Visual Information and Information Systems: Third International Conference, VISUAL'99, Amsterdam, The Netherlands, June 2-4, 1999, Proceedings

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen 1614 3 Berlin Heidelberg New Yo...

5 downloads 915 Views 16MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science

Edited by G. Goos, J. Hartmanis and J. van Leeuwen

1614

3 Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo

Dionysius P. Huijsmans Arnold W.M. Smeulders (Eds.)

Visual Information and Information Systems Third International Conference, VISUAL’99 Amsterdam, The Netherlands, June 2-4, 1999 Proceedings

13

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands

Volume Editors Dionysius P. Huijsmans Leiden University, Computer Science Department Niels Bohrweg 1, 2333 CA Leiden, The Netherlands E-mail: [email protected] Arnold W.M. Smeulders University of Amsterdam, Research Institute Computer Science Kruislaan 403, 1098 SJ Amsterdam, The Netherlands E-mail: [email protected]

Cataloging-in-Publication data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Visual information and information systems : third international conference ; proceedings / VISUAL ’99, Amsterdam, The Netherlands, June 2 - 4, 1999. D. P. Huijsmans ; Arnold W.M. Smeulders (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 1999 (Lecture notes in computer science ; Vol. 1614) ISBN 3-540-66079-8

CR Subject Classification (1998): H.3, H.5, H.2, I.4, I.5, I.7, I.3 ISSN 0302-9743 ISBN 3-540-66079-8 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. c Springer-Verlag Berlin Heidelberg 1999 Printed in Germany

Typesetting: Camera-ready by author SPIN 10705199 06/3142 – 5 4 3 2 1 0

Printed on acid-free paper

Preface Visual Information at the Turn of the Millenium Visual information dominates the senses we have been given to observe the world around us. We tend to believe information most when it is in visual form. Television and Internet have accelerated the perfusion of visual information to unprecedented heights. Now that all sensors are turning digital, and personal computers and the Net are powerful enough to process visual information, a new era is being born: the age of multimedia information. The dominant component of multimedia information is visual. Hence the conclusion, we are on the threshold of the age of visual information. The approach of the new millenium provokes these sweeping thoughts. Five hundred years after the invention of printed books, visual information has returned to the forefront of information dissemination, on equal par with textual and numerical information. The practice of designing visual information systems is far removed from such grandiose thoughts. Visual information systems are radically diﬀerent from conventional information systems. Many novel issues need to be addressed. A visual information system should be capable of providing access to the content of pictures and video. Where symbolic and numerical information are identical in content and form, pictures require a delicate treatment to approach their content. To search and retrieve items on the basis of their pictorial content requires a new, visual or textual way of specifying the query, new indices to order the data, and new ways to establish similarity between the query and the target. A novel element, still lacking research, is the display of the information space of all visual items in the system. Derived from the Third International Conference on Visual Information Systems, held in Amsterdam, this issue of Springer’s Lecture Notes in Computer Science provides a state-of-the-art view on visual information systems. Among the building blocks of visual information systems, the computation of features is currently attracting the most attention. Good features are instrumental in reducing the abundance of information in the picture or in the video to the essence. Ideally speaking the feature is insensitive to irrelevant variations in the data, and sensitive to variations in semantic diﬀerences in the data. In the proceedings you will ﬁnd features of various kinds, where invariance is of speciﬁc importance to features for image databases. For browsing and searching for unspeciﬁed items in the information space of all items in the system, visual interaction on the ensemble of all items can provide an overview to the surﬁng user. In the proceedings you will ﬁnd contributions on query by iterative optimization of the target, displaying the information space, and other ways to trace semantically similar items or documents. It is expected that the topic will attract more attention, more completely fulﬁlling the name: visual information systems. An important issue of visual search is the similarity measure. It is not easy to decide what makes two objects, example and target, experienced as equal. Simi-

VI

Preface

larity is currently approached as either an exact correspondence (as in standard databases), as a statistical problem (as in object classiﬁcation), or as a metrical problem (in feature space). It is quite likely that similarity search as a cognitive problem will gain in weight where human-perceived similarity will be core. Similarity search for all practical purposes is proximity search: the subject and the target match by proximity. In the proceedings you will ﬁnd many diﬀerent implementations of the notion of proximity. Underlying any information system, there should be a database proper with data structures, query speciﬁcation, and indexing schemes for eﬃcient search. Where the main emphasis of the techniques embodied here is on processing visual information, the connection to databases, and the database parlance is still underrated. In the proceedings you will ﬁnd contributions on extensions of the database tradition towards unstructured multimedia items, on data structures especially suited for spatial data, and on new ways to access spatial data. An essential part of visual information processing is the success of capturing the information in the image. Where the biggest problem in computer vision is a successful segmentation step, in image databases several authors ﬁnd their way around this step. In the proceedings you will ﬁnd contributions based on characterizing internally similar partitions in the image, salient details, or total image proﬁles. Contributions on all these and many more aspects of many more topics can be absorbed from the proceedings. Their combination in one LNCS Volume gives an up-to-date overview of all components of visual information systems. All the contributions in this book have been reviewed thoroughly. The editors of this book wish to thank the members of the program committee and the additional reviewers for their eﬀort. Their work has enhanced the ﬁnal submission to this book. You will ﬁnd their names on a separate sheet. We thank them cordially. With this book we hope that the conference series on visual information systems will continue on to a long-lived future. The conference chair would like to seize the opportunity to thank the members of the local committee and the conference bureau for making the conference happen. Finally, the support of the members of the visual information systems steering committee has been much appreciated.

March 1999 Arnold W.M. Smeulders Nies Huijsmans

Visual99 Conference Organization

Conference Chair Arnold W.M. Smeulders

University of Amsterdam, NL

The Visual Information Systems Steering Committee S.K. Chang Ramesh Jain Tosiyasu Kunii Clement Leung Arnold W.M. Smeulders

University of Pittsburgh, USA University of California, USA The University of Aizu, J Victoria University of Technology, AU University of Amsterdam, NL

Program Chairs Ruud M. Bolle Alberto Del Bimbo Clement Leung

IBM Watson, USA University of Florence, I Victoria University of Technology, AU

Program Committee Jan Biemond Josef Bigun S.K. Chang David Forsyth Theo Gevers Luc van Gool William Grosky Glenn Healey Nies Huijsmans Yannis Ioanidis Horace Ip Ramesj Jain Rangachar Kasturi Martin Kersten Inald Lagendijk Robert Laurini Carlo Meghini

Technical University Delft, NL Halmstad University, S Pittsburgh, USA Berkeley, USA University of Amsterdam, NL Catholic University, Leuven, B Wayne State University, USA University of California, Irvine, USA Leiden University, NL University of Athens, G City University of Hong Kong, HK University of California, San Diego, USA Penn State University, USA CWI, Amsterdam, NL Technical University Delft, NL Universite C. Bernard Lyon, F IEI CNR, Pisa, I

VIII

Conference Organization

Erich Neuhold Eric Pauwels Fernando Pereira Dragutin Petkovic Hanan Samet Simone Santini Stan Sclaroﬀ Raimondo Schettini Stephen Smoliar Aya Soﬀer Michael Swain Hemant Tagare George Thoma Remco Veltkamp Jian Kang Wu

University of Darmstadt, D Catholic University, Leuven, B Instituto Superior Tcnico, Lisbon, P IBM, Almaden, USA University of Maryland, USA University of California, San Diego, USA Boston University, USA ITIM CNR, Milan, I Xerox, Palo Alto, USA Technion, Haifa, IL DEC, USA Yale University, USA National Library of Medicine, USA Utrecht University, NL National University of Singapore, SP

Additional Reviewers Giuseppe Amato Sameer Antani Frantisek Brabec Andr Everts Ullas Gargi Sennay Ghebreab Henk Heijmans Gisli R. Hjaltason Bertin Klein Thomas Klement Martin Leissler Michael Lew Ingo Macherius Giuseppe De Marco Vladimir Y. Mariano TatHieu Nguyen S.D.Olabarriaga Patrizia Palamidese Fabio Patern P. Savino Geert Streekstra V.S.Subrahmanian Ulrich Thiel Jeroen Vendrig Marcel Worring

IEI CNR, Pisa, I Penn State University, USA University of Maryland, USA University of Darmstadt, D Penn State University, USA University of Amsterdam, NL CWI, Amsterdam, NL University of Maryland, USA University of Darmstadt, D University of Darmstadt, D University of Darmstadt, D Leiden University, NL University of Darmstadt, D IEI CNR, Pisa, I Penn State University, USA University of Amsterdam, NL University of Amsterdam, NL IEI CNR, Pisa, I IEI CNR, Pisa, I IEI CNR, Pisa, I University of Amsterdam, NL University of Maryland, USA University of Darmstadt, D University of Amsterdam, NL University of Amsterdam, NL

Conference Organization

IX

Local Organizing Committee Theo Gevers Nies Huijsmans Dennis Koelma Carel van den Berg Remco Veltkamp Marcel Worring

University of Amsterdam, Leiden University, University of Amsterdam, PGS, Amsterdam, Utrecht University, University of Amsterdam,

Sponsors Shell Nederland B.V Netherlands Computers Science Research Foundation Advanced School for Computing and Imaging University of Amsterdam Royal Academy of Arts and Sciences

NL NL NL NL NL NL

Table of Contents

Visual Information Systems Supporting Image-Retrieval by Database Driven Interactive 3D InformationVisualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Leissler, M. Hemmje, E.J. Neuhold

1

Video Libraries: From Ingest to Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 R.M. Bolle, A. Hampapur Querying Multimedia Data Sources and Databases . . . . . . . . . . . . . . . . . . . . . 19 S.-K. Chang, G. Costagliola, E. Jungert General Image Database Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 P.L. Stanchev System for Medical Image Retrieval: The MIMS Model . . . . . . . . . . . . . . . . . 37 R. Chbeir, Y. Amghar, A. Flory An Agent-Based Visualisation Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 J. Meddes, E. McKenzie Error-Tolerant Database for Structured Images . . . . . . . . . . . . . . . . . . . . . . . . 51 A. Ferro, G. Gallo, R. Giugno

Interactive Visual Query Query Processing and Optimization for Pictorial Query Trees . . . . . . . . . . . . 60 A. Soﬀer, H. Samet Similarity Search Using Multiple Examples in MARS . . . . . . . . . . . . . . . . . . . 68 K. Porkaew, S. Mehrotra, M. Ortega, K. Chakrabarti Excluding Speciﬁed Colors from Image Queries Using a Multidimensional Query Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 D. Androutsos, K.N. Plataniotis, A.N. Venetsanopoulos Generic Viewer Interaction Semantics for Dynamic Virtual Video Synthesis 83 C.A. Lindley, A.-M. Vercoustre Category Oriented Analysis for Visual Data Mining . . . . . . . . . . . . . . . . . . . . 91 H. Shiohara, Y. Iizuka, T. Maruyama, S. Isobe User Interaction in Region-Based Color Image Segmentation . . . . . . . . . . . . . 99 N. Ikonomakis, K.N. Plataniotis, A.N. Venetsanopoulos

XII

Table of Contents

Using a Relevance Feedback Mechanism to Improve Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 G. Ciocca, R. Schettini Region Queries without Segmentation for Image Retrieval by Content . . . . 115 J. Malki, N. Boujemaa, C. Nastar, A. Winter Content-Based Image Retrieval over the Web Using Query by Sketch and Relevance Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 E. Di Sciascio, G. Mingolla, M. Mongiello Visual Learning of Simple Semantics in ImageScape . . . . . . . . . . . . . . . . . . . . 131 J.M.Buijs, M.S. Lew

Browsing Information Space Task Analysis for Information Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 S.L. Hibino Filter Image Browsing: Exploiting Interaction in Image Retrieval . . . . . . . . . 147 J. Vendrig, M. Worring, A.W.M. Smeulders Visualization of Information Spaces to Retrieve and Browse Image Data . . 155 A. Hiroike, Y. Musha, A. Sugimoto, Y. Mori Mandala: An Architecture for Using Images to Access and Organize Web Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 J.I. Helfman A Compact and Retrieval-Oriented Video Representation Using Mosaics . . 171 G. Baldi, C. Colombo, A. Del Bimbo

Internet Search Engines Crawling, Indexing and Retrieval of Three-Dimensional Data on the Web in the Framework of MPEG-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 E. Paquet, M. Rioux A Visual Search Engine for Distributed Image and Video Database Retrieval Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 J.-R. Ohm, F. Bunjamin, W. Liebsch, B. Makai, K. Mueller, B. Saberdest, D. Zier Indexing Multimedia for the Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 B. Eberman, B. Fidler, R. Iannucci, C. Joerg, L. Kontothanassis, D.E. Kovalcin, P. Moreno, M.J. Swain, J.-M. Van Thong Crawling for Images on the WWW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 J. Cho, S. Mukherjea

Table of Contents

XIII

A Dynamic JAVA-Based Intelligent Interface for Online Image Database Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 V. Konstantinou, A. Psarrou

Video Parsing Motion-Based Feature Extraction and Ascendant Hierarchical Classiﬁcation for Video Indexing and Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 221 R. Fablet, P. Bouthemy Automatically Segmenting Movies into Logical Story Units . . . . . . . . . . . . . . 229 A. Hanjalic, R.L. Lagendijk, J. Biemond Local Color Analysis for Scene Break Detection Applied to TV Commercials Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 J.M. S´ anchez, X. Binefa, J. Vitri` a, P. Radeva Scene Segmentation and Image Feature Extraction for Video Indexing and Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 P. Bouthemy, C. Garcia, R. Ronfard, G. Tziritas, E. Venau, D. Zugaj Automatic Recognition of Camera Zooms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 S. Fischer, I. Rimac, R. Steinmetz A Region Tracking Method with Failure Detection for an Interactive Video Indexing Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 M. Gelgon, P. Bouthemy, T. Dubois Integrated Parsing of Compressed Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 S.M. Bhandarkar, Y.S. Warke, A.A. Khombhadia Improvement of Shot Detection Using Illumination Invariant Metric and Dynamic Threshold Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 W. Kong, X. Ding, H. Lu, S. Ma Temporal Segmentation of MPEG Video Sequences . . . . . . . . . . . . . . . . . . . . 283 E. Ardizzone, C. Lodato, S. Lopes Detecting Abrupt Scene Change Using Neural Network . . . . . . . . . . . . . . . . . 291 H.B. Lu, Y.J. Zhang Multi-Modal Feature-Map: An Approach to Represent Digital Video Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 U. Srinivasan, C. Lindley Robust Tracking of Video Objects through Topological Constraint on Homogeneous Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 M. Liao, Y. Li, S. Ma, H. Lu

XIV

Table of Contents

Spatial Data The Spatial Spreadsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 G.S. Iwerks, H. Samet A High Level Visual Language for Spatial Data Management . . . . . . . . . . . . 325 M.-A. Aufure-Portier, C. Bonhomme A Global Graph Model of Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . 333 S.G. Nikolov, D.R. Bull, C.N. Canagarajah A Graph-Theoretic Approach to Image Database Retrieval . . . . . . . . . . . . . . 341 S. Aksoy, R.M. Haralick Motion Capture of Arm from a Monocular Image Sequence . . . . . . . . . . . . . . 349 C. Pan, S. Ma

Visual Languages Comparing Dictionaries for the Automatic Generation of Hypertextual Links: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 I. Gagliardi, B. Zonta Categorizing Visual Contents by Matching Visual ”Keywords” . . . . . . . . . . . 367 J.-H. Lim Design of the Presentation Language for Distributed Hypermedia System . 375 M. Katsumoto, S.-i. Iisaku A Generic Annotation Model for Video Databases . . . . . . . . . . . . . . . . . . . . . . 383 H. Rehatschek, H. Mueller Design and Implementation of COIRS(a COncept-Based Image Retrieval System) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 H. Yang, H. Kim, J. Yang Automatic Index Expansion for Concept-Based Image Query . . . . . . . . . . . . 399 D. Sutanto, C.H.C. Leung

Features and Indexes for Image Retrieval Structured High-Level Indexing of Visual Data Content . . . . . . . . . . . . . . . . . 409 A.M. Tam, C.H.C. Leung Feature Extraction: Issues, New Features, and Symbolic Representation . . . 418 M. Palhang, A. Sowmya Detection of Interest Points for Image Indexation . . . . . . . . . . . . . . . . . . . . . . 427 S. Bres, J.-M. Jolion

Table of Contents

XV

Highly Discriminative Invariant Features for Image Matching . . . . . . . . . . . . 435 R. Alferez, Y.-F. Wang Image Retrieval Using Schwarz Representation of One-Dimensional Feature 443 X. Ding, W. Kong, C. Hu, S. Ma Invariant Image Retrieval Using Wavelet Maxima Moment . . . . . . . . . . . . . . 451 M. Do, S. Ayer, M. Vetterli Detecting Regular Structures for Invariant Retrieval . . . . . . . . . . . . . . . . . . . . 459 D. Chetverikov Color Image Texture Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 N. Nes, M.C. d’Ornellas Improving Image Classiﬁcation Using Extended Run Length Features . . . . . 475 S.M. Rahman, G.C. Karmaker, R.J. Bignall Feature Extraction Using Fractal Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 B.A.M. Schouten, P.M. de Zeeuw

Object Retrieval Content-Based Image Retrieval Based on Local Aﬃnely Invariant Regions . 493 T. Tuytelaars, L. Van Gool A Framework for Object-Based Image Retrieval at the Semantic Level . . . . 501 L. Jia, L. Kitchen Blobworld: A System for Region-Based Image Indexing and Retrieval . . . . . 509 C. Carson, M. Thomas, S. Belongie, J.M. Hellerstein, J. Malik A Physics-Based Approach to Interactive Segmentation . . . . . . . . . . . . . . . . . 517 B.A. Maxwell

Ranking and Performance Assessment of Eﬀectiveness of Content Based Image Retrieval Systems . . . 525 A. Dimai Adapting k-d Trees to Visual Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 R. Egas, D.P. Huijsmans, M. Lew, N. Sebe Content-Based Image Retrieval Using Self-Organizing Maps . . . . . . . . . . . . . 541 J. Laaksonen, M. Koskela, E. Oja Relevance Feedback and Term Weighting Schemes for Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 D. Squire, W. Mueller, H. Mueller

XVI

Table of Contents

Genetic Algorithm for Weights Assignment in Dissimilarity Function for Trademark Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 D. Y.-M. Chan, I. King

Shape Retrieval Retrieval of Similar Shapes under Aﬃne Transform . . . . . . . . . . . . . . . . . . . . . 566 F. Mokhtarian, S. Abbasi Eﬃcient Image Retrieval through Vantage Objects . . . . . . . . . . . . . . . . . . . . . 575 J. Vleugels, R. Veltkamp Using Pen-Based Outlines for Object-Based Annotation and Image-Based Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 L. Schomaker, E. de Leau, L. Vuurpijl Interactive Query Formulation for Object Search . . . . . . . . . . . . . . . . . . . . . . . 593 T. Gevers, A.W.M. Smeulders Automatic Deformable Shape Segmentation for Image Database Search Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 L. Liu, S. Sclaroﬀ A Multiscale Turning Angle Representation of Object Shapes for Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 G. Iannizzotto, L. Vita Contour-Based Shape Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 L.J. Latecki, R. Lakaemper Computing Dissimilarity Between Hand Drawn-Sketches and Digitized Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 F. Banﬁ, R. Ingold

Retrieval Systems Document Generation and Picture Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 632 K. van Deemter FLORES: A JAVA Based Image Database for Ornamentals . . . . . . . . . . . . . 641 G. van der Heijden, G. Polder, J.W. van Eck Pictorial Portrait Indexing Using View-Based Eigen-Eyes . . . . . . . . . . . . . . . 649 C. Saraceno, M. Reiter, P. Kammerer, E. Zolda, W. Kropatsch Image Retrieval Using Fuzzy Triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 S.H. Jeong, J.D. Yang, H.J. Yang, J.H. Choi

Table of Contents

XVII

Image Compression Variable-Bit-Length Coding: An Eﬀective Coding Method . . . . . . . . . . . . . . . 665 S. Sahni, B.C. Vemuri, F. Chen, C. Kapoor Block-Constrained Fractal Coding Scheme for Image Retrieval . . . . . . . . . . . 673 Z. Wang, Z. Chi, D. Deng, Y. Yu Eﬃcient Algorithms for Lossless Compression of 2D/3D Images . . . . . . . . . . 681 F. Chen, S. Sahni, B.C. Vemuri

Virtual Environments Lucent VisionT M : A System for Enhanced Sports Viewing . . . . . . . . . . . . . . 689 G.S. Pingali, Y. Jean, I. Carlbom Building 3D Models of Vehicles for Computer Vision . . . . . . . . . . . . . . . . . . . 697 R. Fraile, S.J. Maybank Integrating Applications into Interactive Virtual Environments . . . . . . . . . . 703 A. Biancardi, V. Moccia

Recognition Systems Structural Sensitivity for Large-Scale Line-Pattern Recognition . . . . . . . . . . 711 B. Huet, E.R. Hancock Complex Visual Activity Recognition Using a Temporally Ordered Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719 S. Bhonsle, A. Gupta, S. Santini, M. Worring, R. Jain Image Database Assisted Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727 S. Santini, M. Worring, E. Hunter, V. Kouznetsova, M. Goldbaum, A. Hoover Visual Processing System for Facial Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 735 C. Xu, J. Wu, S. Ma Semi-interactive Structure and Fault Analysis of (111)7x7 Silicon Micrographs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745 P. Androutsos, H.E. Ruda, A.N. Venetsanopoulos Using Wavelet Transforms to Match Photographs of Individual Sperm Whales Identiﬁed by Contour of the Trailing Edge of the Fluke . . . . . . . . . . 753 R. Huele, J.N. Ciano From Gaze to Focus of Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761 R. Stiefelhagen, M. Finke, J. Yang, A. Waibel

XVIII Table of Contents

Automatic Interpretation Based on Robust Segmentation and Shape-Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769 G. Frederix, E.J. Pauwels A Pre-ﬁlter Enabling Fast Frontal Face Detection . . . . . . . . . . . . . . . . . . . . . . 777 S.C.Y. Chan, P.H. Lewis

Visualization Systems A Technique for Generating Graphical Abstractions of Program Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785 C. Demetrescu, I. Finocchi Visual Presentations in Multimedia Learning: Conditions that Overload Visual Working Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793 R. Moreno, R.E. Mayer Visualization of Spatial Neuroanatomical Data . . . . . . . . . . . . . . . . . . . . . . . . . 801 C. Shahabi, A.E. Dashti, G. Burns, S. Ghandeharizadeh, N. Jiang, L.W. Swanson Visualization of the Cortical Potential Field by Medical Imaging Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 809 M.C. Erie, C.H. Chu, R.D. Sidman Applying Visualization Research Towards Design . . . . . . . . . . . . . . . . . . . . . . 817 P. Janecek

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825

Supporting Image-Retrieval by Database Driven Interactive 3D Information-Visualization Martin Leissler, Matthias Hemmje, Erich J. Neuhold GMD – German National Research Center for Information Technology IPSI – Integrated Publication and Information Systems Institute Dolivostr. 15, 64293 Darmstadt, Germany [leissler, hemmje, neuhold]@darmstadt.gmd.de Abstract. Supporting image-retrieval dialogues between naive users and information systems is a non-trivial task. Although a wide variety of experimental and prototypical image retrieval engines is available, most of them lack appropriate support for end-user oriented front ends. We have decided to illustrate the possible advantages of a tight coupling between interactive 3D information visualization systems and image retrieval systems based on database management systems by deriving requirements from a characteristic application scenario. By means of an “interactive 3D gallery” scenario, the paper provides an overview of the requirements, components, and architecture of a general database-driven 3D information visualization system on the basis of an RDBMS and VRML. The given approach supports loading time as well as runtime database access in various forms. It reflects the overall conceptual framework of our activities in this highly dynamic area of research and forms a basis for many other applications where information objects have to be visualized for interacting users or user groups.

1. Introduction Supporting image retrieval dialogues between naive users and information systems is a non-trivial task. Although supporting the basic “pattern matching” process within the image retrieval mechanism has been tackled by various research activities (e.g. [Pentland et al. 95] [Picard et al. 93] [Wang et al. 97] [Müller & Everts 97]) during the last years, supporting the user interface front end in an image retrieval dialogue in an appropriate way has been neglected to some extent. Most of the work conducted in this area (e.g. [VVB], [Chang et al 97a], [Chang et al 97b], [Chang et al 96a], [Christel et al 96]) applies user-interface paradigms implemented on the basis of 2D interface-toolkits. In contrast to these works we want to outline in this paper how image retrieval user interfaces in distributed front-end scenarios can be supported by means of interactive 3D information visualization technologies. The work presented in this paper is based on concepts, experiments, experiences and insights gained from our work aiming at supporting, e.g., full-text-retrieval and multimedia retrieval dialogues in a similar way. The paper introduces the available base technologies. Furthermore, it outlines an overall architectural system model based on a requirement

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 1-14, 1999.  Springer-Verlag Berlin Heidelberg 1999

2

Martin Leissler et al.

analysis derived from an example scenario. An overview of implementation aspects of the proposed architectural framework and an outlook on potential future works conclude the paper.

2. Base Technologies Looking at the development of non-immersive interactive information visualization applications over the past few years, the Virtual Reality Modeling Language (VRML) has clearly become the de facto standard for representing interactive 3D worlds on the web or in offline applications. Furthermore, in 1998, VRML97 (the actual specification, [VRML97]) has made its way to become an ISO standard. If we examine the historical development of VRML in detail, it has to be recognized that the first version of the language standard (VRML 1.0) was directly derived from the file format which the OpenInventor toolkit [Wernecke 94] from Silicon Graphics Inc. had defined to exchange 3D scenes between different applications. This version of VRML was completely static, i.e., no user interactivity and even no animation was supported in a VRML1.0 scene. Soon after this first version came the second version (VRML2.0) which is today mostly identical with the actual ISO standard. It incorporates complex interactive behavior, advanced animation features and custom extensibility. However, VRML 97 is still a closed file format which uses its own internal and proprietary event model to describe behavior and user interaction. All information about a scene is contained exclusively within the VRML code. In consequence, there is no “natural” way for a VRML scene to communicate and integrate with other external applications or software components. The task of identifying concepts to solve this problem has mainly been tackled through single working groups within the VRML consortium [Web3D]. One working group, for example, has defined the so called External Authoring Interface (EAI) [EAI] to handle biderectional communication between a Java applet and a VRML97 scene coexisting on a web page. If VRML is, based on its promising starting point, ever to become a seriously used interactive information visualization medium and development platform for all kinds of information system applications, a flexible and efficient integration of the language with existing technological standards such as interfaces to database management systems or application servers has to be achieved. This means built-in standard mechanisms for communication between VRML and external systems have to be derived from the existing standard. Clearly, this problem has to be tackeled from both sides, the VRML language side and the side of external standard application programming interface (API). Situated in this interactive information visualization working context, our research concentrates on VRML and database integration which, in our opinion, is the most urgent problem at hand. By working on the problem of how VRML could communicate with existing database management systems (DBMS), we can also learn a lot about supporting more complex information visualization scenarios, e.g., persistent multi-user scenarios. Until today, all applications that used VRML in

Supporting Image-Retrieval

3

connection with a DBMS had to rely on using custom-coded database access with proprietary APIs such as the Java Database Connectivity (JDBC) [JDBC] or the Microsoft Open Database Connectivity (ODBC) [ODBC]. This is highly unsatisfactory, because people work on the implementation of the same interfacing problems over and over again. A standardization of the VRML side of the DBMS integration has been partially proposed by the database working group of the Web3D consortium [Web3D]. However, we believe that, although the overall observations of the working group are correct and sound, some more work will have to be done in the detailed specification of the features of the database integration – on API side as well as on VRML side. Some examples of necessary features will be provided in later sections of this paper. To derive the neccesary extended requirements, this paper first takes a look at a complex example scenario and later describes how this scenario could possibly be conceptually solved and supported by corresponding architectural system models and implementations.

3. Example Scenario Given that, for example in an electronic commerce application, users want to browse through a visually rich 3D virtual gallery environment (as described in [Müller et al. 99]) filled with, from their subjective point of view, more or less interesting paintings of various artists. They want to have the option to buy a painting or just enjoy reviewing works of their favorite artists, for a while. A soon as they have entered the system, a so called “query interface” is presented in which they can enter a description of their interest. After choosing a few of their favorite artists and selecting some paint styles they like, it is time to submit the query containing the so called “search criteria” to the system. Now, a 3D visually interactive presentation of an art gallery is generated in an area of the screen. The selection of the works-of-art is based on the users’ search criteria. Somewhere else on the screen, an interactive 2D overview visualization which explains the overall architectural structure of the gallery building is visualized. The displayed architectural topology of the gallery building is structured hierarchically and is therefore easy to navigate by selecting areas related to the different artists, painting techniques, and styles. As the users move around in this virtual gallery environment, they orient themselves with the help of so called “landmarks” and “signposts” inside the 3D environment as well as in the 2D overview visualization of the gallery. After a short spatial navigation, a room entrance is reached. A label describes that the works of art displayed in this room match a couple of search criteria defined earlier in the form based query construction dialogue by the user (e.g., a certain artist and his self-portrait style paintings). After navigating through the entrance, a room shows up which contains only the paintings expected in this section of the gallery. As the users navigate into the room, they take their time to study the paintings hanging on the walls of the room. By clicking on one of the paintings in the 3D environment, all information about the painting stored in the database is displayed on a separate area of the screen together

4

Martin Leissler et al.

with a more detailed high-quality image of the painting and further meta-information like, e.g., pricing and sales status. While striving around in some room, it can be recognized that one of the paintings is suddenly marked with a small “not-available” banner. By some coincidence, it must have been removed from the gallery. Either it has been sold to someone else and is therefore no longer available in the 3D gallery or someone has, e.g., not paid his bill to the gallery service provider for renting the space in the gallery. The application aspects and the users’ experience described above demand a set of different interactive information visualization functions to be supported by an overall technical architecture. The concepts which can define a formal basis for the implementation of such an architectural framework and its components are described below. The complete conceptual interactive information visualization model will be described in more detail in a different paper.

4. Other Work Before we describe the detailed conceptual system model and implementational architecture of our approach, we take a short look at existing systems which support scenarios that are similar but not identical to the one described above. 4.1. Virgilio The Virgilio system as described in [Massari et al. 98] and [Constabile et al. 98] is a software system architecture which allows the user to submit a generic query on a database which typically contains multimedia content. As a result, the system constructs a 3D metaphorical visual representation of the hierarchical query result structure through which the user can browse the query result set interactively. One of the main features of Virgilio is that the querys, the visual representation, and the mapping between the query and the visualization are all stored in persistent repositorys and can therefore be easily exchanged. The mapping between visual objects and query structures is called a “metaphor” in Virgilio. This metaphor can be completly user-defined, which means that the appearance of the visual environment can be dynamically adjusted. On the basis of a set of properties of the visual objects (e.g. a room object can contain other objects), the metaphor is intelligently applied to the given query result. A prototypical implementation of the Virgilio architecture exists on the basis of a custom application using a proprietary API (OpenInventor). After choosing a different query, the system has to completely reconstruct the 3D environment. If the underlying data changes during the runtime navigation, this will have no immediate effect on the scene. Furthermore, the querys in the query repository are fixed. The user cannot query the system freely. 4.2. TVIG The TVIG system (The Virtual Internet Gallery) [Müller et al. 99] implements a scenario very similar to the 3D gallery scenario described above. Users can use a

Supporting Image-Retrieval

5

standard web browser to query a relational database for information about artworks (e.g., their favorite artist by name) and retrieve a dynamically constructed 3D gallery visualized in the same web page. While browsing through the gallery, users can click on door handles to enter the rooms containing the paintings. Rooms and paintings are queried and constructed on demand, at runtime, to keep the system load as low as possible. The mapping between the gallery building structure and the search results are user definable from within the system. The implementation of TVIG uses some standards such as HTML, VRML, Java and JDBC, but is mainly based on custom-written code. As in Virgilio, the visual environment does not react immediatly to changes in the underlying database. The visualization is reconstructed if a new search query is submitted by the user. In both systems there are no generic standard mechanisms to communicate between the database and the 3D visualization.

5. Requirements Any scenario similar to the one described above should demand for the following general requirements: First of all, we definitetly need an appropriate information distribution mechanism for the given scenario. In today’s world of globally networked computer systems, it is obvious that an application like a virtual gallery should be a completely web based application in order to be able to reach as many users as possible. The numerous versions of web browsers with all kinds of mutimedia plugins available on client machines makes the www an ideal application platform for our scenario. In addition to browser technology defining the web-based application front end, a generally VR-capable application server is required in the back end of the application solution. Next, we need a standard way of representing and rendering a real time interactive 3D environment for the virtual gallery. This technology should be able to run on as many client platforms as possible and has to support user interaction, scene object behavior and scene animation. Furthermore, the visualization environment has to support interactive navigation in a straightforward way and should seamlessly scale with the available client resources. Since the data for the information objects of interest (in this case paintings) should be available to all users of the application at any time, a peristent storage database is required. This mechanism should also be able to serve a broad bandwith of different client platforms and a large number of users, in parallel. Because the visual environment is based on the highly dynamic content of the persistent storage mechanism as well as on the highly dynamic interests of the users there has to be an integrated mechanism to parametrically construct, reconstruct and adapt the whole visual environment in a very flexible fashion.

6

Martin Leissler et al.

Any user interaction has to be mapped to operations on the persistent storage which are in turn reflected in the 3D environment. This muhas to be enabled while the application is running. Therefore, we need a bidirectional communication mechanism between the running 3D environment and the underlying persistent storage mechanism [Hemmje 99]. Changes in the storage should be reflected in the 3D environment immediatly. Therefore we need a mechanism to automatically notify the 3D scene about changes occuring in the underlying database managed by the persistent storage mechanism.

6. Architectural Model for a Database Driven 3D Visualization Environment Figure 1 displays the general architectural model of a an application scenario like the virtual gallery. It supports the requirements derived in Section 3. The architectural model consists of a browsing client, an application server, and a persistent storage system supporting the information object database displayed in figure 1. Since the communication between the components (depicted by arrows) can potentially take place over a networked connection, the single components can be arbitrarily assigned to hardware platforms, e.g., from all three components on one platform to all on different platforms. The left side of the diagram displays the users’ VR-capable browsing client. After entering the application, the VR client displays the user interface with all its components like, for example, a form-based user interface component which can be used to enter search criteria describing the users’ interest in the customized gallery experience. runtime access

VR Client

results & notification query

request 3D scene

Application Server

DB

result

VR Extension Fig. 1. The general architectural model of a data driven 3D visualization application

If a query (i.e., an information request) containing the search criteria is submitted to the application server, the server recognizes an appropriate 3D scene has to be delivered back to the client. Since the server has to produce the actual data that matches the users’ request, it has to translate user requests into a query which can be sent to the database. The database system processes the query and sends the result back to the application server. Now, the application server can use the retrieved data

Supporting Image-Retrieval

7

to construct the 3D scene with the help of some special server extension. The 3D scene is then sent to the users VR browsing client and displayed properly. As the dynamically constructed 3D scene is interactively browsed, users can interact with certain objects of the surrounding environment, which, in turn, may lead to the necessity to get additional data from the database storage mechanism. If, for example, the user enters an area in which the images for the paintings on the walls have not been retrieved from the database, they have to be retrieved, downloaded, and integrated into the scene during application runtime. Furthermore, if the user clicks on the image of a painting to retrieve information about the artist, price etc. the same holds true. In these cases, the client runtime access mechanism has to be used to query the database and retrieve the results. These have to be directly integrated into the scene, at runtime, too. Furthermore, as soon as any data manipulation action is performed on the database storage mechanism which effects the data visualized in the running 3D scene - be it from external manipulation or from the scene itself – an immediate notification of the client has to be performed to which the clients 3D environment can, in turn, react. By now, we have described the application interaction cycle completely. Note that even though the arrows in the above figure are drawn straight from the client to the database storage mechanism, the runtime communication and notification mechanisms do not necessarily have to go directly from the client system to the storage system. It is possible (and in many cases appropriate) to handle the communication through an additional middleware layer. Taking the described scenario, the derived reqirements, and the proposed general architectural system model as a basis for our implementation approach, we can now look at the technical details which are needed to implement such a system.

7. Implementation Aspects in the Architectural Model Assuming that we use a commonly available web browser as the platform for our application front end, we have furthermore decided to store the information about the artists, paintings, styles, techniques, prices, etc. in the tables of a conventional relational database management system (DBMS) as the platform for the persistent storage mechanism, i.e., the back end of our application. Next we assume that VRML97 is used to present the 3D visualization of the gallery. As stated before, VRML97 is the most advanced and widely-accepted standard available on the market. Therefore, it is most likely to be installed on a large number of client web browsers. However, as stated in the first section, VRML97 has a few limitations. Since it is a closed file format with its own event model, all interactions which could influence external systems have to be translated from and to the VRML event model. To date, there is no predefined way to communicate between a VRML scene and a database system. This limitation applies to both, VRML scenes lying on a web server (loading time database access) and VRML scenes currently running in a client browser (runtime database access).

8

Martin Leissler et al.

If a scenario as the described virtual gallery has to be implemented with reusable components, we have to define and implement standardized and general ways of database communication. 7.1. Loading Time Database Access in VRML Since the 3D gallery is dynamically constructed on the basis of results corresponding to the users search criteria, we cannot use a static VRML scene to represent the visual environment. Therefore, the scene has to be somehow constructed dynamically and more or less on the fly after the user has submitted the search criteria to the system. Because all data for the paintings is stored in a relational database, the application has to translate the users’ search criteria into some query language statements (proposedly SQL-statements) which are then sent to the DBMS to be executed. Furthermore, the results of the query have to be integrated into the process of dynamically constructing the 3D gallery scene. In a typical implementation this would work by letting the custom coded application (typically a Java applet or an ActiveX control) send an SQL statement to the DBMS via some API. Then, while iterating over the query statements’ result set, the VRML scene code is generated by some code segment, deep inside the application, and sent to the VRML browser component. Because of the broad variety of possible visualizations of the 3D content, the probability for being able to reuse the VRML scene constructor code in a different but to some extent similar application scenario is typically very low. Furthermore, even if this implementation technique may work well, what we really want is a flexible and efficient standard way to integrate database results into a dynamically-built VRML scene. This can be achieved by integrating a loading time database access capability into VRML with the help of a server side include mechanism as presented in [Risse et al. 98] and [Müller et al. 98]. This mechanism allows the scene designer to define unfinished VRML templates which contain SQL statements. At loading time of the scene the responsible server on which the VRML scene templates are stored (i.e., either a web server or a DBMS with a custom extension) fills the templates with actual result data from the SQL query results. The client browser retrieves a perfectly VRML compliant scene description which visually represents the actual data from the DBMS. Figure 2 displays a possible implementation approach of the server extension module during the dynamic information visualization process. On the upper left of Figure 2 the VRML scenegraph containing a server side include node and a template subgraph is displayed. Before the server returns the VRML scene, the SQL statement is executed. Then the template subgraph is instanced for each returned row and filled with the query results. The so constructed VRML-compliant scene is returned to the client browser. 7.2. Runtime Database Access in VRML For HTML pages which download once from a web server and are only statically viewed by the user, a server-side include mechanism may be enough database

Supporting Image-Retrieval

9

interaction. But the main difference between a static HTML page and a VRML scene is that, after download, the VRML scene starts to run, i.e., it has a runtime. The user interacts with scene elements, and predefined animation sequences may be running, triggered by scene events. Mapping this to the given scenario means that the user, for example, interactively opens a door of the gallery by clicking on the handle. If such an interaction should trigger a database access (read or write) to dynamically construct the room behind the door, we clearly need some runtime interaction component in VRML capable of sending SQL statements to a database and retrieving the results. Furthermore, it should be possible to distribute the result data into the running VRML scene. Generally, this means a mapping from VRML runtime user interaction to database interaction. In this case, the events occur in the VRML scene and are mapped onto DBMS manipulation sequences. SSI-Node

VRML extension call SELECT s1,s2,... FROM tabelle WHERE C

sqlStatement "SELECT s1,s2,... FROM tabelle INTO f1, f2,... WHERE C“

Template

f1

f2

f3

s1

s2

s3

s4

...

V11

V12

V13

V14

...

V21

V22

V23

V24

...

VN4

...

f4

... VN1

VN2

VN3

Instancing process

... V11

V12

V13

V14

V21

V22

V23

V24

VN1

VN2

VN3

VN4

Fig. 2. Loading time database access with a VRML server side include mechanism.

Again this functionality can be achieved by using proprietary Java code inside the VRML scripting nodes. In a typical VRML scene which uses such a database access mechanism, the custom Java code inside a scripting node reacts to some scene event (representing a user interaction) by accessing a DBMS via a defined API (e.g., JDBC). The results are then collected in a VRML multiple value type field and further distributed in the scene by the VRML event mechanisms. Once again, we have custom code which has to be rewritten from scratch or modified each time the application requirements change. This of course, is costly and a never ending source of bugs which are potentially hard to trace. Moreover, this solution is highly dependent on the VRML browser component which typically leads to unpredictable behavior in cross browser application scenarios.

10

Martin Leissler et al.

What would make the VRML application developer’s life a lot easyer is a standard mechanism for runtime SQL database access out of a VRML scene. The approach has to be general enough to cover all possible cases within a runtime database access scenario. This mechanism is provided by a VRML extension node (prototype) which allows to define arbitrary SQL statements and the distribution of the possible results in the scene graph, as shown in Figure 3. Web Browser VRML Scene

SQL node

(3)

SQL Statement results

(1) (2) RDBMS

Fig. 3. The VRML runtime SQL-node. First, an SQL statement is sent to the database (1). Next, the results are returned into the VRML scene graph (2). Finally, the results are distributed to the defined positions in the graph (3).

An extension node with a similar functionality has been proposed by the database working group (now enterprise working group) of the VRML consortium in their “Recommended practices for SQL database access” [Lipkin 98]. However, while the working groups proposal covers the steps (1) and (2) of Figure 3, our solution also lets the user of such an extension node directly define exactly how the result data is to be distributed across the nodes of the VRML scene graph. Thereby, the developer is spared the burden of writing a couple of custom script nodes for every different scene graph, just in order to distribute data. This leads to less, and more efficient, VRML code. Communication between the SQL node and the DBMS can be implemented through a direct database connection. Because in many application scenarios DBMS connections are expensive, it is also possible to connect all clients to some middleware component via lightweight communication protocols and let this component handle all requests through one keep-alive DBMS connection. The implementation specific details of this approach will be presented in a separate paper. 7.3. Automatic Database Event Triggering with VRML The last requirement mentioned in our analysis is derived from the fact that, although the 3D scene in some way visualizes the actual database content, it does not automatically react to changes in the underlying data. If, for example, one of the paintings in our example scenario is marked as “not-available” in the database, for

Supporting Image-Retrieval

11

some reason (e.g., because the painting has been sold), the VRML scene should react instantly by visualizing this event, e.g., by visualizing a banner on top of the painting. Therefore, we need a mechanism which enables the database to automatically notify the running VRML scene about occuring events. More general, this means a mapping from database events to VRML scene events is required. Typically, in existing applications, this has to be done by reloading and reconstructing the whole scene based on the changed database content, which is unflexible and time consuming. Another possibility is to query the database in regular intervals from within the VRML scene to detect possible changes in the underlying data. This approach could be implemented with the help of the above mentioned runtime query component. However, this unneccesarily consumes network bandwidth and runtime resources even if the database content remains unchanged. An elegant solution to this problem is an active database trigger mechanism which enables the database to contact the VRML scene if some given part of the database has changed in a predefined way. Such a technology needs to define standards how the database should invoke a notification mechanism, how the database events are translated to VRML events and sent to the running scene and, finally, how the VRML scene can handle such events in order to delegate them to the relevant parts in the scenegraph. VRML Client

DB Client

VRML Scene notification

Trigger

Trigger Node

DBMS

access

DB Client DB Client

SQL Node

Fig. 4. Architecture for automatic notification through database triggers

Many different clients of the database system (including the VRML scene itself !) can access the database and change its internal data. Database triggers can be assigned to arbitrary parts of the data (i.e., tables in an RDBMS) and fire a predefined action sequence. In this case the the trigger action launches a notification mechanism which contacts a defined node in the VRML client scene. After distributing the event in the scene a new query can be sent to retrieve the updated data from the database. Note that this last aspect of the presented scenario already supports a shared multi client architecture. As soon as we have multiple clients connected to the system which display different parts of the scene (or even different scenes) based on the data in the storage, the trigger notification mechanism could also be used in shared virtual environment applications which have to synchronize a global application state across multiple connected clients. Again, this is best done by using a middleware component to handle the communication between the database trigger mechanism and the client

12

Martin Leissler et al.

machine. The middleware component can distribute the incoming notification events from the database to the appropriate clients and, at the same, time merge the expensive database connections. The details of this approach are presented in a separate paper. 7.4. Overall Architectural Framework After describing all crucial system components, we can now define a generic overall architectural framework which matches our given requirements and is able to run database driven interactive 3D information visualization applications similar to the described gallery scenario. (Figure 5) Runtime trigger Runtime Loading time

VRML Scene

Application Server

Trigger Node notification SQL Node results

. . .

Trigger notification server

SQL statement

JDBC server

Trigger SQL statement results

VRML Scene Search request

Web server Web server

SQL statement Result rows

Trigger Node VRML scene

VR Extension

DB Client

notification

RDBMS

DB Client . . . DB Client

SQL Node

Fig. 5. Technical architecture of a database driven interactive 3D information visualizatiom system

The above figure displays the interaction of all components. Before VRML clients go into runtime interaction mode, they log into the system and request the 3D environment from the web server based on custom search criteria. The web server, as all other intermediary sever components, is combined under the concept of an application server. Note again that all components in the diagram, including the application server components, can be arbitrarily assigned to physical machines. During the loading-time process (dotted arrows), the web server querys the DBMS, fetches the result data matching the users search criteria, and, finally, returns the customized VRML templates to the client via the VR extension module. As a result of certain interactions (e.g., opening a door in the gallery), an SQL node in the running VRML scene querys the database via a middleware database driver (typically JDBC type 3) and distributes the query results to the proper positions in the scene graph (e.g., puts the pictures on the gallery walls).

Supporting Image-Retrieval

13

If the underliying data for a VRML scene is affected now, either by a VRML client (through an SQL node) or an external DBMS client, the trigger notification mechanism may be launched (normal arrows). The notification event is distributed to the clients via a middleware trigger server. This component notifies exclusively those clients which are affected by the current change in the underlying data (e.g., clients displaying the same scene), and thereby optimizes the network load. Notificatications are distributed in the running VRML scene as events which, in turn, may launch a query from an SQL node to retrieve the most recent data. Note how the trigger mechanism can be used to propagate user interactions (affecting the global database state) on one client across multiple other clients connected to the system.

8. Conclusions and Outlook In this paper, we have presented an architectural framework consisting mainly of three components (VRML server side includes, SQL runtime node and active database triggers) which is capable of running highly dynamic database driven interactive 3D information visualization applications. We have outlined how all architectural components can work together in a complex visualization scenario such as the virtual gallery. However, the presented architectural components can also be used as completely independent stand-alone components in applications with different requirements. Indeed, in most application scenarios not all of the three components need to be used. Our experience has taught us that a combination of only some of them is in most cases already sufficient.

References [Chang et al 97a] Chang, S.-F., Chen, W., Meng, H., Sundaram, H., Zhong, D. (1998). Videoq: An automated content-based video search system using visual cues. In Proceedings of ACM Multimedia 1997. [Chang et al 97b] Chang, S.-F., Smith, J., Meng, H., Wang, H., Zhong, D. (1998). Finding images/video in large archives.‘ In: D-Lib Magazine, February 1997. [Chang et al 96] Chang, Y.-L., Zeng, W., Kamel, I., Alonso, R. (1996). Integrated image and speech analysis for content-based video indexing. In: Proceedings of ACM MM 1996 [Christel et al 97] Christel, M., Winkler, D., Taylor, C. (1997). Multimedia abstraction for a digital video library. In: Proceedings of ACM Digital Libraries '97, pages 21--29, Philadelphia, PA [Costabile et al. 98] Costabile, M. F., Malerba, D., Hemmje, M., Paradiso, A. (1998) Building Metaphors for Supporting User Interaction with Multimedia Databases In: Proceedings of 4th IFIP 2.6 Working Conference on Visual DataBase Systems - VDB 4, L'Aqulia, Italy, May 27-29, p. 47-66, Chapman & Hall 1998 [DBWork] Enterprise Technology Working Group of the Web3D consortium http://www.vrml.org/WorkingGroups/dbwork/ [EAI] Information technology -- Computer graphics and image processing -- The Virtual Reality Modeling Language (VRML) -- Part 2: External authoring interface Committee Draft ISO/IEC 14772-2:xxxx http://www.web3d.org/WorkingGroups/vrml-eai/Specification/

14

Martin Leissler et al.

[Hemmje 99] Hemmje, M. (1999). Supporting Information System Dialogues with Interactive Information Visualization.. To appear in: Dissertation Thesis, Technical University of Darmstadt, 1999 [JDBC] Sun Microsystems, Inc. The JDBCTM Data Access API http://java.sun.com/products/jdbc/index.html [Lipkin 98] Lipkin, D. (1998). Recommended Practices for SQL Database Access http://www.web3d.org/Recommended/vrml-sql/ [Massari et al. 98] Massari, A., Saladini, L., Sisinni, F., Napolitano, W., Hemmje, M., Paradiso, A., Leissler, M. (1998). Virtual Reality Systems For Browsing Multimedia. In: Furth, B. (ed.): Handbook of Multimedia Computing [Müller et al. 98] Müller, U., Leissler, M., Hemmje, M. (1998). Entwurf und Implementierung eines generischen Mechanismus zur dynamischen Einbettung multimedialer Daten in VRMLSzenen auf der Basis eines objektrelationalen DBMS. GMD Research Series, No, 23/1998, GMD – Forschungszentrum Informationstechnik, St. Augustin 1998 [Müller et al. 99] Computing and Systems (ICMCS'99) Mueller A., Leissler M., Hemmje, M., Neuhold E.(1999). Towards the Virtual Internet Gallery. To appear in: Proceedings of IEEE International Conference on Multimedia [Müller & Everts 97] Müller, A.,Everts, A. (1997). Interactive image retrieval by means of abductive inference. In RIAO 97 Conference Proceedings -- Computer-Assisted Information Searching on Internet, pages 450--466, June 1997. [ODBC] Microsoft Press (1997). Microsoft ODBC 3.0 software development kit and programmer’s reference. Microsoft Press, Redmond Washington [Picard et al. 93] Picard, R. W., Kabir, T. (1993). Finding Similar Patterns in Large Image Databases. In: IEEE ICASSP, Minneapolis, Vol. V, pp. 161-164, 1993 [Pentland et al. 95] Pentland, A., Picard, R. W., Sclaroff, S. (1995). Photobook: Contentbased Manipulation of Image Databases. In: SPIE Storage and Retrieval Image and Video Databases II, San Jose, CA, 1995 [Risse et al. 98] Risse, T., Leissler, M. , Hemmje M., Aberer, K. (1998). Supporting Dynamic Information Visualization With VRML and Databases. In: CIKM '98, Workshop on New Paradigms in Information Visualization and Manipulation, Bethesda, November 1998 [VRML97] International Standard ISO/IEC 14772-1:1997 http://www.web3d.org/Specifications/VRML97/ Information technology -- Computer graphics and image processing -- The Virtual Reality Modeling Language (VRML) -- Part 1: Functional specification and UTF-encoding. [VVB] The virtual video browser http://hulk.bu.edu/projects/vvb_demo.html [Wang et al. 97] Wang, J. Z., Wiederholg, G., Firschein, O., Wie, S.X. (1997). Content-based image indexing and searching using Daubechies’ wavelets.In: International Journal on Digital Libraries, Vol.1, Number 4, December 1997, Springer Verlag, pp.311-328 [Web3D] Web3D Consortium (formerly: VRML consortium) http://www.web3d.org Home of the Web3D consortium (formerly http://www.vrml.org) [Wernecke 94] Wernecke, Josie (1994) The Inventor Mentor, Programming Object-Oriented 3D-Graphics with Open Inventor Release 2. Open Inventor Architecture Group; AddisonWesley Publishing Company, Inc. 1994

Video Libraries: From Ingest to Distribution

Ruud M. Bolle and Arun Hampapur IBM T. J. Watson Research Center Yorktown Heights, NY 10598 {bolle,arunh}@us.ibm.com

Abstract. Production, transmission and storage of video will eventually all be in digital form. Additionally, there is a need to organize video eﬃciently in databases so that videos are easily ingested, retrieved, viewed and distributed. We address and discuss many of the issues associated with video database management.

1

Introduction

The digital form of video will allow us to do many things – some of these things can be envisioned today, others will be discovered during the years to come. The digital form permits computationally extracting video content descriptors. Ideally, video is completely annotated through machine interpretation of the semantic content of the video. In practice, given the state-of-the-art in computer vision, such sophisticated data annotations may not be feasible. Much of the focus in multimedia library eﬀorts has been on text and image databases [1], not on video libraries. We touch upon techniques for processing video as an image of space-time and we argue that the type of video analysis that has to be performed should be geared toward the speciﬁc video genre or category – e.g., sports versus soap operas. Processing video to derive annotations is one thing. Another thing, as we discuss ﬁrst, is the infrastructural demands for putting such video management systems together.

2

Base Video Management System

Figure 1 shows the functional components of a base video management system and their relationship to each other. Ingest and annotation: Video data can be analog or digital. Ingest deals with video digitization in the case of analog video, and parsing a wide variety of digital video formats. Associating annotations with video segments is another part of the ingest process. Annotation consists of 1. Real time logging: Extracting a ﬁrst level of time dependent indices from a video stream. This may include keyframe extraction and basic audio index information.

The work reported in the paper has been funded in part by NIST/ATP under Contract Number 70NANB5H1174.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 15–18, 1999. c Springer-Verlag Berlin Heidelberg 1999

16

Ruud M. Bolle and Arun Hampapur

Meta Database

Retrieval and Browsing

Video Input Ingest and Annotation

Media Database

Media Distribution

Fig. 1. Base video management system 2. Manual annotation and cataloging: In many applications, manually associating information with diﬀerent time segments of video and linking a unit of video to other media is essential. 3. Automatic oﬄine annotation: Providing content based access to video requires content analysis and annotation. Most of these processes run oﬄine. The data generated by the ingest and annotation process is stored in two locations, namely: Meta database: This can be a traditional relational database system like DB2. A data model for video data management [2] includes a time independent part like, title, producer, directors, length, etc. And, more interestingly, a time dependent part which uses a relational table structure. Media database: These databases handle both the storage and distribution aspects of managing the actual digital video data. They are ﬁle servers which are designed to handle streaming media like video. Finally, there is the issue of component inter-operability. Each of the functional blocks shown in Figure 1 is a complex subsystem. This gives rise to issues of inter-operation between the components using a standardized command and control protocol. A research eﬀort that addresses this issue can be found in [3].

3

Video Annotation

Annotation is ideally achieved in a completely automatic fashion [4]. Video is a concatenation of shots. As described in [5], the analysis of video should not depend too much on the reliability of the shot detection algorithm that is used. Moreover, the analysis of video should go beyond computations on pixels just within shots, i.e., between-shot processing is important. Between-shot processing has as goal to derive high-level structure for automatic annotation of, possibly long, video segments. The scene structure of, for example, sitcoms can be rediscovered using clustering of shots [5]. In [6] the concept of motion picture grammars is introduced. The thesis is that video data can be represented by grammars (e.g., [7]). The grammars need to be stochastic [8];

Video Libraries: From Ingest to Distribution

17

stochastic, context-free grammars and hidden Markov models [9] are closely related. Hidden Markov models are used in [10] to detect commercials.

4

Video Retrieval and Distribution

Retrieving video through the formulation of a query is inherently more complicated than retrieving text documents. In addition to text, there is visual and audio information; moreover, there is temporal visual dynamics. Very much like text query formulation, a video query is a sequence of steps. Each step is an active ﬁltering to reduce the number of relevant candidates. Each step allows interactive query formulation, and each gives a more reﬁned query to the next step. Video query (see [5]) can be broken down as: query on the category of video (navigating), query on the text, and/or audio and visual feature descriptions (searching), query on the semantic summary of visual content (browsing) and query on the full-motion audio-visual content (viewing).

5

Specialized Video Management System

Base video systems are currently available as products. Such systems provide most of the infrastructural requirements for managing video. However, eﬀective video management requires the ability to retrieve video based on much higherlevel semantic concepts. This demands the development of specialized video data management systems which are tailored to diﬀerent domains. Each new application domain will require several additional functionalities which include specialized indexing algorithms, user interfaces, and data models: Indexing algorithms: Depending on the application domain, new indexing strategies are needed. For example, for sports, new event indexing algorithms need to be developed. Say, for basketball, algorithms for detecting events like scoring become critical. User interfaces: The browsing and viewing patterns for video will diﬀer significantly across domains. For example, in a news video management system, the searching will be based on the content of the speech. For sports it will be based more on visual content such as diﬀerent play-patterns. Thus both the query interface and the video control interface need to suit the domain. Data models: Certain domains may require that the management system be capable of managing several types of media and associations between them. This implies that data models for such systems have to be augmented beyond the simple models used in base video management systems.

6

Discussion

We have described many of the aspects of video database management systems. Video indexing is but one of the components of such systems; video ingest is another important and often neglected component. Finally, complicated!l infrastructures are needed for complete end-to-end systems.

18

Ruud M. Bolle and Arun Hampapur

References 1. A. Gupta and R. Jain, “Visual information retrieval,” Comm. ACM, vol. 40, pp. 70– 79, May 1997. 15 2. A. Coden, N. Haas, and R. Mack, “A system for representing and searching video segments deﬁned by video content annotation methods,” tech. rep., IBM T.J. Watson Research Center, 1998. 16 3. N. Haas, PROPOSED SMPTE STANDARD for television Digital Studion Command and Control (DS-CC) Media and Metadata Location. NIST/ATP HD Studio Joint Venture, 1998. 16 4. A. Nagasaka and Y. Tanaka, “Automatic video indexing and full-motion search for object appearances,” in Proc. IFIP TC2/WG2.6 2nd Working Conf. on Visual Database Systems, pp. 113–127, Sep.-Oct. 1991. 16 5. R. M. Bolle, B.-L. Yeo, and M. M. Yeung, “Video query: Research directions,” IBM J. of R & D, vol. 42, pp. 233–252, March 1998. 16, 17 6. R. Bolle, Y. Aloimonos, and C. Fermuller, “Toward motion picture grammars,” in Proc. IEEE 3rd ACCV, pp. 283–290, Jan. 1998. 16 7. K. S. Fu, Syntactic Pattern Recognition and Applications. Englewood Cliﬀs, NJ: Prentice Hall, 1982. 16 8. E. Charniak, Statistical Language Learning. MIT Press, 1993. 16 9. X. D. Huang, Y. Ariki, and M. A. Jack, Hidden Markov Models for Speech Recognition. Edinburgh University Press, 1990. 17 10. Y.-P. Tan and R. Bolle, “Binary video classiﬁcation,” Tech. Rep. RC-21165, IBM T.J. Watson Research Center, 1998. 17

Querying Multimedia Data Sources and Databases1 Shi-Kuo Chang1, Gennaro Costagliola2, and Erland Jungert3 1

Department of Computer Science University of Pittsburgh [email protected] 2 Dipartimento di Matematica ed Informatica Università di Salerno [email protected] 3 Swedish Defense Research Institute (FOA) [email protected]

Abstract. To support the retrieval and fusion of multimedia information from multiple sources and databases, a spatial/temporal query language called ΣQL is proposed. ΣQL is based upon the σ−operator sequence and in practice expressible in SQL-like syntax. ΣQL allows a user to specify powerful spatial/temporal queries for both multimedia data sources and multimedia databases, eliminating the need to write different queries for each. A ΣQL query can be processed in the most effective manner by first selecting the suitable transformations of multimedia data to derive the multimedia static schema, and then processing the query with respect to this multimedia static schema.

1 Introduction The retrieval and fusion of spatial/temporal multimedia information from diversified sources calls for the design of spatial/temporal query languages capable of dealing with both multiple data sources and databases in a heterogeneous information system environment. With the rapid expansion of the wired and wireless networks, a large number of soft real-time, hard real-time and non-real-time sources of information need to be processed, checked for consistency, structured and distributed to the various agencies and people involved in an application [12]. In addition to multimedia databases, it is also anticipated that numerous web sites on the World Wide Web will become rich sources of spatial/temporal multimedia information. Powerful query languages for multiple data sources and databases are needed in applications such as emergency management (fire, flood, earthquake, etc.), telemedicine, digital library, community network (crime prevention, child care, senior citizens care, etc.), military reconnaissance and scientific exploration (field computing). These applications share the common characteristics that information from multiple 1

This research was co-funded by the National Science Foundation, USA, the Swedish National Defence Institute and the Italian National Council of Research (CNR).

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp.19 -28, 1999.  Springer-Verlag Berlin Heidelberg 1999

20

Shi-Kuo Chang et al.

sources and databases must be integrated. A typical scenario for information fusion in emergency management may involve live report from a human observer, data collected by a heat sensor, video signal from a camera mounted on a helicopter, etc. Current systems often have preprogrammed, fixed scenarios. In order to enable the end user to effectively retrieve spatial/temporal multimedia information and to discover relevant associations among media objects, a flexible spatial/temporal multimedia query language for multiple data sources and databases should be provided. To support the retrieval and fusion of multimedia information from multiple sources and databases, a spatial/temporal query language called ΣQL is proposed. ΣQL is based upon the σ−operator sequence and in practice expressible in an SQL-like syntax. The natural extension of SQL to ΣQL allows a user to specify powerful spatial/temporal queries for both multimedia data sources and multimedia databases, eliminating the need to write different queries for each. Query language for heterogeneous multimedia databases is a new and growing research area [9, 13]. There has been substantial research on query languages for images and spatial objects [2], and a survey can be found in [5, 6]. Of these query languages, many are based upon extension of SQL [14], such as PSQL [15] and Spatial SQL [8]. Next come video query languages where the focus is shifted to temporal constraints [1] and content based retrieval [3]. While the above described approaches each address some important issues, there is a lack of unified treatment of queries that can deal with both spatial and temporal constraints from both live data sources and stored databases. The proposed approach differs from the above in the introduction of a general powerful operator called the σ−operator, so that the corresponding query language can be based upon σ−operator sequences. The paper is organized as follows. The basic concepts of the σ−query is explained in Section 2. Section 3 introduces elements of Symbolic Projection Theory and the general σ−operator, and Secction 4 describes the SQL query language. An illustration of data fusion using the σ−query is presented in Section 5. Section 6 formalizes the representation for multimedia sources and then gives a query processing example. In Section 7 we discuss further research topics.

2 Basic Concepts of the σ−Query As mentioned in Section 1, the σ−query language is a spatial/temporal query language for information retrieval from multiple sources and databases. Its strength is its simplicity: the query language is based upon a single operator - the σ−operator. Yet the concept is natural and can easily be mapped into an SQL-like query language. The σ−query language is useful in theoretical investigation, while the SQL-like query language is easy to implement and is a step towards a user-friendly visual query language. An example is illustrated in Figure 1. The source R, also called a universe, consists of time slices of 2D frames. To extract three pre-determined time slices from the source R, the query in mathematical notation is: σt (t1 , t2 , t3 ) R. The meaning of the σ−operator in the above query is SELECT, i.e. we want to select the time axis and three slices along this axis. The subscript t in σt indicates the selection of the time axis. In the SQL-like language a ΣQL query is expressed as:

Querying Multimedia Data Sources and Databases

21

SELECT t CLUSTER t1, t2, t3 FROM R

Fig. 1. Example of extracting three time slices (frames) from a video source

A new keyword "CLUSTER" is introduced, so that the parameters for the σ−operator can be listed, such as t1, t2, t3. The word "CLUSTER" indicates that objects belonging to the same cluster must share some common characteristics (such as having the same x coordinate value). A cluster may have a sub-structure specified in another (recursive) query. Clustering is a natural concept when dealing with spatial/temporal objects. The mechanism for clustering will be discussed further in Section 3. The result of a ΣQL query is a string that describes the relationships among the clusters. This string is called a cluster-string, which will also be discussed further in Section 3. A cluster is a collection of objects sharing some common characteristics. The SELECT- CLUSTER pair of keywords in ΣQL is a natural extension of the SELECT keyword in SQL. In fact, in SQL implicitly each attribute is considered as a different axis. The selection of the attributes’ axes defines the default clusters as those sharing common attribute values. As an example, the following ΣQL query is equivalent to an SQL query to select attributes’ axes "sname" and "status" from the suppliers in Paris. SELECT sname, status CLUSTER * FROM supplier WHERE city = "Paris" In the above ΣQL query, the * indicates any possible values for the dimensions sname and status. Since no clustering mechanism is indicated after the CLUSTER keyword the default clustering is assumed. Thus by adding the "CLUSTER *" clause, every SQL query can be expressed as a ΣQL query. Each cluster can be open (with objects inside visible) or closed (with objects inside not visible). The notation is t2o for an open cluster and t2c or simply no superscript for a closed cluster. In the ΣQL language the keyword "OPEN" is used: SELECT t CLUSTER t1 , OPEN t2 , t3 FROM R

22

Shi-Kuo Chang et al.

With the notation described above, it is quite easy to express a complex, recursive query. For example, to find the spatial/temporal relationship between objects having the same x coordinate values x1 or x2 from the three time slices of a source R, as illustrated in Figure 1, the ΣQL query in mathematical notation is: σx (x1 , x2)( σt (t1o, t2o, t3o ) R)

(1)

The query result is a cluster-string describing the spatial/temporal relationship between the objects 'a' and 'b'. How to express this spatial/temporal relationship depends upon the (spatial) data structure used. In the next section we explain Symbolic Projection as a means to express spatial/temporal relationships.

3 A General σ−Operator for σ−Queries As mentioned above, the ΣQL query language is based upon a single operator - the σ−operator - which utilizes Symbolic Projection to express the spatial/temporal relationships in query processing. In the following, Symbolic Projection, the cutting mechanism and the general σ−operator are explained, which together constitute the theoretical underpinnings of ΣQL. Symbolic Projection [7, 11] is a formalism where space is represented as a set of strings. Each string is a formal description of space or time, including all existing objects and their relative positions viewed along the corresponding coordinate axis of the string. This representation is qualitative because it mainly describes sequences of projected objects and their relative positions. We can use Symbolic Projection as a means for expressing the spatial/temporal relationships extracted by a spatial/temporal query. Continuing the example illustrated by Figure 1, for time slice Ct1 its x-projection using the Fundamental Symbolic Projection is: σx (x1 , x2 ) Ct1 = (u: Cx1,t1 < Cx2,t1)

(2)

and its y-projection is: σy(y1 , y2 ) Ct1 = (v: Cy1,t1 < Cy2,t1)

(3)

In the above example, a time slice is represented by a cluster Ct1 containing objects with the same time attribute value t1. A cluster-string is a string composed from cluster identifiers and relational operators. The single cluster Ct1 is considered a degenerated cluster-string. After the σy operator is applied, the resulting cluster Cy1,t1 contains objects with the same time and space attribute values. In the above example, the cluster-string (v: Cy1,t1 < Cy2,t1) has the optional parentheses and projection variable “v” to emphasize the direction of projection. The query σt(t1 , t2 , t3 ) R yields the following cluster-string α: α = (t: Ct1 < Ct2 < Ct3 )

(4)

When another operator is applied, it is applied to the clusters in a cluster-string. Thus the query σx (x1, x2) σt(t1o, t2o, t3o)R yields the following cluster-string β:

Querying Multimedia Data Sources and Databases

β = (t: (u: Cx1,t1 < Cx2,t1) < (u: Cx1,t2 < Cx2,t2) < (u: Cx1,t3 < Cx2,t3))

23

(5)

The above cluster-string β needs to be transformed so that the relationships among the objects become directly visible. This calls for the use of a materialization function MAT to map clusters to objects. Since Cx1,t1 = Cx1,t2 = Cx1,t3 = {a} and Cx2,t1 = Cx2,t2 = Cx2,t3 = {b}, the materialization MAT(β) of the above cluster-string yields: MAT(β) = (t: (u: a < b) < (u: a < b) < (u: a < b))

(6)

The query result in general depends upon the clustering that in turn depends upon the cutting mechanism. The cutting is an important part of Symbolic Projection because a cutting determines both how to project and also the relationships among the objects or partial objects in either side of the cutting line. Usually the cuttings are ordered lists that are made in accordance with the Fundamental Symbolic Projection. The cutting type, κ-type, determines which particular cutting mechanism should be applied in processing a particular σ−query. The general σ−operator is defined by the following expression where, in order to make different cutting mechanisms available, the cutting mechanism κ−type is explicitly included: σaxes, k-type σ-type (clusters)ϕ = stype :

(7)

The general σ−operator is of the type σ−type and selects an axis or multiple axes, followed by a cutting mechanism of the type κ−type on (clusters)ϕ where ϕ is a predicate that objects in the clusters must satisfy. The σ−operator operates on a clusterstring that either describes a data source (e.g. data from a specified sensor) or is the result of another σ−operator. The result of the σ−operator is another cluster-string of type stype. Since the result of the σ−operator is always a cluster-string, a materialization operator MAT is needed to transform the cluster-string into real-world objects and their relationships for presentation to the user.

4 The ΣQL Query Language ΣQL is an extension of SQL to the case of multimedia sources. In fact, it is able to query seamlessly traditional relational databases and multimedia sources and their combination. The ΣQL query language operates on the extended multimedia static structure MSS which will be described in Section 6. A template of an ΣQL query is given below: SELECT dimension_list CLUSTER [cluster_type] [OPEN] cluster_val1, .., [OPEN] cluster_valn FROM source WHERE conditions PRESENT presentation_description which can be translated as follows: "Given a source (FROM source) and a list of dimensions (SELECT dimensions), select clusters (CLUSTER) corresponding to a list of

24

Shi-Kuo Chang et al.

projection values or variables ([OPEN] cluster_val1, ..) on the dimension axes using the default or a particular clustering mechanism ([cluster_type]). The clusters must satisfy a set of conditions (WHERE conditions) on the existing projection variables and/or on cluster contents if these are open ([OPEN]). The final result is presented according to a set of presentation specifications (PRESENT presentation_description)." Each σ−query can be expressed as an ΣQL query. For example, the σ−query σs,κ(s1, s2o, s3, .., sn)φ R can be translated as follows: SELECT s CLUSTER κ s1, OPEN s2, s3, .., sn FROM R WHERE φ

5 An Exmaple of Multi-Sensor Data Fusion In this section, ΣQL will be illustrated with a query that uses heterogeneous data from two different sensors -- a laser radar and a video. An example of a laser radar image is given in Figure 2. This image shows a parking lot with a large number of cars, which look like rectangles when viewed from the top. The only moving car in the image has a north-south orientation while all others have an east-west orientation. Laser radar images are characterized by being three-dimensional and having geometric properties, that is, each image point is represented by x-, y- and z-coordinate values. The particular laser radar used here is a product by SAAB Dynamics of Sweden, which is helicopter born and generates image elements from a laser beam that is split into short pulses by a rotating mirror. The laser pulses are transmitted to the ground, in a scanning movement, and when reflected back to the platform a receiver collects the returning pulses that are stored and analyzed. The result of the analysis is a sequence of points with a resolution of about 0.3 m. The video camera is carried by the helicopter as well and the two sensors are observing the same area. This means that most cars in the parking lot can be seen from both sensors. The moving car shown in two video frames in Figure 3 is encircled. Figure 4 shows two symbolic images corresponding to the two video frames in Figure 3. Almost identical projection strings can be generated from the laser radar image. Basically the query can be formulated as follows. Suppose we are interested in finding moving objects along a flight path. This can be done by analyzing only the video frames, but that may require too much computation time and the problem cannot be solved in real time. Laser radar images can, however, be used to recognize vehicles in real time, which has been shown by Jungert et al. in [9, 10]. However, it cannot be determined from the laser radar images whether the vehicles are moving. The solution is to analyze the laser radar image to first find existing vehicles, determine their positions in a second step, and then verify whether they are moving from a small number of video frames. Finally, in the fusion process, it can be determined which of the vehicles are moving. Subquery1: Are there any moving objects in the video sequence in [t1, t2]? Q1 = σmotion(moving)σtype(vehicle) σxy,interval_cutting(*) σt(To)T mod 10 = 0 and T>t1 and T
Querying Multimedia Data Sources and Databases

25

Subquery2: Are there any vehicles in the laser radar image in [t1, t2]? Q2 = σtype (vehicle) σxyz,interval_cutting(*)σt(To) T>t1 and T
Fig. 2. A laser radar image of a parking lot with a moving car (encircled)

Fig. 3. Two video frames showing a moving white vehicle (encircled) in the parking lot

Fig. 4. Symbolic images of two video frames of moving car and its close neighbors The subquery Q1 selects the video source and opens the video frames. The selection of video frames includes some conditions to specify which frames to accept and in what time interval. In this case we have chosen every tenth video frame within the interval [t1,t2]. In the next selection the σxy-operator is applied to the video frames using the interval cutting mechanism [5, 6]. This operator generates the (u,v)-strings from which the object types are determined by the σtype-operator in the selected frames. Finally the vehicles in motion are determined by the application of the motion operator. The motion string (m) is generated from the time projection string: t: (u: a0s < a 1s a 2s < a 0e < a 1e a 2e < a 3s a 4s a 5s < a 3e a 4e a 5e , v: a0s < a 0e < a 1s a 3s < a 1e a 3e < a 4s < a 4e < a 2s a 5s < a 2e a 5e) < (u: ... , v: ...) < … (u: a1s a 2s < a 1e a 2e < a 0s < a 0e < a 3s a 4s a 5s < a 3e a 4e a 5e , v: a0s < a 1s a 3s < a 1e a 3e < a 0e < a 4s < a 4e < a 2s a 5s < a 2e a 5e) < (u: ... , v: ...) < ...

26

Shi-Kuo Chang et al.

From this string the motion string is generated by the application of a σ-operator which generates a string similar to an implicit merge_or-operation, i.e.: m: t: (u: a0s < a 0e < a ´0s < a ´0e , v: a 0s < a 0e < a ´0s < a ´0e) The subquery Q2 returns first the (u,v)-strings for the time interval [t1,t2]. An intermediate result of the subquery thus becomes: u: a1s a 2s < a 1e a 2e < a 0s < a 0e < a 3s a 4s a 5s < a 3e a 4e a 5e , v: a1s a 3s < a 0s < a 1e a 3e < a 4s < a 4e < a 2s a 5s < a 0e < a 2e a 5e After applying the σxyz-operator, existing vehicles are detected by applying the σtypeoperator. The subqueries can now be fused. A fusion operator φ, called merge_and, is designed, which performs fusion of vehicle information with respect to equality of vehicle type and position (x,y) in time t. Its input comes from multiple data sources of equal type. All object types are consequently equal in both subqueries. Fusion is thus applied such that only those objects selected from the two subqueries that can be associated to each other, will remain in the motion object string that gives only a0 in response. The complete query now looks like: φxytmerge-and(*)(σmotion(moving)σtype(vehicle) σxy,interval_cutting(*)σt(To)T mod 10 = 0 and T>t1 and T
σmedia_sources (videoo)media_sources, σtype (vehicle) σxyz,interval_cutting(*)σt(To) T>t1 and T t1 AND T < t2, SELECT motion CLUSTER moving FROM SELECT type CLUSTER vehicle FROM SELECT x,y CLUSTER interval * FROM SELECT t CLUSTER OPEN (* ALIAS T) FROM SELECT media_sources CLUSTER OPEN video FROM media_sources WHERE T mod 10 = 0 AND T>t1 AND T
6 Multimedia Source Representation and ΣQL Query Execution In the previous sections we have described a data source as a simple projection string. However, in general, in order to describe data sources we need a more complex data structure. In this section we describe an extension of the MSS model proposed in [4] for the description of multimedia data. A multimedia source description is composed of a hierarchy of entities. Each entity has the format where the name is the entity identifier, the type is the entity type and each description is a triple ((d1..dm):{e1, e2, .., en}: rel_expr) with m ≥ 1 and n ≥ 0 containing (a) a list of dimensions di according to which the entity is being clustered; (b) a set of component entity identifiers resulting

Querying Multimedia Data Sources and Databases

27

from the clustering; and (c) a relational expression where relations (depending on the dimensions) are used to relate the component entities A description is legal if n = 0 or a clustering mechanism able to derive the description from the source is available. In the case of n=0, the entity is an atom with respect to the description dimension and the relational expression reduces to a simple value. Depending on the chosen description type (and consequently on the associated clustering mechanism) a source can be seen as a temporal sequence of entities, or a spatial disposition of entities, or as a set of attribute-value pairs, etc. As an example, let us consider a video clip segment showing two trees and two walking persons. The video clip is clustered according to the time dimension in three consecutive frames as shown in Figure 7. The video clip represents Cathy (c) and Bill (b) moving east, and two trees (a). The following entities constitute the MSS representation of the video clip: [R, video, (τ: {R1, R2, R3}: (t: R1 < R2 < R3))] [ R1, frame, (τ: {}: t1),((x, y): {a, b, c}: ((u: a=b< c < a), (v: a < b < a= c))) ] [ R2, frame, (τ: {}: t2),((x, y): {a, b, c}:((u: a < b < a=c),(v: a < b < a= c)] [ R3, frame, (τ: {}: t3), ((x, y): {a, b}: ((u: a < < a=b), (v: a < b < a ))) ] [ a, plant, (object: {}: {(name: tree)})] [ b, person, (object: {}: {(name: Bill)})] [ c, person, (object: {}: {(name: Cathy)})] t2

t1 a

R1 =

c

R2 =

b a

t3

aa

c

b

b

a

c

aa

c

t

a

R3 =

b a

Fig. 5. A video clip segment of three frames R1, R2, R3

The phases needed to process a ΣQL query are as follows: 1. Lexical analysis and syntactic analysis. 2. Transformational analysis to validate semantic correctness, where the following actions are taken: a) comfirm the compatibility between the dimensions and the sources in each sub-query; b) check if the overall query is consistent -- i.e., there exist representations for the intermediate results that make the query execution feasible -- and if so, keep track of the representations (intermediate result type inference); c) for each intermediate result, select a feasible representation (probably with the help from the user); d) if the main source is not structured, then build the MSS schema, else check if there exists a query-engine that supports the querying of the structured source. 3. Query optimization and query execution: If the main source is not structured then populate the MSS structure according to the schema defined during the semantic analysis by extracting the appropriate information from the source. (The semantic validation implies that the algorithms to extract the information from the source are available); execute the query against the MSS; present the results; else send the query to the database query-engine and present the results.

28

Shi-Kuo Chang et al.

7 Discussion As explained in previous sections, ΣQL can express both spatial and temporal constraints individually using the SELECT/CLUSTER construct and nested subqueries. Its limitation seems to be that constraints simultaneously involving space and time cannot be easily expressed, unless embedded in the WHERE clause. Although such constraints may be rare in practical applications, further investigation is needed in order to deal with such complex constraints. ΣQL can be applied directly to a video sequence such as the one shown in Figure 1, by visually selecting and clustering (decomposing) the 2D (or 3D, 4D, etc.) space. Applications to distance learning, remote sensing, etc. can be explored. Another important application of ΣQL, as suggested in Section 6, is to facilitate user-system interaction in selecting feasible representations in the transformational analysis of a σ−query for visual reasoning. References 1. Ahanger, G., Benson, D. and Little, T.D., "Video Query Formulation", Proc.s of Storage and Retrieval for Images and Video Databases II, San Jose, February 1995, SPIE, pp. 280-291. 2. Chan, E. P. F. and Zhu, R., "QL/G - A query language for geometric data bases", Proc. of the 1st Int. Conf. on GIS in Urban Regional and Env. Planning, Samos, Greece, April. 3. Chang, S. F., Chen, W., Meng, H. J., Sundaram, H., and Zhong, D., "VideoQ: An automated Content Based Video Search System using Visual Cues", Proceedings of the Fifth ACM International Multimedia Conference, November 1997. 4. Chang, S. K., "Content-Based Access to Multimedia Information", Proceedings of Aizu International Student Forum-Contest on Multimedia, (N. Mirenkov and A. Vazhenin. eds.), The University of Aizu, Aizu, Japan, Jul 20-24, 1998, pp. 2-41. (Available at www.cs.pitt.edu/Óchang/365/cbam7.html) 5. Chang, S.-K. and Jungert, E., Symbolic Projection for Image Information Retrieval and Spatial Reasoning, Academic Press, 1996. 6. Chang, S.-K. and Jungert E. Pictorial data management based upon the theory of Symbolic Projection, Journal of Visual Languages and Computing, vol. 2, no 3, 1990, pp 195-215. 7. Chang, S.-K., Shi, O. Y. and Yan, C. W., Iconic indexing by 2D strings, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 9, No. 3, pp. 413-428, 1987. 8. Egenhofer, M., "Spatial SQL: A Query and Presentation Language", IEEE Transactions on Knowledge and Data Engineering, Vol. 5, No. 2, 1991, pp. 161-174. 9. Holden, R., "Digital's DB Integrator: a commercial multi-database management system", Proc. of 3rd Int.l Conf.e on Parallel and Distributed Information Systems, Austin, TX, USA, Sept. 28-30 1994, IEEE Comput. Soc. Press, Los Alamitos, CA, USA, pp. 267-268. 10. Jungert, E., Carlsson, C., and Leuhusen, C.,"A Qualitative Matching Technique for Handling Uncertainties in Laser Radar Images", Proceeding of the SPIE conference on Automatic Target Recognition VIII, Orlando, Florida, April 13-17, 1998, pp 62-71 11. Lee, S.-Y. and Hsu, F.-S., Spatial reasoning and similarity retrieval of images using 2D Cstring knowledge representation, Pattern Recognition, vol. 25, 1992, pp 305-318. 12. Lin, C. C., Chang, S. K. and Xiang, J. X., Transformation and Exchange of Multimedia Objects in Distributed Multimedia Systems, ACM Journal of Multimedia Systems, Springer-Verlag, Vol. 4, Issue 1, 1996, 12-29. 13. Li, J. Z., Ozsu, M. T., Szafron, D. and Oria, V., "MOQL: A multimedia Object Query Language", Proc. 3rd Int. Worksh on Multim Info Systems, Como, Italy, Sept -97, pp 19-28. 14. Oomoto E. and Tanaka, K., Video Database Systems - Recent Trends in Research and Development Activities, in Handbook of Multimedia Information Management, (Grosky, W. I., Jain, R. and Mehrotra, R., eds.), Prentice Hall, 1997, pp 405-448. 15. Roussopoulos, N., Faloutsos, C. and Sellis, T., "An Efficient Pictorial Database System for PSQL", IEEE Trans. on Software Engineering, Vol. 14, No. 5, May 1988, pp. 639-650. 16. E. Waltz and J. Llinas, "Multisensor data fusion", Artect House, Boston, 1990.

General Image Database Model Peter L. Stanchev Institute of Mathematics and Computer Science, Bulgarian Academy of Sciences Acad. G. Bonchev St. 8, 1113 Sofia, Bulgaria, [email protected]

Abstract. In this paper we propose a new General Image DataBase

(GIDB) model. The model establishes taxonomy based on the systematisation of existing approaches. The GIDB model is based on the General Image Data model [1] and General Image Retrieval model [2]. The GIDB model uses the powerful features offered by object-oriented modelling, the elegance of the relational databases, the state of art of computer vision and the current methods for knowledge representation and management to achieve effective image retrieval. The developed language for the model is a hybrid between interactive and descriptive query languages. The ideas of the model can be used in the design of image retrieval libraries for an object-oriented database. As an illustration the results of applying the GIDB model to a plant database in the Sofia Image Database Management System are presented.

1 Introduction The image databases are becoming an important element of the emerging information technologies. They have been used in an a wide variety of applications such as: geographical information systems, computer-aided design and manufacturing systems, multimedia libraries, medical image management systems, automated catalogues in museums, biology, geology, mineralogy, astronomy, botany, house furnishing design, anatomy, criminal identification, etc. As well they are becoming an essential part of most multimedia databases. The first survey for image databases appeared in the early 1980 by Tamura and Yokoya [3]. They classify the image database systems into three categories: convention databases, conventional databases with extended function for image processing, and specialised systems designed for a particular application domain. On the other hand Grosky and Mehrota [4] classify the image databases into three categories: systems using relational databases, system based on object-oriented model and systems for image interpretation. Grosky elaborated the ideas of image databases to multimedia databases [5]. There are mainly five approaches towards image database system architecture:

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 29-36, 1999.  Springer-Verlag Berlin Heidelberg 1999

30

Peter L. Stanchev

(1) Conventional database system as an image database system. The use of a conventional database system as an image database system is based mainly on relational data models and rarely on hierarchical. The images are indexed as a set of attributes. At the time of the query, instead of retrieving by asking for information straight from the images, the information is extracted from previously calculated image attributes. Languages such as Structured Query Language (SQL) and Query By Example (QBE) with modifications such as Query by Pictorial Example (QPE) are common for such systems. This type of retrieval is referred as attribute based image retrieval. A representative prototype system from this class of systems is the system GRIM_DBMS [6]. (2) Image processing/graphical systems with database functionality. In these systems topological, vector and graphical representations of the images are stored in the database. The query is usually based on a command-based language. A representative of this model is the research system SAND [7]. (3) Extended/extensible conventional database system to an image database system. The systems in this class are extensions over the relational data model to overcome the imposed limitations, by the flat tabular structure of the relational databases. The retrieval strategy is the same as in the conventional database system. One of the research systems in this direction is the system GIS [8]. (4) Adaptive image database system. The framework of such a system is a flexible query specification interface to account for the different interpretations of images. An attempt for defining such kind of systems is made in [9]. (5) Miscellaneous systems/approaches. Various other approaches are used for building image databases such as: grammar based, 2-D string based, entity-attributerelationship semantic network approach, matching algorithms, etc. In this paper a new General Image DataBase (GIDB) model is presented. It includes descriptions of: (1) an image database system; (2) generic image database architecture; (3) image definition, storage and manipulation languages.

2 The GIDB Model Description This section we start with some definitions. Definition 1. A Data Model is a type of data abstraction that is used to provide the data conceptual representation. It is a set of concepts that can be used to describe the structure of a database. By the structure of a database we mean the data types, relationships, and constraints that should hold on the data. It can also include a set of operations for database retrieval and update [10]. Definition 2. An Image Database (IDB) is a logically coherent collection of images with some inherent meaning. The images usually belong to a specific application domain. An IDB is designed, built, and populated with images for specifics purpose and represents some aspects of the real world. Definition 3. An Image Database Management System (IDBMS) is a collection of programs that enable the user to define, construct and manipulate an IDB for various applications. An image definition involves specifying the characteristics of

General Image Database Model

31

the application domain, the image indexing mechanism, the image-object recognition mechanism, and the information about the images that will be extracted and stored together with the images. An image database constructing is the process of storing the images themselves on some storage media, together with the logical image description. An image database manipulation includes functions such as querying the database for a retrieval of a specific image and updating the image database to reflect changes of the images in the real word. As well the user could create his own set of programs and bind them into an image database system. Definition 4. An Image Database System (IDBS) is constituted from IDB and IDBMS. The main differences from a conventional database system environment are: (1) the existence of tools for image databases definition, including tools for image indexing and image-object recognition and (2) existence of image processing procedures. Definition 5. The way that the users think about data is called external view level, the way that the data are recognised by the database system is called internal or physical level, and the middle layer is called conceptual level. 2.1 The Generic Architecture of an Image Database The architecture of a generic image database system is given in Fig. 1. Three phases for interactions with the system are provided: domain definition, image entering and image retrieval. In order to introduce new application areas for the system the administrator uses the domain definition phase. At the second phase the images are entered into the system. The third phase is image retrieval. In it the end-users use the system for posing queries and viewing the image features. PHASE

INPUT

PROCESS

RESULT

1. Domain definition a. logical description

Logical Image Definition Language (LIDL)

LIDL processor

Procedure for image indexing

b. physical description

Physical Image Definition Language (PIDL)

PIDL processor

Procedure for physical image storage

2. Image entering a. input the image and image information

images

Image Storage Language (ISL) processor

Logical and physical IDB

b. image updating

ISL updating tools

ISL processor

Logical and physical IDB

c. image deletion

ISL deletion tools

ISL processor

Logical and physical IDB

3. Image retrieval a. image display

Image Manipulation Language-IML

b. logical image display

IML

Query processor

Query processor & Statistical processor

Fig. 1. Generic architecture of an IDBS

Images Semantic data Statistical data

32

Peter L. Stanchev

2.2 Image Data Model Description The proposed Image Data model establishes taxonomy based on the systematisation of the existing approaches. The proposed approach for the image modelling includes: • using language approach, where language structures are used for physical and logical image content description; • using object oriented approach, where the image and the image objects are treated as objects containing appropriate functions calculating its functions. The data model is object oriented. The image itself together with its semantic descriptions is treated as an object in terms of the object oriented approach. The image is presented in two layouts (classes) - logical and physical. A semantic schema of the proposed model is shown in Fig. 2. Image R Logical view Global view Meta attributes

Content-based view

Semantic attributes Objects

Physical view R R

Modelbased view = = Relations

Image header

Image matrix

General purpose view Colour

Texture

Topological Vector Metric

Spatial

Colour Texture Shape Logical Semantic attributes attributes Legend: is-an-abstraction-of (multi-valued) is-an-abstraction-of (one-to-one) = domain dependant R - required

Fig. 2. Semantic schema of the GID model

2.3 Image Retrieval The retrieval model is unique in the sense of its comprehensive coverage of the image features. The main characteristics of the proposed model could be summarised as follows: a.) The images are searched by their general image description model representation [1];

General Image Database Model

33

b.) The model is based on similarity retrieval. Let a query be converted through the general image data model in an image description Q(q1, q2, …, qn) and an image in the image database has the description I(x1, x2, …, xn). Then the retrieval value (RV) between Q and I is defined as: RVQ(I) = Σi = 1, …,n (wi * sim(qi, xi)), where wi (i = 1,2, …, n) is the weight specifying the importance of the ith parameter in the image description and sim(qi, xi) is the similarity between the ith parameter of the query image and database image and is calculated in different way according to the qi, and xi values. They can be: symbol, numerical or linguistic values, histograms, attribute relational graphs, pictures or spatial representations characters. 2.4 IDBS Languages We try to develop the IDBS languages for the GIDB model following an analogy with the conventional database systems languages. Those languages can be divided into three classes: image definition, image storage and image retrieval languages. 2.4.1 Image Definition Language The image definition language consists of two parts: the Logical Image Definition Language (LIDL) and the Physical Image Definition Language (PIDL). The Logical Image Definition Language. One physical image has different logical interpretations. The Global Image Data Model is used to create a logical representation of a physical image. The process of the logical representation is described in Fig. 3. Functions for similarity calculations have been included in the chosen methods. STEP

FORMAT

PROCESS

1.Entering schema

(IDB name, entering media, file format)

reading from outside source into the memory

2. Editing schema

(general parameters, method1, method2, ..., methode)

manipulation, transform, spatial filter, histograms, and morphological filter

3. Global view obtaining

(meta attributes = name1: type1, name2: type2, …, namema: typema; semantic attributes = name1: type1, name2: type2, …, namesa: typesa)

procedure for meta and semantic attribute definition

4. General purpose view obtaining

(colour = method1, method2, ..., methodk1, texture = method1, method2, ..., methodk2,)

procedure for colour and texture definition

5. Segmentation & object definition

(method1, method2, ..., methods)

procedure for image segmentation and object definition

6. Relation definition schema

(method1, method2, ..., methodr)

procedure for relation definition

Fig. 3. The steps in the Logical Image Definition Language

34

Peter L. Stanchev

Physical Image Definition Language. The functions for physical image storage are given in Fig. 4. STEP

FORMAT

PROCESS

1. Physical image storage schema

(method1, method2, ..., method n1)

procedure for physical image storage

2. Logical image storage schema

(method1, method2, ..., method n2)

procedure for logical image storage

3. Indexing mechanism schema

(method1, method2, ..., method n3)

procedure for indexing creation

4. „Thumbnail“ image storage schema

(method1, method2, ..., method n4)

procedure for „thumbnail“ image storage

Fig. 4. The steps in the Physical Image Definition Language

2.4.2 Image Storage Language The ISL contains three parts: (1) image entering language, (2) image updating language and (3) image deletion language. Image updating and deletion are seldom used and are not typical for image databases. For all these operations a specific interactive environment has to be created. Image processing and measurement functions are available in this language to assist the user. 2.4.3 Image Manipulation Language The Image Manipulation Language includes retrieval by attribute value, shape, colour, texture, example image or spatial constrain. The query is translated to a GID model representation and then the GIR model is used to retrieve the desired images. The retrieval method is described in more details in [2].

3 An Example for Applying the GIDB Model The GIDB model manipulation capabilities are illustrated on drawings and pictures of plan image database realised in the Sofia Image Database Management System. 3.1 Image Definition Language The first level of interaction is the domain definition. The definition of the image application area is given in Fig. 5.

General Image Database Model

35

Fig 5. An example of the Logical Image Definition Language

3.2 Image Storage Language and Image Manipulation Language Let a plant image be entered in the image database. The GID description and the image itself are stored in the image database. The Image Manipulation language is described in [2]. An example for a query result is given in Fig. 6.

Fig 6. An example for a query result

36

Peter L. Stanchev

4 Conclusions The main advantages of the proposed model could be summarised as follows: (1) Its generality. The image representation is done trough the general image data model. The image retrieval is based on the general image retrieval model. Therefore the model is applicable to a wide variety of image collections. (2) Its practical applicability. There are numerous methods for the decomposition of the image into objects and for image indexing. According to the application domain the appropriate method could be used. (3) Its flexibility. The model could be customised when used with a specific application. The presented GIDBS model could be extended for distributed IDBS and multimedia database containing text, video and speech signals. At present software realisation of the model for Windows NT is considered in the Sofia Image Database Management System.

Acknowledgement This work is partially supported by a project VRP I 1/99 of the National Foundation for Science Research of Bulgaria and European Commission INCO Copernicus project INTELLECT (PL961099).

References 1.

Stanchev, P.: General Image Data Model. In: 22 intentional conference: Information Technologies and Programming, Sofia (1997) 130-140 2. Stanchev P.: General Image Retrieval Model, In: Proceeding of the Conference of the Union of Bulgarian Math, Pleven (1998) 63-71 3. Tamura, H. Yokoya N.: Image Database Systems: a Survey. Pattern Recognition, Vol. 17, No. 1 (1984) 29-43 4. Grosky W., Mehrotra R.: Image Database Management. IEEE Computer, 22 (1989) 7-8 5. Grosky W.: Multimedia Information Systems. IEEE MultiMedia, Spring (1994) 12-23 6. Stanchev P., Rabitti F.: GRIM_DBMS: a GRaphical IMage DataBase Management System. In: Kunii, T.: Visual Database Systems, North-Holland (1989) 415-430 7. Aref, W., Samet H.: Extending a DBMS with Spatial Operations. In: Second Symposium on Large Spatial Databases, Zurich, Switzerland, ETH Zurich (1991) 299-318 8. Stanchev P., Smeulders A., Groen F.: Retrieval from a Geographical Information System. In: Computing Science in the Netherlands, Amsterdam, The Netherlands, Stichting Mathematish Centrum (1991) 528-539 9. Gudivada V., Raghavan V.: Picture Retrieval Systems: A Unified Perspective and Research Issues. Tech Report, CS-95-03 (1995) 10. Elmasri R., Navathe S.: Fundamentals of Database Systems. The Benjamin/Cummings Publishing Company (1994)

System for Medical Image Retrieval The MIMS Model Richard Chbeir, Youssef Amghar, and Andre Flory LISI - INSA 20 Avenue A. Einstein, F-69621 Villeurbanne - France Tel: (+33) 4 72438595, Fax: (+33) 4 72438597 {rchbeir,amghar,flory}@lisiflory.insa-lyon.fr

Abstract. The multifaceted description of image data presents a couple of problems for traditional information systems designed for textual data. Image databases require eﬃcient and direct spatial search based on image objects and their relationships instead of some cumbersome alphanumeric encodings of the images. Current systems generally approach the problem from diﬀerent views. Thus, they describe one image facet dependently on application domain. This paper presents a model for Medical Image Management System (MIMS) that allows physicians to retrieve images and required information by combining the strengths of several approaches. Via a convivial iconic interface, our system assign to each image object a graphical representation to describe image objects and their attributes. It allows also to automatically calculate their interrelations. Key-Words : Database management, image data, image indexing and interrogation, multimedia, hypermedia, thesaurus, spatial relation.

1

Introduction

The use of multimedia information in the areas of defense and civilian satellites, spatial data, CAD data, home entertainment and medical systems has witnessed its utility. Manipulating these medias such as text, graphics, image and voice requires an adequate and special management in terms of storage, transmission and manipulation. Traditional systems have been conceived to deal with textual documents; they are not adapted for handling complex objects such as ones of multimedia applications. Current eﬀorts have been focused on retrieving medias using only traditional artiﬁcial keys, such as identiﬁcation number [6]. This obviously limits the consulting power for any domain research. Several approaches have been proposed to solve the induced problem but each one of them tries to manipulate one or more media aspects such as graphical ones ([4]) or semantic one ([2,1,7]). This paper presents a model for medical information system, using medical image as media, whose main objective has been to conceive a general approach and optimize the retrieval process.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 37–42, 1999. c Springer-Verlag Berlin Heidelberg 1999

38

Richard Chbeir et al.

In the following section, the description of our data model follows a more detailed description for image representation problem and the medical information problem. Section 3 concludes our actual work and drafts our future one.

2

Medical Image Description

Image description is so person-depended. It diﬀers from person to another and is related to context placed in. Several parameters are taken into consideration: image objects, image context, person culture, application domain, etc. It partially reﬂects the diﬃculties encountered in each of indexing and/or retrieving process. For our concern, medical domain is particular and requires various parameters. An example of medical searches would make easier our clariﬁcation in this section: ”Select all images for patients older than 50 years, having tumor on lungs”. It’s obvious that one image facet description is so restrictive and unable to satisfy physicians’ needs. In fact, search criterias in medical information system lay not only on image context (patients older than 50 years), but also on semantic content (tumor on lungs). So, it’s primordial to mix approaches to have hybrid system in order to satisfy medical desires. Figure 1 describes the general conceptual schema explained in next sessions. 2.1

Image Context

The medical history of each patient is represented by an Electronic Medical Folder (EMF) [5]. To verify his diagnostic, the physician needs to retrieve all images of a speciﬁc patient. ”Select all images of Mr.Robin” could be an obvious example. Thus, each medical image is related to a specific domain (radiology, dermatology, etc.). In addition, we take into consideration several image attributes such as: incidence (axial, coronal, saggital, etc.), type (JPEG, GIF, etc.), device (scanner, MRI, radio, etc.), date. 2.2

Image Description

To specify the medical content of the image, three kinds of semantic objects must be described: – Anatomic Organ (AO 1 ): corresponds to the medical organ that the image presents such as brain, hand, etc. – Pathologic Sign (PS) : identiﬁed and detected by physician, it represents a medical anomaly. For instance tumor, fracture, etc. – Anatomic Region (AR): presents the internal structure of AO. For example, the left ventricle, the right lobe, etc. Each medical image corresponds to an AO in which a set of anomalies are positioned on anatomic regions. Each anatomic region and anomaly are represented by a set of descriptors in order to outline their states. A descriptor may be a text, numeric or date. A dangerous tumor, a pumped up lung and a dehydrated lobe could be some examples of what it’s called medical states. 1

These abbreviations will be used later.

System for Medical Image Retrieval: The MIMS Model

39

Fig. 1. The general conceptual schema 2.3 Incertitude Description Incertitude is inherent to image because of description process, which might be automatic or manual. It can be situated at all description levels (AO, AR and PS). Because semantic aspect is so important in medical information system, it’s important to take into consideration such possibility. It’s the case, for example, when the physician is not sure of his diagnostic and hesitates between two anomalies. Our approach provides the possibility of describing each noncertain anomaly by one or several PSs associated to an incertitude degree for each. The incertitude degree is a decimal number majored by the interval [0,1]. Whenever the physician is sure of anomaly, the incertitude degree is evaluated as 1. 2.4

The Thesaurus

This component supports all domain-speciﬁc information that must be maintained for each speciﬁc application. It represents a dictionary in which terms

40

Richard Chbeir et al.

are interrelated. As we mentioned before, the medical image represents a set of Pathologic Signs (PS) in Anatomic Regions (AR) of the concerned Anatomic Organ (AO). We propose a hierarchical thesaurus 2 built upon these concepts (PS, AR and AO) to normalize indexing process and make more coherent, easier and eﬃciency both indexing and querying processes. The use of the thesaurus oﬀers the opportunity to combine the two traditional methods: the keywords and the legend methods. 1. The thesaurus relations : The thesaurus terms, in MIMS, are interrelated by a couple of semantic relations: – the synonymy relation: relates a non-descriptor term 3 to a descriptor one. The medical vocabulary is so complex that some equivalent concepts may be diﬀerent from domain to another. For instance, a tumor is a synonym to a cancer. – The hierarchy relation: deﬁnes partial order relation between descriptors in order to classify medical terms and extend queries. It groups generalization / specialization relation (anomaly and tumor) and/or composition relation (head and brain). – The compatibility relation: connects AR to PS. This relation is used essentially to sort PS with regard to compatible AR. For instance, at the left ventricle, only tumor could be found. 2. Thesaurus graphical representation: As it’s mentioned in the ﬁrst section, it’s so important, when dealing with non-computer specialist, to use convivial interfaces. A graphical iconic interface is proposed in this framework to index and query the medical image database, and to manipulate thesaurus components (Figure 2). – Each AO is attached to a pattern called Organ that physician has to choose in the beginning to present the analyzed image. – The internal structure of the selected organ presenting anatomic regions is presented by a set of polygons. – Each anomaly (or PS) is attached to an icon that could be placed by physician to indicate the anomaly location. 2.5

Spatial Relations

To precise the position of anomaly in anatomic region, the notion of spatial relations is utilized. Euclidean space is used to calculate diﬀerent spatial relations in both directional and topologic space. We exclude metric space for two reasons: it has to be done manually, and risks to be inappropriate in certain medical domains where measuring variability becomes important [3]. Only validated relations are stored in database and not icon coordinates. This means that scale modiﬁcation can be done easily because it aﬀects only coordinates and not relations. Two kinds of spatial relations are identiﬁed: spatial relations between AR and its PS (AR/PS), and spatial relations between PS of one AR (PS/PS): 2 3

The thesaurus is a knowledge base used to help physicians to formulate their queries. A term semantically near to another one.

System for Medical Image Retrieval: The MIMS Model

41

Fig. 2. The MIMS Indexing Interface

1. AR/PS relations : Concerning the topologic space, only inclusion relation (”IN”) is considered. The creation of such relation is implicitly attached to the creation of AR/PS descriptor whenever the user puts an icon inside the AR. On the other hand, it’s also signiﬁcant to precise the anomaly direction inside the damaged region in order to locate it. Via barycenters comparison of both damaged region and anomaly, we can locate the direction of the last one. Several eventualities are then possible: left, right, low-right, high, left-high, etc. For that reason, only four relations have been taken into account: ”high, low, left, right” but the possibility that each anomaly may have two directional relations with its damaged region is always maintained. 2. PS/PS relations : Four topologic relations are taken into consideration in our approach: touch, cover, disjoint, mix. To calculate these topologic relations, the rectangular form of each icon associated to PS is utilized. Concerning the directional relations between anomalies, there are only two: High and Left. In fact, equivalence between directional relations allow us to neglect the implicit ones, i.e. if X is higher than Y, then Y is lower than X and vice versa. The main element to achieve this task is always the barycenter.

42

3

Richard Chbeir et al.

Conclusion

This paper describes the current status of our project on medical image management system. A new data model has been introduced, able to use image features and represents all medical image objects aspects. Another contribution exists, it’s the independence aspect between model components that allows us to use our approach in several domains only by changing the knowledge module. Our actual occupation is the amelioration of our model that integrates the knowledge base through an hypermedia thesaurus, a very convivial interface where the user is able to formulate his query in a very ﬂexible manner and access image databases via the Web. The image content evolution seems to be an attractive major. In fact, the patient images describe the medical curriculum evolution . To carry out his diagnostic, the physician needs sometimes to know the chronological sequence of images. Maturity development, disease progression, medical therapeutic, traumatic event and other examples need to integrate the image content evolution in database. Our future work will be devoted to this light in order to provide the possibility in consulting database using medical image content evolution. Also, we aim to construct an online system able to guide the physician during his insertion and interrogation processes. In fact, the construction of another intelligent database based on images description will be explored. The implementation of our approach in several other domains looks like obvious. To prove our believes, we envisage our approach integration in pictorial databases in order to demonstrate the independence aspect of such approach.

References 1. Abad-Mota S., Kulikowski C., Semantic Queries on Image Databases : The IMTKAS model, Proc. Of the Basque international Workshop on Information, BIWIT 95, IEEE, Computer Society Press, P. 20-28. 37 2. Ashley J., Flickner M., Hafner J., Lee D., Niblack W., Petkovic D, The Query By Image Content System. Proceedings Of the 1995 ACM SIGMOD International Conference on management of data, SIGMOD 95, P. 475. 37 3. Clementini E., Di Felice P., Van Oostrom P., A small set of formal topological relationships suitable for end-user interaction, Lecture notes in computer science, 692, Advances in Spatial Databases. Proceedings Of the 3rd international symposium, SSD93, Singapore, Juin 1993, P. 277-295. 40 4. Jeﬀrey R. Bach, Santanu P., Ramesh J., A Visual Information Management System for the Interactive Retrieval of Faces, IEEE transactions on Knowledge and data engineering, Vol. 5, N 4, August 1993. 37 5. Laforest L., Frnot S., Flory A., A new approach for Hypermedia Medical Records Management, 13th International Congress Medical Informatics Europe (MIE’96), Copenhagen, 1996. 38 6. Mascarini Ch., Ratib O., Trayser G., Ligier Y., Appel R.D. In-house access to PACS images and related data through World Wide Web, European Journal of Radiology, Vol. 22, 1996, P. 218-220. 37 7. Mourad Mechkour EMIR2. An Extended Model for Image Representation and Retrieval., in DEXA’95. Database and EXpert system Applications, London, September, pp395-404, 1995. ISBN 3-540-60303-4. 37

An Agent-Based Visualisation Architecture Jonathan Meddes and Eric McKenzie School of Computer Science, Division of Informatics The University of Edinburgh, James Clerk Maxwell Building The King’s Buildings, Mayﬁeld Road, Edinburgh EH9 3JZ Tel: +44 131 650 5129 Fax: +44 131 667 7209 {jmx,ram}@dcs.ed.ac.uk

Abstract. Advances in storage technology have led to a dramatic increase in the volume of data stored on computer systems. Using a generic visualisation framework, we have developed a data classiﬁcation scheme that creates an object-orientated data model. This data model is used in an agent-based visualisation architecture to address the problems of data management and visual representations. As part of an ongoing research project we have implemented DIME (Distributed Information in a Multi-agent Environment), a visualisation system based on this approach.

1

Introduction

Computer science is catching up with visualisation, and there is a substantial motivation for computer science to make the eﬀort. We strive to make data more interpretable via a visual representation but visualisation did not start with computers. Bertin’s seminal work with graphical representations [1] is still a widely acknowledged guide to good visualisation practice and predates the widespread introduction of bitmapped displays. The increased storage capacity of computer systems has allowed application developers to retain more detailed data. Managers have realised that they can use this data to have more control of their organisation. From a managers perspective, this has the potential to be an information revolution, and can satisfy their need for enhanced control. Unfortunately, this revolution cannot take place until the information is readily accessible. An eﬀective visualisation uses computer graphics that allows the user to understand data. This paper presents an architecture for a novel approach to visualisation and provides an extensible visualisation system with which to develop our ideas. We have based our research around a generic visualisation framework in which we describe a data classiﬁcation scheme used to create a data model that accurately classiﬁes the data. The organisation of the data model directly inﬂuences the internal operation of our visualisation architecture which is developed using a technique allowing distributed communication orientated visualisation. Finally, we describe the implementation of the visualisation system.

This research is supported by BT Laboratories.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 43–50, 1999. c Springer-Verlag Berlin Heidelberg 1999

44

2

Jonathan Meddes and Eric McKenzie

Data Classification

Data classiﬁcation is the process of identifying data items, structures and relationships within a data series. The result is a data model that can be used by the remaining stages of the visualisation pipeline. Automated systems including SAGE [7] and ANDD [6] emphasised that automated visualisation can only be successful if the system understands the data. The accurate classiﬁcation of data and relationships using a suitably expressive scheme is crucial to a visualisation system capable of creating eﬀective visualisations. Anomalies introduced at this stage are propagated down the visualisation pipeline; this has the potential to introduce spurious visual cues into a visual representation. Before we describe the main features of our data classiﬁcation scheme, we shall identify the desirable properties it should exhibit. The data classiﬁcation scheme attempts to create a model that captures the syntax and semantics of the data. The syntax of data is its raw features that deﬁne its structural properties, whereas the semantics of data is the implied meaning projected by the organisation of individual data items. The data classiﬁcation scheme has two goals: (1) consistently encode the data using a set of well-deﬁned data types so that subsequent stages of the pipeline receive the data model in a familiar format; and (2) capture relationships between data items and allow easy extensions to the model by introducing additional data items and relationships. For illustration purposes, we present a subset of our full data classiﬁcation scheme that is speciﬁcally intended for the classiﬁcation of geographic and topological based data. An application that generates this type of data is the analysis of wide-area-network traﬃc; cities have a geographic location, network links have a topological structure and traﬃc can be measured using a quantitative value. The data model created by our scheme uses an object-orientated paradigm; this is a convenient method of organising the data that simpliﬁes extending the data classes. Every data item is represented in our data model by a data object that is an instantiation of a data class. The basic data classes that are available to the data model are nominal, quantitative, spatial and relational. In a nominal data class, each member is a data object that represents a textual item, e.g., name. A quantitative class contains members representing numerical values, e.g., age. Constrained subclasses provide support for more specialist values such as currency and percentage. The spatial data class has members that represent a location in space. In its basic form the spatial data class is abstract and cannot be instantiated, but the creation of subclasses provides support for two-dimensional, three-dimensional and longitudinal/latitudinal co-ordinates. The relational data class provides the data model with a versatile method of representing relationships. Instantiation of this class creates a data object containing references to its associated objects; the data class also contains references to all the classes of objects which are represented in its relations. Subclasses such as the Complex Data Class (CDC) provide increased specialisation by restricting the types of relations which can be made. The CDC deﬁnes a signature consisting of a collection of data classes. Each instantiation creates a Complex

An Agent-Based Visualisation Architecture

45

CDC Domain

Classes Objects

Data item

CDO CDO Tuple CDO

CDO

(a)

(b)

Fig. 1. (a) relational table and (b) its data model Data Object (CDO) and must have exactly the same combination of data classes deﬁning their relationships. Using these basic data classes, a data model that accurately represents the input data is constructed using the data classiﬁcation scheme. Figure 1 shows an external relational database table (a) translated into the classes and objects of the internal data model (b). In the table, every cell contains a data item, every item from a column is from a similar domain and the data items from a row constitute a tuple. In the data model, a data class captures the domain which is instantiated to describe its data items. A CDC is deﬁned to represent the association between columns in the table. Instantiations create a CDO that captures the association between the individual data items of a tuple. The data contained in the data model is merely a collection of data values and relations. In the absence of further detail describing the contents of the data model, it is meaningless as a self-contained information source. Data is transferred into information by the introduction of context. Typically, context is introduced by the user when they extract data from a model by introducing environmental inﬂuences from their domain knowledge. To move the data model from a simple collection of data values and relationships we introduce domain functions to provide context and replicate the environmental inﬂuences that dictate how a user changes data to information. In the data model, a domain function provides a basic service that can manipulate the data by mirroring the simple processing and reasoning a user performs when they interpret some data. For example, if we consider the use and capacity of a network link, a typical user of that data can deduce that the average load of that link is represented by use/capacity; a domain function would introduce new load objects to the data model.

3

Visualisation Architecture

The visualisation framework allows us to develop visualisation ideas in an environment where we are not constrained by a speciﬁc visualisation architecture.

46

Jonathan Meddes and Eric McKenzie

If we want to create tangible visual representations, we require a visualisation architecture to realise the visualisation system. In the remainder of this paper, we describe a visualisation architecture that demonstrates a novel approach to implementing the visualisation pipeline. A visualisation architecture must address (1) how data is stored or represented within the system, (2) how visual representations of the data are created, and (3) how the operational requirements of (1) and (2) are co-ordinated. The visualisation architecture we have adopted is an agent based-system inspired by behaviours observed in biological systems. 3.1

Principles of an Agent Based Architecture

The simplest way of illustrating the underlying principles behind an agent-based system is to use a biological example. If we consider a room with no windows or doors, the room is essentially a box. Although the room is sparse, it does contain some objects, a telephone, a chair, a blackboard and a stick of chalk. The room is populated by ﬁve people; each with a personality that characterises their behaviour. For example, one could posses leadership qualities that leads them to command others, or have organisational skills that makes them request actions from others, or might pace around the room. The person exhibiting the latter behaviour is physically constrained by the boundaries of the room. If they do not rely on the walls to restrict their movement, a cognitive process must be present that not only makes them move but also controls their movement. The inhabitants of the room can communicate using a commonly understood language. The person making the utterance could decide to communicate with another individual, a group of people (e.g., all males or some other commonly accepted group), or broadcast their utterance to everyone. This communication could transfer information, issue instructions (or orders) or ask and respond to questions. It would be fair to assume the people in the room would observe Grice’s maxims of communication [4]. The adoption of these rules means that the people in the room communicate eﬀectively and will not attempt to mislead one another, maliciously or otherwise. A person’s environment is deﬁned by the room and the objects within it. The room is a static boundary that restricts the movement of people and objects. The chair is a passive object that may be used by the people in the room but oﬀers the bear minimum of interaction. Similarly, the blackboard provides a service to the people in the room, but could also be used for storage or as a device to communicate with other people. The telephone is an interesting object; it provides a service allowing people to communicate outside their immediate environment and can aﬀect the people within the environment when it receives an incoming call. In this case the telephone is no longer a passive object that provides a service, it can cause the people in the room to change their behaviour. A simple example will provide an illustration of the types of communication that could take place in the room. It demonstrates the principle of co-operation that is essential for social groups to achieve a common goal. In this example, we want the names of the people in the room to be written on the blackboard. The

An Agent-Based Visualisation Architecture

47

task is initiated by a telephone call that is answered by a person who knows how to use a receiver; the caller issues an unambiguous instruction to “write every persons name on the blackboard”. The person who answered the call is now aware of the task that must be achieved and they take a controlling position in the environment. Luckily, the person who answers the call also has the ability to write on the blackboard and broadcasts a message to all the other people in the room asking them to reply with their name. One by one, the other people in the room reply to the question and as the answers are received they are written on the blackboard. Finally, the task is completed by adding their own name to the bottom of the list. This example demonstrates that some tasks can only be achieved by interaction and co-operation. This distributed computation, co-operation and communication which is evident in the biological example above demonstrates exactly the principles which underpin an agent-based system. In such a system, every person and object is represented as an agent. Agents in the environment have a similar existence to the people in the room; they are autonomous entities but are capable of communication. Through communication, agents can arrange co-operation to achieve common goals. Agents can co-exist in an environment that has a similar constraining eﬀect to the room. The environment also provides a transport mechanism for inter-agent communication; this is analogous to the longitudinal sound waves that are transmitted through the air of the room. We have used this distributed model of computation to create DIME (Distributed Information in a Multi-agent Environment), a visualisation system that moves away from a heavyweight constraint based optimisation algorithm towards a lightweight distributed system that empowers individual data items. The agent environment is populated directly from the data model created by the data classiﬁcation scheme. The object-orientated nature of the data model easily translates into the agent environment by allowing every data object and class to be represented as an agent. Using this approach provides a convenient method of organising the data within the visualisation system; it also allows the responsibility for creating a visual representation of the data to be devolved to the individual data items and their associates.

vote(property)

P1

P2

Pn action()

Property communication

Agent communication interface

Property communication Information

Instruction

Enivornment communication system

Question Answer

Fig. 2. Architecture of an individual DIME agent with properties (P1 to Pn ) and the communication interface with the agent environment.

48

3.2

Jonathan Meddes and Eric McKenzie

DIME Agent Environment

Using an agent-based philosophy, DIME supports agents operating within an agent environment. We ﬁrst describe the construction of an agent before describing the operation of multiple agents within the environment; the discussion of our system is presented with a minimum of implementation detail. DIME agent: Figure 2 shows the structure of an agent which is characterised by the properties it holds (P1 to Pn ). Typically an agent is assigned numerous properties and the interaction between the properties dictates its overall behaviour. Properties can be broadly cast into two types, (1) a value property stores details about the agent, and (2) a behaviour property directly describes how the agent behaves within the environment. The internal operation of an agent is controlled by its property communication interface. This is responsible for controlling the individual properties of an agent, organising the updates of each property via a voting mechanism, inter-property communication, and providing access to the agent communication interface for inter-agent communication. When an agent is introduced into the environment, it is activated to perform its tasks. Once activated, an agent uses a roundrobin algorithm allowing each property to optionally perform some action. The combined eﬀect of these actions allows agents to complete their goals. The action of each property is deﬁned within its action method which is invoked each time it is the property’s turn to do some work. This method allows the property to make changes to the character (or state) of the agent by updating or adding properties. Properties in control of the agent are co-operative and operate in a fair and reasonable manner; this is particularly important when a property needs to update the values of other properties in the agent. As a safeguard to an agent being dominated by a minority of properties, we use a voting mechanism in an attempt to reach a consensus on the evolution of the agent. The voting mechanism allows all other properties to cast a vote on their support for the agent to adopt the new property value. The support a property can express for new values ranges from deﬁnitely reject to deﬁnitely accept, with a neutral position being taken by a property with no interest in the vote. If the outcome of the vote is marginal, a random function makes the casting vote. The operation of an agent is clearly illustrated by a simple example; we limit our consideration to a single agent deﬁned to randomly move around a limited area and only consider its directly relevant properties. This behaviour is expressed using three properties, (1) the position property stores the agents current position in its environment; (2) the boundary property represents the limits of the agents movement; and (3) the random movement property deﬁnes the agents behaviour in its environment. Only the random movement property has an action method, the other two characterise the agent but have a passive role in its behaviour. When the random movement property wishes to move the agent, it uses the inter-property communication interface to request the current position of the agent; this is the value of the position property. It can then make a change to the position to represent a random movement in the environment. Before this change can be adopted, a vote must take place involving the position

An Agent-Based Visualisation Architecture

49

and the boundary properties. The position property has no preference in the outcome of the vote and returns a neutral verdict. The boundary property must investigate the proposed change in detail. If the proposed position is outside the boundary, it rejects the change; otherwise, it votes to accept the change and the property communication interface will permit the new value to be adopted. DIME environment: Agents do not directly have access to other agents internal properties and all contact must be channelled through the agent communication interface. This interface provides the necessary functionality to direct messages to individual agents, groups of agents (e.g., all the agents in a class or associated by a relationship agent) or all agents within the environment. Using this communication interface, agents can pass properties to one another using the following styles of communication, (1) instruction messages pass instructions to an agent which must be followed; (2) information messages inform other agents of a property value; and (3) question messages request information from an agent which are responded to using an answer message. Messages are routed by the environment communication system and stored until required by the agents. The agent environment supports the introduction of domain knowledge via domain functions. These specialist agents have two roles, (1) to search for combinations of data agents that can be used to create new agents; and (2) to co-ordinate groups of agents. In the latter role, domain function agents act as a proxy for other agents in the environment. In this guise, the agents for whom it is acting as a proxy are assigned a proxy property; this refers all action and voting to the proxy agent. This mechanism is available to an agent that co-ordinates groups of agents when a substantial inﬂuence on their conduct is required. The structure of the data model provides a natural organisation of the data within the visualisation system. Agents have a more complete understanding of the data than any other entity and can create visual representations of themselves using a special agent providing an agent display window. Agents that wish to be represented must posses the visible property; this is responsible for communicating visual information about the agent to the agent display window and encapsulates all the visual information about how to render the agent. This information is received from other properties and can be inﬂuenced directly or indirectly by other agents in the environment. Specialist properties can provide the agent with additional knowledge of a suitable visual representation for the data it represents. For example, a Gestalt property gives an agent knowledge about Gestalt principles [5] of appropriate organisation. Another property could provide the agent with knowledge of colour perception [2]. Using such properties, the agents have an improved knowledge of a suitable representation for the data. Agents can collaborate to represent themselves as one visual item in the agent display window. The data model provides several inherent organisational features that are retained within the agent environment. The class agents and their associated data members provide two extreme levels of abstraction. At the most detailed level, agents can provide a visual representation in the agent display window or alternatively the data agents can provide a high level abstraction of the data. A CDO provides an orthogonal dimension of abstraction where

50

Jonathan Meddes and Eric McKenzie

related data items provide an associated concept. In a data model created by the data classiﬁcation scheme the highest level of abstraction is a complex data class. A CDC agent is an abstract view derived from a table representing all the class agents, data agents and CDO’s. The autonomous but collaborative nature of agents allows them to negotiate suitable visual representations for the natural structures present in the data they represent.

4

Current Work and Future Plans

The agent visualisation architecture has formed the basis for DIME, an ongoing research project implemented using Java and the Java 3D API which has allowed the rapid development of prototype agents and provides a stable platform to introduce additional functionality. DIME provides an interface to the agent environment allowing a user to browse the agent class hierarchy and expand individual classes to study its data agent instantiations. Properties of agents can be investigated and where necessary the characteristics of an agent can be changed by removing, adding or changing its properties. Visible agents are rendered in the agent environment window by the agent rendering system. At present visualisations in DIME use graphical style sheets (GSS) [3]. A GSS speciﬁes the properties that should be assigned to the agents for an effective visual representation of the data they represent. A future enhancement will introduce dynamic property allocation by introducing user proﬁle agents that characterise a user. Such agents are assigned the task of identifying data that is relevant to the user; the agents representing this data are given greater prominence in the visual representation. A long term goal for the development of DIME is to create a system capable of acting as a data repository. Our ambition is for the data management to be controlled by agents encouraging a long-term data management strategy rather than a system for short-term visualisation.

References 1. Bertin, J.: Graphics and graphic information processing. Walter de Gruyter, Berlin, 1981. 43 2. Healey, C.: Choosing eﬀective colours for data visualisation. IEEE Visualisation 1996. 49 3. Felciano, R., Altman, R.: Graphical Style Sheets: Towards Reusable Representations of Biomedical Graphics. Conference on Human Factors in Computer Systems 1998 (to appear). 50 4. Grice, H.: Logic and Conversation. In Cole P and Morgan J eds. Syntax and Semantics: Speech Acts. 3 (1975) 41-58. New York: Academic Press. 46 5. Koﬀka, K.: Principles of Gestalt Psychology. London New York: Kegan Paul, Trench, Trubner, : Harcourt, Brace. 1935. 49 6. Marks, J.: A Formal Speciﬁcation Scheme for Network Diagrams That facilitates Automated Design. Journal of Visual Languages and Computing. 2 (1991) 395-414 44 7. Roth, S., Mattis, J.: Data Characterization For Intelligent Graphics Presentation. CHI’90 Proceedings. 44

Error-Tolerant Database for Structured Images

A. Ferro, G. Gallo, and R. Giugno Dipartimento di Matematica Università di Catania, Viale A. Doria 6, 95125, Catania, Italy Phone +39095733051, Fax +39095330091 {ferro, gallo, giugno}@dipmat.unict.it

Abstract. This paper reports the application of an error-tolerant retrieval technique introduced in 1997 by Oflazer [IEEE Transactions on P.A.M.I., vol. 19, No. 12, December 1997] for databases of trees coding pictures of similar objects. The technique is of great interest for the classification and maintenance of historical and archeological pictorial data. We demonstrate the approach on a small postal stamp collection and describe the same application on a large set of pictures.

1

Introduction

Some classes of images can be considered as "statements" in a complex, non-linear, language. Such is the case of icons, coin images, coats and arms, postal stamps etc. All these objects have their own structure and a precise syntax. Although not fully general [3] the "syntactic" approach to image understanding can be usefully applied to machine recognition. The database structure proposed in this paper is one of such applications. A high level approach to classify pictures according to their basic features can be produce great advantages. These features can be the responses to linear and/or nonlinear filters applied to the images or can be features with a precise semantic for the human observer. Features can be recovered and analyzed using object recognition algorithms in a completely automated approach or, in a semiautomatic approach require the intervention of a human operator. Unfortunately there is no simple way, be it automated or dependent on human intervention, of organizing the elements of a generic picture in an unambiguous structure. "Content based" image databases ([1,4,6,7]) are still in an experimental stage although the initial efforts are very promising. Better results are possible when the database includes only very specific kind of images: in particular we propose a semantic approach to build data bases for homogeneous highly structured images. More precisely, a collection of homogeneous and highly structured images is a set of pictures where all elements share the same "layout" but differs for details. Furthermore we require that details range over a finite set of possibilities and that they can be unambiguously classified. Examples of such collections are sets of images of similar objects like those that frequently occur in the cataloging of cultural or artistic artifacts: database of coins, data base of postal stamps, etc. For demonstration sake in the rest of this paper we will describe our Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 51-59, 1999.  Springer-Verlag Berlin Heidelberg 1999

52

A. Ferro et al.

innovative approach referring to a collection of postal stamps. The good results obtained in our toy example has been easily replicated on larger collections (about 100 items) and there is no real obstacle to exclude the same approach on much larger collections (more than 1000 items). We report also experiments with a database built in a complete automated way using the relative distribution of pixels in an image. Although this last approach focuses on issues that are not paramount for human perception promising performance has been obtained also in this case.

(a)

(d)

(g)

(b)

(e)

(h)

(c)

(f)

(i)

Fig. 1. The collection of the stamps used to illustrate our proposal referred in the text as “toy example”. The following ids have been used to refer to each stamp: (a) Italia650; (b) Usa60; (c) Berlin20; (d) Deutschland100; (e) Deutschland80; (f) Italia750; (g) Helvetia90; (h) Helvetia180; (i) Republique Francaise.

This paper is organized as follows: Section 2 describes how to code a picture into a tree, both using an interactive procedure based on the semantic of the image and using a totally automated procedure based on the spatial distribution of the pixels. Section 3 introduces a “metric” for the distance between the coding trees and shows how to organize a collection of trees into a compact, space efficient trie. Section 4 reports the results obtained in the experiments and section 5 draws some conclusions and indicates further research directions.

2

Building Tree from Images

This Section describes how to code an image drawn from a collection of structured and homogeneous images into a tree. The basic idea is to treat a structured image as a sentence in a natural language in order to be able to apply the technique described in [5]. Sentences in a natural language are structured in a general pattern. For example statements expressing the action of a subject on an object can be divided into subparts: subject, verb, object. Each subpart can be further decomposed in noun, article, attribute and so on. This, naturally, suggests describing and storing such sentences with a hierarchical data structure like a tree. The choice of the features to organize in a tree for each

Error-Tolerant Database for Structured Images

53

image/object and the structure of the tree is perhaps the most sensitive parameter of the proposed approach. We experimented with two strategies: one semantically based and one based on the distribution of the pixels. 2.1

Semantically Based Tree Construction

The strategy based on the semantic of images depends on the knowledge of an expert: an accurate analysis has to determine which details and features are helpful in assessing the identity of each data base item. It is important to avoid features that introduce only wasteful noise. To illustrate in a concrete way how the image coding is realized we report one possible scheme for some of the features of postal stamps. A typical "stamp tree" can be seen in Fig. 2 and in Fig. 3. Observe that not all of the many possible details and features have been introduced in our scheme. This has been an intentional choice to reduce the complexity of our example and grants a greater clarity in exposition. As a counter effect if the details mentioned in Fig. 2 are the only ones to be considered, even different stamps (for example (g) and (i) in our collection, shown in Fig. 1) have very similar trees. LAYOUT

COLOR

SUBJECT

BACK GROUND

SOUTH

N

T

EAST

N

WEST

NORTH

T

N

T

N

T

Fig. 2. General tree of a stamp, when the semantic approach is adopted. LAYOUT ∈ {Horizontal, Vertical}; COLOR ∈ {1, 2, 3, More}; SUBJECT ∈ {Monument, Face, Drawing}; BACKGROUND ∈ {Empty, Full}; N ≡ Number; T ≡ Text. V

MORE

M

F

S

T

N

N

T

Fig. 3. The tree of the stamp Italia650.

2.2

Pixels Based Tree Construction

The semantic tree construction gives impressive results but it is, in some way, inadequate because it requires the intervention of a human operator/expert. As an

54

A. Ferro et al.

alternative we tested also a completely automated coding of a picture into a tree based on the pixels distribution. The idea is to decompose a picture A recursively as follows: Step1 Divide A with median horizontal cut into two sub regions: Ah1 and Ah2. Step2 Check if the two resulting sub pictures are close under some prescribed similarity measure (for example they have close mean values for the RGB channels). If Ah1 and Ah2 are close go to Step3 else repeat the subdivision recursively on Ah1 and Ah2. Step3 Divide A with a median vertical cut into two sub regions Av1 and Av2. Step4 Check if the two resulting sub pictures are close under some prescribed similarity measure. If Av1 and Av2 are not close repeat the subdivision recursively on Av1 and Av2 else STOP subdivision.

The decomposition obtained following the above algorithm can be stored in a tree whose leaves bear a label with mean RGB values of the non-decomposable regions. For an example Fig. 4 shows the tree associated to stamp “Italia650”. H

H H

H H

V

V V

H V

V

H V

V

V q

a

b

c d e f g

h

i

l m n

o

p

Fig. 4. The decomposition tree of the stamp Italia650, when the pixels approach is adopted. The label H means "horizontal cut" and V means "vertical cut". The values of the leaves are the mean RGB and in this case they are: a = (227, 225, 192); b = (237, 233, 202); c = (192, 129, 92); d = (182, 128, 96); e = (231, 174, 115); f = (184, 153, 112); g = (154, 147, 119); h = (161, 158, 132); i = (166, 169, 146); l = (145, 160, 141); m = (117, 119, 98); n = (153, 155, 135); o = (163, 87, 68); p = (121, 109, 89); q = (233, 202, 178).

3

Finding Trees in a Forest

The constructions in section 2 show how to turn a collection of pictures into a forest of trees. Oflazer [5] has recently reported efficient ways to retrieve in this forest all the trees "similar" up to some degree to a given query. In particular, given an input

Error-Tolerant Database for Structured Images

55

tree and a similarity threshold, an algorithm is known that efficiently retrieves all the trees in the database whose similarity with the input exceeds the threshold. 3.1

Similarity Measures between Trees

A similarity measure is essential if a flexible, error tolerant, retrieval of database items has to be performed. Following [5] the distance between two trees is defined taking into account structural differences and label differences. Two trees can be different because there are two different labels in corresponding nodes or because some branch in one has no correspondence in the other. Let C be the cost in case of different labels and S the cost for a structural difference. The distance between the trees, is the minimum cost of leaves or branches insertions, deletions or leaf labels changes necessary to change one tree into the other. To be more precise, we adopt the data structure “vertex list sequence” to represent a tree T. This structure is a sequence of lists. There are as many lists in this sequence than leaves in the tree. Each list contains the ordered sequence of vertices in the unique path from the root to the corresponding leaf. For example the tree in the Fig. 3 is represented as follows: ((V, MORE), (V, M), (V, F), (V, S, T), (V, N, N), (V, N, T)) Using this formalism we can introduce the following definition. Definition. Let Z=Z1,…, Zp, denote a generic vertex list sequence of p vertex lists. Z[j] denotes the initial subsequence of Z up to and including the j-th vertex list. We will use X (of length m) to denote the query vertex list sequence, and Y (of length j) to denote the sequence that is a (possibly partial) candidate vertex list sequence (from the database of trees). i=0 ìdist ( X [0], Y [ j ]) = j * S ï j=0 ïdist ( X [i], Y [0]) = i * S ïdist ( X [i − 1], Y [ j − 1]) if X i = Y j i.e., last vertex lists are same ï ïmin(dist ( X [i − 1], Y [ j − 1]) + C , dist ( X [i], Y [ j ]) = í dist ( X [i − 1, Y [ j ]) + S , ï ï dist ( X [i], Y [ j − 1]) + S ) if X i and Y j differ only at the leaf label ï ïmin(dist ( X [i − 1], Y [ j ]), ï dist ( X [i], Y [ j − 1])) + S otherwise î

If the semantic tree coding is adopted Table 1 reports the "similarity values" between all the stamps in our toy collection assuming that C=1 and S=2. This is also the choice we made in the experiments reported in Section 4. If the pixels based tree coding is adopted very different results are obtained. Table 2 reports the "similarity values" between all the stamps in our toy collection assuming that C=1 and S=2 in this case. Fig. 5 is a graphical representation of the distances between the stamp Italia650 and the other stamps stored in our collection obtained following both the semantic and the pixel approach.

56

A. Ferro et al. DIST.

A

B

C

D

E

F

G

H

I

A B C D E F G H I

0 5 11 12 4 3 22 24 22

5 0 11 10 7 8 22 24 22

11 11 0 4 7 9 22 24 22

12 10 4 0 8 13 22 24 22

4 7 7 8 0 7 22 24 22

3 8 9 13 7 0 20 22 20

22 22 22 22 22 20 0 3 2

24 24 24 24 24 22 3 0 4

22 22 22 22 22 20 2 4 0

Table 1. Distances between the stamps in the database with C =1 and S = 2 when the semantic tree coding is adopted. A=Italia650, B=Usa60; C=Berlin20; D=Deutschland100; E=Deutschland80; F=Italia750; G=Helvetia90; H=Helvetia180; I=Republique Francaise. DIST.

A

B

C

D

E

F

G

H

I

A B C D E F G H I

0 20 22 20 114 146 89 90 149

20 0 24 16 116 148 91 92 151

22 24 0 24 112 144 87 88 147

20 16 24 0 116 148 91 92 151

114 116 112 116 0 122 98 96 206

146 148 144 148 122 0 178 176 238

89 91 87 91 98 178 0 50 181

90 92 88 92 96 176 50 0 182

149 151 147 151 206 238 181 182 0

Table 2. Distances between the stamps in the database with C=1 and S=2 when the pixel based tree coding is adopted. A=Italia650, B=Usa60; C=Berlin20; D=Deutschland100; E=Deutschland80; F=Italia750; G=Helvetia90; H=Helvetia180; I=Republique Francaise.

144

22

12

11

22

5

3

20

4

89 24

22

(a)

20

146

90 114

(b)

Fig. 5. The similarity between the stamp Italia650 and the other stamps when (a) the semantic tree coding, (b) the pixels based tree is adopted.

Error-Tolerant Database for Structured Images

3.2

57

Organizing a Forest in a Tree

The trees associated to the elements in our collection are stored in the database with their vertex list sequence representation. The set of vertex list sequences is, in turn, converted into a trie structure [2]. The tree will compress any possible redundancies in the prefixes of the vertex list sequences to achieve a compact data structure. For instance, the trees associated to the stamps of the collection in Fig. 1, when coded account the semantic approach, can be represented as a trie as shown in Fig. 6.

(V, MORE) (H, 1)

(H, D)

(H, MORE)

(V, M)

(H, D)

(H, F)

(V, 2)

(V, 3)

(H, F)

(V, F)

(V, E)

(V, M)

(V, F)

(V, F) (V, F)

(V, F)

(V, F)

(V, E, T) (H, S, T)

(H, S, N)

(V, N, N)

(V, S, T)

(H, S, T) (H, N, T)

(i)

(H, N, N) (H, N, N)

(g)

(V, S, N)

(V, S, N)

(O, N, T)

(h)

(V, N, T)

(f)

(V, E, T)

(V, N, N)

(V, W, T)

(V, W, T)

(d)

(c)

(V, N, N)

(V, S, T)

(V, N, T)

(b)

(V, E, T) ( V, N, N)

(e)

(V, N, T)

(a)

Fig. 6. The tree associated to the trees of stamps of our database. 3.3

Error Tolerant Retrieval of Trees

Our goal is the retrieval of trees that match a query up to some degree of approximation. Standard searching within a trie corresponds to traversing a path starting from the start node, to one of the leaves, so that the concatenation of the labels on the arcs along this path matches the input vertex list sequence. For error-tolerant searching, one has to find all paths from the start node to leaves, such that, the corresponding vertex list sequences are within a given distance threshold t of the query vertex list sequence. To perform efficiently this search, paths in the trie that lead to no solutions have to be pruned early so that the search is bound to a very small portion of data structure. The detailed description of on efficient search strategy is outside the scope of this paper; here we report a sketch of the idea developed in [5]. To perform an efficient depth first probing of the trie one has to compute the similarity distance between subsequences of the query and sequences obtained chaining together the labels of the visited nodes of the trie. This can be done recursively, in a dynamic programming fashion, maintaining a suitable array H of distances where the entry H(i, j) represents the distance between the subsequence of the first i vertex lists in the query and the

58

A. Ferro et al.

sequence of j vertex list obtained visiting a node of depth j of the trie. The array H(i, j) is locally updated during the visit of the trie and during the possible backtracking steps. A backtracking step occurs whenever a similarity threshold t (also known as cutoff distance) is exceeded. The leaves that have been reached during the search, within the cutoff distance bounds are the output of the error tolerant matching procedure. Oflazer has shown that this search can be realized in O(L2log L k1/lceiling(t/S)) where L is the number of leaves in each tree, k is number of the tree in the forest, t is the cutoff distance and S is the cost of adding or deleting a leaf in a certain tree. 4

Results

In this section we report some of the results obtained using our algorithm. A first experiment has been to query extensively the database of the stamps in Fig. 1. We have performed searches of queries already in the database and of queries new to the database. If the query is already in the database we have always obtained an exact match. In Fig. 7 the portion of the trie visited with threshold t = 2 with a query outside of stored collection is represented. The output is the stamp “Deutschland100” that belongs to the very some emission of the query and hence is very similar to it. If the threshold t is set to 4 more outputs are found. In particular the algorithm retrieves Deutschland100 and in addition Berlin20. Although Berlin20 may appear very different from Deutschland300 it shares with it the vertical layout, the woman portrait and few other details that in our coding scheme are important. This shows that it is important to have a coding scheme that closely maps the perceptively and semantically important information of each item in the database.

(i)

(g)

(f) (h)

(d)

(c)

(b)

(e)

(a)

Fig. 7. Search resulting from query the stamp “Deutschland300” and threshold t = 2 as a query. The solid branches are the only branches visited from the search.

The good results obtained in these first simple experiments have been replicated on a larger database regarding 100 stamps coded using a semantic approach. We ran the following quantitative test: a human expert have been asked to retrieve form the collection a set of 4 stamps that are, according to his judgment, close to a "query stamps". We compared the expert selections with the output obtained from the same query using our algorithm. In all cases, both expert and algorithm retrieved the very same copy of the query whenever present. The set of the other two stamps retrieved

Error-Tolerant Database for Structured Images

59

from the expert included some of the output obtained from our algorithm in 42% cases with similarity threshold t = 2. This percentage goes to 84% with similarity threshold t = 4. Similar results have been obtained when the query does not have an exact match in the database. As a final experiment we run the same test on the same 100 stamps database using the alternative coding described in section 2.2. Although the performance decreased because the pixel based coding gives more relevance to features that are clearly not taken into account from a human expert, we believe that some combination of the two coding could produce very reliable results. 5

Conclusion and Research Perspectives

In this paper we have demonstrated an application of the error tolerantly retrieval technique proposed in [5] according two coding strategies. We have conducted experiments using a database of . images with similar layout and different details from a finite set of possibilities. The database has been queried and the answers have been statistically compared with the findings of an expert. The observed performance proves the suitability of the method for database of visual information. Future work following this line of research concerns both applications and more refined theoretical analysis of the problem. It is particularly interesting to characterize the class of images that are best for our technique in more precise terms using the syntactic tools provided from visual languages. For the applications we have undertook the construction of a database for a collection of ancient Greek coins in order to validate the proposed method on a real life situation of moderate size. We intend also to prove the applicability of the proposed technique to Web search, and as a support for electronic commerce. References 1. J. R. Back, C. Fuller, A. Gupta, A. Hampapur, B. Horowitz, R. Humphrey, and R. Jain, “The Virage Image Search Engine: An Open framework for Image Management”, Proc. SPIE, Storage and Retrieval for Still Image and Video Databases, vol. 2670, pp. 76-87, 1996. 2. T. H. Cormen, C. E. Leiserson, R. L. Rivest, “Introduction to Algorithms”, MIT press 1990. 3. D. Marr, "Vision", Freeman and Co., New York, 1982. 4. W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, G. Taubin, “The QBIC Project: Querying Images by Content Using Color Texture, and Shape”, Proc. SPIE, vol. 1908, Storage Retrieval for Image and Video Databases, 1993, pp. 173-187. 5. K. Oflazer, “Error-Tolerant Retrieval of Trees”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, No. 12, December 1997. 6. A. Pentland, R. W. Picard, S. Sclaroff, “Photobook: Tools for Content-Based Manipulation of Image Database”, Proc. SPIE, vol. 2185, Storage and Retrieval for Image and Video Databases, Feb. 1994, pp. 34-47. 7. J. R. Smith and S. F. Chang, “VisualSeek: A fully automated content-based image query system”, Proc. International Conference on Image Processing, Lausanne, Switzerland, 1996.

Query Processing and Optimization for Pictorial Query Trees Aya Soﬀer1,2 and Hanan Samet2 1

2

Computer Science Department, Technion City, Haifa 32000, Israel [email protected] Computer Science Department, Institute for Advanced Computer Science University of Maryland at College Park, College Park, Maryland 20742 [email protected]

Abstract. Methods for processing of pictorial queries speciﬁed by pictorial query trees are presented. Leaves of a pictorial query tree correspond to individual pictorial queries while internal nodes correspond to logical operations on them. Algorithms for processing individual pictorial queries and for parsing and computing the overall result of a pictorial query tree are presented. Issues involved in optimizing query processing of pictorial query trees are outlined and some initial solutions are suggested. Keywords: image databases, query speciﬁcation, query optimization, retrieval by content, spatial databases, image indexing

1

Introduction

Image databases must be capable of being queried pictorially. The most common method of doing this is via an example image. The problem with this method is that in an image database we are usually not looking for an exact match. Instead, we want to ﬁnd images similar to a given query image. One of the main issues is how to determine if two images are similar and whether the similarity criteria that are used by the database system match the user’s notion of similarity. Another diﬃculty with pictorial queries is that they are usually not very expressive in terms of specifying combinations of conditions and negative conditions. A good pictorial query speciﬁcation method should leverage on the expressiveness of pictorial queries in terms of describing what objects the target images should contain and their desired spatial conﬁguration, resolve the ambiguities inherent to pictorial queries, and enable specifying combinations of conditions. In our previous work [8], we devised a pictorial query speciﬁcation technique for formulating queries that specify which objects should appear in a target image

The support of the Lady Davis Foundation and the National Science Foundation under Grant CDA-950-3994 is gratefully acknowledged. The support of the National Science Foundation under Grant IRI-97-12715 is gratefully acknowledged.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 60–68, 1999. c Springer-Verlag Berlin Heidelberg 1999

Query Processing and Optimization for Pictorial Query Trees

61

as well as how many occurrences of each object are required. Moreover, the minimum matching certainty between query-image and database-image objects can be speciﬁed, and spatial constraints that specify bounds on the distance between objects and the relative direction between objects can be imposed. Expressive power is achieved by allowing a pictorial query speciﬁcation to be composed of one or more query images and by allowing a query image to be negated. In [9] we extended the pictorial query speciﬁcation technique to allow the formulation of complex pictorial queries via pictorial query trees where leaves correspond to individual pictorial queries while internal nodes represent logical operations on the set of pictorial queries (or subtrees) represented by its children. In this paper we describe in detail the algorithm for processing individual pictorial queries. It can handle multiple instances of each symbol in the query image as well as in the database image and can also handle wild cards. In addition, we present an algorithm for parsing and computing the overall result of a pictorial query tree. We also discuss some issues that are involved in query optimization of pictorial query trees (i.e., their simpliﬁcation).

2

Related Work

Most image database research deals either with global image matching based on color and texture features [4,5,10] or with the ambiguity associated with matching one query-image object to another [3]. These methods do not address the case of images that are composed of several objects and their desired spatial conﬁguration. There has been some work on the speciﬁcation of topological and directional relations among query objects [1,2]. These studies only deal with tagged images. Furthermore, it is always assumed that the goal is to match as many query-image objects to database-image objects as possible. A limited form of spatial ambiguity is allowed in pictorial queries based on the 2D-string and its variants [2]. The spatial logic described in [1] also allows speciﬁcation of query images in terms of spatial relations between objects and permits users to select the level of spatial similarity. In some cases (e.g., [6]), the images are segmented into regions either automatically or semi-automatically and some queries involving spatial constraints specifying the desired arrangement of these regions can be performed. However, the issue of the distance between objects is not addressed by these or any other method. Furthermore, none of these methods provide Boolean combinations or negations of query images.

3

Building Pictorial Query Trees

An individual pictorial query is speciﬁed by selecting the required query objects and positioning them according to the desired spatial conﬁguration. Next, the similarity level in terms of three parameters is speciﬁed. The matching similarity level msl is a number between 0 and 1 that speciﬁes a lower bound on the certainty that two symbols are from the same class and thus match. Contextual similarity speciﬁes how well the content of database image DI matches that

62

Aya Soﬀer and Hanan Samet

Fig. 1. Contextual similarity levels (csl).

Fig. 2. Spatial similarity levels (ssl).

of query image QI (e.g., do all of the symbols in QI appear in DI ?). We use four levels of contextual similarity (see Figure 1). Spatial similarity speciﬁes how good a match is required in terms of the relative locations and orientation of the matching symbols between the query and database image. We use ﬁve levels of spatial similarity (see Figure 2). For more details and examples, see [8]. Complex pictorial queries are composed of combinations of individual pictorial queries and are speciﬁed via pictorial query trees. Leaves of a pictorial query tree correspond to individual pictorial queries. A negated leaf node (NOT) yields the set of all images that do not satisfy the pictorial query. Internal nodes in the tree represent logical operations (AND, OR, XOR) and their negations on the set of images that satisfy the pictorial query (or query subtree) represented by its children.

AND

AND

2

OR 7

OR

csl = 2 ssl = 2

csl = 2 ssl = 4

(a)

csl = 2 ssl = 4

10

5

10

5

csl = 2 ssl = 4

csl = 2 ssl = 4

csl = 2 ssl = 4

(b)

Fig. 3. Images with a (a) camping site within 5 miles of a ﬁshing site OR a hotel within 10 miles of a ﬁshing site AND an airport northeast of and within 7 miles of the ﬁshing site . (b) camping site within 5 miles OR with a hotel within 10 miles of a ﬁshing site of a ﬁshing site AND with no airport within 2 miles of the ﬁshing site (the line above the query denotes negation). The root of the tree is either a pictorial query or a logical operator, while an internal node corresponds to a logical operator and can have one or more

Query Processing and Optimization for Pictorial Query Trees

63

children. For a conjunction of query images where the same symbol appears in both query images, the user may specify whether the two query-symbols must match the same instance of the symbol in the database image, or whether two diﬀerent instances are allowed. This is termed object binding.

AND

OR

NOR

10

5

csl = 2 ssl = 4

csl = 2 ssl = 4

csl = 2 ssl = 5

csl = 2 ssl = 5

Fig. 4. Images with a camping site within 5 miles OR with a hotel within 10 miles of a ﬁshing site AND with neither a restaurant nor a cafe . Figures 3 and 4 are examples of pictorial query trees. Figure 3a speciﬁes more than one acceptable spatial constraint (i.e., a camping site within 5 of a or a hotel within 10 of a ﬁshing site ) with an OR. We also ﬁshing site specify that there is an airport northeast of and within 7 miles of the ﬁshing site with an AND. Figure 3b shows negation of a pictorial query specifying within 2 miles of the ﬁshing site . Figure 4 shows that there is no airport negation of logical operators in internal nodes. Here we are seeking images with a camping site within 5 miles of a ﬁshing site OR a hotel within 10 but without either a restaurant or a cafe . miles of a ﬁshing site

4

Pictorial Query Processing

In this section, we present an algorithm for retrieving all database images that conform to a given pictorial query tree speciﬁcation. 4.1

Processing individual pictorial queries

The ﬁrst step in the algorithm processes each pictorial query image (i.e., each leaf) individually using function GetSimilarImagesM that takes as input a query image (QI ), the matching similarity level (msl ), the contextual similarity level (csl ), and the spatial similarity level (ssl ) associated with QI . It returns the set of database images RI such that each image DI ∈ RI satisﬁes the pictorial query. Figure 5 summarizes this algorithm. GetSimilarImagesM handles wildcards as well as multiple instances of each class in the query image and in the database image.

64

Aya Soﬀer and Hanan Samet

GetSimilarImagesM(logical image QI, msl, csl, ssl)

m←0 /* check matching similarity */ foreach el ∈ QI if (el = wildcard) then rm ← set of all images stored in the database else rm ← set of all images containing C(el) with certainty ≥ msl (use index on class) m←m+1 /* check contextual similarity */ if (csl = 1) ∨ (csl = 2) then n−1 RI ← i=0 ri elseif (csl = 3) ∨ (csl = 4) n−1 RI ← i=0 ri if (csl = 1) ∨ (csl = 3) then RI ← RI − {I s.t. I includes symbols not in QI} (use index on image id) /* check multiple symbol instance conditions */ if (csl = 1) ∨ (csl = 2) then I RI ← RI − {I s.t. ∃k : nQI k > nk }, i.e. some (k-th) symbol of QI is underrepresented in I /* check spatial similarity */ foreach I ∈ RI for every possible matching of symbols between QI and I check feasibility of this matching w.r.t. spatial constraints if all matchings are infeasible RI ← RI − I return RI ordered by average certainties

Fig. 5. Algorithm to retrieve all database images similar to a query image (QI) conforming to constraints dictated by msl, csl, and ssl. nIk denotes the number of occurrences of the k th symbol in image I

First, for each symbol in the query image it ﬁnds all database images, DI, that contain this symbol with certainty ≥ msl. Next, it handles sthe contextual constraints. If csl is 1 or 2 (images should contain all symbols in QI), then it intersects the set of result images from the ﬁrst step. If csl is 3 or 4 (any one symbol from QI is enough), then it takes the union of the result images. If the contextual similarity level is 1 or 3, then it avoids including images containing symbols that are not present in QI . Next, it checks the case of multiple instances of query symbols in the query image. If csl is 1 or 2, then for every instance of each symbol in QI, it checks whether there exists an instance of the symbol in DI.

Query Processing and Optimization for Pictorial Query Trees

65

Finally, it checks whether the spatial constraints are satisﬁed for each candidate image I in the candidate image list RI. Since multiple instances of symbols are allowed in QI and in I, this step needs to check many possible matchings. It can be that some mappings between QI symbols and I symbols create feasible conﬁgurations while others do not. For each QI symbol create a set of possible matches in I. Selecting one element from each of these sets generates one possible matching. If none of the possible matchings pass the spatial constraints test, then remove the image from the candidate result set. The spatial similarity between any two matchings is calculated using algorithm CheckSsl [8] which determines whether the spatial constraints dictated by a query image QI and spatial similarity level ssl hold in a logical image DI . Images that pass all of the tests are ordered by the average matching certainty of all matching symbols and returned as the result of the query. 4.2

Parsing and evaluating pictorial query trees

ProcessQueryTree(query tree node: N )

S ← set of all images in the database (global variable) if (isLeaf (N )) N R ← GetSimilarImagesM (QI(N ), msl(N ), csl(N ), ssl(N )) if (hasN egationF lag(N )) NR ← S − NR else n←0 foreach M ∈ sons(N ) rn ← P rocessQueryT ree(M ) n ← n+1 , , or , possibly inverted) N R ← OP (N )n−1 i=0 ri (OP (N ) can be return N R

Fig. 6. Algorithm to retrieve all images satisfying the query represented by node N of a pictorial query tree. Procedure P rocessQueryT ree parses and evaluates the result of a pictorial query tree. Figure 6 summarizes the algorithm. ProcessQueryTree takes as input a node N in the query tree, and returns the set of images that satisfy the query tree rooted at N . If N is a leaf node, then it checks whether the results of this query are cached from earlier invocations. If they are not, then algorithm GetSimilarImagesM is invoked. If the leaf node is negated in the tree, then the complement of the result images set returned by GetSimilarImagesM is taken. The ﬁnal result image set is returned. If N is an internal node in the query

66

Aya Soﬀer and Hanan Samet

tree, then ProcessQueryTree is called recursively on each child of N , followed by applying the appropriate logical operation on the results of these calls. The whole query tree is evaluated in this recursive manner by invoking algorithm ProcessQueryTree with the root of the query tree as an argument. Recall, that users can specify object binding. That is, whether the same instance of an object is to be used when it appears in more than one of the pictorial query images that make up the pictorial query tree. The following is an outline of the additions to our algorithms that are necessary for handling object binding. Algorithm ProcessQueryTree receives as additional input a global set of constraints that stipulates the bindings that were speciﬁed as part of the query. This set consists of groups of symbols, where all of the symbols in the same group should be matched to the same symbol instance in the database image. To ﬁlter out database images that are incompatible with respect to the binding conditions, we combine these binding constraints with information that is provided by the algorithm GetSimilarImagesM, which is augmented to return for each database image that was found similar to the query image, the mapping between query symbols and matched database symbols.

5

Query Optimization Issues

Several optimization techniques can be applied to improve the eﬃciency of processing pictorial query trees. These include methods designed for optimization of individual pictorial query processing and optimization of query tree processing. Individual pictorial query processing may be made more eﬃcient by handling spatial and contextual constraints simultaneously rather than one followed by the other as we do now. We addressed this issue in [7]. Two optimizations are possible for computing the result of the pictorial query tree. The ﬁrst optimization is to change the order of processing individual query images in order to execute the parts that are more selective (i.e., result in fewer images) ﬁrst. The selectivity of a pictorial query is based on three factors. Matching selectivity estimates how many images satisfy the matching constraint as speciﬁed by msl. Contextual selectivity estimates how many images satisfy the contextual constraint as speciﬁed by the query image and csl. Spatial selectivity estimates how many images satisfy the spatial constraint as speciﬁed by ssl. Depending on ssl, either distance, direction, both, or neither are constrained. Matching and contextual selectivity factors are computed based on statistics stored as histograms in the database which indicate the distribution of classiﬁcations and certainty levels in the images. These histograms are constructed when populating the database. Computing spatial selectivity is much more complex. One approach to measuring the distance aspect of the spatial selectivity calculates some approximation of the area spanned by the symbols in the query image. This can be estimated, for example. using an approximation of the convex hull of the symbols in the query image. Details of this method are beyond the scope of this paper. Selectivity of an individual pictorial query (leaf) is computed by combining these three selectivity factors.

Query Processing and Optimization for Pictorial Query Trees

67

The query tree selectivity is computed using a recursive algorithm similar to the one executing the query. If an individual pictorial query is negated in the tree, the selectivity is 1 - the selectivity of the query. The selectivity of a subtree is as follows. For OR or XOR, take the sum of the selectivities of the subtrees minus the probability that a combination of cases occured. For AND, take the product of the selectivities of the subtrees. To illustrate the general use of this optimization method, consider the query trees in Figure 3. In both queries the left side of the tree requests images with within 5 miles of a ﬁshing site OR a hotel within 10 a camping site miles of a ﬁshing site . In query (a), we add the constraint that there exists an airport northeast of and within 7 of ﬁshing site . In our database, we have very few airﬁelds and thus the right side is more selective and it will be processed ﬁrst. On the other hand in query (b), we add the constraint that within 2 miles of the ﬁshing site . Clearly, in most there is no airport cases there will be no such airport , and thus in this case the right side is not selective and the left side should be processed ﬁrst. The second form of optimization is to combine individual query images and to process them together. To see its usefulness, we study how the query in Figure 4 is processed using the current algorithm. First, ﬁnd {CF} all images with a within 5 of a ﬁshing site . Next, ﬁnd {HF} all images with a camping site hotel within 10 of a ﬁshing site . Then, take the union of these two sets: and {LS} = {CF } ∪ {HF }. Now, ﬁnd the set {R}: images with a restaurant the set {C}: images with a cafe and compute the set RS = I − (R ∪ C). The ﬁnal result is the intersection of the two sets: LS ∩ RS. A more sensible way , ﬁnd the nearest to compute this query is as follows. For each ﬁshing site neighbors up to distance 5 in incremental order. If the next nearest neighbor is a camping site or a hotel , then add this image to the candidate list. Continue retrieving nearest neighbors in incremental order up to distance 10. If , then add this image to the candidate the next nearest neighbor is a hotel list. For each image I in the candidate list, examine all of the objects in I. If or a cafe in I, then remove I from the candidate list. there is a restaurant

References 1. A. Del Bimbo, E. Vicario, and D. Zingoni. A spatial logic for symbolic description of image contents. Jour. of Vis. Lang. and Comp., 5(3):267–286, Sept. 1994. 61 2. S. K. Chang, Q. Y. Shi, and C. Y. Yan. Iconic indexing by 2-D strings. IEEE Trans. on Patt. Anal. and Mach. Intel., 9(3):413–428, May 1987. 61 3. W. I. Grosky, P. Neo, and R. Mehrotra. A pictorial index mechanism for modelbased matching. Data & Know. Engin., 8(4):309–327, Sept. 1992. 61 4. W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic, and P. Yanker. The QBIC project: Querying images by content using color, texture, and shape. In Proc. of the SPIE, Storage and Retrieval of Image and Video Databases, vol. 1908, pp. 173–187, San Jose, CA, Feb. 1993. 61 5. A. Pentland, R. W. Picard, and S. Sclaroﬀ. Photobook: Content-based manipulation of image databases. In Proc. of the SPIE, Storage and Retrieval of Image and Video Databases II, vol. 2185, pp. 34–47, San Jose, CA, Feb. 1994. 61

68

Aya Soﬀer and Hanan Samet

6. J. R. Smith and S.-F. Chang. VisualSEEk: a fully automated content-based image query system. In ACM Int. Conf. on Multimedia, pp. 87–98, Boston, Nov. 1996. 61 7. A. Soﬀer and H. Samet. Pictorial queries by image similarity. In 13th Int. Conf. on Patt. Recog., vol. III, pp. 114–119, Vienna, Austria, Aug. 1996. 66 8. A. Soﬀer and H. Samet. Pictorial query speciﬁcation for browsing through spatially-referenced image databases. Jour. of Vis. Lang. and Comp., 9(6):567– 596, Dec. 1998. 60, 62, 65 9. A. Soﬀer, H. Samet, and D. Zotkin. Pictorial query trees for query speciﬁcation in image databases. In 14th Int. Conf. on Patt. Recog., vol. I, pp. 919–921, Brisbane, Australia, Aug 1998. 61 10. M. Swain. Interactive indexing into image databases. In Proc. of the SPIE, Storage and Retrieval for Image and Video Databases, vol. 1908, pp. 95–103, San Jose, CA, Feb. 1993. 61

Similarity Search Using Multiple Examples in MARS Kriengkrai Porkaew1, Sharad Mehrotra2 , Michael Ortega1 , and Kaushik Chakrabarti1 1

Department of Computer Science University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA 2 Department of Information and Computer Science University of Cailifornia at Irvine, Irvine, CA 92697, USA {nid,sharad,miki,kaushik}@ics.uci.edu

Abstract. Unlike traditional database management systems, in multimedia databases that support content-based retrieval over multimedia objects, it is diﬃcult for users to express their exact information need directly in the form of a precise query. A typical interface supported by content-based retrieval systems allows users to express their query in the form of examples of objects similar to the ones they wish to retrieve. Such a user interface, however, requires mechanisms to learn the query representation from the examples provided by the user. In our previous work, we proposed a query reﬁnement mechanism in which a query representation is modiﬁed by adding new relevant examples based on user feedback. In this paper, we describe query processing mechanisms that can eﬃciently support query expansion using multidimensional index structures.

1

Introduction

In a content-based multimedia retrieval system, it is diﬃcult for users to specify their information need in a query over the feature sets used to represent the multimedia objects [10, 7, 12]. Motivated by this, recently, many content-based multimedia retrieval systems have explored a query by example (QBE) framework for formulating similarity queries over multimedia objects (e.g., QBIC [4], VIRAGE [1], Photobook [9], MARS [6]). In QBE, a user formulates a query by providing examples of objects similar to the one s/he wishes to retrieve. The system converts this into an internal representation based on the features extracted from the input images. However, a user may not initially be able to provide the system with “good” examples of objects that exactly capture their information needs. Furthermore, a user may also not be able to exactly specify the relative

This work was supported by NSF awards IIS-9734300, and CDA-9624396; in part by the Army Research Laboratory under Cooperative Agreement No. DAAL01-962-0003. Michael Ortega is supported in part by CONACYT grant 89061 and MAVIS fellowship.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 68–75, 1999. c Springer-Verlag Berlin Heidelberg 1999

Similarity Search Using Multiple Examples in MARS

69

importance of the diﬀerent features used to represent the multimedia objects to the query. To overcome the above limitations, in the Multimedia Analysis and Retrieval (MARS) project, we explored techniques that allow users to reﬁne the initial query during the retrieval process using relevance feedback [10]. Given an initial query, the system retrieves objects that are most similar to the query. The feedback from the user about the relevance of the retrieved objects is then used to adjust the query representation. Relevance feedback in MARS serves two purposes as follows. Query Reweighting adjusts the relative importance of the diﬀerent components to the query. It allows the system to learn the user’s interpretation of similarity between objects. Query Modification changes the underlying representation of the query to incorporate new relevant information from the user’s feedback. It overcomes the deﬁciency of having started from examples that only partially capture the user’s information need. In [11, 12, 7, 10], various models for query reweighting and query modiﬁcation were explored and compared over diverse multimedia collections. Speciﬁcally, two diﬀerent strategies for query modiﬁcation have emerged. The ﬁrst, referred to as query point movement (QPM) [7, 11], attempts to move the query representation in the direction where relevant objects are located. At any instance, a query is represented using a single point in each of the feature spaces associated with the multimedia object. In contrast to QPM, in [10] we proposed a query expansion model (QEM) in which the query representation is changed by selectively adding new relevant objects (as well as deleting old and less relevant objects). In QEM, the query may consist of multiple points in each feature space. Our experiments over large image collections illustrated that QEM outperforms QPM in retrieval eﬀectiveness (based on precision/recall measures) [10]. However, in QEM, its potential drawback is that the cost of evaluating the query grows linearly with the number of objects in the query if done naively. In this paper, we explore eﬃcient strategies to implement QEM that overcome the above overhead. The key is to traverse a multidimensional index structure (e.g., X-tree [2], hybridtree [3], SS-tree [15], etc.) such that best N objects are retrieved from the data collection without having to explicitly execute N nearest neighbor queries for each object in the query representation. We conduct an experimental evaluation of our developed strategies over a large image collection. Our results show that the developed algorithms make QEM an attractive strategy for query modiﬁcation in content-based multimedia retrieval since it provides better retrieval eﬀectiveness without extensive overhead. The rest of the paper is developed as follows, Sect. 2 describes the contentbased retrieval in MARS. Section 3 describes the proposed approaches to implementing QEM. Section 4 compares the approaches and shows experimental results. Conclusions are given in Sect. 5.

70

2

Kriengkrai Porkaew et al.

Content-Based Retrieval in MARS

This section brieﬂy describes the content-based retrieval mechanism supported in MARS which is characterized by the following models: Multimedia Object Model: a multimedia object is a collection of features and the functions used to compute the similarity between two objects for each of those features. Query Model: A query is also a collection of features. In QEM, a query may be represented by more than one instance (point) in each feature space. Furthermore, weights are associated with each feature, as well as, with each instance in the feature representation. These weights signify the relative importance of the component to the query. Figure 1 illustrates the query structure which consists of multiple features fi and each feature consists of multiple feature instances rij .

w1 w11

F1

R 11

Query

w2

w12

w21

R 12

R 21

F2

w22 R 22

F i = Feature i w i = Importance of Feature i with respect to the other features w ij = Importance of Feature i of Object j with respect to Feature i of the other objects R ij= Representation of Feature i of Object j

Fig. 1. Query Model

Retrieval model: The retrieval model deﬁnes how similarity Sim between a query Q and an object O is computed. n Similarity is computed n hierarchically over the query tree. That is Sim = i=1 wi Simi , where i=1 wi = 1, n is the number of features used in the queries, and Simi is the similarity between the object and the query mbased on feature i which is computed as: Simi = m j=1 wij Simij , where j=1 wij = 1, m is the number of feature instances in the feature i in the query, and Simij is the similarity between instance j and the object based on feature i. Simij is computed using the similarity function determined by the object model. The retrieval process begins with some initial weights associated with nodes at each level of the query tree. For simplicity, initially weights associated with nodes of the same parent are equal. Refinement Model: The reﬁnement model adjusts the query tree and the similarity functions used at diﬀerent levels of the tree based on the user’s feedback. As discussed in the introduction, the reﬁnement process consists of query reweighting and query modiﬁcation using query expansion model. The details of the reweighting models, and the query modiﬁcation models are not critical for the discussion of implementation techniques in this paper and hence omitted due to space restrictions. Details can be found in [10].

Similarity Search Using Multiple Examples in MARS

3

71

Query Processing

At each iteration of query reﬁnement, the system returns to the user N objects from the database that have the highest similarity to the current query representation. Instead of ranking each object in the database and then selecting the best N answers, the query is evaluated in a hierarchical bottom up fashion. First, the best few objects based on each feature individually are retrieved. The similarity values of these objects on individual features are then combined (using the weighted summation model) to generate a ranked list of objects based on the entire query. The process continues until the best N matching objects have been retrieved. We next discuss how feature nodes of the query are evaluated, and the answers are combined to obtain the best N answers for the query. 3.1

Evaluating Feature Nodes

In a query tree, let f be a feature node and r1 , . . . , rm be the instances (points) under the feature space F . The objective of evaluating the feature node is to retrieve N objects from the database that best match f . We will use the notion of distance instead of similarity since the evaluation of the feature node will use multidimensional indexing mechanisms that are organized based on distances. a point x in F and Let drj ,x be the distance between rj and m m Df,x be the distance between f and x in F where Df,x = j=1 wj drj ,x and j=1 wj = 1. Thus, the best N matches to f correspond to objects which are closest to f based on the above deﬁnition of distance. In the following two subsections, we describe two diﬀerent strategies of evaluating the best N objects for a given feature node. Both strategies assume that the feature space is indexed using a multidimensional data structure that supports range and k-nearest neighbor queries. Centroid Expansion Search (CES): The idea is to iteratively retrieve next nearest neighbors of some point c (close to r1 , . . . , rm ) in the feature space F using the feature index until the N best matches to f are found. Let x and y be two objects in the feature space F . x is a better match to f compared to y if and only if Df,x ≤ Df,y , or equivalently m

wj drj ,x ≤

j=1

m

wi drj ,y

(1)

j=1

Since distance functions are metric, the triangle inequality dictates that drj ,x ≤ dc,x + dc,rj and drj ,y ≥ |dc,y − dc,rj |. Substituting drj ,x , drj ,y in (1): m

m wj dc,x + dc,rj ≤ wj |dc,y − dc,rj |

j=1

Since

m j=1

wj = 1, we get: dc,x +

j=1 m

m

j=1

j=1

wj dc,rj ≤

wj |dc,y − dc,rj |

(2)

(3)

72

Kriengkrai Porkaew et al.

Thus, if (3) holds, then (1) also holds. To remove the absolute value from (3), let R = { r1 , . . . , rm }, R1 = { rj ∈ R | dc,rj ≤ dc,y }, and R2 = R − R1 = { rj ∈ R | dc,rj > dc,y }. Replace R1 and R2 in (3), dc,x + ≤

rj ∈R1

wj dc,rj +

rj ∈R1

dc,x ≤ dc,y − 2 

wj dc,rj

rj ∈R2

wj (dc,y − dc,rj ) + 

wj (dc,rj − dc,y )

rj ∈R2

wj dc,y +

rj ∈R2

dc,x ≤ dc,y − 2

m

(4)

 wj dc,rj 

(5)

rj ∈R1

wj min(dc,y , dc,rj )

(6)

j=1

Equation (6) provides the strategy to retrieve the best N answers based on the match to f . The strategy works as follows. We ﬁnd the nearest neighbors to c incrementally. Let x1 , . . . , xP be the objects seen mso far. We determine the target M, 1 ≤ M ≤ P such that Dc,xM ≤ Dc,xP − 2 j=1 wj min(dc,xP , dc,rj ). By (6), Df,xM ≤ Df,xP +k , k = 1, 2, . . .. Let α = max{Df,xi |i = 1, . . . , M }. We then determine the set {xi |i = 1, . . . , P ∧ Df,xi ≤ α}. All such xi are better matches to f than any object xP +k , k = 1, 2, . . . and are hence returned. If N objects have not yet been returned, the process continues iteratively by retrieving the next closest object to c (i.e., xP +1 ) and repeating the above algorithm. mNotice that c can be any point. However, the optimal choice of c minimizes j=1 wj dc,rj ; i.e. c should be the weighted centroid of r1 , . . . , rm . This approach does not require any change to the incremental nearest neighbor search algorithm associated with the original multidimensional data structure. However, it does not perform well when query changes dramatically due to the relevance feedback process since the starting centroid is optimal for the original query. Multiple Expansion Search (MES): In this approach, N nearest neighbor for a feature node f is determined by iteratively retrieving next nearest neighbors for each instance r1 , . . . , rm associated with f . Let Rj be the set of ranked results for the instance rj , j = 1, . . . , m. That is, for all x ∈ Rj and y ∈ Rj , drj ,x ≤ drj ,y . Furthermore, let αj be the maximum distance between rj and any object in Rj in the feature space; that is, αj = max{drj ,x |x ∈ Rj }. nRj contains all objects that are in the range of αj from rj . Note that if y ∈ j=1 Rj , then drj ,y > αj m m m for all j. So j=1 wj drj ,y > j=1 wj αj , that is, Df,y > j=1 wj αj . As a result, m m y ∈ j=1 Rj if Df,y ≤ j=1 wj αj . m Note that if j=1 Rj contains at least N objects x1 , . . . , xN such that for m all xk , Df,xk ≤ j=1 wi αj , then it is guaranteed that N best matches to the feature node f are contained in m j=1 Rj . Thus, in order to evaluate the best

Similarity Search Using Multiple Examples in MARS

73

N matches to f , MES incrementally evaluates the nearest neighbor for each of the instances r1 , . . . , rn thereby increasing the value of at least m one αj in each step, j = 1, . . . , m until there are at least N objects within j=1 Rj for which m Df,xk ≤ j=1 wi αj . Many diﬀerent strategies can be used to expand αj s. The optimal strategy n determines αj that minimize i=1 Ri since then the least number of objects are explored to retrieve the best N objects based on the match to the feature. We try diﬀerent strategies for determining αj s and compare them in Sect. 4. 3.2

Evaluating the Query Node

Given the best matching answers for each of the feature nodes f1 , . . . , fn , the objective in evaluating the query node is to combine the results to determine the best N objects to the overall query. That is, we need to determine the N objects with the least distance to thequery, where the distance n between object n and the query is deﬁned as DQ,x = i=1 wi Df,x where i=1 wi = 1. MES discussed for the feature node evaluation can also be used for this purpose and is hence not discussed any further.

4

Experiments

To explore the eﬀectiveness of the algorithms, we performed experiments over a large image dataset (65,000 images) obtained from the Corel collection. Images features used to test the query processing are color histogram [14], color histogram layout [8], color moments [13], and co-occurrence texture [5]. Manhattan distance is used for the ﬁrst two features and Euclidean distance is used for the last two features [8]. The purposes of this experiment are to compare various approaches we proposed, and to show that QEM can be implemented eﬃciently. The eﬀectiveness is measured by the number of objects seen before the best N answers are found. A good approach should not need to explore so many objects to guarantee the best N answers and it should not degrade signiﬁcantly when multiple objects are added to the query. We performed experiments on CES and MES with various parameters. Specifically, CES searches from the centroid of the query point set. In MES, we explored 4 expansion options as follows. Single Search searches only in one of the query points. Balanced Search searches on all query points with equal ranges. Weighted Search searches on all query points with the ranges proportional to the weights of the query points. Inverse Weighted Search searches on all query points with the ranges proportional to the inverse of the weights of the query points. In the experiments, we do not use any index structure in order to avoid hidden eﬀects caused by the speciﬁc index structure. Instead, we simulate a k-nearest neighbor search by scanning the dataset and ranking the answers.

74

Kriengkrai Porkaew et al.

The experimental result shows that single search performs the worst. Intuitively, one may expect the weighted search to perform the best among the four approaches. However, surprisingly, even though the weights are not balanced, the balanced search performed better than any search techniques including the centroid expansion search.

2000

10 Centroid Balance Weighted 1/Weight Single

Centroid Balance Weighted 1/Weight 8 Number of objects seen (x Top N)

top N

1500

1000

6

4

500 2

0

0 0

500

1000 Seen

1500

(a) Seen vs best N

2000

1

2

3

4 5 Number of points in a query

6

7

(b) Query size vs Seen/best N ratio

Fig. 2. Experimental Result

Figure 2 compares the diﬀerent approaches and shows that the number of objects in the query representation has very little impact on the balanced search and the weighted search which are the best searches. The reason is simply because the feature space is sparse and the multiple query points are close together due to the query expansion model which selectively adds relevant query points and removes less relevant ones. Other approaches do not perform well since they may have seen best answers but they cannot guarantee that those answers are among the best ones unless they explore further.

5

Conclusions

Content-based multimedia retrieval and multidimensional indexing are among the most active research areas in the past few years. The two research areas are closely related. The supporting index structure has a big impact on the eﬃciency of the retrieval. In this paper, we proposed algorithms to extend index structures to support complex queries eﬃciently in the MARS weighted summation retrieval model. We focussed on an eﬃcient implementation to support QEM proposed in [10]. QEM modiﬁes the query by selectively adding new relevant objects to the query (as well as deleting old and less relevant objects). In contrast, QPM modiﬁes the query by moving the query point in the direction of the relevant objects.

Similarity Search Using Multiple Examples in MARS

75

Our previous work showed that QEM outperforms QPM in retrieval eﬀectiveness. This paper further illustrates that QEM can be eﬃciently implemented using multidimensional index structures. As a result, we believe that QEM is a viable approach for query reﬁnement in multimedia content based retrieval.

References [1] Jeﬀrey R. Bach, Charles Fuller, Amarnath Gupta, Arun Hampapur, Bradley Horowitz, Rich Humphrey, Ramesh Jain, and Chiao fe Shu. The Virage image search engine: An open framework for image management. In SPIE Conf. on Vis. Commun. and Image Proc., 1996. 68 [2] S. Berchtold, D. A. Keim, and H. P. Kriegel. The x-tree: An index structure for high-dimensional data. In VLDB, 1996. 69 [3] Kaushik Chakrabarti and Sharad Mehrotra. High dimensional feature indexing using hybrid trees. In ICDE, 1999. 69 [4] M. Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Q. Huang, Byron Dom, Monika Gorkani, Jim Haﬁne, Denis Lee, Dragutin Petkovic, David Steele, and Peter Yanker. Query by image and video content: The QBIC system. IEEE Computer, Sep 1995. 68 [5] Robert M. Haralick, K. Shanmugam, and Its’hak Dinstein. Texture features for image classiﬁcation. IEEE Trans. on Sys., Man, and Cyb., SMC-3(6), 1973. 73 [6] Thomas S. Huang, Sharad Mehrotra, and Kannan Ramchandran. Multimedia analysis and retrieval system (MARS) project. In Annual Clinic on Library Application of Data Processing - Digital Image Access and Retrieval, 1996. 68 [7] Yoshiharu Ishikawa, Ravishankar Subramanya, and Christos Faloutsos. Mindreader: Querying databases through multiple examples. In VLDB, 1998. 68, 69 [8] Michael Ortega, Yong Rui, Kaushik Chakrabarti, Sharad Mehrotra, and Thomas S. Huang. Supporting similarity queries in MARS. In ACM Multimedia, 1997. 73 [9] A. Pentland, R.W. Picard, and S. Sclaroﬀ. Photobook: Content-based manipulation of image databases. Int’l Journal of Computer Vision, 18(3), 1996. 68 [10] Kriengkrai Porkaew, Sharad Mehrotra, and Michael Ortega. Query reformulation for content based multimedia retrieval in MARS. In IEEE Int’l Conf. on Multimedia Computing and Systems, 1999. 68, 69, 70, 74 [11] Yong Rui, Thomas S. Huang, and Sharad Mehrotra. Content-based image retrieval with relevance feedback in MARS. In IEEE Int’l Conf. on Image Proc., 1997. 69 [12] Yong Rui, Thomas S. Huang, Michael Ortega, and Sharad Mehrotra. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Trans. on Circuits and Video Technology, Sep 1998. 68, 69 [13] Markus Stricker and Markus Orengo. Similarity of color images. In SPIE Conf. on Vis. Commun. and Image Proc., 1995. 73 [14] Michael Swain and Dana Ballard. Color indexing. Int’l Journal of Computer Vision, 7(1), 1991. 73 [15] D. White and R. Jain. Similarity indexing with the ss-tree. In ICDE, 1995. 69

Excluding Specified Colors from Image Queries Using a Multidimensional Query Space Dimitrios Androutsos1 , Kostas N. Plataniotis2, and Anastasios N. Venetsanopoulos1 1

University of Toronto Department of Electrical & Computer Engineering Digital Signal & Image Processing Lab 10 King’s College Road, Toronto, Ontario, M5S 3G4, CANADA {zeus,anv}@dsp.toronto.edu WWW:http://www.dsp.toronto.edu 2 Ryerson Polytechnic University Department of Math, Physics & Computer Science 350 Victoria Street, Toronto, Ontario, M5B 2K3, CANADA [email protected]

Abstract. Retrieving images in a database based on user specified colors is a popular low-level retrieval technique. However, the available systems today do not easily allow for a user or a specified query to tag certain colors as unwanted in the query result to ultimately be excluded in the query. Specifically, color histogram techniques do not allow for a direct approach to excluding colors and would require a separate query stage to filter out images containing unwanted colors. In this paper we present our vector-based scheme to image retrieval using a multidimensional query space which naturally accepts the exclusion of specified colors in the overall similarity measure.

1

Introduction

Color image retrieval has received increasing attention lately as the ﬁeld of image database retrieval grows. It’s importance stems from the fact that color is a lowlevel image feature which is essential to the early stages of human vision. Color is easily recalled and identiﬁed and a natural attribute for describing objects and scenes. For these reasons, image retrieval researchers have been trying to ﬁnd eﬃcient and eﬀective ways to retrieve color images from large databases using color in the query deﬁnition [1]. To this end, color indices are created using color histograms to capture the color representation of all the database images [2,3]. Using these indices, a user can retrieve images from the database by building a query by specifying certain colors which they want the retrieved images to contain or by specifying an example image which the retrieved images should match. There are a number of image retrieval systems which employ these techniques and there is much ongoing research in the area [4,5]. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 76–82, 1999. c Springer-Verlag Berlin Heidelberg 1999

Excluding Specified Colors from Image Queries

77

However, these systems do not address a very important issue in color retrieval namely, color exclusion. It is important for users to be able to tag a certain color or group of colors as unwanted so that they do not appear in the retrieval results. With present systems, specifying which colors to exclude would require an additional stage to ﬁlter retrieved images and modify their ranking according to whether or not an exclusion color is present. In this paper we describe how our system addresses this issue by virtue of using a Multidimensional Query Space which incorporates the exclusion of any unwanted colors directly into the image similarity measure, without requiring an extra processing stage.

2

System Description

Utilizing color histograms for indexing and retrieval has gained much popularity. However, there are inherent problems with this technique, which reduce the ﬂexibility and accuracy of the query process and results. In particular, color histograms capture global color activity. Attempts to include spatial information by image partitioning has had some success but storage and computational requirements increase accordingly. In addition, the similarity metrics which are commonly accepted and utilized allow little ﬂexibility and have no valid perceptual basis. We have developed a system which is color vector-based. We do not use histograms to build indices. Instead, we store representative RGB color vectors from extracted color regions, along with spatial color information to build an index of smaller dimension and with more information than a simple color histogram. In this section, we give a brief overview of our system, speciﬁcally how the feature extraction is done via segmentation and we also present the distance measure which we use to perform similarity matching. 2.1

Feature Extraction & Indexing

Our feature extraction is based on unsupervised recursive color segmentation. Speciﬁcally, we perform HSV-space segmentation while taking into consideration certain perceptual attributes of human color perception and recall. The HSVspace classiﬁes similar colors under similar hue orientations and thus provides a more natural grouping. In addition, it allows for automated segmentation since it allows for fast and eﬃcient automation. It is not dependent on variables, such as seed pixels or number of extracted colors, such as in clustering techniques. The details of our segmentation technique can be found in [8]. However, it is important to note that we: – extract bright colors ﬁrst – extract and classify white and black regions – treat the saturation histogram as multi-modal instead of bi-modal

78

Dimitrios Androutsos et al.

For each image we extract c colors, which is an image dependent quantity. We calculate the average color of each of the c colors and use that RGB value as each region’s representative vector. These c colors, along with spatial information such as size and location of each region, are used to build each image index. 2.2

Similarity Measure

Since our color indices are actual 3-dimensional color vectors which span the RGB space, a number of vector distance measures can be implemented for retrieval. However, we implement a measure which is based on the angle of a color vector. Angular measures are chromaticity-based, which means that they operate primarily on the orientation of the color vector in the RGB space and therefore are more resistant to intensity changes and it has been found that they provide much more accurate retrieval results than other measures [7]. Speciﬁcally, our similarity measure is a perceptually-tuned combination of the angle between two vectors and a magnitude diﬀerence part, deﬁned as [9]:   · x − x | x |x 2 i j i j ) 1− √ ) , (1) β(xi , xj ) = exp −α(1 − 1 − cos−1 ( π |xi ||xj | 3 · 2552 angle magnitude where xi and xj are 3-dimensional color vectors, α is a design parameter and √ and 3 · 2552 are normalization factors.

3

2 π

Image Query

During the query process, for each user-speciﬁed query color, a similarity measure is calculated using (1), to each representative color vector in a given database index. For each query color, the minimum distance is kept and a multidimensional measure is created which consists of the minimum distances of the query colors to the indexed representative vectors in the given index:

D(d , . . . , d ) = I − (min(β(q , i ), . . . , β(q , i min(β(q , i ), . . . , β(q , i ))), 1

n

1

n

1

1

1

n

m

m )), . . . ,

(2)

where I is a vector of size n with all entries of value 1, q n are the n query colors and im are the m indexed representative color vectors for a given image. 3.1

Multidimensional Query Space

The vector D in (3) exists in a vector space deﬁned by the similarity measure of the speciﬁed query colors to the indexed colors. The dimensionality of this space changes and is dependent on the number of query colors. We refer to this space as the multidimensional query space.

Excluding Specified Colors from Image Queries

79

The database image that is the closest match to all the given query colors q1 , q2 , . . . , qn is the one which is closest to the origin of the multidimensional query space. Within this query space, there is a line on which all components of D are equal. We refer to this line as the equidistant line. A distance vector D that is most centrally located, i.e, is collinear with the equidistant line and at the same time has the smallest magnitude, corresponds to the image which contains the best match to all the query colors, as depicted in Figure 1(a). For each query, each database image exists at a point in this multidimensional query space. It’s location and relation to the origin and equidistant line determines it’s retrieval ranking which we quantify by taking a weighted sum of the magnitude of D and the angle, D to the equidistant line: R = w1 |D| + w2 D,

(3)

where lower rank values R imply images with a closer match to all the query colors. The weights w1 and w2 can be adjusted to control which of the two parameters, i.e., magnitude or angle, are to dominate. We have found that values of w1 = 0.8 and w2 = 0.2 give the most robust results. This is to be expected since collinearity with the equidistant line does not necessarily imply a match with any query color. It implies that each query color is equally close (or far) to the indexed colors. However, as |D| → 0, implies closer matches to one or more colors. Thus, a greater emphasis must be placed on the magnitude component.

q1

equidistant line q1

equidistant line

∆

D

Ξ D

x1

q

2 q2

(a)

(b)

Fig. 1. (a) Vector representation of 2 query colors q1 &q2 , their multidimensional distance vector D and the corresponding equidistant line. (b) the same 2 query colors,1 exclusion color, x1 and the resulting multidimensional distance vector ∆.

80

4

Dimitrios Androutsos et al.

Color Exclusion

Our proposed vector approach provides a framework which easily accepts exclusion in the query process. It allows for image queries containing any number of colors to be excluded in addition to including colors in the retrieval results. From the discussion in Section 3.1 above, we are interested in distance vectors D which are collinear with the equidistant line and which have small magnitude. The exclusion of a certain color should thus aﬀect D accordingly and it’s relation to the equidistant line and the origin. For example, if it is found that an image contains an indexed color which is close to an exclusion color, the distance between the two can be used to either pull or push D closer or further to the ideal and accordingly aﬀect the retrieval ranking of the given image, as shown in Figure 1(b). To this end, we determine the minimum distances of each exclusion color with the indexed representative colors, using (1), to quantify how close the indexed colors are to the exclusion colors:

X (x , . . . , x ) = (min(β( , i ), . . . , δ( , i 1

n

1

1

1

i

i

m )), . . . min(β( n , 1 ), . . . , δ( n , m )))

(4)

where ξn are the n exclusion colors and im are the m indexed representative colors of each database image. Equation (4) quantiﬁes how similar any indexed colors are to the exclusion colors. To quantify dissimilarity, a transformation of each vector component of X is required, and then this is merged with D to give a new overall multidimensional distance vector:

= [D I − X ],

(5)

where I is a vector of size n with all entries of value 1. The dimensionality of ∆ is equal to the number of query colors + number of exclusion colors. The ﬁnal retrieval rankings are then determined from |∆| and the angle which D in (5) makes with the equidistant line of the query color space (i.e., the space without excluded colors). We performed an example query from our database of 1850 natural images, both with exclusion and without. Figure 2(a) depicts the query result when R,G,B colors (26, 153, 33) (green) and (200, 7, 25) (red) were desired and the color (255, 240, 20) (yellow) was excluded. It can be seen that images which contained colors close to yellow were removed from the top ranking results, as compared to Figure 2(b), where yellow was not excluded. We further investigated these exclusion results by determining by how much the retrieval ranking of the images which contained yellow changed. A trained user was asked to look at the top 40 retrieval results for the query of red and green, and determine which of these images contained yellow. This ﬁrst step resulted with a set of 25 images, which we refer to as X , that contained the exclusion color. The retrieval ranking of each of the images in X was then calculated when the same query also excluded yellow. It was found that none of the images in X remained among the top 40 retrieval results. Furthermore, their ranking decreased signiﬁcantly and all 25 images were now ranked among the bottom 27% of the entire 1850 image database, i.e., among the 500 least similar images.

Excluding Specified Colors from Image Queries

81

(a)

(b)

Fig. 2. Query result for images with (a) red & green, excluding yellow and (b) red & green, not excluding yellow.

8 images remained in the top 40 retrieval results, which contained red and green and 7 images had their ranking slightly decreased for containing colors that were perceptually close to yellow. The ﬂexibility of this technique allows any number of colors to be excluded in a given color query and can also be incorporated in query-by-example, where a seed image is fed as a query. Furthermore, the amount by which X of (4) aﬀects D can be varied by a simple weighting to tune the strictness of the exclusion.

5

Conclusions

We have shown how easily and eﬀectively our system addresses the concept of color exclusion in a color image query. It is incorporated into the overall similarity calculation of each candidate image in a given query and does not require a post-processing stage to filter out images which contain a color to be

82

Dimitrios Androutsos et al.

excluded. This is accomplished by virtue of the multidimensional query space which the distance measures of the query vectors span and their relation to the equidistant line. The similarity of speciﬁed exclusion colors to indexed database colors aﬀects the overall ranking by eﬀectively lowering the rank of a given image which contains color that should be excluded. In our system, any number of colors can be excluded in a given query to provide greater ﬂexibility in how a user query is deﬁned, to ultimately retrieve more valid images from a given database.

References 1. V. N. Gudivada and V. V. Raghavan, “Content-based image retrieval systems,” Computer 28, September 1995. 76 2. M. J. Swain and D. H. Ballard, “Color indexing,” International Journal of Computer Vision 7(1), 1991. 76 3. M. Stricker and M. Orengo, “Similarity of color images,” in Storage and Retrieval for Image and Video Databases III, Proc. SPIE 2420, pp. 381–392, 1995. 76 4. W. Niblack, R. Barber, W. Equitz, M. Flickner, Glasman, D. Petkovic, P. Yanker, C. Faloutsos, and G. Taubin, “The qbic project: Querying images by content using color, texture and shape,” in Storage and Retrieval for Image and Video Databases, M. H. Loew, ed., Proc. SPIE 1908, 1993. 77 5. J. R. Smith and S. F. Chang, “Visualseek: a fully automated content-based image query system,” in ACM Multimedia Conference, November 1996. 77 6. X. Wan and C.-C. J. Kuo, “Color distribution analysis and quantization for image retrieval,” in Storage and Retrieval for Image and Video Databases IV, Proc. SPIE 2670, pp. 8–16, 1995. 7. D. Androutsos, K.N. Plataniotis and A.N. Venetsanopoulos, “Distance Measures for Color Image Retrieval,” International Conference on Image Processing ’98, Chicago, USA, October 1998. 78 8. D. Androutsos, K.N. Plataniotis, A.N. Venetsanopoulos, “A Vector Angular Distance Measure for Indexing and Retrieval of Color,” Storage & Retrieval for Image and Video Databases VII, San Jose, USA, January 26-29, 1998. 77 9. D. Androutsos, K.N. Plataniotis, A.N. Venetsanopoulos, “A Perceptually Motivated Method for Indexing and Retrieval of Color Images,” International Conference on Multimedia Computing Systems 1999, Florence, Italy, June 7-11, 1999. Submitted. 78

Generic Viewer Interaction Semantics for Dynamic Virtual Video Synthesis Craig A. Lindley1 and Anne-Marie Vercoustre2 1

CSIRO Mathematical and Information Sciences Locked Bag 17, North Ryde NSW 2113, Australia Phone: +61-2-9325-3150, Fax: +61-2-9325-3101 [email protected] 2

INRIA-Rocquencourt, France [email protected]

Abstract. The FRAMES project is developing a system for video database search, content-based retrieval, and virtual video program synthesis. For dynamic synthesis applications, a video program is specified at a high level using a virtual video prescription. The prescription is a document specifying the video structure, including specifications for generating associative chains of video components. Association specifications are sent to an association engine during video synthesis. User selection of a virtual video prescription together with the default behavior of the prescription interpreter and the association engine define a tree structured search of specifications, queries, and video data components. This tree structure supports generic user interaction functions that either modify the traversal path across this tree structure, or modify the actual tree structure dynamically during video synthesis.

Introduction

The FRAMES project is developing a system for video database search, content-based retrieval, and virtual video program synthesis. The FRAMES project has been carried out within the Cooperative Research Centre for Advanced Computational Systems established under the Australian Government's Cooperative Research Centres Program. Video components within the FRAMES database are described in terms of a multi-layered model of film semantics, derived from film semiotics. For dynamic video program synthesis applications, a program is specified at a high level using a virtual video prescription (Lindley and Vercoustre, 1998a). Coherent sequences of video are required, rather than just lists of material satisfying a common description. To meet this requirement, the FRAMES system uses an engine for generating associative chains of video sequences, initiated by an initial specification embedded within a virtual video prescription. Once a virtual video prescription has been Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp.83-90, 1999.  Springer-Verlag Berlin Heidelberg 1999

84

Craig A. Lindley and Anne-Marie Vercoustre

selected, the prescription interpreter and associated instruction processing functions can be allowed to generate a virtual video with no further interaction from the viewer. In this case the resulting presentation has the form of a traditional linear film or video. However, depending upon the viewer’s overall purpose, it may be desirable to steer the ongoing presentation in various ways. For example, the user may wish to steer the presentation towards subjects of interest and away from those of less interest, gain an overview of the area, go into detail, or follow a particular mood or emotion. This paper defines generic user interaction semantics for dynamic virtual video synthesis based upon the data structures and sequencing functions of the FRAMES system. The semantics provide run-time interactions for the viewers of a virtual video; the interactions do not result in any permanent changes to the data structures involved, but affect the way those data structures are used to generate a particular video presentation. We begin with a summary of FRAMES system users and user tasks, provide an overview of the FRAMES system, and summarise the processes that are used to select video components during the generation of a synthesised video sequence. The high level algorithm used within the FRAMES association engine is described, and is seen to define a tree-structured search through the available video components. User interaction semantics are then analysed in terms of generic user interaction strategies, the default data structure that models the selection action of the synthesis engine, and generic interaction operations that can be defined in terms of their effect upon the implied data structure.

FRAMES System Users and User Tasks The FRAMES video synthesis process implies four different author/system user roles that may be involved in the production and use of a virtual video. Within the FRAMES system, video data is a primitive (atomic) data input, organised as a set of discrete video sequences. The video maker may use a variety of software tools and products to create these digital video clips. Interactive video systems that support interaction within a complete video program represent a new medium requiring customised development of video data. The FRAMES video synthesis engine operates upon descriptions associated with raw video data. Hence once the video data is available, a description author must develop a descriptor set and associate descriptors with appropriate video data sequences. The FRAMES environment includes data modeling interfaces to support this authoring process. The interfaces and underlying database are based upon the semiotic model described by Lindley and Srinivasan (1998). Once the descriptions have been created, they are stored in the FRAMES database for use by the video synthesis engine.

Generic Viewer Interaction Semantics for Dynamic Virtual Video Synthesis

85

The FRAMES system can be used with these semantic descriptions to provide basic semantic search and retrieval services, where a user can directly interrogate the database using relational parametric queries, or interrogate the database via the FRAMES association engine either to conduct fuzzy parametric searches, or to generate an associative chain of video components. However, for many users and applications a specific high level program structure may be required. Such a structure can be defined using a virtual video prescription. A prescription, defined by a virtual video prescription author, contains a sequence of embedded queries for generating the low level video content, where the particular order, form, and content of the queries implements a specific type, genre and style of video production. The final end user/viewer community is the audience for whom the virtual video production is created. Such a user will typically select a virtual video prescription according to their current tasks and needs, and use the FRAMES virtual video synthesis engine to generate a virtual video presentation. For dynamic virtual video synthesis, there are a number of ways and points in the process where viewer interaction is meaningful. All viewer interaction functions may be available to the authors of the interaction system, to provide feedback to authors about the appropriateness and effectiveness of descriptions and prescriptions as they are being developed. The authoring process for interactive virtual videos is highly complex, and requires careful coordination between the video makers, description authors, and prescription authors to ensure that these three levels of content are compatible and function correctly to produce coherent viewer sequences. Understanding the principles for doing this effectively is an important topic of ongoing research.

The FRAMES Video Synthesis System

The FRAMES system consists of three primary elements: a virtual video prescription interpreter, a database containing semantic descriptions of individual video components, and the instruction engines for generating sequences of video data. A virtual video prescription represents a high level structure of, or template for, a video program of a particular type, containing a list of instructions for generating a virtual video production (Lindley and Vercoustre, 1998a). The virtual video interpreter reads virtual video prescriptions. A user may select a prescription, which may have values assigned to various embedded parameters to reflect the particular requirements and interests of that user before being forwarded to the interpreter. The interpreter reads the instructions within a prescription sequentially, routing each instruction in turn to an appropriate processor. Three types of instructions may occur within a prescription: direct references to explicitly identified video components, parametric database queries, and specifications for generating an associative chain of video components (Lindley, 1998). Access by direct reference uses an explicit, hard-coded reference to a video data file plus start and end offsets of the required segment (eg. using the referencing syntax of SMIL, Hoschka 1998). Parametric database queries may

86

Craig A. Lindley and Anne-Marie Vercoustre

include complex logical conditions or descriptor patterns. In parametric search, the initial query may form a hard constraint upon the material that is returned, such that all of its conditions must be satisfied. Alternatively, a ranked parametric search can return a list of items ranked in decreasing order of match to the initial query, down to some specified threshold. Access by associative chaining is a less constrained way of accessing video data, where material may be incorporated on the basis of its degree of match to an initial search specification, and then incrementally to successive component descriptions in the associative chain. Associative chaining starts with specific parameters that are progressively substituted as the chain develops. At each step of associative chaining, the video component selected for presentation at the next step is the component having descriptors that most match the association specification when parameterised using values from the descriptors attached to the video segment presented at the current step. The high-level algorithm for associative chaining is: 1. initialise the current state description according to the associative chaining specification. The current state description includes: • the specification of object, attribute, and entity types that will be matched in the chaining process, • current values for those types (including NULL values when initial values are not explicitly given or components of the next instantiation are NULL), • conditions and constraints upon the types and values of a condition, and • weights indicating the significance of particular statements in a specification 2. Generate a ranked list of video sequences matching the current state description. 3. Replace the current state description with the most highly ranked matching description: this becomes the new current state description. 4. Output the associated video sequence identification for the new current state description to the media server. 5. If further matches can be made and the termination condition (specified as a play length, number of items, or associative weight threshold) is not yet satisfied, go back to step 2. 6. End. Since association is conducted progressively against descriptors associated with each successive video component, paths may evolve significantly away from the content descriptions that match the initial specification. This algorithm (described in detail in Lindley and Vercoustre, 1998b) has been implemented in the current FRAMES demonstrator. Specific filmic structures and forms can be generated in FRAMES by using particular description structures, association criteria and constraints. In this way the sequencing mechanisms remain generic, with emphasis shifting to the authoring of metamodels, interpretations, and specifications for the creation of specific types of dynamic virtual video productions.

Generic Viewer Interaction Semantics for Dynamic Virtual Video Synthesis

87

Generic Interaction Strategies User interaction in the context of dynamic virtual video synthesis can take place at several levels, and in relation to several broad types of user task. Canter et al (described in McAleese, 1989) distinguish five discernible strategies that users may use in moving through an information space: 1. 2. 3. 4. 5.

scanning: covering a large area without depth browsing: following a path until a goal is achieved searching: striving to find an explicit goal exploring: finding out the extent of the information given wandering: purposeless and unstructured globetrotting

These strategies are all relevant to interaction with dynamic virtual video synthesis, and the interactive presentation system for virtual videos should support each strategy. To these five strategies we can also add: 6. viewing: allowing the algorithm to generate a video sequence without further direction from a user (ie. the viewer is passively watching a video) Dynamic virtual video syntheses in the FRAMES project uses the viewing model as the default behavior of the system. That is, once a virtual video prescription has been selected, the synthesiser generates the video display based upon that prescription and the semantics defined by the underlying algorithms. The virtual video prescription may define a video program amounting to a scan, browse, search, exploration of, or wander through the underlying video database, depending upon the applicationspecific purpose of the prescription. To provide interactive viewing functions, suitable interfaces must be provided allowing viewers to modify the behavior of the video synthesis engine away from this default behavior within the form defined by the original virtual video prescription.

User Interaction Semantics A prescription can be customised for a particular user by setting its parameter values. Parametric search may be an exact search mechanism (eg. if a traditional relational database is used), or may involve a fuzzy search process that returns identifiers of video component having descriptors that approximately match the search query, ranked in decreasing order of match to the query. A video synthesis system incorporating ranked search can include interfaces allowing users to select from the ranked list of returned results. Associative chaining can be modified in several ways by user interactions, by using user interactions to effectively modify the chaining specification dynamically as chaining proceeds. Users can modify the entity types used to associate the current component with the next component, modify the current

88

Craig A. Lindley and Anne-Marie Vercoustre

entity values, set or reset constraints upon entity values, or modify the weightings upon entity types. Users can also interrupt the default selection of the most highly associated video component by selecting another ranked element as the current element, which will re-parameterise the associative chaining specification at the current point in the chain.

S1

Prescription P1 instr 1 instr 2 S2 . . . Prescription

Fig. 1.

Associative Chain Components: C1 C2 . . . Cp

Ranked components not selected:

S3

C1,1 C1,2 . . . C1,q

S4

Prescription

The semantics of these user interactions can be modeled by regarding the operation of the association engine as a tree search behaviour, as shown on Figure 1. In considering choices that can be made by users, it is useful to regard the starting point as the decision about which virtual prescription to execute, this being the root node of the search tree. Each prescription contains a list of instructions that constitute its child nodes. The algorithm that interprets prescriptions will execute each instructions in sequential order. An instruction (specifically, an instruction that is an association specification) generates a series of video components that are its child nodes in turn, each component being presented for display in the sequential order in which it is returned. Finally, for each selected video component in a series, there is a list of other candidate components that have not been selected, ranked in decreasing order of associative strength (to the previous component in the selected list); this ranked list may be considered to be a set of child nodes for a selected component. Hence the video synthesis process defines an ordered, depth-first traversal of the system data structures and the dynamically generated association structure of video components. The default behavior of the synthesis engine without user interaction satisfies the user interaction strategy identified above as viewing. However, to support scanning, browsing, searching, exploring, and wandering strategies, specific and generic interaction functions can be provided. These are divided into two classes. The first class of interaction functions are those that determine the path taken by the user in traversing the default synthesis tree amount to functions that interrupt or modify the default depth-first traversal behavior of the algorithm. These functions include: • control of whether the process should stop, loop back to some point (eg. as identified on a history list), or proceed to the next default item

Generic Viewer Interaction Semantics for Dynamic Virtual Video Synthesis

89

• jump to a position on the tree other than the next position defined by the depth-first algorithm • display a set of video components in parallel The second class of interaction functions are those that dynamically alter the structure of the default tree during video synthesis are functions that effectively produce an alteration in the specification that is driving the generation of a virtual video production. This can include: • functions that dynamically modify virtual video prescriptions (eg. changing the values of variables used within a prescription during execution) • functions that dynamically modify queries prior to their execution, or as they are being executed. Examples include adding or removing descriptor types that associative matching is taking place against, and modifying the weightings attached to descriptor types.

Related Work Interactive video covers a broad range of technologies and interests, including interactive video editing systems, model-based video image generation, and interactive search and browsing of video data in archives or databases. The FRAMES project is addressing the interactive use of predefined video sequences. Dynamic access to predefined video using content-based retrieval techniques has generally been based upon an information retrieval model in which data is generated in response to a single query (eg. the IBM QBIC system, http:// wwwqbic.almaden.ibm.com/ stage/ index.html); sequencing from this perspective is a contextual task within which content-based retrieval may take place. The MOVI project has incorporated some automated video analysis techniques into an interactive video environment that then uses hard-coded links between video elements (see http:// www.inrialpes.fr/ movi/ Demos/ DemoPascal/ videoclic.html). Unlike these approaches, FRAMES generates links between video sequences dynamically using an associative chaining approach similar to that of the Automatist storytelling system developed at MIT (Davenport and Murtaugh, 1995, and Murtaugh, 1996). The Automatist system uses simple keyword descriptors specified by authors and associated with relatively self-contained video segments. In Automatist, users can interact with the associative chaining process either by explicitly modifying the influence of specific keyword descriptors arranged around the periphery of the interface, or by selecting a less strongly associated video component to become the current displayed component determining the ongoing associative chain. The FRAMES system extends this associative chaining approach by using a highly structured semantic model (described in Lindley and Srinivasan, 1998), which allows greater discrimination on descriptor types, and more types of relationship between sequenced video components. Flexible and modifiable association specifications in FRAMES and the incorporation of direct references and parametric queries in high

90

Craig A. Lindley and Anne-Marie Vercoustre

level prescriptions create opportunities for interaction beyond the simple selection of keywords and ranked components.

Conclusion This paper has presented an analysis of the underlying semantics of user interaction in the context of the FRAMES dynamic virtual video sequence synthesis algorithms. Ongoing research is addressing the presentation of interaction options to users, and the problem of disorientation within the unfolding interactive video.

References Aigrain P., Zhang H., and Petkovic D. 1996 “Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review”, Multimedia Tools and Applications 3, 179-202, Klewer Academic Publishers, The Netherlands. Davenport G. and Murtaugh M. 1995 “ConText: Towards the Evolving Documentary” Proceedings, ACM Multimedia, San Francisco, California, Nov. 5-11. Hoschka P.(ed) 1998, “Synchronised Multimedia Integration Language (SMIL) 1.0 Specification” W3C Recommendation 15 June 1998. Lindley C. A. 1998 “The FRAMES Processing Model for the Synthesis of Dynamic Virtual Video Sequences”, Second International Workshop on Query Processing in Multimedia Information Systems (QPMIDS) August 26-27th 1998 in conjunction with 9th International Conference DEXA98 Vienna, Austria. Lindley C. A. and Srinivasan U. 1998 “Query Semantics for Content-Based Retrieval of Video Data: An Empirical Investigation”, Storage and Retrieval Issues in Image- and Multimedia Databases, August 24-28, in conjunction with 9th International Conference DEXA98 Vienna, Austria. Lindley C. A. & Vercoustre A. M. 1998a “Intelligent Video Synthesis Using Virtual Video Prescriptions”, Proceedings, International Conference on Computational Intelligence and Multimedia Applications, Churchill, Victoria, 9-11 Feb., 661-666. Lindley C. A. & Vercoustre A. M. 1998b “A Specification Language for Dynamic Virtual Video Sequence Generation”, International Symposium on Audio, Video, Image Processing and Intelligent Applications, 17-21 August, Baden-Baden, Germany. McAleese R. 1989 “Navigation and Browsing in Hypertext” in Hypertext theory into practice, R. McAleese ed., Ablex Publishing Corp., 6-44. Murtaugh M. 1996 The Automatist Storytelling System, Masters Thesis, MIT Media Lab, Massachusetts Institute of Technology.

Category Oriented Analysis for Visual Data Mining H. Shiohara, Y. Iizuka, T. Maruyama, and S. Isobe NTT, Cyber Solutions Laboratories 1-1 Hikarinooka, Yokosuka-shi, Kanagawa, 239 JAPAN TEL: +81 468 59 3701, FAX: +81 468 59 2332 {shiohara,iizuka,maruyama,isobe}@dq.isl.ntt.co.jp

Abstract. Enterprises are now storing large amount of a data and data warehousing and data mining are gaining a great deal of attention for identifying eﬀective business strategies. Data mining extracts eﬀective patterns and rules from data warehouses automatically. Although various approaches have been attempted, we focus on visual data mining support to harness the perceptual and cognitive capabilities of the human user. The proposed visual data mining support system visualizes data using the rules or information induced by data mining algorithms. It helps users to acquire information. Although existing systems can extract data characteristics only from the complete data set, this paper proposes a category oriented analysis approach that can detect the features of the data of associated with one or more particular categories.

1

Introduction

The great evolution in computing power has enabled businesses to collect and store copious amounts of data. As competition between enterprises intensiﬁes, it is more important for the business strategy to be based on real data. Data mining has thus attracted attention as a way of obtaining such knowledge. Data mining can extract rules from copious amounts of data or classify data by using algorithms established in the ﬁeld of artiﬁcial intelligence. Although it’s suitable for handling copious amounts of data, its algorithms are very diﬃcult to use if the user is not familiar with data analysis. We developed a visual data mining support system that combines data mining algorithms with visualization for better usability. Because eﬀective visualization is needed to help users to discover rules/patterns, the selection of the attribute(s) to be the visualization target is very important. In our system, attribute selection is performed automatically by utilizing data mining methods. The 3 main functions of our system are as follows. 1. extract data characteristics by applying data mining 2. select the eﬀective attributes to be visualized based on the extracted characteristics Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 91–98, 1999. c Springer-Verlag Berlin Heidelberg 1999

92

H. Shiohara et al.

3. create visual representations eﬀectively by mapping the selected attributes to parameters of a plot proﬁle The resulting visualization makes it easier for the user to put forward hypotheses. Existing systems apply data mining only to all data which can lead to the signiﬁcant characteristics of partial data sets being overlooked. Accordingly, we add category analysis to the existing method to create a more powerful visual data mining support system. This article overviews the visual data mining support method in section 2, and section 3 introduces the new attribute selection method, the category oriented selection method. The remaining sections describe eﬀective visualization oﬀered by this method and some application examples.

2

Visual Data Mining Support Method

The visual data mining support method should help the user in discovering rules/patterns in voluminous data sets easily by combining powerful data mining algorithms with user-friendly visualization techniques. Human beings can well grasp data trends if they are well visualized. Furthermore, when they have background knowledge, they can understand what the trends mean in the real world. If the data set is too large, it is not easy even to guess which attributes are correlated which degrades analysis performance. Therefore, we focused on the selection of the attributes to be visualized and adapted data mining algorithm to support visual analysis. 2.1

Multi-dimensional Visualization System

As the visualization environment, we developed INFOVISER, a multi-dimensional visual analysis tool. INFOVISER transfers character-based data into graphical information and visualizes one record of data as one element such as a node or line. Each attribute of data is mapped to one parameter of a plot proﬁle such as axes, size, color, shape, and so on. This yields multi-dimensional and powerful visualization that is impossible with ordinary bar or circular charts. The degree of multi-dimensionality that can be visualized at once is about 10, and business data have many more attributes. This makes it diﬃcult to manually ﬁnd the attributes that are correlated. 2.2

Visualized Attribute Selection Method

As described before, the key to visual data mining support is how to ﬁnd the effective target to be visualized. Selection of a group of attributes that are strongly correlated can lead to the discovery of new rules/patterns. For this purpose, the support system extracts characteristics from data, and selects target attributes based on these characteristics. The user evaluates the visualization results and ﬁnds a clue to the next analysis step or, may change

Category Oriented Analysis for Visual Data Mining

93

the extraction process to generate another visualization result. This integration of machine and human analysis is the key to our proposal. Existing attribute selection methods use decision trees or correlation coeﬃcients for extracting the data characteristics of all data. The decision tree method is eﬀective especially when there is an analysis target attribute. The attributes, listed in the tree created by the decision tree algorithm, are selected as visualization targets in order of hierarchy from the root node. When there is no analysis target or no clue, a correlation coeﬃcient matrix is eﬀective. The pairs of attributes having higher correlation coeﬃcients are selected as visualization targets. These methods can overcome the diﬃculty of setting hypotheses due to excessive attribute number. The system conﬁguration is depicted in Fig.1

Fig. 1. System Conﬁguration of Visual Data Mining Support System

3

Category Oriented Attribute Selection Method

3.1

Requirements

The following are examples of visualization results manually generated using INFOVISER ’s GUI. sumo wrestler a high rank wrestler have good balance of height and weight.(Fig.2) medical checkup 1 a heavy drinker has high γGTP value and high blood sugar level.(Fig.3)

Fig.2. sumo wrestler

Fig.3. medical checkup 1

Fig.4. medical checkup 2

94

H. Shiohara et al.

medical checkup 2 in the case of a light drinker, the person’s obesity is inversely proportional to how much he smokes.(Fig.4) In these examples, the attributes that seem to inﬂuence the attribute that attracted the user’s interest like rank of wrestler and obesity, are visualized using particular patterns. However, these attributes were not selected by the existing selection method. Fig.5 depicts the decision tree for medical checkup data whose target is “Drinking”. Because attributes like γGTP or blood sugar or smoke or obesity don’t have high rank, they aren’t visualized. Even the correlation coeﬃcient method didn’t select these attributes because of their small absolute values. ( Table 1) The reason seems to be that these characteristics are seen only Fig. 5. Result of Decision Tree as part of the data and are lost when extraction is applied uniformly to the whole set. Human beings can detect distinguishing patterns in the visualized ﬁgures by their pattern recognition ability and induce rules from their background knowledge. For selecting attributes like above examples automatically, it is necessary to detect data subset including prominent patterns. In order achieve this, we propose to combine data categorization with the attribute selection process. That is, grouping data according to some guideline, characterizing each data group, and extracting attributes that have remarkable patterns compare to the complete data set. This should satisfy the following requirements. TREE Root |--TREE NODE 1 | if Sex in { 2 } | |--IF Systolic Pressure <= 157 | | THEN Drinking = 2.375 | +--IF Systolic Pressure > 157 | THEN Drinking = 1 +--if Sex not in { 2 } |--if Cholesterol <= 222.5 | |--if Diastolic Pressure <= 68 | | |--IF Age <= 45 | | | THEN Drinking = 3 | | +--IF Age > 45 | | THEN Drinking = 1 | +--IF Diastolic Pressure > 68 | THEN Drinking = 3.04762 +-IF Cholesterol > 222.5 THEN Drinking = 3.95238

1. able to reﬂect user’s interest in data categorization 2. able to evaluate partial characteristics for attribute selection 3.2

Data Categorization

The direct way to reﬂect the user’s interest is to make the analysis target attribute a guideline for categorization. The attributes that have special values within in a certain category, can be considered as those that are correlated to the user’s interest. As for the user’s interest, there are several cases. · there is only one target attribute Table 1. Correlation Coeﬃcient Attribute Correlation Coefficient

Age Height Weight Systolic pressure 0.47 0.32 0.07

-1.37

Diastolic Choles pressure -terol 0.18 0.16

Blood sugar 0.13

COT GPT γGTP Obesity degree Smoke 0.21 0.20 0.24 0.20 0.47

Category Oriented Analysis for Visual Data Mining

95

· there are multiple target attributes · there is no target attribute ( target attribute is not clear) With just one analysis target attribute, categorization is achieved by dividing the data into groups by setting discrimination levels or simply into an equal number. When there are multiple target attributes, such a naive method is not desirable. Categorization that takes account of the correlation of the attributes like multi-dimensional clustering is more suitable. How then can we categorize data if the user does not select any target attributes ? In this case, we use factor analysis ( a statistical technique ) to divide the data attributes into target attributes and explanation attributes (or dependent variables and independent variables) at ﬁrst. 3.3

Attribute Selection

This section discusses how to evaluate, for attribute selection, the characteristics extracted from each category. As for the correlation coeﬃcient method, a pair of attributes that shows a diﬀerent trend from the remaining data is taken as the characteristic, rather than just considering high values within one category. That is, a pair of attributes that have low correlation value in the whole data set may show high correlation in a certain category, or show an inverse correlation. We formalize these behaviors into the following expression to evaluate how characteristic the category is. 1 f1 (rA , rp , np ) = (1 − √ )rp (rp − rA ) np np is the number of partial data, rA is the correlation of all data, rp is the correlation of partial data set. In the same way, data distribution can be used for characterizing categories. That is, locating attributes whose distribution is much diﬀerent from those of the whole data set using basic statistical values such as average and standard deviation. We use the following expression to compare and evaluate the strength of the characteristics created by the categories and attributes. 1 mp − mA f2 (np , mp , sp , mA ) = (1 − √ ) np sp np is the number of partial data, mp is the average of partial data set, mA is average of all data, sp is standard deviation of partial data. Attributes are selected in order of the evaluation of characteristics.

4 4.1

Visualization Method Scatter Chart Representation

In this category oriented analysis method, the visualization target attributes are classiﬁed into two groups that are mapped as follows.

96

H. Shiohara et al.

Fig.6 Visualization Result 1

Fig.7 Visualization Result 2

· attributes that categorize data: targeted attributes → color, shape · attributes that characterize data: explanative attributes → X, Y-axis In this method, from which category the characteristics are extracted is signiﬁcant information, as is the attributes themselves, so we represent the categorizing attributes by color, category by shape, and the extracted attributes as the X-axis and Y-axis. 4.2

Aggregation Representation

When a large amount of data is visualized, ﬁgures overlap because the display area is not inﬁnite. This overlapping causes not only visual confusion but also a loss of information. In order to avoid this problem, the number of ﬁgures visualized on screen is reduced. One way is to thin the number of data records, but it is possible the remaining data doesn’t retain the original trend. Another way is to summarize neighboring data and this is more desirable from the viewpoint of analysis. It is common to combine ﬁgures that are close in ordinary scatter charts. In INFOVISER, however, proﬁles such as color, shape, size, have meaning. So, if these proﬁles are ignored and only position is considered when ﬁgures are summarized, the data proﬁle information is lost and can’t be evaluated. Therefore, we suppose a virtual space where in all proﬁles are treated equally, and summarize by distance in this space. This enables visualization by fewer ﬁgures without losing the signiﬁcance of the trends of all data.

5

Application Examples

This section shows results of applying the category oriented methods and the visualization method to test data consisting of medical checkup data. (3600 records with 24 attributes). In this article, only cases of one target attribute and no attribute are shown.

Category Oriented Analysis for Visual Data Mining

97

Case of one target attribute We selected “smoking” as the target attribute. The result of categorization by equal division of value into 6 groups was evaluated using equation f1 and f2 . The following features were extracted. Correlation coeﬃcient matrix: results · as smoking rate increases, the correlation of drinking and uric acid becomes stronger. · as smoking rate increases, the inverse correlation of obesity and HDL cholesterol(good cholesterol) increases. Basic statistics: results · heavy smokers have lower HDL cholesterol and more neutral fat. · light smokers have lower γGTP value ( means healthy hepatic function). Medically it is said that HDL cholesterol reduces as the smoking rate increases, or uric acid increases as the rate of drinking increases. In this result, when the degree of smoking or obesity is large, this phenomenon is seen strongly. Fig.6 and 7 show visualization results. ( Count as size, more smoke as more dark color, and category as shape) Case of no target attribute The top 4 results of factor analysis of the test data are shown in Table 2. By performing categorization using multi-dimensional clustering, and using attributes highly correlated with the ﬁrst factor, the following features were extracted. Correlation coeﬃcient matrix: results · in the highest factor scoring group (cluster 6), the correlation of total cholesterol and GPT(hepatic index) is high and meal frequency is inversely proportional to the drinking rate. · in the high factor scoring group (cluster 1), the rate of eating before sleep is inversely correlated to meal frequency. Basic statistics: results · in the highest factor scoring group(cluster 6) heavy drinking is common, smoking rate is high, and exercise and sleeping hours are small. · in a high factor scoring group (cluster 1), smoking rate is high. In this case, the height of factor scoring is interpreted as an index of poor healthy.

6

Discussion

We ascertained the eﬀective attributes and visualization results were obtained by applying the proposed method to test data with enough records. A shortcoming is that the visualization result may not be so comprehensible even if the numerical value is signiﬁcant. That is due to the relative low value of the correlation coeﬃcient or a small diﬀerence in distribution. The existing method can generate very plain visualizations. One of the examples generated by the

98

H. Shiohara et al.

existing method indicates that systolic pressure and diastolic pressure are almost proportional and both are highly correlated with obesity as shown in Fig8. However, the proposed method can identify conspicuous characteristics.

7

Conclusion

This article proposed a category oriented analysis method that can detect the strength of the characteristics of diﬀerent categories, and conﬁrmed that it effectively supports visual data mining. In the future, we will examine a user interpretation support function and other characterization methods.

References 1. K.Kurokawa, S.Isobe, H.Shiohara, “Information Visualization Environment for Character-based Database Systems” VISUAL ’96, pages 38-47, Feb. 1996. 2. Y.Iizuka, et al., “Automatic Visualization Method for Visual Data Mining”, Lecture Notes in Artiﬁcial Intelligence Vol.1394, PAKDD-98, pp.174-185, Apr. 1998. 3. B.H.MacCormik, T. A. DeFanti and M.D.Brown, eds., “Visualization in Scientiﬁc Computing,” Computer Graphics, Vol.21, No.6, ACM Siggraph, Nov. 1987. 4. A.S.Jacobson, A.L.Berkin and M.N.Orton, “Linkwinds: Interactive Scientiﬁc Data Analysis and Visualization”, Communications of the ACM, Vol.37, No.4, Apr.1994. 5. U. M. Fayyad and E. Simoudis, “Knowledge Discovery in Databases”, Tutorial Notes, 14th International Joint Conference on Artiﬁcial Intelligence (IJCAI-95), 1995. 6. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, “Advances in Knowledge Discovery and Data Mining”, AAAI/MIT Press, 1995. 7. D. A. Keim, “Database and Visualization”, Tutorial Notes, ACM-SIGMOD’96, 1996.

User Interaction in Region-Based Color Image Segmentation Nicolaos Ikonomakis1, Kostas N. Plataniotis2 , and Anastasios N. Venetsanopoulos1 1

Department of Electrical & Computer Engineering Digital Signal & Image Processing Lab University of Toronto 10 King’s College Road, Toronto, Ontario, M5S 3G4, Canada {minoas,anv}@dsp.toronto.edu WWW:http://www.dsp.toronto.edu 2 School of Computer Science Ryerson Polytechnic University 350 Victoria Street, Toronto, Ontario, M5B 2K3, Canada [email protected]

Abstract. An interactive color image segmentation technique is presented for use in applications where the segmented regions correspond to meaningful objects, such as for image retrieval. The proposed technique utilizes the perceptual HSI (hue,saturation,intensity) color space. The scheme incorporates user interaction so that the best possible results can be achieved. Interaction with the user allows the segmentation algorithm to start eﬃciently and to reﬁne the results. Interaction is performed for the key images (usually the ﬁrst image or those where new ojects enter the scene) of the video-telephony sequence. The user is allowed to identify on the screen the relevant regions by marking their seeds. The user guidance can be given by the sender, by the receiver, or by both. The eﬀectiveness of the algorithm is found to be much improved over techniques used in the past.

1

Introduction

Image segmentation refers to partitioning an image into diﬀerent regions that are homogeneous or ”similar” in some image characteristic. It is an important ﬁrst task of any image analysis process, because all subsequent tasks, such as feature extraction and object recognition, rely heavily on the quality of the segmentation. Image segmentation has taken a central place in numerous applications, including, but not limited to, multimedia databases, color image and video transmission over the Internet, digital broadcasting, interactive TV, video-ondemand, computer-based training, distance education, video-conferencing and

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 99–106, 1999. c Springer-Verlag Berlin Heidelberg 1999

100

Nicolaos Ikonomakis et al.

tele-medicine. For some speciﬁc applications, it is possible to ﬁnd an automated analysis process that performs segmentation as desired. This may be the case of a surveillance application, where the automated analysis system provides the video encoder detailed information about the object of importance, allowing the selective coding of the scene. A simple, real-time, and automated analysis process based on the detection of moving objects may be used. However, for many applications (multimedia databases, web-based search engines, video-conferencing, tele-medicine, etc.), fully automated analysis schemes provide only part of the desired analysis results [1,2]. The performance of an automatic segmentation scheme on complex video scenes (i.e. lots of background objects) may not produce the desired results. For these applications, user interaction is imperative so that the achieved results can have a meaningful and powerful semantic value. For this reason, more recent research is given to interactive and “human in the loop” systems [1,2,3,4,5]. The QBIC (Query By Image Content) team [5] uses interactive region segmentation for image retrieval purposes. In this paper, an interactive color image segmentation scheme is proposed that employs the perceptual HSI (Hue, Saturation, Intensity) color space to segment color images. More speciﬁcally, the proposed scheme is developed for implementation in applications where the segmented regions should correspond to meaningful objects, such as image retrieval or video-telephony type sequences. The region-based segmentation scheme ﬁrst employs an initial user interaction seed determination technique to ﬁnd seed pixels to be used in a region growing algorithm. Initial user interaction also includes selecting several values for threshold parameters used in the region growing algorithm. Following the automatic growing of the regions a supervised region merging algorithm is employed to reﬁne the results of the segmentation. The next section explains the segmentation scheme. This is followed by the results and conclusions.

2

Color Image Segmentation

The segmentation scheme presented utilizes the HSI color space, and thus, the color values of the pixel are ﬁrst converted from the standard RGB (red, green, blue) color values to the HSI color values using well known transformation formulas [6]. The scheme can be split into four general steps: 1. The pixels in the image are classiﬁed as chromatic or achromatic pixels by examining their HSI color values. 2. User classiﬁes seed pixels in image. 3. The region growing algorithm is employed on the chromatic and achromatic pixels separately starting from the seed pixels. 4. Regions are merged through user interaction. The region growing algorithm has been presented in the past [7], but with arbitrary unsupervised seed determination. The automatic method gave good

User Interaction in Region-Based Color Image Segmentation

101

results but still needed improvement. Automatic seed determination is one of the most diﬃcult problems in color image segmentation [6]. A good seed pixel is the pixel with the most dominant color and is usually the center pixel of the region. Thus, to determine them, an initial segmentation of the image is needed to ﬁnd the regions. The new seed determination method presented in this paper constitutes an initial user interaction process. Because the human visual system can segment an image automatically with little or no hesitation [8], the user can achieve the initial segmentation and, thus, determine the best starting pixels. Each step in the above generalization is explained in the following sections. Due to limitation in space, experimental results for the Claire video-telephony type image will only be discussed. 2.1

Chromatic/Achromatic Separation of Pixels

The HSI color model corresponds closely to the human perception of color [6]. The hue value of the pixel has the greatest discrimination power among the three values because it is independent of any intensity attribute. Even though hue is the most useful attribute, there are two problems in using this color value: hue is meaningless when the intensity is very low or very high; and hue is unstable when the saturation is very low [6]. Because of these attributes, in the proposed scheme, the image is ﬁrst divided into chromatic and achromatic regions by deﬁning eﬀective ranges of hue, saturation, and intensity values. Since the hue value of a pixel is meaningless when the intensity is very low or very high the achromatic pixels in the image are deﬁned as the pixels that have low or high intensity values. Pixels can also be categorized as achromatic if their saturation value is very low, since hue is unstable for low saturation values. From the concepts discussed above, the achromatic pixels in the HSI color space are deﬁned as follows: achromatic pixels:(intensity>90) or (intensity<10) or (saturation<10),

(1)

where the saturation and intensity values are normalized from 0 to 100. Only the intensity values of the achromatic pixels are considered when segmenting the achromatic pixels into regions. Pixels that do not fall into this category are categorized as chromatic pixels. For chromatic pixels all three HSI color values are considered in the algorithm. 2.2

Seed Determination

The region growing algorithm starts with a set of seed pixels and from these grows regions by appending to each seed pixel those neighboring pixels that satisfy a certain homogeneity criterion, which will be described later. An interactive algorithm is used to ﬁnd the ’best’ seed pixels in the image. The ’best’ seed pixels are deﬁned as the pixels that are in the center of their respective object. A beneﬁt of having an initial user interaction is that diﬀerent levels of

102

Nicolaos Ikonomakis et al.

segmentation can be employed. More speciﬁcally, the user can choose two diﬀerent sets of seed pixels. One for regions that require a low degree of segmentation (i.e. facial areas), and one for regions that require a high degree of segmentation (i.e. background, clothing, etc.). This is beneﬁcial, because areas where some detail is important can be segmented separately from areas where detail is not essential. The user is ﬁrst asked to select low level seed pixels from the original image. After this, the user is asked to select the high level seeds. It was found that having only two levels makes it quicker and easier for the user to choose the seed pixels. For classical unsupervised seed determination algorithms, this two level seed determination option is not an easy beneﬁt to implement. 2.3

Region Growing

The region growing algorithm starts with the set of seed pixels determined from the user and from these grows regions by appending to each seed pixel those neighboring pixels that satisfy a certain homogeneity criterion. The algorithm is summarized in the Figure 1. The ﬁrst seed pixel is compared to its 8-connected neighbors: eight neighbors of the seed pixel. Any of the neighboring pixels that satisfy a homogeneity criterion are assigned to the ﬁrst region. This neighbor comparison step is repeated for every new pixel assigned to the ﬁrst region until the region is completely bounded by the edge of the image or by pixels that do not satisfy the criterion. The color of each pixel in the ﬁrst region is changed to the average color of the all the pixels in the region. The process is repeated for the next and each of the remaining seed pixels. If all of the seed pixels are grown and there are still unassigned pixels in the image further region growing segmentation is employed by considering random seed pixels amongst the unassigned pixels. For the chromatic pixels, the homogeneity criterion used for comparing the seed pixel and the unassigned pixel is that if the value of the distance metric used to compare the unassigned pixel and the seed pixel is less than a threshold value Tchrom than the pixel is assigned to the region. Varying the value of Tchrom controls the degree of segmentation with a low value resorting to an over-segmented image and a high value to an under-segmented image. With the two diﬀerent levels of seeds found, two diﬀerent values of Tchrom are used. The seeds that require a low degree of segmentation have a threshold lower than the seeds that require a high degree of segmentation. Besides setting the seed pixels, the user initially sets the values of the low and high level thresholds. If the region growing segmentation results with these thresholds are not perfect the user can reﬁne the threshold values, during the segmentation process, until optimal results are obtained. The distance measure used for the chromatic pixels will be referred to as the cylindrical metric. It computes the distance between the projections of the pixel points on a chromatic plane. It is deﬁned as follows[9]: 1 (2) dcyl (s, i) = (dI )2 + (dC )2 2 , with dI = |Is − Ii |

(3)

User Interaction in Region-Based Color Image Segmentation

103

Select next seed pixel

neighbors are 8 neighbors of seed pixel

compare neighbors to the seed pixel with homogeneity criterion

if any neighbor pixels compared satisfy the homogeneity condition

satisfying pixels are assigned to the region and are the new neighbors

TRUE

FALSE

Fig. 1. The region growing algorithm. and

1 dC = (Ss )2 + (Si )2 − 2Ss Si cos θ 2 ,

where

θ=

|Hs − Hi | if 360◦ − |Hs − Hi | if

|Hs − Hi | < 180◦ , |Hs − Hi | > 180◦ .

(4)

(5)

The value of dC is the distance between the 2-dimensional (hue and saturation) vectors, on the chromatic plane, of the seed pixel and the pixel under consideration, as shown in Figure 2. Henceforth, dC combines both the hue and saturation (chromatic) components of the color. An examination of the metric equation (Equation (2)) shows that it can be considered as a form of the popular Euclidean distance (L2 norm) metric. A pixel is assigned to a region if the value of the metric dcyl is less than a threshold Tchrom . In the case of the achromatic pixels, the homogeneity criterion used is that if the diﬀerence in the intensity values between an unassigned pixel and the seed pixel is less than a threshold value Tachrom than the pixel is assigned to the seed pixel’s region. That is, if (6) |Is − Ii | < Tachrom, then pixel i would be assign to the region of seed pixel s. 2.4

Region Merging

Following the region growing algorithm for image segmentation a user reﬁnement process is employed. The user supervises the results of the region growing

104

Nicolaos Ikonomakis et al.

R

Si

G

pixel under consideration θ Ss

seed pixel

B

Fig. 2. The chromatic plane of the HSI color model. algorithm, correcting the undesired deviations when needed. The user does this by telling the algorithm which regions in the segmented image to merge into one. This further segments the image into less regions. The user speciﬁes how many regions to be merged and then speciﬁes the regions. The color of the regions that are merged is then changed to the average color of the regions. This is deﬁned as: r k=1 ck Nk (7) average color = r k=1 Nk where r, ck , and Nk are the number of regions being merged, the color vector of region k, and the number of pixels in region k, respectively. This ensures that if a small region is merged with a larger one the average color will be heavily weighted by the larger region.

3

Results and Conclusions

The performance of the interactive color image segmentation scheme was tested with the video-telephony type image Claire. For this type of image, segmentation can be used for the coding and/or compressions of the sequences before transmission. The original Claire image is displayed in Figure 3(a). Interaction with the user allows the automatic analysis process to start eﬃciently, and to reﬁne the produced results. Interaction is performed for the key images (usually the ﬁrst image or those where new ojects enter the scene) of the video-telephony sequence. The user is allowed to identify on the screen the relevant regions by marking their seeds. The user guidance can be given by the sender, by the receiver, or by both. Considering the achromatic pixels, it was found that the best results were obtained with a Tachrom threshold value of 15. This is 15% of the maximum achromatic distance. This was the the value used throughout the experimental analysis. Figure 3(b) shows an example of the seed pixels a user may choose for the Claire image. The white pixels refer to the high threshold level seed pixels and the black pixels refer to the low threshold level seed pixels.

User Interaction in Region-Based Color Image Segmentation

(a)

(b)

(c)

(d)

105

(e)

Fig. 3. Claire Image. (a) Original (b) Image showing seeds (white:high level, black:low level) (c) Segmented image before interactive merging (d) Classical segmented image with random seed determination (e) Segmented image after interactive merging.

106

Nicolaos Ikonomakis et al.

Figure 3(c) shows the segmented images, before merging, of Claire with Tchromlow = 10 (5 % of maximum distance) and Tchromhigh = 30 (15% of maximum distance). It was found that these threshold values give the best results for a varied set of images. As can be seen, the facial areas which incorporate a lot of detail are grown through several low level seeds. The background, the jacket, and the hair in the Claire image are segmented using the high level seeds. User reﬁnement is incorporated to merge together similar regions. Figure 3(e) shows the segmented images after merging. As can be seen, the images have less regions. Figure 3(d) shows the segmented images of Claire after segmenting them with the classical method of segmentation with random arbitrary seed determination and no merging. These results are clearly not as good as the one segmented with the interactive segmentation scheme. A color image segmentation scheme using the HSI color space was used to segment video-telephony type images. It incorporated an interactive seed determination algorithm The technique was found to be robust and relatively computationally inexpensive. The eﬀectiveness of the algorithm was found to be much improved over the technique used in the past [7]. Other than video-telephony type images, the proposed interactive scheme can be employed in other applications. These include: web-based search engines, multimedia databases, digital broadcasting, interactive TV, video-on-demand, computer-based training, distance education, and tele-medicine.

References 1. Soltanian, Z.H., Windham, J.P., Robbins, L.,: Semi-supervised segmentation of MRI stroke studies. Proc. of the SPIE vol. 3034 (1997) 437–448. 100 2. Olabarriaga, S.D., Smeulders, W.M.,: Setting the Mind for Intelligent Interactive Segmentation: Overview, Requirements, and Framework. Proc. of the 15th Int. Conf. on Information Processing in Medical Imaging (1997) 417–422. 100 3. Castagno, R., Ebrahimi, T., Kunt, M.,: Video Segmentation Based on Multiple Features for Interactive Multimedia Applications. IEEE Transactions on Circuits and Systems for Video Technology (Sept 1998) vol. 8, no. 5 562–571. 100 4. Correia, P., Pereira, F.,: User Interaction in Content-based Video Coding and Indexing. EUSIPCO III (1998) 2293–2296 . 100 5. Niblack, W., Barber, R., and et al.: The QBIC project: Querying images by content using color, texture and shape. Proc. SPIE Storage and Retrieval for Image and Video Databases (1996). 100 6. Gonzales, R.C., Wood, R.E.,: Digital Image Processing. Addison-Wesley, Massachusetts (1992). 100, 101 7. Ikonomakis, N., Plataniotis, K.N., Venetsanopoulos, A.N.,: A Region-based Color Image Segmentation Scheme. Proc. of the SPIE: Visual Communications and Image Processing vol. 3653 (1999) 1202–1209. 100, 105 8. Marr, D.,: Vision. Freeman, San Fransisco, California (1982). 101 9. Tseng, D.C., Chang, C.H.,: Color Segmentation Using Perceptual Attributes. Proc. 11th Int. Conf. on Pattern Recognition III:Conf c. (1992) 228–231. 102

Using a Relevance Feedback Mechanism to Improve Content-Based Image Retrieval G. Ciocca and R. Schettini Istituto Tecnologie Informatiche Multimediali Consiglio Nazionale delle Ricerche Via Ampere 56, 20131 Milano, Italy {ciocca,centaura}@itim.mi.cnr.it

Abstract. the paper describes a new relevance feedback mechanism that evaluates the distribution of the features of images judged relevant or not relevant by the user, and dynamically updates both the similarity measure and query in order to accurately represent the user's particular information needs. Experimental results are reported to demonstrate the effectiveness of this mechanism.

1. Introduction Content-based retrieval systems operate on image databases to extract relevant images in response to a visual query. Unfortunately, the concept of "relevance" is frequently associated with the semantics of the image, and the encoding and exploitation of semantic information in a general purpose retrieval system is still an unsolved issue. However in practice low-level visual features are often correlated with the semantic contents [1, 8], and image retrieval techniques exploiting these have become a promising research issue [3, 9, 13]. General purpose systems of this type are already available [6, 16]. These systems usually make it possible to extract image descriptions in terms of color, texture, shape and layout features from the images and define the relative search/matching functions that can be used to retrieve those of interest. The performance of an image retrieval system is, consequently, closely related to the nature and quality of the features used to represent image content, but it is not limited to this. Another important issue is that the measure adopted to quantify image similarity is user- and task- dependent [3, 10], and this dependence is not in general understood well enough to permit careful, a-priori selection of the optimal measure. In this paper we present a mechanism that allows the user to query the database and progressively refine the system’s response by indicating the relevance, or irrelevance of the retrieved items. Our approach differs from other recently presented studies [11, 15, 18] in both the strategy for learning the query representing the information needs and the strategy for learning the similarity function. The relevance feedback mechanism presented is a part of a Visual Information Retrieval system currently under development. A full description of all the functionalities of this system is beyond the scope of the paper. We should like to provide here only Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp.107 -114, 1999.  Springer-Verlag Berlin Heidelberg 1999

108

G. Ciocca and R. Schettini

some experimental evidence that our retrieval mechanism makes it possible to approximate the user's information needs in a great variety of applications.

2. Image Indexing The absolute performance of an image retrieval system is closely linked with to the nature and the quality of the features used to represent the image content. The features used to index the images in the experiments reported here were calculated on the global image and on 5 sub-images and are: • A Color Coherence Vector (CCV) in the CIELAB color space quantized in 64 colors (pixels in regions whose size exceeds 1% of the image size are counted as coherent pixels) [13]; • A histogram of the transition in colors (CIELAB color space quantized in 11 colors) [7]; • Moments of inertia of the distribution of colors in the unquantized CIELAB color space [17]; • A histogram of contour directions opportunely filtered (only high gradient pixels are considered) and using box widths of 15 [4, 9]; • The mean and variance of the absolute values of the coefficients of the subimages of the first three levels of the multi-resolution wavelet transform of the luminance image [4]; • The neighborhood Gray-Tone Difference Matrix (NGTDM), i.e. coarseness, contrast, busyness, complexity, and strength, as proposed by Amadasum et al. [2]; • The spatial composition of the color regions identified by the process of quantization in 11 colors [4].

3. Relevance Feedback The sub-vectors of features are indicated by Xihs, where i is the vector index, h is the index of the feature, and s is the index of the sub-image to which that feature refers. We indicate with Dhs the distance associated with the feature h-th of region s. In our experiment all the features are compared with the L1 distance measure, as it is statistically more robust than the L2 distance measure [14]. The global metric used to evaluate the similarity between two images of the database is, in general, a linear combination of the distances between the individual features: q

p

i j Sim(X i , X j ) = ∑∑ w hs D hs ( X hs , X hs )

(1)

s =1 h =1

in which the whs are weights. There are two problems in this formulation of image similarity. First, since the single distances may be defined on intervals of widely varying values, they must be normalized to a common interval to place equal

Using a Relevance Feedback Mechanism

109

emphasis on every feature score. Second, the weights must often be heuristically set by the user, and this may be rather difficult to do, as there may be no clear relationship between the features used to index the image database and those evaluated by the user in a subjective image similarity evaluation. To cope with the first problem we use Guassian normalization as follows [12, 15]: i j  D ( X i , X j ) − µ11 D hs ( X hs D ( X i , X j ) − µ pq  , X hs ) − µ hs D ( X , X ) =  11 11 11 , ,..., pq pq pq  K σ 11 K σ hs K σ pq   i

T

j

(2)

Assuming that the features' distance distributions have a Gaussian distribution, it can be shown that there is a 68% probability that the feature values will lie within the range of [-1, 1] if K=1, and a 99% probability if K=3 [12]. As we can not assume apriori that the distances have will have Gaussian distributions, the following general relationship holds:   D − µ hs 1 ≤ 1 ≥ 1 − 2 P − 1 ≤ hs K σ K hs  

(3)

According to this relationship the probability that the distance will lie within the range of [-1,1] is of 89% if we set K at 3, and 94% if k is set at 4 [13]. The latter is the default value used in all our experiments. A simple additional shift moves the distances into the [0,1] range. Out-of-range values are mapped to the extreme values, so that they do not bias further processing. At this point our similarity function has the following form: q

p

Sim(X i , X j ) = ∑∑ w hs s =1 h =1

i Dhs ( X hs , X hsj ) + 1 q p j = ∑∑ w hs d hs ( X ihs , X hs ) 2 s =1 h =1

(4)

The algorithm must now determine the weights for the individual distances by a statistical analysis of the distance feature values of the images of the relevance-set, and these weights are used to accentuate, or diminish, the influence of a given feature in the overall evaluation of similarity [18]. In our experience in content-based retrieval relevant images are sometimes selected because they resemble the query image in just some pictorial features. Consequently, after an initial query, one retrieved image may be considered relevant because it has the same color as the query, and another be selected for its similarity in shape, although the two are actually quite different from each other. Letting the relevant set R+, d+hs be the set of normalized distances among the elements of R+, and σ hs+ be the corresponding variance, we similarly, let R- be the set of non-relevant images, and d-hs , the set of normalized distances among the elements of R-. We when consider the union between R+ and R-, and compute the corresponding distance sets d+-hs, letting d*hs be d+-hs \d+hs, * , the corresponding variance. The weight terms are defined as: and σ hs

110

G. Ciocca and R. Schettini

1  ε w +hs =   1 +  ε + σ hs

if R + < 3

(5)

otherwise

0  w = 1  *  ε + σ hs

if R + + R − < 3 or R − = 0

0 w hs =  +  w hs − w *hs

+ < w *hs if w hs

* hs

(6)

otherwise

(7)

otherwise

where ε is a positive constant (set at 0.01 in our experiments). When at least three examples are given, we can take into account negative examples in tuning the similarity weights. For any given feature the first term of whs is high when there is some form of agreement among the feature values of the relevant (positive) set, while the second term is high when there is a similarity between positive and negative examples.

4. Query Processing Query processing consists in modifying the feature vector of the query by taking into account the feature vectors of the images judged relevant by the user. Our approach is to let R+ be the set of relevant images the user has selected (including the original query), Q the average query and σ the corresponding standard deviation, we proceed as follows:

{

Yhs (j) = X ihs (j)

}

X ihs (j) − Qhs (j) ≤ 3σ hs (j)

~ Q hs (j) =

1 Yhs (j)

∑X

i hs Xihs (j)∈Yhs (j)

(j)

∀ h, s, i, and j

(8) (9)

~ that can better represent the The query processing formulates a new query Q hs images of interest to the user, taking into account the features of the relevant images, without allowing one different feature value to bias query computation. The query process could similarly be applied to compute a query representing nonrelevant examples. But this seems of little practical interest: non-relevant examples are usually not similar to each other, and are, consequently, scattered throughout the feature space.

Using a Relevance Feedback Mechanism

111

5. Query Classification At the first iteration, when the user has selected just one image to be searched, all the weights in the similarity function (4) are set at the equal value of 1/ε. For faster tuning of the similarity function we can exploit previous query sessions performed by the user on the same database. To this end the user is allowed to have satisfactory queries memorized together with the corresponding weights in the similarity measure. When the user has already formulated a query "similar" to the new one, the algorithm sets the initial weights of the similarity function at the value of the former query reducing the time and effort needed to adapt ~the similarity measure by means of the k k relevance feedback algorithm. We let Qˆ = Q , w hs be the set of queries and the

{

}

corresponding weights. When a new query Qn is submitted, the system first evaluates its similarity with respect previous queries, using the corresponding weights in the similarity function; it then selects the closest one as follows: ~ Qk , w khs = argmin Q~ k ,w k

hs

q

ˆ ∈Q

p

∑∑ w

k hs

~ d hs (Qn , Qk )

(10)

s=1 h =1

The initial weights selected in Equation (17) are now set as follows:

w k(initial) hs

q p  ~ w khs d hs (Q n , Q k ) ∑∑  k s=1 h =1 w hs if ≤T  1 = pq ε  1 otherwise  ε

0 < T ≤1

(11)

Parameter T allows the user to tune the sensitivity. In our implementation the default is set at 0.1, that is, the new query is considered similar to an old one when the two differ less than the 10% of 1 ε ( pq) , the maximum distance value allowed.

6. Experimental Results and Discussion Users do not find it difficult to provide interactively examples of similar and dissimilar images. However if the image database queried is large and heterogeneous, or the retrieval task particularly complex, the user may find not enough examples of images that are actually very similar to the query in the first screens, and, to avoid the time-consuming visual browsing of the database, may mark as relevant images that are only partially similar. The user's information needs may also be rather vague, such as: find all the images containing people. In both cases images judged relevant may differ widely. Treating all these images in the same way - for example, averaging the features of the relevant images to compute a new query vector or updating the similarity measure [18] - may consequently produce very poor results, while

112

G. Ciocca and R. Schettini

processing all the relevant images as single queries and then combining the retrieval outputs may create an unacceptable computational burden when the database is large. Last but not least, relevant images may have some features, color for example, that are only casually similar. If the system is not able to identify these features and treat them differently, subsequent retrieval iterations will be biased. The key concept of the relevance feedback we have proposed is the statistical analysis of the feature distributions of the images the user has judged relevant, or not relevant, in order to understand what features have been taken into account (and to what extent) by the user in formulating this judgment, so that we can then accentuate the influence of these features in the overall evaluation of image similarity as well as in the formulation of a new query. The archives on which the algorithms were tested contain some 5,000 images: photographs of real landscapes, people and animals; reproductions of antique textiles of the Poldi Pezzoli Museum of Milan; a collection of paintings of the Accademia Carrara of Bergamo; and a collection of ceramics belonging to the Museo Internazionale della Ceramica of Faenza. To evaluate the effectiveness of our retrieval method we applied a measure proposed by Methre et al.[5]. We let S be the number of relevant items the user wanted to retrieve when posing a query; R Iq , the set of relevant images, and R qE , the set of relevant images retrieved in the short list. The effectiveness measure was defined as:  R qI ∩ R qE  if R qI ≤ S I (12)  R q ηS =  I E  Rq ∩ Rq if R qI > S  E  R q If R qI ≤S, the effectiveness was reduced to the traditional recall measure, while if

R Iq >S, the effectiveness corresponded to precision (In our implementation S was set at 24). The effectiveness of the algorithms was tested on the individual databases, and on a combination of them, to evaluate the system's capacity for adaptation. In Table I we have summarized the experimental results for twenty queries for each of the different databases considered. The queries have not been classified: the first iteration corresponds to a similarity measure in which all the features have the same importance. Each bin corresponds to a different database, and reports the average effectiveness value at each of the first three retrieval iterations. The results show that relevance feedback improves the effectiveness of the retrieval considerably for all the databases, and, in general, the second iteration (that is the first relevance feedback iteration) corresponds to the largest single improvement. There is, instead, little benefit in repeating the relevance feedback more than five or six times. It can reasonably be argued that this is due to the limited capability of the low-level features used to exhaustively describe image content, and not to the relevance feedback itself. The integration of other features and the exploitation of unsupervised image segmentation to focus automatically on the significant parts of the database images to

Using a Relevance Feedback Mechanism

113

be indexed should improve the results. This integration will be straightforward for us, as the structure of the relevance feedback mechanism is actually descriptionindependent, that is, the index can be modified, or extended to include other features, without requiring any change in the algorithm. However, as already pointed out by several authors, perceptually similar images are not necessarily similar in terms of low-level features. Therefore, we believe that only the integration of text-based image annotation will make it possible to further increase retrieval effectiveness in a significant way. Figure 1 represents an example of the system's application. Table I. Retrieval effectiveness 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0

First iterarion Second iteration T hird iteration

Photos (1745 images)

Paintings Ceramic and (1768 images) ancient textiles (942 images)

Figure 1. The retrieval results after the second iteration of relevance feedback.

114

G. Ciocca and R. Schettini

Interested readers may find additional examples at the following address: http://www.test.itim.mi.cnr.it/sitoitim/schettini/relfeme.htm.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

Aigrain O., Zhang H., Petkovic D.: Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review. Multimedia Tools and Applications, 3 (1996) 179-182 Amadasun M., King R.: Textural features corresponding to textural properties, IEEE Transaction on System, Man and Cybernetics, 19 (1989) 1264-1274 Binaghi E., Gagliardi I., Schettini R.: Image retrieval using fuzzy evaluation of color similarity. International Journal of Pattern Recognition and Artificial Intelligence, 8 (1994) 945-968 Ciocca G., Gagliardi I., Schettini R.: Retrieving color images by content. In: Del Bimbo A., Schettini R. (eds.) Proc. of the Image and Video Content-Based Retrieval Workshop (1998) Desai Narasimhalu A., Kankanhalli M.S.: Benchmarking multimedia databases. Multimedia Tools and Applications, 4 (1997) 333-356. Faloutsos C., Barber R., Flickner M., Hafner J., Niblack W., Petrovic D. Efficient and effective querying by image content. Journal of Intelligent Systems, 3 (1994) 231-262 Gagliardi I., Schettini R.: A method for the automatic indexing of color images for effective image retrieval. The New Review of Hypermedia and Multimedia, 3 (1997) 201224 Gudivada V.N, Rahavan V.V.: Modeling and retrieving images by content. Information Processing and Management, 33 (1997) 427-452 Jain A.K., Vailaya A.: Image retrieval using color and shape. Pattern Recognition, 29 (1996) 1233-1244. Minka T., Picard R.W.: Interactive learning with a “Society of Models”. Pattern Recognition, 30 (1997) 565-581. Mitra M., Huang J., Kumar S.R.: Combining Supervised Learning with Color Correlograms for Content-Based Image Retrieval, Proc. of the Fifth ACM Multimedia 97 Conference (1997). Mood A.M, Graybill F.A., & Boes D.C: Introduzione alla statistica. McGraw-Hill (1988) Pass G., Zabih R., Miller J.: Comparing Images Using Color Coherence Vectors. Proc. Fourth ACM Multimedia 96 Conference (1996) P.J. Rousseeuw P.J., Leroy A.M.: Robust regression and outlier detection, John Wiley & Sons (1987). Rui Y, Huang T.S., Mehrotra S., Ortega M.: A Relevance feedback architecture in content-based multimedia information retrieval systems. Proc. of the IEEE Workshop on Content-based Access of Image and Video Libraries (1997) Smith J.R., Chang S-F.: VisualSEEK: a fully automated content-based image query. Proc. of the Fourth ACM Multimedia 96 Conference (1996) Stricker M., Orengo M.: Similarity of Color Images. Proc. of the SPIE Storage and Retrieval for Image and Video Databases III Conference (1995) Taycher L., La Cascia M., Sclaroff S.: Image Digestion and Relevance Feedback in the ImageRover WWW Search Engine, Proc. of the Visual1997 Conference (1997)

Region Queries without Segmentation for Image Retrieval by Content Jamal Malki, Nozha Boujemaa, Chahab Nastar, and Alexandre Winter INRIA Rocquencourt, BP 105, 78153 Le Chesnay, France [email protected]

Abstract. Content-based image retrieval is today ubiquitous in computer vision. Most systems use the query-by-example approach, performing queries such as "show me more images that look like this one''. Most often, the user is more specifically interested in specifying an object (or region) and in retrieving more images with similar objects (or regions), as opposed to similar images as a whole. This paper deals with that problem, called region querying. We suggest a method that uses a multiresolution quadtree representation of the images and thus avoids the hard problem of region segmentation. Several experimental results are presented in real-world databases.

1

Introduction

Surfimage is a Content-Based Image Retrieval (CBIR) system developed at INRIA since 1996. Its specificity is the capacity to deal with both categories of image databases: • For image databases with ground truth, the system should be as efficient as possible on the specific application. Examples include face recognition or medical image retrieval. A quantitative evaluation of the system can then be reported in terms of recognition rate, precision/recall graph, etc. • For image databases where no ground truth is available, the system should be flexible, since the notion of perceptual similarity is subjective and contextdependent. Smart browsing, query refinement, multiple queries, and partial search on user-defined regions are among the desirable features of the system. Applications include stock photography and the World Wide Web. Surfimage uses the query-by-example approach for retrieving images and integrates advanced features such as image signature combination, multiple queries, query refinement, and partial queries [1,2]. We focus on the latter problem in this paper. Indeed, the user is most often interested in performing a query on an object (or region), rather than on the whole image. The goal of the system is then to retrieve those images in the database that contain similar objects. This observation motivates recent research on spatially-localized features and region matching. Methods range Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp.115 -122, 1999.  Springer-Verlag Berlin Heidelberg 1999

116

Jamal Malki et al.

from histogram computation without segmentation [3-5], approximate region segmentation [6-9], graphical and spatial structure of the image [10-12]. In this paper, we suggest a multiresolution quadtree approach for performing region queries. It represents a simple and efficient way to select image parts for retreival. Endeed, when manual object segmentation through all database images is too hard or when images are too complex for performing automatic segmentation, quadtrees image subdivision is an appropriate solution for partial queries. Image signatures are then computed systematically on each subimage. A dedicated similarity measure is computed, allowing to retrieve images with similar regions. The similarity measure defines the invariance properties of the query (e.g. "find similar objects anywhere in the image" vs. "find similar objects in the same image location"). Section 2 details feature computation in the multiresolution quadtrees. Section 3 discusses user queries and the associated inducted invariance. In section 4 we present comparative experimental results. We draw the conclusions in section 5.

2. Multiresolution Quadtrees Several problems occur with region (or partial) matching. The first one is segmentation: accurate segmentation of an image into regions in the general case is quite difficult. Another problem is invariance: is the user interested in finding other occurrences the object (or region) with the same position, orientation, and scale, or do they require invariance against these transformations? In order to deal with partial queries, we use a multiresolution quadtree representation (similar to [13]) that localizes image features in structured regions (fig. 1). This approach avoids image segmentation and offers effective alternatives for the invariance problem. A1 A2 A3 Fig. 1. Computation of image signatures on the multiresolution quadtree representation of the original image for 3 levels

Let A the original image and p the number of levels: For a given level i ( 1 ≤ i ≤ p ) A is divided into ni parts such that, for all levels, A is represented by a total number of N subimages:

( )

A = A ni i

ni

p

,

∑n i =1

i

= N,

n i = 4 i −1 , 1 ≤ i ≤ p

(1)

Region Queries without Segmentation for Image Retrieval by Content

117

2.1. Feature Computation

We use a large selection of features offered by the Surfimage system [1]. Examples include color, shape and texture, mostly captured via histograms. The presentation of these commonly used features is not the scope of this paper -for further details, see [14, 15] or [1]. The features are computed on each subimage of the quad tree representation, yielding higher dimensional feature vectors. If we consider the p-level representation, the following feature vector is computed: (2) h = hi , 1 ≤ i ≤ p , n = 4i −1

( ) ni

features

i

( )is defined as the feature h computed on the sub-image A . The set of (h )for a fixed i corresponds to the features of the i-th level of the

where each h

i ni

i ni

i ni

representation. For instance, if the original single feature h is a 64-bin histogram, the dimensionality of the 3-level quad-tree feature is (1+4+16)*64 = 1344. Eventually, note that, when using histograms, we normalize the feature vectors so that they sum up to 1 -they thus represent distributions of that feature in the corresponding sub-image. 2.2. Feature Combination

Combination of different features has been a recent focus of image retrieval [2]. The main problem is how do we combine ``apples and oranges'', i.e. features that have different number of components, different scales etc.? We have experimented with several combination methods: voting, gymnastics-rule, etc… [16]. Among these methods, one seems to be the most appropriate and is described in the following. Under Gaussian assumption, the normalized linear combination method uses the estimated mean µi and standard deviation σi of the distance measure d for each feature i, providing the normalized distance: d ' ( x (i ) , y (i) ) =

d ( x ( i ) , y ( i ) ) − ( µ i − 3σ i ) 6σ i

(3)

Where x(i) and y(i) are the vector signatures of images X and Y within feature i. The new distance measure d' will essentially have its values in [0...1]. The combined distance between X and Y is then: D (Q , X ) =

∑ ρ (d ' ( x i =1

where α = µ i − 3σ i 6σ i

3.

i

(i )

, y ( i ) )) −

∑ ρ ( −α

i

)

(4)

i

, ρ is an increasing function (e.g. ρ (x)=x, ρ (x)=x3 etc.).

User Queries

3.1. Performing a Partial Query

For performing a region query, the user has to specify the regions of interest (RoI) at each level. Note that the RoIs do not have to be connex (fig. 2). For a given level,

118

Jamal Malki et al.

we build the corresponding bounding box, defined as the smallest subimage containing the RoIs (see fig.2). This bounding box (Bb) is able to catch the relative geometric positions of the different RoIs, and will be used for retrieving images . Partial queries composed of RoIs Bounding box Image query Fig. 2. Bounding box of partial query at the third level In connection to the RoIs, the user has to specify which metric (fixed, location invariant) they are using. Note that most of the image signatures that we use are histograms, yielding most often rotation, translation and scale invariance within a subimage. The main invariances that the user can specify are thus: no location invariance (e.g. a face on top left, a hand on bottom left), no relative location encoding (a face and a hand anywhere in the image), location invariance (a face and a hand with specified relative positions). 3.2. Similarity Metric and Invariance Issues

The previous invariances have to be translated into similarity metrics (or equivalently, distances). We have dealt with the ”no invariance'' case in our previous work [16]. The ”loose relative location” case is an alternative which can be easily derived. We describe the location invariance hereafter; the method encodes the relative locations of the RoIs and is translation invariant (fig. 3).

Fig. 3. Authorized translations of partial queries bounding box at level 3 Location invariance is built in as follows. Let RQ the set of Nr partial queries selected

by the user at the level l: R Q =

Nr − 1

UQ

β =0

 Nr −1

Let M the Bb of   



U {Q } , i

i=0

lβ

{

= Q l 0 , K Q l NR − 1

}

Nt is the number of authorized translations. The



similarity measure between a query image Q and any image X in the database is given by:

d (Q , X ) =

Nt − 1

∑ d (M (Q ), M j=0

M j (X ) =

j

( X ))

Nr −1

UQ α

α j =0

l

is the sub-image obtained after a translation j. Note that: j

(5)

Region Queries without Segmentation for Image Retrieval by Content

d (M ( Q ), M j ( X ) ) =

Nr −1

Tm

∑ α∑ δ α β β =0

j

=0

j

(

d Q l β , X lα j

)

119

(6)

Where δ is the Kronecker symbol. 3.3. Redundancy

We note that the quadtree representation of image features is redundant, especially with respect to histogram computation. More precisely, a histogram feature computed at any node of the tree is equal to the sum of the histogram features of its 4 sons (regardless of the normalization procedure). The quadtree representation is thus redundant. However, experimentally, this redundancy is effective and useful: indeed, the lower levels tend to specify the holistic features of the image, thus allowing to restrict the search (e.g. I am looking for red roses, but in landscape scenes only). Note also that the contribution of each subimage to the global metric being equivalent, the lower levels participate less to the overall metric. Examples are shown in section 4. The user selects the first level to focus on an image category (e.g. a landscape), and specifies the query regions in a higher level.

4. Retrieval Results We present results on our homebrew bigdatabase (Fig.4), which was built by merging the MIT Vistex database of textures, the BTphoto database of city and country scenes, a homebrew paintings database, and the homeface database of people in the lab. The total number of images in bigdatabase is 3670. We use a kd-tree structure to optimize the search process. Note that when spatial relations between partial queries are not preserved, retrieval results are different from those displayed in figure 5, 6 and 7.

Fig4. A sample of bigdatabase consisting in 3670 heterogeneous images

120

Jamal Malki et al.

Fig.5 Retrieval of the top left image. Retrieved images are from top left to bottom right in order of best match. Query image at level 1 and partial queries at level 3 are also presented. We search for selections every where in the database images

Fig6: Second example of partial queries on bigdatabase

Region Queries without Segmentation for Image Retrieval by Content

121

Fig 7 Retrieved face images from bigdatabase with both level 1 and 3

Fig 8 Retrieved face images from bigdatabase with only level 1. System returns just face images among texture, landscape images and painting

5.

Conclusion

To address the issue of region matching, we avoided the difficult problem of image segmentation using a multiresolution quadtree representation of the image. Our experience shows that we need two levels: level 1 to pick image category and level 3

122

Jamal Malki et al.

to specify more precise details in a same category. This approach provides flexibility for general and heterogeneous image databases retrieval without using specialized features such for face retrieval as shown in fig.7 and fig.8. Various general features are computed and combined together on each subimage. Dedicated similarity metrics allow for inferring invariance properties -in particular location invariance. Experimental results show that the method is effective for local queries in image databases. They also emphasize the importance of preserving spatial organization of local features to achieve good retrieval results.

6.

References

1. Nastar C., Mitschke M., Meilhac C. and Boujemaa N.: Surfimage: a Flexible Content-Based Image Retrieval System", ACM Multimedia'98", Bristol, September, 1998 2. Nastar C., Mitschke M., Meilhac C.: Efficient Query Refinement for Image Retrieval, Computer Vision and Pattern Recognition (CVPR'98), Santa Barbara, June, 1998 3. Michaeal J. Swain and D. H. Ballard: Color indexing, IJCV,vol.7, n°1, pp.11-32, 1991 4. John R. Smith and Shih-Fu Chang,Tools and Techniques for Color Image Retrieval,Proceedings of Storage & Retrieval for Image and VideoDatabases I, San Jose, CA, USA, February, pp. 426-437, 1996, 5. Jing Huang and Ravi Kumar and Mandar Mitra and Wei-Jing Zhuand Ramin Zabih,Image Indexing Using Color Correlograms,CVPR, Puerto Rico, June, pp.762-768, 1997 6. Sergio D. Servetto and Yong Rui and Kannan Ramchandran and Thomas S. Huang,A: Region -Based Representation of Images in MARS, Special Issue on Multimedia Signal Processing, Journal on VLSI Signal Processing, Oct, 1998 7. Belongie S. and Carson C. and Greenspan H. and Malik J.: Color- and Texture-Based Image Segmentation Using EM and Its Application to Content-Based Image Retrieval", Proceedings of the Sixth International Conference on Computer Vision(ICCV'98), Bombay, January,1998 8.Cinque L., Lecca F., Levialdi S., and Tanimoto S. : Retrieval of images using rich image descriptions, Proceedings of the ICPR, 1998 9. Neill W. Campbell, William P. J. Mackeown and Barry T. Thomas and Tom Troscianko: Interpreting image databases by region classification,Pattern Recognition (Special Edition on Image Databases), vol. 30,n° 4, Apr,pp. 555--563, 1997 10. Simone Santini and Ramesh Jain: The Graphical Specification of Similarity Queries,Journal of Visual Languages and Computing,1997 11. Del Bimbo and Vicario E. : Using Weighted Spatial Relationships in Retrieval by Visual Contents, IEEE workshop on Image and Video Libraries",Santa Barbara,June,1998 12. Soffer A., Samet H. and Zotkin D. : Pictorial Query trees for query specification in image databases,Proceedings of the ICPR,1998 13. Vellaikal A. and Kuo C.: Joint Spatial-Spectral Indexing of JPEG Compressed Data for Image Retrieval, Int'l Conf. on Image Proc.", Lausanne, 1996 14. Pentland A., Picard R. and Sclaroff S. : Photobook: Tools for Content-Based Manipulation of Image Databases", Storage and Retrieval of Image and Video Databases II,vol. 2185, San Jose, 1994 15. Jain A. and Vailaya A. : Image Retrieval Using Color and Shape: Pattern Recognition,vol.29,n°8, 1996 16. Nastar C., Mitschke M., Meilhac C., Boujemaa N., Bernard H. and Mautref M.: Retrieving Images by Content: the Surfimage System, Multimedia Informations Systems'98,Istanbul, September,1998

Content-Based Image Retrieval over the Web Using Query by Sketch and Relevance Feedback E. Di Sciascio, G. Mingolla, and M. Mongiello Dip. Elettrotecnica ed Elettronica, Politecnico di Bari, Via E. Orabona 4, I-70125, Bari Italy Tel.+39(0)805460641 Fax +39(0)805460410 [email protected]

Abstract. This paper investigates the combined use of query by sketch and relevance feedback as techniques to ease user interaction and improve retrieval effectiveness in content-based image retrieval over the World Wide Web. To substantiate our ideas we implemented DrawSearch, a prototype image retrieval by content system that uses color, shape and texture to index and retrieve images. The system avails of Java applets for query by sketch and uses relevance feedback to allow users dynamically refine queries.

1 Introduction Content-based image retrieval (CBIR) [1-3] information systems use information extracted from the content of images for retrieval, and help the user retrieve images relevant to the contents of the query. A number of methodologies, techniques and tools, related to image content processing, have been studied for identification and comparison of image features in order to develop classification and retrieval systems based on (almost) automatic interpretation of image content. Complete image classification, indexing and retrieval based on the content interpretation require semantic interpretation and cannot be afforded with current technology. A surrogate of semantic interpretation is the computation of visual features that can be used as quantitative parameters for the identification of similar images. Thus, the problem of retrieving images with homogeneous content is substituted with the problem of retrieving images visually close to a target one. Several systems have been proposed in recent years in the framework of content-based retrieval [5-20]. With particular reference to papers related to query by sketch, the interested reader is also referred to QBIC [10,11], QVE [18] and a recent work on the use of elastic deformation of user sketches [19]. It must anyway be noticed that these works emphasize more the pattern matching problem than the retrieval by similarity one. Also noteworthy is the work in [20], which uses wavelet-based indexing and query by sketch for color images retrieval. Here the emphasis is in avoiding any user specification but the submitted query sketch. Approaches proposed in [5-9] introduce relevance feedback [4] as a distinguished aspect that can allow improving retrieval results using feedback provided by the user. Porting CBIR systems to the WWW is not a straightforward task. Largest part of current systems basically allow query by example letting the user Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 123-130, 1999.  Springer-Verlag Berlin Heidelberg 1999

124

E. Di Sciascio et al.

browse into the collection until he/she finds an image visually similar to searched ones Once such an image has been found the user can pose a query and retrieve similar images. This process is cumbersome. As image collections grow in size this process may take a lot of time, and eventually reduce the query-retrieval process to trivial browsing. Furthermore the well-experienced problem of current Internet low speed makes the all process much tedious. We have investigated the possibility to increase user's interaction with image retrieval systems over the Internet using query by sketch and giving the user the ability to interact easily with the system by browsing the retrieved images and tuning the response through relevance feedback analysis. As a matter of fact, rather than putting much processing effort into answering a query in the sharpest way, in an image retrieval system it is reasonable to provide a fast reply with a rough retrieval method (yet offering non-trivial discrimination performance). Differently from text-based documents, browsing of a small retrieved set of images is fast and easy since the relevance of the content can be stated at a glance. To address these and other issues related to image information systems, we have started the DrawSearch [5] project with our group at Politecnico di Bari.

query by texture interface (Java applet)

user

query by shape and color interface (Java applet)

feature data archive

WWW

query server

feature extraction

retrieved set viewer (html) image archive

DrawSearch Client side

DrawSearch Server

side

Figure 1. DrawSearch system overview

2 System Overview In order to substantiate our ideas on the use of query by sketch and relevance feedback to increase user's interactivity and retrieval effectiveness in a web based environment, we designed and implemented a prototype system, which is accessible over the Internet at: http://deecom03.poliba.it/DrawSearch/DrawSearch-home.html. Figure 1 shows its main components. Our system extracts as relevant features, as other systems do, color, shape and texture. Two separate user interfaces are provided. The first one allows posing queries by sketch combining shape and color distribution; the second one allows queries by texture content. Reasons for having separate interfaces are various, most relevant for us was to avoid increasing the computational burden on the client side. We are

Content-Based Image Retrieval over the Web

125

anyway integrating relevant texture information in the retrieval stage of the color and shape system. Features extraction is a two-fold problem in our approach. Basically the features described here are extracted off-line for images during the database population stage. Hence, though always relevant, time performance is not critical when referred to this stage. On the other hand, when the features have to be extracted from the user's sketch, time, and hence algorithmic complexity, becomes a primary issue. Furthermore, while features computation on the server side can be done with efficient standard compiled languages, the client side has to rely on the much less efficient Java language. All extracted features share a representation based on the vector space model [4]. This model is based on the association of term vectors to documents, each vector representing a specific document by holding information about the index terms or keywords associated to it. Such information may appear simply as a set of present/not present flags, but more often it is a measure (weight) of the ability of each index term to discriminate the document within the collection. In our image retrieval model, feature vectors play almost the same role that term vectors play in text retrieval, holding normalized values of the image features as indexing information. However, differently from the vector space model, we do not weight the features against the whole image collection. The reason for this difference comes from the different user perception of image similarity with respect to text similarity, and to the different level at which the retrieved items are evaluated: visual for the images, semantic for the text. Image similarity is evaluated on visual properties, and may be verified at a glance. It is therefore independent from the image collection size and variety. The features are used to find similarities rather than to discriminate differences. On the other hand, text reading to verify the adequacy of the retrieved documents is a long process, therefore a text retrieval system must be provided with good discrimination capabilities, relying on the use of words to describe the document meaning and not its appearance. We believe that this model has an intrinsic strength due to its widespread use and its reliability. It also allows an extremely simple integration with text-based information systems. Hence we considered of primary importance feature evaluation schemes that allow a straightforward implementation in this model. Color feature is extracted computing the average values within 16 predefined blocks (in a 4 by 4 arrangement) the image is divided into. Computing the color feature in this way is obviously not very precise; but it allows, with limited computational effort, to represent the color distribution of the most immediately visible components like large objects or background, hence also providing a limited degree of spatial information. The adopted color model is RGB (Red, Green, Blue). Resulting data are normalized to a sum of 1 and arranged in a vector of 48 components. We are currently changing our color model to the HSB one. Anyway results reported in this paper still refer to color distribution computed using RGB. Shape feature extraction in real images requires image segmentation. Although a number of segmentation algorithms have been proposed in the literature, this issue is far from being solved. Most problems come from the ill-posed nature of the edge detection problem. The problem becomes harder when we need segmentation into semantically coherent regions, i.e. objects. We adopted a contour-based segmentation algorithm, which uses as ancillary information color and texture.

126

E. Di Sciascio et al.

a)

b)

c) Figure 2. Query and retrieval results for a by color and shape query. a) sketch drawn on the canvas; b) retrieval results; c) retrieval results after two rounds of relevance feedback.

Shape characteristics extraction has been performed using a modified version of the Fourier descriptors approach proposed in [21]. Our shapes are hence assumed simply connected with borders represented by a closed curve. We currently use the real part

Content-Based Image Retrieval over the Web

127

of the lower 100 Fourier coefficients for shape description, which are arranged in a single vector. As we discard the continuous component, we obtain scale, rotation and traslation invariant coefficients. Retrieval based on texture analysis in real images also requires image segmentation. Our approach uses Gaussian Markovian Random Fields (GMRF) for texture-based segmentation and texture extraction [8]. Two vectors storing mean values and variance values characterize each segmented area, obtained applying the GMRF algorithm.

3 User Interface and Query Processing The basic prerequisite a user interface for visual, content-based, image retrieval system should have is to let the user simply express "what" is his/her information need. Figure 2.a shows the color and shape query interface with a query sketched on the canvas. Here the user simply has to roughly draw a sketch, in terms of color and shape, of the images he/she would like to retrieve. A message window is provided to inform the user of the actions. A control panel shows the values computed when features are extracted. It is worth noticing that separate layers exist for shape and color distribution. A further "border" layer is provided to let the user check if the shape he/she has drawn is closed. This has drawbacks, as it may require drawing twice an object within the shape. The first one on the color layer and the second on the shape layer. This strategy was determined by the need to avoid the further computational burden of extracting the shape information from a color image. Also to reduce the computational burden on the client side, the retrieval interface uses a simple html page to show results. The user interface for texture processing is also a Java applet. Figure 3.a shows a query obtained selecting a user defined area within a database image. Differently from most systems, which basically allow posing a query using pre-selected texture images, our system allows the user to dynamically select a texture area within images in the database in order to increase user's interactivity with the system. Also this interface is endowed of a text area for messages from the system and a control panel to allow inspection of extracted parameters. Retrieval is performed by measuring the distance, in the n-dimensional space defined by the index terms, between the term vector of the query and the term vectors of the documents. In the retrieval by shape and color sub-system, two similarity functions are computed: simC(R,Q) and simS(R,Q), respectively accounting for color and shape. Each function simX(R,Q) representing the similarity between a database image feature, defined by the tuple R = (r0, r1, ..., rn), and the query image feature, also defined by a tuple Q = (q0, q1, ..., qn) is computed using the cosine similarity coefficient, defined as:

sim( R , Q) =

∑ ri q i ∑ ri2 × ∑ q i2

(1)

The resulting coefficients are merged to form the final similarity function as a linear combination:

128

E. Di Sciascio et al.

sim(R,Q) = α × simC(R,Q) + β × simS(R,Q)

(2)

where α and β are weighting coefficients. The weights defined by the coefficients obviously lead to increase or decrease the contribution of a feature with respect to the others. To better characterize the user's information need, these coefficients are dynamically modified during the relevance feedback stage. Figure 2.b shows an example of retrieval results.

a)

b) Figure 3. Query and retrieval results for a query by texture. a) selected query area; b) retrieval results after relevance feedback. Relevant texture areas are the highlighted ones.

In the retrieval by texture sub-system, a single similarity function is computed between the selected query area, represented by its feature vector Q, and textured areas of images in the database, represented by their feature vectors Ri is performed computing the Euclidean distance between vectors storing associated mean values. Such distance is normalised with associated variance values providing the following expression: simT (Q, Ri ) =

4

∑ j =1

(f

(Q ) − f j ( Ri ) )

2

j

σ 2j ( Ri )

(3)

Content-Based Image Retrieval over the Web

129

Resulting retrieved images are indexed according to a growing distance score. Uncertainty is obviously present in image retrieval, due to the weak correspondence between the computed features and the image content perceived by the user. In other words, the system may not match the user perception of similarity. For this reason there is a growing interest in the image database community towards methodologies and techniques for relevance feedback. In our model the user can improve retrieval results by selecting, among the topmost ranked retrieved images, the ones he/she considers relevant. This is known as "positive feedback". Based on experiments, this type of feedback is used in the retrieval by texture subsystem. The other sub-system also implements negative feedback for images considered not relevant. Leaving an image unselected marks it as "don't care" and it does not contribute to the relevance feedback process. A new query is computed by combining the feature vectors of the original query with the ones of the relevant and not relevant images. In practice, the modified query is computed by adding to the feature vector Q associated to the query image the feature vectors X of relevant images and subtracting the not relevant ones Y, respectively weighted with suitable δ and ε coefficients:

Q ( k +1) = Q ( k ) + δ

N rel

∑

i =1

Xi − ε

M notrel

∑

Yi

(4)

i =1

with δ=0 in the retrieval by texture sub-system. The modification is performed separately on the features. The modified query is then used in a new retrieval step. It is worth noticing that the shape and color retrieval utilizes a 3-layer relevance feedback strategy. The first one has been just described and it operates at the feature level. The other two operate on the combined features and on the images domain, i.e. if the user tends to concentrate on a single feature, the systems strengthens the contribution of that feature in the similarity measure. Furthermore if the user concentrates his/her interest on images of a single domain, i.e. all relevant images belong to a single category, the system retrieves in the following relevance feedback stages only images belonging to that domain. Figures 2.c and 3.b show results after relevance feedback processing. Table I shows quantitative results in terms of precision and recall for a first set of tests performed. Please notice mp1=mp2 as in our test we always retrieved the same number of images regardless of any threshold.

4 Conclusions Content-based access appears to be a promising direction to increase the efficiency and accuracy of unstructured data retrieval, yet strategies for specifically web-based CBIR systems have to be devised. We have investigated techniques to increase user's interactivity in such a scenario, proposing query by sketch as an effective way to start a significant query, and relying on relevance feedback to refine retrieval results. We are still integrating and refining our model; first experiments carried out with the aid of some volunteers have shown anyway fair precision and recall scores, and noteworthy improvements obtained by relevance feedback.

130

E. Di Sciascio et al.

TableI. Precision and recall scores Queries results without relevance feedback results with single layer relevance feedback results with two-layer relevance feedback

mp1≡mp2 0,495 0,593 0,637

mr1 0,649 0,778 0,836

mr2 0,659 0,787 0,842

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]

Communications of the ACM, 40, 12, 1997. ACM Multimedia Systems, special issue on Multimedia Databases, 3, 5/6, 1995. IEEE Computer, special issue on Content Based Image Retrieval, 28, 9, 1995. G. Salton, Automatic Text Processing, Addison Wesley, 1989. E. Di Sciascio, M. Mongiello, DrawSearch: a Tool for Interactive Content-Based Image Retrieval over the Net, in Proc. of SPIE vol. 3656,.561-572, 1999. Y. Rui, T.S.Huang, S. Mehrotra, Content based image retrieval with relevance feedback in MARS", Proc. of IEEE ICIP'97, 1997. A. Celentano, E. Di Sciascio, Features Integration and Relevance Feedback Analysis in Image Similarity Evaluation, Journal of Electronic Imaging, 7, 2, 1998. E. Di Sciascio, G. Piscitelli, A. Celentano, Textural Features and Relevance Feedback in Image Retrieval, in Visual Database Systems 4, Chapman and Hall, 1998. H.G. Stark, On image retrieval with wavelets, Journal of Imaging Systems and Technology, 7, 200-210, 1996. W. Niblak et al., The QBIC project: Querying images by content using color, texture, and shape, in Proc. of SPIE, vol. 1908, 173-182, 1993. M. Flickner et al., Query by Image and Video Content: The QBIC System, IEEE Computer, 28, 9, 23-31, 1995. P. M. Kelly, T. M. Cannon, D. R. Rush, Query by image example: the CANDID approach, in Proc. of SPIE, vol. 2420, 238-248, 1995. V. E. Ogle, M. Stonebrakes, Chabot: retrieval from a relational database of images, IEEE Computer, 28, 9, 40-56, 1995 R. Bach et al., The Virage Image Search Engine: An open framework for image management, in Proc. of SPIE, vol. 2670, 76-87, 1996. J.R. Smith, S.F. Chang, VisualSEEK: a fully automated content-based image query system, Proc. of ACM Multimedia'96, 1996. W.Y. Ma, B.S. Manjunath, NETRA: A toolbox for navigating large image databases, Proc. IEEE ICIP '97, 1997. R.W. Picard, T.Kabir, Finding similar patterns in large image databases, Proc. ICASSP, 1993. K. Hirata, T. Kato, Query by visual example, content based image retrieval, Lecture Notes in Computer Science, vol. 580, 1992. A Del Bimbo, P. Pala, Visual image retrieval by elastic matching of user sketches, IEEE Trans. PAMI, 19, 2, 1997. C. E. Jacobs, A. Finkelstein, D. H. Salesin. Fast Multiresolution Image Querying. Proc. of SIGGAPH 95, 1995. Y. Rui, A.C. She, T.S. Huang, Modified Fourier descriptors for shape representation-a practical approach, Proc. of 1st workshop on image databases and multimedia search, 1996.

Visual Learning of Simple Semantics in ImageScape Jean Marie Buijs and Michael S. Lew Leiden Institute for Advanced Computer Science Leiden University, Postbus 9512, 2300 RA Leiden, The Netherlands {buijs,mlew}@cs.leidenuniv.nl Abstract Learning visual concepts is an important tool for automatic annotation and visual querying of networked multimedia databases. It allows the user to express queries in his own vocabulary instead of the computer's vocabulary. This paper gives an overview of our current research directions in learning visual concepts for use in our online visual webcrawler, ImageScape. We discuss using the Kullback relative information for finding the most informative features in the case of human faces and generalize the method to other objects/concepts.

1 Introduction In many content based retrieval systems, the user is asked to understand how the computer sees the world. An emerging trend is to try to have the computer understand how people see the world. However, understanding the world is a fundamental computer vision problem which has withstood decades of research. The critical aspect to these emerging methods is that they have modest ambitions. Petkovic[1997] has called this finding "simple semantics." From recent literature, this generally means finding computable image features which are correlated with visual concepts. The key distinction is that we are not trying to fully understand how human intelligence works. This would imply creating a general model for understanding all visual concepts. Instead, we are satisfied to find features which describe some small, but useful domains of visual concepts.

1.1 Visual Search Paradigms Content based search researchers are constantly looking for methods which are usable by nonexperts. The typical method for this intuitive search is by finding a similar image. In this paradigm, the user clicks on an image, and then the search engine ranks the database images by similarity with respect to color, texture, and shape. In sketch search methods, the user draws a rough sketch of the goal image. The assumption is that the sketch corresponds to the object edges and contours. The database images which have the most similar shapes to the user sketch are returned. Sketches represent Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 131-138, 1999.  Springer-Verlag Berlin Heidelberg 1999

132

Jean Marie Buijs and Michael S. Lew

an abstract level of reasoning of the image. Another abstract method uses icons to represent objects and/or concepts in the image. The user places the icons on a canvas in the position where they should appear in the goal image. In this context, the database images must have been preprocessed for the locations of the available objects/concepts. The database images which are most similar by the content of objects/concepts to the iconic user query are returned. For an overview, see Gudivada and Raghavan[1995] and especially, Flickner, et al. [1995]. We designed a system for searching networked multimedia databases called ImageScape. One of the principal query types in ImageScape is semantic icons. Semantic icons are essentially drag-and-drop icons which represent concepts from the user's vocabulary. These can be objects such as human faces, textures such as sand or wood, or even colors. The importance of the method is it does not require the user to learn how the computer understands the image. Instead, the computer learns how humans perceive images.

1.2 Imagescape System Overview When an image is brought to the server, it is analyzed for features pertaining to the semantic icons (i.e. faces, sand, water, etc.) and for extraction of the computer sketches. Then a thumbnail of the image and the features are stored in a compressed database. When a user sends an image query from a WWW Java browser/client program, the query is sent to the server, and matched against the database. The user drawn sketch is compared to the computer sketches and the semantic icons are compared to the automatically extracted features. The best matches are then sent back to the WWW browser/client program to be displayed to the user. In summary, the ImageScape system consists of the following modules: • • • • • •

collection of text, images, audio, and video from the WWW compression of the image database semantic object detection in images computer sketching of images matching between the icons/sketches with the database images Java client connecting to host server for the visual query input and processing

There are other interesting WWW image search engines which have been described in the research literature. The WebSeek[Smith and Chang 1997] system from Columbia University finds similar images and performs automatic text based category classification. The Webseer[Frankel, Swain, and Athitsos 1996] system from the University of Chicago lets users search by the number of faces and by text queries. Taylor, Cascia, and Sclaroff[1997] designed the ImageRover system to primarily use

Visual Learning of Simple Semantics in ImageScape

133

relevance feedback for the search process, and in the PicToSeek system, Gevers and Smeulders[1997] search through the images using similar images and image features. In a previous paper [Lew, et al. 1997], we introduced an early version of this system. In this work, our focus is entirely on the object/concept detection.

2 Learning Simple Semantics In this paper we discuss learning simple semantics, or in another way, visual learning of concepts. This brings into mind the question raised by readers and referees, which is “What is visual learning?” Such a general term could refer to anything having to do with artificial or human intelligence in all its sophistication and complexity. Rather than a vague description, we seek to define it clearly at least within the boundaries of this paper as either (1) feature tuning; (2) feature selection; or (3) feature construction. Feature tuning refers to determining the parameters which optimize the use of the feature. This is often called parameter estimation. Feature selection means choosing one or more features from a given initial set of features. The chosen features typically optimize the discriminatory power regarding the ground truth which consists of positive and negative examples. Feature construction is defined as creating new features from a base set of atomic features and integration rules. In this paper, we focus on feature selection and in the section on future work, we reveal preliminary results toward feature construction. What is an object/concept? For our purposes, an object/concept is anything which we can apply a label or recognize visually. These could be clearly defined objects like faces or more difficult concepts such as textures. Most textures do not have corresponding labels in common language. Object/Concept detection is essential to the usage of the WWW image search engine because it gives the computer the ability to understand our notion of an object or concept. Instead of requiring all users to understand low level feature queries, we are asking the computer to understand the high level queries posed by humans. For instance, if we want to find an image with a beach under a blue sky, most systems require the user to translate the concept of beach to a particular color and texture. In our system, the user can pose the query visually as a beach under a blue sky using icons to represent beach and blue sky, respectively. Giving a complete discussion of visual concept learning would not fit within the scope of a conference paper. In fact, it would require several books to do it justice. Furthermore, we suggest that what is necessary in the field now is a thorough survey on visual concept learning. For the scope of this paper, we give a brief overview of recent visual learning techniques in the research literature. We turn to an example of feature selection in the domain of human face detection, and then observe that it is straightforward to generalize the face detection method to other objects.

2.1 Background Picard[1996] reported promising results in classifying blocks in an image into "at a glance" categories. What this means is that she investigated multiple model methods

134

Jean Marie Buijs and Michael S. Lew

to classify an NxN block into categories which humans could classify without logically analyzing the content. Forsyth, et al. [1996] found objects from feature blobs. More recently, Vailaya, Jain, and Zhang [1998] have reported success in classifying images as city vs. landscape. They found that the edge direction features have sufficient discriminatory power for accurate classification of their test set. Buijs[1998] reported promising results in learning primary colors and textures using the Kullback relative information. The commonality between these methods was using multiple features for object/concept detection. Regarding object detection, the recent surge of research toward face recognition has motivated robust methods for face detection in complex scenery. Representative results have been reported by Sung and Poggio[1998], Rowley and Kanade[1998], Lew and Huijsmans [1996], and Lew and Huang[1996].

2.2 Feature Selection We begin by describing a method of finding human faces in grayscale images with complex backgrounds and then show that the method is easily extensible to other objects/concepts. The Kullback relative information[1959] is generally regarded as one of the canonical methods of measuring discriminatory power. Specifically, we formulated the problem as discriminating between the classes of face and nonface, and used the Kullback relative information as a measurement of the class separation, which is the distance between the classes in feature space. As the class separation increases, the overlap between the classes decreases making the confidence in the class decision increase. The detection algorithm can be stated concisely as follows: (1) Create a ground truth set of positive (face) and negative(nonface) examples (2) Compute the histograms from the ground truth set for the classes face and nonface. (3) Find the N most informative features by maximizing the Kullback relative information combined with a Markov random field. (4) Arrange the N most informative features in a vector, and apply a minimum distance classifier and is shown in Figure 2.1.

Visual Learning of Simple Semantics in ImageScape Eyes

Eyes/nose Templates

,

135

, ...

Kullback Relative Information

NonFace Templates Nose Area ,

, ...

Kullback Relative Information Image

(a) (b) Figure 2.1. (a) Result of using Kullback relative information on face and nonface examples; and (b) the 256 pixels which have the greatest discriminatory power.

2.3 Generalizing to Multiple Models In the previous explanation covering face detection, we used the 256 pixels which had the greatest discriminatory power. For lack of a better word, we define this set of features as a discriminatory model. This discriminatory model has the advantage that for N features in the model, it minimizes the misdetection rate. The question arises then of how to generalize the method to more features such as color, texture, and shape. For each object we wish to detect, a large set of positive and negative examples is collected. We measured a variety of texture, color, and shape features, and for each one we calculate the Kullback discriminant. The candidate features for our system included the color, gradient, Laplacian, and texture information from every pixel as shown in Figure 2.2. For the texture models, we used Trigrams[Huijsmans, Poles, and Lew 1996], LBP [Wang and He 1990], LBP/C, Grad. Mag., XDiff, YDiff, [Ojala, Pietikainen, and Harwood 1996]. For shape comparison, we used the features derived from elastic contours[Del Bimbo and Pala 1997], invariant moments[Hu 1962], and Fourier Descriptors [ Gonzalez and Woods 1993]. Best Model

Positive Examples

Texture Distributions and Projections Trigrams, LBP/C, Grad. Mag, Xdiff, Ydiff

Measure Discriminatory Power

Color RGB, HSV

Negative Example

Shape Invariant Moments, Snakes, Fourier Desc.

Figure 2.2. Selecting the best discriminatory model of N features from texture, color, and shape features.

136

Jean Marie Buijs and Michael S. Lew

3 Feature Construction In the previous discussion, we proposed using the Kullback relative information for feature selection. The next logical step was to consider methods for feature construction. In recent work[Buijs 1998], we presented a rule based method for combining the atomic features of color, texture, and shape toward representing more sophisticated visual concepts. Recall that in feature construction there are atomic features and rules for integrating them. Our atomic features were instances of color, texture, and shape. The atomic colors were red, yellow, purple, green, blue, brown, orange, white, gray, and black. From the Kullback relative information, the color model with the greatest discriminatory power was the HSV. The atomic textural features were coarse, semi-coarse, semi fine, fine, nonlinear, semi-linear, linear, and texture features based on examples: marble, wood, water, herringbone, etc. LBP/C had the greatest discriminatory power regarding the Kullback relative information. For the atomic shape features, we created a basic set of geometric primitives: circular, elliptical square, triangular, rectangular, and pentagonal. We also tuned a variety of shape examples. Shape features were detected using template matching and active contour energy [del Bimbo and Pala 1997]. Simple concepts were represented as AND conjoined boolean expressions: If (color is orange) AND (texture is coarse) AND (shape is circular) Then object is an orange More complex concepts were represented using AND/OR expressions such as: If ((color is yellow OR color is white) AND (texture is fine) AND (texture is nonlinear)) Then object is sand Rules were automatically generated from positive and negative example training sets using decision trees. We ranked the features used in the decision trees by the Kullback relative information, and created the tree using the features with greater discriminatory power first. Results for five outdoor categories are shown in Table 1. Table 1. Probability Of Misdetection Misdetection

forest 0.23

mountain 0.14

sand 0.27

sky 0.09

water 0.15

Visual Learning of Simple Semantics in ImageScape

137

4 Conclusions Visual concept learning has the potential of bringing intuitive searching for visual media to the general public. Regarding the World Wide Web, we can bring image and video search to anyone with a Web browser if these visual learning technologies mature. At Leiden University, we are currently creating a large library of visual feature training databases and detectors. In this paper, we gave an overview of the visual concept learning methods being used for the ImageScape project. From the perspective of visual learning as feature tuning, feature selection, and/or feature construction, we have shown a progression of techniques for learning simple concept domains. Regarding future work, we think that the methods for combining the results of multiple classifiers[Kittler 1998] have the most versatility and potential for improvement of simple semantic detection.

Acknowledgements This research was funded by the Dutch National Science Foundation and the Leiden Institute for Advanced Computer Science

References Buijs, J. M., “Toward Semantic Based Multimedia Search,” Master’s Thesis, , Leiden Institute for Advanced Computer Science, August 13, 1998. Del Bimbo, A., and P. Pala, “Visual Image Retrieval by Elastic Matching of User Sketches,” IEEE Trans. Pattern Analysis and Machine Intelligence, February, pp. 121-132, 1997. Flickner, M., H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, “Query by Image and Video Content: The QBIC System,” Computer, IEEE Computer Society, pp. 23-32, Sept. 1995. Forsyth, D., J. Malik, M. Fleck, T. Leung, C. Bregler, C. Carson, and H. Greenspan, "Finding Pictures of Objects in Large Collections of Images," Proceedings, International Workshop on Object Recognition, Cambridge, April 1996 Frankel, C., M. Swain and V. Athitsos, “WebSeer: An Image Search Engine for the World Wide Web,” Technical Report 96-14, University of Chicago, August 1996. Gevers, T. and A. Smeulders, “PicToSeek: A Content-Based Image Search System for the World Wide Web,” VISUAL’97, San Diego, December, pp. 93-100. Gonzalez, R. and R. E. Woods, “Digital Image Processing”, Addison Wesley, 1993. Gudivada, V. N., and V. V. Raghavan, “Finding the Right Image, Content-Based Image Retrieval Systems,” Computer, IEEE Computer Society, pp. 18-62, Sept. 1995. Hu, M., “Visual Pattern Recognition by Moment Invariants”, IRA Trans. on Information Theory, vol. 17-8, no. 2, pp. 179-187, Feb. 1962.

138

Jean Marie Buijs and Michael S. Lew

Huijsmans, D. P., M. Lew, and D. Denteneer, “Quality Measures for Interactive Image Retrieval with a Performance Evaluation of Two 3x3 Texel-based Methods,” International Conference on Image Analysis and Processing, Florence, Italy, September, 1997. Kittler, J., M. Hatef, R. Duin, and J. Matas, “On Combining Classifiers,” IEEE Trans. Patt. Anal. and Mach. Intel., vol. 20, no. 3, March 1998. Kullback, S. “Information Theory and Statistics,” Wiley, New York, 1959. Lew, M., K. Lempinen, and N. Huijsmans, "Webcrawling Using Sketches," VISUAL'97, San Diego, December, 1997, pp. 77-84. Lew, M. and N. Huijsmans, “Information Theory and Face Detection,” Proceedings of the International Conference on Pattern Recognition, Vienna, Austria, August 25-30, 1996, pp.601-605. Lew, M. and T. Huang, “Optimal Supports for Image Matching,” Proc. of the IEEE Digital Signal Processing Workshop, Loen, Norway, Sept. 1-4, 1996, pp. 251254. Ojala, T., M. Pietikainen and D. Harwood, “A Comparative Study of Texture Measures with Classification Based on Feature Distributions,” vol. 29, no. 1, pp. 51-59, 1996. Petkovic, D., "Challenges and Opportunities for Pattern Recognition and Computer Vision Research in Year 2000 and Beyond, "Proc. of the Int. Conf. on Image Analysis and Processing, September, Florence, vol. 2, pp. 1-5, 1997. Picard, R.. "A Society of Models for Video and Image Libraries." IBM Systems Journal. 1996. Rowley, H, and T. Kanade, Neural Network Based Face Detection, IEEE Trans. Patt. Anal. and Mach. Intell., vol. 20, no. 1, pp. 23-38, 1998. Smith, J. R. and S.F. Chang, “Visually Searching the Web for Content,” IEEE Multimedia, 1997, pg. 12-20. Sung, K. K., and T. Poggio, Example-Based Learning for View-Based Human Face Detection, IEEE Trans. on Patt. Anal. and Mach. Intell, vol. 20, no. 1, pp. 39-51, 1998. Taycher, L., M Cascia, and S. Sclaroff, “Image Digestion and Relevance Feedback in the ImageRover WWW Search Engine,” VISUAL’97, December, San Diego, pp. 85-91. Tekalp, A. M., Digital Video Processing, Prentice Hall, New Jersey, 1995. Vailaya, A., A. Jain and H. Zhang, "On Image Classification: City vs. Landscape," IEEE Workshop on Content-Based Access of Image and Video Libraries, Santa Barbara, June 21, 1998. Wang, L. and D. C. He, “Texture Classification Using Texture Spectrum,” Pattern Recognition 23, pp. 905-910, 1990.

Task Analysis for Information Visualization Stacie L. Hibino Bell Labs, Lucent Technologies 263 Shuman Boulevard, Naperville, IL 60566 USA [email protected] http://www.bell-labs.com/~hibino/

Abstract. Previous research in information visualization has primarily focused on providing novel views and frameworks to aid users in exploring or accessing data; very little work has been done to support users through the full analysis processfrom the raw data to the final results. But what tasks do users perform when analyzing data using an information visualization (infoVis) environment? A task analysis of experts’ use of an existing infoVis system was conducted to examine this question. Results indicate that users work on various tasks outside of data explorationtasks such as conditioning and preparing data, collecting results, and gathering evidence for a presentation. This pilot study identifies key data analysis tasks that expert users perform when using an infoVis environment to analyze some real-life data.

1 Introduction Recent advances in information visualization (infoVis) have led to novel visualizations and new paradigms to support users in accessing and exploring a wide variety of data. General infoVis frameworks [1,13] and applications to fields such as temporal [8], software [9] and medical [10] data are just a few examples of the growing work in infoVis. Most of this work, however, has focused on enhancing data access and exploration; little, if any work has examined the larger problem of supporting users’ data analysis processthe processes they use in transforming raw data into a presentation of results based on using infoVis as their key analysis tool. The continual introduction of more powerful computers and the recent explosion in large data repositories offers new opportunities while posing new challenges to the infoVis community. In our research lab, where we have already developed a suite of infoVis tools capable of handling moderately sized data sets (easily accommodating data sets containing 100,000 data records), we are beginning to see evidence of these challenges as users complain about the effort required to prepare data for exploration or abstract key results from an infoVis analysis of a complex data set. Previous work in information workspaces [3,7] offers glimpses of an organizing framework, but much of this work focuses on organizing data and objects (e.g., documents) rather than a user’s process.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 139-146, 1999.  Springer-Verlag Berlin Heidelberg 1999

140

Stacie L. Hibino

In order to understand how to better support users in the infoVis analysis process, we set out to identify the types of tasks they conduct during data analysis with an infoVis tool. In this paper, we describe the task analysis conducted to accomplish this goal. We had three driving questions in this endeavor: • Do users conduct other tasks besides data exploration during data analysis? • If so, what are these tasks? • How important are these tasks to the data analysis process? The task analysis was conducted on five infoVis experts using an existing infoVis environment (EDV: the Exploratory Data Visualizer [13]). Users were asked to analyze a disease data set used as part of an American Statistical Association (ASA) Data Exposition. Their goal at the end of their analysis sessions was to present key results of their analysis in their typical target presentation format. Overview. This paper is divided into four additional sections. In the next section, I describe the experimental method. I then present and discuss the results, summarize some related work and finally, I provide conclusions and describe some future work.

2 Experimental Method The goal of the task analysis was to study expert users analyzing real-life data with an existing infoVis system (EDV [13]) as their primary analysis toolwhere users started with the raw data, analyzed it, and worked towards a presentation of results. In order to examine how users cope with a complex data analysis problem requiring multiple sessions, the users were asked to accomplish this over two separate sessions. Participants. Five users (all male) participated in the study. They were invited to participate based on their expertise in using EDV as well as their experience in using information visualization (infoVis) environments for data analysis. Two users have a statistics background, while the other three hold advanced computer science degrees. EDV. The Exploratory Data Visualizer (EDV) is an information visualization framework which provides linked data views [13]. EDV allows users to create and interact with any number of dynamically linked, user-specified data views including univariate (e.g., bar charts and histograms), bivariate (e.g., scatterplots and table views), multivariate, network, and text-based views (e.g., similar to a spreadsheet). Views are dynamically linked to one another so that a selection made in one view results in automatic highlighting of the corresponding selections in all other views. Data Set. Users analyzed real-life tuberculosis (TB) disease data from the 1991 American Statistical Association Data Exposition [5]. The TB data included one primary file of over 11,000 records and five other supporting data files on information such as census data. Procedure. The users’ goal was to analyze the TB data and create a presentation of key results. They were given two 1-hour sessions, separated in time by at least one week, to accomplish this goal. They used EDV as their primary tool for analyzing the data, and were told that they could use any other tools that they typically use in conjunction with EDV to accomplish the analysis task. In the first session, users read a sheet describing this task analysis study and their goals, and a hard copy of the readme

Task Analysis for Information Visualization

141

file included with the TB data set. They were asked to provide think-aloud verbal protocols of what they were doing as they were doing it, and then proceeded with the analysis task at hand. At the end of the second session, if extra time was available, a short informal post-interview was conducted. Information about the post-questionnaire was sent to users after they had completed their second session. Data Collected. The following data was collected: observational data, files generated during the users’ sessions (e.g., scripts written to transform the data, new files of transformed data, etc.), post-interview notes, and results of the post questionnaire. In addition, output of the users’ screen was captured directly onto video, along with audio of their verbal protocols.

3 Results and Discussion While users did take different approaches to their analyses, their core set of tasks and findings were very similar. However, none of them created an actual presentation of their results. Although they analyzed a real-life data set, they were not motivated to create a presentation and in most cases did not feel that they had enough time to do so. Instead, they either captured their key findings on paper or articulated them aloud during their analysis. Information about presentation-related tasks was gathered through informal post-interviews and through the post-questionnaire. 3.1 High-Level Analysis Tasks During the task analysis, we observed several low-level user tasks. Through grouping similar tasks together, we identified seven categories of high-level tasks: • prepare: data background and preparation tasks, • plan: analysis planning and/or strategizing tasks, • explore: data exploration tasks, • present: presentation-related tasks, • overlay: overlay and assessment tasks, • re-orient: re-orientation tasks (when analysis requires more than one session), and • other: statistics-based tasks. The low-level tasks for each of the task categories are described in Section 3.2. When asked, users did not present an alternative list or any additional categories of highlevel tasks. Users rated the importance of each of the low-level tasks to the analysis process based on a scale of 1 to 5. Figure 1 shows the average importance ratings for each of the high-level task categories, based on an average of all of the low-level task ratings in each category. Overall, the ratings are fairly high, with averages ranging from 3.4 to 4.5 on a 5-point scale. However, taking individual users into account, an analysis of variance indicates that the differences in importance ratings between categories is significant, leading to the following order of importance based on user ratings: plan > explore > prepare > present > statistics > overlay > re-orient tasks.

142

Stacie L. Hibino

Fig. 1. Average importance ratings for task categories (1=unimportant, 5=very important; error bars indicate standard deviation)

3.2 Low-Level Analysis Tasks The low-level tasks were identified through observational data, previous infoVis taxonomies [12], informal interviews and through the post-questionnaire. Tables 1 to 6 summarize the low-level tasks by category and include user examples. While there is some overlap between tasks and task categories, these summaries provide a working taxonomy of data analysis through information visualization. Table 1. Prepare: data background and preparation tasks Task Description Gather background information about data set at hand Understand data sources Get clarification on data ambiguities Collect additional data from other external sources Reformat data for suitable input Check data for potential data errors Check for missing data Transform the data

Example Review TB readme file Note TB data file names, sizes, formats How is “race” different from “ethnicity?” Can I get my almanac? Add header information to a data file for importing into EDV Spot check raw data file Is there any missing data? Split variables, rollup/aggregate data

Table 2. Plan: analysis planning and strategizing tasks Task Description Hypothesize Make a strategy or plan for all or a part of your analysis Identify data formats and variables required for desired views

Example There was a hypothesis that TB incidence increased in HIV infected groups Decide what, how, and how much to investigate or explore We need sums of census data by state…

Task Analysis for Information Visualization

143

Table 3. Explore: data exploration tasks (incorporating tasks from [12]) Task Description Get an overview of the data Investigate data to test hypotheses (top down approach) Explore data in search of trends or exceptions (bottom-up approach) “Query” or filter the database Identify curiosities to investigate further Zoom in on items of interest Remove uninteresting items Identify data clusters Identify relationships between variables Explain view/visualization Identify a trend or exception Verify a trend or exception Drill-down for more details

Example I always like to use something to get some idea of the whole data set… There was a hypothesis that…. We can look at that actually Now let’s look at [the] race [variable] We go to race=2 [African American], we see that they get TB around … So that’s sort of interesting… I wonder, let’s… Let’s look at [just] that peak of youngsters I’m just going to eliminate those early years; concentrate on data where there at least seems to be stable reporting going on Alright, let’s try clustering…cluster view So in terms of age, whites seem to get TB more when they’re older in comparison to the other races… There are two possible answers [explanations] here. One is that… An interesting gap in the [age] data here… There’s a gap in around 12 year olds. Examine alternative view to verify a trend [looking at text records] You can actually count the number of 15 year olds...

Table 4. Present: presentation-related tasks Task Description Gather evidence to answer a hypothesis or driving question Record or keep track of trends and results tested and found Articulate importance of a result (rank it or identify it as “interesting”) Articulate/summarize all and/or key results Decide what to include in presentation of results Create presentation of results Give presentation of results

Example Well, in terms of the urban area hypothesis, it looks like it might be reasonably, likely; District of Columbia, which is an urban area … No significant effect per time-of-year How does this result rank in comparison to the others? I’m going to present a summary of my results in written form What are the top 2-3 interesting results I want to show? Paste screen dumps into an electronic presentation and annotate Communicate results to others

144

Stacie L. Hibino

Table 5. Overlay: overlay and assessment tasks Task Description Take notes Window management Assess your strategy Assess your observations Assess your assumptions about data formats Assess your progress Estimate cost-benefit ratio of additional data collection or conditioning

Example Write down code info: 1=male; 2=female Move and resize windows Is this the right strategy to take? Does this observation make sense? Is my data in the right format to accomplish this part of the analysis? I’m wondering if there’s anything else I haven’t considered which I should look at I should really recode those… but I couldn’t be bothered with that

Table 6. Re-Orient: re-orientation tasks Task Description Review goal(s), Review data and formats, Review notes Review progress Identify starting point for current session

Example Review TB readme file head tb.txt Flip through written notes What I remember doing last time was… So what I wanted to try doing today was…

Due to space limitations, we cannot discuss each of the low-level tasks in detail, but we do highlight some of the more interesting observations here. A couple of interesting data exploration tasks include explaining a visualization and verifying a trend or exception. These tasks, which have not previously been reported in other infoVis taxonomies, indicate that expert users do not just stop at the identification of a trend or exception, and that they consider their analysis to be as much of an investigation as an exploration. Different users have different types of target presentations. Users in this study listed a variety of typical target presentations ranging from static screen dumps to interactive web pages [6] and live EDV demonstrations. For small data sets and in situations where only a few results are identified, presentations are much easier to create and require fewer tasks. Large complex data sets, however, require additional work to keep track of, rank, and decide on presentation contents of results. This is especially the case when users may be sorting through a series of 20 to 30 results. The take notes and window management tasks are overlay tasks that cut across the other task categories. For example, users could take notes while preparing data, planning their analysis, or exploring the data. The assessment tasks identify the metacognitive activity exhibited by users during their analysis sessions. That is, users asked themselves the types of sample questions listed with each task in Table°5. Statistics-Based Tasks. Although users did not conduct any statistical tests during their analysis of the TB data, several users mentioned situations where they either would follow-up with a statistical test or where they thought it would be nice to have system support to conduct a particular statistical test.

Task Analysis for Information Visualization

145

Additional Tasks. Three users listed five tasks between them that they felt were not included in the task list presented in the post-questionnaire. Three of the five tasks listed were very similar to existing tasks while the other two included: • given a relationship or feature of interest, explore what other factors may contribute to it, and • sort and order results so that related results and their impact on each other can easily be accomplished.

4 Related Work Several infoVis taxonomies have been proposed (e.g., [12, 2]), but these typically focus on categorizing aspects of infoVis limited to accessing and exploring dataaspects such as data types, visualization types and exploration tasks. The results of the task analysis presented in this paper indicates that such taxonomies only address a part of the problem, especially when considering the use of infoVis for data analysis rather than only data access. While no infoVis environment currently addresses all of the types of tasks identified through this task analysis, some work has touched on some of the issues identified here. For example, the SAGE system [11] is a knowledge-based presentation system that potentially reduces the users’ task load on analysis planning as well as presentation-related tasks. A second example is an infoVis spreadsheet [4] that provides a framework for organizing data conditioning as well as data exploration based on graphical transformations. Information workspaces (e.g., [3, 7]) have typically focused on organizing data rather than processes. However, one can imagine using a rooms [7] or book metaphor [3] for organizing an infoVis analysis. For example, rooms or books could be used as logical separators for the different types of tasks (e.g., data preparation, analysis planning, data exploration, etc.) or they could be used to separate the analysis along themes or threads (e.g., separate rooms could be dedicated to investigations of different hypotheses). The challenge in using either of these metaphors, however, is in understanding how, if, and when process support can be bridged across rooms.

5 Conclusion and Future Work Users do perform many other tasks beyond data exploration when using an infoVis environment for analyzing data. In this pilot study, I identified six other categories of analysis tasks besides data exploration: data background and preparation, analysis planning and strategizing, presentation-related, overlay and assessment, re-orientation, and statistics-based tasks. Moreover, not only do users conduct these other tasks, they also rate them highly in terms of their importance to the analysis process. In particular, they rate planning and strategizing tasks significantly higher than exploration tasks. We are currently in the process of performing a detailed analysis of the video data and verbal protocols to identify how often users performed the various tasks as well as how much time they spent on each of them. In the mean time, we note that users

146

Stacie L. Hibino

indicated on the post-questionnaire that they typically spend, on average, about 25% of their analysis time on data exploration and at most 40% of their time; thereby spending over half of their analysis time on tasks other than data exploration. In the future, we plan to investigate and prioritize the importance of system support of each of the tasks. Our long-term goal is to work towards a more integrated infoVis framework, one that provides better support to users through the full data analysis process.

Acknowledgments Special thanks to expert users who participated in the study, to Graham Wills for EDV, and to Beki Grinter and Ken Cox for reviewing earlier drafts of this paper.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Ahlberg, C., & Shneiderman, B. (1994). Visual Information Seeking: Tight Coupling of Dynamic Query Filters with Starfield Displays. CHI'94 Conf. Proc. ACM Press, 313-317. Card, S. and J. Mackinlay. (1997). The Structure of the Information Visualization Design Space. IEEE Proceedings of Information Visualization’97, 92-99. Card, S., Robertson, G., and W. York. (1996). The WebBook and the Web Forager: an information workspace for the World-Wide Web. CHI '96. Conf. Proceedings, 111-119. Chi, E.H., Riedl, J., Barry, P., and J. Konstan. (1998). Principles for Information Visualization Spreadsheets. IEEE Computer Graphics & Applications, 18(4), 30-38. 1991 ASA Data Exposition, Disease Data. Available at: http://www.stat.cmu.edu/disease/. Eick, S., Mockus, A., Graves, T. and Karr, A. (1998). A Web Laboratory for Software Data Analysis. World Wide Web Journal, 12, 55-60. Henderson, J. & S. Card. (1986). Rooms: The use of multiple virtual workspaces to reduce space contention in window-based graphical user interfaces. ACM Transactions on Graphics, 5(3), 211-241. Hibino, S. and Rundensteiner, E. (1996). MMVIS: Design and Implementation of a Multimedia Visual Information Seeking Environment. ACM Multimedia’96 Conf. Proc. NY:ACM Press, 75-86. Jerding, D.F., Stasko, J.T. and Ball, T. (1997). Visualizing interactions in program executions. ICSE’97 Conference Proceedings. NY:ACM Press, 360-370. North, C., Shneiderman, B. and Plaisant, C. (1997). Visual Information Seeking in Digital Image Libraries: The Visible Human Explorer. Information in Images (G. Becker, Ed.), Thomson Technology Labs (http://www.thomtech.com/mmedia/tmr97/chap4.htm). Roth, S., Kolojejchick, Mattis, J. and J. Goldstein. Interactive graphic design using automatic presentation knowledge. CHI’94 Conference Proceedings, 318-322. Shneiderman, B. (1996). The eyes have it: A task by data type taxonomy for information visualizations. IEEE Proceedings of Visual Languages 1996, 336-343. Wills, G. (1995). Visual Exploration of Large Structured Datasets. New Techniques and Trends in Statistics. IOS Press, 237-246.

Filter Image Browsing Exploiting Interaction in Image Retrieval J. Vendrig, M. Worring, and A.W.M. Smeulders Intelligent Sensory Information Systems, Department of Computer Science University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Tel/fax: ++31-20-525.7463/7490 {vendrig, worring, smeulders}@wins.uva.nl

Abstract. In current image retrieval systems the user reﬁnes his query by selecting example images from a relevance ranking. Since the top ranked images are all similar, user feedback often results in rearrangement of the presented images only. The Filter Image Browsing method provides better incorporation of user interaction in the retrieval process, because it is based on diﬀerences between images rather than similarities. Filter Image Browsing presents overviews of the database to users and lets them iteratively zoom in on parts of the image collection. In contrast to many papers where a new system is just introduced, we performed an extensive evaluation of the methods presented using a user simulation. Results for a database containing 10,000 images show that Filter Image Browsing requires less eﬀort from the user. The implementation of Filter Image Browsing in the ImageRETRO system is accessible via the Web.

1

Introduction

An important future research direction in image retrieval is the introduction of the ”human in the loop” [4]. User interaction can be helpful to reduce large diverse domains to small speciﬁc domains, exploiting human knowledge about the query and context. In our opinion, there are only three image retrieval methods that are truly interactive, i.e. the input and output of the methods accommodate further interaction with the user. The ﬁrst two methods are subtypes of Query by Navigation, meaning the user input consists of a choice of one of the navigation controls provided by the system. The subtypes are distinguished based on the relation of the navigation controls to the content of the image database: – Visual Inspection. Controls are content-independent, e.g. ”next” buttons. – Query by Association. Controls are content-dependent, e.g. hyperlinks. The third interactive image retrieval method is a special case of the wellknown Query by Pictorial Example method, viz. Query by internal Pictorial Example (QiPE). In a QiPE system user input consists of one or more images Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 147–155, 1999. c Springer-Verlag Berlin Heidelberg 1999

148

J. Vendrig et al.

being selected from the database. The system returns a relevance ranking, i.e. a list containing the distances from the user input images to images in the database. In practice, systems make use of more than one retrieval method. Visual Inspection is always involved in showing the output images of all other methods. Although trivial, the use of Visual Inspection alone is too time consuming. Therefore we concentrate on the characteristics of the other two interactive methods, viz. Query by Association and QiPE. The advantage of Query by Association is the use of structure, which can easily be visualized. However, it requires construction of the relations between the items in the database, which is usually done manually. QiPE is very ﬂexible and dynamic. The search path is determined run-time by computing relevance rankings. However, there are two important drawbacks to the QiPE method. Firstly, users can get stuck at a local optimum. Secondly, the selection of the initial set of images to show to the user is non-trivial. It is generally neglected in literature though. While traditional systems do not help users after answering a query in further speciﬁcation of their information need, interactive systems do provide the user more overviews than just an initial one. Recent research does take advantage of interaction by providing relevance feedback, e.g. by extracting knowledge from the browse path [5] or giving feedback by visualizing distances between images [3,7]. The Filter Image Browsing method we developed combines the powerful concepts for interaction found in Query by internal Pictorial Example and Query by Association. Furthermore, our method makes use of the browse path of a user when interacting. Filter Image Browsing is described in section 2. In section 3 evaluation criteria and the simulation environment are described. The results of experiments are presented in section 4. Conclusions are given in section 5.

2

Filter Image Browsing

In Filter Image Browsing (FilIB) a user recursively selects dynamically generated clusters of images. By choosing the cluster most similar to the information need, the user zooms in on a small collection of relevant images. The scatter/gather method [2] uses a comparable approach for the retrieval of textual documents, but focuses on the visualization of features of document clusters in the form of words.. The goal of FilIB is to assist in the quick retrieval of images by facilitating interaction between system and user with database overviews. FilIB can be seen as a structuring overlay over QiPE. The structuring overlay handles the lack of overview in traditional QiPE systems. Alternatively, FilIB can be viewed as the addition of a dynamic zoom function for image databases to Query by Association. In this section, the method and its consequences are discussed. Furthermore, a detailed description of the basics of Filter Image Browsing is given.

Filter Image Browsing

149

Fig. 1. Filter Image Browsing retrieval process.

2.1

The Retrieval Process

A Filter Image Browsing retrieval session (Fig. 1) ﬁrst presents an initial overview of the content of the database in the form of images. The user inspects the images shown and selects the one most similar to the images he is looking for. Then the system performs a similarity ranking for the selected image. Next, the ﬁlter step, which characterizes this method, is performed. Only images most alike the query image are used in the remainder of the retrieval session. The three steps overview, selection and reduction are repeated until the set of remaining images is small enough to switch to Visual Inspection. Each overview is based on the subset of images in that state of the retrieval process so that the user zooms in on the image database. However, the reduction ﬁlter potentially results in the loss of relevant images during the retrieval process. Once an image is excluded, it cannot be retrieved anymore during that particular session. The amount of reduction and the number of required selections should be balanced to minimize loss of relevant images and time spent searching. For the sake of brevity, we do not consider navigational options such as “back (to previous state)” here. More formally explained, let Is be the active set with Ns images in state s of the retrieval process. In the initial state s=0, I0 is the entire image database of size N0 . In state s=0, the system presents images in the overview Is ⊆Is . Then the user selects seed images i∈ Is . For the sake of simplicity we limit i to one seed image. The system performs a ﬁlter operation ϕ resulting in a new active set of images: Is+1 =ϕ(Is ,i). The user ends the session when the desired images are found or when he thinks the images cannot be found. The latter case indicates that either the interactive system failed or the desired images are not present in the database. In the following paragraphs the overview and ﬁlter operations are explored in more detail. As shown later, the overview operation is inﬂuenced by the ﬁlter operation. Therefore, the ﬁlter operation is described ﬁrst. Filter The appliance of a ﬁlter results in a change of scope of the active image set. Since the purpose of FilIB is to zoom in on a set of images that suits the user’s information need, reduction is the only ﬁlter operation considered here. The goal of reduction is getting a smaller set of images still containing the images desired. Similarity ranking of the active set according to user selected

150

J. Vendrig et al.

images is an appropriate technique to select images for the new (reduced) set. Is+1 then contains the images from Is that are most similar to query images i. A reduction factor ρ is used. E.g., if ρ=0.25, Is+1 is 4 times as small as Is . Thus the reduction operation targets at a ﬁxed size for the new active set so that the outcome is predictable. When all parameters are known, it is easy to compute the number of steps necessary to reduce I0 to an image set suitable for Visual Inspection. Since Ns =ρ·N0 , s can be computed for a ﬁxed end size Nend : send =

log

Nend N0

log ρ

.

(1)

Overview The goal of the overview function is to present to the user a number of images representative for the active image set. The number of presented images must be small, since the user looks at the images by way of Visual Inspection. The images in Is are the only access to other images in Is . Combined with the reduction operation this results in orphan images in Is , i.e. images i for which it is not possible to choose an image from Is resulting in an Is+1 that contains i. Orphan images are always lost in the next state. To guarantee Is is fully covered by Is , we introduce the cover constraint, which says every image in Is must be part of at least one of the possible new active sets Is+1 . Thus the overview function must produce a presentation set Is that complies to the following constraint: ϕ(i, Is ) = Is . (2) i∈Is

The selection of a presentation set that complies to the cover constraint consists of three stages: Is = postSelection(coverConstraint(preSelection)). In stage 1 a preselection is made, i.e. a set of input images are determined, e.g. randomly or by a user. In stage 2 the preselection is extended so that the total set of images complies to the cover constraint. If the number of images required for compliance to the cover constraint exceeds the maximum number of images shown to the user, the reduction factor should be adapted by user or system. In the optional stage 3 a postselection is made to extend the set to a predeﬁned size. Postselection ensures a predictable amount of output to the user. We have constructed a brute force algorithm that guarantees compliance to the cover constraint for a presentation set Is . The preselection is one seed image, either randomly chosen from I0 , or selected by the user from Is−1 . The image is the ﬁrst member of Is . Then it is used to perform a reduction on a copy of Is . Images in the (virtually) reduced set are checked as being a child of a member of Is . The algorithm then uses the image least similar to the last seed image to ﬁnd the new seed image. Again, it is added to Is . The process is repeated until all images in Is are known to be a child of at least one of the images in Is . In a best case scenario, 1/ρ seed images are necessary. In practice the brute force approach resulted in an average of 12 images necessary to cover our data sets for ρ=0.25. In our opinion Is typically should contain about 20 images. The

Filter Image Browsing

151

brute force algorithm does not exceed this maximum. We have implemented a scenario for the postselection of images that consists of the random selection of images from Is . The results for this scenario are discussed in section 4.

3

Evaluation

In order to evaluate the Filter Image Browsing concept an experiment on the ImageRETRO system1 was set up. In the following sections we describe the evaluation criteria and the experimental environment. 3.1

Criteria

For evaluation of the eﬀectiveness of Filter Image Browsing, separate criteria are used for each of the two functions, viz. reduction and overview. It is assumed that the information need of the user, expressed by the set of target images T⊂I0 , is static during the entire session. Reduction Evaluation To measure the eﬀect of discarding images during the ﬁlter process, the Recall in each state of the retrieval session is computed: |Is T | . (3) Recalls = |T | In the best case, the reduction operation discards irrelevant images only, and Recall is 1 (maximum) in every state. The reduction criterion measures to what degree the best case scenario is approached in practice. By deﬁnition, Recall is 1 in initial state 0. To prevent forced loss of relevant images, in every state of the retrieval session Ns has to be equal to or greater than the size of T. Overview Evaluation For comparison of presentation set generators, Sought Recall (SR) [6] in each state of the retrieval session is measured. SRs expresses the amount of relevant images the user has actually seen on screen during the current and previous states. (Ij T ) j=0..s . (4) SRs = |T | The values for SR of FilIB can be predicted by using Recall values, assuming all images in Is have an equal chance of being selected for Is : SRs ≈ 1

|Is | · Recalls . Ns

http://carol.wins.uva.nl/˜vendrig/imageretro/

(5)

152

J. Vendrig et al.

Since SR is known to the user, the prediction can be used to derive Recall. Subsequently conclusions about continuing or restarting the session can be made. In our opinion the use of representative images in FilIB should lead to better results than in the case of the presentation of a relevance ranking, albeit a local optimum. In order to test this hypothesis, SR of both methods is compared. FilIB has to outperform QiPE (relevance ranking system) within a reasonable amount of user interactions. 3.2

Experiment

Domain A large and diverse domain was chosen to populate the image database, viz. the World Wide Web. The collection of 10,000 images2 is representative for the domain. For every image, index values were computed for 9 simple image features [8], primarily based on the distribution of hue, saturation and intensity. Ranking The relevance of an image is computed by averaging similarity scores of all individual features. Since features cannot be expected to have similar distributions of their values, scores cannot be averaged to a meaningful overall score directly. Therefore a normalization function is added to the feature scoring function. For each feature a priori a histogram containing the frequency of a similarity score is made. When the feature scoring function is invoked run-time, it looks up the similarity score in the histogram and returns the percentile, e.g. stating that the similarity between the two objects is in the top 5%. The percentiles of the various features can be compared and averaged with one another because they are independent of the distribution and similarity metric used. The use of the normalization via frequencies does not only allow the use of diﬀerent types of image features and similarity metrics, e.g. histograms and metric values, but also the use of features of other media than images. Simulation In the evaluation experiment users were simulated by a user model as introduced in [1]. In the user model it is assumed that all users are the same and that the decisions they make are based on image features, so that the modeled users are consistent and unbiased. Target sets deﬁning the simulated information need comply with 3 conditions: – Small distance in feature space, so that clustering the target images is possible. – Same style (visual similarity). Style is deﬁned by objective meta-data (common original site) and subjective evaluation of the images. – Size bandwidth. There is a minimum and maximum for the number of images to focus on ﬁnding medium sized groups of images. For the experiment seven target sets3 were selected from the image collection. 2 3

publicly available at http://carol.wins.uva.nl/˜vendrig/icons/ http://carol.wins.uva.nl/˜vendrig/imageretro/target/

Filter Image Browsing

1

153

FilIB (0.2) FilIB (0.3) FilIB (0.4) FilIB (0.5) QiPE

Recall

0.9

0.8

0.7 10000

Sought Recall

1

0.75

0.5

0.25

0

1000

100

Size (logarithmic scale)

10

1

2

3

4

5

6

7

8

# interactions

9

10

11

Fig. 2. Evaluation results. Eﬀect of reduction for various reduction factors (left) and retrieval performance of FilIB and QiPE (right).

The algorithms used for simulation of FilIB and QiPE make use of the predeﬁned target sets and a given presentation size for Is (the desired number of pictures to be shown) ﬁxed on 20 in our experiments. The reduction factor ρ is given as a constant for each retrieval session. Both simulations use the choose seed function, which computes which image in the presentation set is most similar to the entire target set. If one of the images in the presentation set is a member of the target set, it is chosen by default. For QiPE I0 is given, i.c. I0 =overview(I0). The presentation set function returns overview (Is ,i) for FilIB, and the top ranked images from Is for QiPE. The convergence criterion is “Ns equals presentation size” for Filter Image Browsing, and “high similarity of Is and Is+1 (>80% overlap)” for QiPE. Finally, the PresentationSetsimilarity ranking function results in ϕ(Is ,i) for Filter Image Browsing, and a plain similarity ranking relative to seed image i for QiPE.

4

Results & Discussion

Simulations were run for the 7 target sets, with 5 diﬀerent randomized seeds. In Filter Image Browsing simulations, for each of the reduction factors used, the average of 5 slightly varying reduction factors was used. The reduction factors mentioned are the medians. The results for each Filter Image Browsing simulation are based on 175 runs in total. In Fig. 2 the graphs for the evaluation of the reduction operation are shown. Since the loss of relevant images depends on the reduction factor and the resulting size Ns , Recall is expressed as a function of Ns in the reduction eﬀect graph. The reduction evaluation shows that even though Filter Image Browsing does cause the loss of desired images, far more irrelevant than relevant images are discarded. Low reduction factors of about 0.3, as well as high reduction factors of about 0.5 result in good performance. For comparison of Filter Image Browsing to a QiPE system a cluster based overview was evaluated for both systems. In the case of QiPE this means the initial presentation set only is constructed by

154

J. Vendrig et al.

way of clustering techniques. For Filter Image Browsing four diﬀerent reduction factors were used. The graphs for Filter Image Browsing simulations stop when converged. The number of states necessary to converge can be computed from equation 1. The maximum number of interactions for both FilIB and QiPE is 11. The retrieval performance graph shows that even though both FilIB and QiPE ﬁnd approximately the same amount of relevant images, the former method requires less user interactions to reach that result.

5

Conclusions

The concept of Filter Image Browsing shows that incorporating user interaction in an image retrieval method pays oﬀ. Subsequent cycles of database overview and reduction lead the user in few steps to a small collection of similar images. The consequences of the inherent ambiguity in the selection of representative images based on a combination of feature similarities is left for future research. The simulations used to evaluate the performance of Filter Image Browsing show satisfying results for all determined criteria. We conclude that more elaborate use of user interaction does result in quicker retrieval of images. Furthermore, the results of Filter Image Browsing are more predictable, as the number of user interactions can be computed a priori. This indicates that the method is helpful as well when desired images are not present in the image collection, since a user does not have to search indeﬁnitely. Thus the combination of Query by internal Pictorial Example and Query by Association into Filter Image Browsing results in a powerful method for browsing through image databases.

References 1. I. Cox, M. Miller, S. Omohundro, and P. Yianilos. Target testing and the PicHunter bayesian multimedia retrieval system. In Proceedings of the Advanced Digital Libraries (ADL’96) Forum, pages 66–75, Washington D.C., 1996. 152 2. D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR’92, Copenhagen, Denmark, 1992. 148 3. Y. Rubner, C. Tomasi, and L. Guibas. Adaptive color-image embeddings for database navigation. In Proceedings of ACCV, pages 104–111, Hongkong, 1998. 148 4. Y. Rui, T. Huang, and S.-F. Chang. Image retrieval: Current techniques, promising directions and open issues. Journal of Visual Communication and Image Representation, 10:1–23, 1999. 147 5. Y. Rui, T. Huang, M. Ortega, and S. Mehrotra. Relevance feedback: A power tool in interactive content-based image retrieval. IEEE Trans on Circuits and Systems for Video Technology, 8(5):644–655, 1998. 148 6. G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGrawHill, New York, 1983. 151

Filter Image Browsing

155

7. S. Santini and R. Jain. Beyond query by example. In Proceedings of the Sixth ACM International Multimedia Conference, pages 345–350, Bristol, England, 1998. 148 8. J. Vendrig, M. Worring, and A. Smeulders. Filter image browsing. Technical Report 5, Intelligent Sensory Information Systems, Faculty WINS, Universiteit van Amsterdam, 1998. 152

Visualization of Information Spaces to Retrieve and Browse Image Data Atsushi Hiroike1 , Yoshinori Musha1 , Akihiro Sugimoto1 , and Yasuhide Mori2 1

2

Information-Base Functions Hitachi Laboratory, RWCP c/o Advanced Research Laboratory, Hitachi Ltd., Hatoyama, Saitama 350-03, JAPAN. Tel. +81-492-96-6111, Fax. +81-492-96-6006 {he,sha,sugimoto}@harl.hitachi.co.jp Information-Base Functions Tsukuba Laboratory, RWCP [email protected]

Abstract. We have developed a user interface for similarity-based image retrieval, where the distribution of retrieved data in a high-dimensional feature space is represented as a dynamical scatter diagram of thumbnail images in a 3-dimensional visualization space and similarities between data are represented as sizes in the 3-dimensional space. Coordinate systems in the visualization space are obtained by statistical calculations on the distribution of feature vectors of retrieved images. Our system provides some diﬀerent transformations from a high-dimensional feature space to a 3-dimensional space that give diﬀerent coordinate systems to the visualization space. By changing the coordinates automatically at some intervals, a spatial-temporal pattern of the distribution of images is generated. Furthermore a hierarchical coordinate system that consists of some local coordinate systems based on key images can be deﬁned in the visualization space. These methods can represent a large number of retrieved results in a way that users can grasp intuitively.

1

Introduction

In recent years, image retrieval systems based on similarity have been reported by many researchers and companies [1], [2], [3]. We have been studying this issue from two points of view, i.e. “metrization” and “visualization.” For metrization, we are developing pattern recognition technologies applicable to a large-scale database containing various types of images. For visualization, our purpose is to ﬁnd a suitable user interface for image retrieval systems, which is the main topic in this report. The importance of visualization in similarity-based image retrieval systems has been discussed by several researchers [4]. We will report our prototype system that uses the dynamical 3-dimensional representation.

2 2.1

Basic Concepts Retrieval System as an Extension of Human Perception

In conventional retrieval systems the conditions of queries are well-deﬁned in an algorithm, e.g. “documents including a keyword”, “data created in 1998”, . . . . Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 155–163, 1999. c Springer-Verlag Berlin Heidelberg 1999

156

Atsushi Hiroike et al.

In similarity-based systems, the deﬁnitions of conditions are not clear, because similarity is a subjective or psychological criterion. It changes among individuals and can change within a single individual. A user may not have a deﬁnite criterion; that is, the condition of a query may be obscure. Unfortunately, we have not yet developed enough technology to give a general deﬁnition of the similarity applicable to an arbitrary image. Therefore it is not guaranteed that a user will be satisﬁed with a decision of the retrieval system. Furthermore, it is not certain that a user would be satisﬁed even if the system were as intelligent as a human being, since he or she might not want to accept the decision of another person. Our main objective is to develop a similarity-based retrieval system which allows users to encounter as many data as possible. This is in contrast to the concept of conventional retrieval systems where it is important to reduce the number of retrieved results appropriately. A computer system is inferior to a human being at pattern recognition, but superior in processing a large amount of data. Our system should present an understandable representation of a mass of data to users. It will enable a user to grasp a large information space that cannot be directly perceived. That is, the system will extend the user’s perception system. The system should assist users to make their queries clear and also be adaptable to changes of their intentions. It should not discourage users but should enhance their imagination. 2.2

Representation of Metric Information

In our system, as in other similarity-based retrieval systems, each image is represented as a vector or a set of vectors, where each vector is a feature characterizing a speciﬁc property of the image. In many cases, the dimensionality of the vector is high. Our basic idea for the user interface is to use visualization of the metric space that consists of feature vectors. The 2- or 3-dimensional representation of a data distribution in a highdimensional space is not a new topic, but a classical issue in multivariate analysis. However, whereas the conventional usage applies multivariate analysis to represent the static structure of data, we use it to represent the status of the system changed by a user’s operation. In our system, the resultant representation is interactive and dynamic. A user can give a requirement to the system through the visualization space; the requirement induces a change in the internal state of the system; and this change is reported to the user in the visualization space. The internal state of the system consists of abstract quantities, e.g. similarities deﬁned in a high-dimensional space, which are diﬃcult for users to understand. The visualization transforms these quantities into visible ones that can be perceived directly by users. In the desktop-metaphor where logical or symbolic relationships among objects in ﬁle systems are visualized, the intuitive operations of ﬁles and directories are available. In our system where metric relationships among pattern information are visualized, the intuitive operations in retrieving and browsing image data will be available.

Visualization of Information Spaces to Retrieve and Browse Image Data

157

Most similarity-based retrieval systems present retrieved images as a list sorted according to the similarity. This is quasi-1-dimensional visualization, which represents the order of the similarity between the key image and the results. In the previous report, we proposed a system using 2-dimensional visualization [5], [6]. In that system, retrieved results are displayed as a scatter diagram of thumbnail images in a 2-dimensional space constructed from features used in the similarity calculation. The coordinates of the space are eigenvectors given by applying principal component analysis (PCA) to the feature vectors of the retrieved results. Therefore the space is optimal in the second order to represent a distribution of retrieved data. List-representation may gather similar images in the higher ranks. However, it loses the impression of orderliness in the lower ranks because adjacency among the retrieved images has no meaning. So a user will be unhappy if given a long list. In the 2-dimensional representation, the conﬁguration of data represents the similarity among retrieved data, and gives a well-ordered impression to users. Consequently the system can report more than 100 data without boring a user.

3 3.1

Model of the Metrization Features

Color-based feature A histogram of color distribution in an image is calculated by dividing the color space into Nr × Ng × Nb boxes. If compositional information is needed, an image is divided into Nx ×Ny rectangular areas and multiple histograms are calculated. In this case, the number of dimension is Nr × Ng × Nb × Nx × Ny . Gradient-based feature This feature is based on a direction-distribution of gradient vectors in a gray-scale image. Suppose that the directions of vectors with a range −π/2 ≤ θ < π/2 are quantized into Nθ levels. Let v = {vk |k = 0, . . . Nθ −1} be a vector in which accumulation results are stored and (fx , fy ) be a gradient vector at a pixel. vk is updated as vk → vk + fx2 + fy2 , where k is determined bythe direction of (fx , fy ). Finally, each element of v is normalized as vk → vk /S, where S is the number of pixels used in the updating. In the same way as for the color-based feature, features can be extracted in Nx × Ny separated areas of an image. Furthermore it is well known that the resolution of an image is important in this type of feature. If Nl -level pyramidal data consisting of gray-scale images with diﬀerent resolutions are given, the number of dimensions of the feature vector is Nθ × Nx × Ny × Nl . Usually, we reduce the dimensions of these features by using PCA on the whole of database. For example, a 1024-dimensional color-feature (Nr = Ng = Nb = Nx = Ny = 4) and a 512-dimensional gradient-feature (Nθ = 8,Nx = Ny = 4, Nl = 4) are transformed into 100-dimensional feature vectors respectively. This reduction of dimensions is necessary to perform statistical calculations depending on the retrieval.

158

3.2

Atsushi Hiroike et al.

Similarity

The deﬁnition of similarity between two images is based on the squared distance. Let Nf be the number of feature types, and {xi } and {yi } (i = 0, . . . , Nf − 1) be sets of feature vectors of an image X and an image Y respectively. The similarity between X and Y is deﬁned as 2 wi xi − yi , (1) s(X, Y ) = exp − i

where each squared distance is assumed to be normalized appropriately, xi − yi 2 = 1 for all i, and w = {wi } is a non-negative weight vector. Our system allows a user to retrieve images using multiple keys which we call “reference images.” A user can also arbitrarily divide reference images into some groups, each of which we call a “cluster.” The similarity between an image X and a cluster C is deﬁned as s(X, C) = min s(X, Yi ) . Yi ∈C

(2)

A weight vector w is assigned to a cluster. Increasing the total value of w decreases the similarities within a cluster. This is counterintuitive for users. Therefore we transform a weight vector W speciﬁed on the user level into w so that an increase in the mean value of W increases the similarities in the cluster. The −2 deﬁnition of w is wi = Nf W Wi , where W = i Wi . In addition, each cluster has an attribute of “positive” or “negative.” While in the normal case clusters are positive, an image in a negative cluster is what a user does not want included in retrieved data. Let the similarity of an image X given by the system be s(X), and let C be the nearest cluster to X. s(X) = −s(X, C) if C is negative; otherwise, s(X) = s(X, C). The retrieved results that the system ﬁnally reports are limited by the threshold of s(X) and the maximum number of retrieved data. 3.3

Statistical Calculation

To obtain appropriate visualization, we perform statistical calculations depending on the retrieved results. The modiﬁed statistics based on retrieved data are calculated within each cluster. Let {Xi } be a set of retrieved images whose nearest cluster is C, and let {xij } (j = 0, . . . , Nf ) be feature vectors of Xi , which are not necessarily identical to the features used in eq. 1. The statistics of the j-th feature in C are deﬁned as p s(Xi , C) (3) N = i

µj = N −1

p

s(Xi , C) xij

(4)

s(Xi , C)p (xij − µj ) (xij − µj )t .

(5)

i

Σj = N −1

i

For p = 0, this is identical to the statistics of retrieved data used in our previous model. For p → ∞, this is the statistics of the reference data.

Visualization of Information Spaces to Retrieve and Browse Image Data

159

a. View of the user interface.

b. Spatial-temporal pattern of a distribution of images.

c Fig. 1. The user interface of our system (sample images 1996 PhotoDisc, Inc).

4

Main features of the user interface

In the prototype system, a user selects key images from previously retrieved results (the system usually provides some sample keys at the ﬁrst step). Fig. 1-a shows a view of the user interface. The background is the visualization space, where the dynamical 3-dimensional scatter diagram of thumbnail images is displayed. In this case, up to 1,000 images are retrieved from among about 10,000. There are some panels at the front. The top one is a “collection panel” where a user keeps his favorite images, which has no eﬀect on the retrieval. The panel displayed at the bottom is a “control panel” equipped with GUI components for controlling the view of the visualization space. The other panels are cluster panels. If a user selects a reference image by clicking the mouse,

160

Atsushi Hiroike et al.

the image is displayed on a cluster panel speciﬁed by the user. In this example, the front one is a positive cluster panel, and the back one is a negative cluster panel. If the user considers a retrieved image to be an error or noise, he can add it to the negative cluster panel. The sliders on the cluster panels are for setting the weights of features. The main features of the user interface are as follows. Representation as a spatial-temporal pattern Many transformations from a high-dimensional feature space to a 2- or 3-dimensional visualization space are possible. In the previous system using 2-dimensional representation, a user selects a combination of coordinates by himself. In the present system, the coordinate system changes automatically at some intervals (Fig. 1-b). The system provides a series of 3-dimensional coordinate systems. An element of the series determines the position of each image at a time step. The positions between two adjacent steps are deﬁned by linear interpolation between the positions in the two diﬀerent coordinate systems. This makes the independent motion of each image and a 3-dimensional spatial-temporal pattern of an image distribution. The buttons in the top left-hand corner of the “control panel” show the current step, and clicking them causes the view to jump to an arbitrary step. The upper one of the two sliders below the buttons controls of the speed of motion, and the lower one is for setting the view at an arbitrary point between two adjacent steps while pausing the motion. Representation of similarity The similarity of each retrieved image is represented as the size of the image in the 3-dimensional space. The function of the transformation from a similarity to a size has two parameters S = βsα , where α determines the intensity of the eﬀect of the similarity and β is a general scaling parameter. The GUI component on the left side of the “control panel,” which is the rectangular area with the small black circle, is the controller for this function. The x-y coordinates of the area correspond to the parameters of α and β. Local coordinate systems and hierarchical representation Fig. 2 illustrates the local coordinate systems based on reference images. Reference images are located on a global coordinate system, and each of the retrieved results is located on a local coordinate system around the most similar reference. In the case of Fig. 1-a, the statistics of retrieved data were calculated (eq. 5), and the eigenspaces of color-based and gradient-based features were used as local and global coordinates respectively. The local coordinates enable intuitive representation of the relationship between a key image and each retrieved image, and more structured representation in displaying a large number of images. The coordinates used in visualizing are deﬁned as a cluster-attribute. When some clusters are deﬁned, the center of each cluster (the mean vector deﬁned in eq. 4) is located on the root coordinate system, which is also constructed from the feature vectors. Walk-through functions The arrow-shaped buttons on the right of the control panel are GUI components for the walk-through: x or y-translations,

Visualization of Information Spaces to Retrieve and Browse Image Data

161

z-translations, and rotations. Our system provides two types of the view rotations: a viewer-centered rotation and an object-centered rotation. The latter allows a user to rotate around a selected image. The changing of similarities is caused by adding or removing reference images or by changing the weights of features. For example, if one selects an image belonging to an island of images as a new key, the image becomes larger and separates from the the island. Similar images also become larger (some of them gradually appear in the visualization space). They crowd around the key image and make a new island of images. If one removes a key image, the island around it disappears. The similar images become smaller and some of them which are not similar to all key images disappear from the space.

Fig. 2. Local coordinate systems.

5

Outline of the System Architecture

The software of the system consists of two server programs and client programs. An M-server (“metrization server”) performs various kinds of calculations based on features. A V-server (“visualization server”) displays a virtual 3-dimensional space. The client programs have usual GUI components and accept a user’s action except for some events which occur in the visualization space. The Mserver keeps the features of the whole images as compressed data on its memory. In response to a request from a client program, the M-server performs the image retrieval, generates the motion routes of retrieved images with the statistical calculations, and sends the series of routes to the V-server. The V-server receives data from the M-server in parallel with updating the view. A V-server needs a large texture mapping memory to create the real-time motions of more than 1,000 images. So we used an Onyx2 (Silicon Graphics Inc.) with 64 MB of texture mapping memory. The most serious problem is that the V-server has to update a lot of objects in the visualization space every time the retrieval condition changes. This is diﬀerent from the conventional visualization systems or from the user interface models using virtual reality whose contents

162

Atsushi Hiroike et al.

usually remain unchanged. In the worst case, the V-server has to load image ﬁles of all the retrieved data. For example, it takes about 8 sec. on our system to read 1,000 thumbnail (64 × 64) images (encoded as JPEG ﬁles), which degrades the real-time response to the user’s action. If all images are cached in memory, a similar problem occurs because it is necessary to register the data in the texture memory. To minimize the degradation of the real-time response, the Mserver sorts the data to be sent with respect to its importance deﬁned by the combination of the present similarities and the change in similarities from the last retrieval. If a new request for updating contents is made while the route data is being received, the M-server discards the remainder.

6

Discussion and Conclusions

In this paper, we reported our image retrieval system that uses the dynamical 3-dimensional visualization, which allows users to access a large information space. Compared with the 2-dimensional visualization system in our previous report, the present system is superior in showing a lot of retrieved results to users without boring them, and in its intuitive representation of the internal state of the system. Cost is a clear and simple problem in this system, which needs a high-end graphics workstation, whereas our 2-dimensional version can run on a PC. The more critical problem is in the controllability of the virtual 3-dimensional space. A 3-dimensional representation is useful in browsing, but sometimes a user feels it is inconvenient for operations such as selecting data. A user has to operate a mouse, which is a 2-dimensional pointing device, in order to select an object in the virtual 3-dimensional space. Furthermore in the present system, many functions of the user interface are supplied by conventional 2-dimensional GUI components. A user has to frequently move the cursor from the visualization space to the GUI panels, and sometimes he feels frustration. So we have to develop new methods of data selection and data input to make a user interface with 3-dimensional representation that is easy to operate.

References 1. Flickner, M., et.al., “Query by image and video content: the QBIC system,” IEEE Computer, 28-9, 23–32, 1995. 155 2. Faulus, D.S., Ng, R.T., “An expressive language and interface for image querying,” Machine Vision and Applications, 10, 74–85, 1997. 155 3. Stricker, M., Dimai, A., “Spectral covariance and fuzzy regions for image indexing,” Machine Vision and Applications, 10, 66–73, 1997. 155 4. Gupta, A., Santini, S., Jain R., “In search of information in visual media.” Communication of the ACM, 40-12, 35–42, 1997. 155 5. Musha, Y., Mori, Y., Hiroike, A.: Visualizing Feature Space for Image Retrieval (in Japanese), Procs. of the 3rd Symposium on Intelligent Information Media, 301–308, 1997. 157

Visualization of Information Spaces to Retrieve and Browse Image Data

163

6. Musha, Y., Mori, Y., Hiroike, A., Sugimoto, A.: “An Interface for Visualizing Feature Space in Image Retrieval,” Machine Vision and Applications, 447–450, 1998. 157

Mandala: An Architecture for Using Images to Access and Organize Web Information Jonathan I. Helfman 1

AT&T Labs - Research, Shannon Laboratory Human Computer Interaction Department Room B255, 180 Park Avenue, Florham Park, NJ 07932-0971, USA 2 University of New Mexico, Computer Science Department Room FEC 313, Albuquerque, NM 87131, USA [email protected] http://www.cs.unm.edu/∼jon/

Abstract. Mandala is a system for using images to represent, access, and organize web information. Images from a web page represent the content of the page. Double-clicking on an image signals a web browser to display the associated page. People identify groups of images visually and share them with Mandala by dragging them between windows. Groups of image representations are stored as imagemaps, making it easy to save visual bookmarks, site indexes, and session histories. Image representations aﬀord organizations that scale better than textual displays while revealing a wealth of additional information. People can easily group related images, identify relevant images, and use images as mnemonics. Hypermedia systems that use image representations seem less susceptible to classic hypertext problems. When image representations are derived from a proxy server cache, the resulting visualizations increase cache hitrates, access to relevant resources, and resource sharing, while revealing the dynamic access patterns of a community.

1

Using Images to Represent Web Information

Most web software represents information textually. We use textual bookmarks and history lists in browsers, textual indexes and similarity metrics in search engines, and textual concept hierarchies in link taxonomies. Textual representations and organizations shape our experience of interacting with information [13]. On the web, a selectable image is a link to another resource. Imagemaps link to multiple resources. In general, a selectable image contains more data than a typical string of selectable text and has the potential to provide more of an indication if the link is relevant and worth following. In many cases, images from a web page represent the content of the page. Even the images used for decoration, navigation, and advertisement often provide additional characterizations of a web site (although these can usually be recognized and suppressed). Web images may represent their context for many reasons. There is a rich history of using images to illustrate and illuminate Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 163–171, 1999. c Springer-Verlag Berlin Heidelberg 1999

164

Jonathan I. Helfman

Fig. 1. Mandala client displaying three groups of representations.

manuscripts. Most technical documents contain highly descriptive illustrations, charts, or diagrams. On the web, image formats were standardized before audio or text stylesheets. As a result images are not only used for illustration, but they have become a common strategy for site diﬀerentiation. Mandala lets people visualize large groups of web pages by displaying selectable thumbnails of the pages’ images. Groups may be determined in many ways: the URLs in a bookmark ﬁle, the history of a browsing session, the results of a query, etc. Mandala’s displays function as visual interactive indexes; they provide an overview of large amounts of information without sacriﬁcing access. Mandala automatically builds groups of representations and provides a user interface for viewing and editing groups of representations and saving them as imagemaps. A snapshot of a Mandala client is shown in Fig. 1. Creating a system that uses images to represent web information requires solving several technical problems. While digital text enjoys an eﬃcient representation, digital images require more memory and bandwidth, as well as support for compression, decompression, and scaling. Web-based systems also require support for HTTP monitoring and HTML parsing. Additional requirements include the ability for people to identify image groups visually and share them with the system easily. These technical challenges have been addressed by Mandala’s modular architecture, which provides a ﬂexible and general platform for visual information research on the web. Mandala’s architecture allows it to function in multiple ways (e.g. as a GUI for organizing and maintaining information represented by images, a visual bookmark facility, an imagemap editor, a cache visualization tool, and a visual look-ahead cache).

Mandala: An Architecture for Using Images

165

Mandala has been fully implemented and is almost fully operational. Mandala’s image server (see Sec. 2.1) is used to generate imagemaps for CoSpace, an experimental VRML system at AT&T Labs. Mandala’s proxy server (see Sec. 2.2) is used to monitor web browsing sessions for PadPrints, a web navigation tool, which builds multi-scale maps of a user’s browsing history[9]. The remainder of this paper describes Mandala’s architecture, as well as some preliminary observations about using image representations.

2

Architecture and Implementation

Figure 2 illustrates Mandala’s component structure. White boxes represent the Mandala components. Solid grey shapes represent standard components that are used without modiﬁcation (e.g. web servers and browsers). Light grey boundaries indicate runtime constraints. Arrows indicate the exchange of HTTP-like messages consisting of ASCII headers and optional ASCII or binary data.

Fig. 2. Mandala’s Component Structure. Mandala is less concerned with supporting visual queries through feature extraction and indexing than previous Visual Information Systems[11]. Mandala’s architecture does not preclude a component for image analysis and feature extraction, but web images are surrounded by a rich context of meta-information, which seems to provide an ample feature set for indexing[19]. 2.1

Imago: Mandala’s Image Server

Mandala’s image server, called Imago, has been developed to support fast image data compression and decompression, image shrinking, imagemap creation, and

166

Jonathan I. Helfman

image meta-data extraction (i.e. information about the image that is normally hidden in the image’s header, such as its dimensions or total number of colors). While the other Mandala components are written in Java, Imago is written in C, both for speed and reliability. Publicly available C code to read and write GIF and JPG images is relatively robust[3,10] compared to the Java image decompression classes, which frequently throw undocumented exceptions when trying to decode GIF or JPG variations that they don’t fully support. Imago creates thumbnails and imagemaps according to client speciﬁcations. A minimal thumbnail speciﬁcation is a URL for the input image, which causes Imago to use default scale factors and ﬁltering functions. Clients can specify the maximum thumbnail dimension, ﬁlter function, return style (whether to return the URL or the data), and replacement style (whether to overwrite an exiting thumbnail of the same name or generate a new name). Additional options ignore input images that don’t meet minimal requirements for size or number of colors. Imago’s default ﬁlter function uses a form of hierarchical discrete correlation (HDC) that scales the input by one half[5]. To scale images by arbitrary amounts, Imago uses successive invocations of HDC and a single invocation of more general image rescaling algorithm that creates and scales ﬁlter kernels based on the scale factor[20]. The possible slowness of the ﬁnal pass is minimized because scale factors are guaranteed to be less than one half. A minimal imagemap speciﬁcation is a stub name for the output and a list of URL pairs for web images and associated resources. The imagemap command supports each of the thumbnail options described above, as well as layout style (grid, tile, random, user-set) and background style (solid color, tiled image). Imago also rates imagemaps according to a heuristic that attempts to identify images that are used for decoration, navigation, or advertisement. The rating scheme is based on image meta-information as opposed to image content. It ranks large, square images with many colors higher than small, wide or tall images with few colors. The rate of an imagemap is the average rate of the imagemap’s thumbnails. Ratings allow clients to identify imagemaps that are most likely to contain useful representations. 2.2

Mirage: Mandala’s Proxy Server

Mandala’s proxy server, called Mirage, has been developed to support local caching of images, transparent web browser monitoring, and HTML parsing. A proxy server is a program that sits between web servers and web clients, such as browsers (see [1] and [12] for good introductions to proxy servers). Requests are copied from clients to servers. Responses, if any, are copied back to clients and possibly cached, in case they are requested again. Proxy servers provide several beneﬁts, such as reduced web latency and increased eﬀective bandwidth[12], increased web access (when the destination server is down, but the resource is cached), and savings on long-distance connection charges[15]. The two main problems for a proxy server implementation are how to determine if a cached ﬁle is fresh and which ﬁles to remove when the cache gets full. Freshness is diﬃcult to determine because few servers use the Expires HTTP

Mandala: An Architecture for Using Images

167

Fig. 3. Snapshot of a Dynamic Cache Visualization. header and there is no way for a destination server to inform a proxy server when a resource has been updated [14]. When the Expires header is not used, Mirage uses a typical hueristic, which estimates freshness time as a factor of the last-modiﬁed time. If the server does not transmit a last-modiﬁed time, Mirage assigns the resource a maximum age. When the cache is full, Mirage uses a Least-Recently Used (LRU) algorithm for removing old ﬁles from the cache [18]. Researchers disagree about the performance of LRU [1,17,18], however LRU has a more eﬃcient implementation than algorithms that need to compute scores and insert into sorted lists [8]. Mirage improves Mandala’s performance by caching thumbnails and imagemaps generated by Imago and installed on a web server. Mandala also beneﬁts from Mirage’s extensions. Proxy server extensions are a recommended strategy for creating utilities and applications that enhance the experience of web browsing [2,4]. Mirage has been extended to allow registered applications to monitor web browsing sessions[9]. Monitoring with a proxy server is transparent to users who need only conﬁgure their browser to use the proxy server via a commandline option or preference setting. Mirage has also been extended to describe its cache and parse HTML, capabilities that Mandala uses to identify references to links and images in the cached pages. A snapshot of a cache visualization is shown in Fig. 3. Displaying selectable images from a proxy server cache has several surprising implications. Cache visualizations improve the cache hit-rate (i.e. the percentage of requests that can be satisﬁed out of the cache) more than algorithmic approaches, which rely on the chance that people will unknowingly request a page that someone else has requested[1]. Cache visualizations increase access to relevant information. When people can see the contents of the cache, and select an image to access an associated page, they are more likely to access information that is relevant to

168

Jonathan I. Helfman

their needs or interests. When displaying images as they are cached, the cache visualizations reveal dynamic usage patterns of entire communities and promote unprecedented sharing of resources. Streams of similar images represent active hypertext trails of anonymous community members (e.g. the cars in Fig. 3). If any trail is of interest, selecting an image in the trail allows an observer to become a participant immediately, blazing a new trail, which is soon displayed as an additional stream of images in the communal montage. 2.3

Mandala Server

The Mandala Server automatically groups images based on origin (e.g. images from the same web server or from the same browsing session), but could be extended to use similarity of contextual information[19] or extracted features[11]. The Mandala Server also groups images from pages that are reachable from the page most recently requested by the user’s web browser, a capability that allows a Mandala client to function as a visual look-ahead cache. Web searches initiated with any search engine deﬁne groups of images from pages that match the query. The Mandala Server communicates with multiple Mandala clients by posting messages to a bulletin-board and sending the clients brief messages that they should check the bulletin-board when they have a chance. This protocol prevents the server from swamping the clients with a ﬂood of messages and image data. Clients are free to read updates when user-activity decreases. 2.4

Mandala Client

Each Mandala client supports automatic layout and animation of image representations. User interactions include image selection, image positioning, and animation control. Clients may be full-ﬂedged applications or Java applets (with slightly reduced functionality). Mandala clients communicate only with a single Mandala server. Double-clicking on an image causes the client to signal the Mandala server, which signals a web browser (via its client API) to display the associated page. Mandala clients use separate windows for each group of representations (see Fig. 1). People edit imagemaps by repositioning thumbnails. People edit groups by dragging and dropping thumbnails between windows. People deﬁne new groups by dragging thumbnails into an empty window.

3

Preliminary Observations about Image Representations

With the increased access aﬀorded by viewing large groups of image representations comes associated copyright and privacy concerns. Taking images oﬀ web pages may seem to be a copyright violation, but if they are used as an interactive index that promotes access to the information, then they are no more of an infringement than the textual indexes built by search engines. In both cases, the intent is to promote access to original information and encourage the livelihood

Mandala: An Architecture for Using Images

169

of the purveyors of the original information. Privacy becomes a concern when people do not wish to share information about their browsing behavior with a community. In this case, they may choose not to use a proxy server, but then they will not beneﬁt from the communal cache. Image representations seem to diminish the eﬀect of several classic hypertext problems. For example, spatial disorientation is caused by unfamiliarity with possibly complex hypertext structures[6], while cognitive overhead is caused by the general level of complexity associated with multiple choices[7]. Because image representations deemphasize hypertext structure, people navigate through clusters of similar images instead of structures of hypertext links. Images may provide better navigational cues than textually-labeled links because images have been shown to improve human memory for associated information[16]. Other hypertext problems include lack of closure, the inability to determine which pages have been visited or if any nearby unvisited pages are relevant, and embedded digression, the inability to manage multiple, nested digressions[7]. Because people have a remarkable memory for images, they can distinguish quickly between familiar and unfamiliar images[21]. A system that uses image representations may therefore help people identify new information when seeing an unfamiliar image and ﬁnd previously accessed information by remembering and locating a familiar image. The hypertext problem of trail-blazing, the inability to determine if a link is worth following, may be diminished when using image representations because images have the potential to provide a better indication if a link is worth following than brief textual labels. The hypertext problem of session summarization, the inability to save the state of a browsing session, is alleviated by storing the image representations associated with a browsing session as an imagemap, which provides a visual session summary that preserves the interactive nature of the browsing experience.

4

Conclusions

This paper describes Mandala, a system that provides visual interactive overviews of large amounts of information by using images from web pages to represent those pages (Sec. 1). Mandala’s architecture is discussed and distinguished from earlier Visual Information Systems that support image indexing and visual querying (Sec. 2). The use of a proxy server for transparent monitoring of web browsers is described, and cache visualizations are reported to improve cache hitrates and reveal dynamic communal access patterns (Sec. 2.2). In addition, image representations are shown to increase concerns over copyright and privacy while diminishing classic problems associated with hypertext (Sec. 3).

References 1. Marc Abrams, Charles R. Standridge, Ghaleb Abdulla, Stephen Williams, and Edward A. Fox. Caching proxies: Limitations and potentials. In Proceed-

170

2.

3. 4.

5. 6. 7. 8.

9.

10. 11.

12.

13. 14.

15.

16. 17. 18.

19.

Jonathan I. Helfman ings of the Fourth International World Wide Web Conference, December 1995. http://www.w3.org/pub/Conferences/WWW4/Papers/155/ . 166, 167 Rob Barrett and Paul Maglio. Intermediaries: New places for producing and manipulating web content. In Proceedings of the Seventh International World Wide Web Conference, 1998. http://wwwcssrv.almaden.ibm.com/wbi/www7/306.html. 167 T. Boutell. http://www.boutell.com/gd/. 166 Charles Brooks, Murray S. Mazer, Scott Meeks, and Jim Miller. Applicationspeciﬁc proxy servers as http stream transducers. In Proceedings of the Fourth International World Wide Web Conference, December 1995. http://www.w3.org/pub/Conferences/WWW4/Papers/56/. 167 Peter J. Burt. Fast ﬁlter transforms for image processing. Computer Graphics and Image Processing, 16(1):20–51, 1981. 166 Jeﬀ Conklin. A survey of hypertext. Technical Report STP-356-86, MCC, February 1987. 169 Carolyn Foss. Tools for reading and browsing hypertext. Information Processing and Management, 25(4):407–418, 1989. 169 Jonathan Helfman. Insights and surprises using image representations to access and organize web information. AT&T Web Implementor’s Symposium, July 1998. http://www.cs.unm.edu/ jon/mandala/wis98/wis98.html. 167 Ron Hightower, Laura Ring, Jonathan Helfman, Ben Bederson, and Jim Hollan. Graphical multiscale web histories: A study of PadPrints. In Hypertext ’98 Proceedings, pages 58–65, 1998. 165, 167 Independent JPEG Group. ftp://ftp.uu.net/graphics/jpeg/. 166 Clement H. C. Leung and W. W. S. So. Characteristics and architectural components of visual information systems. In Clement Leung, editor, Visual Information Systems, Lecture Notes in Computer Science 1306. Springer, 1997. 165, 168 Ari Loutonen and Kevin Altis. World-wide web proxies. In Proceedings of the First International Conference on the World-Wide Web, WWW ‘94, 1994. http://www1.cern.ch/PapersWWW94/luotonen.ps. 166 Marshall McLuhan. The Gutenberg Galaxy: the Making of Typographic Man. University of Toronto Press, 1962. 163 J. C. Mogul. Forcing http/1.1 proxies to revalidate responses, May 1997. http://www.es.net/pub/internet-drafts/ draft-mogul-http-revalidate-01.txt. 167 Donald Neal. The harvest object cache in New Zealand. In Proceedings of the Fifth International World Wide Web Conference, 1996. http://www5conf.inria.fr/fich html/papers/P46/Overview.html. 166 Allan Paivio. Imagery and Verbal Processes. Holt, Rinehart, & Winston, 1971. 169 Tomas Partl and Adam Dingle. A comparison of www caching algorithm eﬃciency. http://webcache.ms.mff.cuni.cz:8080/paper/paper.html. 167 James E. Pitkow and Margaret M Recker. A simple yet robust caching algorithm based on dynamic access patterns. In Proceedings of the Second World Wide Web Conference, 1994. http://www.ncsa.uiuc.edu/SGD/IT94/Proceedings/DDay/pitkow/ caching.html. 167 Neil C. Rowe and Brian Frew. Finding photograph captions multimodally on the world wide web. In AAAI-97 Spring Symposium Series, Intelligent Integration and Use of Text, Image, Video, and Audio Corpora, pages 45–51, March 1997. 165, 168

Mandala: An Architecture for Using Images

171

20. Dale Schumacher. General ﬁltered image rescaling. In David Kirk, editor, Graphics Gems III, pages 8–16. Academic Press, 1994. 166 21. Lionel Standing, Jerry Conezio, and Ralph Norman Haber. Perception and memory for pictures: Single-trial learning of 2500 visual stimuli. Psychonomic Science, 19(10):73–74, 1970. 169

A Compact and Retrieval-Oriented Video Representation Using Mosaics Gabriele Baldi, Carlo Colombo, and Alberto Del Bimbo Universit` a di Firenze Via Santa Marta 3, I-50139 Firenze, Italy

Abstract. Compact yet intuitive representations of digital videos are required to combine high quality storage with interactive video indexing and retrieval capabilities. The advent of video mosaicing has provided a natural way to obtain content-based video representations which are both retrieval-oriented and compression-eﬃcient. In this paper, an algorithm for extracting a robust mosaic representation of video content from sparse interest image points is described. The representation, which is obtained via visual motion clustering and segmentation, features the geometric and kinematic description of all salient objects in the scene, being thus well suited for video browsing, indexing and retrieval by visual content. Results of experiments on several TV sequences provide an insight into the main characteristics of the approach.

1

Introduction

Quite recently, the rapid expansion of multimedia applications has encouraged research eﬀorts in the direction of obtaining compact representations of digital videos. On the one hand, a compact video encoding is required for high quality video storage; on the other hand, an ad hoc video representation needs to be devised at archival time in order to ease video browsing and content-based retrieval. Past approaches to video compression (see e.g., the standards MPEG 1 and 2) have privileged image processing techniques which, taking into account only the signal-level aspects of visual content, emphasize size reduction over retrieval eﬃciency. More recently, they have been presented browsing-oriented computer vision techniques to represent videos by reconstructing the very process of ﬁlm making [4]. These techniques are capable to segment the video into a number of “shots,” each delimited by ﬁlm editing eﬀects such as cuts, dissolves, fades, etc. The description of shot content relies then on the extraction of salient “keyframes.” However, the above techniques have the limitation of providing only a partial information of video content, being it impossible to reconstruct a video only from its keyframes. The advent of mosaicing techniques [7,8] paved the way to content-based video representations which are both retrieval- and compression-eﬃcient. Such techniques reduce data redundancy by representing each video shot through a single patchwork image composed using all of its frames. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 171–178, 1999. c Springer-Verlag Berlin Heidelberg 1999

172

Gabriele Baldi et al.

In this paper, a method to represent video content through image mosaics is described. Mosaics are extracted from video data through corner-based tracking and a 2D aﬃne motion model. An original motion clustering algorithm, called DETSAC, is proposed. The obtained video representation features the image and motion description of all salient objects in the scene, and is well suited to both video browsing and retrieval by visual content.

2

Video Segmentation

The primary task of video analysis is video editing segmentation, i.e. the identiﬁcation of the start and end points of each shot. Such a task implies solving two problems: i) avoiding incorrect identiﬁcation of shot changes due to rapid motion or sudden lighting change in the scene; ii) detect sharp (cuts) as well as gradual transitions (dissolves). To avoid false shot change detection, a correlation metric based on HSI color histograms is used, which is highly insensitive even to rapid continuous light variations while maintaining reliable to detect cuts. To detect dissolves, a novel algorithm based on corner statistics is used, based on monitoring the minima in the number of salient points detected. During a dissolve, while the previous shot gradually fades out and its associated corners disappear, the new one fades in, its corners being still under the saliency threshold [3].

3

Shot Analysis

Once a video is segmented into shots, each shot is processed so as to extract its 2D dynamic content and allow its mosaic representation. Image motion is processed between successive frames of the shot via a three-step analysis: (1) corner detection; (2) corner tracking; (3) motion clustering and segmentation.

Fig. 1. Corner detection. The pixel locations corresponding to extracted corners are shown in white. Corner Detection. An image location is deﬁned as a corner if the intensity gradient in a patch around it is not isotropic, i.e. it is distributed along two preferred directions. Corner detection is based on the algorithm originally presented by Harris and Stephens in [6], classifying as corners image points with

A Compact and Retrieval-Oriented Video Representation Using Mosaics

173

large and distinct values of the eigenvalues of the gradient auto-correlation matrix. Fig. 1(right) shows the corners extracted for the indoor frame of Fig. 1(left). Corner Tracking. To perform intra-shot motion parameters estimation, corners are tracked from frame to frame, according to an algorithm originally proposed by Shapiro et al. in [9] and modiﬁed by the authors to enhance tracking robustness. The algorithm optimizes performance according to three distinct criteria, namely: Frame similarity: The image content in the neighborhood of a corner is virtually unchanged in two successive frames; hence, the matching score between image points can be measured via a local correlation operator. Proximity of Correspondence: As frames go by, corner points follow smooth trajectories in the image plane, thus allowing to reduce the search space for each corner in a small neighborhood of its expected location, as inferred based on previous tracking results. Corner Uniqueness: Corner trajectories cannot overlap, i.e. it is not possible that at a same time two corners share the same image location. Should this happen, only the corner point with higher correlation would be maintained, while the other would be discarded.

Fig. 2. Corner tracking examples. 1st row: translation induced by camera panning; 2nd row: divergence induced by camera zooming; 3rd row: curl induced by camera cyclotorsion.

Since the corner extraction process is heavily aﬀected by image noise (the number and individual location of corners varies signiﬁcantly in successive frames; also,

174

Gabriele Baldi et al.

a corner extracted in one frame, albeit still visible, could be ignored in the next one), the modiﬁed algorithm implements three diﬀerent corner matching strategies, ensuring that the above tracking criteria are fulﬁlled: – strong match, taking place between pairs of locations classiﬁed as corners in two consecutive frames; – forced match, image correlation within the current frame, in the neighborhood of a previously extracted corner; – backward match, image correlation within the previous frame, in the neighborhood of a currently extracted corner. These matching strategies ensure that a corner trajectory continues to be traced even if, in some instants, the corresponding corner fails to be detected. Fig. 2 shows corner tracking examples from three diﬀerent commercial videos, and featuring diverse kinds of 2D motions induced by speciﬁc camera operations. Each row in the ﬁgure shows two successive frames of a shot, followed by the traced corners pattern. Motion clustering and segmentation. After corner correspondences have been established, an original motion clustering technique is used to obtain the most relevant motions present in the current frame. Each individual 2D motion of the scene is detected and described by means of the aﬃne motion model x a0 a1 a2 x −x (1) = + a3 a4 a5 y − y y characterizing image points displacements – (x, y) and (x , y ) denote the coordinates of a same point in the previous and current frames, respectively. Motion clustering takes place starting from the set of corner correspondences found for each frame. A robust estimation method is adopted, guaranteeing on the one hand an eﬀective motion clustering, and on the other a good rejection of false matches (clustering outliers). The clustering technique, called DETSAC (“DETerministic SAmple Consensus”), is an adaptation of the “RANdom SAmple Consensus” (RANSAC) algorithm ([5], see also [2]) to the problem of motion clustering. DETSAC operates as follows. For each trajectory obtained by corner tracking, the two closest corners trajectories are used to compute the aﬃne transformation (i.e., the 6 degrees of freedom a0 , . . . , a5 ) which best ﬁts the trajectory triplet (each corner trajectory provides two constraints for eq. (1), hence three non collinear trajectories are suﬃcient to solve for the six unknown parameters). The number of trajectories “voting” for each obtained transformation candidate determine the consensus for that candidate. Iterating the candidate search and consensus computation for all possible corner triplets, the dominant motion with maximum consensus is obtained. All secondary motions are iteratively computed exactly in the same way, after the elimination of all the corner points with dominant motion. The RANSAC algorithm is conceived to reject well outliers in a set of data characterized by a unimodal population. Yet, in image motion segmentation, it is highly probable that two or more data populations are presented at

A Compact and Retrieval-Oriented Video Representation Using Mosaics

175

a given time instant, corresponding to independently moving objects. In such cases, RANSAC is likely to produce grossly incorrect motion clusters. As an example, Fig. 3 shows that, when attempting to cluster data from two oppositely translating objects, RANSAC wrongly interprets the two motions as a single rotating motion.

Fig. 3. Motion clusters for two translating objects. Left: Ground truth solution. Right: RANSAC solution.

Although DETSAC is conceived to solve the multimodal distribution problem, however, it achieves this at the cost of a diminished parameter estimation accuracy (nearby trajectories tend to amplify ill-conditioning). Therefore, DETSAC is only meant to provide a rough motion estimate, to be reﬁned later via an iterative weighted least squares strategy, where the higher is the departure of each individual observation from the current estimate, the lower is its associated weight. In such a way, the robustness to outliers of DETSAC is eﬃciently coupled with the estimation accuracy of least squares. Besides, the above clustering algorithm would exhibit a somewhat tendency to fragment each cluster into subclusters. To avoid that, a further cluster merging step is performed, to ensure that each new cluster is distant enough from previous clusters in the space of aﬃne transformations [10]; if this does not happen, clusters below a minimum threshold are merged together. The distance between two aﬃne transformations A = (a0 , . . . , a5 ) and B = (a0 , . . . , a5 ) is deﬁned as 2 l 2 2 2 2 d = (p0 + p1 + p3 + p4 ) + p22 + p25 , (2) 2 where pi = |ai − ai | for i = 0, . . . , 5, and l = (w + h)/2 is the average frame size. Qualitatively, eq. (2) expresses the displacement (in pixels) produced in the frame’s periphery as the eﬀect of the diﬀerence between the motions A and B. Indeed, each addend under the square root expresses the contribution of each individual parameter to the overall displacement. Another important feature of the clustering algorithm is temporal subsampling. In fact, by limiting initially the motion analysis to every 16 or 32 frames,

176

Gabriele Baldi et al.

slow motions or motions very similar to each other can be succesfully detected and diﬀerentiated. Only in a second phase, the motion analysis is reﬁned by iteratively halving the frame interval, until all the frames of a sequence are processed. The motion clusters obtained at higher subsampling levels are used as constraints for clustering reﬁnement, so as to avoid that previously formed clusters are incorrectly merged.

Fig. 4. Residual errors and motion segmentation. The two motion clusters obtained are shown using white and black arrows, respectively. Figure 4 provides a clustering example obtained from a news video, including a man moving in an exterior. As the camera pans rightwards to track the man, the image background moves leftwards; hence, the shot features two main image motions (man, background). The upper row of Fig. 4 reports two successive frames, and the result of corner tracking; in the lower row, they are shown the residual segmentation errors relative to the background (dominant motion) and the man (secondary motion), respectively, together with the ﬁnal clustering result. The actual motion-based segmentation is performed by introducing spatial constraints to the classes obtained via the previous motion clustering phase. Compact image regions featuring homogenous motion parameters – thus corresponding to single, independently moving objects – are extracted by region growing [1]. The motion segmentation algorithm is based on the computation of an a posteriori error obtained by plain pixel diﬀerences between pairs of frames realigned according to the extracted aﬃne transformations.

4

Representation Model

The mosaic-based representation includes all the information required to reconstruct the video sequence, namely, – location of each shot in the overall sequence; – type of editing eﬀect; – mosaic image of the background;

A Compact and Retrieval-Oriented Video Representation Using Mosaics

177

Fig. 5. Four frames from a commercial video, and the associated mosaic.

Fig. 6. Four frames from a news video, the associated mosaic and the foreground moving object.

Fig. 7. Four frames from a commercial video, the associated mosaic and the foreground moving object.

178

Gabriele Baldi et al.

– 2D motion due to the camera (for each frame); – 2D motion and visual appearance of each segmented region (also for each frame). Figure 5 shows some frames of a video sequence featuring composite horizontal and vertical pan camera movements, and the mosaic obtained. The mosaic image captures all the background details which are present at least in one frame of the sequence. Also, the mosaic composes individual frame details into a global description of the background (notice, e.g., that the babies are never visible all together in just one frame). Figures 6 and 7 show two more complicated examples, featuring camera panning and zooming and the presence of an independently moving objects. Notice, again, that both the car in Fig. 6 and the farm in Fig. 7 is almost integrally visible in the mosaic image, although it is not so in each individual frame. The mosaic representation also allows to “erase electronically” an object from a video (e.g., in Fig. 6, the man is segmented out from the background mosaic image).

References 1. D.H. Ballard and C.M. Brown. Computer Vision. Prentice-Hall, 1982. 176 2. T.-J. Cham, and R. Cipolla. A statistical framework for long-range feature matching in uncalibrated image mosaicing. In Proc. Int’l Conf. on Computer Vision and Pattern Recognition CVPR’98, pages 442–447, 1998. 174 3. C. Colombo, A. Del Bimbo, and P. Pala. Retrieval of commercials by video semantics. In Proc. Int’l Conf. on Computer Vision and Pattern Recognition CVPR’98, pages 572–577, 1998. 172 4. J.M. Corridoni and A. Del Bimbo. Structured digital video indexing. In Proc. Int’l Conf. on Pattern Recognition ICPR’96, pages (III):125–129, 1996. 171 5. M.A. Fischer and R.C. Bolles. Random Sample Consensus: A paradigm for model ﬁtting with applications to image analysis and automated cartography. In Communications of the ACM, 24:381-395, 1981. 174 6. C.G. Harris and M. Stephens. A combined corner and edge detector. In Proc. 4th Alvey Vision Conference, pages 147–151, 1988. 172 7. M. Irani, P. Anandan and S. Hsu. Mosaic based representation of video sequences and their applications. In Proc. Int’l Conference on Computer Vision ICCV’95, pages 605–611, 1995. 171 8. H.S. Sawhney and S. Ayer. Compact representations of videos though dominant and multiple motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18:814–830, 1996. 171 9. L.S. Shapiro, H. Wang and J.M. Brady. A matching and tracking strategy for independently moving, non-rigid object. In Proc. British Machine Vision Conference, pages 306–315. 173 10. J.Y.A. Wang and E.H. Adelson. Representing moving images with layers. IEEE Transaction on Image Processing, 3(5):625–638, 1994. 175

Crawling, Indexing and Retrieval of Three-Dimensional Data on the Web in the Framework of MPEG-7 Eric Paquet and Marc Rioux National Research Council Building M-50, Montreal Road Ottawa (Ontario) K1A 0R6 Canada [email protected], [email protected]

Abstract. A system for crawling, indexing and searching three-dimensional data on the Web in the framework of MPEG-7 is presented. After a brief presentation of the Web context, considerations on three-dimensional data are introduced followed by an overview of the MPEG-7 standard. A strategy to efficiently crawl three-dimensional data is presented. Methods designed to automatically index three-dimensional objects are introduced as well as the corresponding search engine. The implications of MPEG-7 for such a system are then analyzed from both the normative and non-normative point of view.

1. Introduction Few among us thought that the Web would encompass so many fields and applications and that there would be such an amount and diversity of information available. In order to be useful that information has to be crawled, indexed and retrieved. If it is not the case the information is lost. The information available on the Web presents two very important characteristics. The first one is the multimedia nature of the content, which can be made of texts, audio, pictures, video, threedimensional data, or any combination of them. The second characteristic is the hypermedia nature. It means that a given document can point to another and hopefully related document. So starting with a few key documents it is possible to cover gradually the network by linking from one document to the next. Crawling and indexing of pictures on the Web has received a lot of attention [1-4]. Threedimensional data have received some attention from the indexing point of view [5] but little if not from the crawling point of view. This paper intends to focus on the crawling, indexing and retrieval of three-dimensional objects.

2. Three-Dimensional Data on the Web The nature of three-dimensional date on the Web is manifold. A three-dimensional datum can be an object or a scene. They can be themselves represented in many formats. Some of them are based on triangular meshes while others are based on parametric representations like the non-uniform rational b-spline or NURBS. In some Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 179-186, 1999.  Springer-Verlag Berlin Heidelberg 1999

180

Eric Paquet and Marc Rioux

cases a volumetric representation is used: the most common is the voxel representation. Some formats have only one resolution or level of detail while others have many. Some formats allow compression but if it is not the case it is always possible to compress the files. Finally they support ASCII, binary or even both representations. Currently there are more than 40 different formats on the Web. Some of them like ACIS are used in specialized application like CAD and most of them count for a small proportion of the total three-dimensional population of the Web. The most important format by far is VRML: shorthand for Virtual Reality Modeling Language. There are many reasons for that: the first one being that retrospectively VRML was probably the first three-dimensional non-CAD standard used by a wide range of companies and applications. VRML is currently available under three versions: VRML 1, 2 and 97. The latest is the ISO version of VRML 2. So not only VRML is a de facto standard but it is also a rightful international standard. For all those reasons we have decided to use VRML as our working format and internal representation in our system. In addition to geometry, three-dimensional files can describe the color distribution. The color can be represented as a set of vertices or as a set of texture maps. For vertices the RGB system is used most of the time. For texture maps it depends on the type of coding used for the maps. In the case of JPEG, a luminance-chrominance representation is used. The color information can be part of the file structure like in the Metastream format or it can be saved externally like in VRML. It has to be pointed out that VRML also support internally defined texture maps.

3. An Overview of MPEG-7 MPEG-7 [6] is formally known as the Multimedia Content Description Interface. MPEG-7 is not to be confused with his predecessors MPEG-1, MPEG-2 and MPEG°4. MPEG-7 describes the content of multimedia objects while MPEG-1, 2 and 7 code the content. MPEG-7 is actually under development and the international standard is expected for 2001. MPEG-7 is made of a normative and non-normative part. The normative part is what is standardized by MPEG while the non-normative part is what is out of the scope of the standard. The normative part of MPEG-7 is the description. The non-normative cover features extraction, indexing and search engine. The reason why MPEG-7 does not standardize those parameters is because they are not necessary for interoperability and because MPEG wants to keep the door wide open for future improvements MPEG-7 is made out of three main elements: the descriptors D, the descriptor schemes DS and the description definition language or DDL [6]. The descriptor is the representation of a feature: a feature being a distinctive or characteristic part of the data. The descriptor may be composite. The description scheme is formed by the combination of a plurality of descriptors and description schemes. The description

Crawling, Indexing and Retrieval of Three-Dimensional Data on the Web

181

scheme specified the structure and the relation between the descriptors. Finally the DDL is the language used to specify the description scheme. The reason why a standard is needed in the field of content-based access is very simple: interoperability. A search engine can efficiently utilize the descriptors only if their structure and meaning is known. If it is not the case the descriptors are in the worst case useless or in the best case misused.

4. The Crawler Before three-dimensional data can be described they need to be located. This process is usually known as crawling. In order to crawl the documents we use a commercial crawler, Web robot or spider made by Excalibur Corporation running on an NT workstation. The spider consists of 4 server programs and one client program running together in client-server architecture. Two of the server programs provide database functionality for the crawler. The other two server programs are the load balancer and the transfer processor. They provide the network transfer and data processing functionality for the crawler. The client program is where the user defines the domain of action of the spider. This is done through a series of configuration files. The O2 database is a commercial object oriented database used to store metadata for crawled documents. It is works in conjunction with the URL server to provide the database component of the spider. The transfer-processor program does the actual download and processing of web documents. This program can transfer and process data in parallel. The load balancer divides the work among the transfer-processors mentioned above. The spider program acts as the manager of a particular crawl job. It decides where the crawl should start from, where it should go, and when it should end. The user controls these behaviors through a set of highly configurable filters that constrains the crawling pattern. These filters act like intelligent decision point for choosing which forward links to follow and which not to follow. The Excalibur spider has not been designed to retrieve three-dimensional data. As a matter of fact this medium has not been taken into consideration in the design of the Web robot. The spider initiates the crawl from a set of locations specified in the configuration files. These URL are chosen for their richness in three-dimensional content and related hyperlinks. The spider does not have any filtering capability for three-dimensional files. Consequently the crawling process has to be divided in two steps. In the first step the location of all documents is saved but not content because the proportion of threedimensional files is very small compare to other types like HTML. Saving them would only overload the system disks. Once the spider has fished to cover the Web domain specified in the configuration files the locations are filtered. The filter keeps only the URL corresponding to three-dimensional files by looking at the file extensions. For example the extension wrl corresponds to the VRML format version

182

Eric Paquet and Marc Rioux

1, 2 and 97. The filter takes into account the fact that some files can be under a compress format and can end up with a zip, gzip or Z extension. Once the location list has been filtered a new list containing only URL of threedimensional files is generated. The spider retrieves those files by performing a second crawl. Files that are not in the VRML format are converted to this working format by a converter. We have two converters: the first one can handle more than 40 different commercial formats while the second one works on most CAD formats. Once the files have been converted to the working format, a set of descriptors is generated for each one of them by the analyzer. Some formats like VRML can link to external texture files. Because the spider was not design for three-dimensional models an additional process is needed to retrieve those files. The URL corresponding to the texture files is determined by analyzing the content of the three-dimensional file and then the corresponding texture file can be retrieved.

5. The Analyzer The analyzer parses each file, extracts the geometry, the color distribution and analyzes the content in order to generate the descriptors and DS. It also creates a small picture of the model corresponding to the file as well as an hyperlink to the file. In order to describe the geometrical shape we introduce the concept of a cord. A cord is a vector that goes from the center of mass of the object to the center of mass of a given triangle belonging to the surface of the object. We benefit from the fact that our working format VRML is based on a triangular mesh representation. In order to define the orientation of a cord we use a reference frame that does not depend on the particular orientation of the object. The reference frame is defined as the eigen vectors of the tensor of inertia of the object. Each axis is identified by the corresponding eigen value. The axes are labeled one, two and three by descending order of their eigen values. The cord orientation is completely determined by the two angles between the cord and the first two axis and the cords are uniquely defined by their modules or norms and those angles. We are interested in the distribution of those cords so we define the cord distribution as a set of three histograms: the first histogram represents the distribution of the first angle, the second histogram the distribution of the second angle and the third histogram the distribution of the norms. The histograms are normalized in order to make their comparison easier. Each histogram has a header made out of a single number representing the number of channels. The size of the histogram is also dictated by the precision of the representation and by the discrimination capability that is needed. The behavior of a cord can be better understood by considering a regular pyramid and a step pyramid. Most people agree that they belong to the same category. If normal vectors would be used to represent the pyramids, five directions

Crawling, Indexing and Retrieval of Three-Dimensional Data on the Web

183

would characterize the regular pyramid while six directions would characterize the step pyramid. So they would be classified as two distinct objects. If a cord representation is used the histograms corresponding to the regular pyramid and to the step pyramid are much more similar. Consequently a cord can be viewed as a slow varying normal vector. In addition to the cord, we propose another description based on a wavelet representation. This is because in addition to be bounded by a surface a threedimensional object is also a volume and should be analyzed as such. In order to analyzed the volume we use a three-dimensional wavelet transform. Let us review the procedure. For our purpose we use DAU4 wavelets which have two vanishing moments. The N×N matrix corresponding to that transform is c 0 c  3    W =     c 2 c  1

c1 − c2

c2 c1

c3 − c0

c0 c3

c1 − c2

c2 c1

c3 − c0 O c0

c1

c2

c3

− c2

c3

c1 c0

− c0

c3

        c3  − c0   c1  − c 2 

(1)

The wavelet coefficients are obtained by applying the matrix W on the three axes defined by the tensor of inertia. We use those axes because the wavelet transform is neither translation nor rotation invariant. In order to apply the transform to the object the latter has to be binarized by using a voxel representation. The set of wavelet coefficients represents a tremendous amount of information that is reduced by computing a set of statistical moments for each level of resolution of the transform. For each moment order, a histogram of the distribution of the moment is built: the channels correspond to the level of detail while the amplitude corresponds to the moment values. The last geometrical descriptor is based on three-dimensional statistical moments. The statistical moments are not rotation invariant. In order to solve that problem they are computed in the same reference frame used for the wavelet and cord descriptors. The order of the moment is related to the level of detail. The colour distribution is simply handle by a set of three histograms corresponding to the red, green and blue channels. The scale or physical dimensions of the model has its own descriptor that corresponds to the dimensions of the smallest bounding box that can contain the model. Whatever the descriptor, histograms are compared by means of the Hamming distance. All those descriptors can be weighted and combine in order to handle a particular query. Because of the non-linear and random behaviour of the error the compared descriptors are weighted according to their rank and not according to the corresponding error [5].

184

Eric Paquet and Marc Rioux

The search engine uses the concepts of direct and indirect query by example. In the case of the direct method, a three-dimensional model is used to specify the query. From the results of the first query one can refine the process by performing additional queries. In the case of the indirect methods key words are used. Those key words correspond to a dictionary of three-dimensional models. One word can correspond to one or more models. A query is performed for each model and the best results for each search are combined to form a unique answer. This process is completely transparent to the user. In addition the later can specify the descriptors and the weight attributed to the geometrical shape, color distribution and scale. An example of a query involving a sword is shown below.

Fig. 1. Some results for a query by shape for a sword using the cord description in a database of more than 1000 objects.

6. Implications of MPEG-7 Let us start with the normative part of MPEG-7. As we saw earlier, the normative part is concerned with the description, which, in our case correspond to the descriptors and DS we have presented in the previous section. Each histogram corresponds to a descriptor. The DS for the distribution of the cords is made out of the histogram of the first angle, of the second angle and of the norms. There are also a DS associated with the histograms associated with the wavelet and moment description. We do not propose a DDL because it is not critical in our case. In order to be suitable for three-dimensional data the DDL should support inheritance, possibly multiple inheritance and, in the case of three-dimensional scenes and object’s parts it should be able to describe the relation between the components. Those relations can be geometrical like the relative positions and orientations but they can also be functional.

Crawling, Indexing and Retrieval of Three-Dimensional Data on the Web

185

In order to be acceptable for MPEG-7, the descriptors and DS should fulfill a set of requirements: among them they should be scalable and they should handle multi-level representations [6]. Our D and DS are scalable at both the object and database level. At the database level the complexity of the search is a linear function of the number of models. At the object level the size of the cord, moment and color distribution does not depend on the number of triangles representing the model. In the case of the wavelet description, the number of level of details has a logarithmic complexity, which is acceptable in most applications. Our descriptors and DS can also handle multi-level representations. This is already the case for the wavelet representation that provides by nature a coarse to fine representation. The same can be said about the moments: the order of the moment is related to the level of detail. In the case of the cord distribution a multi-level DS can be obtained by combining many cord DS and varying the number of channels. This is also a characteristic that MPEG-7 is looking for: forming easily D and DS from existing ones [6]. From an MPEG-7 perspective our D and DS are also useful because they cover a wide variety of applications ranging from CAD, catalogues, medical applications and virtual reality and as far as our experiments have proved, they tend to describe adequately most three-dimensional models. The non-normative part is not part of the standard but nevertheless it can have some indirect implications. Let us consider the extraction of D and DS. The extraction process should not depend on the implementation otherwise the description obtained would not be consistent from one implementation to the next. Let us review our indexing procedure to see if it is implementation dependent. We first convert each model to the VRML format. In almost all cases it is possible to reconvert the models back to their original formats without lost of information. That shows that conversion does not modify the content of the file. Cords extraction is clearly defined. The extraction of the eigen vectors and values is a standard procedure and the determination of the orientation of the cord leads to a unique solution. Some care must be taken with the tensor of inertia. When computing the tensor of inertia the weight of each triangle must be taken into account: this weight corresponds to the area of the surface of the corresponding triangles. This is because low curvature surfaces tend to be represented by only a small number of triangles while high curvature surfaces are represented with many triangles even if their area is comparatively smaller. The same remarks can be made for the moment and wavelet descriptions.

7. Conclusions A system that can crawl, index and retrieve three-dimensional objects from the Web has been presented. The system can automatically retrieve three-dimensional models, describe them and search the Web for similar models. The description is based on the scale, three-dimensional shape and color distribution. It has been shown that the

186

Eric Paquet and Marc Rioux

proposed system could be integrated in the framework of MPEG-7. Our system runs on an NT workstation with two processors. It takes typically 2-5 seconds to compute the descriptors for a given three-dimensional object. The search engine is very fast. With the C++ implementation it takes less then one second to search a database of 2000 objects. The implementation based on Java servlets and Oracle 8 takes about two seconds but is much more flexible. The spider takes full advantage of the multiprocessor architecture and requires 512 Mbytes of memory in order to operate smoothly. In order to process faster the data, the spider requires two physical disks: one for the O2 database and one for the retrieved URL. In its present state the system can only crawl small domain of the Web: typically 20 hosts plus all the hosts to which they link. This limitation could be easily overcome by using more computers. The proportion of three-dimensional links is usually less then 0.5 %. A demo of the search engine is available at http://cook.iitsg.nrc.ca:8800/Nefertiti/Nefertiti.html. This demo shows the potential as well as the performances of the search engine A standard for the description of multimedia content is becoming an important issue: even the most sophisticated search engines retrieve a relatively small fraction of what is available. If a standard would be available, material providers would be able to provide such a description. Commercial interests would motivate them because the market would rapidly be divided between those that can be located because of the description and those that cannot. It is also important to develop descriptors that can be automatically extracted. When introducing MPEG-7 we will have to deal with the legacy of the past: millions of multimedia objects without any description. Only automatically generated descriptors could provide an adequate description of those multimedia objects within a reasonable amount of time

References 1 2 3 4 5 6

M La Cascia et al., “Combining Textual and Visual Cues for Content-based Image Retrieval on the World Wide Web”, Proc. IEEE Workshop on Content-based Access of Image and Video Libraries, pp. 24-28 (1998). W. Bosques et al., “A Spatial Retrieval and Image Processing Expert System for the World Wide Web”, Computers ind. Engng 33, pp. 433-436 (1997). C.-L. Huang and D.-H. Huang, “A content-based image retrieval system”, Image and Vision Computing 16, pp. 149-163 (1998). S.-F. Chang et al., “Exploring Image Functionalities in WWW ApplicationsDevelopment of Image/Video Search and Editing Engines”, Proc. International Conf. on Image Processing, pp. 1-4 (1997). E. Paquet and M. Rioux, "Content-based Access of VRML Libraries", IAPR International Workshop on Multimedia Information Analysis and Retrieval, Lecture Notes in Computer Sciences-Springer 1464, pp. 20-32 (1998). MPEG-7: Evaluation Process Document, ISO/IEC JTC1/SC29/WG11 N2463, Atlantic City (USA), October (1998).

A Visual Search Engine for Distributed Image and Video Database Retrieval Applications Jens-Rainer Ohm, F. Bunjamin, W. Liebsch, B. Makai, K. Müller, B. Saberdest, and D. Zier Heinrich Hertz Institute for Communications Technology, Image Processing Department, Einsteinufer 37, D-10587 Berlin, Germany Phone +49-30-31002-617, Fax +49-30-392-7200 [email protected]

Abstract. This paper reports about an implementation of a search engine for visual information content, which has been developed in the context of the forthcoming MPEG-7 standard. The system supports similarity-based retrieval of visual (image and video) data along feature axes like color, texture, shape and geometry. The descriptors for these features have been developed in a way such that invariance against common transformations of visual material, e.g. filtering, contrast/color manipulation, resizing etc. is achieved, and that they are fitted to human perception properties. Furthermore, descriptors have been designed that allow a fast, hierarchical search procedure, where the inherent search mechanisms of database systems can be employed. This is important for client-server applications, where pre-selection should be performed at the database side. Database interfaces have been implemented in a platformindependent way based on SQL. The results show that efficient search and retrieval in distributed visual database systems is possible based on a normative feature description such as MPEG-7.

1 Introduction Visual-feature based search and retrieval of images and videos in databases is a technique which has attracted considerable research interest recently [1][2][3]. Feature descriptions used to characterize the visual data and retrieval algorithms to search the databases are closely related in these implementations in order to obtain optimum results. The consequence is, that the database provider usually also provides a specific search engine, which can only marginally be adapted to a user's needs. Moreover, it is impossible to perform a net-wide search, retrieving visual data which meet some predefined features from different database sources. With multimedia content emerging over the worldwide computer networks, use of distributed systems becomes necessary. If retrieval of data is supposed to be a de-coupled process, a normative feature description is required. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 187-194, 1999.  Springer-Verlag Berlin Heidelberg 1999

188

Jens-Rainer Ohm et al.

To meet this challenge, ISO's Moving Pictures Experts Group (MPEG) has started a standardization activity for a "Multimedia Content Description Interface", called MPEG-7, which shall provide a standardized feature description for audiovisual data[4]. The meaning of "features" in this sense is very widespread, and can consist of elements for – high-level description (e.g. authoring information, scripting information, narrative relationships between scenes) ; – mid-level description (e.g. semantic categories of objects or subjects present within a scene) ; – low-level description (e.g. signal-based features like color, texture, geometry, motion of scene or camera). Low-level description categories can be transformed into higher-level ones by setting specific rules, and qualitative separation of these categories is not always straightforward. The work reported in this contribution concentrates on the low-level description ; in this case, automatic extraction of features from the data is usually possible, and definition of matching criteria for the similarity-based retrieval using a specific feature type is more or less unique. Even though, it is not the intention of MPEG-7 to standardize the feature extraction, nor the search/retrieval algorithms, which may be differently optimized for specific applications. Nevertheless, the structure of a normative feature description has a high impact on simplicity of adaptation of a non-normative search algorithm, which uses this description for retrieval purposes. MPEG-7 concepts for description schemes and descriptors are shortly reviewed in Section 2. Section 3 describes feature descriptor examples, how they are combined in a flexible way in the framework of a description scheme, and how the ranking in the search/retrieval is performed. Furthermore, in a distributed retrieval application, the interrelationship between the database (at the server side) and the search engine (at the client side) is of major importance. This aspect is discussed in Section 4. Section 5 describes the implementation, and in Section 6, conclusions are drawn.

2

MPEG-7 Description Concept

The MPEG-7 description will consist of description schemes (DS) and descriptors°(D), which are instantiated as descriptor values. Furthermore, it is planned to create a description definition language (DDL), which will allow to define new description schemes and descriptors for specific applications [5]. The whole description will be encoded, such that efficient storage and transmission are enabled. In this paper, we concentrate on the optimization of description scheme structures, such that they can be used in an efficient way for distributed applications. A description scheme is generally a combination of one or more sub-ordinate descriptor(s) and/or description scheme(s). An example is illustrated in Fig.1, where the DS “A” at the top level is a combination of the Ds “A” and “B”, and DS “B”, which again is a combination of Ds “C” and “D”. Multiple nesting of DSs shall be possible. Each descriptor usually characterizes one single feature of the content. The associated descriptor value can be a scalar or a vector value, depending on the nature of the descriptor.

A Visual Search Engine

189

DS "A"

D "A"

D "B"

DS "B"

D "C"

D "D"

Fig.1. Example of a simple MPEG-7 description scheme structure.

3

Visual Feature Description and Content Retrieval

A generic description of visual content for similarity-based retrieval should be based on a flexible combination of different feature descriptors. Even though there are some examples, where one single descriptor would be sufficient to find similar images (e.g. color feature for sunset scenes), in most cases a more distinguishable specification of the query will be necessary. Combination of descriptors during the search can be achieved in two different ways : 1. Parallel approach. A similarity-cost function is associated with each descriptor, and final ranking is performed by weighted summation of all descriptors’ similarity results. This leads to an exhaustive search, where all features available for all items have to be compared during the similarity determination. 2. Coarse-to-fine approach. If specific features can be classified as dominant, a coarse pre-selection is possible based on these, which separates those items with low similarity. Additional features are employed in subsequent finer phases of the search, only. A proper implementation will yield good results with a much shorter retrieval time than the parallel approach. A search algorithm can also contain elements of the parallel approach (to combine different features) at each level of the coarse-to-fine approach. Descriptor organizations which support both of these approaches are described in the following subsections 3.1-3.3 for the examples of color, texture and contour features. 3.1

Color Feature

The use of color histograms is very efficient to describe the color feature of visual items. For this purpose, we are using a transformation of the color into the HSV (Hue, Saturation, Value) space. It is known that differences within HSV space approximately coincide with the human perception of color differences. The HSV space is quantized into 166 bins in our implementation, and within each bin, the frequency of occurrence of the specific color is calculated. This technique was adopted from [6]. Comparison of the histograms of two visual items is performed by weighted sum-of-squareddifferences calculation. This descriptor is capable to find images of similar color with

190

Jens-Rainer Ohm et al.

a high accuracy. If a re-ordering of the histogram is applied such that dominant colors are compared first, it is possible to implement a coarse-to-fine search strategy. 3.2

Texture Feature

The color feature can be calculated based on the pixel statistics, while the texture of a visual item characterizes the interrelationship between adjacent pixels. For this purpose, we are using a frequency description in the wavelet space, where the image is decomposed into 9 wavelet channels (see Fig.2a). Two image signals have a high similarity, if they are scaled or rotated versions of each other. Scaling or rotation effects a shift in the frequency distribution. To overcome this problem, we are using the values of energies calculated locally over the following frequency channel groups (ref. to Fig. 2b): a) Channel “S”, the “scaled” image ; b) Sums over channels 1+2+3, 4+5+6, 7+8+9, which are rotation-invariant criteria ; c) Sums over channels 1+4+7, 2+5+8, 3+6+9, which are scale-invariant criteria. Horizontal Frequency

Horizontal Frequency

S

1

3

2

Vertical Frequency

Vertical Frequency

5

9

a)

1

3

2 7

7 6

S

4

4

6

5

9

8

8

b)

Fig.2. a Structure of channels in the wavelet decomposition. b Sums over channel groups for rotation-invariant (⋅⋅⋅) and scale-invariant () texture criteria. The energy samples from each of the 6 wavelet combination channels (from b/c) undergo a local threshold operation. Based on the output of this binary operation, one of 64 possible frequency distribution patterns results. The frequency of occurrence (histogram) for each of these patterns is stored as a texture descriptor. In addition, a histogram of mean, energy and variance calculated locally over 64 blocks of the image is extracted from the “S” channel after global mean extraction, and used as similarity criterion at the first stage of a hierarchical (coarse-to-fine) search procedure. 3.3

Contour Feature

If a segmented image or video scene is available, the description can be related to arbitrary-shaped objects instead of rectangular scenes. Basically, all the descriptors introduced above can likewise be applied to segments within an image. In addition,

A Visual Search Engine

191

the geometry or contour of the segment can be used as an additional feature. Geometry or contour features can also be used as standalone criteria, e.g. for certain classes of graphics objects. We are using wavelet descriptors, which characterize the 2D contours either in Cartesian- or polar-coordinate representation [7]. If each frequency band is used as a separate descriptor, it is possible to perform a coarse-to-fine search with a raw contour matching (lowest frequency coefficients) first, followed by more exact comparison at the later stages. The wavelet descriptors allow contours of unequal size and orientation to be compared by their shape similarity. A minimum number of 16 scaling coefficients is required to allow reliable decision at the first stage. ...

DS : Visual Scene

DS : Scene Structure

DS : Visual Object

DS : Visual Object

...

...

DS : Geometry

DS : Color

Contour, Position, Moments, ... DS : Color Transformation

D : Transformation Function

D: Reference Color Space

Descriptor Values

Descriptor Values

DS : Frequency Transformation

D: Analysis Function

DS : Color Histogram

Descriptor Values

D : Scaling Factor

D : Number of Cells

D : Data of Histogram

Descriptor Values

Descriptor Values

Descriptor Values

DS : Motion

DS : Texture

Object Motion, Camera Motion, Parametric Motion Model ...

DS : Wavelet Frequency Pattern Occurence

D: Number of Channels Descriptor Values

D: Threshold Function Descriptor Values

D : Energy, Mean, Var. Histogram Descriptor Values

D : Data of Histogram Descriptor Values

Fig.3. A visual object DS based on color, texture and geometry descriptors. 3.4

Combination of Descriptors and Retrieval Operation

Fig.3 shows the structure of a description scheme characterizing the visual features of an "image object", which may either be a still image, a keyframe from a video, a single segment of one of those, or any other (rectangular or arbitrary-shaped) visualcontent item. This DS can again be a sub-description of a higher-level DS in MPEG°7, e.g. for the purpose of shot description within a video scene. The full resolution is given in the figure only for the color and texture branches. Each descriptor (D) is instantiated by one or more descriptor values (DV). Remark that two different types of DVs are present in our figure – one to characterize the “structural” parameters of the descriptor (quantization functions, number of histogram bins), and another

192

Jens-Rainer Ohm et al.

one giving the feature values of the content (data of histogram). The former type can be shared by all content items which are represented with the same description, while the latter one must be made available “individually” for each item. Our retrieval algorithm designed for image database search supports the following properties, such that the database search is fast, reliable and efficient : – Flexible combination of different features, according to the requirements of a query formulation. This means, that from all features contained in a description, only a subset may be used for a specific query or search step. – Application of a weighting function wi for each specific feature i in a feature combination. This weighting function will usually not be normative in the MPEG7 context, but depends on the query formulation. Since the similarity metric values resulting from the different feature descriptors vary by a high extent, we perform a normalization. The final metric calculated from normalized metrics si becomes s=w1⋅s1+ w2⋅s2+..., with sum of the weights equal to 1.

4

Retrieval in a Client/Server Environment

For audiovisual data search and retrieval, a search engine needs flexible access to the feature description resources. This means, that for specific search tasks only particular subsets of the feature representation data are needed. This can easily be achieved, if not only the visual content items, but also the MPEG-7 description data are organized as items in the database. In this case, only the structure of organization (e.g., in which field of a data table which descriptor values can be found) must be made available in an initial descriptor table which is an entry point to the database. The AV data (to which the description is related) need not necessarily to be stored in the same database, it is sufficient that the description holds a link to the real location. The search engine situated at the client can then access any descriptor values associated with any AV object in any set of data via a database interfaces. Since the Structured Query Language (SQL) [8] is a very common interface supported by most database systems, we have used it for this purpose. The configuration is illustrated in fig.4. Standard SQL mechanisms are employed via the interface to formulate the query in the remote pre-selection.

5

Implementation

The concepts elaborated in this paper have been implemented in a visual-data search engine developed at HHI. For platform independence, the system's core parts – presentation, user interactivity and database interfaces – were realized in JAVA. For database interfacing, we have used the JAVA Database Connectivity (JDBC) Tools, which allow the implementation of a database-independent SQL interface. The user interface is also of high importance in the realization. The search engine's basic visible desktop only includes the most relevant setting capabilities, like data management and selection of basic features for a specific query. Users with more skills can also use fine-tune

A Visual Search Engine

193

settings for optimum search results, e.g. adjust specific weighting between the different texture descriptors available. Fig.5a shows the desktop of the search engine, where the 10 images shown at the right side are the result of the query in ranked order, originating from the image in the left box as a reference. Fig.5b illustrates additional query examples. The search can be performed both on still images and on key frames from video sequences. The presentation interface also contains a video player.

MPEG-7 decoder

feature selection

Interface

Search Engine

Interface

Initial Descriptor Table

1011 1011 1011

text text text

link link link

MPEG-7 stream MPEG-7 database

AV data decoder

AV stream

Interface

Presentation Engine

Interface

AV content query result (link address)

AV content

AV content

client

server

Fig. 4. Configuration of a client-server architecture in visual data retrieval.

a)

b)

Fig.5. a Desktop of the HHI search engine (Windows platform) b More query results (first 10 most similar displayed in each case) generated by the HHI search engine.

194

6

Jens-Rainer Ohm et al.

Conclusions

For distributed visual data retrieval applications, a normative description of visual features as defined by MPEG-7 is necessary, such that interoperability between the search engine at the client side and the database positioned at the server side is enabled. Basically, a distributed configuration imposes additional requirements to the content-based search, especially with respect to data organization and structure of feature description. We have tested different descriptors that can be used in a coarseto-fine search strategy, which can partially be applied at the server end, in order to speed up the query and avoid unnecessary transmission of feature data. We have found that the communication between the search engine and the database can be organized in a very efficient way, using existing interconnection standards such as SQL. For multiple-tier systems, e.g. simultaneous linking with several databases or connection with intelligent agents, an object-oriented approach like CORBA would be more convenient.

7 Acknowledgements This work was supported by the German Federal ministry of education, research, science and technology under grant 01 BN 702.

References 1. 2. 3. 4. 5. 6. 7. 8.

J.R. Smith, S.-F. Chang : “VisualSeek : A fully automated content-based image query system”, Proc. Int. Conf. on Image Proc. (ICIP), Lausanne, 1996 J.R. Bach et al. : “The Virage image search engine : An open framework for image management”, Proc. Storage and Retrieval for Image and Video Databases, SPIE vol. 2670, pp. 76-87, 1996 W. Niblack et al. : “The QBIC project : Querying images by content using colour, texture and shape”, Proc. Storage and Retrieval for Image and Video Databases, SPIE vol. 1908, pp. 173-187, 1993 ISO/IEC/JTC1/SC29/WG11 : “MPEG-7 context and objectives”, document no. N2460, Atlantic City, Oct. 1998 ISO/IEC/JTC1/SC29/WG11 : “MPEG-7 requirements document”, document no. N2461, Atlantic City, Oct. 1998 J.R. Smith : “Integrated spatial and feature image systems : Retrieval, analysis and compression”, PhD thesis, Columbia University, 1997 G. C.-H. Chuang and C.-C. J. Kuo : “Wavelet descriptor of planar curves : Theory and applications”, IEEE Trans. Image Proc. 5 (1996), pp. 56-70 ISO/IEC 9075:1992, "Information Technology --- Database Languages --- SQL"

Indexing Multimedia for the Internet Brian Eberman, Blair Fidler, Robert Iannucci, Chris Joerg, Leonidas Kontothanassis, David E. Kovalcin, Pedro Moreno, Michael J. Swain, and Jean-Manuel Van Thong Cambridge Research Laboratory Compaq Computer Corporation, Cambridge, MA 02139, USA Tel: +1 617 692-7627, Fax: +1 617 692-7650 [email protected]

Abstract. We have developed a system that allows us to index and deliver audio and video over the Internet. The system has been in continuous operation since March 1998 within the company. The design of our system diﬀers from previous systems because 1) the indexing can be based on an annotation stream generated by robust transcript alignment, as well as closed captions, and 2) it is a distributed system that is designed for scalable, high performance, universal access through the World Wide Web. Extensive tests of the system show that it achieves a performance level required for Internet-wide delivery. This paper discusses our approach to the problem, the design requirements, the system architecture, and performance ﬁgures. It concludes by showing how the next generation of annotations from speech recognition and computer vision can be incorporated into the system.

1

Introduction

Indexing of Web-based multimedia content will become an important challenge as the amount of streamed digital video and audio served continues to grow exponentially. Inexpensive storage makes it possible to store multimedia documents in digital formats with costs comparable to that of analog formats, and inexpensive, higher-bandwidth networks allow the transmission of multimedia documents to clients in corporate intranets and through the public Internet. The very rapid growth in the use of streaming media players and browser plug-ins demonstrates that this is a compelling medium for users, and the ease of use of products such as Microsoft’s NetshowServerTM and RealNetworks’ RealServerTM will make it possible for even small organizations to distribute their content over the Internet or Intranets. We built the CRL Media Search system (termed Media Search from now on) to investigate mechanisms for indexing video and audio content distributed via the Web. The Media Search service bears signiﬁcant resemblance to existing search engines on the World Wide Web. Like search engines it allows the users to perform a text search against a database of multimedia documents and return a list of relevant documents. Unlike standard search engines, it also locates the Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 195–203, 1999. c Springer-Verlag Berlin Heidelberg 1999

196

Brian Eberman et al.

matches within the documents – search engines are able to leave this task up to the browser. We completed initial implementation of the system in early March, 1998 and have been running it nearly continuously over Compaq’s corporate intranet, with updates, since then. A broad range of content sources has been added to the system. The content includes broadcast material from the Web, technical talks, produced management conferences, and captured broadcast video content. The audio quality of these assets varies widely and has provided a very useful set of tests. Statistical analysis of the system usage has enabled the development of a user model for extensive scalability testing of the system. This paper summarizes the system design issues and contrasts our solutions with some that have appeared previously in the literature. We then present the basic system architecture and how the system functions, and conclude with discussion of our plans for incorporating additional annotation types into the system.

2

Design Issues

During the past few years a number of video indexing systems have been built. These systems generally take in a video feed from an analog source, digitize the source feed and then perform indexing on the content. Both research and commercial systems have been built. The News-on-Demand project, part of Informedia, at CMU [4] is a good example of this type of system. In this project CNN and other live feeds were digitized, then image analysis was performed to cut the video into shots. Shots are then grouped into scenes using multiple cues. The speech in text format for each scene is then indexed using a standard text indexing system. Users can then send queries to retrieve scenes. Work since [4] has focused on video summarization, indexing with speech recognition, and learning the names of faces. A similar example is a commercial product from VirageTM . This product can provide a system to do real-time digitization, shot cut detection, and closedcaption extraction. Oﬀ-line, the system provides a keyframe summary of the video over a Web browser; by selecting time periods represented by pairs of keyframes, the user can add annotations to the index. Finally, Maybury’s [5] system at MITRE focuses on adding advanced summarization and browsing capabilities to a system that is very similar to Informedia. These previous works, and others, employ the shot – an uninterrupted sequence of video from a single camera view – as the basic atomic unit for all additional processing. While shots are clearly important, this type of structuring of the information is not always appropriate. Davenport et al [2] introduced the idea of a stream-based representation of video from which multiple segmentation can be generated. In a stream-based representation, the video is left intact, and multi-layered annotations with precise beginning and ending times are stored as associated metadata with the video. Annotations can be a textual representation

Indexing Multimedia for the Internet

197

of the speech, the name of a person, objects in the scene, statistical summaries of a sequence of video, or any other type of data. A stream-based annotation system provides a more ﬂexible framework and can always be reduced to a shot/scene representation by projecting the time intervals of the other annotations against the shot annotations. In our work, the stream-based approach can be used to produce a text oriented display. For example, if the system has primarily indexed conversations or speech, as ours has, then what is of interest to the user is the structure of the textual representation of the speech. A single keyframe per paragraph could be more appropriate than one determined from image analysis. A second example is content reuse and repurposing. Content companies are very interested in reusing old content for new productions. In this case, the semantic content of the story is not of interest. Instead the basic objects, people, and settings are of value. Annotations should mark their appearance and disappearance from the video. As a ﬁnal case, consider indexing a symphony video based on instruments and scores. In this case a visually-based temporal segmentation is not appropriate, but one based on the musical structure is. Our system, paired with a Web-based annotation tool we built, can support all these diﬀering styles of annotation. Another diﬀerence in our system is that we treat the Web as not just a delivery mechanism, but as the basic infrastructure on which to build. We believe that the ease with which video and audio content can be placed on the web will soon cause a wide assortment of groups to distribute their multi-media content as ubiquitously as their text documents. We have experience with large corporate and institutional archives, and news and other content-production companies. All of these types of organizations are actively considering or have started to make this move. When this happens, video indexing systems will be needed not to index analog feeds, but to index video that has been placed on multi-media content servers. To investigate this issue, we designed our system so that HTTPbased content servers, which we call Media Servers, could be distributed across the organization, or anywhere on the Internet, and then be indexed from one central location. As a proof of the concept, we indexed audio content on NPR’s web site in our system. Users could search this content using our system; when they played one of these clips it was delivered directly from the NPR site. Since our system is designed to be used by a wide variety of users, we built an HTML-based user interface that would work on all major web browsers. To provide access across a world-wide corporate intranet, the content delivery was based on low-bitrate streaming video. The system follows the standard search engine user interaction model. This model consists of two steps: First, users search and get pointers to documents, and then they go to the documents and use the browser’s find command to search for their particular query. Since find is not a browser supported function for video, we had to create a way of supporting find through HTML pages that was consistent with this model. Since we worked with a very broad range of content sources, the audio quality of these materials varied broadly and followed many diﬀerent production

198

Brian Eberman et al.

formats. Thus our system could not use knowledge of the format for browsing. We further had to develop very robust methods for aligning the available, occasionally inaccurate, transcripts to very long audio segments (often greater than 1 hour in length) containing speech, music, speech over music, speech with background noise, etc. A paper by Moreno et al [6] reports on this algorithm. Finally, this area is rapidly evolving new techniques for automatically annotating content. This, plus our ﬁrst experience with trying to build a content processing system led us to develop a distributed periodic processing model with a central database as a workﬂow controller. This approach is discussed more throughly in DeVries [3].

3

System Description

The Media Search system, as shown in Figure 1, is broken into six components: 1) one or more Media Servers, 2) a metadatabase that is a built on a standard relational database, 3) a collection of autonomous periodic processing engines or daemons managed by the metadatabase, 4) an Indexing System which is a modiﬁed version of the NI2 index used in the AltaVista engine, 5) a Feeder to synchronize information in the database with the NI2 index, and 6) a Presentation Server that communicates with the other subsystems to construct an HTML response to a user query.

Content

Fig. 1. CRL Media Search Architecture

Media Server A Media Server stores and serves up the actual content and provides a uniform interface to all the media stored in the system. In addition, it handles storage and access control, if required. We use it to store video (MPEG,

Indexing Multimedia for the Internet

199

RealVideoTM ), audio(RealAudioTM ), and images (JPEG). We also implemented on-demand conversion functions to convert from MPEG to other formats. For large-scale Internet applications of the system, on-demand conversion is only used to produce formats accessed by our internal processing daemons. All formats accessed by user queries are pre-produced in order to improve performance. Both the stored ﬁles, and the “virtual” ﬁles which are computed on-demand are accessed through a uniform URL interface. The URL-based interface to Media Servers provides a layer of abstraction allowing storage versus computation trade-oﬀs, and makes it possible to locate Media Servers anywhere within an intranet or across the Internet. Meta-Database and Daemons The metadatabase, which is built on a relational database, performs three functions. Firstly, it keeps track of the location of all stored formats, or representations, of each multimedia document. Secondly, the metadatabase acts as a central workﬂow control system for daemon processing. A daemon is a process that performs “work”, typically taking one or more representations and computing an output representation. For example, the alignment daemon takes a transcript and an audio ﬁle as input and computes an aligned transcript as output. Central workﬂow processing is enabled by having each daemon type register the format of its required inputs when that type is ﬁrst installed. Then each daemon instance can request work from the metadatabase and is given the input URL handles that need to be processed. When a daemon completes work on its input representations, it stores the output representation in a Media Server and registers the availability of the new representation with the metadatabase. This simple model leads to a robust, distributed processing method which scales to the large processing systems needed for the Web. The third role of the metadatabase is to store all the stream-annotation information. Although we physically store the annotations in a way optimized for the presentation user interface, the format of the annotation tables can be thought of as tables giving the type, start time, end time, and value. The system is structured so that it can store arbitrary annotation types in this way. For example, using speech recognition technology we are able to align transcripts to the audio component of the video [6]. The annotation stream then consists of the start and end time of each word in the transcript, and the word itself. The system is suﬃciently ﬂexible that we can store a large variety of diﬀerent forms of annotations, although currently we only store aligned transcripts and images, called keyframes, which represent a video shot. Index and Feeder One of the most important problems in video and audio indexing is not only to index the collection of words that were spoken during the video, but also to be able to determine where a particular word, phrase, or combination occurred. It was our aim to support this capability while still indexing full documents. We modiﬁed the NI2 index used by AltaVista [1] to accomplish this task. While the original NI2 index returns the ID of a document containing the words of the user query, our modiﬁed version will return multiple hits per document; one hit for every location in the document that matches the query.

200

Brian Eberman et al.

In order to provide within-document indexing, we ﬁrst had to deﬁne what were the within-document match locations to a given query. A match location provides an entry point into the video; the user is then free to play the video for as long as desired. Therefore, a match location naturally deﬁnes a subdocument, starting at the match location and ending at the end of the video. The match locations were then deﬁned to be all the locations where terms from the query matched the document, and which deﬁned a subdocument that matched the complete query. We also extended the model to include rank queries, that is, how to rank certain subdocuments matches more highly than others. A standard term frequency / inverse document frequency (tf.idf) metric [7] is used, with each term match multiplied by a position-dependent factor. To enhance the rank of subdocuments where there were many term matches appearing soon after the match location, the position-dependent factor takes the form of an exponential decay with its peak at the beginning of the subdocument. At this point we have not done suﬃcient information retrieval (IR) testing of this metric to report on the IR performance of within-document ranking.

Fig. 2. CRL Media Search Query Response Page

Presentation Server The Presentation Server communicates with the various services oﬀered by the metadatabase, the Index Server, and the Media Server to allow the user to perform searches, see responses, and view parts of the video stream. A screen shot of the initial response produced by the Presentation Server after a query is shown in ﬁgure 2. From this page the user may decide to play the video from the ﬁrst match within a document, or to go to another page to see all the matches within the video. From this document-speciﬁc match page, the user may play the video from any one of the matches within the document, search within the speciﬁed document, or progress to yet another page. This third style of page allows the user to browse the video starting at the location of one of the matches.

Indexing Multimedia for the Internet

201

A typical response to a user query is computed as follows: The presentation system ﬁrst sends a message to the Index Server; then, using a handle returned by the index, ﬁnds information in the metadatabase. In this way the presentation system uses the two systems together to compute responses to queries, in eﬀect making the NI2 index an extension of the relational database. Although this type of extension can in principle be achieved with object relational databases or in extended relational databases by putting the index within the database, our specialized external text index that is synchronized with the database oﬀers a higher performance solution.

4

System Performance

The system was tested running on two Digital AlphaServer 4100’s. One machine had four Alpha 21164 processors running at 466 MHz, 2 gigabytes (GB) of memory, and a disk array with 600 GB of data storage. This machine ran the database, index, and Media Server components of Media Search. The second machine had four 400MHz Alpha 21164 processors and 1.5 GB of memory and ran the presentation server. The test harness was a set of client processes probabilistically emulating the user model derived from the system deployed on Compaq’s internal network. The server side used Oracle 8.0.5 as the relational database with a custom developed interface between the presentation layer and the Oracle engine, and the NI2 index engine with the extensions described in section 3. The presentation server used the Apache web server with the presentation CGI scripts developed in Perl. To avoid the high startup overhead of Perl for every invocation of a CGI script we used the FastCGI extensions to CGI which allow Perl processes to become servers and converts CGI calls to socket communication between the Web server and the resident Perl processes. Our system has a throughput of 12.5 user sessions per second (as deﬁned by the user model), resulting in 36 pages served per second, or approximately 3.2 million pages over a 24-hour period. We also achieved an average latency of less than 0.5 seconds per page served. The Presentation System, written in Perl, was the bottleneck in these tests, even with the FastCGI optimizations. Performance improvements to the presentation system could be obtained, for example, by rewriting it in C/C++ instead of Perl. We have also conducted performance tests of the metadatabase as a standalone component since it is the only component of our system whose performance can not be improved simply by replicating the data and using more machines to service requests. All other components can be distributed/replicated among multiple machines fairly easily. The test of the metadatabase component alone running on one AlphaServer 4100 with four Alpha 21164 processors at 466MHz measured a throughput of 29 user sessions per second, equivalent to 88 pages per second or 7.5 million pages over a 24-hour period.

202

5

Brian Eberman et al.

Conclusions and Future Work

We implemented the CRL Media Search system to explore the issues that arise when building a distributed system for indexing multimedia content for distribution over the World Wide Web. The system uses the Internet as the basic platform for both organizing the computation and distributing content to users. We have tested the performance of the system and found that it scales well and can provide an indexing service at a cost comparable to indexing text (HTML) documents. We are investigating adding other meta types of annotation information to the system. For instance, the meta-information extracted by a face detector/recognizer and speaker spotter can be placed directly into the metadatabase. This information can then either be indexed in the current NI2 index or in a second NI2 index. In addition, we are researching indexing crawled multimedia from the World Wide Web, where we do not have transcripts or closed captions. Sound classiﬁcation and speech recognition are key technologies for this area of research.

Acknowledgements The Media Search project has beneﬁted from the insights, support, and work of many people, including: Mike Sokolov, Dick Greeley, Chris Weikart, Gabe Mahoney, Bob Supnik, Greg McCane, Katrina Maﬀey, Peter Dettori, Matthew Moores, Andrew Shepherd, Alan Nemeth, Suresh Masand, Andre Bellotti, S. R. Bangad, Yong Cho, Pat Hickey, Ebrahim Younies, Mike Burrows, Arjen De Vries, and Salim Yusufali. Thanks to Beth Logan for comments on drafts of this paper.

References 1. Burrows, M. Method for indexing information of a database. U.S. Patent 5745899, (April 1998). 199 2. Davenport, G., Aguierre-Smith, T. G., and Pincever, N. Cinematic primitives for multimedia. IEEE Computer Graphics and Appplications 11, 4 (July 1991), 67–75. 196 3. de Vries, A. P., Eberman, B., and Kovalcin, D. E. The design and implementation of an infrastructure for multimedia digital libraries. In International Database Engineering Applications Symposium (July 1998). 198 4. Hauptmann, A., Witbrock, M., and Christel, M. News-on-demand - an application of informedia technology. In D-LIB Magazine (Sept. 1995). 196 5. Mani, I., House, D., and Maybury, M. Intelligent Multimedia Information Retrieval. MIT Press, 1997, ch. 12: Towards Content-Based Browsing of Broadcast News Video. 196

Indexing Multimedia for the Internet

203

6. Moreno, P. J., Joerg, C., Van Thong, J.-M., and Glickman, O. A recursive algorithm for the forced alignment of very long audio segments. In International Conference on Spoken Language Processing (1998). 198, 199 7. Salton, G., and McGill, M. J. Introduction to Modern Information Retrieval McGraw-Hill, 1983. 200

Crawling for Images on the WWW Junghoo Cho1 and Sougata Mukherjea2 1

Department of Computer Science Stanford University, Palo Alto, Ca 94305, USA chocs.stanford.edu 2 C&C Research Lab, NEC USA 110 Rio Robles, San Jose, CA 95134, USA sougataccrl.sj.nec.com

Abstract. Search engines are useful because they allow the user to ﬁnd information of interest from the World-Wide Web. These engines use a crawler to gather information from Web sites. However, with the explosive growth of the World-Wide Web it is not possible for any crawler to gather all the information available. Therefore, an eﬃcient crawler tries to only gather important and popular information. In this paper we discuss a crawler that uses various heuristics to ﬁnd sections of the WWW that are rich sources of images. This crawler is designed for AMORE, a Web search engine that allows the user to retrieve images from the Web by specifying relevant keywords or a similar image. Keywords: World-Wide Web, Crawling, Site-based Sampling, Non-icon detection.

1

Introduction

Search engines are some of the most popular sites on the World-Wide Web. However, most of the search engines today are textual; given one or more keywords they can retrieve Web documents that have those keywords. Since many Web pages have images, eﬀective image search engines for the Web are required. There are two major ways to search for an image. The user can specify an image and the search engine can retrieve images similar to it. The user can also specify keywords and all images relevant to the user speciﬁed keywords can be retrieved. Over the last two years we have developed an image search engine called the Advanced Multimedia Oriented Retrieval Engine (AMORE) [5] (http://www.ccrl.com/amore) that allows the retrieval of WWW images using both the techniques. The user can specify keywords to retrieve relevant images or can specify an image to retrieve similar images. Like any search engine we need to crawl the WWW and gather images. With the explosive growth of the Web it is obviously not possible to gather all the WWW images. The crawlers run on machines that have limited storage capacity, and may be unable to index all the gathered data. Currently, the Web contains more than 1.5 TB, and is growing rapidly, so it is reasonable to expect that most Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 203–211, 1999. c Springer-Verlag Berlin Heidelberg 1999

204

Junghoo Cho and Sougata Mukherjea

machines cannot cope with all the data. In fact a recent study has shown that the major text search engines cover only a small section of the Web [3]. The problem is magniﬁed in an image search engine since image indexing takes more time and storage. Therefore the crawler should be “intelligent” and only crawl sections of the WWW that are rich sources of images. In this paper we present the AMORE crawler and explain several heuristics that can be used to determine WWW sections containing images of interest. The next section cites related work. Section 3 gives an overview of the AMORE system. Section 4 explains the crawler architecture. Section 5 discusses the heuristics used by the crawler. Finally section 6 concludes the paper with suggestions of future work.

2

Related Work

Crawlers are widely used today. Crawlers for the major search engines, for example, Alta Vista (http://www.altavista.com) and Excite (http://www.excite.com) attempt to visit most text pages, in order to build content indexes. At the other end of the spectrum, we have personal crawlers that scan for pages of interest to a particular user, in order to build a fast access cache (e.g. NetAttache http://www.tympani.com/products/NAPro/NAPro.html). Roughly, a crawler starts oﬀ with the URL for an initial page. It retrieves the page, extracts any URLs in it, and adds them to a queue of URLs to be scanned. Then the crawler gets URLs from the queue (in some order), and repeats the process [6]. [1] looks at the problem of how the crawler should select URLs to scan from its queue of known URLs. To ensure that the crawler selects important pages ﬁrst, the paper suggests metrics like backlink count and page rank to determine the importance of a WWW page. Instead of ﬁnding the overall importance of a page, in this paper we are interested in the importance of a page with respect to images. Another research area relevant to this paper is the development of customizable crawlers. An example is SPHINX [4], a Java toolkit and interactive development environment for Web crawlers which allows site-speciﬁc crawling rules to be encapsulated.

3

AMORE Overview

During indexing the AMORE crawler, discussed in the next section, gathers “interesting” Web pages. The images contained and referred to in these pages are downloaded and the Content-Oriented Image Retrieval (COIR) library [2] is used to index these images using image processing techniques. We also use various heuristics, after parsing the HTML pages, to assign relevant keywords to the images and create keyword indices.

Crawling for Images on the WWW

(a) Semantic Similarity Search with a picture of Egypt

205

(b) Integrated Search with the keyword ship and the picture of a ship

Fig. 1. Examples of diﬀerent kinds of AMORE searches During searching, AMORE allows the user to retrieve images using various techniques. Figure 1 shows some retrieval scenarios. The user can specify keywords to retrieve relevant images. The user can also click on a picture and retrieve similar images. The user has the option of specifying whether the similarity is semantic or visual. For semantic similarity, the keywords assigned to the images are used. If two images have many common keywords assigned, they are considered to be similar. Thus in Figure 1(a) images of Egypt are retrieved even though they are not visually similar. For visual similarity, the COIR library is used. It looks at features of the images like color, shape and texture to determine similarity using the image indices. AMORE also allows the integration of keyword search and similarity search. Thus Figure 1(b) shows images visually similar to the picture of a ship that are also relevant to the keyword ship.

4

AMORE Crawler

The design of AMORE image crawler embodies two goals that we pursue. First, the crawler should crawl the web as widely as possible. More precisely, we want the crawler to visit a signiﬁcant number of the existing Web sites. If the crawling is performed only to a small set of sites, the scope and the number of images crawled may be limited and biased. Second, the crawler should not waste much of its resource examining “uninteresting” parts of the Web. For now, the information on the Web is mostly textual, and only a small portion of the Web contains images worthy of being indexed. The crawler should not waste its resource trying to crawl mostly textual parts of the Web.

206

Junghoo Cho and Sougata Mukherjea

Note that these two goals are conﬂicting. On one hand, we want to gather images from as many sites as possible, which means that the crawler should visit a signiﬁcant portion of the web. On the other hand, we want to limit the scope of the crawler only to the “interesting” sections. We tried to achieve these two conﬂicting goals by site-based sampling approach, which will be discussed next. 4.1

Architecture of the Crawler

The crawler of AMORE consists of two diﬀerent sub crawlers: Explorer and Analyzer. Informally, Explorer discovers “interesting sites” and Analyzer ﬁlters out “uninteresting” sections from the identiﬁed sites. Figure 2 represents the data ﬂow between these two crawlers.

Crawl

WWW

Explorer

Interesting Pages

Interesting Sites

Crawl

Analyser

AMORE indexer

Interesting pages

Fig. 2. The architecture of the AMORE crawler.

– Explorer Explorer is the big scale crawler whose main job is to discover “interesting” sites from the web. It is optimized to ﬁnd as many interesting sites as possible, and therefore it tries to visit the web widely but shallowly. More precisely, it diﬀers from most web crawlers in that it only crawls k sample pages for each and every site it found. After sampling k pages from a site, it checks the sample pages to see how many non-icon images the pages contains or refers. (The criteria for icon detection is described in section 5.1 in detail). If more than r% of pages have more than one non-icon image, then the site is considered “interesting”. The Analyzer works on these interesting sites that the Explorer found. Note that even if a site is not found to be interesting, the interesting pages in the site are sent to the AMORE indexer. This allows AMORE to index images from a large number of Web sites.

Crawling for Images on the WWW

207

– Analyzer Analyzer is the small scale crawler whose main job is to identify “interesting” sections from a web site. The input to Analyzer are the “interesting” sites that Explorer found. For each input site, the Analyzer performs more crawling to gather m (>> k) sample pages. These sampled pages are then analyzed to evaluate the directories in the site. For each directory, we calculate its importance as discussed in in section 5.2. Then the Analyzer crawls the directories in the order of their importance. The Analyzer examines all directories whose importance is greater than a threshold. Note that our two step crawling approach is conceptually similar to iterative deepening [7]. Informally, we expand all high level nodes (crawl root level pages of each web site), and we go into deeper (perform more crawling) for the interesting nodes expanded. Also note that there are various parameters in the crawling process like the number of pages to be sampled by the Explorer and the threshold value for importance of the directories in the Analyzer. The AMORE administrator can set these values based on the resource constraints.

5

Heuristics

Fig. 3. Comparing the reasons why images referred to in HTML ﬁles were not indexed by AMORE.

5.1

Removing Icon Images

The Web is well-known for its heterogeneity of information. The heterogeneity is also true for images, and diﬀerent types of images coexist on the Web. At one

208

Junghoo Cho and Sougata Mukherjea

extreme, a small icon is used as the bullet of a bulleted list and at the other extreme, a page embeds a 1024x768 gif image of Gogh’s painting. We believe the images on the Web can be classiﬁed into two categories: icons and authentic images. Icons are the images whose main function is enhance the ”look” of a web page. They can be substituted by a symbol (e.g. bullets) or by text (e.g. advertizing banners), but they are used to make the page more presentable. In contrast to icons, authentic images are the images that cannot be replaced by non-images. We cannot substitute the image of Gogh’s painting or the picture of Michael Jordan with text without losing information that we want to deliver. An usability study of AMORE has also shown that people were not interested in the icons when they are using a WWW image retrieval engine. It is generally diﬃcult to identify icons without analyzing the semantic meaning of an image. However, our experiments show that the following heuristics work reasonably well for icon detection: – Size: We remove very small images such as dots which are generally used for HTML page beautiﬁcation. We only extract images that are more than a certain size (generally > 2000) and have a certain width and height. – Ratio: We don’t extract images if their width is much greater or smaller (> 3 or < 1/3) than their height. This ﬁlters out the headings and banners that appear at the top and the sides of many Web pages. – Color: We also remove color images if they have very few colors (<5). This removes uninteresting computer generated logos, etc. – Animated: Surprisingly, most users were also not interested in animated gifs! So they are also removed. – Transparent: We also remove transparent images since they are generally used as headings and logos. About 10% of the images gathered by AMORE were not indexed because they were uninteresting (by the above criteria) or they could not be downloaded to our site (for example, some images were missing even though they were referenced in HTML pages) or they were not the right format (like most browsers AMORE supports only gif and jpeg). Figure 3 is a chart showing the various reasons for which an image was not indexed. 5.2

Finding Interesting Directories

An analysis of the major WWW sites have shown that they are well organized; all pages dealing with the same subject are generally organized in the same directory. It is also true that if a sample of the pages in a directory have images, a majority of the pages are “interesting” from our point of view. These observations are utilized by the Analyzer when it tries to ﬁnd sections of interest in large Web sites. We use various heuristics to ﬁnd interesting directories:

Crawling for Images on the WWW

209

– In many Web sites images are kept in directories with relevant names. For example, directories like http://www.nba.com/finals97/gallery/, http://www.si.edu/ natzoo/photos/ and http://www.indiabollywood.com/ gallery/ contain good images. Individuals also organize their site so that the images are kept in directories with meaningful names. For example, we found good images in http://fermat.stmarys-ca.edu/∼jpolos/photos and http://www.mindspring.com/ ∼zoonet/galleries. Therefore, we have a list of keywords (like gallery, photo, images) and if a directory has any of these words, they are considered interesting and crawled in detail. – However, not all interesting directories may have meaningful names. For example, http://cbs.sportsline.com/b/allsport/ has many images. Therefore, for most directories more analysis is necessary. We calculate the importance of a page as i + 1/w where i is the number of non-icon images in the page and w is the number of words (excluding HTML tags). This makes the importance of a page with a lot of text and one image less important than a page with one image and fewer text. The overall importance of a directory is the average of the importance of the sampled pages of the directory. Since the analyzer crawls the directories in order of their importance, image intensive pages will be gathered ﬁrst and if there are resource constraints, the AMORE administrator can stop the crawling of a site after sometimes. At present, directories whose importance is greater than a pre-deﬁned threshold are considered interesting and only these directories are crawled in detail.

Fig. 4. An evaluation of our directory importance heuristic for http://cnnsi.com. The interesting directories are shown in italics.

210

Junghoo Cho and Sougata Mukherjea

To determine if our heuristics are correct, we have built a visual evaluation interface. Figure 4 shows the interface. Here we are evaluating our heuristics for the http://cnnsi.com Web site. The site is represented as a tree and the interesting directories are shown in italics and red color. It is seen that directories like almanac (URL: http://cnnsi.com/almanac) and features are found interesting while directories like jobs and help are found uninteresting. For a large directory, even if the whole directory is uninteresting, several subdirectories may be found to be interesting. For example, on exploring the hockey directory using our evaluation interface, we ﬁnd that the directories events and players are found to be interesting while directories like scoreboards and stats are not. On examining the Web site, we found that the directory importance heuristics performed upto our expectation.

6

Conclusion

In this paper we presented the crawler for the AMORE WWW Image Search Engine. The crawler uses several heuristics to crawl at least some pages of all Web sites as well as the “interesting” sections of “interesting” Web sites. This allows the AMORE crawler to achieve the conﬂicting goals of gathering as many “interesting” images as possible by visiting as few sites as possible. In the future we are planning to do an extensive evaluation of the crawler and extend the technique to other media like video and audio. Our ultimate objective is to develop an eﬀective Multimedia WWW Search engine.

References 1. J. Cho, H. Garcia-Molina, and L. Page. Eﬃcient Crawling through URL ordering. Computer Networks and ISDN Systems. Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7):161–172, April 1998. 204 2. K. Hirata, Y. Hara, N. Shibata, and F. Hirabayashi. Media-based Navigation for Hypermedia Systems. In Proceedings of ACM Hypertext ’93 Conference, pages 159– 173, Seattle, WA, November 1993. 204 3. S. Lawrence and C. Giles. Searching the World-Wide Web. Science, 280(5360):98, 1998. 204 4. R. Miller and K. Bharat. SPHINX: a framework for creating personal, site-speciﬁc Web crawlers . Computer Networks and ISDN Systems. Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7):119–130, April 1998. 204 5. S. Mukherjea, K. Hirata, and Y. Hara. Towards a Multimedia World-Wide Web Information Retrieval Engine. In Proceedings of the Sixth International World-Wide Web Conference, pages 177–188, Santa Clara, CA, April 1997. 203

Crawling for Images on the WWW

211

6. B. Pinkerton. Finding what People Want: Experiences with the WebCrawler. In Proceedings of the First International World-Wide Web Conference, Geneva, Switzerland, May 1994. 204 7. S. Russell and P. Norvig. Artificial Intelligence: A Morden Approach. Prentice Hall, 1995. 207

A Dynamic JAVA-Based Intelligent Interface for Online Image Database Searches V. Konstantinou and A. Psarrou School of Computer Science University of Westminster Harrow Campus, Watford Road, Harrow HA1 3TP, London, U.K. [email protected]

Abstract. This paper describes the use of a JAVA-based interface to enable remote image-based searches. The database used for our target system, resides at the Marcianna Library in Venice, Italy. It consists of a number of medieval manuscripts scanned at a variety of resolutions and indexed by both textual and image based information. The system described in this paper is a JAVA based client-server which can be used to interrogate a remote hybrid database and initiate searches using both the textual and image links of this database. Image queries can be based on data supplied by the user, or extracts of the original remote database. In the paper we also discuss the indexing and pattern recognition algorithms which are used on the Server side. This work is part of the HISTORIA project funded under the European Union Libraries Initiative (Telematics for Libraries - Project No:3117). Keywords: Digital libraries, Content-based indexing/retrieval, Java Interfaces, Distributed Databases, Multimedia.

1

Introduction

Scholars and museum curators in the areas of history of art, miniatures, seals, codicology and archivism need an easy and rapid consultation and comparison of manuscript compilations that contain codices of heraldic arms, in order to date objects of art or collect information on family histories or historical events. The great interest of the specialists throughout the world, in the compilations of such manuscripts is witnessed by the large number of consultation requests of the codices to museums and libraries throughout the world. The processing of such requests are currently faced with three main obstacles: – Information is available only locally: At present is not possible to answer all of these requests accurately and there are long waiting times due to the time taken to search through the various sources of information. Such information is usually available in the form of the original manuscripts which are often fragile and can only be used by trained curators. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 211–220, 1999. c Springer-Verlag Berlin Heidelberg 1999

212

V. Konstantinou and A. Psarrou

– Indexing of codices of arms is subjective: Codices of arms are often described by keywords that characterise objects or shapes depicted on them and their spatial arrangement. However, text classiﬁcation is not unique because the keywords used in the indexing of codices are subjective to the opinion of each researcher and in most cases are fairly general (i.e. the word ’ﬂower’ can be used to describe a number of real-life ﬂowers). In addition, keywords used in heraldry vary broadly across languages and cultures. – Indexing of codices of arms is not governed by heraldic rules: The most challenging obstacles are met from requests that are concerned with the determination of the origin of codices that don’t have a record in the current database. In this case a search is requested to provide images that closely match the enquired codice. However, Venetian heraldry is not governed by any standard set of heraldic rules and researchers rely more on plain pictorial information. The aim of the HISTORIA project is to overcome the problems associated with handling the original and usually very fragile manuscripts and provide a digital library which is incorporated with search and query tools tailored for (a) extracting information from reproductions of medieval manuscripts (b) meeting the needs of user groups such as historians and curators. We have addressed these issues by designing a digital library that has the following modules and functionality: – Information Server: A special Server (The Historia Server) module that accommodates and controls all of the system modules. The Historia Server is also linked to a Web server through which it can be accessed by both Intranet and Internet users. – User Interface: To enable both local and remote access of the system a JAVA based client-server interface was designed. The interface is loaded whenever a user visits a special Web page linked to the Historia server. – Image base: A specially-designed digital image library is used to store and retrieve low and high deﬁnition reproductions of images and text that represent (a) pages of various sources/books that describe the genealogy of Venetian families and (b) Venetian heraldry associated with these families. – Index base: During the acquisition/information collection phase the images are indexed according to family names, colour and pictorial information. – Hyper-imaging: A dedicated hybrid (text and/or image) search engine which ﬁnds images that contain similar shapes and colours is used for image or text based searches. – Drawing tool: A drawing tool which allows the user to draw and colour representations that resemble the image that would like to retrieve. The facility can also be used to change and manipulate images already in the database. – Network links: Objects in the digital library can be linked with two types of links: (a) hard coded links that associate pages of medieval books that describe the genealogy of noble Venetian families to images of Venetian heraldry and (b) user defined links that hold the user notes.

A Dynamic JAVA-Based Intelligent Interface

213

– Hyper-text: The user can also create links to individual text notes which can in turn be automatically linked to other text or image related information. For this purpose we have employed a special automatic Hypertext engine developed at the University of Westminster.

2

Background

Research in digital image libraries is widely known as content-based retrieval where search is not based on the indexing and retrieval of information using keywords but on the appearance of whole or parts of the image. The approaches used in many digital libraries are classiﬁed according to the type of generic query formats used in retrieving information, for example browsing, colour, texture, sketch, shape, motion and others [6]. In addition, search and retrieval approaches are divided to image domain dependent or independent. Many times, however, the notion of domain independent approaches is associated with techniques that do not preserve any semantic information of the content of the images. Such approaches, even though they might be regarded as generic, fail in the long term to produce satisfactory results for two main reasons: (a) the search is based on some mathematical formula that does not provide unique representation of shapes and (b) does not cater for information that the user may be interested to include. In contrast direct meaningful search on images is only possible by use of semantic-preserving image compression [9]. Such compact representation may or may not be dependent on priory knowledge of the information that appears in the image but preserve essential image similarities. The need for the development of image indexing and retrieval algorithms tailored to speciﬁc domains has been raised by a number of curators and art researchers. In particular it was shown that many of the commercial and research image databases available have not been successful in retrieving similar images from an art domain. For example, QBIC [4,3] had only a 10% success rate when used to relate similar landscape paintings of an artist [5]. The main reason for this is that many standard image processing algorithms assume that images are the perspective or parallel projection of a three-dimensional world. However, artists have struggled to represent the world ﬁltered through their imagination and it is rare, for example, to ﬁnd paintings that, despite superﬁcial appearances, present even a perspective view from a single ﬁxed viewpoint [7]. In our work we are looking at the development of a digital library to host medieval manuscripts that contain hand drawings of (a) rectangular shapes or stripes and (b) irregular shapes of animals or objects. The Algorithms developed to index and search this library have been based on the needs of the users [10]. From the algorithmic point of view, they are based on the analysis of colour histograms and their application on region growing algorithms.

214

3

V. Konstantinou and A. Psarrou

System Overview

To achieve the functionality of the image base described in section 1, the operation of the HISTORIA system can be divided into two main cycles (phases): 1. The acquisition phase: During this phase the image base, the index base and the hard coded network links are created. During the ﬁrst stage the scanned images are automatically processed to extract features that describe their colour and image regions or the outline of shapes in the image and their spatial organisation. The extracted features are stored in a SQL-formed database and augmented with text information. 2. The query phase: During this phase the user can query the image base through the hyper-imaging and hyper-text links and create his/her own user defined network links. During the query phase the user composes queries using one of the following means (a) ’family name’ or other text associated with the heraldic images (b) appearance information that is related to the pictorial or colour information of the image. During this phase a process similar to the database acquisition is performed to the input image. The features extracted from the search image (if the search is based on an image as mentioned in b above) are stored again in a similar SQL format and matched against the stored database information. The user can select to use more than one of the matching mechanisms available (see Hyper-imaging below). The resulting set of images from the database is based on similarity. The similarity is calculated with the aid of distance functions between the features.

4

Hyper-Imaging

Representation and association of images in HISTORIA is based on two types of image analysis [10]. Colour histogram analysis and regional analysis. 4.1

Colour Histogram Analysis

Colour queries let users ﬁnd images that have similar colour distribution. This is preferred because the required heraldic features some times relate to the colour used in the coats of arms (emblems). The search algorithm used here is based on colour histogram analysis. Although the RGB basis is good for the acquisition or display of colour information, it is not particularly good basis to explain the perception of colours. Following Ballard and Brown [1] we transform the RGB values to opponent colours deﬁned as follows: rg = r − g by = 2 ∗ g − r − g wb = r + g + b Based on opponent colours histograms we create colour models for the emblems which we then match using a Histogram Intersection technique [11].

A Dynamic JAVA-Based Intelligent Interface

4.2

215

Region Analysis

The main problem in the retrieval of images based on pictorial information is the segmentation of the images into meaningful parts. In our case the segmentation of images require the partition of the images into homogeneous regions that represent composite shapes found in heraldry. Grey level thresholding is a tool widely used in image segmentation, where a single or band of threshold values is used to discriminate objects from background. The most important problem is selecting the grey-level values that constitute the right thresholds for segmenting the image [12]. We can select appropriate threshold values for image segmentation based on information regarding the number of neighbouring pixels that have the same gray level. Such information can be derived from local spatial analysis of the grey level values of the images and can be represented in a co-occurrence matrix. This is a second-order statistics measure of the image gray levels occurring in speciﬁed relative positions and reveals spatial variation of gray levels (i.e. homogeneity) in an image. It thus allows homogeneous regions to be identiﬁed and separated from cluttered background. In particular, the diagonal of the co-occurrence matrix represents the number of neighbouring pixels that have the same grey level. Consequently, clusters near the diagonal of the co-occurrence matrix provide an indication of the grey level of the homogeneous regions in an image. Using such representation we can select threshold values that minimise the amount of busyness, noise or roughness present in the image in the thresholded image. If we assume that the ideal objects and background have simple compact shapes then noise points and holes in the object and background of the thresholded image and strong irregularities in the object background border are undesirable as we normally want thresholded images to look smooth rather than busy. Busyness measures using co-occurrence matrices Following Chanda and Majumber [2] and Lie [8], we deﬁne g(x, y) to be an L-level monochrome image of size M × M . We are interested in considering the co-occurrence of gray levels in pairs of pixels where one is lying at a distance of d in the direction θ with respect to the other. Here we take the value d = 1 and θ as an integer multiple of 12 π. The element Cθ,m,n gives the frequency that a pixel having gray-level n occurs adjacent to a pixel having gray-level m in direction θ. Considering all 4-neighbourhood directions the co-occurrence matrix C is obtained as: C = Cm,n = C0 + Cπ/2 + Cπ + C3π/2 where Cn represents the co-occurrence statistics of pairs of pixels adjacent in the θ direction. Let t be the threshold for image binirisation that maps all gray levels greater than t into the object and all other levels into the background. This mapping partitions the co-occurrence matrix into four distinct areas [12]. 1. matrix elements representing co-occurrences of gray levels in the object. i.e. those C(i, j) such that i > t and j > t (shaded area B3 )

216

V. Konstantinou and A. Psarrou

Fig. 1. The threshold value t divides the co-occurrence matrix into four non-

overlapping blocks. The values on each diagonal element Ci,i of the co-occurrence matrix represent the ith entry of the grey level histogram of the image.

2. matrix elements representing co-occurrences of gray levels in the background. i.e. those C(i, j) such that i ≤ t and j ≤ t (shaded area B4 ) 3. matrix elements representing co-occurrences of object gray levels with background gray levels. i.e. those C(i, j) such that i ≥ t and j < t (shaded areas of B1 ) or i > t and j ≤ t(shaded area and B2 ) Given a threshold t of an image, the measure of busyness C(t) that was used throughout this work is computed by summing those entries of the co-occurrence matrix representing the percentage of object-background adjacencies. (i.e. the entries in the B1 and B2 areas). If C(t) is relatively high for a given threshold we would expect the thresholded image to contain a large number of noise points and/or jagged edges. Conversely a relatively low C(t) would indicate that the threshold chosen results in a smooth picture. C(t) will be zero if all grey levels are mapped into the same output level. To avoid this we require that the threshold lie between the object and background means. Once the co-occurrence matrix C has been computed, the busyness measure C(t) can be calculated for all thresholds using the recurrence relationship t−1 n C(i, t) + Σj=t+1 C(t, j) C(t) = C(t − 1) − Σi=1

where n is the number of gray levels in the image and the dimension of M . The methods of threshold selection using the co-occurrence matrix looks for the threshold for which the number of pairs of border pixels i.e. sum of Cm,n over the blocks B1 and B2 is minimal. In other words, it searches for a threshold which segments the image into the largest homogeneous regions possible.

A Dynamic JAVA-Based Intelligent Interface

4.3

217

Region Labelling

Only under very unusual circumstances can thresholding be successful using a single threshold for the whole image since even in very simple images there are likely to be grey level variations in objects and background. Better segmentation results can be achieved using variable thresholds in which the threshold value varies over the image as a function of local image characteristics. To ﬁnd such thresholds we represent the busyness measure C(t) obtained from the co-occurrence matrix as a function of threshold t It should be noted that the busyness curve, as a function of threshold must should have the same general shape as the grey level histogram. This is because when we threshold at a point on a histogram peak, i.e. within the object or background gray level range, we may expect high degree of busyness in the thresholded image; whereas when we threshold in the valley between the object and background, the busyness should be relatively low. Therefore, C(t) is represented in a histogram which is searched for local draughts. The local draughts in the histogram correspond to the threshold values that are used to deﬁne the boundaries of homogeneous regions in an image. The thresholds found with this process are listed in an ascended order and represent pairwise the boundary values of regions. The labelling process starts separates the images in regions by using stacks and performing a recursive search in neighbouring pixels. The region description of the images is performed using shape metrics that describe: (a) area as the number of pixels occupied by the region, (b) elongation as the ratio of the length and width of the smallest bounding rectangle encompassing the region and (c) compactness as the ratio of the square of the length of the region boundary and the region area size. The description of image regions using simple metric values has the following advantages that they describe a region independent of its colour content and require minimal storage which is held in SQL tables and be queried. Figure 2 shows a search result based on combined histogram and region analysis on one emblem of the Basegnio family.

Fig. 2. Search results on one of the emblems of the Basegnio family.

218

5

V. Konstantinou and A. Psarrou

The HISTORIA JAVA Interface

The HISTORIA end-user interface has been augmented by incorporating a frontend shown in Figure 3 that allows the user to access the database though the WWW and MsqlJava classes.

Fig. 3. The Java interface of HISTORIA

The screen in the Java interface is essentially divided up into four main parts: 1. The top-left area id headed Families and contains a scrolling list of the names of the families for which there are emblems in the database. 2. The top-right area is headed Image Editor. This provides a canvas and a set of drawing tools and options to allow the user to create small images to use as search items. 3. The middle panel is titled Search By and its main purpose is to keep a visual record of the search term that produced the results shown. 4. The bottom panel, titled Search Results, displays the search results for the search term shown in the middle panel. Each retrieved image can the be used as retrieval key as it is or changed using the image editor panel. The functionality of the Java interface is shown in Figure 4. A member of the Badoeri family is selected in the Search By panel and similar images retrieved. The image is then copied to the Image Editor panel where it can be modiﬁed and then used as a key retrieval image. The interface is compatible with the JAVA implementation found in both Netscape (v3 or greater) and Internet Explorer (v4 or greater).

A Dynamic JAVA-Based Intelligent Interface

219

Fig. 4. The Search and Editing facilities on the Java Interface The ’canvas’ provided with the interface enables the user to sketch and even colour parts of an emblem that interests him/her. As the Server incorporates both object and colour-based search algorithms the ’sketches’ can be just rough outlines of objects or major coloured regions appearing the emblem in question.

6

HISTORIA Intranet and Internet Access

The use of a Java i/f enables the library which uses the HISTORIA system to make it available to all of the interested researchers without the need for upgrading any of the equipment. The Marcianna library for which the system was developed, currently uses a NOVELL/UNIX network but its users have a variety of machines ranging from 486 Pcs running Windows v3 to more modern Pentium II machines. The system runs on a Silicon Graphics O2 running IRIX version 6.3. The Interface has been designed in such a way as to auto detect local (Intranet) users and enable extended search, printing and display features (high resolution images). This way the Library can maintain its copyright and control the distribution of the high quality images.

7

Experimental Search Results

The prototype image-base currently hosted by Marcianna library, includes approximately 900 images, and the success rates based on two available matching mechanisms are as follows: Image Type Histogram Region Analysis Combined Known 100 100 100 Related 60 70 90 All numbers shown above are success percentages. ’Known’ indicates that the image is already stored in the image base. Related are the relevant images that contain similar items or colours to the ’searched’ image. The success rates

220

V. Konstantinou and A. Psarrou

for the related images were derived by comparing the HISTORIA results and the sets derived by human scholars given the same sources.

8

Conclusion

One beneﬁt of the HISTORIA system is that it can improve access to the information contained within the manuscripts and coats of arms held in database, allowing individual researchers ”hands on” access to the database. With this in mind the documentation system has been designed such that it can be extended to provide individual researchers with a personalized research tool allowing them to record research paths and the results of previous searches and to deﬁne their own links between families and images with associated notes. The JAVA interface apart from oﬀering world-wide access, it also enables the operators to protect the copyright of their documents by auto-detecting local users and enabling diﬀerent features as appropriate. The Interface is currently updated so that it can interrogate a number of Historia servers and in that way enable true distributed image searches.

References 1. D. Ballard and C. Brown. Computer Vision. Prentice Hall, 1982. 214 2. B. Chandra and D. Dutta Majumder. A note on the use of the graylevel cooccurrence matrix in threshold selection. Signal Processing, 15(2):149–167, 1988. 215 3. Myron Flickner et al. Query by image and video content: The QBIC system. IEEE Computer, 28(9):23–32, September 1995. 213 4. W. Niblack et al. The QBIC project: Querying images by content using colour, texture and shape. In Storage and Retrieval for Image and Video Databases I, Proc. SPIE 1908, pages 173–187, 1993. 213 5. Catherine Grout. From ‘virtual curator’ to ‘virtual librarian’: What is the potential for the emergent image recognition technologies in the art-historical domain? In Electronic Imaging and the Visual Arts, London, 1996. 213 6. V. Gudivada and V. Raghavan. Content-based image retrieval systems. IEEE Computer, 28(9):18–22, September 1995. 213 7. J. Lansdown. Some trnds in computer graphic art. In S. Mealing, editor, Computers & Art. intellect, 1997. 213 8. W.-N. Lie. An eﬃcient threshold-evaluation algorithm for image segmentation based on spataial graylevel co-occurrences. Signal Processing, 33(1):121–126, July 1993. 215 9. A. Pentland, R.W. Picard, and S. Sclaroﬀ. Photoshop: Tools for content-based manipulation of image databases. In Storage and Retrieval for Image and Video Databases I, Proc. SPIE 2185, pages 34–47, 1993. 213 10. A. Psarrou, S. Courtenage, V. Konstantinou, P. Morse, and P. O’Reilly. Historia: Final report. Telematics for Libraries, 3117. 213, 214 11. M. Swain and D. Ballard. Color indexing. International Journal of Computer Vision, 7(1):11–32, 1991. 214 12. J. S. Weszka and A. Rosenfeld. Threshold evaluation techniques. IEEE Transactions on Systems, Man, and Cybernetics, 8(8):622–629, 1978. 215

Motion-Based Feature Extraction and Ascendant Hierarchical Classification for Video Indexing and Retrieval Ronan Fable1 and Patrick Bouthemy2 1

IRISA / CNRS IRISA / INRIA Campus universitaire de Beaulieu, 35042 Rennes Cedex, France Tel: (33) 2.99.84.25.23, Fax: (33) 2.99.84.71.71 {rfablet,bouthemy}@irisa.fr 2

Abstract. This paper describes an original approach for motion characterization with a view to content-based video indexing and retrieval. A statistical analysis of temporal cooccurrence distributions of relevant local motion-based measures is exploited to compute global motion descriptors, which allows to handle diverse motion situations. These features are used in an ascendant hierarchical classiﬁcation procedure to supply a meaningful hierarchy from a set of sequences. Results of classiﬁcation and retrieval on a database of video sequences are reported.

1

Introduction

Image databases are at the core of various application ﬁelds, either concerned with professional use (remote sensing and meteorology from satellite images, road traﬃc surveillance from video sequences, medical imaging, . . . ) or targeted at a more general public (television archives including movies, documentaries, news, . . . ; multimedia publishing,. . . ). Reliable and convenient access to visual information is of major interest for an eﬃcient use of these databases. Thus, it exists a real need for indexing and retrieving visual documents by their content. A large research amount is currently devoted to image and video database management, [1,7,16]. Nevertheless, due to the complexity of image interpretation and dynamic scene analysis, it remains hard to easily identify relevant information with regards to a given query. As far as image sequences are concerned, content-based video indexing, browsing, editing, or retrieval, primarily require to recover the elementary shots of the video and to recognize typical forms of video shooting such as static shot, traveling, zooming and panning [1,3,15,16]. These issues also motivate studies concentrating on image mosaicing [9], on object motion characterization in case of a static camera [4], or on segmentation and tracking of moving elements, [6]. These methods generally exploit motion segmentation relying either on 2d parametric motion models or on dense optical ﬂow ﬁeld estimation. They aim at Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 221–229, 1999. c Springer-Verlag Berlin Heidelberg 1999

222

Ronan Fable and Patrick Bouthemy

determining a partition of a given scene into regions attached to diﬀerent types of motions with a view to extracting relevant moving objects. Nevertheless, they turn out to be unadapted to certain classes of sequences, particularly in the case of unstructured motions of rivers, ﬂames, foliages in the wind, or crowds, . . . , (see Figure 1). Moreover, providing a global interpretation of motion along a sequence, without any prior motion segmentation or without any complete motion estimation in terms of parametric models or optical ﬂow ﬁelds, seems in the context of video indexing attractive and achievable to discriminate general types of motion situations. These remarks emphasize the need of designing new low-level approaches in order to supply a direct global motion description, [2,11,13,14]. We propose an original approach to video indexing and retrieval according to the motion content. It relies on the global motion-based features presented in our previous work [2]. They are extracted using a statistical analysis of temporal cooccurrences of local non-parametric motion-related information. These motion indexes are introduced in a ﬂexible ascendant hierarchical classiﬁcation scheme to determine a meaningful hierarchy from a large video sequence set. It expresses similarities based on some metrics in the feature space. We can easily exploit the computed hierarchy for eﬃcient retrieval with query by example. This paper is organized as follows. In Section 2, we outline the general ideas leading to our work. Section 3 brieﬂy describes the motion-based feature extraction. In Section 4, we introduce the indexing structure and the retrieval procedure. Section 5 contains classiﬁcation results and retrieval examples, obtained on a large set of video sequences, and Section 6 contains concluding remarks.

2

Problem Statement and Related Work

Video sequences are ﬁrst processed to extract elementary shots with the technique presented in [3] (note that in the following we may use the term of sequence to deal with an elementary shot). Then, for each previously extracted shot, we intend to characterize the whole spatio-temporal motion distribution in order to build a motion-based indexing and retrieval system. Let us note that, in the same manner, texture analysis methods study the spatial grey-level distribution. In particular, cooccurrence measurements provide eﬃcient tools for texture description in terms of homogeneity, contrast or coarseness [8]. Therefore, we aim at adapting cooccurrence-based features in the context of motion analysis. Preliminary research in that direction was developed by Polana and Nelson for activity recognition [11]. As part of their work, they introduce the notion of temporal texture, opposed to periodic activities or rigid motions, and associated to ﬂuid motions. Indeed, motions of rivers, foliages, ﬂames, or crowds, . . . , can be regarded as temporal textures (see Figure 1). In [14], temporal texture synthesis examples close to the original sequences are reported. However, this work is devoted to these particular cases of dynamic scenes, and cannot be extended to rigid motions or periodic activities. In [13], temporal texture features are extracted based on the description of spatio-temporal trajectories. However, it relies on detection of moving contours

Motion-Based Feature Extraction

a)

223

b)

Fig. 1. Examples of temporal textures : a) foliage b) fire (by courtesy of mit). by a simple thresholding of the pixel-based frame diﬀerences, which are known to be noisy. In the subsequent, maps of local motion measures along the image sequence are required as input of cooccurrence measurements. As dense optical ﬂow ﬁeld estimation is time-consuming and unreliable in case of complex dynamic scenes, we prefer to consider local motion-related information, easily computed from the spatio-temporal derivatives of the intensity. Rather than the normal velocity used in [11], a more reliable information is exploited as explained in the next section. Besides, we intend to design a new video indexing and retrieval approach using the global motion-based features extracted from the temporal cooccurrences statistics. Thus, we ﬁrst need to determine a meaningful indexing structure on a large dataset. Among all the clustering methods, we focus on ascendant hierarchical classiﬁcation (AHC), [5,10]. It exploits a Euclidean norm on the motion-based feature space and aims at minimizing the within-class variances. The obtained hierarchical representation is directly exploited for eﬃcient retrieval with query by example.

3 3.1

Extraction of Global Motion-Based Features Local Motion-Related Measures

By assuming intensity constancy along 2d motion trajectories, the well-known image motion constraint relates the 2d apparent motion and the spatio-temporal derivatives of the intensity function, and the normal velocity vn at a point p is −It (p) given by : vn (p) = ∇I(p) where I(p) is the intensity function, ∇I = (Ix , Iy ) the intensity spatial gradient, and It (p) the intensity partial temporal derivative. If the motion direction is orthogonal to the spatial intensity gradient, this quantity vn can in fact be null whatever the motion magnitude. vn is also very sensitive to noise attached to the computation of the intensity derivatives. Nevertheless, an appropriately weighted average of vn in a given neighborhood forms a more relevant motion-related quantity as shown in [12] : 2 s∈F (p) ∇I(s) · |vn (s)| (1) vobs (p) = max(η 2 , s∈F (p) ∇I(s)2 ) where F (p) is a 3 × 3 window centered on p. η 2 is a predetermined constant, related to the noise level in uniform areas, which prevents from dividing by zero

224

Ronan Fable and Patrick Bouthemy

or by a very low value. Thus, vobs provides us with a local motion measure, easily computed and reliably exploitable. The loss of the information relative to motion direction is not a real shortcoming, since we are interested in interpreting the general type of dynamic situations observed in a given video shot. The computation of cooccurrence matrices can not be achieved on a set of continuous variables. Due to the spreading out of the measures vobs , a simple linear quantization within the interval [inf p vobs (p); supp vobs (p)] is not pertinent. Since it is generally assessed in motion analysis that large displacements can not be handled through a single resolution analysis, we set a limit beyond which measures are no more regarded as reliable. Thus, in practice, we quantize linearly the motion quantities within [0, 4] on 16 levels. 3.2

Global Motion Features

In [11], spatial cooccurrence distributions are evaluated on normal ﬂow ﬁelds to classify processed examples in pure motion (rotational, divergent) or in temporal texture (river, foliage). In that case, since studied interactions are spatial, only motions which are stationary along the time axis can be characterized. Moreover, to recover the spatial structure of motion, several conﬁgurations corresponding to diﬀerent spatial interactions have to be computed, which is highly time-consuming. Consequently, we focus on temporal cooccurrences deﬁned for a pair of quantized motion quantities (i, j) at the temporal distance dt by : (r, s) ∈ Cdt /obs(r) = i, obs(s) = j Pdt (i, j) = (2) |Cdt | where obs holds for the quantized version of vobs , and Cdt = (r, s) at the same spatial position in the image grid /∃t, r ∈ image(t) and s ∈ image(t− dt ) . From these cooccurrence matrices, global motion features similar as those deﬁned in [8] are extracted  1: f = P (i, j) log(Pdt (i, j))   2 (i,j) dt  2    f = (i,j) Pdt (i, j)/[1 + (i − j) ]  3 2 f = (i,j) (i − j) Pdt (i, j) (3)  4 4 2  = i P (i, j) / i P (i, j) − 3 f d d  t  (i,j) (i,j) t    f5 = 4 2 (i − j) P (i, j) / dt (i,j) (i,j) (i − j) Pdt (i, j) − 3 where f 1 is the entropy, f 2 the inverse diﬀerence moment, f 3 the acceleration, f 4 the kurtosis and f 5 the diﬀerence kurtosis. This set of global motion features is in this work computed over all the image grid. In order to cope with non-stationarity in the spatial domain, we can easily obtain a region-based characterization of motion. Indeed, the extraction of the motion descriptors can also be achieved either on predeﬁned blocks or on extracted regions resulting from a spatial segmentation, since we focus only on temporal interactions. In that case, the retrieval process will consist in determining regions of sequences of the database similar in terms of motion properties to those characterized for the processed query.

Motion-Based Feature Extraction

4 4.1

225

Motion-Based Indexing and Retrieval Motion-Based Indexing

Since we plan to design an eﬃcient indexing and retrieval scheme based on the global motion features presented above, we are required to build an appropriate representation of the database. This will allow us to recover easily sequences similar in terms of motion properties to a given video query. Thus, we have to make use of a classiﬁcation method in order to cluster video sequences into meaningful groups. Among the numerous clustering algorithms, we have selected an iterative process called ascendant hierarchical classification (AHC) [5]. Due to its simplicity of computation and its hierarchical nature, it reveals eﬃcient for image and video database management as shown in [10]. It comes to compute a binary decision tree expressing the hierarchy of similarities between image sequences according to some metrics. Let us consider a set of motion-related feature vectors, fn = (fn1 , . . . , fn5 ) where n refers to a sequence in the database. The AHC algorithm proceeds incrementally as follows. At a given level of the hierarchy, pairs are formed by merging the closest clusters in the feature space in order to minimize the withinclass variance and maximize the between-class centered second-order moment. We will use the Euclidean norm. Moreover, if an element n represented by a feature vector fn is too far from all the others one i.e. minm fn − fm 2 > Vmax , where Vmax is a predeﬁned constant, it forms also a new cluster. This procedure is iterated from the lowest level to the upper one in the hierarchy. To initialize the algorithm at the lowest level, each cluster corresponds to a unique sequence. In our experiments, we have extracted the motion-based descriptors presented in section 3.2 with a temporal distance dt = 1. Nevertheless, we cannot directly use the Euclidean norm with such features of diﬀerent nature. In order to exploit this norm to compare feature vectors, we compute for the feature f 3 its square root and we raise the features f 4 and f 5 to the one fourth power. 4.2

Retrieval with Query by Example

We are interested in retrieving sequences of the database the most similar to a given video query. More particularly, we focus on matching sequences according to global motion properties. Indeed, the index structure described above provides us with such an eﬃcient hierarchical motion-based retrieving tool. We compute ﬁrst the hierarchical index structure over the video database. Second, to handle the submitted query, the proposed sequence is processed to extract the meaningful motion-based features. In the same manner as previously, we compute the square root of the feature f 3 and the power one fourth of features f 4 and f 5 in order to use the Euclidean norm as cost function. Then, we explore the hierarchy of sequences as follows. At its upper level, the retrieval algorithm selects the closest cluster, according to the Euclidean distance to the center of gravity of the considered cluster in the

226

Ronan Fable and Patrick Bouthemy

feature space. Then, for each of the children nodes, the distance from the feature vector of the query video to the center of gravity of each cluster is computed, and the cluster with the shortest distance is selected. This procedure is iterated through the index structure until a given number of answers or a given similarity accuracy is reached.

5

Results and Concluding Remarks

We make use of the approach described above to process a database of image sequences. We have paid a particular attention to choose video representative of various motion situations. Indeed, the database includes temporal textures such as ﬁre or moving crowds, examples with an important motion activity such as sport video (basket, horse riding,...), rigid motion situations (cars, train, ...), and sequences with a low motion activity. Finally, we consider a database of 25 video sequences (typically, each sequence is composed of 10 images). First, AHC is applied to the database in the space (f 1 , f 2 , f 3 , f 4 , f 5 ). In Figure 2, the representation of the database in the feature space, restricted to (f 3 , f 4 , f 5 ) space for visualization convenience, is reported. The four sequence classes of level 4 in the hierarchy are really related to diﬀerent types of motion situations : the class “o” involves temporal textures, the class“x” includes sport video motions, elements of the class “+” are related to rigid motion situations and the class “.” is composed of low motion activity examples.

9 8 7 6 5 4

class o

class x

class o

class * Figure 2.b

3 2 1 3 2.5

6 5

2 4

1.5

3 2

1

1 0.5

0

Figure 2.a

Fig. 2. Representation of the video database obtained with the AHC : a) Spreading of the sequences in the restricted feature space (f 3 , f 4 , f 5 ). Symbol (+,o,.,*) are indexes of classes at the level 4 in the AHC hierarchy. b) Examples representative of the extracted classes. We display the first image of the sequence which is the closest from the center of gravity of its class. Now, we deal with motion-based retrieval for query by example. Fig. 3 shows results obtained with two video queries. The maximum number of answers to

Motion-Based Feature Extraction

query 1 : high activity

answer 1

answer 2

answer 3

query 2 : low activity

answer 1

answer 2

answer 3

227

Fig. 3. Results of motion-based retrieval operations with query by example for a maximum of three answers. We display for each selected sequence its first image. a given query is ﬁxed to 3. The ﬁrst example is a horse riding sequence. The retrieval process supplies accurate answers of sport shots which appear similar to the query in terms of global motion properties. The second video query is a static shot of a meeting. It is matched with other low motion activity sequences. Let us proceed to a more quantitative evaluation of our approach. Since it seems diﬃcult to directly analyze the accuracy of the classiﬁcation scheme, we use the following procedure. First, we deﬁne a priori sequence classes among the dataset according to visual perception. Then, we analyze the three retrieved answers when considering each element of the base as a query. To evaluate the accuracy of our retrieval scheme, we consider two measures. We count the number of times that the query shot appears as the best answer, and, on the other hand, if the second retrieved sequence belongs to the same a priori class, we consider the retrieval process as correct. In practice, we have determined four a priori sequence classes : the ﬁrst one with low motion activity, the second with rigid motions, important motion activity examples forms the third one, and temporal textures the fourth one. Even if this evaluation procedure remains somewhat subjective, it delivers a convincing validation of the indexing and retrieval process. Obtained results for the whole database are rather promising : similar query and ﬁrst retrieved answer (%) correct classiﬁcation rate according to a priori class (%)

80 75

Table 1. Evaluation of the motion-based indexing and retrieval process

6

Conclusion

We have described an original method to extract global motion-related features and its application to video indexing and retrieval. Motion indexes rely

228

Ronan Fable and Patrick Bouthemy

on a second-order statistical analysis of temporal distributions of relevant local motion-related quantities. We exploit a hierarchical ascendant classiﬁcation to infer a binary tree over the video database. Examples of retrieval using query by example have shown good results. In future work, we should determine optimal sets of global features adapted to diﬀerent types of content in the video database, and evaluation over a still larger database should be performed.

Acknowledgments: This work is funded in part by AFIRST (Association Franco-Israelienne pour la Recherche Scientiﬁque).

References 1. P. Aigrain, H.J. Zhang, and D. Petkovic. Content-based representation and retrieval of visual media : a state-of-the-art review. Multimedia Tools and Applications, 3(3):179–202, November 1996. 221 2. P. Bouthemy and R. Fablet. Motion characterization from temporal cooccurrences of local motion-based measures for video indexing. In Proc. Int. Conf. on Pattern Recognition, ICPR’98, Brisbane, August 1998. 222 3. P. Bouthemy and F. Ganansia. Video partioning and camera motion characterization for content-based video indexing. In Proc. 3rd IEEE Int. Conf. on Image Processing, ICIP’96, Lausanne, September 1996. 221, 222 4. J.D. Courtney. Automatic video indexing via object motion analysis. Pattern Recognition, 30(4):607–625, April 1997. 221 5. E. Diday, G. Govaert, Y. Lechevallier, and J. Sidi. Clustering in pattern recognition. In Digital Image Processing, pages 19–58. J.-C. Simon, R. Haralick, eds, Kluwer edition, 1981. 223, 225 6. M. Gelgon and P. Bouthemy. Determining a structured spatio-temporal representation of video content for eﬃcient visualization and indexing. In Proc. 5th European Conf. on Computer Vision, ECCV’98, Freiburg, June 1998. 221 7. B. Gunsel, A. Murat Tekalp, and P.J.L. van Beek. Content-based access to video objects : temporal segmentation, visual summarization and feature extraction. Signal Processing, 66:261–280, 1998. 221 8. R.M. Haralick, K. Shanmugan, and I. Dinstein. Textural features for image classiﬁcation. IEEE Trans. on Systems, Man and Cybernetics, 3(6):610–621, Nov. 1973. 222, 224 9. M. Irani and P. Anandan. Video indexing based on mosaic representation. IEEE Trans. on PAMI, 86(5):905–921, May 1998. 221 10. R. Milanese, D. Squire, and T. Pun. Correspondence analysis and hierarchical indexing for content-based image retrieval. In Proc. 3rd IEEE Int. Conf. on Image Processing, ICIP’96, Lausanne, September 1996. 223, 225 11. R. Nelson and R. Polana. Qualitative recognition of motion using temporal texture. CVGIP : Image Understanding, 56(1):78–99, July 1992. 222, 223, 224 12. J.M. Odobez and P. Bouthemy. Separation of moving regions from background in an image sequence acquired with a mobile camera. In Video Data Compression for Multimedia Computing, chapter 8, pages 295–311. H. H. Li, S. Sun, and H. Derin, eds, Kluwer, 1997. 223

Motion-Based Feature Extraction

229

13. K. Otsuka, T. Horikoshi, S. Suzuki, and M. Fujii. Feature extraction of temporal texture based on spatiotemporal motion trajectory. In Proc. Int. Conf. on Pattern Recognition, ICPR’98, Brisbane, August 1998. 222 14. M. Szummer and R.W. Picard. Temporal texture modeling. In Proc. 3rd IEEE Int. Conf. on Image Processing, ICIP’96, Lausanne, September 1996. 222 15. W. Xiong and J. C.H. Lee. Eﬃcient scene change detection and camera motion annotation for video classiﬁcation. Computer Vision and Image Understanding, 71(2):166–181, August 1998. 221 16. H.J. Zhang, J. Wu, D. Zhong, and S. Smolier. An integrated system for contentbased video retrieval and browsing. Pattern Recognition, 30(4), April 1997. 221

Automatically Segmenting Movies into Logical Story Units Alan Hanjalic, Reginald L. Lagendijk, and Jan Biemond Faculty of Information Technology and Systems Information and Communication Theory Group Delft University of Technology P.O.Box 5031, 2600 GA Delft, The Netherlands {alan,inald,biemond}@it.et.tudelft.nl Abstract. We present a newly developed strategy for automatically segmenting movies into logical story units. A logical story unit can be understood as an approximation of a movie episode and as the base for building an eventoriented movie organization structure. The automation aspect is becoming increasingly important with the rising amount of information in emerging digital libraries. The segmentation process is designed to work on MPEG-DC sequences and can be performed in a single pass through a video sequence.

1 Introduction For an easy user interaction with large volumes of video material in emerging digital libraries efficient organization of the stored information is required. In this paper we concentrate on movies as a particularly important class of video programs and emphasize the need for an event-oriented movie organization scheme. Humans tend to remember different events after watching a movie and think in terms of events during the video retrieval process. Such an event can be a dialog, action scene or, generally, any series of shots “unified by location or dramatic incident”[5]. Therefore, an event as a whole should be treated as an elementary retrieval unit in advanced movie retrieval systems. We propose a novel method for automatically segmenting movies into logical story units. Each of these units is characterized by one or several temporally interrelated events, which implies that the segmentation result can provide a concise and comprehensive top level of an event-oriented movie organization scheme. The proposed high-level segmentation method can be carried out in a single pass through a video sequence.

2 From Episodes to Logical Story Units Each shot [1] within a movie program belongs to a certain global context built up around one movie event or several of them taking place in parallel. Thereby, a shot can either be a part of an event or serve for its “description”, by e.g. showing the scenery where the coming or the current event takes place, showing a “story telling” narrator in typical retrospective movies, etc. In view of such a distinction, we will further refer to shots of a movie as either event shots or descriptive shots.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp.229 -236, 1999.  Springer-Verlag Berlin Heidelberg 1999

230

Alan Hanjalic et al.

We can now realistically assume that a standard movie is produced as a series of meaningful segments corresponding to event-oriented global contexts described above, which we will call episodes. An episode is generally a combination of the event shots and descriptive shots related to the event(s) of the episode. It can be simple, if it concentrates on one event only. However, a more complex episode structure is possible as well. This is the case where several events, taking place in parallel, are presented as a series of their interchanging fragments. We denote the fragment i of the j

event j by Ti and introduce a model for the movie structure as shown in Fig. 1, built up by concatenating episodes of different complexity. Descriptive shots

T11

T11

T12

T13

T22

T23

T14

Event-shots

Episode 1

Episode 2

Episode 3

Fig. 1. A sample movie sequence consisting of three episodes. Descriptive shots are described as boxes with lined patterns.

In view of the event-based structure of an episode and the assumed limited number of episodes in a typical movie, a movie segmentation into episodes can provide a highly suitable top level for a compact and comprehensive event-oriented movie organization scheme. However, such a segmentation can be performed precisely, only if the movie script is available. This is not the case in automated sequence analysis systems, especially those operating at the user side [3] of a video transmission network. In such systems, all movie content analysis, segmentation and organization processes are done based on movie’s audiovisual characteristics and their temporal variations, measured and captured by standard audio, image and video processing tools. In this paper, we perform the movie segmentation using visual features only. As a result, the approximates of the actual movie episodes are obtained, which we will call logical story units (LSUs). Various applications in digital video libraries can benefit from an LSU-based movie organization scheme. For example, an overview of a movie can be obtained immediately if one looks at the obtained set of LSUs. Fig. 2 illustrates how a movie can be broken up into LSUs and how existing content-based clustering algorithms can be applied to all shots of an LSU. The shots of each LSU that are most representative can be glued together and be played as movie highlights. One can also use key frames to browse through each individual LSU, which is an especially important feature for LSUs having a complicated structure (e.g. containing several temporally interrelated events). The user only browses through relevant shots, e.g. those relating to the selected LSU (for instance, when searching for a particular movie character in the context of a certain event), and is not burdened with (the many) other shots of a sequence. For each granularity (cluster) level, a key-frame set is available providing video representations through pictorial summaries having different amounts of detail.

Au tomatically Segmen ting Movies into Lo gical Story Units

LSU 1

LSU 2

231

LSU 3 t

Characteristic video shots of an LSU on different granularity levels (e.g. content-based clustering is used)

Key frames of characteristic shots (one key-frame set for each granularity level)

Fig. 2. Possible scheme for movie representation based on LSUs

Few methods dealing with high-level movie segments can be found in literature. In°[2] characteristic temporal events like dialogs, high-motion and high-contrast segments are extracted for the purpose of making a movie trailer, but no attempt is made to capture the entire movie material. In [5] an approach is presented based on time-constrained clustering and label assignments to all shots within a sequence. Predefined models are used to analyze the resulting label sequence and recognize patterns corresponding to dialogs, action segments and arbitrary story units. The effectiveness of this method, especially for segmenting movies into story units, depends however on the applicability of the model used for a story unit. We foresee here several practical problems such as the choice of the interval for time-constrained clustering, which puts an artificial limit on the duration of an episode. Another problem is that characterizing shots by distinct labels simplifies the real interrelation among neighboring shots far too much.

3 Concept of Logical Story Units The concept of an LSU is based on the global temporal consistency of its visual content. Such a consistency is highly probable in view of the realistic assumption that an event is related to a specific location (scenery) and certain characters. It can be expected that within an event every now and then similar visual content elements (scenery, background, people, faces, dresses, specific patterns, etc.) appear and some of them even repeat. Such content matches clearly may not happen immediately in successive video shots, however, most probably within a certain time interval. We first assume that visual content elements from the current shot k1 reappear (approximately) in shot k1 + p1 . Then, shots k1 and k1 + p1 form a linked pair. Since shots k1 and k1 + p1 belong to the same LSU(m), consequently all intermediate shots also belong to LSU(m):

232

Alan Hanjalic et al.

[k1 , k1 + p1 ] ∈ LSU ( m) if p1 ⇐ min A(k1 , k1 + l ) < M ( k1 ). l =1,K,c

(1)

Here, A(k,k+l) is the dissimilarity measure between the shots k and k+l, while c is the number of subsequent shots the current shot is compared with to check the visual dissimilarity. The threshold function M(k) specifies the maximum dissimilarity allowed within a single LSU. Since the visual content is usually time-variant, the function M(k) also varies with the shot under consideration. If there are no subsequent shots with sufficient similarity to the current shot k2 , i.e. the inequality in equation (1) is not satisfied, there is the possibility that one or more shots preceding shot k2 link with shot(s) following shot k2 . Then, the current shot is enclosed by a shot pair that belongs to LSU(m), i.e.

[ k2 − t, k 2 + p2 ] ∈ LSU ( m ) min A( k 2 − i, k 2 + l ) < M ( k 2 ). if (t, p2 > 0 ) ⇐ min i =1,K,r l =− i +1,K,c

(2)

Here r is the number of video shots to be considered preceding the current shot k2 . If for the current shot k3 neither (1) nor (2) is fulfilled, but if shot k3 links with one of the previous shots, then shot k3 is the last shot of LSU(m). The objective is now to detect the boundaries between LSUs, given the described procedure for linking shots. In principle one can check equations (1) and (2) for all shots in the video sequence. This, however, is rather computationally intensive and also unnecessary. According to (1), if the current shot k is linked to shot k+p, all intermediate shot automatically belong to the same LSU and do not have to be checked anymore. Only if no link can be found for shot k, it is necessary to check whether at least one of r shots preceding the current shot k can be linked with shot k+p (for p>0, as stated in (2)). If such a link is found, the procedure can continue at shot k+p, otherwise shot k is at the boundary of LSU(m). The procedure then continues with shot k+1 for LSU(m+1). The LSU boundary detection procedure is illustrated in Fig. 3. (e) (a)

(b)

(d)

LSU(m)

t

(c)

LSU(m+1)

Fig. 3. Illustration of the LSU boundary detection procedure. The shots indicated by (a) and (b) can be linked and are by definition part of LSU(m). Shot (c) is implicitly declared part of LSU(m) since the shot (d) preceding (c) is linked to a future shot (e). Shot (e) is at the boundary of LSU(m) since it cannot be linked to future shots, nor can any of its r predecessors.

Au tomatically Segmen ting Movies into Lo gical Story Units

233

To determine if a link can be established between two shots, we need the threshold function M(k). We compute this threshold recursively from already detected shots that belong to the current LSU. If the minimum of A(k,n) found in equation (1) (or equation (2) if (1) does not hold) denotes the content inconsistency value C(k), then the threshold function M(k) that we propose is:

M (k ) = αC (k , N k )

(3)

Here α is a fixed parameter whose value is not critical between 1.3 and 2.0, and C (k , N k ) is computed as C (k, N k ) =

 1  Nk  ∑ C(k − i) + C0  Nk + 1  i =1 

(4)

The parameter Nk denotes the number of links in the current LSU that have lead to the current shot k, while the summation in (4) comprises the shots defining these links. Essentially the threshold M(k) adapts itself to the content inconsistencies found so far in the LSU. It also uses as a bias the last content inconsistency value C0 of the previous LSU for which (1) or (2) is valid. We now proceed to define the content-based dissimilarity function A(k,n), and assume that the video sequence is segmented into shots, using any of the methods found in literature (e.g. [1]). Each detected shot is represented by one or multiple key frames so that its visual information is captured in the best possible way (e.g. by using°[1]). All key frames belonging to a shot are merged together in one large variable-size image, called the shot image, which is then divided into blocks of HxW pixels. Each block is now a simple representation of one visual-content element of the shot. Since we cannot expect an exact shot-to-shot match in most cases, and because the influence of those shot-content details which are not interesting for an LSU as a whole should be as small as possible, we choose to use only those features that describe the HxW elements globally. In this paper we use only the average color in the L*u*v* uniform color space as a block feature. For each pair of shots (k,n), with k
min

∑ d (bk , bn )

all possible block combinations all blocks b

(5)

234

Alan Hanjalic et al.

where

d (bk , bn ) =

( L (b ) − L (b )) + (u (b ) − u (b )) + (v (b ) − v (b )) *

2

*

k

n

*

2

*

k

n

*

*

k

n

2

(6)

and where all possible block combinations are given by the first item. Unfortunately this is a problem of high combinatorial complexity. We therefore use a suboptimal approach to optimize (5). The blocks bk from a key frame of shot k are matched in the unconstrained way in shot image n starting with the top-left block in that key frame, and subsequently scanning in the line-fashioned way to its bottomright block. If a block bn has been assigned to a block bk , it is no longer available for assignment until the end of the scanning path. For each block bk the obtained match yields a minimal distance value d1(bk). This procedure is repeated for the same key frame in the opposite scanning fashion, i.e. from bottom-right to top-left, yielding a difference mapping for the blocks bk and a new minimal distance value for each block, denoted by d2(bk). On the basis of these two different mappings for a key frame of shot k and corresponding minimal distance values d1(bk) and d2(bk) per block, the final correspondence and actual minimal distance dm(bk) per block is constructed as follows: • dm(bk) = d1(bk) , if d1(bk)= d2(bk) (7a) • dm(bk) = d1(bk) , if d1(bk)< d2(bk) and d1(bk) is the lowest distance value measured on the assigned block in the shot image n (one block in shot image n can be assigned to two different blocks in a key frame from k: one time in each scanning direction) (7b) dm(bk) = ∞ , otherwise. (7c) • dm(bk) = d2(bk) , if d2(bk)< d1(bk) and d2(bk) is the lowest distance value measured on the assigned block in the shot image n (7d) dm(bk) = ∞ , otherwise. (7e) where ∞ stands for a fairly large value, indicating that no objective best match for a block bk could be found. The entire described procedure is repeated for all key frames of shot k, leading to one value dm(bk) for each block of shot image k. Finally the average of the distances dm(bk) of the B best-matching blocks in the shot image k is computed as the final inter-shot dissimilarity value: A(k , n) =

1 ∑ d m (bk ) B B best matching

(8)

blocks

The reason for taking only the B best-matching blocks is that two shots should be compared only on a global level. In this way, we allow for inevitable changes within the LSU, which, however, do not degrade the global continuity of its visual content. After the LSU boundary detection procedure has been explained, the characteristics of the obtained LSUs in view of the actual episodes will now be discussed. For this purpose, we investigate a series of shots a to j, as illustrated in Fig.°4. According to the movie script, the boundary between episodes m and m+1 lies

Au tomatically Segmen ting Movies into Lo gical Story Units

235

between shots e and f. We now assume that the shot e, although belonging to the episode m, has a different visual content than the rest of the shots in that episode. This can be the case if, e.g., e is a descriptive shot, which generally differs from event shots. Consequently, the content consistency could be followed by overlapping links in the LSU(m) up to shot d, so that the LSU boundary is found between shots d and e. If the shot e contains enough visual elements also appearing in the episode m+1, so that a link can be established, e is assumed to be the first shot of the LSU(m+1) instead of shot f. This results in a displaced episode boundary, as shown in Fig. 4. However, if no content-consistency link can be established between shot e and any of the shots from the episode m+1, another LSU boundary is found between shots e and f. Suppose that f is a descriptive shot of the episode m+1, containing a different visual content than the rest of the shots in that episode, so again no content-consistency link can be established. Another LSU boundary is found between shots f and g. If the linking procedure can now be started from shot g, it is considered to be the first shot of the new LSU(m+1). In this case, not a precise LSU boundary is found but one that is spread around the actual episode boundary, taking into consideration all places where the actual episode boundary can be defined. Consequently, the shots e and f are not included into LSUs, as shown in Fig. 4. Such scenarios occur quite often and show that by investigating the temporal consistency of the visual content, only an approximation for the actual episode should be expected. Episode boundary

a b c d e f g h i j LSU(m)

LSU(m+1)

Spread LSU boundary

Fig. 4. LSU versus episode boundary. Note that LSUs are, in general, different than the corresponding episodes since not all episode shots are included into the LSUs.

4 Experimental Validation To test the proposed LSU boundary detection approach, we used two full-length movies. Both of them were available as DC sequences [4], obtained from MPEGstreams with (slightly modified) frame sizes 88x72 and 80x64 respectively. We detected the shots using the method from [1] and represented each shot by two key frames taken from the beginning and end of a shot, in order to capture most of its important visual content elements. To get an idea about the positions of the actual episode boundaries we asked unbiased test subjects to manually segment both movies and took into account only those boundaries registered by all test subjects. These boundaries we call probable. Then, we had our algorithm perform the automatic segmentation of the movies for different values of parameters B and α and compared the automatically obtained boundaries with the probable ones. Thereby, an automatically detected boundary, not registered by any of the test users, was considered as false.

236

Alan Hanjalic et al.

Best performance for both movie sequences was obtained if 50% of blocks were considered in (8) and for the threshold multiplication factor α of 1.4. Thereby, both block dimensions H and W used to segment the shot images were chosen as 8. For these parameter values, 69% of probable boundaries were detected in average for both movie sequences with only 6% of false detections. Low number of false detections is fully compliant with our requirements for conciseness and comprehensiveness of the movie retrieval interface at its top level. These characteristics are not guaranteed if a high percentage of “false episodes” are present, making the first interaction level overloaded and the interaction inefficient. On the other hand, after investigating the missed 31% of probable boundaries, we found out that most of episodes, which could not be distinguished from each other, belong to the same global context (e.g. a series of episodes including a wedding ceremony, a reception and a wedding party). Therefore, the comprehensiveness of the LSU set obtained for B=50% and α =1.4 was not strongly degraded by missed boundaries.

5 Conclusions The high-level movie segmentation approach presented in this paper is based on investigating the visual information of a video sequence and their temporal variations, as well as on the assumption about the global temporal consistency of the visual content within a movie episode. Our work shows that using only visual features of a movie sequence can provide satisfactory segmentation results, although the LSU boundaries only approximate the actual episode boundaries in some cases. The obtained LSUs can be used for developing an efficient event-oriented movie retrieval scheme. As the proposed technique computes the detection threshold recursively, and looks only a limited number of shots ahead during the LSU boundary detection, the entire process including shot-change detection, key-frame extraction, and LSU boundary detection, can be carried out in a single pass through a sequence.

References 1. 2. 3. 4. 5.

Hanjalic A., Ceccarelli M., Lagendijk R.L., Biemond J.: Automation of systems enabling search on stored video data, Proceedings of IS&T/SPIE Storage and Retrieval for Image and Video Databases V, Vol. 3022 (1997) Pfeiffer S., Lienhart R., Fischer S., Effelsberg W.: Abstracting Digital Movies Automatically, Journal of Visual Communication and Image Representation, Vol. 7, No. 4 (1996), pp. 345-353 The SMASH project home page: http://www-it.et.tudelft.nl/pda/smash. Yeo B.-L., Liu B.: On the Extraction of DC Sequence from MPEG Compressed Video, Proceedings of IEEE ICIP (1996) Yeung M., Yeo B.-L.: Video Content Characterization and Compaction for Digital Library Applications, Proceedings of IS&T/SPIE Storage and Retrieval for Image and Video Databases V, Vol. 3022 (1997)

Local Color Analysis for Scene Break Detection Applied to TV Commercials Recognition Juan Mar´ıa S´ anchez, Xavier Binefa, Jordi Vitri`a, and Petia Radeva Computer Vision Center, Departament d’Inform` atica Universitat Aut` onoma de Barcelona, 08193 Bellaterra, Spain {juanma,xavierb,jordi,petia}@cvc.uab.es

Abstract. TV commercials recognition is a need for advertisers in order to check the fulﬁllment of their contracts with TV stations. In this paper we present an approach to this problem based on compacting a representative frame of each shot by a PCA of its color histogram. We also present a new algorithm for scene break detection based on the analysis of local color variations in consecutive frames of some speciﬁc regions of the image.

1

Introduction

The recognition of known patterns in a given digital video is a task with lots of applications. Among these we have developed a system that recognizes the broadcast of TV commercials on-line and stores their broadcast time, as well as their length. This data can be used by advertisers to check the fulﬁllment of their contracts with TV stations, as well as to know the impact of their publicity on the viewers, together with audience statistics. Some previous work has been done about the analysis of TV commercials. The aim of Lienhart et al. in [3] is the isolation and extraction of commercial blocks from TV broadcasts by detecting a set of features common to every commercial block. In the same paper, a recognition-based approach is also implemented to detect single commercials out of commercial blocks, using an approximate substring matching algorithm. Colombo et al. [1] obtain some semantic measures of video sequences from a set of perceptual features in order to characterize every commercial and perform a content-based retrieval on a video database. We have not found any previous work about auditing commercials broadcast. The problem we are dealing with consists on creating a suitable and compact representation to store a digital video sequence pattern, so it can be recognized in a given video. We assume that every video sequence is constituted by a set of shots, which can be represented by a single frame. This assumption forces us to implement an algorithm to correctly detect every transition between shots.

This work was supported by the projects TAP97-0463 and TAP-0631 of Spanish Industry Ministry.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 237–244, 1999. c Springer-Verlag Berlin Heidelberg 1999

238

Juan Mar´ıa S´ anchez et al.

Diﬀerent techniques have been used for this purpose, most of them based on color. The simplest one is the color histogram diﬀerence between consecutive frames [9]. Some improvements have been proposed by considering the spatial distribution of pixels in the image like the color coherence vectors (CCV) [6] that divide the pixels in each bin into coherent and non coherent, or the joint histograms [7] that build a multidimensional histogram considering a set of local features for each pixel. These algorithms detect almost every sharp transition, but gradual transitions, i.e. dissolves and fades, are not correctly detected due to the low global variation of color from one frame to the next. A completely diﬀerent alternative is proposed by Zabih et al. in [12], where an algorithm is presented which is based on computing the number of edge pixels that appear and disappear in a given frame as a feature called edge change ratio (ECR). The main problem of this algorithm is dealing with aﬃne transformations of the contents of the image at a low computational cost. The algorithm presented in this paper treats this problem by locally considering some speciﬁc regions of the image. This scene break detection algorithm is introduced in section 2 and our TV commercial recognition system is presented in section 3, showing the way we use principal component analysis (PCA) to reduce the size of the representation of every shot and obtain a real time recognition system.

2

Local Color Analysis for Scene Break Detection

We introduce a sharp and gradual transition detection algorithm based on the analysis of color contents of speciﬁc regions in consecutive frames in order to get a measure of the global variation between them as a combination of local diﬀerences. Let Rji be region j of frame i of a video sequence. A measure of the variation of each region colors from frame i to frame i + 1 can be computed as a diﬀerence ∆hij of their associated color histograms hij and hi+1 j . A global diﬀerence measure between frames i and i + 1 can then be obtained as a combination of the local diﬀerences. The possible transformations of the contents of each region from one frame to the next must be considered in the way we deﬁne the pairs of regions (Rji , Rji+1 ) and how their color histograms are built and compared. 2.1

Regions Around Color Transitions

Following our approach, regions Rji are deﬁned from the most signiﬁcative color transitions of the image, detected as high values of its multispectral gradient. The main interest of these regions is that they can be divided into two subregions by the gradient creases. We consider that one of them (Oji ) belongs to an object in the scene and the other one (Bji ) belongs to a diﬀerent object or to the background (or both). Each region Rji is deﬁned as follows: 1. The multispectral gradient of the image is obtained using Sobel’s approximation.

Local Color Analysis for Scene Break Detection

239

2. Only signiﬁcant color transitions are considered applying a threshold, obtained by Otsu’s algorithm [5], to the gradient image and removing its smaller connected regions using a queue based algorithm [11]. 3. A region Rji is deﬁned from each connected region of the gradient image by an adaptative window that contains all of its pixels. If the content of at least one of these sub-regions does not signiﬁcantly change from one frame to the next, we can say that the main content of the Rji is the are built same in Rji+1 . This situation can be easily detected if hij and hi+1 j in such a way that they contain the same number of pixels from Oji and Bji . Using a queue based algorithm we know which pixels are at a distance less or equal to d from the border between both sub-regions. With this algorithm the color histogram is exactly the same in every orientation of the contents of the region. Suppose that we have a region in frames i and i + 1 which at least one of its sub-regions does not change. Therefore, the values of its corresponding are very close. If we sort their bins to get two vectors colors in hij and hi+1 j i+1 i+1 i i hj (n) and hj (n), n = 0..N − 1, such that |hij (r) − hi+1 j (r)| < |hj (s) − hj (s)|, ∀r < s, we will ﬁnd the corresponding K−1to the unchanged region in their K bins K−1 ﬁrst positions, where k=0 hij (k) ≈ k=0 hi+1 j (k) ≈ 0.5, and the sum of the diﬀerences between these K elements of the vectors will be close to zero. Thus, we will know if a region has not continuity in any of its sub-regions when this sum of diﬀerences is high. These changed regions will be named Qij . 2.2

Tolerance to Transformations of Regions

Before applying this diﬀerence measure between regions, we must ﬁnd the pairs of regions (Rji , Rji+1 ) that will be compared, considering the possible transformations of their contents, mainly translations, scalings due to camera zooms and rotations. The window that deﬁnes Rji on the gradient image is used to ﬁnd the displacement of this region correlating with its neighbourhood in the gradient image of frame i + 1. Changes in the scale of Rji contents can mislead the correlation, as shown in ﬁgure 1 where it is mistaken as a displacement and the content of the window that deﬁnes Rji+1 diﬀers from Rji ’s. To solve this situation, each connected region of the gradient image of frame i + 1 is divided into smaller pieces with a maximum of p pixels and a region Rji is deﬁned for each piece, so all of them can be correctly located in frame i + 1 as shown in Fig. 1. Although rotations are mainly considered in the way we compute the color histograms of the regions as we have previously said, when the orientation of the contents of Rji changes, the shape of the window that deﬁnes it may change as well. Figure 2 shows this situation. We assume that between two consecutive frames, the rotation angle of an element of the scene is small, and so it is the change in the shape of the window that deﬁnes Rji . This change can be overcome by increasing the window size. If an Rji+1 can not be found for a given Rji , i.e. the maximum of the correlation with its surroundings is close to zero, we know that the contents of Rji have

240

Juan Mar´ıa S´ anchez et al.

Fig. 1. Tolerance to zooms dividing the connected regions of the gradient image into smaller pieces.

Fig. 2. Change in shape of the window that deﬁnes an Rji due to a rotation. completely changed from one frame to the next, with no need of computing the i diﬀerence between hij and hi+1 j . These regions will be named Pj . 2.3

Detecting and Classifying Scene Breaks

Once we know which regions Rji do not have continuity in frame i+1, we compute a measure of the global variation of the scene between two consecutive frames. The variation of a large region is more signiﬁcant than a smaller one, so we weigh up each region contribution to the global measure by the number of pixels it contains. Then, this measure is computed as: L−1

V (i, i + 1) =

j=0

|Pji | + N −1 j=0

M−1 j=0

|Qij | (1)

|Rji |

where |Rji | is the number of pixels in Rji . This measure lets us detect and classify diﬀerent kinds of scene breaks. Cuts are characterized by high values of V (i, i+1) caused by the sudden change of the scene content. Dissolves are mainly due to the Pji ’s because new edges gradually appear in frame i + 1 far from the location of the ones in the previous frames. In a fade-out, every Rji turns out to be a Pji because all the gradient creases disappear, but a fade-in can not be detected using V (i, i + 1). As gradient creases gradually appear, V (i + 1, i) must be used instead. As far as we are forced to compute V (i, i + 1) and V (i + 1, i) to detect both kinds of fades, we can also improve the detection of cuts and dissolves, because both of them are high in these transitions, so their sum is a more robust measure than only one of them. Figure 3 shows their values corresponding to each diﬀerent kind of scene break.

Local Color Analysis for Scene Break Detection

241

Fig. 3. Values of V (i, i+ 1) () and V (i + 1, i) () during a cut (left), a dissolve (center) and both kinds of fades (right).

2.4

Experimental Results

We have tested our algorithm on a video sequence of Spanish TV commercials acquired at 25 frames per second with a 160 × 120 pixels resolution. This sequence contained several scene breaks including 65 cuts, 13 dissolves, 3 fades-out and 2 fades-in. All of them were correctly detected by our algorithm. 15 false positives were reported during this sequence mainly caused by frame sequences where there are no signiﬁcant regions in the gradient image. In this situation, regions Rji can not be deﬁned and the algorithm behaviour is unexpected. Only 4 of them were due to fast motion of large objects in the scene. For our application to commercial recognition, false positives turn out to be redundancy in their representation, which may be good for the recognition process, while the detection of every scene break is a must.

3

Recognition of TV Commercials

In this section, we introduce our approach to the recognition of previously stored patterns of commercials in a given TV broadcast. The main drawback of commercial recognition is the change of their length in diﬀerent broadcasts in order to reduce their cost for advertisers. This reduction is achieved removing shots and/or shortening their length. Our representation of a commercial assumes that every video sequence can be divided into shots. The information contained in each shot can be represented by a single image. For eﬃciency, we have chosen the ﬁrst frame of a shot as its representative. Then, a commercial is stored as the sequence of representative frames of its shots. The recognition process is achieved detecting the transitions between shots, getting their ﬁrst frames and searching the stored database for the matching ones. The database is looked over in such a way that the elements with the highest matching probability are ﬁrst compared. Let’s suppose that we have previously detected and identiﬁed a shot of the input video sequence, so we know which commercial is being broadcasted. When we detect the next shot, it is compared with every shot representative of the current commercial to know if it is still being broadcasted. If it ﬁnished, the new shot can very probably be the ﬁrst one of another commercial, so its representative frame is compared with

242

Juan Mar´ıa S´ anchez et al.

every ﬁrst shot representative of the commercials in the database. Even if none of them matched it, it still must be compared with the rest of shot representatives of every stored commercial because the ﬁrst shot of a commercial can be removed. The comparison of shot representatives is based on their color histograms in order to achieve some invariance to small changes in the representative frame of the shot, as can be caused by removing its ﬁrst few frames. Color histograms are aﬀected by color intensity variations in diﬀerent TV stations broadcasts, which can be modeled as a shift in each color channel histogram. Every representative frame color channel is normalized depending on its average intensity value before searching the database. 3.1

Dimensional Reduction Using PCA

If we want to work with a large commercial database in real time, the size of the elements to compare should be reduced to very few dimensions. This is achieved applying PCA to the color histograms of shot representatives. This statistical technique has been used extensively for object recognition tasks (Turk and Pentland [10], Nayar et al. [4], Huttenlocher et al. [2], Seales et al. [8]). Let h be the color histogram of a frame represented as a column vector. Given a set of learning histograms hm , m = 1..M , we deﬁne the matrix X = [h1 − c, . . . , hM − c], where c is the average of the hm ’s. The average histogram is subtracted from each hm so that the predominant eigenvectors of XX T will capture the maximal variation of the original set of histograms. The eigenvectors of XX T are an orthogonal basis in terms of which the hm ’s can be rewritten. Let ei , i = 1..N , denote each of its eigenvectors and let E be the matrix [e1 , . . . , eN ]. Each hm is rewritten in terms of the orthogonal basis deﬁned by the eigenvectors of XX T as gm = E T (hm − c). As distances are preserved under an orthonormal change of basis, it can be shown that ||hm − hn ||2 = ||gm − gn ||2 [8]. The most important fact is that hm can be approximated using just those eigenvectors corresponding to the k largest eigenvalues, rather than all N eigenvectors, where k << N . This low-dimensional representation is intended to capture the important characteristics of the set of learning images. Let fm be the vector of coeﬃcients of gm corresponding to the k largest eigenvalues. Then, the sum of squared diﬀerences, ||hm − hn ||2 , is approximated as ||fm − fn ||2 . This size reduction of the elements to compare allows us to speed up the recognition process by computing diﬀerence measures of the elements in the kdimensional subspace instead of in the original N -dimensional space, while the maximal variation of the original set is kept. 3.2

Real Application Results

We have tested our system on real prime time Spanish TV broadcasts, using a PC and acquiring the video sequence directly with a Matrox Rainbow Runner video capture board. Although our scene break detection algorithm could not be fully used in real time because it has not been optimized yet, we have implemented it as a helper to a simple color histogram diﬀerence algorithm. When this measure

Local Color Analysis for Scene Break Detection

243

(a) Representative frames of the 10 ﬁrst shots of a “Little Maidness” commercial.

(b) Representative frames of the 5 shots of a diﬀerent “Little Maidness” commercial.

Fig. 4. Two diﬀerent commercials of the same product.

gets over a low threshold (which reports many false positives), our new algorithm is asked to conﬁrm the detection of the transition. We have learned 23 commercials in our system, with a total of 399 shot representatives. Some of these commercials were broadcasted more than once during the test sequence. The learned commercials appeared a total of 81 times during the 9 hours the system was running. All of them were recognized and their lengths were exactly reported, including those broadcasted by a diﬀerent TV station of the learning one. The test sequence also contained diﬀerent commercials of the same product with some similar shots like the ones shown in Fig. 4. Some cases were reported correctly, but some were not because the learned key frames did not correspond to the ones found during the broadcast. Although some of them are similar, the projections of their color histograms do not match. A better key frame selection would let us detect these commercials. However, as far as they are actually diﬀerent commercials, both of them should be learned if they must be recognized by the system. Only 4 shots were reported as wrong commercials during this test. They can easily be detected by the user if one frame is stored for every detected commercial.

4

Conclusions

This paper has introduced two contributions to digital video analysis. First of all, we have developed an application of pattern recognition in digital video

244

Juan Mar´ıa S´ anchez et al.

sequences based on their segmentation into shots, and on the dimensional reduction of their representative frames, using PCA of their color histograms, to speed up the recognition process and achieve the goal of commercial recognition during TV broadcasts in real time. Our system has shown to be very accurate detecting the broadcast of commercials as well as their length. On the other hand, this approach forces us to correctly detect transitions between the shots of the sequence, including cuts, dissolves and the two kinds of fades. We have introduced a new scene break detection algorithm based on the local analysis of color variation between consecutive frames in some speciﬁc regions of the image. We have chosen the regions around color transitions due to its high perceptual content that lets us interpret what is happening in each one of them. The experimental results have been very satisfacting. Future work has to be done optimizing the algorithm in order to fully use it in real time applications.

References 1. C. Colombo, A. Del Bimbo, and P. Pala. Retrieval of commercials by video semantics. In Proc. Computer Vision and Pattern Recognition, pages 572–577, 1998. 237 2. D. P. Huttenlocher, R. H. Lilien, and C. F. Olson. Object recognition using subspace methods. In Proc. of the 4th European Conference on Computer Vision, volume I, pages 536–545, April 1996. 242 3. R. Lienhart, C. Kuhm¨ unch, and W. Eﬀelsberg. On the detection and recognition of television commercials. In Proc. IEEE Conf. on Multimedia Computing and Systems, pages 509–516, Ottawa, Canada, June 1997. 237 4. S. K. Nayar, H. Murase, and S. A. Nene. Parametric appearance representation. In Early Visual Learning, pages 131–160. Oxford University Press, 1996. 242 5. N. Otsu. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics, SMC, 9(1):62–66, January 1979. 239 6. G. Pass and R. Zabih. Histogram reﬁnement for content-based image retrieval. In Proc. of the 3rd Workshop on Applications of Computer Vision, Sarasota, Florida, December 1996. 238 7. G. Pass and R. Zabih. Comparing images using joint histograms. ACM Journal of Multimedia Systems, 1998. (to appear). 238 8. W. B. Seales, C. J. Yuan, W. Hu, and M. D. Cuts. Object recognition in compressed imagery. Image and Vision Computing, 16:337–352, 1998. 242 9. M. J. Swain and D. H. Ballard. Color indexing. International Journal of Computer Vision, 7(1):11–32, 1991. 238 10. M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71–86, 1991. 242 11. L. Vincent. Algorithmes morphologiques ` a base de ﬁles d’attente et de lacets. Extension aux graphes, Ph. D. dissertation, Ecole Nationale Sup´erieure des Mines de Paris, France, 1990. 239 12. R. Zabih, J. Miller, and K. Mai. A feature-based algorithm for detecting and classifying scene breaks. In ACM Conference on Multimedia, San Francisco, California, November 1995. 238

Scene Segmentation and Image Feature Extraction for Video Indexing and Retrieval Patrick Bouthemy1 , Christophe Garcia2 , R´emi Ronfard3 , George Tziritas2 , Emmanuel Veneau3 , and Didier Zugaj1 1

INRIA–IRISA, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France {bouthemy,dzugaj}@irisa.fr 2 ICS–FORTH, P.O.Box 1385, GR 711 10 Heraklion, Crete, Greece {cgarcia,tziritas}@csi.forth.gr 3 INA, 4 avenue de l’Europe, 94366, Bry-sur-Marne, France {rronfard,eveneau}@ina.fr

Abstract. We present a video analysis and indexing engine, that can perform fully automatic scene segmentation and feature extraction, in the context of a television archive, based on a library of image analysis functions and templates.

1

Introduction

The ESPRIT project DiVAN aims at building a distributed architecture for the management of - and access to - large television archives. An important part of the system is devoted to automatic segmentation of the video content, in order to assist the indexing and retrieval of programs at several levels of details from individual frames to scenes and stories, using a combination of annotated semantic information and automatically extracted content-based signatures. In Section 2, the template-based architecture of the system is described. The tools developed in this architecture are presented in Section 3. In Section 4, we brieﬂy sketch how those tools can be integrated into complete analysis graphs, driven by the internal structure of television programs.

Fig. 1. Architecture of the video indexing system in DiVAN: Analysis is controlled by scripts stored in the template library, and launches tools from the tools library. Both libraries are accessed via the CORBA ORB. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 245–253, 1999. c Springer-Verlag Berlin Heidelberg 1999

246

2

Patrick Bouthemy et al.

Template-Based Indexing and Retrieval

In DiVAN, a video indexing session is distributed between several machines, using connections to two remote CORBA servers - a video analysis server and a template library server (Figure 1). Each program template contains the taxonomy of event categories (such as shots, scenes and sequences) meaningful to a collection of programs (such as news, drama, etc.), along with their temporal and compositional structure. In addition, the template contains algorithmic rules used to recover the structure of each individual program, based on observable events from the tools server (such as dissolves, wipes, camera motion types, extracted faces and logos). Section 3 reviews the tools available in the DiVAN system, and Section 4 presents examples of template-based indexing using those tools.

3

Shot Segmentation and Feature Extraction

The ﬁrst step in the analysis of a video is the partitioning into its elementary subparts, called shots, according to signiﬁcant scene changes (cuts or progressive transitions). Features such as keyframes, dominant colors, faces and logos can also be extracted to represent each shot. 3.1

Cut Detection

Our algorithm involves the frames that are included in the interval between two successive I-frames. It is assumed that this interval is short enough to include only one cut point. Given this assumption, the algorithm proceeds as follows. Step1: Perform cut detection considering the next I-frame pair and using one of the histogram metrics and a predeﬁned cut detection threshold. Step2: Conﬁrm the detected cut in the next subinterval between two consecutive I-, P-frames, included in the interval of I-frames. Step3: Perform cut detection between two successive I-, P-frames using histogram or MPEG macro block motion information. If the cut is not conﬁrmed in this subinterval, go to step 2. Step4: Determine the exact cut point in the subinterval, by applying a histogram or MPEG motion based cut detection method on each frame interval-pair. If the cut is not conﬁrmed, go to step 2, otherwise, go to step 1. Key-frame extraction is performed during shot detection, and is based upon a simple sequential clustering algorithm, using a single cluster C of the candidate key-frames, which is successively updated by including the I-frame that follows the last element in display order. C is a set of K consecutive I-frames and Cm is an ”ideal” frame with the property that its features match the mean vector (”Mean Frame”) of the corresponding characteristics of frames Ck (k = 1..K). Assuming that the only feature used for classiﬁcation is the colour/intensity histogram Hk of frames Ck , the ”mean frame” histogram Hm is computed. Then, in order to measure the distance δ(Ck , Cm ), an appropriate histogram metric is used. Based on this distance, the key-frame CK (the representative frame of C) is determined, according to a threshold.

Scene Segmentation and Image Feature Extraction for Video Indexing

3.2

247

Progressive Transitions and Wipes

The method we propose for detecting progressive transitions relies on the property of temporal coherence of the global dominant image motion between successive image pairs within a shot. To this end, we exploi a robust estimation criterion and a multiresolution scheme to estimate a 2D parametric motion model accounting for the dominant image motion, in presence of secondary motions [7]. The minimization problem is solved by using an IRLS (Iteratively Re-weighted Least Squares) technique. The set of pixels conforming with the estimated dominant motion forms its estimation support. The analysis of the temporal evolution of the support size supplies the partitioning of the video into shots [2]. A cumulative sum test (Hinkleys test) is used to detect signiﬁcant jumps of this variable, accounting for the presence of shot changes. It can accurately determine the beginning and the end time instants of the progressive transitions. This technique, due to the use of the robust motion estimation method, is fairly resilient to the presence of mobile objects or to global illumination changes. More details can be found in [3]. It is particularily eﬃcient to detect dissolve transitions. To handle wipe transitions between two static shots, we have developed a speciﬁc extension of this method. In the case of wipes, we conversely pay attention to the outliers. Indeed, as illustrated in Figure 2, the geometry of the main outlier region reﬂects the nature of the wipe with the presence of a rectangular stripe. To exploit in a simple and eﬃcient way this property, we project the outlier binary pixels onto the vertical and horizontal image axes, which supplies characterizable peaks. Then for detecting wipes, we compute the temporal correlation of these successive projections. The location and value of the correlation extremum allow us to characterize wipes in the video. One representative result is reported in Figure 2.

3.3

Dominant Color Extraction

We characterize the color content of keyframes by automatically selecting a small set of colours, called dominant colours. Because, in most images, a small number of colors ranges capture the majority of pixels, these colours can be used eﬃciently to characterize the color content. These dominant colors can be easily determined from the keyframe color histograms. Getting a smaller set of colours enhances the performance of image matching, because colour variations due to noise are removed. The description of the image, using these dominant colours, is simply the percentages of those colors found in the image. We compute the set of dominant colors using the k-means clustering algorithm. This iterative algorithm determines a set of clusters of nearly homogeneous colours, each cluster being represented by its mean value. At each iteration, two steps are performed: 1) using the nearest-neighbour rule, pixels are grouped to clusters, and 2) each new cluster is represented by its mean color vector.

248

Patrick Bouthemy et al.

Image 358

Image 359

Outlier image 358 weight=0

50

50

50

100

100

100

150

150

150

200

200

200

250

250

a)

50

100

150

200

250

300

350

b)

250

50

Horizontal projection image 358

100

150

200

250

300

350

c)

50

100

Vertical projection image 358

150

200

250

300

350

Projection Correlations

1

1

9

0.9

0.9

8

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

7 6 5 4 3

d)

0

50

100

150

200

250

300

350

e)

0

2 1

50

100

150

200

250

f)

0

0

100

200

300

400

500

600

700

Fig. 2. Detection of wipes: (a,b) Images within a wipe eﬀect, (c) Image of outliers in black, (d,e) projections on horizontal and vertical axes, (f ) plot of the correlation criterion. 3.4

Logo Detection

We introduce a simple but eﬃcient method to perform the detection of static logos, which is the most frequent case in TV programs. To speed up the detection, we specify the expected screen location of the logo by deﬁning its associated bounding box B (see Figure 3). The method proposed is based on the two following steps : 1) determination of the logo binary mask, 2) evaluation of a matching criterion combining luminance and chrominance attributes. If the logo to be detected is not rectangular, the ﬁrst step consists to build the exact logo mask for detection. That is, we have to determine which pixels belong to the logo in the bounding box. From at least two logos from images with diﬀerent background, a discriminant analysis technique is used to cluster pixels in two classes : those belonging to the logo and the others. The clustering step is performed by considering image diﬀerence values for the luminance attribute {a1 }, and, directly values of chrominance images for chrominance attributes {a2 , a3 } in the Y, U, V space. The two classes will be discriminated by considering a threshold T for 2 (T ) for each attribute {a1 , a2 , a3 }. We consider the inter-class variance σInterC ∗ each attribute {ak }. The optimal threshold T then corresponds to an extremum 2 (T ). of the inter-class variance σInterC We display in Figure 3 an example of histogram of the image diﬀerence pixels 2 (T ) supplying within the bounding box, a plot of the inter-class variance σInterC the dissimilarity measure between the two classes, and the determined binary mask M. The matching simply consists in comparing each pixel that belongs to the logo mask in the current color image of the sequence with the value in the reference logo image. The criterion is given by the sum of the square diﬀerences between these values over the logo mask.

Scene Segmentation and Image Feature Extraction for Video Indexing

image with logo

image with logo

50

50

100

100

150

150

200

200

250

250

Bounding box image 1.L

Bounding box image 2.L

5

5

10

10

15

15

20

20

25

25

30 50

100

150

200

250

300

350

50

100

Image of Chrominance 1.U

150

200

250

Attribute distribution a

2

300

30

350

5

10

15

20

25

30

5

10

15

0.9

2

20

25

30

Binary mask M 2

Inter−class variance

1.U

0.06

249

0.8

2

0.7

4

0.05

4

6

0.6

0.04

6

0.5

8

8

0.03

0.4 10

10

0.3

0.02

12

12

0.2 0.01

14

14

0.1

a)

16 2

4

6

8

10

12

14

16

b)

0 100

120

140

160

180

200

220

c)

0 100

120

140

160

180

200

220

d)

16 2

4

6

8

10

12

14

16

Fig. 3. Top row: Images with the same logo and associated logo bounding boxes. Middle row: Determination of the binary masks (d) from image of chrominance U (a). Plot of the distribution of attribute {a2 } (b), and of the inter-class variance 2 (T ) (c). values σInterC 3.5

Face Detection

We have developed a fast and complete scheme for face detection in color images where the number of faces, their location, their orientation and their size are unknown, under non-constrained conditions such as complex background [5]. Our scheme starts by performing a chrominance-based segmentation using the YCbCr color model which is closely related to the MPEG coding scheme. According to a precise approximation of the skin color sub-spaces in the YCbCr color model, each macro-block average color value is classiﬁed. A binary mask is computed in which a ”one” corresponds to a skin color macro-block, and a “zero” corresponds to a non-skin color macro-block. Then, scanning through the binary mask, continuous “one” candidate face regions are extracted. In order to respect the speed requirements, we use a simple method of integral projection like in [9]. Each macro-block mask image is segmented into non-overlapping rectangular regions that contain a connected “one” region. We search for candidate face areas in these resulting areas. Given that we do not know a priori the size of the faces regions contained in the frame, we ﬁrst look for the largest possible ones. Then, we iteratively reduce their size and run over the possible aspect ratios and the possible positions in each binary mask segment area. Constraints related to size range, shape and homogeneity are then applied to detect these candidate face regions and possible overlaps are resolved. The main purpose of the last stage of our algorithm is to verify the face detection results obtained after candidate face areas segmentation and remove false alarms caused by objects with color similar to skin colors and similar aspects ratios such as exposed parts of the body or parts of the background. For this purpose, a wavelet packet analysis scheme is applied using a similar method to the one we described in [3,4] in the case of face recognition. For a candidate face region, we search for a face in every position inside the region and for a possible number of dimensions of a bounding rectangle. This search is based on a feature vector that is extracted from a set

250

Patrick Bouthemy et al.

of wavelet packet coeﬃcients related to the corresponding region. According to a metric derived from the Bhattacharyya distance, a subregion, deﬁned by its position and size in candidate faces area is classiﬁed as a face or non-face area, using the prototype face area vectors acquired in a previous training step.

4

Scene Segmentation

Shot segmentation and feature extraction alone cannot meet the needs of professional television archives, where the meaningful units are higher-level program segments such as stories and scenes. This section focuses on the segmentation of programs into scenes and stories, as described by some generic templates (special-purpose templates can also be created for speciﬁc program collections, although they are not described in this paper for lack of space). 4.1

Basic Program Template

Our basic program template implements a scene segmentation algorithm introduced by Yeo et al. [10]. A scene is deﬁned as a grouping of contiguous shots, based on image similarity. This algorithm uses three parameters - an image similarity function, an average scene duration ∆T and a cluster distance threshold. The similarity function uses a combination of extracted features - color histograms, dominant colors, camera motion types, number, size and positions of faces, etc. Figure 4 shows the analysis graph for scene analysis using dominant colors. The result of the indexing session consists of a hierarchically organized scene track. Independently, a soundtrack is also extracted, but it is not taken into account in the building of the scene graph.

Fig. 4. Analysis graph for the basic program template. The details of the scene segmentation algorithm follows the original algorithm by Yeo et al. A binary tree is constructed in order to represent the similarity between keyframes. A temporal constraint is introduced in order to reduce computation time. To calculate the N (N − 1) distances describing the similarity between clusters, we start with the pairwise distances between keyframes separated by less than a temporal window, as done in [2]. Distances are accessed via

Scene Segmentation and Image Feature Extraction for Video Indexing

251

a priority stack for eﬃciency reasons. At each step, the smaller distance among the N(N-1) distance is selected, the two corresponding clusters are merged, and all distances are updated through the Lance-Williams formula. A binary tree is built by successive merging of the nearest clusters, until the threshold value is reached. The scene segmentation is computed according to methods described in [2]. We build an oriented graph where the nodes are the previously selected clusters and the edges are temporal succession links. The oriented graph is segmented into its strongly connected components, which are assumed to correspond to scenes. Finally, we complete the scene track by adding all progressive transitions at their appropriate level (between shots or scenes). We obtained good scene segmentation results using regional color histograms or dominant colors in programs with long sequences, each corresponding to a diﬀerent setting (as in drama or ﬁction). In many other cases, this approach to scene segmentation suﬀers some limitations, due to its restrictive deﬁnition of a scene. For instance, a short story item in a news broadcast will be incorporated into a larger scene, because of the recurrent anchorperson images before and after the story.

4.2

Composite Program Template

The DiVAN composite program template is used to introduce a broader definition for program segments such as stories, using cues such as logo or face detection. Figure 5 shows the complete analysis graph for a composite program. The program is segmented into stories, based on detected logos, wipes, dissolves and soundtrack changes, and stories are segmented into scenes using the previously described algorithm. In [8], we state the story segmentation problem as a CSP (constraint satisfaction problem), in the framework of an extended description logics, whose solutions are all the segmentations of a composite program, compatible with the given template.

Fig. 5. Analysis graph for a composite program template.

252

5

Patrick Bouthemy et al.

Conclusion

This paper describes the ﬁrst prototype of the DiVAN indexing system, with a focus on the image processing library and the general system architecture. Template-based video analysis systems have been proposed in academic research [6,1], but no such system is available commercially. The Video Analysis Engine by Excalibur introduces a choice of parameters for shot segmentation but the list of choices is closed. The DiVAN prototype gives the users control of many eﬃcient video analysis tools, though the edition of their own templates, and will be thoroughly tested by documentalists at INA, RAI and ERT in Spring of 1999. The second DIVAN prototype will provide many more functions, such as audio analysis, face recognition, query-by-example, moving objects analysis and camera motion analysis.

Acknowledgments This work was funded in part under the ESPRIT EP24956 project DiVAN. We would like to thank the Innovation Department at Institut National de l’Audiovisuel (INA) for providing the video test material reproduced in this communication.

References 1. P. Aigrain, P. Joly, and V. Longueville. Medium knowledge-based macrosegmentation of video into sequences, in Intelligent Multimedia Information Retrieval, M.T. Maybury, Editor. 1997, MIT Press. 252 2. P. Bouthemy and F. Ganansia. Video partioning and camera motion characterization for content-based video indexing. In Proc. 3rd IEEE Int. Conf. on Image Processing, ICIP’96, Lausane, September 1996. 247 3. C. Garcia, G. Zikos, G. Tziritas. A Wavelet-based framework for face recognition. In Int. Workshop on Advances in Facial Image Analysis and Recognition Technology, 5th European Conference on Computer Vision, June 2 – 6, 1998, Freiburg, Germany. 249 4. C. Garcia, G. Zikos, G. Tziritas. Wavelet Packet analysis for face recognition. To appear in Image and Vision Computing (1999). 249 5. C. Garcia, G. Zikos, G. Tziritas. Face detection in color images using wavelet packet analysis IEEE Multimedia Systems’99 (1999) 249 6. B. Merialdo and F. Dubois, An agent-based architecture for content-based multimedia browsing, in Intelligent Multimedia Information Retrieval, M.T. Maybury, Editor. 1997, MIT Press. 252 7. J.M. Odobez and P. Bouthemy. Robust multiresolution estimation and objectoriented segmentation of parametric motion models. Jal of Visual Communication and Image Representation, 6(4), 1995. 247 8. J. Carrive, F. Pachet., R. Ronfard, Using description logics for indexing audiovisual documents. International Workshop on Description Logics, Trente, Italy, 1998. 251

Scene Segmentation and Image Feature Extraction for Video Indexing

253

9. H. Wang and S.-F. Chang. A highly eﬃcient system for automatic face region detection in MPEG video. IEEE Transactions on Circuits and Systems For Video Technology, 7(4) (1997) 615-628. 249 10. M. Yeung, B.-L. Yeo & B. Liu. Extracting story units from long programs for video browsing and navigation. International Conference on Multimedia Computing and Systems, June 1996. 250

Automatic Recognition of Camera Zooms* Stephan Fischer, Ivica Rimac, and Ralf Steinmetz Industrial Process and System Communications Department of Electrical Engineering and Information Technology Technical University of Darmstadt Merckstr. 25 • D-64283 Darmstadt • Germany

Abstract. In this paper we explain how camera zooms in digital video can be recognized automatically. To achieve this goal we propose a new algorithm to detect zooms automatically and present experimental data on its performance.

1

Introduction

In this paper we describe a new algorithm to recognize zooms automatically. Besides only recognizing that a zoom has taken place it is often very useful to compute an exact scaling factor of a frame. We developed a new algorithm which both recognizes zooms and computes a scaling rate. The algorithm can be used to derive semantic features of digital film, such as automatic scene groupings or trailers [Fis97], [BR96]. The paper is structured as follows: Following a review of related work in section 2. In section 3 we propose a new algorithm to recognize camera zooms. In section 4 we present experimental results we obtained using our algorithm. Section 5 concludes the paper and gives an outlook.

2

Related Work

The recognition of zooms as well as the calculation of zoom factors has been investigated in the context of self-calibration of cameras in the past. [PKG98] describe a versatile approach which can be used to self-calibrate a camera. The authors specialize the general case towards scenarios where the focal length varies. The difference to the work described in this paper is that our algorithm is much simpler and hence faster to compute. The same statement is true for the work described by [HA97] and [FLM92] which. Tse and Bakler describe a global zoom/pan estimation algorithm in [TB91], which requires analysis of a high-density field of motion vectors. Since determining an entire frame of motion vectors with high spatial resolution is a very time consuming process, Zhang et al. [ZGST94] propose the block-matching algorithm for computing the motion vector field. But this method is inherently inaccurate when there is multiple *

This work is supported in part by Volkswagen-Stiftung, Hannover, Gemany.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 253-260, 1999.  Springer-Verlag Berlin Heidelberg 1999

254

Stephan Fischer et al.

motion inside a block. Furthermore, block-matching is particularly problematic during a zoom sequence because the area covered by the camera angle expands or contracts during a zoom. Finally, the motion of a large object in a camera shot proved to be a major source of error for camera pan detection.

3

Zoom Detection

A zoom operation is defined by an enlargement of the content of video frames towards the center of the zoom (zoom-in) or by a reduction (zoom-out). The overall size of the frames remains constant during this operation. A zoom-in is shown in Figure 1. The arrows specify that the image is the result of a zoom-in from an earlier image.

Fig. 1. Example of a zoom operation.

• The zoom factor is the factor by which an image has to be scaled to obtain the next image of a video.

3.1

Zoom Recognition Based on Spline Interpolation

The approach we propose uses splines to recognize the fact that a zoom took place as well as to calculate a zoom factor. It is not our goal to calculate the self-calibration of a camera, however, we merely compute the zoom effect. The principal idea is that in any zoom-in or zoom-out sequence either an image A can be found as part of an image B or an image B can be found in an image A, if A is preceding B in time. The task should therefore be to use a mathematical approach to find an image in another. This can be achieved using the spline interpolation of images. Spline interpolation serves to interpolate a set of discrete values by a curve or by a surface in order to model a continuous function. To interpolate images a variety of different approaches to spline interpolation are well known, for example cubic polynomial splines, non rational uniform B-splines (NURBS) or thin-plate splines. In our approach we used natural cubic splines. Two important advantages of the spline approach can be identified: First, no computation of motion vectors is needed. Motion

Automatic Recognition of Camera Zooms

255

vectors themselves already contain a significant amount of errors caused by camera, digitalization and calculation artifacts also influencing the following zoom recognition process. Secondly a precise zoom factor indicating the amount of enlargement or reduction in size of an image can be computed. The approach we propose is to transform two images A and B of a video sequence to a spline representation and then to stretch the spline representing image A in order to locate the spline representing image B and vice versa. Both directions have to be applied as it is unknown in advance if a zoom-in or a zoom-out is to be found and as it is not sure if an image is contained totally in another image. By an adaptation of a variable scaling factor the real zoom factor can be computed. As the exact mathematical formulation of splines is beyond the scope of this paper we start to explain our method being applied for one-dimensional images thus simplifying the understanding of the underlying processes. We then extend the algorithm to twodimensional images. 3.1.1

Zoom Recognition in One-Dimensional Images

In the case of one-dimensional images two arrays A and B can be used containing values representing the brightness of a specific pixel of an image. Let B be an image which is the result of a zoom operation and thus some sort of transformation of image A (see Figure 2).

Brightness

Brightness

Zoom-out operation

Zoom-in operation

Image A

Time

Time

Time

Time

Zoom operations

Image B

Fig.2. Zoom operations. Representing images A and B as splines a spline B can be located in a spline A if image B is a video frame to be presented later in time than frame A (zoom-in operation). In the case of a zoom-out operation a search of spline A in spline B will be successful. A zoom-out can thus be recognized searching for image A in image B. A great advantage of this representation is that the search can be performed very efficiently as only a limited set of spline values have to be compared and as the comparison is not bound to values lying on the pixel grid. The algorithm to compute the zoom factor of a pair of images A and B is shown in Figure 3.

256

Stephan Fischer et al.

Calculate splines representing images A and B. Search pattern of spline B in spline A. If pattern can be found then result := zoom-in. Else search pattern of spline A in spline B. If pattern can be found then result := zoom-out. Else result := no zoom. Fig. 3. Zoom recognition in 1D-images.

In a first step the splines representing images A and B have to be calculated. A very efficient way to compare splines is to use the second derivative of the images which can be computed applying a Laplace-operation. The use of the second derivative guarantees that uniform areas of the images as well as regions of a constant gradient disappear speeding up the calculation of the splines as well as the matching process between the images. As the Laplace-operation calculates the second derivative the zero values correspond to the turning points in the images if and only if the first derivative is nonlinear. Searching for turning points to match A and B the following two conditions are possible: image A is contained in image B entirely in a smaller form, or image A is only partially contained in image B due to fast camera panning. In the first case a potential zoom factor lambda can be estimated using the following formula: TP 1, 2 TP1, 1 λ = ----------------------------TP 2, 2 TP2, 1 where the numerator denominates the turning points (TP) in image A. After estimating lambda the validity has to be checked by picking x-values randomly and by verifying spline A( x ) = d + splineB( λx )

where d is the distance by which spline A has been translated in spline B. If the verification fails the next pair of turning points TP2,2 and TP2,1 is used to compare the images. This step is repeated until A has been located in B. It should be noted that the matching process is very fast as only those values have to be compared where both splines are equal. As a result a zoom factor greater than 1 denotes a zoom-out while a value less than 1 denotes a zoom-out. In the second case a verification can fail if the selected turning point of image A is not contained in image B anymore due to fast camera panning. To solve this problem a new pair of turning points has to be chosen in spline A similar to the choice of new points in spline B. This has to be repeated until the spline has been located. In the case where no zoom has been recognized we assume that no zoom is contained in the video.

Automatic Recognition of Camera Zooms

257

Calculate splines representing images A and B. SplineMatch := FALSE, result := undefined select first turning point (TP) in A while ((SplineMatch == FALSE) AND (turning points in A unvisited)) select second turning point in A while ((SplineMatch == FALSE) AND (turning points in B unvisited)) find turning point (TP) in B using zerocrossings (second derivative) calculate zoom factor using second TP and evaluate result by testing spline equality with x- and y-direction if zoom factor can be calculated SplineMatch := TRUE result := zoom else select new base TP in B select new base TP in A if (result == undefined) result := no zoom. Fig. 4. Zoom recognition in 2D-images.

3.1.2

Zoom Recognition in Two-dimensional Images

The spline comparison is much more complex in the two-dimensional case. The search for a spline equality based on rows or columns is hereby a problem as rows or columns of image A are not only shrunk but also displaced towards the zoom center in image B in case of a zoom-in. Also the target coordinates of the displacement don’t have to be integers. To solve this problem we propose the following algorithm which is only applied to one direction (either x or y) while we use the complementary axis to validate the search results (see Figure 4). The search for turning points by localizing the zeros of the Laplace image cannot be performed by checking the zeros at the whole-numbered values of the spline solely as zeros between those could be missed. To avoid this problem zero crossings have to be examined. In the case of a change of the sign between two values of the second derivative a zero must have occurred between these. Applying this technique also turning points can be localized which are not located on the pixel grid. The algorithm first tries to match the images using turning points on the horizontal axis. In case of a failure the y-direction is examined. To speed up the algorithm we start the localization process in image B with x- or y-values corresponding to those of the turning point in image A and extend the search radius as the algorithm continues its execution.

258

Stephan Fischer et al.

4

Experimental Results

We tested the spline algorithm with synthetical as well as with real image sequences. The performance has been defined as the amount of recognized zooms in relation to the length of the sequence. Our experimental data contained sequences without camera panning and zooms but with object movement (control instance), sequences containing zooms solely, sequences with zoom and camera panning and sequences with zooms, camera panning and object movement. We also tested the robustness of the algorithm by creating an artificial noise of variable strength in the source images. Video images very often include a specific amount of noise caused by camera and compression errors. In a first set of experiments we examined the performance of the algorithm applied to one-dimensional synthetic image sequences. The results are listed in Table 1. Tab. 1. 1D zoom detection

Experiment

Recognition performance

Sequences without camera operations Sequences with zoom only Sequences with zoom and camera panning Sequences with zoom, camera panning and strong object motion 100

Zoom recognized [%] 0 100 100 84.29

False positive [%] 0 0 0 11.93

1D-Rate 2D-Rate

80 60

Scale up by zoom factor

40 Image i

20 00

Image (i+1) Video frames

10

20

30 40 Noise in percent

50

60

Fig. 5. 1D-zoom recognition in the presence Fig. 6. Generation of synthetical images of noise

It is not surprising that a zoom cannot be recognized in the presence of strong object movement anymore. Experiments have shown that moving objects which are larger than 23 percent of the size of the total image cause a distortion which makes a zoom

Automatic Recognition of Camera Zooms

259

recognition impossible. This result was obtained generating synthetical image sequences containing moving objects and increasing their size step by step. Surprisingly, the size of the zoom factor does not influence the recognition rate at all. We identified slow zooms as well as fast zooms with an equal success. Figure 5 shows the impact of noise with respect to the experimental results. These experiments were conducted using sequences containing zooms and camera panning. It is not surprising that a significant amount of noise lowers the performance of the algorithm. A similar behavior can be observed applying the algorithm to two-dimensional images. The results are slightly worse and can be explained with the efficiency of the localization which only approximates the exact location of the spline. The results for the zoom detection using 2D-images are shown in Table 2. Tab. 2. Results of 2D-zoom recognition

Experiment

Zoom recognized False positive [%] [%] Sequences without camera operations with object 2.19 1.1 movement Sequences with zoom only 98.03 0 Sequences with zoom and camera panning 94.77 3.11 Sequences with zoom, camera panning and 82.97 15.33 strong object motion We tested the performance of the spline algorithm with 50 videos consisting of 1000 frames each. Concerning the synthetical videos we adjusted the value of the zoom factor according to the function resulting in zoom factors traversing the interval [0;2a] periodically. Adjusting a and t we were able to create different zoom variations within the synthetical sequence. For a zoom-in we constructed the synthetical images in such a way that we scaled up an image i by the zoom factor and cut out the edge regions in order to obtain the new image (i+1) equal in size to image i (see Figure 6). As all of our experimental sequences started with a zoom-in we buffered the images of the zoom-in and played them in reverse order to obtain the respective zoom-out. This way we were able to produce sequences consisting of periodic zoom-ins and zoomouts. The zoom factor can be adjusted varying the argument of the sine function thus creating slower or faster zooms. The scaling of the synthetical zooms can furthermore be adjusted using the parameter a. The mean experimental values we measured with the synthetical clips and with the real-world clips are shown in Table 2. A result of the experiments is that errors occurring during the zoom recognition process are mainly due to noise included in the images. This can in particular be shown applying some sort of smoothing, for example a Gaussian filter before calculating the splines. The efficiency can then be further improved. It turns out that a zoom recognition using splines outperforms the analysis of motion vector fields (both optical flow and MPEG vectors). Errors caused by noise in the original images and by object motion lower the performance of the algorithm. Considering object motion however it should be possible to find regions in an image where the distortion caused by the moving object is

260

Stephan Fischer et al.

rather small. These areas can then be used to increase the performance of the algorithm.

5

Conclusions and Outlook

In this paper we propose a new method to recognize zooms automatically which also calculates a precise scaling factor between successive frames of a video sequence. In our experimental section we have shown that the algorithm is has a reliability greater than 80 percent. Although the localization algorithm is quite robust we currently think about an integration of our algorithm with algorithms to compute the camera calibration. Another problem any algorithm to recognize zooms faces is object movement. In another set of experiments we currently examine if the spline localization can be performed on selected parts of an image where no object movement takes place.

References [BBBB93] R.H. Bartels, J.C. Beatty, K.S. Booth, E.G. Bosch, and P. Jolicoeur. Experimental comparison of splines using the shape-matching paradigm. ACM Trans. Graph. 12(3), pp. 179-208, 1993. [BR96] J. S. Boreczky and L. A. Rowe. A comparison of video shot boundary detection techniques. Journal of Electronic Imaging, 5(2):pp. 122-128, 1996.

[BX94] C. Bajat and G. Xu. NURBS approximation of surface / surface intersection curves. Adv. Comput. Math.2(1), pp. 1-21, 1994. [BFB84] J.L. Barron, D.J. Fleet, and S.S. Beauchemin. Performance of Optical Flow Techniques. International Journal of Computer Vision, 12(1), 1984. [Fi97] S. Fischer. Feature combination for content-based analysis of digital film. PhD thesis, University of Mannheim, 1997. [FLM92] O. Faugeras, Q.-T. Luong, and S. Maybank: Camera self-calibration: Theory and experiments. Proc. ECCV, 1992. [HA97] A. Heyden, K. Aström: Euclidean Reconstruction from Image Sequences with Varying and Unknown Focal Length and Principal Point, Proc. CVPR, 1997. [Hoe89] M. Hötter. Differential Estimation of the Global Motion Parameters Zoom and Pan. Signal Processing, 16, pp. 249-265, 1989. [ISM93]K. Illgner, C. Stiller, and F. Müller. A Robust Zoom and Pan Estimation Technique. Proceedings of the International Picture Coding Symposium PCS’93, Lausanne, Switzerland, 1993. [LK81] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. DARPA IU Workshop, pp. 121-130, 1981. [MKG98] M. Pollefeys, R. Koch, and L. Van Gool: Self-Calibration and Metric Reconstruction in spite of Varying and Unknown Internal Camera Parameters, Proc. Of ICCV, 1998. [TB91] Y.T. Tse and R.L. Bakler. Global zoom pan estimation and compensation for video compression. Proceedings ICASSP, pp. 2725-2728, 1991. [ZGST94]H. Zhang, Y. Gong, S.W. Smoliar, and S.Y. Tan. Automatic Parsing of News Video. Proceedings of IEEE Conf. on Multimedia Computing and Systems, 1994.

A Region Tracking Method with Failure Detection for an Interactive Video Indexing Environment Marc Gelgon, Patrick Bouthemy, and Thierry Dubois IRISA/INRIA Campus universitaire de Beaulieu 35042 Rennes cedex, France [email protected]

Abstract. In this paper, we address the problem of structuring video documents into meaningful spatio-temporal entities, involving the tracking of spatio-temporal zones of interest. We present a new technique for tracking over time a region in the image which is initially user-delimited. The tracking method is based on the robust estimation of a 2D aﬃne motion. In order to facilitate the task of the user, the scheme includes the original idea of an automatic tracking failure detection procedure which exploits at low cost information related to the estimated motion model. Experimental results are provided on excerpts of a real movie.

1

Introduction

In this paper, we address the problem of structuring video documents into meaningful spatio-temporal entities, and focus on the issue of tracking given objects of interest within shots. These entities are the shots and spatio-temporal zones of interest within each shot. Such an analysis of a video document is an essential step for accessing (video browsing and indexing) and for manipulating (video editing) the video content. A survey of such applications of video analysis is presented in [5]. The topics revolving around extraction of zones of interest for enhanced browsing and exploitation of video documents, include the generation of video summaries [8,11], and so-called “hypervideo links” that can be established between diﬀerent instances of the same object in one or several documents [2]. From the spatial and temporal inter-relations between extracted objects, scenario recognition can be achieved [6] with a view to indexing. Partitioning the video into shots is generally the ﬁrst analysis step. This task has received wide attention. Exploitable results can be obtained in a reasonable amount of time, and sometimes, directly from the MPEG bit stream [18]. However, as reported in a comparative study [7], eﬀective detection of progressive transitions is a stumbling stone of many approaches, and hence remains the subject of current and recent work [4,17]. Structuring a shot into meaningful spatio-temporal zones (also called Video Objects in MPEG-4 terminology) is Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 261–269, 1999. c Springer-Verlag Berlin Heidelberg 1999

262

Marc Gelgon et al.

also a much studied issue, both by the video analysis and video coding communities. The use of motion, possibly with the assistance of intensity or color, as an objective criterion for extracting “zones of interest” in an automatic manner is widespread. It is indeed meaningful in many situations, notably with the “surveillance scene” type involving moving people or vehicles. Recent advances in this ﬁeld can be for example found in [10,15]. Still, there are limitations to the use of motion as a zone-of-interest delimiter. First, by its principle, motion can naturally not catch all “interesting zones” (e.g. faces), or induces undesirable results (a moving vehicle will be extracted with his moving shadow). Then, in practice, even if motion is a relevant criterion, the task at hand is often complex with general video contents. One has to perform either motion detection or segmentation, to model the apparent motion of these objects that may be articulated or that may change their attitude, one has to cope with the so-called aperture problem and parallax-related issues, and so forth. Automatic motionbased segmentation remains a computationally diﬃcult and expensive process, even if used for some oﬀ-line applications related to video indexing as in [8]. Truly, there is also a need, in the context of interactive general-purpose video indexing system, for assistance from the user to achieve an interactive segmentation, and conversely, a real need from video users for assistance from the system in tracking spatio-temporal zones of interest along a shot. Indeed, we focus in this paper on temporal tracking of a given zone. This zone can be delimited by the user in a ﬁrst frame. It can also be extracted in an automatic way, then manually pointed in the obtained segmentation map. The primitive actually tracked could be an intensity contour delimiting this zone [12], a few points of interest [16] or the whole information contained in the region. Tracking an initially user-deﬁned zone with deformable meshes has been proposed in [9], and through the intensity-homogeneous or color-homogeneous zones which compose it in [13]. Besides, tracking techniques are bound to fail occasionally in diﬃcult situations. Hence, it is crucial and helpful, in interactive applications, that the zone tracker could automatically warn the user of a failure in tracking, in order to give him the hand to re-deﬁne the zone in the frame to be tracked. The automatic tracking can then restart from this re-initialization. Such an alarm feature can improve user comfort and work eﬃciency. In this paper, we propose, after a ﬁrst step for automatically partitioning a video into shots, a method for tracking in nearly real-time an initially userdeﬁned or user-selected region, including a tracking failure detection technique. These steps exploit information related to the apparent motion measured in the image. More precisely, we make use of a robust multiresolution estimation technique of 2D aﬃne motion model, and of the associated estimation support. The remainder of this paper is organized as follows. Section 2 ﬁrst overviews the dominant aﬃne motion model estimator technique employed, and how it is applied to automatic partitioning of the video into shots. Section 3 deals with the tracking of zones. We illustrate these techniques through experimental results and provide concluding remarks (Section 4).

A Region Tracking Method with Failure Detection

2

263

Dominant Motion Estimation

The tracking scheme described in this paper exploits motion and motion-related information. For use in the successive video structuring steps, we are interested in measuring apparent (2D) motion between two successive images, either globally in the image, for partitioning the video into shots, or on a region within the image, in order to track this region. In either case, we represent the apparent motion ﬁeld at hand by a 2D parametric (aﬃne) model. Since several motions may be present in the estimation zone S, we only seek for the estimation of the dominant one. To estimate the dominant motion without prior motion segmentation, we employ a technique based on robust statistics. We outline here the method employed for estimating a 2D parametric motion model, further detailed in [14]. This method takes advantage of a multiresolution framework and an incremental scheme. It minimizes a M-estimator criterion to ensure the goal of robustness to outliers formed by the points corresponding to secondary motions or to areas where the image motion equation is not valid. We have chosen the aﬃne motion model wΘ deﬁned at point p = (x, y), considering a reference point (xg , yg ) by: a1 + a2 (x − xg ) + a3 (y − yg ) (1) w Θ (p) = a4 + a5 (x − xg ) + a6 (y − yg ) This motion model is a good trade-oﬀ between complexity and representativity. In practice, the image centre is taken as reference point in the case of global motion estimation in the image, or the region centre is selected in the case of region-based motion estimation. The parameter vector Θ = (a1 , a2 , a3 , a4 , a5 , a6 ) is estimated between images I(t) and I(t + 1) in an incremental way. The method for detecting shot changes uses aﬃne motion models accounting for the global dominant image motion between successive images [3]. The evolution of the size of the associated estimation support enables the detection of both cuts and progressive transitions, with the same framework, and with a single parametrization. This technique is also fairly resilient to diﬃcult situations such as the presence of large mobile objects and global illumination changes. We refer the reader to [3,4] for further detail on the technique.

3 3.1

Zone Tracking with Automatic Failure Detection Tracking

Given the video automatically divided into shots, the user can now deﬁne objects he wishes to track. Let a user-deﬁned polygon delimitate at time t the zone to be tracked. An aﬃne motion model is estimated over this region using the multiresolution robust estimator outlined in section 2. (Other 2D parametric motion models could also be handled as well.) We then apply to each of the vertices of the polygon the motion vector provided at this vertex by the estimated aﬃne motion model. The set of projected vertices form the estimate for the tracked region at the next time instant. The use of a robust motion estimator enables the proposed technique to cope with a variety of challenging situations.

264

Marc Gelgon et al.

These include possible shadow eﬀects or changes in illumination, partial occlusion and the fact that an aﬃne model may not reﬂect well enough the real 2D geometric transform. First, the robustness of the estimator makes the tracking resilient to such diﬃculties. Then, should a slight error however occur, motion model estimation is quite robust to pixels unduly included in the tracked zone. In other words, correct track of the region can be kept despite degration in the accuracy of region boundary. This processing is repeated over successive pairs of images within the shot. Besides, let us point out that this tracking module can be run from any image in the shot either in a forward or in a backward mode (if possible). A classical alternative way of tracking would be to perform intensity correlation. To assess the comparison with our technique, let us point out that our method estimates with a high accuracy a full aﬃne motion model. Such a motion model is quite important to account suﬃciently for region boundary deformation in the image due to 3D motion of the object, hence to perform eﬀective tracking. We think this is quite diﬃcult to carry this out with a correlation matching technique as robustly and eﬃciently as it is done here. Moreover, the robust motion estimation step described above can be performed in nearly real-time. 3.2

Tracking Failure Detection

We describe here how a severe unaccuracy of the region boundary estimate can be detected in most cases. This is particularly helpful to ensure satisfactory extraction of the zone of interest over the whole shot. Motion model is in fact estimated iteratively as the cumulated value of parameter vector Θ n = arg min increments ∆Θ n : ∆Θ ρ (ri ) (2) ∆Θn pi ∈I with ri = I(pi + wΘbn (pi ), t + 1) − I(pi , t) + ∇I(pi + wΘbn (pi ), t + 1).w∆Θn (pi ) ρ(x) is a hard-redescending M-estimator. Solving for these increments consists in the following minimization problem 1 : ψ(ri ) n = arg min ∆Θ wi ri2 with wi = . (3) ∆Θn 2 ri p i

where ψ is the derivative of the ρ function, corresponding to the M-estimator. As explained in [14], the minimization problem deﬁned in relation 2 is solved using a irls (Iteratively Re-weighted Least Squares) technique. Once the dominant motion estimation step is completed, the ﬁnal value of the quantity wi indicates if a point pi is likely or not to belong to the part of the image undergoing this dominant motion. In the former case, wi is close or equal to 1, in the later case, wi is equal or close to 0 (outliers). We deﬁne the support of the dominant motion as the set of points pi satisfying wi ≥ ν, where ν is a predeﬁned threshold, (typically 0.2). The pertinent information for the following is in fact the size, denoted ζt , of this support (in fact, we use a normalized version of this size within [0,1]). The principle of the proposed technique is that, as long as the object tracking is correct, most pixels within the estimated region (polygon transformed over time) should be found conform to the estimated dominant motion model,

A Region Tracking Method with Failure Detection

265

and thus the value of ζt should be close to 1. Should the region boundary become suddenly signiﬁcantly erroneously estimated, it will generally happen that more pixels attached to a motion diﬀerent to that of the tracked region (e.g. background motion) will be included in the estimation support, and rejected as outliers by the motion estimator. This would induce a strong decrease in ζt . The estimated boundary of the tracked region often drifts only slowly over time. In such a case, a slow decrease of ζt occurs. The use of Hinkley’s test [1] enables the detection of downward jumps of variable ζt . Since this test is cumulative and accounts for all past values of the studied variable, sudden or more progressive decreases can be detected with the same criterion and the same threshold parameter values. Besides, in the case of a progressive jump, the beginning of this jump can be accurately located in time. Implementation of this test is simple and its computational cost is very low. Let us denote m0 the mean of ζt before a jump. This mean value is estimated on-line and re-estimated after each jump detection. δmin denotes the minimal magnitude of jump we wish to detect. The following computation and test are carried out over time : k δmin Tk = ζt − m0 + (k ≥ 0) 2 t=0 Mk = max Ti ; jump detection if Mk − Tk > α 0≤i≤k

α is a user-set threshold. We locate the beginning of the jump at the time following the last instant for which Mk −Tk = 0, i.e. the beginning of a signiﬁcant downward evolution of the studied variable. When a failure is detected, the beginning of failure is automatically located in time and the user is requested to redeﬁne the region in this frame. Motion-related information is an eﬀective means of coping with the detection of progressive drifts. Indeed, the content of the region in terms of intensity pattern must be allowed to slowly evolve over time, to tolerate changes in illumination and object attitude. Although the tracking technique works for tracking zones that belong to the background scene, tracking failure detection does not work in such case, since the motion within and surrounding the region is the same.

4

Experimental Results and Concluding Remarks

The proposed scheme has been applied to a part of an Avengers TV movie. The following results illustrate region tracking applied to zones of interest over a whole shot. They all come from the Avengers excerpt but the ﬁrst. In the “Irisa” sequence (Fig. 2a), a zone corresponding to a building is selected. Tracking performs well throughout a zoom eﬀect and despite strong occlusion. In the “Cliﬀ Road” sequence (Fig. 2b), a car is tracked. Camera and car motion are complex, but the region boundary is rather well maintained over this long sequence, in spite of the change in attitude, camera-to-objet distance and strong

266

Marc Gelgon et al.

illumination changes. The head of the person is selected in the “Emma” sequence (Fig. 2c). The 2D dynamic content of this shot is complex and an interactive approach is inescapable to extract and track such a subjective element as the head. Though the inter-frame motion is strong, satisfactory results are obtained. The last shot (Fig. 2d) shows a car entering a pond. The complex forward motion of this car is quite well accounted for, and the water splashes do not severely aﬀect motion measurement. 1

0.9

0.8

0.7

0.6

ζt

0.5

0.4

0.3

0.2

0.1

0

0

10

20

30

40

50

60

frame number

a

b

c

Fig. 1. Temporal evolution of ζt is plotted above, the three pictures below it are zooms on the tracked object at frames (a) 1, (b) 55 (c) 56

Fig. 1 shows are example of failure detection, applied to the “Cliﬀ Road” sequence, where this time the white car is tracked. The temporal evolution of ζt provides an example of tracking failure detection (strong decrease of ζt ) corresponding to incorrect motion estimation between frames 55 and 56. Tracking failure detection has shown to perform well for most objects we have attempted to track in a variety of sequences, with few false alarms. Let us point out that it is not necessary to be able to track over very long image sequences, but only over the length of a shot. Parameter α is tuned so as to favour false alarms w.r.t. failure mis-detection. With the current version, tracking time is about 0.5 sec. per frame, but it should decrease to approximately 0.2 sec. in some further implementation. We have presented a new technique for tracking in nearly real-time an image region within a shot. It relies on robust motion estimation, and involves the original idea of including a tracking failure detection method which exploits, at low cost, information related to the estimated motion model. Both tracking and failure detection have shown eﬀective on many real sequences including movies and adverts.

A Region Tracking Method with Failure Detection

a1

a2

a3

b1

b2

b3

b4

b5

b6

c1

c2

c3

d1

d2

d3

267

Fig. 2. Region tracking on several sequences : the ﬁrst image of the series shows the manually deﬁned boundary : (a) “Irisa”sequence at time 1,12 and 116; (b) “Cliﬀ Road”sequence at time 1,15,30,45,75,95; (c) “Emma” sequence at time 1,20,44; (d) “Car in pond” sequence at time 1,10,17. Sequences (b,c,d) are shots of the Avengers TV movie.

268

Marc Gelgon et al.

Acknowledgments: This work is funded in part by Alcatel CRC, Marcoussis, France, and by AFIRST (Association Franco-Israelienne pour la Recherche Scientiﬁque). The authors would like to thank INA (Institut National de l’Audiovisuel, D´epartement Innovation) for providing the MPEG-1 Avengers sequence.

References 1. M. Basseville. – Detecting changes in signals and systems - a survey. – Automatica, 24(3):309–326, 1988. 265 2. S. Benayoun, H. Bernard, P. Bertolino, P. Bouthemy, M. Gelgon, R. Mohr, C. Schmid and F. Spindler. – Structuring video documents for advanced interfaces. – ACM Multimedia Conference, Demo session, Bristol, Sept. 1998. 261 3. P. Bouthemy and F. Ganansia. – Video partitioning and camera motion characterization for content-based video indexing. – In Proc. of 3rd IEEE Int. Conf. on Image Processing, volume I, pages 905–909, Lausanne, September 1996. 263 4. P. Bouthemy, M. Gelgon, and F. Ganansia. – A uniﬁed approach to shot change detection and camera motion characterization. – INRIA Report 3304, Nov. 1997. 261, 263 5. P. Correia and F. Pereira. – The role of analysis in content-based video coding and indexing. – Signal Processing, (66):125–142, 1998. 261 6. J.D. Courtney. – Automatic video indexing via object motion analysis. – Pattern Recognition, 30(4):607–625, April 1997. 261 7. U. Gargi, R. Kasturi, and S. Antani. – Performance characterization and comparison of video indexing algorithms. – In Conf. on Computer Vision and Pattern Recognition (CVPR’98), pages 559–565, Santa Barbara, June 1998. 261 8. M. Gelgon and P. Bouthemy. – Determining a structured spatio-temporal representation of video content for eﬃcient visualisation and indexing. – In 5th European Conference on Computer Vision (ECCV’98), Freiburg, June 1998. 261, 262 9. B. G¨ unsel, A.M. Tekalp, and P. van Beek. – Content-based access to video : temporal segmentation, visual summarization and feature extraction. – Signal Processing, (66):261–280, 1998. 262 10. M. Irani and P. Anandan. – A uniﬁed approach to moving object detection in 2D and 3D scenes. – IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(6):577–589, June 1998. 262 11. M. Irani and P. Anandan. – Video indexing based on mosaic representations. – Proc. of IEEE, 86(5):905–921, May 1998. 261 12. M. Isard and A. Blake. – Contour tracking for stochastic propagation of conditional density. – In 4th European Conf. on Computer Vision, Cambridge UK, April 1996. – LNCS 1065, Springer Verlag. 262 13. F. Marques and C. Molina. – Object tracking for content-based functionalities. – In SPIE Visual Communication and Image Processing (VCIP-97), volume 3024, pages 190–198, San Jose, 1997. 262 14. J-M. Odobez and P. Bouthemy. – Robust multiresolution estimation of parametric motion models. – Jal of Visual Communication and Image Representation, 6(4):348–365, December 1995. 263, 264

A Region Tracking Method with Failure Detection

269

15. J.M. Odobez and P. Bouthemy. – Direct incremental model-based image motion segmentation for video analysis. – Signal Processing, 66(3):143–156, mai 1998. 262 16. Y. Rosenberg and M. Werman. – Representing local motion as a probability distribution matrix and object tracking. – In Proc. of Conf. on Computer Vision and Pattern Recognition, Puerto-Rico, June 1997. 262 17. W. Xiong and J.C.M. Lee. – Eﬃcient scene change detection and camera motion annotation for video classiﬁcation. – Computer Vision and Image Understanding, 71(2):166–181, August 1998. 261 18. B.L Yeo and B. Liu. – Rapid scene analysis on compressed video. – IEEE Trans. on Circuits and Systems for Video Technology, 6(5):533–544, December 1995. 261

Integrated Parsing of Compressed Video Suchendra M. Bhandarkar1, Yashodhan S. Warke1 , and Aparna A. Khombhadia2 1

Department of Computer Science The University of Georgia, Athens, Georgia 30602–7404, USA 2 Netscape Communications Corporation 501 E. Middleﬁeld Rd, Bldg 14, MV-093 Mountain View, California 94043–4042, USA

Abstract. A technique for detecting scene changes in compressed video streams is proposed which combines multiple modes of information. The proposed technique directly exploits and combines the luminance, chrominance, motion compensation information and the prediction error signal in an MPEG1-coded video stream. By performing minimal decoding of the compressed video stream, the proposed technique results in signiﬁcant savings in terms of execution time and memory usage. The technique is capable of detecting abrupt scene changes (cuts), gradual scene changes (dissolves), and dominant camera motion in the form of pans and zooms in an MPEG1-coded video stream. Experimental results show that combining multiple modes of information is more eﬀective in detecting cuts and dissolves. The proposed technique is capable of processing video frames in real time and could be used for the rapid generation of key frames for real-time browsing of video streams and for indexing to support content-based access to video libraries.

1

Introduction

Video parsing or scene change detection is typically used to extract key frames from a video stream. These key frames can then used for rapid video browsing and automatic annotation and indexing of video streams to support contentbased query access to a video database. The video parsing operation is primarily domain-independent and therefore is a crucial ﬁrst step that precedes domaindependent analysis of the video [9]. Video parsing techniques that operate upon compressed video directly have a considerable advantage in terms of execution time and memory usage. Techniques that directly parse MPEG1-coded video typically use the histograms of the DC images derived from the DCT coeﬃcients of the I frames [6], the variance of the DC coeﬃcients in the I and P frames [2] or the proportion of macroblocks with valid motion vectors in the P and B frames [4,9] to detect abrupt scene changes. Yeo and Liu [8] present algorithms for detecting abrupt and gradual scene changes, intrashot variations and ﬂashlight scenes in MPEG1coded and MJPEG-coded video data using DC images. Bhandarkar and Khombadia [1] present a technique for the detection of abrupt and gradual scene Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 269–277, 1999. c Springer-Verlag Berlin Heidelberg 1999

270

Suchendra M. Bhandarkar et al.

changes that exploits motion compensation information and the prediction error in the MPEG1-coded video stream. However, video parsing techniques that rely on only one source or mode of information are seen to suﬀer from certain signiﬁcant shortcomings. For example, techniques that rely only on chrominance and/or luminance values (or their average values in the DC images), are prone to misses when there is little change in background color or luminance between successive video shots and to false positives when there is a change in background color or background luminance due to change in ambient lighting within a single shot. The relative motion of objects between successive frames is more eﬀective in detecting such scene changes. Moreover, certain types of scene changes such as pans, zooms etc. cannot be detected reliably with luminance and/or chrominance information alone. Using motion information in isolation also has its limitations. In MPEG1-coded video, I frames do not contain motion compensation information. Consequently, scene changes occurring at I frames cannot be detected if the video parsing technique relies solely on motion compensation information. The same is true of MJPEG video which consists solely of I frames. In this paper, we present a novel and eﬃcient approach to video parsing that integrates luminance, chrominance and motion information, all of which are computed directly from an MPEG1-coded (compressed) video stream. It is shown that by appropriately combining luminance, chrominance and motion information, one can design a more accurate and robust video parsing technique. Since the proposed technique entails minimal decompression of the compressed video, it is capable of parsing video in real-time (i.e., at rates in excess of 30 frames/sec).

2

Description of MPEG1 Video

MPEG1 video compression [3] relies on two basic techniques: block-based motion compensation for reduction of temporal redundancy and Discrete Cosine Transform (DCT)-based compression for the reduction of spatial redundancy. The motion information is computed using 16 × 16 pixel blocks (called macroblocks) and is transmitted with the spatial information. The MPEG1 video stream consists of three types of frames: intra-coded (I) frames, predictive-coded (P) frames and bidirectionally-coded (B) frames. I frames use only DCT-based compression with no motion compensation. P frames use previous I frames for motion encoding whereas B frames may use both previous and future I or P frames for motion encoding. An I frame in MPEG1-coded video is decomposed into 8 × 8 pixel blocks and the DCT computed for each block. The DC term of the block DCT, termed as DCT (0, 0) is given by 18 7i=0 7j=0 IB (i, j) where IB is the image block. Operations of a global nature performed on the original image can be performed on the DC image [6,8,9] without signiﬁcant deterioration in the ﬁnal results. For P and B frames in MPEG1-coded video, motion vectors (MVs) are deﬁned for each 16 × 16 pixel region of the image, called a macroblock. P frames have macroblocks that are motion compensated with respect to an I frame in the

Integrated Parsing of Compressed Video

271

immediate past. These macroblocks are deemed to have a forward predicted MV (FPMV). B frames have macroblocks that are motion compensated with respect to, either a reference frame in the immediate past, immediate future or both. Such macroblocks are said to have an FPMV, backward predicted MV (BPMV) or both, respectively. The prediction error signal is compressed using the DCT and transmitted in the form of 16 × 16 pixel macroblocks.

3

Scene Change Detection in MPEG1-Coded Video

Our technique is currently designed to detect two important types of scene changes in MPEG1-coded video streams: cuts and dissolves. A cut in a video stream is characterized by an abrupt scene change. A dissolve in a video stream is characterized by a gradual scene transition in which the present (outgoing) scene gradually fades out while the next (incoming) scene gradually fades in. Our technique is also capable of detecting dominant camera motion during zooms and pans and classifying the corresponding scenes as such. 3.1

Detection of Cuts Using DC Images

Let Xi , i = 1, 2, . . . , N be a sequence of DC images. The diﬀerence sequence Di , i = 1, 2, . . . , N − 1, is generated where Di = d(Xi , Xi+1 ). Each element of Di is the diﬀerence of the cumulative DC values of two successive images. A scene cut is deemed to occur between frames Xl and Xl+1 if and only if (1) The diﬀerence Dl is the maximum within a symmetric sliding window of size 2m − 1, i.e. Dl ≥ Dj , j = l − m + 1, . . . , l − 1, l + 1, . . . , l + m − 1, and (2) Dl is at least n times the magnitude of the second largest maximum in the sliding window. The parameter m is set to be smaller than the minimum expected duration between the scene changes [8]. Since the DC images have a luminance component and two chrominance components the diﬀerence image is the weighted sum of the absolute values of the individual component diﬀerences. 3.2

Detection of Cuts Using Motion Information

Since a cut in a video stream is characterized by an abrupt scene change, a typical motion estimator will need to intracode almost all the macroblocks at a cut. Thus, by computing the extent of motion or motion distance (MD) within each macroblock in the current frame relative to the corresponding macroblock in the previous frame and setting it to a very high value if such correspondence cannot be determined, scene cuts can be detected. The computation of MD is complicated by the fact that the corresponding macroblocks of successive frames may have MVs based on diﬀerent reference frames. As an example, consider Fig. 1. Let the macroblock under consideration be (i, j) for frames #1 and #2. Also let M V1 = (u1 , v1 ) be the FPMV of the (i, j)th macroblock of frame #1, and M V2 = (u2 , v2 ) be the BPMV of

272

Suchendra M. Bhandarkar et al.

MV1

I frame (# 0)

MV1

MV2

(i, j)

(i, j)

B frame (# 1)

MV2

B frame (# 2)

P frame (# 3)

Fig. 1. Motion Distance Computation: Possible scenario I

MV1

MV4

MV1

MV2

(i, j)

(i, j)

MV2 II

IV MV3 MV3

III

I frame (# 0)

B frame (# 1)

B frame (# 2)

P frame (# 3)

Fig. 2. Motion Distance Computation: Two-level indirection

the (i, j)th macroblock of frame #2. Here the (i, j)th macroblock of frame #2 has a BPMV based on the (i, j)th macroblock of frame #3. However, the corresponding macroblock in frame #1 has a FPMV based on the (i, j)th macroblock of the reference frame #0. In order to calculate the MD or the motion in frame #2 relative to frame #1, we need to ﬁnd the MVs for the corresponding macroblocks of successive frames based on the same reference frame. Thus, we require a two-level indirection for the (i, j)th macroblock of frame #2 to obtain its MV based on frame #0. This two-level indirection is shown using dotted arrows in Fig. 2. The order of computation is indicated by the number on the arrows. This two-level indirection yields M V4 = (u4 , v4 ), which is the MV of the (i, j)th macroblock of frame #2 based on the reference frame #0. In the above case, the MD can be computed as MD = (u4 − u1 )2 + (v4 − v1 )2 . Since the computation of M D involves ﬂoating point arithmetic, we approximated it by MDa the computation of which entails only integer arithmetic: MDa = |u4 − u1 | + |v4 − v1 |. In general, for two consecutive frames Prev and Curr the corresponding macroblocks may have an FPMV, a BPMV, both or none. Thus, depending on the type of MVs the macroblocks possess, sixteen cases need to be dealt with. We refer the interested reader to [1] for a detailed explanation of how MD is computed in each of these cases. In order to detect cuts, we calculate the motion edge magnitude for each frame which is deﬁned as the sum of MD’s (or MDa ’s) of all the macroblocks in the frame. We detect scene cuts by thresholding the peaks within a sliding window of length 2m + 1 frames in the plot of the motion edge magnitude versus frame number using the same criteria discussed in Section 3.1.

Integrated Parsing of Compressed Video

3.3

273

Detection of Gradual Scene Changes Using DC Images

c2

| c2-c1 |

c1

a1

a2

a1-k

a2-k

a1

a2

Fig. 3. Gradual Transition and Diﬀerence Sequence

A gradual scene transition is modeled as a linear transition from c1 to c2 , in the time interval [a1 , a2 ] (Fig. 3) and modeled as [8]:  n < a1 ,  c1 , 1 gn = ac22 −c (1) −a1 (n − a2 ) + c2 , a1 ≤ n < a2 ,  n ≥ a2 c2 , Assuming that k > a2 − a1 , the diﬀerence sequence Dik (gn ) = d(Xi , Xi+k ) is given by [8]:  0, n < a1 − k,    |c2 −c1 |   [n − (a − k)], a1 − k ≤ n < a2 − k,  a2 −a1 1 k a2 − k ≤ n < a1 , Di (gn ) = |c2 − c1 |, (2)  |c2 −c1 |   − (n − a ), a ≤ n < a , 2 1 2    0, a2 −a1 n ≥ a2 The plots of gn and Dik (gn ) are shown in Fig. 3. The plateau between a2 − k and a1 has a maximum constant height of |c2 − c1 | if k > a2 − a1 . In order to detect the plateau width, it is required that for ﬁxed k: (1) |Dik − Djk | < , where j = i − s, . . . i − 1, i + 1, . . . , i + s, and k k (2) Dik ≥ l × Di−k/2−1 or Dik ≥ l × Di+k/2+1 , for some large value of l. Since the width of the plateau is k − (a2 − a1 ) + 1, the value of k should be chosen to be ≈ 2(a2 −a1 ) where a2 −a1 is the (expected) length of the transition. 3.4

Detection of Gradual Scene Changes Using Motion Information

During a dissolve, the motion estimation algorithm typically ﬁnds the best matching blocks in the reference frame(s) for blocks in the current frame but at the cost of higher prediction error. Also, in a typical dissolve, the error is uniformly distributed in space and value over all of the macroblocks. These observations are encapsulated in the following metrics (i) The average error

274

Suchendra M. Bhandarkar et al.

M N

E(i, j) over the M · N macroblocks should be high, M N 1 2 = MN [E − E(i, j)]2 should be high (ii) The error variance σE i=1 j=1 M avg N ijE(i,j) j=1 2 and (iii) The error cross covariance σij = i=1 − iavg javg , where M N E(i,j) M N M N i=1 j=1 iE(i,j) jE(i,j) j=1 j=1 and javg = i=1 , should be low. iavg = i=1 M N M N Eavg =

1 MN

i=1

4

i=1

j=1

j=1

E(i,j)

i=1

j=1

E(i,j)

Detection of Scenes with Camera Motion

Scenes containing dominant camera motion such as pans and zooms can be detected and classiﬁed based on the underlying pattern of the MVs. In a pan, most of the MVs are aligned in a particular direction. For a zoom-in most of the MVs point radially inwards towards the focus-of-expansion (FOE) whereas for a zoom-out they point radially outwards from the FOE. Let θij be the direction of the MV associated with the ijth macroblock. 1 M N Let θavg = MN i=1 j=1 θij be the average of the MV directions in the frame. For a frame to qualify as a member of a pan shot, the variance σθ2 = M N 1 2 i=1 j=1 [θij − θavg ] should be less than a predeﬁned threshold. MN For zoom-ins and zoom-outs we analyze the MVs of frame in each of the four quadrants. Let M Vx and M Vy denote the MV components along the horizontal and vertical directions respectively. We compute the total number of macroblocks in each quadrant satisfying the following criteria: Zoom-In: (i) First Quadrant: M Vx < 0 and M Vy < 0; (ii) Second Quadrant: M Vx > 0 and M Vy < 0; (iii) Third Quadrant: M Vx > 0 and M Vy > 0; and (iv) Fourth Quadrant: M Vx < 0 and M Vy > 0. Zoom-Out: (i) First Quadrant: M Vx > 0 and M Vy > 0; (ii) Second Quadrant: M Vx < 0 and M Vy > 0; (iii) Third Quadrant: M Vx < 0 and M Vy < 0; and (iv) Fourth Quadrant: M Vx > 0 and M Vy < 0. If the total number of macroblocks in the frame satisfying the aforementioned quadrant-speciﬁc conditions exceeds a speciﬁed threshold then the frame is classiﬁed as belonging to a zoom-in or zoom-out as the case may be.

5

Integrated Parsing

The current integration of luminance-based, chrominance-based and motionbased approaches is based on computing a joint decision function which is the weighted sum of the detection criteria of the individual approaches. At present the weights are selected by the user. Automatic weight selection is a topic we wish to pursue in the future. At this time the integrated approach applies only to scene change detection i.e., detection of abrupt and gradual scene changes. Detection of pans and zooms is done based on motion information alone. The GUI for the system has been developed in Java using JDK 1.1.5. The GUI allows the user to select the MPEG video, decide which parsing approach to use, decide

Integrated Parsing of Compressed Video

275

various threshold values, decide the relative weights of the individual approaches in the integrated approach, and view the plots of the various detection criteria.

6

Experimental Results

The MPEG1 video clips, used for testing the proposed technique were either generated in our laboratory or downloaded over the Web [7]. The Tennis sequence contains cuts at frames 90 and 150. It was observed that the scene changes were successfully detected only by the MV approach (Fig. 4), with the DC diﬀerence approach displaying very small peaks in the luminance and chrominance domain (Fig. 5). The integration of the two approaches indicates clear peaks at both scene cut points (Fig. 6) thus showing it to be more reliable. The Spacewalk1 sequence contains 3 diﬀerent dissolve sequences: between frames 74 and 85, between frames 155 and 166, and between frames 229 and 242. Both the approaches detected the 3 dissolves accurately. However, it was found that the DC k-diﬀerence approach overestimated the span of the dissolve (Fig. 7). The results of the integrated approach (Fig. 9) were found to be more accurate than those of the MV approach (Fig. 8) and the DC k-diﬀerence approach (Fig. 7) individually. The videos used for testing the zoom and pan detection were created in our laboratory using a video camera. The results for one of the zoom sequences are depicted in Fig. 10. This clip has 267 frames with 2 zoom-outs, between frames 77 and 90, and frames 145 and 167. The algorithm successfully detected these zoom-out sequences (Fig. 10). The results for one of the pan sequences are shown in Fig. 11. The clip has a pan sequence starting from frame 90 and ending at frame 153. The algorithm successfully detected this pan sequence, by displaying a low value for the variance of the MV angle (Fig. 11).

7

Conclusions and Future Work

The integrated approach presented in this paper, combines information from multiple sources such as luminance, chrominance and motion and is capable of more reliable detection of cuts and dissolves, and also pans and zooms in MPEG1 video. Since the MPEG1 video is minimally decompressed during parsing, the approach results in great savings in memory usage and processing bandwidth. Future work will investigate detection of other scene parameters such as wipes, morphs and object motion, the automatic selection of the various threshold values and also learning these threshold values from examples.

References 1. S.M. Bhandarkar and A.A. Khombadia, Motion–based Parsing of Compressed Video, Proc. IEEE IWMMDBMS, Aug. 1998, pp. 80–87. 269, 272 2. H. Ching, H. Liu, and G. Zick, Scene decomposition of MPEG compressed video, Proc. SPIE Conf. Dig. Video Comp., Vol. 2419 Feb. 1995, pp. 26-37. 269

276

Suchendra M. Bhandarkar et al.

Fig. 4. Tennis: Motion Fig. 5. Tennis: DC diﬀer- Fig. 6. Tennis: Inteedge plot ence plot grated approach plot

Fig. 7. Spacewalk1: DC k- Fig. 8. Spacewalk1: Error Fig. 9. Spacewalk1: Intevariance plot diﬀerence plot grated approach plot

Fig. 10. Pan7: Plot of % of pixels satisfying zoom criteria

Fig. 11. Pan5: Plot of MV angle variance

3. D.L. Gall, MPEG: A video compression standard for multimedia applications, Comm. ACM, Vol. 34(4), 1991, pp. 46-58. 270 4. V. Kobla, D. Doermann, K.I. Lin and C. Faloutsos, Compressed domain video indexing techniques using DCT and motion vector information in MPEG video, Proc. SPIE Conf. Stor. Retr. Img. Vid. Dbase., Vol. 3022, Feb. 1997, pp. 200-211. 269 5. J. Meng and S.F. Chang, CVEPS - A Compressed Video Editing and Parsing System. Proc. ACM Conf. Multimedia, Nov. 1996, pp. 43-53. 6. K. Shen and E. Delp, A fast algorithm for video parsing using MPEG compressed sequence, Proc. IEEE Intl. Conf. Image Process., Oct. 1995, pp. 252-255. 269, 270 7. University Of Illinois, Urbana-Champaign. ACM Multimedia Lab for sources of MPEG video data, URL : http://www.acm.uiuc.edu/rml/Mpeg/ 275 8. B.L. Yeo and B. Liu, Rapid scene analysis on compressed video, IEEE Trans. Cir. and Sys. for Video Tech., Vol. 5(6), 1995, pp. 533-544. 269, 270, 271, 273

Integrated Parsing of Compressed Video

277

9. H.J. Zhang, C.Y. Low, and S.W. Smoliar, Video parsing and browsing using compressed data, Jour. Multimedia Tools Appl., Vol. 1(1), 1995, pp. 89-111. 269, 270

Improvement of Shot Detection Using Illumination Invariant Metric and Dynamic Threshold Selection Weixin Kong, Xianfeng Ding, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition Institute of Auotomation, Chinese Academy of Sciences P.O.Box 2728, Beijing 100080, P.R.China [email protected]

Abstract. Automatic shot detection is the first step and also an important step for content-based parsing and indexing of video data. Many methods have been introduced to address this problem, e.g. pixel-by-pixel comparisons and histogram comparisons. But gray or color histograms used in most existing methods ignore the problem of illumination variation inherent in the video production process. So they often fail when the incident illumination varies. And because shot change is basically a local process of a video, it is difficult to find an appropriate global threshold for absolute difference measure. In this paper, new techniques for shot detection are proposed. We use color ratio histograms as frame content measure, because it is robust to illumination changes. A local adaptive threshold technique is adopted to utilize the local characteristic of shot change. The effectiveness of our methods is validated by experiments on some real-world video sequences. Some experimental results are also discussed in this paper.

1. Introduction and Related Work Archiving and accessing multimedia information have become important tasks in several important application fields. Areas that will benefit from advances on this subject include VOD (video on demand), DLI (digital library), etc. Of all the media types, video is the most challenging one, because it combines all the other media information into a single bit stream. Because of its length and unstructured format, it is hard to efficiently browse and retrieve large video files. The most popular used approach consists in assigning key words to each stored video file, and doing retrieval only on these key words. But these key words often can’t capture the rich content of the videos. So the method to browse and retrieve video sequences directly by their content is becoming urgently needed. As far as video browsing and retrieval are concerned, the primary requirements are the construction of video structure. For a typical video film, it has an obvious structure hierarchy, i.e. video, scene, shot and frame. A shot is a sequence of video frames generated during a continuous camera operation. It can serve as the smallest indexing Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 277-282, 1999.  Springer-Verlag Berlin Heidelberg 1999

278

Weixin Kong et al.

unit. So shot detection is an important first step in content-based video browsing and retrieval. In recent years, substantial efforts have been devoted to shot-based video analysis. A survey on this topic can be found in [1]. An assumption that is often made is that the content should not change greatly from one frame to the next within one camera shot. So, In general, shot boundaries can be detected by employing a difference metric to measure the change between two consecutive frames. A shot boundary is declared if the difference between the two frames exceeds a certain threshold. Therefore, the key issues in detecting shot boundaries are selecting suitable difference metrics and appropriate thresholds. Different solutions have been designed using pixel- or block- based temporal image difference [2,3], or difference of gray and color histograms [4,5]. Histograms are robust to object motion, i.e. two frames having an unchanged background and objects will show little difference in their overall gray or color distribution. And they are simple to compute. So they have been widely used in shot-based video analysis. And several authors claim that this measure can achieve good trade-off between accuracy and speed. In this paper, we also address the problem of segmenting the film into a sequence of shots based on difference of histograms. The rest of the paper is organized as follows. In section 2, we discuss our illumination invariant frame content metric: color ratio histogram. In section 3, we propose our shot detection algorithm. We use color ratio histograms as frame content measure, because it is robust enough against all noncontent-based changes (some small movements, changes of illumination condition etc.). To utilize the local characteristic of shot change, a local adaptive threshold technique is adopted. The effectiveness of our methods is validated by experiments on some real-world video sequences. Some experimental results are also discussed in this section. Conclusion and future work are in section 4.

2.Illumination Invariant Frame Content Metric Gray or color histograms used in most existing methods ignore the problem of illumination variation inherent in the video production process. So they often fail when the incident illumination varies. Even simple lighting changes will result in abrupt changes in histograms. This limitation might be overcome by preprocessing with a color constancy algorithm. In [6], Wei Jie etc. proposed a color-channelnormalization method, then they reduced the three-dimension color to two-dimension chromatic and defined a two-dimension chromatic histogram. Their method can discount simple spectral changes of illumination. But it is computationally expensive and can’t do with spatial changes of illumination. In [7], Funt etc. showed that neighbourhood based color ratios are invariant under various changes both in spectrum and spatiality. They studied the use of color ratio histograms as features for indexing image database. Now we adopt color ratio histograms in shot detection. Color ratio histogram can be formulated as following:

Improvement of Shot Detection

279

H (i , j , k ) = å z( x , y ), x,y

ì1, if d R ( x , y ) = i d G ( x , y ) = j d B ( x , y ) = k z( x, y) = í 0, else î 2 d k ( x , y ) = ∇ i k ( x , y ), k = R, G , B

(1)

i k ( x , y ) = log f k ( x , y ), k = R, G , B where f ( x, y ) is the RGB color values at position ( x , y ) , i ( x, y ) is their logarithms and d ( x, y ) is the Laplacian difference of i ( x, y ) . Color ratio histogram is the histogram of d ( x, y ) .Obviously, it can be seen as the histogram of color ratios between a small image region. Since the ratios of color RGB triples from neighboring locations are relatively insensitive to changes in the incident illumination, this circumvents the need for color constancy preprocessing. Logarithm can be calculated by table lookup and Laplacian transform can be implemented by simple mask convolution operation. So its computational cost is very low.

3. Shot-Based Video Analysis Shot-based video analysis involves segmenting the video clip into a sequence of camera shots by detecting shot boundaries. There are two types of shot boundary: camera breaks and gradual transitions. The later kind of boundaries is obviously more difficult to detect. 3.1 Camera Break Detection Figure 1 shows the difference of traditional color histograms versus time in a sequence of 300 frames. There are two camera breaks at frame 11 and 238. But in the shot between these two frames there are many illumination changes due to firework. As a result, many peaks of difference value occur. For example, at frame°226, a very high pulse exists. But actually there is only a simple lighting change, the contents of two consecutive frames are basically same. While in figure°2, which illustrates the difference of color ratio histograms, these pulses have been effectively smoothed. And we can easily detect the two breaks. Another improvement of our algorithm is that we introduce a relative difference measure between two consecutive frames. Because the shot change is basically a local characteristic of a video, It is difficult to find an appropriate global threshold for absolute difference measure. For a highly active video sequence, there will be a lot of frames where absolute histogram differences are large and using only absolute measures will bring many false alarms. The increment ratios of the histogram differences instead emphasis on this locality. It can be described as: CD = D ( H n +1 , H n ) / D ( H n − H n −1 )

Here D ( H

n +1,

H

n

(2)

) is the absolute difference measure of two frames.

280

Weixin Kong et al.

Figure 3 shows this relative measure. It can be seen that this measure is more reasonable in detecting shot changes.

Fig. 1.

Fig. 2.

Fig. 3.

3.2 Gradual Change Detection Unlike camera breaks, a gradual shot change does not usually causes sharp peaks in the inter-frame difference of a video sequence, and can be easily confused with object or camera motion. It often lasts more than ten or more frames instead of only one frame. So gradual changes are usually determined by observing the inter-frame differences over a period of time. In [4], Zhang etc. proposed a method called twin comparison technique. Two thresholds Tb , Ts , Ts < Tb are set for camera breaks and gradual changes respectively.

Improvement of Shot Detection

281

If the histogram difference between two consecutive frames satisfies Ts < SD(i, i + 1) < Tb , the ith frame is marked as the potential start frame of a gradual change. For every potential frame detected, it is compared to subsequent frames. This is called an accumulated comparison Ac . The comparison is computed till Ac > Tb and SD < Ts . The end of the gradual change is declared when this condition is satisfied. But the thresholds in their method are difficult to set. These thresholds should also vary in a long video sequence. Based the twin comparison method, we develop a new local adaptive threshold technique. We first calculate the average a and standard deviation σ of the histogram differences of the frames within a temporal window preceding the current frame. Then the histogram difference of the current frame is compared with this average value. We use a + ( 2 ~ 3)σ as the threshold to detect the start frame of a gradual change and use a + (5 ~ 6)σ to detect the end frame. 3.3 Shot Detection Experimental Results Our approach has been validated by experiments with several kinds of video sequences. These sequences contain usual features related to film producing and editing, including lighting condition, object and camera motion, and editing frequency. Table 1 and 2 refer to performance results obtained in our experiments on two sequences, where the error rate reported on last column is computed as the ratio: (False+Missed)/Real. Shot Type

False

Missed

Real

Error Rate

Camera Break

2

1

96

3.1%

Gradual Change

1

0

10

10%

Table 1. Experimental results on sequence 1(5975 frames) Shot Type

False

Missed

Real

Error Rate

Camera Break

3

1

113

3.5%

Gradual Change

1

1

13

15%

Table 2. Experimental results on sequence 2(9136 frames)

282

Weixin Kong et al.

4 Conclusions and Future Work In this paper, new techniques for shot detection are proposed. We use color ratio histograms as frame content measure. It is robust to illumination changes. A local adaptive threshold technique is adopted to utilize the local characteristic of shot change. The effectiveness of our methods is validated by experiments on some realworld video sequences. Experimental results show that our method is effective in detecting both camera breaks and gradual changes. From these experiments, we also find that the amount of shots in a typical film is very large (one shot per 3 seconds). So only shot level video structure can’t guarantee efficient browsing and retrieval. Higher semantic level analysis of the video content and the construction of scene structure are very important. A scene is defined as a sequence of shots related by semantic features. It is the scene that constitutes the semantic atom upon which a film is based. Obviously the construction of scene structure is a far more difficult research task when compared with shot detection. And little work has been done to this problem. Based on our accurate shot detection algorithm, we will study the problem of scene structuring in the future.

References 1. J.S. Boreczky and L.A. Rowe, Comparison of video shot boundary detection techniques, Proc. SPIE Conf. Storage and Retrieval for Image and Video Databases IV, Vol. 2670, pp170--179, 1996. 2. K.Otsuji, Y.Tonomura and Y.Ohba, Video browsing using brightness data, Proc. SPIE Conf. Visual Communications and Image Processing, pp.980-989, November 1991. 3. A.Nagasaka and Y.Tanaka, Automatic video indexing and full-video search for object appearances, Proc. 2nd Visual Database Systems, pp119-133, October 1991 4. H.Zhang, A.Kankanhalli, and S.Smoliar, Automatic partitioning of full-motion video , Multimedia Systems, vol. 1, pp10-28, 1993 5. I.K.Sethi, N.Patel, A Statistical Approch to Scene Change Detection, Proc. SPIE Conf. Storage and Retrieval for Image and Video Databases, vol.2420, pp329-338, 1995 6. J.Wei, M.S.Drew, and Z.-N.Li, Illumination-invariant video segmentation by hierarchical robust thresholding, Proc. SPIE Conf. Storage and Retrieval for Image and Video Databases, vol.3312, pp.188-201,1998 7. B.V.Funt and G.D.Finlayson, Color Constant color indexing, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.17, pp522-529, 1995

Temporal Segmentation of MPEG Video Sequences Edoardo Ardizzone, Carmelo Lodato, and Salvatore Lopes CNR-CERE Centro Studi sulle reti di Elaboratori Viale delle Scienze, 90128 Palermo, Italy [email protected] {ino,toty}@cere.pa.cnr.it

Abstract. The video segmentation is a fundamental tool for the video semantic content evaluation. In multimedia application, videos are often in MPEG-1 format. In this paper, an algorithm for the automatic shot segmentation of MPEG-1 sequences is presented. The adopted method is based on heuristic considerations concerning the characteristics of MPEG-1 video streams. In particular, the pattern structure and the I-, B- and P-frame sizes are taken in account. The proposed algorithm has been applied to MPEG-1 sequences and some results are reported.

1

Introduction

The effective use of video databases requires videos be indexed not only by textual data, but also (and mainly) by the visual features they contain. This kind of databases are conventionally called “content-based video databases” (CBVD) and systems that allow to retrieve images by their visual content are often referred to as “content-based retrieval system” (CBRS).1 The visual content is described by features related to color, texture, object structure, etc. Features are normally extracted from images in a manual, or semi-automatic, or automatic way during the phase of DB population, and stored in a feature database. During the query phase, the feature DB is searched for features most similar to those one provided by the user, and the related images, sorted in order of similarity, are shown to the user. As far as videos are concerned, motion features, e.g. related to the objects’ motion or to camera movements, are also important. Anyway, the first step of feature extraction is normally a temporal segmentation process. The objective of this process is the detection of scene cuts, in order to reduce the video to a sequence of short dynamic scenes2, generally characterized by a set of homogeneous features. Each scene may be therefore characterized by the features of one or more representative frames, i.e. still images3. The operation and the characteristics of several systems of this kind may be found in [2,3,5]. For example, the methodology and the algorithms used by JACOB, a general purpose system particularly suited for storage and retrieval of TV sequences, have been described in [2]. Some examples of “content-based retrieval systems” are described in [4,6,9,10,11] Often referred to as shots in the literature. 3 Often referred to as r-frames in the literature. 1 2

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp.283 -290, 1999.  Springer-Verlag Berlin Heidelberg 1999

284

Edoardo Ardizzone et al.

Most of these systems, JACOB included, operate on uncompressed video sequences. Nevertheless, video sequences are often compressed for efficient transmission or storage. Therefore, compressed videos have to be decompressed before indexing algorithms may be applied, thus requiring the application of computationally intensive processing steps. More recently, pre-processing of compressed videos directly on compressed domain has been proposed by several authors, mainly for MPEG-1 bitstreams [12-14]. Algorithms for scene change detection in MPEG-1 compressed video sequences have been proposed in [15]. Some video indexing methods based both on motion features (mainly camera movements and operations such as zooming and panning) and on motion-based spatial segmentation.of single frames have been presented in [16]. In this paper we propose a method for scene cut detection which does not need any decompression. The method is based on the external analysis of the characteristics of the MPEG-1 bitstream. In particular, the frame pattern of MPEG-1 coding, the size and the size changes of I, P and B frames are used to decide where a scene cut is more probable to be. The decision is based on heuristics. Since no decompression process is necessary, and moreover the analysis is based on very simple computation, the algorithm is very speedy. Moreover, as shown later, it is accurate enough to be used as a tool for prreliminary segmentation step. The rest of the paper is organized as follows. In section 2, MPEG-1 characteristics are reviewed. Section 3 describes the proposed algorithm, and section 4 reports the first experimental results.

2

MPEG-1 Coding Characteristics

The MPEG-1 standard concerns the audio and video digital signal compression [1]. The video compression is achieved removing the redundant information in a sequence of pictures. The compression ratio can be chosen in such a way that the compression process does not alter the quality of the compressed sequence. The standard concerns also the multiplexing of audio and video signals, but in this paper only video streams are treated. The video compression process exploits the spatial redundancy in a single picture and the temporal redundancy between pictures that are close to each other. The spatial redundancy is reduced using a block coding technique. The temporal redundancy is reduced using a motion estimation technique. During the compression process, the MPEG-1 encoder decides whether the arriving picture should be compressed using the block or the motion estimation technique. The frames compressed using only the block coding are called intra-frames or I-frames. Frames with motion estimation coded respect to the previous one are called predicted frames or P-frames. Frames coded respect to the previous and following frames are called bidirectionally predicted frames or B-frames. The P frames are always coded respect to the closest previous I or P frames. B frames are always coded respect to the previous and following I or P frames. An MPEG-1 video will be characterised by a recurrent sequence of I-, P- and B- frames, always starting with an I-frame, commonly called pattern. Generally, the pattern structure depends from the frame rate, because it is necessary to code at last two I frames every second for reasons related to random access and error propagation. The pattern structure is decided at the coding time.

Temporal Segmentation of MPEG Video Sequences

3

285

The Algorithm

The proposed algorithm for the automatic segmentation into shots of MPEG-1 sequences is essentially based on the search of “points” potentially representing scene cuts. These “points” or “events” are detected analyzing the external characteristics of the MPEG-1 videos. In this study, the characteristics taken in account are the pattern structure and the sizes of all frame types. Internal characteristics as intensity, chrominance or motion vectors are not considered. This choice agrees with the simplicity and speed requirements expressed in the introduction. The algorithm scans a whole MPEG-1 trace searching any anomalies with respect to pattern structures or to frame sizes. As already said, an MPEG-1 sequence consists in a recurrent structure of I-, P- and B- frames called pattern. The pattern length and configuration depends from the video frame rate. For example, the coded pattern IBBPBBPBBPBB is 12 frames long for a 24 frame per second video. The pattern IBBPBBPBBPBBPBB is 15 frames long when the video is coded at 30 frame per second. Normally, the pattern remains unchanged for the whole duration of the sequence. On the other hand, some MPEG-1 encoders can modify the coding sequence to the aim of improving on the perceptual quality of compressed videos. Such a modification consists, for instance, in coding a P frame instead of a B frame, or in the truncation of the current pattern before its completion. Generally, pattern changes happen very seldom and subsequently a succession of normal pattern restarts. Pattern changes can be necessary, for example, to code fast evolving scenes. Digital videoediting tools that operate directly on compressed sequences could also introduce pattern changes or truncations. IBBPBBPBBPBB-IBBPBBPBBPBBPBB-IBBPBBPBBPBB

pattern change

Frame I n 837

Frame I n 852

Fig. 1. Correspondence between the pattern and semantic content change

From an analysis of MPEG-1 videos including localized alterations of the normal or prevalent pattern, a strict correlation between changes of pattern structure and changes in the semantic content of the pictures in the frames before and after the pattern modifications has been observed. It follows that the frames corresponding to the pattern changes could be regarded as scene cuts between contiguous distinct shots. Thus, the complete patterns that precede and follow the modified one could be considered belonging to different shots. In Fig. 1, a typical observed case is reported.

286

Edoardo Ardizzone et al.

The figure shows the pattern change and the corresponding scene change between the I frames preceding and following it. The other methods adopted for detecting potential scene cuts are all derived from an analysis of frame sizes. I-frames, as already said, include all the necessary information for their decoding. For this reason, their sizes are in some way correlated to the picture complexities. I-frames of similar size will represent pictures of the same complexity but not necessarily equals. That is, it is not possible to find a useful relationship between the semantic content of a picture and the size of the frame coding it. Nevertheless, I-frames following each other in the same video in a very short time interval do not show significant variation on their size. On the other hand, a significant size variation between consecutive Iframe pairs can probably indicate a semantic content variation in the interval between them. Although this consideration can seem rather obvious from the qualitative point of view, the quantitative determination of the threshold to be overcome by a size variation to be significant is not so easy. This conclusion derives from the analysis of several video sequences of different subjects and compressed with different MPEG-1 encoders. Threshold values vary strongly from a sequence to another one and many trials have been carried out in order to find a method suitable for sequences of any type. An easy to implement procedure for threshold determination is explained in the following. Firstly, frame sizes have been normalised using the following relationship:

I *j =

I j − I min

(1)

I max − I min

where I*j, Ij are respectively the normalised and the original size of j-th frame, Imax and Imin the sizes of the greatest and of the smallest I-frame. Then, the statistical distribution of the differences ∆I* between the normalised size of the consecutive I frame pairs have been evaluated. Fig. 2 shows the statistical distribution of ∆I* for an examined sequence in the (0, 1) interval. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

minimum percentage

threshold value 0

0,1

0,2

0,3

0,4

∆ I*

0,5

0,6

0,7

0,8

0,9

1

Fig. 2. Threshold value determination for ∆I*c

The curve plotted in Fig. 2 expresses the cumulative frequency of ∆I* values. The threshold value determination is achieved searching the first point of that curve above a fixed value of the cumulative frequency (90% in the plot), with a null value of the tangent. Such a criterium, chosen to discern ∆I* values that are statistically anomalous can be easily applied to all types of sequences. A value of ∆I* over the thresholds allows to detect a potential scene cut happening in a two pattern long interval. If the

Temporal Segmentation of MPEG Video Sequences

287

transition between two contiguous shots is very smooth, with a very slow fade-in fade out effect, the correspondent ∆I* values could be not large enough to signal the change. In such a situation, a useful consideration could be done on the normalised frame sizes I*. As matter of fact, although the scene evolution is such as that the values of ∆I* stay below the threshold value, the corresponding I* can still show an anomalous behaviour. That is, the value I* exceeding a proper threshold, can mark anyway a potential scene cut that otherwise would not be detected. The threshold value for I* is evaluated with the same procedure presented above. P-frames are coded using both block and motion estimation technique, that is, reducing the temporal redundancy with respect to the nearest previous I or P frames, and also the spatial redundancy of coded pictures. For this reason, the P-frame sizes represent themselves the scene evolution with respect to the reference frame. A succession of consecutive P-frames of large size will represent a scene that evolves fast. It is reasonable to suppose that a P-frame of a very large size is located at the beginning of a new shot. Just as for the previous case, the P-frame sizes have been normalised using the same relationship (1), substituting I-frame size with P-frame size and, from the statistical distribution, is possible to find the first point with a null tangent value above a fixed value of the cumulative frequency curve. The difference ∆P* of normalised sizes between consecutive P-frame pairs can also be used to give an evaluation of the variation rate in time of the corresponding pictures. In analogy with already discussed cases, P-frames with the corresponding ∆P* greater than a threshold value are searched by the algorithm. This search applies to scenes where there is a relative motion between the camera and the subject. In this case, there could be, for example, a succession of P-frames with P* all below the threshold value, but that differ each other significantly. In such a situation, also the ∆P* could reveal a shot transition undetectable instead. The same consideration for P-frames can be applied to B-frames too. A further search can be done considering the sum of frame sizes belonging to the same pattern, that is the GOP (Group of Pictures) size. The results of all the searches (I* , ∆I* , P* ,∆P* , B* , ∆B* , GOP*, ∆GOP*) except for pattern change depend strongly from the threshold values chosen for each search. These values are derived from a statistical analysis fixing a maximum percentage value of items that probably will exceed the threshold value. For instance, a 95% value implicates that no more than 5% of the items will overcome the fixed threshold. This parameter depends on the number of the items in the sample and on the desired result. As matter of fact, increasing this parameter, the number of items exceeding the threshold decreases and the probability that the detected items correspond to effective scene changes increases. Conversely, decreasing this value in order to detect more events could result in an excessive oversampling. From a qualitative study on MPEG 1 videos, it has been observed that the trends of frame size of each type vary considerably during the whole sequence. This suggests that the choice of a unique threshold value for a long sequence could not produce good results. For this reason, the results can be improved on by applying the above search procedures on portions of the original video streams to the aim of finding suitable threshold values for each subinterval. The stream partitioning in subintervals can be performed on the basis of the local video characteristics or from statistical considerations. Naturally, the partitioning can not be too fine for the sake of simplicity and for reasons correlated to the significance of the statistical sample. In fact, with too few items the statistical

288

Edoardo Ardizzone et al.

analysis can not be applied. Once all the searches have been performed, the subsequent task of the algorithm consists in an analysis of all detected events in order to eliminate the oversampling. Because each search produces events that are signalled from frames of different type, a single event corresponding to a scene cut could be marked several times on frames that are close to each other. A filtering process is then required in order to avoid an incorrect evaluation of multiple close events as distinct events, substituting a set of close signals with a single event. Close events can be considered coincident if they are in correspondence with frames belonging to the same pattern, or to two adjacent patterns or to an interval of fixed length. So, setting up a minimal resolution of the algorithm, the segmentation in shots shorter than a fixed interval is avoided. In this situation we assumed that the last I-frame before the fixed interval and the last I-frame within the interval, belong to different shots. A resolution of two patterns would result in neglecting scene shorter than 1 second. From a practical perspective, in an application for storing and indexing the segmented shots in a content-based database, it could be less significant to detect shots of short length.

4

Experimental Results

The presented algorithm has been tested and tuned on a set of MPEG-1 sequences of different subject, length and quality. In this section, the reported results regard three MPEG-1 movies, chosen for the good video quality, for their length and for the number of scene cuts included. The transitions between adjacent shots are characterised by both abrupt and slow changes. The used movies are available as demo in the software set of a commercial graphic video card. The relevant video characteristics are summarised in table 1. Table 1. Relevant characteristics of the MPEG-1 video sequences used in the trials

Name Ontario Toronto History

N of frames 7008 4302 5808

Length (s) 292 179 242

N of shots 136 119 47

The algorithm requires the knowledge of the type and the size of each frame. The collection of all the necessary information from an MPEG-1 movie 5000 frame long requires about 20 s of processing time on a pc pentium 300MHz. The required processing time for the automatic shot segmentation requires about 0.15 s. Two series of trials has been performed. In the first set, the parameters have been fixed in order to achieve the most successful events minimising the oversampling. To this end, a procedure for automatic determination of the relevant parameters (number of subintervals, minimal percentage for threshold value evaluation) has been developed. The length of each subinterval has been taken in such a way to obtain a statistical sample ranging from 250 up to 350 items. In the second set of trials, a minimal percentage value of 80% for the threshold evaluations has been fixed in order to maximise the number of successful events with no care for the oversampling. The trials of the second set have been carried out in the perspective of using the proposed

Temporal Segmentation of MPEG Video Sequences

289

algorithm as pre-processing phase in the framework of a more sophisticated automatic shot segmentation system. As matter of fact, more complex systems need to decode partially or completely the video streams, thus requiring a considerable longer processing time. A pre-processing phase carried out with the proposed algorithm could reduce greatly the items to be processed obtaining a relevant gain in the global processing time. In both trials, the algorithm resolution has been set to 20, thus avoiding that two consecutive signals are not less than 20 frame far from each other. This value derives from the actual implementation of the filtering phase that is not accurate enough for lesser values of the resolution. The resolution can be improved on up to the intrinsic precision of each search using more sophisticated filtering techniques. The adopted resolution value does not necessary implicate that the algorithm can not detect shots shorter than 20 frame. In fact, in the current implementation, each shot is represented by the frame extracted from the middle of the interval limited by two consecutive events. Thus, the adopted value does not preclude a priori the correct determination of a shorter shot. Table 2 reports all the relevant parameters for the two series of trials. Table 2. Parameter values for the two trial sets Sequence Ontario - I Toronto - I History - I Ontario - II Toronto - II History - II

N of intervals I P B 2 7 18 1 4 11 1 5 15 2 7 18 1 4 11 1 5 15

Minimum percentage I* ∆I* P* ∆P* B* ∆B* 0.92 0.98 0.99 0.92 0.98 0.99 0.92 0.98 0.99 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Resolution (n frames) 20 20 20 20 20 20

undetected detected

Ontario-I

Toronto-I

History-I

Ontario-II

Toronto-II

History-II

Fig. 3. Summary of trial results

The results of all the trials are reported in Fig 3. In more details, the percentage values of scene cuts detected and undetected for each sequence and for each trial are reported. As can be seen in the histogram of Fig. 3, the successful percentage for the sequence History reaches the 100% in the second trial. For the sequence Ontario and Toronto the number of the detected shots increases, but it does not reach the 100%. Both Ontario and Toronto sequences are characterised by a relevant number of shots shorter than 20 frames. Not all these shots can be detected with the current implementation of the filtering process. But, if the shots longer than 2 patterns are

290

Edoardo Ardizzone et al.

only considered, for both sequence the successful percentage will be 100% and 97% for Ontario and Toronto respectively. The oversampling, as a percentage of real number of scene cuts, is below 30 % in the first set of trials and increases in the second one up to 27%, 86% and 350% for Toronto, Ontario and History respectively. Adopting a more accurate filtering process in order to increase the algorithm resolution, the results in terms of successful percentage and reduced oversampling would be improved on.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

D. Le Gall, “MPEG-1: a Video Compression Standard for Multimedia Applications”, Comm.of the ACM, April 1991, Vol. 34, No. 4. M. La Cascia, E. Ardizzone, “JACOB: Just a Content-Based Query System for Video Databases”, Proc. ICASSP-96, May 7-10, Atlanta, GA. E. Ardizzone, M. La Cascia, “Automatic Video Database Indexing and Retrieval”, Multimedia Tools and Applications, 4, pp. 29-56, Kluwer, 1997.. V. N. Guditava and V.V. Raghavan, “Content-Based Image Retrieval Systems”, IEEE Comp., Sept. 1995. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, D. Lee, D. Petkovic, D. Steele, P. Yanker, “Query by Image and Video Content: The QBIC System”, IEEE Comp. Sept. 1995. D. Lee, R. Barber, W. Niblack, M. Flickner, J. Hafner, D. Petkovic, “Query by Image Content Using Multiple Objects and Multiple Feature: User Interfaces Issues”, Proc. of ICIP 1994. D. Lee, R. Barber, W. Niblack, M. Flickner, J. Hafner, D. Petkovic, “Indexing for Complex Queries on a Query-By-Content Image Database”, International Conference on Pattern Recognition 1994, volume 1, pages 142-146. A. Nagasaka and Y. Tanaka, “Automatic Video Indexing and Full-motion Search for Object Appearence”, in Proc. IFIP TC2/WG2.6 Second Working Conference on Visual Database Systems, Sept. 30-Oct. 3, 1991, pp. 113-127. V.E. Ogle and M. Stonebraker, “Chabot: Retrieval from a Relational Database of Images”, IEEE Comp. Sept. 1995. P. M. Kelly, M. Cannon, D. R. Hush, “Query by Image Example: The CANDID Approach”, Proc. of SPIE – Storage and Retrieval for Image and Video Databaase III, 1995. A.Pentland, R. W. Picard, S. Sclaroff, “Photobook: Content-Based Manipulation of Image Databases”, SPIE Storage and Retrieval Image and Video Databases II, No. 2185, Feb 610, 1994, San Jose. A. L. Yeo and B.Liu, “Rapid Scene Analysis on Compressed Video”, IEEE Transaction on Circuits and Systems for Video Technology, vol. 5, no. 6, Dec. 1995. Boon-Lock Yeo, Bede Liu, “On The Extraction of DC Sequences from MPEG-1 Compressed Video”, Proc. of International Conference on Image Processing, October 1995. J. Meng and S.-F Chang, “Tools for Compressed-Domain Video Indexing and Editing”, SPIE Conference on Storage and Retrieval for Image and Video Database, Vol. 2670, San Jose, CA, Feb. 1996. J. Meng, Y. Juan and S. F. Chang, “Scene Change Detection in a MPEG-1 Compressed Video Sequence”, Digital Video Compression: Algorithms and Technol., vol. SPIE -2419, pp. 14-25, Feb. 1995. E. Ardizzone, M. La Cascia, A. Avanzato and A. Bruna, “Video Indexing Using MPEG-1 Motion Compensation Vectors”, submitted to IEEE ICMCS99.

Detecting Abrupt Scene Change Using Neural Network∗ H.B. Lu and Y.J. Zhang Department of Electronic Engineering Tsinghua University, Beijing 100084, China

Abstract: A real-time algorithm is proposed for the detection of abrupt scene changes, which makes use of a dual (one big and one small) window and single-side checking to avoid the false detection and miss detection caused by the violent motion of camera and/or large objects. In addition, a multi-layer perceptron is used to solve the problem of parameter determination in the proposed algorithm. The performance of our algorithm has been experimentally compared with that of some typical methods by using real video sequences. The recall rate is greatly improved while keeping high precision rate.

1.

Introduction

Digital video is a significant component of multimedia information systems, and the most demanding in terms of storage and transmission requirements. Content-based temporal sampling of video sequence is an efficient method for representing the visual information by using only a small subset of the video frames. These frames are obtained by so called video segmentation techniques. Through temporal segmentation, input video streams are decomposed into their fundamental units— shots and then representative frames called key frames can be extracted. Here we define shot as a continuous sequence of frames from one camera operation. Each shot usually contains closely related visual contents. There are many special video effects that have been used in video productions, some frequently used ones are cut, i.e., abrupt scene change, as well as gradual scene change, such as fade and dissolve. In this paper, we discuss cut detection with high recall and precision rate. Here the definitions of recall rate and precision rate are: correct detection recall rate = correct detection + missed detection

correct detection correct detection + false detection The paper is organised as follows: section 2 discusses some current methods for cut detection. Section 3 presents an efficient method for cut detection, which is based on a novel dual window concept and is implemented by using a multi-layer neural network. Some detection experiments with real video films are presented in section 4 and the result discussions are given in section 5. precision rate =

∗

This work has been supported by NNSF (69672029) and HTP (863-317-9604-05).

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 291-298, 1999.  Springer-Verlag Berlin Heidelberg 1999

292

2.

H.B. Lu et al.

Related Previous Work

The existing procedures for cut detection often consist of two steps: measuring the disparity according to some metrics and comparing the disparity with a predetermined threshold. Many metrics have been reported, and most of them fall into three categories: metrics based on histogram comparison, metrics based on first and second order intensity statistics, and metrics based on pixel difference in which the structure of images is considered. Some quantitative comparisons of these metrics can be found in [1,2]. In this section we first present our own evaluation of several related metrics, then make some discussions on threshold selection. 2.1

Previous Metrics for Cut Detection

Scene change detection has been carried out both in compressed and uncompressed domains. The ideas are very similar, i.e. sequentially comparing each video frame with its adjacent ones and marking the point at which a large disparity is detected. How to define metrics to measure the disparity between two frames? In compressed domain, the popular methods first compare consecutive "I" frames to find out possible ranges of scene change, and then use "B" and "P" frames between two consecutive "I" frames to locate the accurate position of scene change [3]. This method does not need to decode compressed stream, so it is very fast. But it is hard to detect gradual scene change by only using the coded information. In uncompressed domain, the major techniques used are based on pixel difference, histogram comparison, edge difference, and motion vector. The simplest method is based on histogram comparison. The observation is that if two frames have similar backgrounds and common objects their histograms will show little difference. Since the grey level histogram represents some global information of frames, the comparison based on histogram is insensitive to small motions and noises. But as the histogram discards the space distribution of grey level, some abrupt scene changes would be missed as two frames with different objects may have similar histograms. Another popular method for scene change detection in uncompressed domain is based on cumulating pixel difference, i.e. comparing each pixel in one frame with its corresponding pixel in the next frame, and summing up the difference over the whole frame. If the total difference value is bigger than a pre-determined threshold, then an abrupt scene change is declared. The major problem with this method is that it is very sensitive to camera and object motion. One possible solution is to smooth the frame first, so that each pixel has, for instance, the mean value of its 8 nearest neighbours. This approach would also filter out some noise in the frame, but it could only compensate for minor camera and/or object motion. 2.2

Threshold Selection

To determine a cut position, a disparity threshold should be set. In [4], cuts are identified when the histogram differences are beyond 5 or 6 standard deviations from the mean value. However, when some violent motions occur inside a shot, many consecutive frames will be identified as cuts. In [5], a technique called "sliding window method (SWM)" is proposed. Let Xi, i = 1, 2, …, N be a sequence of DC images, the difference sequence Di, i = 1, 2, …, N–1 are formed using the formulation below:

Detecting Abrupt Scene Change Using Neural Network

Di = D (i, i + 1) =

å | I i ( x, y ) − I i +1 ( x, y ) |

293

(1 )

x, y

where Ii(x, y) is the intensity of pixel (x, y) in frame-i. To detect scene change from the difference sequence Di, a sliding window of size°2m–1 is defined, and the frame to be detected is placed in the middle of this window. A scene change from Xl to Xl+1 is declared if the following two criteria are fulfilled: (1) Dl ≥ Dj, j = l-m+1, …, l–1, l+1, …, l+m–1; (2) Dl ≥ t × Dk where Dk is the second largest maximum inside the sliding window, t is a pre-defined parameter. With the help of local information, the precision and recall rates of this method are thus improved. But we find that three problems are still existing. First, violent intra-shot motions of object or camera around shot boundary can cause a sequence of high peaks near the shot boundary and the criterion (2) can not be satisfied in such a case. For example, a segment of inter-frame difference from a test sequence (see section 4) is shown in Fig.1. There exists a shot between peak a and b, but in practice the cut position b has been missed because the motion. The second problem is related to false declaration. Fig 2 shows another segment of inter-frame difference from a test sequence. The peak a and peak b indicate two cut positions, while c and d are small peaks due to adding and removing movie subtitle. Because there is practically no any motion between peak a and peak b, so peak c and peak d become two significant local peaks which would be identified as cut positions.

Fig.1

Example of missed detection

Fig.2

Example of false detection

The selection of parameter t is the third a problem. A proper parameter is very important for cut detection. In [5], the sensitivity of the parameter t is investigated. The parameters should be selected with a better trade-off between decreasing the missed detection and decreasing the false detection. It went without saying that some cut positions will be missed where the value of maximum over the value of second maximum is just less than the selected t, so criterion (2) can not be satisfied.

294

3.

H.B. Lu et al.

A New Approach for Cut Detection

To overcome the weakness of previous work, a new cut detection approach is designed. Fig. 3 depicts the block diagram. Input video sources can be compressed video, such as MPEG streams, or uncompressed video sequences, such as image sequence captured by image grabber. In order to get actual image sequence for detection, we partially decode MPEG stream to get its DC sequence, while for uncompressed video an 8×8 average is carried out. Both cut and flashlight positions are detected from this sequence. Since flashlights can not be considered as cuts either from the definition of cut or from the content of video [5], we identify flashlights as intra-shot events and discard them to get the final abrupt scene change positions. Compressed Video Stream

Extract DC Image

Uncompressed Video Stream

8×8 Average

Video Sequence

Detect Flashlight

Detect Cut

Fig. 3

Shots

Diagram for cut detection

In the following, we first describe our cut detection approach, which is based on dual-window and single-side checking, and then we use multi-layer perceptron to improve its performance. 3.1

Dual-Window Method for Cut Detection

For judging the performance of video segmentation, recall rate and precision rate are often used. However, there are some conflicts between recall rate and precision rate. Since our objective of video segmentation is for further video browsing and retrieving, we hope that the detection method should first provide high recall rate, and then high precision rate is considered. To this purpose, the pixel difference comparison is selected as one of the important metrics. A dual (one big and one small) window approach is used. The big window is used for selecting the probable cut positions, and the small window that centred at the probable position selected is used for determining the real cut positions. The decision made in the small window is helped by a single-side checking technique in contract to double-side checking technique as described in [5]. To avoid false detection from the single-side checking, we divide the image to four blocks, then compare histogram of the corresponding block in consecutive frames. The whole algorithm can be described by the following steps: (1) Define a disparity metric between frame-j and frame-k: D ( j, k ) =

å f (| I j ( x, y ) − I k ( x, y ) |)

x, y

N

(2 )

where N is the number of pixels in one frame. The function f (.) is defined as follows:

Detecting Abrupt Scene Change Using Neural Network

ìï1 x > T f ( x) = í ïî0 otherwise

295

(3 )

It helps to make a selective statistics to exempt small disparity values. (2) Define a big window of size WB. Let the current detecting frame-l locate in the window. The mean inter-frame difference from this window is calculated. (3) Define a small window of size WS = 2m–1, and let frame-l locate at the centre of this window (4) Let Dl = D(l–1, l). If both of the following criteria (single-side criteria) are satisfied (t1 and t2 are predefined constants): (a) Dl ≥ t1 × mean (b) Dl ≥ t2 × Dleft .OR. Dl ≥ t2 × Dright where Dleft = max(Dj), j = l-m+1, …, l–1; Dright = max(Dj), j = l+1, …, l+m–1, then we consider frame-l as a possible cut position (otherwise return to (2) and consider the next frame). (5) For further affirmation, another metric (where hj, hk are the histograms of frame j, k, respectively. The denominator is a normalizing factor) is defined: M

δ ( j, k ) =

å | h j (i ) − hk (i ) |

i =1 M

å{h j (i ) + hk (i )}

(4 )

i =1

If δ (l–1, l) ≥ t3 (t3 is a predefined constant) is also satisfied, frame-l is identified as a cut position. Return to (2). Three points are to be noted here: (1) The big window is used to avoid false detection. Using mean of interframe differences can avoid the false detection caused by SWM as pointed out in section 2, at the same time reduce the number of searching positions greatly. (2) The single-side criteria are used in step (4) to avoid the miss detection caused by violent camera and object motions. Since single window may introduce false positions, we add a criterion δ (l–1, l) ≥ t3. (3) The spatial distribution information is lost in global histogram, so we divide image into n×n blocks, and compute δi, i = 1, 2, …, n×n from the corresponding blocks between consecutive frames, and then use the average δ (l − 1, l ) = å δ i n × n . The problem that different scenes have similar histograms can be amended. 3.2

Thresholding Using Multi-Layer Perceptron (MLP)

The above algorithm solves the problems of global threshold selection and two weakness of SWM. But the parameters t1, t2, t3 are still difficult to select. A possible solution is to observe several video clips, then select a good value of these parameters according to experiments.

296

H.B. Lu et al.

We consider artificial neural network is a suitable way to fit the requirement. The goal of neural-network was to solve problem without explicit programming. The neurons and networks were supposed to learn from examples and store the obtained knowledge in a distributed way among the connection weights. Neural networks are on-line learning systems, intrinsically non-parametric and model-free. Since neuralnetwork classifiers are often able to achieve better reliability than classical statistical or knowledge-based structural recognition methods through their adaptive capability [6], we use multi-layer perceptron (MLP) here. The practical MLP we used has three full-connected layers, as a classifier to identify whether the current position is a cut position or not. The structure of MLP is shown in Fig.4. Our focus here is how to extract features to form input vector, since efficient feature extraction is crucial for reliable classification. According to the analysis in section 3.1, four features are extracted: inter-frame difference on frame-l using metric (2); the ratio of Dl over Dleft; the ratio of Dl over Dright; and δ (l–1, l), i.e. input layer has four neurons. We define the input vector I = [I1, I2, I3, I4] as follows: I1 = Dl , I 2 = Dl / Dleft , I 3 = Dl / Dright , I 4 = δ (l − 1, l ) =

1 n×n åδi n × n i =0

(5 )

This input vector forms a 4-D space. The values of Ii are all much bigger on cut positions than those on non-cut positions, so it is easy to achieve robust classification by using MLP in this 4-D space. In order to classify the input frame-l represented by the input vector into two classes, we define an output vector O = [O1, O2]. If O1 < O2, then frame-l is declared as a cut position, otherwise a non-cut position. We calculate the number of neurons in hidden layer using formulation: n h = ni + no + k , where k is a constant between 1~10. As here ni = 4, n0 = 2, we

obtain nh = 4 ~ 13. We choose nh = 10 for more robust system. O1

Output layer

Hidden layer

1

Input layer

2

I1

Fig. 4

3

O2

..........

4

I2

I3

nh

I4

Structure of MLP

In our MLP, the input neurons use the linear functions, and the neurons in hidden and output layer use the sigmoid functions. Modified back-propagation training algorithm is applied here [7]. Instead of minimising the squares of the differences between the actual and target values summed over the output units and all cases, the following error function is to be minimised:

Detecting Abrupt Scene Change Using Neural Network

M

E = −å

N

å [Tmn ln On + (1 − Tmn ) ln(1 − Omn )]

297

(6 )

m =1 n =1

where m runs over cases, M is the total number of cases, N is the total number of output units, Omn is the actual value (between 0 and 1) of output unit n, and Tmn is target value of output unit n. The coupling strengths wji are updated according to the following rule: M

∆w ji ( s + 1) = −η å ( m =1

∂E ) + α∆w ji ( s ) ∂w ji

(7 )

where s represents the sweep number, m runs over cases, M is the total number of cases, η is the learning rate, α is momentum factor. In order to assure that all coupling strengths are changed by the iterative learning procedure, they have to be initialised with small values (random numbers ranging from –0.03 to +0.03 in our case). In real implementation, since t1 and t2 in step (4) of section 3.1 are larger than 1, so those positions that don't satisfy the criterion "Dl > mean .AND. (Dl > Dleft .OR. Dl > Dright)" are not cut positions, and can be first filtered out to reduce the searching time.

4.

Experiment Results

In order to compare our proposed algorithms with SWM, four representative video clips have been chosen for test. One is from "Four Wedding One Funeral", there exists static shot, flashlight, and many camera operations; one is about air battles from "Top Gun", in which both camera and object motions are very violent; one is from "Fifth Element", there are a number of big object motions; and one is selected from cartoon to test the algorithm performance under fast scene change. First, we compare DWM with SWM. We choose parameters as: WB = 500, m = 12, t1 = 1.2, t2 = 2, t3 = 0.3 for DWM, m = 10, t = 2 for SWM. The results for the four test video clips are listed in Table 1 and Table 2, respectively. From these tables, it is clear that the recall rate of DWM is much higher than that of SWM. Mainly benefited from dual windows, DWM avoids many miss detections caused by the violent motions, The precision rate is slightly improved because small local maximums can be discarded. But as SWM, DWM also has the problem of selecting proper parameters. As we can see in video clip from "Fifth Element", SWM missed 18 positions while DWM still misses 13 cut positions. Then we test DWM-MLP method. Here WB = 500, m = 12 are chosen for feature extraction. Another video clip from "Airforce No.1" is used to train the MLP, since some representative types of cuts exist in this clip. The train algorithm has converged after ten epochs. The above mentioned four video clips are used to test the MLP and the results are listed in Table.3. As we expected, MLP here is very robust, 98.4% recall rate and 96.6% precision rate are achieved for these test clips. It is worthy to mention that by using DWM-MLP, only 1 cut position is missed in video clip of "Fifth Element".

298

H.B. Lu et al. Table 1.

Test results obtained by using SWM (m = 10, t = 2)

Video Clip Frames Total Cuts Correct Missed False Recall Precision Four Wedding One Funeral 3434 18 16 2 2 89% 89% Top Gun 1602 36 28 8 3 78% 90% Fifth Element 2674 61 43 18 0 70% 100% Cartoon 1402 12 10 2 0 83% 100% Totals 9112 127 97 30 5 76.4% 95% Table 2.

Test results obtained by using DWM (WB = 500, m = 12, t1 = 1.2, t2 = 2, t3 = 0.3)

Video Clip Frames Total Cuts Correct Missed False Recall Precision Four Wedding One Funeral 3434 18 18 0 0 100% 100% Top Gun 1602 36 36 0 3 100% 92% Fifth Element 2674 61 48 13 0 79% 100% Cartoon 1402 12 12 0 1 100% 92% Totals 9112 127 114 13 4 89.8% 96.6% Table 3.

Test results obtained by using DWM-MLP (WB = 500, m = 12)

Video Clip Frames Total Cuts Correct Missed False Recall Precision Four Wedding One Funeral 3434 18 18 0 0 100% 100% Top Gun 1602 36 36 0 3 100% 92% Fifth Element 2674 61 60 1 0 98% 100% Cartoon 1402 12 11 1 1 100% 92% Totals 9112 127 125 2 4 98.4% 96.6%

5.

Discussion

A real-time method for robust cut detection in uncompressed as well as compressed video is proposed. The main features of this approach are dual-window and single side checking which are used to select probable positions of cut. To achieve a robust classification, a multi-layer perceptron is applied. Our algorithms can effectively avoid the false detection and miss detection caused by the violent motion of camera and/or large objects.

References 1. 2. 3. 4. 5. 6. 7.

J.S. Boreczky and L.A.Rowe, SPIE, V.2664: 170-179, 1996. R.M. Ford et al., Proceedings of IEEE ICMCS, 610-611, 1997. V.Kobla and D. Doermann, SPIE, V.3022: 200-211, 1997 H.J.Zhang et al., Multimedia Systems, V.1: 10-28, 1993. B.L. Yeo and B. Liu, IEEE Trans. CSVT-5: 533-544, 1995. K.Fukushima, Neural Network, V.1: 119-130, 1988. A.Van Ooyen and B.Nienhuis, Neural Network, V.5: 465-471, 1992

G

Multi-modal Feature-Map: An Approach to Represent Digital Video Sequences1 Uma Srinivasan and Craig Lindley CSIRO Mathematical and Information Sciences Locked Bag 17, North Ryde NSW 1670, Australia Building E6B, Macqaurie University Campus, North Ryde NSW Phone: 61 2 9325 3148, Fax: 61 2 93253200 {Uma.Srinivasan,Craig.Lindley}@cmis.csiro.au

Abstract. Video sequences retrieved from a database need to be presented in a compact, meaningful way in order to enable users to understand and visualise the contents presented. In this paper we propose a visual representation that exploits the multi-modal content of video sequences by representing retrieved video sequences with a set of multi-modal feature-maps arranged in a temporal order. The feature-map is a ‘collage’ represented as a visual icon that shows: the perceptual content such as a key-frame image, the cinematic content such as the type of camera work, some auditory content that represents the type of auditory information present in the sequence, temporal information that shows the duration of the sequence and its offset within the video.

1

Introduction

Video sequences retrieved from a database need to be presented in a compact, meaningful way in order to enable users to understand and visualise the contents presented. Currently most approaches to visualisation [6,10,13,14] and presentation of video sequences deal with one modality at a time [2,3]. While the above approaches offer different ways of presenting visual summaries of videos, the basic information presented represents only the visual content in videos. One piece of work that uses both audio and visual content is described in [7]. In this paper we propose a visual representation where video sequences are represented with a set of multi-modal feature-maps arranged in a temporal order. The feature-map is a ‘collage’ represented as a visual icon that shows the following: (i) perceptual content such as a key-frame image, (ii) cinematic content such as the type of camera work, (iii) some auditory content that represents the type of auditory information present in the sequence, (iv) temporal information that shows the duration of the sequence and its offset within the 1

The authors wish to acknowledge that this work was carried out within the Cooperative Research Centre for Advanced Computational Systems established under the Australian Government's Cooperative Research Centres Program.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 299-306, 1999.  Springer-Verlag Berlin Heidelberg 1999

300

Uma Srinivasan and Craig Lindley

video, and (v) some indication of sematic associations represented in the video sequence. As each feature-map represents one video sequence, it also serves as a visual index to the retrieved video sequences. The multi-modal feature-maps arranged in a temporal order serve as a browsing tool that can conserve and postpone the utilisation of the entire bandwidth of limited network resources.

2

What is a Feature-Map?

A feature-map is a visual representation of audio and visual contents of a video sequence and its function is to facilitate the visualisation of digital content at a semantic level. Information represented in a feature-map may be available as annotated descriptions of audio and visual content, which are generated either manually or semi-automatically. While the idea of feature-map is independent of the level of automation, in this paper we focus on representing those features that can be automatically detected, annotated and stored in a video database. Model of Digital Video Features Most models of video are based on a temporal organisation where a video is described at various levels such as frame, shot, and scene. We use such a model to get the keyframe and temporal component of the feature-map. However, to incorporate the feature-based components within the feature-map, we have taken a different approach. We have developed a feature model that is based on the general research direction and is shown in Figure 1. This model forms the basis of identifying features that can be represented in the feature-map. Video features

Auditory

Visual

Speech

Camera-operation Object-based based

Silence

MonologueDialogue

Pan

Zoom

Zoom-in Pan-left

Pan-right

Tilt

MovingStationary object object Male voice Female voice

Music

Instrumental

Other sounds

vocal

Bird sounds

ExplosionCrowd Noise

Zoom-out foreground background

Fig. 1. Classification of Audio and Video features in Digital Videos

Multi-modal Feature-Map

301

Feature extraction has been studied from different perspectives. On the visual side, some groups have studied camera-motion detection [1], while others have studied object-motion-based features [5]. The audio analysis groups have largely focussed on analysing audio content in the digital videos [9],[10]. Our focus here is to use existing feature extraction techniques and also allow the model to evolve, as more featureextraction research results become available2. In order to represent the features shown in the model, we need an unambiguous way to represent and combine features that can occur at multiple levels. For example, ‘silence’ can occur at any of the levels; as silence may occur within a piece of music or during a conversation. The visual representation has to address this issue of granularity in a meaningful way. Visual Representation of Features As the feature-map is envisaged as a visual representation of features present in a video sequence, we propose associating each feature with an icon that is a symbolic representation of that feature. The symbols should also be chosen such that they are unique and do not interfere with the actual key-frame image of an object. This calls for a visual lexicon and a visual grammar that offers a systematic approach to presenting these visual icons such that they convey objective information. Visual Lexicon The visual lexicon has to represent multiple modalities such as audio and visual features in an unambiguous way. Initially we restrict the lexicon to only those features represented in the feature model shown in Figure1. We expect the model and the lexicon to grow as it becomes possible to detect more features in digital videos. A feature-map should meet the following requirements: (i) it should summarise the automatically detected features (ii) it should be easy to develop /learn, (iii) it should minimise redundancy, (iv) it should be easier and quicker to understand than textual annotation. An important criterion used while designing the icons has been that it should be possible to combine them in such a way, that the resulting symbols unambiguously convey the meaning of multiple features available in the video sequence. In summary, the philosophy here is that ‘a picture is worth a thousand words. (As no industry standard icons were available to us, we have come up with the visual symbols shown in Tables 1 and 2. The feature-map idea can be used by substituting these icons with industry-specific symbols, should they become available.) Table 1 shows some audio features represented using visual icons. The shaded symbols indicate the presence of sound. The unshaded symbols represent silence within that feature. For example, under music, we could have some silence within a musical event. Table 2A shows the visual representation of features that represent camera-motion, and Table 2B shows a visual representation of features that represent object motion. 2

With digital cameras and MPEG4 and MPEG7 compression schemes the list of detectable features is likely to change rapidly.

302

Uma Srinivasan and Craig Lindley

Contrasting colours shows foreground and background objects. These feature-based icons form part of a feature map to indicate the nature of contents present in the retrieved video sequences. Table 1. Representation of Audio Events

Audio Feature

Visual Representation

Audio Feature

Dialogue

Music

Monologue

Music - Male

Explosion

Music-female

Table 2A. Camera-motion features Camera Visual Operation Representation Pan – left

Visual Representation

Table 2B. Object-motion features Object-based Visual Features Representation Moving object

Pan-right

Foreground object

Zoom-in

Background object

Zoom-out Tilt-up Tilt-down

When designing iconic systems, it is desirable to have some rules to construct the image so that it conveys the same meaning to all users. This calls for a visual grammar to specify some simple composition rules. Visual Grammar Sentences in visual languages are assemblies of pictorial objects (or icons) with spatial relationships. For a continuous media such as video, the temporal aspects also need to be represented. For the purposes of constructing a feature-map, the visual grammar has to specify rules to address the following: (i) features that need to be represented in a feature-map, and what constitutes a valid feature-map, (ii)

Multi-modal Feature-Map

303

composition rules for the layout of feature-based icons, key-frame images and temporal information associated with the features that need to be represented In order to address the above issues, we have organised the grammar rules into 2 groups: feature selection rules and composition rules. At this stage we have just enumerated the criteria for these rules. Developing a more rigorous syntax will be part of our on-going work. Feature selection criteria (i) Auditory and visual features specified in the query will be represented in the feature-map, through associated feature-icons. (ii) Features available in the retrieved video sequences (ie, pre-determined, annotated and stored in the database) will also be represented. (iii) In case of overlapping features, only the most dominant feature will be displayed in the feature-map. (This is also due to the fact, it becomes increasingly difficult to detect overlapping features and usually only the dominant one can be detected easily.) Composition criteria Feature-maps have both temporal and spatial components that need to be represented. In addition, the composition rules have to accommodate multiple modalities available in the video sequences. (i) Each retrieved video sequence will have a representative key-frame. In the most simplistic case this could be the first frame of the sequence. (ii) The spatial arrangement of features (specified and/or extracted) will be based on their temporal ordering. (That is if there are multiple features within a video sequence, they would be presented in the order of their appearance.) (iii) Temporal information such as start-offset and duration of the sequence will be represented. (iv) The above 3 components will form part of the feature-map that represents the returned video sequence. (v) The feature-maps will be placed in a temporal order along the basic time-line of the video User Interaction User interaction involves query formulation and display of returned sequences. The user interface should support query formulation at three levels. (i) At a purely semantic level; which from our discussions with people managing archives, is often the preferred level for a general user of a digital video library. (ii) A combination of semantics and feature levels; which may be preferred by more informed users such as television archivists and librarians, (iii) At the level of features; which could be useful for advanced users such as sound editors, film directors, etc. This requirement calls for an appropriate mapping between the semantic and feature levels. Figure 2 shows the proposed mapping scheme. In order to formulate a query, the general user, ie, the first category shown above, would use information related to the video domain and the application domain, (ie the two left ovals). The second category of user would use a combination of concepts and features (middle and right oval), and the third category of user would use the features directly.

304

Uma Srinivasan and Craig Lindley Application Domain Concepts

Video Domain

Content-based Features

game Sports

Object motion Camera panning

Player movements

Commentator’s views Conversation

Documentary

Dialogue

Commentator’s views News

Monologue

Music

Main news

Loud sounds

Commercial l breaks

Fig. 2. Mapping Semantics to Content-based Features Allowing users to map out the concepts-features relationship provides us with some rich semantics that are often difficult to capture and model. As the feature-map represents all the features that are related to a concept as perceived by a user, it provides a good visualisation of video content that is unique to the user specifying the query. Figure 3 shows a set of feature-maps returned from a query to retrieve sequences about the Australian election campaign. The key-frames shown in this example are the first frames of returned sequences. The first image shows that the sequence represented by this image has a camera tilt operation, followed by a zoom-in, followed by some speech. The key frame shows the Australian symbol. The start time and the duration of the sequence is shown as part of the image. The second image shows that there is music followed by a zoom-in followed by a speech. The image (insides of a piano) in the key-frame indicates that the associated music. The third image shows there is a zoom-in operation followed by some speech, which is followed by a crowd cheer and a moving object.

03:20 05:10

00.25 02:30

00:50

06:30 08:15

03:30

)6:30 Video time line

Fig. 3. Mapping Semantics to Content-based Features

Multi-modal Feature-Map

305

The picture of John Howard in th e key frame combined with the features gives a reasonable understanding of the contents of that sequence. Feature-Map Construction The framework we have developed as part of the FRAMES [13] project provides a supporting environment to conduct some experiments in generating feature-maps as described in this paper. Figure 4 shows the query processor component of that framework. Display

Client Front-end application

Query

Display M anager

Feature Map Builder Key frame images Auditory features

Concept-feature associations Query Processor SQL queries

Visual features Temporal offset

Video stream

Video Server

video sequence reference

Application schema Database Server

Application Model

Fig. 4 . Architrecture to generate multi-modal feature-map Information about specific audio and visual events of interest are stored in the database. We have developed our own algorithms to detect audio [10] and video events [12]. The Query Processor translates the queries specified at a semantic level into SQL qureies that relate to video objects and associated features characterised by their temporal and feature-based attributes. The Feature-map Builder collates the auditory and visual features by associating the query and the results returned with appropriate visual images to generate a set of multi-modal feature-maps arranged in a temporal order. The Display Manager links each feature-map with the appropriate video sequence delivered by the video server, and presents a user interface with featuremaps arranged in a temporal order to facilitate browsing through a bunch of video sequences that have been retrieved based on a query condition. An interesting extension to visualising video content would be to enable the play back of only that mode which is chosen from the feature-map. What this means is if the auditory symbol in the feature-map is clicked, the audio content should be played back. This aspect needs further investigation and will form part of our on-going research activity.

306

3

Uma Srinivasan and Craig Lindley

Conclusion

The feature-map presented in this paper enables us to represent some important audio and visual information available in video sequences. The feature-maps provide a form of temporal compression where the loss of information involved is affordable with respect to the function of the feature-map. In the context of presenting information from a digital library of videos, such a representation offers a compact pictorial summary at a semantic level rather than at a purely perceptual level such as displaying colour histograms or audio wave patterns.

References 1. Aigrain, P.; Zhang, H., and Petkovic, D. Content-Based Representation and Retrieval of Visual Media. MULTIMEDIA TOOLS AND APPLICATIONS. 1996a; 3179-202. 2. Arman; Depommier; Hsu, and Chiu. Content-based Browsing of Video Sequences, Proceeedings of ACM international Conference on Multimedia `94; 1994; California. ACM; 1994. 3. Bolle, R; Yeo, B., and Yeung, M. Video Query, Beyond the keywords. IBM Research Report. 1996 Oct. 4. Bolle, Rudd M.; Yeo, Boon-Lock, and Yeung, Minerva M. Video Query and Retrieval1997; 13-23. 5. Chang, S. F.; Chen, W.; Meng, H. J.; Sundaram, H., and Zhong, D. A fully automated content-based video search engine supporting spatiotemporal queries. IEEE transactions on circuits and systems for video technology. 1998 Sep; 8(5):602-615. 6. Jain, R. editor. Communications of the ACM. ACM. Vol. 40, 1997. 7. Lienhart, Rainer; Pfeiffer, Silvia, and Effelsberg Wolfgang. Video Abstracting. Communications of the ACM. 1997 Dec; 40(12). 8. PfeiffeR, S.; Fischer, S., and Effelsberg , W. Automatic Audio Content Analysis . Proceedings of ACM Multimedia, 94; 1994; Boston. 1996. 9. Samouelian; A,.Robert-Ribes, J.and Plumpe, M. Speech, Silence, Music and Noise Classification of TV Broadcast Material. Proc. 5th International Conference on Spoken Language Processing; 1998 Dec; Sydney. 1998 Dec. 10. Smoliar, S. W. and Zhang, H. J. Content-based video indexing and retrieval. IEEE Multimedia. 1994 Summer; 343-350. 11. Srinivasan, U.; Gu, L.; Tsui, K., and Simpson-young, W. G. A Data Model to support Content-based Search on Digital Video Libraries. Australian Computer Journal. 1997 Nov; 29(4):141-147. 12. Srinivasan, U; Lindley, C., and Simpson-Young, W. G. A Multi-Model Framework for Video Information Systems. Semantic Issues in Multimedia Systems, Kluwer Academic Publishers, 85-107. 13. Taniguchi, Y.; Akutsu, A., and Tonomura, Y. PanoramaExcerpts: Extracting and Packaging PAnoramas for Video Browsing. Proc. ACM Multimedia 97; Seattle. 1997 Nov. 14. Yeung, M. M. and Yeo, B. L. Video visualization for compact presentation and fast browsing of pictorial content. IEEE transactions on circuits and systems for video technology. 1997 Oct; 7(5):771-785.

Robust Tracking of Video Objects through Topological Constraint on Homogeneous Motion Ming Liao, Yi Li, Songde Ma, and Hanqing Lu National Laboratory of Pattern Recognition Institute of Auotomation, Chinese Academy of Sciences P.O.Box 2728, Beijing 100080, P.R.China Tel. 86-10-62542971, Fax 86-10-62551993 [email protected]

Abstract. Considering the currently available methods for the motion analysis of video objects, we notice that the topological constraint on homogeneous motion is usually ignored in piecewise methods, or improperly imposed by blocks that do not have physical correspondence. In this paper we address the idea of area-based parametric motion estimation with spatial constraint involved, in order that the semantic segmentation and tracking of non-rigid object can be undertaken in interactive environment, which is the center demand of applications such as MPEG-4/7 or content-based video retrieval. The estimation of global motion and occlusion can also be computed through the tracking of background areas. Besides, based on the proposed hierarchical robust framework, the accurate motion parameters between correspondent areas can be obtained and the computational efficiency is improved remarkably.

1. Introduction The semantic description of the object motion in video sequence has been a continuous research topic of motion analysis, and is becoming a hot point of current research[1]. This is mainly because of its potential application background such as MEPG-4/7 or content-based video retrieval, which will bring considerable market profits. Nevertheless, until now, the desired target is still beyond technical ability, for the various appearances of the objects and the environment seems difficult to be unambiguously described by a formal method, as a result the totally automatic segmentation is impossible. Practically, interaction is inevitable and motion segmentation and tracking of objects becomes the center problem. Current research efforts on this topic [2,6] are generally based on the assumption that, homogeneous motion exists respectively in areas of object appearance, so that local parametric models such as affine model or perspective model can be applied, and the problem is converted to the spatial clustering of parametric motion field. Since the original motion field is basically derived from intensity-based optical flow field [7] or patch based homogeneous motion analysis [6], some robust estimation methods [8] are implemented to resolve the fundamental problems of optical flow computation such as the boundary problem. Conclusively, by trying to describe the object motion from Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 307-316, 1999.  Springer-Verlag Berlin Heidelberg 1999

308

Ming Liao et al.

local details, most of these approaches work in a bottom up style, which has the following shortcomings: 1) The computation of homogeneous motion patches is very sensitive to the parametric motion clustering on optical flow field, as a result trivial patches may be derived, and the semantic description of objects motion is impossible. The main reason of this problem is that, not only data conservation but also topological homogeneity such as connectivity, continuity and neighborhood is maintained in the motion of physical objects, but no spatial topological constraints are imposed on the clustering in general approaches. This is also the case of some top down methods [9]. 2) Although robust methods can be applied, intensity-based optical flow computation is still unstable to noise, boundary, large displacement, etc. 3) Global motion, i.e., the motion of the camera, if not considered, will affect the local motion and violate the assumption of homogeneous motion. On the contrary, intensity-based estimation of global motion is also unstable and sensitive to local motion. 4) Efficiency of piecewise computation to motion field is another problem. To resolve it, block based methods [6] are proposed. However, these blocks don't correspond to any physical areas in image surface, therefor is harmful to the precision. To deal with these problems, we propose a hierarchical robust framework in which a perspective motion based multi-scale splitting, merging and compensation of spatial areas are applied to impose a topological constraint in the computation. The reason of selecting perspective model is that, for top down analysis, large areas may be considered, while the depth invariance assumption of affine model can not be satisfied. This is not the case of bottom up analysis. The center idea of our method is that, motion of object, esp. its parametric representation, is smooth up to a precision threshold and sampling frequency, therefor it is possible to predict and track the integral motion of objects if comparatively accurate description is obtained by interaction at first step. Violation to the prediction can be re-computed from top down through motion based multi-scale area splitting, merging and compensation, so that multi-motion and occlusion can also be coped with, at the same time both precision and efficiency are ensured. After prediction, component of large motion is obtained in optical flow computation and only residual needs to be calculated, therefor the small motion requirement of parametric optimization is satisfied. After all, the estimation of global motion is undertaken by background areas tracking and motion clustering. This paper is organized as follows: section 2 discusses the area based robust estimation of perspective motion and its Kalman prediction. Section 3 discusses the region growing based watershed transformation on multiple features, as well as the established scale space. Section 4 presents our hierarchical segmentation framework for motion based area splitting, merging and compensation. Section 5 shows some experimental results and section 6 concludes the paper.

Robust Tracking of Video Objects through Topological Constraint

309

2. Area-Based Perspective Motion: Robust Estimation and Prediction For a rigid area A moving along a plane far from the camera, its motion can be represented by the perspective model with θ as the motion parameter, i.e.;

1 x y 0 0 0 x 2 u( p( x , y ) ∈ A, θ ) =&  0 0 0 1 x y xy

xy  θ = Mθ 2 y 

(1 )

It can be proved that, after this transformation, the topological attributes of areas such as continuity, connectivity and neighborhood can be reserved. The motion parameter θ of area A between image pair I t and I t +1 can be estimated by

arg min θ

∑ E ( I t ( p ) − I t +1 ( u( p, θ )))

p ( x , y ) ∈A u( p ,θ ) ∈I t + 1

(2 )

where E is an energy function. For the classical LST energy function, it is based on the normal distribution assumption of data points, which is not the general case of motion estimation. Although the problem can be resolved efficiently by continuation methods such as SOR (Simultaneous Over-Relaxation) or Newton-Gaussian method because of the convexity of LST, it is very sensitive to the outliers, i.e., data points that should not be considered. This is because, data points far from the true solution, as we can't classified them as outliers in the beginning of computation, contribute much more than correct data points to the energy. To resolve this problem, the robust estimator such as the truncated quadratic, Geman & Mclure, Lorentzian, etc., is proposed [8]. However, as these estimators are not convex, GNC (deterministic graduated non-convexity) or stochastic optimization such as simulated annealing or genetic algorithm have to be applied, which lost the efficiency, and very sensitive to the initial solution. So a convex robust estimator is important. In our method the convex estimator is improved from [10] originally defined as (see Fig. 1(a))

 η 2 σ 2 η ≤σ  η σ − 1 otherwise

ρ σ (η ) = 

(3)

When implemented technically, there are a second zero point of energy in (3). This zero point, although doesn't exist theoretically, does exist in computation, which will bring much trouble. The infinite energy of the outlier at infinite distance is another problem. To overcome them, at the same time reserve the convexity, we modify it as (See Fig, 1(b))

 η2 σ 2 η ≤σ ρσ (η) =  n−1 n 2 − 1 2 + ( η − n σ ) ( σ ) nσ < η ≤ (n+1)σ, n ∈Z + 2 

(4)

310

Ming Liao et al.

Fig. 1 (a). Robust Estimator (3) and its

Fig. 1(b). Robust Estimator (4) and its

Gradient.

Gradient.

Since estimator (4) converges to 2 at infinite point and keeps convex, (2) can be resolved by continuation method, nevertheless (2) is still not a convex problem. This is because, the accumulation area, i.e., A(θ ) , is θ -related. For an infinite perspective plane, A(θ ) is the whole plane and (2.1) is convex. But for a bounded area [M, N], its convexity is determined by the status of the two intensity surfaces. This is also the case of the standard optical flow equation. As a result, simple SOR can only obtain the local optimal solution nearest to the zero point. This is why the assumption of small and local homogeneous motion must be imposed, and the global optimal solution can be found only if the initial solution of the iteration is properly given. This initial solution can be predicted by traditional estimation method such as the Kalman motion filter. Generally, variance of zooming does not occur usually, and as the planar assumption requires, the distance of the object from the camera should be very large, so variance in depth can also be ignored. Therefor only three components need to be estimated, i.e., rotation angle α and 2 translations dx, dy along axis, which can be assumed independent to each other. For area At ∈ I t with motion parameter

θ = ( a1 , a 2 , a 3 , a 4 , a 5 , a 6 , a 7 , a 8 ) T , we have  α   arctg ( a 2 2 a 1 − a 4 2 a 5 )      a3   dx  =      a6   dy  

(5)

After applying a linear estimation filter Γ such as the α − β − λ filter [11], suppose their predicted value is (α ′ , dx ′ , dy ′ ) T , the predicted θ ′ between I t and I t +1 can be easily calculated. Using θ ′ as the initial solution to resolve (2), the SOR algorithm generally converges to a global or rational solution.

Robust Tracking of Video Objects through Topological Constraint

311

3. Region Growing Based Watershed Transformation on Multiple Features Intensity-based watershed transformation [12] is extensively implemented for image segmentation as an area feature extractor, and fast algorithm is also proposed. To improve the correspondence between the watershed areas and physical structure of image, other quantifiable features such as gradient, texture, etc., are introduced [4,13]. Furthermore, when considering motion analysis, motion parameters can also be involved[14]. At this time, the region growing process of catchment basin computation is based on the area similarity, which linearly synthesizes all these features, i.e., N

Sim( A1 , A2 ) = ∑ α i ⋅ dist i ( Γ i ( A1 ), Γ i ( A2 ))

(6)

i =1

where Γi is an feature filter; dist i is the distance operator with respective dimension and α i is the correspondent weighting coefficients. Conjugate areas A1 and A2 will be merged if their similarity is above a threshold T. For computational purpose, a multiscale framework is desirable. To achieve this, the merging threshold T can be designed up to a scale factor, i.e., T is determined by the so-called scale of region growing and large scale will derive a large merging threshold. In this way less areas will survive in larger scales and a strict scale space without any boundary deviation is established. In our experiments, primary area partition is based on morphological gradient. In the following region growing steps, besides the mean and variance of intensity, position of mass center of conjugate primary areas are taken into account. When doing motion based area merging as mentioned in next section, distance between perspective motion parameters of conjugate areas is also considered. As we have explained previously, totally automatic segmentation of objects from image is beyond currently available technology. As a result interactively marking of objects is necessary. Here we apply the top down marking from the segmentation of the multiple features based watershed transformation. Objects are marked firstly in large scales, and the rest parts are marked in smaller scale, until mask image with enough precision is arrived. After marking, a hybrid scale representation of objects is obtained, which is the basis of the objects tracking. The first prediction to the motion of the object mask is assigned to be zero for the α − β − λ filter.

4. Hierarchical Segmentation of Areas with Homogeneous Motion Our framework of hierarchical segmentation of areas with homogeneous motion is composed of three stages, i.e., splitting, merging and compensation of uncertain areas. Fig 2 gives an illustration to the framework. Following we will discuss them sequentially.

312

Ming Liao et al. Image1

Image2

....

Image(n)

Multiple Features Based Region Growing Areas in Scale1

Areas in Scale2

......

Areas in Scale(n)

Areas in Scale1

Areas in Scale2

......

Areas in Scale(n)

Interactive Marking

Objects Mask(1)

Prediction

Homogeneous Motion Estimation

Threshold1

Global Motion

Splitting

No

Validate

for next image Threshold2

Threshold3

Merging

for next image

for next Image

Compensation

Objects Mask(2)

Fig. 2. Illustration of Our Hierarchical Framework For a connected object mask, or a connected background area, if homogeneous motion is assigned and possible motion is predicted, the perspective parameters can be calculated robustly according to section 2. As the homogeneous motion assumption may not be correct, large energy higher than a threshold will derive in the computation. This time the splitting of the area should be undertaken. Since these connected areas are composed of catchment basins in hybrid scales as addressed in section 3, the splitting can be accomplished by reducing the maximum scale of the component subareas. This splitting process is continued until the required energy threshold is satisfied. After the splitting stage, the initial motion of each object area, as well as the global motion which is the dominant motion of unmarked background areas, is obtained, and the motion field of the whole image is estimated. Notice that discontinuity motion at boundary is implicitly resolved, which is difficult to piecewise methods. At this time the area merging can be processed by the multiple features based region growing with the motion involved in the calculation of area similarity as addressed in section 3, and larger areas with re-computed homogeneous motion are derived. The similarity threshold is fixed to a global value, and motion of multiple objects as well as the global motion is determined in I t . Since conjugated areas may have different motion, the warped image of I t using the obtained motion field may not cover the whole area of I t +1 . Those uncovered areas are named as uncertain areas. Their motion is determined by the followed compensation stage. Simply said, for those uncertain areas, we try to combine them to one of their conjugated certain areas. The combination is determined by the drawback analysis, i.e. if an uncertain area is assigned to the motion of one of its

Robust Tracking of Video Objects through Topological Constraint

313

conjugated certain area, its correspondent area in I t can be determined by inverse warping of the assigned motion, and the similarity can be calculated. The motion, which produces maximum similarity above a specified threshold, is finally assigned to the concerned uncertain area. After an iterative compensation and combination, every certain area will grow to a limitation. For those uncertain areas whose maximum similarity is still below the threshold, it is classified as a newly appeared area, which means it is occluded in I t . In this way the occlusion is resolved. Their initial motion is assigned to the global motion. This idea is firstly proposed in [14]. In our paper, it works on areas but not points. After these three stages, the final motion of background and each object is obtained and inputted into the α − β − λ filter to predict their motion in I t + 2 . Because our framework totally works on connected and conjugated areas, and in a top down manner, the topological constraint is inclusively imposed, and violation to topological integrity is avoided, as a result the efficiency and the reliability are remarkably improved.

(a)

(d)

(b)

(e)

(c)

(f)

(g)

Fig. 3. Sequence taken from lab: (a) Previous frame. (b) Current frame. (c) next frame. (d) Spatial-temporally segmented and tracked result of Previous frame, different gray lever denotes regions with different motion.(e) spatially segmented result of Current frame, due to assumption of unified motion in arm region is violated, the arm region is spitted to three regions in finer scale.(f) Spatial-temporal segmented result of Current frame after region merging. (g) Spatial-temporally segmented and tracked result of next frame.

314

Ming Liao et al.

5. Experimental Results In this section, experimental result of two sequences is given. One is a sequence taken from lab, another is the foreman sequence. Fig. 3 show a example of region spitting and merging. We can see that from previous frame to current frame ,the arm region undergo a unified motion and can be tracked together. But from current one to next frame, the arm regions undergo two different motion, so our assumption of unified motion in arm region is violated and we have to spit it as seen in (e).In ( f ) we give the result after region merging. As for Foreman sequence, we give our tracking result of the frame 181,184,187. Please note that there exist large cameral motion, in our method, the region tracking and cameral motion estimation are under the unified scheme. In spite of large displacement between frames, our method still can track the foreman pretty well.

(a)

(e)

(b)

(f)

(c)

(g)

Fig. 4. Foreman sequence (a) Frame 181. (b) Frame 184 (c) Frame 187 (e) Tracked result of Frame 181. (f) Tracked result of Frame 184. (g) Tracked result of Frame 187.

Robust Tracking of Video Objects through Topological Constraint

315

6. Conclusion Through the proposed hierarchical robust framework, homogeneous motion field is analyzed in the unit of area from large scales to small, so that topological constraints is implicitly involved in the motion based region growing. A splitting stage, followed by a merging and the compensation stage, is applied in the computation process in order that top down analysis is fulfilled. Compared with the general piecewise or patch based bottom up methods, our methods is of more reliability and efficiency. Since the performance of this framework heavily relies on the quality of area partition, the linear combination style of area similarity calculation, i.e., equation (6), is not satisfactory enough. More rational measurement and new features such as the color information can also be introduced. Optimal region growing as a symbolic problem also needs more consideration. These are our future works.

References [1] Paulo Correia, Fernando Pereira, "The role of analysis in content-based video coding

and indexing" Signal Processing special issue on video sequence segmentation for content-based processing and manipulation, Volume 66, No.2,April 1998. [2] F. Marqués and Cristina Molina." An object tracking technique for content-based functionalities", SPIE Visual Communication and Image Processing (VCIP-97) , volume 3024 pp. 190-198, San Jose, USA,1997. [3] F. Marqués, B. Marcotegui and F. Meyer. "Tracking areas of interest for content-based functionalities in segmentation-based coding schemes". Proc.ICASSP'96, volume II, pages 1224-1227, Atlanta (GA), USA, May 1996. [4] F. Marqués. Temporal stability in sequence segmentation using the wathershed algorithm. In P. Maragos, R. Schafer and M. Butt, editors, Mathematical Morphology and its Applications to Image and Signal Processing, pages 321-328, Atlanta (GA), USA, May 1996. Kluwer Academic Press. [5] D. Zhong and S.-F. Chang, "Spatio-Temporal Video Search Using the Object Based Video Representation," IEEE. Intern. Conf. on Image Processing, invited talk, special session on video technology, Santa Barbara, Oct. 1997. [6] D. Zhong and S.-F. Chang, "Video Object Model and Segmentation for Content-Based Video Indexing," IEEE Intern. Conf. on Circuits and Systems, June, 1997, Hong Kong. (special session on Networked Multimedia Technology & Application) [7] Lothar bergen and Fernand Meyer "Motion Segmentation and Depth ordering Based on Morphological Segmentation" Proc.ECCV, 531-547,1998 [8] M. J. Black and P. Anandan, "The Robust Estimation of Multiple Motions: Parametric and Piecewise-Smooth Flow Fields, Computer Vision and Image Understanding", 63(1), 75-103, 1996 [9] J.R.Bergen,P.J.Burt, R.Hingorani,and S.peleg. "Computing two motions from three frames".Proc.ICCV,pages27-32,December 1990

[10] P. Huber, Robust Statistics, Wiley 1981 [11] Y.Bar-Shalom and T.E. Fortmann. Tracking and Data Association. Academic Press,Inc.1988.

[12] L. Vincent and P. Soille, "Watersheds in Digital Space: An Efficient Algorithm Based on Immersion Simulation", IEEE Transaction on Pattern Analysis and Machine Intelligence, 13(6), 583-598, 1991

316

Ming Liao et al.

[13] M.Pardas and P.Salembier. "3D morphological segmentation and motion estimation for image sequence" EURASIP Signal Processing,38(1):31-43,1994. [14] Jae Gark Choi, Si-Woong Lee and Seong-Dae Kim "Video Segmentation Based on Spatial and Temporal Information"Proc.ICASSP'97,2661-2664,1997.

The Spatial Spreadsheet Glenn S. Iwerks1 and Hanan Samet2 1

2

Computer Science Department, University of Maryland, College Park, Maryland 20742 [email protected] Computer Science Department, Institute for Advanced Computer Studies University of Maryland, College Park, Maryland 20742 [email protected]

Abstract. The power of the spreadsheet can be combined with that of the spatial database to provide a system that is ﬂexible, powerful and easy to use. In this paper we propose the Spatial Spreadsheet as a means to organize large amounts of spatial data, to quickly formulate queries on that data, and to propagate changes in the source data to query results on a large scale. Such a system can be used to organize related queries that not only convey the results of individual queries but also serve as a means of visual comparison of query results. Keywords: spreadsheets, spatial databases, visualization

1

Introduction

In this paper we introduce the Spatial Spreadsheet. The purpose of the Spatial Spreadsheet is to combine the power of a spatial database with that of the spreadsheet. The advantages of a spreadsheet is the ability to organize data, to formulate operations on that data quickly through the use of row and column operations, and to propagate changes in the data through the system. The Spatial Spreadsheet consists of a 2D array of cells containing data. Updates can propagate through the array via cell operations. Operations can be single cell operations, row operations, or column operations. Column operations iterate over rows in a column and row operations iterate over columns in a row. Cell values can be instantiated by the user or can be a result of operations performed on other cells. In the classic spreadsheet paradigm, cell values are primitive data types such as numbers and strings whereas in the Spatial Spreadsheet, cells access database relations. The relation is part of a spatial relational database. A relation is a table of related attributes. A tuple in a relation is one instance of these related items. Each table is made up of a set of tuples [9]. Attributes in a spatial database relation can be primitive types such as numbers and strings or spatial data types such as points, lines and polygons.

The support of the National Science Foundation under Grant IRI-97-12715 is gratefully acknowledged.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 317–324, 1999. c Springer-Verlag Berlin Heidelberg 1999

318

Glenn S. Iwerks and Hanan Samet

Spatial attributes stored in the relations associated with each cell can be displayed graphically for visualization of query results. This allows the eﬀects of updates on the base input relations to be observed through the graphical display when changes occur. The rest of this paper is organized as follows. Section 2 gives some background on spreadsheets and spatial databases. Section 3 describes the Spatial Spreadsheet. Section 4 provides some implementation details. Section 5 draws some concluding remarks as well as gives some directions for future research.

2 2.1

Background The Classic Spreadsheet

The classic spreadsheet was designed as an accounting tool. It permitted the user to quickly formulate calculations on the data through column and row operations. It also allowed the user to easily observe how changes in the input data aﬀected a whole series of calculations. The original spreadsheet was laid out in a two-dimensional array of cells in rows and columns. Users could populate the rows and columns with numeric data. They could then perform operations on entire columns (or rows) and populate additional columns with the results. 2.2

Spreadsheet for Images

Spreadsheets for Images (SI) is an application of the concept of a spreadsheet to the image processing domain [6]. In this case, the concept of a spreadsheet is used as a means of data visualization. Each cell in the spreadsheet contains graphical objects such as images and movies. Formulas for processing data can be assigned to cells. These formulas can use the contents of other cells as inputs. This ties the processing of data in the cells together. When a cell is modiﬁed, other cells that use it as input are updated. A somewhat related capability is provided by the CANTATA programming language to be used with the KHOROS system [8]. 2.3

SAND Browser

The SAND Browser is a front end for the SAND [2] spatial relational database. The user need only to point and click on a map image to input spatial data used in the processing of query primitives. The results of the queries are then displayed graphically. This gives the user an intuitive interface to the database to help the visualization of the data and the derivation of additional information from it. However, such a system does have limitations. In the SAND Browser one primitive operation is processed at a time. When the user wants to make a new query, the results of the previous operation are lost unless they are saved explicitly in a new relation. As a result, there is no simple and implicit way to generate more complicated queries from the primitives. In presenting the Spatial Spreadsheet we will propose some possible solutions to these limitations of the SAND Browser while still maintaining its ease of use and intuitive nature.

The Spatial Spreadsheet

319

Figure 1: Example query results in top-level window

3

The Spatial Spreadsheet

The Spatial Spreadsheet is a front end to a spatial database. A spatial database is a database in which spatial attributes can be stored. Attributes of a spatial relational database may correspond to spatial and non-spatial data. For example, spatial data types may consist of points, lines, and polygons. Numbers and character strings are examples of non-spatial data. By mapping the world coordinates of the spatial data to a bitmap it may be converted to an image for visualization of the data. The Spatial Spreadsheet provides a means to organize the relational data and query results in a manner that is intuitively meaningful to the user. One may apply meaning to a column, a row, or an entire set of columns or rows to organize data. For example, spatio-temporal data may be organized so that each row corresponds to a diﬀerent time period and each column corresponds to a diﬀerent region in the world.

320

Glenn S. Iwerks and Hanan Samet

The Spatial Spreadsheet is made up of a 2D array of cells. Each cell in the spreadsheet can be referenced by the cell’s location (row, column). In the Spatial Spreadsheet, each cell represents a relation. A cell can contain two types of relations: a persistent relation or a query result. A persistent relation is a relation that exists in a permanent state. This is not to say that the data in the relation does not change but rather that the relation existed before the spreadsheet was invoked and will continue to exist after the spreadsheet exits unless explicitly deleted by the user. The second type of a relation contains the result of a query posed by the user. The user decides if a query result will persist or not. The user can pose simple queries. Simple queries are primitive operations. Some examples of a primitive operation are selection, projection, join, spatial join [5], window [1], nearest neighbor [4], etc. Primitive operations are composed to create complex queries. 3.1 Example Let us consider a simple example (see Figure 1). Suppose that we are concerned about ﬂooding in 3 diﬀerent regions of the world: A, B and C. Roads close to rivers may get washed out when the rivers ﬂood. We want to know what roads in these regions are close to a river at or near ﬂood stage. For each of these regions we have a relation containing all the rivers at or near ﬂood stage. We open these river relations in the ﬁrst column of our spreadsheet (i.e., column 0). We let row 0 correspond to region A, row 1 to region B, and row 2 to region C. We open relations in column 1 that store position information for roads in each region. Our column operation is to ﬁnd all the roads in cells in column 1 that are within 500 meters of a river in the cell in column 0 of the same row and store the result in column 2. In a modiﬁed version of SQL [9] the query might look as follows. SELECT * FROM Cell(X,1), Cell(X,2), distance(Cell(X,1).river, Cell(X,2).road) d WHERE d < 500 The modiﬁcation to SQL1 introduced here is the Cell() function. Instead of giving an explicit relation name in the FROM clause, we introduce the Cell() function that takes a row and a column value and returns a relation. The presence of the variable X for the row parameter tells the system to iterate over all open relations in the given columns. The operation producing the result in column 3 is an example of a column operation. Similarly, one can iterate over all the columns in a row using a row operation. One can also perform single cell operations. 3.2

Design

The design of the Spatial Spreadsheet is object-oriented. Figure 2 shows the basic object model of the Spatial Spreadsheet in UML notation [3]. The ﬁgure shows 1

SQL is not actually used in the Spatial Spreadsheet system. It is only used here for example purposes.

The Spatial Spreadsheet

321

six class objects: Spreadsheet, Cell, Display, Relation, Query and Processor. It is important to note the distinction between a Cell object and what has been previously referred to as a cell. A cell is an element in the spreadsheet array. A Cell object is a class object named “Cell” used in the design and underlying implementation of the spreadsheet. Likewise, a Relation object is the class object named “Relation” not to be confused with a relation in the relational database. In the remainder of this paper we will distinguish object names by using the italic font. When the Spatial Spreadsheet is started, an instance of the Spreadsheet object is created. This is the top-level object and acts as the root aggregator to all other objects. The primary responsibility of the Spreadsheet object is to keep track of Cell objects, global states, and the organization of cells in the top-level window of the graphical user interface. A Spreadsheet object can have one or more Cell objects. Query objects and Relation objects are Cell objects — that is, they are derived from Cell objects. An instance of a Cell object is created when a persistent relation is opened or a cell is needed to process and store a primitive operation. Cell objects have member data items to keep track and manipulate their own relation. Cell objects can be associated with other Cell objects. Query objects derived from Cell objects use these associations to keep track of which other Cell objects it uses as input. All Cell objects use these associations to keep track of which Query objects use them as input. This becomes important in update propagation. Each Cell object has a Display object. The Display object’s role is to display data from the relation for the user. Display objects can display information for the user in several ways including a meta data display, tuple-by-tuple display of raw data, and a graphical display for spatial data types. In the graphical display spatial attributes are rendered by projecting their coordinates onto a 2D bitmap as a means of data visualization. Each Query object also has a Processor object. Processor objects are responsible for processing primitive operations.

Figure 2: Spatial Spreadsheet Object Model: boxes indicate class objects, diamonds indicate aggregate or “has a” relationships, and triangles indicate inheritance.

322

3.3

Glenn S. Iwerks and Hanan Samet

Update Propagation

There are two ways the data stored in a relation open in the spreadsheet can be changed. The ﬁrst way is by an outside source. In particular, another process that accesses the underlying database can make changes to the data. The second way is by the actions of the spreadsheet itself. If a persistent relation is updated by an outside source, the eﬀects of those changes need to be propagated to all the other cells that directly or indirectly use that relation as input. Consider the river and road example. Suppose it has been raining a lot in region B and the relation containing the information on rivers at or near ﬂood stage is updated by inserting more rivers. In this case, the Cell object holding the result in column 2 for region B would need to be updated after the change occurred in column 0. The propagation process works as follows. A relation corresponding to a Relation object is updated. The Relation object is notiﬁed and it marks itself as “dirty”. When a Relation object or a Query object becomes dirty it then informs all Cell objects depending on it for input that they are now dirty too. It may be useful to think of the Cell objects in the spreadsheet as nodes in a directed graph. Edges directed into a node indicate Cell object inputs. Nodes in the graph having no incoming edges are Relation objects. All the other nodes are Query objects. We will refer to Query objects that have no outgoing edges as terminals. The manner in which queries are created ensures that there are no cycles in this directed graph. Therefore, we do not have to check for cycles while passing messages. Eventually, these messages are passed through all possible paths from the initial dirty Relation object to all terminals reachable from the initial Relation object. Since there are no cycles, message passing will cease. After all Cell objects are marked dirty that can be marked dirty, the initial dirty Relation object marks itself as “clean”. The PropagateClean() method is invoked for each Cell object that uses the Relation object as direct input. The PropagateClean() method propagates the update. PropagateClean() { If all my inputs are clean and I am active then { Mark myself clean and recalculate primitive operation For each Cell object J that uses me as input do Call J’s PropagateClean() method } } It is necessary to propagate all the “dirty” messages all the way through the graph of Cell objects before recalculating any primitive operations associated with a Cell object otherwise some Cell objects might recalculate their operations more than once. For example, suppose that Cell object X recalculates its operation as soon as one of its inputs, say Cell object Y, indicates that a change has occurred. If Cell object Y is also input to Cell object Z which in turn is input to Cell object X, then Cell object X would have to update itself again after it is informed that Cell object Z has been updated. If this situation was not prevented, then there could be as many as O{n2 } updates. This situation

The Spatial Spreadsheet

323

is prevented by informing each Cell object of all imminent updates before any updates are actually performed. This essures O{n} updates. Note that individual Cell objects may be set “active” or “inactive” by the user. An inactive Cell object blocks the update of itself and blocks the propagation of updates to it’s dependents. This avoids spending time updating Cell objects in the spreadsheet in which the user is not currently interested. Updates may propagate automatically whenever a change occurs or only as desired by the user. At the top level, the Spreadsheet object has an UpdateSpreadsheet() method. This is called to initiate update propagation. UpdateSpreadsheet() { For each Cell object K in the spreadsheet do If K is dirty then Call K’s PropagateClean() method. } 3.4

Graphical User Interface

Rather than expressing operations on cells with a query language such as SQL, the simple operations associated with cells are created through the use of the “wizard”. The wizard consists of one or more popup windows that guide the user through the steps of instantiating a cell. To start the wizard, the user clicks on an empty cell. At each step, the wizard oﬀers the possible choices to the user and the user selects the desired choice with the mouse. In some cases, the user may still have to type something. In particular, this is the case when an expression is required for a selection or join operation. At present, the user is required to type the entire expression. As in the SI system [6], we chose to use Tcl [7] for expressions. This requires the user to be knowledgeable of the expression syntax. This error-prone aspect detracts from the GUI’s ease of use. We intend to replace this with a more intuitive system in the future. The main window consists of an array of cells (see Figure 1). Cells can be expanded or contracted by sliding the row and column boundaries back and forth. Theoretically, the spreadsheet could hold an unlimited number of rows and columns but to simplify the implementation we limit the number of rows and columns. We can still start the system with a large number of cells and hide those that are not being used by moving the sliders. Display of spatial attributes is not limited to the graphical display in a single cell. Each graphical display can display spatial attributes from any relation associated with any cell in the spreadsheet. This allows the user to make visual comparisons by overlaying diﬀerent layers in the display. The Spatial Spreadsheet also provides a global graphical display in a separate top-level window.

4

Implementation

The Spatial Spreadsheet is an interface used to interact with a spatial relational database. The spatial relational database we use is SAND [2]. SAND provides the database engine that underlies the system. It contains facilities to create,

324

Glenn S. Iwerks and Hanan Samet

update and delete relations. It provides access methods and primitive operations on spatial and non-spatial data. The Spatial Spreadsheet extends the basic set of primitive queries to include the classic selection, projection and nested loop join operations. The implementation of the Spatial Spreadsheet is object-oriented and was written entirely in incremental Tcl (iTcl) and incremental Tk (iTk). It runs on Sun Sparc and Linux systems.

5

Concluding Remarks

We have described how the power of the spreadsheet can be combined with a spatial database. The Spatial Spreadsheet provides a framework in which to organize data and build queries. Row and column operations provide a mechanism for rapid query creation on large amounts of related data. The systematic tabulation of the data as found in the two-dimensional array of the Spatial Spreadsheet enables the user to visually compare spatial components and pick out patterns. The user can also see how query results change as updates occur. An important issue for future work that was not addressed here is update propagation optimization. In particular, the output of any given Query object may be the result of many steps along the way between it and initial Relation objects. Currently the method of computation is determined in a procedural manner by the user. In the future we will focus on converting this to a declarative form and using query optimization techniques to improve refresh eﬃciency when updates occur.

References 1. W. G. Aref and H. Samet. Eﬃcient window block retrieval in quadtree-based spatial databases. GeoInformatica, 1(1):59–91, April 1997. 320 2. C. Esperan¸ca and H. Samet. Spatial database programming using SAND. In M. J. Kraak and M. Molenaar, editors, Proceedings of the Seventh International Symposium on Spatial Data Handling, volume 2, pages A29–A42, Delft, The Netherlands, August 1996. 318, 323 3. M. Fowler and K. Scott. UML Distilled, Applying the Standard Object Modeling Lanuage. Addison-Wesley, Reading, MA, 1997. 320 4. G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. Computer Science Department TR-3919, University of Maryland, College Park, MD, July 1998. (To appear in ACM Transactions on Database Systems). 320 5. G. R. Hjaltason and H. Samet. Incremental distance join algorithms for spatial databases. In Proceedings of the ACM SIGMOD Conference, pages 237–248, Seattle, WA, June 1998. 320 6. M. Levoy. Spreadsheets for images. In Proceedings of the SIGGRAPH’94 Conference, pages 139–146, Los Angeles, 1994. 318, 323 7. J. K. Ousterhout. Tcl and the Tk Toolkit. Addison-Wesley, April 1994. 323 8. J. Rasure and C. Williams. An integrated visual language and software development environment. Journal of Visual Languages and Computing, 2(3):217–246, September 1991. 318 9. A. Silberschatz, H. F. Korth, and S. Sudarshan. Database System Concepts. McGraw-Hill, New York, third edition, 1996. 317, 320

A High Level Visual Language for Spatial Data Management Marie-Aude Aufaure-Portier and Christine Bonhomme Laboratoire d’Ingénierie des Systèmes d’Information INSA & UCBL Lyon F-69 621 Villeurbanne [email protected] [email protected]

Abstract. In this paper, we present a visual language dedicated to spatial data called Lvis. This language has been defined as an extension of the Cigales visual language based on the Query-By-Example principle. The language is based on predefined icons modelling spatial objects and operators that are used to build a visual query. The visual query is then translated into the host language of Geographic Information Systems (GIS). A major problem of such a language is that visual queries are generally ambiguous because of multiple interpretation of the visual representation. We first present a brief state of the art of languages dedicated to GIS and then formally define our visual language. The global architecture of the system is described. We then focus on visual ambiguities and propose a model of detection and resolution of these ambiguities.

1 Introduction Many recent research have been recently done in the field of Geographic Information Systems (GIS) especially for data storage, new indexing methods, query optimization, etc. [1]. A main characteristic of GIS is to manage complex and large amount of data. A fundamental research area concerns the definition of high level user interface because GIS users are generally non-computer scientists. Many applications are concerned by spatial data: urban applications, geomarketing, vehicle guidance and navigation, tourism and so on. Human actors implied in these applications are architects, engineers, urban planners, etc. GIS applications have recently migrated towards citizen-oriented applications. This makes crucial the definition of simple and user-friendly interfaces. Cartographic information can be graphically visualized (maps, pictograms, etc.) using marketed GIS, but, in most cases the languages developed for queries and updates are very poor and dedicated to only one specific system. The consequence is that end-users applications cannot be supported by other systems. Another drawback is the complexity for non-computer specialists to design and develop applications. However, the main characteristic of spatial information is to be graphical. This implies that graphical or visual languages are well suited for spatial applications. Graphical languages are based on the use of symbols representing the data model concepts. These symbols are only pure graphical conventions, without any Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 325-332, 1999.  Springer-Verlag Berlin Heidelberg 1999

326

Marie-Aude Aufaure-Portier and Christine Bonhomme

metaphorical power, and consequently need to be explained and memorized. Visual languages use metaphors to show the concepts. Metaphors take the mental model of the end-user into account. We propose a visual language defined as an extension of the Cigales language [2]. This language, Lvis, is based on the use of icons. These icons represent the spatial objects stored into the database and the spatial operators used to build a visual query. The extension concerns: (1) the definition of new operators such as logical operators; (2) the detection and resolution of visual ambiguities due to the principle of query construction; (3) the integration of this language into a customizable visual environment [3] devoted to the design and development of spatial applications. We firstly present a brief state of the art of visual languages for spatial applications. Then, we define our visual language. We then present how to deal with visual ambiguities and propose a detection and resolution model of a particular class of visual ambiguities. A prototype is now available and briefly described in this paper. We then conclude on our future work and perspectives.

2 State of the Art Many propositions have been made the last decade in the field of spatial information retrieval. We can distinguish between the query language approach and the hypermedia approach. We can classify the propositions of query languages into two kinds: (1) textual approaches (natural language and extensions of SQL), (2) nontextual approaches (tabular, graphical or visual languages). Natural language [4] approach seems the most suitable for the end-user. But, a serious difficulty of this approach is that many ambiguities must be solved. Another troublesome issue is that query formulation can be verbose and difficult (generally, a drawing is better than a long sentence). This approach can be seen as a good complement for graphical and visual approaches. Many extensions of the SQL language have been proposed [5,6]. These extensions are necessary in order to allow Data Base Management Systems to store and retrieve spatial information. However, this class of languages are not suited to end-users because of the difficulty to express spatial relations in natural language and the lack of conviviality for technical languages such as extensions of SQL. Tabular approaches [7,8] are defined as extensions of QBE (Query-By-Example)°[9]. The main difficulty is to express joins. Graphical languages make a better use of visual medium but the underlying concepts are not perceived in a metaphorical way. Considering that spatial information is visual, visual languages [2,10,11,12] have been proposed. Some works have also been done to design new metaphors [13,14]. Visual Languages use icons and metaphors to model spatial objects, spatial relations between objects and queries. The user’s mental model is taken into account°[15]. A metaphor can be seen as a mapping between a domain with a high level of abstraction and another domain with a low level of abstraction. An icon can be viewed as a visual representation of a concept. This approach has been expanded very rapidly because of the evolution of the applications towards citizens and the requirements of the end-users of conviviality and ease of use of the interface. Visual languages offer an intuitive and incremental view of spatial queries but lack from a poor expressive power, execution inefficiency and multiple interpretations for a

A High Level Visual Language for Spatial Data Management

327

query. Two main approaches have been developed to design visual languages: (1) the end user draws a pattern using a set of icons, and (2) the end-user makes a drawing directly on the screen using the blackboard metaphor. The first approach is illustrated by the Cigales language [2] and the second one by the Sketch! [10] and SpatialQuery-By-Sketch [12] languages. The reader can refer to [16] to have more details about query languages for GIS. The main advantage of these two approaches comes from the fact that the user does not have any constraint to express a query and no new language to learn. The main limitation is that a query can lead to multiple interpretations. The user’s drawing may not represent the real world (error due to the mental representation of the user) and may lead to a wrong interpretation or may not represent the user’s viewpoint. These languages can be seen as precursors for visual querying in the GIS domain application and provide two different approaches. The main contribution is that users having a low level in computers can express queries with an intuitive manner. These languages also permit the visual definition of spatial views. Nevertheless, many limitations still remain. The main limitation comes from the ambiguities of visual languages. This problem is the object of section 4. Another limitation is that alphanumerical and spatial data are not uniformly supported. A few operators have no graphic equivalence like operators used for reasoning and deduction

3 Definition of the Visual Language Lvis This section describes the general architecture of our project, then defines the syntax and semantics of the Lvis language. A prototype is already available on the marketed GIS MapInfo and is described in section 3.1. 3.1 Architecture of the Project Lvis is integrated into a customizable design and development environment [3]. The end-user interface is based upon icons and pop-up menus. A visual query can be seen with a visual representation (visual metaphors), a technical representation (spatial objects and operators involved in the query) and a textual representation (extended SQL). A query is first expressed using the visual language: incremental composition of icons and operators. This query is then translated into an intermediate language in order to be independent from the GIS. This intermediate language is based on the functionalities proposed in SQL3-MM [17]. The query is then translated, using a specific driver, towards the host language of the GIS plat-form. A prototype is under current development and already available for simple queries, i.e. queries with only one operator. The visual query is then translated into a structural representation (binary tree). A textual representation is then extracted from this structural representation. This textual representation is then transformed into the GIS host query language (MapInfo host language). The current work concerning our prototype is the integration of complex queries and the graphical display of query results. The graphical interface is given in Figure 1.

328

Marie-Aude Aufaure-Portier and Christine Bonhomme

Fig. 1. Graphical interface of the Lvis language

3.2 Definition of the Language This section describes the syntax and semantics of Lvis. The alphabet of the language is divided into two sub-sets: the spatial object types set (polygons and lines) and the operators set (figure 2). Operators Set theory Logical

Spatial Topological Intersection Inclusion Adjacency Disjunction Equality

Metrical Point selection Ray selection

Intersection Union Identity Difference Exclusive conjunction

And Or Not

Interactive Structural selection Point Creation Radius Modification Rectangle Deletion Any Area Buffer Zone

Fig. 2. Operators set

The two spatial object types to be handled are polygonal and linear objects. We assume that a punctual object would be represented by a polygonal object the area of which is null. This set ST of spatial object types is defined by: ST: STN x STI et STN = {Polygonal, Linear]}, STI={ , ∀ st ∈ ST , st = (name st , icon st ) ∧ name st ∈ STN ∧ icon iconts =

iff namets = "Polygonal",

st

} ∈ STI ∧

iff namets = "Linear"

Another set of object types is the set of data types. We only consider in this paper spatial data, i.e. objects of the database that own a spatial type. The icons that represent these object types generally use visual metaphors and aim at being as closely as possible to the mental models of the users.

A High Level Visual Language for Spatial Data Management

329

DT: DTN x DTI x ST et DTN is the set of names of object types stored in the database, DTI is the set of icons of object types stored in the database, ST is the set of spatial object types previously defined. ∀dt ∈ DT , dt = (namedt , icondt, typeSst ) ∧ namedt ∈ DTN ∧ icondt ∈ DTI ∧ icondt = ficondt (namedt ) ∧ typeSdt ∈ ST ficondt, is a mapping function that associates an icon to a name of data types. The set of operators contains spatial, set theory's, logical, interactive selection and structural operators. Spatial operators are composed of topological and metrical operators (figure3). The choice of the topological operators has been made in accordance to those that are supported by the normalized spatial SQL [17]. All of these operators are either binary or unary operators.

4 How to Deal with Ambiguities? Visual ambiguities can occur at two different levels. The first level concerns the visual representation of the query by the system and the second level is how the visual query is interpreted by end-users. On one hand ambiguities appear when several visual representations are suitable for a given query. The system must decide which one of these visual representations will be displayed to the user. On the other hand ambiguities are generated when a visual representation of a given query is interpreted in different ways. This second case of ambiguities, called interpretation ambiguities, is minimized thanks to the syntax and semantics of our visual language. For example, colours are used to associate the icon of an object with its shape and the symbol ‘?’ indicates the target object of the query. Moreover, the technical representation of a query reminds the steps of its formulation. Thus, we have focused our work on the first case of ambiguities, called representation ambiguities. Firstly a classification of (visual) ambiguities types have been defined (figure 3). Four main types of ambiguities have been distinguished: visual ambiguities tied to the (1) topological relations between the objects of a query; (2) location of objects expressed in Cartesian or cardinal coordinates; (3) geometry of objects; (4) number of occurrences for a given spatial relation between objects. The two first classes are subdivided into three subclasses: simple ambiguities between simple objects, grouping ambiguities between groups of objects and intersection ambiguities between intersections of objects. Figure 4 shows an example of ambiguous visual representation for each one of these classes. Topology

Location

AMBIGUITY TYPE Simple Grouping Relations with objects’ intersections Simple Grouping Relations with objects’ intersections

Geometry Number of relations between two objects Fig. 3. Taxonomy of visual ambiguities

C11 C12 C13 C21 C22 C23 C3 C4

330

Marie-Aude Aufaure-Portier and Christine Bonhomme

C11 Some of spatial relations may be not explicitly specified by the user (e.g.: spatial relation between A and C)

C12 The object A is disjoint from a group of objects. Must A be located inside/outside the grouping objects?

C13 Does the system allow the user to specify spatial relations between the intersections of objects (spatial relation between A ∩ B and C)?

C21 Does the system allow the user to specify the coordinates of the objects?

C22 Does the system allow the user to specify the coordinates of grouping objects?

C23 Does the system allow the user to specify distances between the intersections of objects?

C3 Does the system allow the user to exactly specify the shape of the objects?

C4 Does the system allow the user to specify: the number of occurrences of a same spatial relation between two objects (left figure); several different types of spatial relations (right figure)?

Fig. 4. Example of ambiguous visual representations

Our study is concentrated to the handling of topological ambiguities that are concerned with the intersections of objects and especially to the problem of the "don't care" relations. We illustrate the problem of "don't care" relations taking an example of a spatial query. Consider the query “Which towns are crossed by a river and have a forestry zone?” This query is expressed with our language in two steps: at first, specification of an intersection relation between an object type “Town” and an object type “River”; then, formulation of another intersection relation between the same object type “Town” and a new object type “Forest”. But the user did not specify the spatial relation between the objects “River” and “Forest” just because he doesn't care. What must decide the system? Which visual representation must be chosen for these two objects and for the whole query? (figure 5). Which towns are crossed by a river and have a forestry zone? The spatial relation between the

objects towns and forest doesn’t care ⇒ A few visual representations exist!

Fig. 5. Example of a “don’t care” relation

A High Level Visual Language for Spatial Data Management

331

To solve this problem, a model has been proposed for the detection of visual ambiguities. It determines the set of possible visual representations for a given query. This model is based on the intersection levels between objects involved in queries and is defined as a graph-type model including nodes and edges (figure 6 (a)). Nodes and edges can be either enabled or disabled according to spatial criteria of the query. The main advantage of this model is that the model for queries composed of a given number of objects is built only once and is updated according to the spatial criteria of the other queries. This model is conceivable for queries with up to four objects (figure 6 (b)). For a query with n objects, the model contains 2n-1 nodes and Card(2n-1,2) = 2n-1!/(2!*(2n-1-1)!) relations that can be specified between the objects. We assume that most of spatial queries contain less than four objects. So this model can be integrated to our language. Objects # Nodes # 1 2 3 4 5

1 3 7 15 31

Possible representations # 2 8 128 32 768 2 147 483 648

Fig. 6. Model of detection of visual ambiguities. (a) The graph structure of the model; (b) The complexity of the model

When the user submits a visual query, the system searches for the set of possible visual representations all over the graph of the query. If more than one possible representation exists the system decides which one will be the less ambiguous for the user. To do that we think that it is necessary to allow interactions between the system and the user in order to build a user profile. For example, keeping the user's preferences concerning visual representations of queries types that are often formulate could be a good and efficient strategy. The system becomes so a personalized system, indeed even a self-adapted system. We think too that it could be interesting to let the user modifying itself the visual representation of its query. This can be realized by mean of dynamic alterable visual representation of the queries. Spatial criteria of queries (and so spatial relations between the objects of the queries) still remain to be true whatever changes may be done on the visual representation.

5 Conclusion and future work This paper presents Lvis, an extension of the visual language Cigales devoted to spatial information systems. This language is based upon a query-by-example philosophy. We then focus on how to detect and solve visual and representation ambiguities. We have defined a resolution model for “don’t care” relationship between spatial objects. This model is realistic for queries containing less than four objects and will be integrated to our prototype. We must now study the others cases of

332

Marie-Aude Aufaure-Portier and Christine Bonhomme

ambiguities. The prototype has been developed according to the architecture of the project described in this paper. We must now validate it in collaboration with potential end-users. A first set of cognitive tests have already been realized. Some conclusions about these tests have been extracted and must be confirmed.

References 1. Laurini, R., Thompson D.: Fundamentals of Spatial Information Systems, The APIC series, Academic Press (1992) 2. Aufaure-Portier, M-A.: A High-Level Interface Language for GIS, Journal of Visual Languages and Computing, Vol. 6 (2), Academic Press (1995) 167-182 3. Lbath, A., Aufaure-Portier, M-A., Laurini, R.: Using a Visual Language for the Design and Query in GIS Customization, 2nd International Conference on visual information systems (VISUAL97), San Diego, (1997) 197-204 4. Bell, J.E.: The experiences of New Users of a Natural Language Interface to a Relational Database in a Controlled Setting, First Int. Workshop on Interfaces to Database Systems, Ed. R. Cooper, Springer-Verlag (1992) 433-454 5. Costagliola, G., and al.: GISQL - A Query Language Interpreter for Geographical Information Systems, IFIP Third Working Conference on Visual Database Systems (1995) 247-258 6. Egenhofer, M.: Spatial SQL : A Query and Presentation Language, IEEE Transactions on Knowledge and Data Engineering (1994),Vol. 6 (1) 86-95 7. Staes, F., and al.: A Graphical Query Language for Object Oriented Databases, IEEE Workshop on Visual Languages (1991) 205-210 8. Vadaparty, K., and al.: Towards a Unified Visual Database Access, SIGMOD Record (1993) Vol. (22) 357-366 9. Zloof, M.M.: Query-by-Example : A Database Language, IBM Systems Journal (1977) Vol. 16 (4) 324-343 10. Meyer, B.: Beyond Icons : Towards New Metaphors for Visual Query Languages for Spatial Information Systems, Proceedings of the first International Workshop on Interfaces to Database Systems (R. Cooper ed.), Springer-Verlag (1993) 113-135 11. Benzy, F., and al.: VISIONARY: a Visual Query Language Based on the User Viewpoint Approach, Third International Workshop on User-Interfaces to Database Systems (1996) 12. Egenhofer, M.J.: Query Processing in Spatial-Query-by-Sketch, Journal of Visual Languages and Computing (1997) Vol. 8 (4) 403-424 13. Egenhofer, M.J., Bruns, H.T.: Visual Map Algebra : A Direct-Manipulation User Interface for GIS, Third Working Conference. on Visual Database Systems (IFIP 2.6), (1995) 211226 14. Kuhn, W.: 7±2 Questions ans Answers about Metaphors for GIS User Interfaces, Cognitive Aspects of Human-Computer Interaction for Geographic Information Systems (T. Nyerges, D. Mark, R. Laurini & M. Egenhofer ed.) (1993) 113-122 15. Downs, R.M., Stea, D.: Maps in Minds, Reflections on cognitive mapping, Harper and Row Series in Geography (1977) 16. Aufaure-Portier, M.A., Trepied, C.: A Survey of Query Languages for Geographic Information Systems, Proceedings of IDS-3 (3rd International Workshop on Interface to Database), published in Springer Verlag's Electronic Workshops in Computer Series (1996) 14p (www.springer.co.uk/eWiC/Worshops/IDS3.html) 17. ISO/IEC JTC1/SC21/WG3 DBL-SEL3b (1990)

A Global Graph Model of Image Registration S. G. Nikolov, D. R. Bull, and C. N. Canagarajah Image Communications Group, Centre for Communications Research University of Bristol, Merchant Venturers Building Woodland Road, Bristol BS8 1UB, UK Tel: (+ 44 117) 9545193, fax: (+ 44 117) 9545206 {Stavri.Nikolov,Dave.Bull,Nishan.Canagarajah}@bristol.ac.uk

Abstract. The global graph model of image registration is a new visual framework for understanding the relationships and merits between the wide variety of existing image registration methods. It is a global, dynamically updateable model of the state-of-the-art in image registration, which is designed to assist researchers in the selection of the optimal technique for a specific problem under investigation. Two-dimensional and three-dimensional graph display techniques are used in this paper to visualise the new model. The Virtual Reality Modeling Language (VRML) was found to provide a very suitable representation of such a 3-D graph model.

1

Introduction

Image registration is a common problem in many diverse areas of science including computer vision, remote sensing, medical imaging, and microscopy imaging. Image registration can be deﬁned as the process which determines the optimal correspondence between two or more images. Such images may be acquired from one and the same object: (a) at diﬀerent times; (b) under diﬀerent conditions; (c) from diﬀerent viewpoints; (d) from various sensors. One of the images I1 is taken to be the reference image, and all other images I2 , I3 , . . . , In , called input images, are matched to the reference image. To register the images, a transformation must be found, which will map each point of an input image to a point in the reference image. The mapping has to be optimal in a way that depends on what needs to be matched in the images. Over the years, a great variety of image registration techniques have been developed for various types of data and problems. These techniques have been independently proposed and studied by researchers from diﬀerent areas, often under diﬀerent names, resulting in a vast collection of diverse papers on image registration. Research areas which have contributed signiﬁcantly to the development of image registration techniques comprise computer vision and pattern recognition, medical image analysis, remotely sensed image processing, 3-D microscopy, astronomy, computer aided design (CAD), and automatic inspection. Each of these areas has developed its own specialised registration methods. The need to compare the diﬀerent approaches to image registration has recently led to the publication of Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 333–340, 1999. c Springer-Verlag Berlin Heidelberg 1999

334

S. G. Nikolov et al.

several review papers [1,18,10]. Most of these review papers try to classify image registration methods according to some classiﬁcation scheme, e.g. the primitives used to match the images, the type and complexity of the transform utilised to align the images, etc. The great majority of such schemes are very much domain speciﬁc. Reviews of image registration methods for alignment of medical images, for example, can be found in [15,11,18,10]. The similarity of some registration methods, applied to images from diﬀerent research areas, however, suggests the usefulness of a global taxonomy of image registration techniques, where such techniques are compared not only on similar images from one single area (e.g. medical images, microscopic images), but also across scientiﬁc areas and across diﬀerent scales (i.e. macroscopic versus microscopic images). The only comprehensive review paper on image registration methods spanning images and methods from diﬀerent research areas, is the paper published by Brown [1]. All image registration methods in [1] are described according to a four-component classiﬁcation scheme. In this paper, we propose a new model of the image registration process. This model, which we call the global graph model of image registration, is an attempt to put together results from many diverse areas into a single representation, where the similarities and diﬀerences between the image registration methods and their components may be clearly seen. The global graph model of image registration is much like a taxonomy of image registration methods, although we would prefer to view it as a dynamically updateable, multi-component, graphical representation of the image registration process. The model has been derived from the model proposed by Brown, while several extensions have been added. The aim of this paper is to present the new graph model, rather than to review the existing techniques in image registration. Hence, only a few example papers from several research areas are used to build a nucleus of the graph model.

2

Brown’s Image Registration Model

In her review of image registration techniques, Brown [1] considers image registration as a combination of four key components: (a) feature space (FS) - the set of image features which are extracted from the reference image and from the input images, and are used to perform the matching; (b) search space (SSp) - the class of potential transformations that establish the correspondence between the input images and the reference image; (c) search strategy (SSt) - the method used to choose which transformations have to be computed and evaluated; (d) similarity metric (SM) - which provides a quantitative measure of the match between the reference image and the transformed input images, for a given transformation chosen in the search space, using the search strategy. Brown has reviewed numerous articles on image registration and has classiﬁed all image registration methods into several tables [1], corresponding to the four components of her model. However, in Brown’s paper, it is very diﬃcult to see

A Global Graph Model of Image Registration

335

the relations between the tables, and furthermore, on the basis of these tables only, it is impossible to track down how the choices of each component are put together in each paper to form a complete image registration method. Finally, Brown’s model is static and cannot be updated on-line, thus being an excellent snapshot of the state-of-the-art in image registration at the time of its publication. Our global graph model is an attempt to overcome these disadvantages.

3 3.1

A Global Graph Model of Image Registration Extensions to Brown’s Image Registration Model

We propose to add the following new components to Brown’s model (Fig. 1 (left)): (e) image space (IS) - this is the space of images to be registered, grouped into classes on the basis of the area of research (e.g. medical images, remote sensing images, etc.); (f) dimension - the dimension of the images, which may be 2-D, 3-D, 4-D; (g) paper - the publication which describes a new image registration technique or a new application of a known registration algorithm. The last two additional components may be regarded as meta-components, because they specify some characteristics of instances from the other major components of the model. More meta-components can be added to the model, but here we want to keep the image registration model as simple and as general as possible.

3.2

A 2-D Global Graph Model of Image Registration

An example of the 2-D global graph model of image registration is given in Fig. 1 (left). The basic components of the model in Fig. 1 (left) deﬁne several layers in the global graph. Each new paper is a subgraph of the global graph. The root of this subgraph is the reference to the paper while its nodes are instances from the successive layers of the global graph. Generally, several kinds of subgraphs of the global graph can be distinguished: (a) paper graph - a graph which presents image registration results published in a speciﬁc paper. Three example paper graphs (dashed, solid, and bold edges) are included in Fig. 1 (left); (b) layer graph - a graph connecting all the nodes in one layer of the global graph model (e.g. all the image classes in the IS, as illustrated in Fig. 1 (right)). Fig. 1 (right) shows only some example image classes and modalities. More areas or new modalities can be added to the IS layer; (c) comparison graph - a graph which compares several diﬀerent alternatives of some steps of the image registration process. A comparison graph may show the diﬀerence between two complete image registration algorithms, or it may compare only some steps of these algorithms; (d) area graph - an area graph is a generalisation of a comparison graph, where all the images from the IS are from one research area. Thus, the similarities and diﬀerences between image registration methods applied to images from a certain area (e.g. medical images, remote sensing images) can be observed at a glance.

336

S. G. Nikolov et al.

paper

Le Moigne [12]

Nikolov [14]

Studholme [17]

IS

remote sensing

microscopy

medical

dimension

2-D

3-D

FS

WT maxima

intensity

SSp

piece-wise polynomial

SSt

SM

affine

rigid

Hierarchical Techniques

Normalized crosscorrelation function

Correlation coefficient

Relative Entropy

Fig. 1. The 2-D global graph model of image registration (left). Example papers included: Le Moigne [12] (dashed edges), Studholme [17] (bold edges), and Nikolov [14] (solid edges). The diﬀerent nodes of the IS layer (right). All abbreviations are given in [13]. A double-ended arrow between any two modalities shows that registration of images from these modalities has been studied in a speciﬁc paper (paper graph). 3.3

A 3-D Global Graph Model of Image Registration

There is growing evidence that the human brain can comprehend increasingly complex structures if these structures are displayed as objects in 3-D space [20,19]. If the layers of the global graph are displayed as parallel planes in 3-D space, a 3-D global graph model of image registration (Fig. 2) can be built. The use of multiple abstraction levels is a common approach to visualisation of very large graphs. Several techniques have been proposed in the past for constructing 3-D visualisations of directed and undirected graphs [19,2,7], multi-level clustered graphs [3,6] and hierarchical information structures [16]. Three-dimensional graphs have also been successfully used as graphic representation of knowledge bases [5]. The main advantage of 3-D multi-level graph display over 2-D graph display, especially when it comes to very large graphs, is that the additional degree of freedom allows the numerous graph nodes to be spread across several levels, making the overall graph structure much more conceivable. The nodes in one layer can be positioned according to some kind of closeness measure, which is speciﬁc for this layer (e.g. Fig. 1 (right)), and thus can be grouped into meaningful clusters. Subgraphs, e.g. paper graphs and layer graphs, may be regarded as cross-sections of the global graph and can be plotted as 2-D graphs for easier interpretation. Large graphs can be displayed using a variety of methods such as: (a) all the information associated with the

A Global Graph Model of Image Registration

337

Fig. 2. A 3-D display of part of the global graph model.

nodes and edges of the graph is displayed; (b) several views or zoom-in maps are plotted; (c) distorting views such as ﬁsh-eye lens are utilised; (d) stereopsis; (e) animation; (f) virtual reality. While most of these approaches generate one or several static views of the graph structure and display them to the observer, virtual reality allows the viewer to interactively examine the graph structure, or some of its details, by navigating around it (i.e. by rotation, zoom and translation of the whole structure). Hence, we have decided to use a 3-D virtual reality representation of the global graph model of image registration. One question of paramount importance is how to update the global graph model so that it stays up-to-date with the state-of-the-art in image registration. New image registration methods and new results should be easily incorporated in the global graph model. Modiﬁcations of the structure and relations in the model, in view of new developments, will also inevitably become necessary. Therefore, a dynamical 3-D representation of the global graph model is needed, which will be available to researchers from diﬀerent scientiﬁc areas, who may submit new components and new methods and thus update the model.

338

3.4

S. G. Nikolov et al.

A VRML Representation of the 3-D Global Graph Model

The Virtual Reality Modeling Language (VRML) is a ﬁle format for describing interactive 3-D objects and scenes to be experienced on the World Wide Web (WWW). With the introduction of VRML 2.0 (Moving Worlds) which was replaced by VRML97 in December 1997, VRML is considered to be the de facto standard for describing and sharing 3-D interactive worlds over the WWW. We have decided to use VRML as a mean to visualise the global graph model of image registration because of the following reasons: (a) VRML ﬁles can be displayed on virtually any computer (multi-platform support); (b) VRML provides fast and high-quality rendering; (c) it comprises a rich set geometrical primitives which can be used to construct various graph displays; (d) VRML is becoming more and more popular for scientiﬁc data visualisation and exploration. So far, there have been only a few attempts to use VRML to describe and display graphs [8]. The additional degree of freedom, compared to 2-D graphs, and the fact that the viewer can navigate around the graph structure and look at it from any position and angle, create numerous diﬃculties, which have to be taken into account when constructing 3-D graphs using VRML. In the case of 3-D layered graphs, some of the problems that have to be solved are: (a) how to position the nodes in each layer (what kind of closeness measure to use); (b) how to add new nodes and edges to the graph so that it stays balanced and aesthetically pleasing. The global graph model will evolve in time, which means that the spatial arrangement of its nodes and edges will also change frequently; (c) how to display node labels in 3-D. Several possibilities exist: the VRML text node can be used, or alternatively, text as texture can be mapped to some geometrical primitives (e.g. spheres, cylinder, cones, etc.) which represent graph nodes (see Fig. 3); (d) what kind of orientation to use for the text labels so that the text is always readable. One simple solution is to present alternative views of horizontal or vertical text labels, depending on the position of viewer. A more sophisticated way is to track down the position of the viewer and to keep the text labels always parallel to his ’eyes’. (e) each node which belongs to the paper layer can be implemented as a link to the original paper. If a paper is available on-line, the link points to the ﬁle with the paper. Thus, paper nodes in the graph are anchors to on-line publications. VRML anchors are speciﬁed by the VRML anchor node and a URL indicating the WWW address of the destination resource. A problem which remains open is how to keep all such paper links up-to-date, having in mind that some of the on-line resources will change their addresses in time. An initial 3-D VRML graph is displayed in Fig. 3. The optimal VRML representation of the global graph model of image registration is still under investigation. Since the goal of the global graph model is to be really global, and thus accepted by most of the members of the image registration research community, the optimal VRML representation will be searched for by means of constructing several diﬀerent VRML graphs and collecting feedback about the usefulness and aesthetic merits of each one of them.

A Global Graph Model of Image Registration

339

Fig. 3. A VRML representation of the graph model shown in Fig. 1 (left). This VRML 2.0 ﬁle was generated with the new version of the GraphViz program [4]. A short description of the global graph model of image registration, including the VRML 2.0 representation, can be found at http://www.fen.bris.ac.uk/elec/research/ccr/imgcomm/fusion.html

4

Conclusion and Acknowledgements

In this paper we have presented a new graph model of the image registration process. This new model is an extension to Brown’s four-component model. The new global graph model has several advantages over other image registration models, i.e. it is domain independent, dynamically updateable, and it visually displays the similarities and diﬀerences between various image registration methods and their components. A VRML representation of the 3-D global graph model is presented and several problems connected with its construction and display are discussed in the paper. Similar graph models can also be used in other image related research areas, e.g. to characterise content-based retrieval systems, where the IS will comprise the diﬀerent media (text, audio, image, video), and the FS may consist of texture, colour, motion, etc. This work was funded by UK EPSRC Grant #GR/L53373. We are grateful to Dr. Stephen North from AT&T Bell Laboratories for providing the dot [9] and GraphViz programs [4].

340

S. G. Nikolov et al.

References 1. L. G. Brown. A survey of image registration techniques. ACM Computing Surveys, 24(4):325–376, 1992. 334 2. R. F. Cohen, P. Eades, T. Lin, and F. Ruskey. Three-dimensional graph drawing. In R. Tamassia and I. G. Tollis, editors, Graph Drawing (Proc. GD ’94), volume 894 of Lecture Notes in Computer Science, pages 1–11. Springer-Verlag, 1995. 336 3. P. Eades and Q. Feng. Multilevel visualization of clustered graphs. In Graph Drawing ’96 Proceedings. Springer-Verlag, 1996. 336 4. J. Ellson, E. Gansner, E. Koutsofios, and S. North. GraphViz: tools for viewing and interacting with graph diagrams. The GraphViz program is available at http://www.research.att.com/sw/tools/graphviz. 339 5. K. M. Fairchild, S. T. Poltrock, and F. W. Furnas. SemNet: Three-Dimensional Graphic Representations of Large Knowledge Bases. Lawrence Erlbaum, 1988. 336 6. Qingwen Feng. Algorithms for Drawing Clustered Graphs. PhD thesis, University of Newcastle, Australia, April 1997. 336 7. A. Garg and R. Tamassia. GIOTTO3D: a system for visualizing hierarchical structures in 3D. In Graph Drawing ’96 Proceedings. Springer-Verlag, 1996. 336 8. Cristian Ghezzi. A geometric approach to three-dimensional graph drawing. Technical report, Computation Dept, UMIST, Manchester, UK, 1997. 338 9. E. Koutsofios and S. C. North. Drawing graphs with dot. Technical report, AT&T Bell Laboratories, Murray Hill, NJ, USA, 1992. 339 10. J. B. A. Maintz and M. A. Viergever. A survey of medical image registration. Medical Image Analysis, 2(1):1–36, March 1998. 334 11. C. R. Maurer and J. M. Fitzpatrick. A review of medical image registration. In R. J. Maciunas, editor, Interactive Image-Guided Neurosurgery, pages 17–44. American Assoc of Neurological Surgeons, 1993. 334 12. J. Le Moigne and R. F. Cromp. The use of wavelets for remote sensing image registration and fusion. Technical Report TR-96-171, NASA Goddard Space Flight Center, 1996. 336 13. S. G. Nikolov. A Global Graph Model of Image Registration. Technical Report UoB-SYNERGY-TR01, Image Communications Group, Centre for Communications Research, University of Bristol, May 1998. 336 14. S. G. Nikolov, M. Wolkenstein, H. Hutter, and M. Grasserbauer. EPMA and SIMS image registration based on their wavelet transform maxima. Technical Report TR-97, Vienna Univesity of Technology, Austria, 1997. 336 15. C. A. Pelizzari, D. N. Levin, G. T. Y. Chen, and C. T. Chen. Image registration based on anatomical surface matching. In Interactive Image-Guided Neurosurgery, pages 47–62. American Assoc of Neurological Surgeons, 1993. 334 16. G. G. Robertson, J. D. Mackinlay, and S. Card. Cone trees: Animated 3-D visualization of hierarchical information. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, 1991. 336 17. C. Studholme, D. L. G. Hill, and D. J. Hawkes. Automated 3D registration of truncated MR and CT images of the head. In David Pycock, editor, Proc. of BMVA, pages 27–36, 1995. 336 18. P. A. van den Elsen, E. Pol, and M. Viergever. Medical image matching - a review with classification. Eng. Med. Biol., 12(1):26–39, March 1993. 334 19. C. Ware, D. Hui, and G. Franck. Visualizing object oriented software in three dimensions. In CASCON 1993 Proceedings, 1993. 336 20. Y. Xiao and Milgram. Visualization of large networks in 3-D space: Issues in implementation and experimental evaluation. In CAS 1992 Proc., 1992. 336

A Graph–Theoretic Approach to Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington, Seattle, WA 98195-2500 U.S.A. {aksoy,haralick}@isl.ee.washington.edu http://isl.ee.washington.edu

Abstract. Feature vectors that are used to represent images exist in a very high dimensional space. Usually, a parametric characterization of the distribution of this space is impossible. It is generally assumed that the features are able to locate visually similar images close in the feature space so that non-parametric approaches, like the k-nearest neighbor search, can be used for retrieval. This paper introduces a graph–theoretic approach to image retrieval by formulating the database search as a graph clustering problem to increase the chances of retrieving similar images by not only ensuring that the retrieved images are close to the query image, but also adding another constraint that they should be close to each other in the feature space. Retrieval precision with and without clustering are compared for performance characterization. The average precision after clustering was 0.78, an improvement of 6.85% over the average precision before clustering.

1

Motivation

Like in many computer vision and pattern recognition applications, algorithms for image database retrieval have an intermediate step of computing feature vectors from the images in the database. Usually these feature vectors exist in a very high dimensional space where a parametric characterization of the distribution is impossible. In an image database retrieval application we expect to have visually similar images close to each other in the feature space. Due to the high dimensionality, this problem is usually not studied and the features are assumed to be able to locate visually similar images close enough so that non-parametric approaches, like the k-nearest neighbor search, can be used for retrieval. Unfortunately, none of the existing feature extraction algorithms can always map visually similar images to nearby locations in the feature space and it is not uncommon to retrieve images that are quite irrelevant simply because they are close to the query image. We believe that a retrieval algorithm should be able to retrieve images that are not only close (similar) to the query image but also close (similar) to each other. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 341–348, 1999. c Springer-Verlag Berlin Heidelberg 1999

342

Selim Aksoy and Robert M. Haralick

In this work, we introduce a graph–theoretic approach for image retrieval by formulating the database search as a graph clustering problem. Graph–theoretic approaches have been a popular tool in the computer vision literature, especially in object matching. Recently, graphs were used in image segmentation [8,7,4] by treating the image as a graph and deﬁning some criteria to partition the graph. Graphs did not receive signiﬁcant attention in image retrieval algorithms mainly due to the computational complexity of graph-related operations. Huet and Hancock [5] used attributed graphs to represent line patterns in images and used these graphs for image matching and retrieval. Clustering the feature space and visually examining the results to check whether visually similar images are actually close to each other is an important step in understanding the behavior of the features. This can help us determine the eﬀectiveness of both the features and the distance measures in establishing similarity between images. In their Blobworld system, Carson et al. [3] used an expectation-maximization based clustering algorithm to ﬁnd canonical blobs to mimic human queries. In our work we also use the idea that clusters contain visually similar images but we use them in a post-processing step instead of forming the initial queries. The paper is organized as follows. First, the features used are discussed in Section 2. Then, a new algorithm for image retrieval is introduced in Section 3, which is followed by the summary of a graph–theoretic clustering algorithm in Section 4. Experiments and results are presented in Section 5. Finally, conclusions are given in Section 6.

2

Feature Extraction

The textural features that are used were described in [1,2]. The feature vector consists of two sets of features which intend to perform a multi-scale texture analysis which is crucial for a compact representation in large databases containing diverse sets of images. The ﬁrst set of features are computed from the line-angleratio statistics which is a texture histogram of the angles between intersecting line pairs and the ratio of the mean gray levels inside and outside the regions spanned by those angles. The second set of features are the variances of gray level spatial dependencies and are computed from the co-occurrence matrices for diﬀerent spatial relationships. Each component in the 28-dimensional feature vector is normalized to the [0, 1] interval by an equal probability quantization.

3

Image Retrieval

After computing the feature vectors for all images in the database, given a query image, we have to decide which images in the database are relevant to it. In most of the retrieval algorithms, a distance measure is used to rank the database images in ascending order of their distances to the query image, which is assumed to correspond to a descending order of similarity. In our previous work [1,2] we deﬁned a likelihood ratio to measure the relevancy of two images, one being

A Graph–Theoretic Approach to Image Database Retrieval

343

the query image and one being a database image, so that image pairs which had a high likelihood value were classiﬁed as “relevant” and the ones which had a lower likelihood value were classiﬁed as “irrelevant”. The distributions for the relevance and irrelevance classes were estimated from training sets and the likelihood values were used to rank the database images. We believe that a retrieval algorithm should be able to retrieve images that are not only similar to the query image but also similar to each other, and formulate a new retrieval algorithm as follows. Assume we query the database and get back the best N matches. Then, for each of these N matches we can do a query and get back the best N matches again. Deﬁne S as the set containing the query image and at most N 2 + N images that are retrieved as the results of the original query and N additional queries. Then, we can construct a graph with the images in S as the nodes and can draw edges between each query image and each image in the retrieval set of that query image. We call these edges the set R where R = {(i, j) ∈ S × S | image j is in the retrieval set when image i is the query}. The distances between images which correspond to two nodes that an edge connects can also be assigned as a weight to that edge. We want to ﬁnd the connected clusters of this graph (S, R) because they correspond to similar images. The clusters of interest are the ones that include the original query image. The ideal problem now becomes ﬁnding the maximal P , where P ⊆ S such that P × P ⊆ R. This is called a clique of the graph. The images that correspond to the nodes in P can then be retrieved as the results of the query. An additional thing to consider is that the graph (S, R) can have multiple clusters. In order to select the cluster that will be returned as the result of the query, additional measures are required. In the next section we deﬁne the term “compactness” for a set of nodes. The cluster with the maximum compactness can then be retrieved as the ﬁnal result. If more than one such cluster exist, we can select the one with the largest number of nodes or can compute the sum of the weights of the edges in each of the clusters and select the one that has the minimum total weight. This method increases the chance of retrieving similar images by not only ensuring that the retrieved images are close to the query image, but also adding another constraint that they should be close to each other in the feature space. In the next section we describe a graph–theoretic clustering algorithm which is used to ﬁnd the clusters. Section 5 presents experimental results.

4

Graph–Theoretic Clustering

In the previous section, we proposed that cliques of the graph correspond to similar images. Since ﬁnding the cliques is computationally too expensive, we use the algorithm by Shapiro and Haralick [6] that ﬁnds “near-cliques” as dense regions instead of the maximally connected ones. Another consideration for speed is to compute the N -nearest neighbor searches oﬄine for all the images in the database so that only one N -nearest neighbor search is required for a new query, which is the same amount of computation for the classical search methods.

344

Selim Aksoy and Robert M. Haralick

In the following sections, ﬁrst we give some deﬁnitions, then we describe the algorithm for ﬁnding dense regions, and ﬁnally we present the algorithm for graph–theoretic clustering. The goal of this algorithm is to ﬁnd regions in a graph, i.e. sets of nodes, which are not as dense as major cliques but are compact enough within some user speciﬁed thresholds. 4.1

Definitions

– (S, R) represents a graph where S is the set of nodes and R ⊆ S × S is the set of edges. – (X, Y ) ∈ R means Y is a neighbor of X. The set of all nodes Y such that Y is a neighbor of X is called the neighborhood of X and is denoted by Neighborhood(X). – Conditional density D(Y |X) is the number of nodes in the neighborhood of X which have Y as a neighbor; D(Y |X) = #{N ∈ S | (N, Y ) ∈ R and (X, N ) ∈ R}. – Given an integer K, a dense region Z around a node X ∈ S is deﬁned as Z(X, K) = {Y ∈ S | D(Y |X) ≥ K}. Z(X) = Z(X, J) is a dense region candidate around X where J = max{K | #Z(X, K) ≥ K}. – Association of a node X to a subset B of S is deﬁned as A(X|B) =

#{Neighborhood(X) ∩ B} , #B

0 ≤ A(X|B) ≤ 1.

(1)

– Compactness of a subset B of S is deﬁned as C(B) =

1 A(X|B) , #B

0 ≤ C(B) ≤ 1.

(2)

X∈B

4.2

Algorithm for Finding Dense Regions

To determine the dense region around a node X, 1. Compute D(Y |X) for every other node Y in S. 2. Use the densities to determine a dense–region candidate set for node X by ﬁnding the largest positive integer K such that #{Y | D(Y |X) ≥ K} ≥ K. 3. Remove the nodes with a low association (determined by the threshold MINASSOCIATION) from the candidate set. Iterate until all of the nodes have high enough association. 4. Check whether the remaining nodes have high enough average association (determined by the threshold MINCOMPACTNESS). 5. Check the size of the candidate set (determined by the threshold MINSIZE). When MINASSOCIATION and MINCOMPACTNESS are both 1, the resulting regions correspond to the cliques of the graph.

A Graph–Theoretic Approach to Image Database Retrieval

4.3

345

Algorithm for Graph Theoretic Clustering

Given dense regions, to ﬁnd the clusters of the graph, 1. Merge the regions that have enough overlap, determined by the threshold MINOVERLAP, if all of the nodes in the set resulting after merging have high enough associations. 2. Iterate until no regions can be merged. The result is a collection of clusters in the graph. Note that a node can be a member of multiple clusters because of the overlap allowed between them.

5

Experiments and Results

The test database consists of 340 images which were randomly selected from a database of approximately 10,000 aerial and remote sensing images. The images were grouped into 7 categories; parking lots, roads, residential areas, landscapes, LANDSAT USA, DMSP North Pole and LANDSAT Chernobyl, to form the groundtruth. 5.1

Clustering Experiments

The ﬁrst step of testing the proposed retrieval algorithm is to check whether the clusters formed by the graph–theoretic clustering algorithm are visually consistent or not. First, each image was used as a query to search the database, and for each search, N top-ranked images were retrieved. Then, a graph was formed with all images as nodes and for each node N edges correspond to its N topranked images. Finally, the graph was clustered by varying the parameters like N , MINASSOCIATION and MINCOMPACTNESS. In order to reduce the possible number of parameters, MINSIZE and MINOVERLAP were ﬁxed as 12 and 0.75 respectively. The resulting clusters can overlap. This is a desired property because image content is too complex to be grouped into distinct categories. Hence, an image can be consistent with multiple groups of images. To evaluate the consistency of a cluster, we deﬁne the following measures. Given a cluster of K images, CorrectAssociationk =

#{i | GT(i) = GT(k), i = 1, . . . , K} K

(3)

gives the percentage of the cluster that image k is correctly associated with, where GT(i) is the groundtruth group that image i belongs to. Then, consistency is deﬁned as Consistency =

K 1 CorrectAssociationk . K

(4)

k=1

To select the best set of parameters, we deﬁne a cost function Cost = 0.7(1 − Consistency) + 0.3(Percentage of unclustered images)

(5)

346

Selim Aksoy and Robert M. Haralick

1

0.9

0.8

0.7

Consistency

0.6

0.5

0.4

0.3

0.2

0.1

0

0

0.1

0.2

0.3

0.4 0.5 0.6 0.7 Percentage of Unclustered Images

0.8

0.9

1

Fig. 1. Consistency vs. Percentage of unclustered images for N ∈ {10, . . . , 70}, MINCOMPACTNESS ∈ {0.3, . . . , 1.0}, MINASSOCIATION ∈ {0, . . . , MINCOMPACTNESS}, MINSIZE = 12, MINOVERLAP = 0.75. Dashed lines correspond to the minimum cost. and select the parameter set that minimizes it. Here Consistency is averaged over all resulting clusters. Among all possible combinations of the parameters given in Figure 1, the best parameter set was found as {N, MINCOMPACTNESS, MINASSOCIATION} = {15, 0.6, 0.4}, corresponding to an average Consistency of 0.75 with 6% of the images unclustered. Example clusters using these parameters are given in Figure 2. We observed that decreasing N or increasing MINCOMPACTNESS or MINASSOCIATION increases both Consistency and Percentage of unclustered images. 5.2

Retrieval Experiments

We also performed experiments using all of the 340 groundtruthed images in the database as queries and, using the parameter set selected above, retrieved images in the clusters with the maximum compactness for each query. For comparison, we also retrieved only 12 top–ranked images (no clustering) for each query. Example queries without and with clustering are shown in Figures 3 and 4. We can observe that some images that are visually irrelevant to the query image can be eliminated after the graph–theoretic clustering. An average precision of 0.78 (compared to 0.73 when only 12 top-ranked images are retrieved) for the whole database showed that approximately 9 of the 12 retrieved images belong to the same groundtruth group, i.e. are visually similar to the query image. We also observed that, in order to get an improvement by clustering, the initial precision before clustering should be large enough so that the graph is not dominated by images that are visually irrelevant to the query image. In our experiments, when the initial precision was less than 0.5, the average precision after clustering was 0.19. For images with an initial precision greater than 0.5, the average precision after clustering was 0.93. The better the features are, the larger the improvement after clustering becomes.

A Graph–Theoretic Approach to Image Database Retrieval

(a) Consistency = 1

(b) Consistency = 1

(a) Using only 12 top–ranked images.

(b) Using graph–theoretic clustering.

347

Fig. 2. Example clusters for N =15, MINCOMPACTNESS=0.6, MINASSOCIATION=0.4, MINSIZE=12, MINOVERLAP=0.75.

Fig. 3. Example query 1. Upper left image is the query. Among the retrieved images, ﬁrst three rows show the 12 most relevant images in descending order of similarity and the last row shows the 4 most irrelevant images in descending order of dissimilarity. When clustering is used, only 12 images that have the smallest distance to the original query image are displayed if the cluster size is greater than 12.

348

Selim Aksoy and Robert M. Haralick

(a) Using only 12 top–ranked images.

(b) Using graph–theoretic clustering.

Fig. 4. Example query 2.

6

Conclusions

This paper addressed the problem of retrieving images that are quite irrelevant to the query image, which is caused by the assumption that the features are always able to locate visually similar images close enough in the feature space. We introduced a graph–theoretic approach for image retrieval by formulating the database search as a problem of ﬁnding the cliques of a graph. Experiments showed that some images that are visually irrelevant to the query image can be eliminated after the graph–theoretic clustering. Average precision for the whole database showed that approximately 9 of the 12 retrieved images belong to the same groundtruth group, i.e. are visually similar to the query image.

References 1. S. Aksoy and R. M. Haralick. Textural features for image database retrieval. In Proc. of IEEE Workshop on CBAIVL, in CVPR’98, pages 45–49, June 1998. 342 2. S. Aksoy, “Textural features for content-based image database retrieval,” Master’s thesis, University of Washington, Seattle, WA, June 1998. 342 3. C. Carson et al.. Color- and texture-based image segmentation using EM and its application to image querying and classification. submitted to PAMI. 342 4. P. Felzenszwalb and D. Huttenlocher. Image segmentation using local variation. In Proc. of CVPR, pages 98–104, June 1998. 342 5. B. Huet and E. Hancock. Fuzzy relational distance for large-scale object recognition. In Proc. of CVPR, pages 138–143, June 1998. 342 6. L. G. Shapiro and R. M. Haralick. Decomposition of two-dimensional shapes by graph-theoretic clustering. IEEE PAMI, 1(1):10–20, January 1979. 343 7. J. Shi and J. Malik. Normalized cuts and image segmentation. In Proc. of CVPR, pages 731–737, June 1997. 342 8. Zhenyu Wu and Richard Leahy. An optimal graph theoretic approach to clustering: Theory and its application to image segmentation. IEEE PAMI, 15(11):1101–1113, November 1993. 342

Motion Capture of Arm from a Monocular Image Sequence Chunhong PAN and Songde MA Sino-French Laboratory in Computer Science, Automation and Applied Mathematics National Laboratory Of Pattern Recognition Institute of Automation, Chinese Academy Sciences Sciences [email protected]

Abstract. The paper develops a new motion capture method from a monocular sequence of 2D perspective images. Our starting point is arm motion. We ﬁrst extract and track the feature points from image sequence based on watershed seqmentation and Voronoi diagram, then by rigidity constraint and motion modelling constraint we use motion analysis method to yield 3D information of feature points. Finally the obtained data is attuned to simulate motion of model. A experiment with real images are included to demonstrate the validity of the theoretic results.

1

Introduction

Recently, human motion analysis has increased in importance for visual communications, virtual reality, animation and biomechanics [5,1,2]. While generating appealing human motion is the central problem in virtual reality and computer animation. A wide variety of techniques are presented in the process of creating a complex animation. Generally speaking, these techniques can be grouped into three main classes: keyframing [10], procedure [8], and motion capture [9]. The generated animations by these techniques are so-called keyframe animation, procedural animation, and motion capture animation. Up to now motion capture is the only eﬀective method to generate the human arbitrary motion. Motion capture employs special special sensors or markers to record the motion of a human performer by multiple cameras from diﬀerent directions. The recorded data is then used to generate the motion for an animation. The system is able to estimate position with an accuracy of 0.1 diameter. However, to achieve such accuracy, it is necessary to have a complicated system composed of many special markers, and 4-8 cameras that need to be accurately calibrated. Furthermore, many sensors have to be worn by a person all the time, it is stressful and hard to handle in many applications which limits the use of the system. The determination of 3D motion by analysis of two or more frames captured at diﬀerent instants is a great research topic in computer vision [12]. Generally there are two distinct ways. The ﬁrst approach is based on optic ﬂow [3]. By computing optic ﬂow of images the 3D motion of a rigid body can be determined. The second method depends on correspondence of features [7,6]. By extracting Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 349–357, 1999. c Springer-Verlag Berlin Heidelberg 1999

350

Chunhong Pan and Songde Ma

a small number of features in images corresponding to 3D object features and then using correspondence of these features in successive images the parameters of motion can be obtained in theory. In this paper we study the unconstrainted problem of estimating the 3D motion of the human arm based on the determination of the 3D motion. The obtained information are attuned to generate animation of arm. We ﬁrst extract and track joints in the images, and determine the 3D motion of the joints from these 2D correspondences. It is known that extracting and tracking feature points in the grayscale sequence images or the segmented images are very diﬃcult. While based on the 1D stick ﬁgure we can track the joint points conveniently in sequence images. In order to obtain good correspondences of feature points in sequence images, we made pre-processing on the grayscale sequence images. First using the gradient watershed segmentation we get binary edge images, then based on Voronoi Diagram we skeletonize the binary edge images to obtain a 1D ﬁgure stick. Using the 1D stick ﬁgure we can obtain a good correspondences of joint points over image sequence manually.

2 2.1

Pre-Processing of Image Sequence Motion Segmentation Based on Gradient Watershed

As a ﬂexible, robust and eﬃcient tool, watershed transformation has been widely applied on applications such as grayscale image segmentation, coding, etc. Here in order to obtain the robust area extraction , we use the gradient watershed segmentation under geodesic reconstruction and the stratiﬁcation transformation [14]. As a non-linear tool for image processing, gradient watershed segmentation has shown its special ability to extract areas with well correspondence to objects in images. Eﬃcient algorithm are also proposed in the literature [13]. Nevertheless, simply applying the watershed transformation on image will definitely result in over-segmentation. To eliminate it, the scale space of gradient watershed is necessary. In order to establish this scale space, two methods were proposed, i.e. the multiscale geodesic reconstruction ﬁlter, and stratiﬁcation transformation. The watershed transformation W S on intensity image I(M, N ) can be denoted on an 8-connectivity digital grid as W S(I) = {CB, W A}, where W A is a local-connected subset of I that composes the watershed arc and CB is the set of catchment basin. Without loss of generality processing, we allow catchment basin to include the conjoint watershed arcs and simplify the deﬁnition of watershed transformation as: W S(I) = {CBi | i = 1, 2....p},

I = ∪CBi ;

(1)

Suppose Rmin (I) = ∪1<=i<=p Ni (gi(0) ) , in which Ni is the individual regional minima with intensity gi(0) , we have: CBi = CBi (gi(0) ) = Γ ZI (Ni );

(2)

Motion Capture of Arm from a Monocular Image Sequence

351

where Γ ZI is the transformation of geodesic inﬂuence zone on intensity surface I. So for CBi , Ni is its only regional minima. When discussing the scale space of gradient watershed, the multiscale ﬁlter Φn we use is the combination of opening by reconstruction of erosion and closing by reconstruction by dilation, i.e., Φn (I) = γnrec (ϕrec i (I));

(3)

The feature(the catchment basin of watershed transformation) extractor of scale n is ∇ ∇ W Sn∇ (I) = W S(∇m (Φn (I))) = {CBi(n) | 1 ≤ i ≤ pn }; (4) here ∇m is the morphological gradient operator deﬁned as: ∇m (f ) = δ1 (f ) − 1 (f ).

(5)

According to Eq.(2), suppose the gradient watershed partition of an intensity images I is ∇ W S0∇ (I) = {CBi(0) (gi(0) ) | 1 ≤ i ≤ p0 )}; (6) for a point I[x, y], its stratiﬁcation transformation ψs (I[x, y]) is deﬁned as ∇ (gi(0) ); 1 ≤ i0 ≤ p0 , ψs (I[x, y]) = gi(0) , if [x, y] ⊂ CBi(0)

(7)

From Eq.(4) and Eq.(5), the segmentation operator we use is deﬁned as W Sn∇ (ψ(I)) = W S(∇m (φn (ψ(I)))); 2.2

(8)

Skeletonization Based on Voronoi Diagram

Skeletonization in the plane denotes a process which transforms a 2D object into 1D line representation, comparable to a stick ﬁgure. The skeletonization algorithm we use is ﬁrst based on the Voronoi diagram(VD) [11]. The Voronoi diagram is a well-known tool in Computational Geometry. Let pi and pj denotes two elements from a set Ω of N points in the plane. The locus of all points which are closer to pi than to pj is a half-plane H(pi , pj ) containing pi which is bounded by the perpendicular bisector of pi pj . Consequently, the locus Vi (pj ) of points closer to pi than to any other point pj amounts to a convex Voronoi polygon and can be computed by intersecting at most N-1 half planes H(pi , pj ); First-order Voronoi diagram can be expressed as V1 (pi ) =

H(pi , pj );

(9)

i =j

then the ﬁrst-order VD of Ω is the collection of all Vi (pi ): V1 (pi ); V or1 (Ω) = pi ⊆Ω

(10)

352

Chunhong Pan and Songde Ma

The Voronoi diagram V is made of all the edges of these areas [4]. By construction, all the points P in V are located at the same distance of 2 nearest sites A and B on the boundary of P. Theoretically, the skeleton S of P can be also deﬁned as the median line between opposite boundaryes of P, thus S is a subset B between the nearest sites A and B of ν. Here by calculating the angle AV used for deﬁning the Voronoi diagram and V, a point of the boundary between two voronoi region owned by A and B, we can decide if it belongs or not to S. When A and B are located on the same sides of the boundary of the pore space, B will be small, and When A and B are located on the opposite the angle AV sides, the angle will be large. A point V belongs to the median line if the angle B is large enough, which can be represented in the following form: AV B > K} S = {V /V ∈ ν, AV

(11)

where K is an empirical tuning constant which is chosen to avoid disconnected areas in the median line as well as ”barbules”, and A, B are the nearest sites of V in ν .

3

Performance Analysis with Constraint Fusion

Articulated model of arm: We proposed the stick model for human arm, which it consists of 3 joint points and the links between adjacent joint points are rigid. In the articulated model, the points represent model features and segments represent rigidity constraints. The model has 7 degrees of freedoms(DOF). Here in order to keep the model as simple as possible and to have as few degrees of freedom as possible, we do not model the shape of the hand and simply assume it to extend along the axis of the forearm. In our approach, we ﬁrst use rigid constraints between the feature points of the arm model. it means that the distance between certain feature points of the model should remain constant during the whole tracking process. Furthermore, from anthropometric data we know that lengths of links are fractions of body height. For example, the relative lengths of the forearm and upperarm are 0.206 and 0.22 of body height respectively, So the constraint about length relationship between the forearm and upperarm can also be added to our algorithm. Motion analysis and rigidity constraint: The image geometry of the multiple-frame motion analysis problem is shown in Figure 2. A pinhole camera model is used. Images are taken of a moving object at evenly-spaced times. (j) The feature point Pi denotes the ith point at time tj . Its 3D space co(j) (j) (j) (j) (j) ordinates will be given by Pi (Xi , Yi , Zi ). The projection of Pi onto (j) (j) the image plane will be denoted by P = (xi , yi ). We use subscript i to denote the ith point, the superscript (j) denotes ”primes” and refers to points at (2) (3) tj (zi = zi , zi = zi ). When a pinhole camera is being used, the object and image point coordinate are related by (for j = 0, 1, 2, ... ): (j)

(j)

xi

=

f Xi

(j)

Zi

(j)

;

(j)

yi

=

f Yi

(j)

Zi

;

(12)

Motion Capture of Arm from a Monocular Image Sequence 1 0 0 1

shoulder-3DOF

11 00 00 11 00 11 00 11 00 11

353

11 00 00 11

elbow_1DOF wrist_2DOF

0 1 0 1 1 0 0 111 0 1 1 000 1 0 00 1 11 1 00 0 1 00 11 00 11 0 1 00 11 0 1 0 1 11 00 00 11 00 11

1 0 0 1 0 1

Fig. 1. An articulated model consisting of arm, the points represent model features (j)

(j)

Here f is the focal length of the camera. Images coordinate (xi , yi ) can be known from image plane when frames are sampled at diﬀerent time, and correspondences of points over all the views are easily achieved due to checking out manually based on 1D ﬁgure stick. Thus the problem can be stated as: P0 y Z P1

P0’

P1’ o

Y

O

x

X

Fig. 2. Basic perspective imaging geometry for motion analysis Given correspondences of points throughout all the frames, how to (j) (j) (j) (j) find Pi (Xi , Yi , Zi ). It is well-known that from a monocular image sequence, the 3D depth of an isolated point is impossible to recover only by uncalibrated camera in rigidity constraints, In our analysis, we attempt to track only the arm, for simplicity, we keep the 3D position of the shoulder ﬁxed in space and ﬁrst calculate the relative 3D coordinates of another endpoints. In this case, P0 is stationary throughout (j) all the views. Equivalently, Z0 is equal for all j. Here we use an approach of decomposition. The arm under analysis is decomposed into two parts, each containing a single link: forearm and upperarm. Since we have assumed that the shoulder is ﬁxed the upperarm is simpler part than forearm. We ﬁrst analyze the motion of the upperarm and then propagate the analysis to forearm. By using the approach we can obtain 3D motion information of arm. First we have rigidity constraints over frames: (j)

P0 − P1 = P0

(j)

− P1 = l;

f or j = 0, 1, 2...

354

Chunhong Pan and Songde Ma (j)

(j)

l denotes the length of the link, and P0 , P1 are two endpoints of any link. For upperarm due to position of shoulder ﬁxed, usimg Eqs.(12) this can be written as: (x21 + y12 + f 2 )Z12 − 2(x0 x1 + y0 y1 + f 2 )Z0 Z1 = (j)2 (j)2 (j)2 (j) (j) (13) (x1 + y1 + f 2 )Z1 − 2(x0 x1 + y0 y1 + f 2 ) (j) Z0 Z1 = l (j)

In above equations the depths Z0 and Z1 are unknown variables for j = 0, 1, 2, .... In fact if the shoulder is ﬁxed the depth Z0 is only the scale. Our aim (j) is to calculate the relative depths Z1 over views. Motion modelling and optimal constraint: To obtain the unique 3D infomation of feature points over frames. A motion model is needed, here we assume that the arm moves with smooth motion during a short period of time. We know that when the shoulder ﬁxed, the feature point of the elbow always moves in surface of sphere with radius of length of upperarm. Let dj be the (j) (j+1) distance between P1 and P1 when the feature point of the elbow moves (j)

from P1

(j+1)

to P1

(j)

(j+1)

, and θj be the angle between the line P1 P1

(j+1) (j+2) P1 P1 .

and the

(j) P1

line Here denotes point trajectory of the elbow over views. The smooth motion means that the variances of the distance and the angle above have the minimal values. This can be represented as follow: J1 = min

N −1

|dj − d|;

j=1

J2 = min

N −2

|θj − θ|;

j=1

where: d=

N −1 N −2 1 1 dj and θ = θj (N − 1) j=1 (N − 2) j=1

It is an optimization problem with rigidity constraint which can be solved by dynamic programming. Since the shoulder is ﬁxed, the its depth Z0 can be seen as a scale, while other endpoint of upperarm satisﬁes the rigidity constraint which means it moves on the surface of the sphere. By adding motion modelling to this approach we obtain relative 3D coordinates of feature point of elbow by programming. When the relative 3D information of two endpoints of the upperarm is determined by the above method the 3D coordinates of one endpoint of the adjacent forearm become known, and the length of the upperarm up to scale become also known. From above rigidity constraint equation this yield a two-degree polynomial equation with one unknown variable for each frame. Generally there are two solutions to a two-degree polynomial equation. We lay above optimal constraints on the unknown endpoint of the forearm, furthermore the depths Z (j) must be physically realizable solutions: Z (j) > 0, so we obtain a unique solution for Z (j)

Motion Capture of Arm from a Monocular Image Sequence

4

355

Experiment

The human model which consists of the arm was constructed from rigid links connected by rotary joints with one, two and three degrees of freedoms. The dynamic model shown in Figure 3 has 15 body segments which were represented by some rigid generalized cylinders. The points of contact between the rigid links were skinned automatically.

Fig. 3. Model used to perform experiment In order to acquire to real data to test our algorithms, we used a video camera to record scenes of arm moving. We videotaped the motion in a unconstrainted scene, and generated a image sequence. The ﬁgure 5 give some sampled frames from a database sequence. Then we pre-processed the image sequence by using segmentation based on watershed and skeletonization based on Voronoi Diagram discussed above. The ﬁgure 6 and 7 show the binary edge images and skeleton images respectively. From the skeleton image sequence, we obtained a set of 2D trajectories of feature points by manually tracking the joint points such as elbow and wrist. The (x,y) coordinates of each dot of all the sample’s frames is used to analyze motion. The ﬁgure 4 shows 3D trajectories of elbow and wrist up to scale. Obviously when the length of the upperarm or forearm is determined one can get a real 3D coordinates of elbow and wrist. While once the body height is known the lengths of arm can be easily calculated. Finally the scale 3D coordinates of elbow and wrist obtained from above approach are used to generate motion of the arm modelled by the rigidity generalized cylinders. Figure 8 shows the simulated motion of the arm. Due to noise, correspondences with error, and approximate assumption on articulated model, in fact it is impossible to satisfy the rigidity constraints exactly. But our motion model is based on the smooth movement, and we obtain the 3D data of the joints by the optimal numberical search, so when the movement is small the method is eﬀective.

5

Conclusion

A new motion capture method based on feature point correspondence over frames is proposed. We ﬁrst pre-process the image sequence, and obtain 1D skeleton images. By rigidity constraint and motion modeling we yield the 3D information of feature points. Experimental results show that the method is eﬃcient.

356

Chunhong Pan and Songde Ma

Fig. 4. Motion trajectories of elbow and wrist

Fig. 5. A sampled image sequence with arm moving

Fig. 6. The segmented binary edge images

Fig. 7. The skeleton of binary edge image

Motion Capture of Arm from a Monocular Image Sequence

357

Fig. 8. Simulated motion of a human arm

References 1. Devi L. Azoz Y. and Sharma R. Tracking hand dynamics in unconstrainted environments. In Proceedings of IEEE International Conference on Computer Vision, pages 274–280, 1998. 349 2. Barsky B. Badler N. and Zeltzer D. Making Them Move. Morgan Kaufmann, 1991. 349 3. Horn B.K.P. and Schunk B.G. Determining optical ﬂow. Artificial Intelligence, 17:185–203, 1981. 349 4. Yu Z.Y. Delerue J.F., Perrier E. and Velde B. New algorthms in 3d image analysis and their application to the measurement of a spatialized pore size distribution in soils. to appear in Journal of Physics and Chemistry of the Earth, 1998. 352 5. Ureslla E. Goncalves L., Bernardo E.D. and Perona P. Monocular tracking of the human arm in 3d. In Proceedings of IEEE International Conference on Computer Vision, pages 764–770, 1995. 349 6. Netravali A.N. Holt R.J., Huang T.S. and Gian R.J. Determining articulated motion from perspective views: A decomposition approach. Pattern Recognition, 30:1435–1449, 1997. 349 7. Robert J.H. and Netravali A.N. Number of solutions for motion and structure from multiple frame correspondence. Intel. J. of Computer Vision, 23:5–15, 1997. 349 8. Hodgins J.K. Biped gait transition. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 2092–2097, 1991. 349 9. David J.S. Computer puppetry. IEEE Computer Graphics and Applications, 18:38– 45, 1998. 349 10. Shoemaker K. Animation rotation with quaternion curves. In Proceedings of SIGGRAPH’85, pages 245–254, 1985. 349 11. Ogniewicz R.L. and Kubler O. Hierarchic voronoi skeletons. Pattern Recognition, 28:343–359, 1995. 351 12. Huang T.S. and Netravali A.N. Motion and structure from feature correspondence: A review. In proc. IEEE, volume 88, pages 252–258, 1994. 349 13. L. Vincent. Morphological grayscale reconstruction in image analysis: Applications and eﬃcient algorithms. IEEE Transaction on Image Processing, 2:176–201, 1993. 350 14. Songde Ma Yi Li, Ming Liao and Hangqing Lu. Scale space of gradient watershed. to appear in Journal of Image and Graphics, 1998. 350

Comparing Dictionaries for the Automatic Generation of Hypertextual Links: A Case Study Isabella Gagliardi and Bruna Zonta CNR-ITIM Via Ampere 56, 20131 Milano Tel +39 02 7064 3270 / 53, Fax +39 02 7064 3292 {isabella, bruna}@itim.mi.cnr.it

Abstract. There is a great need for tools that can build hypertexts from "flat" texts in an automatic mode, assigning links. This paper addresses the problem of the automatic generation of similarity links between texts that are relatively homogeneous in form and content, such as the cards of an art catalogue. The experimentation it describes has compared the results obtained using weighted and unweighted supervised dictionaries with those produced using weighted and unweighted automatic dictionaries.

Introduction There is a great need for tools that can build hypertexts from "flat" texts in an automatic, or at least partially automatic mode, especially when the hypertexts concerned have a very high number of nodes and links. The work needs to be done more rapidly, and it must be sytematized, in part to avoid involuntarily following different philosophies of thought in assigning links. A study of the state of the art, through a bibliographical search facilitated by the great number of articles investigating the problem that can be found in journals [11] and proceedings, as well as on Internet, shows that the effectiveness of the algorithm depends greatly on the characteristics of the texts to which it is applied. The problem and the solutions proposed in the literature are present in two extremes: strongly structured documents, that is, equipped with indexes, chapters, subdivisions, crossreferences, etc. on the one hand; and linear, unstructured documents on the other. In this paper we address the problem of the automatic generation of associative links between texts that are relatively homogeneous in form and content, such as the cards of a catalogue describing works of art. The basic idea is that the presence in two cards of a certain number of common terms in proportion to the sum of their terms indicates that these cards can be linked to each other for their "conceptual similarity", and that the corresponding objects can as a consequence also be compared to each other, for their "perceptual similarity" [16]. In this process dictionaries, that is, the set of terms used to establish whether the two cards can be linked, and in what measure, play an important role. We have prepared a series of dictionaries, differing in the way they are built and in their Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 358-366, 1999.  Springer-Verlag Berlin Heidelberg 1999

Comparing Dictionaries for the Automatic Generation of Hypertextual Links

359

semantic content - supervised, supervised and weighted, automatic, automatic and weighted - and these dictionaries have been tested with an algorithm that calculates the similarity. The algorithm, the dictionaries, and the results obtained from the application of each are presented in this paper. In particular, the first section discusses related works, together with concepts of help in understanding our work; the second describes our approach, while the third contains details of the experiments performed, and preliminary results. The section Conclusion and Future Developments points to more, in-depth research on the correspondence between the conceptual similarities of the cards and the perceptual similarities of the objects they describe. An advance in this direction would make it possible, starting from texts, to automatically introduce links of comparison between similar images in multimedia art catalogues. This study is part of the project of the Italian National Research Council (CNR) on "Beni culturali: metodi e strumenti per la creazione di archivi multimediali nel settore della ceramica" (Cultural Resources: methods and tools for the creation of multimedia archives in the ceramic sector) developed at ITIM in Milan. 1. Related works The increasing availability of collections of on-line textual documents too large to allow the manual authoring and construction of a hypertext, is the main reason for the current interest in the study and implementation of fully or partially automated techniques. A pioneering few [9,12,13] began research in this field before hypermedia applications became as widespread as they are today. In 1995 a workshop on "IR and the Automatic Construction of Hypermedia" was held during the ACM SIGIR conference, and in 1997 the authoritative journal IP&M published a monographic issue on the subject [11]. In [3], Agosti supplied the key notions involved in the automatic construction of hypertexts, together with a brief selection of experiments conducted in this field. Salton et al. [14] proposed a technique that can be used to create links between text segments and practically construct a hypertext at retrieval time. They had to deal with the problem of identifying internally consistent fragments from available texts, and used a graph representation to show the results. More recently, Allan [4] has addressed in particular the problem of managing the different types of links. The technique he proposes provides a wholly automatic method for gathering documents for a hypertext, associating the set, after identifying the type of link, with its description. Document linking is based upon IR similarity measures with adjustable levels of strictness. Agosti & Crestani [2] have proposed a design methodology for the automatic generation of an IR hypertext, starting from a collection of multimedia documents and using well established IR techniques. Tudhope [17] has designed a semantic hypermedia architecture, in which the semantic similarity of information units forms the basis for the automatic construction of links integrated into hypermedia navigation. This architecture has been implemented in a prototype application: "A museum of social history".

360

Isabella Gagliardi and Bruna Zonta

In 1995 the authors designed and implemented a hypermedia Information Retrieval application on CD-ROM: “Sixteenth Century Genoese Textiles”[6,7]. Identification of the hypertextual links is based upon a pattern matching method, in two different contexts: • given a glossary, the catalogue cards are automatically connected to the items in the glossary, forming referential links. The results have been most satisfactory; • the texts of the cards indicate cross-references and comparisons, such as "different techniques, but similar type of decoration: Savona, Sanctuary of Our Lady of Charity, Museo del Tesoro, no. 183 [card in the same catalogue]; Mayer van Den Bergh Museum, inventory no. 1655 [referring to a different catalogue]". In this case the program links the card contained in the same catalogue, but ignores the second. More generally, the program's task is to associate only card within the archives, ignoring cross-references to other catalogues. This algorithm has been moderately successful. 2. Our Approach Here The present procedure has been designed to automatically define the links among textual cards in multimedia art catalogues, where every card corresponds to an image of an object. Most of the cards contain partially structured texts, some of which in a predefined format (name of the object, shape, dimensions, date, brief technical notes, etc.), and the rest in free text describing the subject represented in the object and how the object is decorated. A card of this type is rarely more than a page long, and usually of the length and style common to the catalogue. The free text tends to be concise, with few, and at any rate not significant repetitions. From the point of view of automatic management, the parts in fixed text are easily subdivided into fields, while the uniform length and style of the parts in free text constitute a good premise for similarity comparison and any assigning of links. The fact that there are few repetitions eliminates the problem of the frequency of terms in the text of each card. Unfortunately, these cards are in general compiled by various people, with different degrees of expertise, over a period of time which may be very long. Consequently, the terminology is usually not homogeneous, and the texts would have to be normalized, or classified, to be considered a satisfactory basis for effective automation. The procedure described here has been designed to calculate the similarity between cards that have already been normalized: when this similarity is found, the two texts are connected by a link of the associative type. The similarity model used is the conceptual "contrast" type, which considers similarity an intersecting of features [16], in this case, of terms. The basic idea is that the presence of terms common to two different cards indicates that these can be considered similar to each other. The possible links thus identified are a function of the number of terms present in the two cards, and have a "strength" of (0,1). The similarity between two texts is defined by the number of terms in common in proportion to the total number of terms of the two cards.

Comparing Dictionaries for the Automatic Generation of Hypertextual Links

361

The model used has clearly suggested the choice of the well known formula proposed by Salton [15]: 2( wi termi ∩ w j term j ) simi , j = wi termi ∪ w j term j

to which weights could be assigned. wi is the weight associated with the termi throughout the catalogue, as we shall explain below. The results depend, obviously, upon the terms chosen for comparison. This choice can be made in two extreme manners: automatically, with the use of lists of stopwords, or manually by experts in the domain who indicate the more significant terms according to certain criteria. Each method produces a different dictionary, the automatic is the richer one, the supervised, the more specific. We have compared the results obtained using a supervised dictionary and a weighted supervised dictionary, with those obtained using an automatic dictionary. 3. The Experiment

The experiment is part of a CNR project for the preservation of "Cultural Resources": the objective has been to compare the effectiveness of an automatic dictionary with that of supervised dictionaries, to see whether and how much the results improved in the latter case. The art catalogue employed in the experiment was The Cora Donation: Medieval and Renaissance Ceramics of the International Museum of Ceramics of Faenza [1], containing over 800 cards describing as many objects. In the catalogue each ceramic object is represented by an image, in color or black and white, and described by a card in text. Figure 1 shows a typical catalogue card and the corresponding image. The image elements and the textual description of subject-decoration do not always correspond exactly. This is due primarily to the fact that the image is bi-dimensional, while the description refers to a three-dimensional object. The free text, that is, text describing the subject and/or the decoration of the object, was used to assign the links. Explicit references present in the field "Analogies" were ignored, since they had already been experimented on the Genoese Textiles Catalogue [6,7] We did use these references later to verify the quality of the links assigned with our procedure. For the same purpose, when the cross-reference was of the type "Decoration like that of the preceding piece", the original description was repeated. Various trials were run in the course of the experimentation, each with a different dictionary, and the results were then compared. The Supervised Dictionary (SD) was created by: • Extraction of the descriptors: over 1000 terms considered significant for the description of the subject and decoration were extracted manually. • Creation of the lexicon: these descriptors were reduced to about 700 by unifying the variations, which were essentially three in kind: − graphic (with/without quotation marks, upper/lower case, ...); − morphological- derived (singular/plural, name/adjective, name/diminutive, ...); − lexical ("composition/decoration", "writing/inscription/caption", "woman/female").

362

Isabella Gagliardi and Bruna Zonta

A taxonomy was also created, grouping the descriptors in categories, and these in turn in higher level categories, for a total of three levels. These categories were used in the experimentation with weights. The Weighted Supervised Dictionary (WSD) contains the same terms as the SD, but weights have been applied to these. The weight can be assigned automatically, on the basis of the number of times the term occurs in the entire collection, following well-established procedures of Information Retrieval (IR), or manually, considering the importance of the term in the domain, or in the collection, regardless of its frequency. The former procedure was used here. Originally the adopted weights were: 1 for frequencies from 1 to 25, 0.75 for frequencies from26 to 50, 0.50 for frequencies from 51 to 75 0.25 for frequencies from 76 to 100 and 0.10 for frequencies over 100. To be able to assign an additional value to strongly characterizing term ("lion") as compared with those designating abstract decorative borders ("braid"), the above values have been diminuished. After many tests, the values have been set at 0.70, 0.55, 0.40, 0.25 and 0.10 respectively., so that adding the value of 0.30 to terms such as "lion" moves them two class higher. A procedure was also set up that allowed the user to assign a greater/lesser weight to some terms (or categories of terms) at query time, in order to express any specific interest. The Automatic Dictionary (AD) contains the words extracted by the ISIS Information Retrieval System. Consequently it is composed of all the terms present in the fields of the subject and decoration, except for the terms present in a stoplist. No stemming procedure has been applied, as no satisfactory algorithm is available for the Italian language. Since adding weights to this dictionary did not produce substantially different results, those results are not shown here. 4. Preliminary Results

Both the supervised and automatic dictionaries always assigned a similarity value of 1 when the texts were identical, and similarity values that varied, but were always among the highest, when the "Analogies" field contained cross-references. As for the differences among the various dictionaries, we saw that the SD generally assigned higher values than the SWD, but on the whole in accord with it. Because of the way it is structured, the AD considers a larger number of terms, but does not recognize equivalent terms, and consequently assigns lower and less differentiated values, without excluding, as the other dictionaries did, terms designating colors. This means that color plays a determinant role, not desired in this case, in the identification of links. The table below summarizes the number of terms in each dictionary, the number of times these appear on the cards, and the average number of terms per card, together with the number of cards with non-zero similarity. The SD gave no results for three of the cards for extraneous reasons, such as the fact that the description of the subject had not been entered in the proper place.

Comparing Dictionaries for the Automatic Generation of Hypertextual Links

No.of terms in the dictionary No. of times terms appear on the cards Average no. of dictionary terms per card No. of cards with non-zero similarity

AD 2040 17266 18 918

SD 690 5869 6 915

363

SWD 690 5869 6 915

We ran the program on all the cards in the catalogue. The following table lists, for each of the first seven cards, the card linked to it with the highest similarity value, and the value computed by each of the three dictionaries. CardId 001 002 003 004 005 006 007

AD linked card 775 246 754 697 747 696 847

sim. 0,285 0,451 0,562 0,571 0,382 0,450 0,297

SD linked card 632 751 192 692 654 689 282

sim. 0,454 0,500 0,444 0,500 0,428 0,461 0,444

WSD linked card 622 582 186 359 654 446 282

sim. 0,452 0,496 0,360 0,473 0,490 0,326 0,438

The following table summarizes the above results, where by minimum and maximum values is always meant the value of the card linked with the highest value (the first card in a decreasing order). Minimum similarity value Maximum value (excluding 1) Average similarity value Absolute interval of variation Interval of variation (excluding 1)

AD 0,142 0,965 0,609 0,858 0,823

SD 0,153 0,923 0,632 0,847 0,770

WSD 0,111 0,984 0,615 0,889 0,873

The eight observers who participated in the evaluation of the results obtained by the three dictionaries were shown the image of the object corresponding to the query card, and the images of the objects corresponding to the cards the different dictionaries linked with it. To facilitate their task and allow them to repeat it a sufficiently large number of times, only the card linked with the highest value by each dictionary was considered. Their task consisted in ranking by decreasing similarity values the images compared with the query. All eight found it difficult at first to restrict their attention rigorously to the task of evaluating only the similarity of subjects and decorations, ignoring shape, use, color, epoch, and style. However, with some experience the task was more readily performed. The images taken for comparison were 50, and the images compared with these 150, of which, however, only 97 were different, since the same image could be selected more than once, either in different contexts, or because the dictionaries agreed in the same context. The observers, who did not know which dictionaries had assigned which links, found the following similarities: order I II III

AD 8 15 27

SD 19 18 13

WSD 23 17 10

364

Isabella Gagliardi and Bruna Zonta

In the course of the experiment the observers found that some of the images called forth by supervised dictionaries did not at all resemble the query image. Analysis of the cards and the relative descriptors identified at least two reasons for these incongruities: either the descriptors referred to parts of the three-dimensional ceramic object that were not visible in the two-dimensional image, or the texts of the cards were not sufficiently appropriate and specific. The first drawback would be easily eliminated by using a series of images or films that show the object from various points of view. The second could be remedied, at least in part, by establishing norms for the compilation of the cards, together with a Thesaurus of the domain. The program was written in Microsoft Visual Basic 4.0(TM) with a control for management of the HTML pages. Microsoft Access 97(TM) was used for the database; the images of the objects in the catalogue, originally in TIF format, were converted to GIF format, and in general reprocessed by a Microsoft Photo Editor (TM). 5. Conclusions and Future Developments

This paper has presented a procedure for the automatic generation of hypertextual links among texts (cards) in art catalogues. We have used the very simple formula defined by G. Salton et al. (enriched with weights) to thoroughly examine the role of dictionaries in the successful realization of the links. In our experiment four different dictionaries were created and tested, and the results of three of these, the SD, WSD and AD (the WAD registered results very similar to those of the AD), were evaluated, on the basis of the corresponding images, by eight observers. As anticipated, better results were obtained with the supervised dictionaries than with the automatic dictionary. To effectively automate the entire procedure would take universal taxonomies such as ICONCLASS, or at least domain Thesauruses that could serve as guidelines for the drafter of the cards, and as filters for the automatic compilation of the dictionary. Integrating the algorithm presented here in a system that allows the automatic linking of images on the basis of their perceptual similarities may further improve results. Automatically classifying the pictorial content of the images to create the textimage links, or evaluating the "semantic" similarity of the images on the basis of lowlevel features alone (with a corresponding evaluation of their perceptual similarity), is generally an arduous task. But it will be possible in this application, despite the complexity of the images, because of the homogeneity of the database. In any case this will allow us to investigate any correlations between an image and the textual description of the represented object, or between the textual description of an object and the features that represent the corresponding image. A prototype system for the automatic creation of image-to-image and text-to-image links is now in an advanced state of construction. To create the text-to-image link we plan to apply the CART classification strategy. For the creation of image-to-image links, the set of features and measure for the perceptual-semantic similarity of the images will be selected by means of a relevance feedback mechanism which we are now in the process of developing.[8,10]

Comparing Dictionaries for the Automatic Generation of Hypertextual Links

365

ID code no.: 487 Object: Albarello Heights: 20 cm; diameter of base 8,4 cm Material: Majolica Origin: Montelupo Period: XVI century Subject: In the central area, there are two circular medallions, framed by festoons, containing the S. Bernardino IHS monogram; between the two medallions, a decoration with Persian palmettos. Glaze: orange, green, yellow, blue, red.

ID code no.: 488 Object: Mug Heights: 20,5 cm; diameter of base 8,5 cm Material: Majolica Origin: Montelupo Subject: Front: a circular cartouche with an ondulate border and containing the S. Bernardino I.H.S. monogram framed by a festoon. Sides: vertical bands with Persian palmettos. Under the handle the initial P. Glaze: brown, blue, orange and green.

Fig. 1. Id. Card no. 487

Fig. 3. Id card no. 488 linked to card 487, using the WSD, with the value of 0.862

D code no: 402 Object: Mug Material: Majolica Origin: Cafaggiolo Period: 1520 ca. Subject: front: a large circular medallion with a festoon and the S. Bernardino I.H.S monogram; the remaining surface is decorated with grotesques on a blue background; back: under the handle, a graffito monogram SP. Glaze: orange, gray, yellow, blue, green and brown. Analogies: Preservation: Good Fig. 2. Id card no. 402 linked to card 487, using the SD, with the value of 0.777

Code no.: 490 Object: Globular vase with two handles Subject: On both the faces circular medallions containing shield with palms and framed by festoons; surrounded by Persian palmettos. Under the handle the initial P. Glaze: blue, green., orange, yellow and red. Fig. 4. Id card no. 490 linked to card 487, using the AD, with the value of 0.666

366

Isabella Gagliardi and Bruna Zonta

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

17.

La donazione Galeazzo Cora: ceramiche dal medioevo al XIX secolo, Museo Internazionale delle Ceramiche in Faenza, Gruppo Editoriale Fabbri, 1985, Milano. Agosti M, F. Crestani, M. Melucci, “Design and implementation of a tool for the automatic construction of hypertexts for information retrieval”, Information Processing & Management, Vol. 32(4), pp. 459-476, 1996, Elsevier Science Ltd. Agosti M., F. Crestani, M. Melucci, “On the use of information retrieval techniques for the automatic construction of hypertext”, Information Processing & Management, Vol. 33(2), pp. 133-144, 1997, Elsevier Science Ltd. Allan J., “Building hypertext using information retrieval”, Information Processing & Management, Vol. 33(2), pp. 145-159, 1997, Elsevier Science Ltd. Carrara P., Della Ventura A., Gagliardi I., “Designing hypermedia information retrieval systems for multimedia art catalogues”, The New Review of Hypermedia and Multimedia, vol. 2, pp. 175-195, 1996. Carrara P., Gagliardi I., "A collection of antique Genoese textiles: an example of hypermedia Information Retrieval", poster session HIM 95, Konstanz (Germany), 57/4/95. Carrara P., Gagliardi I., Della Ventura, A. CD-ROM Tessuti Genovesi del Seicento, new version , 1996. Ciocca G., Schettini R., “Using a Relevance Feedback Mechanism to Improve Contentbased Image Retrieval”, Third International Conference on Visual Information Systems, Amsterdam, 2-4 June 1999 (submitted). Frisse M. E., “Searching for information in a hypertext medical handbook”, Communications of the ACM, Vol. 31(7), 1988. Gagliardi I., R. Schettini, G. Ciocca, “Retrieving Color Images by Content”, in Image And Video Content-Based Retrieval, February, 23rd 1998, CNR, Milano. Information Processing & Management, Vol. 33(2), 1997, Elsevier Science Ltd. Pollard R., “A hypertext-based thesaurus as a subject browsing aid for bibliographic databases”, Information Processing & Management, Vol. 29(3), pp. 345-357, 1993, Pergamon Press Ltd. Rada R., “Converting a Textbook to Hypertext”, ACM Trans. On Inf. Sys Vol. 10(3), pp. 294-315, July 1992. Salton G., A. Singhal, M. Mitra, C. Buckley, “Automatic text structuring and summarization”, Information Processing & Management, Vol. 33(2), pp. 193-207, 1997, Elsevier Science Ltd. Salton G., Automatic text processing, Addison-Wesley, 1989, New York. Similarity in language, thought and perception, edited by Cristina Cacciari, Brepols, 1995. Tudhope D., Taylor, “Navigation via similarity: automatic linking based on semantic closeness”, Information Processing & Management, Vol. 33(2), 1997, Elsevier Science Ltd.

Categorizing Visual Contents by Matching Visual “Keywords” Joo-Hwee Lim RWCP , Information-Base Functions KRDL Lab 21 Heng Mui Kent Terrace, S(119613), Singapore Tel: +65 874-6671, Fax: +65 774-4990 [email protected]

Abstract. In this paper, we propose a three-layer visual information processing architecture for extracting concise non-textual descriptions from visual contents. These coded descriptions capture both local saliencies and spatial conﬁgurations present in visual contents via prototypical visual tokens called visual “keywords”. Categorization of images and video shots represented by keyframes can be performed by comparing their coded descriptions. We demonstrate our proposed architecture in natural scene image categorization that outperforms methods which use aggregate measures of low-level features.

1

Introduction

Automatic categorization of text documents has received much attention in the information retrieval and ﬁltering community (e.g. [7,8]). Visual content categorization is relatively less explored in multimedia database and retrieval research, though pattern classiﬁcation and object recognition are well studied ﬁelds. This is because in general visual contents (images, videos etc) are complex and illdeﬁned. Most often than not, visual content categorization involves human visual perception. The latter is diﬃcult due to two problems. First, interpreting visual data is underconstrained. A visual content can be associated with multiple consistent interpretations of the world. Second, semantically similar contents can be manifestated in many instances with variations in illumination, translation, scale etc. Many existing visual information systems (e.g. [15]) extract and annotate the data objects in the visual content manually, often with some assistance of user interfaces. It is assumed that once keywords are associated with the visual content, text retrieval techniques can be deployed easily. Although text descriptions are certainly important to reﬂect the (largely conceptual) semantics of multimedia data, they may result in combinatoric explosion of keywords in the attempt of annotation due to the ambiguous and variational nature of multimedia data.

Real World Computing Partnership Kent Ridge Digital Labs

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 367–374, 1999. c Springer-Verlag Berlin Heidelberg 1999

368

Joo-Hwee Lim

Also there is a limit to how much semantic information the textual attributes can provide [3]. Visual content-based retrieval systems (e.g. [11,13,2]) have mainly focused on using primitive features such as color, texture, shape etc for describing and comparing visual contents. Very often, aggregate measures of an image’s color and texture are employed as a signature for image similarity comparison. This will often produce results incongruent with human expectations [9]. For example, images sharing similar overall color distribution can diﬀer greatly in semantic content. We argue this point further with the following scenario analysis. Suppose a coast/seaside image I0 (left half of Figure 1)is scrambled into I1 (right half of Figure 1). Based on distributions of color or other low level features solely, I0 and I1 will be considered similar though they are perceptually dissimilar. Scrambling I0 in diﬀerent ways can easily produce perceptually incoherent images I2 , I3 · · · to fool a search engine that relies only on distribution of low level features and make its performance looks bad for comparison.

Fig. 1. An example image and its scrambled version When these feature-based techniques are applied to individual objects, an object is often the focus for retrieval and not much consideration has been given to the interrelationship among the objects. In a diﬀerent approach that advocates the use of global conﬁguration, the work reported in [14] developed a method for extracting relational templates that capture the color, luminance and spatial properties of classes of natural scene images from a small set of examples. The templates are then used for scene classiﬁcation. Although the method improves over previous eﬀort [9] that hand-crafted the templates, scene representation and similarity matching are computed through the relationships between adjacent small local regions which seem rather complex for comprehension. In this paper, we propose a three-layer visual information processing architecture for extracting concise non-textual descriptions from visual contents. Starting from the pixel-feature layer, the architecture progressively extracts locally salient visual information and spatially distributed conﬁguration information present in the visual contents at the next two higher layers respectively. In a nutshell, visual contents are described in terms of prototypical visual tokens called visual “keywords”. The resulting descriptions are coded via singular value decomposition

Categorizing Visual Contents by Matching Visual “Keywords”

369

for dimensionality and noise reduction. To demonstrate our novel architecture, we employ these coded descriptions for content comparison in a scene categorization task. When compared with the popular methods that rely on distribution of low-level features, our method has shown superior classiﬁcation performance.

2 2.1

Content Description & Comparison Visual “Keywords” Extraction

A key to alleviate the problems of ambiguity and variations in visual content for visual information processing task such as categorization is to exploit its inherent statistical structure. There are prototypical visual entities present in the contents of a given distribution of visual documents (e.g. digital images, video shot keyframes). Using statistical learning methods, these visual “keywords” can be derived from a suﬃciently large sample of visual tokens of a visual content domain. A visual token is a coherent unit (e.g. region of pixels) in a visual document. A visual content can then be spatially described in terms of the extracted visual “keywords”. For supervised learning, detectors for salient objects such as human faces, pedestrians, foliage, clouds etc can be induced from a training set of positive and negative examples of visual tokens collected from visual documents of a given visual content domain (e.g. [12]). Detectors may be further specialized for diﬀerent views (e.g. faces of frontal and side views, skies of cloudy and clear days etc) to improve their detection accuracy. Alternatively unsupervised methods such as self-organizing maps, fuzzy cmeans algorithm, and the EM algorithm can be used to discover regularities in the visual tokens in visual documents. Clusters that represent prototypical visual tokens are formed from a training set of visual tokens sampled from visual documents of a given visual content domain. 2.2

Architecture

The proposed architecture has three layers (Figure 2). The lowest layer is a collection of low-level feature planes at pixel level (pixel-feature layer). For example, the color feature of an image can have three R,G,B planes of the same resolution. The middle layer, Type Registration Map (TRM), is an abstraction of the lowest layer. More precisely, given an image I with resolution M × N , its TRM G has a lower resolution of P ×Q, P ≤ M, Q ≤ N . Each pixel or node (p, q) of G has a receptive ﬁeld R [1] that speciﬁes a two-dimensional region of size rx × ry in I which can inﬂuence the node’s value. That is, R = {(x, y) ∈ I|xp ≤ x ≤ xp , yq ≤ y ≤ yq } where rx = xp − xp + 1, ry = yq − yq + 1, and (xp , yq ) and (xp , yq ) are the starting and ending pixels of the receptive ﬁeld in I respectively. We further allow tessellation displacements dx , dy > 0 in X, Y directions respectively such that adjacent pixels in G along X direction (along Y direction) have receptive ﬁelds in I which are displaced by dx pixels along X direction (dy pixels along Y

370

Joo-Hwee Lim

B spatial histogram layer

A Q

type registration layer

sx

P sy N ry

pixel-feature layer rx

M dy

dx

Fig. 2. Three-layer content description architecture direction) in I. That is, two adjacent G pixels share pixels in their receptive ﬁelds unless dx ≥ rx (or similarly dy ≥ ry ). For simplicity, we ﬁx the size of receptive ﬁeld (rx , ry ) and the displacements (dx , dy ) for all pixels in G and assume that (M − rx ) is divisible by dx ((N − ry ) is divisible by dy ). A visual token tj is a receptive ﬁeld in I. It can be characterized by diﬀerent perceptual features such as color, texture, shape, and motion etc. The number of visual tokens in a visual document D can be quantiﬁed by the spatial dimensions of its TRM G. Every pixel or node (p, q) in a TRM G registers the set/class membership of a visual token governed by its receptive ﬁeld against T numbers of visual “keywords” which have been extracted. In short, a TRM is a threedimensional map, G = P × Q × T , that registers local type information. Likewise, the highest layer, Spatial Histogram Map (SHM), is a summary of TRM. A receptive ﬁeld S of size sx ×sy and a displacement size cx , cy are used to tessellate the spatial extent (P, Q) of TRM with A × B, A ≤ P, B ≤ Q receptive ﬁelds. The memberships G(p, q, t) (∈ [0, 1]) of visual “keywords” t at TRM pixel (p, q) that falls within the receptive ﬁeld of SHM pixel (a, b) are histogrammed into frequencies of diﬀerent visual “keywords”, H(a, b, t) as G(p, q, t). (1) H(a, b, t) = (p,q)∈S(a,b)

where S(a, b) denotes the receptive ﬁeld of (a, b). 2.3

Singular Value Decomposition

We apply Singular Value Decomposition (SVD) to SHMs extracted from visual contents analogous to Latent Semantic Analysis (LSA) [5]. We form the fre-

Categorizing Visual Contents by Matching Visual “Keywords”

371

quency matrix X that associates visual “keywords” and visual documents as follows. Each column denotes a visual document in the form of H(a, b, t). Each row is about a visual term t in the receptive ﬁeld of pixel (a, b). Thus each entry of X takes the value of H(a, b, t). SVD is carried out on X [5], X = U ΣV T

(2)

where U, V are the matrices of left and right singular vectors, and Σ is the diagonal matrix of singular values. A coded description Ω of a visual document D (a query example or a database document) is computed as Ω = DT Uk Σk−1 (3) where Uk , Σk are approximated (truncated) versions of U, Σ respectively. Using this coded description, similarity between two images x and y can be compared using appropriate similarity measures between their corresponding Ωx and Ωy .

3

Experimental Results

Natural scene images from prepackaged PhotoCD collections from Corel [4,9,14] are used as test data in our experiments. We preclassify 347 images into the following non-overlapping classes (of sizes): coasts/seasides (59), fields (95), forests/trees (72), snowy mountains (85), and streams/waterfalls (36). Figure 3 shows three samples (rows) from each class (columns), in the left-toright order as given in the previous sentence. Given an image (a query sample or a visual document), normalized to resolution 256 × 384, we extract color and orientation features based on YIQ color model and Haar wavelet coeﬃcients respectively. The RGB channels of a natural scene image are transformed into their equivalent values in the YIQ color space. An one-level Haar wavelets decomposition is applied to the Y channel to obtain the horizontal (H), vertical (V), and diagonal (D) details. Haar wavelets are chosen because they are fastest to compute and have been used with success [6,12]. As a result of preprocessing, an image is transformed into 3 YIQ planes of size 256 × 384 and 3 HVD planes of size 128 × 192. To extract visual tokens, a 32 × 32 receptive ﬁeld and a 8 × 8 displacement size are used for TRM on each YIQ plane. Equivalently, a 16 × 16 receptive ﬁeld and a 8 × 8 displacement size are used for the HVD planes. A receptive ﬁeld extracted from each of the YIQ planes is histogrammed into 100 bins in [0, 1] and the mode is taken as the feature value for the receptive ﬁeld. For the HVD planes, only the 50 largest-magnitude coeﬃcients for each plane are retained [6]. The feature value for a 16× 16 receptive ﬁeld is the frequency of these prominent coeﬃcients. In short, a visual token is represented by a 6-dimension feature vector summarizing its dominant color and orientation components. One third of the visual tokens extracted from all images (i.e. 13 of 452, 835) are subjected to fuzzy c-means clustering. The resulting T cluster centers are the visual “keywords”. A TRM G is therefore a 29 × 45 matrix of T -element vector.

372

Joo-Hwee Lim

Fig. 3. Sample images from ﬁve classes (columns) A 15 × 18 receptive ﬁeld and a 7 × 9 displacement size are applied to G, resulting in a SHM H of size 3 × 4. Since each histogram covers T types of visual “keywords”, the term vector has 12×T elements. After SVD, the k largest factors are retained to form Ω. The similarity measure used is cosine. The leave-one-out method and K-nearest-neighbour (K-NN) classiﬁer are adopted. Each of the 347 images is used as an unknown input to the K-NN classiﬁer using the rest of 346 images as training set. The classiﬁcation rate is averaged over all 347 images. For K-NN, the number of nearest neighbours was ranged over K = 1, 3, 5, · · · , 19 and the best result is selected. Voting is done by summing up the similarity scores of the votes (up to K) from each class, which works better than sums of counts of votes in our empirical study. Table 1 summarizes the result for diﬀerent methods compared. The label “ColorHist” denotes the method that uses YIQ color histograms for comparing natural scene images. To maintain compatibility, 100 bins are also used for each of the 3 YIQ histograms, resulting in a 300-dimension vector for each image. Likewise, the result of label “Wavelets” is produced by comparing visual contents based on the 50 largest-magnitude wavelet coeﬃcients in each of the 128 × 192 HVD planes. The coeﬃcients are quantized into {−1, 0, 1} depending on the signs of the truncated coeﬃcients [6]. The label “CH+W” represents method that combines those of “ColorHist” and “Wavelets” with equal weights. The label “200-vk” corresponds to the result of using the output

Categorizing Visual Contents by Matching Visual “Keywords”

373

of our proposed architecture, H(a, b, t), with 200 visual “keywords” (vk), which peaks among the number of vk = 20, 30, 40, 50, 60, 80, 100, 120, 200 attempted. Based on 200-vk, SVD was carried out with the number of factors retained, k = 10, 20, 30, 50, 70, 90. The label “200-vk,k=50” shows the best result among the values of k.

Table 1. Comparison of diﬀerent methods Methods

Classif. %

ColorHist Wavelets CH+W 200-vk 200-vk,k=50

57.1 38.0 59.1 62.5 66.9

From Table 1, we see that our proposed visual “keywords” and coded descriptions better describe and discrimine visual contents in our experiments than the popular methods that rely on aggregate measures of low-level features. Table 2 shows a breakdown of classiﬁcation rates for each of the ﬁve classes. It is interesting to note that the performance of our proposed method is roughly proportional to the size of training set in the classes. The streams/waterfalls class seems to be the toughest class for all methods compared due to its small sample size and variations in our data. Comparing with histogram-based methods, our method scores better in the classes: fields, forests/trees, snowy mountains. The seas and mountains appear in varying spatial layouts in the coasts/seaside images. Thus they tend to favor global aggregate measures more than a regular tessellation of our method used in this paper. With a context-sensitive spatial layout [10], the result will be improved.

Table 2. Class breakdown of classiﬁcation rates Class (size) coasts/seaside (59) fields (95) forests/trees (72) snowy mountains (85) streams/waterfalls (36)

ColorHist CH+W 200-vk,k=50 71.2 43.2 62.5 65.9 38.9

79.7 46.3 58.3 68.2 38.9

50.8 75.8 69.4 77.6 38.9

374

4

Joo-Hwee Lim

Conclusions

In this paper, we have described a novel visual content description generation architecture. Low-level features of a visual content are progressively abstracted into spatial histograms of visual “keywords” and coded by SVD for eﬀective and eﬃcient similarity matching. Encouraging experimental results on image categorization of natural scenes have been obtained when compared to popular methods that use aggregate measures of low-level features. We will consider supervised learning [12] and other coding scheme in further experimentation.

References 1. Arbib, M.A. (Ed.): The Handbook of Brain Theory and Neural Networks. The MIT Press (1995). 369 2. Bach, J.R. et al.: Virage image search engine: an open framework for image management. In Storage and Retrieval for Image and Video Databases IV, Proc. SPIE 2670 (1996) 76–87. 368 3. Bolle, R.M., Yeo, B.L., Yeung, M.M.: Video query: research directions. IBM Journal of Research and Development 42(2) (1998) 233–252. 368 4. Corel (1998). http://www.corel.com. 371 5. Deerwester. S. et al.: Indexing by latent semantic analysis. J. of the Am. Soc. for Information Science, 41 (1990) 391–407. 370, 371 6. Jacobs, C.E., Finkelstein, A., Salesin, D.H.: Fast multiresolution image querying. In Proc. SIGGRAPH’95 (1995). 371, 372 7. Larkey, L.S., Croft, W.B.: Combining classiﬁers in text categorization. In Proc. of SIGIR’96 (1996) 289-297. 367 8. Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In Proc. of SIGIR’94 (1994) 81-93. 367 9. Lipson, P., Grimson, E., Sinha, P.: Conﬁguration based scene classiﬁcation and image indexing. In Proc. of CVPR’97 (1997) 1007–1013. 368, 371 10. Lim, J.H. (1999). Learnable Visual Keywords for Image Classiﬁcation. (in preparation). 373 11. Niblack, W. et al.: The QBIC project: querying images by content using color, textures and shapes. Storage and Retrieval for Image and Video Databases, Proc. SPIE 1908 (1993) 13–25. 368 12. Papageorgiou, P.C., Oren, M., Poggio, T.: A general framework for object detection. In Proc. ICCV (1998). 369, 371, 374 13. Pentland, A., Picard, R.W., Sclaroﬀ, S.: Photobook: content-based manipulation of image databases. Intl. J. of Computer Vision, 18(3) (1995) 233–254. 368 14. Ratan, A.L. Grimson, W.E.L.: Training templates for scene classiﬁcation using a few examples. In Proc. IEEE Workshop on Content-Based Analysis of Images and Video Libraries (1997) 90–97. 368, 371 15. Rowe, L.A. Boreczky, J.S., Eads, C.A.: Indices for user access to large video database. Storage and Retrieval for Image and Video Databases II. Proc. SPIE 2185 (1994) 150–161. 367

Design of the Presentation Language for Distributed Hypermedia System Michiaki Katsumoto and Shun-ichi Iisaku Communications Research Laboratory of the Ministry of Posts and Telecommunications 4-2-1 Nukui-Kitamachi, Koganei City, Tokyo 184-8795 Japan Tel: +81-42-327-6425 Fax: +81-42-327-7129 [email protected]

Abstract. We describe a new control language for our Dynamic Hypermedia system, HMML, which controls multimedia presentations by extending HTML. HTML is a language for used displaying information on the browser. This language displays text, images, movies, etc. on a window. If Java or Dynamic HTML is used, then viewing moving objects is al so possible. However, these languages are not necessarily capable of scene synchronization and lip synchronization. Moreover, although SMIL provides simple scene synchronization it does not guarantee QoS requirements. Therefore, a language is needed for providing lip synchronization and complicated scene synchronization which guarantees QoS requirements.

1

Introduction

We have designed new presentation models for a next-generation hypermedia system with a sophisticated hypermedia structure. Hardman et al[4,5]. organizes hypertext presentations by nodes and links, multimedia presentations by a combination of continuous and discrete media, and hypermedia presentations by an extended hypertext presentation model in which each node organizes one multimedia presentation. However, this definition of a multimedia presentation is inadequate because it does not clearly define the temporal synchronization between continuous media, such as audio and video, and between continuous media and discrete media, such as images, graphics and text, for the presentation scenario. It also does not consider the transmission of scenario depended media over a network, while maintaining the temporal relation between spacial relation. Consequently, we defined a hypermedia presentation model as one consisting of several multimedia presentations[1]. In a previous paper, we designed a Hypermedia-on-Demand system (HOD) [2] based on client-agent-server architecture to provide hypermedia presentations. In addition to this, we provide the control functions for hypermedia presentations[3]. Multimedia information on the Internet can be accessed by using the World Wide Web browser. The HTML (HyperText Markup Language) is used for displaying information on the browser[6]. This language can display text, images, movies, etc. on a window. If Java [7] or Dynamic HTML [8] is used, then it is also possible to view moving objects. However, these languages are not necessarily capable of scene and lip synchronization of movies and audio. Moreover, although Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 375-382, 1999.  Springer-Verlag Berlin Heidelberg 1999

376

Michiaki Katsumoto and Shun-ichi Iisaku

SMIL (Synchronized Multimedia Integration Language) provides simple scene synchronization, it does not guarantee QoS requirements[9]. Therefore, a language and it s functions are needed for providing lip and complicated scene synchronization which guarantees QoS requirements. In this paper we describe HMML (Hypermedia Markup Language) which is used for controlling hypermedia presentations, and discuss the control functions for it.

2

Dynamic Hypermedia System

2.1

The Architecture

The Dynamic Hypermedia System (DHS) is a network-oriented platform for multimedia information networks to provide multimedia information based on hypermedia presentation. Its architecture is made up of three components; client agents, a knowledge agent, and multimedia databases, as shown in Fig. 1. The client agents are located at user stations and provide the users with multimedia presentation capabilities. The knowledge agent manages the links to information units through dynamic linking methods[10] and generates multimedia objects. The multimedia databases (MDB) manage multiple media objects, such as text, image, and video and audio data. Client Agents Media Media Presentation Event ManagerPresentation Manager Manager Hypermedia Hypermedia Manager Network Network Interface

Knowledge Agent Query Hypermedia presentation

Link Manager

Med ia d ata

...

Network

Query

MDB

G

MDB

Knowledgebase

Object Manager

Network Interface

MDB Media Server Network Interface

. ..

MDB

Netw ork Knowledge Agent

MDB MDB:Multimedia database G: Gateway

Fig. 1. Architecture of Dynamic Hypermedia System

2.2

Presentation Models

We have proposed three presentation models for an advanced information infrastructure: the dynamic hypertext, multimedia and dynamic hypermedia models. These models constitute the next-generation hypermedia information system. They are more sophisticated than the Amsterdam Hypermedia Model [4,5] because they include dynamically linking mechanisms and QoS guarantee functions for making hypermedia information access more flexible. Dynamic Hypertext Model: One fundamental difference between this model and the conventional hypertext model is that this model supports a dynamical link to the next node during user interaction. The next node is linked by dynamic link methods which search for media data to match the user s intellectual background or level of interest in the information. Multimedia Model: The multimedia model is fairly self-explanatory. Several media sources are integrated temporally and spatially to create presentations.

Design of the Presentation Language for Distributed Hypermedia System

377

Dynamic Hypermedia Model: This model, shown in Fig. 2, integrates the dynamic hypertext model with the multimedia model. In other words, a node of the dynamic hypermedia model is constituted by the multimedia model. MMS

RTP

CRTP

Temporal Navigation

RTP

RTP

TEXT

DL CS

VIDEO AUDIO

MMS

Temporal Navigation

RTP

CRTP

RTP

VIDEO

IMAGE

AUDIO Scenario Time

Temporal Navigation

Scenario Time

MMS RTP DL CS

CRTP

DL CS

Temporal Navigation

RTP

RTP

TEXT VIDEO AUDIO

MMS: Multimedia Structure RTP: Reference Time Point CRTP: Current Reference Time Point DL: Dynamic Linking CS: Context Switching

IMAG E

Scenario Time

Temporal Navigation

Fig. 2. Illustration of the hypermedia model.

3

Presentation Control Language

The presentation control language is a language which describes the control structure to present hypermedia presentations with scenarios in the dynamic hypermedia system. The presentations are provided by based on the scenario which is interpreted and performed by this presentation language the on the multimedia browser in the client agent. For a multimedia scenario, the structure, time control, and navigation of this language is also considered. Moreover, the source of this language is written by text which is extended HTML, describes the scenario with consideration of the capability of the components so that the scenario can be read easily. Details of the functions are described below. 3.1

Structure

A multimedia scenario described by the HMML has a hierarchic structure that consists of two or more scenarios, and these scenarios serve as components of the high-order layer scenario . The concept of the hierarchy of a scenario can be extended in a high-order layer. Although, in the HMML the hierarchy of a scenario can be extended to the number of layers. The scenario in the DHS defines four layers: media layer (the 1st layer), scene layer (the 2nd layer), story layer (the 3rd layer), and title layer (the 4th layer). The media layer: In this layer, behavior of a single media is described, such as animation objects, images, video, text, and button. The scene layer: In this layer, the behavior of media scenarios are described. The head of the scenario of this scene layer is described as the Reference Time Point (RTP) and is used for carrying out navigation. The story layer: In this layer, the behavior of scene scenarios is described. This scenario serves as the description unit in presentation control language.

378

Michiaki Katsumoto and Shun-ichi Iisaku

The title layer: In this layer, the behavior of two or more stories is described. 3.2

Control of Time

In this section we describe the control of presentation time of a multimedia scenario. The multimedia scenario has a hierarchic structure and two or more scenarios exist in each hierarchy. Presentation time control exists for each scene and for the overall the scenario. The presentation time control for each scene is described as local time information enclosed in the scenario. That is, the single scenario, which is a component of a multimedia scenario, describes the starting time information of the scene as an origin of local time of the scenario. Suppose that scenario A of the 2nd hierarchy contains scene 1 and 2 of the 1st hierarchy. The start time of scene 2 is described as T2 in scenario A. When the time of the events of scene 2 is described as t1, the time T of the events in scenario A can be expressed as T = T2+t1. 3.3

Navigation

The HMML can also perform navigation according to the scenario described. Navigation points out the time and spatial moves between presentations. Two kinds of navigation are specified, temporal navigation and dynamic linking methods. Temporal navigation: This navigation event moves to the head of the scene or RTP within the scenario by the event of navigation during scene reproduction and starts the presentation at the scene or RTP. Dynamic Linking: This navigation event moves to the head of another story by the event of navigation during the multimedia presentation.

4

Language Specification

4.1

Definition of a Window

The rectangular viewing area on the display is called a window. There are five kinds of logical windows and four of actual windows, for example the general window is actually specified by the presentation control language. The display window shows the viewing area of the display itself and can show the size of the viewing area required by the scenario needs. The general window specifies the height and width by the number of dots. The story window, scene window, and media window specify the position (X, Y). The position of the origin is expressed with the offset from the upper layer window. The size of the window expresses the height and width by the number of dots. 4.2

Architecture

The HMML describes the information element by using the structured notation technique and structured tag. The top layer is only allowed the statement: ... .

Design of the Presentation Language for Distributed Hypermedia System

379

The statement of a tag is permitted by the 2nd layer: , <MEDIA>, <STORY>, and , <EVENT_ACTION>. The tag is reflected in the general window. The tag shown by which is inside of the general window and the tag shown by <MEDIA> which is outside the story window are described by the 2nd layer. That is information of the media object which is displayed on the story window. The 3rd or less layers are the same. The skeleton model shown in Fig. 3 is an example of a description of a scenario. In this example, the story consists of one scene containing one media scenario. <MEDIA ID=1 ATTRIBUTE="JPEG"> <MEDIA_INFO> <STORY> <STORY_INFO> <SCENE ID=1> <SCENE_INFO> <MEDIA ID=2> <MEDIA_INFO> <EVENT ID=7> <EVENT_ACTION>

Fig. 3. Example of structuring description.

4.3

Events and the Actions

Two kinds of events can be described in the HMML, user event and timer event. This event handling becomes indispensable when performing navigation. A timer event is set up beforehand, when it is described in the scenario and the scenario is performed. For example, after the end of a story, making an automatically navigation without user interactions. An action describes the operation started when an event occurs. The relationship of an event and an action is determined when a scenario is constituted. That is, a different scenario can be constituted if the association of an event and an action differs, even when combining the same scenes in the same sequence. Therefore, the description of an action and the statement of a corresponding relationship of an event and an action are located outside of the <STORY>.

380

Michiaki Katsumoto and Shun-ichi Iisaku

5

Presentation Management Module

5.1

Multimedia Controller

Coupled multimedia controllers, one in a client agent and one in knowledge agent, work together in one presentation. The multimedia controllers execute the synchronization control based on the scenario. The multimedia controller in a client agent manages and controls all media objects that constitute a presentation, and controls scene synchronization by the message passing method. Moreover, if the scenario time which shows the progress grade of a presentation is managed and the reference time move event from a user is detected, a control message will be transmitted to all media objects, and the move of a present reference time of day will be performed, as shown in Fig. 4. The modification of the QoS parameter of the whole scenario, in which a stream management and control module supervises the load effect within a system and the status of a network, negotiates with other modules according to the status of the shift. Notification of the modification is sent to the multimedia controller. The multimedia controller which receives the notification then notifies each media object about the parameter which maintains QoS priority in a scenario. A presentation is provided in each media object, maintaining QoS assurance with the new QoS parameter by the QoS maintenance module. In the knowledge agent a transmission schedule is changed if needed. Client Agent Clock

Multimedia Controller

pause

Messages/ Event

Knowledge Agent

Multimedia Controller

Messages

message

message

start

...

...

Network

Media Object 1 Media Object 2 Media Object n TEXT

VIDEO

AUDIO IMAGE

MDB Media Presentaion Manager

MDB

...

MDB

MDB: Multimedia Database

Fig. 4. Multimedia Controller.

5.2

Hypermedia Controller

The hypermedia controller, within the client agent and the knowledge agent, controls the time of starts and terminates the multimedia controller by the message passing method; and controls the context switching based on the description of author described in each presentation scenario, as shown in Fig. 5. Moreover, the client agent receives a status message from the multimedia controller in the client agent which makes possible for the context control to perform the synchronization.

Design of the Presentation Language for Distributed Hypermedia System Client Agent

Knowledge Agent

Databases Network

Network

Message

Event Manager Hypermedia Manager

Message

Link Manager

KB

Objec t Manager

Message

Hypermedia Controller

Hypermedia Controller

Message

Message

...

...

M .

MOs

M .

MMC . .

.

MOs

. .

MMC .

.

MOs

381

.

MRPC MDB

MC .

. .

MOs MOs

KB : Knowledge-base MRPC: Multicast RPC

MC .

. .

Media data .

MDB

.

MOs

Media port A Media port B

MDB

Media port C MMC:Multimedia controller

MO: Media object

MDB: Multimedia Database

Fig. 5. Hypermedia Controller

5.3

User Events

The event which a user uses to control multimedia presentations in the multimedia presentation and in the hypermedia presentation are as follows; start: for starting a new multimedia presentation; pause: for pausing a multimedia presentation; resume: for releasing a multimedia presentation;, jump: for referencing a time point to temporal navigation; quit: for ending a multimedia presentation; select: for navigation by dynamic linking. These events are in the control messages of HMML. For example, in select , a user clicks the media data (objects, such as a button, are also included) of the multimedia presentation directly in the hypermedia presentation. This action triggers and performs navigation from the embedded support dynamically to the RTP or the new multimedia presentation.

6

Conclusion

This paper described a language for controlling the multimedia presentation and the hypermedia presentation by the presentation control module. The functional validation of HMML using the original browser is completed now. We will make a general purpose browser.

References [1] M. Katsumoto, N. Seta, and Y. Shibata, “A Unified Media Synchronization Method for Dynamic Hypermedia System,” Journal of IPSJ, Vol. 37, No. 5, pp. 711-720, May 1996. [2] M. Katsumoto and S. Iisaku, “Design of Distributed Hypermedia System Based on Hypermedia-on-Demand Architecture,” Journal of IPSJ, Vol. 39, No.2, Feb. 1998.

382

Michiaki Katsumoto and Shun-ichi Iisaku

[3] M. Katsumoto and S. Iisaku, “Design of the Presentation Controller Functions for Distributed Hypermedia System,” Proc. of ICOIN-12, pp.206-211, Jan. 1998. [4] L. Hardman, D.C.A. Bulterman, and G. Rossum, “Links in Hypermedia: the Requirement for Context,” ACM Hypertext ‘93, pp.183-191, Nov. 1993. [5] L. Hardman, D.C.A. Bulterman, and G. Rossum, “The AMSTERDAM Hypermedia Model: Adding Time and Context to the Dexter Model,” Comm. ACM, Vol. 37, No. 2, pp. 5062,1994. [6] T. Berners Lee and D. Connolly, “Hypertext Markup Language - 2.0,” IETF RFC 1866, Nov. 1995. [7] http://java.sun.com/ [8] D. Gulbansen and K. Rawlings, “Special Edition Using Dynamic HTML,” QUE corporation, 1997. [9] http://www.w3.org/TR/1998/REC-smil-19980615/ [10] M. Katsumoto, M. Fukuda, and T. Shibata, “Kansei Link Method based on User Model,” Proc. of ICOIN-10, pp. 382-389, 1995.

A Generic Annotation Model for Video Databases Herwig Rehatschek1 and Heimo Müller2 1

Institute of Information Systems, JOANNEUM RESEARCH Steyrergasse 17, A-8010 Graz, Austria 2 Faculty of Arts, Word & Image studies Vrije Universiteit Amsterdam de Boelelaan 1105 1081 HV Amsterdam, Netherlands

Abstract: The change from analogue broadcasting to digital MPEG-2 channels among the satellite programs resulted in new demands on video databases and archives. Digital archives offer on the one hand a reduction of storage costs, and enable on the other hand easy reuse of already existing material. However, searching for appropriate film material in large archives is still a tedious problem. This paper describes a generic annotation model for MPEG movies which enables the user to structure the film in as many hierarchical levels as needed and to annotate any physical or logical part of the film with generic definable attributes. The model was implemented in prototype system which additionally offers a query and ordering facility per web browser and Internet.

1

Introduction

An increasing number of satellites offering digital MPEG-2 channels (e.g. DF 1, Astra Service, Premiere Digital, RAI, Intelsat, ...) mark the start of a new age in the distribution of films and videos. This results in an increasing demand on content annotation in order to reuse already existing archive material for cost effective productions. However, searching for appropriate film material in a large film archive is still a tedious task. Parts of films can just be searched and retrieved if annotations are available. In praxis there are many different ways of annotation depending on the overall approach (annotation based on a thesaurus, keywords or only free text) and the application domain (broadcast archive, industrial archive, cultural archive). An additional problem occurs by using different annotation languages and country specific character sets. When film archives are opened for commercialization or for the public the awkward handling of analogue film material becomes a problem. Digitization offers a number of advantages including reduction of storage costs, no progressive decay, fast availability in different qualities (MPEG-1/for previewing purposes, MPEG-2/for sending, ...), reusing and copying of material without loss of quality and fast access for internal personnel (Intranet) and customers (Internet). Within our implemented prototype system some of these problems are addressed and solved. A major focus was given on the interoperability across different application Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 383-390, 1999.  Springer-Verlag Berlin Heidelberg 1999

384

Herwig Rehatschek and Heimo Müller

domains and the problem of import/conversion of existing annotation data. The cross platform exchange of annotation records was studied in detail. The system offers tthree annotation possibilities, a thesaurus based, one with generic keywords in combination with free text and an automatic annotation facility.

2

Related Work

Several efforts are undertaken in order to define appropriate data models for storing multimedia data. One model for storing a physical, time based representation of digital video and audio was introduced by [1]. General concepts for the physical modeling of digital video and audio data are discussed and a specific model for storing Quicktime movies is introduced. The application of the general concepts allows the specific physical modeling of any other video format. The Layered Multimedia Data Model (LMDM) developed by [8] emphasizes the sharing of data components by dividing the process of multimedia application development into smaller pieces. LMDM claims for the separation of data, manipulation and presentation. Both modeling approaches do not concentrate an the topic of generic film annotation using user definable attributes and values which can be attached to any physical or logical unit (e.g. an act, scene, shot) of a film. A lot of research has been done on the development of digital video databases and archives. Siemens has implemented the CARAT-ARC system [2], which is an open system for storing, indexing and searching multimedia data. Annotation of data is supported by either using a thesaurus or free text. However, the system is not designed for supporting off-line units, e.g. outsourcing of annotation and/or encoding to geographically dispersed locations. The VideoSTAR experimental database system [5], which was developed by the Norwegian Institute of Technology, supports storage of media files, virtual documents, video structures and video annotations in four repositories. Content based querying and retrieval of film parts is achieved by annotation of logical parts of a film (sequence, scene, shot, compound units). Despite the relational data model of VideoSTAR offers annotation it is not generic in the sense that users can define new categories but limited to four categories, which can hold free text. There exist several sites offering search for film meta-information and download of movie clips on the internet. Some just offer an alphabetically ordered list of films with previews, others offer a database system with access to stored film meta-information [7], [4].

3

The Prototype System

This section gives an overview of our prototype system architecture and its high level building blocks. The system is a very large digital video database holding all films in MPEG-2 format. Sources remain stored on Digital Betacam in order to fulfill any special format wishes of customers (e.g. S-VHS). Each film has annotations attached

A Generic Annotation Model for Video Databases

385

which allows the search of specific parts or objects (e.g.: acts, scenes, shots, actors, ...) in a film. Basically the system consists of four units (see Figure 1): compression street(s), annotation site(s), central digital video database and the web interface for online search and ordering. MPEG-II Compression LowRes Compression

Video

Tape Storage Shelf (10-15 tapes, HiRes + LoRes)

Disk Array Caching

Compression Street

Annotation Site

Carrier

SGML-Based Interface

Annotations Film Annotation

Annotation Software SGMLBased Interface

SGMLBased Interface Central DFC database

Central database

Delivery of customer ordered film parts in the specified format

Tape Archive (robot)

HTTP

Media Production on Demand, e.g. by the Annotation Process or a customer request

Internet Web Interface

Customer with Standard WWW Browser

Figure 1: High level building blocks of the Digital Film Center

According to Figure 1 the filling process of the database can be described as follows: incoming videos are first encoded at the compression sites in two formats: MPEG-2 for storage a the central video tape archive and resell and MPEG-1 for low resolution previews and annotation purposes. The encoded material is then sent together with some film metainformation to a central video database on DLT tapes. The metainformation is stored in SGML [6] format in order to make the system as open as possible. The metainformation of the film is imported, the MPEG data remains on the DLT tape. The database stores a reference to the tape and the location for later access. Now the film is ready for annotation and can be checked out by an annotation site. For this purpose the MPEG-1 representation together with the already existing film metainformation is sent to an annotation site using again SGML as an exchange format. Since the compression streets and the annotation sites have a special SGML

386

Herwig Rehatschek and Heimo Müller

based off-line interface for importing / exporting information and data to the central video database these units can be built at geographically dispersed locations all over the world. At the annotation site the film is being annotated using a special annotation software. The annotation is sent back in the same SGML based format to a central database. Now information about the film and all parts of the film is contained in the video database. These information can be searched by customers via a web interface. Because of the attached annotations the search of parts of the film or specific objects within the film become possible. Any parts of the films can later be ordered on-line vie the web interface.

4

The Generic Annotation Model

The most important and central component of the prototype system is the digital video database. It holds all information related to films and parts of films. Within the database there exist two main views on films: the logical and the physical view. The starting point is the logical film. It has several physical representations, and is the target of annotations. This is different to current systems, where in most cases a physical representation of a film is annotated. Both views are modeled in the data scheme of the prototype system. One physical representation is the reference source. When describing differences in terms of annotation of different representations (e.g. different language versions, or evening versus late night versions) all annotations are made relative (or in reference) to the time codes of the reference version. We want to stress the fact, that when annotating a film there exist basic semantics: the temporal static structure of the film (referred to as static annotation) and annotations, which vary in their semantics (referred to as dynamic annotation). E.g., when annotating a video database of a hospital, person annotation will describe patients, when describing news material, we annotate real world persons and their historic actions, and in the annotation of a movie, actors/characters of the movie are described. The annotation model of the database, therefore defines a model to describe the basic semantics (temporal and logical structure of video) and provides a method to describe the dynamic part of the annotation data. The temporal model allows to construct any given structure of a film in as many levels as needed. Subdivisions of films into several units are supported (e.g. a film may be divided into several acts; an act may consist of several scenes and a scene may be divided into several shots where each shot has a special frame of interest). The data model consists of the units parts, sequences and groups. Parts are the smallest units. They address a number of frames (or even just one) and are defined by a start and end time code. Sequences can be defined recursive and can therefore again contain sequences. This allows a modeling of as many levels as needed. Besides sequences groups can be formed which represent any combination of parts, sequences and also again groups.

A Generic Annotation Model for Video Databases

387

35 mm

VHS

copy

Digi Beta copy

representation

MPEG-II

AVI

y co p

py co

MPEG-1

cop

y

Quicktime

8 mm

ANNOTATION -> characteristics of film material (e.g: defects (scratches, color degradation, dust), quality, ...)

film

version

start time code: 00:21:23:14 end time code: 00:21:47:00 quality: poor

Figure 2: Physical representation of an example film

Groups do not have the requirement to contain continuous sequences of time codes and are therefore a good instrument to structure a film according to key scenes (e.g. for a trailer or advertisement production). Since a film can have more than one version these structures can exist for each version and can be actually different. The video database supports versions and can hold a structure for each of them. This is indicated by the version level at the bottom of Figure 3. An example for the temporal structuring of a movie is given in Figure 3. The film "Opernball" is structured in acts, scenes and shots using our hierarchical film data model. In this example parts represent shots, sequences of first order scenes and sequences of second order acts. All entities of the logical and physical structure of a film can be annotated and therefore also be searched. The semantics of such an annotation is defined in the so called "annotation style file". An annotation style file holds a number of annotation attributes. Annotation attributes can be defined generic, in the sense that the user can define the attribute's name and its type. One annotation style concentrates on one special kind of movie, e.g. a medical film or a documentary film, and has therefore special annotation attributes. E.g. for a documentary film some attributes could be "geography / city" (type text), "animal / mammal / body temperature" (type number). Different styles can be created and stored in the database. Annotation styles are defined by using SGML [6] in combination with natural language description. The set of all annotation styles used is called video object foundation class.

Herwig Rehatschek and Heimo Müller

parts

part 1

part 2

part 3

sequence 1 (level 0)

part 4

part 5

part n-1

part n

sequence 2 (level 0)

sequences sequence 1 (level 1)

group 3

group 1 groups group 2

version

film

ANNOTATION -> characteristics of film units (e.g: persons, objects, description,...)

388

shot: Philippe skates with the floor brush start time code: 00:21:23:14 end time code: 00:21:47:00 persons: Phillipe location: living room

scene: skating and singing persons: Phillipe, Hanni location: living room

act: the day before persons: location: -

group: Hanni and Phillip persons: Hanni, Phillip location: living room

title: Opernball

Figure 3: Logical structure of an example film

The generic implementation within the RDBMS was implemented by defining attributes by a name, a detailed description, a data type and a possible default value. Such defined attributes can now be assigned to any logical or physical film entities (e.g. parts, sequences, physical representations, etc.). Next to the generic annotation the system supports a thesaurus based keyword annotation. The thesaurus can be defined by the user with a special tool and stored within the database. The database supports different thesauri according to different kind of movies. All annotation and encoded data is stored in the central database which is accessible over the world-wide web to customers for searching and ordering. The web interface provides access to the database material for customers. By filling an electronic shopping cart authorized customers can order the desired film material - which actually can be parts of a film - in the desired quality. The querying possibilities offered support the search for generic annotation attributes as well as free text search. The result represents parts of the film which have been returned e.g. on the query "return all parts which contain a table ". Next to the detailed description of the part including start and end time code and all the generic annotation attributes authorized customers are able to preview material by clicking on a link within the result. Previews are stored on hard disks for fast access and not on mass storage devices, where the high-quality material of the archive is kept.

A Generic Annotation Model for Video Databases

5

389

Results and Conclusions

This paper addressed a digital video database system which allows (1) storage of digital videos and corresponding metainformation (2) generic annotation of user defined film structures (3) search access on annotation data via a web interface and a standard WWW browser. The video database of the prototype system is designed as a large geographically dispersed system. Many encoding suites produce MPEG-2 videos, one central video database holds metainformation, annotations and references to the MPEG-2 files stored in a tape robot, many annotation sites add film annotations to the stored movies. The central database can be accessed via a web interface by a standard WWW browser all over the world. The generic film data model of the system allows the hierarchical structuring of a film in as many levels as needed. This can be done on the one hand for the logical structure (e.g. acts, scenes, shots and frames) and on the other hand for the physical representation of a film. To each of these logical and physical entities annotations can be attached. The generic annotation model is the most remarkable part of the video database. The generic annotation model allows the free definition of annotation attributes with any user defined name and type. These annotation attributes can be structured in so called "annotation styles". Different annotation styles can be stored in the video database. One style refers to one specific annotation topic (e.g. medical films, action films, ...). The generic annotation is done by a special annotation software which supports the annotator with a graphical user interface and a MPEG-I preview. A second annotation possibility is thesaurus-keyword based, where the thesaurus can be dynamically created and exchanged. A web interface was developed in order to search the database and download previews. The web interface offers registered users the search for entire films (e.g. title search) and parts of a film. Search results can be collected in a shopping cart and on-line ordering can take place. The quality of the ordered film material can be chosen by the customer. The prototype does not use a proprietary exchange format among the distributed units. All interfaces between the central video database, the annotation software and the encoding suites are SGML-based which makes the prototype an open system. Imports from and exports to other video database systems, e.g. Media Vault, become possible.

6

Outlook

Currently annotation styles are defined with SGML and natural language description. In the future formal specification methods could be used for describing the semantics of the annotation fields, and their relations. The development of Video Object Foundation Classes will be stressed in the future, which describe a framework of basic objects semantics, e.g. persons, settings, speech,

390

Herwig Rehatschek and Heimo Müller

movement patterns, and methods of specializing these objects for a specific annotation style. The new member of the MPEG family, called "Multimedia Content Description Interface" (in short ‘MPEG-7’), will extend the limited capabilities of proprietary solutions in identifying existing content notably by specifying a standard set of descriptors that can be used to describe various types of multimedia information. Developments on this standard will be closely monitored and checked for integration into the prototype system.

7

Acknowledgments

This project was partially funded by the European Union (“Digital Film Center” ESPRIT project Nr. 25075, “VICAR” ESPRIT project Nr. 24916). Specific thanks go to our colleagues Bernd Jandl and Harald Mayer and to our Greek partners within the project which helped to realize this system.

8

References

[1]Ch. Breiteneder, S. Gibbs, D. Tsichritzis. "Modelling of Audio/Video Data", pp. 322-339, Karlsruhe, Germany, 1992 [2]R. Depommier, N. Fan, K. Gunaseelan, R. Hjelsvold. "CARAT-ARC: A scalable and Reliable Digital Media Archiving System", IEEE Int. Conf. on Image Processing, 1997 [3]P. England, R. Allen, et. al. "The Bellcore Video Library Toolkit. Storage and Retrieval for Image and Video Databases“, pp. 254-264, San Diego/La Jolla, CA, USA 1996 [4]Film.com Inc., January 1998. [5]R. Hjelsvold, R. Midtstraum. "Databases for Video Information Sharing", Proc. of the IS&T/SPIE, San Jose, C.A., Feb. 1995. [6]International Organization for Standardization, 1986. Information processing---Text and office systems---Standard Generalized Markup Language (SGML), Geneva ISO, 1986. [7]The Internet Movie Database, January 1998. [8]Schloss, Wynblatt. “Providing Definition and Temporal Structure for MM data“, Proc. of the Second ACM Int. Conf. on MM, CA., ACM Press, ISBN 0-89791-686-7, S. Francisco, 1994

Design and Implementation of COIRS (A COncept-Based Image Retrieval System) Hyungjeong Yang, Hoyoung Kim, and Jaedong Yang Dept. of Computer Science Chonbuk National University, Chonju Chonbuk, 561-756, South Korea Tel: (+82)-652-270-3388, Fax: (+82)-652-270-3403 {hjyang,hykim,jdyang}@jiri.chonbuk.ac.kr

Abstract. In this paper, we design and implement COIRS (COnceptbased Image Retrieval System). It is a content-based image retrieval system to search for images as well as indexing them based on concepts. The concepts are detected by a thesaurus called triple thesaurus. The triple thesaurus consists of a series of rules deﬁning the concepts. COIRS adopts an image descriptor called triple to specify the spatial relationships between objects in an image. An image is indexed by a set of triples - each of them is enrolled into an inverted ﬁle, pointing to the image. We also develop a query processor to retrieve relevant images by evaluating a user query. The query formulated in terms of triples is evaluated by matching its triples with those of the inverted ﬁle.

1

Introduction

For the last decade, a large volume of image collections from various sources such as medical diagnosis, the military, the fashion industry, and broadcasting has brought forth a variety of image retrieval techniques. One simple technique is to search images based on descriptions manually produced. However, the number of images is not amenable to such a method. Moreover, describing even one image is not a trivial work since much knowledge encoded in it is equivalent to thousands of words in general [7]. Contents-based retrieval techniques are therefore necessitated to analyze images based on the characteristic of their content. Some generic attributes used for indexing and searching images are color, texture, shape and spatial relationship. QBIC [1], Stars [6], and Photobook [7] are attempts to index images based on the attributes. However, these systems alone may not satisfy user queries if retrieves images turn out to be relevant only when they are conceptually related with the queries. For example, most of the conventional image retrieval systems fail to retrieve kitchen pictures since they can’t deal with the concept, kitchen . To retrieve such images, the systems may ask users to explicitly list the components such as a dining table, a cupboard, and a dishwasher that the kitchen should include

This work was supported by the KOSEF no. 97-0100-1010-3 To whom correspondence should be addressed: [email protected]

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 391–399, 1999. c Springer-Verlag Berlin Heidelberg 1999

392

Hyungjeong Yang et al.

together with the explicit speciﬁcation of their possible spatial conﬁguration. Obviously, it may be a tiresome work to specify the concept in such a way. Concept-based image retrieval techniques appear to be a unique solution for providing users with a higher-level query interface. In this paper, we describe the design and implementation of COIRS(COnceptbased Image Retrieval System). It diﬀers from extant content-based image retrieval systems in that it enables users to query based on concepts, that is, high level objects identiﬁed from a spatial conﬁguration of primitive objects in an image. COIRS adopts an image descriptor called triple to specify the spatial relationships between objects. All images are therefore indexed by a set of associated triples. A triple thesaurus deﬁnes concepts by the logical connective of triples. The triples are used for formulating queries as well as indexing images. We also develop a query processor to evaluate a query by matching its triples with those of the inverted ﬁle.

2

Image Indexing by Triples

An image is represented by a set of ordered triples in COIRS. A triple speciﬁes a spatial relationship between two objects in an image [2,3]. For example, if an object b is located at the north side of another object a, the triple would be < a, b, north >. Fig. 1 shows a symbolized image p1 , where a vase(b) with ﬂowers(c) is located on a table(a) with an apple(d) and a bunch of grapes(e) at its east side. The image p1 is now represented by Tp1 = {< a, b, north >, < a, c, north >, < a, d, north >, < a, e, north >, < b, c, north >, < b, d, east >, < b, e, east >, < c, d, southeast >, < c, e, southeast >, < d, e, east >}. We assume one symbol is assigned for one object regardless of the size of the object. We also restrict the spatial relationship into eight directions such as north, southeast, west, etc.

Fig. 1. Symbolized Image p1 Any known technique is unavailable to automatically performing object recognition since objects to be recognized originate from disparate domains and images contain considerable noise [6].COIRS provides a visual image indexer to facilitate manual object labeling in an image and the speciﬁcation of their relative position. It is an assistant tool designed to minimize manual work when

COIRS – A COncept -Based Image Retrieval System

393

indexing images. To label an object in an image, dragging a Minimum Bounded Rectangle (MBR)and then entering its name are requested(Fig. 2(a)). As each object is labeled, the generated triples are displayed through the triple viewer. While such a manual labeling may incur a considerable overhead, it has some advantages over automatic one. One is that even scenes extremely diﬃcult to analyze can be indexed. For example, natural scenes containing mountains and rivers can be indexed in terms of triples.

Fig. 2. Visual Image Indexer and Inverted File

The triples produced by the visual image indexer are inserted into an inverted ﬁle(Fig. 2(b)). The inverted ﬁle consists of triples and the link of the images indexed by the triples. We constructed this ﬁle based on the technique presented in [4].

3 3.1

Recognition of Concept Objects Triple Thesaurus

A concept is a composite object into which more than one object is aggregated according to their spatial relationships. Objects except the concepts are primitive objects. A triple thesaurus captures the concepts from the logical connective of triples. Let T and O be a set of all triples and objects respectively. Then a triple thesaurus C to detect concepts is deﬁned as the following function [8]. C : 2T → O, i.e., C({t}) = c f or t ∈ T and c ∈ O. For example, the two primitive objects, b and c in p1 can be combined into a ﬂowervase f , which is a concept: C({< b, c, north >}) = f . Such a thesaurus may be implemented by CFG(Context Free Grammar), PDA(pushdown automata), or rules, which turn out to be the same mechanism. In COIRS, a set of production rules in CFG is used to detect concepts by YACC (Yet Another Compiler-Compiler).

394

Hyungjeong Yang et al.

Yacc Code of Triple Thesaurus %token NORTH,SOUTH,WEST,EAST %token NORTHEAST,NORTHWEST,SOUTHEAST,SOUTHWEST %token TABLE,VASE,FLOWER,APPLE,GRAPE,ORANGE %token AND,OR %% Concept_Object : Flower_vase { add_cobject("flowervase"); } | Fruits { add_cobject("fruits"); } | Still_life { add_cobject("still_life"); } ; Flower_vase : ’<’ VASE ’,’ FLOWER ’,’ NORTH ’>’ ; Fruits : ’<’ Fruits ’,’ Fruits ’,’ Location ’>’ | Fruit ; Still_life : ’<’ Flower_vase ’,’ Fruits ’,’ Side_below ’>’ AND ’<’ TABLE ’,’ Fruits ’,’ NORTH ’>’ AND ’<’ TABLE ’,’ Flower_vase ’,’ NORTH ’>’ ; Fruit : APPLE | GRAPE | ORANGE ; Location : NORTH | SOUTH | WEST | EAST | NORTHWEST | NORTHEAST | SOUTHWEST | SOUTHEAST ; Side_below : WEST | EAST | SOUTHWEST | SOUTHEAST | SOUTH ; Now constituents of the concept c, comp(c) is deﬁned as {o1, o2 | C({< o1 , o2 , r >})}. For example, comp(f ) = {b, c}. 3.2

Triple Generation for Concepts

To determine a spatial relationship between a concept and other objects, we now deﬁne ordering of directions. Let a concept c ∈ O, o, o ∈ comp(c) and / comp(c)). For r, r ∈ D, < o, oj , r >, < o , oj , r >∈ Tp for all oj (oj ∈ Op ∧ oj ∈ if r is above r’ or r = r in (Fig. 3(a)), then we say r subsumes r’. It is denoted by r ≥ r [8].

Fig. 3. Ordering between Directions and 4 Bits Representation

A function GenConceptTriple() generates concept triples to specify spatial relationships between a newly generated concept and other objects in the image.

COIRS – A COncept -Based Image Retrieval System

395

The spatial relationship between the objects is deﬁned by GLB(Greatest Lower Bound)/LUB(Lowest Upper Bound). Concept-direction set named R is a set of spatial relationships between all constituents of a concept c and an object o in the image. R is obtained by Dir Comp(). Function to generate concept triples GenConceptTriple() Input : A Concept Object and other objects in an image Output : Triples Involving c Begin R = Dir_Comp(c) if GLB(R) exists then return (c,o,GLB(R)) else if LUB(R) exists then return (c,o,LUB(R)) return NULL End. A spatial relationship r is represented by four bits(Fig. 3(b)). ’AND’ and ’OR’ bit operators are used to calculate GLB and LUB respectively. We ﬁrst perform pairwisely GLB(R) = AN D(r1 , r2 ) where r1 , r2 ∈ R. If GLB(R) = 0, it is the target direction, but if not, LU B(R) = OR(r1 , r2 ) is calculated as an alternative direction. When neither GLB(R) = 0 nor LU B(R) = 0, NULL is returned. This means that any representative direction can not be obtained. Nonexistence of GLB or LUB between a concept object and any other objects entails that their spatial relationship cannot be deﬁned in terms of eight directions. Such cases are not generated as triples in COIRS. Since reﬁned frameworks supporting ’same’, ’surround’ and ’overlap’ may require considerably extensive work, we leave it to further research. In p1 , to determine a direction between a concept f lowervase(f ) and an object apple(d) , let’s obtain concept-direction set R = {east, southeast} between comp(f ) = {b, c} and ’d ’ from Dir Comp(). Since southeast = 0101 and east = 0100 in (Fig. 3(b)), GLB(southeast, east) = AN D(0101, 0100) = 0100, i. e., east. In other words, since the spatial relationship between f and d may be viewed as east and southeast simultaneously, r = GLB(R) = east is ﬁxed as a representative relationship satisfying both of two. < f, d, east > is hence added to the triple set. Similarly, {< a, f, north >, < f, d, east >, < f, e, east >} are generated and then added to the triple set. Another advantage of our system is that it provides a level of abstraction in capturing concepts. For example, in (Fig. 1), suppose ’g’ is deﬁned as a concept ’fruits’ by C({< d, e, east >}) = g. The spatial relationship between ’f ’and ’g ’is then r = GLB(R) = east. Furthermore, we can extract the whole semantics of p1 , i.e., a ’still-life((h)’ image, if we deﬁne C({< a, f, north >, < a, g, north >, < f, g, east >}) = h. It describes that an image where a ﬂower vase(f) and fruits(g) are at the north of a table(a), and g is to the east of f may be viewed as a still-life image.

396

4

Hyungjeong Yang et al.

Concept-Based Query Evaluation

A query is evaluated by matching the query triples with those of the inverted ﬁle. A query may be given by two ways; object icons and triples. Like other Graphical User Interface (GUI), Icon-based query interface is intuitive and easy to use(Fig. 4(a)). The interface is divided into four parts: a sketchpad, an icon set, a query triple viewer and a result viewer. The query is given by locating icons at the sketchpad from the icon set provided by COIRS. For user convenience, a user is also allowed to directly input an object name by pointing a place on the sketchpad.

Fig. 4. User Interface

A query may be also issued through a triple-based interface by inputting objects and the spatial relationship between them(Fig. 4(b)). It is composed of three parts: a query triple editor, a query triple viewer, and a result viewer. The query triple editor allows users to construct a simple query or compound query by using logical connectives such as ’AND’ and ’OR’. Once objects and a spatial relationship enter, the corresponding query triples are displayed through the triple viewer. If the spatial relationship is one of right, left, above, below or side, it is transformed into several triples. For example, if the spatial relationship of a triple is ’below’, it is translated into three triples whose spatial relations are southwest, south, and southeast respectively. It is also possible that a query can be formulated in terms of objects alone without their spatial relationships. Then the query is converted to eight or-relationship triples having eight directions. The result viewer now shows the result of the query from the following function to retrieve images.

COIRS – A COncept -Based Image Retrieval System

397

Retrieve_Images() input: query triple output: image ids Begin while(query_triple != EOF) Begin token=Get_Token(query_triple) if monotriple = = token then set_of_ids = InvertedFileLookup (token) else logical_con_st = token End return (CompoundQueryOp(set_of_ids, logical_con_st)) End Since COIRS can even extract the semantics of a whole image, it is possible that COIRS retrieves an image by a concept which covers it. For example, to retrieve still-life images, input would be simply ’still-life’ into ’concept of image’ ﬁeld in the query editor.

5

System Implementation

COIRS was fully implemented with Motif and C++ on top of Sun Solaris OS. It consists of four modules: a visual image indexer, a triple thesaurus, an inverted ﬁle and a query processor. The visual image indexer facilitates object labeling and the speciﬁcation of relative position of objects. The thesaurus captures the concepts by analyzing triples, thereby extracting image semantics. A query is evaluated by matching the triples of the query with an inverted ﬁle. Shown in Fig. 5 is the whole conﬁguration of COIRS incorporating the four modules.

Fig. 5. System Conﬁguration of COIRS There are two approaches in implementing the query processor of COIRS: top down evaluation and bottom up evaluation. In top down evaluation, references to the triple thesaurus are conﬁned to concepts appearing in the triples of user

398

Hyungjeong Yang et al.

queries. Any other reference for trying to detect concepts in images is not made. The inverted ﬁle, therefore, does not include any triple containing concepts. When evaluating a query, every triple containing concepts is translated into more than one concept-free triple yet semantically equivalent to. Target images may be retrieved by searching the concept-free triples in the inverted ﬁle. On the contrary, in bottom up evaluation, every concept is detected and then the generated triples involving concepts are inserted into the inverted ﬁle prior to query evaluations. The triples in a user query may hence match its direct counterpart in the inverted ﬁle. Currently our query processor in COIRS adopts the bottom up evaluation for not compromising the user response time, avoiding query processing delay due to query reformulation. However, the bottom up evaluation also has the drawback that concept detection be time consuming when images contains too many objects. Judgement on which one is better may depend on the characteristic of application domains.

6

Conclusions and Further Work

In this paper, we developed COIRS as an advanced content-based image retrieval system. The main advantages of COIRS are that 1) it is a higher level image retrieval system in comparison with other systems that retrieve images only relying on syntactical information such as colors, shape or texture, and 2) it provides an integrated framework into which extant content-based technologies can be uniformly incorporated. As further researches, complementary works for our framework may be needed. First, we should solve the problem of determining ambiguous spatial relationships between objects which can not be speciﬁed in terms of only eight directions. For example, we should remove a diﬃculty in specifying a direction which may be either east or southeast, but more likely, southeast. Introducing fuzziﬁed spatial relationship may be an alternative for that. Second, the thesaurus introduced in this paper should be developed in greater detail, since it is a core component for capturing image semantics. Rule-based languages such as prolog or CLIPS may be exploited to construct the thesaurus.

References 1. Ashley, J., et al. : Automatic and Semiautomatic Methods for Image Annotation and Retrieval in QBIC. In: Proceeding of Storage and Retrieval for Image and Video Databases III, Vol. 2, 420, SPIE (1995) 24-25. 391 2. Chang, C. C. and Lee, S. Y.: Retrieval of Similar Pictures on Pictorial Databases. In: Pattern Recognition, (1991) 675-680. 392 3. Chang, C. C.: Spatial Match Retrieval of Symbolic Pictures. In: Journal of Information Science and Engineering, Vol. 7, (1991) 405-422. 392 4. Cook, C. R. and Oldehoeft, R. :A letter-oriented minimal perfect hashing function. In: ACM SIGplan Notices 17 (1982) 18-27. 393 5. Han, J. J., Choi, J. H., Park, J. J. and Yang, J. D. :An Object-based Information Retrieval Model : Toward the Structural Construction of Thesauri, In :Proceeding of International Conference ADL98, (1998) 117-125.

COIRS – A COncept- Based Image Retrieval System

399

6. Li, John Z. and Ozsu, M. Tamer : STARS: A spatial Attributes Retrieval System for Images and Videos. In : Proceedings of the 4th International Conference on Multimedia Modeling(MMM’97), Singapore (1997). 391, 392 7. Pentland, A., Picard, R. W., Scaroﬀ, S.: Photobook: Tools for Content-based Manipulation of Image Databases. In: International Journal of Computer Vision (1996). 391 8. Yang, J. D. and Yang, H. J.: A Formal Framework for Image Indexing with Triples: Toward a Concept-based Image Retrieval, In: International Journal of Intelligent System.: submitted (1998). 393, 394

Automatic Index Expansion for Concept-Based Image Query Dwi Sutanto and C. H. C. Leung Communications & Informatics Victoria University of Technology Ballarat Road, Melbourne 3001, Victoria, Australia {dwi,clement}@matilda.vut.edu.au

Abstract. Search effectiveness in an image database is always a trade-off between the indexing cost and semantic richness. A solution that provides a significant degree of semantic richness that simultaneously limits the indexing cost is presented. The query schemes are able to enhance the query speed by adopting a semantically rich structured form for high-level image content information, as well as exploiting the efficiency of conventional database search. Through the use of rules, which can be either pre-defined or dynamically incorporated, a new level of semantic richness can be established, which will eliminate the costly detailed indexing of individual concepts. The query algorithm incorporates rule-based conceptual navigation, customized weighting, incremental indexing and relevance feedback mechanisms to enhance retrieval effectiveness.

1 Introduction With rapid advances in powerful multimedia computers and the Internet, pictorial query algorithm has attracted significant research attention. There are many different methods and techniques that have been proposed. They largely fall into two categories: concept-based [1,7,8,9,11,14,15,16,19] and contentbased [2,3,4,5,6,10,12,13,17]. Concept-based methods are mainly text-based approaches which allow users to post their query either simply using keywords or using a form of natural language. Content based methods, on the other hand, are pixelbased which allow users to post their query by an example or by image contents (color, shape, texture, etc.). Each type of method has its advantages and disadvantages. Query by example (QBE), for example, is suitable if a user has a similar image at hand, and a query will recall entities having similar image signature. However, it would not perform well if the image is taken from a different angle, having a different scale, or placed in a different setting. It is often difficult to query images by their contents where users have to tell/select a color composition, outline the drawing, select a texture etc. Because content-based methods are not rich in image semantics, it is difficult to use them to query high level visual concepts like an image Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 399-408, 1999.  Springer-Verlag Berlin Heidelberg 1999

400

Dwi Sutanto and C.H.C. Leung

of a Melbourne Spring Festival. Multiple domain recall is another disadvantage of approaches like QBIC [3,4,5]; e.g. a query for 70% of blue color and 30% of white color may return an image of a white car parked behind a blue wall, a white book on a blue table, an image of a beach with white sand, etc. For these reasons, it is obvious that text-based queries are preferable to QBE. Text-based queries are also very much faster than QBE and QBIC because text processing only takes a small fraction of time compared to image processing. Another advantage of text-based queries is the ease with which a user could prepare the query, because they use human language for the queries. In this paper, we develop a text-based query system based on the Ternary Fact Model (TFM) database [1,7,14]. Unlike conventional text-based system which relies on keywords for database index and query, TFM has an underlying visual entityrelationship index representations, rule-based conceptual hierarchy, and other features to support semi-automatic indexing and to enhance query performance using thesaurus system, relevance feedback mechanism, and user-tailored weighting components.

2 Image Indexing Paradigm Our approach to image indexing is developed from the basic Ternary Fact Model°[1,7,14], which is based on a textual descriptive approach to represent image contents. The TFM modeling approach has shown to give competent performance in terms of recall and precision in the retrieval of images. The representation consists of five discrete fact types: elementary facts, modified facts, outline facts, binary facts and ternary facts. Elementary facts are atomic objects in the image that are meaningful to the human users, such as apple, book, chair, etc. Modified facts are elementary facts augmented with descriptive properties, such as red apple, thick book, old chair, etc. Outline facts are abstract concepts derived from the image, such as war, party, celebration, etc. Binary facts are relationships linking two elementary or modified facts, such as a boy eats an apple, a book is on the chair, etc. Finally ternary facts are relationships linking three elementary or modified facts, such as a boy peels a green apple with a knife, a man puts a book on the chair, etc. Table 1 illustrates the main features of the model. It is quite possible to extend the model to link more than three facts; however, it was found that three-fact relationships are sufficient to provide a simple yet adequate representation in most situations [7]. Despite the richness of its representations, TFM still relies on a significant amount of manual work to interpret and index images (due to limitations in current image recognition algorithms), in comparison with pixel-based systems. In this paper present a mechanism which aims to eliminate part of this manual work by having the computer to semi-automatically build high level indices. We shall achieve this goal by employing a knowledge based system (KBS) of rules which will automate the process of generating high level indices (outline facts) from low level indices (elementary facts). This KBS will also be able to expand and restructure elementary facts and outline facts into a hierarchy. This expansion is depicted in Figure 1.

Automatic Index Expansion for Concept-Based Image Query

401

In Figure 1, elementary facts are labeled as low level objects, while outline facts are labeled as medium or high level concepts. Basically, apart from the left most set, which contains all the elementary facts, the remaining sets contain outline facts. In Figure 1, a dot represents a fact and a line to the left of the fact relates that particular fact to its components. An outline fact is composed of two or more elementary facts or other outline facts drawn from a lower level. Although Figure 1 only shows one medium level concept, in practice we can have several. From this representation, it is clear that the higher the level of the concept, the less the number of the facts in the set. This is an important characteristic that will be exploited in database search, which will be discussed later. low level object

med. level object/concept

high level concept

Figure 1. Index Expansion

The index is entered manually/extracted automatically from the image by human/computer in term of atomic objects. Atomic objects are defined as the smallest entity that cannot be decomposed further into components. For example if a table is defined as an atomic component, then in the future we would not be able to recognize a table leg or a table top from the database. Therefore, depending on the application, an indexer will have to predetermine to what extent he/she wants to decompose an image object into atomic indices. From these atomic indices, the machine will develop higher level indices using its knowledge-based system. It is the task of the rules to perform the creation of the higher level indices. A rule consists of an IF part, which lists all the conditions that must be satisfied, and a THEN part, which concludes the rule given that the conditions are met. A rule is created by assigning lower level (atomic) objects/indices in the IF part and higher level object/index in the THEN part. Boolean AND or OR can used to define the relationship among the atomic objects in the condition of the rule. The AND part is the necessary condition, and the OR part is the optional condition. By validating the rule conditions with existing indices (which could be elementary facts or outline facts) and obtaining higher level indices from the hypothesis of the rule, we can create a new index entry automatically. In other words, we build a higher level index from lower level (atomic) indices that might be directly recognizable from the image. This indexing mechanism will avoid inconsistency in human perception of the image concept when the process is performed manually. For retrieval, users can take the advantage of these high level indices to speed up the searching time and narrow down

402

Dwi Sutanto and C.H.C. Leung

the search space. We illustrate below how to construct a typical rule to define a high level object 'computer'. IF

THEN

there exists a monitor a CPU a keyboard a mouse the OBJECT is a computer

AND AND AND

In turn, we can treat the object 'computer' as an intermediate level index, and then use it as a condition for a higher level object description as in the following example. IF

THEN

there exists a desk a computer a telephone the OBJECT is an office

AND AND AND

In this way, we structure the index representations into a hierarchy. Therefore, several atomic indices will be able to define intermediate indices and several intermediate indices will be able to define higher level indices and so on. The benefit of this method is the reusability of the rules. Once a rule that defines an object is created, it can be used for different images in the database as well as new images.

3 Index Organization Indices created in the previous Section have to be organized in such a way that will facilitate fast and efficient retrieval for a query. We shall explain this organization through the following example. Suppose that we have four pictures in the database with the following contents: Table 1. Elementary Indices Image #1 ribbon balloon light cake candle people

Image #2 ribbon balloon light Xmas tree present Santa Claus

Image #3 ribbon flower car

Image #4 tree flower lawn

In the database, these objects will be treated as elementary facts (elementary indices) of the pictures which will be stored in an elementary index table. One or more picture numbers are related to each index entry indicating to which images an index entry belongs. For example, images #1, #2, and #3 are related to index ribbon, as ribbon is present in these images.

Automatic Index Expansion for Concept-Based Image Query

403

Suppose that we have created a knowledge based system using the following rules: Rule 1 IF there exists a ribbon AND a balloon AND a light THEN the OBJECT is a decoration Rule 3 IF there exists a tree AND a flower AND lawn THEN the OBJECT is a garden Rule 5 IF there exists a decoration AND a Christmas tree AND a present AND a 'Santa Claus' THEN the OBJECT is a Christmas event

Images [1,2,3] [1,2] [1,2] [1,2] Images [4] [3,4] [4] [4] Images

Rule 2 IF there exists a ribbon AND a flower AND a car THEN the OBJECT is a wedding party Rule 4 IF there exists a decoration AND a cake AND a candle AND people THEN the OBJECT is a birthday party

Images [1,2,3] [3,4] [3] [3] Images [1,2] [1] [1] [1] [1]

[1,2] [2] [2] [2] [2]

Upon the execution of these rules, we will generate new indices that represent intermediate or high level indices. These indices will be stored in different tables corresponding to the level of abstraction or the stage of the creation of the index. Table 2. Index Table

Atomic Index Table ribbon 1,2,3

Intermediate Index Table decoration 1,2

balloon candle cake people Xmas tree present Santa Claus light car flower tree lawn

Wedding P. garden

1,2 1 1 1 2 2 2 1,2 3 3,4 4 4

3 4

High Level Index Table Birthday 1 party Xmas event 2

404

Dwi Sutanto and C.H.C. Leung

In our algorithm we have to include picture numbers as extended conditions of the rule, because indices are bound to picture numbers. To evaluate the rule, the inference engine requires that all of the conditions are satisfied. It is possible that one rule will satisfy more than one picture, however it is required that all of the rule conditions have to satisfy each picture. Table 2 illustrates the index tables corresponding to the above rules.

4 Query Processing Processing a query for such a database consists of searching and matching between query components and index components. For elementary facts, the program will compare elementary facts from the query and those from the database index. For modified facts, matching has also to be performed on the modifier and elementary fact. For binary and ternary facts, the algorithm have to also verify any existing relationship among the facts. These are the basic operation of the query processor; however, other factors need to be considered so as to achieve a more efficient and effective search. We examine below this processing mechanism in more detail. Weights or significance measures provide users with a means to give priorities to selected query components in order to optimize the results of the query. Applying higher fact weight values will increase the value of the fact recall rate (since these would be more significantly figured in the retrieval ranking), while higher modifier and link weight values will increase the value of the precision rate. Depending on the query outcome, user may want to further tailor the query results by widening the scope of the result or narrowing the result. Weights are also given for other extended features such as background/foreground image, image category, color, etc. Processing the query commences from user input. The query may be entered using a natural language, providing it has a structure similar to the indexing language in the TFM. The query is processed to remove all unwanted components. Because TFM only covers a subset of the natural language, full natural language processing is not needed. A query sentence is broken down into several query clauses. Each query clause is processed separately. The results of each of these query clauses will be merged at a later stage to obtain the final result of the query. Each clause phrase is then transformed into the Ternary Fact Model structure, to obtain query components. A summary of the components extraction procedure is sketched below. 1. The first noun identified from a clause is considered as an elementary subject fact candidate. 2. If the elementary subject fact candidate is directly followed by one or more nouns, then the last noun is considered as the actual elementary subject fact. 3. Words (nouns or adjectives) before the elementary subject fact is considered as modifiers. 4. The first noun extracted from the clause after the elementary subject fact will be treated as an elementary object fact candidate. The actual elementary object fact is determined as in step 2.

Automatic Index Expansion for Concept-Based Image Query

405

5. Words between the elementary subject fact phrase and elementary object fact phrase are considered as a fact linker. 6. The same procedure applies for finding the ternary fact components. Once we get all of the facts, the search engine is ready to perform a search on the database. The actual search that needs to be conducted is in the elementary/outline fact table, because all the elementary/outline facts are stored in this table, which means that the search result is sufficient to determine whether or not any further search is needed. Thus, if this search does not result in any match, then any further search in other tables will not yield anything. Indices in this table are organized as a one to one relationship between elementary facts and image reference numbers indicating to which images the facts belong. For every match, the search engine will record the reference number which will be used as a pointer to data in other tables. Basically the search engine will only need to verify which index modifiers among the set pointed to by the reference numbers from the main search match the query modifiers. The result from these verifications should return an even smaller set of reference numbers than the previous one. Similarly the reference number from this modified fact table will be used to verify data in binary facts table, which in turn will be used to verify data in the ternary facts table using the same technique. This algorithm eliminates searching time needed at higher level tables (modified, binary, and ternary facts tables) using the first level search. We diagrammatically depict the whole process in Figure 2. ref. fact v ref.

mod. fact v ref.

fact v

link fact v

img10 dog 0.6 img10 brown dog 0.6 img10 dog 0.7 eat bone 0.1

elementary facts modified facts binary facts

Figure 2. Query Processing

In Figure 2, there is included a field v associated with each record. This is a value field obtained from the image source which represents the significance measure for each associated fact within the image. This value is also returned by a function within the search engine for every match found. Figure 3 illustrates the query calculations.

P( Q| I n ) = ∑ Wi P( Fi | I n ) i

(1)

406

Dwi Sutanto and C.H.C. Leung

P(Q|In) = siginificance measure for image In P(Fi|In) = significance value of feature i in image In Wi = user-specified query weight for feature i Fi = query features i = number of features

RESULTS QUERY Q Wi Fi P(Fi|In) In

Figure 3. Query Formula

Upon any match the query processor will assign hit values for all individual facts and then calculate total hit values for a query clause using Equation 1 [18] for each image recalled. Equation 1 basically adds all weighted features being queried in a given image. The value returned by the formula indicates how close a recalled image matches to a given query.

5 Relevance Feedback Relevance feedback mechanism allows users to fine tune their queries by indicating to the system which query results are relevant. Usually this is achieved by clicking on a number of relevant images from the query results. The system will further search the database to find more images which are similar to users’ feedback. To process this feedback, a feedback processor will obtain index data from the selected images to be used as a new query. To maintain the track of the query, the new query will have to include the initial query as well. This makes possible similarity retrieval based on high-level objects rather than low-level features as is the case for most other systems. ref. fact Rb Rf v 5 a 5 b 5 c img10 p 5 7 0.6 7 q r 7

R u le 5 : if a , b , c th e n p R u le 7 : if p , q th e n r a p b c q ru le 5 ru le 7

r

Figure 4. Feedback Mechanism

The above feedback is beneficial provided that the first query result successfully recalls a number of relevant images. However, sometimes it is necessary to widen or to narrow the query concepts, if there is no relevant images being recalled. This can be done by following the rule network. To support this feature, we include a field to store the rule number for every outline facts deduced during the indexing process. Figure 4 details this mechanism, which illustrates two joining rules, i.e. rules 5 and 7. Suppose

Automatic Index Expansion for Concept-Based Image Query

407

that the query processor recalls entity p. Using this rule network as an example, we could widen the query by following (move forward) the path of rule 7, or narrow the query by following (move backward) the path of rule 5.

6 Concluding Remarks Search effectiveness on an image database is always a trade-off between the indexing cost and semantic richness. To support a high degree of semantic richness, the cost of indexing tends to be high as significant human involvement is necessary. Some degree of human indexing involvement is inevitable as automatic recognition algorithms are still unable to provide, even in a limited way, the semantic level required of many applications. Fully automatic indexing of semantically rich contents is unlikely to be achievable in the foreseeable future. We have presented a solution that provides a significant degree of semantic richness but at the same time limiting the indexing cost. Ternary Fact Model query schemes are able to enhance the query speed by adopting a semantically rich canonical form for high-level image content information, and exploiting the structural efficiency of conventional database search. Through the use of rules, which can be either pre-defined or dynamically incorporated, a new level of semantic richness can be established, which will eliminate the costly indexing of individual concepts. Necessarily, this kind of semi-automatic, high-level indexing entails some degree of uncertainty which would affect adversely the retrieval precision. This requires the use of weights to indicate the degree of relevance of query outcomes which are often linked to the measure of reliability of the deductive rules for generating an index entry. However, compared with the cost of manually entering every index entry, such a degree of uncertainty appears to be quite workable and acceptable.

References 1. Sutanto, D. and Leung, C.H.C., “Automatic Image Database Indexing”, 2. 3. 4. 5. 6.

Proceedings of the Multimedia and Visual Information Systems Technology Workshop, pp. 15-19, October 1997. Gudivada, Venkat N. and Raghavan, Vijay V., “Content-Based Image Retrieval Systems”, IEEE Computer, pp. 18-31, 1995. Barber, R. et.all., “ULTIMEDIA MANAGER: Query By Image Content and its Applications”, IEEE Comput. Soc. Press: Digest of Papers, Spring Compcon ‘94, pp. 424-429, 1994. Barber, R. et.all., “A Guided Tour of Multimedia Systems and Applications: Query by Content for Large On-Line Image Collections”, IEEE Computer Society Press, pp. 357-378, 1995. Flickner, Myron et.all., “Query by Image and Video Content: The QBIC System”, IEEE Computer, Vol. 28 Issue 9, pp. 23-32, September 1995. Campanai, M., Del Bimbo, A., and Nesi, P., “Using 3D Spatial Relationships for Image Retrieval by Contents”, Proc. IEEE Workshop on Visual Languages, 1992.

408

Dwi Sutanto and C.H.C. Leung

7. Leung, C.H.C. and Zheng, Z.J., “Image Data Modelling for Efficient Content

Indexing”, Proc IEEE International Workshop on Multi-Media Database Management Systems, pp. 143-150, 1995. 8. Yang, Li and Wu, Jiankang, “Towards a Semantic Image Database System,” Data & Knowledge Engineering, Vol. 22, No. 3, pp. 207-227, May 1997. 9. Li, Wen-Syan et al., ”Hierarchical Image Modeling for Object-Based Media Retrieval”, Data & Knowledge Engineering, Vol. 27, No. 2, pp. 139-176, September 1998. 10. Jorgensen, Corinne, “Attributes of Images in Describing Tasks”, Information Processing & Management, Vol. 34, No. 2/3, pp. 161-174, March/May 1998. 11. Shakir, Hussain Sabri, “Context-Sensitive Processing of Semantic Queries in an Image Database System”, Information Processing & Management, Vol. 32, No. 5, pp. 573-600, 1996. 12. Gudivada, Venkat N., “Modeling and Retrieving Images by Content”, Information Processing & Management, Vol. 33, No. 4, pp. 427-452, 1997. 13. Eakins, John P. et al., “Similarity Retrieval of Trademark Images”, IEEE Multimedia, April-June 1998. 14. Leung, C. H. C. and Sutanto, D. “Multimedia Data Modeling and Management for Semantic Content Retrieval”, in Handbook of Multimedia Computing, Furht, B. (Ed.), CRC Press (To Appear). 15. Chua, Tat-Seng et al., “A Concept-Based Image Retrieval System”, Proceedings of the Twenty Seventh Hawaii International Conference on System Sciences, Vol. 3, pp. 590-598, Jan 1994. 16. Chang, S. F. et. al., “Visual Information Retrieval from Large Distributed Online Repositories”, Comm. ACM, Vol. 40, Dec 1997, pp. 63-71. 17. Srihari, Rohini K., “Automatic Indexing and Content-Based Retrieval of Captioned Images”, IEEE Computer, pp. 49-56, September 1995. 18. Hou, Tai Yuan, et al, “Medical Image Retrieval by Spatial Features”, 1992 IEEE International Conference on Systems, Man, and Cybernetics, Vol. 2, pp. 1364-9, October 1992. 19. Grosky, W. I., “Managing Multimedia Information in Database Systems”, Comm. ACM, Vol. 40, Dec 1997, pp. 72-80.

Structured High-Level Indexing of Visual Data Content Audrey M. Tam and Clement H.C. Leung Communications & Informatics Victoria University of Technology, Australia Footscray Campus (FO119), P.O. Box 14428 Melbourne CMC, VIC 8001, Australia Fax:+61 3 9688 4050 {amt,clement}@matilda.vut.edu.au

Abstract. Unstructured manual high-level indexing is too open-ended to be useful. Domain-based classification schemes reduce the variability of index captions and increase the efficiency of manual indexing by limiting the indexer's options. In this paper, we incorporate classification hierarchies into an indexing superstructure of metadata, context and content, incorporating high-level content descriptions based on the ternary fact model. An extended illustration is given to show how metadata can be automatically extracted and can subsequently help to further limit the indexer's options for context and content. Thus, this structure facilitates the indexing of high-level contents and allows semantically rich concepts to be efficiently incorporated. We also propose a form of data mining on this index to determine rules that can be used to semi-automatically (re)classify images.

1

Classification Schemes for High-Level Indexing

Multimedia databases must store a vast amount of information about their data objects – data about the semantic content and data about the data [4,5,6,8]. Both indexing and querying of visual databases can be done on the basis of low-level data content – texture, colour, shape etc. – and high-level semantic content – what the image “means” to humans. Low-level content can be automatically extracted for indexing but high-level content must be indexed manually or semi-automatically. Unstructured manual high-level indexing is too open-ended to be useful: like a Rorschach test, the results say more about the human indexer's viewpoint than about the images [7]. Domain-based classification schemes can reduce the variability of index captions and increase the efficiency of manual indexing by limiting the indexer's options. A classification scheme may resemble the conventional library classification system or it may be structured to accommodate a particular domain. It might include (Is-a) hierarchies of subclasses and (Has-a) hierarchies of components; hence, it need not be a tree structure. For example, in the transport domain, a bike lane is a subclass of onDionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 409-416, 1999.  Springer-Verlag Berlin Heidelberg 1999

410

Audrey M. Tam and Clement H.C. Leung

road facility and also a component of a road. Classification schemes are able to remove large portions of the search space from consideration. With multiple classification schemes, it would be possible to reduce the set of candidate images to manageable proportions (Figure 1). The leaf nodes in the search trees in Figure 1 would correspond to data object id, with the links either implemented as pointers or logical references, in which case set operations (e.g., intersection) would need to be incorporated. Classification Scheme 1

Classification Scheme 2

Candidate Set Figure 1. Search tree pruning using multiple classification

Although classification schemes will help to limit the search, they are seldom sufficiently exhaustive to allow the target images to be pinpointed exactly, in which case the underlying classification scheme can be supplemented by the ternary fact model proposed in [10]. This data model, which forms the basis for the systematic representation of image content, has as its basic building blocks facts – nodes in the underlying classification scheme. These may be modified and linked together in pairs or triplets to express the subject matter of an image. Examples of facts in the transport domain are given below; elementary and outline facts are in upper case, modifiers are in lower case and links are in title case: • • • • •

Elementary Facts: TRAM, BICYCLE Outline Fact: TRAFFIC JAM, CRITICAL MASS RIDE Modified Facts: articulated TRAM, folding BICYCLE Binary Fact: tall MAN Riding folding BICYCLE Ternary Fact: yellow TAXI Hitting red CAR Towing TRAILER

We can incorporate classification hierarchies and high-level content descriptions based on the ternary fact model into an indexing superstructure based on the following idea. Within a particular domain – tourism, sports, transport – the attributes of data objects can be divided into three categories:

Structured High-Level Indexing of Visual Data Content

•

• •

411

Metadata or data about the data object or its contents, usually not derivable from viewing the data object itself, e.g., materials used in constructing a racing bicycle or the location of a bike path. The media type of the data object is also a kind of metadata. A standard for describing multimedia data is the goal of MPEG-7 (Multimedia Content Description Interface) [11]. In many cases, metadata treat a data object or its contents as entities within the EntityRelationship framework of database design and store relevant attributes of the entities as structured data fields in either a relational database or an object database. Context or the data object's significance in this domain, e.g., a racing bicycle in sports or a bicycle lane in transport. Content or the data object itself and the information contained in the data object that is meaningful for this domain. The content of the data object may be distinguished into high-level and low-level contents. In this paper, we concentrate on high-level content, which can be difficult to extract automatically. Additional structure can be imposed on content, e.g., image objects may be judged to be in the foreground, background or middleground.

The distinction between context and content may be expressed by analogy with books. Context is the subject classification of a book and content is the information in a book, which is usually structured into chapters or sections. An image of Montmartre contains artists and easels; a book about Richard Feynman contains an explanation of quantum electrodynamics. Other examples are shown in Figure 2, which also demonstrates how the same image can be indexed differently in two different domains, with different entities linked to the metadata database. The sports domain metadata database contains tables for athletes and sporting equipment while the transport domain metadata database contains tables for bicycling facilities. Sports: Racing bicyclist Kathy Watt

Transport: Bike lane Metadata

Image of Kathy Watt on

riding on

St. Kilda Road

monocoque Metadata

bike lane

Female cyclist riding on St. Kilda Rd bike lane

Figure 2. Context, content and metadata of same image in different domains

412

Audrey M. Tam and Clement H.C. Leung

We shall describe a data model that forms the basis for the systematic representation of image content, using domain-based classification schemes to create a structured indexing syntax. This structure facilitates manual indexing of high-level content and, through the use of data mining rules, allows semantically rich concepts to be efficiently incorporated.

2

Example Application

An example application will illustrate the level of detail that may be appropriate for each set of attributes. Tourism is an application domain that relies heavily on images to attract customers. A tourism image database would store photographs of tourist destinations and activities. Retrieved images could be used in the production of brochures for general distribution or tailored to the interests of specific customers. 2.1

Metadata

The metadata of a tourism image includes: ! ! ! ! !

Administrative data to control access and payment: identifiers of creators and editors, creation and modification data, usage fee codes Data about the image structure for structure-dependent image-analysis operations and for composition into a presentation document: image format, resolution, bits per pixel, compression data Geographical location: region, country, state, city, locality or GPS coordinates. Calendar time of the event represented in the image (usually the same as the creation time): month, day, hour, minute, second to the appropriate level of precision. Can be recorded by the camera and may even be part of the image. Other domain-dependent data such as hotel tariffs, height of monuments etc. The indexer would not be required to enter most of this data as it would be archived and updated as part of a conventional database.

The first four types of metadata would be needed for any photographic image. Note that most of these metadata can be (semi-)automatically derived from the image or require the indexer to specify or select only an identifier that links into a conventional database. Linking other existing conventional databases as metadata can extend the utility of the image database. For example, linking to a conservation database can generate environmentally aware captions such as the following: Alpine butterfly on an orange everlasting, Mt Buffalo National Park, Victoria. The ecology of alpine areas could be affected by rising temperatures associated with the greenhouse effect. [2]

Structured High-Level Indexing of Visual Data Content

2.2

413

Context

The context of an image can be selected from the application domain’s classification hierarchies. In our tourism example, the scene might represent a particular celebration, landmark, tourist activity, accommodation etc. Location and time metadata can be used here to prune the classification trees presented to the indexer.

2.3

High-Level Content

The high-level content of interest to the user is entirely domain-dependent. The domain determines the objects and relationships that users are likely to search for and the vocabulary that users are likely to employ when searching for semantic content. Because the context-level classifications remove ambiguities, captions at this level can convey more detail about the significant objects in an image. Because the metadata include geographical and temporal data as well as metadata of significant objects, captions do not need to repeat these data. A significant entity needs only an identifier to link it into the metadata database. We suggest the following objects1 and modifiers for a tourism image database: ! !

!

! !

Person := NATIVE|VISITOR2. Possible modifiers for an unnamed person are male|female, old|young|child|infant. Construction := DWELLING|BUSINESS|MONUMENT|ROAD. HOUSE and HOTEL are subclasses of DWELLING; SHOP is a subclass of BUSINESS; BRIDGE is a subclass of ROAD; MUSEUM is a subclass of MONUMENT. Possible modifiers would indicate size, age and religion or type of monument. Natural attraction := LAND|WATER|SKY. MOUNTAIN, VALLEY and PLAIN are subclasses of LAND; RIVER and LAKE are subclasses of WATER; SUNSET and AURORA BOREALIS are subclasses of SKY. Possible modifiers would indicate size, colour, age and significance (historic, sacred, remote etc.) Plant and animal hierarchies can be adapted from the standard biological classifications. Possible modifiers would indicate size, colours, wild|cultivated, native|introduced etc. Other objects of interest include VEHICLEs, FOOD, CLOTHING, TOYs, CRAFTWORK, ARTWORK, RELICs etc.

Relationships (binary and ternary facts) would be restricted to interactions among two or three of the permissible objects, and these relationships would be limited to those likely to attract potential tourists. For example, a VISITOR might Talk To, Buy From or Dance With a NATIVE but a tourism database would not contain an image of a 1 Each object may have an associated identifier, e.g., NATIVE(4217) or HOTEL(2175),

linking it into the metadata database, which contains the object's name and other data. 2 Although “native” and “visitor” could be modifiers of MAN and WOMAN, in a tourism database, the most significant distinction is that between natives and visitors. Gender and age are of lesser importance.

414

Audrey M. Tam and Clement H.C. Leung

NATIVE Eating a VISITOR! Tourism is a people-oriented domain. Other domains, such as building or botany, would be less concerned with describing the people in an image but would enable the description of buildings or plants in much more detail.

3

Data Mining for Semi-Automatic Indexing

It is useful to distinguish between two types of indexing paradigms for high-level contents: explicit indexing and implicit indexing. The former requires an explicit entering of every index entry, while the latter includes a component for implicit deduction of index items for inclusion. An explicit-indexing scenario might be: • For a batch of images from one creator, accept or modify default values for administrative and image structure metadata. • Download digitized images from scanner or digital camera and enter time and geographic location data. (Electronic enhancements to equipment will eventually enable the automatic capture of these data.) • Automatically compute low-level signatures of images. • The indexer describes each image by selecting from context and contents options (dynamically pruned by the place and time data): named entities automatically link to corresponding data in the metadata database. Implicit indexing, on the other hand, may be viewed as a kind of visual data mining. After populating and manually indexing the image database, it should be possible to perform data mining to detect patterns (the rules described in [9]) in the metadata and content of images that could be used to suggest context- and content-level classifications of new images or to reclassify existing images. Data mining on lowlevel content and on a combination of metadata, high-level and low-level content has the greatest potential for semi-automating context-level indexing. For example, an image containing people buying and selling food and craftwork could be a marketplace if it also contains a combination of colours, textures and shapes that indicate a chaotic environment. The value of conventional data mining stems from the fact that it is impossible to explicitly specify or enter all useful knowledge into the database. In the present context, the value of data mining stems from the impossibility of indexing everything explicitly in an image. Here, the association between the presence of different kinds of image contents would be discovered and indexed automatically. There are three main types of rules that can be used for data mining: I. High-level contents → Context II. Low-level signatures → High-level contents III. High-level contents + Low-level signatures → Context These types are discussed in greater detail in [9]. We restrict our discussion here to an example of a type I rule that could be used to extend a classification scheme. Returning to our tourism example, imagine that a general interest in eco-tourism leads to an increase in enquiries about bicycle holidays. While these are common in Europe, they are almost unheard of in Australia, so our tourism image database does not contain the concept of “Bicycle Holiday” anywhere in its classification scheme. However, the process of indexing high-level content would have noted the presence

Structured High-Level Indexing of Visual Data Content

415

of “VISITOR riding BICYCLE” in images whose context might be “Rhine River” or “Roman road”. Searching for this high-level content caption would yield a candidate set for the new context caption “Bicycle Holiday” (a subclass of Tourist Activity).

4

Query Specification and Composition

Values or ranges for metadata can be specified using an SQL-like language or a form with fill-in text fields and drop-down choice boxes. Certain metadata lend themselves to visual specification: for example, clicking or dragging on a map can indicate geographic locations of interest. The facts in the Ternary Fact model lend themselves readily to SQL description. For example, to retrieve an image with an old man riding a blue bicycle originated after 1997 from a photo database one might specify: SELECT object, originator FROM photo-db WHERE year > 1997 AND content = “old MAN Riding blue BICYCLE” Here, some metadata are included but the essential part is the content specification, which may stand by itself or, more typically, be mixed with metadata specifications. The object field may be stored as a long field within the relational table or, more typically, provide a pointer to the data object. Due to the complexity of various Boolean combinations and the ambiguity of textual description, the content field would point to parsing and searching procedures that would require separate database tables [10]. The interface may also provide mechanisms to facilitate visual query composition, as conventional query languages are unable to capture the pictorial character of visual queries. The user could use a feature palette to create icons of the size, shape, colour and texture that they wish to find, with the option of dragging and dropping these icons onto a structure template to specify the layer (foreground etc.) or position in an image of the objects represented by the icons. Many software packages for creating graphics organize images in layers and some domains have standard names for these layers, e.g., in geographical mapping and architecture. Such structure can be incorporated into the query interface at this level.

5

Summary and Concluding Remarks

Conventional indexing of textual contents requires a matching of the indexing language and the query language. In the case of visual data content, this is not always possible as there are a variety of aspects of an image that need to be indexed, and these have to be incorporated in different ways. As far as possible, we adopt a systematic approach for structuring the contents and incorporate the same structure within a suitably extended query language such as SQL.

416

Audrey M. Tam and Clement H.C. Leung

Retrieval of visual information always requires substantial pruning of the search space, and such pruning needs to be achieved by different means. In this study, we incorporate classification hierarchies into an indexing superstructure of metadata, context and content, incorporating high-level content descriptions based on a welldefined data model for image contents. Here, metadata can be automatically extracted and can subsequently help to further limit the indexer's options for context and content. This structure facilitates the indexing of high-level contents and allows semantically rich concepts to be efficiently incorporated. We also incorporate a mechanism for implicit indexing, which may be regarded as a form of data mining on the index to determine rules that may be used to semi-automatically (re)classify images. The method presented forms part of an integrated scheme for the effective retrieval of images based on a spectrum of image characteristics, and it is intended that such a scheme may be implemented for wider usage and experimentation on the Internet.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

D. Adjeroh, and K. C. Nwosu, “Multimedia database management  requirements and issues”, IEEE Multimedia, Vol. 4, No. 4, 1997, pp. 24-33. Australian Conservation Foundation, Wilderness Diary, 1998, Week 42. S. K. Chang, “Extending visual languages for multimedia”, IEEE Multimedia, Fall 1996, pp. 18-26. Chang, S. F. et. al., “Visual Information Retrieval from Large Distributed Online Repositories”, Comm. ACM, Vol. 40, Dec 1997, pp. 63-71. Grosky, W. I., “Managing Multimedia Information in Database Systems”, Comm. ACM, Vol. 40, Dec 1997, pp. 72-80. A. Gupta and R. Jain, “Visual information retrieval”, Comm. ACM, Vol. 40, No.5, 1997, pp. 70-79. R. Jain, Private communication, 1998. V. Kashyap, K. Shah and A. Sheth, “Metadata for building the MultiMedia Patch Quilt”, Multimedia Database Systems: Issues and Research Directions, S. Jajodia and V.S. Subrahmaniun, (Eds.), Springer-Verlag, 1995. C. H. C. Leung and D. Sutanto, “Multimedia Data Modeling and Management for Semantic Content Retrieval”, Handbook of Multimedia Computing, B. Fuhrt (Ed.), CRC Press, 1998. C. H. C. Leung and Z. J. Zheng, “Image Data Modelling for Efficient Content Indexing”, Proc. IEEE International Workshop on Multi-media Database Management Systems, New York, August 1995, IEEE Computer Society Press, pp. 143-150. F. Pereira, “MPEG-7: A Standard for Content-Based Audiovisual Description”, Proc. Visual ’97, San Diego , Dec 1997, pp. 1-4. W.W.S. So, C. H. C. Leung and Z. J. Zheng, “Analysis and evaluation of search efficiency for image databases”, in Image Databases and Multi-media Search A. Smeulders and R. Jain (Eds.), World Scientific, 1997. C. H. C. Leung (Ed.) Visual Information Systems. Springer-Verlag Lecture Notes in Computer Science LNCS 1306, Heidelberg, 1997.

Feature Extraction: Issues, New Features, and Symbolic Representation Maziar Palhang and Arcot Sowmya Artiﬁcial Intelligence Department School of Computer Science and Engineering The University of New South Wales Sydney, NSW 2052, Australia {maziar,sowmya}@cse.unsw.edu.au

Abstract. Feature extraction is an important part of object model acquisition and object recognition systems. Global features describing properties of whole objects, or local features denoting the constituent parts of objects and their relationships may be used. When a model acquisition or object recognition system requires symbolic input, the features should be represented in symbolic form. Global feature extraction is well-known and oft-reported. This paper discusses the issues involved in the extraction of local features, and presents a method to represent them in symbolic form. Some novel features, speciﬁcally between two circular arcs, and a line and a circular arc, are also presented.

1

Introduction

The information extracted from images and used for object recognition in an image are called features. These features are often matched to similar features in object models, and should be chosen such that they can uniquely identify objects appearing in the images. Features may be global, representing properties of whole objects, or local, denoting the constituent parts of objects and their relationships. Utilising local features is more appealing since the object may still be recognised in the presence of noise and occlusion. The suitability of features incorporated into object models for later recognition is crucial in the success of the model creation (or model learning) and object recognition system. Ideally, features should explain the relationships among constituent parts of objects and produce enough constraints among them. Lack of suitable features can lead to poor discrimination of an object from others. It is essential to extract and select features which are feasible to compute, increase the performance of the system, and decrease the amount of available information (such as raw measurements from the input) to a manageable size without losing any valuable information [Sch92]. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 418–427, 1999. c Springer-Verlag Berlin Heidelberg 1999

Feature Extraction: Issues, New Features, and Symbolic Representation

419

Symbolic representation of features becomes necessary when the system processing the images demands symbolic input. For example, when creating models of objects using a symbolic learning system, such as TSFOIL [PS97b], the images of sample objects should be represented in symbolic form, which has been considered as a challenge [BP94]. Alternatively, when a symbolic language, such as Prolog, is used to interpret the contents of an image, again the input should be in symbolic form. So far, there have been some attempts to extract features and represent them in symbolic form, such as [CB87, PRSV94, SL94] to name a few. Papers often only describe the kind of features they have used, mostly global, and do not delve into the details. We have been involved in a project to learn object models from images using symbolic learning systems, and found gaps in the feature extraction area, especially for local features. In this paper, we discuss some of these issues which must be considered when extracting local features, then introduce some features used in our investigations, parts of which are novel, and describe their extraction. These features may be used by other researchers in other applications, due to the general nature of the extraction process. In section 2, we introduce diﬀerent features and their extraction method. Symbolic representation is discussed in section 3, with an example. A discussion on the approach concludes the paper.

2

Finding Relations

The primary information extracted from images are edges. In our system, edges are extracted using the Canny operator. These edges are then linked, and partitioned to a series of straight lines, and circular arc segments (called arcs henceforth). Before model acquisition or recognition can proceed, relations between lines and arcs should be found. Generally, the relations should be translation, rotation, and scale invariant to have broader applicability. The relations must also constrain the arrangement of segments such that they are easily distinguishable from other arrangements. To derive the relations used in our experiments, we were inspired by past research in the area especially perceptual organisation results [Low85], and heuristics reported in [BL90, Gri90, Pop95]. A relation which expresses a property of a segment is called a unary relation. The majority of unary relations are not translation, rotation, or scale invariant. Thus, in our experiments they are not commonly used, except swept-angle of arcs, which is invariant. Binary relations are explained in subsequent section. 2.1

Binary Relations

A relation which describes a property between two segments is called a binary relation. There are diﬀerent binary relations that may be considered. These relations are found by exhaustively considering all pairs of segments ﬁrst. To

420

Maziar Palhang and Arcot Sowmya

make a decision, hand-tuned threshold values have often been used. In an image, there are a large number of segments. Finding all possible binary relations among all these segments makes the search space huge both for model acquisition and recognition, and degrades system performance. Hence, if lsmaller is the length of the smaller segment, llonger the length of the longer segment, and dmin the minimum distance between the two segments, for extracting binary relations we ≥ 0.25) and (dmin ≤ llonger ). assume: ( llsmaller longer Binary relations between line segments To ﬁnd relations between two line segments, diﬀerent distances are found ﬁrst. There are eight diﬀerent distances as shown in Fig. 1(a). Only those distances falling within the line segments are taken into account. This relation itself is not scale invariant, and even if it is normalised, it is not a good feature when there is occlusion. However, it is the basis for extracting other relations, as explained in the following:

(xs j ,ys j )

i d6

d0 d4

f

(xe j ,yej )

d2

α ll

j

d7

ij

d1 d3

(xs i ,ys i )

d5

α lf

i

(xe i ,yei )

(a)

(b)

(c)

Fig. 1. (a)Distances between two lines. (b)The orientation of a line depends on where the origin of the coordinate system is placed. (c) The angle between line i and line j.

– near lines. If the minimum distance of the extreme points to each other is less than 6 pixels, these lines are considered to be near each other. The relation connected has not been considered since due to noise or other imaging eﬀects, two lines connected in reality may be disconnected in the extracted edges. However, near lines relation can cover this case as well. – angle between two lines. To ﬁnd the angle between two lines, we ﬁrst ﬁnd out which pair of extreme points of the two lines are nearer, since the orientation of each line could have two values depending on where the origin of the coordinate system is. For example, if the orientation of a line is 30o or −330o when the origin is at one end, it will be 210o or −150o when the origin is moved to its other end (Fig. 1(b)). Thus, extreme points which are nearer together are moved to the origin and the angle between two lines are measured (Fig. 1(c)). The angle is always considered to be positive. Also, angles are measured directionally, that is the angle between line i and line j, αllij , is the angle that line j should rotate in the counterclockwise direction

Feature Extraction: Issues, New Features, and Symbolic Representation

421

to reach line i. For instance, if αllij is 30o , then αllji is 330o. This property helps in discriminating shapes from each other. – collinear. Suppose the vector f connects two extreme points of line i and line j which are not the nearest extreme points, in the direction of line i to line j. Let the angle between line i, and vector f be αlfi (Fig. 1(c)). Then, line i is collinear with line j if: (170o ≤ αllij ≤ 190o OR αllij ≤ 10o OR αllij ≥ 350o ) AN D (170o ≤ αlfi ≤ 190o OR αlfi ≤ 10o OR αlfi ≥ 350o )

– parallel. Two lines are parallel to each other if: (170o ≤ αllij ≤ 190o OR αllij ≤ 10o OR αllij ≥ 350o ) AN D (10o < αlfi < 170o OR 190o < αlfi < 350o )

– acute. Line i is acute to line j if: 10o < αllij < 75o – right angle. Line i is at right angle to line j if: 75o ≤ αllij ≤ 105o – obtuse. Line i is obtuse to line j if: 105o < αllij < 170o Binary relations between arc segments In the same manner as line segments, diﬀerent distances between two arcs are measured ﬁrst as illustrated in Fig 2.1. Based on these distances, the following relations are found:

(xs j ,ys j )

(xs i ,ys i ) (xe j ,yej )

d1 d0

(xs j ,ys j )

α4 ni

d2

(xe i ,yei ) d3

(xs i ,ys i )

α3

α1

nj

α2 (xe j ,yej )

m ij ci cj

(xe i ,yei )

(a)

(b)

(c)

Fig. 2. (a) Diﬀerent distances between two arcs. (b) Diﬀerent angles between two arcs. (c) Normal and chord vectors of two arcs i and j. Also shown is the vector connecting two nearest extreme points of two arcs, from arc i to arc j.

– near arcs. There are four diﬀerent distances among the extreme points of two arcs as shown in Fig.2.1. If the minimum of these distances is less than 6 pixels, they are considered to be near to each other. – angle. There are diﬀerent angles between two arcs. We consider the angles between the lines connecting the centres of arcs to their corresponding endpoints with respect to each other. This produces four diﬀerent angles. The centre points of the two arcs are moved to the origin of the coordinate system to measure the angles between these lines (α1, α2, α3, and α4 in Fig. 2.1). Only the minimum and maximum angles between two arcs are considered.

422

Maziar Palhang and Arcot Sowmya

(a)

(b)

(c)

(d)

(f)

(e)

Fig. 3. Relations between two arcs, (a) hill, (b) tilde, (c) inverse tilde, (d) wave, (e) neck, (f) beak. Left and right shapes show the border conditions, and the middle shapes show the normal case.

– normal angle. The normal angle of an arc i with respect to another arc j (αnnij ) is the angle that the normal of arc j should rotate in counterclockwise direction to reach the normal of arc i (Fig. 2.1). – chord normal angle. The chord normal angle of an arc i with respect to another arc j (αcnij ) is the angle that the normal of arc j should rotate in counterclockwise direction to reach the chord of arc i (Fig. 2.1). In ﬁnding the angle of a chord of an arc, the origin is considered at the extreme point which is nearest to the other arc. – A set of new relations between two arcs. There are quite well-known relations between two lines, such as acute, obtuse, etc; however, there are no such relations deﬁned for two arcs or for a line and an arc. This motivated us to devise some new features. They are hill, tilde, inverse tilde, wave, neck, and beak, shown in Fig. 3. The equations for ﬁnding these relations may be found in the Appendix. The point to notice is that the neck relation, and the beak relation need more careful examination because the available angular relations cannot always separate these two from each other. Such a situation is shown in Fig. 4. Let mij be the vector connecting the nearest extreme points of arc i and arc j in the direction of i to j (Fig. 2.1), and αcmji be the angle that this vector should rotate in counterclockwise direction, to reach the chord of arc j. This vector helps us to distinguish these two relations from each other, by observing that the chord of arc j is on the left or right of this vector. An exception is displayed in Fig. 4(c), which can be recognised by checking whether the nearest extreme point of arc j to arc i is inside or outside of the circle of which arc i is a part. This can be found by comparing, dj , the distance of this point to the centre of arc i to the radius, ri , of arc i. Binary relations between an arc and a line segment Diﬀerent distances may be measured between an arc and a line as displayed in Fig. 5(a). The following relations are extracted: – near arc line. There are four diﬀerent distances among the extreme points of a line and an arc as shown in Fig. 5(a). If the minimum of these distances is less than 6 pixels they are considered to be near to each other.

Feature Extraction: Issues, New Features, and Symbolic Representation

423

ni

cj

ni n ij

ni

ci cj n ij ci

n ij

(a)

(c)

(b)

Fig. 4. A situation where a neck relation (a),(c) is not distinguishable from a beak relation (b) by using normal and chord angles alone.

(xs j ,ys j )

rj2

d1 (xs i ,ys i )

Ni

rj1

d0

α nl ij

(xe j ,ye j ) d2

ci

d3

l

α cl ij

j

(xe i ,ye i )

(a)

(b)

Fig. 5. (a) diﬀerent distances between a line and an arc, (b) the relations angle normal line, and angle chord line.

– angle. The angle between an arc and a line is considered to be the angle between the arc’s radius passing through the nearer extreme points of the arc to the line and the line itself. The nearest extreme points of the arc, and the line are moved to the origin of the coordinate system to measure this angle. For example in Fig. 5(a) the points (xsi , ysi ) and (xsj , ysj ) are near to each other; thus the angle between rj1 and the line is considered to be the angle between the arc and the line. – angle normal line. The angle which a line j should rotate in counterclockwise direction to reach the normal to an arc i, and measured based on the nearest extreme points of two segments, is called angle normal line between these two segments and is represented by αnlij (Fig. 5(b)). – angle chord line. The angle which a line j should rotate in counterclockwise direction to reach the chord of an arc i, and measured based on the nearest extreme points of two segments, is called angle chord line between these two segments and is represented by αclij (Fig. 5(b)). – A set of new devised features consisting of stick, inverse stick, doo1 , and inverse doo shown in Fig. 6. The equations for deriving these relations are presented in the Appendix. 1

Doo is the name of number 2 in the Farsi language. We chose this name since 2 has this shape in Farsi.

424

Maziar Palhang and Arcot Sowmya

(b)

(a)

(c)

(d)

Fig. 6. Relations between an arc and a line, (a) stick, (b) inverse stick, (c) doo, and (d) inverse doo. Left and right shapes show the border conditions, and the middle shapes show the normal case.

3

Symbolic Representation of the Relations

Once the relations are found, they can be represented in symbolic form. The above relations may be represented as Prolog facts as shown below: partof (object, seg) swept angle(seg, no) parallel(seg, seg)∗ collinear(seg, seg)∗ acute(seg, seg) right angle(seg, seg)

obtuse(seg, seg) near lines(seg, seg)∗ hill(seg, seg)∗ tilda(seg, seg)∗ inv tilda(seg, seg)∗ wave(seg, seg)∗

beak(seg, seg)∗ inv doo(seg, seg) neck(seg, seg)∗ near arc line(seg, seg) ∗ near arcs(seg, seg) stick(seg, seg)(seg, seg) inv stick(seg, seg) doo(seg, seg)

The words object, seg, and no refer to the types of the arguments. For example, if an object o0 has 10 segments s1 to s10, then o0 is of object type and s1 to s10 of seg type. Type no is just a real valued number. Since the angles are directional, it is assumed that the second argument is rotated counterclockwise to reach the ﬁrst argument in the case of angular binary relations. In the case of symmetric relations, the order does not matter. Considering the angles directionally constrains not only the position of segments in space, but also the diﬀerent ways in which they can be represented in symbolic form. The relation partof is necessary to create links between an object and its segments, so that the symbolic input processor knows which segment belongs to which object. As an example, in one of our learning experiments, the following rule was created to describe a mug: mug(A) : −partof (A, B), hill(B, C), hill(C, D), stick(C, E), hill(B, F ). The capital letters are variables standing for the segments and objects. In a system such as Prolog, these variables may be replaced by other terms in checking if an object is a mug or not. More information on learning object models and recognising objects using this approach may be found in [PS97a].

4

Discussion

The importance of the feature extraction stage in building object models, and recognising objects was explained. Some important issues and details about extracting local features were pointed out. Due to the lack of abstract relations between two arcs or a line and an arc, a set of new features, speciﬁcally hill, tilde, inverse tilde, wave, neck, beak, stick, inverse stick, doo, and inverse doo were introduced. The representation of relations in symbolic form was then discussed

Feature Extraction: Issues, New Features, and Symbolic Representation

425

These features may be considered as knowledge that an expert provides to the model-creation or recognition systems, before model-creation or object recognition proceeds (called background knowledge). By abstracting the underlying concepts, these features can greatly facilitate recognition and learning. In addition, as symbolic learning systems are not strong enough to handle numeric data well, feature abstraction is necessary for model acquisition. Moreover, since these systems often restrict the length of the rules they create, an abstract feature may replace the simpler features it is representing and help to create a rule which might not otherwise be possible. In our experiments, these additional features caused the learning time to reduce, and the coverage and eﬃciency of the rules to increase. In addition, they allowed us to learn models of some objects which do not have necessarily straight line edges. Obviously, this feature repertoire is not strong enough to represent all kinds of objects, especially soft objects and natural objects. We do not claim that the heuristics are optimal as well. However, they can provide a basis for other researchers to use in their research, and possibly improve them.

References [BL90] R. Bergevin and M. D. Levine. Extraction of line drawing features for object recognition. In Proc. of IEEE 10th International Conference on Pattern Recognition, pages 496–501, Atlantic City, New Jersey, USA, June 1990. 419 [BP94] B. Bhanu and T. A. Poggio. Introduction to the special section on learning in computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9):865–868, Sep. 1994. 419 [CB87] J. H. Connell and M. Brady. Generating and generalizing models of visual objects. Artificial Intelligence, 31:159–183, 1987. 419 [Gri90] W. E. L. Grimson. Object Recognition by Computer : the role of geometric constraints. MIT Press, 1990. 419 [Low85] D. G. Lowe. Perceptual Organization and Visual Recognition. Kluwer Academic Publishers, 1985. 419 [Pop95] A. R. Pope. Learning To Recognize Objects in Images: Acquiring and Using Probabilistic Models of Appearance. PhD thesis, Department of Computer Science, The University of British Columbia, Canada, December 1995. 419 [PRSV94] P. Pellegretti, F. Roli, S. B. Serpico, and G. Vernazza. Supervised learning of descriptions for image recognition purposes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1):92–98, January 1994. 419 [PS97a] M. Palhang and A. Sowmya. Automatic acquisition of object models by relational learning. In C. Leung, editor, Visual Information Systems, volume 1306 of Lecture Notes on Computer Science, pages 239–258. Springer, 1997. 424 [PS97b] M. Palhang and A. Sowmya. Two stage learning, two stage recognition. In Poster Proc. of the Australian Joint Conference on Artificial Intelligence (AI’97), pages 191–196, Perth, Australia, December 1997. 419

426

Maziar Palhang and Arcot Sowmya

[Sch92] Robert J. Schalkoﬀ. Pattern Recognition: Statistical, Structural and Neural Approaches. John Wiley and Sons, 1992. 418 [SL94] A. Sowmya and E. Lee. Generating symbolic descriptions of two-dimensional blocks world. In Proc. of IAPR International Workshop on Machine Vision Applications, pages 65–70, Kawasaki, Japan, December 1994. 419

Appendix In this Appendix, the equations to extract the new relations devised are provided. – hill: o

270 ≤ αnnij ≤ 360o AN D 0o ≤ αcnij ≤ 90o AN D 0o ≤ αnnji ≤ 90o AN D 270o ≤ αcnji ≤ 360o

OR

270o ≤ αnnji ≤ 360o AN D 0o ≤ αcnji ≤ 90o AN D 0o ≤ αnnij ≤ 90o AN D 270o ≤ αcnij ≤ 360o

OR

0o ≤ αnnij ≤ 90o AN D 90o ≤ αcnij ≤ 180o AN D 270o ≤ αnnji ≤ 360o AN D 180o ≤ αcnji ≤ 270o

OR 0o ≤ αnnji ≤ 90o AN D 90o ≤ αcnji ≤ 180o AN D 270o ≤ αnnij ≤ 360o AN D 180o ≤ αcnij ≤ 270o

– tilde: o

90 ≤ αnnij ≤ 180o AN D 180o ≤ αcnij ≤ 270o AN D 180o ≤ αnnji ≤ 270o AN D 270o ≤ αcnji ≤ 360o

OR 90o ≤ αnnji ≤ 180o AN D 180o ≤ αcnji ≤ 270o AN D 180o ≤ αnnij ≤ 270o AN D 270o ≤ αcnij ≤ 360o

– inverse tilde: o o 90 ≤ αnnij ≤ 180

AN D 0o ≤ αcnij ≤ 90o AN D 180o ≤ αnnji ≤ 270o AN D 90o ≤ αcnji ≤ 180o

OR 90o ≤ αnnji ≤ 180o AN D 0o ≤ αcnji ≤ 90o AN D 180o ≤ αnnij ≤ 270o AN D 90o ≤ αcnij ≤ 180o

– wave: o

270 ≤ αnnij ≤ 360o AN D 0o ≤ αcnij ≤ 90o AN D 0o ≤ αnnji ≤ 90o AN D 90o ≤ αcnji ≤ 180o

OR

270o ≤ αnnji ≤ 360o AN D 0o ≤ αcnji ≤ 90o AN D 0o ≤ αnnij ≤ 90o AN D 90o ≤ αcnij ≤ 180o

OR

0o ≤ αnnij ≤ 90o AN D 270o ≤ αcnij ≤ 360o AN D 270o ≤ αnnji ≤ 360o AN D 180o ≤ αcnji ≤ 270o

OR 0o ≤ αnnji ≤ 90o AN D 270o ≤ αcnji ≤ 360o AN D 270o ≤ αnnij ≤ 360o AN D 180o ≤ αcnij ≤ 270o

– neck: o

90 ≤ αnnij ≤ 270o AN D 180o ≤ αcnij ≤ 360o AN D 90o ≤ αnnji ≤ 270o AN D 0o ≤ αcnji ≤ 180o AN D αcmji ≤ 180o AN D d j > ri

OR

90o ≤ αnnji ≤ 270o AN D 180o ≤ αcnji ≤ 360o AN D 90o ≤ αnnij ≤ 270o AN D 0o ≤ αcnij ≤ 180o AN D αcmji > 180o AN D d j > ri

– beak: o

90 ≤ αnnij < 180o AN D 180o ≤ αcnij < 270o AN D 180o < αnnji ≤ 270o AN D 90o < αcnji ≤ 180o

OR 90o ≤ αnnji < 180o AN D 180o ≤ αcnji < 270o AN D 180o < αnnij ≤ 270o AN D 90o < αcnij ≤ 180o

OR

180o ≤ αnnij ≤ 270o AN D 270o ≤ αcnij ≤ 360o AN D 90o ≤ αnnji ≤ 180o AN D (αcmji ≥ 180o OR (αcmji ≤ 180o OR 180o ≤ αnnji ≤ 270o AN D 270o ≤ αcnji ≤ 360o AN D 90o ≤ αnnij ≤ 180o AN D (αcmji < 180o OR (αcmji > 180o

AN D 0o ≤ αcnji ≤ 90o AN D dj < ri )) AN D 0o ≤ αcnij ≤ 90o AN D dj < ri ))

Feature Extraction: Issues, New Features, and Symbolic Representation

– stick: 90o < αnlij ≤ 270o AN D 0o ≤ αclij ≤ 180o OR αnlij = 90o AN D αclij < 180o

– inverse stick: 90o < αnlij < 270o AN D 180o < αclij < 360o – doo: αnlij ≥ 270o AN D αnlij ≤ 90o AN D 0o ≤ αclij < 180o – inverse doo: αnlij ≥ 270o AN D αnlij ≤ 90o AN D 180o ≤ αclij ≤ 360o

427

Detection of Interest Points for Image Indexation St´ephane Bres and Jean-Michel Jolion Laboratoire Reconnaissance de Formes et Vision Bˆ at 403 INSA 20, Avenue Albert Einstein, 69621 Villeurbanne Cedex, France Tel: 33 4 72 43 87 59, Fax: 33 4 72 43 80 97 [email protected] http://rfv.insa-lyon.fr/~jolion

Abstract. This paper addresses the problem of detection and delineation of interest points in images as part of an automatic image and video indexing for search by content purposes project. We propose a novel key point detector based on multiresolution contrast information. We compare this detector to the Plessey feature point detector as well as the detector introduced in the SUSAN project. As we are interested in common database applications, we focus this comparison on robustness versus coding noise like Jpeg noise.

1

Introduction

This paper addresses the problem of detection and delineation of interest points in image and sequence of images. This study 1 is part of our current research in the ﬁeld of automatic image and video indexing for search by content purposes. One is often interested in compact features extracted from the signal in the ﬁeld of image and video indexing. More particularly, one of the most popular approach to large image database search is iconic request, e.g. ﬁnd some images similar to the one given as example. Some now wellknown products are available (for instance [Fli 95])but they are not so powerful especially because nobody really knows what ”similar” means [Aig 96]. Basically, a classic way consists in, ﬁrst, extract features from the images of the database, then, compact these features in a reduced set of N indexes. Given an image example, the process is thus to extract features, to project onto the indexes space and to look for the nearest neighbor based on some particular distance. The features are mostly global ones like parameters extracted from the colour distribution, the coeﬃcients in the Fourier or wavelets domains. . . Another approach is based on interest points. It argues that two signals are similar if they have particular characteristics spatially located in a consistent order. The locations of these particular characteristics are called the interest point or key points. 1

This work has been supported by the european community under project INCO 950363 TELESUN and by the R´egion Rhˆ one-Alpes grant ACTIV.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 427–435, 1999. c Springer-Verlag Berlin Heidelberg 1999

428

St´ephane Bres and Jean-Michel Jolion

It is quite easy to understand that using a small set of such points instead of the all image reduces the amount of data to be processed. Moreover, local information extracted in the neighborhood of these particular points is assumed to be more robust to classic transformation (additive noise, aﬃne transformation including translation, rotation and scale eﬀects, partial visibility. . . ). In this paper, we ﬁrst introduce in section 2 our model based on the multiresolution contrast energy. Section 3 will discuss current results, compare it to the classic diﬀerential approach on images, and present further studies that have to be carried on in order to better emphasize this novel approach.

2 2.1

Multiresolution Contrast Energy A Preliminary Discussion

A very large set of interest points detectors has been already proposed in the litterature [Smi 97]. This wide variety of detectors is mainly due to a lack of deﬁnition of the concept of interest points. However, most of these works refer to the same litterature and basically assume that key points are equivalent to corners or more generally speaking to image points characterized by a signiﬁcant gradient amount in more than one direction. Obviously, a diﬀerential framework results from these deﬁnitions. The motivations of our work are mainly the disadvantage of the previous works and we summarize them as follows: Why key points should have been corners ? Edge point is a widely used feature for image analysis. Corner points have been used because an image contains too many edge points. However, it is not clear that there is any other justiﬁcation for this choice. This is why we will prefer energy based responses which do not assume any particular geometric model. Why using a diﬀerential framework ? It is very diﬃcult to design an image feature without any reference to some variation measurement. Indeed, as the visual system, only variations in the image intensities are of importance. However, if the classic way to estimate a local amount of variation is to use a gradient based, i.e. diﬀerential, approach, it is wellknown that this leads to some problems (a priori model of the signal, choice of the natural scale, directionnal dependant values, non scalar response as the gradient is a vector. . . ). That is why we propose to use a less constraint model, the contrast. This model also accounts for the need of a non absolute approach. Indeed although the human visual system cannot accurately determine the absolute level of luminance, contrast diﬀerences can be detected quite consistently. What about scale ? It is obvious that we must take care about the scale eﬀect. We argue that the key points extraction must be multiscale instead of a simple accumulation of one-scale extractions. That is why our framework is based on multiresolution operators like those described in [Jol 93]. 2.2

Multiresolution Contrast: A Brief Review

Usually, the luminance contrast is deﬁned as C = LLb − 1, where L denotes the luminance at a certain location in the image plane and Lb represents the

Detection of Interest Points for Image Indexation

429

luminance of the local background. More generally, L and Lb are computed from neighborhoods or receptive ﬁelds whose center P is the pixel to be processed, the neighborhood associated to Lb being greater than that of L. The value of the size of the neighborhood is an a priori information of such kind of approaches. It is clear that it has to be related to the size of the details to be emphazised in the image. However, rarely is this size unique (this is exactly the same problem as the scale eﬀect of diﬀerential operators) for a given image. It is thus interesting to work simultaneously on several sizes for a given point. In [Jol 94], we introduced the contrast pyramid. The pyramid framework allows the manipulation of multiple neighborhood sizes. Let P be a node on level k in an intensity pyramid, e.g. a gaussian pyramid. Its value Gk (P ) denotes the local luminance (i.e. in a local neighborhood which size is related to the size of the receptive ﬁeld of P ). w(M ).Gk−1 (M ) (1) Gk (P ) = M∈sons(P )

where w is a normalized weight function which can be tuned to simulate the Gaussian pyramid [Jol 93]. The luminance of the local background is obtained from the luminances of the fathers of P . Thus, the background pyramid is built as follows: W (Q).Gk+1 (Q) ≡ Expand[Gk+1 ](P ) (2) Bk (P ) = Q∈f athers(P )

where W is a normalized weight function which takes into account the way P is used to build the luminance of its fathers. The contrast pyramid is thus deﬁned by Ck (P ) ≡

Gk (P ) f or 0 ≤ k ≤ N − 1 and CN (P ) ≡ 1 Bk (P )

(3)

where N is the size of the pyramid, e.g. the input image I is 2N × 2N pixels. It can easily be shown that (C0 , . . . , CN ) is an exact code of the input image I ≡ G0 . 2.3

The Multiresolution Minimum Contrast

A key point is characterized, in our approach, by a local significant amount of contrast. We will thus design our indicator as follows. First, we must take into account the non symmetry of the contrast measure regarding the intensity distribution (we do not get similar values of the contrast for low intensities and for high intensities as the contrast is deﬁned as a ratio). We will also modify this ratio in order to get 0 for a non contrast situation and greater than zero elsewhere. We will then use a modiﬁed contrast measure : |Gk (P ) − Bk (P )| |Gk (P ) − Bk (P )| , Ck∗ (P ) = M in (4) Bk (P ) 255 − Bk (P )

430

St´ephane Bres and Jean-Michel Jolion

In the previous approaches, the authors used a local averaging in order to collapse the gradient distribution of values. This step is not required in our approach thanks to the multiresolution scheme. 2.4

Extracting the Multiresolution Key Points

The contrast based key points are the local maxima of the minimum contrast pyramid above a predeﬁned threshold (but as shown later this threshold is only useful to reduce the number of key points). Figure 1 shows an example of these key points for several levels.

a

b

c

Fig. 1. Contrast key points for three consecutive levels of a given image of a portrait with threshold value = 0.2.

The next step consists in collapsing this multiresolution representation of the input image in a compact representation. We propose two classic coarse to fine strategies depending on the kind of map one wants to obtain. First, assume that one wants to build a simpliﬁed gray level map which best keeps the relevant information of the input signal based on the multiresolution key points. The scheme is thus as follows Extract the local maxima of the minimum contrast pyramid C ∗ and re-build the input image from this modiﬁed code.

˜ k+1 ](P ) G˜k (P ) = Ck (P ) if P is a key point 1 otherwise .Expand[G f or k = N − 1 . . . 0

(5)

Figure 2 shows an example of this strategy applied on a portrait image. Another resulting map is made of the locations of the multiresolution key points (we will use it to our comparison with other interest points detectors). So, we use the following two steps coarse to fine strategies. First, sum the contrasts of the key points only accross the pyramid.

Detection of Interest Points for Image Indexation

a

b

c

d

431

Fig. 2. Compact key point image based on the multiresolution observed points. (a) input image (b) reconstructed image using 0.1 as threshold. (c) the contrast energy map (with threshold = 0). (d) the corresponding ﬁnal map of key points (enlarged and superimposed on the original signal).

1 E˜k (P ) = k+1 .Ck∗ (P ) if P is a key point 0 otherwise+ Expand[E˜k+1 ](P ) fork = N − 1 . . . 0 and E˜N = 0

(6)

This leads us to some kind of energy map (see Figure 2c). The coeﬃcient is introduced in order to emphasized the local contrasts versus the global ones. Then, ﬁlter out the non local maxima. This step is required if one wants to get a binary map as shown on Figure 2d. 1 1+k

3

Discussion

In this section, we will compare our detector to the improved Harris’ detector proposed by Schmid in [Sch 97] and the Susan’s detector proposed in [Smi 97]2 . The Harris’s detector has proved to be very powerful regarding many perturbations like image rotations, scale changes, variations of illuminations. . . The Susan’s detector is more appropriate for corner detection and is not strictly related to an a priori diﬀerential model. 2

These operators can be interactively executed at http://rfv.insa-lyon.fr/˜jolion. This site also proposes a larger version of this paper as well as other reports related to this topic.

432

3.1

St´ephane Bres and Jean-Michel Jolion

What Are We Detecting ?

The ﬁrst point we want to look at is the characterization of these new key points. Indeed, we do not use any a priori model so it is less constrained than in the diﬀerential approach but what can we say about these points ? Figure 3 is a classic synthetic image from the French GDR ISIS data base. The parameters of the three detector were tuned in order to get the four corner of the dark rectangle in the lower left part of the image. The results obtained for the Harris’s detector clearly shows that it does not ﬁnd any vertical or horizontal edges as proved in the theory (as it looks for the minimun value of local changes over all the directions). However, it suﬀers of localization default as shown in [Smi 97]. The three detectors correctly extract two interest points on the shape located in the upper right part of the image. On the disks, the behavior are diﬀerents. Susan’s detector extracts only points located on the interior disk, Harris extracts points located on both disks (that seem to be more appropriate) and our detector result looks like part of the edge points. What can we say about this part of the image ? One important point is that our detector seems to be less sensitive to discrete conﬁguration thanks to the multiresolution approach (is there any way to characterize one point from another on a disk, except that a discrete disk is not a ”real” disk ?). However, in that case, we get the edge points (not all the points because of the non maxima suppression), which is not more appropriate than only a small set of points but extracted based on discrete geometric properties. This shows the limitations of this approach of image characterization based on key points. The ﬁgure in the lower left part of the image in equivalently characterized by both the Harris and the Susan’s detectors, i.e. by the corners. In our case, we again get more points, located on the edges. These multiple responses are due to the multiresolution behavior of the shape, i.e. in low resolution, both shapes (the rectangle and the triangle) interact. It can be shown, with the same threshold value, that the detector extracts only the four corners of the rectangle if the analysis is limited to the lower levels of the pyramids (0 or 0 and 1). More generally speaking, our detector is not based on any a priori model. That is why it extracts both local corner and edge conﬁgurations. However, a corner response is greater than an edge response for a given local variation. The behavior of the detectors on the non geometric shape in the lower right part of the image is more complex. The Harris detector clearly suﬀers from delocalization and forgets a part of the shape. The others have a more similar behaviors, ours seems to extract more details than the Susan’s detector. The localizations of the interest points are good for the three detectors and that was not always the case for the pure geometric and synthetic parts of this image. We will not present more results on non synthetic images because we do not have a clear ground true for these images in order to compare and any result will obviously be subjective and qualitative. We prefer to focus on a particular property the detectors should have.

Detection of Interest Points for Image Indexation

a

b

433

c

Fig. 3. Key points extracted on a synthetic image by (a) the Harris’ detector (b) the Susan’s detector (c) our detector.

3.2

Robustness Regarding Coding Noise.

When working on visual image management, one has to deal with a new kind of image noise: the coding noise. Indeed, most of the images we are working with are coded using classic scheme like Jpeg. This coding is very interesting because it returns compact codes and guarantee pretty nice outputs of the images. However, what is good for the human eye is not always good for a computer algorithm. That is why we are interested in the robustness of our tools when working on coded images. In the case of the key point detectors, we will compare the map, I, corresponding to our image to the map, J, corresponding to the same image but after this image has been jpeg encoded and decoded with a given quality between 0 and 100%. In order to compare both key points maps (the initial one and the jpeg’s one), we used a classic measure, the ﬁgure of merite as introduced by Pratt [Pra 78]. This measure both takes into account the variation of the number of key points and their delocalization. Figure 4a shows the result of this experimentation on the image of ﬁgure 4b. The parameters of the detectors were tuned in order to extract similar amount of interest points. Mainly, the Harris and our detector have similar behaviors. However, the ﬁrst has a better stability for very low Jpeg quality, and ours gives a better result for qualities greater than 75% (which are those one uses in practical applications). Note that even for a 100% quality, the corresponding maps are not identical because even if the quality is maximun, the encoding/decoding process results in quantization errors, i.e. gray levels with ±1 diﬀerences. The result we obtained for the Susan detector is poor compared to the others. This is due to the model, i.e. the detector is based on a distance between grey levels of neighbors pixels. This tool is quite simple but the underlying idea (similar points are points with very similar grey levels) is not robust under distorsions like those resulting from the Jpeg coding (bloc truncature effect, frequency distorsion. . . ). It is thus normal that this detector has a good robustness for high quality coding.

434

St´ephane Bres and Jean-Michel Jolion

We also tested these detectors for other distorsions (without any statistical constraints but only several tests on classic images):

0.6 0.4 0.2

Figure of merite

0.8

1.0

– additive impulse noise: the Susan’s detector is the more robust because being optimized for this kind of noise, the worst was the Harris’s detector because its too important use of the derivatives, – position of the camera relative to the scene: our detector is the best (but Harris is very close), the worst is the Susan’s detector.

0

20

40

60

80

100

Jpeg coding quality

a

b

Fig. 4. (a) Robustness of the Susan detector (✷), Harris detector (•) and ours (×) regarding the noise related to the quality of Jpeg coding. (b) The image used for the robustness test. This image is free of coding eﬀect.

References [Aig 96]

[Fli 95]

[Jol 93] [Jol 94] [Pra 78]

P.Aigrain, H.Zhang & D.Petkovic (1996) Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review, Multimedia Tools and Applications, 1996. 427 M.Flickner, H.Sawhney, W.Niblack, J.Ashley, Q.Huang, B.Dom, M.Gorkani, J.Hafner, D.Lee, D.Petkovic, D.Steele & P.Yanker (1995) Query by image and video content: The qbic system, IEEE Computer special issue on content based picture retrieval system, 28(9), 23-32. 427 J.M.Jolion & A.Rosenfeld (1993) A Pyramid Framework for Early Vision, Kluwer Academic Press. 428, 429 J.M.Jolion (1994) Multiresolution Analysis of Contrast in Digital Images (in french), Traitement du Signal, 11(3), 245-255. 429 W.K.Pratt (1978) Digital Image Processing, New York, Wiley, Interscience. 433

Detection of Interest Points for Image Indexation [Sch 97]

[Smi 97]

435

C.Schmid & R.Mohr (1997) Local Grayvalue Invariants for Image Retrieval, IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(5), 530-535. 431 S.M.Smith & J.M.Brady (1997) SUSAN - A New Approach to Low Level Image Processing, Int. Journal of Computer Vision, 23(1), 45-78. 428, 431, 432

Highly Discriminative Invariant Features for Image Matching Ronald Alferez and Yuan-Fang Wang Department of Computer Science University of California Santa Barbara, CA 93106 {ronald,yfwang}@cs.ucsb.edu Abstract. In this paper, we present novel image-derived, invariant features that accurately capture both the geometric and color properties of an imaged object. These features can distinguish between objects that have the same general appearance (e.g., diﬀerent kinds of ﬁsh), in addition to the typical task of distinguishing objects from diﬀerent classes (e.g. ﬁsh vs. airplanes). Furthermore, these image features are insensitive to changes in an object’s appearance due to rigid-body motion, aﬃne shape deformation, changes of parameterization, perspective distortion, view point change and changes in scene illumination. The new features are readily applicable to searching large image databases for speciﬁc images. We present experimental results to demonstrate the validity of the approach, which is robust and tolerant to noise.

1

Introduction

The advent of high-speed networks and inexpensive storage devices makes the construction of large image databases feasible. More and more images are now stored in electronic archives. In line with this, however, is the need for tools to help the user browse and retrieve database images eﬃciently and eﬀectively. Most existing image indexing and retrieval systems, such as Virage [4], QBIC [5], and Photobook [6], are able to do between-classes retrieval. That is, they can distinguish between images of diﬀerent classes. For example, an image of a ﬁsh as a query retrieves a list of images in the database containing an image similar to a ﬁsh (the query and the generated results are classiﬁed as belonging to the same class of objects). Images that belong to other classes, such as airplanes, are appropriately excluded from the list. However, these systems do not allow the user to retrieve images that are more speciﬁc. In other words, they are unable to perform within-a-class retrieval. For example, the user may want to retrieve all images of rainbow trouts (characterized by the number and location of ﬁns, and by the color of their body). Current systems will likely fail with this query, generating lists of images containing various species of ﬁsh. The problem is that a rainbow trout appears very similar to other species of ﬁsh, and the features adopted by current systems are not descriptive enough to handle this type of scenario. Hence, there is a need for a system that enables within-a-class retrieval, which discriminates between images within the same class of objects. In addition, environmental changes such as an object’s pose and lighting should be not be a factor in measuring similarity. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 435–443, 1999. c Springer-Verlag Berlin Heidelberg 1999

436

Ronald Alferez and Yuan-Fang Wang

To perform within-a-class retrieval in image databases, the system should be able to discriminate between imaged objects that have very similar appearance. The key to building such a system is in designing powerful, highly discriminative image features that can discriminate small variations among objects. These variations, however, should not include changes that are not intrinsic to an object, so that an object that is stretched, for example, should not be distinguished from its original form. Many digital library applications will ﬁnd within-a-class retrieval particularly useful. Potential scenarios include searching for ﬁsh in an aquarium database, leaves and ﬂowers in a botanical image database, and logos in a catalog. Despite the similar appearance of objects within each of these databases, and despite possible changes in pose and scene illumination, our new image features should be able to discriminate between diﬀerent imaged objects within a database, while correctly matching the same ones. Our contribution is in developing novel image-derived features that enable both between-classes and within-a-class retrievals. Not only do the new features discriminate between imaged objects that look very diﬀerent, they can also distinguish between imaged objects with very similar appearance. Furthermore, these image features are insensitive to environmental changes such as rigid-body motion, aﬃne shape deformation, changes of parameterization, perspective distortion, view point change and changes in scene illumination. These image features can be applied to image indexing, search and retrieval for large image databases, where high accuracy and environmental insensitivity is an issue. Although segmentation (contour extraction) is not addressed, our strategy still has many practical applications, particularly when there is absolute control of the image database (e.g., when the database is a collection of imaged objects photographed with an uncluttered background, such as catalogs), and the object of interest in the query image is pinpointed (or drawn) by a human. We propose invariant features that capture only the essential traits of an image, forming a compact and intrinsic description of an imaged object. Environmental factors such as pose and illumination are ignored. Hence, it is more eﬃcient than, say, aspect-based approaches where multiple aspects of the same model have to be remembered. The new invariant features analyze the shape of the object’s contour as well as the color characteristics of the enclosed area. The analysis involves projecting the shape or color information onto one of many basis functions of ﬁnite, local support (e.g., wavelets, short-time Fourier analysis, and splines). Invariance of the descriptors is achieved by incorporating the projection coeﬃcients into formulations that cancel out many environmental factors. The invariant features produced by the new framework are insensitive to rigid motion, aﬃne shape deformation, changes of parameterization and scene illumination, and/or perspective distortion. Furthermore, they enable a quasilocalized, hierarchical shape and color analysis, which allows for the examination of information at multiple resolution scales. The result is an invariant framework which is more ﬂexible and tolerant to a relatively large degree of noise. Excellent reviews on invariants are presented in [7,8].

Highly Discriminative Invariant Features for Image Matching

2

437

Technical Rationale

We will illustrate the design of invariant image features using a speciﬁc scenario where invariants for curves are sought. For shape invariants, these directly apply to the silhouette (contour) of imaged objects in a database. For illumination invariants, the same technique applies by linearizing internal regions by a characteristic sampling curve and computing invariant color signatures along the characteristic curve. In both cases, the invariant signatures produced can be examined at diﬀerent resolution scales, making the invariant features both ﬂexible and noise tolerant. The particular basis functions we use in the illustration are the wavelet bases and spline functions. However, the same framework can be easily extended to other bases and to 3D surfaces. Aﬃne Invariant Parameterization We ﬁrst look at the problem of point correspondence when attempting to match two curves (or contours) under an aﬃne transformation. For each point selected from one curve, the corresponding point on the other curve has to be properly identiﬁed. In deﬁning parameterized curves c(t) = [x(t), y(t)]T , the traditional arc length parameter, t, is not suitable because it does not transform linearly (or it is not invariant) under an aﬃne transformation. Two parameterizations which do, are described in [2]: (1) The b √ ˙y − x ¨y˙ dt where x, ˙ y˙ are the ﬁrst and aﬃne arc length, is deﬁned as: τ = a 3 x¨ x ¨, y¨ are the second derivatives with respect to any parameter t (possibly the b intrinsic arc length); and (2) the enclosed area parameter, σ = 12 a |xy˙ − y x| ˙ dt, which is the area of the triangular region enclosed by the two line segments from the centroid to two contour points a and b. Seemingly, a common origin and traversal direction on the contour must also be established. However, it can be easily shown that a diﬀerence of starting points is just a phase-shift between the invariant signatures of two contours. Similarly, two contours parameterized in opposing directions are just ﬂipped, mirror images of each other. Hence, a match can be chosen that maximizes the cross-correlation between the two signatures. This, together with the use of an aﬃne invariant parameterization, implies that no point correspondence is required when computing the aﬃne invariants of an object’s contour. Rigid Motion and Aﬃne Transform Consider a 2D curve, c(t) = [x(t), y(t)]T where t denotes a parameterization which is invariant under aﬃne transform, and cψa,b dt. its expansion onto the wavelet basis ψa,b = √1a g( t−b a ) [3] as ua,b = If the curve is allowed a general aﬃne transform, we have: c (t) = mc(±t + t0 )+ t where m is any nonsingular 2 × 2 matrix, t is the translational motion, t0 represents a change of the origin in traversal, and ± represents the possibility of traversing the curve either counterclockwise or clockwise 1 . It follows that: ua,b =

R

=m =m 1

R

c ψa,b dt R R

= (mc(±t + t0 ) + t)ψa,b dt

0 )−b c(t ) √1a g( ∓(t −t )dt + a

c(t )ψ(t )a,±b+t0 dt

R

tψa,b dt = m

R

0) c(t ) √1a g( t −(±b+t )dt a

(1)

= mua,±b+t0 .

In the implementation, the parameter is computed modularly over a closed contour.

438

Ronald Alferez and Yuan-Fang Wang

Note that we use the wavelet property ψa,b dt = 0 to simplify the second term in Eq.1. If m represents a rotation (or the aﬃne transform is a rigid motion of a translation plus a rotation), it is easily seen that an invariant expression (this is just one of many possibilities) can be derived using the ratio expression u a,b uc,d

=

|ua,±b+t0 | |mua,±b+t0 | = . |muc,±d+t0 | |uc,±d+t0 |

(2)

The wavelet coeﬃcients ua,b and ua,±b+t0 are functions of the scale a and the displacements b and ±b + t0 . If we ﬁx the scale a, by taking the same number of sample points in each curve, we can construct expressions based on correlation coeﬃcients to cancel out the eﬀect of a diﬀerent traversal starting point (t0 ) and direction (±t). Let us deﬁne the invariant signature of an object, fa (x), as:

fa (x) =

u |ua,x | |ua,±x+t0 | a,x = , and fa (x) = u |ua,x+x0 | ua,x+x0 a,±(x+x0 )+t0

(3)

where x0 represents a constant value separating the two indices. Then one can easily verify that when the direction of traversal is the same for both con|ua,x+t0 | tours, fa (x) = u = fa (x + t0 ). If the directions are opposite, then | a,x+x0 +t0 | |ua,−x+t0 | 1 = fa (−x−x . As the correlation coeﬃcient of two sigfa (x) = u 0 +t0 ) | a,−x−x0 +t0 | R f (x)g(x + τ )dx nals is deﬁned as R (τ ) = . f (x)g(x)

f · g

We deﬁne the invariant (similarity) measure Ia (f, f ) between two objects as

Ia (f, f ) = maxτ,τ {Rfa (x)fa (x) (τ ), Rfa (x)

1 (−x) fa

(τ )} .

(4)

It can be shown [1] that the invariant measure in Eq. 4 attains the maximum of 1 if two objects are identical, but diﬀer in position, orientation, and/or scale. Other invariant features may still be derived where the same technique can be employed to measure similarity, making it independent of the parameterization used. For simplicity, we only show the invariant expressions from this point on. It is known that the area of the triangle formed by any three ua,b changes linearly in an aﬃne transform [7]. Hence, we have the following invariants 2 : u u a,b c,d 1 1 u u g,h i,j 1 1

ue,f 1

uk,l 1

=

ua,±b+t uc,±d+t 0 0 1 1 u ui,±j+t0 g,±h+t0 1 1

ue,±f +t0 1

uk,±l+t0 1

.

(5)

Perspective Transform Allowing an arbitrary view point and large perspective distortion makes the problem much harder as the projection is a non-linear process, involving a division in computing 2D coordinates. Extending the curve to 3D makes it even more diﬃcult. A simpliﬁed model is possible, using a parallel or quasi-perspective (aﬃne) model, but this holds only to a certain degree under a small perspective distortion. We provide a more rigorous treatment of perspective invariants. The projection process can be linearized using a tool which is well-known in computer graphics, the rational form of a basis function. 2

Some Rmay require a smaller number of coeﬃcients. For example, for wavelet bases where ψa,b dt = 0, Eq. 5 can be simpliﬁed where only four coeﬃcients are used.

Highly Discriminative Invariant Features for Image Matching

439

We will use NURBS (Non-Uniform Rational B-Spline) for illustration. The rational form of a b-spline function in 2D (3D) is the projection of a non-rational b-spline function in 3D (4D). Speciﬁcally, let C(t) = [X(t), Y (t), Z(t)]T = i Pi Ni,k (t) be a non-rational curve in 3D where Pi ’s are its control vertices and Ni,k (t) are the non-rational spline basis. Its projection in 2D will be:

c(t) =

x(t) = y(t)

"

X(t) Z(t) Y (t) Z(t)

#

=

P i

pi Ri,k (t),

where Ri,k (t) =

Zi Ni,k (t) , j Zj Nj,k (t)

P

(6)

and pi ’s are the projected control vertices in 2D, and Ri,k are the rational bases. We can now formulate the problem of ﬁnding perspective invariants as a curve ﬁtting problem. Intuitively, if a 2D curve results from the projection of a 3D curve, then it should be possible to interpolate the observed 2D curve using the projected control vertices and the rational spline bases and obtain a good ﬁt. If that is not the case, then the curve probably does not come from the projection of the particular 3D curve. Hence, the error in curve ﬁtting is a measure of invariance. (Ideally, the error should be zero.) Perspective projection produces: 2 3 2 3 r X +r Y +r Z +T Xi Z 5

pi = 4 Y i i

i

12 i

13

i

x

5

(7)

r31 Xi +r32 Yi +r33 Zi +Tz

Zi

Ri,k =

11

r31 Xi +r32 Yi +r33 Zi +Tz

= 4 r21 Xi +r22 Yi +r23 Zi +Ty

(r31 Xi + r32 Yi + r33 Zi + Tz )Ni,k(t) . j (r31 Xj + r32 Yj + r33 Zj + Tz )Nj,k(t)

P

(8)

where rij ’s and Ti ’s are the rotation and translation parameters, respectively. Image invariant deﬁned by the goodness of ﬁtting is I = t (d(t)− i pi Ri,k (t))2 , where d(t) denotes the distorted image curve. Note that in Eq. 6, the shape of a 2D curve is determined by the projected control vertices and the rational spline bases, both of which are unknown. By using rational bases, our approach minimizes I by a two-step gradient descent which maintains the linearity of the whole formulation and drastically reduces the search eﬀort. We ﬁrst assume that all Zi ’s are equal, which is equivalent to approximating the rational bases using the corresponding non-rational bases. This allows us to estimate the 2D control vertex positions. Aﬃne invariant parameters can be used as an initial estimate for point correspondence, which will be adjusted in succeeding steps to account for perspective foreshortening. ∂I ∂I dRi,k ), suggesting that minimization Observe that dI = i ( ∂p dpi + ∂R i

i,k

can be broken into two stages: (1) that of updating 2D control vertex positions ). (dpi ); and (2) that of updating rational bases (dRi,k The estimated 2D control vertex positions are used to constrain the unknown rotation and translation parameters using Eq. 7. A linear formulation results using at least six 2D control vertices estimated from Eq. 6. (For a planar 3D curve, four 2D control vertex positions will suﬃce.) The motion parameters allow Ri,k ’s to be updated using Eq. 8. The updated Ri,k ’s allow a better prediction of the appearance of the curve in images, and any discrepancy in the predicted and actual appearance of the curve is used in a gradient search to further verify the consistency. The prediction involves updating the parameterization t and the 2D control vertex positions pi , which are then used to estimate the unknown motion parameters through Eq. 7.

440

Ronald Alferez and Yuan-Fang Wang

Hence, a recursive process results to reﬁne the positions of the 2D control vertices, the shapes of the rational spline functions, the parameterization, and the 3D motion parameters, until a convergence is achieved. Variation in Lighting Condition We now consider the case when the imaged objects are illuminated by light sources of diﬀerent numbers, positions, and types. For simplicity, we will consider three spectral bands of red, green, and blue. Generalizing to an n-band illumination model is straightforward. Assuming two 2D images diﬀer only by scene illumination (i.e., no geometrical changes), we can linearize interesting (or important) 2D regions by wellknown techniques. We can then treat the problem as an illumination invariance problem for points along a curve. In addition, we can include the aﬃne or perspective case, to produce an invariant which is insensitive to both geometric (aﬃne or perspective) and illumination changes. By solving for the deformation and translation parameters from the aﬃne or perspective invariants, we can reconstruct the same transformation for any point or curve between two images. Hence, any curve constructed from one image can be matched, point by point, to its corresponding curve in the transformed image. Illumination invariants for curves can then be applied, to verify if the two image regions, as the deﬁned by the curves, are the same. Let L(t) denote the perceived image color distribution along a curve. We have L(t) = [r(t), g(t), b(t)]T = [f r (λ), f g (λ), f b (λ)]T s(λ, t)dλ, where λ denotes the wavelength, and f r (λ) the sensitivity of the red sensor (similarly for the green and blue channels). We assume a Lambertian model, and that the reﬂected radiance functions, s(λ, t), are modeled as a linear combination of a small number of basis functions sk (λ), whence, s(λ, t) = k αk (t)sk (λ), where sk (λ) denotes the k-th basis function for representing the reﬂected radiance properties, and αk (t) are the space varying expansion coeﬃcients. Then using an analysis which is similar to that employed in the aﬃne case, we have Z

ua,b =

where

2

3

2

32

2

3

Z Lrk f r (λ)sk (λ) 4 Lg 5 = 4 f g (λ)sk (λ) 5 dλ k λ f b (λ)sk (λ) Lbk

Similarly,

2

ua,b

3

1 Lr1 Lr2 · · · Lrk va,b g g g Lψa,b dt = 4 L1 L2 · · · Lk 5 4 · · · 5 = Lrgb va,b , k va,b Lb1 Lb2 · · · Lbk

32

Z k va,b =

and

αk (t)ψa,b dt . t

3

1 Lr1 Lr2 · · · Lrk va,±b+t 0 7 6 g = 4 L1 Lg2 · · · Lgk 5 4 · · · 5 = (Lrgb )(va,±b+t0 ) . k va,±b+t Lb1 Lb2 · · · Lbk 0

Then it is easily shown that the following expression is invariant under diﬀerent lighting conditions (similar to Eq. 5): T · · · ua ,b ua ,b· · · ua ,b u a1 ,b1 1 1 k k k k = T · · · u · · · u u u ck ,dk ck ,dk c1 ,d1 c1 ,d1

h i h i T u ua ,±b +t· · · ua ,±b +t a1 ,±b1 +t·0· · uak ,±bk +t0 1 1 0 0 k k h i h i T u uc ,±d +t· · · uc ,±d +t c1 ,±d1 +t·0· · uck ,±dk +t0 1 1 0 0 k k

(9)

Highly Discriminative Invariant Features for Image Matching

441

Original (solid) and deformed (dashed) shape descriptors 3

300 2.5

250

Invariant features

2

200

150

1.5

1

0.5

100 0

50 −0.5 0

450

(a)

500

550

(b)

600

650

700

5

10

15 Arc length

750

(c)

20

25

30

(d)

Fig. 1. (a) Original image, (b) deformed image, (c) extracted original (solid) and deformed (dashed) patterns, and (d) the invariant signatures plotted along the contours.

3

Experimental Results

We conducted various experiments to test the validity of the new invariant features. Each experiment was isolated, which individually examined the performance of each image feature. However, the features can potentially be combined to make a powerful image retrieval system that can do within-a-class retrieval. General Aﬃne Transform with Change of Parameterization Fig. 1 shows (a) a shirt with a dolphin imprint and (b) a deformed version of the same imprint (an aﬃne transformation). The extracted patterns are shown in (c). The secondorder b-spline function of a uniform knot vector was used in the basis expansion. The invariant signatures shown in (d), which were aligned by maximizing the cross-correlation, are clearly quite consistent. Perspective Transform Our formulation, though recursive in nature, is nonetheless linear and achieves fast convergence in our preliminary experiments. The number of iterations needed to verify the invariance was small (about 3 to 4) even for large perspective distortion. In Fig. 2, (a) shows the canonical view of a curve embedded on a curved surface (a cylindrical pail) and (b) another perspective. We extracted the silhouette of the car from both images and the depth values for the silhouette in canonic view were computed. Curve ﬁtting 25

450 20

400

15

10

350

5

300 0

250 −5

−10

200

−15

150 −20

100 100

(a)

(b)

150

200

250

300

350

(c)

400

450

500

550

−25

0

10

20

30

40

50

60

70

(d)

Fig. 2. (a) Canonical view, (b) another perspective, (c) 2D image curve (solid) and the curve derived w. perspective invariant ﬁtting (dashed), and (d) their shape signatures.

442

Ronald Alferez and Yuan-Fang Wang

and invariant signature (after ﬁve iterations) thus computed are displayed in Figs. 2(c) and (d), respectively. Our invariance framework produces consistent results for general, non-planar 3D curves, all with a small number of iterations. Change of Illumination To illustrate the correctness of the invariance formulation under illumination changes, we placed diﬀerent color ﬁlters in front of the light sources used to illuminate the scene and verify the similarity of illumination invariant signatures. Fig. 3 shows the same cookbook cover under (a) white and (b) red illumination. For simplicity, we randomly deﬁned two circular curves (indicated by the red and green circles) and computed the invariant signatures along these two curves under white and red illumination. It should be noted that the particular example we show here only serve to demonstrate the correctness of the framework. In real applications, we can linearize the image to obtain an invariant signature for the whole image. The invariant proﬁles computed from the white (solid) and red (dashed) illumination are shown in Fig. 3(c) for the curve deﬁned by the red circle and (d) for the curve deﬁned by the green circle. As can be seen from the ﬁgure, the signatures are quite consistent. Invariant signatures for white (solid) and red (dashed) cookbook cover images

Invariant signatures for white (solid) and red (dashed) cookbook cover images

20

12

18 10 16 14 8 12 10

6

8 4 6 4 2 2 0 0

(a)

5

10

15

20 25 30 Along the red circle

(b)

35

40

45

50

0 0

5

10

15

20 25 30 Along the green circle

(c)

35

40

45

50

(d)

Fig. 3. The same cookbook cover under (a) white and (b) red illumination, and the invariant signatures computed under white (solid) and red (dashed) illumination (c) along the red circle and (d) along the green circle.

Invariant features of original (solid) and deformed (dashed) shapes at scale 1

Original (solid) and deformed (dashed) shapes

Invariant features of original (solid) and deformed (dashed) shapes at scale 4

120

Invariant features of original (solid) and deformed (dashed) shapes at scale 8

6

3

16 2

4

100

14

1 2 80

8

0

60

40

0

Invariant features

10

Invariant features

Invariant features

12

−2

−4

−1 −2 −3

6 −4

20 −6

4

−5 0

−8

−6

2

0

−20 0

2

4

6

8

10

(a) shapes

12

14

20

40

60 Arc length

80

(b) scale 1

100

120

−10 0

5

10

15 Arc length

20

(c) scale 4

25

30

−7 0

2

4

6 Arc length

8

10

12

(d) scale 8

Fig. 4. Invariant shape descriptors for the original (solid) and deformed, noisecorrupted shapes (dashed) at diﬀerent scales. Hierarchical Invariant Analysis The additional degree of freedom in designing the basis function enables a hierarchical shape analysis. Fig. 4(a) shows the

Highly Discriminative Invariant Features for Image Matching

443

original and noise-corrupted shapes. As shown in Fig. 4(b)-(c), our approach, which analyzes the shape at diﬀerent scales locally, will eventually discover the similarity, even though the similarity may manifest at diﬀerent levels of details. In this case, scale 8 produces more consistent signatures than the others. Future Work The performance of each image feature is very encouraging, prompting us to combine these image features to make a powerful image retrieval system that can do within-a-class retrieval. Results will be presented in a future paper. Applications include searching through specialized image databases, which contains imaged objects with very similar appearance (e.g., botanical databases and aquarium databases). In fact, these features have already been applied to object recognition experiments where perspective distortion, color variation, noise, and occlusion were all present [1]. In that experiment, the database comprised of diﬀerent models of airplanes, many of which had the same general shape. Perfect recognition was achieved for that particular database and test images.

4

Conclusion

We presented a new framework for computing image-derived, invariant features, ideal for image indexing and retrieval. These features provide high discriminative power and are insensitive to many environmental changes. Preliminary results show promise as a useful tool for searching image databases.

References 1. R. Alferez and Y.F. Wang. Geometric and Illumination Invariants for Object Recognition. IEEE Trans. Pattern Analy. Machine Intell. To appear as a regular paper. 438, 443 2. K. Arbter, W. E. Snyder, H. Burkhardt, and G. Hirzinger. Application of AﬃneInvariant Fourier Descriptors to Recognition of 3-D Objects. IEEE Trans. Pattern Analy. Machine Intell., 12:640–647, 1990. 437 3. I. Daubechies. Orthonormal Bases of Compactly Supported Wavelets. Commun. Pure Appl. Math., 41:909–960, 1988. 437 4. Hampapur et.al. Virage Video Engine. Proc. of SPIE, Storage and Retrieval for Image and Video Databases V, 3022:188–200, 1997. 435 5. M. Flickner et.al. Query by Image and Video Content: The QBIC System. IEEE Comput., pages 23–32, September 1995. 435 6. A. Pentland, R.W. Picard, and S. Sclaroﬀ. Photobook: Tools for Content-Base Manipulation of Image Databases. Int. J. Comput. Vision, 18(3):233–254, 1996. 435 7. T. H. Reiss. Recognizing Planar Objects Using Invariant Image Features. SpringerVerlag, Berlin, 1993. 436, 438 8. I. Weiss. Geometric Invariants and Object Recognition. Int. J. Comput. Vision, 10(3):207–231, 1993. 436

Image Retrieval Using Schwarz Representation of One-Dimensional Feature Xianfeng Ding, Weixing Kong, Changbo Hu, and Songde Ma National Lab of Pattern Recognition Institute of Automation, Chinese Academy of Sciences, P.O. Box 2728, BeiJing, PR.China {xfding,wxkong,cbhu,masd}@NLPR.ia.ac.cn

Abstract. Retrieval efficiency and accuracy are two important issues in designing a content based database retrieval system. In order to retrieve efficiently, we must extract feature to build index. Recently intensive research focused on how to extract one-dimensional features and calculate the distance between them, such as color histogram, Fourier descriptor, image shape spectrum (ISS). We develop a new method to match one-dimensional feature function in multiscale space using Schwarz representation. It can obtain closed form match function and similarity measure instead of traditional optimization. Thus we can calculate the global distance when the local information of feature function is matched. In this paper, we use the center distance function of shape as the feature functions. We calculate their Schwarz representation as indices, and calculate the optimal distance as similarity measure to sort the images. Experimental results show its efficiency and accuracy.

1 Introduction Recent works on content-based image retrieval (CBIR) have exhibited an increasing interest in developing method capable to retrieve in image database efficiently and accurately. In order to achieve the desired efficiency and accuracy, simple and easily computed index must be built in the system. Considering computation complexity, many researchers extracted one-dimensional feature of the image as the index, such as color histogram, the Fourier descriptor, ISS [13] and so on. Shape is the essential feature of object. It can be used in retrieval to enhance the efficiency and accuracy. Shape representation and matching are two crucial problems. Traditionally, Freeman chain code, Fourier descriptor, conic and B spine were used to describe planar curve [3]. T.Boult et al used super-quadric to represent curve [4]. G. Chuang [5] et al proposed the wavelet approximation representation. In the area of shape matching, Hough transformation [6] is classical and performs very well in the case of heavily noise and occlusion, but it can not deal with the matching problem of those widely existed deformable shapes. Some people use deformable model to solve the problem. In real application, considering the spatial and time consumption, people prefer to using simple method to represent and match the shape such as shape Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp.443-450, 1999.  Springer-Verlag Berlin Heidelberg 1999

444

Xianfeng Ding et al.

factor─some quantity measure (matrix, area, perimeter etc.). Jagadish [9] proposed an algorithm to represent the shape by a set of rectangles, and every shape can be mapped into a point in high dimensional space. So various point access methods (PAM) can be used in shape retrieval. In QBIC of IBM, similar method was used to represent and retrieve shape. Some other features are also used, for example, C.Nastar extracted image shape spectrum (ISS) [13] as the index of image database. But all these methods concentrated on how to extract one-dimensional feature to decrease the computation complexity. How to calculate the similarity measure between one-dimensional features efficiently and effectively is still under discussion. Some researchers noticed the global information between features. Then the most natural similarity metric, Euclidean distance [13] etc, was used. Many similarity measures are discussed in [15]. Some researchers used moment to calculate the similarity. Some researchers noticed the local information, but they only matched the local peaks of the feature function [14,16,10]. For example, Mokhtarian [10] et al extracted the maxima of shape in curvature scale space, then calculated the distance between index and the query model. All these methods did not consider global distance and local information at the same time. In this paper we introduce an advanced method to express feature functions in multiscale space, then we can get a match function to indicate mapping between index and the query model. We use the mapping to calculate the similarity measure between features. The global distance between index and the query model is obtained when the local information is matched. We call this distance "Optimal similarity measure", by this measure we can obtain more accurate retrieval result. And because we can obtain the match function between feature functions in closed form, the computation complexity is very low. In section 2 we will discuss how to match two one-dimensional feature functions by Schwarz representation. The index building process will be discussed in section 3. Finally, experimental results will be presented in section 4, along with a discussion on merits of our approach compared to those methods in literature.

2

Match Using Schwarz Representation

The notion of multiscale representation is of crucial important in signal processing and matching. For example, two partial different signals may look alike in a greater scale. There are many methods to represent signal in multiscale analysis, for example the method in curvature scale space proposed by Mokhtarian [10] et al. They obtained very good results by calculating the similarity measure after the local peaks are matched, but they matched the shape in some fixed scale and did not give total signal mapping. We will match the one-dimensional feature functions in an unfixed scale. In this section, we introduce a method to match two one-dimensional signals. Reader can find the detail in [11]. This method can obtain the one-one mapping between two signals in closed form without any optimization. The following notations are used throughout this section. C(R) is the field of complex (real) numbers, R+ is the set of positive real numbers.

Image Retrieval Using Schwarz Representation of One-dimensional Feature

445

U r = {z | z ∈ c, z = r} is a circle in C of radius r , U is the unit circle. ∆ r = {z | z ∈ C , z < r} is a disc in C of radius r , ∆ is the unit disc. “ ” denotes the composition of two functions. Let f α (θ ) : U → R(α = 1,2) be two signals. To match them is to find a linear function t : R → R and a one-to-one smooth function w : U → U such that (1) t ( f 2 (θ )) = f 1 ( w(θ )) Since it is easy to estimate t , we assume that t = id is the identical mapping without loss of generality, i.e. we only need to calculate f 2 (θ ) = f 1 ( w(θ )) . A signal f and its derivative at different scales can be described by its Schwarz integral f ( z ) [11], so we calculate the Schwarz integral of both side of Eq (1). ~ 1 2π e iϕ + z iϕ f ( z ) = f ( z ) + ig ( z ) = f ( e ) dϕ ( 2) 2π ∫0 e iϕ − z where z = re iθ ∈ ∆ , r =| z | is the scale factor, g (z ) is the harmonic conjugate of f (z ) . We expand them in Fourier series: !

+∞ 1 a 0 + ∑ (a n cos nθ + bn sin nθ ) 2 n =1 +∞ 1 iθ f ( z ) = f (re ) = a 0 + ∑ r n (a n cos nθ + bn sin nθ ) 2 n =1

f ( e iθ ) =

+∞

g ( z ) = g (re iθ ) = ∑ r n (a n sin nθ − bn cos nθ ) n =1

+∞ +∞ ~ f ( z ) = ∑ c n z n = ∑ c n (r n e inθ ) where c 0 = a 0 / 2, c n = a n − bn i (n ≥ 1) n =1

n =1

~ f ( z ) represents the information of signal and ~ information at each scale r (from 0 to 1). If r = 0, f ( z ) represents information in the most coarse scale, while r = 1 is the finest scale. It can be proved that f (z ) is analytic function in unit disc. We obtain the following equation ~ ~ ~ (3) f 2 = f1 o w ~ where w : D → D ( D , D ⊆ ∆) is an analytic bijective. So we can calculate the Since r denotes the scale, so

1

2

1

2

analytic function wˆ : D → C . wˆ = fˆ1−1 o fˆ2

(4)

Then we compute the star-radius r of wˆ to obtain the scale in which there is the most similarity between two signals, thus we can get the optimal match in the optimal scale. When w : U → U is defined by exp(iθ ) a exp(i∠( wˆ (r * exp(iθ ))) (5) *

It gives a one-to-one mapping between the original signals f α ( α = 1,2 ). Thus we can calculate the matching error E under the one-to-one mapping, we define the match error E = E1 + λE 2 , where

446

Xianfeng Ding et al.

~ ~ ~ ~′ E1 = W0 | f1 (0) − f 2 (0) | +W2 | f1′(0) / f 2 (0) | E2 =

1 2π

∫| U

fˆ2 ( z ) − fˆ1 ( w( z )) || dz |

The similarity measure may be defined as 1 / E . The following figures show the mapping between a circle (a) and an ellipse (b) by the method discussed above.

(a) circle

(c)center distance of a

(b) ellipse

(d) center distance of b

(e)one-one mapping

Figure 1. Match between a circle and an ellipse.

Figure 1 demonstrates that we can get one-one mapping between two shapes using Schwarz representation, it gives the matching not only between the feature points but also the total signal in the multiscale space. So the general match error can be calculated under the one-one mapping. We can consider not only the local information but also the global information at the same time.

3

Process of Indexing and Retrieval

A simple index can improve retrieval efficiency, while a powerful index can enhance retrieval accuracy. Because simple index can reduce scanning time, an index should consume as little storage as possible. And index should represent as much information as possible so that the retrieval system can obtain more accurate result. Schwarz representation describes the signal and its derivative at different scale, and it can be expanded into polynomial. We can use a vector to represent the coefficients of the polynomial. That means Schwarz representation can represent much information of the shape, and it consumes very little space.

Image Retrieval Using Schwarz Representation of One-dimensional Feature

447

3.1 Process of Index

We index the image using Schwarz representation by the following steps: Step 1 We extract one-dimensional feature. In this paper we use the center distance function of shape as the feature function f 1 (n) . Step 2 Expand the feature function f 1 (n) into Fourier series:

f 1 (e i θ ) =

+∞ 1 a 0 + ∑ (a n cos nθ + bn sin nθ ) 2 n =1

(6)

Then, we get the Schwarz integral of the one-dimensional feature functions f 1 (n) as following: +∞ ~ f 1 ( z) = ∑ c n z n (7) n =1

where c 0 = a 0 / 2, c n = a n − bn i (n ≥ 1) Step 3 Compute its inverse function. f 1−1 ( w) =

1 2πi

~ f 1′( z ) z ∫| z|= r ~f ( z ) − wdz 1

(8)

We express them in polynomials. +∞

f 1−1 ( w) = ∑ a k w k

(9)

k =1

where a k =

k! 2πi

~ f 1′( z ) z ∫| z|= r [ ~f ( z)] k +1 dz can be implemented by the numerical integral of 1

the following:

re iθ

~ df 1 (re iθ ) (10) ~ iθ k +1 [ f 1 (re )] Since we only need to sample the angle in double frequency of signal, so the inverse function can be figured out very fast. Step 4 The coefficients of the polynomial a k are described by a vector as the index of image. ak =

k! 2πi

∫

2π

0

3.2 Process of Retrieval

The retrieval algorithm should consume as little time as possible. Because we use Schwarz representation as index, only a composition of two polynomials is needed to get the one-one mapping, while other methods need perform an optimization to match point pairs. We retrieve the database as following: Step 1 We extract one-dimensional feature function f 2 (n) of the query model. ~ Step 2 Calculate the Schwarz integral f 2 ( z ) of f 2 (n) , and express them in polynomials: +∞ ~ f 2 ( z) = ∑ c n z n n =1

448

Xianfeng Ding et al.

~ Step 3 Composite the polynomials f 2 ( z ) and f 1−1 (ω ) to obtain the match function wˆ as described in Eq (4). It can be implemented much faster than optimization. We calculate the star-radius r * of wˆ , then we obtain a one-to-one mapping between the feature functions by sampling exp(iθ ) a exp(i∠( wˆ (r * exp(iθ ))) in signal frequency. 1 by numeral integral. E Step 5 Output the k most similar images as retrieval result Step 4 Calculate the similarity measure

4 Experimental Result In order to verify our method, we use the shape image database obtained from VSSP of university of Surrey. There are total 1100 images in the database. We calculate the center distance of each image described in figure 2. We can see a typical one-one mapping function in figure 3. The result of shape retrieval is shown in Fig 4.

Figure 2 The center distance function of the shape

Image Retrieval Using Schwarz Representation of One-dimensional Feature

449

Figure 3 The typical one-one mapping

(a)

(b) Figure 4. results of shape based retrieval (a) The image given by user (b) The query results

450

Xianfeng Ding et al.

5 Conclusion In this paper we proposed a new method to retrieve the image database. Since in image retrieval efficiency and accuracy is very crucial, we must compromise between the speed and accuracy. Many retrieval methods pay attention to the speed, so they calculate distance between image without match. Other people perform matching between some dominant points then calculate the distance between those dominant points. Compared to these methods described in literature, our method has such merits as following: 1 The computational cost of match is very low. We can perform matching before calculate the distance between images, so we can get optimal similarity measure. 2 We use both the global information and local information of the feature function. We get the one-one mapping mainly on the local dominant information, but we also get the global mapping. This is very useful in computing the similarity distance. 3 Since the match is calculated in the scale space and the scale is located by closedform function, we can match any signal in different scale without normalization. Also this method also has its own fault, for example it can not deal with scale variation and occlusion.

References 1. M. Swain, D. Ballard, Color Indexing, IJCV, 7 (1), (1991)11-32 2. B.M.Mehtre, M.Kankanhalli et al, Color matching for Image Retrieval. Pattern Recognition Lett, 16, (1995)325-331. 3. D.H.Ballard, Brown C M, Computer vision. Prentice Hall , New York,1982. 4. A.D.Gross, T.E.Boult ,Error of Fit Messures for recovering Parameteric Solids. Proc ICCV, (1998) 690-694. 5. G.C-H.Chuang, C-C.Jay Kuo.Wavelet Descriptor of Planar Curves: Theory and Application. IEEE Trans on Ip, 5(1), (1991) 56-70. 6. D.H.Ballard, Generalizing the Hough Transform to Detect Arbitrary Shapes. Pattern Recognition, 13(2), (1981) 111-122. 7. B.Widrow,The rubber Mask Technique. Part I pattern Recognition, 5(3), (1973) 175-211. 8. M.Kass, A.Witkin, et al Snake: Active contour Models, IJCV,1(4), (1988) 321-31. 9. H.Jagadish, A Retrieval of Technique for Similar shapes H.V.Jagadish, Proc ACM SIGMOD Conf. Management of Data ACM New York, (1991) 208-217. 10.F. Mokhtarian, S.Abbasi, J.Kittler, Efficient and Robust Retrieval by Shape Content through Curvature Scale Space, First Inter. Workshop on Image Databases and Multi-media Search, (1996) 35-42. 11. Q. Yang ,S.D. Ma, Schwarz Representation for Matching and similarity Analysis, Proc. Of the sixth Inter. Conf. on Computer vision 1996. 12. Aditya,Vailaya, Shape-Based Image Retrieval, PHD paper of MSU,1997 13. Chahab Nastar, The Image Shape Spectrum for image retrieval, Research report,Inria,1997 14. Madirashi Das, E.M.Riseman, FOCUS: Searching for Multi-color Objects in a diverse image database, CVPR, 1997. 15. Rangachar Kasturi, Susan H.Strayer, An evalution of color histogram based methods in video indexing , research progress report ,USP,1996 16. Xia Wang, C.C.Jay.Kuo, Color image retrieval via feature-adaptive query processing, SIAM's 45th Anniversary Meeting, Stanford University, CA, July 14-18, 1997.

Invariant Image Retrieval Using Wavelet Maxima Moment Minh Do, Serge Ayer, and Martin Vetterli Swiss Federal Institute of Technology, Lausanne (EPFL) Laboratory for Audio-Visual Communications (LCAV) CH-1015 Lausanne, Switzerland {Minh.Do,Serge.Ayer,Martin.Vetterli}@epfl.ch

Abstract. Wavelets have been shown to be an eﬀective analysis tool for image indexing due to the fact that spatial information and visual features of images could be well captured in just a few dominant wavelet coeﬃcients. A serious problem with current wavelet-based techniques is in the handling of aﬃne transformations in the query image. In this work, to cure the problem of translation variance with wavelet basis transform while keeping a compact representation, the wavelet transform modulus maxima is employed. To measure the similarity between wavelet maxima representations, which is required in the context of image retrieval systems, the diﬀerence of moments is used. As a result, each image is indexed by a vector in the wavelet maxima moment space. Those extracted features are shown to be robust in searching for objects independently of position, size, orientation and image background.

1

Introduction

Large and distributed collections of scientiﬁc, artistic, and commercial data comprising images, text, audio and video abound in our information-based society. To increase human productivity, however, there must be an eﬀective and precise method for users to search, browse, and interact with these collections and do so in a timely manner. As a result, image retrieval (IR) has been a fast growing research area lately. Image feature extraction is a crucial part for any such retrieval systems. Current methods for feature extraction suﬀer from two main problems: ﬁrst, many methods do not retain any spatial information, and second, the problem of invariance with respect to standard transformations is still unsolved. In this paper we propose a new wavelet-based indexing scheme that can handle variances of translation, scales and rotation of the query image. Results presented here are with the ”query-by-example” approach but the method is also ready to be used in systems with hand-drawn sketch query. The paper is organized as follows. Section 2 discusses the motivation for our work. The proposed method is detailed in Sections 3 and 4. Simulation results are provided in Section 5, which is followed by the conclusion. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 451–459, 1999. c Springer-Verlag Berlin Heidelberg 1999

452

2

Minh Do et al.

Motivation

A common ground in most of current IR systems is to exploit low-level features such as color, texture and shape, which can be extracted by a machine automatically. While semantic-level retrieval would be more desirable for users, given the current state of technology in image understanding, this is still very diﬃcult to achieve. This is especially true when one has to deal with a heterogeneous and unpredictable image collection such as from the World Wide Web. Early IR systems such as [2,8] mainly relied on a global feature set extracted from images. For instance, color features are commonly represented by a global histogram. This provides a very simple and eﬃcient representation of images for the retrieval purpose. However, the main drawback with this type of systems is that they have neglected spatial information. Especially, shape is often the most diﬃcult feature to be indexed and yet it is likely the key feature in an image query. More recent systems have addressed this problem. Spatial information is either expressed explicitly by the segmented image regions [9,1,6] or implicitly via dominant wavelet coeﬃcients [4,5,12]. Wavelets have been shown to be a powerful and eﬃcient mathematical tool to process visual information at multiple scales. The main advantage of wavelets is that they allow simultaneously good resolution in time and frequency. Therefore spatial information and visual features can be eﬀectively represented by dominant wavelet coeﬃcients. In addition, the wavelet decomposition provides a very good approximation of images and its underlying multiresolution mechanism allows the retrieval process to be done progressively over scales. Most of the wavelet-based image retrieval systems so far employed traditional, i.e. orthogonal and maximally-decimated, wavelet transforms. These transforms have a serious problem that they can exhibit visual artifacts, mainly due to the lack of translation invariance. For instance, the wavelet coeﬃcients of a translated function fτ (t) = f (t − τ ) may be very diﬀerent from the wavelet coeﬃcients of f (t). The diﬀerences can be drastic both within and between subbands. As a result, a simple wavelet-based image retrieval system would not be able to handle aﬃne transformations of the query image. This problem was stated in previous works (eg. [4]), but to our knowledge, it still has not received proper treatment. On the other hand, the ability to retrieve images that contain interesting objects at diﬀerent locations, scales and orientations, is often very desirable. It is our intent to address the invariance problem of wavelet-based image retrieval in this work.

3

Wavelet Maxima Transform

As mentioned above, the main drawback of wavelet bases in visual pattern recognition applications is their lack of translation invariance. An obvious remedy to this problem is to apply a non-subsampled wavelet transform which computes all the shifts [11]. However this creates a highly redundant representation and we have to deal with a large amount of redundant feature data.

Invariant Image Retrieval Using Wavelet Maxima Moment

453

To reduce the representation size in order to facilitate the retrieval process while maintaining translation invariance, an alternative approach is to use an adaptive sampling scheme. This can be achieved via the wavelet maxima transformation [7], where the sampling grid is automatically translated when the signal is translated. For images, inspired by Canny’s multiscale edge detector algorithm, the wavelet maxima points are deﬁned as the points where the wavelet transform modulus is locally maximal along the direction of the gradient vector. Formally, deﬁne two wavelets that are partial derivatives of a two-dimensional smoothing function θ(x, y) ψ 1 (x, y) =

∂θ(x, y) ∂θ(x, y) and ψ 2 (x, y) = ∂x ∂y

(1)

Let us denote the wavelets at dyadic scales {2j }j∈Z as ψ2kj (x, y) =

1 k x y ψ ( j, j) 2j 2 2

k = 1, 2

(2)

Then the wavelet transform of f (x, y) at a scale 2j has the following two components ∞ ∞ k j f (x, y)ψ2kj (x − u, y − v)dxdy W f (2 , u, v) = −∞

−∞

= f (x, y), ψ2kj (x − u, y − v)

k = 1, 2

(3)

It can be shown [7] that the two components of the wavelet transform given in (3) are proportional to the coordinates of the gradient vector of f (x, y) smoothed by θ2j (x, y). We therefore denote the wavelet transform modulus and its angle as: (4) M f (2j , u, v) = |W 1 f (2j , u, v)|2 + |W 2 f (2j , u, v)|2 2 j W f (2 , u, v) Af (2j , u, v) = arctan (5) W 1 f (2j , u, v) Definition 1 (Mallat et al. [7]). Wavelet maxima at scale 2j are defined as points (u0 , v0 ) where M f (2j , u, v) is locally maximum in the one-dimensional neighborhood of (u0 , v0 ) along the angle direction given by Af (2j , u0 , v0 ). If the smoothing function θ(x, y) is a separable product of cubic spline functions then the transform can be eﬃciently computed using a ﬁlter bank algorithm [7]. Figure 1 displays the wavelet maxima transform of an image at 3 scales. The wavelet maxima transform has some useful properties for image retrieval applications. Apart from being compact and translation invariant, it has been shown to be very eﬀective in characterization of images from multiscale edges (see Fig. 1). Therefore feature extraction based on the wavelet maxima

454

Minh Do et al.

transform captures well the edge-based and spatial layout information. Using wavelet maxima only, [7] can reconstruct an image which is visually identical to the original one. This reconstruction power of wavelet maxima indicates the signiﬁcance of its representation. In addition, the ”denoising” facility in the wavelet maxima domain can be exploited to achieve robustness in retrieving images which contain interesting objects against various image backgrounds.

Fig. 1. Wavelet maxima decomposition. The right hand part shows the wavelet maxima points at scales 2j where j = 6, 3, 1 from top to bottom, respectively (showing from coarse to detail resolutions)

4

Wavelet Maxima Moment

Given a compact and signiﬁcant representation of images via wavelet maxima transform, the next step is to deﬁne a good similarity measurement using that representation. The result of wavelet maxima transform is multiple scale sets of points (visually located at the contours of the image) and their wavelet transform coeﬃcients at those locations. Measuring the similarity directly in this domain is diﬃcult and ineﬃcient. Therefore we need to map this ”scattered” representation into points in a multidimensional space so that the distances could be easily computed. Furthermore, we require this mapping to be invariant with respect to aﬃne transforms. For those reasons, we select the moments representation. Traditionally, moments have been widely used in pattern recognition applications to describe the geometrical shapes of diﬀerent objects [3]. Diﬀerence of moments has also been successfully applied in measuring similarity between image color histograms [10]. For our case, care is needed since we use moments to represent wavelet maxima points which are dense along curves rather than regions (see the normalized moment equation (8)).

Invariant Image Retrieval Using Wavelet Maxima Moment

455

Definition 2. Let us denote Mj is the set of all wavelet maxima points of a given image at the scale 2j . We define the (p + q)th -order moment of the wavelet maxima transform, or wavelet maxima moment for short, of the image as: mjpq =

up v q M f (2j , u, v),

p, q = 0, 1, 2, . . .

(6)

(u,v)∈Mj

where M f (2j , u, v) is defined in (4). The reason for not including the angles Af (2j , u, v) in the moment computation is because they contain information about direction of gradient vectors in the image which is already captured in the locations of the wavelet maxima points. In the sequel the superscript j is used to denote scale index rather than power. First, to obtain translation invariance, we centralize the wavelet maxima points to their center of mass (uj , v j ) where uj = mj10 /mj00 ; v j = mj01 /mj00 . That is, (u − uj )p (v − v j )q M f (2j , u, v) (7) µjpq = (u,v)∈Mj

We furthermore normalize the moments by the number wavelet maxima points, |Mj |, and their ”spread”, (µj20 + µj02 )1/2 , to make them invariant to the change of scale. The normalized center moments are deﬁned as: j = ηpq

µjpq /|Mj |

(µj20 /|Mj | + µj02 /|Mj |)(p+q)/2

=

µjpq

(µj20 + µj02 )(p+q)/2 |Mj |1−(p+q)/2

(8)

Note that unlike computing moments for regions, in our case we can not use the ﬁrst order moment µj00 for scale normalization. This is due to the fact that when the scale of an object reduces, for example, the number of wavelet maxima points may decreases because of both the reduction in size and also the lost of details in high frequencies. Finally, to add in rotation invariance, we compute seven invariant moments j j + η02 up to the third order as derived in [3] for each scale, except invariants η20 j (which are always equal to 1 due to our scale normalization) are replaced by η00 . The current implementation of our system computes 4 levels of wavelet decomposition at scales 2j , 1 ≤ j ≤ 4, and 7 invariant moments φji , 1 ≤ i ≤ 7, for each scale, thus giving a total of 28 real numbers as the signature for each indexed image. For testing, we simply adapt the most commonly used similarity metric, namely the variance weighted Euclidean distance [2]. The weighting factors are the inverse variances for each vector component, computed over all the images in the database. The normalization brings all components in comparable range, so that they have approximately the same inﬂuence to the overall distance.

456

5

Minh Do et al.

Simulation Results

In this section, we evaluate the performance of the proposed method in the query-by-example approach. Since we are particularly interested in the invariant aspect of extracted features, a test image database was synthetically generated. Figure 5 shows the object library which consists of twenty diﬀerent foods in small size images 89 by 64 pixels. For each object, a class of 10 images was constructed by randomly rotating, scaling and pasting that object onto a randomly selected background. Scaling factor is a uniform random variable between 0.5 and 1. The position of pasted objects was randomly selected but such that the object would entirely ﬁt inside the image. The backgrounds come from a set of 10 wooden texture images of size 128 by 128 pixels. The test database thus contains 200, 128x128 grey level images. Each image in the database was used as a query in order to retrieve the other 9 relevant ones. Figure 5 shows an example of retrieval results. The query image is on the top left corner; all other images are ranked in the order of similarity with the query image from left to right, top to bottom. In this case, all relevant images are correctly ranked as the top matches following by images of very similar shape but are diﬀerent in visual details. The retrieval eﬀectiveness evaluation is shown in Figure 5 in comparison with the ideal case. By considering diﬀerent number of the top retrieval (horizontal axis), the average number of the images from the same similarity class is used to measure the performance (vertical axis). This result is superior in compared with [4] where the retrieval performance was reported to drop signiﬁcantly, about five times, if the query was translated, scaled and/or rotated.

Fig. 2. The object library of 20 food images of size 89 x 64.

6

Conclusion

This paper has presented a wavelet-based image retrieval system that is robust in searching for objects independently of position, size, orientation and image background. The proposed feature extraction method is based on the marriage of the wavelet maxima transform and invariant moments. The important point

Invariant Image Retrieval Using Wavelet Maxima Moment

457

Fig. 3. Example of retrieval results from the synthetic image database.

10

Average number of retrieving relevant images

9 8 7 6 5 4 3 solid line: ideal retrieval 2 dashed line: retrieval using wavelet maxima moment 1 0

0

1

2

3

4

5 6 7 8 9 10 Number of the top matches considered

11

12

13

14

Fig. 4. Retrieval performance in comparison with the ideal case.

15

458

Minh Do et al.

is that neither a moment or a wavelet maxima method alone would lead to the good performance we have shown, as thus, the combination of the two is the key. This results in an extracted feature set that is compact, invariant to translation, scaling, rotation, and significant - especially for shape and spatial information. However, the presented retrieval system here is mainly based on conﬁguration/shape related information. This is because of the moment computation puts emphasis on the positions of the wavelet maxima or edge points of the image. Extensions on extracting other types of image information from the wavelet maxima transform are being explored. In particular, color-based information can be eﬃciently extracted from the scaling coeﬃcients which correspond to a low resolution version of the original image. Texture can be characterized by a set of energies computed from wavelet coeﬃcients from each scale and orientation. To conclude, the main advantage of using wavelet transform in image retrieval application is that it provides a fast computation process to decompose image into meaningful descriptions.

Acknowledgments The authors would like to thank Wen Liang Hwang , Stephane Mallat and Sifen Zhong for their Wave2 package and Zoran Peˇcenovi´c for his user interface software.

References 1. C. Carson, S. Belongie, H. Greenspan, and J. Malik. Region-based image querying. In IEEE Workshop on Content-based Access of Image and Video Libraries, Puerto Rico, June 1997. 452 2. M. Flickner et al. Query by image and video content: The QBIC system. Computer, pages 23–32, September 1995. 452, 455 3. M.-K. Hu. Visual pattern recognition by moment invariants. IRE Trans. Info. Theory, IT-8:179–187, 1962. 454, 455 4. C.E. Jacobs, A. Finkelstein, and D.H. Salesin. Fast multiresolution image querying. In Computer graphics proceeding of SIGGRAPH, pages 278–280, Los Angeles, 1995. 452, 456 5. K.-C. Liang and C.-C. Jay Kuo. Progressive image indexing and retrieval based on embedded wavelet coding. In IEEE Int. Conf. on Image Proc., 1997. 452 6. W. Y. Ma and B. S. Manjunath. NETRA: A toolbox for navigating large image databases. In IEEE International Conference on Image Processing, 1997. 452 7. S. Mallat and S. Zhong. Characterization of signals from multiscale edges. IEEE Trans. Pattern Anal. Machine Intell., 14:710–732, July 1992. 453, 454 8. A. Pentland, R.W. Piccard, and S. Sclaroﬀ. Photobook: Content-based manipulation of image databases. International Journal of Computer Vision, 18(3):233–254, 1996. 452 9. J.R. Smith and S.-F. Chang. VisualSEEk: a fully automated content-based image query system. In Proc. The Fourth ACM International Multimedia Conference, pages 87–98, November 1996. 452

Invariant Image Retrieval Using Wavelet Maxima Moment

459

10. M. Stricker and M. Orengo. Similarity of color images. In Storage and Retrieval for Image and Video Databases III, volume 2420 of SPIE, pages 381–392, 1995. 454 11. M. Vetterli and J. Kovacevic. Wavelets and Subband Coding. Prentice-Hall, Inc, 1995. 452 12. J. Z. Wang, G. Wiederhold, O. Firschein, and S. X. Wei. Wavelet-based image indexing techniques with partial sketch retrieval capability. In Proceedings of 4th ADL Forum, May 1997. 452

Detecting Regular Structures for Invariant Retrieval Dmitry Chetverikov Computer and Automation Research Institute 1111 Budapest, Kende u.13-17, Hungary Phone: (36-1) 209-6510, Fax: (36-1) 466-7503 [email protected]

Abstract. Many of the existing approaches to invariant content-based image retrieval rely on local features, such as color or speciﬁc intensity patterns (interest points). In some methods, structural content is introduced by using particular spatial conﬁgurations of these features, which are typical for the pattern considered. Such approaches are limited in their capability to deal with regular structures when high degree of invariance is required. Recently, we have proposed a general measure of pattern regularity [2] that is stable under weak perspective of non-ﬂat patterns and varying illumination. In this paper we apply this measure to invariant detection of regular structures in aerial imagery.

1

Introduction

This paper addresses the problem of invariant search of structured (repetitive) intensity patterns, e.g., regular textures, in arbitrary scenes. Structure-based image retrieval is an unsolved and challenging task. The basic problem is the computational complexity of structure detector. Periodicity is not a local property. To ﬁnd a structure in an image, one has to span at least two periods, which needs long-range operations. Also, one has to precisely align with the periodicity vector, which requires high angular resolution. The task is further complicated by the necessity to tolerate changes in viewing conditions and illumination and, in case of non-ﬂat structures, shadows and occlusions. Due to its locality and invariance, color (e.g., [5]) is one of the most popular options used for image retrieval. However, color is not a structural property. Color-blind people are able to detect and recognize structured patterns. Local grayvalue invariants, such as those based on the local jet [11], can be used for eﬃcient matching of interest point conﬁgurations with limited invariance. Appearance-based search methods (e.g., [10]) are mostly applicable to deterministic intensity patterns representing shapes, rather than to statistical, repetitive structures. For the reason of computational feasibility, most of the retrieval systems, such as the QBIC [4], restrict the texture-based search to neighborhood ﬁltering and histograms. A recent exception is the ImageRover [12] that applies the Fourier Transform to deal with periodic textures. This approach can hardly Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 459–466, 1999. c Springer-Verlag Berlin Heidelberg 1999

460

Dmitry Chetverikov

be implemented as a ﬁlter, which limits its scope to a few distinct, well separated objects per image. A regular structure, ﬂat or non-ﬂat, is perceived by humans as regular under varying viewing conditions, including changes in illumination, local occlusions and shadows. Recently, we proposed a general measure of pattern regularity [2] that can serve as a highly invariant, perceptually motivated feature. In the current pilot study, we use this regularity feature to ﬁnd arbitrarily oriented, periodic structures in aerial images.

2

The Maximal Regularity Measure

In this section, we sketch our computational deﬁnition of regularity proposed in [2], where all technical details are given. Consider a M × N pixel size digital image I(m, n) and a spacing vector d = (α, d), with α being the orientation, d the magnitude of the vector. The image is scanned by d. In each position of d, the two points connected by the vector are considered and the occurrences of absolute graylevel diﬀerences between them are counted. The origin of d moves on the image raster, while the end of the vector points at a non-integer location. When the origin is in the pixel (m, n), d points at the location (x, y) given by x = n + d cos α, y = m − d sin α. The intensity I(x, y) is obtained by linear interpolation of the four neighboring pixels, then truncated to integer. Note that α and d are continuous, independent parameters, which makes the proposed regularity measure operational. For a discrete set of spacing vectors dij = (αi , dj ), we compute M EAN (i, j) as the mean value of |I(m, n) − I(x, y)| over the image. Here αi = ∆α · i, dj = ∆d · j. We use ∆d = 1, with ∆α being task-dependent. To cope with varying contrast, M EAN (i, j) is normalized by its maximum value maxij {M EAN (i, j)} so that 0 ≤ M EAN (i, j) ≤ 1. The M EAN feature is related to the autocorrelation function. When viewed as a function of d for a given angle, this feature is called contrast curve and denoted by F (d). Figure 1 shows typical contrast curves for patterns with different degrees of regularity. A periodic pattern has a contrast curve with deep and periodic minima. Our deﬁnition of regularity quantiﬁes this property. It also takes into account that the shape of the period can generally be more complex, with local minima that indicate the presence of a hierarchical structure. For each angle αi , the directional regularity is deﬁned as REG = (REGint · REGpos )2 ,

(1)

where the intensity regularity REGint = 1 − Fam

(2)

and the position regularity REGpos = 1 −

|d2 − 2d1 | d2

(3)

Detecting Regular Structures for Invariant Retrieval

random

weak regular

F

regular

F

d

461

F

d

d

Fig. 1. Typical contrast curves of a random, a weak regular and a regular patterns. Here Fam is the absolute minimum of F (d), d1 and d2 the positions of the two lowest minima after elimination of false, noisy minima. (d1 < d2 .) Finally, the maximal regularity feature M AXREG is deﬁned as the maximum directional regularity over all angles M AXREG = maxi {REG(i)}. Since M EAN (i, j) is normalized, 0 ≤ M AXREG ≤ 1, with 0 indicating a random, 1 a highly regular pattern. The angular resolution Na and the maximum spacing dmax are two basic parameters of the method. It is assumed that dmax extends to at least two periods of the pattern. High angular resolution is necessary for the spacing vector to precisely align with the periodicity vector.

3

Invariance of Maximal Regularity

The maximal regularity of a ﬂat pattern is invariant under weak perspective, when the size of objects is small compared with the viewing distance. Weak perspective, an approximation widely used in vision research, can be interpreted as an orthographic projection onto the image plane followed by an isotropic scaling [7]. Both transformations preserve periodicity and parallelism. This is suﬃcient for the invariance of the maximal regularity because its components are invariant under linear transformations of intensity due to illumination changes. Assume that a structure extends in the third dimension as well, but its size in this dimension is small compared to the other two dimensions. Under weak prospective with varying viewing angle and distance, the periodic elements of a regular structure cast shadows that are also periodic. In the visible parts of the pattern, periodicity and parallelism are still preserved, while intensity may change in a non-linear way. Despite this latter circumstance, the maximal regularity is quite stable, as illustrated in ﬁgure 2.

4

Regularity Filtering

We have implemented M AXREG as a ﬁlter. Because of the large number of spacings and angles, the computational load of the ﬁlter would be extremely high, if not prohibitive. Fortunately, diﬀerent techniques are available to design an eﬃcient implementation.

462

Dmitry Chetverikov

0.78

0.89

0.88

0.87

0.90

0.85

0.84

0.86

0.83

0.85

Fig. 2. Diﬀerent views of a non-ﬂat structure and their M AXREG values. Image pyramid (e.g., [6]) is a standard tool to achieve ‘action at distance’ by bringing the points closer to each other, that is, by shortening the periods of visual structures. When using image pyramids for this purpose, two circumstances must be paid attention to. When a structure is viewed from diﬀerent angles and distances, as it is typically the case in image databases, its period changes signiﬁcantly. The resolution pyramid should accommodate the potential variations. At the same time, a ﬁne structure may be lost when resolution is reduced. For these reasons, a structure should be searched at several consecutive levels of a pyramid. The parameters of a detector should be properly tuned to ensure both speed and reliability. The multiresolution approach is only a partial solution to the complexity problem. It can be substantially improved by using the run ﬁltering. (See, for example, [9].) In a run ﬁlter, when the window moves to the next position the output is updated incrementally rather than computed anew from the scratch. Additive functions, such as M EAN (i, j), are particularly suitable for run-ﬁltering implementation since they are easy to update. The autocorrelation function and the Fourier spectrum are less suitable for this purpose. Based on a run-ﬁltering implementation of M EAN (i, j), we have designed a regularity filter which is selective to local regularity computed in a sliding window. The M EAN ﬁlter, originally created in the framework of the interaction map research, is presented elsewhere [1,3]. The extension of this ﬁlter to M AXREG is straightforward. The regularity ﬁlter has three basic parameters. By changing the maximum displacement dmax , one can tune the ﬁlter to shorter or longer periods. The window size Wreg exceeds dmax , but is less than the expected structure size. The angular resolution ∆α is a trade-oﬀ between speed and precision. In our experiments, we used ∆α = 5◦ , 10◦ , 15◦ .

5

Detecting Regular Structures

The stability of M AXREG under weak perspective makes it useful in those pattern detection tasks that require high degree of invariance. In particular, regularity can serve as an eﬃcient preselection key for retrieval of structures in image databases.

Detecting Regular Structures for Invariant Retrieval

463

The regularity ﬁlter discussed in section 4 was used to ﬁnd regular structures in the RADIUS model board imagery [8] whose samples can be seen in ﬁgures 3 and 4. Each of these images contains several periodic structures, ﬂat and non-ﬂat, viewed under weak perspective and varying illumination. Two of these structures, the one shown in ﬁgure 2 and the large, periodic linear roof structure are perceptually dominant. These two non-ﬂat structures appear in all images of the dataset. The ﬁrst one will be referred to as S1, the second one as S2. The goal of the test was to detect the dominant structures S1 and S2 in a collection of 14 images arbitrarily selected from the RADIUS dataset.

Fig. 3. Phases of structure detection.

The regularity ﬁlter may also respond to less prominent but still periodic patterns, as well as to local patterns that are not perceived as regular. We were interested in the selectivity and robustness of the ﬁlter when applied to the Gaussian and the Laplacian pyramids. The original resolution of the RADIUS imagery was reduced by a factor of 3 to 433 × 341 pixel size. Then, three-level Gaussian and Laplacian pyramids were built using the procedure proposed by Burt [6]. The M AXREG ﬁlter was

464

Dmitry Chetverikov

applied to levels 0,1 and 2 of the pyramids with the parameters dmax = 15, 17, 12, Wreg = 23, 25, 20 and ∆α = 10◦ , 5◦ , 5◦ , respectively. (Level 0 is the base of a pyramid.) The resulting regularity images were enhanced (consolidated) by a median ﬁlter of dmax × dmax size. Finally, each image was thresholded at the regularity value of 0.5 and the detection result was overlaid on the original image. M AXREG ≥ 0.5 indicates medium regular and highly regular patterns. This structure detection procedure is illustrated in ﬁgure 3 where a Gaussian pyramid is processed. In each row the resolution decreases from left to right, with the lower levels zoomed to the base size. The ﬁrst row displays the consolidated results of regularity ﬁltering. The second row shows the locations detected in the Gaussian pyramid. For comparison, the Laplacian detection results are given in the last row. More examples of detection are shown in ﬁgure 4, where the upper row is the Gaussian, the lower row the Laplacian pyramid. An immediate observation is that the Laplacian detector responds to more patterns. To quantify the difference, the statistics of the responses were analyzed.

Fig. 4. Further examples of structure detection.

Table 1 summarizes the detection results in terms of structure indications at diﬀerent levels of the two pyramids. The columns S1 and S2 are the responses to the two dominant structures, S+ to other periodic patterns. The last column, S−, shows the false positives, that is, those indicated locations that are not perceived as regular patterns. For examples, in 14 images structure S1 was detected 7 times on level 1 of the Gaussian pyramid and 14 times (i.e., always) on the same level of the Laplacian pyramid. There was no false response in the Gaussian pyramid and perceptually less important structures were only indicated at the maximum

Detecting Regular Structures for Invariant Retrieval

465

resolution. The Laplacian pyramid indicated much more minor structures but gave false responses at all levels.

Table 1. Structure detection results. 215mm Detected Level 0 Level 1 Level 2

Gaussian Laplacian S1 S2 S+ S− S1 S2 S+ S− 13 2 48 0 14 8 95 84 7 12 0 0 14 14 27 10 0 9 0 0 5 13 7 11

In table 2 the detection results for the two dominant structures are presented in a diﬀerent way. This table shows how many times a dominant structure was detected within a single pyramid. For instance, in the Gaussian pyramid S1 was detected at a single level in 57% of the cases and at two levels in 43% of the cases. The empty ﬁrst column indicates that both dominant structures were detected at least once in each of the pyramids. The structures exhibit themselves through more Laplacian levels, at the cost of frequent false alarms.

Table 2. Detectability of the two dominant structures (%). 27mm Ndet S1 S2

6

Gaussian Laplacian 0 1 2 3 0 1 2 3 0.0 57.0 43.0 0.0 0.0 0.0 64.0 36.0 0.0 36.0 64.0 0.0 0.0 7.0 36.0 57.0

Conclusion

We have introduced a new, highly invariant maximal regularity feature and used it to detect structures in aerial images. Due to its invariance, the regularity feature is applicable to non-ﬂat patterns. It has been implemented in a running ﬁlter, which opens the way for further testing and exploration. Currently, application of the proposed method is limited by its computational cost which is still high despite the run ﬁltering implementation: regularity ﬁltering of a medium size image takes several minutes on an advanced PC. Another current drawback is limited descriptive power for random and weak regular patterns.

466

Dmitry Chetverikov

More research and testing are needed to justify the algorithm and to systematically evaluate its performance, especially as far as 3D invariance, generality, scalability and robustness are concerned. The discriminating power of regularity should be improved by considering its directional distribution REG(i). Earlier, we developed a related method [3] for accurate analysis of pattern anisotropy, symmetry and orientation. We hope that combining regularity with other fundamental structural features of visual patterns will result in a powerful tool for structure description and retrieval.

Acknowledgments: This work is partially supported by grant OTKA T026592.

References 1. D. Chetverikov. Structural ﬁltering with texture feature based interaction maps: Fast algorithm and applications. In Proc. International Conf. on Pattern Recognition, pages 795–799. Vol.II, 1996. 462 2. D. Chetverikov. Pattern regularity as a visual key. In Proc. British Machine Vision Conf., pages 23–32, 1998. 459, 460 3. D. Chetverikov. Texture analysis using feature based pairwise interaction maps. Pattern Recognition, Special Issue on Color and Texture, 1999, in press. 462, 466 4. M. Flickner et al. Query by image and video content: the QBIC system. IEEE Computer Magazine, pages 23–30, 1995. 459 5. T. Gevers and A.W.M. Smeulders. Color Based Object Recognition. In A. Del Bimbo, editor, Lecture Notes in Computer Science, volume 1310, pages 319–327. Springer Verlag, 1997. 459 6. B. J¨ ahne. Digital Image Processing. Springer-Verlag, 1997. 462, 463 7. J.L. Mundy and A. Zisserman. Projective geometry in machine vision. In J.L. Mundy and A. Zisserman, editors, Geometric Invariance in Computer Vision, pages 463–534. MIT Press, 1992. 461 8. University of Washington. RADIUS Model Board Imagery Database I,II. Reference Manual, 1996. 463 9. I. Pitas. Digital Image Processing Algorithms. Prentice Hall, 1993. 462 10. S. Ravela and R. Manmatha. Image retrieval by appearance. In 20t h Intl. Conf. on Research and Development in Information Retrieval, 1997. 459 11. C. Schmid and R. Mohr. Local Grayvalue Invariants for Image Retrieval. IEEE Trans. Pattern Analysis and Machine Intelligence, 19:530–535, 1997. 459 12. S. Sclaroﬀ, L. Taycher, and M. La Cascia. ImageRover: A Content-Based Image Browser for the World Wide Web. In IEEE Workshop on Content-Based Access of Image and Video Libraries, 1997. 459

Color Image Texture Indexing Niels Nes and Marcos Cordeiro d’Ornellas Intelligent Sensory Information Systems University of Amsterdam - Faculty WINS Kruislaan, 403 - 1098 SJ Amsterdam, The Netherlands {niels,ornellas}@wins.uva.nl http://carol.wins.uva.nl/∼{niels,ornellas}

Abstract. The use of the image color information beyond color histograms has been limited for image retrieval. A reason is the lack of an accepted core basic color operation set on color images. With the grown interest in image retrieval applied to color images, new operators have been recently developed, having interesting properties. Opening distributions on images based on granulometries constitute an extremely useful tool in morphological tasks. Eﬃcient techniques have been proposed for binary and grayscale images using linear openings. The present study extends the granulometry concept for color images. In addition, it addresses the development of a new morphological approach grounded on particle size distributions for color images and their use as an additional textural information to build queries over an image database.

1

Introduction

Multimedia information systems are becoming increasingly popular. They integrate text, images, audio and video and provide user desirable applications. One example is the image database system. Managing images for eﬃcient retrieval and updating is a growing needed and challenging issue. Recently the interest in color images has grown due to the abundance of such color images on the WWW. The new interest has resulted in many new views on the subject. Although color is heavily used as an important feature in image retrieval systems, its use has been limited to color histograms [13] [6] [12] mostly. Other features such as texture and shape are usually computed based on the intensity of the image or only on single color channel techniques. In [7] a technique for color image retrieval is described based on the Hue component only. In [16] wavelet based methods for texture are described using the separated color channels. In [6] the intensity of a color image is used to compute their texture, shape and moment features. The reason for this limited use of the color content is resulted from the lack of theory about basic operators applied to color images. Furthermore, the divide & conquer approach does not exploit the correlation between color channels. Multichannel techniques that take into account that correlation have been reported to be more eﬀective in [14] and [4]. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 467–475, 1999. c Springer-Verlag Berlin Heidelberg 1999

468

Niels Nes and Marcos Cordeiro d’Ornellas

Morphological methods like granulometries have been used to characterize size distributions and shapes on binary and grayscale images [10]. In this paper, we extend the notion of grayscale granulometries to color images based on the color opening and closing proposed in [4]. Moreover, we deﬁne a color pattern spectrum, i.e. the distribution of the object sizes, from which color image retrieval with texture can be computed. The organization of this paper is as follows. Section 2 summarize the fundamentals of granulometries and size distributions. Section 3 describes the concept of color morphology, which is based on vector ranking concepts. Section 4 discuss about color indexing and extends the notion of granulometries to color images with the use of color pattern spectrum. In section 5 we demonstrate the results obtained using Monet [1] and show the practical use of the content-based image indexing running on a database of 6800 images. We conclude with section 6, summarizing the results and further research.

2

Granulometries and Size Distribution

Granulometries are based on the fact that a constant number of particles and constant amount of area or volume, at each scale level, are used to obtain particle size distributions. This idea can be developed further to obtain image signatures [2]. The following deﬁnitions are based on [10] and [5]. Definition 1 (Granulometry). A granulometry can be interpreted as a collection of image operators {Ψt }, t > 0, such that Ψt is anti-extensive for all t, Ψt is increasing for all t, and Ψt Ψs = Ψs Ψt = Ψmax{t,s} . It was observed by [15] that the most important example of granulometry is a ﬁnite union of decreasing openings (φt ), each by a parameterized convex structuring element B: Ψt (A) = (A◦tB1 ) ∪ (A◦tB2 ) ∪ . . . ∪ (A◦tBn )

(1)

Similarly, anti-granulometries, or granulometries by closings, can be deﬁned as a ﬁnite union of increasing closings. Definition 2 (Granulometric Size Distribution or Pattern Spectrum). The granulometric size distribution or pattern spectrum of an image A, with respect to a granulometry {Ψt (A)}, t > 0 is a mapping P SΨt (A) given by: P SΨt (A) = Ω(φt (A)) − Ω(φt−1 (A))

(2)

which is a discrete density. The density is called a granulometric size distribution or pattern spectrum. 2.1

Linear Grayscale Granulometries

Let us denote by NL (p) and NR (p) respectively the left and the right neighbors of a pixel p. The eﬀect of an opening by a linear segment Ln , n ≥ 0 on a grayscale image I:

Color Image Texture Indexing

469

Definition 3 (Line Segment). A line segment S, of length l(S), can be interpreted as a set of pixels {p0 , p1 , . . . , pn−1 } such that for 0 < i < n, pi = NR (pi−1 ). Definition 4 (Line Maximum). A line maximum M of length l(M ) = n in a grayscale image I is a line segment {p0 , p1 , . . . , pn−1 } such that: ∀i, 0 < i < n, I(pi ) = I(p1 )

(3)

I(NL (p0 )) < I(p0 ) , I(NR (pn−1 )) < I(p0 ).

(4)

The eﬀect of a line opening of size n on M is that a new plateau of pixels is created at altitude M AX{I(NL(p0 )), I(NR (pn−1 ))}. This plateau P contains M , and may be itself a maximum of I ◦ Ln .

3

Multivalued Morphology

One of the basic ideas in mathematical morphology is that the set of all images constitutes a complete lattice. The concept of extrema, i.e. inﬁmum and supremum, stems from this partial ordering relation. If the extrema exist for any collection of images, then that lattice is called a complete lattice. Any morphological operator we apply to the color images can be applied to each component separately. This kind of marginal processing is equivalent to the vectorial approach deﬁned by the canonic lattice structure when only extrema operators and their compositions are involved, inducing a totally ordered lattice. However, this morphological procedure fails because every color can be seen as a vector in a spatial domain and the extrema of the two vectors is a mixture of both colors. Besides, image components are highly correlated. In [8], an approach grounded on vector transformations, followed by marginal ordering was introduced. An image is coded into another representation by means of a surjective mapping called h-adjunction. A major drawback in practice is that the extrema of each set of vectors are not necessarily unique. Recently, [3], [14], and [4] succeed in dealing with this question by ranking vectors, i.e. each vector pixel is represented by a single scalar value. When a bijective mapping is used, it induces a total ordering and determines clearly the extrema of each set of vectors. In this way, it is possible to perform any classical morphological ﬁlter on the coded image, and decode the result afterwards. 3.1

Ordering Color as Vectors

To extend the vector approach to color images, it is necessary to deﬁne an order relation which orders colors as vectors. This imposes a total ordering relationship achieved by the lexicographical ordering 1 . The structuring element for the vector morphological operations deﬁned here is the set g, and the scalar-valued function 1

An ordered pair (i, j) is lexicographically earlier than (i , j ) if either i ≤ i or i = i and j ≤ j . It is lexicographic because it corresponds to the dictionary ordering of two-letter words.

470

Niels Nes and Marcos Cordeiro d’Ornellas

used for the reduced ordering is h : R3 → R. The operation of vector dilation is represented by the symbol ⊕v . The value of the vector dilation of f by g at the point (x, y) is deﬁned as: (f ⊕v g)(x, y) ∈ {f (r, s) : (r, s) ∈ g(x,y) }

(5)

h((f ⊕v g)(x, y)) ≥ h(f (r, s))∀(r, s) ∈ g(x,y)

(6)

Similarly, vector erosion is represented by the symbol v , and the value of the vector erosion of f by g at the point (x, y) is deﬁned as: (f v g)(x, y) ∈ {f (r, s) : (r, s) ∈ g(x,y) }

(7)

h((f v g)(x, y)) ≤ h(f (r, s))∀(r, s) ∈ g(x,y)

(8)

Vector opening is deﬁned as the sequence of vector dilation after vector erosion, and vector closing is deﬁned as the sequence of vector erosion after vector dilation. Since the output of the vector ﬁlter depends on the scalar-valued function used for reduced ordering, the selection of this function provides ﬂexibility in incorporating spectral information into the multi-valued image representation. When the bit-mix approach [3] is used, the transform h is based on the representation of each component of T in the binary mode. Let T ∈ RM with M components t(i), each one represented on p bits t(i)j ∈ {0, 1} with j ∈ {0, . . . , p}. The considered mapping h can then be written as follows:  p M.(p−j) M M −i 2 2 t(i)j  h(t) = j=1 i=1 (9) t t ↔ h(t ) = pj=1 2M.(p−j) M 2M −i t (i)j i=1  h(t) ≤ h(t )

All scalar-valued functions lead to a family of images, parameterized by shape, size, and color, which could be useful for image retrieval.

4

Color Indexing

The extension of grayscale granulometries to color images is ﬁrmly established on multi-valued morphology. In this way, we can derive a color object size distribution based on color openings. Using these distributions as image descriptors makes searching for images with similar sized objects possible. Since granulometries based on linear openings and closing are not rotation invariant we need to apply the same technique in a horizontal, vertical and diagonal direction. The results are merged into one pattern-spectrum using the maximum of the three, i.e. P SΨt (A) = max(P SΨt (h), P SΨt (v), P SΨt (d))

(10)

One step further is the search for images with similar texture. We could derive a scale invariant description from the pattern spectrum that describes the texture of the image. We derive this scale invariant description, H, where each Hi is deﬁned using the following equation. Hi =

P Sj /P Si∗j

j=i

(11)

Color Image Texture Indexing

4.1

471

Color Pattern Spectrum

Texture and color are two important visual cues that give a large amount of information from surfaces in the scene. Although they share a common role in the scenes, they have been studied separately in computer vision due to the diﬃculty that both properties represent. Texture is the visual cue due to the repetition of image patterns. It is used in several tasks such as classiﬁcation of materials, scene segmentation and extraction of surface shapes from the texture variations. Much work in computer vision has focused on the texture perception problem. Psychophysical experiments and neurobiological evidences have provided the basis for the deﬁnition of computational models of texture perception [9]. The color visual cue is the result of the observation of an speciﬁc illuminant on a given surface using three diﬀerent types of sensors. In computer vision, color has been used in region segmentation tasks, image classiﬁcation, image database retrieval, surface chromatic constancy analysis, etc. The representation of color has been studied emphasizing the aspects of constructing perceptual spaces that allow applying the computer vision methods. Several studies have been recently directed to the problem of co-joint representations for texture and color, some diﬃculties have arisen from the fact that three-dimensional color representation is not the best way to represent texture. Grouping texture and color representation reduces the amount of raw data presented by the image while preserving the information needed for the task in hand. This information reduction has to give a representation that allows to computationally dealing with the proposed task. Searching images based on pattern spectrum requires a comparison method to be performed. Image retrieval systems use similarity measures to describe the similarity between two images. The proposed similarity measure is modeled after color histogram intersection and is robust to occlusion in the image. The same robustness is required for color image texture. We deﬁne the similarity between two pattern spectra as follows: n min(ai , bi ) i=0 n S(a, b) = (12) i=0

ai

Many image retrieval operations require also searching on color content. On that account, we use color histograms to describe the image. Furthermore, we integrate similarity measures obtained from both color and texture using a linear combination with adjustable weights. So the user can control the importance of either one of the features easily.

5

Experimental Results

The experiments conducted were performed on a database taken from a CDROM of 6800 photographs. We calculate the pattern spectra for all twelve combinations of openings and closings in horizontal, vertical and diagonal directions for the color models RGB and HSI. We used the Monet[1] database system as our

472

Niels Nes and Marcos Cordeiro d’Ornellas

Fig. 1. Histogram intersection results.

experimentation platform. This database systems was extended with an image data type and primitives [11]. Figure 1 shows the results of a query based on Histogram Intersection as described in [13]. Figure 2 shows the results of the same query by example images based on color pattern spectra. In both cases the top left image was the one selected, using the HSI model 2 .

2

Due to the costs of color printing and the inherent distortions associated with the size reduction and the printing process, the corresponding color plates will be made available through http://carol.wins.uva.nl/∼ornellas/images/visual99.

Color Image Texture Indexing

473

Fig. 2. Color pattern spectra results.

6

Conclusions and Further Research

Color images should be treated as ﬁrst class citizen, not as a special case of grayscale images. The information in the color triplet should not be broken into its channels. Splitting would waste valuable information. Using operators that preserve this information leads to better image retrieving feature vectors. We proposed the color pattern spectrum. It turns out to be an interesting retrieving feature, which could be eﬃciently computed. The experiments show that this texture feature will indeed improve the results made by an image retrieving system. As future work, we like to point out that more features could be deﬁned using these color operators. We like to investigate whether the color pattern spectrum could be used to search for partial images in the database.

474

Niels Nes and Marcos Cordeiro d’Ornellas

References 1. P. A. Boncz and M. L. Kersten. Monet: An impressionist sketch of an advanced database system. In Proc. IEEE BIWIT workshop, San Sebastian (Spain), july 1995. 468, 471 2. E. J. Breen and R. Jones. Attribute openings, thinnings, and granulometries. Computer Vision and Image Understanding, 64(3):377–389, 1995. 468 3. J. Chanussot and P. Lambert. Total ordering based on space ﬁlling curves for multi-valued morphology. In Proceedings of the International Symposium on Mathematical Morphology (ISMM’98), pages 51–58. Kluwer Academic Publishers, Amsterdam, 1998. 469, 470 4. M. C. d’Ornellas, R. v.d. Boomgaard, and J. Geusebroek. Morphological algorithms for color images based on a generic-programming approach. In Proceedings of the Brazilian Conference on Computer Graphics and Image Processing (SIBGRAPI’98), pages 323–330, Rio de Janeiro, 1998. IEEE Press. 467, 468, 469 5. E. R. Dougherty. Euclidean grayscale granulometries: Representation and umbra inducement. Journal of Mathematical Imaging and Vision, 1(1):7–21, 1992. 468 6. C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, and W. Equitz. Eﬃcient and eﬀective querying by image content. Intelligent Information Systems, 3:231–262, 1994. 467 7. T. Gevers and A. W. M. Smeulders. Evaluating color and shape invariant image indexing for consumer photography. In Proceedings of the First International Conference on Visual Information Systems, pages 293–302, Berlin, 1996. Springer Verlag. 467 8. J. Goutsias, H. J. A. M. Heijmans, and K. Sivakumar. Morphological operators for of image sequences. Computer Vision and Image Understanding, 62:326–346, 1995. 469 9. F. Korn, C. Faloutsos, N. Sidiropoulos, E. Siegel, and Z. Protopapas. Fast nearest neighbor search in medical image databases. In Proceedings of the 22nd VLDB Conference - Bombay, India, pages 224–234, New York, 1996. IEEE Press. 471 10. P. Maragos. Pattern spectrum and multiscale shape representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:701–716, 1989. 468 11. N. Nes, C. van den Berg, and M. Kersten. Database support for image retrieval using spatial-color features. In A. W. M. Smeulders and R. Jain, editors, Image Databases and Multi-media Search, pages 293–300. World Scientiﬁc, London, 1997. 472 12. J. R. Smith and S. Chang. Tools and Techniques for Color Image Retrieval. In SPIE Storage and Retrieval for Image and Video Databases IV, No 2670, 1996. 467 13. R. Swain and J. Ballard. Color indexing. International Journal of Computer Vision, 7:513–528, 1991. 467, 472 14. H. Talbot, C. Evans, and R. Jones. Complete ordering and multivariate mathematical morphology: Algorithms and applications. In Proceedings of the International Symposium on Mathematical Morphology (ISMM’98), pages 27–34. Kluwer Academic Publishers, Amsterdam, 1998. 467, 469

Color Image Texture Indexing

475

15. L. Vincent and E. R. Dougherty. Morphological segmentation for textures and particles. In E. R. Dougherty, editor, Digital Image Processing Methods, pages 43–102. Marcel Dekker, New York, 1994. 468 16. J. Z. Wang, G. Wiederhold, O. Firschein, and S. X. Wei. Wavelet-based image indexing techniques with partial sketch retrieval capability. In Proceedings of the Fourth Forum on Research and Technology Advances in Digital Libraries, pages 323–330, New York, 1997. IEEE Press. 467

Improving Image Classification Using Extended Run Length Features Syed M Rahman, Gour C. Karmaker, and Robert J Bignall Gippsland School of Computing and Information Technology Monash University, Churchill, VIC, Australia 3842 {Syed.Rahman,Bob.Bignall}@infotech.monash.edu.au

Abstract. In this paper we evaluate the performance of self-organising maps (SOM) for image classification using invariant features based on run length alone and also on run length plus run length totals, for horizontal runs. Objects were manually separated from an experimental set of natural images. Object classification performance was evaluated by comparing the SOM classifications independently with a manual classification for both of the feature extraction methods. The experimental results showed that image classification using the run length method that included run length totals achieved a recognition rate that was, on average, 4.65 percentage points higher that the recognition rate achieved with the normal run length method. Thus the extended method is promising for practical applications.

1 Introduction Image classification is a challenging area and is essential in most fields of science and engineering [1]. Image classification is performed on the basis of significant features extracted from the images. These features can be based on different image attributes including colour, texture, sketch, shape, spatial constraints, text, objective and subjective attributes etc. One of the most important and challenging tasks of image classification is feature selection. In practice the precision of classification almost entirely depends on the types of features used. Run length may be used to encode the features of an object. Rahman and Haque investigated image ranking using features based on horizontal and vertical run lengths [2]. Run length based features have been used to approximate the shape of images [3] and also in image classification [4]. However, in these previous approaches the run length features were computed from the entire image and they were not independent of translation, rotation or the scale of the objects. In this paper we have further extended the invariant run length features technique by including the total of the run lengths for each horizontal run. The total of the run lengths in a horizontal run equals the total length of all the line segments formed by the intersection of the horizontal line with the image. The inclusion of these aggregated horizontal distances with the run length encapsulates shape information into the features along with texture information. The extended features method was evaluated and its performance compared with that of the normal run length method. ClassificaDionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp.475 -482, 1999.  Springer-Verlag Berlin Heidelberg 1999

476

Syed M Rahman et al.

tion was performed using self-organising maps. The objects used in the image database were manually separated from their scenes. The organisation of the paper is as follows. The computation of invariant features is described in section 2. Section 3 deals with Self Organising Maps (SOM) and the experimental set up is detailed in section 4. Results and conclusions are presented in section 5.

2 Computation of Invariant Features A gray level run is defined as a group of successive pixels whose gray level intensities are the same. The run length is the number of pixels contained in a run. A run length histogram is a data structure that contains the frequencies of all run lengths and which therefore depicts the gray-level probability distribution of an object. It is used to represent the features of an object for two reasons. Firstly, the gray-level distribution varies with the geometric structure of objects with similar texture, so it approximates the shape of the object. Secondly, the gray-level distribution of an object varies with the texture of that object, i.e. its coarseness and contrast [5]. Object recognition may be improved if the features set contains shape based features as well as texture based features. Such run length features are called composite features as they approximate both the shape and texture of an object. Gray level intensities with minor differences are regarded as similar because humans can not discern the illumination difference in such cases. Thus a threshold can be used to reduce the number of discrete gray levels in an object. T is used to represent such a threshold, which denotes the maximum difference between two successive intensities for them to be considered similar during a run length calculation. From experimentation the value of the threshold T was selected to be 10. The objects are normalised for rotation before the run length feature calculation. Our notation and the algorithm used for computing a run length histogram are described in the following section. The jth run length in the ith row is given as follows. Let T be the threshold i.e. the maximum difference between the gray level intensities of two adjacent pixels for them to be considered the same. Denote by Rl(i,j) the value of jth run length in the ith row, so that j∈(1..Maxhl), i∈(1..Maxlv) and y∈(1..Maxhl), where Maxlh is the maximum horizontal length of the object when the axis of the minimised moment of inertia of the object is parallel with the X axis. Maxlv is the maximum vertical length of the object when the axis of minimised moment of inertia of the object is parallel with the X axis. The first run length in row i is then R(i,1) = #{(x,y) | x=i, P(x,y) ∈Object, y1 <= y <= y2 and for all y', y1 <= y' < y2 implies |P(x,y')-P(x,y'+1)| <= T } where y < y1 implies P(i,y) is not in the Object and either P(i, y2) is not in the object or |P(i,y2-1)-P(i,y2)| > T. For j > 1, the run length is given by equation (1).

Improving Image Classification Using Extended Run Length Features

j

477

j+1

Rl(i,j)= #{(x,y) | x=i, P(x,y) ∈Object, y <= y <= y j j+1 and for all y', y <= y' < y implies |P(x,y') P(x,y'+1)| <= T } j

(1) j

where y < y implies that either P(i,y) is not a point in the Object or that |P(i,y -1)j j+1 j+1 j+1 P(i,y )| > T; and that either P(i, y ) is not in the object or |P(i,y )-P(i,y +1)| > T. The normalised Run length is given by the equation Nl(i,j)= ceil((Rl (i,j)/ Maxlh)∗Nf)

(2)

where Nf is the number of features to be extracted from the object. The run length histogram is given by the following equation. H(s)= #{ Nl(i,j) : N| ∀i ,j. Nl(i,j)=s }

(3)

where s∈(1..Nf) The normalised run length histogram is given by the equation H(s)= ceil((H(s)*Sh)/ Maxlv)

(4)

where the constant Sh is assumed to be the standard height of an object. This normalised run length histogram contains Nf features, which are independent of the position, rotation, and size of the object. The techniques utilised in the above algorithm to obtain the invariant features, namely the normalisation of translation, rotation and scale are standard and are described in section 2.2. 2.1 Extension of Run Length Features The run length histogram incorporates an intensity distribution implying texture and to some extent shape information about the object. However, for regions within the object where the frequency of intensity change is high, the shape information is mostly lost. It may be observed that low frequency run lengths tend to be absent unless the change in intensity in a horizontal row is low. In view of this we propose to extend the information from each horizontal row by including with the actual intensity distribution a value which is derived from the set of runs within a horizontal row. It actually provides the total of the distances between adjacent pairs of boundary points in a horizontal row. It may be obtained by the following formula. (5) Rl (i ) = å Rl (i , j ) j The extended run length Rf is then an increased data set incorporating the shape information to a greater extent and is given by the equation Rf = {Rl(i, j), Rl(i) | i∈(1..Maxlv) and j∈(1..Maxhl)} The normalisation technique is similar to the normal run length data.

(6)

478

Syed M Rahman et al.

2.2

Invariant Attributes

In recent years a lot of research has been done into object recognition based on shape invariants [6] but the problem of invariant detection requires significant further study. Three invariant attributes used are translation, rotation and scale as discussed below. Normalisation is needed to obtain geometrically invariant features in order to adequately compare objects having different position, orientation and size. Translation: The features obtained by the run length method are invariant under translation since the objects are first separated manually from their scenes and then the features are computed from the objects. Rotation: Since the run length features, i.e. histograms, are calculated from the horizontal run lengths, they are sensitive to the orientation of the objects. The features must be independent of rotation. Rotational invariance is accomplished by repositioning the objects along a reference axis so that each object possesses a unique orientation in space. We used the axis of minimised moment of inertia as a reference axis, which is a straight line about which the moment of inertia is minimum and which passes through the unique centre of mass of the object. The orientation of an object can be determined by calculating the angle between the axis of minimised moment of inertia and the X axis. The technique used to determine the reference axis is as follows. The general moment equation for a 2D object can be defined as [7]

m st = å å x s y t p( x, y )

(7)

x y

where s,t=0,1,2,... and (s+t) = the order of the moment From equation (7) the zero and first order moments may be obtained, as given by the following equations. m00 = åå p( x, y) x

y

m10 = å å x. p( x, y) x

y

m01 = åå y. p( x, y) x

y

The Cartesian coordinates of the centroid (x, y) of the object are obtained from these moments as shown in equation (8). x=

m10 m and y = 01 m 00 m 00

(8)

The central moments (c) which are invariant under translation are defined as follows in terms of ordinary moments.

c st = å å ( x − x )s ( y − y )t p ( x , y ) x y

(9)

Improving Image Classification Using Extended Run Length Features

479

The angle between axis of minimised moment of inertia and the X axis in terms of second order central moments [Hu, 62] is given by equation (10). β=

æ 2c11 ö 1 ÷ tan −1çç ÷ 2 è c20 − c02 ø

(10)

Here β represents the orientation of the object with respect to the X axis and is such π π that − < β < . The range of values of β can be increased by solving the above 4 4 equation. The equation for β can then be expressed as in equation (11)

β = tan−1(

− (c20 − c02) − (c20 − c02)2 + 4c112 2c11

)

where

−

π π <β < 2 2

(11)

The values of β are determined by rotating the object through various angles and it can be shown that this actually gives the orientation of the object. The object is rotated by -β so that its axis of minimised moment of inertia becomes parallel to the X axis. Due to rotation small background holes may appear in the rotated object, which change its gray level intensity distribution and also degrade the visual perceptual quality of the object. This also changes the run length features of the object. The missing pixels can be approximated by using a dilation operator, which would fill in the missing pixels with the value of the foreground pixels of the object [8]. Dilation solves the problem of background holes but it changes the pixel distribution of the object since it alters the gray level pixel value. This operator also increases the size and modifies the shape of the object [9]. Scaling: The features obtained from the run length method are affected by the size of the object. For this reason scale normalisation is required to make the run length features size invariant. Horizontal run length normalisation is achieved by dividing each horizontal run length by the maximum horizontal length of the object. Similarly the frequency of each horizontal run length of a particular object depends on the value of the maximum vertical length of the object. Run length histogram normalisation is achieved by dividing the frequency of each run by the maximum vertical length of the object.

3

Self-Organising Maps

Neural networks are used extensively in the field of image classification as they can easily deal with various types of classification problems [10]. Neural networks rely on knowledge that is usually obtained through a training process. Two types of training

480

Syed M Rahman et al.

exist, namely supervised and unsupervised training. Correct target information is provided during supervised training but in unsupervised training there is no need to provide the network with target classifications. The training adjusts the weight vectors in such a way that similar objects fall into the same class. Self-organising maps are unsupervised neural networks first introduced by Teuvo Kohonen and have only input and output layers. A self-organising map is also called a topology-preserving map, since it preserves the topological structure amongst the categories it identifies [11]. Thus this network gives a topological ordering of its classes that reflects likenesses amongst the input images in the output classifications. We selected this network to classify the images as it is able to divide the set of images into a specified number of categories.

4

Experimental Setup

The experimental system consists of four stages, namely Object Segmentation, Features Computation, Network Training / Object Classification and Performance Evaluation. An overview of the experimental system is shown in figure 1. The steps are detailed below. •

Object Segmentation: In this step the objects in the scene were separated from their backgrounds. The image database contained 202 real images of different shapes, sizes and orientations.

Input Images

Object Segmenta-

Object

Feature Computation

Features

NN training and object classification

Evaluate Performance PerformClassifiance cations

Fig. 1. Overview of the Experimental System •

Feature Computation: We calculated 125 normal run length and 125 extended run length invariant features, which were independent of position, scale and orientation. The method used for feature computation has been detailed in section 2.

•

Training of Networks and Object Classification: Neural networks assign an object to a class by making use of the knowledge that is gathered during training. Training is therefore a crucial task for neural networks. The network used was trained by feeding independently the invariant features to the input neurons as an input vector. The number of neurons in the input layer was equal to the number of features obtained from an object. The number of neurons in the output layer, i.e. the cluster unit, was taken to be 9, which was equal to the number of classes identified manually. The parameters used in the network are shown in Table 1. The distance function and feature selection were chosen as vanilla (Euclidean) and rotation respectively. A missing value was regarded as an error condition.

Improving Image Classification Using Extended Run Length Features

481

Table 1. Parameter values of the Neural Network for both features Input no. Output no. No. of Epoch Neighbour-hood size Initial weight 125 9 10,000 8 0.5

learning rate 0.5

The steps used to train the neural network are detailed in [11].

5

Results and Conclusions

The objects relevant to the query object were classified using a self-organising map by utilising the normal run length and the extended run length methods. The nine objects used as queries in our experiments for both extraction methods are shown in figure 2.

Fig. 2. The nine query objects

The performance of a classification was evaluated by comparing the results obtained from the network with the manual classification. An image was allocated to a manual class on the basis of the meaning of the objects contained in that class. The classification of images in each of the manual groups (Mn, n=1 to 9) into different calculated groups (Cn, n=1 to 9) by the SOM network based on the normal run length and extended run length features is shown in Table 1. In this table the values down the main diagonal show the number of correctly classified objects while off-diagonal numbers whose network class number is less than or equal to the manual group number indicate misclassified objects. The recognition rate was calculated using the following equation. recognition rate =

TotalObjectInTheManualGroup × 100 TotalObject Re cognisedByNet

Table 2. Classification of images with Neural network Classification result with Run length Feature Gr# M1 M2 M3 M4 M5 M6 M7 M8 M9

Classification result with Extended Run length Feature

C1 C2 C3 C4 C5 C6 C7 C8 C9

C1 C2 C3 C4 C5 C6 C7 C8 C9

19

23

1 9

1

1 2

4 1

1 11

1

13

24

4

1 1 2 6

1

4

3 1

6

24 11

3

1

1

9 1 1

8 1

4 4

1 2

21 12

3 1

5 1

14

3 6

2 1 5

1 3 1

2 17 1

2 1

3

13 7 1

6 4

1

12 2

12 3

482

Syed M Rahman et al.

The average recognition rate using the normal run length method, as shown in table 3, was calculated to be 58% compared to 62.65% for the method using extended run length features. The performance of the neural network for some classes achieved 100%. Because the manual classification included semantic information, there were some objects in the image classes with significant differences in shape and other features, which resulted in some misclassifications. It may be inferred from the results that run length features are susceptible to such differences. In addition to this, a selforganising map is highly sensitive to the setting of its parameter values. Table 3. Calculation of average performance Run length

G1

G2

G3

G4

G5

G6

G7

G8

G9

73

36

100

100

33

3.8

86

41

50

52

100

Ext. Run Length 88.46

90.91 56.67

50

57.14 35.29 33.33

Average 58 62.65

References 1.

Gudivada, Venkat N., Raghavan, Vijay V. “Content based image retrieval systems” IEEE Computer, September, 1995, pp. 18-22.

2.

Rahman S. M. and Haque N. “Image ranking using shifted difference” In Proceedings of the ISCA 12th International Conference on Computers and Their Applications. Tempe, Arizona, USA, March 13-15, 1997, pp. 110-113.

3.

Rahman, S. M. et. al. “Self-Organizing Map for shaped based image classification” In Proceedings of the ISCA 13th International Conference on Computers and Their Applications. Honolulu, Hawaii, U.S.A. March 25-27, pp. 291-294, 1998.

4.

Gour, C K, Rahman S M, Bignall, B. "Object ranking using run length invariant features" In the proceedings of International Symposium on Audio, Video, Image Processing and Intelligent Applications, Baden- Baden, Germany, 17-21 August, 1998, pp.52-56.

5.

Hideyuki Tamura, Shunji Mori and Takashi Yamawaki. “Textural features corresponding to visual perception” IEEE Transactions on Systems, Man, and Cybernetics, Vol: SMC-8, No: 6, June 1978.

6.

Mundy J. L. and Zisserman A. (Eds.) “Geometric Invariance in Computer Vision” MIT Press, 1992.

7.

Hu, Ming-Kuei. “ Visual pattern recognition by moment invariants ”. IREE Transactions on Information Theory, Vol: IT-8, Feb 1962.

8.

Gose, Earl et. al: “Pattern Recognition and Image Analysis” Prentice Hall PTR Upper Saddle River, NJ 07458, pp: 365,1996.

9.

AwcockG. J. and Thomas R. “Applied Image Processing” McGraw-Hill, Inc. pp: 171, 1996.

10. Hoekstra, Aarnoud and Duin, Robert P.W. “On the nonlinearity of pattern classifiers” International Conference on Pattern Recognition, Vol: 4, pp: 271-75, Vienna, Austria, August 25-29, 1996. 11. Fausett, Laurence “Fundamentals of Neural Networks Architectures, Algorithms, and Applications” Prentice-Hall, Inc. 1994.

Feature Extraction Using Fractal Codes Ben A.M. Schouten and Paul M. de Zeeuw Centre for Mathematics and Computer Sciences (CWI) P.O. Box 94079, 1090 GB Amsterdam, The Netherlands Phone: 00 31 20 5929333, Fax: 00 31 20 5924199 {[email protected],Paul.de.Zeeuw}@cwi.nl

Abstract. Fast and successful searching for an object in a multimedia database is a highly desirable functionality. Several approaches to content based retrieval for multimedia databases can be found in the literature [9,10,12,14,17]. The approach we consider is feature extraction. A feature can be seen as a way to present simple information like the texture, color and spatial information of an image, or the pitch, frequency of a sound etc. In this paper we present a method for feature extraction on texture and spatial similarity, using fractal coding techniques. Our method is based upon the observation that the coeﬃcients describing the fractal code of an image, contain very useful information about the structural content of the image. We apply simple statistics on information produced by fractal image coding. The statistics reveal features and require a small amount of storage. Several invariances are a consequence of the used methods: size, global contrast, orientation.

1

Introduction

Automatic indexing and retrieval of images based on content is a challenging research area. What is called content is usually subjective and often depends on the context, domain, etc. This is the reason why content based access is still largely unsolved. High-level content based retrieval requires the use of domain knowledge and is therefore limited to a speciﬁc domain. Low-level retrieval techniques are more generic but they can characterize only low-level information such as color, texture, shape, motion etc. We are interested in low-level retrieval techniques for grey scale images, based on texture and spatial (dis)similarity. We wish to locate a set of images related to a given image ( ”Query by example”). Fractal coding is eﬀective for images having a degree of self-similarity. Here similar means that a given region in an image may be ﬁtted to another region using some aﬃne transformation. This notion of similarity is particularly useful for textured regions. We will make a distinction between three main aspects of texture: symmetry, contrast and coarseness [15]. The spatial similarity features will be based on spatial relationships, like the distance and the angle between the similar regions. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 483–493, 1999. c Springer-Verlag Berlin Heidelberg 1999

484

Ben A.M. Schouten and Paul M. de Zeeuw

The power of the method lies in the multiresolution nature: retrieval of highresolution database images with low-resolution original inexact queries is possible. Retrieval systems for very large databases impose strong demands on the size of the feature vectors, the eﬀectiveness of the indexing techniques, and the eﬃciency of the searching algorithm. Therefore the features should be simple to compute and be discriminating. It is necesary to develop hierarchical indexing and searching strategies, that is, in subsequent steps one performs an increasingly detailed search on a smaller and smaller subset of the database. By their multiresolution character fractal coding techniques are apt to the construction of hierarchical image indexing and searching schemes. This paper is a ﬁrst survey on how eﬀective feature extraction based on fractal codes can be. The need for features with discriminating potential and the possibilities oﬀered by hierarchical schemes in this respect gave reason to write this paper.

2 2.1

Background Fractal Image Coding

Fractal coding is a relatively new technique which emerged from fractal geometry. It has been studied thoroughly by several authors, see e.g. [4,13]. Fractal coding is based on the self-similarity in a picture. This means that small pieces of the picture can be approximated by transformed versions of some other (larger) pieces of the picture. This phenomenon is exploited to extract features that relate to this self-similarity. We give a brief introduction to fractal image coding, cf. [2,6,3]. Without loss of generality, we suppose that the image I measures 2N × 2N pixels. We will denote this image area with E. We consider grey scale images and deﬁne G = {0, ..., 255}. So: I : E → G, I ∈ GE , I |R is the restriction of the image to the region R and IR := I |R . In fractal coding the image I is partitioned into non-overlapping sub blocks of ﬁxed size, called range blocks. See Figure 1 (courtesy of Dugelay et al. [3]). The fractal encoder searches for every range block R, another block in the image (domain block D) that looks similar under an aﬃne transformation. The range blocks are identiﬁed by the coordinates of the lower left corner of the block R = (2d m, 2d n) | 0 ≤ m, n ≤ 2N −d − 1. m, n ∈ Z . The goal of the expression scheme is to approximate, within a certain tolerance , the range block R by a certain domain block D of double size: 2d+1 × 2d+1 . The chosen domain block is extracted from a domain pool D. There are several kinds of domain pools; for our survey we use the half overlapping domain pool m n D = (2d+1 , 2d+1 ) | 0 ≤ m, n ≤ 2N −d − 1. m, n ∈ Z . 2 2

Feature Extraction Using Fractal Codes

485

Fig. 1. Fractal coding in steps The approximation of the range block by the domain block is done in several steps: a. The domain block is brought into position by an symmetry operator VR . b. The grey values of the domain block are tuned by an operator WR . WR consists of a contrast scaling α and a luminance oﬀset β. c. The size of the domain block is reduced with 75 % (averaging, down sampling). The essential operator in the scheme is WR . α within WR is chosen in such a way that WR is a contraction mapping. The other operators are used to make more ﬁts possible. WR is an aﬃne mapping of grey values. WR : G →R Given a range block R the coder searches for a domain block DR ⊂ E and an aﬃne mapping WR such that according to the l1 metric on G: d(WR (ID ), IR ) ≤

(1)

In the scheme the above procedure is repeated for all range blocks R ∈ R. Then, the original image I is by approximation a ﬁxed point for the map W : WR W = R∈R

By the Fixed Point Theorem the image can be restored by iterating W in the decoding phase, starting with any picture. This implies that storage of the parameters of the map W is suﬃcient for the (near) reconstruction of the image.

486

2.2

Ben A.M. Schouten and Paul M. de Zeeuw

Quadtrees and Multiresolution

Most fractal coding schemes use a quadtree as a further subdivision of the image. In the ﬁrst stage of the coding, the image is partitioned into range blocks of ﬁxed size. According to the tolerance (1) there will or will not be a match between a range block and a domain block. This means, there are: 1. successes i.e. range blocks for which an approximation by a domain block has been found and, 2. failures i.e. range blocks for which no approximation could be found. The procedure in fractal coding is to subdivide the failures into four sub blocks of 1/4 size. The search for successes will then start again; now only with range blocks of 1/4 size. This ”multiresolution” scheme is illustrated in Figure 2. The ﬁrst level of the

Fig. 2. Subdivision by a quadtree; failures at level i are illustrated at level i + 1. quadtree, i = 0, contains range blocks of a ﬁxed size, partitioning the image. The failures at a certain level i are divided into four sub blocks and illustrated at the next level i + 1. The number of failures per level i is an important feature, we denote it by fi . For convenience we also deﬁne si as the number of successes at level i.

3

Feature Extraction

Today there exists several implementations in multimedia database systems such as the QBIC system by IBM [10], Photobook developed by the MIT Media Lab [12], and the Virage system developed by Virage Inc. In these, as well as many other approaches, one deﬁnes feature vectors of image properties. It is essential that such feature vectors are much smaller in size than the original images, but represent the image content as accurately as possible. Images are considered to be similar if the distance between their corresponding feature vectors, which are supposed to be elements of a given metric space, is small. For

Feature Extraction Using Fractal Codes

487

this reason, the discriminating power of the features has to be strong. Features often used are color and texture [10,12]. Furthermore, several authors have suggested to use shape properties [12], or relative position of objects within an image [7,5], called spatial similarity. 3.1

Textural and Spatial Similarity

Texture in images has been recognized as an important aspect of human vision. Fractal image coding has good results in coding textures with the exception of statistical texture which is far from ideal [11]. However, we like to stress an important issue: compressing an image and featuring an image are quite diﬀerent goals. There is no a priori reason why a transform used for indexing multimedia databases has to satisfy the same properties as one for compressing images. It can be argued that the disability of fractal coding to handle statistical texture is an advantage. Here we are dealing with low-level feature extraction, without any segmentation. We like to show that fractal coding can model spatial info without segmentation and create several features for spatial similarity. In these features we like to express whether similarity between regions is bounded to a certain part of the image. Or whether there is a dominating direction between the blocks that are similar. The extracted information is modeled in a way that is independent of the size of the image; a very desirable item. Smaller thumbnails can then be used to retrieve bigger images.

4

Feature Extraction Using Fractal Codes

In our experiments the coding scheme was programmed to use ﬁve quadtree levels. At the ﬁrst step every image is divided into 16 range blocks, regardless the size of the image. At every level of the quadtree several features will be extracted and with this more information about the image is added at every level of the quadtree. 4.1

Texture

We like to distinguish three features for texture: symmetry, contrast and coarseness. The symmetry feature is modeled by the operator VRi , see Figure 1. VRi relates a range block to one of the 8 symmetry operators that are used to bring a domain block into position to match this range block at a certain level i. In our experiments we will make histograms of several features. In the symmetry histogram, horizontally the 8 symmetry operators are denoted. The vertical axis shows the fraction by which the various symmetry operations occur at that level of the quadtree, see Figure 4.

488

Ben A.M. Schouten and Paul M. de Zeeuw

Homogeneity of textural contrast is modeled by the mean and variance of the grey value scaling α. If a domain block, at a certain level i is matched with a range block, α is the scaling used on the grey values of the domainblock. Again all features are related to i. The coarseness of texture is modeled by the number of successes si at a certain level of the quadtree. So we count the number of ranges for which a proper domain block has been found at a certain level of the quadtree. The depth appears to be very important in this respect. If a lot of large domain blocks can be mapped onto range blocks, the scale of the similarity will likely be coarse. 4.2

Spatial Similarity.

For the spatial (dis)similarity present in an image, three features are derived from the fractal code depending on the quadtree level: uniformity, direction and dimension. Uniformity and direction. The ﬁrst two spatial features are modeled by one vector ci , depending on the level i of the quadtree. ci is expressed in terms of its magnitude li and angle φi : ci = (li , φi ) li measures the distance between a matched range and domain block. The perception is, that li is bounded if an image consists of diﬀerent textures dividing the image into several regions. In our histograms this feature will be divided into 8 classes; length will be calculated as fraction of the distance from lower left corner to upper right corner of the image. In this way the feature is made size invariant. The spatial direction feature (see Figure 3) measures the angle between the horizontal direction and the direction from upper left corner of the domain block to upper left corner of the range block at a certain level i. We choose to represent

l Range block

φ

Domain block

Fig. 3. Spatial uniformity and direction.

Feature Extraction Using Fractal Codes

489

these features numerically by vectors ∈ R8 , and graphically by histograms with eight bars, see Figure 4. The spatial dimension feature relates to the Box Counting Dimension [4]. The image I is divided into 162 sub blocks. For each sub block we deﬁne: di =2 log

fi+1 fi

where fi is the number of failures at level i. Figure 5 serves as an example. This feature distinguishes between images that have edges scattered all over the image, and images with a few clear-cut lines.

5

Results

In this section we investigate the discriminating power of some features. Here we selected three features: textural symmetry, textural coarseness and spatial uniformity. The example images stem from the Vistex Database of MIT. Figure 4 shows 15 pictures and the corresponding histograms with respect to the selected features. The ﬁrst column shows the picture itself; the second column shows histograms related to textural symmetry; the eight values at the horizontal axis correspond to the eight symmetry operators that can be distinguished for mapping domains onto ranges. The ﬁrst four values denote rotation of a square part over 0 (identity), 90, 180 and 270 degrees respectively. The second quartet denotes the same, but with an additional ﬂip of the plane in which the image lies. The vertical axis shows the fraction by which the various symmetry operations occur at that level of the quadtree. Although all features can be extracted at all levels, we present only the histogram which relates to the level which numbers the most successes, see Section 2.2. We observe clearly a preference for the 0 and 180 degrees classes in the ”Building” images. Other images have a much more even spread over the symmetry operations. Apparently there is a dominating direction in the picture, as could be expected. The feature appears to distinguish between images of man-made and natural environment. The third column shows histograms with respect to textural coarseness. The horizontal axis of this histogram corresponds to the depth of reﬁnement in the quadtree of the encoding procedure. The vertical axis shows the accumulated success rate of the encoding. It is the fraction of all pixels in the original image that are successfully mapped from domain blocks onto range blocks. We observe how the ”Metal” image is poorly matched even at large depths, due to the statistical texture present in the image. The ”Clouds” image mainly seems to consist of similarity at large scale. ”Kiss” which has clouds as a background almost shows the same histogram.

490

Ben A.M. Schouten and Paul M. de Zeeuw

Fig. 4. Pictures from M.I.T. Database and some features.

Feature Extraction Using Fractal Codes

491

”Buildings.0008” and ”Buildings.0009” consist of building structure at a larger scale which is reﬂected by this feature. The fourth column shows the histogram with respect to spatial uniformity. The horizontal axes shows 8 classes for the distance between matching domains and ranges. ”Kiss” is a nice example of an image that has texture centered in the image, which is reﬂected in a biased distribution with a preference for short distances. For ”DogCageCity”, ”GrassLand” and ”ValleyWater” the histogram shows two superposed distributions, corresponding to the two main textures. Finally, we show two examples of the spatial dimension feature. Figure 5, exploits the fractal dimension, times 10, of parts of an image at a certain depth. Typically lines and borders yield di ≈ 1 and areas with lots of inner structure

20 20 10 00 00 00 00 00 00 16 16 00 00 00 10 10

20 20 10 00 00 00 00 00 00 00 10 10 00 00 00 16

20 20 10 00 00 00 00 00 00 16 10 10 00 16 20 10

16 20 10 00 00 00 00 00 10 10 00 16 16 16 16 10

16 20 10 00 00 00 00 00 00 00 10 20 20 20 20 16

10 20 16 00 00 00 00 00 00 00 00 10 20 10 20 10

10 16 10 00 00 00 00 00 00 00 10 16 00 16 16 00

16 16 10 00 00 00 00 00 00 00 00 20 00 20 16 00

20 20 20 00 00 00 00 00 00 00 00 00 00 10 00 20

20 20 10 00 00 00 00 00 00 00 00 00 00 00 16 10

20 20 20 16 20 20 20 16 16 10 10 00 10 10 20 16

16 16 16 10 20 20 10 16 10 16 00 16 10 16 20 20

16 20 10 16 20 20 16 20 16 16 16 16 10 20 20 16

10 20 20 16 20 20 20 20 16 16 16 10 20 20 20 20

10 10 10 20 16 20 10 20 10 00 00 16 20 20 20 20

10 16 16 20 16 20 10 16 20 10 00 20 20 20 20 16

00 00 20 16 20 00 20 20 16 00 00 00 00 00 00 00

00 00 20 10 16 00 16 16 20 00 00 00 00 00 00 00

00 00 10 20 20 00 16 16 16 16 00 00 00 00 00 00

00 00 10 16 16 10 20 20 20 10 00 16 00 00 00 00

00 00 10 10 20 00 16 16 16 00 00 10 00 00 00 00

00 00 16 16 20 16 00 20 20 16 16 00 00 00 00 00

00 00 10 10 20 10 16 00 20 00 10 00 00 00 00 00

00 00 10 16 16 20 20 20 20 00 10 10 10 00 00 00

00 00 20 16 20 20 20 20 16 16 00 00 00 00 10 00

00 00 10 10 10 20 20 10 10 16 00 00 00 10 10 00

00 00 00 10 10 20 20 16 00 20 00 10 00 00 00 00

00 00 00 20 16 20 20 10 00 10 00 10 00 00 00 00

00 00 00 20 16 00 00 10 10 00 16 16 00 00 00 00

00 00 10 16 00 16 00 00 10 10 10 00 00 00 00 00

00 00 00 20 00 00 00 20 16 16 10 00 00 00 00 00

00 00 00 10 16 00 00 00 16 16 16 10 00 00 00 00

Fig. 5. Spatial dimension (x10); ”BrickPaint.0001” and ”DogCageCity.0002” images.

yield di ≈ 2. We observe how it is roughly similar to the geometry and topology of the original images and indeed borders and inner areas can be identiﬁed by diﬀerent values of di .

6

Conclusions and Further Research

The features we studied appear to have discriminating power that relates to human vision. Next question is, whether this discriminating power can be used to successfully retrieve an image from a large database. We plan to address this question. The method can be used for a hierarchical search and it combines several desirable options. The ﬁrst to mention is: feature extraction and compression. The second is that this method has proved to be invariant to size, orientation, contrast scalings and luminance oﬀsets. Therefore our method may improve on previous approaches [9,14,17].

492

Ben A.M. Schouten and Paul M. de Zeeuw

Acknowledgments We like to thank Prof. M.S. Keane and Dr. H.J.A.M. Heijmans for their advice.

References 1. P. Aigrain, H. Zhang, D. Petkovic, Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review. Multimedia Tools and Applications, 3, pp. 203-223, Kluwer 1996. 2. G.M. Davis, A Wavelet-Based Analysis of Fractal Image Compression. IEEE Transactions on Image Processing, Vol7, No.2, Feb 1998. 484 3. J.-L. Dugelay, B. Fasel, V. Paoletti, N. Vallet, www.eurecom.fr/˜image/Projet Etudiant 1997/english/codeur.html. 484 4. Y. Fisher (ed.), Fractal Image Compression, Theory and Application, Springer Verlag, 1994. 484, 489 5. V.N. Gudivada and V.V. Raghavan, Design and evaluation of algorithms for image retrieval by spatial similarity. ACM transactions on Information Systems, Vol. 13,No.2, pp. 115-144, April 1995. 487 6. A. Jacquin, A Fractal Theory of Iterated Markov Operators with Applications to Digital Image Coding, PhD thesis, Georgia Institute of Technology, August 1989. 484 7. T. Kato, Database architecture for content-based image retrieval. Proc. of SPIE Conf. on Image Storage and Retrieval Systems, Vol. 1662, pp. 112-123, San Jose, Feb 1992. 487 8. J.M. Keller and S. Chen, Texture description and segmentation through fractal geometry. Computer Vision, Graphics, and Image Processing, 45:150-166, 1989. 9. J.M. Marie-Julie and H. Essafi, Image Database Indexing and Retrieval Using the Fractal Transform. Proc. of Multimedia Applications, Services and Techniques, pp. 169-182, Springer Verlag 1997. 483, 491 10. W. Niblack, R. Barber, W. Equitz, M. Glasman, D. Petkovic, P. Yanker, C. Faloutsos and G. Taubin, The QBIC Project: querying images by content using color, texture and shape. Storage and Retrieval for Image and Videodatabases 1908, 173-187, 1993. 483, 486, 487 11. G.E. Oien, R. Hamzaoui and D. Saupe, On the limitations of fractal image texture coding. 487 12. A. Pentland, R.W. Picard, S. Sclaroff, Photobook: content-based manipulation of image databases. Storage and Retrieval Image and Video Databases II, vol. 2185, SPIE, pp. 34-37. San Jose, 1994. 483, 486, 487 13. D. Saupe and R. Hamzaoui (eds.), ftp://ftp.informatik.unifreiburg.de/papers/fractal/README.html. 484 14. Alan D. Sloan, Retrieving Database Contents by Image Recognition: New Fractal Power. Advanced Imaging, 9(5):26-30,5 1994. 483, 491 15. H. Tamura, S. Mori and T. Yamawaki, Texture Features Corresponding to Visual Perception. IEEE Transactions on Systems, Man and Cybernetics, 8(6), June 1978. 483 16. J.K. Wu and A.D. Narasimhalu, Identifying Faces Using Multiple Retrievals, IEEE Multimedia, 1(2): 27-38, 1994.

Feature Extraction Using Fractal Codes

493

17. Aidong Zhang, Biao Cheng, Raj Achary and Raghu Menon, Comparison of Wavelet Transforms and Fractal Coding in Texture-based Image Retrieval. Proceedings of the SPIE Conference on Visual Data Exploration and Analysis III, San Jose, January 1996. 483, 491

Content-Based Image Retrieval Based on Local Aﬃnely Invariant Regions Tinne Tuytelaars1 and Luc Van Gool1,2 1

University of Leuven, ESAT-PSI Kard. Mercierlaan 94, B-3001 Leuven, Belgium {tinne.tuytelaars,luc.vangool}@esat.kuleuven.ac.be 2 Kommunikationstechnik, ETH Zentrum, ETZ CH-8092 Z¨ urich, Switzerland

Abstract. This contribution develops a new technique for content-based image retrieval. Where most existing image retrieval systems mainly focus on color and color distribution or texture, we classify the images based on local invariants. These features represent the image in a very compact way and allow fast comparison and feature matching with images in the database. Using local features makes the system robust to occlusions and changes in the background. Using invariants makes it robust to changes in viewpoint and illumination. Here, “similarity” is given a more narrow interpretation than usual in the database retrieval literature, with two images being similar if they represent the same object or scene. Finding such additional images is the subject of quite a few queries. To be able to deal with large changes in viewpoint, a method to automatically extract local, aﬃnely invariant regions has been developed. As shown by the ﬁrst experimental results on a database of 100 images, this results in an overall system with very good query results.

1

Introduction

Most existing image retrieval systems mainly focus on color and color distribution. A very popular and well-developed technique is based on color histograms, ﬁrst developed by Swain and Ballard [11], reﬁned by Funt and Finlayson [2,3], as well as by Healey and Slater [5] to obtain illumination invariance. Others have tried to use texture (e.g. [7]) or shape (e.g. [4,9,1]), but the use of these criteria is mostly limited to a speciﬁc application. We have developed a diﬀerent approach to content-based image retrieval, based on local invariants. These invariants characterise neighbourhoods of interest points. The latter correspond to corners, found with a corner detector (in our case the Harris detector). The idea is to introduce some geometry in the image retrieval process. Using local features makes the system robust to occlusions and changes in the background. Using invariants makes it robust to changes in viewpoint and illumination. Moreover, the use of invariants allows the use of hashing techniques, resulting in a very eﬃcient image retrieval process. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 493–500, 1999. c Springer-Verlag Berlin Heidelberg 1999

494

Tinne Tuytelaars and Luc Van Gool

Probably the most related work is that of Schmid and Mohr [10], who have investigated the matching of points between images based on local greylevel invariants and have also applied this for retrieving images from a database. However, their system only deals with invariance under rotations, combined with a scale space to overcome changes in scale between the images. Therefore, the allowed change in viewpoint is rather limited. We have extended these ideas towards invariance under more general transformations. More precisely, we consider invariance under aﬃne geometric transformations (valid if the surface over which the invariant is computed is planar) and under linear changes in intensities in each of the three colorbands, i.e. intensities change by a scale factor and oﬀset that may be diﬀerent for the diﬀerent color bands (valid in case of lambertian surfaces under changing illumination). Several invariants satisfying these constraints can be found in literature (e.g. [8]). However, their application in this context is not straightforward. If one wants to use geometric invariants, sets of at least four coplanar points are needed. Apart from the combinatorics, this is unfeasible since there is no way to check the coplanarity of points based on a single image. In case photometric moment invariants are used, one can assume that surfaces are locally planar. The problem in that case is that the invariants have to be computed over the same regions in both images, and no ﬁnite region can be found that remains unchanged under an aﬃne transformation (for instance, a circular region will be transformed into an elliptical one). One way out would be a generalisation of Schmid and Mohr’s scale-space approach. However, with four degrees of freedom for aﬃne invariant regions (e.g. parallelograms), this is computationally unfeasible. Therefore, we ﬁrst have to ﬁnd “aﬃnely invariant regions”, i.e. we have to develop a method to ﬁnd the same region in both images independently. In section 2, we describe a method to ﬁnd aﬃnely invariant regions in order to handle the geometric distortions between the query image and a corresponding image in the database. Invariant-based image retrieval using these invariant regions is discussed in section 3. These invariants also cover for the photometric changes between views. Section 4 shows some experimental results and section 5 concludes the paper with some ﬁnal remarks.

2

Finding Aﬃnely Invariant Regions

The problem addressed in this section can be summarized as follows: given two images of the same scene, taken from diﬀerent viewpoints, ﬁnd one or more regions around a point, such that the same region(s) is (are) found in both images independently, i.e. without using knowledge about the other image. In a diﬀerent context, Lindeberg et al. [6] have also been looking into this problem. Their approach is situated in the domain of shape from texture, where they apply “aﬃne shape adaptation” of smoothing kernels. In the case of weak isotropy, the region found corresponds to rotationally symmetric smoothing and

Content-Based Image Retrieval

495

rotationally symmetric window functions in the tangent plane to the surface. However, for other cases, their method does not necessarily converge. We consider invariance under aﬃne geometric and photometric changes (i.e. a scale and oﬀset for each spectral band). Thus, we assume that the scene is locally planar and not occluded and that no strong, specular reﬂections occur. To reduce the complexity of the problem, we restrict ourselves to ﬁnding aﬃnely invariant regions for corner points making use of the nearby edges. Also, we make a distinction between two diﬀerent cases: curved edges and straight edges. 2.1

Case 1: Curved Edges

Let p = (xp , yp ) be a corner point, and e1 and e2 the edges in its neighbourhood. Then two relative aﬃnely invariant parameters l1 and l2 can be deﬁned for the two edges e1 and e2 , in the following way (see also ﬁg. 1) i = 1, 2 li = abs(|pi (1) (si ) p − pi (si )|)dsi with si an arbitrary curve parameter, p(1) (si ) the ﬁrst derivative of pi (si ) with respect to si , abs() the absolute value and |..| the determinant. Then, a point p1 (l1 ) on one edge can be associated with a point p2 (l2 ) on the other edge, such that l1 = l2 . Both l1 and l2 are relative, aﬃne invariants, but their ratio ll12 is an absolute aﬃne invariant and the association of a point on one edge with a point on the other edge is also aﬃnely invariant. From now on, we will simply use l when referring to l1 = l2 . Together, the two points p1 and p2 deﬁne a region A for the point p as a function of l: the parallelogram spanned by the vectors p1 − p and p2 − p. In this way, the problem of ﬁnding an aﬃnely invariant region has been reduced to ﬁnding a value for l in an aﬃnely invariant way. To this end, we evaluate a function over the region A(l) that reaches its extrema for corresponding values of l, i.e. in an invariant way for both the geometric and photometric changes. We then select the region A(l) for which such function reaches a local extremum. Since it is not guaranteed that the function will really reach an extremum over the limited l-interval we are looking l1

q

11111111111111111111 00000000000000000000 00000000000000000000 11111111111111111111 00000000000000000000 11111111111111111111 00000000000000000000 p1 11111111111111111111 00000000000000000000 11111111111111111111 00000000000000000000 11111111111111111111 00000000000000000000 11111111111111111111 00000000000000000000 11111111111111111111 00000000000000000000 11111111111111111111 l2 00000000000000000000 11111111111111111111 00000000000000000000 11111111111111111111 00000000000000000000 11111111111111111111 00000000000000000000 11111111111111111111 p2 00000000000000000000 11111111111111111111 00000000000000000000 11111111111111111111 00000000000000000000 11111111111111111111 00000000000000000000 11111111111111111111 00000000000000000000 11111111111111111111 p

q’ 000000000000 l ’1111111111111 000000000000 111111111111

000000000000 111111111111 111111111111 000000000000 000000000000 111111111111 000000000000 111111111111 000000000000 ’ 111111111111 000000000000 111111111111 l2 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 p’ 000000000000 111111111111 000000000000 2 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111

000000000000 p’1 111111111111 000000000000 111111111111

p’

Fig. 1. Based on the edges in the neighbourhood of a corner point, an aﬃnely invariant region can be found as a function of the relative aﬃnely invariant parameter l1 = l2 .

496

Tinne Tuytelaars and Luc Van Gool

at, a set of functions are tested. As a result, more than one region might be found for one corner point. The functions we used in our experiments are:

I(x, y) dxdy dxdy

and

p − pg | p − p2 |

|p − q |p − p1

with I(x, y) the image intensity, pg the center of gravity of the region, weighted with image intensity, and q the corner of the parallelogram opposite to the cornerpoint p (see ﬁgure 1). I(x, y)ydxdy I(x, y)xdxdy pg = ( , ) I(x, y)dxdy I(x, y)dxdy

2.2

Case 2: Straight Edges

In the case of straight edges, the method described above cannot be applied, since the relative aﬃnely invariant parameter l will be zero. However, since straight edges occur quite often in real images, we cannot just neglect this case. A straightforward extension of the previous technique would then be to search for local extrema of some appropriate function in a two-dimensional search-space with the two parameters running over the edges as coordinates, instead of a onedimensional search-space running over l. However, the functions which we have been using do not show a clear, well-deﬁned extremum. Instead, we have some shallow “valleys” of low values, (e.g. corresponding to cases where the area in the numerator tends to zero). Instead of taking the (inaccurate) local extremum, we combine two such functions, and take the intersection of the two valleys, as shown in ﬁgure 2. Of course, it is not guaranteed that the valleys will indeed intersect. That is why diﬀerent combinations of invariants are tested (e.g. for diﬀerent subregions, or for diﬀerent color bands). Moreover, the special case where two valleys (almost) coincide must be detected and rejected, since the intersection will not be accurate in that case.

Inv1

Inv2

l2

l1

Inv

l2

l1

l2

l1

Fig. 2. For the straight edges case, the intersection of the “valleys” of two different invariants is used instead of a local extremum.

Content-Based Image Retrieval

3

497

Image Retrieval Based on Local Invariants

Now that we are able to extract aﬃnely invariant regions in an image, local, aﬃne moment invariants can be computed and used to match a query image with images in the database representing the same scene or object. First, we describe what local invariant features we use. Next, the image retrieval process is discussed in more detail. 3.1 Local Invariant Features As in the region ﬁnding step, we consider invariance both under aﬃne geometric changes and linear photometric changes, with diﬀerent oﬀsets and scalefactors for each of the three color bands. For each region, a feature vector is composed, consisting of aﬃne moment invariants. These can be matched quite eﬃciently with the invariant vectors computed for the regions in the database images, using a hashing-technique. In this way, combinatorics can be avoided, reducing the computation time from O(nN ) to O(n) (with n the number of regions in the query image and N the total number of regions in all of the database images). As a result, the processing time is independent of the size of the database. As a ﬁrst invariant– and, as shown by our experiments, a quite distinctive one – we use the “type” of the region found. Here, with “type” we refer to the method used to ﬁnd the region: was it found using the method for curved edges or with the method for straight edges ? and what functions were used ? Only if the type of two regions corresponds, can they be matched. The other invariants are all aﬃne moment invariants. Care has to be taken, however, that they are suﬃciently distinctive. More speciﬁcally, one must pay attention to the fact that the region ﬁnding process turns some invariants into trivial cases. Due to space limits, a detailed description of the invariants used in our experiments can not be given here. Therefore, we refer to [8]. Brieﬂy, moment invariants up to the second order are computed, that use each of the three color bands to obtain a higher distinctive power. 3.2 Image Retrieval based on Local Invariant Features In order to ﬁnd the corresponding image(s) in the database, a voting mechanism is applied. Each point in the query image is matched to the point in one of the database images whose feature vector yields the smallest Mahalanobis-distance. If this distance is smaller than some threshold th, the cross correlation between them is computed (after normalization to a reference region) as a ﬁnal check to reject false matches. Each match between a point in the query image and a point in one of the database images is then translated into a vote for the corresponding database image. The image that receives the highest number of votes, is selected as the query result. From the ratio between the highest number of votes and the second highest number of votes we can derive a “conﬁdence-measure” c: n1 if n2 < n1 c= n1 + n2

498

Tinne Tuytelaars and Luc Van Gool

1 if n2 = n1 N with n1 the highest number of votes, n2 the second highest number of votes and N the number of images with n1 votes. The higher this value, the more conﬁdent we are that the query result is correct. Indeed, if the second highest number of votes is zero, the conﬁdence measure is estimated to be 100 %. If, on the other hand, the second highest number of votes is equal to the highest number of votes, the estimated conﬁdence measure is at most 50 % and depends on the number of images with the same number of votes. c=

4

Experimental Results

For our experiments, we used sets of two or exceptionally three color images of a scene or object. Each time, one image was kept as a query image, while the other(s) was (were) put in a database which, at the end, contained 50 images of more than 40 diﬀerent scenes or objects. To this, we added 50 more images from the Corel Professional Photos database that were selected from the packages “Big Apple”, “Egypt”, “India” and “China”, to end up with a total of 100 images. Fig. 3 shows some query images (upper rows) together with the image of the same object or scene in the database (lower rows). Note that there typically is a large diﬀerence in viewpoint between both images. Also the illumination conditions may have changed, and parts of the object or scene may have been occluded in one of the images. Nevertheless, the correct image was retrieved in almost 70 % of the query images as the image with the highest number of votes. In 95 % of the cases, the correct image was among the upper seven retrieved images. Examples of images where the system failed are the ones shown in the most right column of ﬁg. 3. In the case of the car-images, this is due to the lack of texture on the car, combined with specular reﬂections. For the lower right images, the scene is composed of many diﬀerent objects, causing many occlusions that mislead the matching process. For all the other images in ﬁg. 3 the correct image was the ﬁrst one retrieved. Table 1 gives an overview of the results. The ranking as obtained through the voting process must be interpreted as a ranking according to the probability of representing the same object or scene. In contrast to other content-based image retrieval systems, our system cannot be used to rank the database images according to their similarity with the query image. This is caused by the fact that only local similarity is checked, while a human observer usually looks at similarity on a global scale. However, it is our belief that in many applications what the user is really interested in are images with the same objects or scenes rather than a more qualitative similarity.

5

Summary and Conclusions

In this contribution, a novel approach to content-based image retrieval was worked out, based on local, illumination and viewpoint invariant features. To

Content-Based Image Retrieval

499

Fig. 3. Some examples of query images (upper rows) and their corresponding images in the database (lower rows). this end, a method for the automatic extraction of aﬃnely invariant regions was developed. The ﬁrst results on a database of 100 images showed that the method is promising, and can be considered a good alternative to the existing, color or texture based retrieval systems in cases where images of the same object or scene are searched for. Of course, further testing is still required, both on a larger database and using more complex criteria (based on combinations of our approach and global, color-based techniques, for example).

Acknowledgments TT is a Research Assistant of the Fund for Scientiﬁc Research (FWO). Support by IUAP project ’Intelligent Mechatronic Systems’ of the Belgian DWTC and by the Fund for Collective Fundamental Research (FKFO) of the Fund for Scientiﬁc Research (FWO) is gratefully acknowledged.

References 1. A. Del Bimbo and P. Pala, Eﬀective Image Retrieval Using Deformable Templates, ICPR96, pp. 120 - 123, 1996. 493 2. G. D. Finlayson, S. S. Chatterjee and B. V. Funt, Color Angular Indexing, ECCV96, Vol.II, pp.16-27, 1996. 493

500

Tinne Tuytelaars and Luc Van Gool

Table 1. Image Retrieval Results: For each image (ﬁrst column), the position at which the correct image from the database was retrieved (second column) together with the conﬁdence ratio (third column) is given. image

position of confidence correct image measure auto 7 bak 3-5 bank 1 0.881 bellavista 1 0.706 bier 1-3 0.333 bloemen 1 0.706 choc 1 0.9 cola 1-4 0.25 deur 6 domus 1 0.65 heroon 3-5 kassei 1 0.538 kasteel 1 0.684 kobe 4-6 kot 6-10 kotp 1 0.545 mask 2-5 masker 14 matras 1 0.929 meter 5-18 display 3-4

image

position of confidence correct image measure model 11-19 plant 1 0.75 poort 5 post 1 0.71 rob 1 0.636 simpsons 1-4 0.25 stof 1 0.805 tank 1 0.611 tankf 1 0.905 tegels 1 0.6 winkels 1 0.739 boor 2-10 pc 1 0.577 muur 1 0.666 koﬃe 1 0.6 stapel 1 0.565 trio 1 0.848 ikke 1 0.8 tractor 1-2 0.5 dak 1 0.947

3. B. V. Funt and G. D. Finlayson, Color Constant Color Indexing, IEEE PAMI Vol. 17, no. 5, pp.522-529, 1995. 493 4. E. Gary and R. Mehrorta, Shape similarity-based retrieval in images databases, SPIE, Vol. 1662, 1992. 493 5. G. Healey and D. Slater, Global Color Constancy: Recognition of Objects by Use of Illumination Invariant Properties of Color Distributions, J. Opt. Soc. Am. A, Vol. 11, no. 11, pp. 3003-3010, nov 1995. 493 6. J. Garding ˙ and T. Lindeberg Direct computation of shape cues using scale-adapted spatial derivative operators, IJCV, Vol. 17, no. 2, pp. 163-191, feb 1996. 494 7. F. Liu and R. W. Picard, Periodicity, directionality, and randomness: Wold features for image modeling and retrieval, M.I.T. Media Laboratory Perceptual Computing Section Technical Report, No. 320, 1995. 493 8. F. Mindru, T. Moons, L. Van Gool Color-based Moment Invariants for the Viewpoint and Illumination Independent Recognition of Planar Color Patterns, to appear at ICAPR, Plymouth, 1998. 494, 497 9. F. Mokhtarian, S. Abbasi and J. Kittler, Robust and Eﬃcient Shape Indexing through Curvature Scale Space, BMVC(96), September 9-12, 1996. 493 10. C. Schmid, R. Mohr Local Greyvalue Invariants for Image Retrieval, PAMI Vol. 19, no. 5, pp 872-877, may 1997. 494 11. M. J. Swain and D. H. Ballard, Color Indexing, IJCV, Vol. 7, no. 1, pp. 11-32, 1991. 493

A Framework for Object-Based Image Retrieval at the Semantic Level Linhui Jia and Leslie Kitchen Computer Vision and Machine Intelligence Lab Department of Computer Science and Software Engineering Melbourne University, Parkville, Vic 3052, Australia {linhui,ljk}@cs.mu.oz.au

Abstract. This paper proposes a framework with essential components and processes for object-based image retrieval based on semantically meaningful classes of objects in images. An instantiation of the framework is presented to show the usage of the framework.

1

Introduction

Psychophysical studies [12,2] show that high-level concepts or semantic meanings of images are the most important cues used by humans to judge similarities between images. However, what current computer vision techniques can automatically extract from images are mostly low-level or middle-level features. These features might convey some information about semantic meanings of images, but there is no simple and direct way of representing semantic meanings by using them. Therefore, developing extractable image features to achieve semantically meaningful retrievals is one of the major challenges in the ﬁeld of image retrieval [16]. Recently, several systems [3,1,18,20] have been developed to answer this challenge. These systems treat each image as a whole and automatically identify higher-level concepts of images using low-level feature information such as color and texture. They aim at grouping images into general semantically meaningful categories such as indoor vs. outdoor and cities vs. landscape. Consequently, they can provide semantic content-based indices for retrieval and browsing. Our purpose here is to investigate object-based image retrieval (OBIR) methods at the semantic level, which treat the images as containing objects using mainly extractable shape features. Object-based images, here, are used to distinguish those images that contain distinguished objects as their main scene content such as ﬁshes and hand tools, from other images that contain materials, such as beaches and grass, which appear as homogeneous patterns. Object-based images could be domain dependent images or could exist in a category of a general image collection. Because interesting objects in object-based images from diﬀerent domains are so diﬀerent, it is not currently feasible to develop techniques that are suitable for object-based image retrieval in many diﬀerent domains. Our objective in Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 501–508, 1999. c Springer-Verlag Berlin Heidelberg 1999

502

Linhui Jia and Leslie Kitchen

proposing a framework for object-based image retrieval is to provide guidelines which can be instantiated by adding speciﬁc techniques into the framework’s components. The image retrieval system so developed is expected to suit the problem domain. This framework can also be used to analyse existing objectbased image retrieval systems. The following section describes this framework. An instantiation of it is presented in Section 3. Finally, Section 4 gives a summary.

2

The OBIR Framework On-line Off-line

Query Image

Notes: Two stages:

An Image Taking Collection some

Training 11111 00000 11 00 Images 00 11 00 11 images 00 11 000 111 000 111 00 00 11 11 00 11 000 111 000 111 000 111 for training 00 11 000 111 11 00 1 0

11 00

For each image in the collection

Off-line stage (right hand side of On-line stage (left hand side of

For each training image

Image training process (off-line) Image insertion process (off-line)

Feature Extraction from Foreground

Image query & retrieval process (on-line)

1

Object Classification 3 Image Similarity Measurement 5 Retrieved Images

)

Three processes:

Processing Image for Foreground/Background Separation

Learned Classifier

)

Building Training Data Based on Extracted Features

Five components: 1 : Image processing

Learning Classifier 1111111 0000000 0000000 1111111 0000000 1111111

Image Database

component

2 : Classifier training

2

component

3 : Object classification component

Building Image Indices

4 : Image database management component

5 : Image similarity Image Indexing

measurement component

4

Fig. 1. The structure of the OBIR framework

For the retrieval of object-based images at a semantic level, similarities between images are most likely associated with interesting classes of objects, or

A Framework for Object-Based Image Retrieval at the Semantic Level

503

semantic meanings of objects, in the images [15]. Encoding semantic based classes of object information in an image retrieval system involves solving a hard problem of classifying diﬀerent classes of visual objects. The key components, the classiﬁer training component and the object classiﬁcation component of the OBIR framework are for this purpose. In addition, OBIR contains three other components. Figure 1 illustrates their interaction during three processes in two stages. Of the three processes, the classiﬁer training process and image insertion process can be carried out in the oﬀ-line stage, and they have to be completed before the image query and retrieval process can be performed during the on-line stage. The following subsection is a description of each of the ﬁve components. Subsection 2.2 gives the purposes of the two stages with the three processes. Subsection 2.3 mentions a possible pre-processor. 2.1

Functionality of the Five Components

Image processing component This component is used to separate interesting objects, also called the foreground, from the background for each image, and to extract features from the foreground to represent the image. Foreground/Background (F/B) separation is a fundamental problem in object-based image processing. For images with a relatively uniform background, F/B separation can be performed using thresholding and clustering techniques For more general images, motion information could be used [19,21]. Also, semiautomatic methods can be provided for extracting interesting objects, since extracting semantically meaningful objects under arbitrary conditions is still an unsolved problem. The next step, after F/B separation, is feature extraction from the foreground. Interesting objects in diﬀerent domains could be described by diﬀerent features. The chosen extractable features of the domain object must be the most discriminative features that can distinguish objects of one class from objects of other classes. For example, if the contours of objects can be used to distinguish diﬀerent classes of objects in a domain, features extracted from object contours can be used. Otherwise, color and texture, or a combination of contour, color, and texture features of regions of objects may be used. This issue needs to be considered when instantiating the framework to design an image retrieval system. Classifier training component In object-based image collections, we can assume that the classes of objects in images can be known a priori. Therefore, supervised learning techniques can be employed to generate a classiﬁer from a set of example objects, called training examples. Each training example is described by a set of attributes and a class. The learned classiﬁer can be used to classify a new example (object) without class information into a class. Most supervised learning algorithms require that the examples are represented using a ﬁxed set of attributes. However, sometimes it is very hard to deﬁne a ﬁxed set of extractable attributes for an object. In this case, the problem of object classiﬁcation can be decompose into sub-problems, such as classiﬁcation of the parts of the object, which are relatively easy to solve.

504

Linhui Jia and Leslie Kitchen

For learning a classiﬁer, a subset of images which contain diﬀerent classes of objects are chosen from the image collection and are used as training images. Objects in training images are labeled with appropriate classes, which are usually provided by domain experts. For objects in each of the training images, using the extracted features obtained from the image processing component, the classiﬁer training component builds training data and learns a classiﬁer. There are many learning techniques that can be used in this component to learn a classiﬁer using training data, such as decision tree learning [14], production rule learning [14,10,4], Naive Bayesian classiﬁcation [7,8], ﬁrst-order learning [13], and neural networks [11]. Object classification component This component makes use of the classiﬁer which is learned in the image training component to classify objects in images into diﬀerent classes. In the case of decomposing the original object classiﬁcation problem into sub-problems, classiﬁcation results of the sub-problems need to be combined to form the classiﬁcation for the original problem. Image database management component Using the image processing component and object classiﬁcation component, all images in the collection are processed and objects in them are classiﬁed into diﬀerent classes. The internal representations of images are generated using the information of classes of objects in them. Images with these representations are stored into the image database. The image database management component performs this function. In addition, for eﬃcient retrieval, an image index can be built using the object class information. Image similarity measurement component This component deﬁnes similarity measures between two images based on the internal representation of them, or class information of objects in them. For a given query image, this component ﬁnds similar images from the image database and returns them in order of their similarities to the query image. 2.2

Tasks of the Two Stages with Three Processes

The goal of the oﬀ-line stage is to generate a classiﬁer for object classiﬁcation through learning from training example objects, and to process images in the collection for image insertion by means of the classiﬁer. During the training process, training images are processed by the image processing component to separate their foreground (interesting objects) from background. Attributes of the interesting objects are then extracted. Either each of the interesting objects or each part of them with its attributes as a training example is included in the training data in the classiﬁer training component. Using the training data, a supervised learning technique can be used to learn a classiﬁer. After the classiﬁer is created, images in the collection are processed to insert them into the image database by the image insertion process. In this process, the image processing component takes each image in the collection to extract

A Framework for Object-Based Image Retrieval at the Semantic Level

505

objects and their attributes as described above. Then all the objects in the image, together with their attributes, are passed to the object classiﬁcation component. The object classiﬁcation component uses the learned classiﬁer to classify these objects or parts of them. In the latter case, the classes of all the parts need to be combined to determine the class of the object. When the class information of objects in an image is generated, the image database component creates an index for the image based on this information, and then stores the image into the image database. Therefore, instead of the original images, the represented images with their class information are stored in the image database. Note that only objects of those images in the image collection that are used as training images are labeled (assigned classes) by human experts. All objects in other images in the collection are classiﬁed automatically. This reduces the need for human experts’ involvement. On the other hand, for on-line image retrieval, when a query image is given, it is processed ﬁrst by the image processing component and then the object classiﬁcation component in the same manner as for processing an image in the image collection. Once the class information of objects in the query image is obtained, it is compared with images in the image database by the image similarity measurement component. Similar images are retrieved and are provided in descending order of their similarities to the query image. In this way, retrieval is performed at the level of object classes. Since each predeﬁned object class can be semantically meaningful, the retrieval can also be carried out at a semantic level. 2.3

Possible Pre-processing

It is worth mentioning that to make the OBIR framework more useful or make it a part of a more general content-based image retrieval system, a pre-processor can be added. The purpose of the pre-processor is to separate the interesting object-based images from other images in an image collection. This pre-processor could be built automatically through classiﬁcation learning using low-level features such as colors and textures in a similar way to [3,1,18,20]. Using such a pre-processor, relevant object-based images as a subset of a general image collection can be extracted ﬁrst. Then the framework can be applied to this subset of images. The generation and application of this pre-processor can be thought of as another component of the OBIR framework.

3

An Instantiation of the OBIR Framework

Within the framework described above, an OBIR system [6] has been developed using C on SUN workstations. In this system, contours of objects are expected to contain information for identifying classes of objects in images, and objects are classiﬁed by analysing their contours. Therefore, the image processing component of the OBIR system is to perform object contour segmentation and contour feature (attribute) extraction.

506

Linhui Jia and Leslie Kitchen

Since it is hard to ﬁnd a ﬁxed set of attributes for each contour, commonly used classiﬁer learning techniques, such as decision tree learning, cannot be directly used to solve the problem of contour classiﬁcation. We divide this problems into sub-problems of which each is for classifying a segment of a contour. Attributes of each segment are relations between this segment and other segments which satisfy scale, rotation, and translation invariance. All the segments of each object contour in training images, including attributes and class, are sent as training examples to the classiﬁer training component, which uses C4.5 [14] to learn a decision-tree classiﬁer. The object classiﬁcation component of the system uses the learned classiﬁer to classify segments of an object contour. Then, the class of the object is determined through majority voting among the classes of all the segments of the contour of the object. That is, the most common class of the segments is predicted as the class of the object. If two classes have the same number of segments for an object, the object is assigned the class with higher summed conﬁdences. In the image similarity measurement component, we formulate similarity measures which make use of the information of segment classes of objects in each of the two compared images. Therefore, the image similarity measures convey object class information. To make the retrieval eﬃcient, the image database management component indexes all the images in the images database using the object class information in them. For each object class, images are linked together in decreasing order of percentage of their object segments belonging to the corresponding class. All the oﬀ-line and on-line stages, with image training, image insertion, and query image retrieval processes, are the same as described in the framework. The system is tested on a hand-tool image collection [17]. Figure 2 shows example images from this collection. All the processes in the system are performed automatically, eﬀectively, and eﬃciently. Experimental results of this system are better than the results reported in [17]. Details of techniques in this system and experiments are reported in [6]. The system is currently being evaluated on other object-based image databases. This system is suitable for domains in which diﬀerent types of objects have distinctive shapes which result in distinctive contours.

Fig. 2. Some example images from the hand tool image collection It is observed that using the above similarity measures, semantically similar images are likely to be ordered at the top of the image similarity order. To further reﬁne the retrievals to make it that the top N semantically similar images are ordered based on visually similarities, where N is deﬁned by the user, we add a point-pattern-matching technique in the image similarity measurement

A Framework for Object-Based Image Retrieval at the Semantic Level

507

component to reﬁne the order of the top N images. During the matching in the reﬁnement stage, the point-pattern-matching treats each segment of an object contour as a point in a multi-dimensional space. The number of dimensions is the number of attributes for each segment. The matching between a query image and images in the database is based on minimizing the summed distances between query points (segments) and points (segments) of an image. The smaller the distance is, the more similar the two images are. To reduce the computational cost, some heuristic search methods such as greedy search can be applied. Figure 3 shows as an example retrieval results before and after the reﬁnement stage.

Fig. 3. Retrieval results before (top row) and after (bottom row) the reﬁnement stage. Numbers below each image are the distance of this image from the query image. At top left is the query image. Here, N is 6.

4

Summary

A framework with necessary components, processes, and stages for object-based image retrieval is described. Advantages of the framework include that using supervised learning techniques, an unlabeled image database can be accessed and the retrieval can be performed at predeﬁned semantically meaningful level of object classes. This framework can be used as the basis for the design of new object-based image retrieval systems in diﬀerent domains.

Acknowledgements Many thanks to Stan Sclaroﬀ of the Boston University, for providing the tool images. The ﬁrst author is supported by a Melbourne University research scholarship and a Melbourne IT scholarship.

References 1. Belongie, S., Carson, C.: Greenspan, H., and Malik, J.: Recognition of images in large database using a learning framework. Tech Report 97-939, Computer Science Division, University of California at Berkeley (1997). 501, 505

508

Linhui Jia and Leslie Kitchen

2. Bernice, E.R., Frese, T., Smith, J., Bouman, C.A., and Kalin, E.: Perceptual image similarity experiments. IS&T/SPIE Conf on Human Vision and Electronic Imaging III (1998). 501 3. Carson, C. and Ogle, V.E.: Storage and retrieval of feature data for a very large online image collection. Data Engineering 19(4) (1996). 501, 505 4. Clark, P. and Niblett, T.: The CN2 induction algorithm. Machine Learning 3 (1989) 261-283. 504 5. G¨ unsel, B. and Tekalp, A.M.: Shape similarity matching for query-by-example. Pattern Recognition 31(7) (1998) 931-944. 6. Jia, L. and Kitchen, L.: Classiﬁcation-driven object-based image retrieval. To appear in Proceedings of the IEEE International Conference on Multimedia Computing and Systems, Florence, Italy (1999). 505, 506 7. Kononenko, I.: Comparison of inductive and naive Bayesian learning approaches to automatic knowledge acquisition. In B. Wielinga et al. (Eds.), Current Trends in Knowledge Acquisition. Amsterdam: IOS Press (1990). 504 8. Langley, P. and Sage, S.: Induction of selective Bayesian classiﬁers. Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence (1994) 339-406. 504 9. Lee, S.U. Lee and Chung, S.Y.: A comparative performance study of several global thresholding techniques for segmentation. Computer Vision, Graphics, and Image processing, 52 (1990) 171-190. 10. Michalski, R.S. and Chilausky, R.L.: Knowledge acquisition by encoding expert rules versus computer induction from examples: a case study involving soybean pathology. International Journal for Man-Machine Studies 12 (1980) 63-87. 504 11. Murray, A. F.: Applications of Neural Networks. Boston : Kluwer Academic Publishers (1995). 504 12. Papathomas, T.V., Conway, T.E., Cox, I.J., Ghosn, J., Miller, M. L., Minka, T.P., and Yianilos, P.N.: Psychophysical studies of the performance of an image database retrieval system. IS&T/SPIE Conf on Human Vision and Electronic Imaging III (1998). 501 13. Quinlan, J.R.: Learning logic deﬁnitions from relations. Machine Learning 5 (1990) 239-266. 504 14. Quinlan, J.R.: C4.5: Program for Machine Learning. San Mateo, CA: Morgan Kaufmann (1993). 504, 506 15. Roth, I. and Bruce, V.: Conceptual Categories. Perception and Representation, Current Issues. Second Edition, Open University Press, Buckingham, Philadelphia (1996). 503 16. Rui, Y., Huang, T.S., and Chang, S.F.: Image retrieval: past, present, and future. Journal of Visual Communication and Image Representation (1998). 501 17. Sclaroﬀ, S.: Deformable prototypes for encoding shape categories in image database. Pattern Recognition 30(4) (1997) 627-641. 506 18. Szummer, M. and Picard, R.W.: Indoor-outdoor image classiﬁcation. IEEE International Workshop on Content-based Access of Image and video Databases, in conjunction with ICCV’98, Bombay, India, Jan (1998). 501, 505 19. Thompson, W.B.: Combining motion and contrast for segmentation. IEEE Trans. Pattern Anal. Machine Intell. 2 (1980) 543-549. 503 20. Vailaya, A., Jain, A. and Zhang, H.J.: On image classiﬁcation: City vs. Landscape. Pattern Recognition 31(12) (1998) 1921-1935. 501, 505 21. Wang, J.Y.A. and Adelson, E.H.: Layered representation for motion analysis, Proc. IEEE CVPR’93 . Longer version available as: M.I.T. Media Laboratory Perceptual Computing Technical Report No. 228 (1993). 503

Blobworld: A System for Region-Based Image Indexing and Retrieval Chad Carson, Megan Thomas, Serge Belongie, Joseph M. Hellerstein, and Jitendra Malik EECS Department University of California, Berkeley, CA 94720, USA {carson,mct,sjb,jmh,malik}@eecs.berkeley.edu Abstract. Blobworld is a system for image retrieval based on ﬁnding coherent image regions which roughly correspond to objects. Each image is automatically segmented into regions (“blobs”) with associated color and texture descriptors. Querying is based on the attributes of one or two regions of interest, rather than a description of the entire image. In order to make large-scale retrieval feasible, we index the blob descriptions using a tree. Because indexing in the high-dimensional feature space is computationally prohibitive, we use a lower-rank approximation to the high-dimensional distance. Experiments show encouraging results for both querying and indexing.

1

Introduction

From a user’s point of view, the performance of an information retrieval system can be measured by the quality and speed with which it answers the user’s information need. Several factors contribute to overall performance: – the time required to run each individual query, – the quality (precision/recall) of each individual query’s results, and – the understandability of results and ease of reﬁning the query. These factors should be considered together when designing a system. In addition, image database users generally want to ﬁnd images based on the objects they contain, not just low-level features such as color and texture [5]; image retrieval systems should be evaluated based on their performance at this task. Current image retrieval systems tend to perform queries quickly but do not succeed in the other two areas. A key reason for the poor quality of query results is that the systems do not look for meaningful image regions corresponding to objects. Additionally, the results are often diﬃcult to understand because the system acts like a black box. Consequently, the process of reﬁning the query may be frustrating. When individual query results are unpredictable, it is diﬃcult to produce a stream of queries that satisﬁes the user’s need. In earlier work we described “Blobworld,” a new framework for image retrieval based on segmenting each image into regions (“blobs”) which generally correspond to objects or parts of objects [2]. The segmentation algorithm is fully automatic; there is no parameter tuning or hand pruning of regions. In Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 509–517, 1999. c Springer-Verlag Berlin Heidelberg 1999

510

Chad Carson et al.

this paper we present a complete online system for retrieval in a collection of 10,000 Corel images using this approach. The query system is available at http://elib.cs.berkeley.edu/photos/blobworld. We present results indicating that querying for distinctive objects such as tigers, zebras, and cheetahs using Blobworld produces higher precision than does querying using color and texture histograms of the entire image. In addition, Blobworld false positives are easier to understand because the matching regions are highlighted. This presentation of results means that interpreting and reﬁning the query is more productive with the Blobworld system than with systems that use low-level features from the entire image. The speed of individual queries is also an important factor, so we describe an approach to indexing Blobworld features. We project each color feature vector down to a lower dimensional vector and index the resulting vector. We ﬁnd that queries that use the index to retrieve several hundred images and then rank those images using the true distance achieve results whose quality closely matches the results of queries that scan the entire database. We begin this paper by brieﬂy reviewing the current state of image retrieval. In Section 2 we outline the segmentation algorithm, region descriptors, and querying system. In Section 3 we discuss indexing. In Section 4 we present experiments which examine the performance of querying and indexing in Blobworld. We conclude with a brief discussion. For more details than can be presented here due to space constraints, see the complete version of this paper [3] at http://www.cs.berkeley.edu/~carson/papers/visual99.html. 1.1

Related Work

Image retrieval systems based primarily on low-level image features include IBM’s Query by Image Content (QBIC) [6], Photobook [15], Virage [7], VisualSEEk [18], and Chabot [14]. Lipson et al. [12] retrieve images based on spatial and photometric relationships within and across simple image regions derived from low-resolution images. Jacobs et al. [11] use multiresolution wavelet decompositions to perform queries based on iconic matching. Ma and Manjunath [13] perform retrieval based on segmented image regions. Their segmentation requires some parameter tuning and hand pruning of regions. Much research has gone into dimensionality reduction [8] and new index trees [17,19] to cope with the high dimensionality of indices built over color histograms. Work to date has focused on indexing the entire image or userdeﬁned sub-regions, not on indexing automatically created image regions. Our indexing methods are based on those used in QBIC [8].

2

Blobworld

The Blobworld representation is related to the notion of photographic or artistic scene composition. Blobworld is distinct from color-layout matching as in QBIC [6] in that it is designed to ﬁnd objects or parts of objects; each image

Blobworld: A System for Region-Based Image Indexing and Retrieval

511

is treated as an ensemble of a few “blobs” representing image regions which are roughly homogeneous with respect to color and texture. Each blob is described by its color distribution and mean texture descriptors. Details of the segmentation algorithm may be found in [2]. 2.1

Grouping Pixels into Regions

Each pixel is assigned a vector consisting of color, texture, and position features. The three color features are the coordinates in the L*a*b* color space; we smooth these features in the image to avoid oversegmentation arising from local color variations due to texture. The three texture features are contrast, anisotropy, and polarity, extracted at an automatically selected scale. The position features are simply the (x, y) position of the pixel; including the position generally decreases oversegmentation and leads to smoother regions. We model the distribution of pixels in this 8-D space using mixtures of two to ﬁve Gaussians. We use the Expectation-Maximization algorithm [4] to ﬁt the mixture of Gaussians model to the data. To choose the number of Gaussians that best suits the natural number of groups present in the image, we apply the Minimum Description Length (MDL) principle [16]. Once a model is selected, we perform spatial grouping of connected pixels belonging to the same color/texture cluster. Figure 1 shows the segmentation of two sample images. 2.2

Describing the Regions

We store the color histogram over the pixels in each region. This histogram is based on bins with width 20 in each dimension of L*a*b* space. This spacing yields ﬁve bins in the L* dimension and ten bins in each of the a* and b* dimensions, for a total of 500 bins. However, not all of these bins are valid; only 218 bins fall in the gamut corresponding to 0 ≤ {R, G, B} ≤ 1. To match the color of two regions, we use the quadratic distance between their histograms x and y, d2hist (x, y) = (x − y)T A(x − y) [8]. A = [aij ] is a symmetric matrix of weights between 0 and 1 representing the similarity between bins i and j based on the distance between the bin centers; adjacent bins have a weight of 0.5. This distance measure allows us to give a high score to two regions with similar colors, even if the colors fall in diﬀerent histogram bins. For each blob we also store the mean texture contrast and anisotropy.

Fig. 1. Sample Blobworld representations. Each blob is shown as an area of constant color.

512

2.3

Chad Carson et al.

Querying in Blobworld

The user composes a query by submitting an image in order to see its Blobworld representation, selecting the relevant blobs to match, and specifying the relative importance of the blob features. We return the best matching images, indicating for each image which set of blobs provided the highest score; this information helps the user reﬁne the query. After reviewing the query results, the user may change the weights or may specify new blobs to match and then issue a new query. (The details of the ranking algorithm may be found in [3].)

3

Indexing

Indices allow the computer to ﬁnd images relevant to a query without looking at every image in the database. We used R*-trees [1], index structures for data representable as points in an N-dimensional space. R*-trees are not the state of the art for nearest-neighbor search [10] in multiple dimensions; using a newer tree [17,19] would likely speed up the index by a constant factor. However, our basic observations are independent of this index tuning: (i) indexing over blobs can decrease query time without signiﬁcantly reducing quality, and (ii) indices over blobs can perform better than whole-image indices. We used the GiST framework [9] to experiment with the indices. Shorter paths from root to leaf in an index tree lead to fewer disk accesses to reach the leaves, and thus faster data retrieval. Node fanout, the number of data entries that can ﬁt in a node (disk page), dictates tree height. High dimensional data requires large data entries and thus low fanout and slow index retrieval. At suﬃciently high dimensions fanout becomes so low that query speed using the index is worse than simply scanning the entire database. To avoid this, we need a low-dimensional approximation to the full color feature vectors. 1/2 would require Computing the full distance d(x, y) = (x − y)T A(x − y) storing the entire 218-dimensional histogram and performing the full matrixvector multiplication. To reduce the storage and computation in the index, we use Singular Value Decomposition to ﬁnd Ak , the best rank-k approximation to the weight matrix A. We then project x and y into the subspace spanned by the rows 1/2 of Ak , yielding xk and yk . The Euclidean distance (xk − yk )T (xk − yk ) is a lower bound on the full distance. Since the singular values σk+1 , . . . , σ218 are small for our A, this bound is tight. We can thus index the low-dimensional xk ’s and use the Euclidean distance in the index without introducing too much error. The index aims to match the quality of full queries. We would like the index to retrieve exactly the images that the full query ranks as the best matches. However, there is a quality/time tradeoﬀ: as the index returns more images, the ﬁnal query results will get better, but the query will take longer.

4

Experiments

In order to understand the performance of the Blobworld system, we explored several questions:

Blobworld: A System for Region-Based Image Indexing and Retrieval

513

– What is the precision of Blobworld queries compared to queries using global color and texture histograms? More speciﬁcally, for which queries does Blobworld do better, and for which do global histograms do better? – How well do the indexing results approximate the results from a full scan of the collection? (According to this measure, false positives returned by the full scan should also be returned by the index.) Here we must consider the dimensionality of the indexed data as well as how indices over blobs compare to indices over whole image histograms. – What do we lose by using an index instead of a full scan? That is, how does the precision of the indexed query compare to the full-scan precision? We explore each of these questions in turn in the next three sections. 4.1

Comparison of Blobworld and Global Histograms

We expected that Blobworld querying would perform well in cases where a distinctive object is central to the query. In order to test this hypothesis, we performed 50 queries using both Blobworld and global color and texture histograms. We compared the Blobworld results to a ranking algorithm that used the global color and texture histograms of the same 10,000 images. The color histograms used the same 218 bins as Blobworld, along with the same quadratic distance. For texture histograms, we discretized the two texture features into 21 bins each. In this global image histogram case, we found that color carried most of the useful information; varying the texture weight made little diﬀerence to the query results. For each of ten categories we performed ﬁve queries using one blob, two blobs, and global histograms. (We used the same feature weights for all queries in a category.) In Figure 2 we plot the average precision for queries in the tiger, cheetah, zebra, and airplane categories; the diﬀerences between Blobworld and global histograms show up most clearly in these categories. The results indicate that the categories fall into three groups: distinctive objects: The color and texture of cheetahs, tigers, and zebras are quite distinctive, and the Blobworld query precision was higher than the global histogram query precision. distinctive scenes: For most of the airplane images the entire scene is distinctive, but the airplane region itself has quite a common color and texture. Global histograms did better than Blobworld in this category. (We have added an option to allow the user to use the entire background in place of the second blob. Using the background improves the Blobworld performance on these distinctive-scene queries, since it avoids matching, for example, the small regions of sky found in thousands of images in the database.) other: The two methods perform comparably on the other six categories: bald eagles, black bears, brown bears, elephants, horses, and polar bears. Blobs with the same color and texture as these objects are common in the database, but the overall scene (a general outdoor scene) is also common, so neither

514

Chad Carson et al.

Blobworld nor global histograms has an advantage, given that we used only color and texture. However, histograms can be taken no further, while Blobworld has much room left for improvement. For example, the shapes of animals and airplanes are quite distinctive, and by segmenting the image we have some prospect of describing object shape in a meaningful way. (Using the background improves the Blobworld performance in some of these categories as well.) These results support our hypothesis that Blobworld yields good results when querying for distinctive objects.

precision

tigers .5 .4 .3 .2 .1 0 10

cheetahs

zebras

planes 1 blob 2 blobs histograms

30 50 # retrieved

10

30 50 # retrieved

10

30 50 # retrieved

10

30 50 # retrieved

Fig. 2. Average precision (fraction of retrieved images which are correct) vs. number of images retrieved for several query types. Solid lines represent full queries; dashes represent indexed queries. The index tracks the corresponding full query quite closely, except for the zebra case, where the indexed Blobworld precision is lower than the full Blobworld precision because we only index color, not texture.

4.2

Comparison of Indexed to Full Queries over Multiple Index Dimensionalities

Index speed improves as the number of dimensions in the index decreases; therefore, we want to ﬁnd the minimum number of dimensions that the index may use. Query speed improves as the number of images the index must fetch and the full Blobworld algorithms rank decreases; therefore, we want to ﬁnd the minimum number of images that the index may return. However, we want to ensure that a query using the index produces nearly the same results as a full Blobworld query. We are also interested in how blob index performance compares to global histogram index performance. For simplicity, we compared the indices based on color feature vectors alone. We measured the recall of the indices using nearest-neighbor search to retrieve and rank images against the top 40 images retrieved by a full Blobworld query or global histogram query over all the images. Figure 3 shows that in the lowdimensional case recall for the blob indices is higher than for the global histogram indices; blob indices approximate the results of full Blobworld queries better than global histogram indices approximate the results of global histogram queries. We

Blobworld: A System for Region-Based Image Indexing and Retrieval

515

believe this occurs because the blob color histograms, which are derived from relatively uniformly colored regions, cluster better than the global histograms. Retrieving just a few hundred blobs from the ﬁve-dimensional blob index gives us most of the images the full Blobworld query ranked highest. Therefore, for the remaining experiments we use the ﬁve-dimensional indices.

average recall

1 Blobworld (5D) Blobworld (20D) histograms (5D) histograms (20D)

0 0

100 200 300 400 500 600 700 number of objects (blobs or images) retrieved from index

800

Fig. 3. Recall of (1) blob index compared to the top 40 images from the full Blobworld query and (2) global histogram index compared to the top 40 images from the full whole image query. The plots are the average of 200 queries over a database of 10,000 images, or about 61,000 blobs. Results using ﬁve dimensions and twenty dimensions are shown.

4.3

Precision of Indexed and Full Queries

We also wanted to test the behavior of the indexing schemes in terms of precision measured against ground truth. In essence, we wanted to see how indexing aﬀects the quality of Blobworld query results. We performed the same queries as in Section 4.1, using the index to reduce the number of “true” comparisons required. We passed the query to an index using the ﬁve-dimensional projection and retrieved the nearest 400 database objects (400 blobs for Blobworld, 400 images for global histograms). When indexing two-blob queries, we retrieved the nearest 400 matches to each of the two blobs and returned the union of the two result sets. Figure 2 indicates that the precision of the indexed results (reordered using the “true” matching algorithm) closely mirrors the precision of the full query results. As previously stated, this is the quality goal for the index. Simple timing tests indicate that indexed Blobworld queries run in a third to half of the time of the full query. As the number of images in the collection increases, this speed advantage will become even greater.

5

Conclusions

The eﬀectiveness of an image retrieval system is determined by three factors: the time to perform an individual query, the quality of the query results, and the ease of understanding the query results and reﬁning the query. We have

516

Chad Carson et al.

shown that Blobworld queries for distinctive objects provide high precision and understandable results because Blobworld is based on ﬁnding coherent image regions. We have also shown that Blobworld queries can be indexed to provide fast retrieval while maintaining precision.

Acknowledgments This work was supported by an NSF Digital Library Grant (IRI 94-11334), NSF graduate fellowships for Serge Belongie and Chad Carson, and fellowship stipend support for Megan Thomas from the National Physical Science Consortium and Lawrence Livermore National Laboratory.

References 1. Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R*-tree: An eﬃcient and robust access method for points and rectangles. In Proc. ACM-SIGMOD Int’l Conf. on Management of Data (1990) 322–331 512 2. Belongie, S., Carson, C., Greenspan, H., Malik, J.: Color- and texture-based image segmentation using EM and its application to content-based image retrieval. In Proc. Int. Conf. Comp. Vis. (1998) 509, 511 3. Carson, C., Thomas, M., Belongie, S., Hellerstein, J., Malik, J.: Blobworld: A system for region-based image indexing and retrieval. Technical Report UCB/CSD99-1041. 510, 512 4. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Soc., Ser. B, 39 (1977) 1–38 511 5. Enser, P.: Query analysis in a visual information retrieval context. J. Doc. and Text Management, 1 (1993) 25–52 509 6. Flickner, M., Sawhney, H., Niblack, W., Ashley, J., et al: Query by image and video content: The QBIC system. IEEE Computer, 28 (Sept. 1995) 23–32 510 7. Gupta, A., Jain, R.: Visual information retrieval. Comm. Assoc. Comp. Mach., 40 (May 1997) 70–79 510 8. Hafner, J., Sawhney, H., Equitz, W., Flickner, M., Niblack, W.: Eﬃcient color histogram indexing for quadratic form distance functions. IEEE Trans. Pattern Analysis and Machine Intelligence, 17 (July 1995) 729–736 510, 511 9. Hellerstein, J. M., Naughton, J., Pfeﬀer, A.: Generalized search trees for database systems. In Proc. 21st Int. Conf. on Very Large Data Bases (1995) 562–573 512 10. Hjaltason, G., Samet, H.: Ranking in spatial databases. In Proc. 4th Int. Symposium on Large Spatial Databases (1995) 83–95 512 11. Jacobs, C., Finkelstein, A., Salesin, D.: Fast multiresolution image querying. In Proc. SIGGRAPH (1995) 510 12. Lipson, P., Grimson, E., Sinha, P.: Conﬁguration based scene classiﬁcation and image indexing. In Proc. IEEE Comp. Soc. Conf. Comp. Vis. and Patt. Rec., (1997) 1007–1013 510 13. Ma, W., Manjunath, B.: NeTra: A toolbox for navigating large image databases. In Proc. IEEE Int. Conf. on Image Proc., (1996) 568–571 510 14. Ogle, V., Stonebraker, M.: Chabot: Retrieval from a relational database of images. IEEE Computer, 28 (Sept. 1995) 40–48 510

Blobworld: A System for Region-Based Image Indexing and Retrieval

517

15. Pentland, A., Picard, R., Sclaroﬀ, S.: Photobook: Content-based manipulation of image databases. Int. J. Comp. Vis., 18 (1996) 233–254 510 16. Rissanen, J.: Stochastic Complexity in Statistical Inquiry. World Scientiﬁc (1989) 511 17. Berchtold, D. K. S., Kriegel, H.: The x-tree: An index structure for highdimensional data. In Proc. of the 22nd VLDB Conference (1996) 28–39 510, 512 18. Smith, J. R., Chang, S.-F.: Single color extraction and image query. In Proc. IEEE Int. Conf. on Image Processing (1995) 528–531 510 19. White, D., Jain, R.: Similarity indexing with the ss-tree. In Proc. 12th IEEE Int’l Conf. on Data Engineering (1996) 516–523 510, 512

A Physics-Based Approach to Interactive Segmentation Bruce A. Maxwell Swarthmore College, Dept. of Engineering Swarthmore, PA 19081, USA (610)328-8081 (phone) (610)328-8082 (fax) [email protected]

Abstract. Interactive segmentation for image manipulation and composition is becoming increasingly important as digital imaging devices become ubiquitous. Users want to select objects in one image to put them in another. This paper proposes a method for selecting image regions that are likely to correspond to multicolored objects rather than just regions of similar color and/or texture. The method is based upon a physical analysis of neighboring regions using the reflectance ratio along the region borders. This measure provides an illumination and geometry invariant cue to scene coherence between regions of different color. Using this measure and a user-selected seed region, the computer can automatically or interactively grow the region to include neighboring regions of different color that are likely to be part of the same object.

1 Introduction The tremendous amount of digital imagery now available to anyone with a computer has significantly increased the need for both automatic and interactive image manipulation tools. One of the most important steps in image manipulation is segmentation: selecting coherent areas of the image for manipulation and modification. Automatic segmentation tools have been studied and developed within the computer vision community since its inception. Since computer vision is focused on the use of images by computers, however, interactive segmentation techniques have only recently been developed, mostly in the graphics and medical imaging community where image manipulation for use by people is a primary goal (see, for example [4]). Previous work on interactive segmentation falls into three categories: boundary following, active contour and volume models, and region growing. Boundary following methods interact with the user as he or she traces a region of interest in an image. The computer may learn the boundary features from the user and/or use precomputed features to snap the traced boundary to the edge of the desired image region. An example of boundary following is the intelligent scissors method of Mortensen and Barrett [10]. Active contour models generally require the user to provide an approximate boundary tracing. Then the contour fits itself onto the image according to the interactions of the model parameters and calculated image features, such as gradient magnitude. Examples of active contour models are snakes in 2D, and balloons in 3D [12][4].

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 517-524, 1999. c Springer-Verlag Heidelberg Berlin 1999

518

Bruce A. Maxwell

Finally, region growing methods, such as the Adobe Photoshop™ magic wand tool allow the user to select a seed region and then the computer expands the user-selected region according to a calculated similarity measure [1]. Similarity is normally based on color and/or small-scale texture features, although both Nayar and Bolle and Gevers and Smeulders have shown that the reflectance ratio/color ratio is useful for identifying regions of similarity and region boundaries [11][7]. Each of these methods has benefits and drawbacks to it. Region growing methods do not require the user to trace a boundary, but they can only select regions that have similar appearance. Therefore, they will not adequately segment a multi-colored object, for example, without further user interaction. Boundary following methods, on the other hand, will allow the user to select a multi-colored area of an image so long as the boundary function is adaptive along its length or the differently colored areas have similar boundary features. Active contours allow the user to provide approximate shape information, but rely heavily on computed features which may vary significantly across a multi-colored region. The active contour method is also more difficult than the other two methods to make an interactive process beyond the initial approximate boundary tracing. This paper presents a new method of interactive segmentation that is geared towards selecting single or multi-colored coherent surfaces. By using a physical analysis of adjacent regions, the computer hypothesizes whether two adjacent regions are part of the same coherent surface in the scene. Combining this physical analysis with standard region growing segmentation methods allows the computer to grow a region that covers a coherent surface even though that surface contains multiple colors. The user can interact with this process in at least two ways. First, they can select adjacent regions and indicate they are part of the same surface. Second, the computer can query the user when it adds a new section of the image to the selected region to make sure the addition is correct. This process moves interactive segmentation beyond the limitations of current region growing methods and allows the user to select coherent surfaces that will more often correspond to the object of interest rather than a region of similar color or texture.

2 Analysis of Piece-wise Uniform Surfaces For the computer to make a decision about whether neighboring regions are part of the same surface it must somehow estimate their similarity in the scene. Coherence between adjoining surfaces in a scene, however, does not necessarily map to identifiable coherence in an image. Likewise, incoherent surfaces in a scene do not necessarily map to incoherent regions in an image. As an example of the former, consider a sphere that is half green plastic and half a shiny metal. While the sphere forms a coherent surface in the scene, its projection into an image does not necessarily reflect this coherence. Likewise, a red ball sitting on a red plane will display coherence in color across both regions despite the fact they are incoherent surfaces. For certain classes of objects we can, however, make predictions about coherence in the scene by analyzing relationships in the image. In particular, we can use physical analysis of appearance to examine surfaces that are either piece-wise uniform, or that vary slowly relative to their image projection. Nayar and Bolle, for example, have

A Physics-Based Approach to Interactive Segmentation

519

developed the reflectance ratio as an illumination and shape invariant measure of relative albedo [11]. A similar measure, the color ratio, has been used by Gevers and Smeulders for illumination and geometry invariant segmentation and recognition [6]. Breton and Zucker have shown that the shading flow field--related to the direction of the gradient of image intensity--is related to surface coherence [1]. Finally, Maxwell and Shafer formulated a test of surface coherence based on three measures: reflectance ratio, direction of the gradient of image intensity, and the smoothness of the image profile across adjacent regions [9]. This work on interactive segmentation is based upon the reflectance ratio similarity measure used by Maxwell & Shafer [9]. At is essence, the reflectance ratio is a measure of the difference in transfer function between two pixels that is invariant to illumination and shape so long as the latter two elements are similar. Nayar and Bolle developed the concept and have shown it to be effective for both segmentation and object recognition [11], as have Gevers and Smeulders [6]. The reflectance ratio was originally defined for intensity images and measures the ratio in albedo between two points. The albedo of a point is the percentage of light reflected by the surface. The principle underlying the reflectance ratio is that two nearby points in an image are likely to be nearby points in the scene. Therefore, they most likely possess similar illumination environments and geometric characteristics as shown in Fig. 1(a). Nayar and Bolle represent the intensity I of a pixel in an image as (1) I i = kρ i R ( s, v, n ) where k represents the sensor response and light source intensity, ρ is the albedo of the surface, and R(s, v, n) is a scattering function representing the geometry dependent aspects of the transfer function and exitant illumination field. If p1 and p2 are nearby p21 L1

L2 N1 N2

V1 V2

p11 p12

p22

...

p2(N-1) p2N

p1(N-1) p1N

(a)

(b)

Fig. 1: (a) Two nearby points. Note that the geometry is approximately the same for both points. The ratio of the intensities is, therefore, a function of the albedos of the two points. (b) Pixel pairs along the border of two regions. Two regions that are part of the same object should have a constant reflectance ratio along their boundary.

points in a scene, then the geometry dependent functions should be similar, as will the light source brightness and sensor response. Therefore, if we take the ratio of the intensities of two nearby pixels in an image, then we obtain the ratio of albedos, or the reflectance ratio π as shown in (2).

520

Bruce A. Maxwell

I1 kρ 1 R ( s, v, n ) ρ1 π = ----- = -------------------------------- = -----(2) kρ 2 R ( s, v, n ) I2 ρ2 Multiple light sources do not affect the reflectance ratio so long as the assumption of similar geometries and illumination environments holds. A well-behaved version of the reflectance ratio is the difference in intensities divided by their sum, as in (3) [11].  I 1 – I 2 r =  ----------------  I 1 + I 2

(3)

Note that the geometry and sensor dependent terms still cancel out. Unlike the ratio in (2), however, this measure of the reflectance ratio ranges from [-1,1]. Given this measure of the difference between the transfer functions of adjacent pixels, how do we use it to measure surface coherence? The basic idea is to look for constant reflectance ratios along region boundaries. If the reflectance ratio along the boundary connecting two regions of different intensity is not constant, then either the shape or illumination are incompatible in addition to the regions’ colors. The algorithm is as follows. As shown in Fig. 1(b), for each border pixel p1i in h1 that borders on h2 the algorithm finds the nearest pixel p2i in h2 and calculates the reflectance ratio between them if both pixels are equal to or brighter than a dark threshold drr. Since the reflectance ratio is designed to work on intensity images, not color images, the algorithm uses the pixel intensity values. Summing up the ratios of all of the sufficiently bright pixel pairs the algorithm calculates the mean reflectance ratio ravg. It then finds the variance of the reflectance ratio along the border. If the regions belong to the same object, the reflectance ratio should be the same for all pixel pairs (p1i,p2i) along the h1,h2 border, regardless of the shape or illumination. The variance is a simple measure of constancy. If h1 and h2 are part of the same object, this variance should be small, with variation existing because of the quantization of pixels, noise in the image, and small-scale texture in the scene. If, however, h1 and h2 are not part of the same object, then the illumination and shape are not guaranteed to be similar for each pixel pair, violating the assumption underlying the reflectance ratio. This should result in a larger variance. We can differentiate between these two cases using a threshold variance and chi-square test. Region pairs with a small variance may be coherent surfaces; region pairs with large variances are not coherent surfaces. This test, therefore, will rule out merging some region pairs and build confidence for merging others [9]. It is important to note that not all discontinuous surfaces will produce a significant variance in the reflectance ratio. However, the expectation is that for most camera positions the reflectance ratio between two different surfaces will vary. The algorithm uses a variance of 0.008, or a standard deviation of 0.09, as a threshold for all of the cases in this experiment. This is approximately 4.5% of the range [-1,1].

3 Heuristic Segmentation Tools In addition to the physical analysis, we can use heuristic region merging techniques to improve the results for most cases. Beveridge et. al., for example, proposed several useful heuristic measures based on region connectivity and region size [2]. For this work, we used their connectivity score, and our own size-based heuristic.

A Physics-Based Approach to Interactive Segmentation

521

The connectivity score is based on the idea that regions that share a large percentage of their boundary are more likely to be connected than regions that share a small percentage. This heuristic is quite useful with the reflectance ratio estimate as small segments of a border are likely to have little variation in the reflectance ratio along that border simply because of its length. The connectivity score is based upon the expression in (4), which says that the connectivity score increases with increasing shared boundary. 4 × s hared ( a, b ) (4) conn ( a, b ) = ---------------------------------------------------------------------min ( length ( a ), length ( b ) ) ) The connectivity score is then bounded in the range [0.5, 1.0], which means that any shared boundary greater than 25% of the smaller region is not penalized. The size heuristic for the interactive segmentation task is based on the idea that regions should not be too large: in general we don’t want to select more than half the picture if we are trying to select objects in a scene. The purpose is to stop the segmentation tool from selecting the background. The size heuristic is based upon the relationship given in (5), bounded to the range [0.5, 1.0]. ( size ( a ) + size ( b ) ) (5) 1 – ----------------------------------------------size ( image ) This relationship says that increasing size increases the penalty for merging. The penalty flattens out for any region above half the size of the image. The square root ensures that medium and small regions are not heavily penalized. With the physical and heuristic merging techniques in place, we can now turn to the interactive segmentation and object selection process. size ( a, b ) =

Fig. 2. Picture of a wooden stop-sign and a green plastic cup

(a)

(b)

(c)

(d)

Fig. 3. From left to right: (a) Initial seed region (black pixels), (b) Inner and outer boundaries, (c) Initial grouping of boundary pixels, and (d) final grouping of boundary pixels

522

Bruce A. Maxwell

Fig. 4. Next region found after the “S”

Fig. 5. Combined “S” and sign region that forms a new seed region

Fig. 6. Final regions selected by the segmentation process after clicking on the “S” (left image) and on the cup (right image)

4 Interactive Segmentation Procedure The interactive segmentation process is as follows. The session begins when the user clicks somewhere on the surface of interest. For example, in the image of a stop-sign and cup in Fig. 2 the user might click on the “S” in the sign, hoping to select the entire stop-sign. Using this point as a seed area, the computer grows a region using normalized color as the similarity feature. This region, shown in Fig. 3a, becomes the seed region that starts an iterative process of looking for adjacent regions that might be part of the same surface. Step two is to find a boundary for the seed region and its neighbors. This is accomplished in three stages. First a grassfire transform of the inner region identifies all pixels that are a distance dI away from the edge of the region. These pixels form the inner border. Second, a grassfire transform of the outer region identifies all pixels that are a distance dO from the background border; these pixels form the outer border. The distance values dI and dO are user-supplied constants that are chosen to avoid pixels that overlap different regions in the scene. Finally, the two borders are related by finding adjacent pixel pairs, one from the in-region border and one from the out-region border. Adjacent pairs are found by following a line perpendicular to the tangent of the inregion border until it hits an out-region border pixel. The borders are shown in Fig. 3b. Step three uses the reflectance ratio analysis of section 2 to divide the border into regions where the reflectance ratios are similar. This step segments the boundary into sections as in Fig. 3c. The largest of these sections corresponds to the adjacent region that shares the greatest boundary similarity--in a physical sense--with the seed region. Step four is grow an adjacent region using the largest exterior boundary section as a seed. In the stop-sign and cup example, this results in the red portion of the stop-sign being selected as shown in Fig. 4.

A Physics-Based Approach to Interactive Segmentation

523

After step four, we can recalculate the segmentation of the inner and outer borders to account for the new region information, resulting in the Fig. 3d. Note that the entire border between the seed region and the adjacent region is now marked as coherent. In step five, we test for coherence between the two regions to see if they are likely to be part of the same surface. This test uses the methods outlined in sections 2 and 3 to determine a match likelihood. If the two regions demonstrate significant coherence, the computer will join them together into a new seed region as shown in Fig. 5. If they do not demonstrate coherence, then the process will either stop, if there are no unexplored sections of the border, or it will repeat from step 4 and try a different adjacent region. If the two regions are joined, then the process will repeat from step one with the aggregate region. When the algorithm cannot find any more coherent adjacent regions it stops and returns the current region. Extracted regions in the stop-sign and cup example image are shown in Fig. 6.

5 Results and Conclusion The results shown in Fig. 6 demonstrate that this approach provides a segmentation corresponding more closely to objects in the scene than methods based primarily on color and/or texture. Fig. 7 shows a more complex scene, where the algorithm is able to extract the plastic Lego™ object from a colored planar background. The key to making this approach more robust is twofold. First, as shown in [9], using three measures of similarity is more robust than using just the reflectance ratio. Second, in order to make the region’s edges cleaner a method such as snakes or the intelligent scissors method of [10] could be given the region’s boundary as an input.

Fig. 7. Selecting an object against a complex background. The region on the right was selected by the computer after clicking in the green plastic lego.

The strength of an approach based on a physical analysis of the scene is that it gives the computer a scene-based estimation of similarity which corresponds more closely to how people want to select and manipulate the world.

524

Bruce A. Maxwell

References [1] [2]

Adobe Systems, Inc., Photoshop™ 4.0, computer program, 1997. J. R. Beveridge, J. Griffith, R. Kohler, A. Hanson, and E. Riseman, “Segmenting Images Using Localized Histograms and Region Merging,” Int’l J. of Computer Vision, 2, 1989, pp. 311-47. [3] P. Breton and S. W. Zucker, “Shadows and Shading Flow Fields,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June, 1996, pp. 782-789. [4] S. Cagnoni, A. B. Dobrzeniecki, J. C. Yanch, and R. Poli, “Interactive segmentation of multi-dimensional medical data with contour-based application of genetic algorithms”, in Proceedings of IEEE Int’l Conference on Image Processing, pp. 498-502, 1994. [5] L. D. Cohen, “On Active Contour Models and Balloons”, CVGIP: Image Understanding, vol. 52, no. 2, pp. 211-218, March 1991. [6] T. Gevers and A. W. M. Smeulders, “Color Based Object Recognition, in Proceedings of ICIAP, Florence, Italy, 1997. [7] T. Gevers and A. W. M. Smeulders, “Edge Steered Region Segmentation by Photometric Color Invariant”, in Proceedings of SCIA, Lappeenranta, Finland, 1997. [8] L. Lapin, Probability and Statistics for Modern Engineering, PWS Engineering, Boston, 1983. [9] B. A. Maxwell and S. A. Shafer, “Physics-Based Segmentation: Moving Beyond Color,” in Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, 1996. [10] E. N. Mortensen and W. A. Barrett, “Intelligent Scissors for Image Composition”, in Proceedings of SIGGRAPH ‘95, pp. 191-198, 1995. [11] S. K. Nayar and R. M. Bolle, “Reflectance Based Object Recognition,” International Journal of Computer Vision, 1996. [12] D. Terzopoulos, A. Witkin, and M. Kass, “Constraints on Deformable Models: Recovering 3D Shape and Nonrigid Motion”, Artificial Intelligence, 36, pp. 91-123, 1988.

Assessment of Eﬀectiveness of Content Based Image Retrieval Systems Alexander Dimai Communications Technology Laboratory Swiss Federal Institute of Technology, ETH CH - 8092 Zurich, Switzerland [email protected]

Abstract. Performance evaluation is an often neglected topic in content based image retrieval research. Commonly used evaluation measures for eﬀectiveness assessment are reviewed and their shortcomings are discussed. A new evaluation measure is proposed which overcomes some of the shortcomings of existing evaluation measures. The presented performance measure is especially suited for rank based retrieval methods. The measure is obtained fully automatic and can be exploit to investigate diﬀerent criteria of eﬀectiveness in detail. For better comparison of performance diﬀerences, the measure provides a conﬁdence and credibility measure, too. Experiments were carried out to study the properties of the evaluation method.

1

Introduction

The rapid expansion of computer networks and the dramatically falling cost of data storage are making multimedia databases increasingly common. Digital information in the form of images, music, and video is quickly gaining in importance for business and entertainment. Consequently, the growth of multimedia databases creates the need for more eﬀective and eﬃcient search and access techniques, especially of image data. In the last few years large research eﬀorts were undertaken to develop eﬃcient content based image retrieval (CBIR) systems. Therefore, a large number of diﬀerent solutions and algorithms were proposed. However, the comparison of the diﬀerent contributions is diﬃcult due to the lack of widely accepted quality assessments of the systems. Nevertheless, reliable evaluation methods are required for eﬀective and eﬃcient research. Furthermore, only powerful evaluation methods can ensure that high quality standards for content based descriptor, such as MPEG-7, can be accomplished. The evaluation of retrieval algorithms is a very complex task. This complexity is one of the reasons why there is a never ending debate on this issue in information retrieval communities. However, in the CBIR community this discussion seems to be neglected completely. The following paper wants to bring up and discuss this important issue. As a contribution to the discussion a new performance measure is presented which is eﬀective and eﬃcient for comparing rank based image retrieval algorithms. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 525–533, 1999. c Springer-Verlag Berlin Heidelberg 1999

526

2

Alexander Dimai

What Should Be Evaluated?

Evaluating a system requires several steps. The following section focuses on two important steps: First, the evaluation criteria has to be deﬁned precisely and secondly, the assessment measure has to be deﬁned. The ﬁrst step seems to be obvious, but in several works a clear deﬁnition of the evaluation criteria is missing. Such clear statement is crucial because the quality of a system can be judged by a large number of diﬀerent criteria. The criteria for a retrieval system can be grouped into four classes: Eﬀectiveness: relevance, usefulness of retrieval discrimination, distinctiveness, stability due to changes of the query, scalability due to increase of data, completeness of results, complexity of query formulation, etc. Eﬃciency: time for searching, time for delivering, time for index generation, Insertion and dilation time, scalability due to request numbers, storage-space for index, time for generating a query, etc. Flexibility: suitability for applications, application adaptivity, environment adaptivity, etc. Others: error resilience, user guidance, presentation of results, etc. Each class has several sub-criteria and each of these sub-criteria requires a separate evaluation to obtain a overall evaluation of a system (for an interesting discussion on this topic see [5,7]). Furthermore, several of the sub-criteria can be further divided. For example the ability of an algorithm to retrieve images can be deﬁned based on the low level (or visual) content or the high level (or concept) of an image. A general quality assessment of a retrieval algorithm requires investigations on all these criteria for obtaining a precise proﬁle of the strengths and weaknesses of a retrieval algorithm. Then, such proﬁles of retrieval algorithms can be used to select eﬃciently the required retrieval algorithms for speciﬁc application domains. The second step in the evaluation process is to deﬁne the assessment measure that depends, obviously, on the evaluation criteria. For example, to measure the searching time an assessment measure is the time in seconds. In such a case, the evaluation measure is relatively unproblematic. Unfortunately, for the majority of the listed criteria, such a simple measure can not be found. For most of these criteria the assessment must be done by subjects. The subjects assess the performance of the system according to the evaluation criteria. For example, subjects have to compile sets of relevant data to determine recall and precision of a system. Unfortunately, such assessments are generally very labour and time intensive for the subjects and researchers. Even worse, the results of such manually generated assessments are often not very good reproducible. The time intensity and the bad reproducibility are certainly important reasons why extensive evaluations of CBIR systems are rarely found in literature.

Assessment of Eﬀectiveness of Content Based Image Retrieval Systems 1

Precision

Precision

1

0

527

-

0

Recall

1

0

0

Recall

1

Fig. 1. Example of a precision-recall graph. Currently, the performance diﬀerences of models are detected based on such graphs. Right: comparison of eﬀectiveness of descriptor models A (dotted line) and B (solid line). The graph shows that model A is more eﬀective than model B. Left: comparison of stability of descriptor models. Based on this graph it is diﬃcult to judge if model A (dotted line) is more stable than model C (solid line). The presented performance measure µ is able to summarize the conclusions drawn from each graph by a single number, as it is explained in Fig. 2.

3

Existing Methods for Evaluation of Eﬀectiveness

In the following, the assessment of eﬀectiveness, especially the aspect of retrieving relevant data, is studied in more details. Common assessment measures for the evaluation are precision and recall. Assume the system retrieves n numbers of images for a given query. From these retrieved images only r documents are relevant regarding the query. Then, precision is deﬁned as ratio of the number of retrieved relevant data to the number of all retrieved data, i.e. P = r/n. Recall is the ratio of the number of retrieved relevant data to the number m of totally existing relevant data in the data set, i.e. R = r/m. Both numbers take on values between 0 and 1. Large values indicate a good performance of the system. The ﬁrst shortcoming of these numbers is that they are inversely correlated which makes it diﬃcult to compare systems based on these two numbers. To adapt these measures for CBIR systems which are based on ranked list, e.g. [9] [8], the cutoﬀ number n is chosen by the operator. In practise the number is chosen such that a result of n images can be scrolled and view conveniently. However, the measures are very sensitive to the choice of the cutoﬀ number n. Therefore, the two measures, are often computed for diﬀerent values of n and are then depicted against each other. The graphical results are the so called precision-recall graphs as shown in Fig. 1. The simple adaption of the measures to rank based methods completely neglects the order of the ranking of the data. To overcome the negligence of the rankings other measures were proposed. Normalized recall and precision using the rank of the relevant data is proposed in [6] and normalized rank sum are used in [8]. However, for all recall measures the total number of relevant data is required which is rarely given or very tedious to obtain. Another shortcoming of these measures is that they do not provide any conﬁdence or credibility measures of their signiﬁcance. Therefore, it is often impossible to judge if diﬀerences in the measures are signiﬁcant or not.

528

4

Alexander Dimai

Rank-Diﬀerence Trend Analysis

Considering all the shortcomings of precision and recall, better assessment measures are required. In the following, the so called rank diﬀerence trend analysis (RDTA) is proposed. The RDTA overcomes some of the shortcomings of recall and precision and follows the guidelines mentioned in section 2. As it is explained in the following the RDTA 1) allows one to deﬁne precisely the evaluation criteria, 2) regards the ranking of the relevant data, 3) returns only one number for the performance assessment, 4) provides a conﬁdence and a credibility measure for judging more reliable about performance diﬀerences of algorithms. 5) is insensitive to changes in the set of relevant data, and 6) does not depend on any cutoﬀ number. The basic idea of the RDTA is to compare statistically the rankings of relevant data that are obtained by two retrieval algorithms A and B. The set of relevant data is compiled for each query image in the so called target sets. The criterion for the collection of the target set is the same as the evaluation criteria. Therefore, due to the choice of the target sets, diﬀerent aspects of eﬀectiveness can be investigated. Importantly, the deﬁnition of the target set has to be done only once. Afterwards, the evaluation of systems can be done fully automatically without endeavoring any human examiners. This automatic execution ensures a high degree of reproducibility. The automatic evaluation is done in three steps: First, for each query image the algorithms A and B generate ranked lists of retrieved data. Secondly, the ranks of each image of the target sets (that correspond to the query image) are determined. This step is executed for all query images and results in two rank sets {rAi }i=1..N and {rBi }i=1..N respectively, for algorithms A and B. Third, the set of all ranks {rAi }i=1..N and {rBi }i=1..N are compared. From the statistical comparison of the rank sets, an assessment measure µ(A, B) is obtained which expresses the relative discrimination of A with respect to B. A positive value µ(A, B) signiﬁes that A is more eﬀective than B. Conversely, a negative value µ(A, B) signiﬁes that A is less eﬀective than B. The following section explains how the assessment measure µ is obtained derived from the rank sets. 4.1

Statistical Trend Analysis

To obtain the assessment measure µ, it is assumed that the ranks of the target images are correlated and that the correlation can be approximated by a function f of the form rAi = f (rBi ). Further, it is assumed that both algorithms A and B rank all query images on rank zero, which directly implies that the function f fulﬁlls the condition f (0) = 0. For the comparison of two algorithms the rankings on low ranks is especially interesting. Therefore, the linear term of the Taylor expansion of f around zero contains the most valuable information about ranking diﬀerences. Accordingly, the rank sets {rAi }i=1..N and {rBi }i=1..N are compared by linear regression, i.e. rAi = b ∗ rBi . The parameter b can be interpreted that model A performs b times worse than model B. A linear model is ﬁtted to the points deﬁned by the“rank” coordinates of each target image

Assessment of Eﬀectiveness of Content Based Image Retrieval Systems

529

100

Rank B

Rank B

100

µ = 0.04 -+ 0.05

µ = 0.46-+ 0.06 0

0

Rank A

100

0

0

Rank A

100

Fig. 2. Example of rank distributions and the corresponding trend analysis. Each point in the plot represents a target set image, i.e. a relevant image, with coordinates given by the ranking obtained by algorithm A and B, respectively. The slope of the straight line is used to obtain by reparamterisation the assessment measure µ. Left: the performance measure µatt for comparing the eﬀectiveness of models A and B to retrieve relevant data based on pre-attentive (or visual) similarity. The results show that model A performs much better than B. Right: the performance measure µγ for comparing the stability of the models A and C under gamma transformation. The result shows that both model A and C are equally stable. (rAi , rBi )i=1..N to determine parameter b (see Fig. 2). However, the ﬁtting has to consider an important eﬀect. For the evaluation, high ranked relevant data are less important than low ranked relevant data. Therefore, high ranked relevant data should be weighted less than low ranked relevant data. The weights for each point i are introduced by associating to each rank variances σAi and σBi , which are functions of the ranks. For simplicity a linear error model was chosen, i.e. σAi = a × rAi (e.g.a = 0.3). Currently, other error models are investigated. Although the choice of the error model inﬂuences the value of the assessment measure and its conﬁdence interval, but experiments have shown that it does not alter the qualitative results of the performance measure. The chi-square measure is used as maximum likelihood estimator. It is straightforward to write down the χ2 merit function for this case, χ2 (b) =

N (rAi − brBi )2 i=1

2 + b2 σ 2 σAi Bi

(1)

To obtain a performance measure µ that is symmetric, i.e. µ(A, B) = −µ(B, A) and which ranges between [-1,1], the following parametrisation is chosen, µ = 4/π ∗arctan(b)−1. With this substitution a slope of b = 1 corresponds to µ0 = 0, i.e. both algorithm rank averagely the same. Values of |µ0 | = 0.2 and |µ0 | = 0.4, respectively, indicate that an algorithm ranks averagely 1.4 and 2, respectively, times better than the other algorithm. The global minimum of χ2 (µ) and its standard error have to be determined numerically (for details see [4]). Furthermore, this approach allows one to determine a goodness-of-ﬁt measure Q. The measure Q indicates how good the linear model and the error model describe the correlation of the rank sets. Absent the goodness-of-ﬁt measure Q not the slightest indication is given that the parameter µ has any meaning at all (see [4]). In the case where the goodness-of-ﬁt is

530

Alexander Dimai

close to zero, i.e., Q < 0.001 the performance measure µ is rejected because it is inappropriate for performance evaluation. In experiments this unfortunate case only occurred in less than 2 % of all comparisons. Two examples of the ﬁt and its obtained measure µ are depicted in Fig. 2. The same algorithm were used to compute the precision-recall graphs shown in Fig. 1. The precision-recall graphs indicate that in the ﬁrst case one algorithm performs signiﬁcantly better. In the second case, both algorithm performs nearly similar. A major advantage of the proposed measure µ is that the quintessence of the whole evaluation is pined down by one number and does not need a whole graph. 4.2

Target Set Deﬁnition

The deﬁnition of the target set depends on the criterion which has to be evaluated. In the following, the deﬁnition is given for two examples. Stability: The goal of the stability investigations is to study how the different algorithm retrieve images if the image signal is altered or disturbed. Such disturbances occur often in real world applications of image acquisition and the behavior of content descriptors under such transformations provide important hints about the applicability of the descriptors. Such target set can be generated automatically. First, a query image is selected. Then several copies are made and altered by diﬀerent disturbance levels. All these modiﬁed copies build one target set to the original image as query image. The signal transformations are for instance noise, color shift, color channel scaling, gamma corrections, geometric transformations, etc. Discrimination: One essential component for an eﬀective retrieval system is the ability to retrieve similar relevant data. For this investigations, the target sets have to be compiled by subjects. The search by subjects for similar images is especially simple if a query image is presented, because the human examiner uses relative judgment which is easier and more consistent than absolute judgment [2]. Each subject has to search in the database exhaustively for similar images where similarity is deﬁned by the evaluation criterion, such as visually similar or similarity based on concepts. Each subject complied for each query image and similarity criterion its one personal target sets. These personal target sets were then combined to global target set which represented the relevant image data. Global target sets were formed depending on the level of agreement between the images in the personal target sets. If for one query image, s diﬀerent subjects of totally S subjects had an image considered similar than this image was included in a global target set of level s/S. Experiments were conducted to evaluate at which level such global target sets are meaningful.

5

Results

Experiments were carried out to investigate the robustness and the sensitivity of the RDTA. For this purpose, a database was used that consist of over 12,000

Assessment of Eﬀectiveness of Content Based Image Retrieval Systems

531

images. These images contain a wide range of outdoor scenes and objects like buildings, wild-life, people, textures, satellite images and paintings. Three diﬀerent retrieval algorithms were implemented. Model A is the color moment model presented in [8] combined with the texture descriptor presented in [3]. Model B is the color constancy model described in [1]. Model C is the color histogram method suggested by Swain and Ballard [9]. As target sets the stability target sets and the discriminative target sets were used. For the stability, 15 query images were randomly selected from the database. Then 10 copies were made from each query image and altered by gamma correction. Therefore the measure µγ , is based on 150 points (or images). The same test can also be made for other disturbances such as noise, color shift, etc. For the generation of the discriminative target sets, seven subjects selected their personal target sets for 25 query images based on visual similarity (µatt ). For other 10 query images they were requested to select similar images based on same concepts, e.g. ﬁsh scenes. The compiled global target sets were then used to study the performance of the algorithm to extract relevant images based on concepts (µConcept ). First, the stability of the measure due to changes of the target sets was tested. For this purpose, the total size N of the target sets for the stability investigation was altered. The measure showed insigniﬁcant changes, for exam=150 ple for the gamma stability the test yields µN (C, B) = −0.09 ± 0.02 and γ N =75 (C, B) = −0.08 ± 0.02. The measures for the other algorithm performed µγ similar. Also the construction of the global set was studied in more details. The measure µ was used for diﬀerent compiled target sets using diﬀerent agreement levels s/S. An example is listed in table 1. As in the other cases an agreement level of 50 % to 70 % provides stable results. The dependency of µ on the database size M was studied, too. From the original database, ten randomly selected databases with size M were generated. Then the RDTA was conducted. From the ten RDTA measures the mean and their standard deviations are listed in table 1. The results, also for the other algorithms, show that at a size of approximately 1,000 images, the variation due to diﬀerent databases is smaller than the uncertainty of the RDTA. Hence, the measure µ is conservative and performs very stably.

6

Summary and Outlook

Performance evaluation of image retrieval system is an often neglected topic. For evaluating a retrieval system the evaluation criteria have to be deﬁned precisely. Furthermore, appropriate assessment measures are required to evaluate eﬃciently the systems. It is argued that the commonly used assessment measures, i.e., recall and precision, have several shortcomings. Therefore, a new method was proposed. The proposed method evaluates fully automatically the performance of a rank based retrieval system. The precise evaluation criteria serve for the compilation

532

Alexander Dimai

Database Size M 250 500 1000 2000 4000 8000 µatt (A, B) 0.42 ± 0.09 0.43 ± 0.07 0.44 ± 0.03 0.45 ± 0.03 0.45 ± 0.03 0.46 ± 0.3 RDTA Agreement Level 100 85 71 57 42 28 µatt (A, B) 0.52 ± 0.05 0.48 ± 0.05 0.48 ± 0.06 0.46 ± 0.06 0.37 ± 0.08 0.41 ± 0.09 µatt (A, C) 0.21 ± 0.03 0.22 ± 0.04 0.20 ± 0.04 0.19 ± 0.04 0.16 ± 0.08 0.17 ± 0.08 RDTA

Table 1. RDTA results: Above: µ as function of the database size. The mean value and its standard deviation of RDTA are listed for ten diﬀerent database of size M . Bottom: µ in dependence of agreement level of target sets.

of the target sets, which has to be done only once. The performance measure also provides a conﬁdence and a goodness-of-ﬁt measure to judge about the signiﬁcance of performance diﬀerences. The measure is robust to changes of the target sets. Investigations on the stability of the measure and the construction are presented. The new evaluation measure was used to develop and compare eﬃciently novel content based image descriptors.

Acknowledgments The research was supported partially by the ETH cooperative project on Integrated Image Analysis and Retrieval.

References 1. G. Healey and D. Slater. Global color constancy: recognition of objects by use of illumination invariant properties of color distributions. Journal Optical Society of America A, 11(11):3003–3010, Nov 1994. 531 2. M.E. Lesk and G. Salton. Relevance assessments and retrieval system evaluation. In G. Salton, editor, The SMART Retrieval System-Experiments, chapter 26. Englewood Cliﬀs, Prentice-Hall, NJ, 1971. 530 3. B.S. Manjunath and W.Y. Ma. Texture features for browsing and retrieval of image data. IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(8):842–848, 1996. 531 4. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C. Cambridge University Press, 1991. 529 5. A. M. Rees. The relevance of relevance to the testing and evaluation of document retrieval systems. In Aslib Proceedings, volume 18, pages 316–324, 1966. 526 6. G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGrawHill, New York, 1983. 527 7. L. Schamber, M.B. Eisenberg, and M.S. Nilan. A re-examination of relevance: toward a dynamic, situational deﬁnition. Information, Processing & Management, 26(6):775–776, 1990. 526

Assessment of Eﬀectiveness of Content Based Image Retrieval Systems

533

8. M. Stricker and A. Dimai. Spectral covariance and fuzzy regions for image indexing. Machine Vision and Applications, 10:66–73, 1997. 527, 531 9. M. J. Swain and D. H. Ballard. Color indexing. Intern. Journal of Computer Vision, 7(1):11–32, 1991. 527, 531

Adapting k-d Trees to Visual Retrieval Rinie Egas, Nies Huijsmans, Michael Lew, and Nicu Sebe Leiden Institute of Advanced Computer Science Leiden University, 2333 CA Leiden, The Netherlands {huijsman mlew nicu}@wi.leidenuniv.nl Abstract. The most frequently occurring problem in image retrieval is ﬁnd-the-similar-image, which in general is ﬁnding the nearest neighbor. From the literature, it is well known that k-d trees are eﬃcient methods of ﬁnding nearest neighbors in high dimensional spaces. In this paper we survey the relevant k-d tree literature, and adapt the most promising solution to the problem of image retrieval by ﬁnding the best parameters for the bucket size and threshold. We also test the system on the Corel Studio photo database of 18,724 images and measure the user response times and retrieval accuracy.

1

Introduction

Searching for multimedia is one of the outstanding problems for the World Wide Web and for companies which have networked databases. For example, Philips Media uses tera-byte databases which are derived from ownership of trademarks, video, images, and audio. The real world usage of the multimedia search programs is to support and to enable the re-use and trading of assets. Content based retrieval is a rapidly growing area consisting of advances in interfaces, features, search paradigms, and indexing methods. This paper addresses the indexing problem using the general method of k-d trees because they have been found to have moderate requirements for incremental addition of media and logarithmic access speed, which is especially important as the databases grow from tens of thousands to millions of items.

2

Feature Vector Searching

Considering a collection of feature vectors as a point set S = p1 , p2 , . . . , pn in a d-dimensional space, the problem of retrieving images similar to a given query image is transformed to a k-nearest neighbor searching problem. This nearest neighbor searching problem is to ﬁnd point(s) pk in S most similar to a query point q, not necessarily in S, according to some distance measure. 2.1

Linear and Constant Time Searching

The most straightforward way to solve the nearest neighbor searching problem in high-dimensional space is to sequentially compare all points, using the given Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 533–541, 1999. c Springer-Verlag Berlin Heidelberg 1999

534

Rinie Egas et al.

distance measure. This linear search takes O(n) time. Another way is to precompute a distance matrix which stores all diﬀerence measures:   dist0,1 ... dist0,n−1 dist0,0   .. .. ..   . . . distn−1,0

distn−1,1

. . . distn−1,n−1

where disti,j is the distance between pi and pj . Note that only the lower half triangle of the matrix is necessary since the matrix is symmetric. The main advantage of this method is that it has constant access time. A disadvantage is that when one wants to use an image from outside the image database as a query image, a new row must be added to the distance matrix. This will be an expensive operation, especially for large image collections. Furthermore when only m of the smallest distances per column are stored it will not be possible to retrieve more than the m best matches of a query image. In summary, the linear search method requires minimal memory resources, has constant time updating of the image index, but does not scale well for large databases. The distance matrix approach requires large memory resources, has linear time updating of the image index, and is the fastest method regarding access time. In the next section, we introduce and discuss a compromise approach called k-d trees, which requires moderate memory resources, has moderate update times, and has logarithmic access speed. 2.2

The k-d Tree

The k-d tree [1,2] is a binary tree in which each node represents a hyper-rectangle and a hyperplane orthogonal to one of the coordinate axis, which splits the hyper-rectangle into two parts. These two parts are then associated with the two child nodes. Starting with the root of the tree, which represents the entire search space, the search space is partitioned until the number of data points in the hyper-rectangle falls below some given threshold. The leaf nodes of the tree are called buckets, and the threshold that limits the maximum number of data points in a bucket is called the bucket size of the tree. Note that data points are only stored in leaf nodes, not in the internal nodes. In the case of one-dimensional searching, the k-d tree is identical to a binary search tree. Each node contains some partition value by which the data points are partitioned. All data points with values less than the partition value belong to the left child, while those with a larger or equal value belong to the right child. In k dimensions, one dimension is chosen to serve as a discriminator. The search space is partitioned along this discriminator dimension by some partition value. There are several possibilities of choosing both the discriminator and the partition value. We chose the discriminator as the dimension with the largest variance, as suggested in [3]. They found that compared with the standard maximum spread dimension, used in [1], this provided equivalent or slightly better search performance and required less CPU time to build the tree. Furthermore

Adapting k-d Trees to Visual Retrieval

N1

535

N0

8 y 7 N0 6 5

N2

5

2

0

4

N1

1

3

d:x p:2

N2

d:y p:5

3

2

4

1

0

d:x p:3

6

1

2

3

4

5

6

7

0

1,2

3,4

5,6

B0

B1

B2

B3

8 x

(a)

(b)

Fig. 1. (a) data points in 2 dimensions (x,y); (b) k-d tree; N0, N1 and N2 are the partition planes, deﬁned by the discriminators d and the partition values p. we chose the partition value of a node to be the median of the data points in the hyper-rectangle represented by that node along the discriminator dimension, making the k-d tree balanced. 2.3

The Standard k-d Tree Search Algorithm

The standard k-d tree search algorithm was proposed by Arya [5]. At each leaf node visited the distance between the query point and each data point in the bucket is computed, and the nearest neighbor is updated if this is the closest point seen so far. At each internal node the subtree whose corresponding hyperrectangle is closer to the query point is visited. Later, the farther subtree is searched if the distance between the query point and the closest point visited so far exceeds the distance between the query point and the corresponding hyperrectangle. To perform this test, in [1] are maintained two boundary arrays which keep track of the upper and lower boundaries of the hyper-rectangle. The distance between the query point and the closest point visited so far exceeds the distance between the query point and the corresponding hyper-rectangle if the boundaries of the hyper-rectangle overlap the ball centered at the query point with radius equal to the distance between the query point and the closest point visited so far. This test is called the bounds-overlap-ball test. Arya [6] presents a more eﬃcient method to perform the test, he refers to this method as the incremental distance calculation technique. We will use this technique in our search algorithm. A variable CP distH is used to keep track of the distance between the query point and the hyper-rectangle H. Furthermore an array CPH is maintained to keep track of the distance between the query point and the hyper-rectangle H along each dimension. As the k-d tree is traversed, CP distH and the appropriate element of CPH can be updated incrementally. We will use the example of Figure 1 to show how this is done. Consider node N 2 and its two children B2 and B3. Let HN 2 be the hyper-rectangle corresponding to node N 2, which is partitioned into hyperrectangles HB2 and HB3 . Without loss of generality, assume HB2 is closer to

536

Rinie Egas et al.

N0

N0 3

2

0

0

HB3

N3 H y

dist1

q

HB2 dist2

N3

N2

N2

N2

1

5

id:5,6 dist:2,4

id: 5,6 dist:2,4

id:5,2 dist:2,2

id:5,2 dist:2,2

B2

B3

B0

B1

2

5

4

id:5,2 dist:2,2

id:5,2 dist:2,2

B0

B1

4

3

1

id:5,2 dist:2,2

id: 5,6 dist:2,4

B2

B3

x

(b)

(c)

Fig. 2. (a) incremental distance calculation technique; dist1=CP distHB3 , dist2=CP distHB2 ; (b) standard k-d tree search; (c) priority k-d tree search, (branch numbering shows how the tree is traversed) (a)

query point q than HB3 is. Given CP distHN 2 and CPHN 2 (i), i ∈ [0, d − 1], where CPH (i) denotes the distance between the query point q and hyper-rectangle H along dimension i, we can update these quantities to get CP distH and CPH (i) for i ∈ [0, d − 1] and H = HB2 or H = HB3 . Figure 2(a) shows that CP distHB2 = CP distHN 2 and CPHB2 (i) = CPHN 2 (i), i ∈ [0, d − 1]. Also, CPHB3 = CPHN 2 , i ∈ [0, d − 1], i = y. The value of CPHB3 (y) is the distance between q and the plane that partitions hyper-rectangle HN 2 and thus can be computed doing one substraction. When we use the squared distance as our distance metric, CP distHB3 can be computed like this: CP distHB3 = CP distHN 2 − CPHN 2 (y) · CPHN 2 (y) + CPHB3 (y) · CPHB3 (y) The above facts allow us to update CPH and CP distH in time independent of dimension, for each dimension. Because the eﬃciency of the bounds overlap ball test used by Friedman et al. [1] degrades with higher dimensions, the incremental distance calculation technique oﬀers better eﬃciency, especially in high dimensions. Figure 2(b) shows how the k-d tree is traversed when the standard search algorithm is applied to the example data points of ﬁgure 1 with query point q = (3, 6). In the example the number of matches to be returned is two (N N = 2). The buckets show the IDs and distances of the best 2 matches after that bucket has been visited. Because the example data set contains very few data points, all buckets must be searched and no savings are made compared to linear searching. 2.4

The Priority k-d Tree Search Algorithm

In the search example of the previous section we saw that the standard k-d tree algorithm came across the nearest neighbors before the search terminated. The complexity of the search algorithm can be reduced by sacriﬁcing this guarantee and interrupting the search before it terminates (say, when the number of visited buckets reaches a certain threshold). In this case, it is desirable to order the

Adapting k-d Trees to Visual Retrieval

537

search so that buckets that are more likely to contain the nearest neighbor are visited ﬁrst. Arya [5] suggests the priority k-d tree search algorithm. This algorithm visits the buckets in increasing order of distance from the query point. This is done by maintaining a priority queue of subtrees, where the priority of a subtree is inversely related to the distance between the query point and the hyperrectangle corresponding to the subtree (CP dist). Initially the root of the k-d tree is inserted into the priority queue. Then the following procedure is repeatedly carried out. First the subtree with the highest priority is extracted from the queue. This subtree is descended to visit the bucket closest to the query point. In this bucket the nearest neighbor is updated if one or more data points in the buckets is the closest point seen so far. As the subtree is descended, for each node that is visited the farther subtree is inserted into the priority queue. The algorithm terminates when the priority queue is empty, or if the distance from the query point to the hyper-rectangle corresponding to the highest priority subtree is greater than the distance to the closest data point. Figure 2(c) shows how the k-d tree from ﬁgure 1 is traversed when the priority search algorithm is applied to the example data points. Again, the query point is q = (3, 6) and the best 2 matches must be retrieved. The priority search algorithm ﬁnds the best 2 matches after visiting 2 buckets, where the standard search algorithm needs to visit 3 buckets to ﬁnd the matches. When using the priority search algorithm White and Jain [3] propose the use of a threshold that limits the number of buckets that will be visited. Using such a threshold will change the exact nearest neighbor search to an approximate nearest neighbor search. Results of White and Jain [4] and our empirical results show that allowing a very small probability of an incorrect result can provide a signiﬁcant speedup. In the previous example (Figure 2(c)) a threshold of 2 buckets would have provided a speedup of 2 and would not have eﬀected the results.

2.5

Integration with the Karhunen-Lo` eve Transform

We saw that in each node the dimension with the largest variance is chosen as the discriminator dimension. White & Jain [3] suggest to transform the feature vector sets using the Karhunen-Lo`eve transform (KLT) before indexing them in a k-d tree. Applying the KLT on the feature vectors results in a vector set whose ﬁrst elements have the largest variance and whose last elements have the smallest variance. Furthermore the elements of the KLT vectors are uncorrelated.

Fig. 3. Two test-pairs: monkeys (left), wine (right)

538

Rinie Egas et al.

The number of buckets that must be searched to ﬁnd the best matches decreases signiﬁcantly when the k-d tree is build using the KLT vectors. We tested the trees on our high dimensional feature vectors. Those test showed that using the original feature vectors all buckets are searched (when no threshold is used). Using the KLT vectors to build the tree, only about 50% of the buckets are searched. We therefore will use the KLT vectors to build our k-d trees.

3

Optimizing k-d Trees for Image Retrieval

In this section, we used the priority KLT k-d trees for indexing the Corel Studio Photo database consisting of 18,724 color images. For testing purposes, 30 test sets were created, each test set consisting of a group of visually similar images, as shown in ﬁgure 3. As features, we used the color histogram and statistical and spectral textures. The color histogram was computed from discretizing the color space into the 512 bins. Each bin corresponds to the index computed from the most signiﬁcant 3 bits from RGB, respectively. The statistical texture method (Texture 1) measures the cooccurence matrix from the gray levels [7], and the spectral texture (Texture 2) method uses the Fourier coeﬃcients [7]. Up to this point, we have described the general theory behind the variations in k-d tree methods. In this section we ﬁnd the k-d tree parameters (bucket size and threshold) which minimize the access time in the context of image retrieval. tree height #buckets 9 512 10 1024 11 2048 12 4096

bucket size buckets searched time (sec.) 37 48% 6.5 19 45% 6.4 10 41% 6.8 5 40% 7.5

Table 1. Selecting Tree Height: average results of 20 test queries

3.1

Bucket Size

From our perspective, the k-d tree with the best bucket size is the k-d tree which ﬁnds the best k-nn matches in the least time. To minimize the size of the k-d tree we will choose a bucket size that ﬁlls the buckets as much as possible, given the number of images. Test results (see [8] veriﬁed that the search time is minimal when the tree is as full as possible. When we use maximally ﬁlled trees, the bucket size is deﬁned as follows: bucket size(h) =

N 2h

where N = 18724 and h is the tree height. (recall that the binary k-d trees are complete and that the height of a binary tree equals log2 (#buckets)). So, we now need to choose the best k-d tree height. We tried k-d trees with diﬀerent

Adapting k-d Trees to Visual Retrieval

539

heights on our COREL image collection. For every height we performed 20 test queries using the Color and Texture 1 algorithms, and we computed the average percentage of buckets that was searched to ﬁnd the best 10 matches. Table 1 shows the results. The results show that the percentage of buckets searched decreases with the tree height, but the search time increases when the tree gets very large. This can be explained by the fact that both the tree loading time and the tree search overhead (distance computation) increase signiﬁcantly for large trees. Since the results show that the search time does not vary dramatically for diﬀerent bucket sizes, we will use 19 as the bucket size of all our k-d trees, regardless of the number of images used. linear search 60*log(#buckets)-400 tree search

#images #buckets Color Texture 1 Overall

2000 128 30% 30% 30%

4000 256 30% 25% 27.5%

9000 512 25% 20% 22.5%

18724 1024 20% 20% 20%

Table 2. Selecting Threshold: average results of 20 test queries per tree size

#buckets searched

1000 800 600 400 200 0

200

400

600 #buckets

800

1000

Fig 5. Linear vs. Tree search

3.2

Threshold

Having chosen a bucket size, we can now choose a threshold that limits the number of buckets visited. Given the size of the search tree we want to know how low we can choose the threshold, while the diﬀerence between the tree search results and the exact linear search results is negligible. We created k-d trees with four diﬀerent heights and we located the lowest threshold for those trees. For every tree size we performed 20 test queries using the Color and Texture 1 algorithms. Similar queries were performed using a linear search. We deﬁned the diﬀerence between the tree search results and the linear results to be negligible when the average diﬀerence between the best 10, 20 and 30 matches found with the tree search and the best 10, 20 and 30 matches found with linear search was less then 0.5%. Table 2 shows the results. Note that we choose the number of images such that the buckets of the k-d tree were as full as possible. The percentages indicate the thresholds that were used. The Overall row shows the average percentage of the buckets that is searched. This percentage decreases as the trees grow. Figure 5 shows a plot of the number of buckets that must be visited.

540

Rinie Egas et al.

Because our tree search function needs to know what threshold to use with a given tree size, we choose the following function to compute the threshold: threshold(#buckets) = 60log2(#buckets) − 400 This function is also plotted in Figure 5. This plot shows that our function gives a good approximation of the threshold for trees with up to 1024 buckets (19456 images). When larger image collections are used it will be necessary to update the function. 3.3

User Search Response Times and Retrieval Accuracy

We measured the response time of the system to the user and then gave a breakdown of the time into initialization (program setup time before queries can be made); search (time spent searching the tree); and response time (time including initialization, search, and display of results). In Table 3 the times are given using linear search and the priority KLT k-d search tree, respectively. initialization Color 0.21 Linear search Texture 1 0.21 Color 0.68 Tree search Texture 1 0.68

search 7.1 10.7 2.3 3.0

sum 7.3 10.9 3.0 3.7

response time 7.5 12.3 3.1 3.9

Table 3. Search Response Time: average results of 20 test queries per algorithm Table 4 describes the accuracy of the indexing method for each of the features. The accuracy is measured in terms of the ratio of similar images found in the ”Top R” ranks, also known as normalized recall ratio. Algorithm Color Texture 1 Texture 2

Top 5 0.72 0.46 0.36

Top 10 0.74 0.46 0.40

Top 20 0.77 0.49 0.47

Top 50 0.79 0.55 0.50

Top 100 0.81 0.61 0.60

Top 500 0.87 0.74 0.72

Table 4. Normalized recall ratios using 30 test-sets (78 images) embedded in 18724 images.

4

Conclusions

Solution of the ﬁnd-the-similar-image problem requires solving the ﬁnd-thenearest-neighbor problem. Priority k-d search trees are an eﬃcient method for ﬁnding nearest neighbors in high dimensional search problems. In this paper, we have adapted the search trees to the problem of image retrieval and found the best parameters regarding minimizing the access time. Tests on image collections with up to 18724 images showed that the threshold value can be approximated by a logarithmic function of the number of feature vectors.

Adapting k-d Trees to Visual Retrieval

541

References 1. J.H. Friedman, J.L. Bentley, and R.A. Finkel, An Algorithm for Finding Best Matches in Logarithmic Expected Time, ACM Transactions on Mathematical Software, 3(3), p. 209-226, Sep. 1977 534, 535, 536 2. J.L. Bentley, K-d Trees for Semidymamic point sets, in Proc. 6th Ann. ACM Sympos. Comput. Geom., p. 187-197, 1990 534 3. D.A. White and R. Jain, Algorithms and Strategies for Similarity Retrieval, Visual Computing Laboratory, University of California, San Diego, 1996 534, 537 4. D.A. White and R. Jain, Similarity Indexing: Algorithms and Performance, Visual Computing Laboratory, University of California, San Diego, 1997 537 5. S. Arya, Nearest Neighbor Searching and Applications, PhD thesis, Computer Vision Laboratory, University of Maryland, College Park, 1995 535, 537 6. S. Arya, D. Mount, An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions, in Proc. 5th ACM-SIAM Sympos. Discrete Algorithms, p.573-582, 1994 535 7. R.C. Gonzalez and R.E. Woods,Digital Image Processing, Addison-Wesley, 1992 538 8. R. Egas, Benchmarking of Visual Query Algorithms, Internal Report 97-06, Computer Science Dept., Leiden University, 1997 538

Content-Based Image Retrieval Using Self-Organizing Maps Jorma Laaksonen, Markus Koskela, and Erkki Oja Laboratory of Computer and Information Science, Helsinki University of Technology, P.O.BOX 5400, Fin-02015 HUT, Finland {jorma.laaksonen,markus.koskela,erkki.oja}@hut.fi Abstract We have developed an image retrieval system named PicSOM which uses Tree Structured Self-Organizing Maps (TS-SOMs) as the method for retrieving images similar to a given set of reference images. A novel technique introduced in the PicSOM system facilitates automatic combination of the responses from multiple TS-SOMs and their hierarchical levels. This mechanism aims at adapting to the user’s preferences in selecting which images resemble each other in the particular sense the user is interested of. The image queries are performed through the World Wide Web and the queries are iteratively reﬁned as the system exposes more images to the user.

1

Introduction

Content-based image retrieval from unannotated image databases has been an object for ongoing research for a long period. Many projects have been started in recent years to research and develop eﬃcient systems for content-based image retrieval. The best-known implementation is probably Query By Image Content (QBIC) [3] developed at the IBM Almaden Research Center. Other notable systems include MIT’s Photobook [9] and its more recent version, FourEyes, the search engine family of WebSEEk, VisualSEEk, and MetaSEEk [2], which all are developed at Columbia University, and Virage [1], a commercial content-based search engine developed at Virage Technologies Inc. We have implemented an image-retrieval system called PicSOM, which tries to adapt to the user’s preferences regarding the similarity of images using SelfOrganizing Maps (SOMs) [5]. The approach is based on the relevance feedback technique [11], in which the human-computer interaction is used to reﬁne subsequent queries to better approximate the need of the user. Some earlier systems have also applied the relevance feedback approach in image retrieval [8,10]. PicSOM uses a SOM variant called Tree Structured Self-Organizing Map (TS-SOM) [6,7] as the image similarity scoring method and a standard World Wide Web browser as the user interface. The implementation of our imageretrieval system is based on a general framework in which the interfaces of co-operating modules are deﬁned. Therefore, the TS-SOM is only one possible choice for the similarity measure. However, the results we have gained so Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 541–549, 1999. c Springer-Verlag Berlin Heidelberg 1999

542

Jorma Laaksonen et al.

far, are very promising on the potentials of the TS-SOM method. As far as the current authors are aware, there has not been notable image retrieval applications based on the SOM. Some preliminary experiments with the SOM have been made previously in [13].

2

Principle of PicSOM

Our method is named PicSOM, which bears similarity to the well-known WEBSOM [4,12] document browsing and exploration tool that can be used in free-text mining. WEBSOM is a means for organizing miscellaneous text documents into meaningful maps for exploration and search. It is based on the SOM which automatically organizes documents into a two-dimensional grid so that related documents appear close to each other. Up to now, databases over one million documents have been organized for search using the WEBSOM system. In an analogous manner, we have aimed at developing a tool that utilizes the strong self-organizing power of the SOM in unsupervised statistical data analysis for image retrieval. The features may be chosen separately for each speciﬁc task and the system may also use keyword-type textual information for the images. The basic operation of the PicSOM image retrieval is as follows: 1) An interested user connects to the WWW server providing the search engine with her web browser. 2) The system presents a list of databases available to that particular user. 3) After the user has selected the database, the system presents an initial set of tentative images scaled to small thumbnail size. The user then selects the subset of images which best match her expectations and to some degree of relevance ﬁt to her purposes. Then, she hits the “Continue Query” button in her browser which sends the information on the selected images back to the search engine. 4) The system marks the selected and non-selected images with positive and negative values, respectively, in its internal data structure. Based on this information, the system then presents a new set of images aside with the images selected this far. 5) The user again selects the relevant images, submits this information to the system and the iteration continues. Hopefully, the fraction of relevant images increases as more images are presented to the user and, ﬁnally, one of them is exactly what she was originally looking for. 2.1

Feature Extraction

PicSOM may use one or several types of statistical features for image querying. Separate feature vectors can thus be formed for describing the color, texture, and structure of the images. A separate Tree Structured Self-Organizing Map is then constructed for each feature vector set and these maps are used in parallel to select the best-scoring images. New features can be easily added to the system, as long as the features are calculated from each picture in the database. In our current implementation, the average R-, G-, and B-values are calculated in ﬁve separate regions of the image. This division of the image area increases the discriminating power by providing a simple color layout scheme.

Content-Based Image Retrieval Using Self-Organizing Maps

543

The resulting 15-dimensional color feature vector thus not only describes the average color of the image but also gives information on the color composition. The current texture feature vectors in PicSOM are calculated similarly in the same ﬁve regions as the color features. The Y-values of the YIQ color representation of every pixel’s 8-neighborhood are examined and the estimated probabilities for each neighbor pixel being brighter than the center pixel are used as features. This results in ﬁve eight-dimensional vectors which are combined to one 40-dimensional textural feature vector. 2.2

Self-Organizing Map (SOM)

The Self-Organizing Map (SOM) [5] is a neural algorithm widely-used to visualize and interpret large high-dimensional data sets. The map consists of a regular grid of neurons. A vector mi , consisting of features, is associated with each unit i. The map attempts to represent all the available observations with optimal accuracy using a restricted set of models. At the same time, the models become ordered on the grid so that similar models are close to each other and dissimilar models far from each other. Fitting of the model vectors is usually carried out by a sequential regression process, where t = 1, 2, . . . is the step index: For each sample x(t), ﬁrst the index c = c(x) of the best-matching unit is identiﬁed by the condition ∀i : x(t) − mc (t) ≤ x(t) − mi (t) .

(1)

After that, all model vectors or a subset of them that belong to nodes centered around node c(x) are updated as mi (t + 1) = mi (t) + h(t)c(x),i (x(t) − mi (t)) .

(2)

Here h(t)c(x),i is the “neighborhood function”, a decreasing function of the distance between the ith and cth nodes on the map grid. This regression is then reiterated over the available samples and the value of h(t)c,i is let to decrease in time to guarantee the convergence of the unit vectors mi . 2.3

Tree Structured SOM (TS-SOM)

The Tree Structured Self-Organizing Map (TS-SOM) [6,7] is a tree-structured vector quantization algorithm that uses SOMs at each of its hierarchical levels. In PicSOM, all TS-SOM maps are two-dimensional. The number of map units increases when moving downwards in the TS-SOM. The search space for the best-matching vector of equation (1) on the underlying SOM layer is restricted to a predeﬁned portion just below the best-matching unit on the above SOM. Therefore, the complexity of the searches in TS-SOM is remarkably lower than if the whole bottommost SOM level were accessed without the tree structure. The computational lightness of TS-SOM facilitates the creation and use of huge SOMs whichare used to hold the images stored in the image database. The

544

Jorma Laaksonen et al.

feature vectors calculated from the images are used to train the levels of the TS-SOMs beginning from the top level. After the training phase, each unit of the TS-SOMs contains a model vector which may be regarded as the average of all feature vectors mapped to that unit. In PicSOM, we then search in the corresponding data set for the feature vector which best matches the stored model vector and associate the corresponding image to that map unit. Consequently, a tree-structured hierarchical representation of all the images in the database is formed. In an ideal situation, there should be one-to-one correspondence between the images and TS-SOM units in the bottom level of each map. 2.4

Using Multiple TS-SOMs

Combining the results from several maps can be done in a number of ways. A simple method would be to ask the user to enter weights for diﬀerent maps and then calculate a weighted average. This, however, requires the user to give information which she normally does not have. Generally, it is a diﬃcult task to give low-level features such weights which would coincide with human’s perception of images at a more conceptual level. Therefore, a better solution is to use the relevance feedback approach. Then, the results of multiple maps are combined automatically, using the implicit information from the user’s responses during the query. The PicSOM system thus tries to learn the user’s preferences from the interaction with her and to set its own responses accordingly. The rationale behind our approach is as follows: If the images selected by the user map close to each other on a certain TS-SOM map, it seems that the corresponding feature performs well on the present query and the relative weight of its opinion should be increased. This can be implemented by marking on the maps the images the user has seen. The units are given positive and negative values depending whether she has selected or rejected the corresponding images. The mutual relations of positively-marked units residing near each other can then be enhanced by convolving the maps with a simple low-pass ﬁltering mask. As a result, areas with many positively marked images spread the positive response to their neighboring map units. The images associated with these units are then good candidates for next images to be shown to the user, if they have not been shown already. The current PicSOM implementation uses convolution masks whose values decrease as the 4-neighbor or “city-block” distance from the mask center increases. The convolution mask size increases as the size of the corresponding SOM layer increases. Figure 1 shows a set of convolved feature maps during a query. The three images on the left represent three map levels on the Tree Structured SOM for the RGB color feature, whereas the convolutions on the right are calculated on the texture map. The sizes of the SOM layers are 4 × 4, 16 × 16, and 64 × 64, from top to bottom. The dark regions have positive and the light regions negative convolved values on the maps. Notice the dark regions in the lower-left corners of the three layers of the left TS-SOM. They indicate that there is a strong response and similarity

Content-Based Image Retrieval Using Self-Organizing Maps

545

Figure1. An example of convolved TS-SOMs for color (left) and texture (right) features. Black corresponds to positive and white to negative convolved values. between images selected by the user in that particular area of the color feature space. 2.5

Refining Queries

Initially, the query begins with a set of reference images picked from the top levels of the TS-SOMs in use. For each reference image, the best-matching SOM unit is searched on every layer of all the TS-SOM maps. The selected and rejected images result to positive and negative values on the best-matching units. The positive and negative responses are normalized so that their sum equals to zero. Previously positive map units can also be changed to negative as the retrieval process iteration continues. In early stages of the image query, the system tends to present the user images from the upper TS-SOM levels. As soon as the convolutions begin to produce large positive values also on the lower map levels, the images on these levels are shown to the user. The images are therefore gradually picked more and more from the lower map levels as the query is continued. The inherent property of PicSOM to use more than one reference image as the input information for retrievals is important. This feature makes PicSOM diﬀer from other content-based image retrieval systems, such as QBIC, which use only one reference image at a time.

3

Implementation of PicSOM

The issues of the implementation of the PicSOM image retrieval system can be divided in two categories. First, concerning the user interface, we have wanted to make our search engine, at least in principle, available and freely usable to

546

Jorma Laaksonen et al.

Figure2. WWW-based user interface of PicSOM. The user has already selected ﬁve aircraft images in the previous rounds. The system is displaying the user ten new images to select of. the public by implementing it in the World Wide Web. The use of standard web browser also makes the queries on the databases machine independent. Figure 2 shows a screenshot of the current web-based PicSOM user interface, which can be found at http://www.cis.hut.fi/picsom/. Second, the functional components in the server running the search engine have been implemented so that the parts responsible for separate tasks have been isolated to separate processes. The implementation of PicSOM has three separate modular components: 1) picsom.cgi is a CGI/FCGI script which handles the requests and responses from the user’s web browser. This includes processing the HTML form, updating the information from previous queries and executing the other components as needed to complete the requests. 2) picsomctrl is the main program responsible for updating the TS-SOM maps with new positive and negative response values, calculating the convolutions, creating new map images for the next web page, and selecting the bestscoring images to be shown to the user in the next round. 3) picsomctrltohtml creates the HTML contents of the new web pages based on the output generated by the picsomctrl program.

4

Preliminary Quantitative Results

Quantitative measures of the image retrieval performance of a system, or any single feature, are problematic due to human subjectivity. Generally, there exists no deﬁnite right answer to an image query as each user has individual expectations.

Content-Based Image Retrieval Using Self-Organizing Maps

547

25

QBIC draw

precision %

20

QBIC color histogram

15

PICSOM

QBIC color

QBIC texture

10 20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

340

number of retrieved images

Figure3. Average precisions of PicSOM and QBIC responses when one reference image containing an aircraft is presented to the system. The a priori probability for correct response is 8 percent. We have made experiments with an image database of 4350 images. Most of them are color photographs in JPEG format. The images were downloaded from the image collection residing at the Swedish University Network FTP server, located at ftp://ftp.sunet.se/pub/pictures/. We evaluated PicSOM’s and QBIC’s responses when one image containing an aircraft was given at a time as a reference image to the systems. The average number of relevant images was then calculated for both methods by using a hand-picked subset of the ftp.sunet.se collection which contained 348 aircraft images. This gave rise to a priori probability of 8.0 percent correct responses. Figure 3 shows the average retrieval precisions as functions of the number of images returned from the database. The response of PicSOM was produced by returning 20 best-scoring images in each iteration step and selecting the relevant images among them. For QBIC, separate responses for four features are shown. As could be expected, QBIC’s average performance decreases when the number of returned pictures is increased. This is due to the fact that QBIC’s response is always based on a single reference image. On the contrary, PicSOM’s operation is based on an increasing set of images and the response becomes more accurate as more images are shown. PicSOM’s strength is thus in its ability to use multiple reference images and all available features simultaneously.

5

Future Plans

The next obvious step to increase PicSOM’s retrieval performance is to add better feature representations to replace our current experimental ones. These

548

Jorma Laaksonen et al.

will include color histograms, color layout descriptions, shape features, and some more sophisticated texture models. As the PicSOM architecture is designed to be modular and expandable, adding new statistical features is straightforward. We also need to deﬁne more quantitative measures which can be used in comparing the performance of the PicSOM system with that of other content-based image retrieval systems. To study our method’s applicability on a larger scale we shall need larger image databases. A vast collection of images is available on the Internet, and we have preliminary plans to use PicSOM as an image search engine for the World Wide Web.

References 1. Bach J. R., Fuller C., Gupta A., et al. The Virage image search engine: An open framework for image management. In Sethi I. K. and Jain R. J., editors, Storage and Retrieval for Image and Video Databases IV, volume 2670 of Proceedings of SPIE, pages 76–87, 1996. 541 2. Chang S.-F., Smith J. R., Beigi M., and Benitez A. Visual information retrieval from large distributed online repositories. Communications of the ACM, 40(12):63– 69, December 1997. 541 3. Flickner M., Sawhney H., Niblack W., et al. Query by image and video content: The QBIC system. IEEE Computer, pages 23–31, September 1995. 541 4. Honkela T., Kaski S., Lagus K., and Kohonen T. WEBSOM—self-organizing maps of document collections. In Proceedings of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland, June 4-6, pages 310–315. Helsinki University of Technology, Neural Networks Research Centre, Espoo, Finland, 1997. 542 5. Kohonen T. Self-Organizing Maps, volume 30 of Springer Series in Information Sciences. Springer-Verlag, 1997. Second Extended Edition. 541, 543 6. Koikkalainen P. Progress with the tree-structured self-organizing map. In Cohn A. G., editor, 11th European Conf. on Artificial Intelligence. European Committee for Artiﬁcial Intelligence (ECCAI), John Wiley & Sons, Ltd., August 1994. 541, 543 7. Koikkalainen P. and Oja E. Self-organizing hierarchical feature maps. In Proceedings of 1990 International Joint Conference on Neural Networks, volume II, pages 279–284, San Diego, CA, 1990. IEEE, INNS. 541, 543 8. Minka T. P. An image database browser that learns from user interaction. Master’s thesis, M.I.T, Cambridge, MA, 1996. 541 9. Pentland A., Picard R. W., and Sclaroﬀ S. Photobook: Tools for content-based manipulation of image databases. In Storage and Retrieval for Image and Video Databases II (SPIE), volume 2185 of SPIE Proceedings Series, San Jose, CA, USA, 1994. 541 10. Rui Y., Huang T. S., and Mehrotra S. Content-based image retrieval with relevance feedback in MARS. In Proc. of IEEE Int. Conf. on Image Processing ’97, pages 815–818, Santa Barbara, California, USA, October 1997. 541 11. Salton G. and McGill M. J. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. 541 12. WEBSOM - self-organizing maps for internet exploration, http://websom.hut.fi/websom/. 542

Content-Based Image Retrieval Using Self-Organizing Maps

549

13. Zhang H. and Zhong D. A scheme for visual feature based image indexing. In Storage and Retrieval for Image and Video Databases III (SPIE), volume 2420 of SPIE Proceedings Series, San Jose, CA, February 1995. 542

Relevance Feedback and Term Weighting Schemes for Content-Based Image Retrieval David Squire, Wolfgang M¨ uller, and Henning M¨ uller Computer Vision Group, Computer Science Department University of Geneva, rue G´en´eral Dufour, 1211 Geneva 4, Switzerland squire,muellerw,[email protected]

Abstract. This paper describes the application of techniques derived from text retrieval research to the content-based querying of image databases. Speciﬁcally, the use of inverted ﬁles, frequency-based weights and relevance feedback is investigated. The use of inverted ﬁles allows very large numbers (≥ O(104 )) of possible features to be used, since search is limited to the subspace spanned by the features present in the query image(s). Several weighting schemes used in text retrieval are employed, yielding varying results. We suggest possible modiﬁcations for their use with image databases. The use of relevance feedback was shown to improve the query results signiﬁcantly, as measured by precision and recall, for all users.

1

Introduction

In recent years the use of digital image databases has become common, both on the web and for preparing electronic and paper publications. The eﬃcient querying and browsing of large image databases has thus become increasingly important. Content-based retrieval from large text databases has been studied for more than forty years, yet the insights and techniques of text retrieval (TR) have largely been ignored by content-based image retrieval (CBIR) researchers, or reinvented without heeding the prior work. The utility of Relevance Feedback (RF) is long-established [1], yet its application in CBIR systems (CBIRSs) is very recent. Similarly, a great variety of term-weighting approaches has been investigated, both empirically and theoretically [2]. Means of system evaluation have also been thoroughly studied [3], yet Precision and Recall, the usual performance measures, are not widely used in CBIR. TR systems usually treat each possible term (i.e. word) as a dimension of the search space. Spaces with O(104 ) dimensions are thus typical. The key realization is that in such systems both queries and stored objects are sparse: they have only a small subset (O(102 )) of all possible attributes. Search can thus be restricted to the subspace spanned by the query terms. The data structure which makes this eﬃcient is the Inverted File (IF), described in §3.2. Conversely, considerable eﬀort has been devoted by CBIR researchers to the search for compact image representations (choosing the “right” features), and to the use of techniques such as factor analysis [4] to reduce the feature space dimensionality. Dimensionality Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 549–557, 1999. c Springer-Verlag Berlin Heidelberg 1999

550

David Squire et al.

reduction, however, can eliminate the “rare” feature variations which are very useful for creating a speciﬁc query. We present a CBIRS which uses an IF, with more than 80000 possible features per image. Using 10 queries for each of 5 users on a test database of 500 images, we compare the eﬀectiveness of a variety of feature-weighting schemes derived from TR (see §4). Modiﬁcations to these schemes, speciﬁc to CBIR, are suggested. We analyze the performance of these weighting schemes both with and without RF, and that of a typical low-dimensional, nearest-neighbour CBIRS [4], using precision and recall graphs. The TR-inspired weighting schemes are found to improve performance, and the addition of RF makes a still greater diﬀerence.

2

Current CBIR Research

CBIR researchers acknowledge that the general computer vision problem remains unsolved: semantic retrieval is impossible. The usual approach is to extract lowlevel features and to attempt to capture image similarity using some function of them. Most systems employ features based on colour, texture or shape. Features are often computed globally, and contain no spatial information. Some systems allow the user to inﬂuence the relative weights of these classes of features. 2.1

Features

By far the most commonly used feature is colour (e.g. [5,6,7]), usually computed in a colour space thought to be “perceptually accurate” (e.g. HSV [7] or CIE [8]). The usual representation is the colour histogram. Histogram intersection is the most frequently used distance measure. A disadvantage is that this takes no account of perceptual similarities between bins. Measures exist which use a matrix of bin similarity coeﬃcients [5], but the choice of coeﬃcients is not obvious, and the cost is quadratic. Many systems use texture to improve image characterization. A great variety of texture features has been employed: hierarchies of Gabor ﬁlters [9]; the Wold features used in Photobook [10]; the coarseness, contrast, and directionality features used in QBIC [5]; and many more. Shape features are often computed assuming that images contains only one shape, and are thus best applied to restricted domains. Shape features include: modal matching, applied to isolated ﬁsh, rabbits and machine tools [11]; histograms of edge directions, applied to trademarks [6]; matching of shape components such as corners, line segments or circular arcs [12]. Global features are inadequate for many CBIR tasks: users may be interested in the spatial layout of colours, textures and shapes, or in particular objects. One approach is to use features which retain spatial information, such as wavelet decompositions [13]. Others segment the image into regions, and then extract features such as color and texture from them, as well as spatial properties such as size, location and their relationships to other regions [7,14,15]. This turns CBIR into a labeled graph matching problem.

Relevance Feedback and Term Weighting Schemes for CBIRS

2.2

551

Similarity

The meaning of similarity in CBIR is rarely addressed, yet it is vital to do so: human judgments of similarity vary greatly [16]. Image similarity is typically deﬁned using a metric on a feature space. It is often implied that if one chooses the “right” features proximity in feature space will correspond to perceptual similarity. There are several reasons to doubt this, the most fundamental being the metric assumption. There is evidence that human similarity judgments do not obey the requirements of a metric: “[Self-identity] is somewhat problematic, symmetry is apparently false, and the triangle inequality is hardly compelling” [17, p. 329]. The lack of symmetry is the most important issue: the features which are signiﬁcant depend on which item is the query. Some attempts have been made to address these problems. Self-organizing maps have been used to cluster texture features according to class labels provided by users [9]. A set-based technique has been applied to learn groupings of similar images from positive and negative examples provided by users [10]. Distance Learning Networks attempt to learn a mapping from feature space to “perceptual similarity space” using human similarity judgment data [18]. 2.3

Relevance Feedback

There are two basic approaches to RF. According to the RF, a system can create a composite query from relevant and non-relevant images [19], or it can adjust its similarity metric [8]. Some use the variances of features in the relevant set as a weighting criterion [20]. Whilst related to the variance-based approach, the technique presented here diﬀers in that in can cope with multimodal distributions of relevant features, and with very much greater numbers of possible features.

The Viper System

3

Viper 1 employs more than 80000 simple colour and spatial frequency features, both local and global, extracted at several scales. These are intented to correspond (roughly) to features present in the retina and early visual cortex. The fundamental diﬀerence between traditional computer vision and image database applications is that there is a human “in the loop”. RF allows a simple classiﬁer to be learnt “on the ﬂy”, corresponding to the user’s information need. 3.1

Features

Colour features Viper uses a palette of 166 colours, derived by quantizing HSV space into 18 hues, 3 saturations, 3 values and 4 grey levels. Two sets of features are extracted from the quantized image. The ﬁrst is a colour histogram, with empty bins are discarded. The second represents colour layout. Each block in the image (the ﬁrst being the image itself) is recursively divided into four 1

http://cuiwww.unige.ch/~vision/Viper/

552

David Squire et al.

equal-sized blocks, at four scales. The occurrence of a block with a given mode color is treated as a binary feature. There are thus 56440 possible colour block features, of which each image has 340. Texture features Gabors have been applied to texture classiﬁcation and segmentation, as well as more general vision tasks [9,21]. We employ a bank of real, circularly symmetric Gabors, deﬁned by 2

fmn (x, y) =

2

1 − x +y 2 e 2σm cos(2π(u0m x cos θn + u0m y sin θn )), 2 2πσm

(1)

where m indexes ﬁlter scales, n their orientations, and u0m gives the centre frequency. The half peak radial bandwidth is chosen to be one octave, which determines σm . The highest centre frequency is chosen as u01 = 0.5, and u0m+1 = u0m /2. Three scales are used. The four orientations are: θ0 = 0, θn+1 = θn + π/4. The resultant bank of 12 ﬁlters gives good coverage of the frequency domain, and little overlap between ﬁlters. The mean energy of each ﬁlter is computed for each of the smallest blocks in the image. This is quantized into 10 bands. A feature is stored for each ﬁlter with energy greater than the lowest band. Of the 27648 such possible features, each image has at most 3072. Histograms of the mean ﬁlter outputs are used to represent global texture characteristics. 3.2

Techniques Derived from Text Retrieval

Inverted files An IF contains an entry for every possible feature consisting of a list of the items which contain that feature. The TR community has developed techniques for building and searching IFs very eﬃciently [22]. In evaluating a query, only images which contain features present in the query are retrieved. Coupled with appropriate weighting schemes this results in asymmetric similarity measures, in better accord with the psychophysical data (see §2.2). Feature weighting and relevance feedback As discussed in §2.3, RF can produce a query which better represents a user’s information need. We investigate the application of weighting functions used in TR to CBIR. The weighting function can depend upon the term frequency tf j and collection frequency cf of the feature, as well as its type (block or histogram). The motivation for using tf and cf is very simple: features with high tf characterize an image well; features with high cf do not distinguish that image well from others [2]. We consider a query q containing N images i with relevances Ri ∈ [−1, 1]. The frequency of feature j in the pseudo-image corresponding to q is2 tf qj = 2

N 1 tf · Ri . N i=1 ij

(2)

In this paper, only single-level, positive feedback is used: Ri = 1 for all images in q.

Relevance Feedback and Term Weighting Schemes for CBIRS

553

The weighting function s deﬁned in Equations 5–9 are derived from typical TR term weighting function s [2]. Some modiﬁcations were necessary since the image features used can not always be treated in the same way as words in documents. All weighting function s make use of a base weight for block features tf qj 0 wf kqj = . (3) for histogram features sgn(tf qj ) · min abs tf qj , tf kj (The second case is a generalized histogram intersection.) Two diﬀerent logarithmic factor s are used, which depend upon cf : log( cf1 ) block log( cf1 − 1 + ) block j j lcf 2j = . (4) lcf 1j = 1 hist. 1 hist. is added to avoid overﬂows. The weighting function s investigated are 1

best weighted probabilistic : wf = wf classical idf:

0 kqj

wf 2 = wf 0kqj 3

binary term independence : wf = wf

0 kqj

0.5tf kj · 0.5 + · lcf 2j maxj tf kj 2 · lcf 1j · lcf 2j ·

wf 0kqj standard tf : wf 4 =

· 2 m tf km

(5) (6) (7)

tf kj · tf qj 1

block hist.

coordination level: wf 5 = wf 0kqj

(8) (9)

For each image k, using weighting method l, a score slkq is calculated: slkq =

wf lkqj .

(10)

j

4

Experiments

The performance of Viper was evaluated using a set of 500 heterogeneous colour images provided by T´el´evision Suisse Romande. Ten images were selected as queries. Five users then examined all 500 images to determine relevant sets for each query.3 Neither the number of images to choose nor the similarity criteria were speciﬁed. Each query image was presented to Viper and the top 20 ranked images were returned. Using a “consistent user” assumption, the relevant set for each user for this query was inspected and the set of relevant images present in the top 20 was then submitted as a second, relevance feedback query. This was done for the ﬁve weighting schemes (Equations 5–9), meaning that 300 relevance feedback queries were performed. 3

All users were computer vision researchers, so some bias can be expected.

554

David Squire et al.

The performance of Viper was compared with that of a low-dimensional system of the sort commonly used in image retrieval. The system uses a set of 16 colour, segment, arc and region statistics [23]. System performances are compared using precision P and recall R, P=

r N

R=

r , TotRel

(11)

where N is the number of images retrieved, r is the number of relevant images retrieved, and TotRel is the total number of relevant images in the collection. In general, precision decreases as more images are retrieved. An ideal P vs. R graph has P = 1 ∀ R. Figure 1 shows the performance of the weighting methods averaged over all users and queries. RF improves performance in every case, at all values of recall. This is signiﬁcant since in general not all relevant images are present in the top 20 after the initial query: the system is not simply returning those images marked relevant with higher rankings.

Method 1 Method 2 Method 3 Method 4 Method 5 low-dimensional

Method 1 Method 2 Method 3 Method 4 Method 5 low-dimensional

1

0.8

0.8

0.6

0.6

Precision

Precision

1

0.4

0.4

0.2

0.2

0

0 0

0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

(a) before feedback

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

(b) after feedback

Fig. 1. Performance of weighting methods averaged over all users and queries.

System performance varied greatly depending on the nature of the query. Some queries are “easy”, in that simple visual features characterize the relevant set. The relevant sets were very similar in these cases, and performance after RF was often perfect (P = 1 ∀ R). Figure 2 shows the performance of Viper on a “hard” query, for which the relevant sets varied greatly in size and composition. The eﬀect of RF is even more dramatic in this case. This is to be expected, since no ﬁxed similarity measure can cope with diﬀerent relevant sets across users. The best weighting function for this query is method 2 (Equation 6), and this is also the best method averaged over all queries. This is a classic tf · log 1/cf weight, which has been shown to have information theoretic motivation [2].

Relevance Feedback and Term Weighting Schemes for CBIRS

Method 1 Method 2 Method 3 Method 4 Method 5 low=dimensional

Method 1 Method 2 Method 3 Method 4 Method 5 low-dimensional

1

0.8

0.8

0.6

0.6

Precision

Precision

1

0.4

0.4

0.2

0.2

0

555

0 0

0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

(a) before feedback

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

(b) after feedback

Fig. 2. Performance of weighting methods on a “hard” query, averaged across users.

5

Conclusion

We have shown how techniques used in TR (inverted ﬁles, relevance feedback and term weighting) can be adapted for use in CBIR. IFs permit the use of very large feature spaces, and experiments show that term weighting and RF result in a system which outperforms a low-dimensional vector-space system at every level of recall.

Acknowledgments This work was supported by the Swiss National Foundation for Scientiﬁc Research (grant no. 2000-052426.97).

References 1. Salton, G., Buckley, C.: Improving retrieval performance by relevance feedback. J. of the Am. Soc. for Information Science 41(4) (1990):288–287 549 2. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5) (1988):513–523 549, 552, 553, 554 3. Salton, G.: The state of retrieval system evaluation. Information Processing and Management 28(4) (1992):441–450 549 4. Pun, T., Squire, D. M.: Statistical structuring of pictorial databases for contentbased image retrieval systems. Pattern Recognition Letters 17 (1996):1299–1310 549, 550 5. Niblack, W., Barber, R., Equitz, et al.: QBIC project: querying images by content, using color, texture, and shape. In: Niblack, W., ed., Storage and Retrieval for Image and Video Databases, vol. 1908 of SPIE Proc. (Apr. 1993), 173–187 550 6. Jain, A. K., Vailaya, A.: Image retrieval using color and shape. Pattern Recognition 29(8) (Aug. 1996):1233–1244 550

556

David Squire et al.

7. Smith, J. R., Chang, S.-F.: Tools and techniques for color image retrieval. In: Sethi, I. K., Jain, R. C., eds., Storage & Retrieval for Image and Video Databases IV , vol. 2670 of IS&T/SPIE Proceedings. San Jose, CA, USA (Mar. 1996), 426–437 550 8. Sclaroﬀ, S., Taycher, L., La Cascia, M.: ImageRover: a content-based browser for the world wide web. In: IEEE Workshop on Content-Based Access of Image and Video Libraries. San Juan, Puerto Rico (Jun. 1997), 2–9 550, 551 9. Ma, W., Manjunath, B.: Texture features and learning similarity. In: CVPR’96 [24], 425–430 550, 551, 552 10. Pentland, A., Picard, R. W., Sclaroﬀ, S.: Photobook: Tools for content-based manipulation of image databases. Intl. J. of Computer Vision 18(3) (Jun. 1996):233– 254 550, 551 11. Sclaroﬀ, S.: Deformable prototypes for encoding shape categories in image databases. Pattern Recognition 30(4) (Apr. 1997):627–642. (special issue on image databases) 550 12. Cohen, S. D., Guibas, L. J.: Shape-based image retrieval using geometric hashing. In: Proc. of the ARPA Image Understanding Workshop (May 1997), 669–674 550 13. Ze Wang, J., Wiederhold, G., Firschein, O., Xin Wei, S.: Wavelet-based image indexing techniques with partial sketch retrieval capability. In: Proc. of the 4th Forum on Research and Technology Advances in Digital Libraries. Washington D.C. (May 1997), 13–24 550 14. Carson, C., Belongie, S., Greenspan, H., Malik, J.: Region-based image querying. In: Proc. of the 1997 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR ’97). San Juan, Puerto Rico (Jun. 1997) 550 15. Ma, W. Y., Deng, Y., Manjunath, B. S.: Tools for texture- and color-based search of images. In: Rogowitz, B. E., Pappas, T. N., eds., Human Vision and Electronic Imaging II , vol. 3016 of SPIE Proc.. San Jose, CA (Feb. 1997), 496–507 550 16. Mokhtarian, F., Abbasi, S., Kittler, J.: Eﬃcient and robust retrieval by shape content through curvature scale space. In: Smeulders and Jain [25], 35–42 551 17. Tversky, A.: Features of similarity. Psychological Rev. 84(4) (Jul. 1977):327–352 551 18. Squire, D. M.: Learning a similarity-based distance measure for image database organization from human partitionings of an image set. In: Proc. of the 4th IEEE Workshop on Applications of Computer Vision (WACV’98). Princeton, NJ, USA (Oct. 1998), 88–93 551 19. Huang, J., Kumar, S. R., Mitra, M.: Combining supervised learning with color correlograms for content-based image retrieval. In: Proc. of The Fifth ACM Intl. Multimedia Conf. (ACM Multimedia 97). Seattle, USA (Nov. 1997), 325–334 551 20. Rui, Y., Huang, T. S., Ortega, M., , Mehrotra, S.: Relevance feedback: A power tool in interactive content-based image retrieval. IEEE Trans. on Circuits and Systems for Video Technology 8(5) (Sep. 1998):644–655 551 21. Jain, A., Healey, G.: A multiscale representation including opponent color features for texture recognition. IEEE Trans. on Image Processing 7(1) (Jan. 1998):124–128 552 22. Witten, I. H., Moﬀat, A., Bell, T. C.: Managing gigabytes: compressing and indexing documents and images. Van Nostrand Reinhold, 115 Fifth Avenue, New York, NY 10003, USA (1994) 552 23. Squire, D. M., Pun, T.: A comparison of human and machine assessments of image similarity for the organization of image databases. In: Frydrych et al. [26], 51–58 554

Relevance Feedback and Term Weighting Schemes for CBIRS

557

24. Proc. of the 1996 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR ’96), San Francisco, California (Jun. 1996) 556 25. Smeulders, A. W. M., Jain, R., eds.: Image Databases and Multi-Media Search, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands (Aug. 1996) 556 26. Frydrych, M., Parkkinen, J., Visa, A., eds.: The 10th Scandinavian Conf. on Image Analysis (SCIA’97), Lappeenranta, Finland (Jun. 1997) 556

Genetic Algorithm for Weights Assignment in Dissimilarity Function for Trademark Retrieval David Yuk-Ming Chan and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong, Shatin, N.T., Hong Kong {ymchan,king}@cse.cuhk.edu.hk

Abstract. Trademark image retrieval is becoming an important application for logo registry, veriﬁcation, and design. There are two major problems about the current approaches to trademark image retrieval based on shape features. First, researchers often focus on using a single feature, e.g., Fourier descriptors, invariant moments or Zernike moments, without combining them for possible better results. Second, even if they combine the shape features, the weighting factors assigned to the various shape features are often determined with an ad hoc procedure. Hence, we propose to group diﬀerent shape features together and suggest a technique to determine a suitable weighting factors for diﬀerent shape features in trademark image retrieval. In this paper, we use a supervised learning method for ﬁnding the weighting factors in the dissimilarity function by integrating ﬁve shape features using a genetic algorithm (GA). We tested the learned dissimilarity function using a database of 1360 monochromatic trademarks and the results are promising. The retrieved images by our system agreed well with that obtained by human subjects and the searching time for each query was less then 1 second.

1

Introduction

A trademark is a complex symbol that is used to distinguish a company’s logo from the others. Up to now, the number of trademarks is over one million and is growing rapidly. Therefore, the task of registering a new trademark becomes diﬃcult without inadvertent infringement of copyright of an existing trademark. Currently, some researchers use a single shape attribute to represent a trademark. The results are not satisfactory especially for large databases. On the other hand, some suggest to use a combination of features; however, the importance among various features are often diﬃcult to determine. Often this is done in an ad hoc manner. To tackle these shortcomings, we propose to integrate several shape features and suggest a technique to learn the weighting factors of the dissimilarity function for trademark retrieval. In our system, ﬁve shape features are used to capture the contour and the inner parts of a trademark. Fourier descriptors are used to capture the approximated boundary of a trademark. For the inner parts of a trademark, invariant Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 557–565, 1999. c Springer-Verlag Berlin Heidelberg 1999

558

David Yuk-Ming Chan and Irwin King

moments, Euler number, eccentricity, and circularity are used. To integrate the shape features, a supervised learning method using a genetic algorithm is proposed. The weighting factors of the dissimilarity function learned can be used for trademark retrieval afterward. The results show that the weighting factors found could improve the accuracy of trademark retrieval and the retrieved images by our system agreed well with human perception. The paper is organized as follows. In Section 2, previous work on the contentbased image retrieval is reviewed. The problem is deﬁned in Section 3. The genetic algorithm for ﬁnding the weighting factors is introduced in Section 4. Section 5 describes our trademark retrieval model and the features used. The experimental results are given in Section 6 and Section 7 concludes our report.

2

Literature Review

There are various shape representation, matching and similarity measuring methods proposed in the past few years. The QBIC (Query By Image Content) system [1] serves as an image database ﬁlter which allows queries of large image databases based on visual image content such as color percentages, color layout, and textures occurring in the images. Although user can use multiple cues as a query, one needs to be well trained before using the QBIC system eﬀectively. STAR (System for Trademark Archival and Retrieval) system [12,16] uses features based on R, G, and B color components, invariant moments, and Fourier descriptors extracted from manually isolated objects. Kim et. al. [11] developed a trademark retrieval system which uses Zernike or pseudo-Zernike moments of an image as a feature set. The retrieval scheme is based on visually salient feature that dominantly aﬀects the global shape of the trademark. A two-stage approach for trademark retrieval is proposed by Jain et. al. [10]. In the ﬁrst stage, easily computable features like edge angles and moment invariants are used while in the second stage, the plausible retrievals from the ﬁrst stage are screened using a deformable template matching process. They proposed to integrate the associated dissimilarity values of two shape-based retrievals; however, they could not provide an eﬀective way for assigning weights for diﬀerent shape features. Logo similarity matching based on positive and negative shape features was suggested by Soﬀer et. al. [15]. The proposed methods seem to be accurate but the testing database size is too small (130 logos) and for each insertion, the representative component of a logo must be selected by the user.

3

Problem Definition

The problem of ﬁnding weights for the dissimilarity function for trademark retrieval can be formalized as follows: Definition 1. An image database DB is defined as DB = {Ii }ni=1 , where Ii is an image in the database.

(1)

Genetic Algorithm for Weights Assignment

559

Definition 2. Given an image I and a set of feature parameters θ = {θi }ni=1 , a feature extraction function f is defined as f : I × θ → Rd ,

(2)

which extracts a real-valued d-dimensional feature vector. Definition 3. Let xIf be a feature vector of an image I on the basis of a feature extraction function f . Then, the integrated dissimilarity function, Dt , between two images I1 and I2 is defined as n i=1 wi Dfi Dt (I1 , I2 ) = , (3) n i=1 wi where fi is a feature extraction function, Dfi is the Euclidean distance between the feature vector xIf1i and feature vector xIf2i , wi is the weight assigned to feature vector set i and n is the number of the feature vector sets. Since the use of a single shape attribute for retrieval may not have enough discriminatory information, in order to increase the accuracy of retrievals, the integration of attributes was suggested [9,10,16]. Definition 4. Given a training database db⊂DB, where DB is an image database. A training pair T P is defined as T P = (IT , IS ) ,

(4)

where IT ∈ db is the target image for a query and IS ∈ db is the user defined best matched image. Definition 5. Given an integrated dissimilarity function Dt , n training pairs for a given training database db, the total count T C(w) is defined as the number of correct hits given by Dt with the set of weights w for searching in db. For example, given a training pair T P = (IT , IS ), if IS is selected by the integrated dissimilarity function Dt , then this is a correct hit. Definition 6. The set of weights w in a dissimilarity function Dt is defined as w = {wi }ni=1 , where wi is the weight assigned to feature vector set i. The problem is to find w such that T C(w) is maximized, i.e., arg maxw T C(w) .

4

(5)

Finding Weights Using a Genetic Algorithm

A genetic algorithm (GA) is an optimization method based on the evolutionary metaphor. It has been shown to outperform both gradient methods and random search in solving the optimization problems with few cost function evaluations [2]. A short introduction of genetic algorithm can be found in [14]. Our problem deﬁned in Section 3 is an optimization problem; therefore, we suggest to use a genetic algorithm to solve it. The details of our GA are as follows.

560

4.1

David Yuk-Ming Chan and Irwin King

Chromosome Representation

A chromosome representation is used to describe an individual in the population of interest. A chromosome in our GA is deﬁned as c = (w1 , w2 , . . . , wi , . . . , wn ),

(6)

where wi is the weight assigned to feature vector set i and n is the number of feature vector sets which is the same as the number of genes in the chromosome. Note that a population P is deﬁned as P = {c1 , c2 , . . . , ci , . . . , cP opSize }, where P opSize is the number of individuals in the population and ci is a chromosome. 4.2

Selection Function

A selection function plays a vital role in a genetic algorithm because it selects individuals to reproduce successive generations. A probabilistic selection is performed based on the individual’s ﬁtness such that the better individuals have an increased chance of being selected. The selection method, Roulette wheel, proposed by Holland [5] was used in our implementation. The probability, Pi , for each individual is deﬁned by Fi P [Individual i is chosen] = P opSize j=1

Fj

,

(7)

numbers is where Fi is equal to the ﬁtness of individual i. A series of N random generated and compared against the cumulative probability, Ci = ij=1 Pj , of the population. If Ci−1 < U (0, 1) ≤ Ci , the individual is selected.

4.3

Genetic Operators

Genetic operators provide the basic searching mechanism of the GA. The operators are used to create new solutions based on existing solutions in the population. The operators including arithmetic crossover, heuristic crossover, simple crossover, boundary mutation, multi-non-uniform mutation, non-uniform mutation and uniform mutation are used in our implementation. For more details, please refer to [6].

4.4

Initialization, Termination and Evaluation Function

The initial population is randomly generated. The stopping criterion is a predeﬁned maximum number of generations and the evaluation function is the total count T C(w) deﬁned in Deﬁnition 5.

Genetic Algorithm for Weights Assignment

(a)

(b)

561

(c)

Fig. 1. (a) Original image. (b) Closed image. (c) Approximated boundary extracted. Invariant Moments

Start Fourier Descriptors

Query Trademark

Trademark Database

Eccentricity

Integration and Matching

Circularity

Retrieved Trademarks

Euler Number

End

Fig. 2. The proposed shape-based trademark retrieval system.

5

Image Retrieval Model

In our system, ﬁve features that are scale, rotational and translation invariant are chosen to represent a trademark (see Fig. 2). They are Fourier descriptors [3,13] of approximated boundary, seven invariant moments [7], eccentricity [8], circularity [8] and Euler number [4]. Since there can be more than one component in a trademark, the image will be connected by applying a closing operator [3] if this is the case when calculating the Fourier descriptors and circularity. Fig. 1 shows an example of approximated boundary extraction of a trademark. The extracted feature vectors of the images are stored in the database. When a user raises a query, the features of the query trademark are ﬁrst extracted. Then the extracted features are matched linearly with the features in the database and then intergrated by the dissimilarity function (see Deﬁnition 3). The trademarks are then displayed in the order of similarity.

6 6.1

Experimental Results Genetic Algorithm

The proposed genetic algorithm was tested with the following setup. The population size, P opSize, was 30 and the maximum number of iterations was set to 500. In addition, there were 40 training pairs (T P ) and the size of the training database, db, was 200. The values of the weights (genes) were bounded by 0

562

David Yuk-Ming Chan and Irwin King

30

25

25

20 15

Frequency

Best Cost Function Score

20

15

10 10

Genetic Algorithm Random Sampling

5

5

0

0

50

100

150 200 250 300 350 Number of Cost Function Evalutations

(a)

400

450

500

0

0

20

40

60 80 100 120 Ranking of the target images in the test sets

140

160

180

(b)

Fig. 3. (a) Best cost function value versus the number of cost function evaluations for random sampling and GA. (b) The ranking histogram.

and 1. The probability of application of crossover was 0.6 and the probability of application of mutation was 0.05. The experiments were repeated 10 times with diﬀerent random seeds. Note that partial credit is given to the imperfect match in the evaluation function of the GA. For example, if the target is ranked in the ﬁrst position, T C(w) is increased by one. If it is ranked in the second position, T C(w) is increased by 0.95. If it is ranked in the third position, T C(w) is increased by 0.9 and so on. This method helps the GA to search for a better solution in a smaller number of iterations. From Fig. 3(a), the genetic algorithm improved the accuracy of the retrievals by changing the weights. It converged in 500 iterations with a total count of 26.7 which means on average, the distance function ranks the target of a query in a training pair to the eighth position. The performance of the ﬁnal distance function is further analyzed by the ranking histogram in Fig. 3(b). From the histogram, 70% of the targets were ranked at the top ten positions and 83% of the targets were ranked at the top twenty positions. However, there were several targets ranked very low. This is because trademarks that appear to be perceptually similar need not be similar in their shape. As some examples, Fig. 4 shows several pairs of images which may perceptually look similar, but have low image correlation. However, we believe that by using more shape features, the accuracy can be improved. We also tested the dissimilarity function with equal weights (w1 = w2 = w3 = w4 = w5 = 0.2) and the total count, T C(w), was 14.3. This is because some shape features match the human perception better than the others and therefore higher weights should be assigned to them.

Genetic Algorithm for Weights Assignment

(a)

(b)

563

(c)

Fig. 4. Perceptually similar images that have low image correlation.

Query trademark

Similar trademarks retrieved

Fig. 5. Results of the similar trademarks retrieval. The ﬁrst column are the query trademarks and the others are the retrieved similar trademarks. 6.2

Trademark Retrieval

After the GA training has terminated, the chromosome with the highest T C(w) is selected as the ﬁnal weights for retrieval. The weights found were tested with a database of 1360 trademarks on an Ultra 5 Machine. The system was developed using Matlab 5.2 under the Unix operating system. To verify the performance of the proposed method, a set of trademarks were used as queries to the trademark database. Several sample query results are shown in Fig. 5. From the results obtained, the trademarks retrieved agreed well with human perception. In addition, the average time of 10 trials for feature extraction and database querying were 2.96s and 0.08s respectively. Since shape similarity is a subjective issue, in order to further evaluate the proposed method, ﬁve volunteers were asked to perform similarity retrieval based on shape on the database of 1360 images.

564

David Yuk-Ming Chan and Irwin King

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

Fig. 6. Query trademarks (LHS) and the best matches (RHS) retrieved by the volunteers. n = 1 n ≤ 2 n ≤ 3 n ≤ 4 n ≤ 5 n ≤ 20 Not retrieved (%) Query Nature (%) (%) (%) (%) (%) (%) Normal Query 20 30 50 70 70 100 0 20 40 70 70 100 0 Rotated Query 10 20 20 50 60 60 100 0 Scaled Query

Table 1. Retrieval results on the basis of the dissimilarity function with the weights found by the GA: n refers to the position of the correct retrieval. The last column indicates the percentage of non-retrieved images in the top 20 matches.

Given a query, they were asked to choose the best match from the database. Fig. 6 shows the query trademarks and the best matches retrieved by the volunteers and Table 1 presents the results of retrieval of our system, where n corresponds to the position of the correct retrieval. As it is observed from the results, our simple shape measure was eﬀective in retrieving rotated and scaled images. This is because the shape features that we chose are all rotation and scaling invariant. Moreover, all the target images were ranked in the top twenty positions. This reveals that the dissimilarity function models the human perception quite well.

7

Conclusion and Future Work

In this paper, we suggested to integrate diﬀerent trademark features by using a dissimilarity function for retrieving similar trademarks. A method for ﬁnding the weighting factors of the diﬀerence function using a GA has been proposed. The results show that the weighting factors found by the GA improves the accuracy of trademark retrieval. Future research will consider to use more shape features like Zernike moments or edge angles to further improve the accuracy of retrieval. The weights of the distance function can be found by the proposed GA easily. Besides, this method can be applied to optimize other general dissimilarity measures (other than a sum of weight Euclidean). This can be easily done by changing

Genetic Algorithm for Weights Assignment

565

the distance function to be optimized in the GA and using the parameters of the distance function as genes in the chromosome representation of the GA.

References 1. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by Image and Video Content: The QBIC System. Computer, 28(9):23–32, September 1995. 558 2. D. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, 1989. 559 3. R. C. Gonzalez and R. E. Woods. Digital Image Processing. Addison-Wesley, 1992. 561 4. R. M. Haralick and L. G. Shapiro. Computer and Robot Vision, volume 2. AddisonWesley, 1993. 561 5. J. Holland. Adaptation in natural and artifical systems. The University of Michigan Press, 1975. 560 6. C. Houck, J. Joines, and M. Kay. A genetic algorithm for function optimization: A matlab implementation. NCSU-IE, 95-09, 1995. 560 7. M. K. Hu. Visual Pattern Recognition by Moment Invariants. IRE Trans. on Information Theory, 8, 1962. 561 8. B. Jahne. Digital Image Processing: Concepts, Algorithms and Scientific Applications. Springer-Verlag, Berlin; New York, 4 edition, 1997. 561 9. A. K. Jain and A. Vailaya. Image Retrieval using Color and Shape. Pattern Recognition, 29(8):1233–1244, 1996. 559 10. A. K. Jain and A. Vailaya. Shape-Based Retrieval: A Case Study with Trademark Image Databases. Pattern Recognition, 31(9):1369–1390, 1998. 558, 559 11. Y. S. Kim and W. Y. Kim. Content-Based Trademark Retrieval System Using Visually Salient feature. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 307–312, 1997. 558 12. C. P. Lam, J. K. Wu, and B. Mehtre. STAR - A System for Trademark Archival and Retrieval. In 2nd Asian Conf. on Computer Vision, volume 3, pages 214–217, 1995. 558 13. E. Persoon and K. S. Fu. Shape discrimination using Fourier descriptors. IEEE Trans. on Systems, Man Cybernetics, 7(2):170–179, 77. 561 14. G. Roth and M. D. Levine. Geometric Primitive Extraction Using a Genetic Algorithm. IEEE Trans. on Pattern Analysis and Machine Intelligence, 16(9), Sep 1994. 559 15. A. Soﬀer and H. Samet. Using Negative Shape Features for Logo Similarity Matching. In 14the International Conf. on Pattern Recognition, volume 1, pages 571–573, 1998. 558 16. J. K. Wu, B. M. Mehtre, Y. J. Gao, C. P. Lam, and A. D. Narasimhalu. STAR—A Multimedia Database System For Trademark Registration. In Witold Litwin and Tore Risch, editors, Applications of Databases, First International Conference, volume 819 of Lecture Notes in Computer Science, pages 109–122, Vadstena, Sweden, 21–23 June 1994. Springer. 558, 559

Retrieval of Similar Shapes under Aﬃne Transform Farzin Mokhtarian and Sadegh Abbasi Centre for Vision Speech and Signal Processing Department of Electronic & Electrical Engineering University of Surrey, Guildford, Surrey GU2 5XH, UK Tel: +44-1483-876035, Fax: +44-1483-259554 [email protected] http://www.ee.surrey.ac.uk/Research/VSSP/imagedb/demo.html

Abstract. The application of Curvature Scale Space representation in shape similarity retrieval under aﬃne transformation is addressed in this paper. The maxima of Curvature Scale Space (CSS) image have already been used to represent 2-D shapes in diﬀerent applications. The representation has shown robustness under the similarity transformations. Scaling, orientation changes, translation and even noise can be easily handled by the representation and its associated matching algorithm. In this paper, we also consider non-uniform scaling and examine the performance of the representation under aﬃne transformations. It is observed that the performance of the method is promising even under severe deformations caused by the transformation. The method is evaluated objectively through a very large classiﬁed database and its performance is compared with the performance of two well-known methods, namely Fourier descriptors and moment invariants.

1

Introduction

A number of shape representations have been proposed to recognise shapes under aﬃne transformation. Some of them are the extensions of well-known methods such as Fourier descriptors [2] and moment invariants [4][13]. The methods are then tested on a small number of objects for the purpose of object recognition. In both methods, the basic idea is to use a parametrisation which is robust with respect to aﬃne transformation. The arc length representation is not transformed linearly under shear and therefore is replaced by aﬃne length [5]. The shortcomings of the aﬃne length include the need for higher order derivatives which results in inaccuracy, and ineﬃciency as a result of computation complexity. Moreover, it can be shown [8] that although the arc length is not preserved, it does not change dramatically under aﬃne transformation. Aﬃne invariant scale space is reviewed in [11]. This curve evolution method is proven to have similar properties as curvature evolution [7], as well as being aﬃne-invariant. However, an explicit shape representation has yet to be introduced based on the theory of aﬃne invariant scale space. The prospective shape Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 566–574, 1999. c Springer-Verlag Berlin Heidelberg 1999

Retrieval of Similar Shapes under Aﬃne Transform

567

representation might be computationally complex as the deﬁnition of the aﬃne curvature involves higher order derivatives. We have already used the maxima of Curvature Scale Space (CSS) image to represent shapes of boundaries in similarity retrieval applications [9][10]. The representation is proved to be robust under similarity transformation which include translation, scaling and changes in orientation. In this paper, we examine the robustness of the representation under general aﬃne transformation which also includes non-uniform scaling. As a result, the shape is deformed and therefore the resulting representation may change. We will show that the performance of the method is promising even in the case of severe deformations. The following is the organisation of the remainder of this paper. In section 2, CSS image is introduced and the CSS matching is brieﬂy explained. Section 3 is about the aﬃne transformation and the way we create our large databases. In section 4, we evaluate the performance of the method in two different ways. The results are then compared to the results of other well-known methods in section 5. The concluding remarks are presented in section 6.

2

Curvature Scale Space Image and the CSS Matching

Consider a parametric vector equation for a curve: r(u) = (x(u), y(u)) where u is an arbitrary parameter. The formula for computing the curvature function can be expressed as: κ(u) =

x(u)¨ ˙ y(u) − x ¨(u)y(u) ˙ . (x˙ 2 (u) + y˙ 2 (u))3/2

(1)

If g(u, σ), a 1-D Gaussian kernel of width σ, is convolved with each component of the curve, then X(u, σ) and Y (u, σ) represent the components of the resulting curve, Γσ : X(u, σ) = x(u) ∗ g(u, σ) Y (u, σ) = y(u) ∗ g(u, σ) It can be shown [10] that the curvature of Γσ is given by: κ(u, σ) =

Xu (u, σ)Yuu (u, σ) − Xuu (u, σ)Yu (u, σ) (Xu (u, σ)2 + Yu (u, σ)2 )3/2

(2)

As σ increases, the shape of Γσ changes. This process of generating ordered sequences of curves is referred to as the evolution of Γ (Figure 1a). If we calculate the curvature zero crossings of Γσ during evolution, we can display the resulting points in (u, σ) plane, where u is the normalised arc length and σ is the width of Gaussian kernel. For every σ we have a certain curve Γσ which in turn, has some curvature zero crossing points. As σ increases, Γσ becomes smoother and the number of zero crossings decreases. When σ becomes

568

Farzin Mokhtarian and Sadegh Abbasi

4

2

1

6

3 7

1 2

3

7

5

4

6 5

(a)

(b)

Fig. 1. a)Shrinkage and smoothing of the curve and decreasing of the number of curvature zero crossings during the evolution, from left: σ = 1, 4, 7, 10, 12, 14. b)The CSS image of the shape.

suﬃciently high, Γσ will be a convex curve with no curvature zero crossing, and we terminate the process of evolution. The result of this process can be represented as a binary image called CSS image of the curve (see Figure 1). Black points in this image are the locations of curvature zero crossings during the process of evolution. The intersection of every horizontal line with the contours in this image indicates the locations of curvature zero crossings on the corresponding evolved curve Γσ . The curve is ﬁnally represented by locations of the maxima of its CSS image contours. Curvature Scale Space Matching The algorithm used for comparing two sets of maxima, one from the input and the other from one of the models, has been described in [10] and [9]. The algorithm ﬁrst ﬁnds any possible changes in orientation which may have been occurred in one of the two shapes. A circular shift then is applied to one of the two sets to compensate the eﬀects of change in orientation. The summation of the Euclidean distances between the relevant pairs of maxima is then deﬁned to be the matching value between the two CSS images.

3

Non-uniform Scaling and Our Databases

The general aﬃne transformation can be represented mathematically with the following equation. xa (t) = ax(t) + by(t) + e (3) ya (t) = cx(t) + dy(t) + f where xa (t) and ya (t) represent the coordinates of the transformed shape. There are six degrees of freedom in this transformation. Translation and uniform scaling are represented by two degrees of freedom, while change in orientation needs just one parameter. The remaining parameter is related to shear. The general aﬃne transformation contains all these transformations. However, it is possible

Retrieval of Similar Shapes under Aﬃne Transform

569

to apply only one of them at a time. Scaling, change in orientation and shear are represented by the following matrices. Sx 0 cosθ −sinθ 1 k Ascaling = Ashear = Arotation = sinθ cosθ 0 1 0 Sy If Sx is equal to Sy , Ascaling represents a uniform scaling. A shape is not deformed under rotation, uniform scaling and translation. However, non-uniform scaling and shear contribute to the shape deformation under general aﬃne transformation. In this paper, we examine the performance of the CSS representation under non-uniform scaling. We re-write the matrix Ascaling as follows 1 0 Ascaling = 0k The measure of shape deformation depends on the parameter k in this matrix. Figure 2 shows the eﬀects of aﬃne transformation on shape deformation. In this Figure, we have chosen k = 2.5. In order to achieve diﬀerent transformations, we have changed the orientation of the original shape prior to applying the nonuniform scaling. The values of θ ranges from 20◦ to 180◦ , with 20◦ intervals. As this Figure shows, the deformation is severe for k = 2.5. For larger values of k, e.g. 3 and 4, the deformation is much more severe. In order to create four diﬀerent databases, we choose four diﬀerent values for k, 1.5, 2.5, 3.0 and 4.0. We then apply the transformation on a database of 500 original object contours. From every original objects, we obtain 9 transformed shapes with diﬀerent values of θ. Therefore, each database consists of 500 original and 4500 transformed shapes. We then carry out a series of experiments on these databases to verify the robustness of the CSS image representation under aﬃne transformations. The following section is concerned with these experiments.

4

Results

We examined the performance of the representation through two diﬀerent experiments. The ﬁrst one was performed on the databases of section 3. Every original shape was selected as the input query and the ﬁrst n outputs of the system were observed to see if the transformed versions of the query are retrieved by the system. We found out that for k = 1.5, almost all transformed versions of a shape appear in the ﬁrst 20 outputs of the system. For k = 2.5, on average 99% of transformed shapes of the input query are among the ﬁrst 20 shapes retrieved by the system as the similar shapes to the input query. This ﬁgure is 98% and 96% for k = 3.0 and k = 4.0 respectively. The results for diﬀerent values of k and for diﬀerent numbers of output images are presented in Figure 4a. As this Figure shows, the ﬁrst outputs of the system, always include a large portion of the transformed versions of the input query. These results show that the representation can be used in aﬃne transformed environment. This fact is also veriﬁed by the following evaluation method.

570

Farzin Mokhtarian and Sadegh Abbasi

Fig. 2. The deformation of shapes is considerable even with k = 2.5 in nonuniform transform. The original shape is presented in top left. Others represent transformation with k = 2.5 and θ = 20◦ , 40◦ , ..., 160◦ , 180◦.

Objective evaluation with classiﬁed database We have already used the idea of using classiﬁed databases to evaluate a shape representation method in shape similarity retrieval [1]. In this paper we have chosen 10 groups of similar shapes presented in Figure 3. These are our original shapes and we produce 9 transformed shapes from each original one. As a result, a group with 8 members will have 80 members after adding the transformed shapes. In order to assign a performance measure to the method, we choose every member of each group as the input query and ask the system ﬁnd the best n similar shapes to the input from the database. We then observe the number of outputs, m, which are from the same group as the input. The performance measure of the system for this particular input is deﬁned as the ratio of the actual m and the maximum possible value of m, namely mmax . The performance measure of the system for the whole database will be the average of the performance measures for each input query. As seen in Figure 3, the total number of original shapes is 76, and therefore the database population is 760. We chose diﬀerent values for n and for each case, computed the performance measure of the method. The results are presented in Figure 4b. As this Figure shows, for lower values of n, the performance measure of the system is considerably large. For example, in the case of k = 1.5 and n = 10, it is 100%. In other words, all of the ﬁrst 10 outputs of the system belong to the same group as the input. This is as a result of the fact that the system ranks the outputs based on their similarity measure to the input, and there is a strong possibility that the most similar models be in the same group as the input query. As n increases, the similarity between the bottom ranked outputs and the input query decreases and models from other groups may also appear as the output. At the same time, mmax is equal to n for n ≤ 80 and increases with n. For n ≥ 80, mmax does not increase anymore and is ﬁxed at 80, however m always increases with n and therefore this part of each plot has a positive slope.

Retrieval of Similar Shapes under Aﬃne Transform

571

Fig. 3. Classiﬁed database used for objective evaluation. Note that for each original shape, nine transformed versions are also generated.

n= n= n= n= n=

10 20 50 120 150

k = 1.5 100% 94% 83% 79% 81%

k = 2.5 99% 92% 80% 77% 80%

k = 3.0 99% 92% 79% 76% 79%

k = 4.0 98% 90% 77% 76% 79%

Table 1. Results of evaluation of our method on classiﬁed databases.

These results are also presented in another form in table 1. As this table shows, the ﬁrst 10 outputs of the system are almost always in the same group as the input query. These can be either the transformed versions of the input or other members of the group. Although this ﬁgure slightly drops when we look at the ﬁrst 20 outputs of the system, it is still very high. In fact, considering the large degree of deformation, specially in the case of k ≥ 2.5, the results are very promising even in the case of n = 50 or n = 120.

5

Comparison with Other Methods

Fourier descriptors [12] and Moment invariants [3][6] have been widely used as shape descriptors in similarity transform environment. Both methods represent the global appearance of the shape in their most important components. For example, the largest magnitude component of Fourier descriptors represents the dimensions of the best ﬁtted ellipse of the shape. Since aﬃne transformation changes the global appearance of a shape, it is expected that the performance of these methods is negatively aﬀected under the transformation. The modiﬁed versions of these methods have been introduced to deal with aﬃne transformation [2][4][13]. They are used in object recognition application and clustering a small number of shapes. However, it has been shown that the improvements in modiﬁed versions are not very signiﬁcant in comparison to

572

Farzin Mokhtarian and Sadegh Abbasi

100.0

100.0

k=1.5 k=2.5 k=3.0 k=4.0

k=1.5 k=2.5 k=3.0 k=4.0

90.0

Success rate %

Success rate %

90.0 95.0

80.0

70.0

85.0

2

12

22

32

Number of observed outputs, n

(a)

60.0

5

55

105

155

Number of observed outputs, n

(b)

Fig. 4. a) Identifying transformed versions of the input query. b) Evaluation with classiﬁed database.

the conventional versions. Considering the fact that the implementation of the modiﬁed versions is not a straight forward task, we decided to examine the conventional versions of these methods and compare the results with the results of our method. We observed that the diﬀerence between the performance measure of our method and the performance measure of each of these methods is very large. Even if we allow 10 − 15% improvement for the modiﬁed versions of the methods, their performance are still well behind the performance of the CSS representation. We implemented the method described in [12] to represent a shape with its normalised Fourier descriptors. Every original shape and its transformed was presented by their ﬁrst 20 components. The Euclidean distance was used to measure the similarity between the two representations. The results for diﬀerent values of k are presented in two forms in Figure 5a and table 2. The minimum of the performance measure for diﬀerent values of k is around 30%, comparing to 70% of the CSS method. At the same time, the slope of the plots after the minimum points is not sharp which means that most of the missing models are not ranked even among the ﬁst 200 outputs of the system. For moment invariant, each object is represented by a 12 dimensional feature vector, including two sets of normalised moment invariants [3], one from object boundary and the other from solid silhouette. The Euclidean distance is used to measure the similarity between diﬀerent shapes. The results are presented in Figure 5b. Comparing the plots of this Figure and the ones in Figure 4, a large diﬀerence between the performance of the CSS representation and this method is observed.

Retrieval of Similar Shapes under Aﬃne Transform

n= n= n= n= n=

10 20 50 120 150

k = 1.5 93% 82% 61% 56% 63%

k = 2.5 77% 61% 40% 40% 44%

k = 3.0 75% 58% 38% 37% 41%

573

k = 4.0 72% 55% 35% 34% 38%

Table 2. Results of evaluation of FD method on classiﬁed databases.

6

Conclusion

In this paper we demonstrated that the maxima of the Curvature Scale Space image can be used for shape representation in similarity retrieval applications even under aﬃne transformation. We observed that the method can identify the transformed versions of input query among a large number of database models. The results of evaluation of the method on a very large classiﬁed database of original and transformed shapes showed that the performance of the method is promising even under severe deformation of the shapes as a result of aﬃne transformation. The method was compared with two well-known methods, namely Fourier descriptors and moment invariants. The results showed the superiority of our method over these methods.

References 1. S. Abbasi, F. Mokhtarian, and J. Kittler. Shape similarity retrieval using a height adjusted curvature scale space image. In Proceedings of 2nd International Conference on Visual Information Systems, pages 173–180, San Diego, California, USA, 15-17 December 1997. 570 2. K. Arbter et al. Applications of aﬃne-invariant fourier descriptors to recognition of 3-d objects. IEEE Trans. Pattern Analysis and Machine Intelligence, 12(7):640– 646, July 1990. 566, 571 3. S. Dusani, K. Breeding, and R. B. Mcghee. Aircraft identiﬁcation by moment invariants. IEEE Transactions on Computers, C-26:39–45, 1977. 571, 572 4. J. Flusser and T. Suk. Pattern recognition by aﬃne moment invariants. Pattern Recognition, 26(1):167–174, Janaury 1993. 566, 571 5. H. W. Guggenheimer. Diﬀerential Geometry. MCGraw-Hill, New York, 1963. 566 6. M. K. Hu. Visual pattern recognition by moments invariants. IRE Trans. Information Theory,, IT-8:179–187, 1962. 571 7. B. B. Kimia and K. Siddiqi. Geometric heat equation and nonlinear diﬀusion of shapes and images. Computer Vision and Image Understanding, 64(3):305–332, 1996. 566 8. F. Mokhtarian and S. Abbasi. Shape similarity retrieval under aﬃne transforms through curvature scale space. In Submitted to 10th International Conference on Image Analysis and Processing, Venice, Italy, September 1999. 566

574

Farzin Mokhtarian and Sadegh Abbasi

FD results

Moments 80.0

k=1.5 k=2.5 k=3.0 k=4.0

Success rate %

80.0

60.0

Success rate %

100.0

k=1.5 k=2.5 k=3.0 k=4.0

60.0

40.0

40.0

20.0

5

55

105

155

Number of observed outputs, n

(a)

20.0

5

55

105

155

Number of observed outputs, n

(b)

Fig. 5. The results of objective evaluation for moment invariants and Fourier descriptors. Compare these plots with the results of the CSS representation in Figure 4.

9. F. Mokhtarian, S. Abbasi, and J. Kittler. Eﬃcient and robust retrieval by shape content through curvature scale space. In Proceedings of the First International Workshop on Image Database and Multimedia Search, pages 35–42, Amsterdam, The Netherlands, August 1996. 567, 568 10. F. Mokhtarian, S. Abbasi, and J. Kittler. Robust and eﬃcient shape indexing through curvature scale space. In Proceedings of the sixth British Machine Vision Conference, BMVC’96, volume 1, pages 53–62, Edinburgh, September 1996. 567, 568 11. G. Sapiro and A.Tannenbaum. Aﬃne invariant scale space. International Journal of Computer Vision, 11(1):25–44, 1993. 566 12. T. P. Wallace and P. Wintz. An eﬃcient three-dimensional aircraft recognition algorithm using normalised fourier descriptors. Computer Graphics and Image Processing, 13:99–126, 1980. 571, 572 13. A. Zhao and J. Chen. Aﬃne curve moment invariants for shape recognition. Pattern Recognition, 30(6):895–901, 1997. 566, 571

Eﬃcient Image Retrieval through Vantage Objects Jules Vleugels and Remco Veltkamp Department of Computer Science, Utrecht University, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands {jules,Remco.Veltkamp}@cs.uu.nl, http://www.cs.uu.nl/~jules/ Abstract. We describe a new indexing structure for general image retrieval that relies solely on a distance function giving the similarity between two images. For each image object in the database, its distance to a set of m predetermined vantage objects is calculated; the m-vector of these distances speciﬁes a point in the m-dimensional vantage space. The database objects that are similar (in terms of the distance function) to a given query object can be determined by means of an eﬃcient nearest-neighbor search on these points. We demonstrate the viability of our approach through experimental results obtained with a database of about 48,000 hieroglyphic polylines.

1

Introduction

Recent years have seen a growing interest in developing eﬀective methods for searching large image databases. While manual browsing may be adequate for collections of a few hundred images, larger databases require automated tools for search and perusal. Content-based image retrieval [20,21,19,13] is based on certain characteristics of the images; our particular interest lies in approaches based on feature extraction [18]. Such methods perform matching on the contents of the image itself, by comparing features of the query image with those of images stored in the database. Global features are derived from the entire shape; examples of this are roundness, central moment, eccentricity, and major axis orientation [2,7]. We are speciﬁcally interested in features with a geometric ﬂavor, such as being able to search a collection of vector images for the occurrence of a query polyline. In this paper, we present a way of eﬃciently comparing the features of a given query image to a large number of images stored in a database. Our approach works by mapping database objects onto points in an m-dimensional space, in such a way that points that lie close together correspond to images with similar features. This allows us to eﬃciently retrieve objects that are similar to a given query object by determining the points that lie close to the point corresponding to the query image. The only requisite for our approach is that a similarity distance function be deﬁned on the database objects.

This research was supported by SION project No. 612-21-201: Advanced Multimedia Indexing and Searching (AMIS).

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 575–585, 1999. c Springer-Verlag Berlin Heidelberg 1999

576

Jules Vleugels and Remco Veltkamp

As an application, we implemented our approach on a database of about 48,000 hieroglyphic polylines [1], and show how to eﬃciently retrieve the hieroglyphics with polylines that are similar (under translation and rotation) to those of a given query hieroglyphic. 1.1

Related Work

The vantage-object structure we describe in this paper is a paradigm to store objects in a data structure such that objects with similar features can be retrieved in an eﬃcient manner. Wolfson [25] describes a diﬀerent paradigm for rigid-object recognition under diﬀerent viewing transformations, which he calls Geometric Hashing. All combinations of so-called interest points of the database objects are indexed into a hash table. The interest points of a given query image are compared to those stored in the database through a voting mechanism. A drawback of this paradigm lies in its time and storage bounds: a query with an object containing m interest points—for example, the number of vertices for polylines—has a worst-case complexity of O(m3 ), and for n planar objects with m interest points each the hash table requires O(nm3 ) storage to achieve invariance under translation, rotation, and scaling. An important ingredient to the approach described in this paper is an algorithm that determines nearest neighbors among a set of high-dimensional points. Several theoretically eﬃcient solutions [4,5,9,14] have arisen from the Computational Geometry community, but these algorithms are not always equally eﬃcient in practice. Yianilos [26] considers the problem of ﬁnding nearest neighbors in general metric spaces. He introduces the vantage-point tree (vp-tree for short) together with associated algorithms. (Uhlmann [23] has independently reported the same structure, calling it a metric tree.) The expected complexity of a vp-tree query is under certain circumstances O(log n), and in practice it appears to perform somewhat better than a k-d tree. The vp-tree is however not guaranteed to be balanced, so its expected complexity may not be met for inputs with particular unfavorable distributions. It should be emphasized that, despite the similarity in naming, our vantage-object structure has only little in common with the vantagepoint tree: our vantage-object structure is a data structure used to store objects for eﬃcient similarity retrieval, whereas the vp-tree implements nearest-neighbor queries on point sets. The similarity in naming occurs because both structures categorize database objects in terms of their distances from vantage objects. For completeness, we mention some other nearest-neighbor data structures with similar properties that have been devised with high-dimensional data sets in mind: the SR-tree due to Katayama and Satoh [12], the SS-tree of White and Jain [24], the TV-tree proposed by Lin et al. [15], and the X-tree as reported by Berchtold et al. [6]. The nearest-neighbor step of our algorithm can be replaced by a range search, that is, instead of determining the k nearest neighbors, we could consider all points that lie within some ﬁxed distance from the query point. (This is essentially what is achieved by the distance threshold we introduce in Section 4.4.) There exists a wealth of literature on range searching [16,17,8]; recent research

Eﬃcient Image Retrieval through Vantage Objects

577

has considered the problem of achieving good asymptotic complexity as well as favorable running times. The algorithm due to Schwarzkopf and Vleugels [22], which—besides being theoretically eﬃcient—has been explicitly designed to perform well in practice for realistic inputs.

2

Preliminaries

Let A = {A1 , . . . , An } be a set of n objects in R2 . In this paper, we call a continuous function d : A × A → R a distance function on A if and only if for all Ai , Aj , Ak ∈ A with 1 i, j, k n: 1. d(Ai , Ai ) = 0 (one-way identity), 2. d(Ai , Aj ) 0 (positive), and 3. d(Ai , Aj ) d(Ai , Ak ) + d(Ak , Aj ) (triangle inequality). Note that we neither require a distance function to have the usual identity nor to be symmetric, that is, d(Ai , Aj ) = 0 does not necessarily imply that Ai = Aj , and moreover d(Ai , Aj ) need not be equal to d(Aj , Ai ). However, the distance function deﬁned by d should closely adhere to our intuitive notion of shape resemblance for the results of a query to be perceived as resembling the query object. Throughout the paper, a superscripted asterisk ∗ is used for variables that are related to vantage objects; a subscripted question mark ? applies to query-related variables.

3

The Vantage Object Structure

Suppose we are given a collection of objects with a distance measure d deﬁned on them. Consider two objects A1 and A2 that are very similar—that is, d(A1 , A2 ) is small—, and let A∗ be some third object, which we call a vantage object. Since d satisﬁes the triangle inequality, |d(A1 , A∗ ) − d(A2 , A∗ )| will be small as well. In other words, we can measure the resemblance between A1 and A2 by comparing their respective distances from A∗ . Note that this correspondence is strictly one way: objects that are similar will have similar distances from a vantage object A∗ , but objects with similar distances are not necessarily similar in appearance.1 Given a query object A? , we can thus determine (a superset of) all similar objects by computing the distance of A? to some vantage object A∗ and selecting all objects that achieve similar distance from A∗ . More formally, each object Ai corresponds to a point pi in the one-dimensional vantage space deﬁned by A∗ , in which the coordinate of pi is given by d(Ai , A∗ ). As noted above, this selection will also include objects that are not similar to A? but happen to have similar distance to A∗ . Moreover, if the set of objects is large, the set of objects with similar distance will most likely grow large as well. Both of these problems can 1

This is not diﬃcult to see by considering the equivalence to the natural numbers: even though the distance from 1 to 7 equals that from 13 to 7, the distance from 1 to 13 is much larger.

578

Jules Vleugels and Remco Veltkamp

be relieved by increasing the number of vantage objects to, say, m. Let A∗ = {A∗1 , . . . A∗m } be a set of m vantage objects. Each object Ai ∈ A corresponds to a point pi = (x1 , . . . , xm ) in the m-dimensional vantage space, such that xj = d(Ai , A∗j ). For a given query object A? , we compute its distance to each of the m vantage objects; this deﬁnes a point p? . Next, we determine all vantagespace points that are suﬃciently close to p? ; each of these points corresponds to an object of A. Once again, it is not guaranteed that each of the objects returned is similar to the query object; however, all objects that are suﬃciently similar to A? are correctly returned in this manner. To see this, deﬁne the object-space neighbors nbors(A? , ) of A? as the set of objects whose distance from A? is at most : nbors(A? , ) = {Ai ∈ A | d(Ai , A? ) }. In contrast, let nbors∗ (A? , A∗ , ) denote the set of vantage-space neighbors of A? , that is, the objects whose distance from each of the vantage objects diﬀers by at most from the corresponding distance of A? : nbors∗ (A? , A∗ , ) = {Ai ∈ A | ∀A∗ ∈ A∗ : |d(Ai , A∗ ) − d(A? , A∗ )| }. Lemma 1. The set of vantage-space neighbors of a query object A? includes its object-space neighbors, that is, nbors∗ (A? , A∗ , ) ⊇ nbors(A? , ). Proof. Let Ai be an object with d(Ai , A? ) , and A∗ = {A∗1 , . . . , A∗m } the set of vantage objects. By the triangle inequality, d(Ai , A∗j ) d(Ai , A? ) + d(A? , A∗j ) d(A? , A∗j ) + holds for each j with 1 j m. It follows that Ai is included in the set nbors∗ (A? , A∗ , ) as well. A similarity query among objects thus reduces to a simple nearest-neighbors query among a set of points. To answer such a query, we need only compute the distance of the query object to the m vantage objects, and determine the vantage-space points that are suﬃciently close to the resulting query point. The main advantage of this approach is that the computationally most-expensive step—computing the similarity measure between the database objects—is performed entirely oﬄine; at runtime we can deal with points rather than the original images. This framework leaves some details to be ﬁlled in. For one, it is deﬁned in terms of some distance function and a set of vantage objects, both of which depend on the application at hand. Secondly, we need a means of ﬁnding the k nearest neighbors of a query point among a set of m-dimensional points. For now, we limit this discussion to mentioning that several known solutions [4,5,9,14] achieve O(k log n) query time after O(n log n) preprocessing. This gives the following result. Theorem 1. Let A be a set of n objects, A? a query object, and 0 a constant, and let T denote the time required to compute the distance between any two given objects. For any constant m > 0, we can preprocess A in O(mnT + n log n) time and O(mn) space, such that we can retrieve the objects Ai with d(Ai , A? ) in O(mT + k log n) time, where k is the cardinality of nbors∗ (A? , A∗ , ).

Eﬃcient Image Retrieval through Vantage Objects

4

579

Matching Hieroglyphics

We implemented the algorithm described in the previous section on a collection of approximately 4,700 hieroglyphics with a total of over 48,000 polylines [1]. To apply the framework to this database, we need a matching distance function that measures the distance between two hieroglyphics, and pick a number of vantage objects for the collection. Additionally, the framework requires an eﬃcient algorithm to retrieve the nearest vantage-space neighbor of a given hieroglyphic. Finally, each hieroglyphic consists of a number of polylines, and we need a way to map the query results for the separate polylines to entire hieroglyphics. 4.1

A Matching Distance Function

Arkin et al. [3] describe a matching metric for polygons that is invariant under translation, rotation, and scaling, is deﬁned for both convex and non-convex polygons, and can be computed in time O(c1 c2 log c1 c2 ), where c1 , c2 are the numbers of vertices of the respective polygons. The metric is based on the L2 distance between the turning functions of two polygons. The turning function ΘA (s) of a polygon measures the angle of the counterclockwise tangent as a function of the arc length s, measured from some reference point O on the boundary of A. In other words, ΘA (0) is the angle v that the tangent at the reference point O makes with some reference orientation, for example, the x-axis; ΘA (s) keeps track of the turning that takes place as one traces the boundary of A, increasing with left-hand turns and decreasing with right-hand turns. Turning functions are periodic: ΘA (s) = ΘA (s + l) mod 2π, where l is the perimeter length of A. The turning function ΘA (s) of a polygon A is translation invariant by deﬁnition, and becomes ΘA (s) + α if the polygon is rotated over α degrees; scaling invariance can be achieved by normalizing the polygons to a standard perimeter length of 1. The distance between two turning functions ΘA and ΘB is deﬁned as 1 min |ΘA (s) − ΘB (s + t) + α| ds. 0t<1, 0α<2π

0

In our approach we essentially use the same metric, with some slight adaptations to the case of matching polylines rather than polygons. For one, note that the distance of a suﬃciently short line segment l from any polyline A is zero since l will perfectly match any (longer) line segment of A. We overcome this problem by enforcing a threshold on the number of segments of the database polylines; only polylines with a suﬃcient number of segments are considered. Also, we cannot simply scale the polylines to some unit length because we want to be able to match portions of a polyline rather than matching a polyline in its entirety. For example, consider a polyline P and a longer polyline Q that contains P as a sub-polyline. Since P occurs in its entirety in Q, we want the latter to perfectly match the former. If we would scale both polylines to some unit length, the (turning functions of the) polygons would no longer match very well. In other words, we cannot simply scale the two polylines prior to matching

580

Jules Vleugels and Remco Veltkamp

to achieve scale invariance. At the moment we do not have a viable alternative to this, which implies that our implementation is not scale invariant. This could be overcome by either adapting the above metric to scale invariance or employing a diﬀerent distance function, such as the one proposed by Cohen and Guibas [10]. Our matching algorithm runs in identical T = O(c1 c2 log c1 c2 ) time for two polylines, with c1 , c2 as before. This brings the entire preprocessing step for n polylines with m vantage polylines to O(mnc2 log c), where c is the maximum number of segments to a single polyline; this is O(mn) if we consider c a constant. (In the hieroglyphics database, c is about 400. While this may seem a relatively high number, the average is much lower: about fourteen.) 4.2

Choosing Vantage Objects

Another aspect of our approach is the choice of vantage objects. Ideally, the m vantage objects should diﬀerentiate the database objects as well as possible. This means that they should measure diﬀerent ‘properties’ of the objects; if the vantage objects are not very diﬀerent from one another, the distance from an object to each vantage object is similar, and little information is gained by adding the extra vantage objects. A possible way to ensure that each vantage point adds relevant discriminating power is by choosing the vantage points such that they lie maximally far apart. Computing the k furthest among a set of points, however, is an exponential problem; we resort to the following heuristic algorithm instead. First, choose an initial vantage object—polyline, in our case— A∗1 at random. Next, repeatedly pick the next vantage object A∗i by maximizing the minimum distance to the previously chosen vantage objects A∗1 , . . . , A∗i−1 . While the vantage objects thus chosen will probably not be maximally far apart, they will at least constitute a set that is spread out well. As for the number of vantage polylines we should choose, there is a clear tradeoﬀ: fewer polylines will return more non-relevant polylines as answers to a query, whereas more polylines increase the dimension of the search space and, therefore, the time required to answer a query. Some comparisons for both diﬀerent values of m and diﬀerent vantage objects are given in Section 5. 4.3

Retrieving Nearest Neighbors

The problem of determining nearest neighbors among a set of points is well studied in the ﬁeld of Computational Geometry, and several eﬃcient solutions exist [5,9,14]. Of particular interest is the approximate-nearest-neighbor algorithm due to Arya et al. [4], who show that the nearest-neighbor problem can be solved particularly eﬃciently if we weaken the problem formulation to determining approximate nearest neighbors to the query point. For any query point q ∈ Rd and constant > 0, a point p ∈ P is a (1 + )-approximate nearest neighbor of q if, for all p ∈ P , we have d(p, q) (1 + ) · d(p , q). A set of n points in Rd can be preprocessed in O(n log n) time and O(n) space, such that given a query point q ∈ Rd , a constant > 0, and a constant integer k 1, we can compute (1 + )-approximations to the k nearest neighbors of q in O(k log n) time. Besides

Eﬃcient Image Retrieval through Vantage Objects

581

being theoretically optimal in the worst case, their approach has been observed to be very eﬃcient in practice, even for = 0. For m vantage polylines, a single query among n polylines thus takes O(mT + k log n) time, where k is the number of polylines returned, and T is again the time required to compare any two given polylines. With the distance function described previously, this is O(mc2 log c + k log n) = O(m + k log n), where c is the (constant) maximum number of segments to a polyline. As a postprocessing step, we explicitly measure the similarity between the query polyline and each of the k polylines returned by the algorithm; this takes additional O(kc2 log c) = O(k) time, and thus does not inﬂuence the given bound. This postprocessing step has been omitted in the experimental results presented here since our main goal it to demonstrate the usefulness of the vantage-object structure by itself, even without further postprocessing of the query results. 4.4

Matching Entire Hieroglyphics

So far, our algorithm description focused on matching single polylines. The hieroglyphics in our database however consist of several—typically about ten— separate polylines. Therefore, we need to devise a strategy to combine the results for multiple polylines into a single result for each hieroglyphic. Given a query hieroglyphic consisting of h polylines, we perform h separate queries, one for each polyline. The polylines returned by these queries are grouped according to the hieroglyphic they are part of; this gives a certain number of matching polylines for each hieroglyphic. Next, the hieroglyphics should be presented to the user in a way that reﬂects their resemblance to the query hieroglyphic. The most eﬀective way would be to explicitly measure their distance from the query polygon. As mentioned before, for the purpose of this paper we refrain from doing so, because we want to demonstrate that our vantage-object retrieval algorithm already provides a relevant ordering of the query results. Instead, we sort the hieroglyphics matched using the following simple heuristics. The more polylines of a given hieroglyphic match closely with a polyline of the query hieroglyphic, the better we consider the hieroglyphic to match the query. We sort hieroglyphics with an identical number of matching polylines according to increasing maximum distance of a matched polyline from the corresponding query polyline. Note that it is not a priori clear how many nearest neighbors we want to retrieve for each query point, since the grouping by hieroglyphic is not done until after the retrieval step. We chose to use a threshold on the maximum distance of each vantage-space point from the query point; the polylines for which this distance does not exceed the threshold are returned as answers to the queries.

5

Experimental Results

We implemented the algorithm described in the previous sections, using an available implementation of the search algorithm due to Arya et al. [4], on a 400 MHz Pentium II-based workstation.

582

Jules Vleugels and Remco Veltkamp

Table 1. Three example queries together with their 30 best matches.

Some typical queries for m = 6 vantage points, along with the hieroglyphics returned, are shown in Table 1. For these queries, we set the minimum number of segments that a polyline should consist of to three—only polylines that deﬁne at most a single angle are ignored in the matching process. Rather than using a distance threshold as described in the previous section, we here show the 30 best matches to each query. The three queries increase in detail. The ﬁrst one consists of a simple shape that recurs in numerous hieroglyphics; all query results contain a—more or less, as the hieroglyphics have been digitized by hand—exact copy of the shape. The second query is more speciﬁc, and as a result we get some matches that contain an exact copy of the query bird, followed by a number of hieroglyphics that contain similarly-shaped birds. The last query shown contains a shape that is unique within in the database, and the query results resemble the query object to various degree. (Note that the intuitive resemblance seems to gradually decrease with the ranking.) As mentioned, the bulk of the computations are performed during the preprocessing steps. For each polyline in the database we need to compute the corresponding vantage-space point, which involves computing its distance from each of the vantage polylines; this takes about twenty-three minutes per vantage polyline. In case we want to add polylines to the database, this computation need only be performed for the newly added polylines. The time required for building the nearest-neighbor data structure depends on the number m of vantage objects, but is, for example, only slightly more than three seconds for m = 10. This step has to be performed only when the image database changes. Finally, the query time

Eﬃcient Image Retrieval through Vantage Objects

583

is low due to the intensive preprocessing steps; for example, for 1000 randomly selected query polylines and m = 6 vantage objects, retrieving the 100 nearest neighbors takes 5.4ms on average. For comparison purposes, we also implemented a trivial algorithm using the same implementation of the matching algorithm. To ﬁnd the database polylines similar to a given query polyline, we compute the distance of each polyline in the collection from the query and take the best-matching one. Answering a query for a single polyline this way is identical to computing the distance from a vantage polyline, and therefore also takes somewhat over twenty minutes. To match an entire query hieroglyphic, this time should be multiplied by the number of polylines the hieroglyphic consists of. Due to space limitations, we are unable to present detailed results for diﬀerent numbers of vantage objects. The overall tendency is that the query time does not increase signiﬁcantly with an increasing number of vantage objects; for example, the average time of 5.4ms for m = 6 mentioned above becomes 1.8ms for m = 2. The number of results returned, however, quickly grows if we do with less vantage points: for m = 6 and a distance threshold of 0.1, the average number of polylines returned was 42 polylines; for m = 2 this is 1039. Also the relevance of the query results appears to be much better if we increase the number of vantage points.

6

Conclusions

We presented an indexing structure for general image retrieval that relies solely on a distance function giving the similarity between two images, and demonstrated that it can be used to eﬃciently determine shape-based image similarity. Although the viability of our approach was demonstrated only for a single speciﬁc case—retrieving hieroglyphic images—its potential is far more general. In fact, our indexing structure applies to any set of images for which a similarity distance function can be deﬁned, which includes raster images as well as vector images. Since almost all distance calculation are performed during an oﬄine preprocessing step, an advantage of our approach is that it does not rely on the distance function being cheap and/or easy to compute; a query requires only a small number—typically ten or less—of distance calculations. Therefore, if suﬃcient preprocessing time is available, one can use an elaborate (and possibly expensive) distance function with desirable properties such as robustness against noise, blurrings, cracks, and deformations [11].

References 1. The extended library. Centre for Computer-Aided Egyptological Research, Faculty of Theology, Utrecht University, Utrecht, the Netherlands. http://www.ccer.theo.uu.nl/ccer/extlib.html. 576, 579 2. Edoardo Ardizzone, Marco La Cascia, Viti Di Ges´ u, and Cesare Valentie. Content based indexing of image and video databases by global and shape features. In Proc. Int. Conf. Pattern Recognition, 1996. 575

584

Jules Vleugels and Remco Veltkamp

3. Esther M. Arkin, L. P. Chew, D. P. Huttenlocher, K. Kedem, and Joseph S. B. Mitchell. An eﬃciently computable metric for comparing polygonal shapes. IEEE Trans. Pattern Anal. Mach. Intell., 13(3):209–216, 1991. 579 4. S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching. In Proc. 5th ACM-SIAM Sympos. Discrete Algorithms, pages 573–582, 1994. An implementation is available from http://www.cs.umd.edu/~mount/ANN. 576, 578, 580, 581 5. J. L. Bentley. K-d trees for semidynamic point sets. In Proc. 6th Annu. ACM Sympos. Comput. Geom., pages 187–197, 1990. 576, 578, 580 6. S. Berchtold, D. A. Keim, and H.-P. Kriegel. The X-tree: An index structure for higher dimensional data. In Proc. 22th VLDB Conference, pages 28–39, 1996. 576 7. M. La Cascia and E. Ardizzone. JACOB: Just a content-based query system for video databases. In IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 1996. 575 8. Bernard Chazelle and Emo Welzl. Quasi-optimal range searching in spaces of ﬁnite VC-dimension. Discrete Comput. Geom., 4:467–489, 1989. 576 9. K. L. Clarkson. Nearest neighbor queries in metric spaces. In Proc. 29th Annu. ACM Sympos. Theory Comput., pages 609–617, 1997. 576, 578, 580 10. S. D. Cohen and Leonidas J. Guibas. Partial matching of planar polylines under similarity transformations. In Proc. 8th ACM-SIAM Sympos. Discrete Algorithms, pages 777–786, January 1997. 580 11. M. Hagedoorn and R. C. Veltkamp. Measuring resemblance of complex patterns. In Proc. Int. Conf. Discrete Geom. Comput. Imagery, 1999. 583 12. Norio Katayama and Shin’ichi Satoh. The SR-tree: An index structure for highdimensional nearest neighbor queries. In SIGMOD ’97, pages 369–380, 1997. 576 13. P. M. Kelly, T. M. Cannon, and D. R. Hush. Query by image example: the CANDID approach. In Proc. SPIE: Storage and Retrieval for Image and Video Databases III, volume 2420, pages 238–248, 1995. 575 14. J. Kleinberg. Two algorithms for nearest-neighbor search in high dimension. In Proc. 29th Annu. ACM Sympos. Theory Comput., pages 599–608, 1997. 576, 578, 580 15. K. I. Lin, H. V. Jagdish, and C. Faloutsos. The TV-tree: An index structure for higher dimensional data. VLDB Journal, 4:517–542, 1994. 576 16. J. Matouˇsek. Eﬃcient partition trees. Discrete Comput. Geom., 8:315–334, 1992. 576 17. J. Matouˇsek. Range searching with eﬃcient hierarchical cuttings. Discrete Comput. Geom., 10(2):157–182, 1993. 576 18. Rajiv Mehrotra and James E. Gary. Similar-shape retrieval in shape data management. IEEE Computer, 28:57–62, 1995. 575 19. W. Niblack, R. Barber, W. Equitz, M. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, and G. Taubin. The QBIC project: Querying images by content using color, texture and shape. Storage Retrieval Image Video Databases, 1908:173–187, 1993. 575 20. Virginia E. Ogle and Michael Stonebraker. Chabot: Retrieval from a relational database of images. IEEE Computer, 28:40–48, 1995. 575 21. A. Pentland, R. W. Picard, and S. Sclaroﬀ. Photobook: Tools for content-based manipulation of image databases. In Proc. SPIE: Storage and Retrieval for Image and Video Databases II, volume 2185, pages 34–47, 1994. 575 22. Otfried Schwarzkopf and Jules Vleugels. Range searching in low-density environments. Inform. Process. Lett., 60:121–127, 1996. 577

Eﬃcient Image Retrieval through Vantage Objects

585

23. Jeﬀrey K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Inform. Process. Lett., 40:175–179, 1991. 576 24. D. A. White and R. Jain. Similarity indexing with the SS-tree. In Proc. 12th IEEE Internat. Conf. Data Engineering, pages 516–523, 1996. 576 25. H. J. Wolfson. Model-based object recognition by geometric hashing. In Proc. 1st Europ. Conf. Comp. Vision, pages 526–536, 1990. 576 26. P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proc. 4th ACM-SIAM Sympos. Discrete Algorithms, pages 311– 321, 1993. 576

Using Pen-Based Outlines for Object-Based Annotation and Image-Based Queries Lambert Schomaker1, Edward de Leau2 , and Louis Vuurpijl1 Nijmegen Institute for Cognition and Information (NICI) P.O. Box 9104, 6500 HE, Nijmegen, The Netherlands, {schomaker,vuurpijl}@nici.kun.nl [email protected] http://hwr.nici.kun.nl/

Abstract. A method for image-based queries and search is proposed which is based on the generation of object outlines in images by using the pen, e.g., on color pen computers. The rationale of the approach is based on a survey on user needs, as well as on considerations from the point of view of pattern recognition and machine learning. By exploiting the actual presence of the human users with their perceptual-motor abilities and by storing textually annotated queries, an incrementally learning image retrieval system can be developed. As an initial test domain, sets of photographs of motor bicycles were used. Classiﬁcation performances are given for outline and bitmap-derived feature sets, based on nearestneighbour matching, with promising results. The beneﬁt of the approach will be a user-based multimodal annotation of an image database, yielding a gradual improvement in precision and recall over time.

1

Introduction

In the search for image material in large databases, a number of query methods can be used, varying from keyword-based queries in textually annotated image databases to example-based and feature-based pictorial queries using the image content of the individual pictures (Table 1). On the world-wide web (WWW), various experimental approaches are already available [9], from which some lessons can be drawn. Textual methods (Table 1) based on keyword queries (A) to ﬁnd images are potentially very powerful. However, a textual annotation of images produced by a single content provider, although already very costly by itself, usually does not cover a suﬃcient number of views or perspectives on the same pictorial material. Most importantly, however, the deixis problem is not solved: What is the exact location and area of a mentioned object in the actual image? This information may be relevant to the user, but is in any case extremely relevant to pattern classiﬁers in a machine-learning architecture. In text-based context search (B), the query method also consists of keywords, but the annotation is automatically derived from the context of the image within the document [8,6]. This approach may be brittle, because only in some Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 585–592, 1999. c Springer-Verlag Berlin Heidelberg 1999

586

Lambert Schomaker et al.

domains (e.g., science, journalism) there exists something like a grammar for the juxtaposition of text and image which allows for ﬁnding meaningful correspondences between textual and pictorial content.

Table 1. Types of queries and matching methods in image-based search Query A. textual (keywords) B. textual (keywords) C. exemplar image D. layout structure (e.g., colored rectangles) E. object outline F. object sketch

Matched with: manually provided textual image annotations textual and contextual information in the image neighbourhood image bitmap image bitmap image bitmap, contours or outlines image bitmap

Matching algorithm free text and informationretrieval (IR) methods free text and IR methods template matching or feature-based schemes texture and color segmentation feature-based schemes feature-based schemes

The example-based matching methods (C) are usually disappointing for the user because the underlying goal of their query is not to look for ”similar looking pictures”, but for pictures ”with similar object content”. A fourth query method (D) consists of layout specification, for instance by means of placing variable-sized rectangles of diﬀerent color and/or texture on a blank image, representing the query. Category (E), queries based on an object outline will be the focus of this paper. An outline is deﬁned as a closed ﬁgure, drawn by the user around an object on a photograph by means of a pointing device (mouse or pen). A more diﬃcult form of query is represented by (F), where the user is allowed to produce a free sketch of a ﬁgure which represents the visual query [4]. Combinations of the above query methods (A) to (F) can be used in a real application. The majority of the currently proposed methods are strongly characterized by a ’technology push’. However, it seems reasonable to derive some constraints from actual user demands before embarking on any development of pattern recognition algorithms. Basic questions are: – – – –

are the users able to produce the queries? do the users like the query method? what level of classiﬁcation performance will be acceptable to the user? is the system able to explain why a given match has been found?

It is essential to exploit all known constraints, given the diﬃculties in contentbased image retrieval. This means that user-context information such as user goals (”What is the user going to do with retrieved images?”), and systemrelated constraints (concerning, e.g., the user’s software and hardware platform, bandwidth, etc.) should be used to improve the adequacy of the system response. In an on-line WWW survey on image-based search the following user responses were given (Table 2). From this table it can be inferred that users were less

Using Pen-Based Outlines for Object-Based Annotation

587

Table 2. User goals in image-based queries: objects, features or textures? Numbers represent the frequency of user responses in this on-line WWW survey (NA=no answer given). Respondents (N=170) were asked to tell a few things related to recent imagebased search actions they executed on the WWW. Question ”Did you need an image ...” ”...with a particular object on it?” ”...with a particular color on it?” ”...with a particular texture on it?”

Yes

No

NA

114 25 23

49 137 137

7 8 10

interested in color or texture, but reported to need images containing objects or other content. From this survey, it becomes apparent that users report to be mainly looking for an object in the image (114/170) and are much less interested in detailed image properties. Furthermore, photographic images seem to be more important than other types of graphic material: It was found that 68% of the given responses concerned photographs (survey data were provided by MSc student Arie Baris). These ﬁndings indicate that a successful image-based search method should be developed for photographic material, with a focus on objectrecognition methods. However, the seemingly eﬀortless foreground/background segmentation in human visual perception is diﬃcult to realize with current image processing and classiﬁcation methods. Only under idealized conditions, bottomup object segmentation based on edge detection and region analysis will be possible.

2

Design Considerations

The proposed method is based on a number of ideas, aimed at improving the classiﬁcation performance and the usability of image-based search methods. The following sections brieﬂy describe the design considerations: 2.1 Focus on object-based representations and queries Picard [5] makes a distinction between (a) subject, object and action search, (b) syntactic search (layout of images, breakpoints in video streams), (c) moodrelated search, and the category (d) ”I know what I’m looking for when I see it”. Although looking for objects in images (”all horses”, ”mandolins”) does not cover all possible forms of image needs in users, it is probably a very common user goal in an image-retrieval context, as evidenced from our survey, as well. Here, we deﬁne object as referring both to inanimate and animate objects in images. 2.2 Focus on photographic images with identifiable objects for which a verbal description can be given Given the user preference for photographic images, it seems useful to focus the image-based retrieval eﬀorts on this category. Since the purpose of the annotation concept presented in this paper is to bootstrap new image-retrieval methods which utilize both textual and pictorial query components, it is essential that a textual description of the object-based image query be added by the user.

588

Lambert Schomaker et al.

2.3 Exploit the presence of human perceptual abilities in the user In other areas of pattern recognition, the deﬁnition of particular classes may be relatively easy, apart from a manageable number of ambiguous cases, such as in speech or handwriting recognition. Moreover, in these ﬁelds, well-known databases [1,3] exist which contain truth labels of input patterns. The number of classes is typically limited to a few hundred unique patterns (i.e., characters, phonemes, visemes or words). In content-based image retrieval within an open domain, the number of classes is much larger, and there are not many public databases which are useful for the training of classiﬁcation algorithms. Human assistence in multimedial annotation is essential in order to obtain a ’bootstrap collection’ of object-based samples. 2.4 Exploit human fine motor control The idea is to ask the user to produce queries by drawing an outline which encloses an object in a given image. The color of the outline should be contrastive with respect to the image content. Coordinates (xt , yt ) are recorded. The users are asked to follow the intended object boundaries with some precision. The outline should be a closed shape, and the user may have to guess its path at points of occlusion with other objects in the image. The computer mouse can be used but its accuracy and resolution are limited. A better solution is the electronic pen, in combination with ’electronic paper’: an integrated digitizer and LCD screen. A problem may be the availability of a ’seed’ image containing the object sought for. The solution is (a) to base the initial search on keywords, using a found image for further search, or, (b) to draw an outline by heart. 2.5 Allow for incremental annotation of image material As in text-based information retrieval, almost every query can be considered as a valuable piece of condensed information which is based on genuine user goals and the user’s understanding of the real world. This is especially true for multimedial queries in the form of object outlines. By asking the user to textually annotate the object outline of the query by using the keyboard, speech, or even handwriting recognition, a growing database of object outlines is formed, which is essential for the training of the image classiﬁcation algorithms. There are other advantages of the proposed approach. The new standards MPEG-4 and MPEG-7 allow for an object-based image description. However, since object segmentation is diﬃcult in an open image domain, there is a bottleneck at the point of creation of the object-annotated multimedia content. Pen-based techniques may be developed in order to alleviate this problem. 2.6 Start with a limited content domain to evaluate these concepts Although the goals are high, we will constrain the image content domain ﬁrst, to see whether the concept is fruitful. Images from a technological context have the advantage that a large number of object and object components can be identiﬁed, for which names do exist. The topic chosen here concerns a set of 200 mixed JPEG and GIF photographs of motor bicycles. Within this set, 750 outlines were drawn around image parts in the following classes: exhaust, wheels, engine, frame, pedal, fuel tank, saddle, driver, mirror, license plate, bodyworks, head light, fuel tank lid, light, rear light, totalling 15 object classes with 50 diﬀerent outline samples of each object (Figure 1).

Using Pen-Based Outlines for Object-Based Annotation

589

Fig. 1. A query to ﬁnd an engine (l) and a few outlines of ”frames” (r)

3

Method

In our approach, a number of query scenarios can be envisaged. For this exploratory study, the image-based queries were performed within two representations: (a) Matching the query outline (xk , yk ) with all outlines which are present in the database, and (b) matching the image I(x, y) content within the outline (xk , yk ) with existing templates in the database. Simple 1-NN matching will be used for both feature categories. 3.1

Features in Outline-Pattern Matching

Based on handwriting recognition research - more speciﬁcally the recognition of isolated on-line handwritten characters [7] - we have developed the following feature set for outlines, to be used in image based queries. Figure 2 shows the used outline representation. A given closed raw contour (Xi , Yi ) with points i = [0, Nr − 1] which is derived from time-based measurements from a pointing device (preferably a pen) is resampled to a ﬁxed and suﬃcient number of samples yielding (xk , yk ) with points k = [0, Ns − 1]. Given the limited bandwidth of the human motor system and the limited amount of time the user will invest in the production of an outline in a dynamical querying condition, the number of curvature maxima along the outline will be limited. Here we have chosen → ten ballistic Ns = 100, which would coincide with about ten curvature peaks ( strokes of 100ms) in real-time drawing behavior sampled at 100 points/second. This puts a soft upper limit to the complexity of contours. However, as an example, it is not necessary to produce, e.g., a meticulously shaped outline of a tree with all its leaves and tiny branches: a global approximation already contains useful information. (0, 0), yielding (xk , yk ). Then The center of gravity (µx , µy ) is translated to 2 the standard deviation of all radii rk = ((xk ) + (yk )2 ) is calculated yielding the rms radius σr . Finally, the outline is normalized to a radius σr of 1 by: x ˆk = xk /σr and yˆk = yk /σr . The normalized outline (xˆk , yˆk ) can then be used for scale and translation invariant matching. The resulting feature vector (Sl )

590

Lambert Schomaker et al. B r

(0,0) (x,y)

Fig. 2. Features in outline pattern matching. The center of gravity is translated to (0, 0), the size is normalized to an rms radius (σr ) of one. From the starting point B, the matching process will try both clockwise and counter-clockwise directions, retaining the best result of both match variants. Other normalizations such as left/right or up/down mirroring are optional. with all normalized x and y values can be compared with any other feature vector (Sm ) in a database by using the average squared Euclidean distance for simple nearest-neigbour matching: ∆S =

Ns 1 (Smk − Slk )2 Ns

(1)

k=0

However, this feature vector will be not suﬃcient for accurate matching. In particular, what is missed are the curvature details along the curve. For this reason, a second feature vector A is deﬁned, containing the running angle along the outline as (cos(φ), sin(φ)) (Figure 2). This feature vector contains more information about local changes in direction. Also for this feature vector, the distances between unknown and known outlines can be calculated, yielding ∆A , similar to eq. 1. A third feature vector (P ) consists of the histogram of angles in the contour, i.e., the probability distribution p(φ), φ being bounded from −π/2 to +π/2. Also for P , the distances between a query and a template can be calculated, yielding ∆P . The matching process entails two further provisions to implement invariance: (1) starting point (B, Fig. 2), (2) order and (3) horizontal mirroring normalization. Each outline query is matched with a sample outline, at a number of starting points along the curve, in a clock-wise and counterclockwise fashion, both for a normal and a horizontally mirrored version. The best match, i.e., with the lowest distance, is kept in the hit list. 3.2

Within-Outline Image Bitmap Features

The following 68 features are calculated from the pixels within the closed object outline and are intended to capture color, intensity, and texture. color centroids The center of gravity for each of the RGB-channels. This gives 6 features: R(x,y), G(x,y) and B(x,y) color histogram The histogram of the occurrence of 8 main colors: black, blue, green, cyan, red, magenta, yellow and white intensity histogram A histogram for 10 levels of pixel intensity RGB statistics The minimum and maximum values of each of the RGB-channels, and their average and standard-deviation (12 features) texture descriptors A table of five textures was used, with five statistical features each (25 features) invariant moments Seven statistical high-order moments [2] which are invariant to size and rotation

Using Pen-Based Outlines for Object-Based Annotation

4

591

Results

Table 3 (column I-III) shows the results for the normalized coordinates (ˆ x, yˆ), the running angle (cosφ, sinφ) and the histogram of directions p(φ). Results are expressed as average percentage of hits in a top-10 hit list. The outline coordinates (ˆ x, yˆ) perform best. The four low-performance classes at the bottom of the table are the almost circular shapes without much diﬀerence in the outlines (the diﬀerent lights and the fuel tank lid). The histogram of angles (column III) yields mediocre results. The rightmost column (IV) shows the results for the feature vector calculated from the image content within the outline curve. A selected subset of 29 out of 68 features was used based on stochastic optimisation. It can be observed that even after such optimisation, the within-outline image content is somewhat less reliable as a basis for matching than the outline coordinates and running angle, i.e., for the typical objects (motor-bike components) in this database. Partly this is due to trivial factors such as color, partly these results may be caused by the fact that some of the image-based features, such as the invariant moments, are of a rather global nature. More detailed analysis of precision vs recall (from P1 to P50 ) revealed no conﬂicting results.

Table 3. Classiﬁcation performance of outline matching. Results are represented as the average percentage of correct hits in the top-10 hit list (P10 ), averaged over n = 50 outline instances per class, of which each was used as a probe in nearest-neighbour matching. The query itself was excluded from the matching process. The total number of patterns is 750. Results for three groups of outline features and a group of imagebased features are presented. Query wheels exhaust engine frame pedal driver saddle fuel tank mirror license plate bodywork head light fuel tank lid light rear light

5

I. P10 (%) (ˆ x, yˆ) 77.6 75.4 57.0 52.0 47.4 43.6 41.4 41.4 40.6 36.0 31.0 30.6 29.6 21.6 14.8

II. P10 (%) (cosφ, sinφ) 81.8 79.4 51.4 33.8 47.2 43.4 39.2 43.2 39.8 47.8 26.6 38.2 35.8 19.4 14.8

III. P10 (%) p(φ) 36.0 34.0 31.6 38.8 22.8 20.2 15.0 23.2 11.2 30.2 14.4 13.2 25.8 11.0 9.0

IV. P10 (%) image − based 58.2 34.6 49.6 69.4 33.0 50.2 20.2 22.8 22.4 21.8 22.4 30.4 23.4 27.4 33.0

Discussion

Other matching procedures currently under study are kNN and nearest centroid variants. A well-known property of image-based retrieval is the fact that the

592

Lambert Schomaker et al.

set of classes varies over time. For this reason, a neural-network solution for the problem as a whole is not suitable. However, for particular sub-domain problems like face detection and recognition, specialized neural-network classiﬁers may be more attractive than simple distance-based schemes. One of the reasons why this simple scheme without localized aﬃne normalizations and/or perspective transform yields reasonable results may be the fact that photographers generate a limited number of ’canonical views’ on objects, according to perceptual and artistic rules. Therefore, the number and range of camera attitudes towards the object, in this case a motor cycle, is limited. A related issue is the variation in object query shape for a given object. Current research in our group is focused at both the spatial variability of outline queries and their relation to the raw object edges as well as the textual variations produced in the object annotation by the users. Using the proposed system concept for the collection of a large number of object-based outlines, a training set will be created for the development of autonomous object classiﬁcation. The availability of a large outline base for the training of an object-based image retrieval system at the level of both preprocessing (i.e., knowledge-based edge detection) and classiﬁcation may ultimately result in considerable improvements in automatic object recognition in this application area.

References 1. Guyon, I., Schomaker, L., Plamondon, R., Liberman, R. and Janet, S.: Unipen project of on-line data exchange and recognizer benchmarks. Proceedings of the 12th International Conference on Pattern Recognition, ICPR’94, Jerusalem, Israel. IAPR-IEEE, (1994) 29–33 588 2. Hu, M-K.: Visual Pattern Recognition by Moment Invariants. IRE Transactions on Information Theory IT-8 (1962) 179–187 590 3. Lamel, L.F., Kasel, R.H. and Seneﬀ S.: Speech database development: Design and analysis of the acoustic-phonetic corpus. Proceedings of the DARPA Speech Recognition Workshop (1987) 26–32 588 4. Lopresti, D., Tomkins, A. and Zhou, J.: Algorithms for matching hand-drawn sketches. In: Downton, A.C. & Impedovo, S. (eds.): Progress in Handwriting Recognition. London: World Scientiﬁc (1997) 69–74 586 5. Picard, R.W.: Light-years from Lena: Video and Image Libraries of the Future. Proceedings of the International Conference on Image Processing (ICIP), Oct ’95, Washington DC, USA. Vol I (1995) 310–313 587 6. Rowe, N.C. and Frew, B.: Automatic caption localization for photographs on world wide web pages. Information Processing & Management 34(1) (1998) 95–107 585 7. Schomaker, L.R.B.: Using stroke- or character-based self-organizing maps in the recognition of on-line, connected cursive script. Pattern Recognition 26(3) (1993) 443–450 589 8. Srihari, R.K.: Visually searching the Web for content. IEEE Computer 28(9) (1995) 49–56 585 9. WWW Reference page to image-based retrieval methods: http://hwr.nici.kun.nl/~profile/ibir/ 585

Interactive Query Formulation for Object Search Theo Gevers and Arnold W.M. Smeulders ISIS, University of Amsterdam, Kruislaan 403 1098 SJ Amsterdam, The Netherlands {gevers,smeulders}@wins.uva.nl Abstract. Snakes provide high-level information in the form of continuity constraints and minimum energy constraints related to the contour shape and image features. In this paper, we aim at using color invariant gradient information, as image features, to guide the deformation process. We focus on using color snakes for interactive image segmentation for the purpose of contentbased image retrieval. The key idea is to select appropriate subimages of objects (instead of the entire image) on which the image object search will be conducted. After image segmentation, to achieve accurate image object search, weights are assigned to the image features of the selected subimages in accordance to their importance i.e. having high feature frequencies but low overall collection frequencies. Experiments show that the proposed color invariant snake successfully find object material contours discounting other ”accidental” edges types (e.g. shadows, shading and highlight transitions). Furthermore, experiments show that object search using subimages with weighted features yield high retrieval accuracy. The object search scheme is at http://www.wins.uva.nl/research/isis/zomax/.

1

Introduction

Active contour models can assist the process of image segmentation. The active contour method that has attracted most attention is known as snake [4], for example. In this paper, we focus on using snakes for interactive image segmentation for the purpose of content-based image retrieval by query-by-example. The key idea is to allow the user to specify in an interactive way salient subimages of objects on which the image object search will be based. In this way, confounding and misleading image information is discarded. It is well known that the RGB values obtained by a color camera will be negatively aﬀected (color values will shift in RGB-color space) by the imageforming process [2],[3]. Therefore, in this paper, a snake-based image segmentation method is proposed on the basis of physics considerations leading to color image features which are robust to the imaging conditions. The paper is organized as follows. In Section 2, we review the properties and behavior of snakes. In Section 3, we propose robust color invariant gradients on which the color snake is based. Experiments are conducted with the color snake on various color images in Section 4. Subimage object search is discussed in Section 5. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 593–600, 1999. c Springer-Verlag Berlin Heidelberg 1999

594

2

Theo Gevers and Arnold W.M. Smeulders

Active Contour Methods: Snakes

An active contour is a deformable curve: v(t) = [x(t), y(t)], t ∈ [0, 1]

(1)

that moves through the spatial domain of an image I to minimize an energy functional E. The energy E, associated with the curve, is a weighted sum of internal and external energies: E = αEint + βEext

(2)

where α and β are appropriate weights. The internal energy of a snake measures the desired properties of a contour’s shape. In order to obtain smooth and physical feasible results, an elasticity and smoothness constraint are deﬁned in the internal energy as follows: Eint = ( (||v(t) ||2 + ||v(t) ||2 dt)( ||v(t) ||dt) (3) t

t

where v(t) and v(t) denotes the ﬁrst and second derivative of the curve with respect to t, and are measures for respectively the elasticity and smoothness. The external energy is derived from the image in such a way that the snake is attracted to certain image features. In most snake-type techniques the intensity gradient is considered as the primary image feature, leading to the following external term: Eext = −||∇I(x, y)||dt (4) t

where the gradient image ||∇I(x, y)|| is usually derived from the intensity image through Gaussian derivatives. Whereas in the traditional active contour methods the image intensity discontinuities are used to guide the deformation process, we focus on using color invariant gradient information instead.

3

Color Invariant Snakes

Let the color gradient be denoted by ∇C, then the color-based external energy term is as follows: (5) Eext = −||∇C(x, y)||dt t

Our aim is to obtain robust color gradients ∇C discounting shadows, shading and highlights. In Section 3.1, we brieﬂy discuss the color invariant models which we have recently proposed, see [1] for example. Then, various alternatives for ∇C are given in Section 3.2.

Interactive Query Formulation for Object Search

3.1

595

Color Invariant Models

Assuming dichromatic reﬂectance and white illumination, it is shown in [1] that normalized color rgb and the newly proposed color models c1 c2 c3 are invariant to a change in viewing direction, object geometry and illumination. In this paper, we focus on c1 c2 c3 deﬁned as follows [1]: R G B c1 = arctan( ), c2 = arctan( ), c3 = arctan( ) (6) max{G, B} max{R, B} max{R, G} Further, it is shown that l1 l2 l3 is also invariant to highlights deﬁned as follows [1]: l1 =

|R − G| |R − B| , l2 = , |R − G| + |R − B| + |G − B| |R − G| + |R − B| + |G − B| l3 =

3.2

|G − B| |R − G| + |R − B| + |G − B|

(7)

Color Invariant Gradients

In the previous section, color models are discussed which are invariant under varying imaging conditions. According to eq. 5, we need to deﬁne the color gradient ∇C(x, y) diﬀerentiated for the various color models. Gradients in multi-valued images We follow the principled way to compute gradients in vector images as described by Silvano di Zenzo [5] and further used in [7], which is summarized as follows. Let Θ(x1 , x2 ) : 2 → m be a m-band image with components Θi (x1 , x2 ) : 2 → for i = 1, 2, ..., m. For color images we have m = 3. Hence, at a given image location the image value is a vector in m . The diﬀerence at two nearby points P = (x01 , x02 ) and Q = (x11 , x12 ) is given by Θ = Θ(P ) − Θ(Q). Considering an inﬁnitesmall displacement, the diﬀerence becomes the diﬀerential 2 ∂Θ dΘ = i=1 ∂x dxi and its squared norm is given by: i dΘ2 =

2 2 ∂Θ ∂Θ i=1 k=1

∂xi ∂xk

dxi dxk =

2 2 i=1 k=1

gik dxi dxk =

dx1 dx2

T

g11 g12 g21 g22

dx1 dx2

(8)

∂Θ ∂Θ where gik := ∂x · ∂xk and the extrema of the quadratic form are obtained in the i direction of the eigenvectors of the matrix [gik ] and the values at these locations correspond with the eigenvalues given by: 2 g11 + g22 ± (g11 − g22 )2 + 4g12 λ± = (9) 2 with corresponding eigenvectors given by (cos θ± , sin θ± ), where θ+ = 2g12 1 π 2 arctan g11 −g22 and θ− = θ+ + 2 . Hence, the direction of the minimal and

596

Theo Gevers and Arnold W.M. Smeulders

maximal changes at a given image location is expressed by the eigenvectors θ− and θ+ respectively, and the corresponding magnitude is given by the eigenvalues λ− and λ+ respectively. Note that λ− may be diﬀerent than zero and that the strength of an multi-valued edge should be expressed by how λ+ compares to λ− , for example by subtraction λ+ − λ− as proposed by [7], which will be used to deﬁne gradients in multi-valued color invariant images in the next section. Gradients in multi-valued color invariant images In this section, we propose color invariant gradients based on the multi-band approach as described in the previous section. The color gradient for RGB is as follows: − λRGB (10) ∇CRGB = λRGB + − √ RGB RGB 2 RGB RGB )2 gRGB +g22 ± (g11 −g22 ) +4(g12 ∂G 2 RGB 2 where g11 = | ∂R for λ± = 11 2 ∂x | + | ∂x | + ∂B 2 ∂R 2 ∂G 2 ∂B 2 ∂R ∂R ∂G ∂G ∂B ∂B RGB RGB | ∂x | , g22 = | ∂y | + | ∂y | + | ∂y | , g12 = ∂x ∂y + ∂x ∂y + ∂x ∂y . Further, we propose that the color invariant gradient (based on c1 c2 c3 ) for matte objects is given by: (11) ∇Cc1 c2 c3 = λc+1 c2 c3 − λc−1 c2 c3 √ ccc ccc c c c c c c c c c g 1 2 3 +g221 2 3 ± (g111 2 3 −g221 2 3 )2 +4(g121 2 3 )2 c1 c2 c3 1 2 for λ± = 11 where g11 = | ∂c 2 ∂x | + c1 c2 c3 c1 c2 c3 ∂c2 2 ∂c3 2 ∂c1 2 ∂c2 2 ∂c3 2 ∂c1 ∂c1 ∂c2 ∂c2 ∂c3 ∂c3 | ∂x | +| ∂x | , g22 = | ∂y | +| ∂y | +| ∂y | , g12 = ∂x ∂y + ∂x ∂y + ∂x ∂y . Similarly, we propose that the color invariant gradient (based on l1 l2 l3 ) for shiny objects is given by: ∇Cl1 l2 l3 = λl+1 l2 l3 − λl−1 l2 l3 (12) lll lll l l l l1 l2 l3 1 2 3 −g 1 2 3 )2 +4(g l1 l2 l3 )2 g 1 2 3 +g22 ± (g11 l1 l2 l3 ∂l2 2 22 12 1 2 for λ± = 11 where g11 = | ∂l 2 ∂x | +| ∂x | + l1 l2 l3 l1 l2 l3 ∂l2 2 ∂l3 2 ∂l2 ∂l2 ∂l3 ∂l3 3 2 1 2 1 ∂l1 = | ∂l = ∂l | ∂l ∂x | , g22 ∂y | + | ∂y | + | ∂y | , g12 ∂x ∂y + ∂x ∂y + ∂x ∂y . To evaluate the performance of the color snake diﬀerentiated for the various color gradient ﬁelds, the methods are compared, in the next section, on color images taken from full-color objects in real-world scenes.

4

Experiments: Color Snakes

The objects considered during the experiments were recorded in 3 RGB-colors with the aid of the SONY XC-003P CCD color camera (3 chips) and the Matrox Magic Color frame grabber. The digitization was done in 8 bits per color. Two light sources of average day-light color were used to illuminate the objects in the scene. The size of the images are 128x128. In the experiments, the same weights have been used for the shape and image feature constraints.

Interactive Query Formulation for Object Search

597

Fig. 1. From top left to right bottom Fig. 2. From top left to right bottom a. Color image with ground-truth de- a. Color image with ground-truth denoted by the white contour. b. The ini- noted by the white contour. b. The initial contour as specified by the user (the tial contour as specified by the user (the white contour). c Snake segmentation white contour). c Snake segmentation result based on intensity gradient field result based on intensity gradient field ∇I d. Result based on RGB gradient ∇I d. Result based on RGB gradient field ∇CRGB . e. Result based on c1 c2 c3 field ∇CRGB . e. Result based on c1 c2 c3 gradient field ∇Cc1 c2 c3 . f. Result based gradient field ∇Cc1 c2 c3 . f. Result based on l1 l2 l3 gradient field ∇Cl1 l2 l3 . on l1 l2 l3 gradient field ∇Cl1 l2 l3 .

Figure 1.a and 2.a show respectively an image of a matte cube and a shiny (red) ball against a homogeneous background. The ground-truth (the true boundary) is given by the white contour. The images are clearly contaminated by shadows, shading and highlights. Note that the cube and ball are painted homogeneously. Further, in Figure 1.b and 2.b the initial contours are shown as speciﬁed by the user (the white contour). As one can see, the snake segmentation results based on intensity I gradient denoted by ∇I and the RGB color gradient denoted by ∇CRGB are negatively aﬀected by shadows and shading due to the varying shape of the object. In fact, for these gradient ﬁelds is not clear to which boundaries the snake contour should be pulled to. As a consequence, the ﬁnal contour is biased and poorly deﬁned. In contrast, the ﬁnal contours obtained by the snake method based on ∇Cc1 c2 c3 and ∇Cl1 l2 l3 gradient information, are nicely pulled towards the true boundary and hence correspond neatly with the material transition. Figure 3 shows images containing an air-balloon and a lion. These images c Stock Photo Libraries. Again the snake segmentation recome from Corel sults based on intensity I and color RGB gradient are poorly deﬁned. The ﬁnal contours obtained by the snake method based on ∇Cc1 c2 c3 and ∇Cl1 l2 l3 gradient information, are nicely pulled towards the true edge and hence correspond again to the object color transition.

598

Theo Gevers and Arnold W.M. Smeulders

c From left to right. a. Color image. b. The initial contour. c-f Fig. 3. Corel . Snake segmentation result based on gradient field ∇I, ∇CRGB , ∇Cc1 c2 c3 , and ∇Cl1 l2 l3 .

5

Object Search

In this section, we consider interactive image segmentation in the context of content-based object search by image example. The basic idea of object search by image example is to extract characteristic features from target images which are then matched with those of the query image. These features are typically derived from shape, texture or color properties of query and target images. After matching, images are ordered with respect to the query image according to their similarity measure. To be precise, let an image I be represented by its image feature vectors of the form I = (f0 , wI0 ; f1 , wI1 ; ..., ; ft , wIt ) and a typical query Q by Q = (f0 , wQ0 ; f1 , wQ1 ; ..., ; ft , wQt ), where wIk (or wQk ) represent the weight of image feature fk in image I (or query Q), and t image features are used for image object search. The weights are assumed to be between 0 and 1. Weights can be assigned corresponding to the feature frequency ﬀ as deﬁned by: wi = ﬀ

(13)

giving the well-known histogram form where ﬀ (feature frequency) is the frequency of occurrences of the image feature values in the image or query. However, for accurate image object search, it is desirable to assign weights in accordance to the importance of the image features. To that end, the image feature weights used for both images and queries are computed as the product of the features frequency multiplied by the inverse collection frequency factor, deﬁned by [6]: wi = (0.5 +

N 0.5ﬀ ) log( ) max{ﬀ} n

(14)

Interactive Query Formulation for Object Search

599

where N is the number of images in the database and n denotes the number of images to which a feature value is assigned. In this way, features are emphasized having high feature frequencies but low overall collection frequencies. Given the weighted vector representation, the following query-image similarity measure is used: t min{wQk , wIk } S(Q, D) = k=1t (15) k=1 wQk Images are ranked with respect to this similarity measure S().

6

Object Search: Experiments

In this section, we report on the retrieval accuracy of the object search process. The database consists of 500 reference images of multicolored 3-D man-made objects. A second, independent set (the test set) of recordings was made of randomly chosen objects already in the database. These objects, N2 = 70 in number, were recorded again one per image with a new, arbitrary position and orientation with respect to the camera, some recorded upside down, some rotated, some at diﬀerent distances. The dataset can be experienced within the Pic2Seek system on-line at http://www.wins.uva.nl/research/isis/zomax/. For each image, feature vectors are constructed based on c1 c2 c3 gradient ﬁeld ∇Cc1 c2 c3 with 8 bits resolution. By using ∇Cc1 c2 c3 , in theory, pixels on a homogeneous colored patch will be discarded during feature vector formation while pixels along the same color edge will accumulate in the same feature. For example, using eq. ( 13) as feature weighting, the total frequency for a particular feature represents a measure of the length of the color (invariant) edge. For a measure of retrieval quality, let rank rQi denote the position of the correct match for query image Qi , i = 1, ..., N2 , in the ordered list of N1 match values. The cumulative percentile oftest images producing a rank smaller or j equal to j is deﬁned as X (j) = ( N12 k=1 η(rQi == k))100%, where η reads as the number of test images having rank k. In Figure 4, accumulated ranking percentile is shown for (A) retrieval based on the entire image with unweighted color features (cf. eq. ( 13)), (B) retrieval based on subimages of (only) the object after snake segmentation with unweighted color features (cf. eq. ( 13)), (C) retrieval based on subimages of (only) the object after snake segmentation with weighted color features (cf. eq. ( 14)). From the results of Figure 4, we can observe that the discriminative power of subimage-based object search with weighted features is higher then the other approaches. As expected, the discrimination power of content-based image retrieval using unweighted features has worst performance.

7

Conclusion

In this paper, interactive image segmentation is considered for the purpose of content-based image retrieval. We have proposed the use of color invariant gra-

600

Theo Gevers and Arnold W.M. Smeulders

100 80

X (j )

+

2

60 4

+

2

4

Accumulated ranking percentile X (j ) for j 10 + + + + + + + 2 2 2

+

4

4

2

2

4

4

2

2

4

4

4

4

2

40 (C) subimage (color snake) and weighted features + (B) subimage (color snake) without weighted features 2 (A) whole image without weighted features 4

20 0

1

2

3

4

j

5

;!

6

7

8

9

10

Fig. 4. The accumulated ranking plotted against ranking j.

dient information to guide the deformation process to obtain snake boundaries which correspond to material boundaries in images discounting the disturbing inﬂuences of surface orientation, illumination, shadows and highlights. Experimental results show that the proposed color invariant snake successfully ﬁnd object material contours discounting other ”accidental” edges types (e.g. shadows and highlight transitions). Furthermore, experiments show that object search using subimages with weighted features yield high retrieval accuracy.

References 1. Gevers, T. and Smeulders, A. W. M., Image Indexing using Composite Color and Shape Invariant Features, ICCV, Bombay, India (1998) 594, 595 2. Bajcsy, R., Lee S. W., and Leonardis, A., Color Image Segmentation with Detection of Highlights and Local Illumination Induced by Inter-reﬂections, In IEEE 10th ICPR’90, pp. 785-790, Atlantic City, NJ, 1990. 593 3. Klinker, G. J., Shafer, A. and Kanada, T., A Physical Approach to Color Image Understanding, Int. J. of Comp. Vision, Vol. 4, pp. 7-38, 1990. 593 4. M. Kass, A. Witkin, D. Terzopoulos, Snakes: Active Contour Models, International Journal of Computer Vision, 1(4), pp. 321-331, 1988. 593 5. S. di Zenzo, Gradient of a Multi-images, CVGIP, 33:116-125, 1986. 595 6. Salton, G. and Buckley C., Term-weighting approaches in automatic text retrieval Information Processing and Management, 1988. 598 7. G. Sapiro, D. L. Ringach, Anisotropic Diﬀusion of Multi-valued Images with Applications, to Color Filtering, IEEE PAMI, (5)11, 1582-1586, 1996. 595, 596

Automatic Deformable Shape Segmentation for Image Database Search Applications Lifeng Liu and Stan Sclaroﬀ Computer Science Dept. Boston University, 111 Cummington Street, Boston, MA 02215, USA liulf,[email protected] Abstract. A method for shape based image database indexing is described. Deformable shape templates are used to group color image regions into globally consistent conﬁgurations. A statistical shape model is used to enforce the prior probabilities on global, parametric deformations for each object class. The segmentation is determined in part by the minimum description length (MDL) principle. Once trained, the system autonomously segments deformed shapes from the background, while not merging them with adjacent objects or shadows. The formulation can be used to group image regions based on any image homogeneity predicate; e.g., texture, color, or motion. Preliminary experiments in color segmentation and shape-based retrieval are reported.

1

Introduction

Retrieval by shape is considered to be one of the more diﬃcult aspects of contentbased image database search. A major part of the problem is that many techniques assume that shapes have already been segmented from the background, or that a human operator has encircled the object via an active contour. Such assumptions are unworkable in applications where automatic indexing is required. In this paper, a new region-based approach is proposed that automatically segments deformable shapes from images. Deformable shape templates are used to group color image regions into globally consistent conﬁgurations. A statistical shape model is used to enforce the prior probabilities on global, parametric deformations for each object class. The segmentation is determined in part by the minimum description length (MDL) principle. The method includes two stages: over-segmentation using a traditional region segmentation algorithm, followed by deformable model-based evaluation of various region grouping hypotheses. During the second stage, region merging, deformable model ﬁtting, and global consistency checking are executed simultaneously. The approach is general, in that it can be used to group image regions based on texture measures, color, or other image features. Once trained, the system autonomously segments objects from the background, while not merging them with adjacent objects of similar image color. The resulting recovered parametric model descriptions can then be used directly in shape-based search of image databases. The system was tested on a number of diﬀerent shape classes and results are encouraging. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 601–609, 1999. c Springer-Verlag Berlin Heidelberg 1999

602

2

Lifeng Liu and Stan Sclaroﬀ

Background

Segmentation using low-level techniques, such as region growing, edge detection, and mathematical morphology operations, requires a considerable amount of interactive guidance in order to get satisfactory results. Automating these model-free approaches is diﬃcult because of shape complexity, illumination, inter-reﬂection, shadows, and variability within and across individual objects. One solution strategy is to exploit prior knowledge to suﬃciently constrain the segmentation problem. For instance, a model based segmentation scheme can be used to reduce the complexity of region grouping. Due to shape deformation and variation within object classes, a simple rigid model-based approach will break down in general. This realization has led to the use of deformable contour models in image segmentation [11] and in shape-based image retrieval [8,5]. The snake formulation can be extended to include a term that enforces homogeneous properties over the region during region growing [7,9,15]. This regionbased approach tends to be more robust with respect to model initialization and noisy data. However, it requires hand-placement of the initial model, or a userspeciﬁed seed point on the interior of the region. One proposed solution is to scatter many region seeds at random over the image, followed with segmentation guided via Bayes/MDL criteria [10,12,19]. Unfortunately, the above mentioned techniques are going to make mistakes in merging regions, even in constrained contexts. This is because local constraints are in general insuﬃcient. To gain a more reliable segmentation, global consistency must be enforced [17]: the best partitioning is the one that globally and consistently explains the greatest portion of the sensed data. Finding the globally consistent or MDL image labeling is impractical in general due to the computational complexity of global optimization algorithms [13]. This leads to the use of parallel algorithms [12] or algorithms that instead ﬁnd an approximately optimal solution [2,4,6,10,14,16,18,19].

3

Model Formulation

In our system, a deformable model is used to guide grouping of image regions. A shape model is speciﬁed in terms of global warping functions applied to a closed polygon, hereafter referred to as a template. The global warping can be generic, and is controlled by a vector of warping parameters, a. To demonstrate the approach, we implemented a system that uses quadratic polynomials to model global deformation due to stretching, shearing, bending, and tapering. Assume that the distribution on shape parameters for a particular shape category can be modeled as a multi-dimensional normal distribution. The distribution is characterized by its mean ¯ a and covariance matrix Σ. For a given deformation paramter vector a, the suﬃcient statistic for characterizing likelihood is the Mahalanobis distance: aT Σ−1 ˜ a, Edef orm = ˜

(1)

where ˜ a = a−¯ a. As will be described, ¯ a, Σ are acquired via supervised learning.

Automatic Deformable Shape Segmentation

3.1

603

Model Fitting

One important step in the image partitioning procedure is to ﬁt each region grouping hypothesis gi with deformable models from the object library. Fitting minimizes a function that includes the deformation term of Eq. 1 and two additional terms: a.) area overlap between model and region grouping, and b.) color compatibilty of regions included in the grouping: E(gi ) = Ecolor + αEarea + βEdef orm .

(2)

The scalars α and β control the importance of the three terms. The color compatibility term Ecolor is simply the norm of color covariance matrix for pixels within the region grouping. The region/model area overlap term is computed Earea = SGSS2 m , where SG is the area of the region grouping hypothesis, Sm is c the area of the deformed model, and Sc is the common area between the regions and deformed model. By using degree of overlap in our cost measure, we can avoid measuring distances between region boundaries and corresponding model control points. Hence we can avoid the problem of ﬁnding direct correspondence between landmark points, which is not easy in the presence of large deformations. Model ﬁtting is accomplished by minimizing Eq. 2. In our system, we employ the downhill-simplex method [13] because it requires only function evaluations, not derivatives. Though it is not very eﬃcient in terms of the number of function evaluations that it requires, it is still suitable for our application since it is fullyautomatic, and reliable. The procedure is accelerated via a multiscale approach. 3.2

Model Training

In the current system, the template is deﬁned by the operator as a polygonal model. During model training, the system is presented with a collection of color images. These images are ﬁrst over-segmented via a traditional color region segmentation algorithm [1,13]. In the ﬁrst few training images, the operator is asked to mark candidate regions that belong to the same object. The system then merges the regions and uses downhill-simplex method to minimize the cost function in Eq. 2, thereby matching the template to the training regions in a particular image. This process is repeated for all images in the training set. As more training data is processed, the system can then semi-automate training. The system can take a “ﬁrst guess” at the correct region grouping and present it to the operator for approval [13].

4

Automatic Image Segmentation

Once trained, the deformable model guides the grouping and merging of color regions. The process begins with over-segmentation of the color input image [1,13]. An edge map is also computed via standard image processing methods. Using this over-segmentation, candidate regions are matched with models based on their color band-rate feature [3].

604

Lifeng Liu and Stan Sclaroﬀ

There are two major constraints used in the selection of candidate groupings. The ﬁrst constraint is a spatial constraint: every region in a grouping hypothesis should be adjacent to another region in the same group. The second constraint is a region boundary compatibility constraint [13]: if the boundary between two region is “strong,” then they cannot be combined in the same group. The system then tests various combinations of candidate region groupings for each model. The goal is to ﬁnd the optimal, model-based partitioning of the image. In theory, the system should exhaustively test all possible combinations of the candidate regions, and select the best ones for merging; however, the computational complexity of such exhaustive testing is exponential, and the problem of ﬁnding the best group is NP complete. To make the problem tractable, we have tested a number of approximation strategies for ﬁnding an globally consistent labeling of the image [13]. In the global consistency strategy, for any possible partitioning of the image, we compute a global cost value for the whole conﬁguration: E=

n

ri E(gi ) + γn,

(3)

i=1

where ri is the ratio of ith group area to the total area, and E(gi ) is the deformation cost for group gi , n is the number of the groupings in the current image partitioning, and γ is a constant factor. In our experiments, γ = 0.04. The ﬁrst term measures the model compatibilty over all groupings in the image partition. The second term corresponds to the code length (number of models employed); it enforces a minimum description length criterion [12,13]. 4.1

Highest Conﬁdence First

A deterministic algorithm, highest conﬁdence ﬁrst (HCF), can be used to improve convergence speed [4,10]. The HCF algorithm as applied to our problem is as follows: 1. Initialize the region grouping conﬁguration such that every region in the over-segmented image is in its own distinct group gi . 2. Fit models to each region grouping gi . Compute the global cost Eo via Eq. 3. Save this conﬁguration as best found so far, Co . 3. Set Em to a very large value. 4. For each pair of adjacent groups gi , gj in the current conﬁguration, compute the global cost, E2 that would result if gi , gj were merged. If E2 < Em , then set Em = E2 and save this merged conﬁguration Cm . After this step, Cm is the conﬁguration with minimum merging cost for merging any pair of groups in the current conﬁguration. 5. Use the merged conﬁguration Cm as the new conﬁguration. If Em < Eo , then set Eo = Em and save this new conﬁguration as best found so far Co = Cm . 6. Terminate when all groups are merged into one. and output the best conﬁguration Co and its cost value Eo . Otherwise, go to 3.

Automatic Deformable Shape Segmentation

(a)

605

(b)

Fig. 1. Two deformable template models employed in our experiments: (a) ﬁsh model, (b) banana model. The initial polygonal model was deﬁned by the user, and then trained as described in Sec. 3.2.

In our experience, the computational complexity of HCF is generally less than that needed to obtain similar quality segmentation results via the simulated annealing algorithm [13]. In HCF the number of diﬀerent merging conﬁgurations tested is O(n2 ), where n is the number of regions in the image. This is because some results from the previous iteration can be reused in the next. Speciﬁcally, at each iteration (except the ﬁrst), the algorithm need only compute the pairwise merging cost between all groups gi and the newly-merged group from the previous iteration.

5

Results

The aforementioned segmentation method was implemented and tested on hundreds of images from a number of diﬀerent classes of cluttered color imagery: images of fruit, vegetables, and leaves collected under controlled lab conditions, and images of ﬁsh obtained from the world wide web. Due to space limitations, only two examples can be shown. The ﬁrst example shows segmentation results for ﬁve examples of ﬁsh images obtained from the world wide web. The ﬁsh model used in segmentation is shown in Fig. 1(a), and was trained using about 60 training images. The test images were excluded from the training set. The original color images are shown in the ﬁrst column of Fig. 2, followed by the over-segmented images used as input to the merging algorithm. The third column shows the models recovered in ﬁnding the best merging conﬁguration obtained via HCF. Finally, last column depicts the corresponding model-based merging of image regions. As can be seen, the method accurately recovered a deformable model description of each ﬁsh in the image. Only in one case, (Fig. 2(a)), was the orientation of some of the models incorrectly estimated. Despite clutter, deformation, and partial occlusions, performance was quite satisfactory. In the next example, we show the approach as employed in an image retrieval application. We demonstrate the approach using a simple banana shape model that was trained using 40 example images of bananas at varying orientations and scales. These training images were not contained in our test image data set. All images in the test data set were then segmented using the trained model as described in Sec. 4. The recovered model deformation parameters a for the selected region grouping hypotheses were stored in the index for each image. If the image had multiple yellow objects, then the system stored a list of model

606

Lifeng Liu and Stan Sclaroﬀ

(a)

(b)

(c)

(d)

(e)

Fig. 2. Example segmentation for images of ﬁsh. The original color images are shown in the ﬁrst column, followed by the over-segmented images used as input to the merging algorithm. The third column shows the recovered deformable models for the best merging conﬁguration obtained via HCF. Finally, last column depicts the model-based merging of regions.

descriptions for that image. Once descriptions are precomputed, shape-based queries can be answered in interactive time. An example search with our system is shown in Fig. 3. The user selected the image shown in Fig. 3(0). The system retrieved images that had similar shapes, here shown in rank order (1-14). The most similar shapes are other bent bananas of similar aspect ratio. Yellow squash shapes were ranked less similar. The corresponding region grouping is shown below each of the original images in the ﬁgure. Note that the system correctly grouped regions despite shadows, lighting conditions, and deformation. Especially notable are cases where multiple yellow shapes are abutting each other (Fig. 3(3,7,12,14)). Due to the use of model-based region merging, our system is able to avoid merging similarly colored, adjacent but separate objects. The approach is also adept at avoiding merging objects with their similarly-colored shadows.

Automatic Deformable Shape Segmentation

(0)

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

607

Fig. 3. Image retrieval example. The user selected an example image (0). The system retrieved shapes found in the database and displayed them in rank similarity order (1-14). The segmented shape is shown below each original database image. If an image contained more than one yellow shape, it is shown more than once in the retrieval (once per shape). Note that the most similar shapes are other bent bananas of similar aspect ratio. Yellow squash shapes were ranked less similar.

6

Conclusion

As seen in the examples of the previous section, the shape-based region merging algorithm can produce satisfactory results. The algorithm can detect the whole object correctly, while at the same time, avoid merging objects with background and/or shadows, or merging adjacent multiple objects. A statistical shape model is used in ﬁnding a globally-consistent labeling of the image, as determined in part by the minimum description length (MDL) principle. The formulation is general, in that it can be used to group image regions based on a general image homogeneity predicate; e.g., texture, color, or motion. The major issue is the computation time required to obtain a segmentation result. This led to the evaluation of diﬀerent methods for obtaining approximate, globally optimal region groupings [13]. The method of choice is based upon the highest conﬁdence ﬁrst (HCF) algorithm.

608

Lifeng Liu and Stan Sclaroﬀ

In most previous approaches, initial model placement is either given by the operator, or by exhaustively testing the model in all orientations, scales, and deformations centered at every pixel in the image. The region-based approach proposed in this paper signiﬁcantly reduces the need to test all model positions. Once trained, our system is fully-automatic. Therefore, it is well-suited to image database indexing applications. Each selected region grouping hypothesis has a recovered shape model associated with it. As has been demonstrated, these model parameters can be used directly in recognition and shape comparison.

References 1. J.R. Beveridge, J.S. Griﬃth, R.R. Kohler, A.R. Hanson, and E.M. Riseman. Segmenting images using localized histograms and region merging. IJCV, 2(3):311– 352, 1989. 603 2. G. Bongiovanni and P. Crescenzi. Parallel simulated annealing for shape detection. CVIU, 61(1):60–69, 1995. 602 3. M. H. Brill. Can color-space transformation improve color constancy other than von Kries? SPIE, Human, vision, visual Processing, and digital display TV, 1913:485– 492, 1993. 603 4. P. B. Chou and C. M. Brown. The theory and practice of bayesian image labeling. IJCV, 4(3):185–210, 1990. 602, 604 5. A. DelBimbo and P. Pala. Visual image retrieval by elastic matching of user sketches. PAMI, 19(2):121–132, 1997. 602 6. R. P. Grzeszczuk and D. N. Levin. Brownian strings: segmentating images with stochastically deformable contours. PAMI, 19(10):1100–1114, 1997. 602 7. J. Ivins and J. Porrill. Active-region models for segmenting textures and colors. I&VC, 13(5):431–438, 1995. 602 8. A. K. Jain, Y. Zhong, and S. Lakshmanan. Object matching using deformable templates. PAMI, 18(3):267–278, 1996. 602 9. T. N. Jones and D. N. Metaxas. Image Segmentation Based on the Integration of Pixel Aﬃnity and Deformable Models. Proc. CVPR, pp. 330–337, 1998. 602 10. T. Kanungo, B. Dom, W. Niblack, and D. Steele. A fast algorithm for mdl-based multi-band image segmentation. Proc. CVPR, pp. 609–616, 1994. 602, 604 11. M. Kass, A.P. Witkin, and D. Terzopoulos. Snakes: Active contour models. IJCV, 1(4):321–331, 1988. 602 12. Y. G. Leclerc. Constructing simple and stable descriptions for image partitioning. IJCV, 3(1):73–102, 1989. 602, 604 13. L. Liu and S. Sclaroﬀ. Deformable shape detection and description via model-based region grouping. Technical report, CS TR 98-017, Boston U., Nov. 1998. 602, 603, 604, 605, 607 14. D. Noll and W. Von Seelen. Object recognition by deterministic annealing. I&VC, 15(11):855–860, 1997. 602 15. R. Ronfard. Region-based strategies for active contour models. IJCV, 13(2):229– 251, 1994. 602 16. G. Storvik. Bayesian approach to dynamic contours through stochastic sampling and simulated annealing. PAMI, 16(10):976–986, 1994. 602 17. T. M. Strat. Natural Object Recognition. Springer-Verlag, 1992. 602

Automatic Deformable Shape Segmentation

609

18. J. P. Wang. Stochastic relaxation on partitions with connected components and its application to image segmentation. PAMI, 20(6):619–636, 1998. 602 19. S. C. Zhu and A. Yuille. Region competition: Unifying snakes, region growing, and bayes/mdl for multiband image segmentation. PAMI, 18(9):884–900, 1996. 602

A Multiscale Turning Angle Representation of Object Shapes for Image Retrieval Giancarlo Iannizzotto1 and Lorenzo Vita2 1

Department of Mathematics University of Messina, C.da Papardo, Salita Sperone, 98166, Messina, Italy [email protected] 2 Istituto di Informatica e Telecomunicazioni University of Catania, Viale A. Doria 6, 95025 Catania, Italy [email protected] Abstract. In this paper, we present a Multiscale Turning Angle representation and a pseudo-distance developed to compare object shapes of 2-D images. Both representation and pseudo-distance are devised to obtain results very similar to those of a human operator; in particular, the problem of noisy contours is dealt with. A prototype application has been developed to test the algorithms, and experimental results are presented.

1

Introduction

Comparing the shapes of the objects contained in digital images is one of the major issues in image analysis [1,2]. Very often the purpose of this comparison is not the discrimination of the exact identity between the shapes (exact matching), but the determination of a level of similarity among the shapes (an evaluation of how shapes resemble each other). This category of applications generally responds to queries like: “Search for all the shapes in a database, that are similar (within a certain level ) with a given shape” (range queries), or: “Find the ﬁrst n shapes in the database, that are more similar to a given shape” (nearestneighbor queries). Anyway, in both cases we need to sort the whole database according to the level of similarity with the shape speciﬁed as a reference. Thus, the selection of the comparison technique greatly aﬀects both the eﬀectiveness and the performances of the system. The representation used for the shape greatly aﬀects the features of the retrieval system, too: for example, if the comparison has to ignore the size, the rotations and the translations of objects, it is better to select invariant functions or those having at least a “simple” behavior at the variation of such parameters. In particular, the techniques using functions of the curvilinear abscissa have proven to have many of the listed features, and Turning Angles have had a certain favor in the ﬁeld of image retrieval [3,4]. Although Turning Angles can give a high level of discrimination among the shapes and have some interesting properties that have been well described in literature [5][6], Turning Angles are very sensitive to small variations in the contour [7], and are not invariant for rotation. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 609–616, 1999. c Springer-Verlag Berlin Heidelberg 1999

610

Giancarlo Iannizzotto and Lorenzo Vita

In this paper we present a technique of representation and comparison of shapes through a modiﬁed and multiscale version of Turning Angles, below called multiscale TA representation. The described representation is invariant for rotation, tranlation, and change of size. Besides, since it is multiscale, it is very robust small contour variations. The method of comparison used is based on dynamic programming, and has been designed for the eﬀective management of the features of the TA representation. The described techniques have been experimented for the shape-based retrieval on a database of about 250 images. Interesting results are presented at the end of the paper. In the following paragraphs we will provide a quick deﬁnition of Turning Angle shape representation, and a survey on shape problems with Turning Angle (par. 2). We will therefore show our solution, and will present the results obtained in our experiments(par. 3 and 4). Finally, we will draw some conclusions and disclose some future developments (par. 5).

2

Some Problems with the Turning Angle Representation

Let us assume that we ﬁx (along the curve that forms the contour of an object) a curvilinear abscissa s, a sense (for example counterclockwise), and any reference point, where the abscissa is given the value 0. Let us deﬁne, for each contour point, a function θ(s) of the abscissa s, as the angle formed with a reference axis (for example, axis x) by the counterclockwise tangent to the path covered for moving from the current point to the following one. In other words, θ(s) measures the turning angle with reference to the direction of the axis x, when moving from the current point to the following one [3]. The sequence of the values assumed by θ(s) in every point of the contour is the Tuning Angle representation of the contour. The description thus obtained of the objects is invariant to the transfers of the objects, and enables us to easily deal with their variations in area and perimeter. Besides, this description is represented by a one-dimensional (and not two-dimensional) signal; this makes its treatment much simpler. The Turning Angle representation of a contour is a vector of n elements, where n is the number of points forming the contour to which it refers. Several techniques can be used, according to the needs in terms of accuracy in the representation of contours, of speed, and of selectivity, to obtain vectors having all the same size (scaling with reference to the perimeter [5], even or uneven sampling, preliminary ﬁtting of contours with parametric curves [4,7]). However, not all the techniques of comparison require vectors to have the same size but this condition simpliﬁes the process and, if adequately exploited, can improve its accuracy. The Turning Angle representation is not invariant to the rotation: if we rotate a shape, we add a vertical shift (an oﬀset) to the graph of Turning Angle. This feature introduces another variable in the comparison stage, and is generally undesirable. Several approaches to this problem have been tried: among others, the normalization of the Turning Angle vector with respect to some parameter

A Multiscale Turning Angle Representation of Object Shapes

611

(e.g. median value, mean value, etc) [7] and the identiﬁcation of the axes of the object [4]. The second problem to be dealt with is the uncertainty of the selection of the origin point of the curvilinear abscissa s on the contours. If we select a diﬀerent point as origin, the representation that is obtained from the same object changes considerably. The function that we want to use for the comparison (distance function) must therefore be able to recognize the similarity (or the equality) among the two objects, notwithstanding the fact that we have selected diﬀerent origins of the curvilinear abscissa for them. This problem can be considered the same as the one of the optimal alignment between the two sets of samples representing the two contours. The third problem concerns the high sensitivity shown by the Turning Angle function to small variations in contours. A small perturbation of the contour causes even a much wider perturbation in the Turning Angle. In this case, only the duration (number of samples involved) of the perturbations in the domain of the Cartesian co-ordinates and in that of the Turning Angle function are the same. Let us consider two objects, O1 and O2 , whose representations in terms of Turning Angle are respectively the two vectors A and B (for simplicity, let us assume that the lengths of the two vectors have been made the same). The two objects are absolutely the same, except for a short part of the second object, that has been slightly changed (corrupted by “noise”). If we suppose to use the Euclidean distance for the comparison between the two vectors: D=

2

(Ai − Bi )

(1)

i

we obtain a distance between the two objects, that can be very high (although they diﬀer only for few points) due to the fact that it sums the contributions given by all the points of the two vectors. A good solution would be that of filtering the contributions given by all the points to the distance, in order to reduce the weight of the single point aﬀected by noise. An example of this technique has been presented in [7]. The fourth problem concerns the case of an inexact correspondence between the two sequences of samples. For example, if the object O1 has a slightly longer side (even of one sample only) in comparison with the object O2 and, for the rest, the two objects are the same, as soon as the process of comparison meets the “spare” sample, it will lose the alignment between the two sequences. Even if all the other samples are equal, the correspondence will be irremediably lost, and the two shapes will result very diﬀerent. Allowing for a warping while searching for the correspondence can solve this problem. If the two sequences are not similar for an index i, the algorithm should try to skip that point, and to identify a new realignment a bit forward. Of course, this involves a cost that will aﬀect the total distance, and will depend on the number and the width of the realignment jumps made [4].

612

3

Giancarlo Iannizzotto and Lorenzo Vita

The Proposed Solution

Below we will describe the solution which we have developed for the multiscale representation of 2-D shapes, starting from the Turning Angle representation, and the technique of comparison, based on dynamic programming.

3.1

The Multiscale Ta Representation

We have dealt with the Rotation Problem by slightly changing the deﬁnition of Turning Angle, in order to avoid the direct dependence of the representation on the orientation of the object with respect to the axes of the image. In conclusion, the axis with respect to which the Turning Angle is calculated, is no longer the x axis of the image, but an axis with the direction and the sense of the counterclockwise tangent to the curve in the point of origin of the curvilinear abscissa. The angle obtained is always included between −2π and 2π, due to a modulus operation. For making a distinction with the Turning Angle, this representation will be indicated as T A below. With regard to the selection of the starting point, that is, the point of curvilinear abscissa 0, the solution that we have adopted is similar to the one proposed in [4]. The main axis of the shape and the axis perpendicular to it are determined. Then, the closest one to the barycentre of the shape is selected, in the set of the points in common between the shape and the perpendicular axis. In the case of almost symmetric shapes, for which there might be some ambiguities in the determination of the starting point, other possible starting points are selected (up to three), and for each of them a diﬀerent TA vector is calculated. The comparison with other shapes will therefore be done for each of the TA vectors, and then the selected distance will be the shortest one. In the Multiscale Representation, the level of accuracy with which the details of the shape are represented increases or decreases with the variation of the scale. An example of multiscale representation can be seen in ﬁgure 1. Two triangles are shown. The second of them (Fig.1.d) diﬀers from the ﬁrst one only for a small portion of the contour, aﬀected by noise (bottom on the left). For each of the two triangles, the graphs of the respective TA representations are shown at the highest detail scale 0 (Fig.1.b,e) and at the scale 16, which is deﬁnitely less detailed (Fig.1.f). As we can notice, when the scale increases, the representations of the two triangles deﬁnitely converge. In ﬁgure 2 we show the graphs of the diﬀerence (sample by sample) among the TA representations of the two triangles, for the two scales (ﬁgure 2.a: scale 0, ﬁgure 2.b: scale 16). The multiscale representation has been obtained through recursive convolutions of the vector TA with a Gaussian Kernel [8] with width equal to 5. As we will see in the next paragraph, this principle (which is very similar to the one described in [9]), has enabled us to obtain an eﬀective method of comparison, with regard to noise.

A Multiscale Turning Angle Representation of Object Shapes

a

c

b

e

d

613

f

Fig. 1. Two slightly diﬀerent triangles and there TA representations at scale 1 and scale 16.

a

b

Fig. 2. Graphs of the per-sample diﬀerences between the TA representations of the two triangles of Fig.1 (see text).

3.2

The Matching Algorithm

The matching algorithm is based on the deﬁnition of a pseudo-distance function, developed through techniques of dynamic programming. A similar approach to the problem of matching can be found in [4]. Considering two turning angle vectors (Θ1 and Θ2 ), the function calculates, sample by sample, the absolute value of the diﬀerence between the two vectors, and sums the results for obtaining a scalar value. While calculating the diﬀerences, the algorithm determines the optimal alignment between the two vectors. Besides, a sample of one of the two vectors can match more than a single sample of the other (and vice-versa). Of course, this operation of “warping” involves an increase in the distance of a “penalty” value for each jump done. Like in all the algorithms of this kind, matching always goes on monotonically. Unlike the technique used in [4], in our algorithm the length of two vectors Θ1 and Θ2 can be diﬀerent, until one is

614

Giancarlo Iannizzotto and Lorenzo Vita

double than the other. This means that we do not always need to normalize the two shapes with reference to the perimeter, but (for example) with reference to the area. In several cases, this can lead to a more accurate comparison [7]. The eﬃciency of the matching process can take advantage (if it must take place with a higher number of images) from the multiscale feature of the selected representation. A ﬁrst stage with a high scale (> 20) and low resolution enables us to quickly discard all the shapes whose distance from the one looked up is higher than a given threshold. A second stage at full resolution and scale 0 enables us to select among those remaining more accurately. This approach, called “quick and dirty”, was successfully used in the past in several systems of image retrieval [2,7,10].

4

Experimental Results

The comparison technique described so far has been tested by developing an application for shape-based image retrieval. The database has been ﬁlled with about 250 color and greyscale images; for the retrieval, only the shape of the outer contour of the biggest object in every image is used. The segmentation of the images is performed by the Amoeba algorithm [11,12]. The interface for the system has been developed under Windows95, while the server has been developed under Linux environment with Postgres95. In Fig.3 the response to a query is represented. The query ﬁgure (a triangle with its starting point at the extreme left) is not present in the database. All the similar shapes, sorted according their distance from the triangle of the query, are therefore retrieved. We need to consider that, even if both the starting point and the orientation of the shapes vary, they are anyway recognized as responding to the query (they are the ﬁrst ones in the list retrieved). Even a triangle partially aﬀected by noise has been correctly recognized as similar to the other ones. For the experiment to which ﬁgure 3 is referred, the database was ﬁlled with geometric shapes only. The ability of the proposed system to treat complex shapes and real images is shown in Fig.4. This time the database has been ﬁlled with 256-color images representing objects with diﬀerent shapes. Once more, the image used in the query is not present in the database, and two images very similar to it, together with other less similar ones, are present. The images containing an object with a shape similar to that represented in the query are in the ﬁrst places of the list of images retrieved.

5

Conclusions and Further Work

In this paper we have presented a technique for the comparison among planar shapes, based on a multiscale representation of the shapes and on an algorithm developed through dynamic programming. In the ﬁrst paragraphs of the paper we have brieﬂy described the problems connected with the comparison among shapes represented through Turning Angle. In the subsequent paragraphs such problems have been dealt with, and solutions have been presented for each of

A Multiscale Turning Angle Representation of Object Shapes

Fig. 3. Results for the “Triangle” query.

Fig. 4. Results for the “Dragon” query.

615

616

Giancarlo Iannizzotto and Lorenzo Vita

them. We present the experimental results, obtained through a system for shapebased image retrieval that implements the techniques described. During the tests, the system has given much satisfactory results when compared with human perception; disadvantages are its lack for indexing, its restriction to only one singlyconnected object for each image, and the high computational complexity of the pseudo-distance algorithm. Currently, we are working on new approaches to the matching problem, with lower computation time (currently, about 5 seconds for matching 250 shapes of about 128 points each, without “quick and dirty ﬁltering”, on a Pentium-100 PC running the Linux operating system).

References 1. S. K. Chang and A. Hsu. Image information systems: Where do we go from here? IEEE Trans. on Knowledge and Data Engineering, 4 No. 5:431–442, October 1992. 609 2. C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, and W. Equitz. Eﬃcient and eﬀective querying by image content. Journal of Intelligent Information Systems, 3, July 1994. 609, 614 3. Brian Scassellati, Sophoclis Alexopoulos, and Myron Flickner. Retrieving images by 2d shape: A comparison of computation methods with human perceptual judgments. In SPIE Conference on Storage and Retrieval for Image and Video Databases, volume 2185, pages 2–14, February 1994. 609, 610 4. W. Niblack and J. Yin. A pseudo distance measure for 2d shapes based on turning angle. In International Conf. on Image Processing, Washington, DC, USA, 1995. 609, 610, 611, 612, 613 5. E.M. Arkin, L.P. Chew, D.P.Huttenlocher, K.Kedem, and J.S.B. Mitchell. An eﬃciently computable metric for comparing polygonal shapes. IEEE Transactions on PAMI, 13(3), March 1991. 609, 610 6. E. R. Davies. Machine Vision, 2nd Edition. Academic Press, 1997. 609 7. G. Iannizzotto, A. Puliaﬁto, and L. Vita. A new shape distance for content based image retrieval. In Multimedia Modeling, MMM96, Tolouse (France), November 1996. 609, 610, 611, 614 8. R. Jain, R. Kasturi, and B. G. Schunck. Machine Vision. McGraw-Hill, 1995. 612 9. F. Mokhtarian and A. K. Mackworth. A theory of multiscale, curvature-based shape representation for planar curves. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14 (8):789–805, August 1992. 612 10. F. Mokhtarian. Design and implementation of a content-based image retrieval tool. In Proceedings of European Conference on Computer Vision, volume I, pages 566–578, Cambridge, UK, 1996. 614 11. G. Iannizzotto and L. Vita. A fast, accurate method to segment and retrieve object contours in real images. In International Conf. on Image Processing, Lausanne, Switzerland, 1996. 614 12. G. Iannizzotto and L. Vita. Fast and accurate edge-based segmentation with no contour smoothing in 2-d real images. to be published in IEEE Transactions on Image Processing. 614

Contour-Based Shape Similarity Longin Jan Latecki and Rolf Lak¨ amper Institut f¨ ur Angewandte Mathematik Universit¨ at Hamburg, Bundesstr. 55, 20146 Hamburg {latecki,lakaemper}@math.uni-hamburg.de

Abstract. A similarity measure for silhouettes of 2D objects is presented, and its properties are analyzed with respect to retrieval of similar objects in an image database. Our measure proﬁts from a novel approach to subdivision of objects into parts of visual form. To compute our similarity measure, we ﬁrst establish the best possible correspondence of visual parts, which is based on a correspondence of convex boundary arcs. Then the similarity between corresponding arcs is computed and aggregated. We applied our similarity measure to shape matching of object contours in various image databases and compared it to well-known approaches in the literature. The experimental results justify that our shape matching procedure gives an intuitive shape correspondence and is stable with respect to noise distortions.

1

Introduction

With the recent increase in image and multimedia databases, there has been increased research in developing and applying shape similarity measures, e.g., see [3]. In computer vision there is a long history of work in shape representation and shape similarity. However, since in image databases the object classes are generally unknown a priori, a universal, non-parametric shape representation is necessary. By this we mean approaches in which the entire object shape is described uniformly without any restrictions on its shape. A shape similarity measure useful for shape-based retrieval in image databases should be in accord with our visual perception. This important property leads to the following requirements: 1. It should permit recognition of perceptually similar objects that are not mathematically identical. 2. It should abstract from distortions (e.g.,digitization and segmentation noise). 3. It should respect visual parts of objects. 4. It should not depend on scale, orientation, and position of objects. If we want to apply a shape similarity measure to distributed image databases, e.g., in Internet, where the object classes are generally unknown a priori, it is necessary that 5. a shape similarity measure is universal, in the sense that it allows us to identify or distinguish objects of arbitrary shapes. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 617–625, 1999. c Springer-Verlag Berlin Heidelberg 1999

618

Longin Jan Latecki and Rolf Lak¨ amper

In this paper we present a shape similarity measure that satisﬁes requirements 1-5. We demonstrate this by theoretical considerations, experimental results, and by comparison to the existing similarity measures. An application of our measure to retrieval of similar objects in a database of object contours is demonstrated in Figure 1, where the user query is given by a graphical sketch.

Fig. 1. Shape for similar objects based on our similarity measure. Since contours of objects in digital images are distorted due to digitization noise and due to segmentation errors, it is desirable to neglect the distortions while at the same time preserving the perceptual appearance at the level suﬃcient for object recognition. Thus, before our similarity measure is applied the shape of objects is simpliﬁed by a novel curve evolution method. This allows us • to reduce inﬂuence of noise and • to simplify the shape by removing irrelevant shape features without changing relevant shape features, which contributes in a signiﬁcant way to the fact that our measure satisﬁes requirements 1 and 2. A few stages of our novel curve evolution are illustrated in Figure 2. It is described in [10] and on our www-side [11]. Our curve evolution method achieves the task of shape simpliﬁcation in a parameter-free way, i.e., the process of evolution does not require any parameters, and for the stop condition, we can determine a universal stage of the curve evolution (that do not dependent on a given shape and on the amount of the distortions) at which the distortions are eliminated and the perceptual appearance is suﬃcient for robust object recognition. This is an important feature of

Contour-Based Shape Similarity

619

our curve evolution method for applications in image databases, since it implies that shape simpliﬁcation is achieved automatically.

(a)

(b)

(d)

(e)

(c)

(f)

Fig. 2. A few stages of our curve evolution. (a) is a distorted version of the contour on http://www.ee.surrey.ac.uk/Research/VSSP/imagedb/demo.html.

Our shape similarity measure proﬁts from a subdivision of objects into parts of visual form. There is a strong evidence for part-based representations in human vision, see e.g., [15] and [5]. According to Siddiqi et al. [15], “Part-based representations allow for recognition that is robust in the presence of occlusion, movement, deletion, or growth of portions of an object.” It is a simple and natural observation that maximal convex parts of objects determine visual parts. The fact that visual parts are somehow related to convexity has been noticed in the literature, e.g., Basri et al. [2], Hoﬀman and Richards [4], and Koenderink and Doorn [7]. Although the observation that visual parts are “nearly convex shapes” is very natural, the main problem is to determine the meaning of “nearly” in this context. Many signiﬁcant visual parts are not convex in the mathematical sense, since a visual part may have small concavities, e.g., small concavities caused by ﬁngers in the human arm. Thus, a natural and simple idea is to compute signiﬁcant convex parts while neglecting small concavities. Our solution to this problem is presented in [10].

2

Shape Similarity Measure

Our similarity measure proﬁts from the decomposition into visual parts based on convex boundary arcs. The key idea is to ﬁnd a right correspondence of the visual parts. We assume that a single visual part (i.e., a convex arc) of one curve can correspond to a sequence of consecutive convex and concave arcs of the second curve. Thus, the correspondence can be a one-to-many or a many-to-one mapping, but never a many-to-many mapping. This assumption is justiﬁed by the fact that a single visual part should match to its noisy versions that can be composed of sequences of consecutive convex and concave arcs, or by the fact

620

Longin Jan Latecki and Rolf Lak¨ amper

that a visual part obtained at a higher stage of evolution should match to the arc it originates from. Since maximal convex arcs determine visual parts, this assumption guarantees preservation of visual parts (without explicitly computing visual parts). In this section, we assume that polygonal curves are simple, i.e., there are no self-intersections, and they are closed. We assume also that we traverse polygonal curves in the counter clockwise direction. Let convconc(C) denote the set of all maximal convex or concave subarcs of a polygonal curve C. Then the order of traversal induces the order of arcs in convconc(C). Since a simple one-to-one comparison of maximal convex/concave arcs of two polygonal curves is of little use, due to the facts that the curves may consist of a diﬀerent number of such arcs and even similar shapes may have diﬀerent small features, we join together maximal arcs to form groups: A group g of curve C is a union of a (non-empty) consecutive sequence of arcs in convconc(C). Thus, g is also a subarc of C. We denote groups(C) the set of all groups of C. We have convconc(C) ⊆ groups(C). A grouping G for a curve C is an ordered set of consecutive groups G = (g0 , ..., gn−1 ) for some n ≥ 0 such that • gi ∩ gi+1(mod

n)

is a single line segment for i = 0, ...., n − 1.

Since any two consecutive groups intersect in exact one line segment, the whole curve C is covered by G. We denote the set of all possible groupings G of a curve C as G(C). Given two curves C1 , C2 , we say that groupings G1 ∈ G(C1 ) and G2 ∈ G(C2 ) corresponds if there exists a bijection f : G1 → G2 such that 1. f preserves order of groups and 2. for all x ∈ G1 either x ∈ convconc(C1 ) or f (x) ∈ convconc(C2 ). We call the bijection f a correspondence between G1 and G2 . We denote the set of all corresponding pairs in G(C1 ) × G(C2 ) by C(C1 , C2 ). The condition that any f is a bijection means that both curves are decomposed into the same amount of groups. Condition (2) means that at least one of corresponding groups x ∈ G1 or f (x) ∈ G2 is a maximal (convex or concave) arc. The reason is that we want to allow mappings between one-to-many maximal arcs or many-to-one maximal arcs, but never between many-to-many maximal arcs. A similarity measure for curves C1 , C2 is deﬁned as Sc (C1 , C2 ) = min{ Sa (x, f(G1 ,G2 ) (x)) : (G1 , G2 ) ∈ C(C1 , C2 )}, (1) x∈G1

where f(G1 ,G2 ) is the correspondence between G1 and G2 and Sa is a similarity measure for arcs. The similarity measure for arcs Sa (A1 , A2 ) is given by L1 distance of tangent space representations of A1 and A2 . Due to the limited

Contour-Based Shape Similarity

621

number of pages, we cannot give the details of the tangent space representation here. They can be found in [9]. The similarity measure deﬁned in (1) is computed using dynamic programming. Our approach is a modiﬁcation of the standard techniques for string comparison (see e.g., [8]). Numerous experimental results show that it leads to intuitive arc correspondences, e.g., see Figure 3. We have applied the shape similarity measure in (1) to automatic object indexing and searching in image databases. The experimental results are described in Section 3.

Fig. 3. The corresponding arcs are labeled by the same numbers.

3

Comparison to Known Similarity Measures

We concentrate on comparison to universal similarity measures that are translation, rotation, reﬂection, and scaling invariant. This excludes, for example, Hausdorﬀ distance (Huttenlocher et al. [6]), which is universal but is not rotation, reﬂection, and scaling invariant. We compared the results of our approach with an approach presented in Siddiqi et al. [14] that is based on a hierarchical structure of shocks in skeletons of 2D objects. In this approach object shape is represented as a graph of shocks. The similarity of objects is determined by a similarity measure of the corresponding graph of shocks proposed in [14]. Although the shape representation in [14] is not based on boundary curves, the results of our similarity measure are very similar to the results in [14]. The comparison results can be found on our www-side [11] and in [9]. Further we compared our approach to retrieval of similar objects through a similarity measure based on curvature scale space in Mokhtarian et al. [12]. The curvature scale space representation is obtained by curve evolution guided by diﬀusion equation [13]. The similarity measure in [12] is applied to a database of marine animals in which every image contains one animal. We applied our similarity measure to the same database. The results are very similar but not

622

Longin Jan Latecki and Rolf Lak¨ amper

identical to the results in [12]. The details can be found on our www-side [11] and in [9]. In this section we present in more detail a comparison of our approach to the shape similarity measures given in Basri et al. [2,1]. The reason is that Basri et al. not only propose similarity measures but also simple test experiments to verify their properties. We demonstrate that our measure yields desirable results in accord with the experiment proposed in [2] using images scanned from [2]. Thus, our test images contain additionally digitization as well as segmentation noise. In [2,1], the desirable properties of similarity measures are illustrated and tested on three proposed similarity measures: spring model, linear model, and continuous deformation model. These models measure deformation energy needed to obtain one object from the other. The calculation of deformation energy is based on (best possible) correspondence of boundary points and local distortions of corresponding points as a function of local curvature diﬀerences. Thus, the calculation of the three measures requires local computation of curvature. In Table 1 in [2,1] an experiment with regular polygons is demonstrated. The intuitive idea is that the more vertices a regular polygon has, the more similar to a circle it is. The results of our similarity measure on the images from Table 1 are shown in Figure 4. It can be easily observed that our measure yields the desirable results.

Fig. 4. The results of our similarity measure on test images in Table 1 in [1]. Basri et al. [1] further argument that similarity measures should be sensitive to structure of visual parts of objects. To check this property, they suggest that bending an object at a part boundary should imply less changes than bending in the middle of a part. This property of our measure is illustrated in Figure 5. The similarity measures in Basri et al. [2] are obtained as integral of local distortions between corresponding contour points. The authors themselves point out a counter-intuitive performance of their measures when applied to the objects in the ﬁrst row in Figure 6 (Figure 17 in [2]). The H-shaped contour (a) is compared to two diﬀerent distortions of it. Although the shape (b) appears more similar to (a) than the shape (c), the amount of local distortion to obtain (b) and (c) from (a) is the same. Therefore, all three measures presented in Basri et al. [2] yield that shapes (b) and (c) are equally similar to (a). Basri et al.

Contour-Based Shape Similarity

623

Fig. 5. The results of our similarity measure on Table 2 in Basri et al. [2,1].

argument that this counter-intuitive performance is due to the fact that their measures are based on contour representation of shapes and suggest to use area based representation to overcome this problem. We do not agree with the fact that the counter-intuitive performance of measures in Basri et al. [2] is due to contour representation. The performance of our measure clearly proves that this is not the case: Our similarity measure is based on contour representation and gives similarity values in accord with our visual perception. Our measure yields Sc ((a), (b)) = 4 and Sc ((a), (c)) = 8, i.e., (b) is more similar to (a) than (c). The main diﬀerence is that our measure is not based on local properties, i.e., it is not based on correspondence of contour points and their local properties, but on correspondence of contour parts. We suspect that the problem with contour-based deformations in [2] is due to local correspondence of contour points and to local deformation computation. Further Basri et al. [2] point out a serious generic problem with area based representations. In the second row in Figure 6 (Figure 18 in [2]), shape (a) appears to be more similar to shape (b) than to shape (c). Yet, there is nearly no distortion of regions enclosed by contours in (a) and (c), while shape (b) have a highly distorted interior in comparison to (a). Thus, any approach based on area distortion would counter-intuitively predict that (a) is more similar to (c) than to (b). Again our similarity measure yields results in accord with our visual perception: Sc ((a), (b)) = 35 and Sc ((a), (c)) = 47. Since we use a contour based representation, we do not run into this generic problem of area based representations. Additionally, the local curvature computation on digital curves is very unreliable and can lead to qualitatively diﬀerent results of functions based on this computation. For example, Table 1 in [2] shows similarity values of various regular polygons in comparison to a circle. For similarity measure based on eq. 5 in [2] (continuous deformation model), the values for real data diﬀer signiﬁcantly from the values for synthetic data. This implies qualitatively diﬀerent results: the most similar regular polygon to the circle for real data is the 7-gon, while the most similar polygon for synthetic data is a triangle. This diﬀerence is due

624

Longin Jan Latecki and Rolf Lak¨ amper

Fig. 6. Our similarity measure yields results in accord with visual perception. First row: Sc ((a), (b)) = 4, Sc ((a), (c)) = 8. Second row: Sc ((a), (b)) = 35, Sc ((a), (c)) = 47. to unreliable local curvature computation for real data. We recall that no local curvature computation is necessary for our measure.

References 1. R. Basri, L. Costa, D. Geiger, and D. Jacobs. Determining the similarity of deformable shapes. In Proc. IEEE Workshop on Physics-Based Modeling in Computer Vision, pages 135–143, 1995. 622, 623 2. R. Basri, L. Costa, D. Geiger, and D. Jacobs. Determining the similarity of deformable shapes. Vision Research, to appear; http://www.wisdom.weizmann.ac.il/ .ronen/. 619, 622, 623 3. D. Forsyth, J. Malik, and R. Wilensky. Searching for digital pictures. Scientific American, pages 88–93, June 1997. 617 4. D. D. Hoﬀman and W. A. Richards. Parts of recognition. Cognition, 18:65–96, 1984. 619 5. D. D. Hoﬀman and M. Singh. Salience of visual parts. Cognition, 63:29–78, 1997. 619 6. D. Huttenlocher, G. Klanderman, and W. Rucklidge. Comparing images using the Hausdorﬀ distance. IEEE Trans. PAMI, 15:850–863, 1993. 621 7. J. J. Koenderink and A. J. Doorn. The shape of smooth objects and the way contours end. Perception, 11:129–137, 1981. 619 8. J. B. Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM Review, 25:201–237, 1983. 621 9. L. J. Latecki and R. Lak¨ amper. Shape similarity measure based on correspondence of visual parts. IEEE Trans. Pattern Analysis and Machine Intelligence, submitted. 621, 622 10. L. J. Latecki and R. Lak¨ amper. Convexity rule for shape decomposition based on discrete contour evolution. Computer Vision and Image Understanding, to appear. 618, 619 11. L. J. Latecki, R. Lak¨ amper, and U. Eckhardt. http://www.math.unihamburg.de/home/ lakaemper/shape. 618, 621, 622 12. F. Mokhtarian, S. Abbasi, and J. Kittler. Eﬃcient and robust retrieval by shape content through curvature scale space. In A. W. M. Smeulders and R. Jain, editors, Image Databases and Multi-Media

Contour-Based Shape Similarity

625

Search, pages 51–58. World Scientiﬁc Publishing, Singapore, 1997; http://www.ee.surrey.ac.uk/Research/VSSP/imagedb/demo.html. 621, 622 13. F. Mokhtarian and A. K. Mackworth. A theory of multiscale, curvature-based shape representation for planar curves. IEEE Trans. PAMI, 14:789–805, 1992. 621 14. K. Siddiqi, A. Shokoufandeh, S. J. Dickinson, and S. W. Zucker. Shock graphs and shape matching. Int. J. of Computer Vision, to appear; http://www.cim.mcgill.ca/ siddiqi/journal.html. 621 15. K. Siddiqi, K. Tresness, and B. B. Kimia. Parts of visual form: Ecological and psychophysical aspects. In Proc. IAPR’s Int. W. on Visual Form, Capri, 1994. 619

Computing Dissimilarity between Hand-Drawn Sketches and Digitized Images Folco Banfi, Rolf Ingold Informatics Institute, University of Fribourg, Switzerland E-mail: {folco.banfi,rolf.ingold}@unifr.ch ABSTRACT This paper presents a new technique for computing dissimilarity between a sketch drawn by a user and a digitized image. Our method is based on local features extracted after performing a segmentation, whose results are stored in a multilayer structure. Dissimilarity is then computed between layers of the structure.

1 Introduction The appearance of large image databases, added to the high cost of human production of descriptive data for content-based retrieval [1] calls for efficient techniques for automatically extracting this data. This information can be used to improve the results of traditional search based for example on keyword indexing, whose limits are wellknown. In our work we suppose that the image database is queried using a sketch of the sought image drawn by the user with a simple paint tool (henceforth "hand-drawn sketch"). In this paper we present a new way of computing dissimilarity between digitized images and hand-drawn sketches; this dissimilarity measure is based on local features extracted after segmentation and stored in a hierarchical data structure. The context of our work is explained in Section 2, and the image structure is defined in Section 3. The feature extraction is presented in Section 4, while the comparison between structures is explained in Section 5. Section 6 introduces a way to rank the images according to their dissimilarity scores. Finally, in Section 7 we list some results we obtained in our work and in Section 8 we give some insights into what is to come in the future. 2 Context of our Work Over the last few years, progress in the automatic extraction of primitive (as opposed to logical or semantic [2]) features such as boundaries and shapes, has moved the focus of research towards similarity-based retrieval. Among the features differentiating similarity-based retrieval systems we have the kind of the query image, which can be a complete or partial digitized image [3][4][8][9], or a freely drawn sketch [4][5][6][7][10]. These systems require (dis)similarity measures between images, which should rely on color similarity, texture similarity, shape similarity, spatial similarity and object presence analysis [1]. We decided to focus on query-by-similarity with a sketch depicting the whole tarDionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 625-632, 1999. c Springer-Verlag Heidelberg Berlin 1999

626

Folco Banfi and Rolf Ingold

get image as the query, and we developed a dissimilarity measure, based on local features, which could be used in conjunction with classical image retrieval methods to improve the search results. Sketches are drawn with a simple paint tool but an unrestricted choice of colors. 3 The Image Structure An image A is represented as a multi-layered structure (Plate 1), whose n-th layer represents the segmented image containing exactly n regions (henceforth "clusters"). When moving from layer A n to layer A n + 1 , n – 1 clusters remain unchanged, while one is split into two new clusters. It is wise to organize the structure so that regions corresponding to the most important areas of the sketch appear first. This way, a rough similarity measure between images can be obtained by comparing their first layers. In our case (see Section 5) this organization of the structure also allows us to lower the complexity of the computation in a significant way. 3.1 The Layer Let A be an image. Its n-th layer A n contains the list of the clusters obtained by seg1 2 n menting the image in exactly n clusters, labeled A n, A n, …, A n . 3.2 Cluster Features Cluster features must allow the computation of a meaningful dissimilarity between clusters. A cluster is represented by its average color, its geometric centroid, its size and its geometric eigensystem. The choice of average color, geometric centroid and cluster size as features were straightforward. The geometric eigensystem gives an approximation of the shape of the cluster, since we compare their shapes by looking at the eigenvectors and eigenvalues, representing clusters as ellipses (a similar technique is used in [11]). The more the clusters’ ellipses are similar, the more the clusters’ shapes are supposed to be similar. This method shows inaccuracies only when the clusters are extremely scattered (see Figure 1).

FIGURE 1. Despite being extremely different visually, the vectors describing the shape of the two clusters share the same orientation.

4 Feature Extraction Before we present the feature extraction, it’s necessary to define our quality criteria: • the feature extraction must produce comparable structures when applied on

Computing Dissimilarity between Hand-Drawn Sketches and Digitized Images

• •

627

sketches and digitized images; the structure shouldn’t change significantly if small transformations (such as luminosity changes, translations, rotations and scaling) are applied to the image; under the same initial conditions, the extracted structure must always be the same.

The image structure is built by iteratively merging the clusters obtained after segmenting the image. Clusters correspond to "areas of almost uniform color" in the original image. Actually, since objects tend to have a uniform hue regardless of surface reflections or shadows [13], clusters in our model are more uniform in hue than in luminosity and/or saturation. 4.1 Segmentation Our quality criteria led us to the algorithm that is described in this paragraph. The obtained results were compared to the results of the Batchelor-Wilkins and ISODATA clustering methods [17]. The algorithm starts by extracting the initial clusters, then iteratively selects two clusters to be merged until only one cluster is left. This agglomerative method was chosen because of its short computing time [16]. The result is a partition of the image into not necessarily connected regions. 4.1.1 Selection of the Initial Clusters Two techniques have been implemented:

• computing the image’s histogram after reducing the color space to 83 •

= 512 colors, then taking the histogram’s peaks as the cluster centers; parsing the image and creating a new cluster center every time the color of the current point is sufficiently distant from the ones of the already created cluster centers.

Both techniques (described in [17]) produce an initial set of not necessarily connected clusters based uniquely on the color information, thus going against the usual definition of segmentation [14]. Non-connected clusters are allowed because they can represent a single entity (like flowers in a field). The disadvantage is that objects of the same color, no matter how distant they are, belong to the same cluster. Separating such objects requires solving the cluster validation problem [15]. 4.1.2 The Merging Phase Our algorithm iteratively merges the two clusters whose average colors are the closest. While this method might appear very primitive, it often gives similar segmentation results for the sketch and the target image because a good sketch contains the same colors as the target image. Merging two clusters X 1, X 2 generates a cluster X that contains all their points. The other clusters are not affected.

628

Folco Banfi and Rolf Ingold

5 The Matching Problem After the feature extraction we have a sketch S whose image structure contains m S layers ( S = { S 1, …, S mS } ) and a candidate image C i whose image structure contains m i layers ( C i = { C i1, …, C im } ). The algorithm computing the dissimilarity must take i into account the following factors: • the complexity of the algorithm must be kept as low as possible; this is achieved by "carrying over" most of the result from a layer to the following one; • the structures extracted do not in general have the same number of layers; the matching starts by comparing S 1 and C i1 , then moves to the following layers until the last layer of both images is reached (example: if S has 2 layers and C i has 4, the comparisons are ( S 1, C i1 ), ( S 2, C i2 ), ( S 2, C i3 ), ( S 2, C i4 ) ). We will now introduce the basic concepts behind our matching algorithm. A description of the algorithm itself will then follow. Finally, we will discuss the outcome of the algorithm and its complexity. 5.1 Concepts Behind the Matching Algorithm In order to explain our matching algorithm, we need to introduce two important concepts: the cluster dissimilarity measure and the Free Clusters Pool. 5.1.1 Cluster Dissimilarity Cluster dissimilarity is computed as a weighted sum: ( d s ( X 1, X 2 ) ⋅ α 1 + d c ( X 1, X 2 ) ⋅ α 2 + d g ( X 1, X 2 ) ⋅ α 3 + d h ( X 1, X 2 ) ⋅ α 4 ) d clust(X 1, X 2) = ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------4

∑ αi

i=1

d c is the distance between the clusters’ average colors; d g is the distance between the clusters’ centroids; d s ( X 1, X 2 ) = s 1 – s 2 ⁄ ( δ + ( 1 – δ ) ⋅ s 1 + s 2 ) , where s 1, s 2 are the sizes of clusters X 1 and X 2 and δ is empirically set to 1 ⁄ 3 , is the difference between the clusters’ sizes; d h is the difference between the shapes, computed using the clusters’ eigenvalues ( a 1, b 1 ) and ( a 2, b 2 ) and the angle α between the eigenvectors. d h ( X 1, X 2 ) = min( 2

2

2

2

2

2

2

2

min ( a 1, ( a 2 ⋅ cos ( α ) ) + ( b 2 ⋅ sin ( α ) ) ) ⋅ min ( b 1, ( b 2 ⋅ cos ( α ) ) + ( a 2 ⋅ sin ( α ) ) ), min ( a 2, ( a 1 ⋅ cos ( α ) ) + ( b 1 ⋅ sin ( α ) ) ) ⋅ min ( b 2, ( b 1 ⋅ cos ( α ) ) + ( a 1 ⋅ sin ( α ) ) )) ⁄ max(a 1 ⋅ b 1, a 2 ⋅ b 2)

These measures are described more in detail in [17]. In our experiments we empirically set: α 1 = 100, α 2 = 100, α 3 = 50, α 4 = 30 . The values returned by

Computing Dissimilarity between Hand-drawn Sketches and Digitized Images

629

d s, d c, d g, d h, d clust are all between 0 and 1.

5.1.2 The Free Clusters Pool 1

m

1

m

Let S k = { S k , …, S k } and C k = { C ik, …, C ik } be the compared layers, and let t(1)

u(1)

t (r )

u(r)

t(r + 1)

M k = { ( S k , C ik ), …, ( S k , C ik ), ( S k ( ∅,

u(r + 1) C ik ),

…, ( ∅,

t (m)

, ∅ ), …, ( S k

, ∅ ),

u(m) C ik ) }

be a configuration. Clusters for which a similar enough cluster in the other image was t(1) t (r ) u(1) u(r) found are "linked" (it is the case for S k , …, S k , C ik , …, C ik ), while the others are "free" and are contained in a set called the Free Clusters Pool (FCP). Both S and C i have a FCP. Since the clusters in the FCPs are the only ones for which a matching cluster must be found, the number of possible configurations depends on the number of clusters in the FCPs. This way, the complexity of the algorithm can be reduced. Every time the algorithm moves to the following layer, the new clusters are added to their image’s FCP. On the other hand, once two similar enough clusters are found, they are removed from the respective FCPs. 5.2 Description of The Matching Algorithm The algorithm can be split into two main steps:

• search of the best solution in the current layer; • transition from the current layer to the following one. 5.2.1 Search of the Best Solution in the Current Layer This step begins after the transition from the previous layer to the current one. At this point some clusters are linked, while the others are in the FCPs. In this phase we must: • compute the dissimilarity between the clusters that have just been added to the FCP and all the clusters of the other image’s FCP (the rest of the dissimilarities are carried over from the previous layer); • compute a penalty for every cluster contained in one of the FCPs, which will be used in case no similar enough cluster in the other image is found; • select the cluster pairs whose dissimilarity is below a given threshold τ ; • find the configuration that minimizes the dissimilarity d config : r

d config(M k) =

∑ ( d clust(Sk

t( j)

m

u( j)

t( j)

u( j)

, C ik ) ⋅ max(s(S k ), s(C ik )) ) +

j=1

∑

t( j)

u( j)

( p(S k ) + p(C ik ) )

j = r+1

where p(X ) is the penalty attached to cluster X and s(X ) is the size of cluster X . 5.2.2 Moving from one layer to another This stage of the algorithm handles the changes occurring when passing from one layer to its successor. In each image one cluster splits, and the new clusters must be added to the FCPs. The splitting clusters are either free or linked. In the first case, no special

630

Folco Banfi and Rolf Ingold

care is necessary, and the new clusters go into the image’s FCP. In the second one, the cluster of the other image with which the splitting cluster is linked must go into its image’s FCP. The algorithm can now perform a new search of the best solution. 5.3 The Result The matching algorithm applied on the sketch S and a candidate C i produces a list of chosen configurations M i = { M i1, …, M ili } , where l i = max(m S, m i) , along with a list of dissimilarity scores D i = { D i1, …, D ili } . 5.4 The Complexity Issue As we said before, complexity is an important issue, as the number of possible configurations grows exponentially. This quantity depends on the number of plausible cluster pairs, which is kept small by using the threshold τ . If τ is well-chosen, the number of configurations can be considered polynomial. 6 Ranking the Images Computing the dissimilarity between a sketch S and a collection of images C = { C 1, …, C n } produces n sets of dissimilarity scores D = { D 1, …, D n } , with D i = { D i1, …, D il } and l i = max(m S, m i) . From these results, we can extract max(l i) i local rankings R q = { ( D v ( 1 )q, C v(1)q ), …, ( D v(γ )q, C v(γ )q ) } , where γ is the number of candidates that have a dissimilarity value for the q-th layer. Since not all images C i appear in every local ranking, a global ranking is needed. To obtain such a ranking, every image C i is given a score depending on the dissimilarities obtained, as well as the position of the image in the local rankings. Images are then sorted according to this score. More details can be found in [17]. 7 Experiments Experiments were performed on a collection containing 3238 24-bit color images taken from an image database published by ClickArt, all reduced to 150x120 pixels and sharing the same orientation. Using only the "brush", "fill" and "pen" tools in Adobe Photoshop, three users produced 5, 6 and 6 sketches respectively. These were compared to all the images in the database. Results of our approach were compared to those obtained by sorting the database images according to the dissimilarity with the sketch computed by an inhouse dissimilarity measure based on global features, as shown in Table 1. More results can be found in [17].

Computing Dissimilarity between Hand-drawn Sketches and Digitized Images

631

Target position 1st 2nd 3rd-10th 11th-20th 21st-50th Beyond User 1 5 (6) 0 (0) 1 (1) 0 (0) 0 (0) 1 (0) User 2 5 (3) 0 (0) 0 (1) 0 (0) 0 (0) 0 (1) User 3 1 (2) 0 (0) 1 (4) 0 (0) 1 (0) 3 (0) TABLE 1. Comparison between results obtained with our local features dissimilarity measure and the in-house dissimilarity measure based on global features (in parentheses).

Results for the two approaches are extremely similar, with a slight edge for the one based on global features. Bad results can be provoked by: • not sufficiently detailed sketches; • poor performance of the segmentation algorithm, as explained in Paragraph 4.1.1. 8 Conclusions and Future Work Similarity-based retrieval systems require effective (dis)similarity measures in order to yield good results. In this paper we presented a new way of computing dissimilarity between images, based on local features automatically extracted from the images. Results are extremely similar to those obtained with a similar measure based on global features. Current works [5][6][7] go towards comparison of digitized images with partial sketches, something that can’t be done with methods based on global features, while our technique needs only minor changes in order to be able to perform such comparisons. We are already working in this direction. Bibliography 1

2 3 4 5

6

7

Philippe Aigrain, Hongjiang Zhang, Dragutin Petkovic, Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review, Multimedia Tools and Applications, No. 3, pp. 179-202, 1996. Venkat N. Guivada, Vijay V. Raghavan, Content-Based Image Retrieval Systems, IEEE Computer, Vol. 28, No. 9, pp. 18-22, 1995. Anil K. Jain, Aditya Vailaya, Image Retrieval Using Color and Shape, Pattern Recognition, Vol. 29, No. 8, pp. 1233-1244, 1996. Charles E. Jacobs, Adam Finkelstein, David H.Salesin, Fast Multiresolution Image Querying, SIGGRAPH 95 Proceedings, pp. 277-286, 1995. A. Del Bimbo, M. Mugnaini, P. Pala, F. Turco, L. Verzucoli, Image Retrieval by Color Regions, Proceedings of International Conference on Image Analysis and Processing,, pp. 181-187, 1997. Y. Alp Aslandogan, Chuck Thier, Clement T. Yu, Jon Zou and Naphtali Rishe, Using Semantic Content and WordNet in Image Retrieval, proc. SIGIR 97, Philadelphia, USA, pp. 286295. Myron Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Qian Huang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele and Peter

632

8 9 10

11

12 13 14 15 16 17

Folco Banfi and Rolf Ingold Janker, Query by Image and Video Content, the QBIC System, IEEE Computer, Vol. 28, No. 9, pp. 23-32, 1995. S.Ravela, R. Manmatha, Image Retrieval by Appearance, Proceedings of SIGIR97, Philadelphia, USA, pp. 278-285, 1997. A. Pentland, R.W. Picard and S. Sclaroff, Photobook: Content-Based Manipulation of Image Databases, International Journal of Computer Vision, Vol. 18, No. 3, pp. 233-254, 1996. Alberto Del Bimbo, Pietro Pala, Visual Image Retrieval by Elastic Matching of User Sketches, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 2, pp. 121-132, 1997. Chad Carson, Serge Belongie, Hayit Greenspan and Jitendra Malik, Region-Based Image Querying, Workshop on Content-Based Access of Image and Video Libraries, associated with the Conference on Computer Vision and Pattern Recognition, June 20, 1997, San Juan, PUR. Marco La Cascia, Edoardo Ardizzone, JACOB: Just a Content-Based Query System for Video Databases, in Proc. ICASSP-96, 1996. A.Moghaddamzadeh and N. Bourbakis, A Fuzzy Region Growing Approach for Segmentation of Color Images, Pattern Recognition, Vol. 30, No. 6, pp.861-881, 1997. Nikhil R. Pal, Sankar K. Pal, A Review on Image Segmentation Techniques, Pattern Recognition, Vol. 26, No. 9, pp.1277-1294, 1993. Richard O. Duda, Peter E. Hart, Classifications and Scene Analysis, 1973, Wiley, New York, USA. Helmuth Spaeth, Cluster Analysis Algorithms, 1980, Ellis Horwood, Chichester, UK. Folco Banfi, Rolf Ingold, Computing Dissimilarity Between Hand-Drawn Sketches and Digitized Images, Technical Report 98-21, University of Fribourg, 1998; available at http:// www-iiuf.unifr.ch/~banfi/banfi/caine.ps.

Document Generation and Picture Retrieval Kees van Deemter Information Technology Research Institute (ITRI) University of Brighton, Lewes Road, Watts Bldng. Brighton, BN2 4GJ, United Kingdom [email protected]

Abstract. Picture retrieval can be used by a document generation system for the inclusion of ‘photographic’ pictures in the documents generated by the system. An algorithm is described that makes use of a library of pictures, each of which is associated with a set of logical representations to facilitate retrieval. The paper focuses on the problems stemming from the twin notions of pictorial underspecificity and overspecificity. Keywords: Document Generation, Picture Retrieval, Formal Semantics

1

Pictures in Generated Documents

Simple textual documents can be automatically generated from the representations in a general-purpose database using natural language generation (nlg) techniques (e.g. [9]). The present paper discusses how generation procedures can be extended in such a way that the documents generated contain pictures as well as text. Our presentation focuses on pharmaceutical Patient Information Leaﬂets (pils), as exempliﬁed by the leaﬂets in [1]. Around 60% of these leaﬂets contain pictures. Most of them are complex sketches depicting actions. Such pictures, loosely called ‘photographic’ in style, are diﬃcult to break down, compositionally, into syntactic parts that have a well-deﬁned semantic interpretation. Consequently, it is diﬃcult to see how they could be generated from smaller parts (cf. [10]). Luckily there is also little need for genuine generation of pictures in this area: The number of pictures used by any one pharmaceutical company is limited. Moreover, the same picture tends to be used in several leaﬂets. This suggests that a document generation program could select pictures from a library, in which each picture is coupled with a representation of its meaning. The program would allow an author to specify the content of each leaﬂet and, for a given ‘chunk’ of information, determine whether it is in need of pictorial illustration.

2

WYSIWYM for Text Generation

Elsewhere (e.g. [6]) a new knowledge-editing method called ‘wysiwym editing’ has been introduced and motivated. Wysiwym editing allows a domain expert Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 632–640, 1999. c Springer-Verlag Berlin Heidelberg 1999

Document Generation and Picture Retrieval

633

to edit a knowledge base (kb) reliably by interacting with a feedback text, generated by the system, which presents both the knowledge already deﬁned and the options for extending and modifying it. Knowledge is added or modiﬁed by menu-based choices which directly aﬀect the knowledge base; the result is displayed to the author by means of an automatically generated feedback text: thus ‘What You See Is What You Meant’. Wysiwym instantiates a general recent trend in dialogue systems towards moving some of the initiative from the user to the system, allowing such systems to avoid ‘open’ (i.e., unconstrained) input. The present paper is concerned with applications of wysiwym to document generation: The kb created with the help of wysiwym is used as input to an natural language generation (nlg) program, producing as output a document of some sort (e.g., a user manual), for the beneﬁt of an end user. Applications of wysiwym to document generation make use of two diﬀerent nlg systems. One produces feedback texts (for the author) and the other output texts (for an end user). Let us take the drafter system [6] as an illustrative example. By interacting with the feedback texts generated by drafter, the author deﬁnes a procedure for performing a task, e.g. saving a document in a word processor. When a new knowledge base is created, a procedure instance is created, e.g. proc1. The permanent part of the kb (i.e., the T box) speciﬁes that every procedure has two attributes: a goal, and a method. This information is conveyed to the author through a feedback text: ‘Achieve this goal by applying this method’. Not yet deﬁned attributes are shown through anchors. A boldface anchor indicates that the attribute must be speciﬁed. An italicized anchor indicates that the attribute is optional. All anchors are mouse-sensitive. By clicking on an anchor, the author obtains a pop-up menu listing the permissible values of the attribute; by selecting one of these options, the author updates the knowledge base. Clicking on this goal yields a pop-up menu that lists all the types of actions that the system knows about: e.g., choose, click, close, create, save, schedule, start, etc., from which the author selects ‘save’. The program responds by creating a new instance, of type save, and adding it to the knowledge base as the value of the goal attribute on proc1: procedure(proc1). goal(proc1, save1). save(save1). From the updated knowledge base, the generator produces a new feedback text Save this data by applying this method. including an anchor representing the undeﬁned actee attribute on the save1 instance. By continuing to make choices at anchors, the author might expand the knowledge base in the following sequence: – Save the document by applying this method. – Save the document by performing this action (further actions).

634

Kees van Deemter

– Save the document by clicking on this object (further actions). – Save the document by clicking on the button with this label (further actions). – Save the document by clicking on the Save button (further actions). At this point the knowledge base is potentially complete (no boldface anchors remain), so an output text can be generated and incorporated into the software manual. For example, ‘To save the document, you may click on the Save button’. One wysiwym application that is currently under development has the creation of Patient Information Leaﬂets (pils) as its domain. The present, limited version of this system allows authors to enter information about possible side-eﬀects of taking a medicine. The dialogue starts by a very general feedback text, e.g. ‘There is a situation’, whereupon the author can choose to expand the anchor ‘a situation’ as either an atomic or a logically complex statement. Feedback texts generated include sentences such as ‘If you are either pregnant or allergic to penicillin, then tell your doctor’. It is this ‘pils’ version of wysiwym that we will have in mind in the remainder of this paper.

3

An Extension: WYSIWYM for Text Plus Pictures

We will be concerned with an extension of wysiwym which is used to create output documents that contain pictures as well as words. In accordance with what we ﬁnd in the majority of pictures in the pils corpus, we will simplify and assume that pictures never add information that is not expressed in the text. Here is an example from the corpus: 1. Unscrew the cap and squeeze a small amount of ointment, about the size of a match-head, on to your little finger. 2. Apply ointment to the inside of one nostril. 3. Repeat for the other nostril.

Clause 1 is illustrated by an image depicting the squeezing of ointment; clause 2 is illustrated by a picture showing a ﬁnger entering the left nostril (Figure 1); clause 3 is illustrated by an analogous picture involving the right nostril (Figure 2). How can these pictures be included in generated texts? We will allow the author to indicate, for a given mouse-sensitive stretch s of text, whether s would beniﬁt from pictorial illustration. If so, the system will search its library to ﬁnd a picture that matches the meaning of s. This might be done on using an existing classiﬁcation scheme (e.g. [8]). The present paper, however, proposes to use a more precise, logically oriented classiﬁcation scheme. If a nonempty set of matching pictures is found, one picture is chosen from this set. If the set is empty, the system responds that no illustration is possible yet and stores the request. In future, more sophisticated extensions of wysiwym may be built in which the system itself can decide when to use pictures. In the system now under development, the author always activates a stretch of text corresponding to an instance in the knowledge base.

Document Generation and Picture Retrieval

635

Fig. 1. Illustration for clause 2

Fig. 2. Illustration for clause 3

Representations express what information each picture intends to convey. Embellishments are omitted. Things that are only depicted because it would be diﬃcult not to depict them (such as the exact size or position of an object) are also omitted. In the cases that we will be concerned with, this means that the representations will focus on how an action of a given type are carried out. It has been observed that photographic pictures express ‘vivid’ information and that this information can be expressed by a conjunction of positive literals [4]. In what follows, we will use a notation that matches this observation, using a fragment of predicate logic that is easily translated into the semantic networks used in existing wysiwym systems. Thus, we will write ϕ(x, y) in the representation associated with a picture to assert that there are x and y such that ϕ(x, y) is true. Thus, all variables in these representation are interpreted as if they were governed by an existential quantiﬁer taking scope over (at least) the entire representation. Assume is the part of the database for which a pictorial illustration is requested. For simplicity, we assume here that denotes an action. The representations in the kb of a wysiwym can be rendered in logical notation as follows: type0 (e) & role1 (e) = x1 & ... & rolen (e) = xn & type1 (x1 ) & ... & typen (xn ),

636

Kees van Deemter

where each of e, x1 , .., xn is either a variable or a constant. This notation reﬂects the reliance, in the semantic nets used in drafter-ii, on instances, types, and attribute/value pairs. Each instance has one (1-ary) property, called its type, and can be the argument of any number of attributes, whose values are instances again. Instances are rendered as variables or constants, while types are denoted by the predicates typei ; attributes are denoted by the functions rolei (e.g. Actor, Actee, Target, as in functional approaches to grammar). In the case of an action e, its type, type0 , corresponds roughly with the meaning of a verb, saying what kind of action e is; The values of these attributes (xi in the formula above), each of which can be either a variable or a constant, can be of any type typei (e.g., an action) and each of them can have other attributes, etc. For example, suppose equals the following representation in the kb: Squeeze(e) & Actor(e) = Reader & Actee(e) = z & Ointment(z) & Quantity(z) = Small & Source(e) = t & T ube(t) & T arget(e) = u & LittleF inger(u) & Owner(u) = Reader corresponding with clause 1; e, z, t, and u are variables; Reader and Small are constants. How can be used to select an appropriate picture? Two rules suggest themselves: one approaching from ‘above’ and one from ‘below’: Rule A: Use the logically weakest picture whose representation logically implies . Rule B: Use the logically strongest picture whose representation is logically implied by . Logical strength is determined on the basis of the representations alone. Determining whether ϕ logically implies ψ, where each of the two formulas is either an instance in the kb or a semantic representation of a picture, is not diﬃcult, given the fact that both are conjunctions of positive literals.

4

Pictorial Under- and Overspecificity

We will use the term ‘underspeciﬁcity’ to highlight the fact that a picture can sometimes contain less information than might be expected on the basis of its function. Conversely, we will speak of overspeciﬁcity when a picture contains information in addition to the information that would be expected. The two phenomena will be discussed in turn. For more details and discussion, see [2]. 4.1

The Problem of Pictorial Underspecificity

Suppose Rule A is used to choose a picture that matches the part of the kb: Rule A: Use the logically weakest picture whose representation logically implies .

Document Generation and Picture Retrieval

637

Let us apply this Rule to the ﬁrst part of the example. Observe that the rule requires all the relevant information to be expressed in the picture, allowing some additional information. It requires, therefore, that the representation of Figure 3 contains all the information in including 1. 2. 3. 4.

LittleFinger(u) Owner(u)=Reader Actor(e)=Reader Ointment(z)

Fig. 3. Illustration for clause 1

But it seems clear that none of these facts are expressed by the picture: (1) The depicted ﬁnger could be an index ﬁnger; (2) it could belong to anyone; (3) anyone could do the squeezing; and (4) what is being squeezed could be a cream instead of an ointment, for example. Clearly, a picture can illustrate an item of information without expressing all the information in it. Thus, Rule A fails to implement the notion of a pictorial illustration. Rule B, by contrast, does not require that all the relevant information is expressed by the picture, as long as no additional information (i.e., information not in ) is expressed. As a result, pictorial underspeciﬁcity is no problem for Rule B, which will choose the most informative picture that contains nothing that is not in . 4.2

The Problem of Pictorial Overspecificity

Using Rule B, let us assume that these are appropriate feedback texts for (2) and (3), followed by the statements in the kb that correspond with (2) and (3b): 2. Apply the ointment to the inside of one nostril 3a. Squeeze another small quantity of ointment from the tube on to one of your little fingers 3b. Apply the ointment to the inside of the other nostril.

1 : Apply(e2 ) & Actor(e2 ) = Reader & T arget(e2 ) = m & N ostril(m) 2 : Apply(e3 ) & Actor(e2 ) = Reader & T arget(e3 ) = n & N ostril(n) & N ostril(m) & m = n

638

Kees van Deemter

Suppose Fig.1 and Fig.2 are associated with the following representations: Picture 1: Apply(e ) & T arget(e ) = a & Lef tN ostril(a) Picture 2: Apply(e ) & T arget(e ) = b & RightN ostril(b). Then Rule B would fail to select any of these two pictures as applicable to 1 , and the same is true for 2 . The reason is that Rule B, while allowing a picture to omit any information from the relevant part of the kb, does not allow a picture to express anything in addition to what the the relevant part of the kb requires. So neither 1 nor 2 can be illustrated by means of a picture whose representation gives away which nostril is depicted. The source of the problem is the diﬃculty of depicting an arbitrary nostril; and if a picture could do this, it would be even more diﬃcult for the next picture to depict ‘the other’ nostril. 4.3

Ways of Overcoming both Problems

In this section, a way of dealing with the problem of pictorial overspeciﬁcity is described that does makes use of Rule B without requiring us to alter the representations in the knowledge base. The ﬁrst step is to keep diﬀerent representations for the two pictures, but to leave out the bit saying which is which: Pic(1): Apply(e ) & T arget(e ) = a & N ostril(a) Pic(2): Apply(e ) & T arget(e ) = b & N ostril(b) As a result, either picture can be used to illustrate 1 , so an arbitrary one of them will be selected. To account for ‘speciﬁc’ cases, where the system needs to know which is which (e.g. ‘If you feel a pain in your left arm, your heart may be the cause of the problem’), as well as for the nonspeciﬁc cases discussed above, the library needs to contain two semantic representations for each picture: one with and one without left/right information: Pic(1, nonspec): Apply(e ) & T arget(e ) = a & N ostril(a) Pic(1, spec): Apply(e ) & T arget(e ) = a & Lef tN ostril(a) Pic(2, nonspec): Apply(e ) & T arget(e ) = b & N ostril(b) Pic(2, spec): Apply(e ) & T arget(e ) = b & RightN ostril(b)

Document Generation and Picture Retrieval

639

Another problem with the approach outlined above can be illustrated using clause (3b) of the original example, which referred to the inside of ‘the other nostril’. Suppose the system has chosen the left nostril to illustrate clause (2) (i.e., the information 1 ), using Fig.1 and its nonspeciﬁc representation Pic(1, nonspec). Then, clearly, Rule B predicts that 2 can be illustrated by Fig.1 as well, so both nostrils threaten to be illustrated by the same picture. This problem can be solved by letting illustration aﬀect representation. When a picture comes to illustrate something, this allows us to identify parts of the picture as depicting speciﬁc things. This information causes the picture to be more expressive. It may be captured by adding equations to the representation saying which variables in it must be uniﬁed with which variables in the kb to establish that the latter implies the ﬁrst. For example, to establish that F (x)&G(x)&H(z) logically implies G(y)&F (y), x and y are uniﬁed, leading to the equation x = y. Let us return to our example to show the implications of this idea. Claim: Pic(2, nonspec) cannot illustrate

=

1

and

=

2

in the same document.

=

Suppose, firstly, Pic(2, nonspec) is used to illustrate 1 . Then two equations are added to Pic(2, nonspec), resulting in Pic(2, nonspec) = Pic(2, nonspec) & e = e2 & a = m.

=

Suppose, after that, Pic(2, nonspec) is used to illustrate 2 . Then two more equations are added to Pic(2, nonspec) , resulting in Pic(2, nonspec) = Pic(2, nonspec) & e = e2 & e = e3 & a = m & a = n. Given the statement m = n in 2 , the combination a = m & a = n is inconsistent. The same result would be obtained if 1 and 2 were illustrated in reverse order. qed

=

=

=

This solves the problems in the area of under- and overspeciﬁcity that are associated with Rule B.

5

Information Retrieval

This paper has outlined an approach to document generation that makes crucial use of picture retrieval. The resulting retrieval task, however, diﬀers much from the task facing classical document retrieval systems: – The number of objects in our search space is tiny (around 102 rather than 106 as in classical retrieval), making the task of creating logical representations less burdensome [3]. This is even more so because one (wysiwym) interface is employed to create meaning representations for pictures and text. – Wysiwym guarantees relations of perfect logical implication, making the use of probabilistic modeling [7] unnecessary. – As we have seen, the fundamental logical relation underlying retrieval is now not D → Q, as in classical ir, (where Q is the query and D is the document answering it) but Q → D (where Q is the information in the kb and D is the representation of the picture). ‘Speciﬁcity’ – in our case, this is D → Q – does not play a role: we don’t care how much other information the document contains, as long as it contains all the information in the picture [5].

640

Kees van Deemter

A ﬁrst version of a prototype applying the main ideas in this paper to the pills domain has been implemented by Richard Power in February 1999. Various related issues, such as proper placement on the page and the problems associated with sequences of pictures, have not been taken into account so far.

References 1. ABPI: The Association of the British Pharmaceutical Industry, 1996-1997 abpi Compendium of Patient Information Leaflets (1997) 2. K. van Deemter: Retrieving Pictures for Document Generation. In: Proc. of 14th Twente Workshop on Language Technology, on Multimedia Information Retrieval (twlt14), University of Twente, The Netherlands (1998) 3. P.G.B. Enser: Progress in Documentation; Pictorial Information Retrieval. Journal of Documentation, Vol.51 2, (1995), pp.126-170 4. H.J. Levesque: Making Believers out of Computers. Artificial Intelligence 30 (1986) pp.81-108 5. J. Nie: An IR model based on modal logic. Inform. Proc. and Management, 25(5):477-491 (1989). 6. R. Power, D. Scott: Multilingual Authoring using Feedback Texts: In Proc. of COLING/ACL conference, Montreal (1998) 7. C.J. van Rijsbergen: Towards an information logic. In: Proc. ACM SIGIR, (1989) 8. H. van de Waal (completed and edited by L.D. Couprie, R.H. Fuchs, E.Tholen, G. Vellekoop, a.o.): Iconclass; An iconographic classification system. Amsterdam (17 vols). isbn 0-7204-8264-X. See also (1973-1985) 9. H. Uszkoreit: Language Generation: In Ronald A. Cole, J.Mariani, H.Uszkoreit, A.Zaenen, and V.Zue (eds.) Survey of the State of the Art in Human Language Technology. (1996) 10. W. Wahlster, E. Andr´e, W. Finkler, H.-J. Profitlich, and Th. Rist, Plan-based Integration of Natural Language and Graphics Generation, Artificial Intelligence 63, (1993) p.387-427

FLORES: A JAVA Based Image Database for Ornamentals Gerie van der Heijden, Gerrit Polder, and Jan Wouter van Eck CPRO-DLO P.O. Box 16, NL-6700 AA Wageningen, the Netherlands {g.w.a.m.vanderheijden,g.polder,j.w.vaneck}@cpro.dlo.nl

Abstract. Flores is a database system for finding and retrieving plant varieties with similar appearance. The system allows color-images to be segmented interactively using JAVA Applets. The alpha-channel of the image is thereby used for the segmentation result, i.e. it contains the object. The image with object mask is transferred to a server-system, which extracts information from the object in the image. The information can be color histograms or a set of size, shape, texture and color features. The set of features depends on the type of object, as indicated by the user. The histogram is binarized and, using fast algorithms, matched with the binarized database histograms. For the feature set, nearest neighbors are searched in a multi-dimensional features space using an SR-tree algorithm. The server returns information of the most similar objects, like similarity, color-image and object mask to the Applet.

1 Introduction Nowadays, databases containing images are rather standard. However, most of these systems do not make use of the pictorial information: images are stored as binary large objects and are non-searchable items in the database. Many attempts are being made to retrieve information from the database using the pictorial information itself. Systems like Virage [3] and Chabot [5] use the information of whole images with fast techniques to compare the images. A nice example is the Virage implementation in the Alta-Vista search engine. However, in some domains more specific information of objects is necessary to obtain reliable matches. An example is the trademark database, where the interest concerns the trademark only and not the whole image. Systems which allow the use of object specific information are QBIC [1] and ARBIRS [2]. In this paper we describe a system for variety registration of ornamentals which requires detailed features of the flowers. For variety registration, a breeder requests Breeders’ Rights (kind of patent) for a new candidate variety. A major criterion for granting Plant Breeders’ Rights is that the candidate variety has to be distinct from all existing varieties worldwide. Since the number of varieties is large, especially in ornamental crops, there is no single crop expert who has complete knowledge of all varieties. Therefore the ornamental crop expert uses an image database with pictures of the reference varieties, combined with some keywords. At the moment the crop expert can search using key-words, like “yellow” and “rose”, but this still gives many Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp.641 -648, 1999.  Springer-Verlag Berlin Heidelberg 1999

642

Gerie van der Heijden et al.

hundreds (or thousands) of hits. It would be very helpful if the system could retrieve and order the pictures from the database by decreasing similarity, using very detailed features, derived from expert-knowledge. Such a system should preferably fulfil the following demands: 1. The system should allow a new (local) image to be compared with images of a reference collections. The images may contain various backgrounds, and may contain different objects in one image, like leaves and flowers. 2. The system should be able to handle all crops in a generic way, but it should also be possible to implement feature extraction and matching specific for each crop/plant part. Feature extraction and matching may have a general basis for flowers, but implementation details will depend on the crop, e.g. rose or tulip. 3. The system should also contain other variety information, like administrative data. 4. It should be possible/easy to share and exchange the information with different registration authorities in different countries. 5. It should be possible to obtain an overall similarity for a variety, combining the similarities for different representations (images) of the variety. E.g. top-view and side-view of flower, but also a view of a leaf and the total plant appearance. Clearly, because of item 1 and 2, an object-specific approach of the system is required. The system should extract different features and apply different matching algorithms, depending on the type of object. Also, because of the many different kinds of objects, and the required details, the user should be able to control/inspect the segmentation: i.e. a client-driven segmentation is required. For combining the image information with other information, a simple unique key can suffice for a mapping between image/object and variety information in a relational database. This allows separate development of the relational database and the image database. Since the information should be exchanged by administrative systems in different countries (item 4), a good separation between image database and relational database is once more a prerequisite. Furthermore item 4 calls for the use of an Internet/Intranet approach, which in combination with interactive segmentation requires an Applet based approach in a browser. Item 5 can be seen as an extension of combining and weighing the similarities. Although this seems not difficult in theory, it has certain implications which are difficult to solve in practice. An image database system which has a object-specific approach is ARBIRS, described and developed by Gong [2]. ARBIRS has a standard module for image segmentation and texture regions. Although it does have the object specific approach, and is even capable of handling and querying multiple objects in one image, it does not distinguish between different types of objects. Therefore it only has one type of feature extraction and matching and does not allow for user-controlled segmentation. QBIC has an object-module in the total system and also allows user-driven segmentation, but this is not possible in an Internet browser [1]. Furthermore, QBIC does not have the possibility to assign different features and matching functions for different kinds of objects. In this paper we propose a system which allows handling of different kinds of objects, using interactive segmentation, across a network. The setup of the system is described in section 2. In section 3 we discuss the future of the system.

FLORES: A JAVA Based Image Database for Ornamentals

643

2 System Setup To allow a flexible environment with multiple users in a dynamic, continuously changing internet/intranet environment, we use a web-browser. The use of a webbrowser, in combination with client-driven segmentation, requires the use of JavaApplets. Since we have a large image library written in C, based on Scil-Image [6], we decided to use the Java Applets only for interactive segmentation. Afterwards the image is sent to the server where the remaining processing (feature extraction, matching) is done, using the Java Native Interface (JNI) between Java and C. The system consists of two main parts: 1. Client-site: takes care of picture loading, (interactive) segmentation, sending the object to the server for matching, and showing the resulting similar images; 2. Server-site: takes care of feature extraction, feature matching/similarity calculation, sorting by similarity and sending the results to the client. A flowchart of the system, showing all components and communication order is shown in Fig. 1. The components of the systems are described in the next paragraphs.

Client

Server WWW WWWbrowser browser 3

1 HTTP HTTPServer Server Servlet Servlet

Flores.html Flores.html 2 4 6

Flores Flores Applet Applet

5 8 7

11

Flores FloresServer Server (JAVA) (JAVA)

10

Shared Sharedlibrary library Image Imageanalysis analysis Matching Matching

9 12

Relational Relational Database Database

Image Image Database Database

Fig. 1. A flow chart of the Flores system. The numbers indicate the flow order and are further explained in the text.

644

Gerie van der Heijden et al.

2.1 Client Site The system starts after the user asks for the Flores.html document in his web-browser (nr 1 in Fig 1). The html-document, containing the applet is sent to the client (nr 2 and°3). After the applet has started, it asks for a local image file to load. Since the Java Development Kit 1.1 (JDK) has a security mechanism (the so-called sandbox model), which does not allow access to the local file system for unsigned applets, it is less suitable for our purposes. Therefore we have chosen for JDK 1.2, which has a very nice policy tool where the user can explicitly indicate which operations a specific applet is allowed on which files or directories. For more information on Java security mechanisms we refer to http://java.sun.com/products/jdk/1.2/docs/guide/security/. Interactive Segmentation. An image can have multiple objects, which can even overlap, e.g. the object “whole plant” and its sub-object “flower”. Therefore the user should be able to indicate which part (object) of the image should be matched with the objects in the database. The system has several tools for selecting an object in the image, using (semi-)interactive segmentation. The first one is thresholding. The three color histograms (RGB) are shown on the screen. If the user clicks on a pixel in the image, thresholds are set in the histogram around the R, G, and B values of the selected point. With a (user-adjustable) fuzziness-parameter, the range around the R, G, and B values can be automatically adjusted, hence introducing a lower and upper limit for each of the three color components. Furthermore, the user can adjust the upper and lower limits interactively by using sliders. Only the image pixels falling within the limits of the thresholds are selected. The result of any changes are shown in the image, using the alpha channel of the image as a mask, i.e. setting alpha to 0 for masked pixels (transparent, takes on the color of the background) or to 255 for selected pixels (opaque, original color). Since it is often impossible to obtain a 100% correct segmentation with thresholding, several tools are available to make changes to the mask, e.g. to add or delete certain parts of the mask using the mouse, or to invert the mask. A very useful tool is propagation within the mask plane from a user-selected point. The result is that all unmasked pixels which are not in the connected region around this pixel are masked and that a single object is obtained. Another segmentation tool is region growing, where a color region growing algorithm is applied with a user-selected pixel as seed. A fuzziness parameter can be adjusted by the user to allow a smaller or higher color distance between candidate pixels and start pixel. The applet provides options to add or delete selected regions. Database matching. The user can select different image matching options and, having obtained a good segmentation result, send the image to the server. The options include type of object (flower, plant, leaf, whole image) and type of matching (color histograms, combination of features). Before the applet can send the image, it sends a message to a servlet via the http-port (nr 4 in Fig. 1). The servlet starts a socket and a server-program at that socket (nr 5 in Fig. 1) and returns the port number to the applet (nr 6 in Fig 1). Next, the applet sends the image information as a serialized object over the port to the server-program (nr 7 in Fig. 1).

FLORES: A JAVA Based Image Database for Ornamentals

645

Showing the results. After the server-side program is ready, it sends information back over the port to the applet (nr 11 in Fig. 1). The information contains 25 records of objects, in order of decreasing similarity. Each record contains information on the unique id of the object in the database, the similarity with the user-supplied object, the URL of the color image and of the mask image. Besides it contains a unique key for obtaining additional variety information in the relational database (nr 12 in Fig. 1). The images are shown in a Java image-button panel with the relevant information. By clicking on an image, this image is copied to a separate image frame, which can subsequently be processed (segmented/matched) as described above. 2.2 Server Site The Java Flores server, started by a servlet, listens to a port. After it has received the image over the port, the image is converted to the image structure suitable for the image processing library Scil-Image [6]. The mask is converted to a binary image. Subsequently the image library is called through the Java Native Interface (nr 8 in Fig. 1), using the color image, the binary image and the matching option as parameters. Histogram based matching. The image library extracts specific information which is used for the matching process. One of the options is a binarized normalized rghistogram: of each RGB-value, the normalized r (=R/(R+G+B)) and the normalized g°(=G/(R+G+B)) are calculated. These values are stored in a two-dimensional histogram with 32x32 bins. After the histogram is calculated, it is binarized by thresholding: if a bin has more than a certain percentage (e.g. 1%) of the object pixels the bin is set to 1, otherwise it becomes 0. This binary histogram with 32x32 bins can be stored in only 32 integers (4-byte words). A graphical representation of the binarized histograms of rose flowers is shown in Figure 2. The histogram is compared with the stored histograms in the image database (nr 9 in Fig. 1). This is done by calculating a similarity measure between two histograms. First the logical AND of the two histograms is taken. If the two histograms are similar, the number of bins with 1 will be large. Therefore the number of bins with 1 can be regarded as a measure of similarity. However, this measure strongly depends on the spread of colors and therefore we apply a normalization by dividing it with the number of bins with 1 of the logical OR of the two histograms. This similarity measure can be calculated very fast, since the AND/OR operations can be done in parallel (32-bits) and the bit-counting can be done by table lookup (e.g. a table of 256 allows 8-bit parallel bit-counting). The array of similarity measures are ordered decreasingly using the standard quicksort algorithm. The matching algorithm returns the list of objects, ordered by similarity. Of the first 25 objects (or any other number) other relevant information like the unique key in the relational database is retrieved. A special class, containing all this information is filled and returned to the client (nr 10 and 11 in Fig. 1).

646

Gerie van der Heijden et al.

Fig. 2. Flowers of 10 different rose varieties, with their binary normalized rg-histogram. The normalized rg-histogram have the red axis horizontal and the green axis vertical. Each dot represents the presence of a color in the image.

Since this measure is very fast to calculate, it can be used over the complete database without any restriction on type of object, i.e. matching based on histogram similarity is done over all objects. Features based matching. Depending on the type of object, which is indicated by the user, various features are calculated. Since features are specific for each type, we will not go into detail in the kind of features used in Flores. It may include for example area and center of mass (x, y), relative to image size, length/width ratio, shape factor (perimeter2/(4.π.area)), several moments, modus of red, green and blue (i.e. the value of the bin with the highest count in the histogram). After calculation, the features are weighted to optimize discrimination between varieties. Since several flowers are available for each variety a multivariate Discriminant Analysis is applied to maximize the variance between varieties over the within variety variance using a linear combination of the features. The latent root vectors and adjustments from this statistical analysis are used to provide weights for a linear recombination of the features. The new set of recombined and weighted features is used as feature-space for searching nearest neighbors. The most similar objects in this multi-dimensional space are found using the SRTree algorithm [4]. The SR (sphere/rectangle) tree is an index structure for high-

FLORES: A JAVA Based Image Database for Ornamentals

647

dimensional nearest neighbor queries. It is an adaptation of the SS tree, which outperforms the R* tree. The bounding spheres used in the SS-tree occupy larger volumes than bounding rectangles. This problem is overcome in the SR-tree, and it outperforms both the R* tree and SS tree. For more details we refer to [2,4]. Note that for every type of object a separate tree is made and matching is done only within the specified type of object.

3 Results and Discussion At the moment the (prototype) database contains images of flowers of eighty rose varieties. At least three flowers per variety are available. The images are recorded under controlled conditions, so within the database all pictures can be compared. The server is a Sun Sparc computer equipped with an Apache http-server, with the JServ servlet module. The image library is based on Scil-Image 1.4 [6]. Internet Explorer and Netscape are used as clients. JDK 1.2 is used for the Java applets, since this offers tools to access the local environment from within a browser. The current version opens a local port for exchanging data between client and server. This has severe limitations when using firewalls, since firewalls generally prohibit the opening of user-defined ports. A way to circumvent these problems is to use Remote Method Invocation (RMI) of Java, since this protocol will then automatically use tunneling over the http-port. By allowing user-driven segmentation in the applet, providing semi-automatic tools, the user can quickly select the correct object, masking the remaining part of the image. In this way a crucial step is set for optimal matching. Having the system perform the segmentation automatically, could lead to incorrect segmentation of the object and this would seriously distort the reliability of the system. The advantage of the object-based approach is that it allows very detailed and precise object-specific feature-extraction and matching. We use the human brain for the semantics (determining kind of object, controlling segmentation), and the computer for the quantitative comparison. The information that the user already has (basic knowledge of kind of object) is supplied to the system and the system can now focus on detailed quantitative information, difficult to obtain without the aid of a computer, e.g. detailed shape, color and texture information. By showing the resulting most similar images to the user, the user is still in control of the final decision. For comparing the histogram based matching methods, a Principal Coordinates plot is used, which shows the distance (dissimilarity) of all individual objects in the best possible two-dimensional way. From these plots, it could be concluded that RGB-histograms and normalized rg-histograms outperformed other color-spaces like HSI, and YUV. As measure of retrieval accuracy, the ranking of the most similar of the other images of the same variety is used. For this purpose, the database was limited to 70 varieties represented by three images each. Each image was compared with the remaining 209 images. In a random situation, the expected ranking of the most similar image of the same variety would be 70 (using Monte Carlo simulation). For the SR-

648

Gerie van der Heijden et al.

tree, this value was 2.03. In 71% of the cases, the most similar other image of the same variety had ranking 1. We expect that these values can be improved by further optimization of the feature set. Currently, the system is being optimized for rose flowers, enhancing the featureextraction and matching process using the SR-tree. Furthermore the system is extended to other flower types. The system provides for a direct link to a relational database containing the variety information corresponding with the image(-object). The system is not capable of combining several images to obtain an overall similarity for a variety. At first glance this seems a simple matter of using weighing coefficients to form a linear combination of the individual similarities, even though it may be difficult to find the right coefficients. This is indeed true if all similarities are available, like in the histogram matching. However, the main problem is in using search trees. Then only the k1 nearest images for one view of a variety are retrieved, which is a different set from the k2 nearest images for another view of the variety. Combining the sets of several views may lead to an incorrect ranking of overall similarity and even to a (nearly) empty set. Retrieving more images from the SR-tree (depending on the number of views to be combined) will cost more time and will never give a guarantee that the resulting combined set will be correct. A practical solution is to use a threshold for each view, such that a variety with a view having a similarity below the threshold, can never be regarded as a similar variety, even if all other views are (nearly) identical. This would extend the search considerably and probably reduce the advantages of the SR-tree unacceptably. Therefore, combining similarities from different images using tree search algorithms is still an open question to the authors.

4

Conclusions

We have developed an object-dependent system for image matching of ornamental varieties, where the feature extraction and matching depends on the type of object. It has user-driven segmentation tools in a cross-platform environment, using JAVA Applets. Furthermore the system provides a direct link with a relational database. The system has at this moment no means for combining similarities from different images of the same variety to determine an overall similarity.

References 1. Flickner, M, et al. Query by image content: the QBIC-system. IEEE Computer (1995) 23-32 2. Gong, Y. Intelligent image databases. Toward advanced image retrieval. Kluwer Academic Publishers. (1998). 3. Gupta, A., Jain, R. Visual Information Retrieval. Comm. of the ACM 40(5) (1997) 70-79. 4. Katayama, N. Satoh, S. The SR-tree: an index structure for high-dimensional nearest neighbor queries. Proceedings of the ACM SIGMOD, Tucson Arizona (1997). 5. Ogle, V.E., Stonebraker, M. Chabot: retrieval from a relational database of images. IEEE Computer (1995) 40-48. 6. Scil-Image user manual. Version 1.4. TNO Institute of Applied Physics, Delft, NL (1997).

Pictorial Portrait Indexing Using View-Based Eigen-Eyes C. Saraceno, M. Reiter, P. Kammerer, E. Zolda, and W. Kropatsch PRIP-Vienna University of Technology Treitlstr. 3, 183/2, A-1040 Vienna, AUSTRIA Tel: (+43-1) 58801-18351 Fax: (+43-1) 505-4668 {saraceno,rei,paul,zolda,krw}@prip.tuwien.ac.at

Abstract. A hierarchical database indexing method for a pictorial portrait database is presented, which is based on Principal Component Analysis (PCA). The description incorporates the eyes, as the most salient region in the portraits. The algorithm has been tested on 600 portrait miniatures of the Austrian National Library.

1

Introduction

Today, computer technology allows for large collection of digital image databases. The complexity and the diversity of the material make however particularly difﬁcult any browsing or retrieval task. To face this challenge a number of automated methods for indexing images is being developed. In addition, there is an increasing interest to specify standardized descriptions of various types of multimedia information. Descriptions must be associated to the data, to allow fast and eﬃcient searching for material that is of interest to the user. This eﬀort is being conducted within the activities of the new standard MPEG-7 (Multimedia Content Description Interface) [2]. In our work, the idea is to support art historians or non expert users by automatically structuring the database and by associating a set of descriptors to the data. This can be used not only to eﬃciently retrieve a speciﬁc portrait, but it can also be used to exploit correlation of speciﬁc features among portraits. This will help in deﬁning or exploring similarities in the work of a certain painter or in distinguishing the painting styles of diﬀerent artists [6]. Classical face recognition techniques look for facial features that comprise information that is relevant for discrimination of faces and at the same time are insensitive to diﬀerent illumination conditions, viewpoint, facial expression, etc. Among the best possible known approaches for face recognition, big eﬀorts have been devoted to eigenspace representations[3,4]. Instead of explicitly modeling the facial features they are extracted by PCA from a training set of representative images. In [3] the term “Eigenfaces” is used to denote the components of the feature space used for classiﬁcation.

This work was supported by the Austrian Science Foundation (FWF) under grants P12028-MAT and S7000-MAT.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 649–657, 1999. c Springer-Verlag Berlin Heidelberg 1999

650

C. Saraceno et al.

In the case of portrait miniatures, the artist paints by using his/her own perception of reality, making, therefore, even more diﬃcult the recognition process. Furthermore, in a portrait a face can slightly change due to the style of the painter, or to a diﬀerent age of the painted person, whereas eyes remain mostly similar between diﬀerent portrait of the same person. For this reason, and in order to minimize the variation due to spurious factors such as partial occlusion, diﬀerent periods, etc., instead of considering whole faces, we are using eigen-eyes. A view-based representation which takes into account diﬀerent eye poses and pupil positions allows for a more accurate description of our data. The detection of eyes in an image together with the orientation of each eye give immediately the number and the location of faces within an image. In the next section, a hierarchical structure for our image database is presented. The view-based eigen-eye approach used for indexing is explained in section 3. Simulation results are presented in section 4, whereas conclusions are drawn in the last section of the paper.

2

The Hierarchical Database Structure

In order to generate indices that can be used to access an image database, a description of each image is necessary. A ﬁrst number of attempts described in the literature have focused on indexing image databases based on color, texture and shape [8]. Although they provide information on the data, they do not lead to a complete and eﬃcient description of the images. The key point is to establish a hierarchical data structure (a relational or a transition graph) which allows to handle diﬀerent levels of abstraction of the visual data. A hierarchical organization of portrait material can, at a low level, characterize the images using low level features such as color histograms, shape, eigen representation etc. For each exploited feature, nodes can be created so as to group images having similar feature values, such as same eye orientation or shape. In our implementation, a table lists all exploited features (as shown in Fig. 1.c.) This table can also be used whenever a query on low level features is proposed to the system. At higher level, the images can be characterized based on content-based analysis. Information such as presence of faces, number of faces represented in the image, their location and orientation, etc. can all be stored at this level of abstraction. At this level, nodes are created considering combinations of low and/or high level features. For example, a node can represent a group of images having frontal eyes (see Fig. 1.b), where eigen-features can be used to group eyes with same orientation. In our implementation the middle level of the hierarchy contains two types of nodes: nodes which group together portraits having same number of painted faces, and nodes which group portraits with same eye orientation. For each of the two classes, a table is created containing a description of the features and nodes representing speciﬁc features of that class. The table which groups portraits with same number of faces is identiﬁed using eigen-

Pictorial Portrait Indexing Using View-Based Eigen-Eyes

651

features (see section 3). Links connect each node to the corresponding features of the low layer. The highest level of the hierarchy can contain information such as the name of the painter, the period in which the portrait was painted, the name of the painted person, etc (see Fig. 1.a). A table for each type of node is created. Each table contains the class description (e.g. ”painters table”), and nodes representing an instance of the class (e.g. ”Bencini”). Each node has links to nodes of the middle layer. Links can also exist between nodes of diﬀerent tables of the same layer. For example, a link between the painter ”Bencini” and the person ”Maria Carolina” is created because Bencini is the author of one of Maria Carolina’s portraits. This approach allows to organize the miniportrait images for any subsequent analysis, classiﬁcation or indexing task. In the example given in Fig. 1, the database can be accessed by requesting the portrait of a speciﬁc person. Starting with the retrieved portrait, other portraits having, for example, similar eye orientation, or being painted by the same person can also be retrieved. The advantage of such a representation is, of course, to have a faster access to speciﬁc information, allowing for an eﬃcient navigation through the visual data. Operations such as coding, retrieval from specialized queries may all beneﬁt from this type of representation. In the following section the role of the eigen-features is explained together with the algorithm which allows to automatically create tables and nodes of the low and middle levels of the hierarchy.

3

The Eigen Approach

Subspace methods and eigenspace decomposition have been used widely for object representation and object class modeling (e.g. generic human face [3,5]). Essentially a model is acquired by Principal Component Analysis (PCA) [1] of a set of example images (templates). The PCA allows to extract the lowdimensional subspaces that capture the main linear correlations among the high dimensional image data. In painted portraits, eyes are the most important features for recognizing the depicted person as well as the period and the artist of the miniature. In the case of portrait miniatures, after sketching the contour of the face on ivory, in the ﬁrst step the artist tries to visualize the characteristic personality of the painted person by completely elaborating the eyes. In the following the view-based eigen-eye approach is presented. 3.1

Eigenspace Decomposition

Given a training set Tx = {x1 , . . . , xM } of NT image vectors of N 2 -dimensionality the eigenvectors Φ = {e1 , e2 , . . . eNT } and eigenvalues λi of the covariance matrix Σ can be obtained by eigen-decomposition (or singular value decomposition). Since usually NT << N 2 , Σ is a singular matrix of rank

652

C. Saraceno et al.

"Painters" Bencini

"Periods"

Stroehly

1700-1750

"Painted Persons" Maria Carolina

1750-1800

Maria Theresia

Middle Layer

(a)

High layer

"Number of Persons" 1

2

"Eye orientations"

5

Frontal 3/4 Left

Low layer

(b)

111 000 000 111 000 111

111 000 000 111 000 111 000 111

111 000 000 111 000 111 000 111

(c)

Fig. 1. (a) High layer (b) Middle layer (c) Low layer with mean images of the training samples

Pictorial Portrait Indexing Using View-Based Eigen-Eyes

653

High Layer

Middle Layer

Portrait Image

Low Layer

Database

Fig. 2. Connections between layers at most NT − 1. The Karhunen Loeve transform (KLT) y of a mean normalized ˜ = x − µ(Tx ) is deﬁned by image vector x ˜ y = Φt x

(1)

The original image x can be obtained using the KL-basis Φ by x = Φy + µ(Tx ) =

N T −1

ei yi + µ(Tx )

(2)

i=1

In PCA the KLT representation y is truncated approximating x using only the eigenvectors corresponding to the k highest eigenvalues (the ﬁrst k eigenvectors.) x=

k

ei yi + µ(Tx )

(3)

i=1

The KLT is optimal in the sense that mean square error between the truncated representation and the actual data is minimized. The number of principal components needed to represent x to a suﬃcient degree of accuracy depends on the correlation among the images in Tx . We use a fast pattern matching algorithm utilizing multiple view-based eigenspaces and Fourier transform [7] which allows a fast calculation of the eigen-projections y and the normalized correlation xi xe (4) C(xi , xe ) = xi between a part of the image xi and its projection xe onto the particular eigenspace. We normalize geometrically by scaling and rotating the images according to 2 feature points (eye corners). Moreover, data has to be categorized in a way that unimportant factors, that might inﬂuence the representation are eliminated. This can be done by using an view-based eigenspace ([4]) where PCA is performed on every subset of images according to certain categories such as

654

C. Saraceno et al.

left/middle/right pupil and left/right/frontal view. The view-based approach allows a more accurate description of the set of images, by multiple independent subspaces. The images in our database do not have many variations in face orientation. About 90% of them have “three-quarter view”, either from the left or right side. A characteristic of these miniatures is that the view of the eyes does not necessarily coincide with the view of the face. As an example, three-quarter faces can also have frontal eyes. This characteristic must be considered whenever an algorithm for face recognition is applied to painted faces. In our case the eigeneyes are sensitive to diﬀerent position of the pupil, diﬀerent orientation of the eye and diﬀerent shading. For this reason, the images have been categorized manually according to these factors and an eigenspace for each category has been calculated. In particular, we consider three diﬀerent pupil positions (left, center and right) and ﬁve diﬀerent eye orientations (left proﬁle, right proﬁle, three-quarter right, three-quarter left and frontal) for a total amount of eleven diﬀerent eigenspaces, as shown in Fig. 1.c. (In the case of proﬁle eyes, only one pupil position is considered.) In order to index the portrait miniatures according to eye orientation and number of faces per picture, an eigenspace for each of the eleven aforementioned classes must be determined. The eigenspaces will be used to detect the eyes and their orientation, as explained in the following section. 3.2

Portrait Miniature Indexing Using Eigeneyes

At this point, all images of the database are analyzed using all eigenspaces. Diﬀerent resolution and rotation of the testing images are considered by rotating the input image from -30 to 30 degrees in 10 degrees steps and scaling with 5 scaling factors in 10% steps. At each position of the image the maximum of the normalized correlation between subimage at this position and its projection on the diﬀerent eigenspaces is put in a canonical correlation map. Eye locations, poses, rotation angle and scale can be obtained by thresholding the correlation map. The images can then be grouped according to eye orientations, i.e frontal, three-quarter left, three-quarter right, proﬁle left and proﬁle right. One image will be in more than one group, if diﬀerent eye orientations are detected within the image. Each group of eye can be indexed using the eigenvectors. In particular, frontal and three-quarter eyes are represented by the eigenvectors of the three eigenspaces (one for each pupil position), whereas the proﬁle eyes, having only one possible pupil position, are represented by the eigenvectors belonging to one eigenspace. Eigen-eyes can also allow to determine the number of faces within a portrait. The technique to determine the number of faces analyzes the number of detected eyes and their orientations. One face is detected if one of the following cases occurs: 1. only one eye is detected; 2. two eyes are detected which can be frontal or/and three quarter.

Pictorial Portrait Indexing Using View-Based Eigen-Eyes

655

In a more general case, N faces are detected if the following case occurs: A are proﬁle eyes and B × 2 are three-quarter and/or frontal eyes, where A + B = N and A, B ≥ 0. In fact, for proﬁle faces only a proﬁle eye is visible, whereas threequarter or frontal faces can have either two frontal eyes, or two three-quarter eyes, or one frontal eye and one three-quarter eye. In the following section, results of the eye detection and indexing are presented.

4

Results

586 portrait miniatures have been tested. Most of them have three-quarter faces with frontal eyes. The eye orientations have been classiﬁed by art historian experts. Only 18 miniatures have more than one painted face. As an example, Fig. 3 shows a typical input image, the resulting correlation values, the correlation map, and the resulting eye locations.

(a)

(b)

(c)

(d)

Fig. 3. (a) Input Image (b) Correlation values (c) Indices of eye classes (d) Thresholded correlation values In table 1, the results of detection are presented. Most of the frontal eyes were correctly detected. Only 8 of them were detected as three-quarter left or right eyes. 42 three-quarter eyes were detected as frontal. 3 right proﬁle eyes were detected as three-quarter right, whereas two left proﬁle eyes were detected as three-quarter left eyes. However these errors did not have drastic consequences during the indexing procedure. In fact, the number of faces within a portrait does not change if, for example, a three-quarter eye is detected as frontal eye.

656

C. Saraceno et al.

# of eyes Correctly detected frontal 822 814 three-quarter right 130 94 three-quarter left 166 160 profile right 15 12 profile left 16 14

Wrongly detected 42 9 4 0 0

Table 1. Eye detection results

5

Conclusions

In this paper, a hierarchical structure for organizing miniature portrait databases has been presented. Three main layers have been deﬁned. In the high layer, classiﬁcations performed by art historian experts are stored. In the middle layer, a content-based analysis of the images is used to index the database. This is performed by utilizing combination of low level features (stored in the low layer) Furthermore, an algorithm for indexing art-historical portrait miniatures using Principal Component Analysis (PCA) has been proposed. Eleven diﬀerent eigenspaces have been created, based on eye orientation and pupil position. These eigenspaces are then used to detect eyes in portrait miniatures. The eigenspaces allow for determining also the orientation of eyes. This information has been used in the middle layer of the hierarchy for automatic content extraction (such as presence and number of faces and their locations).

References 1. Keinosuke Fukunaga. ”Statistical Pattern Recognition”. Computer Science and Scientific Computing. Academic Press, Inc., 2nd edition, 1990. 651 2. ISO/IEC JTC1/SC29/WG11. MPEG Requirements Group. MPEG-7: Context and Objectives. Doc. ISO/MPEG N1733, Stockholm, July 1997. 649 3. Alex Pentland Matthew Turk. “Eigenfaces for Recognition”. Journal of Cognitive Neuroscience, 4(1):71–86, 1991. 649, 651 4. A. Pentland B. Moghaddam and T. Straner. ”view-based and modular eigenspaces for face recognition”. Proceedings of the Conf. on Computer Vision and Pattern Recognition, 1994. 649, 653 5. B. Moghaddam and A. Pentland. ”Probabilistic Visual Learning for Object Representation”. IEEE Trans. PAMI, 19(7):696, 1997. 651 6. R. Sablatnig, P. Kammerer, and E. Zolda. Hierarchical classification of paintings using face- and brush stroke models. In 14th Int’l Conference on Pattern Recognition, Brisbane, Australia, August 17-20, pages 474–476, 1998. 649 7. M. Uenohara and T. Kanade. ”Use of Fourier and Karhunen-Loeve Decomposition for Fast Pattern Matching With a Large Set of Templates”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(8):891–897, 1997. 653

Pictorial Portrait Indexing Using View-Based Eigen-Eyes

657

8. R. Barbaer W. Niblack and et al. The qbic project: Querying images by content using color, texture and shape. In Proc. of the SPIE Conf. on Vis. Commun. and Image Proc., Feb. 1994. 650

Image Retrieval Using Fuzzy Triples Seon Ho Jeong, Jae Dong Yang , Hyung Jeong Yang, and Jae Hun Choi Dept. of Computer Science, Chonbuk National University, Chonju Chonbuk, 561-756, South Korea Tel: (+82)-652-270-3388, Fax: (+82)-652-270-3403 {shjeong,jdyang,hjyang,jhchoi}@jiri.chonbuk.ac.kr

Abstract. This paper proposes a concept-based and an inexact image match retrieval technique based on fuzzy triples. The most general method adopted to index and retrieve images based on this spatial structure may be the triple framework. However, this framework has two signiﬁcant drawbacks; one is that it can not support concept-based image retrieval and the other is that it fails to deal with an inexact match among directions. Presented in this paper is an image retrieval technique based on fuzzy triples which makes the concept-based and inexact match possible. For the concept-based match, we employ a set of fuzzy membership functions structured like a thesaurus, whereas for the inexact match, we introduce k-weight function to quantify the similarity between directions. In addition, we analyze the retrieval eﬀectiveness of our framework regarding the degree of the conceptual matching and the inexact matching.

1

Introduction

The explosion of image information in diverse domains has created the need of new approaches that help users eﬀectively retrieve images based on their visual contents such as color, shape and texture. Content-based retrieval is characterized by its ability to retrieve relevant images based on their contents. QBIC[7] and Virage[1] are attempts to perform image retrieval as well as image indexing based on color, texture, shape or the combination of them. Another attempt for the content-based image retrieval is to use 2D strings (2 Dimensional Strings) in representing images [3]. The 2D strings may be viewed as the symbolic projection along x- and y- directions. The 2-D string is not only a compact representation of the real image, but also an ideal representation suitable for formulating picture queries. [3][6] feature research activities based on the 2-D string. C. C. Chang and S. Y. Lee [2] made the 2-D string structure more readable by transforming 2D strings into semantically equivalent triples. This technique largely focuses on the problem of indexing. In particular, They index an image

This work was supported by the KOSEF no. 97-0100-1010-3 To whom correspondence should be addressed: [email protected]

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 657–664, 1999. c Springer-Verlag Berlin Heidelberg 1999

658

Seon Ho Jeong et al.

in terms of triples specifying the spatial relationship between its constituent objects. The triples generated as indexes are then entered into a inverted ﬁle. Similarly, triples are also extracted from a given query image. Evaluation of user queries involves matching the triples with those of the inverted ﬁle. This framework is a novel index mechanism in that it allows us to simplify the spatial structure of images, guaranteeing fast retrieval time. However, there are two signiﬁcant drawbacks in the framework: one is that it can not accommodate concept-based image retrieval and the other is that it does not deal with inexact match among directions. While such content-based image retrieval systems try to exactly specify spatial relationships and visual contents for indexing and retrieving images, they fail to retrieve the conceptually related images. To retrieve such images, we should take into account concepts implying semantics buried in images. Additionally, 8 directions are not suﬃcient to fully specify the spatial relationships. To compensate these problems, we propose a fuzzy triple to make the inexact match and the concept-based match possible. For the concept-based match, we employ a set of fuzzy membership functions structured like a thesaurus, whereas for the inexact match, we introduce k-weight function to quantify the similarity between directions. The two facilities are uniformly supported by the fuzzy matching [9][10] performed when evaluating queries. We also provide a experimental results analyzing eﬀectiveness of the fuzzy triple framework.

2

Image Retrieval by Fuzzy Triples

Before introducing fuzzy triples, we need to brieﬂy explain the triple-based indexing technique adopted in [1]. This technique deals with symbolic images. The spatial relationship between two distinct objects may be obtained by analyzing their directions. Note that directional relationships are inﬂuenced by the angle of view and orientation of objects. We can specify these relationships as one of an ordered direction set D={east, northeast, north, northwest, west, southwest, south, southeast} or an angle degree [0◦ , 359◦]. In addition, when the order is important, the i-th element of D is denoted by di . We refer to our image as symbolic representation of an image where each object is represented by a unique symbol. An image is therefore represented by a set of triples involving the symbols. To specify the spatial relationship between objects in a symbolic image, we begin by deﬁning an ordered triple. Definition 1. Let IDB (Image Database) be the whole set of images in an image retrieval system, Op be the set of all objects possibly occurring in an image p ∈ IDB. Then a set of triples for p, Tp is given by Tp = {t =< oi , oj , rij > | rij ∈ D ∪ [0◦ , 359◦ ], oi , oj ∈ op }. The generated triples are then inserted into an inverted ﬁle together with the links of their images. This ﬁle is used in retrieving images to match a given

Image Retrieval Using Fuzzy Triples

659

user query. Its structures was originally proposed by [4]. To deﬁne the similarity between objects and relevant concepts-including more general semantics, we provide the following deﬁnition. Definition 2. A name function, fN AME is deﬁned as follows. fN AME : O → N, where N is a set of terms. From now on, we refer to concepts as broader terms in a thesaurus. Using these concepts, it may be probable to retrieve images conceptually related with a user query. Deﬁnition 3 reﬁnes our fuzzy triples with this name function for supporting the concept-based matching.

Definition 3. Let Tp be a fuzzy triple set for p. Then it is deﬁned as.

Tp = {t =< fN AME (oi ), fN AME (oj ), rij > |rij ∈ D ∪ [0◦ , 359◦], oi , oj ∈ op }. We may treat a term as a fuzzy set characterized by its membership function, and we call it as fuzzy linguistic term or brieﬂy fuzzy term. Note that we use a term fuzzy triple in Deﬁnition 3. The reason is that the object names participating in the triples are speciﬁed by fuzzy terms. As will be seen in section 4, it is the fuzzy term that makes the concept-based match possible.

3

K-Weight Function as a Membership Function of Directions

The direction relationships between objects may be represented by angle degrees. In this paper, to support inexact match between 8 directions and the corresponding angle degrees, we introduce a k-weight function [11] that is symmetric but not always convex. In particular, we will use the k-weight function as a membership function. The following membership functions for directions are provided, where east corresponds to 0◦ (i.e., µeast (0◦ ) = 1 ) and the angle increases counterclockwise - for example, northeast, and north correspond to 45◦ and 90◦ respectively. Definition 4. Let bi = 45 ∗ i, i = 0, . . . , 7 and c = 1/2025k . Then membership function di for µdi∈ D, i = 0, ..., 7 are deﬁned as c × (452 − (x − bi )2 )k , i = 1, ..., 7 µdi (x) = c × (452 − (x − b0 )2 )k , if 0◦ ≤ x ≤ 45◦ where b0 = 0◦ ◦ b0 = 360 if 45 ∗ 7◦ ≤ x < 45 ∗ 8◦ It is apparent that the each of the membership function values tends to decrease as the corresponding angle value grows distant from its center. For

660

Seon Ho Jeong et al.

example, at some point x near 45◦ , a value of µeast (x) is lower than that of µnortheast (x), while µnortheast (x) is lower than µnorth (x) near 90◦ . A membership function value of 22.5◦ - the central point of east and northeast is approximately 0.5. In order to normalize k-weight function, C is ﬁxed as 1/2025k for real number k. Since the shape of functions is subject to the k value and the retrieval eﬀectiveness of our framework may depend on the characteristic of the shapes, we need to appropriately adjust the k value. The membership function value of k=2.5 is almost close to 0.5 at the point of the middle of the east and northeast, 22.5◦. Since it exactly coincides with our intuition, we will deﬁne the k value as 2.5. We now get the following eight direction functions drawn by the version of k = 2.5 as shown in Fig. 1.

Fig. 1. Membership Functions for 8 Dirctions Fig.2. Sharper K-weight Functions

To make the variation of the 8 direction functions sharper, we may need to modify our functions as shown in Fig. 2. Though that may drastically enhance the retrieval eﬀectiveness of the functions, we avoid further detailed discussion here, since developing functions to maximize the retrieval eﬀectiveness is far from our concern. They are introduced as an example of designing a membership function to quantify compatibility between a direction and the corresponding angle degrees.

4

Concept-Based Match for Using Thesaurus

In text-based information retrieval system, thesaurus is mainly used to enhance retrieval eﬀectiveness when processing user queries. A thesaurus consists of concept nodes and links representing the weighted relationships between them such as broader term (BT), narrower term (NT), and related term (RT). However, the thesaurus is adopted in our paper for a quite diﬀerent purpose. The conceptual distance between terms is made by the integrated membership functions of terms in the thesaurus, which have hierarchical yet nested structure. In Fig. 3 the structure is represented by a fuzzy graph specifying fuzzy membership values.

Image Retrieval Using Fuzzy Triples

661

For example, ’furniture’ is a fuzzy term taking ’table’, ’home appliance’, ’audio’ and ’chair’ as its members (or instance), and each of them is in turn a fuzzy term. Additionally, for using thesaurus to match all the triples conceptually related with each other, we may need composed membership functions for obtaining membership values between two fuzzy terms indirectly related.

Fig. 3. Membership function of fuzzy terms

Deﬁnition 5 is this composed membership function.

Definition 5. Let F is a fuzzy term. Then µF (c) = max(min(µai (c), α)) for all c ∈ U where U is a set of terms in the thesaurus and µF (ai ) = α ∈ [0, 1] [14]. For example, if we want to know the conceptual degree of closeness between furniture and radio, Deﬁnition 5 and Fig. 3 help to calculate it.

Example 1. The degree of conceptual closeness between furniture and radio is calculated by µf urniture (radio) = max(min(µhomeappliance (radio),µf urniture (homeappliance), min(µaudio (radio), µf urniture (audio)), = max(min(0.97, 0.4), min(0.7, 0.5)) = 0.5 Our thesaurus has a limitation that it can not capture any composite concept formed by aggregation of terms along with their spatial relationships. For example, a concept ’city’ may be an aggregation of buildings, cars and peoples, which have speciﬁc spatial relation with each other. In another literature [9], we solved the limitation by introducing another thesaurus called the triple thesaurus. It consists of rules to detect composite concepts deﬁned in terms of the combination of more than one triple. However, to simplify the content of this paper, we omit the detection of such concepts. Now, we are in a position to formally deﬁne our image retrieval system including the fuzzy triples and the term thesaurus.

662

Seon Ho Jeong et al.

Definition 6. An image information retrieval system I IR is deﬁned as follows. I IR =< IDB, T r, M f unc, In > where IDB is a set of images, T r is a set of all Tp ’s for p ∈ IDB, M f unc is a set of fuzzy membership functions including the fuzzy term thesaurus and the k-weight functions, and In is an inverted ﬁle. The evaluation of user queries for retrieving images is made by transforming them into equivalent triples and then matching the triples of the inverted ﬁle. The following example illustrates a simple query processing. Example 2. Let the query be given as ”search for images where an audio is on the furniture or between east and northeast side of furniture with 20◦ ”. Then Q1 = Qm1 ∨ Qm2 where Qm1 =< f urniture, audio, north > and Qm2 =< f urniture, audio, 20 >.

5

Experimental Results

In this experiment, we used around 100 scanned interior images from magazines. We asked 6 postgraduate students to determine the similarity between a given query and the image collection by assigning diﬀerent values: related, unrelated. Four students of them are members of image retrieval research group, and the other two of the rest are members of information retrieval research group. Four queries are chosen for the two types of collection. Two of them are mono queries, and the rest are compound queries involving logical connectives such as AND and OR. In processing user queries, we use the term thesaurus to enhance the retrieval eﬀectiveness when performing a concept-based match. For a user query including terms that are not exactly matched, the systems retrieve images relevant to his or her intent by replacing them with other thesaurus terms conceptually related to them. Terms in a query are translated into the equivalent fuzzy terms , and then are redeﬁned into the fuzzy triples. For example, suppose a user tries to retrieve images containing furniture. Fuzzy terms such as table and chair are conceptually close to it, while fuzzy terms such as home appliance and audio are rarely related. Therefore, to preserve precision in redeﬁning the user query and thus enhance retrieval eﬀectiveness, terms are not extended if the membership function takes any value below 0.5. It avoids extending the terms semantically insigniﬁcant. In this experimental work, we aimed to test the eﬀectiveness of retrieval on varieties of image collections, using the concept precision and recall measures. We calculate the mean of precision and recall of retrieval for the four queries in each experiment. Note that the same queries are also asked to the 6 postgraduate students. If the student mark that any image from the collection is related with the query, the image is considered as a relevant one. The following standard deﬁnition is used for measuring recall and precision.

Image Retrieval Using Fuzzy Triples

663

Definition 7. Let p be the number of all images that are relevant to the query image, r be the total number of images retrieved, and q be the number of relevant images retrieved. Then Recall = R = pq × 100, P recision = P = qr × 100 Table 1 gives the results from our experiments.

Table 1. Recall/Precision Table

We conducted four experiments on the same image collection by posing four diﬀerent queries. The result of each query in Table 1 is the mean of recall and precision over four queries in each experiment. Table 1 presents a case to explain that retrieval with a concept-based match shows better recall than without one. It is observed that the recall is improved, preserving precision. When using the concept-based match, the average recall and precision are approximately 68% and 85%, respectively. On the other head, when doing without concept-based match, it is 34% and 90%, respectively. As a result, the overall enhanced average recall are 34% and the undermined average of precision is 5%.

6

Conclusion and Further Work

In this paper, we developed an image retrieval technique based on a new data type called fuzzy triple to make the inexact and the concept-based image retrieval possible. A k-weight function was introduced for the precise speciﬁcation of spatial relationships with angle degrees. To support the conceptually related image retrieval, we used a term thesaurus that represents degree of relationships among concepts as membership functions. Fuzzy triple provides a formal speciﬁcation which is conceptually simple yet powerful in that it enables concept-based image retrieval possible, accommodating current content-based image retrieval technologies which do not use the triples. We used the term thesaurus to enhance retrieval eﬀectiveness when performing a concept-based match between terms in a query and the counterparts in fuzzy triples of target images. Thanks to the thesaurus, we achieved the enhancement of recall, preserving its precision.

664

Seon Ho Jeong et al.

As further researches, complementary works for our technique may be needed. First, the thesaurus introduced in this paper should be developed in greater detail, since it is a core component for redeﬁning fuzzy terms. What is more, we used only the central point of rectangular objects whose boundaries are parallel to the x- and y-axes in this experimental work. Therefore, we should take into account the size of objects to exactly analyze the spatial relationship between them. Second, we should extend retrieval strategies for extracting composite objects containing signiﬁcant semantics of images.

References 1. Bach, J.R., Charles F., Gupta, A., Hampapur, A., Horowitz, B., Humphrey, R., Jain, R., Shu, C-F.: The Virage Image Search Engine: An Open framework for image management. http://www.virage.com 2. Chang, C.C., Lee, S.Y.: Retrieval of Similar Pictures on Pictorial Databases. Pattern Recognition(1991) 675–680. 3. Chang, S.K., Shi, Q.Y., Yan, C.W.: Iconic indexing by 2D string. IEEE Transaction on Pattern Analysis(1987) 413–428. 4. Cook, C.R., Oldehoeft, R.: A Letter-Oriented Minimal Perfect Hashing Function. ACM SIGplan Notices 17(1972) 18–27. 5. Corridoni, J.M., Bimbo, A.D., Magistris, S.D.: Querying and Retrieving Pictorial Data Using Semantics Induced by Colour Quality and Arrangement. Proc. of the Int. Conf. on Multimedia Computing and Systems, Hiroshima Japan (1996) 219– 222. 6. Lee, S.Y., Hsu, F. J.: Spatial Reasoning and Similarity Retrieval of Images Using 2D C-string Knowledge Representation. Pattern Recognition 25(1992) 305–318. 7. Pentland, A., Picard, R.W., Scaroﬀ, S.: Photobook: Tools for Content-based Manipulation of Image Databases. Int. J. of Computer Vision (1996). 8. Salton, G., McGill, M.J.(eds): Introduction to Modern Information Retrieval. McGraw-Hill (1987). 9. Takahasi, T., Shima, N, Kishino, F.: An image retrieval method using in queries on spatial relationships. J. of Information Processing 15(1992) 441–449. 10. Yang, J.D., Yang, H.J.: A Formal Framework for Image Indexing with Triples : Toward a Concept-Based Image Retrieval. To appear in Int. J. of Intelligent Systems, (http://jiri.chonbuk.ac.kr/˜jdyang) (1999). 11. Yang, J.D.: F MP: A Fuzzy Match Framework for Rule-Based Programming. Data & Knowledge Engineering 24(1997) 183–203. 12. Wand, M.P., Jones, M.C.(eds): Kenel Smoothing. Chapman & Hall (1995).

Variable-Bit-Length Coding: An Eﬀective Coding Method S. Sahni, B. C. Vemuri, F. Chen, and C. Kapoor CISE Department, University of Florida, Gainesville, FL 32611 {sahni, vemuri}@cise.uﬂ.edu

Abstract. We propose a new coding scheme for lossless compression. This scheme, variable-bit-length coding, stores diﬀerent grayscale values using a diﬀerent number of bits. The compression performance of variable-bit-length coding coupled with a preprocessing method called remapping is compared with the compression performance of well known coding methods such as Huﬀman coding, arithmetic coding, and LZW coding. On the bechmark suite of 10 images used by us, remapping with variable-bit-length coding obtained maximum compression more often than did any other coding method.

1

Introduction

A variety of techniques–coding, interpolation, and transforms (e.g., wavelet methods)–have been proposed for the lossless compression of two-dimensional images [1,3]. In this paper, we focus on coding methods alone. Coding methods such as Huﬀman coding [2], arithmetic coding [5], and Ziv-Lempel-Welch (LZW) coding [6,4] work only on one-dimensional data. Therefore, to compress a twodimensional image using one of these coding schemes, we must ﬁrst linearize the two-dimensional image. In this paper, we propose a new coding method–variable-bit-length (VBL) coding– which stores diﬀerent gray values using a diﬀerent number of bits. This coding method may be coupled with a mapping method which maps the gray scale values in an image into a contiguous set of values beginning at 0. The combination of mapping and VBL coding provides a coding scheme that often outperforms all known popular coding schemes. In Section 2, we discuss linearization methods. Our mapping scheme is discussed in Section 3, and our variable-bit-length coding schemes is developed in Section 4. Experimental results comparing all the schemes discussed in this paper are presented in Section 5. The images used in our experimental studies include 6 natural images and 4 brain MR images. All of the images are (256, 256) with 8 bits/pixel except the image sag1, which uses 12 bits/pixel.

This research was support, in part, by the National Institute of Health under grant R01LM05944-03.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 665–674, 1999. c Springer-Verlag Berlin Heidelberg 1999

666

2

S. Sahni et al.

Linearization Schemes

When coding schemes such as Huﬀman coding, arithmetic coding, and LivZempel-Welch coding are used to compress a two-dimensional image, the image must ﬁrst be converted into a one-dimensional sequence. This conversion is referred to as linearization. Coding schemes such as Huﬀman coding depend only on the frequency of occurrence of diﬀerent gray values. Since linearization does not aﬀect this frequency, coding schemes in this category are unaﬀected by the particular method used to linearize a two-dimensional image. On the other hand, coding schemes such as arithmetic coding and Liv-Zempel-Welch coding depend on the relative order of gray scale values and so are sensitive to the linearization method used. Natural images have local and global redundancy. Local redundancy causes a given neighborhood in the image to exhibit coherence or correlation (referred to as smoothness of data). Some linearization schemes are more eﬀective at keeping pixels that are close in the two-dimensional image, close in the one-dimensional sequence. Therefore, these schemes are more eﬀective in preserving the local redundancy of the image and are expected to yield better compression when coupled with a coding scheme that can take advantage of local redundancy. Some of the more popular linearization schemes are given below. Each scans the image pixels in some order to produce the one-dimensional sequence. 1. Row-Major Scan: The image is scanned row by row from top to bottom, and from left to right within each row. 2. Column-Major Scan: The image is scanned column by column from left to right, and from top to bottom within each column. 3. Diagonal Scan: The image is scanned along the antidiagonals (i.e., lines with constant row plus column value) beginning with the top-most antidiagonal. Each antidiagonal is scanned from the left bottom corner to the right top corner. 4. Snake-like Row-Major Scan: This is a variant of the row-major scan method described above. In this method, the image is scanned row by row from top to bottom, and the rows are alternately scanned from left to right and from right to left. The top-most row is scanned from left to right (as in ﬁg. 1 (a)). Snake-like variants of column-major and diagonal scans can be deﬁned in a similar manner (see ﬁg. 1 (b)). 5. Spiral Scan: In this, the image is scanned from the outside to the inside, tracing out a spiral curve starting from the top left corner of the image and proceeding clockwise (see ﬁg. 1 (c)). 6. Peano-Hilbert Scan: This scan method is due to Peano and Hilbert, and is best described recursively as in ﬁg. 1 (d). This method requires the image to be a 2k × 2k image. When k is odd, the scan path starts at the leftmost pixel of the ﬁrst row and ends at the leftmost pixel of the bottom row. When k is even, the path starts at the leftmost pixel of the ﬁrst row and ends at the right-most pixel of this row. In a Peano-Hilbert scan, the image is scanned

Variable-Bit-Length Coding: An Eﬀective Coding Method

k=1

k=2

667

k=3

k=1

k=2

k=3

(a)

(b)

(c)

(d)

Fig. 1. (a)snake-like row major scan path (b)snake-like diagnal scan path (c) spiral scan path (d) peano scan path

quadrant by quadrant. The scan path for a 2k × 2k image for k = 1, 2, and 3 is shown in ﬁg. 1 (d). To determine the eﬀect of the linearlization method on the compression ratio attained by a coding scheme, we compressed our test set of 10 images using both the Unix compression utilities gzip and compress. Both os these are based on the LZW coding method. Table 1 gives the compression ratios achieved by gzip. Diﬀerent linearization schemes result in diﬀerent compression ratios. For example, the compression ratios for man range from 1.31 to 1.40, and those for brain1 range from 1.63 to 1.75. Although no linearization method provided highest compression for all images, the Peano-Hilbert scan did best most often.

Table 1. Comparison of diﬀerent linealization schemes with/without mapping using Gzip Image row major diagonal snake spiral lenna 1.18 1.18 1.17 1.19 man 1.39 1.31 1.39 1.40 chall 1.43 1.39 1.43 1.43 coral 1.32 1.26 1.32 1.28 shuttle 1.44 1.43 1.44 1.42 sphere 1.30 1.29 1.30 1.33 brain1 1.75 1.63 1.75 1.69 slice15 1.73 1.63 1.73 1.68 head 1.80 1.75 1.80 1.78 sag1 1.68 1.66 1.69 1.74

peano 1.20 1.39 1.47 1.29 1.47 1.31 1.67 1.66 1.77 1.71

668

S. Sahni et al.

Table 2 gives the compression ratios achieved by compress on our sample suite of 10 images. Here too, we see a variation in attained compression ratio depending on the linearization method used. Since compress did not provide better compression than provided by gzip on any of our test images, we shall not report further results using compress. Table 2. Comparison of diﬀerent linealization schemes with/without mapping using Compress Command Image row major diagonal snake spiral lenna 1.07 1.08 1.07 1.11 man 1.30 1.23 1.31 1.32 chall 1.32 1.29 1.33 1.37 coral 1.21 1.15 1.21 1.21 shuttle 1.37 1.29 1.37 1.33 sphere 1.17 1.17 1.18 1.22 brain1 1.70 1.57 1.70 1.65 slice15 1.67 1.56 1.67 1.64 head 1.79 1.71 1.78 1.77 sag1 1.58 1.55 1.59 1.61

3

peano 1.10 1.32 1.38 1.20 1.36 1.21 1.63 1.62 1.75 1.60

Gray Level Mapping

Generally, images do not have all the gray values within the dynamic range of gray level values determined by the resolution (bits/pixel). For example, the pixels of an 8 bit per pixel image may use only 170 of the 256 possible gray values. In gray level mapping the gray values of an image are mapped to the range 0 − n where n + 1 is the number of gray values actually present. For example, suppose we have an image whose pixels have the gray levels 1, 3, 5, 7. These values are mapped to the range 0-3 using a mapping table map[0 : 3] = [1, 3, 5, 7]. map[i] can be used to remap the gray value i to its original value. From the mapped image and the mapping table, we can reconstruct the original image. Gray level mapping can help in improving compression by increasing the correlation between adjacent pixel values. Coding methods such as Huﬀman coding and LZW coding are insensitive to the actual pixel values. Therefore, the compression ratios obtained by these methods cannot be impoved using the mapping method. Consequently, gray scale mapping is not recommended as a preprocessor for either gzip or compress. However, artithmetic coding and the variable-bit-length coding method of Section 4 are sensitive to the actual values and their compression performance is aﬀected by mapping gray values from one range into another. Additionally, the performance of compression methods such

Variable-Bit-Length Coding: An Eﬀective Coding Method

669

as wavelet and predictive (or interpolating) schemes which work directly on the two-dimensional image is aﬀected by the mapping method just described. Table 3 gives the compression ratio achieved by arithmetic coding on the suite of 10 images shown in ﬁg. 2 and ﬁg. 3. Using mapping as a preprocessor to arithmetic coding resulted in a slight improvement in the compression obtained for 4 of the 10 images; on one image, there was a slight reduction in the compression ratio. Table 3. Comparison of arithmetic coding on row-major images with/without mapping image lenna man chall coral shuttle sphere brain1 slice15 head sag1

4

raw mapped 1.13 1.13 1.23 1.22 1.33 1.34 1.27 1.28 1.32 1.32 1.29 1.29 1.71 1.74 1.70 1.70 1.93 1.93 1.95 1.96

Variable Bit Length Coding

In a traditionally stored source ﬁle, each symbol is stored using the same number of bits (usually 8). In variable bit length coding (VBL) diﬀerent symbols are stored using a diﬀerent number of bits. Suppose that our source ﬁle is a linearized image in which each pixel has gray values between 0 and 255, the source ﬁle is stored using 8 bits per pixel. However, the gray values 0 and 1 need only one bit each; values 2 and 3 need two bits each; values 4, 5, 6, and 7 need only three bits each; and so forth. In the VBL representation, each gray value is stored using the minimum number of bits it requires. To decode the compacted bit representation, we need to know the number of bits used for each pixel. Rather than store this number with each pixel, the run-length coding method is used and the pixels divided into segments, the gray values in each segment require the same number of bits. Unfortunately, for typical images the space needed to store the segment lengths, the bits per pixel in each segment, and the compacted gray values often exceeds the space needed to store the gray values in ﬁxed length format. However, we can combine adjacent segments together using the strategy given below. First we summarize the steps in VBL coding.

670

S. Sahni et al.

Create Segments The source symbols are divided into segments such that the symbols in each segment require the same number of bits. Each segment is a contiguous chunk of symbols and segments are limited to 2s symbols (best results were obtained with s = 8). If there are more than 2s contiguous symbols with the same bit requirement, they are represented by two or more segments. Create Files Three ﬁles SegmentLength, BitsP erP ixel, and Symbols are created. The ﬁrst of these ﬁles contains the length (minus one) of the segments created in step 1. Each entry in this ﬁle is s bits long. The ﬁle BitsP erP ixel gives the number of bits (minus one) used to store each pixel in the segment. Each entry in this ﬁle is d bits long (for gray values in the range 0 through 255, d is 3). The ﬁle Symbols is a binary string of symbols stored in the variable bit format. Compress Files Each of the three ﬁles created in step 2 is compressed/coded to reduce its space requirements. The compression ratios that we can achieve using VBL coding depends very much on the presence of long segments that require a small number of bits. Suppose that following step 1, we have n segments. The length of a segment and the bits per pixel for that segment are referred to as the segment header. Each segment header needs k + d bits of space. Let li and bi , respectively, denote the length and bits per symbol for segment i. The space needed to store the symbols of segment i is li ∗ bi . The total space required for the three ﬁles created n in step 1 is (k + d) ∗ n + i=1 li ∗ bi . The space requirements can be reduced by combining some pairs of adjacent segments into one. If segments i and i + 1 are combined, then the combined segment has length li + li+1 . Each pixel now has to be stored using max{bi , bi+1 } bits. Although this technique increases the space needed by the ﬁle Symbols, it reduces the number of headers by one. Let sq be the space requirements for an optimal combining of the ﬁrst q segments. Deﬁne s0 = 0. For an instance with i > 0 segments, suppose that, in an optimal combining, C, segment i is combined with segments i − 1, i − 2, · · ·, and i − r + 1 but not with segment i − r. The space, si , needed by the optimal combining C is: space needed by segments 1 through i−r+lsum(i−r+1, i)∗bmax(i−r+1, i)+11 b where lsum(a, b) = j=a lj and bmax(a, b) = max{ba , ..., bb }. If segments 1 through i − r are not combined optimally in C, then we change their combining to one with smaller space requirement and hence reduce the space requirement of C. So in an optimal combining C, segments 1 through i−r must also be combined optimally. With this observation, the space requirements for C become si = si−r + lsum(i − r + 1, i) ∗ bmax(i − r + 1, i) + 11 The only possibilities for r are the numbers 1 through i for which lsum does not exceed 2k (recall that segment lengths are limited to 2k ). Although we do not know which is the case, we do know that since C has minimum space requirement, r must yield the minimum space requirement over all choices. So we get the recurrence

Variable-Bit-Length Coding: An Eﬀective Coding Method

si =

min

1≤k≤ilsum(i−k+1,i)≤2k

671

{si−k + lsum(i − k + 1, i) ∗ bmax(i − k + 1, i)} + k + d

Using this dynamic programming formulation, we can determine the optimal way to combine segments. Once this has been determined, the segments created in step 1 are combined and the three ﬁles of step 2 created. Decoding is quite straightforward. Table 4 gives the compression ratios achieved by the VBL coding method. For this method, using mapping as a preprocessor can make a dramatic impact on the achieved compression. For example, the compression ratio achieved for brain1 linearized using a Peano-Hilbert scan is 1.42 without mapping and 1.86 with mapping. Notice that VBL coding did best on 9 of the 10 images when used in conjunction with mapping and the Peano-Hilbert linearization scheme.

Table 4. Comparison of diﬀerent linealization schemes with/without mapping using VBL Coding Image

row major raw mapped lenna 1.10 1.16 man 1.32 1.35 chall 1.24 1.39 coral 1.23 1.33 shuttle 1.32 1.37 sphere 1.06 1.19 brain1 1.41 1.83 slice15 1.73 1.84 head 1.87 2.01 sag1 2.11 2.09

5

diagonal raw mapped 1.10 1.16 1.28 1.31 1.21 1.35 1.19 1.29 1.28 1.32 1.05 1.18 1.36 1.77 1.66 1.78 1.82 1.96 2.07 2.05

snake spiral raw mapped raw mapped 1.10 1.16 1.12 1.19 1.32 1.35 1.33 1.37 1.24 1.39 1.26 1.41 1.23 1.33 1.22 1.32 1.33 1.37 1.30 1.34 1.06 1.20 1.09 1.21 1.41 1.83 1.41 1.83 1.73 1.84 1.72 1.84 1.87 2.01 1.88 2.02 2.11 2.09 2.17 2.15

peano raw mapped 1.13 1.21 1.37 1.39 1.30 1.45 1.26 1.37 1.35 1.39 1.10 1.22 1.42 1.86 1.76 1.87 1.90 2.05 2.20 2.17

Summary Results

Table 5 gives the best compression ratio obtained by various coding methods for each of our 10 test images. to obtain the best ration, we applied each of the linearization schemes of Section 2 to both the raw and mapped image. Results using the mapping preprocessor are indicated by *. VBL coding did best on 6 of our 10 images and gzip did best on the remaining 4. VBL coding remains best on 6 of the 10 images even if we consistently use the VBL method with mapping and the Peano-Hilbert linearization scheme.

672

S. Sahni et al.

Table 5. Comparison of the best results of the coding schemes Image lenna man chall coral shuttle sphere brain1 slice15 head sag1

6

Gzip Compress Huﬀman Arithmetic 1.20 1.11 1.09 1.17 1.40 1.32 1.09 1.25 1.47 1.38 1.17 1.34* 1.32 1.21 1.16 1.28* 1.47 1.37 1.18 1.38 1.33 1.22 1.25 1.30 1.75 1.70 1.60 1.80* 1.73 1.67 1.57 1.79 1.80 1.79 1.79 2.01 1.74 1.61 1.60 2.08*

VBL 1.21* 1.39* 1.45* 1.37* 1.39* 1.22* 1.86* 1.87* 2.05* 2.20

Conclusions

We have proposed the use of a mapping preprocessor and a new coding method called VBL coding. Experiments conducted by us show that the mapping preprocessor often enhances the performance of VBL coding. Experiments conducted by us indicate that the VBL coding scheme coupled with mapping and the Peano-Hilbert linearization method often outperforms the Huﬀman and arithmetic coding methods when applied to raw or mapped linearized images.

References 1. R. J. Clarke, Digital Compression of Still Images and Video, New York, Academic Press, 1995. 665 2. D.Huﬀman, “A Method for the Construction of Minimum Redundancy Codes”, Proc. IRE, Vol 40, pp. 1098-1101, 1952. 665 3. Weidong Kou, Digital Image Compression - Algorithms and Standards, Kluwer Academic Publishers, 1995. 665 4. T. Welch, “A Technique for High-Performance Data Compression”, IEEE Computer, June 1994, 8-19. 665 5. I. H. Witten, R.M. Neal, and J. G. Cleary, “Arithmetic coding for data compression”, Commum. ACM, vol. 30, pp. 520-540, June 1987. 665 6. J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression”, IEEE Trans. on Information theory, Vol. 24, No. 5, pp. 530-536, 1978. 665

Variable-Bit-Length Coding: An Eﬀective Coding Method

(a)

(b)

(c)

(d)

(e)

(f)

673

Fig. 2. Natural Images (a) lenna (b) man (c) chall (d) coral (e) shuttle (f) sphere

674

S. Sahni et al.

(a)

(b)

(c)

(d)

Fig. 3. Medical Images (a) brain1 (b) brain2 (c) slice15 (d) sag0

Block-Constrained Fractal Coding Scheme for Image Retrieval Zhiyong Wang1,2 , Zheru Chi1 , Da Deng2 , and Yinlin Yu2 1

Department of Electronic and Information Engineering The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Tel: (852)2766 6219, Fax: (852)2362 8439 [email protected] 2 Department of Electronic and Communication Engineering South China University of Technology GuangZhou City, 510641, GuangDong Province, P. R. China

Abstract. Fractal coding has been proved useful for image compression. In this paper, we present a block-constrained fractal coding scheme and a matching strategy for content-based image retrieval. In our coding scheme, an image is partitioned into non-overlap blocks of a size close to that of an iconic image. Fractal codes are generated for each block independently. In the similarity measure of fractal codes, an improved nona-tree decomposition scheme is adopted to avoid matching the fractal codes globally in order to reduce computational complexity. Our experimental results show that our approach is eﬀective for image retrieval. Keywords: Fractal coding, Image coding, Iterated function systems, Content-based image retrieval.

1

Introduction

In recent years, more and more applications such as digital libraries, geographical map and medical image management, require eﬀective and eﬃcient means to access images based on their true contents. Many retrieval approaches based on visual features, such as shape, color, texture, and spatial structure of images have been proposed [1,2]. In spite of a few application systems reported, visual feature extraction in general remains as a challenging problem. In fact, in order to describe the image precisely, we must seek for the schemes to extract the intrinsic features of images, which is quite similar to image compression to some extent. The task of image compression is to eliminate the redundancy in an image so that the image can be represented with compact codes. Fractal coding provides a promising approach for the representation of the image content with compact codes. In addition, the spatial relationship among various objects in an image can be reﬂected in the transformation codes by the nature of the coding scheme. It is recognized that the spatial structure of an image provides very important information on its content. By using fractal codes, we can represent an image without extracting visual features explicitly. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 673–680, 1999. c Springer-Verlag Berlin Heidelberg 1999

674

Zhiyong Wang et al.

The potential of fractal image compression for image retrieval was observed by Sloan [3]. However the computational complexity of his approach is high. Further investigations made by Zhang et al indicated that fractal coding is eﬀective for content-based image retrieval [4,5]. For the purpose of image retrieval, we propose to use a modiﬁed fractal coding scheme, termed block-constrained fractal coding, which constrains the domains to be searched for a given range within a block to avoid searching the domain globally. By such treatment, the computational complexity is reduced signiﬁcantly so that on-line fractal coding can be implemented. In this paper, we also propose a realistic matching strategy for content-based image retrieval based on the proposed block-constrained fractal coding. The organization of this paper is as follows. In the next section, we present our block-constrained fractal coding scheme. Section 3 discusses the similarity measure between the fractal codes of two images. A matching strategy is presented in Section 4. Experimental results with discussions are given in Section 5. Finally, concluding remarks are drawn in Section 6.

2

Block-Constrained Fractal Coding

Since Barnsley recognized the potential of Iterated Function Systems (IFS) for computer graphical applications and proposed fractal image compression [6], more and more attentions have been drawn to this promising approach [7,8,9]. Fractal image compression is based on the mathematical results of IFS. The formal mathematical description of IFS can be found in Jacquin’s paper [9]. In fractal image coding, for each given range R within an image, the fractal encoder seeks a domain D of the same image, in order that according to a certain metric, such as mean-squared-error, the transformation W (D) is the best approximation of the range R. In matrix form, the transformation Wi between range Ri and domain Di is determined by        ai b i 0 x x ei        c d 0 f y y Wi = + i i i oi z z 0 0 si where si controls the contrast, oi controls the luminance oﬀset, z denotes the pixel gray value at position (x,y) in Di . According to the fractal theory, W must be contractive. Following the above coding scheme, we can represent an image with a set of transformations termed fractal codes with which the image can be reconstructed in a few iterations. Obviously the above fractal coding scheme is extremely time consuming although it could achieve a large compression ratio. The main part of computing time is spent on domain searching globally in order to ﬁnd the best matching of a range. Many improvements have been proposed to reduce the encoding time [10,7,8]. However the computation time is still a big obstacle for practical applications of fractal coding in image retrieval.

Block-Constrained Fractal Coding Scheme for Image Retrieval

675

Obviously, the smaller an image is, the quicker the fractal codes can be generated. The most eﬀective way in reducing the searching time is to reduce the number of the domains to be searched. For a given range, we propose to constrain the domain blocks to a region that contains the range. We term our approach as block-constrained fractal coding scheme. The image is ﬁrst partitioned into non-overlapped blocks of the equal size that is not smaller than four times the area of the range. Each of these blocks is encoded independently. If the sizes of the block and range are selected properly, the encoding operation can be performed real-time. Therefore no extra information need to be stored in the image database. The block-constrained fractal coding has two clear advantages: shorter computing time and less storage space (all codes can be obtained real-time). In Section 3, the eﬀectiveness and eﬃciency of our method will be discussed.

3

Similarity Measurement with Fractal Codes

By the scheme discussed in the last section, the image content can be uniquely determined by those fractal codes. As discussed in [4], more identical fractal codes suggest more identical blocks and therefore the two images are more similar. Consequently, the percentage of identical fractal codes from two images, termed matching rate M R, can measure the similarity of two images, with a higher matching rate indicating a greater degree of similarity. The deﬁnition of matching rate is reasonable and very simple. Sloan proposed to use the straightforward juxtaposition of an iconic image with each image in the database [3]. In his approach, the similarity between two images is assessed by noting the frequent choice of a block in one image as the domain for another image. However the computational complexity of Sloan’s coding scheme is high. The matching of the fractal codes of an iconic image and the segments of a database image is very much dependent on the subimages that are actually not identical in two images and therefore not reliable [4]. Many ranges may choose their domains out of the intersection of the two images. As a result, the retrieval might miss some database images that are actually similar to the given icon. In the joint fractal coding scheme presented by Zhang et al, ADZhang96, a weighted similarity measure is proposed. However, the weights need to be adjusted carefully or determined by experiments in order to improve its reliability. In our approach, we propose to measure the similarity of two images directly based on the matching rate of their fractal codes, which can be performed more eﬃciently. Although the deﬁnition of our matching rate is based on Zhang’s similarity measurement and will have the similar disadvangtage, the block-constrained coding scheme can make the identical blocks as many as possible so that the matching rate can reﬂect the similarity between images better.

4

Matching Strategy for Image Retrieval

A simple approach for image retrieval with fractal coding is to encode both a database image and an iconic image and then compute the matching rate

676

Zhiyong Wang et al. Block Unit To Be Coded 0000000 1111111 1111111 0000000 0000000 1111111 0000000 1111111 0000000 1111111 A B 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 C D 0000000 1111111 0000000 1111111 1111111 0000000

(a)

00000000 11111111 11111111 00000000 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111

(b)

1

2

3

4

(c)

Fig. 1. Tailoring of an iconic image: (a) a database image; (b) an iconic image; (c) the iconic image after tailoring. between the fractal codes of two images. This direct method may work correctly under the ideal situations. As to the following two situations, we must deal with them respectively. 1. An iconic image is not aligned properly to the blocks of a database image as shown in Figure 1. In this case, the two images may have very low matching rate although the subimage between the two images are very large. 2. The fractal codes may be identical although their range blocks and domain block are absolutely irrelevant. Because the block is encoded independently in the block-constrained coding scheme. As to the situation 1, we will tailor some pixels of the icon image, which is similar to the tailoring operation in [4]. The only diﬀerence is that we are to align the constrained block properly and Zhang et al are to align the range block properly. An example of tailoring an iconic image is shown in Figure 1. The original iconic image is a subimage of an database image (shadowed). However, we ﬁnd that two images do not have identical blocks. After the black region in the iconic image is removed, the tailored iconic image has four blocks that are identical to those of the database image. We can conclude that if the icon is a subimage of a database image, there must be a tailored icon with block(s) that are identical to block(s) in a database image. In this paper, we assume that the size of an iconic image is not smaller than the size of four blocks. 4.1

Matching Strategy

As mentioned above, it is not unusual that two irrelevant ranges have the identical fractal code using our block-constrained fractal coding scheme. It is not wise to match the fractal codes of the iconic image globally to those of the database image. We can reduce the number of irrelevant matching by constraining the matching to the segments of the database image. The segment size should be similar to the icon size. To compromise between a higher matching accuracy and a faster retrieval, the improved nona-tree decomposition scheme is adopted [2]. We assume that the root, that is, the whole image is at the 0-th level. There are at most (2i+1 − 1)2 segments instead of 9i segments at the i-th level. For example, there are 81 segments of size 64 × 64 in a 256 × 256 image with the nona-tree decomposition scheme [4], but there are only 49 segments with our

Block-Constrained Fractal Coding Scheme for Image Retrieval

677

improved nona-tree decomposition [2]. As a result, the retrieval time can be cut short signiﬁcantly. In the approach proposed by Zhang et al, ADZhang95a, ADZhang96, all segments along a branch from the root to a leaf have to be encoded. In our approach, if we choose the size of contrained blocks properly, a database image is encoded only once with the partitioned blocks, because we can obtain the fractal codes of every segments with the structure relationship of the nona-tree. That is, our approach will reduce the retrieval time. In matching process, we can obtain the matching rate between each segment in a database image and an iconic image without decomposing the image into segments explicitly, because we can use the structure relationship of the nonatree. We actually perform matching between the whole image and the iconic image. When a matched fractal code exists, we can determine which segment(s) the range in the image locates at according to the location of the range and then accumulate the contribution of the matching rate for the corresponding segment(s). That is, the matching rate between each segment in the database image and the icon can be obtained by matching fractal code of each range in the image once only, which reduces the retrieval time signiﬁcantly. We take the highest matching rate between the icon and the segments as the matching rate between the icon and the image. When performing matching between a pair of blocks, we only compare their fractal codes and need not compare the blocks pixel by pixel, which reduces the computing time. 4.2

Image Retrieval

When performing retrieval, we make the block size equal to the domain size. Let the range size be S × S and the block (domain) size N × N , we have N = 2 × S, that means for each block, we need only to ﬁnd the fractal codes of four ranges against a single domain. We observe that the main part of time is spent on comparing fractal codes, which is dependent on the number of ranges. We can reduce the number of ranges by increasing the size of the range. However, a larger block size resulting from increased range size (the area of the former is four times that of the latter) will increase the number of comparison in dealing with the alignment problem discussed in Section 4. In experiments, we set S = 8, so N = 16. In general, the size of the segments in an image is set to 64 × 64 or 32 × 32. If the size of iconic image is close to 32 × 32, then the size of segments will be set to 32 × 32; otherwise, it will be set to 64 × 64. Because the larger the segments, the more irrelevant matches involved. When the retrieval of larger iconic image is submitted, the iconic image is ﬁrst decomposed into several segments of size 64 × 64, the average of the matching rate of all the iconic segments is then obtained as the matching rate of the whole iconic image. We can thus perform retrieval by using an iconic image of any size. In the following discussion, we assume that the iconic image need not to be decomposed. Figure 2 shows the retrieval process with our scheme.

678

Zhiyong Wang et al. Database image

Block-constrained fractal coding

Fractal codes

Matching rate Improved nona-tree decomposition based matching strategy Iconic image Tailoring Tailored icon operation

Block-constrained fractal coding

Fractal codes

Fig. 2. Flow chart of our image retrieval scheme.

5

Experimental Results and Discussion

In our experiments, 40 grey scale images of natural scenery were ﬁrst tailored to the size of 128 × 128 for evaluating our retrieval approach. Their scale-down images shown in Figure 3. Iconic images of diﬀerent sizes were extracted from these images. The experimental results show that the images contain an iconic image can always obtain the highest matching rate as expected.

Fig. 3. Test images.

(a)

(b)

Fig. 4. (a) The iconic image I1 ; (b) the retrieved image M1 . Let us consider an iconic image I1 of size 64 × 64 extracted from the image M1 as shown in Figure 4. I1 starts at location (28, 28) of M1 . M1 was implicitly

Block-Constrained Fractal Coding Scheme for Image Retrieval

679

decomposed into 9 segments of size 64 × 64 when matching fractal codes. With the ﬁrst four rows and the ﬁrst four columns removed, the highest matching rate 0.609 between the icon I1 and image M1 is obtained.

(a)

(b)

(c)

(d)

Fig. 5. (a) iconic image I2 ; (b) - (d) retrieved images. Another experimental result is shown in Figure 5. The icon I2 of size 60 × 60 is extracted from the image M2 at starting location (25, 25). The ﬁrst three retrieved images with the top matching rates of 0.796, 0.469 and 0.453 are showed in Figures 5 (b), 5 (c) and 5 (d) respectively. All of these images have the similar subimage of sky, which indicates that our retrieval scheme is also possible to retrieve similar images based on a iconic image.

(a)

(b)

Fig. 6. (a) iconic image I3 ; (b) retrieved images. Our method can also handle large icons. Figure 6 shows the experimental result for the icon I3 of size 100 × 100 taken from the database image M3 at starting location (28, 28) and the matching rate is 1.0. The above experimental results indicate that our approach is eﬀective for image retrieval.

6

Conclusion

In this paper we present a block-constrained fractal coding scheme and a matching strategy for image retrieval. Based on our proposed coding scheme, the encoder can obtain the fractal codes of an image real time, which eliminates the

680

Zhiyong Wang et al.

necessity of storing these fractal codes in the database. The matching strategy based on an improved nona-tree decomposition scheme also makes the retrieval process more eﬃcient. Our retrieval approach can obtain satisfactory retrieval results and is compared favorably with other two methods , a pixel-matching-based method and the method proposed by Zhang et al. However, the retrieval time for a large database is still too long. More research will be pursued in order to improve the eﬃciency of the method.

Acknowledgment The work described in this paper was substantially supported by a grant from the Hong Kong Polytechnic University (Project No. P173).

References 1. M. D. Marsicoi, L. Cinque, and S. Levialdi. Indexing pictorial documents by their content: A survey of current techniques. Image and Vision Computing, 15, 1997. 673 2. Edward Remias, Gholamhosein Sheikholeslmai, and Aidong Zhang. Block-oriented image decomposition and retrieval in image database systems. In Proceedings of the 1996 International Workshop on Multi-media Database Management Systems, pages 85–92, Blue Mountain Lake, New York, Aug. 1996. 673, 676, 677 3. A. D. Sloan. Retrieving database contents by image recognition: New fractal power. Advanced Imaging, 9(5), 5 1994. 674, 675 4. Aidong Zhang, Biao Cheng, and Raj Acharya. An approach to query-by-texture in image database systems. In Proceedings of the SPIE Conference on Digital Image Storage and Archiving Systems, pages 338–349, Philadelphia, USA, Oct. 1995. 674, 675, 676 5. Aidong Zhang, Biao Cheng, Raj Acharya, and Raghu Menon. Comparison of wavelet transforms and fractal coding in texture-based image retrieval. In Proceedings of the SPIE Conference on Visual Data Exploration and Analysis III, San Jose, Jan. 1996. 674 6. M. F. Barnsley and L. P. Hurd. Fractal image compression. AK Peters, Wellesley, Mass, 1993. 674 7. A. E. Jacquin. Fractal image coding: A review. Proceedings of the IEEE, 80(10), Oct. 1993. 674 8. Y. Fisher. Fractal image compression: theory and application. Springer-Verlag, New York, 1995. 674 9. A. E. Jacquin. Image coding based on a fractal theory of iterated contractive image transformation. IEEE Transactions on Image Processing, 1(1), Jan. 1992. 674 10. D. Saupe and R. Hamzaoui. A review of the fractal image compression literature. Computer Graphics, 28(4), 1994. 674

Eﬃcient Algorithms for Lossless Compression of 2D/3D Images F. Chen, S. Sahni , and B. C. Vemuri CISE Department University of Florida, Gainesville, FL 32611 {sahni,vemuri}@cise.ufl.edu Abstract. We propose enhancements to the 2D lossless image compression method embodied in CALIC. The enhanced version of CALIC obtained better compression than obtained by CALIC on 37 of the 45 images in our benchmark suite. The two methods tied on 6 of the remaining 8 images. This benchmark suite includes medical, natural, and man-made images. We also propose a lossless compression method for 3D images. Our method employs motion estimation and obtained better compression than competing wavelet-based lossless compression methods on all 8 3D medical images in our benchmark suite.

1

Introduction

Image compression plays a very important role in applications like tele-videoconferencing, remote sensing, document and medical imaging, and facsimile transmission, which depend on the eﬃcient manipulation, storage, and transmission of binary, gray scale, and color images. Image compression techniques may be classiﬁed as either lossy or lossless. Lossy compression methods are able to obtain high compression ratios (size of original image / size of compressed image), but the original image can be reconstructed only approximately from the compressed image. Lossless compression methods obtain much lower compression ratios than obtained by lossy compression methods. However, from the compressed image, we can recover the exact original image. In this paper, we are concerned solely with lossless compression of 2D and 3D images. A recent survey and evaluation of coding, spatial, and transform methods for 2D image compression appears in [3]. As concluded in [3], the best compression ratios are obtained by the compression system called CALIC [1]. CALIC is a context based adaptive lossless image coder that was developed by Wu and Memon [1]. CALIC uses a gradient-based non-linear prediction to get a lossy image and a residual image. Then, it uses arithmetic coding to encode the residuals based on the conditional probability of the symbols in diﬀerent contexts. CALIC also has a mechanism to automatically trigger a binary mode which is

This research was support, in part, by the National Institute of Health under grant R01LM05944-03. Contact Author: Sartaj Sahni (phone: 352-392-1527, fax: 352-392-1220)

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 681–688, 1999. c Springer-Verlag Berlin Heidelberg 1999

682

F. Chen et al.

used to code uniform and/or binary subregions of the image. In this paper, we propose enahancements to CALIC. Although these enhancements do not aﬀect the runtime of CALIC, they result in a greater amount of compression. We also propose a motion-based lossless compression scheme for 3D images. This scheme utilizes the 2D image registration algorithm of [2]. Although we also generalize CALIC so that 3D images may be compressed, this generalization does not result in better compression than obtained by our motion-based compression method. For the experimental evaluation of our enhanced CALIC method (ECALIC), we use a benchmark suite of 45 2D images. This suite includes 11 ISO test images, 16 medical images, 9 NASA images, and a mixed-bag of 9 additional images. Most of the images are 256 × 256 8-bit gray scale images. A few are of diﬀerent size, and one image has a fair amout of text embedded. For the evaluation of the 3D compression methods, we use the suite of 8 medical images used in [9]. These images have a resolution of 8 bits/pixel and are composed of a varying number of slices (between 16 and 192). Each slice is a 256 × 256 image.

2 2.1

Enhanced CALIC Adaptive Binary Mode

CALIC switches into a binary mode when in a pixel neighborhood where pixels have at most two values. Although there is no compression overhead incurred to switch into the binary mode, an escape symbol is inserted into the compressed ﬁle when switching out of binary mode. The number of pixels coded between the time CALIC switches into binary code and the time it switches out of this mode is called the length of binary mode. When the length of binary mode is large, the overhead of the escape symbol is compensated for by the savings obtained from being in binary mode. However, when the length of binary mode is small, better compression is obtained by not switching into binary mode. Thus binary mode improves the compression performance only for uniform or nearly uniform images and natural images which are expected to have large binary mode lengths. CALIC works well for images with large smooth areas because the average length of binary mode is large and CALIC can trade oﬀ the overhead of encoding escape symbols. CALIC also works well for many textured images like lenna—which have very few nearly uniform areas—for which binary mode is not triggered and hence no overhead is incurred. When an image contains a fast changing textured background or when an image has a lot of noise, binary mode is triggered thousands of times resulting in very short binary mode lengths. For these images, the overhead of encoding an escape symbol each time we exit binary mode degrades the compression performance of CALIC. When CALIC is in binary mode, the context model of prediciton errors is not updated. This means that no prediction occurs on binary mode pixels and thus no prediction error is accumulated in the model for the future computation of e(Q(∆), B). Therefore, for normal images where the average length of binary mode is small, we lose numerous opportunities to train the context model via

Eﬃcient Algorithms for Lossless Compression of 2D/3D Images

683

omission of the prediction step in binary mode. Since the context model can only be updated on the ﬂy, the earlier the model reaches its steady state, the more accurate the subsequent error estimations will be. The above analysis leads us to introduce a preprocessing step in which we scan the image to decide whether or not to enable binary mode globally. This decision is made by computing the average length of binary mode and the number of escape symbols that will need to be introduced. If the average length of binary mode is less than 3 or if there are too many escapes (≥ 2% of the total number of pixels in the image), then binary mode is disabled during compression. If binary mode is enabled, the context models are updated when the binary mode length is less than 5 so as to increase the chances of early training of the context model. 2.2

Enhanced Error Prediction

˙ j] is deﬁned as In CALIC, the context for predicting the error e[i, j] = I[i, j]− I[I, a tuple C(Q(∆), B), where ∆ is the least square estimator of the prediction errors and B is the texture pattern of the neighbors. Since the errors associated with the neighbors are also good indicators of the current prediction errors, we add one more parameter to the context model, namely, the average of the prediction errors at the neighbors. This average is deﬁned as eneighbor = |ew + en|/2, where ew and en are the prediction errors of the neighbors to the west and north of the current pixel. Now the context becomes a triplet, C(Q(∆), B, eneighbor ). eneighbor is quantized into ﬁve levels. The cutoﬀs for the quantizer are set to (3, 8, 15, 45). So, the number of contexts becomes ﬁve times those used in CALIC. 2.3

Adaptive Histogram Truncation

CALIC codes the ﬁnal errors e = I − I˙ − e using arithmetic coding. Since large errors occur with small frequency, we have a severe zero frequency problem in the arithmetic coding. This is particularly true for sharp conditional error probabilities p(e|δ) with small δ = Q(∆). CALIC uses a histogram tail truncation method to reduce the number of symbols in the arithmetic coding. It limits the size of each conditional error histogram to some value Nd , 1 ≤ d < L, such that a majority of the errors to be coded under the coding context δ = Q(∆) fall into the range of the Nd . For any error e in this coding context that is larger than Nd , a truncation occurs and the symbol Nd is used to encode the escape character which represents the truncation. Following this, the truncated information e − (Nd − 1) is assimilated into the following class of prediction errors with the coding context (δ + 1). In CALIC, Nd , 1 ≤ d < L, is ﬁxed and selected empirically. Instead of using a ﬁxed histogram tail truncation, we propose an adaptive histogram tail truncation that truncates the low frequency tail of the histogram according to the current error histogram. This eliminates the unnecessary zero frequency count of symbols during the arithmetic coding without incurring a large overhead. Theoretically, there is an optimal point where truncation yields the best compression ratio. But, since it is diﬃcult to analyze quantitatively the

684

F. Chen et al.

relationship between the frequency and codeword length of a certain symbol in the adaptive arithmetic coding, we use a simple criteria to obtain a truncation point. Speciﬁcally, we ﬁrst use a two dimensional counter X to count the frequency of an error in a certain context. For example, X(δ, val) represents the number of errors that have a value val in the context δ. Then, prior to the encoding, we search each histogram from the low frequency end (the tail) to the high frequency end. Whenever we encounter a frequency greater than a threshold T , we stop and set the truncation point. Since the error histograms are usually monotonically decreasing, such a truncation can retain the signiﬁcant entries of each histogram and cut oﬀ the histogram tail with very low frequency entries. Although using such a simple criteria might seem inadequate for optimal truncation, experiments indicate that this criteria is more eﬀective and eﬃcient than other criteria such as the entropy. 2.4

Experimental Evaluation of Enhanced CALIC

Enhanced CALIC (ECALIC) includes adaptive binary mode, enhanced error prediction, and adaptive histogram truncation. We have compared the compression eﬀectiveness of ECALIC relative to that of competing lossless compression methods including JPEG [4], LOCO-I/JPEG-LS [8], S+P [6] and CALIC [1]. JPEG and LOCO-I/JPEG-LS are the ISO 2D image lossless compression standards. S+P is an eﬀective transform domain algorithm based on the principle of the wavelet transform. CALIC is the context-based spatial domain algorithm upon which ECALIC is based. As reported in [3] CALIC gives the best compression of the known lossless compression methods. For our experiments, each of the cited methods was used on the raw images in our 45 image benchmark suite as well as on a preprocessed version of these images. The preprocessing [5] involved mapping the grayscale values that occur in an image into a contiguous range of integral values. The mapping table is stored along with the compressed image to enable reconstruction of the original image. Because of space considerations, detailed compression results are presented for only 20 of our 45 test images (Table 1). The best compression result for each image is marked with a †. The performance of the mapping preprocessor is inconsistent; on some images (e.g., the ocean images w6 and w7), the preprocessing doubles the compression ratio obtained by CALIC and increases the ratio for ECALIC 75%. On other images (e.g., man), there is a small reduction in the compression ratio. ECALIC and CALIC generally provide higher compression ratio than any of the other schemes. In fact, CALIC and ECALIC were outperformed on only one of our test images (fprint). CALIC and ECALIC performed better than S+P and JPEG on all the test images. They performed better than LOCI-I/JPEG-LS on 42 of 45 images. ECALIC did better than CALIC on 37 of our 45 images; CALIC did slightly better than ECALIC on only 2 of our 45 images; the two methods tied on the remaining 6 images.

Eﬃcient Algorithms for Lossless Compression of 2D/3D Images

685

Table 1. Performance of 2D Compression Schemes Image

JPEG JPEG-LS S+P CALIC ECALIC w w/o w w/o w w/o w w/o w w/o lenna 1.56 1.56 1.75 1.76 1.76 1.77 1.82 1.83† 1.82 1.82 man 2.02 2.04 2.69 2.71 2.63 2.65 2.74 2.77 2.77 2.80† chal 1.93 1.59 2.28 1.85 2.21 1.79 2.35† 1.90 2.35† 1.91 coral 1.67 1.49 1.88 1.64 1.87 1.64 1.93† 1.68 1.93† 1.68 shuttle 1.89 1.80 2.19 2.05 2.13 2.01 2.24† 2.12 2.24† 2.11 sphere 1.58 1.59 1.84 1.85 1.84 1.85 1.89 1.89 1.91 1.92† mri 2.16 2.16 2.67 2.69 2.76 2.76 2.76 2.78 2.87 2.89† ct 3.46 3.51 6.03 6.05 5.27 5.27 6.69 6.71 6.92 6.95† air1 1.34 1.34 1.45 1.45 1.44 1.45 1.49† 1.49† 1.47 1.48 ﬁnger 1.34 1.34 1.41 1.41 1.45 1.45 1.46 1.47† 1.47† 1.47† cmpnd1 2.91 2.87 5.76 6.12 3.32 3.33 6.19 6.19 6.27 6.32† heart 2.01 2.03 2.83 2.88 2.66 2.70 3.01 2.96 3.07† 3.02 brain1 1.86 1.49 2.09 1.63 2.16 1.68 2.21 1.71 2.25† 1.71 head 1.92 1.92 2.15 2.16 2.20 2.20 2.26 2.23 2.31† 2.26 slice0 2.16 2.17 2.52 2.52 2.54 2.54 2.65 2.60 2.73† 2.66 skull 2.00 2.01 2.79 2.82 2.57 2.59 2.91 2.94 2.84 2.97† carotid 2.73 2.76 4.43 4.51 3.95 4.01 4.67 4.75 4.78 4.88† aperts 3.43 3.47 6.81 6.95 5.79 5.89 6.97 7.13 7.33 7.50† w6 4.63 2.41 6.66 2.69 5.15 1.76 7.45 4.24 7.68† 4.69 w7 4.66 2.41 6.77 2.69 5.21 1.77 7.22 3.83 7.52† 4.22 fprint 1.72 1.36 2.42† 1.77 1.88 1.39 1.99 1.47 2.02 1.49 w and w/o indicate the algorithm with and without mapping

3

Lossless 3D Image Compression

Several of today’s diagnostic imaging techniques, such as computed tomography (CT), magnetic resonance (MR), positron emission tomography (PET), and single photon emission computed tomography (SPECT), produce a threedimensional volume of the object being imaged, represented by multiple twodimensional slices. These images may be compressed independently on a slice by slice basis. However, such a two-dimensional approach does not beneﬁt from exploiting the dependencies that exist among all three dimensions. Since the image slices are cross sections that are adjacent to one another, they are partially correlated. An alternative approach to compress the sequence of slices is to view the slices as a sequence of moving frames. The third dimension can be treated as the time axis and motion analysis techniques from the computer vision literature can be applied to determine the motion between consecutive frames. The motion estimator acts as a predictor for the next slice. The error between the estimation of the next slice and the actual values for the next slice may be viewed as a 2D image, which can be compressed using 2D compression methods.

686

F. Chen et al.

Yet another approach is to consider the set of slices as a 3D volume and use prediction or 3D frequency transform compression methods. For example, we could use the lossless 3D wavelet compression algorithm of Bilgin [9]. This algorithm ﬁrst decomposes the image data into subbands using a 3D integer wavelet transform, it then uses a generalization of the zerotree coding scheme [7] together with context-based adaptive arithmetic coding to encode the subband coeﬃcients. 3.1

Motion Based Compression Algorithm

Prediction using a motion estimator is called motion compensation. This prediction permits extracting the temporal redundancy that exists in a sequence of images through motion estimation. If the motion vector at each pixel location is (u, v), then the motion error is given by e(x, y) = I(x, y; n)−I(x−u, y−v; n−1), where n is the slice or frame number. The task of the encoder is to encode the motion errors. Our proposed algorithm comprises the following steps: 1. Motion estimation The motion from one slice to the next can be represented by a geometric transformation on one slice. A 2D aﬃne transformation is an example of a geometric transformation and it includes rotation, translation and scaling. The aﬃne motion model is deﬁned in the following manner: x t x u(x, y) t t + 2 − = 0 1 t3 t4 t5 y y v(x, y) where, T = (t0 , ..., t5 )T performs a global transformation. So, given two 2D images, a motion estimation algorithm will compute a 2D aﬃne transform vector T . To reduce the computational burden, the transform T is computed only at a subset of the image grid called the control point grid. The transformation can then be interpolated at other locations using a B-Spline representation for the control grid. In our motion-based 3D compression algorithm we used the robust and eﬃcient estimator proposed in [2]. 2. Compute the motion error We use the motion transform vector T to compute E = T (In−1 ) − In

(1)

which is the motion error between the predicted value T (In−1 ) of slice n and the actual value of slice n. Speciﬁcally, the following steps are used: (a) Apply the motion estimation algorithm on the control points to get the motion vector (u, v) at the control points, (b) Use bilinear interpolation to get the motion vectors at other locations of the image grid. (c) Compute the diﬀerence (i.e., the motion error) E between the transformed image Iˆn = T (In−1 ) and the target image In .

Eﬃcient Algorithms for Lossless Compression of 2D/3D Images

687

3. Encode the motion error The error is the diﬀerence E between two images as deﬁned in (1). The ﬁrst slice of a 3D image is compressed using a 2D compression scheme such as ECALIC (Section 2). The motion estimation scheme is applied to all other slices. The motion vector and motion errors are stored for each of these remaining slices. Since the motion parameters are six ﬂoating point numbers, the storage space they require is very small. 3.2

3D Context-Based Compression Algorithm

This algorithm is an extension of our proposed 2D lossless compression algorithm ECALIC. We introduce a third dimension in the prediction and context modeling steps. The details of the extension are omitted from this paper. 3.3

Evaluation of 3D Methods

For the evaluation of the 3D methods, we consider the two versions of the wavelet method proposed in [9]. The methods wavelet-based(a) and wavelet-based(b) diﬀer in the block size used. The results for the schemes wavelet-based(a) and wavelet-based(b) are from Bilgin [9]. The method wavelet-based(a) uses a 2-level dyadic decomposition on blocks of 16 slices, while wavelet-based(b) uses a 3-level dyadic decomposition on the entire volume. As pointed out in Bilgin [9], no single transform performs best over the entire data set, therefore, for comparison purposes, we include the best results from all the transforms considered in [9]. Table 2 shows that our motion-based scheme gives best compression on 7 of the 8 test images; our context-based scheme yields the best result on the remaining image. The wavelet-based schemes tie the performance of our motion-based scheme on 1 image.

Table 2. Comparison of 3D compression methods Image

slices # wavelet wavelet motion context based(a) based(b) based based CT skull 192 3.671 3.981 4.092† 3.910 CT wrist 176 6.522 7.022† 7.022† 6.447 CT carotid 64 5.497 5.743 6.070† 6.041 CT Aperts 96 8.489 8.966 9.875† 9.741 MR liver t1 48 3.442 3.624 3.663† 3.280 MR liver t2el 48 4.568 4.823 4.921† 4.405 MR sag head 16 3.644 3.502 4.063 4.093† MR ped chest 64 4.021 4.267 4.304† 4.100 † indicates the best result

688

F. Chen et al.

From the space complexity point of view, our context-based algorithm has the advantage that during each step of compression at most three slices of data are brought into memory, whereas, in the wavelet-based algorithm (b), the whole volume needs to be loaded into memory in order to apply the wavelet transform. When the method wavelet-based(a) is used, blocks of 16 slices are read in and compressed. This scheme, however, does not give as much compression as does wavelet-based(b). Unlike the wavelet schemes, we do not require that the bumber of slices be a power of 2.

4

Conclusion

In this paper, we have proposed enhancements to the 2D lossless compression method CALIC. The enahnced version, ECALIC, outperforms CALIC on 37 of our 45 test images, and ties with CALIC on 6 of the remaining 8 images. We have also proposed two lossless 3D image compression algorithms, one is motion-based, the other is context-based. The motion-based algorithm outperformed the wavelet-based algorithms of [9] on 7 of the 8 data sets used; it tied on the 8th data set. Besides providing better compression, our motion-based scheme requires less computer memory than required by the competing methods of [9].

References 1. X. Wu, N. Menon, “CALIC-A context based adaptive lossless image codec,” Proc. of 1996 International Conference on Acoustics, Speech, and Signal Processing, pp.1890-1893, 1996. 681, 684 2. R. Szeliski and J. Coughlan. “Hierarchical spline-based image registration,” IEEE Conf. Conput. Vision Patt. Recog., pp. 194-201, Seattle, WA, June 1994. 682, 686 3. B. C. Vemuri, S. Sahni, et al.. “State of the art lossless image compression algorithms,” University of Florida, 1998. 681, 684 4. Wallace, G. K. “The JPEG still picture compression standard,” Communications of the ACM, vol 34, pp. 30-44, April 1991. 684 5. S. Sahni, B. C. Vemuri, F. Chen, and C. Kapoor. “Variable-Bit-Length Coding: An Eﬀective Coding Method,” University of Florida, 1998. 684 6. A. Said, W. Pearlman, “A new, fast, and eﬃcient image codec based on set partitioning in hierarchical trees,” IEEE Trans. On Circuits And Systems For Video Technology, vol. 6, No. 3, June, 1996. 684 7. J. M. Shapiro, “Embedded image coding using zerotrees of wavelet coeﬃcients,” IEEE Trans. On Signal Processing, vol. 41, No. 12, December 1993. 686 8. M. J. Weinberger, G. Seroussi, and G. Sapiro, “LOCO-I: A low complexity, contextbased lossless image compression algorithm,” Proc. of 1996 Data Compression Conference, pp. 140-149, 1996. 684 9. A. Bilgin, G. Zweig, M. W. Marcellin, “Eﬃcient lossless coding of medical image volumes using reversible integer wavelet transforms,” Proc. 1998 Data Compression Conference, March 1998, Snowbird, Utah. 682, 686, 687, 688

LucentVisionT M : A System for Enhanced Sports Viewing Gopal Sarma Pingali, Yves Jean, and Ingrid Carlbom Bell Labs, Lucent Technologies, Murray Hill, NJ 07974, USA Phone: 908 582 6544 Fax: 908 582 6632 [email protected]

Abstract. LucentVisionT M is a networked visual information system that archives sports action in real time using visual processing. LucentVision provides a variety of visual and textual content-based queries on the archived information, and presents the query results in multiple forms including animated visualization of court coverage and virtual-replay of action from arbitrary viewpoints. LucentVision is the first system to process sports video in real time to provide action summaries in live broadcasts. This paper describes the architecture of the system and results from its use in international tennis tournaments.

1

Introduction

Visual information systems can signiﬁcantly enhance viewers’ experience of sports by extracting motion and geometry information that is hidden in video. Some eﬀorts towards this goal include [4,2,5,7,1,8,3]. This paper presents the architecture of LucentVision – a system we have developed for enhanced sports viewing – and some results of running the system. LucentVision uses real-time visual processing on multiple synchronized video streams to extract and summarize action in sporting events. A key diﬀerence between earlier systems and this system is the use of real-time automatic visual tracking. LucentVision archives action summaries in the form of 3D motion trajectories of the essential elements of the game (players, ball) along with video clips and score information in a database. The system provides several query mechanisms, including score-based and motion based queries, to derive action summaries for a single player or comparisons across multiple players over any particular game, part of a game, or across multiple games. The systems uses visual representations to show performance summaries at diﬀerent levels of abstraction. Examples include statistics such as distance covered, speed and acceleration; visualizations such as court coverage maps and animations of evolutions of such maps; video replays of relevant pieces of action; and virtual replays of the ball which can be watched from any arbitrary viewpoint. The system architecture developed in this paper is applicable to a variety of sports. The focus of this paper will be on the current implementation for the game of tennis. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 689–696, 1999. c Springer-Verlag Berlin Heidelberg 1999

690

2

Gopal Sarma Pingali et al.

LucentVision Architecture Overview

Figure 1 shows the architecture of the system. The inputs to the system consist of multiple synchronized video streams which are processed by visual tracking and video compression subsystems. The visual tracking subsystem tracks speciﬁc action elements of interest in the sport. In the case of tennis, the motion of each player as well as the three-dimensional motion of the ball is tracked. The output of the tracking subsystem consists of motion trajectories of the players and the ball. A database system stores the motion trajectories and compressed video streams. It also stores the corresponding game scores, provided through an input GUI. An output Application Program Interface (API) provides the link between client programs and the archived information in the database. The tracking, compression and database subsystems along with the Input GUI and the Output API comprise the LucentVision server.

3

Tracking Subsystem

The key component of LucentVision is a real-time tracking subsystem that tracks the motion of the players and the ball. The inputs to this subsystem are streams of video from cameras covering the sporting event and the outputs are motion trajectories of the players and the ball which are represented as a sequence of spatio-temporal coordinates (three-dimensional space coordinates and time). Real-time tracking is challenging because of the non-rigid motion of the players, the speed and the small size of the ball, and changing lighting conditions. The player tracker segments out foreground/motion regions using diﬀerencing operations, tracks local features in the segmented regions, and dynamically clusters the motion of unstable local features to form a stable motion trajectory corresponding to the centroid of the player. The player tracker has been extensively tested under a variety of outdoor and indoor lighting conditions. The ball tracker uses a combination of motion and color segmentation to estimate ball position at high speeds. The color segmentation takes advantage of the well-deﬁned hue and saturation of the ball in HSV space. Further details on the player and ball tracking are given in [6]. The trajectories output by the tracking subsystem are a very compact representation of the important motion content in the video. Given the spatio-temporal trajectories, we can compute the position, direction of travel, distance covered, speed, and acceleration at any instant, thus allowing content-based queries depending on any of these attributes.

4

Database Organization

The system uses a relational database to store the conﬁguration and rules for each match and system conﬁguration parameters, besides the score, extracted motion trajectories and video clips of action through out the match. A match conﬁguration table is used to store for each match the names of the players, the player who serves ﬁrst and on which side of the court each player is located

LucentVisionT M : A System for Enhanced Sports Viewing

691

at the beginning of the match. The latter information is important as players change sides several times during the course of a match. The system also stores the parameters of the match indicating, for example, the number of sets, if tiebreaks are used, and the court geometry (singles or doubles). It also stores camera calibration parameters for the tracking and video capture cameras. The system maintains the state of a match for every point in the match in tables corresponding to the match, games, sets and points. In tennis, a match consists of three or ﬁve sets, a set consists of games, and a game consists of points. The state of the match for any point is given by the score at that point, the locations of the players on the court, and the serving player. The system stores extracted motion trajectories and video clips for every point. A unique point i.d. is used to relate the trajectory or video clip for a point to the state of the match at that point. An input graphical user interface controls the insertion/update of data in the database system and enables tracking and video capture on feeds from the cameras. At the end of a point, the GUI automatically determines the match state and score and inserts this information along with the trajectories into the database.

5

Output API

The API provides numerous functions to derive the trajectories and video streams for a particular match, across diﬀerent matches, or diﬀerent players. The API supports both score-based queries (“get all the trajectories for player A for service games that he won”) and visual content based queries (“get all the video clips for points when player B approached the net”). Once connected via a wired or wireless internet link, the client acquires all of the parameters for the current match (in progress or pending) and may use one of the Output Composer models. The Output Composer supports a variety of outputs which can be tailored to the client’s application needs, processing power, and bandwidth of the communication link with the LucentVision server. Some examples of outputs from the Output Composer are court coverage maps, replays of the ball motion in a virtual environment, statistics such as player speed, distance traveled and acceleration, and video replays of any part of the match. For clients with an abundance of computational resources we provide a “heavy” client API. The API calls process raw trajectory and score information on the client to produce derivative data such as the LucentVision Map shown in ﬁgures 3 to 5 and 7 to 9. The “light” client routines acquire match data and derivatives with little client-side processing, for example, the current score or total distance traveled by a player. The “streaming” API handles the streaming of match data to the client. Streaming is useful for live-update scenarios where the client is automatically handed data as soon as the server acquires it. For example, the stream objects can be positional information for driving representations of a ball and players on a virtual court.

692

Gopal Sarma Pingali et al.

Furthermore, LucentVision provides a cache mechanism to ensure that an application has fast access to results of previous queries besides the most recent update. The caching mechanism is realized on either the client or the server. Our API supports either model because “heavy” clients can aﬀord local caching while “light” clients may lack the caching resources but can beneﬁt from its use on the server side.

6

Results

In this section, we focus on results obtained by running the system in international tournaments. Thus far, the system has been run in three international tennis tournaments – both outdoors (ATP Championship at Cincinnati, USA, in August 1998) and indoors (ATP Championship at Stuttgart, Germany, October 1998, and ATP World Championship at Hannover, Germany, November 1998). Figure 2 shows a still from a video clip archived at the championship in Cincinnati. The motion trajectory of the player for the current point is overlaid on the image. Such motion trajectories and video clips are stored in real time for every point in a match. During the tournaments, LucentVision was used to obtain real-time visualizations, called LucentVision MapsT M , showing player court coverage patterns. In addition, the system computed statistics such as the distance covered by each player, average speed of the player and peak speed. The LucentVision Maps, annotated with statistics, were integrated into worldwide broadcasts and were an integral part of the live commentaries. Figure 3 shows the LucentVision Map from the semiﬁnal of the ATP World Championship. The rest of the LucentVision Maps shown in this paper are from this tournament. The LucentVision Map shows the court coverage patterns for players with one player shown on each half of the court. A player’s activity on both sides of the court is combined and displayed on one side. Color is used to represent the time a player spends in diﬀerent parts of the court, with red indicating most time, followed by yellow, green and blue. The LucentVision Maps are useful in analyzing player performance and strategy. In ﬁgure 3, it is seen that Sampras, shown on the left, approached the net much more than Corretja who spent most of his time close to the baseline. It is also seen that both players spent more time on their left-hand (backhand side) than on their right-hand (forehand) side. Figure 3 also shows the match score, the total distance covered by each player, and their average speeds. Figure 4 shows the LucentVision Map for the other semiﬁnal of the tournament. The map again highlights the diﬀerence in the aggressive net-approaching style of Henman and the baseline play of Moya. Figure 5 shows the strikingly similar baseline play of both Moya and Corretja in the World Championship ﬁnal match. It is seen that each player covered over 10 km in this ﬁve-set match. Additional information was obtained by the LucentVision system in this match to contrast the two players. Figure 6 shows how the average speeds for each player changed from set to set in the course of the match which lasted over four hours.

LucentVisionT M : A System for Enhanced Sports Viewing

Control

Video streams

Occupancy maps, Statistics, Virtual replays, Video replays, Historical comparisons

Camera parameters Match parameters Court geometry

Input GUI Score

Trackers

Trajectories

Database

Output Composer

Compression SERVER CLIENT Fig. 1. Architecture of the LucentVision system

Fig. 2. Still from a video clip showing the player motion trajectory obtained by the LucentVision system

Sampras

Match (6−4 3−6 6−7) 3.61 km 22.01 kmph

Corretja

3.63 km 21.88 kmph

Fig. 3. LucentVision Map for the Sampras−Corretja semifinal match

693

694

Gopal Sarma Pingali et al.

Henman

Match (4−6 6−3 5−7) 3.01 km 21.61 kmph

Moya

3.16 km 22.56 kmph

Fig. 4. LucentVision Map for the Henman−Moya semifinal match

Moya

10.21 km 21.94 kmph

Match (6−3 6−3 5−7 3−6 5−7)

Corretja

10.07 km 21.57 kmph

Fig. 5. LucentVision Map for the five−set final match

Speed in km/h

23

Moya Corretja 20 1

2

3

4

5

Set

Fig. 6. Graph showing changing average speeds of the players across the five sets in the final match.

LucentVisionT M : A System for Enhanced Sports Viewing

Match Agassi

Corretja Fig. 7. LucentVision Map for an Agassi−Corretja match

Set 1 Sampras

Moya Fig. 8. LucentVision Map for Set 1 of a Sampras−Moya match

Set 2 Sampras

Moya Fig. 9. LucentVision Map for Set 2 of a Sampras−Moya match

695

696

Gopal Sarma Pingali et al.

Moya slowed down after the second set while Corretja, who was slower than Moya for most of the match, sped up signiﬁcantly in the crucial ﬁfth set to win the championship. LucentVision was used in this manner to provide a variety of information in real time based on the match situation and the broadcaster’s needs. Figure 7 shows the map for a match between Agassi and Corretja. The map not only shows that both players were playing close to the baseline for most of the time, but also highlights the subtle distinction that Agassi played predominantly inside the baseline while Corretja played more outside the baseline. The LucentVision system provides information for any subset of a match. Figures 8 and 9 show the maps for two individual sets in a match between Sampras and Moya. It is seen how Moya who was playing at the baseline in the ﬁrst set, signiﬁcantly changed his strategy in the second set and approached the net.

7

Conclusion

The LucentVision system uses a combination of real-time tracking, computer graphics, visualization, database, and networking technologies to enhance viewers’ appreciation of the strategy and athleticism involved in a sport and to increase viewers’ sense of presence in a sporting environment. The philosophy behind the system is to capture the activity in an environment and its geometry in real time through visual means. The architecture supports a number of sports and other surveillance applications such as analysis of customer activity in retail environments and security monitoring, and paves the way for the emerging paradigm of immersive telepresence.

References 1. Thomas Bebie and Hanspeter Bieri. Soccerman - reconstructing soccer games from video sequences. In Proceedings of the International Conference on Image Processing, 1998. 689 2. Y. Gong, L.T. Sin, C.H. Chuan, H. Zhang, and M. Sakauchi. Automatic parsing of tv soccer programs. In Proceedings of the International Conference on Multimedia Computing and Systems, pages 167–174, 1995. 689 3. Praja Inc. Praja actionsnaps! http://www.actionsnaps.com, 1998. 689 4. S. Intille and A. Bobick. Visual tracking using closed worlds. In Proceedings of the Fifth International Conference on Computer Vision, pages 672–678, 1995. 689 5. P.H. Kelly, A. Katkere, D.Y. Kuramura, S. Moezzi, S. Chatterjee, and R. Jain. An architecture for multiple perspective interactive video. In Proceedings of ACM Multimedia ’95, pages 201–212, 1995. 689 6. Gopal Pingali, Yves Jean, and Ingrid Carlbom. Real-time tracking for enhanced sports broadcasts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 260–265, 1998. 690 7. G. Sudhir, J.C.M. Lee, and A.K. Jain. Automatic classification of tennis video for high-level content-based retrieval. In Procedings of IEEE Workshop on ContentBased Access of Image and Video Databases (CAIVD’98), 1998. 689 8. Orad Hi-Tec Systems. Virtual replay. http://www.orad.co.il/sport/index.htm, 1998. 689

Building 3D Models of Vehicles for Computer Vision Roberto Fraile and Stephen J. Maybank Department of Computer Science The University of Reading, Reading RG6 6AY, UK P.O. Box 225, Whiteknights [email protected]

Abstract. We present a technique to build and manipulate three dimensional models of vehicles, to be used by a computer vision system. These tools will help ﬁnd a suitable class of models for the vehicles and the trajectories, which are restricted to shape of cars (symmetry, regularity) and their trajectories (ground plane constraint), moving on a ﬁxed scenario. The models consist of a variable number of facets in the three dimensional space, varying along time, each one with a corresponding objective (evaluation) function.

1

Introduction

We have created a tool to build and manipulate three dimensional models of vehicles, using images taken from a ﬁxed, oﬀ-the-shelf calibrated camera. Our aim is to develop techniques for vehicle detection and tracking. The approach considered is top down, in which a set of hypotheses, chosen from a family of three dimensional models, is matched against the raw images. Former applications of this model based, top down approach were intended to solve the problem of real time detection and tracking of vehicles. Sullivan et al. [8] used edge based fully featured vehicle models, which were compared with each image using an objective function deﬁned on the pixels surrounding the projected edges [2]. Kalman ﬁltering, and other types of ﬁlters [7] were used as prediction engines for the hypothesis generation. For top-down techniques to work robustly, accurate initial hypotheses are required for both the pose of the vehicle and its shape. The main problem is that it is hard to ﬁnd accurate geometric models for the shapes of vehicles. Deformable models (for example [3] and [4]) require a continuous model of the object. In our approach we use parameterised models, in which the number of parameters can vary. We can increase the amount of information in the model, but at the cost of a higher number of parameters. Model construction and ﬁtting are implemented in a computer program which provides a test bed for assessing the performance of the diﬀerent models. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 697–702, 1999. c Springer-Verlag Berlin Heidelberg 1999

698

2

Roberto Fraile and Stephen J. Maybank

Model and Knowledge Based Vehicle Detection

Models can contain varying amounts of implicit knowledge. Too little knowledge is an invitation for hypotheses spaces which are too large, while too much knowledge may include wrong assumptions, which compromise image matching. Independently of the quality and quantity of knowledge, top-down vision systems have two key elements: 1. An objective function f that associates a value to each hypothesis x, depending on the match between x and the input images. 2. A search algorithm over the space of hypotheses, to ﬁnd one that produces “a good value” for f . A “good value” is normally a maximum of f on the space of all valid hypotheses. The hypotheses can take the form of geometric models for the shape, combined with kinetic models for the trajectory. Not all features are useful for detection, tracking or comparison throughout an image sequence. The vehicle may show diﬀerent sides, produce diﬀerent contrasts in diﬀerent images, or have diﬀerent apparent sizes, etc. A diﬀerent number of parameters may apply in each situation. Therefore, we model the three dimensional appearance of the vehicle, not its shape. The model does not include those parts of the vehicle hidden from view throughout the image sequence.

3

Four Dimensional Models

Four dimensional models, which span in space and time coordinates, are to be used for our research in vision. Two main trends in solid modelling are the boundary and the constructive representations [6]. For our purpose, we need a three dimensional boundary representation of the object, which is directly related to the description of the appearance. We load a portion of the video sequence into memory, to ensure that it is available for repeated computation of the objective function. Shape Each shape model consists of a family of triangles (not necessarily a mesh, see Fig. 2). These triangles represent parts of the surface which are of interest, in that they are visible in the image. Trajectory The models for the trajectory can be given either as a sequence of rigid transformations that link the triangles between frames, or, at a higher level, as a continuous function of time, describing those transformations [5]. The advantage of considering the trajectory model and the three dimensional shape model together, is that we can deﬁne a single objective function. We can combine information from complementary sources and make an informed decision about the shape and the trajectory together.

Building 3D Models of Vehicles for Computer Vision

699

The bounding surface, in our implementation, has the form of a family of triangular facets in 3D. Since a mesh of triangles would imply continuity in the bounding surface of the appearance being modelled, and we want to avoid that assumption, neighbouring triangles may have gaps between them where the surface appearance is not modelled. The number of triangular facets to be used can vary enormously, and ﬁner detail is penalised with a higher dimensional search space. In future work we will tackle the problem of comparing models with diﬀerent numbers of facets.

4

Describing the Shape of Vehicles in the Space

Here we describe a simple technique for helping human users specify models of triangles in the space. It allows the human user to guess the 3D position of each vertex of a model. It also provides visual support to doublecheck and reﬁne the accuracy, by displaying diﬀerent instances of the model, projected over diﬀerent images of the same object. Given the images and the calibration of the camera, points are fully described by their projection on the image plane, and the projection of their orthogonal shadow on the ground plane, The human user provides a height h, and a point v in the image plane (see Fig. 1). A “projection” map P , from the three dimensional space to the image plane, is known a priori. The program ﬁnds the point w = (wx , wy , wz ), of the space, with coordinate wz = h, such that P w = v. This method is used to help the human user to choose intuitively the height of a point, when moving the pointer on the window: the orthogonal shadow is calculated and drawn on the right place, under the pointer. In Fig. 1 the user has located the upper cross at the projection of a prominent point on the vehicle. The user then supplies an estimate of the height of the 3D point above the ground plane. The system uses the known camera calibration and the position of the cross to compute the position of the foot of the perpendicular from the 3D point to the ground plane. The foot is projected in to the image and the projection marked automatically by the lower cross. The visual interface is used to provide point coordinates, in the three dimensional space, of the vertices of the triangles which describe the three dimensional shape of the vehicles. This method has two main advantages: 1. It does not require a priori accurate physical measurements, such as the height and length of the vehicle, which usually are unknown, but can be obtained intuitively using the images and given the calibration of the camera. 2. Constraints such as perpendicularity or orthogonality between the diﬀerent parts of the vehicle are not assumed implicitly. Existing alternatives, such as auxiliary lines (in the same way as most popular CAD systems for planar drawings), or grids of reference, are more complicated and include in diﬀerent degrees assumptions about measurements or perpendicularity.

700

Roberto Fraile and Stephen J. Maybank

Fig. 1. A point on the vehicle (upper cross) and its projection onto the ground plane (lower cross).

5

Applications

In our case, even low level visual modelling of vehicles is quite fast. Only the parts of the vehicles that can be seen during the sequence need to be modelled. Additional assumptions about the shape may lead to full models, for example, symmetry.

Fig. 2. First frame showing the projections of the 3D triangles

Figures 2, 3 and 4, show a Rover 216 for which a family of triangles has been deﬁned by hand using the point and click technique described before. The family is the same for all the frames. On the ﬁrst frame (Fig. 2), most of the triangles were described, except those of the bonnet, which were located only in the ﬁnal image. The model works for all the frames of the sequence.

Building 3D Models of Vehicles for Computer Vision

701

Fig. 3. Intermediate frame

In these experiments the trajectory has been speciﬁed as a rigid movement of the model on the ground plane (a rotation and a translation) for each frame. These triangles are being used to segment the image and as input to the objective function currently under development. As a consequence, there is a relation between the accuracy of the model and the amount of information available to the objective function. It may be the case, for vehicle a long way oﬀ or when the images have a very low quality, that a single facet or two ﬁt the object more eﬃciently than a comprehensive geometric model.

Fig. 4. The vertices of the bonnet were described in this frame

The triangles are divided into groups such that the triangles in each group correspond to regions with the same expected grey level. The background is considered as a special case of non-variable pixels over time. Edges, highlights or shadows are not modelled in this application.

702

6

Roberto Fraile and Stephen J. Maybank

Conclusions

There is a huge variety of candidates for models [1], objective functions [2] and search algorithms [8]. The main requirements are that they are computable, discriminate between good and bad models, and produce acceptable search spaces. In order to fulﬁl these requirements, we are developing a tool that includes all steps: model based description of the shape and the trajectory, and an objective function. We are currently developing a new objective function that makes use of all the pixels of the image. To include the trajectory and the shape, in a single model, increases the number of parameters. Another disadvantage, is that the images need to be buﬀered before being processed: that is, this method works oﬀ-line. The buﬀering lasts only for a few frames, which is suﬃcient for most surveillance applications. Future work will include innovative parametric models for the trajectory, which will reduce the size of the parameter space by taking account of the physical properties of moving vehicles [5]. Shadows, edges and highlights can be built up out of these low level models: shadows are projections of triangles onto ground plane, edges can be deﬁned as the segment that two triangles sharing two vertices have in common, and shortlived highlights can be modelled as triangles belonging to special groups that last for only a few frames, depending on the motion and the orientation of the light source.

References 1. B. A. Barsky. Computer Graphics and Geometric Modelling Using Beta-Splines. Springer, 1986. 702 2. K. Brisdon. Hypothesis verification using iconic matching. PhD thesis, Department of Computer Science, The University of Reading, 1990. 697, 702 3. G. Celniker and D. Gossard. Deformable curve and surface ﬁnite-elements for freeform shape design. In T. W. Sederberg, editor, Computer Graphics, volume 25, pages 257–266. SIGGRAPH, Addison-Wesley, July 1991. 697 4. J. M. Ferryman, A. D. Worrall, G. D. Sullivan, and K. D. Baker. Visual surveillance using deformable models of vehicles. Robotics and Autonomous Systems, 19:315–335, 1997. 697 5. R. Fraile and S. J. Maybank. Vehicle trajectory approximation and classiﬁcation. In P. H. Lewis and M. S. Nixon, editors, British Machine Vision Conference, 1998. 698, 702 6. C. Hoﬀman and J. R. Rossignac. A road map to solid modeling. Visualization and Computer Graphics, 2(1):3–10, March 1996. 698 7. S. J. Maybank, A. D. Worrall, and G. D. Sullivan. A ﬁlter for visual tracking based on a stochastic model for driver behaviour. In B. Buxton and R. Cipolla, editors, Computer Vision, ECCV96, volume 1065 of Lecture Notes in Computer Science, pages 540–549. Springer, 1996. 697 8. G. D. Sullivan. A priori knowledge in vision. In D. Vernon, editor, Computer Vision: Craft, Engineering and Science, ESPRIT Basic Research, pages 58–79. Springer, 1994. 697, 702

Integrating Applications into Interactive Virtual Environments Alberto Biancardi and Vincenzo Moccia DIS and INFM (Pavia Research Unit) Universit` a di Pavia, via Ferrata, 1, I-27100 Pavia, Italy Tel: +39.0382.505372, Fax: +39.0382.505373 [email protected] [email protected] http://vision.unipv.it/ Abstract. In this paper we describe Siru, a virtual reality development environment for the Mac OS. Unlike existing virtual reality authoring tools, Siru features unique inter-application communication capabilities, and thus can be used to develop customised solutions having sophisticated three-dimensional interfaces and integrating functionalities from existing application programs. We present the tool itself and a few sample applications. We also discuss some technology choices that have been made, and look forward to possible future improvements.

1

Introduction

When dealing with real-world metaphors or the representation of complex processes, 3D graphics in general and virtual reality (VR) in particular lower the amount of abstraction required by the interface, thus easing access to information and widening the expected audience of software products. This is why 3D graphics and VR are getting more and more important in computer applications. While initially conﬁned to military and ﬂight simulation, VR technology is now applied to many new applications, such as prototyping of engineering designs, architectural walk-throughs, on-line computer-based training or interactive product marketing demonstrations. To achieve high levels of realism, these applications must meet various performance parameters, especially as far as human perception factors are concerned. The rapid increase in performance of 3D graphics on inexpensive PC platforms, however, has made virtual environment (VE) interfaces feasible enough to be used successfully in a number of concrete situations [8]. Currently available VR related software tools belong to one of three categories: modellers, for creating individual 3D objects to be incorporated into VEs; scene builders, used to place previously generated objects in a virtual space; and viewers, which allow users to explore such virtual spaces and share them with other people. Sometimes all of these features are provided within one integrated development kit. Although it comes with scene building and viewing capabilities, Siru does not exactly ﬁt any of the categories above. While most VR tools today focus Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 703–710, 1999. c Springer-Verlag Berlin Heidelberg 1999

704

Alberto Biancardi and Vincenzo Moccia

on self-contained worlds, Siru allows developers to integrate functionalities from existing software. Objects in Siru VEs not only communicate with each other, but also interact and exchange data with the system software and external independent application programs, either local or running on networked computers. This is obtained by attaching a script to each VE element to deﬁne its behaviour and by using a system-wide scripting language to glue together applications [2,6]. Hence, even if a few similar tools exist [4,5], they are intended as three-dimensional graphics simulators or rapid prototyping systems rather than development environments for eﬀective software solutions, addressing programmers who need to combine features provided by multiple applications and who also want to endow the result with innovative 3D interfaces.

2

A Flexible Framework

Siru should be regarded as a framework for the development of interactive customised solutions. People who should beneﬁt from Siru include: HCI professionals, who need a tool for quick prototyping of innovative interaction or design techniques; 3D interface designers, who will exploit Siru’s excellent inter-application comunication (IAC) capabilities; solution developers, who need to integrate capabilities from diﬀerent application programs and system software services; multimedia application creators, who are willing to embed content within 3D interactive virtual environments and do not want to start development from scratch; and scientists or researchers who wish to experience new paradigms in simulation and data visualisation. From an applicative point of view Siru may play diﬀerent roles: – It is a tool for integrating and customising applications. Siru can be used as a basis to perform tasks involving many applications. A single object can send instructions to one application, get the resulting data, and then send the data to one or more additional applications. Controlled applications can be on any computer on a given network. You can do this today using other tools; however, Siru provides developers with a media-rich creative environment that makes new solutions possible, and lets users interact with their data in new ways. – It is a three-dimensional interface builder. Traditional VR tools lack systemwide eﬀectiveness, and though you can enrich VRML [9] worlds with Java functionality, this is far from being straightforward. Siru objects can be made to interact with system software and other applications with little or no eﬀort at all. Thus, developers and power users are allowed to create 3D interfaces to system services or existing application programs. Further, the ability to create movies from a user session makes it possible to use Siru as a tool for the analysis and assessment of interface usability. – It can be used for data visualisation and presentation. Data can be represented in a three-dimensional space, which does deﬁnitely make sense in a wide range of circumstances. Again, such purpose can probably be achieved with ordinary virtual reality description languages. However, Siru makes it

Integrating Applications into Virtual Environments

705

possible to link 3D representations with the actual data in a simple and elegant way, they evolve in real time as related data change, even while they are being explored, and even across a network. – It is a tool for creating linked virtual worlds. Diﬀerent worlds can be run and explored by diﬀerent users on separate networked computers. Each world can be made to interact with any other; actions performed by one user in one world may have eﬀects on other users’ experiences. Information on each user’s activity can be gathered, processed and logged, or even exploited to modify the related VEs, so to have users face new and more stimulating situations as they get acquainted with the old ones.

3

Basic Elements

The Siru development environment is founded on an object-oriented approach. Each project is made up of a collection of interactive objects having a set of attributes called properties. Some properties aﬀect the way objects appear and behave, and can be modiﬁed at any given time either in scripts or through the provided graphical property browsers; other properties are available for reading only, and provide status information. One property deserving special mention is an object’s script, in which handlers are deﬁned for all messages relevant to the object itself, if any. Messages are usually generated in response to user actions; however, objects can actually send messages to each other and thus work collaboratively to achieve some goal. Scripts can be composed in the integrated script editor or using an external tool. Advanced features such as message passing and delegation are fully supported. Further, since objects belong to a containment hierarchy, inheritance is also supported, therefore providing a powerful programming model which is suitable for creating vast environments and implementing complex algorithms. Siru objects belong to four classes: models, sounds, lights and views. – Models are three-dimensional geometric objects; they make up the visible part of a VE. The spatial location, scale and rotation angles of each model are fully editable. Further, any image ﬁle can be applied to the surface of a model as a texture. The model itself, however, must be created in advance using an external modelling tool. – Sounds provide audio capabilities. To make virtual experiences more realistic, Siru provides spatial ﬁltering for sounds, so that they appear to be emanating from a speciﬁc location and distance from the user, possibly moving in space too. – Lights are used to provide illumination to model surfaces. Siru supports multiple light sources in a given scene. Four types of lights are deﬁned, all sharing some basic properties such as brightness and colour. – Views maintain the information necessary to render scenes from a VE. Each view is essentially a collection of a single camera and a set of attributes determining a method of projecting the virtual scene onto a 2D plane. The camera location also aﬀects the way localised sounds are heard. The number of views open at a given time is limited only by available memory. Siru allows

706

Alberto Biancardi and Vincenzo Moccia

developers to generate movie ﬁles from views, thus recording all the activities taking place in a VE. 3.1

Building a Virtual Environment

Building a static VE is just as easy as to create a few objects and set their location in space. Making the objects interactive is simply a matter of writing scripts that deﬁne arbitrarily complex behaviours. In order to get acquainted with some of Siru’s most basic features, let us examine the fundamental steps needed to build a simple VE. Suppose we want to create a virtual garden: we are going to need an apple tree, an apple, some ﬂowers, and a suitable background noise (a river, for instance). To keep things simple, we will not use any of Siru’s IAC capabilities for now. After launching Siru, we create a new VE from scratch by selecting the New Project File item from the File menu. Next, we open a view by issuing the appropriate menu command. Views are needed to explore a VE, but, of course, they are also helpful when authoring one. For the sake of simplicity, let us have just one view in our virtual garden. Now we could start placing objects in our virtual space. However, to actually see them, we need to add some light ﬁrst. Sophisticated lighting techniques are available in Siru. However, a simple ambient light casting no shadows will be enough for our garden; let us set its colour to a pale yellow, to resemble sunlight. Creating lights and setting their properties is just a matter of selecting a few menu items. Everything is ready now to add our apple tree. We choose New Model from the Object menu and specify the ﬁle containing the model description: the tree appears suddenly, provided its location falls within the current ﬁeld of view. In the property browser winFig. 1. Editing object properties dow we are given a chance to modify the tree’s location, scale and rotation. Similarly, we add some ﬂowers (ﬁg. 1). Note, however, that an external modelling tool is needed to draw the 3D models and write ﬁles containing their description. Adding the background sound is just as straightforward. We select the New Sound menu item, specify the sound ﬁle and set the sound status to looping in the property browser. Since we want this to be perceived as background noise, we also set the source mode property to unﬁltered; thus, spatial ﬁltering will not be applied, and we will not need to specify a location for this sound. Now, suppose we want the apple tree to drop an apple when clicked. To achieve this behaviour, we add a new model (the apple) and hide it by setting

Integrating Applications into Virtual Environments

707

its enabled property to false; then, we set the script of the apple tree to something like the following: on mousedown set location of model "Apple" of document "Garden" to initLoc set enabled of model "Apple" of document "Garden" to true set currentLoc to initialLoc repeat with i from 0 to (item 3 of (finalLoc - initLoc)) by 0.1 set item 3 of currentLoc to i set location of model "Apple" ¬ of document "Garden" to currentLoc end repeat set enabled of model "Apple" of document "Garden" to false end mousedown Our garden is now ready and can be explored freely.

4

Connecting Applications

Unlike most VR tools, Siru can act as a foundation to perform complex tasks involving external independent application programs; it can be regarded as a general-purpose three-dimensional interface builder, providing a creative mediarich environment for the development of customised solutions. Siru VEs can be seen as an interaction layer where a developer can place hooks to evolving data or external programs. Thus, Siru provides a high level approach to application integration, considerably speeding up the development and update of prototypes. Its interaction with application usually consists of either exchanging data or applying some kind of external functionality to existing data; anyway the opposite view is possible, too: external application programs can directly control the environment by executing simple scripts that, for instance, create or dispose of scene objects, change object or view properties, modify the behaviour of objects by changing their special script properties, . . . We will now see two examples demonstrating why these techniques are important, and what they can be concretely used for. 4.1

The Sales Trend Example

The virtual garden example shows how to build a self-contained application. Now we will see how to obtain data from external sources, and how to display such data in a 3D environment. In the sales trend example, information about the sales turnover from an imaginary commercial activity is retrieved from a spreadsheet and then used to create a 3D representation. We monitor the sales of ﬁve items: a toy plane, a laundry detergent bottle, a chair, a teapot, and an

708

Alberto Biancardi and Vincenzo Moccia

umbrella. For each item, a 3D model is displayed whose size is proportional to the sales data. So, if 40 chairs and 20 teapots were sold, the chair model will be scaled so to appear twice as big as the teapot model (ﬁg. 2 on the left). Information is retrieved periodically from the spreadsheet application; after each update, the user is warned with a ring. Getting a cell’s value from the spreadsheet is as easy as writing the following AppleScript command: get cell cellName of spreadsheet ¬ of document "Sales Spreadsheet" and addressing it to the spreadsheet application from within some script. In our example, the only signiﬁcant script is the one associated to the ring sound, which does all the necessary work, including waiting for the update period to expire, getting data from the spreadsheet application, and scaling models that represent the sales items as needed.

Fig. 2. Applicative scenarios: sales and car parking examples

4.2

The Car Parking Example

In the car parking example (ﬁg. 2 on the right), a two ways communication is established between a virtual parking area and a database (DB) system storing information about parked cars. Whenever a new record is added to the DB, a new car appears in the virtual parking area, located in the spot that was assigned to it from the DB system. Removing a record from the DB results in the corresponding virtual car to disappear. Further, each time a car gets clicked in the virtual parking area, a query is run on the DB to ﬁnd the corresponding record, and all available information about that car is displayed. The car license plate is used as a key to keep a link between Siru models and records in the DB system. A possible extension of this example is to supply additional browsing points (e.g. at the counter) to help customers to ﬁnd their car location by showing a virtual walk-through inside the parking: Siru already has all the required functionalities to handle such an extension.

Integrating Applications into Virtual Environments

4.3

709

More Ideas

Thanks to the IAC capabilities of Siru it is possible to use third-party programs to monitor and control physical devices. Thus a virtual representation of some real place can be created and virtual objects can be linked to their real counterparts. When the latter change, the former are aﬀected, and viceversa. Along these lines complete control and supervision systems could be developed. Virtual panels could be created to operate complex machineries; feedback would be obtained through suitable transducers. Further, 3D representations could be used to examine components located in inaccessible or dangerous environments. Realistic simulations could be done on the virtual system before actually linking it to the physical one.

5

Design and Implementation Issues

Siru integrates diﬀerent technologies (3D graphics, speech, audio and video) in a runtime environment founded on Apple’s Open Scripting Architecture (OSA) [7]. Multimedia content is embedded within object properties while object message handlers are composed using AppleScript – a dynamic, object-oriented language which supports inheritance, delegation, compiled libraries and features an easy English-like syntax. Other OSA-compliant scripting languages can be used as well. Messages are dispatched through the Apple Event mechanism, which is the standard IAC protocol on the Mac OS [2]. Thus, Siru objects are also able to interact and exchange data with other application programs or the system software itself. Further, since Apple Events also work across networks, communication and interaction with remote computers or users is allowed as well, and does not require any special handling. While designing Siru, we decided that supporting AppleScript would result in signiﬁcant beneﬁts, and the idea of providing a built-in scripting language was discarded very early. Actually, most of Siru’s powerful capabilities depend on adopting the OSA, which allows to control several applications from a single script and does not require users to learn a new language for each application. One more exciting feature of the OSA is a recording mechanism that takes much of the work out of creating scripts. When recording is turned on, you can perform actions in a recordable application and the corresponding instructions in the AppleScript language will be created automatically. One relevant concern in designing Siru was to assess the feasibility of convincing interactive VEs on low cost hardware. As we expected, real time 3D graphics turned out to be very CPU intensive. We considered alternative solutions, such as the QuickTime VR (QTVR) technology [3], which allows to create 360-degree panoramic views from a set of photographs; however, the ﬁxed point of view paradigm adopted by QTVR proved quite unsatisfactory, allowing for only minimal interaction. Also, we found that the eﬀect resulting from the integration of photorealistic QTVR scenes with other media was aesthetically unpleasant. Eventually, we preferred dynamically rendered 3D graphics, sacriﬁcing speed for the sake of interaction and ﬂexibility. We adopted Apple’s QuickDraw 3D

710

Alberto Biancardi and Vincenzo Moccia

imaging technology [1], which oﬀers reasonably fast rendering and provides a hardware abstraction layer that allows system software to utilise a wide variety of acceleration hardware without code changes.

6

Conclusions

Siru is a tool to integrate functionalities not to be found in any single software package within interactive 3D interfaces, creating new ways to manipulate information on computers. The example applications we have presented in this paper show the eﬀectiveness of our approach. However the Siru is not yet ﬁnished product. Future development will aim at improving usability and speed; also, more sophisticated interaction techniques will be added to make a larger number of solutions achievable. Siru is free and can be requested to the authors or obtained by writing to [email protected].

References 1. Apple Computer, Inc.: 3D Graphics Programming With QuickDraw 3D. AddisonWesley Publishing Company, Reading MA (1995) 710 2. Apple Computer, Inc.: Inside Macintosh: Interapplication Communication. Addison-Wesley Publishing Company, Reading MA (1993) 704, 709 3. Apple Computer, Inc.: Virtual Reality Programming With QuickTime VR 2.1. Apple Technical Publications, Cupertino CA (1997) 709 4. Ayers, M., Zeleznik, R.: The Lego interface toolkit papers: virtual reality (TechNote), Proceedings of the ACM Symposium on User Interface Software and Technology (1996) 97–98 704 5. Conway, M., Pausch, R., Gossweiler, R., Burnette, T.: Alice: a rapid prototyping system for building virtual environments. Proceedings of ACM CHI’94 Conference on Human Factors in Computing Systems (1994) 295–296 704 6. Ousterhout, J. K.: Scripting: Higher-Level Programming for the 21st Century. Computer 31 (1998), 23–30 704 7. Smith, P. G.: Programming for flexibility: the Open Scripting Architecture. develop, the Apple technical journal 18 (1994) 26–40 709 8. The Policy Studies Institute of London: Virtual reality: the technology and its applications. Information Market Observatory, Luxembourg, (1995) Available at http://www. echo.lu/ 703 9. The Virtual Reality Modeling Language ISO/IEC DIS 14772-1 (1997) Available at http://www.vrml.org/VRML97/DIS/ 704

Structural Sensitivity for Large-Scale Line-Pattern Recognition Benoit Huet and Edwin R. Hancock Department of Computer Science University of York, York, YO10 5DD, UK

Abstract. This paper provides a detailed sensitivity analysis for the problem of recognising line patterns from large structural libraries. The analysis focuses on the characterization of two diﬀerent recognition strategies. The ﬁrst is histogram-based while the second uses feature-sets. In the former case comparison is based on the Bhattacharyya distance between histograms, while in the latter case the feature-sets are compared using a probabilistic variant of the Hausdorﬀ distance. We study the two algorithms under line-dropout, line fragmentation, line addition and line end-point position errors. The analysis reveals that while the histogram-based method is most sensitive to the addition of line segments and end-point position errors, the set-based method is most sensitive to line dropout.

1

Introduction

The recognition of objects from large libraries is a problem of pivotal importance in image retrieval [9,7,6,1]. The topic has attracted massive interest over the past decade. Most of the literature has focussed on using low-level image characteristics such as colour [9], texture [2] or local feature orientation [6] for the purposes of recognition. One of the most eﬃcient ways to realise recognition is to encode the distribution of image characteristics in a histogram [9]. Recognition is achieved by comparing the histogram for the query and those for the images residing in the library. In a recent series of papers, we have embarked on a more ambitious programme of work where we have attempted large-scale object recognition from structural libraries rather than image libraries [5,4,3]. Speciﬁcally, we have shown how line-patterns segmented from 2D images can be recognised using a variety of structural summaries. We have looked at three diﬀerent image representations and have investigated ways of recognising objects by comparing the representations. The simplest structural representation is a relational histogram. This is a variant of the pairwise geometric histogram [10] where Euclidean invariant relative attributes are binned provided that the line primitives are connected by an edge of a nearest neighbour graph [5]. Although relatively crude, object recognition via histogram comparison does not require explicit correspondences to be identiﬁed between individual line tokens. A more sophisticated representation is to store the set of pairwise attributes for the edges of the nearest-neighbour Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 711–719, 1999. c Springer-Verlag Berlin Heidelberg 1999

712

Benoit Huet and Edwin R. Hancock

graph. Diﬀerent sets of attributes can be compared using a fuzzy variant of the Hausdorﬀ distance [4]. Here the problem of ﬁnding explicit correspondences between the elements of the set is circumvented. The ﬁnal method is to use an eﬃcient graph-matching technique to ensure that the pattern of correspondences is consistent [3]. It is important to stress that as the recognition strategy becomes more sophisticated then so the computational overheads increase. We have viewed the application of these diﬀerent recognition strategies as a sequential reﬁnement process. The idea is to commence by limiting the set of possible recognition hypotheses with a coarse histogram search. The candidates are then reﬁned on the basis of the fuzzy Hausdorﬀ distance and ﬁnally veriﬁed by detailed graph-matching. The critical question that underpins this strategy is how much pruning of the data-base can be eﬀected in the histogram comparison step without leading to an unacceptably high probability of rejecting the true match. The answer to this question is one of noise sensitivity. Provided that the line patterns are not subjected to undue corruption, then the initial cut can be quite severe. The aim in this paper is to provide an analysis of the two hypothesis reﬁnement steps to better understand their noise sensitivity characteristics. We consider four corruption processes. The ﬁrst of these is positional jitter. The second is the addition of clutter. The third is line dropout. The fourth and ﬁnal process is that of line-fragmentation. We illustrate that the most destructive process is the addition of clutter. Based on this analysis, we provide ROC curves that can be used to set the rejection cutoﬀ for both the relational histogram and the fuzzy Hausdorﬀ distance. With this information to hand the processes can be integrated so as to deliver a pruned set of hypotheses which is both conservative and parsimonious.

2

Object Representation

We are interested in line-pattern recognition. The raw information available for each line segment are its orientation (angle with respect to the horizontal axis) and its length (see ﬁgure 1). To illustrate how the pairwise feature attributes are computed suppose that we denote the line segments indexed (ab) and (cd) by the vectors xab and xcd respectively. The vectors are directed away from their point of x ·x intersection. The relative angle attribute is given by θxab ,xcd = arccos[ |x ab||xcd | ]. ab cd From the relative angle we compute the directed relative angle. This is an extension to the attribute used by Thacker et al. [10], that consists of giving the relative angle a positive sign if the direction of the angle from the baseline xab to its pair xcd is clockwise and a negative sign if it is counter-clockwise. This allows us to extend the range of angles describing pairs of segments from [0,π] to [−π,π] and therefore, reduce indexation errors associated with angular ambiguities. In order to describe the relative position between a pair of segments and resolve the local shape ambiguities produced by the relative angle attribute we introduce a second attribute.The directed relative position ϑxab ,xcd is represented by the normalised length ratio between the oriented baseline vector xab and the

Structural Sensitivity for Large-Scale Line-Pattern Recognition

713

d h f c

Θ

ab,cd

ϑ

ab,cd =

Dib Dab

g e

b

a

i

Dab Dib

Fig. 1. Geometry for shape representation vector xib joining the end (b) of the baseline segment (ab) to the intersection of Dib −1 ] . the segment pair (cd). ϑxab ,xcd = [ 12 + D ab The physical range of this attribute is (0, 1]. A relative position of 0 describes parallel segments, while a relative position of 1 indicates that the two segment intersect at the middle point of the baseline. We aim to augment the pairwise attributes with constraints provided by the edge-set of the N-nearest neighbour graph. Accordingly, we represent the sets of line-patterns as 4-tuples of the form G = (V, E, U, B). Here the line-segments extracted from an image are indexed by the set V . More formally, the set V represents the nodes of our nearest neighbourhood graph. The edge-set of this graph E ⊂ V × V is constructed as follows. For each node in turn, we create an edge to the N line-segments that have the closest distances. Associated with the nodes and edges of the N-nearest neighbour graph are unary and binary attributes. The unary attributes are deﬁned on the nodes of the graph and are represented by the set U = {(φi , li ); i ∈ V }. Speciﬁcally, the attributes are the line-orientation φi and the line-length and li . By contrast, the binary attributes are deﬁned over the edge-set of the graph. The attribute set B = {(θi,j , ϑi,j ; (i, j) ∈ E ⊆ V ×V } consists of the set of pairwise geometric attributes for line-pairs connected by an edge in the N-nearest neighbour graph. We are concerned with attempting to recognise a single line-pattern Gm = (Vm , Em , Um , Bm ), or model, in a data-base of possible alternatives. The alternative data-patterns are denoted by Gd = (Vd , Ed , Ud , Bd ), ∀d ∈ D where D is the index-set of the data-base.

3

Relational Histograms

With the edge-set of the nearest neighbour graph to hand, we can construct the structurally gated geometric histogram [5]. The bin-incrementing process can be formally described as follows. Let i and j be two segments extracted from the raw image. The angle and position attributes θij and ϑij are binned provided the two segments are connected by an edge, i.e. (i, j) ∈ E. If this condition is met then the bin H(α, β) spanning the two attributes is incremented as follows H(α, β) + 1 if (i, j) ∈ E and θi,j ∈ Aα and ϑi,j ∈ Rβ H(α, β) = H(α, β) otherwise where Aα is the range of directed relative angle attributes spanned by the αth horizontal histogram-bin and Rβ is the range of directed relative position

714

Benoit Huet and Edwin R. Hancock

spanned by the βth vertical histogram bin. Each histogram contains nA relative angle bins and nR length ratio bins. The data-base is queried by computing the Bhattacharyya distance or histogram correlation. Suppose that hm is the normalised relational histogram for the query image and hd is the normalised histogram for the iamge indexed d in the data-base, then the Bhattacharyya distance is given by R(Gd , Gm ) = − ln

nA nR hd (α, β) × hm (α, β) α=1 β=1

The best-matched line pattern Gd is the one that satisifes the condiition R(Gd , Gm ) = arg min R(Gd , Gm ) Gd ∈D

4

(1)

Feature Sets

The second recognition strategy involves comparing the paiwise feature sets for the line-patterns. We measure the pattern similarity using pairwise attribute relations deﬁned on the edges of the nearest-neighbour graph. Suppose that the set of nodes connected to the model-graph node I is CIm = {J|(I, J) ∈ EM }. The corresponding set of data-graph nodes connected to the node i is Cid = {j|(i, j) ∈ Ed }. With these ingredients, the consistency criterion which combines evidence for the match of the graph Gm onto Gd is Q(Gd , Gm ) = 1 1 1 d P (i, j) → (I, J)|v m I,J , v i,j m d |VM | × |Vd | |CI | |Ci | d i∈V I∈V J∈C m d

m

j∈Ci

I

The probabilistic ingredients of the evidence further combining formulad need explanation. The a posteriori probability P (i, j) → (I, J)|v m I,J , v i,j represents the evidence for the match of the model-graph edge (I, J) onto the data-graph d edge (i, j) provided by the corresponding pair of attribute relations v m I,J and v i,j . In practice, these relations are the angle diﬀerence θi,j and the length ratio ϑi,j deﬁned in Section 2. We assume that the conditional prior can be modelled as follows d m d P (i, j) → (I, J)|v m (2) I,J , v i,j = Γσ (||v I,J − v i,j ||) d where Γσ (||v m I − v i ||) is a distance weighting function. In a previous study [4] we have shown that most eﬀective weighting kernel is a Gaussian of the form 2the ρ Γσ (ρ) = exp − σ . We now consider how to simplify the computation of relational consistency. We commence by considering the inner sum over the nodes in the model-graph neighbourhood CIM . Rather than averaging the edge-compatibilities over the

Structural Sensitivity for Large-Scale Line-Pattern Recognition

715

entire set of feasible edge-wise associations, we limit the sum to the contribution of maximum probability. Similarly, we limit the sum over the node-wise associations in the model graph by considering only the matched neighbourhood of maximum compatibility. With these restrictions, the process of maximising the Bayesian consistency measure is equivalent of maximising the following relational-similarity measure d max maxm Γσ (||v m (3) Q(Gd , Gm ) = I,J − v i,j ||) i∈Vd

I∈Vm

j∈Cid

J∈CI

With the similarity measure to-hand, the best matched line pattern is the one which satisﬁes the condition Q(Gd , Gm ) = arg max Q(Gd , Gm ) d ∈D

5

(4)

Recognition Experiments

We provide some examples to illustrate the qualitative ordering that result from the two recognition experiments. The data-base used in our study consists of 2500 line-patterns segmented from a variety of images. There are three classes of image contained within the data-base; trademarks and logos, letters of the alphabet of diﬀerent sizes and orientations, and, 25 aerial images. There are 5 multiple segmentations for each aerial image. We have a digital map for a road network contained in two of the images. Since the aerial images are obtained using a line-scan process, they are subject to barrel distortion and are deformed with respect to the map. Figures 2 and 3 compare the recognition rankings obtained from the database. In each case the left-hand panel is the result of using relational histograms while the right-hand panel is the result of using feature-sets. In each panel the thumbnails are ordered from left-to-right and from top-to-bottom according to decreasing rank. In Figure 2 we show an example of querying the data-base with the letter A. In the case of the feature-sets, the 12 occurrences of the latter A are ranked at the top of the order. It is interesting to note that the noisy versions of the letter are ranked in positions 11 and 12. In the case of the relational histograms the letter A’s are more dispersed. The letters K and V disrupt the ordering. Finally, Figure 3 shows the result of querying the data-base with the digital map. In the case of the feature-sets, the eight segmentations of the two images containing the road-pattern are recalled in the top-ranked positions. In the case of the relational histogram, ﬁve of the segmentations are top-ranked. Another segmentation is ranked ninth and one segmentation falls outside the top 16.

6

Sensitivity Analysis

The aim in this section is to investigate the sensitivity of the two recognition strategies to the systematics of the line-segmentation process. To this end we

716

Benoit Huet and Edwin R. Hancock

(a)

(b)

(c)

Fig. 2. The result of querying the data-base with the letter “A”

have simulated the segmentation errors that can occur when line-segments are extracted from realistic image data. Speciﬁcally, the diﬀerent processes that we have investigated are: – Extra lines: Additional lines with random lengths and angles are created at random locations. – Missing lines: A fraction of line-segments are deleted at random locations. – Split lines: A predeﬁned fraction of lines-segment have been split into two. – Segment end-point errors: Random displacements are introduced in the end-point positions for a predeﬁned fraction of lines. The distribution of end-point errors is Gaussian with a standard deviation of 4 pixels. – Combined errors: Here we have mixed the four diﬀerent segment errors described above in equal proportion. The performance measure used in our sensitivity analysis is the retrieval accuracy. This is the fraction of queries that return a correct recognition. We query the data-base with line patterns that are known to have a number of counterpart. Here the query pattern is a distorted version of the target in the data-base. An example is furnished by the digital map described earlier which is a barrel-distorted version of the target. Figure 4 compares the retrieval accuracy as a function of the fraction of lines that are subjected to segmentation errors. In the case of the relational histogram (Figure 4a) performance does not degrade until the fraction of errors exceeds 20%. The most destructive types of error are linesplitting, line segment end-point errors and the addition of extra lines. The linesplitting introduces additional combinatorial background that swamps the query pattern. The method is signiﬁcantly less sensitive to missing lines and performs

Structural Sensitivity for Large-Scale Line-Pattern Recognition

(a)

(b)

717

(c)

Fig. 3. The result of querying the data-base with the digital map well under combined errors. In the case of the feature-sets (Figure 3b) the overall performance is much better. At large-errors it is only missing lines that limit the eﬀectiveness of the technique. However, the onset of errors occurs when as few as 40% of the lines are deleted. The line-patterns are least sensitive to segment end-point errors. In the case of both line-addition and line-splitting there is an onset of errors when the fraction of segment errors is about 20 percent. However, at larger fractions of segmentation errors the overall eﬀect is signiﬁcantly less marked than in the case of line-deletions. 800

Accuracy of retrieval

Accuracy of retrieval

80

Extra Lines Missing Lines Split Lines EndPoint Errors Combined Errors

100

60

40

20

80

Extra Lines Missing Lines Split Lines EndPoint Errors Combined Errors

700 Worse Ranking Position

Extra Lines Missing Lines Split Lines EndPoint Errors Combined Errors

100

60

40

600 500 400 300 200

20 100

0

0 0

20 40 60 80 Percentage of lines affected by noise

100

(a) Relational histograms.

0 0

20 40 60 80 Percentage of lines affected by noise

(b) Feature sets.

100

0

20 40 60 80 Percentage of lines affected by noise

100

(c) Ranking

Fig. 4. Eﬀect of various kinds of noise to the retrieval performance (a)(b). Worse ranking position (c). We now turn our attention to how the two recognition strategies may be integrated. The idea is to use the relational histogram as a ﬁlter that can be applied to the data-base to limit the search via feature-set comparison. The important issue is therefore the rank threshold that can be applied to the histogram similarity measure. The threshold should be set such that the probability of false rejection is low while the number of images that remain to be veriﬁed must is small.

718

Benoit Huet and Edwin R. Hancock

To address this question we have conducted the following experiment. We have constructed a data-base of some 2500 line-patterns. The data-base contains several groups of images which are variations of the same object. Each group contains 10 variations. In Figure 4(c) we show the result of querying a data-base of with an object selected from each group. The plot shows the worst ranked member of the group as a function of the amount of added image noise. The plot shows a diﬀerent curve for each of the ﬁve diﬀerent noise types listed above. The main conclusion to be drawn from this plot is that additional lines and end-point segment errors have the most disruptive eﬀect on the ordering of the rankings. However, provided that less than 20% of the line-segments are subject to error, then the data-base can be pruned to 1% of its original size using the relational histogram comparison. If a target pruning rate of 25% is desired then the noise-level can be as high as 75%.

7

Discussion and Conclusion

The main contribution in this paper has been to demonstrate some of the noise sensitivity systematics that limit the retreival accaracy that can be achieved with two simple line-pattern recognition schemes. The ﬁrst is based on pairwise geometric histogram comparison. The second involves comparing the set of pairwise goemetrc attributes. Our study reveals that the two methods have rather diﬀerent noise sytematics. The histogram-based method is most sensitive noise processes that swamp the existing pattern. These include the addition of clutter and the fragmentation of existing lines. The feature-set based method, on the other hand, is relatively insensitive to the addition of line segments. However, it is more sensitive to the deletion of line segments.

References 1. T. Gevers and A. Smeulders. Image indexing using composite color and shape invariant features. IEEE ICCV’98, pages 576–581, 1998. 711 2. G. L. Gimelfarb and A. K. Jain. On retrieving textured images from an image database. Pattern Recognition, 29(9):1461–1483, 1996. 711 3. B. Huet, A. D. J. Cross, and E. R. Hancock. Graph matching for shape retrieval. Advances in Neural Information Processing Systems 11, Edited by M.J. Kearns, S.A. Solla and D.A. Cohn, MIT Press, (available May 1999), 1998. to appear. 711, 712 4. B. Huet and E. R. Hancock. Fuzzy relational distance for large-scale object recognition. IEEE CVPR’98, pages 138–143, June 1998. 711, 712, 714 5. B. Huet and E. R. Hancock. Relational histograms for shape indexing. IEEE ICCV’98, pages 563–569, Jan 1998. 711, 713 6. A. K. Jain and A. Vailaya. Image retrieval using color and shape. Pattern Recognition, 29(8):1233–1244, 1996. 711 7. R. W. Picard. Light-years from lena: Video and image libraries of the future. IEEE ICIP’95, 1:310–313, 1995. 711

Structural Sensitivity for Large-Scale Line-Pattern Recognition

719

8. W. J. Rucklidge. Locating ojects using the Hausdorﬀ distance. IEEE ICCV’95, pages 457–464, 1995. 9. M. J. Swain and D. H. Ballard. Indexing via colour histograms. IEEE ICCV’90, pages 390–393, 1990. 711 10. N. A. Thacker, P. A. Riocreux, and R. B. Yates. Assessing the completeness properties of pairwise geometric histograms. Image and Vision Computing, 13(5):423–429, June 1995. 711, 712 11. R. Wilson and E. R. Hancock. Structural matching by discrete relaxation. IEEE PAMI, 19(6):634–648, June 1997.

Complex Visual Activity Recognition Using a Temporally Ordered Database Shailendra Bhonsle1 , Amarnath Gupta2 , Simone Santini1 , Marcel Worring3 , and Ramesh Jain1 1 3

Visual Computing Laboratory, University of California San Diego 2 San Diego Supercomputer Center Intelligent Sensory Information Systems, University of Amsterdam

Abstract. We propose using a temporally ordered database for complex visual activity recognition. We use a temporal precedence relation together with the assumption of fixed bounded temporal uncertainty of occurrence time of an atomic activity and comparatively large temporal extent of the complex activity. Under these conditions we identify the temporal structure of complex activities as a semiorder and design a database that has semiorder as its data model. A query algebra is then defined for this data model.

1

Introduction

In this paper we present some issues related to the design of a database system for storage and recognition of activities from video data. Automatic processing of video data in order to understand the behavior of people and objects in the video is a very active area of research, with ramiﬁcations covering ﬁelds as diverse as understanding ﬁlm semantics [8], and automatic surveillance [3]. A typical activity understanding system can be conceptually divided in two parts. First, suitable video analysis algorithms extract features to represent certain low level (i.e. pre-semantic) aspects of the behavior of the objects. Typical features extracted in this phase include the trajectory of objects [4], their size, shape and (in the case of non-rigid objects like people), posture. Following the video analysis phase, the features are taken in by some activity recognition modules, in charge of recognition and categorization, which classify the activities inferred from the low level features into some predeﬁned semantic taxonomy. Traditionally, the latter problem has been solved by ad-hoc recognition modules, typically one module for each activity to be recognized. For instance, a (hypothetical) system for the analysis of activities during a soccer game would include a “foul” recognizer, a “goal” recognizer, an “attack” recognizer, and so on. More often that not, the low level features per se would not conduce to the recognition of an activity. Rather, the pattern of change of the low level features would reveal the activity. For instance, recognition of the “foul” activity would Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 719–726, 1999. c Springer-Verlag Berlin Heidelberg 1999

720

Shailendra Bhonsle et al.

include: (a) recognition that two player are close to each other and on a collision course, (b) recognition that one of the two player suddenly stop, and (c) recognition that, after the event, all or most of the other players also stop. A system organized along these lines would be extremely inﬂexible and inextensible, since the detection of a new activity would entail the coding of a new recognition module. In addition, such a system would make and extension to the management of historical data problematic. Vision systems are not well equipped to manage large amounts of data. However, the development of eﬀective low level video processing algorithms, united to a reduced cost (and therefore increased deployment possibilities) of video equipment make the capacity of managing large databases crucial for the success of activity recognition and analysis systems. The capacity to manage large database and the consequent access to large repositories of historical data, allow system to answer new and interesting question involving statistically deviant behaviors. Consider, for instance, a surveillance system. In many cases it is impossible to deﬁne exactly what kind of suspicious behavior we are interested in. It is possible, however, to analyze the average behavior of individuals and to identify all behaviors that deviate from the norm. We propose to solve the problems mentioned above by including a full-ﬂedged temporal database into activity recognition systems. In our scheme, a synthesized description of the low level features is stored in a database. A suitable temporal query language allows us to interrogate the database, detecting the pattern of changes in the features that constitute the activities. More speciﬁcally, we call activity a temporally congruent sequence of actions that can be assigned a semantic denotation independently on the behavior of the object before and after the activity. An activity is not atomic, but can be decomposed in a series of events. An event is a change in the state of the object that happens at a well deﬁned instant in time. As an example, consider the activity “turning left.” A walking person will in general keep turning left for a ﬁnite amount of time, but the activity can be decomposed into the two events “beginning to turn left” and “ﬁnishing to turn left,” both happening at well deﬁned time instants. We also assume that there is some uncertainty associated with the determination of events. Consider, for instance, the event “beginning to turn left.” Detection of this event requires a ﬁnite approximation of the second derivative of the trajectory of an object, and this approximation will require computing the diﬀerence between trajectory points at diﬀerent time instant. A suitable algorithm will keep computing the approximation of the second derivative and, when its value exceed a certain threshold, will signal a “beginning to turn left” event. It is in general impossible to determine when, within the approximation interval, the event actually took place. We assume that the uncertainty is bounded by a constant ∆ which is the same for all the events recognized by the system. Additionally, this bound on temporal uncertainty of atomic activities is small compared to the extent of the complex activity. In this paper we use the temporal

Complex Visual Activity Recognition

721

binary precedence relation ≤∆ deﬁned for events x and y as x ≤∆ y if and only if the times of occurrences of these activities are separated by a duration greater than ∆. This relation imposes a semiorder structure [5] on the set of events. The database that we have designed supports this temporal structure as its data model. The architectural model of recognition of this category of complex visual activities consists of a visual processing subsystem, a transducer subsystem and a database subsystem [3]. The transducer recognizes an event and assigns a domain- dependent symbol to it. The event symbols together with their attributes are then inserted into the database. Example systems for symbolic processing in activity recognition are provided in [1]. The use of partial orders to model concurrency is studied in [7], while [2,6] are examples of the design of partially ordered databases. The main diﬀerence between our data model and that of others is in ﬁxing of the temporal relation between eventss and identiﬁcation of a speciﬁc class of partial orders, namely semiorders. With appropriate semantic constraint in the data model, our approach provides for computationally tractable class of partial order algorithms for important activity recognition related query operations.

2

Event Recognition

The complex activity recognition architecture [3] has a visual processing subsystem that is used to extract features of the visual entities. These features are stored in a relational database, called logbook. The transducer subsystem uses feature data from the logbook, applies event recognition algorithms and associates event symbols to the recognized events. The symbols together with their parameters are stored in the database subsystem. The transducer subsystem acts as a bridge between the visual processing subsystem which deals with signal processing and the database subsystem which deals with symbolic processing. The transducer consists of one or more networks of change and state detectors [3]. The modules detect the current state of visual entities and their change. We associate symbols from a domain- dependent vocabulary of atomic activities to the various state transitions. This vocabulary is maintained by the transducer subsystem and it provides ﬂexibility with respect to the set of atomic activities that needs detection. The system provides some general purpose state transition detectors (e.g. detection of an object entering or leaving a predeﬁned region, accelerating, stopping, turning, and so on), and an interface to include more domain speciﬁc state transition detectors. The atomic activity symbols along with the occurrence timestamps and other parameters for recognized atomic activities are sent to the activity recognition database. There is an uncertainty interval associated with the occurrence timestamp of an atomic activity. We make the assumption here that for all atomic activities this uncertainty interval is bounded above by a temporal duration ∆. This duration dictates that two atomic activities occur concurrently whenever their timestamps are less than or equal to ∆ apart.

722

3

Shailendra Bhonsle et al.

Complex Activity Recognition Database

We are mainly concerned here with deﬁning a database that can handle temporal ordering of atomic activities. In the following subsection we brieﬂy describe a semiorder data model for our temporally ordered database. One of the requirements for our database is that while it provides temporal ordering deﬁned by the semiorder data model, it should be possible to destroy the temporal ordering and treat the database as a relational (or relational-temporal) database and give appropriate relational queries. The query language provides operators to achieve this. The query language provides operators to manipulate sets of temporal semiorders and includes a semiorder pattern deﬁnition language. 3.1

Semiorder Data Model

Consider the binary temporal relationship ≤∆ between two atomic activities x, y ∈ V where V is a set of activity nodes. x ≤∆ y means that x occurs before y and the occurrence timestamps of these two activities are greater than duration ∆, which is a ﬁxed constant. This relation is irreﬂexive, non-symmetric and transitive and hence deﬁnes a partially ordered set V, ≤∆ . Two nodes a, b ∈ V that are incomparable under ≤∆ are denoted as a b. We denote by x+y the suborder of ≤∆ that consists of two subsets of distinct nodes S1 and S2 of V such that |S1 | = x and |S2 | = y and ∀a ∈ S1 , ∀b ∈ S2 , a b, and nodes of S1 and S2 individually form chains. The characteristics of the relation ≤∆ ensure that induced suborders 2 + 2 and 1 + 3 do not occur in V, ≤∆ , thus giving the set the structure of a semiorder [5]. Two activity nodes a, b ∈ V such that a b, are said to occur simultaneously. Simultaneity is not transitive. The class of semiorders subsumes two special cases. It includes the class of weak orders (orders where induced suborders 1 + 2 do not occur) and the class of total orders (orders where induced suborders 1+1 do not occur). For convenience we will denote an unordered set as V, ∅. A labeled semiorder is a semiorder where each node has a label from a domain Σ, assigned using a function µ : V → Σ. A labeled semiorder S is deﬁned as the tuple S = V, Σ, µ, ≤∆ . We will use a labeled semiorder as our data model with the following two provisions: 1. Σ represents the domain of a set of named attributes A1 , . . . , AN with their respective data types. For instance, in a database we can decide to label every event with a record containing the time at which the event occurred, the position in space where the event occurred, and an integer that identiﬁes the object that generated the event. In this case Σ is the domain containing tuples that conform to the following scheme: [T : int; x : double; y : double; z : double; id : int] Σ is the domain of tuples with attributes A1 , . . . , AN .

Complex Visual Activity Recognition

723

2. (Semantic constraint.) There is an ordering of A1 , . . . , AN deﬁned such that in any labeled semiorder, whenever two nodes x, y ∈ V and x y, µ(x) and µ(y) are lexicographically totally ordered. Here lexicographic ordering is over some ﬁxed encoding of attribute values into any totally ordered set, possibly the set of integers. For convenience the semantic constraint for any labeled semiorder will be represented as a lexicographic linear extension [5] λ of the semiorder, where incomparable elements are ordered lexicographically. The semantic constraint helps make some (iso)morphisms related query language operators computationally tractable. Formally, our database contains two datatypes, semiorders, and sets, and the schema of the database is a set of named order relations. An order relation is a tuple O = V, Σ, µ, ≤∆ , where Σ is the set of labels of the events. In addition, we have an ordering of the attributes in Σ that we use for the lexicographic order λ. Hence our database is populated by sets of labeled semiorders, one set for each order relation. We give an example from the video surveillance and monitoring domain. The vision subsystem extracts the centroid, bounding box, color of the bounding box, and other information for various moving objects in a visually monitored environment. The centroid related activities include entering or exiting from a predeﬁned region, start of left turn, start of jumping, two objects coming close etc. We deﬁne the following schema, consisting of two order relations:

O1 (Centroid Activity) Σ : time, ObjId, activity, position, region ≤∆ : (∆ = 2 time units Attribute order: time, ObjId, activity, position, region O1 (Color Activity) Σ : time, ObjId, activity, position, region, color ≤∆ : (∆ = 2 time units Attribute order: time, ObjId, activity, position, region, color

Note that the value of Σ is ﬁxed for a given order relation but could be diﬀerent across diﬀerent order relations. Also note that in the data model there are few implicit keywords, like NULL which represents an empty semiorder and NOW, which denotes the current time. 3.2

Query Language

We have deﬁned an algebra over semiorder data model which forms the basis of our query language. The query language includes a semiorder pattern deﬁnition language. This language is used to deﬁne semiorder patterns that are used together with the algebra operators. We provide a flatten operator that helps

724

Shailendra Bhonsle et al.

us treat the whole database as a relational database and hence queries using relational operators can be given to our semiorder database. In addition to aggregate operations that are commonly deﬁned for relational databases, we have aggregate operations that operate over the ordered sets. The complex activity recognition database is a set of semiorders. Unless explicitly combined (using disjoint union operator) or rearranged (using rearrangement operator) the algebra operations apply to each semiorder contained in the set individually. In the following we informally deﬁne some of the operators of our algebra. Selection (σc (R)) Projection (πA (R)) Renaming(δA→B (R)) Join (R1 ✶ R2 ). These operators are extensions of usual operators in relational algebra. The selection operator takes a set of semiorders and retrieves those nodes whose labels satisfy the condition speciﬁed by selection condition (C). The result is a set of semiorders. The Projection operator takes a set of semiorders and restricts the node labels to the set of attribute names speciﬁed along with the projection operator (A). Both the selection and the projection operators preserve the semantic constraint by deriving the lexicographic linear extension (λ) of the resulting sets of semiorders from the input order relation instance. This is an important condition for projection since the subset of attributes of Σ that are associated as node labels may not be orderable on their own. Renaming replaces a given attribute name by another name. It is usually used in conjunction with the join operator that creates new order relations. The new value of ∆ for R1 ✶ R2 (where R1 and R2 are order relations) is taken to be the minimum of two ∆s for R1 and R2 . Since we are dealing exclusively with temporal occurs before relations there is always a timestamp attribute associated with every node label in a database instance. If x is a node in R1 and y is a node in R2 then a node is created in the resulting order relation if and only if the diﬀerence of the timestamps of x and y are within the new value of ∆. The new assigned timestamp is the minimum of two timestamps. Disjoint Union (∪) The disjoint union operator takes a set of semiorders and produces another set of a singleton semiorder. The resulting semiorder derives its lexicographic linear extension (λ) from the original lexicographic linear extensions. Rearrange () The rearrange operator takes a set of semiorders and produces another set of semiorders. The resulting semiorders are ordered according to the lexicographic increasing values of the supplied attribute(s). , along with morphism operators and iterators, can produce a variety of diﬀerent results on the same database instance. Flatten (Λ) This operator accepts a set of semiorders and produces an unordered set of labels of the semiorder nodes. It makes sure that the resulting set of tuples has

Complex Visual Activity Recognition

725

no duplicates. Once the order is destroyed, it cannot be obtained as the resulting set does not keep any ordering information, including that of λ. The result is a relational set of tuples to which any relational operation can be applied. Morphism Operators This is an important class of operators to manipulate the order information contained within the database. There are ﬁve of these operators: I, SI, SIT , SID and M . The I operator is used to ﬁnd if two semiorders are isomorphic. X SI Y will extract the most recent suborders of semiorders of Y that are isomorphic to X. Here, and in all of these operators, Y is a set of semiorders and X is a set of singleton semiorder (possibly produced by using disjoint union). The result is a set of semiorders. X SID Y is deﬁned similarly and returns all node disjoint suborders of semiorders of Y that are isomorphic to X. X SIT Y returns all isomorphic suborders of semiorders of Y that are temporally disjoint. X M Y extracts the largest suborder that is common to both X and Y . The algebra contains a few other operators to manipulate order information like ﬁnding ﬁrst set of nodes, last set of nodes etc. We are currently investigating the complexity of including iterators over semiorders. Query language, besides having algebraic operators and the semiorder pattern deﬁnition language, has aggregate functions that work over the semiorders. Some examples of these are functions to ﬁnd the temporal extent of semiorders, ﬁnding width and height of semiorders etc.

4

Complex Visual Activity Recognition Queries

Here we provide a few examples of queries over the schema deﬁned in section 3.1. Since we have not described the pattern description language, we will only describe the patterns in English. 1. Find the common activities of object 1 and object 2 in region 9 Y = (πactivity (σObjId=1,region=9 (X))) ; Z= (πactivity (σObjId=2,region=9 (X))) ; A = (Y )M (Z) Here Y and Z are singleton sets of semiorders where node labels has just the activity name as attribute. The answer A contains a singleton set of semiorder which describes the largest set of common activities that object 1 and object 2 did in region 9 of the environment. In this query only the temporal order in which activities are done is important. There may be many related queries that can extract temporally constrained sets of activities. 2. Find the objects that never visited region 9 P = Pattern for (activity = Enter region, region = 9) X = Λ(πObjId (Z)) Y = Λ(πObjId ((P SID Z))) A=X −Y This query illustrates the use of negation operator of relational algebra.

726

5

Shailendra Bhonsle et al.

Conclusions

We have identiﬁed the temporal extent, uncertainty and decomposability properties of a general category of complex visual activities and have proposed using a database for recognition of complex activities in this category. We have also identiﬁed the exact temporal relation and corresponding semiorder data model that the database should support. The database provides ﬂexibility in the recognition of a variety of complex activities. Issues like real-time complex activity recognition will be addressed within the context of a concrete implementation. The notions of spatial extent, uncertainty and decomposability should also be deﬁned and used for recognition of certain classes of complex visual activities. This is an important research issue. We do not currently have such spatial modeling in our database design, although ths is an important research topic that we plan to consider in the future.

References 1. A. Del Bimbo, E. Vicario, and D. ZIngoni. Symbolic description and visual querying of image sequences using spatio-temporal logic. IEEE Transactions on Knowledge and Data Engineering, 7(4), 1994. 721 2. S. Grumbach and T. Milo. An algebra for pomsets. In ICDT ’95, pages 191–207. Springer-Verlag, 1995. 721 3. Amarnath Gupta, Shailendra Bhonsle, Simone Santini, and Ramesh Jain. An event management architecture for activity recognition in a multistream video database. In Proceedings of the 1998 Image Understanding Workshop, Monterey, CA, November 1998. 719, 721 4. Ivana Mikic, Simone Santini, and Ramesh Jain. Video processing and integration from multiple cameras. In Proceedings of the 1998 Image Understanding Workshop, Monterey, CA, November 1998. 719 5. R. H. Mohring. Computationally tractable classes of ordered sets. In I. Rival, editor, Algorithms and Order. Kluwer Academic, 1989. 721, 722, 723 6. Wilfred Ng and Mark Levine. An extension of SQL to support ordered domains in relational databases. In Proceedings of the 1997 International Database Engineering and Applications Symposium, Montreal, Que., Canada, 25-27 Aug., pages 358–367, 1997. 721 7. V. R. Pratt. Modelling concurrency with partial orders. International Journal of Parallel Programming, 15(1), 1986. 721 8. Nuno Vasconcelos and Andrew Lippman. Towards semantically meaningful feature spaces for the characterization of video content. In Proceedings of International Conference on Image Processing, Santa Barbara, CA, USA, 26-29 Oct., pages 25– 28, 1997. 719

Image Database Assisted Classification Simone Santini1 , Marcel Worring2 , Edd Hunter1 , Valentina Kouznetsova1, Michael Goldbaum3 , and Adam Hoover1 1

2 3

Visual Computing Lab, University of California San Diego Intelligent Sensory Information Systems, University of Amsterdam Department of Ophthalmology, University of California San Diego

Abstract. Image similarity can be deﬁned in a number of diﬀerent semantic contexts. At the lowest common denominator, images may be classiﬁed as similar according to geometric properties, such as color and shape distributions. At the mid-level, a deeper image similarity may be deﬁned according to semantic properties, such as scene content or description. We propose an even higher level of image similarity, in which domain knowledge is used to reason about semantic properties, and similarity is based on the results of reasoning. At this level, images with only slightly diﬀerent (or similar) semantic descriptions may be classiﬁed as radically diﬀerent (or similar), based upon the execution of the domain knowledge. For demonstration, we show experiments performed on a small database of 300 images of the retina, classiﬁed according to fourteen diagnoses.

1

Introduction

Image databases aim at retrieving images that have a certain meaning for the user asking the query. Finding an image with the right meaning is provably a diﬃcult problem. Classiﬁcation techniques attach meaning to the images by categorizing them into a ﬁxed set of classes. Image databases avoid categorization by deﬁning appropriate similarity measures between pairs of images, and by ranking the answers by similarity with the query. The underlying assumption is that image similarity will induce a soft categorization of some signiﬁcance. Image databases can be classiﬁed according to the level of abstraction and the amount of domain knowledge used in computing the similarity. A possible taxonomy of approaches along a semantic axis is reported in ﬁg. 1. A number of image databases assume that meaningful similarity can be expressed by a distance measure in a geometric feature space, obtained from the image data through some pure data driven image processing operations to retrieve shape, texture, and color features [1,2]. This assumption is approximately valid under certain circumstances. Typically, if the database is a collection of disparate images not connected by a particular ﬁeld or purpose, and retrieval is incidental. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 727–735, 1999. c Springer-Verlag Berlin Heidelberg 1999

728

Simone Santini et al.

Fig. 1. Overview of the diﬀerent spaces in which similarity can be deﬁned. However, in domains where images play an integral role in daily practice, there is a lot more to meaning than those simple features describing image content. Meaning is then largely depending on the context and purpose of image retrieval. Therefore, systems use low level visual features as a basis for a reasoning subsystem trying to extract higher level semantics meaningful in the speciﬁc application domain [7]. Other systems try to apply some image processing operations to the visual features, in order to transform them into other visual features semantically more meaningful in the domain of discourse [5,6]. Both approaches result in what we call a Visual Semantic Space. The diﬀerence between the two is the amount of domain knowledge required and the way in which knowledge steers the extraction process. In the case of visual processing, knowledge only dictates the nature of the features to be extracted, while in reasoning it determines the methods and algorithms. In this paper, we take the idea one step further, and use a reasoning system on top of the visual semantic space. The output of this reasoning system deﬁnes features in a domain semantic space. The speciﬁc domain we consider is that of retina images. In this domain, the image similarity of interest is diagnostic: two images are similar if the same diagnosis can be ascribed to them with the same conﬁdence. Our approach consists of two steps: In the ﬁrst step we derive visual labels in the visual semantic space, which are of interest in this particular domain. In the current system, the labels are assigned by an expert. They are all based on pure visual information and could hence potentially be derived from the image in an automatic way using domain speciﬁc techniques. In the second step, we use a Bayesian network, whose weights were set using domain knowledge. The output of the network consists of the marginal probabilities for each of the possible classiﬁcations. It can be used for image classiﬁcation without the help of a database. In our case, however, we use the vector of marginal probabilities as a feature vector to form the domain semantic space and deﬁne a novel probability based measure for comparing the two feature vectors to establish the required similarity in this space. The rationale for using image database techniques to assist the classiﬁcation is that in certain cases the output of the classiﬁer may not be discriminant enough to allow for a sharp classiﬁcation. However, it might happen that there

Image Database Assisted Classiﬁcation

729

are images in the database with the same pattern of probabilities. We retrieve these images and assign their label to the unknown image. Ideally, this method should help decide dubious cases while retaining the cases in which the label is decidable. The paper is organized as follows. Section 2 introduces the semantic spaces, the associated reasoning methods, and their deﬁnitions of similarity. Section 3 reports results with the proposed method and compares performance at the diﬀerent levels.

2 2.1

Methods Semantic Spaces

In our application the domain semantic space is formed by a set of 14 relevant diagnoses. The visual semantic space contains a set of 44 visual cues suﬃcient for discriminating amongst those fourteen diagnoses. These 44 cues were determined by questioning expert ophtalmologists about the information they were looking for while observing an image for diagnostic purposes. The cues are symbolic, and each one of them takes values in a small unordered set of possible values. As an example, the visual semantic feature microaneurism or dot hemorrhage takes values from {absent, few anywhere, many anywhere}. The number of possible values is cue dependent and varies between two and eight. Additionally, any cue may take on the value “unknown” if for a speciﬁc image it can not be identiﬁed [3]. Having separated the two semantic spaces, allows to separate the identiﬁcation of visual cues from the judgment of the causes of the ﬁndings. The ﬁndings are based entirely on appearance, while the judgment process takes into account previously learned knowledge and expert training. As a practical example of the diﬀerence between the two spaces, one of the authors, who has worked on retinal images for two years but has no medical training or experience, is capable of assigning the right values to the semantic visual clues with a fairly high accuracy, but he is incapable of making diagnoses. 2.2

Image Database

Our database consists of 300 retinal images, digitized from 35mm slide ﬁlm and stored at 605 × 700 × 24-bit (RGB) resolution. The retinal expert on our team determined, by hand, the set of diagnoses for each image in domain semantic space. Since diagnoses are not mutually exclusive, any individual image may have more than one diagnosis. This often occurs when a primary diagnosis is accompanied by one or more secondary diagnoses. Of our 300 images, 250 have one diagnosis, 46 have two diagnoses, and 4 have three diagnoses. Example images are shown in ﬁg.2. It is important to notice that in this domain simple features in geometric feature space (color histograms, etc..) are quite meaningless. To the untrained eye, all images already look more or less the same. Summarizing the data using geometric features, makes them only more similar.

730

Simone Santini et al.

Fig. 2. Example images in the database 2.3

Similarity in Visual Semantic Space

To deﬁne the similarity of a pair of images in our semantic visual space, requires comparing two vectors containing symbolic values. The set of admissible symbolic values for an element do not have a direct relation to a numeric value. In fact the diﬀerent values do not neccessarily have an ordering. These problems motivate the following similarity metric. Let F = {F1 , F2 , . . . , FM } represent a feature vector, consisting of M symbolic elements. Given two feature vectors FA and FB , the distance d(FA , FB ) between them is deﬁned as: M = FBi 1 FAi (1) d(FA , FB ) = 0 FAi = FBi i=1

Note that if all features could only assume two values this would reduce to the Hamming distance. Using this metric, the similarity of two images is proportionate to the number of semantic features that have the same symbolic value for both images. 2.4

Reasoning in Visual Semantic Space

To obtain values in domain semantic space requires a reasoning process based on expert knowledge. In this paper a Bayesian network based approach is followed. The Bayesian network computes probabilities for classiﬁcations based upon Bayes’ rule: P (m|di )P (di ) (2) P (di |m) = N P (m|di )P (di ) i=1

where, m is the vector of 44 visual semantic cues, and di is the i-th element out of the N = 14 diagnoses. For a Bayesian network to operate, it must be supplied with the set of probabilities P (m|di ). These are supplied by the expert in the application domain, and are commonly called beliefs. For our application, given an average of three

Image Database Assisted Classiﬁcation

731

values for each manifestation, this seemingly requires 44 × 3 × 16 ≈ 2300 estimated probabilities. However, many of the beliefs have a value of zero, and so may be supplied implicitly. Additionally, each probability P (m|di ) deﬁnes a combined probability P (m|di ) = P (m1 = s1 AND m2 = s2 AND m3 = s3 . . .)

(3)

where each value sk is any allowable state value for the feature mj . Rather than supply these combined probabilities, which can also be diﬃcult to estimate, individual probabilities for P (mj |di ) may be supplied and, assuming mutual independence, combined as follows: P (m|di ) = P (m1 |di )P (m2 |di )P (m3 |di ) . . . P (m44 |di )

(4)

Finally, the prior probabilities P (di ) are also supplied by the expert.

Fig. 3. A graph of the Bayes network. Cues and diagnoses are represented by nodes, while links denote non-zero conditional probabilities. We used commercially available tools [4] to construct and operate the Bayesian network. A graph of the structure of the Bayesian network in our application is shown in ﬁg. 3 (only non-zero links are represented). For each diagnosis we have, Kj P (mj |di ) = 1.0 (5) j=1

where Kj is the number of possible states for the manifestation mj . The network and the associated probabilities deﬁne the domain knowledge we utilize. It should

732

Simone Santini et al.

be noted that one of the nodes in the network is the age of the patient. Clearly this is a non-visual cue which is important for ﬁnding the proper diagnosis. Given an image with an unknown diagnosis, and its visual semantic features, the Bayesian network computes the probabilities for each individual diagnosis using eq.2 given the set of manifestations. As indicated earlier, a doctor classiﬁes the image into a limited set of diagnoses only. In order to separate the list of derived probabilities into a set of likely diagnoses, and a set which are not likely, we perform an adaptive clustering. A threshold is found which maximizes the Fisher criterion of class separation (µ1 − µ2 )2 /(σ12 − σ22 ), where µ and σ 2 are the sample mean and variances of the probabilities of the two respective output categories. To perform the clustering, the output list of probabilities is sorted. The threshold is taken at the maximum value for the criterion encountered while incrementally adding diagnoses from the unlikely category to the likely category, in sorted order. Since the number of diagnoses per image is limited to three for this application, the output is in any event limited to between the one and three most likely diagnoses. 2.5

Similarity in Domain Semantic Space

The output of the Bayesian network can be considered as a 14-dimensional feature vector, and used for indexing the database of images. Now, given a query image, the 14 marginal probabilities of the diagnoses are in this case used to retrieve images with similar marginal probabilities from the database. The diagnoses of these images are returned by the system. The rationale behind this choice is that sometimes the output of the Bayesian network is not suﬃcient to make a clear choice regarding the diagnoses to be assigned to an unknown image. In other words, classes may not be well separated. In these cases, however, the pattern of probabilities can be indicative of the diagnoses, and ﬁnding an image in the database with a similar pattern of probabilities can give us the right result. Formally, let Ii be the i-th image in the database, D(Ii ) the set of diagnoses associated with Ii , and pi = B(Ii ) the output of the Bayesian network when image Ii is presented as an input. We deﬁne a distance measure δ between outputs of the Bayesian network, δ(B(Ii ), B(Ij )) being the distance between the i-th and the j-th image in the database. When the unknown image J is presented, we determine a permutation πi of the database such that δ(B(J), B(Iπi )) ≤ δ(B(J), B(Iπi+1 ))

(6)

to rank the images, and retain the K images closest to the query: {Iπ1 , . . . , IπK }. The union of the diagnoses of these images will be assumed as the set of diagnoses of the unknown image: K D(Iπi ) (7) D(J) = i=1

Image Database Assisted Classiﬁcation

733

The deﬁnition of the distance function δ is obviously important to assure the correct behavior of the system. We again use a function that can be seen as a generalization of the Hamming distance. If 0 ≤ p ≤ 1 is a probability value, we deﬁne its negation p¯ = 1 − p. Given two vectors of marginal probabilities x and y, we deﬁne their distance as δ(x, y) =

1 (x · y¯ + x¯ · y) N

(8)

The normalization factor guarantees that 0 ≤ δ ≤ 1. It is immediate to see that if the elements of x and y are in {0, 1}, δ is the Hamming distance between them. This distance has also another interesting interpretation. Consider a single component of the vectors xi and yi , and assume that the “true” values of those components can only be 0 or 1 (i.e. a disease is either present or not). Because of uncertainty, the actual values of xi and yi are in [0, 1]. In this case xi y¯i + x ¯i yi = xi (1 − yi ) + (1 − xi )yi

(9)

is the probability that xi and yi are diﬀerent. The choice of the value K should be done using engineering judgment. High values of K will increase the number of true diagnoses incorporated in the answer of the system that is, increasing the value of K will reduce the number of false negatives. At the same time, however, increasing the value of K will increase the number of false positives. In all our experiments we used K = 1, considering the database image closest to the query only.

3

Results

Let C deﬁne a classiﬁcation for an image, consisting of a set of one or more diagnoses: C = {Di , . . .} (10) where each Di is one of the 14 diagnosis codes. Let C1 and C2 deﬁne two diﬀerent classiﬁcations for an image. Typically, C1 will be our “ground truth” (expert classiﬁcation) and C2 the classiﬁcation obtained with one of the three methods above. We deﬁne the quality of the match as Q=

|C1 ∩ C2 | |C2 |

(11)

A value Q = 0 means that no correct diagnosis was reported, a value of Q = 1 means that all correct diagnoses an no extra diagnosis were reported. Note that the normalization factor is |C2 | and not |C1 | to penalize giving too many diagnoses. We considered our image database and made a rotation experiment: each one of the images was in turn removed from the database and considered as the unknown image. The values of Q were collected for all images, and their average

734

Simone Santini et al.

computed. For Nearest Neighbors in visual semantic space this yielded a value of 0.52 and for the Bayesian classiﬁer 0.53. The method using the Bayesian classiﬁer and the database (search in the domain semantic space) yielded 0.57. The variance was approximately 0.16 in all three cases.

4

Discussion

In this paper we have proposed a new framework for performing database searches by introduction of a semantically meaningful feature space. A reasoning system like a Bayesian network, can provide this. Reasoning alone does not always provide suﬃcient information to classify images. In these cases, comparing the pattern of marginal probabilities with that of images classiﬁed earlier can aid in proper classiﬁcation. The new similarity measure we deﬁned generalizes the Hamming distance of binary patterns. The results indicate that the performance of the nearest neighbor and the Bayesian classiﬁers are indistinguishable, while there is some evidence that the combination of the classiﬁer and the database yields improved results. It is noted that the improvement is small. We hypothesize that the complexity of the semantic network is not at par with the small database of 300 images. Furthermore, the results are only as good as the coverage of the database. If we give a system an image with certain diseases, and the database contains no image with the same diseases, we will not be able to obtain a correct answer. Thus, we can expect the performance of the system to increase with the size of the database.

Acknowledgments We gratefully acknowledge Prof. Ken Kreutz Delgado for the many fruitful discussions and for suggesting the generalized Hamming distance.

References 1. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: the QBIC system. IEEE Computer, 28(9), 1995. 727 2. A. Gupta and R. Jain. Visual information retrieval. Communications of the ACM, 40(5):70–79, 1997. 727 3. A. Hoover, M. Goldbaum, A. Taylor, J. Boyd, T. Nelson, S. Burgess, G. Celikkol, and R. Jain. Schema for standardized description of digital ocular fundus image contents. In ARVO Investigative Ophthalmology and Visual Science, Fort Lauderdale, FL, 1998. Abstract. 729 4. F. Jensen. Hugin api reference manual, version 3.1, hugin expert a/s, 1997. 731 5. V.E. Ogle and M. Stonebraker. Chabot: retrieval from a relational database of images. IEEE Computer, 28(9), 1995. 728

Image Database Assisted Classiﬁcation

735

6. G.W.A.M. van der Heijden and M. Worring. Domain concept to feature mapping for a plant variety image database. In A.W.M. Smeulders and R. Jain, editors, Image Databases and Multimedia Search, volume 8 of Series on software engingeering and knowledge engineering, pages 301–308. World Scientiﬁc, 1997. 728 7. N. Vasconcelos and A. Lippman. A Bayesian framework for semantic content characterization. In Proceedings of the CVPR, pages 566–571, 1998. 728

A Visual Processing System for Facial Prediction Changsheng Xu1, Jiankang Wu1, and Songde Ma2

1

Kent Ridge Digital Labs

21 Heng Mui Keng Terrace, Singapore, 119613, Republic of Singapore {xucs, jiankang}@krdl.org.sg 2

National Lab of Pattern Recognition, Institute of Automation Chinese Academy of Sciences, Beijing, 100080, P.R.China [email protected]

Abstract. In this paper, we describe a 2D and a 3D practical visual processing system for facial prediction in surgical correction of dento-maxillafacial deformities. For 2D system, we use adaptive filtering and edge detecting techniques and some artificial intelligent approaches to locate landmarks and make diagnostic decision on facial cephalogram and photo. For 3D system, we employ laser active triangulation principle to acquire range data of human face and reconstruct 3D high quality face image. The 2D system can realize the automation from landmark location to parameter measurement on cephalogram while the 3D system can make the diagnostic design of preoperation of orthognathic surgery more visual and accurate.

1 Introduction Prediction of facial appearance after orthognathic surgery is one of the important steps in surgical correction of dento-maxillafacial deformities. In general, however, not only landmark location but also facial prediction was completed manually by drawing on cephalogram, which was time-consuming with unsatisfactory accuracy. The application of image processing and computer vision in orthognathic surgery provides a new method that makes the diagnosis , treatment planning and prediction analysis more accurate and quicker. In this paper, a visual 2D and 3D automatic facial image acquiring and processing system is presented by using image processing and computer vision techniques. In 2D facial prediction system, a kind of adaptive Kalman filtering approach of color Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp.735 -744, 1999.  Springer-Verlag Berlin Heidelberg 1999

736

Changsheng Xu et al.

noise based on the correlative method of the system output is proposed to solve the color noise problems in images[1]. This approach is unnecessary to know the statistical performance of the noise so that it can avoid the complicated computation. An optimal scale of edge detector is derived from the optimal edge detection theory by introducing the scale-space theory[2]. An automatic threshold is introduced in the process of edge detection. Based on the optimal scale and the automatic threshold, a fast adaptive edge detecting algorithm is proposed. This algorithm can make a compromise between the precision of edge detection and the effect of the noise removal. In 3D facial acquiring and reconstructing system, a laser projector emits a laser stripe on patient face and a CCD camera detects it. A transformation matrix is related to the transformation between the laser image in the CCD camera and the related absolute position in space. By rotating the laser projector and CCD camera, we can obtain the whole 3D data of patient face. In order to reconstruct a high quality 3D image and improve the visual effect for dentist diagnosis, we also register the 2D gray camera image with range data. The 2D and 3D visual facial image processing systems not only realize the automation from landmark location to parameter measurement on cephalogram but also makes the diagnostic design of preoperation of orthognathic surgery more accurate.

2 2D Facial Processing and Predicting System The technical features of the 2D system can be described as an intelligent image processing system by using the data processing ability of computers to analogue the human vision process. It contains image preprocessing (smooth, filter), image feature extraction (recognition, location) and image content understanding (measurement, analysis). The diagram of the 2D system is shown in Fig.1.

A Visual Processing System for Facial Prediction

X-ray film face input

737

image filtering image

edge detecting landmark location measurement

data output

diagnosis design analogue operation facial prediction processing Fig.1. Diagram of the system structure

This system can reduce the artificial errors in cephalometric analysis and eliminated the source of errors either in cephalometric tracing or in landmark location. It made the cephalometric analysis simpler, quicker and more accurate in orthogathic surgery. Fig.2 shows the original image of cephalometric film. Filtered and edge detected images are shown in Fig.3 and Fig.4 respectively. Fig.5 shows the image that landmarks were located automatically. Fig.6 shows the facial image of a patient before, predictive and after operation.

Fig.2. Original image

Fig.3. Filtered image

738

Changsheng Xu et al.

Fig.4. Edge detected image

(a) Before operation

Fig.5. Landmark image

(b) Prediction

(c) After operation

Fig.6. Facial image

3 3D Facial Acquiring and Reconstructing System The 3D system is based on active triangulation principle. The typical configuration is shown in Figure 1. A single camera is aligned along the z-axis with the center of lens located at (0,0,0) .At a baseline distance b to the left of the camera is a light projector

sending out a beam or plane of light at a variable angle θ relative to the x-axis baseline. The point

( x, y, z )

is projected into the digitized image at the pixel

(u, v ) .The measured quantities (u, v,θ )

are used to compute the ( x , y , z ) as follows:

x=

b u f ctgθ − u

(1)

y=

b v f ctgθ − u

(2)

z=

b f f ctgθ − u

(3)

A Visual Processing System for Facial Prediction

739

Fig.7. Geometry for active triangulation

In our system, we project a laser stripe onto the human face . So all the points on the vertical laser line can be ranged at same time and accelerate the acquisition procedure. This system consists of four parts: optical apparatus, mechanic apparatus, two circuit boards and a computer that controls the whole system as Fig.8 shows. The optical apparatus includes two CCD cameras and a laser projector. The reason for using two cameras is to eliminate missing parts problem which occurs where occlusion prevents detection of one of these cameras. In order to acquire the range data of the whole face and let patient feel comfortable, we make patient sit still in the scene and the optical apparatus rotate around the patient. The mechanic apparatus has a arm which is driven by a electrical motor. When the motor is working, the arm can rotate around the patient. The optic apparatus is fixed on the arm. One of the circuit boards is a control board which controls the working state of the mechanic apparatus. The other is a image grab board which transfers image data from CCD camera. The two boards are inserted into PC’s slot. The computer is the control center. It controls the whole data acquisition process and makes 3D reconstruction of human face. In the calibration process, we determine the camera focal lengths and the translation and orientation of the projection system with respect to the global coordinate system. This process needs to be accomplished only once for each setting of parameters.

740

Changsheng Xu et al.

Fig.8. System configuration

The process is designed by locating a few 3-D points with known global coordinate and the corresponding image points. The rotation, scaling and perspective projection can be described in a single matrix A. Assume the global coordinate system is O-XYZ and the camera coordinate system is o-uv. In our system, we choose the points in the laser plane and make O-XY plane of the global coordinate system overlap the laser plane. So the Z coordinates of all these points are zero. Then the matrix A is a 3×3 matrix and the transformation can be described as:

u   A11 A12 A13   symbol 114 \f "Symbol" \s 10ρ  v  =  A21 A22 A23  1   A31 A32 A33

   

X  Y    1 

(4)

Assume a known global coordinate of one point is ( X i , Yi ,0 ) and its corresponding camera coordinate is ( ui , v i ). After rearranging the unknown variables A to form matrix, the relationship can be described as: QA = 0

(5)

where 0

− X i ui − Yi ui

X i Yi 1

− X i vi − Yi vi

 X i Yi Q=  0 0

1

A = [ A11

A13 A21 A22

A12

0

0

0

A23 A31 A32 A33 ]

− ui  − v i 

(6) (7)

A Visual Processing System for Facial Prediction

741

These equations represent two equations in the eight unknown transformation elements Aij . However applying these equations to n (n >4) non-coplanar known locations in the object space and their corresponding image points, the unknown transformation coefficients Aij can be computed. We design a special reference object in which some particular points are marked. The distance between each points and the images of these points in the CCD camera can be identified, enabling the transformation matrix to be constructed. Because the laser plane is projected strictly across the rotate axis of the mechanic apparatus, the transformation matrix in each angle is same. So we can calibrate in one angle to get the transformation matrix that can be applied to all other angles. The procedure of acquiring face range data is divided into two steps. First, The laser projector is turned on and projects the laser stripe onto human face. The mechanic arm starts from a „zero“ position and rotates clockwise by 180 ° . Each of the two cameras grabs a image once the arm rotates by 1° . After the arm completes rotating we can get 180 images of laser stripes from 180 deferent angles. Second, the laser projector is turned down and the arm rotates anti-clockwise. Each camera also grabs a image once the arm rotates by 1° and we can get gray level image from each angle. The coordinate of each point in laser stripe can be computed by using the elements of the perspective projection transformation matrices. We build two coordinate systems to calculate the 3D coordinate. One is the fixed world coordinate system O − X wY w Z w whose O − X wY w plane overlaps the laser plane at angle 0 ° The other is a mobile coordinate system O − X mYm Z m whose O − X mYm plane is moved with laser projector and keep overlapping the laser plane at each angle as Fig.9 shows.

742

Changsheng Xu et al.

Assume the coordinate of a point in the camera image is (u, v ) . We can calculate the correspondent mobile coordinate ( X m , Ym , Z m )

T

symbol 114 \f "Symbol" \s 10ρ A

from the function −1

X m  u  v  = Y   m    1  1 

(8)

and Z m =0. Then the world coordinate of the point can be calculated from X w = X m cossymbol 113 \f "Symbol" \s 10θ , Z w = X m sinsymbol 113 \f "Symbol" \s°10θ , Yw = Ym . After calculating all the points in the laser stripe at each angle, we get the whole face range data. Then we can reconstruct the 3-D face image. Fig.10(a) is an original face image. Fig.10(b) is the 3-D face stripe image and Fig.10(c) is the correspondent 3-D image which uses light model. In order to enhance the display effect , we register the gray image with the range image and make the final image more like the real face photo. Fig.10(d) and Fig.10(e) are two final 3-D face images from different direction.

Fig.10. Experimental results

A Visual Processing System for Facial Prediction

743

4 Conclusion The application of image processing and computer vision techniques in orthognathic surgery provides a new method for improving the diagnosis and treatment of dentomaxillofacial deformities. The 2D and 3D facial image processing systems have good predictive accuracy and reliability after clinical application and shows following characteristics. (1) The 2D system can realize the automation from landmark location to parameter measurement in cephalogram and accurately predict the postoperation changing. (2) The 2D system can simulate the whole program of orthognathic surgery. The predictive facial appearances of the patients will help the sugeon-patient communication and make the surgery plan more reasonable and feasible. (3) The 3D system can accurately acquire the whole face range data with a high speed and the scanning speed and accuracy can be improved by using high speed and resolution camera. (4) The final 3-D registered image looks more like the real photo and is easy for dentist to observe and make diagnostic decisions.

Acknowledgment Dr. Zhang Xiao is thanked for his kind help and for providing important access to cephalograms.

References 1. Xu, C.S., Ma, S.D, Adaptive Kalman Filtering Approach of Color Noise in Cephalometric Image, High Technology letters, Vol.3, No. 2, (1997) 8-12 2. Xu, C.S., Ma, S.D., Adaptive Edge Detecting Approach Based on Scale-Space Theory, IEEE Proc. of IMTC/97, Vol.1, Ottawa, Canada, (1997) 130-133 3. Xu, C.S., Xu, Z.M., Edge-Preserving Recursive Noise-Removing Algorithm and Its Applications in Image Processing, Journal of Tsinghua University, Vol.36, No.8, (1996) 24-28

744

Changsheng Xu et al.

4. Xu, C.S., Xu, Z.M., Application of Kalman Filter in Automatic Cephalometric Analysis System, Journal of Pattern Recognition and Artificial Intelligence, Vol.9, No.2, (1996) 130-137 5. Cardillo, J., Sid-Ahmed, M.A., An image processing system for locating craniofacial landmarks, IEEE Trans. on Medical Imaging, Vol.13, No.2, (1994) 275-289 6. Ji, A., Leu, M.C., Design of optical triangulation devices, Optics and Laser Technology, Vol.21, No. 5, (1989) 335-338 7. Akute, T., Negishi, Y., Development of an automatic 3-D shape measuring system using a new auto-focusing method, Measurement, Vol. 9, No. 3, (1991) 98-102 8. Tang, S., Humg, Y.Y, Fast profilometer for the automatic measurement of 3-D object shapes, Appl. Opt., Vol. 29, No. 10, (1990) 3012-3018 9. Clarke, T.A., The use of optical triangulation for high speed acquisition of cross section or profiles of structures, Photogrammetric Record, Vol. 13, No. 7, (1990) 523-532

Semi-interactive Structure and Fault Analysis of (111)7x7 Silicon Micrographs Panagiotis Androutsos1 , Harry E. Ruda2 , and Anastasios N. Venetsanopoulos1 1

Department of Electrical & Computer Engineering University of Toronto, Digital Signal & Image Processing Lab 10 King’s College Road, Toronto, Ontario, M5S 3G4, Canada {oracle,anv}@dsp.toronto.edu WWW: http://www.dsp.toronto.edu 2 Department of Metallurgy and Materials Science University of Toronto, Electronic Materials Group 184 College St., Toronto, Ontario, M5S 3E4, Canada [email protected] WWW: http://www.utoronto.ca/ emg

Abstract. A new technique by which the electron micrographs of (111)7x7 Silicon are analyzed is discussed. In contrast to the conventional manner by which pseudocolor is introduced into normally gray scale surface scans, this method performs a high-level, knowledge based analysis to provide the viewer with additional information about the silicon sample at hand. Namely, blob recognition and analysis, as well as a priori knowledge of (111)7x7 Silicon can be utilized to delineate structural patterns and detect fault locations. The conveyance of information such as this is of much more consequence to an investigator interested in determining a sample’s uniformity and structure.

1

Introduction

For years, Quantum Physics preached the existence of the atom. It was the advent of Electron Microscopy, however, that provided a major breakthrough by which theory could actually be visualized. In the many years which have passed, many strides forward have been made which enable scientists to perform incredible feats with the tiniest of tools and with the most basic of building blocks. The ability to actually see what is happening at the atomic level is only superseded by one’s knowledge of it, and thus the requirements for imaging have always been of great importance in this ﬁeld. This intimate relationship that exists between vision and knowledge is one of the factors which contribute to understanding.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 745–752, 1999. c Springer-Verlag Berlin Heidelberg 1999

746

Panagiotis Androutsos et al.

1.1

Pseudocolor Micrographs

Traditional methods by which electron micrographs are made more intelligible are based on the fact that the human visual system is able to distinguish between a larger variety of color levels than gray levels. Pseudocoloring involves an operation where a particular feature(s) of an image (or set of images) is mapped to a particular color. As a result, the coding of desired information or properties that are embedded within, and eventually extracted from the image(s), can be conveyed to the viewer in an eﬃcient manner [1]. The advantage of presenting visual information compactly through such a utilization of color is obvious. 1.2

Overview

In the case of surface micrographs, there exists a very large choice of features to focus on. This paper concentrates on the analysis of the repetitive pattern present in (111)7x7 silicon micrographs. A variety of techniques are used to extract relevant information regarding both atomic structure and patterns, as well as atomic discontinuities. Gray level techniques are utilized to obtain a ﬁeld of landmarks shapes, or ’blobs’ which are subsequently passed to a highlevel, knowledge-based system that performs fault detection, and atomic surface structure delineation.

2

Overall System Implementation

Referring to Figure 1, some general statements can be made about the system. First, the input image which can be in pseudocolor is converted to gray scale. This is followed by a histogram equalization. A contrast and brightness enhanced image is preserved for use as the bottom layer in the ﬁnal result. Following these gray-level transformations, the image is made into a binary one via a thresholding operation. The result is a ﬁeld of shapes or blobs which are recursively analyzed for shape, size, etc. This blob analysis and classiﬁcation is used to extract faults from the micrograph. Once the faults have been removed from the array, the ﬁnal candidate blobs are analyzed using a knowledge base to delineate the structural lines. Finally, a line drawing algorithm [3] is utilized to generate the last layer. The ﬁnal output consists of the original micrograph, the surface faults, and an outline of the pattern created by the atoms on the surface. 2.1

Pre-processing

The process of manipulating the gray scale silicon micrograph image for further analysis is a very precarious step in this system. There are three stages involved here. First, histogram equalization of the image’s gray levels promotes visual appeal to the viewer, and also maps the gray levels such that they span the entire range of possible pixel values. Subsequently, the image is made binary via thresholding.

Semi-interactive Structure and Fault Analysis

747

Fig. 1. Overall System Block Diagram

The ﬁnal preprocessing step involves morphological processing of the resultant blobs. The image is opened using a 3-pixel wide circular mask. Equation 1 depicts this set operation where X is the image at hand, and A is the morphological structuring element [4] XA = (X A) ⊕ A.,

(1)

X A ≡ x : {Ax ⊂ X},

(2)

X ⊕ A ≡ {x : Ax ∩ X = ∅}.

(3)

where,

This step provides some ﬁltering of spurious data, smoothing of jagged edges, as well as providing increased separation between shapes. A round mask was chosen since in general, the landmarks that are being sought are round in nature.

748

2.2

Panagiotis Androutsos et al.

Blob Analysis

Shape analysis can require a large amount of computational power. This particular system was programmed in Java1 using the IMAGEnius package [2]. Although Java has some computational overhead which slows down the overall system speed, the choice for its use was made in order to accommodate for interactivity, ease of implementation, and embedded functionality. Analysis of the blob ﬁeld was performed using a recursive search algorithm. Pixels were linked using 4-connectivity, and area and perimeter were simultaneously calculated.

Fig. 2. Search criteria dialog

Figure 2 depicts a dialog which is used to select the match criteria desired. As shown, a wide variety of critera can be used. These criteria include measures based on invariant moments [5] (phi1-phi2) as well as a roundness measure [1], whose calculation is shown in Equation 4, and was used to obtain the results in Section 3. γ= 2.3

(perimeter)2 . 4π · area

(4)

Structure Analysis

Following the blob analysis, a ﬁeld similar to the one depicted in Figure 3 results. Speciﬁcally, the resultant ﬁeld in Figure 3 was acquired using only the roundness measure as a match criterion. Since in general, faults are not round and usually take on variable shapes, they can easily be extracted from the image. At this point, it would be very easy to prompt the user for input that would connect any two blobs which he would be certain are landmark points. Such information would include the correct periodicity of the pattern, as well as directional information. This user-input, however, is not required since a search algorithm, in co-operation with knowledge-based programming, can be used to extract both atomic distance and directionality. The nearest neighbor landmark 1

Java is a registered trademark of Sun Microsystems

Semi-interactive Structure and Fault Analysis

749

Fig. 3. Post blob analysis result

points can be found by using a growing search window around each landmark point; 3 pixels, 5 pixels, 7 pixels, etc. Figure 4 depicts the growing search window, and the expected positions of blobs for a (111)7x7 Silicon surface. Upon

Fig. 4. Directional and Distance information search algorithm

the detection of another landmark point within the search window, a distance is measured between blob centres, followed by a search in regions where landmark points would be expected to lie. This process continues until a blob with six correctly positioned neighbors is found, or until all blobs have been examined, and the distance and directional information from a case with ﬁve or perhaps four correctly positioned neighbors would be used.

3

Results

An example of good fault extraction and line emphasis for silicon pattern delineation can be seen in Figure 5 2 . The individual fault and structure line layers can be seen in Figure 6. Figure 8 depicts a second sample result. Hand-analysis of the original micrograph found a total of 113 possible correct connection lines for structure delineation. This ideal result can be seen in Figure 7. The result in Figure 5 depicts 87 correctly detected connections, with 2

Input image courtesy of OMICRON Vakuumphysik GmbH

750

Panagiotis Androutsos et al.

Fig. 5. Input image and sample result. Image courtesy of OMICRON Vakuumphysik GmbH

Fig. 6. Fault layer and structure line layer zero false connections. The missing structural lines result from the fact that surface faults that incorporate atomic landmarks are excluded from the set of blobs used to delineate structure. The total number of faults present in the original micrograph of Figure 5 is 12. The system was able to detect a total of 10 of these faults with zero false detections. The undetected faults were the two high intensity areas present near the upper-right, and central-left portions of the original image. These faults which can be interpreted as spurious atoms were not detected because fault detection is based on the analysis of low-intensity blobs (atomic landmarks) rather than high-intensity blobs. Incorporating analysis of brightly colored blobs for improved fault detection would become an unwieldy task due to the sheer number of distinct bright shapes within the image.

4

Conclusions

The pattern that exists within the silicon structure is immediately evident in the ﬁnal output. Color is utilized to provide meaningful information about the

Semi-interactive Structure and Fault Analysis

751

Fig. 7. Hand analysis depicting entire set of connection lines, and missing connection lines for the analysis in Figure 5

Fig. 8. Additional input image and sample result. Image courtesy of OMICRON Vakuumphysik GmbH

structure rather than to make the image easier to look at. The blue lines clearly show where the silicon pattern exists, and the red shapes outline the locations of faults. Since a search is performed within a small vicinity for a landmark point, this algorithm will work well in situations where drift has occurred during acquisition, and the resultant micrograph is less than ideal. Extending the system to incorporate data interpolation and extrapolation would improve the amount of structural delineation. This would be a relatively easy task, since a-priori knowledge about the silicon structure, coupled with information extracted from the image with respect to directionality and atomic distance (in image pixels), would enable the creation of additional structure lines extending from detected landmark points with fewer than the maximum number of connections. Further work on this system can be done to examine the eﬀects of utilizing diﬀerent matching criteria as well as combinations of matching criteria with varying weights. Overall, the results show that a micrograph processed using this system conveys

752

Panagiotis Androutsos et al.

a greater amount of information to the viewer than a traditional pseudocolored image for the purpose of intelligibility and/or visual appeal.

References 1. Jain, Anil K., Fundamentals of Digital Image Processing Prentice Hall, Englewood Cliﬀs, NJ,1989. 746, 748 2. Androutsos, P., Androutsos, D., Plataniotis, K.N., Venetsanopoulos, A.N, Handson Education in Image Processing with Java, IEEE Conference on Multimedia Computing and Systems ’99, Florence, Italy, Submitted Nov, 1998. 748 3. Foley, James W., Computer Graphics: Principles and Practice Addison-Wesley, New York, 1996. 746 4. Sanwine, S. J., The Colour Image Processing Handbook Chapman & Hall, London, 1998. 747 5. G. Lu, Communication and Computing for Distributed Multimedia, Artech House, Boston, 1996. 748 6. Williams, David B., Images of Materials Oxford University Press, New York, 1991.

Using Wavelet Transforms to Match Photographs of Individual Sperm Whales Identified by the Contour of the Trailing Edge of the Fluke R. Huele1 and J. N. Ciano2 1

Centre of Environmental Science Leiden University, P.O.Box 9518, 2300 RA Leiden, The Netherlands Tel +31 71 527 7477, Fax +31 71 527 7434 [email protected] 2 Florida Department of Environmental Protection, Endangered and Threatened Species, Northeast Field Station, 7825 Baymeadows Way, Suite 200B, Jacksonville, FL 32256 Tel +1 904 448-4300 ext. 229, Fax +1 904 448-4366 [email protected]

Abstract. Taking the wavelet transform of the trailing edge contour as metric and using cross correlation as measure of similarity successfully assists in matching different photographs of identified individual sperm whales. Given a photograph of a sperm whale fluke as input, the algorithm orders a collection of photographs as to similarity to the given fluke contour. Applied on a set of 293 photographs taken in Bleik Canyon, Norway, the algorithm correctly presented 40 pairs among the first five candidates, of which only 24 were found by human observers. Five known matches were not among the first five candidates.

Introduction Some species of marine mammals have characteristic markings that make it possible to identify individuals visually from photographs taken during observations in the field [2,4,8,10,13,19,20,21]. Sperm whales (Physeter macrocephalus) can often be individually identified by the sufficiently unchanging marks on the trailing edge of the flukes [1,7,9,23,28]. World-wide, thousands of photographs of sperm whale flukes have been taken in support of ethological and population dynamics research. The resulting collections are ordered by landmarking, either roughly [1] or by a more detailed system [18,28]. Landmarking can perform remarkably well under constraints of hard- and software, but indexing and retrieval of material can become very time consuming [3]. Moreover, indices based on landmarking are not always independent of the operator and ambiguity may be introduced by the use of not clearly demarcated categories.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp.753-760, 1999.  Springer-Verlag Berlin Heidelberg 1999

754

R. Huele and J. N. Ciano

A method of automated matching of different photographs of an individual can speed up and improve research. It will make it possible to match different collections and is essential for the proposed compilation of a North Atlantic Sperm Whale Catalogue. Both will widen possibilities for research into population dynamics of sperm whales [5,17,21,29]. The increasing availability of complex hardware on the consumer market and the recent successes of wavelet analysis [14,15,24,25,26,27] suggested it might be possible to design such a matching algorithm. Independent confirmation of the identity of individuals, in the form of DNA analysis or sound recordings, is only rarely available, so the final decision of identity will have to be based on the human eye. The proposed algorithm presents five candidates, ordered on likelihood of matching a given photograph and so effectively acts as filter to reduce the amount of photographs to be examined. In contrast to most medical and industrial applications, the photographs are taken while conditions of lighting and background are not under control. This opens the possibility that the method can be used for the identification of other objects identifiable by a one-dimensional signal in a noisy background.

Material The matching algorithm was calibrated on two collections of photographs. One set, to be named here set A, consists of 65 images, representing the sperm whales identified at Bleik Canyon, Andenes, Norway during the 1998 field season. The other set, to be called set B, is a collection of 228 photographs of the ventral surface and trailing edge of sperm whale flukes that had been acquired previous in the period 1989 - 1997. All photographs used for this test were considered of an acceptable quality for matching by human observers [1,29]. The photographs were stored as grey-level pictures in TIFF format. Set A has a mean resolution of 220 by 496. Set B had a mean resolution of 470 by 1586, but was downsampled in order to reduce the number of columns to 1000 or less, while preserving the ratio of height to width. The photographs were all taken during whale watching tours, at the moment the sperm whale started a dive and extended its fluke into the air for a few seconds. Each photograph shows a fluke, more or less submerged in the water, surrounded by waves and sometimes sky. The contrast of the pictures is varying, as is the angle under with the picture is taken. All pictures are taken form the ventral side, because whale watching protocol proscribes approaching whales from the rear side.

Using Wavelet Transforms to Match Photographs

755

One author (JNC), having experience in photo identification of sperm whales, visually found the matches within collection A and between the collections A and B. Collection B was presented as consisting of photographs of unique individuals and supposedly contained no matches. The other author (RH), having no experience in photo identification, tried finding these matches with the proposed algorithm. The success of the algorithm was originally defined as the percentage of known matches that the algorithm would find. Each match found, either by the human or the computer, was checked by at least two independent researchers.

Methods The algorithm consecutively extracts the signal of the trailing edge contour from a photograph, represents the signal as a wavelet transform and calculates a measure of similarity between the photographs. The five photographs having the highest measure of similarity to a given photograph are presented as candidates for matching. If a match was found visually between the given fluke and one or more of the five candidates, this was counted as a success. If no matches were found visually between the given fluke and the first five candidates, this was counted as a failure.

Fig. 1. The procedure. (1): the original photograph, (2): the result of thresholding, (3): the result of binary operations, (4): the extracted contour, (5): the normalised contour, (6): the wavelet transform of the contour.

756

R. Huele and J. N. Ciano

Of each image, the grey-level histogram was calculated and the threshold boundary value was determined by minimising the Kittler-Illingworth function [11,12,16]. The resulting binary image is a picture of both the fluke and noise caused by waves, clouds and an occasional seagull. The noise characteristically showed a large width to height ratio and was largely removed by the operation of opening. The resulting largest dark area, ordered by size and representing at least 70% of the total black area, was interpreted as the silhouette of the fluke. From the silhouette the trailing edge was extracted as a one-dimensional signal, by finding the topmost pixels, excluding those that were more than six times the standard deviation away form the mean horizontal value. The resulting contour was interpreted as a one dimensional signal and represented as a complex series. The contour was normalised to minimise the effect of pitch, roll and yaw. The central notch was found as the minimum value in the middle third of the series. Dividing by complex numbers oriented both halves of the contour to the horizontal. Interpolation normalised the contour to a set of 512 real numbers between 0 and 1. The tips are located at (0,0) and (512,0), the central notch was located at (256,0) and (257,0). The contour was transformed into the coefficients of scale 100 of the continuous wavelet transform, using the Daubechies wavelet of order 4. The coefficients, representing a given trailing edge, were used as index to the photograph. A measure of similarity between two photographs was defined as the maximum of the cross correlation coefficients of the two series of wavelet coefficients. Taking the maximum made the measure relatively insensitive to phase shifts, caused by variations in the extracted contour. A photograph to be matched was first indexed as series of wavelet coefficients and then by brute force compared to all other photos in the collection. The resulting five having the highest cross correlation were presented as candidates for matching, the final decision depending on the verdict of two independent observers. All procedures were coded in Matlab, using the toolboxes for image processing, signal processing and wavelets.

Results In collection A of 65 photos from the 1998 field season, three matches were identified, of which only one was previously known. One known match was not identified by the algorithm. In the collection B of 228 images of former years, 7 matches were found under different landmarking categories, even though the collection was presented as consisting of unique individuals only and the categories were supposed to be exclusive. Between the two sets A and B, 32 matches were found, of which only 24 were known. Two matches between A and B were not identified among the first five candidates. Matching the other way round, that is

Using Wavelet Transforms to Match Photographs

757

presenting an image from collection B and identifying a match among the first 5 candidates from collection A, resulted in the same set of matching pairs. Of the total of 45 matches now known to exist, 32 were identified as first candidate, three as second candidate, one as third candidate, four as fourth candidate and the remaining 5 matches were not among the first five candidates.

Fig.2. The nearest neighbour candidates of fluke no. 816. The flukes no. 996 and 859 were both confirmed as being identical to 816, though not previously identified as such by the human researcher.

Conclusion The proposed algorithm did present more matching pairs among the first five candidates than originally found by the human researchers, thus invalidating the original set-up of the experiment. Seen from the negative side, it has to be concluded that the number of matches in the collections is unknown, so that no conclusion can be drawn on the degree of success of the algorithm and no prediction can be made on its performance on larger datasets. Seen from the positive side, it seems that finding matching photographs in collections is so difficult for humans that the algorithm can offer welcome assistance.

758

R. Huele and J. N. Ciano

Discussion Extraction of the signal of the contour performs satisfactory, though it has to be kept in mind these sets were scanned beforehand on visual quality. Some objective measure of photographic quality to preselect images would be helpful, as low contrast images tend to correlate with nearly all and overwhelm the list of candidates. Rather unexpectedly, it proved to be effectively impossible to construct a reliable testset. In the absence of a objective measure of identity, the human eye will have to decide if two photographed flukes are or are not from the same individual. Finding matches in even relatively small collections of photographs seems to be extremely hard. The main obstacle is the lack of an ordinal index. A collection ordered according to an unambiguous ordinal index provides certainty that a certain contour is not present in the collection. An ordinal index would also speed up retrieval by orders of magnitude and would simplify retrieval by hand. Lacking an ordinal index, retrieval based on the wavelet transform seems to provide satisfying results, even though it is not quite clear why the algorithm works. It is intriguing why the relatively low frequencies of the scale 100 effectively code the contour, while human researchers seem to discriminate by the higher frequencies of the notches.

Acknowledgements This work would not have been possible without the guides, assistants and volunteers at the Whalecenter in Andenes, Norway, who devoted time, energy, and effort to the photo-identification of sperm whales at Bleik Canyon during many seasons. Roar Jørgensen assisted in the field, and also in the lab. Erland Letteval, Tuula Sarvas and Vivi Fleming organised and made available the material of the years 1989 - 1997. Hans van den Berg gave invaluable support on wavelet analysis, and Nies Huijsmans offered useful suggestions on image processing. Peter van der Gulik has been an untiring guide into the world of marine mammal science. Jonathan Gordon and Lisa Steiner, both of IFAW, provided photographic material for calibration of the algorithm. The authors would also like to thank the staff and administration of Whalesafari Ltd., Andenes, and extend a special note of gratitude to vessel crews: Captain Geir Maan, Captain Glenn Maan, Captain Kjetil Maan, and Arne T.H. Andreasen of M/S Reine; and to Captain Terje Sletten, Gunnar Maan, Roy Pettersen, Guro Sletten and Jan Hansen of M/S Andford.

References 1. Arnbom, Tom. Individual Identification of Sperm Whales. In: Rep. Int. Whal. Commn. 37. (1987) 201-204. 2. Bannister, J.L. Report on the Assessment of Computer-aided Photographic Identification of Humpback Whales, Western Australia: Pilot Study and Related Items. Unpublished report

Using Wavelet Transforms to Match Photographs

759

to the Australian Nature Conservation Agency. (address: Western Australian Museum, Perth, Western Australia 6000 Australia) (1996) 13 pp 3. Bearzi, Giovanni: Photo-identification: matching procedures. In: Notarbartolo di Sciara, Giuseppe, Evens, Peter, Politi, Elena: ECS Newsletter no 23, Special Issue. (1994) 27-28. 4. Beck, Cathy A., Reid, James P.: An Automated Photo-identificatioin Catalog for Studies of the Life History of the Florida Manatee. US National Biological Service Information and Technology Report 1, (1995) 120-134. 5. Calambokidis, J., Cubbage, J.C., Steiger, G.H., Balcomb, K.C. and Bloedel, P. Population estimates of humpback whales in the Gulf of the Farallones, California. Reports to the International Whaling Commission (special issue 12) (1990) 325-333. 6. Castleman Kenneth R. Digital Image Processing. Prentice Hall, Upper Saddle River, New Jersey. (1996) 470-483 7. Childerhouse, S.J., Dawson S.M.: Stability of Fluke Marks used in individual photoidentification of male Sperm Whales at Kaikoura, New Zealand. In: Marine Mammal Science 12(3). (1996) 447-451. 8. Cooper, Bruce. Automated Identification of Southern Right Whales. Honours Thesis in Information Technology, University of Western Australia. (1994) 9. Dufault, Susan, Whitehead, Hal.: An Assessment of Changes with Time in the Marking Patterns used for Photoidentification of individual Sperm Whales, Physeter Macrocephalus. In: Marine Mammal Science 11(3). (1995) 335-343. 10. Dott, Hector, Best, Peter B. and Elmé Breytenbach. Computer-assisted Matching of Right Whale Callosity Patterns. Paper SC/45/0 18 presented to the International Whaling Commission Scientific Committee. (1993) 12pp. 11. Gonzalez, Rafael C., Woods, Richard E. Digital Image Processing. Addison Wesley Publishing Company. (1993) 443-457 12. Haralick, Robert M., Shapiro Linda G.: Computer and Robot Vision, Vol 1. Addison Wesley Publishing Company. (1992) 13-58 13. Hiby, Lex and Lovell, Phil. Computer Aided Matching of Natural Markings: A Prototype System for Grey Seals. Reports to the International Whaling Commission (special issue 12): (1990) 57-61. 14. Huele, Ruben, Udo de Haes, Helias: Identification of Individual Sperm Whales by Wavelet Transform of the Trailing Edge of the Flukes. In: Marine Mammal Science 14(1). (1998) 143-145. 15. Jacobs, Charles E., Finkelstein, Adam and Salesin, David H. Fast Multiresolution Image Querying. University of Washington, Seattle. Technical report UW-CSE-95-01-06. (1995) 10pp. 16. Jähne, Bernd. Digital Image Processing, Concepts, Algorithms and Scientific Applications. Third Edition. Springer Verlag, Berlin Heidelberg New York (1995) 200208 17. Katona, Steven K. and Beard, Judith A. Population Size, Migrations and Feeding Aggregations of the Humpback Whale (Megaptera Novaeangliae) in the Western North Atlantic Ocean. Reports to the International Whaling Commission (special issue 12) (1990) 295-305. 18. Letteval, Erland. Report to ANCRU: Description of the Fluke-key and division of Sections. AnCRU, Andenes (1998) 19. Lovell, Phil and Hiby, Lex. Automated Photo-identification or right whales and blue whales. Paper SC/42/PS5 presented to the International Whaling Commission Scientific Committee. (1990) 28pp. 20. Mizroch, S.A., Beard, J. and Lynde, M. Computer assisted photo-identification of humpback whales. Reports to the International Whaling Commission (special issue 12) (1990) 63-70.

760

R. Huele and J. N. Ciano

21. Mizroch, S.A. and G.P Donovan, eds. Individual Recognition of Cetaceans: Use of Photoidentification and Other Techniques to Estimate Population Parameters. Rep. Int. Whal. Commn. Spec. Issue No. 12. (1990) 1-17. 22. Mizroch, S.A., Hobbes, R., Mattila, D., Baraff, L.S., and Higashi, N. A new survey protocol for capture-recapture studies in humpback whale winter grounds. . Paper SC/48/0 18 presented to the International Whaling Commission Scientific Committee. (1996) 14pp. 23. Palacios, Daniel M., Mate, Bruce R.: Attack by False Killer Whales (pseudorca crassidens) on Sperm Whales in the Galapagos Islands. In: Marine Mammal Science 12(4) (1996) 582-587. 24. Starck, J.-L., Murtagh, F., Bijaoui, A.: Image Processing and Data Analysis, The Multiscale Approach. Cambridge University Press (1998) 120-151 25. Stollnitz, Eric J., DeRose, Tony D. and Salesin, David H. 1994. Wavelets for Computer Graphics, Theory and Applications. Morgan Kaufmann Publishers, Inc. San Fransisco, California. (1996) 43-57 26. Strang, Gilbert and Truong Nguyen. Wavelets and Filter Banks. Wellesley-Cambridge Press. (1996) 362-364. 27. White, R.J., Prentice, H.C., and Verwijst, Theo. Automated image acquisiton and morphometric description. Canadian Journal of Botany. 66 (1988) 450-459. 28. Whitehead, Hal. Computer Assisted Individual Identification of Sperm Whale Flukes. Reports to the International Whaling Commission (special issue 12) (1990) 71-77. 29. Whitehead, H. Assessing Sperm Whale Polulations Using Natural Markings: Recent Progress. In: Hammond, P.S., Mizroch, S.A., Donovan, G.P, Individual Recognition of Cetaceans: Use of Photo-Identification and Other Techniques to Estimate Population Parameters. Internation Whaling Commission, Cambridge UK. (1990) 377-382

From Gaze to Focus of Attention Rainer Stiefelhagen1 , Michael Finke2 , Jie Yang2 , and Alex Waibel12 1

2

Universit¨ at Karlsruhe, Computer Science, ILKD Am Fasanengarten 5, 76131 Karlsruhe, Germany [email protected] http://werner.ira.uka.de Carnegie Mellon University, Computer Science Department 5000 Forbes Avenue, Pittsburgh, PA, USA {fimkem,yang+,ahw}@cs.cmu.edu http://is.cs.cmu.edu

Abstract. Identifying human gaze or eye-movement ultimately serves the purpose of identifying an individual’s focus of attention. The knowledge of a person’s object of interest helps us eﬀectively communicate with other humans by allowing us to identify our conversants’ interests, state of mind, and/or intentions. In this paper we propose to track focus of attention of several participants in a meeting. Attention does not necessarily coincide with gaze, as it is a perceptual variable, as opposed to a physical one (eye or head positioning). Automatic tracking focus of attention is therefore achieved by modeling both, the persons head movements as well as the relative locations of probable targets of interest in a room. Over video sequences taken in a meeting situation, the focus of attention could be identiﬁed up to 98% of the time.

1

Introduction

During face-to-face communication such as discussions or meetings, humans not only use verbal means, but also a variety of visual cues for communication. For example, people use gestures; look at each other; and monitor each other’s facial expressions during a conversation. In this research we are interested in tracking at whom or what a person is looking during a meeting. The ﬁrst step towards this goal is to ﬁnd out at which direction a person is looking, i.e. his/her gaze. Whereas a person’s gaze is determined by his head pose as well as his eye gaze, we only consider head pose as the indicator of the gaze in this paper. Related work on estimating human head pose can be categorized in two approaches: model based and example based approaches: In model-based approaches usually a number of facial features, such as eyes, nostrils, lip-corners, have to be located. Knowing the relative positions of these facial features, the head pose can be computed [2,8,3]. Detecting the facial features, however, is a challenging problem and tracking is likely to fail. Example based approaches either use some kind of function approximation technique such as neural networks [1,7,6], or a face database [4] to encode example images. Head pose of new Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 761–768, 1999. c Springer-Verlag Berlin Heidelberg 1999

762

Rainer Stiefelhagen et al.

images is then estimated using the function approximator, such as the neural networks, or by matching novel images to the examples in the database. With example based approaches usually no facial landmark detection is needed, instead the whole facial image is used for classiﬁcation. In the Interactive Systems Lab, we have worked on both approaches. We employed purely neural network [7] and model-based approaches to estimate a user’s head pose [8]. We also demonstrated that a hybrid approach could enhance robustness of a model based system [9]. In this paper, we extend the neural network approach to estimating the head pose in a more unrestricted situation. A major contribution of this paper is to use hidden markov model (HMM) to detect a user’s focus of attention from an observed sequence of gaze estimates. We are not only interested in which direction a user is looking at during the meeting, but also want to know at whom or what he is looking. This requires a way of incorporating knowledge about the world into the system to interpret the observed data. HMMs can provide an integrated framework for probabilistically interpreting observed signals over time. We have incorporated knowledge about the meeting situation, i.e. the approximate location of participants in the meeting into the HMMs by initializing the states of person dependent HMMs appropriately. We are applying these HMMs to tracking at whom the participants in a meeting are looking. The feasibility of the proposed approach have been evaluated by experimental results. The remainder of the paper is organized as follows: section 2 describes the neural network based head pose estimation approach. In section 3 we introduce the idea of interpreting an observed sequence of gaze directions to ﬁnd a user’s focus of attention in each frame; deﬁne the underlying probability model and give experimental results. We summarize the paper in section 4.

2

Estimating Head Pose with Neural Nets

The main advantage of using neural networks to estimate head pose as compared to using a model based approach is its robustness: With model based approaches to head pose estimation [2,8,3], head pose is computed by ﬁnding correspondences between facial landmarks points (such as eyes, nostrils, lip corners) in the image and their respective locations in a head model. Therefore these approaches rely on tracking a minimum number of facial landmark points in the image correctly, which is a diﬃcult task and is likely to fail. On the other hand, the neural network-based approach doesn’t require tracking detailed facial features because the whole facial region is used for estimating the user’s head pose. In our approach we are using neural networks to estimate pan and tilt of a person’s head, given automatically extracted and preprocessed facial images as input to the neural net. Our approach is similar to the approach as described by Schiele et. al. [7]. However, the system described in [7] estimated only head rotation in pan direction. In this research we use neural network to estimate head rotation in both pan and tilt directions. In addition, we have studied two diﬀerent image preprocessing approaches. Rae et. al. [6] describe a user depen-

From Gaze to Focus of Attention

763

Fig. 1. Example images take during data collection as used for training and testing of the neural nets

dent neural network based system to estimate pan and tilt of a person. In their approach, color segmentation, ellipse ﬁtting and Gabor-ﬁltering on a segmented face are used for preprocessing. They report an average accuracy of 9 degrees for pan and 7 degrees for tilt for one user with a user dependent system. In the remainder of this section we describe our neural net based approach to estimate user’s head pose (pan and tilt). 2.1

Data Collection Setup

During data collection, the person that we collected data from had to sit on a chair on a speciﬁc location in the room, with his eyes at a height of approximately 130cm. In a distance of one meter and at a height of one meter a video camera to record the images was placed on a tripod. We placed marks on three walls and the ﬂoor on which the user had to look one after another. The marks where placed in such a way that the user had to look in speciﬁc well known directions, and ranged from -90 degrees to +90 degrees for pan, with one mark each ten degrees, and from +15 degrees to -60 degrees for tilt, with one mark each 15 degrees. Once the user was looking at a mark, he could press a mousebutton, and 5 images were being recorded together with the labels indicating the current head pose. We collected data of 14 male and 2 female subjects. Approximately half of the persons were wearing glasses. 2.2

Preprocessing of Images

We investigated two diﬀerent preprocessing approaches: Using normalized grayscale images of the user’s face as the input to the neural nets and applying edge detection to the images before feeding them into the nets. To locate and extract the faces from the collected images, we have used a statistical skin color model [10]. The largest skin colored region in the input image was selected as the face. In the ﬁrst preprocessing approach, histogram normalization was applied to the grayscale face images as a means towards normalizing against diﬀerent lighting conditions. No additional feature extraction was performed and the normalized grayscale images were downsampled to a ﬁxed size of 20x30 images and then used as input to the nets.

764

Rainer Stiefelhagen et al.

Person A

Person B

Fig. 2. Preprocessed images: normalized grayscale, horizontal edge and vertical edge image (from left to right)

In the second approach, we applied a horizontal and a vertical edge operator plus tresholding to the facial grayscale images. Then the resulting edge images were downsampled to 20x30 pixels and were both used as input to the neural nets. Figure 2 shows the corresponding preprocessed facial images of the two person depicted in Figure 1. From left to right, the normalized grayscale image, the horizontal and vertical edge images are displayed. 2.3

ANN Architecture

We trained separate nets to estimate pan and tilt of a person’s head. Training was done using a multilayer perceptron architecture with one hidden layer and standard backpropagation with momentum term. The output layer of the net estimating pan consisted of 19 units representing 19 diﬀerent angles (-90, -80, ...,+80, +90 degrees). The output layer of the tilt estimating net consisted of 6 units representing the tilt angles +15, 0, -15, .. -60 degrees. For both nets we used gaussian output representation. With a gaussian output representation not only the single correct output unit is activated during training, but also its neighbours receive some training activation decreasing with the distance from the correct label. The input retina of the neural nets varied between 20x30 units and 3x20x30 units depending on the diﬀerent number and types of input images that we used for training (see 2.4). 2.4

Training and Results

We trained separate user independent neural nets to estimate pan and tilt. The neural nets were trained on data from twelve subjects from our database and evaluated on the remaining four other subjects. The data for each user consisted of 570 images, which results in a training set size of 6840 images and a test set size of 2280 images. As input to the neural nets, we have evaluated three diﬀerent approaches: 1) Using histogram normalized grayscale images as input to the nets. 2) Using horizontal and vertical edge images as input and 3) using both, normalized grayscale plus horizontal and vertical edge images as input. Table 1 summarizes the results that we obtained using the diﬀerent types of input images. When using

From Gaze to Focus of Attention

765

Table 1. Person independent results (Mean error in degrees) using diﬀerent preprocessing of input images. Training was done on twelve users, testing on four other users. Net Input Grayscale Edges Edges + Grayscale

Pan 12.0 14.0 9.0

Tilt 13.5 13.5 12.9

normalized grayscale images as input we obtained a mean error of 12.0 degrees for pan and 13.5 degrees for tilt on our four user test set. With horizontal and vertical edge images as input, a slightly worse accuracy for estimating the pan was obtained. Using both, normalized grayscale image as well as the edge images as input to the neural net signiﬁcantly increased the accuracy and led to accuracy of 9.0 degrees and 12.9 degrees mean error for pan and tilt respectively. These results show, that it is indeed feasible to train a person independent neural net based system for head pose estimation. In fact, the obtained results are only slightly worse than results obtained with a user dependent neural net based system as described by Rae et. al.[6]. As compared to their results, we did not observe serious degradation on data from new users. To the contrary, our results indicate that the neural nets can generalize well to new users.

3

Modelling Focus of Attention Using Hidden Markov Models

The idea of this research is to map the observed variable over time namely the gaze direction to discrete states of what the person is looking at, i.e. his focus of attention. Hidden Markov Models (HMM) can provide an integrated framework for probabilistically interpreting observed signals over time. In our model, looking at a certain target is modelled as being in a certain state of the HMM and the observed gaze estimates are considered as being probabilistic functions of the diﬀerent states. Given this model and an observation sequence of gaze directions, as provided by the neural nets, it is then possible to ﬁnd the most likely sequence of HMM states that produced the observations. Interpreting being in a certain state as looking at a certain target, it is now possible to estimate a person’s focus of attention in each frame. Furthermore, we can iteratively reestimate the parameters of the HMM so as to maximize the likelihood of the observed gaze directions, leading to more accurate estimates of foci of attention. We have tested our models on image sequences recorded from a meeting. In the meeting, four people were sitting around a table, talking to and looking at each other and sometimes looking onto the table. Figure 3 shows two example images taken during data collection of the meeting. For two of the speakers we then estimated their gaze trajectory with the neural nets described in the

766

Rainer Stiefelhagen et al.

Fig. 3. Example images from “meeting” data as used for HMM evaluation

previous section. For each user we have applied an HMM to detect his focus of attention given the observed gaze directions over time. 3.1

HMM Design

Knowing that there were four people sitting around a table, we modelled the targets for each person P as the following four states: P is looking to the person sitting to his right, P is looking to the person to his left, P is looking to the person in front of him, P is looking down on the table. In our model the observable symbols of each state are the pose estimation results as given by the neural nets, that is the angles for pan and tilt ωpan and ωtilt . We have parameterized the state dependent observation probabilities B = bi (ω) for each state i, where i ∈ lef t, right, center, table, as two-dimensional gaussian distributions with diagonal covariance matrices . Assuming that we know the approximate positions of the participants of the meeting relative to each other, we initialized the observation probability distributions of the diﬀerent states with the means of the gaussians set to the expected viewing angle, when looking at the corresponding target. The transition matrix A = (aij ) was initialized to have high transition probabilities for remaining in the same state (aii = 0.6) and uniformly distributed state transition probabilities for all other transitions. The initial state distribution was chosen to be uniform. 3.2

Probabilistic Model

Let O = ω1 ω2 · · · ωT be the sequence of gaze direction observations ωt = (ωpan,t , ωtilt,t ) as predicted by the neural nets. The probability of the observation sequence given the HMM is given by the sum over all possible state sequences q: p(O) = q p(O, q) = q p(O|q) p(q) = q t p(ωt |qt ) p(qt |qt−1 ) = q t bqt (ω) aqt ,qt−1 . To ﬁnd the single best state sequence of foci of attention, q = q1 . . . qn for a given observation sequence, we need to ﬁnd maxq (p(O, q)). This can be eﬃciently computed by the Viterbi algorithm [5]. Thus, given the HMM and the

From Gaze to Focus of Attention

767

Table 2. Percentage of falsely labelled frames without using the HMM and with using HMM before and after parameter reestimation Seq. no HMM HMM, no reest. HMM, reest. A 9.4 % 5.4 % 1.8 % B 11.6 % 8.8 % 3.8 %

observation sequence of gaze directions, we can eﬃciently ﬁnd the sequence of foci of attention using the Viterbi algorithm. So far we have considered the HMM to be initialized by knowledge about the setup of the meeting. It is furthermore possible to adapt the model parameters λ = (A, B) of the HMM so as to maximize p(O|λ). This can be done in the EM (Expectation-Maximizaton) framework by iteratively computing the most likely state sequence and adapting the model parameters as follows:

– means: µ ˆ pan (i) = Ei (ωpan ) =

µ ˆtilt (i) = Ei (ωtilt ) = , where φi,t =

1 0

φi,t ωpan,t

φi,t

φi,t ωtilt,t φi,t

: qt = i : otherwise

– variances: 2 2 σpan (i) = Ei (ωpan ) − (Ei (ωpan ))2 2 2 σtilt (i) = Ei (ωtilt ) − (Ei (ωtilt ))2

– transition probabilities: ai,j = 3.3

number of transition from state i to j t φi,t

Results

To evaluate the performance of the proposed model, we compared the statesequence given by the Viterbi-decoding to hand-made labels of where the person was looking to. Both of the evaluated sequences contained 500 frames and lasted about one and a half minute each. We evaluated the performance of the HMM without model parameter adaption and with automatic parameter adaption. Furthermore we evaluated the results obtained by directly mapping the output of the neural nets to the diﬀerent viewing targets. Table 2 reports the obtained results. It can be seen that compared to directly using the output of the neural nets, a signiﬁcant error reduction can already be obtained by using an HMM without parameter adaption on top of the ANN output. Using parameter reestimation however, the error can be furthermore reduced by a factor of two to three on our evaluation sequences.

768

4

Rainer Stiefelhagen et al.

Conclusion

In this paper we have addressed the problem of tracking a person’s focus of attention during a meeting situation. We have proposed the use of a HMM framework to detect focus of attention from a trajectory of gaze observations and have evaluated the proposed approach on two video sequences that were taken during a meeting. The obtained results show the feasability of our approach. Compared to hand-made labels, accuracy of 96% and 98% was obtained with the HMM-based estimation of focus of attention. To estimate a person’s gaze we have trained neural networks to estimate head pose from facial images. Using a combination of normalized grayscale images, horizontal and vertical edge images of faces as input to the neural nets, we have obtained accuracy of 9.0 degrees and 12.9 degrees for pan and tilt respectively on a test set of four users which have not been in the training set of the neural nets.

References 1. D. Beymer, A. Shashua, and T. Poggio. Example-based image analysis and synthesis. In Proceedings of Siggraph’94, 1994. 761 2. Andrew H. Gee and Roberto Cipolla. Non-intrusive gaze tracking for humancomputer interaction. In Proc. Mechatronics and Machine Vision in Practise, pages 112–117, 1994. 761, 762 3. T.S. Jebara and A. Pentland. Parametrized structure from motion for 3d adaptive feedback tracking of faces. In Proceedings of Computer Vision and Pattern Recognition, 1997. 761, 762 4. A. Pentland, B. Moghaddam, and T. Starner. View-based and modular eigenspaces for face recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1994. 761 5. Lawrence R. Rabiner. Readings in Speech Recognition, chapter A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, pages 267– 295. Morgan Kaufmann, 1989. 766 6. Robert Rae and Helge J. Ritter. Recognition of human head orientation based on artiﬁcial neural networks. IEEE Transactions on neural networks, 9(2):257–265, March 1998. 761, 762, 765 7. Bernt Schiele and Alex Waibel. Gaze tracking based on face-color. In International Workshop on Automatic Face- and Gesture-Recognition, pages 344–348, 1995. 761, 762 8. Rainer Stiefelhagen, Jie Yang, and Alex Waibel. A model-based gaze tracking system. In Proceedings of IEEE International Joint Symposia on Intelligence and Systems, pages 304 – 310, 1996. 761, 762 9. Rainer Stiefelhagen, Jie Yang, and Alex Waibel. Towards tracking interaction between people. In Intelligent Environments. Papers from the 1998 AAAI Spring Symposium, Technical Report SS-98-02, pages 123–127, Menlo Park, California 94025, March 1998. AAAI, AAAI Press. 762 10. Jie Yang and Alex Waibel. A real-time face tracker. In Proceedings of WACV, pages 142–147, 1996. 763

Automatic Interpretation Based on Robust Segmentation and Shape-Extraction Greet Frederix and Eric J. Pauwels ESAT-PSI, Dept. of Electrical Eng. K.U.Leuven, K. Mercierlaan 94, B-3001 Leuven, Belgium Phone: + 32 - 16 - 321706, Fax: + 32 - 16 - 321986 {Eric.Pauwels,Greet.Frederix}@esat.kuleuven.ac.be

Abstract. We report on preliminary but promising experiments that attempt to get automatic annotation of (parts of) real images by using non-parametric clustering to identify salient regions, followed by a limbcharacterization algorithm applied to the contours of the regions.

1

Introduction

The rapidly growing interest in content-based image access and retrieval (CBIR) for multi-media libraries has caused a resurgence in the activities relating to intermediate level processing in computer vision. Extensive experimentation over the last few years has shown that matching natural images solely on that basis of global similarities is often too crude to produce satisfactory results. What is required is some form of perceptually relevant segmentation that allows one to identify a (small) number of salient image-regions which can then serve as the basis for more discerning region-based matching. For the problems at hand saliency is deﬁned in terms of features that capture essential visual qualities such as colour, texture or shape-characteristics. This means that when an image is mapped into the appropriate feature-space, salient regions (by their very deﬁnition) will stand out from the rest of the data and can more readily be identiﬁed. Therefore, from an abstract point of view, segmentation can be interpreted as a problem of selecting appropriate features, followed by cluster-detection in feature-space. In fact, both steps are but two aspects of the same problem, as a particular feature-space is deemed appropriate whenever it shows pronounced clusters. Indeed, if mapping the pixels into the feature-space lumps them all together, this particular set of features is obviously of little use. Having established the relevance of unsupervised clustering we will in the first part of this paper outline a robust versatile non-parametric clustering algorithm that is able to meet the challenges set by the highly unbalanced and convoluted clusters that are rife in image-processing applications. Experiments on natural images conﬁrm that it can be used to extract saliency and produce semantically meaningful segmentation. In the second part of this paper we will argue that

Post-Doctoral Research Fellow, Fund for Scientiﬁc Research (F.W.O.), Belgium.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 769–776, 1999. c Springer-Verlag Berlin Heidelberg 1999

770

Greet Frederix and Eric J. Pauwels

CBIR can contribute signiﬁcantly to the problem of image-understanding. Indeed, if segmentation allows us to partition an image into perceptually salient regions, we can then use CBIR-based similarity measures to match (parts of) the image to regions in other images. If the image-database is already partially annotated, this matching can be used to automatically propagate annotations to new images.

2

Non-parametric Clustering for Segmentation

Clustering based on non-parametric density-estimation The complexity of the clusters encountered in intermediate-level processing means that classical clustering-algorithms such as k-means or Gaussian mixture models often perform very poorly; hence our choice of non-parametric density estimation as the core of the clustering-algorithm. To meet the requirement of completely unsupervised segmentation we propose two new non-parametric cluster-validity measures which can be combined to pick an optimal clustering from a family of clusterings obtained by density-estimation. Recall that clustering based on non-parametric density-estimation starts from the construction of a data-density f through convolution of the dataset by a density-kernel Kσ (where σ measures the spread of the kernel). After convolution candidate-clusters are identiﬁed by using gradient ascent to pinpoint local maxima of the density f . However, unless the clustering parameters (σ) is preset within a fairly narrow range, this procedure will result in either too many or too few clusters and it is very tricky to pick acceptable clustering parameters. For this reason we have taken a diﬀerent route. We pick a value for σ which is small (with respect to the range of the dataset) and, as before, proceed to identify candidate clusters by locating local maxima of the density f . This will result in an over-estimation of the number of clusters, carving up the dataset in a collection of relatively small “clumps” centered around local maxima. Next, we construct a hierarchical family of derived clusterings by using the data-density to systematically merge neighbouring clumps. Notice how this is very similar to the tree constructed in the case of hierarchical clustering, but with the crucial diﬀerence that the merging is based on the density, rather than on the distance, thus eliminating the unwelcome chaining-eﬀect that vexes hierarchical clustering. Now, in order to pick out the most satisfactory clustering we will discuss indices of cluster-validity that directly assign a performance-score to every proposed clustering of the data. Non-parametric measures for cluster-validity There is no shortage of indices that measure some sort of grouping-quality. Some of the most successful are the silhouette coeﬃcient [3] the Hubert-coeﬃcient, the intra- over intervariation quotient and the BD-index, introduced by Bailey and Dubes [2]. However, all of these coeﬃcients compare inter- versus intra-cluster variability and tend to favour conﬁgurations with ball-shaped well-separated clusters. Irregularly shaped clusters are problematic. It is for this reason that we have opted

Automatic Interpretation based on Robust Segmentation

771

to restrict our attention to non-parametric indices which don’t suﬀer the abovementioned drawbacks. As a “cluster” is a relatively well-connected region of high data-density that is isolated, we introduce the following two non-parametric measures that quantify these qualitative descriptions for a given clustering of the dataset (for more details we refer to [5]). 1. Isolation is measured in our algorithm by the k-nearest neighbour norm (NN-norm). More precisely, for ﬁxed k (the precise value of which is not very critical), the k-nearest neighbour norm νk (x) of a data-point x is deﬁned to be the fraction of the k nearest neighbours of x that have the same clusterlabel as x. Obviously, if we have a satisfactory clustering and x is taken well within a cluster, then νk (x) ≈ 1. However, even nearby the boundary of a well-deﬁned cluster we can still expect νk (x) ≈ 1, since most of the nearest neighbours will be located well within the interior of the cluster. Only when a bad clustering has artiﬁcially broken a densely populated region into two or more parts, we’ll see that νk (x) is signiﬁcantly smaller along the “faultline”. Averaging over the dataset yields a measure of the homogeneity for the total clustering. This quality-measure for clustering captures the fact that a cluster should be isolated with respect to the rest of the data. Furthermore, unlike most of the other criteria discussed above, it does not favour a particular cluster-structure, and is therefore very robust with respect to variations in the cluster-geometry of the cluster. However, this index doesn’t notice whenever two clusters are merged, even if they are well-separated. For this reason we need the next criterion which penalizes clusterings that erroneously lump together widely separated clusters. 2. Connectivity relates to the fact that for any two points in the same cluster, there always is a path connecting both, along which the data-density remains relatively high. In our algorithm we quantify this by choosing at random two points in the same cluster and connecting them by a straight line. We then pick a testpoint t halfway along this connecting line and subject it to gradient ascent to seek out its local density maximum. However, the constraint is that during its evolution the distance of this testpoint to either of the two “anchor-points” should remain roughly equal (to avoid that the testpoint converges to one of the anchor-points). In case the cluster has a curved shape, this allows the testpoint to position itself along the high-density crescent connecting the anchor-points. The data-density at the ﬁnal position of the testpoint (averaged over a number of random choices for the anchor-points) can be used as a connectivity-indicator C (the so-called C-norm). Clearly, if the clustering lumps together two well-separated clusters, many of these testpoints will get stuck in the void between the high-density regions, thus lowering the value of the index. Combining cluster-validity indices to select a clustering In order to get a satisfactory clustering-result one has to try and maximise both indices simultaneously, trading oﬀ one agaist the other. The problem is further compounded by the fact that the relevant information is captured primarily by the

772

Greet Frederix and Eric J. Pauwels

way these indices change, rather than by their speciﬁc values. Typically, the NN-norm will decrease as the number of clusters grows, while the connectivityindex tends to increase, but both trends will usually exhibit a sudden transition whereafter they more or less level oﬀ. However, as it is tricky to reliably identify such a “knee” in a graph, we go about it diﬀerently. First, in order to make the indices directly comparable, we compute their robust Z-scores, deﬁned by Z(ξi ) = (ξi − median(ξ))/MAD(ξ), where ξ = {ξ1 , . . . , ξ } represents the whole sample and MAD stands for median absolute deviation. Next, let Lp be the labeling for the pth clustering in the above-deﬁned hierarchical tree, i.e. Lp maps each datapoint x to its corresponding cluster-label Lp (x), and let Np and Cp be the corresponding NN-norm and C-norm respectively. The (robust) Z-score for the pth clustering is then deﬁned to be Zp = Z(Np ) + Z(Cp ) and among the possible clusterings listed in the tree, we pick the one which maximizes this robust Z-score. We refer to the segmented colour-images in this paper for a application of this technique to colour-segmentation.

3

From Segmentation to Interpretation

Once clustering has been used to extract perceptually salient regions, recognition is the next logical step. It is often possible to use the average feature-values (e.g. average colour) over the segmented region to get perceptually relevant information. However, in many cases the shape of the region is also highly informative. In order to test our ideas we looked at a database of images of barnyard animals (100 images). Due to the complexity inherent to these natural images, one cannot expect the segmentation result to be perfect: Variations in colour and texture, or occlusion and the like, make that in most cases only pieces of the contours delineating the regions will have an easily recognizable shape. For this reason, we divide the contour up into meaningful parts, along the lines initiated in [1], and extended in [7], and more recently [4]. Unfortunately, most of the work in the cited papers deals with idealized artiﬁcial shapes for which the complications are less severe. To be able to use this part-based approach for the recognition of segmented regions in real images, we combined and extended various elements in the aforementioned references to develop a CBIR-based recognition system that is beginning to be able to recognise salient parts in real images. More precisely, after using clustering to segment an image in a small number of regions, we extract the central region of interest and construct its contour. Straightforward postprocessing ensures that the result is a single, topologically simple contour i.e. one component without self-intersection. Next we identify salient visual parts (i.e. socalled limbs) by systematically working through the following steps: 1. Curve-evolution: First of all, we create a scale-space of curves by applying the discrete curve-evolution expounded in [4]. This evolution systematically simpliﬁes the shape of the curve by a gradual and principled averaging of curvature-variations until a simple convex shape emerges. By keeping track

Automatic Interpretation based on Robust Segmentation

773

of the “survival-time” of each of the points in the polygonal approximation a natural hierarchy of saliency is created. 2. Limb-extraction: This hierarchy established among the points on the extracted contour can be used to identify salient visual parts (limbs). In particular, we proceed in two waves: First, we look at successive negative curvature points along the contour that ﬂank convex arcs of the (simpliﬁed) contour. (Convex arcs in the simpliﬁed contour correspond to arcs that are “essentially” convex in the original contour). Connecting these successive negative curvature points creates a list of limb-candidates from which the ﬁnal limbs are chosen based on a measure of continuation (cfr. [7]). The idea is that the line-segment that separates the limb from the body should be relatively short and ﬁt in well (curvature-wise) with the curve-segments that ﬂank the putative limb. Secondly, once we have removed limbs sandwiched between successive negative curvature points, we extend the work in the above-mentioned papers and look for so-called tail-like limbs. These are visual parts that are deﬁned by only one negative curvature point, but enjoy an excellent continuation. An example is the elephant’s trunk in Fig. 2. 3. Data-encoding and search: Once this procedure is completed, we can construct a tree that represents the contour by subdividing it into limbs and sublimbs. In most cases, at least one of these limbs is highly salient and characteristic. For instance, in a collection of images of barnyard animals we found that, occlusion and bending notwithstanding, a horse’s head and neck are highly recognisable. To capitalize on this observed saliency, we compute a small set of geometric indices for these limbs. More speciﬁcally, we determine their relative size (with respect to the trunk), number of sizable dents, the elongation (ratio of long to short axis) and the bending-angle. 4. Interpretation: Contrary to most interpretation-systems, we do not try to develop a rule-based decision system that extracts from the segmented regions a rule-set for identiﬁcation. Rather, we start from the assumption that part of the images in the database are already annotated (prior knowledge). Confronted with a new image the system will ﬁrst use clustering to segment it (e.g. on the basis of colour), whereupon the contour of the region(s) of interest are processed and encoded as detailed above. By retrieving from the annotated part of the database those visual parts that have a similar shape, in conjunction with their annotation, it becomes possible to formulate informed hypotheses about the content of the new image: “if this region looks like other regions, most of which are horses, then this part of the image is probably a horse.” Notice how it is possible to correlate relatively small regions in diﬀerent images, even if the rest of the images are diﬀerent. This is impossible if one only considers global similarity. Experiments To test the viability of this approach we took a set of natural images of barnyard animals and segmented them using our cluster-algorithm. The contour of the central region was decomposed in visual parts (“limbs”) as described above. Some examples of the input and results of this procedure can be found in Fig. 3.

774

Greet Frederix and Eric J. Pauwels

Fig. 1. Application of the non-parametric clustering algorithm to segmentation of natural colour-images. Left: input image, Right: segmented image in mean and/or false colours. Recall that the number of clusters is automatically determined by the algorithm.

Fig. 2. Extracted limbs for three cartoon-ﬁgures and one contour obtained by the clustered-based segmentation of a mare and her foal. The complexity of the latter contour, caused by the foal partially occluding the mare, eloquently makes the case for part-based identiﬁcation.

Automatic Interpretation based on Robust Segmentation

775

−25

10 −30

20 −35

30 −40

40 −45

50

A

−50

60

−55

70

−60

80 90

−65

100

−70

20

40

60

80

100

120

140

160

−75 60

70

80

90

100

110

120

130

140

Fig. 3. Input images and extracted contours of central object found by clusterbased segmentation. Limbs are identiﬁed using the algorithm described in text. Letters refer to the table of mutual distances between the limbs.

776

Greet Frederix and Eric J. Pauwels

From this ﬁgure it is obvious that although the shapes of most limbs are fairly variable and irregular, some of them are nevertheless very typical and therefore highly recognisable (eg. horse’s heads). This fact is illustrated in the table below where the similarities between a number of limbs (based on geometric indices speciﬁed in item 3) are shown. Notice how all limbs that represent horse’s heads cluster together. If limb A was obtained from a new input image, while the other limbs (B through M) were already annotated, this table would suggest that the new image probably shows a horse.

A B C D E F G H I J K L M

A

B

C

D

E

F

G

H

I

J

K

L

M

0 4 3 5 2 2 1 27 27 24 24 22 43

4 0 6 0 2 3 5 36 36 36 28 22 50

3 6 0 4 2 1 4 15 16 14 14 15 35

5 0 4 0 1 2 6 35 33 32 32 29 58

2 2 2 1 0 1 1 22 21 21 18 17 43

2 3 1 2 1 0 4 23 23 22 19 17 40

1 5 4 6 1 4 0 26 25 21 28 29 52

27 36 15 35 22 23 26 0 1 1 6 16 28

27 36 16 33 21 23 25 1 0 0 10 22 38

24 36 14 32 21 22 21 1 0 0 12 24 39

24 28 14 32 18 19 28 6 10 12 0 3 9

22 22 15 29 17 17 29 16 22 24 3 0 6

43 50 35 58 43 40 52 28 38 39 9 6 0

Table 1. Table of mutual distances between labeled “limbs” (see ﬁg.2-3). Notice how limbs that represent horses’ heads and necks (A-G) cluster together. A similar pattern is observed for cows’ heads (H-J) and fore- and hindlimbs (K and L) and a tail (M). If part G of the cartoon horse, carries the annotation “horse’s head” this distance table can be used to propagate annotations to the testimages.

References 1. D.D. Hoﬀman and W.A. Richards: Parts of recognition. Cognition, Vol.18, pp. 65-96, 1985. 772 2. A.K. Jain and R.C. Dubes: Algorithms for Clustering Data. Prentice Hall, 1988. 770 3. Leonard Kaufman and Peter J. Rousseeuw: Finding Groups in Data: An Introduction to Cluster Analysis. J. Wiley and Sons, 1990. 770 4. L.J. Latecki and R. Lak¨ amper: Convexity Rule for Shape Decomposition Based on Discrete Contour Evolution. To appear in Int. J. of Computer Vision and Image Understanding. 772 5. E.J. Pauwels and G. Frederix: Non-parametric Clustering for Segmentation and Grouping. Proc. VLBV’98, Beckman Institute, Urbana-Champaign, Oct. 1998, pp. 133-136. 771 6. J. Shi and J. Malik: Normalized Cuts and Image Segmentation. Proc. IEEE Conf. oon Comp. Vision and Pattern Recognition, San Juan, Puerto Rico, June 1997. 7. K. Siddiqi and B. Kimia: Parts of Visual Form: Computational Aspects. IEEE Trans. PAMI, Vol. 17, No. 3, March 1995. 772, 773

A Pre-filter Enabling Fast Frontal Face Detection Stephen C. Y. Chan and Paul H. Lewis Multimedia Research Group University of Southampton, Zepler Building, Highfield, Southampton S017 1BJ. {scyc96r,phl}@ecs.soton.ac.uk

Abstract. We present a novel pre-filtering technique that identifies probable frontal illuminated face regions in colour images regardless of translation, orientation, and scale. The face candidate regions are normalised and provide the basis for face verification using published face detection algorithms. The technique focuses on a fast search strategy to locate potential eye-pairs in an image or video frame. The eye-pair candidates indicate areas that may contain faces. Scale and orientation is inferred from the eye-pairs, and a neural network is used to confirm the normalised face candidates.

1

Introduction

Detecting the presence of human faces can provide important cues for many image and video analysis tasks [1]. We are interested in enhancing multimedia tasks, in particular content-based video browsing, retrieval and navigation but face detection and location is also used as a pre-requiste for face analysis tasks such as recognition and expression interpretation. It is non-trivial if faces encoded in visual data can appear in any pose, position, orientation, and scale. The task is further compounded by problems associated with illumination variation and noise. Some of the most robust available techniques for face detection are computationally intensive, applying their elaborate detection algorithms at many scales and orientations in all possible positions in each image. The aim of this paper is to present a pre-ﬁltering technique which can identify, relatively quickly, regions in an image or video frame likely to contain human faces. Face candidates are detected regardless of position, orientation, and scale but initially we have assumed full frontal illuminated faces. The paper is presented in the following manner: Section 2 describes some related work contributing to face detection; Section 3 presents an overview of the pre-ﬁltering technique; Sub-sections 3.1, and 3.4 present the technique in detail; Section 4 reports results of some experimental work; and Section 5 gives the conclusions.

Stephen Chan would like to acknowledge support from the EPSRC.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 777–785, 1999. c Springer-Verlag Berlin Heidelberg 1999

778

2

Stephen C. Y. Chan and Paul H. Lewis

Related Work

Detecting frontal proﬁle faces has been investigated using a variety of diﬀerent approaches. Recently, the detection of faces with the head in varying poses has been reported by Yow et al.,fbhfd. Their approach detects features using spatial ﬁlters, and forms face candidates using geometric and grey-level constraints. A probabilistic framework then evaluates the face candidates for true faces. Chow et al.,tasfaﬀd, detect facial features to isolate faces in a constrained manner. Chen et al.,dohﬁci, use colour characteristics to detect faces in images against complex backgrounds. A neural network trained to recognise skin coloured pixels is used to isolate areas of skin, and eventually forms candidate face regions. Face regions are processed for lips to verify the existence of faces. Techniques that use motion have been used to isolate areas of an image, and analysing colours that make-up facial features, Choong et al.,ahﬂiacbumaci. Dai et al.,ftmbosgldaiaifdiacs, use colour to hypothesise the location of faces, where faces are evaluated as a texture that is based on a set of inequalities derived from a Space Grey Level Dependency(SGLD) matrix, described in Haralick et al.,tﬃc. A computationally expensive, but arguably the most robust, approach to face detection is proposed by Rowley et al.,nnbfd. A small input window is passed over every part of an image, and a neural network ﬁlter is used to establish whether or not a face is present. Scale invariance is achieved by sub-sampling each image at diﬀerent resolutions, and searching each of the sub-images. A rotational invariant version of this, also by Rowley et al.,rinnbfd, is achieved by estimating the angle of the sub-image within the input window. The sub-image is then de-rotated and presented to the neural network for classiﬁcation. There is a growing amount of literature concerned with verifying the existence of faces at a given location. However, the fast and automatic location of face candidate regions, as a pre-ﬁltering operation, is important if rapid and reliable face detection is to be achieved in video applications.

3

Overview of the Technique

This paper proposes a pre-ﬁltering technique which rapidly identiﬁes locations in video frames where preliminary evidence suggests a face may be sited. A more elaborate and established technique is then used to conﬁrm or deny the existence of faces at these locations. The pre-ﬁltering technique is based on the fact that, for frontal illuminated faces, the eyes are a usually a prominent feature of the face [6,7,11]. They have a spatial distribution that is roughly related to other facial features such as the nose and mouth. The distance between a pair of eyes gives an indication of the size of the face, and the positions of the eyes can be used to estimate the orientation. Using this premise the technique generates regions that are most likely to contain faces. These regions are then veriﬁed, in turn, to test whether a face

A Pre-filter Enabling Fast Frontal Face Detection

779

actually exists. Generating these regions relies on detecting possible pairs of eyes(eye-pairs) that may or may not belong to a face. The eye-pairs inherently provide information about the location, orientation and scale of potential faces. Square regions around the eye-pairs are used to establish the area that may contain the rest of the face. These areas are then normalised so that they represent possible upright faces. A suitable face veriﬁcation technique can then be used to verify the captured areas to conﬁrm the existence of faces. The current system uses a neural network for the ﬁnal face veriﬁcation stage and is based on the approach of Rowley et al.,nnbfd. Figure 1, illustrates the individual stages of the pre-ﬁltering process and face veriﬁcation. Video frame or image

Region detection

Eye-pair generation

Face area extraction

Face verfication

Fig. 1. The stages in isolating regions that may contain faces. 3.1

Region detection

The initial stage receives an image and segments it into regions. Each of these regions are evaluated in turn to see whether they satisfy certain criteria pertaining to eyes. The visual input is segmented by remapping pixel values to a voxelised RGB colour-space. Mapping colours of an image to its representative voxel produces homogeneous regions. The segmentation process has complexity of O(n) where n is the number of image pixels, and it is ideal for applications where speed is a concern. It can be eﬃciently implemented by reducing colour bits in the red, green, blue colour channels for each pixel using bit masks. Our system uses only the ﬁrst signiﬁcant bit. Dark regions are extracted from the segmented image by visiting every dark pixel(seed pixels) and ﬂood ﬁlling surrounding areas that have the same voxel number as the current seed pixel. During the ﬂood ﬁlling process, the number of ﬂooded pixels are counted and the extreme co-ordinates of the ﬁll are preserved. To reduce the computational complexity in the next stage, each region is evaluated with a set of heuristics that determine whether it could be a potential eye region. The heuristics are as follows but it should be noted that parameter values are not critical. They are used to eliminate candidate regions which have properties suﬃciently diﬀerent from those of an eye region that they may be eliminated. Deﬁnitions: w and h are the width and height of the segmented image in pixel units. Rn , where n is the region number in the set of regions R. Rn .width is the width in pixels. Rn .height is the height in pixels. .width . Rn .aspect is the aspect, deﬁned as RRnn.height Rn .numberof pixels is the number of pixels the region occupies. .numberof pixels Rn .homogeneity is a measure of homogeneity, deﬁned as RRnn.width∗R . n .height

780

Stephen C. Y. Chan and Paul H. Lewis

1. Elimination of regions that are too small and too large. 1 < Rn .width < 0.5w and 1 < Rn .height < 0.5h 2. Regions associate with eyes have a limited range of aspect ratio. 1 7 < Rn .aspect < 7.0 3. This criterion determines how much the region covers its minimum enclosing rectangle. Rn .homogeneity > 0.5 We found that smoothing the image reduced noise and produced better results. A 3x3 smoothing mask was convolved with the input before the region detection process was initiated. Segmentation of the ﬁltered input produced smoother regions and a reduction of false positive eye regions was recorded. Figure 2 illustrates an image passing through the stages described in this section.

Input image

Smoothing

Segmentation

Region generation

Region filtering

Fig. 2. Images displaying the results of each sub-stage during region detection. The ﬁltered regions are indicated by the rectangles in the Region ﬁltering image. 3.2

Eye-Pair Generation

The eye-pair generation process attempts to pair regions together that may potentially belong to a face. Given that there are n regions after the region de2 tection stage, the number of possible eye-pairs is n 2−n . It is desirable to reduce the number of eye-pairs by comparing regions with other regions using a set of eye-pair heuristics. Again, parameters are not critical and were obtained from observations of a wide variety of images containing faces. The algorithm is as follows: Deﬁnitions: distancex(Rj , Rk ), horiz. distance between the centres of regions Rj and Rk . distancey (Rj , Rk ), vert. distance between the centres of regions Rj and Rk . For all possible eye-pairs (Rj , Rk ): if distancex (Rj , Rk ) > distancey (Rj , Rk ) then relative width =

Rj .width Rj .width Rk .width , region aspect1 = Rj .height , Rk .width Rk .height , sum of widths = Rj .width

region aspect2 = else R .height relative width = Rj .height , region aspect1 = k

region aspect2 =

Rk .height Rk .width ,

+ Rk .width

Rj .height Rj .width ,

sum of widths = Rj .height + Rk .height

A Pre-filter Enabling Fast Frontal Face Detection

781

endif if 0.2 < relative width < 5.0 and k1 ∗ sum of widths < region distance < k2 ∗ sum of widths and 0.8 < region aspect1 < 7.0 and 0.8 < region aspect2 < 7.0 then Store eye-pair (Rj , Rk ) The condition distancex (Rj , Rk ) > distancey (Rj , Rk ) determines if the eyepair (Rj , Rk ) is more horizontal or vertical. The reason for having such a condition is that the aspect ratios can be calculated roughly relative to the vertical position of a face, where the width of a region relates to the width of an eye region of an upright face. An input image with a face on its side will have the eye regions with the width being the actual height of the eyes in the image. The term relative width ensures that no two regions have greatly exaggerated size diﬀerences, since regions belonging to the same face should not vary by orders of magnitude. Illumination will aﬀect the size of eye regions in the segmentation process and thus a range is considered. region aspect1 and region aspect2 ensures that the eye regions are approximately in-line with each other. This eliminates eye-pairs with one eye region in a horizontal position and an eye region in a vertical position. The k1 ∗ sum of widths < region distance < k2 * sum of widths, where k1 < k2 , ensures that the distance between an eye-pair is not exaggerated relative to the size of eye regions. In this case the sum of widths relative to the upright face position is used to give a measure of the size of eye regions. 3.3

Face Area Extraction

The resulting eye-pairs possess information that allows rotation and scale invariance of faces. This stage takes each eye-pair and extracts a square region which covers the main facial features(eyes, nose, mouth). Figure 3a, presents the deﬁnition of the square region. Two square regions must be extracted to achieve full rotation invariance. The eye-pairs form an imaginary line between the two squares and both areas on either side must be taken into account. Figure 3(b...g), shows a face image and all the captured areas on both sides of the generated eye-pairs. The captured face candidate areas are rotationally normalised. Our implementation captures face candidates which are rotationally normalised on the ﬂy. This is achieved by scanning pixels parallel to the eye-pairs and remapping them to an orthogonal grid that is of the same pixel dimensions as the pre-determined square capture area. 3.4

Face Verification

The face candidate images captured in the previous stage now present us with a pattern classiﬁcation problem for upright frontal faces. We use a neural network based on the work by Rowley et al.,nnbfd, to classify each face candidate subimage. They use a 20 x 20 pixel window moved over the entire image and perform ﬁltering functions to enhance the image viewed by the window, before it is passed

782

Stephen C. Y. Chan and Paul H. Lewis

0.5d

0.5d

d

0.5d

2d

(a) Capture area definition

(b) Overlayed capture masks

(c)

(d)

(e)

(f)

(g)

Fig. 3. An image with captured face candidates based on eye-pairs. The columns of images(c)...(g), show two images captured for each eye-pair. to a neural classiﬁer. Rowley et al. pre-process the input image by correcting the lighting and then performing histogram equalisation to improve the contrast. Our system only needs to perform histogram equalisation on the face candidate images since we have initially assumed frontal illuminated faces. Video frames and scanned images were used to generate training patterns. Training the network used visual data generated from the pre-ﬁltering process where over 450 representative faces were manually selected. False positives generated by the neural network were augmented to the non-faces training set, and the network retrained. Face candidates are resized to 20 x 20 pixel dimensions, greyscaled, and histogram equalised, before mapping to the trained neural network and the output is thresholded to give a binary decision, face or non-face.

4

Experimental Results

Our system uses 24 bit colour images or video frames and is being developed on a Pentium 133Mhz machine running Linux. Each frame is mapped to a 300 x 300 pixel frame buﬀer before any processing takes place. Figure 4, shows various frontal illuminated face views, where located faces are signiﬁed with a box that also indicates the orientation. When a general database contain over 400 faces was used, including many very small faces, the pre-ﬁltering algorithm detected eye-pairs for 53% of the total faces. In order to test the algorithm more fairly, a subset of the original database was established with 103 images containing at least one full frontal face. No faces were less than 60 pixels across but apart from this lower limit on size, faces could appear at any scale or orientation.

A Pre-filter Enabling Fast Frontal Face Detection

783

Fig. 4. Representative examples of faces found by the system. Each image shows 4 numbers: the number of faces in the image, the number of detected faces, the number of false positives, and the number of eye-pairs.

The total number of faces contained in the images was 128. After running the pre-ﬁltering algorithm, eye-pairs were found for 75% of the faces and of these 65% were correctly conﬁrmed as faces by the neural net veriﬁer. This result was obtained with our initial implementation of the veriﬁer and it is expected that the proportion correctly veriﬁed will improve with more careful training. On average about 70 eye-pairs were found per image, which is an important statistic since it is this number which determines the number of applications of the computationally intensive neural net veriﬁcation algorithm in our approach. The beneﬁts of the approach are clear when it is recalled that, in Rowley et al.’s original approach [9], neural nets are applied 193737 times per image with a processing time of 590 seconds on a Sparc 20 although they describe modiﬁcations which give a small degradation and a processing time of 24 seconds. Currently our

784

Stephen C. Y. Chan and Paul H. Lewis

approach is averaging about one image per second on a Pentium 133. Rowley reports detection rates of between 78.9% and 90.5% and although these are higher than our current rates, we believe that the speed improvement in our approach shows substantial promise when working towards real-time applications.

5

Conclusions

We have developed a pre-ﬁltering technique for face detection which provides an order of magnitude improvement in processing time on the method described by Rowley et al. Our pre-ﬁltering technique can currently detect 75% of eye-pairs belonging to faces in a test database containing full frontal faces of reasonable size. We believe that, although parameters in the algorithm are not critical, it will be possible to extend the cases considered in order to improve the robustness of the technique.

References 1. C. Chen and S. P. Chiang. Detection of human faces in colour images. In Vision Image Signal Processing, volume 144 of 6, pages 384–388. IEE, 1997. 777 2. Gloria Chow and Xiaobo Li. Towards a system for automatic facial feature detection. Pattern Recognition, 26:1739–1755, 1993. 3. Ying Dai and Yasuaki Nakano. Face-texture model based on sgld and its application in face detection in a color scene. Pattern Recognition Society, 29(6):1007– 1017, 1996. 4. Robert M. Haralick, K. Shanmugam, and I. Dinstein. Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics, 3:610–621, 1973. 5. Choong Hwan Lee, Jun Sung Kim, and Hyu Ho Park. Automatic human face location in a complex background using motion and color information. Pattern Recognition, 29(11):1877–1889, 1996. 6. David E. Benn Mark S. Nixon and John N. Carter. Robust eye centre extraction using the hough transform. In 1st International Conference on Audio-and VideoBased Biometric Person Authentication, Lecture Notes in Computer Science, pages 3–9, 1997. 778 7. Daniel Reisfeld and Yehezkel Yeshurun. Preprocessing of face images: Detection of features and pose normalization. Computer Vision and Image Understanding, 71(3):413–430, September 1998. 778 8. Henry A. Rowley, Shumeet Baluja, , and Takeo Kanade. Rotation invariant neural network-based face detection. Technical report, CMU CS Technical Report CMUCS-97-201, 1997. 9. Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. Human face detection in visual scenes. Technical report, CMU-CS-95-158R, Carnegie Mellon University, http://www.cs.cmu.edu/ har/faces.html, November 1995. 783 10. Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. Neural network-based face detection. In Transactions On Pattern Analysis And Machine Intelligence, volume 20 of 1, pages 23–38. IEEE, January 1998.

A Pre-filter Enabling Fast Frontal Face Detection

785

11. Li-Qun Xu, Dave Machin, and Phil Sheppard. A novel approach to real-time nonintrusive gaze finding. In BMV, Southampton, volume 2, pages 428–437, 1998. 778 12. Kin Choong Yow and Roberto Cipolla. Feature-based human face detection. Image And Vision Computing, 15(9):713–735, 1997.

A Technique for Generating Graphical Abstractions of Program Data Structures Camil Demetrescu1 and Irene Finocchi2 1

Dipartimento di Informatica e Sistemistica Universit` a di Roma “La Sapienza”, Via Salaria 113, 00198 Roma, Italy Tel. +39-6-4991-8442 [email protected] 2 Dipartimento di Scienze dell’Informazione Universit` a di Roma “La Sapienza”, Via Salaria 113, 00198 Roma, Italy Tel. +39-6-4991-8308 [email protected]

Abstract. Representing abstract data structures in a real programming language is a key step of algorithm implementation and often requires programmers to introduce language-dependent details irrelevant for both a high-level analysis of the code and algorithm comprehension. In this paper we present a logic-based technique for recovering from the loss of abstraction related to the implementation process in order to create intuitive high-level pictorial representations of data structures, useful for program debugging, research and educational purposes.

1

Introduction

In the last few years there has been growing interest in taking advantage of visual capabilities of modern computing systems for representing through images information from several application domains. Indeed, certain eﬀort has been devoted to exploring the eﬀectiveness of pictorial representations of code and data structures in the ﬁelds of software visualization and algorithm animation (see [6]). In particular, since data structures have a natural graphical interpretation, the use of computer-generated images is extremely attractive for displaying their features, the information they contain and their temporal evolution. This seems very useful for both the debugging of programs and research and educational purposes. One of the earliest experiments in this area led to the development of the system Incense (see [5]), able to automatically generate natural graphical displays of data structures represented in a Pascal-like language directly accessing compiler’s symbol table and choosing a layout for variables according to their types. The visualization of abstract data structures (e.g. digraphs and queues), as opposed to concrete ones (namely, those found in program source code), is the

This author was partially supported by EU ESPRIT Long Term Research Project ALCOM-IT under contract no. 20244.

Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 785–792, 1999. c Springer-Verlag Berlin Heidelberg 1999

786

Camil Demetrescu and Irene Finocchi

basic idea behind the system UWPI (see [4]), that analyzes operations performed by a Pascal program on its concrete data structures and suggests plausible abstractions for them chosen from a ﬁxed set. High-level debugging of programs could take great advantage of visualization capabilities, yet most modern conventional debuggers are basically textoriented and rely on direct built-in displays of program variables. For example, the Metrowerks CodeWarrior debugger provides several low-level representations of numeric variables (decimal, hexadecimal etc.) and allows programmers to interact with disclosure triangles to examine structures’ ﬁelds and to recursively follow pointed objects. Two fundamental criteria for evaluating systems for visualizing data structures are the level of abstraction of pictorial representations they produce and their automation. In [6] three levels of abstraction are considered: – direct representations, typical of debuggers, are obtained by mapping information explicitly stored in program’s data structures directly onto a picture; – structural representations are achieved by hiding and encapsulating irrelevant details of concrete data structures; – synthesized representations emphasize aspects of data structures not explicitly coded in the program, but deduced from it. Unfortunately, abstraction and automation requirements appear to conﬂict: systems that automatically produce visualizations usually gather shallow information from program’s source code and are not able to recover the original meaning of data structures, perhaps lost during algorithm’s implementation. Hence, programming the visual interpretation of data structures through additional code seems necessary in order to obtain customized structural and synthesized representations, but requires additional eﬀort of the programmer. In this paper we address the visualization of data structures through a programmable logic-based interpretation of their meaning. Due to the lack of space, we focus our attention on synthesized representations, that seem the most diﬃcult to realize. The method we propose has been used in the development of the algorithm animation system Leonardo detailed in [2]. The paper is organized as follows. After describing the logic-based visualization framework that is the backbone of our approach (section 2), in sections 3 and 4 we introduce the concept of abstraction recovery and we present techniques for visualizing both indexed and linked representations of graphs and trees, easily extendable to other kinds of data structures. We conclude with some remarks about advantages and disadvantages of our approach.

2

The Visualization Framework

In ﬁgure 1 we propose a logic-based architecture for visualizing information extracted from concrete data structures. The diagram highlights two main moments of the visualization process. The ﬁrst step consists in augmenting an underlying program with declarations about the abstract interpretation of its

A Technique for Generating Graphical Abstractions

AAAAAAAAA AAAAAAAAA Underlying program

Functions and procedures definitions

Program execution machine

Augmented program

Data structures definitions

Write

Read

UNDERLYING PROGRAM COMPUTATION

Concrete data structures

787

Predicates definitions

Read

Predicates interpreter

PROGRAMMING TIME Visualization request

Predicates computation requests

Predicates results

ABSTRACTION RECOVERY COMPUTATION

EXECUTION TIME

Visualizer High-level data structures

Rendering libraries

IMAGE RENDERING COMPUTATION

Fig. 1. Logic-based architecture for visualizing data structures data structures. The second one is related to: 1) the execution of the underlying program; 2) the generation of high-level data structures from concrete ones according to user’s declarations; 3) their visualization by means of rendering libraries, specifying objects’ default retinal features and their layout. In the sequel we will assume to deal with underlying C programs and with declarations speciﬁed as predicates in a logic-based language called Alpha (see [3] for details). An Alpha predicate is a boolean function with “by value” or “by name” arguments, computed according to a Prolog-like backtracking mechanism that allows it to return in its “by name” parameters diﬀerent values on sequential repeated calls. From the point of view of a user interested in programming a visualization, an augmented C program can be created by embedding in the text of a C program the deﬁnitions of Alpha standard predicates, having a ﬁxed predeﬁned signature that allows them to be recognized and computed on visualizer’s demand. Standard predicates are classiﬁed into constructors and descriptors. The ﬁrst ones concern the declaration of abstract objects (graphs, lists, queues etc.) and their sub-objects (vertices, edges, items etc.). The second ones are optional and declare objects’ retinal features such as the color of vertices. Moreover, predicates’ deﬁnitions may refer to variables of the underlying program, making their output dependent on information stored in concrete data structures. From the point of view of the visualization system, the eﬀective generation of a picture starts by computing the standard predicates deﬁned by the user and by collecting their output values into high-level data structures (abstraction recovery computation). These ones are then directly accessed by the visualizer that maps them onto a graphical representation (rendering computation). The visualization process is triggered by update requests to the visualizer, generated either on user’s demand or automatically. In the second case, requests may be issued either at regular intervals of time or as a consequence of dynamic modiﬁcations to concrete data structures performed by the underlying program. The last option, supported by the tool described in [2], requires a complex software technology often hard to implement, but oﬀers a powerful mechanism for high-level visual debugging of programs: actually, if the consistency between im-

788

Camil Demetrescu and Irene Finocchi

ages and program execution is automatically mantained, wrong actions of the program can be easily detected.

3

Abstraction Recovery

Identifying suitable data structures and representing them in a chosen programming language are two key steps in the design and implementation of algorithms. Unfortunately, the concrete representation of abstract data structures often requires programmers to introduce language-dependent details irrelevant for a high-level analysis of the code and causes a not desirable loss of abstraction: information about the meaning of concrete data structures and their usage does not usually appear in the code, but remains part of programmer’s know-how. Nevertheless, our interest in the visualization is focused on the ability to convey essential information and to recover from this loss of abstraction. As an example, let us consider a directed graph G(V, A) concretely represented in C by means of its adjacency matrix (see [1]): struct AdjMatrix { int n; char m[100][100]; } g;

According to a usual convention, the variable g may be interpreted as an instance of a directed graph, with V = {0, . . . , g.n − 1} ⊆ {0, . . . , 99} and A = {(x, y) ∈ V 2 : g.m[x][y] = 0}. The following Alpha declarations translate this piece of information into a computer-usable form: Graph(Out 1); Directed(1); Node(Out N,1) For N: InRange(N,0,g.n-1); Arc(X,Y,1) If g.m[X][Y]!=0;

They declare that there is a graph with label 1, it is directed, its nodes are identiﬁed by the numbers in the range [0, . . . , g.n − 1] and there is an arc (x, y) if and only if g.m[x][y] = 0. Observe that InRange is a predeﬁned Alpha predicate able to enumerate all integer values in a given range. Moreover, predicates N ode and Arc refer to the variable g of the underlying program. In our framework, standard predicates are computed by an interpreter due to a sequence of requests issued by the visualizer according to a precise query algorithm. In ﬁgure 2 we give a possible fragment of a query algorithm that invokes predicates Graph, Directed, N ode and Arc. Note that predicates Graph and N ode are enumerative, being able, in case, to return diﬀerent values on subsequent calls thanks to the baktracking-based computation mechanism provided by the Alpha language. This is an extremely powerful feature for compactly specifying sets of values. The visualizer uses the previous query algorithm fragment to build the highlevel data structures G, dg , Vg and Ag , ∀g ∈ G, containing the labels of declared

A Technique for Generating Graphical Abstractions

789

G←∅ while (Graph(g)=true) do begin G ← G ∪ {g} if (Directed(g)=true) then dg ← true else dg ← f alse Vg ← ∅ while (N ode(n,g)=true) do Vg ← Vg ∪ {n} Ag ← ∅ for all (x, y) ∈ Vg × Vg if (Arc(x,y,g)=true) then Ag ← Ag ∪ {(x, y)} end

Fig. 2. Query algorithm that invokes predicates Graph, Directed, N ode and Arc graphs, their type (directed or undirected), their nodes and their arcs, respectively. Then, it may use a graph drawing algorithm to produce a geometric layout for each declared graph. If any of standard predicates Graph, Directed, N ode or Arc has not been deﬁned in the augmented program, the interpreter assumes it is f alse by default. This choice gives the visualizer great ﬂexibility, allowing it to provide default values for any piece of information left undeﬁned by the user. Our approach, based on logic assertions, appears very powerful for highlighting formal properties of data structures and for conveying synthesized information into images. For example, consider the following declarations: Graph(Out 2); Node(Out N,2) For N: Node(N,1); Arc(X,Y,2) Assign S In { S=0; for (int i=0;i
They declare a new undirected graph, labelled 2, that provides a diﬀerent graphical interpretation of the same structure g. In particular, its vertices are the same as in graph 1, but there is an edge {X, Y } if and only if the X-th row and the Y -th row of g.m are orthogonal. Figure 3, generated by the system Leonardo, shows a direct representation of a possible instance of the variable g and the graphs obtained from it through the foregoing declarations. The temporal complexity of the query algorithm becomes a critical point when dealing with large data structures. In our example, if n is the number of nodes stored in g.n, graph 1 is built in time O(n2 ), since any single call to pred-

790

Camil Demetrescu and Irene Finocchi

Fig. 3. Direct and synthesized visualizations of information stored in a matrix icates that deﬁne it takes time O(1), while graph 2 in time O(n3 ), because any computation of Arc(x, y, 2) requires O(n) steps for the orthogonality checking. 3.1

Comparing Diﬀerent Query Algorithms

The proposed algorithm for querying graph-related standard predicates is not the only possible. In particular, since it builds the set of arcs Ag by computing Arc(x, y, g) for each pair of nodes (x, y), it appears well suitable for extracting information from matrices, but implies a quadratic lower bound on the temporal complexity not desirable for other kinds of concrete graph representations, such as adjacency lists or list of arcs.

Table 1. Diﬀerent standard predicates for declaring arcs and related query algorithms Standard predicates signatures Arc(X,Y ,G)

AdjList(X,Out Y ,G)

InList(Out X,Y ,G)

ArcList(Out X,Out Y ,G)

Query algorithm fragments Ag ← ∅ for all (x, y) ∈ Vg × Vg if (Arc(x,y,g)=true) then Ag ← Ag ∪ {(x, y)} Ag ← ∅ for all x ∈ Vg while (AdjList(x,y,g)=true) do Ag ← Ag ∪ {(x, y)} Ag ← ∅ for all y ∈ Vg while (InList(x,y,g)=true) do Ag ← Ag ∪ {(x, y)} Ag ← ∅ while (ArcList(x,y,g)=true) do Ag ← Ag ∪ {(x, y)}

A good solution to this problem is to provide various ways to specify the same piece of information, that is, diﬀerent standard predicates playing the same role but invoked according to diﬀerent query algorithms. In table 1 we show four standard predicates for declaring arcs of a graph g and the query algorithm fragments for computing them, presupposing a diﬀerent behaviour of invoked predicates: Arc(x, y, g) must simply perform a test on an input pair (x, y) provided by the visualizer; AdjList(x,Out y,g) takes a node x and is assumed

A Technique for Generating Graphical Abstractions

791

to enumerate all nodes y such that (x, y) ∈ Ag ; InList(Out x,y,g) is similar, but receives y and must enumerate all nodes x such that (x, y) ∈ Ag ; at last, ArcList(Out x,Out y,g) takes no input node and is assumed to enumerate all pairs of nodes to be added to Ag . Note the given algorithms may be merged together allowing the user to declare arcs in diﬀerent - even mixed - ways, according to his/her particular necessities. An accurate choice may consistently reduce the temporal complexity of the abstraction operation, as in the example given in table 2.

Table 2. Comparing the use of Arc and ArcList for visualizing a list of arcs

C and Alpha definitions C definition of a list of arcs

Definition of Arc

Definition of ArcList

4

struct Arcs { int n,m; struct { int start, end; } l[100]; } g; Arc(X,Y,1) For M:InRange(N,0,g.m-1) If X==g.l[M].start && Y==g.l[M].end; ArcList(Out X,Out Y,1) For M:InRange(N,0,g.m-1) Assign X=g.l[M].start Y=g.l[M].end;

Query algorithm complexity n = |Vg | m = |Ag | O(n2 · m) O(m)

Linked vs. Indexed Data Structures

So far we considered only indexed data structures and the task of extracting information from them was quite simple: in our ﬁrst example we enumerated nodes as numerical objects and we used them in the deﬁnition of the predicate Arc as indices for accessing an adjacency matrix. Dealing with linked data structures is a little more diﬃcult, due to their sequential - and not random - access: the whole data structure usually needs to be visited to allow the visualizer to gather its items. In the following we consider a common example, i.e. the binary tree: struct node { int key; struct node *left, *right; } *root;

As the number of nodes in the tree is not known in advance, a possible solution is to deﬁne an auxiliary (non standard) predicate to collect its nodes: PreVisit(R,Out If R!=NULL If R!=NULL If R!=NULL

N) Assign N=R For N:PreVisit(((struct node*)R)->left ,N) For N:PreVisit(((struct node*)R)->right,N)

Moreover Moreover ;

792

Camil Demetrescu and Irene Finocchi

P reV isit returns the pointers to all items of the tree rooted in R according to a recursive scheme; thus, tree nodes can be enumerated by simply invoking it with input parameter root. The rest of the code for visualizing the tree is given below: Tree(Out 1); Node(Out N,1) For N:PreVisit(root,N); AdjList(X,Out Y,1) Assign Y=((struct node*)X)->left Moreover Assign Y=((struct node*)X)->right ;

5

Conclusions

In this paper we presented an architecture for creating intuitive high-level visualizations of concrete data structures. In particular we focused our attention on the use of logic-based techniques for recovering from the loss of abstraction related to the implementation process. Relevant features of our approach are: – freedom of representation: there are no limitations on the type of concrete data structures; – freedom of interpretation: the same variable may be interpreted in several ways, leading to diﬀerent pictorial representations; this is achieved by uncoupling concrete data structures from high-level ones; – possibility of logic reasoning on data structures: formal properties can be easily visualized. We presented some examples concerning the visualization of graphs and trees, yet the same ideas hold for other kinds of abstract data structures, too (e.g. queues, lists, stacks etc.). We considered the temporal complexity of the abstraction recovery process, as it is a critical point when dealing with large data structures, and we showed that an accurate choice of predicates may reduce it. The reader interested in this approach can ﬁnd further information over the Internet at: http://www.dis.uniroma1.it/~demetres/Leonardo/.

References 1. Cormen, T.H., Leiserson, C.E., Rivest, R.L. (1990), Introduction to Algorithms, MIT Press, Cambridge, MA. 788 2. Crescenzi, P., Demetrescu, C., Finocchi, I., Petreschi, R., (1997), Leonardo: a software visualization system, Proceedings WAE’97, pp. 146-155. 786, 787 3. Demetrescu, C., Finocchi, I., (1998), A general-purpose logic-based visualization framework, Proceedings WSCG’99, pp. 55-62. 787 4. Henry, R.R., Whaley, K.M., Forstall, B., (1990), The University of Washington Illustrating Compiler, Proceedings of the ACM SIGPLAN‘90 Conference on Programming Language Design and Implementation, 223-233, New York: ACM. 786 5. Myers, B.A., (1983), Incense: a system for displaying data structures, Computer Graphics, 17(3): 115-125. 785 6. Roman, G.C., Cox, K.C., (1993), A taxonomy of program visualization systems, Computer, 26, 11-24. 785, 786

Visual Presentations in Multimedia Learning: Conditions that Overload Visual Working Memory Roxana Moreno and Richard E. Mayer University of California, Santa Barbara Psychology Department, Santa Barbara, CA 93106, U.S.A. {Moreno,Mayer}@psych.ucsb.edu

Abstract. How should we design visual presentations to explain how a complex system works? One promising approach involves multimedia presentation of explanations in visual and verbal formats, such as presenting a computergenerated animation synchronized with narration or on-screen text. In a review of three studies, we found evidence that presenting a verbal explanation of how a system works with an animation does not insure that students will understand the explanation unless research-based cognitive principles are applied to the design. The first two studies revealed a split-attention effect, in which students learned better when the instructional material did not require them to split their attention between multiple visual sources of information. The third study, revealed a modality effect, in which students learned better when verbal input was presented auditorily as speech rather than visually as text. The results support two cognitive principles of multimedia learning.

1 Introduction The purpose of this paper is to propose a set of instructional design principles for visual presentations, as derived from a review of recent empirical studies on multimedia learning. In all studies, students were presented with verbal and non-verbal visual information and their learning from the multimedia lesson was compared to that of students who were presented with identical graphics and animations but instead of viewing on-screen text, listened to a narration. In defining multimedia learning it is useful to distinguish among media, mode and modality. Media refers to the system used to present instruction, such as a book-based medium or a computer. Mode refers to the format used to represent the lesson, such as words versus pictures. Modality refers to the information processing channel used by the learner to process the information, such as auditory versus visual [5]. Of particular interest for the present review is the study of how specific combinations of modes and modalities may affect students' learning of scientific explanations, such as when we combine visual-verbal material (i.e., text) or auditory-verbal material (i.e., narration) with visual-non-verbal materials (i.e., graphics, video or animations). Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 793-800, 1999.  Springer-Verlag Berlin Heidelberg 1999

794

Roxana Moreno and Richard E. Mayer

In all studies, after viewing a multimedia presentation, students had to complete a series of tests aimed to assess their retention and learning. Participants were asked to write down as much of the material as they could remember (retention test), to give names for parts of the animation (matching test), and to apply what they have learned to solve new problems (transfer test). Based on the results of our studies, two design principles will be proposed: the split-attention principle, and the modality principle.

2 Issue 1: A Split-Attention Effect How should verbal information be presented to students to enhance learning from animations: auditorily as speech or visually as on-screen text? In order to answer this question, Mayer and Moreno [7] asked students to view an animation depicting a complex system (the process of lightning formation, or how a car's braking system works), either along with concurrent narration (Group AN) or along with concurrent on-screen text (Group AT). Our goal was to test a dual-processing theory of multimedia learning based on the following assumptions: (a) working memory includes an auditory working memory and a visual working memory, analogous to the phonological loop and visuo-spatial sketch pad, respectively, in Baddeley's [1,2] theory of working memory; (b) each working memory store has a limited capacity, consistent with Sweller's [3,13,14] cognitive load theory; (c) meaningful learning occurs when a learner retains relevant information in each store, organizes the information in each store into a coherent representation, and makes connections between corresponding representations in each store, analogous to the cognitive processes of selecting, organizing, and integrating in Mayer's generative theory of multimedia learning [5,9]; and (d) connections can be made only if corresponding pictorial and verbal information is in working memory at the same time, corresponding to referential connections in Paivio's [4,12] dual-coding theory. Congruent with this dual-processing theory of multimedia learning, visuallypresented information is processed--at least initially--in visual working memory whereas auditorily-presented information is processed--at least initially--in auditory working memory. For example, in reading text, the words may initially be represented in visual working memory and then be translated into sounds in auditory working memory. As shown in Figure 1, in the AN treatment, students represent the animation in visual working memory and represent the corresponding narration in auditory working memory. Because they can hold corresponding pictorial and verbal representations in working memory at the same time, students in group AN are better able to build referential connections between them. In the AT treatment, students try to represent both the animation and the on-screen text in visual working memory. Although some of the visually-represented text eventually may be translated into an acoustic modality for auditory working memory, visual working memory is likely to become overloaded. Students in group AT must process all incoming information--at least initially--through their visual working memory. Given the limited resources students have for visual information processing,

Visual Presentations in Multimedia Learning

795

using a visual modality to present both pictorial and verbal information can create an overload situation for the learner. If students pay full attention to on-line text they may miss some of the crucial images in the animation, but if they pay full attention to the animation they may miss some of the on-line text. Because they may not be able to hold corresponding pictorial and verbal representations in working memory at the same time, students in group AT are less able to build connections between these representations.

Fig. 1. A dual-processing model of multimedia learning. From Mayer & Moreno [7].

Therefore, dual-processing theory predicts that students in group AT perform more poorly than students in group AN on retention, matching , and transfer tests. The predictions are based on the idea that AT students may not have encoded as much of the visual material as AN students, may not have been able to build as many referential connections between corresponding pictorial and verbal information as AN students, and may not have been able to construct a coherent mental model of the system as well as AN students. Method and Results. Seventy eight college students who lacked knowledge of meteorology participated in the study of lightning formation, and 68 college students who had low knowledge of car mechanics participated in the study of a car’s braking system. All participants first viewed the animation with either concurrent narration in a male voice describing the major steps in the respective domain (Group AN) or concurrent on-screen text involving the same words and presentation timing (Group AT). Then, all students took the retention, transfer and matching tests. Figures 2 and 3 show the proportion of correct answers on the retention, matching and transfer tests for the AN and AT groups who viewed the lightning and car's braking system animation, respectively.

796

Roxana Moreno and Richard E. Mayer

1.00

Group AN

Proportion Correct

.80

Group AT

.60 .40

.20

Retention Test

Matching Test

Transfer Test

Fig. 2. Proportion correct on retention, matching and transfer tests for two groups--Lightning study. From Mayer & Moreno [7].

In the lightning presentation, group AN recalled significantly (p< .001) more, correctly matched significantly (p < .01) more elements on diagrams, and generated significantly (p< .001) more correct solutions than Group AT. Similarly, in the car braking presentation, group AN recalled significantly (p< .05) more, correctly matched significantly (p < .05) more elements on diagrams, and generated significantly (p< .01) more correct solutions than Group AT. These results are consistent with the predictions of the dual-processing hypothesis and allow us to infer the first instructional design principle, called the split-attention principle by the cognitive load theory [3,11]. Split-Attention Principle. Students learn better when the instructional material does not require them to split their attention between multiple sources of mutually referring information.

Visual Presentations in Multimedia Learning

Group AN

1.00

Group AT

.80 Proportion Correct

797

.60

.40

.20

Retention Test

Matching Test

Transfer Test

Fig. 3. Proportion correct on retention, matching and transfer tests for two groups--Car braking study. From Mayer & Moreno [7].

3 Issue 2: The Role of Modality Why do students learn better when verbal information is presented auditorily as speech rather than visually as on-screen text? Our first two studies showed that students who learn with concurrent narration and animation outperform those who learn with concurrent on-screen text and animation [7]. However, this type of concurrent multimedia presentations, force the text groups to hold material from one source of information (verbal or non-verbal) in working memory before attending to the other source. Therefore, the narration group might have had the advantage of being able to attend to both sources simultaneously, and the superior performance might disappear by using sequential multimedia presentations, where verbal and non-verbal materials are presented one after the other. The purpose of our third study [10] was to test if the advantage of narration over on-screen text resides in a modality principle. If this is the case, then the advantage for auditory-visual presentations should not disappear when they are made sequential, that is, when the graphics or animation are presented either before or following the narration or on-screen text. Method and Results. The participants were 137 college students who lacked knowledge of meteorology. They first viewed the animation in one of the following six conditions. First, and similar to our first two studies, one group of students viewed concurrently on-screen text while viewing the animation (TT) and a second group of students listened concurrently to a narration while viewing the animation (NN). In

798

Roxana Moreno and Richard E. Mayer

addition to the concurrent groups, four groups of sequential presentations were included. Students listened to a narration preceding the corresponding portion of the animation (NA), listened to the narration following the animation (AN), read the onscreen text preceding the animation (TA), or read the on-screen text following the animation (AT). After viewing the animation, all students took retention, transfer and matching tests. Figure 4 shows the proportion of correct answers on the retention, transfer and matching tests for the NN, AN, NA, AT, TA and TT groups.

Fig. 4. Proportion correct on retention, transfer and matching tests for six groups. From Moreno and Mayer [10].

The text groups (TT, AT, and TA) scored significantly lower than the narration groups (NN, AN, and NA) in verbal recall (p < .001), problem solving transfer (p < .001), and matching (p < .005). These results reflect a modality effect. Within each modality group, the simultaneous and sequential groups only showed a significant difference in their performance for matching tests (p < .05). This finding might be interpreted as an example of split-attention, where presenting two competing visual materials simultaneously has negative effects on the association of verbal and visual materials in a multimedia presentation. These results are consistent with prior studies on text and diagrams [11], and allow us to infer a second instructional design principle--the Modality Principle. Modality Principle. Students learn better when the verbal information is presented auditorily as speech rather than visually as on-screen text both for concurrent and sequential presentations.

4 General Discussion These results provide an important empirical test of a dual-processing theory of working memory within the domain of multimedia learning according to which students will learn better in multimedia environments when words and pictures are presented in separate modalities than in the same modality. When pictures and words are both presented visually (i.e., a split-attention situation), learners are able to select

Visual Presentations in Multimedia Learning

799

fewer pieces of relevant information because visual working memory is overloaded. When words and pictures are presented in separate modalities, visual working memory can be used to hold representations of pictures and auditory working memory can be used to hold representations of words. The robustness of these results was evident on two different domains (meteorology and mechanics) across three different studies. Although multimedia learning offers very high potential educational opportunities by the presentation of rich visual information such as graphics, animation, and movies, computer-based instructional materials are usually based on what current technology advances can do rather than on research-based principles of how students learn with technology. Multimedia environments allow students to work easily with verbal and non-verbal representations of complex systems. They also allow the use of different modalities to present the same information. The present review demonstrates that presenting a verbal explanation of how a system works with complex graphics, does not insure that students will remember or understand the explanation unless researchbased principles are applied to the design. Our first two studies showed that students learn better from designs that do not present simultaneous mutually-referring visual information. The split-attention principle emphasizes the need to present animation with auditory speech rather than on-screen text. Presenting an animation with simultaneous on-screen text forces students to hold one source of the visual materials in working memory while attending to the other source, creating a high cognitive load. In our third study, evidence was found for a modality principle, where students learn better if the verbal material is presented auditorily rather than visually even in sequential presentations. It showed that the advantage of narration presentations over on-screen text presentations does not disappear when both groups are forced to hold the information contained in one source of the materials before attending to the other. These results suggest not only that more information is likely to be held in both auditory and visual working memory rather than in just one but that the combination of auditory verbal materials with visual non-verbal materials may create deeper understanding than the combination of visual verbal and non-verbal materials. This study calls attention to the need to broaden the goals of instructional designers of visual presentations. The design of multimedia presentations should be guided by the goal of presenting information that is relevant, and in a way that fosters active cognitive processing in the learner. Focusing solely on the first goal--presenting relevant information--can lead to presentations such as the one given to the AT groups in our studies, where visual working memory is likely to become overloaded. When working memory becomes overloaded, the opportunities for active cognitive processing are reduced. Focusing on both goals--presenting relevant information in ways that promote active learning--can lead to presentations such as the one given to the AN groups in our studies, where working memory is less likely to become overloaded. An important consideration in the design of multimedia presentations is whether to accompany animations with auditorily-presented or visually-presented words. The most important practical implication of this study is that animations should be accompanied by narration rather than by on-screen text. This implication is particularly important in light of the increasing use of animations and on-screen text both in

800

Roxana Moreno and Richard E. Mayer

courseware and on the world wide web. These results cast serious doubts on the implicit assumption that the modality of words is irrelevant when designing multimedia presentations. These results should not be taken as a blanket rejection of the use of text captions with graphics. To the contrary, in a series of studies on text and illustrations about how devices work carried out in our lab at Santa Barbara the results consistently have shown that students learn more productively when text is presented within corresponding illustrations rather than when text and illustrations are presented on separate pages [6,5,8,9]. Similarly, in a series of studies on worked-out geometry problem examples Sweller and his colleagues have shown that students learn better when text explanations are presented on the sheet with geometry problems than separately°[13,14]. Overall, these studies provide ample evidence for the benefits of presenting short captions or text summaries with illustrations.

References 1. Baddeley, A.D.: Working memory. Oxford, England: Oxford University Press (1986) 2. Baddeley, A.: Working memory. Science, Vol. 255, (1992) 556-559 3. Chandler, P. & Sweller, J.: The split-attention effect as a factor in the design of instruction. British Journal of Educational Psychology, Vol. 62, (1992) 233-246 4. Clark, J. M. & Paivio, A.: Dual coding theory and education. Educational Psychology Review, Vol. 3, (1991) 149-210 5. Mayer, R. E.: Multimedia learning: Are we asking the right questions? Educational Psychologist, Vol. 32, (1997) 1-19 6. Mayer, R. E.: Systematic thinking fostered by illustrations in scientific text. Journal of Educational Psychology, Vol. 81, (1989) 240-246 7. Mayer, R. E. & Moreno, R.: A split-attention effect in multimedia learning: Evidence for dual processing systems in working memory. Journal of Educational Psychology, Vol.90, (1998) 312-320 8. Mayer, R. E. & Gallini, J. K.: When is an illustration worth ten thousand words? Journal of Educational Psychology, Vol. 82, (1990) 715-726 9. Mayer, R. E., Steinhoff, K., Bower, G. & Mars, R.: A generative theory of textbook design: Using annotated illustrations to foster meaningful learning of science text. Educational Technology Research and Development, Vol. 43, (1995) 31-43 10. Moreno, R. & Mayer, R. E.: Cognitive principles of multimedia learning: the role of modality and contiguity. Journal of Educational Psychology (in press) 11. Mousavi, S. Y., Low, R., & Sweller, J.: Reducing cognitive load by mixing auditory and visual presentation modes. Journal of Educational Psychology, Vol. 87, (1995) 319-334 12. Paivio, A,: Mental representation: A dual coding approach. Oxford, England: Oxford University Press (1986) 13. Tarmizi, R. & Sweller, J.: Guidance during mathematical problem solving. Journal of Educational Psychology, Vol. 80, 424-436 (1988) 14. Ward, M. & Sweller, J.: Structuring effective worked out examples. Cognition and Instruction, Vol. 7, (1990) 1-39

Visualization of Spatial Neuroanatomical Data Cyrus Shahabi, Ali Esmail Dashti, Gully Burns, Shahram Ghandeharizadeh, Ning Jiang, and Larry W. Swanson Department of Computer Science & Department of Biological Sciences USC Brain Project & Integrated Media Systems Center University of Southern California, Los Angeles, California 90089-0781, U.S.A. {cshahabi,dashti,shahram,njiang}@cs.usc.edu {gully,lswanson}@mizar.usc.edu

1

Introduction

Research on the design, development, management, and usage of database systems has traditionally focused on business-like applications. However, concepts developed for such applications fail to support the diverse needs of scientiﬁc and biomedical applications, which requires the support of an extraordinarily large range of multimedia data formats. Moreover, the quality and progress of scientiﬁc endeavors depends in part on the ability of researchers to share and exchange large amount of visual data with one another eﬃciently [1]. In this paper, we describe our eﬀorts as part of the USC Brain Project, which is a collaboration between neuroscience and database researchers to realize a digital collaborative environment, in developing a number of visualization and database tools to help neuroscientists to share and visualize neuroscientiﬁc images. We report on the development of data visualization tools for spatial analysis of neuroanatomical data. Neuroanatomical data is analyzed by neuroscientists in order to understand the behavior of brain cells, where the brain is made up of a large number of individual cells (or neurons) and glial cells. The task of neuroscience is to explain how the brain organizes these units to control behavior and how, in turn, the environment inﬂuences the brain. To understand the brain and its behavior, it is necessary to appreciate how the nervous system is organized functionally and anatomically. Our focus here is on developing visualization tools to understand the anatomical organizations of brain cells. The remainder of this paper is organized as follows. In Sec. 2, we describe the functionality required by the target application domain in detail. Sec. 3 provides descriptions of the tools developed to support the functionality of the application domain and discuss its challenges. In Sec. 4, we show how a combination of the tools can be used to deﬁne a standard template for sharing neuroscience information among scientists. Here, we speciﬁcally focus on consolidating the relevant contents of digital journal publications with neuroanatomical data. Finally, Sec. 5 concludes this paper by describing our future work. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 801–809, 1999. c Springer-Verlag Berlin Heidelberg 1999

802

2

Cyrus Shahabi et al.

Neuroanatomical Application

We interpret the spatial structure of neuroanatomical data visually. Analysis of patterns of cellular properties in brain tissue with quantitative statistical methods can be used to make objective interpretations. The use of these analyses are limited due to the complexity of the brain and the inherent diﬃculties of obtaining quantitative neuroanatomical data, so that data visualization is unlikely to be superseded by quantitative statistical analyses among the majority of neuroanatomists. Moreover, a visual representation can be regarded as a standard requirement for all neuroanatomical data, and, below, we show how to represent quantitative statistical data in conjunction with neuroanatomical images.

R

LGd

VISp

SC

Fig. 1. The four stages of neuroanatomical data interpretation

The neuroanatomical data that will be considered in this paper are Phaseolus vulgaris Leuco agglutinin (PHAL) immunohistochemical tract-tracing data. These are by no means representative of all diﬀerent types of neuroanatomical data, but represent a starting point that we will use to eventually generalize from. The interpretation of the PHAL tract-tracing data has four stages, where each stage consists of data in a diﬀerent physical or computational form, see Figure 1. The ﬁrst stage involves histological slides. These are 30µm thick slices of rat brain tissue mounted on glass slides that can be examined by the use of a light microscope. This data contains a huge amount of information and subsequent stages progressively simplify the data. In the next stage, the data is transfered to Swanson’s rat atlas plates by drawing the individual PHAL-stained ﬁbers [2], where the brain atlas consists of drawings of cell-group and ﬁber tract boundaries from celloidin-embedded Nissl sections. Since very little information about the function, anatomy, and pathology of speciﬁc parts of the human brain is available, links to similar issues in animal research become useful. Therefore, the Swanson atlas was prepared from a rat brain sectioned in the coronal (frontal or traverse) plane. From 556 serial sections, 73 levels were chosen and illustrated as representative of the entire rat brain. The process of superimposing data on the atlas requires a high level of expertise and patience for several reasons. The orientation of the plane of section of the tissue does not correspond exactly to that of the atlas. The cutting and ﬁxing procedures cause unpredictable nonlinear distortions of the tissue. The experimenter is forced to perform some degree of subjective interpretation when

Visualization of Spatial Neuroanatomical Data

803

performing this task by drawing the data on each atlas plate. If suﬃcient care is taken with this procedure, the end product is a highly detailed and accurate representation of the labeling pattern in the histological slide, but the procedure is extremely time-consuming. The next stage of processing is building summaries of sets of connections in a two dimensional representation of the brain called a ﬂatmap. These diagrams have been designed to preserve as much of the topographical organization of the brain as possible in a simple two-dimensional representation. Thus, data involving several brain structures can be represented in a convenient two-dimensional ﬁgure combined with an implicit representation of the position of these structures. These ﬂatmaps are derived from Swanson’s atlas of the rat brain. The ﬁnal level of interpretation of PHAL tract-tracing data is the logical circuit diagram. These diagrams describe the organization of brain systems under study in terms of the connections between structures. They are summaries of large numbers of PHAL and other tract-tracing experiments, and typically, do not involve any explicit representation of the structure of the tissue itself. As conceptual tools, logical circuit diagrams are widely used throughout the entire scope of neuroanatomical research. They represent the end product of neuroanatomical tract-tracing research: a conceptual framework for the organization of neural systems. In order to extract the logical circuit diagram (i.e., ﬁnal level) from a series of two dimensional ﬂatmaps (i.e., third level), a neuroscientist is required to visualize the results of many experiments at diﬀerent levels of analysis. Therefore, a database of the information generated at all four stages and tools to visualize and manage the data is required. Two of the four stages in this process may be ameliorated through the use of tools described in this paper: a) the Neuroanatomical Registration Viewer (or NeuARt) is concerned with the stage involving expert drawings of histological slides, and b) the NeuroScholar knowledge-base management system is concerned with the ﬁnal stage involving high-level interpretations of the data. The quality of information at each stage can be improved tremendously if information from other stages is also accessible. For example, when looking at data from a speciﬁc area, it is easier to see which other data may be relevant to the questions under consideration. In contrast, users’ examining high-level interpretations may ﬁnd it extremely useful to zoom in to the ﬁne details that may either support or refute global schemes. These interactions between stages can be accomplished by interaction between these tools.

3

Neuroanatomical Data Visualization Tools

In this section, we start by describing NeuARt: Neuroanatomical Registration Viewer. NeuARt is an application designed to help neuroanatomists manage, store, query, browse, and share both Swanson’s atlas and other experimental data (as described above). We focus on an important component of NeuARt, Spatial Query Manager (SQM), which provides spatial correspondences between regions of the brain atlas and experimental data. Subsequently, we describe Spatial Index Manager (SIM). SIM is an application designed to help neuroanatomists build

804

Cyrus Shahabi et al.

the necessary index structures for supporting the spatial queries imposed by SQM. For each component, we describe its basic features, see Figure 2.

NeuARt

SIM

Viewer Server Data Server Database queries

Informix Universal DBMS + Database Schema

Populate the database with spatial indexes

Fig. 2. NeuARt and SIM system architecture

3.1

NeuARt: Neuroanatomical Registration Viewer

NeuARt is designed as a client-server architecture, where it consists of two main modules: a data viewer module and a data management module. We chose this modular design to simplify future modiﬁcations to the user interface and to simplify porting of the application to diﬀerent data management modules. The data viewer module resides at the client side and is responsible for the data visualization task. It contains a graphical user interface that is described below. The NeuArt data management module consists of a database management system (DBMS), a database schema, and a data server. The ﬁrst two components reside at the server side while the data server is on the client side. The data server manages the interface between the viewer and the DBMS and caches large images on the client side. In our prototype, we have used the Informix Universal Server v9.12 as the DBMS, because of its object relational capabilities. The data server is a Java application and it communicates with the database server through the Java API, a library of Java classes provided by Informix. It provides access to the database and methods for issuing spatial and SQL queries and retrieving results. From each client’s data server, Remote Method Invocation (RMI) is used to open connection to the database server. The data viewer module consists of a viewer server and eight user interface (UI) managers (where each manager is a pop-up window). The viewer server is a centralized server for all the interactions among the UI managers, and the interaction between the data viewer module and the data management module. The UI managers are: the display manager, SQM, query manager, results manager, active set manager, level manager, anatomical terms manager, and viewer manager. The viewer module provides neuroscientists with a friendly

Visualization of Spatial Neuroanatomical Data

805

user-interface, and it is based on a so-called “two pass-paradigm” [3]. The twopass paradigm works as follows. In the ﬁrst pass, the user identiﬁes a set of textual and spatial ﬁelds to query the database. The data management module returns partial information on a set of experiments that satisfy the query based on these spatial and textual ﬁelds. In the second pass, the user identiﬁes the set of experiments he/she is interested in for complete consideration. For each identiﬁed experiment, the data management module returns the complete set of experimental data (i.e., all textual data and images) to the viewer module. Hence, the images and other large multimedia data are only required during the second pass. The display manager is the focal point of user interactions within NeuARt. The display manager allows the user to: a) display and control the selected atlas and the image overlays, b) control the display of the other UI managers, and c) specify spatial queries, see Figure 2. The user may spatially query the atlas structures and/or query the combination of the atlas structures and the overlay data using the Display Manager. To specify spatial queries, the user may use SQM, which is designed in order to support spatial queries on both the atlas images and experimental data. SQM extends the NeuArt user interface permitting the user to: a) point at a structure and see the name and corresponding information about the structure (including the list of publications with experiments on that structure), and b) select an area (as a rectangle or a circle) and ﬁnd all the experiments that are contained in or contain or overlap with the selected area. SQM achieves its spatial query functionality by utilizing the Java 2D API on the user interface side and the Informix spatial datablade on the database sever side. In addition, it utilizes topological information generated by SIM for both atlas images and experimental data. The query manager is used to specify textual attributes, such as: experimental protocols, laboratories, authors, and other textual attributes, to query the database. After submitting a query, with the query manager and the spatial query tools of the display manager, the database returns results to the result manager (via the viewer server). The result manager formats the results into a list of experiments. While scrolling through these descriptions, the user may select and add experiments to the active list manager for the second pass of the two pass paradigm (see above). The active set manager enables the user to control the presentation of the data overlays on atlas levels. The level manager allows the user to traverse the atlas in the third dimension (z-axis). It allows for two types of traversal: atlas level traversal, and experiment-level traversal. In the former, for each button push, the Level Manager jumps one atlas level forward or backward. In the later, for each button push, the level Manager jumps to an atlas level that contains data layers forward or backward. 3.2

SIM: Spatial Index Manager

The topological structures of Swanson’s atlas and of experimental data should be stored in databases, with their spatial relationships explicitly expressed. The Swanson atlas consists of a set of seventy-three electronic drawings in Adobe Illustrator. The curves and lines of the drawings delineate brain structures, but

806

Cyrus Shahabi et al.

the topological structure of many of the constituent spline curves does not fully enclose their respective nuclei in a topologically consistent manner. Some regions lie in areas without complete boundaries and so the exact location of their borders remains unclear. To solve this problem, SIM developed in order to impose a mathematical topographical structure onto the atlas drawings, by using a combination of automation and expert user intervention with a topological mapping program. This process converts the atlas drawings into “intelligent templates” in which every point “knows” both the spatial extent and the name of the region that contains it. This “knowledge” is then inherited by any regional data registered against the atlas, and thus support Spatial Queries anchored by references to particular brain regions, spatial features, or 3D coordinates. The current version of SIM is implemented in Java language. Similar to NeuARt’s data server, it communicates to the Informix Universal Server via RMI (see Figure 2). It stores the identiﬁed topological structures in Informix spatial datablade format. Two major functions of SIM are: – Free Hand Drawing: This function allows users to identify objects by free hand drawing polygons around them, labeling them, and storing them into the database. Through this function we can impose topological structures on both the atlas and the experimental data. – Fill Function & Automatic Boundary Generation: This function can semiautomatically identify objects with closed structures and store them as polygons into the database system. This is achieved by ﬁlling a selected closed structure with a certain color and then automatically detecting the convex hull of the colored region. Another function of this module is to check whether a free hand drawn polygon is closed or not.

4

Standard for Data Consolidation

The emergence of neuroinformatics as a discipline has prompted the need for a standardization and coordination of neuroanatomical terminology and coordinate systems. These are cornerstones of eﬀective information sharing among scientists and applications. At present, brain atlases provide the main practical standardized global maps of neural tissue. Here, we brieﬂy describe how it is possible to employ the Swanson’s atlas through NeuArt as a means to consolidate neuroscience data, which includes neuroanatomical and neurochemical data, as well as journal publications. As a direct result of such an interconnection and consolidation, many neuroinformatics navigation scenarios will become feasible. For example, a neuroscientists can start data navigation from a repository of digital publications, select a paper, and then request to zoom into Swanson’s atlas to see the corresponding brain structures discussed in the experimental section of the paper. Alternatively, he/she might start from navigating Swanson’s atlas and then request to view all the publications available about a speciﬁc brain region. The link between domain-speciﬁc knowledge and spatially-distributed experimental data is generated through the use of a common set of named objects (i.e., the names of brain regions and ﬁber pathways from Swanson’s atlas). All

Visualization of Spatial Neuroanatomical Data

807

knowledge stored in NeuroScholar is translated explicitly to this nomenclature and can be represented in the context of the atlas. SIM provides a topological structure for each named area in terms of its spatial properties, thus providing a mechanism for translating spatially-distributed drawings into the atlas scheme. 4.1

NeuroScholar: A Knowledge Base System

Interpretations of Neuroanatomical data are typically represented in the published literature. The task of constructing a globally consistent account of the neural connections of the system is made extremely diﬃcult for many reasons: the literature is huge; much of the data is incomplete, error-prone and largely qualitative; and ﬁnally, neuroanatomical nomenclature is extremely disparate. We challenge this task with a knowledge-base management system called NeuroScholar. Recently, several large collections of connection data have been con-

a) A schematic view of the structures b) Spatial indexing of neuroanatomical data

Fig. 3. Visualizing neuroanatomical data in NeuroScholar

structed into databases, so that the network of inter-area connections can be analyzed with mathematical methods [4,5]. These studies are concerned with systems of between thirty and one hundred brain structures and may be considered to be an overview of the literature from the collator’s viewpoint. With the exception of work in the rat [5], the original descriptions of the connection data are not represented in the collation, so that users must reread the cited publications in order to verify the interpretations made by the collator. In all cases, the published descriptions of connections were intuitively translated into a single global parcellation scheme that had been adopted by the collator. NeuroScholar is more powerful than these previous databases of neuroanatomical connection information [5] in two ways. First, it uses an object-oriented data model to represent the conceptual framework of neuroanatomical experimentation in detail. Rather than representing a neural connection as a high-level point-to-point description, we incorporate the physical parameters of neuronal populations into our description, this is illustrated in Figure 3(a). This approach allows us to model neurobiological concepts realistically. Second, the system can diﬀerentiate between diﬀerent types of knowledge (i.e., data that has been organized in a

808

Cyrus Shahabi et al.

coherent framework and represented in the context of similar or conﬂicting data), and represents subjective interpretations of authors in the database structure. This domain-based knowledge are textual descriptions of spatial phenomena. The power of this software can be augmented by embedding it into the NeuARt, see Figure 3(b). This ﬁgure shows the location of the injection site in a tract-tracing experiment [6]. On closer examination, it would be possible to place the injection site in the position shown on the right hand ﬁgure. This polygon lies mostly in a region of the brain called the “Zona Incerta” rather than the authors’ account, which places it in the “lateral hypothalamic area”. Such a discrepancy would make the correct interpretation of this data impossible without the use of spatial indexing. It is immediately apparent from these ﬁgures that the structure of the rat brain is extremely complex, and neuroanatomists would beneﬁt tremendously from having access to domain-based information while viewing drawings in NeuARt. Thus, within NeuARt, a user may send queries directly to the NeuroScholar to query a speciﬁed area’s inputs, or outputs or any other aspect of published information concerning that structure, this may include descriptions of an area’s physiological properties, or its high-level function (i.e., ’spatial navigation system’) as reported in the literature.

5

Conclusion

We have described a neuroanatomical visualization tool to navigate through brain structures while monitoring related data generated as a result of experiments or from published literature. This tool consists of many components including a platform-independent graphical user interface, an object-relational database system, a knowledge base system to reason about published literature, and a number of spatial components to reason about the topological structures of the brain and its relevant data. Currently, we are working on diﬀerent techniques to represent, query, and manage three-dimensional structures of the brain (i.e., brain volumes) through many levels of 2-dimensional structures.

References 1. Dashti, A.E., Ghandeharizadeh, S., Stone, J., Swanson, L.W., Thompson, R.H.: Database Challenges and Solutions in Neuroscientific Applications. NeuroImage Journal (1997) 801 2. Swanson, L.W.: Brain Maps: Structure of the Rat Brain. 2nd edn. Elsvier Science Publishers B. V., Amsterdam (1998) 802 3. Shahabi, C., Dashti, A.E., Ghandeharizadeh, S.: Profile Aware Retrieval Optimizer for Continuous Media. Proceedings of the World Automation Congress (1998) 805 4. Young, M.P.,Scannell, J.W., Burns G.A., Blakemore C.: Analysis of Connectivity: Neural Systems in the Cerebral Cortex. Reviews in the Neurosciences, Vol. 5, No. 3 (1994) 227-250 807

Visualization of Spatial Neuroanatomical Data

809

5. Burns, G.: Neural Connectivity of the Rat: Theory, Methods and Applications. Physiology Department, Oxford University (1997) 807 6. Allen, G.V., Cechetto, D.F.: Functional and Anatomical Organization of Cardiovascular Pressor and Depressor Sites in the Lateral Hypothalamic Area. Journal of Comparative Neurology, Vol. 330, No. 30. (1993) 421-438 808

Visualization of the Cortical Potential Field by Medical Imaging Data Fusion Marie C. Erie1 , C. Henry Chu1 , and Robert D. Sidman2 1

Center for Advanced Computer Studies The University of Southwestern Louisiana, Lafayette, LA 70504, U.S.A. 2 Department of Mathematics The University of Southwestern Louisiana, Lafayette, LA 70504, U.S.A.

Abstract. We describe the visualization of the potential ﬁeld on the scalp and on the cortical surface. The surfaces are derived from magnetic resonance imaging data and the potential ﬁelds are reconstructed from electroencephalography data. The visualization tool is validated with clinical and cognitive application studies.

1

Introduction and Problem Background

Visualization tools provide insight for users to deal with the abundance of data available in our information age. An important application of visualization is in medical imaging, where many modalities have been developed for diﬀerent organs and applications. Integration of diﬀerent imaging modalities for the diagnostics of the human brain, for example, has the potential to improve neuroscientiﬁc tasks such as noninvasive localization of epileptic spikes and seizures. Among the many modalities available, electroencephalography (EEG) has the advantages of low cost, wide availability, and millisecond-time resolution. The disadvantage of EEG is its limited spatial resolution due to the limited number of sampling sites and to the smearing and attenuation of the voltage by the skull and other medium surrounding the sources of the EEG. Integrating EEG data with the structural and anatomical information provided by magnetic resonance imaging (MRI) data oﬀers the promise of source localization in clinically useful cases, such as the identiﬁcation of critical brain tissue for resection in medically intractible epilepsy. To address this need, we explore the use of visualization tools based on the reconstruction of potential ﬁeld on a cortical surface derived from MRI images.

2

Visualization Methods

The Cortical Imaging Technique (CIT) [1] is among a number of algorithms for reconstructing the potential ﬁeld at or below the cortical surface that have been developed recently to improve the spatial resolution of EEG imaging. The CIT models the head as a hemisphere, and reconstructs the potential ﬁeld inside the hemisphere based on scalp-recorded voltages as boundary conditions. In CIT, Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL’99, LNCS 1614, pp. 809–816, 1999. c Springer-Verlag Berlin Heidelberg 1999

810

Marie C. Erie et al.

the reconstruction problem is framed as an inward harmonic continuation problem. We ﬁrst construct a hemispherical layer of N weighted, radially oriented unit current dipoles, D1 , · · · DN , such that the potential function of the theoretical layer takes on the same surface values, v1 , · · · , vM , at the M surface sites, A1 , · · · , AM . Weighting numbers are calculated w1 , . . . , wN , to satisfy the M equations, N i=1 wi V (Di , Aj ) = vj , for j = 1, · · · , M , as follows. The quantities V (Di , Aj ) are the potentials generated by the unit dipole Di at surface site Aj ; and vj is the measured referential voltage at the jth scalp recording site. In practice, M typically has values such as 16, 32, or 64; the number of source dipoles is usually set to 160 or 280 or higher, depending on the conﬁguration of the dipoles. Since M < N , this system has an inﬁnite number of solutions. Nevertheless, it is possible to ﬁnd wˆi , the unique solution of minimum Euclidean norm via a singular value decomposition of the matrix associated with the system of equations. Once the weights of the unit current dipoles are determined, one can “image” N the cortical potential at any radius using the forward computation vˆl = i=1 wˆi V (Di , Cl ) , for l = 1, · · · , L, where the quantities V (Di , Cl ) are the potentials generated by the unit dipole Di at the imaged site Cl . The potential ﬁeld, such as one recovered by the CIT, is typically displayed as isocontours of interpolated voltages plotted inside a unit circle. The user is assumed to be viewing the cortex, modeled as a hemisphere, from the top, with the left and right ears along the horizontal axis. Three-dimensional graphics methods allow one to interactively view the hemisphere as a 3D object with its surface color mapped to the voltages. A more intuitive presentation of the potential ﬁeld is to render the potential ﬁeld on the cortical or the scalp surface. The cortical surface and the scalp surface have to be extracted from a diﬀerent modality, such as MRI data. Volume data, such as MRI data, are 3D entities that contain a set of samples, each of which represents a value of some property of the data at a 3D location. Volume data are obtained by sampling, simulation, or modeling techniques [2]. There are two classes of techniques for visualizing volume data: volume rendering and surface rendering. Volume rendering techniques [3] map the data directly into an image without the intermediate step of surface ﬁtting. Images are formed by sampling rays projected through the volume data. Hence, both the interior and the surface of each object in the data are considered. The ﬁrst step of surface rendering techniques is typically the generation of isosurfaces, which are taken to be representations of volume objects. The surfaces are then rendered to form images. Volume rendering techniques are better at preserving the information in the volume data than surface rendering techniques, at the cost of increased algorithm complexity. Surface rendering techniques are preferred when the application requires fast rendering, or when only the exterior of an object is to be visualized. Our application requires that rendering be suﬃciently fast to facilitate animation of the sequence of cortical potentials derived from time-series EEG data. These potentials are to be rendered on a cortical surface, hence our tool is based on surface rendering.

Visualization of the Cortical Potential Field

811

Volume data of the brain structure, segmented from MRI data, are used for obtaining a mesh surface via the marching cubes algorithm [4], which is the most widely used isosurface rendering algorithm. The user speciﬁes an intensity value as the threshold to obtain the surface that needs to be visualized. A cube marches through the volume data, deciding at each location whether a surface patch should be placed inside the cube. At each location, there are eight values at each of the vertices. If the vertex value is not less than the threshold, the vertex is assigned a value of one; otherwise a value of zero is assigned. This operation determines the topology of the surface. The locations of intersections of the surface and each edge are determined. Subsequently, the gradient of the original data is computed and used for shading the object. Separate cortical and scalp surfaces were rendered by color mapping the respective potential ﬁelds computed by the CIT. This was implemented by extending the Visualization Toolkit C++ class library [2]. Speciﬁcally, a new marching cubes class was deﬁned with methods which compute potential at each mesh vertex. The CIT takes into account the attenuation of the cortical potential by the highly resistive skull layer in deriving the cortical potential from scalp recorded data. Potential values are computed on a hemispherical surface, using a set of dipole sources located on another hemispherical surface interior to the ﬁrst hemisphere, and thus closer to the true source of scalp recorded data. Potentials are to be “projected” onto the the non-ideal surface of the cortex, or scalp, as the case may be. This is accomplished by associating each computed potential with the corresponding vertex’s scalar attribute. The visualization pipeline can be executed for several time steps to present the dynamics of spike events in the EEG time-series recordings. Since the EEG data and the MRI data are acquired separately in their own coordinate systems, we have to ﬁrst align them. Best ﬁtting spheres are ﬁt to the MRI-derived scalp and to the electrode locations. The center of the scalp-ﬁtted sphere is used as the center of the head model. The coordinate axes of the MRI data and those of the electrodes are then aligned to form a uniﬁed coordinate system for the forward computation.

3

Results

A set of 16-channel EEG data sampled at 100 Hz and digitized to 12 bits of resolution was used to reconstruct the potential ﬁeld. The source of the data was from a 45 year-old male whose MRI showed a right anterior temporal lobe lesion and who suﬀered from complex partial seizures. The original CIT analysis was for the purpose of noninvasive localization of epileptic foci. A MRI data set of 53 slices was acquired with intra-slice spatial resolution of 1 mm and inter-slice resolution of 4 mm. The best ﬁtting sphere to the MRI-derived scalp is shown in Figure 1. In Figure 2, we show two conﬁgurations of the hemispherical dipole distributions. On the left panel, there are 160 dipoles in the conﬁguration that were traditionally used in most CIT-related publications. On the right panel, the 280

812

Marie C. Erie et al.

dipoles are evenly distributed on the hemisphere. These ﬁgures also indicate the dipole size and direction. The color as well as the length indicates the relative dipole weights, with “inward” red dipoles having the most negative weights and “outward” blue dipoles having the most positive weights. The equally distributed source dipole conﬁguration was used to reconstruct the potential map. In Figure 3, we show the time evolution over four consecutive time points, in increments of 10 milliseconds. The color map represents a voltage range of -97.88 microvolts to 199.95 microvolts. Positive voltages are shown in shades of blue, and negative voltages are shown in shades of red. In Figure 4, the reconstructed potential maps on the scalp and on the cortical surface are shown. The voltage range on the cortical surface is from -63.34 to 199.95 microvolts, while that on the scalp is from -20.08 to 84.29 microvolts. We conducted a second study using visual evoked potential data. In this study, a subject’s response to a visual pattern ﬂash stimulation. The visual stimulus is a wedge pattern ﬂashed in one of the four quadrants of a screen. Figure 5 shows all four wedge stimuli which were oriented radially from the center of the screen. Individually, these provided stimuli to the subject’s upper right (UR), upper left (UL), lower right (LR), and lower left (LL) visual ﬁelds. Reconstructed visual evoked potential at the cortical potential shows the brain’s response to speciﬁc stimulus patterns. We can validate our visualization tool to a certain extent based on what is known about the visual system pathway. There are three peaks of voltage values, with alternating polarities, after the stimulus. In Figure 6, we show the response to an upper left stimulus at the ﬁrst two peaks. In Figure 7, we show the response at the third (positive) peak to the four stimuli.

4

Discussion

The visualization tool developed facilitates comparisons of such parameters in CIT analysis as the number and conﬁguration of source dipoles. Using the visualization tool, we found that for the 16-channel data we used, the 280 dipoles were visually equivalent to the higher resolution of 1011 dipoles. From Figure 2, the undersampling of source dipole in the classical 160-dipole conﬁguration compared to the equal distribution conﬁguration is vividly illustrated. The dipole weights are also displayed in Figure 2. Although the dipole layer is a construct mainly for enhancing the potential voltage map, visualizations of these distribution, especially the time evolution of them, may oﬀer support for estimating the general location of foci. The time points chosen for Figure 3 are near the peak of an epileptiform discharge. The high level of activity in the right temporal region can be noted. Figure 4 shows the CIT-reconstructed potential on the MRI-derived scalp and cortex using the 280-dipole source distribution. Although we do see the activity in the right anterior temporal lobe region of the scalp, we see a smaller focus of this activity on the cortex.

Visualization of the Cortical Potential Field

813

Figures 6 and 7 show the responses as elucidated by the reconstructed potential on the scalp surface correspond to the expected responses to the visual stimuli. To brieﬂy summarize, we developed a visualization tool that combines the temporal resolution of EEG data with the spatial resolution of MRI data. The tool was validated using applications from clinical and cognitive applications.

Acknowledgments This work was supported in part by a Louisiana Board of Regents Graduate Fellowship to M.C.E. and by the U.S. Department of Energy under grant no. DE-FG02–97ER1220. The authors thank Marty Ford, Mark Pfeiger, and Steve Sands, all of NeuroScan Labs, for their contribution of the VER data and technical communications. They further thank Todd Preuss of the USL New Iberia Research Center for his helpful comments.

References 1. R. D. Sidman, “A method for simulating intracerebral potential ﬁelds: The cortical imaging technique,” Journal of Clinical Neurophysiology, vol. 8, no. 4, pp. 432–441, 1991. 809 2. W. Shroeder, K. Martin, and B. Lorensen, The Visualization Toolkit, Prentice Hall, Englewood Cliﬀs, N.J., 1996. 810, 811 3. A. Kaufman, D. Cohen, and R. Yagel, “Volume graphics,” IEEE Computer, vol. 26, no. 7, pp. 51–64, 1993. 810 4. W. E. Lorensen and H. E. Cline, “Marching cubes: A high resolution 3D surface construction algorithm,” in Computer Graphics, vol. 21, no. 4, pp. 163–170, 1987. 811

Figure 1. The best ﬁtting sphere to the MRI-derived scalp.

814

Marie C. Erie et al.

Figure 2. The “Classical” (left) and equally distributed (right) source dipole conﬁgurations used in CIT.

Figure 3. Time evolution of potential voltages on an MRI-derived cortical surface. The four time steps are shown clockwise from the top left corner.

Visualization of the Cortical Potential Field

Figure 4. The CIT-reconstructed potential on the MRI-derived scalp (left) and cortex (right) using the 280-dipole source distribution.

Figure 5. The visual stimuli. In a visual evoked response study, one of the four quadrants is ﬂashed and the subject’s response voltages on the scalp level are recorded.

815

816

Marie C. Erie et al.

Figure 6. Response to an upper left stimulus at 104 ms (left) and at 166 ms (right) post-stimulus.

Figure 7. Responses at 212 ms post-stimulus. The stimulus was the upper left, the upper right, the lower right, and the lower left (clockwise from top left) quadrant of the screen.

Applying Visualization Research Towards Design Paul Janecek Laboratoire de Recherche en Informatique, Bâtiment 490 Université de Paris-Sud 91405 Orsay Cedex, France [email protected] Abstract. The range of information visualization research efforts and taxonomies present a confusing array of techniques and dimensions for analysis. In this paper, we build upon an existing model to create a framework for comparing and analyzing previous research. This analysis has several benefits: first, it suggests refinements to the operator model; second, it shows where previous research can be applied to the design of new visualization systems and toolkits; and third, it allows us to compare different taxonomies to find where they overlap or lack dimensionality.

1. Introduction Over the past decade a bewildering number of information visualization techniques have been developed and applied across a wide range of domains. Researchers have also developed a wide range of visualization taxonomies [2,3,5,6,11,12,14,20,21,22] and classifications of the design space [4,8,9,13,17,18]to map out the similarities and differences between these techniques. The purpose of these research efforts is to help designers understand the range of alternative implementations, and their strengths and weaknesses for a particular task. They can also aid researchers in determining the fundamental differences between techniques so they can evaluate their effectiveness and suggest future designs. However, there are at least two problems that face a designer in applying these research efforts. First, determining how to place this research into a design context. Second, understanding how these numerous research efforts are related, overlap, or conflict. This paper uses a recently developed model of the visualization process to answer these two questions. This paper is organized as follows. Section 2 presents the operator state model, discusses its usefulness in the design of visualization systems, and describes several refinements of the model to aid in a higher-level analysis of tasks and interaction. Section 3 uses the operator model as a framework to place previous research into the design context, and explains the potential benefits of this analysis. In Section 4 we discuss our conclusions.

2. The Operator State Model Chi and Riedl [5] recently suggested an operator state model for analyzing information visualization systems. The elements in this model are explained in Fig. 1. Dionysius P. Huijsmans, Arnold W.M. Smeulders (Eds.): VISUAL'99, LNCS 1614, pp. 817-823, 1999.  Springer-Verlag Berlin Heidelberg 1999

818

Paul Janecek

Data Stage

DSO

DTO AA Stage AASO VTO VA Stage VASO

VMTO View Stage VSO

Data Stage: the data in its raw form (e.g., database, document collection) Data Stage Operator(DSO): operators that leave data in the same form (e.g., filters, sort algorithms) Data Transform Operator(DTO): operators that transform data into another form (e.g., mapping into a data structure) Analytical Abstraction (AA) Stage: data in a form that can be analyzed and processed (e.g., data-structure in application ) Analytical Abstraction Stage Operator(AASO): operators that process the data within this stage (e.g., dimension reduction, aggregation ) Visualization Transform Operator(VTO): operators that transform data into a graphical model (e.g., mapping data values to coordinate sets) Visualization Abstraction (VA) Stage: the graphical model (e.g., a scene graph) Visualization Abstraction Stage Operator(VASO): operators that process the graphical model (e.g., layout algorithms, mapping objects to graphical attributes) Visualization Mapping Transform Operator(VMTO): operators that transform a graphical model into a view (e.g., lighting model, camera focal attributes, rendering) View Stage: the rendered image used in the interface View Stage Operator(VSO): operators that manipulate the view within this stage (e.g., translation, rotation of image)

Fig. 1. The Operator State Model [5]. Nodes are data states, and edges are transform operators. The author modified the VASO and VMTO operators as described in the text

The model is a network of data states (nodes) and transformation operators (edges) that explicitly model the flow of information from a data source to a view, similar in many ways to a traditional visualization pipeline. Chi and Riedl discuss a number of ways in which this model is powerful for designers. For example, the model explicitly shows the role of operators on data values and their related view(s), making apparent the semantics of different operations. The model also helps a designer understand the breadth of applicability of operators within and between domains, and to explore different implementation choices. One important advantage of a network model is that it is possible to have multiple paths through the states and operators, which could represent multiple views of the same data set. In the original model, all graphical mappings occurred in the VMTO. We slightly modified the model to clearly distinguish between transforms that affect the graphical model (VASO) and those that render a view (VMTO). This supports a finer-grained analysis of the differences between visualization techniques. For example, Fig. 2 shows three hypothetical visualizations of a collection of web pages. In this model, the first difference in the visualization process can be clearly traced to the mapping between data and graphical models (VTO). The Cone Tree [16] and Tree-Map [10], which are hierarchical visualization techniques, would use a breadth-first traversal of the data network to create a graphical tree model. SeeNet [1], on the other hand, transforms the data network of pages to a network of graphical objects. The second difference is in their layout (VASO). The Cone Tree constructs a 3D model of the tree, the Tree-Map uses a space-filling layout, and SeeNet positions the nodes according to their associated geographical locations. The final views are

Applying Visualization Research Towards Design

819

then rendered from these graphical models by the VMTO. This simple example clearly shows some of the similarities and differences of these three visualizations. Data S

AA DTO

VA VTO

VASO

View VMTO Cone Tree

Collection of Web Pages

Network

BreadthFirst Traversal

Tree

3D Layout

Render

Tree Map Layout

Tree Map SeeNet

Geographic Network Layout

Fig. 2. An operator model of three views: Cone Tree, TreeMap, and SeeNet

There are two main weaknesses in this model. The first is the lack of a higher level framework for analyzing tasks. The rest of this section presents an extension of the operator model to support this type of analysis. The second weakness is the model’s lack of detail within states and across operator types. In Section 3 we discuss how previous research can be used to refine the model. Chi and Riedl [5] described several properties of operators that can be used in analyzing its semantics (i.e., functional/operational, view/value, and breadth). They also discussed the relationship between the position of an operator in the model to its effect on the view or the value. Information flows from the data stage to the view stage during the creation of a visualization, but interaction with the model is in the opposite direction. We suggest that the operator model can be used as a framework for analysis of tasks and higherlevel interaction by explicitly mapping these higher-level operations into the visualization system to their associated operators and data objects. We refer to this relationship to position as the depth of an operator or data object, and define it as its distance from the view into the model. For example, Fig. 3 shows how depth can be related to different semantics of a delete operation. Data Increasing Depth VA View

DSO: Delete data object in database AA

AASO: Delete data object in AA VASO: Delete graphical object in VA VSO: Delete portion of image in view

Fig. 3. Depth of interaction with operators

This is a slight extension to the model that can help a designer to explore the mapping of a task to different stages, and how this changes the underlying semantics. This also allows us to map the task and interaction classifications of previous research efforts onto the operator model. In the next section, we use this framework to suggest how previous research can be applied to the design.

820

Paul Janecek

3. A Taxonomy of Information Visualization Research As mentioned earlier, a weakness of the operator model is its lack of detail within states and across operator types. For example, [5] used the operator model to analyze the semantics of a number of visualization techniques. However, no comparison was made across different visualization techniques to explore similarities, the reusability of operators, or to develop a taxonomy of operators. Additionally, although the operator model is derived from [3], it lacks the detailed analysis of the data and representation states that model supported. In this section, we place previous taxonomies of information visualization techniques into the context of the operator model. This analysis has several benefits: first, it suggests refinements to the states and operators of the model; second, it suggests where previous research can be applied in the design of visualization systems; and third, it allows comparison of different research efforts to find where they overlap or lack dimensionality. As an example, we will place the taxonomies of Shneiderman [20] and Card & Mackinlay [3] into the context of the operator model, and demonstrate how this could be useful for a designer. The taxonomies use different dimensions to characterize visualizations: the first [20] uses the dimensions of data type and task; the second [3] uses data type, visual vocabulary, and interaction. Although the first dimensions are similar, [20] suggests a high-level grouping of data sets (1D, 2D, 3D, multi-dimensional, tree, network, and temporal) and [3] suggests a low-level analysis by the dimensionality of a data value (nominal, ordinal, quantitative, spatial, geographic, network). These classifications can be used separately to group and compare the Data and AA stages across visualizations. The second dimension of [3], visual vocabulary, is composed of marks (such as Points and Lines), their retinal properties (such as Color and Size), and their position in space and time. These groupings could be used to analyze the operators that create and manipulate the graphical model in the VA stage, as well as the rendered image in the View stage. This low-level detail also supports an analysis of the cognitive “fit” between data and and their representation [22, 4, 13, 17]. Data Shneiderman [20] Card [3]

DSO

DTO

AA AASO VTO

F F,S

D,E

F,H,R F,M,S

D M

VA VASO VMTO View VSO O,R,Z

O,Z P,Z

Fig. 4. Tasks and Interaction mapped onto the Operator Framework. Legend: Details on Demand, Extract, Filter, History, Multidimensional Scaling, Pan, Relate, Sort, Zoom

Both taxonomies discuss interaction, but again in different terms. [20] describes a set of tasks (overview, zoom, filter, details on demand, relate, history, and extract). we can map these general tasks into the model to explore different interaction semantics, as shown in Fig. 4. For example, a VSO zoom suggests a magnification of the view, a VASO zoom suggests a change in the graphical model (such as the animation associated with selecting a node in a Cone Tree[16]). An AASO “zoom” might add information from the data model, and a DSO “zoom” could open the original data source. The analysis of [3] characterizes interaction as a mapping from a

Applying Visualization Research Towards Design

821

view or widget to an operator and its associated data variable. They discuss navigation operations, such as pan and zoom, and three types of data functions: filter, sort, and multidimensional scaling (MDS). Interactions with the data functions are eventually mapped to changes in graphical objects (VASO), as shown in Fig. 4. The taxonomy of [20] is high-level, and does not support the detailed analysis of a visualization that [3] does. However, its dimensions are general enough to easily group similar visualizations, and lead to interesting explorations of the design space as demonstrated with the set of tasks. This example shows how placing previous research into the context of the operator model can offer insights into operator semantics and alternative designs. The rest of this section expands on this analysis to include other taxonomies and research from the area of automated presentation techniques. Fig. 5 presents three research areas that have been placed into the operator framework: visualization taxonomies, automated presentation techniques, and distortion taxonomies. The rows for Task and Interaction at the bottom of the table are dimensions that should be mapped separately into the framework as in Fig. 4.

Noik [14]

Distortion

Leung [12]

Goldstein [8]

Roth [18]

Golovchinsky [9]

Casner [4]

Roth [17]

Mackinlay [13]

Chi [5]

Auto. Presentation Systems

Card [3]

Tweedie [21]

Keim [11]

Zhang [22]

Bruley [2]

Shneiderman [20]

Taxonomies

Data DSO DTO AA AASO VTO VA VASO VMTO View VSO Task Interaction

Fig. 5. A Taxonomy of Information Visualization Research. The rows are the states or transforms in the operator model (see Fig. 1), and the columns are previous research (by first author). Darkened squares indicate that the research characterizes the state or operator of the model. (For example, Shneiderman, the first column, characterizes visualizations by the dimensions of data type and task)

The first group in Fig. 5 presents a number of visualization taxonomies in order of increasing dimensional coverage. [20] and [3], were discussed previously. The table highlights areas that have received no research (such as lighting and rendering), and little focus, such as the VTO. [11] discusses a range of methods for reducing the dimensionality of data sets for visualization, and [22] discusses the cognitive "fit" between data and their graphical representations.

822

Paul Janecek

The second group in Fig. 5 is Automated Presentations Systems. These are ordered chronologically. The general goal of these systems is to automatically design an optimal representation based on features of the given data set. To accomplish this, these systems must formally characterize the data space, the representation space, a mapping between the two, and a metric for evaluating resulting designs. These indepth analyses of the design space are important resources for designers. The third group in Fig. 5 is taxonomies of Distortion techniques. [12] characterizes distortions by their view-based magnification functions. [14] creates a taxonomy of both view and data-based techniques. The operator model is particularly effective at clarifying the underlying differences between different techniques that have similar results, such as graphical fisheye distortions [19] and data-based fisheye distortions [7].

4. Conclusion In this paper, we refined the operator state model [5] of visualization systems to support higher-level analyses of interaction. We then placed a number of previous taxonomies of information visualization into the context of the model. The states and operators of the framework suggest where research can be applied to design, and also allow us to compare the coverage of different research efforts. As an example, we discussed how the dimensions of two taxonomies, [20] and [3], can be mapped into the framework, and the specific insights into design that this analysis offers. Future improvements to this taxonomy should begin with an analysis of the range of possibilities in each dimension to develop a clearer distinction both within and between different operators and states.

References 1. Becker, R.A., Eick, S. G., and Wilks, A.R. Visualizing Network Data. IEEE Transactions on Visualization and Computer Graphics, pp. 16-28, March 1995. 2. Bruley, C., and Genoud, P. Contribution à une Taxonomie des Représentations Graphiques de l’Information. In Proc. IHM ’98, pp. 19-26, 1998. 3. Card, S.K., and Mackinlay, J.D. The Structure of the Information Visualization Design Space. In Proc. Information Visualization Symposium ‘97, pp. 92-99, 1997. 4. Casner, S.M. A Task-Analytic Approach to the Automated Design of Graphic Presentations. ACM Transactions on Graphics, pp. 111-151, April 1991. 5. Chi, E.H., and Riedl, J.T. An Operator Interaction Framework for Visualization Systems. In Proc. Information Visualization Symposium ‘98, pp. 1-8, 1998. 6. Chua, M.C., and Roth, S.F. On the Semantics of Interactive Visualizations. In Proc. IEEE Information Visualization ‘96, pp. 29-36, 1996. 7. Furnas, G.W. Generalized Fisheye Views. In Proc. CHI ’86, pp. 16-23, 1986. 8. Goldstein, J., Roth, S.F., Kolojejchick, J., et al A Framework for Knowledge-Based, Interactive Data Exploration. Journal of Visual Languages and Computing, pp. 339-363, December 1994. 9. Golovchinsky, G., Kamps, T., and Reichenberger, K. Subverting Structure: Data-driven Diagram Generation. In Proc. IEEE Visualization ‘95, pp. 217-223, 1995. 10.Johnson, B., and Shneiderman, B. Tree-Maps: A Space-Filling Approach to the Visualization of Hierarchical Information Structures. In Proc. IEEE Visualization ‘91, pp. 284-291, 1991.

Applying Visualization Research Towards Design

823

11.Keim, D.A. Visual Techniques for Exploring Databases. In Invited Tutorial, Int. Conf. On Knowledge Discovery in Databases, KDD ’97, Newport Beach, 1997. 12.Leung, Y.K., and Apperley, M. D. A Review and Taxonomy of Distortion-Oriented Presentation Techniques. ACM Transactions on Computer-Human Interaction, pp. 126-160, June 1994. 13.Mackinlay, J.D. Automating the Design of Graphical Presentations of Relational Information. ACM Transactions on Graphics, pp. 110-141, April 1986. 14.Noik, E.G. A Space of Presentation Emphasis Techniques for Visualizing Graphs. In Proc. Graphics Interface ‘94, pp. 225-233, 1994. 15.Noik, E.G. Layout-independent Fisheye Views of Nested Graphs. In Proc. Visual Languages ’93, pp. 336-341, 1993. 16.Robertson, G.G., Mackinlay, J. D., and Card, S.K. Cone Trees: Animated 3D Visualizations of Hierarchical Information. In Proc. CHI ‘91, pp. 189-194, 1991. 17.Roth, S.F., and Mattis, J. Data Characterization for Intelligent Graphics Presentation. In Proc. CHI ‘90, pp. 193-200, 1990. 18.Roth, S.F., and Mattis, J. Automating the Presentation of Information. In Proc. IEEE Conf. On AI Application, pp. 90-97, 1991. 19.Sarkar, M., et al. Stretching the Rubber Sheet : A Metaphor for Viewing Large Layouts on Small Screens. In Proc. UIST ’93, pp. 81-91, 1993. 20.Shneiderman, B. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. In Proc. IEEE Symposium on Visual Languages ‘96, pp. 336-343, 1996. 21.Tweedie, L. Characterizing Interactive Externalizations. In Proc. CHI ’97, pp. 375-382, 1997. 22.Zhang, J. A representational analysis of relational information displays. Int. J. HumanComputer Studies, volume 45, pp. 59-74, 1996.

Author Index

Abbasi, S., 566 Aksoy, S., 341 Alferez, P., 435 Amghar, Y., 37 Androutsos, D., 76 Androutsos, P., 745 Ardizzone, E., 283 Aufure-Portier, M.-A., 325 Ayer, S., 451 Baldi, G., 171 Banﬁ, F., 625 Belongie, S., 509 Bhandarkar, S.M., 269 Bhonsle, S., 719 Biancardi, A., 703 Biemond, J., 229 Bignall, R.J., 475 Binefa, X., 237 Bolle, R.M., 15 Bonhomme, C., 325 Boujemaa, N., 115 Bouthemy, P., 221, 245, 261 Bres, S., 427 Buijs, J.M., 131 Bull, D.R., 333 Bunjamin, F., 187 Burns, G., 801 Canagarajah, C.N., 333 Carlbom, I., 689 Carson, C., 509 Chakrabarti, K., 68 Chan, D.Y.-M., 557 Chan, S.C.Y., 777 Chang, S.-K., 19 Chbeir, R., 37 Chen, F., 665, 681 Chetverikov, D., 459 Chi, Z., 673 Cho, J., 203 Choi, J.H., 657 Chu, C.H., 809 Ciano, J.N., 753 Ciocca, G., 107

Colombo, C., 171 Costagliola, G., 19 Dashti, A.E., 801 Deemter, K. van, 632 Del Bimbo, A., 171 Demetrescu, C., 785 Deng, D., 673 Di Sciascio, E., 123 Dimai, A., 525 Ding, X., 277, 443 Do, M., 451 Dubois, T., 261 Eberman, B., 195 Eck, J.W. van, 641 Egas, R., 533 Erie, M.C., 809 Fablet, R., 221 Ferro, A., 51 Fidler, B., 195 Finke, M., 761 Finocchi, I., 785 Fischer, S., 253 Flory, A., 37 Fraile, R., 697 Frederix, G., 769 Gagliardi, I., 358 Gallo, G., 51 Garcia, C., 245 Gelgon, M., 261 Gevers, T., 593 Ghandeharizadeh, S., 801 Giugno, R., 51 Goldbaum, M., 727 Gool, L. Van, 493 Gupta, A., 719 Hampapur, A., 15 Hancock, E.R., 711 Hanjalic, A., 229 Haralick, R.M., 341 Heijden, G. van der, 641 Helfman, J.I., 163

826

Author Index

Hellerstein, J.M., 509 Hemmje, M., 1 Hibino, S.L., 139 Hiroike, A., 155 Hoover, A., 727 Hu, C., 443 Huele, R., 753 Huet, B., 711 Huijsmans, D.P., 533 Hunter, E., 727 Iannizzotto, G., 609 Iannucci, R., 195 Iisaku, S.-i., 375 Iizuka, Y., 91 Ikonomakis, N., 99 Ingold, R., 625 Isobe, S., 91 Iwerks, G.S., 317 Jain, R., 719 Janecek, P., 817 Jean, Y., 689 Jeong, S.H., 657 Jia, L., 501 Jiang, N., 801 Joerg, C., 195 Jolion, J.-M., 427 Jungert, E., 19 Kammerer, P., 649 Kapoor, C., 665 Karmaker, G.C., 475 Katsumoto, M., 375 Khombhadia, A.A., 269 Kim, H., 391 King, I., 557 Kitchen, L., 501 Kong, W., 277, 443 Konstantinou, V., 211 Kontothanassis, L., 195 Koskela, M., 541 Kouznetsova, V., 727 Kovalcin, D.E., 195 Kropatsch, W., 649 Laaksonen, J., 541 Lagendijk, R.L., 229 Lakaemper, R., 617 Latecki, L.J., 617

Leau, E. de, 585 Leissler, M., 1 Leung, C.H.C., 399, 409 Lew, M.S., 131, 533 Lewis, P.H., 777 Li, Y., 307 Liao, M., 307 Liebsch, W., 187 Lim, J.-H., 367 Lindley, C.A., 83, 299 Liu, L., 601 Lodato, C., 283 Lopes, S., 283 Lu, H., 277, 307 Lu, H.B., 291 Ma, S., 277, 307, 349, 443, 735 Makai, B., 187 Malik, J., 509 Malki, J., 115 Maruyama, T., 91 Maxwell, B.A., 517 Maybank, S.J., 697 Mayer, R.E., 793 McKenzie, E., 43 Meddes, J., 43 Mehrotra, S., 68 Mingolla, G., 123 Moccia, V., 703 Mokhtarian, F., 566 Mongiello, M., 123 Moreno, P., 195 Moreno, R., 793 Mori, Y., 155 Mueller, H., 383, 549 Mueller, K., 187 Mueller, W., 549 Mukherjea, S., 203 Musha, Y., 155 Nastar, C., 115 Nes, N., 467 Neuhold, E.J., 1 Nikolov, S.G., 333 Ohm, J.-R., 187 Oja, E., 541 Ornellas, M.C. d’, 467 Ortega, M., 68 Palhang, M., 418

Author Index Pan, C., 349 Paquet, E., 179 Pauwels, E.J., 769 Pingali, G.S., 689 Plataniotis, K.N., 76, 99 Polder, G., 641 Porkaew, K., 68 Psarrou, A., 211 Radeva, P., 237 Rahman, S.M., 475 Rehatschek, H., 383 Reiter, M., 649 Rimac, I., 253 Rioux, M., 179 Ronfard, R., 245 Ruda, H.E., 745 S´ anchez, J.M., 237 Saberdest, B., 187 Sahni, S., 665, 681 Samet, H., 60, 317 Santini, S., 719, 727 Saraceno, C., 649 Schettini, R., 107 Schomaker, L., 585 Schouten, B.A.M., 483 Sclaroﬀ, S., 601 Sebe, N., 533 Shahabi, C., 801 Shiohara, H., 91 Sidman, R.D., 809 Smeulders, A.W.M., 147, 593 Soﬀer, A., 60 Sowmya, A., 418 Squire, D., 549 Srinivasan, U., 299 Stanchev, P.L., 29 Steinmetz, R., 253 Stiefelhagen, R., 761 Sugimoto, A., 155 Sutanto, D., 399

Swain, M.J., 195 Swanson, L.W., 801 Tam, A.M., 409 Thomas, M., 509 Tuytelaars, T., 493 Tziritas, G., 245 Van Thong, J.-M., 195 Veltkamp, R., 575 Vemuri, B.C., 665, 681 Venau, E., 245 Vendrig, J., 147 Venetsanopoulos, A.N., 76, 99, 745 Vercoustre, A.-M., 83 Vetterli, M., 451 Vita, L., 609 Vitri` a, J., 237 Vleugels, J., 575 Vuurpijl, L., 585 Waibel, A., 761 Wang, Y.-F., 435 Wang, Z., 673 Warke, Y.S., 269 Winter, A., 115 Worring, M., 147, 719, 727 Wu, J., 735 Xu, C., 735 Yang, H., 391 Yang, H.J., 657 Yang, J., 391, 761 Yang, J.D., 657 Yu, Y., 673 Zeeuw, P.M. de, 483 Zhang, Y.J., 291 Zier, D., 187 Zolda, E., 649 Zonta, B., 358 Zugaj, D., 245

827

Visual Information Systems

Coordination Languages and Models: Third International Conference, COORDINATION'99, Amsterdam, The Netherlands, April 26-28, 1999, Proceedings

Computational Science - ICCS 2002: International Conference, Amsterdam, The Netherlands, April 21-24, 2002. Proceedings

Advanced Information Systems Engineering: 6th International Conference, CAiSE '94, Utrecht, The Netherlands, June 6 - 10, 1994. Proceedings

Interoperating Geographic Information Systems: Second International Conference, INTEROP'99, Zurich, Switzerland, March 10-12, 1999 Proceedings

Information and Communication Technologies in Tourism 2009: Proceedings of the International Conference in Amsterdam, The Netherlands, 2009

Service-Oriented Computing - ICSOC 2005: Third International Conference, Amsterdam, The Netherlands, December 12-15, 2005, Proceedings

Advanced Information Systems Engineering: 12th International Conference, CAiSE 2000, Stockholm, Sweden, June 5-9, 2000 Proceedings

Advanced Information Systems Engineering: 7th International Conference, CAiSE '95, Jyväskylä, Finland, June 12 - 16, 1995. Proceedings

Advanced Information Systems Engineering: 13th International Conference, CAiSE 2001, Interlaken, Switzerland, June 4-8, 2001. Proceedings

Advanced Information Systems Engineering: 17th International Conference, CAiSE 2005, Porto, Portugal, June 13-17, 2005, Proceedings

Advanced Information Systems Engineering: 10th International Conference, CAiSE'98, Pisa, Italy, June 8-12, 1998, Proceedings

Advanced Information Systems Engineering: 19th International Conference, CAiSE 2007, Trondheim, Norway, June 11-15, 2007, Proceedings

Social Robotics: Third International Conference on Social Robotics, Icsr 2011, Amsterdam, the Netherlands, November 24-25, 2011. Proceedings

Advanced Information Systems Engineering: 18th International Conference, CAiSE 2006, Luxembourg, Luxembourg, June 5-9, 2006, Proceedings

Advanced Information Systems Engineering: 5th International Conference, CAiSE '93, Paris, France, June 8-11, 1993. Proceedings

Quantum Information V: Proceedings of the Fifth International conference

Intelligent Technologies for Interactive Entertainment: Third International Conference, INTETAIN 2009, Amsterdam, The Netherlands, June 22-24, 2009, Proceedings ... and Telecommunications Engineering)

Information and Communication Security: Second International Conference, ICICS'99 Sydney, Australia, November 9-11, 1999 Proceedings

Information Security and Cryptology - ICISC'99: Second International Conference Seoul, Korea, December 9-10, 1999 Proceedings

Coordination Models and Languages (LNCS 6116), 12th International Conference, COORDINATION 2010, Amsterdam, The Netherlands, June 7-9, 2010, Proceedings

Information and Communications Security: Third International Conference, ICICS 2001, Xian, China, November 13-16, 2001. Proceedings

Advances in Databases and Information Systems: Third East European Conference, ADBIS'99, Maribor, Slovenia, September 13-16, 1999, Proceedings

Advances in Visual Information Systems: 4th International Conference, VISUAL 2000, Lyon, France, November 2-4, 2000 Proceedings

Recent Advances in Visual Information Systems: 5th International Conference, VISUAL 2002 Hsin Chu, Taiwan, March 11-13, 2002. Proceedings

Web Information Systems - WISE 2006: 7th International Conference in Web Information Systems Engineering, Wuhan, China, October 23-26, 2006, Proceedings

Information Security and Cryptology - ICISC 2000: Third International Conference, Seoul, Korea, December 8-9, 2000, Proceedings

Information Hiding: Third International Workshop, IH'99, Dresden, Germany, September 29 - October 1, 1999 Proceedings

Visual Information and Information Systems: Third International Conference, VISUAL'99, Amsterdam, The Netherlands, June 2-4, 1999, Proceedings

Visual Information Systems

Coordination Languages and Models: Third International Conference, COORDINATION'99, Amsterdam, The Netherlands, April 26-28, 1999, Proceedings

Computational Science - ICCS 2002: International Conference, Amsterdam, The Netherlands, April 21-24, 2002. Proceedings

Computational Science - ICCS 2002: International Conference, Amsterdam, The Netherlands, April 21-24, 2002. Proceedings

Advanced Information Systems Engineering: 6th International Conference, CAiSE '94, Utrecht, The Netherlands, June 6 - 10, 1994. Proceedings

Interoperating Geographic Information Systems: Second International Conference, INTEROP'99, Zurich, Switzerland, March 10-12, 1999 Proceedings

Information and Communication Technologies in Tourism 2009: Proceedings of the International Conference in Amsterdam, The Netherlands, 2009

Service-Oriented Computing - ICSOC 2005: Third International Conference, Amsterdam, The Netherlands, December 12-15, 2005, Proceedings

Advanced Information Systems Engineering: 12th International Conference, CAiSE 2000, Stockholm, Sweden, June 5-9, 2000 Proceedings

Advanced Information Systems Engineering: 7th International Conference, CAiSE '95, Jyväskylä, Finland, June 12 - 16, 1995. Proceedings

Advanced Information Systems Engineering: 13th International Conference, CAiSE 2001, Interlaken, Switzerland, June 4-8, 2001. Proceedings

Advanced Information Systems Engineering: 17th International Conference, CAiSE 2005, Porto, Portugal, June 13-17, 2005, Proceedings

Advanced Information Systems Engineering: 10th International Conference, CAiSE'98, Pisa, Italy, June 8-12, 1998, Proceedings

Advanced Information Systems Engineering: 10th International Conference, CAiSE'98, Pisa, Italy, June 8-12, 1998, Proceedings

Advanced Information Systems Engineering: 19th International Conference, CAiSE 2007, Trondheim, Norway, June 11-15, 2007, Proceedings

Social Robotics: Third International Conference on Social Robotics, Icsr 2011, Amsterdam, the Netherlands, November 24-25, 2011. Proceedings

Advanced Information Systems Engineering: 18th International Conference, CAiSE 2006, Luxembourg, Luxembourg, June 5-9, 2006, Proceedings

Advanced Information Systems Engineering: 5th International Conference, CAiSE '93, Paris, France, June 8-11, 1993. Proceedings

Quantum Information V: Proceedings of the Fifth International conference

Intelligent Technologies for Interactive Entertainment: Third International Conference, INTETAIN 2009, Amsterdam, The Netherlands, June 22-24, 2009, Proceedings ... and Telecommunications Engineering)

Information and Communication Security: Second International Conference, ICICS'99 Sydney, Australia, November 9-11, 1999 Proceedings

Information Security and Cryptology - ICISC'99: Second International Conference Seoul, Korea, December 9-10, 1999 Proceedings

Coordination Models and Languages (LNCS 6116), 12th International Conference, COORDINATION 2010, Amsterdam, The Netherlands, June 7-9, 2010, Proceedings

Information and Communications Security: Third International Conference, ICICS 2001, Xian, China, November 13-16, 2001. Proceedings

Advances in Databases and Information Systems: Third East European Conference, ADBIS'99, Maribor, Slovenia, September 13-16, 1999, Proceedings

Advances in Visual Information Systems: 4th International Conference, VISUAL 2000, Lyon, France, November 2-4, 2000 Proceedings

Recent Advances in Visual Information Systems: 5th International Conference, VISUAL 2002 Hsin Chu, Taiwan, March 11-13, 2002. Proceedings

Web Information Systems - WISE 2006: 7th International Conference in Web Information Systems Engineering, Wuhan, China, October 23-26, 2006, Proceedings

Information Security and Cryptology - ICISC 2000: Third International Conference, Seoul, Korea, December 8-9, 2000, Proceedings

Information Hiding: Third International Workshop, IH'99, Dresden, Germany, September 29 - October 1, 1999 Proceedings

Recommend Documents