Recent Advances in Visual Information Systems: 5th International Conference, VISUAL 2002 Hsin Chu, Taiwan, March 11-13, 2002. Proceedings

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen 2314 3 Berlin Heidelberg New Y...

Author: Shi-Kuo Chang | Zen Chen | Suh-Yin Lee

24 downloads 831 Views 6MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

2314

3

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo

Shi-Kuo Chang Zen Chen Suh-Yin Lee (Eds.)

Recent Advances in Visual Information Systems 5th International Conference, VISUAL 2002 Hsin Chu, Taiwan, March 11-13, 2002 Proceedings

13

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Shi-Kuo Chang Knowledge Systems Institute 3420 Main Street, Skokie, IL 60076, USA E-mail: [email protected] Zen Chen Suh-Yin Lee National Chiao Tung University, Dept. of Comp. Science & Information Engineering 1001 Ta Hsueh Road, Hsin Chu, Taiwan E-mail: {zchen/sylee}@csie.nctu.edu.tw

Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Recent advances in visual information systems : 5th international conference, VISUAL 2002, Hsin Chu, Taiwan, March 11 - 13, 2002 ; proceedings / Shi-Kuo Chang ... (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer, 2002 (Lecture notes in computer science ; Vol. 2314) ISBN 3-540-43358-9

CR Subject Classiﬁcation (1998): H.3, H.5, H.2, I.4, I.5, I.3, I.7 ISSN 0302-9743 ISBN 3-540-43358-9 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by Olgun Computergraﬁk Printed on acid-free paper SPIN: 10846602 06/3142 543210

Preface

Visual information systems are information systems for visual computing. Visual computing is computing on visual objects. Some visual objects such as images are inherently visual in the sense that their primary representation is the visual representation. Some visual objects such as data structures are derivatively visual in the sense that their primary representation is not the visual representation, but can be transformed into a visual representation. Images and data structures are the two extremes. Other visual objects such as maps may fall somewhere in between the two. Visual computing often involves the transformation from one type of visual objects into another type of visual objects, or into the same type of visual objects, to accomplish certain objectives such as information reduction, object recognition, and so on. In visual information systems design it is also important to ask the following question: who performs the visual computing? The answer to this question determines the approach to visual computing. For instance it is possible that primarily the computer performs the visual computing and the human merely observes the results. It is also possible that primarily the human performs the visual computing and the computer plays a supporting role. Often the human and the computer are both involved as equal partners in visual computing and there are visual interactions. Formal or informal visual languages are usually needed to facilitate such visual interactions. In this conference various research issues in visual information systems design and visual computing are explored. The papers are collectively published in this volume. We would like to express our special thanks to the sponsorship of the National Science Council, ROC, the Lee and MTI Center of National Chiao Tung University, ROC, and Knowledge Systems Institute, USA.

January 2002

Shi-Kuo Chang, Zen Chen, and Suh-Yin Lee

VISUAL 2002 Conference Organization

General Chair American General Co-chair European General Co-chair Asian General Co-chair

Shi-Kuo Chang, USA Ramesh Jain, USA Arnold Smeulders, The Netherlands Horace Ip, ROC

Program Co-chairs

Zen Chen, ROC Suh-Yin Lee, ROC

Steering Committee

Shi-Kuo Chang, USA Horace Ip, Hong Kong Ramesh Jain, USA Tosiyasu Kunii, Japan Robert Laurini, France Clement Leung, Australia Arnold Smeulders, The Netherlands

Program Committee Jan Biemond, The Netherlands Josef Bigun, Switzerland Shih Fu Chang, USA David Forsyth, USA Theo Gevers, The Netherlands Luc van Gool, Belgium William Grosky, USA Glenn Healey, USA Nies Huijsmans, The Netherlands Yannis Ioanidis, Greece Erland Jungert, Sweden Rangachar Kasturi, USA Toshi Kato, Japan Martin Kersten, The Netherlands Inald Lagendijk, The Netherlands Robert Laurini, France Yi-Bin Lin, ROC

Carlo Meghini, Italy Erich Neuhold, Germany Eric Pauwels, Belgium Fernando Pereira, Portugal Dragutin Petkovic, USA Hanan Samet, USA Simone Santini, USA Stan Sclaroﬀ, USA Raimondo Schettini, Italy Stephen Smoliar, USA Aya Soﬀer, USA Michael Swain, USA Hemant Tagare, USA George Thoma, USA Marcel Worring, The Netherlands Jian Kang Wu, Singapore Wei-Pan Yang, ROC

Sponsors

National Science Council, ROC National Chiao Tung University, ROC Knowledge Systems Institute, USA

Table of Contents

I

Invited Talk

Multi-sensor Information Fusion by Query Reﬁnement . . . . . . . . . . . . . . . . . . Shi-Kuo Chang, Gennaro Costagliola, and Erland Jungert

II

1

Content-Based Indexing, Search and Retrieval

MiCRoM: A Metric Distance to Compare Segmented Images . . . . . . . . . . . . 12 Renato O. Stehling, Mario A. Nascimento, and Alexandre X. Falc˜ ao Image Retrieval by Regions: Coarse Segmentation and Fine Color Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Julien Fauqueur and Nozha Boujemaa Fast Approximate Nearest-Neighbor Queries in Metric Feature Spaces by Buoy Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Stephan Volmer A Binary Color Vision Framework for Content-Based Image Indexing . . . . . 50 Guoping Qiu and S. Sudirman Region-Based Image Retrieval Using Multiple-Features . . . . . . . . . . . . . . . . . 61 Veena Sridhar, Mario A. Nascimento, and Xiaobo Li A Bayesian Method for Content-Based Image Retrieval by Use of Relevance Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Ju-Lan Tao and Yi-Ping Hung Color Image Retrieval Based on Primitives of Color Moments . . . . . . . . . . . . 88 Jau-Ling Shih and Ling-Hwei Chen Invariant Feature Extraction and Object Shape Matching Using Gabor Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Shu-Kuo Sun, Zen Chen, and Tsorng-Lin Chia

III

Visual Information System Architectures

A Framework for Visual Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 105 Horst Eidenberger, Christian Breiteneder, and Martin Hitz Feature Extraction and a Database Strategy for Video Fingerprinting . . . . . 117 Job Oostveen, Ton Kalker, and Jaap Haitsma ImageGrouper: Search, Annotate and Organize Images by Groups . . . . . . . 129 Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang

X

Table of Contents

Toward a Personalized CBIR System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Chih-Yi Chiu, Hsin-Chih Lin, and Shi-Nine Yang

IV

Image/Video Databases

An Eﬃcient Storage Organization for Multimedia Databases . . . . . . . . . . . . 152 Philip K.C. Tse and Clement H.C. Leung Unsupervised Categorization for Image Database Overview . . . . . . . . . . . . . . 163 Bertrand Le Saux and Nozha Boujemaa A Data-Flow Approach to Visual Querying in Large Spatial Databases . . . 175 Andrew J. Morris, Alia I. Abdelmoty, Baher A. El-Geresy, and Christopher B. Jones MEDIMAGE – A Multimedia Database Management System for Alzheimer’s Disease Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Peter L. Stanchev and Farshad Fotouhi

V

Networked Video

Life after Video Coding Standards: Rate Shaping and Error Concealment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu A DCT-Domain Video Transcoder for Spatial Resolution Downconversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao A Receiver-Driven Channel Adjustment Scheme for Periodic Broadcast of Streaming Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Chin-Ying Kuo, Chen-Lung Chan, Vincent Hsu, and Jia-Shung Wang Video Object Hyper-Links for Streaming Applications . . . . . . . . . . . . . . . . . . 229 Daniel Gatica-Perez, Zhi Zhou, Ming-Ting Sun, and Vincent Hsu

VI

Application Areas of Visual Information Systems

Scalable Hierarchical Summarization of News Using Fidelity in MPEG-7 Description Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Jung-Rim Kim, Seong Soo Chun, Seok-jin Oh, and Sanghoon Sull MPEG-7 Descriptors in Content-Based Image Retrieval with PicSOM System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Markus Koskela, Jorma Laaksonen, and Erkki Oja Fast Text Caption Localization on Video Using Visual Rhythm . . . . . . . . . . 259 Seong Soo Chun, Hyeokman Kim, Jung-Rim Kim, Sangwook Oh, and Sanghoon Sull

Table of Contents

XI

A New Digital Watermarking Technique for Video . . . . . . . . . . . . . . . . . . . . . 269 Kuan-Ting Shen and Ling-Hwei Chen Automatic Closed Caption Detection and Font Size Diﬀerentiation in MPEG Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee Motion Activity Based Shot Identiﬁcation and Closed Caption Detection for Video Structuring . . . . . . . . . . . . . . . . . . . . 288 Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee Visualizing the Construction of Generic Bills of Material . . . . . . . . . . . . . . . . 302 Peter Y. Wu, Kai A. Olsen, and Per Saetre Data and Knowledge Visualization in Knowledge Discovery Process . . . . . . 311 TrongDung Nguyen, TuBao Ho, and DucDung Nguyen

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

Multi-sensor Information Fusion by Query Refinement Shi-Kuo Chang1, Gennaro Costagliola2, and Erland Jungert3 1

Department of Computer Science University of Pittsburgh GLERK$GWTMXXIHY 2 Dipartimento di Matematica ed Informatica Università di Salerno KIRGSW$YRMWEMX 3 Swedish Defense Research Agency (FOI) NYRKIVX$PMRJSMWI

Abstract. To support the retrieval and fusion of multimedia information from multiple real-time sources and databases, a novel approach for sensor-based query processing is described. The sensor dependency tree is used to facilitate query optimization. Through query refinement one or more sensor may provide feedback information to the other sensors. The approach is also applicable to evolutionary queries that change in time and/or space, depending upon the temporal/spatial coordinates of the query originator.

1

Sensor-Based Query Processing for Information Fusion

In recent years the fusion of multimedia information from multiple real-time sources and databases has become increasingly important because of its practical significance in many application areas such as telemedicine, community networks for crime prevention, health care, emergency management, e-learning, digital library, and field computing for scientific exploration. Information fusion is the integration of information from multiple sources and databases in multiple modalities and located in multiple spatial and temporal domains. Generally speaking, the objectives of information fusion are: a) to detect certain significant events [29, 30], and b) to verify the consistency of detected events [10, 20, 25]. As an example, Figure 1(a) is a laser radar image of a parking lot with a moving vehicle (encircled). The laser radar is manufactured by SAAB Dynamics in Sweden. It generates image elements from a laser beam that is split into short pulses by a rotating mirror. The laser pulses are transmitted to the ground in a scanning movement, and when reflected back to the platform on the helicopter a receiver collects the returning pulses that are stored and analyzed. The results are points with x, y, z coordinates and time t. The resolution is about 0.3 m. In Figure 1(a) the only moving vehicle is in the lower right part of the image with a north-south orientation, while all other vehicles have east-west orientation. Figure 1(b) are two video frames showing a moving white vehicle (encircled) while entering a parking lot in the middle of the upper left frame, and between some of the parked vehicles in the lower right frame. Moving objects can be detected from S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 1–11, 2002. © Springer-Verlag Berlin Heidelberg 2002

2

Shi-Kuo Chang, Gennaro Costagliola, and Erland Jungert

the video sequence [18]. On the other hand, the approximate 3D shape of an object or the terrain can be obtained from the laser radar image [12]. Therefore the combined analysis of laser radar image and video frame sequence provides better information to detect a certain type of object and/or to verify the consistency of the detected object from both sources. To accomplish the objectives of information fusion, novel sensor-based query processing techniques to retrieve and fuse information from multiple sources are needed. In sensor-based query processing, the queries are applied to both stored databases and real-time sources that include different type of sensors. Since most sensors can generate large quantities of spatial information within short periods of time, sensor-based query processing requires query optimization. We describe a novel approach for sensor-based query processing and query optimization using the sensor dependency tree. Another aspect to consider is that queries may involve data from more than one sensor. In our approach, one or more sensor may provide feedback information to the other sensors through query refinement. The status information such as position, time and certainty can be incorporated in multi-level views and formulated as constraints in the refined query. In order to accomplish sensor data independence, an ontological knowledge base is employed.

(a)

(b)

Fig. 1. (a) A laser radar image of a parking lot with a moving vehicle (encircled). (b) Two video frames showing a moving white vehicle (encircled) while entering a parking lot.

There is an important class of queries that require more sophisticated query refinement. We will call this class of queries evolutionary queries. An evolutionary query is a query that may change in time and/or space. For example when an emergency management worker moves around in a disaster area, a predefined query can be executed repeatedly to evaluate the surrounding area to find out objects of threat. Depending upon the position of the person or agent (the query originator) and the time of the day, the query can be quite different. Our approach is also applicable to evolutionary queries that may be modified, depending upon the temporal/spatial coordinates of the query originator.

Multi-sensor Information Fusion by Query Refinement

3

This paper is organized as follows. The background and related research are described in Section 2. The notion of sensor data dependence is discussed in Section 3, and the sensor data dependency tree is introduced in Section 4. Section 4 describes simple query processing, and Section 5 illustrated the query refinement approach. Section 6 discusses view management and the sensor data ontological knowledge base. An empirical study is described in Section 7.

2

Background and Related Research

In our previous research, a spatial/temporal query language called ΣQL was developed to support the retrieval and fusion of multimedia information from realtime sources and databases [5, 6, 9, 19]. ΣQL allows a user to specify powerful spatial/temporal queries for both multimedia data sources and multimedia databases, thus eliminating the need to write separate queries for each. ΣQL can be seen as a tool for handling spatial/temporal information for sensor-based information fusion, because most sensors generate spatial information in a temporal sequential manner [16]. A powerful visual user interface called the Sentient Map allows the user to formulate spatial/temporal σ-queries using gestures [7, 8]. For empirical study we collaborated with the Swedish Defense Research Agency who has collected information from different type of sensors, including laser radar, infrared video (similar to video but generated at 60 frames/sec), and CCD digital camera. In our preliminary analysis, when we applied ΣQL to the fusion of the above described sensor data, we discovered that in the fusion process data from a single sensor yields poor results in object recognition. For instance, the target object may be partially hidden by an occluding object such as a tree, rendering certain type of sensors ineffective. Object recognition can be significantly improved, if a refined query is generated to obtain information from another type of sensor, while allowing the target being partially hidden. In other words, one (or more) sensor may serve as a guide to the other sensors by providing status information such as position, time and accuracy, which can be incorporated in multiple views and formulated as constraints in the refined query. In the refined query, the source(s) can be changed, and additional constraints can be included in the where-clause of the σ-query. This approach provides better object recognition results because the refined query can improve the result from the various sensor data that will also lead to a better result in the fusion process. A refined query may also send a request for new data and thus lead to a feedback process. In early research on query modification, queries are modified to deal with integrity constraints [26]. In query augmentation, queries are augmented by adding constraints to speed up query processing [13]. In query refinement [28] multiple term queries are refined by dynamically combining pre-computed suggestions for single term queries. Recently query refinement technique was applied to content-based retrieval from multimedia databases [3]. In our approach, the refined queries are created to deal with the lack of information from a certain source or sources, and therefore not only the constraints can be changed, but also the source(s). This approach has not been

4

Shi-Kuo Chang, Gennaro Costagliola, and Erland Jungert

considered previously in database query optimization, because usually the sources are assumed to provide the complete information needed by the queries. In addition to the related approaches in query augmentation, there is also recent research work in agent-based techniques that are relevant to our approach. Many mobile agent systems have been developed [1, 2, 22], and recently mobile agent technology is beginning to be applied to information retrieval from multimedia databases [21]. It is conceivable that sensors can be handled by different agents that exchange information and cooperate with each other to achieve information fusion. However, mobile agents are highly domain-specific and depend on ad-hoc, ’hardwired’ programs to implement them. In contrast, our approach offers a theoretical framework for query optimization and is applicable to different type of sensors, thus achieving sensor data independence.

3

Sensor Data Independence

As mentioned in the previous sections, sensor data independence is an important new concept in sensor-based query processing. In database design, data independence was first introduced in order to allow modifications of the physical databases without affecting the application programs [27]. It was a very powerful innovation in information technology. The main purpose was to simplify the use of the databases from an end-user’s perspective while at the same time allow a more flexible administration of the databases themselves [11]. In sensor-based information systems [29], no similar concept has yet been suggested, due to the fact that this area is still less mature with respect to the design and development of information systems integrated with databases in which sensor data are stored. Another reason is that the users are supposed to be domain experts and consequently they have not yet requested sensor-based information systems with this property. In current sensor-based information systems, in order to formulate queries concerning various objects and their attributes registered by the sensors, detailed knowledge about the sensors is required. Therefore sensor selection is left to the users who supposedly are also experts on sensors. However in real life this is not always the case. A user cannot be an expert on all sensors and all sensor data types. Therefore systems with the ability to hide this kind of low-level information from the users need to be developed. User interfaces also need to be designed to allow the users to formulate queries with ease and to request information at a high-level of abstraction to accomplish sensor data independence. An approach to overcome these problems and to accomplish sensor data independence is described, through the use of the sensor dependency tree, the query refinement technique, the multi-level view databases, and above all an ontological knowledge base for the sensors and objects to be sensed.

4

The Sensor Dependency Tree

In database theory, query optimization is usually formulated with respect to a query execution plan where the nodes represent the various database operations to be

Multi-sensor Information Fusion by Query Refinement

5

performed [14]. The query execution plan can then be transformed in various ways to optimize query processing with respect to certain cost functions. In sensor-based query processing, a concept similar to the query execution plan is introduced. It is called the sensor dependency tree, which is a tree in which each node Pi has the following parameters: objecti is the object type to be recognized sourcei is the information source recogi is the object recognition algorithm to be applied sqoi is the spatial coordinates of the query originator tqoi is the temporal coordinates of the query originator aoii is the spatial area-of-interest for object recognition ioii is the temporal interval-of-interest for object recognition timei is the estimated computation time in some unit such as seconds rangei is the range of certainty in applying the recognition algorithm, represented by two numbers min, max from the closed interval [0,1] These parameters provide detailed information on a computation step to be carried out in sensor-based query processing. The query originator is the person/agent who issues a query. For evolutionary queries, the spatial/temporal coordinates of the query originator are required. For other type of queries, these parameters are optional. If the computation results of a node P1 are the required input to another node P2, there is a directed arc from P1 to P2. The directed arcs originate from the leave nodes and terminate at the root node. The leave nodes of the tree are the information sources such as laser radar, infrared camera, CCD camera and so on. They have parameters such as (none, LR, NONE, sqoi, tqoi, aoiall, ioiall, 0, (1,1)). Sometimes we represent such leave nodes by their symbolic names such as LR, IR, CCD, etc. The intermediate nodes of the tree are the objects to be recognized. For example, suppose the object type is ’truck’. An intermediate node may have parameters (truck, LR, recog315, sqoi, tqoi, aoiall, ioiall, 10, (0.3, 0.5)). The root node of the tree is the result of information fusion, for example, a node with parameters (truck, ALL, fusion7, sqoi, tqoi, aoiall, ioiall, 2000, (0,1)) where the parameter ALL indicates that information is drawn from all the sources. In what follows, the spatial/temporal coordinates sqoi and tqoi for the query originator, the allinclusive area-of-interest aoiall and the all-inclusive interval-of-interest ioiall will be omitted for the sake of brevity, so that the examples are easier to read. Query processing is accomplished by the repeated computation and updates of the sensor dependency tree. During each iteration, one or more nodes are selected for computation. The selected nodes must not be dependent on any other nodes. After the computation, one ore more nodes are removed from the sensor dependency tree. The process then iterates. As an example, by analyzing the initial query, the following sensor dependency tree is constructed: (none, LR, NONE, 0, (1,1)) → (truck, LR, recog315, 10, (0.3, 0.5)) → (none, IR, NONE, 0, (1,1)) → (truck, IR, recog144, 2000, (0.5, 0.7)) → (truck, ALL, fusion7, 2000, (0,1)) (none, CCD, NONE, 0, (1,1)) → (truck, CCD, recog11, 100, (0, 1)) →

This means the information is from the three sources - laser radar, infrared camera and CCD camera - and the information will be fused for recognizing the object type ’truck’. Next, we select some of the nodes to compute. For instance, all the three leaf nodes can be selected, meaning information will be gathered from all three sources.

6

Shi-Kuo Chang, Gennaro Costagliola, and Erland Jungert

After this computation, the processed nodes are dropped and the following updated sensor dependency tree is obtained: (truck, LR, recog315, 10, (0.3, 0.5)) → (truck, IR, recog144, 2000, (0.5, 0.7)) → (truck, ALL, fusion7, 2000, (0,1)) (truck, CCD, recog11, 100, (0, 1)) →

We can then select the next node(s) to compute. Since LR has the smallest estimated computation time, it is selected and recognition algorithm 315 is applied. The updated sensor dependency tree is: (truck, IR, recog144, 2000, (0.5, 0.7)) → (truck, CCD, recog11, 100, (0, 1))

→

(truck, ALL, fusion7, 2000, (0,1))

In the updated tree, the LR node has been removed. We can now select the CCD node and, after its removal, select the IR node. (truck, IR, recog144, 2000, (0.5, 0.7)) → (truck, ALL, fusion7, 2000, (0,1))

Finally, the fusion node is selected. (truck, ALL, fusion7, 2000, (0,1))

After the fusion operation, there are no unprocessed (i.e., unselected) nodes, and query processing terminates.

5

Query Refinement

In the previous section, a straightforward approach of sensor-based query processing is described. This straightforward approach misses the opportunity of utilizing incomplete and imprecise knowledge gained during query processing. Let us re-examine the above scenario. After LR is selected and recognition algorithm 315 applied, suppose the result of recognition is not very good, and only some partially occluded large objects are recognized. If we follow the original approach, the reduced sensor dependency tree becomes: (truck, IR, recog144, 2000, (0.5, 0.7)) → (truck, CCD, recog11, 100, (0, 1))

→

(truck, ALL, fusion7, 2000, (0,1))

But this misses the opportunity of utilizing the incomplete and imprecise knowledge gained by recognition algorithm 315. If the query is to find un-occluded objects and the sensor reports only an occluded object, then the query processor is unable to continue unless we modify the query to find occluded objects. Therefore a better approach is to refine the original query, so that the updated sensor dependency tree becomes: (truck, IR, recog144, aoi-23, 2000, (0.6, 0.8)) → (truck, ALL, fusion7, aoi-23, 2000, (0, 1))

This means recognition algorithm 315 is applied to detect objects in an area-ofinterest aoi-23. After this is done, the recognition algorithm 144 is applied to recognize objects of the type ’truck’ in this specific area-of-interest. Finally, the fusion algorithm fusion7 is applied. Given a user query in a high-level language, the natural language, a visual language or a form, the query refinement approach is outlined below, where italic

Multi-sensor Information Fusion by Query Refinement

7

words indicate operations for the second (and subsequent) iteration. Its flowchart is illustrated in Figure 2. Step 1. Analyze the user query to generate/update the sensor dependency tree based upon the ontological knowledge base and the multi-level view database that contains up-to-date contextual information in the object view, local view and global view, respectively. Step 2. If the sensor dependency tree is reduced to a single node, perform fusion operation (if multiple sensors have been used) and then terminate query processing. Otherwise build/refine the σ-query based upon the user query, the sensor dependency tree and the multi-level view database. Step 3. Execute the portion of the σ-query that is executable according to the sensor dependency tree. Step 4. Update the multi-level view database and go back to Step 1.

Fig. 2. Flowchart for the query refinement algorithm.

As mentioned above, there is another class of queries that require more sophisticated query refinement. An evolutionary query is a query that change in time and/or space. Depending upon the position of the query originator and the time of the day, the query can be different. In other words, queries and query processing are affected by the spatial/temporal relations among the query originator, the sensors and the sensed objects In query processing/refinement, the spatial/temporal relations must be taken into consideration in the construction/update of the sensor dependency tree. The temporal relations include "followed by", "preceded by", and so on. The spatial relations

8

Shi-Kuo Chang, Gennaro Costagliola, and Erland Jungert

include the usual spatial relations, and special ones such as "occluded by", and so on [24]. As mentioned above, if in the original query we are interested only in finding un-occluded objects, then the query processor must report failure when only an occluded object is found. If, however, the query is refined to "find both un-occluded and occluded objects", then the query processor can still continue.

6

Multi-level View Database and Ontological Knowledge Base

A multi-level view database (MLVD) is needed to support sensor-based query processing. The status information is obtained from the sensors, which includes object type, position, orientation, time, accuracy and so on. The positions of the query originator and the sensors may also change. This is processed and integrated into the multi-level view database. Whenever the query processor needs some information, it asks the view manager. The view manager also shields the rest of the system from the details of managing sensory data, thus achieving sensory data independence. The multiple views may include the following three views in a resolution pyramid structure: the global view, the local view and the object view. The global view describes where the target object is situated in relation to some other objects, e.g. a road from a map. This will enable the sensor analysis program to find the location of the target object with greater accuracy and thus make a better analysis. The local view provides the information such as the target object is partially hidden. The local view can be described, for example, in terms of Symbolic Projection [4], or other representations. Finally, there is also a need for a symbolic object description. The views may include information about the query originator and can be used later on in other important tasks such as in situation analysis. The multi-level views are managed by the view manager, which can be regarded as an agent, or as middleware, depending upon the system architecture. The global view is obtained primarily from the geographic information system (GIS). The local view and object view are more detailed descriptions of local areas and objects. The results of query processing, and the movements of the query originator, may both lead to the updating of all three views. For any single sensor the sensed data usually does not fully describe an object, otherwise there will be no need to utilize other sensors. In the general case the system should be able to detect that some sensors are not giving the complete view of the scene and automatically select those sensors that can help the most in providing more information to describe the whole scene. In order to do so the system should have a collection of facts and conditions, which constitute the working knowledge about the real world and the sensors. This knowledge is stored in the ontological knowledge base, whose content includes object knowledge structure, sensor and sensor data control knowledge. The ontological knowledge base consists of three parts: the sensor part describing the sensors, recognition algorithms and so on, the external conditions part providing a description of external conditions such as weather condition, light condition and so on, and the sensed objects part describing objects to be sensed. Given the external condition and the object to be sensed, we can determine what sensor(s) and recognition algorithm(s) may be applied. For example, IR and Laser can be used at night (time condition), while CCD cannot be used. IR probably can be used in foggy

Multi-sensor Information Fusion by Query Refinement

9

weather, but Laser and CCD cannot be used (weather condition). However, such determination is often uncertain. Therefore certainty factors should be associated with items in the ontological knowledge base to deal with the uncertainty.

7

An Empirical Study

For empirical study we collected over 700 G-bytes of data from three type of sensors, including laser-radar, infrared camera (similar to video but generated at 60 frames/sec), and a CCD digital camera. Figure 3 is an example of an infrared image, a laser radar image and a CCD image of the same area. This experimental data is provided by the Swedish Defense Research Agency for the evaluation of the sensor-based query processing approach. High resolution terrain elevation models for synthetic environments are produced using laser-radar data [17]. GIS data are also available, so that the multi-level view databases and the ontological knowledge base can be constructed. Researchers at the Swedish Defense Research Agency plan to collect a substantial number of test queries. Three different types of test queries will be of interest: a) queries for the recognition of objects from multiple sensors; b) spatial/temporal queries; and c) evolutionary queries.

Fig. 3. An infrared image(left), a laser radar image (middle) and a CCD image (right).

As mentioned above, certainty factors should be associated with the nodes in the sensor dependency tree, the items in the ontological knowledge base, as well as data acquired by the sensors due to technical imperfections in the sensors and other practical considerations. Certainty factors (or confidence values) are normalized as real numbers in the interval [0,1] and interpreted as the certainty (or the confidence) a user may have in a query result. Therefore in query processing, all the computation steps should take uncertainty management into consideration. Different approaches in uncertainty management, including Bayesian networks [15] and fuzzy logic [23], can be considered. Since the certainty factor of a node may change after a computation step, there may be multiple ways of deciding the precedence among the nodes, and the sensor dependency tree may be replaced by the sensor dependency graph. Query processing then does not proceed by simply eliminating nodes successively from the sensor dependency graph. We need to investigate the generalized solution, such as using relaxation algorithms, to attack the problem of query processing and optimization with uncertainty management.

10

Shi-Kuo Chang, Gennaro Costagliola, and Erland Jungert

References 1. J. Baumann et al., “Mole – Concepts of a Mobile Agent System”, World Wide Web, Vol. 1, No. 3, 1998, pp 123-137. 2. C. Baumer, “Grasshopper – A Universal Agent Platform based on MASIF and FIPA Standards”, First International Workshop on Mobile Agents for Telecommunication Applications (MATA’99), Ottawa, Canada, October 1999, World Scientific, pp 1-18. 3. K. Chakrabarti, K. Porkaew and S. Mehrotra, “Efficient Query Refinement in Multimedia Databases, 16th International Conference on Data Engineering, San Diego, California, February 28 – March 3, 2000. 4. S. K. Chang and E. Jungert, Symbolic Projection for Image Information Retrieval and Spatial Reasoning, Academic Press, London, 1996. 5. S. K. Chang and E. Jungert, “A Spatial/temporal query language for multiple data sources in a heterogeneous information system environment”, The International Journal of Cooperative Information Systems (IJCIS), vol. 7, Nos 2 & 3, 1998, pp 167-186. 6. S. K. Chang, G. Costagliola and E. Jungert, “Querying Multimedia Data Sources and rd Databases”, Proceedings of the 3 International Conference on Visual Information Systems (Visual’99), Amsterdam, The Netherlands, June 2-4, 1999. 7. S. K. Chang, “The Sentient Map”, Journal of Visual Languages and Computing, Vol. 11, No. 4, August 2000, pp 455-474. 8. S. K. Chang, T. H. Chen and C. S. Li, "Gesture-Enhanced Information Retrieval and Presentation in a Distributed Learning Environment", Proceedings of the International Conference on Multimedia (ICME'2000), New York, July 31 to August 2, 2000. 9. S. K. Chang, G. Costagliola and E. Jungert, “Spatial/Temporal Query Processing for th Information Fusion Applications”, Proceedings of the 4 International Conference on Visual Information Systems (Visual’2000), Lyon, France, November 2000, Lecture Notes in Computer Sciences 1929, Robert Laurini (Ed.), Springer, Berlin, pp 127-139. 10. C.-Y. Chong, S. Mori, K.-C Chang and W. H. Baker, “Architectures and Algorithms for Track Association and Fusion”, Proceedings of Fusion’99, Sunnyvale, CA, July 6-8, 1999, pp 239-246. 11. C. Date, An Introduction to Database Systems, Addison-Wesley, 1995. 12. M. Elmqvist, E. Jungert et al., “Terrain Modelling and Analysis using Laser Scanner Data”, Proceedings of Conference on Land Surface Mapping and Characterization using Laser Altmetry, Annapolis, MD, USA, October 22-24, 2001, pp 219-226, published by Dept. of Geography, University of Maryland, MD, 2001. 13. G. Grafe, “Query Evaluation Techniques for Large Databases”, ACM Computing Surveys, Vol. 25, No. 2, June 1993. 14. M. Jarke and J. Cohen, “Query Optimization in Database Systems”, ACM Computing Surveys, Vol. 16, No. 2, 1984. 15. F. V. Jensen, An Introduction to Bayesian Networks, Springer Verlag, New York, 1996. 16. E. Jungert, “An Information fusion System for Object Classification and Decision Support nd Using Multiple Heterogeneous Data Sources”, Proceedings of the 2 International Conference on Information Fusion (Fusion’99), Sunnyvale, California, USA, July 6-8, 1999. 17. E. Jungert, U. Söderman, S. Ahlberg, P. Hörling, F. Lantz, G. Neider, “Generation of high resolution terrain elevation models for synthetic environments using laser-radar data”, Proceedings of SPIE no 3694, Modeling, Simulation and Visualization for Real And Virtual Environments, Orlando , Florida, April 7-8, 1999, pp 12-20. 18. E. Jungert,”A Qualitative Approach to Reasoning about Objects in Motion Based on Symbolic Projection”, Proceedings of the Conference on Multimedia Databases and Image Communication (MDIC’99), Salerno, Italy, October 4-5, 1999.

Multi-sensor Information Fusion by Query Refinement

11

19. E. Jungert, “A Data Fusion Concept for a Query Language for Multiple Data Sources”, Proceedings of the 3rd International Conference on Information Fusion (FUSION 2000), Paris, France, July 10-13, 2000. 20. L. A. Klein, “A Boolean Algebra Approach to Multiple Sensor Voting Fusion”, IEEE Transactions on Aerospace and Electronic Systems, Vol. 29, No. 2, April 1993, pp 317327. 21. H. Kosch, M. Doller and L. Boszormenyi, “Content-based Indexing and Retrieval supported by Mobile Agent Technology”, Multimedia Databases and Image Communication, LNCS2184, (M. Tucci, ed.), Springer-Verlag, Berlin, 2001, pp 152-166. 22. D. B. Lange and M. Oshima, Programming and Deploying Java Mobile Agents with Aglets, Addison-Wesley, Reading, MA, USA, 1999. 23. Lawrence Livermore National Laboratory, “Multisensor data fusion system using fuzzy logic”, in the web site on sensor technology at http://www.llnl.gov/sensor_technology/ STR25.html, 2001. 24. S.Y. Lee and F. J. Hsu, “Spatial Reasoning and Similarity Retrieval of images using 2D Cstring knowledge Representation”, Pattern Recognition, vol. 25, no 3, 1992, pp 305-318. 25. J. R. Parker, “Multiple Sensors, Voting Methods and Target Value Analysis”, Proceedings of SPIE Conference on Signal Processing, Sensor Fusion and Target Recognition VI, SPIE vol. 3720, Orlando, Florida, April 1999, pp 330-335. 26. M. Stonebraker, “Implementation of Integrity Constraints and Views by Query Modification”, in SIGMOD, 1975. 27. J. D. Ullman, Database and Knowledge-base Systems, Vol. 1, Computer science Press, Rockville, Maryland, USA, 1988, pp 11-12. 28. Bienvenido Vélez, Ron Weiss, Mark A. Sheldon, and David K. Gifford, “Fast and th Effective Query Refinement”, Proceedings of the 20 ACM Conference on Research and Development in Information Retrieval (SIGIR97), Philadelphia, Pennsylvania, July 1997. 29. E. Waltz and J. Llinas, Multisensor data fusion, Artect House, Boston, 1990. 30. F. E. White, “Managing Data Fusion Systems in Joint and Coalition Warfare”, Proceedings of EuroFusion98 – International Conference on Data Fusion, October 1998, Great Malvern, United Kingdom, pp 49-52.

MiCRoM: A Metric Distance to Compare Segmented Images Renato O. Stehling1 , Mario A. Nascimento2 , and Alexandre X. Falc˜ao1 1

2

Institute of Computing, University of Campinas, Brazil {renato.stehling,afalcao}@ic.unicamp.br Department of Computer Science, University of Alberta, Canada [email protected]

Abstract. Recently, several content-based image retrieval (CBIR) systems that make use of segmented images have been proposed. In these systems, images are segmented and represented as a set of regions, and the distance between images is computed according to the visual features of their regions. A major problem of existing distance functions used to compare segmented images is that they are not metrics. Hence, it is not possible to exploit ﬁltering techniques and/or access methods to speedup query processing, as both techniques make extensive use of the triangular inequality property - one of the metric axioms. In this work, we propose microm (Minimum-Cost Region Matching), an effective metric distance which models the comparison of segmented images as a minimum-cost network ﬂow problem. To our knowledge, this is the ﬁrst time a true metric distance function is proposed to evaluate the distance between segmented images. Our experiments show that microm is at least as effective as existing non-metric distances. Moreover, we have been able to use the recently proposed Omni-sequential ﬁltering technique, and have achieved nearly 2/3 savings in retrieval/query processing time.

1

Introduction

Image databases are becoming more and more common in several distinct application domains, such as (multimedia) search engines, digital libraries, medical and geographic databases and criminal investigation. The evolution of techniques for acquisition, transmission and storage of images has also allowed the construction of very large image databases. All these factors have spurred great interest in content-based image retrieval (CBIR) techniques. Existing CBIR systems based on low-level features (such as color and texture) can be classiﬁed into three main categories: (1) global approaches (e.g. [1,2,3]), (2) partitionbased approaches (e.g. [4,5,6]) and (3) regional approaches (e.g. [7,8,9]). Each of these categories poses a distinct compromise among the complexity of visual features extraction algorithms, the complexity of the distance function used to compare images, the amount of space required to represent the visual features and the retrieval effectiveness.

Research partially supported by NSERC, Canada, and by CNPq/FINEP, Brazil, under the PRONEX SAI Project.

S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 12–23, 2002. c Springer-Verlag Berlin Heidelberg 2002

MiCRoM: A Metric Distance to Compare Segmented Images

13

Global approaches describe the visual content of an image as a whole, without spatial or topological information. Partition-based approaches introduce some spatial information about the visual content of the images decomposing them in spatial cells, according to a ﬁxed partition scheme, and describing the content of each cell individually. Regional approaches are a natural evolution of partition-based approaches in the sense that, instead of decomposing images in a ﬁxed way, these approaches exploit their visual content to achieve a more ﬂexible and robust segmentation. Unlike partition cells, segmented regions of two distinct images may have different size, position and shape. Moreover, the number of regions of two images may be different. Our focus in this paper is on the comparison of segmented images in the context of regional CBIR approaches. To the best of our knowledge, existing distance functions that compare segmented images are not metrics. More speciﬁcally, they do not satisfy the triangular inequality property. This property is essential to reduce the query processing time using ﬁltering techniques [10] and/or access methods [11,12]. Our contribution in this paper is an effective metric distance to compare segmented images, called microm (Minimum-Cost Region Matching). The main advantage of microm is the possibility, for the ﬁrst time (as far as we know), to compare segmented images using an effective, true-metric distance function. As a consequence, microm allows the use of ﬁltering techniques and/or access methods to reduce the query time. The remainder of this paper is organized as follows. Section 2 describes in details the problems related to the comparison and indexing of segmented images, identifying existing distances for this purpose. In Section 3, we propose microm, our new metric distance to compare segmented images. The effectiveness of microm is evaluated in Section 4. Experimental results related to the use of ﬁltering techniques based on the microm metric are presented in Section 5. Finally, Section 6 states our conclusions and directions for future works.

2

Comparison of Segmented Images

One important aspect of any CBIR system is the distance function used to compare the visual features extracted from the images. Such a distance affects directly the time spent processing a visual query and the quality of the retrieval (effectiveness). The better the distance simulates the human perception of similarity using the available visual features, the more effective is the CBIR system in retrieving relevant images to the user’s needs. The computational complexity of the distance function is another important issue when processing a visual query. Depending on the function complexity, the time to compute the distance between images might be superior to the time to access the disk pages where the visual features are stored. The distance function also restricts the universe of ﬁltering techniques and access methods which can be used to speedup the query processing. When the visual features of images are represented as points in a k-dimensional space (each vector element corresponds to a spatial coordinate), it is possible to exploit geometric distances like L1 (City-Block) and L2 (Euclidean) to compare images. Moreover, it is possible to use spatial access methods (SAMs) [12] to reduce the search space at query time.

14

Renato O. Stehling, Mario A. Nascimento, and Alexandre X. Falc˜ao

Unfortunately, regional CBIR approaches can not be adequately modeled in a vectorial space, because the number of regions of two images may be different and the obtained regions may also have different sizes. Regional CBIR approaches are better modeled in a metric space. A metric space is composed by a set of elements (in our case, these elements are the visual features) and a metric distance to compare these elements. In metric spaces, there are no restrictions about the representation of the visual features. In this case, what really matter are the metric properties of the distance used to compare the visual features. A distance d is considered a metric if, for any (images) X, Y and Z, the following properties hold: – – – –

Positiveness - d(X, Y ) ≥ 0 Symmetry - d(X, Y ) = d(Y, X) Reﬂexivity - d(X, X) = 0 Triangular inequality - d(X, Z) ≤ d(X, Y ) + d(Y, Z)

Metric spaces can be efﬁciently indexed using metric access methods (MAMs) [11]. These methods make extensive use of the triangular inequality property to reduce the search space and also the number of distance computations at query time. The main problem to model a regional CBIR approach in a metric space is related to the distance function used to compare segmented images. To the best of our knowledge, there are only a few works dedicated to this topic. In general, the most common approach is to perform comparisons based on individual regions, as in the Blobworld system [7]. In that system, although querying based on a limited number of regions is allowed, the query is performed by merging single-region query results. Even if it was possible to combine the results obtained with each individual region of an image, there is no guarantee that the full content of the images is compared. It is possible that most of the regions in an image match with the same region of the other. Moreover, if the comparison is performed in the opposite direction, it is possible to obtain a completely different distance. In order to reduce the inﬂuence of inaccurate segmentation, and to guarantee the comparison of the full content of the images, systems like SIMPLIcity [8] and CBC [9] compare images according to the properties of all segmented regions simultaneously, not only in a region-by-region basis. SIMPLIcity compare images according to the irm (Integrated Region Matching) distance. An equivalent distance function is used in CBC. The main difference is that the visual features used to compare individual regions in CBC and SIMPLIcity are not the same. 2.1

IRM Distance

The irm distance between two images X and Y is algorithmically described in Table 1. The main problem of the irm distance function is that it does not satisfy the triangular inequality property. This problem is related to the greedy approach of choosing ﬁrst the most similar regions to be matched. The greedy algorithm in this case does not guarantee that the obtained distance is the best (smallest) one. Figure 1 shows a counterexample where the results obtained with the irm greedy distance do not satisfy the triangular inequality property. In this example, images X, Y and Z are compared two-by-two, according to their regions. Each image has exactly two

MiCRoM: A Metric Distance to Compare Segmented Images

15

Table 1. irm distance irm(X, Y ) for each pair of regions Xi ∈ X and Yj ∈ Y Xi .status = Yj .status = 0 Compute dreg (Xi , Yj ) β=0 for each dreg (Xi , Yj ) in a non-decreasing order if Xi .status = Yj .status = 0 if Xi .size < Yj .size w = Xi .size Yj .size = Yj .size − Xi .size Xi .status = 1 else w = Yj .size Xi .size = Xi .size − Yj .size Yj .status = 1 if Xi .size = 0 then Xi .status = 1 β = β + w × dreg (Xi , Yj ) return β

regions of the same size (0.5). For illustrative purpose only, each region has its visual feature represented by a single numerical value. This number could be, for example, the average gray level of the region. The size and also the visual feature of the regions are normalized between 0 and 1. The distance between two regions (dreg ) is given by the module of the difference of their visual feature. The edges between images show the matched regions according to the irm distance. On the right of Figure 1, there is also the result of the comparisons, organized in a triangular shape. Y

Y a = 0.2

c = 1.0

e = 0.3

0.45

b = 0.6

d = 0.5

0.2

(0.35)

Z

X f = 0.8

X

Z 0.15

Fig. 1. The comparison of images X, Y and Z using the irm distance does not satisfy the triangular inequality property

Thus, the triangular comparison of the images give us the inequality 0.45 ≥ 0.2 + 0.15, which contradicts the triangular inequality property. The problem in this example is in the distance between images X and Y . The greedy approach adopted in irm results in a non-optimal distance when X and Y are compared, because there is another match which reduces the distance between them.

16

Renato O. Stehling, Mario A. Nascimento, and Alexandre X. Falc˜ao

The optimal comparison which minimizes the distance between images X and Y is shown in dotted lines, and gives the result optimal(X, Y ) = 0.5 × |0.2 − 0.5| + 0.5 × |0.6 − 1.0| = 0.35. The result of this optimal comparison is shown between brackets in the triangular representation of the distances among the three images. If the optimal distance is used, we have 0.35 ≤ 0.2 + 0.15, which satisﬁes the triangular inequality property.

3 The microm Metric Distance In this section, we propose microm (Minimum-Cost Region Matching), a new truemetric distance function to compare the visual content of segmented images. As it will be shown in Section 4, microm is at least as effective as irm, the distance function used in SIMPLIcity and CBC systems, and has the advantage that it can be adequately indexed using existing MAMs [11] such as the M-tree [13]. It is also possible to use a combination of ﬁltering techniques and SAMs to speedup the query processing, as it will be discussed in Section 5. The main idea of microm consists of modeling the comparison of segmented images as a minimum-cost network ﬂow problem [14]. More speciﬁcally, the comparison of images is modeled as a transportation problem. The transportation problem is an optimization problem which can be informally expressed as follows. Assume that we have a number of consumers with certain demand for a product. This product is made by a number of producers with certain production capacities. The system is balanced in the sense that the total demand equals the total production capacity. The production should be transported from the producers to the consumers, such that every consumer gets exactly as much product as it needs. The transportation costs from all producers to all consumers are known in advance. The transportation problem is to ﬁnd the optimal (cheapest) way to bring the products from the producers to the consumers. Next, a formal deﬁnition for the transportation problem is given. A network is a directed graph G = (V, E) composed by a set V of n nodes and a set E of m arcs. Each node represents either a producer or a consumer. Assuming that there are p producers and c consumers, we have: n = p + c. Each node has an associated number pd which represents its production (positive values) or its demand (negative values) depending p on whether c the node is a producer or a consumer. The system is balanced, so i=1 pdi + j=1 pdj = 0. There is a directed arc (i, j) for every pair of producer i and consumer j. Thus, m = p × c. Each arc (i, j) has two associated values: its transportation capacity capij , and its transportation cost costij . The arc capacity is given by capij = min(|pdi |, |pdj |). The decision variable in the transportation problem is the ﬂow f lowij in each arc (i, j). pThese cﬂows should satisfy 0 ≤ f lowij ≤ capij , and should minimize the function i=1 j=1 (costij × f lowij ). The minimum value of the function above corresponds to the microm distance p c between the two images, that is, µ = min( i=1 j=1 (costij × f lowij )). Despite the differences in the modeling of the problem, microm gives the optimal solution for the comparison of segmented images that the greedy approach adopted in irm sometimes fails to obtain. In fact, the irm distance can be thought as a greedy function to solve the

MiCRoM: A Metric Distance to Compare Segmented Images

17

transportation problem (as deﬁned above) which gives as much ﬂow as possible to the arcs with the smallest cost. The minimum-cost network ﬂow problem is a linear program with a very special structure [14]. As such, specialized algorithms can ﬁnd solutions much faster than plain linear programming algorithms. A large number of efﬁcient algorithms for this specialized instance of the problem are available. In our case, we used the CS2 code developed by Cherkassky and Goldberg1 . CS2 is a an efﬁcient implementation of a scaling pushrelabel algorithm for the minimum-cost ﬂow/transportation problem [15]. An example of two images and the modeling of their comparison as a transportation problem can be viewed in Figure 2. Image X is composed by three regions a, b and c, and image Y is composed by regions d and e. The visual feature of each region is represented by a number. This number and also the size of the regions are normalized between [0,1]. For example, size(a) = 0.5 and size(b) = 0.25. The comparison of images X and Y is modeled as a transportation problem in the following way. Cost

Production Y

X

0.5

a

0.7

a = 1.0 d = 0.8 b = 0.0

e = 0.3

0.25

c

X

d

−0.5

e

−0.5

0.8 0.3 0.0

Cost x Flow

0.5

Y

0.7 x 0.25

0.2 x 0.25 0.7 x 0.25

b

b

c = 0.8 0.25

a

Demand

0.2

d

a" = 1.0

a’ = 1.0

e

b = 0.0

c = 0.8

0.2 x 0.25

d’ = 0.8

e’ = 0.3

d" = 0.8

e" = 0.3

0.3 x 0.25 0.0 x 0.25

0.0 x 0.25 c

0.3 x 0.25

µ(X,Y) = 0.175 + 0.05 + 0.0 + 0.075 = 0.3

Fig. 2. Modeling the comparison of segmented images as a transportation problem

Each region of image X is modeled as a producer node, where the production is given by the normalized size of the region. Similarly, each region of image Y is modeled as a consumer node, with a demand given by its size (remember that a demand is represented by a negative value). Each arc between pairs of producer/consumer nodes has a cost given by the distance (dreg ) between the corresponding regions. In this example, this distance is given by the absolute difference of the numerical properties of the regions. 1

http://www.intertrust.com/star/goldberg/soft.html

18

Renato O. Stehling, Mario A. Nascimento, and Alexandre X. Falc˜ao

A solution for the transportation problem modeled on top of Figure 2 can be viewed on the bottom part of the same ﬁgure. As can be seen, half of node a’s production (0.25) was transported to node d with cost 0.2. The other half (0.25) was transported to node e with cost 0.7. All production of node b (0.25) was transported to node e with cost 0.3, ﬁlling the demand of that node. Finally, the total production of node c (0.25) was transported to node d with cost 0. The minimum transportation cost in this network is thus (0.25 × 0.2) + (0.25 × 0.7) + (0.25 × 0.3) + (0.25 × 0.0) = 0.3. The bottom-right part of Figure 2 shows how the solution of the transportation problem maps back on the compared images. In this particular example, the irm distance is exactly the same as microm, i.e., µ(X, Y ) = irm(X, Y ). However, as it was shown in the previous section, this is not always the case. 3.1 microm Metric Properties The microm distance decomposes the “real” regions of the images in “virtual” subregions to compute the minimum distance between them. The regions obtained after the virtual decomposition have very interesting properties: – The number of regions of the compared images becomes the same. – The obtained regions are the ones which minimize the distance between the two images, according to the model adopted (transportation problem). – There is a one-to-one match between regions of the two images. – Matched regions have the same size. The above properties ensure that the distance between images is optimal and that the full content of the images is compared. These properties are also useful to show that the microm distance is a metric. By construction, it is clear that the microm distance satisﬁes the axioms of positiveness, symmetry and reﬂexivity. Next, it will be shown that this distance also satisﬁes the triangular inequality property. The demonstration assumes that the distance dreg (used to compare individual regions of images) is a metric. Consider the triangular comparison of three images X, Y and Z, at the level of virtual regions. Assume that a virtual region Xi of image X matches with a virtual region Yj of image Y . Similarly, assume that the virtual region Yj matches with a virtual Zk of image Z, and the virtual region Zk matches with a virtual region Xl of image X, closing a triangular match for a particular virtual region. In this scenario, there are two possible relations between the virtual regions Xi and Xl of image X: either Xi = Xl or Xi = Xl . We call the ﬁrst case a cyclic match, because the virtual region which started the triangular match is the same that ends the process. The second case is called an acyclic match, as the regions which started and ends the triangular match are different. Initially, let us suppose that the application of the microm distance to compare images X, Y and Z, results only in cyclic matches (Xl = Xi ) at the level of virtual regions. As we are assuming the cyclic property only when images X, Z are compared (closing the triangular comparison of the images), this speciﬁc microm distance (with the additional restriction of cyclic matches) is represented as µcyclic (X, Z). We know that for cyclic matches, dreg (Xi , Zk ) ≤ dreg (Xi , Yj ) + dreg (Yj , Zk ) for any regions Xi , Yj and Zk , since dreg is a metric. We also know that the microm

MiCRoM: A Metric Distance to Compare Segmented Images

19

distance is only a linear combination of dreg distances. As the linear combination of metric distances is also a metric, we have that, for the case of cyclic matches of virtual regions, µcyclic (X, Z) ≤ µ(X, Y ) + µ(Y, Z). The assumption of cyclic matches at the level of virtual regions does not guarantee that the obtained distance is optimal, because this is not a restriction of our model. However, as the microm distance is optimal, we have that µ(X, Z) ≤ µcyclic (X, Z) ≤ µ(X, Y ) + µ(Y, Z), i.e., independently of the use of acyclic matches of virtual regions, the optimality of the microm distance always guarantee that the triangular inequality property holds.

4

Effectiveness Evaluation

This section presents our experimental results related to the effectiveness of the microm metric distance. We have compared microm with the irm distance, under the same segmentation scheme. In order to have a reference, we have also included the results obtained when images are represented by its global color histogram (GCH) with the L1 vectorial distance. It was adopted histograms with 64 uniformly quantized colors. The experiments used a collection of about 20,000 heterogeneous images (Corel GALLERY Magic 65,000 - Stock Photo Library 2), composed by 200 distinct image domains, each one with 100 JPEG images. The microm and irm distances were used to compare regions obtained with the CBC(3, 0.1) conﬁguration of the CBC clustering algorithm [9]. This conﬁguration offers an intermediate compromise between the number of obtained regions (which affects the space overhead and the query processing time) and the retrieval effectiveness. With this conﬁguration, each image within our reference collection was segmented (in average) in 40 connected regions. Each region of an image is represented by its average color in the Lab color-space (3 values), its size, and the spatial coordinates of its geometric center (2 values). Thus, each region of an image is represented by 6 ﬂoat-point numbers (fpns) and images is represented, in average, by 6 × 40 = 240 fpns. The distance between regions (dreg ) adopted is a weighted composition of the distances between the average color in the CIE Lab color-space and between the spatial position of the compared regions. Since it is generally difﬁcult to express low-level features of images, it was adopted the Query-By-Example (QBE) paradigm, where an image is given as example and the system retrieves the most similar matches for this image. The effectiveness of the approaches was evaluated using a set of 18 query images, selected from our reference collection of images. The set of images accepted as relevant for each query image (RRSet) was determined a priori, using a technique similar to the pooling method adopted in TREC conferences [16,17]. We extracted the set of relevant images (for a given query) from a pool of possible relevant images. This pool is created by taking the top 30 images retrieved by each compared approach. The pool of candidate images was then visually analyzed to ultimately decide on the relevance of each image. The subset of relevant images in the pool is the RRSet of the query image. We evaluated the effectiveness of the approaches using Precision vs. Recall (P×R) curves [16]. Precision is a measure which evaluates the accuracy of the search (how many

20

Renato O. Stehling, Mario A. Nascimento, and Alexandre X. Falc˜ao

of the retrieved images are relevant). Recall measures the extent to which the retrieval is exhaustive (how many of the relevant images were retrieved). The results of the effectiveness comparison can be viewed in Figure 3. The best overall results were obtained with the microm metric distance, followed by the irm distance. In both cases, the comparison was based on the regions obtained with the CBC clustering algorithm. As can be seen, both results are better than the use of a GCH to represent images plus a geometric distance (L1 ) to compare these histograms. The advantage of microm over irm is evident, but not very large. This means that the irm distance, although not a metric, is a good approximation for the microm metric distance in terms of effectiveness. It is also efﬁcient, since it is less expensive to compute. However, the microm metric distance, besides being slightly better in terms of effectiveness, has the advantage that its metric properties can be used to speedup query processing using ﬁltering techniques and/or access methods effectively.

CBC + MiCRoM

1.0

CBC + irm GCH + L1

Precision

0.8 0.6 0.4 0.2 0.0 0.0

0.2

0.4

0.6

0.8

1.0

Recall

Fig. 3. Effectiveness results

For small collections, the combination of an efﬁcient distance like irm, and a linear scan of the image database, is an interesting approach. However, for large databases, independently of its computational complexity, the use of a metric distance like microm becomes more attractive as it is possible to reduce the query time making extensive use of the triangular inequality property. In the next section, we will investigate a ﬁltering technique that reduces the CPU time to process a visual query when complex distances like microm are used to compare images.

5

Filtering Based on Metric Distances

Since there are efﬁcient techniques to cope with vector spaces, application designers try to give their problems a vector space structure. A common reduction consists of mapping a general metric space into a projected vector space. A query processed in the vectorial space generates a candidate list of images that should be analyzed in the original metric space in order to eliminate false-positives. The space reduction as discussed above is obtained by deﬁning k images of the database as reference, computing and storing the microm distance between the database

MiCRoM: A Metric Distance to Compare Segmented Images

21

images and the reference images as k-dimensional vectors and then, using a simple and efﬁcient geometric distance to ﬁlter out non-relevant images in the vectorial space (at query time). Santos et al [10] called this space reduction Omni-concept. They proposed the HF-algorithm to deﬁne the k reference images (foci) used to generate the k-dimensional vectorial space (omni-space). The sequential scan of the omni-space was called Omni-sequential. The omni-sequential algorithm makes extensive use of the triangular inequality property to eliminate non-relevant images at query time. In order to illustrate this process, let us consider Q a query image, D a database image, Fi the ith focus used to generate the k-dimensional omni-space (1 ≤ i ≤ k), and a query radius r. The database image D is a candidate image only if the following inequality holds: max1≤i≤k |µ(Q, Fi ) − µ(Fi , D)| ≤ r

(1)

Notice that the distances µ(Q, Fi ) and µ(Fi , D) are known at query time, as they correspond to the ith omni-coordinate (in the omni-space) of images Q and D, respectively. In our ﬁltering experiments, we adopted the omni-sequential algorithm. As discussed in previous section, our reference collection of images has 20,000 images. The results presented are relative to the 18 query images used in the effectiveness evaluation discussed in previous section. The proportion of the database ﬁltered out using the omni-sequential algorithm was evaluated by varying the number of foci between 1 and 10. The foci images were selected according to the HF-algorithm. We used query radius varying between 0.005 and 0.1 (as the distances are normalized, the maximum distance between two images is 1.0). On the left of Figure 4, it is shown the relation between the query radius and the average number of images retrieved, i.e., the number of images with a microm distance to the query images smaller than the query radius. 100

% of images filtered out

Number of images

90

120 110 100 90 80 70 60 50 40 30 20 10 0

80

0.005

70

0.01 0.015

60

0.02 50

0.025

40

0.030

30

0.035 0.04

20

0.045 10 0

0.005

0.010 0.015

0.020

0.025 0.030

0.035 0.040

0.045

1

2

Query Radius

3

4

5

6

7

8

9

10

Number of foci

Fig. 4. Filtering results

As can be seen, in order to retrieve the top 100 most similar images to a query image, in average, a query radius of 0.045 is enough. A query radius of 0.1 (not shown in the Figure) is sufﬁcient to retrieve, in average, the top 9039 most similar images to the query image. This is approximately half of the database size.

22

Renato O. Stehling, Mario A. Nascimento, and Alexandre X. Falc˜ao

On the right of Figure 4, it is shown the degree of ﬁltering using query-radius between 0.05 and 0.045, according to the number of foci used. As can be seen, independently of the query radius used, the ideal number of foci seems to be 4. After this point, the proportion of the database ﬁltered out does not increase substantially. For example, for a query radius of 0.045, 63.45% of the image database was ﬁltered out using only 4 foci. This means that 2/3 of the database was pruned without computing the microm distance, but only using the L1 distance in the 4-dimensional omni-space. This proportion grows to only 67.34% when 10 foci are used. This behavior is the same for all query radius tested. As the time to compare two 4-dimensional vectors using the L1 distance is much smaller than the comparison of the regions of two images using the microm distance, we can say that the gain in CPU time using omni-sequential (for a query radius of 0.045) is almost of 2/3 when compared to a linear scan of the image database. In order to reduce the I/O time to process a visual query, it is possible to index the generated 4-dimensional vectorial space using a spatial access method (SAM) such as the R∗ -tree [18]. SAMs reduce the comparison of images only to those near the query image. In this way, only a portion of the omni-space need to be read from the disk, further reducing the number of I/O operations to process a visual query.

6

Conclusions and Future Work

This paper presented microm (Minimum-Cost Region Matching), an effective metric distance to compare the visual content of segmented images. microm models the comparison of the regions of two images as a minimum-cost network ﬂow problem [14]. Our experimental results show that the microm metric is at least as effective as the irm distance [8,9]. This suggests that the greedy approach adopted in irm, although not optimal, gives results very close to the results obtained with microm metric, with the advantage of being less complex. However, the main disadvantage of irm is that it is not a metric distance and so, it is useful only when the image database is relatively small. The microm metric, although computationally more complex than irm, is not only slightly more effective, but, more importantly, it has the great advantage that it allows the use of the triangular inequality property in ﬁltering techniques [10] and/or access methods [11,12]. This yields substantially reductions in query processing time and a much broader context of application than irm. In the near future, we plan to investigate in more details the indexing based on the microm metric distance in order to deﬁne the best ﬁltering technique/acess method to speedup the query processing. Another possibility is to investigate alternative segmentation techniques which result in regions that could maximize the beneﬁts of comparing them using the microm metric.

Acknowledgments Renato O. Stehling realized this work while visiting at the University of Alberta and was supported by a Graduate Scholarship from FAPESP, Brazil.

MiCRoM: A Metric Distance to Compare Segmented Images

23

References 1. Androutsos, D., Plataniotis, K.N., Venetsanopoulos, A.N.: Vector angular distance measure for indexing and retrieval of color. In: Proc. of SPIE – Storage and Retrieval for Image and Video Databases VII. Volume 3656. (1999) 604–613 2. Sethi, I.K., Coman, I., Day, B., et al.: Color-wise: A system for image similarity retrieval using color. In: Proc. of SPIE – Storage and Retrieval for Image and Video Databases IV. Volume 3312. (1998) 140–149 3. Zhang, Y.J., Liu, Z.W., He, Y.: Comparison and improvement of color-based image retrieval tchniques. In: Proc. of SPIE – Storage and Retrieval for Image and Video Databases VI. Volume 3312. (1998) 371–382 4. Sciascio, E.D., Mingolla, G., Mongiello, M.: Content-based image retrieval over the web using query by sketch and relevance feedback. In: Proc. of the VISUAL’99 Intl. Conf. (1999) 123–130 5. Sebe, N., Lew, M.S., Huijsmans, D.P.: Multi-scale sub-image search. In: Proc. of ACM Multimedia’99 Intl. Conf. (1999) 79–82 6. Stehling, R.O., Nascimento, M.A., Falc˜ao, A.X.: On ’shapes’ of colors for content-based image retrieval. In: Proc. of the ACM MIR’00 Intl. Workshop. (2000) 171–174 7. Carson, C., Thomas, M., Belongie, S., et al.: Blobworld: A system for region-based image indexing and retrieval. In: Proc. of the VISUAL’99 Intl. Conf. (1999) 509–516 8. Li, J., Wang, J.Z., Wiederhold, G.: IRM: Integrated region matching for image retrieval. In: Proc.of ACM Multimedia’00 Intl. Conf. (2000) 147–156 9. Stehling, R.O., Nascimento, M.A., Falc˜ao, A.X.: An adaptive and efﬁcient clustering-based approach for content based retrieval in image databases. In: Proc. of IDEAS’01 Intl. Symposium. (2001) 356–365 10. Santos, R.F., Traina, A., Traina, C., Faloutsos, C.: Similarity search without tears: The omnifamily of all-purpose access methods. In: Proc. of ICDE’01. (2001) 623–630 11. Chavez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.L.: Searching in metric spaces. ACM Computing Surveys (2001) To appear. 12. Gaede, V., Guenther, O.: Multidimensional access methods. ACM Comp. Surveys 30 (1998) 123–169 13. Ciaccia, P., Partella, M., Zezula, P.: M-tree: An efﬁcient access method for similarity search in metric spaces. In: Proc. of the VLDB’97 Intl. Conf. (1997) 426–435 14. Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows: Theory, Algorithms, and Applications. Prentice Hall (1993) 15. Goldberg, A.V.: An efﬁcient implementation of a scaling minimum-cost ﬂow algorithm. Journal of Algorithms 22 (1997) 01–29 16. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley (1999) 17. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann (1999) 18. Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R∗ -tree: An efﬁcient and robust access method for points and rectangles. In: Proc.of ACM SIGMOD Intl. Conference. (1990) 322–331

Image Retrieval by Regions: Coarse Segmentation and Fine Color Description Julien Fauqueur and Nozha Boujemaa INRIA, Imedia Research Group, BP 105, F-78153 Le Chesnay, France {Julien.Fauqueur,Nozha.Boujemaa}@inria.fr http://www-rocq.inria.fr/imedia/

Abstract. In Content-Based Image Retrieval systems, region-based queries allow more precise search than global ones. The user can retrieve similar regions of interest regardless their background in images. The deﬁnition of regions in thousands of generic images is a diﬃcult key point, since it should not need user interaction for each image, and nevertheless be as close as possible to regions of interest (to the user). In this paper we ﬁrst propose a new technique of unsupervised coarse detection of regions which improves their visual speciﬁcity. The Competitive Agglomeration (CA) classiﬁcation algorithm, which has the advantage to automatically determine the optimal number of classes, is used. The second key point is the region description which must be ﬁner for regions than for images. We present a novel region descriptor of ﬁne color variability: the Adaptive Distribution of Color Shades. It is based on color shades adaptively determined for each region at a high resolution: 5 million of potential diﬀerent colors represented against few hundreds of predeﬁned colors in existing descriptors. Successful results of segmentation and region queries are presented on a database of 2500 generic images involving landscapes, people, objects, architecture, ﬂora. . . .

1

Introduction

The primary functionality of a Content-Based Image Retrieval system is the global query-by-example approach, in which visual features are extracted from the entire image. But in many cases the user’s goal is to retrieve similar regions rather than similar images as a whole. In a generic image database the search for similar regions using global features over images can be highly biased by the surrounding regions and background. Region based query systems allow to select a region in an image and retrieve images containing a similar region. The two major points to consider are the deﬁnition of regions and their description. A manual extraction of regions was proposed in [1] but is unviable for huge databases. Automatic region detection can be performed on-line using features back projection (see [2] and [3]), but they are inaccurate and time consuming at S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 24–35, 2002. c Springer-Verlag Berlin Heidelberg 2002

Image Retrieval by Regions

25

query phase. Oﬀ-line methods include systematic image subdivision into squares (see [4]) and image segmentation. This latter method was proposed in a couple of systems such as Blobworld [5] and Netra [6]. In Blobworld [5], segmentation is performed by classiﬁcation with EM-algorithm which requires to have a preﬁned number of classes. A contour-based segmentation proposed in [7] and integrated in a CBIR system ([6] and [8]) provides an accurate segmentation but with very homogeneous regions. We can also cite the work of Wang [9] in SIMPLICIty, which performs a color segmentation of images to describe an image as a set of regions, but single region queries can’t be performed. Existing region color discriptors are based on histograms determined on a predeﬁned subsampling of a color space: uniform subsampling of Hsv into 166 bins in VisualSeek [2], uniform subsampling of Lab into 218 bins in Blobworld [10] or a 256 color codebook predetermined for a given database in Netra [8]. Our approach diﬀers from the above by our conception of regions and the techniques for extracting and describing them. We think regions should integrate more intrinsic variability to provide a better characterization and their color description should not depend on a predeﬁned color set. The key idea of coarse region detection and ﬁne description: the relatively high visual variability inside regions is accurately described by the ﬁne resolution of color shades, such that regions are really speciﬁc against eachother in the database. The Competitive Agglomeration classiﬁcation algorithm used for both segmentation and indexing will be detailed in the ﬁrst section. In section 3, the coarse image segmentation for automatic region detection will be presented. Region indexing and matching are explained in section 4. Then tests and results will be presented and discussed in section 6. Finally conclusions are drawn.

2

CA Clustering Algorithm

Competitive Agglomeration classiﬁcation, originally presented in [11], has the major advantage to determine the optimal number of clusters. In [12] an application of this algorithm is proposed for image segmentation. Using notations from [11] and [12], we call {xj , ∀j ∈ {1, ..., N }} the set of N data we want to clusterize and C the number of clusters. {βi , ∀i ∈ {1, ..., C}} denotes the set of prototypes to be determined. The distance between data xj and prototype βi is d(xj , βi ). Then CA-Classiﬁcation is performed by minimizing the following quantity J:  2 C N C N  J = J1 + αJ2 , where J1 = u2ij d2 (xj , βi ) and J2 = − uij  (1) i=1 j=1

i=1

j=1

where uij represents the membership degree of feature point xj to prototype βi . Minimizing J1 separately is equivalent to perform an FCM classiﬁcation [13] which determines C optimal prototypes and the fuzzy partition U given xj and C using distance d. And J2 is a complexity reduction term which garantees the cluster validity (see [12]). Therefore J is written as a combination of two opposite

26

Julien Fauqueur and Nozha Boujemaa

eﬀect terms (J1 and J2 ). So minimizing J with an over-speciﬁed number of initial clusters simultaneously performs the data clustering and optimizes the number of clusters. α is the competition weight which should allow a balance between terms J1 and J2 in (1). J is minimized recursively and at iteration k, weight α is written as: α(k) = η0 exp

−k τ

C N

u2ij d2 (xj , βi )

2 C N u ij i=1 j=1

i=1

j=1

(2)

As iterations go, α decreases so emphasis is ﬁrst given to agglomeration process, then to classiﬁcation optimization. α is fully determined by η0 and τ . During the algorithm spurious clusters are discarded. The convergence is decided when prototypes are stable. The classiﬁcation granularity is controlled by factor α through its magnitude η0 and its decline strength with τ . The higher η0 and τ , the higher α, so the more classes are merged. So for given classiﬁcation granularity, CA determines the optimal number of classes. CA will be used at three steps in our work with diﬀerent levels of granularity and diﬀerent input data: ﬁrst to perform image quantization, then to segment roughly the image by computing LDQC prototypes and then to ﬁnely describe regions with color shades.

3

Coarse Region Detection

Extracted regions should encompass a certain visual diversity to be visually characteristic, using a coarse segmentation. We want to stay beyond a too ﬁne level of spatial and feature details. This choice is also motivated by the drawbacks of an oversegmentation which provides small and homogeneous regions: – a small region is rarely visually salient in a scene – a statistics-based description computed on a small region can’t be accurate – if all regions are homogeneous, it’s harder to diﬀerentiate them from one another – too many regions grow needlessly the database size We’ll deﬁne a region of interest as an area of connected pixels perceptually salient, i.e. covering a minimum surface in the image and presenting a certain visual “homogeneous diversity”. To group pixels to form such regions, we want to perform a CA-classiﬁcation of local color distributions of the image. This feature naturally integrates the diversity of colors in pixels neighbourhood. The choice of the color set to compute local color distributions is crucial: it must be compact to gain speed in classiﬁcation and be representative of a small pixel neighbourhood. If all original colors (an image can contain thousands of diﬀerent colors) are kept, the classiﬁcation will become computationally too expensive. Classic color histogram, computed on a uniform subsampling of a color space are too long (they contain useless empty bins). So we deﬁne the

Image Retrieval by Regions

27

color set as the adaptive set representing the quantized colors of a given image obtained by color classiﬁcation. All neighbourhoods in the image give a set of Local Distributions of Quantized Colors (referred as LDQC’s) which are classiﬁed. LDQC prototypes are back projected onto image, then small regions are either merged or discarded. 3.1

Image Color Quantization

Image colors are CA-classiﬁed as (L,u,v) triples using the Euclidean distance. The classiﬁcation granularity was chosen such that big areas in images with a strong texture are at least represented by 2 color shades. At classiﬁcation convergence the color prototypes deﬁne the set Cqc of nqc color shades. Since CA determines automatically the right number of clusters, the number of color shades nqc will be representative of the image color diversity. Quantized image is obtained by back projecting color prototypes in the image. 3.2

Determination of LDQC Prototypes in Image

To determine all the LDQC’s, we slide a window over pixels in the quantized image and evaluate the corresponding local distribution over the Cqc color set. Let’s denote SW the window surface and ST OT the image surface. LDQC’s are evaluated every wr pixels, where wr is the window radius, so that all pixels participate to the determination of the LDQC prototypes. A suitable distribution distance must be used for the classiﬁcation. Lp distances are widely used to measure similarity between color distributions computed over entire images but are not adapted to distributions computed over small pixel neighbourhoods. Indeed the distribution of a natural image is rather smooth and ﬂatter than that of a small neighbourhood which presents a couple of peaks. Since there are few colors in a neighbourhood it is necessary to have a distance for LDQC which takes into account the inter-bin color similarity. This is what does the color quadratic form distance presented in [14]. Its expression is given for two distributions {xi } and {yi } evaluated on a set of nqc colors: T

dq (x, y) = (x − y) A(x − y) = 2

nqc nqc

(xi − yi )(xj − yj )aij

(3)

i=1 j=1

where aij is the similarity between colors i and j, determined with the Euclidean distance in Luv space. This distance is used during classiﬁcation to compare the LDQC histograms (we’ll have d = dq in CA formulae (1) and (2)). After classiﬁcation, the segmented image is obtained by assigning to the ST OT /wr2 pixels the label of the LDQC prototype minimizing the quadratic distance to the LDQC around that pixel. A maximum vote ﬁlter is applied to the image of labels to discard isolated pixels. Window surface SW deﬁnes the spatial level of details of the segmentation: the higher SW and the bigger patterns we extract. wr was set to 8 pixels for a 500x500 image.

28

3.3

Julien Fauqueur and Nozha Boujemaa

Adjacency Information

The segmented image gives us a complete partition of the image into adjacent regions formed from the back projection of the LDQC prototypes. Very small regions correspond to salient areas detected by LDQC classiﬁcation but are too small to constitute regions of interest, so they increase needlessly the total number of regions in the database. Besides, in complex scenes, they’re often located at the frontier between two regions of interest or inside a region of interest. So they should be merged to improve the topology of regions of interest. Region attributes (surface, color distribution) and region adjacency (list of neighbours) information are stored in a Region Adjacency Graph structure used to merge regions. We want ﬁnal regions of interest to be of minimum size SM min = 0.015∗ST OT (i.e. 1.5% of the image surface). Below this threshold a region is merged to its closest visual neighbour if it has one and is discarded otherwise. Two small regions are said to be visually close if they have close mean quantized color distributions. After merging process, remaining regions of size below SM min are salient but too small, so they are discarded from the graph and not indexed. The region extraction workﬂow is the following: 1. 2. 3. 4.

4 4.1

image quantization by CA-classiﬁcation of color pixels computation and CA-classiﬁcation of LDQC’s to obtain LDQC prototypes determination of connected components and generation of the RAG merge and discard regions

Region Indexing and Retrieval Fine Color Region Description

Once regions are detected in a coarse way we have to ﬁnely describe their visual appearance. Existing region color descriptors are generally histograms evaluated on a few hundreds of bins obtained by a subsampling of the color space: a uniform subsampling in [2], [10] or a database-dependent subsampling in [8]. See the illustration of a 216 bin Luv histogram region description in left part of ﬁgures (1) and (2). Such a description forces the minimum distance between two colors to be high because the subsampling is ﬁxed and because we only consider a few hundreds of colors among millions in a full color space. This low granularity of color description is suitable for complex images as they contain a wide range of diﬀerent colors. But regions are by deﬁnition more homogeneous than an image so their color description should be ﬁner. To represent shades of any given hue, a high granularity color set must be found. A ﬁne uniform subsampling of a color space raises the problems of numerous useless empty bins and heavy matching computation. We want to select for each region an adaptive color set providing color shades which are relevant for the region. We should get a single color shade on a perfectly

Image Retrieval by Regions

29

uniform region and many on a highly textured region. We decide to index regions with the distribution of their color shades determined with CA algorithm with a high classiﬁcation granularity. To achieve this, for each region, its color pixels in the original image are classiﬁed with low τ and η0 to catch representative shades of colors. The optimal number of color shades found by CA is in itself an information about the region visual diversity. The color shade triples are determined from the whole Luv colorspace which contains 5.6 million colors while a classic color descriptor picks colors from around 200 given colors. The descriptor index consists of the list of color shades as Luv triples with their respective percentage in the region. Top-right parts of ﬁgures (1) and (2) show examples of such descriptors. Note: the image quantized colors determined in section 3.1 are unsatisfactory candidates to index regions for two reasons: they are determined with a too low granularity (suitable for a coarse segmentation) and all image color pixels are in competition which favours colors from big regions and bias the color prototypes determination.

4.2

Matching Regions

For a given query region of color shades distribution X, similar regions are such that their distribution Y minimizes the distance between X and Y . Let’s write distributions X and Y as pairs of color/percentage: X X X Y Y Y Y X = {(cX 1 , p1 ), ..., (cncsX , pncsX )} and Y = {(c1 , p1 ), ..., (cncsY , pncsY )}

cX i

th

cYj

(4)

th

Y as the color similarity between and acX (i color of X) and (j color i cj of Y ). Since color shades are ﬁnely determined the quadratic distance is again a good choice to take into account the inter-bin color similarity. The formula (5) gives the quadratic distance between two color distributions x and y evaluated on the same color set. But when measuring the distribution distance between two regions from two diﬀerent images, the two color sets are diﬀerent. So we will rewrite the expression of the quadratic distance to discard the distributions binwise diﬀerences. Let’s consider x as the extension of distribution X over the entire color space and y the extension of Y . The extension consists in setting bin values to zero for colors which are not color shades, so we have dq (x, y) = dq (X, Y ).

dq (x, y)2 =(x − y)T A(x − y) =xT Ax − xT Ay − y T Ax + y T Ay =xT Ax + y T Ay − 2xT Ay =

N N i=1 j=1

xi xj aij +

N N i=1 j=1

(5) yi yj aij − 2

N N i=1 j=1

xi yj aij

30

Julien Fauqueur and Nozha Boujemaa

Then we ﬁnally have the following expression of the quadratic distance used to compare two color shades distributions X and Y evaluated on any color sets: dq (X, Y )2 =

ncs X ncs X i=1

−2

X X + pX i pj acX i cj

j=1 ncs X ncs Y i=1 j=1

Y pX i pj

ncs Y ncs Y i=1 j=1

pYi pYj acYi cYj (6)

Y acX i cj

The ﬁrst term involves only the X distribution, the second the Y distribution and the last one the product of both and no more binwise diﬀerence is involved. Returned regions are sorted by growing quadratic distance dq .

5

Tests

Our system was tested on IDS database provided by courtesy of Images Du Sud Photo Stock company. It contains 2500 generic images of ﬂowers, portraits, landscapes, seascapes, architecture, people, fruit, gardens. Images size are between 400x400 and 600x600 pixels.

6 6.1

Results Region Detection

A few segmented images are presented in ﬁgure (3). More examples can be seen at: http://www-rocq.inria.fr/˜fauqueur/ADCS/ . Images for which an obvious segmentation could be decided are correctly segmented. More generally, images in the database are complex natural scenes and extracted regions present a coherent color diversity. The coarse segmentation proves its ability to integrate within regions areas formed with many shades of the same hue, strong textures, isolated spatial details, which make their speciﬁcity. 15248 regions were automatically extracted from the 2483 images (average of 6 regions per image). Segmenting an image took an average of 5.6s. Discarded regions (shown as small grey regions in examples) represent a very small percentage of image surfaces. To compare color shades, a 4*4*4=216 uniform subsampling of the Luv colorspace was also tested to compute the local color distributions. Resulting regions were inaccurate and the histogram vectors were to long to classify. 6.2

Region Description

Top-right parts of ﬁgures (1) and (2) illustrate the ﬁne granularity of the color shade representation and their ﬁdelity to the original colors. In ﬁgure (3), the segmented images show the detected regions followed by the corresponding images

Image Retrieval by Regions

31

Fig. 1. Color description of the lavender region: with a classic 216 bin Luv distribution (left) and with the ADCS descriptor (top right). Because of the strong subsampling into 216 bins, wrong colors appear in the classic Luv distribution: blue shades rather than purple ones. The ADCS descriptor represents the purple color shades accurately and provides a more compact descriptor. Note color bins in the ADCS distribution have no speciﬁc order.

32

Julien Fauqueur and Nozha Boujemaa

Fig. 2. Color description of the sky region: with a classic 216 bin Luv distribution (left) and with the ADCS descriptor (top right). Distribution comparison: both distributions represent real colors but the ADCS has a ﬁner dynamic of blue shades and still in a more compact descriptor.

Image Retrieval by Regions

33

Fig. 3. First: original images. Second: images of regions with mean color. Third: mages of regions with color shades used for indexing. Non-indexed regions are shown with random color pixels.

formed by each region color shades used for their description. More examples of such images can be seen at: http://www-rocq.inria.fr/˜fauqueur/ADCS/ . The global appearance of these quantized images shows the precision of the ADCS region color descriptor. A total of 261219 color shades from the Luv space were used to index the 15248 regions (average of 17 colors per region). 168912 of these colors were unique (to be compared to the couple of hundreds of ﬁxed bins in a classic histogram). Extracting an ADCS index from a region took around 0.5s . Since an average of 17 colors is used to represent a region, we can determine the number of bytes needed to store an ADCS index: for one region, it contains: the number of color shades, the list of color shades (as Luv triples) and the population of each shade, i.e. 1 + 17 ∗ (3 + 1) = 69 bytes. This makes an ADCS index around three times more compact than a classic color histogram. 6.3

Retrieval

Region queries are done by exhaustive comparison with the 15248 regions and average query time is 1.3s. Hundreds of region queries in our system always returned regions which presented a perceptually similar color distribution for various kinds of regions: uniform or textured, containing diﬀerent hues. Regions described by many color shades returned regions with many color shades and conversely for single-colored regions. We observed that the number of color shades is also an exploited information about the color diversity of a region. Screenshots in ﬁgures (4) and (5) show the result of a query on a lavender region. ADCS descriptor is used in ﬁgure (4) and, in ﬁgure (5), classic 216 bin Luv

34

Julien Fauqueur and Nozha Boujemaa

Fig. 4. Retrieval from top-left lavender region using ADCS.

Fig. 5. Retrieval from top-left lavender region using classic 216 bin Luv histogram.

Image Retrieval by Regions

35

histogram matched with the L1 distance. We can observe that classic histogram didn’t top-ranked regions with colors as similar as with color shades.

7

Conclusions

The key idea is to detect visually speciﬁc regions of interest and match them with the ﬁne descriptor to improve the retrieval results. We presented a scheme for coarse automatic image segmentation and ﬁne color description to perform region-based queries in a generic image database. The novel segmentation scheme detects regions which are potential regions of interest for the user (they are visually salient in the image) and at the same time speciﬁc from one another in the database (they encompass a visual “homogeneous diversity”). The new ADCS signature provides a representation of region color variability with more accuracy than existing descriptors.

References 1. Del Bimbo and Vicario E., “Using weighted spatial relationships in retrieval by visual contents,” IEEE workshop on Image and Video Libraries, June 1998. 2. S.F. Chang J.R. Smith, “Visualseek: A fully automated content-based image query system,” in ACM Multimedia, 1996, pp. 87–98. 3. B. Moghaddam, H. Biermann, and D. Margaritis, “Deﬁning image content with multiple regions of interest,” CBAIVL, 1999. 4. J. Malki, N. Boujemaa, C. Nastar, and A. Winter, “Region queries without segmentation for image retrieval by content,” in Visual Information and Information Systems, 1999, pp. 115–122. 5. Belongie S., Carson C., Greenspan H., and Malik J., “Color- and texture-based image segmentation using em and its application to content-based image retrieval,” Proc. Int. Conf. on Computer Vision (ICCV’98), 1998. 6. Deng Y. and Manjunath B., “An eﬃcient low-dimensional color indexing scheme for region-based image retrieval,” ICASSP Proceedings, 1999. 7. Ma W. and B. Manjunath, “Edgeﬂow: A framework of boundary detection and image segmentation,” CVPR Proceedings, pp. 744–749, 1997. 8. Wei-Ying Ma and B. S. Manjunath, “Netra: A toolbox for navigating large image databases,” Multimedia Systems, vol. 7, no. 3, pp. 184–198, 1999. 9. Jia Li James Z. Wang and Gio Wiederhold, “Simplicity: Semantics-sensitive integrated matching for picture libraries,” PAMI, 2001. 10. C. Carson, M. Thomas, and S. Belongie, “Blobworld: A system for region-based image indexing and retrieval,” 1999. 11. H. Frigui and R. Krishnapuram, “Clustering by competitive agglomeration,” Pattern Recognition, vol. 30, no. 7, pp. 1109–1119, 1997. 12. Boujemaa N., “On competitive unsupervized clustering,” ICPR, 2000. 13. J. C. Bezdek, Pattern Recognition with Fuzzy Objective Functions, Plenum, New York NY, 1981. 14. J. Hafner H. Sawhney W. Aquitz M.Flickner and W. Niblack, “Eﬃcient color histogram indexing for quadratic form distance functions,” PAMI, 1995.

Fast Approximate Nearest-Neighbor Queries in Metric Feature Spaces by Buoy Indexing Stephan Volmer Fraunhofer Institute for Computer Graphics, Fraunhoferstr. 5, 64283 Darmstadt, Germany [email protected]

Abstract. An indexing scheme for solving the problem of nearest neighbor queries in generic metric feature spaces for content-based retrieval is proposed aiming to break the “dimensionality curse". The basis for the proposed method is the partitioning of the feature dataset into a ﬁxed number of clusters that are represented by single buoys. Upon submission of a query request, only a small number of clusters whose buoys are close to the query object are considered for the approximate query result, cutting down the amount of data to be processed effectively. Results from extensive experimentation concerning the retrieval accuracy are given. The inﬂuence of control parameters is investigated with respect to the tradeoff between retrieval accuracy and query execution time.

1

Introduction

Interest in digital multimedia has increased enormously over the last few years with the evolution of today’s information and communication technologies. Users exploit the opportunities offered by the ability to access and manipulate remotely stored multimedia objects (e.g. text, images, audio, and video) in all imaginable ways. This has fuelled the emergence of large multimedia repositories. Finding a multimedia object whose content is truly relevant to the user’s need has become the focal point of recent research in multimedia information technology. Large repositories cannot be meaningfully queried in the classical sense, because it is very difﬁcult to structure the information contained in multimedia objects in alphanumeric keys or records (either manually or computationally) for traditional relational databases. The concept of searching for information on a semantical level by matching alphanumeric strings no longer applies to multimedia objects, because they consist of abstract representations entailing sensorial data on a syntactical level. In most multimedia applications, all the queries are commonly formulated in a way asking for objects that are similar to a given one [15]. The concept of similarity imposes severe problems because sensorial data is encoded differently than humans perceive it. Content information must be abstracted and translated into an encoding that can be compared. Problems with traditional methods have led to the rise of techniques for retrieving multimedia objects on the basis of content descriptors – a technology now generally referred to as content-based retrieval (CBR). CBR systems employ unsupervised algorithms on multimedia objects analyzing their raw digital data representations. This analysis results in compact content descriptors that convey speciﬁc aspects of the object’s most salient features. The similarity between two objects is then determined by some well-deﬁned similarity measure between their associated content descriptors. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 36–49, 2002. c Springer-Verlag Berlin Heidelberg 2002

Fast Approximate Nearest-Neighbor Queries

37

Upon presentation of a query, the content descriptor of the query object is compared with the descriptors in the database. A linear search through all descriptors contained in the database would be very time-consuming and inefﬁcient. Therefore, an indexing scheme becomes necessary in order to limit the number of potential target descriptors from the database and reduce the computational effort needed to determine their similarity to the query descriptor sequentially. This task is generally referred to as similarity indexing [14]. The goal of similarity indexing is to reduce the amount of data to be processed by categorizing or grouping similar objects together.

2

Preliminaries

This paper focuses on providing a general-purpose spatial indexing scheme applicable to any content descriptors derived from digital representations of media objects that comply with the postulates of the metric feature model. Those content descriptors are compact in nature and conserve the most salient features and properties of the media object’s content without accounting for any kind of knowledge, interpretation or reasoning. The metric feature model is based on the assumption that human similarity perception approximately corresponds with a measurement of an appropriate metric distance between content descriptors. 2.1

Metric Feature Model

Let ∆ be a feature extraction algorithm that transforms digital representations of media objects M into content feature descriptors ω: ∆

M −→ ω

(1)

Let the feature domain Ω denote the universe of all feature descriptors that can be generated by ∆. Depending on the speciﬁc characteristics of ∆, Ω can be ﬁnite or inﬁnite, and discrete or continuous. Then (Ω, δ) is called a generic feature space, where δ : Ω × Ω → IR+ 0

(2)

is a metric on Ω called the dissimilarity measure. The metric δ must satisfy the properties (i) δ(ωi , ωi ) = 0 (ii) δ(ωi , ωj ) = δ(ωj , ωi )

(3)

(iii) δ(ωi , ωj ) + δ(ωj , ωl ) ≥ δ(ωi , ωl ) for all ωi , ωj , ωl ∈ Ω. This common framework includes the deﬁnition of ubiquitous d-dimensional feature vector spaces (Ω ≡ IRd ), but is not necessarily limited to them. A ﬁnite subset S = { ω1 , ω2 , . . . , ωN } ⊆ Ω (4) of the feature domain is called the feature dataset whose elements are the feature descriptors extracted from a set of |S| = N media objects.

38

2.2

Stephan Volmer

K-Nearest Neighbor Queries

By far the most common query of a CBR system is a request like “Find the K most similar objects to the query example!" Such a request can be formulated as a K-nearest neighbor query (K-NN query) in metric space: Given an query object ωQ ∈ Ω and an integer K ≥ 1, the K-NN query SN N (ωQ , K) selects the K elements from the feature dataset S which have the smallest distance from ωQ with the following properties: (i) SN N (ωQ , K) ⊂ S (ii) |SN N (ωQ , K)| = K (iii) ∀ ω ∈ SN N (ωQ , K) ∃ ω ∈ S \ SN N (ωQ , K)

(5)

with δ(ωQ , ω ) < δ(ωQ , ω),

3

State-of-the-Art

The history of recent research on similarity indexing techniques can be traced back to the middle 70’s when hierarchical tree structures (e.g. k-d tree) for indexing multidimensional vector spaces were ﬁrst introduced. In 1984, Guttman proposed the R-tree indexing structure [7], which was the basis for the development of many other variants. Sellis et al. proposed the R+ -tree [10], and Beckman et al. proposed the best dynamic R-tree variant, the R∗ -tree [2] in the following years. A very extensive review and comparison of various spatial indexing techniques for feature vector spaces can be found in [14]. Motivated by k-d tree and R-tree, they proposed the VAM k-d tree and the VAMSplit R-tree. Experimentally, they found that the VAMSplit R-tree provided the best performance, however, this was at the loss of the dynamic nature of the R-tree. Common to all of the cited research is the idea that feature descriptors are stored at the leaf level of a hierarchical index tree structure. Each leaf corresponds to a partition of the feature space and each node to a convex subspace spanning the partitions created by its children. Tree branches that do not meet certain distance requirements are pruned during a similarity query in order to reduce the search space. The main problem of this approach is that it requires the evaluation of the distance from the query point to the arbitrarily shaped convex subspace represented by the node being examined. The most common approach for simplifying this problem is that partitions are split along hyperplanes that are orthogonal to a coordinate axis, ultimately creating hyper-rectangular partitions whose sides are aligned parallel to the feature space’s axes. Most of the hierarchical spatial indexing methods work satisfactorily for lower dimensions, but suffer from the dimensionality curse [9] when applied to feature vectors in medium- or high-dimensional feature spaces (d > 20). The dimensionality curse is strictly related to the distribution of the dissimilarity measures between the feature dataset and the query object. If the variance of the dissimilarities for a given query object is low, then conducting an indexed K-NN query becomes a difﬁcult task. A way to

Fast Approximate Nearest-Neighbor Queries

39

obviate this situation is to conduct queries that come up with an approximate solution of the K-NN query problem [1]. In recent research, there have been many attempts to get a grip on the problem of the dimensionality curse – one of them is the reduction of the dimensionality of the underlying feature domain with a principal component analysis (PCA) or its variants. In [8], Ng and Sedighain followed this approach to reduce the dimensionality, and in [6] Faloutsos and Lin proposed a fast approximation of the Karhunen-Loeve Transform (KLT) to perform the dimension reduction. However, even though experimental results from their research showed that some real feature datasets could be considerably reduced in dimension without signiﬁcant degradation in retrieval quality, the queries become proportionally less accurate with the loss of dimensions. The biggest shortcoming of the techniques mentioned above is that they are applicable to vector spaces only, that is, a vector of ﬁxed dimensionality suitably represents each descriptor. This paper, however, aims on the rather general case, where the similarity criterion deﬁnes a metric space instead of the more restricted case of a vector space. Therefore, the indexing scheme has to rely solely on the distance relationship among the objects of the dataset without any information about its topology. Two different ways have been pursued in recent research to solve the problem of similarity indexing for pure metric spaces. The ﬁrst approach consists in mapping the metric space into a vector space. In [12], for each object in metric space, its distance to a set of d predetermined so-called vantage objects is calculated. The vector of these distances speciﬁes a point in the d-dimensional vantage space. The selection of vantage objects, their number, as well as their location in metric space, is critical for this approach. In [12], an approach is described that attempts to constitute a set of vantage objects that spreads well enough in metric space. However, the central problem of this mapping technique is: How well can a metric space be transformed into a vector space? How many dimensions are required for the target vector space? These are difﬁcult questions that have not been satisfactorily answered. The second approach involves the generation of hierarchical tree structures similar to the ones used for vector spaces. The M -tree as presented by [5], as well as the vantage point tree (VPT) presented by [4] are examples of this approach. These tree structures partition the metric space recursively into smaller subspaces that all have the shape of regular hyper-spheres. Each partition is represented by its centroid and its corresponding covering radius.At query time, the query is compared against all the representatives of the node and the search algorithm enters recursively into all those that cannot be discarded using the covering radius criterion. There are many proposed variations of this approach in literature; most of them differ on how centroids of partitions are selected and on how partitions are split. In the following section an indexing scheme is proposed that follows some of the basic ideas of those approaches, but focuses strongly on a pragmatic solution that delivers reasonable performance in conjunction with a relational database.

4

Buoy Indexing

The proposed indexing scheme is based on the idea that the feature dataset is decomposed into disjoint non-empty partitions of arbitrary convex shape. Each partition is not

40

Stephan Volmer Feature Descriptors

Cluster Buoys Virtual Cluster Borders

Covering Cluster Hyper−spheres

Fig. 1. Schematic diagram of partitioning a dataset of 44 descriptors into 3 clusters in 2-dimensional vector space. Cluster buoys are depicted by black discs with white marker symbols; their associated descriptors by marker symbols of the same shape. Lines representing equidistant points between cluster buoys denote virtual cluster borders. Concentric circles around the cluster buoys illustrate covering hyper-spheres of clusters; an arrow marks its radius to its most distant member descriptor. It should be noted that the covering hyper-spheres do not represent the real shape of the cluster extensions.

represented by a complex description of its extension or its boundaries in the feature domain, but rather by a single prototype element that is an element of the feature domain itself. The prototype element serves as a buoy in feature space for its associated partition. Ideally, the partitions should be distributed in feature space in a way that they cover the dataset well. Each partition should have approximately the same number of feature descriptors as members, and the number of partitions should be an order of magnitude smaller than the number of feature descriptors in the dataset. The membership of an element of the feature dataset to a speciﬁc partition is solely determined by its metric distances to all buoys placed in feature space – a feature descriptor exclusively belongs to the partition with the closest associated buoy in the feature space. A partition is not only represented solely by its associated buoy, but also by its covering hyper-sphere. The covering hyper-sphere is sufﬁciently identiﬁed by a single valued parameter that represents the maximum distance from the partition’s associated buoy to its most distant member descriptor. Fig. 1 shows the principle of the described partitioning clustering method. 4.1

Index Generation

In general, the task of partitioning a particular feature dataset S into k disjoint non-empty subsets S1 , S2 , . . . , Sk (hereafter called clusters) with the following properties (i)

k i=1

Si = S

(ii) Si = ∅,

∀1≤i≤k

(iii) Si ∩ Sj = ∅,

∀ 1 ≤ i, j ≤ k, i = j

(6)

Fast Approximate Nearest-Neighbor Queries

41

is performed by any k-clustering algorithm, where the total number of clusters k is assumed to be selected a priori as a constant. Each descriptor of the feature dataset belongs to exactly one cluster (crisp membership). By far the most common type of k-clustering algorithm is the optimization algorithm. The optimization algorithm deﬁnes a cost criterion c : {S1 , S2 , . . . , Sk } → IR+ 0,

(7)

which associates a non-negative cost with each cluster. The goal of the optimization algorithm is then to minimize the global cost c (S) =

k

c (Si )

(8)

i=1

for a given feature dataset. If each cluster Si is represented by a buoy ω ˆ i that is an element of the feature domain Ω itself, then, the cost criterion of a cluster can be deﬁned as c(Si ) =

|Si |

δ(ˆ ωi , ωim )

(9)

m=1

where ωim is the mth element of Si , and |Si | is the number of elements in Si . Commonly, the centroid of the cluster would be chosen to be the buoy ω ˆ i (k-means clustering algorithm [3]). However, since many types of dataset do not belong to feature spaces in which the mean is deﬁned1 , a different type of buoy must be chosen in order to be generically applicable for the metric feature space model as deﬁned in Sect. 2.1. Consequently, the median of each cluster is selected as its representative buoy (k-medians clustering algorithm). Note that ω ˆ i ∈ Si ⊂ S ⊂ Ω and that ω ˆ i is chosen to minimize the cost c(Si ) of the cluster itself. The classic implementation of the optimization problem is an algorithm that tries to minimize (8) iteratively. The algorithm terminates, if c(S) remains constant for two consecutive iterations. The result is a local minimum of the optimization problem. Techniques like simulated annealing can be employed further to improve the result. Additional Constraints. The pure k-medians clustering algorithm produces clusters with sizes 1 ≤ |Si | ≤ N − k + 1. In order to support the development of clusters of approximately the same size, an additional constraint on the cluster size Smin ≤ |Si | ≤ Smax

(10)

has to be imposed on the algorithm during any iteration, whereas Smin and Smax are empirically selected thresholds for the minimum and maximum accepted cluster sizes respectively. If any cluster’s size exceeds Smax , the cluster is randomly split into two 1

The mean of two elements of the feature domain is required to be an element of the feature domain itself – this is not always the case for feature spaces that are not vector spaces.

42

Stephan Volmer

equally sized clusters and the smallest existing cluster is deleted. If there are still any clusters whose size falls below Smin , the cluster is deleted and the largest existing cluster is randomly split into two equally sized clusters. The member descriptors of deleted clusters are immediately assigned to the clusters with the closest associated buoys. A high level description of the constrained optimization algorithm is shown in Fig. 2. The selection of Smin and Smax directly impacts the convergence of the constrained optimization algorithm. The expected value for the average cluster size is |Si | =

N k

(11)

Obviously, the size constraints should be selected in a way that 1 ≤ Smin <

N < Smax ≤ N − k + 1 k

(12)

However, this criterion is necessary, but not sufﬁcient to guarantee convergence.

Initialize clusters by assigning descriptors of dataset Initialize buoys of clusters Calculate global cost repeat for all clusters whose size exceed constraints Find smallest cluster Redistribute descriptors of smallest cluster to other clusters according to the descriptors’ distance to the clusters’ buoys Delete smallest cluster Split largest cluster randomly Update buoys of split clusters Reassign descriptors of the dataset to clusters according to the descriptors’ distance to the clusters’ buoys Update buoys of clusters Calculate global cost until global cost remains constant Fig. 2. High-level description of the constrained version of the iterative optimization algorithm for buoy clustering.

Subsequent Index Updates. Considering that CBR systems today are dynamic since new media objects are continuously added to repositories, the feature dataset increases with time also. Intuitively, the newly added descriptors are assigned to the cluster with the closest buoy. However, since the buoys are not modiﬁed during this process, some clusters might grow extensively, while others might not grow at all. This necessitates an infrequent periodic update to the index in order to compensate for the newly added feature descriptors. The clusters are then simply initialized with the buoys of the old index to avoid starting the iterative algorithm from scratch again.

Fast Approximate Nearest-Neighbor Queries

43

With the number of descriptors increasing, it might even become necessary to increase the overall number of clusters. In this case, empty clusters are added which are then removed by the size constraint in following iterations.

4.2

Indexed Approximate K-NN Queries

A K-NN query is looking for the K closest feature descriptors to a given query descriptor ωQ in feature space (see Sect. 2.2). The best strategy to effectively cut down the amount of data to be processed is to limit the search to the immediate proximity of ωQ . Intuitively, the search should start with the cluster that is the closest, proceed with the clusters in order of their distance to ωQ , and stop after a “sufﬁcient" number of clusters have been processed in order to produce the result of a sequential K-NN query. After a query request has been submitted, the ﬁrst step is to determine the distances from the query object ωQ to the set of buoys Sˆ = {ˆ ω1 , . . . , ω ˆ k } of the index. The result is compiled into a sorted list of cluster according their proximity to ωQ . It is impossible to analytically determine, how many clusters in the immediate proximity have to be considered in order to produce an accurate result. This is mainly due to the fact that the topology of the dataset in the immediate area of the query descriptor is unknown. The query point might be located in areas of the feature space that are less populated by the dataset. As a result, the variance of the distance distribution to the dataset descriptors for this particular query object is low. This yields many potential candidate descriptors with approximately the same distance to the query object. Some potentially relevant descriptors might not be returned during an indexed query, if the feature space is limited to too few clusters. In a pragmatic approach, a parameter q is introduced that enables the user to limit the execution time of his query. q (0 < q ≤ 1) speciﬁes the fraction of clusters closest to the query object that are considered for building the result of the indexed K-NN query. q directly controls the amount of data processed during an indexed query. Therefore, smaller values of q result in faster query responses. However, at the same time it might also impact the retrieval accuracy counterproductively. The result of such a query can only be considered an approximation of the result produced by a linear K-NN query conducted on the whole dataset. Consequently, an approximate K-NN query is formulated as follows: Given an query object ωQ , an integer K ≥ 1 (number of results the query is supposed to return), and a real number 0 < q ≤ 1, the approximate K-NN query SAN N (ωQ , K, k, q) selects K elements from the joint union of q · k clusters (from a total of k clusters) whose associated buoys have the smallest distances to ωQ with the following properties: q·k

(i) SAN N (ωQ , K, k, q) ⊂

Si

i=1

(ii) |SAN N (ωQ , K, k, q)| = K

(13)

(iii) ∀ ω ∈ SAN N (ωQ , K, k, q) ∃ ω ∈ S \ SAN N (ωQ , K, k, q) with δ(ωQ , ω ) < δ(ωQ , ω),

44

Stephan Volmer Covering Hyper−sphere of Cluster

Covering Hyper−sphere of Result List

Q

Fig. 3. Schematic diagram of the coverage criterion applied during an approximate K-NN query. The diagram shows the situation after the ﬁrst two clusters (2, 3) with the closest associated buoys have been queried. The hyper-sphere of the next cluster to be queried and the current result list do not intersect, therefore the cluster can be skipped.

Practically, SAN N (ωQ , K, k, q) is implemented as an ordered sequence of independent linear K-NN queries that are performed on partial datasets with member descriptors from single clusters respectively. Throughout this sequence, a sorted result list containing the identiﬁers and distances of at most K descriptors that are potential candidates is kept, and is continuously updated throughout the sequence. Before each partial query is performed, the coverage criterion is checked. If the radius of the covering hyper-sphere of the queried clusters added to the radius of the covering hyper-sphere of the current result list is smaller than the distance of the cluster’s buoy to the query point (see Fig. 3), than the partial query can be skipped, since it has no implication on the result list at all.

5 5.1

Experimental Results Setup

Although the proposed indexing scheme is applicable to any type of media object, the analysis has been conducted with descriptors representing the visual content of static images. The experimental environment is brieﬂy described below: Repository The image database consists of color JPEG images in screen preview quality (approximate size 300 × 200). The images were taken from CD image catalogues with a variety of topics, e.g. people, sports, art, travel, animals, nature, industry, and business. Thus, the visual content of the database was quite heterogeneous. Feature Space The used feature descriptor stores the coarse color layout of an image into a compact wavelet ﬁngerprint (for further details refer to [13]). Queries A total sample of 1000 query images that were not part of the image repository itself was submitted as requests to the query engine in order to collect the results. Hardware The hardware consisted of three PCs – the front-end, the search engine, and the database server – connected by 100 MBit Ethernet running the TCP/IP protocol. All the PCs had standard conﬁguration with processors running at 1 GHz. The operating system was Microsoft Windows 2000, and the database server was running SQL Server 2000.

Fast Approximate Nearest-Neighbor Queries

45

Table 1. Results of the generation of different indices. N is the size of the repository; k is the number of clusters, Smin and Smax are the constraints for the minimum and maximum cluster sizes respectively. c(S) is the global cost after the index generation terminates after i iterations. t is the total time required for the index generation. t can vary signiﬁcantly for different feature types, because their associated similarity measures δ(·) require different computational effort. Furthermore, due to the random assignment of descriptors during the initialization, each index generation has some random aspects that are not sizable quantitatively.

5.2

N

k

Smin

Smax

c(S)

i

t

25000 25000 25000 25000

125 250 500 1000

100 50 25 25

300 150 75 75

6947.59 6418.61 5883.10 5337.43

6 7 10 11

2440 sec 3439 sec 7383 sec 15481 sec

Index Generation

In this section, the performance of the index generation is investigated in order to give the reader some idea about the time that is needed to perform a partitioning of a large dataset of N descriptors into an index of k clusters. The constraining size parameters have been empirically set to N 3N Smin = , Smax = (14) 2k 2k This ensures on one hand that the distribution of cluster sizes roughly approximates a bell-shaped curve around the expected average, and on the other hand that the optimization algorithm converges after a few iterations quickly. Table 1 shows some results for different index generations. However, the generation of the index through the constrained k-medians clustering algorithm needs considerable processing time that increases with the number of clusters k and the number of feature descriptors N of the dataset. Although the generation is generally computed off-line, there is a point where the index generation has to be distributed on multiple computers with regard to processing power and memory resources in order to be manageable (e.g. [11]). 5.3

Retrieval Accuracy vs. Query Execution Time

The overall performance of an approximate K-NN query SAN N (ωQ , K, k, q) based on the proposed indexing scheme has the following dependencies: – – – – – –

the speciﬁc characteristics of the feature space (Ω, δ) the distribution of a given feature dataset S in feature space the query object ωQ the number of returned query results K the number of clusters k the parameter q

46

Stephan Volmer

The ﬁrst and second are directly associated with the selection of a particular feature extraction algorithm, its associated dissimilarity measure and the media objects that are inserted into the dataset. Their inﬂuence cannot be made quantitatively tangible, since those dependencies cannot be controlled at query time. However, a general statement about the method’s performance can be made by an experimental analysis for speciﬁc feature spaces and datasets. The inﬂuence of the query object can be voided, if the results of a sufﬁciently large number of query objects with statistically uncorrelated content are averaged. The parameters K, k, and q can be used to truly control the behavior of an indexed query. In particular, a smart selection of the parameters k and q in relationship to the total number of descriptors N present in the dataset can yield a high retrieval accuracy and signiﬁcant speedup compared to the execution time of linear query requests. The impact of its control parameters on SAN N (ωQ , K, k, q) is investigated in the remainder of this section. The retrieval accuracy P (ωQ , K, k, q) is determined by comparing the “desired" result SN N (ωQ , K) of the linear K-NN query with the “returned" approximate result SAN N (ωQ , K, k, q) of the indexed K-NN query according to P (ωQ , K, k, q) =

|SAN N (ωQ , K, k, q) ∩ SN N (ωQ , K)| K

(15)

In Information Retrieval community, P (ωQ , K, k, q) is also referred to as the recall rate and is commonly used as the criteria for assessing the degree of success of the result for a given query. Graph (a) in Fig. 4 shows the dependency of the retrieval accuracy on the selection of parameter q for some ﬁxed values of k. Graph (c) in Fig. 4 illustrates the same situation, however, in dependency of parameter k for some ﬁxed values of q. Graphs (a) and (b), and (c) and (d) depict the relationship between retrieval accuracy and query execution time respectively. In all cases it can be seen that the retrieval accuracy quickly approaches 100% as q increases. At the same time, the query execution time grows linearly with q. In the majority of cases a retrieval accuracy of more than 95% can be achieved for relatively small values of q. For example, q = 0.08 (approximately 2000 descriptors out of 25000 are compiled into the result) yields perfect retrieval accuracy for 81.3% of the queries, and retrieval accuracy of at least 90% for 97% of the queries. Simultaneously, query executions times are achieved that are 3.78 times faster than linear queries. Some few queries have a retrieval accuracy that is considerably lower than the average. This is due to query points that are located in lesser-populated areas of the feature dataset (see Sect. 4.2). It should be noted that only a few queries take advantage of the coverage criterion (see Fig. 3) for larger values of k. Its effect is visible in graph (b) of Fig. 4 by investigating the lower boundaries in comparison to the averages. This justiﬁes the commitment to a non-hierarchical buoy index, since any hierarchical structure relies on a high probability of pruning at higher branch levels. Graphs (b) and (d) furthermore illustrate that larger numbers of clusters k result into to slightly better retrieval accuracies in principal. The reason for this is that the larger number of buoys provides a better coverage of the feature space. However, this effect is voided from a user’s point of view, because at the same time the associated

Fast Approximate Nearest-Neighbor Queries (c)

100

100

80

80

Retrieval Accuracy [%]

Retrieval Accuracy [%]

(a)

60

40 k = 250 k = 500 k = 1000

20

0 0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

60

40 q = 0.02 q = 0.04 q = 0.10

20

0

0.20

0

100

200

300

400

q (b)

600

700

800

900

1000

600

700

800

900

1000

(d) 1000

800

Query Execution Time [msec]

Query Execution Time [msec]

500

k

1000

k = 250 k = 500 k = 1000

600

400

200

0 0.00

47

0.02

0.04

0.06

0.08

0.10

q

0.12

0.14

0.16

0.18

0.20

800

q = 0.02 q = 0.04 q = 0.10

600

400

200

0

0

100

200

300

400

500

k

Fig. 4. Retrieval accuracy (a,c) and query execution time (b,d) of approximate 20-NN queries in dependency of parameters k and q. k is the total number of clusters, and can be chosen during the off-line index generation (at query time, this parameter is ﬁxed). q is a parameter that is selected by the user at query time in order to control the query execution time. Bold solid lines depict the average of the experimental results. Dashed lines denote the boundaries in which 95% of the experimental results were found. The average query execution time of a linear 20-NN search for the used feature type was 988.90 msec.

execution times grow also. This is due to the computational overhead introduced by the administration of the additional clusters. From graph (b) it can be derived that a good N N selection for k is a value in between 100 and 50 , where N is the size of the dataset. Fig. 5 shows that the selection of K has little effect on the retrieval accuracy and virtually no effect on the query’s execution time. Obviously, a larger proximity of the query point has to be visited, if the number of returned query results K grows.

6

Conclusions

The proposed buoy indexing scheme enables fast approximate K-NN queries in virtually any metric feature space. Its primary goal is the acceleration of the query response times while achieving the best possible retrieval accuracy. Experimental results have shown that a speedup of 5 with an average retrieval accuracy of almost 100% is reality.

48

Stephan Volmer (a)

(b)

100

1000

Query Execution Time [msec]

Retrieval Accuracy [%]

80

60

40 K = 10 K = 20 K = 50 K = 100

20

0 0.00

0.02

0.04

0.06

0.08

0.10

q

0.12

0.14

0.16

0.18

0.20

800

K = 10 K = 20 K = 50 K = 100

600

400

200

0 0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

q

Fig. 5. Retrieval accuracy (a) and query execution time (b) of approximate K-NN queries in dependency of parameter K. K is the total number of results, the query is supposed to return. Bold solid lines depict the average of the experimental results. Dashed lines denote the boundaries in which 95% of the experimental results were found.

Its most innovative feature is, however, that the user can directly control the system’s query response time by a single parameter for each of his queries. In fact, any query request can be guaranteed to complete in a given time frame after submission, if the system concludes a query after a given time limit has been reached. This is an important feature for CBR system that relies on multiple types of feature descriptors. Typically, those systems run independent queries on different feature types in parallel and merge their results in a subsequent processing step. Therefore, the feature type with the slowest executing query becomes the bottleneck for such systems. With the proposed indexing scheme, it is possible to run computationally expensive queries with smaller values of q in order to have all queries ﬁnish at the same time. The potential loss of quality is not critical, because faster queries still perform with high accuracy.

Acknowledgements This work was supported by the European Commission with the ESPRIT Project #28773 COBWEB “Content-based Image Retrieval on the Web" and the IST Project #12277 [email protected] “Live Interaction with Video Broadcast over the Web: A New Approach to e-Commerce".

References 1. Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions. Journal of the ACM. 45 6 (1998) 891–923 2. Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R∗ -tree: An Efﬁcient and Robust Access Method for Points and Rectangles. Proc. of the ACM SIGMOD Int’l Conf. on Management of Data. Atlantic City, NJ, USA. (1990) 322–332.

Fast Approximate Nearest-Neighbor Queries

49

3. Bow, S.T.: Pattern Recognition and Image Preprocessing. Marcel Dekker, Inc. (1992). 4. Chiueh, T.: Content-Based Image Indexing. Proc. of the 20th Int’l Conf. on Very Large Databases. Santiago, Chile. (1994) 582–593. 5. Ciaccia, P. and Patella, M. and Zezula P.: M -tree: an Efﬁcient Access Method for Similarity Search in Metric Spaces. Proc. of the 23rd Int’l Conf. on Very Large Databases. Athens, Greece. (1997) 426–435. 6. Faloutsos, C., Lin, K.I.: FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. Proc. of the ACM SIGMOD Conf. San Jose, CA, USA. (1995) 163–174. 7. Guttman, A.: R-tree: A Dynamic Indexing Structure for Spatial Searching. Proc. of the ACM SIGMOD Int’l Conf. on Management of Data. Boston, MA, USA. (1984) 47–57. 8. Ng, R., Sedighian, A.: Evaluating Multi-dimensional Indexing Structures for Images Transformed by Principal Component Analysis. Proc. SPIE Vol. 2670. (1996) 50–61. 9. Pestov, V.: On the Geometry of Similarity Search: Dimensionality Curse and Concentration of Measure. Information Processing Letters. 73 1–2 (2000) 47–51. 10. Sellis, T., Roussopoulos, N., Faloutsos, C.: The R+ -tree: A Dynamic Index for Multidimensional Objects. Proc. 13rd Int’l Conf. on Very Large Data Bases. Brighton, England. (1987) 507–518. 11. Stoffel, K., Belkoniene, A.: Parallel k/h-Means Clustering for Large Data Sets. Proc. of the 5th EUROPAR Conf. on Parallel Processing. Toulouse, France. (1999) 1451–1454. 12. Vleugels, J., Veltkamp, R.C.: Efﬁcient Image Retrieval through Vantage Objects. Proc. of the 3rd Int’l Conf. on Visual Information Systems. Amsterdam, The Netherlands. (1999) 575–584. 13. Volmer, S.: Tracing Images in Large Databases by Comparison of Wavelet Fingerprints. Proc. of the 2nd Int’l Conf. on Visual Information Systems. La Jolla, CA, USA. (1997) 163– 172. 14. White, D.A., Jain,R.: Similarity Indexing: Algorithms and Performance. Proc. SPIE Vol. 2670. (1996) 65–72. 15. Yoshitaka, A., Ichikawa, T.: A Survey on Content-Based Retrieval for Multimedia Databases. IEEE Transactions on Knowlegde and Data Engineering. 11 1 (1999) 81–93.

A Binary Color Vision Framework for Content-Based Image Indexing Guoping Qiu and S. Sudirman School of Computer Science, The University of Nottingham _UMYW\Wa$GWRSXXEGYO

Abstract. We have developed an elegant and effective method for contentbased color image indexing and retrieval. A color image is first represented as a sequence of binary images each captures the presence or absence of a predefined visual feature, such as color. Binary vision algorithms are then used to analyze the geometric properties of the bit planes. The size, shape, or geometry moment of each connected binary region on the visual feature planes can then be computed to characterize the image content. In this paper, we introduce the color blob size table (Cbst) as an image content descriptor. Cbst is a 2-D array that captures the co-occurrence statistics of connected regions sizes and their colors. Unlike other similar methods in the literature, Cbst enables the employment of simple numerical metric measures to compare image similarity based on the properties of region segments. We will demonstrate the effectiveness of the method through its application to content-based retrieval from image database.

1

Introduction

Image indexing and retrieval is an important area of visual information management. This area has received extensive research interest from various communities including image processing, computer vision and database [1]. However, since the problem is complex and complicated, researchers from each community tend to tackle the problem from their own perspective, and hence solutions developed so far mostly reflected this tendency. It is generally agreed that developing an effective and comprehensive solution will require expertise from many disciplines. While many researchers have been trying to develop new and advanced computer vision techniques to tackle the problem, there is general consensus that state of the art vision technologies are still “not there yet”. Many of them either worked only in very restricted conditions or they can be unstable. We believe a practical solution that is stable, reliable and work well in broad conditions will probably be best built around established and tried methods. In this paper, we seek inspiration from a well-established computer vision area, which seemed to be neglected or overlooked by researchers developing solutions to image indexing and retrieval problems. Binary vision, vision techniques developed to deal with binary images, was well developed for several decades [5, 7]. Many useful techniques such as connected comS.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 50–60, 2002. © Springer-Verlag Berlin Heidelberg 2002

A Binary Color Vision Framework for Content-Based Image Indexing

51

ponent labeling, region property measurement etc have been routinely used in machine vision for a long time. The motivation of this paper is to seek solutions for color image indexing and retrieval using well established binary vision technology. The organization of the paper is as follow. In Section 2, we present a framework for representing a color image in a sequence of binary images and from which to derive image content descriptions using binary vision technology. Section 3 presents an implementation of the framework. Section 4 presents experimental results. In section 5, we discuss related methods in the literature and section 6 concludes the paper.

2

The Framework

Decomposing an image into a sequence of binary images can be very convenient in many image-processing problems. For example, by using the gray values of the image as thresholds and representing images as a sequence of binary images each of them represents the absence or presence of a gray value in a pixel position, an important class of image filtering – statistical filters can be analyzed [2, 3]. Another application of bit-plane decomposition is in image compression/coding [4]. An attractive feature of the binary image is its simplicity. There have been many well-established techniques to deal with binary images. One of our motivations is to develop elegant, reliable and yet effective solution to content-based image indexing problems. And, we would like to seek solutions using binary vision technique [5, 7]. The starting point of course is how to meaningfully represent a given color image in binary form so that content descriptors can be derived from which using binary vision analysis. The general principle is to represent image pixels having similar properties, such as color, texture appearance, or other visual significance on the same bit-plane. Once the image has been represented in binary images, then the properties of these binary images can be measured using routines such as connected component labeling, and connected regions size and shape analysis. The general framework of our approach is illustrated in Fig. 1 [6]. An image is first processed by a pixel classifier. Then the bit planes (one for each of the classes) are constructed and binary vision routines are used to compute content descriptors for the image. To implement the framework, one first has to consider how to implement the pixel classifier. The guiding principle is that pixels classified as belong to the same class should have similar visual significance. Secondly, how and what to measure on each of the feature planes such that the measurements are discriminative, and easily usable for indexing and retrieval purpose. We introduce one possible solution in next section.

3

An Implementation of the Framework for Color Features

There are obviously various ways to classify the pixels. The criterions should be that pixels classified into the same class should have similar visual properties. This is of course a form of image segmentation [7]. Image segmentation is a key step in many

52

Guoping Qiu and S. Sudirman

vision systems. Although there have been tremendous effort put into developing accurate and meaningful image segmentation methods by many very capable researchers and significant progress has been made, a fool-proven segmentation algorithm, one that works well in any circumstances has still yet to be developed. What we want is something that is reliable and its implementation will not fall apart in vast majority of situations. Plus, the classes the pixels being classified into should have meanings that are related to the visual content of the images. View a pixel in isolation, color of the pixel would be the obvious property to choose. This is the color histogram approach [8]. View a pixel and its neighbor together, texture property can be exploited [9]. Color is by far the most popular features used in content-based image indexing, and it can be a very effective content descriptor if used properly. In this section, we present a method for constructing the binary image planes based on color classification (quantization), as shown in Fig. 2. Input color image

Pixel classifier

Bit-plane #1

Bit-plane #2

Bit-plane #3

Bit-plane #N

Binary Vision Routines

Image Content Descriptors

Fig. 1. A framework using binary vision routine for content-based color image indexing. Input color image

Color Palette

C1 Bit-plane

C1

C2 Bit-plane

C2

CN

C3 Bit-plane

CN Bit-plane

Connected component labeling Blob size measurement

Color Blob Size Table

Fig. 2. An implementation based on color classification.

A Binary Color Vision Framework for Content-Based Image Indexing

53

Color quantization is used in many areas, and any color based content-based image indexing method uses color quantization of one form or another. Finding the color codebook, or palette, is realized by a form of vector quantization [10] and there are many established color quantization methods [11]. The palette consists of N representative colors, C1, C2, …, CN, found in one or ensemble of images through some statistical means. Each pixel is compared with the N colors in the palette and is quantized to the color that is closest to the pixel. Let F(x, y) be a pixel vector of the original color image at co-ordinate location (x, y), CPn (x, y) be the binary pixel value of the bit plane for color n at co-ordinate location (x, y). Then the color index, ci(x,y), for the pixel is found as ci (x,y ) = n, if F ( x,y ) − Cn < F ( x,y ) − Cm , for ∀m and m ≠ n

(1)

The binary plane are then defined as 1, if ci ( x, y ) = n  CPn ( x, y ) =  0, otherwise 

(2)

That is, there are as many binary planes as the number of colors in the palette. When a pixel in the original image is quantized to the nth color, then the value of the corresponding binary pixel on that bit plane is 1, otherwise it will have a value of 0. Therefore the union of all the bit planes form the quantized original image. Fig. 3 shows an illustration of an image and its seven color bit planes. In this ideal case, all visually distinctive regions (including the background) are clearly separated on each bit plane, which enables the application of binary vision routines to analyze their geometric properties.

1

1 0

0

0 1

1

0 0 1

1

0 0 1

Fig. 3. An image (top-left) and its color bit-plane. By measuring the 1-value regions, i.e. their size, shape and other properties, we can tell a lot about the images content.

There are many useful binary routines. The two we will be using in the current paper are connected component labeling and region measurement. Both are very well established vision techniques and details can be found in any computer vision textbook, e.g., see [5, 7].

54

Guoping Qiu and S. Sudirman

3.1 The Color Blob Size Table Once the connected pixels are grouped together, these pixels will form "color blobs". The sizes, shapes, locations, and other properties should be indicative of the content of the scene. Whilst there are many parameters concerning these blobs can be easily and conveniently measured to give information about the content of the original image, we will present one method which simply indexing the sizes of these blobs. We first quantize the size of the blobs into discrete sizes, S1, S2, …SM. In order to make this feature scale invariant, these discrete sizes are relative to the image size. Assuming for bit plane n, for all n, the blobs are labeled as Blobj(n), j =1, 2, …., a color blob size table, Cbst(m, n), m = 1,2, …M, n = 1, 2, …, N is formed as

Cbst (m, n ) =

∑ size(Blob (n)) [ ( ( ))]

∀j , Where Q Size Blob j n = S m

j

(3)

In words, Cbst(m, n) accumulates the number of pixels of those blobs whose size are being quantized to Sm on bit plane n.

4

Experimental Results

To evaluate the performance of the new method, we have tested it in a database consisted of over 7000 color photo images. For comparison, we have also implemented the MPEG-7 color structure descriptor method [12]. We used the color quantization scheme in the MPEG-7 standard to create the color palette (in HMMD space). For both the new method and the MPEG-7 CSD, exactly the same color quantization scheme was used. In the new method, the blob sizes were quantized into 9 discrete values relative to the image size as shown in table 1. The blob size quantization steps were non-uniform. The smaller blob sizes were quantized more finely than larger blob sizes. For each image in the database, we calculated its color blob size table. The image similarity was measured according the difference of their color blob size tables. Let CAbst (m, n) and CBbst (m, n) be the color blob size tables of image A and B respectively, the similarity of A and B is measured according to the following L1 norm:

D( A, B ) =

1 ∑ CAbst (m, n ) − CBbst (m, n ) M × N ∀m , n

(4)

Image similarity based on the MPEG-7 colour structure descriptors is also calculated using the same L1 norm measure. Fig. 5 (a) shows a query to retrieve flags for the MPEG-7 CSD method, and Fig. 5 (b) shows the result using the same query image for the new method. (There are 100 Flag images in the database). As can be seen clearly, the new method returns much more relevant images. In this case, the MPEG-7 CSD returned 13 flags in the first 50 positions, whilst the new method returned 38 flags in the first 50 positions. Fig. 6 shows an example of retrieving Poker cards from the database. In this example,

A Binary Color Vision Framework for Content-Based Image Indexing

55

MPEG-7 CSD returned 40 cards (black and while) in the first 50 positions, the new method returned 49 cards in the first 50 positions. Yet another example is shown in Fig. 7 which used a fruit as a query example. Although this was not a clear-cut case in terms of the retrieval quantitative performance. The new method performed extremely well subjectively.

Table 1. Blob size quantization table. Quantized Blob Sizes S1 S2 S3 S4 S5 S6 S7 S8 S9

5

% of Image Size (IS) 0.01% 0.05% 0.1% 0.5 % 1% 5% 10% 50% 100%

Related Methods

There have been many content descriptors published in recent years, see the recent survey paper [1] for a comprehensive review. The ones that are most similar to ours are the "Blobworld" of UC Berkeley [13] and the MPEG-7 color structure descriptor [12]. Here we briefly discuss how Cbst relates to and differs from Blobworld and MPEG-7 CSD. Our method is related to Blobworld. Whilst Blobworld tries to use sophisticated image segmentation algorithms, we do not put our emphasis on the segmentation step for two reasons. Firstly, segmentation is difficult and can be unreliable. Second, pixels segmented into the same regions (based on a variety of parameters) may not have simple and meaningful numerical measures to describe the visual properties of the image segments. This makes it difficult to develop simple image matching methods like the one we use here. The introduction of Cbst makes our method differ from blobworld. Whilst blobworld is complicated and not very easy to implement by novices in the field, our method is simple, and can be implemented by any person who know how to write simple programs. It is worthy mentioning that Cbst can be used in conjunction with the segmentation method of blobworld as well, i.e., used to summarize the segmented regions. We believe the idea of viewing pixels with similar visual properties as on a separate bit plane is an important and useful concept, which provides a cognitive model that is conducive to bring out binary vision routines to help the development of simple and yet effective content descriptors. For example, one can easily measure projections [5, 7] of each bit plane thus analyzing the shape of the visual feature distribution. The

56

Guoping Qiu and S. Sudirman

introduction of the color blob size table has also enabled the development of simple and yet effective image similarity measures. Based on the same idea, i.e., viewing the visual property as one dimension and region geometric measure as another dimension, other simple and useful 2-D tables can be constructed as well. In a way, our method is related to the MPEG-7 CSD. MPEG-7 CSD, described in detail in the standard, tries to incorporate spatial structures of the color distribution into the content descriptor. It uses an 8 x 8 structuring mask as the structuring element and counts the number of times a particular color is contained within the structuring element as the structuring element scans the image. Our method uses connected region labeling takes the MPEG-7 CSD a step further. In some circumstances, our method will be more advantageous. Fig. 4 illustrates a situation where the MPEG-7 CSD will fail but our method will succeed in distinguishing the two different patterns. In general, MPEG-7 CSD will not be able to distinguish a solid region and a region of the same dimension and color but with holes in the middle which are smaller than the structuring element.

(a)

(b)

(c)

(d)

Fig. 4. MPEG-7 CSD will have the same bin count for the pixels in all these different patterns, our new method will distiguish them (each dot represents a pixel).

6

Summary

In this paper, we have presented an elegant content-based image indexing framework and an implementation of the framework has been shown to be tremendously effective. With such a framework (Fig. 1), we can implement the pixel classifier with a variety of features. For example, as well as using color, other features such as texture can be included. We can even make the pixel classifier semantically meaningful, such as skin color [14]. Only one of many possible region measures was presented in this paper, many other region parameters, such as region’s shape, moments etc can be easily used. Such a representation has also laid the foundation for building higher level, more intelligent image retrieval models. Different implementations of the framework are currently being actively pursued and we will publish results in the future.

A Binary Color Vision Framework for Content-Based Image Indexing

57

(a)

(b) Fig. 5. (a) Retrieval result of MPEG-7 CS method. (b) Retrieval result of the new method. The top left corner image was the query example image

58

Guoping Qiu and S. Sudirman

(a)

(b) Fig. 6. (a) Retrieval result of MPEG-7 CS method. (b) Retrieval result of the new method. The top left corner image was the query example image

A Binary Color Vision Framework for Content-Based Image Indexing

59

(a)

(b) Fig. 7. (a) Retrieval result of MPEG-7 CS method. (b) Retrieval result of the new method. The top left corner image was the query example image

60

Guoping Qiu and S. Sudirman

References 1. W. M. Smeulders et al, "Content-based image retrieval at the end of the early years", IEEE Trans PAMI, vol. 22, pp. 1349 - 1380, 2000 2. J. Fitch et al, "Median filtering by threshold decomposition", IEEE Trans Accoustic, Speech and Signal Processing, vol. 32, pp. 1183 - 1188, 1984 3. G. Qiu, "Functional optimization properties of median filtering", IEEE Signal Processing Letters, vol. 1, pp. 64 - 65, 1994 4. S. Kamata et al, “Depth-first coding for multivalued pictures using bit-lane decomposition”, IEEE Trans on Communications, vol. 43, pp. 1961 – 1969, 1995 5. R. Jain, R. Kasturi and B. Schunck, Machine Vision, McGraw-Hill, 1995 6. G Qiu, "Image and image content processing, representation and analysis for image matching, indexing or retrieval and database management", UK Patent Application No th GB0103965.0, 17 , February 2001 nd 7. M. Sonka, V. Hlavac and R. Boyle, Image Processing, Analysis and Machine Vision, 2 Edition, PWS Publishing, 1999 8. M. J. Swain et. al., “Color Indexing”, Int. J. Computer Vision, Vol. 7, no. 1, pp.11-32, 1991 9. J. Huang, et. al., "Image indexing using color correlogram", Proc. CVPR, pp. 762-768, 1997 10. Gersho, R. M. Gray, Vector quantization and signal compression, Kluwer Academic Publishers, Boston, 1992 11. J. Arvo, Editor, Graphics Gems II, Academic Press, 1991 12. MPEG7 FCD, ISO/IEC JTC1/SC29/WG11, March 2001, Singapore 13. Carson et al, "Blobworld,: A system for region-based image indexing and retrieval", Proc. International Conference on Visual Information Systems, 1999 14. M. Jones and J. Rehg, "Statistical color models with application to skin detection", Technical Report, Cambridge Research Laboratory, CRL/98/11, Compaq, 1998

Region-Based Image Retrieval Using Multiple-Features Veena Sridhar, Mario A. Nascimento, and Xiaobo Li Dept. of Computing Science, University of Alberta, Canada {veena,mn,li}@cs.ualberta.ca

Abstract. Content-based image retrieval from large multimedia databases eﬀectively and eﬃciently is a challenging task. In this paper, we propose a retrieval technique that utilizes the regional properties of the images. After image segmentation, each region is represented by its colour, shape, size, and spatial position. Regions of diﬀerent images are matched and a distance measure between the whole images is calculated. The relative importance of the above features is investigated, and colour plays a major role in the process of distance computation. Our representation is robust to minor inaccuracy in image segmentation, is invariant to scaling and can perceive geometric changes like translation and rotation. The experimental results indicate that our technique outperforms recently proposed techniques.

1

Introduction

Image Retrieval has been an active research area since the 1970s [9]. To retrieve images eﬀectively, it is necessary to have a good image representation and a good similarity measure. Some recent approaches to represent images require the image to be segmented into a number of regions (a group of connected pixels which share some common properties). This is done with the aim of extracting the objects in an image. In many cases, the image segmentation result can be subjective, and may not correspond to the salient objects precisely. Other semantic knowledge, outside of segmentation result, is often used in the process of object detection/recognition. An image database containing a large number of heterogeneous images poses a great challenge for segmentation algorithms in terms of object recognition. However, from the image retrieval point of view, a meaningful (but not perfect) segmentation is usually suﬃcient. In this paper, we propose a region representation scheme that is less sensitive to segmentation inaccuracies, to further enhance the retrieval performance. Instead of representing the colour of a region by the average colour, it is represented by a histogram that captures all the colours present in the region. We use two diﬀerent colour spaces namely RGB and HSV and evaluate the changes in retrieval performance. Apart from colour, other properties namely, size, shape and position of a region are also

Work partially supported by the Canadian Natural Sciences and Engineering Research Council (NSERC).

S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 61–75, 2002. c Springer-Verlag Berlin Heidelberg 2002

62

Veena Sridhar, Mario A. Nascimento, and Xiaobo Li

represented. Even, if the representation of a region is good, the matching function is what determines the similarity between regions. In [11], the Integrated Region Matching (irm) technique is proposed and it is robust to the eﬀect of inaccurate segmentation. We also make use of this measure for matching regions between images. The reminder of this paper is organized as follows. Section 2, discusses some related work in the ﬁeld of image retrieval using visual attributes like colour, shape, spatial position etc, and also some works related to region-based image retrieval systems. Section 3 presents a new Content Based Image Retrieval (CBIR) approach, snl, which focuses on a colour representation that is not very sensitive to segmentation inaccuracies and also accounts for other features of regions. Section 4 presents the similarity metric irm which was proposed in [11] and also illustrates the advantages of the measure when combined with our representation. In Section 5, we evaluate the retrieval eﬀectiveness of snl in comparison with some approaches namely, Global Colour Histogram and cbc [19]. Finally in Section 6, we give the conclusions and state some directions for future work.

2

Related Work

Several features have been extracted from images to aid the retrieval process. One of the oldest and the most eﬀective feature is colour. gch is a simple and eﬀective way of representing image features. Global Colour Histogram (gch) has some useful properties like invariance to scaling, rotation and translation which make it one of the most robust colour representation available. The gch technique has several advantages including computation simplicity and comparison time. gch suﬀers from the colour quantization problem. Perceptually similar colours may be quantized into diﬀerent bins and the vice versa. gch does not consider the spatial location of the colours present in an image. To avoid this problem, local colour histograms were proposed. An image is partitioned into equal sized sub-images/blocks and the distance between each of the corresponding blocks is calculated. But this method is not robust to rotation and translation, it suﬀers from problems like cell-cross talk [18] and variance to absolute spatial location. Smith and Chang proposed the colour sets [17], which approximates the colour histogram in order to speed up the retrieval process in the case of very large databases. A colour set represents a set of colours chosen from a quantized colour space and the since features are represented as a bitstring, a binary tree can be used to speed up the search process. Another colour based approach was proposed in [20], where an image was represented with the help of the ﬁrst three moments namely the colour average, variance and skewness. This technique has the advantage of low space overhead and is also computationally simple. Even though colour moments was able to avert the quantization eﬀects unlike the colour histograms, they still lacked spatial information. Pass et al [14] proposed a new method using the colour coherence vectors. In their paper, they proposed a histogram based approach that

Region-Based Image Retrieval Using Multiple-Features

63

incorporated some spatial information as well. The image is initially blurred to remove small diﬀerences between pixels and then pixels within a bucket were classiﬁed as either coherent or in-coherent depending on whether they were part of a large similar-pixeled region. Yet another approach using colour-spatial information was proposed by Hsu et al [7]. In their approach, they ﬁrst select a set of representative colours for each image and then they use a technique called maximum entropy discretization to determine clusters of these representative colours in the form of rectangular regions. They propose a direct and an indirect measure to compute the similarity between the images. QBIC (Query by Image Content) [1] is a popular image retrieval system which uses two approaches: a partition-based approach to represent the colour features and a region-based approach where they divided the whole image into rectangles having homogeneous colour features. The disadvantage with this approach is that regions are restricted to equal sized rectangles. They also suﬀer from variances to rotation, scaling and translation. Shape, next to colours, is considered as an important characteristic that can help discriminate between two images and therefore in retrieval. Shape representations are broadly classiﬁed into two types: region-based and boundary-based [5]. Region-based techniques as the name suggests use the entire region for shape determination purpose and are more commonly used in web-based multimedia databases. Moment Invariants [9] is the most commonly used technique. Hu [8] proposed 7 such moments. Boundary based techniques mainly use the outline of the regions to calculate shape. Fourier descriptor is one of the well-known methods belonging to this category, e.g., [16]. In this technique, the boundary of a given region is obtained and is Fourier transformed [5]. The dominant Fourier coeﬃcients are used as the shape descriptors. Chain codes [5] use the 8-connectivity or the 4connectivity to represent the line segments that constitute the boundary of a region. Signatures, Shape numbers, polynomial approximation are other means of representing boundaries. The disadvantage of shape based retrieval systems are that boundary based techniques are applicable to “sketch-databases”. For using region based descriptors, obtaining a region is a major problem. So due to this inaccuracy of the region itself, the descriptors may become ineﬀective. Obtaining the semantics or the meaning of an image is one of the most current research topics in the area of image retrieval. Visual features alone are not enough to distinguish between images. For example, there might be two images, one with a blue sky and the other with the blue sea. Using colour, texture and other attributes they might be deemed similar, but semantically they are totally diﬀerent. Of course it cannot be denied that without the help of visual features, it is impossible to derive the semantics of image, unless they are annotated manually. One of the most important factors in a semantic based retrieval system is to not just look at the image on the whole, but in fact to look at the objects in the image and to try and ﬁnd relationships between these objects. Partitioning or segmenting the image into regions may reveal the “true” objects within an image. Local properties of regions could help in understanding these objects, thus contributing to more meaningful image retrieval. For this

64

Veena Sridhar, Mario A. Nascimento, and Xiaobo Li

purpose, we need to partition the image into its constituent objects. There are several image retrieval systems that adapt a region-based method for retrieval purposes. In Blobworld [3], objects are recognized by segmenting the image into regions that have roughly the same colour and texture. Then, each pixel is associated with a vector that consists of colour, texture and spatial features. A model of the distribution of pixels is developed in an 8-D space. The distance between two images is calculated as the distance between the blobs in terms of colour and texture. Netra [12] is another Image retrieval system which segments the images into regions which are homogeneous in colour and then uses the colour, texture, shape and spatial properties for measuring similarity. Both Blobworld and Netra require the user to select the region of interest from the segmented image and this region is then compared with regions in other images in the database thus avoiding noise during the matching process. There are however some disadvantages of this method. The user is burdened with the task of selecting his object of interest, when in fact the segmentation may not have yielded regions close to human’s perception of an object. The other problem is that humans often tend to associate objects with the background and other surrounding information to give it some meaning. So depending on the background where a particular object is present, users may perceive the same object diﬀerently. An attempt towards capturing the semantics to help ﬁnd similar images was made by Wang et al in [21] and Stehling et al [19]. In Simplicity [21], they make use of semantics to classify images into the following categories: Textured vs Non-textured using the famous X 2 measure and Graphs vs Photographs using the probability density of wavelet coeﬃcients in high frequency bands. In their method, they ﬁrst segment the images by dividing the image into 4x4 blocks and then they extract a feature vector consisting of 6 features (3 of which are the average colour components and the other three are high frequency bands of wavelet transforms. Then a K-means algorithm [6] is used to cluster these feature vectors. While [21] makes use of the colour of each region to ﬁnd similar images within a category of images, in [19] the colour and the spatial position of each region is used. The distance used by both [21] and [19] to compute the similarity between the images is the irm measure proposed in [11]. The advantage of the irm distance is that it is not overly sensitive by over or under-segmentation because it considers all the regions in an image. Due of this advantage of the irm measure, we decided to use it for our similarity matching. In the next section we describe the irm measure in more detail. snl technique is an attempt to improve upon their idea including a robust colour representation, as well as other features such as spatial location, size and some notion of region shape.

3

snl – The Proposed Technique

The ﬁrst step in our retrieval technique is to attempt to segment the image into regions that (ideally) would correspond to the objects in the image. For this purpose, we need a segmentation algorithm that is eﬀective in rendering homo-

Region-Based Image Retrieval Using Multiple-Features

65

geneous regions in a short time. We tried three diﬀerent segmentation algorithms namely K-means [6], a segmentation method proposed by Comaniciu et al [4] and a clustering technique proposed recently by Stehling et al [19]. K-means is one of the most popular partitional clustering methods and its implementation is very simple and straightforward. It works by randomly initializing the mean value of K clusters and then calculating the diﬀerence between each pixel and the mean of each cluster to decide to which cluster a particular pixel belongs. Then the means are re-calculated and this process is iterated. Despite the fact that K-means is computationally simple and takes little time, the number of segments is an input parameter to the algorithm. We wanted an automatic clustering algorithm that could decide on the the number of clusters based on the content of the image. Hence, K-means did not suit our requirements. Another clustering approach was suggested by Comanciu et al [4] based on the mean-shift algorithm1 . For each pixel in the image, a feature space is constructed based on its neighboring 9 pixels. Then a feature pallet is constructed with the most signiﬁcant colours in the image and based on these colours, homogeneous regions are determined. Post-processing is done by eliminating very small sized regions. The segmentation process is completely automatic, however it is time consuming (10 seconds for a 512x512 image). Recently a paper was published by Stehling et al [19] which presents a singlelink region growing algorithm used along with a minimum spanning tree. The algorithm can be described as follows: The image is ﬁrst converted into a graph whose vertices are the pixels in the image and whose edges are neighborhoods of 4 pixels. The weight of each edge is the Euclidean distance between the colours of the 4-pixel neighborhood. The pixels are clustered using two thresholds: color threshold and size threshold. A set of connected pixels whose colour similarity is greater than or equal to the colour threshold forms a region. Then, regions less than the given size threshold are considered noise and hence merged with the nearest neighbor having the greatest similarity in terms of colour. The clustering algorithm proposed here is not only automatic, but also uses spatial and colour features and takes less time (4 seconds for a 512x512 image). Hence we decided to use this clustering algorithm to obtain regions in the image. The next phase is the regional feature extraction phase wherein the segmented images are analysed and a feature vector is constructed for each region. Our technique uses for each region namely colour, spatial position, shape and size. 3.1

Extracting Colour

One of the most eﬀective feature that helps distinguishing one image from another is colour. As mentioned before, the problem with any segmentation/clustering algorithm is that a single set of parameters cannot be applied to all the images in the database, especially when considering a miscellaneous collection. Even within an image, it would make more sense if some objects had a more 1

http://www.caip.rutgers.edu/riul/research/code.html

66

Veena Sridhar, Mario A. Nascimento, and Xiaobo Li

detailed representation than others. The segmentation algorithms mentioned before tend to cluster pixels on the basis of the most signiﬁcant colours present in the image and tend to ignore or merge smaller segments with the larger ones closest to them either in terms of colour or spatial location or some other property. It is deﬁnitely true that signiﬁcant colours help in identifying similarity between images, but they also lead to a lot of false positives. For instance, a yellow sunﬂower, yellow sun and a yellow ball (of the same size) would all be segmented into roughly circular regions with the dominant colour which is yellow. In terms of the mean colour of the region, size and shape they would all be deemed very similar. But semantically they are not similar at all. In fact, the subtle diﬀerence between them can be brought out by the less dominant colours in the region e.g. the black center in the sunﬂower and the orange tinge in the sun. Thus, from the above discussion we can infer that while the dominant colours help in ﬁnding regions that are similar to each other, less dominant colours help in eliminating false positives. For this reason, we decided to represent the colour feature of each segment with its histogram which gives us the distribution of colours in that region. Thus, for each region i in the image I, we have a colour histogram representation, C(i). Colours are represented using a speciﬁc colour model. Colour models help in expressing the colours in some standard, accepted format [5], [2]. In this paper, we observe that there is a change in retrieval performance with change in the colour models. We considered the two most popular colour models namely, RGB and HSV [2]. The RGB colour model can be represented in the form of a cube with the primary colours red, blue and green occupying the three corners of the cube. The gray-scale values from black to white are present along the diagonal of the cube from the origin. Any other colour is expressed as a combination of the primary colours. The disadvantage of RGB colour model is that the space is not perceptually uniform and equal importance has to be given to all the three components during quantization. The HSV colour model [2] is more intuitive and can be visualized as a hexacone or a six-sided pyramid. The value V is 1 at the top of the hexacone. Hue H is represented along the perimeter of the base of the hexacone with red at 0 degree, green at 120 degrees and so on. The main advantage with this colour space is that it can be quantized easily [10]. Details about the quantization schemes used in the RGB and the HSV colour spaces are discussed in Section 5. 3.2

Extracting Other Features

Apart from colour, we also extract other features from regions in the Image. The size of each region is extracted and is normalized by the original Image size, for invariance to scaling. The size of each region i of the Image I, is thus, a value between 0 and 1 and is expressed as A(i). In order to extract the shape of each region, we compute the eccentricity of the minimum bounding rectangle (MBR) of each region. Eccentricity has been used before e.g. [1], [13] and is easy to calculate. The shape of each region i in Image I is represented as E(i). In order to capture the spatial position of each region, we calculate the centroid of each

Region-Based Image Retrieval Using Multiple-Features

67

region. The x and the y co-ordinates of the center position are normalized by the image co-ordinates. Thus, the spatial position of each region i in Image I is represented as S(i).

4

The Distance Measure

Next to image representation, similarity measure is one of the key items in the process of image retrieval that decides the eﬀectiveness and the eﬃciency of the retrieval technique. In the case of retrieval using regions of an image, it is important to choose a similarity measure that is robust to segmentation inaccuracies. It is also important that the measure agrees with the human perception of similarity and be easily computable. In this section, we shall describe the measure used for calculating the similarity between two images. Since images were decomposed into their respective segments, the similarity between two images was in fact the similarity between their constituent segments. As mentioned in Section 3, each region is represented by its colour, size, shape and spatial location. Hence, to compare two regions we needed to compare these respective features. Thus, distance between two regions i and j of Images I1 and I2 is deﬁned as: D(i, j) = α × DC (i, j) + β × DS (i, j) + γ × DE (i, j) where DC is the colour histogram distance and α is the weight assigned to the colour feature, DS is the Spatial distance with its corresponding weight β and DE is the Shape distance between two regions with weight γ. The Colour distance between region i of I1 and region j of I2 is in terms of the histograms C(i), C(j) of the the regions consisting of N bins each is expressed as: k=N DC (i, j) = |C(i)[k] − C(j)[k]| k=0 th

where C(l)[k] denotes the k entry of histogram C(l). The distance in the Spatial position between the region i in I1 and region j in I2 is the Euclidean distance between the x and the y co-ordinates of the centers of the regions. This is shown below: DS (i, j) = (X(i) − X(j))2 + (Y (i) − Y (j))2 where X(l) and Y (l) are the x and the y co-ordinates of the centers for region l. The Shape distance, is the distance between the eccentricities of the MBRs enclosing two regions i and j of Images I1 and I2 respectively can be computed as: DE (i, j) = |e(i) − e( j))| where e(i),e(j) are the eccentricities. The size A(i), of each region i of the Image I, is not used when calculating the distance between the regions. It is instead used as a weighting factor during the matching process. In Section 2, we also discussed, albeit brieﬂy, the irm proposed in [21] which is robust to segmentation inaccuracies. In this paper, we use the irm measure

68

Veena Sridhar, Mario A. Nascimento, and Xiaobo Li

1. For each pair of regions, i in I1 and j in I2 Calculate the distance D(i, j) 2. Arrange all the distances in ascending order 3. Mark all regions as ‘‘not−done’’ 4. For all m x n pairs of regions If regions are marked ‘‘not done’’ then a.Pick the two regions with the lowest distance b.DI (I1,I2) += D(i,j) x min(A(i), A(j)) c.A(i) −= min(A(i), A(j)) d.A(j) −= min(A(i), A(j)) e.Mark the region with the minimum size as ‘‘done’’ Fig. 1. The IRM algorithm I1 R1

R2

I2

I4

I3

R1

R1

R2

R2

R3

R4

R1

R2

Fig. 2. Sample Image Set

to calculate the distance between two images. The irm measure to calculate the distance DI (I1, I2), between the two images I1, with m regions and I2, with n regions is calculated as shown in Figure 1 The process of calculating the irm measure requires quadratic time consuming since we need to compare all segments of Image I1 with all segments of image I2. In our case, however, due to our conﬁguration, we obtain only a few regions (on an average 5, for color threshold = 3, size threshold = 1 in the cbc algorithm). We therefore can aﬀord to use this measure. Whenever a query image is given, we segment this image, extract all the regional features and compare this data with the meta-data of all the images in the database using the distance formulae and the similarity measure given above. After obtaining the similarity between the query image and all the other images in the database, the results are re-ranked in the order of decreasing similarity (or increasing distance). 4.1

Discussion with an Illustration

In this section, we diﬀerentiate our approach from two other approaches namely gch (Global Colour Histograms) and cbc proposed by Stehling et al [19] using four example images. For simplicity let us assume that our colour palette consists of only three colours: black, gray and white. In the ﬁrst case, we illustrate the fact that snl is capable of perceiving rotation changes and in the second case, we point out the importance of using a histogram representation for the colour property of a region. Let us consider Figure 2, and we compare Image I1 with images I2 and I3. We know that image I2 is a rotated version of image I1 and assume it to be more similar to I1 than I3. It is also clear that image I4 is the not the same as image I1 because I4

Region-Based Image Retrieval Using Multiple-Features

69

Table 1. Distance Calculation Using three techniques Techniques DI (I1, I2) DI (I1, I3) DI (I1, I4) gch 0 0 0.2 cbc 0.062 0.048 0 snl 0.075 0.134 0.07

contains some “candy canes”. Therefore, if human perception of distance between two images i and j is denoted as H(i, j), then the assumptions we made earlier are H(I1, I2) < H(I1, I3) and H(I1, I4) ∼ 0. The distance between Image I1 and the other three images, I2, I3 and I4 are calculated using the above mentioned techniques and are shown in Table 1. When gch is applied, the distance between I1, I2 and I1, I3 is 0 because the colour composition of I1, I2 and I3 are the same. Thus, gch, does not agree with both our assumptions on human perception of similarity. From this particular case, we can deduce that colour composition is important, but it is not enough to diﬀerentiate between images where the spatial distribution of colour is diﬀerent. For applying snl and cbc, images need to be segmented. I1, I2 are segmented into two regions R1 and R2 and I3 is segmented into four regions R1, R2, R3 and R4. In I4, the smaller regions constituting the “candy canes” are merged with region R1 to form a single region with the average colour, gray. The second region is R2. When cbc technique is applied, the distances between I1, I2 and I1, I3 are not zero as can be seen in Figure 3(a). This is because, the matching technique takes into account the colour and also the spatial location of the region. However, since they do not consider the shape properties of the regions, the distance between I1, I2 is greater than the distance between I1, I3 as seen from Table 1. Also, the distance between I1 and I4 is 0. This is because, during segmentation the small regions inside R1 of I4 were merged with R1 and the average colour was represented. It is quite contrary to what usually human would perceive and hence cbc also does not agree with both the assumptions made earlier. snl determines the distance between I1 and I2 to be smaller than the distance between I1 and I3 (see Table 1). This is because, snl uses the colour, spatial location and the shape of each region. snl is also capable of distinguishing between I1 and I4 despite the disadvantage of the segmentation process, as shown in Figure 3(2). snl satisﬁes both assumptions made earlier and is therefore better suited to represent human perception of similarity. Thus, using some example ﬁgures we have illustrated that we combine the advantages of gch and cbc and make our technique more similar to human perception.

5

Experiments and Results

In this section, we discuss about the evaluation measures used and the experiments performed. Three sets of experiments were conducted to observe and

70

Veena Sridhar, Mario A. Nascimento, and Xiaobo Li I1 R1 R2

0.031

0.031

I2

I1

R1

R1

R2

I2

R2

D(I1,I2) = 0.062 0.017

I1 R1

R2

0.007 0.007

I3

I1

R1

R2

R1

R3

R4

R2

0.037

R1

0.037

R2

D(I1,I2) = 0.075 0.039 0.028 0.028

0.017

R1 R2

R1

R2

R3

R4

0.039

D(I1,I3) = 0.048

I1

I3

D(I1,I3) = 0.134

I1

I4 0

0

R1

R1 R2

R2

0

R1 R2

D(I1,I4) = 0.07

D(I1,I4) = 0

(a) cbc-based computation

I4 0.07

(b) snl-based computation

Fig. 3. Distance between images using cbc and snl

measure the performance of the proposed retrieval technique. The ﬁrst experiment relates to the quantization scheme to be applied to the RGB and HSV colour space. In the second set of experiments, weights to be assigned to the colour, spatial and shape features of each region are determined. The third set of experiments, presents the performance of snl technique in comparison to the Global colour histogram and the cbc technique proposed by Stehling et al [19]. The experiments were performed on a large heterogeneous database containing 10,000 images with 15 query images2 . Each of these 15 query images have a set relevant of images that are similar in colour distribution and are semantically related to it. The evaluation of the retrieval system was done using the most popular measures, namely precision and recall. Recall [22] is the percentage of relevant documents which were retrieved, and precision is the percentage of retrieved documents which were relevant. Ideally both precision should be closer to 1. A set of recall and precision points are joined together yielding the so-called recall − precision curve. 5.1

Quantization Schemes

The ﬁrst experiment was done to select a good quantization scheme for both the RGB and the HSV colour space. We used about 10,000 images to test the performance of the colour spaces, RGB and HSV, for various quantization levels. The colour property of each region in an image is represented with a histogram in the above mentioned colour spaces. In this experiment, a uniform quantization is applied with each region’s histogram consisting of 27, 64 and 125 bins for RGB and 81, 135 and 162 bins for HSV respectively and their performance was observed. 2

http://www.cs.ualberta.ca/∼mn/CBIRone/

Region-Based Image Retrieval Using Multiple-Features 100

100

RGB Histogram with 27 bins RGB Histogram with 64 bins RGB Histogram with 125 bins

90 80

90 80

Precision[%]

70

Precision[%]

71

60 50 40 30

70 60 50 40

20

30

10

20

0

HSV Histogram with 81(9,3,3) bins HSV Histogram with 135(15,3,3) bins HSV Histogram with 162(18,3,3) bins

10 0

20

40 60 Recall[%]

(a) RGB color space

80

100

0

20

40 60 Recall[%]

80

100

(b) HSV color space

Fig. 4. Performance variation with various quantization levels.

In Figure 4(a), it is seen that the performance of the 64-colour quantization is the best and the curve is drastically pulled down by an increase in the quantization space. This is because two colours which are very similar to each other can be classiﬁed into two diﬀerent bins and since only a one-one diﬀerence between the bins is calculated, the distance between the two similar colours is increased. Decreasing the number of bins also aﬀects the performance because with just 27 bins the separability between colours is reduced. The performance is not affected as much due to the fact that the regions obtained from segmentation are homogeneous in colour to some extent and 27 colours are suﬃcient to represent the colours within such a homogeneous region Since the 64 colour quantization scheme in the RGB colour space was the best, we adopt the same for all our future experiments. We also observe in Figure 4(b) that the performance of the HSV colour space does not vary as much with changes in quantization levels. We selected the 81 colour quantization scheme not only because it performed well also because the storage overhead was considerably low when compared to the other two schemes using 135 and 162 colours. 5.2

Assigning Weights to Regional Features

In the previous section, we discussed about calculating the distance between two regions. This distance is a weighted function of the colour histogram distance, shape distance and the spatial distance between any two regions. The second experiment was done to decide on the values to be assigned to α, β and γ. Again a set of 10,000 images was considered and the importance of each of these 3 features was studied by assigning diﬀerent values for α, β and γ. In Figures 5(a) and 5(b), we observe that colour is clearly the most important feature that aﬀects the retrieval performance. Shape and Spatial properties do not account for the performance very much. Thus, we know that the value of α has to be higher than both β and γ. To further reﬁne these weights, we decided to consider a few sample points to calculate the average precision for all recall values in a database of 10,000 images. The graph corresponding to this experiment (Figure 6) indicated that an α value of 0.7, and β and γ equal to 0.15 each yielded a very good result

72

Veena Sridhar, Mario A. Nascimento, and Xiaobo Li 100

100

α = 12.5%, β = 75%, γ = 12.5% α = β = 12.5%, γ = 75% α = β = γ = 33.33% α = 75%, β = γ = 12.5%

90 80

80 70

Precision[%]

70

Precision[%]

α = 12.5%, β = 75%, γ = 12.5% α = β = 12.5%, γ = 75% α = β = γ = 33.33% α = 75%, β = γ = 12.5%

90

60 50 40

60 50 40

30

30

20

20

10

10

0

0 0

20

40 60 Recall[%]

80

100

0

(a) RGB color space

20

40 60 Recall[%]

80

100

(b) HSV color space

Fig. 5. Eﬀectiveness when varying importance to diﬀerent features 75

Average Precision [%]

74 73 72 71 70 69 68

Average Precison for RGB (64 colors) Average Precison for HSV (81 colors)

67 66 40

50

60

70

80

90

100

Colour Weight [%]

Fig. 6. The Average Precision at varying colour weights for 10,000 images

in terms of eﬀectiveness. An important thing to note here is that, in both the HSV and the RGB colour space consistent observations related to the weights to be assigned to each feature were made. Thus, we see that the values of α, β and γ are independent of the colour space used. 5.3

Comparison with Existing Approaches

In the third experiment, we compare the snl with the cbc technique [19]. The reason why we compare to [19] is because their technique was published recently and they claim to perform better than CCV [15] and Colour Moments [20]. The colour and size threshold was set to be 3, 0.1 respectively. In [19], the authors claim that these set of parameters result in a good compromise between the number of regions, eﬀectiveness and robustness. In the case of snl, we set the color and the size thresholds to be 3, 1 respectively. Since snl is robust to segmentation inaccuracies a high threshold does not aﬀect the results and in fact leads to a smaller number of regions that need to be compared during query time. We also compared our technique with the Global Colour Histogram using the HSV colour space. Since it is important to see how well our technique scales

Region-Based Image Retrieval Using Multiple-Features 100

80

SNL (RGB) GCH CBC SNL (HSV)

90 80 70

70

Precision[%]

Precision[%]

100

SNL (RGB) GCH CBC SNL(HSV)

90

73

60 50 40

60 50 40 30

30

20

20

10

10

0 0

20

40 60 Recall[%]

80

100

(a) 10,000 images

0

20

40 60 Recall[%]

80

100

(b) 50,000 images

Fig. 7. Comparing diﬀerent techniques and their scalability

up, we also experimented using 50,000 images in addition to the 10,000 ones used for the previous experiments. Figure 7(a) indicate the performance of the three techniques in a database containing 10,000 images. The snl technique performs better than both the gch and the cbc technique. We can clearly observe that while the curves of cbc and gch drop down drastically, the curve corresponding to snl technique is more stable. The robustness of snl becomes more evident in Figure 7(b), where the curves of gch and cbc drop down even further as compared to a small drop down of the curve corresponding to our technique. The performance of our technique in the HSV space in both cases is very good in the beginning. But at the end, the curve drops down dramatically. This is because among the relevant images for the 15 query images that were chosen, there some that are semantically related to the image but have a totally diﬀerent colour composition. For example, if we consider a query image as a blue car, the relevant set consists of several blue cars as well as a few red cars. Since snl is heavily dependent on the colour similarity, it retrieves all the blue cars in the beginning and retrieves the red cars at the very end. All other techniques are aﬀected by noise and do not retrieve the relevant images as quickly as snl does. Since our technique can retrieve most of the relevant images quickly its performance is better than the other two techniques.

6

Conclusions and Future Work

This paper presents a content based retrieval technique, snl, that is based on color histogram of image regions. The segmentation algorithm proposed in [19] is used, and each region is represented by its color, shape, spatial position and size. These features, especially the color representation, are more robust to segmentation inaccuracies. The similarity measure irm proposed in [21] is used to maintain this robustness. Experiments were made to choose a good colour-space quantization scheme in the RGB and HSV colour-space. We also conducted several experiments to

74

Veena Sridhar, Mario A. Nascimento, and Xiaobo Li

decide on the weights α, β and γ that had to be assigned to the colour, shape and size component, respectively, of each region. An α value of 0.7, β value of 0.15 and γ value of 0.15 yielded a very good retrieval eﬀectiveness. We compared snl technique with cbc and gch. For this purpose we used 2 diﬀerent databases of size 10,000 and 50,000 images and in both cases we observed that snl performed better than the other two. snl also scaled up well with a change in database size. While in this paper we have concentrated on a robust colour representation for each region in a segmented image. We would like to come up with such robust measures for other regional features, e.g., shape and spatial position. We are also currently investigating the possible utilization of background/foreground information in the images. The reasoning being that foreground objects/regions should be treated diﬀerently from the background, as they may contribute more towards the semantics of the image.

Acknowledgements The authors wish to thank Renato Stehling for providing the source code for the segmentation algorithm and also for constructive discussions.

References 1. J. Ashley, R. Barber, M. Flickner, J. Hafner, D. Lee W. Niblack, and D. Petkovic. Automatic and semi-automatic methods for image annotation and retrieval in qbic. In In Proc. Storage and Retrieval for Image and Video Databases II, pages 24–35, 1995. 2. A. D. Bimbo. Visual Information Retrieval. Morgan Kaufmann Ed, 1999. 3. C. Carson, M. Thomas, S. Belongie, J.M. Hellerstein, and J. Malik. Blobworld: A system for region-based image indexing and retrieval. In In Proc. 3rd Intl. Conf. on Visual Information Systems, pages 509–516, 1999. 4. D. Comaniciu and P. Meer. Robust analysis of feature spaces: Color image segmentation. In In Proc. IEEE Conf. on Comp. Vis. and Pattern Recognition, pages 750–755, 1997. 5. R. C. Gonzalez and R. E. Woods. Digital Image Processing. Addison-Wesley, third edition, 1992. 6. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. 7. W. Hsu, T.S. Chua, and H.K. Pung. An integrated color-spatial approach to content-based image retrieval. In In Proc. 3rd ACM Multimedia Conf., pages 305– 313, 1995. 8. M. K Hu. Visual pattern recognition by moment invariants. IRE Transactions on Information Theory, IT-8:179–187, 1962. 9. T. Huang and Y. Rui. Image retrieval: Past, present, and future. In In Proc. Intl. Symposium on Multimedia Information Processing, pages 1–23, 1997. 10. P. Kerminen and M. Gabbouj. Image retrieval based on color matching. In Finnish Signal Processing Symp., pages 89–93, 1999.

Region-Based Image Retrieval Using Multiple-Features

75

11. J. Li, J. Z. Wang, and G. Wiederhold. IRM: integrated region matching for image retrieval. In ACM Multimedia, pages 147–156, 2000. 12. W.Y. Ma and B. S. Manjunath. Netra: A toolbox for navigating large image databases. Multimedia Systems, 7(3):184–198, 1999. 13. F. Mokhtarian, S. Abbasi, and J. Kittler. Eﬃcient and robust retrieval by shape content through curvature scale space. In In Proc. Intl. Workshop on Image Databases and Multimedia Search, pages 35–42, 1996. 14. G. Pass and R. Zabih. Histogram reﬁnement for content-based image retrieval. In Workshop on Applications of Computer Vision, pages 96–102, 1996. 15. G. Pass, R. Zabih, and J. Miller. Comparing images using color coherence vectors. In Proc. of the ACM Multimedia’96 Intl. Conf., pages 65–73, 1996. 16. Y. Rui, A. She, and T. Huang. Modiﬁed fourier descriptors for shape representation – a practical approach. In In Proc. 1st Intl. Workshop on Image Databases and Multimedia Search., pages 22–23, 1996. 17. J. Smith and S. Chang. Single color extraction and image query. In In Proc. IEEE Int. Conf. on Image Proc., pages 528–531, 1995. 18. R. O. Stehling, M. A. Nascimento, and A. X. Falcao. Techniques for color-based image retrieval. Technical Report 16, University of Alberta, 2001. 19. R.O. Stehling, M.A. Nascimento, and A.X Falcao. An adaptive and eﬃcient clustering-based approach for content based image retrieval in image databases. In In Proc. Intl. Data Eng. and Application Symposium, pages 356–365, 2001. 20. M.A. Stricker and M. Orengo. Similarity of color images. In In Proc. Storage and Retrieval for Image and Video Databases (SPIE), pages 381–392, 1995. 21. J.Z. Wang, J. Li, and G. Wiederhold. Simplicity: Semantics-sensitive integrated matching for picture libraries. IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(9):947–963, 2001. 22. I. Witten, A. Moﬀat, and T. Bell. Managing Gigabytes. Morgan Kaufmann, Second edition, 1999.

A Bayesian Method for Content-Based Image Retrieval by Use of Relevance Feedback Ju-Lan Tao1,2 and Yi-Ping Hung1,2 1 Institute

of Information Science, Academic Sinica of Computer Science and Information Engineering, National Taiwan University V$GWMIRXYIHYX[, LYRK$MMWWMRMGEIHYX[ 2 Department

Abstract. This paper proposes a new Bayesian method for content-based image retrieval using relevance feedback. In this method, the problem of contentbased image retrieval is first formulated as a two-class classification problem, where each image in the database can be classified as “relevant” or “nonrelevant” with respect to the query and the goal is to minimize the misclassification error. Then, the problem of image retrieval is further transferred into a simpler problem of ranking each image in the database by using a similarity measure that is basically a likelihood ratio. Here, the likelihood of the relevant class is modeled by a mixture of Gaussian distribution determined by the positive samples, and the likelihood of the non-relevant class is assumed to be an average of Gaussian kernels centered at negative samples. The experimental results have indicated that the proposed method has potential to become practical for content-based image retrieval.

1

Introduction

The goal of content-based image retrieval (CBIR) is to retrieve the desired images for a user from a large image database, based on the image contents [10, 17]. In CBIR, a set of features needs to be first extracted for representing the content of the images. These extracted features can be used (either explicitly or implicitly to the user) to search for the desired images. The query methods for CBIR can be roughly divided into two types. In query-byfeature (QBF) method, the user specifies queries by explicitly specifying the features they are interested in searching for. In query-by-example (QBE) method, the user specifies a target query image to express perceptual aspects of his query concept upon which the image database is to be searched and compared against. Most of the existing image retrieval systems have made use of query-by-example because it is a natural way for users to search an image or video database without explicit user knowledge of features. In either type of the CBIR methods, the criterion for selecting images to be displayed is based on feature similarity. The famous CBIR system MARS, borrowed the idea of “term weighting” from traditional textual retrieval [7, 9]. Images are expressed by weight vectors in the term space. The similarity between the query and an image is defined as the weight cosine distance. Another approach is to adopt the probabilistic techniques, such as those used in [3, 4, 6, 11]. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 76–87, 2002. © Springer-Verlag Berlin Heidelberg 2002

A Bayesian Method for Content-Based Image Retrieval

77

The reason that the performance of CBIR has been unsatisfactory for many practical applications is mainly due to the gap between the high-level semantic concepts and the low-level features. Recently, some researchers began to study the relevance feedback techniques, trying to bridge the gap and improve the retrieval performance [2, 7, 9, 11]. Nevertheless, the problem is far from being solved. This paper proposes a new method of relevance feedback. Here, image retrieval is formulated as a process of learning. Given a query image, it is very hard to know the exact concept the user has in his mind, such as an old person, a male, or a man in red. By using an iterative process with relevance feedback, the difference between relevant images and non-relevant images may become clearer. A closely related work is the one done by Su et al. [11]. However, they model the distribution of the relevant class by a single Gaussian while we use a mixture of Gaussian instead. Another major difference is that they heuristically added to the similarity measure a penalty term caused by negative samples, while we formally derive the penalty term, different from theirs, from the Bayesian decision rule.

2

Bayesian Framework

2.1

Problem Formulation

In this work, we use the binary independent retrieval (BIR) model introduced in [7]. With this model, a given query, which may consist of a set of images, can determine an ideal answer set which contains exactly the relevant images matching the query concept. Therefore, we can classify the images in the database into two categories according to its relevance to the query. That is, CBIR is formulated as a two-class classification problem in this paper. The goal of the retrieval system is to find a map from the feature space ℑ to the set of classes Ω,

α: ℑ → = {ω R ,ω N } x → ω

(1)

where ωR is the relevant class and ωN is the non-relevant class with respect to the query. 2.2

Bayesian Classifier and Similarity Measure

Given a query Q, we want to minimize the probability of retrieval error, i.e. the probability P(α(x) ≠ ωtrue) [6]. Notice that the number of classes is set to two here, instead of a general K in [6]. The idea is that the samples in the image database may not be able to preclassified into K classes. Instead, whenever a query is issued by a user the database can always be conceptually divided into two classes: the relevant class and the non-relevant class. It is well known that the optimal decision can be obtained by the maximum a posterior probability (MAP) classifier,

78

Ju-Lan Tao and Yi-Ping Hung

α * (x, Q) = arg max P(ω i | x; Q) i = arg max p(x | ω i ; Q) P (ω i | Q) ,

(2)

i

where ωi ∈ Ω, p(x| ωi ;Q) is the likelihood function of class ωi with respect to x given Q, and P(ωi|Q) = P(ωi) is the prior probability of class ωi. The above decision * function α implies that, an image is considered to be relevant to the query Q if its feature vector x satisfies p (x | ωR ; Q) P(ωR ) ≥ p (x | ωN ; Q) P(ω N ) .

(3)

Take the logarithmic operation on Equation (3). We have ln p (x | ωR ; Q) − ln p(x | ωN ; Q) ≥ − ln P (ωR ) + ln P (ω N ) .

(4)

The terms on the right-hand side of Equation (4) represent the prior information, which are independent of x. Hence, we can define the similarity between an image x and a query Q as SimBayes (x, Q) = ln p (x | ωR ; Q) − ln p (x | ωN ; Q) − c1 ,

(5)

where c1 = -lnP(ωR) + lnP( ωN) will not affect the similarity ordering of the images and can be ignored in sampling, from the database, the most relevant images for user’s feedback. Notice that if the non-relevant likelihood of an image approaches to zero, this image may have high similarity value even though it does not match the concept (e.g., far away from the positive samples). To solve this problem, we use a modified similarity measure in our CBIR system: Sim(x, Q) ≡ ln p (x | ω R ; Q) − ln[ p(x | ω N ; Q) + ε ] ,

(6)

where ε is a pre-specified value. Also, we add a constraint that requires a “relevant image” must have a high enough value of the relevant likelihood. For each image having feature x in the database, we calculate its similarity with the query. Suppose we would like to display M retrieved images to the user. Then, we first sieve out from the database α⋅M (α ≥ 1) images with larger relevant likelihoods p(x| ωR;Q). It is from these α⋅M images that we select M images having larger similarity values.

3

Retrieval Methods

Suppose there are N images in the database. We can extract a feature vector of dimension d for each image in the database. Features such as color, texture and shape can be general or domain-specific [10, 13, 14, 15]. Let xi denote the feature extracted th from the i image and D denote the set of feature vectors for all the images in the database, i.e., D = {xi | xi = [ xi1 ,L , xij ,L , xid ]T ∈ℜd , 1 ≤ i ≤ N }.

(7)

A Bayesian Method for Content-Based Image Retrieval

79

If one has no idea beforehand about which attributes should be emphasized more for retrieving relevant images, all attributes can be normalized to the same scales at the beginning. A commonly used method is to subtract each attribute of a feature by its sample mean and then divide it by its sample deviation, i.e., xij − µˆ{ j } , σˆ{ j }

(8)

th where Φj is the set of the j attribute of all the samples in D, µˆ {⋅} and σˆ {⋅} are functions that return the sample mean and the sample variance of Φj , respectively. The algorithm for retrieving the set of images matching the query concept is an iterative process. The process can be divided into two stages. The initial query stage merely finds out the closest images to the query image, and the relevance feedback stage use the information of relevant and non-relevant images responded by the user to learn the query concept in the user’s mind. For a CBIR system adopting the relevance feedback strategy, the user is presented with a list of the retrieved images. After examining the retrieved images, the user can mark those that are obviously relevant or obviously non-relevant. Those feedback th images will be incorporated into the query. Then, the query at the t feedback iteration becomes the union of the initial query and feedback images, i.e.,

Q(t) = {q 0 , q1 , L, qη } = Q 0 ∪ Q1 ∪ L ∪ Q k ∪ L ∪ Q t if k = 0 {q } and Q k =  k0+ , kQ ∪ Q otherwise  k+

(9)

k-

where Q and Q are the set of relevant patterns (positive feedbacks) and nonrelevant patterns (negative feedbacks) identified by the user from the retrieved images th at the k iteration. In this strategy, we assume the user does not change his query concept during the retrieval process. 3.1

Initial Query Stage

Density Estimation for Non-relevant Class. It is reasonable to assume that the nonrelevant images are uniformly distributed before the user submits his first negative query. That is, d

p (x = xi | ωN ) = p ( xi1 ,L , xij ,L , xid | ω N ) = ∏ j =1

1 , bj − a j

(10)

where aj = min{Φj} and bj = max{Φj}. In the present image database, no pattern will fall outside the region formed by the extreme values of each feature attribute. These extreme values determine the intervals of the uniform distribution. Density Estimation for Relevant Class. Before the user submits his query, the distribution of the relevant class is uniform, just like the non-relevant class. However, when the user begins his query process by the method of query-by-example, we first assume that the distribution of the relevant class can be modeled by a unimodal

80

Ju-Lan Tao and Yi-Ping Hung

Gaussian. Using an ellipsoid to approximate the distribution of the data is a common approach both in statistics and in other domains [1]. Let G(x;q,Σ) denote a ddimensional Gaussian density with mean vector q and covariance matrix Σ. Suppose that there is a group of data roughly approximated by an ellipsoid, as illustrated in Fig. 1. We can further approximate it with an axis-aligned ellipsoid. In the following, the covariance matrix of a distribution is assumed to have the form of a diagonal 2 matrix, Σ= diag{σj }, thus G (x; q, ) = (2π )

−

d 2

| |

−

1 2

 1  exp− (x − q)T Σ −1 (x − q) 2   −

(11)

1

2  d  2  1 d ( x j − q j )  = (2π )  ∏ σ 2j  exp− ∑ . σ 2j  2 j =1   j =1  −

d 2

Let q0 denote the image feature vector of the first query image. At this moment, the 0 query Q(0) = Q is equal to {q0}. The Gaussian distribution of the relevant class is 2 modeled by a Gaussian centered at q0 and with a default covariance matrix σR I. The probability density of a pattern xi becomes p(x = x i | ω R ; Q(0)) = G(x i ; q 0 ,σ R2 I) = (2πσ R2 )

−

d 2

 1 exp− 2  2σ R

d

∑(x j =1

ij

 − q0 j ) 2  . 

(12)

For each image feature xi, we can calculate its similarity to the query by  d  d 1 d 2 Sim( xi , Q) = − ln(2πσ R2 ) − ( x − q ) + ln ∏ ln(b j − a j ) + ε  ∑ ij 0j 2 2 2σ R j =1  j =1  d

= −c2 ∑ ( xij − q0 j ) 2 + c3 .

(13.a) (13.b)

j =1

In Equation (13.a), only the second term will affect the ordering of the similarity measures of different images in the database. Thus, the feature point closer to the given positive query will be more relevant to the query concept at the initial query stage. It can be easily observed in Fig. 2.

Fig. 1. Approximation of probability density by an axis-aligned ellipsoid

A Bayesian Method for Content-Based Image Retrieval

81

Fig. 2. Likelihoods of the relevant class and non-relevant class with respect to a query

3.2

Relevance Feedback Stage

Each time as feedbacks are presented, we first update the distribution of the relevant class and non-relevant class based on the feedback images. Then, we use the updated distribution to retrieve similar images. Distribution Update for Relevant Class. In the previous subsection, we assumed that the initial probability density of the relevant class was Gaussian and centered at the first query image. As the relevance feedback operations are involved in the retrieval process, the distribution of the relevant class will be modified progressively by the feedback images. However, the distribution described by the feedback images usually cannot be modeled by a unimodal Gaussian anymore. Hence, for the relevant class, we model its distribution by a mixture of Gaussians Nc

p(x = x i | ω R ; Q(t)) = ∑υ c G (x k ;

c

c =1

,

c

),

(14)

where ∑υc = 1. The mixture is completely specified by the parameter Θ = {υc, µc , Σc | 1 ≤ c ≤ Nc}. In the feedback stage, we will use an incremental clustering algorithm to either classify the relevant images issued at the current iteration into the existing clusters, or to create new clusters. + + t+ Let Cm denote the cluster closest to q , q ∈Q , i.e., l = || q + −

m

||Σ2 −1 = min || q + − m

c 1≤c≤ N c

c

||Σ2 −1 ,

(15)

c

+

which is measured by the squared Mahalanobis distance. If q is close enough to the + cluster center µm, we may assume that q belongs to this cluster and the distribution of + + this cluster in which q joins can be represented by a unimodal Gaussian. Thus, q will be grouped into the closest cluster Cm if l ≤ zR, where zR is a distance threshold. + 2 Otherwise, a new cluster will be created with mean q and covariance matrix σR I. The parameters for the distribution of each cluster will not be modified until all the positive feedbacks for the current iteration are examined. The maximum likelihood

82

Ju-Lan Tao and Yi-Ping Hung

estimate of the mean vector and the covariance matrix are exactly the sample mean and the sample covariance of all samples contained in this cluster. Therefore, µm and 2 Σm = diag{σmj } can be updated with the following equation: | C mt |=| C mt −1 | + | t m

=

1 | C mt

 t −1 | C m | | 

 1 t t −1 ( mj )2 = t (| C m | −1)( | C m | −1 

t mj

t-1 m

∑q

+

(16.a)

|

q + ∈Q t + ∩ C mt

+ +

∑q

q ∈Q

) 2 + | C mt −1 | (

t+

+

∩C mt

t −1 2 mj

   

) +

(16.b)

and

∑ (q

q + ∈Q t + ∩ C mt

+ j

) 2 − | C mt | (

t mj

 (16.c) )2  

The weighting coefficient υc in Equation (14) is proportional the number of positive th samples contained in the c cluster. Distribution Update for Non-relevant Class. In general, the negative feedback images are randomly sampled. Consequently, they are usually dispersed in the feature space. Here, the density of the non-relevant class is expressed as an average of kernel + functions, each centered at a negative sample q in Q . Here, we let Q = Q ∪Q , where + Q and Q and are the set of all positive and negative samples, respectively. Let the kernel function be Gaussian. Then we can rewrite the probability density for the nonrelevant class as follows: p (x = xi | ωN ; Q(t)) =

t

1 t

∑| Q

k−

∑∑ −

| k =1 q ∈Q

k−

G (xi ; q - , σ q2− I ) .

(17)

k =1

The issue is how to determine the appropriate covariance for each Gaussian centered at a negative sample. Suppose the variance of each dimension is the same. The simplest way is to use a default variance σN for all Gaussian kernels, i.e., σq- = σN for all q ∈Q . A better way is to use the closet positive sample to bound the variance, i.e.,

σ q2− = min{σ N2 , min + + q ∈Q

1 || q + − q - ||2 }. zN

(18)

The reason for bounding the variance is given below. In order to guarantee the positive samples around q has small probability of belonging to the non-relevant class, we can force the squared Mahalanobis distance from the closest positive sample to q to be greater than a threshold zN. That is, | q + − q - ||2Σ−1 = q−

|| q + − q - ||2 ≥ zN . σ q2−

(19)

Therefore, σq- should be smaller than ||q -q || / zN . The above iteration for updating the distribution of both classes will be repeated each time when new feedbacks are issued by the user. Once the distributions for both classes are updated, for each image in the database, we can compute its similarity to 2

+

- 2

A Bayesian Method for Content-Based Image Retrieval

83

the query (as introduced in section 2.2), and then display the first M images having larger similarity values.

4

Determination of zR and zN

As we have mentioned in the previous section, two distance thresholds, zR and zN, need to be determined. In this work, we use the chi-square distribution to determine the thresholds. In the following, we only focus on the discussion of zR, since zN can be determined by using a similar method. Remember that whether a pattern x can be merged into cluster Cm depends on whether the squared Mahalanobis distance from x to the cluster’s mean µm satisfies  x j − µ mj || x − m || = ∑   σ j =1  mj d

2 Σ −m1

2

  ≤ zR .  

(20)

Suppose that the feature components conditioned on a cluster are independent Gaussian random variables X1, X2, …, Xd. The normalized random variables Yj = (Xjµmj)/σmj , 1 ≤ j ≤ d, have standard normal distributions. Then, the squared Mahalanobis 2 2 2 distance random variable Z = Yj +Y2 +…+Yd has a chi-square distribution with d 2 degrees of freedom, i.e., Z ~ χ (d) [12]. Chi-square distribution function can be calculated by z

1 ψ d / 2−1e−ψ / 2 dψ . d /2 Γ ( d / 2)2 0

F ( z ) = Prob[ Z ≤ z ] = ∫

(21)

The gamma function is defined by ∞

Γ(t ) = ∫ y t −1e− y dy, 0 < t . 0

(22)

Values of the chi-square distribution can be found in tables for different values of d and z. Its probability density function is shown in Fig. 3 when d = 5. Suppose x is a member of cluster Cm. Let pR denote the number such that Prob[Z ≤ zR] = 1 - pR. Then, zR can be determined by a given value of pR. A smaller pR corresponds to a larger zR, which implies that the pattern x is more likely to fall within the region of the cluster bounded by zR. Then, it is less likely to create a new cluster in section 3. On the other hand, a larger pR implies that it is more likely to create new clusters. Therefore, pR is referred to as the “easiness” of creating new clusters, and can be pre-specified by the user. Once pR is determined, zR can be easily obtained by looking up the tables of the chi-square distribution. An advantage of specifying pR instead of zR is that zR turns out to have small dynamic range for 0.1 ≤ pR ≤ 0.9.

5

Experimental Results

The proposed method has been on several image databases: Brodatz texture, IIS faces and Coil-20 as in Table 1. Brodatz texture database contains 111 640x640 pixels images. Each was divided into 16 nonoverlapping subimages of 160x160 pixels. IIS

84

Ju-Lan Tao and Yi-Ping Hung

Fig. 3. Chi-square p.d.f. for d = 5. The number p is referred to as the “easiness” of creating new cluster

Table 1. Some parameters of the image databases used in the experiments Database Total Images N Image Size Categories Display Number M α⋅M Mtrue Feature Length d

Brodatz Texture

IIS Faces

COIL-20

1776 160x160 111 32 56 16 48

1280 60x60 128 32 56 10 32

1440 128x128 20 32 144 72 48

faces is a collection of facial images (http://smart.iis.sinica.edu.tw/html/download. html). This database contains 3840 face images of 128 persons. Each person has thirty images: ten are the front faces, ten are the left faces, and ten are the right faces. Only the front faces are used here. Columbia Object Image Library (COIL-20) is a database of gray-scale images of 20 objects (http://www.csie.columbia.edu/CAVE/research/ softlib/coil-20.html). The objects were placed against a black background, and images of the objects were taken at pose intervals of 5 degrees. This corresponds to 72 images per object. The database has two sets of images, one original and one normalized. We use the second set which contains 1440 size-normalized images. For images in Brodatz texture and COIL-20, textural features are extracted from its cooccurrence matrix for distance 1 to 2 [13]. For the IIS faces database, the features used are the PCA coefficients of the Eigen faces [14]. For these databases, we assume that images in the same category represent the same concept. Let Mtrue be the number of images contained in one category. Given an initial query sample, we would like to retrieve all the other Mtrue-1 images in the same category. Let “display number M” be the maximum number of images can be displayed on the screen. The user can select positive samples and negative samples from the displayed images. In most practical situations, one does not know the true value of Mtrue. In our experiments, to simulate the behavior of user feedback automatically, positive samples are randomly picked from images of the same category and the first two images of different categories are used as negative samples.

A Bayesian Method for Content-Based Image Retrieval

85

Table 2. Average precision obtained for each iteration using different image databases. PF is the number of positive feedback samples random selected. NF is the number of negative feedback samples. In these tests, we use pR = 0.9 and pN = 0.5 Database

PF

NF

4 4 4 4 8 8

0 2 0 2 0 2

Brodatz IIS Faces COIL-20

0 73.28 73.28 71.72 71.72 59.71 59.71

Precision at iteration t (%) 1 2 3 4 90.01 92.26 94.11 95.24 90.30 92.94 94.88 95.80 86.29 89.59 91.16 92.62 86.86 90.59 92.48 93.26 70.17 75.57 78.53 80.40 72.40 77.56 80.11 81.76

5 96.10 96.44 93.44 93.91 82.09 83.2

To show the performance of the relevant feedback strategy, we compute the value of precision when it is equal to recall, where precision and recall are defined below: recall =

number of images retrieved that are relevant total number of images that are relevant ( M true )

precision =

number of images retrieved that are relevant number of images that are retrieved (k )

That is, the performance is evaluated by the value of precision at k = Mtrue , which implies that the number of retrieved images, k, is chosen to be the total number of relevant images. To compute the precision at k = Mtrue , we repeat the process of image retrieval by using each image in the database as an initial query, and then performing five feedback iterations. Therefore, each precision shown in Table 2 is an average of N retrievals where N is the number of images in the database. We can observe from Table 2, the performance can be improved by the relevance feedback stages incrementally. In addition to the above-mentioned experiments using image databases of specificdomain, we have also tested our method with the following experiment that uses a general collection of full-color photos in Corel Gallery. We randomly select 66 photo directories from the Gallery, which form an image database of 6312 images. Features extracted from the database are (i) textural features of the co-occurrence matrix of distance 1 to 3, (ii) color histogram in HSV space with quantization 64, and (iii) the first and second color moments in HSV space. As an example, Fig. 4 shows the result of trying to retrieve images of arched doors.

6

Conclusion

This paper proposes a new Bayesian method for retrieving images from a large database through relevance feedback. In this method, the problem of image retrieval is formulated as a problem of ranking each image in the database by using a similarity measure that is basically a likelihood ratio. Here, the likelihood of the relevant class is modeled by a mixture of Gaussian distribution determined by the positive samples,

86

Ju-Lan Tao and Yi-Ping Hung

(a)

(b)

(c) Fig. 4. Let M = Display number = 10, α = 1.5, and pR = 0.5 and pN = 0.5. Initial query is shown on the up-left banner. (a) The result retrieved after the initial query stage. Three positive feedbacks (2, 3, 8) and one negative feedback (5) are marked for the next iteration. (b) The result retrieved after the first feedback stage. We mark two more positive feedbacks (5, 9) and one more negative feedback (7) for the next iteration. (c) The result retrieved after the second feedback stage. It can be seen that all the retrieved images are now arched doors.

while the likelihood of the non-relevant class is assumed to be an average of Gaussian kernels centered at negative feedback samples. The proposed method for updating the likelihoods of the two classes is quite simple both conceptually and computationally. The experimental results have indicated that the proposed method has potential to become practical for content-based image retrieval. We are currently working on incorporating region-based methods to further improve the performance.

References 1. Sabharwal and L. C. Potter, “Set Estimation via Ellispoidal approximation,” Proceedings of the International Conference on Acoustics, Speech, and Signal Proceeding, pp.897-900, May 1995. 2. Buckley and G. Salton, “Optimization of Relevance Feedback Weights,” Proceedings of SIGIR, pp.351-357, 1995.

A Bayesian Method for Content-Based Image Retrieval

87

3. Moghaddam and A. Pentland, “Probabilistic Visual Learning for Object Representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-19(7):696-710, July 1997. 4. J. Cox, M. L. Miller, S. M. Omohundro, and P. N. Yianilos, “PicHunter: Bayesian relevance feedback for image retrieval,” International Conference on Pattern Recognition, pp.361-369, 1996. 5. T. Jolliffe, Principal Component Analysis, Springer series in statistics. Springer-Verlag, New York, 1986. 6. N. Vasconcelos and A. Lippman, “Bayesian Relevance Feedback for Content-Based Image Retrieval,” Proceedings of the IEEE Workshop on Content-based Access of Image and Video Libraries, pp.63-67, 2000. 7. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, ACM Press, New York, 1999. 8. R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification, 2nd Ed. John Wiley & Sons, Inc., 2000. 9. Y. Rui, T. Huang and S. Mehrotra, “Content-Based Image Retrieval with Relevance Feedback in MARS,” Proceedings of International Conference on Image Processing, vol.2, pp. 815-818, 1997. 10. J. Zachary and S. S Iyengar, “Content based image retrieval systems,” Application Specific Systems and Software Engineering and Technology, pp.136-143, 1999. 11. Z. Su, H. J. Zhang and S. Ma, “Using Bayesian Classifier in Relevant Feedback of Image Retrieval,” Proceedings of the 12th IEEE International Conference on Tools with Artificial Intelligence, pp.258-261, 2000. 12. R. V. Hogg and E. A. Tanis, Probability and Statistical Inference, 6th Ed. Prentice Hall International, Inc., 2001. 13. R. M. Haralick, K. Shanmugam and I. Dinstein, “Textural Features for Image Classification,” IEEE Transactions on System, Man, and Sybernetics, Vol. SMC-3, No. 6, pp.610-621, November 1973. 14. M. Turk and A. Pentland, “Eigenfaces for Recognition,” Journal of Cognitive Neuroscience, 3(1), pp.71-86, 1991. 15. W. Y. Ma and H. J. Zhang, “Benchmarking of image features for content-based retrieval,” Asilomar Conference on Signals, Systems & Computers, pp.253-257,1998. 16. V. N. Gudivada and V. V. Raghavan, “Content-Based Image Retrieval Systems,” IEEE Computer, 28(9): 18-22, Sept. 1995.

Color Image Retrieval Based on Primitives of Color Moments Jau-Ling Shih and Ling-Hwei Chen Department of Computer and Information Science, National Chiao Tung University 1001 Ta Hsueh Rd., Hsinchu, Taiwan 30050, R.O.C. WNP$GLYIHYX[PLGLIR$GGRGXYIHYX[

Abstract. In this paper, a color image retrieval method based on the primitives of color moments will be proposed. First, an image is divided into several blocks. Then, the color moments of all blocks are extracted and clustered into several classes. The mean moments of each class are considered as a primitive of the image. All primitives are used as features. Since two different images may have different numbers of features, a new similarity measure is then proposed. To demonstrate the effectiveness of the proposed method, a test database from Corel is used to compare the performances of the proposed method with other existing ones. The experimental results show that the proposed method is better than others. Keywords: Content-based image retrieval, color moments, clustering.

1

Introduction

The recent emerging of multimedia as well as the availability of large image and video archives have made content-based information retrieval become a popular research topic. Digital library is one of the applications of content-based information retrieval systems. In a digital library, large image databases such as color photographs, trademarks, stamps and paintings exist, how to provide an automatic and user-friendly image retrieval system based on the image content become an important task. The most frequently referred visual contents for image retrieval are color, texture, and shape. Among them, the color feature is most commonly used for color image retrieval. It is very robust to complex background and independent of image size and orientation. The color histogram [1] is the most well known color feature and used by the QBIC system [2]. It denotes the joint probability of the intensities of three color channels and is invariant to rotation, translation and scaling. To take into account the similarities between similar but not identical colors, QBIC system introduced the quadratic distance to measure similarity between two histograms. To overcome the quantization effects of the color histogram, Stricker and Orengo [3] used the color moments as feature vectors for image retrieval. Since any color distribution can be characterized by its moments and most information is concentrated on the low-order moments, only the first moment (mean), the second moment (variance) and the third moment (skewness) are taken as the feature vectors. The S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 88–94, 2002. © Springer-Verlag Berlin Heidelberg 2002

Color Image Retrieval Based on Primitives of Color Moments

89

similarity between two color moments is measured by the Euclidean distance. Two similar images will have high similarity. However, if two images have only a similar sub-region, their corresponding moments will be different and the similarity measure will be low. For fast image retrieval on the large image databases, the WebSEEk system [4] proposed the color set. All colors are first quantized into 166 colors in the HSV color space. Then, color set is defined as a subset of the 166 colors. The color set can be obtained from a color histogram by thresholding the color histogram. However, two images with the same color set but different relative amounts of colors will be regarded as similar images. All the above-mentioned color features only contain the color information of each pixel in an image, the local relationship among neighboring pixels is not involved. Huang et al. [5] proposed another kind of feature, called color correlograms, which expresses the spatial correlation of pairs of colors changes with distance. In order to reduce the size of the feature set, all colors are quantized into 64 colors in the RGB color space. However, if an image contains various colors, such as stain-glass images, 64 colors are not enough to represent the color information. On the other hand, if the number of quantized colors increases, the retrieval speed will be decreased. In general, there are two kinds of searching strategies. One is to find images totally similar to the query one. The other is to find images partly similar to the query one. For example, if a query image is a rose and some roses are in a garden image, then the garden image will be considered as a different one from the query image under the first searching strategy, but as the similar one under the second strategy. A method of using the second searching strategy can reach the aim of the first searching strategy but not vice versa. All of the methods using the above-mentioned features take the first searching strategy. Thus, in this paper, a color image retrieval method using the second searching strategy will be proposed to solve the disadvantages of the first searching strategy. First, an image is segmented into several blocks. Then, the color moments of all blocks are extracted and clustered into several classes. The mean vector of each class is considered as a primitive of the image. All primitives are used as feature vectors. Then, a similarity measure is proposed to do color image retrieval. In order to show the effectiveness of the proposed method, some comparisons among the proposed method and other existing ones are also provided. The experimental results show that our proposed method is superior to other existing ones for most kinds of images. The rest of the paper is organized as follows. In Section 2, we will introduce the extracting method for the primitives of color moments of each image. In Section 3, the similarity measure is described. The experimental results are described in Section 4. Finally, conclusions will be given in Section 5.

2

Primitives of Color Moments Extraction

In this section, we will define the primitives of color moments, which will be used for color image retrieval. Before introducing them, the color moments will be first described. For a color image, based on the YIQ model, the first color moment of the i-th color component ( i = 1, 2, 3 ) is defined by

90

Jau-Ling Shih and Ling-Hwei Chen

M i1 = where

1 N ∑ pi, j , N j =1

pi , j is the color value of the i-th color component of the j-th image pixel and N

is the total number of pixels in the image. The h-th moment, h color component is then defined as

1 N M ih =  ∑ pi , j − M i1  N j =1

(

= 2, 3,K, of i-th

1

)

h

h  .  

Take the first H moments of each color component in an image s to form a feature vector, CT , which is defined as CT = [ ct1 , ct 2 , K, ct Z ] = [α 1 M 11 , α 1 M 12 , K, α 1 M 1H , α 2 M 21 , α 2 M 22 , K, α 2 M 2H , α 3 M 31 , α 3 M 32 , K, α 3 M 3H ],

where Z = H ⋅ 3 and α1 , α 2 , α 3 are the weights for the Y, I, Q components. Based on the above definition, an image is first divided into X non-overlap blocks. For each block a, its h-th color moment of the i-th color component is defined by the feature vector,

M ah, i . Then,

CBa , of block a is represented as

CBa = [ cba ,1 , cba , 2 ,K, cba , Z ] = [α1M a1,1 , α1M a2,1 , K,α1M aH,1 , α 2 M a1 , 2 , α 2 M a2, 2 , K, α 2 M aH, 2 , α 3 M a1, 3 , α 3 M a2,3 , K, α 3 M aH,3 ].

From the above definition, we can get X feature vectors. However, there are many similar CBa ’s among these feature vectors. In order to speed up the image retrieval, we will find some representative feature vectors to stand for these feature vectors. To reach this aim, a progressive constructive clustering algorithm [6] is used to classify all CBa ’s into several clusters and the central vector of each cluster is regarded as a representative vector and called as a primitive of the image. The central vector, of the k-th cluster is defined by nk

PCk = [ pck ,1 , pck , 2 , ..., pck , Z ] = where

∑ CB kj j =1

nk

nk

nk

∑ cbkj,1

= [ j =1 nk

nk

∑ cbkj,2 ,

j =1

nk

PC k ,

∑ cb , ...,

j =1

nk

k j,Z

],

(1)

CB kj , j = 1, 2,...., nk , belongs to the k-th cluster and nk is the size of the k-th

cluster. Note that during the construction of an image database, all primitives will also be attached to each image as the feature vectors for retrieval purpose. Since the distance threshold of clustering, Td , is fixed for all images, the number of clusters varies for different images. To treat this situation, a method to evaluate the similarity between two images with different number of feature vectors will be proposed.

Color Image Retrieval Based on Primitives of Color Moments

3

91

Color Image Retrieval

In this section, a similarity measure between two images with various numbers of primitives is provided. Before introducing the similarity measure, we will first describe several definitions. The k-th primitive of a query image q is represented as:

PC kq = [ pc kq,1 , pc kq, 2 , ..., pc kq, Z ] , where k = 1, 2, ..., m , and m is the number of primitives in the query image. The l - th primitive of a matching image s is denoted s s s s q s as PC l = [ pc l ,1 , pc l , 2 , ..., pc l , Z ] . The distance between PC k and PC l is defined as follows:

D _ PC kq,,ls = The minimum distance between

Z

∑ ( pc i =1

q k ,i

− pcls,i ) 2 .

q k

PC and all primitives of s is defined by

D _ PC kq , s = min ( D _ PC kq,,ls ) . l

The distance between the query image q and the matching image s is defined by m

D _ PC q , s = ∑ n kq × D _ PC kq , s , k =1

q k

where n is the size of the k-th cluster. The similarity measure between q and s is defined as

Sim q , s =

1 . D _ PC q , s

q, s

Note that the larger Sim a matching image has, the more similar it is to the query one. Based on the measure, we can find images similar to the query one by taking those with high values.

4

Experimental Results

To evaluate the performance of the proposed method, experiments have been conducted based on Corel photo library, which is often used by image retrieval research groups [5, 6]. There is a test database, D1, selected from Corel in our experiments. Based on D1, we implement other methods using the color histogram [1], color moments [3], color set [4], or color correlograms [5] as features to compare their performances with ours. D1 has 1300 images. These images are classified into 13 classes, including the flower, stained glass, woman, sunset, sports car, sailboat, ancient architecture, dinosaur, duck, waterfall, painting, underwater world, and gong fu. Each class contains 100 images. Fig. 1 shows several example images for each class of D1. The performance is measured by the recall and precision [7]. Note that the recall, Re, is defined by the following equation:

92

Jau-Ling Shih and Ling-Hwei Chen

N T where N is the number of relevant images retrieved and T is the total number of relevant images. The precision, Pr, is defined as follows: N 3U = K where K is the total number of retrieved images. 5H =

Class 1

Class 2

Class 3

Class 4

Class 5

Class 6

Class 7

Class 8

Class 9

Class 10

Class 11

Class 12

Class 13

Fig. 1. The example images of each class from the database D1.

To show the performance of the proposed method, the retrieval results are compared with those using color histogram [1], color moment [3], color set [4], or color correlograms [5] on D1. As shown in Fig. 2, the proposed method is much better than other methods. The detail comparison of precision for each class is shown in Fig. 3. As shown in Fig. 3, our proposed method is better than all other methods on all classes except the sunset and waterfall classes. Note that the sunset and waterfall classes have very simple color layout. For this type of images, our method is as good as that using color correlogram, but the proposed method has much better performance on the classes with complex color layout.

5

Conclusions

In this paper, a new color image retrieval method based on primitives of color moments is proposed. First, an image is divided into several blocks. Then, the color moments of all blocks are extracted and clustered into several classes based on the algorithm of fast non-iterative clustering. The mean vector of each class is considered as a primitive of the image. All primitives are used as feature vectors. Then, a specially designed similarity measure is proposed to do color image retrieval. Different from other methods, the proposed method contains the detail color information of each important part in an image. The comparison with other methods reveals that for most types of images, the proposed method outperforms other

Color Image Retrieval Based on Primitives of Color Moments

93

methods using color histogram, color set, color moments and color correlograms. The proposed system can be used in the application of digital library for the content-based image retrieval (CBIR).

(a)

(b)

Fig. 2. The performance comparison among the proposed method and other methods on D1. (a) The precision curves. (b) The precision vs. recall curves (T = 100).

Fig. 3. The precision comparison for each class of D1 among the proposed method and other methods (K = 50).

Acknowledgment This research was supported in part by the National Science Council, R.O.C., under Contract NSC 89-2213-E-009-148.

94

Jau-Ling Shih and Ling-Hwei Chen

References 1.M. Swain and D. Ballard, “Color indexing,” International Journal of Computer Vision, Vol. 7, No. 1, pp. 11-32, 1991. 2.M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele and P. Yanker, “Query by image and video content: The QBIC system,” IEEE Computer, Vol. 28, No. 9, pp. 23-32, 1995. 3.M. Stricker and M. Orengo, “Similarity of Color Images,” in Proc. SPIE Storage and Retrieval for Still Image and Video Databases III, pp. 381-392, San Jose, CA, USA, February 1995. 4.J. R. Smith and S. F. Chang, “Visually searching the web for content,” IEEE Trans. Multimedia, Vol. 4, No. 3, pp. 12-20, 1997. 5.J. Huang, S. K. Kumar, M. Mitra, W. Zhu and R. Zabih, “Image indexing using color correlograms, ” in Proc. CVPR Int. Conf., pp. 762-768, 1997. 6.N. Akrout, R. Prost and R. Goutte, “Image compression by vector quantization: a review focused on codebook generation,” Image and Vision Computing, Vol. 12, No. 10, pp. 627637, 1994. 7.Y. Deng and B. S. Manjunath, “ An efficient low-dimensional color indexing scheme for region-based image reitrieval,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Vol. 6, pp. 3017-3020, 1999.

Invariant Feature Extraction and Object Shape Matching Using Gabor Filtering Shu-Kuo Sun1, Zen Chen1, and Tsorng-Lin Chia2 1 Department

of Computer Science and Information Engineering, National Chiao Tung University, Hinchu, Taiwan 2 Department of Information Management, Ming Chuan University, Taoyuan, Taiwan

Abstract. Gabor filter-based feature extraction and its use in object shape matching are addressed. For the feature extraction multi-scale Gabor filters are used. From the analysis of the properties of the Gabor-filtered image, we know isolated dominant points generally exist on the object contour, when the filter design parameters are properly selected. The dominant points thus extracted are robust to the image noise, scaling, rotation, translation, and the minor projection deformation. Object shape matching in terms of a two-stage point matching is presented. First, a feature vector representation of the dominant point is used for initial matching. Secondly, the compatibility constraints on the distances and angles between point pairs are used for the final matching. Computer simulations with synthetic and real object images are included to show the feasibility of the proposed method.

1

Introduction

Features in the form of points, lines, regions, and textures characterize an image. These features are further described by attributes of magnitude, orientation, location, size, and color, etc. For image analysis it is necessary to extract all or some of these features and their associated attributes from a given image. Furthermore, for the analysis task images of the object involved are taken with different viewing geometry determined by location and orientation parameters. Consequently, the image formation is influenced by a combination of factors of rotation, scaling, translation and perspective deformation. Image features that are invariant or robust to the above viewing parameters are desired for object matching. Gabor filters have been used to extract points [1], lines [2-3], edges [4-5], corners [5-6], textured regions [7-8]. In these methods most of the features used were derived at a single scale, although some suggestions of the use of the scale space notion [9] was given; however the actual implementation was, by no means, straightforward at all. The main problem with a single scale method is that the features are sensitive to the resolution and noise of the image. Here we are concerned with object matching based on the local structure information, so we shall consider the new method for point feature extraction using Gabor filtering. In this paper we show that a homogeneous structure block of the object produces large reactions to a bank of Gabor filters when the filter parameters are properly tuned to the structure block size. Generally, an object image contains several structures of S-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 95–104, 2002. © Springer-Verlag Berlin Heidelberg 2002

96

Shu-Kuo Sun, Zen Chen, and Tsorng-Lin Chia

different sizes. So different structure features need be detected at multiple scales of the filters. Gabor responses in multiple orientations are also different. A scaleorientation representation is used to index the dominant point, which can be used to solve the image rotation, scaling, and translation problems as well as projection deformation problem. Object shape matching in terms of a two-stage point matching is presented. First, a feature vector representation of the dominant point is used for initial matching. Secondly, the compatibility constraints on the distances and angles between point pairs are used for the final matching. Computer simulations with synthetic and real object images are included to show the feasibility of the proposed method.

2

Important Properties of the Gabor-Filtered Image

A Gabor function is a Gaussian-modulated complex exponent function that provides the best spatial and frequency information of the signal. The general form of the Gabor function is given as [10-11] g s , N , l ( x, y ) =

where

 x ’  cos θ l  y ’ = − sin θ    l

1 x’ y’ 2 exp{−[( ) 2 + ( ) ]}exp( jws x′) 2 σs ασ s 2πασ s

(1)

sin θ l   x − x0  cosθ l   y − y 0 

with θ l being the orientation parameter; (x0, y0) is the current point; σ s and ασ s are the Gaussian window parameter ( α = 1 is used here); ω s is the spatial frequency parameter. A scale normalization condition is usually imposed on the parameters σ s and ω s such that σ s ws = σ s +1ws +1 = Nπ / 4 (or Nπ / 5) (depending on whether 4 σ s or 5 σ s is used to approximate the Gaussian window size, which is also equal to the filter size) for all s. Gabor functions form a complete but nonorthogonal basis set. Signal expansion using this basis provides a localized frequency description that is useful for image encoding and compression [12]. Gabor functions can be extended to Gabor wavelets used for image representation [13], image browsing and retrieval [8], and object recognition [14]. If Gabor function is used for extracting the object features, then the s, N , l even-symmetric component g even ( x, y ) can be used to extract the bar (or line) features s, N , l and the odd-symmetric component g odd ( x, y ) is for extracting the edge features [5]. Since we want to extract point features using the edge information, we shall use the odd-symmetric Gabor function in the following. Fig. 1 depicts such a set of Gabor filters with different orientations and a typical mathematical form with N = 4. Let I(x,y) be the input image function. For multiple scales s, s = 1, 2, 3, …, S, and multiple orientations θ l , θl = l × ∆θ , l = 1, 2,….., L ( π is a multiple of ∆θ ), the filter

responses are given by the convolution operations: s, N , l R s , N , l ( x, y ) = I ( x, y) * godd ( x, y)

(2)

Invariant Feature Extraction and Object Shape Matching Using Gabor Filtering

(a)

97

(b)

Fig. 1. (a) A set of odd-symmetric Gabor filters. (b) The mathematical form of a typical oddsymmetric Gabor filter with N = 4.

In the following we give the important properties of Gabor-filtered image that lead to the invariant features. Due to the space limitation, proofs are omitted. Property 1: Let Rˆ s, N , l (x, y) and R s, N, l (x, y) be the Gabor filter responses to input images Iˆ (x, y) and I (x, y). If the two input images are related by a scale factor such that Iˆ (x, y) = I (kx, ky), then there is a correspondence between their responses given by kσ , N , l (kx, ky) Rˆ σ , N , l ( x, y ) = R (3) Property 2: Let the binary object images I1(x, y), I2(x, y) and I3(x, y) be given by  c , if ( x, y ) is an object po int I 1 ( x, y ) =  1  c2 , otherwise

(4)

 c − c , if ( x, y ) is an object po int I 2 (x, y) = I1 (x, y) - c 2 =  1 2 0, otherwise

(5)

I3 (x, y) = - I2 (x, y)

(6)

Then their responses are equal, i.e., s, N , l s, N , l s, N , l (7) I1(x, y) * g odd ( x, y ) = I2(x, y) * g odd ( x, y ) = I3(x, y) * g odd ( x, y ) This property indicates that a binary object can be treated as two gray levels: one zero and one positive.

Property 3: Let the binary image of a square be of size Q x Q and let the Gabor filter have a varying scale s. Then the Gabor filter response to the image varies with the filter scale. Furthermore, there is an optimal filter scale at which the response is a maximum. We compute the response values R s, N, l (x, y) at each point (x, y) for s = 1, 2, …, S and l = 1, 2, .., L (N is fixed). Then the energy value at point (x, y) for a specific scale s is defined as

98

Shu-Kuo Sun, Zen Chen, and Tsorng-Lin Chia L

E s , N ( x, y ) = ∑ | R s , N , l ( x, y ) |

(8)

l =1

Then the maximal energy value is found with respect to all scales E N (x,y) = max {E s, N (x,y)}

(9)

s

The 2D array of the maximal energy values is denoted as a dominant energy map. The point with the maximal energy is called a dominant point and the scale of the maximal energy is called the dominant scale. We are interested in the dominant points that are isolated; namely their energy values are the strict local maxima. Under proper filter design assumptions, there exist isolated dominant points in the filtered image. Property 4: For an image of a square there exist the strict local maxima of the dominant energy map, if the Gabor filters contain a proper scale and the value of parameter N is larger than 2. Similarly, for the image of a rectangle, a triangle or the other simple shape, it can be shown that the isolated dominant points exist on the object contour, if the Gabor filters contain a proper scale and N > 2. When the object shape becomes more complex, then all structure patterns in the neighborhood of a point will jointly determine the dominant scale and the existence of the isolated dominant point. In Fig. 2 the multi-scale energy maps are obtained through the application of the same type of Gabor filters with different filter scales to various squares. Notice the squares yield different maps for these filter scales. However, the dominant points in the dominant energy maps obtained from the merging of the multi-scale energy maps have the nearly equal energy values, as indicated by Properties 1 and 3. Next, we consider the effect of 2-D rotation on the filter response and the energy map. Property 5: Let Rˆ s , N , l (x, y) and R s , N , l (x, y) be the filter responses to the images Iˆ (x, y) and I (x, y). If Iˆ (x, y) is obtained from I (x, y) by a rotation through an angle φ in the counter-clock direction, i.e., Iˆ (x, y) = I (x’, y’) for points (x, y) and (x’, y’) that are related by  x’  cos φ   =   y ’  − sin φ

Then where

sin φ  x    cos φ  y 

Rˆ s , N , l (x, y) = R s , N ,θ l +φ ( x, y )  x   cos φ   =   y   sin φ

(10) (11)

− sin φ   x    cos φ   y 

Based on Property 5 we can show that the energy map at the dominant orientation is robust to the 2D rotation, that is, Eˆ s , N ( x, y ) = E s , N ( x, y ) when φ is a multiple of ∆θ . The orientation among the L orientations that is associated with the maximum Gabor filter response is called the dominant orientation. A feature vector consisting of the filter responses in the L orientations is used to represent the local structure information of the dominant point, which will be used in the initial point matching.

Invariant Feature Extraction and Object Shape Matching Using Gabor Filtering

(a) Scale 1: 14×14

(b) Scale 2: 20×20

(c) Scale 3: 28×28

99

(d) Scale 4: 40×40

(e) Scale 5: 56×56

(f) Dominant energy maps

Fig. 2 The profiles of energy maps at five different scales (a-e) and of the dominant energy maps (f) (Image size = 200×100. Square size = 7, 10, 14, 20. Filter size = 14, 20, 28, 40, 56. N = 4)

100

3

Shu-Kuo Sun, Zen Chen, and Tsorng-Lin Chia

Dominant Point Extraction and Matching

We outline an algorithm for extracting dominant points from each given image below. Algorithm for dominant point extraction: (1) Choose a set of Gabor filters with appropriate filter parameters. (2) Apply the multi-scale Gabor filters to the image to obtain the multi-scale energy maps and merge them into the dominant energy map by equation (9). (3) Plot the dominant energy histogram of the energy map and set a lower bound on the energy value of a candidate dominant point in the upper T %. (T = 10 in our case) (4) Partition the image into non-overlapping blocks of P×Q pixels each. (5) Find one candidate dominant point having the maximal energy in each block. Then check each candidate dominant point obtained to see if it is also a local maximum in a neighborhood of size P×Q centered at the point. Retain the top one or top few candidate dominant points as the final dominant points. The dominant points of the reference image can be analyzed and readjusted off-line. The energy distribution of the reference dominant points can be used as guidance for extracting the dominant points in the test image. Next, we perform the initial point matching. Algorithm for initial point matching: (1) Compare the energy values of the two dominant points to be matched, one from the reference image and the other from the test image, to see if they are in a compatible interval? (2) Compute the cross correlation between the two dominant points using their individual cyclical representation of L×1 feature vector aligned with the dominant orientation. (3) Check if the computed cross correlation value exceeds a preset level (0.85 used here)? If yes, the two points are considered matched. A refinement of the initial point matching based on a discrete relaxation process [15] is finally performed. We check the compatibility between two pairs of the initially matched dominant points each time by considering their distance difference and the orientation difference after the adjustments in scale and rotation using the dominant scale and dominant orientation obtained from the initial matching result. A graph is constructed based on the compatible pairs of points found. Then a maximal clique of the graph is searched [16]; the size of the clique is an index of the similarity between the two objects. We use some heuristics to speed up the clique searching.

4

Experimental Results

In the experiments we apply our feature extraction and object matching method to real images to provide some insights into our method. The first experiment is about the retrieval of a key from the key database. Here we only show the feasibility of the method, so we do not apply to a large database size. Fig. 3 gives the simulation results. In Fig. 3(a) four keys with the superimposed dominant points and dominant

Invariant Feature Extraction and Object Shape Matching Using Gabor Filtering

101

orientation constitute the database. Two query keys with the superimposed features are shown in Fig. 3(b). The retrieval results produced by our method are given in Figs. 3(c) and 3(d), in which the ratio of the number of matched points to the total number of dominant points is used as a similarity measure. Query 1 is an enlarged, rotated version of a key in the database, while query 2 is a key not included in the database. The experimental outcome is rather encouraging from the judgment of our visual inspection of the shape similarity involved here. In the second experiment we test our method on the airplane images. Figs.4 (a) and 4(b) show the extracted dominant points and orientations of two airplanes that are identical, but with different sizes and orientations. Fig. 4(c) is the estimation of the rotation angles based on the initial matching result. The final result of the found matched points is shown in Figs. 4(d) and 4(e). The simulation result indicates our method is robust to the scaling and rotation factors. Other simulations in regard to the aspect ratio change and projection deformation are also performed. The matching results generally give a large portion of feature points matched. The reason is mainly because the feature points have the strong and well-defined local structure information.

5

Conclusions

In this paper we apply the multi-scale Gabor filters to the images of an object. From the analysis of properties of the filtered images we know there exist isolated dominant points on the object contour, when the filter parameters are properly selected. The dominant points extracted are robust to the factors of noise, scaling, rotation, translation, and the projection deformation. An initial matching of the extracted dominant points between two different images of the object is performed using a feature vector representation. To eliminate the ambiguity in the initial point matching a refinement using compatibility constraints on the distances and angles between the point pairs is presented. In the refinement a graph is defined and its maximal clique is found. Finally, computer simulations with synthetic and real object images are conducted to show the proposed method works reasonably well. We are currently using the technique in other applications including image mosaicking and image registration.

102

Shu-Kuo Sun, Zen Chen, and Tsorng-Lin Chia

25 points

31 points

30 points

31 points

( a)

Query 2 Query 1 (b)

21/25

13/31

9/30

9/31

10/30

7/31

(c)

16/25

17/31 (d)

Fig. 3. (a) Database, (b) Query 1 and Query 2 , (c) Retrieval result for Query 1, (d) Retrieval result for Query 2. (Parameters used: N = 4, filter size = 12, 15, 18, 24, 30, 36, neighborhood size = 5×5, cross correlation threshold = 0.85)

Invariant Feature Extraction and Object Shape Matching Using Gabor Filtering

(a)

103

(b)

o

o

(Rotation angle = -9*15 = -135 )

(c)

(d)

(e)

Fig. 4. Two airplanes with the extracted dominant points and orientations superimposed: reference airplane (a) and sample airplane (b). (c) The rotation angle estimation. (d, e) Result of the final point matching. (Parameters used: N = 4, filter size = 12, 15, 18, 24, 30, 36, neighborhood size = 3×3, cross correlation threshold = 0.85)

104

Shu-Kuo Sun, Zen Chen, and Tsorng-Lin Chia

References 1. Q. Zheng and R. Chellappa. Automatic feature point extraction and tracking in image sequences for arbitrary camera motion. International Journal of Computer Vision. Vol.15, pp.31-76, 1995. 2. A.K. Jain, S. Prabhakar, and L. Hong. A multichannel approach to fingerprint classification. IEEE Trans. PAMI, vol. 21, no.4, pp.348-359, 1999. 3. J. Chen, Y. Sato, and S. Tamura. Orientation space filtering for multiple orientation line segmentation. IEEE Trans, PAMI, vol. 22, no. 5, pp. 417-429, 2000. 4. Z. Wang and M. Jenkin. Using complex Gabor filters to detect and localize edges and bars. In: C. Archibald and E. Petriu, (eds.): Advanced in Machine Vision: Strategies and Applications, vol. 32, River Edge, NJ: World Scientific (1992) pp. 151-170. 5. R.P. Wurtz and T. Lourens. Corner detection in color images through a multiscale combination of end-stopped cortical sells. Image Vision and Computing 18, pp.531-541, 2000. 6. B. Robbins and R. Owens. 2D feature detection via local energy. Image Vision and Computing 15, pp 353-368, 1997. 7. T.P. Weldon, W.E. Higgins and D.F. Dunn. Efficient Gabor filter design for texture segmentation. Pattern Recognition, vol. 29, no. 12, pp. 2005-2015, 1996. 8. B.S. Manjunath and W.Y. Ma. Texture features for browsing and retrieval of image data. IEEE Trans. PAMI, vol. 18, no. 8, pp. 837-842, 1996. 9. A.P. Witkin. Scale-space filtering. In proc 8th Int. Joint Conf. Artificial Intell., pp. 10191021, 1983. 10. J.G. Daugman. Two-dimensional spectral analysis of cortical receptive field profile. Vision Research, vol. 20, pp. 847-856, 1980. 11. J.G. Daugman. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. J. Optical Soc. Amer., vol. 2, no.7, pp. 1169-1179, 1988. 12. J.G. Daugman. Computing discrete 2-D Gabor transforms by neural networks for image analysis and compression. IEEE Trans. Acous., Speech, Signal Processing, vol. 36, pp. 1169-1179, 1988. 13. T. S. Lee. Image representation using Gabor wavelets. IEEE Trans. PAMI, vol 18, no. 10, pp. 959-970 1996. 14. X. Wu and B. Bhanu. Gabor wavelet representation for 3D object recognition. IEEE Trans. Image Processing, vol.6, no. 1, pp. 47-64, 1997. 15. Rosenfeld, R.A. Hummel, and S.T. Zucker. Scene labeling by relaxation operations. IEEE Trans. Systems, Man, and Cybernetics, vol. SMC-6, no.6, pp. 420-433, 1976. 16. Yang, W. E. Snyder and G. L. Bilbro, Matching oversegmented 2D images to models using association graphs, Image Vision Comput. vol.7, no.2, pp.135-143, 1989.

A Framework for Visual Information Retrieval Horst Eidenberger1, Christian Breiteneder1, and Martin Hitz2 1

Vienna University of Technology, Institute of Software Technology and Interactive Systems, Favoritenstrasse 9-11 – 188/2, A-1040 Vienna, Austria _IMHIRFIVKIVFVIMXIRIHIVa$MQWXY[MIREGEX 2 University of Klagenfurt, Department for Informatics-Systems, Universitätsstrasse 65-67, A-9020 Klagenfurt, Austria QEVXMRLMX^$YRMOPYEGEX

Abstract. In this paper a visual information retrieval project (VizIR) is presented. The goal of the project is the implementation of an open Contentbased Visual Retrieval (CBVR) prototype as basis for further research on the major problems of CBVR. The motivation behind VizIR is: an open platform would make research (especially for smaller institutions) easier and more efficient. The intention of this paper is to let interested researchers know about VizIR’s existence and design as well as to invite them to take part in the design and implementation process of this open project. The authors describe the goals of the VizIR project, the intended design of the framework and major implementation issues. The latter includes a sketch on the advantages and drawbacks of the existing cross-platform media processing frameworks: Java Media Framework, OpenML and Microsoft’s DirectX (DirectShow).

1

Introduction

Global integration of information systems with the ability for easy creation and digitization of visual content have lead to the problem of how to manage these vast amounts of data in collections or databases. One of the crucial success factors of all approaches to this problem is apparently the implementation of effective but still easy to handle retrieval methods. Content-based retrieval of images and video (CBVR) is still a rather new approach to overcome these problems by deriving features (or: descriptors; like color histograms, etc.) from the visual content and comparing visual objects by measuring the distance of features with distance functions. CBVR can be a helpful addition to text retrieval systems. Its major advantages are fully automated indexing and description of visual content by visual features. On the other hand the fundamental drawbacks of this approach are: – The semantic gap between high level concepts presented to a user and the low level features that are actually used for querying [22]. – Subjectivity of human perception. Different persons or the same person in different situations may judge visual content differently. This problem occurs in various situations: different persons may judge features (color, texture, etc.) differently, or if they judge them in the same way they still may perceive them in different ways [23]. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 105–116, 2002. © Springer-Verlag Berlin Heidelberg 2002

106

Horst Eidenberger, Christian Breiteneder, and Martin Hitz

Partly because of these two principle drawbacks four major problems of CBVR approaches can be identified: – Low result quality–Using only general features for all types of visual content and asking the user to choose features her- or himself leads to retrieval results of low quality. – Complicated interfaces–Casual users are overtaxed by the demand for a definite opinion on similarity, the selection of features and especially, by the often necessary provision of weights. Many users would not even try a typical CBVR interface, if they had the opportunity to use it. To improve the acceptance of CBVR systems simpler user interfaces are needed. – Unsatisfactory querying performance–CBVR systems use distance functions to calculate the dissimilarity between visual objects. This process is often very slow and reply times in the range of minutes may occur for large databases. – Lack of assessment methods–No standardized collections of images or videos exist for most types of features that could be used to assess new querying methods. One exception is the Brodatz database for textures, which is some sort of de-facto standard. In this paper we present our visual information retrieval project (VizIR). The goal of this project is an open CBVR prototype as a basis for further research to overcome the problems pointed out above and in many other publications. VizIR was started in summer 2001 as a conclusion to the authors experiences with earlier CBVR projects and is currently evaluated for scientific funding in Austria. The motivation behind VizIR is: an open CBVR platform would make research (especially for smaller institutions) easier and more efficient (because of standardized evaluation sets and measures, etc.). Partly the authors took this idea from a panel discussion at the ACM Multimedia conference 2000 on a global multimedia curriculum, where the discussion participants stressed the need for shared scientific multimedia software. The intention of this paper is to let interested researchers know about VizIR’s existence and design as well as to invite them to take part in the design and implementation process of this truly open project. The rest of this paper is organized as follows: the following section points out relevant related work, section 3 is dedicated to the VizIR project goals, section 4 to the framework design and section 5 discusses major implementation issues.

2

Related Work

Past CBVR research efforts have lead to several general-purpose prototypes like QBIC [8], VisualSEEk [26], Photobook [20], MARS and El Ninó [24] for image querying and OVID [16] or VIQS for video indexing and retrieval and some application specific prototypes like image retrieval systems for trademarks [28] or CueVideo for news videos analysis (e. g. [5]). These prototypes share a number of serious drawbacks: – All of them implement only a small number of features and offer the developer no API for extension. An exception is IBM’s QBIC system for image querying, which has (in version 3) a well-documented API for feature programming.

A Framework for Visual Information Retrieval

107

Due to several reasons most prototypes are not available for further research. Some of them have been canceled (e. g. Virage) and others have not been released to the public (e. g. Photobook). – None of these prototypes have an architecture supporting the MPEG-7 standard (see [14]). To the knowledge of the authors at present no MPEG-7 compliant prototype for CBVR exists or is under development. Part 6 of MPEG-7 contains a reference implementation of its visual descriptors and a simple querying application, which was developed for testing and simulation [14]. Because it contains no framework, no documentation of the CBVR part, no user interface, no suitable database, no optimized descriptor extraction functions and no performance optimized algorithms unfortunately this reference implementation cannot be used as a CBVR prototype, although it is still a good starting point for developing one. Apart from the mentioned focal points of research and the implemented prototypes the following key issues of CBVR systems have hardly been discussed so far: – Similarity definition – The common way of similarity definition in CBVR systems is measuring distances with an L1 or L2 metrics (city block distance and Euclidean distance), merging a single objects distance values for multiple features by the weighted sum and presenting the user the objects with the lowest distance sum as the similar ones. In their publications the authors have shown that this method is far from being the most effective one [3]. More sophisticated methods for similarity definition would result in a qualitative better outcome (e. g. [25]). – Media sets for assessment – As pointed out above, no considerable effort has been undertaken so far to put together standardized rated image and video sets for the various groups of features. This has lead to vague, often worthless statements on the quality of CBVR prototypes. – Integration of computer vision methods – Surprisingly few ideas and methods have been taken over from the computer vision community up to now. Neural networks have been used for face detection and thresholding methods for segmentation but hardly any shaping techniques for 3D object reconstruction or sophisticated neural networks for scene analysis have been applied. The VizIR project intends to integrate the various directions of past and current research in an open prototype to push CBVR research one step further towards practical usefulness by overcoming its most serious problems. The next chapter gives an overview of the objectives of the VizIR project. –

3

Project Goals

The VizIR project aims at the following major goals: – Implementation of a modern, open class framework for content-based retrieval of visual information as basis for further research on successful methods for automated information extraction from images and video streams, definition of similarity measures that can be applied to approximate human similarity judgment and new, better concepts for the user interface aspect of visual information retrieval, particularly for human-machine-interaction for query definition and refinement and video handling.

108

Horst Eidenberger, Christian Breiteneder, and Martin Hitz

Implementation of a working prototype system that is fully based on the visual part of the MPEG-7 standard for multimedia content description. Obtaining this goal requires the careful design of the database structure and an extendible class framework as well as seeking for suitable extensions and supplementations of the MPEG-7 standard by additional descriptors and descriptor schemes, mathematical and logical fitting distance measures for all descriptors (distance measures are not defined in the standard) and defining an appropriate and flexible model for similarity definition. MPEG-7 is not information retrieval specific. One goal of this project is to apply the definitions of the standard to visual information retrieval problems. – Development of integrated, general-purpose user interfaces for visual information retrieval. Such user interfaces have to include a great variety of different properties: methods for query definition from examples or sketches, similarity definition by positioning of visual examples in 3D space, appropriate result display and refinement techniques and cognitively easy handling of visual content, especially video. – Support of methods for distributed querying, storage and replication of visual information and features and methods for query acceleration. The importance of this issue becomes apparent from the large amount of data that has to be handled in such a system and the computation power that is necessary for querying by – often quite complex – distance functions. Methods for distributed querying, storage and replication include the replication of feature information, client-server architectures and remote method invocation in the querying and indexing modules as well as compression of video representations for the transport over low bandwidth networks. Methods for query acceleration include indexing schemes, mathematical methods for complexity reduction of distance functions and generation of querying heuristics [4]. Another implicit goal of the VizIR project is the development of a multimedia specific UML-based software development process. Multimedia applications have special needs that have to be considered during the design and implementation of such a system. Developing tailor-made software development methods on the basis of the UML design process is just a logical step. The next section will give technical details on these objectives and the intended system architecture. –

4

Framework Design

Referred to its nature the VizIR project can be split in a front-end part (user interfaces for query definition, result display and query refinement, video representation and delivery, etc.) and a back-end part (class framework for querying, information management, etc.). The major issues concerning the front-end are: – Design of image querying interfaces–Modern ways of similarity definition (3D spatial layout of example images, iconic indexing, etc.) have to be combined with different querying paradigms (query by example, query by sketch, etc.). Additionally, it must be possible to define spatial relationships within visual content, regions of interest, etc.

A Framework for Visual Information Retrieval

109

Video presentation and interaction–Implementation of state-of-the-art video handling paradigms (e. g. micons, panoramas, paper video, etc.) and development of new, better metaphors. One interesting alternative could be a spatio-temporal onion view on video objects. – Design of video querying interfaces–The video handling methods have to be integrated in a video querying interface, which has to offer analogous features to the image querying interface and additional query by (moving) objects methods. – Integration of image and video querying–The media querying interfaces have to be integrated in a joint user interface where image features can be applied to video clips and videos consisting of different views on a scene for image querying. – Application-specific interfaces–In addition to general- purpose interfaces methods have to be developed to adapt these interfaces to application specific needs. Fields of application in the future will be digital libraries, CSCW systems and of course the Internet itself. – Result display interfaces–This is a rather easy task for images (e. g. browsing, iconic indexing, etc.) but hard to implement for video content. Common approaches are index frames and micons, which are obviously unsatisfactory. A more sophisticated approach could be an object viewer for all objects and their temporal trajectories in a video shot. Also, video cubism (interactively cutting an X-Y-time cube of video data along arbitrary oriented planes; [7]) should be considered as an alternative for offering video results. – Query refinement interfaces – Iterative query refinement by relevance feedback is a technique that has become state-of-the-art in information retrieval applications in the last years [15], [27]. The effect of such a component stands and falls with an intuitive user interface that allows the user to enter his feedback in an intuitive way. These interfaces have to be designed as intuitive and self-explanatory as possible to guarantee high usability and in consequence increasing acceptance for CBVR. Matters related to the design and implementation of the back-end are: – Implementation of a technically sound class framework for the other system components. Even though this is not a research but an engineering problem, the authors have to stress that using a professional database and programming environment will be crucial success factors for a modern CBVR research prototype. As pointed out above, most past approaches have serious shortages in their system architecture. VizIR will use a professional relational database for media and feature data storage and an open class framework as basis for the implementation of further components. – Implementation of the basic MPEG-7 descriptors for still images and video. It is intended to follow the reference implementation of part 6 of the standard. For the reasons given above and especially because the algorithms of the reference implementation are not optimized the redesign and implementation of the MPEG7 descriptors is a very time- and human resources consuming task. – The basic MPEG-7 descriptors can be combined with aggregate descriptors (grid layout, time line, etc.) and grouped to descriptor schemes. The task of this part of the project is to discuss, which combinations of descriptors make sense for a general-purpose CBVR prototype. Additionally, an API has to be defined for the creation of descriptor schemes. –

110

Horst Eidenberger, Christian Breiteneder, and Martin Hitz

MPEG-7 is not a visual information retrieval specific standard and in general does not include distance functions for the various descriptors. Neither does it give any recommendations. Therefore it is necessary to implement common distance metrics (like L1, L2 metric, Mahalanobis distance, etc.; [23]), to associate them with descriptors and to develop custom distance functions where these metrics are not applicable (e. g. object features, etc.). – The MPEG-7 standard – although it is a major advance in multimedia content description – standardizes only a number but not nearly all useful features. It is necessary to design and implement additional descriptors and distance functions for texture description of images (wavelets, etc.; e. g. [13]), symmetry detection of objects (useful for face detection, detection of human-made objects, etc.), object description in video streams (structure recognition from motion, etc.), object representation (scene graphs, etc.) and classic video analysis (shot detection, etc.) from uncompressed as well as compressed video streams. Additionally the authors plan to use fractal methods (iterated function systems; IFS) to describe the shape of objects effectively. So far IFS have been used for the compression of self-similar objects (e. g. [1]) but hardly for content-based retrieval (see [12]). The authors think, that IFS could be very effective for shape description too. – Design of methods for query definition that are flexible enough to satisfy different ways of how humans can perceive and judge similarity which are still applicable in a distributed querying environment. The query model approach developed by the authors could be applied and extended for this task [3]. – Implementation of methods for query refinement. As frequently stressed in publications on information retrieval this is a crucial task for the quality of a retrieval system. VizIR will contain methods for experimenting with feedback by rating and positive query examples. The authors doubt that approaches with positive and negative query examples make sense for visual content. – Development and implementation of indexing schemes and query acceleration models. Next to classic index structures for visual content (e. g. R tree, segment index tree, etc.) and query acceleration techniques (application of the triangle inequality [2], storage of the factorized terms of the Mahalanobis distance [21], etc.) experiments will be undertaken with new heuristic approaches like those previously published by the authors [4]. – Finally, it is necessary to implement tools for distributed and replicated visual content management as well as database management. This is – like the first element of this list – more an engineering than a research problem (except the feature replication problem). A third group of matters, which is relevant to both the front-end and the back-end concerns assessment methods. To the belief of the authors a significant improvement of CBVR research in the future will be the development of standardized quality assessment procedures. In the VizIR project the following assessment tasks will be undertaken: – Analysis of common evaluation models (recall, precision, etc.; [9], [19]) and application of other methods (systematic measures, etc.). Moreover different evaluation techniques and methods from other research areas will be checked for applicability to the problem at hand. This could be conventional psychological methods, e. g. semantic differential techniques [18] or new methods to be –

A Framework for Visual Information Retrieval

111

developed. The major problem - apart from the cumbersome lack of standardized evaluation sets - of applying the standard measures in information retrieval, recall and precision to CBVR systems that use linear weighted merging (see above) is that this implicitly means giving up at least 10% of recall. This is because a system with linear weighted merging returns the n “most similar” available objects (independent of the question whether or not they are really similar), while the recall measures the ratio of really similar objects to all available objects. F e a tu re D e s c rip tio n ID N am e

F e a tu re C o lle c tio n

D e sc rip tio n

ID

M e d ia D a ta

n :m

N am e D e s c ri pto r S c .

ID

F e atu re C la s s

M e d ia C o lle c tio n

N am e D e s crip tio n

B ina ry La y ou t

1 :n

n :m

ID N am e

F e a tu re D a ta

n :1

M e d ia

n :1

M e d ia T y p e

D e s crip to r B in a ry R e p r

F e a tu re D a ta

ID

D e sc rip tio n

ID

N am e

R a w d a ta

N am e

U RL

D e s crip tio n

Fig. 1. EER database diagram. Visual media is stored in table “Media” and associated with a single “MediaType”. Each media may belong to n collections and each collection may contain m elements. Feature classes are described in table “FeatureClass” with the MPEG-7 descriptor definition language (DDL; based on XML schema). Features are organized in collections as well. Feature data is stored in binary and DDL format in table “FeatureData”. –

–

Creation of evaluation sets with image or video content for groups of descriptors and assignment of pair-wise similarity from tests with volunteers (students, etc.). Such sets are obviously decisive for the quality judgment of CBVR systems but in fact there is only one de-facto standard, the Brodatz database for texture images. The aim of the VizIR project is the definition of test sets for shape features, color and symmetry features and video object features. Partially these evaluation sets will be created by enriching and extending the image and video clip sets, which were used for building the ground truth of some MPEG-7 features (e. g. motion activity descriptor, etc.). Different approaches - e.g., findings on the basis of gestalt laws - will be checked for their suitability to develop those test sets. Extended evaluations on the MPEG-7 descriptors and descriptor schemes as well as on the other implemented descriptors and aggregates with statistical methods in two steps: Evaluation of their independent performance and their performance in combinations. From this information the overall performance of the visual part of MPEG-7 and VizIR can be judged. Analysis of dependencies among descriptors with statistical methods (cluster analysis, factor analysis, etc.) to identify a base for the space of descriptors and become able to normalize the visual part of the MPEG-7 standard and extend it by new independent descriptors.

112 – –

Horst Eidenberger, Christian Breiteneder, and Martin Hitz

Evaluation of the performance optimization methods implemented in VizIR in comparison to other comparable retrieval systems. Finally assessment of the user interfaces by volunteers who judge the video handling methods, similarity definition concepts and the overall usability of the system. For this task methods of usability assessment will be applied. MP EG-7 - Descriptors

Other descriptors

...

...

...

...

QueryLayer +feature : String +threshold : Double +weight : Double

1

1

<>

Feature +content : MediaContent +featureNam e : String + : ... +extractFeature() : void +calculateDistance(other : Feature) : double +FeatureToRaw() : byte[] +RawToFeature(raw : byte[]) : void +FeatureToDescriptor() : String +DescriptorToFeature(descriptor : S tring) : void

consists of 1..n

Query

ResultSet +num berOfElem ents : Integer +getSize() : Integer +getElem ents() : M ediaContent[] ... 1..n

1

Result set

1

-handle : DatabaseHandle +prepare(layers : QueryLayer[], example: String, collection : String, media : String) : Integer +execute() : Integer // RC: result set size +getNext() : AssoziativeArray +close() : void

contains

M ediaContent +type : String +xSize, + ySize, +colorScheme, +numFrames : Number +rawdata : byte[] +activeFrame : Number // Currently active frame

uses

DatabaseM anager -handle : DatabaseHandle +getMedia() : String[] +getCollections() : String[] +getFeatures() : String[] +addMedia() : void +addCollections() : void +addFeature() : void +addContent() : void ... consist of

use

+MediaContent(imageNumber : Number) : void +M ediaContent(name : String) : void +MediaContent(URL : String, local : Boolean) : void +getFrame(frameNumber : Number) : byte[] // -1 ... next, 0 ... first, +n ... (n+1)-th frame +readContent(

) : void +writeContent(

) : void

Fig. 2. UML class diagram for an ideal implementation of the VizIR class framework. Key element is class “Query”, which contains the methods for query generation and execution. Each query consists of a number of “QueryLayer” elements that implement exactly one feature each. All feature classes – MPEG-7 descriptors as well as all others - are derived from the interface “Feature” and contain methods for descriptor extraction (“extractFeature()”), serialization (“FeatureToRaw()”, “RawToFeature()”, etc.) and distance measurement (“calculateDistance()”). Feature classes take their media content from instances of the class “MediaContent”. The result of each query is a set of media objects (represented as MediaContent objects), which is stored in a “ResultSet” object. Finally the methods of class “DatabaseManager” encapsulate the database access.

The latter two evaluation cycles have to be performed in usability labs. A combination of different observation methods and devices - such as eye-trackers and video observation devices – is necessary to collect objective (e.g. eye-movement) as well as subjective data (e.g. verbal expressions). By analyzing and comparing different data, cost and benefit assessments of existing systems with special focus on the system to be developed are possible. The VizIR prototype will be based on a standard relational database. Fig. 1 gives an overview of its tables and relations for media and feature storage. Fig. 2 outlines the likely class structure of the VizIR prototype. To a certain extent this class framework follows the architecture of IBM’s QBIC system [8], but largely differs

A Framework for Visual Information Retrieval

113

from QBIC in its server/client independent classes. Similarly to QBIC, the database access is hidden from the feature programmer and the layout of all feature classes is predefined by the interface “Feature”. Concluding this sketch of the VizIR prototypes system architecture we outline several aspects of the application and data distribution. Modern CORBA based programming environments like the Java environment permit the networkindependent distribution of applications, objects and methods (in Java through the Remote Method Invocation library) to increase the performance of an application by load balancing and multi-threading. If VizIR will be implemented in Java the objects for querying could be implemented as JavaBeans, feature extraction functions with RMI, database management through servlets and user interfaces as applets. Database distribution could be realized through standard replication mechanisms and database access through JDBC.

5

Implementation

The major question concerning the implementation of the VizIR prototype is on the programming environment. At this point in time when MPEG-21 is still far out of sight, there are three major alternatives that support image and video processing to choose from: – Java and the Java Media Framework (JMF; [10]) – The emerging Open Media Library standard (OpenML) of the Khronos group [17] – Microsoft DirectX (namely DirectShow) resp. its successor in the .NET environment [6] All of these environments offer comprehensive video processing capabilities and are based on modern, object-oriented programming paradigms. DirectX is platformdependent and a commercial product. For .NET Microsoft has recently initiated the development a Linux version but it is expected that this version will not be available before summer 2002 and will still have to be purchased. Additionally it is unlikely that versions for other operating systems will be developed as well (SunOS, OpenBSD, IRIX, etc.). Therefore in the following discussion we will concentrate on the first two alternatives: JMF and OpenML. JMF is a platform-dependent add-on to the Java SDK, which is currently available for SunOS and Windows (implementation by SUN and IBM) as well as Linux (implementation by Blackdown) in a full version and in a Java version with less features for all other operating systems that have Java Virtual Machine implementations. JMF is free and extensible. OpenML is an initiative of the Khronos Group (a consortium of companies with expert knowledge in video processing, including Intel, SGI and SUN) that standardizes a C-interface for multimedia production. OpenML includes OpenGL for 3D and 2D vector graphics, extensions to OpenGL for synchronization, the MLdc library for video and audio rendering and the ‘OpenML core’ for media processing (confusingly the media processing part of OpenML is named OpenML as well; therefore we will use the term ‘OpenML-mp’ for the media processing capabilities below). The first reference implementation of OpenML for Windows was announced for winter 2001.

114

Horst Eidenberger, Christian Breiteneder, and Martin Hitz

Among the concepts that are implemented similarly in JMF and OpenML-mp are the following: – Synchronization: a media objects time base (JMF: TimeBase object, OpenMLmp: Media Stream Counter) is derived from a single global time base (JMF: SystemTimeBase object, OpenML-mp: Unadjusted System Time) – Streaming: both environments do not manipulate media data as a continuous stream but instead as discrete segments in buffer elements. – Processing control: JMF uses Control objects and OpenML-mp uses messages for this purpose. Other important media processing concepts are implemented different in JMF and OpenML-mp: – Processing chains: in JMF real processing chains with parallel processing can be defined (one instance for one media track is called a CodecChain). In OpenMLmp processing operations data always flows from the application to a single processor (called a Transcoder) through a pipe and back. – Data flow: JMF distinguishes between data sources (including capture devices, RTP servers and files) and data sinks. OpenML-mp handles all I/O devices in the same way (called Jacks). The major advantages of OpenML-mp are: – Integration of OpenGL, the platform-independent open standard for 3D graphics. – A low-level C API that will probably be supported by the decisive video hardware manufacturers and should have a superior processing performance. – The rendering engine of OpenML (MLdc) seems to have a more elaborate design than the JMF Renderer components. Especially it can be expected that the genlock-mechanism of MLdc will prevent lost-sync phenomena, which usually occur in JMF when rendering media content with audio and video tracks that are longer than ten minutes. – OpenML-mp defines more parameters for video formats and is closer related to professional video formats (DVCPRO, D1, etc.) and television formats (NSTC, PAL, HDTV, etc.) On the other hand the major disadvantages of OpenML are: – It is not embedded in a CASE environment like Java for JMF. Therefore application development requires more resources and longer development cycles. – OpenML is not object-oriented and includes no mechanism for parallel media processing. The major drawbacks of JMF are: – Lower processing performance because of the high-level architecture of the Java Virtual Machine. This can be reduced by the integration of native C code through the Java Native Interface. – Limited video hardware and video format support: JMF has problems with accessing certain video codecs, capture devices and with transcoding of some video formats. The outstanding features of JMF are: – Full Java integration. The Java SDK includes comprehensive methods for distributed and parallel programming, database access and I/O processing. Additionally professional CASE tools exist for software engineering with Java.

A Framework for Visual Information Retrieval

115

JMF is free software and reference implementations exist for a number of operating systems. JMF version 2.0 is a co-production of SUN and IBM. In version 1.0 Intel was involved as well. – JMF is extensible. Additional codecs, multiplexers and other components can be added by the application programmer. The major demands for the VizIR project are the need for a free and bug-free media processing environment that supports distributed software engineering and has a distinct and robust structure. Matters like processing performance and extended hardware support are secondary for this project. Therefore the authors think that currently JMF is the right choice for the implementation. Design and implementation will follow a UML based incremental design process and prototyping, because UML is state-of-the art in engineering and because of the valuable positive effect of rapid prototyping on the employee’s motivation. Standard statistical packages and Perl scripts will be used for performance evaluation and Selforganizing Maps [11] and Advanced Resonance Theory (ART) neural networks as well as genetic algorithms for tasks like pattern matching and (heuristic) optimization (like in [4]). –

6

Conclusion

The major outcome of the open VizIR project can be summarized as follows: – An open class framework of methods for feature extraction, distance calculation, user interface components and querying. – Evaluated user interfaces methods for content-based visual retrieval. – A system prototype for the refinement of the basic methods and interface paradigms. – Carefully selected evaluation sets for groups of features (color, texture, shape, motion, etc.) with human-rated co-similarity values. – Evaluation results for the methods of the MPEG-7 standard, the authors earlier content-based retrieval projects and all other promising methods. The authors would like to invite interested research institutions to join the discussion and participate in the design and implementation of the open VizIR project.

References 1. Barnsley, M.F., Hurd, L.P., Gustavus, M.A.: Fractal video compression. Proc. of IEEE Computer Society International Conference, Compcon Spring (1992) 2. Barros, J., French, J., Martin, W.: Using the triangle inequality to reduce the number of comparisons required for similarity based retrieval. SPIE Transactions (1996) 3. Breiteneder, C., Eidenberger, H.: Automatic Query Generation for Content-based Image Retrieval. Proc. of IEEE Multimedia Conference, New York (2000) 4. Breiteneder, C., Eidenberger, H.: Performance-optimized feature ordering for Contentbased Image Retrieval. Proc. European Signal Processing Conference, Tampere (2000) 5. Chua, T., Ruan, L.: A Video Retrieval and Sequencing System. ACM Transactions on Information Systems, Vol. 13, No. 4 (1995) 373-407

116

Horst Eidenberger, Christian Breiteneder, and Martin Hitz

6. DirectX: msdn.microsoft.com/library/default.asp?url=/library/enus/wcegmm/htm/dshow.asp 7. Fels, S., Mase, K.: Interactive Video Cubism. Proc. of ACM International Conference on Information and Knowledge Management, Kansas City (1999) 78-82 8. Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Steele, D., Yanker, P.: Query by Image and Video Content: The QBIC System. IEEE Computer (1995) 9. Frei, H., Meienberg, S., Schäuble, P.: The Perils of Interpreting Recall and Precision. In: Fuhr, N. (ed.): Information Retrieval, Springer, Berlin (1991) 1-10 10. Java Media Framework Home Page: java.sun.com/products/java-media/jmf/index.html 11. Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J.: SOM-PAK: The Self-organizing Map Program Package. Helsinki (1995) 12. Lasfar, A., Mouline, S., Aboutajdine, D., Cherifi, H.: Content-Based Retrieval in Fractal Coded Image Databases. Proc. of Visual Information and Information Systems Conference, Amsterdam (1999) 13. Lin, F., Picard, R. W.: Periodicity, directionality, and randomness: Wold features for image modelling and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (1996) 14. MPEG-7 standard: working papers www.cselt.it/mpeg/working_documents.htm#mpeg-7 15. Nastar, C., Mitschke, M., Meilhac, C.: Efficient Query Refinement for Image Retrieval. Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1998) 16. Oomoto, E., Tanaka, K.: OVID: design and implementation of a video-object database system. IEEE Transactions on Knowledge and Data Engineering (1993) 17. OpenML: www.khronos.org/frameset.htm 18. Osgood, C. E. et al.: The Measurement of Meaning. University of Illinois, Urbana (1971) 19. Payne, J. S., Hepplewhite, L., Stonham, T. J.: Evaluating content-based image retrieval techniques using perceptually based metrics. SPIE Proc., Vol. 3647 (1999) 122-133 20. Pentland, A., Picard, R. W., Sclaroff, S.: Photobook: Content-Based Manipulation of Image Databases. SPIE Storage and Retrieval Image and Video Databases II (1994) 21. Rui, Y., Huang, T., Chang, S.: Image Retrieval: Past, Present and Future. Proc. of International Symposium on Multimedia Information Processing, Taiwan (1997) 22. Santini, S., Jain, R.: Beyond Query By Example. ACM Multimedia (1998) 23. Santini, S., Jain, R.: Similarity Measures. IEEE Transactions on Pattern Analysis and Machine Intelligence (1999) 24. Santini, S., Jain, R.: Integrated browsing and querying for image databases. IEEE Multimedia, Vol. 3, Nr. 7 (2000) 26-39 25. Sheikholeslami, G., Chang, W., Zhang, A.: Semantic Clustering and Querying on Heterogeneous Features for Visual Data. ACM Multimedia (1998) 26. Smith, J. R., Chang, S.: VisualSEEk: a fully automated content-based image query system. ACM Multimedia (1996) 27. Wood, M., Campbell, N., Thomas, B.: Iterative Refinement by Relevance Feedback in Content-Based Digital Image Retrieval. ACM Multimedia (1998) 28. Wu, J. K., Lam, C. P., Mehtre, B. M., Gao, Y. J., Desai Narasimhalu, A.: Content-Based Retrieval for Trademark Registration. Multimedia Tools and Applications, Vol. 3, No. 3 (1996) 245-267

Feature Extraction and a Database Strategy for Video Fingerprinting Job Oostveen, Ton Kalker, and Jaap Haitsma Philips Research, Prof. Holstlaan 4, 5656 AA Eindhoven, The Netherlands [email protected], [email protected], [email protected]

Abstract. This paper presents the concept of video ﬁngerprinting as a tool for video identiﬁcation. As such, video ﬁngerprinting is an important tool for persistent identiﬁcation as proposed in MPEG-21. Applications range from video monitoring on broadcast channels to ﬁltering on peerto-peer networks to meta-data restoration in large digital libraries. We present considerations and a technique for (i) extracting essential perceptual features from moving image sequences and (ii) for identifying any suﬃciently long unknown video segment by eﬃciently matching the ﬁngerprint of the short segment with a large database of pre-computed ﬁngerprints.

1

Introduction

This paper presents a method for the identiﬁcation of video. The objective is to identify video objects not by comparing perceptual similarity of the video objects themselves (which might be computationally expensive), but by comparing short digests, also called ﬁngerprints, of the video content. These digests mimic the characteristics of regular human ﬁngerprints. Firstly, it is (in general) impossible to derive from the ﬁngerprint other relevant personal characteristics. Secondly, comparing ﬁngerprints is suﬃcient to decide whether two persons are the same or not. Thirdly, ﬁngerprint comparison is a statistical process, not a test for mathematical equality: it is only required that ﬁngerprints are suﬃciently similar to decide whether or not they belong to the same person (proximity matching). 1.1

Classiﬁcation

Fingerprint methods can be categorized in two main classes, viz. the class of method based on semantical features and the class of methods based on nonsemantical features. The former class builds ﬁngerprints from high-level features, such as commonly used for retrieval. Typical examples include scene boundaries and color-histograms. The latter class builds ﬁngerprints from more general perceptual invariants, that do not necessarily have a semantical interpretation. A typical example in this class is diﬀerential block luminance (see also Section 2). For both classes holds that (small) ﬁngerprints can be used to establish perceptual equality of (large) video objects. It should be noted that a feature extraction method for ﬁngerprinting must be quite diﬀerent from the methods used for S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 117–128, 2002. c Springer-Verlag Berlin Heidelberg 2002

118

Job Oostveen, Ton Kalker, and Jaap Haitsma

video retrieval. In retrieval, the features must facilitate searching of video clips that somehow look similar to the query, of that contain similar objects as the query. In ﬁngerprinting the requirement is to identify clips that are perceptually the same, except for quality diﬀerences or the eﬀects of other video processing. Therefore, the features for ﬁngerprinting need to be far more discriminatory, but they do not necessarily need to be semantical. Consider the example of identiﬁcation of content in a multimedia database. Suppose one is viewing a scene from a movie and would like to know from which movie the clip originates. One way of ﬁnding out is by comparing the scene to all fragments of the same size of all movies in the database. Obviously, this is totally infeasible in case of a large database: even a short video scene is represented by a large amount of bytes and potentially these have to be compared to the whole database. Thus, for this to work, one needs to store a large amount of easily accessible data and all these data have to be compared with the video scene to be identiﬁed. Therefore, there is both a storage problem (the database) as well as a computational problem (matching large amounts of data). Both problems can be alleviated by reducing the number of bits needed to represent the video scenes: fewer bits need to be stored and fewer bits need to be used in the comparison. One possible way to achieve this is by using video compression. However, because it is not needed to reconstruct the video from the representation, at least theoretically it is possible to use less bits for identiﬁcation than for encoding. Moreover, perceptually comparing compressed video streams is a computationally expensive operation. A more practical option is to use a video compression scheme that is geared towards identiﬁcation, more speciﬁcally to use a ﬁngerprinting scheme. Video identiﬁcation can then be achieved by storing the ﬁngerprints of all relevant fragments in a database. Upon reception of a unknown fragment, its ﬁngerprint is computed and compared to those in the database. This search (based on inexact pattern matching) is still a burdensome task, but it is feasible on current-day PCs. 1.2

Relation to Cryptography

We will now ﬁrst discuss the concept of cryptographic hash functions and show how we approach the concept of ﬁngerprints as an adaptation of cryptographic hash functions. Hash functions are a well-known concept in cryptography [8]. A cryptographic hash, also called message digest or digital signature, is in essence a short summary of a long message. Hash functions take a message of arbitrary size as input and produce a small bit string, usually of ﬁxed size: the hash or hash value. Hash functions are widely used as a practical means to verify, with high probability, the integrity of (bitwise) large objects. The typical requirements for a hash function are twofold: 1. For each message M , the hash value H = h(M ) is easily computable; 2. The probability that two messages lead to the same hash is small.

Feature Extraction and a Database Strategy for Video Fingerprinting

119

As a meaningful hash function maps large messages to small hash values, such a function is necessarily many-to-one. Therefore, collisions do occur. However, the probability of hitting upon two messages with the same hash value should be minimal. This usually means that the hash values for all allowed messages have a uniform distribution. For an n-bit value h the probability of a collision is then equal to 2−n . Cryptographic hash functions are usually required to be one-way, i.e., it should be diﬃcult for a given hash value H to ﬁnd a message which has H as its hash value. As a result such functions are bit-sensitive: ﬂipping a single bit in the message changes the hash completely. The topic of this paper, ﬁngerprinting for video identiﬁcation, is about functions which show a strong analogy to cryptographic hash functions, but that are explicitly not bit sensitive and are applicable to audio-visual data. Whereas cryptographic hashes are an eﬃcient tool to establish mathematical equality of large objects, audio-visual ﬁngerprint functions serve as a tool to establish perceptual similarity of (usually large) audio-visual objects. In other words, ﬁngerprints should capture the perceptually essential parts of audio-visual content. In direct analogy with cryptographic hash functions, one would expect a ﬁngerprint function to be deﬁned as a function that maps perceptually similar objects to the same bit string value. However, it is well-known that perceptual similarity is not a transitive relationship. Therefore, a more convenient and practical deﬁnition reads as follows: a ﬁngerprint function is function that (i) maps (usually bitwise large) audiovisual object to (usually bitwise small) bit strings (ﬁngerprints) such that perceptual small changes lead to small diﬀerences in the ﬁngerprint and (ii) such that perceptually very diﬀerent objects lead with very high probability to very diﬀerent ﬁngerprints. 1.3

Fingerprinting Approaches

The scientiﬁc community seems to be favouring the terminology ‘ﬁngerprint’, and for that reason this is the terminology that will be used in this paper. However, it is doubtful whether or not this is the best choice. For instance, the term ﬁngerprinting is also used in the watermarking community, where it denotes the active embedding of tracing information. Although the literature on ﬁngerprinting is still limited, in particular for video, some progress is already reported. Among others algorithms for still image ﬁngerprinting are published by Abdel-Mottaleb et.al. [1], by J. Fridrich [5], by R. Venkatesan et al. [12,11] and by Schneider and Chang [10]. A number of algorithms for audio ﬁngerprinting have been published. See [6] and the references therein. A number of papers present algorithms for video ﬁngerprinting. Cheung and Zakhor ( [4]) are concerned with estimating the number of copies (possibly at diﬀerent quality levels) of video clips on the web. Hampapur and Bolle [7] present a indexing system based on feature extraction from key frames. Cryptographic hashes operate on the basis of a complete message. As such, it is impossible to check the integrity or obtain the identity of a part of the message. For video ﬁngerprint this is an undesirable property, as it means that it is impossible to identify short clips out of a longer clips. Also, for integrity

120

Job Oostveen, Ton Kalker, and Jaap Haitsma

checking, one would like to be able to localize distortions. For this reason, it is not always appropriate to create a global ﬁngerprint for the whole of an audiovisual object. Instead, we propose to use a ﬁngerprint-stream of locally computed ﬁngerprint bits (also referred to as sub-ﬁngerprints): per time unit, a number of bits are extracted from the content. In this way, it is possible to identify also smaller sections of the original. In a typical identiﬁcation scenario, the full ﬁngerprint stream is stored in the database. Upon reception of a video, the ﬁngerprint values are extracted from a short section, say with a duration of 1 second. The result, which we call a ﬁngerprint block, is then matched to all blocks of the same size in the database. If the ﬁngerprint block matches to a part of the ﬁngerprint stream of some material, it is identiﬁed as that speciﬁc part of the corresponding video. If there is no suﬃciently close match, the process repeats by extracting a next ﬁngerprint block and attempting to match it. The description above reveals two important complexity aspects of a full ﬂedged ﬁngerprinting system. The ﬁrst complexity aspect concerns ﬁngerprint extraction, the second concerns the matching process. In a typical application the ﬁngerprint extraction client has only limited resources. Moreover, the bandwidth to the ﬁngerprint matching engine is severely restricted. It follows that in many applications it is required that ﬁngerprint extraction is low complexity and that the size of the ﬁngerprint is either small or at least suﬃciently compressible. This observation already in many cases rules out the use of semantic ﬁngerprints, as these tend to be computationally intensive. The ﬁngerprint matching server is in its most basic form a gigantic sliding correlator: for an optimal decision a target ﬁngerprint block needs to be matched against all ﬁngerprinting blocks of similar length in the database. Even for simple matching functions (such as bit error rate), this sliding correlation becomes infeasible if the ﬁngerprint database is suﬃciently large. For a practical ﬁngerprint matching engine it is essential that the proximity matching problem is dealt with in an appropriate manner, either by including ingredients that allow hierarchical searching [6], by careful preparation of the ﬁngerprint database [3] or both. Both types of complexities are already well recognized in the ﬁeld of audio ﬁngerprint, see for example the recent RIAA/IFPI call [9]. 1.4 Overview In this paper we introduce a algorithm for robust video ﬁngerprinting that has very modest feature extraction complexity, a well-designed matching engine and a good performance with respect to robustness. We will present some general considerations in the design of such a video ﬁngerprinting algorithm with a focus on building a video identiﬁcation tool. In Section 2 we introduce the algorithm and discuss a number of the issues in designing such an algorithm. Section 3 contains the design of a suitable database structure. In Section 4 we will summarize our results and indicate directions for future research.

2

Feature Extraction

In this section, we present a feature extraction algorithm for robust video ﬁngerprinting and we discuss some of the choices and considerations in the design of

Feature Extraction and a Database Strategy for Video Fingerprinting

121

Divide in blocks

Frames

mean

T

a

mean

T

a

mean

T

a

Luminance

mean

Fig. 1. block diagram of the diﬀerential block luminance algorithm

such an algorithm. The ﬁrst question to be asked is in which domain to extract the features. In audio, very clearly, the frequency domain optimally represents the perceptual characteristics. In video, however, it is less clear which domain to use. For complexity reasons it is preferable to avoid complex operations, like DCT or DFT transformations. Therefore, we choose to compute features in the spatio-temporal domain. Moreover, to allow easy feature extraction from most compressed video streams as well, we choose features which can be easily computed from block-based DCT coeﬃcients. Based on these considerations, the proposed algorithm is based on a simple statistic, the mean luminance, computed over relatively large regions. This is also approach taken by Abdel-Mottaleb [1]. We will choose our regions in a fairly simple way: the example algorithm in this paper uses a ﬁxed number of blocks per frame. In this way, the algorithm is automatically resistant to changes in resolution. To ease the discussion, we introduce some terminology. The bits extracted from a frame will be refereed to as sub-ﬁngerprints. A ﬁngerprint block then denotes a ﬁxed number of sub-ﬁngerprints from consecutive frames. Our goal is to be able to identify short video clips and moreover to localize the clip inside the movie where it originates from. In order to do this, we need to extract features which contain suﬃcient high frequency content in the temporal direction. If the features are more or less constant over a relatively large number of frames, then it is impossible to localize exactly the clip inside the movie. For this reason, we take diﬀerences of corresponding features extracted from subsequent frames. Automatically, this makes the system robust to (slow) global changes in luminance. To arrive at our desired simple binary features, we only retain the sign of the computed diﬀerences. This immediately implies robustness to luminance oﬀsets and to contrast modiﬁcations. To decrease the complexity

122

Job Oostveen, Ton Kalker, and Jaap Haitsma

of measuring the distance between two ﬁngerprints (the matching process), a binary ﬁngerprint also oﬀers considerable advantages. That is, we can compare ﬁngerprints on a bit-by-bit basis, using the Hamming distance as a distance measure. Summarizing, we discard all magnitude information from the extracted ﬁlter output values, and only retain the sign. The introduction of diﬀerentiation in the temporal direction leads to a problem in case of still scenes. If a video scene is eﬀectively a prolonged still image, the temporal diﬀerentiation is completely determined by noise, and therefore the extracted bits are very unreliable. Conceptually, what one would like is that ﬁngerprints do not change while the video is unchanged. One way to achieve this is by using a conditional ﬁngerprint extraction procedure. This means that a frame is only considered for ﬁngerprint computation if it diﬀers suﬃciently from the last frame from which a ﬁngerprint was extracted [2]. This approach leads, however, to a far more diﬃcult matching procedure: the matching needs to be resistant to the fact that the ﬁngerprint extracted from a processed version of a clip may have a diﬀerent number of sub-ﬁngerprints than the original. Another possibility is to use a diﬀerent temporal ﬁlter which does not completely suppress mean luminance (DC). This can be achieved in a very simple manner by replacing the earlier proposed FIR ﬁlter kernel [ −1 1 ] by [ −α 1 ], where α is a value slightly smaller than 1. Using this ﬁlter the extracted ﬁngerprint will be constant in still scenes (and even still regions of a scene), whereas in regions with motion the ﬁngerprint is determined by the diﬀerence between luminance values in consecutive frames. In addition to the diﬀerentiation in the time domain, we can also do a spatial diﬀerentiation (or more generally, a high-pass ﬁlter) on the features extracted from one frame. In this way, also the correlation between bits extracted from the same frame is decreased signiﬁcantly. Secondly, application of the spatial ﬁlter avoids a bias in the overall extracted bits, which would occur if the new temporal ﬁlter were applied directly to the extracted mean luminance values1 . For our experiments, the results of which will be presented below, we have used the following algorithm. 1. Each frame is divided in a grid of R rows and C columns, resulting in R × C blocks. For each of these blocks, the mean of the luminance values of the pixels is computed. The mean luminance of block (r, c) in frame p is denoted F (r, c, p) for r = 1, . . . , R and c = 1, . . . , C. 2. We visualise the computed mean luminance values from the previous step as frames, consisting of R × C “pixels”. On this sequence of low resolution gray-scale images, we apply a spatial ﬁlter with kernel [ −1 1 ] (i.e. taking diﬀerences between neighbouring blocks in the same row), and a temporal ﬁlter with kernel [ −α 1 ], as explained, above. 3. The sign of the resulting value constitutes the ﬁngerprint bit B(r, c, p) for block (r, c) in frame p. Note that due to the spatial ﬁltering operation in the previous step, the value of c ranges from 1 to c − 1 (but still, r = 1, . . . , R). Thus, per frame we derive C × (R − 1) ﬁngerprint bits. 1

Without spatial diﬀerentiation the ﬁngerprint values before quantization would have a larger probability of being positive than negative

Feature Extraction and a Database Strategy for Video Fingerprinting

123

Summarizing, and more precisely, we have for r = 1, . . . , R, c = 1, . . . , C − 1: 1 if Q(r, c, p) ≥ 0, B(r, c, p) = 0 if Q(r, c, p) < 0, where Q(r, c, p) = (F (r, c + 1, p) − F (r, c, p)) − α (F (r, c+, p − 1) − F (r, c, p − 1)) . We call this algorithm “diﬀerential block luminance”. A block diagram, describing this is depicted in Figure 1. These features have a number of important advantages: – Only a limited number of bits is needed to uniquely identify short video clips with a low false positive probability – the feature extraction algorithm has a very low complexity and it may be adapted to operate directly on the compressed domain, without a need for complete decoding – The robustness of these features with respect to geometry-preserving operations is very good A disadvantage may be that for certain applications the robustness with respect to geometric operations (like zoom & crop) may not be suﬃcient. Experimental robustness results are presented in section 2.1, below. For our experiments we used α = 0.95 and R = 4, C = 9. This leads to a ﬁngerprint size of 32 bits per frame, and a block size 120 × 80 pixels for NTSC video material. Matching is done on the basis of ﬁngerprint bits extracted from 30 consecutive frames, i.e., 30 × 32 = 960 bits. 2.1

Experimental Results

Extensive experiments with the algorithm described above are planned for the near future. In this article we report on the results of some initial tests. We have used six 10-second clips, taken from a number of movies and television broadcasts (with a resolution of 480 lines and 720 pixels per line). From these clips, we extracted the ﬁngerprints. These are used as “the database”. Subsequently, we processed the clips, and investigated how this inﬂuences the extracted ﬁngerprint. The test included the following processing: 1. 2. 3. 4. 5.

MPEG-2 encoding at 4 Mbit/second; median ﬁltering using 3 × 3 neighbourhoods; luminance-histogram equalisation; shifting the images vertically over k lines (k = 1, 2, 3, 4, 8, 12, 16, 20, 24, 32); scaling the images horizontally, with a scaling factor between 80% and 120%, with steps of 2%.

Job Oostveen, Ton Kalker, and Jaap Haitsma

30

30

25

25

20

20 bit error rate

bit error rate

124

15

15

10

10

5

5

0 0.75

0.8

0.85

0.9

0.95 1 horizontal scale factor

1.05

1.1

1.15

1.2

0

0

5

10

15 20 vertical shift

25

30

35

Fig. 2. Robustness w.r.t. horizontal scaling (left graph) and vertical shifts (right graph)

The results for scaling and shifting are in Figure 2. The other results are reported below: MPEG-2 encoding: median ﬁltering: histogram equalisation:

11.8% 2.7% 2.9%

The results indicate that the method is very robust against all processing which is done on a local basis, like for instance MPEG compression or median ﬁltering. In general the alterations created by these processes average out within the blocks. Processing which changes the video more in a global fashion is more diﬃcult to withstand. For instance, global geometric operations like scaling and shifting lead to far higher bit error rates. This behaviour stems from the resulting misalignment of the blocks. A higher robustness could be obtained by using larger blocks, but this would reduce the discriminative power of the ﬁngerprint.

3

Database Strategy

Matching the extracted ﬁngerprints to the ﬁngerprints in a large database is a non-trivial task since it is well known that proximity matching does not scale nicely to very large databases (recall that the extracted ﬁngerprint values may have many bit errors). We will illustrate this with some numbers, based on using the proposed ﬁngerprinting scheme (as described in Section 2), in a broadcast monitoring scenario. Consider a database containing news clips with a total duration of 4 weeks (i.e., 4×7×24 = 672 hours of video material). This corresponds to almost 300 megabytes of ﬁngerprints. If we now extract a ﬁngerprint block (e.g. corresponding to 1 second of video, which results in 30 sub-ﬁngerprints) from an unknown news broadcast, we would like to determine which position in the 672 hours of stored news clips it matches best. In other words we want to ﬁnd the position in these 672 hours where the bit error rate is minimal. This

Feature Extraction and a Database Strategy for Video Fingerprinting

125

Lookup table

Extracted Fingerprint

Clip 1 0x00000000 0x00000001

0xE6DF801

0x1647839B

0x00000000 0x00000001

Clip 2

Clip 3

0x2AD89311

0x129647DE

0x00000000 0xFFFFFFFF

0x00000000

232

0xFFFFFFFF

0x00000001

0x78253671

0x2AD89311

0x1647839B

0xFFFFFFFF

Fig. 3. database layout

can be done by brute force matching, but this will take around 72 million comparisons. Moreover the number of comparisons increases linearly with the size of the database. We propose to use a more eﬃcient strategy, which is depicted in Figure 3. Instead of matching the complete ﬁngerprint block, we ﬁrst look at only a single sub-ﬁngerprint at a time and assume that occasionally this 32-bit bit-string contains no errors. We start by creating a lookup table (LUT) for all possible 32-bit words, and we let the entries in the table point to the video clip and the position(s) within that clip where this 32-bit word occurs as sub-ﬁngerprint. Since this word can occur at multiple positions in multiple clips the pointers are stored in a linked list. In this way one 32-bit word is associated with multiple pointers to clips and positions. The approach that we take bears a lot of similarity to inverted ﬁle techniques, as used commonly in text retrieval applications. Our lookup table is basically an index describing for each sub-ﬁngerprint (word) at which location in which clip it occurs. The main diﬀerence with text retrieval is that due to processing of the video we need to adapt our search strategy to the fact that sub-ﬁngerprints will frequently contain (possibly many) erroneous bits. By inspecting the lookup table for each of the 30 extracted sub-ﬁngerprints a list of candidate clips and positions is generated. With the assumption that occasionally a single sub-ﬁngerprint is free of bit errors, it is easy to determine whether or not all the 30 sub-ﬁngerprints in the ﬁngerprint block match one of

126

Job Oostveen, Ton Kalker, and Jaap Haitsma

the candidate clips and positions. This is done by calculating the bit error rate of the extracted ﬁngerprint block with the corresponding ﬁngerprint blocks of the candidate clips and positions. The candidate clip and position with the lowest error rate is selected as the best match, provided that this error rate is below an appropriate threshold. Otherwise the database reports that the search could not ﬁnd a valid best match. Note that in practice, once a clip is identiﬁed, it is only necessary to check whether or not the ﬁngerprints of the remainder of the clip belong to the best match already found. As soon as the ﬁngerprints no longer match, a full structured search is again initiated. Let us give an example of the described search method by taking a look at Figure 3. The last extracted ﬁngerprint value is 0x00000001. The LUT in the database points only to a certain position in clip 1. Let’s say that this position is position p. We now calculate the bit error rate between the extracted ﬁngerprint block and the block of song 1 from position p-29 until position p. If the two blocks match suﬃciently closely, then it is very likely that the extracted ﬁngerprint originates from clip 1. However if the two blocks are very diﬀerent, then either the clip is not in the database or the extracted sub-ﬁngerprint contains an error. Let’s assume that the latter occurred. We then try the one but last extracted sub-ﬁngerprint (0x00000000). This one has two possible candidate positions, one in clip 2 and one in clip 1. Assuming that the bit error rate between the extracted ﬁngerprint block and the corresponding database ﬁngerprint block of clip 2 yields a bit error rate below the threshold, we identify the video clip as originating from clip 2. If not, we repeat the same procedure for the remaining 28 sub-ﬁngerprints. We need to verify that our assumption that every ﬁngerprint block contains an error-free sub-ﬁngerprint is actually a reasonable assumption. Experiments indicate that this is actually the case for all reasonable types of processing. By the above method, we only compare the ﬁngerprint blocks to those blocks in the database which correspond exactly in at least one of their sub-ﬁngerprints. This makes the search much faster compared to exhaustive search or any pivotbased strategy [3], and this makes it possible to eﬃciently search in very large databases. This increased search speed comes at the cost of possibly not ﬁnding a match, even if there is a matching ﬁngerprint block in the database. More precisely, this is the case if all of the sub-ﬁngerprints have at least one erroneous bit, but at the same time the overall bit error rate is below the threshold. We can decrease the probability of missed identiﬁcations by using bit reliability information. The ﬁngerprint bits are computed by taking the sign of a real-valued number. The absolute value of this number can be taken as a reliability measure of the correctness of the bit: the sign of a value close to zero is assumed to be less robust than the sign of a very large number. In this way, we can declare q of the bits in the ﬁngerprint unreliable. To decrease the probability of a missed recognition, we toggle those q bits, thus creating 2q candidate sub-ﬁngerprints. We then do an eﬃcient matching, as described above, with all of these sub-ﬁngerprints. If one of these leads to a match, then the database ﬁngerprint block is compared with the originally extracted ﬁngerprint. If the resulting bit error rate of this ﬁnal comparison is again below the threshold then we have a successful identiﬁ-

Feature Extraction and a Database Strategy for Video Fingerprinting

127

cation. Note that in this way the reliability information is used to generate more candidates in the comparison procedure, but that it has no inﬂuence on the ﬁnal bit error rate. In [6] we have described a method for audio ﬁngerprinting. The database strategy described there is the same as the one in this paper, except for some of the parameter values (in case of audio, matching is done based on ﬁngerprint blocks which consist of 256 sub-ﬁngerprints, corresponding to 3 seconds of audio). With this audio database we have carried out extensive experiments, that show the technical and economical feasibility to scale this approach to very large databases, containing for instance a few million songs. An important ﬁgure of merit for a ﬁngerprinting method is the false positive probability: The probability that two randomly selected video clips are declared similar by the method. Under the assumption that the extracted ﬁngerprint bits are independent random variables, with equal probability of being 0 or 1, it is possible to compute a general formula for the false positive probability: Let a ﬁngerprint consist of R sub-ﬁngerprints and let each sub-ﬁngerprint consist of C bits. Then for two randomly selected ﬁngerprint blocks, the number of bits in which the two blocks correspond is binomially (n, p) distributed with parameters n = RC and p = 12 . As RC is large, we can approximate this distribution by a normal distribution with mean µ = np = RC/2 and variance σ 2 = np(1 − p) = RC/4. Given a ﬁngerprint block B1 the probability that less than a fraction α of the bits of a randomly selected second ﬁngerprint block B2 is diﬀerent from the corresponding bits of B1 equals ∞ 2 1 1 1 − 2α √ − x2 √ Pf (α) = √ e dx = erfc n . 2 2π (1−2α)√n 2 Based on this formula, we can set our threshold for detection. In our experiments we used n = 960. Setting the threshold α = 0.3 (i.e., declaring two clips similar if their ﬁngerprint blocks are diﬀerent in at most 30% of the bit positions), the false positive probability is computed to be in the order of 10−35 . In practice the actual false positive probability will be signiﬁcantly higher due to correlation between the bits in a ﬁngerprint block. Currently, we are in the process of studying experimentally the correlation structure, and adapting our theoretical false positive analysis accordingly.

4

Conclusions

In this paper we have presented ﬁngerprinting technology for video identiﬁcation. The methodology is based on the functional similarity between ﬁngerprints and cryptographic hashes. We have introduced a feature extraction algorithm, the design of which was driven by minimal extraction complexity. The resulting algorithm is referred to as diﬀerential block luminance. Secondly we have outlined a structure for very eﬃciently searching in a large ﬁngerprint database. The combination of these feature extraction and database algorithms results in a robust and very eﬃcient ﬁngerprinting system. Future research will

128

Job Oostveen, Ton Kalker, and Jaap Haitsma

be mainly focusing on extracting even more robust features, still under the constraint of limited complexity of the extractor and manageable ﬁngerprint database complexity.

References 1. M. Abdel-Mottaleb, G. Vaithilingam, and S. Krishnamachari. Signature-based image identiﬁcation. In SPIE conference on Multimedia Systems and Applications II, Boston, USA, 1999. 2. J. Bancroft. Fingerprinting: Monitoring the use of media assets, 2000. Omnibus Systems Limited, white paper. see http://www.advanced-broadcast.com/. 3. E. Chavez, J. Marroquin, and G. Navarro. Fixed queries array: A fast and economical data structure for proximity searching. Multimedia Tools and Applications, 14:113–135, 2001. 4. S.S. Cheung and A. Zakhor. Video similarity detection with video signature clustering. In Proc. 8th International Conference on Image Processing, volume 2, pages 649–652, Thessaloniki, Greece, 2001. 5. J. Fridrich. Robust bit extraction from images. In Proc. IEEE ICMCS’99, volume 2, pages 536–540, Florence, Italy, 1999. 6. J. Haitsma, T. Kalker, and J. Oostveen. Robust audio hashing for content identiﬁcation. In International Workshop on Content-Based Multimedia Indexing, Brescia, Italy, 2001. accepted. 7. A. Hampapur and R.M. Bolle. Feature based indexing for media tracking. In Proc. International Conference on Multimedia and Expo 2000 (ICME-2000), volume 3, pages 1709–1712, 2000. 8. A.J. Menezes, S.A. Vanstone, and P.C. van Oorschot. Handbook of Applied Cryptography. CRC Press, 1996. 9. RIAA-IFPI. Request for information on audio ﬁngerprinting technologies, 2001. http://www.ifpi.org/site-content/press/20010615.html, http://www.riaa.com/pdf/RIAA IFPI Fingerprinting RFI.pdf. 10. M. Schneider and S.F. Chang. A robust content based digital signature for image authentication. In Proceedings of the International Conference on Image Processing (ICIP) 1996, volume 3, pages 227–230, 1996. 11. R. Venkatesan and M.H. Jakubowski. Image hashing. In DIMACS conference on intellectual property protection, Piscataway, NJ, USA, 2000. 12. R. Venkatesan, S.M. Koon, M.H. Jakubowski, and P. Moulin. Robust image hashing. In Proceedings of the International Conference on Image Processing (ICIP), 2000.

ImageGrouper: Search, Annotate and Organize Images by Groups Munehiro Nakazato1, Lubomir Manola2,and Thomas S. Huang1 1 Beckman

Institute, University of Illinois at Urbana-Champaign, 405 N. Mathews Ave. Urbana, IL 61801, USA {nakazato,huang}@ifp.uiuc.edu 2 School of Electrical Engineering, University of Belgrade [email protected]

Abstract. In Content-based Image Retrieval (CBIR), trial-and-error query is essential for successful retrieval. Unfortunately, the traditional user interfaces are not suitable for trying different combinations of query examples. This is because first, these systems assume query examples are added incrementally. Second, the query specification and result display are done on the same workspace. Once the user removes an image from the query examples, the image may disappear from the user interface. In addition, it is difficult to combine the result of different queries. In this paper, we propose a new interface for Content-based image retrieval named ImageGrouper. In our system, the users can interactively compare different combinations of query examples by dragging and grouping images on the workspace (Query-by-Group.) Because the query results are displayed on another pane, the user can quickly review the results. Combining different queries is also easy. Furthermore, the concept of “image groups” is also applied to annotating and organizing a large number of images.

1

Introduction

Many researchers have proposed ways to find an image from large image databases. We can divide these approaches into two types of interactions: Browsing and Searching. In image browsing, the users look through the entire collections. In most systems, the images are clustered in hierarchical manner and the user can traverse the hierarchy by zooming and panning [3][4][10][16]. In [16], browsing and searching are integrated so that the user can switch back and forth between browsing and searching. Meanwhile, enormous amount of research have been done for Content-Based Image Retrieval (CBIR) [7][18][24]. In CBIR systems, the user searches image by visual similarity, i.e. low-level image features such as color [25], texture [23] and structure [27]. They are automatically extracted from images and indexed in the database. Then, the system computes the similarity between the images based on these features. The most popular method of CBIR interaction is Query-by-Examples. In this method, the users select example images (as positive or negative) and ask the system to retrieve visually similar images. In addition, in order to improve the retrieval further, CBIR systems often employ Relevance Feedback [18][19], in which the users S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 129–142, 2002. c Springer-Verlag Berlin Heidelberg 2002

130

Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang

can refine the search incrementally by giving feedback to the result of the previous query. In this paper, we propose a new user interface for digital image retrieval and organization, named ImageGrouper. In ImageGrouper, a new concept Query-by-Groups is introduced for Content-based Image Retrieval (CBIR.) The users construct queries by making groups of images. The groups are easily created by dragging images on the interface. Because the image groups can be easily reorganized, flexible retrieval is achieved. Moreover, with the similar operations, the user can effectively annotate and organize a large number of images. In the next section, we discuss how groups are used for image retrieval. Then, the following sections describe the use of image groups for image annotation and organization.

2

User Interface Support for Content-Based Image Retrieval

2.1 Current Approaches: Incremental Search Not many researches have been done regarding user interface support for Contentbased Image Retrieval (CBIR) systems [16][20]. Figure 1 shows a typical GUI for CBIR system that supports Query-by-Examples. Here, a number of images are aligned in grids. In the beginning, the system displays randomly selected images. The effective ways to align images are studied in [17]. In some cases, they are images found by browsing or keyword-based search. Under each image, a slide bar is attached so that the user can tell the system which images are relevant. If the user thinks an image is relevant, s/he moves the slider to the right. If s/he thinks an image is not relevant and should be avoided, s/he moves the slider to the left. The amount of slider movement represents the degree of relevance

Fig 1. Typical GUI for CBIR Systems

Query

Query

Result

Result

Fig 2. Example of “More is not necessarily better”. The left is the case of one example, the right is the case of two examples.

ImageGrouper: Search, Annotate and Organize Images by Groups

131

(or irrelevance.) In some systems, the user selects example images by clicking check boxes or by clicking on the images [6]. In these cases, the degrees are not specified. When the “Query” button is pressed, the system computes the similarity between selected images and the database images, then retrieves the N most similar images. The grid images are replaced with the retrieved images. These images are ordered based on the degree of similarity. If the user finds additional relevant images in the result set, s/he selects them as new query examples. If a highly irrelevant image appears in the result set, the user can select it as a negative example. Then, the user press “Query” again. The user can repeat this process until s/he is satisfied. This process is called relevance feedback [18][19]. Moreover, in many systems, the users are allowed to directly weight the importance of image features such as color and texture. In [22], Smeulders et al. classified Query by Image Example and Query by Group Example into two different categories. From user interface viewpoint, however, these two are very similar. The only difference is whether the user is allowed to select multiple images or not. In this paper, we classify both approaches as Query by Examples method. In stead, we use term “Query by Groups” to refer our new model of query specification method described later. Query-by-Example approach has several drawbacks. First of all, these systems assume that the more query examples are available, the better result we can get. Therefore, the users are supposed to search images incrementally by adding new example images from the result of the previous query. However, this assumption is not always true. Additional examples may contain undesired features and degenerate the retrieval performance. Figure 2 shows an example of situations when more query examples could lead to worse results. In this example, the user is trying to retrieve pictures of cars. The left column shows the query result when only one image of “car” is used as a query example. The right column shows the result of two query examples. The results are ordered based on the similarity ranks. In both cases, the same relevance feedback algorithm (Section 5.2 and [19]) was used and tested on Corel image set of 17,000 images. In this example, even if this additional example image looks visually good for human eyes, it introduces undesirable features into the query. Thus, no car image appears in the top 8 images. An image of car appears in the rank 13th for the first time. This example is not a special case. It happens often in image retrieval and confuses the users. This problem happens because of semantic gap [20][22] between the high-level concept in the user’s mind and the extracted features of images. Furthermore, finding good combinations of query examples is very difficult because image features are numerical values that are impossible to be estimated by human. Only way to find the right combination is trial and error. Otherwise, the user can be trapped in a small part of image database [16]. Unfortunately, the traditional user interfaces were designed for incremental search and are not suitable for trial and error query, if not impossible. This is because in these systems, query specification and result display must be done on the same workspace. Once the user removes an image from the query examples during relevance

132

Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang

feedback loops, the image may disappear from the user interface. Thus, it is awkward to bring it back later for another query. Second, the traditional interface does not allow the user to put aside the query results for later uses. This type of interaction is desired because the users are not necessarily looking for only one type of images. The users’ interest may change during retrieval. This behavior is known as berry picking [2] and has been observed for text documents retrieval by O’Day and Jeffries [15]. Moreover, because of semantic gap [20][22] mentioned above, the users often need to make more than one query to satisfy his/her need [2]. For instance, a user may be looking for images of “beautiful flowers.” The database may contain many different “flower” images. These images might be completely different in terms of lowlevel visual features. Thus, the user needs to retrieve “beautiful flowers” as a collection of different types of images. Finally, in some case, the user had better start from a general concept of objects and narrow down to specific ones. For example, suppose the user is looking for images of “red cars.” Because image retrieval systems use various image features [23][27] as well as colors [25], even cars with different colors may have many common features with “red cars.” In this case, it is better to start by collecting images of “cars of any color.” Once enough number of car images are collected, the user can specify “red cars” as positive examples, and other cars as negative examples. Current interfaces for CBIR systems, however, do not support these types of query behavior. Another interesting approach for Query by Examples was proposed by Santini et.al [20]. In their El Ninõ system, the user specifies a query by mutual distance between example images. The user drags images on the workspace so that the more similar images (in the user’s mind) are located closer to each other. The system then reorganizes the images’ locations reflecting the user’s intent. There are two drawbacks in El Ninõ system. First, it is unknown to the users how close similar images should be located and how far negative examples should be apart from good examples. It may take a while for the user to learn “the metric system” used in this interface. The second problem is that like traditional interfaces, query specification and result display are done on the same workspace. Thus, the user’s previous decision (in the form of the mutual distance between the images) is overridden by the system when it displays the results. This makes trial and error query difficult. Given the analogue nature of this interface, trial and error support might be essential. Even if the user gets an unsatisfactory result, there is no way to redo the query with a slightly different configuration. Any experimental result is not provided in the paper. 2.2 Query-by-Groups We are developing a new user interface for CBIR systems named ImageGrouper. In this system, a new concept Query-by-Groups was introduced. Query-by-Groups mode is an extension to Query-by-Example mode described above. The major difference is that while Query-by-Example handles the images individually, in Query-by-Group, a “group of images” is considered as the basic unit of the query. Figure 3 shows the display layout of ImageGrouper. The interface is divided into two panes. The left pane is the ResultView that displays the results of content-based

ImageGrouper: Search, Annotate and Organize Images by Groups

positive group

133

negative group

Popup Menu

Result View

GroupPalette

neutral group Fig 3. The ImageGrouper

retrieval, keyword-based retrieval, and random retrieval. This is similar to the traditional GUI except for there are no sliders or buttons under the images. The right pane is the GroupPalette, where the user manages each image and image groups. In order to create an image group, the user first drags one or more images from the ResultView into GroupPalette, then encloses the images by drawing a rectangle (box) as we draw a rectangle in drawing applications. All the images within the group box become the member of this group. Any number of groups can be created in the palette. The user can move images from one group to another at any moment. In addition, groups can be overlapped to each other, i.e. each image can belong to multiple groups. To remove an image from a group, the user simply drags it out of the box. When the right mouse button is pressed on a group box, a popup menu appears so that the user can give query properties (positive, negative, or neutral) to the group. The properties of groups can be changed at any moment. The colors of the corresponding boxes change accordingly. To retrieve images based on these groups, the user press the “Query” button placed at the top of the window (Figure 3.) Then, the system retrieves new images that are similar to images in positive groups while avoiding images similar to negative groups. The result images are displayed in the ResultView. When a group is specified as neutral (displayed as a white box), this group does not contribute to the search at the moment. This group can be turned to a positive or negative group later for another retrieval. If a group is positive (displayed as a blue box), the system uses common features among the images in the group. On the other hand, if a group is given negative (red box) property, the common features in the group are used as negative feedbacks. The user can specify multiple groups as positive or negative. In this case, these groups are merged into one group, i.e. the union of the groups are taken. The detail of the algorithm is described in Section 5.2.

134

Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang

In the example shown in Figure 3, the user is retrieving images of “flowers.” In the GroupPalette, three flower images are grouped as a positive group. On the right of this group, a red box is representing a negative group that consists of only one image. Below the “flowers” group, there is a neutral group (white box), which is not used for retrieval at this moment. Images can be moved outside of any groups in order to temporarily remove images from the groups. The gestural operations of ImageGrouper are similar to file operations of a Window-based OS. Furthermore, because the user’s mission is to collect images, the operation “Dragging Images into a Box” naturally matches the user’s cognitive state. 2.3 Flexible Image Retrieval The main advantage of Query-by-Groups is flexibility. Trial and Error Query by Mouse Dragging. In ImageGrouper, images can be easily moved between the groups by mouse drags. In addition, the neutral groups and space outside of any groups in the palette can be used as storage area [8] for images that are not used at the moment. They can be reused later for another query. It makes trial and error of relevance feedbacks easier. The user can quickly explore different combinations of query examples by dragging images into or out of the box. Moreover, the query specification that the user made is preserved and visible in the palette. Thus, it is easy to modify the previous decision when the query result is not satisfactory. Groups in a Group. ImageGrouper allows the users to create a new group within a group (Groups in a Group.) With this method, the user begins with collecting relatively generic images first, then narrows down to more specific images. Figure 4 shows an example of Groups in a Group. Here, the user is looking for “Red cars.” When s/he does not have enough number of examples, however, the best way to start is to retrieve images of “cars with any color.” This is because these images may have many common features with red car images, though their colors features are different. The large white box is a group for “Cars with any colors.” Once the user found enough number of car images, s/he can narrow down the search only for red cars. In order to narrow down the search, the user divide the collected images into two sub-groups by creating two new boxes for red cars and other cars. Then the user specifies the red car group as positive and the other cars group as negative, respectively. In Figure 4, the left smaller (blue, i.e. positive) box is the group of red cars and the right box (red, i.e. negative) is the group of non-red cars. This narrow down search was not possible on the conventional CBIR systems. 2.4 Experiment on Trial and Error Query In order to examine the effect of ImageGrouper’s trial-and-error query, we compared the query performance of our system with that of a traditional incremental approach (Figure 1). In this experiments, we used Corel photo stock that contains 17000 images as the data set. For both interfaces, the same image features and relevance feedback algorithms (described in Section 5.2) are used. For the traditional interface, the top 30 images are displayed and examined by the user in each relevance feedback. For ImageGrouper, the top 20 images are displayed in the ResultView. Only one positive group and one neutral group are created for this

ImageGrouper: Search, Annotate and Organize Images by Groups

135

136

Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang

Cloud Cars of any color

Cloud and Mountain

Red Cars Non-Red Cars

Mountain

Fig 4. Groups in a group.

Fig 5. Overlap between groups. Two images in the overlapped region contain both mountain and cloud.

When keyword search is integrated with CBIR like our system and [16], keywordbased search can be used to find the initial query examples for content-based search. For this scheme, the user does not have to annotate all images. In any cases, it is very important to provide easy and quick ways to annotate text on a large number of images. 3.1 Current Approaches for Text Annotation The most primitive way for annotation is to select an image, then type in keywords. Because this interaction requires the user to use mouse and keyboard repeatedly in turn, it is too frustrating for a large image database. Several researchers have proposed smarter user interfaces for keyword annotation on images. In bulk annotation method of FotoFile [9], the user selects multiple images on the display, selects several attribute/value pairs from a menu, and then presses the “Annotate” button. Therefore, the user can add the same set of keywords on many images at the same time. To retrieve images, the user selects entries from the menu, and then presses the “Search” button. Because of visual and gestural symmetry [9], the user needs to learn only one tool for both annotation and retrieval. PhotoFinder [21] introduced drag-and-drop method, where the user selects a label from a scrolling list, then drags it directly onto an image. Because the labels remain visible at the designated location on the images and these locations are stored in the database, these labels can be used as “captions” as well as for keyword-based search. For example, the user can annotate the name of a person directly on his/her portrait in the image, so that other users can associate the person with his/her name. When the user needs new words to annotate, s/he adds them to the scrolling list. Because the user drags keywords into individual images, bulk annotation is not supported in this system.

ImageGrouper: Search, Annotate and Organize Images by Groups

137

3.2 Annotation by Groups Most home users do not want to annotate images one by one, especially when the number of images is large. In many cases, the same set of keywords is enough for several images. For example, a user may just want to annotate “My Roman Holiday, 1997” on all images taken in Rome. Annotating the same keywords repeatedly is painful enough to discourage him/her from using the system. ImageGrouper introduces Annotation-by-Groups method where keywords are annotated not on individual images, but on groups. As in Query-by-Groups, the user first creates a group of images by dragging images from ResultView into GroupPalette and drawing a rectangle around them. In order to give keywords to the group, the user opens Group Information Window by selecting “About This Group” from the pop-up menu (Figure 3). In this window, arbitrary number of words can be added. Because the users can simultaneously annotate the same keywords on a number of images, annotation becomes much faster and less error prone. Although Annotationby-Groups is similar to bulk annotation of FotoFile [9], there are several advantages described below. Annotating New Images with the Same Keywords. In bulk annotation [9], once the user finished annotating keywords to some images, there is no fast way to give the same annotation to another image later. The user has to repeat the same steps (i.e. select images, select keywords from the list, then press “Annotate”.) This is awkward when the user has to add a large number of keywords. Meanwhile, in Annotation-byGroup, the system attaches annotations not on each images, but on groups. Therefore, by dragging new images into an existing group, the same keywords are automatically given to it. The user does not have to type the same words again. Hierarchical Annotation with Groups in a Group. In ImageGrouper, the user can annotate images hierarchically using Groups in a Group method described above (Figure 4). For example, the user may want to add new keyword “Trevi Fountain” to only a part of the image group that has been labeled “My Roman Holiday, 97.” This is easily done by creating a new sub-group within the group and annotating only on the sub-group. In order to annotate hierarchically on FotoFile [9] with bulk annotation, the user has to select some of images that are already annotated, and then annotate them again with more keywords. On the other hand, ImageGrouper allows the user to visually construct a hierarchy on the GroupPalette first, then edit keywords on the Group Information Window. This method is more intuitive and less error prone. Overlap between Images. An image often contains multiple objects or people. In such cases, the image can be referred in more than one context. ImageGrouper support this multiple references by allowing overlaps between image groups, i.e. an image can belong to multiple groups at the same time. For example, in Figure 5, there are two image groups: “Cloud” and “Mountains.” Because some images contain both cloud and mountain, these images belong to both groups. They are automatically referred as “Cloud and Mountain.” This concept is not supported in other systems.

138

4

Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang

Organizing Images by Groups

In the previous two sections, we described how ImageGrouper supports content-based query as well as keyword annotation. These features are closely related and complementary to each other. In order to annotate images, the user can collect visually similar images first, using content-based retrieval with Query-by-Groups. Then s/he can annotate textual information to the group of collected images. After this point, the user can quickly retrieve the same images using keyword-based search. Conversely, the results of keyword-based search can be used as a starting point for content-based search. This method is useful especially when the image database is only partially annotated or when the user is searching images based on visual appearance only. 4.1 Photo Albums and Group Icons As described above, ImageGrouper allows groups to be overlapped. In addition, the user can attach textual information on these groups. Therefore, groups in ImageGrouper can be used to organize pictures as “photo albums [9]” Similar concepts are proposed in FotoFile [9] and Ricoh’s Storytelling system [1]. In both systems, albums are used for “slide shows” to tell stories to the other users. In ImageGrouper, the user can convert a group into a group icon. When the user selects “Iconify” from the popup menu (Figure 3,) images in the group disappear and a new icon for the group appears in GroupPalette. When the group has an overlap with another group, images in the overlapped region remain in the display. Furthermore, the users can manipulate those group icons as they handle individual images. They can drag the group icons anywhere in the palette. The icons can be even moved into another group box realizing groups in a group. Finally, group icons themselves can be used as examples for content-based query. A group icon can be used as an independent query example or combined with other images and groups. In order to use a group icon as a normal query group, the user right clicks the icon and opens a popup menu. Then, s/he can select “relevant”, “irrelevant” or “neutral.” On the other hand, in order to combine a group icon with other example images, the user simply draws a new rectangle and drags them into it. Organize-by-Groups method described here is partially inspired by the Digital Library Integrated Task Environment (DLITE) [5]. In DLITE, each text documents as well as the search results are visually represented by icons. The user can directly manipulate those documents in a workcenter (direct-manipulation.) In [8], Jones proposed another graphical tool for query specification, named VQuery. In VQuery, the user specifies the query by creating Venn diagrams. The number of matched documents is displayed in the center of each circle. While DLITE and VQuery were systems for text documents, the idea of directmanipulation [5] is applicable more naturally to image databases. In text document database, it is difficult to determine the contents of text documents from the icons. Therefore, the user has to open another window to investigate the detail [5] (in case of DLITE, a web browser is opened.) On the other hand, in image databases, images themselves (or their thumbnails) can be used for direct-manipulations. Therefore, instant judgment by the user is possible [16][22].

ImageGrouper: Search, Annotate and Organize Images by Groups

5

139

Implementation

A prototype of ImageGrouper is implemented as a client-server system, which consists of User Interface Clients and Query Server. They are communicating via HyperText Transfer Protocol (HTTP). 5.1 The User Interface Client The user interface client of ImageGrouper is implemented as a Java2 Applet with Swing API (Figure 3). Thus, the users can use the system through Web browsers on various platforms such as Windows, Linux, Unix and Mac OS X. The client interacts with the user and determines his/her interests from the group information or keywords input. When “Query” button is pressed, it sends the information to the server. Then, it receives the result from the server and displays it on the ResultView. Because the client is implemented in multi-thread manner, it remains reactive while it is downloading images. Thus, the user can drag a new image into the palette as soon as it appears in the ResultView. Note that the user interface of ImageGrouper is independent of relevance feedback algorithms [18][19] and the extracted image features (described below.) Thus, as long as the communication protocols are compatible, the user interface clients can access to any image database servers with various algorithms and image features. Although the retrieval performance depends on the underlying algorithms and image features used, the usability of ImageGrouper is not affected by those factors. 5.2 The Query Server The Query Server stores all the image files and their low-level visual features. These visual features are extracted and indexed in advance. When the server receives a request from a client, it computes the weights of features and compares user-selected images with images in the database. Then, the server sends back IDs of the k most similar images. The server is implemented as a Java Servlet that runs on the Apache Web Server and Jakarta Tomcat Servlet container. It is written in Java and C++. In addition, the server is implemented as a stateless server, i.e. the server does not hold any information about the clients. This design allows different types of clients such as the traditional user interface [13] (Figure 1) and 3D Virtual Reality interface [14] can access to the same server simultaneously. For home users who wish to organize and retrieve images locally on their PCs’ hard disks, ImageGrouper can be configured as a standalone application, in which the user interface and the query server are resident on the same machine and communicate directly without a Web server. Image Features. As the visual features for content-base image retrieval, we use three types of features: Color, Texture, and Edge Structure. For color features, HSV color space is used. We extract the first two moments (mean, and standard deviation) from each of HSV channels [25]. Therefore, the total number of color features is six. For texture, each image is applied into wavelet filter bank [23] where the images are decomposed into 10 de-correlated sub-bands. For each sub-band, the standard deviation of the wavelet coefficients is extracted. There-

140

Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang

fore, the total number of this feature is 10. For edge structures, we used Water-Fill edge detector [27] to extract image structures. We first pass the original images through the edge detector to generate their corresponding edge maps. From the edge map, eighteen (18) elements are extracted from the edge maps. Relevance Feedback Algorithm. The similarity ranking is computed as follows. First, the system computes the similarity of each image with respect to only one of the features. For each feature i ( i = { color, texture, structure } ), the system computes a query vector q i based on the positive and negative examples specified by the user. Then, it calculates the feature distance g ni between each image n and the query vector, T

g ni = W i ( p ni – q i )W i

(1)

where p ni is the feature vector of image n regarding the feature i. For the computation of the distance matrix Wi , we used Biased Discriminant Analysis (BDA.) The detail of BDA is described in [26]. After the feature distances are computed, the system combines each feature distance g ni into the total distance d n . The total distance of image n is a weighted sum of each g ni , d n = uT gn

(2)

where g n = [ g n1, …, g nI ] . I is the total number of features. In our case, I is 3. The optimal solution of the feature weighting vector u = [ u 1, …u I ] is solved by Rui et al. [19] as follows, ui =

∑

I j=1

fj⁄ fj

(3)

where f i = ∑nN = 1 g ni , and N is the number of positive examples. This gives higher weight to that feature whose total distance is small. This means that if the positive examples are similar with respect to a certain feature, this feature gets higher weight. Finally, the images in the database are ranked by the total distance. The system returns the k most similar images.

6

Future Work

We plan to evaluate our system further with respect to both usability and query performance. Especially, we will investigate the effect of Groups in a group query described in Section 2.3. As mentioned in [11], traditional precision/recall measure is not very suitable for evaluation for interactive retrieval systems. Therefore, we may need to consider appropriate evaluation methods for the system [12][22]. Next, in the current system, when more than one group is selected as positive, they are merged into one group, i.e. all images in those groups are considered as positive examples. We are investigating a scheme where different positive groups are considered as different classes of examples [28]. In addition, for the advanced users, we are going to add support for group-wise feature selection. Although our system automatically determines the feature weights, the advance users might know which features are important for their query. Thus, we will allow the users to specify which features are supposed to be considered for each group. Some groups might be important in terms of color features only, while others might be important in terms of structures. Finally, because the implementation of

ImageGrouper: Search, Annotate and Organize Images by Groups

141

ImageGrouper does not depend on underlying retrieval technologies, it can be used as a benchmarking tool [12] for various image retrieval systems.

7

Conclusion

In this paper, we presented ImageGrouper, a new user interface for digital image retrieval and organization. In this system, the users search, annotate, and organize digital images by groups. ImageGrouper has several advantages regarding image retrieval, text annotation, and image organization. First, in content-based image retrieval (CBIR), predicting a good combination of query examples is very difficult. Thus, trial-and-error is essential for successful retrieval. However, the previous systems are assuming incremental search and do not support trial-and-error search. On the other hand, Query-by-Groups concept in ImageGrouper allows the user to try different combinations of query examples quickly and easily. We showed this lightweight operation helps the users to achieve higher recall rate. Second, with Groups in a Group configuration, narrowing down search was made possible. This method helps the user find both positive and negative examples, and provides him/her with more choices. Next, typing text information to a large number of images is very tedious and time consuming. Annotate-by-Groups method eases the users of this task by allowing them to annotate multiple images at the same time. Groups in a group method realizes hierarchal annotation, which was difficult in the previous systems. Moreover, by allowing groups to overlap to each other, ImageGrouper further reduces typing. In addition, our concept of image groups is also applied for organizing image collections. A group in GroupPalette can be shrunk into a small icon. These group icons can be used as “photo albums” which can be directly manipulated and organized by the users. Finally, these three concepts: Query-by-Groups, Annotation-by-Groups and Organize-by-Groups share the similar gestural operations, i.e. dragging images and drawing a rectangle around them. Thus, once the user learned one task, s/he can easily adapt herself/himself to the other tasks. Operations in ImageGrouper are also similar to file operations used in Windows and Macintosh computers as well as most drawing programs. Therefore, the user can easily learn to use our system.

Acknowledgement This work was supported in part by National Science Foundation Grant CDA 9624396.

References 1. Balabanovic, M., Chu, L.L. and Wolff, G.J. Storytelling with Digital Photographs. In CHI’00, 2000. 2. Bates, M.J. The design of browsing and berrypicking techniques for the on-line search interface. Online Review, 13(5), pp. 407-431, 1989. 3. Bederson, B.B. Quantum Treemaps and Bubblemaps for a Zoomable Image Browser. HCIL Tech Report #2001-10, University of Maryland, College Park, MD 20742. 4. Chen, J-Y., Bouman, C.A., and Dalton, J.C. Heretical Browsing and Search of Large Image Database. In IEEE Trans. on Image Processing, Vol. 9, No. 3, pp. 442-455, March 2000.

142

Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang

5. Cousins, S.B., et al. The Digital Library Integrated Task Environment (DLITE). In 2nd ACM International Conference on Digital Libraries, 1997. 6. Cox, I.J., Miller, M.L., Minka, T.P., Papathomas, T.V. and Yianilos, P.N. The Bayesian Image Retrieval System, PicHunter: Theory, Implementation, and Psychophysical Experiments. In IEEE Transactions on Image Processing, Vol. 9, No. 1, January 2000. 7. Flickner, M., Sawhney, H. and et al. Query by Image and Video content: The QBIC system. In IEEE Computer, Vol. 28, No.9, pp. 23-32, September 1995. 8. Jones, S. Graphical Query Specification and Dynamic Result Previews for a Digital Library. In UIST’98, 1998. 9. Kuchinsky, A., Pering, C., Creech, M.L., Freeze, D., Serra, B. and Gwizdka, J. FotoFile: A Consumer Multimedia Organization and Retrieval System. In CHI’99, 1999. 10. Laaksonen, J., Koskela, M. and Oja, E. Content-based image retrieval using self-organization maps. In Proc. of 3rd Intl. Conf. in Visual Information and Information Systems, 1999. 11. Lagergren, E. and Over, P. Comparing interactive information retrieval systems across sites: The TREC-6 interactive track matrix experiment. In ACM SIGIR’98, 1998. 12. Müller, H et al. Automated Benchmarking in Content-based Image Retrieval. In Proc. of IEEE International Conference on Multimedia and Expo 2001, August, 2001. 13. Nakazato, M. et al., UIUC Image Retrieval System for JAVA, available at http:// chopin.ifp.uiuc.edu:8080. 14. Nakazato, M. and Huang, T.S. 3D MARS: Immersive Virtual Reality for Content-based Image Retrieval. In Proc. of IEEE International Conference on Multimedia and Expo 2001. 15. O’Day V. L. and Jeffries, R. Orienteering in an information landscape: how informationseekers get from here to there. In INTERCHI ‘93, 1993. 16. Pecenovic, Z., Do, M-N., Vetterli, M. and Pu, P. Integrated Browsing and Searching of Large Image Collections. In Proc. of Fourth Intl Conf on Visual Information Systems, Nov, 2000. 17. Rodden, K., Basalaj, W., Sinclair, D. and Wood, K. Does Organization by Similarity Assist Image Browsing? In CHI’01. 2001. 18. Rui, Y., Huang, T. S., Ortega, M. and Mehrotra, M. Relevance Feedback: A Power Tool for Interactive Content-Based Image Retrieval. In IEEE Transaction on Circuits and Video Technology, Vol. 8, No. 5, Sept. 1998. 19. Rui, Y. and Huang, T. S., Optimizing Learning in Image Retrieval. In IEEE CVPR ‘00, 2000. 20. Santini, S. and Jain, R. Integrated Browsing and Querying for Image Database. IEEE Multimedia, Vol. 7, No. 3, 2000, pp. 26-39. 21. Shneiderman, B. and Kang, H. Direct Annotation: A Drag-and-Drop Strategy for Labeling Photos. In Proc. of the IEEE Intl Conf on Information Visualization (IV’00), 2000. 22. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A. and Jain, R. Content-based Image Retrieval at the End of the Early Years. In IEEE PAMI Vol. 22, No. 12, December, 2000. 23. Smith, J.R. and Chang S-F. Transform features for texture classification and discrimination in large image databases. In Proc. of IEEE Intl. Conf. on Image Processing, 1994. 24. Smith, J.R. and Chang S-F. VisualSEEk: a fully automated content-based image query system. In ACM Multimedia’96, 1996. 25. Sticker, M. and Orengo, M., Similarity of Color Images. In Proc. of SPIE, Vol. 2420 (Storage and Retrieval of Image and Video Databases III), SPIE Press, Feb. 1995. 26. Zhou, X. and Huang, T. S. A Generalized Relevance Feedback Scheme for Image Retrieval. In Proc. of SPIE Vol. 4210: Internet Multimedia Management Systems, 6-7 November 2000. 27. Zhou, X. S. and Huang, T. S. Edge-based structural feature for content-base image retrieval. Pattern Recognition Letters, Special issue on Image and Video Indexing, 2000. 28. Zhou, X. S., Petrovic, N. and Huang, T. S. Comparing Discriminating Transformations and SVM for Learning during Multimedia Retrieval. In ACM Multimedia ‘01, 2001.

Toward a Personalized CBIR System* Chih-Yi Chiu1, Hsin-Chih Lin2,**, and Shi-Nine Yang1 1

Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 300 _G]GLMYWR]ERKa$GWRXLYIHYX[ 2 Department of Information Management, Chang Jung Christian University, Tainan, Taiwan 711 LGPMR$QEMPGNYIHYX[

Abstract. A personalized CBIR system based on a unified framework of fuzzy logic is proposed in this study. The user preference in image retrieval can be captured and stored in a personal profile. Thus, images that appeal to the user can be effectively retrieved. Our system provides users with textual descriptions, visual examples, and relevance feedbacks in a query. The query can be expressed as a query description language, which is characterized by the proposed syntactic rules and semantic rules. In our system, the semantic gap problem can be eliminated by the use of linguistic terms, which are represented as fuzzy membership functions. The syntactic rules refer to the way that linguistic terms are generated, whereas the semantic rules refer to the way that the membership function of each linguistic term is generated. The problem of human perception subjectivity can be eliminated by the proposed profile updating and feature re-weighting methods. Experimental results have proven the effectiveness of our system.

1

Introduction

Content-based image retrieval (CBIR) receives much research interest recently [1-4]. However, there exist several problems that prevent CBIR systems from being popular. Two examples of the problems are [3-4]: (1) the semantic gap between image features and human perceptions in characterizing an image, and (2) the human perception subjectivity in finding target images. Most CBIR systems provide users with query-by-anexample and/or query-by-a-sketch schemes. Since the features extracted from the query are low-level, it is not easy for users to supply a suitable example/sketch in the query. If a query fails to reflect the user preference, the retrieval results may be unsatisfactory. To capture the user preference in image retrieval, the relevance feedback provides a useful scheme [5-6]. However, since the features extracted from feedback examples are also low-level, the user may take many feedback iterations to find a target image [7]. *

This study was supported partially by the National Science Council, R.O.C. under Grant NSC90-2213-E-309-004 and Ministry of Education, R.O.C. under Grant 89-E-FA04-1-4. ** Corresponding author. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 143–151, 2002. © Springer-Verlag Berlin Heidelberg 2002

144

Chih-Yi Chiu, Hsin-Chih Lin, and Shi-Nine Yang

To overcome the above-mentioned problems, a personalized CBIR system based on a unified framework of fuzzy logic is proposed in this study. Our system consists of two major phases: (1) database creation and (2) query comparison, as shown in Fig. 1. The database creation phase deals with the methods for feature extraction and linguistic term generation. In this study, Tamura features [8] are used as our texture representation. To eliminate the semantic gap problem in image retrieval, we propose an unsupervised fuzzy clustering algorithm to generate linguistic terms and their membership functions. The linguistic terms provide textual descriptions that abstract human perceptions for images, whereas the membership functions measure the similarity between a query and each database image. The query comparison phase deals with the methods for query parsing, profile updating, feature re-weighting, similarity function inference, and similarity computation. To eliminate the problem of human perception subjectivity in image retrieval, we propose profile updating and feature re-weighting methods to capture the user preference at each (relevance) feedback. The user preference is stored in a personal profile. Images that appeal to the user can be effectively retrieved. Query Descritption Language

Texture Image

Feature Extraction

Query Parsing

Profile Updating

Feature Reweighting

Personal Profile

Feedback History

Tamura Features

Similarity Function

Visual Examples

Relevance Feedbacks

Similarity Function Inference

Linguistic Term Generation

Textual Descriptions

User Interface

Similarity Computation

Image Browsing

Texture Database Image data Texture Representation Personal Profile

(a)

(b)

Fig. 1. The system overview: (a) database creation; (b) query comparison.

2

Database Creation

2.1

Feature Extraction

Our texture features should have the following characteristics. (1) The features characterize low-level texture properties. (2) These properties are perceptually meaningful; humans can easily interpret these properties by textual descriptions. In this study, six Tamura features [8], including coarseness, contrast, directionality, line-likeness, regularity, and roughness, are used to test the system performance.

Toward a Personalized CBIR System

2.2

145

Linguistic Term Generation

In this study, degrees of appearance on each feature are interpreted as five linguistic terms, as summarized in Table 1. The linguistic term is represented as a membership function and can be further defined by the proposed syntactic rules (Table 2) and semantic rules (Table 3). The syntactic rules refer to the way that linguistic terms are generated, whereas the semantic rules refer to the way that the membership function of each linguistic term is generated. In this study, the sigmoidal function is used to formulate the membership function. The membership functions of the linguistic terms on each feature are generated as follows. Table 1. Linguistic terms for the six features. Features Coarseness

very fine

Contrast

very low very nondirectional very bob-like very irregular very smooth

Directionality Line-likeness Regularity Roughness

Linguistic Terms medium fine coarse low medium contrast medium non-directional directional medium lineblob-like like irregular medium regular smooth medium rough

coarse

very coarse

high

very high

directional

very directional

line-like regular rough

very like-like very regular very rough

Table 2. Syntactic rules. QueryDescriptionLanguage ::= {QueryExpression ⊕ Connective} QueryExpression ::= <empty> | TextualDescription | VisualExample TextualDescription ::= Negation ⊕ Hedge ⊕ LinguisticTerm VisualExample ::= Negation ⊕ Hedge ⊕ RelevanceAdjective ⊕ TamuraFeature ⊕ #ExampleID Negation ::= <empty> | ‘not’ Hedge ::= <empty> | ‘more or less’ | ‘quite’ | ‘extremely’ LinguisticTerm ::= ‘very fine’ | ‘fine’ | ‘medium coarse’ | ‘coarse’ | ‘very coarse’ | … | ‘very smooth’ | ‘smooth’| ‘medium rough’ | ‘rough’ | ‘very rough’ TamuraFeature ::= ‘coarseness’ | ‘contrast’ | ‘directionality’ | ‘line-likeness’ | ‘regularity’ | ‘roughness’ RelevanceAdjective ::= ‘relevant’ | ‘irrelevant’ Connective ::= <empty> | ‘and’ | ‘or’

Algorithm 1. Unsupervised Fuzzy Clustering. Input: Data sequence ( f1 , f 2 ,..., f n ) , where f i denotes the value of a feature in the ith database image, and n is the number of database images. Output: Five membership functions P1 , P2 ,..., P5 on the feature. Step 1. Set c0 = 0 , c6 = 1 , and c j = j / 6 , j = 1, 2, …, 5, where c0 and c6 are the two bounds of the universe, c1 , c2 ,..., c5 denote centers of the five linguistic terms.

146

Chih-Yi Chiu, Hsin-Chih Lin, and Shi-Nine Yang Table 3. Semantic rules.

Semantic rules for the membership function µQ , where Q is a query expression on a feature: • LinguisticTerm ⇒ µQ (v) = Pj (v) , where v is the feature value of the image example, Pj (v) is defined in Eq. 1 (Q is a textual description.) • #ExampleID ⇒ µQ (v) = K (v) =

1 1+ e

− a (v − b )

⋅

1 1+ e

−c (v − d )

, where a, b, c, d are the

parameters of the membership function K (Q is a set of image examples.) • Hedge ⇒ µ Q h (v) = [ µQ (v)]h • ‘not’ ⇒ µ ¬Q (v) = 1 − µ Q (v) • ‘and’ ⇒ µ Q1 ∧ Q2 (v) = min[µ Q1 (v), µ Q2 (v )] • ‘or’ ⇒ µ Q1 ∨ Q2 (v) = max[µ Q1 (v), µ Q2 (v)]

Step 2. Set membership matrix U = 0. For each datum f i , update each element ui , j using one of the following rules: Rule 1. If f i ≤ c1 , set ui ,1 = 1 and ui , j ≠1 = 0 . Rule 2. If c j < f i ≤ c j +1 , set ui , j =

c j +1 − f i c j +1 − c j

, ui , j +1 = 1− ui , j , and ui , k ≠ j , j +1 = 0 .

Rule 3. If fi > c5 , set ui , j ≠ 5 = 0 and ui ,5 = 1 .

∑i =1ui, j fi n ∑i =1ui, j n

Step 3. Compute c1 , c2 ,..., c5 using c j =

. If the change of any c j exceeds

a given threshold, go to Step 2. Step 4. The membership function Pj (v) of the j-th linguistic term is defined as Pj ( v ) =

1 1 , ⋅ 1 + e −a ( v −b) 1 + e −c ( v −d )

(1)

where v is the feature value, a = k / cj - cj-1, b = (cj + cj-1) / 2, c = -k / (cj+1 - cj), d = (cj + cj+1) / 2, and k > 0. The parameters a, b, c, d are stored in the personal profile.

3

Query Comparison

3.1

Query Parsing

In this study, a query is defined as a logic combination of query expressions on all features. The query can be parsed by a query description language, which is characterized by the proposed syntactic rules (Table 2) and semantic rules (Table 3).

Toward a Personalized CBIR System

3.2

147

Profile Updating

Suppose a user has posed a query. If the retrieval results are unsatisfactory, the user may pose feedback examples for the next retrieval. At each feedback, the personal profile, i.e., the parameters of membership functions, can be updated as follows. For relevant examples, the weighted average center x of these examples is computed, and the previous membership function is pulled toward to the center. We define an error function E = [1 − µ ’( x)]2 , where µ ’ is the previous membership function on the feature. For irrelevant examples, the previous membership function is pushed away by these examples individually. We define an error function E = ∑ j [0 − µ ’( f j )]2 , where f j is the feature value (on the feature) of the j-th irrelevant example. To minimize E, the gradient descent method is used as follows: ∆ϕ = −η[∂E / ∂ϕ ] ,

where ϕ is a parameter in µ ’, η is the learning rate, and ϕ + ∆ϕ is the updated parameter in the personal profile. Fig. 2 illustrates the underlying idea.

MF center

relevant examples

weighted average center

irrelevant examples

multi-dimensional membership function Fig. 2. Updating the membership function through relevance feedbacks.

3.3

Feature Re-weighting

Suppose a user has posed a query. After several feedbacks, the user’s emphasis on each feature can be evaluated from the feedback history. We propose a feature reweighting algorithm as follows to fine-tune the weight of each feature in image retrieval.

148

Chih-Yi Chiu, Hsin-Chih Lin, and Shi-Nine Yang

Algorithm 2. Feature Re-weighting. Input: A series of previous k weights, denoted as W (k ) , the query expression Q on a feature. Output: A series of previous k + 1 weights, i.e, W ( k +1) , and the similarity between Q and v on the feature, denoted as sQ(v). Step 1. If there is no relevant example in Q, set parameter κ = 1 . Otherwise let κ = cos(σ × π / 2) , where σ is the standard deviation of the relevant examples. (k) (k+1) Step 2. Update W to W as follows: Wk(+k1+1) = ακ + ∑i =1 β i( k ) × Wi ( k ) , k

where β is a series of decreasing coefficients, each of which denotes the (k) corresponding importance in W , and α + ∑ β i( k ) = 1 . (k)

Step 3. In the parse tree of the query, two query expressions are combined by a connective c. Let v denotes the feature value of a database image. The weighted similarity between Q and v is computed as follows: sQ ( v ) = 1 − Wk(+k1+1) × [1 − µQ ( v )]

if c = ‘and’

sQ ( v ) = Wk(+k1+1) × µQ ( v )

if c = ‘or’

(2)

where µQ (v ) is the membership value of Q for v. Computations of the membership value will be discussed in Sections 3.4 and 3.5. 3.4

Similarity Function Inference

After the personal profile is updated or the features are re-weighted, new similarity functions must be inferred to reflect the user preference. The inference method is presented as follows: Type 1. If Q = <empty>, set µQ(v) = 0. Type 2. If Q is a textual description, set µ Q (v) = (−1) N +1[ N − Pjh (v)] , where Pj is defined in Eq. 1, h is a hedge. N = 1 if Q is a negative expression; else N = 0. Type 3. Q is a set of n visual examples. If there is no relevant example in Q, set µQ(v) = 0. Otherwise, compute the weighted average center x and the standard deviation σ on the feature and define the membership function as follows:

µQ (v ) = ( −1) N +1[ N − K h ( v )] , where K is defined in Table 3 and set a = k /(σ + δ ), b = x − (σ + δ ), c = −a, d = x + (σ + δ ), δ > 0, and k > 0. Note that the parameters of µ Q are stored in the personal profile. Each feature has its membership functions and equal feature weight at a new search. The weighted similarity between a query and each database image on the feature is computed using

Toward a Personalized CBIR System

149

Eq. 2. Finally, the total similarity function for the query can be inferred through minmax compositions of all weighted similarity functions on each feature. If the previous query on a feature is textual descriptions or visual examples, the current query expression on the feature will be treated as a relevance feedback. We use the gradient descent method to modify membership functions on each feature from the feedback history. Again, the total similarity function is inferred through min-max compositions of all weighted similarity functions. 3.5

Similarity Computation

Let D be a collection of database images and V be a set of feature values for an arbitrary database image. The similarity between the query and each database image is denoted as a fuzzy set A in D: A = {(V , S (V )) | V ∈ D} = ∑V ∈D S (V ) / V ,

where S is the total similarity function inferred from the query, and S(V) is the similarity between the query and the database image V. Our system computes the fuzzy set A and outputs the ranked images according to the similarity in descending order. The user can browse the results and feed relevant/irrelevant examples in the next retrieval if necessary.

4

Experimental Results

Our database contains 1444 texture images collected from Corel Gallery Collection. Fig. 3a shows the results for the query “very fine ∧ very directional ∧ very regular.” The retrieved images are displayed in descending similarity order from left to right and top to bottom. Fig. 3b shows the results if we select the second, fifth, and eighth images (in Fig. 3a) as relevant examples. To measure the system performance, we use 450 texture images as testing data. The original 50 512×512 texture images are obtained from MIT VisTex. Each image is partitioned into nine 170×170 non-overlap sub-images, named as relevant images. Fig. 4a shows the PR graph for a conjunction of all queries with feature re-weighting. The precision and recall increase in the first feedback is the largest. This fast convergence is a desirable situation. Fig. 4b shows the PR graph for the same queries in Fig. 4a, but without feature re-weighting. Obviously, the performance with feature re-weighting outperforms the one without feature re-weighting.

5

Conclusions and Future Work

A personalized CBIR system is proposed in this study. The methods for generating linguistic terms, updating the personal profile, re-weighting features, inferring similarity functions, and computing the similarity are all based on a unified framework of fuzzy logic. According to the experimental results, the semantic gap problem can be

150

Chih-Yi Chiu, Hsin-Chih Lin, and Shi-Nine Yang

bridged through the use of linguistic terms. The problem of human perception subjectivity can be solved through our profile updating and query re-weighting algorithms. Besides remedying these problems, our personalized CBIR system can achieve higher accuracy for image retrieval. The PR graphs have strongly supported the abovementioned claims.

(a)

(b) Fig. 3. (a) Retrieval results for the query “very fine ∧ very directional ∧ very regular;” (b) retrieval results for the three relevant examples from Fig. 3a.

1

1 0 rf 1 rf 2 rf 3 rf

0.9 0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2 0.1

0.2

0.3

0.4

0.5

(a)

0.6

0.7

0 rf 1 rf 2 rf 3 rf

0.9

0.8

0.2 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(b)

Fig. 4. (a) PR graph with feature re-weighting; (b) PR graph without feature re-weighting.

For future work, we will explore efficient multidimensional indexing techniques to make our system scalable for large image collections. Another important aspect is putting our system into practice. For example, textile pattern retrieval may be a promising application in the future.

Toward a Personalized CBIR System

151

References 1. Aigrain, P., Zhang, H. J., Petkovic, D.: Content-Based Representation and Retrieval of Visual Media: A State-of-The-Art Review. Multimedia Tools and Applications 3 (1996) 179-202 2. Idris, F., Panchanathan, S.: Review of Image and Video Indexing Techniques. Journal of Visual Communication and Image Representation 8 (1997) 146-166 3. Rui, Y., Huang, T. S., Chang, S. F.: Image Retrieval: Current Techniques, Promising Directions, and Open Issues. Journal of Visual Communication and Image Representation 10 (1999) 39-62 4. Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-Based Image Retrieval at the End of the Early Years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 1349-1380 5. Minka, T. P., Picard, R. W.: Interactive Learning with a Society of Models. Pattern Recognition 30 (1997) 565-582 6. Rui, Y., Huang, T. S., Mehrotra, S.: Content-Based Image Retrieval with Relevance Feedback in MARS. IEEE International Conference on Image Processing, Vol. 2, Santa Barbara, CA, USA (1997) 815-818 7. Lu, Y., Hu, C., Zhu, X., Zhang, H. J., Yang, Q.: A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems. ACM International Conference on Multimedia, Los Angeles, CA, USA (2000) 31-37 8. Tamura, H., Mori, S., Yamawaki, T.: Texture Features Corresponding to Visual Perception. IEEE Transactions on Systems, Man, and Cybernetics 8 (1978) 460-473

An Efficient Storage Organization for Multimedia Databases Philip K.C. Tse1 and Clement H.C. Leung2 1 Department of Electrical and Electronic Engineering, University of Hong Kong, Pokfulam Road, Hong Kong SAR, China. TXWI$IIILOYLO 2 School of Communications and Informatics, Victoria University, P.O. Box 14428, MCMC, Vic8001, Australia. GPIQIRX$QEXMPHEZYIHYEY

Abstract. Multimedia databases may require storage space so huge that magnetic disks become neither practical nor economical. Hierarchical storage systems provide extensive storage capacity for multimedia data at very economical cost, but the long access latency of tertiary storage devices and large disk buffer make them infeasible for multimedia databases and visual information systems. In this paper, we investigate the data striping method for heterogeneous multimedia data streams on HSS. First, we have found that the multimedia objects should be striped across all media units to achieve the highest system throughput and smallest disk buffer consumption. Second, we have proved a feasibility condition for accepting concurrent streams. We have carried out experiments to study its performance, and it is observed that the concurrent striping method can significantly increase the system throughput, reduce the stream response time, and lower the need for disk buffers, offering considerable advantages and flexibility.

1

Introduction

Visual and Multimedia Information Systems (VIS) need to capture, process, store, and maintain a variety of information sources such as text, sound, graphics, images and video [18]. Such a system may be viewed at different levels: a user-transparent multimedia operating system with specific applications sitting on top of it (Fig. 1). The application layer always includes a multimedia database management system, which will rely on a suitable storage structure to support its operation. Multimedia databases need to store a variety of types of data. Popular or frequently accessed multimedia objects may reside permanently in the disks together with metadata, indexes, and other files. Cold multimedia objects and transaction log files are stored on tertiary store. Only the first portion of each object resides in disks. We focus on the retrieval of cold multimedia objects in this paper.

2

The Performance Problem and Relationship with Other Works

Most computer systems store their on-line data on disks, but storing huge amount of multimedia data on disks is expensive. Multi-level hierarchical storage systems (HSS) provide large capacity at a more economical cost than disk only systems [1]. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 152–162, 2002. © Springer-Verlag Berlin Heidelberg 2002

An Efficient Storage Organization for Multimedia Databases

153

However, such a storage structure invariably includes the long access latency of data held in tertiary storage devices [4].

Multimedia Information System Multimedia DBMS Storage Structure: HSS

Multimedia OS Multimedia Hardware Fig. 1. The performance of multimedia information systems is determined by the underlying storage structure

Traditionally tertiary storage devices store each object in its entirety using the nonstriping method on the media units. When a burst of streams arrives, response time would deteriorate because the streams are served in serial order. It is thus inefficient for multimedia databases where multiple objects are often accessed simultaneously. The simple striping method and the time-slice scheduling algorithm have been proposed to reduce the stream response time using extra switching [9, 16]. However, the extra switching overheads and the contention for exchange erode system throughput. Hence, these methods are appropriate only under light load conditions. The new concurrent striping method was shown to be efficient for homogeneous streams [30, 32]. We extend the concurrent striping method to handle heterogeneous streams in this paper. Multimedia objects may either be staged or pipelined from tertiary storage devices [28, 31]. We consider only the more efficient pipelining methods in this paper. 2.1

Relationship with Other Works

The continuous display requirement is necessary to guarantee that multimedia data streams can be displayed without interruption. In [24], data blocks of multimedia streams are interleaved using the Storage Pattern Altering policy using fixed transfer rate over both the media and gap blocks in optical disks. We generalize this interleaving placement method by interleaving streams over the temporal domain instead of the space domain. This allows for the feasibility condition to be used on more general storage devices and arbitrary scheduling methods. Many techniques on storing multimedia data strips on disk arrays are studied in the literature. Data distribution and replication are studied in [6, 26, 33]. Data striping in disk-only systems are analyzed in [2]. Constraint placement methods in [8, 13, 20] provide sufficient throughput for multimedia data retrieval on disks. Our method is the first constraint allocation method on HSS. Much research on the delivery of multimedia data has been done. Piggybacking and patching methods in [3, 11, 12], the multi-casting protocols in [17, 23], intelligent cache management techniques in [21], and proxy servers studies in [10, 22, 25, 34]

154

Philip K.C. Tse and Clement H.C. Leung

reduce the need for repetitive delivery of the same objects from the server. Quality of service guarantees over the network are studied in [15, 19, 27]. Some data striping methods on HSS have been proposed [7, 29]. Placement on the tertiary storage devices is optimized for random accesses but multimedia streams retrieve data continuously. In [5], a parallel striping method is studied, and the performance of random workload and the optimal strip width on simple striping systems are considered in [14]. The possibility of striping across all tapes is somehow excluded from the study. We shall describe the concurrent striping method and concurrent streams management in the next Section. We then establish the feasibility conditions in Section 4. We shall present the system performance in Section 5 and the experimental results in Section 6. This paper is concluded in Section 7.

3

Concurrent Striping

In the concurrent striping method, we divide the media units into several groups at one group per tertiary drive, and then we arrange the media units in a fixed sequence. Each multimedia data object is partitioned into a number of segments. We assume that each segment is a logical unit that can be displayed in fixed time after the previous segment has been displayed. We also assume that each object is accessed sequentially only in a fixed sequence. The segments are then placed into the media units following this sequence, with one segment on one media unit. Each object should have all its segments placed together. When multimedia objects are accessed, the Multimedia DBMS initiates new streams to access data objects. A new stream is accepted only if the maximum number of concurrent streams is not yet reached. Otherwise, the new stream is placed in a stream queue (Fig. 2). Once accepted, a new stream is created and it sends two requests to every tertiary drive and waits. The tertiary drives access data independently, an accepted stream starts to display data at the completion of at least one request from each drive. Each tertiary drive keeps the waiting requests in two queues. The first queue keeps waiting requests that access segments on the current media unit, while the second queue keeps requests that access data from other media units. The order of requests being served is controlled by the SCAN scheduling policy. The robot arm serves the exchange requests in a round robin manner.

4

Feasibility Conditions

The notations in Table 1 will be used in studying the feasibility conditions. We assume that each stream seeks with an overhead of S seconds and transfers a segment using M seconds. After that, the stream suspends data retrieval for G seconds. Each segment can display for δ seconds. A multimedia stream (M, δ) is acceptable if and only if it satisfies the continuous display requirement: S + M ≤ δ.

(1)

An Efficient Storage Organization for Multimedia Databases

155

retrieve data requests creates

parallel stream controller

tertiary drive

stream data access notification

finished notification

exchange requests

disk requests

new streams

exchange notification

robotic exchanger

retr

iev ed d

ata

disk drive

multimedia database di sp lay

da ta

display data

retrieved data queue data flow

memory

request & notification

Fig. 2. Concurrent Streams Management

This continuous display requirement must be maintained over a finite period of time. It can temporarily be violated by satisfying requests in advance and keeping the retrieved data in read-ahead buffers. The average ratio of transfer time to display time must however be maintained over a finite period of time. Table 1. Notations

Parameter S M G δ 4.1

Meaning access overheads transfer time gap time display time

Homogeneous Streams

Multimedia streams are considered as homogeneous if all streams have the same display time period δ. Let n streams be characterized by (M1, δ), (M2, δ), to (Mn, δ). Let Si be the access overhead time in serving each stream and Gi be the time gap of the ith stream, for i = 1 to n. By definition of the time gap, we have Si + Mi + Gi ≤ δ.

(2)

Corollary 1: n streams can be concurrent if and only if S1 + M1 + S2 + M2 + … + Sn + Mn ≤ δ.

(3)

Due to space limits, the proof of Corollary 1 and 2 are skipped here. Their validity follows directly as special conditions of the Corollary 3.

156

Philip K.C. Tse and Clement H.C. Leung

4.2

Heterogeneous Streams

Multimedia streams are considered as heterogeneous when their cycle periods are different. Let n streams be characterized by (M1, δ1), (M2, δ2), to (Mn, δn) such that not all δi are the same. Let S1 to Sn be the access overhead time in serving each stream. Corollary 2: n streams can be concurrent if and only if

S1 + M 1 S 2 + M 2 S + Mn + + ... + n ≤ 1. δ1 δ2 δn

4.3

(4)

Heterogeneous Streams with Multiple Devices

When multiple devices are available, the devices may serve the streams independently or in parallel. When the streams are served in parallel, the devices are considered as a single device with different access overheads and transfer rate. When the streams are served independently, one request is served by one device at a time. We assume that the requests can be distributed evenly to p devices, otherwise some devices can be overloaded while others are under utilized. Corollary 3: n streams can be concurrent on p independent devices if and only if

S1 + M 1 S 2 + M 2 S + Mn + + ... + n ≤ p. δ1 δ2 δn

(5)

Proof: If n streams are concurrently served by p devices, then there exists a finite time period δ such that kj requests of the jth streams are served by p devices. By the continuous display requirement, this time period should not exceed the display time of each stream. We have

δ ≤ kjδj , ⇒

j = 1, 2, …, n, (6)

kj 1 ≤ , δ δj

j = 1, 2, …, n.

Since the total retrieval time of all requests must be less than the service time of the p devices over the time period δ, we have,

∑ k (S n

j

j

)

+ M j ≤ pδ,

i =1

n

⇒

∑ j =1

(

kj Sj +M j

δ

) ≤ p.

(7)

An Efficient Storage Organization for Multimedia Databases

Substituting

157

kj 1 ≤ from Eq. (6), we obtain δ δj n

∑

(S j + M j ) δj

j =1

≤ p.

Hence, the necessary part is proved. Conversely, we let δ = δ 1δ 2 ...δ n and let kj ∈ R such that

δ , j = 1, 2, …, n, δj

kj =

⇒

Substituting

kj

δ

=

(8)

1 , δj

j = 1, 2, …, n.

1 from Eq. (8) to the necessity condition, we have δj n

∑

(

kj Sj +M j

δ

j =1

∑ ( n

⇒

) ≤ p, )

k j S j + M j ≤ pδ.

(9)

j =1

Since all terms are positive, we can take away all except the ith term from

∑ k (S n

j

j

)

+ M j . Hence, we obtain

j =1

k i (S i + M i ) ≤ pkiδi, i = 1, 2, …, n, ⇒ (S i + M i ) ≤ pδi,

i = 1, 2, …, n.

(10)

That is, requests of the ith stream can be served within time period δi by p devices. As long as the requests are distributed evenly to the devices, the continuous display requirements of all streams are fulfilled. Therefore, the n streams can be accepted to be served concurrently.

g

5

System Performance

To display the streams without starvation, the storage system must retrieve each segment before it is due for display. In the concurrent striping method, the maximum

158

Philip K.C. Tse and Clement H.C. Leung

number of requests that can appear between two consecutive requests of the same stream is less than s. If D drives are serving s streams each accessing segments of size X, then we have the continuous display requirement as (11)

DX X ≥ ω + s (α + ) , δj τ

where ω, α, and τ are the media exchange time, reposition time, and data transfer rate of the storage devices, and δj is the display bandwidth of the jth stream respectively. Since one segment is retrieved for each stream per media exchange in the concurrent striping method, we have for the system throughput

DsX X ω + s(α + ) τ

(12)

.

Disk buffers are required to store data that are retrieved from tertiary storage faster than they are consumed. Let the time that the tertiary drives spend in serving each group of concurrent requests be E[B], the disk buffer size for the jth stream using the concurrency striping method is

rX −

rδ j D

(13)

E[ B] .

Let E[G] be the expected stream service time, the disk buffer size for the jth stream using the non-striping method and parallel striping method is

rZ − δ j (E[G ]) ,

6

(14)

Experimental Results

We have created a simulation system to study the storage system performance of a robotic tape library. The media exchange time, reposition length and segment size are randomly generated for each request according to a uniform distribution with ±10% deviation from the mean value. New streams arrive randomly at the system according to the mean stream arrival rate. Other simulation parameters in Table 2 are used. Table 2. Simulation Parameters

Simulation Parameter Number of streams Stream arrival rate No. of tertiary drives Media exchange time Reposition rate Max reposition length Segment length Transfer rate

Default Value 200 streams 5 to 60 per hr 3 55 seconds 0.06 sec/inch 2000 inches 10 minutes 14.5 MB/sec

An Efficient Storage Organization for Multimedia Databases

6.1

159

Number of Displaying Streams

When the segment size increases, more displaying streams are allowed in both striping methods whereas the number of displaying streams is almost unchanged in the non-striping method. The concurrent striping method can serve more streams when the segment length is longer (Fig. 3). If the maximum number of concurrent streams is limited by the continuous display requirement in Eq.(11), no starving occurs. Otherwise, the number of starving requests would increase rapidly.

Maximum concurrent streams 100 50 0 5

10 15 segment length (minutes)

20

Fig. 3. Maximum Concurrent Streams

6.2

Maximum System Throughput

The maximum system throughput shows the ability in clearing requests from waiting queues. The maximum throughput of the concurrent striping method (high concurrency) is always higher than that of the other methods (Fig. 4). The system throughputs of the methods increase when larger segments are used due to three reasons: First, fewer exchanges and repositions are required for larger segments, resulting in fewer overheads. Second, larger segment are displayed for a longer time, more concurrent streams can be accepted to share the same media exchange overhead. Third, the full length of reposition is shared in SCAN scheduling among more concurrent streams, the mean reposition time and thus the overhead is reduced. Therefore, the maximum system throughput is higher. 6.3

Stream Response Time

The stream response time shows the quality of service to users in Fig. 5. The stream response time is dominated by the start up latency at low stream arrival rate, but it is dominated by the queue waiting time at high stream arrival rate. At low stream arrival rate, the concurrent striping method responds slower than the other two methods. Since the drives may be in the middle of a round, new streams need to wait for the media unit containing the first required segment to be exchanged. At fast stream arrivals, the concurrent striping method responds faster than other methods. As the queue grows, the response time increases rapidly. Since the concurrent striping method has the highest throughput, it serves requests the fastest. Therefore, the concurrent striping method reduces streams response time under heavy loads.

160

Philip K.C. Tse and Clement H.C. Leung

M axim um s ys tem thro ughp ut

M B/s e c 40 35 30 25 20 15 10 5 0 5

7

9 11 s e g me n t le n g th (min u te s )

13

15

p a ra lle l s trip in g (p re d ic te d )

p a ra lle l s trip in g (me a s u re d )

n o n -s trip in g (p re d ic te d )

n o n -s trip in g (me a s u re d )

h ig h c o n c u rre n c y (p re d ic te d )

h ig h c o n c u rre n c y (me a s u re d )

Fig. 4. Maximum System Throughput

Mean stream response time

s econds 3600

1800

0 -

10.0

20.0 30.0 40.0 s tream arrival rate (per hour)

50.0

60.0

parallel s triping (predicted)

parallel s triping (meas ured)

non-s triping (predicted)

non-s triping (meas ured)

high concurrency (predicted)

high concurrency (meas ured)

Fig. 5. Mean Stream Response Time

6.4

Disk Buffer Space

The disk buffer size indicates the amount of necessary resources in each method (Fig. 6). The largest disk buffer space is used by the non-striping method that retrieves data well before they are due for display. In both striping methods, the segments reside on different media units. At low stream arrival rate, multiple media exchanges are required to retrieve each object, resulting in lower data retrieval throughput per stream and smaller disk buffers. At fast stream arrivals, more streams are served concurrently in the concurrent striping method. As the segments for each stream are retrieved discontinuously, each object is retrieved at a slower pace and less data are moved to the disk. Thus, the disk buffer size per stream drops in the concurrent striping method.

An Efficient Storage Organization for Multimedia Databases

161

Buffer s ize p er s tream

MB 2400 2200 2000 1800 1600 1400 0

10

20 30 40 s tre a m a rriv a l ra t e (p e r h o u r)

p a ra lle l s t rip in g (p re d ic te d ) n o n -s trip in g (p re d ic te d ) h ig h c o n c u rre n c y (p re d ic te d )

50

60

p a ra lle l s t rip in g (me a s u re d ) n o n -s trip in g (me a s u re d ) h ig h c o n c u rre n c y (me a s u re d )

Fig. 6. Disk Buffer Size

7

Summary and Conclusion

The use of HSS will be inevitable for large multimedia databases in future systems. The main concerns in using these systems are their relatively poor response characteristics and large resource consumption. The concurrent striping method addresses these problems by sharing the switching overheads in HSS among concurrent streams. We have provided a feasibility condition to serve heterogeneous streams on a number of devices based on their access overheads and media transfer rates. The concurrent striping method has several advantages. The first advantage is that its system throughput is higher than that of existing methods. The second advantage is that it can serve more streams than the non-striping method at limited disk buffer space. The third advantage is that new streams respond faster under heavy loads which are very often the practical requirement in multimedia databases. These advantages make the concurrent striping method the most efficient storage organization for supporting the operation of multimedia databases and visual information systems.

References 1. Basu, P., Little, T.D.C.: Pricing Considerations in Video-on-demand Systems. ACM Multimedia (2000) 359-361 2. Berson, S., Ghandeharizadeh, S., Muntz, R., Ju, X.: Staggered Striping in Multimedia Information Systems. Proc. of ACM SIGMOD Conf. (1994) 79-90 3. Cai, Y., Hua, K.A.: An Efficient Bandwidth-Sharing Technique for True Video on Demand Systems. ACM Multimedia (1999) 211-214 4. Chervenak, A.L., Patterson, D.A., Katz, R.H.: Storage Systems for Movies-on-demand Video Servers. Proc. of IEEE Sym. on Mass Storage Systems (1995) 246-256 5. Chiueh, T.C.: Performance Optimization for Parallel Tape Arrays. Proc. of ACM Supercomputing (1995) 375-384 6. Chou, C.F., Golubchik, L., Lui, J.C.S.: A Performance Study of Dynamic Replication Techniques in Continuous Media Servers. ACM SIGMETRICS (1999) 202-203 7. Christodoulakis, S., Triantafillou, P., Zioga, F.A.: Principles of Optimally Placing Data in rd Tertiary Storage Libraries. Proc. of 23 VLDB Conf. (1997) 236-245 8. Chua, T.S., Li, J., Ooi, B.C., Tan, K.L.: Disk Striping Strategies for Large Video-on-demand Servers. Proc. of ACM Multimedia (1996) 297-306

162

Philip K.C. Tse and Clement H.C. Leung

9. Drapeau, A.L., Katz, R.H.: Striped Tape Arrays. Proc. of IEEE Sym. on Mass Storage Systems (1993) 257-265 10. Dykes, S.G., Robbins, K.A.: A Viability Analysis of Cooperative Proxy Caching. Proc. IEEE INFOCOM 3 (2001) 1205-1214 11. Eager, D., Vernon, M., Zahorjan, J.: Optimal and Efficient Merging Schedules for Videoon-Demand Servers. Proc. of ACM Multimedia (1999) 199-202 12. Gao, L., Zhang, Z., Towsley, D.: Catching and Selective Catching: Efficient Latency Reduction Techniques for Delivering Continuous Multimedia Streams. ACM Multimedia (1999) 203-206 13 Ghandeharizadeh, S., Kim, S.H., Shahabi, C.: On Configuring a Single Disk Continuous Media Server. Proc. of ACM Multimedia (1995) 37-46 14. Golubchik, L., Muntz, R.R., Watson, R.W.: Analysis of Striping Techniques in Robotic Storage Libraries. Proc. of IEEE Sym. on Mass Storage Systems (1995) 225-238 15. Greenhalgh, C., Benford, S., Reynard, G.: A QoS Architecture for Collaborative Virtual Environments. ACM Multimedia (1999) 121-130 16. Lau, S.W., Lui, J.C.S., Wong, P.C.: A Cost-effective Near-line Storage Server for Multimedia System. Proc. of IEEE Conf. on Data Engineering (1995) 449-456 17. Lee, K.W., Ha, S., et. al.: An Application-level Multicast Architecture for Multimedia Communications. ACM Multimedia (2000) 398-400 18. Leung C.H.C. (ed.): Visual Information Systems. Lecture Notes in Computer Science, Vol. 1304. Springer-Verlag, Berlin Heidelberg New York (1997) 19. Metz, C.: Differentiated Services. IEEE Multimedia (2000) 84-90 20. Özden, B., Rastogi, R., Silberschatz, A.: On the Design of a Low-cost Video-on-demand Storage System. ACM Multimedia Systems 4 (1996) 40-54 21. Paknikar, S., Kankanhalli, M., et.al.: A Caching and Streaming Framework for Multimedia. ACM Multimedia (2000) 13-20 22. Park, S.C., Park, Y.W., Son, Y.E.: A Proxy Server Management Scheme for Continuous Media Objects Based on Object Partitioning. Proc. IEEE ICPADS (2001) 757-762 23. Pochueva, J., Munson, E.V., Pochuev, D.: Optimizing Video-On-Demand Through Requestcasting. ACM Multimedia (1999) 207-210 24. Rangan, P.V., Vin, H.M.: Efficient storage techniques for digital continuous multimedia. IEEE Trans. Knowledge and Data Engineering, Vol. 5(4), (1993) 564-573 25. Rejaie, R., Yu, H., Handley, M., Estrin, D.: Multimedia Proxy Caching Mechanism for Quality Adaptive Streaming Applications in the Internet. IEEE INFOCOM (2000) 980-989 26. Santos, J.R., Muntz, R.R., Ribeiro-Neto, B.: Comparing Random Data Allocation and Data Striping in Multimedia Servers. ACM SIGMETRICS (2000) 44-55 27. Smith, J., Mohan, R., Li, C.S.: Scalable Multimedia Delivery for Pervasive Computing. ACM Multimedia (1999) 131-140 28. Tavanapong, W., Hua, K.A., Wang, J.Z.: A Framework for Supporting Previewing and VCR Operations in a Low Bandwidth Environment. ACM Multimedia (1997) 303-312 29. Triantafillou, P., Papadakis, T.: On-Demand Data Elevation in a Hierarchical Multimedia rd Storage Server. Proc. of the 23 VLDB Conf. (1997) 1-10 30. Tse, P.K.C., Leung, C.H.C.: Performance of Large Multimedia Databases on Hierarchical Storage Systems. Proc. of IEEE Pacific-Rim Conf. on Multimedia (2000) 184-187 31. Tse, P.K.C., Leung, C.H.C.: A Low Latency Hierarchical Storage System for Multimedia Data. Proc. of IAPR Int. MINAR Workshop (1998) 181-194 32. Tse, P.K.C., Leung, C.H.C.: Retrieving Multimedia Objects from Hierarchical Storage th Systems. Proc. of the 18 IEEE MSS Symposium (2001) 297-301 33. Wang, J., Guha, R.K.: Data Allocation Algorithms for Distributed Video Servers. ACM Multimedia (2000) 456-458 34. Zhang, Z.L., Wang, Y., Du, D.H.C., Su, D.: Video Staging: A Proxy Server based Approach to End-to-End Video Delivery over Wide-Area Networks. IEEE/ACM Transactions on Networking, Vol. 8(4) (2000) 429-442

Unsupervised Categorization for Image Database Overview Bertrand Le Saux and Nozha Boujemaa INRIA, Imedia Research Group, BP 105, F-78153 Le Chesnay, France [email protected] http://www-rocq.inria.fr/ lesaux

Abstract. We introduce a new robust approach to categorize image databases: Adaptative Robust Competition (ARC). Providing the best overview of an image database helps users browsing large image collections. Estimating the distribution of image categories and ﬁnding their most descriptive prototype represent the two main issues of image database categorization. Each image is represented by a high-dimensional signature in the feature space. A principal component analysis is performed for every feature to reduce dimensionality. Image database overview by categorization is computed in challenging conditions since clusters are overlapping and the number of clusters is unknown. Clustering is performed by minimizing a Competitive Agglomeration objective function with an extra noise cluster collecting outliers.

1

Introduction

Over the last few years, partly due to the development of the Internet, more and more multimedia documents that include digital images have been produced and exchanged. However, locating a target image in a large collection became a crucial problem. The usual way to solve it consists in describing images by keywords. Because this is a human operation this method suffers from subjectivity and text ambiguity and requires huge time to manually annotate a whole database. By image analysis images can be indexed by automatic description which only depend on their objective visual content. So Content-based Image Retrieval (CBIR) became a highly active research ﬁeld. The usual scenario of CBIR is a query by example, which consists in retrieving images of the database similar to a given one. The purpose of browsing is to help the user ﬁnding his image query by providing ﬁrst the best overview of the database. Since the database cannot be presented entirely, a limited number of key images have to be chosen. It means we have to ﬁnd the most informative images which allow the user to know what the database contains. The main issue is to estimate the distribution (usually multi-modal) of image categories. Then we need the most representative image for each category. Practically, this is a critical point in the scenario of content-based query by example: the “page zero” problem. Existing systems often begin by presenting either randomly chosen images or keywords. In the ﬁrst case, some categories are missed, and some images can be visually redundant. The user has to pick several random subsets to ﬁnd an image corresponding to the one he has in mind. Only then can the query by example be S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 163–174, 2002. c Springer-Verlag Berlin Heidelberg 2002

164

Bertrand Le Saux and Nozha Boujemaa

performed. In the second case, images are manually annotated with keywords, and the ﬁrst query is processed using keywords. Thus there is a need for presenting a summary of the database to the user. A popular way to ﬁnd partitions in complex data is prototype-based clustering algorithm. The fuzzy version (Fuzzy C-Means [1]) has been constantly improved for twenty years by the use of the Mahalanobis distance [2], the adjunction of a noise cluster [3] or the competitive agglomeration algorithm [4] [5]. A few attempts to organize and browse image databases have been made: Brunelli and Mich [6], Medasani and Krishnapuram [7] and Frigui et al. [8]. A key point of categorization is the input data representation. A set of signatures (color, texture and shape) allows to describe the visual appearance of the image. The content-based categorization should be performed by clustering these signatures. This operation is computed in challenging conditions. The feature space is high-dimensional: computations are affected by the curse of dimensionality. The number of clusters in the image database is unknown. Natural categories have various shapes (sometimes hyper-ellipsoidal but often more complex), they are overlapping and they have various densities. The paper is organized as follows: §2 presents the background of our work. Our method is then presented in section 3. The results on image databases are discussed and compared with other clustering methods in section 4 and section 5 summarizes our concluding remarks.

2

Background

The Competitive Agglomeration (CA) algorithm [4] is a fuzzy partitional algorithm which allows not to specify the number of clusters. Let X = {xi | i {1, .., N }} be a set of N vectors representing the images. Let B = {βj | j {1, .., C}} represents prototypes of the C clusters. Competitive Agglomeration (CA) algorithm minimizes the following objective function: N 2 C N C J= (uji )2 d2 (xi , βj ) − α (uji ) (1) j=1 i=1

j=1

i=1

Constrained by: C

uji = 1, f or i {1, .., N }

(2)

j=1

d2 (xi , βj ) represents the distance from an image signature xi to a cluster prototype βj . The choice of the distance depends on the type of clusters having to be detected. For spherical clusters, Euclidean distance will be used. uji is the membership of xi to a cluster j. The ﬁrst term is the standard FCM objective function [1]: the sum of weighted square distances. It allows us to control shape and compactness of clusters. The second term (the sum of squares of clusters’ cardinalities) allows us to control the number of clusters. By minimizing both these terms together, the data set will be partitioned in the optimal number of clusters while clusters will be selected to minimize the sum of intra-cluster distances.

Unsupervised Categorization for Image Database Overview

165

The cardinality of a cluster is deﬁned as the sum of the memberships of each image to this cluster: N Ns = (usi ) (3) i=1

Membership can be written as:

where:

CM ust = uF + uBias , st st

(4)

[1/d2 (xt , βs )] CM uF = C , st 2 j=1 [1/d (xt , βj )]

(5)

and: uBias st

α = 2 d (xt , βs )

Ns −

C

2 j=1 [1/d (xt , βj )]Nj C 2 j=1 [1/d (xt , βj )]

(6)

The ﬁrst term in equation (4) is the membership term in FCM algorithm and takes into account only relative distances to the clusters. The second term is a bias term which is negative for low cardinality cluster and positive for strong clusters. This bias term leads to a reduction of cardinality of spurious clusters which are discarded if their cardinality drops below a threshold. As a result only good clusters are conserved. α should provide a balance [4] between the two terms of (1) so α at iteration k is deﬁned by: C N 2 2 j=1 i=1 (uji ) d (xi , βj ) α(k) = η0 exp(−k/τ ) (7) 2 C N j=1 i=1 (uji ) α is weighted by a factor which decreases exponentially along iterations. In the ﬁrst iterations the second term of equation (1) dominates so the number of clusters drops rapidly. Then, when the optimal number of clusters is found, the ﬁrst term dominates and the CA algorithm seeks the best partition of the signatures.

3 Adaptative Robust Competition (ARC) 3.1

Dimensionality Reduction

A signature space has been built for a 1440 image database (Columbia Object Image Library [9]). It contains 1440 gray scale images representing 20 objects, where each object is shot every 5 degrees. This feature space is high-dimensional and contains three signatures: 1. Intensity distribution (16-D): the gray level histogram. 2. Texture (8-D): the Fourier power spectrum is used to describe the spatial frequency of the image [10]. 3. Shape and Structure (128-D): the correlogram of edge-orientations histogram (in the same way as color correlogram presented at [11]).

166

Bertrand Le Saux and Nozha Boujemaa 1:obj10 2:obj11 3:obj12 4:obj13 5:obj14 6:obj15 7:obj16 8:obj17 9:obj18 10:obj19 11:obj1 12:obj20 13:obj2 14:obj3 15:obj4 16:obj5 17:obj6 18:obj7 19:obj8 20:obj9 0.15 0.1 0.05

3rd component

0 -0.05 -0.1 -0.15 -0.2 -0.25 -0.3

-0.7

-0.6

0 -0.5

-0.4

-0.05 -0.3

1st component

-0.2

2nd component

-0.1 -0.1

-0.15

Fig. 1. Distribution of gray level histograms for Columbia database on the three principal components

The whole space is not necessary to distinguish images. To prevent clustering from expensive computation, a principal component analysis is performed to reduce the dimensionality. For each feature only the ﬁrst main components are kept. To visualize the problems raised by the categorization of image databases, the distribution of image signatures is shown on ﬁgure 1. This ﬁgure presents the subspace corresponding to the three principal components of the feature gray level histogram. Each natural category is represented with a different color. Two main problems appear: categories overlap and natural categories have different and various shapes.

3.2 Adaptative Competition α is the weighting factor of the competition process. In equation (7) α is chosen according to the objective function and has the same value and effect for each cluster. Though, during the process, α inﬂuences the computation of memberships in equations (4) and (6). The term uBias appreciates or depreciates the membership ust of data point xt to st cluster t according to the cardinality of the cluster. This will cause this cluster to be conserved or discarded respectively. Since clusters may have different compactness, the problem is to attenuate the effect of uBias for loose clusters, in order to not discard them too rapidly. We introduce an st average distance for each cluster s: d2moy (s)

=

N

2 2 i=1 (usi ) d (xi , βs ) N 2 i=1 (usi )

f or 1 ≤ s ≤ C

(8)

Unsupervised Categorization for Image Database Overview

And an average distance for the whole set of image signatures: C N 2 2 j=1 i=1 (uji ) d (xi , βj ) 2 dmoy = C N 2 j=1 i=1 (uji )

167

(9)

Then, α in equation (6) is expressed as: αs (k) =

d2moy α(k) f or 1 ≤ s ≤ C d2moy (s)

(10)

The ratio d2moy /d2moy (s) is lower to 1 for loose clusters, so the effect of uBias is attenust ated: cardinality of cluster is slowly reduced. On the contrary, d2moy /d2moy (s) is greater than 1 for compact clusters, so both memberships to these clusters and cardinalities are increased: they are more resistant in the competition process. Hence we build an adaptative competition process given by αs (k) for each cluster s. 3.3

Robust Clustering

A solution to deal with noisy data and outliers is to capture all the noise signatures in a single cluster [3]. A virtual noise prototype is deﬁned, which is always at the same distance δ from every point in the data-set. Let this noise cluster be the ﬁrst cluster, and noise prototype noted as β1 . So we have: d2 (xi , β1 ) = δ 2

(11)

Then the objective function (1) has to be minimized with the following particular conditions: – Distances for the good clusters j are deﬁned by: d2 (xi , βj ) = (xi − βj )T Aj (xi − βj ) f or 2 ≤ j ≤ C.

(12)

where Aj are positive deﬁnite matrices. If Aj are identity matrix, then the distance is Euclidean distance, and the prototypes of clusters j for 2 ≤ j ≤ C are: N (uji )2 xi βj = i=1 (13) N 2 i=1 (uji ) – For the noise cluster j = 1, distance is given by (11). The noise distance δ has to be speciﬁed. It would vary from an image database to another, so it would be based on data-set statistical information. It is computed as the average distance between image signatures and good cluster prototypes: C N 2 j=2 i=1 d (xi , βj ) 2 2 δ = δ0 (14) N (C − 1) The noise cluster is then supposed to catch outliers that are at an equal mean distance from all cluster prototypes. Initially, δ cannot be computed using this formula, since

168

Bertrand Le Saux and Nozha Boujemaa

distances are not yet computed. It is just initialized to δ0 , and the noise cluster becomes signiﬁcant after a few iterations. δ0 is a factor which can be used to enlarge or minimize the size of the noise cluster, though in the results that will be presented, δ0 = 1. The new ARC algorithm using adaptative competitive agglomeration and noise cluster can now be summarized: Fix the maximum number of clusters C. Initialize randomly prototypes for 2 ≤ j ≤ C. Initialize memberships with equal probability for each image to belong to each cluster. Compute initial cardinalities for 2 ≤ j ≤ C using equation (3). Repeat Compute d2 (xi , βj ) using (11) for j = 1 and (12) for 2 ≤ j ≤ C. Compute αj for 1 ≤ j ≤ C using equations (10) and (7). Compute memberships uji using equation (4) for each cluster and each signature. Compute cardinalities Nj for 2 ≤ j ≤ C using equation (3). For 2 ≤ j ≤ C, if Nj < threshold, discard cluster j. Update number of clusters C. Update prototypes using equation (13). Update noise distance δ using equation (14). Until (prototypes stabilized). Hence a new clustering algorithm is proposed. The two next points address two problems raised by image database categorization. 3.4

Choice of Distance for Good Clusters

What would be the most appropriate choice for (12) ? The image signatures are composed of different features which describe different attributes. The distance between signatures is deﬁned as the weighted sum of partial distances for each feature 1 ≤ f ≤ F : d(xi , βj ) =

F

wj,f df (xi , βj )

(15)

f =1

For each feature, the natural categories in image databases have various shapes, the more often hyper-ellipsoidal, and overlap each other. To retrieve such clusters, Euclidean distance is not appropriate. So the Mahalanobis distance [2] is used to discriminate image signatures. For clusters 2 ≤ j ≤ C, partial distances for feature f are computed using: −1 df (xi , βj ) = |Cj,f |1/pf (xi,f − βj,f )T Cj,f (xi,f − βj,f )

(16)

where xi,f and βj,f are the restrictions of image signature xi and cluster prototype βj to the feature f . pf is the dimension of both xi,f and βj,f : it is the dimension of the subspace corresponding to feature f . Cj,f is the covariance matrix (of dimension pf × pf ) of cluster j for the feature f : N (uji )2 (xi,f − βj,f )(xi,f − βj,f )T Cj,f = i=1 (17) N 2 i=1 (uji )

Unsupervised Categorization for Image Database Overview

3.5

169

Normalization of Features

The problem is to compute the weights wj,f used in equation (15). The features have different orders of magnitude and different dimensions, so the distance over all features cannot be deﬁned as a simple sum of partial distances for each feature. The idea is to learn the weights during the clustering process. Ordered Weight Averaging [12] is used, as proposed in [8]. First, partial distances are sorted in ascending order. For each feature f , the rank of corresponding partial distance is obtained: rf = rank(df (xi , βj ))

(18)

And the weight at iteration k > 0 is updated using: (k)

(k−1)

wj,f = wj,f

+

2(F − rf ) F (F + 1)

(19)

It has two positive effects. First, features with small values are weighted with a higher weight than those with large values, so the sum of partial distances is equilibrated. Secondly, since the weights are computed during the clustering process, if some images are found to be similar according to one feature, their partial distance will be small, and the effect of this feature will be accentuated: it allows to ﬁnd a cluster which contains images similar according to a single main feature. 3.6 Algorithm Outline Fix the maximum number of clusters C. Initialize randomly prototypes for 2 ≤ j ≤ C. Initialize memberships with equal probability for each image to belong to each cluster. Initialize feature weights uniformly for each cluster 2 ≤ j ≤ C. Compute initial cardinalities for 2 ≤ j ≤ C. Repeat Compute covariance matrix for 2 ≤ j ≤ C and feature subsets 1 ≤ f ≤ F using (17). Compute d2 (xi , βj ) using (11) for j = 1 and (16) for 2 ≤ j ≤ C. Update weights for clusters 2 ≤ j ≤ C using (19) for each feature. Compute αj for 1 ≤ j ≤ C using equations (10) and (7). Compute memberships uji using equation (4) for each cluster and each signature. Compute cardinalities Nj for 2 ≤ j ≤ C. For 2 ≤ j ≤ C, if Nj < threshold discard cluster j. Update number of clusters C. Update prototypes using equation (13). Update noise distance δ using equation (14). Until (prototypes stabilize).

170

Bertrand Le Saux and Nozha Boujemaa

Fig. 2. left: ground truth: the 20 objects of the Columbia database, right: Summary obtained with ARC algorithm

Fig. 3. left: Prototypes of clusters obtained with SOON algorithm, right: Prototypes of clusters obtained with CA algorithm

4

Results and Discussion

The ARC algorithm is compared with two other clustering algorithms: the basic CA algorithm [4] and the Self-Organization of Oscillator Network (SOON) algorithm [8]. The SOON algorithm can be summarized as follows: 1. Each image signature is associated to an oscillator characterized by a phase variable that belongs to [0, 1]. 2. Whenever an oscillator’s phase reaches 1, it resets to 0 and other oscillators’ phases are either increased or decreased according to a similarity function.

Unsupervised Categorization for Image Database Overview

171

Table 1. This matrix shows how many pictures of each object belong to a cluster obtained with ARC. Object 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Cluster 1 72 . . . . . . . . . . . . . . . . . . . 2 . 3 1 1 . . . . . . 2 . 3 . . . . . . . 3 . . 48 . 4 4 . . . 5 . . . . . . . . 4 . 4 . 3 4 70 . . . 15 . . . . . . . 13 . . . . 5 . . . . 32 . . . . . 1 . . . . . . . . . 6 . . . . . . . . . . . . . . . . . . . . 7 . . . . 3 . 67 . . . 12 . . . . . . . . . 8 . . . . 2 . 5 57 . . 1 . . . . . . . . . 9 . . . . 13 . . . 70 5 . . . . . . . . . . 10 . . . . . . . . . . . . . . . . . . . . 11 . 9 . . . . . . . 1 51 . . . . . . . . . 12 . . . . 3 . . . . 5 . 72 . . . . . . . . 13 . 22 . . . . . . . . 5 . 21 . . . . . . . 13 . 12 . . . . . . . . . . 48 . . . . . . . 14 . . . . . 1 . . . . . . . 72 . . . . 1 . 15 . . . . . . . . . . . . . . 72 . . . . . 16 . . . . . 2 . . . . . . . . . 59 . . . . 17 . . . . . . . . . . . . . . . . 72 . . . 18 . . . . . . . . . . . . . . . . . 72 . . 19 . . 18 . 2 35 . . . 14 . . . . . . . . 26 . 19 . . . 1 2 16 . . . 16 . . . . . . . . 23 . 19 . . 11 . 1 14 . . . . . . . . . . . . 19 . 20 . . . . . . . . . 2 . . . . . . . . . 72 noise . 23 5 . 10 . . . 2 24 . . . . . . . . . .

3. Oscillators begin to clump together in small groups. Within each group, oscillators are phase-locked. After a few cycles, existing groups get bigger by absorbing other oscillators and merging with other groups. 4. Eventually, the system reaches a stable state where the image signatures are organized into the optimal number of stable groups. For each category, a prototype is chosen according to the following steps: • The average value of each feature is computed over image. • Then, the average of all images deﬁnes a virtual prototype. • The real prototype is the nearest image to the virtual one. The ground truth of Columbia database is shown on ﬁgure 2. The three summaries are presented on ﬁgures 2 and 3. Quite all the natural categories are retrieved with the three methods. But with SOON or CA algorithms, some categories are split in several clusters, so several prototypes are redundant. Our method provides a better summary with less redundancy. Tables 1 and 2 present the membership matrices of objects to clusters which describe the content of each cluster. Since the simple CA algorithm has no cluster to collect ambiguous image signatures, clusters obtained with this method are noisy. Besides the main natural category retrieved in a cluster, there are always other images which belong to a neighbor cluster or to a wide spread cluster. This problem is solved with both other methods. With ARC or SOON algorithms, more than a third of categories are perfectly clustered, i.e. all the images of a single cate-

172

Bertrand Le Saux and Nozha Boujemaa

Table 2. The left matrix shows how many pictures of each object belong to a cluster obtained with CA and the right matrix shows the result of the same experiment with SOON. Object 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Cluster 1 42 . . 4 . . . 1 . 2 6 . . . . . . . . . 1 30 . . . . . . 9 . . 1 . . . . . . . . . 2 . 35 . . . . 3 1 . . 1 . . . . . . . . . 3 . . 8 . . 30 . . . . . . . . . . . . 26 . 3 . . 10 . . . . . . 1 . . . . . . . . 10 . 4 . 1 2 31 22 . . 1 3 3 . . . . . . . . . . 5 . . . . 10 . 5 . . 54 3 . . . . . . . . . 6 . . . . . . . . . . . . . . . . . . . . 7 . . . . 1 . 61 . . . . . . . . . . 14 . . 8 . . . . 2 . . 21 19 . . . . . . . . . . 44 9 . . . . 5 . . 19 47 . . . . . . . . . . . 10 . . . . . . . . . . . . . . . . . . . . 11 . 5 . . 1 . . 3 . . 49 . . . . . . . . . 12 . . . . 12 . . . . . . 72 . . . . . . . . 13 . 17 . . . . . . . . 6 . 72 . . . . . . . 14 . . . . . . . 6 . . . . . 72 . . . . . . 15 . . . . . . 1 . . . . . . . 33 . . . . . 15 . . . . . . 2 . . . 4 . . . 39 . . . . . 16 . 13 . 37 . . . 12 . . 2 . . . . 72 . . . . 17 . . . . 1 . . . . . . . . . . . 72 . . . 18 . . . . 10 . . . . 3 . . . . . . . 29 . . 18 . . . . . . . . . 1 . . . . . . . 29 . . 19 . . 40 . 8 25 . . . 8 . . . . . . . . 26 . 19 . . 12 . . 17 . . . . . . . . . . . . 10 . 20 . . . . . . . . 3 . . . . . . . . . . 28 Object 1 2 Cluster 1 21 . 1 51 . 2 . . 3 . . 4 . . 5 . . 5 . . 6 . . 6 . . 7 . . 8 . . 8 . . 9 . . 10 . . 10 . . 10 . . 11 . . 12 . . 13 . . 14 . . 15 . . 16 . . 17 . . 18 . . 18 . . 19 . . 20 . . noise . 72

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 . . . 7 . . . 4 40 . . . . . . . . . . . . . . . . 2 . 19

. . . . 72 . . . . . . . . . . . . . . . . . . . . . . .

. . . . . 15 19 . . . . . . . . . . . . . . . . . . . . 38

. . . 6 . . . 5 43 . . . . . . . . . . . . . . . . 3 . 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . 72

. . . . . . . . . . 16 40 . . . . . . . . . . . . . . . 16

. . . . . . . . . . . . 14 . . . . . . . . . . . . . . 57

. . . . . . . . . . . . . 10 16 10 . . . . . . . . . . . 36

. . . . . . . . . . . . . . . . 26 . . . . . . . . . . 46

. . . . . . . . . . . . . . . . . 72 . . . . . . . . . .

. . . . . . . . . . . . . . . . . . 13 . . . . . . . . .

. . . . . . . . . . . . . . . . . . . 71 . . . . . . . 1

. . . . . . . . . . . . . . . . . . . . 72 . . . . . . .

. . . . . . . . . . . . . . . . . . . . . 72 . . . . . .

. . . . . . . . . . . . . . . . . . . . . . 72 . . . . .

. . . . . . . . . . . . . . . . . . . . . . . 39 33 . . .

. . . . . . . 6 42 . . . . . . . . . . . . . . . . 5 . 19

. . . . . . . . . . . . . . . . . . . . . . . . . . 72 .

Unsupervised Categorization for Image Database Overview

173

Fig. 4. left: cluster of object ‘drugs package’ obtained by ARC, and right: cluster of object ‘drugs package’ obtained by CA algorithm

Fig. 5. cluster of object ‘drugs package’ obtained by SOON algorithm

gory are grouped in a single cluster. The other natural categories present more variation among their images, so are more difﬁcult to retrieve. Let’s consider one of these categories: the images representing the drug package ‘tylenol’. It presents several difﬁculties: it is wide spread, and another category which represents another drugs package is very similar. The cluster formed with the CA algorithm contains 71 images and only 47 images of the good category (see ﬁgure 4). The cluster formed with the SOON algorithm has no noise but contains only 14 images (among 72) (ﬁgure 5). With our method, a cluster of 88 images is found, with 18 noisy images and 70 good images. The CA algorithm suffers from the noisy data which prevent it from ﬁnding the good clusters. On the contrary, the SOON algorithm rejects lot of images in the noise cluster: thus good clusters are pure, but more than a quarter of the database is considered as noise. Since whole categories can be rejected (table 2 shows that 2 complete categories of Columbia database are in the noise cluster) the image database is not well represented. ARC method avoids these drawbacks. It ﬁnds clusters which contain almost all images of the natural category, with a only small amount of noise. The noise cluster contains only really ambiguous images which would affect the results by biasing the clustering process.

174

5

Bertrand Le Saux and Nozha Boujemaa

Conclusion

We have presented a new unsupervised and adaptative clustering algorithm to categorize image databases: ARC. When prototypes of each category are picked and collected together it provides a summary for the image database. It allows to face problems raised by image database browsing and more speciﬁcally handle the “page zero”. It allows computing the optimal number of clusters in the dataset. It assigns outliers and ambiguous image signatures to a noise cluster, to prevent them from biasing the categorization process. Finally, it uses an appropriate distance to retrieve clusters of various shapes and densities.

References 1. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press (1981) 2. Gustafson, E.E., Kessel, W.C.: Fuzzy clustering with a fuzzy covariance matrix. In: IEEE CDC, San Diego, California (1979) 761–766 3. Dave, R.N.: Characterization and detection of noise in clustering. Pattern Recognition Letters 12 (1991) 657–664 4. Frigui, H., Krishnapuram, R.: Clustering by competitive agglomeration. Pattern Recognition 30 (1997) 1109–1119 5. Boujemaa, N.: On competitive unsupervized clustering. In: Proc. of ICPR’2000, Barcelona, Spain (2000) 6. Brunelli, R., Mich, O.: Image retrieval by examples. IEEE Transactions on Multimedia 2 (2000) 164–171 7. Medasani, S., Krishnapuram, R.: Categorization of image databases for efﬁcient retrieval using robust mixture decomposition. In: Proc. of the IEEE Workshop on Content Based Access of Images and Video Libraries, Santa Barbara, California (1998) 50–54 8. Frigui, H., Boujemaa, N., Lim, S.A.: Unsupervised clustering and feature discrimination with application to image database categorization. In: NAFIPS, Vancouver, Canada (2001) 9. Nene, S.A., Nayar, S.K., Murase, H.: Columbia object image library (coil20). Technical report, Department of Computer Science, Columbia University, http://www.cs.columbia.edu/CAVE/ (1996) 10. Niemann, H.: Pattern Analysis and Understanding. Springer, Heidelberg (1990) 11. Huang, J., Kumar, S.R., Mitra, M., Zu, W.J.: Spatial color indexing and applications. In: ICCV, Bombay, India (1998) 12. Yager, R.R.: On ordered weighted averaging aggregation operators in multicriteria decision making. Systems, Man and Cybernetics 18 (1988) 183–190

A Data-Flow Approach to Visual Querying in Large Spatial Databases Andrew J. Morris1 , Alia I. Abdelmoty2 , Baher A. El-Geresy1 , and Christopher B. Jones2 1 2

School of Computing, University of Glamorgan, Treforest, Wales, CF37 1DL, UK Department of Computer Science, Cardiﬀ University, Cardiﬀ, Wales, CF24 3XF, UK Abstract. In this paper a visual approach to querying in large spatial databases is presented. A diagrammatic technique utilising a data ﬂow metaphor is used to express diﬀerent kinds of spatial and non-spatial constraints. Basic ﬁlters are designed to represent the various types of queries in such systems. Icons for diﬀerent types of spatial relations are used to denote the ﬁlters. Diﬀerent granularities of the relations are presented in a hierarchical fashion when selecting the spatial constraints. The language constructs are presented in detail and examples are used to demonstrate the expressiveness of the approach in representing diﬀerent kinds of queries, including spatial joins and composite spatial queries.

1

Introduction

Large spatial databases such as, Computer Aided Design and Manufacture (CAD/CAM), Geographic Information Systems (GIS) and medical and biological databases, are characterised by the need to represent and manipulate a large number of spatial objects and spatial relationships. Unlike, traditional databases, most concepts in those systems have spatial representations and are therefore naturally represented using a visual approach. GIS are a major example of spatial databases with a large number of application domains, including environmental, transportation and utility mapping. Geographic objects, usually stored in the form of maps, may be complex formed by grouping other features and may have more than one spatial representation which changes over time. For example, a road object can be represented by a set of lines forming its edges or by a set of areas between its boundaries. Users of current GIS are expected to be non-experts in the geographic domain as well as possibly casual users of database systems. Alternative design strategies for query interfaces, besides the traditional command-line interfaces, are sought to produce more eﬀective GIS and to enhance their usability. The current generation of GIS have mostly textual interfaces or menu-driven ones that allow some enhanced expression of the textual queries [Ege91]. Problems with textual query languages have long been recognised [Gou93] including the need to know the structure of the database schema before writing a query as well as problems of semantic and syntactic errors. Problems are compounded in a geographic database where geographic features can be represented by more S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 175–186, 2002. c Springer-Verlag Berlin Heidelberg 2002

176

Andrew J. Morris et al.

than one geometric representation and the semantics and granularity of spatial relations may diﬀer across systems and application domains. In this paper, the focus is primarily on the process of query formulation. A visual approach is proposed to facilitate query expression in those systems. The approach addresses some of the basic manipulation issues, namely, the explicit representation of the spatial types of geographic features and the qualitative representation of spatial relationships. A diagrammatic technique is designed around the concept of a ﬁlter to represent constraints and implemented using direct manipulation. Filters, represented by icons, denote spatial and non-spatial constraints. Spatial constraints are computed through the application of spatial operators on one spatial entity, e.g. calculating the area of polygon, or on more than one spatial entity, e.g. testing whether a point object is inside a polygon object. Diﬀerent granularities of binary spatial ﬁlters are used and may be deﬁned in the language, for example, a general line-cross-area relationship may be specialised to indicate the number of points the two objects share etc. The concept of a ﬁlter is used consistently to construct complex queries from any number of sub-queries. The aim is to provide a methodology for a non-expert user to formulate and read relatively complex queries in spatial databases. Notations are used to distinguish query (and sub-query) results, to provide means of storing query history as well as to provide a mechanism for query reuse. A prototype of the approach has been implemented and evaluation experiments are currently underway. GIS are the main examples used in this paper. However, the approach proposed may be applied to other types of spatial databases. The paper is structured as follows. Section 2 lists some general requirements and problems identiﬁed for query interfaces to spatial databases. A discussion of related work is presented in section 3. In section 4, the data ﬂow approach is ﬁrst described and the language constructs are then presented in detail. This is followed in section 5 by an overview of the implementation and evaluation of the produced interface, concluding with a summary in section 6.

2

General Requirements and Identiﬁed Problems

Several issues related to the design of query interfaces to spatial databases are identiﬁed as follows. Some of these issues can be addressed at the language design level, while others need to be addressed at the implementation level of the query interface. Issues arising due to the spatial nature of the database include, Representation of Spatial Objects: Geographic objects have associated spatial representations to deﬁne their shape and size. Objects may be associated with more than one spatial representation in the database to handle diﬀerent map scales or diﬀerent application needs. Spatial representations of objects determine and limit the types of spatial relationships that they may be involved in. Explicit representation of the geometric type(s) of geographic features is needed to allow the user to express appropriate constraints over their locations.

A Data-Flow Approach to Visual Querying in Large Spatial Databases

111111 000000 000000 111111 A 000000 111111 000000B 111111

B

177

1111 0000 A 0000 1111 0000 1111

Fig. 1. Types of overlap relationship between two spatial regions.

Spatial operations and joins: It is diﬃcult for a non-expert user to realise all the possible spatial operations that may be applied on a geographic object or the possible spatial relationships that may be computed over sets of geographic objects. The semantics of the operations and relationships are implicit in their names. Those names may not have unique meanings for all users and are dependent on their implementation in the speciﬁc system in use. For example, an overlap relationship between two regions may be generalised to encompass the inside relationship in one implementation or may be speciﬁc to only mean partial coverage in another as shown in ﬁgure 1. In this paper a visual, qualitative, representation of spatial operations and relationships is proposed to facilitate their direct recognition and correct use. Also, diﬀerent granularities of spatial relationships need to be explicitly deﬁned to express diﬀerent levels of coarse and detailed spatial constraints. Composite spatial constraints: Multiple spatial constraints are used in query expressions. Again, the semantics of the composite relation may be vague, especially when combined using binary logical operators of And and Or. Means of visualising composite spatial relations would therefore be useful. E.g. “Object1 is north-of Object2 and close to it but outside a buﬀer of 10 m. from Object3”. Self spatial joins: Problems with the expression of self joins were noted earlier in traditional databases [Wel85]. The same is true in spatial databases but complicated with the use of spatial constraints in the join. E.g. “Find all the roads that intersect type A roads?” Query History: Visualising results of sub-queries during the process of query formulation is useful as users tend to create new queries by reworking a previous query or using parts thereof and so suggests the inclusion of query history. Other general database issues include, parenthesis complexity when specifying the order of Boolean operators with parentheses as the query grows [Wel85, JC88,MGP98]. Also, problems when using Boolean logic operators of And & Or as well as common syntactic errors such as, omitting quotation marks around data values where required [Wel85] and applying numeric operators to nonnumeric ﬁelds. The approach proposed in this paper attempts to handle some of the above issues that can be addressed at the language design level. Other issues are left to the implementation stage of the query interface.

3

Related Work

Querying interfaces to GIS can be broadly categorised between textual interfaces and non-textual interfaces. Several text-based extensions to SQL have been

178

Andrew J. Morris et al.

proposed (e.g. [Ege91, IP87, RS99]). Spatial extensions to SQL inherit the same problems of textual query languages to traditional databases. Typing commands can be tiring and error prone [EB95], with diﬃcult syntax that is tedious to use [Ege97]. In [Gou93] it was noted that users can spend more time thinking about command tools than thinking of the task that they have set out to complete. The Query-by-Example model [Zlo77] has also been explored in several works. QPE [CF80] and PICQUERY [JC88] are examples of such extensions. Users formulate queries by entering examples of possible results into appropriate columns on empty tables of the relations to be considered. Form-based extensions often do not release the user from having to perform complicated operations in expressing the queries nor from having to understand the schema structure. Also, complex queries usually need to be typed into a condition box that is similar to the WHERE clause of an SQL statement. Visual languages have been deﬁned as languages that support the systematic use of visual expressions to convey meaning [Cha90]. A great deal of work is already being carried out to devise such languages for traditional and objectoriented databases in an attempt to bridge the gap of usability for users. Iconic, diagrammatic, graph-based and multi-modal approaches are noted. Lee and Chin [LC95] proposed an iconic language, where icons are used to represent objects and processes. A query is expressed by building an iconic diagram of a spatial conﬁguration. Diﬃculties with this approach arise from the fact that objects in a query expression need to be explicitly speciﬁed along with their associated class and attributes, which renders the language cumbersome for the casual user [Ege97]. Sketch-based languages are interesting examples of the visual approach. In the CIGALES system proposed by Mainguenaud and Portier [MP90], users are allowed to sketch a query by ﬁrst selecting an icon of a spatial relationship and then drawing the query in the ”working area” of the interface. LVIS is an extension to CIGALES [PB99] where an attempt is made to provide the functionality of a query language. Egenhofer [Ege97] and Blaser [Bla98] have also proposed a sketch-based approach where a sketch of the query is drawn by the user and interpreted by the system. A set of query results is presented to the user including exact and near matches. Sketch-based approaches are suitable for expressing similarity-based queries to spatial databases and can become complex to use in a general context when composite queries are built. Also, they either assume that users are able to sketch a query and express spatial relationships in a drawing or rely on diﬀerent modalities for oﬀering the user guidance in developing the sketch. Exact queries can be generally ambiguous due to several possible interpretations of the visual representation

4

Language Description

Query diagrams are constructed using ﬁlters, represented by icons, between data input and output elements. Queries are visualised by a ﬂow of information that

A Data-Flow Approach to Visual Querying in Large Spatial Databases

179

may be ﬁltered or reﬁned. The approach is based on, but substantially modiﬁes and extends an early example of a ﬁlter ﬂow metaphor proposed by Young and Shneiderman [YS93]. In [YS93] a single relation was used over which users could select the attributes to constrain. The metaphor of water ﬂowing through a series of pipes was used and the layout of the pipes indicated the binary logic operators of And and Or. Line thickness was used to illustrate the amount of ﬂow, or data, passing through the pipes and attribute menus were displayed on the lines to indicate the constraints. Join operations were not expressed in [YS93] nor were there indications to means of handling query results. The idea was simply presented using one relation as input. The idea was later used by Murray et al [MPG98] to devise a visual approach to querying object-oriented databases. In this paper, the basic idea of data ﬂow between data source and results is utilised. The concept of a ﬁlter between both source and result is introduced to indicate the type of constraint expressed, whether non-spatial or spatial as well as the type of the spatial constraint in the later case. Spatial and non-spatial join operations are also expressed consistently. Graphical notations for intermediate query results are used to allow for tracing query histories and reuse of queries (and sub-queries). In what follows the query constructs are described in detail. 4.1

Database Schema

Consider the following object classes to be used as an example schema. County (cname:string, geometry:polygon, area:float, population:integer, other-geometry: point) Town (tname:string, geometry:polygon, area:float, town-twin:string, tpopulation:integer, county:county) Road (rname:string, geometry:line, rtype:string, rcounty:string, rsurface:string) Supermarket (sname:string, geometry:point, town:string, onroad:string)

In ﬁgure 2, object classes are depicted using a rectangular box containing the name of the class and an icon representing its spatial data type, whether point, line, polygon or any other composite spatial data type deﬁned in the database, e.g. a network. This oﬀers the user initial knowledge of the spatial representation associated with the feature. A thick edge on the icon box is used if the object has more than one spatial representation in the database. Switching between representations is possible by clicking on the icon box. For example, a County object is represented by a polygon to depict its actual shape and by a point for manipulation on smaller scale maps. All other information pertaining to the class is accessible when the user selects the class and then chooses to view its attributes. At this point we are not primarily concerned about how the database schema is depicted, but we focus on the aspect of query visualisation. As queries are constructed, the extent of the class chosen as input to the query will ﬂow through ﬁlters to be reﬁned according to the constraints placed on it. Results from a query or a sub query contain the new ﬁltered extents, and these can be used to provide access to the intermediate results as well as ﬁnal results of a query or as input to other sections of the query.

180

Andrew J. Morris et al.

Road

Town

County

Supermarket

Fig. 2. Example Schema. The basic spatial representation of the objects is depicted in the icons.

Road A

rtype = "motorway"

A

length(road) > 50

rtype = "motorway"

Display the roads with

length(road) > 50

Road (a)

Road

road type "motorway". (b)

Road (c)

Fig. 3. a) An aspatial ﬁlter and a spatial ﬁlter. b) Depicting query results. ”Select All From Road Where Road.rtype = ’motorway’ ”. c) A spatial ﬁlter in a simple query construct.

A basic query skeleton consists of a data input and data output elements and a ﬁlter in between. Every input object will have a related result object that can be displayed in the case of spatial objects. 4.2

Filters

Filters or constraints in a query are made on the non-spatial (aspatial) properties of the feature as well as on the spatial properties and location of the feature. Hence, two general icons are used to represent both types of ﬁlters as shown in ﬁgure 3. Figure 3(a) demonstrates a non-spatial ﬁlter depicted by an A (for (stored) Attributes) symbol and ﬁgure 3(b) demonstrates a spatial ﬁlter depicted by the “coordinates” symbol. The non-spatial ﬁlter represents constraints over the stored attributes and the location ﬁlter represents constraints that need to be computed over the spatial location of the object. After indicating the type of ﬁlter requested, the speciﬁc condition that the ﬁlter represents is built up dynamically by guiding the user to choose from menus of attributes, operators and values and the condition is then stored with the ﬁlter and may be displayed beside the icon as shown in the ﬁgure. Several ﬁlters may be used together to build more complex conditions and queries as will be shown in the following examples. 4.3

Query Results

The initial type of the data is deﬁned by the extent that ﬂows into the query. It is this type that will be passed along the data ﬂow diagram, depicted by downward

A Data-Flow Approach to Visual Querying in Large Spatial Databases

181

Road A

rtype = "motorway"

A

rsurface = "Asphalt"

A

length(road) > 50

A

(a)

A

A

Road (b)

(c)

Fig. 4. (a) Filters joined by And. (b) Filters joined by Or. (c) Visualisation of multiple ﬁlters. ”Display all the motorway roads with asphalt road surface or all the roads whose length is > 50.”

pointing arrows to the results. The type of the ﬂow is not altered by the query constraints. The only way the type of ﬂow can be altered is when it ﬂows into a results box. The results of the query are depicted, as shown in ﬁgure 3(b), by a double-edged rectangular box with the class name along with any particular attributes selected to appear in the results. By default the result of the query is displayed if the object has a spatial representation. The results box can be examined at any time of query formulation and its content displayed as a map and/or by listing the resulting object properties. If none of the attributes has been selected for listing, then the default is to view all the attributes of the class. An English expression of the query producing the result box is also available for examination through the result box as shown in the ﬁgure. 4.4

Simple Query Constructs

The example in ﬁgure 3 demonstrates a simple ﬁlter to restrict the results based on a non-spatial condition. Other operators may be used, e.g.=, >, <, like, etc. Also, spatial (unary) operators may be used to ﬁlter the results, e.g. area, volume, perimeter/boundary, etc. An example of using a spatial ﬁlter is shown in ﬁgure 3(c). Simple queries may be combined using Boolean expressions. Figure 4 represents the diﬀerent cases. The ﬂow will pass through only when the constraint is satisﬁed. In ﬁgure 4(a), multiple constraints are shown in series to represent constraints joined by And. The ﬂow will pass through only when both constraints are satisﬁed. In ﬁgure 4(b), parallel arrangement of the ﬁlters is used to indicate that the ﬂow will pass through when either or both constraints are satisﬁed. Any number of constraints may be joined together by binary logic operators as shown in the example in ﬁgure 4(c) where three constraints are used. Negated constraints are depicted by the ﬁlters in ﬁgure 5(a). Not may be applied to individual constraints or to a group of constraints. Filters may be joined in any order as explained. An example of a query with negated constraints is shown in ﬁgure 5(b).

182

Andrew J. Morris et al. Town town-twin = "Esslingen" tpopulation > 20000

A

area(town.geometry) > 15

A

A Town

(a)

(b)

Fig. 5. (a) Negation of non-spatial and spatial ﬁlters. (b) Visualisation of the And, Or and Not operators. Road rtype = "motorway"

County

A

A

road.geometry cross county.geometry

population > 50000

County

A Road, County

(a)

(b)

(c)

Fig. 6. (a) Non-Spatial join ﬁlter. (b) Spatial join ﬁlter (c) Example query of a spatial join. Speciﬁc relationship icon replaces general spatial join to indicate the cross relationship.

4.5

Joins

Two kinds of join operations are possible in spatial databases namely, non-spatial joins and spatial joins. Both types are represented coherently in the language. Spatial joins are expressions of spatial relationships between spatial objects in the database. Examples of spatial join queries are: Display all the motorway objects crossing Mid Glamorgan, and Display all the towns north of Cardiﬀ within South Glamorgan. Filter notations are modiﬁed to indicate the join operation as shown in ﬁgure 6(a) and (b). A join ﬁlter is associated with more than one object type. A result box is associated with every joined object class and linked to the join ﬁlter. An example of a spatial join query is shown in ﬁgure 6(c). The query ﬁnds all the motorway roads that cross counties with population more than 50,000. Note that the result box from the join operation has been modiﬁed to reﬂect the contents of the join table. More than one object type has been produced, in this case, roads and counties that satisfy the join condition will be displayed on the result map.

A Data-Flow Approach to Visual Querying in Large Spatial Databases

183

Fig. 7. Examples of symbols for some spatial relationships [CFO93]; (A) for area, (L) for line and (P) for point. Road rtype = motorway

Town Supermarket

A

Road

A

0.5 km

tpopulation > 10000

Town

Supermarket, Road, Town

Fig. 8. Composite query. Find the supermarkets within a buﬀer of 0.5 km of a motorway or are outside and north-of a town whose population is greater than 10000.

A symbol of the spatial relationship sought is used to replace the “coordinate” symbol in the spatial join ﬁlter. A choice of possible spatial joins is available depending on the spatial data types of the objects joined. In the last example, all the possible relationships between line (for roads) and polygons (for counties) will be available. Spatial relationships may be classiﬁed between topological, directional and proximal. Relationships are grouped in hierarchical fashion to allow the use of ﬁner granularities of relationships. Examples of hierarchies of topological and directional relationships are shown in ﬁgure 7. Qualitative proximal relationships, such as near and far are vague unless they explicitly reﬂect a pre-deﬁned range of measures. Hence, using proximal relationships requires an indication of the measure of proximity required, e.g. within a distance of x m. Multiple spatial joins may be expressed similarly either with the same object type, e.g. to ﬁnd the supermarkets outside and north of towns, or with more than one object type, e.g. to ﬁnd the supermarkets north of towns and within a buﬀer of 5 km. from motorways as shown in ﬁgure 8.

5

Implementation

So far, the proposed language has been described independently of its implementation. In this section, an outline of the interface prototype to the language

184

Andrew J. Morris et al.

Fig. 9. The query Formulation Window.

is presented. The implementation of the interface aims to address some of the issues relating to schema visualisation, structuring of query results, operator assistance in general, including guided query expression, feedback and restriction of user choice to valid options during query formulation. A prototype of the interface is implemented in Delphi. A test spatial data set is stored in a relational database, linked to the query interface. The query interface window is shown in ﬁgure 9. Input data sets are selected in a Schema visualisation window. The query is formulated, in a guided fashion, using a collection of ﬁlters, including, spatial, aspatial, negated and various types of spatial join ﬁlters. The interfaces is context-sensitive and allows only possible ﬁlters and choices to be presented to the user at the diﬀerent stages of query formulation. An spatial-SQL interpretation of the ﬂow diagram is produced and compiled to produce the result data set presented on the result window. Evaluation tests for both the language and interface have been designed and are being conducted using two categories of users, namely, users with some experience of using a GIS systems and users with no prior knowledge of GIS. The evaluation test for the language makes use of the “PICTIVE” approach [Mul93] where the language elements are simulated using Post-It notes and a whiteboard.

6

Conclusions

In this paper a visual approach to querying spatial databases is proposed. Examples from the GIS domain have been used throughout to demonstrate the expressiveness of the language. The design of the language tried to address several requirements and problems associated with query interfaces to spatial databases. The following is a summary of the design aspects. – Icons were used to represent the geographic features with explicit indication of their underlying spatial representation, thus oﬀering the user a direct indication to the data type being manipulated.

A Data-Flow Approach to Visual Querying in Large Spatial Databases

185

– A data ﬂow metaphor is used consistently to describe diﬀerent types of query conditions namely, non-spatial and spatial constraints as well as negated constraints and spatial and non-spatial joins. – Concise representation of the metaphor was used to join multiple constraints when dealing with one object in join operations. – Intermediate results are preserved and could be queried at any point of the query formulation process and hence the query history is also preserved. – Nested and complex queries are built consistently. The consistent use of the metaphor is intended to simplify the learning process for the user and should make the query expression process easier and the query expression more readable. The approach is aimed at casual and non expert users, or at expert domain users who are not familiar with query languages to databases. The implementation of the language aims to cater for diﬀerent levels of user expertise. Visual queries are parsed and translated to extended SQL queries that are linked to a GIS for evaluation.

References Bla98.

CF80. CFO93.

Cha90. EB95.

Ege91. Ege97. Gou93.

IP87.

JC88.

A. Blaser. Geo-Spatial Sketches, Technical Report. Technical report, National Centre of Geographical Information Analysis: University of Maine, Orono, 1998. N.S. Chang and K.S. Fu. Query-by-Pictorial Example. IEEE Transactions on Software Engineering, 6(6):519–24, 1980. E. Clementini, P.D. Felice, and P.V. Oosterom. A Small Set of Formal Topological Relationships for End-User Interaction. In Advances in Spatial Databases - Third International Symposium, SSD’93, pages 277–295. Springer Verlag, 1993. S.K. Chang. Principles of Visual Programming Systems. Englewood Cliﬀs: Prentice Hall, 1990. M.J. Egenhofer and H.T. Burns. Visual Map Algebra: a direct-manipulation user interface for GIS. In Proceedings of the Third IFIP 2.6 Working Conference on Visual Database Systems 3, pages 235–253. Chapman and Hall, 1995. M.J. Egenhofer. Extending SQL for cartographic display . Cartography and Geographical Information Systems, 18(4):230–245, 1991. M.J. Egenhofer. Query Processing in Spatial Query by Sketch . Journal of Visual Languages and Computing, 8:403–424, 1997. M. Gould. Two Views of the Interface. In D. Medyckyj-Scott and H.M. Hearnshaw, editors, Human Factors in GIS, pages 101–110. Bellhaven Press, 1993. K. Ingram and W. Phillips. Geographic information processing using an SQL based query language. In Proceedings of AUTO-CARTO 8, pages 326– 335, 1987. T. Joseph and A.F. Cardena. PICQUERY: A High Level Query Language for Pictorial Database Management. IEEE Transactions on Software Engineering, 14(5):630–638, 1988.

186

Andrew J. Morris et al.

LC95.

MGP98.

MP90.

MPG98.

Mul93.

PB99.

RS99. Wel85. YS93.

Zlo77.

Y.C. Lee and F.L. Chin. An Iconic Query Language for Topological Relationships in GIS. International Journal of Geographical Information Systems, 9(1):24–46, 1995. N. Murray, C. Goble, and N. Paton. Kaleidoscape: A 3D Environment for Querying ODMG Compliant Databases. In Proceedings of Visual Databases 4, pages 85–101. Chapman and Hall, 1998. M. Mainguenaud and M.A. Portier. CIGALES: A Graphical Query Language for Geographical Information Systems. In Proceedings of the 4th International Symposium on Spatial Data Handling, pages 393–404. Univerity of Zurich, Switzerland, 1990. N. Murray, N. Paton, and C. Goble. Kaleidoquery: A Visual Query Language for Object Databases . In Proceedings of Advanced Visual Interfaces, pages 247–257. ACM Press, 1998. M. Muller. PICTIVE: Democratizing the Dynamics of the Design Session. In Participatory Design: Principles and Practices, pages 211–237. Lawrence Erlbaum Associates, 1993. M.A.A. Portier and C. Bonhomme. A High Level Visual Language for Spatial Data Management. In Proceedings of Visual ’99, pages 325–332. Springer Verlag, 1999. S. Ravada and J. Sharma. Oracle8i Spatial: Experiences with Extensible Database . In SSD’99, pages 355–359. Springer Verlag, 1999. C. Welty. Correcting User Errors in SQL. International Journal of Manmachine studies, 22:463–477, 1985. D. Young and B. Shneiderman. A Graphical Filter/Flow Representation of Boolean Queries: A Prototype Implementation and Evaluation. Journal of the American Society for Information Science, 44(6):327–339, 1993. M.M. Zloof. Query-by-Example: A Database Language . IBM Systems Journal, 16(4):324–343, 1977.

MEDIMAGE – A Multimedia Database Management System for Alzheimer’s Disease Patients Peter L. Stanchev1 and Farshad Fotouhi2 1

Kettering University, Flint, Michigan, 48504 USA TWXERGLI$OIXXIVMRKIHY LXXT[[[OIXXIVMRKIHYbTWXERGLI 2 Wayne State University, Detroit, Michigan 48202 USA JSXSYLM$GW[E]RIIHY

Abstract. Different brain databases, such as: (1) the database of the anatomic MRI brain scans of children across a wide range of ages to serve as a resource for the pediatric neuroimaging research community [6], (2) Brigham RAD Teaching Case Database Department of Radiology, Brigham and Women’s Hospital Harvard Medical School [2], (3) Brain Web Simulated Brain Database site of a normal brain and a brain affected by multiple sclerosis [3] are using from many researchers. In this paper, we present MEDIMAGE – a multimedia database for Alzheimer’s disease patients. It contains imaging, text and voice data and it used to find some correlations of brain atrophy in Alzheimer’s patients with different demographic factors.

1

Introduction

We determined topographic selectivity and diagnostic utility of brain atrophy in probable Alzheimer’s disease (AD) and correlations with demographic factors such as age, sex, and education. A medical multimedia database management system MEDIMAGE was developed for supporting this work. Its architecture is based on the image database models [4, 7]. The system design is motivated by the major need to manage and access multimedia information on the analysis of the brain data. The database links magnetic resonance (MR) images to patient data in a way that permits the use to view and query medical information using alphanumeric, and feature-based predicates. The visualization permits the user to view or annotate the query results in various ways. These results support the wide variety of data types and presentation methods required by neuroradiologists. The database gives us the possibility for data mining and defining interesting findings.

S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 187–193, 2002. © Springer-Verlag Berlin Heidelberg 2002

188

2

Peter L. Stanchev and Farshad Fotouhi

The MEDIMAGE System

The MEDIMAGE system architecture is presented in the Figure 1.

MEDIMAGE MR Image Processing Tools

MR Image Segmentation tools MR 3D reconstruction tools MR Measurement tools

MEDIMAGE Database Management System Tools

MEDIMAGE Definition Tools MEDIMAGE Storage Tools MEDIMAGE Manipulation Tools MEDIMAGE Viewing Tools

MEDIMAGE Databases catalogs 1. MR database catalog 2. Segmented and 3D remonstrated database catalog 3. Test database catalog 4. Radiologist comments database catalog

MEDIMAGE Databases 1. MR database 2. Segmented and 3D remonstrated database 3. Test database 4. Radiologist comments database

Fig. 1. The MEDIMAGE system architecture

2.1

MEDIMAGE System Databases

In the MEDIMAGE system there are four databases: 1. MEDIMAGE MR Database. For brain volume calculation we store a two-spinecho sequence covering the whole brain. 58 T2-weithed 3 mm slices are obtained with half-Fourier sampling, 192 phase-encoding steps, TR/TE of 3000/30, 80 ms, and a field-of-view of 20 cm. The slices are contiguous and interleaved. We collect and store also 124 T1-weighted images using TR/TE of 35/5 msec, flip angle of 35 degrees. Finally we collect patients and scanner information such as: acquisition date, image identification number and name, image modality device parameters, image magnification, etc. 2. MEDIMAGE Segmented and 3D reconstructed database. This is the collection of process magnetic resonance images – segmented and 3D rendered. 3. MEDIMAGE Test database. The test date includes patient’s results from the standard tests for Alzheimer’s disease and related disorders. 4. MEDIMAGE Radiologist comments database. This data are in two types: text and voice. They contain the radiologist findings.

A Multimedia Database Management System for Alzheimer’s Disease Patients

2.2

189

MEDIMAGE MR Image Processing Tools

In the MEDIMAGE system there are three main tools for image processing. 1. MEDIMAGE MR Image Segmentation tools. These tools include bifeature segmentation tool and ventrical and sulcal CSF volume calculation tool. The CSF denotes the fluid inside the brain. • Bifeature segmentation tool. Segmentation of the MR images into GM (gray matter), white matter (WM) and CSF is perform in the following way: thirty points per compartment (15 per hemisphere) are sampled simultaneously from the proton density and T2-weigted images. The sample index slice is the most inferior slice above the level of the orbits where the anterior horns of the lateral ventricles could be seen. Using a nonparametric statistic algorithm (k-nearest neighbors supervised classification) the sample points are used to derive a “classificator” that determined the most probable tissue type for each voxel. • Ventrical and sulcal CSF volume calculation tool. A train observer places a box encompassing the ventricles to define the ventrical CSF. Subtraction the ventical from the total CSF provided a separate estimate of the sulcal CSF. 2. MEDIMAGE MR 3D reconstruction tools. These tools include total brain capacity measurement and region of interest definition tools. • Total brain capacity measurement tool. A 3D surface rendering technique is used to obtain accurate lobal demarcation. The T2-weighted images are first “edited” using intensity thresholds and tracing limit lines on each slice to remove nonbrain structures. The whole brain volume, which included brain stamp and cerebellum, is then calculated from the edit brain as an index of the total intracranial capacity and is used in the standardization procedures to correct for brain size. A 3D reconstruction is computed. • Region of interest definition tool. Using anatomical landmarks and a priori geometric rules accepted by neuroanatomic convention, the frontal, parietal, temporal, and occipital lob are demarcated manner. The vovels of the lobar region of interest is used to mask the segmented images, enabling quantification of different tissue compartments for each lobe. 3. MEDIMAGE MR Measurement tools. These tools include hippocampal volume determination tool. • Hippocampal volume determination tool. Sagical images are used to define the anterior and posterior and end points of the structure. Then they are reformatted into coronal slices perpendicular to the longitudinal axis of the hippocampal formation. Then the hippocampal perimeter is traced for each hemisphere. The demarcated area is multiplied by slice thickness to obtain the hippocampal volume in the slice. 2.3

MEDIMAGE Database Management Tools

In the MEDIMAGE database management system there are definition, storage, manipulation and viewing tools.

190

Peter L. Stanchev and Farshad Fotouhi

1. MEDIMAGE Definition Tools. Those tools are used for defining the structure of the four databases. All of them are using relational model. 2. MEDIMAGE Storage Tools. These are tools allowing entering, deletion and updating of the data in the system. 3. MEDIMAGE Manipulation Tools. Those tools allow: image retrieval based on alphanumeric, and feature-based predicates and numerical, text, voice and statistic data retrieval. • Image retrieval. The images are searched by their image description representation, and it is based on similarity retrieval. Let a query be converted in an image description Q(q1, q2, …, qn) and an image in the image database has the description I(x1, x2, …, xn). Then the retrieval value (RV) between Q and I is defined as: RVQ(I) = Σi = 1, …,n (wi * sim(qi, xi)), where wi (i = 1,2, …, n) is the weight th specifying the importance of the i parameter in the image description and th sim(qi, xi) is the similarity between the i parameter of the query image and database image and is calculated in different way according to the qi, xi values. There are alphanumeric and feature-based predicates. • Numerical, text, voice and statistic data retrieval. A lot statistical function are available in the system allowing to make data mining using the obtain measurements and correlated them with different demographic factors. 4. MEDIMAGE Viewing Tools. Those tools allow viewing images and text, numerical and voice data from the four databases supported by the system.

3

Results Obtaining with the MEDIMAGE System

The results of some of the image processing tools are given in Figures 2-7. Result from the statistical analysis applied to MR images in 32 patients with probable AD and 20 age- and sex-matched normal control subjects find the following findings. Group differences emerged in gray and white matter compartments particularly in parietal and temporal lobes. Logistic regression demonstrated that larger parietal and temporal ventricular CSF compartments and smaller temporal gray matter predicted AD group membership with an area under the receiver operating characteristic curve of 0.92. On multiple regression analysis using age, sex, education, duration, and severity of cognitive decline to predict regional atrophy in the AD subjects, sex consistently entered the model for the frontal, temporal, and parietal ventricular compartments. In the parietal region, for example, sex accounted for 27% of the variance in the parietal CSF compartment and years of education accounted for an additional 15%, with women showing less ventricular enlargement and individuals with more years of education showing more ventricular enlargement in this region. Topographic selectivity of atrophic changes can be detected using quantitative volumetry and can differentiate AD from normal aging. Quantification of tissue volumes in vulnerable regions offers the potential for monitoring longitudinal change in response to treatment.

A Multimedia Database Management System for Alzheimer’s Disease Patients

TE = 30 ms TR = 3000 ms

TE = 80 ms TR = 3000 ms => Fig. 2. Bifeature segmentation

=> Fig. 3. Ventricular and Sulcal CSF Separation

=> Fig. 4. Brain Editing

191

192

Peter L. Stanchev and Farshad Fotouhi

=> Fig. 5. 3D Brain Reconstruction

=> Fig. 6. Region Definition

=> Fig. 7. Hippocampal Volume Calculation

4

Conclusions

The MEDIMAGE system was developed in the Sunnybrook health science center, Toronto, Canada, on SUN Microsystems. It uses GE scanner software and ANALYSE and SCILIMAGE packages. The medical findings are described in details in [5]. The main advantages of the proposed MEDIMAGE system are:

• •

Generality. The system could easily modify for other medical image collection. The system was use also for corpus colosam calculations [1]. Practical applicability. The results obtained with the system define essential medical findings.

A Multimedia Database Management System for Alzheimer’s Disease Patients

193

The main conclusion of using the system is that the content-based image retrieval is not essential part in such kind of system. Data mining algorithms play essential roles in similar systems.

References 1. Black SE., Moffat SD., Yu DC, Parker J., Stanchev P., Bronskill M., “Callosal atrophy correlates with temporal lobe volume and mental status in Alzheimer's disease.” Canadian Journal of Neurological Sciences. 27(3), 2000 Aug., pp. 204-209. 2. Brigham RAD Teaching Case Database Department of Radiology, Brigham and Women's Hospital Harvard Medical School http://brighamrad.harvard.edu/education/online/tcd/tcd.html 3. C.A. Cocosco, V. Kollokian, R.K.-S. Kwan, A.C. Evans: "BrainWeb: Online Interface to a 3D MRI Simulated Brain Database", NeuroImage, vol.5, no.4, part 2/4, S425, 1997 - Proceedings of 3-rd International Conference on Functional Mapping of the Human Brain, Copenhagen, May 1997. 4. Grosky W., Stanchev P., “Object-Oriented Image Database Model”, 16th International Conference on Computers and Their Applications (CATA-2001), March 28-30, 2001, Seattle, Washington, pp. 94-97. 5. Kidron D., Black SE., Stanchev P., Buck B., Szalai JP., Parker J., Szekely C., Bronskill MJ., “Quantitative MR volumetry in Alzheimer's disease. Topographic markers and the effects of sex and education”, Neurology. 49(6):1504-12, 1997 Dec. 6. Pediatric Study Centers (PSC) for a MRI Study of Normal Brain Development http://grants.nih.gov/grants/guide/noticefiles/not98-114.html 7. Stanchev, P., “General Image Database Model,” Visual Information and Information Systems, Proceedings of the Third Conference on Visual Information Systems, Huijsmans, D. Smeulders A., (Eds.) Lecture Notes in Computer Science, Volume 1614 (1999), pp. 29-36.

Life after Video Coding Standards: Rate Shaping and Error Concealment Trista Pei-chun Chen1, Tsuhan Chen1, and Yuh-Feng Hsu2 1

Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA _TIMGLYRXWYLERa$ERHVI[GQYIHY 2 Computer and Communications Research Laboratories, Industrial Technology Research Institute, Hsinchu 310, Taiwan WTIRGIV$MXVMSVKX[

Abstract. Is there life after video coding standards? One might think that research has no room to advance with the video coding standards already defined. On the contrary, exciting research opportunities arise after the standards are specified. In this paper, we introduce two standard-related research areas: rate shaping and error concealment, as examples of interesting research that finds its context in standards. Experiment results are also shown.

1

Introduction

What are standards? Standards define a common language that different parties can communicate with each other effectively. An analogy to the video coding standard is the language. Only with the language, Shakespeare could create his work and we can appreciate the beautiful masterpiece of his. Similarly, video coding standards define the bitstream syntax, which enables the video encoder and the decoder to communicate. With the syntax and decoding procedure defined, interesting research areas such as encoder optimization, decoder post-processing, integration with the network transport and so on, are opened up. In other words, standards allow for advanced video coding research fields to be developed and coding algorithms to be compared on a common ground. In this paper, we consider H. 263 [1] as the video coding standard example. Similar ideas can also be built on other standards such as MPEG-4 [2]. Two research areas: rate shaping [3] and error concealment [4] (Fig. 1), are introduced for networked video transport. First, we introduce rate shaping to perform joint source-channel coding. Video transport is very challenging given the strict bandwidth requirement and possibly high channel error rate (or packet loss rate). Through standards such as the real-time control protocol (RTCP, part of the real-time transport protocol (RTP)) [5], the encoder can obtain network condition information. The rate shaper uses such information to S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 194–206, 2002. © Springer-Verlag Berlin Heidelberg 2002

Life after Video Coding Standards: Rate Shaping and Error Concealment

195

shape the coded video bitstream before sending it to the network. The video transport thus delivers the video bitstream with better quality and utilizes the network bandwidth more efficiently. channel info.

Video

Source/channel encoder

Rate shaper

Source/channel Error decoder Concealment

Joint source/channel coded bitstream

Reconstructed video

Fig. 1. System of video transport over network

Second, we present error concealment with updating mixture of principle components. In a networked video application, even with good network design and video encoder, the video bitstream can be corrupted and become un-decodable at the receiver end. Error concealment is useful in such a scenario. We introduce in particular a model-based approach with updating mixture of principle components as the model. The User Datagram Protocol (UDP) [6] sequence number is used to inform the video decoder to perform error concealment. In addition to the two areas introduced, research areas such as video traffic modeling would not be relevant without the standards being defined. Prior work on video traffic modeling can be found in [7], [8], [9], [10], and [11]. This paper is organized as follows. In Section 2, we adopt the rate shaping technique to perform joint source-channel coding. In Section 3, updating mixture of principle components is shown to perform very well in the error concealment application. We conclude this paper in Section 4.

2

Adaptive Joint Source-Channel Coding Using Rate Shaping

Video transmission is challenging in nature because it has high data rate compared to other data types/media such as text or audio. In addition, the channel bandwidth limit and error prone characteristics also impose constraints and difficulties on video transport. A joint source-channel coding approach is needed to adapt the video bitstream to different channel conditions. We propose a joint source-channel coding scheme (Fig. 2) based on the concept of rate shaping to accomplish the task of video transmission. The video sequence is first source coded followed by channel coding. Popular source coding methods are H.263 [1], MPEG-4 [2], etc. Example channel coding methods are Reed-Solomon codes, BCH codes, and the recent turbo codes [12], [13]. Source coding refers to “scalable encoder/decoder” in Fig. 2 and channel coding refers to “error correction coding

196

Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu

(ECC) encoder/decoder” in Fig. 2. The source and channel coded video bitstream then passes through the rate shaper to fit the channel bandwidth requirement while achieving the best reconstructed video quality. channel info. video scalable encoder

ECC encoder

rateshaper

Joint source/channel coded bitstream

ECC decoder

(a)

reconstructed video

scalable decoder

(b)

Fig. 2. System diagram of the joint source-channel coder: (a) encoder; (b) decoder

2.1

Rate Shaping

After the video sequence has been source and channel coded, the rate shaper then decides which portions of the encoded video bitstream will be sent. Let us consider the case where the video sequence is scalable coded into two layers: one base layer and one enhancement layer. Each of the two layers is error correction coded with different error correction capability. Thus, there are four segments in the video bitstream: the source-coding segment of the base layer bitstream (lower left segment of Fig. 3 (f)), the channel-coding segment of the base layer bitstream (lower right segment of Fig. 3 (f)), the source-coding segment of the enhancement layer bitstream (upper left segment of Fig. 3 (f)), and the channel-coding segment of the enhancement layer bitstream (upper right segment of Fig. 3 (f)). The rate shaper will decide which of the four segments to send. In the two-layer case, there are totally six valid combinations of segments (Fig. 3 (a)~(f)). We call each valid combination a state. Each state is represented by a pair of integers (x, y ) , where x is the number of source-coding segments chosen counting from the base layer and y is the number of channel-coding segments counting from the base layer. x and y satisfy the relationship of x ≥ y .

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 3. Valid states: (a) State (0,0); (b) State (1,0); (c) State (1,1); (d) State (2,0); (e) State (2,1); (f) State (2,2)

The decision of the rate shaper can be optimized given the rate-distortion map, or R-D map, of each coding unit. A coding unit can be a frame, a macroblock, etc., depending on the granularity of the decision. The R-D maps vary with different channel error conditions. Given the R-D map of each coding unit with a different constellation of states (Fig. 4), the rate shaper finds the state with the minimal distortion under certain bandwidth constraint “B”. In the example of Fig. 4, State (1,1) of Unit 1 and State (2,0) of Unit 2 are chosen. Such decision is made on each of the coding unit given the bandwidth constraint “B” of that unit.

Life after Video Coding Standards: Rate Shaping and Error Concealment D

D

00 10

00

21 20

11

10 22

11

R

22

B

(a)

….

21 20

B

197

R

(b)

(c)

Fig. 4. R-D maps of coding units: (a) Unit 1; (b) Unit 2; (c) Unit 3 and so on

Consider taking a frame as a coding unit. Video bitstream is typically coded with variable bit rate in order to maintain constant video quality. To minimize the overall distortion for a group of pictures/frames (GOP), it is not enough to choose the state for each frame based on the equally allocated bandwidth to every frame. We will introduce a smart rate shaping scheme that allocates different bandwidth to each frame in a GOP. The rate shaping scheme is based on the discrete rate-distortion combination algorithm. 2.2

Discrete Rate-Distortion Combination Algorithm

Assume there are F frames in a GOP and the total bandwidth constraint for these F frames is C . Let x (i ) be the state chosen for frame i and let Di , x (i ) and Ri , x (i ) be the resulting distortion and rate at frame i respectively. The goal of the rate shaper is to: F

minimize

∑D F

subject to

∑R i =1

(1)

i , x (i )

i =1

i , x (i )

≤C

(2)

In principle, this optimization problem can be accomplished using Dynamic Programming [14], [15], [16]. The trellis diagram is formed with the x-axis being the frame index i , y-axis being the cumulative rate at frame i , and the cost function of the trellis being the distortion. If there are S states at each frame, the number of nodes at Frame i = F will be S F (if none of the cumulative rates are the same). This method is too computationally intensive. If the number of states, S , is large, the R-D map becomes a continuous curve. The Lagrangian Optimization method [16], [17], [18] can be used to solve this optimization problem. However, Lagrangian Optimization method cannot reach the states that do not reside on the convex hull of the R-D curve. In this paper, we introduce a new discrete rate-distortion combination algorithm as follows:

198

Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu

1. At each frame, eliminate the state in the map if there exists some other state that is smaller in rate and smaller in distortion than the one considered. This corresponds to eliminating states in the upper right corner of the map (Fig. 5 (a)). 2. At each frame i , eliminate State b if Ria < Rib < Ric and Dib − Dia < Dic − Dib , Rib − Ria Ric − Rib

3.

4. 5.

6.

where State a and State c are two neighboring states of State b . This corresponds to eliminating states that are on the upper right side of any line connecting two states. For example, State b is on the upper right side of the line connecting State a and State c (Fig. 5 (b)). Thus, State b is eliminated. Label the remaining states in each frame from the state with the lowest rate, State 1, to the state with the highest rate. Let us denote the current decision of state at Frame i as State u(i) . Start from u(i) = 1 for all frames. The rate shaper examines the next state u(i) +1 of each frame and finds the one that gives the largest ratio of distortion decrease over rate increase compared to the current state u(i) . If Frame τ is chosen, increase u(τ ) by one. As an example, let us look at two frames, Frame m and Frame n in Fig. 5 (c). Current states are represented as gray dots and the next states as black dots. We can see that updating u(m ) gives larger ratio increase than updating u (n ) . Thus, the rate shaper updates u(m ) . Continue Step 3 until the total rate meets C or will exceed C with any more update of u(i) . If C is met, we are done. If the bandwidth constraint is not yet met after Step 4, reconsider the states that were eliminated by Step 2. For each frame, re-label all the states from the state with the lowest rate to the state with the highest rate, and let u(i) denote the current state. Choose the frame with the next state giving the most distortion decrease compared to the current state. If Frame τ is chosen, increase u (τ ) by one. Continue Step 5 until the total rate meets C or exceeds C with more update of u(i) . Dm

D

D

Dn

u(m)

u(n) u(n)+1

b a

c

R

(a)

u(m)+1

R

(b)

Rm

Rn

(c)

Fig. 5. Discrete R-D combination: (a) Step 1; (b) Step 2; (c) Step 3

2.3

Experiment

We compare four methods: (M1) transmits a single non-scalable and non-ECC coded video bitstream; (M2), proposed by Vass and Zhuang [19], switches between State (1, 1) and State (2, 0) depending on the channel error rate; (M3) allocates the same bit

Life after Video Coding Standards: Rate Shaping and Error Concealment

199

budget to each frame and chooses the state that gives the best R-D performance for each frame; (M4) is the proposed method that dynamically allocates the bit budget to each frame in a GOP and chooses the state that gives the best overall performance in a GOP, using the algorithm shown in Sect. 2.2. Each GOP has F = 5 frames. The test video sequence is “stefan.yuv” in QCIF (quarter common intermediate format). The bandwidth and channel error rate vary over time and are simulated as AR(1) processes. The bandwidth ranges from 4k bits/frame to 1024k bits/frame; and the channel error rate ranges from 10 −0.5 to 10 −6.0 . The performance is shown in mean square error (MSE) versus the GOP number as in Fig. 6. In the case that all four methods satisfy the bandwidth constraint, the average MSE of all four methods are 10050, 5356, 2091, and 1946 respectively. The proposed M4 has the minimum distortion among all. In addition, let us compare M1 and M2 with M3 and M4. Since M1 and M2 do not have the R-D maps in mind, the network could randomly discard the bitstream sent by these two methods. The resulting MSE performance of M1 and M2 are bad. On the other hand, M3 and M4 are more intelligent in knowing that the bitstream could be non-decodable if the channel error rate is high and thus decide to allocate the bit budget to the channel-coding segments of the video bitstream. 4

2

x 10

M1 M2 M3 M4

MSE

1.5

1

0.5

0 0

10

20 30 GOP number

40

50

Fig. 6. MSE performance of four rate shaping methods

3

Updating Mixture of Principle Components for Error Concealment

When transmitting video data over networks, the video data could suffer from losses. Error concealment is a way to recover or conceal the loss information due to the transmission errors. Through error concealment, the reconstructed video quality can be improved at the decoder end. Projection onto convex sets (POCS) [20] is one of the most well known frameworks to perform error concealment. Error concealment based on POCS is to formulate each constraint about the unknowns as a convex set. The optimal solution is obtained by recursively projecting a previous solution onto each convex set. For error concealment, the projections of data refer to (1) projecting the data with some losses to a model that is built on error-free

200

Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu

data, and (2) replacing data in the loss portion with the reconstructed data. The success of a POCS algorithm relies on the model to which the data is projected onto. We propose in this paper updating mixture of principle components (UMPC) to model the non-stationary as well as the multi-modal nature of the data. It has been proposed that the mixture of principle components (MPC) [21] can represent the video data with a multi-modal probability distribution. For example, faces images in a video sequence can have different poses, expressions, or even changes in the characters. It is thus natural to use a multi-modal probability distribution to describe the video data. In addition, the statistics of the data may change over time as proposed by updating principle components (UPC) [22]. By combining the strengths of both MPC and UPC, we propose UMPC that captures both the non-stationary and the multi-modal characteristics of the data precisely. 3.1

Updating Mixture of Principle Components

* * ***** ** ** * ** ** ***** *** * * ** *** * ** * ** * * * * * *** ** * ** * * * *

* * ***** * ** * * * **** * * * * ** * * * ** ** **** * ** ** ** *** * ** * * * * ** *

*

*

* * * ** * * * * * ** ** * * * * * ** * ** ** * * * ** * * * * * * ** ** ** * * ****** ** *

Given a set of data, we try to model the data with minimum representation error. We specifically consider multi-modal data as illustrated in Fig. 7 (a). The data are clustered to multiple components (two components in this example) in a multidimensional space. As mentioned, the data can be non-stationary, i.e., the stochastic properties of the data are time-varying. At time n , the data are clustered as Fig. 7 (a) and at time n′ , the data are clustered as Fig. 7 (b). The mean of each component is shifting and the most representative axes of each component are also rotating.

*

*

* * * **** * * ** * * **** ** ** * ** ** ** ** **** * * ** ** * *** * ** *** * * **

(a)

(b)

Fig. 7. Multi-modal data at (a) time n (b) time n′

At any time instant, we attempt to represent the data as a weighted sum of the mean and principle axes of each component. As time proceeds, the model changes its mean and principle axes of each component. The representation error of the model at time instant n should have less contribution from data that are further away in time from the current one. The optimization formula can be written as follows: (3)

Life after Video Coding Standards: Rate Shaping and Error Concealment

201

The notations are organized as follows:

At any time instant n , this is to minimize the weighted reconstruction error with the choice of means, the sets of eigenvectors, and the set of weights. The reconstruction errors contributed by previous data are weighted by powers of the decay factor α . The solution to this problem is obtained by iteratively determining weights, means and sets of eigenvectors respectively while fixing the other parameters. That is, we optimize the weights for each data using the previous means and sets of eigenvectors. After updating the weights, we optimize the means and the eigenvectors accordingly. The next iteration starts again in updating the weights and so on. The iterative process is repeated until the parameters converge. At the next time instant n + 1 , the parameters of time instant n are used as the initial parameter values. Then the process of iteratively determining weights, means and sets of eigenvectors starts again. The mean m (qn ) of mixture component q at time n is:         2 M  w w nq nq n m (n −1) +   x − wnj xˆ nj  m (q ) = 1 − ∞ ∑ q n ∞      i 2 i 2 j =1, j ≠ q   ∑ α wn −i ,q   ∑ α wn −i , q   i=0   i =0 

(4)

The covariance matrix C (rn ) of mixture component r at time n is:

[

C (rn )

]

 wnr (x n − m r )x Tn + x n (x n − m r )T −  M   w w (x − m )m T + m (x − m )T −  ∑ nj nr n r j j n r  j =1  (n −1)  = αC r + (1 − α ) M P   T T T  ∑ wnj wnr ∑ u jk (x n − m j ) (x n − m r )u jk + u jk (x n − m r ) −  k =1  j =1, j ≠ r   w 2 (x − m )(x − m )T  r n r  nr n 

[

[

][

]

]

(5)

202

Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu

To complete one iteration with determination of means, covariance matrix and weights, the solution for weights is: ˆ TX ˆ 2X i i  T 1 

ˆ Tx  1 w i  2X i i   =   0  λ   1 

(6)

where 1 = [1 L 1]T is an M × 1 vector. We see that both MPC and UPC are special cases of UMPC with α → 1 and M = 1 respectively. 3.2

Error Concealment with UMPC

With object based video coding standards such as MPEG-4 [2], the region of interest (ROI) information is available. A model based error concealment approach can use such ROI information and build a better error concealment mechanism. Fig. 8 shows two video frames with ROI specified. In this case, ROI can also be obtained by face trackers such as [23].

(a)

(b)

Fig. 8. Two video frames with object specified

When the video decoder receives a frame of video with error free ROI, it uses the data in ROI to update the existing UMPC with the processes described in Sect. 3.1. When the video decoder receives a frame of video with corrupted macroblocks (MB) in the ROI, it uses UMPC to reconstruct the corrupted ROI. In Fig. 9, we use three st nd rd mixture components: 1 , 2 , and 3 , to illustrate the idea of UMPC for error concealment. Current Frame Replace missing data Project

+

w1 Project

Project

1st Component

w2

2nd Component

Reconstruction w3

3rd Component

Fig. 9. UMPC for error concealment

Life after Video Coding Standards: Rate Shaping and Error Concealment

203

The corrupted ROI is first reconstructed by each individual mixture component. The resulting reconstructed ROI is formed by linearly combining the three individually reconstructed ROI. The weights for linear combination are inverse proportional to the reconstruction error of each individually reconstructed ROI. After the reconstructed ROI with UMPC is done, replace the corrupted MB with the corresponding data in the reconstructed ROI just obtained. The process of reconstruction with UMPC and replacement of corrupted MB is repeated iteratively until the final reconstruction result is satisfying. 3.3

Experiment

The test video sequence is recorded from a TV program. The video codec used is H. 263 [1]. Some frames of this video sequence are shown in Fig. 8. We use a two state Markov chain [24] to simulate the bursty error to corrupt the MB as shown in Fig. 10. “Good” and “Bad” correspond to error free and erroneous state respectively. The overall error rate ε is related to the transition probabilities p and q by ε = p ( p + q ) . We use ε = 0.05 and p = 0.01 in the experiment. 1-q

1-p p

Good

Bad q

Fig. 10. Two state Markov chain for MB error simulation

There are two sets of experiments: Intra and Inter. In the Intra coded scenario, we compare three cases: (1) none: no error concealment takes place. When the MB is corrupted, the MB content is lost; (2) MPC: error concealment with MPC as the model. The number of mixture components M are three and the number of eigenvectors P for each mixture components are two; (3) UMPC: error concealment with UMPC as the model with M = 3 and P = 2 . The decay factor is α is 0.9 . In the Inter coded scenario, we also compare three cases: (1) MC: error concealment using motion compensation; (2) MPC: error concealment with MPC as the model operated on motion compensated data; (3) UMPC: error concealment with UMPC as the model on operated motion compensated data. Fig. 11 shows the means of UMPC at two different time instances. It shows that the model captures three main poses of the face images. Since there is a change of characters, UMPC captures such change and we can see that the means describe more on the second character at th Frame 60 .

Fig. 12 and Fig. 13 show the decoded video frames without and with the error concealment. Fig. 12 (a) shows a complete loss of MB content when the MB data is lost. Fig. 12 (b) shows that the decoder successfully recovers the MB content with the corrupted ROI projected onto the UMPC model. Fig. 13 (a) shows the MB content being

204

Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu

recovered by motion compensation when the MB data is lost. The face is blocky because of the error in motion compensation. Fig. 13 (b) shows that the decoder successfully recovers the MB content inside the ROI with the motion compensated ROI projected onto the UMPC model. st

1 component

nd

rd

2 component

3 component

th

Frame 20

th

Frame 60

th

th

Fig. 11. Means for UMPC at Frame 20 and 60

(a)

(b)

Fig. 12. Error concealment for the Intra coding scenario: (a) no concealment; (b) concealment with UMPC

(a)

(b)

Fig. 13. Error concealment for the Inter coding scenario with: (a) motion compensation; (b) motion compensation and UMPC

The PSNR performance of the decoded video frames is summarized in Table 1. In both the Intra and Inter scenarios, error concealment with UMPC performs the best. Table 1. Error concealment performance of four models at INTRA and INTER coded scenarios None (Intra) / MC (Inter)

MPC

UMPC

Intra

15.5519

29.3563

30.6657

Inter

21.4007

21.7276

22.3484

Life after Video Coding Standards: Rate Shaping and Error Concealment

4

205

Conclusion

We presented two research areas: rate shaping and error concealment, that find their relevance after video coding standards are defined. With rate shaping and error concealment, we can improve the quality of service of networked video. We showed that exciting new research areas are opened up after the standards are specified.

References 1. ITU-T Recommendation H.263, January 27, 1998 2. Motion Pictures Experts Group, "Overview of the MPEG-4 Standard", ISO/IEC JTC1/SC29/WG11 N2459, 1998 3. Trista Pei-chun Chen and Tsuhan Chen, “Adaptive Joint Source-Channel Coding using Rate Shaping”, to appear in ICASSP 2002 4. Trista Pei-chun Chen and Tsuhan Chen, “Updating Mixture of Principle Components for Error Concealment”, submitted to ICIP 2002 5. H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson: “RTP: A transport protocol for real-time applications”, RFC1889, Jan. 1996. ftp://ftp.isi.edu/in-notes/rfc1990.txt 6. J. Postel, “User Datagram Protocol“, RFC 768, Aug. 1980. http://www.ietf.org/rfc/rfc768.txt 7. Trista Pei-chun Chen and Tsuhan Chen, “Markov Modulated Punctured Autoregressive Processes for Traffic and Channel Modeling”, submitted to Packet Video 2002 8. D. M. Lucantoni, M. F. Neuts, and A. R. Reibman, “Method for Performance Evaluation of VBR Video Traffic Models”, IEEE/ACM Transactions on Networking, 2(2), 176-180, April 1994 9. P. R. Jelenkovic, A. A. Lazar, and N. Semret, “The Effect of Multiple Time Scales and Subexponentiality in MPEG Video Streams on Queuing Behavior”, IEEE Journal on Selected Areas in Communications, 15(6), 1052-1071 10. M. M. Krunz, A. M. Makowski, “Modeling Video Traffic using M/G/ ∞ Input Processes: A Compromise between Markovian and LRD Models”, IEEE Journals on Selected Areas in Communications, 16(5), 733-748, 1998 11. Deepak S. Turaga and Tsuhan Chen, “Hierarchical Modeling of Variable Bit Rate Video Sources”, Packet Video 2001 12. S. Lin, D. J. Costello, Jr., Error Control Coding: Fundamentals and Application, PrenticeHall 13. S. Wicker, Error Control Systems for Digital Communication and Storage, Prentice-Hall, 1995 14. B. Bellman, Dynamic Programming, Prentice-Hall, 1987 15. G. D. Forney, “The Viterbi Algorithm”. Proc. of the IEEE, 268-278, March 1973 16. A. Ortega and K. Ramchandran, “Rate-Distortion Methods for Image and Video Compression”. IEEE Signal Processing Magazine, 15(6), 23-50 17. H. Everett, “Generalized Lagrange Multiplier Method for Solving Problems of Optimum Allocation of Resources”. Operations Research, 399-417, 1963 18. Y. Shoham and A. Gersho, “Efficient Bit Allocation for an Arbitrary Set of Quantizers”. IEEE Trans. ASSP, 1445-1453, Sep 1988

206

Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu

19. J. Vass and X. Zhuang, “Adaptive and Integrated Video Communication System Utilizing Novel Compression, Error Control, and Packetization Strategies for Mobile Wireless Environments”, Packet Video 2000 20. H. Sub and W. Kwok, “Concealment of Damaged Block Transform Coded Images using Projections Onto Convex Sets”, IEEE Trans. Image Processing, Vol. 4, 470-477, April 1995 21. D. S. Turaga, Ph.D. Thesis, Carnegie Mellon University, July 2001 22. X. Liu and T. Chen, "Shot Boundary Detection Using Temporal Statistics Modeling", to be appeared in ICASSP 2002 23. J. Huang and T. Chen, "Tracking of Multiple Faces for Human-Computer Interfaces and Virtual Environments", ICME 2000 24. M. Yajnik, S. Moon, J. Kurose, D. Towsley, “Measurement and modeling of the temporal dependence in packet loss”, IEEE INFOCOM, 345-52, March 1999

A DCT-Domain Video Transcoder for Spatial Resolution Downconversion Yuh-Reuy Lee1, Chia-Wen Lin1, and Cheng-Chien Kao2 1 Department

of Computer Science and Information Engineering National Chung Cheng University Chiayi 621, Taiwan G[PMR$GWGGYIHYX[ LXXT[[[GWGGYIHYX[bG[PMR 2 Computer & Communications Research Lab Industrial Technology Research Institute Hsinchu 310, Taiwan GGOES$MXVMSVKX[

Abstract. Video transcoding is an efficient way for rate adaptation and format conversion in various networked video applications. Several transcoder architectures have been proposed to achieve fast processing. Recently, thanks to its relatively low complexity, the DCT-domain transcoding schemes have become very attractive. In this paper, we investigate efficient architectures for video downscaling in the DCT domain. We propose an efficient method for composing downscaled motion vectors and determining coding modes. We also present a fast algorithm to extract partial DCT coefficients in the DCT-MC operation and a simplified cascaded DCT-domain video transcoder architecture.

1

Introduction

With the rapid advance of multimedia and networking technologies, multimedia services, such as teleconferencing, video-on-demand, and distance learning have become more and more popular in our daily life. In these applications, it is often needed to adapt the bit-rate of a coded video bit-stream to the available bandwidth over heterogeneous network environments [1]. Dynamic bit-rate conversions can be achieved using the scalable coding schemes provided in current video coding standards [2]. However, it can only provide a limited number of levels of scalability (say, up to three levels in the MPEG standards) of video quality, due to the limit on the number of enhancement layers. In many networked multimedia applications, a much finer scaling capability is desirable. Recently, fine-granular scalable (FGS) coding schemes have been proposed in the MPEG-4 standard to support a fine bit-rate adaptation and limited temporal/spatial format conversions. However, the video decoder requires additional functionality to decode the enhancement layers in the FGS encoded bit-streams. Video transcoding is a process of converting a previously compressed video bitstream into another bit-stream with a lower bit-rate, a different display format (e.g., S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 207–218, 2002. © Springer-Verlag Berlin Heidelberg 2002

208

Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao

downscaling), or a different coding method (e.g., the conversion between H.26x and MPEGx, or adding error resilience), etc. To achieve the goal of universal multimedia access (UMA), the video contents need to be adapted to various channel conditions and user equipment capabilities. Spatial resolution reduction [5-9] is one of the key issues for providing UMA in many networked multimedia applications. In realizing transcoders, the computational complexity and picture quality are usually the two most important concerns and need to be traded off to meet various requirements in practical applications. The computational complexity is very critical in real-time applications. A straightforward realization of video transcoders is to cascade a decoder followed by an encoder as shown in Fig. 1. This cascaded architecture is flexible and can be used for bit-rate adaptation and spatial and temporal resolution-conversion without drift. It is, however, very computationally intensive for real-time applications, even though the motion-vectors and coding-modes of the incoming bit-stream can be reused for fast processing. Incoming bitstream

IQ1

IDCT1

+

-

+

DCT

Q2

Outgoing bitstream IQ2

F IDCT2 MC

+ MV

Decoder DCT : Discrete Cosine Transform IDCT : Inverse Discrete Cosine Transform Q : Qunatization MV: Motion Vector

MC

MV

F

Encoder

IQ : Inverse Quantization F : Frame Memory MC : Motion Compensation

Fig. 1. Cascaded pixel-domain transcoder

For efficient realization of video transcoders, several fast architectures have been proposed in the literature [2-11, 14-15]. In [10], a simplified pixel-domain transcoder (SPDT) was proposed to reduce the computational complexity of the cascade transcoder by reusing motion vectors and merging the decoding and encoding process and eliminating the IDCT and MC (Motion Compensation) operations. [11] proposed a simplified DCT-domain transcoder (SDDT) by performing the motion-compensation in the DCT-domain [12] so that no DCT/IDCT operation is required. This simplification imposes a constraint that this architecture cannot be used for spatial or temporal resolution conversion and GOP structure conversion, that requires new motion vectors. Moreover, it cannot adopt some useful techniques, which may need to change the motion vectors and/or coding modes, for optimizing the performance in transcoding such as motion vector refinement [14]. The cascaded pixel-domain transcoder is drift-

A DCT-Domain Video Transcoder for Spatial Resolution Downconversion

209

free and does not have the aforementioned constraints. However, its computational complexity is still high though the motion estimation doesn’t need to be performed. In this paper, we investigate efficient realizations of video downscaling in the DCT domain. We also propose efficient methods for composing downscaled motion vectors and determining coding modes. We also present a fast algorithm to extract partial DCT coefficients in the DCT-MC operation and a simplified cascaded DCT-domain video transcoder architecture. The rest of this paper is organized as follows. In section 2, we discuss existing transcoder architectures, especially the DCT-domain transcoder for spatial downscaling. In section 3, we investigate efficient methods for implementing downsizing and motion compensation in the DCT domain. Finally, the result is summarized in section 4.

2 Cascaded DCT-Domain Transcoder for Spatial Resolution Downscaling To overcome the constraints of the SDDT, we propose to use the Cascaded DCTDomain Transcoder (CDDT) architecture which first appeared in [6]. The CDDT can avoid the DCT and IDCT computations required in the pixel-domain architectures as well as preserve the flexibility of changing motion vectors, coding modes as in the CPDT. Referring to Figure 1, by using the linearity property of the DCT transform (i.e., DCT(A+B) = DCT(A) + DCT(B)), the DCT block can be moved out from the encoder loop to form the equivalent architecture in Fig. 2(a). Each combination of IDCT, pixel-domain motion compensation, and DCT as enclosed by the broken lines is equivalent to a DCT-domain MC (DCT-MC) peration. Therefore we can derive the equivalent cascaded DCT-domain transcoder architecture as shown in Fig. 2(b). The MC-DCT operation shown in Fig. 3 can be interpreted as computing the coefficients of the target DCT block B from the coefficients of its four neighboring DCT blocks, Bi, i = 1 to 4, where B = DCT(b) and Bi = DCT(bi) are the 8×8 blocks of the DCT coefficients of the associated pixel-domain blocks b and bi of the image data. A close-form solution to computing the DCT coefficients in the DCT-MC operation was firstly proposed in [12] as follows. 4

B = ∑ H hi Bi H wi

(1)

i =1

where wi and hi ∈ {1,2,…7}. H h and H w are constant geometric transform matrii i ces defined by the height and width of each subblock generated by the intersection of bi with b. Direct computation of Eq. (1) requires 8 matrix multiplications and 3 matrix additions. Note that, the following equalities holds for the geometric transform matrices: H h = H h , H h = H h , H w = H w , and H w = H w . Using these 1 2 3 4 1 3 2 4 equalities, the number of operations in Eq. (1) can be reduced to 6 matrix multiplica-

210

Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao

tions and 3 matrix additions. Moreover, since H h and H w are deterministic, they i i can be pre-computed and then pre-stored in memory. Therefore, no additional DCT computation is required for the computation of Eq. (1).

ENCODE R

DECODE R Incoming Bitstream

IQ1

IDCT1

MV 1

Outgoing Bitstream

Q2

DCT

IQ2

MC

DCT

F

IDCT2

MC

F

DCT-MC 1

DCT-MC 2 MV 2

(a)

Incoming Bitstream

DECODE R

ENCODE R

IQ1

Outgoing Bitstream

Q2

IQ2

DCT-MC 1

MV 1 DCT-MC 2

MV 2

(b) Fig. 2. (a) An equivalent transform of the cascaded pixel domain transcoder; (b) cascaded DCTdomain transcoder

w1

B2

B1 h1

B

B3

B4

Fig. 3. DCT-domain motion compensation

A DCT-Domain Video Transcoder for Spatial Resolution Downconversion

211

SEQUENCE: FOREMAN-QCIF 42 Simplified DCT-domain Cascaded pixel-domain Cascaded DCT-domain

Average PSNR (dB)

40

38

36

34

32

30

32

64 Bitrate (Kbps)

96

(a) SEQUENCE: CARPHONE-QCIF Simplified DCT-domain Cascaded pixel-domain Cascaded DCT-domain

42

Average PSNR (dB)

40

38

36

34

32

30

32

64 Bitrate (Kbps)

96

(b) Fig. 4. Performance comparison of average PSNR with three different transcoders. the incoming sequence was encoded at 128 kb/s, and transcoded to 96 kb/s, 64 kb/s, and 32 kb/s, respectively for: (a) “foreman” sequence; (b) “carphone” sequence

We compare the PSNR performance of CPDT, SDDT, and CDDT in Fig. 4. Two test sequences: “foreman” and “carphone” were used for simulation. Each incoming sequence was encoded at 128 Kbps and transcoded into 96, 64,and 32 Kbps, respectively. It is interesting to observe that, though all the three transcoding architectures are mathematically equivalent by assuming that motion compensation is a linear operation, DCT and IDCT can cancel out each other, and DCT/IDCT has distributive property, the performance are quite different. The CPDT architecture outperforms the other two. Though the performance of the DCT-domain transcoders is not as ggod as the SPDT, the main advantage of the DCT-domain transcoders lies on the existing efficient algorithms for fast DCT-domain transcoding [10,11,18,19], which make them

212

Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao

very attractive. For spatial resolution downscaling, we propose to use the cascaded DCT-domain transcoder shown in Fig. 5. This transcoder can be divided into four main functional blocks: decoder, downscaler, encoder, and MV composer, where all the operations are done in the DCT domain. In the following, we will investigate efficient schemes for DCT-domain downscaling. DECODER Incoming Bitstream

ENCODER DCT-domain downscaling

VLD +IQ1

Outgoing Bitstream

Q2

IQ2

DCT-MC 1

MV 1 DCT-MC 2 MV Composition

v$

MV 2

Fig. 5. Proposed DCT-domain spatial resolution down-conversion transcoder

3

Algorithms for DCT-Domain Spatial Resolution Downscaling

3.1

DCT-Domain Motion Compensation with Spatial Downscaling

Consider the spatial downscaling problem illustrated in Fig. 6, where b1, b2, b3, b4 are the four original 8×8 blocks, and b is the 8×8 downsized block. In the pixel domain, the downscaling operation is to extract one representative pixel (e.g., the average) out of each 2x2 pixels. In the following, we will discuss two schemes for spatial downscaling in the DCT domain which may be adopted in our DCT-domain downscaling transcoder.

b1 8x8

b2 8x8

b3 8x8

b4 8x8

downscaling

b 8x8

Fig. 6. Spatial resolution down-conversion

A. Filtering + Subsampling Pixel averaging is the simplest way to achieving the downscaling, which can be implemented using the bilinear interpolation expressed below [6,14].

A DCT-Domain Video Transcoder for Spatial Resolution Downconversion

213

4

b = ∑ hibi g i

(2)

 q 4×8  t t  h1 = h 2 = g1 = g3 =     04×8   h = h = g t = gt =  04×8  4 2 4 q   3  4×8  

(3)

i=1

The filter matrices, hi and

gi , are

where

0 0 0 0 0 0.5 0.5 0 0 0 0.5 0.5 0 0 0 0  , and 0 is a 4×8 zero matrix. 4×8 q 4×8 =  0 0 0 0 0.5 0.5 0 0   0 0 0 0 0 0.5 0.5 0 The above bilinear interpolation procedure can be performed in the DCT domain directly to obtain the DCT coefficients of the downsized block (i.e., B = DCT(b)) as follows: 4

4

i =1

i =1

B = ∑ DCT(h i ) DCT(bi ) DCT(g i ) = ∑ H i Bi Gi

(4)

Other filtering methods with a larger number of filter taps in hi and g i may achieve better performance than the bilinear interpolation. However, the complexity may increase in pixel-domain implementations due to the increase in the filter length. Nevertheless, the DCT-domain implementation cost will be close to the bilinear interpolation, since in Eq. (4) Hi and Gi can be precomputed and stored, thus no extra cost will be incurred. B. DCT Decimation It was proposed in [13,14] a DCT decimation scheme that extracts the 4x4 lowfrequency DCT coefficients from the four original blocks b1-b4, then performs 4x4 IDCT to obtain four 4x4 subblocks, and finally combine the four subblocks into an 8x8 blocks. This approach was shown to achieve significant performance improvement over the filtering schemes [14]. [8] interpreted the DCT decimation as basis vectors resampling, and presented a compressed-domain approach for the DCT decimation as described below. Let B1, B2, B3, and B4, represent the four original 8×8 blocks; Bˆ1 , Bˆ 2 , Bˆ3 and Bˆ 4 be the four 4×4 low-frequency sub-blocks of B1, B2, B3, and B4, respectively;  b$ 1 b$ 2  b$ i = IDCT( Bˆi ) , i = 1, …, 4. Then b$ =   is the downscaled version of b$ 3 b$ 4  8×8

214

Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao

def b b2  = DCT(bˆ) from Bˆ , Bˆ , Bˆ and Bˆ , we can use the . To compute B b= 1 1 2 3 4  b3 b 4 16×16

following expression:

ˆ t Bˆ = TbT  b$ 1 b$ 2  TLt  = [TL TR ]   $ $  t b 3 b 4  TR  1T T t B 2T  T t  T t B 4 4 4 = [TL TR ]  4   Lt  t t T4 B 3T4 T4 B 4T4  TR  1 (T T t )t + (T T t ) B 2 (T T t )t + (T T t ) B 3 (T T t )t = (TLT4t ) B L 4 L 4 R 4 R 4 L 4 t t t +(T T ) B 4 (T T ) R 4

(5)

R 4

In addition to the above formulation, [8] also proposed a decomposition method to convert Eq. (5) into a new form so that matrices in the matrix multiplications become more sparse to reduce the computation. 3.2

Motiov Vector Composition and Mode Decision

After downscaling, the motion vectors need to be re-estimated and scaled to obtain a correct value. Full-rang motion re-estimation is computationally too expensive, thus not suited to practical applications. Several methods were proposed for fast composing the downscaled MVs based on the motion information of the original frame [7,14,17]. In [14], three methods for composing new motion vectors for the downsized video were compared: median filtering, averaging, and majority voting. It was shown in [14] that the median filtering scheme outperforms the other two. We propose to generalize the media filtering scheme to find the activity-weighted median of the four original vectors: v1, v2, v3, v4. In our method the distance between each vector and the rest is calculated as the sum of the activity-weighted distances as follows:

di =

1 ACTi

4

∑ v −v j =1 j ≠i

i

j

(6)

where the MB activity can be the squared or absolute sum of DCT coefficients, the number of nonzero DCT coefficients, or simply the DC value. In our method, we adopted the squared sum of DCT coefficients of MB as the activity measure. The activity-weighted median is obtained by finding the vector with the least distance from all. That is

v=

1 arg min di vi ∈{v1 , v2 , v3 , v4 } 2

(7)

A DCT-Domain Video Transcoder for Spatial Resolution Downconversion

215

Fig. 7 shows the PSNR comparison of three motion vector composition scheme: 2 activity-weighted median (denoted by DCT-coef ), the maximum DC method in [17] (denoted by DC-Max), and the average vector scheme (denoted by MEAN). The simulation result that the activity-weighted media outperforms the other two.

(a)

(b) Fig. 7. PSNR performance comparison of three motion vector composition schemes. The input sequences: (a) “foreman” sequence; (b) “news” sequence, are transcoder form 256 Kbps, 10fps into 64 Kbps, 10fps

After the down-conversion, the MB coding modes also need to be re-determined. In our method, the rules for determining the code modes are as follows: (1) If at least one of the four original MBs is intra-coded, then the mode for the downscaled MB is set as Intra. (2) If all the four original MBs are inter-coded, the resulting downscaled MB will also be inter-coded.

216

Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao

(3) If at least one original MB is skipped, and the reset are inter-coded, the resulting downscaled MB will be inter-coded. (4) If all the four original MBs are skipped, the resulting downscaled MB will also be skipped. Note, the motion vectors of skipped MBs are set to zero.

3.3 Computation Reduction in Proposed Cascaded DCT-Domain Downscaling Transcoder In Fig. 4, the two DCT-MCs are the most expensive operation. In our previous work [18], we showed that for each 8×8 DCT block, usually only a small number of lowfrequency coefficients are significant. Therefore we can use the fast significant coefficients extraction scheme proposed in [18] to reduce the computation for DCT-MC. The concept of significant coefficients extraction is illustrated in Fig. 8, where only partial coefficients (i.e., n ≤ 8) of the target block need to be computed. n1×n1

n2×n2 n×n

B1

B2 n4×n4

n3×n3

B3

Bˆ

B

B4

Fig. 8. Computation reduction for DCT-MC using significant coefficients extraction

The DCT-domain down-conversion transcoder can be further simplified by moving the downscaling operation into the decoder loop so that the decoder only needs to decode one quarter of the original picture size. Fig. 9 depicts the proposed simplified architecture. With this architecture both the computation and memory cost will be reduced significantly. However, similar to the down-conversion architectures in [20,21], this simplified transcoder will result in drift errors due to the mismatch in the frame stores between the front-end encoder and the reduced-resolution decoder loop of the transcoder. Several approaches have been presented to mitigate the drift problem [20,21], which may introduce some extra complexity. In MPEG video, since the drift in B frames will not result in error propagation, a feasible approach is to perform fullresolution decoding for I and P frames, and quarter-resolution decoding for B frames.

4

Summary

In this paper, we presented architectures for implementing spatial downscaling video transcoders in the DCT domain and efficient methods for implementing DCT-domain motion compensation with downscaling. We proposed an activity-weighted median

A DCT-Domain Video Transcoder for Spatial Resolution Downconversion

217

filtering scheme for composing the downscaled motion vectors, and also a method for determining the decision mode. We have also presented efficient schemes for reducing the computational cost of the downscaling trancoder. DECODER Incoming Bitstream

ENCODER

VLD +IQ1

Outgoing Bitstream

Q2 Downscaled DCT-MC 1

IQ2

MV 1 DCT-MC 2 MV Composition

v$

MV 2

Fig. 9. Simplified DCT-domain spatial resolution down-conversion transcoder

References 1. 2. 3. 4. 5. 6. 7. 8. 9.

10. 11.

12.

Moura, J., Jasinschi, R., Shiojiri-H, H., Lin, C.: Scalable Video Coding over Heterogeneous Networks. Proc. SPIE 2602 (1996) 294-306 Ghanbari, M.: Two-Layer Coding of Video Signals for VBR Networks. IEEE J. Select. Areas Commun. 7 (1989) 771-781 Sun, H., Kwok, W., Zdepski, J. W.: Architecture for MPEG Compressed Bitstream Scaling. IEEE Trans. Circuits Syst. Video Technol. 6 (1996) 191-199 Eleftheriadis, A. Anastassiou, D.: Constrained and General Dynamic Rate Shaping of Compressed Digital Video. Proc. IEEE Int. Conf. Image Processing (1995) Hu, Q., Panchanathan, s.: Image/Video Spatial Scalability in Compressed Domain. IEEE Trans. Ind. Electron. 45 (1998) 23–31 Zhu, W., Yang, K., Beacken, M.: CIF-to-QCIF Video Bitstream Down-Conversion in the DCT Domain. Bell Labs technical journal 3 (1998) 21-29 Yin, P., Wu, M., Liu, B.: Video Transcoding by Reducing Spatial Resolution. Proc. IEEE Int. Conf. Image Processin (2000) R. Dugad and N. Ahuja, “A Fast Scheme for Image Size Change in the Compressed Domain. IEEE Trans. Circuit Syst. Video Technol. 11 (2001) 461-474 N. Merhav and V. Bhaskaran, “Fast Algorithms for DCT-Domain Image Down-Sampling and for Inverse Motion Compensation. IEEE Trans. Circuits Syst. Video Technol. 7 (1997) 468–476 Keesman, g. et al.: Transcoding of MPEG Bitstreams. Signal Processing: Image Commun. 8 (1996) 481-500 Assuncao, P. A. A., Ghanbari, M.: A Frequency-Domain Video Transcoder for Dynamic Bit-rate Reduction of MPEG-2 Bit Streams. IEEE Trans. Circuits Syst. Video Technol. 8 (1998) 953-967 Chang, S. F., Messerschmitt, D. G.: Manipulation and Compositing of MC-DCT Compressed Video. IEEE J. Select. Areas Commun. (1995) 1-11

218 13. 14.

15. 16.

17.

18. 19.

20. 21.

Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao Tan, K. H., Ghanbari, M.: Layered Image Coding Using the DCT Pyramid. IEEE Trans. Image Processing 4 (1995) 512-516 Shanableh T., Ghanbari, M.: Heterogeneous Video Transcoding to Llower Spatiotemporal Resolutions and Different Encoding Formats. IEEE Trans. on Multimedia 2 (2000) 101-110 Shanableh T., Ghanbari, M.: Transcoding Architectures for DCT-Domain Heterogeneous Video Transcoding. Proc. IEEE Int. Conf. Image Processing (2001) Seo, K., Kim J.: Fast Motion Vector Refinement for MPEG-1 to MPEG-4 Transcoding with Spatial Down-sampling in DCT Domain. Proc. IEEE Int. Conf. Image Processing (2001) 469-472 17 Chen, M.-J., M.-C. Chu, M.-C., Lo, S.-Y.: Motion Vector Composition Algorithm for Spatial Scalability in Compressed Video. IEEE Trans. Consumer Electronics 47 (2001) 319-325 18 Lin, C.-W., Lee, Y.-R.: Fast Algorithms for DCT Domain Video Transcoding. Proc. IEEE Int. Conf. Image Processing (2001) 421-424 19 Song, J., Yeo, B.-L.: A Fast Algorithm for DCT-Domain Inverse Motion Compensation based on Shared Information in a Macroblock. IEEE Trans. Circuits Syst. Video Technol. 10 (2000) 767-775 20 Vetro, A., Sun, H., DaGraca, P., Poon, T.: Minimum Drift Architectures for Threelayer Scalable DTV Decoding. IEEE Trans. Consumer Electronics 44 (1998) 21 Vetro, A., Sun, H.: Frequency Domain Down-Conversion Using an Optimal Motion Compensation Scheme. Int’l Journal of Imaging Systems & Technology 9 (1998)

A Receiver-Driven Channel Adjustment Scheme for Periodic Broadcast of Streaming Video Chin-Ying Kuo1, Chen-Lung Chan1, Vincent Hsu2, and Jia-Shung Wang1 1

Department of Computer Science, National Tsing Hua University, HsinChu, Taiwan _QVHVNW[ERKa$GWRXLYIHYX[ 2 Computer & Communications Research Laboratories, Industrial Technology Research Institute, HsinChu, Taiwan ZLWY$MXVMSVKX[

Abstract. Modern multimedia services usually distribute their contents by means of streaming. In most systems, the point-to-point delivery model is adopted but also known as less efficient. To extent scalability, some services apply periodic broadcast to provide an efficient platform that is independent of the number of clients. These periodic broadcast services can significantly improve performance, however, they require a large amount of client buffers also be inadequate to run on heterogeneous networks. In this paper, we propose a novel periodic broadcast scheme that requires less buffer capacity. We also integrate a receiver-driven channel adjustment adaptation to adjust the transmission rate for each client.

1 Introduction Streaming is the typical technology used to provide various real-time multimedia services. The primary benefit of streaming is processing playback without downloading the entire video in advance. In this architecture, the content server packetizes the video into packets and transmits them to clients. Each client merely acquires a small playback buffer to compose successive video packets they received from the networks and composes these packets to video frames for playing. Although streaming technology is flexible, it cannot support a large-scale system because each client must demand a server stream. Point-to-point communication is known inefficient, so some novel services apply broadcast or multicast to raise scalability. In conventional broadcast systems, each video is continuously broadcasted on the networks. The transfer rate of a video equals to its consumption rate and no additional buffer space is required at the client side. This scheme is efficient but inflexible because long waiting time may be required if the client requests just after the start of broadcasting. The waiting time in this case is almost the same as the playback duration. To reduce such delay, some straightforward schemes allocate multiple channels S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 219–228, 2002. © Springer-Verlag Berlin Heidelberg 2002

220

Chin-Ying Kuo et al.

to broadcast a popular video. For example, if we allocate three video channels for an 84-minute video, we can partition the whole video into three segments and broadcast these segments periodically in distinct channels. As Fig. 1 displays, the maximum waiting time can be significantly reduced to 28 minutes. time Channel 0

S1

S1

S1

….

Channel 1

S2

S2

S2

….

Channel 2

S3

S3

S3

….

28 minutes S 1 : the first 28 minutes of the video S 2 : the second 28 minutes of the video S 3 : the final 28 minutes of the video

Fig. 1. Broadcasting with multiple channels.

Broadcast-based multimedia delivery is an interesting topic, and many data broadcasting schemes [1–8] are proposed nowadays. We first discuss the concept of fast data broadcasting scheme [7]. The primary contribution of fast data broadcasting is reducing the initial delay of playback. However, a huge client buffer is required to store segments that cannot be immediately played out. Suppose k channels are allocated for a video with length L. The sequence {C0, C1, …, CK-1} represents the k channels correspondingly. The bandwidth of each channel equals to the consumption rate of the video. Besides, the video is equally divided into N segments, where N = 2k - 1. Suppose Si represents the ith segment of the video, so the entire video can be constituted as S1 · S2 ·…· SN. We allocate the channel Ci for segments {Sa, …,Sb}, where i = i i+1 i 0, 1, …, k-1, a = 2 , and b = 2 - 1. Within the channel Ci, these 2 data segments are broadcasted periodically. As Fig. 2 indicates, the video is partitioned into 7 segments and then is broadcasted on 3 channels. We observe that the viewer's initial delay (noted as d) is reduced to 12 minutes. Comparing with the previous broadcast scheme which waiting time equals 28 minutes, the fast data broadcasting is much more intelligent. L (the whole movie)

S1 S2 d

······

S7

Channel 0 S 1 S 1 S 1 S 1 S 1 S 1 S 1 S 1 · · · Channel 1 S 2 S 3 S 2 S 3 S 2 S 3 S 2 S 3 · · · Channel 2 S 4 S 5 S 6 S 7 S 4 S 5 S 6 S 7 · · · Fast Service (Needs buffer) Service without buffer

Fig. 2. An example of fast data broadcasting (k=3).

A Receiver-Driven Channel Adjustment Scheme for Periodic Broadcast

221

Although fast data broadcasting reduces the waiting time, extensive buffer requirement (about 50% per video) at the client side requires more cost on equipment. In addition, before applying fast data broadcasting scheme, the service provider must predict the popularity of each video. We should allocate more channels for popular videos. If the prediction is not accurate or the popularity changes in the future, the allocation will be wasteful. To overcome this drawback, adaptive fast data broadcasting scheme [8] is proposed. If the video was not requested for a long time, the server will attempt to release channels allocated for this video if possible. The newly free channel can be used by other popular videos therefore the efficiency can be enhanced. And if the video is demanded again, the server allocates new channels for it. With adaptive data broadcasting scheme, the system can be more flexible. Although fast data broadcasting and adaptive fast data broadcasting are interesting, they are not efficient enough. We propose a novel dynamic data broadcast scheme in this study. In our scheme, both viewer’s waiting time and storage requirement are reduced. In addition, the popularity of a video is used to determine the bandwidth allocation by modifying the channel allocation. Moreover, when some videos are going to be on-line or off-line, the system will intelligently determine an appropriate channel allocation for them. RR 11

1 0 M b p s GG22 5 0 0 k b p s

3 0 0 k b p s

1 0 M b p s SS

RR 2 2

GG11 1 0 M b p s RR 33

Fig. 3. A heterogeneous network.

Although periodic broadcast provide an efficient platform for multimedia delivery, the available network bandwidth for each client usually substantially varies in Internet. As depicted in Fig. 3, server S transmits a video with 10 Mbps. For receiver R3, a perfect video service is available since R3 has sufficient bandwidth to receive all data packets of the video. However, a bottleneck is observed between two gateways G1 and G2, thus, both receiver R1 and R2 would loss many data packets so they cannot enjoy the playback smoothly. Applying receiver-driven bandwidth adaptation to adjust the transmission rate to meet different clients’ network capacities is a well-known approach. The general receiver-driven bandwidth adaptation integrates a multi-layered coding algorithm with a layered transmission system. In layered coding algorithm, it encodes a video into multiple layers including one base layer (denoted as layer one) and several enhanced layers (denoted as layer 2, layer 3, …etc.). By subscribing numbers of layers depending on its network bandwidth, each client receives the best quality of the video that the network can deliver. McCanne, Jacobson and Vetterli [9] proposed a receiver-driven layered multicast (RLM) scheme by extending the multiple

222

Chin-Ying Kuo et al.

group framework with a rate-adaptation protocol. Thus, the transmission of different layered signals over heterogeneous networks is possible. In this scheme, a receiver searches for the optimal level of subscription by two rules:

• •

Drop one layer when congestion occurs. Add one layer when receive successfully.

After perform rate-adaptation on the case in Fig. 3, we have the flow in Fig. 4 Suppose the source S transmits three layers of video by 200 kbps, 300 kbps, 500 kbps, respectively. Because network bandwidth between S and R3 is high, R3 can successfully subscribe all three layers and enjoys the highest video quality. However, since only 500 kbps capacity is available on G2, R1 and R2 cannot receive the entire three layers. At G2, the third layer will be dropped then R1 can only subscribe two layers. For R2, because the network bandwidth is only 300 kbps, it must drop the second layer and subscribe the base layer only. However, the RLM scheme treats each stream independently. If multiple streams pass the same bottleneck link (which are called sharing streams), they may compete for the limited bandwidth because they do not know the sharing status. This may cause unfairness of subscription level of different streams. Therefore, flexible bandwidth allocation adapted to receivers is necessary to share the bandwidth. One approach named Multiple Streams Controller (MSC) was proposed in [10]. In this scheme, it is an RLM-based method with MSC at every client end. It can dynamically adjust the subscription level owing to the available bandwidth. RR 11

1 0 Mb p s G G22 5 0 0 k b p s

3 0 0 k b p s

1 0 Mb p s SS

RR 22

GG1 1 1 0 Mb p s RR 3 3

Fig. 4. Layer subscription.

Bandwidth adaptation schemes described above are developed over multi-layered coded streaming system. However, the implementation of layered coding is still not popular even though the standard of MPEG-4 supports multi-layered coding. Without multi-layered coding, re-encoding the source media into streams with various qualities in server or intermediate nodes is another solution. In these designs, transcoders and additional buffer spaces are required. The buffer is employed to store input streams temporally, and the transcoders are used to re-encode video streams stored in the buffer to output streams with various bit-rate. Each client continues probing the network and sends messages containing the status to the corresponding intermediate node. When the server or intermediate nodes receive these messages, they determine the number of streams that the transcoder should generate and then forward these

A Receiver-Driven Channel Adjustment Scheme for Periodic Broadcast

223

streams to clients. Although transcoding shows a candidate solution while lacking layered coding, the computation complexity in intermediate nodes is expensive if the service scale substantially extends. Does video quality be the only metrics that impacts network bandwidth? The answer is generally yes in end-to-end transmission systems, but not absolutely in periodic broadcast. The bandwidth requirement in periodic broadcast is proportional to the number of channels, so adjusting transmission quality implies changing the number of channels. Furthermore, the quality of streams can also be referred as waiting time and client buffer size of a video in periodic broadcast. Therefore, the concept of receiverdriven bandwidth adaptation can be easily transformed to periodic broadcast. This is our primary target of this study. The rest of this paper is organized as follows. Section 2 describes the broadcast scheme we proposed. Section 3 introduces the integration of our broadcast scheme and a receiver-driven channel adjustment adaptation. Conclusion is then made in Section 4.

2 Our Broadcast Scheme In most periodic broadcast schemes, the permutation of segments to be broadcasted in each channel is determined initially. These schemes usually apply formulas to assign each segment to appropriate channel. For example, fast data broadcast scheme assigns 1, 2, 4, … segments to the first, second, third, … channels, respectively. Although periodic broadcasting schemes can serve a popular video with shorter viewer’s waiting time, large amount of storage requirements at client end is necessary. Assume the video length is L and the consumption rate is b. In fast data broadcasting, client buffer usage is varied from 0 to about 0.5*L*b. The buffer utilization varies too significantly. If the buffer can be utilized more evenly, we can reduce the buffer requirement in the k worst case. In fast data broadcasting, it divides a video into 2 – 1 segments where k is the number of channels. In order to reduce the receiver’s buffer requirements, we hope to allocate one additional channel to improve the flexibility of segment delivery. We k-4 define a threshold of the buffer size as 0.15*L*b. In this case, at most 2 segments size will be required at each client side. If the number of channel is less than 4, no buffer is needed for a receiver. Since the client buffer size is controlled under 0.15*L*b, if a receiver’s buffer requirement exceeds 0.15*L*b, we can use the additional channel to assign segments into different time slots. Thus, buffer usage of each receiver is evenly. In the case that we have k channels, C0, C1, …, Ck-1, for a video of length L. Each channel has bandwidth b, which is assumed the same as the consumption rate of a k-1 video. The video is divided equally into N segments, where N = 2 – 1. Let Si denote the ith segment, the video is constituted as (S1, S2, …, SN). Let Bc denote the maximum k-4 buffer requirement at the client end, where Bc = 2 segments. Suppose there is at least one request at each time interval. First, a segment Si is assigned to a free channel if it must be played immediately. If some channels are idle, we assign segments which will

224

Chin-Ying Kuo et al.

be played later into these empty channels. The corresponding clients must store these segments in their buffer. If there is no new request at some time interval, the latest allocated channel can be released. t C

Playing Buffered segment segment

0

S 1 d

0

V

0

S

1

V V

0

S S

2

t 0+ d C

0

C

1

S

1

S

1

S

2

1

S

1

2

t 0+ 2 d C

0

C

1

S

1

S

1

S

1

S

2

S

3

V V V

S S S

0 1 2

3

S S

2 1

3 3

t 0+ 3 d C

0

C

1

C

2

S

1

S S

1

S

2

S

1

S

3

S

2

S

4

V V V V

1

S S S S

0 1 2 3

4

S S S

3 2 1

4 3 2

t 0+ 4 d C C

0 1

C

2

C

3

S

1

S S

1 2

S S

1 3

S S S

1 2 4

S S

1 5

S

6

S

4

V V V V V

0 1 2 3 4

S S S S S

5 4 3 2 1

S S S S S

6 5 4 4 4

Fig. 5. An example of our data broadcast schedule.

Consider the example displayed in Fig. 5, the video is divided into 7 segments and 4 channels are available. At t0, a new channel C0 is allocated for the video and the first segment S1 is assigned into C0 to serve the viewerV0. Since the new viewerV1 issues at t0 +d, the segment S1 is assigned into C0 again. In addition, we allocate a new channel C1 to transmit S2 for servicing V0. However, the operation in V1 is more complex. V1 must play S1 directly from the network and save S2 into the buffer for future playback. To serve V2 at t0+2d, we still must assign S1 into C0. At the same time, V1 reads S2 from local disk because S2 has been stored at t0 +d, so we need not broadcast S2. The only segment that must be broadcasted now is S3. We observe that only two channels are required at t0 +2d. When the scheme proceeds to t0+3d, only three channels are required because S3 for V1 was already stored in the buffer. By the same procedure, we observe the system required only two channels at t0+4d. Since S4 and S6 will be played

A Receiver-Driven Channel Adjustment Scheme for Periodic Broadcast

225

by V2 and V0 later, we can assign them to C2 and C3 now. If we do not apply this assignment, V0 and V2 will cause the system allocate too many channels when they play these segments. In this example, we utilize at most 4 channels at server side and 1segment buffer at client side (about 0.143*L*b). Our scheme can amazingly reduce the buffer requirement. In our scheme, the channels can be dynamically allocated and deallocated. Fig. 6 shows the situation if there is no request between t0 + 6d and t0 + 7d. Since no new request issues, we can release the latest allocated channel C3 at t0 + 7d. In addition, only two segments S2 and S3 are required immediately, so we can assign S7 in the empty channels. t 0 + 6d C0

S1

C1

S1

S1

S1

S1

S1

S1

S2

S3

S4

S5

S2

S5

S2

S6

S3

S6

S4

S7

S4

C2 C3

t 0 + 7d C0 C1 C2

S1

S1

S1

S1

S1

S1

S1

S2

S2

S3

S4

S5

S2

S5

S3

S2

S6

S3

S6

S7

S4

S7

S4

Fig. 6. A condition to release a channel.

Consider the example displayed in Fig. 5, the video is divided into 7 segments and 4 channels are available. At t0, a new channel C0 is allocated for the video and the first segment S1 is assigned into C0 to serve the viewerV0. Since the new viewerV1 issues at t0 +d, the segment S1 is assigned into C0 again. In addition, we allocate a new channel C1 to transmit S2 for servicing V0. However, the operation in V1 is more complex. V1 must play S1 directly from the network and save S2 into the buffer for future playback. To serve V2 at t0+2d, we still must assign S1 into C0. At the same time, V1 reads S2 from local disk because S2 has been stored at t0 +d, so we need not broadcast S2. The only segment that must be broadcasted now is S3. We observe that only two channels are required at t0 +2d. When the scheme proceeds to t0+3d, only three channels are required because S3 for V1 was already stored in the buffer. By the same procedure, we observe the system required only two channels at t0+4d. Since S4 and S6 will be played by V2 and V0 later, we can assign them to C2 and C3 now. If we do not apply this assignment, V0 and V2 will cause the system allocate too many channels when they play these segments. In this example, we utilize at most 4 channels at server side and 1segment buffer at client side (about 0.143*L*b). Our scheme can amazingly reduce the buffer requirement. In our scheme, the channels can be dynamically allocated and deallocated. Fig. 6 shows the situation if there is no request between t0 + 6d and t0 + 7d. Since no new request issues, we can release the latest allocated channel C3 at t0 +

226

Chin-Ying Kuo et al.

7d. In addition, only two segments S2 and S3 are required immediately, so we can assign S7 in the empty channels.

3 Channel Adjustment In periodic data broadcasting scheme, all clients are served with the same video quality. However, practical networks are usually heterogeneous, so we cannot assume that each client can enjoy the same transmission quality. As we described previously, the requirement of a receiver-driven bandwidth adaptation scheme for data broadcasting is emergent. In this paper, we propose a "channel adjustment" process to approach receiver-driven concept on dynamic data broadcasting scheduling. Consider a video is transmitted to clients in different networks. These clients must calculate the loss rate of this video while taking the requiring data. The server collects the information of the loss rate in clients and determines the appropriate number of channels. If more than half clients are in congestion, the channel adjustment process should be activated to reduce the number of channels. The network traffic can be reduced correspondingly. The concept of our channel adjustment is described in the follows. 15

15

Suppose a hot video is divided into 15 segments (S 1 ~ S 15 ) and transmitted by 5 video channels (C0 ~ C4) on a server end. Suppose congestion happens in most clients, thus, one channel should be released to reduce network traffic. Since the number of 7

channels is decreased to 4 now, the video must be re-divided into 7 segments (S 1 ~ 7

S 7 ). All on-line views must not be delayed while the number of channel decreases. Assume our adjustment starts at H0. We first find the least common multiplier (l.c.m.) of the segment numbers, 7 and 15 in both conditions. Since the least common multi105

plier of 7 and 15 is 105, we virtually divide the video into 105 segments (S 1 105 105

~S

). Table 1 shows the mapping between these segments, and Fig. 7 displays the

example of such channel adjustment. Suppose S

15 13

is necessarily transmitted at H0 to

7 S1

serve previous viewers (Vp). In addition, is also required now to serve new view105 ers. Since we virtually divide segments into S in channel adjustment process, seg7 15 105 ments of S and S can be served as S . Thus, although these segments differ in their sizes, they still can be received by clients without overlap by applying our segment mapping process. In addition, if free blocks are available (the dotted-rectangle in Fig. 7), we can put segments which will be required by client to it. As Fig. 8 displays, S and S

15 15

Thus, S

will be required by Vp and we can assign both S 105 92

~S

105 99

are assigned to channel C1 and S

105 100

15 14

~S

and S 105 105

15 15

15 14

to free blocks.

are assigned to chan-

nel C2. Because the channel adjustment is easy, we can make it transparent to the dynamic data broadcasting. The channel adjustment process completes after all viewers

A Receiver-Driven Channel Adjustment Scheme for Periodic Broadcast 7

227

receive all segments in original S successfully. Since only 4 channels are required in the case that a video is divided into 7 segments, one video channel can be released from now on. Therefore, the network bandwidth is successfully reduced. Table 1. Least common multiplier for sub-segments mapping. 105

The number of divided segments

Mapping to S n

7

S i = S (i−1)*105/7+1 ~ S i*105 / 7

7

15

S i = S (i−1)*105/15+1 ~ S i*105 /15

7 (S i , i = 1~7)

15

15 (S i , i = 1~15)

105

105

105

105

H0 C0

S

C1

S

7 1

15 13

C2 C3 C4 H0

Ma p p i n g

C 0 S11 0 5 S 120 5

S1150 5

C 1 S18055 S 18065 C2

S

10 5 91

C3 C4 S

x i

S

x j

•

x

: Broadcasting successive segments from S i + 1 to S

x j− 1

.

: no data to broadcast Fig. 7. An example of channel adjustment. H C

0

0

S 11 0

C

1

S

C

2

S

C

3

C

4

5

10 5 8 5 10 5 10 0

S

10 5 2

S S S

1 0 5 9 1 10 5 10 0

S 19 02 5

S

1 0 5 9 8

Fig. 8. An example of free block assignment.

S

10 5 15 1 0 5 9 9

228

Chin-Ying Kuo et al.

4 Conclusion We introduce a concept of receiver-driven bandwidth control scheme called channel adjustment on dynamic periodic broadcast scheduling for real-time video service. The primary technology used in our scheme is a dynamic periodic broadcast scheduling. In our scheme, the service scalability is significantly extended via periodic broadcast. Furthermore, the novel channel adjustment proposed in this study can extend our system to heterogeneous clients. The same as other periodic broadcast schemes, we partition each popular video into numbers of segments and then broadcast these segments on distinct channels with different frequencies. The originality of our scheme is dynamically adjusting the broadcast schedule to reduce the requirement of client buffer. The buffer space that each client requires is less than 15 percent of the entire video. In addition, our scheme also provides a flexible platform for developing the feature named channel adjustment. With channel adjustment, each client can request a video with different number of channels depending on its available bandwidth. Allocating more channels implies less initial delay and less buffer requirement. We do not actually modify the playback quality but still can provide different services for heterogeneous clients.

References 1. S. Viswanathan and T. Imielinski, "Metropolitan area video-on-demand service using pyramid broadcasting," Multimedia Systems, vol. 4(4), pp. 197-208, August 1996. 2. C. C. Aggarwal, J. L. Wolf, and P. S. Yu, “A permutation-based pyramid broadcasting scheme for video-on-demand systems,” in Proc. IEEE Int.Conf. Multimedia Computing and Systems, pp. 118–126, June 1996. 3. L.-S. Juhn and L.-M. Tseng, “Harmonic broadcasting for video-on-demand service,” IEEE Transactions on Broadcasting, vol. 43, pp. 268–271, Sept. 1997. 4. L.-S. Juhn and L.-M. Tseng, “Enhanced harmonic data broadcasting and receiving scheme for popular video service,” IEEE Trans. Consumer Electronics, vol. 44, no. 4, pp.343-346, May 1998. 5. L.-S. Juhn and L.-M. Tseng, “Staircase data broadcasting and receiving scheme for hot video service,” IEEE Trans. Consumer Electronics, vol. 43, no. 4, pp.1110-1117, Nov. 1997 6. K. A. Hua and S. Sheu, “Skyscraper broadcasting: A new broadcasting scheme for metropolitan video-on-demand,” ACM SIGCOMM, Sept. 1997 7. L.-S. Juhn and L.-M. Tseng, “Fast data broadcasting and receiving scheme for popular video service,” IEEE Trans. Broadcasting, vol. 44, no. 1, pp. 100-105, Mar 1998. 8. L.-S. Juhn and L.-M. Tseng, “Adaptive fast data broadcasting scheme for video-on-demand service,” IEEE Trans. Broadcasting, vol. 44, no. 2, pp. 182-185, June 1998. 9. S. McCanne, V. Jacobson, and M. Vetterli, ”Receiver-driven Layered Multicast,” Proceeding of ACM SIGCOMM ’96, Aug. 1996 10. M. Kawada, H. Morikawa, T. Aoyama, “Cooperative inter-stream rate control scheme for layered multicast,” Applications and the Internet, Proceedings. Symposium on, 2001, pp. 147 -154

Video Object Hyper-Links for Streaming Applications Daniel Gatica-Perez1 , Zhi Zhou1 , Ming-Ting Sun1 , and Vincent Hsu2 1

Department of Electrical Engineering, University of Washington Seattle, WA 98195 USA 2 CCL/ITRI Taiwan

Abstract. In video streaming applications, people usually rely on the traditional VCR functionalities to reach segments of interest. However, in many situations, the focus of the people are particular objects. Video object (VO) hyper-linking, i.e., the creation of non-sequential links between video segments where an object of interest appears, constitutes a highly desirable browsing feature that extends the traditional video structure representation. In this paper we present an approach for VO hyper-linking generation based on video structuring, deﬁnition of objects of interest, and automatic object localization in the video structure. We also discussed its use in a video streaming platform to provide objectbased VCR functionalities.

1

Introduction

Due to the vast amount of video contents, eﬀective video browsing and retrieval tools are critical for the success of multimedia applications. In current video streaming applications, people usually rely on VCR functionalities (fast-forward, fast-backward, and random-access) to access segments of video of interest. However, in many situations, the ultimate level of desired access is the object. For browsing, people may like to jump to the next “object of interest” or fastforward but only display those scenes involving the “object of interest”. For retrieval, users may like to ﬁnd an object in a sequence, or to ﬁnd a video sequence containing certain video objects. The development of such non-sequential, content-based access tools has a direct impact on digital libraries, amateur and professional content-generation, and media delivery applications [8]. VO hyper-linking constitutes a desirable feature that extends the traditional video structure representation, and some schemes for their generation have been recently proposed [5], [2], [13]. Such approaches follow a segmentation and region matching paradigm, based on (1) the extration of salient regions (in terms of color, motion or depth) from each scene depicted in a video shot, (2) the representation of such regions by a set of features, and (3) the search for correspondences among region features in all the shots that compose a video clip. In particular, the work in [2] generates hyper-links for moving objects, and the work in [13] does so for depth-layered regions in stereoscopic video. In [9], face S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 229–238, 2002. c Springer-Verlag Berlin Heidelberg 2002

230

Daniel Gatica-Perez et al.

Fig. 1. Video Tree Structure. The root, intermediate, and column leaf nodes of the tree represent the video clip, the clusters, and the shots, respectively. Each image on a column leaf corresponds to frames extracted from each subshot.

detection algorithms [15] were used to generate video hyper-links of faces. However, in spite of the current progress [12], automatic segmentation of arbitrary objects continues to be an open problem. In this paper, we present an approach for VO hyper-linking generation, and discuss its application for video streaming with object-based VCR functionalities. After video structure creation, hyper-links are generated by object definition, and automatic object localization in the video structure. The object localization algorithm ﬁrst extracts parametric and non-parametric color models of the object, and then searches in a conﬁguration space for the instance that is the most similar to the object model, allowing for detection of non-rigid objects in presence of partial occlusion, and camera motion. As part of a video streaming platform, users can deﬁne objects, and then fast-forward, fast-reverse, or random-access based on the object deﬁned. The paper is organized as follows. Section 2 discusses the VO hyper-linking generation approach. Results are described in Section 3. Section 4 describes a streaming video platform with support for object-based VCR functionalities. Section 5 provides some concluding remarks.

2 2.1

VO Hyper-link Generation Video Structure Generation

A summarized video structure or Table of Contents (TOC) (Fig. 1), consisting of representative frames extracted from video, cluster, shot, and subshot levels, is generated with the algorithms described in [6]. The TOC reduces the number of frames where the object of interest will be searched to a manageable number. Users can specify objects of interest to generate hyper-links, by drawing a bounding box on any representative frame.

Video Object Hyper-Links for Streaming Applications

2.2

231

Object Localization as Deterministic Search

Object localization constitutes a fundamental problem in computer vision [15], [10], [18], [16], [3]. In pattern theory terms [7], [16], given a template (the image ¯ ¯ ⊂ R2 , any other image I(x) that contains the of an object) I(x) with support D 2 object (with support D ⊂ R ) can be considered as generated from the template I¯ by a transformation TX of the template into the image, ¯ ¯ I(x) = I(TX (x)), x ∈ D,

(1)

where TX is parameterized by X over a conﬁguration space X . In practice, Eq. 1 becomes only an approximation, due to modeling errors, noise, etc. In a deterministic formulation, localizing the template in a scene consists of ﬁnding ˆ ∈ X that minimizes a similarity measure d(·), the conﬁguration X ˆ = arg min dX = arg min d(I(TX (x), I(x)). ¯ X X∈X

X∈X

(2)

We represent the outlines of objects by bounding boxes, and restrict the conﬁguration space X to a quantized subspace of the planar aﬃne transformation space, with three degrees of freedom that model translation and scaling. While far from representing complex object shapes and motions, the simpliﬁed X is useful to locate targets. The interior of an object could be approximately transformed by pixel interpolation using the scale parameter. Alternatively, one can deﬁne a similarity measure that depends not directly on the images, but on image representations that are both translation and scale invariant, so ˆ = arg min d(f (I(TX (x)), f (I(x))). ¯ X X∈X

(3)

With this formulation, the issues to deﬁne are f , d, the search strategy, and a mechanism to declare when the objects is not present in the scene. 2.3

Reducing the Search Space with Color Likelihood Ratios

Pixel-wise classiﬁcation based on parametric models of object/background color distributions has been used for image segmentation [1] and tracking [14]. We use such representation to guide the search process. In the representative frames from which the object is to be searched, let y represent an observed color feature vector for a given pixel x. Given a single foreground object, the distribution of y for such frame is a mixture p(y|Θ) =

p(Oi )p(y|Oi , θi ),

(4)

i∈{F,B}

where F and B stand for foreground and background, p(Oi ) is the prior probability of pixel x belonging to object Oi ( i p(Oi ) = 1), and p(y|Oi , θi ) is the

232

Daniel Gatica-Perez et al.

a

b

c

Fig. 2. Extraction of candidate conﬁgurations. Dancing Girls sequence. (a) Frames extracted from the video clips (the object has been deﬁned by a bounding box). (b) Log-likelihood ratio image for learned foreground and background color models. Lighter gray tones indicate higher probability of a pixel to belong to the object. (c) Binarized image after decision. White regions will be used to generate candidate conﬁgurations.

conditional pdf of observations given object Oi , parameterized by θi (Θ = {θi }). Each conditional pdf is in turn modeled with a Gaussian mixture [11], p(y|Oi , θi ) =

M

p(wj )p(y|wj , θij ),

(5)

j=1

where p(wj ) denotes the prior probability of the j-th component, and the conditional p(y|wj , θij ) = N (µij , Σij ) is a multivariate Gaussian with full covariance matrix. In absence of prior knowledge p(OF ) = p(OB ), and Bayesian decision theory establishes that each pixel can be optimally associated (in the MAP sense) to foreground or background by evaluating the likelihood ratio p(y|OF , θF ) H>F 1 p(y|OB , θB ) H
(6)

The likelihood functions are on-line estimated using the Expectation- Maximization (EM) algorithm, the standard procedure for Maximum Likelihood parameter estimation [11]. Additionally, model selection is automatically estimated using the Minimum Description Length (MDL) principle. RGB models are estimated when a new object is deﬁned, and then applied to the set of representative frames in the video summary. An example is shown in Fig. 2. Only those pixels whose colors match the object color distribution are chosen as candidate search conﬁgurations. Finally, as the background color distribution is likely to change from shot to shot (possibly rendering low values

Video Object Hyper-Links for Streaming Applications

233

for p(y|OB , θB )) probabilities are thresholded to ensure that candidate conﬁgurations truly correspond to object colors. 2.4

Localization Using Bhattacharyya Coeﬃcient

We use the color pdf of the interior of the conﬁguration X ∈ X as the function ¯ and f (IX ) denote the color pdfs of the object and the f (·) in Eq. 3. Let f (I) conﬁguration X, respectively. As discussed in [4], measuring similarity among two distributions can be deﬁned as maximizing the Bayes error associated with them. The Bhattacharyya coeﬃcient is a measure related to the Bayes error deﬁned by ¯ f (IX )) = (f (I(x))f ¯ ρX = ρ(f (I), (IX (x)))1/2 dx (7) and can be used to deﬁne a metric ¯ fˆ(IX )))1/2 dX = (1 − ρ(fˆ(I),

(8)

when the pdfs f (·) are represented by discrete densities fˆ(·). The discrete pdfs for model and candidate conﬁguration are directly estimated by normalizing color histograms (3-D RGB, 8 × 8 × 8 bins). Except for quantization eﬀects, this color discrete density estimate is translation and scale invariant, unlike other representations, like color coocurrence histograms [3], which are translation invariant but not scale-invariant. In the search, the translation component is quantized by a factor of 4 in each direction, and the scaling component is quantized to 5 diﬀerent scales ranging between 0.5 and 2. If a whole QSIF image was to be searched, the number possible conﬁgurations would be 6600. We only search those positions with high likelihood as indicated in white regions in Fig. 2(c). Finally, the decision on the presence of the object is based on thresholding of the Bhattacharyya coeﬃcient. 2.5

Video Hyper-link Generation

Hyper-links are constructed based on object detection/absence for each shot. If links are desired to the subshot level, the described object localization has to be applied on each of the leave frames in the TOC. Video browsing will occur by displaying the subshots for which the object was localized. Alternatively, hyperlinks could be required only at higher levels of the hierarchy (shot, cluster). In that case, the object localization algorithm processes subshot frame leaves until it detects an object, and then jumps to the next shot or cluster, thus requiring less processing in average.

3

Results

Fig. 3 illustrates the results obtained in the Girls video, captured with a moving hand-held camera. One can observe that the algorithm has been able to detect

234

Daniel Gatica-Perez et al.

a

b

c Fig. 3. Object localization. Girls video sequence. (a), (b) and (c) illustrate the object localization process for three diﬀerent user-deﬁned video objects.

the user-speciﬁed objects correctly, in presence of partial occlusion and change of size. Another detection example is shown in Fig. 4. We observe that detection of the object of interest has been correct, but other regions whose features can

Video Object Hyper-Links for Streaming Applications

235

Fig. 4. Object Localization. Wedding video sequence.

Fig. 5. VO hyper-link generation. The frames where the object has been detected are highlighted.

not be discriminated as diﬀerent are incorrectly labeled as object. Several issues are currently under study for object localization improvement, including the use of illumination-invariant object color models, the use of additional features, and the deﬁnition of a decision mechanism based on probability models of positive and negative examples. Hyper-links are created, and the leaves in the TOC that contain the object are highlighted in the GUI, as shown in Fig. 5, allowing for fast browsing in the video structure besides the capability for video playing. The computational complexity is dependent on object size. In the current implementation without any optimization, it takes ﬁve seconds to search among 3000 conﬁgurations per

236

Daniel Gatica-Perez et al.

Fig. 6. Block diagram of streaming video system.

QSIF image, on a Pentium III, 600 MHz PC. By oﬀ-line generation of the main objects in a video clip, the system can provide real time object-based browsing capabilities.

4

A Streaming Video System Supporting Table of Contents and Object-Based VCR Functionalities

A block diagram of a streaming video system is shown in Fig. 6. The system has a typical Server/Client structure. The video sequences are encoded in MPEG-4 and stored in the server with the associated metadata ﬁles. The system supports the conventional VCR functionalities such as Play, Pause, Random Access, Step Forward, Fast Forward, and Fast Reverse, plus the video TOC. The VCR functionalities are implemented as discussed in [10]. For simple implementation, we use I-pictures for random access, fast-forward, and fast-reverse. We are incorporating the object-based VCR functionalities into the system. In the actual applications, the client connects to the remote server over an IP network, and selects the video stream of interest. Two types of logical channels are established between the server and the client: the control channel and the data channel. The TOC and the VCR commands are transmitted in the control channel, while the video packets are transmitted in the data channel. The server sends the TOC of the requested video sequence to the client. The TOC, containing clusters of the key frames of the sequence, is displayed as shown in Fig. 7. The client can choose to play from the beginning of the video sequence, or click on a frame in the the TOC to start playing from that particular segment. The VCR and Hyperlinking Manager receives the commands and retrieves the corresponding part of the video sequence, which is then sent by the Stream Manager to the client for decoding and displaying. The key frames in the TOC are mapped to the closest I-pictures to allow easy decoding. During the play of the video, the user can use the conventional VCR functionalities (e.g. fast-forward, fast reverse, etc.) to manipulate the play of the video. The user can also stop the video and jump to another key frame of interest in the TOC. With the incorporation of the object-based VCR functionalities, the user will be able to stop the video at any frame, deﬁne an object of interest in the frame, and use

Video Object Hyper-Links for Streaming Applications

237

Fig. 7. Streaming video with VCR functionalities and Table Of Contents

the object-based VCR functionalities through the support of the automatically generated VO hyper-links.

5

Conclusions

We have presented a methodology to create video object hyper-links for objectbased video streaming applications. Although the obtained results are encouraging, we acknowledge that object localization is a hard problem, and current eﬀorts are directed to improve discrimination. We have implemented a streaming video system with Table of Contents and VCR functionality support, and are incorporating the object-based VCR functionality features into the system.

Acknowledgements The video sequences used in this study belong to the Eastman-Kodak Home c Video Database.

238

Daniel Gatica-Perez et al.

References 1. S. Belongie, C. Carson, H. Greenspan, and J. Malik, “Color and Texture Image Segmentation Using the Expectation-Maximization Algorithm and Its Application to Content-Based Image Retrieval,”, In Proc. IEEE Int. Conf. Comp. Vis., Bombay, Jan. 1998. 2. P. Bouthemy, Y. Dufournaud, R. Fablet, R. Mohr, S. Peleg, and A. Zomet, “Video Hyper-links Creation for Content-Based Browsing and Navigation,” in Proc. Workshop on Content-Based Multimedia Indexing, Toulouse, France, October 1999. 3. P. Chang and J. Krumm, “Object Recognition with Color Coocurrence Histograms,” in Proc. IEEE Int. Conf. on CVPR, Fort Collins, CO, June 1998. 4. D. Comaniciu, V. Ramesh, and P. Meer, “Real-Time Tracking of Non-Rigid Objects using Mean Shift,” in Proc. IEEE Conf. on Comp. Vis. and Patt. Rec., Hilton Head Island, S.C., June 2000. 5. Y. Deng, and B. S. Manjunath “Netra-V: Toward an Object-Based Video Representation,” IEEE Trans. on CSVT, Vol. 8, No. 5, pp. 616-627, Sep. 1998. 6. D. Gatica-Perez, M.-T. Sun, and A. Loui, “Consumer Video Structuring by Probabilistic Merging of Video Segments,” in Proc. IEEE Int. Conf. on Multimedia and Expo, Tokyo, Aug. 2001. 7. U. Grenander, Lectures in Pattern Theory Springer, 1976-1981. 8. C.W. Lin, J. Zhou, J.Youn, and M.T. Sun, “MPEG Video Streaming with VCR Functionality,” IEEE Trans. on CSVT, Vol. 11, No. 3, pp. 415-425, Mar. 2001. 9. W.-Y. Ma and H.J. Zhang, “An Indexing and Browsing System for Home Video,” In Proc. EUSIPCO, European Conference on Signal Processing. Patras, Greece, 2000, pp. 131-134. 10. J. MacCormick and A. Blake, “A probabilistic contour discriminant for object localisation,” in Proc. IEEE Int. Conf. Computer Vision, pp. 390-395, 1998. 11. G.J. MacLachlan and D. Peel. Finite Mixture Models. John Wiley and Sons, N.Y., 2000. 12. M. Meila and J. Shi, “A random walks view of spectral segmentation,” in Proc. Eighth Int. Workshop on AI and Stats, Jan. 2001. 13. K. Ntalianis, A. Doulamis, N. Doulamis, and S. Kollias, “Non-Sequential Video Structuring Based on Video Object Linking: An Eﬃcient Tool for Video Browsing and Indexing,” in Proc. IEEE Int. Conf. Image Processing, Thessaloniki, Greece, October 2001. 14. Y. Raja, S. McKenna, and S. Gong, “Colour Model Selection and Adaptation in Dynamic Scenes,” in Proc. ECCV, 1998. 15. H. Rowley, S. Baluja, and T. Kanade, “Human Face Detection in Visual Scenes,” Tech. report CMU-CS-95-158R, Computer Science Department, Carnegie Mellon University, November, 1995. 16. J. Sullivan, A. Blake, M. Isard and J. MacCormick, “Object Localization by Bayesian Correlation,” in Proc. IEEE Int. Conf. Computer Vision, pp. 1068-1075, 1999. 17. H.J. Zhang, “Content-based Video Browsing and Retrieval,” In B. Fuhrt, Ed., Handbook of Multimedia Computing, CRC Press, Boca Raton, 1999, pp. 255-280. 18. Y. Zhong and A. K. Jain, “Object Localization Using Color, Texture and Shape,” in Proc. Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, Venice, pp. 279-294, May 1997:

Scalable Hierarchical Summarization of News Using Fidelity in MPEG-7 Description Scheme Jung-Rim Kim, Seong Soo Chun, Seok-jin Oh, and Sanghoon Sull School of Electrical Engineering, Korea University, 1 Anam-dong 5ga Songbuk-gu, Seoul, Korea _NVOMQWWGLYRSWNWYPPa$QTIKOSVIEEGOV

Abstract. The notion of fidelity is an attribute in MPEG-7 FDIS (Final Draft International Standard [1]) that can be used for scalable hierarchical summarization and search [2]. The fidelity is the information on how well a parent key frame represents its child key frames in a tree-structured key frame hierarchy [15]. The use of fidelity was demonstrated for scalable hierarchical summarization [2] based on the low-level features such as color, but the temporal information was not used. Content of a video such as news and golf is temporally well structured and it is desirable to utilize such information. In this paper, we demonstrate the use of fidelity for the summarization of a well-structured news by using temporal information as well as low-level features.

1

Introduction

Nowadays, the speed of network grows up and its bandwidth becomes wide. So there are much more chances of making access to multimedia data. Although improvements are being made on the Internet, the size of multimedia data is too large to deliver complete data. Because of this problem, the study on multimedia compression, transferring and indexing has become important. Also, since the amount of multimedia data increases fast, it is necessary that we should be able to search and access/navigate them easily. Study on content-based retrieval has been extensively researched through various indexing schemes, but the research on multimedia access is still insufficient and being developed. One of the useful methods for access and navigation of multimedia content is summarization. The summarization helps us understand the whole content of a video by showing a set of the key frames/clips representing the whole video. This functionality is especially useful since the size of a video is very large in general and thus users might not want to spend much time watching the whole video. Furthermore it might be difficult to deliver the whole video with limited bandwidth. Therefore, there is a need for the scalable summarization schemes and MPEG-7 MDS (Multimedia Description Scheme) has been developed to provide such schemes.

S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 239–246, 2002. © Springer-Verlag Berlin Heidelberg 2002

240

Jung-Rim Kim et al.

Among a variety of video contents, news content is typically structured by time or by topics and thus can be hierarchically well described by the scalable summarization scheme for easy browsing and summary. The scalable description allows users to select the parts of the news video depending upon their preference or available bandwidth. In this paper, we describe the notion of fidelity in MPEG-7 Summarization Description Scheme (DS) and propose an efficient method for the scalable hierarchical news video summarization based the MPEG-7 DS. This paper is organized as follows. Section 2 introduces related works and the notion of fidelity. Section 3 describes the appropriate notion of fidelity and algorithm for news summarization, and Section 4 demonstrates the experimental results for scalable summarization of news. Finally, Section 5 provides conclusions of the paper.

2

Related Work and Fidelity

2.1

Related Work

Recently, there have been several approaches for the video summarization [6-9]. D. DeMenthon et al. [6] proposed scalable summarization using curve simplification. They developed a method for summarizing video by splitting a trajectory curve in the high dimensional feature space for the key frames. Y. Gong et al. [7] proposed optimal video summarization algorithm using the singular value decomposition of the feature vectors. S. Uchihashi et al. [8] introduced a video summarizing scheme using shot importance. As the shot length is longer, the importance of each shot is assumed to be larger, and as it becomes shorter, its importance diminishes. Mark T. Maybury et al. [9] showed summarization of broadcast news using audio, visual information and closed-captioned text. They summarized news video by key frame selection through the audio and video correlation, and annotated the summarized frames using closedcaptioned text. MPEG-7 also provides video summarization schemes in MDS. The multimedia content description in MPEG-7 is divided into two parts. One part is about the description of the structural aspects of the content that describes the audio-visual content from the viewpoint of its structure. It represents the spatial, temporal or spatiotemporal structure of the audio-visual content and can be described on the basis of perceptual features using MPEG-7 Descriptors for color, texture, shape, motion, audio features, and semantic information using Textual Annotations. And the other is about the description of the content conceptual aspects that describes the audio-visual content from the viewpoint of real-world semantics and conceptual notions. It involves entities such as objects, events, abstract concepts and relationships. Based on such descriptions, MPEG-7 gives description schemes for navigation and access of multimedia content that facilitates browsing and retrieval of audio-visual content by defining summaries, partitions and decomposition, and variations of audio-visual material. A brief explanation about fidelity in MPEG-7 description schemes is given in the following section to be used for scalable hierarchical summarization.

Scalable Hierarchical Summarization of News

2.2

241

Fidelity

The fidelity is the information on how well a parent key frame represents its child key frames in a tree-structured key frame hierarchy [1-5]. We can construct the treestructured hierarchy using the key frame shown as Fig. 1 based on the relationship between a parent key frame and its children using fidelity. The definition of fidelity eα of a node α having the parent node pα is proposed in [2] as eα = 1 − max(d ( pα , x)),

(1)

x∈Tα

where d( ) denotes normalized distance/dissimilarity from 0 to 1 and Tα is the hierarchy rooted at the node α. Level 0

A

eB

eC

B

eD

C

D

Level 1 eE E

eF

eI

F

G

eG eH H

eJ I

eK eL J

K

L

Level 2 Fig. 1. An Example of the Key Frame Hierarchy with Fidelity

3

Scalable Hierarchical Summarization of News Using Fidelity

In this section, we describe the use of fidelity using both the low-level feature and the temporal information of news to construct key frame hierarchy. 3.1

Scalable Hierarchical Algorithm Using Fidelity

The scalable summarization algorithm using fidelity is a max-cut finding algorithm proposed in [2], [5]. This algorithm maximizes the minimum edge cost cut by cut-line in the hierarchy so the fidelity after splitting the hierarchy becomes maximal. It can be summarized as follows: The root node is inserted into the summarization set K at first, and a node β not in K with the minimum fidelity between a node α in K and itself is inserted into K. Until the number of elements in K becomes equal to that specified by a user, the inner loop of the algorithm is repeatedly performed.

242

Jung-Rim Kim et al.

EHHVSSXCRSHIXS/ [LMPIGEVH/ 2 _ PIX<α,β>FIEPIEWXGSWXIHKI WYGLXLEXα∈/ERHβ∉/ EHHβXS/ a Fig. 2. Scalable Hierarchical Summarization Algorithm

For example, suppose eB < eC < eD in Fig. 1. By the max-cut algorithm in Fig. 2, we can choose two nodes that represent the whole hierarchy with maximal fidelity. At first, the root node, A, is selected, and then we have chance to choose one of three nodes B, C, and D. By the algorithm, we should select node B because of the above condition eB < eC < eD. Then, we obtain two hierarchies rooted at A and B, maximizing the fidelity of the hierarchies. 3.2

Hierarchy for News Summarization

Sull et al. proposed a key frame hierarchy with fidelity using only low-level features of the key frames based on equation (1) [2]. However, since such low-level features cannot represent the conceptual aspect of contents well, the result sometimes does not represent semantically meaningful summarization. Structured news content often contains the degree of importance related to time: The news content is structured as two major different parts composed of an anchor shot where one or two anchors reports events, and the event shots between two successive anchor shots. Furthermore the headline news shown at the beginning is important relative to the upcoming news stories shown thereafter. In general, the importance of news decreases as it is approaching to the end of the news. Taking both anchor shots and their time into account, we can show the contents or information of the news more effectively. Instead of equation (1), we propose a fidelity eα’ at a node α as follows: eα ’ = weα + (1 − w)eα (t ),

(2)

where w is the weight for the fidelity based on low-level feature in the range from 0 to 1 and eα(t) is the temporal fidelity at a node α given by time position/code or by temporal distribution between a parent frame and its descendents as eα (t ) =

α (t ) , τ

(3)

where α(t) is time code of α and τ is the total time length of the video content. The eα(t) increases as the temporal position of α increases, allowing us to obtain temporally earlier nodes first when applying the summarization algorithm shown in Fig. 2. Also in [2] they did not consider the temporal order for clustering, and thus it ignored the temporal relationship between a parent key frame and its child key frames.

Scalable Hierarchical Summarization of News

243

For example, the key frames in the bottom level consist of 3 shots such as {E, F}, {G, H, I, J}, and {K, L}, and their parent key frames are B, C, and D, respectively shown in Fig. 1. The key frame B represents I though there is little temporal relationship between them. If this scheme is applied to news content, the anchor shots whose lowlevel features are almost same will be classified into the same cluster. For an effective summarization, it is desirable to initially extract the anchor shots and the event shots between two successive anchor shots, and then construct the key frame hierarchy. The key frame hierarchy is typically constructed by a 4-level bottom-up method. The bottom level, level 3, consists of key frames, level 2 consists of the anchor frames and the key event frames representing the event frames between two successive anchor frames. The key event frames are positioned to the level 1 using only temporal fidelity, and the key frame that represents whole video becomes root. The overall algorithm is shown as Fig 3, and Fig. 4 is an example hierarchy based on this algorithm.

(IXIGXWLSXW )\XVEGXOI]JVEQIWMRIEGLWLSXYWMRKXLIPS[ PIZIPJIEXYVIZIGXSVPIZIP 7ITEVEXIXLIERGLSVWLSXWERHXLIIZIRXWLSXW 'PYWXIVXLIWYGGIWWMZIERGLSVERHIZIRXWLSXWVI WTIGXMZIP]YWMRKXLIEPKSVMXLQTVSTSWIHMR?A PIZIP )\XVEGXEPPIZIRXJVEQIWJVSQPIZIPPIZIP 7IXXLIOI]JVEQIJSVI\EQTPIXMXPIJVEQIXS XLIVSSXOI]JVEQISJXLILMIVEVGL]PIZIP Fig. 3. An algorithm for News Summarization

Root E1 A1 A1

E2 E1

E11

E12

E3

A2 E13

A2

E2 E21

E22

A3

E3

A3

E3

Fig. 4. An Example of the Key Frame Hierarchy of News (An : Anchor shot, En : Event shot)

4

Experimental Results

In this section, we describe the experimental result of the key frame hierarchy and the summarization result of news using our proposed algorithm.

244

Jung-Rim Kim et al.

4.1

Key Frame Hierarchy

In our current implementation, we use the DC luminance projection introduced in [10] as low-level feature vector for key frame extraction, anchor shot detection, and clusr c tering. The luminance projection (ln , lm ) of nth row and mth column in MxN DC image f is respectively lnr ( f ) = lmc ( f ) =

M

∑ Lum{ f (m, n)}, m =1 N

(4)

∑ Lum{ f (m, n)}. n =1

The distance/dissimilarity function, normalized to [0, 1], is also defined as d ( fi , f j ) =

M 1 N r (∑ | ln ( f i ) − lnr ( f j ) | + ∑ | lmc ( f i ) − lmc ( f j ) |) , K n=1 m =1

(5)

where K is a normalizing constant. Using above feature vector and distance/dissimilarity function, we detect shot boundaries and extract key frame set R satisfying the following condition: R = { f i ∈ S | d ( f i , f i −1 ) ≤ ε k , i = 1,2,3...} ,

(6)

where S is the whole video frame set, and ek is distortion to extract shot boundary. We also apply the equation (7) to detect a set A of anchor frames fk in R: A = { fk ∈ R | d ( fa , fk ) ≤ ε a} ,

(7)

where fa is the reference anchor frame that a user selects, and εα is distortion to detect anchor frames. Since the fidelity based on low-level feature for each anchor shot is almost 1, it is meaningless. So we applied only the temporal fidelity, i.e. w=0, to the fidelity of the anchor shot, and we experimentally set w to 0.2 or smaller value in equation (3) to construct a hierarchy for the event frames, such that the low-level feature does not affect the temporal order too much. The experimental results applied to two videos are shown as Table 1. 4.2

Summarization

Using the algorithms shown in Fig. 2 [2], we summarize two news videos by using 9 and 18 frames. Figure 5 (a) and (b) are the experimental results from the key frame hierarchy using only low-level feature, and Fig. 5 (c) and (d) show the results using low-level feature as well as temporal information. As shown in Fig. 5 (a) and (b), the 9-frame summarization and 18-frame summarization does not show a good semantic relationship between them, but in Fig. 5 (c) and (d), the 9-frame summarization gives the storyboard consisting of events purely, and in the 18-frame summarization result the anchor frames well appear between almost every event frame.

Scalable Hierarchical Summarization of News

245

Table 1. Video Contents and Their Frames in each Level

Video News1 News2 News3

Length 27m 38s 27m 36s 27m 39s

(a)

(b)

Level 0 1 1 1

Level 1 10 9 10

Level 2 19 18 19

Level 3 210 138 154

(c)

(d)

Fig. 5 Summarization results from the key frame hierarchy using fidelity. (a) 9-frame summarization of News1 based on low-level feature (b) 18-frame summarization of News1 based on low-level feature (c) 9-frame summarization of News1 based on low-level feature and temporal information (d) 18-frame summarization of News1 based on low-level feature and temporal information

246

5

Jung-Rim Kim et al.

Conclusion

In this paper, we described the use of fidelity in the MPEG-7 MDS for scalable hierarchical summarization of news. Based on the fidelity based on both low-level feature and temporal information, we constructed the semantically meaningful key frame hierarchy consisting of anchor and event frames, demonstrating a feasibility of our approach.

References 1. ISO/IEC 15938-5 FDIS Information Technology -- Multimedia Content Description Interface - Part 5 Multimedia Description Schemes. ISO/IEC JTC1/SC29/WG11 N4206 (2001) 2. Sull, S., Kim, J.-R., Kim, Y., Chang, H.S., Lee, S.U.: Scalable hierarchical video summary and search. Proceeding of SPIE2001, Vol. 4315. Storage and Retrieval for Media Database 2001, San Jose (2001) 553-561 3. Overview of the MPEG-7 standard. ISO/IEC JTC1/SC29/WG11 N4031, Singapore (2001). 4. Efficient and effective search and browsing using fidelity. ISO/IECC/JTCI SC29/WG11 M5101, La Baule (1999) 5. Improved notion of the fidelity for efficient browsing. ISO/IEC JTC1/SC29/SG11 M5442, Maui (1999) 6. DeMenthon, D., Kobla, V., Doermann, D.: Video summarization by curve simplification. Proceedings of ACM International Conference on Multimedia (1998) 211-218 7. Gong, Y. and Liu, X: Generating optimal video summaries. Proceedings of IEEE International Conference on Multimedia and Expo 2000, Vol. 3. (2000) 1559-1562 8. Uchihash, S., Foote, J.: Summarizing video using a shot importance measure and a framepacking algorithm. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 6. (1999) 3041-3044 9. Maybury, M. T., Merlino, A. E.: Multimedia summaries of broadcast news: Proceedings of Intelligent Information Systems (1997) 442-449 10.Chang, H. S., Sull, S., Lee, S. U.: Efficient video indexing scheme for content-based retrieval. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 8. (1999) 1269-1279

MPEG-7 Descriptors in Content-Based Image Retrieval with PicSOM System Markus Koskela, Jorma Laaksonen, and Erkki Oja Laboratory of Computer and Information Science, Helsinki University of Technology, P.O.BOX 5400, 02015 HUT, Finland {markus.koskela,jorma.laaksonen,erkki.oja}@hut.fi

Abstract. The MPEG-7 standard is emerging as both a general framework for content description and a collection of speciﬁc, agreed-upon content descriptors. We have developed a neural, self-organizing technique for content-based image retrieval. In this paper, we apply the visual content descriptors provided by MPEG-7 in our PicSOM system and compare our own image indexing technique with a reference system based on vector quantization. The results of our experiments show that the MPEG-7-deﬁned content descriptors can be used as such in the PicSOM system even though Euclidean distance calculation, inherently used in the PicSOM system, is not optimal for all of them. Also, the results indicate that the PicSOM technique is a bit slower than the reference system in starting to ﬁnd relevant images. However, when the strong relevance feedback mechanism of PicSOM begins to function, its retrieval precision exceeds that of the reference system.

1

Introduction

Content-based image retrieval (CBIR) diﬀers from many of its neighboring research disciplines in computer vision due to one notable fact: human subjectivity cannot totally be isolated from the use and evaluation of CBIR systems. This is manifested by diﬃculties in setting fair comparisons between CBIR systems and in interpreting their results. These problems have hindered the researchers from doing comprehensive evaluations of diﬀerent CBIR techniques. We have developed a neural-network-based CBIR system named PicSOM [1,2]. The name stems from “picture” and the Self-Organizing Map (SOM). The SOM [3] is used for unsupervised, self-organizing, and topology-preserving mapping from the image descriptor space to a two-dimensional lattice, or grid, of artiﬁcial neural units. The PicSOM system is built upon two fundamental principles of CBIR, namely query by pictorial example and relevance feedback [4]. Until now, there have not existed widely-accepted standards for description of the visual contents of images. MPEG-7 [5] is the ﬁrst thorough attempt in this direction. The appearance of the standard will aﬀect the research on CBIR techniques in some important aspects. First, when some common building blocks will become shared by diﬀerent CBIR systems, comparative studies between them S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 247–258, 2002. c Springer-Verlag Berlin Heidelberg 2002

248

Markus Koskela, Jorma Laaksonen, and Erkki Oja

will become easier to perform. As MPEG-7 Experimentation Model (XM) [6] has become publicly available, we have been able to test the suitability of MPEG7-deﬁned image content descriptors with the PicSOM system. We have thus replaced our earlier, non-standard descriptors with those deﬁned in the MPEG-7 standard and available in XM.

2

PicSOM System

The PicSOM image retrieval system [1,2] is a framework for research on algorithms and methods for content-based image retrieval. The methodological novelty of PicSOM is to use several Self-Organizing Maps [3] in parallel for retrieving relevant images from a database. These parallel SOMs have been trained with separate data sets obtained from the image data with diﬀerent feature extraction techniques. The diﬀerent SOMs and their underlying feature extraction schemes impose diﬀerent similarity functions on the images. Every image query is unique and each user of a CBIR system has her own transient view of image similarity and relevance. Therefore, a system structure capable of holding many simultaneous similarity representations can adapt to diﬀerent kinds of retrieval tasks. In the PicSOM approach, the system is able to discover those of the parallel Self-Organizing Maps that provide the most valuable information for each individual query instance. A more detailed description of the PicSOM system and results of earlier experiments performed with it can be found in [1,2]. The PicSOM home page including a working demonstration of the system for public access is located at http://www.cis.hut.ﬁ/picsom. 2.1

Tree Structured Self-Organizing Maps

The main image indexing method used in the PicSOM system is the SelfOrganizing Map (SOM) [3]. The SOM deﬁnes an elastic, topology-preserving grid of points that is ﬁtted to the input space. It can thus be used to visualize multidimensional data, usually on a two-dimensional grid. The map attempts to represent all the available observations with an optimal accuracy by using a restricted set of models. Instead of the standard SOM version, PicSOM uses a special form of the algorithm, the Tree Structured Self-Organizing Map (TS-SOM) [7]. The hierarchical TS-SOM structure is useful for large SOMs in the training phase. In the standard SOM, each model vector has to be compared with the input vector in ﬁnding the best-matching unit (BMU). This makes the time complexity of the search O(n), where n is the number of SOM units. With the TS-SOM one can, however, follow the hierarchical structure and reduce the complexity of the search to O(log n). This reduction can be achieved by ﬁrst training a smaller SOM and then creating a larger one below it so that the search for the BMU on the larger map is always restricted to a ﬁxed area below the already-found BMU and its nearest neighbors on the above map.

MPEG-7 Descriptors in Content-Based Image Retrieval

249

Fig. 1. The surface of a 16×16-sized TS-SOM level trained with the MPEG-7 Edge Histogram descriptor.

In the experiments described in this paper, we have used four-level TS-SOMs whose layer sizes have been 4×4, 16×16, 64×64, 256×256 units. In the training of the lower SOM levels, the search for the BMU has been restricted to the 10×10-sized neuron area below the BMU on the above level. Every image has been used 100 times for training each of the TS-SOM levels. After training each TS-SOM hierarchical level, that level is ﬁxed and each neural unit on it is given a visual label from the database image nearest to it. This is illustrated in Figure 1, where MPEG-7 Edge Histogram descriptor has been used as the feature. The images are the visual labels on the surface of the 16×16-sized TS-SOM layer. It can be seen that, e.g., there are many ships in the top-left corner of the map surface, standing people and dolls beside the ships, and buildings in the bottom-left corner. Visually – and also semantically – similar images have thus been mapped near each other on the map. 2.2

Self-Organizing Relevance Feedback

The relevance feedback mechanism of PicSOM, implemented by using several parallel SOMs, is a crucial element of the retrieval engine. Only a short overview is presented here, see [2] for a more comprehensive treatment.

250

Markus Koskela, Jorma Laaksonen, and Erkki Oja

⇒

Fig. 2. An example of how a SOM surface, on which the images selected and rejected by the user are shown with white and black marks, respectively, are convolved with a low-pass ﬁlter.

Each image seen by the user of the system is graded by her as either relevant or irrelevant. All these images and their associated relevance grades are then projected on all the SOM surfaces. This process forms on the maps areas where there are 1) many relevant images mapped in same or nearby SOM units, or 2) relevant and irrelevant images mixed, or 3) only irrelevant images, or 4) no graded images at all. Of the above cases, 1) and 3) indicate that the corresponding content descriptor agrees well with the user’s conception on the relevance of the images. Whereas, case 2) is an indication that the content descriptor cannot distinguish between relevant and irrelevant images. When we assume that similar images are located near each other on the SOM surfaces, we are motivated to spread the relevance information placed in the SOM units also to the neighboring units. This is implemented in PicSOM by low-pass ﬁltering the map surfaces. All relevant images are ﬁrst given equal positive weight inversely proportional to the number of relevant images. Likewise, irrelevant images receive negative weights that are inversely proportional to the number of irrelevant images. The overall sum of these relevance values is thus zero. The values are then summed in the BMUs of the images and the resulting sparse value ﬁelds are low-pass ﬁltered. Figure 2 illustrates how the positive and negative responses, displayed with white and black map units, respectively, are ﬁrst mapped on a SOM surface and how the responses are expanded in the convolution. Content descriptors that fail to coincide with the user’s conceptions produce lower qualiﬁcation values than those descriptors that match the user’s expectations. As a consequence, the diﬀerent content descriptors do not need to be explicitly weighted as the system automatically takes care of weighting their opinions. In the actual implementation, we search on each SOM for a ﬁxed number, say 100, map locations with unseen images having the highest qualiﬁcation values. After removing duplicate images, the second stage of processing is carried out. Now, the qualiﬁcation values of all images in this combined set are summed up on all used SOMs to obtain the ﬁnal qualiﬁcation values for these images. Then, 20 images with the highest qualiﬁcation values are returned as the result of the query round. In the experiments described in this paper, the queries are always started with an image that belongs to the image class in question. Therefore, we neglected

MPEG-7 Descriptors in Content-Based Image Retrieval

251

the TS-SOM hierarchy and considered exclusively the bottommost TS-SOM levels. This mode of operation is motivated by the chosen query type, since it is justiﬁable to start the retrieval near the initial reference image. This can be seen as depth ﬁrst search. However, the hierarchical representation of the image database produced by a TS-SOM is useful in visual browsing. The successive map levels can be regarded as providing increasing resolution for database inspection. In our earlier experiments, e.g. [1,8,2], there was no initial example image to start the query with and the queries began with initial breadth ﬁrst search using the visual labels and the TS-SOM structure. 2.3

Vector-Quantization-Based Reference Method

There exists a wide range of distinct techniques for indexing images based on their feature descriptors. One alternative method for the SOM is to ﬁrst use quantization to prune the database and then utilize a more exhaustive method to decide the ﬁnal images to be returned. For the ﬁrst part, there exists two alternate quantization techniques, namely scalar quantization (SQ) and vector quantization (VQ). With either of these techniques, the feature vectors are divided into subsets in which the vectors resemble each other. In the case of scalar quantization the resemblance is in respect to one component of the feature vector, whereas resemblance in vector quantization means that the feature vectors are similar as whole. In our previous experiments [8], we have found out that scalar quantization gives bad retrieval results. The justiﬁcation for vector quantization in image retrieval is that unseen images which have fallen into the same quantization bins as the relevant-marked reference images are good candidates for the next reference images to be displayed to the user. Also, the SOM algorithm can be seen as a special case of vector quantization. When using the model vectors of the SOM units in vector quantization, one ignores the topological ordering provided by the map lattice and characterize the similarity of two images only by whether they are mapped in the same VQ bin. By ignoring the topology, however, we dismiss the most signiﬁcant portion of the data organization provided by the SOM. A well-known VQ method is the K-means or Linde-Buzo-Gray (LBG) vector quantization [9]. According to [8], LBG quantization yields better CBIR performance than the SOM used as a pure vector quantizer. This is understandable as the SOM algorithm can be regarded as a trade-oﬀ between two objectives, namely clustering and topological ordering. Consequently, we will use LBG quantization in the reference system of the experiments. The choice for the number of quantization bins is a signiﬁcant parameter for the VQ algorithm. Using too few bins results in too broad image clusters to be useful whereas with too many bins the information about the relevancy of images fails to generalize to other images. Generally, the number of bins should be smaller than the number of neurons on the largest SOM layer of the TS-SOM. In the experiments, we have used 4096 VQ bins, which coincides with the size of the second bottommost TS-SOM levels. This results in 14.6 images per VQ

252

Markus Koskela, Jorma Laaksonen, and Erkki Oja

bin, on the average, for the used database of 59 995 images. Another signiﬁcant parameter is the number of candidate images that are taken into consideration from each of the parallel vector quantizers. Diﬀerent selection policies lead again either to breadth ﬁrst or depth ﬁrst searches. In our implementation, we rank the VQ bins of each quantizer in the descending order determined by the proportion of relevant images of all graded images in them. Then, we select 100 yet unseen images from the bins in that order. After the vector quantization stage, the set of potential images has been greatly reduced and more demanding processing techniques can be applied to all the remaining candidate images. Now, one possible method – also applied in our reference system – is to rank the images based on their properly-weighted cumulative distances to all already-found relevant images in the original feature space. Finally, as in the PicSOM method, we display 20 best-scoring images to the user. In [8], it was found out that the VQ method beneﬁts from this extra processing stage. As calculating distance in a possibly very high-dimensional space is a computationally heavy operation, the vector quantization can thus be seen to act as a preprocessor which prunes a large database as much as it is necessary before the actual image similarity assessment is carried out.

3

Experiments

The performance of a CBIR system can be evaluated in many diﬀerent ways. Even though the interpretation of the contents of images is always casual and ambiguous, some kind of ground truth classiﬁcation of images must be performed in order to automate the evaluation process. In the simplest case – employed also here – some image classes are formed by ﬁrst selecting verbal criteria for membership in a class and then assigning the corresponding Boolean membership value for each image in the database. In this manner, a set of ground truth image classes, not necessary non-overlapping, can be formed and then used in the evaluation. 3.1

Performance Measures and Evaluation Scheme

All features can be studied separately and independently from others for their capability to map visually similar images near each other. These kinds of featurewise assessments, however, have severe limitations because they are not related to the operation of the entire CBIR system as a whole. In particular, they do not take any relevance feedback mechanism into account. Therefore, it is preferable to use evaluation methods based on the actual usage of the system. If the size of the database, N , is large enough, we can assume that there is an upper limit NT of images (NT N ) the user is willing to browse. The system should thus demonstrate its talent within this number of images. In our setting, each image in a class C is “shown” to the system one at a time as an initial image to start the query with. The mission of the CBIR system is then to return as much as possible similar images. In order to obtain results that do not depend

MPEG-7 Descriptors in Content-Based Image Retrieval

253

on the particular image used in starting the iteration, the experiment needs to be repeated over every image in C. This results in a leave-one-out type testing of the target class and the eﬀective size of the class becomes NC − 1 instead of NC and the a priori probability of the class is ρC = (NC − 1)/(N − 1). We have chosen to show the evolution of precision as a function of recall during the iterative image retrieval process. Precision and recall are intuitive performance measures that suite also for the case of non-exhaustive browsing. When not the whole database but only a smaller number NT N of images is browsed through, the recall value very unlikely reaches the value one. Instead, the ﬁnal value R(NT ) – as well as P(NT ) – reﬂects the total number of relevant images found that far. The intermediate values of P(t) ﬁrst display the initial accuracy of the CBIR system and then how the relevance feedback mechanism is able to adapt to the class. With an eﬀective relevance feedback mechanism, it is to be expected that P(t) ﬁrst increases and then turns to decrease when a notable fraction of relevant images have already been shown. In our experiments, we have normalized the precision value by dividing it with the a priori probability ρC of the class and call it therefore relative precision. This makes the comparison of the recall–precision curves of diﬀerent image classes somewhat commensurable and more convenient because relative precision values relate to the relative advantage the CBIR system produces over random browsing. 3.2

Database and Ground Truth Classes

We have used images from the Corel Gallery 1 000 000 product in our evaluations. The database contains 59 995 color photographs originally packed with a wavelet compression and then locally converted in JPEG format with a utility provided by Corel. The size of each image is either 384×256 or 256×384 pixels. The images have been grouped by Corel in thematic groups and also keywords are available. However, we found these image groups and keywords rather inconsistent and, therefore, created for the experiments six manually-picked ground truth image sets with tighter membership criteria. All image sets were gathered by a single subject. The used sets and membership criteria were: – faces, 1115 images (a priori probability 1.85%), where the main target of the image is a human head which has both eyes visible and the head ﬁlls at least 1/9 of the image area. – cars, 864 images (1.44%), where the main target of the image is a car, at least one side of the car has to be completely shown in the image, and its body to ﬁll at least 1/9 of the image area. – planes, 292 images (0.49%), where all airplane images have been accepted. – sunsets, 663 images (1.11%), where the image contains a sunset with the sun clearly visible in the image. – houses, 526 images (0.88%), where the main target of the image is a single house, not severely obstructed, and it ﬁlls at least 1/16 of the image area. – horses, 486 images (0.81%), where the main target of the image is one or more horses, shown completely in the image.

254

3.3

Markus Koskela, Jorma Laaksonen, and Erkki Oja

MPEG-7 Content Descriptors

MPEG-7 [5] is an ISO/IEC standard developed by Moving Pictures Expert Group. MPEG-7 aims at standardizing the description of multimedia content data. It deﬁnes a standard set of descriptors that can be used to describe various types of multimedia information. The standard is not aimed at any particular application area, instead it is designed to support as broad a range of applications as possible. Still, one of the main applications areas of MPEG-7 technology will undoubtedly be to extend the current modest search capabilities for multimedia data for creating eﬀective digital libraries. As such, MPEG-7 is the ﬁrst serious attempt to specify a standard set of descriptors for various types of multimedia information and standard ways to deﬁne other descriptions as well as structures of descriptions and their relationships. As a nonnormative part of the standard, a software Experimentation Model (XM) [6] has been released for public use. The XM is the framework for all reference code of the MPEG-7 standard. In the scope of our work, the most relevant part of XM is the implementation of a set of MPEG-7-deﬁned still image descriptors. At the time of this writing, XM is in its version 5.3 and not all description schemes have yet been reported to be working properly. Therefore, we have used only a subset of MPEG-7 content descriptors for still images in these experiments. The used descriptors were Scalable Color, Dominant Color, Color Structure, Color Layout, Edge Histogram, and Region Shape. The MPEG-7 standard deﬁnes not only the descriptors but also special metrics to be used with the descriptors when calculating the similarity between images. However, we use Euclidean metrics in comparing the descriptors because the training of the SOMs and the creation of the vector quantization prototypes are based on minimizing a square-form error criterium. Only in the case of Dominant Color descriptor this has necessitated a slight modiﬁcation in the use of the descriptor. The original Dominant Color descriptor of XM is variable-sized, i.e., the length of the descriptor varies depending on the count of dominant colors found. Because this could not be ﬁt in the PicSOM system, we used only two most dominant colors or duplicated the most dominant color if only one was found. Also, we did not make use of the color percentage information. These two changes do not make our approach incompatible with other uses of Dominant Color descriptor. 3.4 Results Our experiments were two-fold. First, we wanted to study which of the four color descriptors would be the best one to be used together with the one texture and one shape descriptors in the table. Second, we wanted to compare the performance of our PicSOM system with that of the vector-quantization-based variant. We performed two sets of experiments in which the ﬁrst question was addressed in the ﬁrst set and the second question in both sets. We performed 48 computer runs in the ﬁrst set of experiments. Each run was characterized by the combination of the method (PicSOM / VQ), color feature (Dominant Color / Scalable Color / Color Layout / Color Structure)

MPEG-7 Descriptors in Content-Based Image Retrieval

255

and the image class (faces / cars / planes / sunsets / houses / horses). Each experiment was repeated as many times as there were images in the image class in question, the recall and relative precision values were recorded for each such instant and ﬁnally averaged. 20 images were shown at each iteration round, which resulted in 50 rounds when NT was set to 1000 images. Both recall and relative precision were recorded after each query iteration. Figure 3 shows, as a representative selection, the recall–relative precision curves of three of the studied image classes (faces, cars, and planes). Qualitatively similar behavior is observed with the three other classes as well. The recorded values are shown with symbols and connected with lines. The following observations can be made from the resulting recall–relative precision curves. First, none of the tested color descriptors seems to dominate the other descriptors and on diﬀerent image classes the results of diﬀerent color descriptors often vary considerably. Regardless of the used retrieval method (PicSOM or VQ), Color Structure seems to perform best with faces and using Scalable Color yields best results with planes and horses. With the other classes (cars, sunsets, houses), naming a single best color descriptor is not as straightforward. The second observation is that, in general, if a particular color descriptor works well for a particular image class, it does so with both retrieval algorithms. Third, the PicSOM method more often obtains better precision then the VQ method when comparing the same descriptor sets, although the diﬀerence is rather small. Also, in the end, PicSOM has in a majority of cases reached a higher recall level. The last observation here is, that the diﬀerence between the precision of the best and the worst sets of descriptors is larger with the VQ method than with PicSOM. This can be observed, e.g., in the planes column of Figure 3. In the second set of experiments, we wanted to use all the available MPEG-7 visual content descriptors simultaneously. Runs were again made separately for the six image classes and the two CBIR techniques. The results for all classes can be seen in Figure 4, where each plot now contains mutually comparable recall–relative precision curves of the two techniques. It can be seen in Figure 4 that in all cases PicSOM is at ﬁrst behind of VQ in precision, but soon reaches and exceeds it. In some of the cases (faces and cars), this overtake by PicSOM takes only one or two rounds of queries. With planes, reaching VQ takes the longest time, 11 rounds, due to the good initial precision of VQ, observed also in Figure 3 with the Scalable Color descriptor. Of the tested image classes, sunsets yields the best retrieval results as its relative precision rises at best over 30 and, on the average, almost half of all the images in the class are found among the 1000 retrieved images. This is understandable as sunset images can be well described with low-level descriptors, especially color. On the other hand, houses is clearly the most diﬃcult class, as its precision does not ever rise much above twice the a priori probability of the class. This is probably due to the problematic nature of the class as, descriptorwise, there is not a large diﬀerence between the single houses and groups of houses, e.g., small villages.

256

Markus Koskela, Jorma Laaksonen, and Erkki Oja cars

planes

VQ

PicSOM

faces

Fig. 3. Recall–relative precision plots of the performance of diﬀerent color descriptors and the two CBIR techniques. In all cases also Edge Histogram and Region Shape descriptors have been used.

faces

cars

planes

sunsets

houses

horses

Fig. 4. Recall–relative precision plots of the performance of the two CBIR techniques when all four color descriptors were used simultaneously together with Edge Histogram and Region Shape descriptors.

MPEG-7 Descriptors in Content-Based Image Retrieval

257

As the ﬁnal outcome of the experiment, it can be stated that the relevance feedback mechanism of PicSOM is clearly superior to that of VQ’s. The VQ retrieval has good initial precision but after a few rounds, when PicSOM’s relevance feedback begins to have an eﬀect, retrieval precision with PicSOM is in all cases higher. The houses class can be regarded as a draw and a failure for both methods with the given set of content descriptors. One can also compare the curves of Figure 3 and the curves in the upper row of Figure 4 for an important observation. It can be seen that the PicSOM method is, when using all descriptors simultaneously (Figure 4), able to follow and even exceed the path of the best recall–relative precision curve for the four alternative single color descriptors (Figure 3). This behavior is present in all cases, also with the image classes not shown in Figure 3, and can be interpreted as an indication that the automatic weighting of features is working properly and additional, inferior, descriptors do not degrade the results. On the contrary, the VQ method fails to do the same and the VQ recall–relative precision curves in Figure 4 resemble more the average than the maximum value of the corresponding VQ curves in Figure 3. As a consequence, the VQ technique is clearly more dependent on the proper selection of used features than the PicSOM technique.

4

Conclusions

In this paper, we have described our content-based image retrieval system named PicSOM and shown that MPEG-7-deﬁned content descriptors can be successfully used with it. The PicSOM system is based on using Self-Organizing Maps in implementing relevance feedback from the user of the system. As the system uses many parallel SOMs, each trained with separate content descriptors, it is straightforward to use any kind of features. Due to PicSOM’s ability to automatically weight and combine the responses of the diﬀerent descriptors, one can make use of any number of content descriptors without the need to weight them manually. As a consequence, the PicSOM system is well-suited for operation with MPEG-7 which also allows the deﬁnition and addition of any number of new content descriptors. In the experiments we compared the performances of four diﬀerent color descriptors available in the MPEG-7 Experimentation Model software. The results of that experiment showed that no single color descriptor was the best one for all of our six hand-picked image classes. That result was no surprise, it merely emphasizes the need to use many diﬀerent types of content descriptors in parallel. In an experiment where we used all the available color descriptors, the PicSOM system indeed was able to automatically reach and even exceed the best recall–precision levels obtained earlier with preselection of features. This is a very desirable property, as it suggests that we can initiate queries with a large number of parallel descriptors and the PicSOM systems focuses on the descriptors which provide the most useful information for the particular query instance.

258

Markus Koskela, Jorma Laaksonen, and Erkki Oja

We also compared the performance of the self-organizing relevance feedback technique of PicSOM with that of a vector-quantization-based reference system. The results showed that in the beginning of queries, PicSOM starts with a bit lower precision rate. Later, when its strong relevance feedback mechanism has enough data to process, PicSOM outperforms the reference technique. In the future, we plan to study how the retrieval precision in the beginning of PicSOM queries could be improved to the level attained by the VQ technique in the experiments.

Acknowledgments This work was supported by the Finnish Centre of Excellence Programme (20002005) of the Academy of Finland, project New information processing principles, 44886.

References 1. Laaksonen, J.T., Koskela, J.M., Laakso, S.P., Oja, E.: PicSOM - Content-based image retrieval with self-organizing maps. Pattern Recognition Letters 21 (2000) 1199–1207 2. Laaksonen, J., Koskela, M., Laakso, S., Oja, E.: Self-organizing maps as a relevance feedback technique in content-based image retrieval. Pattern Analysis & Applications 4 (2001) 140–152 3. Kohonen, T.: Self-Organizing Maps. Third edn. Volume 30 of Springer Series in Information Sciences. Springer-Verlag (2001) 4. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. Computer Science Series. McGraw-Hill (1983) 5. MPEG: Overview of the MPEG-7 standard (version 5.0) (2001) ISO/IEC JTC1/SC29/WG11 N4031. 6. MPEG: MPEG-7 visual part of the eXperimentation Model (version 9.0) (2001) ISO/IEC JTC1/SC29/WG11 N3914. 7. Koikkalainen, P., Oja, E.: Self-organizing hierarchical feature maps. In: Proc. IJCNN-90, International Joint Conference on Neural Networks, Washington, DC. Volume II., Piscataway, NJ, IEEE Service Center (1990) 279–285 8. Koskela, M., Laaksonen, J., Oja, E.: Comparison of techniques for content-based image retrieval. In: Proceedings of 12th Scandinavian Conference on Image Analysis (SCIA 2001), Bergen, Norway (2001) 579–586 9. Linde, Y., Buzo, A., Gray, R.: An algorithm for vector quantizer design. IEEE Transactions on Communications COM-28 (1980) 84–95

Fast Text Caption Localization on Video Using Visual Rhythm Seong Soo Chun1, Hyeokman Kim2, Jung-Rim Kim1, Sangwook Oh1, and Sanghoon Sull1 1

School of Electrical Engineering, Korea University, Seoul Korea _WWGLYRNVOMQSWYWYPPa$QTIKOSVIEEGOV 2 School of Computer Science, Kookmin University, Seoul Korea LQOMQ$GWOSSOQMREGOV

Abstract. In this paper, a fast DCT-based algorithm is proposed to efficiently locate text captions embedded on specific areas in a video sequence through visual rhythm, which can be fast constructed by sampling certain portions of a DC image sequence and temporally accumulating the samples along time. Our proposed approach is based on the observations that the text captions carrying important information suitable for indexing often appear on specific areas on video frames, from where sampling strategies are derived for a visual rhythm. Our method then uses a combination of contrast and temporal coherence information on the visual rhythm to detect text frames such that each detected text frame represents consecutive frames containing identical text strings, thus significantly reducing the amount of text frames needed to be examined for text localization from a video sequence. It then utilizes several important properties of text caption to locate the text caption from the detected frames.

1 Introduction With rapid advances in digital technology, the amount of multimedia information available continues to grow. As multimedia contents become readily available, archiving, searching, indexing and locating desired content in large volumes of multimedia, containing images and video in addition to the textual information, will become even more difficult. One important source of information that can be obtained from image and video is the text contained therein. The video can be easily indexed if access to this textual information content is available. They provide clear semantics of video, and are extremely useful in deducing the contents of video. A large number of methods have been extensively studied in recent years to detect text in uncompressed images and video. Ohya et al. [1] perform character extraction by local thresholding and detect character candidate regions by evaluating gray level difference between adjacent regions. Hauptmann and Smith [2] use the spatial context of text and high contrast of text regions in scene images to merge large numbers of horizontal and vertical edges in spatial proximity to detect text. Shim et al. [3] use a generalized region labeling algorithm to find homogeneous regions for text. Wu et al. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 259–268, 2002. © Springer-Verlag Berlin Heidelberg 2002

260

Seong Soo Chun et al.

[4] use texture analysis to detect and segment texts as regions of distinctive texture using pyramid technique for handling text fonts of different sizes. Lienhart [5] provide split and merge algorithm based on characteristics of artificial text to segment text. Li et al. [6] used wavelet analysis and employed a multi-frame coherence approach to cluster edges into rectangular shape. Sato et al. [7] adopted a multi-frame integration technique to separate static text form moving background. A few methods have been also proposed to detect text regions in compressed domain. Yeo and Liu [8] propose a method for the detection of text caption events in video by modified scene change detection which cannot handle captions that gradually enters or disappears from frames. Zhong et al. [9] examined the horizontal variations of AC values in DCT to locate text frames and examined the vertical intensity variation within the text regions to extract the final text frames. Zhang and Chua [10] derived a binarized gradient energy representation directly from DCT coefficients which are subject to constraints on text properties and temporal coherence to locate text. However, none of them exploits the temporal coherence of text useful for reducing processing time by not applying all steps (detection, localization, and OCR) to every frames, which results in duplicates of the same text string in the database. The main contribution of this paper is to develop an efficient and fast compressed DCT domain method to locate text captions on specific areas in digital video through a visual rhythm [12], an abstraction of video that is constructed by sampling certain group of pixels of each frame and by temporally accumulating the samples along time. Our method uses a combination of contrast and temporal coherence information on the visual rhythm to detect text frames such that each detected text frame represents consecutive frames containing identical text strings, thus significantly reducing the amount of text frames needed to be examined for text localization from a video sequence. It then utilizes several important properties of text caption to locate text caption from the detected frames. The visual rhythm constructed for text localization also serves as a visual feature to efficiently detect scene changes. This paper is organized as follow: Section 2, gives a brief description of visual rhythm. Section 3 describes the proposed text frame detection, and text caption localization algorithm. Section 4 describes experimental results. In Section 5, we give concluding remarks.

2 Related Work 2.1 Visual Rhythm For the design of an efficient real-time text caption detector, we resort on using a portion of the original video. This partial video must retain most, if not all, text caption information. We claim that a visual rhythm, defined below, satisfies this requirement. Let fDC(x,y,t) be the pixel value at location (x,y) of an arbitrary WxH DC image [11] which consists of the DC coefficients of the original frame t. Using the sequences of DC images of a video called the DC sequence, we define a visual rhythm, VR, of the video V as follows:

Fast Text Caption Localization on Video Using Visual Rhythm

VR ={ f VR ( z , t )} = { f DC ( x ( z ), y ( z ), t )},

261

(1)

where x(z) and y(z) are one-dimensional functions of the independent variable z. Thus, the visual rhythm is a two dimensional image where the vertical z axis consists of a certain group of pixels from each DC image and the samples are accumulated along time in the horizontal t axis. That is, the visual rhythm is a two dimensional image consisting of pixels sampled from a three-dimensional data (DC sequence). The visual rhythm is also an important visual feature that can be utilized to detect scene changes [12]. The sampling strategy, x(z) and y(z), must be carefully chosen for a visual rhythm to retain text caption information. We define x(z), y(z) as follows :  W  0≤z
,

(2)

where W, H are the width and the height of a DC sequence respectively. Figure 1 illustrates the sampling strategy of the DC sequence for the construction of visual rhythm. The diagonal pixels of a frame from bottom-left most corner to top-right most corner are sampled when 0
W H

y

t x Fig. 1. Representation for regions of text appearance

262

Seong Soo Chun et al. Vertical Line of Visual Rhythm

3H H

H

H 0 Fig. 2. The vertical line of a visual rhythm obtained by sampling the pixels of a DC sequence

2.2 Fast Generation of Visual Rhythm Many compression schemes use the discrete cosine transform (DCT) for intra-frame encoding. Thus, the construction of a visual rhythm is possible without the inverse DCT. We simply extract the DC coefficients of each frame. As for the P- and Bframes of MPEG, algorithms for determining the DC images from inter-frame compressed P- and B-frames of MPEG-1 [11] and MPEG-2 [13] have already been developed. Therefore, it is possible to generate a visual rhythm fast, at least for the DCTbased compression schemes, such as Motion JPEG and MPEG videos.

3 Proposed Strategy 3.1 Text Frame Detection A text frame is defined as a video frame that contains one or more text captions. Since a text caption usually appears in a number of consecutive frames, we propose an algorithm, which detects a representative text frame from the consecutive frames containing identical text strings to avoid unnecessary text caption localization for identical text strings. The text frame detection algorithm detects text frames based on the following characteristics of text caption within video:

Fast Text Caption Localization on Video Using Visual Rhythm ♦ ♦ ♦

263

Characters in a single text caption are mostly uniform in color. Text caption contrasts with their background. Text caption remains in a scene for a number of consecutive frames.

On the visual rhythm obtained through a DC sequence, the pixels corresponding to text caption manifest themselves as long horizontal line with high contrast with their background. Hence, horizontal lines on the visual rhythm with high contrast with their background are mostly due to text string, and they give us clues of where and when each text string appears within the video. The pixel value of the horizontal line on the visual rhythm also gives us clue on the pixel value of the text caption in DC image, allowing us for a simple algorithm for text caption localization within the frame. To detect potential text frames, any horizontal edge detection method can be used on the visual rhythm. In our experiment we used Prewitt edge operator with convolution kernels − 1 − 1 − 1  0 0 0    1 1 1 

on the visual rhythm to obtain VRedge(z,t) as follows: VR edge ( z , t ) =

1

1

∑∑w

i = −1 j = − 1

i, j

f VR ( z + j , t + i ) .

(3)

To obtain text lines which we define as horizontal lines with high contrast with their background on a visual rhythm possibly formed due to text caption, the edges with VRedge(z,t) value greater than a threshold value τ (we set τ as τ=150 in our experiment) and uniform value fVR(z,t) are connected in the horizontal direction. Text lines lasting shorter than a specific amount of time is not considered, since text usually remains in the scene for a number of consecutive frames. Through observations on various types of video materials, shortest captions appear to be active for at least two seconds, which translates into a text line with frame length of 60 if the video is digitized at 30 frames per second. Thus the text lines with length less than 2 seconds can be eliminated. The resulting set of text lines appear in the form: LINE k , [ z k , t kstart , t kend ], k = 1,..., N LINE , start k

(4)

end k

where [ zk , t , t ] denotes the Z coordinate, the beginning and the end frame of the occurrence of text line LINEk on the visual rhythm, respectively. The text lines are ordered by the increasing starting frame number,

t1start ≤ t2start ≤ ... ≤ t Nstart . LINE

(5)

Figure 5(b) shows the binarized representation of text lines possibly formed by text caption from the visual rhythm in Figure 5(a). The frames not in between the temporal duration of LINEk, do not contain any text caption and are omitted for further consideration as text frame candidates. Once the frames without text have been excluded as text frame candidates, it is highly probable that the remaining frames of a video contain text caption within them.

264

Seong Soo Chun et al.

However, it would be very inefficient to perform the text caption localization repeatedly for the same text caption remaining on the screen over multiple frames. Since each text line possibly represents a single text caption, we only need to access a single frame to extract its corresponding text. Therefore, the number of text frames to be examined for text caption localization can be minimized by obtaining a maximum cardinality collection of disjoint intervals of text lines through the following algorithm: n←0 SET =

U {k : k ∈ N }

1< k < N LINE

WHILE ( SET ≠ φ ){ e = min( t kend : : k ∈ SET ); A = {∀k | t kstart < e}; Fn = (max( t kstart : k ∈ A) + e) / 2; n + +; SET ← SET − A; } Fig. 3. Pseudo-code to find minimal number of text frames

th

where Fj is the j frame to be accessed for text caption localization as the final output of the text frame detection stage, where j < n.

3.2 Text Caption Localization The text caption localization stage spatially localizes text caption within a frame. Let fDC(x,y,t) be the pixel value at (x,y) of the DC image of frame t. From the visual rhythm obtained by the sampling strategy given by Equation (2), we can observe that LINEk is possibly formed due to a portion of a character located on (x,y)=(x(zk), y(zk)) in frames between tkstart and tkend with the pixel values fVR(zk, t) where t kstart < t < t kend . Furthermore, if a portion of a character is located on (x,y) = (x(zk), y(zk)) within a DC image it can be assumed that portions of characters belonging to the same text caption to appear along y=y(zk) because text caption are usually horizontally aligned. Therefore, the text line information obtained from the text frame detection stage can be used to approximate the location of text within the frame, and enable us to provide an algorithm to focus on specific area of the frame. From each of the detected frames Fj, we verify whether LINEk , tkstart < F j < tkend is formed by portions of text string located along y=y(zk). For the text line LINEk, we first cluster the pixels with pixel value fVR(zk, t) from the pixels of horizontal scanline y=y(zk) using a 4-connected clustering algorithm to form

Fast Text Caption Localization on Video Using Visual Rhythm

265

text candidate regions in frame Fj where t kstart < t , F j < t kend . From each of the clustered regions, the top-most coordinate is computed and collected in an alignment histogram HT, where the bin corresponds to the row number of the DC image as illustrated in Figure 4. The HB is computed in the same way using the bottom-most coordinates of each region. We declare the existence of an upper boundary BT of text caption if at least 50% of the elements in HT are contained within three or fewer adjacent histogram bins. The lower boundary BB is computed in the same way using HB. The height of the localized text caption can thus be obtained. To find the width of the caption text, the regions with width longer than 1.5*height are firstly discarded. From the final set of regions, the following criterion is used to merge regions corresponding to characters to obtain the width of the text caption: ♦

Two regions, A and B, are merged if gap between A and B is less than 3 times the height.

We can thus verify whether LINEK is formed due to text caption and if so localize text caption, which appears along the duration of LINEK, which does not have to be verified again.

Fig. 4. Computation of the upper and lower boundary BT and BB of text caption through alignment histogram HT and HB.

Since several text lines can be formed due to the same text caption, the whole text localization process is omitted when LINEk and its corresponding horizontal scanline y= y(zk) intersects with any text caption localized by any of the previous text line. Figure 6 shows an example of the localized text caption. The usefulness of this text caption localization stage is that it is inexpensive and fast, robustly supplying bounding boxes around text caption along with their temporal information.

266

Seong Soo Chun et al.

4 Experimental Results 4.1 Environment of the Experiment To evaluate the performance of the proposed method, we have tested it on various types of MPEG video clips consisting of 1) a news broadcast clip (14m 52s), which covered a variety of events including outdoor and newsroom news programs, and weather forecast, 2) a sports clip of golf lesson (37m 21s), and baseball (22m 4 s), 3) a commercial clip (7m 35 s), which contains various embedded captions and credits. 4.2 Performance Evaluation of Proposed Algorithm Table I shows the results of the proposed algorithm. The second row of Table I gives the total number of text caption present in each category. The next row is the count of the correctly identified text caption. The total number of false positives and false negative is stated in the next two rows. Finally the recall and precision in each case is stated. Our proposed text caption localization has an overall average recall of about 80% and a precision of 86%. 4.3 Computational Time of Proposed Algorithm The processing speed of the proposed caption localization method is fast since it only works on few of the pixels sampled from the entire video in compressed domain compared to the conventional approaches operated by using the entire pixels of a video. Table 2 shows the processing time of each stage using Pentium III-500Mhz. It took approximately 7 minutes to produce the visual rhythm of the video clips corresponding to a total length of approximately 1 hour and 20 minutes. From the visual rhythm of the video clips, it took about 22 seconds to detect potential text frame subject to text caption localization as the final result of the text frame detection stage. From the detected text frames, it took approximately a total of 2 minutes to locate text caption from the detected frames. Thus the whole process time took about 9 minutes.

5 Conclusions The proposed algorithm on localizing text caption proved to be very fast by using text caption characteristics on visual rhythm. Moving text caption and text caption embedded on locations other than the assumed locations, resulted in a rather low average recall rate of 80% since our proposed algorithm locates only static text caption located on assumed locations. It took 9 minutes to localize text caption and its temporal duration, for a 1 hour and 22 minutes worth of video clip. This includes the construction time of visual rhythm, which can be used to detect scene changes for video indexing

Fast Text Caption Localization on Video Using Visual Rhythm

267

with very little processing. The proposed method also reduces the time for OCR since identical text caption appearing along consecutive frames are not fed into OCR repeatedly.

(a)

(b) Fig. 5. Characteristics of text caption on visual rhythm: (a) Visual rhythm of video material (b) Text lines representing text caption

Fig. 6. Results of text caption localization Table 1. Recall and precision for text caption localization Video Type

News

Sports

Commercials

Distinct text caption

55

302

37

True Pos

44

241

30

False Pos

8

40

4

False Neg

11

61

7

Recall(%)

80.0

79.8

81.1

Precision(%)

84.6

85.8

88.2

268

Seong Soo Chun et al.

Table 2. Execution time of visual rhythm construction, text frame detection and text caption localization. Video Type

News

Sports

Commercials

Total

Duration

14m 52s

59m 25s

7m 35s

1h 21m 52s

Visual Rhythm

1m 16s

5m 6s

38s

7m

Detection Time

3s

17.21s

1.42s

21.63s

Localization Time

16s

1m 12s

8s

1m 36s

Total Processing Time

1m 35s

6m 35.21s

47.42s

8m 57.63s

References 1. Ohya, J., Shio, A., Akamatsu, S.: Recognizing Characters in Scene Image. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 16. (1994) 214-220 2. Haupmann, A., Smith, M.: Text, Speech, and Vision for Video Segmentation: The Informedia Project. AAAI Symposium on Computational Models for Integrating Language and Vision, (1995) 3. Shim, J., Dorai, C., Bolle, R.: Automatic Text Extraction from Video for Content-Based Annotation and Retrieval. IEEE International Conference on Pattern Recognition, Vol. 1. (1998) 618-620 nd 4. Wu, V., Manmatha, R., Riseman, E.: Finding Text in Images. Proceedings of the 2 ACM International conference on Digital Libraries (1997) 3-12 5. Lienhart, R.: Automatic Text Recognition for Video Indexing. Proceedings of ACM Multimedia (1996) 11-20 6. Li, H., Doermann, D., Kia, O.: Automatic Text Detection and Tracking in Digital Video. IEEE Transactions on Image Processing, Vol. 9. (2000) 147-156 7. Sato, T., Kanade, T., Hughes, E., Smith, M., Satoh, S.: Video OCR: Indexing Digital News Libraries by Recognition of Superimposed Caption. ACM Multimedia Systems, Vol. 7 (1998) 385-394 8. Yeo, B.L., Liu, B.: Visual Content Highlighting Via Automatic Extraction of Embedded Captions on MPEG Compressed Video. IS&T/SPIE/IS&T Symposium on Electronic Imaging: Digital Video Compression, (1996) 9. Zhong, Y., Karu, K. Jain, A.: Automatic Caption Localization in Compressed Video. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22. (2000) 385392 10. Zhang, Y., Chua, T.: Detection of Text Captions in Compressed Domain Video. Proceedings of Multimedia Information Retrieval ACM Multimedia. (2000) 201-204 11. Yeo, B.L., Liu, B.: Rapid Scene Analysis on Compressed Video. IEEE Transactions on Circuit and Systems for Video Technology, Vol. 5. (1995) 533-544 12. Kim, H., Lee, J., Song, S.M.: An Efficient Graphical Shot Verifier Incorporating Visual Rhythm. Proceedings of IEEE International Conference on Multimedia Computing and Systems. (1999) 827-834 13. Song, J., Yeo, B.L.: Spatially Reduced Image Extraction from MPEG-2 Video: Fast Algorithms and Application. Proceedings of SPIE Storage and Retrieval for Image and Video Database VI, Vol. 3312. (1998) 92-107

A New Digital Watermarking Technique for Video Kuan-Ting Shen and Ling-Hwei Chen Department of Computer and Information Science Nation Chiao Tung University, Hsinchu, Taiwan Abstract. Data hiding or digital watermarking can be considered nowadays as the most important issue for digital multimedia. The data hiding technique can be used for covert communication, while digital watermarking can be used for protecting digital media content. Many techniques were developed for embedding data into various multimedia mediums. In this paper, we proposed a method for embedding digital watermark into uncompressed videos. It uses the relationship among the DC components in several successive frames for hiding data. Since DC components will not vary a lot after a DCT-based lossy compression algorithm, this approach is able to resist such compression algorithms. And experimental results demonstrate that this proposed method is robust to MPEG coding.

1. Introduction Information hiding, a technique for embedding data into a given medium without being noticed, becomes more and more important recently. Owing to the high speed and high capacity transmission provided by the Internet, digital multimedia contents are widely spread. Although the digital multimedia technologies bring the Internet user many new applications and services, the content owners are afraid of losing their income due to the increment of illegal copies. Since any unauthorized user can easily make perfect copies of digital multimedia, techniques for protecting the copyright of digital media are now the urgent demands for these content owners. To resolve the right ownership of the multimedia contents, copyright information can be embedded into these contents by applying some sort of information hiding techniques. This kind of information hiding techniques is so-called “Digital Watermarking”. The development of digital watermarking techniques has been motivated in recent years. Numbers of works have been done for embedding watermark into digital image contents. These watermarking schemes can also be applied to videos by treating each single frame in video as a still image. [1–5] Since a video sequence consists of successive still images, it is quite easy to apply a data-hiding algorithm for still images to videos. But data hiding algorithms for images consider only the nature of still images. For digital videos, there is a large amount of inter-frame redundancy. The most popular video compression standard, MPEG, adopts the technique called motion compensation for removing the inter-frame redundancy in order to gain a better compressing ratio. In MPEG, only the I-frames use the same compression technique as that in still images, while other frames, called P- and B- frames, are coding using the motion compensated predictive coding scheme. Most digital watermarking schemes, such as DEW proposed by Langelaar et al. [8][9], embed watermarks only in I-frames of a MPEG coded video. Since most frames in a MPEG coded video are B S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 269–275, 2002. © Springer-Verlag Berlin Heidelberg 2002

270

Kuan-Ting Shen and Ling-Hwei Chen

and P frames, it is quite uneconomical if watermarks embedded in a video can only survive in I-frames. So it is practical to develop an approach that can use not only I frames but also B and P frames for embedding data into videos. In order solve the problem that only I-frames could be used for embedding data, some approaches are proposed using motion vectors for hiding data in videos. [6][7] Since these vectors were generated in the motion compensation process, it is reasonable that these data will survive in P- and B-frames. But this kind of methods has high dependence with MPEG standard. So these methods could not be used if a video is not coded MPEG format. In this paper, an approach embeds data into uncompressed videos by using the relationship between frames is proposed. Since a video consists of a sequence of images, it is reasonable to embed data using some properties between frames. This method is robust to MPEG compression and suitable for those digital watermarking applications.

2. Watermarking on Uncompressed Videos 2.1

Embedding Digital Watermark Using DC Components

The proposed method for digital watermarking uses the relationship of DC components in two successive frames. Instead of hiding data into every frame in a video sequence, only some frames are used for embedding watermark information. This is because distortions caused by modifying DC components are large. In order to minimize the distortion caused by the embedding, not every frame in the host video could be used for embedding watermark information. Fig. 1 shows the embedding procedure of this proposed method.

Embedding pair selection

Embedding

Watermark

Spreading

Embedding signal

Pseudo random noise cr key

Fig. 1. Embedding procedure using DC difference between two consecutive frames.

A New Digital Watermarking Technique for Video

271

In the first step of the embedding procedure, embedding pairs are selected by a key k. The key k that is selected by the user is used to determine the distance between two embedding pairs. An embedding pair consists of two consecutive host frames. And the distance between two embedding pairs is generated by using k as a random seed. In order to guarantee the embedding strength, the number of embedding pairs must be enough. The embedding strength is also chosen by the user. In such constraint, the distance di that is random generated by k could not exceed a maximum distance based on the embedding strength. In such scheme, attackers could not easily find all frames that contain the watermark. Fig. 2 shows an example of the embedding structure of a video by using this embedding scheme. : Host frame A : Host frame B

Fig. 2. Structure of an embedded video sequence with d1, d2, d3… determined by key k.

After embedding pairs are selected, both of the two host frames in an embedding pair are divided into 8x8 blocks. Each block in the host frames is transformed by using the discrete cosine transform (DCT). And the DC component of each transformed block is extracted. These DC components are the basis for embedding watermark information into the host video. For increasing the strength of the embedding watermark, some redundancies must be added into the embedding signal. So the watermark is first spread by a chip-rate cr. And then the spreading signal is modulated by a binary pseudo random noise p generated by the same key k that is used in embedding pair selection. The following equation shows the relations between watermark and the embedding signal. s i = w j where j ⋅ cr ≤ i ≤ ( j + 1) ⋅ cr p i ∈ {0 ,1}

ei = si ⊕ pi  w i : the original watermark signal.  s : the spreading signal.  where  i  p i : pseudo random noise.  e i : the resulting bit - string for embedding

(1)

In order to embed a watermark bit, the relation between the DC component of the block in one host frame and the DC component of its corresponding block in another frame must meet the predefined conditions as the following equation:  DC A < DC B for embedding a bit "0" .   DC A > DC B for embedding a bit "1" .  DC A : DC coefficient of a block in the host frame A. where   DC B : DC coefficient of a block in the host frame B.

(2)

272

Kuan-Ting Shen and Ling-Hwei Chen

For example, the DC component in the host frame A must be less than its corresponding DC in the host frame B for embedding a bit “0”. If the relation between two DC components does not meet the condition according to the watermark bit, the DC component in transformed host frames is changed accordingly. In order to minimize changes in DC components of a single frame, both two host frames would be modified for embedding. Since DC coefficients are sensitive to larger modifications, a block would not embed any watermark bit if the difference between two DC components is larger than a threshold t1. In order to increase the embedding strength, two embedding pairs would embed the same bit-string. Finally, the inverse DCT is applied to transform the embedded frames back to the spatial domain. 2.2 The Watermark Extracting Procedure The watermark extraction procedure is similar to the embedding procedure. First, embedding pairs are selected by using the key k that is used in the embedding procedure. Then, all host frames in an embedding pair are transformed into DCT domain. The watermark is extracted by using the DCs of the transformed embedded frames. With the condition listed in the previous section, an embedded bit could be extracted by using the relationship between the DC in the host frame A and its corresponding DC in the host frame B. To reconstruct the original watermark signal, a pseudo random bit-string is also needed. This is also generated by the key k. The extracted bit-string is then demodulated with this pseudo random bit-string. From the embedding procedure described in the previous section, we know that each watermark bit was spread cr times. And each embedding bit-string was embedded repeatedly into two embedding pairs. So a watermark bit could be reconstructed by cr*2 bits form the bit-strings extracted from two embedding pairs. If there are more 1’s in these cr*2 bits than 0’s, a watermark bit “1” is extracted.

3. Experimental Results The proposed watermarking method embeds watermark into the DCT domain of a video. The behavior of this method is similar to the DEW proposed by Langlaar et al [6]. We have modified Langelaar’s algorithm in order to apply it on raw videos. The original algorithm embeds watermarks on each I-frame in a MPEG compressed video. And a bit is embedded in each DCT-transformed 8x8 block. The modified one embeds watermark in every frames in a video using the same idea. And the watermark is extracted from the I-frames. In this section, both the modified DEW and our proposed would be examined. Three videos used for testing these two watermark schemes are shown in Fig. 3. Each test video consists of 70 frames. And the resolution of each frame in a video is 352 by 240 pixels. The size of an embedding subset in the DEW is set to be 16. The cut-off frequency of the DEW in this experiment is 32. In the experiment of the proposed method, the percentage of embedding pairs of all pairs is 30%. Figure 4 and Figure 5 illustrate the visual results that produced by our proposed method and the DEW.

A New Digital Watermarking Technique for Video

273

Fig. 3. Test videos for the watermark method.

Fig. 4. Experimental results of our proposed method. (a) and (b) are the original embedding pair. (c) and (d) are the resulting embedding pair.

Fig. 5. Experimental results of DEW.

For testing the robustness against MPEG coding of these two methods, the embedded videos are compressed by using a MPEG-2 coder. And then the watermark is extracted from a decompressed video. The length of the watermark signature is

274

Kuan-Ting Shen and Ling-Hwei Chen

1000 bits. After the watermark is extracted, the extracted watermark is compared to the original one. And then the bit error rate is examined. The results are listed in Table 1. Table 1. Bit error rates Weather (1M bps) Weather (2M bps) Table tennis (1M bps) Table tennis (2M bps) News (1M bps) News (2M bps)

DEW 7.4% 2.5% 4% 2% 3% 2.5%

Proposed method 0.9% 0.4% 3.6% 2% 3.2% 1.3%

From the results shown above, visual artifacts are occurred in the edge components of a frame by using the DEW. And the intensity of some plain area in an embedding frame changes slightly after embedding by the proposed method. This is because the DC components are used for embedding watermark bits. But these changes are hard to be noticed while playing the video. By using the modified DEW for embedding watermarks, the watermark bits are no longer exists in the frames which are coded to B- or P- frames by a MPEG coder. So the watermark is only extracted from the decompressed I-frames. And the bit error rates are also examined. From Table 1, the accuracy of watermark bits using the proposed embedding scheme is better than the accuracy by using the DEW. Since the proposed scheme has worse performance in the fast-moving videos, the accuracy and quality are still better than the DEW.

4. Summary In this paper, a method for embedding digital watermark into uncompressed video sequence is proposed. It uses relationship between two DC components of blocks in two successive frames for hiding watermark information. Although modifying DC components could be perceptually visible, it is more robust than using higher frequency coefficients for embedding data. Since changes within a single frame in a video sequence may not be noticed, it is still quite suitable to embed information into these DC components. The experimental results show that the embedded watermark could be extracted even if the embedded video is compressed by MPEG.

Reference 1. F. Hartung and B. Girod, “Watermarking of Uncompressed and Compressed Video, ” Signal Processing, Vol. 66, No.3, pp. 283-301, May 1998. 2. D. Kim and S. Park, “A Robust Video Watermarking Method,” Conference on Multimedia and Expo, 2000. ICME 2000. 2000 IEEE International, Vol. 2, pp. 763-766, 2000. 3. C. Busch and W Funk, “Digital Watermarking: From Concepts to Real-Time Video Applications,” IEEE Computer Graphics and Applications, Vol. 19, Issue 1, pp. 25-35, Jan.Feb. 1999.

A New Digital Watermarking Technique for Video

275

4. T. Chung, M. Hung, Y. Oh, D. Shin and Sang-Hui Park, “Digital Watermarking for Copyright Protection of MPEG2 Compressed Video,” IEEE Transactions on Consumer Electronics, Vol. 44, Issue 3, pp 895-901, Jun. 1998. 5. C. Hsu and J. Wu, “DCT-based Watermarking for Video,” IEEE Transactions on Consumer Electronics, Vol. 44, Issue 1, Feb. 1998. 6. F. Jordan, M. Kutter, and T. Ebrahimi, “Proposal of a Watermarking Technique for Hiding/Retrieving Data in Compressed and Uncompressed video,” ISO/IEC Doc. JTC1/SC29/WG11 MPEG97/M2281, July 1997. 7. J. Song and K.J.R. Liu, “A Data Embedding Scheme for H.263 Compatible Video Coding,” Proceedings of the 1999 IEEE International Symposium on Circuits and Systems, 1999. ISCAS ’99., Vol. 4, pp 390-393, 30 May-2 June 1999. 8. G.C. Langelaar, R.L. Lagendijk, and J. Biemond, “Real-time Labeling Methods for MPEG Compressed Video,” In 18th Symposium on Information Theory in the Benelux, Veldhoven, The Netherlands, May 1997. 9. G.C. Langelaar and R.L. Lagendijk, “Optimal Differential Energy Watermarking of DCT Encoded Images and Video,” IEEE Transactions on Image Processing, Vol. 10, Issue 1, Jan. 2001

Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video* Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee Department of Computer Science And Information Engineering, National Chiao Tung University 1001 Ta-Hsueh Rd, Hsinchi, Taiwan _H]GLIRQLLWMESW]PIIa$GWMIRGXYIHYX[

Abstract. In this paper, a novel approach of automatic closed caption detection and font size differentiation among localized text regions in I-frames of MPEG videos is proposed. The approach consists of five modules: video segmentation, shot selection, caption frame detection, caption localization and font size differentiation. Rather than directly examines scene cut frame by frame, the module of video segmentation first verifies video streams GOP by GOP and then finds out the actual scene boundaries in the frame level. Tennis videos are selected as the case study and the module of shot selection is designed to automatically select specific type of shot for further closed caption detection. The noise of potential captions is filtered out based on its long-term consistency over consecutive frames. While the general closed captions are localized, we select the specific caption that is discriminated utilizing the module of font size differentiation. The detected closed captions can support video structuring, video browsing, high-level video indexing and video content description in MPEG-7. Experimental results show the effectiveness and the feasibility of the proposed scheme.

1

Introduction

With the increasing digital videos in education, entertainment and other multimedia applications, there is an urgent demand for tools that allow efficient way for users to acquire desired video data. Therefore, users need a content-based mechanism to support efficient searching, browsing and retrieval. The need of content-based multimedia retrieval motivates the research of feature extractions of the information of text, image, audio and video. However, textual information is more semantically meaningful and attracts increasing researches about text caption detection in video frames [1][3][6]. With the technique of video compression getting mature, lots of videos are being stored in compressed form and accordingly more and more researches focus on the feature extractions in compressed videos especially in MPEG format. Edge features are extracted directly from MPEG compressed videos to detect scene change [5] and captions are processed and inserted into compressed video frames [7]. Features, like * The research is partially supported by Lee & MTI Center, National Chiao-Tung University, Taiwan and National Science Council, Taiwan. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 276–287, 2002. © Springer-Verlag Berlin Heidelberg 2002

Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video

277

chrominance, shape and frequency are directly extracted from MPEG videos to detect face regions [1][3]. In addition, the researches [2][4] focus on closed caption detection in compressed videos and the large size closed captions are what they concerned. However, in general, the size of text captions appearing in the shots of sports competition is relatively very small, which makes it more difficult to detect and localize the text captions. Therefore, in this paper, we propose a novel approach to detect the small text captions and also differentiate its font size for further applications. The approach consists of five components: GOP-based video segmentation, shot selection, caption frame detection, text caption localization and filtering, and font size differentiation. In our previous research [8] – GOP-based video segmentation is used to effectively segment video. Furthermore, the DCT DC-based shot selection is designed to identify specific shots. Caption frames are detected in the specific shots by computing the variation of DCT AC energy both in the horizontal and vertical directions. In addition, we locate the text caption by the proposed method of the weighted horizontal-vertical DCT AC coefficients and merge regions by morphological operations. To achieve more robust result of text caption localization, each candidate text caption is verified further by computing its long-term consistency that is estimated over the shots of backward, forward and the shot itself. While text captions are localized, we differentiate the font size of each text caption based on variation of the DCT AC energy in the vertical direction. The rest of the paper is organized as the following. Section 2 presents the overview of the proposed scheme and Section 3 describes the GOP-based video segmentation. The proposed approach of text caption localization is illustrated in Section4. Section 5 shows the experimental results and the conclusion and future works are given in Section 6.

MPEG-2 Video Streams

Closed Captions

GOP-Based Scene Change Detection

Font Size Differentiation

DC Variation Based Shot Selection

Noise Filtering Long-Term Consistency

Caption Frame Detection

Text Caption Localization

Fig. 1. Overview of the proposed scheme.

2

Overview of the Proposed Scheme

Fig. 1 shows the architecture of the proposed scheme. The testing videos are compressed in MPEG-2 format and the tennis sport is selected as the case study. First, video streams are segmented into shots by using our previous research of GOP-based

278

Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee

scene change detection approach [8]. Rather than directly examines scene cut frame by frame, to speed-up the module of video segmentation first checks video streams GOP by GOP and then finds out the actual scene boundaries in the frame level. The shots that contain the view of tennis court are identified automatically by the variation of the DCT DC coefficients of I-frames and selected for further caption detection. However, the closed captions, for example, the scoreboard in the clip of tennis court does not appear within the whole shot. Besides, the size of closed captions in tennis videos is generally very small. Therefore, we propose a mechanism to detect caption frames in the specific shots and also differentiate the font size for automatic text caption selection.

Inter-GOP scene change detection

Step 1.

Calculate the difference between each consecutive GOP-pair

If difference is more than

no

threshold?

yes Intra-GOP scene change detection

Step 2.

Find out the actual scene change frame in the GOP

Fig. 2. GOP-based scene change detection.

3

Video Segmentation and Shot Selection

3.1

Scene Change Detection

Video data is segmented into clips to serve as logical units called “shots” or “scenes”. Fig. 2 illustrates our proposed GOP-based scene change detection approach [8]. In MPEG-2 format [9], GOP layer is random accessed point and contains GOP header and a series of encoded pictures including I, P and B-frame. The size of GOP is about 10 to 20 frames, which is less than the minimum duration of two consecutive scene changes (about 20 frames) [10]. In the approach, we first detect possible occurrences of scene change GOP by GOP (inter-GOP). The difference between each consecutive GOP-pair is computed by comparing the I-frames in each consecutive GOP-pair. If the difference of DC

Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video

279

coefficients between these two I-frames is larger than the threshold, then there may have scene change between these two GOPs. Hence, the GOP that contained the scene change frames is located. In the second step – intra GOP scene change detection, we further use the ratio of forward and backward motion vectors to find out the actual frame of scene change within a GOP. By this approach, the segmented results [8] are encouraging and prove that the scene change detection is efficient for video segmentation. 3.2

Shot Selection

While the boundary of each shot is detected, the video sequence is segmented to obtain shots that consist of the clips of advertisement, close-up and tennis court. However, the clips of tennis court are our focus in the further processing and analysis. Hence, scene identification approach is proposed to recognize the clips that are of the type of tennis court. We observe that the variation of the intensity of the tennis court frame is very small through the whole clip and the value of intensity variation between consecutive frames is very similar. In contrast, the intensity of the advertisement and close-up varies significantly in each frame and the difference of the variance of intensity between two neighboring frames is relatively large. Therefore, the DC coefficients of each I-frame are extracted to represent the intensity value and are used to compute the intensity variance of I-frames. In addition to obtain the intensity variance of each Iframe, the variance of each shot is also computed to be the shot feature. The definition of the frame variance and the one of shot variance are shown in Eq. (1) and Eq. (2). DCi , j denotes the jth block of the ith frame and N represents the total number of blocks in a frame.

FVarsDC is the intensity variance of the frame i in shot s and the ,i

variance of shot s is expressed by SVars , where M is the total number of frames in shot s. The variation of the intensity variance of each I-frame in a video sequence from frame-0 to frame-1965 is exhibited in Fig. 3. In the video sequence, there are four clips of tennis court that are marked by the dotted ellipse and the type of close-up clips is marked by the dotted rectangle. The last clip of this sequence is a clip of advertisement signed by the dotted circle. From Fig. 3, we can see that the intensity variance of the type of tennis court is very small and the value is very stable through the whole clip. Thus, the clip of tennis court can be indicated and selected by the characteristic that the value of intensity variance is stable in each individual frame and stable over the whole shot. N

N

j =1

j =1

2 2 FVarsDC ,i = ∑ DCi , j / N − ( ∑ DC i , j / N )

M

M

2 DC 2 SVars ,i = ∑ ( FVarsDC , i ) / M − ( ∑ FVars , i / M ) i =1

(1)

i =1

(2)

280

Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee

Variation of I-frame DC values 250000

Variation

200000 150000 100000 50000 0 0

500

1000

1500

2000

2500

Frame Number Fig. 3. Variation of I-frame DC value of a video sequence (frame0-frame1965)

MPEG-2 I-frames

Text Captions

DCT AC Coefficients Extraction

Potential Caption Region Detection (Weighted Horizontal and Vertical AC Coefficients)

Candidate Caption Region Examination (Long-Term Consistency Computation by Reference Forward and Backward Shots)

Region Merging and Noise Removement by Morphological Operation

Fig. 4. The approach of text caption localization

4

Text Caption Localization

In this section, the scheme of text caption detection and font size differentiation is described. The diagram of caption localization is shown in Fig. 4. In general, text captions do not always appear in consecutive frames. Therefore, we propose an algorithm of caption frame detection to detect frames, which may contain captions. In the process of caption detection, DCT AC coefficients of I-frames in MPEG-2 video

Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video

281

are extracted and are used to compute the energy variation of horizontal and vertical directions of each 8x8 block respectively. Potential caption regions are indicated by the proposed weighted horizontal-vertical AC coefficients and these regions are merged/removed by the morphological operations. For more accurate results of text caption localization, the spatial-temporal relationship over consecutive frames is utilized in which we compute the long-term consistency of each candidate caption region by referring certain I-frames of forward and backward shots. However, the localized text captions may contain the scoreboard, the logo of certain channel or some billboard and the scoreboard is what viewers are most interested in. Therefore, based on the observation that these different types of text captions are different in font size, we propose an algorithm to discriminate font size within localized captions. The details of caption frame detection are described in Subsection 4.1 and the approach of closed caption localization is shown in Subsection 4.2. Subsection 4.3 presents the algorithm of font size differentiation.

Region 1

Region 2

Region 3

Region 4

Region 5

Region 6

Fig. 5. Original frame divided into 6 sub-regions

4.1

Caption Frame Detection

Caption frame detection is a necessary step for text caption localization because of the observation that captions may disappear in some frames and then appear subsequently. Therefore, we should first identify the frames in which captions might be present before the process of potential caption detection. However, in general, the font size of the scoreboard appearing in the shots of sports competition is very small. Hence, the variation of the AC energy of the entire frame with the appearance of the small caption cannot be used to measure the possibility of the presence of the caption which is relatively small in size. Therefore, each I-frame in specific shots is divided into sub-regions, say 6, as shown in Fig. 5. The variation of AC coefficients of each sub-region is measured by Eq.(3), where coefficients of the coefficients from

FVarsAC , i means the variance of AC

i th frame in shot s and ACh / v , j are the horizontal AC

AC0,1 to AC0,7 and the vertical AC coefficients from AC1, 0 to

282

Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee

AC7,0 . The DCT AC coefficients used in the computation of gradient energy are shown in Fig. 6.

FVarsAC ,i =

N

N

j =1

j =1

∑∑ ACh2 / v , j / N − ( ∑∑ ACh / v , j / N ) 2

(3)

The result of caption frame detection based on sub-regions is demonstrated in Fig. 7 and a threshold δ (3000) is predefined to measure if a caption is present or not. We th can see that the curve of DCT AC variance of region-1 drops abruptly in the 18 Ith frame and rises in 39 I-frame and the curves of region-2 to region-6 are stable since th th the scoreboard is absent in region-1 from the 18 I-frame to the 39 one. DC AC0,1 AC0, 2 AC0, 3 AC0, 4 AC0,5 AC0,6 AC0,7 AC1,0 AC2,0 AC3,0 AC4,0 AC5,0 AC6,0 AC7,0

Fig. 6. DCT AC coefficients used in text caption detection

25000

AC Variation

20000

Region1 Region2

15000

Region3 Region4

10000

Region5 Region6

5000

0

1

5

9 13 17 21 25 29 33 37 41 45 49 53 57 61 Frame Number

Fig. 7. Demonstration of caption frame detection

Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video

(a)

283

(b)

(c) Fig. 8. Illustration of intermediate for caption detected (a) Original frame (b) Closed caption detection (c) Filtering based on long-term consistency

4.2

Closed Caption Localization

While the caption frames are detected, we can locate the potential caption regions by utilizing the horizontal and vertical DCT AC coefficients individually to compute the gradient energy in the horizontal and vertical directions respectively. We can observe the fact that the text captions generally appear in rectangular form and the AC energy in the horizontal direction would be larger than that in the vertical direction since distance between letters of each word is fairly small and the distance between two rows of text is relatively large. Therefore, we assign more weights to horizontal coefficients than that to vertical coefficients and the weight assignment is shown in Eq. (4). Here we select three I-frames (first, middle and last) of each shot for caption localization and set wh to 0.7 and wv to 0.3 and the result of potential caption region detection is shown in Fig. 8(b).

E = ( wh H ) 2 + ( wvV ) 2 H=

∑ AC0,h ,

h1 = 1, h 2 = 7

∑ ACv,0 ,

v1 = 1, v 2 = 7

h1≤ h ≤ h 2

V=

v1≤ v ≤ v 2

(4)

284

Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee

From Fig. 8, the original frame is shown in Fig. 8(a) and we can see in Fig. 8(b) that although the scoreboard and the trademark in the upper part of the frame are all indicated, there still have some noisy regions. Therefore, we adopt a morphological operator that is in the size of 1x5 blocks to filter out some noise and afterward the remaining caption regions are clustered and further verified by computing the longterm consistency. We select another two I-frames as consistency reference, the last Iframe of the backward shot and the first I-frame of the forward shot. One possible measurement of the long-term coherence of potential regions is that if the potential caption region appears more than three times among the five I-frames, the region may be a real text caption. The result is demonstrated in Fig. 8(c). Bt

w0 Bsub − block

Bb

w1

8

8

Fig. 9. Sub-block is interpolated from its two neighboring blocks

4.3

Bt

and Bb

Font Size Differentiation

From Fig. 8(c), we can see that the scoreboard in the left upper corner and the trademark in the right upper corner are all successfully detected. Since viewers are interested in the scores during the game competition, then the issue of separating out the captions in the scoreboard is our concern. Hence, we propose an approach to automatically discriminate the font size as a support in the discrimination of scoreboard selection. To obtain the font size, we compute the gradient energy in the vertical direction instead of in the horizontal direction since we observe that the blank space in between two text rows is generally larger than the blank space between two letters and hence the variation of gradient energy in the vertical direction would present in more regular pattern. In addition, to achieve more accurate estimation of periodicity, we compute the DCT coefficients of the 8x8 sub-block between two neighboring blocks and the sub-block is obtained by Eq. (5) and an example is shown in Fig. 9, where Bsub−block is the desired sub-block, Bt and Bb are the top and bottom 8x8 block neighboring to Bsub−block , I w 0 and I w1 are the identity matrix in the dimension of w0 x w0 and w1 x w1 individually. Here we set both w0 and w1 to 8 and it means that the sub-block Bsub−block is in the middle position of blocks Bt and Bb .

0 0   0  Bt +  Bsub−block =   0 I w0   I w1

0  Bb 0 

(5)

Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video

285

To discriminate the font size, we compute the periodicity and the variation of the AC energy in the vertical direction. The results of font size analysis of the scoreboard and the trademark are demonstrated in Fig. 11 and Fig. 12, where T means the average distance and V represents the variance of T within each column. The local minimum of the curve of AC energy is regarded as the low textured region, say black space, between two text rows and hence we compute the average distance T of blank space by finding the interval between two local minimums of the curve. Examples of the scoreboard and the trademark are shown in Fig. 10. We compute T for first five block columns because the first part of the localized scoreboard consists of five block columns and several non-text blocks separate the second part of the scoreboard. We select the part of the text region in which the height of the block columns is consistent for font size computation. Hence, in Fig. 10(b), the block columns of the trademark are all selected for font size computation. From Fig. 11 and Fig. 12, we can see that the average distance T of the scoreboard is about 2.2 which is smaller than 2.9 of the trademark. Besides, the variance of the row distance of blank space within each column of the scoreboard is 0.05 which is also smaller than 0.8 of the trademark. Hence, we can correctly discriminate the scoreboard since the font size of the scoreboard is smaller than that of the trademark and the font size is more regular in the scoreboard than that in the trademark.

(a)

(b)

Fig. 10. Examples of the text captions (a) scoreboard (b) trademark

AC Energy

1000 800

Column1

600

Column2 Column3

400

Column4 Column5

200 0 1

2

3

4

5

6

7

8

9

Block Number Fig. 11. Variation of AC energy of the scoreboard (T=2.2, V=0.05)

286

Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee

Column1

AC Energy

600

Column2

500

Column3

400

Column4

300

Column5 Column6

200

Column7

100

Column8

0

Column9 1

2

3

4

5

6

7

8

9

Column10

Block Number

Column11 Column12

Fig. 12. Variation of AC energy of the trademark (T=2.9, V=0.8)

5

Experimental Results and Discussion

In the experiment, we use the tennis videos recorded from the TV channel of Star-Sport and encode the tennis videos in the MPEG-2 format with the GOP structure IBBPBBPBBPBBPBB at 30 fps. The length of the testing video is about 50 minutes and 903 I-frames are the caption frames. Besides, there are totally 42183 blocks that contain text in the 903 caption frames. The results of caption frame detection and text caption localization are evaluated by estimating the precision and recall. The experimental result of caption frame detection is shown in Table 1 and we can see that the precision and recall are all 100%. Hence it proves that the proposed scheme of sub-region AC energy computation is effective in the detection of small font text captions. In addition, the result of text caption is shown in Table 2 and there are 40635 text blocks detected correctly, 347 blocks falsely detected and 395 text blocks missed. The precision is about 99% and the recall is about 96 %. The good performance lies in the techniques employed in the mechanism - weighted horizontal-vertical AC coefficients is adopted and the long-term consistency of the text caption over consecutive frames is also considered to improve the accuracy of detection results. Some text blocks are missed since the background of the text caption is transparent and would change with the background while camera moves. In this case, if the texture of the background is similar to the text caption, the letters of captions cannot reflect the large variation in gradient energy and some text blocks would be missed. Table 1. Performance of caption frame detection Ground Truth of Caption I-frames

Correctly Falsely Missed detected frames detected frames frames

903

903

0

Missed rate

0

0%

Table 2. Performance of text caption localization after caption frame detected The Ground Truth of Text Blocks 42183 blocks

Correctly Falsely detected Missed detected blocks blocks blocks 40635

347

395

Precision 99%

Recall

Missed rate

96%

0.94%

Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video

6

287

Conclusion and Future Work

In the paper, we propose a novel approach to automatically select specific shots, detect caption frames, locate text captions and differentiate font size in MPEG compressed videos. GOP-based video segmentation in our previous research is used to effectively segment video into shots. Furthermore, the DCT DC-based shot selection is designed to identify specific shots. Caption frames are detected in the specific shots by computing the variation of DCT ac energy both in the horizontal and vertical directions. Furthermore, we locate the text caption by the proposed scheme of the weighted horizontal-vertical DCT AC coefficients and region merging by morphological operations. To achieve more accurate result of text caption localization, we verify each candidate text caption further by computing its long-term consistency that is estimated over the backward referencing shot, the forward referencing shot and the shot itself. While text captions are localized, we differentiate the font size of each text caption based on variation of the DCT AC energy in the vertical direction. By this way, we can automatically select the text captions which users concern to support video browsing, video editing and video structuring. In the future, we will investigate in the video OCR to recognize the localized text captions to support high-level feature extraction, semantic event detection and metadata generation for video content descriptions in MPEG-7.

References 1. H. Wang and S. F. Chang, “A Highly Efficient System for Automatic Face Region Detection in MPEG Video,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 7, No. 4, Aug. 1997, pp. 615-628. 2. Y. Zhong, H. Zhang and A. K. Jain, “Automatic Caption Localization in Compressed Video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 4, Apr. 2000, pp. 385-392. 3. H. Luo and A. Eleftheriadis, “On Face Detection in the Compressed Domain,” Proc. of ACM Multimedia 2000, pp. 285-294. 4. Y. Zhang and T. S. Chua, “Detection of Text Captions in Compressed Domain Video,” Proc. of ACM Multimedia Workshop, 2000, pp. 201-204. 5. S. W. Lee, Y. M. Kim and S. W. Choi, “Fast Scene Change Detection using Direct Feature Extraction from MPEG Compressed Videos,” IEEE Transactions on Multimedia, Vol. 2, No. 4, Dec. 2000, pp. 240-254. nd 6. X. Chen and H. Zhang, “Text Area Detection from Video Frames,” Proc. of 2 IEEE Pacific Rim Conference on Multimedia, Oct. 2001, pp. 222-228. 7. J. Nang, O. Kwon and S. Hong, “Caption Processing for MPEG Video in MC-DCT Compressed Domain,” Proc of ACM Multimedia Workshop, 2000, pp. 211-214. 8. S. Y. Lee, J. L. Lian and D. Y. Chen, “Video Summary and Browsing Based on Story-Unit for Video-on-Demand Service,” Proc. International Conference on ICICS, Oct. 2001. 9. J. L. Mitchell, W. B. Pennebaker, Chad E.Fogg, and Didier J. LeGall, “MPEG VIDEO COMPRESSION STANDARD,” Chapman&Hall, NY, USA, 1997. 10. J. Meng, Y. Juan, S.F. Chang, “Scene Change Detection in a MPEG Compressed Video Sequence,” Proc. IS&T/SPIE, Vol. 2419, 1995, pp.14-25.

Motion Activity Based Shot Identification and Closed Caption Detection for Video Structuring* Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee Department of Computer Science And Information Engineering, National Chiao Tung University 1001 Ta-Hsueh Rd, Hsinchi, Taiwan _H]GLIRWLYNMYERW]PIIa$GWMIRGXYIHYX[

Abstract. In this paper, we propose a novel approach to generate the table of video content based on shot description by motion activity and closed caption in MPEG-2 video streams. Videos are segmented into shots by GOP-based approach and shot identification is used to identify segmented shots. The specific shots of interest are selected and the proposed approach of closed caption detection is used to detect captions in these shots. In order to speed up in scene change detection, instead of examining scene cut frame by frame, GOP-based approach first checks video streams GOP by GOP and then finds out the actual scene boundaries in the frame level. The segmented shots containing closed caption are identified by the proposed object-based motion activity descriptor. The algorithm of SOM (Self-Organization Map) is used to filter out noise in the process of caption localization. While captions are localized in the recognized shots, we create the table of video content based on the hierarchical structure of story unit, consecutive shots and captioned frames. The experimental results show the effectiveness of the proposed approach and reveal the feasibility of the hierarchical structuring of video content.

1

Introduction

More and more video information in digital form is available around the world. The number of users and the amount of information are progressing at a very rapid rate. Content-based indexing provides users natural and friendly query, searching, browsing and retrieval. The need of content-based multimedia retrieval motivates the research on feature extractions of the information contained in text, image, audio and video. However, textual information is more semantically meaningful among the various types of information. Recently, increasing researches focus on feature extraction in compressed video domain, especially in MPEG format because many videos are already being stored in compressed form due to the mature technique of compression. Edge features are directly extracted from MPEG compressed videos to detect scene change [5] and captions are processed and inserted into compressed video frames [7]. Features, like chrominance and shape are directly extracted from MPEG videos to detect face * The research is partially supported by Lee & MTI Center, National Chiao-Tung University,

Taiwan and National Science Council, Taiwan. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 288–301, 2002. © Springer-Verlag Berlin Heidelberg 2002

Motion Activity Based Shot Identification

289

regions [1][3]. In the area of text caption detection, the detection of closed caption detection in large font size in compressed videos is the research focus [2][4]. However, the font size of text captions appearing in the shots of sports competition in general is very small which makes it more difficult to detect and localize the text captions. In addition, Lu and Tang [6] proposed a video structuring scheme, which classifies video shots by color feature and global motion information. However, video shots classification based on object information would be more semantically meaningful. In order to support high-level semantic retrieval of video content, in this paper, we propose a novel approach that structures videos utilizing closed captions and objectbased motion activity descriptors. The mechanism consists of four components: GOPbased video segmentation, shot identification, closed caption detection and video structuring. The rest of the paper is organized as follows. Section 2 presents the overview of the proposed scheme. Section 3 shows the GOP-based scene change detection and Section 4 describes the motion activity based shot identification. The component of closed caption localization is introduced in Section 5. Section 6 illustrates the experimental results and the conclusion and the future works are given in Section 7. MPEG-2 Video Streams GOP-Based Scene Change Detection

Motion Activity Based Scene Identification

Closed Caption Localization

SOM-Based Noise Filtering Table of Video Content

Fig. 1. The architecture of motion activity based video structuring.

2

Overview of The Proposed Scheme

Fig. 1 shows the mechanism of the proposed scheme. The testing videos are formatted in MPEG-2 and the sport of volleyball is selected as the case study. First, video streams are segmented into shots by using our proposed research GOP-based scene change detection [8]. Instead of frame by frame, this module of video segmentation checks video streams GOP by GOP and then finds out the actual scene change boundaries in the frame level. The segmented shots are identified and described in

290

Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee

MPEG-7 descriptor [7]. Thus the type of each shot of the volleyball videos can be recognized and encoded in the descriptor. The structure of volleyball videos consists of various types of shots, “service”, “full-court view” and “close-up” and the service shot is the leading shot in the volleyball competition. Thus, our focus is to recognize and select service shots to localize the closed caption to support video structuring. Therefore, the module of motion activity based shot identification is used to distinguish the types of shots and the specific shots of interest can be automatically recognized and selected for further analysis. Furthermore, the module of closed caption localization is designed based on SOM (Self-Organization Map) to localize the scoreboard, whose caption size is fairly small. The text in the localized closed caption of scoreboard is used to support video structuring. Finally, the table of video content is built by the key frames that contain the scoreboard and also by the semantic shots identified by the motion activity descriptor.

Step 1.

Inter-GOP scene change detection Calculate the difference between each consecutive GOP-pair

If difference is more than

no

threshold?

yes Step 2.

Intra-GOP scene change detection Find out the actual scene change frame in the GOP

Fig. 2. GOP-based scene change detection.

3

Scene Change Detection

Video data is segmented into meaningful clips to serve as logical units called “shots” or “scenes”. Fig. 2 illustrates our proposed GOP-based scene change detection approach [8]. In MPEG-II format [9], GOP layer is random accessed point and contains GOP header and a series of encoded pictures including I, P and B-frame. The size of a GOP is about 10 to 20 frames, which is usually less than the minimum duration of two consecutive scene changes (about 20 frames) [10]. We first detect possible occurrences of scene change GOP by GOP (inter-GOP). The difference between each consecutive GOP-pair is computed by comparing the I-

Motion Activity Based Shot Identification

291

frames in each consecutive GOP-pair. If the difference of DC coefficients between these two I-frames is larger than the threshold, then there may have scene change in between these two GOPs. Hence, the GOP that contains the scene change frames is located. In the second step – intra GOP scene change detection, we further use the ratio of forward and backward motion vectors to locate the actual frame of scene change within a GOP. The experimental results on real long videos in [8] are encouraging and prove that the scene change detection is efficient for video segmentation.

4

Shot Identification

In this section, the approach of shot identification based on object motion activity is introduced. The method of detection of significant moving objects is illustrated in Subsection 4.1 and Subsection 4.2 shows the motion activity descriptor. Scene identification based on the descriptor is presented in Subsection 4.3. 4.1

Moving Object Detection

For the computation efficiency, only the motion vectors of P-frames are used for object detection since in general, in a video with 30 fps consecutive P-frames separated by two or three B-frames are still similar and would not vary too much. Therefore, it is sufficient to use the motion information of P-frames only to detect moving objects. However, the motion vectors of P-frames or B-frames via motion estimation in MPEG-2 may not exactly represent the actual motions in a frame. For a macroblock, a good match is found among its neighbors in the reference frame. However, this motion estimation does not mean that a macroblock does match exactly the correct position in its reference frame. Hence, in order to achieve more robust analysis, it is necessary to eliminate noisy motion vectors before the process of motion vectors clustering. Motion vectors of relatively small or approximately zero magnitude are recognized as noise and hence are not taken into account. On the contrary, motion vectors with larger magnitude are more reliable. In the consideration of low computation complexity, the average of motion vectors of inter-coded macroblocks is computed and selected as the threshold to filter out motion vectors of smaller magnitude or noise. While noisy motion vectors are filtered out, motion vectors of similar magnitude and direction are clustered into the same group (the object) by applying the region growing approach. The details can be found in [7]. 4.2

Motion Activity Descriptor – 2D Histogram

2D-histogram is computed for each P-frame. The horizontal axis of the X-histogram (Y-histogram) is the quantized X-coordinate (Y-coordinate) in a P-frame. In the experiments, the X- and Y-coordinates are quantized into a and b bins according to the aspect ratio of the frame. The workflow of 2D-histogram generation is shown in Fig. 3. Initially, the object size is estimated before bin assignment. If the object size is

292

Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee

larger than the predefined unit size (frame-size/a*b), the object is weighted and x

accumulated by Eq. (1). Bin i , j means the

j th bin of X-histogram in frame i.

Acc ix, j ,α means the accumulated value of object α in frame i for X-histogram, and Obj is the number of objects in frame i. Bin

Acc

x i, j

x i , j ,α

=

Obj

∑

α =1

Acc

x i , j ,α

1  1, if object size ≤ a * b frame size =  size of object α  * a * b , otherwise  frame size

(1)

By utilizing the statistics of the 2D-histogram, spatial distribution of moving objects in each P-frame is characterized. In addition, spatial relationships within the moving objects are also approximately shown in the X-Y histogram pair since each moving object is assigned to the histogram bin according to the X-Y coordinate of its center position. Objects belong to the same coordinate interval are grouped into the same bins, and hence the distance between two object groups can be represented by the differences between the associated bins. Moving Object Information

Object Size > A Predefined Unit

No Yes

Last Object ?

No

Yes Weighted Accumulation Weighted Number

2D-Histogram

Moving Object Assignment Histogram Bin Accumulation

Fig. 3. The workflow of motion activity descriptor.

4.3

Shot Identification Algorithm

The concept of the shot identification algorithm is shown in Fig. 4. We can see that the characteristic of the service shots is that one or few objects appear in the left or the right side of the frame and more objects appear in the other side of the frame. In the shot of full-court view, generally the number of objects of the left part and the number of objects of the right part are balanced and the difference of the number of objects

Motion Activity Based Shot Identification

293

between them is relatively smaller than that of service shot. In the closed-up shots, there is a large object near the middle position of the frame. Therefore, based on the concept, we can distinguish these major shots of volleyball videos. In the algorithm, we use the X-histogram only to be the descriptor of each shot. The details of the algorithm are described as follows.

(a)

(b)

(c)

Fig. 4. Key frames of shots (a) Service (b) Full-court view (c) Closed-up.

Shot Identification Algorithm Input: Segmented shots { Shot1 , Shot2 , … , Shot s } Output: Shot types: { ST1 , ST2 , …, STs }, where the type of shot i STi ∈ {S, F, C} (S: Service, F: Full-court view, C: Closed-up) 1. X-coordinate is divided into a=15 bins. 2. Motion activity descriptor generation for each shot. If the size of shot s is greater than 12 P-frames, then generate 2 descriptors MD1st and MD2 nd Else generate one descriptor

MD1st =

2 Shot s

MDone .

Mid

∑ MDi , MD2nd = i =1

MDi is the feature vector of ( Bin

2 Shot s x i ,1 ,

Shot s

∑ MD

i = Mid +1

Bin

i

, Mid =

x i , 2 , …,

Bin

Shot s , where 2 x i , j ) in Subsection

4.2.

MD1st is the 1st descriptor and MD2 nd is the 2nd descriptor of Shot s . MDone =

1 Shot s

Shots

∑ MD i =1

i

3. Compute the maximum bin value (MBV) and its corresponding bin (MB). Find out the bins (LB) that their values are greater than the defined threshold Γ. If 0 < MBV < 3, then Γ= MBV; Else if 3

MBV < 5, then Γ= MBV – 1;

Else if 5

MBV < 10, then Γ= 5;

Else if 10

MBV < 20, then Γ= 10;

Else if 20

MBV < 25, then Γ= MBV – 10;

294

Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee

Else if 25

MBV < 30, then Γ= MBV – 15;

Else if 30

MBV < 40, then Γ= MBV – 17;

Else if 40

MBV < 50, then Γ= MBV – 20;

Else 50 MBV, then Γ= MBV – 25 4. Prescription: If the number of bin LB in a shot is greater than half number of bins (HNB), then the shot may belong to type F or S. Otherwise, the shot may belong to type C. Left bins of descriptor: Bin0 to Bin6 Medium bin: Bin7 Right bins of descriptor: Bin8 to Bin14 LBL: LB ∈ [Bin0, Bin6]; LBR: LB ∈ [Bin8, Bin14] MBVL: MBV of LBL; MBVR: MBV of LBR MBL: MB ∈ [Bin0, Bin6]; MBR: MB ∈ [Bin8, Bin14] 5. If one descriptor only: Case 1: number of LB HNB: If MBR - MBL < HNB, then shot ∈ type F Else shot ∈ type C Case 2: number of LB < HNB: If number of LBR or number of LBL equals to 0, then shot

∈ type C

Else shot ∈ type S 6. If two descriptors: If MBs of MD1st and MD2 nd are near the medium bin (i.e. each MB ∈ [Bin6, Bin8]) Case 1: If number of LB of MD1st and number of LB of MD2 nd are smaller than HNB, then shot ∈ type C Case 2: If number of LB of MD1st or number of LB of than HNB, then shot Else shot ∈ type S

∈ type F

MD2 nd is greater

Else Case 1: If number of LB of

MD1st and number of LB of MD2 nd are

smaller than HNB If LBL1st + LBR1st

LBL2 nd + LBR2 nd

3 or

Else shot ∈ type C Case 2: If number of LB of

3, then shot

MD1st or number of LB of MD2 nd is greater

than HNB, then If one of two MB is close to medium bin, then shot Else type ∈ shot S 7. Generate the type of each shot.

∈ type S

∈ type C

Motion Activity Based Shot Identification

5

295

Closed Caption Localization

Fig. 5 shows the proposed scheme of closed caption localization in frames. First, we compute the horizontal gradient energy to filter out some noise by using the DCT AC coefficients. The next step is to remove some noisy regions by the morphological operation. While the candidate caption regions are detected, we utilize the SOMbased algorithm to filter out non-caption regions. The details of the closed caption detection are described in Subsection 5.1 and the algorithm of SOM-based filtering is shown in Subsection 5.2. Horizontal Gradient Energy Filtering

Candidate Caption Filtering (SOM-based)

Morphological Operation

Fig. 5. The approach of closed caption localization in frames.

5.1 Closed Caption Detection While service shots are identified, we apply the proposed closed caption detection to further localize the closed caption in these shots, like the scoreboard and the channel trademark. We use the DCT AC coefficients shown in Fig. 6 to compute the horizontal and vertical gradient energy. In Fig. 6, we can see that we use the horizontal AC coefficients from AC0,1 to AC0,7 to compute the horizontal gradient energy by Eq. (2). The horizontal gradient energy

Eh for each 8 x 8 block is the first

Eh of a block is greater than a predefined threshold, the block is regarded as a potential caption block. Otherwise, if E h of a filter for noise elimination. Moreover, if

block is smaller than the threshold, the block is removed. 7

E h = ∑ AC0, j

(2)

j =1

However, different shots may have different lighting conditions which will reflect in the contrast in frames even over the whole shot. Furthermore, different contrast will affect the decision of the threshold and the result of closed caption detection might fail due to this reason. Therefore, we adopt adaptive threshold decision to overcome this problem. The threshold T is computed by Eq. (3), where γ is a factor that can be adjusted, SVars represents the average of horizontal gradient energy of shot s,

FVarsAC , i means the horizontal gradient energy of frame i in shot s and AC h is the horizontal DCT ac coefficient from AC0,1 to AC0, 7 . Based on the fact that higher value of FVarsAC , i means the higher contrast of the frame I, and we can thus remove noisy regions more easily in the frame of higher gradient energy. Therefore, we set lower weight to the frame with higher contrast and set higher weight to that with

296

Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee

lower contrast. By this way, we can remove most of the noisy regions and an example is demonstrated in Fig. 7(b).

3.2, when FVarsAC ,i < SVars , γ = T = γ × SVars  AC 2 . 4 , when FVar  s ,i ≥ SVar

SVars =

1 M

(3)

M

∑ FVar i =1

AC s ,i

N

N

j =1

j =1

2 2 FVarsAC ,i = ∑∑ ACh , j / N − ( ∑∑ AC h , j / N )

DC AC0,1 AC0,2 AC0,3 AC0,4 AC0,5 AC0,6 AC0,7 AC1,0 AC2,0 AC3,0 AC4,0 AC5,0 AC6,0 AC7,0

Fig. 6. DCT ac coefficients used in text caption detection.

After eliminating most of the noisy regions, there still have many small separated regions in which they are either very close or faraway. Some regions are supposed to be connected, like the scoreboard and the channel trademark. Hence, we need to perform the task of regions merging and remove some isolated ones. Therefore, a morphological operator 1x3 blocks is used to merge the regions that the distance in between is smaller than 3 blocks and furthermore the regions of size smaller than 3 blocks are eliminated. The result of applying morphological operation is shown in Fig. 7(c) and we can see that many small and isolated regions are filtered out and the caption regions are merged together. However, some background regions that have large horizontal gradient energy are still present after morphological operation. Hence, we propose an algorithm that is based on the concept of SOM (SelfOrganization Map) [11] to further differentiate the foreground captions and background high textured regions. 5.2 SOM-Based Noise Filtering SOM-Based Noise Filtering Algorithm Input: Candidate regions after morphological operation Ψ = { R1 , R2 , … , Rn }

Motion Activity Based Shot Identification

297

Output: Closed caption regions 1. 2.

Initially, set threshold T = 70 and cluster number j=0. For each candidate region Ri , compute the average horizontal-vertical gradient energy

Ei that is weighted by wh and wv . Here we set wh to 0.6 and wv to

0.4. n is the number of regions in Ψ. 7 7 1 n    wh ∑ AC0,u + wv ∑ AC v ,0  ∑ n j =1  u =1 v =1  3. For each region Ri ∈ Ψ

Ei =

If i = 1, j=j+1, assign

Ri to cluster C j

Else if there is a cluster C such that where k ∈ [1,j] and assign

Dk =

Dk

T and

Dk is minimal among { Dk },

Dk is defined in Eq. (5)

Ri to C Ck Ck 2 ∑ ∑ Ei − E j C k ( C k − 1) i =1 j =i +1

Else j=j+1, create a new cluster 4.

(4)

Set T = T – 11 Select the cluster

(5)

C j and assign Ri to C j

Ck (say Chigh ) that has the largest average gradient energy

E avg ,k computed by Eq. (6)

E avg ,k 5.

1 = Ck

Ck

∑E i =1

(6)

i

If the gradient energy

E avg ,k of Chigh is greater than T, then reset Ψ = Chigh .

Go to step 3. Else 6.

Go to step 6. The cluster Chigh is the set of closed captions.

In the algorithm, we set more weight to the horizontal DCT AC coefficients because closed captions generally appear in rectangular form and the ac energy in the horizontal direction would be larger than that in the vertical direction due to the reason that letters of each word are fairly close and the distance between two rows of text is relatively large. Furthermore, the SOM-based candidate region clustering is iterated until the gradient energy E avg ,k of the cluster Chigh is smaller than the threshold T. Based on the experiments, T is set to 70 initially and T is set to the value of T – 11 in step 4. By this method, we can automatically find out the set of closed captions and this method is based on the fact closed captions are the foreground

298

Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee

which are added after filming. Therefore, the added closed captions are more clear than the background and would have larger gradient energy. After the step of SOMbased noise filtering, each closed caption region is dilated by one block row. The result is shown in Fig. 7(e) and we can see that regions belonging to the same closed caption are merged.

(a) (b) (c) (d) (e)

Fig. 7. Demonstration of the closed caption localization (a) Original I-frame (b) Result after filtering by horizontal gradient energy (c) Result after morphological operation (d) Result after filtering by SOM-based algorithm (e) Result after dilation.

6

Experimental Results and Analysis

In the experiment, we record the volleyball videos from the TV channel of VL Sports and encode in the MPEG-2 format in which GOP structure is IBBPBBPBBPBBPBB and frame rate is 30 fps. The length of the video is about one hour and we obtain 163 shots of service, competition of the full-court view and closed-up. To measure the performance of the proposed scheme, we evaluate precision and recall for the approach of shot identification and the algorithm of closed caption detection. Table 1 shows the experimental result of the shot identification and we can see that the precision of all the three kinds of shots is higher than 92%. Moreover, the recall value

Motion Activity Based Shot Identification

299

of the type of close-up is up to 98%. The recall value of the type of full-court is just 87% due to the reason that the camera would zoom in to capture the scene that players spike near the net. In this case, the scene would consist of a large portion of the net is regarded as closed-up shot. Although the recall value of the shot of full court is not higher than 90%, the overall accuracy of shot identification is still very good. Table 1. Result of shot identification Ground Truth

Number of Correct Detection

Number of False Detection

Number of Miss Detection

Precision

Recall

62

57

5

1

92%

98%

52

49

3

4

94%

92%

49

45

4

7

92%

87%

Number of Detection

Closed-up 58 Service 53 Full Court 52

In Table 2, the result of closed caption localization is presented. There are 98 closed captions containing the scoreboard and the trademark in the testing video and 107 potential captions are detected in which 98 localized regions are the real closed caption. We can see that the recall value is up to 100% and the precision is about 92%. The number of false detection is 9 and this is because the background may consist of some advertisement page whose gradient energy is relatively high compared with the scoreboard and the channel trademark. In the case, it is assigned to the same cluster as closed caption since its gradient energy is very similar to the energy of the scoreboard. Table 2. Result of Closed Caption Localization Ground Truth 98

Number of Detection 107

Number of Correct Detection 98

Precision

Recall

91.59%

100%

The system interface for video structuring is shown in Fig. 8 and Fig. 9. Fig. 8 shows the user interface and Fig. 9 presents all shots of full-court view while users click the option “show all shots” in the field of “F shot”. The system interface is composed based on the video structure that is organized in the temporal order of the key frames of the closed caption, service, full-court view and closed-up. In addition, the caption frame is in the leading frame to show the scoreboard for users first. By this way, users can browse the video contents efficiently by looking for the score of competition and select the shots that they are interested in.

7

Conclusion and Future Work

In this paper, we propose a novel approach to generate the table of video content based on shot description by motion activity and closed caption in MPEG-2 compressed video. We adopt the approach of GOP-based video segmentation to

300

Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee

effectively videos into shots and these shots are described and identified by the object-based motion descriptor. Experimental results show that the proposed scheme performs well to recognize several kinds of shots of volleyball videos. In addition, we design an algorithm to localize the closed caption in the identified specific shots and to effectively filter out non-caption regions. From the user interface, based on the table of content created by closed captions and semantically meaningful shots, we can support users to browse videos more clearly and efficiently.

Fig. 8. Video structure of caption frames, shots of service, full-court view, and closed-up

Fig. 9. The interface shows other shots of full-court view in the bottom area

Motion Activity Based Shot Identification

301

In the future, we will investigate in the video OCR to recognize the localized closed caption to support automatic meta-data generation, like the name of the team in sports video or the name of the leading character in movies or an important person of other kinds of videos. Besides, we can also support semantic event detection and description for automatic descriptor and description scheme production in MPEG-7.

References 1. H. Wang and S. F. Chang, “A Highly Efficient System for Automatic Face Region Detection in MPEG Video,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 7, No. 4, Aug. 1997. 2. Y. Zhong, H. Zhang and A. K. Jain, “Automatic Caption Localization in Compressed Video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 4, Apr. 2000. 3. H. Luo and A. Eleftheriadis, “On Face Detection in the Compressed Domain,” Proc. ACM Multimedia 2000, pp. 285-294, 2000. 4. Y. Zhang and T. S. Chua, “Detection of Text Captions in Compressed Domain Video,” Proc. ACM Multimedia Workshop, CA, USA, pp. 201-204, 2000. 5. S. W. Lee, Y. M. Kim and S. W. Choi, “Fast Scene Change Detection using Direct Feature Extraction from MPEG Compressed Videos,” IEEE Transactions on Multimedia, Vol. 2, No. 4, Dec. 2000. th 6. H. Lu and Y. P. Tan, “Sports Video Analysis and Structuring,” Proc. IEEE 4 Workshop on Multimedia Signal Processing, pp.45-50, 2001. 7. D. Y. Chen and S. Y. Lee, “Object-Based Motion Activity Description in MPEG-7 for th MPEG Compressed Video,” Proc. the 5 World Multiconference on Systemics, Cybernetics and Informatics (SCI 2001), Vol. 6, pp.252-255, July 2001. 8. S. Y. Lee, J. L. Lian and D. Y. Chen, “Video Summary and Browsing Based on StoryUnit for Video-on-Demand Service,” Proc. International Conference on ICICS, Oct. 2001. 9. J. L. Mitchell, W. B. Pennebaker, Chad E.Fogg, and Didier J. LeGall, “MPEG VIDEO COMPRESSION STANDARD,” Chapman&Hall, NY, USA, 1997. 10. J. Meng, Y. Juan and S.F. Chang, “Scene Change Detection in a MPEG Compressed Video Sequence,” Proc. IS&T/SPIE, Vol. 2419, pp.14-25, 1995. 11. T. Kohonen, “The Self-Organizing Map,” Neurocomput., Vol. 21, pp. 1-6, 1998.

Visualizing the Construction of Generic Bills of Material Peter Y. Wu, Kai A. Olsen, and Per Saetre University of Pittsburgh Molde, Norway

Abstract. Assemble-to-order production industries today must be customeroriented, and offer a large variety of product options to compete in the global market place. A Generic Bill of Material (GBOM) is designed to describe the component structures for a family of products in one data model. The specific Bill of Material for any particular product variant can then be generated on demand. Several different approaches to the GBOM model demonstrated to be reasonably effective, but the construction and verification of GBOM remain difficult. This paper formulates a set of principle requirements for the GBOM model to support visualization and manipulation, aimed at the ease of composition and editing of GBOM models. Based on the framework, the different approaches to the GBOM model were reviewed. A new GBOM model is presently, and the GBOM system briefly introduced. The visual environment for the construction of GBOM model is discussed.

1.

Introduction

In the past decade, the challenge of global competition has forced the production industry in discrete assembly to become much more customer-oriented. Product variety becomes one of the keys to gain market share: automobile and other assembleto-order industries faced the need to manage myriads of product variants [1][2]. To manage the myriads of product variants and constantly changing product options, there have been several research efforts to define a generic Bill of Material (GBOM), for example [3][4][5][7]. The GBOM is a data model designed to describe the component structures for a family of products, and to generate the specific Bill of Material for any particular product variant on demand. Quite a few GBOM models were developed with a focus on solving problems in the various aspects of production management. Some focused on materials planning [3], others on flexibility and ease of specification for the customer [4]. Yet the construction of a complete GBOM model remains relatively difficult. The need for easy construction of the structure becomes more pronounced when today’s market place demands even faster response to changes in customer preferences. Good visualization tools for GBOM construction and verification are simply rare, if not non-existent. We studied various GBOM models disseminated in the literature, and report in this paper a set of principle requirements for models to support ease of construction and verification. We will present these requirements in the next section. Based on the framework of these requirements, we review the various GBOM models. These principle requirements are also the key to facilitate visualization and direct manipu-

S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 302–310, 2002. © Springer-Verlag Berlin Heidelberg 2002

Visualizing the Construction of Generic Bills of Material

303

lation of GBOM models. We then present the design of our GBOM model, and briefly sketch the graphical editor used in our GBOM system.

2.

The Principle Requirements

In this section, we first present a set of principle requirements concerning the GBOM model as the idealized goals for a GBOM system. These principle requirements set a framework to review the related GBOM approaches disseminated in the literature, and distinguishes the model put into use for visualization in our system. The principle requirements are the following: (1) Genericity: The GBOM model must be able to describe similar components as variants of a generic component. That is, the GBOM structure should emphasize the commonality of the components, and should be able to generate any specific variants on demand. This will simplify the management and maintenance of BOM structures. (2) Specification Support: The GBOM model must support a specification process to allow the user to arrive at any legal product variant of the user’s choice. The process may run on a platform to support an execution driven by the GBOM model. The specific BOM generated should be immediately usable in other parts of the information system (such as production control), and should support data exchange along the appropriate supply chain (such as purchasing). (3) Child Independence: A GBOM describes a set of product or component variants that another GBOM may include as one of its components. The principle of abstraction requires that different variants must be independent of the products where they are included. This allows different product GBOMs to utilize the same component structures. (4) Parental Restriction: A product GBOM including a component (GBOM) should be able to restrict the legal choices of the component variants. This requirement is actually in concert with the child independence requirement in (3) above so that product and component GBOMs may work together. In fact, (3) and (4) together allows the GBOM model to support cutting and pasting when presented visually on the screen for direct manipulation. (5) Iteration Support: The GBOM model should support an iteration process to generate all acceptable variants matching some given criteria of requirements about the product. The process may run on a platform to support the execution driven by the GBOM model with the given criteria. Iteration support allows the partially specified GBOM to be visually presented and manipulated in a graphical environment. (6) Composition Support: Composing a new GBOM is not at all straightforward. While it is desirable to define similar components as variants of one generic component, too much variation can make the GBOM too intricate. Product engineers may even want to remove unnecessary differences in products and components through re-design, in order to achieve simplicity in the GBOM. Creation and maintenance of GBOM models in actual use is necessarily a longterm effort that requires many iterations. The GBOM model must support the process of composing and editing of GBOM models in the visualization system.

304

3.

Peter Y. Wu, Kai A. Olsen, and Per Saetre

Review of GBOM Models

In production process planning, the traditional approach is to handle product variants is in the use of planning modules [3]. Each attribute with variant attribute values controls a planning module. When each of these attributes is given an acceptable value, the planning modules will then determine the implied requirements for the included components. VanVeen and Wortman analyzed the conditions assumed in the modular approach, and showed that these conditions may not be valid in an environment with many complex products [5]. The planning modules exhibit certain genericity, but the model does not fulfill the principle requirements. Schonsleben introduced the “Variant Generator” – a data model to present all variants of each component in one Bill of Material [6]. A set of mutually independent parameters represents all the attributes which may take variant values. We specify a product variant by assigning the proper values to these parameters. In practice, these parameters are rarely mutually independent, and there is no mechanism to help avoid improper combinations of these parameter values. Therefore, the Variant Generator does not have specification support, nor does it support parental restriction. VanVeen improved the Variant Generator and introduced the GBOM concept [7]. The GBOM model allows the parents to restrict the variability of each child, with a complex definition of the inclusion relationship between components. An elaborate conversion function is needed to determine which child component is selected for each variant of the parent component. While the specification process is difficult, the major weakness is the need to specify the complete set of parameter values in advance to determine the product variant before the GBOM is even exploded. The approach does not offer good specification support. Hegge and Wortmann introduced a set of inheritance rules to the parent-child relationship in the Variant Generator [8]. These rules implicitly make the choice for each parent variant to select the child variant. The inclusion of a component becomes implicit in the model, and the inheritance rules have complicated semantics. The approach shared the same limitation as the one proposed by vanVeen. Bottema and var der Tang described a product configuration system in which the user makes choices of variant features during the explosion process [9]. However, the system is not able to validate these choices until the end of the process. There is no explicit GBOM model to demonstrate genericity, and the specification support is insufficient. Chung and Fischer designed an object-oriented GBOM model [10][11], utilizing the subclass construct to represent the inclusion relationships, so that components included may inherit the features of their parents for the determination of component variants. However, component-inheriting features of the parent would violate the child independence requirements, and would not allow same components to have the same GBOM for use in different products. Olsen and Saetre presented a procedural approach and described a GBOM model that supports an elaborate product specification process [4][12]. The procedural programming language works out a highly practicable GBOM model, with composition support. However, for implementation convenience, the programming language approach must balance between the requirements of child independence and parental restriction, allowing a component to impose a constraint on another

Visualizing the Construction of Generic Bills of Material

305

component. The drawback is a model with difficulty in supporting visualization for manipulation. Bertrand et al. developed the “hierarchical pseudo Bill of Material” – a GBOM model which supports both parental restriction and child independence [13]. The focus was on material planning and production scheduling, and the model did not demonstrate iteration and composition support. Our approach primarily builds on these ideas but we also attempt to make our GBOM model visual and declarative. We maintain all the principle requirements as aforementioned for the GBOM model. For instance, we must enforce the child independence requirement in the model to support cutting and pasting in the graphical presentation of GBOM models. We also enforce the parental restriction requirement, so that a component will not be allowed to impose a constraint in another component unless it is included as a sub-component. If there is a necessary constraint between two different components, it must be specified at a higher level in the GBOM structure, referring to both components as sub-components. To validate a user choice of value for a variant attribute in a component, the specification process must then traverse up and down the GBOM structure to check on all possible constraints. When we can enforce the discipline in the GBOM model to meet the stated requirements, we believe that model-driven software will bring a new level of flexibility and ease of construction and verification of GBOM, similar in other graphical definition tools for object modeling [14][15][16].

4.

The GBOM Model

Figure 1 depicts the structure of our GBOM model. In the GBOM model, a GBOM consists two main parts: the HEAD and the BODY. The HEAD has a descriptive name, and all the attributes and the corresponding permissible values for each attribute. The set of attributes always includes the Part Number, the value of which uniquely identifies the GBOM model. The HEAD also specifies the constraints applied to the attribute values, so that only those attribute values that satisfy all the constraints are acceptable in some product variant described by the GBOM. The BODY of the GBOM model specifies the inclusion of components. There may be a condition specified for the inclusion of a component: the component is included only when the specified condition is met. If no condition is specified, the component is always included. In addition to the condition for inclusion, there is also a set of restrictions the product may impose on the component to be included. These restrictions apply to the attributes and the corresponding permissible values for the component GBOM. Both the condition and the restrictions may refer to attribute values of other components. This would allow specification of cross component constraints to apply to two different components. The GBOM model is also designed for visual display on a graphical editor, to support easy composing and editing. On a graphical editor, the user can choose to display different levels of details in the GBOM models, and optionally explode any particular inclusion. When the GBOM model satisfies all the stated requirements, the graphical editor also supports easy cutting and pasting of GBOM models visually presented to the user.

306

Peter Y. Wu, Kai A. Olsen, and Per Saetre

a descriptive name

a descriptive name

HEAD Part Number: XXXXX

Part Number: YYYYY Attribute: {w1,w2, . . .} . .. Constraints....

Attribute: {v1,v2,v3, . . .} . .. Constraints....

BODY

( condition ) restriction 1... restriction 2...

Component GBOM

Product GBOM Fig. 1. Structure of GBOM Model

Figure 2 in the following page illustrates a simple example of the GBOM model of a stool, and how it is related to the GBOM components of seat, base, cushion, chipboard, different seat covers. The stool is part number 200. The color can be red or blue, and the material can be wood or aluminum. It consists of two components: the seat and the base. The seat is part number 400, and there are three different colors for the seat: read, blue, and white, but the stool will restrict the choice of colors for the seat to that of the stool: red or blue. The seat consists of the cushion, the chipboard, and one of three kinds of seat covers. The color of the seat determines which seat cover to include, which is specified in the condition for inclusion. Incidentally, the different seat covers identified by different part numbers can of course be represented in one GBOM, with the choice of different colors. The stool also includes a base. The base is of three different types: wood, aluminum, and plastic, and also three different heights. But the stool GBOM will only include the base of the type specified in the material of the stool, and the height of 60.

5.

The Visual Environment

The GBOM system comprises several other components. Although not all of them relate directly to the visual construction of GBOM models, the following list briefly explains some of them to provide the context to describe the visual environment for constructing GBOM models. The GBOM System Platform to run the specification process: generating the specific BOM for GBSpec the product variant chosen by the user. GBMatch Platform to run the match process: to test the validity of requirements or constraints against a GBOM model. Graphical editor to compose and edit GBOM models. GBEdit Application to convert specific BOM into a GBOM model, for further GBGen editing.

Visualizing the Construction of Generic Bills of Material

stool Part#: 200 Color: {Red,Blue} Material:{aluminum,wood}

seat Part#: 400 Color: {Red,Blue,White}

307

cushion Part#: 410

chipboard Part#: 420

Color = this.Color;

(Color==Red) Type = this.Material; Height = 60; (Color==Blue)

seat cover Part#: 451 Color: Red

seat cover base

(Color==White)

Part#: 500 Type:{aluminum,wood,plastic} Height:{60,80,100}

Part#: 452 Color: Blue

seat cover Part#: 453 Color: White

component $200 is name(stool); Color(Red|Blue); Material(aluminum|wood); component $400 is end component; name(seat); body $200 is Color(Red|Blue|White); include $400 with end component; Color(this.Color); end include; body $400 is include $500 with include $410; Type(this.Material); include $420; Height(60); case Color is end include; when Red: include $451; end body; when Blue: include $452; when White: include $453; end case; end body; component $500 is name(base); Type(aluminum|wood|plastic); Height(60|80|100); end component;

component $410 is name(cushion); end component; component $420 is name(chipboard) end component; component $451 is name(seat cover); Color:Red; end component; component $452 is name(seat cover); Color:Blue; end component; component $453 is name(seat cover); Color:White; end component;

Fig. 2. GBOM model of a stool

In the back-end, the GBOM system is supported by a persistent store for existing GBOM models, which is the GBOM database. The GBOM definition tool is GBEdit - a graphical editor for composing and editing GBOM models. Figure 3 depicts the graphical presentation of GBOM models for editing. Our current design uses a context-sensitive pop-up menu in the graphical

308

Peter Y. Wu, Kai A. Olsen, and Per Saetre

editor. On the main canvas, the pop-up menu supports the addition of a new component GBOM, or brings up a browse window to select a model from the GBOM database. Upon creation, a GBOM component will allow the key entry for its component name, and a unique Part Number is generated. The user can however change the generated Part Number. Each component GBOM is displayed in a subwindow, which can be expanded or collapsed. When collapsed, the GBOM is displayed only with the component name. The pop-up menu from the top section with the component name supports expanding or collapsing the window. In a semiexpanded form, the middle section is shown, listing the attributes and possible values each attribute can take, as well as any additional constraints. The middle section constitutes the HEAD of the GBOM model. The pop-up menu from inside the middle section will allow entry of new attributes and editing of existing attributes, as well as the constraints. The fully expanded GBOM window shows the lower section, showing the GBOM sub-components, and the conditions and requirements for including each. The pop-up menu from the lower section will support editing of the conditions and requirements, along with the inclusion of any sub-component by inserting a link. The lower section constitutes the BODY of the GBOM model. GBEdit File

Help

STOOL Part Number: 200 Color: { Red, Blue } Material: { Al, Wood }

SEAT Part Number: 400 Color: { Red, Blue }

( none )

Color == this.Color

Fig. 3. The GBOM Graphical Editor

The graphical editor makes the construction of GBOM models easier. The convenience to expand or collapse a GBOM model on display makes the complicated GBOM structure easier to deal with, since the user can perceive the hierarchical composition thus implied in the GBOM model. The key relies on these two requirements: Child Independence and Parental Restriction, listed amongst the principle requirements in Section 2. The graphical editor can also conveniently support cutting and pasting collections of component GBOM models in GBEdit. Furthermore, in the GBOM system, every GBOM model supports the GBSpec process and the GBMatch process. These features would become helpful aids in the GBEdit environment for the construction and verification when composing GBOM models. GBSpec allows the user to investigate possible product variants, while GBMatch provides a convenient way to verify that the stated requirements for

Visualizing the Construction of Generic Bills of Material

309

inclusion of a sub-component GBOM is valid. In the visual environment of GBEdit, the user can trace the conflict of including certain sub-components up and down the hierarchy to the level where the cause of conflict occurs. Again, this feature makes use of the two principle requirements: Child Independence and Parental Restriction. Moreover, since the GBOM models may also include information about cost and availability of each component, the GBOM system can be very useful for planning for supply chain management. GBGen is installation specific, and is not particularly pertinent to the visual environment.

6.

Summary

We presented a new GBOM model and sketched the GBOM system to facilitate its visual construction and verification. We formulated the set of six principle requirements for the GBOM model to support visual presentation and manipulation. These requirements are briefly summarized below:

• • •

Genericity: The GBOM model captures all the BOM information of its variants. Specification Support: The system must support a product specification process. Child Independence: The GBOM for a component is independent of the product GBOM that includes it as a component. • Parental Restriction: A product GBOM is allowed to restrict the variability of a component GBOM included as its component. • Iteration Support: The system should support search for specific products matching certain specified criteria about the product. • Composition Support: The system should support tools and processes for the generation, composition, and editing of GBOM models. Based the framework of these requirements, we reviewed the various approaches to the GBOM model, and then presented our GBOM model that fulfills all the principle requirements aforementioned. We briefly described the GBOM system, and then proceeded to describe further the visual environment for the composition of GBOM models in the system. The two principle requirements –Child Independence and Parental Restriction – fosters a hierarchical structure of the GBOM model to facilitate easy cutting and pasting in the visual environment, and verification in the inclusion of sub-components through tracing up and down such a hierarchy.

References 1. 2. 3.

Erens F.J., Hegge H.M.H. Manufacturing and sales coordination for product variety, International Journal of Production Economics 37:83-99, 1994. Pine B.J. II, Mass Customization: the new frontier in business competition, Harvard Business School Press, Boston, 1993. Vollmann T.E., Berry W.L., Whybark D.C. “Advanced concepts in master production scheduling,” Manufacturing Planning and Control Systems, Dow Jones-Irwin, 1992.

310 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

14. 15. 16.

Peter Y. Wu, Kai A. Olsen, and Per Saetre Olsen K.A., Saetre P., Thorstenson A. “A Procedure-Oriented Generic Bill of Materials,” Computers & Industrial Engineering 32(1): 29-45, 1997. VanVeen, E.A. Wortmann, J.C. “New developments in generative bill of material processing systems,” Production Planning & Control 3 (3): 327-335, 1992. Schonsleben, P. Flexible Production Planning and Data Structuring on computer, CWPublikationen, Munchen, 1985. VanVeen, E.A. Modelling product structures by generic bills-of-material, Elsevier Science Publishers, Amsterdam, 1992. Hegge, H.M.H., Wortmann, J.C. “Generic bill-of-material: a new product model,” International Journal of Production Economics 23:117-128, 1991. Bottema, A., van der Tang, F. “A product configurator as key decision support system,” IFIP Transactions B-7: Applications in Technology, 71-92, North-Holland, 1992. Chung, Y., Fischer, G.W. “Illustration of object-oriented databases for the structure of a bill of materials,” Computers in Industry 19: 257-270, 1992. Chung, Y., Fischer,G.W. “A conceptual structure and issues for an object-oriented bill of materials data model,” Computers and Industrial Engineering 26(2):321-339, 1994. Olsen, K.A. and P. Saetre. “Describing products as programs,” International Journal of Production Economics 56-57: 495-502, 1998. Bertrand, J.W.M., Zuijderwijk, M., Hegge, H.M.H. “Using hierarchical pseudo bills of material for customer order acceptance and optimal material replenishment in assemble to order manufacturing of non-modular products,” Internal Journal of Production Economics 66: 171-184, 2000. Gangopadhyay, D. Wu, P.Y. “An object-based approach to Medical Process th Automation,” 17 Annual Symposium on Computer Application in Medical Care, Washington, D.C., McGraw-Hill, 910-914, 1993. Wu, P.Y. “Visual Capacity Modeling and Interactive Decision Support for Production Planning,” Computer Technology Solutions Conference, Detroit, Michigan, Society of Manufacturing Engineers, September 1999. Wu, P.Y. “Visualizing Capacity and Load in Production Planning,” Information Visualization 2001, London, England, IEEE Computer Society, July 2001.

Data and Knowledge Visualization in Knowledge Discovery Process TrongDung Nguyen, TuBao Ho, and DucDung Nguyen Japan Advanced Institute of Science and Technology Ishikawa, 923-1292 Japan

Abstract. The purpose of our work described in this paper is to develop and put a synergistic visualization of data and knowledge into the knowledge discovery process in order to support an active participation of the user. We introduce the knowledge discovery system D2MS in which several visualization techniques of data and knowledge are developed and integrated into the steps of the knowledge discovery process. Keywords: model selection, knowledge discovery process, data and knowledge visualization, the user’s participation.

1

Introduction

Knowledge discovery in databases (KDD) – the rapidly growing interdisciplinary ﬁeld of computing that evolves from its roots in database management, statistics, and machine learning – aims at ﬁnding useful knowledge from large databases. The process of knowledge discovery is complicated and should be seen inherently as a process containing several iterative and interactive steps: (1) understanding the application domain, (2) data preprocessing, (3) data mining, (4) postprocessing, and (5) applying discovered knowledge. The problem of model selection in KDD – choosing appropriate discovered models or algorithms and their settings for obtaining such models in a given application – is diﬃcult and non-trivial because it requires empirical comparative evaluation of discovered models and meta-knowledge on models/algorithms. In our view, model selection should be semiautomatic and it requires an eﬀective collaboration between the user and the discovery system. In such a collaboration, visualization has an indispensable role because it can give a deep understanding of complicated models that the user can not have if using only performance metrics. The goal of this work is to develop a research system for knowledge discovery with support for model selection and visualization called D2MS. The system has two main contributions to visual knowledge discovery. First is its eﬃcient visualizers for large multidimensional databases, discovered rules, hierarchical structures as well a synergistic visualization of data and knowledge. In particular, the novel visualization technique T2.5D (Trees 2.5 Dimensions) for large hierarchical structures can be seen as an alternative to powerful techniques for representing large hierarchical structures such as cone trees [13] or hyperbolic S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 311–321, 2002. c Springer-Verlag Berlin Heidelberg 2002

312

TrongDung Nguyen, TuBao Ho, and DucDung Nguyen

Fig. 1. Conceptual architecture of D2MS

trees [2]. Second is its tight integration of the visualizers with functions in each step of the knowledge discovery process for supporting the model selection purpose.

2

Data and Knowledge Visualization in D2MS

Figure 1 presents the conceptual architecture of D2MS. The system consists of eight modules: Graphical user interface, Data interface, Data processing, Data mining, Postprocessing and Evaluation, Plan management, Visualization, and Application. For preprocessing tasks, D2MS has algorithms of continuous attributes discretization (three methods based entropy, rough sets, and k-means clustering), ﬁling up missing values (three cluster-based methods for supervised numerical and categorical data, and unsupervised data), and feature selection method SFG [6]. For data mining tasks, D2MS consists of k-nearest neighbors, Bayesian classiﬁcation, decision tree induction method CABRO [7], CABROrule [9], LUPC [3] for rule induction, and conceptual clustering method OSHAM [1]. For postprocessing and evaluation tasks, D2MS has k-fold stratiﬁed cross validation integrated with data mining programs, exportation of discovered models into spreadsheet or readable forms for visualization programs. The visualization module is linked to most other modules in D2MS in particular those directly concerned with model selection. It currently consists of a data visualizer, a rule visualizer, and a tree visualizer (for hierarchical structures). These visualizers are integrated to most methods mentioned above in preprocessing, data mining, and postprocessing. We describe each visualizer in focus on its techniques and how it is linked to the steps in the KDD process. 2.1

Data Visualization

We have chosen the parallel technique for visualizing 2D tabular datasets deﬁned by n rows and p columns.

Data and Knowledge Visualization in Knowledge Discovery Process

313

Fig. 2. Data visualization in D2MS

Viewing Original Data. This view gives the user a rough idea about the distribution of data on values of each attribute, in particular the colors of diﬀerent classes in many cases can show clearly how classes are diﬀerent from each other. Summarizing Data. The key idea of this view is not to view original data points but to view their summaries on parallel attributes. As WinViz [4], D2MS uses bar charts in the place of attribute values on each axis. The bar charts in each axis have the same height (depending on the number of possible attribute values) and diﬀerent widths that signify the frequencies of attribute values. The top-right window in Figure 2 shows the summaries of the stomach cancer data. Querying Data. This view serves the hypothesis generation and hypothesis testing by the user. There are three types of queries: (i) based on a value of the class attribute where the query determines the subset of all instances belonging to the indicated class; (ii) based on a value of a descriptive attribute where the query determines the subset of all instances having this value, (iii) based on a conjunction of attribute-values pairs where the query determines the subset of all instances satisﬁed this conjunction. The queries can be determined by just using point-and-click. The subset of instances matched the a query is visualized in viewing data mode and in summarizing data mode. The grey regions on each

314

TrongDung Nguyen, TuBao Ho, and DucDung Nguyen

axis show the proportions of speciﬁed instances on values of this attribute as shown in bottom-right window in Figure 2). Data Visualization in the KDD Process. With the above three views of data, D2MS integrates data visualization into diﬀerent KDD steps by displaying and interactively changing these views of data at any time. For example, many discretization algorithms provide alternative solution of dividing a numerical attributes into intervals, and the visual data query on the discretized attribute and the class attribute can give insights for decision. 2.2

Rule Visualization

A rule is a pattern related to several attribute-values and a subset of instances. The importance in visualizing a rule is how this local structure is viewed in its relation to the whole dataset, and how the view support the user’s evaluation on the rule interestingness. D2MS’s rule visualizer allows the user to visualize rules in the form antecedent → consequent where antecedent is a conjunction of attribute-value pairs, consequent is a conjunction of attribute-value pairs in case of association rules, and is a value of the class attribute in case of prediction rules. A rule is simply displayed by a subset of parallel coordinates included in antecedent and consequent. The D2MS’s rule visualizer has the following functions: Viewing Rules. Each rule is displayed by polyline that goes through the axes containing attribute-values occurred on the antecedent part of the rule leading to the consequent part of the rule which are displayed with diﬀerent color. In the case of prediction rules, the ratio associated with each class in the class attribute corresponds to the number of instances of the class covered by the rule over the total number of instances in the class. This view gives a ﬁrst observation of the rule quality. Viewing Rules and Data. The subset of instances covered by a rule is visualized together with the rule by parallel coordinates or by summaries on parallel coordinates. From this subset of instances, the user can see the set of rules each of them cover some of these instances, or the user can smoothly change the values of an attribute in the rule to see other related possible rules. Rule Visualization in the KDD Process. There are several ways that support the user in evaluating the quality of the rule together with other measure such as coverage and accuracy of the rule. For example, two rules predicting a target class have the same support and conﬁdence but the one wrongly covered more instances belonging to classes diﬀerent from the target class would be considered worse. Figure 3 illustrates rule visualization in D2MS where the top-left and bottom left windows display a discovered rule, and the top-right and bottom right windows show the instances covered by that rule.

Data and Knowledge Visualization in Knowledge Discovery Process

315

Fig. 3. Rule Visualization

2.3

Tree Visualization

D2MS provides several visualization techniques that allow the user to visualize eﬀectively large hierarchical structures. The tightly-coupled views display simultaneously a hierarchy in normal size and tiny size that allows the user to determine quickly the ﬁeld-of-view and to pan to the region of interest. The ﬁsheye view distorts the magniﬁed image so that the center of interest is displayed at high magniﬁcation, and the rest of the image is progressively compressed. Also, the new technique T2.5D [8] is implemented in D2MS for visualizing very large hierarchical structures. Diﬀerent Modes of Viewing Hierarchical Structures. D2MS tree visualizer provides multiple-views of trees or hierarchical structures. Figure 4 illustrates some of them. – Tightly-coupled views: The global view (on the left) shows the tree structure with nodes in same small size without labels and therefore it can display a tree fully or a large part of it, depending on the tree size. The detailed view (on the right) shows the tree structure and nodes with their labels associated with operations to display node information. – Customizing views: Initially, according to the user’s choice, the tree is either displayed fully or with only the root node and its direct sub-nodes. The tree then can be collapsed or expanded partially or fully from the root or from any intermediate node.

316

TrongDung Nguyen, TuBao Ho, and DucDung Nguyen

Fig. 4. Diﬀerent views of trees in D2MS

– Tiny mode with ﬁsh-eye view: The tiny mode uses much more eﬃciently the space to visualize the tree structure, on which the user can determine quickly the ﬁeld-of-view and pan to the region of interest. Fish-eye distorts the magniﬁed image so that the center of interest is displayed at high magniﬁcation, and the rest of the image is progressively compressed.

Trees 2.5 Dimensions. The user might ﬁnd it diﬃcult to navigate a very large hierarchy, even with tightly-coupled and ﬁsh-eye views. To overcome this diﬃculty, we have been developing a new technique called T2.5D (stands for Trees 2.5 Dimensions). T2.5D is inspired by the work of Reingold and Tilford [12] that draws tidy trees in a reasonable time and storage. Diﬀerent from tightly-coupled and ﬁsheye views that can be seen as location-based views (view of objects in a region), T2.5D can be seen as a relation-based view (view of related objects). The starting point of T2.5D is the observation that a large tree consists many subtrees that are not usually and necessarily viewed simultaneously. The key idea of T2.5D is to represent a large tree in a virtual 3D space (subtrees are overlapped to reduce occupied space) while each subtree of interest is displayed in a 2D space.

Data and Knowledge Visualization in Knowledge Discovery Process

317

Fig. 5. T2.5D provides views in between 2D and 3D

To this end, T2.5D determines the ﬁxed position of each subtree (its root node) in two axes X and Y, and in addition, it computes dynamically an Z-order for this subtree in an imaginary axis Z. A subtree with a given Z-order is displayed “above” its siblings those have higher Z-orders. When visualizing and navigating a tree, at each moment the Z-order of all nodes on the path from the root to a node in focus in the tree is set to zero by T2.5D. The active wide path to a node in focus, which contains all nodes on the path from the root to this node in focus and their siblings, is displayed in the front of the screen with highlighted colors to give the user a clear view. Other parts of the tree remain in the background to provide an image of the overall structure. With Z-order, T2.5D can give the user an impression that trees are drawn in a 3D space. The user can easily change the active wide path by choosing another node in focus [8]. We have experimented T2.5D with various real and artiﬁcial datasets. It has been veriﬁed that T2.5D can handle well trees with more than 20,000 nodes, and more than 1,000 nodes can be displayed together on the screen [8]. Figure 5 illustrates a pruned tree of 1795 nodes learned from stomach cancer data and drawn by T2.5D (note that the original screen with colors gives a better view than this black-white screen).

318

3

TrongDung Nguyen, TuBao Ho, and DucDung Nguyen

A Case-Study

This section illustrates the utility of synergistic visualization of data and knowledge of D2MS in extracting knowledge from a stomach cancer dataset. 3.1

The Stomach Cancer Dataset

The stomach cancer dataset collected at the National Cancer Center in Tokyo during the period 1962-1991 is a very precious source for the research. It contains data of 7,520 patients described originally by 83 numeric and categorical attributes. The top-left window in Figure 6 shows some ﬁrst attributes of some instances in that original database while the middle-left window shows its visualization. One problem is to use of attributes containing patient information before operation to predict the patient status after the operation. The domain experts are particularly interested in ﬁnding predictive and descriptive rules for the class of patients who “dead within 90 days” after operation among totally 5 classes. 3.2

Visual Knowledge Discovery from Stomach Cancer Data

The task of extracting prediction and description rules for the class of patients who died within 90 days after the operation is a diﬃcult because there is no sharp boundary between this class and the others. The data mining methods C4.5, See5.0, CBA, and Rosetta have been applied to do this task. However, the obtained results were far from expectation: they have low support and conﬁdence, and are concerned with only a small part of patients of the target class. Visual Interactive Preprocessing. Much eﬀort has been done in preprocessing the stomach cancer data. The ﬁrst preprocessing task done by the domain experts is to discretize continuous attributes. Two available entropy-based and rough-set based discretization methods in D2MS yield two diﬀerent possibilities of discretizing continuous attributes. The derived dataset based on entropybased method ignores many attributes and often has few discretized values on the others. The other one based on rough set-based method discretized continuous attributes into more values and do not ignore any attributes. The middle-left and bottom-left windows in Figure 6 visualize the original stomach data, and the middle-right and bottom-right windows illustrate one of discretized solution among trials. The second preprocessing task done is to select subsets of 83 attributes that are most relevant to the discovery target. Two feature selection methods have been used for this task: the manual KJ method that is popular in Japan and the feature selection algorithm SFG [6]. SFG orders attributes by information gain and the user can choose diﬀerent subsets of attributes in the decreasing order of information gain. As a consequence, these methods yield diﬀerent choices left to the user to decide the most appropriate. The data visualization in particular the querying mode in D2MS has supported the domain experts doing this task.

Data and Knowledge Visualization in Knowledge Discovery Process

319

Fig. 6. Visual interactive discretization and feature selection

The key point here is to support generating and testing hypotheses by changing views on data, querying data to get insights in the importance of attributes, etc. The middle-right and bottom-right windows in Figure 6 illustrate the active visualization in doing this task. For example, the color of polylines distributed on axes in the middle-right window may suggest how signiﬁcant an attribute is in a prediction task. Visual Mining Rules and Decision Trees. We have applied two methods LUPC and CABRO in D2MS to mine rules and rules from stomach data in which visualization plays a signiﬁcant role. Visual LUPC is eﬀective in mining rules from minority classes [3]. It allows the user the try generating diﬀerent candidate sets of rules by parameter settings, then visually evaluating the results, adjusting the parameter until achieving appropriate results. For example, the domain experts have taken the default values for number of candidate attribute-value pairs η = 100, and number of candidate rules γ = 20 while varying two parameters on minimum accuracy of a rule α and minimum coverage of a rule β. Given α and β, there are two modes of varying α and β with the search biases on accuracy or coverage. In the former, LUPC ﬁnds rules for the target class with accuracy as high as possible while coverage remains equal or greater than β. In the latter, rules are found with coverage as large as possible while accuracy satisﬁes the threshold α.

320

TrongDung Nguyen, TuBao Ho, and DucDung Nguyen Table 1. LUPC ﬁnds more and high quality rules. Number of discovered rules See5.0 CBA-CB CBA-CAR Rosetta LUPC cover ≥ 7, accuracy ≥ 80% 2 1 1 5 14 cover ≥ 7, accuracy ≥ 100% 0 0 0 0 4 cover ≥ 10, accuracy ≥ 50% 1 0 1 0 61 cover ≥ 10, accuracy ≥ 60% 0 0 0 0 40 cover ≥ 10, accuracy ≥ 70% 0 0 0 0 21 Quality of rules

The most important activity in the interactive mining with LUPC is the evaluation of discovered rules, and the user is helped by the D2MS rule visualizer. In addition to the overall performance metrics provided by LUPC for the obtained rule set and for each rule such as its support and conﬁdence, the user can point and click to view each rule such as the subset of instances it covers, what-if a condition in the precedent part changes its value into another value of the same attribute, how its errors distributed in other classes, etc. Table 1 compares the number of rules discovered by See5.0, CBA (CBA-CB and CBA-CAR [5], Rosetta and LUPC (columns 2-6), according to the required minimum of coverage and accuracy of rules shown in the ﬁrst column. Thank to its support for model selection and visualization, LUPC allows us to ﬁnd more and higher quality rules than other systems. For example, See5.0, CBA, and Rosetta found only 2, 1 (1) and 5 rules, respectively, each of which covers at least 7 cases in the class “dead within 90 days” with accuracy equal or greater than 80%. Clearly, these rules characterize only a small part of the target class containing 302 cases in total. LUPC allows us to discover 22 rules with such thresholds. With the requirement of ﬁnding rules that cover at least 10 cases, See5.0, CBA, and Rosetta found almost no such rules except CBA-CAR in the case of accuracy equal or greater than 50%. Under this condition, LUPC shows advantage by its ability of discovering many such high quality rules due to the model selection support in D2MS. As analyzed with more details in [3], See5.0 induces rules with an average error of 30.5% on testing data (taken randomly of 30% of the stomach cancer data), but with a very high false positive rate of 98.9% (i.e., rules found for this class recognize wrongly 98.9% cases from other classes). Similarly, CBA and Rosetta also give poor results on the class “died within 90 days” even when they produce a large number of rules with small thresholds. The visual CABRO has been used to learn decision trees from the stomach data. For each model candidate, CABRO tree visualizer displays graphically the corresponding pruned tree, its size, its prediction error rate. It oﬀers the user a multiple view of these trials and facilitates the user to compare results of trials in order to make his/her ﬁnal selection of techniques/models of interest. Figure 5 shows a screen of T2.5D and model selection by D2MS from stomach cancer data.

321

4

Conclusion

We have presented the knowledge discovery system D2MS with support for model selection integrated with visualization. We emphasize the crucial role of the user’s participation in the model selection process of knowledge discovery and have developed data, rule and tree visualizers in D2MS to support such a participation. Our basic idea is use right visualization techniques in right places, and visualization should be integrated into the steps of the knowledge discovery process. D2MS with its visualization support has been used and shown advantages in extracting knowledge from a real-world applications on stomach cancer data.

References 1. Ho, T.B., “Knowledge Discovery from Unsupervised Data in Support of Decision Making”, Knowledge Based Systems: Techniques and Applications, C.T. Leondes (Ed.), Academic Press, pp. 435-461, 2000. 2. Lamping, J. and Rao, R., “The Hyperbolic Browser: A Focus + Context Techniques for Visualizing Large Hierarchies”, Journal of Visual Languages and Computing, 7(1), pp. 33–55, 1997. 3. Ho, T.B., Nguyen, D.D., and Kawasaki, S., “Mining Prediction Rules from Minority Classes”, 14th International Conference on Applications of Prolog (INAP2001), International Workshop Rule-Based Data Mining RBDM2001, Tokyo, October 2022, 2001. 4. Lee, H.Y., Ong, H.L., and Quek, L.H., “Exploiting Visualization in Knowledge Discovery”, Proc. of First Inter. Conf. on Knowledge Discovery and Data Mining (1995), 198–203. 5. Liu, B., Hsu, W., and Ma, Y., “Integrating Classiﬁcation and Association Rule Mining”, Fourth Inter. Conf. on Knowledge Discovery and Data Mining KDD’98, pp. 80–86, 1998. 6. Liu, H. and Motoda, H., Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers, 1998. 7. Nguyen, T.D. and Ho, T.B., “An Interactive Graphic System for Decision Tree Induction”, Journal of Japanese Society for Artiﬁcial Intelligence, Vol. 14, N. 1, pp. 131–138, 1999. 8. Nguyen, T.D., Ho, T.B., and Shimodaira, H., “A Visualization Tool for Interactive Learning of Large Decision Trees”, Twelfth IEEE Inter. Conf. on Tools with Artiﬁcial Intelligence ICTAI’2000, 2000, pp. 28-35. 9. Nguyen, T.D., Ho, T.B., and Shimodaira, H., “A Scalable Algorithm for Rule PostPruning of Large Decision Trees”, Fifth Paciﬁc-Asia Conf. on Knowledge Discovery and Data Mining PAKDD’01, LNAI 2035, Springer, 2001, pp. 467-476. 10. Ohrn, A., Rosetta Technical Reference Manual, Norwegian University of Science and Technology, 1999. 11. Quinlan, J.R., C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993. 12. Reingold, E.M. and Tilford, J.S., “Tidier Drawings of Trees”, IEEE Transactions on Software Engineering, Vol. SE-7, No. 2, pp. 223–228, 1991. 13. Robertson, G.G., Mackinlay, J. D., and Card, S.K., “Cone Trees: Animated 3D Visualization of Hierarchical Information”, ACM Conf. on Human Factors in Computing Systems, 1991, pp. 189–194.

Author Index

Abdelmoty, Alia I.

175

Boujemaa, Nozha 24, 163 Breiteneder, Christian 105 Chan, Chen-Lung 219 Chang, Shi-Kuo 1 Chen, Duan-Yu 276, 288 Chen, Ling-Hwei 88, 269 Chen, Trista Pei-chun 194 Chen, Tsuhan 194 Chen, Zen 95 Chia, Tsorng-Lin 95 Chiu, Chih-Yi 143 Chun, Seong Soo 239, 259 Costagliola, Gennaro 1 Eidenberger, Horst 105 El-Geresy, Baher A. 175 Falc˜ ao, Alexandre X. 12 Fauqueur, Julien 24 Fotouhi, Farshad 187

Lee, Suh-Yin 276, 288 Lee, Yuh-Reuy 207 Leung, Clement H.C. 152 Li, Xiaobo 61 Lin, Chia-Wen 207 Lin, Hsin-Chih 143 Lin, Shu-Jiuan 288 Manola, Lubomir Morris, Andrew J.

129 175

Nakazato, Munehiro 129 Nascimento, Mario A. 12, 61 Nguyen, DucDung 311 Nguyen, TrongDung 311 Oh, Sangwook 259 Oh, Seok-jin 239 Oja, Erkki 247 Olsen, Kai A. 302 Oostveen, Job 117 Qiu, Guoping 50

Haitsma, Jaap 117 Hitz, Martin 105 Ho, TuBao 311 Hsiao, Ming-Ho 276 Hsu, Vincent 219, 229 Hsu, Yuh-Feng 194 Huang, Thomas S. 129 Hung, Yi-Ping 76

Saetre, Per 302 Saux, Bertrand Le 163 Shen, Kuan-Ting 269 Shih, Jau-Ling 88 Sridhar, Veena 61 Stanchev, Peter L. 187 Stehling, Renato O. 12 Sudirman, S. 50 Sull, Sanghoon 239, 259 Sun, Ming-Ting 229 Sun, Shu-Kuo 95

Jones, Christopher B. Jungert, Erland 1

Tao, Ju-Lan 76 Tse, Philip K.C. 152

Gatica-Perez, Daniel

229

175

Kalker, Ton 117 Kao, Cheng-Chien 207 Kim, Hyeokman 259 Kim, Jung-Rim 239, 259 Koskela, Markus 247 Kuo, Chin-Ying 219

Volmer, Stephan

Laaksonen, Jorma

Zhou, Zhi

247

36

Wang, Jia-Shung 219 Wu, Peter Y. 302 Yang, Shi-Nine 229

143

Advances in Information Systems: Second International Conference, ADVIS 2002, Izmir, Turkey, October 23-25, 2002. Proceedings

Flexible Query Answering Systems: 5th International Conference, FQAS 2002. Copenhagen, Denmark, October 27-29, 2002, Proceedings

Object-Oriented Information Systems: 8th International Conference, OOIS 2002, Montpellier, France, September 2-5, 2002, Proceedings

Advances in Visual Information Systems, 9 conf., VISUAL 2007

Visual Information Systems

Advances in Visual Information Systems: 4th International Conference, VISUAL 2000, Lyon, France, November 2-4, 2000 Proceedings

Advances in Object-Oriented Information Systems: OOIS 2002 Workshops, Montpellier, France, September 2, 2002 Proceedings

Coordination Models and Languages: 5th International Conference, COORDINATION 2002, YORK, UK, April 8-11, 2002 Proceedings

Hybrid Systems: Computation and Control: 5th International Workshop, HSCC 2002, Stanford, CA, USA, March 25-27, 2002, Proceedings

Advances in Databases and Information Systems: 6th East European Conference, ADBIS 2002, Bratislava, Slovakia, September 8-11, 2002, Proceedings

Visual Information and Information Systems: Third International Conference, VISUAL'99, Amsterdam, The Netherlands, June 2-4, 1999, Proceedings

Next Generation Information Technologies and Systems: 5th International Workshop, NGITS 2002, Caesarea, Israel, June 24-25, 2002. Proceedings

Scientific American (March 2002)

Engineering and Deployment of Cooperative Information Systems: First International Conference, EDCIS 2002, Beijing, China, September 17-20, 2002. Proceedings

Recent Advances in Visual Information Systems: 5th International Conference, VISUAL 2002 Hsin Chu, Taiwan, March 11-13, 2002. Proceedings

Advances in Information Systems: Second International Conference, ADVIS 2002, Izmir, Turkey, October 23-25, 2002. Proceedings

Flexible Query Answering Systems: 5th International Conference, FQAS 2002. Copenhagen, Denmark, October 27-29, 2002, Proceedings

Object-Oriented Information Systems: 8th International Conference, OOIS 2002, Montpellier, France, September 2-5, 2002, Proceedings

Advances in Visual Information Systems, 9 conf., VISUAL 2007

Visual Information Systems

Advances in Visual Information Systems: 4th International Conference, VISUAL 2000, Lyon, France, November 2-4, 2000 Proceedings

Advances in Object-Oriented Information Systems: OOIS 2002 Workshops, Montpellier, France, September 2, 2002 Proceedings

Coordination Models and Languages: 5th International Conference, COORDINATION 2002, YORK, UK, April 8-11, 2002 Proceedings

Hybrid Systems: Computation and Control: 5th International Workshop, HSCC 2002, Stanford, CA, USA, March 25-27, 2002, Proceedings

Advances in Databases and Information Systems: 6th East European Conference, ADBIS 2002, Bratislava, Slovakia, September 8-11, 2002, Proceedings

Visual Information and Information Systems: Third International Conference, VISUAL'99, Amsterdam, The Netherlands, June 2-4, 1999, Proceedings

Next Generation Information Technologies and Systems: 5th International Workshop, NGITS 2002, Caesarea, Israel, June 24-25, 2002. Proceedings

Scientific American (March 2002)

Engineering and Deployment of Cooperative Information Systems: First International Conference, EDCIS 2002, Beijing, China, September 17-20, 2002. Proceedings

Genetic Programming: 5th European Conference, EuroGP 2002, Kinsale, Ireland, April 3-5, 2002. Proceedings

Financial Cryptography: 6th International Conference, FC 2002, Southampton, Bermuda, March 11-14, 2002, Revised Papers

3rd International Symposium on Quality Electronic Design: Proceedings 2002 March 18-21, 2002, San Jose, California

Calorimetry in Particle Physics: Proceedings of the Tenth International Conference, California, USA 25-29 March 2002

Information Hiding: 5th International Workshop, IH 2002, Noordwijkerhout, The Netherlands, October 7-9, 2002, Revised Papers

Advances in Cryptology - EUROCRYPT 2002

Advances in Cryptology - ASIACRYPT 2002

Learning Classifier Systems: 5th International Workshop, IWLCS 2002, Granada, Spain, September 7-8, 2002, Revised Papers

Advances in Web-Based Learning: First International Conference, ICWL 2002, Hong Kong, China, August 17-19, 2002. Proceedings

Advances in Natural Language Processing: Third International Conference, PorTAL 2002, Faro, Portugal, June 23-26, 2002. Proceedings

Logic Programming: 18th International Conference, ICLP 2002, Copenhagen, Denmark, July 29 - August 1, 2002 Proceedings

Distributed Computing: 16th International Conference, DISC 2002. Toulouse, France, October 28-30, 2002, Proceedings

Embedded Software: Second International Conference, EMSOFT 2002, Grenoble, France, October 7-9, 2002. Proceedings

Computational Science - ICCS 2002: International Conference, Amsterdam, The Netherlands, April 21-24, 2002. Proceedings

Pervasive Computing: First International Conference, Pervasive 2002, Zürich, Switzerland, August 26-28, 2002. Proceedings

Infrastructure Security: International Conference, InfraSec 2002 Bristol, UK, October 1-3, 2002 Proceedings

Recent Advances in Visual Information Systems: 5th International Conference, VISUAL 2002 Hsin Chu, Taiwan, March 11-13, 2002. Proceedings

Advances in Information Systems: Second International Conference, ADVIS 2002, Izmir, Turkey, October 23-25, 2002. Proceedings

Flexible Query Answering Systems: 5th International Conference, FQAS 2002. Copenhagen, Denmark, October 27-29, 2002, Proceedings

Object-Oriented Information Systems: 8th International Conference, OOIS 2002, Montpellier, France, September 2-5, 2002, Proceedings

Advances in Visual Information Systems, 9 conf., VISUAL 2007

Visual Information Systems

Advances in Visual Information Systems: 4th International Conference, VISUAL 2000, Lyon, France, November 2-4, 2000 Proceedings

Advances in Object-Oriented Information Systems: OOIS 2002 Workshops, Montpellier, France, September 2, 2002 Proceedings

Coordination Models and Languages: 5th International Conference, COORDINATION 2002, YORK, UK, April 8-11, 2002 Proceedings

Hybrid Systems: Computation and Control: 5th International Workshop, HSCC 2002, Stanford, CA, USA, March 25-27, 2002, Proceedings

Advances in Databases and Information Systems: 6th East European Conference, ADBIS 2002, Bratislava, Slovakia, September 8-11, 2002, Proceedings

Visual Information and Information Systems: Third International Conference, VISUAL'99, Amsterdam, The Netherlands, June 2-4, 1999, Proceedings

Next Generation Information Technologies and Systems: 5th International Workshop, NGITS 2002, Caesarea, Israel, June 24-25, 2002. Proceedings

Scientific American (March 2002)

Engineering and Deployment of Cooperative Information Systems: First International Conference, EDCIS 2002, Beijing, China, September 17-20, 2002. Proceedings

Genetic Programming: 5th European Conference, EuroGP 2002, Kinsale, Ireland, April 3-5, 2002. Proceedings

Financial Cryptography: 6th International Conference, FC 2002, Southampton, Bermuda, March 11-14, 2002, Revised Papers

3rd International Symposium on Quality Electronic Design: Proceedings 2002 March 18-21, 2002, San Jose, California

Calorimetry in Particle Physics: Proceedings of the Tenth International Conference, California, USA 25-29 March 2002

Information Hiding: 5th International Workshop, IH 2002, Noordwijkerhout, The Netherlands, October 7-9, 2002, Revised Papers

Advances in Cryptology - EUROCRYPT 2002

Advances in Cryptology - ASIACRYPT 2002

Learning Classifier Systems: 5th International Workshop, IWLCS 2002, Granada, Spain, September 7-8, 2002, Revised Papers

Advances in Web-Based Learning: First International Conference, ICWL 2002, Hong Kong, China, August 17-19, 2002. Proceedings

Advances in Natural Language Processing: Third International Conference, PorTAL 2002, Faro, Portugal, June 23-26, 2002. Proceedings

Logic Programming: 18th International Conference, ICLP 2002, Copenhagen, Denmark, July 29 - August 1, 2002 Proceedings

Distributed Computing: 16th International Conference, DISC 2002. Toulouse, France, October 28-30, 2002, Proceedings

Embedded Software: Second International Conference, EMSOFT 2002, Grenoble, France, October 7-9, 2002. Proceedings

Computational Science - ICCS 2002: International Conference, Amsterdam, The Netherlands, April 21-24, 2002. Proceedings

Pervasive Computing: First International Conference, Pervasive 2002, Zürich, Switzerland, August 26-28, 2002. Proceedings

Infrastructure Security: International Conference, InfraSec 2002 Bristol, UK, October 1-3, 2002 Proceedings

Recommend Documents