Information Brokering Across Heterogeneous Digital Data: A Metadata-based Approach

TE AM FL Y SEMANTIC MODELS FOR MULTIMEDIA DATABASE SEARCHING AND BROWSING The Kluwer International Series on ADVANCE...

Author: Amit Sheth

43 downloads 556 Views 2MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

TE AM FL Y

SEMANTIC MODELS FOR MULTIMEDIA DATABASE SEARCHING AND BROWSING

The Kluwer International Series on ADVANCES IN DATABASE SYSTEMS Series Editor

Ahmed K. Elmagarmid Purdue University West Lafayette, IN 47907

Other books in the Series: INFORMATION BROKERING ACROSS HETEROGENEOUS DIGITAL DATA: A Metadata-based Approach, Vipul Kashyap, Amit Sheth; ISBN: 0-7923-7883-0 DATA DISSEMINATION I N WIRELESS COMPUTING ENVIRONMENTS, KianLee Tan and Beng Chin Ooi; ISBN: 0-7923-7866-0 MIDDLEWARE NETWORKS: Concept, Design and Deployment of Internet Infrastructure, Michah Lerner, George Vanecek, Nino Vidovic, Dad Vrsalovic; ISBN: 0-7923-7840-7 ADVANCED DATABASE INDEXING, Yannis Manolopoulos, Yannis Theodoridis, Vassilis J. Tsotras; ISBN: 0-7923-7716-8 MULTILEVEL SECURE TRANSACTION PROCESSING, Vijay Atluri, Sushil Jajodia, Binto George ISBN: 0-7923-7702-8 FUZZY LOGIC IN DATA MODELING, Guoqing Chen ISBN: 0-7923-8253-6 INTERCONNECTING HETEROGENEOUS INFORMATION SYSTEMS, Athman Bouguettaya, Boualem Benatalhh, Ahmed Elmagarmid ISBN: 0-7923-8216-1 FOUNDATIONS OF KNOWLEDGE SYSTEMS: With Applications to Databases and Agents, Gerd Wagner ISBN: 0-7923-8212-9 DATABASE RECOVERY, Vijay Kumar, Sang H. Son ISBN: 0-7923-8192-0 PARALLEL, OBJECT-ORIENTED, AND ACTIVE KNOWLEDGE BASE SYSTEMS, Ioannis Vlahavas, Nick Bassiliades ISBN: 0-7923-8117-3 DATA MANAGEMENT FOR MOBILE COMPUTING, Evaggelia Pitoura, George Samaras ISBN: 0-7923-8053-3 MINING VERY LARGE DATABASES WITH PARALLEL PROCESSING, Alex A. Freitas, Simon H. Luvington ISBN: 0-7923-8048-7 INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS, Elisa Bertino, Beng Chin Ooi, Ron Sacks-Davis, Kian-Lee Tan, Justin Zobel, Boris Shidlovsky, Barbara Catania ISBN: 0-7923-9985-4 INDEX DATA STRUCTURES IN OBJECT-ORIENTED DATABASES, Thomas Mueck, Martin L. Polaschek ISBN: 0-7923-9971-4 DATABASE ISSUES IN GEOGRAPHIC INFORMATION SYSTEMS, Nabit Adam, Aryya Gangopadhyay ISBN: 0-7923-9924-2 VIDEO DATABASE SYSTEMS: Issues, Products, and Applications, Ahmed A Elmagarmid, Haitao Jiang, Abdelsalam A. Helal, Anupam Joshi, Magdy Ahmea ISBN: 0-7923-9872-6

SEMANTIC MODELS FOR MULTIMEDIA DATABASE SEARCHING AND BROWSING by

Shu-Ching Chen School of Computer Science Florida International University

R. L. Kashyap School of Electrical and Computer Engineering Purdue University Arif Ghafoor School of Electrical and Computer Engineering Purdue University

KLUWER ACADEMIC PUBLISHERS New York / Boston / Dordrecht / London / Moscow

eBook ISBN: Print ISBN:

0-306-47029-2 0-792-37888-1

©2002 Kluwer Academic Publishers New York, Boston, Dordrecht, London, Moscow All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher Created in the United States of America Visit Kluwer Online at: and Kluwer's eBookstore at:

http://www.kluweronline.com http://www.ebooks.kluweronline.com

Contents

List of Figures List of Tables Preface Acknowledgments 1. INTRODUCTION 1. Introduction 2. Multimedia Information Applications 3. Issues and Challenges

vii xi xiii xix 1 1 4 5

2. SEMANTIC MODELS FOR MULTIMEDIA INFORMATION SYSTEMS 19 1. Introduction 19 2. Multimedia Semantic Models 23 3. MULTIMEDIA DATABASE SEARCHING 1. Introduction 2. Image Segmentation 3. Video Parsing and Segmentation Approaches 4. Motion Detection and Tracking Approaches 5. Iconic-Based Grouping and Browsing Approaches 6. Object Recognition Approaches 7. Knowledge-Based Event Modeling Approaches 8. Characteristics of Video Data Modeling 9. Content-Based Retrieval

43 43 44 49 52 55 56 57 59 61

4. MULTIMEDIA BROWSING 1. Introduction 2. Video Browsing 3. Key Frame Selections

69 69 71 78

vi 5. CASE STUDY 1 – AUGMENTED TRANSITION NETWORK (ATN) MODEL 1. Introduction 2. Spatial and Temporal Relations of Semantic Objects 3. Multimedia Presentations 4. Multimedia Database Searching 5. Multimedia Browsing 6. User Interactions and Loops

81 81 88 90 92 98 103

6. CASE STUDY 2 – OBJECT COMPOSITION PETRI NET (OCPN) MODEL 111 1. Introduction 111 2. Interval-Based Conceptual Models 113 7. CONCLUSIONS

127

References

129

Index

147

List of Figures

2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2

3.3 3.4 4.1 4.2

Timeline for Multimedia Presentation. t1 to t6 are the time instances. d1 is time duration between t1 and t2 and so on. A timeline example that includes choice objects A timeline tree representation of the interactive scenario in Figure 2.2 An OCPN example for Figure 2.1: D1 is the delay for media streams I1 and A1 to display. D2 is the delay for V2 to display. An example PNBH Petri net. Transition Network for Multimedia Presentation. Comparison of video data semantic models. (a) tessellations with square tiles of equal size, and (b) tessellations with hexagonal tiles of equal size. This kind of tessellation is independent of the image since it does not consider the information content in the image. (a) an example of a quad-tree tessellation, and (b) tiling with arbitrary shapes like the shapes of a jigsaw puzzle. A hierarchy of video media stream The Cataloging Phase Architecture [144] Augmented Transition Network for video browsing: (a) is the ATN network for a video clip which starts at the state V/. (b)-(d) are part of the subnetworks of (a). (b) is to model scenes in video clip V1. (c) is to model shots in scene S1. Key frames for shot T 1 is in (d).

24 26 27 30 32 40 44

45 46 50 75

77

viii 5.1

5.2

5.3

5.4 5.5

5.6

5.7

Video frame 1. There are four semantic objects: salesman, box, file holder, and telephone; salesman is the target semantic object. The relative position numbers (as defined in Table 5.1) of the other three semantic objects are in the 10, 15, and 24, respectively.

95

Video frame 52. Semantic object box moves from the left to the front of salesman (from viewer’s point of view).

95

Video frame 70. Semantic object box moves from the front to the left of salesman (from viewer’s point of view).

95

The corresponding subnetwork for multimedia input string in Equation 5.1.

96

A browsing graph of a mini tour of the Purdue University campus: there are seven sites denoted by Bi, i = 1 . . . 7 that are connected by arcs. A directed arc denotes a one-way selection and a bidirection arc allows two-way selections.

98

ATN for the mini tour of a campus: Seven networks represent seven sites which users can browse. Networks B1 / through B7 / represent the presentations for sites Purdue Mall, Computer Science Building, Chemical Engineering Building, Potter Library, Union, Electrical Engineering Building and Mechanical Engineering Building respectively. Each network begins a presentation with three media streams: a video, a text, and an audio, and is followed by selections. After a user selects a site, the control will pass to the corresponding network so that the user can watch the presentation for that sit e continuously.

99

Timelines for presentation P1 and P2: Figures (a), (c), (e), and (f) are the presentation sequence for presentation P1. Figures (b), (d), (e), and (f) are the presentation sequence for presentation P2. Figures (e) and (f) are two timelines for selections B1 and B2, respectively.

104

List of Figures 5.8

6.1

6.2 6.3 6.4 6.5

Augmented Transition Network: (a) is the ATN network for two multimedia presentations which start at the states P1 / and P2 /, respectively. (b)(d) are part of the subnetworks of (a). (b) models the semantic objects in video media stream V1, (c) models the semantic objects in image media stream I1, and (d) models the keywords in text media stream T 1. In (e), the “Get” procedure is to access an individual media stream. “Display” displays the media streams. “Next_Symbol(Xi)” reads the input symbol Xi . The “Next-State” is a procedure to advance to the next state. “Start_time(Xi)” gives the pre-specified starting time of Xi . User thinking time is accounted for by the Delay variable. θ is a parameter. (a) Temporal relations represented graphically by a timeline representation. (b) The corresponding Petri net (OCPN) representation for the temporal relations. A unified OCPN model. The n – ary temporal relations. The forward and reverse relations and intervals. Partial interval evaluation.

ix

105

117 118 120 122 124

TE

AM FL Y

This page intentionally left blank.

List of Tables

2.1 2.2 3.1 3.2 5.1

5.2

5.3 5.4 5.5 5.6 6.1 6.2

Classification of Selected Semantic Models Operators in TCSP Comparison of Several Video Parsing Approaches Characteristics of Several Selected Video Data Modeling Approaches Three dimensional relative positions for semantic objects: The first and the third columns indicate the relative position numbers while the second and the fourth columns are the relative coordinates. (xt, yt, zt) and (xs, ys, zs) represent the X-, Y-, and Z-coordinates of the target and any semantic object, respectively. The ''≈" symbol means the difference between two coordinates is within a threshold value. Condition and action table: Get procedure is to access an individual media stream. Get-Symbol is a procedure to read the next input symbol of multimedia input string. Next-State is a procedure to advance to the next state in ATN. Display procedure is to display the media streams. θ, τ, and ρ are the parameters. The trace of ATN for the specified browsing sequence. The trace of ATN for presentation P1. Continuation of Table 5.4 if B1 is chosen. Continuation of Table 5.4 if B2 is chosen. Temporal Parameters of the Unified Model in Figure 6.2 (Pα TR Pβ). n - ary Temporal Constraints.

25 38 51 60

89

93 101 106 107 107 118 121

xii 6.3 6.4

Temporal Parameter Conversions (Pα TR Pβ to Pα--r TRr Pβ-r ). n – ary Temporal Parameter Conversions.

123 123

Preface

The objective of this book is to provide a survey of different models and to cover state of the art techniques for multimedia presentations, multimedia database searching, and multimedia browsing. Therefore, the readers can have an understanding of the issues and challenges of multimedia information systems. As more information sources become available in multimedia systems, the development of the abstract models for video, audio, text, and image data becomes very important. The pros and cons of the different models for multimedia information designs are discussed in this book. In addition, this book will cover most of the recent works that were published on the prestigious Journals and Conferences such as IEEE Transactions on Knowledge and Data Engineering, ACM Multimedia System Journal, Communications of the ACM, IEEE Computer, IEEE Multimedia, ACM SIGMOD and so on. This book is aimed at the general readers who are interested in the issues, challenges, and ideas underlying the current practice of multimedia presentation, multimedia database searching, and multimedia browsing in multimedia information systems. It will also be of interest to university researchers, scientists, industry professionals, software engineers, graduate students, and undergraduate students who need to be become acquainted with this new multimedia technology, and to all those who wish to gain a detailed technical understanding of what multimedia information systems involve. This book is organized in the way that makes the ideas accessible to the readers who are interested in grasping the basics, as well as to those who would like more technical depth.

xiv The first chapter introduces multimedia information applications, the need for the development of the multimedia database management systems (MDBMSs), and the important issues and challenges of multimedia systems. With the increasing complexity of real world multimedia applications, multimedia systems require the management and delivery of extremely large bodies of data at very high rates and may require the delivery with real-time constraints. The applications expected to benefit enormously from multimedia technologies and multimedia information systems include remote collaboration via video teleconferencing, improved simulation methodologies for all disciplines of science and engineering, and better human-computer interfaces. Also, a multimedia system should be able to accommodate the heterogeneity that may exist among the data. Hence, a new design of an MDBMS is required to handle the temporal and spatial requirements, and the rich semantics of multimedia data such as text, image, audio, and video. The purpose of the design and development of an MDBMS is to efficiently organize, store, manage, and retrieve multimedia information from the underlying multimedia databases. In other words, an MDBMS should have the ability to model the varieties of multimedia data in terms of their structure, behavior and function. The issues and challenges discussed in this chapter include:

formal semantic modeling techniques

indexing and searching methods

synchronization and integration modeling

formal query languages

data placement schemas

architecture and operating system support

distributed database management

multimedia query support, retrieval, and browsing

The second chapter discusses the temporal relations, the spatial relations, the spatio-temporal relations, and several semantic models for multimedia information systems. As more information sources become available in multimedia systems, the development of abstract semantic models for multimedia information becomes very important. An abstract semantic model should be rich enough to provide a friendly

Preface

xv

interface of multimedia presentation synchronization schedules to the users and should be a good programming data structure for implementation to control multimedia playback. In other words, the models must be devised able to support the specification of temporal constraints for multimedia data and the satisfaction of these constraints must be at runtime. The use of a model that can represent the temporal constraints on multimedia data makes it easier for the satisfaction of these constraints at presentation time. The semantic models can be classified into the following categories:

timeline models

time-interval based models

graphic models

petri-net models

object-oriented models

language-based models

augmented transition network (ATN) models

Several existing models in each category are introduced in this chapter. Some models are primary aimed at synchronization aspects of the multimedia data while others are more concerned with the browsing aspects of the objects. The former models can easily render themselves to an ultimate specification of the database schema. Some models such as based on graphs and Petri-nets have the additional advantage of pictorially illustrating synchronization semantics and are suitable for visual orchestration of multimedia presentations. The third chapter introduces the issues for multimedia database searching. Multimedia database searching requires semantic modeling and knowledge representation of the multimedia data. Two criteria are considered to classify the existing approaches of modeling multimedia data, especially video data. These two criteria are level of abstraction and granularity of data processing. Based on these two criteria, several classes of approaches employed in modeling video data are compared in this chapter. Some important issues are discussed here. They include:

image segmentation and image segmentation techniques

xvi

video segmentation and video parsing

motion detection and tracking approaches

iconic-based grouping and browsing approaches

object recognition approaches

knowledge-based event modeling approaches

content-based retrieval

The fourth chapter discusses the issues for multimedia browsing and introduces several existing multimedia browsing systems. Cataloging and indexing of video is a critical step to enable intelligent navigation, search, browsing, and viewing of digital video. While the importance of a seamless integration of querying, searching, browsing, and exploration of data in a digital library collection is recognized, this chapter focuses on the challenges associated with video browsing. An increasing number of digital library systems allow users to access not only textual or pictorial documents, but also video data. Digital library applications based on huge amount of digital video data must be able to satisfy complex semantic information needs and require efficient browsing and searching mechanisms to extract relevant information. In most cases, users have to browse through parts of the video collection to get the information they want, which address the contents and the meaning of the video documents. Hence, a browsing system has to provide the support for this kind of information-intensive work. The browsing systems introduces in this chapter include:

a client/server architecture-based browsing system

the VideoQ browsing system

the CueVideo browsing system

the Informedia browsing system

the augmented transition network (ATN) browsing system

In addition, the importance of key frame selection algorithms is also included, and some of the key frame selection algorithms are discussed in this chapter. Two case studies are given in the fourth and the fifth chapters. The fourth chapter introduces the augmented transition network (ATN) model

Preface

xvii

and the fifth chapter introduces the object composition Petri net (OCPN) model. These two models are proposed by the authors of this book. The ATN model has the capabilities to model multimedia presentations, multimedia database searching, and multimedia browsing. Also, the temporal, spatial, or spatio-temporal relations of various media streams and semantic objects can be captured by the proposed ATN model. On the other hand, the OCPN model is based on the logic of temporal intervals and Timed Petri Nets to store, retrieve, and communicate between multimedia objects. These two models are discussed in details in these two chapters. SHU-CHING CHEN, R. L. KASHYAP, AND ARIF GHAFOOR

This page intentionally left blank.

Acknowledgments

Much of the information and insight presented in this book was obtained from recent works that were published on the prestigious Journals and Conferences such as IEEE Transactions on Knowledge and Data Engineering, ACM Multimedia System Journal, Communications of the ACM, IEEE Computer, IEEE Multimedia, ACM SIGMOD and so on. The authors would like to thank Prof. Mei-Ling Shyu at Department of Electrical and Computer Engineering, University of Miami for her pertinent and constructive comments that helped us to improve this book significantly. We would like to thank Dr. Srinivas Sista for his valuable suggestions and comments on the image/video segmentation. Acknowledgement: The work presented in this book has been supported, in part, by the NSF grants IRI-96 19812 and IIS-9974255.

TE

AM FL Y

This page intentionally left blank.

Chapter 1

INTRODUCTION

1.

INTRODUCTION

In Multimedia systems, a variety of information sources – text, voice, image, audio, animation, and video – are delivered synchronously or asynchronously via more than one device. The important characteristic of such a system is that all of the different media are brought together into one single unit, all controlled by a computer. Normally, multimedia systems require the management and delivery of extremely large bodies of data at very high rates and may require the delivery with real-time constraints. The major challenge for every multimedia system is how to synchronize the various data types from distributed data sources for multimedia presentations [21]. Moreover, considerable semantic heterogeneity may exist among the users of multimedia data due to the differences in their perceived interpretation or intended use of the data. Semantic heterogeneity has been a difficult problem in conventional databases. Even today, the problem is not clearly understood. From the integration point of view, conventional data modeling techniques lack the ability to manage the composition of multimedia objects in a heterogeneous multimedia database environment. For a large number of multimedia applications, it may be required to integrate the various types of data semantically. In traditional database management systems (DBMSs), such as relational database systems, only textual and numerical data is stored and managed in the database and there is no need to consider the synchronicity among media. Retrieving data is often based on simple comparisons of text or numerical values, which is no longer adequate for the multimedia data. The relational data model has the drawback of losing semantics, which can cause erroneous inter-

2 pretation of multimedia data. In addition, the relational data model has limited capabilities in modeling the structural and behavioral properties of real-world objects since the complex objects are not directly modeled and the behavioral properties of the objects are not explicitly specified in terms of meaningful operations and applicable knowledge rules [164]. Limitations of the relational model are also obvious when semantic modeling of time-dependent multimedia data (video or audio) is considered. The relational models lack facilities for the management of spatio-temporal relations and do not cover all features required for multimedia database retrieval [183]. Although some relational DBMSs have started the support for the access to multimedia objects in the forms of pointers to binary large objects (BLOBs), they are incapable of interactively accessing various portions of objects since a BLOB is treated as a single entity in its entirety [70]. The object-oriented data models have been proposed as a data model that provides a system with better facilities for managing the multimedia data [16, 169]. In object-oriented database systems, object-oriented data models offer a number of powerful features such as inheritance, information hiding, polymorphism, and type/class mechanism, and may include image data. The object-orientation encapsulates data with a set of operations that are applicable to the data so that there is no need to worry about the heterogeneity of operations caused by different types of data for the purpose of manipulating data. In addition, the definition of a composite object which is an object consisting of other objects provides the capability to handle the structural complexity of the data. However, as indicated in [183], the core of the object-oriented data models lack facilities for the management of spatio-temporal relations though it provides a system with operational transparency and enables the definition of the part-of-relationship among objects which takes an arbitrary structure. In other words, the DBMS still is not designed to support multimedia information. Therefore, multimedia extension is needed to handle the mismatch between multimedia data and the conventional object-oriented database management systems [42]. The purpose of the design and development of a multimedia database management system (MDBMS) is to efficiently organize, store, manage, and retrieve multimedia information from the underlying multimedia databases. Due to the heterogeneous nature of multimedia data and their characteristics, it is very unlikely that the design of an MDBMS can follow the footsteps of the design of a traditional DBMS since the MDBMS must have considerably more capabilities than conventional information-management systems. Some of the important characteristics of multimedia objects are listed as follows [135]:

Introduction

3

Multimedia objects are complex and therefore less than completely captured in an MDBMS.

Multimedia objects are audiovisual in nature and are amenable to multiple interpretations.

Multimedia objects are context sensitive.

Queries looking for multimedia objects are likely to use subjective descriptors that are at best fuzzy in their interpretation.

Multimedia objects may be included in fuzzy classes.

Hence, it is suggested that MDBMSs be developed as libraries from which only the minimum required functionalities can be compiled to meet the needs of an application [135]. Furthermore, in [106] the authors define ten requirements for modeling multimedia data. These ten requirements are: 1. There is a need to specify incomplete information. 2. There is a need to extend the definition of some individual document beyond the definitions of its type. 3. There is a need to integrate data from various databases and to handle them uniformly. 4. There is a need to describe structured information. 5. There is a need to distinguish between internal modeling and external presentation of objects. 6. There is a need to share data among multiple documents. 7. There is a need to create and to control versions. 8. There is a need to include operations. 9. There is a need to handle concurrent access control. 10. There is a need to handle context-free and context-sensitive references. A multimedia environment should not only display media streams to users but also allow two-way communication between users and the multimedia system. The multimedia environment consists of a multimedia presentation system and a multimedia database system. If a multimedia environment has only a presentation system but not a multimedia database system, then it is like a VCR or a TV without feedback from

4 the user. A multimedia database system allows users to specify queries for information. The information may be relative to text data as well as image or video content. By combining a multimedia presentation and multimedia database system, users can specify queries which reflect what they want to see or know.

2.

MULTIMEDIA INFORMATION APPLICATIONS

The areas expected to benefit enormously from multimedia technologies and multimedia information systems include advanced information management systems for a broad range of applications, remote collaboration via video teleconferencing, improved simulation methodologies for all disciplines of science and engineering, and better human-computer interfaces. Vast libraries of information including arbitrary amounts of text, video, pictures, and sound have the potential to be used more efficiently than the traditional book, record, and tape libraries of today [70]. As examples,

Radical changes in the computing infrastructure, spurred by multimedia computing and communication, will do more than extend the educational system – they will revolutionize it. Technological advances will make classrooms and laboratories much more accessible and effective. Technology, especially computer-assisted instruction (CAI), offers a new and non-threatening way to learn that differs from traditional school learning. CAI has been found to have beneficial effects on achievement in a wide variety of instructional settings. By utilizing CAI, the learner often avoids the frustration and aggravation encountered in a normal classroom setting. Multimedia, which refers to the use of the media (slides, films, audio tapes, video, text, graphics, sound, etc.) simultaneously in a coordinated manner, allows interactions and controls by the learners and leads to CAI. These man-machine interfaces allow a flexible learning environment since multimedia can do things that are difficult for an instructor to do such as combining images from a video of a human heart to photographs of stars in deep space and integrating them with computer graphics and text for instructional purposes. The use of emerging multimedia technologies in education causes a major shift in the educational services paradigm that promises major advantages over present analog distance-learning systems. The exponential growth of the Internet has also provided a large and growing number of people with Internet access to information and services. Instant access to globally distributed information has popularized

Introduction

5

video-based hypermedia and thus it is possible to develop authoring and delivery systems for distant education (education-on-demand).

To predict the progression of certain diseases, the researchers may need to analyze the X-Ray computed tomography (image), nuclear magnetic resonance (audio), and patient records (text).

The merchandise of a manufacturer can be advertised by providing audio descriptions, video demonstrations, and prices in textual format.

Banking in the global village becomes reality with multimedia communication systems that connect far-flung regions and integrate all their various marketplaces. In order to provide the highest quality financial products and services to customers, banks must apply multimedia communication systems which foster the collaboration of geographically dispersed people, exchange information and expertise when needed, and provide new links to customers.

The recent advances in computer hardware technology, coupled with computer visualization and image generation research (multimedia research), are significantly affecting TV production methods, which have remained unchanged since the introduction of color television. With the multimedia technologies used in the broadcasting industry, the potential for manipulating recorded images becomes virtually limitless. Moreover, video-on-demand (VOD) service which eliminates the inflexibility inherent in today’s broadcast cable systems can be provided. A VOD server is a computer system that stores videos in compressed digital form and provides support for the concurrent transmission of different portions of the compressed video data to the various viewers.

3.

ISSUES AND CHALLENGES

With the increasing complexity of real world multimedia applications, multimedia systems require the management and delivery of extremely large bodies of data at very high rates and may require the delivery with real-time constraints. For multimedia systems, the ways to incorporate diverse media with diverse characteristics challenge researchers in the community. Many types of real-world knowledge can be represented by describing the interplay among objects (persons, buildings events, etc.) in the course of time and their relationships in space. An application may need to store and access information about this knowledge that can be expressed by complex spatio-temporal logic [69].

6 A multimedia system should be able to accommodate the heterogeneity that may exist among the data. Moreover, a new design of a multimedia database management system (MDBMS) is required to handle the temporal and spatial requirements, and the rich semantics of multimedia data such as text, image, audio, and video. In other words, an MDBMS should have the ability to model the varieties of multimedia data in terms of their structure, behavior and function. For example, the model should be able to represent the basic media types (such as text, image, video, and audio) and possibly be able to explicitly capture the spatial, temporal, and spatio-temporal relationships among multimedia data. The temporal requirements are that media need to be synchronous and to be presented at the specified time that was given at authoring time. The spatial requirement is that the DBMS needs to handle the layout of the media at a certain point in time. For image and video frames, the DBMS needs to keep the relative positions of semantic objects (building, car, etc.). The applications shown in the previous section are just samples of the things possible with the development and use of multimedia databases. As the need for multimedia information systems is growing rapidly in various fields, management of such information is becoming a focal point of research in the database community. This potentially explains the explosion of research in the areas related to the understanding, development, and use of multimedia-related technologies. In general, the storage, transportation, display, and management of multimedia data must have more functionalities and capabilities than the conventional DBMS because of the heterogeneous nature of the data. For an MDBMS to serve its expected purpose, the following subsections discuss some of the prominent issues and challenges posed by multimedia data in the design of an MDBMS [2, 69, 70, 135].

3.1

FORMAL SEMANTIC MODELING TECHNIQUES

The central part of a multimedia database system is its data model which isolates user applications from the details of the management and structure of the multimedia database system. The development of an appropriate data model which serves this purpose and has the ability to organize the heterogeneous data types is crucial [2]. Because of the nature of multimedia data, certain considerations are required when selecting the data model. For example, video and image data need special data models for the temporal, spatial, or spatio-temporal relations of the data.

Introduction

7

Multimedia data are audiovisual in nature and can be complex objects that are made up of smaller individual objects (e.g., the complex video stream), In addition, multimedia data are in general multidimensional. Besides the media types such as image, video, audio, graphics, and text, the spatial, temporal, and real-time dimensions also need to be considered. For example, the analysis of a city map encompasses temporal and spatial dimensions at the application level, while using image and text media objects at the lower level. Unlike the traditional DBMSs, an MDBMS usually poorly approximates the real-world objects with the multimedia data stored in them. Moreover, multimedia information can be represented in an imprecise and incomplete ways in an MDBMS. How to handle the imprecise and incomplete queries issued to an MDBMS to access multimedia data is a challenge for the design of an MDBMS. Hence, the development of formal semantic modeling techniques for multimedia information, especially for video and image data, to provide an adequate and accurate representation of the content of audiovisual information in a database system is a necessity for an MDBMS. A semantic model that forms the basis for the design of an MDBMS should have an adequate expressive power. Here, the expressive power for a semantic model indicates the ability to expressive in such a model many varieties of multimedia objects in terms of their structure, behavior, and function. The structure of an object includes its attributes and content of the object. The behavior of an object is defined as the set of messages it understands, responds to and initiates. The function of an object is an explicit definition of the logical role of the object in the world represented by the MDBMS. In other words, a semantic model should be rich in capabilities for abstracting multimedia information and capturing semantics, and should be able to provide canonical representations of complex images, scenes, and events in terms of objects and their spatio- temporal behaviors. [97] is a special issue of IEEE Transactions on Software Engineering on image databases. In this special issue, an object-oriented database system based on the concept of extensibility by allowing new object classes to the data model is proposed, a query language called PicQuery by using a list of available operators is designed, a visual query interface to be used as an alternative to traditional query interfaces using SQL-like languages is proposed, information extraction from images of paper-based maps is discussed, a new method of image compression that uses irreducible covers of maximal rectangles to represent an image is presented, etc.

8 [20] proposed an MDBMS which consists of an interactive layer, an object composition layer, and a DBMS layer. The top of the architecture is the interactive layer which provides' the functionalities such as database maintenance, hypertext navigation, relational query, media editing, and user interface. The object composition layer is the middle layer of the architecture and is used to handle the temporal synchronization and spatial integration. The bottom of the architecture is the DBMS layer which manages and stores formatted data and unformatted data such as image, video, and audio. [79] proposed a four-layer data model for modeling image data by combining functional and object-oriented models. Image representations and relations, image objects and relations, domain objects and relations, and domain events and relations are captured at the first, second, third, and fourth layers, respectively. Any incomplete similarity-based query is processed by determining which layer(s) should be used for the query. Another two-level hierarchical model for representing objects is developed [77]. In their design, the entire object is represented at the top level and the boundaries are represented at the bottom level. The similarity in boundary segments is used to hypothesize on the top-level objects in their model. Other research on semantic data models for multimedia data include a model for multimedia documents [125], a multimedia information management system on top of ORION which is an object-oriented database system developed at the Microelectronics and Computer Technology Corporation [177], GRIM-DBMS which is a DBMS for graphic images [146], a model for temporal multimedia objects [113], an MDBMS which includes temporal attributes and relationships [145], etc.

3.2

INDEXING AND SEARCHING METHODS

Search in multimedia databases can be computationally intensive, especially if content-based retrieval is needed for image and video data stored in compressed or uncompressed form. Hence, there is a need for the design of powerful indexing and searching methods for multimedia data. Similar to the traditional database systems, multimedia data can be retrieved using object identifiers, attributes, keywords, and so on. For image data, various features such as color, shape, texture, spatial information, and so forth, have been used to index images. For video data, the video sequence is separated into constituent video frames and some representative frames are used to index the video data. For audio data, content-based indexing has been used because of the perceptual

Introduction

9

and acoustic characteristics of audio data. Another issue is to make the indexing fast and storing the indices efficient for easy access. The different media involved in multimedia database systems require special methods for optimal access, indexing, and searching. Also, data may be stored locally in singular database systems or remotely in multiple databases. Therefore, it is essential to have a mechanism for multimedia data and documents that can respond to various types of querying and browsing requested by different user applications. Indexing in an MDBMS is hierarchical and takes place at multiple levels [135]. At the highest level, the class hierarchy of an application is itself treated as an index. However, the classes in such a class hierarchy are often fuzzy classes or fuzzy sets which are totally different from the notion of class in object-oriented programming. At the next level, indexes will be on the attribute and content values. Indexes on attribute values have been used extensively and successfully in traditional DBMSs; while indexes on content values are still a challenging task for multimedia information system. Some existing techniques such as quadtree and R-tree indexing are well-suited for indexing GIS (Geographic Information System) applications but may not good for other applications. Indexes on content values can be semantic indexes that codify the content. Such a “coded” index can be used as an index or a code for the purpose of selection and retrieval. Another level of indexing is structural indexes for complex multimedia objects. An example is the structural indexes on a document retrieval system. Based on the structural indexes, a logical part of a document can be retrieved according to its structure. The last level of indexing is to maintain the mapping across the representations of a multimedia object. The mapping can be either between the physical and logical representations of a multimedia object or between a multimedia object and its compressed form. However, this kind of indexes are only used by a very small number of users. Unlike the traditional DBMSs which are designed and implemented to find exact matches between the stored data and the query parameters for the precise queries, the MDBMSs need to handle the incomplete and fuzzy queries. Due to the facts that the representation of multimedia objects in an MDBMS is always less than complete and any multimedia object is amenable to different interpretations by different users, searching in an MDBMS should work on not only the exact matches but also the approximate matches through neighborhood searches. Some form of classification or grouping of information may be needed to help the search. [38] proposed an intelligent image database system with the use of 2D strings to represent images. The 2D strings are partial orders of

10

3.3

AM FL Y

objects of interest in an image from left to right and from bottom to top. Three types of spatial relationships such as nonoverlapping, partially overlapping, and completely overlapping are defined by using the minimum enclosing rectangles. The concepts of semantic panning and zooming are proposed in [165]. In their approach, multimedia objects are categorized and organized into a tree with semantic panning referring to the traversal across the siblings and semantic zooming referring to a traversal to wither a higher level object or a lower level object along a given path. Hence, it allows the design a reasonably clear presentation system since the amount of information or the number of objects to be displayed at any given time can be controlled through a threshold. Another research presented by [104] uses spatial distribution of gray levels, spatial frequency, local correlation measure, and local contrast to index and search graphical objects. These four measures are computed for the input symbol. Users can hand-draw a symbol they want to search in the system and the similarity measures are computed against the collection of regularly drawn graphic symbols in the database.

SYNCHRONIZATION AND INTEGRATION MODELING

TE

A multimedia database system contains heterogeneous media. Unlike traditional databases which only deal with text information, multimedia databases contain multiple data types such as video, audio, image, etc. Because of this, a multimedia database system needs to consider the integration, presentation, and quality of service (QoS) issues. When video and audio multimedia data are presented together, the problems of media integration and synchronization become important. Moreover, audio and video data are time-dependent so that it is required to present certain amounts of data within a given time for a natural presentation. QoS is also important since the QoS parameter specifies qualitatively the acceptable levels of the performance from the users perspective. Therefore, a good mechanism to integrate the heterogeneous data and to provide synchronous presentation while still meeting other requirements such as the QoS is a crucial issue for a multimedia database system. An MDBMS has to be able to handle real-time data. Real-time complex multimedia data impose synchronization constraints on the storage and retrieval in an MDBMS and at the same time pose demands on both the communication infrastructure and the MDBMS. Real-time multimedia objects are normally large objects which require to consider the disk-layout and distribution constraints, especially when multiple requests need to be services concurrently as in video-on-demand (VOD)

Introduction

11

applications. Moreover, information retrieval needs to consider the synchronization constraints for queries. In most cases, the synchronization constraints can be translated into the layout constraints within a disk or across several disks. Furthermore, a multimedia system integrates text, images, audio, graphics, animation, and full-motion video in a variety of application environments. For traditional text-based DBMSs, data access and manipulation have advanced considerably. However, information of all sorts in varied formats is highly volatile. For multimedia database systems, the ways to incorporate diverse media with diverse characteristics challenge researchers in the community. An important aspect of a multimedia system is the integration of media retrieved from databases distributed across a network since all of the different media are brought together into a single unit controlled by the computer. In order to model the rich semantic multimedia data, the development of models for specifying the media synchronization and integration requirements is required. In other words, these models should be integrated with the monomedia database schema and subsequently, be transformed into a meta-schema to determine the synchronization requirements at retrieval time. Hence, it may require to design the objectretrieval algorithms for the operating systems and to integrate these models with higher-level information abstractions such as hypermedia or object-oriented models. There is considerable research focus on the synchronization issues of multimedia data. Berra et al. [20] are among the earliest groups to identify object synchronization as a requirement for developing multimedia databases. They use a modified Petri net, the Object Composition Petri Net (OCPN) model [118, 120], for modeling the synchronization of multimedia data. In their approach, they suggest the use of the “trigger” mechanism to preserve temporal integrity among multiple objects. Another paper by [162] studies the synchronization properties in multimedia systems. Research on the issues of integration include the integration of expert system with multimedia technologies by [136], the integration of fuzzy logic and expert system technologies by [188] with similarity measures and prototypicality by [170], etc.

3.4

FORMAL QUERY LANGUAGES

Users in a multimedia information system often describe their queries vaguely, fuzzily, or subjectively since they do not have well-defined queries to start with. Besides that, some parts of a query can be even nontextual. Conventional query languages such as SQL are not very suitable

12 for MDBMSs since they are not very good at handling the structured attribute data, especially when the retrieval is for content-based and fuzzy queries on the multimedia data. Hence, the audiovisual and interactive query languages are a must for MDBMSs in the sense that they are able to handle the ambiguous or fuzzy query descriptions and to devise audiovisual query-refinement techniques. In recommending the audiovisual and interactive query languages, the queries will be represented with alphanumeric and audiovisual “words” or linguistic elements and be composed by using multimedia objects, their attributes, and their content. As for the audiovisual query refinement perspective, as suggested in [84], the form of relevance feedback can be used as the underlying paradigm for query refinement. Generally speaking, formal multimedia query languages are required to provide the capabilities to express arbitrarily complex semantic and spatio-temporal schemas associated with composite multimedia information, and to allow imprecise match retrieval. These query languages should support the manipulation of content-based and fuzzy functions for those nontextual and fuzzy queries on multimedia objects. Furthermore, any query interface for an MDBMS should provide the capabilities for both querying and browsing. Browsing techniques may use hypermedia links in a navigational mode or other types of support in a nonnavigational mode. Research related to query languages and processing are reported by [28, 38, 82, 101, 151]. The authors in [38] proposed a query-by-pictorial query language for image/pictorial database systems. A query language which can handle queries on analogue forms of images is proposed by Roussopoulos et al. [151] for pictorial databases. They use dialects of Rtrees for indexing the analogue images and use this index to answer the queries. An interactive query approach by using a layout editor for image databases is proposed by [82]. At the beginning, a user needs to give the names or labels of the objects of interest. Next, the names or labels of the objects are used to import graphic representations of these objects into a layout editor. Then, the user determines the objects’ spatial positions based on the desired query by repositioning the objects spatially. Finally, the layout editor is submitted as a query to the system to retrieve the images matching the graphical representations of the objects and their spatial relationships. The images retrieved are shown in some ranked order. Joseph and Cardenas [101] design their query language called PicQuery which prompts their use with a list of available operators as a single collection or a group. A large number of operators are defined – two functional operators, seven image manipulation operators, eight

Introduction

13

spatial or geometric operators, and ten pattern recognition operators. The functional operators are common functions and statistical functions. The seven image manipulation operators are panning, rotation, zooming, superimposition, masking, color transformation, and projection. The spatial or geometric operators are distance, length, area, line similarity, region similarity, intersection, union, and difference. The pattern recognition operators are edge detection, thresholding, contour drawing, template matching, boundary definition, texture measure, clustering , segmentation, interpolation, and nearest neighbor. Various research efforts on visual query languages are reviewed and a taxonomy is suggested by [18]. The expressive power of graphical query language is discussed by [34]. Extensions to graphical query languages are suggested by [154].

3.5

DATA PLACEMENT SCHEMAS

Multimedia objects can be very large and/or real-time. When several large and/or real-time multimedia objects need to stored on a common set of storage resources, and a minimum number of concurrent requests for such large and/or real-time multimedia objects need to be supported, the development of efficient and optimal data placement schemas, for both single and parallel disk systems, for physical storage management is important. In addition, while placing multimedia data on the disk tracks, it is desirable that the average seek time is reduced and the system throughput is maximized. These requirements add a significant emphasis on designing efficient data placement strategies. The following characteristics of multimedia data pose influences on the design of efficient data placement strategies [111]. 1. multiplicity of media streams 2. continuity of recording and retrieval 3. large quantities of data With all the euphoria surrounding the potential benefits of multimedia information systems, some of the real technological challenges that are pushing the available hardware, as well as the ingenuity of human thought, to their limits. Several issues are involved in association with multimedia object storage such as the huge volumes of the multimedia data, the limited available storage space, the limited bandwidths of the storage system and communication channel, and availability rates of the multimedia data. Some of the hardware problems are [70]:

14

Though the storage devices are usable online with the computers, they are not big enough. The speed of retrieval from the available storage devices, including disks, is not sufficient to cope with the demands of multimedia applications. The speed of storing multimedia data into the disks is relatively slow. The bandwidths of the storage system and communication channel are limited. For example, a single object may demand large portions of bandwidth for extended periods of time. Multimedia data is delay-sensitive which makes the communication problems even worse. Cache memories are too small for multimedia data though they are precious resources.

Because of these problems, methods to transform, manage, transfer, and distribute multimedia data depend heavily on multimedia object storage mechanisms. Random access memories, magnetic disks, optical tapes, and optical storage are some typical storage devices for multimedia data. Another method is to use data compression schemes together with data transformation. The process is to transform the multimedia data to some transform space to remove the redundancies in the original data, i.e. to code the transformed data for storage or transmission. Then a decompression scheme decodes and re-transforms the data back to its original form [2]. Extensive research efforts on multimedia storage servers using optical disks are reported by [12]. In addition, the demands for efficient resource management and effective storage technologies for multimedia applications as well as other high performance applications are even greater and deserve high priority. A survey of the state of the art in the architectures and algorithms involved in the efficient physical management of large amounts of multimedia data is presented in [111]. Along with the introduction of issues involved in the storage and retrieval of audio and video data, this paper also provides a starting point for further studies in the area.

3.6

ARCHITECTURE AND OPERATING SYSTEM SUPPORT

Due to the heterogeneity of multimedia information, the design and development of suitable architecture and operating system support is

Introduction

15

inevitable. The heterogeneity dictates that the architecture of a generalpurpose MDBMS must support a rich set of data-management and computational functionalities and the operating system must support realtime requirements of multimedia data. Since some multimedia data require real-time delivery and presentation, a multimedia database system cannot provide all of its functionalities unless the operating system supports real-time continuous multimedia data. Moreover, the I/O hardware needs to be able to support the various media types; the communication networks need to handle the situations such as limited bandwidth and delays. Other issues potentially related for system support are memory management, CPU performance, throughput, etc.

3.7

DISTRIBUTED DATABASE MANAGEMENT

In a networked multimedia environment, the extensive coordination and management capabilities are needed among the distributed sites to provide location-transparent access and support real-time delivery of data to distributed users. Any multimedia database system is distributed in the sense that different media are retrieved from different databases which are located in disparate locations. Also, the storage problems often force the information to be placed in different physical storage locations. Given that databases may be distributed, the layout and distribution constraints may not be local to a site. In addition, in order to access multiple-but-related multimedia objects stored in an MDBMS, synchronization constraints are imposed. To support data retrieval in such a distributed environment, several issues need to be addressed – distributed and parallel query processing, data security, bandwidth limits, and network delays. Some of the early research on the issues for distributed multimedia database systems are discussed in [26, 20]. For example, the distribution of different media objects across multiple servers and the use of an object graph metaphor for planning access and retrieval from multiple servers are suggested to handle distributed multimedia database systems by [20].

3.8

MULTIMEDIA QUERY SUPPORT, RETRIEVAL, AND BROWSING

In traditional database system, queries are processed using only available indices. However, query support is a multimedia database system is much more complicated than that in a traditional database system.

16 The reason is that matches in multimedia queries often are not exact matches so that a single query might result in multiple data items in response. Therefore, a browsing interface to let user applications retrieve any information potentially related to the current results needs to be developed in association with query support. Hence, a mechanism which supports querying, data retrieval, and browsing is an important task for a multimedia database management system. Another important challenge is content-based retrieval in MDBMSs. Content-based retrieval is feasible only after the features are extracted and classified. Low-level image processing involves finding portions of the raw data which match the user’s requested pattern. During such processing, features in the user’s requested pattern need to be identified and matched against features stored in the database. The problem of pattern matching in image databases has been actively studied for over 20 years [98, 152]. A typical approach is to extract a set of features which concisely describes the given pattern, and then seek these features in the image. However, the extraction of features usually requires the assistance of humans and the placement of certain landmarks or registration marks, and only recently have the researchers started addressing the problem of automatic feature extraction for a multimedia database for a wide variety of objects [147]. Thus, features can be extracted at a coarse level with these marks [135]. If a finer level of feature extraction is desired, the extraction process will become much more time-consuming. After the feature extraction process, the next steps are feature value normalization and feature classification. Feature value normalization is to ensure scale independence and feature classification is to index multimedia objects. Since some multimedia objects may need to index their features concurrently, the number of features and the domain size of each feature will increase the index size of multimedia objects. This leads to a larger index size of the multimedia objects than the index size of normal data. There is considerable research work on information retrieval for multimedia data. Content-based retrieval in text documents has been studied for a long time [6, 14, 105, 132, 150, 153]. Fuzzy logic-based retrieval is studied by both [25, 131]. Scene retrieval for video databases using temporal condition changes are reported by [1]. A comprehensive survey of techniques, algorithms, and tools addressing content-based analysis and visual content representation is published in [3]. The location, identification, and delivery of digital audio and video in a distributed system for browsing applications have been addressed in [122] using distributed client-server architectures. The interactive browsing and content-based querying of a video database with subse-

Introduction

17

quent playout of selected titles are allowed. In [36], an automated content-based video search system called VideoQ allows searching on a set of visual features and spatio-temporal relationships. The user can search the manually annotated videos by sketch or text, or browse the video shots. Applying direct manipulation techniques to visual representations of the feature space on designing interfaces can be found in [49, 96]. Ishikawa et al. [96] prefer sliders to adjust feature values in the query; while Colombo et al. [49] allow the users to employ the functions of a drawing tool to modify sketches and sample images, and use the modified images as a starting point for similarity searches. In [89, 90], a multimedia database management system that is extended with a retrieval engine, an intelligent client buffer strategy, and an admission control module is proposed. The retrieval engine calculates relevance values for the results of a conceptual query by feature aggregation on video shot granularity to offer conceptually content-based access. This engine is embedded in a browsing system architecture. Next, the intelligent client buffer strategy employs these relevance values to enable flexible user interactions during browsing. Finally, the admission control module admits the whole browsing session by predicting the required resources from the query results to reduce the startup delays within a session. More video browsing systems will be introduced in Chapter 4.

This page intentionally left blank.

Chapter 2

SEMANTIC MODELS FOR MULTIMEDIA INFORMATION SYSTEMS

1.

INTRODUCTION

Unlike traditional database systems which have text or numerical data, a multimedia database or information system may contain different medias such as text, image, audio, and video. One of the inherent characteristics of multimedia data is the heavy time-dependence in that they usually related by temporal relations which have to be maintained during their playout. Therefore, how to synchronize the various data types for sophisticated multimedia presentations is a major challenge. As more information sources become available in multimedia systems, the development of abstract semantic models for multimedia information becomes very important. An abstract semantic model has two requirements. First, it should be rich enough to provide a friendly interface of multimedia presentation synchronization schedules to the users. In other words, the models must be devised able to support the specification of temporal constraints for multimedia data and the satisfaction of these constraints must be at runtime. Second, it should be a good programming data structure for implementation to control multimedia playback. The use of a model that can represent the temporal constraints on multimedia data makes it easier for the satisfaction of these constraints at presentation time. In addition, if the temporal relations of multimedia objects are specified by means of a suitable model, they can serve as guidelines to store, retrieve, and present the data. To keep the rich semantic information, the abstract semantic models should also allow users to specify the temporal and spatial requirements at the time of authoring the objects, and to store a great deal of useful information (such as video clip start/end time, start/end frame number,

20 and semantic objects relative spatial locations). Though the temporal dependencies of multimedia objects are a key feature, the focus is on the synchronization of different components at presentation time. The most crucial part is that the playout of the various components of a multimedia presentation meets the successive deadlines. For example, the successive samples of an audio record must be played at a very precise rate in order to obtain high-quality sound [21]. To model the multimedia presentations, the semantic models should have the ability to check the features specified by users in the queries, and maintain the synchronization and quality of service (QoS) desired. Moreover, the semantic models should be able to model the hierarchy of visual contents so that users can browse and decide on various scenarios they want to see.

1.1

TEMPORAL RELATIONS

TE

AM FL Y

Temporal modeling is used to construct complex views or to describe events in multimedia data, especially video data. Events can be expressed by interpreting collective behavior of physical objects over a certain period of time. In a simplistic manner, the behavior can be described by observing the partial or total duration during which an object appears. The relative movement of an object with respect to other objects over the sequence of frames in which it appears is analyzed for event identification. For example, consider a query which searches the occurrence of a slam-dunk in a sports video clip. Modeling this particular event requires identification of at least two temporal subevents which include precise tracking of the motions of the player involved in the slam-dunk and the ball, especially when the ball approaches the hoop. The overall process of composing a slum-dunk event requires a priori specification of multiple temporal subevents [7]. There are two main approaches to represent temporal relations for multimedia data – point-based and interval-based representations. 1. In the point-based temporal relation representation approach, the positions of objects are represented by points on timeline. 2. In the interval-based temporal relation representation approach, the positions of objects are represented by the intervals of their occurrences. Most of the studies use this approach to manage the temporal relations between video objects since representing the temporal relations by the interval-based approach is more perceptible to human than by the point-based approach. Temporal intervals are extensively used for modeling temporal events [10]. In [10], seven basic binary interval-based temporal relations

Semantic Models for Multimedia Information Systems

21

which are before, during, equal, meets, overlaps, starts, and finishes are defined. Temporal intervals consist of time durations characterized by two endpoints, or instants. A time instant is a zero-length moment in time (e.g., 1:00 AM); while a time interval is defined by two time instants representing the two end points and the duration represents the temporal interval. The length of a temporal interval is identified by the difference of its endpoint values. The relative timing between two intervals can be determined from these endpoints. Many later studies represent the temporal relations in their applications based on these seven temporal relations. For example, in [92], a document is defined with a tree structure where a parent node contains the definition of the temporal relations; while Little and Ghafoor [120] extend these basic binary temporal relations to n-ary temporal relations and discuss the reverse relations of them. Intervals and instant-based representations are well studied topics [120]. In OVID [140], a video object is a video frame sequence (a meaningful scene) represented by intervals. A video object may consist of several sequences of continuous video frames. Each video object has its own attributes and attribute values to describe the content. In their study, two operations – merge and overlap – are developed to manipulate video objects. These two operations can also be applied to the textual annotation that denotes the contents of the video objects as part of the definition of each video object. However, the point-based temporal relations and the interval-based temporal relations can be translated to each other. Comparison of an interval-based query with a point-based query is discussed in [168].

1.2

SPATIAL RELATIONS

The capability to capture and model the spatial relations of the multimedia data is one of the mandatory feature in many multimedia applications. Spatial data objects often cover multi-dimensional spaces and are not well represented by point locations. It is very important for a database system to have an index mechanism to handle spatial data efficiently, as required in computer aided design, geo-data applications, and multimedia applications. For example, multimedia documents which consist of images, charts, texts, and graphics require the management of spatial relations for layout information [56]. In the geographical information system (GIS), the representation and indexing of the abstract spatial relations are important. Spatial relations can be coded using various knowledge-based techniques. One of the straightforward way to represent the spatial relation

22 between two objects is by rectangular coordinates. An object’s spatial position is represented by coordinates and the spatial relation between two objects is calculated mathematically as shown in [78, 123]. A general approach for modeling spatial relations is based on identifying spatial relations among objects using bounding boxes or volumes. The minimal bounding rectangle (MBR) concept in an R-tree is used so that each semantic object is covered by a rectangle. An R-tree [80], which was proposed as a natural extension of B-trees [19, 50], combines the nice features of both B-trees and quadtrees. An R-tree is a height-balanced tree similar to a B-tree. The spatial objects are stored in the leaf level and are not further decomposed into their pictorial primitives, i.e., into quadrants, line streams, or pixels. Three types of topological relations between the MBRs can be identified [39]: (1) nonoverlapping rectangles; (2) partly overlapping rectangles; (3) completely overlapping rectangles. For the second and the third alternatives, orthogonal relations proposed by Chang et al. [39] can be used to find the relation objects. In [40, 53], a 2D string is proposed as an indexing technique for representing a spatial relation between the objects of a picture. The 2D strings represent the abstract positions of the objects and several levels of a coarse-strict relation, where the strictness of direction differs from one level to another. Another approach is proposed by [158, 159] that a set of binary relations such as left of, right of, above, below, inside, outside, in front of, behind, and overlaps is defined as primitive relations for representing the spatial relations.

1.3

SPATIO-TEMPORAL RELATIONS

Generally, semantics and events in video data can be expressed in terms of interplay among physical objects in space and time. For modeling purposes, spatio-temporal relations among objects can be represented in a suitable indexing structure, which can then be used for query processing. Currently, there are not many studies of the spatiotemporal modeling of multimedia data. There are two ways to represent the spatial and temporal relations of multimedia data. One is to represent them in a consistent manner [59, 166]; while the other approach is to represent them independently [95]. Whether or not a model provides consistent representation for both spatial and temporal relations depends on the applications. The advantage of using a consistent representation for both the spatial and temporal relations is that content based retrieval on both relations is realized in a unified manner.

Semantic Models for Multimedia Information Systems

23

In [59], both spatial and temporal relations are based on a single set of interval-based primitive relations. In their approach, the “spatial events” are used to describe the relative locations of multiple objects. Another study proposed by [166] discusses the spatio-temporal indexing of multimedia data in an integrated way, where the approach assumes multimedia presentations. For indexing, two approaches have been proposed. One approach separates the spatial index by a 2D R-tree from the temporal index by a 1D R-tree. The second approach integrates spatio-temporal indexing by 3D R-trees. Another example that models the spatial and temporal relations independently is proposed by [95]. In their approach, the temporal relation is primary and the spatial relation is secondary. Objects are structured with the same temporal relations between objects as proposed in [10]. The spatial relations are defined by the spatial composition operations such as about, crop, overlay, overlap, and scale.

2.

MULTIMEDIA SEMANTIC MODELS

Many semantic models have been proposed to model the temporal and/or spatial relations among multimedia objects. Multimedia semantic models can, based on the underlying paradigm and/or the temporal relationship representation, be classified into the following distinct categories: 1. timeline models [27, 88]: timeline.

The relationships are represented by a

2. time-interval based models [120, 140, 172]: The relationships are represented by the time intervals. 3. graphic models [30, 33, 114, 181]: The relationships are represented as edges in a graph. 4. petri-net models [8, 37, 117, 118]: The relationships are captured by means of the concept of transitions. 5. object-oriented models [22, 62, 67, 71, 72, 87, 92, 124]: tionships are modeled as object attributes.

The rela-

6. language-based models [15, 137, 162, 175]: The relationships are expressed by programming language constructs. 7. augmented transition network (ATN) models [44]: The relationships are captured by means of the concept of transitions. Some models are primary aimed at synchronization aspects of the multimedia data while others are more concerned with the browsing aspects

24

Figure 2.1. Timeline for Multimedia Presentation. t1 to t6 are the time instances. d1 is time duration between t1 and t2 and so on.

of the objects. The former models can easily render themselves to an ultimate specification of the database schema. Some models such as based on graphs and Petri-nets have the additional advantage of pictorially illustrating synchronization semantics and are suitable for visual orchestration of multimedia presentations. Table 2.1 shows the models with their corresponding categories. The first column of Table 2.1 associates an identifier with each model for later references. If a model does not have a name, then the name of the first author is used as the identifier for the model. For each model, the most comprehensive or most widely available reference is shown in the second column. Finally, the third column indicates the category in which the model falls.

Semantic Models for Multimedia Information Systems Table 2.1.

Classification of Selected Semantic Models

25

26

Figure 2.2.

2.1

A timeline example that includes choice objects

TIMELINE MODELS

Blakowski and Huebel [27] proposed a timeline model in which all events are aligned on a single time axis. They use “before,” “after,” or “simultaneous to” to represent the relationships between two events. A media stream in a multimedia presentation is defined as one or more letters subscripted by digit(s). The letters A, I, T, and V represent audio, image, text, and video media streams, respectively. The subscript digit(s) is to denote the segment number in the corresponding media stream. For example, V1 denotes video stream segment 1. Figure 2.1 is a timeline to represent a multimedia presentation. The presentation starts at time t1 and ends at time t6. At time t1, media streams V1 (Video 1) and T1 (Text 1) start to play at the same time and continue to play. At time t2, I1 (Image 1) and A1 (Audio 1) begin and overlap with V1 and T1. The duration d1 is the time difference between t1 and t2. V1 and T1 end at time t3 which T2 starts. The process continues till the end of the presentation. The timeline representation can model the temporal relations of media streams in a multimedia presentation [27]. Every presentation needs to strictly follow the pre-specified sequence. Hirzalla et al. [88] proposed a timeline model (Figure 2.2) to expand the traditional timeline model to include temporal inequalities between events. They also develop a timeline tree (Figure 2.3) to represent an

Semantic Models for Multimedia Information Systems

Figure 2.3.

27

A timeline tree representation of the interactive scenario in Figure 2.2

interactive scenario. This enhanced timeline model models user actions as media objects. In a traditional timeline, the vertical axis of the timeline includes only media streams such as text, image, video or audio. In their model, a new type of media object called choice lets users interact with the multimedia presentation.

2.2 TIME-INTERVAL BASED MODELS Two temporal models for time-dependent data retrievals in a multimedia database management system have been proposed [120]. These two proposed models are restricted to having the property of monotonically increasing playout deadlines for represented objects and data retrieval synchronous algorithms. Since interval-based temporal relations are used, only the duration of each data stream is considered and thus the starting and ending frame numbers of each video clip are not included. Oomoto and Tanaka [140] develop a video-object database system named OVID. A video-object data model provides interval-inclusion based inheritance and composite video-object operations. A video object is a video frame sequence (a meaningful scene) and it is an independent object itself. Each video object has its own attributes and attribute values to describe the content. Since this model only uses frame sequences to represent the interval, the starting time and ending time of each interval are not included in this model. This model is designed to help database queries. Wahl and Rothermel[172] proposed an interval-based temporal model using the temporal equalities and inequalities between events. This model is a high-level abstract model which also works for the asynchronous events so that the starting time can be determined at the pre-

28 sentation time. Actually, it is designed for all those events occurring in the presentation.

2.3

GRAPHIC MODEL

Labeled directed graphs have been extensively used to represent information. The idea of using graph notations to model the synchronization requirements is to represent a set of temporally related events as a graph whose nodes denote the events/objects and the edges capture the temporal relationships among the components of the events. An example is the hypermedia systems. This approach allows one to interlink small information units (data) and provides a powerful capability for users to navigate through a database. Information in such a system represents a “page” consisting of a segment of text, graphics codes, executable programs, or even audio/video data. All the pages are linked via a labeled graph, called hypergraph. The major application of this model is to specify higher level browsing features of multimedia systems. The essence of hypertext is a nonlinear interconnection of information, unlike the sequential access of conventional text. Information is linked via cross-referencing between keywords or subjects to other fragments of information [68]. Various operations, such as updating and querying, can be performed on a hypergraph. Updating means changing the configuration of the graph and the content of the multimedia data. Querying operations include navigating the structure, accessing pages (read or execute), showing position in the graph, and controlling the side effects. Basically, it is a model for editing and browsing hypertext. Buchanan and Zellweger [31, 30] proposed a system called Firefly. Firefly provides a graph notation to express the synchronization requirements of a set of temporally related events. Each media object contains two or more events such as start events, end events, synchronization events, and asynchronous events. The start and end events are represented by two rectangular nodes, the synchronization events use circular nodes that are put between the start and end events, and the asynchronous events are denoted by circular nodes that are placed above the start events. The edges represent the temporal relations between events. Each edge is labeled with the relation between the events it connects. Examples of labels are simultaneous with and before by 10s. The asynchronous events are events whose time of occurrence or duration cannot be predicted when the specification is generated. Multimedia objects with indeterminate durations are represented by a dashed line connecting their starting and ending events. In order to ensure the consistency, several restrictions are imposed on constraints involving asynchronous events such as the asynchronous events can only appear as

Semantic Models for Multimedia Information Systems

29

the source event in a synchronization constraint. The following types of constraints are supported: 1. the temporal equalities: require either that two events occur simultaneously or that one precedes the other by a fixed amount of time For example, the simultaneous with label can be used when two events occur simultaneously, the before by k label can be used when one event precedes the other by time k, etc. 2. the temporal inequalities: support the expression of indeterminacy For example, the before by at least k label can be used when an event precedes another by at least some fixed time k, the before by at least k1 and no more than k2 label can be used when an event precedes another by at least some fixed time k1 and at most some fixed time k2, etc. Yeo and Yeung [181] proposed a video browsing model. They developed mechanisms to classify video, find story unit, and organize video STGs model a cluster units using scene transition graphs (STGs). of shots as nodes and the transitions between shots in time as edges so that the video hierarchy and the temporal relations of each video unit are preserved. Therefore, this model provides presentation and browsing capabilities to users. Users can choose a specific scenario to watch by browsing sample video frames at different granularities.

2.4

PETRI-NET MODELS

Recently, the use of Petri-nets for developing conceptual models and browsing semantics of multimedia objects has been proposed. The reason of adopting Petri-net models [142] is that the timed and augmented Petri-nets have traditionally been used to analyze systems with temporal requirements [51, 148, 192]. The idea is to represent the playout of the multimedia objects in a set of temporally related events as places in the net and their temporal relationships as transitions. This type of models have been shown to be quite effective for specifying multimedia synchronization requirements [68]. For example, one such model is used to specify high level (object level) synchronization requirements which is both a graphical and mathematical modeling tool capable of representing concurrency. Little and Ghafoor [118] proposed an Object Composition Petri Net (OCPN) model based on the logic of temporal intervals and Timed Petri Nets to store, retrieve, and communicate between multimedia objects. This model, which is a modification of earlier Petri net models, consists of a set of transitions (bars), a set of places (circles), and a set of directed arcs.

30

Figure 2.4. An OCPN example for Figure 2.1: D1 is the delay for media streams I1 and A1 to display. D2 is the delay for V2 to display.

TE

AM FL Y

OCPN is a network model and a good data structure for controlling the synchronization of the multimedia presentation. A network model can easily show the time flow of a presentation. Therefore, OCPN can serve as a visualization structure for users to understand the presentation sequence of media streams. An OCPN can be graphically represented by means of a bipartite graph. In OCPN, each place (circle) contains the required presentation resource (device), the time required to output the presentation data, and spatial/content information. Each place (circle) is represented by a state node in the OCPN model. The transitions (bars) in the net indicate points of synchronization and processing. The particularly interesting features of this model are the ability to explicitly capture all the necessary temporal relations, and to provide simulation of presentation in both forward and reverse directions [68]. Moreover, the OCPN model can specify exact presentation-time playout semantics which is useful in real-time presentation scheduling. Figure 2.4 shows the OCPN for the same example in Figure 2.1. Media streams V1 and T1 start to display and I1 and A1 join to display after a duration of delay which is denoted by D1. After V1 and T1 finish, T2 together with I1 and A1 to display. Then V2 join to display after duration D2. OCPN can handle the synchronization and quality of service (QoS) for real-time multimedia presentations. The behavior of an OCPN is described by the firing rules. When all input places of a given transition contain a nonblocking token, the transition fires. The token is removed from each input place and added to each output place in the firing of a transition. After the firing, each output place remains active for its associated duration. The token inserted

Semantic Models for Multimedia Information Systems

31

in each output place remains blocked until duration elapses. The details of the OCPN model are described in Chapter 6. Many later semantic models are based on Time Petri Nets. Chang et al. [37] and Lin et al. [117] develop TAO (Teleaction object) and OEM (Object Exchange Manager). TAO is a multimedia object with associated hypergraph structure and knowledge structure, and AMS (Active multimedia system) is designed to manage the TAOs. OEM maintains and manages uniform representation and interacts with other system modules. TAO is a conceptual model which can be implemented as objects in an object-oriented system and each TAO has its own private knowledge in the AMS. TAOs are connected by a hypergraph. The multimedia data schema (MDS) is similar to OCPN which controls the synchronization between time-related data streams. Users need to create a hypergraph structure using four different links (annotation link, reference link, location link, and synchronization link) before generating a multimedia data schema. A multimedia communication schema (MCS) is obtained based on MDS for an efficient transmission sequence. This design can let designers specify the necessary actions for different communication delays and different computer hardware limitations. Al-Salqan and Chang [8] develop a model which uses synchronization agents as “smart” distributed objects to deal with scheduling, integrating, and synchronizing distributed multimedia streams. This formal specification model, interoperable Petri nets, describes the agents’ behavior and captures the temporal semantics. It can deal with both accurate and fuzzy scenarios. In another model, called Petri-Net-Based-Hypertext (PNBH), the higher level browsing semantics can be specified. In this model, information units are treated as net places and links as net arcs. Transitions in a PNBH indicate the traversal of links, or the browsing of information fragments. Figure 2.5 illustrates a PNBH model consisting of segments of an interactive movie. These segments can be played-out in a random order, as selected by the user and restricted by the semantics of the net. Unlike the OCPN model, the net places in PNBH can have multiple outgoing arcs. Hence, the nondeterministic and cyclic browsing can be represented.

2.5

THE OBJECT-ORIENTED MODELS

The basic idea in this type of models is to represent a real world thing or concept as an object. An object usually has an identifier, attributes, methods, a pointer to data, etc. [68]. The object-oriented model has been claimed by many research projects that it is particular well suited to express the data modeling requirements of a set of temporally related

32

Figure 2.5.

An example PNBH Petri net.

events. Under an object-oriented approach, the events are modeled as a set of objects related to each other in several ways. The temporal information for synchronizing multimedia objects are modeled by means of object attributes. Also, the inheritance concept can be used to define a class containing general methods for the synchronization of multimedia objects. There are proposals in the literature can be found in the area of multimedia object-oriented models. Some concern the extensions to the Open Document Architecture [138] to handle the continuous data [87, 92], some concern the handling of the nonconventional data such as sound or images [176], some concern the problem of temporal synchronization [62, 67, 71], some concern the implementation requirements of multimedia applications [73, 124], etc. ODA is a standardized document architecture developed by ISO, CCITT, and ECMA to manage the office documents in an open distributed system [138]. There are several advantages for using ODA as the basis for a temporal model. First, ODA can be a universally acceptable framework to model multimedia objects since it is a widely supported international standard. Second, some existing multimedia systems such as Andrew system [4], MULTOS [133], and others [87, 92] are based on the ODA architecture.

Semantic Models for Multimedia Information Systems

33

In the ODA data model, there is a sharp distinction between the structure and the content of a document. The structure of a document consists of the meta-information about the document content and is useful for the editing and formatting process. The content of a document represents information that should be presented to the users and is partitioned into several content portions. Each content portion contains the information of exactly one type. There are two types of structures associated with each document. The logical structure determines how the logical document components (title, introduction, etc.) are related to document content portions. The layout structure defines how the layout components (frames, pages, etc.) are related to document content portions. Both of the logical and layout structures are represented as trees in which the objects represent the nodes. The leaves of the tree are called basic objects, whereas the internal nodes are called composite objects. A distinct content portion of the document is associated with each basic object and a set of attributes is associated with each node of the tree to describe the properties of the corresponding object. The notions of object class and document class are supported so that the grouping of objects or documents into classes based on similar properties is allowed. In [87], the authors develop a model in the context of the DASH project at the University of California at Berkeley. They extend the ODA architecture with the capability of dealing with time-varying data. Not only the layout structure is enhanced with temporal capabilities, but also the logical structure is modified. It is motivated by the observation that temporal requirements affect not only data presentation but also data semantics. Temporal requirements are expressed by means of timed objects. A timed object is a conventional ODA object extended with a duration. In the logical structure, both timed and un-timed objects can coexist, depending on whether or not temporal information are fundamental to preserve semantics correctness; while the layout structure consists of timed objects since a multimedia presentation is new time independent. The duration of a basic object is explicitly specified by associating with it an interval representing the object presentation time or implicitly assumed to be that of the content portion associated with it. On the other hand, the duration of a composite object and its synchronization requirements are specified by a new class of attributes. This new class of attributes is called the temporal construction attributes that each temporal construction attribute specifies the temporal relationships occurring among the duration of the object to which it refers, and those of its children in the tree representing the layout or logical structure of the document. The value of a temporal construction attribute comes

34 from a set of pre-specified values that either are derived from Allen’s relations [10] or are added to increase the expressivity of the model. A general model proposed by [92] is an extension of the ODA architecture to model the synchronization of multimedia objects. Unlike the Herrtwich’s model [87], this model deals only with the layout structure of a document. Two basic concepts are used in this model – actions and path expressions. A set of temporally related events is modeled as a set of actions. Some action examples are like to play video V, to present image I, etc. An action is limited in time by its starting and ending points. Actions are synchronized at some given points in time corresponding to particular events, such as the starting or ending if another action. These events are referred to as the synchronization points. In addition, since synchronization can occur only at the starting and/or ending point of each action, actions are divided into two groups: atomic actions and compose actions. The atomic actions do not contain synchronization points, whereas the compose actions are composed by both atomic and compose actions. Each compose action must be logically decomposed into a set of equivalent atomic actions to be synchronized. On the other hands, the path expressions are used to specify the synchronization requirements [32] and the order of the actions composing a set of temporally related events. Path expressions are specified by combining atomic actions by path operators. The order of the actions is based on the priority order of the path operators. Let X and Y be two generic actions, N a positive natural number with a default value 1, and k a positive natural number. The following path operators are supported by the model: Concurrency: Action X can be executed N times concurrently. It is written as N : X. Repetition: Action X is executed k times repeatedly. It is written as Xk*. Selective: Either X or Y is executed. The selection depends on a condition which is outside the path expression. It is written as X | Y. Sequential: The execution of Y can start only when X is terminated. It is written as X; Y. Parallel-first: Actions X and Y are started at the same time. The composite action terminates when the first action between X and Y terminates. It is written as X Y.

Semantic Models for Multimedia Information Systems

35

Parallel-last: Actions X and Y are started at the same time. The composite action terminates when both X and Y terminate. It is written as X Λ Y. The above path operators are shown in decreasing priority order. It is required to map the concepts of action and path operator into the ODA architecture. The concepts of the atomic actions and the compose actions are similar to the concepts of the basic objects and the composite objects in ODA. A path expression (in prefix notation) can be transformed into a tree representation where the intermediate nodes represent the path operators and the leaves represent the atomic actions. As mentioned earlier, the layout structure of a document can be represented as a tree with composite objects as the intermediate nodes and the basic objects as the leaves. This correspondence allows the semantics of the path expressions to be added into the ODA architecture. For this purpose, a new attribute called the object synchronization type attribute for layout composite objects is defined. The value of this attribute assumes to be one of the path operators introduced above with the sequential path operator as the default value. This default value indicates that the immediate subordinate objects are to be executed sequentially. Other values such as the content temporal type attribute and the duration attribute are also used. The content temporal type attribute denotes whether the associated content portion is time-dependent or not; whereas the duration attribute specifies the duration of the object presentation. The hyperobject model adopted by the Harmony system integrates the hypertext model with the object-oriented framework to represent the temporal relations among multimedia objects [67]. In their approach, they use links to express the synchronization constraints. A link is represented by a 4-tuple < source_object, conditions, destination_object, message>. The source_object and the destination_object are the objects to be synchronized, the conditions are a set of synchronization conditions, and the message can take as value play or stop and other additional parameters. When the conditions are verified, the source_object sends the message to the destination_object, which executes the corresponding method. An object-oriented model proposed by Ghafoor and Day [71] is based on the OCPN model to represent multimedia data. In this model, the facilities for spatial and temporal composition of multimedia objects is included. In [62], an object-oriented approach is proposed to model video data and their temporal and spatial information. A video is represented as a collection of objects related by temporal and spatial constraints.

36 The collections of the objects corresponding to a video are modeled by the composed objects that are sets of 4-tuples (ci,pi, sti, di), where ci is a component object) pi is the spatial position of ci in the composed object, sti is the starting time of ci, and di is the duration of ci. The OMEGA system relies on an object-oriented data model in which the temporal information associated with the objects are used to computer the precedence and synchronization relationships [124]. To facilitate the presentation of multimedia objects, OMEGA uses temporal information associated with each object to calculate precedence and synchronization between objects. In order to handle different types of multimedia data, a special class (metaclass) called multimediaclass that consists of various multimedia data (images, sound, etc.) is defined. This class is used to handle the large variety of multimedia data. Also, in this model, a multimedia object has attributes, relationships which are its value reference to other objects) components (its value reference to other object(s) that are dependent on the referring superordinate object), and methods. Some integrity rules also apply. These include class, instance, subclass, superclass, inheritance, generalization, and aggregation. For example, IS_PART_OF and IS_REFERENCE_OF can be specified between objects in OMEGA systems. In [72, 73, 74], a model that is a timeline model based on the notion of active object is proposed. An active object differs from an ordinary object (i.e., a passive object) in the way that it may activate methods even if no message is sent to it. In this model, the multimedia objects are the basic building blocks from which the specifications can be constructed. A multimedia object is an active object with a number of ports and is an instance of a multimedia class. The multimedia classes are organized into a hierarchy. The root has two classes that are Activeobject and MultimediaObject. The ports are themselves objects, defining a data structure and operations to check the state of the data structure. Also, the ports describe the types of data the corresponding object consumes and produces. The class ActiveObject provides methods to control the activity of a multimedia object, such as Start(), Stop(), Pause(), and Resume(). On the other hand, the class MultimediaObject provides methods to convert two temporal coordinate systems (world coordinates and object coordinates) such as ObjectTo World() and WorldToObject(), methods to allow multimedia objects to be composed in a unique composite object indicating the temporal sequencing of components such as Translate(), Scale(), and Invert(), and methods to deal with multimedia object synchronization such as Sync(), SyncInterval(), SyncTolerance(), SyncMode(), Cue(), and Jump().

Semantic Models for Multimedia Information Systems

2.6

37

THE LANGUAGE-BASED MODELS

Concurrent languages have been extensively used to specify parallel and distributed process structures. These languages also have the potential to specify multimedia synchronization requirements. Several language-based approaches have been proposed to model multimedia data synchronization requirements by extending the conventional concurrent programming languages such as ADA and CSP with synchronization primitives [15, 57, 61, 137, 143, 162, 175]. The major advantage of language-based models is that they can directly lead to an implementation. In [15], the authors proposed the use of timed communicating sequential processing (TCSP) language as a specification language for a set of temporally related events. TCSP is a temporal extension of the language of communicating sequential processing (CSP) [57]. The syntax and the semantics of the TCSP operators are shown in Table 2.2. As can be seen from Table 2.2, there are several operators that are explicitly depend on time. For example, the asynchronous parallel composition operator allows both of its arguments to evolve concurrently without interacting. The event prefix operator transfers the control to P when the event a occurs. The external choice operator denotes a choice between P and Q. The choice is resolved by the first communication. If the environment is prepared to cooperate with P, then the control is transferred to P, and vice versa. The skip operator does nothing but terminates immediately. The timed delayed operator represents a process that terminates after t time units. The timed prefix operator is to transfer the control to P exactly t time units after the event a has occurred. Finally, the timeout operator transfers the control from P to Q is no communication involving P occurs before time t. When t=0, it is omitted from the timeout operator. TCSP is powerful enough to express Allen’s relations between intervals. For example, the equals relation can be specified by the following TCSP process: (synch_error → SKIP)) equals(x,y)=(x.ready → (y.ready → E1) (y.ready → (x.ready → E1) (synch_error → SKIP)) E1 = x.present → y.present; (x.free → (y.free → SKIP) (synch_error → SKIP)) (y.free → (x.free → SKIP) (synch_error → SKIP)) In this example process, x and y need to start and finish at the same time. Otherwise, a synch_error occurs. Similarly, other Allen’s relations can be defined and therefore, any given set of temporally related events can be modified by composing the basic processes corresponding to Allen’s relations.

38 Table 2.2.

Operators in TCSP

Steinmetz [162] proposed a new synchronization mechanism called restricted blocking to show how the traditional synchronization mechanisms of concurrent languages are inadequate for multimedia systems. There are several drawbacks in the traditional synchronization mechanisms. For instance, in the CSP mechanism, the first process that reaches the synchronization point needs to be blocked and to wait for the partner. It is not always suitable for multimedia data. With the restricted blocking mechanism, actions to be taken while waiting for synchroniza-

Semantic Models for Multimedia Information Systems

39

tion can be explicitly specified. This extension supports multimedia process synchronization, including semantics for real-time synchronization of multimedia data. In the restricted blocking mode, an object may be forced to wait for an other object, to perform synchronization if the later does not arrive in time. In this mechanism, there are several time operands used to express the acceptable synchronization delay as shown below. Note that these additional features are particularly useful in a multimedia context where the acceptable degree of synchronization delay varies considerably depending in the media considered. Timemin: it indicates the minimal acceptable delay in the synchronization. The default value is zero. Timemax: it indicates the maximal tolerable delay in the synchronization. Timeave: it indicates the ideal delay in the synchronization. Usually, it assumes a zero value. Another approach proposed in [143] is developed in the context of the CCWS project. In their approach, a parallel language-like notation is used to specify the synchronization requirements. Three operators – independent, sequential, and simultaneous – are defined in the approach. In [61], the author uses the hypermedia/timed-based document structuring language (HyTime) that is proposed in [137] to specify the temporal constraints. While in [175], the LOTOS (language of temporal ordering specification) is used to specify the continuous media synchronization requirements.

2.7

THE AUGMENTED TRANSITION NETWORK (ATN)

The augmented transition network (ATN), developed by Woods [178], has been used in natural language understanding systems and question answering systems for both text and speech. Chen and Kashyap [43, 44] proposed the ATN as a semantic model to model multimedia presentations, multimedia database searching, the temporal, spatial, or spatiotemporal relations of various media streams and semantic objects, and multimedia browsing. A transition network consists of nodes (states) and arcs. Each state has a state name and each arc has an arc label. Each arc label represents the media streams to be displayed in a time duration. Therefore, time intervals can be represented by transition networks. In a transition network, a new state is created whenever there is any change of media

40

Figure 2.6.

Transition Network for Multimedia Presentation.

AM FL Y

streams in the presentation. There are two situations for the change of media streams and they are as follows: 1. Any media stream finishes to display;

2. Any new media stream joins to display.

TE

A multimedia presentation consists of a sequence of media streams displaying together or separately across time. The arcs in an ATN represent the time flow from one state to another. When an ATN is used for language understanding, the input for an ATN is a sentence which consists of a sequence of words with linear order. In a multimedia presentation, where user interactions such as user selections and loops are allowed, then the sentences cannot be used as the inputs for an ATN. Instead, the inputs for ATNs are modeled by multimedia input strings. A multimedia input string consists of several input symbols and each of them represents the media streams to be displayed at a time interval. In a multimedia input string, the symbol “&” between media streams indicates these two media streams are display concurrently. For example, Figure 2.6 is a transition network for Figure 2.1. There are six states and five arcs which represent six time instants and five time durations, respectively. State names are in the circles to indicate presentation status. State name P/ means the beginning of the transition network (presentation) and state name P/X1 denotes the state after X1 has been read. The reason to use Xi is for convenience purposes. In fact X1 can be replaced by V1 and T1. State name P/X5 is the final state of the transition network to indicate the end of the presentation. State P/Xi represents

Semantic Models for Multimedia Information Systems

41

presentation P just finishes to display Xi and the presentation can proceed without knowing the complete history of the past. There are five occurrences of media stream combinations at each time duration and they are: 1. Duration d1: V1 and T1. 2. Duration d2: V1, T1, I1, and A1. 3. Duration d3: T2, I1, and AI. 4. Duration d4: V2, T2, I1, and A1. 5. Duration d5: V2 and A1. Each arc label Xi in Figure 2.6 is created to represent the media stream combination for each duration as above. For example, arc label X1 represents media streams V1 and T1 display together at duration d1. A new arc is created when new media streams I1, and A1 overlap with V1 and T1 to display. However, only media streams are included in the transition networks. In order to model spatio-temporal relations of semantic objects, subnetworks are developed to model media streams such as images, video frames, and keywords in texts to form a recursive transition network (RTN). Subnetworks in ATNs are used to model detailed information of media streams such as video, image, and text. A new state is created in a subnetwork when there is any change in the number of semantic objects or any semantic object spatial location change in the input symbol. For a single image, the subnetwork has only two state nodes and one arc since the number and spatial location of semantic objects will not change. The advantage to having subnetworks is that the coarse-grain media streams and the fine-grained semantic objects can be separated. The transition network which contains media streams can provide high level (coarse-grain) concept to users what kind of media streams are displayed at different durations. The subnetworks can represent low-level (fined-grained) concepts of images, video frames, or texts. If semantic objects are included in the transition network then it will make this transition network difficult to understand since media streams and semantic objects are mixed together. The inputs of the subnetworks are multimedia input strings too. The multimedia input string is used to model the temporal, spatial, or spatio-temporal relations of semantic objects. A semantic object is an object that appears in a video frame or image such as a “car”. Multimedia input strings provide an efficient means for iconic indexing of the temporal/spatial relations of media streams and semantic

42 objects. A multimedia input string consists of one or more input symbols. An input symbol represents the media streams which are displayed at the same time within a given duration. These input symbols are also the arc label in an ATN. The arc label is a substring of a multimedia input string. By using multimedia input strings, one can model user interactions and loops. An ATN and its subnetworks are used to represent the appearing sequence of media streams and semantic objects. In this design, a presentation is driven by a multimedia input string. Each subnetwork has its own multimedia input string. Users can issue queries using a highlevel language such as SQL. This query then translates to a multimedia input string to match the multimedia input strings in the subnetworks. Multimedia database queries related to images, video frames, or text can be answered by analyzing the corresponding subnetworks [44]. The multimedia database searching can become a substring matching between the query and the multimedia input string. In other words, database queries relative to text, image, and video can be answered via substring matching at subnetworks. Moreover, video is very popular in many applications such as education and training, video conferencing, video-on-demand (VOD), news service, and so on. Traditionally, when users want to search a certain content in videos, they need to fast forward or rewind to get a quick overview of interest on the video tape. Multimedia browsing allows users the flexibility to select any part of the presentation they prefer to see. Since ATNs can model user interactions and loops, the designer can design a presentation with selections so that users can utilize the choices to browse or watch the same part of the presentation more than once. Therefore, ATNs provide three major capabilities: multimedia presentations, temporal/spatial multimedia database searching, and multimedia browsing. The ATN and its subnetworks can be included in multimedia database systems which are controlled by a database management system (DBMS). The designer can design a general purpose multimedia presentation using an ATN so that users can take their own needs to watch, browse, and search this presentation.

Chapter 3

MULTIMEDIA DATABASE SEARCHING

1.

INTRODUCTION

The key characteristic of multimedia data that makes it different from the traditional text-based data is its temporal and spatial dimensions. For video data, video events generally have high degree of temporal contents in addition to the spatial semantics. The semantics and events in video data can be expressed in terms of interplay among physical objects in space and time. Important considerations in multimedia database searching are specification of the spatio-temporal semantics and development of indexing mechanisms. For data modeling purposes, spatio-temporal relationships among multimedia objects can be represented in a suitable indexing structure, which can then be used for query processing. Another critical issue is the semantic heterogeneity that may arise due to the differences in the interpretations of information. Semantic heterogeneity has proven to be a difficult problem in conventional databases. In the context of multimedia data, the problem becomes more difficult and intractable. Multimedia database searching requires semantic modeling and knowledge representation of the multimedia data. Two criteria are considered to classify the existing approaches of modeling multimedia data, especially video data. These two criteria are level of abstraction and granularity of data processing. Based on these two criteria, several classes of approaches employed in modeling video data are compared (as shown in Figure 3.1 [7]). The level of abstraction goes from low to high depending on the level of supported semantics. It is considered low-level if the supported semantics are low-level semantics which are more relevant to the machines

44

Figure 3.1.

Comparison of video data semantic models.

than the users. It increases as the degree of information contents and knowledge extracted from video data increases. For example, “scene change” has low-level abstraction and “scoring a field goal” has highlevel abstract ion. The granularity of data processing goes from coarse to fine depending on the method of preprocessing video data which is the basis on which semantics are extracted. It is considered coarse granularity if it involves processing of video frames as a whole, whereas it is fine granularity if the processing involves detection and identification of objects within a video frame. Methods classified on the right-hand side of Figure 3.1 are considered of fine granularity since they focus on processing video data at the object level. On the other hand, the focus of the models on the left-hand side is on processing video data at frame level using features. The functionalities and classification of those identified approaches are discussed in the next few sections.

2.

IMAGE SEGMENTATION

Still pictures in the form of photographs and motion pictures in the form of cinema have grown to become a significant part of human visual experience. They are an outgrowth of the desire to effectively record and reproduce various visual experiences. Consequently, several systems were built to achieve these goals. The earliest systems were optical

Multimedia Database Searching

45

cameras that used film; both for still images as well as moving images. The current state of the art provides images and video as digital data. This data can then be stored, transmitted, manipulated and displayed using digital computers. A digitized image is a rectangular array of numbers. These numbers are obtained by spatial sampling of the visual field and subsequently quantizing the sampled intensity values. Each sample is called a pixel and the number obtained is the pixel value. Since a video signal is a sequence of images, a digital video is a sequence of digitized images and can be represented as a temporal sequence of arrays each containing numbers.

2.1

OVERVIEW

The segmentation of an image is to divide the image into smaller parts or regions. However, these smaller parts may or may not have any meaning. For example, an image array can be divided into small squares, rectangles, hexagons, or triangles with each segment having the same shape and the same number of pixels. This is referred to as tessellation or tiling with a regular shape where all the tiles have equal size. This kind of tessellation is independent of the image since it does not consider the information content in the image. Hence this segmentation is adopted in the standards like JPEG, MPEG1, MPEG2, etc. for compression of images and video. Figure 3.2(a) and Figure 3.2(b) show the square and hexagonal tessellations [157].

Figure 3.2. (a) tessellations with square tiles of equal size, and (b) tessellations with hexagonal tiles of equal size. This kind of tessellation is independent of the image since it does not consider the information content in the image.

46 A second possibility is to segment the image into tiles with regular shape but variable sizes. Here, the size is chosen based on some criterion depending on the pixel values in the image. Popular examples include partitions using quad-trees and pyramids. Figure 3.3(a) shows an example of a quad-tree tessellation [157]. The sizes of tiles scale by a factor of 2 in each dimension. The tile of appropriate size is chosen based on the underlying image data. These tessellations give rise to a very compact description of tiling information. Another way of dividing the image is to choose arbitrary shapes as in the case of a jigsaw puzzle. This is tessellation with tiles of irregular shapes and sizes. Here, the shape and size of the tiles are determined purely from the content of the data. Each arbitrarily shaped tile in this tessellation is referred to as a segment such as a house, a park, a road, etc. Figure 3.3(b) [157] shows an example tessellations of dividing images based on the content. These partitions are determined based purely on the image data. With a suitable criterion, there is a possibility of obtaining the objects as tiles. The shape description of the tiles is necessary for encoding. This is the preferred tiling for object based representations.

Figure 3.3. (a) an example of a quad-tree tessellation, and (b) tiling with arbitrary shapes like the shapes of a jigsaw puzzle.

Most of these various segmentation methods were investigated by the image processing and image compression communities with a view to efficient representations of image data other than the simple pixel characterization. The purpose of segmentation may vary widely depending on the application. For example, in image compression, the goal is to represent the image in as few bits as possible and the content is ig-

Multimedia Database Searching

47

nored; while in computer vision and pattern recognition, the content is important since each segment represents some meaningful entity.

2.2

IMAGE SEGMENTATION TECHNIQUES

In image segmentation, an input image is partitioned into regions with each region satisfying some homogeneity criterion such as the intensity values, texture, etc. The generated regions are referred to as classes and ideally, they should correspond to objects. However, it is often difficult to define mathematically what exactly constitutes an object. Two important properties of the pixels are usually considered in the segmentation methods and they are: 1. Intensity value distribution: used to model the pixel values within the regions. The homogeneity criteria specified for the purpose of segmentation frequently use this property. 2. Spatial adjacency criteria: used to model the shapes of the regions. This property is used to impose spatial connectivity of the various regions obtained using a segmentation method. Several techniques have been proposed for image segmentation such as the histogram-based methods, the split-and-merge methods, the regiongrowing met hods, the clustering-based met hods, the stochastic-modelbased methods, etc. In the histogram-based methods, the histogram of the image intensities is obtained and it is divided into several intervals based on the peaks and valleys present in it. Pixels belonging to each interval are then grouped under the same class. The split-and-merge methods first split an image into an initial segmentation is obtained and refine the results by merging the segments with similar properties that are spatially adjacent or splitting a region when the property is not within the limits of tolerance. The properties chosen are frequently the mean, variance, moments and the density parameters. The split operation adds missing boundaries and the merge operation eliminates false boundaries and spurious regions. In the region-growing methods, some seed points are chosen initially and the pixels around them are progressively included into the regions based on some constraints that can be the mean, variance, color of the pixels, etc. Some methods model the regions using planar or polynomial models and use the model as the homogeneity criterion for region growing. In the clustering-based methods, the properties of individual pixels or groups of pixels are first obtained as features. A feature vector could

48 include features such as the gray level of a pixel, the color, the spatial coordinates, the texture, the coefficients of a polynomial, etc. Clustering is then performed in the feature space using an appropriate distance metric. The distance metric is usually the Euclidean distance between the feature vectors. The weighted lp norms on the feature vectors are also popular. The clustering results are then mapped back into the image space to obtain a segmentation. When the feature is the gray level of a pixel, the simplest one, the corresponding feature space is the histogram of the image. For color images, characterized by the 3 color components, a 3 dimensional histogram can be constructed. Hence the feature space is three dimensional where clusters correspond to pixels with similar color. Clustering in this type of feature space yields a decomposition of the histogram into a few nonoverlapping intervals, and labeling of the clusters results in multi-thresholding of the image. If the spatial coordinates are also included into the feature vectors, then clustering will impose spatial contiguity on the segmented regions. If texture features are considered, the image is initially divided into small blocks and the texture parameters of each block are extracted. The clustering is then performed in the texture parameter space. In [100], a robust clustering technique is proposed. Clusters that constitute a specific fraction of the total data that are enclosed by a minimum volume ellipsoid in the feature space are extracted. Their approach uses the Mahalanobis distance as the distance metric and the cluster quality is compared with a Gaussian cluster. In the stochastic-model-based methods, the image classes are modeled as random fields and the segmentation problem is posed as a statistical optimization problem. Compared to previous techniques, these techniques often provide more precise characterization of the image classes and generally provide better segmentations when the image classes are complex and otherwise difficult to discriminate by simple low-order statistical measures. Under this category, different types of estimates like the maximum likelihood (ML) estimate, the maximum a posteriori probability (MAP) estimate etc. can be used. Fkequently the expectation maximization (EM) algorithm is employed to compute the ML estimates, while the generalized EM (GEM) algorithm is employed to compute the MAP estimates. Other methods like greedy optimization, dynamic programming and simulated annealing are also used. To date there are very few methods of image segmentation that addressed partitioning and obtaining content description of segments simultaneously [29, 112, 155]. In [112], the problem is posed as segmentation of Gibbs random fields and solved using simulated annealing. In [29], the problem is posed as texture segmentation where the textures

Multimedia Database Searching

49

are modeled by Gauss-Markov random fields. In [155], the problem is posed as joint estimation of the partition and class parameter variables, and a Bayesian estimation multi-scale image segmentation method is proposed. The method recognizes the variability of content description depending on the complexity of the image regions and effectively addresses it. The notion of a class is introduced as that which gives rise to different segments with same content description. In particular, the method partitions the data as well as obtains descriptions of classes for a large family of parametric models. These parametric models are used to describe the content of the classes. Central to the method is the formulation of a cost functional defined on the space of image partitions and the class description parameters that can be minimized in a simple manner.

3.

VIDEO PARSING AND SEGMENTATION APPROACHES

Video segmentation is a very important step for efficient manipulation and access of video data. This is due to the fact that a digitized video stream can be several gigabytes in size. A video clip is a temporal sequence of two dimensional samples of the visual field with each sample being an image which is referred to as a frame of the video. The samples are acquired at regular intervals of time which is currently 30 frames per second for most of the standard video capturing devices. This frame rate is sufficient to recreate the perception of continuity of visual experience due to persistence of vision. Video data can be temporally segmented into smaller groups depending on the scene activity where each group contains several frames. As mentioned in [181], a video clip can be divided into scenes. A scene is a common event or locale which contains a sequential collection of shots. A shot is a basic unit of video production which captures between a record and a stop camera operation. In other words, a shot is considered the smallest group of frames that represent a semantically consistent unit. Figure 3.4 shows a hierarchy for a video clip. At the topmost level is the video clip. A clip contains several scenes at the second level and each scene contains several shots. Each shot contains some contiguous frames which are at the lowest level in the video hierarchy. In the video segmentation step, the digitized video is split into several smaller pieces (physical video segments). Video parsing employs image processing techniques to extract important features from individual video frames. Any significant change in the feature vector in a sequence of frames is used to mark a change in the scene. The process allows a

50 CLIP

SCENE

SHOT

FRAME

Figure 3.4.

A hierarchy of video media stream

AM FL Y

high level segmentation of video data into several shots. Several scene change detection methods have been proposed in the literature, for example, pixel-level comparison, likelihood ratio, color histogram, χ2 histogram, compressed-domain discrete cosine transform (DCT), etc. Some of the approaches are discussed briefly in the following and Table 3.1 provides a summary of these approaches in terms of their relative advantages and limitations [7].

TE

The pixel-level comparison approach:

In this approach, gray-scale values of pixels at corresponding locations in two distinct frames, either consecutive or a fixed distance apart, are subtracted and the absolute value is used as a measure of dissimilarity between the pixel values [134, 189]. If this value exceeds a certain threshold, then the pixel grey scale is assumed to have changed. The percentage of pixels that have changed is the measure of dissimilarity between the frames. This approach is sensitive to several factors such as the noise introduced by the digitalization process, the object movement, and the camera effects. One possible way to limit the effect of such problems is to subdivide the frame into regions and select only certain regions for processing. The likelihood ratio comparison approach: In this approach, the frames are divided into blocks. The blocks of two consecutive frames are compared based on some statistical char-

Multimedia Database Searching Table 3.1.

51

Comparison of Several Video Parsing Approaches

acteristics of their intensity values (e.g., the mean value of the intensity) [103, 189]. This approach is more robust than the pixel-level comparison approach in the presence of noise and object movement. The color histogram approach: In this approach, a frame is analyzed by dividing the color space into discrete colors called bins and counting the number of pixels that fall into each bin [134, 189]. A separate histogram is made for R, G, and B components of colors present in a frame. Another variation of the color histogram approach uses a normalization approach [134] that results in large differences being made larger and small differences being made smaller. The compressed-domain discrete cosine transform (DCT) approach: In this approach, it is computationally efficient and is suitable for compressed video [126, 190].

52

4.

MOTION DETECTION AND TRACKING APPROACHES

The typical video processing applications using video data involve video data coding and transmission such as video compression, video transmission over broadcast channels and computer networks, etc. Most of the later research work in the field of digital video coding is directed towards motion estimation and motion compensation so that high interframe compression ratios can be achieved. The main purpose is to be able to predict successfully the pixels in the next frame based on the pixels in the current frame. Other applications such as robotic vision, scene understanding, and automatic processing of surveillance data focus on tracking moving objects and predicting their motion.

4.1

MOTION DETECTION

Capturing motion information about salient objects and persons is a major challenge in video data modeling. Information at the higher levels of abstraction of identified objects from a sequence of frames related to the motion of these objects needs to be extracted. Video segmentation can be done by effectively utilizing the motion information – motion estimation and tracking of objects. Hence, the simplest form of segmentation is to group image regions that have similar motion parameters since they most likely correspond to a single object. Each segment usually carries some semantic meaning which is its content. To date there are two main approaches to the computation of motion from image sequences: optical flow methods and feature matching methods. Optical flow methods are effective when the motion from one image frame to the next is small; while for motion with large changes, the feature matching methods are more appropriate. In the optical flow methods, the optical flow is the distribution of apparent velocities of movement of brightness patterns in an image so that the information about the spatial arrangement of the objects viewed and the rate of change of this arrangement can be obtained. Here, the relative motion of the objects and the viewer is considered. The optical flow quantity has been important in a variety of imaging and vision problems. For example, in computational vision, optical flow can be an input to higher level vision algorithms performing tasks such as segmentation, tracking, object detection, robot guidance and recovery of shape information. In the motion compensated coding schemes, methods for computing optical flow play an essential part of the schemes.

Multimedia Database Searching

53

In addition, since the intensity of the pattern remains the same when it moves, the following equation can be used to compute the optical flow. . . Ixx + Iyy + It = 0 (3.1) where I(x, y, t) denotes the image intensity at spatial location (x, y) at time t, Ix and Iy are the partial derivatives of I with respect to to the spatial coordinates x and y respectively, and It is the partial derivative with respect to time. The derivatives of x and y with respect to time, (x, y)T, constitute the optical flow field. However, the computation of the optical flow is sensitive to noise since derivatives are involved. To mitigate the effects of the noises, the regularization methods are often used. Some of the methods as. . sume that x and y are constant over spatial patches, some of the fast computation methods use multi-scale regularization and the multigrid techniques, some of the methods are non-iterative and independent of the size of the image, and so on. For example, a smoothness constraint where the motion field varies smoothly in most part of the image is proposed in [93]. Moreover, sometimes the problem is posed as an optimization problem where the cost function is optimized using some iterative techniques like the gradient descent. Generally speaking, the optical flow methods compute the apparent velocity vector field corresponding to the observed motion of brightness patterns in successive image frames. The automatic computation of the optical flow gives the motion information without any need for correspondence of regions between successive frames. Frequently, it involves solving an optimization problem that involves spatial and temporal derivatives of the image intensity field. The idea of using feature matching based methods for motion estimation centers around the concept of correspondence. The feature matching methods are based on finding the correspondence of features between regions in successive frames. This type of methods arises from the fact that given an object or an image region, one or more features can be extracted from it. It also requires the establishment of the correspondence between a sparse set of features extracted from frame to frame. Extraction of feature vectors has been well studied in certain specialized settings such as face recognition [41] and character recognition [116]. However, only recently have the researchers started addressing the problem of automatic feature extraction for a multimedia database for a wide variety of objects [147].

54 The features that have been employed in object tracking are color, histograms, lines, edges, corner points, contours, texture, shape, and the like. These features are mapped into a multidimensional feature space which can allow similarity based retrieval of images [7]. Although there exist several methods for extracting and establishing feature correspondence, the task is difficult and only partial solutions suitable for simplistic situations have been developed. In general, the process is complicated by occlusion which may cause features to be hidden, false features to be generated, and hidden features to reappear. Much more work needs to be done in this area before the advent of one or more general techniques that can be reliably applied to real imagery. Sensitivity to noise exists even in feature matching based methods but to a lesser degree than in optical flow methods. Motion information is obtained in the form of transformation parameters from one frame to the other. The simplest example is using pure translations of the features, which is quite popular in video compression techniques. In fact, the block motion vectors that are used in standard video compression techniques like H.261, H.263, MPEG1, MPEG2, etc. are simple translations of blocks of size 16 x 16. More sophisticated methods use transformations like affine, perspective, projective, polynomial etc. but the number of parameters in each of these transformations is different and accordingly, the number of features considered to estimate the parameters differs. It also becomes increasingly difficult to find the correspondence between the features as the object undergoes more complicated motion. In most of the cases when the correspondence is difficult, a human user is needed to provide the required control points that can be used to proceed with the matching and motion information extraction. Several approaches have been proposed in the literature for tracking motion of objects. Two main approaches are elaborated here. The first technique uses a modified version of known compressed algorithms such as MPEG, to identify and track motion of salient objects. In essence, this semantic-based compression approach combines both image processing and image compression techniques. A motion tracking algorithm proposed in [75] uses both forward and backward motion vectors of macroblocks used by MPEG, encoding algorithm to generate trajectories for objects. The second technique uses a directed graph model to capture both spatial and temporal attributes of objects and persons. A video semantic directed graph (VSDG) model is proposed in [58]. This model is used to maintain temporal information of objects once they are iden-

Multimedia Database Searching

55

tified by image processing techniques. This is achieved by specifying the changes in the parameters of a 3D projection associated with the bounding volume of objects in a given sequence of frames.

4.2

OBJECT TRACKING

Unlike the earlier video processing methods that viewed the video data purely as a signal without its semantic meaning, the recent content based video processing approaches focus on segmenting video frames into regions such that each region, or a group of regions, corresponds to an object that is meaningful to human viewers [54, 64]. One of the emerging applications in video processing is its storage and retrieval from multimedia databases and content based indexing. Some example queries of retrieving data from video databases based on content are to retrieve all the video clips where a goal is scored in a soccer game, to retrieve all the video clips where an airplane is flying right to left and executes a roll, and so on. The methods developed for compression, like MPEG1 and MPEG2, are some examples of the earlier video processing methods; while the object based representation of the video data is being incorporated into the standards like MPEG4 and MPEG7. In [156], the authors proposed a formulation for video segmentation and object tracking. In their approach, the formulation does not require the supervision of a human user. Each frame in the video is partitioned into different segments and the segments are combined to form object traces. An algorithm partitions a video frame and obtains the parameters of the underlying classes simultaneously is also given. The problem of partitioning each frame is posed as a joint estimation of the partition and class parameter variables. By incorporating the partition information of the previous frame into the segmentation process of the current frame, the authors claim that their proposed method implicitly uses the temporal information. In addition, The experimental results in their paper demonstrate that their method succeeds in capturing the object classes even when the objects undergo translations and rotations not in the plane of the image.

5.

ICONIC-BASED GROUPING AND BROWSING APPROACHES

Parsed video segments can be grouped together based on some similarity measure of image features possessed by one or more frames representing a shot or a scene, also known as key frames. This can be used to build iconic based browsing environments. In this case, a key frame of each shot or scene is displayed to the user in order to provide

56 the information about the objects and possible events present in that shot or scene. In [182], a directed graph is used to portray an overall visual summary of a video clip consisting of different scenes that may appear more than once in the video sequence. In a directed graph, the nodes represent the key frames and the edges denote the temporal relationships between them, giving an overview of the order in which these events have occurred. Another approach proposed by [46] uses a similarity pyramid to give a hierarchical clustering of all key frames present in the video database. Organization of this similarity pyramid is based on the extracted features and user’s feedback. This scheme can be scaled up to manage large numbers of video sequences and can provide browsing environments for developing digital video libraries.

6.

OBJECT RECOGNITION APPROACHES

The main purpose of object recognition is to identify key objects so that motion analysis which tracks the relative movements of the objects can be conducted. The data processing granularity for this approach is fine-grained since it requires the recognition of objects in individual video frames. Therefore, each video frame is analyzed either manually or by some image processing techniques for automatic recognition of objects. Furthermore, motion information can also be used to identify the key objects in video sequences. For example, the motion and temporal information together with classical image processing techniques to produce robust results allowing detection of objects without requiring any a priori assumption in [5]. In [156], a simultaneous partition and class parameter estimation (SPCPE) algorithm that considers the problem of video frame segmentation as a joint estimation of the partition and class parameter variables has been developed and implemented to identify objects and their corresponding spatial relations. Each frame in the video is partitioned into different segments and the segments are combined to form object traces for motion information. Since their method incorporates the partition information of the previous frame into the segmentation process of the current frame, the temporal information is implicitly used. Their method utilizes the motion and information to identify the key objects without the need for the supervision of a human user.

Multimedia Database Searching

7.

57

KNOWLEDGE-BASED EVENT MODELING APPROACHES

This kind of approach allows the higher level events (i.e., events that are meaningful to the users) to be specified so that the users can construct different views of the video data. For this purpose, knowledgebased techniques are used to model high-level semantics and events in video data. The semantic-based clustering approaches and temporal modeling are extensively used to capture high-level knowledge and semantics.

7.1

SEMANTIC-BASED CLUSTERING

In order to develop high-level semantics based on video parsing and segmentation, scenes are clustered together based on some desired semantics either automatically or manually. Several approaches have been proposed to build such abstractions. For example, clustering can be based on key objects and other features within each scene which are identified using image processing techniques or textual information from video caption [46]. Another clustering approach uses domain specific semantics in form of sketches or reference frames to identify video segments that are closely related to these frames [160]. Alternatively, the scenes of the segmented video can be examined manually in order to append appropriate textual description. Such description can then be used to develop semantic-based clustering by examining events present in different scenes [7]. In [46], a set of features of video frames along with the motion vectors are used, and data is classified into pseudo-semantic classes corresponding to head and shoulders, indoor versus outdoor, man-made versus natural, etc. While in [160], the well-structured domain of news broadcasting is exploited to build an a priori model consisting of reference frames as a knowledge base that assists in semantic-based classification and then clusters of video segments related to news broadcast.

7.2

TEMPORAL INTERVAL-BASED MODELING

Several approaches have been proposed in the literature to apply temporal modeling to capture knowledge and semantics in temporal databases. In these approaches, knowledge-based formalisms for event specification of video data are developed and the semantic operators such as logic, set, and spatio-temporal operators are extensively used. Logic operators: and, or, not, if-then, only-if, equivalent-to, etc.

58 Set operators: union, intersection, difference, etc. Spatio-temporal operators: starts, meets, overlaps, equals, finishes, before, during, eventually, always, etc. In essence, these approaches use different combinations of these semantic operators. For example, in some of the approaches, both the temporal and logical operators are used to develop the spatio-temporal logic for specifying video semantics. Some approaches use the spatio-temporal operators and the set-theoretic operators to specify video events in form of algebraic expressions. The set-theoretic operators are also used for video production environments [75, 140, 173]. These temporal intervalbased approaches can be roughly classified into two categories - the spatio-temporal logic and the algebraic models [7].

7.2.1

SPATIO-TEMPORAL LOGIC

The approaches in this category use symbols to represent the salient objects identified in a scene, and use a sequence of state assertions to represent the scenes. The sequence of state assertions capture the geometric ordering relationships among the projections of the objects in that scene, and specify the dynamic evolution of these projections in time as the objects move from frame to frame. The assertions are inductively combined through the Boolean connectives and temporal operators. The approach presented in [24] uses spatial and temporal operators such as temporal/spatial eventually and temporal/spatial always to model video semantics in an efficient manner. Also, the fuzziness and incomplete specification of spatial relationships can be handled by defining multilevel assertions. Another approach proposed in [58] defines the generalized temporal intervals which are based on the binary temporal relations (mentioned above) to model video data. The basis for the generalization is that two consecutive intervals satisfy the same temporal relation. A generalized relation is an n-ary relation that is a permutation among n intervals, labeled 1 through n. The n-ary relations are used to build the video semantics in the form of a hierarchy. For this purpose, simple temporal events are first constructed from spatial events by using the meets nary temporal operator. The operands of this meets operator are the spatial event that is observed from frame to frame. The observed event is termed as a simple temporal event. Identification of simple temporal events requires evaluation of spatial and motion information of objects, captured in the VSDG model.

Multimedia Database Searching

7.2.2

59

ALGEBRAIC MODELS

The approaches in this category use temporal operators and set operations to build formalisms that allow semantic modeling as well as editing capabilities for video data. In [75], a set of algebraic operators is defined to allow spatio-temporal modeling and video editing capabilities. In their approach, temporal modeling is carried out by the spatio-temporal operators used in the spatio-temporal logic formalisms. These operators are implemented through functions that map objects and their trajectories into temporal events. In addition, several functions are used to perform various video editing operations such as inserting video clips and extracting video clips and images from other video clips. Another approach proposed in [173] use hierarchical abstraction of video expressions to represent scenes and events for indexing and contentbased retrieval. A video expression, in its simple form, consists of a sequence of frames representing a meaningful scene. Compound video expressions are constructed from simpler ones through algebraic operations. The algebraic operators are creation, composition, description, etc. The composition operators include several temporal and set operations. The set operators are used to generate complex video expressions (e.g., video segments) according to some desired semantics and description. Content-based retrieval is managed by annotating each video expression with field name and value pairs, defined by the users. Algebraic modeling approach has been extended to develop oriented abstraction of video data as presented in [140]. In their approach, a video object is identical to a video expression in [173] and corresponds to semantically meaningful scenes and events. The IS-A generations are used to build an object hierarchy that is defined on instances of objects rather than classes of objects, and allow the semantically identical video segments to be grouped together. Inheritance in such object hierarchy is based on interval inclusion, where some attribute/value pairs of a video object A are inherited by another video object B provided that the raw video data of B is contained in that of A. The composition operations such as interval projection, merge, and overlap constructs are supported by the set operators that are also used to edit video data and define new instances of video objects.

8.

CHARACTERISTICS OF VIDEO DATA MODELING

Table 3.2 lists some important characteristics of the selected video data modeling approaches [7]. From the table, one observation is that most of the approaches that use an automatic mode of capturing video

60 Characteristics of Several Selected Video Data Modeling Approaches

TE

AM FL Y

Table 3.2.

semantics cannot support high-level abstractions. This is due to the difficulty in capturing the concepts that are difficult to map into a set of images and/or the spatio-temporal features that can be automatically extracted from video data without human intervention. In [160], the authors use domain knowledge to incorporate high-level semantics into techniques that capture the semantics through automatic parsing.

Multimedia Database Searching

61

Another observation is that the use of visual querying and browsing of video data is an important feature that must be provided as a part of the database. In particular, algebraic and logical expressions describing spatio-temporal semantics can pose difficulty in understanding and formulating queries. Also, in spatio-temporal video data modeling, some degree of imprecision is intrinsic. The approach proposed in [24] uses a visual query facility as an intuitive interface and employs different levels of precision in specifying spatial relationships among objects as a means to manage the imprecision problem. Whereas in [59], only the most precise and detailed representations are supported. The algebraic model presented in [75] has a limitation in the sense that it puts the burden on the users to define the semantic functions related to video objects. Furthermore, these functions must be defined in terms of object trajectories. On the other hand, the algebraic approaches proposed in [140, 173] provide flexibility in identifying the desired semantics since they allow the interactive formulation of video semantics by the users. However, their approaches suffer from some shortcomings. First, it will become unmanageable for the naive users, Second, it will becomes impractical for large video databases because of the high cost of human interaction.

9.

CONTENT-BASED RETRIEVAL

Traditionally, segmentation is relegated to image processing and is considered a low level operation. For image and video data, the majority of the representation methods, especially the compression methods, treat them as just numbers with some functional or statistical relationship between them. There was no attempt to incorporate the content based description into the compression schemes. The job of image processing is to find the segments or regions that are homogeneous according to some mathematical or statistical criteria. On the other hand, the so called computer vision or artificial intelligence (AI) methods are concerned with high level operations dealing with the content of the image, i.e., understanding the meaning behind the segments or knowledge representation. However, recently a combined approach based on content based coding of images and video is emerging which tries to represent image or video data using the segments within them, where the content of each segment is characterized mathematically. With the recent trend in processing images and video based on their content, there is growing interest in content based segmentation methods. By the content based segmentation methods, users can access images and videos from their respective databases through specifying their content, for example, accessing images depicting a sunset. The desire

62 for content based retrieval is very natural for humans. Due to the enormous amounts of data available, browsing an image or a video database viewing each member for a desired image or video becomes an extremely tedious and time consuming task. To facilitate content based representation, the initial raw data, both images and video, has to be partitioned such that the description of each segment is related to its content. Each object or a group of objects with similar properties should then be described in such a way that it is possible to handle the content based description of the data automatically by a computer. So the problem of content based or object based segmentation for image and video data has gained significant importance. The increasingly accessible digital image and digital video data have given rise to enormous amounts of data. Moreover, the recent progress of hardware makes the management of such large amount of image, video, audio data, or a combination of them, become ordinary. Because of these facts, the retrieval based on content becomes even more relevant and becomes a requirement for database systems. As indicated in [183], the implementation of the content-based retrieval facility cannot be based on one single fundamental and requires to take into account several important aspects such as the underlying data model, the spatial and temporal relations, a priori knowledge of the area of interest, and the scheme for representing queries.

9.1

DATA MODEL ASPECT

Most of the conventional database systems are based on the relational data models and are designed for storing and managing textual and numerical data. Information retrieval is often based on simple comparisons of textural or numerical values, which is not adequate for multimedia data because information in multimedia systems is highly volatile and semantically rich. Another data model that provides better facilities to handle multimedia data is the object-oriented data model [16, 169]. However, the spatio-temporal relationships of multimedia data are still not well-managed in the object-oriented data models. That is, the conventional database systems do not provide enough facilities for storing, managing, and retrieving the contents of multimedia data. Therefore, multimedia extension in the facilities to handle multimedia data is necessary in the underlying data model. Data access and manipulation for multimedia database systems are more complicated than those of the conventional database systems since it is necessary to incorporate diverse media with diverse characteristics. For audio and video data, they essentially imply a temporal aspect such that the synchronization between pieces of data can be an element that

Multimedia Database Searching

63

needs to be managed by a database system. When text data is superimposed onto video data and is stored separately from the video data, both spatial and temporal relationships need to be managed to define the relation between them. Therefore, the ability to manage the spatiotemporal relationships is one of the important features for multimedia database systems. In addition, the representation of multimedia data is one thing while the contents perceived is another. Hence, the recognition and interpretation of contents become important steps in retrieval. In response to this demand, the database management system requires knowledge to interpret raw data into the contents implied. Knowledge is obtained by evaluating the contents that are associated with the semantics of the data being retrieved.

9.2

SPATIAL AND TEMPORAL RELATION ASPECT

Content-based retrieval can be performed based on the spatial, temporal, or spatio-temporal relationships of the objects, the features (shape, color, texture, etc.) or the motion of the objects, or the combination of the above. Retrieving objects based on their spatial, temporal, or spatio-temporal relationships is very typical for many multimedia applications. For example, the most common applications employing spatial semantics and content-based retrieval are geographical information system (GIS), and map databases. This type of systems is extensively used in urban planning and resource management scenarios. In GIS applications, the representation and indexing of the abstract spatial relationships are very important. Research in [60, 81, 129] discuss the retrieval of geographical objects such as buildings, rivers, and so on from the geographical databases. Another example is the clinical radiology applications which the relative sizes and positions of objects are critical for medical diagnosis and treatment. Objects can be retrieved by evaluating their shapes, colors, or texture. There are two possible forms of specification for shapes. The first form is to give a photo or graphic of the object in the database that contains the shape to be retrieved. The other form is to draw the shape. For example, the retrieval of a company trademark is performed by giving a hand-written graphics or by specifying a registered trademark in [23], and the object of an arbitrary shape is retrieved by giving a hand-drawn shape or by cut-and-paste a shape from a painting in [104]. Colors and the spatial distribution of the

64 objects are often applied for the retrieval of paintings [52, 76, 180]. For example, Gong et al. [76] retrieve objects by specifying an image (e.g. a digitized photo) as the example of colors with spatial distribution. Each representative plane is divided into nine subplanes in their approach. Texture specification is applied to retrieve objects that have a certain texture on their surfaces. The purpose is to retrieve specific patterns appearing in an image. The QBIC system [65] is a system that allows the retrieval of objects by the shapes, colors, spatial relationships, and texture. One approach to retrieve objects from video data is by the intrinsic features of video data, i.e., the motion of objects appearing in video. In [36], the authors developed an automated content based video search system called VideoQ. This system allows a user to specify the trajectory, duration, scaling, color, shape, and texture for retrieving objects. Another system proposed by [185] retrieves video data by specifying the motion of an object observed in video data by making a mouse move or by changing the size of an object. When using an example motion, the trajectory and velocity are sampled in accordance with the mouse movement. When changing the size of an object, the new size is specified by drawing rectangles along with the t imeline. Retrieving objects by trajectory, velocity, and size of an object is another approach for the retrieval of spatio-temporal contents [24] since an example motion is regarded as a sequence of spatial relationships. The positions of objects are specified on a screen from which the system extracts spatial relationships at a certain point of time. Two or more sets of spatial relationships are defined sequentially in accordance with the time, representing spatio-temporal relationships of the objects. Under this approach, the motion of an object is specified on discrete time. Therefore, the main focus of their approach is aimed at spatio-temporal correlation of multiple objects rather than the retrieval of the detailed motion of an object.

9.3

SEMANTIC KNOWLEDGE ASPECT

Content-based retrieval can be performed based ons the descriptive knowledge or the derivation knowledge of the objects. In some cases, a database is queried by specifying semantic contents and hence knowledge is required to capture the semantic contents of multimedia data as well as to interpret the query. Knowledge-assisted retrieval plays an important role in multimedia databases because even a single media data has many faces of meaning and contents.

Multimedia Database Searching

65

The semantic contents of multimedia data can be managed by two approaches. The first one is to annotate an image, a video, or an audio data with text. The other approach is to provide the system with a rule base or a knowledge base where knowledge or rules are used to extract features from the raw data, to match content, to analyze queries, and so forth. Generally speaking, the second approach is more practical than the first one even in large multimedia databases which will be discussed later. 1. There are content-based retrieval examples based on text annotation for images [79, 85, 108] and video data [121, 140, 174]. In these studies, the semantic contents of an image or a video data are represented by human-annotated textual descriptions. The textual descriptions provide information which cannot be extracted from an image or a video data through image processing techniques, e.g., the name of a river. Therefore, content-based retrieval for images or video data is internally replaced by a keyword retrieval for annotations. This approach is usually adopted when it is very hard to extract/recognize the target contents from image, video, or audio data. There are two advantages in this approach. First, it can be easily implemented. Secondly, the misevaluation of contents will rarely occur. However, several issues need to be addressed. Since text annotations are used for retrieval, the preparation of annotations in association with image, video, and/or audio data is required and is often created by humans. When the annotations are created by humans, this approach becomes impractical especially in large multimedia databases. It is impractical even in a relatively small multimedia database since the raw data accompanied by annotations is assumed to be updated frequently. Another problem will arise when the degree of an attribute in the target data is used to represent the annotations since it is very difficult to keep the consistency of the annotations. Humans have different perspectives of specifying annotations for the same object. In addition, the consistency for judgment criteria needs to be well managed throughout database evolution [183]. 2. The rule-based or knowledge-based approach has been applied to text databases in the literature [35, 115, 179]. In these studies, the main interest is in generating cooperative answers [179] or in retrieving semantic contents implied in the text [35, 115]. On the other hand, lots of studies have used rule-based or knowledge-based approach in the area of content-based retrieval for multimedia databases either directly [17, 94, 139, 186] or indirectly [184]. The semantic contents are represented by knowledge directly in the sense that the funda-

66 mental property of the contents is the knowledge associated with the feature values and/or the spatial relationships; while the semantic contents are represented by knowledge indirectly since the knowledge is defined for the subject of interest indirectly. In [17], it is assumed that the objects appearing in the image are known and the semantic contents relating to the attribute value of the objects can also be retrieved. Knowledge-based content-based retrieval applied to medical images is studied in [94]. The authors develop a hierarchy called Type Abstraction Hierarchy (TAH) to define a general level of concepts to detailed level with sets of attribute values. In their approach, knowledge is referred to the evaluation of shapes and spatial relationships of the objects (e.g., a tumor) and Knowledge for interpreting contents on spatial relationships or semantic contents is constructed by TAH. In other words, TAHs conceptualize the objects and their semantics and incorporate the domain expert knowledge in order to improve search efficiency in radiological databases. Content-based retrieval of images using knowledge that interprets semantic contents in terms of image representations is discussed in [139, 186]. These two approaches use the content based on image features specific to the content for retrieving images. In [139], the content to be retrieved concerns the meaning of an image. The meaning of the image is represented by a keyword that is defined with a description of image features such as regions of colors and their locations. The semantic contents are represented by the spatial composition of the color regions. The hierarchical relation between the primitive color regions and the semantic contents is captured in a state transition model. While in [186], the authors define a pseudo attribute which associates a query condition with a domain knowledge describing the contents to be retrieved. The method for extracting semantic features from multimedia data, the rules to transform query conditions into an internal representation whose type is the same as the extracted semantic features, or the rules to transform a certain operator into content-dependent calculus is called a domain knowledge. The so called “film grammar” which is a camera framing and/or editing technique commonly used by film directors or editors is used to define knowledge indirectly in [184]. The semantic expression of the contents of a scene, such as the scene of conversation or the scene of tension, is associated with the film grammar. Since the film grammar does not directly represent semantic contents (e.g.,

Multimedia Database Searching

67

conversation) nor tries to extract objects (e.g., the faces of people for evaluating a conversation scene) from the databases, the knowledge is defined for the subject of interest indirectly. In other words, the semantic contents are not extracted from video data, but the editorial technique is extracted to evaluate the semantic contents. Though using knowledge or rules is more practical than using text annotations, some issues arise and need to be discussed. For example, how to interpret a query condition into semantically equivalent expressions and evaluate the expressions with features extracted from the raw data is an important issue in knowledge-based content-based retrieval. Next, how to keep the knowledge base semantically consistent with the database schema also needs to be addressed. A possible solution for this problem is to make the knowledge base and the database schema semantically dependent on each other by integrating them together with rules that prescribe semantic association of one with the other [183].

9.4

QUERY REPRESENTATION SCHEME ASPECT

In traditional text-based database systems, a query is often represented in the form of text or a numerical value and data retrieval is basically by relational algebra or descriptions of simple comparison of attribute values. It is also proper to query the multimedia database systems using this approach. For example, a user may issue a query such as: “Find a picture with a red car in front of a building.” However, using the keywords to represent the graphical features or image attributes such as color does not always work well for multimedia data since the types of contents derivable from multimedia data are diverse. A promising approach of query specification, “Query-By-Example” (QBE), is proposed to allow a user to specify a query condition by giving examples [191]. In QBE, the form of representation is closer to that of data to be retrieved so that it provides a better solution in content-based retrieval. A query for nontextual data is represented, for example, in the form of a rough sketch, a rough painting with colors, or a motion example of trajectory and/or velocity. There are several advantages of using QBE in comparison with using keywords in query representations [183]. 1. QBE provides an intuitive way of representing user constraints since the query representation corresponds to the features of the data.

68 2. QBE provides a better way of representing the queries for nontextual data since the form of representation is the same or at least close enough to the features of data to be retrieved. However, a database may be queried by specifying its semantic contents which requires the analysis and processing of data semantics by the database system during the process of query evaluation. In QBE, a query is represented by examples that a user with to retrieve and therefore it does not analyze the data semantics. Also, the QBE approach is not adequate when two or more heterogeneous types of data form the content. For this purpose, another way of representing queries is called “Query-By-Subject/Object” , which allows the subjective descriptions of a query to be specified since a keyword can well represent the semantic content. To implement this type of query representation, the approaches discussed in the knowledge-assisted retrieval, i.e., the text annotation and knowledge/rule-base methods, are applied to extract and manage contents.

Chapter 4

MULTIMEDIA BROWSING

1.

INTRODUCTION

With advances in computing power and high speed networks, digital video is becoming increasing popular. Large collections of multimedia documents can be found in many application domains such as education and training, broadcast industry, medical imaging, video conferencing, video-on-demand (VOD) , and geographic information systems. However, the idea of what a document is has undergone major changes. Only a few years ago, a document had to be on paper and contained mostly text. On the other hand, a document today is a multimedia entity which may contain significantly more visual than textual information. These changes are not only in documents but in many other areas of visual information management, browsing, storage, and retrieval. Given its role and applications, video will be an integral component of emerging multimedia information systems, and consequently, video collections, in the form of both videotape repositories and digital video archives; they will be increasingly valuable assets that need to be managed in a new generation of multimedia database systems [181]. To support the use of video materials, retrieval and browsing techniques have to be developed. By capitalizing on the processing of visual information, a multimedia database system can offer advanced content management functions for more effective querying and faster retrieval and delivery of video. For the digital library applications, since they are based on huge amounts of digital video data, efficient browsing and searching mechanisms to extract relevant information are required. Extracting information from images/videos is time consuming. In order to provide fast

70

AM FL Y

response for real time applications, information or knowledge needs to be extracted from images/videos item by item in advance and stored for later retrieval. For example, to do spatial reasoning, numerous spatial relations among objects need to be stored [39]. Traditionally, when users want to search a certain content in videos, they need to fast forward or rewind to get a quick overview of interest on the video tape. This is a sequential process and users do not have the choice to choose or jump to specific topic directly. Although disk storage and computer network technologies progress very quickly, the disk storage, network speed, and network bandwidth still cannot meet the requirements of distributed multimedia information applications. As digital video libraries become pervasive, finding the right video content is a challenge. The problem with video is that it tends to exist in a very linear space, one frame after the other. Hence, how to organize video data and provide the visual content in compact form becomes important in multimedia applications [181]. The combination of possibly thousands of video streams in a multimedia database system and the complexity of video data can easily overwhelm a user who is unfamiliar with the content. Therefore, the capabilities to provide effective ways for users to browse among and within video data are necessary for the multimedia database systems. As indicated in [99, 193], playing video data can be thought of as the result of the following process:

TE

Usually, video browsing starts with a query. This query specifies the subjects the user is interested in and makes the browsing more focused and efficient. Usually, a query ends up with multiple video streams since the user often have vague information needs which can only be described in conceptual terms in the query. Some of the video streams might not be what the user wants. Hence, the user needs to browse through the query results. Browsing is an efficient way to exclude the unwanted results and to examine the content of the possible candidates before a request to play the video is issued. Next, the details of some certain specific logical video streams that include the set of logical video streams and the set of video clip posters will be presented to the user. Finally, a certain logical video segment or the whole logical video stream can be played. During any stage of the video browsing, the previous video query can be refined or a new query can be submitted.

Multimedia Browsing

71

Cataloging and indexing of video is a critical step to enable intelligent navigation, search, browsing, and viewing of digital video [36, 127, 161, 171]. In the context of digital video, the search and browse patterns of users have been classified into two broad categories – subject navigation and interactive browsing. In subject navigation, an initial search generates a set of results which must be browsed. Whereas in interactive browsing, a collection of videos is browsed without any prior search [161]. It is obvious that browsing plays a significant role in both categories. While the importance of a seamless integration of querying, searching, browsing, and exploration of data in a digital library collection is recognized, this chapter focuses on the challenges associated with video browsing.

2.

VIDEO BROWSING

An increasing number of digital library systems allow users to access not only textual or pictorial documents, but also video data. Digital library applications based on huge amount of digital video data must be able to satisfy complex semantic information needs and require efficient browsing and searching mechanisms to extract relevant information [89]. In most cases, users have to browse through parts of the video collection to get the information they want, which address the contents and the meaning of the video documents. Hence, a browsing system has to provide the support for this kind of information-intensive work. Browsing applications not only have an impact on the required data access functionality, but also the presentation and interaction capabilities. The main goal of video browsing is to find relevant information quickly for the users of video browsing applications since users may want to view only the relevant shots of videos. There are two possible goals that a user may pursue in a browsing session depending on the applications [55, 89]: 1. Explorative browsing: Users are assumed to have the specific subjects in mind about which they want to collect information. 2. Serendipity browsing: Users are scanning through data not necessarily having concrete goals in mind, and suddenly find interesting but unexpected items. In addition, browsing video surrogates saves user time and storage capacity, and avoids unnecessary downloading of large files. Currently, downloading a video file is very slow. Also, for most of the users, they only inspect shots of their interests. If the preview capabilities for video content can be provided, users can select only the relevant video materials to avoid downloading of the whole large video file. Previews are useful

72 for browsing a collection and browsing the results of a search. Therefore, a video browsing system must support interactive browsing by providing a continuous playback of video scenes, full VCR-functionality, and low startup latency between the playout of shots from several videos within a browsing session [86, 149]. Efforts to support video browsing date back to the early 1990’s. Most of the systems extracted key frames at evenly-spaced time intervals and displayed them in chronological order [130]. Those systems do not use semantic information in the video. Then, the content-based solution started appearing in 1993 [189]. The basic idea is to segment the video using shot-boundary detection algorithms and selected one or more frames from each shot. In [13, 182], efforts to analyze video as a rich media and to leverage from advances in areas like text retrieval, computer vision, natural language processing, and speech recognition can be found. The proliferation of retrieval systems indicates that this is a very active area of research. Proposals on system architectures for browsing applications can be found in [122, 89, 90]. The browsing system in [89, 90] is based on a client/server architecture and consists of an user interface, a client buffer, a retrieval engine, an admission control, a video server, and a multimedia database management system. Their system supports continuous presentation of time-dependent media and reduces startup latency. Thus, semantic video browsing capability can be provided. In [36], a VideoQ system is developed. The VideoQ system can be accessed by drawing sketches, typing in query terms or simply scan the videos. Visual features and the spatio-temporal relationships are used for retrieval. The video shots are cataloged into a subject taxonomy to allow users to navigate, and each video shot is manually annotated to allow users to issue simple keyword searches. A browsing system, Cue Video, that combines manual input, and video and speech technology for automatic content characterization. The CueVideo system integrates voice and manual annotation, attachment of related data, visual content search technologies (QBICtm), and novel multi-view storyboard generation. This system allows the user to incorporate the type of semantic information that automatic techniques would fail to obtain [144, 161]. The Informedia project [48, 171] combines speech recognition, image processing, and natural language understanding techniques for processing video automatically in a digital library system. Another browsing system proposed by [66] enables the calculation of so-called confidence scores by means of audio and video analysis. These confidence scores are used to represent the degree of interest of media data and to assist in browsing activities in terms of

Multimedia Browsing

73

the visualization of the scores, in specifying rate control corresponding to the scores, and in supporting the indexing points for random access. Another kind of approaches is to build iconic-based browsing environments. Iconic-based browsing environments can be built by means of key frames. Many video browsing models are proposed to allow users to visualize video content based on user interactions [13, 63, 65, 130, 140, 160, 181]. These models choose representative images using regular time intervals, one image in each shot, all frames with focus key frame at specific place, and so on. The key frames extracted from the videos are one of the methods to provide visual surrogates of video data. A good key frame selection approach is very important to provide the users such visual cues in browsing. Some key frame selection approaches will be discussed in the next section. Moreover, different levels of representations are needed in a browsing system to provide the previews of videos. A browsing system should have the ability to model visual contents at different granularities so that users can fast browse large video collections. That is, users can get the necessary information quicker and the amount of data transmission can be reduced. Also, the browsing system should allow the users to browse a video sequence directly based on their interests and to retrieve video materials using database queries. Since video data contains rich semantic information, database queries should allow users to get both the high level content such as scenes or shots and the low level content such as the temporal and spatial relations of semantic objects. Here, a semantic object is defined as an object appearing in a video frame such as a “car.” In [45], a video browsing model based on the augmented transition networks (ATNs) is proposed. An ATN and its subnetworks can model video data based on different granularities such as scenes, shots, and key frames. Moreover, user interaction capability and key frame selection mechanism are incorporated into the proposed ATN model for video browsing. In the next subsections, some of the browsing projects proposed in the literature are discussed.

2.1

INFORMEDIA SYSTEM

The Informedia system constitutes a pioneer and successful system for library creation and exploration that integrates different sources of information (such as video, audio, and closed-caption text) present in video data [48, 171]. The Informedia project is conducted at Carnegie Mellon University to improve the search and discovery in the video medium. The unit of information retrieval in the Informedia library is the video segment which may contain a single story. Currently, the Informedia

74 library contains around 40,000 segments [48]. An information visualization interface is developed. This interface allows the user to browse the whole result space without the need for the time-consuming and frustrating traversal of a list of results [47]. In their interface design, alternative browsing options in response to the query such as headlines, thumbnails, filmstrips, and skims are provided. The headlines, thumbnails, and filmstrips are viewed statically. The skim option allows to play back to communicate with the content of the video. In this system, speech recognition, image processing, and natural language understanding techniques are used for video data processing. Speech recognition is used to create time-aligned transcript of the spoken words and to segment the video into paragraphs by using the Carnegie Mellon University Sphinx speech recognition engine. This engine transcribes the content of the video material, with word error rate proportional to the amount of processing time devoted to the task. Significant images and words from a paragraph are extracted to produce a short summary of the video. This video summary is called a video skim and it allows effective searching and browsing. In addition, the filmtrips view provides a storyboard for quick reviewing to reduce the need for viewing the entire video paragraph. If closed-caption text exists for a video, it is integrated with the output of the recognizer. The final text transcript is synchronized at a word level to the video through Sphinx processing. Natural language processing and hidden Markov models (HMMs) can be used to perform contextual analysis on the source metadata. HMMs have proven effective for automatically tagging entities in text output from speech recognizers, where such text lack punctuation cues [110]. Hence, they can be applied to extract information from text metadata, expand aliases, resolve ambiguity, and expand the set of names that can be matched more accurately. Other descriptors for the Informedia library contents like metadata include production notes, automatic topic identification, location information, and user annotations. The user annotations are the comments that the user can type or speak pertaining to a specified portion of video. This system allows the user to browse and retrieve video from the Informedia library based on data i.e., “when”), word occurrence (i.e., “what”), and location (i.e., “where”). These information dimensions can be used in presenting overviews of the video content, summarizing multiple video segments, and as a query mechanism to find segments dealing with a particular topic of interest. Therefore, the Informedia digital video digital library interface supports word query, image query, and spatial query.

Multimedia Browsing

Off-line Processing

Figure 4.1.

2.2

75

On-line Processing

The Cataloging Phase Architecture [144]

CUEVIDEO BROWSING SYSTEM

A Cue Video browsing system is proposed in [144, 161] where the domain of this system consists primarily of videos of technical talks and presentations of general interest to IBM’s Almaden Research community. The CueVideo system combines computer vision, information visualization, and speech recognition technologies together with a user interface designed for rapid filtering and comprehension of video content. Computer vision is used for automated video analysis. Information visualization is used for data visualization. Speech recognition is used for annotation, making related media available through attachments, and supporting user-defined grouping of shots and/or units of higher granularity. The designed speech interface makes it easy for untrained personnel to create the metadata associated with a video, the constrained vocabulary interface captures the structured component of the metadata, and the dictation mode interface captures the content words. The architecture of the CueVideo system consists of the cataloging and retrieval phases. Figure 4.1 shows the architecture of the cataloging phase. The cataloging phase involves two types of processing – on-line and off-line. As can be seen from Figure 4.1, the command mode interface and the dictation mode interface of the speech recognition technology are both used to provide the speech annotation in the on-line processing. The raw data is stored on a network file system and a relational database is used to store the metadata associated to the

76 digitized video. In the off-line processing, the MPEG1 video is taken as the input to a shot-boundary detection module and the outputs are the video storyboard, JPEG images, and shot statistics. The scene cut detection algorithm contains multi-comparison and dynamic thresholding steps. The multi-comparison step compare not only the consecutive frames but also those that are within a fixed delay. In order to reach a reliable boundary detection, the system considers all the pairs simultaneously. In addition, this system uses an adaptive threshold that increases in an action shot and decreases as the shot relaxes. The retrieval phase plays as a web browser based interface. Users can search, browse and view the content of a digital library via this interface. Several options are provided to the users such as to display the storyboard and to play the selected segments of the video. The CueVideo system combines the automated video content characterization techniques empowered by human knowledge. Therefore, domain knowledge is required in the cataloging phase and increases the cost of cataloging. However, it also enriches the retrieval and browsing experience enabling adaptable/tailored views of the content, facilitating repurposing of the data, and achieving higher performance of the overall system.

2.3

AUGMENTED TRANSITION NETWORK (ATN) MODEL FOR VIDEO BROWSING

In [45], an abstract semantic model called the augmented transition network (ATN), which can model video data and user interactions, is proposed. In their approach, a video hierarchy starts with a video clip. A video clip contains several scenes, each scene contains several shots, and each shot contains some contiguous frames. They use the following three properties to define a video hierarchy in their model: 1. V = {S 1, S 2, . . ., SN}, Si denotes the ith scene and N is the number of scenes in this video clip. Let B(S1) and E(S1) be the starting and ending times of scene S1, respectively. The temporal relation B(S1) < E(S1) < B(S2) < E(S2) < . . . is preserved. , ..., 2. Si = , is the jth shot in scene Si and ni is the number of shots in Si. Let B and E be the starting and ending times of shot where B < E < ... < B < E . 3.

= , ..., , and are the starting and ending frames in shot and lj is the number of frames for shot , In property 1, V represents a video clip and contains one or more scenes denoted by S1, S2, and so on. Scenes follow a temporal order. For

Multimedia Browsing

77

Figure 4.2. Augmented Transition Network for video browsing: (a) is the ATN network for a video clip which starts at the state V/. (b)-(d) are part of the subnetworks of (a). (b) is to model scenes in video clip V1. (c) is to model shots in scene S1. Key frames for shot T1 is in (d).

example, the ending time of S1 is earlier than the starting time of S2. As shown in property 2, each scene contains some shots such as to . Shots also follow a temporal order and there is no time overlap among shots so < < ... < < A shot contains some key frames to represent the visual contents and changes in each shot. In property 3, represents key frame k for shot . An ATN can build up the hierarchy property by using its subnetworks. Figure 4.2 is an example of how to use an ATN and its subnetworks to represent a video hierarchy. An ATN and its subnetwork are capable of segmenting a video clip into different granularities and still preserve the temporal relations of different units.

78 In Figure 4.2(a), the arc label V1 is the starting state name of its subnetwork in Figure 4.2(b). When the input symbol V1 is read, the name of the state at the head of the arc (V/V1) is pushed into the top of a push-down store. The control is then passed to the state named on the arc which is the subnetwork in Figure 4.2(b). In Figure 4.2(b), when the input symbol X1 (S1&S2) is read, two frames which represent two video scenes S1 and S2 are both displayed for the selections. In the original video sequence, S1 appears earlier than S2 since it has a smaller number. The “&” symbol in multimedia input strings is used to denote the concurrent display of S1 and S2. ATNs are capable of modeling user interactions where different selections will go to different states so that users have the opportunity to directly jump to the specific video unit that they want to see. The vertical bars “|” in multimedia input strings and more than one outgoing arc in each state at ATNs are used to model the “or” condition so that user interactions are allowed. Assume S1 is selected, the input symbol S1 is read. Control is passed to the subnetwork in Figure 4.2(c) with starting state name S1/. The “*” symbol indicates the selection is optional for the users since it may not be activated if users want to stop the browsing. In Figure 4.2(c), when the input symbol T1&T2&T3 is read, three frames T1, T2, and T3 which represent three shots of scene S1 are displayed for the selection. If the shot T1 is selected, the control will be passed to the subnetwork in Figure 4.2(d) based on the arc symbol T1/. The same as in Figure 4.2(b), temporal flow is maintained. For the details of the ATN model, please see Chapter 5.

3.

KEY FRAME SELECTIONS

The retrieval of video data is realized through both visual and textual attributes so that users can browse the retrieved data to select relevant video content that they want to review and use. To make relevance judgments about the video content retrieved, both the textual attributes (title, keywords, subject, etc.) and the visual surrogates of the content are necessary. User queries for both images and videos are formulated in many different ways based on the user’s background, training, visual perceptions, and needs. In addition, visual information can be interpreted in various ways that make retrieval and browsing by textual descriptions more difficult. For this reason, it is very important that both retrieval and browsing by visual information are provided. However, how to find the right methods and types of visual representations for the retrieved video content and to provide control mechanisms that allow users to view and manipulate visual information become two major challenges [109].

Multimedia Browsing

79

Videos include verbal and visual information that is spatially, graphically, and temporally spread out. This makes indexing video data more complex than textual data. Typically, indexing covers only the topical or content dependent characteristics. The extra-topical or content independent characteristics of visual information are not indexed. These characteristics include color, texture, or objects represented in a picture that topical indexing would not include, but users may rely on when making relevance judgments [109]. Hence, it is very important to provide the users such visual cues in browsing. For this purpose, key frames extracted from the videos are one of the methods to provide visual surrogates of video data. Video documents are segmented into stories, scenes, shots, and frames. One way to create visual representations is to use automatic key frame extraction. One approach is to display a key frame of each shot or scene to the user in order to provide the information about the objects and possible events present in that shot or scene [182]. Another approach uses a similarity pyramid to give a hierarchical clustering of all key frames present in the video database [46]. In [182], a directed graph is used to portray an overall visual summary of a video clip consisting of different scenes that may appear more than once in the video sequence. In a directed graph, the nodes represent the key frames and the edges denote the temporal relationships between them, giving an overview of the order in which these events have occurred. In [46], a similarity pyramid is organized based on the extracted features and user’s feedback. This scheme can be scaled up to manage large numbers of video sequences and can provide browsing environments for developing digital video libraries. Video browsing models proposed in [13, 63, 65, 130, 140, 160, 181] choose representative images using regular time intervals, one image in each shot, all frames with focus key frame at specific place, and so on. Choosing key frames based on regular time intervals may miss some important segments and segments may have multiple key frames with similar contents. One image in each shot also may not capture the temporal and spatial relations of semantic objects. Showing all key frames may let users be confused when too many key frames are displayed at the same time. Actually, the easiest way of key frame selection is to choose the first frame of the shot. However, this method may miss some important temporal and spatial changes in each shot. The second way is to include all video frames as key frames and this may have computational and storage problems, and may increase users’ perception burdens. The third way is to choose key frames based on fixed durations. This method is

80

TE

AM FL Y

still not a good mechanism since it may give us many key frames with similar contents. To achieve a balance, Chen et al. [45] proposed a key frame selection mechanism based on the number, temporal, and spatial changes of the semantic objects in the video frames. In their approach, two conditions are checked. The first condition is to check whether the number of semantic object in two contiguous video frames at the same shot is changes. The second condition checks whether the temporal and spatial relationships of the semantic objects in two contiguous frames of the same shot are changed. If any of the two conditions is satisfied, then the latter frame is selected as a key frame. Under their key frame selection mechanism, the spatio-temporal changes in each shot can be represented by these selected key frames.

Chapter 5

CASE STUDY 1 – AUGMENTED TRANSITION NETWORK (ATN) MODEL

1.

INTRODUCTION

The augmented transition network (ATN) model was developed by Woods in 1970 [178] and has been used in natural language understanding systems and question answering systems for text and speech. Chen and Kashyap [43, 44] proposed a semantic model based on the ATNs to model multimedia presentations, multimedia database searching, and multimedia browsing. The temporal, spatial, or spatio-temporal relations of various media streams and semantic objects can be captured by the proposed ATN model. The arcs in an ATN represent the time flow from one state to another. An ATN can be represented diagrammatically by a labeled directed graph, called a transition graph. The ATN grammar consists of a finite set of nodes (states) connected by labeled directed arcs. An arc represents an allowable transition from the state at its tail to the state at its head, and the labeled arc represents the transition function. An input string is accepted by the grammar if there is a path of transitions which corresponds to the sequence of symbols in the string and which leads from a specified initial state to one of a set of specified final states. States are represented by circles with the state name inside. The state name is used to indicate the presentation being displayed (to the left of the slash) and which media streams have just been displayed. The state name in each state tells all the events that have been accomplished so far. Based on the state name, the users can know how much of the presentation has been displayed. When the control passes to a state, it means all the events before this state are finished. A state node is a breaking point for two different events. For example, in a presentation

82 sequence, if any media stream is changed then a new state node is created to distinguish these two events. In ATNs, when any media stream begins or ends, a new state is created and an arc connects this new state to the previous state. Therefore, a state node is useful to separate different media stream combinations into different time intervals. The arc symbol in the outgoing arc for each state will be analyzed immediately so that the process can continue and does not need to know the past history of the presentation. Two state nodes are connected by an arc. The arc labels indicate which media streams or semantic objects are involved. Each arc represents a time interval. For example, if an arc label contains media streams then it means these media streams will be displayed at this time interval. To design a multimedia presentation from scratch is a difficult process in today’s authoring environment. A subnetwork in ATN can represent another existing presentation and allow the designers to use the existing presentation sequence in the archives. Any change in one of the subnetworks will automatically change the presentation which includes these subnetworks. This feature makes ATN a powerful model for creating a new presentation. This is similar to the class in the object-oriented paradigm. Also, subnetworks can model keywords in a text media stream so that database queries relative to the keywords in the text can be answered. The formal definitions of an ATN and a subnetwork are given here. Definition 5.1: (S, I, T, S 0, F,

An augmented transition network Ψ ) where

is a 6-tuple

1. S = {SO, S1, . . . , Sn-1} is a finite set of states of the control. Each state represents all the events before this state have been accomplished so that it does not need to know the history of the past to continue the presentation. 2. I is a set from which input symbols are chosen. The input string consists of one or more mi and cmi where mi and cmi are separated by “+” and mi and cmi denote a media stream and compressed version of that media stream, respectively. 3. T is a condition and action table by permitting a sequence of actions and conditions to be specified on each arc. A presentation can be divided into several time durations based on different media stream combinations. Each combination occurrence of media streams is represented by an arc symbol. The media streams in an arc symbol are displayed concurrently on this time duration. Conditions and actions

Case Study 1 - AugmentedTransition Network (ATN) Model

83

control the synchronization and quality of service (QoS) of a presentation. Therefore, the real-time situations such as network congestion, memory limitation, user interaction delay can be handled. 4. S0 is the initial state. S0 ∈ S. 5. F is the set of final states. 6.

is the subnetwork of the ATN when the input symbol contains image or video streams.

Definition 5.2: (s, O, t) where 1. s

An augmented

transition subnetwork

is a 3-tuple

S.

2. O is a set from which input symbols are chosen. The input string consists of one or more oi where oi are separated by “+” and each oi is a semantic object. 3. t

1.1

T.

FSM

ATN differs from a finite state automata in that it permits recursion so that ATN is a recursive transition network. A finite state machine (FSM) consists of a network of nodes and directed arcs connecting them. The FSM is a simple transition network. Every language that can be described by an FSM can be described by a regular grammar, and vice versa. The nodes correspond to states and the arcs represent the transitions from state to state. Each arc is labeled with a symbol whose input can cause a transition from the state at the tail of the arc to the state at its head. This feature makes FSM have the ability to model a presentation from the initial state to some final states or to let users watch the presentation fast forward or reverse. However, users may want to watch part of a presentation by specifying some features relative to image or video contents prior to a multimedia presentation, and a designer may want to include other presentations in a presentation. These two features require a pushdown mechanism that permits one to suspend the current process and go to another state to analyze a query that involves temporal, spatial, or spatio-temporal relationships. Since an FSM does not have the mechanism to build up the hierarchical structure; it cannot satisfy these two features.

84 This weakness can be eliminated by adding a recursive control mechanism to the FSM to form a recursive transition network (RTN). A recursive transition network is similar to an FSM with the modifications as follows: all states are given names which are then allowed as part of labels on arcs in addition to the normal input symbols. Based on these labels, subnetworks may be created. Each nonterminal symbol consists of a subnetwork which can be used to model the temporal and spatial information of semantic objects for images and video frames and keywords for texts. Three situations can generate subnetworks. In the first situation, when an input symbol contains an image or a video frame, a subnetwork is generated. A new state is created for the subnetwork if there is any change of the number of semantic objects or any change of the relative position. Therefore, the temporal, spatial, or spatio-temporal relations of the semantic objects are modeled in this subnetwork. In other words, users can choose the scenarios relative to the temporal, spatial, or spatiotemporal relations of the video or image contents that they want to watch via queries. Second, if an input symbol contains a text media stream, the keywords in the text media stream become the input symbols of a subnetwork. A keyword can be a word or a sentence. A new state of the subnetwork is created for each keyword. Keywords are the labels on the arcs. The input symbols of the subnetwork have the same order as the keywords appear in the text. Users can specify the criteria based on a keyword or a combination of keywords in the queries. In addition, the information of other databases can be accessed by keywords via the text subnetworks. For example, if a text subnetwork contains the keyword “Purdue University Library,” then the Purdue University library database is linked via a query with this keyword. In this design, an ATN can connect multiple existing database systems by passing the control to them. After exiting the linked database system, the control is back to the ATN. Third, if an ATN wants to include another existing presentation (ATN) as a subnetwork, the initial state name of the existing presentation (ATN) is put as the arc label of the ATN. This allows any existing presentations to be embedded in the current ATN to make a new design easier. The advantage is that the other presentation structure is independent of the current presentation structure. This makes both the designer and users have a clear view of the presentation. Any change in the shared presentation is done in the shared presentation itself. There is no need to modify those presentations which use it as a subnetwork. Before the control is passed to the subnetwork, the state name at the head of the arc is pushed into the push-down store. The analysis then

Case Study 1 – AugmentedTransition Network (ATN) Model

85

goes to the subnetwork whose initial state name is part of the arc label. When a final state of the subnetwork is reached, a pop occurs and the control goes back to the state removed from the top of the push-down store. However, the FSM with recursion cannot describe cross-serial dependencies. For example, network delays may cause some media streams not to be displayed to users at the tentative start time and the preparation time for users to make decisions is unknown when user interactions are provided. In both situations, there is a period of delay which should be propagated to the later presentations. Also, users may specify queries related to semantic objects across several subnetworks. The information in each subnetwork should be kept so that the analysis across multiple subnetworks can be done. For example, the temporal, spatial, or spatio-temporal relations among semantic objects may involve several video subnetworks. The cross-serial dependencies can be obtained by specifying conditions and actions in each arc. In order to let a recursive transition network have the ability to control the synchronization and quality of service (QoS), conditions and actions on each arc can handle real-time situations such as network congestion, memory limitation, user interaction delay etc. A recursive transition network with conditions and actions on each arc forms an augmented recursive transition network (ATN), The arrangement of states and arcs represents the surface structure of a multimedia presentation sequence. If a user wants to specify a presentation which may be quite different from the surface structure then the actions permit rearrangements and embeddings, and control the synchronization and quality of service of the original presentation sequence. The cross-serial dependencies are achieved by using variables and they can be used in later actions or subsequent input symbols to refer to their values. The actions determine additions, subtractions, and changes to the values of variables in terms of the current input symbol and conditions. Conditions provide more sensitive controls on the transitions in ATNs. A condition is a combination of checkings involving the feature elements of media streams such as the start time, end time, etc. An action cannot be taken if its condition turns out to be false. Thus more elaborate restrictions can be imposed on the current input symbol for synchronization and quality of service controls. Also, information can be passed along in an ATN to determine future transitions. The recursive FSM with these additions forms an augmented recursive transition network (ATN) .

86

1.2

MULTIMEDIA INPUT STRINGS AS INPUTS FOR ATNS

When an ATN is used for language understanding, the input for the ATN is a sentence which consists of a sequence of words with linear order. In a multimedia presentation, when user interactions such as user selections and loops are allowed, then the sentences cannot be used as the inputs for an ATN. For this purpose, a multimedia input string that consists of one or more media streams is used as an input for an ATN. Each arc in an ATN is a string containing one or more media streams displayed at the same time. A media stream is represented by a letter subscripted by some digits. This single letter represents the media stream type and digits are used to denote various media streams of the same media stream type. For example, T1 means a text media stream with identification number one. Multimedia input strings also have the power to model the “or” conditions and the iterative conditions. Since the heart of an ATN is a finite state automata, any multimedia input string can be represented by an ATN. Multimedia input strings adopt the notations from regular expressions [107]. Regular expressions are useful descriptors of patterns such as tokens used in a programming language. Regular expressions provide convenient ways of specifying a certain set of strings. In ATNs, multimedia input strings are used to represent the presentation sequences of the temporal media streams, spatio-temporal relations of semantic objects, and keyword compositions. Information can be obtained with low time complexity by analyzing these strings. A multimedia input string goes from left to right, which can represent the time sequence of a multimedia presentation but it cannot represent concurrent appearance and spatial location of media streams and semantic objects. In order to let multimedia input strings have these two abilities, several modifications are needed. Two levels need to be represented by multimedia input strings. At the coarse-grained level, the main presentation which involves media streams is modeled. At the fine-grained level, the semantic objects in image or video frames and the keywords in a text media stream are modeled at subnetworks. Each keyword in a text media stream is the arc label at subnetworks. New states and arcs are created to model each keyword. The details to model a coarse-grained level are discussed as follows: Two notations and are used to define multimedia input strings and are defined as follows: = {A, I, T, V} is the set whose members represent the media type, where A, I, T, V denote audio, image, text, and video, respectively.

Case Study 1 – AugmentedTransition Network (ATN) Model =

{0, 1,

87

..., 9} is the set consisting of the set of the ten decimal digits.

Definition 5.3: Each input symbol of a multimedia input string contains one or more media streams enclosed by parentheses and displayed at the same time interval. A media stream is a string which begins with a letter in subscripted by a string of digits in . For example, V1 represents video media stream and its identification number is one. The following situations can be modeled by a multimedia input string. Concurrent: The symbol “&” between two media streams indicates these two media streams are displayed concurrently. For example, (T1 &V1) represents T1 and V1 being displayed concurrently. Looping: closure of symbol is some part

m+ = mi is the multimedia input string of positive m to denote m occurring one or more times. The “+” used to model loops in a multimedia presentation to let of the presentation be displayed more than once.

Optional: In a multimedia presentation, when the network becomes congested the original specified media streams which are stored in the remote server might not be able to arrive on time. The designer can use the “*” symbol to indicate the media streams which can be dropped in the on-line presentation. For example, (T1&V1*) means T1 and V1 will be displayed but V1 can be dropped if some criteria cannot be met. Contiguous: Input symbols which are concatenated together are used to represent a multimedia presentation sequence and to form a multimedia input string. Input symbols are displayed from left to right across time sequentially. ab is the multimedia input string of a concatenated with b such that b will be displayed after a is displayed. For example, (A1&T1) (A2&T2) consists of two input symbols (A1&T1) and (A2&T2). These two input symbols are concatenated together to show that the first input symbol (A1&T1) is displayed before the second input symbol (A2&T2). Alternative: A multimedia input string can model user selections by separating input symbols with the “|” symbol. So, (a|b) is the multimedia input string of a or b. For example, ((A1&T1) | (A2&T2)) denotes either the input symbol (A1&T1) or the input symbol (A2&T2) to be displayed. Ending: The symbol “$” denotes the end of the presentation.

88

2.

SPATIAL AND TEMPORAL RELATIONS OF SEMANTIC OBJECTS

Here, a spatial object is called a “semantic object” and the minimal bounding rectangle (MBR) concept in an R-tree is adopted so that each semantic object is covered by a rectangle. As mentioned in Chapter 2, there are three types of topological relations between the MBRs. In ATNs, only the first alternative, the nonoverlapping rectangles, is considered. Definition 5.4: Let O be a set of n semantic objects, O = (o1,o2,...,on). Associated with each oi, "i, (1 ≤ i ≤ n), is an MBRi which is a minimal bounding rectangle containing the semantic object. In a 3D space, an entry MBRi is a rectangle between points (xlow, ylow, zlow) and (xhigh, yhigh, zhigh). The centroid is used to be a reference point for spatial reasoning. As mentioned in the previous section, multimedia input strings are used to represent the temporal relations among media streams and semantic objects. In this section, the use of a multimedia input string to represent the spatial relations of semantic objects is described. The following definition shows the notation for the relative positions in multimedia input strings. Definition 5.5: Each input symbol of a multimedia input string contains one or more semantic objects which are enclosed by parentheses and appear in the same image or video frame. Each semantic object has a unique name which consists of some letters. The relative positions of the semantic objects relative to the target semantic object are represented by numerical subscripts. A superscripted string of digits is used to represent different subcomponents of relation objects if partial or complete overlapping of MBR occurs. The “&” symbol between two semantic objects is used to denote that the two semantic objects appear in the same image or video frame. This representation is similar to the temporal multimedia input string. One semantic object is chosen to be the target semantic object in each image or video frame. In order to distinguish the relative positions, the three dimensional spatial relations are developed (as shown in Table 5.1). In this table, twenty-seven numbers are used to distinguish the relative positions of each semantic object relative to the target semantic object. Value 1 is reserved for the target semantic object with (xt, yt, zt) coordinates. Let (xs, ys, zs) represent the coordinates of any semantic object. The relative position of a semantic object with respect to the target semantic object is determined by the X-, Y-, and Z-coordinate relations.

Case Study 1 – AugmentedTransition Network (ATN) Model

89

Table 5.1. Three dimensional relative positions for semantic objects: The first and the third columns indicate the relative position numbers while the second and the fourth columns are the relative coordinates. (xt, yt, zt) and (xs, ys, zs) represent the X-, Y-, and Z-coordinates of the target and any semantic object, respectively. The “≈” symbol means the difference between two coordinates is within a threshold value.

The “≈” symbol means the difference between two coordinates is within a threshold value. For example, relative position number 10 means a semantic object’s X-coordinate (xs) is less than the X-coordinate (xt) of the target semantic object, while Y- and Z-coordinates are approximately the same. In other words, the semantic object is on the left of the target semantic object. More or fewer numbers may be used to divide an image or a video frame into subregions to allow more fuzzy or more precise queries as necessary. The centroid point of each semantic object is used for space reasoning so that any semantic object is mapped to a point object. Therefore, the relative position between the target semantic object and a semantic object can be derived based on these centroid points. A multimedia input string then can be formed after relative positions are obtained. Del Bimbo et al. [24] proposed

90

3.

TE

AM FL Y

a region-based formulation with a rectangular partitioning. Therefore, each object stands over one or more regions. Table 5.1 follows the same principle but directly captures the relative positions among objects. Relative positions are explicitly indicated by numbers to capture the spatial relations and moving history. Different input symbols in multimedia input strings represent different time durations in a video sequence. These input symbols in multimedia input strings are the arc labels for the subnetworks in ATNs. In ATNs, when an arc contains one or more images, video segments, or texts then one subnetwork with the media stream as the starting state name is created. A new arc and a new state node in a subnetwork, and a new input symbol in a multimedia input string are created when any relative position of a semantic object changes or the number of semantic objects changes. The subnetwork design is similar to the VSDG model [58] which uses transitions to represent the number of semantic object changes. However, in the VSDG model, a new transition is created when the number of semantic objects changes and a motion vector is taken with each node. In this design, in addition to the changes of the number of semantic objects, any relative position change among semantic objects is considered and a state node and an arc in the subnetwork and an input symbol in the multimedia input string are created for this situation. Based on the design, the temporal relations and the relative positions of semantic objects can be obtained, and the moving histories of the semantic objects in the video sequence can be kept. Therefore, substring matching processes using multimedia input strings in database queries can be conducted.

MULTIMEDIA PRESENTATIONS

A multimedia presentation consists of a sequence of media streams displaying together or separately across time. The arcs in an ATN represent the time flow from one state to another. ATN differs from a finite state automata in that it permits recursion. Each nonterminal symbol consists of a subnetwork which can be used to model the temporal and spatial information of semantic objects for images and video frames. Originally, an ATN was used for the analysis of natural language sentences so its input is a sentence composed of words. However, this input format is not suitable for representing a multimedia presentation since several media streams need to be displayed at the same time, to be overlapped, to be seen repeatedly, etc. Therefore, the multimedia input strings are used as the inputs for the ATNs. Each media stream contains a feature set which has all the control information related to the media stream. The definition and the meaning of each element are as follows :

Case Study 1 – AugmentedTransition Network (ATN) Model

91

Definition 5.6: Suppose there are n media streams appearing in the input symbols. Each media stream has a feature set together with it. = {tentative_starting_time, tentative_ending_time, starting_frame, ending_frame, window_position_X, window_position_Y, window_size_width, window_size_height, priority} where i = 1 . . . n. The meaning of each element is illustrated below : tentative_starting_time: the original media stream desired starting time. tentative_ending_time: the original media stream desired ending time. starting_frame: the starting video frame number. ending_frame: the ending video frame number. window_position_X: the horizontal distance from the upper left corner of the computer screen. window_position_Y: the vertical distance from the upper left corner of the computer screen. window_size_width: the window size width of the media stream. window_size_height: the window size height of the media stream. priority: the display priority if several media streams are to be displayed concurrently. In addition to the recursive transition network, a table consisting of actions and conditions which are specified on each arc forms an augmented transition network. The advantage of this table is that only this table needs memory space. Hence, the multimedia transition network is just a visualization of the data structure which can be embedded in the programming implementation. Conditions and actions in the arcs in ATNs maintain the synchronization and quality of service (QoS) of a multimedia presentation by permitting a sequence of conditions and actions to be specified on each arc. The conditions are to specify various situations in the multimedia presentation. A condition is a Boolean combination of predicates involving the current input symbol, variable contents and the QoS. A new input symbol cannot be taken unless the condition is evaluated to true (T). More elaborate restrictions can be imposed on the conditions if needed. For example, if the communication bandwidth is

92 not enough to transmit all the media streams on time for the presentation, then the action is to get the compressed version of media streams instead of the raw data. In this way, synchronization can be maintained because all the media streams can arrive on time. In addition, QoS can be specified in the conditions to maintain synchronization. The actions provide a facility for explicitly building the connections among the whole ATN. The variables are the same as the symbolic variables in programming languages. They can be used in later actions, perhaps on subsequent arcs. The actions can add or change the contents of the variables, go to the next state, or replace the raw media streams with the compressed ones, and etc. Moreover, information can be passed along in an ATN to determine future transitions. In an interactive multimedia presentation, users may want to see different presentation sequences from the originally specified sequence. Under this design, when a user issues a database query, the specification in the query tries to match the conditions in the arcs. If a condition is matched then the corresponding action is invoked. Different actions can generate different presentation sequences which are different from the original sequence. Table 5.2 shows a simple example of how to use conditions and actions to control the synchronization and quality of service (QoS). The first column contains the input symbols. The second and the third columns show the conditions and the actions, respectively. When the current input symbol X1 (V1&T1) is read, the condition of the bandwidth is first checked to see whether the bandwidth is large enough to transmit media streams V1 and T1. If it is not, then the compressed version of V1 will be transmitted. Then the second condition to check whether the pre-specified duration to display V1 and T1 is reached. If it is not, the display continues. The start time is defined to be the time starting the display of V1 and T1, and the difference between the current time and the start time is the total display time so far. The third condition is met when the total display time is equal to the pre-specified duration. In that case, a next input symbol X2 (V1&T1&I1&A1) is read and these four conditions will be checked again. The process continues until the final state (state 5) is reached.

4.

MULTIMEDIA DATABASE SEARCHING

In a multimedia information environment, users may want to watch part of a presentation by specifying some features relative to image or video content prior to a multimedia presentation, and a designer may want to include other presentations in a presentation. In order to meet these two requirements, ATNs use a pushdown mechanism that permits one to suspend the current process and go to another state in the

Case Study 1 – AugmentedTransition Network (ATN) Model

93

Table 5.2. Condition and action table: Get procedure is to access an individual media stream. Get_Symbol is a procedure to read the next input symbol of multimedia input string. Next_State is a procedure to advance to the next state in ATN. Display procedure is to display the media streams. , τ, and ρ are the parameters.

subnetwork to analyze a query that involves temporal, spatial, or spatiotemporal relationships. Subnetworks are separated from the main ATN. Before control is passed to the subnetwork, the state name at the head of the arc is pushed into the push-down store (stack). The analysis then goes to the subnetwork whose initial state name is part of the arc label.

94 When a final state of the subnetwork is reached, the control goes back to the state removed from the top of the push-down store. The three situations that generate subnetworks have been discussed in the first section of this chapter. In ATNs, each image or video frame has a subnetwork which has its own multimedia input string. Subnetworks and their multimedia input strings are created by the designer in advance for a class of applications. Users can issue multimedia database queries using high-level database query languages such as SQL. Each high-level query then translates into a multimedia input string so that it can match with the multimedia input strings for subnetworks. Therefore, database queries become the substring matching processes. A multimedia input string is a left to right model which can model the temporal relations of semantic objects. The semantic objects in the left input symbol appear earlier than those in the right input symbol in a video sequence. The spatial locations of semantic objects also need to be defined so that the queries relative to spatial locations can be answered. Hence, the temporal and spatial relations of semantic objects of a video stream in each input symbol can be modeled by a multimedia input string. User queries can be answered by analyzing the multimedia input string (for example, the movement, the relative spatial location, the appearing sequence, etc. of semantic objects). The spatial location of each semantic object needs to be represented by a symbolic form in order to use multimedia input strings to represent it.

4.1

MULTIMEDIA DATABASE SEARCHING EXAMPLES

Figures 5.1 through 5.3 are three video frames with frame numbers 1, 52, and 70 which contain four semantic objects, salesman, box, file holder, and telephone. They are represented by symbols S, B, F, and T, respectively. Each semantic object is surrounded by a minimal bounding rectangle. Let salesman be the target semantic object. In Figure 5.1, the relative position numbers of the other three semantic objects with respect to the target semantic object are at 10, 15, and 24, respectively. The semantic object box moves from left to front of the target semantic object salesman in Figure 5.2, and moves back to left in Figure 5.3. The following multimedia input string can be used to represent these three figures as follows:

Case Study 1 – AugmentedTransition Network (ATN) Model

Figure 5.1. Video frame 1. There are four semantic objects: salesman, box, file holder, and telephone; salesman is the target semantic object. The relative position numbers (as defined in Table 5.1) of the other three semantic objects are in the 10, 15, and 24, respectively.

95

Figure 5.2. Video frame 52. Semantic object box moves from the left to the front of salesman (from viewer’s point of view).

Figure 5.3. Video frame 70. Semantic object box moves from the front to the left of salesman (from viewer’s point of view).

(5.1) S1 in symbol X1 means salesman is the target semantic object. B10 represents that the semantic object box is on the left of salesman, F15

96

Figure 5.4. tion 5.1.

The corresponding subnetwork for multimedia input string in Equa-

means semantic object file holder is below and to the left of salesman, and so on. B3 in symbol X2 means the relative position of box changes from left to front. Semantic objects file holder and telephone do not change their positions so they have the same relative position numbers in X1, X2, and X3. As can be seen from this example, the multimedia input string can represent not only the relative positions of the semantic objects but also the motion of the semantic objects. For example, the above multimedia input string shows the semantic object box moves from left to front relative to the target semantic object salesman. Figure 5.4 is a subnetwork for multimedia input string shown in Equation 5.1. Therefore, the starting state name for this subnetwork is V1/. As shown in Figure 5.4, there are three arcs with arc labels the same as the three input symbols in Equation 5.1.

4.1.1

A SPATIAL DATABASE QUERY EXAMPLE

In a multimedia information system, the designers can design a general purpose presentation for a class of applications so that it allows users to choose what they prefer to see using database queries. Assume there are several video media streams and Vs is one of them. The multimedia input string for the subnetwork which models Vs is the same as Equation 5.1.

•

Query 1: Display the multimedia presentation beginning with a salesman with a box on his right.

In this query, only the spatial locations of two semantic objects salesman and box are checked. A user can issue this query using a high-level language. This query then is translated into a multimedia input string as follows: (S1&B10). (5.2) The subnetworks of an ATN are traversed and the corresponding multimedia input strings are analyzed. Suppose the multimedia input string in Equation 5.2 is to model the subnetwork for Vs. The input symbol X1 in Vs contains semantic objects salesman (S), and box (B). Let the salesman be the target semantic object. The relative position of the box is to the left of salesman from a viewer’s perspective. By matching Equation 5.2 with Equation 5.1, it can be seen that Vs is the starting video

Case Study 1 – AugmentedTransition Network (ATN) Model

97

clip of the query. When the control is passed back from the subnetwork, then the rest of the multimedia presentation begins to display.

4.1.2

A SPATIO-TEMPORAL DATABASE QUERY EXAMPLE • Query 2: Find the video clips beginning with a salesman holding a box on his right, moving the box from the right to his front, and ending with moving the box back to his right.

This query involves both the temporal and spatial aspects of two semantic objects: salesman and box. This query is translated into a multimedia input string which is the same as Equation 5.1. Again, each of the subnetworks needs to be checked one by one. The same as in the previous query, the relative positions to be matched are based on the views that users see. The first condition in this query asks to match the relative position of the box to the left of the salesman. When the subnetwork of Vs is traversed, S1 and B10 tell us that the input symbol X1 satisfies the first condition in which the box is to the left of the salesman. Next, the relative position of the box is moved from the left to the front of the salesman. This is satisfied by the input symbol X2 since B3 indicates that the box is to the front of the salesman. Finally, it needs to match the relative position of the box to be back to the left of the salesman. This condition is exactly the same as the first condition and should be satisfied by the input symbol X3. In this query, the semantic object salesman is the target semantic object and his position remains the same without any change. From this subsection, it can be seen that after the subnetworks and their multimedia input strings are constructed by the designer, users can issue database queries related to temporal and spatial relations of semantic objects using high-level database query languages. These queries are translated into multimedia input strings to match with those multimedia input strings of the subnetworks that model image and video media streams. Under this design, multimedia database queries related to images or video frames can be answered. The details of the multimedia input strings, the translation from high-level queries to multimedia input strings, and the matching processes are transparent to users. ATNs and multimedia input strings are the internal data structures and representations in a database management system (DBMS). After users issue queries, the latter processes are handled by the DBMS. Separating the detailed internal processes from users can reduce the burden of users so that the multimedia information system is easy to use.

98

Figure 5.5. A browsing graph of a mini tour of the Purdue University campus: there are seven sites denoted by Bi, i = 1 . . . 7 that are connected by arcs. A directed arc denotes a one-way selection and a bi-direction arc allows two-way selections.

5.

MULTIMEDIA BROWSING

Figure 5.5 shows a browsing graph of a mini-tour of a campus. The browsing graph consists of a set of nodes and arcs connecting them. There are seven sites in this browsing graph: Purdue Mall, Computer Science, Chemical Engineering, Potter Library, Union, Electrical Engineering, and Mechanical Engineering. Bi, i = 1 . . . 7 are used to represent these seven sites. For each site, a presentation consists of video, text, and audio media streams which are denoted by Vi,Ti, and Ai. A directed arc denotes a one-way selection. For example, there is a directed arc pointing from the Mechanical Engineering building to the Purdue Mall. This means that after a user watches the presentation for the Mechanical Engineering building, he/she can immediately watch the Purdue Mall presentation. However, the opposite direction is inapplicable since there is no directed arc pointing from the Purdue Mall to the Mechanical Engineering building. The bi-direction arcs allow users to go back and forth between two locations such as the Purdue Mall and the Potter Library. For example, after a user watches the presen-

Case Study 1 – AugmentedTransition Network (ATN) Model

99

Figure 5.6. ATN for the mini tour of a campus: Seven networks represent seven sites which users can browse. Networks B1/ through B7/ represent the presentations for sites Purdue Mall, Computer Science Building, Chemical Engineering Building, Potter Library, Union, Electrical Engineering Building and Mechanical Engineering Building respectively. Each network begins a presentation with three media streams: a video, a text, and an audio, and is followed by selections. After a user selects a site, the control will pass to the corresponding network so that the user can watch the presentation for that site continuously.

100 tation for Purdue Mall, he/she can choose Computer Science, Chemical Engineering, Potter Library, Union, and Electrical Engineering buildings to watch. He/She can also watch the presentation for the Purdue Mall again. Figure 5.6 is the ATN for the browsing graph in Figure 5.5. For simplicity, some state names are not shown in Figure 5.6. Assume the browsing always starts from the Purdue Mall (B1). There are seven networks in Figure reffig:minitouratn and each network represents a site. In a user interaction environment, users may start from any site to watch. Before a detailed discussion of how ATN models multimedia browsing, the arc types together with the notation need to be defined. The following notation and definition as in [9] are adopted. Push arc: succeeds only if the named network can be successfully traversed. The state name at the head of arc will be pushed into a stack and the control will be passed to the named network.

AM FL Y

Pop arc: succeeds and signals the successful end of the network. The topmost state name will be removed from the stack and become the return point. Therefore, the process can continue from this state node. Jump arc: always succeeds. This arc is useful to pass the control to any state node.

TE

A detailed trace of the following browsing sequence is used to illustrate how the ATN works. 1. Purdue Mall, 2. Chemical Engineering, 3. Computer Science, 4. Computer Science. In this browsing example, the Purdue Mall is the first to be visited and followed by the Chemical Engineering building. Then the Computer Science building is the next one to be watched. The Computer Science building is viewed one more time and then the tour stops. Table 5.3 shows the trace for this browsing example. Step 1: The current state is in B1/ where B1 represents the Purdue Mall. The arc followed is arc number 1 and the input symbol is V1&T1&A1. This input symbol denotes video 1, text 1, and audio 1 are displayed concurrently. Step 2: Arc number 2 is followed and the input symbol B1&B2&B3&B4&B5&B6 is read so that users can choose a site from site 1 through 6 to watch.

Case Study 1 – AugmentedTransition Network (ATN) Model Table 5.3.

101

The trace of ATN for the specified browsing sequence.

Step 3: Based on the browsing sequence as specified above, the Chemical Engineering building (B3) is chosen so that arc number 6 is followed and input symbol B3 is read. Since B3 is a subnetwork name, the state name pointed by arc number 6 (B1/B3) is pushed into a stack. A stack follows the last-in-first-out (LIFO) policy which only allows retrieving the topmost state name first.

102 Step 4: The control is passed to the subnetwork with starting state name B3/. Arc number 17 is followed and the input symbol V3&T3&A3 is read so that video 3, text 3, and audio 3 are displayed. Step 5: Arc number 18 is followed and the input symbol B1&B2&B3 is read so that users can choose from site 1 through 3 to watch. Step 6: As specified above, the Computer Science building (B2) is the next site to watch so that arc number 21 is followed and the input symbol B2 is read. Since B2 is a subnetwork name, the state name pointed by this arc is pushed into the stack. Therefore, there are two state names in this stack: B3/B2 and B1/B3. Step 7: The control passes to the subnetwork with starting state name B2/. Arc number 10 is followed and the input symbol is V2&T2&A2 so that video 2, text 2, and audio 2 are displayed. Step 8: Arc number 11 is followed and the input symbol B1&B2 is read. Users can choose either site 1 or 2. Step 9: User interactions allow users to interact with multimedia information systems. Users may want to watch some topic recursively or have user loops so that they have the opportunity to select different contents after viewing a previous selection. ATNs allow recursion, that is, a network might have an arc labeled with its own name. This feature allows ATNs to have the ability to model user loops and recursion easily. As specified above, the Computer Science building (B2) is watched again. The state name pointed by arc number 14 (B2/B2) is pushed into the stack so that there are three state names stored in this stack. Step 10: The control passes back to the same subnetwork and video 2, text 2, and audio 2 are displayed concurrently again. Step 11: After Step 10, as specified above, the presentation stops so that arc number 12 is followed with a pop arc label. The topmost state name in the stack (B2/B2) is popped out so that the control passes to the state node with state name B2/B2. The stack has two state names now. Step 12: Arc number 16 is followed with a pop arc label. Therefore the topmost state name B3/B2 is popped out and the control is passed to it. Step 13: The current state name is B3/B2 and arc number 24 is followed with a pop arc label. The only state name in the stack is popped out and the control is passed to it.

Case Study 1 – AugmentedTransition Network (ATN) Model

103

Step 14: The current state is B1/B3 which is a final state (no outgoing arc) so that the browsing stops. ATN and its subnetworks in Figure 5.6 depict the structural hierarchy of the browsing graph in Figure 5.5. User interactions and user loops are modeled using ATN in this example. Under ATNs, user interactions are represented by using more than one outgoing arc with different arc labels for a state node. User loops are modeled by using recursions with arcs labeled by network names. By using the recursion, one can avoid many arcs which point back to the previous state nodes. This makes the whole network structure become less complicated. Moreover, the browsing sequences in Figure 5.5 are preserved by traversing the ATN.

6.

USER INTERACTIONS AND LOOPS

Figures 5.7(a) through 5.7(f) are the timelines for two multimedia presentations. Figures 5.7(a) and 5.7(b) are the starting timelines for presentations P1 and P2, respectively. In Figure 5.7(a), media streams V1 (video stream 1) and T1 (text 1) start to display at time t1 and media streams I1 (image 1) and A1 (audio stream 1) join to display at time t2 in presentation P1. These four media streams all end at time t3. In presentation P2, as shown in Figure 5.7(b), V2 and T2 start at t1 and end at t3. Then, presentation P1 goes to the presentation shown in Figure 5.7(c) and presentation P2 continues its presentation as shown in Figure 5.7(d). Presentations P1 and P2 join again at time t5 where two choices are provided to allow users to choose based on their preference. A timeline model is inapplicable to model the alternative situations so that it is difficult to model this user interaction scenario. As shown in Figure 5.7, it cannot tell directly that Figures 5.7(e) and 5.7(f) are two timelines for different selections. Since user thinking time is unknown in advance, the starting time for media streams V5 and T6 and the starting time for V6 and T8 will not be known until a user makes a choice. Let's assume the user makes a choice at time t6. Figures 5.7(e) and 5.7(f) are the timelines for selections B1 and B2. Selection B1 has V5 and T6 displayed at time t6. These two media streams end at time t7 where T7 and A4 begin to display and end at time t8. Selection B2 has V6 and T8 displayed from time t6 to time t7, and V7 and A5 start at time t7 and end at time t8. If B1 is chosen, the presentation stops. However, if B2 is chosen, it allows the user to make the choice again. The timeline representation cannot reuse the same timelines as in Figures 5.7(e) and 5.7(f) and therefore it needs to create the same information again. Since the number of loops that users will go through in this part is unknown,

104

Figure 5.7. Timelines for presentation P1 and P2: Figures (a), (c), (e), and (f) are the presentation sequence for presentation P1. Figures (b), (d), (e), and (f) are the presentation sequence for presentation P2. Figures (e) and (f) are two timelines for selections B1 and B2, respectively.

Case Study 1 – Augmented Transition Network (ATN) Model

105

Figure 5.8. Augmented Transition Network: (a) is the ATN network for two multimedia presentations which start at the states P1/ and P2/, respectively. (b)-(d) are part of the subnetworks of (a). (b) models the semantic objects in video media stream V1, (c) models the semantic objects in image media stream I1, and (d) models the keywords in text media stream T1. In (e), the “Get” procedure is to access an individual media stream. “Display” displays the media streams. “Next_Symbol (Xi )” reads the input symbol Xi. The “Next_State” is a procedure to advance to the next state. “Start_time(Xi)”gives the pre-specified starting time of Xi. User thinking time is accounted for by the Delay variable.θ is a parameter.

106 Table 5.4.

The trace of ATN for presentation P1.

it is impractical to use this stand alone timeline representation to model user loops. When the designer designs the start and end times of media streams as shown in Figure 5.7, the multimedia input string can be constructed automatically based on the starting and ending time of media streams. In presentation P1, the multimedia input string is: [5.3]

In presentation P2, the multimedia input string is: [5.4]

As mentioned earlier, a multimedia input string is used to represent the presentation sequence. In presentation P1, the input symbol X1 contains V1 and T1 which start at the same time and play concurrently. Later, I1 and A1 begin and overlap with V1 and T1. Therefore, the input symbol X2 contains the media streams V1, T1, I1, and A1. Each image, video, or text media stream has its own multimedia input string and a subnetwork is created. Figures 5.8(b) to 5.8(d) are part of the subnetworks of P1 to model V1, I1, and T1, respectively. For simplicity, the subnetworks of other image, video, and text media streams are not shown in Figure 5.8. The delay time for I1 and A1 to display does not need to be specified in a multimedia input string explicitly since the multimedia input string is read from left to right so that the time needed to process X1 is the same as the delay time for I1 and A1. The presentation continues until the final state is reached. The “|” symbol represents the alternatives for different selections. The “$” symbol denotes the end of a presentation. Figure 5.8 shows the use of a single ATN to model presentations P1 and P2 which include user interactions and loops. Figure 5.8(a) is an

Case Study 1 - Augmented Transition Network (ATN) Model Table 5.5.

Continuation of Table 5.4 if B1 is chosen.

Table 5.6.

Continuation of Table 5.4 if B2 is chosen.

107

ATN to model presentations P1 and P2 . P1 and P2 start at different starting states. Table 5.4 is a trace of ATN for presentation P1 in Figure 5.8 and is used to explain how ATN works. Step 1: The current state is P1 and the arc to be followed is arc number 1 with arc label X1. Media streams V1 and T1 are displayed. There is no backup state in the stack. Step 2: The current state is P1/X1 which denotes X1 has been read in presentation P1. Arc number 2 with arc label X2 is the arc to be followed. X2 consists of media streams V1, T1, I1, and A1. Step 3: In presentation P1, after X1 and X2 are displayed, the ATN reaches state P1/X2 which has two outgoing arcs (arc numbers 4 and 6). The presentation P1 needs to follow arc 4 so that input symbol X4 is read. Media streams V3 and T3 are displayed at this duration. Step 4: The current state is P1 /X4 and arc number 5 is followed. Media streams T3 and A2 are displayed. Step 5: In user interactions, user thinking time delays need to be kept so that later presentations can be shifted. The cross-serial depen-

108 dencies cannot be handled using finite state machines. However, the conditions and actions in ATNs have the ability to model user interactions. In Figure 5.8(a), after the state P1 /X5 is met, the input symbol X8 with two selections B1 and B2 is displayed. Before a choice is made, a thinking time should be kept. A Delay variable is used to represent the delay of the presentation. The “Start_time(Xi)” procedure gives the pre-specified starting time of Xi. The difference between the current time and the pre-specified starting time is the total display time so far. Since a delay time occurs after user interactions, the presentation sequence needs to be shifted by the delay. The process continues until the final state is reached. Figure 5.8(e) shows the detailed condition column and action column for user interactions. As illustrated in Figure 5.8(a), two choices B1 and B2 are provided to let users make their selections. The corresponding action to calculate the delay time is activated in the action column. The calculated delay time is used to shift the starting time of any media stream displayed after the selection. Step 6: If B1 is selected, arc number 9 is followed and media streams V5 and T6 are displayed (as shown in Table 5.5). If B2 is selected, arc number 11 is followed and media streams V6 and T8 are displayed (as shown in Table 5.6). Step 7: In Table reftab:traceB1, the current state is P1 / X9, arc number 10 is followed, and media streams T7 and A4 are displayed since B 1 is selected. In Table reftab:trace B2, the current state is P1/X 11, arc number 12 is followed, and media streams V7 and A5 are displayed since B2 is selected. Step 8: When B1 is selected, the current state is P1/X 10 and the presentation stops. When B2 is selected, arc number 13 is followed and a “Jump” arc label is met. The control is passed back to state P1/X 5. Step 9. When B2 is selected, the process goes back to Step 5 to let the user make a choice again. Steps 5 through 9 model a loop scenario which is represented by a “+” symbol in multimedia input strings [5.3] and [5.4]. The “Jump” action does not advance the input symbol but lets the control go to the pointing state. That means the “Jump” itself is not an input symbol in multimedia input strings. This feature is crucial for the designers who may want some part of the presentation to be seen over and over again. For example, in a computer-aided instruction (CAI) presentation, the teacher may want the students to view some part of the presentation until they become familiar with this presentation part.

Case Study 1 – Augmented Transition Network (ATN) Model

109

In Figure 5.8(c), there is only one input symbol for the subnetwork modeling I1. The input symbol is X16 which contains the semantic object salesman. Figure 5.8(d) is the subnetwork for T1. The input symbols for T1 consist of three keywords appearing in the sequence library, computer, and database. Presentation P2 is similar to P1 except that X3 is displayed first, and X6 and X7 are displayed after X3. That is, presentations P1 and P2 share arc number 8. Figure 5.8(e) shows an example of how to use conditions and actions to maintain synchronization and quality of service (QoS). In presentation P1, when the current input symbol X1 (V1&T1) is read, the bandwidth condition is first checked to see whether the bandwidth is enough to transmit these two media streams. If it is not enough then the compressed version of V1 will be transmitted instead of V1. Then the pre-specified duration to display V1 and T1 is checked. If the duration is not reached, the display continues. Otherwise, a new input symbol is read and the control is passed to the next state. The same conditions are checked for other input symbols, too.

TE

AM FL Y

This page intentionally left blank.

Chapter 6

CASE STUDY 2 – OBJECT COMPOSITION PETRI NET (OCPN) MODEL

1.

INTRODUCTION

An Object Composition Petri Net (OCPN) model is based on the logic of temporal intervals and Timed Petri Nets and was proposed to store, retrieve, and communicate between multimedia objects [118, 120]. An OCPN can be graphically represented by a bipartite graph, in which places are denoted by circles and transitions by bars. The Petri net can be defined as follows [91]: Definition 6.1: The Petri net is defined as a bipartite directed graph N = {T, P, A}, where T= {t1, t2,..., tn} = a set of transitions (bars) P= {p1 , p2 ,..., pm} = a set of places (circles) A:{TXP} U {PXT} → I = a set of directed arcs I = {1, 2, ...}. For simple Petri nets, the firing of a transition is assumed to be an instantaneous event and thus the time from enabling a transition to firing is unspecified and indeterminate. Hence, extensions of the original model to represent the concept of nonzero time expenditure in the net are required. A class of enhanced Petri net models, called the Timed Petri nets, have been developed which assigns a firing duration to each transition [91, 148, 192]. The Timed Petri net model is chosen as a representation of the synchronization of multimedia entities because of their desirable

112 attributes of representation of concurrent and asynchronous events. The Timed Petri net models map well to Markov performance analysis [118]. Another Time Petri net model proposed in [51] represents processes by places instead of transitions and is called the augmented Petri net model. Nonnegative execution times are assigned to each place in the net. The notion of instantaneous firing of transitions is preserved and the state of the system is always clearly represented during process execution in this augmented model. The tokens are at all times in places, not transitions. It has the advantage of compactness of representation since either process timing schemes can be used. In addition, an extended Petri net model called marked Petri nets is defined as follows. Definition 6.2: A marked Petri net is defined as (Nm) = {T, P, A, M} which includes a marking M which is a mapping from the set of places to the integers and assigns tokens (dots) to each place in the net: M:

P→I,

I=

{0,1,2,...}.

In [118], in order to illustrate the use of multiple media, the authors supplement the augmented Petri net model with resource information associated with the Timed Petri net models [148] into the OCPN model. The OCPN augments the conventional Petri net model with values of time, as durations, and resource utilization on the places in the net. Hence, an OCPN is defined as: Definition 6.3: COCPN= {T, P, A, D, R, M}, where D:P→R D is the mapping from the set of places to the real numbers (durations) R : P → {r1,r2,..., r κ} R is the mapping from the set of places to a set of resources In OCPN, each place (circle) contains the required presentation resource (device), the time required to output the presentation data, and spatial/content information. Since a transition is defined to occur instantaneously, places rather than transitions have states, i.e., each place (circle) is represented by a state node in the OCPN model. The transitions (bars) in the net indicate points of synchronization and processing. Associated with the definition of the Petri net is a set of firing rules governing the semantics of the OCPN model. The firing rules are [118]:

.

Case Study 2 – Object Composition Petri Net (OCPN) Model

113

1. A transition ti fires immediately when each of its input places contain an unlocked token. 2. Upon firing, the transition ti removes a token from each of its input places and adds a token to each of its output places. 3. After receiving a token, a place pj remains in the active state for the interval specified by the duration τj. During this interval, the token is locked. When the place becomes inactive, or upon expiration of the duration τj, the token becomes unlocked. Moreover, since the OCPN models are a form of marked graph [51] and possess their characteristics, the places of OCPN models have exactly one incoming arc and one outgoing arc.

2.

INTERVAL-BASED CONCEPTUAL MODELS

Multimedia data often have time dependencies that must be satisfied at presentation time, such as audio and video data in a motion picture that require time-ordered playout during presentation. Time-dependent data differ from historical data which do not specifically require timely playout. The primary requirement for supporting time-dependent data playout is the means for the identification of temporal relations between multimedia objects. In order to provide the utilities to both the data presentation system and the multimedia author, these timing relationships must be identified and managed. The task of coordinating sequences of multimedia data requires synchronization among the interacting media as well as within each medium. Synchronization can be applied to the playout of concurrent or sequential streams of data and to the external events generated by a human user including browsing, querying, and editing multimedia data. The problem of synchronizing time-ordered sequences of data elements is fundamental to multimedia data.

2.1

TIMING RELATIONSHIPS

Time-dependent multimedia objects require special considerations for presentation due to their real-time playout characteristics as data need to be delivered from the storage devices based on a pre-specified schedule. Furthermore, presentation of a single object can occur over an extended duration as in the case of a motion picture. Temporal relations model multimedia presentation by letting each interval represent the presentation of some multimedia data element, such as a still image or an audio segment.

114 Synchronization requirements for multimedia data have been classified along several dimensions [118, 128, 163]. There are two ways to obtain the timing relationships between the media. 1. Data can have natural or implied time dependencies. Temporal relations are implicitly specified when capturing the media objects. That is, the dependencies can be implied as in the simultaneous acquisition of voice and video. These data streams often are described as continuous because recorded data elements form a continuum during playout. In other words, elements are played-out contiguously in time. The goal is to present the various objects according to the temporal relations captured during the recording process. To enforce this type of synchronization, it requires [21]: capturing the implicit temporal relations of media object during their recording storing these synchronization relations in a database along with the data to be synchronized enforcing these relations at presentation time 2. Data can have synthetic temporal relationships. Temporal relations are explicitly imposed on media objects that are independently captured and stored. Synthetic synchronization is the basis for creating user-defined multimedia presentations. That is, the dependencies can be explicitly formulated as in the case of a multimedia document with voice-annotated text. The goal is to support flexible and powerful synchronization relations among objects, possibly including user interactions. process. To enforce this type of synchronization, it requires [21]: specifying the synchronization constraints among the object composing the presentation enforcing the specified synchronization at runtime In either situation, the characteristics of each medium and the relationships among them must be established in order to provide synchronization in the presence of vastly different presentation requirements. Usually, the synchronization points correspond to the change of an image, the end of a voice-annotation, etc. The combination of natural and synthetic time dependencies can describe the overall temporal requirements of any pre-orchestrated multimedia presentation. A multimedia system must preserve the timing relationships among the elements of the object presentation at these synchronization points by the process of multimedia synchronization.

Case Study 2 – Object Composition Petri Net (OCPN) Model

115

In addition, the tolerance of data to timing skew and jitter during playout varies widely depending on the medium. For example, the audio and video data both require tight bounds on the order of hundreds of milliseconds but they can tolerate different absolute timing requirements during playout as the human ear can discern dropouts in audio data more readily than of video. On the other hand, the text and image data allow the skew on the order of seconds. Based on the differences of the tolerance to skew and jitter during playout, two approaches have been proposed to provide synchronous playout of time-dependent data streams which are a real-time scheduling approach [119] and an optimistic interval-based process approach [120]. An OCPN can be easily used to specify the synchronization requirements of a set of arbitrarily complicated temporally related events. Places in the net represent the playout of the various objects composing the events and are augmented with the time required for the object presentation and presentation device. The transitions denote the synchronization points among the playouts of the various objects. In [120], they proposed a new conceptual model for capturing and managing these timing relationships. Specifically, the n-ary and reverse temporal relations along with their temporal constraints are defined. These new relations are used to ensure a property of monotonically increasing playout deadlines to facilitate both real-time deadline-driven playout scheduling or optimistic interval-based process playout.

2.2

BASIC TEMPORAL RELATIONS

As introduced in chapter 2, there are two main approaches to represent temporal relations for multimedia data. One is the point-based representation and the other one is the interval-based representation. The temporal interval-based approach is often used to manage the temporal relations between video objects. A time interval is characterized as a nonzero duration of time in any set of units such as “one day” and “2 seconds”. This is contrasted with a time instant which is a point structure with zero length duration such as “10:00 AM.” Formally, temporal intervals can be defined as follows [11]. Definition 6.4: Let [S, ≤] be a partially ordered set, and let a, b be any two elements of S such that a ≤ b. The set {x | a ≤ x ≤ b} is called an interval of S denoted by [a, b] of S has the following properties: 1. [a, b] = [c, d]

a = c and b = d

2. if c, d ∈ [ a, b] and e ∈ S and c ≤ e ≤ d then e ∈ [ a, b]

116 3. #([a,b]) > 1. Time intervals are described by their endpoints such as the a and b in Definition 6.4 above and the length of such an interval is identified by b a. The relative timing between two intervals can be determined from these endpoints. By specifying intervals with respect to each other rather than by using endpoints, the intervals can be decoupled from an absolute or instantaneous time reference. This specification leads to the temporal relations. For example, in [83], the author presents a logic of intervals which indicates thirteen distinct ways to relate any two given intervals in time. Using Allen’s representations [10], these relations can be presented graphically using a timeline representation and their corresponding Petri net representations as shown in Figure 6.1(a) and Figure 6.1(b), respectively. One axis indicates the time dimension and the other indicates the space (or resource utilization). The modeling power of OCPNs comes from the fact that for each pair of multimedia objects whose presentation times are related by one of the thirteen temporal relationships, there exists a corresponding OCPN that models their temporal dependencies [118]. Note that the correspondence illustrated in Figure 6.1 and the properties of Petri-nets allow arbitrarily elaborate synchronization requirements to be modeled by an OCPN. In fact, the synchronization requirements of a set of related temporally related events can always be represented as a binary tree. The leaves of the binary tree are the multimedia objects of the events, and the intermediate nodes represent the temporal relationships. The whole specification can be represented by an OCPN which is constructed from the tree in the following manner. Each node in the tree whose children are leaves is immediately mapped onto an OCPN representing the synchronization requirements for the children, using the correspondence illustrated in Figure 6.1. The process is then repeated for each intermediate node up to the root, by replacing in the OCPN corresponding to a given node, the places associated with the children with their corresponding subnets. Define an atomic interval as an interval which cannot be decomposed into subintervals, as in the case of a single frame of a motion picture. In Figure 6.1, let Pα and Pβ be two atomic processes with finite and nontrivial (nonzero) temporal intervals (durations) τα and τβ, respectively. Also, let τδ be the finite delay duration specific to any temporal relation TR, and τTR be the overall duration of pairs of processes. As shown in Figure 6.1(a), only seven out of the thirteen relations are shown since six of them are inverses. The equality relation has no inverse, i.e., α equals β is the same as β equals α. For example, after

Case Study 2 – Object Composition Petri Net (OCPN) Model

117

Figure 6.1. (a) Temporal relations represented graphically by a timeline representation. (b) The corresponding Petri net (OCPN) representation for the temporal relations.

is the inverse relation of before, or equivalently, before -1 is the inverse relation of before. For inverse relations, given any two intervals, it is possible to represent their relations by using the noninverse relations only by exchanging the interval labels [118, 120]. For any two atomic

118

Figure 6.2.

Table 6.1.

A unified OCPN model.

Temporal Parameters of the Unified Model in Figure 6.2 (Pα TR Pβ).

processes and their temporal relation, there is a corresponding OCPN model as indicated in Figure 6.1(b). The converse is true as well, i.e., for any OCPN model, a corresponding temporal relation can be uniquely identified [118]. Table 6.1 [120] summarizes the temporal parameters for each relation specific to the unified OCPN model of Figure 6.2 [118]. The atomic presentation processes are directly mapped into the Petri net places and the additional delay processes are associated to facilitate the proper timing relationships. A set of constraints indicate the timing parameter relationships between simple binary temporal relations is shown in Table 6.1. These constraints are used to show uniqueness in identification of the temporal relations for the simple unified OCPN model. This functionality proved valuable for describing the temporal component of

Case Study 2 – Object Composition Petri Net (OCPN) Model

119

composite multimedia objects as shown in [118]. In particular, these constraints can be used to [120]: 1. identify a temporal relation from the parameters τα, τβ, τδ, and τTR; 2. verify that the parameters satisfy a temporal relation TR; 3. identify the overall interval duration τTR given a temporal relation TR.

2.3

N – ARY TEMPORAL RELATIONS

The basic binary temporal relations are extended to n – ary temporal relations in [120]. Though the binary relations might be sufficient for the temporal characterization of simple or complex multimedia presentations at the level of orchestration, a generalized binary temporal relations which indicate the relationships among many intervals is required in order to ultimately simplify the data structures necessary for maintaining the synchronization semantics in a database. It can be easily seen that the binary construction process in [118] cannot handle the case when many objects are to be synchronized by a single kind of temporal relation. Therefore, a new kind of homogeneous temporal relation for describing the temporal relations on n objects or intervals is defined as follows [120]. Definition 6.5: Let P be an ordered set of n temporal intervals such that P = {P1, P2, . . . , Pn}. A temporal relation, TR, is called an n – ary temporal relation, denoted TRn, if and only if Pi TR Pi+1 , ∀i(1 ≤ i ≤ n). There are thirteen possible n – ary temporal relations which are the same as the binary temporal relations. As shown in Figure 6.3, seven n – ary temporal relations are indicated, after eliminating their inverse. As can be seen from Figure 6.3, when n is 2, the n – ary temporal relations simply reduce to the binary temporal relations. In other words, P1 TR P2 for n=2 . The relative positioning and time dependencies are captured by a delay as is their overall duration Similarly, there is a set of constraints to be identified for the timing parameter relationships among intervals of the n – ary case (as shown in Table 6.2 [120]). The duration of each element i (τi) can be determined by Definition 6.5 and by noting that adjacent intervals (ith to i + lth) form binary relationships for which the relationships of Table 6.1 can be applied.

TE

AM FL Y

120

Figure 6.3.

2.4

The n – ary temporal relations.

REVERSE TEMPORAL RELATIONS

The notion of temporal intervals supports reverse playout activities. In other words, the direction in time of presentation can be reversed. For this purpose, the reverse temporal relations are proposed [120]. These relations, derived from the forward relations, define the ordering and scheduling required for reverse playout. First, the reverse binary temporal relations are characterized. The reverse binary temporal relations are distinct from the inverse temporal relations which are described by computing the operands, i.e., a * b = b* -1 a. This characterization is essential for reverse playout of time-

Case Study 2 – Object Composition Petri Net (OCPN) Model Table 6.2.

121

n – ary Temporal Constraints.

dependent multimedia objects and allows the playout in reverse time as defined by changing the direction of time evaluation of a temporal relation. The following definitions characterize reverse temporal intervals and relations. Definition 6.6: A reverse interval is the negation of a forward interval, i.e., if [a,b] is an interval, then [–b, –a] is the reverse interval. [–b, –a] is clearly an interval since a ≤ b and therefore –b ≤ –a. Definition 6.7: A reverse temporal relation TRr , is defined as the temporal relation formed among reverse temporal intervals. Let [a,b] and [c, d] be two temporal intervals related by T R , then the reverse temporal relation TRr is defined by the temporal relation formed between [-b,-a] and [–d, –c]. The reverse relations are summarized in Figure 6.4. As can be seen from Figure 6.4, the reverse intervals are the reflection across a line on the time axis. In addition, Table 6.3 summarizes the conversions from the forward temporal parameters (τα, τβ, τδ, and TR) to the reverse temporal parameters (τα–r, τβ–r, τδ–r, and TRr). Table 6.3 is derived from the consistency formulas of Table 6.1 and by inspection of the binary reverse relations of Figure 6.4. Therefore, these parameters represent the new parameters formed by the new relation when viewed in reverse. Similarly, the reverse binary temporal relations can be extended to the reverse n – ary temporal relations. Like the n – ary case, the reverse n – ary temporal relations can be defined as follows.

122

Figure 6.4.

The forward and reverse relations and intervals.

Definition 6.8: Let P be an ordered set of n temporal intervals such that P = {P1, P 2 , . . . , Pn}, and let TR be a temporal relation with reverse relation TRr. If TRn is a n – ary temporal relation on P, then a temporal relation is called a reverse n – ary temporal relation, and is defined as Pi TRr Pi-1, (1 < i £ n), where TRr can be found from Table 6.3. Table 6.4 shows the reverse temporal parameters given the temporal parameters of an n – ary temporal relation. Ultimately, these parameters enable reverse presentation timing upon the forward timing parameters.

Case Study 2 – Object Composition Petri Net (OCPN) Model Table 6.3.

Temporal Parameter Conversions (Pα TR P β to Pα–r TRr Pβ-r).

Table 6.4.

2.5

123

n – ary Temporal Parameter Conversions.

PARTIAL TEMPORAL RELATIONS

The notion of temporal intervals also supports partial playout activities, i.e., the presentation of an object can be started at a midpoint rather than at the beginning or at the end. Hence, a further enhancement for multimedia presentation is the ability to playout only a fraction of an object’s overall duration. This operation is typical during audio and video editing, in which a segment is repeatedly started, stopped, and restarted. Another example of partial playout occurs when a viewer stops a motion picture then later restarts at some intermediate point (or

124

Figure 6.5.

Partial interval evaluation.

perhaps an earlier point to get a recap). The OCPN models have this partial interval evaluation capability to achieve this enhancement. Consider a single temporal interval that represents the overall duration, τTRn, of a complex n – ary relationship as shown in Figure 6.5. In Figure 6.5, the purpose is to present some fraction of this temporal interval beginning at a relative time called ts. If ts < 0, then it is too soon to consider τTRn. If ts = 0, then the whole interval and corresponding n – ary intervals must be considered. If 0 < ts < τTRn, then a fractional part of the n – ary relation must be evaluated. If ts > τTRn, then the interval need not be considered for evaluation at all. An atomic interval does not have an n – ary decomposition so that it can represent the presentation of a data element. For the fractional part, both non-decomposable and decomposable intervals need to be considered. For a non-decomposable interval, partial evaluation implies that the data has already been presented and need only be terminated upon expiration of the temporal interval. For a decomposable interval, the problem is to determine where to begin evaluation of the subintervals [120].

Case Study 2 – Object Composition Petri Net (OCPN) Model

125

In summary, OCPN is a network model and a good data structure for controlling the synchronization of the multimedia presentation. A network model can easily show the time flow of a presentation. Therefore, OCPN can serve as a visualization structure for users to understand the presentation sequence of media streams.

This page intentionally left blank.

Chapter 7

CONCLUSIONS

With the advanced progress in high-speed communication networks, large capacity storage devices, digitized media, and data compression mechanisms, much attention has been focused on multimedia information techniques and applications. A variety of information sources – text, voice, image, audio, animation, and video – are involved in multimedia applications and are required to be delivered synchronously or asynchronously via more than one device. The use of emerging multimedia technologies in multimedia applications makes multimedia information systems important. However, multimedia information systems require the management and delivery of extremely large bodies of data at very high rates and may require the delivery with real-time constraints. The major challenge for every multimedia system is how to synchronize the various data types from distributed data sources for multimedia presentations. The different media involved in multimedia information systems require special methods for optimal access, indexing, and searching. In addition, as more information sources become available in multimedia systems, the development of the abstract models for video, audio, text, and image data becomes very important. It is essential to have a mechanism for multimedia data that can respond to various types of querying and browsing requested by different user applications. That is why we have written this book – to provide a survey of different models and to cover state of the art techniques for multimedia presentations, multimedia database searching, and multimedia browsing to the readers so that they can have an understanding of the issues and challenges of multimedia information systems. An introduction of multimedia information applications, the need for the development of the multimedia database management systems

128 (MDBMSs), and the important issues and challenges of multimedia systems is given in Chapter 1. Several general issues and challenges such as semantic modeling techniques, indexing and searching methods, synchronization and integration modeling, multimedia query support, information retrieval, multimedia browsing, and the like are discussed. Next, the temporal relations, the spatial relations, and the spatio-temporal relations in a multimedia information system are discussed in Chapter 2. In addition, several existing semantic models are surveyed and classified into a set of categories depending on their model properties and functionality. In Chapter 3, multimedia searching issues and related work are introduced. Since multimedia database searching requires semantic modeling and knowledge representation of the multimedia data, several issues related to images, videos, object recognition, motion detection and object tracking, knowledge-based event modeling, and content-based retrieval are discussed. Recent work in the literature is also included. The issues for multimedia browsing and some existing browsing systems are given in Chapter 4. Since digital library applications consist of huge amount of digital video data and must be able to satisfy complex semantic information needs, efficient browsing and searching mechanisms to extract relevant information are required. That is why we surveyed some existing multimedia browsing systems. The last two Chapters consist of two case studies – the augmented transition network (ATN) model and the object composition Petri net (OCPN) model. The ways that these two semantic models handle the multimedia presentation synchronization, the spatial, temporal, and the spatio-temporal relationships among multimedia data, and other modeling capabilities are introduced in this chapter. We organized this book in the way that could make the ideas accessible to the readers who are interested in either grasping the basics or getting more technical depth. We hope that the general readers who are interested in the issues, challenges, and ideas underlying the current practice of multimedia presentation, multimedia database searching, and multimedia browsing in multimedia information systems can be benefited from reading this book. Also, since there is considerable research work that were published on the prestigious Journals and Conferences on multimedia information systems introduced in this book, we believe that the university researchers, scientists, industry professionals, software engineers, graduate students, and undergraduate students who are interested in doing research on multimedia information systems can gain a detailed technical understanding on this new emerging area from reading this book, too.

References

[1] S. Abe, T. Tanamura, and H. Kasahara, “Scene Retrieval Method for Video Database Applications using Temporal Condition Changes,” Proceedings of the International Workshop on Industrial Applications of Machine Intelligence and Vision, Tokyo, IEEE Computer Society Press, pp. 355-359, 1989. [2] D.A. Adjeroh and K.C. Nwosu, “Multimedia database management – Requirements and Issues,” IEEE Multimedia, pp. 24-33, July-September, 1997. [3] P. Aigrain, H-J. Zhang, and D. Petkovic, “Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review,” Multimedia Tools and Applications, vol. 3, pp. 179-202, 1996. [4] J. Akkerhuis, A. Marks, J. Rosenberg, and M.S. Sherman, “Processable Multimedia Document Interchange Using ODA,” Proc. EUUG Autuman Conf., pp. 167-177, Sept. 1989. [5] A.A. Alatan, E. Tuncel, and L. Onural, “A Rule-Based Method for Object Segmentation in Video Sequences,” Proc. Int’l Conf. Image Processing, vol. 2, pp. 522-525, Oct. 1997. [6] S. Al-Hawamdeh, B.C. Ooi, R. Price, T.H. Tng, Y.H. Ang, and L. Hui, “Nearest Neighbour Searching in a Picture Archive System,” Proceedings of the International Conference on Multimedia Information Swstems, pp. 17-29, Singapore, McGraw Hill Book Co., 1991. [7] W. Al-Khatib, Y.F. Day, A. Ghafoor, and P.B. Berra, “Semantic Modeling and Knowledge Representation in Multimedia

130 Databases,” IEEE Trans. on Knowledge and Data Engineering, vol. 11, no. 1, pp. 64-80, January/February 1999. [8] Y.Y. Al-Salqan and C.K. Chang, “Temporal Relations and Synchronization Agents,” IEEE Multimedia, pp. 30-39, Summer 1996. [9] J. Allen. Natural Language Understanding, jamin/Cummings Publishing Company, Inc., 1995.

The

Ben-

[10 ] J. Allen, “Maintaining Knowledge About Temporal Intervals,” Comm. ACM, vol. 26, no. 11, pp. 832-843, 1983. [11] T.L. Anderson, “Modeling Time at the Conceptual Level,” Improving Database Usability and Responsiveness, P. Scheuermann, Ed., New York: Academic, pp. 273-297, 1982.

AM FL Y

[12] D.P. Anderson, R. Govindan, and G. Homsy, “Design and Implementation of a Continuous Media 1/0 Server,” Proceedings of the Conference on Multimedia Issues in Networking and Operating Systems, 1990. [13] F. Arman, R. Depommer, A. Hsu, and M.Y. Chiu, “Content-Based Browsing of Video Sequences,” ACM Multimedia, pp. 97-103, Aug. 1994.

TE

[14] J. Ashford and P. Wilett. Text Retrieval and Document Databases, Chartwell-Bratt, Broomley, 1988. [15] A.F. Ates, M. Bilgic, S. Saito, and B. Sarikaya, “Using Timed CSP for Specification Verification and Simulation of Multimedia Synchronization,” IEEE J. Selected Areas in Comm., vol. 14, no. 1, pp. 126-137, Jan. 1996. [16] M. Atkinson, F. Bancilhon, D. DeWitt, K. Dittrich, D. Maier, and S. Zdonik, “The Object-Oriented Database System Manifesto,” Proc. First Int’l Conf. Deductive and Object-Oriented Databases, pp. 40-57, 1989. [17] J.R. Bach, S. Paul, and R. Jain, “A Visual Information Management System for the Interactive Retrieval of Faces,” IEEE Trans. Knowledge and Data Engineering, vol. 5, no. 4, pp. 619-628, 1993. [18] C. Batini, T. Catarci, M. Costabile, and S. Levialdi, “Visual Query Systems: A Taxonomy,” Proceedings of the 2nd Working Conference on Visual Database Systems, Budapest, International Federation for Information Processing, IFIP Working Group 2.6, pp. 159-173, 1991.

References

131

[19] R. Bayer and E. McCreight, “Organization and Maintenance of Large Ordered Indices,” Proc. 1970 ACM-SIGFIDENT Workshop on Data Description and Access, Houston, Texas, pp. 107-141, Nov. 1970. [20] P. Berra, C. Chen, A. Ghafoor, T. Little, and S. Shin, “Architecture for Distributed Multimedia Database Systems,” Comput. Commun., 13, pp. 217-231, 1990. [21] E. Bertino and E. Ferrari, “Temporal Synchronization Models for Multimedia Data,” IEEE Trans. on Knowledge and Data Engineering, vol 10, no. 4, pp. 612-631, July/August 1998. [22] E. Bertino, E. Ferrari, and M. Stolf, “A System for the Specification and Generation of Multimedia Presentations,” Proc. Third Int’l Workshop Multimedia Information Systems, pp. 83-91, Sept. 1997. [23] A.D. Bimbo, P. Pala, and S. Santini, “Image Retrieval by Elastic Matching of Shapes and Image Patterns,” Proc. IEEE Int’l Conf. Multimedia Computing and Systems, pp. 215-218, June 1996. [24] A.D. Bimbo, E. Vicario, and D. Zingoni, “Symbolic Description and Visual Querying of Image Sequences Using Spatio-Temporal Logic,” IEEE Trans. on Software Engineering, vol 7, no. 4, pp. 609-621, August 1995. [25] E. Binaghi, I. Gagliardi, and R. Schetini, “Indexing and Fuzzy Logic-Based Retrieval of Color Images,” Proceedings of the 2nd Working Conference on Visual Database Systems, Budapest, International Federation for Information Processing, IFIP Working Group 2.6, pp. 84-97, 1991. [26] E. Blair, D. Hutchinson, and D. Shepherd, “Distributed Systems Support for Heterogeneous Multimedia Environments,” Proceedings of the Conference on Multimedia Issues in Networking and Operating Systems, 1990. [27] G. Blakowski, J. Huebel, and U. Langrehr, “Tools for Specifying and Executing Synchronized Multimedia Presentations,” Proc. 2nd Int’l Workshop on Network and Operating System Support for Digital Audio and Video, pp. 271-279, 1991. [28] G. Bordogna, I. Gagliardi, D. Merelli, P. Mussio, M. Padula, and M. Protti, “Iconic Queries to Pictorial Data,” Proceedings of the

132 1989 IEEE Workshop on Visual Languages, Rome, pp. 38-42, 1989. [29] C.A. Bouman and B. Liu, “Multiple Resolution Segmentation of Textured Images,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 13, no. 2, pp. 99-113, February 1991. [30] M. Buchanan and P. Zellweger, “Automatically Generating Consistent Schedules for Multimedia Documents,” ACM Multimedia Systems Journal, 1(2), Springer-Verlag, pp. 55-67, 1993. [31] M. Buchanan and P. Zellweger, “Specifying Temporal Behavior in Hypermedia Documents,” Proc. ACM Conf. Hypertext, pp. 262271, 1992. [32] R.H. Campbell and A.N. Habermann, “The Specification of Process Synchronization by Path Expressions,” G. Goos and J. Hartmanis, eds., Operating Systems, Lecture Notes in Computer Science 16, pp. 89-102, Springer-Verlag, 1974. [33] K.S. Candan, B. Prabhakaran, and V.S. Subrahmanian, “CHIMP: A Framework for Supporting Distributed Multimedia Document Authoring and Presentation,’’ Proc. ACM Multimedia Conf., pp. 320-340, Nov. 1996. [34] T. Catarci, “On the Expressive Power of Graphical Query Languages,’) Proceedings of the 2nd Working Conference on Visual Database Systems, Budapest, International Federation for Information Processing, IFIP Working Group 2.6, pp. 404-414, 1991. [35] A. Celentano, M.G. Fugini, and S. Pozzi, “Knowledge-Based Retrieval of Office Documents,” Proc. 13th Int’l Conf. Research and Development in Information Retrieval, pp. 241-254, Sept. 1990. [36] S.F. Chang, W. Chen, H.J. Meng, H. Sundaram, and D. Zhong, “VideoQ: An Automated Content Based Video Search System Using Visual Cues,” Proc. ACM Multimedia, pp. 313-324, 1997. [37] H.J. Chang, T.Y. Hou, S.K. Chang, “The Management and Application of Teleaction Objects,” ACM Multimedia Systems Journal, vol. 3, pp. 228-237, November 1995. [38] N.S. Chang and K.S. Fu, “Query-by-Pictorial Example,” IEEE Trunsactions on Software Engineering, 14, pp. 681-688, 1988.

References

133

[39] S.K. Chang, C.W. Yan, D.C. Dimitroff, and T. Arndt, “An Intelligent Image database System,” IEEE Trans. on Software Engineering, vol 14, no. 5, pp. 681-688, May 1988. [40] S.K. Chang, “Iconic Indexing By 2D String,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, no. 4, pp. 413-428, 1984. [41] R. Chellappa, C.L. Wilson, and S. Sirohey, “Human and Machine Recognition of Faces: A Survey,” Proc. IEEE, vol. 83, no. 5, pp. 705-741, May 1995. [42] C.Y. Roger Chen, D.S. Meliksetian, Martin C-S, Chang, L. J. Liu, “Design of a Multimedia Object-Oriented DBMS,” ACM Multimedia Systems Journal, Vol. 3, pp. 217-227, November 1995. [43] S-C. Chen and R. L. Kashyap, “Temporal and Spatial Semantic Models for Multimedia Presentations,” in 1997 International Symposium on Multimedia Information Processing, Dec. 11-13, 1997, pp. 441-446. [44] S-C. Chen and R. L. Kashyap, “A Spatio-Temporal Semantic Model for Multimedia Presentations and Multimedia Database Systems,” IEEE Trans. on Knowledge and Data Engineering, accepted for publication. [45] S-C. Chen, S. Sista, M-L. Shyu, and R.L. Kashyap, “Augmented Transition Networks as Video Browsing Models for Multimedia Databases and Multimedia Information Systems,” the 11th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’99), pp. 175-182, 1999. [46] J.Y. Chen, C. Taskiran, E.J. Delp, and C.A. Bouman, “ViBE: A New Paradigm for Video Database Browsing and Search,” Proc. IEEE Workshop Content-Based Access of Image and Video Libraries, pp. 96-100, 1998. [47] M.G. Christel and D.J. Martin, “Information Visualization Within a Digital Video Library,” J. Intelligent Info. Systems, 11(3), pp. 235-257, 1998. [48] M.G. Christel and A.M. Olligschlaeger, “Interactive Maps for a Digital Video Library,” IEEE Int’l Conf. on Multimedia Computing and Systems, 1999. [49] C. Colombo, A.D. Bimbo, and I. Genovesi, “Interactive Image Retrieval by Color Distributions,” IEEE Multimedia Systems, pp. 255-258, 1998.

134 [50] D. Comer, “The Ubiquitous B-tree,” Computing Surveys, 11:2, pp. 121-138, June 1979. [51] J.E. Coolahan Jr. and N. Roussopoulos, “Timing Requirements for Time-Driven Systems Using Augmented Petri Nets,” IEEE Trans. Software Engineering, vol. SE-9, pp. 603-616, Sept. 1983. [52] J.M. Corridoni, A.D. Bimbo, S. De Magistris, and E. Vicario, “A Visual Language for Color-Based Painting Retrieval,” Proc. Int’l Symp. Visual Languages, pp. 68-75, 1996. [53] G. Costagliola, M. Tucci, and S.K. Chang, “Representing and Retrieving Symbolic Pictures by Spatial Relations,” Visual Database Systems, vol. II, E. Knuth and L.M. Wegner, eds., Elsevier, pp. 55-65, 1991. [54] J. D. Courtney, “Automatic Video Indexing via Object Motion Analysis,” Pattern Recognition, vol. 30, no. 4, pp. 607-625, 1997. [55] J. Cove and B. Walsh, “Online Text Retrieval vis Browsing,” Information Processing and Management, 24( 1), 1988. [56] I.F. Cruz and W.T. Lucas, “A Visual Approach to Multimedia Querying and Presentation,” Proc. ACM Multimedia, pp. 109-120, 1997. [57] J. Davies, D.M. Jackson, J.N. Reed, G.M. Reed, A.W. Roscoe, and S.A. Schneider, “Timed CSP: Theory and Practice,” J.W. De Bakker, C. Huizing, W.P. De Roever, and G. Rosenberg, eds., RealTime: Theory and Practice, Lecture Notes in Computer Science 600, pp. 640-675. Springer-Verlag, 1991. [58] Y.F. Day, S. Dagtas, M. Iino, A. Khokhar, and A. Ghafoor, “Object-Oriented Conceptual Modeling of Video Data,” IEEE 11th International Conference on Data Engineering, Taipei, Taiwan, pp. 401-408, 1995. [59] Y.F. Day, S. Dagtas, M. Iino, A. Khokhar, and A. Ghafoor, “Spatio-Temporal Modeling of Video Data for On-line ObjectOriented Query Processing,” Proc. Int‘l Conf. Multimedia Computing and Systems, pp. 98-105, May 1995. [60] M. J. Egenhofer, “Query Processing in Spatial-Query-Sketch,” J. Visual Languages and Computing, vol. 8, no. 4, pp. 403-424, 1997.

References

135

[61] R. Erfle, “Specification of Temporal Constraints in Multimedia Documents Using HyTime,” Electronic Publishing, vol. 6 , no. 4, pp. 397-411, 1993. [62] M.L. Escobar-Molano and S. Ghandeharizadeh, “A Framework for Conceptualizing Structured Video,” Proc. First Int’l Workshop Multimedia Information Systems, pp. 95-110, Sept. 1995. [63] B. Falchuk and K. Karmouch, “A multimedia news delivery system over an ATM network,” in International conference Multimedia Computing and Systems, 1995, pp. 56-63. [64] A.M. Ferman, B. Gunsel and A.M. Tekalp, “Object Based Indexing of MPEG-4 Compressed Video,” Proc. SPIE: VCIP, pp 953963, vol. 3024, San Jose, USA, February 1997. [65] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, P. Yanker, “Query by Image and Video Content: The QBIC System,” IEEE Computer, vol. 28, no. 9, pp. 23-31, September 1995. [66] J. Foote, J. Boreczky, A. Girgensohn, and L. Wilcox, “An Intelligent Media Browser using Automatic Multimodal Analysis,” ACM Multimedia, 1998. [67] K. Fujikawa, S. Shimojo, T. Matsuura, S. Nishio, and H. Miyahara, “Multimedia Presentation System ‘Harmony’ with Temporal and Active Media,” Proc. Usenix Conf., 1991. [68] A. Ghafoor and P. Bruce Berra, “Multimedia Database Systems,” Lecture Notes in Computer Science, (Advanced Database Systems, Eds. B. Bhargava and N. Adams), vol. 759, pp. 397-411, SpringerVerlag Publisher, 1993. [69] A. Ghafoor, “Multimedia Database Management Systems,” ACM Computing Survey, vol. 27, no. 4, pp. 593-598, December 1995. [70] A. Ghafoor, “Special Issue on Multimedia Database Systems,” guest editor, ACM Multimedia Systems, vol. 3, pp. 179-181, November 1995. [71] A. Ghafoor and Y.F. Day, “Object-Oriented Modeling and Querying of Multimedia Data,” Proc. First Int’l Workshop Multimedia Information Systems, pp. 111-119, Sept. 1995.

136 [72] S. Gibbs, “Composite Multimedia and Active Objects,” Proc. Int‘l Conf. Object-Oriented Programming: Systems, Languages, and Applications, Oct. 1991. [73] S. Gibbs, C. Breiteneder, and D. Tsichritzis, “Audio/Video Databases: An Object-Oriented Approach,” Proc. Ninth Int’l Conf. Data Eng., pp. 381-390, 1993. [74] S. Gibbs, C. Breiteneder, and D. Tsichritzis, “Data Modeling of Time-Based Media,” Proc, ACM SIGMOD Int’l Conf. Management of Data, pp. 91-102, 1994. [75] F. Golshani and N. Dimitrova, “Retrieval and Delivery of Information in Multimedia Database Systems,” Information and Software Technology, vol. 36, no. 4, pp. 235-242, May 1994. [76] Y. Gong, H. Zhang, H.C. Chuan, and M. Sakauchi, “An Image Database System with Content Capturing and Fast Image Indexing Abilities,” Proc. Int ’1 Conf. Multimedia Computing and Systems, pp. 121-130, May 1994. [77] W.I. Grosky and R. Mehrotra, “Index-Based Object Recognition in Pictorial Data Management,” Computer Vision Graph Image Processing, 52, pp. 416-436, 1990. [78] V.N. Gudivada and G.S. Jung, “An Algorithm for Content-Based Retrieval in Multimedia Databases,” Proc. Int’l Conf. Multimedia Computing and Systems, pp. 193-200, 1996. [79] A. Gupta, T. Weymouth, and R. Jain, “Semantic Queries with Pictures: the VIMSYS Model,” Proc. 17th Int’l Conf. Very Large Databases, pp. 69-79, Barcelona, September 1991. [80] A. Guttman, “R-tree: A Dynamic Index Structure for Spatial Search,” in Proc. ACM SIGMOD, pp. 47-57, June 1984. [81] V. Haarslev and M. Wessel, “Querying GIS With Animated Spatial Sketches,” Proc. Int’l Symp. Visual Languages, pp. 201-208, Sept. 1997. [82] T. Hamano, “A Similarity Retrieval Method for Image Databases using Simple Graphics,” The 1988 IEEE Workshop on Languages for Automation: Symbiotic and Intelligent Robots, IEEE Computer Society Press, pp. 149-154, College Park, MD, 1988.

References

137

[83] C.L. Hamblin, “Instants and Intervals,” Proc. 1st Conf. Int’l Society for the Study of Time, J.T. Fraser et al., Eds., New York: Springer-Verlag, pp. 324-331, 1972. [84] D. Harman, “Relevance Feedback Revisited,” Proceedings of the 15th ACM SIGIR, pp. 1-10, Copenhagen, ACM Press, New York, 1992. [85] S.A. Hawamdeh, B.C. Ooi, R. Price, T.H. Tng, Y.H. Ang, and L. Hui, “Nearest Neighbour Searching in a Picture Archive System,” Proc. Int’l Conf. Multimedia Information Systems, pp. 17-33, 1991. [86] T. Helbig and O. Schreyer, “Protocol for browsing in Continuous Data for Cooperative Multi-Server and Multi-Client Applications,” in T. Plagemann and V. Goebel, eds., IDMS, Springer LNCS, 1998. [87] R.G. Herrtwich and L. Delgrossi, “ODA-Based Data Modeling in Multimedia Systems,” Technical Report TR 90-043, Int’l Computer Science Inst., Berkeley, CA., 1990. [88] N. Hirzalla, B. Falchuk, and A. Karmouch, “A Temporal Model for Interactive Multimedia Scenarios,” IEEE Multimedia, pp. 24-31, Fall 1995. [89] S. Hollfelder, A. Everts, and U. Thiel, “Concept-Based Browsing in Video Libraries,” Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries, 1998. [90] S. Hollfelder, A. Everts, and U. Thiel, “Designing for Semantic Access: A Video Browsing System,” IEEE Int’l Conf. on Multimedia Computing and Systems, 1999. [91] M.A. Holliday and M.K. Vernon, “A Generalized Timed Petri Net Model for Performance Analysis,” Proc. Int’l Conf. Time Petri Nets, pp. 181-190, 1985. [92] P. Hoepner, “Synchronizing the Presentation of Multimedia Objects – ODA Extensions,” Proc. First Eurographics Workshop: Multimedia Systems, Interaction, and Applications, pp. 87-100, Apr. 1991. [93] B.K.P. Horn and B.G. Schunck, “Determining Optical Flow,” Artificial Intelligence, vol. 17, pp. 185-203, 1981.

138 [94] C.C. Hsu, W.W. Chu, and R.K. Taira, “A Knowledge-Based Approach for Retrieving Images by Content,” IEEE Trans. Knowledge and Data Engineering, vol. 8, no. 4, pp. 522-532, 1993. [95] M. Iino, Y.F. Day, and A. Ghafoor, “An Object-Oriented Model for Spatio-Temporal Synchronization of Multimedia Information,” Proc. Int’l Conf. Multimedia Computing and Systems, pp. 110-119, May 1994. [96] H. Ishikawa, K. Kubota, Y. Noguchi, K. Kato, M. Ono, N. Yoshizawa, and A. Kanaya, “A Document Warehouse: A Multimedia Database Approach,’’ in R.R. Wagner, editor, DEXA, 1998. [97] S.S. Iyengar and RL Kashyap, “Guest Editor’s Introduction: Image Databases,” IEEE Transactions on Software Engineering, 14, pp. 608-610, 1988. [98] A.K. Jain, Y. Zhong, and S. Lakshmanan, “Object Matching Using Deformable Templates,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 3, pp. 267-278, Mar. 1996. [99] H. Jiang and A.K. Elmagarmid, “WVTDB – A Semantic ContentBased Video Database System on the World Wide Web,” IEEE Transactions on Knowledge and Data Engineering, vol. 10, no. 6, pp. 947-966, November/December 1998. [100] J.M. Jolion, P. Meer, and S. Bataouche, “Robust Clustering with Applications in Computer Vision,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 8, pp. 791-802, August 1991. [101] T. Joseph and A.F. Cardenas, “PICQUERY: A High-Level Query Language for Pictorial Database Management,” IEEE Transactions on Software Engineering, 14, pp. 630-638, 1988. [102] A. Karmouch and J. Emery, “A Playback Schedule Model for Multimedia Documents,” IEEE Multimedia, vol. 3, pp. 50-61, 1996. [103] R. Kasturi and R. Jain, “Dynamic Vision,” R. Kasturi and R. Jain, eds., Computer Vision, pp. 469-480, IEEE CS Press, 1991. [104] T. Kato, T. Kurita, H. Shimogaki, T. Mizutori, and K. Fujimura, “A Cognitive Approach to Visual Interaction,” Proceedings of the International Conference on Multimedia Information Systems, pp. 109-120, Singapore, McGraw Hill Book Co., 1991.

References

139

[105] R. Kimberley. Text Retrieval – A Directory of Software, 3rd edition, Gower Publishing Company, 1990. [106] W. Klas, E.J. Neuhold, and M. Schrefl, “Visual Databases Need Data Models for Multimedia Data,” In T. Kunii (ed) Visual Database Systems, pp. 433-462, North Holland, New York, 1989. [107] S.C. Kleene, “Representation of Events in Nerve Nets and Finite Automata, Automata Studies,” Princeton University Press, Princeton, N.J., pp. 3-41, 1956. [108] A. Klinger and A. Pizano, “Visual Structure and Databases,” Visual Database Systems, T.L. Kunii, ed., pp. 3-25, 1989. [109] A. Komlodi and L. Slaughter, “Visual Video Browsing Interfaces Using Key Frames,” Proceedings of the CHI 98 Summary Conference on CHI 98 Summary: Human Factors in Computing Systems, pp. 337-338, 1998. [110] F. Kubala, R. Schwartz, R. Stone, and R. Weischedel, “Named Entity Extraction from Speech,” Proc. DARPA Workshop on Broadcast News Understanding Systems, Feb. 1998. [111] T.L. Kunii, Y. Shinagawa, R.M. Paul, M.F. Khan, and A.A. Khokhar, “Issues in Storage and Retrieval of Multimedia Data,” ACM Multimedia Systems, vol. 3, pp. 298-304, November 1995. [112] S. Lakshmanan and H. Derin, “Simultaneous parameter estimation and segmentation of Gibbs random fields using simulated annealing,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11, no. 8, pp. 799-813, 1989. [113] M.K. Leong, S. Sam, and A.D. Narasimhalu, “Towards a Visual Language for an Object-Oriented Multimedia Database,” International Federation for Information Processing (IFIP) TC-2 Working Conference on Visual Databases, pp. 465-496, 1989. [114] L. Li, A. Karmouch, and N.D. Georganas, “Multimedia Teleorchestra with Independent Sources: Part 2 – Synchronization Algorithms,” ACM/Springer Multimedia Systems, vol. l, no. 4, pp. 154-165, 1994. [115] E.B.W. Lieutenant and J.R. Driscoll, “Incorporating A Semantic Analysis into A Document Retrieval Strategy,” Proc. A CM/SIGIR Conf. Research and Development Information Retrieval, pp. 270279, Oct. 1991.

140 [116] J.H. Lim, H.H. Teh, H.C. Lui, and P.Z. Wang, “Stochastic Topology with Elastic Matching for Off-Line Handwritten Character Recognition,” Pattern Recognition Letters, vol. 17, no. 2, pp. 149154, Feb. 1996. [117] C.C. Lin, J.X., S.K. Chang, “Transformation and Exchange of Multimedia Objects in Distributed Multimedia Systems,” ACM Multimedia Systems Journal, vol. 4, pp. 12-29, February 1996. [118] T.D.C. Little and A. Ghafoor, “Synchronization and Storage Models for Multimedia Objects,” IEEE J. Selected Areas in Commun., vol. 9, pp. 413-427, Apr. 1990. [119] T.D.C. Little and A. Ghafoor, “Scheduling of BandwidthConstrained Multimedia Traffic,” Computer Commun., vol. 15, pp. 381-387, July/Aug. 1992.

AM FL Y

[120] T.D.C. Little and A. Ghafoor, “Interval-Based Conceptual Models for Time-Dependent Multimedia Data,” IEEE Trans. On Knowledge and Data Engineering, vol. 5, no 4, pp. 551-563, Aug. 1993.

TE

[121] T.D.C. Little, G. Ahanger, R.J. Folz, J.F. Gibbon, F.W. Reeve, D.H. Schelleng, and D. Venkatesh, “A Digital On-Demand Video Service Supporting Content-Based Queries,” Proc. ACM Multimedia, pp. 427-436, 1993. [122] T.D.C. Little, G. Ahanger, H-J. Chen, R.J. Folz, J.F. Gibbon, A. Krishnamurthy, P. Lumbda, M. Ramanathan, and D. Venkatesh, “Selection and Dissemination of Digital Video via the Virtual Video Browser,” Multimedia Tools and Applications, 1(2), 1995. [123] Z.Q. Liu and J.P. Sun, “Structured Image Retrieval,” J. Visual Languages and Computing, vol. 8, no. 3, pp. 333-357, 1997. [124] Y. Masunaga, “Design Issues of OMEGA – An Object-Oriented Multimedia Database Management System,” J. Information Processing, vol. 14, pp. 60-74, 1991. [125] C. Meghini, F. Rabitti, and C. Thanos, “Conceptual Modeling of Multimedia Documents,” IEEE Computer, 24, pp. 23-30, 1991. [126] J. Meng, Y. Juan, and S.F. Chang, “Scene Change Detection in a MPEG Compressed Video Sequence,” Proc. SPIE, vol. 2,419, pp. 14-25, 1995.

References

141

[127] J. Meng and S.F. Chang, “CVEPS: A Compressed Video Editing and Parsing System,” Proceedings of MM’96, pp. 43, ACM Press, 1996. [128] T. Meyer, W. Effelsberg, and R. Steinmetz, “A Taxonomy on Multimedia Synchronization,” Proc. Fourth Int’l Workshop on Future Trends in Distributed Computing Systems, 1993. [129] B. Meyer, “Pictorial Deduction in Spatial Information Systems,” Proc. Int’l Symp. Visual Languages, pp. 23-30, 1994. [130] M. Mills, J. Cohen, and Y.Y. Wong, “A Magnifier Tool for Video Data,” Proc. ACM Computer Human Interface (CHI), pp. 93-98, May, 1992, [131] S. Miyamoto, “TWO Approaches for Information Retrieval Through Fuzzy Associations,” IEEE Transactions on Systems, Man and Cybernetics, 19, pp. 123-130, 1989. [132] J. Motiwalla, A.D. Narasimhalu, and S. Christodoulakis, Proceedings of the International Conference on Multimedia Information Systems, Singapore, McGraw Hill Book Co., 1991. [133] Multimedia Office Filing – The MULTOS Approach, C. Thanos, ed., North-Holland, 1990. [134] A. Nagasaka and Y . Tanaka, “Automatic Video Indexing and Full Video Search for Object Appearances,” Proc. Second Working Conf. Visual Database Systems, pp. 119-133, IFIP WG 2.6, Oct. 1991. [135] A.D. Narasimhalu, “Multimedia Databases,” Multimedia Systems, 4: 226-249, 1996. [136] A.D. Narasimhalu, “A Framework for the Integration of Expert Systems with Multimedia Technologies,” Expert Syst Appl, 7: 3, pp. 427-439, 1994. [137] S.R. Newcomb, N.A. Kipp, and V.T. Newcomb, “HyTime – Hypermedia/Time-Based Document Structuring Language,” Comm. ACM, vol. 34, no. 11, pp. 67-83, Nov. 1991. [138] Office Document Architecture (ODA): An Interchange Format, no. 8613, ISO, 1986.

142 [139] A. Ono, M. Amano, M. Hakaridani, T. Satou, and M. Sakauchi, “A Flexible Content-Based Image Retrieval System with Combined Scene Description Keyword,” Proc. Int’l Conf. Multimedia Computing and Systems, pp. 201-208, 1996. [140] E. Oomoto, and K. Tanaka, “OVID: Design and Implementation of a Video Object Database System,” IEEE Trans. on Knowledge and Data Engineering, vol. 5, no. 4, pp. 629-643, August 1993. [141] M.T. Özsu, D. Duane, G. El-Medani, C. Vittal, “An objectoriented multimedia database system for a news-on-demand application,” ACM Multimedia Systems Journal, vol. 3, pp. 182-203, November 1995. [142] J.L. Peterson, “Petri nets,” ACM Comput. Surveys, vol. 9, pp. 223-252, Sept. 1977. [143] A. Poggio, J. Garcia Luna Aceves, E. J. Craighill, D. Worthington, and J. Hight, “CCWS: A Computer-Based Multimedia Information System,” Computer, vol. 18, no. 10, pp. 92-103, Oct. 1985. [144] D. Ponceleon, S. Srinivasan, A. Amir, D. Petkovic, and D. Diklic, “Key to Effective Video Retrieval: Effective Cataloging and Browsing,” Proceedings of the 6th ACM Int’l Conf. on Multimedia, pp. 99-107, 1998. [145] R. Price, M.K. Leong, N. Karta, and A.D. Narasimhalu, “Experiences in the Implementation of a Multimedia DBMS,” ISS Working Paper WP 89-12-0, Institute of Systems Science, National University of Singapore, Singapore, 1989. [146] R. Rabitti and P. Stanchev, “GRIMDBMS: A GRaphical IMage Database Management System,” In T. Kunii (ed) Visual Database Systems, pp. 415-430, North Holland, New York, 1989. [147] S. Ravela, R. Manmatha, and E.M. Riseman, “Image Retrieval Using Scale-Space Matching,” Proc. Fourth European Conf. Computer Vision, pp. 273-282, 1996. [148] R.R. Razouk and C.V. Phelps, “Performance Analysis Using Timed Petri Nets,” Proc. 1984 Int’l Conf. Parallel Processing, pp. 126-129, 1984. [149] N. Reddy, “Improving Latency in Interactive Video Server,” SPIE MMCN, 1997.

References

143

[150] C. J. van Rijsbergen. Information Retrieval, Butterworths, London, 1979. [151] N. Roussopoulos, C. Faloutsos, and T. Sellis, “An Efficient Pictorial Database System for PSQL,” IEEE Transactions on Software Engineering, 14, pp. 639-650, 1988. [152] T. Sakai, M. Nagao, and S. Fujibayashi, “Line Extraction and Pattern Detection in a Photograph,” Pattern Recognition, vol. 1, no. 3, pp. 233-248, Mar. 1969. [153] G. Salton and M.J. McGill. Introduction to Modern Information Retrieval, McGraw-Hill, New York, 1983. [154] M. Schneider and T. Trepied, “Extensions for the Graphical Query Language CANDID,” Proceedings of the 2nd Working Conference on Visual Database Systems, Budapest, International Federation for Information Processing, IFIP Working Group 2.6, pp. 189-203, 1991. [155] S. Sista and R.L. Kashyap, “Bayesian Estimation for Multiscale Image Segmentation,” IEEE International Conference on Acoustics, Speech, and Signal Processing, March 1999. [156] S. Sista and R.L. Kashyap, “Unsupervised Video Segmentation and Object Tracking,” in ICIP’99, Japan, 1999. [157] S. Sista. Image And Video Segmentation Using Unsupervised Classification in a Bayesian Set up. Ph.D. Thesis, Purdue University, May 1999. [158] A.P. Sistla, C. Yu, and R. Haddad, “Reasoning About Spatial Relationships in Picture Retrieval Systems,” Proc. Int’l Conf. Very Large Database, pp. 570-581, Sept. 1994. [159] A.P. Sistla, C. Yu, C. Liu, and K. Liu, “Similarity Based Retrieval of Pictures Using Indices on Spatial Relationships,” Proc. Int’l Conf. Very Large Database, pp. 619-629, Sept. 1995. [160] S.W. Smoliar and H.J. Zhang, “Content-based video indexing and retrieval,” IEEE Multimedia, pp. 62-72, Summer, 1994. [161] S. Srinivasan, D. Ponceleon, A. Amir, and D. Petkovic, “What is in the Video Anyway? In Search of Better Browsing,” IEEE Int’l Conference on Multimedia Computing and Systems, pp. 388-393, 1999.

144 [162] R. Steinmetz, “Synchronization Properties in Multimedia Systems,” IEEE J. Selected Areas in Commun., vol. 8, no. 3, pp. 401-412, 1990. [163] R. Steinmetz and T. Meyer, “Multimedia Synchronization Techniques: Experiences Based on Different Systems Structures,’’ Proc. Fourth IEEE Computer Society Int’l Workshop Multimedia Comm., pp, 306-314, 1992. [164] S.Y.W. Su, S.J. Hyun, and H.-H.M. Chen, “Temporal Association Algebra: A Mathematical Foundation for Processing ObjectOriented Temporal Databases,” IEEE Trans. on Knowledge and Data Engineering, Vol. 10, No. 3, pp. 389-408, May/June 1998. [165] M. Tanaka and T. Ichikawa, “A Visual User Interface for Map Information Retrieval Based on Semantic Significance,” IEEE Transactions on Software Engineering, 14, pp. 666-670, 1988. [166] Y. Theodoridis, M. Vazirgiannis, and T. Sellis, “Spatio-Temporal Indexing for Large Multimedia Applications,” Proc. Int’l Conf. Multimedia Computing and Systems, pp. 441-448, 1996. [167] H. Thimm and W. Klas, “d-Sets for Optimized Relative Adaptive Playout Management in Distributed Multimedia Database Systems,” IEEE 12th International Conference on Data Engineering, New Orleans, Louisiana, pp. 584-592, 1996. [168] D. Toman, “Point vs. Interval-Based Query Languages for Temporal Databases,” Proc. Fifth ACM SIGACT/MOD/ART Symp. Principles of Database Systems, pp. 58-67, 1996. [169] K. Tsuda, K. Yamamoto, M. Hirakawa, and T. Ichikawa, “MORE: An Object-Oriented Data Model with A Facility for Changing Object Structures,’’ IEEE Transactions on Knowledge and Data Engineering, vol. 3, no. 4, pp. 444-460, 1991. [170] A. Tversky, “Features of Similarity,” Psychol Rev., 84, pp. 327-354, 1977. [171] H. Wactlar, M. Christel, Y. Gong, and A. Hauptmann, “Lessons Learned from Building a Terabyte Digital Video Library,” IEEE Computer, vol. 32, no. 2, 1999. [172] T. Wahl and K. Rothermel, “Representing Time in Multimedia Systems,” Proc. Int’l Conf. on Multimedia Computing and Systems, CS Press, Los Alamitos, Calif., pp. 538-543, 1994.

References 145 [173] R. Weiss, A. Duda, and D.K. Gifford, “Composition and Search with a Video Algebra,” IEEE Multimedia, vol. 2, no. 1, pp. 12-25, Spring 1995. [174] R. Weiss, A. Duda, and D.K. Gifford, “Content-Based Access to Algebraic Video,” Proc. IEEE Int’l Conf. Multimedia Computing and Systems, pp. 140-151, May 1994. [175] K.H. Weiss, “Formal Specification and Continuous Media,” Proc. First Int’l Workshop Network and Operating System Support for Digital Audio and Video, pp. 123-127, Nov. 1990. [176] D. Woelk, W. Kim, and W. Luther, “An Object-Oriented Approach to Multimedia Databases,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 311-325, May 1986. [177] D. Woelk and W. Kim, “Multimedia Information Management in an Object-Oriented Database System,” Proceedings of the 13th VLDB Conference, pp. 319-329, Brighton, 1987. [178] W. Woods, “Transition Network Grammars for Natural Language Analysis,” Comm. ACM, 13, pp. 591-602, October 1970. [179] X. Wu and T. Ichikawa, “KDA: A Knowledge-Based Database Assistant with A Query Guiding Facility,” IEEE Trans. Knowledge and Data Engineering, vol. 4, no. 5, pp. 443-453, 1994. [180] H. Wynne, T.S. Chua, and H.K. Pung, “An Integrated ColorSpatial Approach to Content-Based Image Retrieval,” Proc. ACM Multimedia, pp. 305-313, 1995. [181] B-L. Yeo and M.M. Yeung, “Retrieving and Visualization Video,” Comm. of the ACM, Vol. 40, No. 12, December 1997, pp. 43-52. [182] M.M. Yeung, B-L. Yeo, W. Wolf, and B. Liu, “Video Browsing Using Clustering and Scene Transitions on Compressed Sequences,” Proc. IS and T/SPIE Multimedia Computing and Networking, 1995. [183] A. Yoshitaka and T. Ichikawa, “A Survey on Content-Based Retrieval for Multimedia Databases,” IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 1, pp. 81-93, January/February 1999. [184] A. Yoshitaka, T. Ishii, M. Hirakawa, and T. Ichikawa, “ContentBased Retrieval of Video Data by the Grammar of the Film,” Proc. Int’l Symp. Visual Languages, pp. 314-321, Sept. 1997.

146 [185] A. Yoshitaka, Y. Hosoda, M. Yoshimitsu, M. Hirakawa, and T. Ichikawa, “VIOLONE: Video Retrieval By Motion Examples,” J. Visual Languages and Computing, vol. 7, no. 4, pp. 423-443, 1996. [186] A. Yoshitaka, S. Kishida, M. Hirakawa, and T. Ichikawa, “Knowledge-Assisted Content-Based Retrieval for Multimedia Databases,” IEEE Multimedia, vol. 1, no. 4, pp. 12-21, 1994. [187] C. Yu, W. Sun, D. Bitton, Q. Yang, R. Bruno, and J. Tullis, “Efficient Placement of Audio on Optical Disks for Real-Time Applications,” Commun ACM, 32, pp. 862-871, 1989. [188] L.A. Zadeh, “The Role of Fuzzy Logic in the Management of Uncertainty in Expert Systems,” Fuzzy Sets Sys., 11, pp. 199-227, 1983. [189] H.J. Zhang, A. Kankanhalli, and S.W. Smoliar, “Automatic Partitioning of Full-Motion Video,” Multimedia Systems, vol. 1, no. 1, pp. 10-28, 1993. [190] H.J. Zhang, C.Y. Low, Y. Gong, and S.W. Smoliar, “Video Parsing Using Compressed Data,” Proc. SPIE, vol. 2,182, pp. 142-149, 1994. [191] M.M. Zloof, “QBE/OBE: A Language for Office and Business Automation,” Computer, vol. 14, no. 5, pp. 13-22, 1981. [192] W.M. Zuberek, “Performance Evaluation Using Extended Timed Petri Nets,” Proc. Int’l Conf. Timed Petri Nets, pp. 272-278, 1985. [193] A.K. Elmagarmid, H. Jiang, et al. Video Database System: Issues, Products, and Applications, Kluwer, 1997.

Index

ADA, 37 AMS, 31 Active multimedia system, 31 Artificial intelligence (AI), 61 Asynchronous event, 28, 112 Augmented transition network, 23 ATN, 23, 25, 39, 42, 73, 76–77, 81–82, 84–85, 90 condition and action table, 91-92 input symbol, 40, 42, 78, 82–85, 87, 90 multimedia input string, 41, 78, 86, 88–90, 94, 96, 106 relative position, 88, 94 subnetwork, 41, 73, 77, 82–86, 90, 93, 96, 106 user interaction, 103 user loop, 103 B-tree, 22 BLOB, 2 Binary large object, 2 Block, 36, 48, 50, 54 CAI, 4, 108 CCITT, 32 CSP, 37–38 Communicating sequential processing, 37 Computer vision, 47, 61, 72, 75 Computer-aided instruction, 108 Computer-assisted instruction, 4 Content-based retrieval, 59, 61 data model aspect, 62 query schema aspect, 67 semantic knowledge aspect, 64 spatio-temporal aspect, 63 DASH, 33 DBMS, 1–2, 6–9, 11, 42, 97 Database management system, 1, 42, 63, 97 ECMA, 32 Event, 7, 20, 22, 26-28, 34, 43, 49, 55–56, 58–59, 79, 81 Expectation maximization (EM), 48

Explorative browsing, 71 FSM, 83–85 Feature classification, 16 Feature extraction, 16, 53 Feature value normalization, 16 Finite state machine, 83 Firefly, 25, 28 Frame, 8, 21, 41, 44, 49–50, 53, 55–56, 72, 76, 79 GIS, 9, 21, 63 Generalized expectation maximization (GEM), 48 Geographic information system, 9 Geographical information system, 21, 63 Hidden Markov model (HMM), 74 HyTime, 25, 39 ISO, 32 Iconic-based browsing, 55 directed graph, 56 similarity pyramid, 56 Image segmentation, 44 clustering-based, 47 content description, 48 histogram-based, 47 image segmentation, 45, 47 partition, 48 region-growing, 47 split-and-merge, 47 stochastic-model-based, 47–48 Information visualization, 75 JPEG, 45, 76 Key frame, 8, 55–56, 72–73, 78–79 Knowledge-based event modeling, 56 algebraic modeling, 58 semantic-based clustering, 57 spatio-temporal logic, 58 temporal interval-based, 57 LOTOS, 25, 39 MBR, 22, 88 MCS, 31

148 MDBMS, 2–3, 6–12, 14–16 MDS, 31 MPEG, 45, 54–55, 76 Maximum a posteriori probability (MAP), 48 Maximum likelihood (ML), 48 Minimal bounding rectangle, 22, 88, 94 Minimum enclosing rectangle, 10 Motion detection, 51 feature matching, 52–53 motion detection, 52 optical flow, 52 Motion tracking, 51 motion tracking, 54 object tracking, 55 Multimedia applications, 4 Multimedia browsing, 42 ATN model, 76 CueVideo system, 72, 74 Informedia system, 72–73 browsing system, 72–74, 76 multimedia browsing, 42, 68, 98 video browsing, 71 Multimedia communication schema, 31 Multimedia data schema, 31 Multimedia database management system, 2, 6 Multimedia database searching, 43, 92 Multimedia presentation, 20, 26, 33, 90 Multimedia semantic model, 19 ATN model, 23, 39 Petri-net model, 23, 29 graphic model, 23, 28 language-based model, 23, 36 object-oriented model, 23, 31 time-interval based model, 23, 27 timeline model, 23-24 Natural language processing, 72 ODA, 32-35 OEM, 25, 31 OMEGA, 25, 36 OVID, 21, 25, 27 Object composition Petri net, 11 OCPN, 11, 25, 29–30, 111, 118, 124 binary temporal relation, 117 n-ary temporal relation, 119 reverse binary temporal relation, 121 reverse n-ary temporal relation, 121 Object recognition, 56, 63 Object-oriented model, 2, 7–8, 11, 23, 31–32, 35–36, 62 PNBH, 31 Pattern recognition, 47 Petri net, 111

Timed Petri net, 111 marked Petri net, 112 Petri-Net-Based-Hypertext, 31 PicQuery, 7, 12 QBE, 67–68 QBIC, 64 QoS, 10, 20, 30, 83, 85, 91–92, 109 Quadtree, 9, 22 Quality of service, 10, 20, 30, 83, 85, 91–92, 109 Query-By-Example, 67 R-tree, 9, 12, 22–23, 88 RTN, 41, 84 Recursive transition network, 41, 84 Regular expression, 86 SPCPE, 56 SQL, 11, 42, 94 STG, 25, 29 Scene transition graphs, 29 Scene, 7, 49, 57–59, 66, 73, 76, 79 Semantic object, 6, 20, 22, 41, 73, 88, 94 Serendipity browsing, 71 Shot, 16–17, 29, 49, 55, 71–73, 75–76, 79 Simultaneous partition and class parameter estimation, 56 Spatial relations, 6, 21 Spatial requirement, 6 Spatio-temporal relations, 2, 6, 22, 41 Speech recognition, 72, 74–75 Storyboard, 72, 74, 76 Synchronous event, 28 TAO, 25, 31 TCSP, 25, 37 Temporal relations, 6, 20, 113, 119 interval-based, 20–21, 115 point-based, 20–21, 115 Temporal requirement, 6 Timed object, 33 Timeline model, 23, 26, 36, 103, 117 Timeline tree, 26 Type Abstraction Hierarchy (TAH), 66 VOD, 5, 10, 42, 69 VSDG, 25, 54, 58, 90 Video hierarchy, 49, 76 Video segmentation, 49 DCT, 50–51 color histogram, 50–51 likelihood ratio, 50 object tracking, 55 pixel-level comparison, 50 scene change detection, 50 video parsing, 49 video segmentation, 49, 56 Video-on-demand, 5, 11, 42, 69 VideoQ, 16, 64, 72

About the Authors

Dr. Shu-Ching Chen received his Ph.D. from the School of Electrical and Computer Engineering from Purdue University, West Lafayette, Indiana, USA in 1998. He also received his Computer Science, Electrical Engineering, and Civil Engineering Master degrees from Purdue University, West Lafayette, IN. He has been an Assistant Professor in the School of Computer Science, Florida International University (FIU) since August, 1999. Before joining FIU, he worked as a R&D software engineer at Micro Data Base Systems (MDBS), Inc., IN, USA. His main research interests include distributed multimedia database systems and information systems, information retrieval, object-oriented database systems, data warehousing, data mining, and distributed computing environments for intelligent transportation systems (ITS). He is the program co-chair of the 2nd International Conference on Information Reuse and Integration (IRI-2000). He is a member of the IEEE Computer Society, ACM, and ITE. Dr. R. L. Kashyap received his Ph.D. in 1966 from Harvard University, Cambridge, Mass. He joined the staff of Purdue University in 1966, where he is currently a Professor of Electrical and Computer Engineering and the Associate Director of the National Science Foundation supported Engineering Research Center Intelligent Manufacturing Systems at Purdue. He is currently working on research projects supported by the Office of Naval Research, Army Research Office. NSF, and several companies like Cummins Engines. He has directed more than 40 Ph.D. dissertations at Purdue. He has authored one book and more than 300 publications, including 120 archival journal papers in areas such as pattern recognition, random field models, intelligent data bases, and intelligent manufacturing systems. He is a Fellow of the IEEE.

TE

AM FL Y

Dr. Arif Ghafoor received his BS degree in electrical engineering from the University of Engineering and Technology, Lahore, Pakistan, in 1976; and the MS, Mphil, and PhD degrees from Columbia University in 1977, 1980, and 1984, respectively. In the Spring of 1991, he joined the faculty of the School of Electrical Engineering at Purdue University, where he is now a full professor. Prior to joining Purdue University, he was on the faculty of Syracuse University from 1984 to 1991. His research interests include design and analysis of parallel and distributed systems, and multimedia information systems. He has published in excess of 100 technical papers in these areas. Currently, he is directing a research laboratory in distributed multimedia systems at Purdue University. His research in these areas has been funded by DARPA, NSF, NYNEX, AT&T, Intel, IBM, Fuji Electric Corporation, and GE. He has served on the program committees of various IEEE and ACM conferences. He is a fellow of the IEEE.