This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
) : void +writeContent(
) : void
Fig. 2. UML class diagram for an ideal implementation of the VizIR class framework. Key element is class “Query”, which contains the methods for query generation and execution. Each query consists of a number of “QueryLayer” elements that implement exactly one feature each. All feature classes – MPEG-7 descriptors as well as all others - are derived from the interface “Feature” and contain methods for descriptor extraction (“extractFeature()”), serialization (“FeatureToRaw()”, “RawToFeature()”, etc.) and distance measurement (“calculateDistance()”). Feature classes take their media content from instances of the class “MediaContent”. The result of each query is a set of media objects (represented as MediaContent objects), which is stored in a “ResultSet” object. Finally the methods of class “DatabaseManager” encapsulate the database access.
The latter two evaluation cycles have to be performed in usability labs. A combination of different observation methods and devices - such as eye-trackers and video observation devices – is necessary to collect objective (e.g. eye-movement) as well as subjective data (e.g. verbal expressions). By analyzing and comparing different data, cost and benefit assessments of existing systems with special focus on the system to be developed are possible. The VizIR prototype will be based on a standard relational database. Fig. 1 gives an overview of its tables and relations for media and feature storage. Fig. 2 outlines the likely class structure of the VizIR prototype. To a certain extent this class framework follows the architecture of IBM’s QBIC system [8], but largely differs
A Framework for Visual Information Retrieval
113
from QBIC in its server/client independent classes. Similarly to QBIC, the database access is hidden from the feature programmer and the layout of all feature classes is predefined by the interface “Feature”. Concluding this sketch of the VizIR prototypes system architecture we outline several aspects of the application and data distribution. Modern CORBA based programming environments like the Java environment permit the networkindependent distribution of applications, objects and methods (in Java through the Remote Method Invocation library) to increase the performance of an application by load balancing and multi-threading. If VizIR will be implemented in Java the objects for querying could be implemented as JavaBeans, feature extraction functions with RMI, database management through servlets and user interfaces as applets. Database distribution could be realized through standard replication mechanisms and database access through JDBC.
5
Implementation
The major question concerning the implementation of the VizIR prototype is on the programming environment. At this point in time when MPEG-21 is still far out of sight, there are three major alternatives that support image and video processing to choose from: – Java and the Java Media Framework (JMF; [10]) – The emerging Open Media Library standard (OpenML) of the Khronos group [17] – Microsoft DirectX (namely DirectShow) resp. its successor in the .NET environment [6] All of these environments offer comprehensive video processing capabilities and are based on modern, object-oriented programming paradigms. DirectX is platformdependent and a commercial product. For .NET Microsoft has recently initiated the development a Linux version but it is expected that this version will not be available before summer 2002 and will still have to be purchased. Additionally it is unlikely that versions for other operating systems will be developed as well (SunOS, OpenBSD, IRIX, etc.). Therefore in the following discussion we will concentrate on the first two alternatives: JMF and OpenML. JMF is a platform-dependent add-on to the Java SDK, which is currently available for SunOS and Windows (implementation by SUN and IBM) as well as Linux (implementation by Blackdown) in a full version and in a Java version with less features for all other operating systems that have Java Virtual Machine implementations. JMF is free and extensible. OpenML is an initiative of the Khronos Group (a consortium of companies with expert knowledge in video processing, including Intel, SGI and SUN) that standardizes a C-interface for multimedia production. OpenML includes OpenGL for 3D and 2D vector graphics, extensions to OpenGL for synchronization, the MLdc library for video and audio rendering and the ‘OpenML core’ for media processing (confusingly the media processing part of OpenML is named OpenML as well; therefore we will use the term ‘OpenML-mp’ for the media processing capabilities below). The first reference implementation of OpenML for Windows was announced for winter 2001.
114
Horst Eidenberger, Christian Breiteneder, and Martin Hitz
Among the concepts that are implemented similarly in JMF and OpenML-mp are the following: – Synchronization: a media objects time base (JMF: TimeBase object, OpenMLmp: Media Stream Counter) is derived from a single global time base (JMF: SystemTimeBase object, OpenML-mp: Unadjusted System Time) – Streaming: both environments do not manipulate media data as a continuous stream but instead as discrete segments in buffer elements. – Processing control: JMF uses Control objects and OpenML-mp uses messages for this purpose. Other important media processing concepts are implemented different in JMF and OpenML-mp: – Processing chains: in JMF real processing chains with parallel processing can be defined (one instance for one media track is called a CodecChain). In OpenMLmp processing operations data always flows from the application to a single processor (called a Transcoder) through a pipe and back. – Data flow: JMF distinguishes between data sources (including capture devices, RTP servers and files) and data sinks. OpenML-mp handles all I/O devices in the same way (called Jacks). The major advantages of OpenML-mp are: – Integration of OpenGL, the platform-independent open standard for 3D graphics. – A low-level C API that will probably be supported by the decisive video hardware manufacturers and should have a superior processing performance. – The rendering engine of OpenML (MLdc) seems to have a more elaborate design than the JMF Renderer components. Especially it can be expected that the genlock-mechanism of MLdc will prevent lost-sync phenomena, which usually occur in JMF when rendering media content with audio and video tracks that are longer than ten minutes. – OpenML-mp defines more parameters for video formats and is closer related to professional video formats (DVCPRO, D1, etc.) and television formats (NSTC, PAL, HDTV, etc.) On the other hand the major disadvantages of OpenML are: – It is not embedded in a CASE environment like Java for JMF. Therefore application development requires more resources and longer development cycles. – OpenML is not object-oriented and includes no mechanism for parallel media processing. The major drawbacks of JMF are: – Lower processing performance because of the high-level architecture of the Java Virtual Machine. This can be reduced by the integration of native C code through the Java Native Interface. – Limited video hardware and video format support: JMF has problems with accessing certain video codecs, capture devices and with transcoding of some video formats. The outstanding features of JMF are: – Full Java integration. The Java SDK includes comprehensive methods for distributed and parallel programming, database access and I/O processing. Additionally professional CASE tools exist for software engineering with Java.
A Framework for Visual Information Retrieval
115
JMF is free software and reference implementations exist for a number of operating systems. JMF version 2.0 is a co-production of SUN and IBM. In version 1.0 Intel was involved as well. – JMF is extensible. Additional codecs, multiplexers and other components can be added by the application programmer. The major demands for the VizIR project are the need for a free and bug-free media processing environment that supports distributed software engineering and has a distinct and robust structure. Matters like processing performance and extended hardware support are secondary for this project. Therefore the authors think that currently JMF is the right choice for the implementation. Design and implementation will follow a UML based incremental design process and prototyping, because UML is state-of-the art in engineering and because of the valuable positive effect of rapid prototyping on the employee’s motivation. Standard statistical packages and Perl scripts will be used for performance evaluation and Selforganizing Maps [11] and Advanced Resonance Theory (ART) neural networks as well as genetic algorithms for tasks like pattern matching and (heuristic) optimization (like in [4]). –
6
Conclusion
The major outcome of the open VizIR project can be summarized as follows: – An open class framework of methods for feature extraction, distance calculation, user interface components and querying. – Evaluated user interfaces methods for content-based visual retrieval. – A system prototype for the refinement of the basic methods and interface paradigms. – Carefully selected evaluation sets for groups of features (color, texture, shape, motion, etc.) with human-rated co-similarity values. – Evaluation results for the methods of the MPEG-7 standard, the authors earlier content-based retrieval projects and all other promising methods. The authors would like to invite interested research institutions to join the discussion and participate in the design and implementation of the open VizIR project.
References 1. Barnsley, M.F., Hurd, L.P., Gustavus, M.A.: Fractal video compression. Proc. of IEEE Computer Society International Conference, Compcon Spring (1992) 2. Barros, J., French, J., Martin, W.: Using the triangle inequality to reduce the number of comparisons required for similarity based retrieval. SPIE Transactions (1996) 3. Breiteneder, C., Eidenberger, H.: Automatic Query Generation for Content-based Image Retrieval. Proc. of IEEE Multimedia Conference, New York (2000) 4. Breiteneder, C., Eidenberger, H.: Performance-optimized feature ordering for Contentbased Image Retrieval. Proc. European Signal Processing Conference, Tampere (2000) 5. Chua, T., Ruan, L.: A Video Retrieval and Sequencing System. ACM Transactions on Information Systems, Vol. 13, No. 4 (1995) 373-407
116
Horst Eidenberger, Christian Breiteneder, and Martin Hitz
6. DirectX: msdn.microsoft.com/library/default.asp?url=/library/enus/wcegmm/htm/dshow.asp 7. Fels, S., Mase, K.: Interactive Video Cubism. Proc. of ACM International Conference on Information and Knowledge Management, Kansas City (1999) 78-82 8. Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Steele, D., Yanker, P.: Query by Image and Video Content: The QBIC System. IEEE Computer (1995) 9. Frei, H., Meienberg, S., Schäuble, P.: The Perils of Interpreting Recall and Precision. In: Fuhr, N. (ed.): Information Retrieval, Springer, Berlin (1991) 1-10 10. Java Media Framework Home Page: java.sun.com/products/java-media/jmf/index.html 11. Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J.: SOM-PAK: The Self-organizing Map Program Package. Helsinki (1995) 12. Lasfar, A., Mouline, S., Aboutajdine, D., Cherifi, H.: Content-Based Retrieval in Fractal Coded Image Databases. Proc. of Visual Information and Information Systems Conference, Amsterdam (1999) 13. Lin, F., Picard, R. W.: Periodicity, directionality, and randomness: Wold features for image modelling and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (1996) 14. MPEG-7 standard: working papers www.cselt.it/mpeg/working_documents.htm#mpeg-7 15. Nastar, C., Mitschke, M., Meilhac, C.: Efficient Query Refinement for Image Retrieval. Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1998) 16. Oomoto, E., Tanaka, K.: OVID: design and implementation of a video-object database system. IEEE Transactions on Knowledge and Data Engineering (1993) 17. OpenML: www.khronos.org/frameset.htm 18. Osgood, C. E. et al.: The Measurement of Meaning. University of Illinois, Urbana (1971) 19. Payne, J. S., Hepplewhite, L., Stonham, T. J.: Evaluating content-based image retrieval techniques using perceptually based metrics. SPIE Proc., Vol. 3647 (1999) 122-133 20. Pentland, A., Picard, R. W., Sclaroff, S.: Photobook: Content-Based Manipulation of Image Databases. SPIE Storage and Retrieval Image and Video Databases II (1994) 21. Rui, Y., Huang, T., Chang, S.: Image Retrieval: Past, Present and Future. Proc. of International Symposium on Multimedia Information Processing, Taiwan (1997) 22. Santini, S., Jain, R.: Beyond Query By Example. ACM Multimedia (1998) 23. Santini, S., Jain, R.: Similarity Measures. IEEE Transactions on Pattern Analysis and Machine Intelligence (1999) 24. Santini, S., Jain, R.: Integrated browsing and querying for image databases. IEEE Multimedia, Vol. 3, Nr. 7 (2000) 26-39 25. Sheikholeslami, G., Chang, W., Zhang, A.: Semantic Clustering and Querying on Heterogeneous Features for Visual Data. ACM Multimedia (1998) 26. Smith, J. R., Chang, S.: VisualSEEk: a fully automated content-based image query system. ACM Multimedia (1996) 27. Wood, M., Campbell, N., Thomas, B.: Iterative Refinement by Relevance Feedback in Content-Based Digital Image Retrieval. ACM Multimedia (1998) 28. Wu, J. K., Lam, C. P., Mehtre, B. M., Gao, Y. J., Desai Narasimhalu, A.: Content-Based Retrieval for Trademark Registration. Multimedia Tools and Applications, Vol. 3, No. 3 (1996) 245-267
Feature Extraction and a Database Strategy for Video Fingerprinting Job Oostveen, Ton Kalker, and Jaap Haitsma Philips Research, Prof. Holstlaan 4, 5656 AA Eindhoven, The Netherlands [email protected], [email protected], [email protected]
Abstract. This paper presents the concept of video fingerprinting as a tool for video identification. As such, video fingerprinting is an important tool for persistent identification as proposed in MPEG-21. Applications range from video monitoring on broadcast channels to filtering on peerto-peer networks to meta-data restoration in large digital libraries. We present considerations and a technique for (i) extracting essential perceptual features from moving image sequences and (ii) for identifying any sufficiently long unknown video segment by efficiently matching the fingerprint of the short segment with a large database of pre-computed fingerprints.
1
Introduction
This paper presents a method for the identification of video. The objective is to identify video objects not by comparing perceptual similarity of the video objects themselves (which might be computationally expensive), but by comparing short digests, also called fingerprints, of the video content. These digests mimic the characteristics of regular human fingerprints. Firstly, it is (in general) impossible to derive from the fingerprint other relevant personal characteristics. Secondly, comparing fingerprints is sufficient to decide whether two persons are the same or not. Thirdly, fingerprint comparison is a statistical process, not a test for mathematical equality: it is only required that fingerprints are sufficiently similar to decide whether or not they belong to the same person (proximity matching). 1.1
Classification
Fingerprint methods can be categorized in two main classes, viz. the class of method based on semantical features and the class of methods based on nonsemantical features. The former class builds fingerprints from high-level features, such as commonly used for retrieval. Typical examples include scene boundaries and color-histograms. The latter class builds fingerprints from more general perceptual invariants, that do not necessarily have a semantical interpretation. A typical example in this class is differential block luminance (see also Section 2). For both classes holds that (small) fingerprints can be used to establish perceptual equality of (large) video objects. It should be noted that a feature extraction method for fingerprinting must be quite different from the methods used for S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 117–128, 2002. c Springer-Verlag Berlin Heidelberg 2002
118
Job Oostveen, Ton Kalker, and Jaap Haitsma
video retrieval. In retrieval, the features must facilitate searching of video clips that somehow look similar to the query, of that contain similar objects as the query. In fingerprinting the requirement is to identify clips that are perceptually the same, except for quality differences or the effects of other video processing. Therefore, the features for fingerprinting need to be far more discriminatory, but they do not necessarily need to be semantical. Consider the example of identification of content in a multimedia database. Suppose one is viewing a scene from a movie and would like to know from which movie the clip originates. One way of finding out is by comparing the scene to all fragments of the same size of all movies in the database. Obviously, this is totally infeasible in case of a large database: even a short video scene is represented by a large amount of bytes and potentially these have to be compared to the whole database. Thus, for this to work, one needs to store a large amount of easily accessible data and all these data have to be compared with the video scene to be identified. Therefore, there is both a storage problem (the database) as well as a computational problem (matching large amounts of data). Both problems can be alleviated by reducing the number of bits needed to represent the video scenes: fewer bits need to be stored and fewer bits need to be used in the comparison. One possible way to achieve this is by using video compression. However, because it is not needed to reconstruct the video from the representation, at least theoretically it is possible to use less bits for identification than for encoding. Moreover, perceptually comparing compressed video streams is a computationally expensive operation. A more practical option is to use a video compression scheme that is geared towards identification, more specifically to use a fingerprinting scheme. Video identification can then be achieved by storing the fingerprints of all relevant fragments in a database. Upon reception of a unknown fragment, its fingerprint is computed and compared to those in the database. This search (based on inexact pattern matching) is still a burdensome task, but it is feasible on current-day PCs. 1.2
Relation to Cryptography
We will now first discuss the concept of cryptographic hash functions and show how we approach the concept of fingerprints as an adaptation of cryptographic hash functions. Hash functions are a well-known concept in cryptography [8]. A cryptographic hash, also called message digest or digital signature, is in essence a short summary of a long message. Hash functions take a message of arbitrary size as input and produce a small bit string, usually of fixed size: the hash or hash value. Hash functions are widely used as a practical means to verify, with high probability, the integrity of (bitwise) large objects. The typical requirements for a hash function are twofold: 1. For each message M , the hash value H = h(M ) is easily computable; 2. The probability that two messages lead to the same hash is small.
Feature Extraction and a Database Strategy for Video Fingerprinting
119
As a meaningful hash function maps large messages to small hash values, such a function is necessarily many-to-one. Therefore, collisions do occur. However, the probability of hitting upon two messages with the same hash value should be minimal. This usually means that the hash values for all allowed messages have a uniform distribution. For an n-bit value h the probability of a collision is then equal to 2−n . Cryptographic hash functions are usually required to be one-way, i.e., it should be difficult for a given hash value H to find a message which has H as its hash value. As a result such functions are bit-sensitive: flipping a single bit in the message changes the hash completely. The topic of this paper, fingerprinting for video identification, is about functions which show a strong analogy to cryptographic hash functions, but that are explicitly not bit sensitive and are applicable to audio-visual data. Whereas cryptographic hashes are an efficient tool to establish mathematical equality of large objects, audio-visual fingerprint functions serve as a tool to establish perceptual similarity of (usually large) audio-visual objects. In other words, fingerprints should capture the perceptually essential parts of audio-visual content. In direct analogy with cryptographic hash functions, one would expect a fingerprint function to be defined as a function that maps perceptually similar objects to the same bit string value. However, it is well-known that perceptual similarity is not a transitive relationship. Therefore, a more convenient and practical definition reads as follows: a fingerprint function is function that (i) maps (usually bitwise large) audiovisual object to (usually bitwise small) bit strings (fingerprints) such that perceptual small changes lead to small differences in the fingerprint and (ii) such that perceptually very different objects lead with very high probability to very different fingerprints. 1.3
Fingerprinting Approaches
The scientific community seems to be favouring the terminology ‘fingerprint’, and for that reason this is the terminology that will be used in this paper. However, it is doubtful whether or not this is the best choice. For instance, the term fingerprinting is also used in the watermarking community, where it denotes the active embedding of tracing information. Although the literature on fingerprinting is still limited, in particular for video, some progress is already reported. Among others algorithms for still image fingerprinting are published by Abdel-Mottaleb et.al. [1], by J. Fridrich [5], by R. Venkatesan et al. [12,11] and by Schneider and Chang [10]. A number of algorithms for audio fingerprinting have been published. See [6] and the references therein. A number of papers present algorithms for video fingerprinting. Cheung and Zakhor ( [4]) are concerned with estimating the number of copies (possibly at different quality levels) of video clips on the web. Hampapur and Bolle [7] present a indexing system based on feature extraction from key frames. Cryptographic hashes operate on the basis of a complete message. As such, it is impossible to check the integrity or obtain the identity of a part of the message. For video fingerprint this is an undesirable property, as it means that it is impossible to identify short clips out of a longer clips. Also, for integrity
120
Job Oostveen, Ton Kalker, and Jaap Haitsma
checking, one would like to be able to localize distortions. For this reason, it is not always appropriate to create a global fingerprint for the whole of an audiovisual object. Instead, we propose to use a fingerprint-stream of locally computed fingerprint bits (also referred to as sub-fingerprints): per time unit, a number of bits are extracted from the content. In this way, it is possible to identify also smaller sections of the original. In a typical identification scenario, the full fingerprint stream is stored in the database. Upon reception of a video, the fingerprint values are extracted from a short section, say with a duration of 1 second. The result, which we call a fingerprint block, is then matched to all blocks of the same size in the database. If the fingerprint block matches to a part of the fingerprint stream of some material, it is identified as that specific part of the corresponding video. If there is no sufficiently close match, the process repeats by extracting a next fingerprint block and attempting to match it. The description above reveals two important complexity aspects of a full fledged fingerprinting system. The first complexity aspect concerns fingerprint extraction, the second concerns the matching process. In a typical application the fingerprint extraction client has only limited resources. Moreover, the bandwidth to the fingerprint matching engine is severely restricted. It follows that in many applications it is required that fingerprint extraction is low complexity and that the size of the fingerprint is either small or at least sufficiently compressible. This observation already in many cases rules out the use of semantic fingerprints, as these tend to be computationally intensive. The fingerprint matching server is in its most basic form a gigantic sliding correlator: for an optimal decision a target fingerprint block needs to be matched against all fingerprinting blocks of similar length in the database. Even for simple matching functions (such as bit error rate), this sliding correlation becomes infeasible if the fingerprint database is sufficiently large. For a practical fingerprint matching engine it is essential that the proximity matching problem is dealt with in an appropriate manner, either by including ingredients that allow hierarchical searching [6], by careful preparation of the fingerprint database [3] or both. Both types of complexities are already well recognized in the field of audio fingerprint, see for example the recent RIAA/IFPI call [9]. 1.4 Overview In this paper we introduce a algorithm for robust video fingerprinting that has very modest feature extraction complexity, a well-designed matching engine and a good performance with respect to robustness. We will present some general considerations in the design of such a video fingerprinting algorithm with a focus on building a video identification tool. In Section 2 we introduce the algorithm and discuss a number of the issues in designing such an algorithm. Section 3 contains the design of a suitable database structure. In Section 4 we will summarize our results and indicate directions for future research.
2
Feature Extraction
In this section, we present a feature extraction algorithm for robust video fingerprinting and we discuss some of the choices and considerations in the design of
Feature Extraction and a Database Strategy for Video Fingerprinting
121
Divide in blocks
Frames
mean
T
a
mean
T
a
mean
T
a
Luminance
mean
Fig. 1. block diagram of the differential block luminance algorithm
such an algorithm. The first question to be asked is in which domain to extract the features. In audio, very clearly, the frequency domain optimally represents the perceptual characteristics. In video, however, it is less clear which domain to use. For complexity reasons it is preferable to avoid complex operations, like DCT or DFT transformations. Therefore, we choose to compute features in the spatio-temporal domain. Moreover, to allow easy feature extraction from most compressed video streams as well, we choose features which can be easily computed from block-based DCT coefficients. Based on these considerations, the proposed algorithm is based on a simple statistic, the mean luminance, computed over relatively large regions. This is also approach taken by Abdel-Mottaleb [1]. We will choose our regions in a fairly simple way: the example algorithm in this paper uses a fixed number of blocks per frame. In this way, the algorithm is automatically resistant to changes in resolution. To ease the discussion, we introduce some terminology. The bits extracted from a frame will be refereed to as sub-fingerprints. A fingerprint block then denotes a fixed number of sub-fingerprints from consecutive frames. Our goal is to be able to identify short video clips and moreover to localize the clip inside the movie where it originates from. In order to do this, we need to extract features which contain sufficient high frequency content in the temporal direction. If the features are more or less constant over a relatively large number of frames, then it is impossible to localize exactly the clip inside the movie. For this reason, we take differences of corresponding features extracted from subsequent frames. Automatically, this makes the system robust to (slow) global changes in luminance. To arrive at our desired simple binary features, we only retain the sign of the computed differences. This immediately implies robustness to luminance offsets and to contrast modifications. To decrease the complexity
122
Job Oostveen, Ton Kalker, and Jaap Haitsma
of measuring the distance between two fingerprints (the matching process), a binary fingerprint also offers considerable advantages. That is, we can compare fingerprints on a bit-by-bit basis, using the Hamming distance as a distance measure. Summarizing, we discard all magnitude information from the extracted filter output values, and only retain the sign. The introduction of differentiation in the temporal direction leads to a problem in case of still scenes. If a video scene is effectively a prolonged still image, the temporal differentiation is completely determined by noise, and therefore the extracted bits are very unreliable. Conceptually, what one would like is that fingerprints do not change while the video is unchanged. One way to achieve this is by using a conditional fingerprint extraction procedure. This means that a frame is only considered for fingerprint computation if it differs sufficiently from the last frame from which a fingerprint was extracted [2]. This approach leads, however, to a far more difficult matching procedure: the matching needs to be resistant to the fact that the fingerprint extracted from a processed version of a clip may have a different number of sub-fingerprints than the original. Another possibility is to use a different temporal filter which does not completely suppress mean luminance (DC). This can be achieved in a very simple manner by replacing the earlier proposed FIR filter kernel [ −1 1 ] by [ −α 1 ], where α is a value slightly smaller than 1. Using this filter the extracted fingerprint will be constant in still scenes (and even still regions of a scene), whereas in regions with motion the fingerprint is determined by the difference between luminance values in consecutive frames. In addition to the differentiation in the time domain, we can also do a spatial differentiation (or more generally, a high-pass filter) on the features extracted from one frame. In this way, also the correlation between bits extracted from the same frame is decreased significantly. Secondly, application of the spatial filter avoids a bias in the overall extracted bits, which would occur if the new temporal filter were applied directly to the extracted mean luminance values1 . For our experiments, the results of which will be presented below, we have used the following algorithm. 1. Each frame is divided in a grid of R rows and C columns, resulting in R × C blocks. For each of these blocks, the mean of the luminance values of the pixels is computed. The mean luminance of block (r, c) in frame p is denoted F (r, c, p) for r = 1, . . . , R and c = 1, . . . , C. 2. We visualise the computed mean luminance values from the previous step as frames, consisting of R × C “pixels”. On this sequence of low resolution gray-scale images, we apply a spatial filter with kernel [ −1 1 ] (i.e. taking differences between neighbouring blocks in the same row), and a temporal filter with kernel [ −α 1 ], as explained, above. 3. The sign of the resulting value constitutes the fingerprint bit B(r, c, p) for block (r, c) in frame p. Note that due to the spatial filtering operation in the previous step, the value of c ranges from 1 to c − 1 (but still, r = 1, . . . , R). Thus, per frame we derive C × (R − 1) fingerprint bits. 1
Without spatial differentiation the fingerprint values before quantization would have a larger probability of being positive than negative
Feature Extraction and a Database Strategy for Video Fingerprinting
123
Summarizing, and more precisely, we have for r = 1, . . . , R, c = 1, . . . , C − 1: 1 if Q(r, c, p) ≥ 0, B(r, c, p) = 0 if Q(r, c, p) < 0, where Q(r, c, p) = (F (r, c + 1, p) − F (r, c, p)) − α (F (r, c+, p − 1) − F (r, c, p − 1)) . We call this algorithm “differential block luminance”. A block diagram, describing this is depicted in Figure 1. These features have a number of important advantages: – Only a limited number of bits is needed to uniquely identify short video clips with a low false positive probability – the feature extraction algorithm has a very low complexity and it may be adapted to operate directly on the compressed domain, without a need for complete decoding – The robustness of these features with respect to geometry-preserving operations is very good A disadvantage may be that for certain applications the robustness with respect to geometric operations (like zoom & crop) may not be sufficient. Experimental robustness results are presented in section 2.1, below. For our experiments we used α = 0.95 and R = 4, C = 9. This leads to a fingerprint size of 32 bits per frame, and a block size 120 × 80 pixels for NTSC video material. Matching is done on the basis of fingerprint bits extracted from 30 consecutive frames, i.e., 30 × 32 = 960 bits. 2.1
Experimental Results
Extensive experiments with the algorithm described above are planned for the near future. In this article we report on the results of some initial tests. We have used six 10-second clips, taken from a number of movies and television broadcasts (with a resolution of 480 lines and 720 pixels per line). From these clips, we extracted the fingerprints. These are used as “the database”. Subsequently, we processed the clips, and investigated how this influences the extracted fingerprint. The test included the following processing: 1. 2. 3. 4. 5.
MPEG-2 encoding at 4 Mbit/second; median filtering using 3 × 3 neighbourhoods; luminance-histogram equalisation; shifting the images vertically over k lines (k = 1, 2, 3, 4, 8, 12, 16, 20, 24, 32); scaling the images horizontally, with a scaling factor between 80% and 120%, with steps of 2%.
Job Oostveen, Ton Kalker, and Jaap Haitsma
30
30
25
25
20
20 bit error rate
bit error rate
124
15
15
10
10
5
5
0 0.75
0.8
0.85
0.9
0.95 1 horizontal scale factor
1.05
1.1
1.15
1.2
0
0
5
10
15 20 vertical shift
25
30
35
Fig. 2. Robustness w.r.t. horizontal scaling (left graph) and vertical shifts (right graph)
The results for scaling and shifting are in Figure 2. The other results are reported below: MPEG-2 encoding: median filtering: histogram equalisation:
11.8% 2.7% 2.9%
The results indicate that the method is very robust against all processing which is done on a local basis, like for instance MPEG compression or median filtering. In general the alterations created by these processes average out within the blocks. Processing which changes the video more in a global fashion is more difficult to withstand. For instance, global geometric operations like scaling and shifting lead to far higher bit error rates. This behaviour stems from the resulting misalignment of the blocks. A higher robustness could be obtained by using larger blocks, but this would reduce the discriminative power of the fingerprint.
3
Database Strategy
Matching the extracted fingerprints to the fingerprints in a large database is a non-trivial task since it is well known that proximity matching does not scale nicely to very large databases (recall that the extracted fingerprint values may have many bit errors). We will illustrate this with some numbers, based on using the proposed fingerprinting scheme (as described in Section 2), in a broadcast monitoring scenario. Consider a database containing news clips with a total duration of 4 weeks (i.e., 4×7×24 = 672 hours of video material). This corresponds to almost 300 megabytes of fingerprints. If we now extract a fingerprint block (e.g. corresponding to 1 second of video, which results in 30 sub-fingerprints) from an unknown news broadcast, we would like to determine which position in the 672 hours of stored news clips it matches best. In other words we want to find the position in these 672 hours where the bit error rate is minimal. This
Feature Extraction and a Database Strategy for Video Fingerprinting
125
Lookup table
Extracted Fingerprint
Clip 1 0x00000000 0x00000001
0xE6DF801
0x1647839B
0x00000000 0x00000001
Clip 2
Clip 3
0x2AD89311
0x129647DE
0x00000000 0xFFFFFFFF
0x00000000
232
0xFFFFFFFF
0x00000001
0x78253671
0x2AD89311
0x1647839B
0xFFFFFFFF
Fig. 3. database layout
can be done by brute force matching, but this will take around 72 million comparisons. Moreover the number of comparisons increases linearly with the size of the database. We propose to use a more efficient strategy, which is depicted in Figure 3. Instead of matching the complete fingerprint block, we first look at only a single sub-fingerprint at a time and assume that occasionally this 32-bit bit-string contains no errors. We start by creating a lookup table (LUT) for all possible 32-bit words, and we let the entries in the table point to the video clip and the position(s) within that clip where this 32-bit word occurs as sub-fingerprint. Since this word can occur at multiple positions in multiple clips the pointers are stored in a linked list. In this way one 32-bit word is associated with multiple pointers to clips and positions. The approach that we take bears a lot of similarity to inverted file techniques, as used commonly in text retrieval applications. Our lookup table is basically an index describing for each sub-fingerprint (word) at which location in which clip it occurs. The main difference with text retrieval is that due to processing of the video we need to adapt our search strategy to the fact that sub-fingerprints will frequently contain (possibly many) erroneous bits. By inspecting the lookup table for each of the 30 extracted sub-fingerprints a list of candidate clips and positions is generated. With the assumption that occasionally a single sub-fingerprint is free of bit errors, it is easy to determine whether or not all the 30 sub-fingerprints in the fingerprint block match one of
126
Job Oostveen, Ton Kalker, and Jaap Haitsma
the candidate clips and positions. This is done by calculating the bit error rate of the extracted fingerprint block with the corresponding fingerprint blocks of the candidate clips and positions. The candidate clip and position with the lowest error rate is selected as the best match, provided that this error rate is below an appropriate threshold. Otherwise the database reports that the search could not find a valid best match. Note that in practice, once a clip is identified, it is only necessary to check whether or not the fingerprints of the remainder of the clip belong to the best match already found. As soon as the fingerprints no longer match, a full structured search is again initiated. Let us give an example of the described search method by taking a look at Figure 3. The last extracted fingerprint value is 0x00000001. The LUT in the database points only to a certain position in clip 1. Let’s say that this position is position p. We now calculate the bit error rate between the extracted fingerprint block and the block of song 1 from position p-29 until position p. If the two blocks match sufficiently closely, then it is very likely that the extracted fingerprint originates from clip 1. However if the two blocks are very different, then either the clip is not in the database or the extracted sub-fingerprint contains an error. Let’s assume that the latter occurred. We then try the one but last extracted sub-fingerprint (0x00000000). This one has two possible candidate positions, one in clip 2 and one in clip 1. Assuming that the bit error rate between the extracted fingerprint block and the corresponding database fingerprint block of clip 2 yields a bit error rate below the threshold, we identify the video clip as originating from clip 2. If not, we repeat the same procedure for the remaining 28 sub-fingerprints. We need to verify that our assumption that every fingerprint block contains an error-free sub-fingerprint is actually a reasonable assumption. Experiments indicate that this is actually the case for all reasonable types of processing. By the above method, we only compare the fingerprint blocks to those blocks in the database which correspond exactly in at least one of their sub-fingerprints. This makes the search much faster compared to exhaustive search or any pivotbased strategy [3], and this makes it possible to efficiently search in very large databases. This increased search speed comes at the cost of possibly not finding a match, even if there is a matching fingerprint block in the database. More precisely, this is the case if all of the sub-fingerprints have at least one erroneous bit, but at the same time the overall bit error rate is below the threshold. We can decrease the probability of missed identifications by using bit reliability information. The fingerprint bits are computed by taking the sign of a real-valued number. The absolute value of this number can be taken as a reliability measure of the correctness of the bit: the sign of a value close to zero is assumed to be less robust than the sign of a very large number. In this way, we can declare q of the bits in the fingerprint unreliable. To decrease the probability of a missed recognition, we toggle those q bits, thus creating 2q candidate sub-fingerprints. We then do an efficient matching, as described above, with all of these sub-fingerprints. If one of these leads to a match, then the database fingerprint block is compared with the originally extracted fingerprint. If the resulting bit error rate of this final comparison is again below the threshold then we have a successful identifi-
Feature Extraction and a Database Strategy for Video Fingerprinting
127
cation. Note that in this way the reliability information is used to generate more candidates in the comparison procedure, but that it has no influence on the final bit error rate. In [6] we have described a method for audio fingerprinting. The database strategy described there is the same as the one in this paper, except for some of the parameter values (in case of audio, matching is done based on fingerprint blocks which consist of 256 sub-fingerprints, corresponding to 3 seconds of audio). With this audio database we have carried out extensive experiments, that show the technical and economical feasibility to scale this approach to very large databases, containing for instance a few million songs. An important figure of merit for a fingerprinting method is the false positive probability: The probability that two randomly selected video clips are declared similar by the method. Under the assumption that the extracted fingerprint bits are independent random variables, with equal probability of being 0 or 1, it is possible to compute a general formula for the false positive probability: Let a fingerprint consist of R sub-fingerprints and let each sub-fingerprint consist of C bits. Then for two randomly selected fingerprint blocks, the number of bits in which the two blocks correspond is binomially (n, p) distributed with parameters n = RC and p = 12 . As RC is large, we can approximate this distribution by a normal distribution with mean µ = np = RC/2 and variance σ 2 = np(1 − p) = RC/4. Given a fingerprint block B1 the probability that less than a fraction α of the bits of a randomly selected second fingerprint block B2 is different from the corresponding bits of B1 equals ∞ 2 1 1 1 − 2α √ − x2 √ Pf (α) = √ e dx = erfc n . 2 2π (1−2α)√n 2 Based on this formula, we can set our threshold for detection. In our experiments we used n = 960. Setting the threshold α = 0.3 (i.e., declaring two clips similar if their fingerprint blocks are different in at most 30% of the bit positions), the false positive probability is computed to be in the order of 10−35 . In practice the actual false positive probability will be significantly higher due to correlation between the bits in a fingerprint block. Currently, we are in the process of studying experimentally the correlation structure, and adapting our theoretical false positive analysis accordingly.
4
Conclusions
In this paper we have presented fingerprinting technology for video identification. The methodology is based on the functional similarity between fingerprints and cryptographic hashes. We have introduced a feature extraction algorithm, the design of which was driven by minimal extraction complexity. The resulting algorithm is referred to as differential block luminance. Secondly we have outlined a structure for very efficiently searching in a large fingerprint database. The combination of these feature extraction and database algorithms results in a robust and very efficient fingerprinting system. Future research will
128
Job Oostveen, Ton Kalker, and Jaap Haitsma
be mainly focusing on extracting even more robust features, still under the constraint of limited complexity of the extractor and manageable fingerprint database complexity.
References 1. M. Abdel-Mottaleb, G. Vaithilingam, and S. Krishnamachari. Signature-based image identification. In SPIE conference on Multimedia Systems and Applications II, Boston, USA, 1999. 2. J. Bancroft. Fingerprinting: Monitoring the use of media assets, 2000. Omnibus Systems Limited, white paper. see http://www.advanced-broadcast.com/. 3. E. Chavez, J. Marroquin, and G. Navarro. Fixed queries array: A fast and economical data structure for proximity searching. Multimedia Tools and Applications, 14:113–135, 2001. 4. S.S. Cheung and A. Zakhor. Video similarity detection with video signature clustering. In Proc. 8th International Conference on Image Processing, volume 2, pages 649–652, Thessaloniki, Greece, 2001. 5. J. Fridrich. Robust bit extraction from images. In Proc. IEEE ICMCS’99, volume 2, pages 536–540, Florence, Italy, 1999. 6. J. Haitsma, T. Kalker, and J. Oostveen. Robust audio hashing for content identification. In International Workshop on Content-Based Multimedia Indexing, Brescia, Italy, 2001. accepted. 7. A. Hampapur and R.M. Bolle. Feature based indexing for media tracking. In Proc. International Conference on Multimedia and Expo 2000 (ICME-2000), volume 3, pages 1709–1712, 2000. 8. A.J. Menezes, S.A. Vanstone, and P.C. van Oorschot. Handbook of Applied Cryptography. CRC Press, 1996. 9. RIAA-IFPI. Request for information on audio fingerprinting technologies, 2001. http://www.ifpi.org/site-content/press/20010615.html, http://www.riaa.com/pdf/RIAA IFPI Fingerprinting RFI.pdf. 10. M. Schneider and S.F. Chang. A robust content based digital signature for image authentication. In Proceedings of the International Conference on Image Processing (ICIP) 1996, volume 3, pages 227–230, 1996. 11. R. Venkatesan and M.H. Jakubowski. Image hashing. In DIMACS conference on intellectual property protection, Piscataway, NJ, USA, 2000. 12. R. Venkatesan, S.M. Koon, M.H. Jakubowski, and P. Moulin. Robust image hashing. In Proceedings of the International Conference on Image Processing (ICIP), 2000.
ImageGrouper: Search, Annotate and Organize Images by Groups Munehiro Nakazato1, Lubomir Manola2,and Thomas S. Huang1 1 Beckman
Institute, University of Illinois at Urbana-Champaign, 405 N. Mathews Ave. Urbana, IL 61801, USA {nakazato,huang}@ifp.uiuc.edu 2 School of Electrical Engineering, University of Belgrade [email protected]
Abstract. In Content-based Image Retrieval (CBIR), trial-and-error query is essential for successful retrieval. Unfortunately, the traditional user interfaces are not suitable for trying different combinations of query examples. This is because first, these systems assume query examples are added incrementally. Second, the query specification and result display are done on the same workspace. Once the user removes an image from the query examples, the image may disappear from the user interface. In addition, it is difficult to combine the result of different queries. In this paper, we propose a new interface for Content-based image retrieval named ImageGrouper. In our system, the users can interactively compare different combinations of query examples by dragging and grouping images on the workspace (Query-by-Group.) Because the query results are displayed on another pane, the user can quickly review the results. Combining different queries is also easy. Furthermore, the concept of “image groups” is also applied to annotating and organizing a large number of images.
1
Introduction
Many researchers have proposed ways to find an image from large image databases. We can divide these approaches into two types of interactions: Browsing and Searching. In image browsing, the users look through the entire collections. In most systems, the images are clustered in hierarchical manner and the user can traverse the hierarchy by zooming and panning [3][4][10][16]. In [16], browsing and searching are integrated so that the user can switch back and forth between browsing and searching. Meanwhile, enormous amount of research have been done for Content-Based Image Retrieval (CBIR) [7][18][24]. In CBIR systems, the user searches image by visual similarity, i.e. low-level image features such as color [25], texture [23] and structure [27]. They are automatically extracted from images and indexed in the database. Then, the system computes the similarity between the images based on these features. The most popular method of CBIR interaction is Query-by-Examples. In this method, the users select example images (as positive or negative) and ask the system to retrieve visually similar images. In addition, in order to improve the retrieval further, CBIR systems often employ Relevance Feedback [18][19], in which the users S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 129–142, 2002. c Springer-Verlag Berlin Heidelberg 2002
130
Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang
can refine the search incrementally by giving feedback to the result of the previous query. In this paper, we propose a new user interface for digital image retrieval and organization, named ImageGrouper. In ImageGrouper, a new concept Query-by-Groups is introduced for Content-based Image Retrieval (CBIR.) The users construct queries by making groups of images. The groups are easily created by dragging images on the interface. Because the image groups can be easily reorganized, flexible retrieval is achieved. Moreover, with the similar operations, the user can effectively annotate and organize a large number of images. In the next section, we discuss how groups are used for image retrieval. Then, the following sections describe the use of image groups for image annotation and organization.
2
User Interface Support for Content-Based Image Retrieval
2.1 Current Approaches: Incremental Search Not many researches have been done regarding user interface support for Contentbased Image Retrieval (CBIR) systems [16][20]. Figure 1 shows a typical GUI for CBIR system that supports Query-by-Examples. Here, a number of images are aligned in grids. In the beginning, the system displays randomly selected images. The effective ways to align images are studied in [17]. In some cases, they are images found by browsing or keyword-based search. Under each image, a slide bar is attached so that the user can tell the system which images are relevant. If the user thinks an image is relevant, s/he moves the slider to the right. If s/he thinks an image is not relevant and should be avoided, s/he moves the slider to the left. The amount of slider movement represents the degree of relevance
Fig 1. Typical GUI for CBIR Systems
Query
Query
Result
Result
Fig 2. Example of “More is not necessarily better”. The left is the case of one example, the right is the case of two examples.
ImageGrouper: Search, Annotate and Organize Images by Groups
131
(or irrelevance.) In some systems, the user selects example images by clicking check boxes or by clicking on the images [6]. In these cases, the degrees are not specified. When the “Query” button is pressed, the system computes the similarity between selected images and the database images, then retrieves the N most similar images. The grid images are replaced with the retrieved images. These images are ordered based on the degree of similarity. If the user finds additional relevant images in the result set, s/he selects them as new query examples. If a highly irrelevant image appears in the result set, the user can select it as a negative example. Then, the user press “Query” again. The user can repeat this process until s/he is satisfied. This process is called relevance feedback [18][19]. Moreover, in many systems, the users are allowed to directly weight the importance of image features such as color and texture. In [22], Smeulders et al. classified Query by Image Example and Query by Group Example into two different categories. From user interface viewpoint, however, these two are very similar. The only difference is whether the user is allowed to select multiple images or not. In this paper, we classify both approaches as Query by Examples method. In stead, we use term “Query by Groups” to refer our new model of query specification method described later. Query-by-Example approach has several drawbacks. First of all, these systems assume that the more query examples are available, the better result we can get. Therefore, the users are supposed to search images incrementally by adding new example images from the result of the previous query. However, this assumption is not always true. Additional examples may contain undesired features and degenerate the retrieval performance. Figure 2 shows an example of situations when more query examples could lead to worse results. In this example, the user is trying to retrieve pictures of cars. The left column shows the query result when only one image of “car” is used as a query example. The right column shows the result of two query examples. The results are ordered based on the similarity ranks. In both cases, the same relevance feedback algorithm (Section 5.2 and [19]) was used and tested on Corel image set of 17,000 images. In this example, even if this additional example image looks visually good for human eyes, it introduces undesirable features into the query. Thus, no car image appears in the top 8 images. An image of car appears in the rank 13th for the first time. This example is not a special case. It happens often in image retrieval and confuses the users. This problem happens because of semantic gap [20][22] between the high-level concept in the user’s mind and the extracted features of images. Furthermore, finding good combinations of query examples is very difficult because image features are numerical values that are impossible to be estimated by human. Only way to find the right combination is trial and error. Otherwise, the user can be trapped in a small part of image database [16]. Unfortunately, the traditional user interfaces were designed for incremental search and are not suitable for trial and error query, if not impossible. This is because in these systems, query specification and result display must be done on the same workspace. Once the user removes an image from the query examples during relevance
132
Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang
feedback loops, the image may disappear from the user interface. Thus, it is awkward to bring it back later for another query. Second, the traditional interface does not allow the user to put aside the query results for later uses. This type of interaction is desired because the users are not necessarily looking for only one type of images. The users’ interest may change during retrieval. This behavior is known as berry picking [2] and has been observed for text documents retrieval by O’Day and Jeffries [15]. Moreover, because of semantic gap [20][22] mentioned above, the users often need to make more than one query to satisfy his/her need [2]. For instance, a user may be looking for images of “beautiful flowers.” The database may contain many different “flower” images. These images might be completely different in terms of lowlevel visual features. Thus, the user needs to retrieve “beautiful flowers” as a collection of different types of images. Finally, in some case, the user had better start from a general concept of objects and narrow down to specific ones. For example, suppose the user is looking for images of “red cars.” Because image retrieval systems use various image features [23][27] as well as colors [25], even cars with different colors may have many common features with “red cars.” In this case, it is better to start by collecting images of “cars of any color.” Once enough number of car images are collected, the user can specify “red cars” as positive examples, and other cars as negative examples. Current interfaces for CBIR systems, however, do not support these types of query behavior. Another interesting approach for Query by Examples was proposed by Santini et.al [20]. In their El Ninõ system, the user specifies a query by mutual distance between example images. The user drags images on the workspace so that the more similar images (in the user’s mind) are located closer to each other. The system then reorganizes the images’ locations reflecting the user’s intent. There are two drawbacks in El Ninõ system. First, it is unknown to the users how close similar images should be located and how far negative examples should be apart from good examples. It may take a while for the user to learn “the metric system” used in this interface. The second problem is that like traditional interfaces, query specification and result display are done on the same workspace. Thus, the user’s previous decision (in the form of the mutual distance between the images) is overridden by the system when it displays the results. This makes trial and error query difficult. Given the analogue nature of this interface, trial and error support might be essential. Even if the user gets an unsatisfactory result, there is no way to redo the query with a slightly different configuration. Any experimental result is not provided in the paper. 2.2 Query-by-Groups We are developing a new user interface for CBIR systems named ImageGrouper. In this system, a new concept Query-by-Groups was introduced. Query-by-Groups mode is an extension to Query-by-Example mode described above. The major difference is that while Query-by-Example handles the images individually, in Query-by-Group, a “group of images” is considered as the basic unit of the query. Figure 3 shows the display layout of ImageGrouper. The interface is divided into two panes. The left pane is the ResultView that displays the results of content-based
ImageGrouper: Search, Annotate and Organize Images by Groups
positive group
133
negative group
Popup Menu
Result View
GroupPalette
neutral group Fig 3. The ImageGrouper
retrieval, keyword-based retrieval, and random retrieval. This is similar to the traditional GUI except for there are no sliders or buttons under the images. The right pane is the GroupPalette, where the user manages each image and image groups. In order to create an image group, the user first drags one or more images from the ResultView into GroupPalette, then encloses the images by drawing a rectangle (box) as we draw a rectangle in drawing applications. All the images within the group box become the member of this group. Any number of groups can be created in the palette. The user can move images from one group to another at any moment. In addition, groups can be overlapped to each other, i.e. each image can belong to multiple groups. To remove an image from a group, the user simply drags it out of the box. When the right mouse button is pressed on a group box, a popup menu appears so that the user can give query properties (positive, negative, or neutral) to the group. The properties of groups can be changed at any moment. The colors of the corresponding boxes change accordingly. To retrieve images based on these groups, the user press the “Query” button placed at the top of the window (Figure 3.) Then, the system retrieves new images that are similar to images in positive groups while avoiding images similar to negative groups. The result images are displayed in the ResultView. When a group is specified as neutral (displayed as a white box), this group does not contribute to the search at the moment. This group can be turned to a positive or negative group later for another retrieval. If a group is positive (displayed as a blue box), the system uses common features among the images in the group. On the other hand, if a group is given negative (red box) property, the common features in the group are used as negative feedbacks. The user can specify multiple groups as positive or negative. In this case, these groups are merged into one group, i.e. the union of the groups are taken. The detail of the algorithm is described in Section 5.2.
134
Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang
In the example shown in Figure 3, the user is retrieving images of “flowers.” In the GroupPalette, three flower images are grouped as a positive group. On the right of this group, a red box is representing a negative group that consists of only one image. Below the “flowers” group, there is a neutral group (white box), which is not used for retrieval at this moment. Images can be moved outside of any groups in order to temporarily remove images from the groups. The gestural operations of ImageGrouper are similar to file operations of a Window-based OS. Furthermore, because the user’s mission is to collect images, the operation “Dragging Images into a Box” naturally matches the user’s cognitive state. 2.3 Flexible Image Retrieval The main advantage of Query-by-Groups is flexibility. Trial and Error Query by Mouse Dragging. In ImageGrouper, images can be easily moved between the groups by mouse drags. In addition, the neutral groups and space outside of any groups in the palette can be used as storage area [8] for images that are not used at the moment. They can be reused later for another query. It makes trial and error of relevance feedbacks easier. The user can quickly explore different combinations of query examples by dragging images into or out of the box. Moreover, the query specification that the user made is preserved and visible in the palette. Thus, it is easy to modify the previous decision when the query result is not satisfactory. Groups in a Group. ImageGrouper allows the users to create a new group within a group (Groups in a Group.) With this method, the user begins with collecting relatively generic images first, then narrows down to more specific images. Figure 4 shows an example of Groups in a Group. Here, the user is looking for “Red cars.” When s/he does not have enough number of examples, however, the best way to start is to retrieve images of “cars with any color.” This is because these images may have many common features with red car images, though their colors features are different. The large white box is a group for “Cars with any colors.” Once the user found enough number of car images, s/he can narrow down the search only for red cars. In order to narrow down the search, the user divide the collected images into two sub-groups by creating two new boxes for red cars and other cars. Then the user specifies the red car group as positive and the other cars group as negative, respectively. In Figure 4, the left smaller (blue, i.e. positive) box is the group of red cars and the right box (red, i.e. negative) is the group of non-red cars. This narrow down search was not possible on the conventional CBIR systems. 2.4 Experiment on Trial and Error Query In order to examine the effect of ImageGrouper’s trial-and-error query, we compared the query performance of our system with that of a traditional incremental approach (Figure 1). In this experiments, we used Corel photo stock that contains 17000 images as the data set. For both interfaces, the same image features and relevance feedback algorithms (described in Section 5.2) are used. For the traditional interface, the top 30 images are displayed and examined by the user in each relevance feedback. For ImageGrouper, the top 20 images are displayed in the ResultView. Only one positive group and one neutral group are created for this
ImageGrouper: Search, Annotate and Organize Images by Groups
135
136
Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang
Cloud Cars of any color
Cloud and Mountain
Red Cars Non-Red Cars
Mountain
Fig 4. Groups in a group.
Fig 5. Overlap between groups. Two images in the overlapped region contain both mountain and cloud.
When keyword search is integrated with CBIR like our system and [16], keywordbased search can be used to find the initial query examples for content-based search. For this scheme, the user does not have to annotate all images. In any cases, it is very important to provide easy and quick ways to annotate text on a large number of images. 3.1 Current Approaches for Text Annotation The most primitive way for annotation is to select an image, then type in keywords. Because this interaction requires the user to use mouse and keyboard repeatedly in turn, it is too frustrating for a large image database. Several researchers have proposed smarter user interfaces for keyword annotation on images. In bulk annotation method of FotoFile [9], the user selects multiple images on the display, selects several attribute/value pairs from a menu, and then presses the “Annotate” button. Therefore, the user can add the same set of keywords on many images at the same time. To retrieve images, the user selects entries from the menu, and then presses the “Search” button. Because of visual and gestural symmetry [9], the user needs to learn only one tool for both annotation and retrieval. PhotoFinder [21] introduced drag-and-drop method, where the user selects a label from a scrolling list, then drags it directly onto an image. Because the labels remain visible at the designated location on the images and these locations are stored in the database, these labels can be used as “captions” as well as for keyword-based search. For example, the user can annotate the name of a person directly on his/her portrait in the image, so that other users can associate the person with his/her name. When the user needs new words to annotate, s/he adds them to the scrolling list. Because the user drags keywords into individual images, bulk annotation is not supported in this system.
ImageGrouper: Search, Annotate and Organize Images by Groups
137
3.2 Annotation by Groups Most home users do not want to annotate images one by one, especially when the number of images is large. In many cases, the same set of keywords is enough for several images. For example, a user may just want to annotate “My Roman Holiday, 1997” on all images taken in Rome. Annotating the same keywords repeatedly is painful enough to discourage him/her from using the system. ImageGrouper introduces Annotation-by-Groups method where keywords are annotated not on individual images, but on groups. As in Query-by-Groups, the user first creates a group of images by dragging images from ResultView into GroupPalette and drawing a rectangle around them. In order to give keywords to the group, the user opens Group Information Window by selecting “About This Group” from the pop-up menu (Figure 3). In this window, arbitrary number of words can be added. Because the users can simultaneously annotate the same keywords on a number of images, annotation becomes much faster and less error prone. Although Annotationby-Groups is similar to bulk annotation of FotoFile [9], there are several advantages described below. Annotating New Images with the Same Keywords. In bulk annotation [9], once the user finished annotating keywords to some images, there is no fast way to give the same annotation to another image later. The user has to repeat the same steps (i.e. select images, select keywords from the list, then press “Annotate”.) This is awkward when the user has to add a large number of keywords. Meanwhile, in Annotation-byGroup, the system attaches annotations not on each images, but on groups. Therefore, by dragging new images into an existing group, the same keywords are automatically given to it. The user does not have to type the same words again. Hierarchical Annotation with Groups in a Group. In ImageGrouper, the user can annotate images hierarchically using Groups in a Group method described above (Figure 4). For example, the user may want to add new keyword “Trevi Fountain” to only a part of the image group that has been labeled “My Roman Holiday, 97.” This is easily done by creating a new sub-group within the group and annotating only on the sub-group. In order to annotate hierarchically on FotoFile [9] with bulk annotation, the user has to select some of images that are already annotated, and then annotate them again with more keywords. On the other hand, ImageGrouper allows the user to visually construct a hierarchy on the GroupPalette first, then edit keywords on the Group Information Window. This method is more intuitive and less error prone. Overlap between Images. An image often contains multiple objects or people. In such cases, the image can be referred in more than one context. ImageGrouper support this multiple references by allowing overlaps between image groups, i.e. an image can belong to multiple groups at the same time. For example, in Figure 5, there are two image groups: “Cloud” and “Mountains.” Because some images contain both cloud and mountain, these images belong to both groups. They are automatically referred as “Cloud and Mountain.” This concept is not supported in other systems.
138
4
Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang
Organizing Images by Groups
In the previous two sections, we described how ImageGrouper supports content-based query as well as keyword annotation. These features are closely related and complementary to each other. In order to annotate images, the user can collect visually similar images first, using content-based retrieval with Query-by-Groups. Then s/he can annotate textual information to the group of collected images. After this point, the user can quickly retrieve the same images using keyword-based search. Conversely, the results of keyword-based search can be used as a starting point for content-based search. This method is useful especially when the image database is only partially annotated or when the user is searching images based on visual appearance only. 4.1 Photo Albums and Group Icons As described above, ImageGrouper allows groups to be overlapped. In addition, the user can attach textual information on these groups. Therefore, groups in ImageGrouper can be used to organize pictures as “photo albums [9]” Similar concepts are proposed in FotoFile [9] and Ricoh’s Storytelling system [1]. In both systems, albums are used for “slide shows” to tell stories to the other users. In ImageGrouper, the user can convert a group into a group icon. When the user selects “Iconify” from the popup menu (Figure 3,) images in the group disappear and a new icon for the group appears in GroupPalette. When the group has an overlap with another group, images in the overlapped region remain in the display. Furthermore, the users can manipulate those group icons as they handle individual images. They can drag the group icons anywhere in the palette. The icons can be even moved into another group box realizing groups in a group. Finally, group icons themselves can be used as examples for content-based query. A group icon can be used as an independent query example or combined with other images and groups. In order to use a group icon as a normal query group, the user right clicks the icon and opens a popup menu. Then, s/he can select “relevant”, “irrelevant” or “neutral.” On the other hand, in order to combine a group icon with other example images, the user simply draws a new rectangle and drags them into it. Organize-by-Groups method described here is partially inspired by the Digital Library Integrated Task Environment (DLITE) [5]. In DLITE, each text documents as well as the search results are visually represented by icons. The user can directly manipulate those documents in a workcenter (direct-manipulation.) In [8], Jones proposed another graphical tool for query specification, named VQuery. In VQuery, the user specifies the query by creating Venn diagrams. The number of matched documents is displayed in the center of each circle. While DLITE and VQuery were systems for text documents, the idea of directmanipulation [5] is applicable more naturally to image databases. In text document database, it is difficult to determine the contents of text documents from the icons. Therefore, the user has to open another window to investigate the detail [5] (in case of DLITE, a web browser is opened.) On the other hand, in image databases, images themselves (or their thumbnails) can be used for direct-manipulations. Therefore, instant judgment by the user is possible [16][22].
ImageGrouper: Search, Annotate and Organize Images by Groups
5
139
Implementation
A prototype of ImageGrouper is implemented as a client-server system, which consists of User Interface Clients and Query Server. They are communicating via HyperText Transfer Protocol (HTTP). 5.1 The User Interface Client The user interface client of ImageGrouper is implemented as a Java2 Applet with Swing API (Figure 3). Thus, the users can use the system through Web browsers on various platforms such as Windows, Linux, Unix and Mac OS X. The client interacts with the user and determines his/her interests from the group information or keywords input. When “Query” button is pressed, it sends the information to the server. Then, it receives the result from the server and displays it on the ResultView. Because the client is implemented in multi-thread manner, it remains reactive while it is downloading images. Thus, the user can drag a new image into the palette as soon as it appears in the ResultView. Note that the user interface of ImageGrouper is independent of relevance feedback algorithms [18][19] and the extracted image features (described below.) Thus, as long as the communication protocols are compatible, the user interface clients can access to any image database servers with various algorithms and image features. Although the retrieval performance depends on the underlying algorithms and image features used, the usability of ImageGrouper is not affected by those factors. 5.2 The Query Server The Query Server stores all the image files and their low-level visual features. These visual features are extracted and indexed in advance. When the server receives a request from a client, it computes the weights of features and compares user-selected images with images in the database. Then, the server sends back IDs of the k most similar images. The server is implemented as a Java Servlet that runs on the Apache Web Server and Jakarta Tomcat Servlet container. It is written in Java and C++. In addition, the server is implemented as a stateless server, i.e. the server does not hold any information about the clients. This design allows different types of clients such as the traditional user interface [13] (Figure 1) and 3D Virtual Reality interface [14] can access to the same server simultaneously. For home users who wish to organize and retrieve images locally on their PCs’ hard disks, ImageGrouper can be configured as a standalone application, in which the user interface and the query server are resident on the same machine and communicate directly without a Web server. Image Features. As the visual features for content-base image retrieval, we use three types of features: Color, Texture, and Edge Structure. For color features, HSV color space is used. We extract the first two moments (mean, and standard deviation) from each of HSV channels [25]. Therefore, the total number of color features is six. For texture, each image is applied into wavelet filter bank [23] where the images are decomposed into 10 de-correlated sub-bands. For each sub-band, the standard deviation of the wavelet coefficients is extracted. There-
140
Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang
fore, the total number of this feature is 10. For edge structures, we used Water-Fill edge detector [27] to extract image structures. We first pass the original images through the edge detector to generate their corresponding edge maps. From the edge map, eighteen (18) elements are extracted from the edge maps. Relevance Feedback Algorithm. The similarity ranking is computed as follows. First, the system computes the similarity of each image with respect to only one of the features. For each feature i ( i = { color, texture, structure } ), the system computes a query vector q i based on the positive and negative examples specified by the user. Then, it calculates the feature distance g ni between each image n and the query vector, T
g ni = W i ( p ni – q i )W i
(1)
where p ni is the feature vector of image n regarding the feature i. For the computation of the distance matrix Wi , we used Biased Discriminant Analysis (BDA.) The detail of BDA is described in [26]. After the feature distances are computed, the system combines each feature distance g ni into the total distance d n . The total distance of image n is a weighted sum of each g ni , d n = uT gn
(2)
where g n = [ g n1, …, g nI ] . I is the total number of features. In our case, I is 3. The optimal solution of the feature weighting vector u = [ u 1, …u I ] is solved by Rui et al. [19] as follows, ui =
∑
I j=1
fj⁄ fj
(3)
where f i = ∑nN = 1 g ni , and N is the number of positive examples. This gives higher weight to that feature whose total distance is small. This means that if the positive examples are similar with respect to a certain feature, this feature gets higher weight. Finally, the images in the database are ranked by the total distance. The system returns the k most similar images.
6
Future Work
We plan to evaluate our system further with respect to both usability and query performance. Especially, we will investigate the effect of Groups in a group query described in Section 2.3. As mentioned in [11], traditional precision/recall measure is not very suitable for evaluation for interactive retrieval systems. Therefore, we may need to consider appropriate evaluation methods for the system [12][22]. Next, in the current system, when more than one group is selected as positive, they are merged into one group, i.e. all images in those groups are considered as positive examples. We are investigating a scheme where different positive groups are considered as different classes of examples [28]. In addition, for the advanced users, we are going to add support for group-wise feature selection. Although our system automatically determines the feature weights, the advance users might know which features are important for their query. Thus, we will allow the users to specify which features are supposed to be considered for each group. Some groups might be important in terms of color features only, while others might be important in terms of structures. Finally, because the implementation of
ImageGrouper: Search, Annotate and Organize Images by Groups
141
ImageGrouper does not depend on underlying retrieval technologies, it can be used as a benchmarking tool [12] for various image retrieval systems.
7
Conclusion
In this paper, we presented ImageGrouper, a new user interface for digital image retrieval and organization. In this system, the users search, annotate, and organize digital images by groups. ImageGrouper has several advantages regarding image retrieval, text annotation, and image organization. First, in content-based image retrieval (CBIR), predicting a good combination of query examples is very difficult. Thus, trial-and-error is essential for successful retrieval. However, the previous systems are assuming incremental search and do not support trial-and-error search. On the other hand, Query-by-Groups concept in ImageGrouper allows the user to try different combinations of query examples quickly and easily. We showed this lightweight operation helps the users to achieve higher recall rate. Second, with Groups in a Group configuration, narrowing down search was made possible. This method helps the user find both positive and negative examples, and provides him/her with more choices. Next, typing text information to a large number of images is very tedious and time consuming. Annotate-by-Groups method eases the users of this task by allowing them to annotate multiple images at the same time. Groups in a group method realizes hierarchal annotation, which was difficult in the previous systems. Moreover, by allowing groups to overlap to each other, ImageGrouper further reduces typing. In addition, our concept of image groups is also applied for organizing image collections. A group in GroupPalette can be shrunk into a small icon. These group icons can be used as “photo albums” which can be directly manipulated and organized by the users. Finally, these three concepts: Query-by-Groups, Annotation-by-Groups and Organize-by-Groups share the similar gestural operations, i.e. dragging images and drawing a rectangle around them. Thus, once the user learned one task, s/he can easily adapt herself/himself to the other tasks. Operations in ImageGrouper are also similar to file operations used in Windows and Macintosh computers as well as most drawing programs. Therefore, the user can easily learn to use our system.
Acknowledgement This work was supported in part by National Science Foundation Grant CDA 9624396.
References 1. Balabanovic, M., Chu, L.L. and Wolff, G.J. Storytelling with Digital Photographs. In CHI’00, 2000. 2. Bates, M.J. The design of browsing and berrypicking techniques for the on-line search interface. Online Review, 13(5), pp. 407-431, 1989. 3. Bederson, B.B. Quantum Treemaps and Bubblemaps for a Zoomable Image Browser. HCIL Tech Report #2001-10, University of Maryland, College Park, MD 20742. 4. Chen, J-Y., Bouman, C.A., and Dalton, J.C. Heretical Browsing and Search of Large Image Database. In IEEE Trans. on Image Processing, Vol. 9, No. 3, pp. 442-455, March 2000.
142
Munehiro Nakazato, Lubomir Manola, and Thomas S. Huang
5. Cousins, S.B., et al. The Digital Library Integrated Task Environment (DLITE). In 2nd ACM International Conference on Digital Libraries, 1997. 6. Cox, I.J., Miller, M.L., Minka, T.P., Papathomas, T.V. and Yianilos, P.N. The Bayesian Image Retrieval System, PicHunter: Theory, Implementation, and Psychophysical Experiments. In IEEE Transactions on Image Processing, Vol. 9, No. 1, January 2000. 7. Flickner, M., Sawhney, H. and et al. Query by Image and Video content: The QBIC system. In IEEE Computer, Vol. 28, No.9, pp. 23-32, September 1995. 8. Jones, S. Graphical Query Specification and Dynamic Result Previews for a Digital Library. In UIST’98, 1998. 9. Kuchinsky, A., Pering, C., Creech, M.L., Freeze, D., Serra, B. and Gwizdka, J. FotoFile: A Consumer Multimedia Organization and Retrieval System. In CHI’99, 1999. 10. Laaksonen, J., Koskela, M. and Oja, E. Content-based image retrieval using self-organization maps. In Proc. of 3rd Intl. Conf. in Visual Information and Information Systems, 1999. 11. Lagergren, E. and Over, P. Comparing interactive information retrieval systems across sites: The TREC-6 interactive track matrix experiment. In ACM SIGIR’98, 1998. 12. Müller, H et al. Automated Benchmarking in Content-based Image Retrieval. In Proc. of IEEE International Conference on Multimedia and Expo 2001, August, 2001. 13. Nakazato, M. et al., UIUC Image Retrieval System for JAVA, available at http:// chopin.ifp.uiuc.edu:8080. 14. Nakazato, M. and Huang, T.S. 3D MARS: Immersive Virtual Reality for Content-based Image Retrieval. In Proc. of IEEE International Conference on Multimedia and Expo 2001. 15. O’Day V. L. and Jeffries, R. Orienteering in an information landscape: how informationseekers get from here to there. In INTERCHI ‘93, 1993. 16. Pecenovic, Z., Do, M-N., Vetterli, M. and Pu, P. Integrated Browsing and Searching of Large Image Collections. In Proc. of Fourth Intl Conf on Visual Information Systems, Nov, 2000. 17. Rodden, K., Basalaj, W., Sinclair, D. and Wood, K. Does Organization by Similarity Assist Image Browsing? In CHI’01. 2001. 18. Rui, Y., Huang, T. S., Ortega, M. and Mehrotra, M. Relevance Feedback: A Power Tool for Interactive Content-Based Image Retrieval. In IEEE Transaction on Circuits and Video Technology, Vol. 8, No. 5, Sept. 1998. 19. Rui, Y. and Huang, T. S., Optimizing Learning in Image Retrieval. In IEEE CVPR ‘00, 2000. 20. Santini, S. and Jain, R. Integrated Browsing and Querying for Image Database. IEEE Multimedia, Vol. 7, No. 3, 2000, pp. 26-39. 21. Shneiderman, B. and Kang, H. Direct Annotation: A Drag-and-Drop Strategy for Labeling Photos. In Proc. of the IEEE Intl Conf on Information Visualization (IV’00), 2000. 22. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A. and Jain, R. Content-based Image Retrieval at the End of the Early Years. In IEEE PAMI Vol. 22, No. 12, December, 2000. 23. Smith, J.R. and Chang S-F. Transform features for texture classification and discrimination in large image databases. In Proc. of IEEE Intl. Conf. on Image Processing, 1994. 24. Smith, J.R. and Chang S-F. VisualSEEk: a fully automated content-based image query system. In ACM Multimedia’96, 1996. 25. Sticker, M. and Orengo, M., Similarity of Color Images. In Proc. of SPIE, Vol. 2420 (Storage and Retrieval of Image and Video Databases III), SPIE Press, Feb. 1995. 26. Zhou, X. and Huang, T. S. A Generalized Relevance Feedback Scheme for Image Retrieval. In Proc. of SPIE Vol. 4210: Internet Multimedia Management Systems, 6-7 November 2000. 27. Zhou, X. S. and Huang, T. S. Edge-based structural feature for content-base image retrieval. Pattern Recognition Letters, Special issue on Image and Video Indexing, 2000. 28. Zhou, X. S., Petrovic, N. and Huang, T. S. Comparing Discriminating Transformations and SVM for Learning during Multimedia Retrieval. In ACM Multimedia ‘01, 2001.
Toward a Personalized CBIR System* Chih-Yi Chiu1, Hsin-Chih Lin2,**, and Shi-Nine Yang1 1
Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 300 _G]GLMYWR]ERKa$GWRXLYIHYX[ 2 Department of Information Management, Chang Jung Christian University, Tainan, Taiwan 711 LGPMR$QEMPGNYIHYX[
Abstract. A personalized CBIR system based on a unified framework of fuzzy logic is proposed in this study. The user preference in image retrieval can be captured and stored in a personal profile. Thus, images that appeal to the user can be effectively retrieved. Our system provides users with textual descriptions, visual examples, and relevance feedbacks in a query. The query can be expressed as a query description language, which is characterized by the proposed syntactic rules and semantic rules. In our system, the semantic gap problem can be eliminated by the use of linguistic terms, which are represented as fuzzy membership functions. The syntactic rules refer to the way that linguistic terms are generated, whereas the semantic rules refer to the way that the membership function of each linguistic term is generated. The problem of human perception subjectivity can be eliminated by the proposed profile updating and feature re-weighting methods. Experimental results have proven the effectiveness of our system.
1
Introduction
Content-based image retrieval (CBIR) receives much research interest recently [1-4]. However, there exist several problems that prevent CBIR systems from being popular. Two examples of the problems are [3-4]: (1) the semantic gap between image features and human perceptions in characterizing an image, and (2) the human perception subjectivity in finding target images. Most CBIR systems provide users with query-by-anexample and/or query-by-a-sketch schemes. Since the features extracted from the query are low-level, it is not easy for users to supply a suitable example/sketch in the query. If a query fails to reflect the user preference, the retrieval results may be unsatisfactory. To capture the user preference in image retrieval, the relevance feedback provides a useful scheme [5-6]. However, since the features extracted from feedback examples are also low-level, the user may take many feedback iterations to find a target image [7]. *
This study was supported partially by the National Science Council, R.O.C. under Grant NSC90-2213-E-309-004 and Ministry of Education, R.O.C. under Grant 89-E-FA04-1-4. ** Corresponding author. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 143–151, 2002. © Springer-Verlag Berlin Heidelberg 2002
144
Chih-Yi Chiu, Hsin-Chih Lin, and Shi-Nine Yang
To overcome the above-mentioned problems, a personalized CBIR system based on a unified framework of fuzzy logic is proposed in this study. Our system consists of two major phases: (1) database creation and (2) query comparison, as shown in Fig. 1. The database creation phase deals with the methods for feature extraction and linguistic term generation. In this study, Tamura features [8] are used as our texture representation. To eliminate the semantic gap problem in image retrieval, we propose an unsupervised fuzzy clustering algorithm to generate linguistic terms and their membership functions. The linguistic terms provide textual descriptions that abstract human perceptions for images, whereas the membership functions measure the similarity between a query and each database image. The query comparison phase deals with the methods for query parsing, profile updating, feature re-weighting, similarity function inference, and similarity computation. To eliminate the problem of human perception subjectivity in image retrieval, we propose profile updating and feature re-weighting methods to capture the user preference at each (relevance) feedback. The user preference is stored in a personal profile. Images that appeal to the user can be effectively retrieved. Query Descritption Language
Texture Image
Feature Extraction
Query Parsing
Profile Updating
Feature Reweighting
Personal Profile
Feedback History
Tamura Features
Similarity Function
Visual Examples
Relevance Feedbacks
Similarity Function Inference
Linguistic Term Generation
Textual Descriptions
User Interface
Similarity Computation
Image Browsing
Texture Database Image data Texture Representation Personal Profile
(a)
(b)
Fig. 1. The system overview: (a) database creation; (b) query comparison.
2
Database Creation
2.1
Feature Extraction
Our texture features should have the following characteristics. (1) The features characterize low-level texture properties. (2) These properties are perceptually meaningful; humans can easily interpret these properties by textual descriptions. In this study, six Tamura features [8], including coarseness, contrast, directionality, line-likeness, regularity, and roughness, are used to test the system performance.
Toward a Personalized CBIR System
2.2
145
Linguistic Term Generation
In this study, degrees of appearance on each feature are interpreted as five linguistic terms, as summarized in Table 1. The linguistic term is represented as a membership function and can be further defined by the proposed syntactic rules (Table 2) and semantic rules (Table 3). The syntactic rules refer to the way that linguistic terms are generated, whereas the semantic rules refer to the way that the membership function of each linguistic term is generated. In this study, the sigmoidal function is used to formulate the membership function. The membership functions of the linguistic terms on each feature are generated as follows. Table 1. Linguistic terms for the six features. Features Coarseness
very fine
Contrast
very low very nondirectional very bob-like very irregular very smooth
Directionality Line-likeness Regularity Roughness
Linguistic Terms medium fine coarse low medium contrast medium non-directional directional medium lineblob-like like irregular medium regular smooth medium rough
coarse
very coarse
high
very high
directional
very directional
line-like regular rough
very like-like very regular very rough
Table 2. Syntactic rules. QueryDescriptionLanguage ::= {QueryExpression ⊕ Connective} QueryExpression ::= <empty> | TextualDescription | VisualExample TextualDescription ::= Negation ⊕ Hedge ⊕ LinguisticTerm VisualExample ::= Negation ⊕ Hedge ⊕ RelevanceAdjective ⊕ TamuraFeature ⊕ #ExampleID Negation ::= <empty> | ‘not’ Hedge ::= <empty> | ‘more or less’ | ‘quite’ | ‘extremely’ LinguisticTerm ::= ‘very fine’ | ‘fine’ | ‘medium coarse’ | ‘coarse’ | ‘very coarse’ | … | ‘very smooth’ | ‘smooth’| ‘medium rough’ | ‘rough’ | ‘very rough’ TamuraFeature ::= ‘coarseness’ | ‘contrast’ | ‘directionality’ | ‘line-likeness’ | ‘regularity’ | ‘roughness’ RelevanceAdjective ::= ‘relevant’ | ‘irrelevant’ Connective ::= <empty> | ‘and’ | ‘or’
Algorithm 1. Unsupervised Fuzzy Clustering. Input: Data sequence ( f1 , f 2 ,..., f n ) , where f i denotes the value of a feature in the ith database image, and n is the number of database images. Output: Five membership functions P1 , P2 ,..., P5 on the feature. Step 1. Set c0 = 0 , c6 = 1 , and c j = j / 6 , j = 1, 2, …, 5, where c0 and c6 are the two bounds of the universe, c1 , c2 ,..., c5 denote centers of the five linguistic terms.
146
Chih-Yi Chiu, Hsin-Chih Lin, and Shi-Nine Yang Table 3. Semantic rules.
Semantic rules for the membership function µQ , where Q is a query expression on a feature: • LinguisticTerm ⇒ µQ (v) = Pj (v) , where v is the feature value of the image example, Pj (v) is defined in Eq. 1 (Q is a textual description.) • #ExampleID ⇒ µQ (v) = K (v) =
1 1+ e
− a (v − b )
⋅
1 1+ e
−c (v − d )
, where a, b, c, d are the
parameters of the membership function K (Q is a set of image examples.) • Hedge ⇒ µ Q h (v) = [ µQ (v)]h • ‘not’ ⇒ µ ¬Q (v) = 1 − µ Q (v) • ‘and’ ⇒ µ Q1 ∧ Q2 (v) = min[µ Q1 (v), µ Q2 (v )] • ‘or’ ⇒ µ Q1 ∨ Q2 (v) = max[µ Q1 (v), µ Q2 (v)]
Step 2. Set membership matrix U = 0. For each datum f i , update each element ui , j using one of the following rules: Rule 1. If f i ≤ c1 , set ui ,1 = 1 and ui , j ≠1 = 0 . Rule 2. If c j < f i ≤ c j +1 , set ui , j =
c j +1 − f i c j +1 − c j
, ui , j +1 = 1− ui , j , and ui , k ≠ j , j +1 = 0 .
Rule 3. If fi > c5 , set ui , j ≠ 5 = 0 and ui ,5 = 1 .
∑i =1ui, j fi n ∑i =1ui, j n
Step 3. Compute c1 , c2 ,..., c5 using c j =
. If the change of any c j exceeds
a given threshold, go to Step 2. Step 4. The membership function Pj (v) of the j-th linguistic term is defined as Pj ( v ) =
1 1 , ⋅ 1 + e −a ( v −b) 1 + e −c ( v −d )
(1)
where v is the feature value, a = k / cj - cj-1, b = (cj + cj-1) / 2, c = -k / (cj+1 - cj), d = (cj + cj+1) / 2, and k > 0. The parameters a, b, c, d are stored in the personal profile.
3
Query Comparison
3.1
Query Parsing
In this study, a query is defined as a logic combination of query expressions on all features. The query can be parsed by a query description language, which is characterized by the proposed syntactic rules (Table 2) and semantic rules (Table 3).
Toward a Personalized CBIR System
3.2
147
Profile Updating
Suppose a user has posed a query. If the retrieval results are unsatisfactory, the user may pose feedback examples for the next retrieval. At each feedback, the personal profile, i.e., the parameters of membership functions, can be updated as follows. For relevant examples, the weighted average center x of these examples is computed, and the previous membership function is pulled toward to the center. We define an error function E = [1 − µ ’( x)]2 , where µ ’ is the previous membership function on the feature. For irrelevant examples, the previous membership function is pushed away by these examples individually. We define an error function E = ∑ j [0 − µ ’( f j )]2 , where f j is the feature value (on the feature) of the j-th irrelevant example. To minimize E, the gradient descent method is used as follows: ∆ϕ = −η[∂E / ∂ϕ ] ,
where ϕ is a parameter in µ ’, η is the learning rate, and ϕ + ∆ϕ is the updated parameter in the personal profile. Fig. 2 illustrates the underlying idea.
MF center
relevant examples
weighted average center
irrelevant examples
multi-dimensional membership function Fig. 2. Updating the membership function through relevance feedbacks.
3.3
Feature Re-weighting
Suppose a user has posed a query. After several feedbacks, the user’s emphasis on each feature can be evaluated from the feedback history. We propose a feature reweighting algorithm as follows to fine-tune the weight of each feature in image retrieval.
148
Chih-Yi Chiu, Hsin-Chih Lin, and Shi-Nine Yang
Algorithm 2. Feature Re-weighting. Input: A series of previous k weights, denoted as W (k ) , the query expression Q on a feature. Output: A series of previous k + 1 weights, i.e, W ( k +1) , and the similarity between Q and v on the feature, denoted as sQ(v). Step 1. If there is no relevant example in Q, set parameter κ = 1 . Otherwise let κ = cos(σ × π / 2) , where σ is the standard deviation of the relevant examples. (k) (k+1) Step 2. Update W to W as follows: Wk(+k1+1) = ακ + ∑i =1 β i( k ) × Wi ( k ) , k
where β is a series of decreasing coefficients, each of which denotes the (k) corresponding importance in W , and α + ∑ β i( k ) = 1 . (k)
Step 3. In the parse tree of the query, two query expressions are combined by a connective c. Let v denotes the feature value of a database image. The weighted similarity between Q and v is computed as follows: sQ ( v ) = 1 − Wk(+k1+1) × [1 − µQ ( v )]
if c = ‘and’
sQ ( v ) = Wk(+k1+1) × µQ ( v )
if c = ‘or’
(2)
where µQ (v ) is the membership value of Q for v. Computations of the membership value will be discussed in Sections 3.4 and 3.5. 3.4
Similarity Function Inference
After the personal profile is updated or the features are re-weighted, new similarity functions must be inferred to reflect the user preference. The inference method is presented as follows: Type 1. If Q = <empty>, set µQ(v) = 0. Type 2. If Q is a textual description, set µ Q (v) = (−1) N +1[ N − Pjh (v)] , where Pj is defined in Eq. 1, h is a hedge. N = 1 if Q is a negative expression; else N = 0. Type 3. Q is a set of n visual examples. If there is no relevant example in Q, set µQ(v) = 0. Otherwise, compute the weighted average center x and the standard deviation σ on the feature and define the membership function as follows:
µQ (v ) = ( −1) N +1[ N − K h ( v )] , where K is defined in Table 3 and set a = k /(σ + δ ), b = x − (σ + δ ), c = −a, d = x + (σ + δ ), δ > 0, and k > 0. Note that the parameters of µ Q are stored in the personal profile. Each feature has its membership functions and equal feature weight at a new search. The weighted similarity between a query and each database image on the feature is computed using
Toward a Personalized CBIR System
149
Eq. 2. Finally, the total similarity function for the query can be inferred through minmax compositions of all weighted similarity functions on each feature. If the previous query on a feature is textual descriptions or visual examples, the current query expression on the feature will be treated as a relevance feedback. We use the gradient descent method to modify membership functions on each feature from the feedback history. Again, the total similarity function is inferred through min-max compositions of all weighted similarity functions. 3.5
Similarity Computation
Let D be a collection of database images and V be a set of feature values for an arbitrary database image. The similarity between the query and each database image is denoted as a fuzzy set A in D: A = {(V , S (V )) | V ∈ D} = ∑V ∈D S (V ) / V ,
where S is the total similarity function inferred from the query, and S(V) is the similarity between the query and the database image V. Our system computes the fuzzy set A and outputs the ranked images according to the similarity in descending order. The user can browse the results and feed relevant/irrelevant examples in the next retrieval if necessary.
4
Experimental Results
Our database contains 1444 texture images collected from Corel Gallery Collection. Fig. 3a shows the results for the query “very fine ∧ very directional ∧ very regular.” The retrieved images are displayed in descending similarity order from left to right and top to bottom. Fig. 3b shows the results if we select the second, fifth, and eighth images (in Fig. 3a) as relevant examples. To measure the system performance, we use 450 texture images as testing data. The original 50 512×512 texture images are obtained from MIT VisTex. Each image is partitioned into nine 170×170 non-overlap sub-images, named as relevant images. Fig. 4a shows the PR graph for a conjunction of all queries with feature re-weighting. The precision and recall increase in the first feedback is the largest. This fast convergence is a desirable situation. Fig. 4b shows the PR graph for the same queries in Fig. 4a, but without feature re-weighting. Obviously, the performance with feature re-weighting outperforms the one without feature re-weighting.
5
Conclusions and Future Work
A personalized CBIR system is proposed in this study. The methods for generating linguistic terms, updating the personal profile, re-weighting features, inferring similarity functions, and computing the similarity are all based on a unified framework of fuzzy logic. According to the experimental results, the semantic gap problem can be
150
Chih-Yi Chiu, Hsin-Chih Lin, and Shi-Nine Yang
bridged through the use of linguistic terms. The problem of human perception subjectivity can be solved through our profile updating and query re-weighting algorithms. Besides remedying these problems, our personalized CBIR system can achieve higher accuracy for image retrieval. The PR graphs have strongly supported the abovementioned claims.
(a)
(b) Fig. 3. (a) Retrieval results for the query “very fine ∧ very directional ∧ very regular;” (b) retrieval results for the three relevant examples from Fig. 3a.
1
1 0 rf 1 rf 2 rf 3 rf
0.9 0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2 0.1
0.2
0.3
0.4
0.5
(a)
0.6
0.7
0 rf 1 rf 2 rf 3 rf
0.9
0.8
0.2 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
(b)
Fig. 4. (a) PR graph with feature re-weighting; (b) PR graph without feature re-weighting.
For future work, we will explore efficient multidimensional indexing techniques to make our system scalable for large image collections. Another important aspect is putting our system into practice. For example, textile pattern retrieval may be a promising application in the future.
Toward a Personalized CBIR System
151
References 1. Aigrain, P., Zhang, H. J., Petkovic, D.: Content-Based Representation and Retrieval of Visual Media: A State-of-The-Art Review. Multimedia Tools and Applications 3 (1996) 179-202 2. Idris, F., Panchanathan, S.: Review of Image and Video Indexing Techniques. Journal of Visual Communication and Image Representation 8 (1997) 146-166 3. Rui, Y., Huang, T. S., Chang, S. F.: Image Retrieval: Current Techniques, Promising Directions, and Open Issues. Journal of Visual Communication and Image Representation 10 (1999) 39-62 4. Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-Based Image Retrieval at the End of the Early Years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 1349-1380 5. Minka, T. P., Picard, R. W.: Interactive Learning with a Society of Models. Pattern Recognition 30 (1997) 565-582 6. Rui, Y., Huang, T. S., Mehrotra, S.: Content-Based Image Retrieval with Relevance Feedback in MARS. IEEE International Conference on Image Processing, Vol. 2, Santa Barbara, CA, USA (1997) 815-818 7. Lu, Y., Hu, C., Zhu, X., Zhang, H. J., Yang, Q.: A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems. ACM International Conference on Multimedia, Los Angeles, CA, USA (2000) 31-37 8. Tamura, H., Mori, S., Yamawaki, T.: Texture Features Corresponding to Visual Perception. IEEE Transactions on Systems, Man, and Cybernetics 8 (1978) 460-473
An Efficient Storage Organization for Multimedia Databases Philip K.C. Tse1 and Clement H.C. Leung2 1 Department of Electrical and Electronic Engineering, University of Hong Kong, Pokfulam Road, Hong Kong SAR, China. TXWI$IIILOYLO 2 School of Communications and Informatics, Victoria University, P.O. Box 14428, MCMC, Vic8001, Australia. GPIQIRX$QEXMPHEZYIHYEY
Abstract. Multimedia databases may require storage space so huge that magnetic disks become neither practical nor economical. Hierarchical storage systems provide extensive storage capacity for multimedia data at very economical cost, but the long access latency of tertiary storage devices and large disk buffer make them infeasible for multimedia databases and visual information systems. In this paper, we investigate the data striping method for heterogeneous multimedia data streams on HSS. First, we have found that the multimedia objects should be striped across all media units to achieve the highest system throughput and smallest disk buffer consumption. Second, we have proved a feasibility condition for accepting concurrent streams. We have carried out experiments to study its performance, and it is observed that the concurrent striping method can significantly increase the system throughput, reduce the stream response time, and lower the need for disk buffers, offering considerable advantages and flexibility.
1
Introduction
Visual and Multimedia Information Systems (VIS) need to capture, process, store, and maintain a variety of information sources such as text, sound, graphics, images and video [18]. Such a system may be viewed at different levels: a user-transparent multimedia operating system with specific applications sitting on top of it (Fig. 1). The application layer always includes a multimedia database management system, which will rely on a suitable storage structure to support its operation. Multimedia databases need to store a variety of types of data. Popular or frequently accessed multimedia objects may reside permanently in the disks together with metadata, indexes, and other files. Cold multimedia objects and transaction log files are stored on tertiary store. Only the first portion of each object resides in disks. We focus on the retrieval of cold multimedia objects in this paper.
2
The Performance Problem and Relationship with Other Works
Most computer systems store their on-line data on disks, but storing huge amount of multimedia data on disks is expensive. Multi-level hierarchical storage systems (HSS) provide large capacity at a more economical cost than disk only systems [1]. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 152–162, 2002. © Springer-Verlag Berlin Heidelberg 2002
An Efficient Storage Organization for Multimedia Databases
153
However, such a storage structure invariably includes the long access latency of data held in tertiary storage devices [4].
Multimedia Information System Multimedia DBMS Storage Structure: HSS
Multimedia OS Multimedia Hardware Fig. 1. The performance of multimedia information systems is determined by the underlying storage structure
Traditionally tertiary storage devices store each object in its entirety using the nonstriping method on the media units. When a burst of streams arrives, response time would deteriorate because the streams are served in serial order. It is thus inefficient for multimedia databases where multiple objects are often accessed simultaneously. The simple striping method and the time-slice scheduling algorithm have been proposed to reduce the stream response time using extra switching [9, 16]. However, the extra switching overheads and the contention for exchange erode system throughput. Hence, these methods are appropriate only under light load conditions. The new concurrent striping method was shown to be efficient for homogeneous streams [30, 32]. We extend the concurrent striping method to handle heterogeneous streams in this paper. Multimedia objects may either be staged or pipelined from tertiary storage devices [28, 31]. We consider only the more efficient pipelining methods in this paper. 2.1
Relationship with Other Works
The continuous display requirement is necessary to guarantee that multimedia data streams can be displayed without interruption. In [24], data blocks of multimedia streams are interleaved using the Storage Pattern Altering policy using fixed transfer rate over both the media and gap blocks in optical disks. We generalize this interleaving placement method by interleaving streams over the temporal domain instead of the space domain. This allows for the feasibility condition to be used on more general storage devices and arbitrary scheduling methods. Many techniques on storing multimedia data strips on disk arrays are studied in the literature. Data distribution and replication are studied in [6, 26, 33]. Data striping in disk-only systems are analyzed in [2]. Constraint placement methods in [8, 13, 20] provide sufficient throughput for multimedia data retrieval on disks. Our method is the first constraint allocation method on HSS. Much research on the delivery of multimedia data has been done. Piggybacking and patching methods in [3, 11, 12], the multi-casting protocols in [17, 23], intelligent cache management techniques in [21], and proxy servers studies in [10, 22, 25, 34]
154
Philip K.C. Tse and Clement H.C. Leung
reduce the need for repetitive delivery of the same objects from the server. Quality of service guarantees over the network are studied in [15, 19, 27]. Some data striping methods on HSS have been proposed [7, 29]. Placement on the tertiary storage devices is optimized for random accesses but multimedia streams retrieve data continuously. In [5], a parallel striping method is studied, and the performance of random workload and the optimal strip width on simple striping systems are considered in [14]. The possibility of striping across all tapes is somehow excluded from the study. We shall describe the concurrent striping method and concurrent streams management in the next Section. We then establish the feasibility conditions in Section 4. We shall present the system performance in Section 5 and the experimental results in Section 6. This paper is concluded in Section 7.
3
Concurrent Striping
In the concurrent striping method, we divide the media units into several groups at one group per tertiary drive, and then we arrange the media units in a fixed sequence. Each multimedia data object is partitioned into a number of segments. We assume that each segment is a logical unit that can be displayed in fixed time after the previous segment has been displayed. We also assume that each object is accessed sequentially only in a fixed sequence. The segments are then placed into the media units following this sequence, with one segment on one media unit. Each object should have all its segments placed together. When multimedia objects are accessed, the Multimedia DBMS initiates new streams to access data objects. A new stream is accepted only if the maximum number of concurrent streams is not yet reached. Otherwise, the new stream is placed in a stream queue (Fig. 2). Once accepted, a new stream is created and it sends two requests to every tertiary drive and waits. The tertiary drives access data independently, an accepted stream starts to display data at the completion of at least one request from each drive. Each tertiary drive keeps the waiting requests in two queues. The first queue keeps waiting requests that access segments on the current media unit, while the second queue keeps requests that access data from other media units. The order of requests being served is controlled by the SCAN scheduling policy. The robot arm serves the exchange requests in a round robin manner.
4
Feasibility Conditions
The notations in Table 1 will be used in studying the feasibility conditions. We assume that each stream seeks with an overhead of S seconds and transfers a segment using M seconds. After that, the stream suspends data retrieval for G seconds. Each segment can display for δ seconds. A multimedia stream (M, δ) is acceptable if and only if it satisfies the continuous display requirement: S + M ≤ δ.
(1)
An Efficient Storage Organization for Multimedia Databases
155
retrieve data requests creates
parallel stream controller
tertiary drive
stream data access notification
finished notification
exchange requests
disk requests
new streams
exchange notification
robotic exchanger
retr
iev ed d
ata
disk drive
multimedia database di sp lay
da ta
display data
retrieved data queue data flow
memory
request & notification
Fig. 2. Concurrent Streams Management
This continuous display requirement must be maintained over a finite period of time. It can temporarily be violated by satisfying requests in advance and keeping the retrieved data in read-ahead buffers. The average ratio of transfer time to display time must however be maintained over a finite period of time. Table 1. Notations
Parameter S M G δ 4.1
Meaning access overheads transfer time gap time display time
Homogeneous Streams
Multimedia streams are considered as homogeneous if all streams have the same display time period δ. Let n streams be characterized by (M1, δ), (M2, δ), to (Mn, δ). Let Si be the access overhead time in serving each stream and Gi be the time gap of the ith stream, for i = 1 to n. By definition of the time gap, we have Si + Mi + Gi ≤ δ.
(2)
Corollary 1: n streams can be concurrent if and only if S1 + M1 + S2 + M2 + … + Sn + Mn ≤ δ.
(3)
Due to space limits, the proof of Corollary 1 and 2 are skipped here. Their validity follows directly as special conditions of the Corollary 3.
156
Philip K.C. Tse and Clement H.C. Leung
4.2
Heterogeneous Streams
Multimedia streams are considered as heterogeneous when their cycle periods are different. Let n streams be characterized by (M1, δ1), (M2, δ2), to (Mn, δn) such that not all δi are the same. Let S1 to Sn be the access overhead time in serving each stream. Corollary 2: n streams can be concurrent if and only if
S1 + M 1 S 2 + M 2 S + Mn + + ... + n ≤ 1. δ1 δ2 δn
4.3
(4)
Heterogeneous Streams with Multiple Devices
When multiple devices are available, the devices may serve the streams independently or in parallel. When the streams are served in parallel, the devices are considered as a single device with different access overheads and transfer rate. When the streams are served independently, one request is served by one device at a time. We assume that the requests can be distributed evenly to p devices, otherwise some devices can be overloaded while others are under utilized. Corollary 3: n streams can be concurrent on p independent devices if and only if
S1 + M 1 S 2 + M 2 S + Mn + + ... + n ≤ p. δ1 δ2 δn
(5)
Proof: If n streams are concurrently served by p devices, then there exists a finite time period δ such that kj requests of the jth streams are served by p devices. By the continuous display requirement, this time period should not exceed the display time of each stream. We have
δ ≤ kjδj , ⇒
j = 1, 2, …, n, (6)
kj 1 ≤ , δ δj
j = 1, 2, …, n.
Since the total retrieval time of all requests must be less than the service time of the p devices over the time period δ, we have,
∑ k (S n
j
j
)
+ M j ≤ pδ,
i =1
n
⇒
∑ j =1
(
kj Sj +M j
δ
) ≤ p.
(7)
An Efficient Storage Organization for Multimedia Databases
Substituting
157
kj 1 ≤ from Eq. (6), we obtain δ δj n
∑
(S j + M j ) δj
j =1
≤ p.
Hence, the necessary part is proved. Conversely, we let δ = δ 1δ 2 ...δ n and let kj ∈ R such that
δ , j = 1, 2, …, n, δj
kj =
⇒
Substituting
kj
δ
=
(8)
1 , δj
j = 1, 2, …, n.
1 from Eq. (8) to the necessity condition, we have δj n
∑
(
kj Sj +M j
δ
j =1
∑ ( n
⇒
) ≤ p, )
k j S j + M j ≤ pδ.
(9)
j =1
Since all terms are positive, we can take away all except the ith term from
∑ k (S n
j
j
)
+ M j . Hence, we obtain
j =1
k i (S i + M i ) ≤ pkiδi, i = 1, 2, …, n, ⇒ (S i + M i ) ≤ pδi,
i = 1, 2, …, n.
(10)
That is, requests of the ith stream can be served within time period δi by p devices. As long as the requests are distributed evenly to the devices, the continuous display requirements of all streams are fulfilled. Therefore, the n streams can be accepted to be served concurrently.
g
5
System Performance
To display the streams without starvation, the storage system must retrieve each segment before it is due for display. In the concurrent striping method, the maximum
158
Philip K.C. Tse and Clement H.C. Leung
number of requests that can appear between two consecutive requests of the same stream is less than s. If D drives are serving s streams each accessing segments of size X, then we have the continuous display requirement as (11)
DX X ≥ ω + s (α + ) , δj τ
where ω, α, and τ are the media exchange time, reposition time, and data transfer rate of the storage devices, and δj is the display bandwidth of the jth stream respectively. Since one segment is retrieved for each stream per media exchange in the concurrent striping method, we have for the system throughput
DsX X ω + s(α + ) τ
(12)
.
Disk buffers are required to store data that are retrieved from tertiary storage faster than they are consumed. Let the time that the tertiary drives spend in serving each group of concurrent requests be E[B], the disk buffer size for the jth stream using the concurrency striping method is
rX −
rδ j D
(13)
E[ B] .
Let E[G] be the expected stream service time, the disk buffer size for the jth stream using the non-striping method and parallel striping method is
rZ − δ j (E[G ]) ,
6
(14)
Experimental Results
We have created a simulation system to study the storage system performance of a robotic tape library. The media exchange time, reposition length and segment size are randomly generated for each request according to a uniform distribution with ±10% deviation from the mean value. New streams arrive randomly at the system according to the mean stream arrival rate. Other simulation parameters in Table 2 are used. Table 2. Simulation Parameters
Simulation Parameter Number of streams Stream arrival rate No. of tertiary drives Media exchange time Reposition rate Max reposition length Segment length Transfer rate
Default Value 200 streams 5 to 60 per hr 3 55 seconds 0.06 sec/inch 2000 inches 10 minutes 14.5 MB/sec
An Efficient Storage Organization for Multimedia Databases
6.1
159
Number of Displaying Streams
When the segment size increases, more displaying streams are allowed in both striping methods whereas the number of displaying streams is almost unchanged in the non-striping method. The concurrent striping method can serve more streams when the segment length is longer (Fig. 3). If the maximum number of concurrent streams is limited by the continuous display requirement in Eq.(11), no starving occurs. Otherwise, the number of starving requests would increase rapidly.
Maximum concurrent streams 100 50 0 5
10 15 segment length (minutes)
20
Fig. 3. Maximum Concurrent Streams
6.2
Maximum System Throughput
The maximum system throughput shows the ability in clearing requests from waiting queues. The maximum throughput of the concurrent striping method (high concurrency) is always higher than that of the other methods (Fig. 4). The system throughputs of the methods increase when larger segments are used due to three reasons: First, fewer exchanges and repositions are required for larger segments, resulting in fewer overheads. Second, larger segment are displayed for a longer time, more concurrent streams can be accepted to share the same media exchange overhead. Third, the full length of reposition is shared in SCAN scheduling among more concurrent streams, the mean reposition time and thus the overhead is reduced. Therefore, the maximum system throughput is higher. 6.3
Stream Response Time
The stream response time shows the quality of service to users in Fig. 5. The stream response time is dominated by the start up latency at low stream arrival rate, but it is dominated by the queue waiting time at high stream arrival rate. At low stream arrival rate, the concurrent striping method responds slower than the other two methods. Since the drives may be in the middle of a round, new streams need to wait for the media unit containing the first required segment to be exchanged. At fast stream arrivals, the concurrent striping method responds faster than other methods. As the queue grows, the response time increases rapidly. Since the concurrent striping method has the highest throughput, it serves requests the fastest. Therefore, the concurrent striping method reduces streams response time under heavy loads.
160
Philip K.C. Tse and Clement H.C. Leung
M axim um s ys tem thro ughp ut
M B/s e c 40 35 30 25 20 15 10 5 0 5
7
9 11 s e g me n t le n g th (min u te s )
13
15
p a ra lle l s trip in g (p re d ic te d )
p a ra lle l s trip in g (me a s u re d )
n o n -s trip in g (p re d ic te d )
n o n -s trip in g (me a s u re d )
h ig h c o n c u rre n c y (p re d ic te d )
h ig h c o n c u rre n c y (me a s u re d )
Fig. 4. Maximum System Throughput
Mean stream response time
s econds 3600
1800
0 -
10.0
20.0 30.0 40.0 s tream arrival rate (per hour)
50.0
60.0
parallel s triping (predicted)
parallel s triping (meas ured)
non-s triping (predicted)
non-s triping (meas ured)
high concurrency (predicted)
high concurrency (meas ured)
Fig. 5. Mean Stream Response Time
6.4
Disk Buffer Space
The disk buffer size indicates the amount of necessary resources in each method (Fig. 6). The largest disk buffer space is used by the non-striping method that retrieves data well before they are due for display. In both striping methods, the segments reside on different media units. At low stream arrival rate, multiple media exchanges are required to retrieve each object, resulting in lower data retrieval throughput per stream and smaller disk buffers. At fast stream arrivals, more streams are served concurrently in the concurrent striping method. As the segments for each stream are retrieved discontinuously, each object is retrieved at a slower pace and less data are moved to the disk. Thus, the disk buffer size per stream drops in the concurrent striping method.
An Efficient Storage Organization for Multimedia Databases
161
Buffer s ize p er s tream
MB 2400 2200 2000 1800 1600 1400 0
10
20 30 40 s tre a m a rriv a l ra t e (p e r h o u r)
p a ra lle l s t rip in g (p re d ic te d ) n o n -s trip in g (p re d ic te d ) h ig h c o n c u rre n c y (p re d ic te d )
50
60
p a ra lle l s t rip in g (me a s u re d ) n o n -s trip in g (me a s u re d ) h ig h c o n c u rre n c y (me a s u re d )
Fig. 6. Disk Buffer Size
7
Summary and Conclusion
The use of HSS will be inevitable for large multimedia databases in future systems. The main concerns in using these systems are their relatively poor response characteristics and large resource consumption. The concurrent striping method addresses these problems by sharing the switching overheads in HSS among concurrent streams. We have provided a feasibility condition to serve heterogeneous streams on a number of devices based on their access overheads and media transfer rates. The concurrent striping method has several advantages. The first advantage is that its system throughput is higher than that of existing methods. The second advantage is that it can serve more streams than the non-striping method at limited disk buffer space. The third advantage is that new streams respond faster under heavy loads which are very often the practical requirement in multimedia databases. These advantages make the concurrent striping method the most efficient storage organization for supporting the operation of multimedia databases and visual information systems.
References 1. Basu, P., Little, T.D.C.: Pricing Considerations in Video-on-demand Systems. ACM Multimedia (2000) 359-361 2. Berson, S., Ghandeharizadeh, S., Muntz, R., Ju, X.: Staggered Striping in Multimedia Information Systems. Proc. of ACM SIGMOD Conf. (1994) 79-90 3. Cai, Y., Hua, K.A.: An Efficient Bandwidth-Sharing Technique for True Video on Demand Systems. ACM Multimedia (1999) 211-214 4. Chervenak, A.L., Patterson, D.A., Katz, R.H.: Storage Systems for Movies-on-demand Video Servers. Proc. of IEEE Sym. on Mass Storage Systems (1995) 246-256 5. Chiueh, T.C.: Performance Optimization for Parallel Tape Arrays. Proc. of ACM Supercomputing (1995) 375-384 6. Chou, C.F., Golubchik, L., Lui, J.C.S.: A Performance Study of Dynamic Replication Techniques in Continuous Media Servers. ACM SIGMETRICS (1999) 202-203 7. Christodoulakis, S., Triantafillou, P., Zioga, F.A.: Principles of Optimally Placing Data in rd Tertiary Storage Libraries. Proc. of 23 VLDB Conf. (1997) 236-245 8. Chua, T.S., Li, J., Ooi, B.C., Tan, K.L.: Disk Striping Strategies for Large Video-on-demand Servers. Proc. of ACM Multimedia (1996) 297-306
162
Philip K.C. Tse and Clement H.C. Leung
9. Drapeau, A.L., Katz, R.H.: Striped Tape Arrays. Proc. of IEEE Sym. on Mass Storage Systems (1993) 257-265 10. Dykes, S.G., Robbins, K.A.: A Viability Analysis of Cooperative Proxy Caching. Proc. IEEE INFOCOM 3 (2001) 1205-1214 11. Eager, D., Vernon, M., Zahorjan, J.: Optimal and Efficient Merging Schedules for Videoon-Demand Servers. Proc. of ACM Multimedia (1999) 199-202 12. Gao, L., Zhang, Z., Towsley, D.: Catching and Selective Catching: Efficient Latency Reduction Techniques for Delivering Continuous Multimedia Streams. ACM Multimedia (1999) 203-206 13 Ghandeharizadeh, S., Kim, S.H., Shahabi, C.: On Configuring a Single Disk Continuous Media Server. Proc. of ACM Multimedia (1995) 37-46 14. Golubchik, L., Muntz, R.R., Watson, R.W.: Analysis of Striping Techniques in Robotic Storage Libraries. Proc. of IEEE Sym. on Mass Storage Systems (1995) 225-238 15. Greenhalgh, C., Benford, S., Reynard, G.: A QoS Architecture for Collaborative Virtual Environments. ACM Multimedia (1999) 121-130 16. Lau, S.W., Lui, J.C.S., Wong, P.C.: A Cost-effective Near-line Storage Server for Multimedia System. Proc. of IEEE Conf. on Data Engineering (1995) 449-456 17. Lee, K.W., Ha, S., et. al.: An Application-level Multicast Architecture for Multimedia Communications. ACM Multimedia (2000) 398-400 18. Leung C.H.C. (ed.): Visual Information Systems. Lecture Notes in Computer Science, Vol. 1304. Springer-Verlag, Berlin Heidelberg New York (1997) 19. Metz, C.: Differentiated Services. IEEE Multimedia (2000) 84-90 20. Özden, B., Rastogi, R., Silberschatz, A.: On the Design of a Low-cost Video-on-demand Storage System. ACM Multimedia Systems 4 (1996) 40-54 21. Paknikar, S., Kankanhalli, M., et.al.: A Caching and Streaming Framework for Multimedia. ACM Multimedia (2000) 13-20 22. Park, S.C., Park, Y.W., Son, Y.E.: A Proxy Server Management Scheme for Continuous Media Objects Based on Object Partitioning. Proc. IEEE ICPADS (2001) 757-762 23. Pochueva, J., Munson, E.V., Pochuev, D.: Optimizing Video-On-Demand Through Requestcasting. ACM Multimedia (1999) 207-210 24. Rangan, P.V., Vin, H.M.: Efficient storage techniques for digital continuous multimedia. IEEE Trans. Knowledge and Data Engineering, Vol. 5(4), (1993) 564-573 25. Rejaie, R., Yu, H., Handley, M., Estrin, D.: Multimedia Proxy Caching Mechanism for Quality Adaptive Streaming Applications in the Internet. IEEE INFOCOM (2000) 980-989 26. Santos, J.R., Muntz, R.R., Ribeiro-Neto, B.: Comparing Random Data Allocation and Data Striping in Multimedia Servers. ACM SIGMETRICS (2000) 44-55 27. Smith, J., Mohan, R., Li, C.S.: Scalable Multimedia Delivery for Pervasive Computing. ACM Multimedia (1999) 131-140 28. Tavanapong, W., Hua, K.A., Wang, J.Z.: A Framework for Supporting Previewing and VCR Operations in a Low Bandwidth Environment. ACM Multimedia (1997) 303-312 29. Triantafillou, P., Papadakis, T.: On-Demand Data Elevation in a Hierarchical Multimedia rd Storage Server. Proc. of the 23 VLDB Conf. (1997) 1-10 30. Tse, P.K.C., Leung, C.H.C.: Performance of Large Multimedia Databases on Hierarchical Storage Systems. Proc. of IEEE Pacific-Rim Conf. on Multimedia (2000) 184-187 31. Tse, P.K.C., Leung, C.H.C.: A Low Latency Hierarchical Storage System for Multimedia Data. Proc. of IAPR Int. MINAR Workshop (1998) 181-194 32. Tse, P.K.C., Leung, C.H.C.: Retrieving Multimedia Objects from Hierarchical Storage th Systems. Proc. of the 18 IEEE MSS Symposium (2001) 297-301 33. Wang, J., Guha, R.K.: Data Allocation Algorithms for Distributed Video Servers. ACM Multimedia (2000) 456-458 34. Zhang, Z.L., Wang, Y., Du, D.H.C., Su, D.: Video Staging: A Proxy Server based Approach to End-to-End Video Delivery over Wide-Area Networks. IEEE/ACM Transactions on Networking, Vol. 8(4) (2000) 429-442
Unsupervised Categorization for Image Database Overview Bertrand Le Saux and Nozha Boujemaa INRIA, Imedia Research Group, BP 105, F-78153 Le Chesnay, France [email protected] http://www-rocq.inria.fr/ lesaux
Abstract. We introduce a new robust approach to categorize image databases: Adaptative Robust Competition (ARC). Providing the best overview of an image database helps users browsing large image collections. Estimating the distribution of image categories and finding their most descriptive prototype represent the two main issues of image database categorization. Each image is represented by a high-dimensional signature in the feature space. A principal component analysis is performed for every feature to reduce dimensionality. Image database overview by categorization is computed in challenging conditions since clusters are overlapping and the number of clusters is unknown. Clustering is performed by minimizing a Competitive Agglomeration objective function with an extra noise cluster collecting outliers.
1
Introduction
Over the last few years, partly due to the development of the Internet, more and more multimedia documents that include digital images have been produced and exchanged. However, locating a target image in a large collection became a crucial problem. The usual way to solve it consists in describing images by keywords. Because this is a human operation this method suffers from subjectivity and text ambiguity and requires huge time to manually annotate a whole database. By image analysis images can be indexed by automatic description which only depend on their objective visual content. So Content-based Image Retrieval (CBIR) became a highly active research field. The usual scenario of CBIR is a query by example, which consists in retrieving images of the database similar to a given one. The purpose of browsing is to help the user finding his image query by providing first the best overview of the database. Since the database cannot be presented entirely, a limited number of key images have to be chosen. It means we have to find the most informative images which allow the user to know what the database contains. The main issue is to estimate the distribution (usually multi-modal) of image categories. Then we need the most representative image for each category. Practically, this is a critical point in the scenario of content-based query by example: the “page zero” problem. Existing systems often begin by presenting either randomly chosen images or keywords. In the first case, some categories are missed, and some images can be visually redundant. The user has to pick several random subsets to find an image corresponding to the one he has in mind. Only then can the query by example be S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 163–174, 2002. c Springer-Verlag Berlin Heidelberg 2002
164
Bertrand Le Saux and Nozha Boujemaa
performed. In the second case, images are manually annotated with keywords, and the first query is processed using keywords. Thus there is a need for presenting a summary of the database to the user. A popular way to find partitions in complex data is prototype-based clustering algorithm. The fuzzy version (Fuzzy C-Means [1]) has been constantly improved for twenty years by the use of the Mahalanobis distance [2], the adjunction of a noise cluster [3] or the competitive agglomeration algorithm [4] [5]. A few attempts to organize and browse image databases have been made: Brunelli and Mich [6], Medasani and Krishnapuram [7] and Frigui et al. [8]. A key point of categorization is the input data representation. A set of signatures (color, texture and shape) allows to describe the visual appearance of the image. The content-based categorization should be performed by clustering these signatures. This operation is computed in challenging conditions. The feature space is high-dimensional: computations are affected by the curse of dimensionality. The number of clusters in the image database is unknown. Natural categories have various shapes (sometimes hyper-ellipsoidal but often more complex), they are overlapping and they have various densities. The paper is organized as follows: §2 presents the background of our work. Our method is then presented in section 3. The results on image databases are discussed and compared with other clustering methods in section 4 and section 5 summarizes our concluding remarks.
2
Background
The Competitive Agglomeration (CA) algorithm [4] is a fuzzy partitional algorithm which allows not to specify the number of clusters. Let X = {xi | i {1, .., N }} be a set of N vectors representing the images. Let B = {βj | j {1, .., C}} represents prototypes of the C clusters. Competitive Agglomeration (CA) algorithm minimizes the following objective function: N 2 C N C J= (uji )2 d2 (xi , βj ) − α (uji ) (1) j=1 i=1
j=1
i=1
Constrained by: C
uji = 1, f or i {1, .., N }
(2)
j=1
d2 (xi , βj ) represents the distance from an image signature xi to a cluster prototype βj . The choice of the distance depends on the type of clusters having to be detected. For spherical clusters, Euclidean distance will be used. uji is the membership of xi to a cluster j. The first term is the standard FCM objective function [1]: the sum of weighted square distances. It allows us to control shape and compactness of clusters. The second term (the sum of squares of clusters’ cardinalities) allows us to control the number of clusters. By minimizing both these terms together, the data set will be partitioned in the optimal number of clusters while clusters will be selected to minimize the sum of intra-cluster distances.
Unsupervised Categorization for Image Database Overview
165
The cardinality of a cluster is defined as the sum of the memberships of each image to this cluster: N Ns = (usi ) (3) i=1
Membership can be written as:
where:
CM ust = uF + uBias , st st
(4)
[1/d2 (xt , βs )] CM uF = C , st 2 j=1 [1/d (xt , βj )]
(5)
and: uBias st
α = 2 d (xt , βs )
Ns −
C
2 j=1 [1/d (xt , βj )]Nj C 2 j=1 [1/d (xt , βj )]
(6)
The first term in equation (4) is the membership term in FCM algorithm and takes into account only relative distances to the clusters. The second term is a bias term which is negative for low cardinality cluster and positive for strong clusters. This bias term leads to a reduction of cardinality of spurious clusters which are discarded if their cardinality drops below a threshold. As a result only good clusters are conserved. α should provide a balance [4] between the two terms of (1) so α at iteration k is defined by: C N 2 2 j=1 i=1 (uji ) d (xi , βj ) α(k) = η0 exp(−k/τ ) (7) 2 C N j=1 i=1 (uji ) α is weighted by a factor which decreases exponentially along iterations. In the first iterations the second term of equation (1) dominates so the number of clusters drops rapidly. Then, when the optimal number of clusters is found, the first term dominates and the CA algorithm seeks the best partition of the signatures.
3 Adaptative Robust Competition (ARC) 3.1
Dimensionality Reduction
A signature space has been built for a 1440 image database (Columbia Object Image Library [9]). It contains 1440 gray scale images representing 20 objects, where each object is shot every 5 degrees. This feature space is high-dimensional and contains three signatures: 1. Intensity distribution (16-D): the gray level histogram. 2. Texture (8-D): the Fourier power spectrum is used to describe the spatial frequency of the image [10]. 3. Shape and Structure (128-D): the correlogram of edge-orientations histogram (in the same way as color correlogram presented at [11]).
166
Bertrand Le Saux and Nozha Boujemaa 1:obj10 2:obj11 3:obj12 4:obj13 5:obj14 6:obj15 7:obj16 8:obj17 9:obj18 10:obj19 11:obj1 12:obj20 13:obj2 14:obj3 15:obj4 16:obj5 17:obj6 18:obj7 19:obj8 20:obj9 0.15 0.1 0.05
3rd component
0 -0.05 -0.1 -0.15 -0.2 -0.25 -0.3
-0.7
-0.6
0 -0.5
-0.4
-0.05 -0.3
1st component
-0.2
2nd component
-0.1 -0.1
-0.15
Fig. 1. Distribution of gray level histograms for Columbia database on the three principal components
The whole space is not necessary to distinguish images. To prevent clustering from expensive computation, a principal component analysis is performed to reduce the dimensionality. For each feature only the first main components are kept. To visualize the problems raised by the categorization of image databases, the distribution of image signatures is shown on figure 1. This figure presents the subspace corresponding to the three principal components of the feature gray level histogram. Each natural category is represented with a different color. Two main problems appear: categories overlap and natural categories have different and various shapes.
3.2 Adaptative Competition α is the weighting factor of the competition process. In equation (7) α is chosen according to the objective function and has the same value and effect for each cluster. Though, during the process, α influences the computation of memberships in equations (4) and (6). The term uBias appreciates or depreciates the membership ust of data point xt to st cluster t according to the cardinality of the cluster. This will cause this cluster to be conserved or discarded respectively. Since clusters may have different compactness, the problem is to attenuate the effect of uBias for loose clusters, in order to not discard them too rapidly. We introduce an st average distance for each cluster s: d2moy (s)
=
N
2 2 i=1 (usi ) d (xi , βs ) N 2 i=1 (usi )
f or 1 ≤ s ≤ C
(8)
Unsupervised Categorization for Image Database Overview
And an average distance for the whole set of image signatures: C N 2 2 j=1 i=1 (uji ) d (xi , βj ) 2 dmoy = C N 2 j=1 i=1 (uji )
167
(9)
Then, α in equation (6) is expressed as: αs (k) =
d2moy α(k) f or 1 ≤ s ≤ C d2moy (s)
(10)
The ratio d2moy /d2moy (s) is lower to 1 for loose clusters, so the effect of uBias is attenust ated: cardinality of cluster is slowly reduced. On the contrary, d2moy /d2moy (s) is greater than 1 for compact clusters, so both memberships to these clusters and cardinalities are increased: they are more resistant in the competition process. Hence we build an adaptative competition process given by αs (k) for each cluster s. 3.3
Robust Clustering
A solution to deal with noisy data and outliers is to capture all the noise signatures in a single cluster [3]. A virtual noise prototype is defined, which is always at the same distance δ from every point in the data-set. Let this noise cluster be the first cluster, and noise prototype noted as β1 . So we have: d2 (xi , β1 ) = δ 2
(11)
Then the objective function (1) has to be minimized with the following particular conditions: – Distances for the good clusters j are defined by: d2 (xi , βj ) = (xi − βj )T Aj (xi − βj ) f or 2 ≤ j ≤ C.
(12)
where Aj are positive definite matrices. If Aj are identity matrix, then the distance is Euclidean distance, and the prototypes of clusters j for 2 ≤ j ≤ C are: N (uji )2 xi βj = i=1 (13) N 2 i=1 (uji ) – For the noise cluster j = 1, distance is given by (11). The noise distance δ has to be specified. It would vary from an image database to another, so it would be based on data-set statistical information. It is computed as the average distance between image signatures and good cluster prototypes: C N 2 j=2 i=1 d (xi , βj ) 2 2 δ = δ0 (14) N (C − 1) The noise cluster is then supposed to catch outliers that are at an equal mean distance from all cluster prototypes. Initially, δ cannot be computed using this formula, since
168
Bertrand Le Saux and Nozha Boujemaa
distances are not yet computed. It is just initialized to δ0 , and the noise cluster becomes significant after a few iterations. δ0 is a factor which can be used to enlarge or minimize the size of the noise cluster, though in the results that will be presented, δ0 = 1. The new ARC algorithm using adaptative competitive agglomeration and noise cluster can now be summarized: Fix the maximum number of clusters C. Initialize randomly prototypes for 2 ≤ j ≤ C. Initialize memberships with equal probability for each image to belong to each cluster. Compute initial cardinalities for 2 ≤ j ≤ C using equation (3). Repeat Compute d2 (xi , βj ) using (11) for j = 1 and (12) for 2 ≤ j ≤ C. Compute αj for 1 ≤ j ≤ C using equations (10) and (7). Compute memberships uji using equation (4) for each cluster and each signature. Compute cardinalities Nj for 2 ≤ j ≤ C using equation (3). For 2 ≤ j ≤ C, if Nj < threshold, discard cluster j. Update number of clusters C. Update prototypes using equation (13). Update noise distance δ using equation (14). Until (prototypes stabilized). Hence a new clustering algorithm is proposed. The two next points address two problems raised by image database categorization. 3.4
Choice of Distance for Good Clusters
What would be the most appropriate choice for (12) ? The image signatures are composed of different features which describe different attributes. The distance between signatures is defined as the weighted sum of partial distances for each feature 1 ≤ f ≤ F : d(xi , βj ) =
F
wj,f df (xi , βj )
(15)
f =1
For each feature, the natural categories in image databases have various shapes, the more often hyper-ellipsoidal, and overlap each other. To retrieve such clusters, Euclidean distance is not appropriate. So the Mahalanobis distance [2] is used to discriminate image signatures. For clusters 2 ≤ j ≤ C, partial distances for feature f are computed using: −1 df (xi , βj ) = |Cj,f |1/pf (xi,f − βj,f )T Cj,f (xi,f − βj,f )
(16)
where xi,f and βj,f are the restrictions of image signature xi and cluster prototype βj to the feature f . pf is the dimension of both xi,f and βj,f : it is the dimension of the subspace corresponding to feature f . Cj,f is the covariance matrix (of dimension pf × pf ) of cluster j for the feature f : N (uji )2 (xi,f − βj,f )(xi,f − βj,f )T Cj,f = i=1 (17) N 2 i=1 (uji )
Unsupervised Categorization for Image Database Overview
3.5
169
Normalization of Features
The problem is to compute the weights wj,f used in equation (15). The features have different orders of magnitude and different dimensions, so the distance over all features cannot be defined as a simple sum of partial distances for each feature. The idea is to learn the weights during the clustering process. Ordered Weight Averaging [12] is used, as proposed in [8]. First, partial distances are sorted in ascending order. For each feature f , the rank of corresponding partial distance is obtained: rf = rank(df (xi , βj ))
(18)
And the weight at iteration k > 0 is updated using: (k)
(k−1)
wj,f = wj,f
+
2(F − rf ) F (F + 1)
(19)
It has two positive effects. First, features with small values are weighted with a higher weight than those with large values, so the sum of partial distances is equilibrated. Secondly, since the weights are computed during the clustering process, if some images are found to be similar according to one feature, their partial distance will be small, and the effect of this feature will be accentuated: it allows to find a cluster which contains images similar according to a single main feature. 3.6 Algorithm Outline Fix the maximum number of clusters C. Initialize randomly prototypes for 2 ≤ j ≤ C. Initialize memberships with equal probability for each image to belong to each cluster. Initialize feature weights uniformly for each cluster 2 ≤ j ≤ C. Compute initial cardinalities for 2 ≤ j ≤ C. Repeat Compute covariance matrix for 2 ≤ j ≤ C and feature subsets 1 ≤ f ≤ F using (17). Compute d2 (xi , βj ) using (11) for j = 1 and (16) for 2 ≤ j ≤ C. Update weights for clusters 2 ≤ j ≤ C using (19) for each feature. Compute αj for 1 ≤ j ≤ C using equations (10) and (7). Compute memberships uji using equation (4) for each cluster and each signature. Compute cardinalities Nj for 2 ≤ j ≤ C. For 2 ≤ j ≤ C, if Nj < threshold discard cluster j. Update number of clusters C. Update prototypes using equation (13). Update noise distance δ using equation (14). Until (prototypes stabilize).
170
Bertrand Le Saux and Nozha Boujemaa
Fig. 2. left: ground truth: the 20 objects of the Columbia database, right: Summary obtained with ARC algorithm
Fig. 3. left: Prototypes of clusters obtained with SOON algorithm, right: Prototypes of clusters obtained with CA algorithm
4
Results and Discussion
The ARC algorithm is compared with two other clustering algorithms: the basic CA algorithm [4] and the Self-Organization of Oscillator Network (SOON) algorithm [8]. The SOON algorithm can be summarized as follows: 1. Each image signature is associated to an oscillator characterized by a phase variable that belongs to [0, 1]. 2. Whenever an oscillator’s phase reaches 1, it resets to 0 and other oscillators’ phases are either increased or decreased according to a similarity function.
Unsupervised Categorization for Image Database Overview
171
Table 1. This matrix shows how many pictures of each object belong to a cluster obtained with ARC. Object 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Cluster 1 72 . . . . . . . . . . . . . . . . . . . 2 . 3 1 1 . . . . . . 2 . 3 . . . . . . . 3 . . 48 . 4 4 . . . 5 . . . . . . . . 4 . 4 . 3 4 70 . . . 15 . . . . . . . 13 . . . . 5 . . . . 32 . . . . . 1 . . . . . . . . . 6 . . . . . . . . . . . . . . . . . . . . 7 . . . . 3 . 67 . . . 12 . . . . . . . . . 8 . . . . 2 . 5 57 . . 1 . . . . . . . . . 9 . . . . 13 . . . 70 5 . . . . . . . . . . 10 . . . . . . . . . . . . . . . . . . . . 11 . 9 . . . . . . . 1 51 . . . . . . . . . 12 . . . . 3 . . . . 5 . 72 . . . . . . . . 13 . 22 . . . . . . . . 5 . 21 . . . . . . . 13 . 12 . . . . . . . . . . 48 . . . . . . . 14 . . . . . 1 . . . . . . . 72 . . . . 1 . 15 . . . . . . . . . . . . . . 72 . . . . . 16 . . . . . 2 . . . . . . . . . 59 . . . . 17 . . . . . . . . . . . . . . . . 72 . . . 18 . . . . . . . . . . . . . . . . . 72 . . 19 . . 18 . 2 35 . . . 14 . . . . . . . . 26 . 19 . . . 1 2 16 . . . 16 . . . . . . . . 23 . 19 . . 11 . 1 14 . . . . . . . . . . . . 19 . 20 . . . . . . . . . 2 . . . . . . . . . 72 noise . 23 5 . 10 . . . 2 24 . . . . . . . . . .
3. Oscillators begin to clump together in small groups. Within each group, oscillators are phase-locked. After a few cycles, existing groups get bigger by absorbing other oscillators and merging with other groups. 4. Eventually, the system reaches a stable state where the image signatures are organized into the optimal number of stable groups. For each category, a prototype is chosen according to the following steps: • The average value of each feature is computed over image. • Then, the average of all images defines a virtual prototype. • The real prototype is the nearest image to the virtual one. The ground truth of Columbia database is shown on figure 2. The three summaries are presented on figures 2 and 3. Quite all the natural categories are retrieved with the three methods. But with SOON or CA algorithms, some categories are split in several clusters, so several prototypes are redundant. Our method provides a better summary with less redundancy. Tables 1 and 2 present the membership matrices of objects to clusters which describe the content of each cluster. Since the simple CA algorithm has no cluster to collect ambiguous image signatures, clusters obtained with this method are noisy. Besides the main natural category retrieved in a cluster, there are always other images which belong to a neighbor cluster or to a wide spread cluster. This problem is solved with both other methods. With ARC or SOON algorithms, more than a third of categories are perfectly clustered, i.e. all the images of a single cate-
172
Bertrand Le Saux and Nozha Boujemaa
Table 2. The left matrix shows how many pictures of each object belong to a cluster obtained with CA and the right matrix shows the result of the same experiment with SOON. Object 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Cluster 1 42 . . 4 . . . 1 . 2 6 . . . . . . . . . 1 30 . . . . . . 9 . . 1 . . . . . . . . . 2 . 35 . . . . 3 1 . . 1 . . . . . . . . . 3 . . 8 . . 30 . . . . . . . . . . . . 26 . 3 . . 10 . . . . . . 1 . . . . . . . . 10 . 4 . 1 2 31 22 . . 1 3 3 . . . . . . . . . . 5 . . . . 10 . 5 . . 54 3 . . . . . . . . . 6 . . . . . . . . . . . . . . . . . . . . 7 . . . . 1 . 61 . . . . . . . . . . 14 . . 8 . . . . 2 . . 21 19 . . . . . . . . . . 44 9 . . . . 5 . . 19 47 . . . . . . . . . . . 10 . . . . . . . . . . . . . . . . . . . . 11 . 5 . . 1 . . 3 . . 49 . . . . . . . . . 12 . . . . 12 . . . . . . 72 . . . . . . . . 13 . 17 . . . . . . . . 6 . 72 . . . . . . . 14 . . . . . . . 6 . . . . . 72 . . . . . . 15 . . . . . . 1 . . . . . . . 33 . . . . . 15 . . . . . . 2 . . . 4 . . . 39 . . . . . 16 . 13 . 37 . . . 12 . . 2 . . . . 72 . . . . 17 . . . . 1 . . . . . . . . . . . 72 . . . 18 . . . . 10 . . . . 3 . . . . . . . 29 . . 18 . . . . . . . . . 1 . . . . . . . 29 . . 19 . . 40 . 8 25 . . . 8 . . . . . . . . 26 . 19 . . 12 . . 17 . . . . . . . . . . . . 10 . 20 . . . . . . . . 3 . . . . . . . . . . 28 Object 1 2 Cluster 1 21 . 1 51 . 2 . . 3 . . 4 . . 5 . . 5 . . 6 . . 6 . . 7 . . 8 . . 8 . . 9 . . 10 . . 10 . . 10 . . 11 . . 12 . . 13 . . 14 . . 15 . . 16 . . 17 . . 18 . . 18 . . 19 . . 20 . . noise . 72
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 . . . 7 . . . 4 40 . . . . . . . . . . . . . . . . 2 . 19
. . . . 72 . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 15 19 . . . . . . . . . . . . . . . . . . . . 38
. . . 6 . . . 5 43 . . . . . . . . . . . . . . . . 3 . 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . 72
. . . . . . . . . . 16 40 . . . . . . . . . . . . . . . 16
. . . . . . . . . . . . 14 . . . . . . . . . . . . . . 57
. . . . . . . . . . . . . 10 16 10 . . . . . . . . . . . 36
. . . . . . . . . . . . . . . . 26 . . . . . . . . . . 46
. . . . . . . . . . . . . . . . . 72 . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 13 . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 71 . . . . . . . 1
. . . . . . . . . . . . . . . . . . . . 72 . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 72 . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 72 . . . . .
. . . . . . . . . . . . . . . . . . . . . . . 39 33 . . .
. . . . . . . 6 42 . . . . . . . . . . . . . . . . 5 . 19
. . . . . . . . . . . . . . . . . . . . . . . . . . 72 .
Unsupervised Categorization for Image Database Overview
173
Fig. 4. left: cluster of object ‘drugs package’ obtained by ARC, and right: cluster of object ‘drugs package’ obtained by CA algorithm
Fig. 5. cluster of object ‘drugs package’ obtained by SOON algorithm
gory are grouped in a single cluster. The other natural categories present more variation among their images, so are more difficult to retrieve. Let’s consider one of these categories: the images representing the drug package ‘tylenol’. It presents several difficulties: it is wide spread, and another category which represents another drugs package is very similar. The cluster formed with the CA algorithm contains 71 images and only 47 images of the good category (see figure 4). The cluster formed with the SOON algorithm has no noise but contains only 14 images (among 72) (figure 5). With our method, a cluster of 88 images is found, with 18 noisy images and 70 good images. The CA algorithm suffers from the noisy data which prevent it from finding the good clusters. On the contrary, the SOON algorithm rejects lot of images in the noise cluster: thus good clusters are pure, but more than a quarter of the database is considered as noise. Since whole categories can be rejected (table 2 shows that 2 complete categories of Columbia database are in the noise cluster) the image database is not well represented. ARC method avoids these drawbacks. It finds clusters which contain almost all images of the natural category, with a only small amount of noise. The noise cluster contains only really ambiguous images which would affect the results by biasing the clustering process.
174
5
Bertrand Le Saux and Nozha Boujemaa
Conclusion
We have presented a new unsupervised and adaptative clustering algorithm to categorize image databases: ARC. When prototypes of each category are picked and collected together it provides a summary for the image database. It allows to face problems raised by image database browsing and more specifically handle the “page zero”. It allows computing the optimal number of clusters in the dataset. It assigns outliers and ambiguous image signatures to a noise cluster, to prevent them from biasing the categorization process. Finally, it uses an appropriate distance to retrieve clusters of various shapes and densities.
References 1. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press (1981) 2. Gustafson, E.E., Kessel, W.C.: Fuzzy clustering with a fuzzy covariance matrix. In: IEEE CDC, San Diego, California (1979) 761–766 3. Dave, R.N.: Characterization and detection of noise in clustering. Pattern Recognition Letters 12 (1991) 657–664 4. Frigui, H., Krishnapuram, R.: Clustering by competitive agglomeration. Pattern Recognition 30 (1997) 1109–1119 5. Boujemaa, N.: On competitive unsupervized clustering. In: Proc. of ICPR’2000, Barcelona, Spain (2000) 6. Brunelli, R., Mich, O.: Image retrieval by examples. IEEE Transactions on Multimedia 2 (2000) 164–171 7. Medasani, S., Krishnapuram, R.: Categorization of image databases for efficient retrieval using robust mixture decomposition. In: Proc. of the IEEE Workshop on Content Based Access of Images and Video Libraries, Santa Barbara, California (1998) 50–54 8. Frigui, H., Boujemaa, N., Lim, S.A.: Unsupervised clustering and feature discrimination with application to image database categorization. In: NAFIPS, Vancouver, Canada (2001) 9. Nene, S.A., Nayar, S.K., Murase, H.: Columbia object image library (coil20). Technical report, Department of Computer Science, Columbia University, http://www.cs.columbia.edu/CAVE/ (1996) 10. Niemann, H.: Pattern Analysis and Understanding. Springer, Heidelberg (1990) 11. Huang, J., Kumar, S.R., Mitra, M., Zu, W.J.: Spatial color indexing and applications. In: ICCV, Bombay, India (1998) 12. Yager, R.R.: On ordered weighted averaging aggregation operators in multicriteria decision making. Systems, Man and Cybernetics 18 (1988) 183–190
A Data-Flow Approach to Visual Querying in Large Spatial Databases Andrew J. Morris1 , Alia I. Abdelmoty2 , Baher A. El-Geresy1 , and Christopher B. Jones2 1 2
School of Computing, University of Glamorgan, Treforest, Wales, CF37 1DL, UK Department of Computer Science, Cardiff University, Cardiff, Wales, CF24 3XF, UK Abstract. In this paper a visual approach to querying in large spatial databases is presented. A diagrammatic technique utilising a data flow metaphor is used to express different kinds of spatial and non-spatial constraints. Basic filters are designed to represent the various types of queries in such systems. Icons for different types of spatial relations are used to denote the filters. Different granularities of the relations are presented in a hierarchical fashion when selecting the spatial constraints. The language constructs are presented in detail and examples are used to demonstrate the expressiveness of the approach in representing different kinds of queries, including spatial joins and composite spatial queries.
1
Introduction
Large spatial databases such as, Computer Aided Design and Manufacture (CAD/CAM), Geographic Information Systems (GIS) and medical and biological databases, are characterised by the need to represent and manipulate a large number of spatial objects and spatial relationships. Unlike, traditional databases, most concepts in those systems have spatial representations and are therefore naturally represented using a visual approach. GIS are a major example of spatial databases with a large number of application domains, including environmental, transportation and utility mapping. Geographic objects, usually stored in the form of maps, may be complex formed by grouping other features and may have more than one spatial representation which changes over time. For example, a road object can be represented by a set of lines forming its edges or by a set of areas between its boundaries. Users of current GIS are expected to be non-experts in the geographic domain as well as possibly casual users of database systems. Alternative design strategies for query interfaces, besides the traditional command-line interfaces, are sought to produce more effective GIS and to enhance their usability. The current generation of GIS have mostly textual interfaces or menu-driven ones that allow some enhanced expression of the textual queries [Ege91]. Problems with textual query languages have long been recognised [Gou93] including the need to know the structure of the database schema before writing a query as well as problems of semantic and syntactic errors. Problems are compounded in a geographic database where geographic features can be represented by more S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 175–186, 2002. c Springer-Verlag Berlin Heidelberg 2002
176
Andrew J. Morris et al.
than one geometric representation and the semantics and granularity of spatial relations may differ across systems and application domains. In this paper, the focus is primarily on the process of query formulation. A visual approach is proposed to facilitate query expression in those systems. The approach addresses some of the basic manipulation issues, namely, the explicit representation of the spatial types of geographic features and the qualitative representation of spatial relationships. A diagrammatic technique is designed around the concept of a filter to represent constraints and implemented using direct manipulation. Filters, represented by icons, denote spatial and non-spatial constraints. Spatial constraints are computed through the application of spatial operators on one spatial entity, e.g. calculating the area of polygon, or on more than one spatial entity, e.g. testing whether a point object is inside a polygon object. Different granularities of binary spatial filters are used and may be defined in the language, for example, a general line-cross-area relationship may be specialised to indicate the number of points the two objects share etc. The concept of a filter is used consistently to construct complex queries from any number of sub-queries. The aim is to provide a methodology for a non-expert user to formulate and read relatively complex queries in spatial databases. Notations are used to distinguish query (and sub-query) results, to provide means of storing query history as well as to provide a mechanism for query reuse. A prototype of the approach has been implemented and evaluation experiments are currently underway. GIS are the main examples used in this paper. However, the approach proposed may be applied to other types of spatial databases. The paper is structured as follows. Section 2 lists some general requirements and problems identified for query interfaces to spatial databases. A discussion of related work is presented in section 3. In section 4, the data flow approach is first described and the language constructs are then presented in detail. This is followed in section 5 by an overview of the implementation and evaluation of the produced interface, concluding with a summary in section 6.
2
General Requirements and Identified Problems
Several issues related to the design of query interfaces to spatial databases are identified as follows. Some of these issues can be addressed at the language design level, while others need to be addressed at the implementation level of the query interface. Issues arising due to the spatial nature of the database include, Representation of Spatial Objects: Geographic objects have associated spatial representations to define their shape and size. Objects may be associated with more than one spatial representation in the database to handle different map scales or different application needs. Spatial representations of objects determine and limit the types of spatial relationships that they may be involved in. Explicit representation of the geometric type(s) of geographic features is needed to allow the user to express appropriate constraints over their locations.
A Data-Flow Approach to Visual Querying in Large Spatial Databases
111111 000000 000000 111111 A 000000 111111 000000B 111111
B
177
1111 0000 A 0000 1111 0000 1111
Fig. 1. Types of overlap relationship between two spatial regions.
Spatial operations and joins: It is difficult for a non-expert user to realise all the possible spatial operations that may be applied on a geographic object or the possible spatial relationships that may be computed over sets of geographic objects. The semantics of the operations and relationships are implicit in their names. Those names may not have unique meanings for all users and are dependent on their implementation in the specific system in use. For example, an overlap relationship between two regions may be generalised to encompass the inside relationship in one implementation or may be specific to only mean partial coverage in another as shown in figure 1. In this paper a visual, qualitative, representation of spatial operations and relationships is proposed to facilitate their direct recognition and correct use. Also, different granularities of spatial relationships need to be explicitly defined to express different levels of coarse and detailed spatial constraints. Composite spatial constraints: Multiple spatial constraints are used in query expressions. Again, the semantics of the composite relation may be vague, especially when combined using binary logical operators of And and Or. Means of visualising composite spatial relations would therefore be useful. E.g. “Object1 is north-of Object2 and close to it but outside a buffer of 10 m. from Object3”. Self spatial joins: Problems with the expression of self joins were noted earlier in traditional databases [Wel85]. The same is true in spatial databases but complicated with the use of spatial constraints in the join. E.g. “Find all the roads that intersect type A roads?” Query History: Visualising results of sub-queries during the process of query formulation is useful as users tend to create new queries by reworking a previous query or using parts thereof and so suggests the inclusion of query history. Other general database issues include, parenthesis complexity when specifying the order of Boolean operators with parentheses as the query grows [Wel85, JC88,MGP98]. Also, problems when using Boolean logic operators of And & Or as well as common syntactic errors such as, omitting quotation marks around data values where required [Wel85] and applying numeric operators to nonnumeric fields. The approach proposed in this paper attempts to handle some of the above issues that can be addressed at the language design level. Other issues are left to the implementation stage of the query interface.
3
Related Work
Querying interfaces to GIS can be broadly categorised between textual interfaces and non-textual interfaces. Several text-based extensions to SQL have been
178
Andrew J. Morris et al.
proposed (e.g. [Ege91, IP87, RS99]). Spatial extensions to SQL inherit the same problems of textual query languages to traditional databases. Typing commands can be tiring and error prone [EB95], with difficult syntax that is tedious to use [Ege97]. In [Gou93] it was noted that users can spend more time thinking about command tools than thinking of the task that they have set out to complete. The Query-by-Example model [Zlo77] has also been explored in several works. QPE [CF80] and PICQUERY [JC88] are examples of such extensions. Users formulate queries by entering examples of possible results into appropriate columns on empty tables of the relations to be considered. Form-based extensions often do not release the user from having to perform complicated operations in expressing the queries nor from having to understand the schema structure. Also, complex queries usually need to be typed into a condition box that is similar to the WHERE clause of an SQL statement. Visual languages have been defined as languages that support the systematic use of visual expressions to convey meaning [Cha90]. A great deal of work is already being carried out to devise such languages for traditional and objectoriented databases in an attempt to bridge the gap of usability for users. Iconic, diagrammatic, graph-based and multi-modal approaches are noted. Lee and Chin [LC95] proposed an iconic language, where icons are used to represent objects and processes. A query is expressed by building an iconic diagram of a spatial configuration. Difficulties with this approach arise from the fact that objects in a query expression need to be explicitly specified along with their associated class and attributes, which renders the language cumbersome for the casual user [Ege97]. Sketch-based languages are interesting examples of the visual approach. In the CIGALES system proposed by Mainguenaud and Portier [MP90], users are allowed to sketch a query by first selecting an icon of a spatial relationship and then drawing the query in the ”working area” of the interface. LVIS is an extension to CIGALES [PB99] where an attempt is made to provide the functionality of a query language. Egenhofer [Ege97] and Blaser [Bla98] have also proposed a sketch-based approach where a sketch of the query is drawn by the user and interpreted by the system. A set of query results is presented to the user including exact and near matches. Sketch-based approaches are suitable for expressing similarity-based queries to spatial databases and can become complex to use in a general context when composite queries are built. Also, they either assume that users are able to sketch a query and express spatial relationships in a drawing or rely on different modalities for offering the user guidance in developing the sketch. Exact queries can be generally ambiguous due to several possible interpretations of the visual representation
4
Language Description
Query diagrams are constructed using filters, represented by icons, between data input and output elements. Queries are visualised by a flow of information that
A Data-Flow Approach to Visual Querying in Large Spatial Databases
179
may be filtered or refined. The approach is based on, but substantially modifies and extends an early example of a filter flow metaphor proposed by Young and Shneiderman [YS93]. In [YS93] a single relation was used over which users could select the attributes to constrain. The metaphor of water flowing through a series of pipes was used and the layout of the pipes indicated the binary logic operators of And and Or. Line thickness was used to illustrate the amount of flow, or data, passing through the pipes and attribute menus were displayed on the lines to indicate the constraints. Join operations were not expressed in [YS93] nor were there indications to means of handling query results. The idea was simply presented using one relation as input. The idea was later used by Murray et al [MPG98] to devise a visual approach to querying object-oriented databases. In this paper, the basic idea of data flow between data source and results is utilised. The concept of a filter between both source and result is introduced to indicate the type of constraint expressed, whether non-spatial or spatial as well as the type of the spatial constraint in the later case. Spatial and non-spatial join operations are also expressed consistently. Graphical notations for intermediate query results are used to allow for tracing query histories and reuse of queries (and sub-queries). In what follows the query constructs are described in detail. 4.1
Database Schema
Consider the following object classes to be used as an example schema. County (cname:string, geometry:polygon, area:float, population:integer, other-geometry: point) Town (tname:string, geometry:polygon, area:float, town-twin:string, tpopulation:integer, county:county) Road (rname:string, geometry:line, rtype:string, rcounty:string, rsurface:string) Supermarket (sname:string, geometry:point, town:string, onroad:string)
In figure 2, object classes are depicted using a rectangular box containing the name of the class and an icon representing its spatial data type, whether point, line, polygon or any other composite spatial data type defined in the database, e.g. a network. This offers the user initial knowledge of the spatial representation associated with the feature. A thick edge on the icon box is used if the object has more than one spatial representation in the database. Switching between representations is possible by clicking on the icon box. For example, a County object is represented by a polygon to depict its actual shape and by a point for manipulation on smaller scale maps. All other information pertaining to the class is accessible when the user selects the class and then chooses to view its attributes. At this point we are not primarily concerned about how the database schema is depicted, but we focus on the aspect of query visualisation. As queries are constructed, the extent of the class chosen as input to the query will flow through filters to be refined according to the constraints placed on it. Results from a query or a sub query contain the new filtered extents, and these can be used to provide access to the intermediate results as well as final results of a query or as input to other sections of the query.
180
Andrew J. Morris et al.
Road
Town
County
Supermarket
Fig. 2. Example Schema. The basic spatial representation of the objects is depicted in the icons.
Road A
rtype = "motorway"
A
length(road) > 50
rtype = "motorway"
Display the roads with
length(road) > 50
Road (a)
Road
road type "motorway". (b)
Road (c)
Fig. 3. a) An aspatial filter and a spatial filter. b) Depicting query results. ”Select All From Road Where Road.rtype = ’motorway’ ”. c) A spatial filter in a simple query construct.
A basic query skeleton consists of a data input and data output elements and a filter in between. Every input object will have a related result object that can be displayed in the case of spatial objects. 4.2
Filters
Filters or constraints in a query are made on the non-spatial (aspatial) properties of the feature as well as on the spatial properties and location of the feature. Hence, two general icons are used to represent both types of filters as shown in figure 3. Figure 3(a) demonstrates a non-spatial filter depicted by an A (for (stored) Attributes) symbol and figure 3(b) demonstrates a spatial filter depicted by the “coordinates” symbol. The non-spatial filter represents constraints over the stored attributes and the location filter represents constraints that need to be computed over the spatial location of the object. After indicating the type of filter requested, the specific condition that the filter represents is built up dynamically by guiding the user to choose from menus of attributes, operators and values and the condition is then stored with the filter and may be displayed beside the icon as shown in the figure. Several filters may be used together to build more complex conditions and queries as will be shown in the following examples. 4.3
Query Results
The initial type of the data is defined by the extent that flows into the query. It is this type that will be passed along the data flow diagram, depicted by downward
A Data-Flow Approach to Visual Querying in Large Spatial Databases
181
Road A
rtype = "motorway"
A
rsurface = "Asphalt"
A
length(road) > 50
A
(a)
A
A
Road (b)
(c)
Fig. 4. (a) Filters joined by And. (b) Filters joined by Or. (c) Visualisation of multiple filters. ”Display all the motorway roads with asphalt road surface or all the roads whose length is > 50.”
pointing arrows to the results. The type of the flow is not altered by the query constraints. The only way the type of flow can be altered is when it flows into a results box. The results of the query are depicted, as shown in figure 3(b), by a double-edged rectangular box with the class name along with any particular attributes selected to appear in the results. By default the result of the query is displayed if the object has a spatial representation. The results box can be examined at any time of query formulation and its content displayed as a map and/or by listing the resulting object properties. If none of the attributes has been selected for listing, then the default is to view all the attributes of the class. An English expression of the query producing the result box is also available for examination through the result box as shown in the figure. 4.4
Simple Query Constructs
The example in figure 3 demonstrates a simple filter to restrict the results based on a non-spatial condition. Other operators may be used, e.g.=, >, <, like, etc. Also, spatial (unary) operators may be used to filter the results, e.g. area, volume, perimeter/boundary, etc. An example of using a spatial filter is shown in figure 3(c). Simple queries may be combined using Boolean expressions. Figure 4 represents the different cases. The flow will pass through only when the constraint is satisfied. In figure 4(a), multiple constraints are shown in series to represent constraints joined by And. The flow will pass through only when both constraints are satisfied. In figure 4(b), parallel arrangement of the filters is used to indicate that the flow will pass through when either or both constraints are satisfied. Any number of constraints may be joined together by binary logic operators as shown in the example in figure 4(c) where three constraints are used. Negated constraints are depicted by the filters in figure 5(a). Not may be applied to individual constraints or to a group of constraints. Filters may be joined in any order as explained. An example of a query with negated constraints is shown in figure 5(b).
182
Andrew J. Morris et al. Town town-twin = "Esslingen" tpopulation > 20000
A
area(town.geometry) > 15
A
A Town
(a)
(b)
Fig. 5. (a) Negation of non-spatial and spatial filters. (b) Visualisation of the And, Or and Not operators. Road rtype = "motorway"
County
A
A
road.geometry cross county.geometry
population > 50000
County
A Road, County
(a)
(b)
(c)
Fig. 6. (a) Non-Spatial join filter. (b) Spatial join filter (c) Example query of a spatial join. Specific relationship icon replaces general spatial join to indicate the cross relationship.
4.5
Joins
Two kinds of join operations are possible in spatial databases namely, non-spatial joins and spatial joins. Both types are represented coherently in the language. Spatial joins are expressions of spatial relationships between spatial objects in the database. Examples of spatial join queries are: Display all the motorway objects crossing Mid Glamorgan, and Display all the towns north of Cardiff within South Glamorgan. Filter notations are modified to indicate the join operation as shown in figure 6(a) and (b). A join filter is associated with more than one object type. A result box is associated with every joined object class and linked to the join filter. An example of a spatial join query is shown in figure 6(c). The query finds all the motorway roads that cross counties with population more than 50,000. Note that the result box from the join operation has been modified to reflect the contents of the join table. More than one object type has been produced, in this case, roads and counties that satisfy the join condition will be displayed on the result map.
A Data-Flow Approach to Visual Querying in Large Spatial Databases
183
Fig. 7. Examples of symbols for some spatial relationships [CFO93]; (A) for area, (L) for line and (P) for point. Road rtype = motorway
Town Supermarket
A
Road
A
0.5 km
tpopulation > 10000
Town
Supermarket, Road, Town
Fig. 8. Composite query. Find the supermarkets within a buffer of 0.5 km of a motorway or are outside and north-of a town whose population is greater than 10000.
A symbol of the spatial relationship sought is used to replace the “coordinate” symbol in the spatial join filter. A choice of possible spatial joins is available depending on the spatial data types of the objects joined. In the last example, all the possible relationships between line (for roads) and polygons (for counties) will be available. Spatial relationships may be classified between topological, directional and proximal. Relationships are grouped in hierarchical fashion to allow the use of finer granularities of relationships. Examples of hierarchies of topological and directional relationships are shown in figure 7. Qualitative proximal relationships, such as near and far are vague unless they explicitly reflect a pre-defined range of measures. Hence, using proximal relationships requires an indication of the measure of proximity required, e.g. within a distance of x m. Multiple spatial joins may be expressed similarly either with the same object type, e.g. to find the supermarkets outside and north of towns, or with more than one object type, e.g. to find the supermarkets north of towns and within a buffer of 5 km. from motorways as shown in figure 8.
5
Implementation
So far, the proposed language has been described independently of its implementation. In this section, an outline of the interface prototype to the language
184
Andrew J. Morris et al.
Fig. 9. The query Formulation Window.
is presented. The implementation of the interface aims to address some of the issues relating to schema visualisation, structuring of query results, operator assistance in general, including guided query expression, feedback and restriction of user choice to valid options during query formulation. A prototype of the interface is implemented in Delphi. A test spatial data set is stored in a relational database, linked to the query interface. The query interface window is shown in figure 9. Input data sets are selected in a Schema visualisation window. The query is formulated, in a guided fashion, using a collection of filters, including, spatial, aspatial, negated and various types of spatial join filters. The interfaces is context-sensitive and allows only possible filters and choices to be presented to the user at the different stages of query formulation. An spatial-SQL interpretation of the flow diagram is produced and compiled to produce the result data set presented on the result window. Evaluation tests for both the language and interface have been designed and are being conducted using two categories of users, namely, users with some experience of using a GIS systems and users with no prior knowledge of GIS. The evaluation test for the language makes use of the “PICTIVE” approach [Mul93] where the language elements are simulated using Post-It notes and a whiteboard.
6
Conclusions
In this paper a visual approach to querying spatial databases is proposed. Examples from the GIS domain have been used throughout to demonstrate the expressiveness of the language. The design of the language tried to address several requirements and problems associated with query interfaces to spatial databases. The following is a summary of the design aspects. – Icons were used to represent the geographic features with explicit indication of their underlying spatial representation, thus offering the user a direct indication to the data type being manipulated.
A Data-Flow Approach to Visual Querying in Large Spatial Databases
185
– A data flow metaphor is used consistently to describe different types of query conditions namely, non-spatial and spatial constraints as well as negated constraints and spatial and non-spatial joins. – Concise representation of the metaphor was used to join multiple constraints when dealing with one object in join operations. – Intermediate results are preserved and could be queried at any point of the query formulation process and hence the query history is also preserved. – Nested and complex queries are built consistently. The consistent use of the metaphor is intended to simplify the learning process for the user and should make the query expression process easier and the query expression more readable. The approach is aimed at casual and non expert users, or at expert domain users who are not familiar with query languages to databases. The implementation of the language aims to cater for different levels of user expertise. Visual queries are parsed and translated to extended SQL queries that are linked to a GIS for evaluation.
References Bla98.
CF80. CFO93.
Cha90. EB95.
Ege91. Ege97. Gou93.
IP87.
JC88.
A. Blaser. Geo-Spatial Sketches, Technical Report. Technical report, National Centre of Geographical Information Analysis: University of Maine, Orono, 1998. N.S. Chang and K.S. Fu. Query-by-Pictorial Example. IEEE Transactions on Software Engineering, 6(6):519–24, 1980. E. Clementini, P.D. Felice, and P.V. Oosterom. A Small Set of Formal Topological Relationships for End-User Interaction. In Advances in Spatial Databases - Third International Symposium, SSD’93, pages 277–295. Springer Verlag, 1993. S.K. Chang. Principles of Visual Programming Systems. Englewood Cliffs: Prentice Hall, 1990. M.J. Egenhofer and H.T. Burns. Visual Map Algebra: a direct-manipulation user interface for GIS. In Proceedings of the Third IFIP 2.6 Working Conference on Visual Database Systems 3, pages 235–253. Chapman and Hall, 1995. M.J. Egenhofer. Extending SQL for cartographic display . Cartography and Geographical Information Systems, 18(4):230–245, 1991. M.J. Egenhofer. Query Processing in Spatial Query by Sketch . Journal of Visual Languages and Computing, 8:403–424, 1997. M. Gould. Two Views of the Interface. In D. Medyckyj-Scott and H.M. Hearnshaw, editors, Human Factors in GIS, pages 101–110. Bellhaven Press, 1993. K. Ingram and W. Phillips. Geographic information processing using an SQL based query language. In Proceedings of AUTO-CARTO 8, pages 326– 335, 1987. T. Joseph and A.F. Cardena. PICQUERY: A High Level Query Language for Pictorial Database Management. IEEE Transactions on Software Engineering, 14(5):630–638, 1988.
186
Andrew J. Morris et al.
LC95.
MGP98.
MP90.
MPG98.
Mul93.
PB99.
RS99. Wel85. YS93.
Zlo77.
Y.C. Lee and F.L. Chin. An Iconic Query Language for Topological Relationships in GIS. International Journal of Geographical Information Systems, 9(1):24–46, 1995. N. Murray, C. Goble, and N. Paton. Kaleidoscape: A 3D Environment for Querying ODMG Compliant Databases. In Proceedings of Visual Databases 4, pages 85–101. Chapman and Hall, 1998. M. Mainguenaud and M.A. Portier. CIGALES: A Graphical Query Language for Geographical Information Systems. In Proceedings of the 4th International Symposium on Spatial Data Handling, pages 393–404. Univerity of Zurich, Switzerland, 1990. N. Murray, N. Paton, and C. Goble. Kaleidoquery: A Visual Query Language for Object Databases . In Proceedings of Advanced Visual Interfaces, pages 247–257. ACM Press, 1998. M. Muller. PICTIVE: Democratizing the Dynamics of the Design Session. In Participatory Design: Principles and Practices, pages 211–237. Lawrence Erlbaum Associates, 1993. M.A.A. Portier and C. Bonhomme. A High Level Visual Language for Spatial Data Management. In Proceedings of Visual ’99, pages 325–332. Springer Verlag, 1999. S. Ravada and J. Sharma. Oracle8i Spatial: Experiences with Extensible Database . In SSD’99, pages 355–359. Springer Verlag, 1999. C. Welty. Correcting User Errors in SQL. International Journal of Manmachine studies, 22:463–477, 1985. D. Young and B. Shneiderman. A Graphical Filter/Flow Representation of Boolean Queries: A Prototype Implementation and Evaluation. Journal of the American Society for Information Science, 44(6):327–339, 1993. M.M. Zloof. Query-by-Example: A Database Language . IBM Systems Journal, 16(4):324–343, 1977.
MEDIMAGE – A Multimedia Database Management System for Alzheimer’s Disease Patients Peter L. Stanchev1 and Farshad Fotouhi2 1
Kettering University, Flint, Michigan, 48504 USA TWXERGLI$OIXXIVMRKIHY LXXT[[[OIXXIVMRKIHYbTWXERGLI 2 Wayne State University, Detroit, Michigan 48202 USA JSXSYLM$GW[E]RIIHY
Abstract. Different brain databases, such as: (1) the database of the anatomic MRI brain scans of children across a wide range of ages to serve as a resource for the pediatric neuroimaging research community [6], (2) Brigham RAD Teaching Case Database Department of Radiology, Brigham and Women’s Hospital Harvard Medical School [2], (3) Brain Web Simulated Brain Database site of a normal brain and a brain affected by multiple sclerosis [3] are using from many researchers. In this paper, we present MEDIMAGE – a multimedia database for Alzheimer’s disease patients. It contains imaging, text and voice data and it used to find some correlations of brain atrophy in Alzheimer’s patients with different demographic factors.
1
Introduction
We determined topographic selectivity and diagnostic utility of brain atrophy in probable Alzheimer’s disease (AD) and correlations with demographic factors such as age, sex, and education. A medical multimedia database management system MEDIMAGE was developed for supporting this work. Its architecture is based on the image database models [4, 7]. The system design is motivated by the major need to manage and access multimedia information on the analysis of the brain data. The database links magnetic resonance (MR) images to patient data in a way that permits the use to view and query medical information using alphanumeric, and feature-based predicates. The visualization permits the user to view or annotate the query results in various ways. These results support the wide variety of data types and presentation methods required by neuroradiologists. The database gives us the possibility for data mining and defining interesting findings.
S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 187–193, 2002. © Springer-Verlag Berlin Heidelberg 2002
188
2
Peter L. Stanchev and Farshad Fotouhi
The MEDIMAGE System
The MEDIMAGE system architecture is presented in the Figure 1.
MEDIMAGE MR Image Processing Tools
MR Image Segmentation tools MR 3D reconstruction tools MR Measurement tools
MEDIMAGE Database Management System Tools
MEDIMAGE Definition Tools MEDIMAGE Storage Tools MEDIMAGE Manipulation Tools MEDIMAGE Viewing Tools
MEDIMAGE Databases catalogs 1. MR database catalog 2. Segmented and 3D remonstrated database catalog 3. Test database catalog 4. Radiologist comments database catalog
MEDIMAGE Databases 1. MR database 2. Segmented and 3D remonstrated database 3. Test database 4. Radiologist comments database
Fig. 1. The MEDIMAGE system architecture
2.1
MEDIMAGE System Databases
In the MEDIMAGE system there are four databases: 1. MEDIMAGE MR Database. For brain volume calculation we store a two-spinecho sequence covering the whole brain. 58 T2-weithed 3 mm slices are obtained with half-Fourier sampling, 192 phase-encoding steps, TR/TE of 3000/30, 80 ms, and a field-of-view of 20 cm. The slices are contiguous and interleaved. We collect and store also 124 T1-weighted images using TR/TE of 35/5 msec, flip angle of 35 degrees. Finally we collect patients and scanner information such as: acquisition date, image identification number and name, image modality device parameters, image magnification, etc. 2. MEDIMAGE Segmented and 3D reconstructed database. This is the collection of process magnetic resonance images – segmented and 3D rendered. 3. MEDIMAGE Test database. The test date includes patient’s results from the standard tests for Alzheimer’s disease and related disorders. 4. MEDIMAGE Radiologist comments database. This data are in two types: text and voice. They contain the radiologist findings.
A Multimedia Database Management System for Alzheimer’s Disease Patients
2.2
189
MEDIMAGE MR Image Processing Tools
In the MEDIMAGE system there are three main tools for image processing. 1. MEDIMAGE MR Image Segmentation tools. These tools include bifeature segmentation tool and ventrical and sulcal CSF volume calculation tool. The CSF denotes the fluid inside the brain. • Bifeature segmentation tool. Segmentation of the MR images into GM (gray matter), white matter (WM) and CSF is perform in the following way: thirty points per compartment (15 per hemisphere) are sampled simultaneously from the proton density and T2-weigted images. The sample index slice is the most inferior slice above the level of the orbits where the anterior horns of the lateral ventricles could be seen. Using a nonparametric statistic algorithm (k-nearest neighbors supervised classification) the sample points are used to derive a “classificator” that determined the most probable tissue type for each voxel. • Ventrical and sulcal CSF volume calculation tool. A train observer places a box encompassing the ventricles to define the ventrical CSF. Subtraction the ventical from the total CSF provided a separate estimate of the sulcal CSF. 2. MEDIMAGE MR 3D reconstruction tools. These tools include total brain capacity measurement and region of interest definition tools. • Total brain capacity measurement tool. A 3D surface rendering technique is used to obtain accurate lobal demarcation. The T2-weighted images are first “edited” using intensity thresholds and tracing limit lines on each slice to remove nonbrain structures. The whole brain volume, which included brain stamp and cerebellum, is then calculated from the edit brain as an index of the total intracranial capacity and is used in the standardization procedures to correct for brain size. A 3D reconstruction is computed. • Region of interest definition tool. Using anatomical landmarks and a priori geometric rules accepted by neuroanatomic convention, the frontal, parietal, temporal, and occipital lob are demarcated manner. The vovels of the lobar region of interest is used to mask the segmented images, enabling quantification of different tissue compartments for each lobe. 3. MEDIMAGE MR Measurement tools. These tools include hippocampal volume determination tool. • Hippocampal volume determination tool. Sagical images are used to define the anterior and posterior and end points of the structure. Then they are reformatted into coronal slices perpendicular to the longitudinal axis of the hippocampal formation. Then the hippocampal perimeter is traced for each hemisphere. The demarcated area is multiplied by slice thickness to obtain the hippocampal volume in the slice. 2.3
MEDIMAGE Database Management Tools
In the MEDIMAGE database management system there are definition, storage, manipulation and viewing tools.
190
Peter L. Stanchev and Farshad Fotouhi
1. MEDIMAGE Definition Tools. Those tools are used for defining the structure of the four databases. All of them are using relational model. 2. MEDIMAGE Storage Tools. These are tools allowing entering, deletion and updating of the data in the system. 3. MEDIMAGE Manipulation Tools. Those tools allow: image retrieval based on alphanumeric, and feature-based predicates and numerical, text, voice and statistic data retrieval. • Image retrieval. The images are searched by their image description representation, and it is based on similarity retrieval. Let a query be converted in an image description Q(q1, q2, …, qn) and an image in the image database has the description I(x1, x2, …, xn). Then the retrieval value (RV) between Q and I is defined as: RVQ(I) = Σi = 1, …,n (wi * sim(qi, xi)), where wi (i = 1,2, …, n) is the weight th specifying the importance of the i parameter in the image description and th sim(qi, xi) is the similarity between the i parameter of the query image and database image and is calculated in different way according to the qi, xi values. There are alphanumeric and feature-based predicates. • Numerical, text, voice and statistic data retrieval. A lot statistical function are available in the system allowing to make data mining using the obtain measurements and correlated them with different demographic factors. 4. MEDIMAGE Viewing Tools. Those tools allow viewing images and text, numerical and voice data from the four databases supported by the system.
3
Results Obtaining with the MEDIMAGE System
The results of some of the image processing tools are given in Figures 2-7. Result from the statistical analysis applied to MR images in 32 patients with probable AD and 20 age- and sex-matched normal control subjects find the following findings. Group differences emerged in gray and white matter compartments particularly in parietal and temporal lobes. Logistic regression demonstrated that larger parietal and temporal ventricular CSF compartments and smaller temporal gray matter predicted AD group membership with an area under the receiver operating characteristic curve of 0.92. On multiple regression analysis using age, sex, education, duration, and severity of cognitive decline to predict regional atrophy in the AD subjects, sex consistently entered the model for the frontal, temporal, and parietal ventricular compartments. In the parietal region, for example, sex accounted for 27% of the variance in the parietal CSF compartment and years of education accounted for an additional 15%, with women showing less ventricular enlargement and individuals with more years of education showing more ventricular enlargement in this region. Topographic selectivity of atrophic changes can be detected using quantitative volumetry and can differentiate AD from normal aging. Quantification of tissue volumes in vulnerable regions offers the potential for monitoring longitudinal change in response to treatment.
A Multimedia Database Management System for Alzheimer’s Disease Patients
TE = 30 ms TR = 3000 ms
TE = 80 ms TR = 3000 ms => Fig. 2. Bifeature segmentation
=> Fig. 3. Ventricular and Sulcal CSF Separation
=> Fig. 4. Brain Editing
191
192
Peter L. Stanchev and Farshad Fotouhi
=> Fig. 5. 3D Brain Reconstruction
=> Fig. 6. Region Definition
=> Fig. 7. Hippocampal Volume Calculation
4
Conclusions
The MEDIMAGE system was developed in the Sunnybrook health science center, Toronto, Canada, on SUN Microsystems. It uses GE scanner software and ANALYSE and SCILIMAGE packages. The medical findings are described in details in [5]. The main advantages of the proposed MEDIMAGE system are:
• •
Generality. The system could easily modify for other medical image collection. The system was use also for corpus colosam calculations [1]. Practical applicability. The results obtained with the system define essential medical findings.
A Multimedia Database Management System for Alzheimer’s Disease Patients
193
The main conclusion of using the system is that the content-based image retrieval is not essential part in such kind of system. Data mining algorithms play essential roles in similar systems.
References 1. Black SE., Moffat SD., Yu DC, Parker J., Stanchev P., Bronskill M., “Callosal atrophy correlates with temporal lobe volume and mental status in Alzheimer's disease.” Canadian Journal of Neurological Sciences. 27(3), 2000 Aug., pp. 204-209. 2. Brigham RAD Teaching Case Database Department of Radiology, Brigham and Women's Hospital Harvard Medical School http://brighamrad.harvard.edu/education/online/tcd/tcd.html 3. C.A. Cocosco, V. Kollokian, R.K.-S. Kwan, A.C. Evans: "BrainWeb: Online Interface to a 3D MRI Simulated Brain Database", NeuroImage, vol.5, no.4, part 2/4, S425, 1997 - Proceedings of 3-rd International Conference on Functional Mapping of the Human Brain, Copenhagen, May 1997. 4. Grosky W., Stanchev P., “Object-Oriented Image Database Model”, 16th International Conference on Computers and Their Applications (CATA-2001), March 28-30, 2001, Seattle, Washington, pp. 94-97. 5. Kidron D., Black SE., Stanchev P., Buck B., Szalai JP., Parker J., Szekely C., Bronskill MJ., “Quantitative MR volumetry in Alzheimer's disease. Topographic markers and the effects of sex and education”, Neurology. 49(6):1504-12, 1997 Dec. 6. Pediatric Study Centers (PSC) for a MRI Study of Normal Brain Development http://grants.nih.gov/grants/guide/noticefiles/not98-114.html 7. Stanchev, P., “General Image Database Model,” Visual Information and Information Systems, Proceedings of the Third Conference on Visual Information Systems, Huijsmans, D. Smeulders A., (Eds.) Lecture Notes in Computer Science, Volume 1614 (1999), pp. 29-36.
Life after Video Coding Standards: Rate Shaping and Error Concealment Trista Pei-chun Chen1, Tsuhan Chen1, and Yuh-Feng Hsu2 1
Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA _TIMGLYRXWYLERa$ERHVI[GQYIHY 2 Computer and Communications Research Laboratories, Industrial Technology Research Institute, Hsinchu 310, Taiwan WTIRGIV$MXVMSVKX[
Abstract. Is there life after video coding standards? One might think that research has no room to advance with the video coding standards already defined. On the contrary, exciting research opportunities arise after the standards are specified. In this paper, we introduce two standard-related research areas: rate shaping and error concealment, as examples of interesting research that finds its context in standards. Experiment results are also shown.
1
Introduction
What are standards? Standards define a common language that different parties can communicate with each other effectively. An analogy to the video coding standard is the language. Only with the language, Shakespeare could create his work and we can appreciate the beautiful masterpiece of his. Similarly, video coding standards define the bitstream syntax, which enables the video encoder and the decoder to communicate. With the syntax and decoding procedure defined, interesting research areas such as encoder optimization, decoder post-processing, integration with the network transport and so on, are opened up. In other words, standards allow for advanced video coding research fields to be developed and coding algorithms to be compared on a common ground. In this paper, we consider H. 263 [1] as the video coding standard example. Similar ideas can also be built on other standards such as MPEG-4 [2]. Two research areas: rate shaping [3] and error concealment [4] (Fig. 1), are introduced for networked video transport. First, we introduce rate shaping to perform joint source-channel coding. Video transport is very challenging given the strict bandwidth requirement and possibly high channel error rate (or packet loss rate). Through standards such as the real-time control protocol (RTCP, part of the real-time transport protocol (RTP)) [5], the encoder can obtain network condition information. The rate shaper uses such information to S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 194–206, 2002. © Springer-Verlag Berlin Heidelberg 2002
Life after Video Coding Standards: Rate Shaping and Error Concealment
195
shape the coded video bitstream before sending it to the network. The video transport thus delivers the video bitstream with better quality and utilizes the network bandwidth more efficiently. channel info.
Video
Source/channel encoder
Rate shaper
Source/channel Error decoder Concealment
Joint source/channel coded bitstream
Reconstructed video
Fig. 1. System of video transport over network
Second, we present error concealment with updating mixture of principle components. In a networked video application, even with good network design and video encoder, the video bitstream can be corrupted and become un-decodable at the receiver end. Error concealment is useful in such a scenario. We introduce in particular a model-based approach with updating mixture of principle components as the model. The User Datagram Protocol (UDP) [6] sequence number is used to inform the video decoder to perform error concealment. In addition to the two areas introduced, research areas such as video traffic modeling would not be relevant without the standards being defined. Prior work on video traffic modeling can be found in [7], [8], [9], [10], and [11]. This paper is organized as follows. In Section 2, we adopt the rate shaping technique to perform joint source-channel coding. In Section 3, updating mixture of principle components is shown to perform very well in the error concealment application. We conclude this paper in Section 4.
2
Adaptive Joint Source-Channel Coding Using Rate Shaping
Video transmission is challenging in nature because it has high data rate compared to other data types/media such as text or audio. In addition, the channel bandwidth limit and error prone characteristics also impose constraints and difficulties on video transport. A joint source-channel coding approach is needed to adapt the video bitstream to different channel conditions. We propose a joint source-channel coding scheme (Fig. 2) based on the concept of rate shaping to accomplish the task of video transmission. The video sequence is first source coded followed by channel coding. Popular source coding methods are H.263 [1], MPEG-4 [2], etc. Example channel coding methods are Reed-Solomon codes, BCH codes, and the recent turbo codes [12], [13]. Source coding refers to “scalable encoder/decoder” in Fig. 2 and channel coding refers to “error correction coding
196
Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu
(ECC) encoder/decoder” in Fig. 2. The source and channel coded video bitstream then passes through the rate shaper to fit the channel bandwidth requirement while achieving the best reconstructed video quality. channel info. video scalable encoder
ECC encoder
rateshaper
Joint source/channel coded bitstream
ECC decoder
(a)
reconstructed video
scalable decoder
(b)
Fig. 2. System diagram of the joint source-channel coder: (a) encoder; (b) decoder
2.1
Rate Shaping
After the video sequence has been source and channel coded, the rate shaper then decides which portions of the encoded video bitstream will be sent. Let us consider the case where the video sequence is scalable coded into two layers: one base layer and one enhancement layer. Each of the two layers is error correction coded with different error correction capability. Thus, there are four segments in the video bitstream: the source-coding segment of the base layer bitstream (lower left segment of Fig. 3 (f)), the channel-coding segment of the base layer bitstream (lower right segment of Fig. 3 (f)), the source-coding segment of the enhancement layer bitstream (upper left segment of Fig. 3 (f)), and the channel-coding segment of the enhancement layer bitstream (upper right segment of Fig. 3 (f)). The rate shaper will decide which of the four segments to send. In the two-layer case, there are totally six valid combinations of segments (Fig. 3 (a)~(f)). We call each valid combination a state. Each state is represented by a pair of integers (x, y ) , where x is the number of source-coding segments chosen counting from the base layer and y is the number of channel-coding segments counting from the base layer. x and y satisfy the relationship of x ≥ y .
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Valid states: (a) State (0,0); (b) State (1,0); (c) State (1,1); (d) State (2,0); (e) State (2,1); (f) State (2,2)
The decision of the rate shaper can be optimized given the rate-distortion map, or R-D map, of each coding unit. A coding unit can be a frame, a macroblock, etc., depending on the granularity of the decision. The R-D maps vary with different channel error conditions. Given the R-D map of each coding unit with a different constellation of states (Fig. 4), the rate shaper finds the state with the minimal distortion under certain bandwidth constraint “B”. In the example of Fig. 4, State (1,1) of Unit 1 and State (2,0) of Unit 2 are chosen. Such decision is made on each of the coding unit given the bandwidth constraint “B” of that unit.
Life after Video Coding Standards: Rate Shaping and Error Concealment D
D
00 10
00
21 20
11
10 22
11
R
22
B
(a)
….
21 20
B
197
R
(b)
(c)
Fig. 4. R-D maps of coding units: (a) Unit 1; (b) Unit 2; (c) Unit 3 and so on
Consider taking a frame as a coding unit. Video bitstream is typically coded with variable bit rate in order to maintain constant video quality. To minimize the overall distortion for a group of pictures/frames (GOP), it is not enough to choose the state for each frame based on the equally allocated bandwidth to every frame. We will introduce a smart rate shaping scheme that allocates different bandwidth to each frame in a GOP. The rate shaping scheme is based on the discrete rate-distortion combination algorithm. 2.2
Discrete Rate-Distortion Combination Algorithm
Assume there are F frames in a GOP and the total bandwidth constraint for these F frames is C . Let x (i ) be the state chosen for frame i and let Di , x (i ) and Ri , x (i ) be the resulting distortion and rate at frame i respectively. The goal of the rate shaper is to: F
minimize
∑D F
subject to
∑R i =1
(1)
i , x (i )
i =1
i , x (i )
≤C
(2)
In principle, this optimization problem can be accomplished using Dynamic Programming [14], [15], [16]. The trellis diagram is formed with the x-axis being the frame index i , y-axis being the cumulative rate at frame i , and the cost function of the trellis being the distortion. If there are S states at each frame, the number of nodes at Frame i = F will be S F (if none of the cumulative rates are the same). This method is too computationally intensive. If the number of states, S , is large, the R-D map becomes a continuous curve. The Lagrangian Optimization method [16], [17], [18] can be used to solve this optimization problem. However, Lagrangian Optimization method cannot reach the states that do not reside on the convex hull of the R-D curve. In this paper, we introduce a new discrete rate-distortion combination algorithm as follows:
198
Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu
1. At each frame, eliminate the state in the map if there exists some other state that is smaller in rate and smaller in distortion than the one considered. This corresponds to eliminating states in the upper right corner of the map (Fig. 5 (a)). 2. At each frame i , eliminate State b if Ria < Rib < Ric and Dib − Dia < Dic − Dib , Rib − Ria Ric − Rib
3.
4. 5.
6.
where State a and State c are two neighboring states of State b . This corresponds to eliminating states that are on the upper right side of any line connecting two states. For example, State b is on the upper right side of the line connecting State a and State c (Fig. 5 (b)). Thus, State b is eliminated. Label the remaining states in each frame from the state with the lowest rate, State 1, to the state with the highest rate. Let us denote the current decision of state at Frame i as State u(i) . Start from u(i) = 1 for all frames. The rate shaper examines the next state u(i) +1 of each frame and finds the one that gives the largest ratio of distortion decrease over rate increase compared to the current state u(i) . If Frame τ is chosen, increase u(τ ) by one. As an example, let us look at two frames, Frame m and Frame n in Fig. 5 (c). Current states are represented as gray dots and the next states as black dots. We can see that updating u(m ) gives larger ratio increase than updating u (n ) . Thus, the rate shaper updates u(m ) . Continue Step 3 until the total rate meets C or will exceed C with any more update of u(i) . If C is met, we are done. If the bandwidth constraint is not yet met after Step 4, reconsider the states that were eliminated by Step 2. For each frame, re-label all the states from the state with the lowest rate to the state with the highest rate, and let u(i) denote the current state. Choose the frame with the next state giving the most distortion decrease compared to the current state. If Frame τ is chosen, increase u (τ ) by one. Continue Step 5 until the total rate meets C or exceeds C with more update of u(i) . Dm
D
D
Dn
u(m)
u(n) u(n)+1
b a
c
R
(a)
u(m)+1
R
(b)
Rm
Rn
(c)
Fig. 5. Discrete R-D combination: (a) Step 1; (b) Step 2; (c) Step 3
2.3
Experiment
We compare four methods: (M1) transmits a single non-scalable and non-ECC coded video bitstream; (M2), proposed by Vass and Zhuang [19], switches between State (1, 1) and State (2, 0) depending on the channel error rate; (M3) allocates the same bit
Life after Video Coding Standards: Rate Shaping and Error Concealment
199
budget to each frame and chooses the state that gives the best R-D performance for each frame; (M4) is the proposed method that dynamically allocates the bit budget to each frame in a GOP and chooses the state that gives the best overall performance in a GOP, using the algorithm shown in Sect. 2.2. Each GOP has F = 5 frames. The test video sequence is “stefan.yuv” in QCIF (quarter common intermediate format). The bandwidth and channel error rate vary over time and are simulated as AR(1) processes. The bandwidth ranges from 4k bits/frame to 1024k bits/frame; and the channel error rate ranges from 10 −0.5 to 10 −6.0 . The performance is shown in mean square error (MSE) versus the GOP number as in Fig. 6. In the case that all four methods satisfy the bandwidth constraint, the average MSE of all four methods are 10050, 5356, 2091, and 1946 respectively. The proposed M4 has the minimum distortion among all. In addition, let us compare M1 and M2 with M3 and M4. Since M1 and M2 do not have the R-D maps in mind, the network could randomly discard the bitstream sent by these two methods. The resulting MSE performance of M1 and M2 are bad. On the other hand, M3 and M4 are more intelligent in knowing that the bitstream could be non-decodable if the channel error rate is high and thus decide to allocate the bit budget to the channel-coding segments of the video bitstream. 4
2
x 10
M1 M2 M3 M4
MSE
1.5
1
0.5
0 0
10
20 30 GOP number
40
50
Fig. 6. MSE performance of four rate shaping methods
3
Updating Mixture of Principle Components for Error Concealment
When transmitting video data over networks, the video data could suffer from losses. Error concealment is a way to recover or conceal the loss information due to the transmission errors. Through error concealment, the reconstructed video quality can be improved at the decoder end. Projection onto convex sets (POCS) [20] is one of the most well known frameworks to perform error concealment. Error concealment based on POCS is to formulate each constraint about the unknowns as a convex set. The optimal solution is obtained by recursively projecting a previous solution onto each convex set. For error concealment, the projections of data refer to (1) projecting the data with some losses to a model that is built on error-free
200
Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu
data, and (2) replacing data in the loss portion with the reconstructed data. The success of a POCS algorithm relies on the model to which the data is projected onto. We propose in this paper updating mixture of principle components (UMPC) to model the non-stationary as well as the multi-modal nature of the data. It has been proposed that the mixture of principle components (MPC) [21] can represent the video data with a multi-modal probability distribution. For example, faces images in a video sequence can have different poses, expressions, or even changes in the characters. It is thus natural to use a multi-modal probability distribution to describe the video data. In addition, the statistics of the data may change over time as proposed by updating principle components (UPC) [22]. By combining the strengths of both MPC and UPC, we propose UMPC that captures both the non-stationary and the multi-modal characteristics of the data precisely. 3.1
Updating Mixture of Principle Components
* * ***** ** ** * ** ** ***** *** * * ** *** * ** * ** * * * * * *** ** * ** * * * *
* * ***** * ** * * * **** * * * * ** * * * ** ** **** * ** ** ** *** * ** * * * * ** *
*
*
* * * ** * * * * * ** ** * * * * * ** * ** ** * * * ** * * * * * * ** ** ** * * ****** ** *
Given a set of data, we try to model the data with minimum representation error. We specifically consider multi-modal data as illustrated in Fig. 7 (a). The data are clustered to multiple components (two components in this example) in a multidimensional space. As mentioned, the data can be non-stationary, i.e., the stochastic properties of the data are time-varying. At time n , the data are clustered as Fig. 7 (a) and at time n′ , the data are clustered as Fig. 7 (b). The mean of each component is shifting and the most representative axes of each component are also rotating.
*
*
* * * **** * * ** * * **** ** ** * ** ** ** ** **** * * ** ** * *** * ** *** * * **
(a)
(b)
Fig. 7. Multi-modal data at (a) time n (b) time n′
At any time instant, we attempt to represent the data as a weighted sum of the mean and principle axes of each component. As time proceeds, the model changes its mean and principle axes of each component. The representation error of the model at time instant n should have less contribution from data that are further away in time from the current one. The optimization formula can be written as follows: (3)
Life after Video Coding Standards: Rate Shaping and Error Concealment
201
The notations are organized as follows:
At any time instant n , this is to minimize the weighted reconstruction error with the choice of means, the sets of eigenvectors, and the set of weights. The reconstruction errors contributed by previous data are weighted by powers of the decay factor α . The solution to this problem is obtained by iteratively determining weights, means and sets of eigenvectors respectively while fixing the other parameters. That is, we optimize the weights for each data using the previous means and sets of eigenvectors. After updating the weights, we optimize the means and the eigenvectors accordingly. The next iteration starts again in updating the weights and so on. The iterative process is repeated until the parameters converge. At the next time instant n + 1 , the parameters of time instant n are used as the initial parameter values. Then the process of iteratively determining weights, means and sets of eigenvectors starts again. The mean m (qn ) of mixture component q at time n is: 2 M w w nq nq n m (n −1) + x − wnj xˆ nj m (q ) = 1 − ∞ ∑ q n ∞ i 2 i 2 j =1, j ≠ q ∑ α wn −i ,q ∑ α wn −i , q i=0 i =0
(4)
The covariance matrix C (rn ) of mixture component r at time n is:
[
C (rn )
]
wnr (x n − m r )x Tn + x n (x n − m r )T − M w w (x − m )m T + m (x − m )T − ∑ nj nr n r j j n r j =1 (n −1) = αC r + (1 − α ) M P T T T ∑ wnj wnr ∑ u jk (x n − m j ) (x n − m r )u jk + u jk (x n − m r ) − k =1 j =1, j ≠ r w 2 (x − m )(x − m )T r n r nr n
[
[
][
]
]
(5)
202
Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu
To complete one iteration with determination of means, covariance matrix and weights, the solution for weights is: ˆ TX ˆ 2X i i T 1
ˆ Tx 1 w i 2X i i = 0 λ 1
(6)
where 1 = [1 L 1]T is an M × 1 vector. We see that both MPC and UPC are special cases of UMPC with α → 1 and M = 1 respectively. 3.2
Error Concealment with UMPC
With object based video coding standards such as MPEG-4 [2], the region of interest (ROI) information is available. A model based error concealment approach can use such ROI information and build a better error concealment mechanism. Fig. 8 shows two video frames with ROI specified. In this case, ROI can also be obtained by face trackers such as [23].
(a)
(b)
Fig. 8. Two video frames with object specified
When the video decoder receives a frame of video with error free ROI, it uses the data in ROI to update the existing UMPC with the processes described in Sect. 3.1. When the video decoder receives a frame of video with corrupted macroblocks (MB) in the ROI, it uses UMPC to reconstruct the corrupted ROI. In Fig. 9, we use three st nd rd mixture components: 1 , 2 , and 3 , to illustrate the idea of UMPC for error concealment. Current Frame Replace missing data Project
+
w1 Project
Project
1st Component
w2
2nd Component
Reconstruction w3
3rd Component
Fig. 9. UMPC for error concealment
Life after Video Coding Standards: Rate Shaping and Error Concealment
203
The corrupted ROI is first reconstructed by each individual mixture component. The resulting reconstructed ROI is formed by linearly combining the three individually reconstructed ROI. The weights for linear combination are inverse proportional to the reconstruction error of each individually reconstructed ROI. After the reconstructed ROI with UMPC is done, replace the corrupted MB with the corresponding data in the reconstructed ROI just obtained. The process of reconstruction with UMPC and replacement of corrupted MB is repeated iteratively until the final reconstruction result is satisfying. 3.3
Experiment
The test video sequence is recorded from a TV program. The video codec used is H. 263 [1]. Some frames of this video sequence are shown in Fig. 8. We use a two state Markov chain [24] to simulate the bursty error to corrupt the MB as shown in Fig. 10. “Good” and “Bad” correspond to error free and erroneous state respectively. The overall error rate ε is related to the transition probabilities p and q by ε = p ( p + q ) . We use ε = 0.05 and p = 0.01 in the experiment. 1-q
1-p p
Good
Bad q
Fig. 10. Two state Markov chain for MB error simulation
There are two sets of experiments: Intra and Inter. In the Intra coded scenario, we compare three cases: (1) none: no error concealment takes place. When the MB is corrupted, the MB content is lost; (2) MPC: error concealment with MPC as the model. The number of mixture components M are three and the number of eigenvectors P for each mixture components are two; (3) UMPC: error concealment with UMPC as the model with M = 3 and P = 2 . The decay factor is α is 0.9 . In the Inter coded scenario, we also compare three cases: (1) MC: error concealment using motion compensation; (2) MPC: error concealment with MPC as the model operated on motion compensated data; (3) UMPC: error concealment with UMPC as the model on operated motion compensated data. Fig. 11 shows the means of UMPC at two different time instances. It shows that the model captures three main poses of the face images. Since there is a change of characters, UMPC captures such change and we can see that the means describe more on the second character at th Frame 60 .
Fig. 12 and Fig. 13 show the decoded video frames without and with the error concealment. Fig. 12 (a) shows a complete loss of MB content when the MB data is lost. Fig. 12 (b) shows that the decoder successfully recovers the MB content with the corrupted ROI projected onto the UMPC model. Fig. 13 (a) shows the MB content being
204
Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu
recovered by motion compensation when the MB data is lost. The face is blocky because of the error in motion compensation. Fig. 13 (b) shows that the decoder successfully recovers the MB content inside the ROI with the motion compensated ROI projected onto the UMPC model. st
1 component
nd
rd
2 component
3 component
th
Frame 20
th
Frame 60
th
th
Fig. 11. Means for UMPC at Frame 20 and 60
(a)
(b)
Fig. 12. Error concealment for the Intra coding scenario: (a) no concealment; (b) concealment with UMPC
(a)
(b)
Fig. 13. Error concealment for the Inter coding scenario with: (a) motion compensation; (b) motion compensation and UMPC
The PSNR performance of the decoded video frames is summarized in Table 1. In both the Intra and Inter scenarios, error concealment with UMPC performs the best. Table 1. Error concealment performance of four models at INTRA and INTER coded scenarios None (Intra) / MC (Inter)
MPC
UMPC
Intra
15.5519
29.3563
30.6657
Inter
21.4007
21.7276
22.3484
Life after Video Coding Standards: Rate Shaping and Error Concealment
4
205
Conclusion
We presented two research areas: rate shaping and error concealment, that find their relevance after video coding standards are defined. With rate shaping and error concealment, we can improve the quality of service of networked video. We showed that exciting new research areas are opened up after the standards are specified.
References 1. ITU-T Recommendation H.263, January 27, 1998 2. Motion Pictures Experts Group, "Overview of the MPEG-4 Standard", ISO/IEC JTC1/SC29/WG11 N2459, 1998 3. Trista Pei-chun Chen and Tsuhan Chen, “Adaptive Joint Source-Channel Coding using Rate Shaping”, to appear in ICASSP 2002 4. Trista Pei-chun Chen and Tsuhan Chen, “Updating Mixture of Principle Components for Error Concealment”, submitted to ICIP 2002 5. H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson: “RTP: A transport protocol for real-time applications”, RFC1889, Jan. 1996. ftp://ftp.isi.edu/in-notes/rfc1990.txt 6. J. Postel, “User Datagram Protocol“, RFC 768, Aug. 1980. http://www.ietf.org/rfc/rfc768.txt 7. Trista Pei-chun Chen and Tsuhan Chen, “Markov Modulated Punctured Autoregressive Processes for Traffic and Channel Modeling”, submitted to Packet Video 2002 8. D. M. Lucantoni, M. F. Neuts, and A. R. Reibman, “Method for Performance Evaluation of VBR Video Traffic Models”, IEEE/ACM Transactions on Networking, 2(2), 176-180, April 1994 9. P. R. Jelenkovic, A. A. Lazar, and N. Semret, “The Effect of Multiple Time Scales and Subexponentiality in MPEG Video Streams on Queuing Behavior”, IEEE Journal on Selected Areas in Communications, 15(6), 1052-1071 10. M. M. Krunz, A. M. Makowski, “Modeling Video Traffic using M/G/ ∞ Input Processes: A Compromise between Markovian and LRD Models”, IEEE Journals on Selected Areas in Communications, 16(5), 733-748, 1998 11. Deepak S. Turaga and Tsuhan Chen, “Hierarchical Modeling of Variable Bit Rate Video Sources”, Packet Video 2001 12. S. Lin, D. J. Costello, Jr., Error Control Coding: Fundamentals and Application, PrenticeHall 13. S. Wicker, Error Control Systems for Digital Communication and Storage, Prentice-Hall, 1995 14. B. Bellman, Dynamic Programming, Prentice-Hall, 1987 15. G. D. Forney, “The Viterbi Algorithm”. Proc. of the IEEE, 268-278, March 1973 16. A. Ortega and K. Ramchandran, “Rate-Distortion Methods for Image and Video Compression”. IEEE Signal Processing Magazine, 15(6), 23-50 17. H. Everett, “Generalized Lagrange Multiplier Method for Solving Problems of Optimum Allocation of Resources”. Operations Research, 399-417, 1963 18. Y. Shoham and A. Gersho, “Efficient Bit Allocation for an Arbitrary Set of Quantizers”. IEEE Trans. ASSP, 1445-1453, Sep 1988
206
Trista Pei-chun Chen, Tsuhan Chen, and Yuh-Feng Hsu
19. J. Vass and X. Zhuang, “Adaptive and Integrated Video Communication System Utilizing Novel Compression, Error Control, and Packetization Strategies for Mobile Wireless Environments”, Packet Video 2000 20. H. Sub and W. Kwok, “Concealment of Damaged Block Transform Coded Images using Projections Onto Convex Sets”, IEEE Trans. Image Processing, Vol. 4, 470-477, April 1995 21. D. S. Turaga, Ph.D. Thesis, Carnegie Mellon University, July 2001 22. X. Liu and T. Chen, "Shot Boundary Detection Using Temporal Statistics Modeling", to be appeared in ICASSP 2002 23. J. Huang and T. Chen, "Tracking of Multiple Faces for Human-Computer Interfaces and Virtual Environments", ICME 2000 24. M. Yajnik, S. Moon, J. Kurose, D. Towsley, “Measurement and modeling of the temporal dependence in packet loss”, IEEE INFOCOM, 345-52, March 1999
A DCT-Domain Video Transcoder for Spatial Resolution Downconversion Yuh-Reuy Lee1, Chia-Wen Lin1, and Cheng-Chien Kao2 1 Department
of Computer Science and Information Engineering National Chung Cheng University Chiayi 621, Taiwan G[PMR$GWGGYIHYX[ LXXT[[[GWGGYIHYX[bG[PMR 2 Computer & Communications Research Lab Industrial Technology Research Institute Hsinchu 310, Taiwan GGOES$MXVMSVKX[
Abstract. Video transcoding is an efficient way for rate adaptation and format conversion in various networked video applications. Several transcoder architectures have been proposed to achieve fast processing. Recently, thanks to its relatively low complexity, the DCT-domain transcoding schemes have become very attractive. In this paper, we investigate efficient architectures for video downscaling in the DCT domain. We propose an efficient method for composing downscaled motion vectors and determining coding modes. We also present a fast algorithm to extract partial DCT coefficients in the DCT-MC operation and a simplified cascaded DCT-domain video transcoder architecture.
1
Introduction
With the rapid advance of multimedia and networking technologies, multimedia services, such as teleconferencing, video-on-demand, and distance learning have become more and more popular in our daily life. In these applications, it is often needed to adapt the bit-rate of a coded video bit-stream to the available bandwidth over heterogeneous network environments [1]. Dynamic bit-rate conversions can be achieved using the scalable coding schemes provided in current video coding standards [2]. However, it can only provide a limited number of levels of scalability (say, up to three levels in the MPEG standards) of video quality, due to the limit on the number of enhancement layers. In many networked multimedia applications, a much finer scaling capability is desirable. Recently, fine-granular scalable (FGS) coding schemes have been proposed in the MPEG-4 standard to support a fine bit-rate adaptation and limited temporal/spatial format conversions. However, the video decoder requires additional functionality to decode the enhancement layers in the FGS encoded bit-streams. Video transcoding is a process of converting a previously compressed video bitstream into another bit-stream with a lower bit-rate, a different display format (e.g., S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 207–218, 2002. © Springer-Verlag Berlin Heidelberg 2002
208
Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao
downscaling), or a different coding method (e.g., the conversion between H.26x and MPEGx, or adding error resilience), etc. To achieve the goal of universal multimedia access (UMA), the video contents need to be adapted to various channel conditions and user equipment capabilities. Spatial resolution reduction [5-9] is one of the key issues for providing UMA in many networked multimedia applications. In realizing transcoders, the computational complexity and picture quality are usually the two most important concerns and need to be traded off to meet various requirements in practical applications. The computational complexity is very critical in real-time applications. A straightforward realization of video transcoders is to cascade a decoder followed by an encoder as shown in Fig. 1. This cascaded architecture is flexible and can be used for bit-rate adaptation and spatial and temporal resolution-conversion without drift. It is, however, very computationally intensive for real-time applications, even though the motion-vectors and coding-modes of the incoming bit-stream can be reused for fast processing. Incoming bitstream
IQ1
IDCT1
+
-
+
DCT
Q2
Outgoing bitstream IQ2
F IDCT2 MC
+ MV
Decoder DCT : Discrete Cosine Transform IDCT : Inverse Discrete Cosine Transform Q : Qunatization MV: Motion Vector
MC
MV
F
Encoder
IQ : Inverse Quantization F : Frame Memory MC : Motion Compensation
Fig. 1. Cascaded pixel-domain transcoder
For efficient realization of video transcoders, several fast architectures have been proposed in the literature [2-11, 14-15]. In [10], a simplified pixel-domain transcoder (SPDT) was proposed to reduce the computational complexity of the cascade transcoder by reusing motion vectors and merging the decoding and encoding process and eliminating the IDCT and MC (Motion Compensation) operations. [11] proposed a simplified DCT-domain transcoder (SDDT) by performing the motion-compensation in the DCT-domain [12] so that no DCT/IDCT operation is required. This simplification imposes a constraint that this architecture cannot be used for spatial or temporal resolution conversion and GOP structure conversion, that requires new motion vectors. Moreover, it cannot adopt some useful techniques, which may need to change the motion vectors and/or coding modes, for optimizing the performance in transcoding such as motion vector refinement [14]. The cascaded pixel-domain transcoder is drift-
A DCT-Domain Video Transcoder for Spatial Resolution Downconversion
209
free and does not have the aforementioned constraints. However, its computational complexity is still high though the motion estimation doesn’t need to be performed. In this paper, we investigate efficient realizations of video downscaling in the DCT domain. We also propose efficient methods for composing downscaled motion vectors and determining coding modes. We also present a fast algorithm to extract partial DCT coefficients in the DCT-MC operation and a simplified cascaded DCT-domain video transcoder architecture. The rest of this paper is organized as follows. In section 2, we discuss existing transcoder architectures, especially the DCT-domain transcoder for spatial downscaling. In section 3, we investigate efficient methods for implementing downsizing and motion compensation in the DCT domain. Finally, the result is summarized in section 4.
2 Cascaded DCT-Domain Transcoder for Spatial Resolution Downscaling To overcome the constraints of the SDDT, we propose to use the Cascaded DCTDomain Transcoder (CDDT) architecture which first appeared in [6]. The CDDT can avoid the DCT and IDCT computations required in the pixel-domain architectures as well as preserve the flexibility of changing motion vectors, coding modes as in the CPDT. Referring to Figure 1, by using the linearity property of the DCT transform (i.e., DCT(A+B) = DCT(A) + DCT(B)), the DCT block can be moved out from the encoder loop to form the equivalent architecture in Fig. 2(a). Each combination of IDCT, pixel-domain motion compensation, and DCT as enclosed by the broken lines is equivalent to a DCT-domain MC (DCT-MC) peration. Therefore we can derive the equivalent cascaded DCT-domain transcoder architecture as shown in Fig. 2(b). The MC-DCT operation shown in Fig. 3 can be interpreted as computing the coefficients of the target DCT block B from the coefficients of its four neighboring DCT blocks, Bi, i = 1 to 4, where B = DCT(b) and Bi = DCT(bi) are the 8×8 blocks of the DCT coefficients of the associated pixel-domain blocks b and bi of the image data. A close-form solution to computing the DCT coefficients in the DCT-MC operation was firstly proposed in [12] as follows. 4
B = ∑ H hi Bi H wi
(1)
i =1
where wi and hi ∈ {1,2,…7}. H h and H w are constant geometric transform matrii i ces defined by the height and width of each subblock generated by the intersection of bi with b. Direct computation of Eq. (1) requires 8 matrix multiplications and 3 matrix additions. Note that, the following equalities holds for the geometric transform matrices: H h = H h , H h = H h , H w = H w , and H w = H w . Using these 1 2 3 4 1 3 2 4 equalities, the number of operations in Eq. (1) can be reduced to 6 matrix multiplica-
210
Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao
tions and 3 matrix additions. Moreover, since H h and H w are deterministic, they i i can be pre-computed and then pre-stored in memory. Therefore, no additional DCT computation is required for the computation of Eq. (1).
ENCODE R
DECODE R Incoming Bitstream
IQ1
IDCT1
MV 1
Outgoing Bitstream
Q2
DCT
IQ2
MC
DCT
F
IDCT2
MC
F
DCT-MC 1
DCT-MC 2 MV 2
(a)
Incoming Bitstream
DECODE R
ENCODE R
IQ1
Outgoing Bitstream
Q2
IQ2
DCT-MC 1
MV 1 DCT-MC 2
MV 2
(b) Fig. 2. (a) An equivalent transform of the cascaded pixel domain transcoder; (b) cascaded DCTdomain transcoder
w1
B2
B1 h1
B
B3
B4
Fig. 3. DCT-domain motion compensation
A DCT-Domain Video Transcoder for Spatial Resolution Downconversion
211
SEQUENCE: FOREMAN-QCIF 42 Simplified DCT-domain Cascaded pixel-domain Cascaded DCT-domain
Average PSNR (dB)
40
38
36
34
32
30
32
64 Bitrate (Kbps)
96
(a) SEQUENCE: CARPHONE-QCIF Simplified DCT-domain Cascaded pixel-domain Cascaded DCT-domain
42
Average PSNR (dB)
40
38
36
34
32
30
32
64 Bitrate (Kbps)
96
(b) Fig. 4. Performance comparison of average PSNR with three different transcoders. the incoming sequence was encoded at 128 kb/s, and transcoded to 96 kb/s, 64 kb/s, and 32 kb/s, respectively for: (a) “foreman” sequence; (b) “carphone” sequence
We compare the PSNR performance of CPDT, SDDT, and CDDT in Fig. 4. Two test sequences: “foreman” and “carphone” were used for simulation. Each incoming sequence was encoded at 128 Kbps and transcoded into 96, 64,and 32 Kbps, respectively. It is interesting to observe that, though all the three transcoding architectures are mathematically equivalent by assuming that motion compensation is a linear operation, DCT and IDCT can cancel out each other, and DCT/IDCT has distributive property, the performance are quite different. The CPDT architecture outperforms the other two. Though the performance of the DCT-domain transcoders is not as ggod as the SPDT, the main advantage of the DCT-domain transcoders lies on the existing efficient algorithms for fast DCT-domain transcoding [10,11,18,19], which make them
212
Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao
very attractive. For spatial resolution downscaling, we propose to use the cascaded DCT-domain transcoder shown in Fig. 5. This transcoder can be divided into four main functional blocks: decoder, downscaler, encoder, and MV composer, where all the operations are done in the DCT domain. In the following, we will investigate efficient schemes for DCT-domain downscaling. DECODER Incoming Bitstream
ENCODER DCT-domain downscaling
VLD +IQ1
Outgoing Bitstream
Q2
IQ2
DCT-MC 1
MV 1 DCT-MC 2 MV Composition
v$
MV 2
Fig. 5. Proposed DCT-domain spatial resolution down-conversion transcoder
3
Algorithms for DCT-Domain Spatial Resolution Downscaling
3.1
DCT-Domain Motion Compensation with Spatial Downscaling
Consider the spatial downscaling problem illustrated in Fig. 6, where b1, b2, b3, b4 are the four original 8×8 blocks, and b is the 8×8 downsized block. In the pixel domain, the downscaling operation is to extract one representative pixel (e.g., the average) out of each 2x2 pixels. In the following, we will discuss two schemes for spatial downscaling in the DCT domain which may be adopted in our DCT-domain downscaling transcoder.
b1 8x8
b2 8x8
b3 8x8
b4 8x8
downscaling
b 8x8
Fig. 6. Spatial resolution down-conversion
A. Filtering + Subsampling Pixel averaging is the simplest way to achieving the downscaling, which can be implemented using the bilinear interpolation expressed below [6,14].
A DCT-Domain Video Transcoder for Spatial Resolution Downconversion
213
4
b = ∑ hibi g i
(2)
q 4×8 t t h1 = h 2 = g1 = g3 = 04×8 h = h = g t = gt = 04×8 4 2 4 q 3 4×8
(3)
i=1
The filter matrices, hi and
gi , are
where
0 0 0 0 0 0.5 0.5 0 0 0 0.5 0.5 0 0 0 0 , and 0 is a 4×8 zero matrix. 4×8 q 4×8 = 0 0 0 0 0.5 0.5 0 0 0 0 0 0 0 0.5 0.5 0 The above bilinear interpolation procedure can be performed in the DCT domain directly to obtain the DCT coefficients of the downsized block (i.e., B = DCT(b)) as follows: 4
4
i =1
i =1
B = ∑ DCT(h i ) DCT(bi ) DCT(g i ) = ∑ H i Bi Gi
(4)
Other filtering methods with a larger number of filter taps in hi and g i may achieve better performance than the bilinear interpolation. However, the complexity may increase in pixel-domain implementations due to the increase in the filter length. Nevertheless, the DCT-domain implementation cost will be close to the bilinear interpolation, since in Eq. (4) Hi and Gi can be precomputed and stored, thus no extra cost will be incurred. B. DCT Decimation It was proposed in [13,14] a DCT decimation scheme that extracts the 4x4 lowfrequency DCT coefficients from the four original blocks b1-b4, then performs 4x4 IDCT to obtain four 4x4 subblocks, and finally combine the four subblocks into an 8x8 blocks. This approach was shown to achieve significant performance improvement over the filtering schemes [14]. [8] interpreted the DCT decimation as basis vectors resampling, and presented a compressed-domain approach for the DCT decimation as described below. Let B1, B2, B3, and B4, represent the four original 8×8 blocks; Bˆ1 , Bˆ 2 , Bˆ3 and Bˆ 4 be the four 4×4 low-frequency sub-blocks of B1, B2, B3, and B4, respectively; b$ 1 b$ 2 b$ i = IDCT( Bˆi ) , i = 1, …, 4. Then b$ = is the downscaled version of b$ 3 b$ 4 8×8
214
Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao
def b b2 = DCT(bˆ) from Bˆ , Bˆ , Bˆ and Bˆ , we can use the . To compute B b= 1 1 2 3 4 b3 b 4 16×16
following expression:
ˆ t Bˆ = TbT b$ 1 b$ 2 TLt = [TL TR ] $ $ t b 3 b 4 TR 1T T t B 2T T t T t B 4 4 4 = [TL TR ] 4 Lt t t T4 B 3T4 T4 B 4T4 TR 1 (T T t )t + (T T t ) B 2 (T T t )t + (T T t ) B 3 (T T t )t = (TLT4t ) B L 4 L 4 R 4 R 4 L 4 t t t +(T T ) B 4 (T T ) R 4
(5)
R 4
In addition to the above formulation, [8] also proposed a decomposition method to convert Eq. (5) into a new form so that matrices in the matrix multiplications become more sparse to reduce the computation. 3.2
Motiov Vector Composition and Mode Decision
After downscaling, the motion vectors need to be re-estimated and scaled to obtain a correct value. Full-rang motion re-estimation is computationally too expensive, thus not suited to practical applications. Several methods were proposed for fast composing the downscaled MVs based on the motion information of the original frame [7,14,17]. In [14], three methods for composing new motion vectors for the downsized video were compared: median filtering, averaging, and majority voting. It was shown in [14] that the median filtering scheme outperforms the other two. We propose to generalize the media filtering scheme to find the activity-weighted median of the four original vectors: v1, v2, v3, v4. In our method the distance between each vector and the rest is calculated as the sum of the activity-weighted distances as follows:
di =
1 ACTi
4
∑ v −v j =1 j ≠i
i
j
(6)
where the MB activity can be the squared or absolute sum of DCT coefficients, the number of nonzero DCT coefficients, or simply the DC value. In our method, we adopted the squared sum of DCT coefficients of MB as the activity measure. The activity-weighted median is obtained by finding the vector with the least distance from all. That is
v=
1 arg min di vi ∈{v1 , v2 , v3 , v4 } 2
(7)
A DCT-Domain Video Transcoder for Spatial Resolution Downconversion
215
Fig. 7 shows the PSNR comparison of three motion vector composition scheme: 2 activity-weighted median (denoted by DCT-coef ), the maximum DC method in [17] (denoted by DC-Max), and the average vector scheme (denoted by MEAN). The simulation result that the activity-weighted media outperforms the other two.
(a)
(b) Fig. 7. PSNR performance comparison of three motion vector composition schemes. The input sequences: (a) “foreman” sequence; (b) “news” sequence, are transcoder form 256 Kbps, 10fps into 64 Kbps, 10fps
After the down-conversion, the MB coding modes also need to be re-determined. In our method, the rules for determining the code modes are as follows: (1) If at least one of the four original MBs is intra-coded, then the mode for the downscaled MB is set as Intra. (2) If all the four original MBs are inter-coded, the resulting downscaled MB will also be inter-coded.
216
Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao
(3) If at least one original MB is skipped, and the reset are inter-coded, the resulting downscaled MB will be inter-coded. (4) If all the four original MBs are skipped, the resulting downscaled MB will also be skipped. Note, the motion vectors of skipped MBs are set to zero.
3.3 Computation Reduction in Proposed Cascaded DCT-Domain Downscaling Transcoder In Fig. 4, the two DCT-MCs are the most expensive operation. In our previous work [18], we showed that for each 8×8 DCT block, usually only a small number of lowfrequency coefficients are significant. Therefore we can use the fast significant coefficients extraction scheme proposed in [18] to reduce the computation for DCT-MC. The concept of significant coefficients extraction is illustrated in Fig. 8, where only partial coefficients (i.e., n ≤ 8) of the target block need to be computed. n1×n1
n2×n2 n×n
B1
B2 n4×n4
n3×n3
B3
Bˆ
B
B4
Fig. 8. Computation reduction for DCT-MC using significant coefficients extraction
The DCT-domain down-conversion transcoder can be further simplified by moving the downscaling operation into the decoder loop so that the decoder only needs to decode one quarter of the original picture size. Fig. 9 depicts the proposed simplified architecture. With this architecture both the computation and memory cost will be reduced significantly. However, similar to the down-conversion architectures in [20,21], this simplified transcoder will result in drift errors due to the mismatch in the frame stores between the front-end encoder and the reduced-resolution decoder loop of the transcoder. Several approaches have been presented to mitigate the drift problem [20,21], which may introduce some extra complexity. In MPEG video, since the drift in B frames will not result in error propagation, a feasible approach is to perform fullresolution decoding for I and P frames, and quarter-resolution decoding for B frames.
4
Summary
In this paper, we presented architectures for implementing spatial downscaling video transcoders in the DCT domain and efficient methods for implementing DCT-domain motion compensation with downscaling. We proposed an activity-weighted median
A DCT-Domain Video Transcoder for Spatial Resolution Downconversion
217
filtering scheme for composing the downscaled motion vectors, and also a method for determining the decision mode. We have also presented efficient schemes for reducing the computational cost of the downscaling trancoder. DECODER Incoming Bitstream
ENCODER
VLD +IQ1
Outgoing Bitstream
Q2 Downscaled DCT-MC 1
IQ2
MV 1 DCT-MC 2 MV Composition
v$
MV 2
Fig. 9. Simplified DCT-domain spatial resolution down-conversion transcoder
References 1. 2. 3. 4. 5. 6. 7. 8. 9.
10. 11.
12.
Moura, J., Jasinschi, R., Shiojiri-H, H., Lin, C.: Scalable Video Coding over Heterogeneous Networks. Proc. SPIE 2602 (1996) 294-306 Ghanbari, M.: Two-Layer Coding of Video Signals for VBR Networks. IEEE J. Select. Areas Commun. 7 (1989) 771-781 Sun, H., Kwok, W., Zdepski, J. W.: Architecture for MPEG Compressed Bitstream Scaling. IEEE Trans. Circuits Syst. Video Technol. 6 (1996) 191-199 Eleftheriadis, A. Anastassiou, D.: Constrained and General Dynamic Rate Shaping of Compressed Digital Video. Proc. IEEE Int. Conf. Image Processing (1995) Hu, Q., Panchanathan, s.: Image/Video Spatial Scalability in Compressed Domain. IEEE Trans. Ind. Electron. 45 (1998) 23–31 Zhu, W., Yang, K., Beacken, M.: CIF-to-QCIF Video Bitstream Down-Conversion in the DCT Domain. Bell Labs technical journal 3 (1998) 21-29 Yin, P., Wu, M., Liu, B.: Video Transcoding by Reducing Spatial Resolution. Proc. IEEE Int. Conf. Image Processin (2000) R. Dugad and N. Ahuja, “A Fast Scheme for Image Size Change in the Compressed Domain. IEEE Trans. Circuit Syst. Video Technol. 11 (2001) 461-474 N. Merhav and V. Bhaskaran, “Fast Algorithms for DCT-Domain Image Down-Sampling and for Inverse Motion Compensation. IEEE Trans. Circuits Syst. Video Technol. 7 (1997) 468–476 Keesman, g. et al.: Transcoding of MPEG Bitstreams. Signal Processing: Image Commun. 8 (1996) 481-500 Assuncao, P. A. A., Ghanbari, M.: A Frequency-Domain Video Transcoder for Dynamic Bit-rate Reduction of MPEG-2 Bit Streams. IEEE Trans. Circuits Syst. Video Technol. 8 (1998) 953-967 Chang, S. F., Messerschmitt, D. G.: Manipulation and Compositing of MC-DCT Compressed Video. IEEE J. Select. Areas Commun. (1995) 1-11
218 13. 14.
15. 16.
17.
18. 19.
20. 21.
Yuh-Reuy Lee, Chia-Wen Lin, and Cheng-Chien Kao Tan, K. H., Ghanbari, M.: Layered Image Coding Using the DCT Pyramid. IEEE Trans. Image Processing 4 (1995) 512-516 Shanableh T., Ghanbari, M.: Heterogeneous Video Transcoding to Llower Spatiotemporal Resolutions and Different Encoding Formats. IEEE Trans. on Multimedia 2 (2000) 101-110 Shanableh T., Ghanbari, M.: Transcoding Architectures for DCT-Domain Heterogeneous Video Transcoding. Proc. IEEE Int. Conf. Image Processing (2001) Seo, K., Kim J.: Fast Motion Vector Refinement for MPEG-1 to MPEG-4 Transcoding with Spatial Down-sampling in DCT Domain. Proc. IEEE Int. Conf. Image Processing (2001) 469-472 17 Chen, M.-J., M.-C. Chu, M.-C., Lo, S.-Y.: Motion Vector Composition Algorithm for Spatial Scalability in Compressed Video. IEEE Trans. Consumer Electronics 47 (2001) 319-325 18 Lin, C.-W., Lee, Y.-R.: Fast Algorithms for DCT Domain Video Transcoding. Proc. IEEE Int. Conf. Image Processing (2001) 421-424 19 Song, J., Yeo, B.-L.: A Fast Algorithm for DCT-Domain Inverse Motion Compensation based on Shared Information in a Macroblock. IEEE Trans. Circuits Syst. Video Technol. 10 (2000) 767-775 20 Vetro, A., Sun, H., DaGraca, P., Poon, T.: Minimum Drift Architectures for Threelayer Scalable DTV Decoding. IEEE Trans. Consumer Electronics 44 (1998) 21 Vetro, A., Sun, H.: Frequency Domain Down-Conversion Using an Optimal Motion Compensation Scheme. Int’l Journal of Imaging Systems & Technology 9 (1998)
A Receiver-Driven Channel Adjustment Scheme for Periodic Broadcast of Streaming Video Chin-Ying Kuo1, Chen-Lung Chan1, Vincent Hsu2, and Jia-Shung Wang1 1
Department of Computer Science, National Tsing Hua University, HsinChu, Taiwan _QVHVNW[ERKa$GWRXLYIHYX[ 2 Computer & Communications Research Laboratories, Industrial Technology Research Institute, HsinChu, Taiwan ZLWY$MXVMSVKX[
Abstract. Modern multimedia services usually distribute their contents by means of streaming. In most systems, the point-to-point delivery model is adopted but also known as less efficient. To extent scalability, some services apply periodic broadcast to provide an efficient platform that is independent of the number of clients. These periodic broadcast services can significantly improve performance, however, they require a large amount of client buffers also be inadequate to run on heterogeneous networks. In this paper, we propose a novel periodic broadcast scheme that requires less buffer capacity. We also integrate a receiver-driven channel adjustment adaptation to adjust the transmission rate for each client.
1 Introduction Streaming is the typical technology used to provide various real-time multimedia services. The primary benefit of streaming is processing playback without downloading the entire video in advance. In this architecture, the content server packetizes the video into packets and transmits them to clients. Each client merely acquires a small playback buffer to compose successive video packets they received from the networks and composes these packets to video frames for playing. Although streaming technology is flexible, it cannot support a large-scale system because each client must demand a server stream. Point-to-point communication is known inefficient, so some novel services apply broadcast or multicast to raise scalability. In conventional broadcast systems, each video is continuously broadcasted on the networks. The transfer rate of a video equals to its consumption rate and no additional buffer space is required at the client side. This scheme is efficient but inflexible because long waiting time may be required if the client requests just after the start of broadcasting. The waiting time in this case is almost the same as the playback duration. To reduce such delay, some straightforward schemes allocate multiple channels S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 219–228, 2002. © Springer-Verlag Berlin Heidelberg 2002
220
Chin-Ying Kuo et al.
to broadcast a popular video. For example, if we allocate three video channels for an 84-minute video, we can partition the whole video into three segments and broadcast these segments periodically in distinct channels. As Fig. 1 displays, the maximum waiting time can be significantly reduced to 28 minutes. time Channel 0
S1
S1
S1
….
Channel 1
S2
S2
S2
….
Channel 2
S3
S3
S3
….
28 minutes S 1 : the first 28 minutes of the video S 2 : the second 28 minutes of the video S 3 : the final 28 minutes of the video
Fig. 1. Broadcasting with multiple channels.
Broadcast-based multimedia delivery is an interesting topic, and many data broadcasting schemes [1–8] are proposed nowadays. We first discuss the concept of fast data broadcasting scheme [7]. The primary contribution of fast data broadcasting is reducing the initial delay of playback. However, a huge client buffer is required to store segments that cannot be immediately played out. Suppose k channels are allocated for a video with length L. The sequence {C0, C1, …, CK-1} represents the k channels correspondingly. The bandwidth of each channel equals to the consumption rate of the video. Besides, the video is equally divided into N segments, where N = 2k - 1. Suppose Si represents the ith segment of the video, so the entire video can be constituted as S1 · S2 ·…· SN. We allocate the channel Ci for segments {Sa, …,Sb}, where i = i i+1 i 0, 1, …, k-1, a = 2 , and b = 2 - 1. Within the channel Ci, these 2 data segments are broadcasted periodically. As Fig. 2 indicates, the video is partitioned into 7 segments and then is broadcasted on 3 channels. We observe that the viewer's initial delay (noted as d) is reduced to 12 minutes. Comparing with the previous broadcast scheme which waiting time equals 28 minutes, the fast data broadcasting is much more intelligent. L (the whole movie)
S1 S2 d
······
S7
Channel 0 S 1 S 1 S 1 S 1 S 1 S 1 S 1 S 1 · · · Channel 1 S 2 S 3 S 2 S 3 S 2 S 3 S 2 S 3 · · · Channel 2 S 4 S 5 S 6 S 7 S 4 S 5 S 6 S 7 · · · Fast Service (Needs buffer) Service without buffer
Fig. 2. An example of fast data broadcasting (k=3).
A Receiver-Driven Channel Adjustment Scheme for Periodic Broadcast
221
Although fast data broadcasting reduces the waiting time, extensive buffer requirement (about 50% per video) at the client side requires more cost on equipment. In addition, before applying fast data broadcasting scheme, the service provider must predict the popularity of each video. We should allocate more channels for popular videos. If the prediction is not accurate or the popularity changes in the future, the allocation will be wasteful. To overcome this drawback, adaptive fast data broadcasting scheme [8] is proposed. If the video was not requested for a long time, the server will attempt to release channels allocated for this video if possible. The newly free channel can be used by other popular videos therefore the efficiency can be enhanced. And if the video is demanded again, the server allocates new channels for it. With adaptive data broadcasting scheme, the system can be more flexible. Although fast data broadcasting and adaptive fast data broadcasting are interesting, they are not efficient enough. We propose a novel dynamic data broadcast scheme in this study. In our scheme, both viewer’s waiting time and storage requirement are reduced. In addition, the popularity of a video is used to determine the bandwidth allocation by modifying the channel allocation. Moreover, when some videos are going to be on-line or off-line, the system will intelligently determine an appropriate channel allocation for them. RR 11
1 0 M b p s GG22 5 0 0 k b p s
3 0 0 k b p s
1 0 M b p s SS
RR 2 2
GG11 1 0 M b p s RR 33
Fig. 3. A heterogeneous network.
Although periodic broadcast provide an efficient platform for multimedia delivery, the available network bandwidth for each client usually substantially varies in Internet. As depicted in Fig. 3, server S transmits a video with 10 Mbps. For receiver R3, a perfect video service is available since R3 has sufficient bandwidth to receive all data packets of the video. However, a bottleneck is observed between two gateways G1 and G2, thus, both receiver R1 and R2 would loss many data packets so they cannot enjoy the playback smoothly. Applying receiver-driven bandwidth adaptation to adjust the transmission rate to meet different clients’ network capacities is a well-known approach. The general receiver-driven bandwidth adaptation integrates a multi-layered coding algorithm with a layered transmission system. In layered coding algorithm, it encodes a video into multiple layers including one base layer (denoted as layer one) and several enhanced layers (denoted as layer 2, layer 3, …etc.). By subscribing numbers of layers depending on its network bandwidth, each client receives the best quality of the video that the network can deliver. McCanne, Jacobson and Vetterli [9] proposed a receiver-driven layered multicast (RLM) scheme by extending the multiple
222
Chin-Ying Kuo et al.
group framework with a rate-adaptation protocol. Thus, the transmission of different layered signals over heterogeneous networks is possible. In this scheme, a receiver searches for the optimal level of subscription by two rules:
• •
Drop one layer when congestion occurs. Add one layer when receive successfully.
After perform rate-adaptation on the case in Fig. 3, we have the flow in Fig. 4 Suppose the source S transmits three layers of video by 200 kbps, 300 kbps, 500 kbps, respectively. Because network bandwidth between S and R3 is high, R3 can successfully subscribe all three layers and enjoys the highest video quality. However, since only 500 kbps capacity is available on G2, R1 and R2 cannot receive the entire three layers. At G2, the third layer will be dropped then R1 can only subscribe two layers. For R2, because the network bandwidth is only 300 kbps, it must drop the second layer and subscribe the base layer only. However, the RLM scheme treats each stream independently. If multiple streams pass the same bottleneck link (which are called sharing streams), they may compete for the limited bandwidth because they do not know the sharing status. This may cause unfairness of subscription level of different streams. Therefore, flexible bandwidth allocation adapted to receivers is necessary to share the bandwidth. One approach named Multiple Streams Controller (MSC) was proposed in [10]. In this scheme, it is an RLM-based method with MSC at every client end. It can dynamically adjust the subscription level owing to the available bandwidth. RR 11
1 0 Mb p s G G22 5 0 0 k b p s
3 0 0 k b p s
1 0 Mb p s SS
RR 22
GG1 1 1 0 Mb p s RR 3 3
Fig. 4. Layer subscription.
Bandwidth adaptation schemes described above are developed over multi-layered coded streaming system. However, the implementation of layered coding is still not popular even though the standard of MPEG-4 supports multi-layered coding. Without multi-layered coding, re-encoding the source media into streams with various qualities in server or intermediate nodes is another solution. In these designs, transcoders and additional buffer spaces are required. The buffer is employed to store input streams temporally, and the transcoders are used to re-encode video streams stored in the buffer to output streams with various bit-rate. Each client continues probing the network and sends messages containing the status to the corresponding intermediate node. When the server or intermediate nodes receive these messages, they determine the number of streams that the transcoder should generate and then forward these
A Receiver-Driven Channel Adjustment Scheme for Periodic Broadcast
223
streams to clients. Although transcoding shows a candidate solution while lacking layered coding, the computation complexity in intermediate nodes is expensive if the service scale substantially extends. Does video quality be the only metrics that impacts network bandwidth? The answer is generally yes in end-to-end transmission systems, but not absolutely in periodic broadcast. The bandwidth requirement in periodic broadcast is proportional to the number of channels, so adjusting transmission quality implies changing the number of channels. Furthermore, the quality of streams can also be referred as waiting time and client buffer size of a video in periodic broadcast. Therefore, the concept of receiverdriven bandwidth adaptation can be easily transformed to periodic broadcast. This is our primary target of this study. The rest of this paper is organized as follows. Section 2 describes the broadcast scheme we proposed. Section 3 introduces the integration of our broadcast scheme and a receiver-driven channel adjustment adaptation. Conclusion is then made in Section 4.
2 Our Broadcast Scheme In most periodic broadcast schemes, the permutation of segments to be broadcasted in each channel is determined initially. These schemes usually apply formulas to assign each segment to appropriate channel. For example, fast data broadcast scheme assigns 1, 2, 4, … segments to the first, second, third, … channels, respectively. Although periodic broadcasting schemes can serve a popular video with shorter viewer’s waiting time, large amount of storage requirements at client end is necessary. Assume the video length is L and the consumption rate is b. In fast data broadcasting, client buffer usage is varied from 0 to about 0.5*L*b. The buffer utilization varies too significantly. If the buffer can be utilized more evenly, we can reduce the buffer requirement in the k worst case. In fast data broadcasting, it divides a video into 2 – 1 segments where k is the number of channels. In order to reduce the receiver’s buffer requirements, we hope to allocate one additional channel to improve the flexibility of segment delivery. We k-4 define a threshold of the buffer size as 0.15*L*b. In this case, at most 2 segments size will be required at each client side. If the number of channel is less than 4, no buffer is needed for a receiver. Since the client buffer size is controlled under 0.15*L*b, if a receiver’s buffer requirement exceeds 0.15*L*b, we can use the additional channel to assign segments into different time slots. Thus, buffer usage of each receiver is evenly. In the case that we have k channels, C0, C1, …, Ck-1, for a video of length L. Each channel has bandwidth b, which is assumed the same as the consumption rate of a k-1 video. The video is divided equally into N segments, where N = 2 – 1. Let Si denote the ith segment, the video is constituted as (S1, S2, …, SN). Let Bc denote the maximum k-4 buffer requirement at the client end, where Bc = 2 segments. Suppose there is at least one request at each time interval. First, a segment Si is assigned to a free channel if it must be played immediately. If some channels are idle, we assign segments which will
224
Chin-Ying Kuo et al.
be played later into these empty channels. The corresponding clients must store these segments in their buffer. If there is no new request at some time interval, the latest allocated channel can be released. t C
Playing Buffered segment segment
0
S 1 d
0
V
0
S
1
V V
0
S S
2
t 0+ d C
0
C
1
S
1
S
1
S
2
1
S
1
2
t 0+ 2 d C
0
C
1
S
1
S
1
S
1
S
2
S
3
V V V
S S S
0 1 2
3
S S
2 1
3 3
t 0+ 3 d C
0
C
1
C
2
S
1
S S
1
S
2
S
1
S
3
S
2
S
4
V V V V
1
S S S S
0 1 2 3
4
S S S
3 2 1
4 3 2
t 0+ 4 d C C
0 1
C
2
C
3
S
1
S S
1 2
S S
1 3
S S S
1 2 4
S S
1 5
S
6
S
4
V V V V V
0 1 2 3 4
S S S S S
5 4 3 2 1
S S S S S
6 5 4 4 4
Fig. 5. An example of our data broadcast schedule.
Consider the example displayed in Fig. 5, the video is divided into 7 segments and 4 channels are available. At t0, a new channel C0 is allocated for the video and the first segment S1 is assigned into C0 to serve the viewerV0. Since the new viewerV1 issues at t0 +d, the segment S1 is assigned into C0 again. In addition, we allocate a new channel C1 to transmit S2 for servicing V0. However, the operation in V1 is more complex. V1 must play S1 directly from the network and save S2 into the buffer for future playback. To serve V2 at t0+2d, we still must assign S1 into C0. At the same time, V1 reads S2 from local disk because S2 has been stored at t0 +d, so we need not broadcast S2. The only segment that must be broadcasted now is S3. We observe that only two channels are required at t0 +2d. When the scheme proceeds to t0+3d, only three channels are required because S3 for V1 was already stored in the buffer. By the same procedure, we observe the system required only two channels at t0+4d. Since S4 and S6 will be played
A Receiver-Driven Channel Adjustment Scheme for Periodic Broadcast
225
by V2 and V0 later, we can assign them to C2 and C3 now. If we do not apply this assignment, V0 and V2 will cause the system allocate too many channels when they play these segments. In this example, we utilize at most 4 channels at server side and 1segment buffer at client side (about 0.143*L*b). Our scheme can amazingly reduce the buffer requirement. In our scheme, the channels can be dynamically allocated and deallocated. Fig. 6 shows the situation if there is no request between t0 + 6d and t0 + 7d. Since no new request issues, we can release the latest allocated channel C3 at t0 + 7d. In addition, only two segments S2 and S3 are required immediately, so we can assign S7 in the empty channels. t 0 + 6d C0
S1
C1
S1
S1
S1
S1
S1
S1
S2
S3
S4
S5
S2
S5
S2
S6
S3
S6
S4
S7
S4
C2 C3
t 0 + 7d C0 C1 C2
S1
S1
S1
S1
S1
S1
S1
S2
S2
S3
S4
S5
S2
S5
S3
S2
S6
S3
S6
S7
S4
S7
S4
Fig. 6. A condition to release a channel.
Consider the example displayed in Fig. 5, the video is divided into 7 segments and 4 channels are available. At t0, a new channel C0 is allocated for the video and the first segment S1 is assigned into C0 to serve the viewerV0. Since the new viewerV1 issues at t0 +d, the segment S1 is assigned into C0 again. In addition, we allocate a new channel C1 to transmit S2 for servicing V0. However, the operation in V1 is more complex. V1 must play S1 directly from the network and save S2 into the buffer for future playback. To serve V2 at t0+2d, we still must assign S1 into C0. At the same time, V1 reads S2 from local disk because S2 has been stored at t0 +d, so we need not broadcast S2. The only segment that must be broadcasted now is S3. We observe that only two channels are required at t0 +2d. When the scheme proceeds to t0+3d, only three channels are required because S3 for V1 was already stored in the buffer. By the same procedure, we observe the system required only two channels at t0+4d. Since S4 and S6 will be played by V2 and V0 later, we can assign them to C2 and C3 now. If we do not apply this assignment, V0 and V2 will cause the system allocate too many channels when they play these segments. In this example, we utilize at most 4 channels at server side and 1segment buffer at client side (about 0.143*L*b). Our scheme can amazingly reduce the buffer requirement. In our scheme, the channels can be dynamically allocated and deallocated. Fig. 6 shows the situation if there is no request between t0 + 6d and t0 + 7d. Since no new request issues, we can release the latest allocated channel C3 at t0 +
226
Chin-Ying Kuo et al.
7d. In addition, only two segments S2 and S3 are required immediately, so we can assign S7 in the empty channels.
3 Channel Adjustment In periodic data broadcasting scheme, all clients are served with the same video quality. However, practical networks are usually heterogeneous, so we cannot assume that each client can enjoy the same transmission quality. As we described previously, the requirement of a receiver-driven bandwidth adaptation scheme for data broadcasting is emergent. In this paper, we propose a "channel adjustment" process to approach receiver-driven concept on dynamic data broadcasting scheduling. Consider a video is transmitted to clients in different networks. These clients must calculate the loss rate of this video while taking the requiring data. The server collects the information of the loss rate in clients and determines the appropriate number of channels. If more than half clients are in congestion, the channel adjustment process should be activated to reduce the number of channels. The network traffic can be reduced correspondingly. The concept of our channel adjustment is described in the follows. 15
15
Suppose a hot video is divided into 15 segments (S 1 ~ S 15 ) and transmitted by 5 video channels (C0 ~ C4) on a server end. Suppose congestion happens in most clients, thus, one channel should be released to reduce network traffic. Since the number of 7
channels is decreased to 4 now, the video must be re-divided into 7 segments (S 1 ~ 7
S 7 ). All on-line views must not be delayed while the number of channel decreases. Assume our adjustment starts at H0. We first find the least common multiplier (l.c.m.) of the segment numbers, 7 and 15 in both conditions. Since the least common multi105
plier of 7 and 15 is 105, we virtually divide the video into 105 segments (S 1 105 105
~S
). Table 1 shows the mapping between these segments, and Fig. 7 displays the
example of such channel adjustment. Suppose S
15 13
is necessarily transmitted at H0 to
7 S1
serve previous viewers (Vp). In addition, is also required now to serve new view105 ers. Since we virtually divide segments into S in channel adjustment process, seg7 15 105 ments of S and S can be served as S . Thus, although these segments differ in their sizes, they still can be received by clients without overlap by applying our segment mapping process. In addition, if free blocks are available (the dotted-rectangle in Fig. 7), we can put segments which will be required by client to it. As Fig. 8 displays, S and S
15 15
Thus, S
will be required by Vp and we can assign both S 105 92
~S
105 99
are assigned to channel C1 and S
105 100
15 14
~S
and S 105 105
15 15
15 14
to free blocks.
are assigned to chan-
nel C2. Because the channel adjustment is easy, we can make it transparent to the dynamic data broadcasting. The channel adjustment process completes after all viewers
A Receiver-Driven Channel Adjustment Scheme for Periodic Broadcast 7
227
receive all segments in original S successfully. Since only 4 channels are required in the case that a video is divided into 7 segments, one video channel can be released from now on. Therefore, the network bandwidth is successfully reduced. Table 1. Least common multiplier for sub-segments mapping. 105
The number of divided segments
Mapping to S n
7
S i = S (i−1)*105/7+1 ~ S i*105 / 7
7
15
S i = S (i−1)*105/15+1 ~ S i*105 /15
7 (S i , i = 1~7)
15
15 (S i , i = 1~15)
105
105
105
105
H0 C0
S
C1
S
7 1
15 13
C2 C3 C4 H0
Ma p p i n g
C 0 S11 0 5 S 120 5
S1150 5
C 1 S18055 S 18065 C2
S
10 5 91
C3 C4 S
x i
S
x j
•
x
: Broadcasting successive segments from S i + 1 to S
x j− 1
.
: no data to broadcast Fig. 7. An example of channel adjustment. H C
0
0
S 11 0
C
1
S
C
2
S
C
3
C
4
5
10 5 8 5 10 5 10 0
S
10 5 2
S S S
1 0 5 9 1 10 5 10 0
S 19 02 5
S
1 0 5 9 8
Fig. 8. An example of free block assignment.
S
10 5 15 1 0 5 9 9
228
Chin-Ying Kuo et al.
4 Conclusion We introduce a concept of receiver-driven bandwidth control scheme called channel adjustment on dynamic periodic broadcast scheduling for real-time video service. The primary technology used in our scheme is a dynamic periodic broadcast scheduling. In our scheme, the service scalability is significantly extended via periodic broadcast. Furthermore, the novel channel adjustment proposed in this study can extend our system to heterogeneous clients. The same as other periodic broadcast schemes, we partition each popular video into numbers of segments and then broadcast these segments on distinct channels with different frequencies. The originality of our scheme is dynamically adjusting the broadcast schedule to reduce the requirement of client buffer. The buffer space that each client requires is less than 15 percent of the entire video. In addition, our scheme also provides a flexible platform for developing the feature named channel adjustment. With channel adjustment, each client can request a video with different number of channels depending on its available bandwidth. Allocating more channels implies less initial delay and less buffer requirement. We do not actually modify the playback quality but still can provide different services for heterogeneous clients.
References 1. S. Viswanathan and T. Imielinski, "Metropolitan area video-on-demand service using pyramid broadcasting," Multimedia Systems, vol. 4(4), pp. 197-208, August 1996. 2. C. C. Aggarwal, J. L. Wolf, and P. S. Yu, “A permutation-based pyramid broadcasting scheme for video-on-demand systems,” in Proc. IEEE Int.Conf. Multimedia Computing and Systems, pp. 118–126, June 1996. 3. L.-S. Juhn and L.-M. Tseng, “Harmonic broadcasting for video-on-demand service,” IEEE Transactions on Broadcasting, vol. 43, pp. 268–271, Sept. 1997. 4. L.-S. Juhn and L.-M. Tseng, “Enhanced harmonic data broadcasting and receiving scheme for popular video service,” IEEE Trans. Consumer Electronics, vol. 44, no. 4, pp.343-346, May 1998. 5. L.-S. Juhn and L.-M. Tseng, “Staircase data broadcasting and receiving scheme for hot video service,” IEEE Trans. Consumer Electronics, vol. 43, no. 4, pp.1110-1117, Nov. 1997 6. K. A. Hua and S. Sheu, “Skyscraper broadcasting: A new broadcasting scheme for metropolitan video-on-demand,” ACM SIGCOMM, Sept. 1997 7. L.-S. Juhn and L.-M. Tseng, “Fast data broadcasting and receiving scheme for popular video service,” IEEE Trans. Broadcasting, vol. 44, no. 1, pp. 100-105, Mar 1998. 8. L.-S. Juhn and L.-M. Tseng, “Adaptive fast data broadcasting scheme for video-on-demand service,” IEEE Trans. Broadcasting, vol. 44, no. 2, pp. 182-185, June 1998. 9. S. McCanne, V. Jacobson, and M. Vetterli, ”Receiver-driven Layered Multicast,” Proceeding of ACM SIGCOMM ’96, Aug. 1996 10. M. Kawada, H. Morikawa, T. Aoyama, “Cooperative inter-stream rate control scheme for layered multicast,” Applications and the Internet, Proceedings. Symposium on, 2001, pp. 147 -154
Video Object Hyper-Links for Streaming Applications Daniel Gatica-Perez1 , Zhi Zhou1 , Ming-Ting Sun1 , and Vincent Hsu2 1
Department of Electrical Engineering, University of Washington Seattle, WA 98195 USA 2 CCL/ITRI Taiwan
Abstract. In video streaming applications, people usually rely on the traditional VCR functionalities to reach segments of interest. However, in many situations, the focus of the people are particular objects. Video object (VO) hyper-linking, i.e., the creation of non-sequential links between video segments where an object of interest appears, constitutes a highly desirable browsing feature that extends the traditional video structure representation. In this paper we present an approach for VO hyper-linking generation based on video structuring, definition of objects of interest, and automatic object localization in the video structure. We also discussed its use in a video streaming platform to provide objectbased VCR functionalities.
1
Introduction
Due to the vast amount of video contents, effective video browsing and retrieval tools are critical for the success of multimedia applications. In current video streaming applications, people usually rely on VCR functionalities (fast-forward, fast-backward, and random-access) to access segments of video of interest. However, in many situations, the ultimate level of desired access is the object. For browsing, people may like to jump to the next “object of interest” or fastforward but only display those scenes involving the “object of interest”. For retrieval, users may like to find an object in a sequence, or to find a video sequence containing certain video objects. The development of such non-sequential, content-based access tools has a direct impact on digital libraries, amateur and professional content-generation, and media delivery applications [8]. VO hyper-linking constitutes a desirable feature that extends the traditional video structure representation, and some schemes for their generation have been recently proposed [5], [2], [13]. Such approaches follow a segmentation and region matching paradigm, based on (1) the extration of salient regions (in terms of color, motion or depth) from each scene depicted in a video shot, (2) the representation of such regions by a set of features, and (3) the search for correspondences among region features in all the shots that compose a video clip. In particular, the work in [2] generates hyper-links for moving objects, and the work in [13] does so for depth-layered regions in stereoscopic video. In [9], face S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 229–238, 2002. c Springer-Verlag Berlin Heidelberg 2002
230
Daniel Gatica-Perez et al.
Fig. 1. Video Tree Structure. The root, intermediate, and column leaf nodes of the tree represent the video clip, the clusters, and the shots, respectively. Each image on a column leaf corresponds to frames extracted from each subshot.
detection algorithms [15] were used to generate video hyper-links of faces. However, in spite of the current progress [12], automatic segmentation of arbitrary objects continues to be an open problem. In this paper, we present an approach for VO hyper-linking generation, and discuss its application for video streaming with object-based VCR functionalities. After video structure creation, hyper-links are generated by object definition, and automatic object localization in the video structure. The object localization algorithm first extracts parametric and non-parametric color models of the object, and then searches in a configuration space for the instance that is the most similar to the object model, allowing for detection of non-rigid objects in presence of partial occlusion, and camera motion. As part of a video streaming platform, users can define objects, and then fast-forward, fast-reverse, or random-access based on the object defined. The paper is organized as follows. Section 2 discusses the VO hyper-linking generation approach. Results are described in Section 3. Section 4 describes a streaming video platform with support for object-based VCR functionalities. Section 5 provides some concluding remarks.
2 2.1
VO Hyper-link Generation Video Structure Generation
A summarized video structure or Table of Contents (TOC) (Fig. 1), consisting of representative frames extracted from video, cluster, shot, and subshot levels, is generated with the algorithms described in [6]. The TOC reduces the number of frames where the object of interest will be searched to a manageable number. Users can specify objects of interest to generate hyper-links, by drawing a bounding box on any representative frame.
Video Object Hyper-Links for Streaming Applications
2.2
231
Object Localization as Deterministic Search
Object localization constitutes a fundamental problem in computer vision [15], [10], [18], [16], [3]. In pattern theory terms [7], [16], given a template (the image ¯ ¯ ⊂ R2 , any other image I(x) that contains the of an object) I(x) with support D 2 object (with support D ⊂ R ) can be considered as generated from the template I¯ by a transformation TX of the template into the image, ¯ ¯ I(x) = I(TX (x)), x ∈ D,
(1)
where TX is parameterized by X over a configuration space X . In practice, Eq. 1 becomes only an approximation, due to modeling errors, noise, etc. In a deterministic formulation, localizing the template in a scene consists of finding ˆ ∈ X that minimizes a similarity measure d(·), the configuration X ˆ = arg min dX = arg min d(I(TX (x), I(x)). ¯ X X∈X
X∈X
(2)
We represent the outlines of objects by bounding boxes, and restrict the configuration space X to a quantized subspace of the planar affine transformation space, with three degrees of freedom that model translation and scaling. While far from representing complex object shapes and motions, the simplified X is useful to locate targets. The interior of an object could be approximately transformed by pixel interpolation using the scale parameter. Alternatively, one can define a similarity measure that depends not directly on the images, but on image representations that are both translation and scale invariant, so ˆ = arg min d(f (I(TX (x)), f (I(x))). ¯ X X∈X
(3)
With this formulation, the issues to define are f , d, the search strategy, and a mechanism to declare when the objects is not present in the scene. 2.3
Reducing the Search Space with Color Likelihood Ratios
Pixel-wise classification based on parametric models of object/background color distributions has been used for image segmentation [1] and tracking [14]. We use such representation to guide the search process. In the representative frames from which the object is to be searched, let y represent an observed color feature vector for a given pixel x. Given a single foreground object, the distribution of y for such frame is a mixture p(y|Θ) =
p(Oi )p(y|Oi , θi ),
(4)
i∈{F,B}
where F and B stand for foreground and background, p(Oi ) is the prior probability of pixel x belonging to object Oi ( i p(Oi ) = 1), and p(y|Oi , θi ) is the
232
Daniel Gatica-Perez et al.
a
b
c
Fig. 2. Extraction of candidate configurations. Dancing Girls sequence. (a) Frames extracted from the video clips (the object has been defined by a bounding box). (b) Log-likelihood ratio image for learned foreground and background color models. Lighter gray tones indicate higher probability of a pixel to belong to the object. (c) Binarized image after decision. White regions will be used to generate candidate configurations.
conditional pdf of observations given object Oi , parameterized by θi (Θ = {θi }). Each conditional pdf is in turn modeled with a Gaussian mixture [11], p(y|Oi , θi ) =
M
p(wj )p(y|wj , θij ),
(5)
j=1
where p(wj ) denotes the prior probability of the j-th component, and the conditional p(y|wj , θij ) = N (µij , Σij ) is a multivariate Gaussian with full covariance matrix. In absence of prior knowledge p(OF ) = p(OB ), and Bayesian decision theory establishes that each pixel can be optimally associated (in the MAP sense) to foreground or background by evaluating the likelihood ratio p(y|OF , θF ) H>F 1 p(y|OB , θB ) H
(6)
The likelihood functions are on-line estimated using the Expectation- Maximization (EM) algorithm, the standard procedure for Maximum Likelihood parameter estimation [11]. Additionally, model selection is automatically estimated using the Minimum Description Length (MDL) principle. RGB models are estimated when a new object is defined, and then applied to the set of representative frames in the video summary. An example is shown in Fig. 2. Only those pixels whose colors match the object color distribution are chosen as candidate search configurations. Finally, as the background color distribution is likely to change from shot to shot (possibly rendering low values
Video Object Hyper-Links for Streaming Applications
233
for p(y|OB , θB )) probabilities are thresholded to ensure that candidate configurations truly correspond to object colors. 2.4
Localization Using Bhattacharyya Coefficient
We use the color pdf of the interior of the configuration X ∈ X as the function ¯ and f (IX ) denote the color pdfs of the object and the f (·) in Eq. 3. Let f (I) configuration X, respectively. As discussed in [4], measuring similarity among two distributions can be defined as maximizing the Bayes error associated with them. The Bhattacharyya coefficient is a measure related to the Bayes error defined by ¯ f (IX )) = (f (I(x))f ¯ ρX = ρ(f (I), (IX (x)))1/2 dx (7) and can be used to define a metric ¯ fˆ(IX )))1/2 dX = (1 − ρ(fˆ(I),
(8)
when the pdfs f (·) are represented by discrete densities fˆ(·). The discrete pdfs for model and candidate configuration are directly estimated by normalizing color histograms (3-D RGB, 8 × 8 × 8 bins). Except for quantization effects, this color discrete density estimate is translation and scale invariant, unlike other representations, like color coocurrence histograms [3], which are translation invariant but not scale-invariant. In the search, the translation component is quantized by a factor of 4 in each direction, and the scaling component is quantized to 5 different scales ranging between 0.5 and 2. If a whole QSIF image was to be searched, the number possible configurations would be 6600. We only search those positions with high likelihood as indicated in white regions in Fig. 2(c). Finally, the decision on the presence of the object is based on thresholding of the Bhattacharyya coefficient. 2.5
Video Hyper-link Generation
Hyper-links are constructed based on object detection/absence for each shot. If links are desired to the subshot level, the described object localization has to be applied on each of the leave frames in the TOC. Video browsing will occur by displaying the subshots for which the object was localized. Alternatively, hyperlinks could be required only at higher levels of the hierarchy (shot, cluster). In that case, the object localization algorithm processes subshot frame leaves until it detects an object, and then jumps to the next shot or cluster, thus requiring less processing in average.
3
Results
Fig. 3 illustrates the results obtained in the Girls video, captured with a moving hand-held camera. One can observe that the algorithm has been able to detect
234
Daniel Gatica-Perez et al.
a
b
c Fig. 3. Object localization. Girls video sequence. (a), (b) and (c) illustrate the object localization process for three different user-defined video objects.
the user-specified objects correctly, in presence of partial occlusion and change of size. Another detection example is shown in Fig. 4. We observe that detection of the object of interest has been correct, but other regions whose features can
Video Object Hyper-Links for Streaming Applications
235
Fig. 4. Object Localization. Wedding video sequence.
Fig. 5. VO hyper-link generation. The frames where the object has been detected are highlighted.
not be discriminated as different are incorrectly labeled as object. Several issues are currently under study for object localization improvement, including the use of illumination-invariant object color models, the use of additional features, and the definition of a decision mechanism based on probability models of positive and negative examples. Hyper-links are created, and the leaves in the TOC that contain the object are highlighted in the GUI, as shown in Fig. 5, allowing for fast browsing in the video structure besides the capability for video playing. The computational complexity is dependent on object size. In the current implementation without any optimization, it takes five seconds to search among 3000 configurations per
236
Daniel Gatica-Perez et al.
Fig. 6. Block diagram of streaming video system.
QSIF image, on a Pentium III, 600 MHz PC. By off-line generation of the main objects in a video clip, the system can provide real time object-based browsing capabilities.
4
A Streaming Video System Supporting Table of Contents and Object-Based VCR Functionalities
A block diagram of a streaming video system is shown in Fig. 6. The system has a typical Server/Client structure. The video sequences are encoded in MPEG-4 and stored in the server with the associated metadata files. The system supports the conventional VCR functionalities such as Play, Pause, Random Access, Step Forward, Fast Forward, and Fast Reverse, plus the video TOC. The VCR functionalities are implemented as discussed in [10]. For simple implementation, we use I-pictures for random access, fast-forward, and fast-reverse. We are incorporating the object-based VCR functionalities into the system. In the actual applications, the client connects to the remote server over an IP network, and selects the video stream of interest. Two types of logical channels are established between the server and the client: the control channel and the data channel. The TOC and the VCR commands are transmitted in the control channel, while the video packets are transmitted in the data channel. The server sends the TOC of the requested video sequence to the client. The TOC, containing clusters of the key frames of the sequence, is displayed as shown in Fig. 7. The client can choose to play from the beginning of the video sequence, or click on a frame in the the TOC to start playing from that particular segment. The VCR and Hyperlinking Manager receives the commands and retrieves the corresponding part of the video sequence, which is then sent by the Stream Manager to the client for decoding and displaying. The key frames in the TOC are mapped to the closest I-pictures to allow easy decoding. During the play of the video, the user can use the conventional VCR functionalities (e.g. fast-forward, fast reverse, etc.) to manipulate the play of the video. The user can also stop the video and jump to another key frame of interest in the TOC. With the incorporation of the object-based VCR functionalities, the user will be able to stop the video at any frame, define an object of interest in the frame, and use
Video Object Hyper-Links for Streaming Applications
237
Fig. 7. Streaming video with VCR functionalities and Table Of Contents
the object-based VCR functionalities through the support of the automatically generated VO hyper-links.
5
Conclusions
We have presented a methodology to create video object hyper-links for objectbased video streaming applications. Although the obtained results are encouraging, we acknowledge that object localization is a hard problem, and current efforts are directed to improve discrimination. We have implemented a streaming video system with Table of Contents and VCR functionality support, and are incorporating the object-based VCR functionality features into the system.
Acknowledgements The video sequences used in this study belong to the Eastman-Kodak Home c Video Database.
238
Daniel Gatica-Perez et al.
References 1. S. Belongie, C. Carson, H. Greenspan, and J. Malik, “Color and Texture Image Segmentation Using the Expectation-Maximization Algorithm and Its Application to Content-Based Image Retrieval,”, In Proc. IEEE Int. Conf. Comp. Vis., Bombay, Jan. 1998. 2. P. Bouthemy, Y. Dufournaud, R. Fablet, R. Mohr, S. Peleg, and A. Zomet, “Video Hyper-links Creation for Content-Based Browsing and Navigation,” in Proc. Workshop on Content-Based Multimedia Indexing, Toulouse, France, October 1999. 3. P. Chang and J. Krumm, “Object Recognition with Color Coocurrence Histograms,” in Proc. IEEE Int. Conf. on CVPR, Fort Collins, CO, June 1998. 4. D. Comaniciu, V. Ramesh, and P. Meer, “Real-Time Tracking of Non-Rigid Objects using Mean Shift,” in Proc. IEEE Conf. on Comp. Vis. and Patt. Rec., Hilton Head Island, S.C., June 2000. 5. Y. Deng, and B. S. Manjunath “Netra-V: Toward an Object-Based Video Representation,” IEEE Trans. on CSVT, Vol. 8, No. 5, pp. 616-627, Sep. 1998. 6. D. Gatica-Perez, M.-T. Sun, and A. Loui, “Consumer Video Structuring by Probabilistic Merging of Video Segments,” in Proc. IEEE Int. Conf. on Multimedia and Expo, Tokyo, Aug. 2001. 7. U. Grenander, Lectures in Pattern Theory Springer, 1976-1981. 8. C.W. Lin, J. Zhou, J.Youn, and M.T. Sun, “MPEG Video Streaming with VCR Functionality,” IEEE Trans. on CSVT, Vol. 11, No. 3, pp. 415-425, Mar. 2001. 9. W.-Y. Ma and H.J. Zhang, “An Indexing and Browsing System for Home Video,” In Proc. EUSIPCO, European Conference on Signal Processing. Patras, Greece, 2000, pp. 131-134. 10. J. MacCormick and A. Blake, “A probabilistic contour discriminant for object localisation,” in Proc. IEEE Int. Conf. Computer Vision, pp. 390-395, 1998. 11. G.J. MacLachlan and D. Peel. Finite Mixture Models. John Wiley and Sons, N.Y., 2000. 12. M. Meila and J. Shi, “A random walks view of spectral segmentation,” in Proc. Eighth Int. Workshop on AI and Stats, Jan. 2001. 13. K. Ntalianis, A. Doulamis, N. Doulamis, and S. Kollias, “Non-Sequential Video Structuring Based on Video Object Linking: An Efficient Tool for Video Browsing and Indexing,” in Proc. IEEE Int. Conf. Image Processing, Thessaloniki, Greece, October 2001. 14. Y. Raja, S. McKenna, and S. Gong, “Colour Model Selection and Adaptation in Dynamic Scenes,” in Proc. ECCV, 1998. 15. H. Rowley, S. Baluja, and T. Kanade, “Human Face Detection in Visual Scenes,” Tech. report CMU-CS-95-158R, Computer Science Department, Carnegie Mellon University, November, 1995. 16. J. Sullivan, A. Blake, M. Isard and J. MacCormick, “Object Localization by Bayesian Correlation,” in Proc. IEEE Int. Conf. Computer Vision, pp. 1068-1075, 1999. 17. H.J. Zhang, “Content-based Video Browsing and Retrieval,” In B. Fuhrt, Ed., Handbook of Multimedia Computing, CRC Press, Boca Raton, 1999, pp. 255-280. 18. Y. Zhong and A. K. Jain, “Object Localization Using Color, Texture and Shape,” in Proc. Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, Venice, pp. 279-294, May 1997:
Scalable Hierarchical Summarization of News Using Fidelity in MPEG-7 Description Scheme Jung-Rim Kim, Seong Soo Chun, Seok-jin Oh, and Sanghoon Sull School of Electrical Engineering, Korea University, 1 Anam-dong 5ga Songbuk-gu, Seoul, Korea _NVOMQWWGLYRSWNWYPPa$QTIKOSVIEEGOV
Abstract. The notion of fidelity is an attribute in MPEG-7 FDIS (Final Draft International Standard [1]) that can be used for scalable hierarchical summarization and search [2]. The fidelity is the information on how well a parent key frame represents its child key frames in a tree-structured key frame hierarchy [15]. The use of fidelity was demonstrated for scalable hierarchical summarization [2] based on the low-level features such as color, but the temporal information was not used. Content of a video such as news and golf is temporally well structured and it is desirable to utilize such information. In this paper, we demonstrate the use of fidelity for the summarization of a well-structured news by using temporal information as well as low-level features.
1
Introduction
Nowadays, the speed of network grows up and its bandwidth becomes wide. So there are much more chances of making access to multimedia data. Although improvements are being made on the Internet, the size of multimedia data is too large to deliver complete data. Because of this problem, the study on multimedia compression, transferring and indexing has become important. Also, since the amount of multimedia data increases fast, it is necessary that we should be able to search and access/navigate them easily. Study on content-based retrieval has been extensively researched through various indexing schemes, but the research on multimedia access is still insufficient and being developed. One of the useful methods for access and navigation of multimedia content is summarization. The summarization helps us understand the whole content of a video by showing a set of the key frames/clips representing the whole video. This functionality is especially useful since the size of a video is very large in general and thus users might not want to spend much time watching the whole video. Furthermore it might be difficult to deliver the whole video with limited bandwidth. Therefore, there is a need for the scalable summarization schemes and MPEG-7 MDS (Multimedia Description Scheme) has been developed to provide such schemes.
S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 239–246, 2002. © Springer-Verlag Berlin Heidelberg 2002
240
Jung-Rim Kim et al.
Among a variety of video contents, news content is typically structured by time or by topics and thus can be hierarchically well described by the scalable summarization scheme for easy browsing and summary. The scalable description allows users to select the parts of the news video depending upon their preference or available bandwidth. In this paper, we describe the notion of fidelity in MPEG-7 Summarization Description Scheme (DS) and propose an efficient method for the scalable hierarchical news video summarization based the MPEG-7 DS. This paper is organized as follows. Section 2 introduces related works and the notion of fidelity. Section 3 describes the appropriate notion of fidelity and algorithm for news summarization, and Section 4 demonstrates the experimental results for scalable summarization of news. Finally, Section 5 provides conclusions of the paper.
2
Related Work and Fidelity
2.1
Related Work
Recently, there have been several approaches for the video summarization [6-9]. D. DeMenthon et al. [6] proposed scalable summarization using curve simplification. They developed a method for summarizing video by splitting a trajectory curve in the high dimensional feature space for the key frames. Y. Gong et al. [7] proposed optimal video summarization algorithm using the singular value decomposition of the feature vectors. S. Uchihashi et al. [8] introduced a video summarizing scheme using shot importance. As the shot length is longer, the importance of each shot is assumed to be larger, and as it becomes shorter, its importance diminishes. Mark T. Maybury et al. [9] showed summarization of broadcast news using audio, visual information and closed-captioned text. They summarized news video by key frame selection through the audio and video correlation, and annotated the summarized frames using closedcaptioned text. MPEG-7 also provides video summarization schemes in MDS. The multimedia content description in MPEG-7 is divided into two parts. One part is about the description of the structural aspects of the content that describes the audio-visual content from the viewpoint of its structure. It represents the spatial, temporal or spatiotemporal structure of the audio-visual content and can be described on the basis of perceptual features using MPEG-7 Descriptors for color, texture, shape, motion, audio features, and semantic information using Textual Annotations. And the other is about the description of the content conceptual aspects that describes the audio-visual content from the viewpoint of real-world semantics and conceptual notions. It involves entities such as objects, events, abstract concepts and relationships. Based on such descriptions, MPEG-7 gives description schemes for navigation and access of multimedia content that facilitates browsing and retrieval of audio-visual content by defining summaries, partitions and decomposition, and variations of audio-visual material. A brief explanation about fidelity in MPEG-7 description schemes is given in the following section to be used for scalable hierarchical summarization.
Scalable Hierarchical Summarization of News
2.2
241
Fidelity
The fidelity is the information on how well a parent key frame represents its child key frames in a tree-structured key frame hierarchy [1-5]. We can construct the treestructured hierarchy using the key frame shown as Fig. 1 based on the relationship between a parent key frame and its children using fidelity. The definition of fidelity eα of a node α having the parent node pα is proposed in [2] as eα = 1 − max(d ( pα , x)),
(1)
x∈Tα
where d( ) denotes normalized distance/dissimilarity from 0 to 1 and Tα is the hierarchy rooted at the node α. Level 0
A
eB
eC
B
eD
C
D
Level 1 eE E
eF
eI
F
G
eG eH H
eJ I
eK eL J
K
L
Level 2 Fig. 1. An Example of the Key Frame Hierarchy with Fidelity
3
Scalable Hierarchical Summarization of News Using Fidelity
In this section, we describe the use of fidelity using both the low-level feature and the temporal information of news to construct key frame hierarchy. 3.1
Scalable Hierarchical Algorithm Using Fidelity
The scalable summarization algorithm using fidelity is a max-cut finding algorithm proposed in [2], [5]. This algorithm maximizes the minimum edge cost cut by cut-line in the hierarchy so the fidelity after splitting the hierarchy becomes maximal. It can be summarized as follows: The root node is inserted into the summarization set K at first, and a node β not in K with the minimum fidelity between a node α in K and itself is inserted into K. Until the number of elements in K becomes equal to that specified by a user, the inner loop of the algorithm is repeatedly performed.
242
Jung-Rim Kim et al.
EHHVSSXCRSHIXS/ [LMPIGEVH/ 2 _ PIX<α,β>FIEPIEWXGSWXIHKI WYGLXLEXα∈/ERHβ∉/ EHHβXS/ a Fig. 2. Scalable Hierarchical Summarization Algorithm
For example, suppose eB < eC < eD in Fig. 1. By the max-cut algorithm in Fig. 2, we can choose two nodes that represent the whole hierarchy with maximal fidelity. At first, the root node, A, is selected, and then we have chance to choose one of three nodes B, C, and D. By the algorithm, we should select node B because of the above condition eB < eC < eD. Then, we obtain two hierarchies rooted at A and B, maximizing the fidelity of the hierarchies. 3.2
Hierarchy for News Summarization
Sull et al. proposed a key frame hierarchy with fidelity using only low-level features of the key frames based on equation (1) [2]. However, since such low-level features cannot represent the conceptual aspect of contents well, the result sometimes does not represent semantically meaningful summarization. Structured news content often contains the degree of importance related to time: The news content is structured as two major different parts composed of an anchor shot where one or two anchors reports events, and the event shots between two successive anchor shots. Furthermore the headline news shown at the beginning is important relative to the upcoming news stories shown thereafter. In general, the importance of news decreases as it is approaching to the end of the news. Taking both anchor shots and their time into account, we can show the contents or information of the news more effectively. Instead of equation (1), we propose a fidelity eα’ at a node α as follows: eα ’ = weα + (1 − w)eα (t ),
(2)
where w is the weight for the fidelity based on low-level feature in the range from 0 to 1 and eα(t) is the temporal fidelity at a node α given by time position/code or by temporal distribution between a parent frame and its descendents as eα (t ) =
α (t ) , τ
(3)
where α(t) is time code of α and τ is the total time length of the video content. The eα(t) increases as the temporal position of α increases, allowing us to obtain temporally earlier nodes first when applying the summarization algorithm shown in Fig. 2. Also in [2] they did not consider the temporal order for clustering, and thus it ignored the temporal relationship between a parent key frame and its child key frames.
Scalable Hierarchical Summarization of News
243
For example, the key frames in the bottom level consist of 3 shots such as {E, F}, {G, H, I, J}, and {K, L}, and their parent key frames are B, C, and D, respectively shown in Fig. 1. The key frame B represents I though there is little temporal relationship between them. If this scheme is applied to news content, the anchor shots whose lowlevel features are almost same will be classified into the same cluster. For an effective summarization, it is desirable to initially extract the anchor shots and the event shots between two successive anchor shots, and then construct the key frame hierarchy. The key frame hierarchy is typically constructed by a 4-level bottom-up method. The bottom level, level 3, consists of key frames, level 2 consists of the anchor frames and the key event frames representing the event frames between two successive anchor frames. The key event frames are positioned to the level 1 using only temporal fidelity, and the key frame that represents whole video becomes root. The overall algorithm is shown as Fig 3, and Fig. 4 is an example hierarchy based on this algorithm.
(IXIGXWLSXW )\XVEGXOI]JVEQIWMRIEGLWLSXYWMRKXLIPS[ PIZIPJIEXYVIZIGXSVPIZIP 7ITEVEXIXLIERGLSVWLSXWERHXLIIZIRXWLSXW 'PYWXIVXLIWYGGIWWMZIERGLSVERHIZIRXWLSXWVI WTIGXMZIP]YWMRKXLIEPKSVMXLQTVSTSWIHMR?A PIZIP )\XVEGXEPPIZIRXJVEQIWJVSQPIZIPPIZIP 7IXXLIOI]JVEQIJSVI\EQTPIXMXPIJVEQIXS XLIVSSXOI]JVEQISJXLILMIVEVGL]PIZIP Fig. 3. An algorithm for News Summarization
Root E1 A1 A1
E2 E1
E11
E12
E3
A2 E13
A2
E2 E21
E22
A3
E3
A3
E3
Fig. 4. An Example of the Key Frame Hierarchy of News (An : Anchor shot, En : Event shot)
4
Experimental Results
In this section, we describe the experimental result of the key frame hierarchy and the summarization result of news using our proposed algorithm.
244
Jung-Rim Kim et al.
4.1
Key Frame Hierarchy
In our current implementation, we use the DC luminance projection introduced in [10] as low-level feature vector for key frame extraction, anchor shot detection, and clusr c tering. The luminance projection (ln , lm ) of nth row and mth column in MxN DC image f is respectively lnr ( f ) = lmc ( f ) =
M
∑ Lum{ f (m, n)}, m =1 N
(4)
∑ Lum{ f (m, n)}. n =1
The distance/dissimilarity function, normalized to [0, 1], is also defined as d ( fi , f j ) =
M 1 N r (∑ | ln ( f i ) − lnr ( f j ) | + ∑ | lmc ( f i ) − lmc ( f j ) |) , K n=1 m =1
(5)
where K is a normalizing constant. Using above feature vector and distance/dissimilarity function, we detect shot boundaries and extract key frame set R satisfying the following condition: R = { f i ∈ S | d ( f i , f i −1 ) ≤ ε k , i = 1,2,3...} ,
(6)
where S is the whole video frame set, and ek is distortion to extract shot boundary. We also apply the equation (7) to detect a set A of anchor frames fk in R: A = { fk ∈ R | d ( fa , fk ) ≤ ε a} ,
(7)
where fa is the reference anchor frame that a user selects, and εα is distortion to detect anchor frames. Since the fidelity based on low-level feature for each anchor shot is almost 1, it is meaningless. So we applied only the temporal fidelity, i.e. w=0, to the fidelity of the anchor shot, and we experimentally set w to 0.2 or smaller value in equation (3) to construct a hierarchy for the event frames, such that the low-level feature does not affect the temporal order too much. The experimental results applied to two videos are shown as Table 1. 4.2
Summarization
Using the algorithms shown in Fig. 2 [2], we summarize two news videos by using 9 and 18 frames. Figure 5 (a) and (b) are the experimental results from the key frame hierarchy using only low-level feature, and Fig. 5 (c) and (d) show the results using low-level feature as well as temporal information. As shown in Fig. 5 (a) and (b), the 9-frame summarization and 18-frame summarization does not show a good semantic relationship between them, but in Fig. 5 (c) and (d), the 9-frame summarization gives the storyboard consisting of events purely, and in the 18-frame summarization result the anchor frames well appear between almost every event frame.
Scalable Hierarchical Summarization of News
245
Table 1. Video Contents and Their Frames in each Level
Video News1 News2 News3
Length 27m 38s 27m 36s 27m 39s
(a)
(b)
Level 0 1 1 1
Level 1 10 9 10
Level 2 19 18 19
Level 3 210 138 154
(c)
(d)
Fig. 5 Summarization results from the key frame hierarchy using fidelity. (a) 9-frame summarization of News1 based on low-level feature (b) 18-frame summarization of News1 based on low-level feature (c) 9-frame summarization of News1 based on low-level feature and temporal information (d) 18-frame summarization of News1 based on low-level feature and temporal information
246
5
Jung-Rim Kim et al.
Conclusion
In this paper, we described the use of fidelity in the MPEG-7 MDS for scalable hierarchical summarization of news. Based on the fidelity based on both low-level feature and temporal information, we constructed the semantically meaningful key frame hierarchy consisting of anchor and event frames, demonstrating a feasibility of our approach.
References 1. ISO/IEC 15938-5 FDIS Information Technology -- Multimedia Content Description Interface - Part 5 Multimedia Description Schemes. ISO/IEC JTC1/SC29/WG11 N4206 (2001) 2. Sull, S., Kim, J.-R., Kim, Y., Chang, H.S., Lee, S.U.: Scalable hierarchical video summary and search. Proceeding of SPIE2001, Vol. 4315. Storage and Retrieval for Media Database 2001, San Jose (2001) 553-561 3. Overview of the MPEG-7 standard. ISO/IEC JTC1/SC29/WG11 N4031, Singapore (2001). 4. Efficient and effective search and browsing using fidelity. ISO/IECC/JTCI SC29/WG11 M5101, La Baule (1999) 5. Improved notion of the fidelity for efficient browsing. ISO/IEC JTC1/SC29/SG11 M5442, Maui (1999) 6. DeMenthon, D., Kobla, V., Doermann, D.: Video summarization by curve simplification. Proceedings of ACM International Conference on Multimedia (1998) 211-218 7. Gong, Y. and Liu, X: Generating optimal video summaries. Proceedings of IEEE International Conference on Multimedia and Expo 2000, Vol. 3. (2000) 1559-1562 8. Uchihash, S., Foote, J.: Summarizing video using a shot importance measure and a framepacking algorithm. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 6. (1999) 3041-3044 9. Maybury, M. T., Merlino, A. E.: Multimedia summaries of broadcast news: Proceedings of Intelligent Information Systems (1997) 442-449 10.Chang, H. S., Sull, S., Lee, S. U.: Efficient video indexing scheme for content-based retrieval. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 8. (1999) 1269-1279
MPEG-7 Descriptors in Content-Based Image Retrieval with PicSOM System Markus Koskela, Jorma Laaksonen, and Erkki Oja Laboratory of Computer and Information Science, Helsinki University of Technology, P.O.BOX 5400, 02015 HUT, Finland {markus.koskela,jorma.laaksonen,erkki.oja}@hut.fi
Abstract. The MPEG-7 standard is emerging as both a general framework for content description and a collection of specific, agreed-upon content descriptors. We have developed a neural, self-organizing technique for content-based image retrieval. In this paper, we apply the visual content descriptors provided by MPEG-7 in our PicSOM system and compare our own image indexing technique with a reference system based on vector quantization. The results of our experiments show that the MPEG-7-defined content descriptors can be used as such in the PicSOM system even though Euclidean distance calculation, inherently used in the PicSOM system, is not optimal for all of them. Also, the results indicate that the PicSOM technique is a bit slower than the reference system in starting to find relevant images. However, when the strong relevance feedback mechanism of PicSOM begins to function, its retrieval precision exceeds that of the reference system.
1
Introduction
Content-based image retrieval (CBIR) differs from many of its neighboring research disciplines in computer vision due to one notable fact: human subjectivity cannot totally be isolated from the use and evaluation of CBIR systems. This is manifested by difficulties in setting fair comparisons between CBIR systems and in interpreting their results. These problems have hindered the researchers from doing comprehensive evaluations of different CBIR techniques. We have developed a neural-network-based CBIR system named PicSOM [1,2]. The name stems from “picture” and the Self-Organizing Map (SOM). The SOM [3] is used for unsupervised, self-organizing, and topology-preserving mapping from the image descriptor space to a two-dimensional lattice, or grid, of artificial neural units. The PicSOM system is built upon two fundamental principles of CBIR, namely query by pictorial example and relevance feedback [4]. Until now, there have not existed widely-accepted standards for description of the visual contents of images. MPEG-7 [5] is the first thorough attempt in this direction. The appearance of the standard will affect the research on CBIR techniques in some important aspects. First, when some common building blocks will become shared by different CBIR systems, comparative studies between them S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 247–258, 2002. c Springer-Verlag Berlin Heidelberg 2002
248
Markus Koskela, Jorma Laaksonen, and Erkki Oja
will become easier to perform. As MPEG-7 Experimentation Model (XM) [6] has become publicly available, we have been able to test the suitability of MPEG7-defined image content descriptors with the PicSOM system. We have thus replaced our earlier, non-standard descriptors with those defined in the MPEG-7 standard and available in XM.
2
PicSOM System
The PicSOM image retrieval system [1,2] is a framework for research on algorithms and methods for content-based image retrieval. The methodological novelty of PicSOM is to use several Self-Organizing Maps [3] in parallel for retrieving relevant images from a database. These parallel SOMs have been trained with separate data sets obtained from the image data with different feature extraction techniques. The different SOMs and their underlying feature extraction schemes impose different similarity functions on the images. Every image query is unique and each user of a CBIR system has her own transient view of image similarity and relevance. Therefore, a system structure capable of holding many simultaneous similarity representations can adapt to different kinds of retrieval tasks. In the PicSOM approach, the system is able to discover those of the parallel Self-Organizing Maps that provide the most valuable information for each individual query instance. A more detailed description of the PicSOM system and results of earlier experiments performed with it can be found in [1,2]. The PicSOM home page including a working demonstration of the system for public access is located at http://www.cis.hut.fi/picsom. 2.1
Tree Structured Self-Organizing Maps
The main image indexing method used in the PicSOM system is the SelfOrganizing Map (SOM) [3]. The SOM defines an elastic, topology-preserving grid of points that is fitted to the input space. It can thus be used to visualize multidimensional data, usually on a two-dimensional grid. The map attempts to represent all the available observations with an optimal accuracy by using a restricted set of models. Instead of the standard SOM version, PicSOM uses a special form of the algorithm, the Tree Structured Self-Organizing Map (TS-SOM) [7]. The hierarchical TS-SOM structure is useful for large SOMs in the training phase. In the standard SOM, each model vector has to be compared with the input vector in finding the best-matching unit (BMU). This makes the time complexity of the search O(n), where n is the number of SOM units. With the TS-SOM one can, however, follow the hierarchical structure and reduce the complexity of the search to O(log n). This reduction can be achieved by first training a smaller SOM and then creating a larger one below it so that the search for the BMU on the larger map is always restricted to a fixed area below the already-found BMU and its nearest neighbors on the above map.
MPEG-7 Descriptors in Content-Based Image Retrieval
249
Fig. 1. The surface of a 16×16-sized TS-SOM level trained with the MPEG-7 Edge Histogram descriptor.
In the experiments described in this paper, we have used four-level TS-SOMs whose layer sizes have been 4×4, 16×16, 64×64, 256×256 units. In the training of the lower SOM levels, the search for the BMU has been restricted to the 10×10-sized neuron area below the BMU on the above level. Every image has been used 100 times for training each of the TS-SOM levels. After training each TS-SOM hierarchical level, that level is fixed and each neural unit on it is given a visual label from the database image nearest to it. This is illustrated in Figure 1, where MPEG-7 Edge Histogram descriptor has been used as the feature. The images are the visual labels on the surface of the 16×16-sized TS-SOM layer. It can be seen that, e.g., there are many ships in the top-left corner of the map surface, standing people and dolls beside the ships, and buildings in the bottom-left corner. Visually – and also semantically – similar images have thus been mapped near each other on the map. 2.2
Self-Organizing Relevance Feedback
The relevance feedback mechanism of PicSOM, implemented by using several parallel SOMs, is a crucial element of the retrieval engine. Only a short overview is presented here, see [2] for a more comprehensive treatment.
250
Markus Koskela, Jorma Laaksonen, and Erkki Oja
⇒
Fig. 2. An example of how a SOM surface, on which the images selected and rejected by the user are shown with white and black marks, respectively, are convolved with a low-pass filter.
Each image seen by the user of the system is graded by her as either relevant or irrelevant. All these images and their associated relevance grades are then projected on all the SOM surfaces. This process forms on the maps areas where there are 1) many relevant images mapped in same or nearby SOM units, or 2) relevant and irrelevant images mixed, or 3) only irrelevant images, or 4) no graded images at all. Of the above cases, 1) and 3) indicate that the corresponding content descriptor agrees well with the user’s conception on the relevance of the images. Whereas, case 2) is an indication that the content descriptor cannot distinguish between relevant and irrelevant images. When we assume that similar images are located near each other on the SOM surfaces, we are motivated to spread the relevance information placed in the SOM units also to the neighboring units. This is implemented in PicSOM by low-pass filtering the map surfaces. All relevant images are first given equal positive weight inversely proportional to the number of relevant images. Likewise, irrelevant images receive negative weights that are inversely proportional to the number of irrelevant images. The overall sum of these relevance values is thus zero. The values are then summed in the BMUs of the images and the resulting sparse value fields are low-pass filtered. Figure 2 illustrates how the positive and negative responses, displayed with white and black map units, respectively, are first mapped on a SOM surface and how the responses are expanded in the convolution. Content descriptors that fail to coincide with the user’s conceptions produce lower qualification values than those descriptors that match the user’s expectations. As a consequence, the different content descriptors do not need to be explicitly weighted as the system automatically takes care of weighting their opinions. In the actual implementation, we search on each SOM for a fixed number, say 100, map locations with unseen images having the highest qualification values. After removing duplicate images, the second stage of processing is carried out. Now, the qualification values of all images in this combined set are summed up on all used SOMs to obtain the final qualification values for these images. Then, 20 images with the highest qualification values are returned as the result of the query round. In the experiments described in this paper, the queries are always started with an image that belongs to the image class in question. Therefore, we neglected
MPEG-7 Descriptors in Content-Based Image Retrieval
251
the TS-SOM hierarchy and considered exclusively the bottommost TS-SOM levels. This mode of operation is motivated by the chosen query type, since it is justifiable to start the retrieval near the initial reference image. This can be seen as depth first search. However, the hierarchical representation of the image database produced by a TS-SOM is useful in visual browsing. The successive map levels can be regarded as providing increasing resolution for database inspection. In our earlier experiments, e.g. [1,8,2], there was no initial example image to start the query with and the queries began with initial breadth first search using the visual labels and the TS-SOM structure. 2.3
Vector-Quantization-Based Reference Method
There exists a wide range of distinct techniques for indexing images based on their feature descriptors. One alternative method for the SOM is to first use quantization to prune the database and then utilize a more exhaustive method to decide the final images to be returned. For the first part, there exists two alternate quantization techniques, namely scalar quantization (SQ) and vector quantization (VQ). With either of these techniques, the feature vectors are divided into subsets in which the vectors resemble each other. In the case of scalar quantization the resemblance is in respect to one component of the feature vector, whereas resemblance in vector quantization means that the feature vectors are similar as whole. In our previous experiments [8], we have found out that scalar quantization gives bad retrieval results. The justification for vector quantization in image retrieval is that unseen images which have fallen into the same quantization bins as the relevant-marked reference images are good candidates for the next reference images to be displayed to the user. Also, the SOM algorithm can be seen as a special case of vector quantization. When using the model vectors of the SOM units in vector quantization, one ignores the topological ordering provided by the map lattice and characterize the similarity of two images only by whether they are mapped in the same VQ bin. By ignoring the topology, however, we dismiss the most significant portion of the data organization provided by the SOM. A well-known VQ method is the K-means or Linde-Buzo-Gray (LBG) vector quantization [9]. According to [8], LBG quantization yields better CBIR performance than the SOM used as a pure vector quantizer. This is understandable as the SOM algorithm can be regarded as a trade-off between two objectives, namely clustering and topological ordering. Consequently, we will use LBG quantization in the reference system of the experiments. The choice for the number of quantization bins is a significant parameter for the VQ algorithm. Using too few bins results in too broad image clusters to be useful whereas with too many bins the information about the relevancy of images fails to generalize to other images. Generally, the number of bins should be smaller than the number of neurons on the largest SOM layer of the TS-SOM. In the experiments, we have used 4096 VQ bins, which coincides with the size of the second bottommost TS-SOM levels. This results in 14.6 images per VQ
252
Markus Koskela, Jorma Laaksonen, and Erkki Oja
bin, on the average, for the used database of 59 995 images. Another significant parameter is the number of candidate images that are taken into consideration from each of the parallel vector quantizers. Different selection policies lead again either to breadth first or depth first searches. In our implementation, we rank the VQ bins of each quantizer in the descending order determined by the proportion of relevant images of all graded images in them. Then, we select 100 yet unseen images from the bins in that order. After the vector quantization stage, the set of potential images has been greatly reduced and more demanding processing techniques can be applied to all the remaining candidate images. Now, one possible method – also applied in our reference system – is to rank the images based on their properly-weighted cumulative distances to all already-found relevant images in the original feature space. Finally, as in the PicSOM method, we display 20 best-scoring images to the user. In [8], it was found out that the VQ method benefits from this extra processing stage. As calculating distance in a possibly very high-dimensional space is a computationally heavy operation, the vector quantization can thus be seen to act as a preprocessor which prunes a large database as much as it is necessary before the actual image similarity assessment is carried out.
3
Experiments
The performance of a CBIR system can be evaluated in many different ways. Even though the interpretation of the contents of images is always casual and ambiguous, some kind of ground truth classification of images must be performed in order to automate the evaluation process. In the simplest case – employed also here – some image classes are formed by first selecting verbal criteria for membership in a class and then assigning the corresponding Boolean membership value for each image in the database. In this manner, a set of ground truth image classes, not necessary non-overlapping, can be formed and then used in the evaluation. 3.1
Performance Measures and Evaluation Scheme
All features can be studied separately and independently from others for their capability to map visually similar images near each other. These kinds of featurewise assessments, however, have severe limitations because they are not related to the operation of the entire CBIR system as a whole. In particular, they do not take any relevance feedback mechanism into account. Therefore, it is preferable to use evaluation methods based on the actual usage of the system. If the size of the database, N , is large enough, we can assume that there is an upper limit NT of images (NT N ) the user is willing to browse. The system should thus demonstrate its talent within this number of images. In our setting, each image in a class C is “shown” to the system one at a time as an initial image to start the query with. The mission of the CBIR system is then to return as much as possible similar images. In order to obtain results that do not depend
MPEG-7 Descriptors in Content-Based Image Retrieval
253
on the particular image used in starting the iteration, the experiment needs to be repeated over every image in C. This results in a leave-one-out type testing of the target class and the effective size of the class becomes NC − 1 instead of NC and the a priori probability of the class is ρC = (NC − 1)/(N − 1). We have chosen to show the evolution of precision as a function of recall during the iterative image retrieval process. Precision and recall are intuitive performance measures that suite also for the case of non-exhaustive browsing. When not the whole database but only a smaller number NT N of images is browsed through, the recall value very unlikely reaches the value one. Instead, the final value R(NT ) – as well as P(NT ) – reflects the total number of relevant images found that far. The intermediate values of P(t) first display the initial accuracy of the CBIR system and then how the relevance feedback mechanism is able to adapt to the class. With an effective relevance feedback mechanism, it is to be expected that P(t) first increases and then turns to decrease when a notable fraction of relevant images have already been shown. In our experiments, we have normalized the precision value by dividing it with the a priori probability ρC of the class and call it therefore relative precision. This makes the comparison of the recall–precision curves of different image classes somewhat commensurable and more convenient because relative precision values relate to the relative advantage the CBIR system produces over random browsing. 3.2
Database and Ground Truth Classes
We have used images from the Corel Gallery 1 000 000 product in our evaluations. The database contains 59 995 color photographs originally packed with a wavelet compression and then locally converted in JPEG format with a utility provided by Corel. The size of each image is either 384×256 or 256×384 pixels. The images have been grouped by Corel in thematic groups and also keywords are available. However, we found these image groups and keywords rather inconsistent and, therefore, created for the experiments six manually-picked ground truth image sets with tighter membership criteria. All image sets were gathered by a single subject. The used sets and membership criteria were: – faces, 1115 images (a priori probability 1.85%), where the main target of the image is a human head which has both eyes visible and the head fills at least 1/9 of the image area. – cars, 864 images (1.44%), where the main target of the image is a car, at least one side of the car has to be completely shown in the image, and its body to fill at least 1/9 of the image area. – planes, 292 images (0.49%), where all airplane images have been accepted. – sunsets, 663 images (1.11%), where the image contains a sunset with the sun clearly visible in the image. – houses, 526 images (0.88%), where the main target of the image is a single house, not severely obstructed, and it fills at least 1/16 of the image area. – horses, 486 images (0.81%), where the main target of the image is one or more horses, shown completely in the image.
254
3.3
Markus Koskela, Jorma Laaksonen, and Erkki Oja
MPEG-7 Content Descriptors
MPEG-7 [5] is an ISO/IEC standard developed by Moving Pictures Expert Group. MPEG-7 aims at standardizing the description of multimedia content data. It defines a standard set of descriptors that can be used to describe various types of multimedia information. The standard is not aimed at any particular application area, instead it is designed to support as broad a range of applications as possible. Still, one of the main applications areas of MPEG-7 technology will undoubtedly be to extend the current modest search capabilities for multimedia data for creating effective digital libraries. As such, MPEG-7 is the first serious attempt to specify a standard set of descriptors for various types of multimedia information and standard ways to define other descriptions as well as structures of descriptions and their relationships. As a nonnormative part of the standard, a software Experimentation Model (XM) [6] has been released for public use. The XM is the framework for all reference code of the MPEG-7 standard. In the scope of our work, the most relevant part of XM is the implementation of a set of MPEG-7-defined still image descriptors. At the time of this writing, XM is in its version 5.3 and not all description schemes have yet been reported to be working properly. Therefore, we have used only a subset of MPEG-7 content descriptors for still images in these experiments. The used descriptors were Scalable Color, Dominant Color, Color Structure, Color Layout, Edge Histogram, and Region Shape. The MPEG-7 standard defines not only the descriptors but also special metrics to be used with the descriptors when calculating the similarity between images. However, we use Euclidean metrics in comparing the descriptors because the training of the SOMs and the creation of the vector quantization prototypes are based on minimizing a square-form error criterium. Only in the case of Dominant Color descriptor this has necessitated a slight modification in the use of the descriptor. The original Dominant Color descriptor of XM is variable-sized, i.e., the length of the descriptor varies depending on the count of dominant colors found. Because this could not be fit in the PicSOM system, we used only two most dominant colors or duplicated the most dominant color if only one was found. Also, we did not make use of the color percentage information. These two changes do not make our approach incompatible with other uses of Dominant Color descriptor. 3.4 Results Our experiments were two-fold. First, we wanted to study which of the four color descriptors would be the best one to be used together with the one texture and one shape descriptors in the table. Second, we wanted to compare the performance of our PicSOM system with that of the vector-quantization-based variant. We performed two sets of experiments in which the first question was addressed in the first set and the second question in both sets. We performed 48 computer runs in the first set of experiments. Each run was characterized by the combination of the method (PicSOM / VQ), color feature (Dominant Color / Scalable Color / Color Layout / Color Structure)
MPEG-7 Descriptors in Content-Based Image Retrieval
255
and the image class (faces / cars / planes / sunsets / houses / horses). Each experiment was repeated as many times as there were images in the image class in question, the recall and relative precision values were recorded for each such instant and finally averaged. 20 images were shown at each iteration round, which resulted in 50 rounds when NT was set to 1000 images. Both recall and relative precision were recorded after each query iteration. Figure 3 shows, as a representative selection, the recall–relative precision curves of three of the studied image classes (faces, cars, and planes). Qualitatively similar behavior is observed with the three other classes as well. The recorded values are shown with symbols and connected with lines. The following observations can be made from the resulting recall–relative precision curves. First, none of the tested color descriptors seems to dominate the other descriptors and on different image classes the results of different color descriptors often vary considerably. Regardless of the used retrieval method (PicSOM or VQ), Color Structure seems to perform best with faces and using Scalable Color yields best results with planes and horses. With the other classes (cars, sunsets, houses), naming a single best color descriptor is not as straightforward. The second observation is that, in general, if a particular color descriptor works well for a particular image class, it does so with both retrieval algorithms. Third, the PicSOM method more often obtains better precision then the VQ method when comparing the same descriptor sets, although the difference is rather small. Also, in the end, PicSOM has in a majority of cases reached a higher recall level. The last observation here is, that the difference between the precision of the best and the worst sets of descriptors is larger with the VQ method than with PicSOM. This can be observed, e.g., in the planes column of Figure 3. In the second set of experiments, we wanted to use all the available MPEG-7 visual content descriptors simultaneously. Runs were again made separately for the six image classes and the two CBIR techniques. The results for all classes can be seen in Figure 4, where each plot now contains mutually comparable recall–relative precision curves of the two techniques. It can be seen in Figure 4 that in all cases PicSOM is at first behind of VQ in precision, but soon reaches and exceeds it. In some of the cases (faces and cars), this overtake by PicSOM takes only one or two rounds of queries. With planes, reaching VQ takes the longest time, 11 rounds, due to the good initial precision of VQ, observed also in Figure 3 with the Scalable Color descriptor. Of the tested image classes, sunsets yields the best retrieval results as its relative precision rises at best over 30 and, on the average, almost half of all the images in the class are found among the 1000 retrieved images. This is understandable as sunset images can be well described with low-level descriptors, especially color. On the other hand, houses is clearly the most difficult class, as its precision does not ever rise much above twice the a priori probability of the class. This is probably due to the problematic nature of the class as, descriptorwise, there is not a large difference between the single houses and groups of houses, e.g., small villages.
256
Markus Koskela, Jorma Laaksonen, and Erkki Oja cars
planes
VQ
PicSOM
faces
Fig. 3. Recall–relative precision plots of the performance of different color descriptors and the two CBIR techniques. In all cases also Edge Histogram and Region Shape descriptors have been used.
faces
cars
planes
sunsets
houses
horses
Fig. 4. Recall–relative precision plots of the performance of the two CBIR techniques when all four color descriptors were used simultaneously together with Edge Histogram and Region Shape descriptors.
MPEG-7 Descriptors in Content-Based Image Retrieval
257
As the final outcome of the experiment, it can be stated that the relevance feedback mechanism of PicSOM is clearly superior to that of VQ’s. The VQ retrieval has good initial precision but after a few rounds, when PicSOM’s relevance feedback begins to have an effect, retrieval precision with PicSOM is in all cases higher. The houses class can be regarded as a draw and a failure for both methods with the given set of content descriptors. One can also compare the curves of Figure 3 and the curves in the upper row of Figure 4 for an important observation. It can be seen that the PicSOM method is, when using all descriptors simultaneously (Figure 4), able to follow and even exceed the path of the best recall–relative precision curve for the four alternative single color descriptors (Figure 3). This behavior is present in all cases, also with the image classes not shown in Figure 3, and can be interpreted as an indication that the automatic weighting of features is working properly and additional, inferior, descriptors do not degrade the results. On the contrary, the VQ method fails to do the same and the VQ recall–relative precision curves in Figure 4 resemble more the average than the maximum value of the corresponding VQ curves in Figure 3. As a consequence, the VQ technique is clearly more dependent on the proper selection of used features than the PicSOM technique.
4
Conclusions
In this paper, we have described our content-based image retrieval system named PicSOM and shown that MPEG-7-defined content descriptors can be successfully used with it. The PicSOM system is based on using Self-Organizing Maps in implementing relevance feedback from the user of the system. As the system uses many parallel SOMs, each trained with separate content descriptors, it is straightforward to use any kind of features. Due to PicSOM’s ability to automatically weight and combine the responses of the different descriptors, one can make use of any number of content descriptors without the need to weight them manually. As a consequence, the PicSOM system is well-suited for operation with MPEG-7 which also allows the definition and addition of any number of new content descriptors. In the experiments we compared the performances of four different color descriptors available in the MPEG-7 Experimentation Model software. The results of that experiment showed that no single color descriptor was the best one for all of our six hand-picked image classes. That result was no surprise, it merely emphasizes the need to use many different types of content descriptors in parallel. In an experiment where we used all the available color descriptors, the PicSOM system indeed was able to automatically reach and even exceed the best recall–precision levels obtained earlier with preselection of features. This is a very desirable property, as it suggests that we can initiate queries with a large number of parallel descriptors and the PicSOM systems focuses on the descriptors which provide the most useful information for the particular query instance.
258
Markus Koskela, Jorma Laaksonen, and Erkki Oja
We also compared the performance of the self-organizing relevance feedback technique of PicSOM with that of a vector-quantization-based reference system. The results showed that in the beginning of queries, PicSOM starts with a bit lower precision rate. Later, when its strong relevance feedback mechanism has enough data to process, PicSOM outperforms the reference technique. In the future, we plan to study how the retrieval precision in the beginning of PicSOM queries could be improved to the level attained by the VQ technique in the experiments.
Acknowledgments This work was supported by the Finnish Centre of Excellence Programme (20002005) of the Academy of Finland, project New information processing principles, 44886.
References 1. Laaksonen, J.T., Koskela, J.M., Laakso, S.P., Oja, E.: PicSOM - Content-based image retrieval with self-organizing maps. Pattern Recognition Letters 21 (2000) 1199–1207 2. Laaksonen, J., Koskela, M., Laakso, S., Oja, E.: Self-organizing maps as a relevance feedback technique in content-based image retrieval. Pattern Analysis & Applications 4 (2001) 140–152 3. Kohonen, T.: Self-Organizing Maps. Third edn. Volume 30 of Springer Series in Information Sciences. Springer-Verlag (2001) 4. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. Computer Science Series. McGraw-Hill (1983) 5. MPEG: Overview of the MPEG-7 standard (version 5.0) (2001) ISO/IEC JTC1/SC29/WG11 N4031. 6. MPEG: MPEG-7 visual part of the eXperimentation Model (version 9.0) (2001) ISO/IEC JTC1/SC29/WG11 N3914. 7. Koikkalainen, P., Oja, E.: Self-organizing hierarchical feature maps. In: Proc. IJCNN-90, International Joint Conference on Neural Networks, Washington, DC. Volume II., Piscataway, NJ, IEEE Service Center (1990) 279–285 8. Koskela, M., Laaksonen, J., Oja, E.: Comparison of techniques for content-based image retrieval. In: Proceedings of 12th Scandinavian Conference on Image Analysis (SCIA 2001), Bergen, Norway (2001) 579–586 9. Linde, Y., Buzo, A., Gray, R.: An algorithm for vector quantizer design. IEEE Transactions on Communications COM-28 (1980) 84–95
Fast Text Caption Localization on Video Using Visual Rhythm Seong Soo Chun1, Hyeokman Kim2, Jung-Rim Kim1, Sangwook Oh1, and Sanghoon Sull1 1
School of Electrical Engineering, Korea University, Seoul Korea _WWGLYRNVOMQSWYWYPPa$QTIKOSVIEEGOV 2 School of Computer Science, Kookmin University, Seoul Korea LQOMQ$GWOSSOQMREGOV
Abstract. In this paper, a fast DCT-based algorithm is proposed to efficiently locate text captions embedded on specific areas in a video sequence through visual rhythm, which can be fast constructed by sampling certain portions of a DC image sequence and temporally accumulating the samples along time. Our proposed approach is based on the observations that the text captions carrying important information suitable for indexing often appear on specific areas on video frames, from where sampling strategies are derived for a visual rhythm. Our method then uses a combination of contrast and temporal coherence information on the visual rhythm to detect text frames such that each detected text frame represents consecutive frames containing identical text strings, thus significantly reducing the amount of text frames needed to be examined for text localization from a video sequence. It then utilizes several important properties of text caption to locate the text caption from the detected frames.
1 Introduction With rapid advances in digital technology, the amount of multimedia information available continues to grow. As multimedia contents become readily available, archiving, searching, indexing and locating desired content in large volumes of multimedia, containing images and video in addition to the textual information, will become even more difficult. One important source of information that can be obtained from image and video is the text contained therein. The video can be easily indexed if access to this textual information content is available. They provide clear semantics of video, and are extremely useful in deducing the contents of video. A large number of methods have been extensively studied in recent years to detect text in uncompressed images and video. Ohya et al. [1] perform character extraction by local thresholding and detect character candidate regions by evaluating gray level difference between adjacent regions. Hauptmann and Smith [2] use the spatial context of text and high contrast of text regions in scene images to merge large numbers of horizontal and vertical edges in spatial proximity to detect text. Shim et al. [3] use a generalized region labeling algorithm to find homogeneous regions for text. Wu et al. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 259–268, 2002. © Springer-Verlag Berlin Heidelberg 2002
260
Seong Soo Chun et al.
[4] use texture analysis to detect and segment texts as regions of distinctive texture using pyramid technique for handling text fonts of different sizes. Lienhart [5] provide split and merge algorithm based on characteristics of artificial text to segment text. Li et al. [6] used wavelet analysis and employed a multi-frame coherence approach to cluster edges into rectangular shape. Sato et al. [7] adopted a multi-frame integration technique to separate static text form moving background. A few methods have been also proposed to detect text regions in compressed domain. Yeo and Liu [8] propose a method for the detection of text caption events in video by modified scene change detection which cannot handle captions that gradually enters or disappears from frames. Zhong et al. [9] examined the horizontal variations of AC values in DCT to locate text frames and examined the vertical intensity variation within the text regions to extract the final text frames. Zhang and Chua [10] derived a binarized gradient energy representation directly from DCT coefficients which are subject to constraints on text properties and temporal coherence to locate text. However, none of them exploits the temporal coherence of text useful for reducing processing time by not applying all steps (detection, localization, and OCR) to every frames, which results in duplicates of the same text string in the database. The main contribution of this paper is to develop an efficient and fast compressed DCT domain method to locate text captions on specific areas in digital video through a visual rhythm [12], an abstraction of video that is constructed by sampling certain group of pixels of each frame and by temporally accumulating the samples along time. Our method uses a combination of contrast and temporal coherence information on the visual rhythm to detect text frames such that each detected text frame represents consecutive frames containing identical text strings, thus significantly reducing the amount of text frames needed to be examined for text localization from a video sequence. It then utilizes several important properties of text caption to locate text caption from the detected frames. The visual rhythm constructed for text localization also serves as a visual feature to efficiently detect scene changes. This paper is organized as follow: Section 2, gives a brief description of visual rhythm. Section 3 describes the proposed text frame detection, and text caption localization algorithm. Section 4 describes experimental results. In Section 5, we give concluding remarks.
2 Related Work 2.1 Visual Rhythm For the design of an efficient real-time text caption detector, we resort on using a portion of the original video. This partial video must retain most, if not all, text caption information. We claim that a visual rhythm, defined below, satisfies this requirement. Let fDC(x,y,t) be the pixel value at location (x,y) of an arbitrary WxH DC image [11] which consists of the DC coefficients of the original frame t. Using the sequences of DC images of a video called the DC sequence, we define a visual rhythm, VR, of the video V as follows:
Fast Text Caption Localization on Video Using Visual Rhythm
VR ={ f VR ( z , t )} = { f DC ( x ( z ), y ( z ), t )},
261
(1)
where x(z) and y(z) are one-dimensional functions of the independent variable z. Thus, the visual rhythm is a two dimensional image where the vertical z axis consists of a certain group of pixels from each DC image and the samples are accumulated along time in the horizontal t axis. That is, the visual rhythm is a two dimensional image consisting of pixels sampled from a three-dimensional data (DC sequence). The visual rhythm is also an important visual feature that can be utilized to detect scene changes [12]. The sampling strategy, x(z) and y(z), must be carefully chosen for a visual rhythm to retain text caption information. We define x(z), y(z) as follows : W 0≤z
,
(2)
where W, H are the width and the height of a DC sequence respectively. Figure 1 illustrates the sampling strategy of the DC sequence for the construction of visual rhythm. The diagonal pixels of a frame from bottom-left most corner to top-right most corner are sampled when 0
W H
y
t x Fig. 1. Representation for regions of text appearance
262
Seong Soo Chun et al. Vertical Line of Visual Rhythm
3H H
H
H 0 Fig. 2. The vertical line of a visual rhythm obtained by sampling the pixels of a DC sequence
2.2 Fast Generation of Visual Rhythm Many compression schemes use the discrete cosine transform (DCT) for intra-frame encoding. Thus, the construction of a visual rhythm is possible without the inverse DCT. We simply extract the DC coefficients of each frame. As for the P- and Bframes of MPEG, algorithms for determining the DC images from inter-frame compressed P- and B-frames of MPEG-1 [11] and MPEG-2 [13] have already been developed. Therefore, it is possible to generate a visual rhythm fast, at least for the DCTbased compression schemes, such as Motion JPEG and MPEG videos.
3 Proposed Strategy 3.1 Text Frame Detection A text frame is defined as a video frame that contains one or more text captions. Since a text caption usually appears in a number of consecutive frames, we propose an algorithm, which detects a representative text frame from the consecutive frames containing identical text strings to avoid unnecessary text caption localization for identical text strings. The text frame detection algorithm detects text frames based on the following characteristics of text caption within video:
Fast Text Caption Localization on Video Using Visual Rhythm ♦ ♦ ♦
263
Characters in a single text caption are mostly uniform in color. Text caption contrasts with their background. Text caption remains in a scene for a number of consecutive frames.
On the visual rhythm obtained through a DC sequence, the pixels corresponding to text caption manifest themselves as long horizontal line with high contrast with their background. Hence, horizontal lines on the visual rhythm with high contrast with their background are mostly due to text string, and they give us clues of where and when each text string appears within the video. The pixel value of the horizontal line on the visual rhythm also gives us clue on the pixel value of the text caption in DC image, allowing us for a simple algorithm for text caption localization within the frame. To detect potential text frames, any horizontal edge detection method can be used on the visual rhythm. In our experiment we used Prewitt edge operator with convolution kernels − 1 − 1 − 1 0 0 0 1 1 1
on the visual rhythm to obtain VRedge(z,t) as follows: VR edge ( z , t ) =
1
1
∑∑w
i = −1 j = − 1
i, j
f VR ( z + j , t + i ) .
(3)
To obtain text lines which we define as horizontal lines with high contrast with their background on a visual rhythm possibly formed due to text caption, the edges with VRedge(z,t) value greater than a threshold value τ (we set τ as τ=150 in our experiment) and uniform value fVR(z,t) are connected in the horizontal direction. Text lines lasting shorter than a specific amount of time is not considered, since text usually remains in the scene for a number of consecutive frames. Through observations on various types of video materials, shortest captions appear to be active for at least two seconds, which translates into a text line with frame length of 60 if the video is digitized at 30 frames per second. Thus the text lines with length less than 2 seconds can be eliminated. The resulting set of text lines appear in the form: LINE k , [ z k , t kstart , t kend ], k = 1,..., N LINE , start k
(4)
end k
where [ zk , t , t ] denotes the Z coordinate, the beginning and the end frame of the occurrence of text line LINEk on the visual rhythm, respectively. The text lines are ordered by the increasing starting frame number,
t1start ≤ t2start ≤ ... ≤ t Nstart . LINE
(5)
Figure 5(b) shows the binarized representation of text lines possibly formed by text caption from the visual rhythm in Figure 5(a). The frames not in between the temporal duration of LINEk, do not contain any text caption and are omitted for further consideration as text frame candidates. Once the frames without text have been excluded as text frame candidates, it is highly probable that the remaining frames of a video contain text caption within them.
264
Seong Soo Chun et al.
However, it would be very inefficient to perform the text caption localization repeatedly for the same text caption remaining on the screen over multiple frames. Since each text line possibly represents a single text caption, we only need to access a single frame to extract its corresponding text. Therefore, the number of text frames to be examined for text caption localization can be minimized by obtaining a maximum cardinality collection of disjoint intervals of text lines through the following algorithm: n←0 SET =
U {k : k ∈ N }
1< k < N LINE
WHILE ( SET ≠ φ ){ e = min( t kend : : k ∈ SET ); A = {∀k | t kstart < e}; Fn = (max( t kstart : k ∈ A) + e) / 2; n + +; SET ← SET − A; } Fig. 3. Pseudo-code to find minimal number of text frames
th
where Fj is the j frame to be accessed for text caption localization as the final output of the text frame detection stage, where j < n.
3.2 Text Caption Localization The text caption localization stage spatially localizes text caption within a frame. Let fDC(x,y,t) be the pixel value at (x,y) of the DC image of frame t. From the visual rhythm obtained by the sampling strategy given by Equation (2), we can observe that LINEk is possibly formed due to a portion of a character located on (x,y)=(x(zk), y(zk)) in frames between tkstart and tkend with the pixel values fVR(zk, t) where t kstart < t < t kend . Furthermore, if a portion of a character is located on (x,y) = (x(zk), y(zk)) within a DC image it can be assumed that portions of characters belonging to the same text caption to appear along y=y(zk) because text caption are usually horizontally aligned. Therefore, the text line information obtained from the text frame detection stage can be used to approximate the location of text within the frame, and enable us to provide an algorithm to focus on specific area of the frame. From each of the detected frames Fj, we verify whether LINEk , tkstart < F j < tkend is formed by portions of text string located along y=y(zk). For the text line LINEk, we first cluster the pixels with pixel value fVR(zk, t) from the pixels of horizontal scanline y=y(zk) using a 4-connected clustering algorithm to form
Fast Text Caption Localization on Video Using Visual Rhythm
265
text candidate regions in frame Fj where t kstart < t , F j < t kend . From each of the clustered regions, the top-most coordinate is computed and collected in an alignment histogram HT, where the bin corresponds to the row number of the DC image as illustrated in Figure 4. The HB is computed in the same way using the bottom-most coordinates of each region. We declare the existence of an upper boundary BT of text caption if at least 50% of the elements in HT are contained within three or fewer adjacent histogram bins. The lower boundary BB is computed in the same way using HB. The height of the localized text caption can thus be obtained. To find the width of the caption text, the regions with width longer than 1.5*height are firstly discarded. From the final set of regions, the following criterion is used to merge regions corresponding to characters to obtain the width of the text caption: ♦
Two regions, A and B, are merged if gap between A and B is less than 3 times the height.
We can thus verify whether LINEK is formed due to text caption and if so localize text caption, which appears along the duration of LINEK, which does not have to be verified again.
Fig. 4. Computation of the upper and lower boundary BT and BB of text caption through alignment histogram HT and HB.
Since several text lines can be formed due to the same text caption, the whole text localization process is omitted when LINEk and its corresponding horizontal scanline y= y(zk) intersects with any text caption localized by any of the previous text line. Figure 6 shows an example of the localized text caption. The usefulness of this text caption localization stage is that it is inexpensive and fast, robustly supplying bounding boxes around text caption along with their temporal information.
266
Seong Soo Chun et al.
4 Experimental Results 4.1 Environment of the Experiment To evaluate the performance of the proposed method, we have tested it on various types of MPEG video clips consisting of 1) a news broadcast clip (14m 52s), which covered a variety of events including outdoor and newsroom news programs, and weather forecast, 2) a sports clip of golf lesson (37m 21s), and baseball (22m 4 s), 3) a commercial clip (7m 35 s), which contains various embedded captions and credits. 4.2 Performance Evaluation of Proposed Algorithm Table I shows the results of the proposed algorithm. The second row of Table I gives the total number of text caption present in each category. The next row is the count of the correctly identified text caption. The total number of false positives and false negative is stated in the next two rows. Finally the recall and precision in each case is stated. Our proposed text caption localization has an overall average recall of about 80% and a precision of 86%. 4.3 Computational Time of Proposed Algorithm The processing speed of the proposed caption localization method is fast since it only works on few of the pixels sampled from the entire video in compressed domain compared to the conventional approaches operated by using the entire pixels of a video. Table 2 shows the processing time of each stage using Pentium III-500Mhz. It took approximately 7 minutes to produce the visual rhythm of the video clips corresponding to a total length of approximately 1 hour and 20 minutes. From the visual rhythm of the video clips, it took about 22 seconds to detect potential text frame subject to text caption localization as the final result of the text frame detection stage. From the detected text frames, it took approximately a total of 2 minutes to locate text caption from the detected frames. Thus the whole process time took about 9 minutes.
5 Conclusions The proposed algorithm on localizing text caption proved to be very fast by using text caption characteristics on visual rhythm. Moving text caption and text caption embedded on locations other than the assumed locations, resulted in a rather low average recall rate of 80% since our proposed algorithm locates only static text caption located on assumed locations. It took 9 minutes to localize text caption and its temporal duration, for a 1 hour and 22 minutes worth of video clip. This includes the construction time of visual rhythm, which can be used to detect scene changes for video indexing
Fast Text Caption Localization on Video Using Visual Rhythm
267
with very little processing. The proposed method also reduces the time for OCR since identical text caption appearing along consecutive frames are not fed into OCR repeatedly.
(a)
(b) Fig. 5. Characteristics of text caption on visual rhythm: (a) Visual rhythm of video material (b) Text lines representing text caption
Fig. 6. Results of text caption localization Table 1. Recall and precision for text caption localization Video Type
News
Sports
Commercials
Distinct text caption
55
302
37
True Pos
44
241
30
False Pos
8
40
4
False Neg
11
61
7
Recall(%)
80.0
79.8
81.1
Precision(%)
84.6
85.8
88.2
268
Seong Soo Chun et al.
Table 2. Execution time of visual rhythm construction, text frame detection and text caption localization. Video Type
News
Sports
Commercials
Total
Duration
14m 52s
59m 25s
7m 35s
1h 21m 52s
Visual Rhythm
1m 16s
5m 6s
38s
7m
Detection Time
3s
17.21s
1.42s
21.63s
Localization Time
16s
1m 12s
8s
1m 36s
Total Processing Time
1m 35s
6m 35.21s
47.42s
8m 57.63s
References 1. Ohya, J., Shio, A., Akamatsu, S.: Recognizing Characters in Scene Image. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 16. (1994) 214-220 2. Haupmann, A., Smith, M.: Text, Speech, and Vision for Video Segmentation: The Informedia Project. AAAI Symposium on Computational Models for Integrating Language and Vision, (1995) 3. Shim, J., Dorai, C., Bolle, R.: Automatic Text Extraction from Video for Content-Based Annotation and Retrieval. IEEE International Conference on Pattern Recognition, Vol. 1. (1998) 618-620 nd 4. Wu, V., Manmatha, R., Riseman, E.: Finding Text in Images. Proceedings of the 2 ACM International conference on Digital Libraries (1997) 3-12 5. Lienhart, R.: Automatic Text Recognition for Video Indexing. Proceedings of ACM Multimedia (1996) 11-20 6. Li, H., Doermann, D., Kia, O.: Automatic Text Detection and Tracking in Digital Video. IEEE Transactions on Image Processing, Vol. 9. (2000) 147-156 7. Sato, T., Kanade, T., Hughes, E., Smith, M., Satoh, S.: Video OCR: Indexing Digital News Libraries by Recognition of Superimposed Caption. ACM Multimedia Systems, Vol. 7 (1998) 385-394 8. Yeo, B.L., Liu, B.: Visual Content Highlighting Via Automatic Extraction of Embedded Captions on MPEG Compressed Video. IS&T/SPIE/IS&T Symposium on Electronic Imaging: Digital Video Compression, (1996) 9. Zhong, Y., Karu, K. Jain, A.: Automatic Caption Localization in Compressed Video. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22. (2000) 385392 10. Zhang, Y., Chua, T.: Detection of Text Captions in Compressed Domain Video. Proceedings of Multimedia Information Retrieval ACM Multimedia. (2000) 201-204 11. Yeo, B.L., Liu, B.: Rapid Scene Analysis on Compressed Video. IEEE Transactions on Circuit and Systems for Video Technology, Vol. 5. (1995) 533-544 12. Kim, H., Lee, J., Song, S.M.: An Efficient Graphical Shot Verifier Incorporating Visual Rhythm. Proceedings of IEEE International Conference on Multimedia Computing and Systems. (1999) 827-834 13. Song, J., Yeo, B.L.: Spatially Reduced Image Extraction from MPEG-2 Video: Fast Algorithms and Application. Proceedings of SPIE Storage and Retrieval for Image and Video Database VI, Vol. 3312. (1998) 92-107
A New Digital Watermarking Technique for Video Kuan-Ting Shen and Ling-Hwei Chen Department of Computer and Information Science Nation Chiao Tung University, Hsinchu, Taiwan Abstract. Data hiding or digital watermarking can be considered nowadays as the most important issue for digital multimedia. The data hiding technique can be used for covert communication, while digital watermarking can be used for protecting digital media content. Many techniques were developed for embedding data into various multimedia mediums. In this paper, we proposed a method for embedding digital watermark into uncompressed videos. It uses the relationship among the DC components in several successive frames for hiding data. Since DC components will not vary a lot after a DCT-based lossy compression algorithm, this approach is able to resist such compression algorithms. And experimental results demonstrate that this proposed method is robust to MPEG coding.
1. Introduction Information hiding, a technique for embedding data into a given medium without being noticed, becomes more and more important recently. Owing to the high speed and high capacity transmission provided by the Internet, digital multimedia contents are widely spread. Although the digital multimedia technologies bring the Internet user many new applications and services, the content owners are afraid of losing their income due to the increment of illegal copies. Since any unauthorized user can easily make perfect copies of digital multimedia, techniques for protecting the copyright of digital media are now the urgent demands for these content owners. To resolve the right ownership of the multimedia contents, copyright information can be embedded into these contents by applying some sort of information hiding techniques. This kind of information hiding techniques is so-called “Digital Watermarking”. The development of digital watermarking techniques has been motivated in recent years. Numbers of works have been done for embedding watermark into digital image contents. These watermarking schemes can also be applied to videos by treating each single frame in video as a still image. [1–5] Since a video sequence consists of successive still images, it is quite easy to apply a data-hiding algorithm for still images to videos. But data hiding algorithms for images consider only the nature of still images. For digital videos, there is a large amount of inter-frame redundancy. The most popular video compression standard, MPEG, adopts the technique called motion compensation for removing the inter-frame redundancy in order to gain a better compressing ratio. In MPEG, only the I-frames use the same compression technique as that in still images, while other frames, called P- and B- frames, are coding using the motion compensated predictive coding scheme. Most digital watermarking schemes, such as DEW proposed by Langelaar et al. [8][9], embed watermarks only in I-frames of a MPEG coded video. Since most frames in a MPEG coded video are B S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 269–275, 2002. © Springer-Verlag Berlin Heidelberg 2002
270
Kuan-Ting Shen and Ling-Hwei Chen
and P frames, it is quite uneconomical if watermarks embedded in a video can only survive in I-frames. So it is practical to develop an approach that can use not only I frames but also B and P frames for embedding data into videos. In order solve the problem that only I-frames could be used for embedding data, some approaches are proposed using motion vectors for hiding data in videos. [6][7] Since these vectors were generated in the motion compensation process, it is reasonable that these data will survive in P- and B-frames. But this kind of methods has high dependence with MPEG standard. So these methods could not be used if a video is not coded MPEG format. In this paper, an approach embeds data into uncompressed videos by using the relationship between frames is proposed. Since a video consists of a sequence of images, it is reasonable to embed data using some properties between frames. This method is robust to MPEG compression and suitable for those digital watermarking applications.
2. Watermarking on Uncompressed Videos 2.1
Embedding Digital Watermark Using DC Components
The proposed method for digital watermarking uses the relationship of DC components in two successive frames. Instead of hiding data into every frame in a video sequence, only some frames are used for embedding watermark information. This is because distortions caused by modifying DC components are large. In order to minimize the distortion caused by the embedding, not every frame in the host video could be used for embedding watermark information. Fig. 1 shows the embedding procedure of this proposed method.
Embedding pair selection
Embedding
Watermark
Spreading
Embedding signal
Pseudo random noise cr key
Fig. 1. Embedding procedure using DC difference between two consecutive frames.
A New Digital Watermarking Technique for Video
271
In the first step of the embedding procedure, embedding pairs are selected by a key k. The key k that is selected by the user is used to determine the distance between two embedding pairs. An embedding pair consists of two consecutive host frames. And the distance between two embedding pairs is generated by using k as a random seed. In order to guarantee the embedding strength, the number of embedding pairs must be enough. The embedding strength is also chosen by the user. In such constraint, the distance di that is random generated by k could not exceed a maximum distance based on the embedding strength. In such scheme, attackers could not easily find all frames that contain the watermark. Fig. 2 shows an example of the embedding structure of a video by using this embedding scheme. : Host frame A : Host frame B
Fig. 2. Structure of an embedded video sequence with d1, d2, d3… determined by key k.
After embedding pairs are selected, both of the two host frames in an embedding pair are divided into 8x8 blocks. Each block in the host frames is transformed by using the discrete cosine transform (DCT). And the DC component of each transformed block is extracted. These DC components are the basis for embedding watermark information into the host video. For increasing the strength of the embedding watermark, some redundancies must be added into the embedding signal. So the watermark is first spread by a chip-rate cr. And then the spreading signal is modulated by a binary pseudo random noise p generated by the same key k that is used in embedding pair selection. The following equation shows the relations between watermark and the embedding signal. s i = w j where j ⋅ cr ≤ i ≤ ( j + 1) ⋅ cr p i ∈ {0 ,1}
ei = si ⊕ pi w i : the original watermark signal. s : the spreading signal. where i p i : pseudo random noise. e i : the resulting bit - string for embedding
(1)
In order to embed a watermark bit, the relation between the DC component of the block in one host frame and the DC component of its corresponding block in another frame must meet the predefined conditions as the following equation: DC A < DC B for embedding a bit "0" . DC A > DC B for embedding a bit "1" . DC A : DC coefficient of a block in the host frame A. where DC B : DC coefficient of a block in the host frame B.
(2)
272
Kuan-Ting Shen and Ling-Hwei Chen
For example, the DC component in the host frame A must be less than its corresponding DC in the host frame B for embedding a bit “0”. If the relation between two DC components does not meet the condition according to the watermark bit, the DC component in transformed host frames is changed accordingly. In order to minimize changes in DC components of a single frame, both two host frames would be modified for embedding. Since DC coefficients are sensitive to larger modifications, a block would not embed any watermark bit if the difference between two DC components is larger than a threshold t1. In order to increase the embedding strength, two embedding pairs would embed the same bit-string. Finally, the inverse DCT is applied to transform the embedded frames back to the spatial domain. 2.2 The Watermark Extracting Procedure The watermark extraction procedure is similar to the embedding procedure. First, embedding pairs are selected by using the key k that is used in the embedding procedure. Then, all host frames in an embedding pair are transformed into DCT domain. The watermark is extracted by using the DCs of the transformed embedded frames. With the condition listed in the previous section, an embedded bit could be extracted by using the relationship between the DC in the host frame A and its corresponding DC in the host frame B. To reconstruct the original watermark signal, a pseudo random bit-string is also needed. This is also generated by the key k. The extracted bit-string is then demodulated with this pseudo random bit-string. From the embedding procedure described in the previous section, we know that each watermark bit was spread cr times. And each embedding bit-string was embedded repeatedly into two embedding pairs. So a watermark bit could be reconstructed by cr*2 bits form the bit-strings extracted from two embedding pairs. If there are more 1’s in these cr*2 bits than 0’s, a watermark bit “1” is extracted.
3. Experimental Results The proposed watermarking method embeds watermark into the DCT domain of a video. The behavior of this method is similar to the DEW proposed by Langlaar et al [6]. We have modified Langelaar’s algorithm in order to apply it on raw videos. The original algorithm embeds watermarks on each I-frame in a MPEG compressed video. And a bit is embedded in each DCT-transformed 8x8 block. The modified one embeds watermark in every frames in a video using the same idea. And the watermark is extracted from the I-frames. In this section, both the modified DEW and our proposed would be examined. Three videos used for testing these two watermark schemes are shown in Fig. 3. Each test video consists of 70 frames. And the resolution of each frame in a video is 352 by 240 pixels. The size of an embedding subset in the DEW is set to be 16. The cut-off frequency of the DEW in this experiment is 32. In the experiment of the proposed method, the percentage of embedding pairs of all pairs is 30%. Figure 4 and Figure 5 illustrate the visual results that produced by our proposed method and the DEW.
A New Digital Watermarking Technique for Video
273
Fig. 3. Test videos for the watermark method.
Fig. 4. Experimental results of our proposed method. (a) and (b) are the original embedding pair. (c) and (d) are the resulting embedding pair.
Fig. 5. Experimental results of DEW.
For testing the robustness against MPEG coding of these two methods, the embedded videos are compressed by using a MPEG-2 coder. And then the watermark is extracted from a decompressed video. The length of the watermark signature is
274
Kuan-Ting Shen and Ling-Hwei Chen
1000 bits. After the watermark is extracted, the extracted watermark is compared to the original one. And then the bit error rate is examined. The results are listed in Table 1. Table 1. Bit error rates Weather (1M bps) Weather (2M bps) Table tennis (1M bps) Table tennis (2M bps) News (1M bps) News (2M bps)
DEW 7.4% 2.5% 4% 2% 3% 2.5%
Proposed method 0.9% 0.4% 3.6% 2% 3.2% 1.3%
From the results shown above, visual artifacts are occurred in the edge components of a frame by using the DEW. And the intensity of some plain area in an embedding frame changes slightly after embedding by the proposed method. This is because the DC components are used for embedding watermark bits. But these changes are hard to be noticed while playing the video. By using the modified DEW for embedding watermarks, the watermark bits are no longer exists in the frames which are coded to B- or P- frames by a MPEG coder. So the watermark is only extracted from the decompressed I-frames. And the bit error rates are also examined. From Table 1, the accuracy of watermark bits using the proposed embedding scheme is better than the accuracy by using the DEW. Since the proposed scheme has worse performance in the fast-moving videos, the accuracy and quality are still better than the DEW.
4. Summary In this paper, a method for embedding digital watermark into uncompressed video sequence is proposed. It uses relationship between two DC components of blocks in two successive frames for hiding watermark information. Although modifying DC components could be perceptually visible, it is more robust than using higher frequency coefficients for embedding data. Since changes within a single frame in a video sequence may not be noticed, it is still quite suitable to embed information into these DC components. The experimental results show that the embedded watermark could be extracted even if the embedded video is compressed by MPEG.
Reference 1. F. Hartung and B. Girod, “Watermarking of Uncompressed and Compressed Video, ” Signal Processing, Vol. 66, No.3, pp. 283-301, May 1998. 2. D. Kim and S. Park, “A Robust Video Watermarking Method,” Conference on Multimedia and Expo, 2000. ICME 2000. 2000 IEEE International, Vol. 2, pp. 763-766, 2000. 3. C. Busch and W Funk, “Digital Watermarking: From Concepts to Real-Time Video Applications,” IEEE Computer Graphics and Applications, Vol. 19, Issue 1, pp. 25-35, Jan.Feb. 1999.
A New Digital Watermarking Technique for Video
275
4. T. Chung, M. Hung, Y. Oh, D. Shin and Sang-Hui Park, “Digital Watermarking for Copyright Protection of MPEG2 Compressed Video,” IEEE Transactions on Consumer Electronics, Vol. 44, Issue 3, pp 895-901, Jun. 1998. 5. C. Hsu and J. Wu, “DCT-based Watermarking for Video,” IEEE Transactions on Consumer Electronics, Vol. 44, Issue 1, Feb. 1998. 6. F. Jordan, M. Kutter, and T. Ebrahimi, “Proposal of a Watermarking Technique for Hiding/Retrieving Data in Compressed and Uncompressed video,” ISO/IEC Doc. JTC1/SC29/WG11 MPEG97/M2281, July 1997. 7. J. Song and K.J.R. Liu, “A Data Embedding Scheme for H.263 Compatible Video Coding,” Proceedings of the 1999 IEEE International Symposium on Circuits and Systems, 1999. ISCAS ’99., Vol. 4, pp 390-393, 30 May-2 June 1999. 8. G.C. Langelaar, R.L. Lagendijk, and J. Biemond, “Real-time Labeling Methods for MPEG Compressed Video,” In 18th Symposium on Information Theory in the Benelux, Veldhoven, The Netherlands, May 1997. 9. G.C. Langelaar and R.L. Lagendijk, “Optimal Differential Energy Watermarking of DCT Encoded Images and Video,” IEEE Transactions on Image Processing, Vol. 10, Issue 1, Jan. 2001
Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video* Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee Department of Computer Science And Information Engineering, National Chiao Tung University 1001 Ta-Hsueh Rd, Hsinchi, Taiwan _H]GLIRQLLWMESW]PIIa$GWMIRGXYIHYX[
Abstract. In this paper, a novel approach of automatic closed caption detection and font size differentiation among localized text regions in I-frames of MPEG videos is proposed. The approach consists of five modules: video segmentation, shot selection, caption frame detection, caption localization and font size differentiation. Rather than directly examines scene cut frame by frame, the module of video segmentation first verifies video streams GOP by GOP and then finds out the actual scene boundaries in the frame level. Tennis videos are selected as the case study and the module of shot selection is designed to automatically select specific type of shot for further closed caption detection. The noise of potential captions is filtered out based on its long-term consistency over consecutive frames. While the general closed captions are localized, we select the specific caption that is discriminated utilizing the module of font size differentiation. The detected closed captions can support video structuring, video browsing, high-level video indexing and video content description in MPEG-7. Experimental results show the effectiveness and the feasibility of the proposed scheme.
1
Introduction
With the increasing digital videos in education, entertainment and other multimedia applications, there is an urgent demand for tools that allow efficient way for users to acquire desired video data. Therefore, users need a content-based mechanism to support efficient searching, browsing and retrieval. The need of content-based multimedia retrieval motivates the research of feature extractions of the information of text, image, audio and video. However, textual information is more semantically meaningful and attracts increasing researches about text caption detection in video frames [1][3][6]. With the technique of video compression getting mature, lots of videos are being stored in compressed form and accordingly more and more researches focus on the feature extractions in compressed videos especially in MPEG format. Edge features are extracted directly from MPEG compressed videos to detect scene change [5] and captions are processed and inserted into compressed video frames [7]. Features, like * The research is partially supported by Lee & MTI Center, National Chiao-Tung University, Taiwan and National Science Council, Taiwan. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 276–287, 2002. © Springer-Verlag Berlin Heidelberg 2002
Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video
277
chrominance, shape and frequency are directly extracted from MPEG videos to detect face regions [1][3]. In addition, the researches [2][4] focus on closed caption detection in compressed videos and the large size closed captions are what they concerned. However, in general, the size of text captions appearing in the shots of sports competition is relatively very small, which makes it more difficult to detect and localize the text captions. Therefore, in this paper, we propose a novel approach to detect the small text captions and also differentiate its font size for further applications. The approach consists of five components: GOP-based video segmentation, shot selection, caption frame detection, text caption localization and filtering, and font size differentiation. In our previous research [8] – GOP-based video segmentation is used to effectively segment video. Furthermore, the DCT DC-based shot selection is designed to identify specific shots. Caption frames are detected in the specific shots by computing the variation of DCT AC energy both in the horizontal and vertical directions. In addition, we locate the text caption by the proposed method of the weighted horizontal-vertical DCT AC coefficients and merge regions by morphological operations. To achieve more robust result of text caption localization, each candidate text caption is verified further by computing its long-term consistency that is estimated over the shots of backward, forward and the shot itself. While text captions are localized, we differentiate the font size of each text caption based on variation of the DCT AC energy in the vertical direction. The rest of the paper is organized as the following. Section 2 presents the overview of the proposed scheme and Section 3 describes the GOP-based video segmentation. The proposed approach of text caption localization is illustrated in Section4. Section 5 shows the experimental results and the conclusion and future works are given in Section 6.
MPEG-2 Video Streams
Closed Captions
GOP-Based Scene Change Detection
Font Size Differentiation
DC Variation Based Shot Selection
Noise Filtering Long-Term Consistency
Caption Frame Detection
Text Caption Localization
Fig. 1. Overview of the proposed scheme.
2
Overview of the Proposed Scheme
Fig. 1 shows the architecture of the proposed scheme. The testing videos are compressed in MPEG-2 format and the tennis sport is selected as the case study. First, video streams are segmented into shots by using our previous research of GOP-based
278
Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee
scene change detection approach [8]. Rather than directly examines scene cut frame by frame, to speed-up the module of video segmentation first checks video streams GOP by GOP and then finds out the actual scene boundaries in the frame level. The shots that contain the view of tennis court are identified automatically by the variation of the DCT DC coefficients of I-frames and selected for further caption detection. However, the closed captions, for example, the scoreboard in the clip of tennis court does not appear within the whole shot. Besides, the size of closed captions in tennis videos is generally very small. Therefore, we propose a mechanism to detect caption frames in the specific shots and also differentiate the font size for automatic text caption selection.
Inter-GOP scene change detection
Step 1.
Calculate the difference between each consecutive GOP-pair
If difference is more than
no
threshold?
yes Intra-GOP scene change detection
Step 2.
Find out the actual scene change frame in the GOP
Fig. 2. GOP-based scene change detection.
3
Video Segmentation and Shot Selection
3.1
Scene Change Detection
Video data is segmented into clips to serve as logical units called “shots” or “scenes”. Fig. 2 illustrates our proposed GOP-based scene change detection approach [8]. In MPEG-2 format [9], GOP layer is random accessed point and contains GOP header and a series of encoded pictures including I, P and B-frame. The size of GOP is about 10 to 20 frames, which is less than the minimum duration of two consecutive scene changes (about 20 frames) [10]. In the approach, we first detect possible occurrences of scene change GOP by GOP (inter-GOP). The difference between each consecutive GOP-pair is computed by comparing the I-frames in each consecutive GOP-pair. If the difference of DC
Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video
279
coefficients between these two I-frames is larger than the threshold, then there may have scene change between these two GOPs. Hence, the GOP that contained the scene change frames is located. In the second step – intra GOP scene change detection, we further use the ratio of forward and backward motion vectors to find out the actual frame of scene change within a GOP. By this approach, the segmented results [8] are encouraging and prove that the scene change detection is efficient for video segmentation. 3.2
Shot Selection
While the boundary of each shot is detected, the video sequence is segmented to obtain shots that consist of the clips of advertisement, close-up and tennis court. However, the clips of tennis court are our focus in the further processing and analysis. Hence, scene identification approach is proposed to recognize the clips that are of the type of tennis court. We observe that the variation of the intensity of the tennis court frame is very small through the whole clip and the value of intensity variation between consecutive frames is very similar. In contrast, the intensity of the advertisement and close-up varies significantly in each frame and the difference of the variance of intensity between two neighboring frames is relatively large. Therefore, the DC coefficients of each I-frame are extracted to represent the intensity value and are used to compute the intensity variance of I-frames. In addition to obtain the intensity variance of each Iframe, the variance of each shot is also computed to be the shot feature. The definition of the frame variance and the one of shot variance are shown in Eq. (1) and Eq. (2). DCi , j denotes the jth block of the ith frame and N represents the total number of blocks in a frame.
FVarsDC is the intensity variance of the frame i in shot s and the ,i
variance of shot s is expressed by SVars , where M is the total number of frames in shot s. The variation of the intensity variance of each I-frame in a video sequence from frame-0 to frame-1965 is exhibited in Fig. 3. In the video sequence, there are four clips of tennis court that are marked by the dotted ellipse and the type of close-up clips is marked by the dotted rectangle. The last clip of this sequence is a clip of advertisement signed by the dotted circle. From Fig. 3, we can see that the intensity variance of the type of tennis court is very small and the value is very stable through the whole clip. Thus, the clip of tennis court can be indicated and selected by the characteristic that the value of intensity variance is stable in each individual frame and stable over the whole shot. N
N
j =1
j =1
2 2 FVarsDC ,i = ∑ DCi , j / N − ( ∑ DC i , j / N )
M
M
2 DC 2 SVars ,i = ∑ ( FVarsDC , i ) / M − ( ∑ FVars , i / M ) i =1
(1)
i =1
(2)
280
Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee
Variation of I-frame DC values 250000
Variation
200000 150000 100000 50000 0 0
500
1000
1500
2000
2500
Frame Number Fig. 3. Variation of I-frame DC value of a video sequence (frame0-frame1965)
MPEG-2 I-frames
Text Captions
DCT AC Coefficients Extraction
Potential Caption Region Detection (Weighted Horizontal and Vertical AC Coefficients)
Candidate Caption Region Examination (Long-Term Consistency Computation by Reference Forward and Backward Shots)
Region Merging and Noise Removement by Morphological Operation
Fig. 4. The approach of text caption localization
4
Text Caption Localization
In this section, the scheme of text caption detection and font size differentiation is described. The diagram of caption localization is shown in Fig. 4. In general, text captions do not always appear in consecutive frames. Therefore, we propose an algorithm of caption frame detection to detect frames, which may contain captions. In the process of caption detection, DCT AC coefficients of I-frames in MPEG-2 video
Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video
281
are extracted and are used to compute the energy variation of horizontal and vertical directions of each 8x8 block respectively. Potential caption regions are indicated by the proposed weighted horizontal-vertical AC coefficients and these regions are merged/removed by the morphological operations. For more accurate results of text caption localization, the spatial-temporal relationship over consecutive frames is utilized in which we compute the long-term consistency of each candidate caption region by referring certain I-frames of forward and backward shots. However, the localized text captions may contain the scoreboard, the logo of certain channel or some billboard and the scoreboard is what viewers are most interested in. Therefore, based on the observation that these different types of text captions are different in font size, we propose an algorithm to discriminate font size within localized captions. The details of caption frame detection are described in Subsection 4.1 and the approach of closed caption localization is shown in Subsection 4.2. Subsection 4.3 presents the algorithm of font size differentiation.
Region 1
Region 2
Region 3
Region 4
Region 5
Region 6
Fig. 5. Original frame divided into 6 sub-regions
4.1
Caption Frame Detection
Caption frame detection is a necessary step for text caption localization because of the observation that captions may disappear in some frames and then appear subsequently. Therefore, we should first identify the frames in which captions might be present before the process of potential caption detection. However, in general, the font size of the scoreboard appearing in the shots of sports competition is very small. Hence, the variation of the AC energy of the entire frame with the appearance of the small caption cannot be used to measure the possibility of the presence of the caption which is relatively small in size. Therefore, each I-frame in specific shots is divided into sub-regions, say 6, as shown in Fig. 5. The variation of AC coefficients of each sub-region is measured by Eq.(3), where coefficients of the coefficients from
FVarsAC , i means the variance of AC
i th frame in shot s and ACh / v , j are the horizontal AC
AC0,1 to AC0,7 and the vertical AC coefficients from AC1, 0 to
282
Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee
AC7,0 . The DCT AC coefficients used in the computation of gradient energy are shown in Fig. 6.
FVarsAC ,i =
N
N
j =1
j =1
∑∑ ACh2 / v , j / N − ( ∑∑ ACh / v , j / N ) 2
(3)
The result of caption frame detection based on sub-regions is demonstrated in Fig. 7 and a threshold δ (3000) is predefined to measure if a caption is present or not. We th can see that the curve of DCT AC variance of region-1 drops abruptly in the 18 Ith frame and rises in 39 I-frame and the curves of region-2 to region-6 are stable since th th the scoreboard is absent in region-1 from the 18 I-frame to the 39 one. DC AC0,1 AC0, 2 AC0, 3 AC0, 4 AC0,5 AC0,6 AC0,7 AC1,0 AC2,0 AC3,0 AC4,0 AC5,0 AC6,0 AC7,0
Fig. 6. DCT AC coefficients used in text caption detection
25000
AC Variation
20000
Region1 Region2
15000
Region3 Region4
10000
Region5 Region6
5000
0
1
5
9 13 17 21 25 29 33 37 41 45 49 53 57 61 Frame Number
Fig. 7. Demonstration of caption frame detection
Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video
(a)
283
(b)
(c) Fig. 8. Illustration of intermediate for caption detected (a) Original frame (b) Closed caption detection (c) Filtering based on long-term consistency
4.2
Closed Caption Localization
While the caption frames are detected, we can locate the potential caption regions by utilizing the horizontal and vertical DCT AC coefficients individually to compute the gradient energy in the horizontal and vertical directions respectively. We can observe the fact that the text captions generally appear in rectangular form and the AC energy in the horizontal direction would be larger than that in the vertical direction since distance between letters of each word is fairly small and the distance between two rows of text is relatively large. Therefore, we assign more weights to horizontal coefficients than that to vertical coefficients and the weight assignment is shown in Eq. (4). Here we select three I-frames (first, middle and last) of each shot for caption localization and set wh to 0.7 and wv to 0.3 and the result of potential caption region detection is shown in Fig. 8(b).
E = ( wh H ) 2 + ( wvV ) 2 H=
∑ AC0,h ,
h1 = 1, h 2 = 7
∑ ACv,0 ,
v1 = 1, v 2 = 7
h1≤ h ≤ h 2
V=
v1≤ v ≤ v 2
(4)
284
Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee
From Fig. 8, the original frame is shown in Fig. 8(a) and we can see in Fig. 8(b) that although the scoreboard and the trademark in the upper part of the frame are all indicated, there still have some noisy regions. Therefore, we adopt a morphological operator that is in the size of 1x5 blocks to filter out some noise and afterward the remaining caption regions are clustered and further verified by computing the longterm consistency. We select another two I-frames as consistency reference, the last Iframe of the backward shot and the first I-frame of the forward shot. One possible measurement of the long-term coherence of potential regions is that if the potential caption region appears more than three times among the five I-frames, the region may be a real text caption. The result is demonstrated in Fig. 8(c). Bt
w0 Bsub − block
Bb
w1
8
8
Fig. 9. Sub-block is interpolated from its two neighboring blocks
4.3
Bt
and Bb
Font Size Differentiation
From Fig. 8(c), we can see that the scoreboard in the left upper corner and the trademark in the right upper corner are all successfully detected. Since viewers are interested in the scores during the game competition, then the issue of separating out the captions in the scoreboard is our concern. Hence, we propose an approach to automatically discriminate the font size as a support in the discrimination of scoreboard selection. To obtain the font size, we compute the gradient energy in the vertical direction instead of in the horizontal direction since we observe that the blank space in between two text rows is generally larger than the blank space between two letters and hence the variation of gradient energy in the vertical direction would present in more regular pattern. In addition, to achieve more accurate estimation of periodicity, we compute the DCT coefficients of the 8x8 sub-block between two neighboring blocks and the sub-block is obtained by Eq. (5) and an example is shown in Fig. 9, where Bsub−block is the desired sub-block, Bt and Bb are the top and bottom 8x8 block neighboring to Bsub−block , I w 0 and I w1 are the identity matrix in the dimension of w0 x w0 and w1 x w1 individually. Here we set both w0 and w1 to 8 and it means that the sub-block Bsub−block is in the middle position of blocks Bt and Bb .
0 0 0 Bt + Bsub−block = 0 I w0 I w1
0 Bb 0
(5)
Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video
285
To discriminate the font size, we compute the periodicity and the variation of the AC energy in the vertical direction. The results of font size analysis of the scoreboard and the trademark are demonstrated in Fig. 11 and Fig. 12, where T means the average distance and V represents the variance of T within each column. The local minimum of the curve of AC energy is regarded as the low textured region, say black space, between two text rows and hence we compute the average distance T of blank space by finding the interval between two local minimums of the curve. Examples of the scoreboard and the trademark are shown in Fig. 10. We compute T for first five block columns because the first part of the localized scoreboard consists of five block columns and several non-text blocks separate the second part of the scoreboard. We select the part of the text region in which the height of the block columns is consistent for font size computation. Hence, in Fig. 10(b), the block columns of the trademark are all selected for font size computation. From Fig. 11 and Fig. 12, we can see that the average distance T of the scoreboard is about 2.2 which is smaller than 2.9 of the trademark. Besides, the variance of the row distance of blank space within each column of the scoreboard is 0.05 which is also smaller than 0.8 of the trademark. Hence, we can correctly discriminate the scoreboard since the font size of the scoreboard is smaller than that of the trademark and the font size is more regular in the scoreboard than that in the trademark.
(a)
(b)
Fig. 10. Examples of the text captions (a) scoreboard (b) trademark
AC Energy
1000 800
Column1
600
Column2 Column3
400
Column4 Column5
200 0 1
2
3
4
5
6
7
8
9
Block Number Fig. 11. Variation of AC energy of the scoreboard (T=2.2, V=0.05)
286
Duan-Yu Chen, Ming-Ho Hsiao, and Suh-Yin Lee
Column1
AC Energy
600
Column2
500
Column3
400
Column4
300
Column5 Column6
200
Column7
100
Column8
0
Column9 1
2
3
4
5
6
7
8
9
Column10
Block Number
Column11 Column12
Fig. 12. Variation of AC energy of the trademark (T=2.9, V=0.8)
5
Experimental Results and Discussion
In the experiment, we use the tennis videos recorded from the TV channel of Star-Sport and encode the tennis videos in the MPEG-2 format with the GOP structure IBBPBBPBBPBBPBB at 30 fps. The length of the testing video is about 50 minutes and 903 I-frames are the caption frames. Besides, there are totally 42183 blocks that contain text in the 903 caption frames. The results of caption frame detection and text caption localization are evaluated by estimating the precision and recall. The experimental result of caption frame detection is shown in Table 1 and we can see that the precision and recall are all 100%. Hence it proves that the proposed scheme of sub-region AC energy computation is effective in the detection of small font text captions. In addition, the result of text caption is shown in Table 2 and there are 40635 text blocks detected correctly, 347 blocks falsely detected and 395 text blocks missed. The precision is about 99% and the recall is about 96 %. The good performance lies in the techniques employed in the mechanism - weighted horizontal-vertical AC coefficients is adopted and the long-term consistency of the text caption over consecutive frames is also considered to improve the accuracy of detection results. Some text blocks are missed since the background of the text caption is transparent and would change with the background while camera moves. In this case, if the texture of the background is similar to the text caption, the letters of captions cannot reflect the large variation in gradient energy and some text blocks would be missed. Table 1. Performance of caption frame detection Ground Truth of Caption I-frames
Correctly Falsely Missed detected frames detected frames frames
903
903
0
Missed rate
0
0%
Table 2. Performance of text caption localization after caption frame detected The Ground Truth of Text Blocks 42183 blocks
Correctly Falsely detected Missed detected blocks blocks blocks 40635
347
395
Precision 99%
Recall
Missed rate
96%
0.94%
Automatic Closed Caption Detection and Font Size Differentiation in MPEG Video
6
287
Conclusion and Future Work
In the paper, we propose a novel approach to automatically select specific shots, detect caption frames, locate text captions and differentiate font size in MPEG compressed videos. GOP-based video segmentation in our previous research is used to effectively segment video into shots. Furthermore, the DCT DC-based shot selection is designed to identify specific shots. Caption frames are detected in the specific shots by computing the variation of DCT ac energy both in the horizontal and vertical directions. Furthermore, we locate the text caption by the proposed scheme of the weighted horizontal-vertical DCT AC coefficients and region merging by morphological operations. To achieve more accurate result of text caption localization, we verify each candidate text caption further by computing its long-term consistency that is estimated over the backward referencing shot, the forward referencing shot and the shot itself. While text captions are localized, we differentiate the font size of each text caption based on variation of the DCT AC energy in the vertical direction. By this way, we can automatically select the text captions which users concern to support video browsing, video editing and video structuring. In the future, we will investigate in the video OCR to recognize the localized text captions to support high-level feature extraction, semantic event detection and metadata generation for video content descriptions in MPEG-7.
References 1. H. Wang and S. F. Chang, “A Highly Efficient System for Automatic Face Region Detection in MPEG Video,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 7, No. 4, Aug. 1997, pp. 615-628. 2. Y. Zhong, H. Zhang and A. K. Jain, “Automatic Caption Localization in Compressed Video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 4, Apr. 2000, pp. 385-392. 3. H. Luo and A. Eleftheriadis, “On Face Detection in the Compressed Domain,” Proc. of ACM Multimedia 2000, pp. 285-294. 4. Y. Zhang and T. S. Chua, “Detection of Text Captions in Compressed Domain Video,” Proc. of ACM Multimedia Workshop, 2000, pp. 201-204. 5. S. W. Lee, Y. M. Kim and S. W. Choi, “Fast Scene Change Detection using Direct Feature Extraction from MPEG Compressed Videos,” IEEE Transactions on Multimedia, Vol. 2, No. 4, Dec. 2000, pp. 240-254. nd 6. X. Chen and H. Zhang, “Text Area Detection from Video Frames,” Proc. of 2 IEEE Pacific Rim Conference on Multimedia, Oct. 2001, pp. 222-228. 7. J. Nang, O. Kwon and S. Hong, “Caption Processing for MPEG Video in MC-DCT Compressed Domain,” Proc of ACM Multimedia Workshop, 2000, pp. 211-214. 8. S. Y. Lee, J. L. Lian and D. Y. Chen, “Video Summary and Browsing Based on Story-Unit for Video-on-Demand Service,” Proc. International Conference on ICICS, Oct. 2001. 9. J. L. Mitchell, W. B. Pennebaker, Chad E.Fogg, and Didier J. LeGall, “MPEG VIDEO COMPRESSION STANDARD,” Chapman&Hall, NY, USA, 1997. 10. J. Meng, Y. Juan, S.F. Chang, “Scene Change Detection in a MPEG Compressed Video Sequence,” Proc. IS&T/SPIE, Vol. 2419, 1995, pp.14-25.
Motion Activity Based Shot Identification and Closed Caption Detection for Video Structuring* Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee Department of Computer Science And Information Engineering, National Chiao Tung University 1001 Ta-Hsueh Rd, Hsinchi, Taiwan _H]GLIRWLYNMYERW]PIIa$GWMIRGXYIHYX[
Abstract. In this paper, we propose a novel approach to generate the table of video content based on shot description by motion activity and closed caption in MPEG-2 video streams. Videos are segmented into shots by GOP-based approach and shot identification is used to identify segmented shots. The specific shots of interest are selected and the proposed approach of closed caption detection is used to detect captions in these shots. In order to speed up in scene change detection, instead of examining scene cut frame by frame, GOP-based approach first checks video streams GOP by GOP and then finds out the actual scene boundaries in the frame level. The segmented shots containing closed caption are identified by the proposed object-based motion activity descriptor. The algorithm of SOM (Self-Organization Map) is used to filter out noise in the process of caption localization. While captions are localized in the recognized shots, we create the table of video content based on the hierarchical structure of story unit, consecutive shots and captioned frames. The experimental results show the effectiveness of the proposed approach and reveal the feasibility of the hierarchical structuring of video content.
1
Introduction
More and more video information in digital form is available around the world. The number of users and the amount of information are progressing at a very rapid rate. Content-based indexing provides users natural and friendly query, searching, browsing and retrieval. The need of content-based multimedia retrieval motivates the research on feature extractions of the information contained in text, image, audio and video. However, textual information is more semantically meaningful among the various types of information. Recently, increasing researches focus on feature extraction in compressed video domain, especially in MPEG format because many videos are already being stored in compressed form due to the mature technique of compression. Edge features are directly extracted from MPEG compressed videos to detect scene change [5] and captions are processed and inserted into compressed video frames [7]. Features, like chrominance and shape are directly extracted from MPEG videos to detect face * The research is partially supported by Lee & MTI Center, National Chiao-Tung University,
Taiwan and National Science Council, Taiwan. S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 288–301, 2002. © Springer-Verlag Berlin Heidelberg 2002
Motion Activity Based Shot Identification
289
regions [1][3]. In the area of text caption detection, the detection of closed caption detection in large font size in compressed videos is the research focus [2][4]. However, the font size of text captions appearing in the shots of sports competition in general is very small which makes it more difficult to detect and localize the text captions. In addition, Lu and Tang [6] proposed a video structuring scheme, which classifies video shots by color feature and global motion information. However, video shots classification based on object information would be more semantically meaningful. In order to support high-level semantic retrieval of video content, in this paper, we propose a novel approach that structures videos utilizing closed captions and objectbased motion activity descriptors. The mechanism consists of four components: GOPbased video segmentation, shot identification, closed caption detection and video structuring. The rest of the paper is organized as follows. Section 2 presents the overview of the proposed scheme. Section 3 shows the GOP-based scene change detection and Section 4 describes the motion activity based shot identification. The component of closed caption localization is introduced in Section 5. Section 6 illustrates the experimental results and the conclusion and the future works are given in Section 7. MPEG-2 Video Streams GOP-Based Scene Change Detection
Motion Activity Based Scene Identification
Closed Caption Localization
SOM-Based Noise Filtering Table of Video Content
Fig. 1. The architecture of motion activity based video structuring.
2
Overview of The Proposed Scheme
Fig. 1 shows the mechanism of the proposed scheme. The testing videos are formatted in MPEG-2 and the sport of volleyball is selected as the case study. First, video streams are segmented into shots by using our proposed research GOP-based scene change detection [8]. Instead of frame by frame, this module of video segmentation checks video streams GOP by GOP and then finds out the actual scene change boundaries in the frame level. The segmented shots are identified and described in
290
Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee
MPEG-7 descriptor [7]. Thus the type of each shot of the volleyball videos can be recognized and encoded in the descriptor. The structure of volleyball videos consists of various types of shots, “service”, “full-court view” and “close-up” and the service shot is the leading shot in the volleyball competition. Thus, our focus is to recognize and select service shots to localize the closed caption to support video structuring. Therefore, the module of motion activity based shot identification is used to distinguish the types of shots and the specific shots of interest can be automatically recognized and selected for further analysis. Furthermore, the module of closed caption localization is designed based on SOM (Self-Organization Map) to localize the scoreboard, whose caption size is fairly small. The text in the localized closed caption of scoreboard is used to support video structuring. Finally, the table of video content is built by the key frames that contain the scoreboard and also by the semantic shots identified by the motion activity descriptor.
Step 1.
Inter-GOP scene change detection Calculate the difference between each consecutive GOP-pair
If difference is more than
no
threshold?
yes Step 2.
Intra-GOP scene change detection Find out the actual scene change frame in the GOP
Fig. 2. GOP-based scene change detection.
3
Scene Change Detection
Video data is segmented into meaningful clips to serve as logical units called “shots” or “scenes”. Fig. 2 illustrates our proposed GOP-based scene change detection approach [8]. In MPEG-II format [9], GOP layer is random accessed point and contains GOP header and a series of encoded pictures including I, P and B-frame. The size of a GOP is about 10 to 20 frames, which is usually less than the minimum duration of two consecutive scene changes (about 20 frames) [10]. We first detect possible occurrences of scene change GOP by GOP (inter-GOP). The difference between each consecutive GOP-pair is computed by comparing the I-
Motion Activity Based Shot Identification
291
frames in each consecutive GOP-pair. If the difference of DC coefficients between these two I-frames is larger than the threshold, then there may have scene change in between these two GOPs. Hence, the GOP that contains the scene change frames is located. In the second step – intra GOP scene change detection, we further use the ratio of forward and backward motion vectors to locate the actual frame of scene change within a GOP. The experimental results on real long videos in [8] are encouraging and prove that the scene change detection is efficient for video segmentation.
4
Shot Identification
In this section, the approach of shot identification based on object motion activity is introduced. The method of detection of significant moving objects is illustrated in Subsection 4.1 and Subsection 4.2 shows the motion activity descriptor. Scene identification based on the descriptor is presented in Subsection 4.3. 4.1
Moving Object Detection
For the computation efficiency, only the motion vectors of P-frames are used for object detection since in general, in a video with 30 fps consecutive P-frames separated by two or three B-frames are still similar and would not vary too much. Therefore, it is sufficient to use the motion information of P-frames only to detect moving objects. However, the motion vectors of P-frames or B-frames via motion estimation in MPEG-2 may not exactly represent the actual motions in a frame. For a macroblock, a good match is found among its neighbors in the reference frame. However, this motion estimation does not mean that a macroblock does match exactly the correct position in its reference frame. Hence, in order to achieve more robust analysis, it is necessary to eliminate noisy motion vectors before the process of motion vectors clustering. Motion vectors of relatively small or approximately zero magnitude are recognized as noise and hence are not taken into account. On the contrary, motion vectors with larger magnitude are more reliable. In the consideration of low computation complexity, the average of motion vectors of inter-coded macroblocks is computed and selected as the threshold to filter out motion vectors of smaller magnitude or noise. While noisy motion vectors are filtered out, motion vectors of similar magnitude and direction are clustered into the same group (the object) by applying the region growing approach. The details can be found in [7]. 4.2
Motion Activity Descriptor – 2D Histogram
2D-histogram is computed for each P-frame. The horizontal axis of the X-histogram (Y-histogram) is the quantized X-coordinate (Y-coordinate) in a P-frame. In the experiments, the X- and Y-coordinates are quantized into a and b bins according to the aspect ratio of the frame. The workflow of 2D-histogram generation is shown in Fig. 3. Initially, the object size is estimated before bin assignment. If the object size is
292
Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee
larger than the predefined unit size (frame-size/a*b), the object is weighted and x
accumulated by Eq. (1). Bin i , j means the
j th bin of X-histogram in frame i.
Acc ix, j ,α means the accumulated value of object α in frame i for X-histogram, and Obj is the number of objects in frame i. Bin
Acc
x i, j
x i , j ,α
=
Obj
∑
α =1
Acc
x i , j ,α
1 1, if object size ≤ a * b frame size = size of object α * a * b , otherwise frame size
(1)
By utilizing the statistics of the 2D-histogram, spatial distribution of moving objects in each P-frame is characterized. In addition, spatial relationships within the moving objects are also approximately shown in the X-Y histogram pair since each moving object is assigned to the histogram bin according to the X-Y coordinate of its center position. Objects belong to the same coordinate interval are grouped into the same bins, and hence the distance between two object groups can be represented by the differences between the associated bins. Moving Object Information
Object Size > A Predefined Unit
No Yes
Last Object ?
No
Yes Weighted Accumulation Weighted Number
2D-Histogram
Moving Object Assignment Histogram Bin Accumulation
Fig. 3. The workflow of motion activity descriptor.
4.3
Shot Identification Algorithm
The concept of the shot identification algorithm is shown in Fig. 4. We can see that the characteristic of the service shots is that one or few objects appear in the left or the right side of the frame and more objects appear in the other side of the frame. In the shot of full-court view, generally the number of objects of the left part and the number of objects of the right part are balanced and the difference of the number of objects
Motion Activity Based Shot Identification
293
between them is relatively smaller than that of service shot. In the closed-up shots, there is a large object near the middle position of the frame. Therefore, based on the concept, we can distinguish these major shots of volleyball videos. In the algorithm, we use the X-histogram only to be the descriptor of each shot. The details of the algorithm are described as follows.
(a)
(b)
(c)
Fig. 4. Key frames of shots (a) Service (b) Full-court view (c) Closed-up.
Shot Identification Algorithm Input: Segmented shots { Shot1 , Shot2 , … , Shot s } Output: Shot types: { ST1 , ST2 , …, STs }, where the type of shot i STi ∈ {S, F, C} (S: Service, F: Full-court view, C: Closed-up) 1. X-coordinate is divided into a=15 bins. 2. Motion activity descriptor generation for each shot. If the size of shot s is greater than 12 P-frames, then generate 2 descriptors MD1st and MD2 nd Else generate one descriptor
MD1st =
2 Shot s
MDone .
Mid
∑ MDi , MD2nd = i =1
MDi is the feature vector of ( Bin
2 Shot s x i ,1 ,
Shot s
∑ MD
i = Mid +1
Bin
i
, Mid =
x i , 2 , …,
Bin
Shot s , where 2 x i , j ) in Subsection
4.2.
MD1st is the 1st descriptor and MD2 nd is the 2nd descriptor of Shot s . MDone =
1 Shot s
Shots
∑ MD i =1
i
3. Compute the maximum bin value (MBV) and its corresponding bin (MB). Find out the bins (LB) that their values are greater than the defined threshold Γ. If 0 < MBV < 3, then Γ= MBV; Else if 3
MBV < 5, then Γ= MBV – 1;
Else if 5
MBV < 10, then Γ= 5;
Else if 10
MBV < 20, then Γ= 10;
Else if 20
MBV < 25, then Γ= MBV – 10;
294
Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee
Else if 25
MBV < 30, then Γ= MBV – 15;
Else if 30
MBV < 40, then Γ= MBV – 17;
Else if 40
MBV < 50, then Γ= MBV – 20;
Else 50 MBV, then Γ= MBV – 25 4. Prescription: If the number of bin LB in a shot is greater than half number of bins (HNB), then the shot may belong to type F or S. Otherwise, the shot may belong to type C. Left bins of descriptor: Bin0 to Bin6 Medium bin: Bin7 Right bins of descriptor: Bin8 to Bin14 LBL: LB ∈ [Bin0, Bin6]; LBR: LB ∈ [Bin8, Bin14] MBVL: MBV of LBL; MBVR: MBV of LBR MBL: MB ∈ [Bin0, Bin6]; MBR: MB ∈ [Bin8, Bin14] 5. If one descriptor only: Case 1: number of LB HNB: If MBR - MBL < HNB, then shot ∈ type F Else shot ∈ type C Case 2: number of LB < HNB: If number of LBR or number of LBL equals to 0, then shot
∈ type C
Else shot ∈ type S 6. If two descriptors: If MBs of MD1st and MD2 nd are near the medium bin (i.e. each MB ∈ [Bin6, Bin8]) Case 1: If number of LB of MD1st and number of LB of MD2 nd are smaller than HNB, then shot ∈ type C Case 2: If number of LB of MD1st or number of LB of than HNB, then shot Else shot ∈ type S
∈ type F
MD2 nd is greater
Else Case 1: If number of LB of
MD1st and number of LB of MD2 nd are
smaller than HNB If LBL1st + LBR1st
LBL2 nd + LBR2 nd
3 or
Else shot ∈ type C Case 2: If number of LB of
3, then shot
MD1st or number of LB of MD2 nd is greater
than HNB, then If one of two MB is close to medium bin, then shot Else type ∈ shot S 7. Generate the type of each shot.
∈ type S
∈ type C
Motion Activity Based Shot Identification
5
295
Closed Caption Localization
Fig. 5 shows the proposed scheme of closed caption localization in frames. First, we compute the horizontal gradient energy to filter out some noise by using the DCT AC coefficients. The next step is to remove some noisy regions by the morphological operation. While the candidate caption regions are detected, we utilize the SOMbased algorithm to filter out non-caption regions. The details of the closed caption detection are described in Subsection 5.1 and the algorithm of SOM-based filtering is shown in Subsection 5.2. Horizontal Gradient Energy Filtering
Candidate Caption Filtering (SOM-based)
Morphological Operation
Fig. 5. The approach of closed caption localization in frames.
5.1 Closed Caption Detection While service shots are identified, we apply the proposed closed caption detection to further localize the closed caption in these shots, like the scoreboard and the channel trademark. We use the DCT AC coefficients shown in Fig. 6 to compute the horizontal and vertical gradient energy. In Fig. 6, we can see that we use the horizontal AC coefficients from AC0,1 to AC0,7 to compute the horizontal gradient energy by Eq. (2). The horizontal gradient energy
Eh for each 8 x 8 block is the first
Eh of a block is greater than a predefined threshold, the block is regarded as a potential caption block. Otherwise, if E h of a filter for noise elimination. Moreover, if
block is smaller than the threshold, the block is removed. 7
E h = ∑ AC0, j
(2)
j =1
However, different shots may have different lighting conditions which will reflect in the contrast in frames even over the whole shot. Furthermore, different contrast will affect the decision of the threshold and the result of closed caption detection might fail due to this reason. Therefore, we adopt adaptive threshold decision to overcome this problem. The threshold T is computed by Eq. (3), where γ is a factor that can be adjusted, SVars represents the average of horizontal gradient energy of shot s,
FVarsAC , i means the horizontal gradient energy of frame i in shot s and AC h is the horizontal DCT ac coefficient from AC0,1 to AC0, 7 . Based on the fact that higher value of FVarsAC , i means the higher contrast of the frame I, and we can thus remove noisy regions more easily in the frame of higher gradient energy. Therefore, we set lower weight to the frame with higher contrast and set higher weight to that with
296
Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee
lower contrast. By this way, we can remove most of the noisy regions and an example is demonstrated in Fig. 7(b).
3.2, when FVarsAC ,i < SVars , γ = T = γ × SVars AC 2 . 4 , when FVar s ,i ≥ SVar
SVars =
1 M
(3)
M
∑ FVar i =1
AC s ,i
N
N
j =1
j =1
2 2 FVarsAC ,i = ∑∑ ACh , j / N − ( ∑∑ AC h , j / N )
DC AC0,1 AC0,2 AC0,3 AC0,4 AC0,5 AC0,6 AC0,7 AC1,0 AC2,0 AC3,0 AC4,0 AC5,0 AC6,0 AC7,0
Fig. 6. DCT ac coefficients used in text caption detection.
After eliminating most of the noisy regions, there still have many small separated regions in which they are either very close or faraway. Some regions are supposed to be connected, like the scoreboard and the channel trademark. Hence, we need to perform the task of regions merging and remove some isolated ones. Therefore, a morphological operator 1x3 blocks is used to merge the regions that the distance in between is smaller than 3 blocks and furthermore the regions of size smaller than 3 blocks are eliminated. The result of applying morphological operation is shown in Fig. 7(c) and we can see that many small and isolated regions are filtered out and the caption regions are merged together. However, some background regions that have large horizontal gradient energy are still present after morphological operation. Hence, we propose an algorithm that is based on the concept of SOM (SelfOrganization Map) [11] to further differentiate the foreground captions and background high textured regions. 5.2 SOM-Based Noise Filtering SOM-Based Noise Filtering Algorithm Input: Candidate regions after morphological operation Ψ = { R1 , R2 , … , Rn }
Motion Activity Based Shot Identification
297
Output: Closed caption regions 1. 2.
Initially, set threshold T = 70 and cluster number j=0. For each candidate region Ri , compute the average horizontal-vertical gradient energy
Ei that is weighted by wh and wv . Here we set wh to 0.6 and wv to
0.4. n is the number of regions in Ψ. 7 7 1 n wh ∑ AC0,u + wv ∑ AC v ,0 ∑ n j =1 u =1 v =1 3. For each region Ri ∈ Ψ
Ei =
If i = 1, j=j+1, assign
Ri to cluster C j
Else if there is a cluster C such that where k ∈ [1,j] and assign
Dk =
Dk
T and
Dk is minimal among { Dk },
Dk is defined in Eq. (5)
Ri to C Ck Ck 2 ∑ ∑ Ei − E j C k ( C k − 1) i =1 j =i +1
Else j=j+1, create a new cluster 4.
(4)
Set T = T – 11 Select the cluster
(5)
C j and assign Ri to C j
Ck (say Chigh ) that has the largest average gradient energy
E avg ,k computed by Eq. (6)
E avg ,k 5.
1 = Ck
Ck
∑E i =1
(6)
i
If the gradient energy
E avg ,k of Chigh is greater than T, then reset Ψ = Chigh .
Go to step 3. Else 6.
Go to step 6. The cluster Chigh is the set of closed captions.
In the algorithm, we set more weight to the horizontal DCT AC coefficients because closed captions generally appear in rectangular form and the ac energy in the horizontal direction would be larger than that in the vertical direction due to the reason that letters of each word are fairly close and the distance between two rows of text is relatively large. Furthermore, the SOM-based candidate region clustering is iterated until the gradient energy E avg ,k of the cluster Chigh is smaller than the threshold T. Based on the experiments, T is set to 70 initially and T is set to the value of T – 11 in step 4. By this method, we can automatically find out the set of closed captions and this method is based on the fact closed captions are the foreground
298
Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee
which are added after filming. Therefore, the added closed captions are more clear than the background and would have larger gradient energy. After the step of SOMbased noise filtering, each closed caption region is dilated by one block row. The result is shown in Fig. 7(e) and we can see that regions belonging to the same closed caption are merged.
(a) (b) (c) (d) (e)
Fig. 7. Demonstration of the closed caption localization (a) Original I-frame (b) Result after filtering by horizontal gradient energy (c) Result after morphological operation (d) Result after filtering by SOM-based algorithm (e) Result after dilation.
6
Experimental Results and Analysis
In the experiment, we record the volleyball videos from the TV channel of VL Sports and encode in the MPEG-2 format in which GOP structure is IBBPBBPBBPBBPBB and frame rate is 30 fps. The length of the video is about one hour and we obtain 163 shots of service, competition of the full-court view and closed-up. To measure the performance of the proposed scheme, we evaluate precision and recall for the approach of shot identification and the algorithm of closed caption detection. Table 1 shows the experimental result of the shot identification and we can see that the precision of all the three kinds of shots is higher than 92%. Moreover, the recall value
Motion Activity Based Shot Identification
299
of the type of close-up is up to 98%. The recall value of the type of full-court is just 87% due to the reason that the camera would zoom in to capture the scene that players spike near the net. In this case, the scene would consist of a large portion of the net is regarded as closed-up shot. Although the recall value of the shot of full court is not higher than 90%, the overall accuracy of shot identification is still very good. Table 1. Result of shot identification Ground Truth
Number of Correct Detection
Number of False Detection
Number of Miss Detection
Precision
Recall
62
57
5
1
92%
98%
52
49
3
4
94%
92%
49
45
4
7
92%
87%
Number of Detection
Closed-up 58 Service 53 Full Court 52
In Table 2, the result of closed caption localization is presented. There are 98 closed captions containing the scoreboard and the trademark in the testing video and 107 potential captions are detected in which 98 localized regions are the real closed caption. We can see that the recall value is up to 100% and the precision is about 92%. The number of false detection is 9 and this is because the background may consist of some advertisement page whose gradient energy is relatively high compared with the scoreboard and the channel trademark. In the case, it is assigned to the same cluster as closed caption since its gradient energy is very similar to the energy of the scoreboard. Table 2. Result of Closed Caption Localization Ground Truth 98
Number of Detection 107
Number of Correct Detection 98
Precision
Recall
91.59%
100%
The system interface for video structuring is shown in Fig. 8 and Fig. 9. Fig. 8 shows the user interface and Fig. 9 presents all shots of full-court view while users click the option “show all shots” in the field of “F shot”. The system interface is composed based on the video structure that is organized in the temporal order of the key frames of the closed caption, service, full-court view and closed-up. In addition, the caption frame is in the leading frame to show the scoreboard for users first. By this way, users can browse the video contents efficiently by looking for the score of competition and select the shots that they are interested in.
7
Conclusion and Future Work
In this paper, we propose a novel approach to generate the table of video content based on shot description by motion activity and closed caption in MPEG-2 compressed video. We adopt the approach of GOP-based video segmentation to
300
Duan-Yu Chen, Shu-Jiuan Lin, and Suh-Yin Lee
effectively videos into shots and these shots are described and identified by the object-based motion descriptor. Experimental results show that the proposed scheme performs well to recognize several kinds of shots of volleyball videos. In addition, we design an algorithm to localize the closed caption in the identified specific shots and to effectively filter out non-caption regions. From the user interface, based on the table of content created by closed captions and semantically meaningful shots, we can support users to browse videos more clearly and efficiently.
Fig. 8. Video structure of caption frames, shots of service, full-court view, and closed-up
Fig. 9. The interface shows other shots of full-court view in the bottom area
Motion Activity Based Shot Identification
301
In the future, we will investigate in the video OCR to recognize the localized closed caption to support automatic meta-data generation, like the name of the team in sports video or the name of the leading character in movies or an important person of other kinds of videos. Besides, we can also support semantic event detection and description for automatic descriptor and description scheme production in MPEG-7.
References 1. H. Wang and S. F. Chang, “A Highly Efficient System for Automatic Face Region Detection in MPEG Video,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 7, No. 4, Aug. 1997. 2. Y. Zhong, H. Zhang and A. K. Jain, “Automatic Caption Localization in Compressed Video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 4, Apr. 2000. 3. H. Luo and A. Eleftheriadis, “On Face Detection in the Compressed Domain,” Proc. ACM Multimedia 2000, pp. 285-294, 2000. 4. Y. Zhang and T. S. Chua, “Detection of Text Captions in Compressed Domain Video,” Proc. ACM Multimedia Workshop, CA, USA, pp. 201-204, 2000. 5. S. W. Lee, Y. M. Kim and S. W. Choi, “Fast Scene Change Detection using Direct Feature Extraction from MPEG Compressed Videos,” IEEE Transactions on Multimedia, Vol. 2, No. 4, Dec. 2000. th 6. H. Lu and Y. P. Tan, “Sports Video Analysis and Structuring,” Proc. IEEE 4 Workshop on Multimedia Signal Processing, pp.45-50, 2001. 7. D. Y. Chen and S. Y. Lee, “Object-Based Motion Activity Description in MPEG-7 for th MPEG Compressed Video,” Proc. the 5 World Multiconference on Systemics, Cybernetics and Informatics (SCI 2001), Vol. 6, pp.252-255, July 2001. 8. S. Y. Lee, J. L. Lian and D. Y. Chen, “Video Summary and Browsing Based on StoryUnit for Video-on-Demand Service,” Proc. International Conference on ICICS, Oct. 2001. 9. J. L. Mitchell, W. B. Pennebaker, Chad E.Fogg, and Didier J. LeGall, “MPEG VIDEO COMPRESSION STANDARD,” Chapman&Hall, NY, USA, 1997. 10. J. Meng, Y. Juan and S.F. Chang, “Scene Change Detection in a MPEG Compressed Video Sequence,” Proc. IS&T/SPIE, Vol. 2419, pp.14-25, 1995. 11. T. Kohonen, “The Self-Organizing Map,” Neurocomput., Vol. 21, pp. 1-6, 1998.
Visualizing the Construction of Generic Bills of Material Peter Y. Wu, Kai A. Olsen, and Per Saetre University of Pittsburgh Molde, Norway
Abstract. Assemble-to-order production industries today must be customeroriented, and offer a large variety of product options to compete in the global market place. A Generic Bill of Material (GBOM) is designed to describe the component structures for a family of products in one data model. The specific Bill of Material for any particular product variant can then be generated on demand. Several different approaches to the GBOM model demonstrated to be reasonably effective, but the construction and verification of GBOM remain difficult. This paper formulates a set of principle requirements for the GBOM model to support visualization and manipulation, aimed at the ease of composition and editing of GBOM models. Based on the framework, the different approaches to the GBOM model were reviewed. A new GBOM model is presently, and the GBOM system briefly introduced. The visual environment for the construction of GBOM model is discussed.
1.
Introduction
In the past decade, the challenge of global competition has forced the production industry in discrete assembly to become much more customer-oriented. Product variety becomes one of the keys to gain market share: automobile and other assembleto-order industries faced the need to manage myriads of product variants [1][2]. To manage the myriads of product variants and constantly changing product options, there have been several research efforts to define a generic Bill of Material (GBOM), for example [3][4][5][7]. The GBOM is a data model designed to describe the component structures for a family of products, and to generate the specific Bill of Material for any particular product variant on demand. Quite a few GBOM models were developed with a focus on solving problems in the various aspects of production management. Some focused on materials planning [3], others on flexibility and ease of specification for the customer [4]. Yet the construction of a complete GBOM model remains relatively difficult. The need for easy construction of the structure becomes more pronounced when today’s market place demands even faster response to changes in customer preferences. Good visualization tools for GBOM construction and verification are simply rare, if not non-existent. We studied various GBOM models disseminated in the literature, and report in this paper a set of principle requirements for models to support ease of construction and verification. We will present these requirements in the next section. Based on the framework of these requirements, we review the various GBOM models. These principle requirements are also the key to facilitate visualization and direct manipu-
S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 302–310, 2002. © Springer-Verlag Berlin Heidelberg 2002
Visualizing the Construction of Generic Bills of Material
303
lation of GBOM models. We then present the design of our GBOM model, and briefly sketch the graphical editor used in our GBOM system.
2.
The Principle Requirements
In this section, we first present a set of principle requirements concerning the GBOM model as the idealized goals for a GBOM system. These principle requirements set a framework to review the related GBOM approaches disseminated in the literature, and distinguishes the model put into use for visualization in our system. The principle requirements are the following: (1) Genericity: The GBOM model must be able to describe similar components as variants of a generic component. That is, the GBOM structure should emphasize the commonality of the components, and should be able to generate any specific variants on demand. This will simplify the management and maintenance of BOM structures. (2) Specification Support: The GBOM model must support a specification process to allow the user to arrive at any legal product variant of the user’s choice. The process may run on a platform to support an execution driven by the GBOM model. The specific BOM generated should be immediately usable in other parts of the information system (such as production control), and should support data exchange along the appropriate supply chain (such as purchasing). (3) Child Independence: A GBOM describes a set of product or component variants that another GBOM may include as one of its components. The principle of abstraction requires that different variants must be independent of the products where they are included. This allows different product GBOMs to utilize the same component structures. (4) Parental Restriction: A product GBOM including a component (GBOM) should be able to restrict the legal choices of the component variants. This requirement is actually in concert with the child independence requirement in (3) above so that product and component GBOMs may work together. In fact, (3) and (4) together allows the GBOM model to support cutting and pasting when presented visually on the screen for direct manipulation. (5) Iteration Support: The GBOM model should support an iteration process to generate all acceptable variants matching some given criteria of requirements about the product. The process may run on a platform to support the execution driven by the GBOM model with the given criteria. Iteration support allows the partially specified GBOM to be visually presented and manipulated in a graphical environment. (6) Composition Support: Composing a new GBOM is not at all straightforward. While it is desirable to define similar components as variants of one generic component, too much variation can make the GBOM too intricate. Product engineers may even want to remove unnecessary differences in products and components through re-design, in order to achieve simplicity in the GBOM. Creation and maintenance of GBOM models in actual use is necessarily a longterm effort that requires many iterations. The GBOM model must support the process of composing and editing of GBOM models in the visualization system.
304
3.
Peter Y. Wu, Kai A. Olsen, and Per Saetre
Review of GBOM Models
In production process planning, the traditional approach is to handle product variants is in the use of planning modules [3]. Each attribute with variant attribute values controls a planning module. When each of these attributes is given an acceptable value, the planning modules will then determine the implied requirements for the included components. VanVeen and Wortman analyzed the conditions assumed in the modular approach, and showed that these conditions may not be valid in an environment with many complex products [5]. The planning modules exhibit certain genericity, but the model does not fulfill the principle requirements. Schonsleben introduced the “Variant Generator” – a data model to present all variants of each component in one Bill of Material [6]. A set of mutually independent parameters represents all the attributes which may take variant values. We specify a product variant by assigning the proper values to these parameters. In practice, these parameters are rarely mutually independent, and there is no mechanism to help avoid improper combinations of these parameter values. Therefore, the Variant Generator does not have specification support, nor does it support parental restriction. VanVeen improved the Variant Generator and introduced the GBOM concept [7]. The GBOM model allows the parents to restrict the variability of each child, with a complex definition of the inclusion relationship between components. An elaborate conversion function is needed to determine which child component is selected for each variant of the parent component. While the specification process is difficult, the major weakness is the need to specify the complete set of parameter values in advance to determine the product variant before the GBOM is even exploded. The approach does not offer good specification support. Hegge and Wortmann introduced a set of inheritance rules to the parent-child relationship in the Variant Generator [8]. These rules implicitly make the choice for each parent variant to select the child variant. The inclusion of a component becomes implicit in the model, and the inheritance rules have complicated semantics. The approach shared the same limitation as the one proposed by vanVeen. Bottema and var der Tang described a product configuration system in which the user makes choices of variant features during the explosion process [9]. However, the system is not able to validate these choices until the end of the process. There is no explicit GBOM model to demonstrate genericity, and the specification support is insufficient. Chung and Fischer designed an object-oriented GBOM model [10][11], utilizing the subclass construct to represent the inclusion relationships, so that components included may inherit the features of their parents for the determination of component variants. However, component-inheriting features of the parent would violate the child independence requirements, and would not allow same components to have the same GBOM for use in different products. Olsen and Saetre presented a procedural approach and described a GBOM model that supports an elaborate product specification process [4][12]. The procedural programming language works out a highly practicable GBOM model, with composition support. However, for implementation convenience, the programming language approach must balance between the requirements of child independence and parental restriction, allowing a component to impose a constraint on another
Visualizing the Construction of Generic Bills of Material
305
component. The drawback is a model with difficulty in supporting visualization for manipulation. Bertrand et al. developed the “hierarchical pseudo Bill of Material” – a GBOM model which supports both parental restriction and child independence [13]. The focus was on material planning and production scheduling, and the model did not demonstrate iteration and composition support. Our approach primarily builds on these ideas but we also attempt to make our GBOM model visual and declarative. We maintain all the principle requirements as aforementioned for the GBOM model. For instance, we must enforce the child independence requirement in the model to support cutting and pasting in the graphical presentation of GBOM models. We also enforce the parental restriction requirement, so that a component will not be allowed to impose a constraint in another component unless it is included as a sub-component. If there is a necessary constraint between two different components, it must be specified at a higher level in the GBOM structure, referring to both components as sub-components. To validate a user choice of value for a variant attribute in a component, the specification process must then traverse up and down the GBOM structure to check on all possible constraints. When we can enforce the discipline in the GBOM model to meet the stated requirements, we believe that model-driven software will bring a new level of flexibility and ease of construction and verification of GBOM, similar in other graphical definition tools for object modeling [14][15][16].
4.
The GBOM Model
Figure 1 depicts the structure of our GBOM model. In the GBOM model, a GBOM consists two main parts: the HEAD and the BODY. The HEAD has a descriptive name, and all the attributes and the corresponding permissible values for each attribute. The set of attributes always includes the Part Number, the value of which uniquely identifies the GBOM model. The HEAD also specifies the constraints applied to the attribute values, so that only those attribute values that satisfy all the constraints are acceptable in some product variant described by the GBOM. The BODY of the GBOM model specifies the inclusion of components. There may be a condition specified for the inclusion of a component: the component is included only when the specified condition is met. If no condition is specified, the component is always included. In addition to the condition for inclusion, there is also a set of restrictions the product may impose on the component to be included. These restrictions apply to the attributes and the corresponding permissible values for the component GBOM. Both the condition and the restrictions may refer to attribute values of other components. This would allow specification of cross component constraints to apply to two different components. The GBOM model is also designed for visual display on a graphical editor, to support easy composing and editing. On a graphical editor, the user can choose to display different levels of details in the GBOM models, and optionally explode any particular inclusion. When the GBOM model satisfies all the stated requirements, the graphical editor also supports easy cutting and pasting of GBOM models visually presented to the user.
306
Peter Y. Wu, Kai A. Olsen, and Per Saetre
a descriptive name
a descriptive name
HEAD Part Number: XXXXX
Part Number: YYYYY Attribute: {w1,w2, . . .} . .. Constraints....
Attribute: {v1,v2,v3, . . .} . .. Constraints....
BODY
( condition ) restriction 1... restriction 2...
Component GBOM
Product GBOM Fig. 1. Structure of GBOM Model
Figure 2 in the following page illustrates a simple example of the GBOM model of a stool, and how it is related to the GBOM components of seat, base, cushion, chipboard, different seat covers. The stool is part number 200. The color can be red or blue, and the material can be wood or aluminum. It consists of two components: the seat and the base. The seat is part number 400, and there are three different colors for the seat: read, blue, and white, but the stool will restrict the choice of colors for the seat to that of the stool: red or blue. The seat consists of the cushion, the chipboard, and one of three kinds of seat covers. The color of the seat determines which seat cover to include, which is specified in the condition for inclusion. Incidentally, the different seat covers identified by different part numbers can of course be represented in one GBOM, with the choice of different colors. The stool also includes a base. The base is of three different types: wood, aluminum, and plastic, and also three different heights. But the stool GBOM will only include the base of the type specified in the material of the stool, and the height of 60.
5.
The Visual Environment
The GBOM system comprises several other components. Although not all of them relate directly to the visual construction of GBOM models, the following list briefly explains some of them to provide the context to describe the visual environment for constructing GBOM models. The GBOM System Platform to run the specification process: generating the specific BOM for GBSpec the product variant chosen by the user. GBMatch Platform to run the match process: to test the validity of requirements or constraints against a GBOM model. Graphical editor to compose and edit GBOM models. GBEdit Application to convert specific BOM into a GBOM model, for further GBGen editing.
Visualizing the Construction of Generic Bills of Material
stool Part#: 200 Color: {Red,Blue} Material:{aluminum,wood}
seat Part#: 400 Color: {Red,Blue,White}
307
cushion Part#: 410
chipboard Part#: 420
Color = this.Color;
(Color==Red) Type = this.Material; Height = 60; (Color==Blue)
seat cover Part#: 451 Color: Red
seat cover base
(Color==White)
Part#: 500 Type:{aluminum,wood,plastic} Height:{60,80,100}
Part#: 452 Color: Blue
seat cover Part#: 453 Color: White
component $200 is name(stool); Color(Red|Blue); Material(aluminum|wood); component $400 is end component; name(seat); body $200 is Color(Red|Blue|White); include $400 with end component; Color(this.Color); end include; body $400 is include $500 with include $410; Type(this.Material); include $420; Height(60); case Color is end include; when Red: include $451; end body; when Blue: include $452; when White: include $453; end case; end body; component $500 is name(base); Type(aluminum|wood|plastic); Height(60|80|100); end component;
component $410 is name(cushion); end component; component $420 is name(chipboard) end component; component $451 is name(seat cover); Color:Red; end component; component $452 is name(seat cover); Color:Blue; end component; component $453 is name(seat cover); Color:White; end component;
Fig. 2. GBOM model of a stool
In the back-end, the GBOM system is supported by a persistent store for existing GBOM models, which is the GBOM database. The GBOM definition tool is GBEdit - a graphical editor for composing and editing GBOM models. Figure 3 depicts the graphical presentation of GBOM models for editing. Our current design uses a context-sensitive pop-up menu in the graphical
308
Peter Y. Wu, Kai A. Olsen, and Per Saetre
editor. On the main canvas, the pop-up menu supports the addition of a new component GBOM, or brings up a browse window to select a model from the GBOM database. Upon creation, a GBOM component will allow the key entry for its component name, and a unique Part Number is generated. The user can however change the generated Part Number. Each component GBOM is displayed in a subwindow, which can be expanded or collapsed. When collapsed, the GBOM is displayed only with the component name. The pop-up menu from the top section with the component name supports expanding or collapsing the window. In a semiexpanded form, the middle section is shown, listing the attributes and possible values each attribute can take, as well as any additional constraints. The middle section constitutes the HEAD of the GBOM model. The pop-up menu from inside the middle section will allow entry of new attributes and editing of existing attributes, as well as the constraints. The fully expanded GBOM window shows the lower section, showing the GBOM sub-components, and the conditions and requirements for including each. The pop-up menu from the lower section will support editing of the conditions and requirements, along with the inclusion of any sub-component by inserting a link. The lower section constitutes the BODY of the GBOM model. GBEdit File
Help
STOOL Part Number: 200 Color: { Red, Blue } Material: { Al, Wood }
SEAT Part Number: 400 Color: { Red, Blue }
( none )
Color == this.Color
Fig. 3. The GBOM Graphical Editor
The graphical editor makes the construction of GBOM models easier. The convenience to expand or collapse a GBOM model on display makes the complicated GBOM structure easier to deal with, since the user can perceive the hierarchical composition thus implied in the GBOM model. The key relies on these two requirements: Child Independence and Parental Restriction, listed amongst the principle requirements in Section 2. The graphical editor can also conveniently support cutting and pasting collections of component GBOM models in GBEdit. Furthermore, in the GBOM system, every GBOM model supports the GBSpec process and the GBMatch process. These features would become helpful aids in the GBEdit environment for the construction and verification when composing GBOM models. GBSpec allows the user to investigate possible product variants, while GBMatch provides a convenient way to verify that the stated requirements for
Visualizing the Construction of Generic Bills of Material
309
inclusion of a sub-component GBOM is valid. In the visual environment of GBEdit, the user can trace the conflict of including certain sub-components up and down the hierarchy to the level where the cause of conflict occurs. Again, this feature makes use of the two principle requirements: Child Independence and Parental Restriction. Moreover, since the GBOM models may also include information about cost and availability of each component, the GBOM system can be very useful for planning for supply chain management. GBGen is installation specific, and is not particularly pertinent to the visual environment.
6.
Summary
We presented a new GBOM model and sketched the GBOM system to facilitate its visual construction and verification. We formulated the set of six principle requirements for the GBOM model to support visual presentation and manipulation. These requirements are briefly summarized below:
• • •
Genericity: The GBOM model captures all the BOM information of its variants. Specification Support: The system must support a product specification process. Child Independence: The GBOM for a component is independent of the product GBOM that includes it as a component. • Parental Restriction: A product GBOM is allowed to restrict the variability of a component GBOM included as its component. • Iteration Support: The system should support search for specific products matching certain specified criteria about the product. • Composition Support: The system should support tools and processes for the generation, composition, and editing of GBOM models. Based the framework of these requirements, we reviewed the various approaches to the GBOM model, and then presented our GBOM model that fulfills all the principle requirements aforementioned. We briefly described the GBOM system, and then proceeded to describe further the visual environment for the composition of GBOM models in the system. The two principle requirements –Child Independence and Parental Restriction – fosters a hierarchical structure of the GBOM model to facilitate easy cutting and pasting in the visual environment, and verification in the inclusion of sub-components through tracing up and down such a hierarchy.
References 1. 2. 3.
Erens F.J., Hegge H.M.H. Manufacturing and sales coordination for product variety, International Journal of Production Economics 37:83-99, 1994. Pine B.J. II, Mass Customization: the new frontier in business competition, Harvard Business School Press, Boston, 1993. Vollmann T.E., Berry W.L., Whybark D.C. “Advanced concepts in master production scheduling,” Manufacturing Planning and Control Systems, Dow Jones-Irwin, 1992.
310 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
14. 15. 16.
Peter Y. Wu, Kai A. Olsen, and Per Saetre Olsen K.A., Saetre P., Thorstenson A. “A Procedure-Oriented Generic Bill of Materials,” Computers & Industrial Engineering 32(1): 29-45, 1997. VanVeen, E.A. Wortmann, J.C. “New developments in generative bill of material processing systems,” Production Planning & Control 3 (3): 327-335, 1992. Schonsleben, P. Flexible Production Planning and Data Structuring on computer, CWPublikationen, Munchen, 1985. VanVeen, E.A. Modelling product structures by generic bills-of-material, Elsevier Science Publishers, Amsterdam, 1992. Hegge, H.M.H., Wortmann, J.C. “Generic bill-of-material: a new product model,” International Journal of Production Economics 23:117-128, 1991. Bottema, A., van der Tang, F. “A product configurator as key decision support system,” IFIP Transactions B-7: Applications in Technology, 71-92, North-Holland, 1992. Chung, Y., Fischer, G.W. “Illustration of object-oriented databases for the structure of a bill of materials,” Computers in Industry 19: 257-270, 1992. Chung, Y., Fischer,G.W. “A conceptual structure and issues for an object-oriented bill of materials data model,” Computers and Industrial Engineering 26(2):321-339, 1994. Olsen, K.A. and P. Saetre. “Describing products as programs,” International Journal of Production Economics 56-57: 495-502, 1998. Bertrand, J.W.M., Zuijderwijk, M., Hegge, H.M.H. “Using hierarchical pseudo bills of material for customer order acceptance and optimal material replenishment in assemble to order manufacturing of non-modular products,” Internal Journal of Production Economics 66: 171-184, 2000. Gangopadhyay, D. Wu, P.Y. “An object-based approach to Medical Process th Automation,” 17 Annual Symposium on Computer Application in Medical Care, Washington, D.C., McGraw-Hill, 910-914, 1993. Wu, P.Y. “Visual Capacity Modeling and Interactive Decision Support for Production Planning,” Computer Technology Solutions Conference, Detroit, Michigan, Society of Manufacturing Engineers, September 1999. Wu, P.Y. “Visualizing Capacity and Load in Production Planning,” Information Visualization 2001, London, England, IEEE Computer Society, July 2001.
Data and Knowledge Visualization in Knowledge Discovery Process TrongDung Nguyen, TuBao Ho, and DucDung Nguyen Japan Advanced Institute of Science and Technology Ishikawa, 923-1292 Japan
Abstract. The purpose of our work described in this paper is to develop and put a synergistic visualization of data and knowledge into the knowledge discovery process in order to support an active participation of the user. We introduce the knowledge discovery system D2MS in which several visualization techniques of data and knowledge are developed and integrated into the steps of the knowledge discovery process. Keywords: model selection, knowledge discovery process, data and knowledge visualization, the user’s participation.
1
Introduction
Knowledge discovery in databases (KDD) – the rapidly growing interdisciplinary field of computing that evolves from its roots in database management, statistics, and machine learning – aims at finding useful knowledge from large databases. The process of knowledge discovery is complicated and should be seen inherently as a process containing several iterative and interactive steps: (1) understanding the application domain, (2) data preprocessing, (3) data mining, (4) postprocessing, and (5) applying discovered knowledge. The problem of model selection in KDD – choosing appropriate discovered models or algorithms and their settings for obtaining such models in a given application – is difficult and non-trivial because it requires empirical comparative evaluation of discovered models and meta-knowledge on models/algorithms. In our view, model selection should be semiautomatic and it requires an effective collaboration between the user and the discovery system. In such a collaboration, visualization has an indispensable role because it can give a deep understanding of complicated models that the user can not have if using only performance metrics. The goal of this work is to develop a research system for knowledge discovery with support for model selection and visualization called D2MS. The system has two main contributions to visual knowledge discovery. First is its efficient visualizers for large multidimensional databases, discovered rules, hierarchical structures as well a synergistic visualization of data and knowledge. In particular, the novel visualization technique T2.5D (Trees 2.5 Dimensions) for large hierarchical structures can be seen as an alternative to powerful techniques for representing large hierarchical structures such as cone trees [13] or hyperbolic S.-K. Chang, Z. Chen, and S.-Y. Lee (Eds.): VISUAL 2002, LNCS 2314, pp. 311–321, 2002. c Springer-Verlag Berlin Heidelberg 2002
312
TrongDung Nguyen, TuBao Ho, and DucDung Nguyen
Fig. 1. Conceptual architecture of D2MS
trees [2]. Second is its tight integration of the visualizers with functions in each step of the knowledge discovery process for supporting the model selection purpose.
2
Data and Knowledge Visualization in D2MS
Figure 1 presents the conceptual architecture of D2MS. The system consists of eight modules: Graphical user interface, Data interface, Data processing, Data mining, Postprocessing and Evaluation, Plan management, Visualization, and Application. For preprocessing tasks, D2MS has algorithms of continuous attributes discretization (three methods based entropy, rough sets, and k-means clustering), filing up missing values (three cluster-based methods for supervised numerical and categorical data, and unsupervised data), and feature selection method SFG [6]. For data mining tasks, D2MS consists of k-nearest neighbors, Bayesian classification, decision tree induction method CABRO [7], CABROrule [9], LUPC [3] for rule induction, and conceptual clustering method OSHAM [1]. For postprocessing and evaluation tasks, D2MS has k-fold stratified cross validation integrated with data mining programs, exportation of discovered models into spreadsheet or readable forms for visualization programs. The visualization module is linked to most other modules in D2MS in particular those directly concerned with model selection. It currently consists of a data visualizer, a rule visualizer, and a tree visualizer (for hierarchical structures). These visualizers are integrated to most methods mentioned above in preprocessing, data mining, and postprocessing. We describe each visualizer in focus on its techniques and how it is linked to the steps in the KDD process. 2.1
Data Visualization
We have chosen the parallel technique for visualizing 2D tabular datasets defined by n rows and p columns.
Data and Knowledge Visualization in Knowledge Discovery Process
313
Fig. 2. Data visualization in D2MS
Viewing Original Data. This view gives the user a rough idea about the distribution of data on values of each attribute, in particular the colors of different classes in many cases can show clearly how classes are different from each other. Summarizing Data. The key idea of this view is not to view original data points but to view their summaries on parallel attributes. As WinViz [4], D2MS uses bar charts in the place of attribute values on each axis. The bar charts in each axis have the same height (depending on the number of possible attribute values) and different widths that signify the frequencies of attribute values. The top-right window in Figure 2 shows the summaries of the stomach cancer data. Querying Data. This view serves the hypothesis generation and hypothesis testing by the user. There are three types of queries: (i) based on a value of the class attribute where the query determines the subset of all instances belonging to the indicated class; (ii) based on a value of a descriptive attribute where the query determines the subset of all instances having this value, (iii) based on a conjunction of attribute-values pairs where the query determines the subset of all instances satisfied this conjunction. The queries can be determined by just using point-and-click. The subset of instances matched the a query is visualized in viewing data mode and in summarizing data mode. The grey regions on each
314
TrongDung Nguyen, TuBao Ho, and DucDung Nguyen
axis show the proportions of specified instances on values of this attribute as shown in bottom-right window in Figure 2). Data Visualization in the KDD Process. With the above three views of data, D2MS integrates data visualization into different KDD steps by displaying and interactively changing these views of data at any time. For example, many discretization algorithms provide alternative solution of dividing a numerical attributes into intervals, and the visual data query on the discretized attribute and the class attribute can give insights for decision. 2.2
Rule Visualization
A rule is a pattern related to several attribute-values and a subset of instances. The importance in visualizing a rule is how this local structure is viewed in its relation to the whole dataset, and how the view support the user’s evaluation on the rule interestingness. D2MS’s rule visualizer allows the user to visualize rules in the form antecedent → consequent where antecedent is a conjunction of attribute-value pairs, consequent is a conjunction of attribute-value pairs in case of association rules, and is a value of the class attribute in case of prediction rules. A rule is simply displayed by a subset of parallel coordinates included in antecedent and consequent. The D2MS’s rule visualizer has the following functions: Viewing Rules. Each rule is displayed by polyline that goes through the axes containing attribute-values occurred on the antecedent part of the rule leading to the consequent part of the rule which are displayed with different color. In the case of prediction rules, the ratio associated with each class in the class attribute corresponds to the number of instances of the class covered by the rule over the total number of instances in the class. This view gives a first observation of the rule quality. Viewing Rules and Data. The subset of instances covered by a rule is visualized together with the rule by parallel coordinates or by summaries on parallel coordinates. From this subset of instances, the user can see the set of rules each of them cover some of these instances, or the user can smoothly change the values of an attribute in the rule to see other related possible rules. Rule Visualization in the KDD Process. There are several ways that support the user in evaluating the quality of the rule together with other measure such as coverage and accuracy of the rule. For example, two rules predicting a target class have the same support and confidence but the one wrongly covered more instances belonging to classes different from the target class would be considered worse. Figure 3 illustrates rule visualization in D2MS where the top-left and bottom left windows display a discovered rule, and the top-right and bottom right windows show the instances covered by that rule.
Data and Knowledge Visualization in Knowledge Discovery Process
315
Fig. 3. Rule Visualization
2.3
Tree Visualization
D2MS provides several visualization techniques that allow the user to visualize effectively large hierarchical structures. The tightly-coupled views display simultaneously a hierarchy in normal size and tiny size that allows the user to determine quickly the field-of-view and to pan to the region of interest. The fisheye view distorts the magnified image so that the center of interest is displayed at high magnification, and the rest of the image is progressively compressed. Also, the new technique T2.5D [8] is implemented in D2MS for visualizing very large hierarchical structures. Different Modes of Viewing Hierarchical Structures. D2MS tree visualizer provides multiple-views of trees or hierarchical structures. Figure 4 illustrates some of them. – Tightly-coupled views: The global view (on the left) shows the tree structure with nodes in same small size without labels and therefore it can display a tree fully or a large part of it, depending on the tree size. The detailed view (on the right) shows the tree structure and nodes with their labels associated with operations to display node information. – Customizing views: Initially, according to the user’s choice, the tree is either displayed fully or with only the root node and its direct sub-nodes. The tree then can be collapsed or expanded partially or fully from the root or from any intermediate node.
316
TrongDung Nguyen, TuBao Ho, and DucDung Nguyen
Fig. 4. Different views of trees in D2MS
– Tiny mode with fish-eye view: The tiny mode uses much more efficiently the space to visualize the tree structure, on which the user can determine quickly the field-of-view and pan to the region of interest. Fish-eye distorts the magnified image so that the center of interest is displayed at high magnification, and the rest of the image is progressively compressed.
Trees 2.5 Dimensions. The user might find it difficult to navigate a very large hierarchy, even with tightly-coupled and fish-eye views. To overcome this difficulty, we have been developing a new technique called T2.5D (stands for Trees 2.5 Dimensions). T2.5D is inspired by the work of Reingold and Tilford [12] that draws tidy trees in a reasonable time and storage. Different from tightly-coupled and fisheye views that can be seen as location-based views (view of objects in a region), T2.5D can be seen as a relation-based view (view of related objects). The starting point of T2.5D is the observation that a large tree consists many subtrees that are not usually and necessarily viewed simultaneously. The key idea of T2.5D is to represent a large tree in a virtual 3D space (subtrees are overlapped to reduce occupied space) while each subtree of interest is displayed in a 2D space.
Data and Knowledge Visualization in Knowledge Discovery Process
317
Fig. 5. T2.5D provides views in between 2D and 3D
To this end, T2.5D determines the fixed position of each subtree (its root node) in two axes X and Y, and in addition, it computes dynamically an Z-order for this subtree in an imaginary axis Z. A subtree with a given Z-order is displayed “above” its siblings those have higher Z-orders. When visualizing and navigating a tree, at each moment the Z-order of all nodes on the path from the root to a node in focus in the tree is set to zero by T2.5D. The active wide path to a node in focus, which contains all nodes on the path from the root to this node in focus and their siblings, is displayed in the front of the screen with highlighted colors to give the user a clear view. Other parts of the tree remain in the background to provide an image of the overall structure. With Z-order, T2.5D can give the user an impression that trees are drawn in a 3D space. The user can easily change the active wide path by choosing another node in focus [8]. We have experimented T2.5D with various real and artificial datasets. It has been verified that T2.5D can handle well trees with more than 20,000 nodes, and more than 1,000 nodes can be displayed together on the screen [8]. Figure 5 illustrates a pruned tree of 1795 nodes learned from stomach cancer data and drawn by T2.5D (note that the original screen with colors gives a better view than this black-white screen).
318
3
TrongDung Nguyen, TuBao Ho, and DucDung Nguyen
A Case-Study
This section illustrates the utility of synergistic visualization of data and knowledge of D2MS in extracting knowledge from a stomach cancer dataset. 3.1
The Stomach Cancer Dataset
The stomach cancer dataset collected at the National Cancer Center in Tokyo during the period 1962-1991 is a very precious source for the research. It contains data of 7,520 patients described originally by 83 numeric and categorical attributes. The top-left window in Figure 6 shows some first attributes of some instances in that original database while the middle-left window shows its visualization. One problem is to use of attributes containing patient information before operation to predict the patient status after the operation. The domain experts are particularly interested in finding predictive and descriptive rules for the class of patients who “dead within 90 days” after operation among totally 5 classes. 3.2
Visual Knowledge Discovery from Stomach Cancer Data
The task of extracting prediction and description rules for the class of patients who died within 90 days after the operation is a difficult because there is no sharp boundary between this class and the others. The data mining methods C4.5, See5.0, CBA, and Rosetta have been applied to do this task. However, the obtained results were far from expectation: they have low support and confidence, and are concerned with only a small part of patients of the target class. Visual Interactive Preprocessing. Much effort has been done in preprocessing the stomach cancer data. The first preprocessing task done by the domain experts is to discretize continuous attributes. Two available entropy-based and rough-set based discretization methods in D2MS yield two different possibilities of discretizing continuous attributes. The derived dataset based on entropybased method ignores many attributes and often has few discretized values on the others. The other one based on rough set-based method discretized continuous attributes into more values and do not ignore any attributes. The middle-left and bottom-left windows in Figure 6 visualize the original stomach data, and the middle-right and bottom-right windows illustrate one of discretized solution among trials. The second preprocessing task done is to select subsets of 83 attributes that are most relevant to the discovery target. Two feature selection methods have been used for this task: the manual KJ method that is popular in Japan and the feature selection algorithm SFG [6]. SFG orders attributes by information gain and the user can choose different subsets of attributes in the decreasing order of information gain. As a consequence, these methods yield different choices left to the user to decide the most appropriate. The data visualization in particular the querying mode in D2MS has supported the domain experts doing this task.
Data and Knowledge Visualization in Knowledge Discovery Process
319
Fig. 6. Visual interactive discretization and feature selection
The key point here is to support generating and testing hypotheses by changing views on data, querying data to get insights in the importance of attributes, etc. The middle-right and bottom-right windows in Figure 6 illustrate the active visualization in doing this task. For example, the color of polylines distributed on axes in the middle-right window may suggest how significant an attribute is in a prediction task. Visual Mining Rules and Decision Trees. We have applied two methods LUPC and CABRO in D2MS to mine rules and rules from stomach data in which visualization plays a significant role. Visual LUPC is effective in mining rules from minority classes [3]. It allows the user the try generating different candidate sets of rules by parameter settings, then visually evaluating the results, adjusting the parameter until achieving appropriate results. For example, the domain experts have taken the default values for number of candidate attribute-value pairs η = 100, and number of candidate rules γ = 20 while varying two parameters on minimum accuracy of a rule α and minimum coverage of a rule β. Given α and β, there are two modes of varying α and β with the search biases on accuracy or coverage. In the former, LUPC finds rules for the target class with accuracy as high as possible while coverage remains equal or greater than β. In the latter, rules are found with coverage as large as possible while accuracy satisfies the threshold α.
320
TrongDung Nguyen, TuBao Ho, and DucDung Nguyen Table 1. LUPC finds more and high quality rules. Number of discovered rules See5.0 CBA-CB CBA-CAR Rosetta LUPC cover ≥ 7, accuracy ≥ 80% 2 1 1 5 14 cover ≥ 7, accuracy ≥ 100% 0 0 0 0 4 cover ≥ 10, accuracy ≥ 50% 1 0 1 0 61 cover ≥ 10, accuracy ≥ 60% 0 0 0 0 40 cover ≥ 10, accuracy ≥ 70% 0 0 0 0 21 Quality of rules
The most important activity in the interactive mining with LUPC is the evaluation of discovered rules, and the user is helped by the D2MS rule visualizer. In addition to the overall performance metrics provided by LUPC for the obtained rule set and for each rule such as its support and confidence, the user can point and click to view each rule such as the subset of instances it covers, what-if a condition in the precedent part changes its value into another value of the same attribute, how its errors distributed in other classes, etc. Table 1 compares the number of rules discovered by See5.0, CBA (CBA-CB and CBA-CAR [5], Rosetta and LUPC (columns 2-6), according to the required minimum of coverage and accuracy of rules shown in the first column. Thank to its support for model selection and visualization, LUPC allows us to find more and higher quality rules than other systems. For example, See5.0, CBA, and Rosetta found only 2, 1 (1) and 5 rules, respectively, each of which covers at least 7 cases in the class “dead within 90 days” with accuracy equal or greater than 80%. Clearly, these rules characterize only a small part of the target class containing 302 cases in total. LUPC allows us to discover 22 rules with such thresholds. With the requirement of finding rules that cover at least 10 cases, See5.0, CBA, and Rosetta found almost no such rules except CBA-CAR in the case of accuracy equal or greater than 50%. Under this condition, LUPC shows advantage by its ability of discovering many such high quality rules due to the model selection support in D2MS. As analyzed with more details in [3], See5.0 induces rules with an average error of 30.5% on testing data (taken randomly of 30% of the stomach cancer data), but with a very high false positive rate of 98.9% (i.e., rules found for this class recognize wrongly 98.9% cases from other classes). Similarly, CBA and Rosetta also give poor results on the class “died within 90 days” even when they produce a large number of rules with small thresholds. The visual CABRO has been used to learn decision trees from the stomach data. For each model candidate, CABRO tree visualizer displays graphically the corresponding pruned tree, its size, its prediction error rate. It offers the user a multiple view of these trials and facilitates the user to compare results of trials in order to make his/her final selection of techniques/models of interest. Figure 5 shows a screen of T2.5D and model selection by D2MS from stomach cancer data.
321
4
Conclusion
We have presented the knowledge discovery system D2MS with support for model selection integrated with visualization. We emphasize the crucial role of the user’s participation in the model selection process of knowledge discovery and have developed data, rule and tree visualizers in D2MS to support such a participation. Our basic idea is use right visualization techniques in right places, and visualization should be integrated into the steps of the knowledge discovery process. D2MS with its visualization support has been used and shown advantages in extracting knowledge from a real-world applications on stomach cancer data.
References 1. Ho, T.B., “Knowledge Discovery from Unsupervised Data in Support of Decision Making”, Knowledge Based Systems: Techniques and Applications, C.T. Leondes (Ed.), Academic Press, pp. 435-461, 2000. 2. Lamping, J. and Rao, R., “The Hyperbolic Browser: A Focus + Context Techniques for Visualizing Large Hierarchies”, Journal of Visual Languages and Computing, 7(1), pp. 33–55, 1997. 3. Ho, T.B., Nguyen, D.D., and Kawasaki, S., “Mining Prediction Rules from Minority Classes”, 14th International Conference on Applications of Prolog (INAP2001), International Workshop Rule-Based Data Mining RBDM2001, Tokyo, October 2022, 2001. 4. Lee, H.Y., Ong, H.L., and Quek, L.H., “Exploiting Visualization in Knowledge Discovery”, Proc. of First Inter. Conf. on Knowledge Discovery and Data Mining (1995), 198–203. 5. Liu, B., Hsu, W., and Ma, Y., “Integrating Classification and Association Rule Mining”, Fourth Inter. Conf. on Knowledge Discovery and Data Mining KDD’98, pp. 80–86, 1998. 6. Liu, H. and Motoda, H., Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers, 1998. 7. Nguyen, T.D. and Ho, T.B., “An Interactive Graphic System for Decision Tree Induction”, Journal of Japanese Society for Artificial Intelligence, Vol. 14, N. 1, pp. 131–138, 1999. 8. Nguyen, T.D., Ho, T.B., and Shimodaira, H., “A Visualization Tool for Interactive Learning of Large Decision Trees”, Twelfth IEEE Inter. Conf. on Tools with Artificial Intelligence ICTAI’2000, 2000, pp. 28-35. 9. Nguyen, T.D., Ho, T.B., and Shimodaira, H., “A Scalable Algorithm for Rule PostPruning of Large Decision Trees”, Fifth Pacific-Asia Conf. on Knowledge Discovery and Data Mining PAKDD’01, LNAI 2035, Springer, 2001, pp. 467-476. 10. Ohrn, A., Rosetta Technical Reference Manual, Norwegian University of Science and Technology, 1999. 11. Quinlan, J.R., C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993. 12. Reingold, E.M. and Tilford, J.S., “Tidier Drawings of Trees”, IEEE Transactions on Software Engineering, Vol. SE-7, No. 2, pp. 223–228, 1991. 13. Robertson, G.G., Mackinlay, J. D., and Card, S.K., “Cone Trees: Animated 3D Visualization of Hierarchical Information”, ACM Conf. on Human Factors in Computing Systems, 1991, pp. 189–194.
Author Index
Abdelmoty, Alia I.
175
Boujemaa, Nozha 24, 163 Breiteneder, Christian 105 Chan, Chen-Lung 219 Chang, Shi-Kuo 1 Chen, Duan-Yu 276, 288 Chen, Ling-Hwei 88, 269 Chen, Trista Pei-chun 194 Chen, Tsuhan 194 Chen, Zen 95 Chia, Tsorng-Lin 95 Chiu, Chih-Yi 143 Chun, Seong Soo 239, 259 Costagliola, Gennaro 1 Eidenberger, Horst 105 El-Geresy, Baher A. 175 Falc˜ ao, Alexandre X. 12 Fauqueur, Julien 24 Fotouhi, Farshad 187
Lee, Suh-Yin 276, 288 Lee, Yuh-Reuy 207 Leung, Clement H.C. 152 Li, Xiaobo 61 Lin, Chia-Wen 207 Lin, Hsin-Chih 143 Lin, Shu-Jiuan 288 Manola, Lubomir Morris, Andrew J.
129 175
Nakazato, Munehiro 129 Nascimento, Mario A. 12, 61 Nguyen, DucDung 311 Nguyen, TrongDung 311 Oh, Sangwook 259 Oh, Seok-jin 239 Oja, Erkki 247 Olsen, Kai A. 302 Oostveen, Job 117 Qiu, Guoping 50
Haitsma, Jaap 117 Hitz, Martin 105 Ho, TuBao 311 Hsiao, Ming-Ho 276 Hsu, Vincent 219, 229 Hsu, Yuh-Feng 194 Huang, Thomas S. 129 Hung, Yi-Ping 76
Saetre, Per 302 Saux, Bertrand Le 163 Shen, Kuan-Ting 269 Shih, Jau-Ling 88 Sridhar, Veena 61 Stanchev, Peter L. 187 Stehling, Renato O. 12 Sudirman, S. 50 Sull, Sanghoon 239, 259 Sun, Ming-Ting 229 Sun, Shu-Kuo 95
Jones, Christopher B. Jungert, Erland 1
Tao, Ju-Lan 76 Tse, Philip K.C. 152
Gatica-Perez, Daniel
229
175
Kalker, Ton 117 Kao, Cheng-Chien 207 Kim, Hyeokman 259 Kim, Jung-Rim 239, 259 Koskela, Markus 247 Kuo, Chin-Ying 219
Volmer, Stephan
Laaksonen, Jorma
Zhou, Zhi
247
36
Wang, Jia-Shung 219 Wu, Peter Y. 302 Yang, Shi-Nine 229
143