Web Document Analysis: Challenges and Opportunities

lyHiliIiHIIiMll MiM'WW Challenges and Opportunities Editors flpostolos flntonacopoulos Jiarojin? Hu MACHINE PERCEPT...

Author: Apostolos Antonacopoulos | Jianying Hu

41 downloads 1190 Views 14MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

lyHiliIiHIIiMll MiM'WW

Challenges and Opportunities

Editors

flpostolos flntonacopoulos

Jiarojin? Hu

MACHINE PERCEPTIONARTIFICIAL INTELLIGENCE _ Volume 55 I World Scientific

Web Document AnollfCJC

Challenges and Opportunities

SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors:

H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA)

Vol. 38: New Approaches to Fuzzy Modeling and Control — Design and Analysis (M. Margaliot and G. Langholz) Vol. 39: Artificial Intelligence Techniques in Breast Cancer Diagnosis and Prognosis (Eds. A. Jain, A. Jain, S. Jain and L. Jain) Vol. 40: Texture Analysis in Machine Vision (Ed. M. K. Pietikainen) Vol. 41: Neuro-Fuzzy Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 42: Invariants for Pattern Recognition and Classification (Ed. M. A. Rodrigues) Vol. 43: Agent Engineering (Eds. Jiming Liu, Ning Zhong, Yuan Y. Tang and Patrick S. P. Wang) Vol. 44: Multispectral Image Processing and Pattern Recognition (Eds. J. Shen, P. S. P. Wang and T. Zhang) Vol. 45: Hidden Markov Models: Applications in Computer Vision (Eds. H. Bunke and T. Caelli) Vol. 46: Syntactic Pattern Recognition for Seismic Oil Exploration (K. Y. Huang) Vol. 47: Hybrid Methods in Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 48: Multimodal Interface for Human-Machine Communications (Eds. P. C. Yuen, Y. Y. Tang and P. S. P. Wang) Vol. 49: Neural Networks and Systolic Array Design (Eds. D. Zhang and S. K. Pal) Vol. 50: Empirical Evaluation Methods in Computer Vision (Eds. H. I. Christensen and P. J. Phillips) Vol. 51: Automatic Diatom Identification (Eds. H. du But and M. M. Bayer) Vol. 52: Advances in Image Processing and Understanding A Festschrift for Thomas S. Huwang (Eds. A. C. Bovik, C. W. Chen and D. Goldgof) Vol. 53: Soft Computing Approach to Pattern Recognition and Image Processing (Eds. A. Ghosh and S. K. Pal) Vol. 54: Fundamentals of Robotics — Linking Perception to Action (M. Xie)

*For the complete list of titles in this series, please write to the Publisher.

Series in Machine Perception and Artificial Intelligence - Vol. 55 ■■■■■HI

Web Document Analncic HlldiyolO

Challenges and Opportunities

Editors

flpostolos flntonacopoulos University of Liverpool, UK

Jianp? Hu IBM TJ. Watson Research Center, USA

8* World Scientific New Jersey • London • Singapore • Hong Kong

Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: Suite 202, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

WEB DOCUMENT ANALYSIS Challenges and Opportunities Copyright © 2003 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-238-582-7

Printed by MultiPrint Services

PREFACE

With the ever-increasing use of the Web, a growing number of documents are published and accessed on-line. The emerging issues pose new challenges for Document Analysis. Although the development of XML and the new initiatives on the Semantic Web aim to improve the machine-readability of web documents, they are not likely to eliminate the need for content analy sis. This is particularly true for the kind of web documents created as web publications (vs. services) where visual appearance is critical. Such con tent analysis is crucial for applications such as information extraction, web mining, summarization, content re-purposing for mobile and multi-modal access and web security. The need is evident for discussions to identify the role of Document Analysis in this new technical landscape. This book is a collection of chapters including state-of-the-art reviews of challenges and opportunities as well as research papers by leading re searchers in the field. These chapters are assembled into five parts, reflect ing the diverse and interdisciplinary nature of this field. The book starts with Part I, Content Extraction and Web Mining, where four different re search groups discuss the application of graph theory, machine learning and natural language processing to the analysis, extraction and mining of web content. Part II deals with issues involved in adaptive content delivery to devices of varying screen size and access modality, particularly mobile de vices. Part III focuses on the analysis and management of one of the most common structured elements in web documents — tables. Part IV includes three chapters on issues related to images found in web documents, includ ing text extraction from web images and image search on the web. Finally, the book is concluded in Part V with discussions of new opportunities for Document Analysis in the web domain, including human interactive proofs for web security, the exploitation of web resources for document analysis experiments, the expansion of the concept of "documents" to include mul timedia documents, and areas where what has been learnt from traditional Document Image Analysis can be applied to the web domain. V

vi

Preface

It is our hope that this book will set the scene in the emerging field of Web Document Analysis and stimulate new ideas, new collaborations and new research activities in this important area. We would like to extend our gratitude to Horst Bunke who encouraged and supported us unfailingly in putting together this book. We are also grateful to Ian Seldrup of World Scientific for his helpful guidance and for looking after the final stages of the production. Last but certainly not least, we wish to express our warmest thanks to the Authors, without whose interesting work this book would not have materialised.

Apostolos Antonacopoulos and Jianying Hu

CONTENTS

Preface

v

Part I. Content Extraction and Web Mining Ch. 1.

Ch. 2.

Ch. 3.

Ch. 4.

Clustering of Web Documents Using a Graph Model A. Schenker, M. Last, H. Bunke and A. Kandel

3

Applications of Graph Probing to Web Document Analysis D. Lopresti and G. Wilfong

19

Web Structure Analysis for Information Mining V. Lakshmi, A.H. Tan and C.L. Tan

39

Natural Language Processing for Web Document Analysis M. Kunze and D. Rosner

59

Part II. Document Analysis for Adaptive Content Delivery Ch. 5. Ch. 6.

Ch. 7.

Reflowable Document Images T.M. Breuel, W.C. Janssen, K. Popat and H.S. Baird

81

Extraction and Management of Content from HTML Documents H. Alam, R. Hartono and A.F.R. Rahman

95

HTML Page Analysis Based on Visual Cues Y. Yang, Y. Chen and H.J. Zhang

vii

113

viii

Contents

Part III. Table Understanding on the Web Ch. 8. Ch. 9.

Automatic Table Detection in HTML Documents Y. Wang and J. Hu

135

A Wrapper Induction System for Complex Documents and its Application to Tabular Data on the Web W. W. Cohen, M. Hurst and L.S. Jensen

155

Ch. 10. Extracting Attributes and their Values from Web Pages M. Yoshida, K. Torisawa and J. Tsujii

179

Part IV. Web Image Analysis and Retrieval Ch. 11. A Fuzzy Approach to Text Segmentation in Web Images Based on Human Colour Perception A. Antonacopoulos and D. Karatzas

203

Ch. 12. Searching for Images on the Web Using Textual Metadata E. V. Munson and Y. Tsymbalenko

223

Ch. 13. A n Anatomy of a Large-Scale Image Search Engine W.-C. Lai, E. Y. Chang and K.-T. Cheng

235

Part V. N e w Opportunities Ch. 14. Web Security and Document Image Analysis H.S. Baird and K. Popat

257

Ch. 15. Exploiting W W W Resources in Experimental Document Analysis Research D. Lopresti

273

Ch. 16. Structured Media for Authoring Multimedia Documents T. Tran-Thuong and C. Roisin

293

Contents

ix

Ch. 17. Document Analysis Revisited for Web Documents R. Ingold and C. Vanoirbeek

315

Author Index

333

This page is intentionally left blank

Part I. Content Extraction and Web Mining

This page is intentionally left blank

CHAPTER 1 C L U S T E R I N G OF WEB D O C U M E N T S USING A GRAPH M O D E L

Adam Schenker 1 , Mark Last2, Horst Bunke 3 , and Abraham Kandel 1 'Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave. ENB 118 Tampa, FL, 33620, USA E-mail: aschenke, [email protected] 2 Department of Information Systems Engineering, Ben-Gurion University of the Negev Beer-Sheva 84105, Israel E-mail: [email protected] 3 Inst. Fur Informatik und angewandte Mathematik, Department of Computer Science, University of Bern, Neubriickstrasse 10, CH-3012 Bern, Switzerland E-mail: [email protected]

In this chapter we enhance the representation of web documents by utilizing graphs instead of vectors. In typical content-based representations of web documents based on the popular vector model, the structural (term adjacency and term location) information cannot be used for clustering. We have created a new framework for extending traditional numerical vector-based clustering algorithms to work with graphs. This approach is demonstrated by an extended version of the classical k-means clustering algorithm which uses the maximum common subgraph distance measure and the concept of median graphs in the place of the usual distance and centroid calculations, respectively. An interesting feature of our approach is that the determination of the maximum common subgraph for measuring graph similarity, which is an NP-Complete problem, becomes polynomial time with our graph representation. By applying this graph-based kmeans algorithm to the graph model we demonstrate a superior performance when clustering a collection of web documents. 1. Introduction In the field of machine learning, clustering has been a useful and active area of research for some time. In clustering, the goal is to separate a given group of data items (the data set) into groups (called clusters) such that items in the same cluster are similar to each other and dissimilar to the items in other clusters. 3

4

Schenker et al.

Unlike the supervised methods of classification, no labeled examples are provided for training. Clustering of web documents is an important problem for two major reasons. First, clustering a document collection into categories enables it to be more easily browsed and used. Automatic categorization is especially important for the World Wide Web with its huge number of dynamic (time varying) documents and diversity of topics; such features make it extremely difficult to classify pages manually as we might do with small document corpora related to a single field or topic. Second, clustering can improve the performance of search and retrieval on a document collection. Hierarchical clustering methods, for example, are used often for this purpose.1 When representing documents for clustering, a vector model is typically used.2 In this model, each possible term that can appear in a document becomes a feature dimension. The value assigned to each dimension of a document may indicate the number of times the corresponding term appears on it. This model is simple and allows the use of traditional clustering methods that deal with numerical feature vectors in a Euclidean feature space. However, it discards information such as the order in which the terms appear, where in the document the terms appear, how close the terms are to each other, and so forth. By keeping this kind of structural information we could possibly improve the performance of the clustering. The problem is that traditional clustering methods are often restricted to working on purely numeric feature vectors. This comes from the need to compute distances between data items or to calculate some representative of a cluster of items {i.e. a centroid or center of a cluster), both of which are easily accomplished in a Euclidean space. Thus either the original data needs to be converted to a vector of numeric values by discarding possibly useful structural information (which is what we are doing when using the vector model to represent documents) or we need to develop new, customized algorithms for the specific representation. We deal with this problem by introducing an extension of a classical clustering method that allows us to work with graphs as fundamental data structures instead of being limited to vectors of numeric values. Our approach has two main benefits. First, it allows us to keep the inherent structure of the original documents by modeling each document as a graph, rather than having to arrive at numeric feature vectors that contain only term frequencies. Second, we do not need to develop new clustering algorithms completely from scratch: we can apply straightforward extensions to go from classical clustering algorithms that use numerical vectors to those that deal with graphs. In this chapter we will describe a k-means clustering algorithm that utilizes graphs instead of vectors and illustrate its usefulness by applying it to the problem of clustering a

Clustering of Web Documents using a Graph Model

5

collection of web documents. We will show how web documents can be modeled as graphs and then clustered using our method. Experimental results will be given and compared with previous results reported for the same web data set based on a traditional vector representation. The chapter is organized as follows. In Sec. 2 we introduce the mathematical foundations we will use for clustering with graphs. In Sec. 3, we extend the classical k-means algorithm to use graphs instead of numerical vectors. In Sec. 4 we will describe a web page data set and its representation by the graph model. In Sec. 5 we present experimental results and a comparison with previous results from clustering the same web documents when using a vector model and classical k-means algorithms. Conclusions are given in Sec. 6. 2. Graphs: Formal Notation Graphs are a mathematical formalism for dealing with structured entities and systems. In basic terms a graph consists of vertices (or nodes), which correspond to some objects or components. Graphs also contain edges, which indicate the relationships between the vertices. The first definition we have is that of the graph itself. Each data item (document) in the data set we are clustering will be represented by such a graph: Definition 1. A graph3'* G is formally defined by a 4-tuple (quadruple): G=(V, E, a, p), where V is a set of vertices (also called nodes), EcVxV is a set of edges connecting the vertices, a:V—>Zv is a function labeling the vertices, and /1:E-^ZE is a function labeling the edges (Zv and EE being the sets of labels that can appear on the nodes and edges, respectively). The next definition we have is that of a subgraph. One graph is a subgraph of another graph if it exists as a part of the larger graph: Definition 2. AgraphGi = (VUEU au Pi) is a subgraph5 of a graph G2 = (V2,E2, ota, [h), denoted GiOG2, if VIQV2, Elp2((x.y)) V(*oOe£,. Next we have the important concept of the maximum common subgraph, or mcs for short, which is the largest subgraph a pair of graphs have in common: Definition 3. A graph G is a maximum common subgraph5 (mcs) of graphs G\ and G2, denoted mcs(G\,G2), if: (1) GcG\ (2) GcjG2 and (3) there is no other

6

Schenker et al.

subgraph G' (G'QGX, G'aG2) such that \G'\>\G\. (Here IGI is intended to convey the "size" of the graph G; usually it is taken to mean IVI. i.e. the number of vertices in the graph.) Using these definitions, a method for computing the distance between two graphs using the maximum common subgraph has been proposed:

max(|G,|,|G2|) where G\ and G2 are graphs, mcs(G\,G2) is their maximum common subgraph, max(...) is the standard numerical maximum operation, and I...1 denotes the size of the graph as we mentioned in Definition 3.6 This distance measure has four important properties.3 First, it is restricted to producing a number in the interval [0, 1]. Second, the distance is 0 only when the two graphs are identical. Third, the distance between two graphs is symmetric. Fourth, it obeys the triangle inequality, which ensures the distance measure behaves in an intuitive way. For example, if we have two dissimilar objects (i.e. there is a large distance between them) the triangle inequality implies that a third object which is similar (i.e. has a small distance) to one of those objects must be dissimilar to the other. Methods for computing the mcs are presented in the literature.7,8 In the general case the computation of mcs is NP-Complete, but as we will see later in the chapter, for our graph representation the computation of mcs is polynomial time due to the existence of unique node labels in the considered application. Other distance measures which are also based on the maximum common subgraph have been suggested. For example, Wallis et al. have introduced a different metric which is not as heavily influenced by the size of the larger graph.9 Fernandez and Valiente combine the maximum common subgraph and the minimum common supergraph in their proposed distance measure.10 However, Eq. 1 is the "classic" version and the one we will use in our implementation and experiments. As yet there are no reported findings to indicate which distance measure is most appropriate for various applications, and this is a topic we will investigate in future research. However, the distance measure of Eq. 1 has the advantage that it requires the least number of computations when compared to the other two distance measures we mentioned above. Finally we need to introduce the concept of the median of a set of graphs. We define this formally as:

Clustering of Web Documents using a Graph Model

7

Definition 4. The median of a set ofn graphs,11 S={Gi, G2,..., G„},is a graph G such that G has the smallest average distance to all elements in S: G - argmirJ —2_,dist{s,G)

(2)

Here S is the set of n graphs (and thus \S\=n) and G is the median. The median is defined to be a graph in set S. Thus the median of a set of graphs is the graph from that set which has the minimum average distance to all the other graphs in the set. The distance dist(...) is computed from Eq. 1 above. There also exist the concepts of the generalized median and weighted mean, where we don't require that G be a member of S.11'12 However, the related computational procedures are much more demanding and we do not consider them in the context of this chapter. Note that the implementation of Eq. 2 requires only O(n') graph distance computations and then finding the minimum among those distances. 3. The Extended k-Means Clustering Algorithm With our formal notation now in hand, we are ready to describe our framework for extending classical clustering methods which rely on Euclidean distance. The extension is surprisingly simple. First, any distance calculations between data items is accomplished with a graph-theoretical distance measure, such as that of Eq. 1. Second, since it is necessary to compute the distance between data items and cluster centers, it follows that the cluster centers (centroids) must also be graphs if we are to use a method such as that in Eq. 1. Therefore, we compute the representative "centroid" of a cluster as the median graph of the set of graphs in that cluster (Eq. 2). We will now show a specific example of this extension to illustrate the technique. To avoid any confusion, we should briefly emphasize here the difference between our method and the family of "traditional" graph-theoretic clustering algorithms.1'13 In the typical graph clustering case, all the data to be clustered is represented as a single graph where the vertices are the data items and the edge weights indicate the similarity between items. This graph is then partitioned to create groups of connected components (clusters). In our method, each data item to be clustered is represented by a graph. These graphs are then clustered using some clustering algorithm (in this case, k-means) utilizing the distance and median computations previously defined in lieu of the traditional Euclidean distance and centroid calculations.

8

Schenker et al. Inputs: the set of n data items and a parameter k, defining the number of clusters to create Outputs: the centroids of the clusters (represented as numerical vectors) and for each data item the cluster (an integer in [l,k]) it belongs to Step 1. Assign each data item (vector) randomly to a cluster (from 1 to k). Step 2. Using the initial assignment, determine the centroids of each cluster. Step 3. Given the new centroids, assign each data item to be in the cluster of its closest centroid. Step 4. Re-compute the centroids as in Step 2. Repeat Steps 3 and 4 until the centroids do not change.

Fig. 1. The basic k-means clustering algorithm.

The k-means clustering algorithm is a simple and straightforward method for clustering data.14 The basic algorithm is given in Fig. 1. This method is applicable to purely numerical data when using Euclidean distance and centroid calculations. The usual paradigm is to represent each data item, which consists of m numeric values, as a vector in the space 9?m. In this case the distances between two data items are computed using the Euclidean distance in m dimensions and the centroids are computed to be the mean of the data in the cluster. However, now that we have a distance measure for graphs (Eq. 1) and a method of determining a representative of a set of graphs (the median, Eq. 2) we can apply the same method to data sets whose elements are graphs rather than vectors. The k-means algorithm extended to operate on graphs is given in Fig. 2. Inputs: the set of n data items (represented by graphs) and a parameter k, defining the number of clusters to create Outputs: the centroids of the clusters (represented as graphs) and for each data item the cluster (an integer in [l,k]) it belongs to Step 1. Assign each data item (graph) randomly to a cluster (from 1 to k). Step 2. Using the initial assignment, determine the median of the set of graphs for each cluster using Eq. 2. Step 3. Given the new medians, assign each data item to be in the cluster of its closest median (as determined by distance using Eq. 1). Step 4. Re-compute the medians as in Step 2. Repeat Steps 3 and 4 until the medians do not change.

Fig 2. The k-means algorithm for using graphs.

4. Clustering of Web Documents using the Graph Model In order to demonstrate the performance and possible benefits of the graphbased approach, we have applied the extended k-means algorithm to the clustering of a collection of web documents. Some research into performing

Clustering of Web Documents using a Graph Model

9

clustering of web pages is reported in the literature.15-18 Similarity of web pages represented by graphs has been discussed in a recent work by Lopresti and Wilfong.19 Their approach differs from ours in that they extract numerical features from the graphs (such as node degree and vertex frequency) to determine page similarity rather than comparing the actual graphs; they also use a graph representation based on the syntactical structure of the HTML parse tree rather than the textual content of the pages. However, the work we are most interested in for evaluation purposes is that of Strehl et al20 In that paper, the authors compared the performance of different clustering methods on web page data sets. This paper is especially important to the current work, since it presents baseline results for a variety of standard clustering methods including the classical k-means using different similarity measures. The data set we will be using is the Yahoo "K" series,* which was one of the data sets used by Strehl et al. in their experiments.20 This data set consists of 2,340 Yahoo news pages downloaded from www.yahoo.com in their original HTML format. Each page is assigned to one of 20 categories based on its content, such as "technology", "sports" or "health". Although a pre-processed version of the data set is also available in the form of a term-document matrix and a list of stemmed words, we are using the original documents in order to capture their inherent structural information using graphs. We represent each web document as a graph using the following method: •

Each term (word) appearing in the web document, except for stop words (see below), becomes a vertex in the graph representing that document. This is accomplished by labeling each node (using the node labeling function a, see Definition 1) with the term it represents. Note that we create only a single vertex for each word even if a word appears more than once in the text. Thus each vertex in the graph represents a unique word and is labeled with a unique term not used to label any other node. If word a immediately precedes word b somewhere in a "section" s of the web document (see below), then there is a directed edge from the vertex corresponding to a to the vertex corresponding to b with an edge label s. We take into account certain punctuation (such as a period) and do not create an edge when these are present between two words.

This data set is available at: ftp://ftp.cs.umn.edu/dept/users/boley/PDDPdata

Schenker et al.

10

•

We have defined three "sections" for the web pages. First, we have the section title, which contains the text in the document's TITLE tag and any provided keywords (meta-data). Second we have the section link, which is text appearing in clickable links on the page. Third we have the section text, which comprises any of the readable text in the document (this includes link text but not title and keyword text). We perform removal of stop words, such as "the", "and", "of, etc. which are generally not useful in conveying information by removing the corresponding nodes and their incident edges. We also perform simple stemming by checking for common alternate forms of words, such as the plural form.

•

We remove the most infrequently occurring words on each page, leaving at most m nodes per graph (m being a user provided parameter). This is similar to a dimensionality reduction process for vector representations.

This form of knowledge representation is a type of semantic network, where nodes in the graph are objects and labeled edges indicate the relationships between objects.21 The conceptual graph is a type of semantic network sometimes used in information retrieval.22 With conceptual graphs, terms or phrases related to documents appear as nodes. The types of relations (edge labels) include synonym, part-whole, antonym, and so forth. Conceptual graphs are used to indicate meaning-oriented relationships between concepts, whereas our method indicates structural relationships that exist between terms in a web document. We give a simple example of our graph representation of a web document in Fig. 3. The ovals indicate nodes and their corresponding term labels. The edges are labeled according to title (TI), link (L), or text (TX). The document represented by the example has the title "YAHOO NEWS", a link whose text reads "MORE NEWS", and text containing "REUTERS NEWS SERVICE REPORTS". This novel method of document representation is somewhat similar to that of directed acyclic word graphs (or DAWGs); however, our nodes represent words rather than letters, our model allows for cycles in the graphs, and the edges are labeled.

Clustering of Web Documents using a Graph Model

11

Fig. 3. An example graph representation of a web document.

When determining the size of a graph representing a web document (Definition 3) we use the following method: Definition 5. The size of a graph G=(V, E, a, p), denoted \G\, is defined as \G\=\V\+\E\. Recall that the typical definition is simply IGI=IVI. However, for this application it is detrimental to ignore the contribution of the edges, which indicate the number of phrases identified in the text. Further, it is possible to have more than one edge between two nodes since we are labeling the edges according to the document section in which the terms are adjacent separately. Before moving on to the experiments, we mention an interesting feature this model of representing documents has on the time complexity of determining the distance between two graphs (Eq. 1). In the distance calculation we are using the maximum common subgraph; the determination of this in the general case is known to be an NP-Complete problem.24 However, our graphs for this application have the following property: Vxje V, a(x)=a(y) if and only if x=y

(3)

In other words, each node in a graph has a unique label assigned to it, namely the term it represents. No two nodes in a graph will have the same label. Thus the maximum common subgraph Gm=(Vrm^m,o^,j8m) of a pair of graphs Gx and Gi can be created using the following method: Step 1. Create the set of vertices by Vm={x\xs V{ and xe V2 and ai(x)=az(x)} Step 2. Create the set of edges by Em={ (x,y)\x,ye Vm and fii{{x,y))=^2((x,y))}

12

Schenker et al.

The first step states the set of vertices in the maximum common subgraph is just the intersection of the set of terms of both graphs. Each term in the intersection becomes a node in the maximum common subgraph. The second step creates the edges by examining the set of nodes created from the previous step. We examine all pairs of nodes in the set; if both nodes contain an edge between them in both original graphs and share a common label, then we add the edge to the maximum common subgraph. Note that this is different from the concept of induced maximum common subgraph, where nodes are added only if they are connected by an edge in both original graphs. If there is a common subset of nodes but different edge configurations in the original graphs, we still add the nodes using our method. We also note that in document clustering, the nodes, which represent terms, are much more important than the edges, which only indicate the relationships between the terms {i.e., followed by). We see that the complexity of this method is OflVillV^U for the first step and 0(\Vmcs\2) for the second step. Thus it is 0(\Vl\W2\+\VmJ2) < 0(\V\2+\VmJ2) = 0(\V\2) overall if we substitute V = ma$JVi\,\V2$. 5. Experimental Results In order to compare clustering methods with differing distance measures, Strehl et al. proposed the use of an information-theoretic measure of clustering performance.20 This measurement is given as:

A"=-ifn/

(it)

\og(k-g)

(4)

where n is the number of data items, k is the desired number of clusters, g is the actual number of categories, and «<■" is the number of items in clusteri classified to be category j . The above measure is, in fact, mutual information25 normalized by the sum of its maximum values (log k and log g) and it represents the overall degree of agreement between the clustering and the categorization. In an attempt to adhere to the methodology of the original experiments, which used the vector model approach, we have selected a sample of 800 documents from the total collection of 2,340 and have fixed the desired number of clusters to be &=40 (two times the number of categories), which is the same number of clusters used in the original experiment. Strehl et al. used this number of clusters "since this seemed to be the more natural number of clusters as indicated by preliminary

Clustering of Web Documents using a Graph Model

13

runs and visualisation." The results for our method using different numbers of maximum nodes per graph and the original results from Strehl et al. for vectorbased k-means and a random baseline assignment are given in Table 1 (higher mutual information is better); results from our method are shown in bold. Each row gives the average of 10 experiments using the same 800 item data sample. The variation in results between runs comes from the random initialization in the first step of the k-means algorithm. We used t-tests to evaluate the statistical significance of our results as compared with the best reported vector-based kmeans method (Extended Jaccard Similarity). Confidences less than 0.950 are marked with a "-". The same performance data is plotted graphically in Fig. 4. In Fig. 5 we show the execution times for performing a single clustering of the document collection when using 5, 50, 100, and 150 nodes per graph. These results were obtained on a 733 MHz single processor Power Macintosh G4 with 384 megabytes of physical memory running Mac OS X. The clustering took 7.13 minutes at 5 nodes per graph and 288.18 minutes for 150 nodes per graph. Unfortunately, no execution time data is available for comparison from the original experiments in Strehl et al. Table 1. Results of our experiments compared'with results from Strehl et al: Method Graphs Graphs Graphs Graphs Graphs Extended Jaccard Similarity Pearson Correlation Cosine Measure Graphs Graphs Graphs Graphs Random (baseline) Euclidean

Max. Nodes/Graph ISO 120 90 75 60

45 30 15 5

-

AM (average) 0.2218 0.2142 0.2074 0.2045 0.1865 0.184 0.178 0.178 0.1758 0.1617 0.1540 0.1326 0.066 0.046

t-test 1.000 1.000 0.999 0.999

-

From Fig. 4 we see that the mutual information generally tends to increase as we allow larger and larger graphs. This makes sense since the larger graphs incorporate more information. On the figure we indicated the values of mutual information from the original experiments for three out of the five methods from Table 1. Euclidean is the classical k-means with a Euclidean distance measure. Random baseline is simply a random assignment of data to clusters; it is used to

14

Schenker et al.

provide a baseline for comparison. We would expect any algorithm to perform better than Random, but we see the Euclidean k-means did not. Finally, Jaccard is k-means using the Extended Jaccard Similarity.2'20 It was the best performing of all the k-means methods reported in the original experiment so we have omitted cosine similarity and Pearson correlation on the chart for clarity. It is not a surprising result to see Euclidean distance perform poorly when using the vector model for representing documents, as it does not have the property of vector length invariance. Because of this, documents with similar term frequency proportions but differences in overall total frequency have large distances between them even though they are supposed to be considered similar. For example, if we were interested in the topic "data mining", a document where the terms "data" and "mining" each appeared 10 times and a document where both terms each appeared 1,000 times are considered to be identical when we have the length invariance property {i.e. their distance is 0). It is only the relative proportion between the terms that is of interest when determining the document's content, since there are often large variances in total term frequency even for documents related to the same topic. Here both documents contain an equal proportion of the terms "data" and "mining". If the term "mining" occurred much more frequently than "data", we would expect the document to be related to a different topic {e.g., "gold mining"). Under Euclidean distance these two documents would have a large distance {i.e. be considered dissimilar) due to the fact that the difference in total frequency (10 vs. 1,000) is large. This is why distance measures with the length invariance property (such as the cosine measure, which measures the cosine of the angle between two feature vectors) are often used in these types of applications in lieu of standard Euclidean distance. We see that even with only 5 nodes per graph our method outperforms both Euclidean k-means and the random baseline; as we increased the number of nodes per graph the performance approaches that of the other k-means methods until it exceeded even the best k-means method reported at 75 nodes per graph or more. For comparison, the original experiment used a term-document matrix where each vector had 2,903 dimensions. We note a general increasing trend in performance as we allow for larger graphs, which would be consistent with the increase in information that occurs as we introduce new terms (nodes) and phrases (edges) in the graphs. However, the performance improvement is not always strictly proportional with the increase in graph size. For example, the improvement from 60 to 75 is greater than the improvement from 75 to 90 even though we are adding 15 new nodes in each case. This may be due to the fact that the extra nodes added when we increase the graph size, while they are

Clustering of Web Documents using a Graph Model

15

frequently occurring terms, may not always provide information that is useful for discriminating between the documents and in actuality may hinder performance by introducing extraneous data. A future improvement may be to find better methods of selecting the nodes to be used in each graph rather than relying strictly on term frequency.

Fig. 4. Mutual Information as a function of the maximum number of vertices per graph.

Fig. 5. Clustering time as a function of the maximum number of vertices per graph.

16

Schenker et al.

6. Conclusions In this chapter we showed how it is possible to cluster web documents using a graph representation rather than a vector representation. A graph representation allows us to retain structural information such as where terms are located in a document and the order in which terms appear — information which is mostly discarded when using the typical vector model approach. Given a graph model of web documents, we can apply traditional clustering techniques such as kmeans by performing an extension from Euclidean distance and centroid calculations to graph distance and median graphs, respectively. To demonstrate the performance of the extended k-means method with our graph representation of web documents, we performed experiments on a web document collection and compared with previous results of clustering using k-means when utilizing a vector model for the same documents. We have discovered the following from our experiments: •

Our method outperformed the baseline random assignment method and the vector-based k-means method using Euclidean distance, even in the case of maximum dimensionality reduction using 5 nodes per graph. As the maximum number of nodes allowed per graph became larger, the performance of our method generally increased. This reflects an increase in the amount of information in the graphs as we add nodes and edges. Our method outperformed all the k-means clustering methods (Euclidean distance, cosine measure, Pearson correlation, and Jaccard similarity) described in Strehl et al.20 when we allowed 75 nodes per graph or more. We believe this reflects the information retained by the graph representation which is not present when using the vector model approach.

We have many avenues to explore for future work. We have shown one graph distance measure here, but others have been proposed. We will perform experiments with other graph distance measures and compare clustering performance. We can also attempt to create a more elaborate graph representation for web documents. For example, we can recognize more document sections, connect words that appear in the same sentence or paragraph, and so on. Such representations could capture even more information, possibly leading to better performance. It is also possible to apply our technique to structured text, such as XML documents and software

Clustering of Web Documents using a Graph Model

17

programs, and we intend to investigate clustering collections of source code using our method. We also wish to extend other clustering algorithms to work with graphs, such as hierarchical agglomerative clustering and fuzzy c-means. Acknowledgments This work was supported in part by the National Institute for Systems Test and Productivity at the University of South Florida under U.S. Space and Naval Warfare Systems Command grant number N00039-01-1-2248. References 1.

A. K. Jain, M. N. Murty and P. J. Flynn, "Data clustering: a review", ACM Computing Surveys Vol. 31, No. 3, 1999, pp. 264-323. 2. G. Salton, Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, 1989. 3. H. Bunke, X. Jiang, and A. Kandel, "On the minimum common supergraph of two graphs", Computing, Vol. 65, 2000, pp. 13-25. 4. J. T. L. Wang, K. Zhang, and G.-W. Chirn, "Algorithms for approximate graph matching", Information Sciences, Vol. 82, 1995, pp. 45-74. 5. H. Bunke, "On a relation between graph edit distance and maximum common subgraph", Pattern Recognition Letters, Vol. 18, 1997, pp. 689-694. 6. H. Bunke and K. Shearer, "A graph distance metric based on the maximal common subgraph", Pattern Recognition Letters, Vol. 19, 1998, pp. 255-259. 7. G. Levi, "A note on the derivation of maximal common subgraphs of two directed or undirected graphs", Calcolo, Vol. 9, 1972, pp. 341-354. 8. J. J. McGregor, "Backtrack search algorithms and the maximal common subgraph problem", Software Practice and Experience, Vol. 12, 1982, pp. 23-34. 9. W. D. Wallis, P. Shoubridge, M. Kraetz, and D. Ray, "Graph distances using graph union", Pattern Recognition Letters, Vol. 22, 2001, pp. 701-704. 10. M.-L. Fernandez and G. Valiente, "A graph distance metric combining maximum common subgraph and minimum common supergraph", Pattern Recognition Letters, Vol. 22, 2001, pp. 753-758. 11. H. Bunke, S. Giinter, and X. Jiang, "Towards bridging the gap between statistical and structural pattern recognition: two new concepts in graph matching" in Advances in Pattern Recognition — ICAPR 2001, eds. S. Singh, N. Murshed, and W. Kropatsch, Springer-Verlag, 2001. 12. H. Bunke and A. Kandel, "Mean and maximum common subgraph of two graphs", Pattern Recognition Letters, Vol. 21, 2000, pp. 163-168. 13. J. G. Augustson and J. Minker, "An analysis of some graph theoretical cluster techniques", Journal of the Association of Computing Machinery, Vol. 17, No. 4, 1970, pp. 571-588. 14. T. M. Mitchell, Machine Learning, McGraw-Hill, Boston, 1997. 15. D. Boley, M. Gini, R. Gross, E. H. Han, K. Hastings, G. Karypis, B. Mobasher, and J. Moore, "Partitioning-based clustering for web document categorization", Decision Support Systems, Vol. 27, 1999, pp. 329-341.

18 16.

17.

18.

19.

20.

21. 22. 23. 24.

25.

Schenker et al. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, "Extracting large-scale knowledge bases from the web", Proceedings of the 25th International Conference on Very Large Databases, 1999, pp. 639-649. S. A. Macskassy, A. Banerjee, B. D. Davison, and H. Hirsh, "Human performance on clustering web pages: a preliminary study", Proceedings of The 4'h International Conference on Knowledge Discovery and Data Mining, 1998, pp. 264-268. O. Zamir and O. Etzioni, "Web document clustering: a feasibility demonstration", Proceedings of the 21s' Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 46-54. D. Lopresti and G. Wilfong, "Applications of graph probing to web document analysis", Proceedings of the Is' International Workshop on Web Document Analysis (WDA'2001), Seattle, USA, September 2001 (ISBN: 0-9541148-0-9) and also at h t t p : //www. c s c . l i v . a c . u k / ~ w d a 2 001, pp. 51-54. A. Strehl, J. Ghosh, and R. Mooney, "Impact of similarity measures on web-page clustering", AAAI-2000: Workshop of Artificial Intelligence for Web Search, 2000, pp. 58-64. S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, PrenticeHall, Upper Saddle River, 1995. X. Lu, "Document retrieval: a structural approach", Information Processing and Management, Vol. 26, No. 2,1990, pp. 209-218. M. Crochemore and R. Verin, "Direct construction of compact directed acyclic word graphs" in CPM97, eds. A. Apostolico and J. Hein, Springer-Verlag, 1997. B. T. Messmer and H. Bunke, "A new algorithm for error-tolerant subgraph isomorphism detection", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 5, 1998, pp. 493-504. T. M. Cover and J.A. Thomas, Elements of Information Theory, Wiley, 1991.

CHAPTER 2 APPLICATIONS OF G R A P H P R O B I N G TO W E B D O C U M E N T ANALYSIS

Daniel Lopresti and Gordon Wilfong Bell Labs, Lucent Technologies Inc. 600 Mountain Avenue Murray Hill, NJ 07974 USA E-mail: {dpl,gtw}@research.bell-labs.com In this chapter, we describe our steps towards adapting a new approach for graph comparison known as graph probing to collections of semistructured documents (e.g., Web pages coded in HTML). We consider both the comparison of two graphs in their entirety, as well as deter mining whether one graph contains a subgraph that closely matches the other. A formalism is presented that allows us to prove graph probing yields a lower bound on the true edit distance between graphs. Results from several experimental studies demonstrate the applicability of the approach, showing that graph probing can distinguish the kinds of sim ilarity of interest and that it can be computed efficiently. 1. I n t r o d u c t i o n Graphs are a fundamental representation in much of computer science, in cluding the analysis of both traditional and Web documents. Algorithms for higher-level document understanding tasks often use graphs to encode log ical structure. HTML pages are usually regarded as tree-structured, while the WWW itself is an enormous, dynamic multigraph. Much work on at tempting to extract information from Web pages makes explicit or implicit use of graph representations. 1-8 It follows, then, that possessing the ability to compare two graphs can be essential, as demonstrated in such applications as query-by-structure, information extraction, performance evaluation etc. Because most problems relating to graph comparison have no known efficient, guaranteed-optimal solution, researchers have developed a wide range of heuristics. Recently, we have begun to explore an intuitive, easy-to-implement 19

20

D. Lopresti and G. Wilfong

scheme for comparing graphs that we call graph probing. As shown in Fig. 1, each of the two graphs under study is placed inside a "black box" capable of evaluating a set of graph-oriented operations (e.g., counting the number of vertices labeled in a certain way) and then subjected to a series of simple queries. A measure of the graphs' dissimilarity is the degree to which their responses to the probes disagree.

Fig. 1.

Overview of graph probing.

In this chapter, we examine in detail the graph probing paradigm we first put forth in the context of our work on table understanding, 9-11 where it played an important role in evaluating the performance of the recogni tion techniques under development. We later extended the approach to the analysis of HTML-coded Web pages from the perspective of information retrieval. 5 ' 6 The preliminary experimental results reported in those earlier papers are substantially augmented herein and accompanied by more ex tensive explanations and new analyses. We begin with a discussion of past work relating to the problem of graph comparison, both in the context of Web documents and more generally. In Sec. 3, we present a formalism showing that graph probing provides a lower bound on the true edit distance between two graphs, and that with a minor change we can derive a similar bound for subgraph matching. To examine how well the approach might work in practice, we provide results from several experimental studies in Sec. 4. Finally, we offer our conclusions and topics for future research in Sec. 5.

Applications

of Graph Probing to Web Document

Analysis

21

2. Related Work Graph comparison is a widespread yet challenging problem, so it should come as no surprise that many researchers have proposed heuristics and/or solutions designed for special cases. It is not our intent to survey the field exhaustively, but rather to identify certain representative papers, especially those most closely related to the approach we are about to describe. A more comprehensive overview can be found in a recent paper by Bunke. 12 Jolion offers opinions on research trends in graph matching. 13 Before beginning, it is important to distinguish between the exact and approximate matching problems. The former is typically called graph iso morphism, while the latter is often phrased in terms of a minimum-cost sequence of basic editing operations (e.g., the insertion and deletion of ver tices and edges) that accounts for the observed differences between the two graphs and which defines the notion of edit distance. These viewpoints are in fact complementary; it should be clear how a solution to the ap proximate matching problem could be helpful in solving the isomorphism problem. Moreover, since graphs that are sufficiently similar most likely contain subgraphs that are identical, subgraph isomorphism can be seen as facilitating approximate matching. A formal connection between these two concepts was established by Bunke. 14 In the context of graph editing, another vital distinction arises with respect to two particular quantities that may be of interest: the actual sequence of operations needed to edit one graph into the other, and the cost of such a sequence. The former is useful in attempting to understand the differences between the graphs and why they may have arisen, while the latter provides a concrete measure of similarity. Given a minimum-cost sequence of editing operations, calculating the corresponding edit distance is straightforward. While the converse is not true, the edit distance by itself is still extremely valuable, especially if it can be computed much more rapidly or for larger graphs than would otherwise be possible using procedures that return the operations. Much prior work has focused on the graph isomorphism problem (i.e., finding an exact correspondence between two graphs) and its variants. The complexity of graph isomorphism remains open and, unfortunately, all known algorithms for its solution have worst-case exponential running times. 15 Heuristics for determining isomorphism often rely on the concept of a vertex invariant, that is, a value f(v) assigned to each vertex v such that under any isomorphism / , if I(v) = v' then f(v) = f(v'). One such

22

D. Lopresti and G. Wilfong

invariant is the degree of a vertex (or the in- and out-degrees, if the graph is directed). Indeed, Nauty, an effective software package for computing graph isomorphism, 16,17 relies on vertex invariants. In general, such heuristics can fail in a catastrophic manner. 18 On the other hand, it has been shown that for random graphs, there is a simple linear time test for checking if two graphs are isomorphic based on the degrees of the vertices, and this test succeeds with high probability. 19 Other research aims at speeding up the computation for database searches. Lazarescu et al. propose a machine learning approach to building decision trees for eliminating from further consideration graphs that can not possibly be isomorphic to a given query graph. 20 While they employ a similar set of features to the ones we use, they do not consider the approxi mate matching problem or subgraph problems. Bunke and Messmer present a decision-tree-based precomputation scheme for solving the subgraph iso morphism problem, although their data structure can be exponential in the size of the input graphs in the worst case. 21 ' 22 Valiente and Martinez describe an approach for subgraph patternmatching based on finding homomorphic images of every connected compo nent in the query. 23 Again, the worst-case time complexity is exponential, but such features could also perhaps be incorporated in the measures we are about to present. Turning to graph edit distance, we note there have been a large num ber of papers written on the subject and its many applications. These can be divided into two categories. In the first we have procedures that are guaranteed to find an optimal solution but may, in the worst case, require exponential time, such as the early work by Pu and colleagues.24'25 The second category includes heuristics that may not necessarily return an op timal match, but that have polynomial running times as in, for example, the Bayesian framework posed by Myers et al.26 Frequently these papers focus on search strategies intended to speed up the computation when certain conditions are satisfied, making the understanding and implementation of the algorithms more difficult. Papadopoulos and Manolopoulos discuss an idea that is philosophically quite similar to ours. 27 However, they focus on a single invariant: vertex degree. It is clear this is not sufficient for catching all of the interesting differences that can arise between HTML documents. Moreover, their his togram technique is applied only to the problem of comparing complete graphs, whereas we wish to examine the subgraph matching problem as well.

Applications

of Graph Probing to Web Document

Analysis

23

Instead of trying to solve the problem for graphs in general, some leeway can be had by limiting the discussion to trees, for which efficient comparison algorithms are known. Schlieder and Naumann consider a problem closely related to ours: error-tolerant embedding of trees to judge the similarity of XML documents. 8 Likewise, Dubois et al. write about tree embedding for searching databases of semi-structured multimedia documents. 3 The WebMARS system, as presented by Ortega-Binderberger et al.,7 models Web documents via their parse trees. Queries are likewise treated as trees, although they encode hierarchy for individual object types (e.g., text, images) and do not represent the same sorts of inter-object relation ships that the mark-up in Web documents encodes. Matching only takes place between the leaves of the query tree and all possible "chains" in the document tree (i.e., paths leading in the direction from the root to a leaf). The match values are then propagated upwards towards the root of the query tree over edges that can be weighted to reflect the importance of that particular component. Hence, there is an asymmetry between queries and documents. In any case, the graph model we would like to support is more general than simple trees, allowing both cross-and back-edges. Finally, the flat-file comparison of Web pages, effectively ignoring their hierarchical structure, has been handled by Doughs et al. through the ap plication of string (as opposed to graph) matching techniques. 28,29

3. A Formalism for Graph Probing In this section we formalize the concept of graph probing as a way of quan tifying graph similarity. Our goal is to relate probing to a more rigorous but harder-to-compute graph edit distance model. Let Gi = (Vi,Ei) and G2 = ( V ^ , ^ ) be two directed attribute graphs, that is, directed graphs where the vertices and edges are potentially la beled by a type. For example, vertices might be labeled as corresponding to HTML structure tags (e.g., section heading, paragraph, table) as well as their associated content, while the labels for edges represent relationships between structures (e.g., contains, next, hypertext reference). Now introduce a graph editing model that allows the following basic op erations: (1) delete an edge, (2) insert an edge, (3) delete an isolated vertex, (4) insert an isolated vertex, (5) change the type of an edge, (6) change the type of a vertex. The edges and vertices created through insertions can be assigned any type initially. It should be clear that such operations can be used to edit any graph into any other graph. The minimum number of op-

24

D. Lopresti and G. Wilfong

erations needed to edit G\ into G2 is the graph edit distance, dist(Gi, G2). There is no known algorithm for efficiently computing this distance in gen eral. Consider a probing procedure that asks the following kinds of questions: "How many vertices with a specific combination of incoming and outgoing edges are present in graph G?" Suppose there are a different edge labels h, ■ ■ ■ ,la- The edge structure of a given vertex can then be represented as a 2a-tuple of non-negative integers, {x\,... , xa, y\,... , ya), if the vertex has exactly Xi incoming edges labeled U and exactly yj outgoing edges labeled lj for 1 < i,j < a. Then a typical probe will have the form: "How many vertices with edge structure (x\,... ,xa,yi,... tya) are present in graph G?" We call these Class 1c ■probes3'. This is sufficient for detecting many kinds of changes to an HTML doc ument, but note that the content can be altered without affecting the graph structure of a Web page. Hence, we also need a class of probes focusing on just the vertices and their types: "How many vertices labeled in a specific way are present in graph G?" These are known as Class 2 probes. Let PR\c collect the responses for vertex in- and out-degrees and their respective edge types, and let PR2 collect the responses for vertex types. Define probe(G1,G2)

= (Pi?i c (Gi) - PRic{G2))

+ {PlhiGi)

- PR2(G2)).

(1)

That is, probe is the magnitude of the difference between the two sets of probing results (the L\ norm, or the sum of the absolute values of the differences between the entries in the two vectors). We then have: Theorem 1: Under the directed attribute graph model and its associated edit model, probe is a lower bound, within a factor of 1/4, on the true edit distance between any two graphs. That is, 1/4 • probe(Gi,G2)

<

dist(G1,G2).

The proof of this result follows from a simple case analysis. The operations that cause the largest possible disparity between edit distance and the graph probing measure are the deletion or insertion of an edge. Any one such edit may cause as many as, but no more than, four of the edge structure probes to differ by one. a

Class l a and Class l b probes correspond to two graph models we will not be considering here: undirected and unlabeled directed graphs, respectively.

Applications

of Graph Probing to Web Document

Analysis

25

An example is illustrated in Fig. 2, where the comparison of the two probe vectors PR\C and PR2 yields a value of four as opposed to the true edit distance which is three (corresponding to the operations listed in the figure).

Fig. 2.

Probing example for the directed attribute graph model.

The precomputation needed for each graph is as follows. Computing the edge structures of all the vertices takes total time 0 ( | j E | + a | V | ) . These |V| 2a-tuples can then be lexicographically sorted in 0(a(S+ \V\)) time, where 6 is the maximum number of edges incident on any vertex. Then a simple pass through the sorted list allows us to compute the number of vertices in each of the (non-empty) classes in additional time G ( a | y | ) . Thus the total precomputation time is 0(a(S + \V\) + \E\). We remark that a and 6 are likely to be small constants, in which case the time complexity becomes OQV\ + \E\). The quantity probe{G\,G2) defined by Eq. 1 is referred to as the graph probing distance between G\ and G2. It provides an approximation to the true edit distance between two graphs, which would be too expensive to compute in the most general case. Another useful observation is that the graph probing measure satisfies the triangle inequality. In other words, probe(Gi,G2) > probe(Gi,Gz) + probe{Gz,G2) for all graphs Gi, G2, and G3. Graph probing distance is also non-negative and symmetric (i.e., pro&e(Gj., G2) = probe(G2, Gi)), thus satisfying three of the four conditions of a metric space. The remaining condition, separation, is violated, however, as probe(G\,G2) = 0 does not

26

D. Lopresti and G. Wilfong

imply that G\ and G2 are identical (isomorphic). Hence, graph probing distance forms what is commonly referred to as a "pseudo-metric" space. When comparing two graphs in their entirety, it suffices to correlate their responses to the probes and measure the disagreement. For the prob lem of subgraph matching, however, we cannot expect to be able to compare directly the outputs for the larger graph to those for the smaller. For exam ple, consider a query graph consisting of a single table that corresponds to a table in some database document, but where that document also contains dozens of other tables. There should be no penalty for having to account for the unrelated tables if our goal is to quantify the similarity between the query and some subgraph of the target; the fact that there is a good match for the one table is enough. To assure that the Class 2 probes are invariant across subgraph isomor phism, the manner in which the results are compared must account for this. Clearly, if the query graph contains a certain number of vertices labeled in a given way, the target graph may possibly contain an isomorphic subgraph so long as it contains at least that many vertices labeled in the same way. If the target graph has fewer such vertices, we know there cannot be a subgraph isomorphism, although there may still be a good approximate matching. Similarly, the way in which the Class lc probes are correlated needs to be modified as well, since the vertices present in the query graph may have fewer incoming and outgoing edges than their corresponding ver tices in an isomorphic subgraph of a larger graph. Hence, when computing the difference between the two sets of probing results under Eq. 1, instead of the standard per-element difference \nij —ri2j\, we use max(Q,riij —ri2,j), where riij is the value of the j t h feature for graph Gi, i £ {1,2}, and G\ is the query graph being matched to some subgraph of G2Through a straightforward extension to the previous discussion, we can define a subgraph edit distance, subdist, and a subgraph probing distance, subprobe. As in the case of regular graph matching, we can show a lower bound result: Theorem 2: Under the directed attribute graph model and its associated edit model, subprobe is a lower bound, within a factor of 1/2, on the true subgraph edit distance between any two graphs. That is, 1/2 • subprobe{G1,G2)

<

subdist{Gi,G2).

The bound in this case is stronger (a factor of 1/2 versus a factor of 1/4) because the worst-case edit can affect the results of at most two probes.

Applications

of Graph Probing to Web Document

Analysis

27

4. Experimental Evaluation This chapter describes research that is still very much in progress. As we noted, we have previously implemented a version of graph probing to help in the evaluation of a table understanding system. 9_11 By comparing the output of a recognition algorithm to the ground-truth created by a human expert, both of which can be represented as attribute graphs, it becomes possible to quantify the performance of the algorithm. However, the utility of graph probing in that specialized domain is not an assurance that it is an appropriate paradigm for searching databases in a retrieval application, es pecially since the classes of probes we have at our disposal are also different in this case. In the remainder of this section, we present the results of several em pirical studies demonstrating how graph probing might be applied in the analysis of semi-structured documents (i.e., Web pages coded in HTML). We begin by describing the graph model we use. For our test collections, we assembled a random assortment of Web pages with the assistance of a commercial search engine using a procedure to be described shortly. The first test examined the ability of the two probe classes to detect changes as a specific commercial Web page evolved naturally over several days. The second studied the subgraph matching problem by searching for a Web page that had been edited by deleting a significant portion of its content. 4.1. Graph

Model

The attribute graph model we employ for HTML documents includes the standard tree-structured hierarchy generated when parsing the tags (the "contains"/"contained-by" relationship). Each tag, or pair of matching start/end tags, corresponds to a vertex in the graph, with any associated content or metadata encoded as attributes. For example, the HTML frag ment: Web Document Analysis would yield one vertex of type font with metadata face="Arial" and content Web Document Analysis. Since the content within a vertex can be arbitrarily long, we hash this data to 32-bit integers to facilitate efficient comparison. Beyond the parse tree, we also make use of the order in which con tent and the various substructures are encountered (in many cases this corresponds to the natural reading order for the material in question). We

28

D. Lopresti and G. Wilfong

represent this via "next" / "previous" cross-edges that connect vertices at a given level in the hierarchy, rather than assuming an implicit fixed ordering on the children of a vertex as some other researchers have done. Lastly, we record hyperlinks as either back-edges (in the case of targets on the same page) or a distinguished vertex type (in the case of external references). No provision is currently made for incoming links from outside documents. The inline content at each vertex is hashed to a 32-bit integer to facilitate com parison and also stored in full as a separate file. Our parse graph generator is implemented in Tcl/Tk 3 0 and is capable of recovering from the kinds of simple errors that often arise in real-world HTML {e.g., missing end tags). Recall from the previous section that we have defined two probe classes for these kinds of graphs: Class l c These probes examine the vertex and edge structure of the graph by counting in- and out-degrees, tabulating different types of in coming and outgoing edges separately. Class 2 These probes count the occurrences of a given type of vertex in the graph. 4.2. Generating

"Random"

Collections

of Web

Pages

To conduct a convincing evaluation of the graph probing paradigm, we need access to a relatively large assortment of real Web pages that have been authored by many different individuals using a variety of editing tools, along with the ability to assemble collections "on-the-fly," either in an unbiased manner or perhaps satisfying certain criteria. Our approach to this problem is to resort to the Web itself by making use of a commercial search engine, querying it for randomly-chosen search terms and then selecting from the results it returns the random "hits" we shall use as our test documents. The Google search engine 31 claims to index over 2 billion Web pages as of the time of this writing. We query Google by picking a random word from the Unix "spell" dictionary, which contains 24,259 words including a number of proper names. Based on this search, we choose one of the pages of hits returned by Google, and from that page we select a pointer to a specific HTML document which we then fetch and subject to further analysis. Our implementation of the Web interface is programmed in Tcl/Tk using the Spynergy Toolkit. 32 It takes a total of three HTTP "round-trips" to get the data we require: (1) First, issue a search request using a randomly-chosen keyword and re-

Applications

of Graph Probing to Web Document Analysis

29

trieve the first page of results, which includes a summary of the total number of hits. (2) The hits are grouped by Google 10-per-page. Identify one of these pages of hits at random and retrieve it. (3) Within this page, determine one of the 10 URL's at random and retrieve the page it refers to. We have developed a set of simple "wrappers" to extract the necessary information from the HTML code that is returned in the steps above. For the experiments that follow, we collected 1,000 random Web pages in this fashion. On the occasions when an HTTP fetch timed-out (after 30 seconds for the initial connection, and 5 seconds for each subsequent buffer), the search was attempted again using a different term. We also re-ran searches that produced no matches. Since an HTTP request may return an arbitrary document (not necessarily HTML) of any length, or no document at all (in the case of a stale link), we enforced minimum and maximum page sizes and also checked for common patterns that indicate an error has occurred (e.g., "404: page not found"). It is also important to observe that even though Google purports there are millions of hits for some search terms, apparently it will only ever return at most the first 1,000 hits. Hence, the universe we are selecting from is much smaller than the entire Web, although with an upper bound of around 24 million pages (the size of the dictionary times the maximum number of hits Google will return for each search term), it seems sufficient for our purposes.

4.3. Experiment

#1: Full Graph

Matching

For the first experiment, we used as our query the May 31, 2001 homepage from The New York Times (http://www.nytimes.com/). This is a relatively complex page, its parse tree containing 1,088 vertices. The results for using graph probing to compare this page to 1,000 random Web pages are shown in Fig. 3. The May 31 page is, of course a perfect match for itself, but also a very good match for the June 4, 5, and 6 homepages as well (the other examples from The New York Times "seeded" in the collection). All of the random pages returned much larger distances from the query. In the figure, the chart shows the breakdown between Class lc probes (dark gray) and Class 2 probes (light gray). The former can be equated with the overall structure of the page, while the latter are more closely related to content. Note that the contributions of the two probe components are for the most part evenly balanced for random pages, but that nearly all of the differences

D. Lopresti and G. Wilfong

30

between the various versions of the The New York Times pages are due to Class 2 probes. A number of other regularly-updated Web pages we have examined exhibit this same behavior; there are no major structural changes from day-to-day, although the content is constantly varying.

Fig. 3.

Graph probing results for Experiment 1 (full graph comparison measure).

Such a result is only of interest, of course, if graph probing distance can be computed efficiently. Fortunately, this appears feasible as illustrated in Fig. 4. Here we show two sets of datapoints as functions of the number of vertices in the target graph. The upper set (light gray squares) displays the total time it takes to parse the HTML document into an intermediate rep resentation, generate the probing results, and compare them to the query. These times range from a fraction of a second to at most 15 seconds for graphs of up to several thousand vertices'3. As indicated earlier, our HTML parser is coded in an interpreted (as opposed to compiled) language. This offers a degree of flexibility, but does exact a performance penalty. Since our focus is on the probing process, we coded those routines in C and measured the run times separately; these are plotted in the lower set of datapoints in the figure (dark gray diamonds). Once parsed, Web pages can be compared using graph probing in about 1/100 of a second, on average. b

AU timing results reported in this chapter were recorded on an IBM ThinkPad T23

Applications

Fig. 4.

of Graph Probing to Web Document

Analysis

31

Time to compare Web pages versus number of vertices in parse graph.

In Fig. 5. we examine the number of probes as a function of the size of the graph. As the breakdown shows, the growth in the total number of probes (light gray squares) is dominated by the Class 2 probes. The Class lc probes (dark gray diamonds) grow rapidly at first, but then start to flatten out dramatically at between 30 and 60 probes. Whether the number of such probes is upper-bounded by a constant as the graphs get increasingly larger is a subject for future research. A more detailed analysis of the statistics for the experiment appears in Tables 1 and 2. In Table 1, we present data on the test collection. Here we can see, for example, that the minimum graph probing distance between one of the random Web pages and the query was 1,302, while the average was 1,933. The majority of the time was spent fetching the data over the Internet: graph probing itself is relatively fast. Statistics for the query and related pages are given in Table 2, where the processing times quoted are averaged over 10 iterations to smooth out any transient anomalies in system performance. It is indeed true that the page for June 4 required significantly less time to parse than the others, even though it is about the same size.

(1 Ghz Pentium III. 256 Mbyte RAM) running the Linux operating system.

D. Lopresti and G. Wilfong

32

Fig. 5.

Number of probes generated versus number of vertices in parse graph.

As there are several possible explanations, this would be an interesting question to explore. T h e a s y m m e t r y between t h e two probe classes for t h e graphs in Table 1 versus t h e graphs in Table 2 is also striking.

Table 1.

Statistics for the test collection in Experiment 1.

Attribute Google Hits HTML Size (bytes) Parse Graph Vertices Class lc Probes Class lc Distance Class 2 Probes Class 2 Distance Overall Probes Overall Distance Fetch Time (sees) Parse Time (sees) Probe Time (sees) Compare Time (sees) Overall Time (sees)

Min 52 10,003 17 9 397 11 775 21 1,302 0.57 0.189 0.00206 0.00267 1.03

Max 486,000,000 99,721 2,306 60 2,390 1,021 2,976 1,047 5,288 36.62 14.836 0.04146 0.25352 37.45

Ave 1.027,982 24.962 401 30 852 112 1,080 143 1,933 2.96 1.243 0.00434 0.00756 4.22

Applications Table 2.

of Graph Probing to Web Document

Analysis

33

Statistics for the query and related documents in Experiment 1.

Attribute HTML Size (bytes) Parse Graph Vertices Class lc Probes Class l c Distance Class 2 Probes Class 2 Distance Overall Probes Overall Distance Parse Time (sees) Probe Time (sees) Compare Time (sees) Overall Time (sees)

NY Times May 31, 2001 50,239 1,088 48 0 234 0 282 0 4.195 0.00776 0.00525 4.21

4.4. Experiment

#2: Subgraph

NY Times June 4, 2001 50,311 1,099 48 21 240 205 288 226 3.446 0.00789 0.00384 3.46

NY Times June 5, 2001 49,625 1,087 48 23 235 203 283 226 4.175 0.00772 0.00381 4.19

NY Times June 6, 2001 49,413 1,081 49 27 235 195 284 222 4.281 0.00772 0.00383 4.29

Matching

For our second experiment, we used the homepage for the WDA'2001 work shop as it existed on July 12, 2001, a screen snapshot of which is shown in Fig. 6. The corresponding parse tree contained 328 vertices. To create the query, the page was modified by deleting the left sidebar (the listing of the workshop chairs and program committee members). The parse tree for this edited page had 125 vertices. Our intention was, of course, to formulate a query graph that would be a good subgraph match for the original (full) page, but hopefully not for other, random pages on the Web. The results for this experiment, using graph probing with the subgraph comparison measure, are shown in Fig. 7. Here we can see that the two probe classes are easily able to distinguish the original page and the smaller, edited version from the rest of the test collection. The biggest contributor to the measured differences were the Class 2 probes, which suggests that content is the key distinguishing factor, although structural differences were captured as well. In the case of the query and the original page, only a single Class lc probe disagreed, indicating a nearly perfect match. As before, we provide a more detailed breakdown of the statistics in Tables 3 and 4. Since a relatively small graph can appear to be a good subgraph match for many larger graphs, graph probing using the subgraph comparison measure is less stringent than full matching. This is the reason the distance values in Table 3 are smaller than the corresponding distances for the same collection of pages using the full graph comparison measure (Table 1). Still, there are detectable differences between the parse graphs for random Web pages and the query we created by editing the WDA'2001

34

D. Lopresti and G. Wilfong

Fig. 6.

Fig. 7.

Snapshot of the WDA'2001 homepage (http://www .csc.liv.ac.uk/~wda2001/).

Graph probing results for Experiment 2 (subgraph comparison measure).

homepage. One potential effect that deserves further exploration is whether the

Applications

of Graph Probing to Web Document Analysis

Table 3.

Statistics for the test collection in Experiment 2.

Attribute Google Hits HTML Size (bytes) Parse Graph Vertices Class lc Probes Class lc Distance Class 2 Probes Class 2 Distance Overall Probes Overall Distance Fetch Time (sees) Parse Time (sees) Probe Time (sees) Compare Time (sees) Overall Time (sees)

Table 4.

Min 52 10,003 17 9 3 11 46 21 50 0.57 0.189 0.00206 0.00205 1.03

Max 486,000,000 99,721 2,306 60 117 1,021 122 1,047 234 36.62 14.836 0.04146 0.11263 37.45

35

Ave 1,027,982 24,962 401 30 33 112 81 143 114 2.96 1.243 0.00434 0.00512 4.21

Statistics for the query and related document in Experiment 2.

Attribute HTML Size (bytes) Parse Graph Vertices Class lc Probes Class lc Distance Class 2 Probes Class 2 Distance Overall Probes Overall Distance Parse Time (sees) Probe Time (sees) Compare Time (sees) Overall Time (sees)

WDA July 12 , 2001 Edited 6,903 125 18 0 62 0 80 0 0.349 0.00226 0.00443 0.36

WDA July 12, 2001 Original 12,109 328 21 1 118 0 139 1 0.542 0.00328 0.00458 0.55

use of predefined templates for building Web pages results in a greater degree of structural similarity than would otherwise be expected between two arbitrary, unrelated pages (the WDA'2001 homepage, for example, was created using Microsoft FrontPage, a popular HTML editor). Even in such cases, however, we note that the Class 2 probes will catch differences in the content of the two pages. 5. Conclusions In this chapter, we have described our initial efforts to adapt the graph prob ing paradigm to searching databases of graph-structured documents, with a focus on Web pages coded in HTML. We considered both the comparison

D. Lopresti and G. Wilfong

36

of two g r a p h s in their entirety, as well as the subgraph matching problem, and gave some preliminary experimental results showing t h a t graph prob ing is effective at distinguishing useful kinds of similarity and t h a t it can be computed efficiently. One topic for future research is to examine database indexing tech niques t h a t could result in sub-linear search times using this measure. The pseudo-metric space property of full graph probing, and in particular the triangle inequality, may be helpful in this regard. Also, the lower bound arguments we presented relating probing to graph edit distance are based on a simple unit cost model for the latter. It would be interesting to know whether similar bounds can be developed for other, more complicated edit cost functions. Currently, graph probing provides a measure of how similar two graphs are or how similar one graph is to some subgraph of another. It does not, however, yield a mapping from one graph to the other. Clearly, to extract information from semi-structured sources we need to recognize more t h a n the fact t h a t some matching is likely to exist; we must be able to identify the actual correspondence between the graphs. Even so, graph probing can at the very least b e used to identify graphs t h a t are a likely match, after which more computationally expensive methods may be run on a smaller collection of graphs to find the best possible matches. In addition to information retrieval, other Web-related applications of graph comparison via probing could include wrapper generation and maintenance 1 ' 2 and analysis of HTML-coded tables. 4 This would require retargeting our graph probing language t o information extraction applica tions, a t a s k t h a t would be challenging but seems feasible.

6.

Acknowledgments

Jianying H u and Ramanujan Kashi played important roles in the table recognition research which lead to the development of the graph probing paradigm. T h e trademarks mentioned in this chapter are the properties of their respective companies.

References 1. N. Ashish and C. Knoblock, "Wrapper generation for semi-structured In ternet sources", In Proceeding of the Workshop on the Management of Semistructured Data, Tucson, AZ, June 1997. 2. W. W. Cohen, "Recognizing structure in Web pp. using similarity queries",

Applications

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14. 15. 16.

of Graph Probing to Web Document Analysis

37

In Proceedings of the Sixteenth National Conference on Artificial Intelligence, Orlando, FL, 1999, pp. 59-66. D. Dubois, H. Prade, and P. Sedes, "Some uses of fuzzy logic in multimedia databases querying", In Proceedings of the Workshop on Logical and Uncer tainty Models for Information Systems, London, England, July 1999. S.-J. Lim and Y.-K. Ng, "An automated approach for retrieving hierarchical data from HTML tables", In Proceedings of the A CM International Confer ence on Information and Knowledge Management, Kansas City, MO, Novem ber 1999, pp. 466-474. D. Lopresti and G. Wilfong, "Applications of graph probing to Web document analysis", In A. Antonacopoulos and J. Hu, editors, Proceedings of the First International Workshop on Web Document Analysis, Seattle, WA, September 2001, pp. 51-54. http://www.csc.liv.ac.uk/~wda2001. D. Lopresti and G. Wilfong, "Comparing semi-structured documents via graph probing", In Proceedings of the Seventh International Workshop on Multimedia Information Systems, Capri, Italy, November 2001, pp. 41-50. M. Ortega-Binderberger, S. Mehrotra, K. Chakrabarti, and K. Porkaew, WebMARS: "A multimedia search engine for full document retrieval and cross media browsing", In Proceedings of the Sixth International Workshop on Advances in Multimedia Information Systems, Chicago, IL, October 2000, pp. 72-81. T. Schlieder and F. Naumann, "Approximate tree embedding for querying XML data", In Proceedings of the ACM SIGIR Workshop on XML and In formation Retrieval, Athens, Greece, July 2000. J. Hu, R. Kashi, D. Lopresti, and G. Wilfong, "A system for understanding and reformulating tables", In Proceedings of the Fourth IAPR International Workshop on Document Analysis Systems, Rio de Janeiro, Brazil, December 2000, pp. 361-372. J. Hu, R. Kashi, D. Lopresti, and G. Wilfong, "Table structure recognition and its evaluation", In Proceedings of Document Recognition and Retrieval VIII, volume 4307, San Jose, CA, January 2001, pp. 44-55. J. Hu, R. Kashi, D. Lopresti, and G. Wilfong, "Evaluating the performance of table processing algorithms", International Journal on Document Analysis and Recognition, 4(3), March 2002, pp. 140-153. H. Bunke, "Recent developments in graph matching"' In Proceedings of the 15th International Conference on Pattern Recognition, volume 2, Barcelona, Spain, 2000, pp. 117-124. J. M. Jolion, "Graph matching: what are we really talking about?", In Third IAPR Workshop on Graph-based Representations in Pattern Recognition, Ischia, Italy, May 2001. h t t p : / / r f v . i n s a - l y o n . f r / ~ j o l i o n / P S / p r l c p l . p d f . H. Bunke, "On a relation between graph edit distance and maximum common subgraph", Pattern Recognition Letters, 18, 1997, pp. 689-694. S. Fortin, "The graph isomorphism problem", Department of Computer Sci ence Technical Report TR 96-20, The University of Alberta, July 1996. B. McKay, Nauty User's Guide (Version 1.5). Computer Science Department, Australian National University.

38

D. Lopresti and G. Wilfong

17. B. McKay, "Practical graph isomorphism", Congressus Numerantium, 30, 1981, pp. 45-87. 18. D. G. Corneil and D. G. Kirkpatrick, "A theoretical analysis of various heuris tics for the graph isomorphism problem", SIAM Journal on Computing, 9(2), May 1980, pp. 281-297. 19. L. Babai, P. Erdos, and S. M. Selkow, "Random graph isomorphism", SIAM Journal on Computing, 9(3), August 1980, pp. 628-635. 20. M. Lazarescu, H. Bunke, and S. Venkatesh, "Graph matching: Fast candi date elimination using machine learning techniques", In Advances in Pattern Recognition, volume 1876 of Lecture Notes in Computer Science, SpringerVerlag, Berlin, Germany, 2000, pp. 236-245. 21. H. Bunke and B. T. Messmer, "Recent advances in graph matching", In ternational Journal of Pattern Recognition and Artificial Intelligence, 11(1), November 1997, pp.: 169-203. 22. B. T. Messmer and H. Bunke, "Efficient error-tolerant subgraph isomorphism detection", In Shape, Structure and Pattern Recognition, World Scientific, Singapore, 1995, pp. 231-240. 23. G. Valiente and C. Martinez, "An algorithm for graph pattern-matching", In Proceedings of the Fourth South American Workshop on String Processing, Carleton University Press, 1997, pp. 180-197. 24. A. Sanfeliu and K.-S. Fu, "A distance measure between attributed relational graphs for pattern recognition", IEEE Transactions on Systems, Man, and Cybernetics, 13(3), May/June 1983, pp. 353-362. 25. W.-H. Tsai and K.-S. Fu, "Error-correcting isomorphisms of attributed re lational-graphs for pattern analysis", IEEE Transactions on Systems, Man, and Cybernetics, 9(12), December 1979, pp. 757-768. 26. R. Myers, R. C. Wilson, and E. R. Hancock, "Bayesian graph edit distance", IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(6), June 2000, pp. 628-635. 27. A. N. Papadopoulos and Y. Manolopoulos, "Structure-based similarity search with graph histograms", In Proceedings of the 10th International Workshop on Database & Expert Systems Applications, IEEE Computer Society Press, 1998, pp. 174-178. 28. Y.-F. Chen, F. Doughs, H. Huang, and K.-P. Vo, "TopBlend: An efficient implementation of HtmlDiff in Java", In Proceedings of WebNet 2000 - World Conference on the WWW and Internet, San Antonio, TX, November 2000. http://www.research.att.com/~chen/topblend/paper/. 29. F. Douglis and T. Ball, "Tracking and viewing changes on the Web", In Pro ceedings of the USENIX Annual Technical Conference, San Diego, CA, Jan uary 1996, pp. 165-176. h t t p : / / w w w . u s e n i x . o r g / p u b l i c a t i o n s / l i b r a r y /proceedings/sd96/douglis.html. 30. J. K. Ousterhout, Tel and the Tk Toolkit. Addison-Wesley, Reading, MA, 1994. 31. Google, October 2002. http://www.google.com/. 32. H. Schroeder and M. Doyle. Interactive Web Applications with Tcl/Tk, AP Professional, Chestnut Hill, MA, 1998.

CHAPTER 3 WEB STRUCTURE ANALYSIS FOR INFORMATION MINING

Vijjappu Lakshmi, 1 Ah-Hwee Tan, 2 and C h e w - L i m Tan 1 ' School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543 Email: vijjappu ,[email protected] URL: http://www.comp.nus.edu.sg/~tancl ~ Laboratories for Information Technology, 21 Heng Mui Keng Terrace, Singapore 119613 [email protected]. sg

Our approach to extracting information from the Web is to analyze the structural content of web pages through exploiting the latent information given by HTML tags. For each specific extraction task, an object model is created consisting of the salient fields to be extracted and the corresponding extraction rules based on a library of HTML parsing functions. We derive extraction rules for both single-slot and multiple-slot extraction tasks which we illustrate through three sample applications. 1. Introduction Information extraction (IE) as defined by Message Understanding Conferences (MUC) refers to the task of, given a document, automatically finding the essential details in the text. For example, given a web page on a seminar announcement, an IE system would extract salient information such as topic, speaker, date, venue, and abstract. A key element of IE systems is the set of text extraction rules that identify the relevant information to be extracted. IE systems have been built with different degrees of success on several kinds of text domains. CRYSTAL1 was a learning-based IE system that took parsed annotated sentences and found patterns for extraction in novel sentences. Webfoot2 was an attempt at general IE that processed fragments by looking at Hyperlink Markup Language (HTML) tags. SRV3 was another learning architecture for IE. It took a user-defined feature set together with a set of hand tagged training documents and learned rules for extraction. Craven et al4 39

40

Lakshmi et al.

reported that greater accuracy could be achieved by representing each web page as a node in graph and each hypertext an edge. Cardie5 provided a list of problems of learning-based IE, including the difficulty of obtaining enough training data and the lack of corpora annotated with the appropriate semantics and domain-specific supervisory information. The generation of training examples is complicated because the IE process often requires the output of earlier levels of analysis such as tagging and partial parsing. DiPasquo argued that there was inherent information in the layout of each page. Marais and Rodheffer7 described WebL, and then showed its usage by implementing a meta-search engine that combined search results from the AltaVista and HotBot public-search services. WebL was a high level, objectoriented scripting language that incorporated two novel features: service combinators and a markup algebra. The markup algebra extracted structured and unstructured values from pages for computation, and was based on algebraic operations on sets of markup elements. Nahm and Mooney8 described a system called DiscoTEX that combines IE and data mining methods to perform text mining task, discovering prediction rales from natural-language corpora. Hence by parsing the HTML formatting, one can improve upon traditional text processing. Recently, the use of wrappers for IE from the Web has been popular. A typical wrapper application extracts the data from web pages that are generated, based on predefined HTML templates. The systems generate delimiter-based rules that use linguistic constraints. WIEN9 used only delimiters that immediately preceded and followed the actual data. SoftMealy10 was a wrapper induction algorithm that generated extraction rales expressed as finite-state transducers. World Wide Web Wrapper Factory11 did extraction by using an HTML parser to construct a parse tree following a Document Object Model. In general, HTML tags can help in many tasks involving natural language processing on the Web. In this paper, we consider the more specific problem of exploiting HTML tags for IE from the Web. The motivation of the work stems from a need for a simpler tool to facilitate the writing of extraction rules for our own applications. We adopt an object model approach to extracting information from HTML pages. An object model, consisting of the salient fields of the web pages and their extraction rales based on an HTML parsing library, represents a projection of a user's interests and requirement on a group of web pages. We derive object models for both single-slot extraction, wherein the extracted fields are independent; and multiple-slot extraction, wherein the extracted field are related. We present object models and experimental results for three sample domains,

Web Structure Analysis for Information Mining

41

namely news extraction model, stock quote extraction model based on singleslot extraction rules and link extraction based on multiple-slot extraction rules. The rest of this paper is organized as follows: Section 2 will present the system architecture, giving details about the HTML parsing library, the object model and the extraction engine. Section 3 will describe the user interface to facilitate the writing of the object models. Three application examples of the object model, namely, news article extraction, link extraction and stock quote extraction, will then be given in sections 4 to 6. Section 7 concludes the paper. 2. Object Model Architecture A web page consists of words, numbers, HTML tags, gif images, advertisements etc. People may not be interested in the entire contents of the page. For example, while reading a news article topics of interest would be the title, the author, the content of the article, the dateline etc. It would be useful if a user was able to specify what's interesting to him on a web page or a group of similar web pages together with an easy way to extract them. As all web pages are written in HTML, a library of basic HTML parsing functions becomes crucial for this purpose. The user can use these functions to specify what's interesting to him and extract the portions he is interested in. Some general functions could be devised which can be used across web pages in the same category or content. Hence for each task, a set of functions could be formulated and this can be encapsulated into an object model. The object model is not just a query language, but it also aims to capture a host of attributes of the web objects, including syntax and semantics. For example, many financial web sites have stock quotes. One object model would be to extract stock quotes from web sites which offers stock quotes. The model would thus consist of the fields like stock name, bid price, ask price, volume, which need to be extracted from the page and the parsing algorithm to extract them. The parsing algorithm would be described using the library of basic HTML parsing functions. In such a scenario, the HTML parsing functions serve as the building blocks for writing extraction rules for different object models. Specific object models could be formulated for each category. Thus given a web page, the appropriate object model could be plugged in and the desired output could be generated. Thus, the essential elements of the object model architecture would be: • HTML Parsing Library: A library of HTML parsing functions serves as the basic building block of writing object models. Functions are provided to manipulate attributes of HTML tags, including text, tables, and links. A subset of the single-slot functions is listed in Table 1.

Lakshmi et al.

42

Multi-slot functions would be prefixed by pat'_. A simple user-interface for writing extraction rules has been developed. A single-slot function is defined as an extraction task, which on identifying the specified pattern exits searching the web page. On the other hand, a multi-slot function, on identification of the required task/pattern, recursively searches for similar pattern throughout the page. Object Model: The library of basic HTML parsing functions can be used to specify what is interesting to a user and extract the portions they are interested in. Some general functions could be devised which can be used across web pages in the same category. For each task, a set of functions can be formulated and this encapsulation of items of interest and the corresponding extraction algorithm forms an object model. Extraction Engine: An extraction engine, central to this architecture, is basically a compiler of the HTML parsing functions. A user specifies the URL of the web page and the object model, i.e., the fields of interests and a set of rules written using the HTML parsing functions to extract them. The extraction engine executes the extraction rules and produces the results in extensible Markup Language (XML) format as shown in Fig. 1.

Fig. 1. The extraction engine extracts the key information from web pages according to the extraction rules of the object model. The extraction engine produces the key information in the form of XML from HTML web pages according to the extraction rules of an object model. Pages returned by search engines, stock quotes listed in financial websites and news articles are common web pages. To test our theory, we formulated three object models. These can further be expanded to other types of web pages. 2.1. HTML Parsing Library A variety of text extraction tasks exist ranging from extracting the contact email address, to the list of products to complex sequential pattern from the page. A text extraction task could thus consist of identifying certain text fragments based

Web Structure Analysis for Information Mining

43

on single or multiple criteria to continuously check for patterns. Hence, based on the nature of extraction, IE systems are characterized as single-slot or multi-slot. Single-slot locates and extracts isolated pieces of text from the document; multislot is able to relate the information contained in separate extracted facts and to link these facts together in a case frame. Parsing functions have been drafted to accommodate both the cases. The HTML parsing library has parsing functions to extract text, lists, tables and links from an HTML document. Since the presentation of an HTML document could also include cues to text extraction, functions to accommodate the presentation (like fonts, colors) have also been included. A sample list of HTML parsing functions is given in Table 1. The full parsing library can be found in the work of V. Lakshmi.12 Table 1: An illustrative list of HTML parsing functions. HTML tag Function Tag getTagContent (tag, n) Contents

Tag Attribute Tag Position

Tag Text

Table

Comments Extracts the nth occurrence of the content enclosed within the start and end tag. E.g., getTagContent (, 2). getAttributeValue(tag, att, n) Retrieves the attribute att of the nth occurrence of the tag tag. E.g., getTagAttribute (, href, n). Returns the tag position or the line getTagPostion (tag, n) number of the nth occurrence of the tag from the start of the page. E.g., getTagPosition (, 3). Locates the nth occurrence of term getText (term, n) term in the page. E.g., getText ("College", 1). getTableData (col, m, term, n) Returns "m" elements from the column term from the nth table E.g., getTableData (row, 1, "exchange").

2.2. Single-Slot HTML Parsing Functions Single-slot parsing functions have been defined to cater to all elements that govern the structure of an HTML document, including text, lists, tables, links, included objects, images, and applets.

44

Lakshmi et al.

A simple single-slot function to capture a tag content is described below: getTagContent(, [n], [ t]): This simple function extracts the information enclosed between a closing and ending HTML tag, "". Some common HTML tags that have closing and ending tags are <Title>, (Bold), (Italics) etc. "n" is an optional argument and signifies the n* occurrence of the tag in the web page. The optional argument "t" represents the number of tokens to be retrieved. Default value would be to retrieve the entire contents of the tag. An Object Model consists of a set of variables and corresponding parsing functions, which extract them from the HTML page. This object model (in the form of text file) is fed as input to the extraction engine. All variables start with a "$" sign. Each variable is described by a parsing function from the HTML parsing library: Object Model ($a ,$b ,$c ,$d, ) where $a ,$b ,$c ,$d,... are variables or salient attributes of the web page and a, b,c,d,... are named using the HTML parsing functions. An example of a news engine object model to extract the title, author and text is depicted in Fig. 2. News Engine Object Model (STitle, $ Author, $Text) Search_url = "http://news.cnet.com/news/0-1003-200-4761765.html" $Title = getTagContent (<Title>, 0, 0) SAuthor = searchKeyWord(0, "by", 10) $Text = getParagraph() Fig. 2. A single-slot pattern object model.

2.3. Multi-Slot/Pattern

HTML Parsing

Functions

Unlike the single slot HTML parsing functions, which identify the exact phrase of interest, a set of HTML parsing functions have been formulated which after identifying an item of interest or pattern will continue to search through the entire document for similar items of interests/patterns. A pattern could be made of a single or multiple conditions. Initially, the entry point of the pattern is to be identified. Hence, the pattern matching functions have been written such that the position of the starting point is identified. This position is then passed on to the subsequent functions, which form or validate the pattern. On identifying a pattern, this function proceeds to recursively identify such patterns in the web page.

Web Structure Analysis for Information Mining

45

Fig. 3. A pattern extraction example. The pattern functions are enclosed in the <Pattern> and tags. Normally, the entry point to a pattern would be to identify a text, tag, table or a hyperlink. Hence three basic functions to identify Positions would be: For Text: pat_getTextPosition("text",$pos) For Hyperlink: pat_getHyperLinkPosition("text",$pos) For Tag: pat_getTagPosition(,$pos) $pos is the position in terms of characters in the HTML file and has an initial value of zero. The above functions return the exact position or line number (integer) of the "text" or "tag". This position is then passed on to subsequent functions, which form the pattern. Single-slot HTML parsing functions, which return a position, can also be used for the above purpose. To print the output of the individual function, a prefix "print" is added so that it is printed in the XML output file. pat_getTextPosition("text",$pos) : This function checks for the "text" from the position of the HTML file in $pos. Once the "text" is identified, it returns the "text" position in the HTML file. pat_getHyperLinkPosition("text",$pos): This function searches for the first occurrence of "text" in the hyperlink text at the position specified by $pos. Initially, a zero is passed and hence the function

Lakshmi et al.

46

searches from the start of the HTML file until it finds a valid hyperlink. A valid hyperlink is defined as one that is an absolute URL and links to text, and does not contain any image attributes etc. patgetTagPosition (, $pos): This function returns the position of the given tag after the position specified by $pos. Initially, a zero is passed to $pos and the first occurrence of the is identified and the position is returned. The tag could be a tag or any valid HTML tag. Figure 3 is a web page returned by the webcrawler search engine. Let us imagine a simple task, to extract the text after the hyperlink "Similar Pages" in the web page. For the above task, we need to identify the hyperlink "Similar Pages". This can be located using the pat_getHyperlinkPosition() function. Then we need to extract the next line after the similar pages. A multi-slot pattern object model is shown in Fig. 4. search_url=http://www. webcrawler.com/cgibin/WebQuery?showSummary=true&start=0&perPag s=25&search=java $txt_pos = pat_getHyperlinkPosition("Similar Pages" ,0) Stag = pat_getLines($txt_pos,l);

Fig. 4. A multi-slot pattern object model. 3. User Interface A simple user interface has been developed to facilitate the writing of the object models. The user interface shown in Fig. 5 is entirely developed in Java. The input form can be used as a template for the most commonly used HTML functions making it easier for the user to write an object model. If the function is complicated it can always be typed into the editor. There are two navigation menus for the user-interface. The top navigation bar in Fig. 5 has the common features of a text editor and the navigation bar on the left helps to choose functions for both single as well as multi-slot function. The navigation bar on the left consists of the single-slot HTML parsing functions from where the user can choose the parsing functions. The navigation bar is organized into categories for easier selection of the function. Document

Web Structure Analysis for Information Mining

47

Elements, Anchor Elements, Block Formatting, List Elements, Information Elements, Character Formatting, Image Elements, Table Elements, Quick Tags and Boolean Operators are the categories for the single-slot functions.

Fig. 5. Output display on the editor. An extraction task is assigned to a variable. Variables start with a "$" sign followed by alphanumeric characters. To declare a variable, we select from the "Variable" menu, the "New" element. A text box to input the variable name, with an option to print, appears. The user can type in the variable name. If he needs the output of the variable name to appear in the output, the user checks the option "Print". The variable can then be easily assigned to a parsing function by choosing the appropriate function from the left navigation bar. After the object model is written, it can be directly fed to the extraction engine by choosing the command "Run" from the menu. The output appears in the editor window below the object model, separated by the line "************GENERATED OUTPUT*********** ", as shown Figs. 7, 9 and 12 in the ensuing three application examples. In case of errors, the error pops up as a message box, halting the execution.

48

Lakshmi et al.

4. News Article Extraction An object model was derived by referring to news articles obtained from ZDNet13 and CNET.14 The news extraction model is described by a set of four variables, as shown in Fig. 6, using the library of HTML parsing functions. The title of the article is normally found in the <meta> tag of the web page and its location indicates the start of the article. The next few lines would contain the author's name and the date of the article. The author's name is generally prefixed by the keyword "by". The dateline is in any of the date formats such as mm-dd-yyyy(10-04-2001) or month-dd-yyyy(April 04 2001). The content of the article that follows is normally enclosed in the HTML
or tag. $TITLE = getTagContent (<meta>,"name=title", "content",0) || getTagText (,$tag_pos) $attribute_linejno = getTagTextPosition (,class="smhead",0) §K\3TftOR = searchKeyWord ($&ttxib\iteJmej\o,"by", -1) $DATE = getformat (DATE, "content start",$attribute_line_no,10) $TEXT = getParaText () || getTagText (,"class=body",0) Fig. 6. Object model for news extraction. Around 100 news articles from ZDNet and CNET were given as input. Figure 7 is a snapshot of a news extraction output from the extraction engine. The results of news extraction are measured in terms of recall and precision. Recall is a measure of how well an information search and retrieval system finds all relevant documents on a searched for topic, even to the extent that it includes some irrelevant documents. Precision is a measure of how well such a system finds only relevant documents on a searched for topics, even to the extent that it skips irrelevant documents. Specifically, if the program retrieves "A" relevant pages and "B" irrelevant page, and misses "C" relevant pages, we define Recall = A/(A+C) Precision = A/(A+B) Table 2 shows that the program had 100% precision in finding the title, dateline, author and text. When extracting the author, the recall dropped due to the fact that some articles did not have the keyword "By". Initially, the recall for the dateline was 80%. The drop in the dateline was due to the fact that the author and the date were found in the same line and the program would extract the whole line as the author's name. So in cases where it did find an author and not a Web Structure Analysis for Information Mining 49 Fig. 7. News extraction output. date, we searched for date patterns in the author. This rule can be included in the object model and with this the recall improved to 93%. All the contents of the article were retrieved but in most of the articles, some extra lines towards the end like "More news articles", "Did you miss the news for a day? " etc. were also retrieved which affected the precision. This object model when tested on articles from Straits Times Interactive15 (STI), a local news website, produced a similar level of performance as shown in Table 3. Table 2: News extraction performance for ZDNet and CNet. Section Title Author Dateline Text Recall 98% 90% 93% 100% Precision 100% 100% 100% 90% Lakshmi et al. 50 Table 3: News extraction performance for STI news articles. Section Title Author Dateline Text Recall 98% 90% 93% 98% Precision 100% 100% 100% 89% 5. Link Extraction A typical web page returned by a search engine consists of hyperlinks pertaining to the retrieved query. Each link is characterized by an absolute URL, a caption (hyperlinked text) and a summary describing it. The HTML anchor tag <a href= "" rel="nofollow"> is used to create hyperlinks. To extract the above pattern, an object model is derived based on a set of multi-slot HTML parsing functions, as shown in Fig. 8. The initial position of a potential pattern would be any hyperlink text obtained through pat_getHyperLinkPosition, which contains the query terms. The position is then passed to the other functions as an anchor for parsing. To extract the URL, the parsing function patjgetAttributeValue is used. Summary is the text between two consecutive hyperlinks obtained by the function ipat_getTagToTag. The search pages may also contain hyperlinks to the engine's categories, shopping links, and advertisements etc. These can be eliminated by the fact that they normally do not have a summary or text describing them and also the Caption would not contain the query terms. The extraction rules also filter links to categories in the search engine by identifying relative URLs. In addition, image hyperlinks are ignored. <PATTERN> $pos = pat_getHyperLinkPosition ("Terms",$pos) $CAPTION = pat_getTagText (<A>,$pos) $URL = pat_getTagAttribute (<A>,"href, $pos) SSUMMARY = p&tjgetTagToTag (<A>,$pos) </PATTERN> Fig. 8. An object model for link extraction. Web Structure Analysis for Information Mining 51 Fig. 9. Output of link extraction. Extraction rules were derived, referring to search engines like Yahoo!,16 Google,17 Altavista,18 and Lycos.19 These extraction rules were then tested on Excite,20 Netscape,21 LookSmart,22 AOL,23 and GO/InfoSeek,24 a similar level of performance was obtained. Tables 4 and 5 show the system performance in terms of precision and recall for each individual search engine. The lower recall rates for Yahoo! and Google were due to the fact that they extracted some unnecessary links. On the other hand, GO/Infoseek had a poorer recall because some captions were linked to images. An example output of link extraction from Google is given in Fig. 9. Table 4: Link extraction performance on the reference set. 1 2 3 4 Search engine Yahoo! AltaVista Google Lycos Precision 100% 100% 100% 100% Recall 85.33% 100% 85.3% 95.33% 52 Lakshmi ei al. Table 5: Link extraction performance on other search engines. Search engine 1 Excite Precision 100% Recall 100% 2 Netscape 100% 100% 3 LookSmart 4 AOL 5 GO/Infoseek 100% 100% 100% 90% 100% 70% 6. Stock Quote Extraction Tables are an extremely powerful Web design tool for laying tabular data on a page, or to lay out text and graphics on a web page. Tables provide Web designers ways to add vertical and horizontal structure to a page. Tables consist of three basic components: rows (vertical spacing) columns (horizontal spacing) and cells (the container created when a row and column intersect) Most financial web sites have web pages displaying stock quotes in a table. A stock quote is characterized by the fields Stock Name, Ticker Symbol, High, Low, Close, Volume, Previous Close, Bid, Ask, Earning Per Share, Dividend Per share, P/E Ratio etc. A typical stock Quote table is shown in Table 6. The objective of this model would be, given a Stock Quote attribute, to retrieve its value. Stock Quote tables can be arranged in any of the given formats, shown in Fig. 10 with their corresponding HTML layouts. Table 6: A stock quote table. Creative 50 Last Change %change Done 39.40 Last Volume Traded -0.700 -1.84 21-Sep- Buy Bid Offer Volume 32,850 850 39.3 39.5 00,10:29 Open Prev Closed 39.60 39.10 Intra-Day 5-Days Range Range 39.10- 39.00- 39.60 39.70 Sell Volume 2,55 0 Remarks Charts-Intra-Day 5-Days - News Company Info Broker Web Structure Analysis for Information Mining Fig. 10. Stock quote table arrangi and the corresponding HTML layouts. 53 54 Lakshmi et al. Using the HTML parsing functions, an object model is defined to extract stock quotes from financial web sites. This object model can be formulated easily using the HTML parsing functions_for tables namely getTableData(), as shown in Fig. 11. Stock Quote Object Model search_url= http://finance.yah oo.com/q?s=MSFT&d=t$title = SLastTrade = getTableData(row,l,"Last Trade") $Chang e = getTableData(row,1,"% Change") $Open == getTableData(row,l "Open") $High == getTableData(row,l,'High") $Low = getTableData(row, 1,' 'Low") $Exchange= getTableData(row, 1 /'Exchange") SYield == getTableData(row, 1"Yield") Fig. 11. Stock quote object model. If the stock quote table is to be extracted as a whole instead of the individual elements, getTableData(Table,"Open",0) can be used. The object model listing would then simply be like: search_url= http://finance.yahoo.com/q?s=MSFT&d=t$title = $StockQuote=getTableData(Table,"Open",0) Six popular financial web sites namely Yahoo!Finance,25 CNNfn.com,26 OH Oft 9Q 10 Catcha.com, Finance! Lycos Asia, Fool.com and Quote.com were used for testing the above object model. The results obtained are tabulated in Table 7. A screen shot of the stock quote output is shown in Fig. 12. Table 7: Stock quote tables S.No 1 2 3 4 5 6 Web site Quote.com Yahoo! Finance CNNfn.com Catcha.com Fool.com LycosAsia Precision 100% 82.35% 100% 100% 100% 89.6% Recall 100% 100% 73.68% 99.33% 100% 100% As we analyze the above results, the data in the LycosAsia website were arranged in three rows and hence could not be retrieved properly. In fact, the Web Structure Analysis for Information Mining 55 above object model would not work where information is arranged column wise. However, the object model can be changed to accommodate other multiple row arrangement or column wise layout. As the user finds new or unexpected layouts in some websites, the user can continue to refine the object model to include additional web structural information to improve the retreival results. Fig. 12. Output of stock quote extraction from the extraction engine. 7. Conclusion This paper presents an object model approach to extracting information from HTML pages. Our object model may be considered a simplified version of a wrapper. It serves as an easy to use tool for the user to quickly write rules based on the structure of some sample web pages to extract the needed information. While the object model is also delimiter based, it is geared towards HTML tags and attributes. Furthermore, its looping construct through the use of multi-slot extraction allows recursive search of the relevant information. Building on this architecture, object models for different types of web pages can be formulated, say for company pages, educational institution pages, online shopping pages etc. Lakshmi et al. 56 A personalized search engine can also be built by integrating all the object models. The last several years have witnessed a variety of innovations in text extraction research. Several new approaches have been developed, including: Hidden Markov models and other statistical techniques for text modelling; active learning and bootstrapping approaches that reduce the required training data; and using boosting to improve the performance of simple text extraction learners. A "unified theory of extraction" is beginning to emerge, in which a family of learning algorithms can cope with a wide variety of text, from natural text, to highly-stylized telegraphic text, to rigidly structured HTML documents. Incorporating semantics and learning algorithms could be the next step in the object model architecture. Acknowledgment This project is supported in part by the Agency for Science, Technology and Research (A*STAR) and the Ministry of Education, Singapore, under the joint research grant, R252-000-071-112/303. References 1. 2. 3. 4. 5. 6. 7. 8. S. Soderland, D. Fisher, J. Aseltine and W. Lehnert, "Crystal: Inducing a conceptual dictionary", Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), 1995, pp. 1314-1319. S. Soderland, "Learning to extract text-based information from the World Wide Web", Proceedings of Third International Conference on Knowledge Discovery and DataMining (KDD-97), 1997, pp. 251-254. D. Freitag, "Information extraction from HTML: Application of a general machine learning approach", Proceedings of the 15th Conference on Artificial Intelligence (AAAI-98), 1998, pp. 517-523. M. Craven, S. Slattery, and K. Nigam, "First-Order Learning for Web Mining", Proceedings, 10th European Conference on Machine Learning, 1998, pp. 250-255. C. Cardie, "Empirical methods in information extraction", AI magazine, 1997, pp. 55-79. D. DiPasquo, "Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web", Senior Honors Thesis, School of Computer Science, CMU, 1998. H. Marais and T. Rodeheffer, "Automating the Web with WebL", Dr. Dobb's Journal, January 1999. U.Y. Nahm and R.J. Mooney, "Using Information Extraction to Aid the Discovery of Prediction Rules", Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, Boston, MA, August, 2000, pp. 51 - 58. Web Structure Analysis for Information Mining 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 57 N. Kushmerick, D. Weld, and R. Doorenbos, "Wrapper induction for information extraction", Proceedings of the 15th International Conference on Artificial Intelligence (IJCAI-97) 1997, pp. 729-735. C. Hsu and M. Dung, "Generating finite-state trans-ducers for semi-structured data extraction from the web", J. of Information Systems 23(8), 1998, pp. 521-538. F. Azavant and A. Sahuguet, "The World Wide Web Wrapper Factory", http://db.cis.upenn.edu/W4F/doc.html. V. Lakshmi, "Web Structure Analysis for Information Mining", Master of Sceince Thesis, School of Computing, National University of Singapore, 2001. CNET, h t t p : / / n e w s . com. com/. ZDNet, h t t p : //www. z d n e t . com.com/. Straits Times Interactive, h t t p : / / s t r a i t s t i m e s . a s i a l . c o m . s g / . Yahoo, h t t p : / / w w w . y a h o o . c o m . Google, h t t p : / / w w w . g o o g l e .com. AltaVista, h t t p : //www. a l t a v i s t a . c o m . Lycos, h t t p : / / w w w . l y c o s . com. Excite, h t t p : //www. e x c i t e . com. Netscape, h t t p : //www. n e t s c a p e .com. LookSmart, h t t p : / /www. l o o k s m a r t . com. AOL, h t t p : / / w w w . a o l . com. Go/InfoSeek, h t t p : / / i n f o s e e k . g o . com/. Yahoo IFinance, h t t p : / / f i n a n c e . y a h o o . com/. CNNfh.com, h t t p : / /money. c n n . com/. Catcha.com, h t t p : / / c a t c h a . com. sg. Finance! Lycos Asia, h t t p : //www. l y c o s a s i a . c o m . sg. Fool.com, h t t p ://www. f o o l . com. Quote.com, h t t p : //www. q u o t e . com. This page is intentionally left blank CHAPTER 4 N A T U R A L L A N G U A G E PROCESSING FOR W E B D O C U M E N T ANALYSIS Manuela Kunze and Dietmar Rosner Otto-von-Guericke- Universitdt Magdeburg Institut fiir Wissens- und Sprachverarbeitung P.O.box 4120, 39016 Magdeburg, Germany E-mail:makunze, roesner@iws. cs. uni-magdeburg. de In this chapter we present an approach to the analysis of web docu ments — and other electronically available document collections — that is based on the combination of XML technology with NLP techniques. A key issue addressed is to offer end-users a collection of highly interoper able and flexible tools for their experiments with document collections. These tools should be easy to use and as robust as possible. XML is chosen as a uniform encoding for all kinds of data: input and output of modules, process information and linguistic resources. This allows effec tive sharing and reuse of generic solutions for many tasks (e.g., search, presentation, statistics and transformation). 1. Introduction In this chapter we present an approach to web document analysis based on techniques from Natural Language Processing (NLP). T h e W W W is a fast growing source of information. It is populated with hyperlinked multimedia information objects. A large proportion of the documents offered on the W W W consist of text or are multimedia documents with a significant proportion of text. a For web document analysis in general, the analysis of text will have to be complemented with the analysis of other W W W media types: image a 'Document' is undoubtedly broader a term than 'text'. For ease of presentation we will in this paper — unless otherwise noted — use 'document' interchangeably with 'text' or 'text document'. The broader usage will be marked explicitly or should be obvious from context. 59 60 M. Kunze and D. Rosner analysis, video interpretation, voice processing etc. In addition, cross media references and the hypermedia structures need proper treatment. It is necessary to be precise about the purpose of a specific system for the analysis of web documents. In other words: what should be achieved with the system? The different tasks that have been identified for traditional collections of documents are in many cases applicable for web documents as well, although sometimes the naming may be different. For example, the services offered by a web search engine are the equivalent of what has more traditionally been called 'information retrieval' (IR). An incomplete list of applications of web document analysis includes among others information retrieval, information extraction (IE), web mining, text classification, summarisation, machine translation, concept learning etc. Each of these applications has its specific demands. Nevertheless, they share many of their subtasks. In addition, many of the subtasks are shared with traditional document processing. For a given complex application in web document analysis we found it fruitful to distinguish its subtasks into the following three categories: • subtasks that are primarily WWW specific, • subtasks that are specific to the application, • subtasks that are relevant to all NLP approaches to text processing. This chapter has its focus on the last group of subtasks. Since these subtasks are also shared with other applications of document processing, there is a high potential for reuse and resource sharing for these subtasks. At best, complex applications of document processing should be configurable from a common tool box with generic modules and generic resources in combination with only a small number of dedicated application specific modules. WWW specific subtasks can be classified as being part of the preprocess ing stage. Preprocessing in this sense comprises all those operations that finally result in a text document in the format expected as input to the linguistic tools. In other words, aspects of the source document that are irrelevant or distracting for linguistic processing will be abstracted away during preprocessing, and the resulting document will be in a canonical format. If source documents are already equipped with appropriate metadata then some subtasks of preprocessing become reduced to looking up the values of metadata attributes. This includes questions such as: Natural Language Processing for Web Document Analysis 61 • What is the natural language used in the text? • What is the domain of the text? • Is the text intended to be 'stand alone' or is it only a fragment in a hyperlinked structure? In the future — with the upcoming 'semantic web' 1 — more and more WWW pages will carry metadata with them. For now, preprocessing will in many cases include attempts to answer these questions with techniques for language identification, domain classification or hyperlink tracing. The focus of our work has been to develop the architecture and a core set of components and resources and a document suite. Application scenarios and corresponding example corpora have been and still are the driving force in the ongoing development and improvement of the system. To work with different scenarios will help to avoid ad hoc solutions that are not scalable and generalisable. To move to different domains and/or application scenar ios also allows to study the methodological issues of how to port and adapt such a system. Given this background, our results are of a methodological nature on the one hand as well as a set of implemented tools on the other hand. In the following, we will try to construct an architecture for a toolbox for NLP-based web document analysis. We will concentrate on the subtasks to be solved and the problems to be tackled. This will be accompanied in many cases by a sketch of the solution chosen in our system. The solution will be explained with examples of data and results of analyses from a variety of domains. Whenever appropriate, we will highlight shortcomings and open problems or discuss the advantages and disadvantages of alternative design decisions. Emphasis will be placed on elaborating both on the algorithms employed as well as on the resources that have to be provided. When one designs modules for NLP tasks there is a variety of aspects to consider: • • • • • • language dependence vs. language independence, generality vs, specificity, ease of adaptation to other domains, sublanguages etc., efficiency of processing and quality of results, robustness, learning issues. Human language processing can be seen as an integrated system, which is primarily driven by semantic and pragmatic concerns. Technological ap- 62 M. Kunze and D. Rosner proximations — as discussed here — suffer from principal shortcomings. It is, for instance, an inherent problem in NLP that a strict separation into completely independent stages of processing is hard to achieve. This is due to the fact that in each stage of processing there are, in many cases, deci sions to be made that may need information from a later stage of processing or from the context. As a web document we can define all documents that are available via the internet. In addition to structural data (HTML tags) and other objects on the page (e.g., pictures and audio files), the page will in many cases contain text. This text can be titles or paragraphs. For an effective and complete analysis of web documents we therefore need an approach for the extraction of natural language parts of the documents. Natural language analysis of textual parts of web documents is no different from a normal text analysis. The preprocessing process starts with an attempt to get the pure text parts of the documents. Another approach to information extraction is the exploitation of HTML tags, especially when tables play an important role on a webpage (see the work of Habegger 2 for analysis based on HTML tags). The following sections present an XML 5 -based approach for a toolbox with morphological, syntactic and semantic (and some corpus-based) meth ods. The system described here is called Document Suite XDOC (XMLbased tools for DOCument processing) and works on German documents. Currently, we are also experimenting with English resources for XDOC for the task of text mining from English documents. 3 2. Design Principles The Document Suite XDOC was designed and implemented as a workbench for flexible processing of electronically available documents in German. 4 Inside XDOC we exploit XML 5 and its accompanying formalisms (e.g., XSLT 6 ) and tools (e.g., xt, saxon) as a unifying framework. All modules in the XDOC system expect XML documents as input and deliver their results in XML format. Having a uniform encoding for all input and output data and for all employed resources is advantageous both for the development team as well as for end-users. Generic modules can be reused for all repetitive tasks like selection of substructures, highlighting of search results, flexible transfor mation into other representations etc. Sharing of expertise across modules in this way does free developer's energy for the 'real' challenges. It is easier Natural Language Processing for Web Document Analysis 63 to offer end-users a uniform 'look and feel' across modules based on the uniform encoding. 2.1. Why XML? XML — and its precursor SGML — offers a formalism to annotate pieces of (natural language) text. To be more precise, if a piece of text is (as a simple first approximation) seen as a sequence of characters (alphabetic and whitespace characters) then XML allows to associate arbitrary markup with arbitrary subsequences of contiguous characters. Many linguistic units of interest are represented by strings of contiguous characters (e.g., words, phrases, clauses etc.). It is a straightforward idea to use XML to encode information about such a substring of a text interpreted as a meaningful linguistic unit and to associate this information directly with the occur rence of the unit in the text. The basic idea of annotation is further sup ported by XML's wellformedness demand that XML elements have to be properly nested. This is fully concordant with standard linguistic practice: complex structures are made up from simpler structures covering substrings of the full string in a nested way. With XML we can reuse generic modu les for selection, highlighting and presentation of relevant information. XSL stylesheets can be exploited to allow different presentations of internal data (in different formats such as html or rtf) and processing results for differ ent target groups; for end-users the internals are not in many cases helpful, whereas developers will need them for debugging. The advantage of differ ent presentation formats is that the users are already acquainted with the functionality of a browser or a common text editor. Furthermore, in the case of a browser we are relatively independent of the operating system. 2.2. User Orientation We are interested in tools and techniques for natural language analysis that help humans with their document processing tasks. The end-users of our applications are domain experts [e.g., medical doctors, engineers etc.). They are interested in having their problems solved but they are typically neither interested nor trained in computational linguistics. Therefore, the barrier they will have to overcome, before they can use a computational linguistic or text technology system, should be as low as possible. This constraint has consequences for the design of the document suite. The work in the XDOC project is guided by the following design prin ciples that have been generalised from a number of experiments and appli- 64 M. Kunze and D. Rosner cations with 'realistic' everyday documents (i.e., email messages, abstracts of scientific papers, technical documentation etc.): • The tools should be usable for 'realistic' documents. One aspect of 'realistic' documents is that they typically contain do main specific tokens which are not directly covered by classical lexical categories (e.g., noun, verb etc.). Those tokens are nevertheless often essential for the user of the document (e.g., an enzyme descriptor like EC 4.1.1.17 for a biochemist). Further aspects of 'realistic' documents are: - they always may contain 'noisy elements': errors of different type (typos, syntax errors, missing or superfluous words, . . . ), - they may be written in a domain specific sublanguage with more or less non-standard syntactic preferences, - due to productiveness of the language lexical and conceptual gaps may occur. • The tools should be as robust as possible. In general, it can not be expected that lexicon information is available for all tokens in a document. This is not only the case for most tokens from 'nonlexical' types — like telephone numbers, enzyme names, ma terial codes etc. — even for lexical types there will always be 'lexical gaps'. This may either be caused by neologisms or simply by start ing to process documents from a new application domain with a new sublanguage. In the last case lexical items will typically be missing in the lexicon ('lexical gap') and phrasal structures may or may not adequately covered by the grammar ('syntax gap'). • The tools should be usable independently but should allow for flexible combination and interoperability. • The tools should not only be usable by developers but by domain experts as well without linguistic training. Here again XML and XSLT play a major role: XSL stylesheets can be exploited to allow different presentations of internal data and results for different target groups; for end-users the internals are not in many cases helpful, whereas developers will need them for debugging. We currently experiment with XDOC in a number of application sce narios. These are: • Knowledge acquisition from technical documentation about casting technology as support for domain experts for the creation of a domain Natural Language Processing for Web Document Analysis 65 specific knowledge base. • Extraction of company profiles from WWW pages for an effective search about possible suppliers. • Analysis of autopsy protocols for statistical summary about typical injuries at traffic accidents. • Information extraction from Medline abstracts. 2.3. Portability The effort needed to port a system like XDOC to another domain and/or another type of application is not easy to be determined and is obviously dependent on the distance or closeness of the new requirements to func tionality already covered and the resources already available. As a rule of thumb: the more general, domain independent components are easier to port than the domain specific ones. The following questions should be answered with respect to the document collection to be processed before starting with a new domain and/or application: • How is the coverage of the lexicon of the morphosyntactic component? • How is the sublanguage used in the corpus? Can existing grammar modules be re-used or do we need to develop a new rule set for syntactic structures not yet covered? • To what extent can domain models be re-used? What has to be mod eled from scratch? Some of these questions can be answered with the help of the system itself (e.g., checking the ratio of unknown token to token covered by the lex icon). It is crucial that users get as much support as possible for the tasks of resource creation. In order to ease initial domain modeling as well as lexicon creation we have been experimenting with a 'bootstrapping' approach. 7 3. Document Suite XDOC The Document Suite XDOC is divided in several functional modules (cf. architecture in Fig. 1). Document processing starts with the preprocess ing functions, succeeded by the functions of the syntactic, semantic and corpus-based modules. In the following subsections we describe the sepa rate functions of the modules. A crucial point: We do favor an engineering approach to NLP. It is not our concern if these tools or techniques are plausible as components of a hu man reader's processing of a document. We deliberately take into account 66 M. Kunze and D. Fig. 1. Rosner Functional Architecture of the Document Suite XDOC. that such tools are approximations at best and will have problems and li mitations. We do accept that the results of tools taken in isolation may be ambiguous (and sometimes even faulty). We try to avoid erroneous results but we do accept ambiguities that are not resolvable within an isolated module. We try to anticipate these cases and try to design other modules in such a way that they can deal with alternatives and overgeneralisation in their input but try to resolve ambiguities and to filter erroneous alter natives. Table 1. a m b i g u i t y in . . . POS tagger chart parser semantic tagger phrase detector Resolving of Ambiguities. resolved b y . . . chart parser case frame analysis case frame analysis analysis of phrase detectors' results In the Document Suite XDOC most ambiguities can be resolved by sub sequent analysis (see Table 1). In the phrase detector the overall interpre tation is based on the analysis of all results from the processed document. The classification backed by the majority of phrases inside a document is selected. 3.1. Preprocessing Module Even if we concentrate on textual web documents there are a number of formats and encodings that we may encounter. Natural Language Processing for Web Document Analysis 67 HTML is very widespread, XHTML and even XML have an increasing share in the web. But other formats are available as well: DVI, Postscript, PDF, DOC, RTF etc. And not to forget: we may have plain text (e.g., in emails). For linguistic processing, we are primarily interested in the sequence of characters constituting the natural language content of the document (i.e., the plain text). The process of mapping from an arbitrary format of the document source to the input representation for linguistic processing may be conceptualized either as 'stripping off' non-text (i.e., commands, mark up etc.) or as extracting text. In either case, this process has to preserve that information from the source that is relevant for further processing. This information to preserve or 'carry over' deals with issues like: • What is a unit of linguistic relevance? • Where are boundaries between units? • What is the function of those boundaries? The result of this preprocessing step should preserve and make explicit (e.g., mark up) this structural information in addition to the linguistic content (i.e., the character sequence of the plain text). This requirement is independent of the specific format of the source. In the preprocessing Fig. 2. Preprocessing: Identification of Text Units. M. Kunze and D. 68 Rosner module all functions from low level (or token-based) text processing are combined. The other modules are based on the results of these functions. The functions are: • • • • Text unit identification (for HTML: HTML Cleaner), Structure Tagger, Part-of-Speech (POS) Tagger, Sentence Splitter. 3.1.1. HTML Cleaner The analysis of web pages can be essential, for example, the automatic generation of company profiles from web pages (see the work of Krotzsch 8 ) or the creation of product taxonomies through web mining. 9 For this task it is necessary to extract the plain textual parts of a web page. The analysis of HTML tags is relevant, when we need information about the structure of the documents such as titles or tables. We get information about the structure from the analysis of HTML tags. In addition we can extract information from tables with linguistic content by the NLP analysis of the cells' content. 3.1.2. Structure Tagger We accept raw ASCII text without any markup as input as well. Raw ASCII may be the original source, or it may as well be the output of many converters like 12a, pdf2text etc. In such cases, structure detection tries to uncover linguistic units (e.g., sentences, titles etc.) as candidates for further analysis. A major subtask is to identify the role of punctuation characters. If the structure of a text document is explicitly available this may be exploited by subsequent linguistic processing. For example, a unit classified as title or subtitle will be accepted as a noun phrase whereas within a paragraph full sentences will be expected. In realistic text even the detection of possible sentence boundaries needs some care. A period character may not only be used as a full stop but may as well be part of an abbreviation (e.g., "z.B." — engl.: "e.g." - or "Dr."), be contained in a number (e.g., 3.14), be used in an email address or in domain specific tokens. The resources employed are special lexica (e.g., for abbreviations) and finite automata for the reliable detection of tokens from specialized non-lexical categories (e.g., enzyme names, material codes etc.). These resources are used primarily to distinguish between those full stop Natural Language Processing for Web Document Analysis 69 characters that function as sentence delimiters (tagged with IP) and those that are part of abbreviations or of domain specific tokens. The information about the function of strings that include a period is tagged in the result {e.g., abbreviation — tagged with ABBR). 3.1.3. POS Tagger The assignment of part-of-speech information to a token — POS tagging for short — is not only a preparatory step for parsing. The information gained about a document by POS tagging and evaluating its results is valuable in its own right. The ratio of tokens not classifiable by the POS tagger to tokens classified is a direct indication of the degree of lexical coverage. In principle, a number of approaches is usable for POS tagging {e.g., the work of Brill 10 ). We decided to avoid approaches based on (supervised) learning from tagged corpora, since the cost for creating the necessary train ing data are likely to be prohibitive for our users (especially in specialized sublanguages). The approach chosen was to try to make the best use of available re sources for German and to enhance them with additional functionality. The tool chosen (MORPHIX 11 ) is not only used in POS tagging but serves as a general morpho-syntactic component for German. The resources employed in XDOC's POS tagger are: • the lexicon and the inflectional analysis from the morpho-syntactic component MORPHIX, and • a number of heuristics {e.g., for the classification of token not covered in the lexicon). For German the morphology component MORPHIX has been devel oped in a number of projects and is available in different realizations. This component has the advantage that the closed class lexical items of German — e.g., determiners, prepositions, pronouns etc. — as well as all irregular verbs are fully covered. The coverage of open class lexical items is depen dent on the amount of lexical coding. Paradigms, such as those for verb conjugation and noun declination are fully covered. To be able to analyze and generate word forms, however, their roots and their classification need to be included in the MORPHIX lexicon. We exploit MORPHIX — in addition to its role in syntactic parsing — for POS tagging as well. If a token in a German text can be morphologi- 70 M. Kunze and D. Rosner cally analysed with MORPHIX, the resulting word class categorisation is used as POS information. Note that this classification need not be unique. Since the tokens are analysed in isolation, we often get ambiguous results with multiple analyses. Some examples: the token "der" may either be a determiner (with a number of different combinations for the features case, number and gender) or a relative pronoun, the token "Hebe" may be ei ther a verb or an adjective (again with different feature combinations not relevant for POS tagging). Since we do not expect extensive lexicon coding at the beginning of an XDOC application some tokens will not receive a MORPHIX analysis. We then employ two techniques: We first try to make use of heuristics that are based on aspects of the tokens that can easily be detected with simple string analysis (e.g., upper-/lowercase, endings etc.) and/or exploitation of the token position relative to sentence boundaries (detected by the structure detection module). If a heuristic yields a classification, the resulting POS class is added together with the name of the employed heuristic (marked as feature SRC (stands for source), cf. Example 1). If no heuristics are applicable we classify the token as member of the class unknown (tagged with XXX). E x a m p l e 1: Unknown Tokens Classified as Noun with Heuristics. <NP TYPE="COMPLEX" RULE="NPC3" GEN="FEM" NUM="PL" CAS="_"> <NP TYPE="FULL" RULE="NP1" CAS="_" NUM="PL" GEN="FEM"> <N SRC="UNG">Blutanhaftungen</W> </NP> <PP CAS="DAT"> <PRP CAS="DAT">an</PRP> <NP TYPE="FUIX" RULE="NP2" CAS="DAT" NUM="SG" GEN="FEM"> <DETD>der</DETD> <K SRC="UCl">GekroesewurzeK/N> </NP> </PP> </NP> To keep the POS tagger fast and simple, the disambiguation between multiple POS classes for a token and the derivation of a possible POS class from context for an unknown token are postponed until syntactic processing. This is in line with our general principle to accept results with overgeneralisation when a module is applied in isolation (here: POS tagging) and to rely on filtering ambiguous results at a later stage of processing (here: exploiting the syntactic context). 3.1.4. Sentence Splitter The functions of the syntactic parser, the phrase detector and the analy ses of the semantic module take sentences as input. The sentence splitter Natural Language Processing for Web Document Analysis 71 divides the text to be analysed into a sequence of sentences. 3.2. Syntactic Module The input data for this module are sentences annotated with POS tags. This module contains methods which are based on a bottom-up parser. The functions of the module are a syntactic parser and a phrase detector. The basis of these functions is a chart parsing machinery, which works with different 'rule sets' to realize different functions. 3.2.1. Syntactic Parser For syntactic parsing we apply a chart parser based on context-free gram mar rules augmented with feature structures. Robustness is achieved by allowing as input elements: 12 • multiple POS classes, • unclassified tokens from open word classes and • tokens with POS class, but without or with only partial feature infor mation. The last case results from some heuristics in POS tagging that allow to assume, for instance, the class noun for a token but do not suffice to detect its full paradigm from the token (note that there are about two dozen different morphosyntactic paradigms for noun declination in German). For a given input the parser attempts to find all complete analyses that cover the input. If no such complete analysis is achievable, it tries to combine maximal partial results ('chunks') into structures covering the whole input. A successful analysis may be based on an assumption about the word class of an initially unclassified token (tagged XXX). This is indicated in the parsing result (feature AS) and can be exploited for learning such clas sifications from contextual constraints. In a similar way the successful com bination of known feature values from closed class items (e.g., determiners, prepositions) with under specified features (written as "_" in attribute-value pairs) in agreement constraints allows the determination of paradigm in formation from successfully processed occurrences. See Example 2: In the input string the word "Mundhoehle" (oral cavity) was not available in the lexicon. Because it is uppercase and not at sentence initial position it is heuristically classified as noun. The successful parse of the prepositional phrase (PP) "in der Mundhoehle" allows the derivation of features from 72 M. Kunze and D. Rosner the determiner within the PP (e.g., the lexical gender feminine for "Mundhoehle"). Example 2: Unknown Token Classified as Adjective and Lexical Features Derived Through Contextual Constraints. <NP TYPE="COMPLEX" RULE="NPC3" GEN="MAS" NUM="SG" CAS="N0M"> <NP TYPE="FULL" RULE="NP3" CAS="N0H" HUM="SG" GEN="MAS"> <DETI>kein</DETI> <XXX AS="ADJ">ungehoeriger</XXX> <N>Inhalt</N> </NP> <PP CAS="DAT"> <PRP CAS="DAT">in</PRP> <NP TYPE="FULL" RULE="NP2" CAS="DAT" NUM="SG" GEN="FEM"> <DETD>der</DETD> <N SRC="UCl">Mundhoehle</M> </NP> </PP> </NP>" The processing time grows with the number of rules. Therefore, the gram mar used in syntactic parsing is organized in a modular way that allows the addition or removal of groups of rules. This ability to tailor the rule set is also exploited when the sublanguage of a domain contains linguistic structures that are unusual or even ungrammatical in standard German. 3.2.2. Phrase Detector There are applications where a full syntactic analysis is not necessary. Some times it is sufficient to be able to quickly detect key phrases in a document. If the phrases are totally fixed, a string matcher is sufficient. Realistically, there will always be some variability in phrasal patterns. Therefore, a more flexible phrase matcher is needed. See, for example, the simple phrase "zahlreiche Blutungen in der Gesichtshaut." (English: numerous bleedings in the skin of the face). The word "zahlreiche" can, for instance be replaced by the word "massenhaft" (English: large quantities of). This can be expressed by the following generalized pattern: MP-SI: HIGH-QUANTITY N["blutungen"] PRPf'in"] N["gesichtshaut"] IP H I G H - Q U A N T I T Y : ADJ["zahlreich"]— ADV["massenhaft"] DETD The corresponding rule (in XML coding) for the chart parser is defined in Example 3. Example 3: Excerpt from the 'Pattern Grammar'. <RULE RULE-CAT="MP-SI" TYPE="MEDICAL-PHRASE" RULE="MP7-SI"> <ELEMENT>HIGH-QUANTITY</ELEMENT> <ELEMENT CONT="blutungen">N</ELEMENT> <ELEMENT CONT="in">PRP</ELEMENT> Natural Language Processing for Web Document Analysis 73 <ELEMENT>DETD</ELEMENT> <ELEMENT CONT="gesichtshaut">N</ELEMENT> <ELEMENT>IP</ELEMENT> </RULE> <RULE RULE-CAT^'HIGH-qUANTITY" TYPE="MEDICAL-PHRASE" RULE="HQ1"> <ELEHEKT CONT="zahlreich">ADJ</ELEMENT> </RULE> <RULE RULE-CAT="HIGH-qUANTITY" TYPE="MEDICAL-PHRASE" RULE="HQ2"> <ELEMENT CONT="massenhaft">ADV</ELEMENT> </RULE> The chart parser annotates recognized structures in the document. 3.3. Corpus Based Module Currently this module contains one method - the analysis of collocations. In XDOC we not only count cooccurence data of pairs of simple tokens but evaluate pairwise occurences of phrases with phrases or tokens. We analyse which combinations of noun phrases with adjectives, verbs and prepositional phrases are possible in a domain. Coocurrence data are exploited for the creation of an initial ontology.7 3.4. Semantic Module The semantic module contains methods that are necessary for the semantic interpretation and understanding of natural language texts. In the follow ing, we describe a semantic tagger and a case frame analysis which com bines isolated meanings of words into a complex concept structure. Finally, we describe a method for the semantic interpretation of specific syntactic structures. 3.4.1. Semantic Tagger Most content words in natural language have a number of different readings. The word 'paper' may — among others — refer to a 'publication' or to a 'piece of matter'. A semantic tagger tries to classify content words into their semantic categories (different applications may have different organisations of those categories into taxonomies or ontologies). For this function we ex pect as input data a text tagged with POS tags and we apply a semantic lexicon. For each token, this semantic lexicon contains a semantic interpre tation and a case frame with syntactic valency requirements. Similar to POS tagging, the tokens in the input are annotated with their meanings and a classification in semantic categories such as concepts and relations. Again, it is possible that the classification of a token in isolation is not unique. In 74 M. Kunze and D. Rosner analogy to the POS tagger, a semantic tagger that processes isolated tokens is not able to disambiguate between multiple semantic categorisations. This task is postponed for contextual processing within case frame analysis (cf. below). Example 4 shows the results of the semantic tagging of the sentence Als Folge des Unfalles erfolgte die Herausnahme einer Niere (In English: In consequence of an accident one kidney was removed). In this case Als and Folge are considered ambiguous. The word form Unfalles is unknown, but we can recognize it as a concept using the information from the POS tagger (<N>Unfalles</N>). Example 4: Result of the Semantic Tagger. <0PERATQR TYPE="temporal-gleichzeitig t e m p o r a l - v o r z e i t i g t e m p o r a l - n a c h z e i t i g komparativ s p e z i f i z i e r e n d kausal restriktiv">Als</0PERATORXCONCEPT TYPE="sequence result">Folge </CONCEPTXXXX>des</XXXXCONCEPT>Unfalles</CONCEPTXXXX>eriolgte</XXXXXXX>die</XXX> <C0NCEPT TYPE="medicalOperation">Herausnahme</CONCEPT><XXX>einer</XXX><COKCEPT TYPE="organ">Niere</CONCEPTXXXX>. </XXX> Semantic resources are more domain dependent in general than syntactic ones. As many other projects we tried to make use of Wordnet (in its German version GermaNet 13 ) as a basis for the semantic lexicon. This proved to be difficult. The categories covered in GermaNet are too abstract to be directly usable in applications. For now, domain specific semantic lexica have to be encoded manually. This is not such a problem for experimental prototypes but for large scale applications support for the creation of these resources is of vital interest. A planned but not yet worked out approach will be to try to exploit learn ing techniques to automatically extract subcategorisation information from the results of robust noun phrase and prepositional phrase chunking. Sub categorisation frames for verbs (i.e., capturing what complements a verb is taking) are the basis for the subsequent abstraction of case frames. 3.4.2. Case Frame Analysis Case frame analysis of a token uncovers details about the type of recognized concepts (resolving multiple interpretations) and their possible relations to other concepts. The detection of complex concepts by case frame analysis needs as input the results of syntactic analysis (chart parser) and semantic tagging. The resources are case frames coded in the semantic lexicon. The case frames express the semantic and syntactic constraints of a potential filler of a relation. Example 5: Example for an Instantiated Case Frame. Natural Language Processing for Web Document Analysis 75 <CONCEPTS> <CONCEPT TYPE="medicalOperation"> <WORD>Herausnahme</WORD> <DESOentfernen von etwas</DESC> <SL0TS> <RELATI0N TYPE="PATIENT"> <ASSIGN.TO rel="nofollow">BONE</ASSIGN_TO> <F0RM>N(gen, fak)</F0RM> <CONTENT></CONTENT> </RELATI0N> <RELATI0N TYPE="PATIENT"> < ASS IGN_T0> QRGAN</ASS IGN.TQ > <FQRM>N(gen, fak)</F0RM> <CONTENT>einer Niere</CONTENT> </RELATI0N> </SL0TS> </C0IJCEPT> <C0NCEPT TYPE="organ"> <W0RD>Niere</WORD> </C0NCEPT> </CQNCEPTS> The output is instantiated case frames with roles filled with those syntactic phrases that are consistent with both semantic and syntactic constraints. The results are again annotated with XML tags. Example 5 is a partial result of the analysis of the example from the semantic tagger. The con cept parser has assigned the phrase "einer Niere" as the filler of the rela tion 'patient', which needs a concept with the category 'organ' (element ASSIGN_TO) and a syntactic noun phrase in the genitive case as filler (element FORM). 3.4.3. Semantic Interpretation of Syntactic Structure Another step in the analysis of the relations between tokens can be the interpretation of the syntactic structure of a phrase or sentence. We exploit the syntactic structure of the sublanguage to extract the relation between several tokens. The semantic structure analysis expects as input the results from the chart parser and from semantic tagging. The results are rela tions between concepts, which are derived from the syntactic structure of phrases or whole sentences. The resource for this analysis is a lexicon with information about the semantic interpretation associated with rules from the context-free grammar used by the chart parser. For example, syntac tic structures with nouns and adjectives as constituents typically denote attribute-value structures, which can be expressed as a has-attribute rela tion. The name of the recognized relation is composed of 'has-' and the meaning of the adjective. Example 6: Excerpt from Lexicon for Semantic Structure Analysis. 76 M. Kunze and D. Rosner <RELATION> <TYPE>lias</TYPE> <DESC>iias_attribute</DESC> <FILLER>ADJ</FILLER> <SYNTACTIC> <STRUCTURE>NP 1</STRUCTORE> <STRUCTURE>NP2</STRUCTURE> <STRUCTURE>NP3</STRUCTURE> <STRUCTURE>MAK/STRUCTURE> </SYNTACTIC> </RELATION> Example 6 shows the entry in the lexicon for the interpretation of adjectivenoun structures. The element SYNTACTIC contains the names of rules in the context-free grammar which describe adjective-noun structures. The general name of a relation is defined with the tag TYPE, and the element FILLER contains the POS tag of the filler of the complete relation name. In this case, it is the semantic interpretation of the adjective token inside the structure. For example, a typical phrase from an autopsy report: Leber dunkelrot (In English: Liver dark red). From semantic tagging we obtain the following information: Example 7: Results of Semantic Tagging. <C0NCEPT TYPE=" organ" >Leber</C0NCEPTXPROPERTY TYPE="color">dunkelrot</PROPERTY> In this example we can extract the relation 'has-color' between the to kens Leber and dunkelrot. This is an example of a simple semantic relation. Other semantic relations can be described through more complex varia tions. In these cases, we must consider linguistic structures like modifiers (e.g., etwas), negations (e.g., nicht), coordinations (e.g., Beckengeruest unversehrt und fest gefuegt) and complex noun groups (e.g., Bauchteil der grossen Koerperschlagader). 4. Related Work In GATE 14 the idea of piping simple modules in order to achieve complex functionality has been applied to NLP with such a rigid architecture for the first time. The project LT XML15 has been pioneering XML as a data format for linguistic processing. Both GATE and LT XML were employed for processing English texts. SMES 16 has been an attempt to develop a toolbox for message extraction from German texts. A disadvantage of SMES that is avoided in XDOC is the lack of a uniform encoding formalism, in other words, users are confronted with different encodings and formats in each module. Natural Language Processing for Web Document Analysis 77 5. C o n c l u s i o n We have presented an approach to the analysis of web documents - and other electronically available document collections - t h a t is based on the combination of XML technology with N L P techniques. T h e end-users of the X D O C toolbox are domain experts {e.g., engineers, medical doctors, economists, etc.) t h a t are naive with regard to NLP. For these users the barrier to overcome before being able to run experiments on their document collections and to get initial results should be as low as possible. Therefore, all modules of X D O C have been designed to be as robust as possible. In the case of gaps in resources {e.g., lexical gaps, missing syntactic structures etc.) the system is geared to exhibit graceful degradation. We have reported about work in progress. An up to date version of X D O C is available for on line experimentation via h t t p : / / l i m a . c s . u n i - m a g d e b u r g . d e : 8 0 0 0 . The key issue for the on going work is to fully exploit the inherent potential for learning as a side effect of processing large quantities of related docu ments. Learning will be especially important for the incremental u p d a t e of X D O C ' s resources. Initial experiments with corpus-based methods have been promising: • In combination with results from syntactic analysis (cf. example 1) the detailled paradigm information for entries in the morphosyntactic lexicon can be deduced from varying occurrences of unknown tokens. • When sentence candidates can not be given a complete syntactic ana lysis then the coverings produced from completed constituents may be (cf. Section 3.2.1) exploited for suggesting new grammar rules. • Abstraction of case frames from multiple occurrences of verbs and their complements may ease the construction of domain specific semantic lexica. References 1. World Wide Web Consortium (ed.), "Semantic Web. Activity Overviev?, Boston, USA, URL: http://www.w3.org/2001/sw, 2001. 2. B. Habegger and M. Quafafou, "Multi-Pattern Wrappers for Relation Ex traction from the Web", in ECAI2002, Proceedings of the 15th European Conference on Artificial Intelligence, F. van Harmelen (eds.),IOS Press, Amsterdam, 2002, pp. 395-399. 3. M. Kunze and C. Xiao, "An Approach for Resource Sharing in Multilingual NLP", in T. Vidal and P. Liberatore(eds.): Proceedings of STarting Artificial Intelligence Researchers Symposium STAIRS 2002, IOS Press, Amsterdam, 2002, pp. 123-124. 78 M. Kunze and D. Rosner 4. D. Rosner and M. Kunze, "An XML Based Document Suite", in Proceedings of the 2002 International Conference on Computational Linguistics, Taipei 2002, pp. 1278-1282. 5. T. Bray and J. Paoli and C M . Sperberg-McQueen and E. Maler (eds.), "Extensible Markup Language (XML) 1.0 (second Edition)". World Wide Web Konsortium, W3C Recommendation, Boston, USA, URL:http//www.w3.org/TR/2000/REC-xml-20001006, 2000. 6. J. Clark, "XSL Transformations (XSLT) Version 1.0", World Wide Web Konsortium, W3C Recommendation, 1999. 7. D. Rosner and M. Kunze, "Exploiting Sublanguage and Domain Character istics in A Bootstrapping Approach to Lexicon and Ontology Creation", in Proceedings of the OntoLex 2002 - Ontologies and Lexical Knowledge Bases at the LREC 2002, Las Palmas, 2002, pp. 68-73. 8. S. Krotzsch and D. Rosner, "Towards Extraction of Company Profiles from Webpages", in Proceedings of 2nd International Workshop on Databases, Documents, and Information Fusion, Karlsruhe, 2002, pp. 29-36. 9. W. Buntine and H. Tirri, "Multi-Faceted Learning for Web Taxonomies", in Proceeding of the 2nd Workshop on Semantic Web Mining, Helsinki, 2002, pp. 52-60. 10. E. Brill, "A Simple Rule-Based Part-of-Speech Tagger", in Proceedings of the Third Conference on Applied Natural Language Processing, 1992, pp. 152-155. 11. W. Finkler and G. Neumann, "A Fast Realization of a Classification-Based Approach to Morphology", in Proceedings of 4- Osterreichischen ArtificialIntelligence Tagung, Wiener Workshop Wissensbasierte Sprachverarbeitung, H. Trost (ed.), Springer Verlag, Berlin, 1988, pp. 11-19. 12. D. Rosner, "Combining Robust Parsing and Lexical Acquisition in The XDOC System", in Proceedings of KONVENS 2000: 5. Konferenz zur Verarbeitung natiirlicher Sprache, 2000, pp. 75-80. 13. C. Kunze and L. Lemnitzer, "GermaNet - representation, visualization, ap plication.", in Proceedings of LREC 2002, Las Palmas, 2002, pp. 1485-1491. 14. H. Cunningham, "GATE - A General Architecture for Text Engineering", in Computing and Humanities, Vol. 36, 2002, pp. 223-254. 15. Language Technology Group (LTG), "LT XML version 1.1", http://www.ltg.ed.ac.uk/software/xml/, 1999. 16. G. Neumann and R. Backofen and J. Baur and M. Becker and C. Braun, "An Information Extraction Core System for Real World German Text Pro cessing" , 5th International Conference of Applied Natural Language, Wash ington, 1997, pp. 208-215. Part II. Document Analysis for Adaptive Content Delivery This page is intentionally left blank CHAPTER 5 REFLOWABLE DOCUMENT IMAGES Thomas M. Breuel, William C. Janssen, Kris Popat, and Henry S. Baird Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 94304 U.S.A. E-mail: { breuel, janssen,popat, baird} @parc. com Millions of documents on the Internet exist in page or image oriented for mats like PDF. Such documents are currently difficult to read on-screen and on handheld devices. This paper describes a system for the auto matic analysis of a document image into atomic fragments (e.g. word images) that can be reconstructed or "reflowed" onto a display device of arbitrary size, depth, and aspect ratio. This allows scans and other page-image documents to be viewed effectively on a limited-resolution screen or hand-held computing device, without any errors and losses due to OCR and retypesetting. The methods of image analysis and repre sentation are described. 1. I n t r o d u c t i o n Millions of documents on t h e Internet exist in page or image oriented doc ument formats like P D F , PostScript, T I F F , J B I G 2 , or DjVu. 6 While such documents are designed to be read easily when printed on paper, they do not come with t h e layout information t h a t would allow t h e m to be reflowed to a d a p t to the smaller window sizes found on handhelds and in web browsers. This chapter, an expanded version of t h e paper presented at I C P R , 4 describes ongoing work on developing document image representa tions t h a t combine the fidelity and simplicity of image oriented document formats with t h e versatility of structured documents. Perhaps surprisingly, page oriented formats are not just being used for scanned legacy documents, but continue t o be widely used for publishing documents on-line. There are a number of reasons for this. Many documents are still primarily generated in word processing systems or t y p e setting sys81 82 Breuel et al. terns for the purpose of printing. While word processor and type setting inputs are reflowable, they are generally too complex to be used as docu ment interchange formats, and they depend on complex, often proprietary, software systems for viewing. They often also contain sensitive information, like revision and author tags. And they often depend on fonts and other resources for proper display. For web publication, authors generally convert them into a portable, image-oriented format like PDF of PostScript. In contrast, generating formats like PDF or PostScript is very easy, since word processors already need to contain PostScript drivers for printers. Furthermore, pagination and line breaks are important (and often required) in many legal and academic contexts for cross-referencing. And viewers and decoders for page-oriented formats are simpler and widely available. Unfortunately, page-oriented document formats can be hard to read on-screen (see Fig. 1) or on handheld devices (see Fig. 2). Without the ability to reflow documents during display, reading such documents on a display that is smaller than page-size either requires scaling down the image until the text becomes unreadably small, or scrolling around on the page while reading lines or columns. This is not a question of better display technology, but simply a result of the form factor constraints that exist on many physical devices. HTML, SGML, and XML would seem to offer a third option, combining a simple, open format with the ability to reflow. But these formats suffer from many of the same problems as other structured document formats: they require complex software systems to be rendered, they often depend on proprietary features, and they cannot guarantee that text retains its intended appearance. What we would like is a system that gives control over generating re flowable documents to the recipient of a document. The obvious thing to try would be to convert the document into symbolic form by performing optical character recognition (OCR), which generates structured and re flowable documents from bitmapped images. However, OCR systems have difficulties with many important kinds of documents, like those containing chemical or mathematical formulas, as well as special fonts. The user of existing OCR systems can never be quite certain that the text they are seeing is accurate. OCR systems also do not retain the exact appearance of text, often substituting fonts. Many of these problems remain even if the OCR system represents difncult-to-recognize characters as bitmaps, as has been proposed by Tong and Srihari. 7 Our approach combines the ability to reflow documents from OCR sys- Reflowable Document Images 83 Fig. 1. This screenshot illustrates a common problem trying to read a P D F document in a browser: to read the entire page requires multiple scrolling up and down. tems with the fidelity and simplicity of image-based capture and represen tations. Essentially, we represent documents as a large collection of small image elements. Each element represents a character, word, image, table, or figure. The image elements are associated with layout control information, allowing them to refiow like words in a regular word processing document. In the rest of this paper, we first describe the image layout analysis we are currently using for creating reflowable image representations. We then discuss issues of representing reflowable content and displaying it using a variety of existing and new display applications and devices. 2. Image and Layout Analysis Image and layout analysis transforms the raw document image into reflow able components. For general background information on document image and layout analysis, the reader is referred to the handbook edited by Bunke and Wang.0 The techniques used in our system are described in more detail Fig. 2. Screen shots from attempts to read a document on a handheld using DjVu or Adobe P D F . On the left hand side, the document could be read in its entirety (the width of the page fits the screen width), but the characters are too small for reading. The middle screenshot shows a DjVu document where the characters are readable (if small), but reading the document requires horizontal scrolling for each line. The right hand screenshot shows the same document displayed in a P D F viewer. Since the font used for the document was not included in the P D F file and also was unavailable on the handheld, the P D F display program performed font substitution, further degrading the quality of the document image. Reflowable Document Images 85 in a recent paper by Breuel. 3 2.1. Text/Image Segmentation Line art, diagrams, and other images need to be treated separately from the textual content of a document: image processing, rescaling, and reflowing are different for images and textual components. It is also important to retain any textual components of an image (labels, annotations) as part of that image, rather than reflowing them with the rest of the text. There are two stages during which images can be detected. First, grayscale or color images are best detected on the unprocessed in put image. This can be done using standard document image segmentation mechanisms, although such mechanisms have not been incorporated into the current system yet. Users can also manually input bounding rectangles for images into the system. Second, images like illustrations and diagrams are also detected as part of connected component analysis and layout analysis. This will be described below. 2.2. Preprocessing Image analysis begins with adaptive thresholding and binarization. For each pixel, we determine the maximum and minimum values within a re gion around the pixel using grayscale morphology. If the difference between these two values is smaller than a threshold (determined statistically), the region is judged to contain only white pixels. (If the difference is small in a uniform black image region, it will be erroneously classified as white, but this will happen rarely as the region size is chosen to be larger than the vast majority of black marks.) If the difference is above a threshold, the region contains both black and white pixels, and the minimum and max imum values represent the black ink and white paper background values, respectively. In the first case, the pixel value is normalized by bringing the estimated white level up to the nominal white level of the display. In the second case, the pixel value is normalized by expanding the range between the estimated white and black levels to the full range between the white level and the nominal black level of the display. After this normalization process, a standard thresholding method can be applied. In the thresholded image, connected components are labeled using a scan algorithm combined with an efficient union-find data structure. Then, a bounding box is determined for each connected component. This results Breuel et al. 86 in a collection of usually several thousand connected components per page. Each connected component may represent a single character, a character part, a collection of touching characters, background noise, or parts of a line drawing or image. The bounding boxes for these connected components are the basis of the subsequent layout analysis. 2.3. Layout Analysis For layout analysis, we are primarily interested in the bounding boxes cor responding to characters in the running text of the document, as well as in a few other page elements like headers, footers, and section headings. We are interested in these particular bounding boxes because they give us impor tant information about the layout of the page that we need for refiowing it. In particular, these bounding boxes and their spatial arrangement can tell us page rotation and skew, where we find column boundaries, what tokens we should consider for token-based compression, what the reading order is, and how text should flow between different parts of the layout. Bounding boxes that are not found to represent "text" in this filtering operation are not lost, however. They can later be incorporated into the output from the system as graphical elements. The dimensions of bounding boxes representing body text are found us ing a simple statistical procedure. If we consider the distribution of heights as a statistical mixture of various components, for most pages containing text, the largest mixture component is going to be from lower case letters at the predominant font size. We use this size to find the x-height of the pre dominant font and use this dimension to filter out bounding boxes that are either too small or too large to represent body text or standard headings. Given a collection of bounding boxes representing text, we are inter ested in finding text lines and column boundaries. The approach used in the prototype system for identifying text lines and column boundaries relies on a branch-and-bound algorithm that finds maximum likelihood matches against line models under a robust least square error model (equivalently, a Gaussian noise model in the presence of spurious background features).2 Text line models are described by three parameters: the angle and offset of the line, and the descender height. Bounding boxes whose alignment point, the center of their bottom side, rests either on the line or at a distance given by the descender height below it, are considered to match the line; matches are penalized by the square of their distance from the model, up to a threshold value e, usually of the order of five pixels. After a text line has Reflowabie Document Images 87 been found, the bounding box of all the connected components that par ticipated in the match is computed, and all other connected components that fall within that bounding box are assigned to the same text line; this "sweeps up" punctuation marks, accents, and "i"-dots that would otherwise be missed. Within each text line, multiple bounding boxes whose projec tions onto the baseline overlap are merged; this results in bounding boxes that predominantly contain one or more characters (as opposed to bounding boxes that contain character parts). The resulting bounding boxes are then ordered by the x-coordinate of their lower left corner to obtain a sequence of character images in reading order. Multiple text lines are found using a greedy strategy, in which first the top match is identified, the bounding boxes that participated in the match are removed from further considera tion, and the next best text line is found, until no good text line matches can be identified anymore. This approach to text line modeling has several advantages over the traditional projection or linking methods. First, due to scanning artifacts, different text lines can have different orientations. Second, by taking into account both the baseline and the descender line, the technique can find text lines that are missed by other text line finders. Third, the matches returned by the method follow the individual text lines more accurately than most other methods. 2 Column boundaries are identified in an analogous manner, by finding globally optimal maximum likelihood matches of the center of the left side of bounding boxes against a line model. In order to reduce background noise, prior to applying the line finder to the column finding problem, statis tics about the distribution of horizontal distances between bounding boxes are used to estimate the inter-character and inter-words spacing (the two largest components in the statistical distribution of horizontal bounding box distances), and bounding boxes for characters are merged into words. This reduces the number of bounding boxes that need to be considered for column matching severalfold and thereby improves the reliability of column boundary detection. We are currently adapting the system to using the segmentation meth ods described in a recent paper by Breuel. 3 Those methods allow for the more reliable identification of column boundaries. They work by first iden tifying rectangles of empty page background satisfying conditions on their aspect ratio and proximity to textual components. Such whitespace rectan gles correspond to column boundaries with high probability. 1 ' 3 Text lines are then detected by a constrained text line finding method similar to the 88 Breuel et al. one described above. With either layout analysis method, large connected components that are not part of a text line are treated as images. Such components are merged repeatedly with any connected components that they overlap; the resulting collection of connected components is treated as a single image. The output of the layout analysis is a collection of text line segments, images, and column boundaries. Unlike other layout analysis methods, for generating renowable documents, we do not require a complete, hierarchical analysis of document structure. All that is left to determine is linking the text line segments together in reading order. For a single column document, by enumerating text lines and bounding boxes of images in order of their ycoordinates, we obtain a sequence of characters, whitespace, and images in reading order. For a double column document, the two columns are treated as if the right column were placed under the left column. This simple layout analysis algorithm copes with a fairly wide number of commonly found layouts in printed documents and transform them into a sequence of images that can be reflowed and displayed on a handheld device. In part, a simple algorithm works well in these applications because the requirements of reflowing for a handheld document reader are less stringent than for other layout analysis tasks, like rendering into a word processor. Since the output of the layout analysis will only be used for reflowing and not for editing, no semantic labels need to be attached to text blocks. Text that is misrecognized as an image (for example, a section heading printed in large type) will still reflow properly and often be completely indistinguishable from correctly segmented text. Because the documents are reflowed on a small screen, there is also no user expectation that a rendering of the output of the layout analysis precisely match the layout of the input document. Furthermore, if page elements like headers, footers, or page numbers are incorporated into the output of the layout analysis, users can easily skip them during reading, and such elements may serve as convenient navigational signposts on the handheld device as well. Even some errors in layout analysis, like misattributing a figure caption to the body of a text, can be tolerated by readers. 3. HTML-Based Representations The result of the decomposition process described above is a sequence of text images and illustrations, along with meta-information about format ting such as paragraph breaks and line justification. Many existing Web (a) Fig. 3. A page from Jepson's A FLORA OF CALIFORNIA: elements; and (c) in a web browser without the boxes. (b) (c) (a) original; (b) in a web browser with boxes drawn around the image 90 Breuel et al. Fig. 4. The Jepson page rendered by Microsoft's Reader electronic book viewer. formats are well-suited for representation of such data. HTML, 11 the stan dard for the World Wide Web, supports the layout in reading order of a sequence of image elements, along with formatting information. The succes sor to HTML, XHTML, 10 uses the more rigorous XML syntax, but provides the same functionality. A group of commercial electronic-book interests are also defining the Open eBook Publication Structure, 12 which also uses the XML syntax, incorporates the XHTML functionality, adds the ability to package multiple XHTML files into a single publication, and provides stan dards for addition of document metadata. Figure 3(a) shows a sample page from Willis Linn Jepson's A FLORA OF CALIFORNIA,8 an important scholarly work in the field of botany, in which typeface and relative type size choices are significant. Once decom posed into image elements, the logical connections between those elements can be represented with HTML. Figure 3(b) shows an HTML representa tion of the decomposition of the Jepson page rendered in a standard Web browser, with boxes drawn around individual image elements. a Figure 3(c) shows that same page, but without the bounding boxes around the elea T h e illustrations have been manually removed from the original, and manually re inserted into this version. We are investigating ways of doing this automatically. Reflowable Document Images 91 ments (the layout has changed slightly as well, since depicting the boxes in (b) takes up some of the available display space). Note that fonts and some characteristics of the original page have been retained, which would not necessarily be the case for an OCR-based transition strategy. 4. Reader Applications A problem with all of the Web formats is that a decomposed document will typically consist of a very large number of separate files. This configuration tends to strain the capabilities of underlying technology support platforms. Token-based compression schemes9 can make this problem somewhat more manageable, but cannot really solve it. Most electronic-book document for mats, however, alleviate this problem by packaging the image elements together with the layout directives in a single file. Dozens of electronic book formats exist, among them Microsoft Reader, Adobe Acrobat Reader, Palm Reader, and Gemstar's RCA 1100 and 1200 formats. Our system of decomposed image elements can be supported by most (probably all) of these formats. Figure 4 shows an example of the Jep son page displayed in Microsoft's Reader viewer program. This was achieved by rendering the decomposed elements into Open eBook Publication Struc ture, then using the Overdrive Readerworks distiller for Reader to create the electronic book. Similar approaches seem to suffice for all of the other popular electronic book formats. Common electronic book formats, and the associated viewer applica tions, are optimized for 'books' consisting mainly of text, with occasional images. This can lead to performance problems in both the time and space domains. Microsoft Reader version l b , for example, running on a 500 MHz Pentium III machine, takes approximately 20 seconds to lay out and display the first page of the Jepson sample shown above. But another system called 'Plucker' 0 provides timely layout and scrolling of the converted document, and averages only 70KB per converted page. We are currently investigat ing alternative storage formats and layout/display techniques optimized for this class of electronic book. see http://www.microsoft.com/reader/ see http: //www .plkr. org/ c 92 Breuel et al. 5. N e w Document Formats We are currently developing new document formats for reflowable content. Rather than devising a completely new format, as systems like PDF and DjVu have attempted, we are building on existing image formats and anno tating them with additional layout information. This approach means that we can take advantage of the considerable work that has gone into the de velopment of existing image compression formats and reuse large amounts of existing code. One format that we are developing is an extension to the Plucker doc ument format. The Plucker format can be understood as a simplified form of HTML, encoded in binary, and packed into a single Palm database file. Plucker already allows images to be incorporated into its documents, but it does not cope well with the large numbers of small images that occur in reflowable document images. Our extension to the Plucker format allows image tags not only to include complete images in a document for display, but to refer repeatedly to a single image within the document and display different subimages. That is, this new form of Plucker image tags specifies a bounding rectangle in the source image to be inserted into the Plucker text. That way, a single image included into a Plucker document can be used to represent hundreds of different word images. Furthermore, the fact that these image inclusions are much larger than individual word images also help compression algorithms achieve better compressions on them. A second format we are developing is based on document images in formats like TIFF and DjVu. We augment these document images with a representation of the document layout — a list of bounding boxes in reading order. Layout information that needs to be represented per page is small (a few thousand bytes). The additional information can be kept separate from the document image, allowing the document image to remain completely unchanged. It can also be incorporated into extended chunks available in formats like TIFF, DjVu, and PDF as part of the image file, resulting in completely self-contained document image files suitable both for display in their original format and reflowable display. 6. Summary and Conclusions As we noted in the beginning, page-oriented electronic document formats show no signs of disappearing, nor do devices whose display area is smaller than a full page of text. None of the three major classes of on-line document formats address the need of displaying page-oriented documents on such Refiowable Document Images 93 devices: structured document formats like LaTeX or Microsoft Word are difficult to display and documents are not usually available in t h a t form, image-based formats like T I F F and DjVu require excessive scrolling on the part of the reader, and existing eBook content for P D A s preserves little of the original visual appearance of the document. O u r approach, refiowable document images, combines the best aspects of eBook and structured documents with the best aspects of image-based formats. Like image-based formats, refiowable document images preserve the appearance of text, fonts, and images and are not subject to O C R errors. B u t refiowable document images also have many of the capabilities of structured document formats to adapt to different displays a n d output devices. This chapter has described work in progress. Currently, the conversion software consists of a number of Java modules t h a t can be invoked from the command line. Clearly, for widespread deployment, a fully automatic system with a convenient user interface is desirable; efforts are underway in our lab t o combine the command line modules into such a system. Another on-going area of work is the development of new viewer applications for handhelds. A lot of documents have been cumbersome to use both on the web and in handhelds. We believe t h a t in the future, the use of refiowable document images will be very important in making large amounts of on-line documents conveniently accessible both on handhelds and inside web browsers. Acknowledgments We are grateful to Dan S. Bloomberg for many of the original ideas around text reflowing leading to this work. References 1. H. S. Baird, S. E. Jones, and S. J. Fortune, "Image segmentation by shapedirected covers", Proceedings of the Tenth International Conference on Pat tern Recognition, Atlantic City, New Jersey, 1990, pp. 820-825. 2. T. M. Breuel, "Robust least square baseline finding using a branch and bound algorithm", Document Recognition and Retrieval VIII, SPIE, San Jose, 2002, pp. 20-27. 3. T. M. Breuel, "Two algorithms for geometric layout analysis", Document Analysis Systems V: Proceedings of the IAPR Workshop on Document Anal ysis Systems (DAS2002), D. Lopresti, J. Hu and R. Kashi (Eds.), Springer Lecture Notes in Computer Science, LNCS 2423, pp. 188-199. 94 Breuel et al. 4. T. M. Breuel, W. C. Janssen, K. Popat, and H. S. Baird, "Paper to PDA", 16th International Conference on Pattern Recognition (ICPR'2002), Quebec, Canada, Aug. 2002. 5. H. Bunke and P. S. P. Wang, Handbook of Character Recognition and Docu ment Image Analysis, World Scientific, 1997. 6. P. Haffner, L. Bottou, P. Howard, and Y. Le Cun, "Djvu : Analyzing and compressing scanned documents for internet distribution", Proceedings of the International Conference on Document Analysis and Recognition, 1999, pp. 625-628. 7. T. Hong and S. N. Srihari, "Representing OCRed documents in HTML", Pro ceedings of the I APR 1997 International Conference on Document Analysis and Recognition (ICDAR 1997), Ulm, Germany, April 1997, pp. 831-834. 8. W. L. Jepson, A Flora of California, University of California Press, 1909, h t t p : / / u c j e p s . b e r k e l e y . e d u / j e p s o n - p r o j ect3.html. 9. Joint Bi-Level Image Experts Group (JBIG) Committee, "Information technology - coded representation of picture and audio information lossy/lossless coding of bi-level images", Technical Report 14-4-92 FDC, ISO/IEC, July 1999. 10. S. Pemberton, A. W. Donoho, A. Navarro, B. Epperson, C. Wilson, D. Zigmond, D. Austin, D. Raggett, D. Dominiak, F. Boumphrey, H. Elenbaas, J. Burger, J. Axelsson, K. Hofrichter, M. Ishikawa, M. Altheim, P. Schmitz, P. Klante, P. King, P. Stark, P. Hoschka, R. Relyea, S. Dooley, S. Schnitzenbaumer, S. McCarron, S. Matsui, S. Peruvemba, T. Celik, T. Wugofski, W. ten Kate, and Z. Nies, "Xhtml 1.0: The extensible hyper text markup language", Technical report, World Wide Web Consortium, Jan. 2000, http://www.w3.org/TR/xhtml1/. 11. D. Raggett, A. L. Hors, and I. Jacobs, "HTML 4.01 specifica tion", Technical report, World Wide Web Consortium, Dec. 1999, http://www.w3.org/TR/html4/. 12. Victor McCrary and contributors, "Open ebook publication structure 1.0.1: Recommended specification", Technical report, Open eBook Forum, July 2001, http://www.openebook.org/. CHAPTER 6 E X T R A C T I O N AND M A N A G E M E N T OF C O N T E N T F R O M H T M L DOCUMENTS H. Alam, R. Hartono and A.F.R. R a h m a n * Document Analysis and Recognition Team (DART), BCL Technologies Inc., 990 Linden Drive, Suite # 203, Santa Clara, CA 95050, USA. E-mail:fuad@bcltechnologies.com URL: http://www.bcltechnologies.com In recent times, the way people access information from the web has undergone a transformation. The demand for information to be accessible from anywhere, anytime, has resulted in the introduction of Personal Digital Assistants (PDAs) and cellular phones that are able to browse the web and can be used to find information using wireless connections. However, the small display form factor of these portable devices greatly diminishes the rate at which these sites can be browsed. Efficient algorithms are required to extract the content of web pages and build a faithful reproduction of the original pages on a smaller display with the important content intact. 1. Introduction The problem of content extraction is very important not only from the point of view of managing the amount of content, but other important issues are associated with this. Some of which are: • Viewing any website: Pattern recognition systems that use document analysis techniques can be employed for displaying web pages on small screen devices by extracting and summarizing their content. These systems have to be generic enough so that they can work with any web site, not only the well laid-out ones. • • High Speed Access: The transformation of web pages has to take place on the fly and therefore should be fast. Network Usage: The schemes employed for the transformation should not slow down network traffic by requiring additional resources. t Corresponding author. 95 96 Alam et al. • Easy Configurability: Any such scheme should be easily configurable within existing systems by System Integrators (SI) and end-users. • Rapid Deployment: This is also a very important factor in software development and deployment. • Non-Intrusive Design: Any such translation scheme should be built on top of a web site without modifying the actual web site. • Multiple Views: This scheme should also allow the Sis and end-users to develop custom views. Fig. 1: Comparative screen area. Fig. 2: Comparative screen area. 2. Research Direction A wireless PDA is designed to be carried in a person's pocket. The display area of such a device is a fraction of that available on a desktop computer as shown in Fig. 1. This can be seen graphically in Fig. 2. For example, Palm devices display only 160x160 pixels. Hence, any web page designed for viewing on a 640x480 or higher-resolution PC display would be severely compromised on a handheld device. Only HP's half-VGA Jornada palmtops approach desktop resolutions. On the other hand, various brands of PocketPCs are coming to the market with somewhat higher resolution, but the price point is still prohibitive for mass adoption of these devices. This can be explored further using the case of cellular phones capable of accessing the web. Technical issues, such as bandwidth, resolutions and formats become extremely important factors with these devices. The present-generation cell-phone data rates are limited to 9600 bps, making the bandwidth limitations a significant bottleneck to web deployment. However, in the longer term, 2.5G cellular is intended to be equivalent to v.90 wired modem speeds, and 3G cellular Data Rates for Cellular and WLAN Standards service targets data throughput roughly equivalent to DSL. In terms of wireless LANs, Extraction and Management of Content from HTML Documents 97 Bluetooth is expected to achieve 721 kbps, and Intersil is presently shipping 802.11 chipsets that deliver 2 Mbps. Hence, bandwidth may be less of a problem in the future, assuming absolute channel capacity is not overwhelmed by the number of end-users. However, display size and resolution are more important considerations than cellular bandwidth. Among the current generation of WMLready GSM cell phones, Nokia's 3310 has a 48x84 display; the 7110 displays 96 x 65 pixels. Ericsson's R380s displays 120 x 360. There is a conflict here: smaller devices appeal to consumers, but small displays cannot display much information. 3. Current State of the Art The current way of browsing the web using a wireless device can be very restrictive. At present, there are three solutions to this problem, handcrafting, transcoding and adaptive re-authoring. 3.1 Handcrafting Custom Web Sites are typically crafted by hand by a set of content experts. This process is labor intensive and expensive. Because of the expense, typically less than 1% of web sites are converted to wireless content. Currently there is an attempt by several companies to automate this conversion. The core of this automation requires web code written in XML. Two problems exist with this approach: • There is no standard XML tagset (Document Type Definition - DTD) in use by vendors. • XML has been available to web designers for the last 10 years. Examination of websites shows little use of document structural elements. We hypothesize this is because web designers see themselves as artists rather than programmers. 3.2 Transcoding Thranscoding replaces HTML tags with suitable device specific tags1 (HDML, WML etc.). While this is cost effective, it has two problems: • Most web pages have a loose repeating visual structure. Sidebars are often the left-most element on a web page. This results in the wireless user getting the same repeating information with every screen, making the browsing an unfriendly experience. 98 Alam et al. • Transcoding sends all the information to the wireless device, making it substantially slow on the wireless network. Transcoding was introduced in Japan during 1999-2000. It was widely rejected by the Japanese users. 3.3 Adaptive Re-authoring This approach has attracted huge interest in recent times principally because of its attractiveness as an alternative to the previous options and as an approach that, in theory, should be flexible enough for future extension. Early attempts include the use of a web digestor2 that can explore the structure of a web page to detect 'blocks' of information and use transcoding techniques to reflow the content. Other attempts include approaches to identify content and then either label the blocks of content or convert them to image thumbnails,3"5 which in turn are connected to the details of the relevant content. Some researchers have approached this problem from the point of view of streamlining web server Quality of Service (QoS)6 and to improve server overload behaviour.7 Most of the approaches mentioned above perform some type of structural analysis of the web page, primarily exploiting the HTML data structures and pre defined tagsets. The Natural Language Community, on the other hand, has been involved in the research of textual summarization for a long time. The problem of textual summarization, while not completely solved, is already wellresearched and there is a wealth of solutions available in this category.8"13 Unfortunately natural language summarization techniques are difficult to apply directly to web pages because of the unique way the content is distributed in the HTML representation and the presence of other multi-media components embedded within these pages. More on this topic is discussed later in this paper. 4. Proposed Approach The importance of efficient content extraction from HTML pages for wireless web access becomes clear, especially in the context of the issues discussed in the previous section. The approach proposed here addresses this problem in a slightly different way. The proposed system is closer to the adaptive re-authoring approach as discussed above, but has a number of distinct features and is more streamlined. Initially, the HTML data structure is exploited to segment the web pages into zones:u Once these zones are identified, attribute based analysis of the content is carried out. This results in the extraction of content that is relevant and important. Extraction and Management of Content from HTML Documents 4.1. 99 Web Page Segmentation This method segments web pages into intelligent and relevant sections. This involves analysis of the structure of the web page. The document is decomposed into a main data structure. Usually it is a tree structure after HTML's table structure. CContent is the first starting point and CTable is its children. The number of tables should be the same with the number of top-level tables in the hypertext page. CTable will have Crow as its children, which might have CCol as its children. Finally CCol can also have CTable again as its children. After this point it will be recursive (Fig. 12) and it goes all the down to the leaves. Adoption of this structure ensures that the HTML page is split into the units as defined within the hierarchy of this tree. The content is inside the CCol nodes; the rest of the structure defines the way the various objects within that structure are related to each other.16 Fie. 12: Data structure. 4.2. Contextual Analysis and Segment Labeling The decomposed sub-documents are analyzed for their context. This classifies the segments in various classes based on the type and size of content in each node. Current state of the art is a simple method that uses information collected in the decomposition stage, such as number of non-link words, linked words, 100 Alam et al. content weight, presence of forms etc. to devise specific classes. For example, an element composed primarily of text and few links is a "story" node. Similarly, nodes with a very high ratio of linked text compared to non-linked text, and if the number of linked words per phrase is more than a threshold, can be "Links". Other classes, such as "Navigation Bars", "Advertisement Bars", "Main Story", "Other Stories", "Forms", "Images" and many more can be created. This contextual analysis of each sub-document produces a summarization expressed as a label. This labeling is based on exploiting the visual cues in the web page. During web page segment labeling, a semantic label is applied to each segment. Often a segment has an overriding label provided by the author. In such cases, the task is locating the author provided label. In other cases, the labels are implied. This is the more difficult task, and currently there is only experimental work in semantic labeling based on content. Visual cues are a good source of information while trying to create an appropriate label from a segment (or sub-document) such as size of font, headlines, boldness, color, links, flashing (Blink), Italic (I), emphasized (EM) and underlines. Various algorithms can be adopted to exploit these parameters. These labels are ultimately used as short summaries for the Table of Content (TOC). 4.3. Web-Page Summarization Since each of these sub-documents is summarized with an "intelligent" summary, these are put together as a summary of the whole document, giving rise to a Table of Content (TOC). Each entry into this TOC points to specific subdocuments within each document. 4.4. Post-processing Extraction of content from individual zones, however, is not the complete solution. These zones can have content that is inter-related and it may not make much sense to display these zones separately. Therefore, the next stage in this process is the analysis of the relationship between these zones. This is achieved in three ways. • Proximity Analysis: This approach involves a relational analysis based on content proximity, which is a measure of how closely related one block of text is to another block. The natural order of these zones can sometimes be used as strong indicators to establish relationship. In addition to that, lexical chains17 are used to create relationship among various zones. Extraction and Management of Content from HTML Documents 101 • Content Classification: Content extracted from individual zones can be classified into various types, and this classification, combined with the context of proximity can be a powerful tool to establish a logical map between various zones. • Content Flow Analysis: Often, due to the structure of the HTML documents, a single coherent block of content can be separated into multiple zones. This is usually internal and is transparent to the user after rendering by the browser. However, content classification as described above will separate a related block of information, such as a paragraph, into multiple zones. Content Flow Analysis involves using content understanding methods to approximate the content flow between zones. This analysis is based on natural language processing involving contextual grammar and vector modeling,18 details of which are beyond the scope of this paper. This application of knowledge models and information retrieval techniques defines the relationship between various previously separated zones in order to combine them to form a coherent representation. Once relationships between various zones are established, they are used to reflow the content into a more meaningful and efficient manner that suits the requirements of smaller display devices. Various methods19"21 are applied to combine the information thus collected. Although primarily developed for character recognition, these techniques are generic enough to be applied to this particular task domain with little or no modification. 4.5. Overall Summary of the Content Extraction and Display Process The following shows the overall stages required to achieve content extraction from HTML documents: • Structural Analysis: Analysis of the structure of a web document • Decomposition: Decomposing a web document based on the extracted structure • Contextual Analysis: Once decomposed into constituent sub-documents, analyze each sub-document for its context. • Summarization (Labeling): This contextual analysis of each subdocument produces a summarization, which can be expressed as a sentence or sub-sentence (a label) indicating the content of this subdocument. • Table of Content (TOC): Since each of these sub-documents are summarized with an "intelligent" summary, these can be put together as a summary of the whole document, giving rise to a Table of Content A lam et al. 102 (TOC). Each entry into this TOC points to specific sub-documents within each document. Order of TOO. The order in which the TOC is extracted depends on the "natural" order of the sub-documents extracted from the main document. However, this "natural" order is often misleading as the main "interesting" or "important" message of the document can be lost in the TOC. Therefore, it is important to analyze the content of each subdocument and display the TOC by re-ordering them based on their relative "importance". This is determined by analyzing the visual emphasis of the textual information such as various levels of headers, boldness, font size, word frequency and other parameters. 5. Results The performance of the system is best described using a practical application. Figure 3 shows the first page of the web page www.bcl-computers.com. As is clearly seen, this web site has a complicated multi-column layout. The content is presented in multiple segments with an implied relationship between these Extraction and Management of Content from HTML Documents 103 segments. For example, a story segment might be followed by a segment providing additional links to similar stories. The system analyses the layout and segments within the page and produces a summarized output (Fig. 4). This is the total table of content (TOC). Each member of the TOC represents several segments within the page. Selecting any of these links will enable the user to go to the more detailed content associated with that TOC. For example, selecting the link "BCL Computers" will lead the user to the display shown in Fig. 5. Clearly, the idea here is to keep the content intact, but the emphasis is on identifying which segments of the page should be put together as a related segment that can be adequately described by a single label. Also, there has to be some way to navigate between the various levels of abstraction. Since the content is in two levels, making the first level labels (TOCs) hyperlinks solves this problem gracefully. For example, if the user selects the link "BCL Computers" from Fig. 4, the system will show the output of Fig. 5. In similar fashion, by selecting the link "Beta Tester wanted" from this display (Fig. 4), the output of Fig. 6 can be obtained. Figure 7 shows the detailed content associated with the link "PDF SOLUTIONS" in Fig. 4. Not only the "story" contents are summarized, sidebars and navigation links are also summarized. For example, Figures 8-11 show the summarized bars from the page www.bcl-computers.com. Alain et al. 104 Fig. 7: Detailed content in the 2nd level. Fig. 6: Detailed content in the 2nd level. I Fig. 8: The summarized top bar. i ^ — Fig. 10: More details. I Fig. 9: A side bar. i Fig. 11: The top bar. Extraction and Management of Content from HTML Documents 105 6. Discussion Wireless web is poised to be the next big technological paradigm shift after PC based Internet. The technologies of web content management, portability and wireless devices, such as handhelds and cellular phones, are maturing rapidly and the convergence is now in sight. As is already the case in Japan, wireless devices are going to be the primary device for web access in the USA. In this emerging market, strong software for easy navigation and accessing information from the web is of paramount importance. However, accessing the web on these small display devices is a very difficult issue, one that has been the primary cause of the slow adoption of this huge phenomenon. Let us look at a very simplified scenario. There are currently 1.2 billion web pages in the world. Assuming it needs a skilled designer 2 hours to adapt an existing page for wireless, it will take 2.4 billion work-hours to make the whole web accessible to the wireless devices. If a cost of $20 per hour is assumed, this effort requires an investment of around $50 billion. This shows the enormity of the task ahead of us. What we need is a very effective tool that can automatically generate summaries of these pages and index the pages accordingly. Such a solution must eliminate the necessity of low level coding to convert web sites. In addition to that, multi-lingual web summarization is going to be extremely important in the very near future. In the next four years, the far-eastern countries are going to surpass the United States in terms of user numbers for web access and usage. This is a vast market for future web related revenue. Sustainable mobile commerce is only going to take off when we have a reliable and efficient method of web access from all types of wireless devices using all the major languages of the world. In general, textual summarization uses various standard techniques such as knowledge-rich methods, exploitation of discourse structures, corpus based approach or classical approaches. However, in the case of web pages, the structure imposed on these pages makes the task much more complicated, as simply extracting the text and graphics from the web pages in sequential order often does not make any sense at all. In the approach proposed here, the summarization of web pages is achieved by a two-stage process. In the first stage, a summary of the structure of the web page is achieved, thereby creating a tree out of the page with automatically generated sections and sub-sections. In the second stage, textual summarization is achieved from those sub-sections. This process, though tending to solve some of the problems, still has some unresolved issues. In the rest of this section, some unresolved issues are discussed. 106 6.1 Web Page Alam et al. Segmentation The design and argument of the web page segmentation approach adopted here is straightforward. However, web designers use the table structures in a variety of ways. In most of the times, this construct is used as a tool for nicely laid out displays. In other cases, tightly related content is distributed into separate nodes just for the sake of visual representation. The prevalent use of HTML table construct for layout purposes is complicated by the fact that sometimes they are used for exactly what they are designed for, as tables with numbers in complex multiple column/row/hybrid configuration. Just assessing when a table construct is a true "table" is a very open problem.22,23 Two of the important problems in web page segmentation are over segmentation and under segmentation. With over segmentation, related areas are segmented, and with under-segmentation unrelated parts are grouped together. The content in these web pages can include forms, frames, Java script enabled active components and many other artefacts. Over segmentation tends to split these up and make the overall rendition unrealistic. Often, the forms, for example, can start anywhere in the code, and again end anywhere in the code, the active components within the form itself can be scattered all over the place. This makes it almost impossible to reconstruct an operational form after segmentation is carried out. Similarly, the same argument is true for some other active components commonly found in web pages. One such example is the common practice of using slices of graphics to build a larger graphics. In the summarization process, these tend to split up, making the summarization incomplete, and often useless. This leads to the second problem. Since we have to have some type of merging to make sure that related sections are put together, this sometimes tends to put unrelated sections together, giving rise to under segmentation. This is still a very open problem. 6.2 Contextual Analysis and Segment Labeling The approach adopted here avoids generating labeling using textual summarization. In recent times, contextual analysis has opened up ways to apply analytical methods to efficiently summarize text. These methods are primarily based on Natural Language (NL) processing.9"12 In most cases, these methods are resource intensive in terms of computational requirements and therefore slow. In any practical system of web browsing, the lag time between switching pages has to be less than 1 second to be accepted by the industry. Therefore, resource intensive solutions, though often superior, are hardly acceptable if they are to be bundled with a commercial web browser. Extraction and Management of Content from HTML Documents 107 Though this algorithm is straightforward, in reality it hardly satisfies the above timing criterion. Web designers tend to take a lot of artistic freedom with all these features, and what we end up with is a kaleidoscope of graphics, active multimedia components, forms and other artifacts. Web designers also use frames, often in a way not initially intended to be used for. These factors make the task of extracting the correct label very difficult. One of the most difficult problems in this respect is the prevalent use of graphics as text in web documents. In some recent studies, it was found that a substantial number of graphics on the web contain text, and a percentage of this text is not repeated anywhere on those pages.24"26 One option is to use an OCR to detect the text within the graphics, but the success rate of OCR systems makes the task of label extraction very difficult. In the proposed approach, this problem is side-stepped by allowing the graphics to be displayed as is, thereby making the output complete. However, this does not make the task of indexing any easier. 6.3 Web-Page Summarization The approach adopted here has the following three advantages: • Coherent Viewing: By providing a table of contents for each web page visited, the system allows end-users to see a summary of the page they are going to visit. This allows the user to coherently view a large web page on a small mini-browser. • Universal Access: This solution allows any wireless user to coherently view any web site without customization. This makes the entire web viewable to the end user. • High Speed Access: By summarizing web pages, this speeds up access to any web site. Currently there is a 1 second average response lag to a page in a web site assuming good network speed. Wireless network speed is a factor in the performance of this system. Currently wireless network speeds in North America have a 7-15 second lag in response time. This is largely due to lack of 2nd and 3rd generation (2G and 3G) digital wireless networks in North America. The auction of digital wireless bandwidth over the next year should result in deployment of 2G and 3G networks in North America and the speed issue will be eased. 6.4. Display Capabilities The display capability of devices can vary widely. Often the small screen devices have lower display capability not only in terms of the number of pixels, but also whether it can support color or not. Since in general, most cell phones and PDAs Alam et al. 108 do not reproduce the range of colors the desktop can display, we need to address how web pages are converted in terms of color. In the approach presented here, we aim to detect the targeted device by detecting what type of browsers it can support. Once detected, and if required, the color images are converted to Black and White after dithering.27 Images are also size-normalized to fit the screen size. If the device is capable of displaying color, then it is only size-normalized with color. 6.5. Language Independence One of the biggest advantages of the proposed approach is its ability to be multi lingual. Since most of the summarization depends on structural analysis and visual cues, the solution is language independent. In general, the only problem encountered in switching between languages is the necessity to handle multi-byte characters, since many of the far-east languages such as Chinese and Japanese use multi-byte representations, whereas English is simply single byte. However, this is an implementation issue and is not directly related to the summarization process. 6.6. Current State of Research In general, this new approach shows promising signs of success. The remaining problems, and there are many, need to be addressed within the proposed framework. As web technology is emerging and acquiring more and more new and innovative ways of making it more attractive and efficient, and as more and more web sites become available, the task of summarization will also become important since we will not have enough time to read everything. In addition, this is extremely important if we want to make the dream of the projected paradigm shift from desktop to handheld devices a reality, since these new generation devices will offer much smaller display areas. Any useful wireless web browsing has to have some form of summarizer associated with its browser and in that respect, this proposed approach might help in achieving this transition. 6.7. Supported Devices The proposed system works by automatically summarizing live web content on the fly to fit smaller screen devices, such as PDAs and cellular phones with web capability. At the present time, the system supports all PDAs using an HTML 3.2 (and above) browser and also cellular phones using WAP, iMode (NTT DoCoMo), J-Sky (J-Phone) and EZweb (KDDI) formats. Extraction and Management of Content from HTML Documents 109 7. Concluding Remarks This paper has presented a concept to extract content from HTML documents based on their structural analysis. Based on this extraction, a classification of the content can allow a more efficient representation of the content in context with the importance and logical relationship between various zones of the document. This document analysis approach should therefore be able to organize the content into a meaningful, understandable, manageable and useful representation. The paper also discusses some of the unsolved challenges in this field and attempt to provide a perspective on the overall state of the art. References 1. M. Hori, R. Mohan, H. Maruyama and S. Singhal, "Annotation of Web Content for Transcoding", E3CNote, h t t p : //www.w3 . o r g / T R / a n n o t / . 2. T. Bickmore, A. Girgensohn and J. W. Sullivan, "Web page filtering and reauthoring for mobile users", The Computer Journal, 42(6), Oxford University Journal, 1999, pp. 534-546. 3. H. Zhang, "Adaptive content delivery: A new research in media computing", Proceedings of Multimedia Data Storage, Retrieval, Integration and Applications Workshop (MDSRIA), Hong Kong, January 13-15, 2000. 4. H. Zhang, J. Chen and Y. Yang, "Adaptive delivery of HTML contents", 9th World Wide Web Conference, Amsterdam, Netherlands, May 15-19, 2000, pp. 284-289. 5. H. Zhang, J. Chen and Y. Yang, "An adaptive web content delivery system", Proceedings of International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems (AH2000), Trento, Italy, 28-30 August, 2000. Online Proc. at http://ah2000.itc.it/. 6. T, F. Abdelzaher and N. Bhatti, "Web Server QoS Management by Adaptive Content Delivery", Computer Networks, Elsevier, 31(11-16), 1999, pp. 1563-1577. 7. T. F. Abdelzaher and N. Bhatti, "Web content adaptation to improve server overload behavior", Proceedings of the 8th International World Wide Web Conference, Toronto, Canada, 1999, pp. 465^99. 8. S. Corston-Oliver, "Text compaction for display very small screens". Proceedings of Automatic Summarization Workshop, the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), Pittsburgh, PA, U.S.A., 2-7 June, 2001. 9. O. Buyukkokten, H. Garcia-Molina and A. Paepcke, "Text Summarization of Web pages on Handheld Devices". Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), Pittsburgh, PA, U.S.A., 2-7 June, 2001. 10. A. Kolcz, V. Prabakarmurthi and J. Kalita, "Summarization as Feature Selection for Text Categorization". Proceedings of the. 10th Information and Knowledge Management Conference (CIKM-01), Atlanta, Georgia, 5-10 November, 2001, pp. 365-370. 11. K. McKeown, J. Klavens, V. Hatzivassiloglou, R. Barzilay, E. Eskin, "Towards Multidocument Summarization by Reformulation: Progress and Prospects". 110 12. 13. 14. 15. 16. Alam et al. Proceedings of the Meeting of the American Association for Artificial Intelligence/International Association of Artificial Intelligence (AAAI/IAAI), Orlando, Florida, U.S.A., 18-22 July, 1999, pp. 453^160. M. J. Witbrock and V. O. Mittal, "Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries. Research and Development in Infonnation Retrieval", proceedings of the Special Interest Group on Information Retrieval (SIGIR), Berkeley, California, U.S.A., 15-19 August, 1999, pp. 315-316. Y. Matsumoto, T. Miyata, T. Nomoto, T. Tokunaga, M. Takeda, and M. Obayashi. "Document Analysis and Summarization Workbench". Software Demonstrations at the 38th Annual Meeting of the Association for Computational Linguistics (ACL2000), Demonstration Notes, Hong Kong, 1-8 October, 2000, pp. 22-23. A. F. R. Rahman, H. Alam, R. Hartono and K. Ariyoshi, "Automatic Summarization of Web Content to Smaller Display Devices". Proceedings of the 6th Document Analysis and Recognition Conference (ICDAR01), Seattle, USA, 10-13 September, 2001, pp. 1064-1068. A. F. R. Rahman, H. Alam and R. Hartono, "Content Extraction from HTML Documents". Proceedings of the Is' International Workshop on Web Document Analysis (WDA'2001), Seattle, USA, September 2001 (ISBN: 0-9541148-0-9) and also at h t t p : / / w w w . c s c . l i v . a c .uk/~wda2 0 01, pp. 7-10. A. F. R. Rahman, H. Alam and R. Hartono, "Understanding the Flow of Content in Summarizing HTML Documents". Proceedings of Document Layout Interpretation and its Applications Workshop (DLIA01), Seattle, USA, 9 September, 2001. Online proc. at h t t p : / / w w w . s c i e n c e . u v a . n l / e v e n t s / d l i a 2 0 0 l / p r o g r a m / index.html. 17. R. Barzilay and M. Elhadad. "Using lexical chains for text summarization". Advances in Automatic Text summarization, I. Mani and M. T. Maybury (eds.), MIT Press, 1999, pp. 111-121. 18. H. Alam. "Spoken language generic user interface (SLGUI) ". Technical Report, AFRL-IF-RS-TR-2000-58, Air Force Research Laboratory, Rome, NY, 2000. 19. A. F. R. Rahman and M- C. Fairhurst. "Introducing new multiple expert decision combination topologies: A case study using recognition of handwritten characters". Proceedings of 4 Document Analysis and Recognition Conference (ICDAR97), Ulm, Germany, 18-20 August, 1997, vol. 2, pp. 886-891. 20. A. F. R. Rahman and M. C. Fairhurst, "Multiple expert classification: A new methodology for parallel decision fusion". Int. Jour, of Document Analysis and Recognition, Springer, 3(1), 2000, pp. 40-55. 21. A. F. R. Rahman and M. C. Fairhurst, "Enhancing consensus in multiple expert decision fusion". IEE Proc. on Vision, Image and Signal Processing, IEE Press, 147(1), 2000, pp. 39-46. 22. A. Rahman and H. Alam, "Challenges in Web Document Summarization: Some Myths and Reality". Proceedings of the Document Recognition and Retrieval IX Conference, Electronic Imaging Conference, Santa Clara, California, U.S.A., 21-22 January, SPIE 4670-27, 2002. 23. Y Wang and J. Hu, "A machine learning based approach for table detection on the web". Proceedings of the llth World Wide Web Conference (WWW2002), Hawaii, U.S.A., 7-11 May, 2002, pp 242-250. Extraction and Management of Content from HTML Documents 24. A. Antonacopoulos, D. Karatzas, and J. O. Lopez. Accessing textual information embedded in internet images. Proceedings of the SPIE Internet Imaging Conference, San Jose, California, U.S.A., 24-26 January 2001, pp. 198-205. 25. D. Lopresti and J. Zhou. Locating and recognizing text in WWW images. Information Retrieval, Kluwer, 2, 2000, pp. 177-206. 26. T. Kanungo, C. H. Lee and R. Bradford, "What fraction of images on the web contain text? ". Proceedings of the Web Document Analysis Workshop (WDA01), Seattle, USA, 8 September, 2001, pp. 43-46. Online proceedings at http://www.csc.liv.ac.uk/~wda20 01/. 27. ImageMagick Image Processing Toolkit, h t t p : //www. i m a g e m a g i c k . o r g / . 111 This page is intentionally left blank CHAPTER 7 HTML PAGE ANALYSIS BASED ON VISUAL CUES Yudong Yang, Yu Chen and HongJiang Zhang Microsoft Research Asia No. 49, Zhichun Road, Beijing, China yangyud@cn.ibm.com, {i-yuchen, hjzhangj@microsoft.com In this chapter, we present an approach to automatically analyzing the semantic structure of HTML pages based on detecting visual similarities of the content objects on these pages. The approach is developed based on the observation that, in most web pages, layout styles of the subtitles or records of the same content category are consistent and there are apparent separation boundaries between different categories. Thus, these subtitles should have similar appearance if they are rendered in visual browsers and these different categories can be separated clearly. In our approach, we first measure the visual similarities of HTML content objects. Then we apply a pattern detection algorithm to detect frequent patterns of visual similarity and use a number of heuristics to choose the most possible patterns. By grouping items according to these patterns, we finally build a hierarchical representation (tree) of the HTML document with semantics inferred from "visual consistency". Preliminary experimental results show promising performance of the method with real web pages. 1. Introduction The World Wide Web has become one of the most important information sources today. Most of the data on web are available as pages encoded in markup languages such as HTML intended for visual browsers. As the amount of data on the web grows, locating desired contents accurately and accessing them conveniently become pressing requirements. Technologies such as web search engines and ACD (Adaptive Content Delivery) ~7 are being developed to meet such requirements. These technologies could give better results only if the semantic structures of these documents were available. However, web pages are normally composed for viewing in visual web browsers and lack information on semantic structures. 113 114 Yang el al. To extract these structures, document analysis methods are required. For search engines, the goal is to locate and extract the data in web pages while obtaining the data structure. For ACD, the semantic structures are the foundation of adaptation decision. Different goals of search engines and ACD lead to differentiated approaches. 1.1. Document Analysis for Search Engines From the search engines' point of view, a web page (HTML Document) is a semistructured data set. The goals of document analysis are: 1) the location and extraction of data from the web page, and 2) the building of the data schema (structure). Hammer et als use an extractor specification file to program a lexical analyzer to extract the data. The specification file is actually a script that tells the analyzer how to perform the extraction. Table 1 gives an example of a typical specification file with two commands. The first command (lines 1-4) tells the analyzer to get a web page from the specified URL. The second command (lines 5-8) tells the analyzer to extract the temperature information inside the web page according to the specified pattern. The schema (structure) of data is also specified by the file. As the specification file is usually created manually, according to some specific page, the analyzer can only extract data from web pages with exactly the same framework. Ashish et al.9 suggest the utilization of certain heuristics to automatically create a data extractor from some samples of a set of similar web pages. While the created extractors may not be precise enough, the authors provide a facility for the user to interactively correct the errors. As a limitation, the extractor can only be applied on the specific web page set. Having extracted data, the schema should be built upon that data. Nestorov etal.u provide a type detection method to infer the inherent data schema (structure). They assume that the data has been extracted and the references HTML Page Analysis Based on Visual Cues 115 among individual data objects are known. Based on the reference relationships, a set of heuristics is proposed to classify the data objects into several categories. Each category corresponds to a data type. 1.2. Document Analysis for Adaptive Content Delivery While search engines require the data inside web pages, ACD needs to know the semantic structure of the web pages. It then adapts the web content according to the client device capabilities, network conditions and user interests. For web pages, re-authoring is commonly used. The process consists of operations including content analysis, summarization and re-layout. All of these operations depend on the understanding of documents' semantic structures. This makes document analysis a necessary job. The goal of web page analysis in ACD is to infer the structure of the web page and provide enough information for the re-authoring operations. Traditional analysis methods leverage the structure-representing HTML tags, such as <H1>~<H6>, to infer the page structure. However, HTML as a whole lacks the ability of representing semantic-related content. For various reasons, it was designed with both structural and presentational capability in mind. And these two were not clearly separated. (In the first version of HTML most of the tags were for structures but many layout and presentation tags were added into later versions and are widely used today. Some of this history can be found in Siegel's essay.15) Further widespread misuses of structural HTML tags for layout purposes {e.g. <TABLE>) make the situation even worse. Cascade Style Sheets (CSS)21 were later developed as a remedy to this, but only recently several popular browsers begun having better CSS support. The recent W3C recommendation of XML provides a better way to organize data and represent the semantic structures. However, most of the web contents are still authored in HTML. Because of the common misuses, HTML tags are not stable features for analyzing structures of real-world HTML documents. In this chapter, we will introduce an automatic method to extract semantic structures from general HTML pages without requiring a priori knowledge of the web pages. This method uses features derived directly from the layout of HTML pages. As we observed, it is common for web pages to keep consistent visual styles with parallel subtitles or records in the same content categories, while different categories are separated by apparent visual boundaries. The objective of this approach is to detect these visual cues and construct the hierarchies of categories. Yang et al. 116 2. Visual Similarity of HTML Objects Good organization of contents is an essential factor of good content services. Figure 5 and Fig. 6 on later pages show some examples of typical web pages. From these examples we can see that it is quite common to divide contents into categories where each category holds records of related subtitles. In addition, records in one category are normally organized in a consistent visual layout style. Boundaries between different categories are marked clearly with different visual styles or separators. As we have said, the basic idea of this approach is to detect these visual cues, which can then be used to detect records and categories. The first step to visual cues detection is determining the visual similarity of HTML objects. The appearance of HTML objects is defined by factors such as layout and style. With current W3C recommendations, layout and style of HTML pages should be defined by CSS. However, due to historical reasons, CSS is not very popular yet and most of the web pages are still patchworks of structural and deprecated20 presentation tags. Also, due to reasons such as tricks used to reduce page size, laziness, mistakes and limited functions in some authoring tools, two HTML objects that look similar in browsers do not necessarily contain the same combination of HTML tags. Different tags and in different orders may produce the same visual effect. Approaches that rely on tags will not be effective in these complex situations. Therefore, we prefer HTML objects instead of tags to identify the semantic units of the web page. A web page can be considered as a hierarchy of HTML objects, in which the root corresponds to the whole page, leaf nodes correspond to simple objects, and internal nodes correspond to container objects. In the following discussion, we will use the terms defined below. • Simple object: Indivisible visual HTML object, which does not include other HTML tags (such as paragraphs of pure texts or tags as <IMG>, <HR>) or are representations of a single embedded media object (such as <OBJECT>, <APPLET>). • Container object: An ordered set of objects that consists of at least one simple object or other container object and these objects must be adjacent if they are rendered in browsers. The order of these elements is normally defined by reading habits ("Left to Right" and "Top to Bottom" in most languages). We represent a container object C as a string of elements {e1; e2,..., en}, where e; is a simple object or other container object. HTML Page Analysis Based on Visual Cues • • • 117 Group object: Special container object where all elements are simple objects and these elements are rendered on the same text line without deliberate line breaks by visual browsers. List object: Special container object where all elements satisfy some consistency constraint (such as visual similarity defined later). Structured document: HTML document converted to a hierarchical structure of container objects and simple objects. 2.1. Visual Similarity of Simple Objects To compare visual similarities of more complex objects, we will first start from simple objects. During the process to parse HTML documents and to extract simple objects, we extract text-rendering parameters by keeping a stack of tags that affect text attributes such as font face, styles, size, and color. For other embedded media objects such as images, we extract information from tag attributes or by analyzing their file headers. According to these parameters, we define some fuzzy comparison rules to decide visual similarity. 118 Yang et at. |Color _ Mod, [ 1, Not Equal Equal Compare image dimension [ min(xs/zei, xsizej) • mm(ysize\, ysize-i) ■Adj 1 max(xs;'zei, xsize^) ■ max{ysize\, ysize^) Other Combinations Starting from x=Type_Mod Compare key HTML attributes (such as <H1>...<H6>,<A>) {Key _ Mod, Not Equal [ 1, Equal Table 2 lists some of the rules we used in our experiments where X_Mode [0,1] and Adjs [0,1] are user-defined values that represent the level of impact on similarity measurements when corresponding parameters are not equal. A modifier equal to 0 signifies distinct or incomparable objects such as in the case of a picture and a paragraph of text, while 1 signifies identical objects. Two objects are considered similar if the final result is close to 1 and different if the final result is close to 0. For simplicity, we only listed cases with image and text media types in detail. In the implementation, all important visual HTML objects (such as <HR>) should be considered. 2.2. Visual Similarity of Container Objects We define visual similarity ofcontainer objects based on that of simple objects. To keep appropriate semantic granularities, we define group objects as entities that are tightly related from our visual-cues-based view (such as sentences and paragraphs). Furthermore, we do not break up group objects during the analysis process. A simple object is treated as a container object with only one element when it is compared with other container objects. Besides the above, list objects play a special role because we use them to represent detected categories and records. Moreover, instead of using the whole object, we select typical elements from the list object to compare with other objects. Given two container objects, we define two kinds of visual similarity measurements: approximate similarity and parallel similarity. • Approximate Similarity: Comparison of two element strings that enables weighted mismatches and omissions (skips). • Parallel Similarity: Comparison of two element strings that enables only weighted mismatches. HTML Page Analysis Based on Visual Cues 119 Fig. 1. Visual similarity of container objects. Based on an exact match algorithm, parallel similarity (Fig. 1(a)) is measured by comparing the corresponding elements one by one. Based on a fuzzy string matching algorithm, approximate similarity (Fig. 1(b)) allows omissions of outliers (the shadowed objects) and weighted mismatches. Thus, it permits a more flexible comparison. In our experiments, we find that outliers are quite common with real world web pages. As a result, we prefer to use approximate similarity normally. The algorithm to compute parallel similarity is shown in Table 3. Because parallel similarity only allows weighted mismatches, strings with different lengths and various outliers will be considered as being very different. Table 3. Parallel string compare algorithm. compare(simpleX, simpleY) = def by Table 2; compare(strI[l ..lthl], strj[l ..IthJ]) { if (lthl != IthJ) return 0; sim = 1; for(i=l;i<=lthI;i++) sim = sim * compare(strI[i],strJ[i]); return sim; The algorithm for the approximate similarity measurement using dynamic programming is shown in Table 4. The weight of skipping may differ from element to element because some of the objects (such as <H1>...<H6>) could be very important and skipping them would be costly (small weight) or not allowed (zero weight). 120 Yangetal. Table 4. Approximate string compare algorithm. compare(x, NULL) = skip_weight(x); compare(simpleX, simpleY) = def by Table 2; compare(strI[ 1 ..lthl], strJ[ 1 ..IthJ]) { dim cmp[0..1thJ]; cmp[0] = 1; lastvl0= 1; for(j=l;j<=lthJ;j++) cmp[j] = cmp[j-l] * comparefNULL, strj[j])); for(i=l; i<=lthl;i++) { lastvll =cmp[0]; cmp[0] = lastvlO * compare(strI[i], NULL); lastvlO = cmp[0]; for(j=l;j<=lthJ;j++){ vl 1 = lastvl 1 * compare(strI[i], strJQ]); vlO = cmp[j] * compare(strI[i], NULL); vOl = cmp[j-l] * comparefNULL, strJ[j]); lastvl 1 = cmp[j]; cmp[j] = max(vl 1, vlO, vOl); } } return cmp[lthJ]; J 3. Pattern Detection and Construction of Structured Documents In this section, we present a method to construct a structured document. Structured documents are constructed in a recursive manner. Given an HTML document, all the simple objects and group objects are collected and divided into initial container objects roughly based on block level tags.20 Then we apply the pattern detection algorithm to elements of these initial container objects, and detected patterns are converted to list objects. In actual experiments, we noticed that most web page authors organize their content from top to bottom in one column, while only a few of them organize content horizontally. The reason, we think, is to avoid the inconvenience of scrolling the browsers' windows horizontally. Therefore, it is safe to assume that detected objects are organized by the order of "from left to right then from top to bottom". From the point of view of the pattern detection algorithm, each container object is a string of its contained elements. We use the notation C={e\, e2, ... , en} to represent a container object, in which C is the container object and ev is one of the contained elements. For C, an object o(s:l) is represented by a sub string as {es, ... , es+n} (s denotes the starting point in C and / is the length). Visual pattern p is represented as a set of "equal" objects {ou ..., om} and sometimes represented by a typical element op of the pattern. The task of pattern HTML Page Analysis Based on Visual Cues 121 detection is to select the appropriate pattern p. Each time a container object is constructed, it is checked for potential patterns. These patterns are then converted into list objects,. Adjacent list objects are checked for visual similarities and are merged if they are similar. For any container object C={eu e2, ... , e n }, there are n(n+l)/2 sub-strings. Using the comparison algorithm described in the previous section, we can compute the visual similarity between each pair of sub-strings. After comparison, appropriate patterns are selected according to the heuristic rules in Table 5. Equal Judgment Minimal Frequency No Overlap Table 5. Rules for pattern detection. Two objects are equal only if their similarity measurement is not below a threshold Ep. Compare(o/,d2) >= Ep A pattern must contain at least Fp objects. \P\>=F„ Objects in one pattern do not overlap with others. V0l(Sl,ll),02(s2,l2), Alignment Paragraphs Minimal Deviation Maximum Coverage Sub-pattern Replacement Significant Token S2 >= Sl+l, Objects in one pattern are normally aligned tidily. Contents that reside in the same unbroken text line should be tightly related and thus will be treated as one element (This is what group objects stand for). Standard deviations of objects' distributions (positions) and lengths in potentially better patterns should be smaller. The better patterns should have bigger coverage of elements in C. P={e\,ei,..., e„}, Max(£l\) If all objects in a pattern are concatenations of "equal" sub strings (sub-pattern), then these objects are expanded to sub-strings. Assume a pattern as {{ei, ... , e,„}, {em+\, ... , e„,+*}, ...} and e, =e„ V/,/, then the pattern is expanded to {ei,..., em,em+i,..., «,„+*, ...}. Records in one category should have similar prefix elements (Eg. start with a bullet image or some special font styles, etc. These elements will be visually similar and outliers are seldom). Since there are n(n+l)/2 sub-strings for a container object with n elements, comparing each pair of the sub-strings would be very costly in time (comparing 0\ and o2 requires a time of 0(l\*l2), and there are 0(« 2 ) sub-strings total). To reduce the complexity, we use a quantization method. Quantization converts a container object C={eu e2, ... , e„} to a token string T={tu t2, ... , ta}. In T, t\ corresponds to e; in C. Elements with same symbol in T are treated as equal. Then the object frequency counting and pattern selection algorithm can be run upon the token string without computing the approximate similarity of each pair of the sub-strings. In our experiments, this improvement accelerates the algorithm greatly while it still maintains satisfactory results. 122 3.1. Yang et al. Quantization In the quantization method, we first cluster candidate elements according to similarity measurements between each element. These clusters are then labelled with unique identifiers. Elements in the same cluster are assigned the same identifier and are considered as equal to each other. A clustering algorithm similar to DBSCAN16 is employed here because we do not know the number of possible clusters (or groups of similar elements) in advance. The heuristics indicated in Table 5 have specified two thresholds Ep and Fp, which correspond to the "epsilon" and "minimal density" parameters required by DBSCAN. Then the distance function can be derived from our similarity measurement. Given the distance function and minimal acceptable density, we can define our Eps-neighbourhood and core point condition^ as follows. Terms such as density-reachable, densityconnected, cluster and noise are then defined based on these two. • Eps-neighbourhood: NEps(e)= {e'eC \ similarity(e, e') >Ep}, where Ep is from "equal judgment". • Core point condition: | NEps(e)\ >Fp, where Fp is defined by "minimal frequency". For C={e\, e2, ... , e n }, if the clustering result is m clusters as Ty={eM eb, ... , e x }, ••• Tm={es, ex, ... , ey}, we will construct a token string T={t\, t2, ... , ta} where t\ equals to the cluster identifier that es belongs to. The token string is then passed to the frequency counting stage. In the following discussions, we will use an example as C={ex, e2, ... , e^} and clustering result as T= {C, A, B, D, A, B, E, D, A, B, C, A, B} with 4 clusters labeled as A, B, C and D and the single outlier is labeled as E. 3.2. Frequency Counting With a naive approach, counting the pattern frequencies will be conducted in 0(n2) complexity, since there are a total of Oin2) sub-strings in a string of length n. However, the complexity can be reduced to 0(n) by employing the suffix tree algorithm. The details are discussed below. The quantized token string T={t\, t2, ... , tn} is appended firstly with an ending symbol "$", which is different from all the symbols in T. Then a suffix tree is constructed for the string 7"={ th t2, ... , ta ,$}. For the example T in Section 3.1, the constructed suffix tree is shown in Fig. 2. Starting from the root node, the "label of path" to a node is the pattern (sub-string) we want to discover and the leaves under the node give the positions of the pattern in string. (The "label of path" is constructed by concatenating all the labels along the path from the root to the specified node.) The number of leaves under the node is the frequency of the pattern in the string. The reason for adding the ending symbol is HTML Page Analysis Based on Visual Cues 123 Fig. 2. Pattern frequency counting of "CABDABEDABCAB". that the string termination is important for frequency counting. Without the ending symbol, the suffix tree would be different and the number of leaves does not indicate the pattern frequency. To build the tree, we borrow some code from an example in Dr. Dobb's Journal19 which is an implementation of Ukkonen's algorithm.18 We modify it slightly to fit with our requirements. The algorithm is of 0(n) complexity. We will not repeat the details here since the author of the algorithm gives a very good introduction of it. 3.3. Selection and Confirmation From the results of frequency counting, we choose the best patterns based on heuristics listed in Table 5. As we can see from Fig. 2, the frequencies of {A, B} and {B} are the highest ones and are good candidates. {A, B} is superior to {B} according to the "maximum coverage" heuristic. However {A, B} can only cover eight elements because of the mismatched tokens C, D and E. To cope with these outliers, we expand these patterns based on the previously defined approximate similarity measurements and the "significant token" heuristic rule. Starting from the strict pattern obtained from frequency counting and the "maximum coverage" heuristic, we try to append succeeding elements after each object of the pattern. During the process, approximate similarity is computed and the consistency of the pattern is checked. The appending process continues until the addition of the next element would break the consistency or any of the rules in Table 5. To illustrate the process, we list the steps of expanding pattern {A, B} below: {ei, {ei, e$}, e4, {es, e«}, e7, eg, {e% eio}, en, {en, en}} -» the original pattern {A,B} {C, {A,B}, D, {A,B},E,D, {A,B}, C, {A,B}} {ei, {£2, £3, £4}, '{es, ?6}, e-i, e%, {e$, eio}, en, {en, en}} —> one element appended {C,{A,B,D}, {A,B},E,D, {A,B}, C, {A,B}} Yang et al. 124 (repeat) {ej, {e2, e3, et], {e5, e6, ei), eg, {e9, ew, en}, {en, £13}} -» final result {C, {A,B,D} ( {A,B, E}, D, {A,B,C}, {A,B}) From the example we can see that the possible best pattern {{C, A, B}, {D, A, B, E}, {D, A, B}, {C, A, B}} is lost. This exposes the limitation of the "significant token" heuristic, because it requires the records in one category to have common prefix. However, this limitation does help to get better results with real-world web pages when the actual record boundary could not be decided at the tag level. 3.4. Construction of Structured Document The construction of the structured document can be partitioned into three stages. In the first stage, an initial hierarchy is constructed according to the tag-level structure of the web page. Then the simple objects and group objects are marked on the initial hierarchy in the second stage (please refer to Section 2 for more information). Other nodes are all marked as candidate container objects. In the third stage, we adjust the container objects in a recursive manner. The adjustment process starts from the bottom of the initial hierarchy. For each candidate container object, we apply the pattern detection algorithm (discussed previously) to its elements, and the detected patterns are converted to list objects. For example, with the sample data in Section 3.3, we can create a new container object as {eu We-,., e-,, eA. {e^ e^ ez, e g ), {e», gl0, en), (en., e„}}} where the underscored element is the newly constructed list object. The example is also shown by Fig. 3. Note that outliers between two list elements (such as e% in the example) are appended to nearby elements as do-not-cares. We then expand container objects to upper levels by merging objects under the same parent. For example, suppose a container object CO={..., ea, eb, ...}, and ea={el5 e2, e3} eb= {e4, e5, e6}. If ea and eb are both list objects and e\ is similar to e4, we can merge ea and eb and update CO to {...,{eu e2, e3, e4, es, e6},...} (Since we know that eu e2, e3 share the same pattern and so do e4, e5 and e6, we could hence assume e\, e2, e3, e4, e5, e6 share the same pattern too. Please refer to Section 2.2 for details on comparing list objects). The merge and expand process will then be repeated throughout the document tree until the container objects representing important structures (see next paragraph) are met. In HTML, some tags can represent distinct self-contained structures. For example, the <ADDRESS rel="nofollow"> tag represents a block of address information. The corresponding object is a relatively independent entity and should not be merged into the parent container object. Currently, we consider that the <H1>...<H6>, <FORM>, <ADDRESS rel="nofollow">, <TABLE>, <OL>, <UL> and <DL> t a g s represent HTML Page Analysis Based on Visual Cues 125 Fig. 3. The structure of the sample container object in Section 3.1. indivisible structures. The <BODY> tag is also indivisible in that it is actually the root of the structured document. The final container object is the hierarchical structured document that is actually a tree representation of the original page. 3.5. Special Consideration of HTML Tables The analysis method described in Section 3.1 to Section 3.4 can be applied on various web pages to extract their structures. However, the method does not consider the special case of HTML tables. Tables are the most frequently used layout tools in web page authoring. From regular data tables to general layout tables (here we refer to data tables as those that represent relational data such as stock, price, etc.), tables provide a powerful way to control positioning and alignment. For this reason, tables are considered a very important source of semistructured data. Several approaches have been developed for extracting structured contents from data tables. Typical approaches such as the one described by Hammer et at do this by manually specifying rules and pattern strings to locate desired data. The method introduced in the work of Lim et al.14 made further steps by automatically analyzing data tables with titles and headers. These approaches, however, did not mention ways to decide automatically if a table is a data table. Because data tables are normally organized tidily, they should hold very strong visual similarity patterns. In addition, many general layout tables also hold strong visual cues. Consequently, it is reasonable to make use of the alignment nature of tables as the starting point of structure analysis. We start the analysis from a pseudo-rendering process that counts the rows and cells of each row at first (HTML table ' c o l s p a n ' and ' r o w s p a n ' attributes are also taken care of in this step). Then, all empty rows, columns and cells are stripped since these are only for spacing and other layout purposes. 126 Yang et al. Fig. 4. Table that is neither row-wise nor column-wise. Because column-wise and row-wise organizations are quite common for data tables, we detect this situation by checking first if the table has headers and footers (such as that specified by <TH> <THEAD> <TFOOT> tags that should only be deliberately used to indicate that the table is a column-wise or row-wise data table). Then, we compare elements in rows and columns to check if similarity consistency holds by comparing each row or column as a string of objects. If similarity consistency holds in rows, the table is row-wise. Similarly, consistency that holds in columns indicates that the table is column-wise. For some tables, the content is neither column-wise nor row-wise, thus none of the above checks would be successful. For example, in Fig. 4 the table has 6 rows and 4 columns. The content in the table is organized in four sub-blocks each is 3*2 in size. In this case, we can try a more aggressive method that divides the table into smaller rectangular blocks and these blocks are checked for similarity consistency. For simplicity, we divide these blocks according to divisors of the number of columns and rows based on an assumption that one category of contents should be held in one table only. If all the efforts described above fail, the table is probably a pure layout table. Inside a pure layout table, only the table cells are meaningful. So we can discard all the container objects corresponding to tag <TR> and <TBODY>, and then pass the <TABLE> container object back to the normal pattern detector. Fig. 5. Example of common HTML tables. HTML Page Analysis Based on Visual Cues 127 4. Experimental Results and Analysis An experimental detector has been implemented to test the efficiency of the analysis method described above. Test data are web pages randomly collected from popular sites listed by http://www. 100hot.com. Beside these we also collect the first page (normally contains a directory list) and search results of several popular search engines. The total number of pages is 50. We then run our test program and save the extracted structures as text files for later manual analysis. Experimental results are listed in the following table. Because of lacking objective ground-truth references, we can only list some numbers for reference purposes (If the reader would like to have a test with your own data, you may contact hjzhang@microsoft.com for the executable of the detector). Table 6. Experimental results with a test set of 50. Results Number of Documents 46 All chunks are detected Missed some apparent chunks 2 Failed to parse the document 2 Although the test is still based on a small data set, from these results we can see that most of the category chunks (sections of related contents) are detected successfully with some minor exceptions. There are 2 pages in which our detector misses some apparent chunks. In common, these pages are somewhat cluttered and have very complex table-based layouts that our detector fails to analyse effectively. Our HTML parser fails over the other 2 pages because of HTML syntax errors. Table 7. Analysis of exceptions in results. Exceptions Reasons Wrongly confirmed chunks •Mistakes by visual similarity measurement One chunk broken up to two or more •Mistakes by visual similarity measurement Mis-aligned boundary •Detected patterns are skewed •Heuristic "significant token" does not hold Missed chunks •Style Sheets are not supported •DHTML and scripts are ignored currently Failed •HTML Tidy failed to parse the page 5. Application in an Adaptive Content Delivery System The analysis method described in this chapter can be applied in an adaptive content delivery system. To give a quick summary, for users with very slow Internet connections, the basic idea of our adaptation process is to summarize 128 Yang et at. web pages to some level that will not affect human comprehension too much in favor of download speed and client (device/browser) capability. The process uses some heuristics based on basics of web UI design: • Categories/items occupying larger display area are more important (in potential of user interest) than smaller ones. • Users prefer to see full images or videos in their browsers and dislike scrolling a window on large images. • Important items are placed before trivial ones. • Users can grasp the ideas of a big category with only a small subset of content items in it. • Impatient users should be in favor of deeper organization structure of information instead of wider ones. (Downloading smaller pages requires less time than the larger ones. But web sites with smaller pages usually have a deeper navigation hierarchy than those with larger pages, provided they contain the same quantity of content.) From these heuristics, we deduce some rules to reduce the contents with tolerable information losses: • Importance values of categories/items are determined by the areas they occupy. • In case of deletion, contents are preserved according to their importance values. • In case of summarization, larger categories are summarized prior to small ones. • List objects are summarized by truncating elements on their tails. For example, a list object {a, b, c, d, e) will be summarized to {a, b, c}. • Large images are shrunk to fit inside the browser's screen. • Contents deleted in a step can be made accessible later by embedding a hyperlink. The adaptation process consists of several steps: 1. Content structure analysis: The structure detector introduced here is used to locate content categories on HTML pages and to build abstract representations of contents that include this structure information and additional content attributes as areas of objects, object size, etc. 2. Content reduction: Content reduction rules are applied to abstract representations according to detected network bandwidth and type of client devices in favor of speed. HTML Page Analysis Based on Visual Cues 129 3. Page reconstruction: The reduced abstract representations are mapped back to original HTML pages to generate reduced web pages that should be downloaded faster. (a) Original web page from "Yahoo!' (b) Adaptation result 1. (c) Adaptation result 2. Fig. 6. Experimental results of the adaptive web content delivery system. Figure 6 shows an example of the adaptation results. The result in Fig. 6(b) is for a browser on a handheld PC which can display small images and has a modem connection. The result in Fig. 6(c) is for a text only browser with slow network connections. As we can see that detected content structure plays a very important role. With guided reduction of contents, rich information and original organizations can be preserved. 130 Yang et al. 6. Conclusions In this chapter, we have presented a visual-cue-based approach to the extraction of the semantic structure of HTML documents. It relies on the observation that, for most web pages, layout styles of records in the same content category are kept the same and clear separation boundaries exist between different categories. We have tested the method with a set of 50 web pages collected randomly from directories of http://www. 100hot.com. An example of using the method in our adaptive content delivery test-bed was later briefly introduced. These preliminary experimental results show that the approach works well with only a few exceptions. It should also be noted that this approach is not exclusive of other methods. It is possible to achieve better results if we combine it with other approaches (such as those based on lexical analyzer or semantic rules9-14). We hope future work will prove this. Acknowledgements The HTML parser in our experiments is based on HTML Tidy22that is W3C open source software and is by Dave Raggett. The suffix tree construction algorithm is based on Mark Nelson's code.19 Here we thank the authors for their significant work. References 1. 2. 3. 4. 5. 6. 7. 8. Y.D. Yang, J.L. Chen, and H.J. Zhang, "Adaptive Delivery of HTML Contents", WWW9 Poster Proceedings, May, 2000, pp. 24-25. A. Fox, S.D. Gribble, Y. Chawathe and E.A. Brewer, "Adapting to Network and Client Variation Using Infrastructure! Proxies: Lessons and Perspectives", IEEE Personal Communications, vol. 5, iss. 4, 1998, pp. 10-19. T.W. BickMore, and B.N. Schilit, Digestor: Device-independent access to the World Wide Web, Proc. of the 6lh International World Wide Web Conference, 1997, pp. 655-663. M. Liijeberg, H. Helin, M. Kojo, and K. Raatikainen, "Enhanced services for World Wide Web in mobile WAN environment", Report C-l 996-28, 1996, University of Helsinki Finland, h t t p : //www. cs . H e l s i n k i . f i / r e s e a r c h / m o w g l i / . W.Y. Ma, I. Bedner, G. Chang, A. Kuchinsky, and H.J. Zhang, "A Framework for Adaptive Content Delivery in Heterogeneous Network Environments", Proc. MMCN2000 (SPIE Vol. 3969), 2000. J.R. Smith, R. Mohan, and CS. Li, "Scalable Multimedia Delivery for Pervasive Computing", Proc. of the 7' ACM International Conference on Multimedia, 1999, pp. 131-140. Web Clipping, 3Com, h t t p : //www.palm. com/. J. Hammer, H.Garcia-Monlina, J. Cho, R. Aranha, and A. Crespo, "Extracting Semistructured Information from the Web", Proc. International Conference on HTML Page Analysis Based on Visual Cues 131 Management of Data and Symposium on Principles of Database Systems (PODS/SIGMOD'97), May 1997. 9. N. Ashish, and C. Knoblock, "Wrapper Generation for Semi-structured Internet Sources", Proc. International Conference on Management of Data and Symposium on Principles of Database Systems (PODS/SIGMOD'97), May 1997. 10. D. Simth, and M. Lopez, "Information Extraction for Semi-structured Documents", Proc. International Conference on Management of Data and Symposium on Principles of Database Systems (PODS/SIGMOD'97), May 1997. 11. S. Nestorov, S. Abiteboul, R. Motwani, "Inferring Structure in Semistructured Data", Proc. International Conference on Management of Data and Symposium on Principles of Database Systems (PODS/SIGMOD'97), May 1997. 12. D.W. Embley, Y.K. Ng, L, Xu, "Filtering Multiple-record Web Documents Based on Application Ontologies", h t t p : //www.deg.byu.edu/papers/vldbOO .ps. 13. D.W. Embley, Y. Jiang, Y.K. Ng, "Record-Boundary Discovery in Web Documents", in Proc of the 1999 Acm SIGMOD International Conference on Management of Data. (SIGMOD'99), 1999, pp. 467-478. 14. S.J. Lim, Y.K. Ng, "An Automated Approach for Retrieving Hierarchical Data from HTML Table", in Proc. Eighth International Conference on Information and Knowledge Management (CIKM'99), Kansas City, Missouri, November 2-6, 1999, pp. 466^-74. 15. D. Siegel, "The Web is Ruined and I Ruined It", h t t p : / / w e b r e v i e w . c o m / 97/04/11/feature/. 16. M. Easter, H-P. Kriegel, J. Sander, X.W. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", in Proc. 2nd International Conference on Knowledge Discovery and Data Mining (KDD '96), Portland, Oregon, August 1996. 17. J.L. Chen, Y.D. Yang, H.J. Zhang, "An Adaptive Web Content Delivery System", in Proc. International Conference on Adaptive Hypermedia and Adaptive Webbased Systems (AH-2000), Aug. 2000. 18. E. Ukkonen, "On-line construction of suffix trees", Algorithmica, 14(3), Sept. 1995, pp. 249-260. 19. M. Nelson, "Fast string searching with suffix trees", Dr. Dobb's Journal, August 1996. 20. W3C, HTML 4.0 specification, h t t p : //www. w3 . o r g / T R / h t m l 4 / . 21. W3C, Cascading Style Sheets, h t t p : / / w w w . w 3 . o r g / S t y l e / C S S / . 22. W3C, HTML Tidy, h t t p : //www. w3 . o r g / P e o p l e / R a g g e t t / t i d y / . This page is intentionally left blank Part III. Table Understanding on the Web This page is intentionally left blank CHAPTER 8 AUTOMATIC TABLE DETECTION IN HTML D O C U M E N T S Yalin Wang and Jianying Hu 1 Dept. of Mathematics, UCLA Box 951555, Los Angeles, CA 90095, USA E-mail: ylwang@math.ucla.edu Avaya Labs Research, 233 Mount Airy road, Basking Ridge, NJ 07920, USA E-mail: jianhu@avaya.com Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential appli cations including web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as <table> elements, a <table> element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the web domain is to identify the genuine tables. In this chapter we ex plore a machine learning based approach for automatic table detection in HTML documents. Various features reflecting the layout as well as content characteristics of tables are explored. Two different classifiers, the decision tree classifier and the Support Vector Machines, are investi gated. The system is tested on a large database which consists of 1,393 HTML files collected from hundreds of different web sites from various domains and contains over 10, 000 leaf <table> elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed a previously designed rule-based system and achieved an F-measure of 95.88%. 1. I n t r o d u c t i o n T h e increasing ubiquity of the Internet has brought about a constantly increasing amount of online publications. As a compact and efficient way to present relational information, tables are used frequently in web docu ments. Since tables are inherently concise as well as information rich, the automatic understanding of tables has many applications including knowl135 136 Y. Wang and J. Hu edge management, information retrieval, web mining, summarization, and content delivery to mobile devices. The processes of table understanding in web documents include table detection, functional and structural analysis and finally table interpretation. 1 In this chapter, we concentrate on the problem of table detection. The web provides users with great possibilities to use their own style of commu nication and expressions. In particular, people use the <table> tag not only for relational information display but also to create any type of multiplecolumn layout to facilitate easy viewing, thus the presence of the <table> tag does not necessarily indicate the presence of a true relational table. We define genuine tables to be document entities where a two dimensional grid is semantically significant in conveying the logical relations among the cells.2 Conversely, Non-genuinetables are document entities where <table> tags are used as a mechanism for grouping contents into clusters for easy viewing only. Fig. 1 gives a few examples of genuine and non-genuine tables. While genuine tables in web documents could also be created without the use of <table> tags at all, we do not consider such cases in this chapter as they seem very rare from our experience. Thus, in this study, Table detec tion refers to the technique which classifies a document entity enclosed by the <table></table> tags as a genuine or non-genuine table. Several researchers have reported their work on web table detection. Chen et al. used heuristic rules and cell similarities to identify tables and tested their algorithm on 918 tables form airline information web pages. 3 Yoshida et al. proposed a method to integrate WWW tables according to the category of objects presented in each table. 4 Their algorithm was evaluated on 175 tables. In our earlier work, we proposed a rule-based algorithm for identifying genuinely tabular information as part of a web content filtering system for content delivery to mobile devices.2 The algorithm was designed for major news and corporate web site home pages. It was tested on 75 web site front pages and achieved an F-measure of 88.05%. While it worked reasonably well for the system it was designed for, it has the disadvantage that it is domain dependent and difficult to extend because of its reliance on hand crafted rules. To summarize, previous methods for web table detection all relied on heuristic rules and were only tested on a database that is either very small, 2 ' 4 or highly domain specific.3 In this chapter, we propose a new machine learning based approach for table detection from generic web documents. While many learning algo rithms have been developed and tested for document analysis and informa- Automatic Fig. 1. Table Detection in HTML Documents 137 Examples of genuine and non-genuine tables. tion retrieval applications, there seems to be strong indication that good document representation including feature selection is more important than choosing a particular learning algorithm. 5 Thus in this work our emphasis is on identifying features that best capture the characteristics of a genuine table compared to a non-genuine one. In particular, we introduce a set of novel features which reflect the layout as well as content characteristics of tables. These features are then used in classifiers trained on thousands of examples. Two widely used classifiers, the decision tree classifier and the Support Vector Machines (SVM), are investigated for this application. To facilitate the training and evaluation of the table classifiers, we con structed a large web table ground truth database consisting of 1,393 HTML files containing 11,477 leaf <table> elements. Experiments on this database using the cross validation method demonstrate a significant performance improvement over the previously developed rule-based system. The rest of the chapter is organized as follows. We describe our feature set in Section 2, followed by a brief description of the classifiers we exper imented with in Section 3. Section 4 explains the data collection process. Y. Wang and J. Hu 138 Experimental results are then reported in Section 5 and we conclude with future directions in Section 6. 2. Features for Web Table Detection Feature selection is a crucial step in any machine learning based methods. In our case, we need to find a combination of features that together pro vide significant separation between genuine and non-genuine tables while at the same time constrain the total number of features to avoid the curse of dimensionality. Past research has clearly indicated that layout and con tent are two important aspects in table understanding. 1 Our features were designed to capture both of these aspects. In particular, we developed 16 features which can be categorized into three groups: seven layout features, eight content type features and one word group feature. In the first two groups, we attempt to capture the global composition of tables as well as the consistency within the whole table and across rows and columns. With the last feature, we investigate the discriminative power of words enclosed in tables using well developed text categorization techniques. Before feature extraction, each HTML document is first parsed into a document hierarchy tree using Java Swing XML parser with W3C HTML 3.2 DTD. 2 A <table> node is said to be a leaf table if and only if there are no <table> nodes among its children. Our experience indicates that almost all genuine tables are leaf tables. Thus in this study only leaf tables are considered candidates for genuine tables and are passed on to the feature extraction stage. In the following we describe each feature in detail. 2.1. Layout Features In HTML documents, although tags like <tr> and <td> (or <th>) may be assumed to delimit table rows and table cells, they are not always reliable indicators of the number of rows and columns in a table. Variations can be caused by spanning cells created using <rowspan> and <colspan> tags. Other tags such as could be used to move content into the next row. To extract layout features reliably, we maintain a matrix to record all the cell spanning information and serve as a pseudo rendering of the table. Layout features based on row or column numbers are then computed from this matrix. Given a table T, we compute the following four layout features: • (1) and (2): Average number of columns, computed as the average number of cells per row, and the standard deviation. Automatic Table Detection in HTML Documents 139 • (3) and (4): Average number of rows, computed as the average number of cells per column, and the standard deviation. Since the majority of tables in web documents contain characters, we compute three more layout features based on cell length in terms of number of characters: • (5) and (6): Average overall cell length and the standard deviation. • (7): Average Cumulative length consistency, CLC. The last feature is designed to measure the cell length consistency along either row or column directions. It is inspired by the fact that most genuine tables demonstrate certain consistency either along the row or the column direction, but usually not both, while non-genuine tables often show no consistency in either direction. Similar kinds of features have been used successfully in table detection in plain text documents. 6 To compute this feature, first the average cumulative within-row length consistency, CLCr, is computed as follows. Let the set of cell lengths of the cells from row i be Hi, i = 1 , . . . ,r (considering only non-spanning cells), and the the mean cell length for row Hi be m^: (1) Compute cumulative length consistency within each Hf. CLCi = 2__, LCci. cieiZi Here LCcl is defined as: LCd = 0.5 - D, where D = min{{cl^il, 1.0}. Intuitively, LCci measures the degree of consistency between cl and the mean cell length, with —0.5 indicating extreme inconsistency and 0.5 indicating extreme consistency. When most cells within Hi are consis tent, the cumulative measure CLCi is positive, indicating a more or less consistent row. (2) Take the average across all rows: 1 r CLCr = - ] T CLd. After the within-row length consistency CLCr is computed, the withincolumn length consistency CLCC is computed in a similar manner. Fi nally, the overall cumulative length consistency is computed as CLC = m&x(CLCr,CLCc). Y. Wang and J. Hu 140 2.2. Content Type Features Web documents are inherently multi-media and have more types of content than any traditional document. For example, the content within a <table> element could include hyperlinks, images, forms, alphabetical or numerical strings, etc. Because of the relational information it needs to convey, a gen uine table is more likely to contain alpha or numerical strings than, say, images. The content type feature was designed to reflect such characteris tics. We define the set of content types T ={Image, Form, Hyperlink, Al phabetical, Digit, Empty, Others}. Our content type features include: • (1) - (7): The histogram of content type for a given table. This con tributes 7 features to the feature set; • (8): Average content type consistency, CTC. The last feature is similar to the cell length consistency feature. First, within-row content type consistency CTCr is computed as follows. Let the set of cell type of the cells from row i &s%,i = 1,... ,r (again, considering only non-spanning cells), and the dominant type for % be DTf. (1) Compute the cumulative type consistency with each row H-c CTd = Y, D> cteiZi where D = 1 if ct is equal to DTi and D = — 1, otherwise. (2) Take the average across all rows: CTCr = 1 r -J2CTCi- The within-column type consistency is then computed in a similar manner. Finally, the overall cumulative type consistency is computed as: CTC = max(CTC r , CTCC). 2.3. Word Group Feature If we look at the enclosed text in a table and treat it as a "mini-document", table classification could be viewed as a text categorization problem with two broad categories: genuine tables and non-genuine tables. In order to explore the the potential discriminative power of table text at the word level, we experimented with several text categorization techniques. Text categorization is a well studied problem in the IR community and many algorithms have been developed over the years (e.g., Joachims 7 and Automatic Table Detection in HTML Documents 141 Yang.8) For our application, we are particularly interested in algorithms with the following characteristics. First, it has to be able to handle docu ments with dramatically differing lengths (some tables are very short while others can be more than a page long). Second, it has to work well on col lections with a very skewed distribution (there are many more non-genuine tables than genuine ones). Finally, since we are looking for a feature that can be incorporated along with other features, it should ideally produce a continuous confidence score rather than a binary decision. In particular, we experimented with three different approaches: vector space, naive Bayes and weighted kNN. The details regarding each approach are given below. 2.3.1. Vector Space Approach After morphing 9 and removing the infrequent words, we obtain the set of words found in the training data, W. We then construct weight vectors representing genuine and non-genuine tables and compare that against the frequency vector from each new incoming table. Let Z represent the non-negative integer set. The following functions are defined on set W. • dfG : W —> Z, where dfG{wi) is the number of genuine tables which include word Wi, i = 1,..., |W|; • tfG : W —> Z, where tfG(wi) is the number of times word wt, i = l,...,|W|, appears in genuine tables; • dfN : W —> Z, where dfN(wi) is the number of non-genuine tables which include word Wi, i = 1,..., |W|; • tfN : W —> Z, where tfN(wi) is the number of times word w^, i = 1,..., |W|, appears in non-genuine tables. • tfT :W^>Z, where tfT(wi) is the number of times word Wj, Wi € W appears in a new test table. To simplify the notations, in the following discussion, we will use dfG, tf , dfzN and tfzN to represent dfG(Wi), tfG(Wi), dfN{Wi) and tfN(Wi), respectively. Let NG, NN be the number of genuine tables and non-genuine tables in the training collection, respectively and let C = max(iV G ,N N ). Without loss of generality, we assume NG y^O and NN ^ 0. For each word Wi in W , i = 1,..., |W|, two weights, pf and pf are computed: G Y. Wang and J. Hu 142 pG=[ tf?log{%^r I tf?log{%C pN = + 1), when df? ± 0 + 1), f tftNlog(^^ 1 tf»log{%C when dff = 0 + 1), when rf/f ^ 0 + 1), when e#f = 0 As can be seen from the formulas, the definitions of these weights were derived from the traditional tf * idf measures used in informational retrieval, 7 with some adjustments made for the particular problem at hand. Given a new incoming table, let us denote the set including all the words in it as W n . Since we only need to consider the words that are present in both W and W n , we first compute the effective word set: We = W n W n . Let the words in We be represented as wmk, where rrik,k = l,...,|W e |, are indexes to the words from set W = {wi,W2,—,w\w\}. we define the following weight vectors: • Vector representing the genuine table class: C — I Pml 5 \U Pm 2 Pm \Wel \ ' U ''"' U ) ' where U is the cosine normalization term: "TvvTi \ fc=i • Vector representing the non-genuine table class: s [v ' v ''"' v ) ' where V is the cosine normalization term: IvvTl \| fc=i • Vector representing the new incoming table: Automatic Table Detection in HTML Documents 143 Finally, the word group feature is defined as the ratio of the two dot products: Wvs = f ^ % , when TT ■ Nsj^O I IT'NS -> -> -> — S 1, when TT ■ Gs^ 0 and TT ■ Ns= 0 10, when IT ■ Gs^ 0 and IT • Ns= 0 (3) 2.3.2. Naive Bayes Approach In the Bayesian learning framework, it is assumed that text data has been generated by a parametric model, and a set of training data is used to calculate Bayes optimal estimates of the model parameters. Then, using these estimates, Bayes rule is used to turn the generative model around and compute the probability of each class given an input document. Word clustering is commonly used in a Bayes approach to achieve more reliable parameter estimation. For this purpose we implemented the dis tributional clustering method introduced by Baker and McCallum. 10 First stop words and words that only occur in less than 0.1% of the documents are removed. The resulting vocabulary has roughly 8000 words. Then dis tribution clustering is applied to group similar words together. Here the similarity between two words wt and ws is measured as the similarity be tween the class variable distributions they induce: P(C\wt) and P(C\ws), and computed as the average KL divergence between the two distributions. (see the paper by Baker and McCallum 10 for more details). Assume the whole vocabulary has been clustered into M clusters. Let ws represent a word cluster, and C = {g, n} represent the set of class labels (g for for genuine, n for non-genuine), the class conditional probabilities are (using Laplacian prior for smoothing): G ttfJ (ws) + 1 tfN(ws) + 1 JV.|g = g ) =M„ + J£££ 'I .*L/ G, K )J' P(".\C=n)=„>J^"Z;f ,- w (5) The prior probabilities for the two classes are: p c ( NN = ^ = NGTW (7) Y. Wang and J. Hu 144 Given a new table di, let d^k represent the fcth word cluster. Based on the Bayes assumption, the posterior probabilities are computed as: vtr P(C = n\A\ P(C = g)P{di\C = g) ^ i) = (8) PR) P(C = g)^lP{wi,k\C P{dt) P(C-nW-P<g = y ° " = g) (9) > (10) P(C = n)U[dAP(m,k\C = n) P{<k) ■ [ } Finally, the word group feature is defined as the ratio between the two: Wnb -p{C = n)^iPMC _ i v G l ^ P K f c | c = g) NNl}lP{Wi,k\C = ny = n) (12) {l6> 2.3.3. Weighted kNN Approach kNN stands for k-nearest neighbor classification, a well known statistical approach. It has been applied extensively to text categorization and is one of the top-performing methods. 8 Its principle is quite simple: given a test document, the system finds the k nearest neighbors among the training documents, and uses the category labels of these neighbors to compute the likelihood score of each candidate category. The similarity score of each neighbor document to the test documents is used as the weight for the category it belongs to. The category receiving the highest score is then assigned to the test document. In our application the above procedure is modified slightly to generate the word group feature. First, for efficiency purpose, the same preprocess ing and word clustering operations as described in the previous section is applied, which results in M word clusters. Then each table is represented by an M dimensional vector composed of the term frequencies of the M word clusters. The similarity score between two tables is defined to be the cosine value ([0,1]) between the two corresponding vectors. For a new incoming table di, let the k training tables that are most similar to di be represented by ditj,j = l,...,k. Furthermore, let sim(di,dij) represent the similarity score between di and dij, and C(dij) equals 1.0 if dij is genuine and —1.0 Automatic Table Detection in HTML Documents 145 otherwise, the word group feature is defined as: T,*=iC(di,j)8im(di,diij) Wknn = -^—k , }^j=lsim{di,dij) • (14) 3. Classification Schemes Various classification schemes have been widely used in document catego rization as well as web information retrieval. 8 ' 11 For the table detection task, the decision tree classifier is particularly attractive as our features are highly non-homogeneous. We also experimented with Support Vector Machines (SVM), a relatively new learning approach which has achieved one of the best performances in text categorization. 8 3.1. Decision Tree Decision tree learning is one of the most widely used and practical methods for inductive inference. It is a method for approximating discrete-valued functions that is robust to noisy data. Decision trees classify an instance by sorting it down the tree from the root to some leaf node, which provides the classification of the instance. Each node in a discrete-valued decision tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute. Continuous-valued decision attributes can be incorporated by dynamically denning new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals. 12 An implementation of the continuous-valued decision tree described by Haralick and Shapiro 13 was used for our experiments. The decision tree is constructed using a training set of feature vectors with true class labels. At each node, a discriminant threshold is chosen such that it minimizes an impurity value. The learned discriminant function splits the training subset into two subsets and generates two child nodes. The process is repeated at each newly generated child node until a stopping condition is satisfied, and the node is declared as a terminal node based on a majority vote. The maximum impurity reduction, the maximum depth of the tree, and minimum number of samples are used as stopping conditions. 3.2. SVM Support Vector Machines (SVM) are based on the Structural Risk Manage ment principle from computational learning theory. 14 The idea of structural 146 Y. Wang and J. Hu risk minimization is to find a hypothesis h for which the lowest true error is guaranteed. The true error of h is the probability that h will make an error on an unseen and randomly selected test example. The SVM method is defined over a vector space where the goal is to find a decision surface that best separates the data points in two classes. More precisely, the decision surface by SVM for linearly separable space is a hyperplane which can be written as w ■ x — b= 0 where x is an arbitrary data point and the vector w and the constant b are learned from training data. Let D = (yi,Xi) denote the training set, and Hi <E {+1, —1} be the classification for x\, the SVM problem is to find w and b that satisfies the following constraints: w ■ xl —b > + 1 for yi = + 1 w ■ Xi — b < —1 for m = —1 while minimizing the vector 2-norm of w. The SVM problem in linearly separable cases can be efficiently solved using quadratic programming techniques, while the non-linearly separable cases can be solved by either introducing soft margin hyperplanes, or by mapping the original data vectors to a higher dimensional space where the data points become linearly separable. 14 ' 15 One reason why SVMs are very powerful is that they are very universal learners. In their basic form, SVMs learn linear threshold functions. Never theless, by a simple "plug-in" of an appropriate kernel function, they can be used to learn polynomial classifiers, radial basis function (RBF) networks, and three-layer sigmoid neural nets. 15 For our experiments, we used the SVMhght system implemented by Thorsten Joachims. 16 4. Data Collection and Ground Truthing Since there are no publicly available web table ground truth database, researchers tested their algorithms in different data sets in the past. 3,2 ' 4 However, their data sets either had limited manually annotated table data (e.g., 75 HTML pages in Penn et al.,2 175 manually annotated table tags in Yoshida et a/.4) or were collected from some specific domains (e.g., a set of tables selected from airline information pages were used in Chen et al. 3 ). To develop our machine learning based table detection algorithm, we Automatic Table Detection in HTML Documents 147 needed to build a general web table ground truth database of significant size. 4 . 1 . Data Collection Instead of working within a specific domain, our goal of data collection was to get tables of as many different varieties as possible from the web. At the same time, we also needed to insure that enough samples of gen uine tables were collected for training purpose. Because of the latter prac tical constraint we biased the data collection process somewhat towards web pages that are more likely to contain genuine tables. A set of key words often associated with tables were composed and used to retrieve and download web pages using the Google search engine. Three directo ries on Google were searched: the business directory and news directory using key words: { t a b l e , stock, bonds, f i g u r e , schedule, weather, s c o r e , s e r v i c e , r e s u l t s , value}, and the science directory using key words { t a b l e , r e s u l t s , value}. A total of 2, 851 web pages were down loaded in this manner and we ground truthed 1,393 HTML pages out of these (chosen randomly among all the HTML pages). 4.2. Ground Truthing HTML File J C^__ Parser ___J^> r—T—n Hierarchy _ \_ CAdding attributes^ HTML with attributes and unique index to each table(ground truth) r C^Validation ~[^) (a) (b) Fig. 2. (a) The diagram of ground truthing procedure; (b) A snapshot of the ground truthing interface. There has been no previous report on how to systematically generate web table ground truth data. To build a large web table ground truth Y. Wang and J. Hu 148 database, a simple, flexible and complete ground truth protocol is required. Figure 4.2(a) shows the diagram of our ground truthing procedure. We created a new Document Type Definition(DTD) which is a superset of W3C HTML 3.2 DTD. We added three attributes for <TABLE> element, which are "tabid", "genuine table" and "table title". The possible value of the second attribute is yes or no and the value of the first and third attributes is a string. We used these three attributes to record the ground truth of each leaf <TABLE> node. The benefit of this design is that the ground truth data is inside HTML file format. We can use exactly the same parser to process the ground truth data. We developed a graphical user interface for web table ground truthing using the Java language. Figure 4.2(b) is a snapshot of the interface. There are two windows. After reading an HTML file, the hierarchy of the HTML file is shown in the left window. When an item is selected in the hierarchy, the HTML source for the selected item is shown in the right window. There is a panel below the menu bar. The user can use the radio button to select either genuine table or non-genuine table. The text window is used to input table title. 4.3. Database Description The resulting database is summarized in Table 1. It contains 14,609 <table> elements, out of which 11,477 are leaf <table> elements. Among the leaf <table> elements, 1, 740 are genuine tables and the remaining 9, 737 are non-genuine tables. Note that even in this somewhat biased collection, genuine tables only account for less than 15% of all leaf table elements. Table 1. <table> elements 14,609 Summary of the database. Leaf <tat>le> elements 11,477 Genuine tables 1,740 Non-genuine tables 9,737 5. Experiments A hold-out method was used to evaluate our table classifier. We randomly divided the data set into nine parts. The classifiers were trained on eight parts and then tested on the remaining one part. This procedure was re peated nine times, each time with a different choice for the test part. Then the combined nine part results were averaged to arrive at the overall per formance measures. 13 Automatic Table Detection in HTML Documents 149 T h e output of the classifier is compared with t h e ground t r u t h and the standard performance measures precision (P), recall (R) and F-measure (F) are computed. Let Ngg,Ngn,Nng represent the number of samples in the categories "genuine classified as genuine", "genuine classified as nongenuine" , and "non-genuine classified as genuine", respectively, the perfor mance measures are defined as: Na Naa + Na R P = R + P N,99 N„ Nn For comparison among different features we report the performance measures when the best F-measure is achieved using the decision tree clas sifier. T h e results of the table detection algorithm using various features and feature combinations are given in Table 2. For both the naive Bayes based a n d the kNN based word group features, 120 word clusters were used ( M = 120). Table 2. R(%) P(%) F(%) Results using various feature groups and the decision tree classifier. L 87.24 88.15 87.70 T 90.80 95.70 93.25 LT 94.20 97.27 95.73 LTW-VS 94.25 97.50 95.88 LTW-NB 95.46 94.64 95.05 LTW-KNN 89.60 95.94 92.77 L: Layout features only. T: Content type features only. LT: Layout and content type features. LTW-VS: Layout, content type and vector space based word group features. LTW-NB: Layout, content type and naive Bayes based word group features. LTW-KNN: Layout, content type and kNN based word group features. As seen from the table, content type features performed better t h a n layout features as a single group, achieving an F-measure of 93.25%. How ever, when the two groups were combined the F-measure was improved substantially to 95.73%, reconfirming the importance of combining layout and content features in table detection. Among the different approaches for the word group feature, t h e vector space based approach gave the best performance when combined with lay out and content features. However even in this case the addition of the word group feature brought about only a very small improvement. This indicates t h a t the text enclosed in tables is not very discriminative, at least not at the word level. One possible reason is t h a t the categories "genuine" and "non-genuine" are too broad for traditional text categorization techniques to be highly effective. Y. Wang and J. Hu 150 Overall, the best results were produced with the combination of layout, content type and vector space based word group features, achieving an F-measure of 95.88%. Table 3 compares the performances of different learning algorithms using the full feature set. The leaning algorithms tested include the decision tree classifier and the SVM algorithm with two different kernels - linear and radial basis function (RBF). Table 3. Experimental results using different learning algorithms. R(%)_, P(%) F(%) Decision Tree 94.25 97.50 95.88 SVM (linear) 93.91 91.39 92.65 SVM (RBF) 95.98 95.81 95.89 As seen from the table, for this application the SVM with radial basis function kernel performed much better than the one with linear kernel. It achieved an F measure of 95.89%, comparable to the 95.88% achieved by the decision tree classifier. Figure 3 shows two examples of correctly classified tables, where Fig. 3(a) is a genuine table and Fig. 3(b) is a non-genuine table. (a) Fig. 3. table (b) Examples of correctly classified tables: (a) a genuine table; (b) a non-genuine Figure 4 shows a few examples where our algorithm failed. Figure 4(a) was misclassified as a non-genuine table, likely because its cell lengths are highly inconsistent and it has many hyperlinks which is unusual for genuine Automatic Table Detection in HTML Documents 151 (a) (b) (c) (d) Fig. 4. Examples of misclassified tables: (a), (b) genuine tables misclassified as nongenuine; (c), (d) non-genuine tables misclassified as genuine tables. Figure 4(b) was misclassified as non-genuine because its HTML source code contains only two <tr> tags. Instead of the <tr> tag, the author used tags to place the multiple table rows in separate lines. This points to the need for a more carefully designed pseudo-rendering process. Figure 4(c) shows a non-genuine table misclassified as genuine. A close examination reveals that it indeed has good consistency along the row di rection. In fact, one could even argue that this is indeed a genuine table, with implicit row headers of Title, Name, Company Affiliation and Phone Number. This example demonstrates one of the most difficult challenges in table understanding, namely the ambiguous nature of many table instances (see the paper by Hu et al17 for a more detailed analysis on that). Figure 4(d) was also misclassified as a genuine table. This is a case where layout features and the kind of shallow content features we used are not enough - deeper semantic analysis would be needed in order to identify the lack of logical coherence which makes it a non-genuine table. For comparison, we tested the previously developed rule-based system2 on the same database. The initial results (shown in Table 4 under "Origi nal Rule Based") were very poor. After carefully studying the results from Y. Wang and J. Hu 152 the initial experiment we realized that most of the errors were caused by a rule imposing a hard limit on cell lengths in genuine tables. After deleting that rule the rule-based system achieved much improved results (shown in Table 4 under "Modified Rule Based"). However, the proposed machine learning based method still performs considerably better in comparison. This demonstrates that systems based on hand-crafted rules tend to be brittle and do not generalize well. In this case, even after careful manual adjustment in a new database, it still does not work as well as an automat ically trained classifier. Table 4. R(%) P(%) F(%) Experimental results of the rule based system. Original Rule Based 48.16 75.70 61.93 Modified Rule Based 95.80 79.46 87.63 A direct comparison to other previous results 3 ' 4 is not possible currently because of the lack of access to their system. However, our test database is clearly more general and far larger than the ones used in Chen et al.3 and Yoshida et al., 4 while our precision and recall rates are both higher. 6. Conclusion and Future Work We presented a machine learning based table detection algorithm for HTML documents. Layout features, content type features and word group features were used to construct a feature set. Two well known classifiers, the decision tree classifier and the SVM, were tested along with these features. For the most complex word group feature, we investigated three alternatives: vector space based, naive Bayes based, and weighted K nearest neighbor based. We also constructed a large web table ground truth database for training and testing. Experiments on this large database yielded very promising results and reconfirmed the importance of combining layout and content features for table detection. Our future work includes handling more different HTML styles in pseudo-rendering and developing a machine learning based table interpreta tion algorithm. We would also like to investigate ways to incorporate deeper language analysis for both table detection and interpretation. Automatic Table Detection in HTML Documents 7. 153 Acknowledgment We would like to t h a n k Kathie Shipley for her help in collecting the web pages, and Amit Bagga for discussions on vector space models. References 1. M. Hurst, "Layout and Language: Challenges for Table Understanding on the Web", First International Workshop on Web Document Analy sis, Seattle, WA, USA, September 2001 (ISBN 0-9541148-0-9) and also at http://www.csc.liv.ac.uk/~wda2001. 2. G. Penn, J. Hu, H. Luo, and R. McDonald, "Flexible Web Document Anal ysis for Delivery to Narrow-Bandwidth Devices", Sixth International Con ference on Document Analysis and Recognition (ICDAR'01), Seattle, WA, USA, September 2001, pp. 1074-1078. 3. H.-H. Chen, S.-C. Tsai, and J.-H. Tsai, "Mining Tables from Large Scale HTML Texts", The 18th International Conference on Computational Lin guistics, Saabrucken, Germany, July 2000, pp. 166-172. 4. M. Yoshida, K. Torisawa, and J. Tsujii, "A Method to Integrate Tables of the World Wide Web", First International Workshop on Web Document Anal ysis, Seattle, WA, USA, September 2001, (ISBN 0-9541148-0-9) and also at http://www.csc.liv.ac.uk/~wda2001. 5. D. Mladenic, "Text-Learning and Related Intelligent Agents", IEEE Expert, July-August 1999. 6. J. Hu, R. Kashi, D. Lopresti, and G. Wilfong, "Medium-Independent Table Detection", SPIE Document Recognition and Retrieval VII, San Jose, CA, January 2000, pp. 291-302. 7. T. Joachims, "A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization", The 14th International Conference on Machine Learning, Nashville, Tennessee, 1997, pp. 143-151. 8. Y. Yang and X. Liu, "A Re-Examination of Text Categorization Methods", 22nd International Conference on Research and Development in Information Retrieval (SIGIR '99), Berkeley, California, 1999, pp. 42-49. 9. M. F. Porter, "An Algorithm for Suffix Stripping", Program, 14(3), 1980, pp. 130-137. 10. D. Baker and A.K. McCallum, "Distributional Clustering of Words for Text Classification", SIGIR'98, Melbourne, Australia, 1998, pp. 96-103. 11. A. McCallum, K. Nigam, J. Rennie, and K. Seymore, "Automating the Con struction of Internet Portals with Machine Learning", Information Retrieval Journal, 3, 2000, pp. 127-163. 12. T. M. Mitchell, Machine Learning, McGraw-Hill, 1997. 13. R. Haralick and L. Shapiro, Computer and Robot Vision, Addison Wesley, 1992. 14. V. N. Vapnik, The Nature of Statistical Learning Theory, 1. Springer, New York, 1995. 15. C. Cortes and V. Vapnik, "Support-Vector Networks", Machine Learning, 20, August 1995, pp. 273-296. 154 Y. Wang and J. Hu 16. T. Joachims, "Making Large-Scale SVM Learning Practical", Advances in Kernel Methods - Support Vector Learning, B. Scholkopf, C. Burges and A. Smola (ed.), MIT-Press, 1999. 17. J. Hu, R. Kashi, D. Lopresti, G. Nagy, and G. Wilfong, "Why Table GroundTruthing is Hard", Sixth International Conference on Document Analysis and Recognition (ICDAR'01), Seattle, WA, September 2001, pp. 129-133. CHAPTER 9 A W R A P P E R I N D U C T I O N SYSTEM fOR COMPLEX D O C U M E N T S , A N D ITS APPLICATION TO T A B U L A R DATA ON THE W E B William W. Cohen, Matthew Hurst, Lee S. Jensenf Intelliseek, Inc. Applied Research Centre Pittsburgh, PA, USA Email: {mhurst, wcohen} ©intelliseek.corn jNextPage Corporation Lehi, UT, USA Email: ljensen@nextpage.com A program that makes an existing website look like a database is called a wrapper. Wrapper learning is the problem of learning website wrappers from examples. We present a wrapper-learning system called WL that can exploit several different representations of a document. Examples of such different representations include document-object model (DOM)level and token-level representations, as well as two-dimensional geomet ric views of the rendered page (for tabular data) and representations of the visual appearance of text as it will be rendered. The learning system described is part of an "industrial-strength" wrapper management sys tem. Controlled experiments show that the learner has broader coverage and a faster learning rate than earlier wrapper-learning systems. 1. Introduction Many websites contain large quantities of highly structured, database-like information. It is often useful to be able to access these websites programmatically, as if they were true databases. A program that accesses an ex isting website and makes that website act like a database is called a wrap per. Wrapper learning is the problem of learning website wrappers from examples. 1 ' 2 In this chapter we will discuss some of the more important representa tional issues for wrapper learners, focusing on the specific problem of ex tracting text from web pages. We argue that pure document-object model (DOM) or token-based representations of web pages are inadequate for the 155 156 Cohen et al. purpose of learning wrappers. We then propose a learning system that can exploit multiple document representations. In more detail, the system includes a single general-purpose "master learning algorithm" and a varying number of smaller, specialpurpose "builders", each of which can exploit a different view of a doc ument. Implemented builders make use of DOM-level and token-level views of a document; views that take more direct advantage of visual character istics of rendered text, like font size and font type; and views that exploit a high-level geometric analysis of tabular information. Experiments show that the learning system achieves excellent results on real-world wrapping tasks, as well as on artificial wrapping tasks previously considered by the research community. 2. Issues in Wrapper Learning One important challenge faced in wrapper learning is picking the repre sentation for documents that is most suitable for learning. Most previous wrapper learning systems represent a document as a linear sequence of to kens or characters. 2,3 Another possible scheme is to represent documents as trees, for instance using the document-object model (DOM). This rep resentation is used by a handful of wrapper learning systems 4,5 and many wrapper programming languages (e.g, Sahuget et al.6). Unfortunately, both of these representations are imperfect. In a web site, regularities are most reliably observed in the view of the information seen by human readers-that is, in the rendered document. Since the ren dering is a two-dimensional image, neither a linear representation or a tree representation can encode it adequately. One case in which this representational mismatch is important is the case of complex HTML tables. Consider the sample table of Fig. 1. Suppose we wish to extract the third column of Fig. 1. This set of items cannot easily be described at the DOM or token level: for instance, the best DOM-level description is probably "td nodes (table data nodes in HTML) such that the sum of the column width of all left-sibling t d nodes is 2, where column width is defined by the colspan attribute if it is present, and is defined to be one otherwise." Extracting the data items in the first column is also complex, since one must eliminate the "cut-in" table cells (those labeled "Actresses" and "Singers") from that column. Again, cut-in table cells have a complex, difficult-to-learn description at the DOM level ("td nodes such that no right-sibling t d node contains visible text"). Wrapper Induction and Its Application to Tabular Data Check out this KOOL Stuff!!! "Actresses" Lucy Lawless Jolie Angelina "Singers" Madonna Brittany Spears images images links links images images links links 157 Last modified: 11/1/01. Fig. 1. A difficult page to wrap. Rendered page: M y Favorite Musical Artists • Muddy Waters • John Hammond • Ry • Cooder ... Last modified: 11/1/01. HTML implementation 1: (h3)My Favorite Musical Artists(/h3) <ul> (li)(i)(b)Muddy Waters(/b)(/i) (li)(i)(b}John Hammond(/b)(/i) (li>{i){b)Ry Cooder</b)</i) <M (P> Last modified: 11/1/01 HTML implementation 2: <h3}My Favorite Musical Artists(/h3) (ul) (li){i)(b)Muddy Waters(/b)(/i> (li)(b){i)John Hammond(/i){/b) (li>(i)(b)RyCooder{/b){/i} (li>... (/ul) Last modified: 11/1/01 Fig. 2. A rendered page, with two HTML implementations. The second implementa tion exhibits irregularity at the DOM level, even though the rendering has a regular appearance. Another problemmatic case is illustrated by Fig. 2. Here a rendering of a web page is shown, along with two possible HTML representations. In the first case, the HTML is very regular, and hence the artist names to be extracted can be described quite easily and concisely. In the second case, 158 Cohen et al. the underlying HTML is irregular, even though it has the same appear ance when rendered. (Specifically, the author alternated between using the markup sequences (i)(b)foo(/b)(/i) and (b)(i)bar(/i)(/b) in constructing italicized boldfaced text.) This sort of irregularity is unusual in pages that are created by database scripts; however, it is quite common in pages that are created or edited manually. In summary, one would like to be able to concisely express concepts like "all items in the second column of a table" or "all italicized boldfaced strings". However, while these concepts can be easily described in terms of the rendered page, they may be hard to express in terms of a DOM- or token-level representation. 3. A n Extensible Wrapper Learning System The remarks above are not intended to suggest that DOM and token repre sentations are bad—in fact they are often quite good. We claim simply that neither is sufficient to successfully model all wrappers concisely. In view of this, we argue that an ideal wrapper-learning system will be able to exploit several different representations of a document—or more precisely, several different views of a single highly expressive baseline representation. In this chapter we will describe such a learning system, called WL2. 3.1. Architecture of the Learning System The basic idea in WL2 is to express the bias of the learning system as an ordered set of "builders". Each "builder" is associated with a certain restricted language L. However, the builder for L is not a learning algorithm for L. Instead, to facilitate implementation of new "builders", a separate master learning algorithm handles most of the real work of learning, and builders need support only a small number of operations on L. Builders can also be constructed by composing other builders in certain ways. For instance, two builders for languages L\ and Li can be combined to obtain builders for the language [L\ oL2) (composition), or the language {L\ AL 2 ) (conjunction). We will describe builders for several token-based, DOM-based, and hy brid representations, as well as for representations based on properties of the expected rendering of a document. Specifically, we will describe builders for representations based on the expected formatting properties of text nodes (font, color and so on), as well as representations based on the expected geometric layout of tables in HTML. Wrapper Induction 3.2. A Generic and Its Application Representation to Tabular Data for Structured 159 Documents We will begin with a general scheme for describing subsections of a docu ment, and then define languages based on restricted views of this general scheme. We assume that structured documents are represented with the docu ment object model (DOM). (For pedagogical reasons we simplify this model slightly in our presentation.) A DOM tree is an ordered tree, where each node is either an element node or a text node. An element node has an ordered list of zero or more child nodes, and contains a string-valued tag (such as t a b l e , h i , or l i ) and also zero or more string-valued attributes (such as href or src). A text node is normally denned to contain a single text string, and to have no children. To simplify the presentation, however, we will assume that a text node containing a string s of length k will have k "character node" children, one for each character in s. Items to be extracted from a DOM tree are represented as spans. A span consists of two span boundaries, a right boundary and a left boundary. Conceptually, a boundary corresponds to a position in the structured doc ument. We define a span boundary to be a pair (n,k), where n is a node and k is an integer. A span boundary points to a spot between the fc-th and the (k + l)-th child of n. For example, if n\ is the rightmost text node in Fig. 3, then (ni, 0) is before the first character of the word "Provo", and (n\, 5) is after the last character of the word "Provo". The span with left boundary (ni, 0) and right boundary (m, 5) corresponds to the text "Provo". As another example, if ri2 is the leftmost l i node in Fig. 3, then the span from (ri2,0) to (ri2,1) contains the text "Pittsburgh, PA". It also Cohen et al, 160 corresponds to a single DOM node, namely, the leftmost anchor (a) node in the DOM tree. A span that corresponds to a single DOM node is called a node span. 3.3. A Generic Representation for Extractors A predicate Pi(si, S2) is a binary relation on spans. To execute a predicate pi on span s\ means to compute the set EXECUTE{pi, s\) = {S2 ■ Pi(si, S2)}. For example, consider a predicate p(s\, S2) which is defined to be true if and only if (a) s\ contains S2, and (b) S2 is a node span corresponding to an ele ment node with tag l i . Let si be a span encompassing the entire document of Fig. 3. Then EXECUTE(p, si) contains two spans, each corresponding to an l i node in the DOM tree, one containing the text "Pittsburgh, PA", and one containing the text "Provo, UT". We require that every predicate is one-to-many and that membership in a predicate can be efficiently decided (i.e., given two spans si and S2, one can easily test if p(s\, S2) is true.) We also assume that predicates are executable—i.e., that EXECUTE (p,s) can be efficiently computed for any initial span s. The extraction routines learned by our wrapper induction system are represented as executable predicates. Since predicates are sim ply sets, it is possible to combine predicates by Boolean operations like conjunction or disjunction; similarly, one can naturally say that predicate Pi is "more general than" predicate pj (i.e. it defines a superset). We note that these semantics can be used for many commonly used extraction languages, such as regular expressions and XPath queries. 3 Many of the predicates learned by the system are stored as equivalent regular expressions or XPath queries. 3.4. Representing Training Data A wrapper induction system is typically trained by having a user identify items that should be extracted from a page. Since it is inconvenient to label all of a large page, a user should have the option of labeling some initial section of a page. To generate negative data, it is assumed that the user completely labeled the page or an initial section of it. A training set T for our system thus consists of a set of triples (Outeri,Scope1,InnerSeti), (Outer2,Scope2,InnerSet2), ■■■, where in each pair Outerj is usually a span corresponding to a web page, Scopei a X P a t h is a widely-used declarative language for addressing nodes in an XML (or XHTML) document. 7 Wrapper Induction and Its Application to Tabular Data 161 is the part of Outeri that the user has completely labeled, and InnerSeti is the set of all spans that should be extracted from Outeri. Constructing positive data from a training set is trivial. The positive examples are simply all pairs {{Outeri, Innerij) : Innerij € InnerSeti}. When it is convenient we will think of T as this set of pairs. While it is not immediately evident how negative data can be con structed, notice that any hypothesized predicate p can be tested for con sistency with a training set T by simply executing it on each outer span in the training set. The spans in the set InnerSeti — EXECUTE(p, Outeri) are false negative predictions for p, and the false positive predictions for p are spans s in the set {s S EXECUTE(p, Outeri) - InnerSeti : contains (Scope, s)} 3.5. Designing (1) a Bias The bias of the learning system is represented by an ordered list of builders. Each builder BL corresponds to a certain restricted extraction language11 L. To give two simple examples, consider these restricted languages: • -^bracket is defined as follows. Each concept c € Lbracket is defined by a pair (£, r), where I and r are strings. Each pair corresponds to a predicate pn^{s\,S'i), which is true if and only if s Contact Currently we have offices in two locations: • Pittsburgh, PA • Provo, UT Fig. 4. A sample web page. Notice that only two examples of "location" exist. For example, executing the predicate Pin,iocations on the span for the b More precisely, we will use C to denote both a set of predicates, and a notation for describing this set of predicates. Cohen et al. 162 document of Fig. 4 would produce a single span containing the text "two". Lbracket is one example of a language-based on viewing the document as a sequence of tokens. • -ktagpath is defined as follows. Each concept c € Ltagpath is defined by a sequence of strings, t\,... ,tk, and corresponds to a predicate Ptlt... ,tk ■ The predicate Pti,... ,tk(si,S2) is true if and only if S2 is a node span contained in si; the tag of the node 712 corresponding to S2 is £&; and for 1 < j < k — 1, the tag of the j-th ancestor of 712 is tk-jFor example, executing the predicate p u i,ii,a o n the span for the doc ument of Fig. 4 would produce the two spans "Pittsburgh, PA" and "Provo, UT". itagpath is an example of a language-based on viewing the document as a DOM. Each builder BL must implement two operations. A builder must be able to compute the least general generalization (LGG) of a training set T with respect to L—i.e., the most specific concept c € L that covers all positive training examples in T. Given an LGG concept c and a training set T, a builder must also be able to refine c with respect to T—i.e., to compute a set of concepts c[,... , c'm such that each c'k covers some but not all of the positive examples (Outerj, Innerij) £ T■ Below we will write these operations as LGGB(T) and REFINEg(c, T). We will also assume that there is a special "top predicate", written "true", which is always true (and hence is not executable.) Other builders will be described below, in Sees. 4.1, 4.2, and 5. 3.6. The Master Learning Algorithm The master learning algorithm used in WL2 is shown in Fig. 5. It takes two inputs: a training set T, and an ordered list of builders. The algorithm is based on FOIL 8 ' 9 and learns a disjunctive normal form (DNF) expression, the primitive elements of which are predicates. As in FOIL, the outer loop of the learning algorithm (the l e a r n P r e d i c a t e function) is a set-covering algorithm, which repeatedly learns a single "rule" p (actually a conjunc tion of builder-produced predicates) that covers some positive data from the training set, and then removes the data covered by p. The result of l e a r n P r e d i c a t e is the disjunction of these "rules". The inner loop (the learnConjunctionfunction) first evaluates all LGG predicates constructed by the builders. If any LGG is consistent with the data, then that LGG is returned. If more than one LGG is consistent, then the LGG produced by the earliest builder is returned. If no LGG is consis- Wrapper Induction Fig. 5. and Its Application to Tabular Data The master learning algorithm tent, the "best" one is chosen as the first condition in a "rule". Executing this "best" predicate yields a set of spans, some of which are marked as positive in T, and some of which are negative. Prom this point the learning process is quite conventional: the rule is specialized by greedily conjoining builder-produced predicates together. The predicate choices made in the inner loop are guided by the same information-gain metric used in FOIL. Unlike FOIL, WL2 uses the ordering of the builders to prioritize the primitive predicates. Predicates generated by earlier builders are preferred to later ones, if their information gains are equal. Notice that because there Cohen et al. 164 are very few positive examples, there are many ties in the information-gain metric. 4. Additional Builders In this section we introduce some methods to form complex builders from simpler ones, and approaches to creating builders that are sensitive to the appearance or format of the document. 4.1. Composite Builders The builders described above are examples of primitive builders. It is also possible to construct new builders by combining other builders. In fact, one reason for using only the REFINE and LGG operations in builders is that LGG and REFINE can often be defined compositionally. Fig. 6. An example web page. One useful composite builder is a chain builder. Given two builders BL1 and BL2, a chain builder learns (roughly) the composition of L\ and L<z. For efficiency reasons we implemented a slightly restricted form of builder composition. A chain builder is a composite builder based on two builders and a user-provided decomposition function f^. Intuitively, the de composition function takes as an argument the span s? to be extracted and returns an intermediate span s': i.e., /d(s2) = s'. The chain builder will Wrapper Induction and Its Application to Tabular Data 165 learn concepts p of the form p={{s!,s2) : Pi(si, fd(s2)) Ap2(fd(s2),s2)} (2) where p\ is in the language associated with B\ and p2 is in the language associated with B2. Given the decomposition function fd, it is straightforward to define the necessary operations for a chain builder Bio2jd for two builders B\ and B2. For instance, refinements of the composition (pi °p2)fd are formed by refining either step of the chain (e.g., p\ or p2). Another combination is conjunction. Given builders BL1 and BL2, it is straightforward to define a builder BLI/XL2 for the language of predicates of the form p\ l\p2 such that pi e Li a n d p 2 G L2Another useful composite builder is a filtered builder. A filtered builder Bq>L extends a builder BL with an arbitrary training set query q, and is defined as follows, where eg is a special null concept. LGGBqJT) =|c0 otherw.se Informally, a filtered builder is "switched off" whenever the predicate q is not satisfied. Filtered builders can be used to introduce additional control information, for example, by restricting some builders to be run only on certain types of extraction tasks, or only on certain very large (or very small) training sets. The following examples help illustrate how composite builders might be used. Example 1. Let fc°ntamer(S2} return the span corresponding to the smallest DOM node that contains s2. Chaining together Bi t a g p a t h and ^bracket using the decomposition function fc°ntamer \s a new and more expressive extraction language. For instance, let the strings I and r rep resent left and right parentheses, respectively. For the page of Fig. 6, the composite predicate p u i,n ° Pe,r would extract the locations from the job descriptions. Notice that p ^ r alone would also pick out the area code "888". Example 2. Let f%re e c e s s o r ( S 2 ) return the first "small" text node pre ceding s2 (for some appropriate definition of "small"), and let Lbow be a language of bag-of-words classifiers for DOM nodes. For example, Lbow might include concepts like Pjob,utie{si,S2) = "si contains s2 and s2 con tains the words 'job' and 'title'." Let Ldist contain classifiers that test the Cohen et al. 166 distance in the DOM tree between the nodes corresponding to si and s%. For example, Ldist might include concepts like pi<d<3(si,S2) = "there are between 1 and 3 nodes between si and s% (in a postfix traversal of the tree)". Chaining together B i b o w and <BLdi3tAi,tagpath using the decomposition function /J""e ecessor would lead to a builder that learns concepts such as the following p{si, S2): p{si, S2) = s' is the first text node preceding S2 that contains three or fewer words; s' contains the words "To" and "apply"; S2 is be tween 1 and 4 nodes after s', and S2 is reached from a\ by a tagpath ending in table, tr, td. For the sample page in Fig. 6, this predicate might pick out the table cell containing the text: "Send c.v. via e-mail . . . ". 4.2. Format-based Extraction Figure 2 illustrates an important problem with DOM-based representations: while regularity in the DOM implies a regular appearance in the rendered document, regular documents may have very irregular DOM structures. In the figure, the markup sequences (i)(b)foo(/b)(/i) and (b)(i)foo(/i)(/b) both produce italicized boldfaced text, but have different token- and DOMlevel representations. Alternating between them will lead to a document that is regular in appearance but irregular in structure. Our experience is that this sort of problem is quite common in small-to-medium sized web sites, where much of the content is hand-built or hand-edited. Our solution to this problem is to construct builders that rely more directly on the appearance of rendered text. We achieve this with a mixture of document preprocessing and reasoning at learning time. In a preprocessing stage, HTML is "normalized" by applying a number of transformations. For instance, the strong tag is replaced by the b tag, em is replaced by i tag, and constructs like f ont=+l are replaced by f ont=fc (where k is the appropriate font-size based on the context of the node.) This preprocessing makes it possible to compute a number of "format features" quickly at each node that contains text. Currently these features include properties like font size, font color, font type, and so on. A special builder then extracts nodes using these features. These prop erties are treated as binary features {e.g., the property "font-size=3" is treated as a Boolean condition u fontSizeEqualsThree=true"). The format builder then produces as its LGG the largest common set of Boolean for- Wrapper Induction and Its Application to Tabular Data 167 mat conditions found for the inner spans in its training set. Refinement is implemented by adding a single feature to the LGG set. 5. Table-based Extraction Tables represent a departure from the simple tree-based view of a document in that their meaning is the result of intersecting orthogonal 'categories' reduced to a two-dimensional representation. Consequently, applying the wrapper learning paradigm described above requires that we prepare a spe cialized view of the tablular elements of a document. Below, we describe how this view is derived and how its features may be integrated into the general learning mechanisms of the wrapper learning system. 5.1. Representing Tables on the World Wide Web A general and universally accepted model of the document element com monly referred to as a table is not yet available (though that presented in Wang 10 is a reasonable one). We do not tackle this definitional problem here. Informally, we consider a table to be a graphical representation of a relation between two or more categories. A table has a number of functional elements, including the head (often the top row in the table), and the stub (often the left-most column). The majority of tables expressed by orthogonal rows and columns may be encoded in HTML by the t a b l e element and its associated legal subelements (tbody, thead, t f o o t , tr, t h and td). There are many other features of HTML that may be brought to bear on the rendering of these encodings (such as exact control of the position and size of a document ele ment) which together with the t a b l e tag set represent a powerful language for almost arbitrary control of the layout of elements on the page. Conse quently, tabular data on the web is lost in the noise of the extended uses of the table element Our intuition is that the proportion of t a b l e elements that encode true tables is quite small (informal experiments performed by the authors suggest less than 10%). Prior work in this area (e.g. Chen et al.n) has suggested that the num ber of true data tables on the web is low (e.g. 28.53% for the Chinese flight schedule domain, slightly higher than the 10% suggested by our own in formal experiments, but in a domain where tables are, intuitively, more likely to be data tables). In general, statistics describing the distribution of features on the web are hard to interpret due to the potential burstiness of characteristics: a feature may be rare in general, but common on 168 Cohen et al. one particular site. However, the above results support our intuition about the proliferation of alternative uses for the t a b l e tag, and introduces the requirement for a 'true' table location process within our wrapper learning system. 5.2. Classes of Table Presentation in HTML We can distinguish two classes of tables in HTML documents - those which have some clear relationship with an instance of the t a b l e tag in the HTML document (class 1); and those which are only considered to be tables by virtue of their appearance once rendered by a suitable browser (class 2). In general, class 1 includes those table instances which are encoded by the 'correct' use of the t a b l e tag and its legal structure (table, tbody, t r etc.) - for example the table in Fig. 7; as well as those cases where the table is embedded in a t a b l e node, but where the t a b l e node does not only represent a single table. For example the tables in Fig. 8, which are represented by a single t a b l e node which also contains peripheral document elements. Fig. 7. A simple and 'correct' use of the t a b l e tag set. An extension of class 1 is a mixed class in which a table contains other elements (that are part of the geometric structure of the table) such as images that depict a number of cells, plain text wrapped in a p r e tag which represents a number of rows in the tables, the use of ul tags, br tags and so on. Class 2 contains such things as images, plain text tables, the use of HTML elements that are not explicitly related, other than via a sibling relationship of some sort etc. See Fig. 9 for an example. In this chapter we concentrate on the first class of tables (and its ex tension). Wrapper Induction and Its Application to Tabular Data 169 Related Tables & Charts H m S S S S E S [c][c][c][c][c][g[c][c] H Heilio. G K (1999) ChinaFood Can China Feed I t s e l f NASA. Laxenburg (CD-ROM Veis 1 1) Copyiight @ 1999 by IIASA All nghts teserved Fig. 8. This figure represents the lowest table element surrounding the two tables at the centre. Although this table element contains a table, it cannot be simply be inspected for relational information in, for example, an information extraction system as further processing is required to locate and model the actual logical tables. 5.3. An Abstract Geometric Table Model A model of the geometry of the table is used both to produce features valuable to the table location process, and to provide a specialized view of the document for the wrapper learning system. In an abstract geometric model, a table is assumed to lie on a grid, and every table cell is assumed to be a contiguous rectangle on the grid. An abstract table model is thus a set of cells, each of which is defined by the co-ordinates of the upper-left and lower-right corners, and a representation of the cell's contents. In the case of HTML tables, the contents are generally a single DOM node. Since we aim to model the table as perceived by the reader, a table model cannot be generated simply by rendering the table node following the algorithm recommended by W3C. 12 Further analysis is required in order to capture additional table-like sub-structure visible in the rendered doc ument. Examples of this type of structure include nested table elements, rows of t d elements containing aligned list elements and so on. Our table 170 Cohen et al. Fig. 9. www.mysimon.com uses tables to display results of product searches for com parison. However, each row in the table is a t a b l e element, and the whole table is not composed within a t a b l e element. Therefore, inference is required to discover what appears to the reader as a clear example of tabular data. modeling system thus consists of several steps. First, we generate a table model from a t a b l e node using a variation of the algorithm recommended by W3C (the variation corrects some minor problems with the published algorithm). We then refine the resulting table model in the following ways. Rationalization. HTML is often very noisy. In order to build a DOM structure it must first be cleaned up to produce syntactically cor rect HTML. This is done by the Tidy utility. 13 Due to the con straints of that task and the lack of adhesion to the correct use of t a b l e encoding in HTML, the Tidy step often generates extra table cells appended to the side of the table. These are trivially detected and removed. Complex cell analysis. Cells that contain structure which is common across a row (e.g., nested tables, forced line breaks, PRE encoded text etc.) are subdivided into appropriate sub-cells which are then inserted back into the table model. Normalization. Any rows that have height greater than one are checked to ensure that they contain some cells of the indicated height. If they do not, then the row height is reduced appropriately. An analo gous process is used to normalize column width. This normalization is necessary when an explicit rowspan or colspan attribute is used to indicate multiple row or column spanning cells and the value of Wrapper Induction and Its Application to Tabular Data 171 the attribute is higher than the total number of rows or columns actually spanned in the rendered table. 5.4. Table Location We cast the table location problem as a classification problem - classifying instances of the t a b l e tag. The problem is to determine for each t a b l e node in the DOM the correct label : positive or negative. A positive label indicates that we believe that this t a b l e node is a true data table, or that it contains a true data table which is essentially a sub-area of the table. Note that a sub-area of the table is not a strict sub-tree of the DOM structure. A negative label represents the converse - that we reject this table node as an instance of a data table. 5.4.1. Application of Machine Learning Our approach adopts a standard machine learning classification paradigm. We take a set of documents which we mark up to indicate the positive table elements. All other table elements are, implicitly, negative instances. We then extract a set of features representing the t a b l e . There are two classes of features. The first are extracted from the HTML representa tion of the document. In fact, they are extracted from the DOM structure. The second class of features are model-based features. Because HTML is a structural representation of the document (which is different from a repre sentation of the document structure) it does not reflect the two-dimensional geometric aspect of the table. The model-based features are derived from an abstract rendering of the table. 5.4.2. Features As discussed above, we use two classes of features to represent tables in our location task. The first set are derived from the DOM view of the document: • single HTML row: computed from observing the number of t r tags in the table. • single HTML column: computed from observing the maximum number of cell tags (th or t d tags) in the table. • border attribute set on table tag: computed by observing the border attribute. • bag-of-tags: generate a feature for each HTML tag found beneath the t a b l e tag in the document. Cohen et al. 172 • bag-of-attributes: generate a feature for each attribute found in tags below the t a b l e tag. The second set of features are derived from the abstract geometric model of the table: • row and column bin features representing the existence of 1, 2, 3, 4, 5, 6-10 and 11+ rows or columns in the table. • string content ratio: the ratio of cells with string content (textual con tent) to the total number of cells in the table. • singular cell ratio: the ratio of cells spanning exactly 1 row and 1 column to the total number of cells in the table. 5.4.3. Experimental Results We collected a sample of 339 labeled examples. To evaluate performance, we averaged five trials in which 75% of the data was used for training and the remainder for testing. We explored several learning algorithms including multinomial Naive Bayes, 18,19 Maximum Entropy, 15 Winnow, 20,21 and a decision tree learner modeled after C4.5. 22 Of these, the Winnow classifier performs the best with a precision of 1.00, a recall 0.922, and an F-measure of 0.959.c The results for the Winnow classifier are shown in Fig. 1 against those for Naive Bayes as a baseline reference. The Features column indicates which set of features were used. Table 1. Precision and Recall results for table location using Naive Bayes and Winnow classification algorithms. Classifier Features Precision Recall F-measure Naive Bayes DOM Model Model, DOM DOM Model Model, DOM 0.950 1.000 0.950 0.950 1.000 0.900 0.895 0.838 0.935 0.920 0.922 0.968 0.922 0.912 0.942 0.935 0.959 0.933 Winnow 5.5. Exploiting Table Context Table classification is not only the first step in table processing: it is also useful in itself. There are several builders that are more appropriate to apply c F-measure is the harmonic mean of recall and precision, i.e., F precision) / (recall + precision). = (2 ■ recall ■ Wrapper Induction and Its Application to Tabular Data 173 outside a table than inside one, or vice versa. One example is builders like that Example 2, Sec. 4.1, which in Fig. 6 learns to extract text shortly after the phrase "To apply:". This builder is generally inappropriate inside a table—for instance, in Fig. 1, it is probably not correct to generalize the example "Lawless" to "all tables cells appearing shortly after the string 'Lucy'". A number of builders in WL2 work like the builder of Example 2, in that the extraction is driven primarily by some nearby piece of text. These builders are generally restricted to apply only when they are outside a data table. This can be accomplished readily with filtered builders. 5.6. Exploiting the Table Models To exploit the geometric view of a table that is encapsulated in an ab stract table model, we choose certain properties to export to the learning system. Our goal was to choose a small but powerful set of features that could be unambiguously derived from tables. More powerful features from different aspects of the abstract table model were also considered, such as the classification of cells as data cells or header cells—however, determining these features would require a layer of classification and uncertainty, which complicates their use in wrapper-learning. To export the table features to WL2, we used the following procedure. When a page is loaded into the system, each table node is annotated with an attribute indicating the table's classification as a data table or non-table t a b l e . Each node in the DOM that acts as a cell in an abstract table is annotated with its logical position in the table model; this is expressed as two ranges, one for column position and one for row position. Finally, each t r node is annotated with an attribute indicating whether or not it contains a "cut-in" cell (like the "Actresses" and "Singers" cells in Fig. 1.) Currently, this annotation is done by adding attributes directly to the DOM nodes. This means that builders can easily model table regularities by accessing attributes in the enriched, annotated DOM tree. Currently four types of "table builders" are implemented. The cut-in header builder represents sets of nodes by their DOM tag, and the bag of words in the preceding cut-in cell. For example, in the table of Fig. 1, the bag of words "Actresses" and the tag t d would extract the strings "Lucy", "Lawless", "images", "links", "Angelina", "Jolie" and so on. The column header builder and the row header builder are analogous. The fourth type of table builder is an extended version of the builder for the ita gP ath language, in which tagpaths are defined by a sequence of tags augmented with the values of Cohen et al. 174 the attributes indicating geometric table position and if a row is a cut-in. As an example, the "extended tagpath" table,tr(cutIn='no'),td(colRange='2-2') would extract the strings "Lawless", " Jolie", "Spears" (but not "Madonna", because her geometric column co-ordinates are "1-2", not "2-2".) Finally, the conjunction of this extended tagpath and the example cut-in expression above would extract only "Lawless" and "Jolie". 6. E x p e r i m e n t s The discussion in this chapter has been restricted to "binary extraction tasks", by which we mean tasks in which a yes/no decision is made for each substring in the document, indicating whether or not that substring should be extracted. There are several existing schemes for decomposing the larger problem of wrapping websites into a series of binary extraction problems. 2 ' 23 WL2 is embedded in one such system. Thus, the basic evalu ation unit is a "wrapper-learning problem", which can be broken into a set of "binary extraction problems". Table 2. Performance of WL2 on repre sentative wrapping problems occurring in actual use. Problem JOB1 JOB2 JOB3 JOB4 JOB5 JOB6 JOB7 median WL2 3 1 1 2 2 9 4 2 Problem CLASS1 CLASS2 CLASS3 CLASS4 CLASS5 CLASS6 median WL2 1 3 3 3 6 3 3 We now turn to some additional benchmark problems. Table 2 gives the performance of WL2 on several real-world wrapper-learning problems, taken from two domains for which WL2 has been used in an industrial set ting. The first seven problems are taken from the domain of job postings. The last six problems are taken from the domain of continuing education courses. These problems were selected as representative of the more difficult wrapping problems encountered in these two domains. Each of these prob lems contains several binary extraction problems—a total of 34 problems Wrapper Induction and Its Application to Tabular Data 175 Fig. 10. Baseline WL2 system on benchmark extraction problems, with and without table and format builders. The plot shows average F-measure on 13 sample problems as a function of the number of examples labeled. altogether. In some cases, it is useful to obtain approximate wrappers, as well as perfect ones. To measure the overall quality of wrappers, we mea sured the recall and precision of the wrappers learned for each problem from k examples, for k = 1 , 2, 3, 5, 10, 15, and 20. In each case the k examples chosen were the first examples which would be presented to a user to be labeled, and recall and precision were measured on all remaining examples (i.e., examples whose label is not yet known to the system). Recall and precision were measured by averaging across all individ ual field extraction problems associated with a wrapper-learning task. The learning system we use is strongly biased toward high-precision rules, so pre cision is almost always perfect, but recall varies from problem to problem. We then plotted the average F-measure across all problems as a function of fc. Figure 10 shows these curves for the baseline WL2 system on the realworld wrapping tasks of Table 2. The curves marked "no format" and "no tables" show the performance of two restricted versions of the system: a version without the format-oriented builders of Section 4.2, and a version without the table-oriented builders of Section 5. These curves indicate a Cohen et al. 176 clear benefit from using these special builders. 7. Conclusions To summarize, we have argued that pure D O M - or token-based representa tions of web pages are inadequate for wrapper learning. We propose instead a wrapper-learning system called WI? t h a t can exploit multiple document representations. WL2 is part of an "industrial-strength" wrapper manage ment system. T h e system includes a single general-purpose master learning algorithm and a varying number of smaller, special-purpose "builders", which can exploit different views of a document. Implemented builders make use of b o t h DOM-level and token-level views of a document. More interestingly, builders can also exploit other properties of documents. Special formatlevel builders exploit visual characteristics of text, like font size and font type, t h a t are not immediately accessible from conventional views of the document. Special "table builders" exploit information about the twodimensional geometry of tabular d a t a in a rendered web page. Acknowledgments T h e a u t h o r s t h a n k Rich Hume, Rodney Riggs, and other Whizards for contributions to this work. References 1. N. Kushmeric, "Wrapper induction: efficiency and expressiveness", Artificial Intelligence, 118, 2000, pp. 15-68. 2. I. Muslea, S. Minton, and C. Knoblock, "Hierarchical Wrapper Induction for Semistructured Information Sources", Journal of Autonomous Agents and Multi-Agent Systems, 4, 2001, pp. 93-114. 3. B. Chidlovskii, "Wrapper Generation by K-reversible Grammar Induction", Proceedings of the Workshop on Machine Learning and Information Extrac tion, Berlin, Germany, 2000. 4. W. W. Cohen and W. Fan, "Learning Page-Independent Heuristics for Ex tracting Data from Web Pages", Proceedings of The Eigth International World Wide Web Conference (WWW-99), Toronto, 1999. 5. W. W. Cohen, "Recognizing Structure in Web Pages Using Similarity Queries", Proceedings of the Sixteenth National Conference on Artificial In telligence (AAAI-99), Orlando, FL, 1999, pp. 59-66. 6. A. Sahuget and F. Azavant, "Building Light-Weight Wrappers for Legacy Web Datasources Using W4F", Proceedings of 25th International Conference on Very Large Databases, (VLDB'99), 1999, pp. 738-741. Wrapper Induction and Its Application to Tabular Data 177 7. XML Path Language (XPath) Version 1.0, 1999, available from http://www.w3.org/TR/1999/REC-xpath-19991116. 8. J. R. Quinlan, "Learning Logical Definitions from Relations", Machine Learn ing, 5(3), 1990, pp. 239-266. 9. J. R. Quinlan and R. M. Cameron-Jones, "FOIL: A Midterm Report", In P. B. Brazdil, editor, Machine Learning: ECML-93, Vienna, Austria, 1993, Springer-Verlag, Lecture Notes in Computer Science 667. 10. X. Wang, Tabular Abstraction, Editing, and Formatting. PhD thesis, Univer sity of Waterloo, Waterloo, Ontario, Canada, 1996. 11. H. Chen, C. Li, J. Tsai, S. Tsai, "Mining Tables from Large Scale HTML Texts", Proceedings of the Eighteenth International Conference on Computa tional Linguistics ,Saarbrucken, Germany, pp. 166-172. 12. HTML 4.01 Specification, 1999, http://www.w3.org/TR/html4/. 13. Clean Up Your Web Pages with HTML TIDY, 1999, http://www.w3.org/People/Raggett/tidy/. 14. T. Mitchell, Machine Learning McGraw-Hill, 1997. 15. K. Nigam, J. Lafferty, and A. McCallum, "Using Maximum Entropy for Text Classification", Proceedings of Machine Learning for Information Filtering Workshop, IJCAI '99, Stockholm, Sweden, 1999, pp. 61-67. 16. N. Littlestone and M. K. Warmuth, "The Weighted Majority Algorithm", In Information and Computation 108(2), 1994, pp. 212-261. 17. M. Hurst, The Interpretation of Tables in Texts. PhD thesis, University of Ed inburgh, School of Cognitive Science, Informatics, University of Edinburgh, 2000. 18. D. Lewis, "Naive (Bayes) at Forty: The Independence Assumption in Infor mation Retrieval", Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998, pp. 4-15. 19. A. McCallum and K. Nigam, "A Comparison of Event Models for Naive Bayes Text Classification", AAAI-98 Workshop on Learning for Text Cate gorization, 1998. 20. N. Littlestone, "Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm", Machine Learning, 2(4), 1988, pp. 285318. 21. A. Blum, "Empirical Support for WINNOW and Weighted Majority Algo rithms: Results on a Calendar Scheduling Domain", Machine Learning: Pro ceedings of the Twelfth International Conference, Lake Tahoe, California, 1995, pp. 64-72. 22. J. R. Quinlan, C4-5: Programs for Machine Learning. Morgan Kaufmann, 1994. 23. L. S. Jensen and W. W. Cohen, "A Structured Wrapper Induction System for Extracting Information from Semi-Structured Documents", Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA, 2001. This page is intentionally left blank C H A P T E R 10 EXTRACTING ATTRIBUTES A N D THEIR VALUES FROM W E B PAGES Minoru Yoshida , Kentaro Torisawa and Jun'ichi Tsujii ' Department of Computer Science, Graduate school of Information Science and Technology, University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan School of Information Science, Japan Advanced Institute of Science and Technology 1-1 Asahidai, Tatsunokuchi, Ishikawa, 923-1292, Japan CREST, JST(Japan Science and Technology Corporation) 4-1-8 Kawaguchi Hon-cho, Kawaguchi-shi, Saitama, 332-0012, Japan E-mail: {mino,tsujii} @is.s.u-tokyo.ac.jp, torisawa@jaist.ac.jp We propose a method for extracting attributes and their values from Web pages. Our method models Web pages by Hidden Marcov Models (HMMs) whose parameters are estimated with no manual interventions such as labeling of training samples. The key idea is to estimate some of the HMM parameters by consulting ontologies built from HTML tables, while other parameters are estimated via the Expectation Maximiza tion (EM) algorithm. In the experiments, we show that our algorithm can extract attributes and values from various kinds of attribute-value expressions. 1. I n t r o d u c t i o n This chapter describes a method for extracting attributes and their values from Web pages. In Web pages, although a sentence is the s t a n d a r d means to express information, many other styles of representation, such as tables or lists, are also used. In particular, important d a t a is often expressed in the form of attributes and their values. About-me pages provide profiles of people by listing their names, sexes, addresses, etc.; and P C catalogue pages describe P C s by listing CPUs, memory sizes, etc. Extracting attributes and their values are therefore useful for various applications, such as the automatic summarization of Web pages. This study is aimed at extracting attributes and their values represented 179 180 M. Yoshida, K. Torisawa and J. Tsujii Fig. 1. Some example pages containing non-sentential blocks. Example extraction is shown on the right hand side of page (2). as non-sentential blocks. A non-sentential block is a part of a Web page that is a word sequence but not a sentence. Figure 1 shows examples of pages consisting of non-sentential blocks. For example, attribute-value pairs ("NAME", "John Smith") or ("SEX", "Male"), etc., are extracted from the page (1) in Fig. 1. We regard this extraction task as a classification task. More precisely, after decomposing a target Web page into several parts, our algorithm assigns labels like name attribute or age value to each part of the Web page as shown in Fig. 1. As a result, each part of the page is classified into a proper attribute or value. Non-sentential blocks are expressed in a variety of ways even for the same content, as shown in Fig. 1. For example, an attribute and its values are placed horizontally on the page ((1) and (3)), while they are also placed vertically in page (2). We use block sequences to uniformly treat this variety of expressions. A block sequence is a sequence of blocks separated by HTML tags or other special characters, and is made from the HTML source of each page. All pages in Fig. 1 can be transformed into the same block sequence shown in Fig. 2. Below the block sequence, a sequence of separators is shown, which separate the blocks. After block sequences are obtained, the task is to assign a label to each block. This task is accomplished by using probabilistic models for Web pages, while model parameters are estimated by employing machine learning techniques. Our method needs no manual intervention such as labeling of train ing samples or constructing heuristic extraction rules; our idea is to con sult ontologies extracted from HTML tables, to estimate the probability of each word being used as an attribute or a value. In our previous work we Extracting Attributes Fig. 2. and Their Values from Web Pages 181 An example conversion from Web pages to block and separator sequences. Fig. 3. System overview. The SSEM, the main module of the system, uses the ontology extracted from tables. demonstrated a method to extract ontologies from tables with no labeled training samples.1 Ontologies extracted by this method are used to analyze attribute-value expressions other than tables, which are referred to by the term lists. Figure 3 shows an overview of the system. Our strategy is to: (1) Extract attribute-value relations from HTML tables. (2) Create ontologies from extracted results. (3) Analyze lists by using the extracted ontologies. An ontology consists of strings that are used to describe some class of 182 M. Yoshida, K. Torisawa and J. Tsujii Fig. 4. An example ontology of a human. Each attribute or value is represented by a sequence of strings expressing it. objects. a For example, the ontology in Fig. 4 is an ontology for a human class. Each string in an ontology belongs to a particular (ordered) string set, of which there are two types: attributes and values. Each attribute is linked to one value, and vice versa. In the ontology in Fig. 4, for example, a string bloodtype belongs to an attribute, while strings A and AB belong to the value linked to it. Statistics for strings appearing in ontologies are used to estimate the probability of strings being used as some attribute or value. This probability is contained in the model for lists. We chose Hidden Markov Models (HMMs) to model block sequences. Block emission probability is estimated based on the ontologies, while other probabilities such as state transition probabilities are estimated via the EM algorithm. There has been some research into extracting data from Web pages with no heuristic rules constructed by hand. Kushmerick et al.2 proposed wrapper induction, which automatically learned rules for extracting data from Web pages. Some other approaches to this task were also proposed. 3 ' 4,5 Freitag et al.& showed a method for extracting information from Web pages by using relational learning with features making use of HTML tags. However, all of these approaches require training data to be labeled with the target fields by hand. The remainder of this chapter is organized as follows. In Sec. 2, we describe an algorithm to extract ontologies from HTML tables. In Sec. 3, we explain a list analysis algorithm based on the extracted ontologies. Section a Notice t h a t the usage of the term "ontologies" in this chapter is somewhat different from the standard one (which is mostly used to represent "concept hierarchies"). Extracting Attributes and Their Values from Web Pages 183 4 shows the results of the experiments and Sec. 5 concludes this chapter and shows the future direction of this work. 2. Ontology Extraction from HTML Tables In this section we briefly describe an algorithm to extract ontologies from HTML tables. An ontology extraction task is divided into the following subtasks. Table Structure Recognition There are some types of tables that can be seen as ontologies, i.e., a set of attributes and their values, as shown in Fig. 6. However, it is not always obvious which parts of the ta ble represent attributes, and which parts represent values. Thus, table structure recognition is a task to determine attribute areas and value areas in a given table. b The aim of the task, i.e., to extract different kinds of ontologies from the WWW, requires a table structure recognition algorithm applicable for many kinds of Web page expressions. We assume that categorization of each word (i.e., the word that is used as an attribute or a value), is consistent among all WWW pages; for example, words which tend to be used as attributes in one page (e.g., bloodtype,) also tend to be used as attributes in other pages. Our algorithm is designed to learn such a categorization of words by using the Expectation Maximization (EM) algorithm, which requires no hand-annotated training samples. Table Integration It is not enough just to extract attributes and their values from each table. The next step is to integrate attribute-value information from various tables into a set of ontologies. This task is divided into two subtasks called table clustering and table merging. Table clustering, which makes a set of table clusters, is a task to deter mine which tables are collected for each ontology. The desired output of this task is that each table cluster consists of all tables representing objects in one class (e.g., a human class), and we introduce the UABC (Unique Attribute Base Clustering) algorithm for this task. Table merging is a task to whose goal is to represent all tables in a table cluster on one large table. The main problem is that different strings with the same meaning can be used; such as the strings birthday and date of birth in tables in about-me pages. To treat such variety of strings b Notice that non-ontological tables that have no attributes are discarded through this task. M. Yoshida, K. Torisawa and J. 184 Tsujii we need a method for attribute clustering, which classifies attributes. We introduce a UVBC (Unique Value Based Clustering) algorithm for this task. Finally, we obtain a set of table clusters, each of which is represented by a single large table describing objects in the same class. Attributes and their values for ontologies can be obtained straightforwardly from these resulting large tables. In the remainder of this section, we explain algorithms for these tasks. 2.1. Table Structure Recognition Definition We represent a table T as ((si,S2,---,sxy),x,y) where x is the number of columns in T, and y is the number of rows in T. Here, {s\, S2,..., sxy) is a sequence of strings appearing in the table. We assume that each string in the table can be classified as an attribute or a value0. The table structure denotes the positions of attributes and those of values. We represent this structure as a set of labels, which are assigned to each string of the table, where a label is a member of the sequence of labels {att, val}. The att label stands for attributes and the val label stands for values. More formally, the table structure is defined as the function whose ar gument is a table defined above and has a value for the corresponding sequence ((si,li), (S2, £2)) •••; (sn, ln)) where k is the label that corresponds to the string Sj. We also assume that table structures can be categorized into the nine types illustrated in Fig. 5. We use types instead of table structures in the remainder of the chapter for the sake of simplicity. Figure 6 provides ex amples of tables using some of the types. When table (a) in Fig. 6 has type 1-h, att is assigned to the words "Name", "Tel." and "Recommendation", while val is assigned to "Lake Restaurant", "31-2456", etc. We can also say that type 2-v is plausible for table (b), type 4-h for table (c), type 4-v for table (d), type 3-v for table (e), and type 0 for table (f).d Table-structure Recognition Algorithm The algorithm chooses the most plausible sequence of types M. = (mi,..., mn) for the input sequence c Notice: for convenience, in this section the strings to be classified as attributes and values are also referred to by the words attributes and values, respectively. d Tables assigned type 0 are discarded and not processed further. Extracting Attributes and Their Values from Web Pages 185 Suffix h: the attribute words are arranged horizontally. Suffix v: the attribute words are arranged vertically. Fig. 5. Types of tables of tables T = (T 1 ; follows. , T n ) , according to the estimated probabilities as M = arg max TT PimATi) = arg max TTP(mi,rj) Then, we express the probability P(rrii, 7$) with t h e following parameter set 9, and denote the probability by Pe(mi,Ti). 8={P6(m\x,y)}U{Pe(s\l)} P e ' ( m , r ) = P8(m ) {(s i )? = 1 ) a; ) j/}) = P(x, y)Pe (m\x, y)P e ((si)r=i K » , v) * P(x,y)P0(m\x,y)Pe((Si)?=1\m) ^ P(x,y)Ps(m\x,y) J\ (s,l)em(T) Pe(s\l) In t h e last transformation, we made a n approximation such t h a t Pe((si)f=i\m) i s the product of P(s\l) for all t h e pairs (s, I) in m(T), where s is a string and I is a label. Intuitively, P(s\l) denotes t h e probability of s appearing in the position labeled with I, and P(m\x,y) denotes the probability t h a t tables which is the size of (x, y) contain the type m. T h e E M algorithm improves the value of Pg{T) by repeatedly adjusting the p a r a m e t e r set 6, according to the following formulae. M. Yoshida, K. Torisawa and J. Tsujii Recommendation Tel. Name special lunch Lake Restaurant 31-2456 chicken curry12-3456 Cafe Bonne fried rice Metro Restaurant 56-7890 (a) A table representing restaurants Name Gender Nationality (b)A BloodType A Hanako 22 Feb. Birthday Female 12-3456 Japanese Tel. table representing a person CD List ABC Music Shop Price Composer Title 2500 Sonata No. 1 Mozart 1800 Beethoven Symphony No. 9 1500 J.S.Bach Nocturne No. 1 (c)A table representing CDs John NAME Mary female SEX male 26 AGE 28 (d)A table representing persons rh is + . Name Hanako same as my real name. BloodType A Female You can see at a glance. Birthday Sex 29 Feb. very rare ... (e)A table representing persons with comments John Richard Tom Jude Mary Bill (f)A list of names Fig. 6. Sample Tables Pe{m\x,y) =-^- £ TeUy Pe>(m\T) llxyl wo = ^m-E E Pe{ i,k k(m(Ti))=(s,l) *'«: Here k(m(T)) means the fcth element of the ordered set m(T), and Sfc(m(Ti))=(s n m e a n s the summation over all possible values of m such t h a t the fcth element of m(T) is (s, I). Here, 6' is an old parameter set and Z$'(l) is the normalizing factor such t h a t ^ s Po'{s\l) = 1- Txy is a sequence Extracting Attributes and Their Values from Web Pages 187 of tables with the size of (x, y). 2.2. Table Integration Table Clustering Here we define the concept of unique attributes for clustering tables. These unique attributes are attributes peculiar to cer tain classes of objects (or tables describing certain classes of objects); for instance, the attribute Hobby is peculiar to tables describing humans and the attribute CPU is peculiar to catalogues of computers. We express the degree of peculiarity of a set of attributes A by the (peculiarity) score func tion uniq(A) defined below. T{A) is a set of tables in which one or more attributes in A appear. V{A) is the set of attributes appearing in T(A). Freq(b, A) is the frequency of attribute b in the tables in T(A). Figure 7 shows the UABC algorithm for table clustering. The UABC algorithm picks up an attribute having the highest (peculiarity) score and makes a set of all tables containing that attribute. The algorithm searches some (five at the maximum) additional attributes which can improve the score of this set ((1) in Fig. 7). After removing the set as a new class of tables ((2) in Fig. 7), the algorithm goes to the next iteration to find the next unique attribute. The definition of uniq(A) is as follows. Assume that freq(b, A) is the frequency of an attribute b in the tables in T(A). Then, uniq(A) can be defined as uniq(A) =def cooc(A) • excl(A), where cooc(A) =def excl(A) - „ , 1 1 ^ 2^ ^ '' beV(A) * V fre<l(P>A) /reg(b A) ' Intuitively, if uniq(A) is large, T(A) is likely to be a set of tables de scribing similar objects. The term cooc(A) expresses how consistently the attributes in V(A) co-occur with A. On the other hand, excl(A) represents the degree of exclusiveness of the attributes in V(A). It attributes in V(A) appear frequently in T(A) and rarely in Tau — T(A), then excl(A) has a large value. Table Merging (Attribute Clustering) Our attribute clustering algo rithm, called the UVBC algorithm, is based on the assumption that at tributes used in the same meaning come with values represented by com mon words. For example, both birthday and date-of-birth attributes have 188 M. Yoshida, K. Torisawa and J. Tsujii Fig. 7. The UABC algorithm years, months or days as their values, which results in these two attributes being followed by values represented by time-related words. The UVBC algorithm for attribute clustering tries to find unique values which are pe culiar to attributes in some meaning, and collect attributes having these values. Figure 8 shows the UVBC algorithm for attribute clustering. The algo rithm takes a value with the highest frequency ((l) in Fig. 8) and makes a set of all attributes having that value. After that, the algorithm also takes some additional values to grow the set of attributes ((2) in Fig. 8). These additional values are selected so as not to violate the following constraints. Neighboring Attribute Constraint: Attributes appearing in the same table cannot be included in the same set of attributes. This constraint is based on the assumption that attributes with the same meaning appear only once in the same table. Attribute-Position Constraint: Attributes in the same set must have similar positions to be represented in tables. Here, pos(a), or a position of an attribute a, is defined as the average of the position number where the attribute appears. The position number is defined for each (m, x, y). For example, if m is 1-h, where all attributes are in the first row, the position number for the attribute in the tth column is i. Where pos(a) < pos(a'), the algorithm niters out a' from the attribute cluster of a if the Extracting Attributes Fig. 8. and Their Values from Web Pages 189 The UVBC algorithm value of "t p °^>°\ exceeds a certain threshold (currently, two). Here, a is a heuristic parameter that weakens the impact of a when pos(a) has a small value; it is currently set to two. 3. List Analysis In this section we describe a method to analyze lists based on the extracted ontologies. As stated in the introduction, a Web page given as an input to our system is first decomposed into a sequence of blocks bounded by separators. The State Sequence Estimation Module (SSEM) determines a sequence of states for the block sequence, by using an ontology extracted from HTML tables. Before explaining the list analysis algorithm, we for mally define the terms used in the remainder of this chapter. After that, we describe our SSEM module, which estimates a sequence of states. 3.1. Term Definition In the following we give definitions of the terms used in the subsequent sections. • A page is a sequence of page fragments, each of which is either a block or a separator. • A block b is a sequence of words. 190 M. Yoshida, K. Torisawa and J. Tsujii Fig. 9. An example of HMMs for block sequences. • A separator is a sequence of separator elements, which are HTML tags or special characters. The special characters are those that tend to be used as block boundaries. They are defined a priori? • An ontology is a sequence ((Ai,Vi), (A2, V2), ■•-, (Am, Vm)), where Ai and Vi correspond to the ith attribute in the ontology and its value, respectively. Ai is a sequence of strings used to express the ith attribute, and Vi is that used to express its value. The function size(i), whose value is the number of tables from which Ai and Vi were collected, is defined for each i. • A role is a pair (l,i), where I £ {att,val} and i € {1, 2,..., m}. I, or a label, denotes whether a block represents an attribute or a value, and i, or an index, denotes the attribute's (or value's) number in the ontology. In addition, there are other roles denoted by (sentence, —) and (none, —).f • A state is defined for each block and has a role as its value. We denote the label of the state s by l(s) and the index by i(s). 3.2. State Sequence Estimation Module Given a sequence of blocks B = (b\,b2,---,bn) and a sequence of separators C = (ci,c 2 , ...,c„_i), g the State Sequence Estimation Module (SSEM) es timates the most probable sequence of states <S = (si,S2,...,s n ). Here, Sj is a state given to the block &,; and Cj is a separator between blocks bi and bi+ie Currently we have 23 special characters, including ":", " # " , " = " , etc. An explanation for sentence and none roles will be given in Sec. 3.2.1. s Separators before the first block and those after the last block are ignored. f Extracting Attributes and Their Values from Web Pages 191 The SSEM estimates the <S so that P{S\B) takes the highest value. In other words, <5 is estimated according to the following formula: S = arg maxP{S \B,C) = argmax = argmaxP(<S / ,S,C). A sequence of blocks and separators is modeled by HMMs. In the HMMs, each hidden state corresponds to a role, and each block is assumed to be emitted from a corresponding hidden state. Figure 9 shows an example. In this HMM, blocks NAME, John Smith, and so on are emitted from states represented by circles. Notice that separators like and are emitted from state transitions. It is natural to relate separators to state transitions because they are both located between blocks. An advantage of modeling block sequences by HMMs is that they can represent dependencies between neighboring blocks that are often closely related like SEX a,nd Male, as seen in the example pages in Fig. 1. Another important point is that HMMs are suitable for variable length sequences. In pages (1) and (2) in Fig. 1, for example, values of hobby attributes are presented over three lines (and over three blocks in a block sequence), while other values are presented in one line. HMMs can naturally treat such variable length fragments, while ad-hoc rules like "extract one line as an attribute and a following line as a value" cannot. In this model, P(S, B, C) is calculated as follows: n P(S,B,C) ^ P(s1)P(b1\s1)l[P(si\S^1)P(bi\s,)P(ci\si.l,sl) (1) i=2 where P(s) is the probability of s appearing as the first state, P(s'\s) is the probability of transitions between s and s', P(b\s) is the emission probabil ity of the block b from the state s, and P(c\s, s') is the emission probability of the separator c during the transition from s to s'. Based on this for mula, the most probable sequence S is calculated by the standard Viterbi algorithm, 9 The following problems are to be solved through this process. Filtering out useless parts of pages There are many blocks that do not include any attribute or value; filtering out such useless blocks is an important task because naive estimation of probabilities results in unnecessarily assigning some role to such blocks. To solve this problem, we define additional roles called a sentence role and a none role, which will be described in Sec. 3.2.1. 192 M. Yoshida, K. Torisawa and J. Tsujii Combining separated but semantically-continuing parts Sometimes one role crosses over two or more neighboring blocks be cause separators are sometimes used only for the purpose of adjusting the appearance of a page. For example, page (1) in Fig. 1 uses a space as a separator, which separates values such as John Smith into more than one block. This problem is handled by introducing the sameval transitions, as described in Sec. 3.2.2. In the following sections, we explain how to estimate values of P(s), P(b\s), P(s\s') and P(c\s, s'), all of which are needed to calculate the prob ability in Eq. 1. 3.2.1. Estimation of P(b\s) and P(s) Our system uses an ontology extracted from HTML tables to estimate P(b\s) and P(s). Each ontology output by the algorithm described in the previous section has a form defined in Sec. 3.1: a sequence of pairs of string sequences. We made a word sequence from each sequence of strings in an on tology by decomposing each string into words with a Japanese morpholog ical analyzer JUMAN. 10 We use a unigram model for each word sequence, and from each word sequence, the frequency of a word-role pair C(w, r) is enumerated as the number of times that w appears in r. This C(w, r) is the base for calculation of the probability P(b\s). As defined above, a block is a sequence of words, and we use unigram models for blocks. In these models, the probability P(b\s) is calculated as n [ = i P{wi\s) where b = {wi, ...,w\b\). Each P(w\s) is estimated as C(w,s) where W is a set of all the words appearing in the ontology. The main problem is that there are many pairs with zero frequencies, that is, the pairs (w, s) such that C(w, s) = 0. It is rare that, for some state s, all the words in a block have a non-zero value of C(w, s). For this reason, it will be problematic to use just this definition of P(w\s) because most probabilities become zero. Currently, we use the Good-Turing Estimation 11 to solve this problem. In the Good-Turing Estimation, the frequency t is adjusted as t* = (t + l)Nt+i/Nt, where Nt is the number of the types of pairs (w, s) that occurred just t times in the training data. In practice, this adjusted frequency is used only for the t such that t < k. (A; is a threshold and currently set to five.) Extracting Attributes and Their Values from Web Pages 193 Another important point is that our algorithm makes the following spe cial roles to filter out the blocks that do not represent any attribute or value. Sentence role If a string in an ontology contains at least one period, ques tion mark or exclamation mark, words in the string are classified into a sentence role. A block with the words likely to appear in sentences, such as conjunctions or auxiliary verbs, tend to be classified into this role. It contributes to the filtering out of sentences, which are not targets of our system. None role Every block that does not correspond to any role in a given ontology should be classified into a none role. Training samples for this role are gathered from outside the ontologies. We gathered a set of pages in addition to the source pages for extracting ontologies, then all these pages were decomposed into a set of words. A pair (w, none) was counted for every word w in this set. In addition, the algorithm uses the following rare role to avoid the effect of roles with low frequency. Rare role The algorithm selects "top N attributes" for each ontology and classifies other attributes and their values into a rare role.hl All roles classified as a rare role are treated as the same role. The top n attributes are selected according to their frequencies (i.e., size(i)). For example, if N = 10, 20 roles (10 attributes and their values) are selected. The value of probability P(s) is determined based on relative frequencies of s in ontologies assuming that frequently appearing roles also appear frequently in the head of a page.J A/ v / .. P(s) = r(s), , where / N r{s) = size(i(s)) ~LLZJ Here, Z is the summation of the values of size(.) over all roles. We call other (top) roles non-rare roles. Note that it does not include the sentence role described above. 'The value of the size(.) for a rare role is the summation of the values of size(.) over all roles gathered as the rare role. J Frequencies of sentence and none roles are heuristically set to the same value as a rare role. 194 M. Yoshida, K. Torisawa and J. Tsujii NAME NAME John I Smith vahNAME 1 AGE 1 23 HOBBY i | AGE | val:AGE HOBBY | Computer 1 Games 1 vahHOBBY | Fig. 10. An example for boundaries of roles. Some different blocks like John and Smith are with the same role. 3.2.2. Estimation of P{s'\s) and P(c\s,s') Next, we explain how to estimate values of P(s'\s) and P(c\s, s'). Values of these probabilities are estimated via the EM algorithm because they cannot be estimated from ontologies. To avoid a data sparseness problem, each state s is classified into an abstract state defined as l(s), a label of a state s. Transition and separator emission probabilities must have the same value for state pairs (s, s') that have the same value of (Z(s), l{s')). The transition probability P ( S J | S J _ I ) is transformed in the following way to make use of abstraction of states. P ( « i | « i - l ) = P(f(Si)|Si_i) • P(Si\l(Si), Si_i) wP(J(sO|J(*-i))--P(*|J(*0,ai-i) Here, P(Z(SJ)|Z(SJ_I)) is the transition probability between abstract states. P(si\l(si),Si-i) is the probability of Si being selected assuming an abstract state l(si). We use the following heuristics to calculate this prob ability. • An attribute must be followed by its value. • A value is likely to be followed by the same value. The former heuristic is expressed by the following constraint. Constraint-1 If s, ^ Sj_i and Z(SJ_I) = att, Si must be (val,i(si-i)). The latter heuristic reflects a boundaries of roles problem. Although a block is a unit of roles in a Web page, it is not ensured that one role corre sponds to just one block. Rather, two or more neighboring blocks often play the same role (See Fig. 10). The algorithm needs boundaries of roles besides those of page blocks to recognize attributes and their values accurately. For this reason, the transition between the same states is distinguished as a sameval transition. Based on these heuristics, P(sj|Z(si),Si_i) is estimated as shown in Fig. 11. Notice that the probability of the sameval transition occurrence is Extracting Attributes Fig. 11. and Their Values from Web Pages 195 Estimation of P(sj|i(s,), Si_i). denoted by P(sameval). Here, reZ(s) is the relative frequency of s among all states having the same abstract role, while rel'(s, s') is the relative fre quency of s among all states other than s'. Note that rel(s) is 1 if s is none or string. Next, we explain how to calculate P(c\s, s'). In calculation of P(c\s, s'), each transition from s to s' is classified as the transition from l(s) to l(s') except that a transition from the state s to the same state s is distinguished as the sameval transition. A transition function trans is a function from R x R to T, where R is a set of roles and T is a set of transitions. This function is defined in the following way. trans(s,s') = ( sameval \(l(s),l(s')) if s = s' iis^s' According to the definition, a separator is a sequence of separator ele ments. P(c\trans(s,s')) is calculated as P(c\trans(s,s')) « JJP(e|£rans(s,s')) e£c M. Yoshida, K. Torisawa and J. Tsujii 196 Fig. 12. Re-estimation formulae by applying an independence assumption among separator elements. Here, o is a parameter t h a t represents separator-element emission probability. Values of P(e\trans(s, s')), P(l(s)\l(s')) and P(sameval) are estimated via t h e E M algorithm. Because these parameters affect the presentation of pages, their values greatly depend on layout usage of each page, and vary from page to page. Therefore, training d a t a for estimating these parameters should be t h e target page only, ignoring all other pages. Starting from some initial parameters, the E M algorithm estimates the parameters t h a t are the most suitable (i.e., give t h e largest likelihood value) for a given page. At the j ' s iteration, new parameters are calculated by applying the formulae shown in Fig. 12. Initial values for each parameter are set so t h a t they are the same for all cases, i.e., values of P ( e j r ) for some r are t h e same for all e, and values of P(x\y) for some y are t h e same for all x. In addition, P(sameval) is set to 0.5. 4. Experiments To evaluate our system, we gathered three sets of Web pages: 50 about-me pages, 50 P C specification pages and 50 company profile pages. k k All these pages were written in Japanese. Extracting Attributes and Their Values from Web Pages 197 A set of ontologies were extracted using the method described in section 2, from 35109 tables collected from the WWW. Among them, we picked up one corresponding ontology for each page set (for example, an ontology describing human entities for an about-me page set) by hand.1 We set the number of non-rare roles for each ontology to 20. The target pages for evaluation did not include any <table> tag, and the pages containing lists in other classes"1 were also excluded." All the blocks in those pages having non-rare roles were extracted and labeled with the correct roles by hand. Performance was evaluated by com paring the roles that were output by our algorithm with the ones labeled by hand, using three measures widely used in Information Retrieval and Nat ural Language Processing communities: recall, precision and F-measure. Recall is given by Nc/Nh and precision is given by Nc/Nm, where Nh is the number of roles given by hand, Nm is the number of roles output by the algorithm, and Nc is the number of times that roles output by hand and those output by the algorithm were the same. The F-measure, which is a combination of recall and precision, is calculated as a harmonic mean of them: F = (2 • recall ■ precision)/(recall + precision). As a baseline, we also evaluated the accuracy when a Naive Bayes Classifier was used for each block, where the state for each block b = (w\,..., w^) was estimated as s = argmaxs P(s) Yifii P(wi]s).° The result for the Naive Bayes Classifier can be seen as the result when dependencies between states were not used in the HMM. Figure 13 shows the overall results.13 In general, the performance was improved by the use of HMMs in comparison with the Naive Bayes Classi fier. This results shows that HMMs are useful for modeling Web pages that contain attribute-value expressions. Figure 14 shows changes of F-measure through the parameter reestimation. The performance was improved by re-estimation for all classes. We think the reason of this improvement was that blocks that cannot be As stated in Sec. 2, ontologies are provided with their unique attributes, which are fre quently used to represent the objects. An ontology for a human, for example, is provided with the attribute hobby. We can select ontologies according to those unique attributes. m F o r example, the pages containing a list for a company profile followed by a list for a president's profile. n Such pages should be analyzed with two or more ontologies because each list in the page needs a corresponding ontology. Currently our system can use only one ontology for each Web page. °P(s) is approximated by r(s) and P(uii\s) is the same one used in HMMs. p We set the number of re-estimation in the EM-algorithm to three. 198 M. Yoshida, K. Torisawa and J. Tsujii P C specification pages Rec. Prec. Method 0.60 HMM (Iter=3) 0.65 0.58 HMM (lter=0) 0.61 0.52 0.48 Naive Bayes C o m p a n y profile pages Rec. Prec. Method HMM (Iter=3) 0.54 0.55 HMM (lter=0) 0.50 0.51 0.40 Naive Bayes 0.56 A b o u t - m e pages Rec. Prec. Method 0.69 HMM (Iter=3) 0.59 0.61 HMM (lter=0) 0.57 0.50 Naive Bayes 0.70 Fig. 13. F-measure 0.62 0.59 0.50 F-measure 0.55 0.50 0.47 F-measure 0.64 0.59 0.58 Results in recall and precision; Rec. means recall and Prec. means precision. Fig. 14. Values of F-measure at each iteration. recognized only by ontologies were correctly recognized with the help of layout information such as HTML tag usage. The performance, however, reached its maximum in early stages of re-estimation and remained un changed or decreased after that. This result suggests that our HMMs still have some probabilities unsuitable for Web pages. We might need to revise, for example, the estimation of P ( S J | Z ( S J ) , S J _ I ) . 5. Conclusion and Future Work In this chapter we proposed a method to extract attributes and their values by consulting ontologies extracted from HTML tables. We applied HMMs Extracting Attributes and Their Values from Web Pages 199 to the problem and proposed a technique to estimate parameters for the HMMs. In the experiments, the result with HMMs outperformed the one with a Naive Bayes Classifier, while the performance improved further by parameter re-estimation via the EM algorithm. This result indicates t h a t HMMs are suitable for modeling Web pages. Because the task proposed in this chapter is still a simple one, it can be extended in several directions. Especially, we plan to extend our algorithm to employ some topic detection techniques to deal with Web pages relating to many types of topics by using various kinds of ontologies. Acknowledgments We are grateful to Yuka Tateisi, Jin-Dong Kim, Takashi Ninomiya, Yoshimasa Tsuruoka, Yusuke Miyao, Jun'ichi Kazama, Naoki Yoshinaga and an anonymous reviewer for their helpful comments on earlier versions of this paper. References 1. M. Yoshida, K. Torisawa and J. Tsujii, "Extracting Ontologies from World Wide Web via HTML tables", Proc. PACLING2001, Kitakyushu, Fukuoka, Japan, September 11-14, 2001, pp. 332-341. 2. N. Kushmerick, D.S. Weld and R. Doorenbos, "Wrapper Induction for Infor mation Extraction", Proc. IJCAI-97, Nagoya, Aichi, Japan, August 23-29, 1997 pp. 729-735. 3. C. Hsu and M. Dung, "Generating finite-state transducers for semistructured data extraction from the Web", Information Systems, 23(8), Elsevier Science, 1998, pp. 521-538. 4. I. Muslea, S. Minton and C. Knoblock, "A hierarchical approach to Wrapper induction", Proc. the third International Conference Autonomous Agents, Seattle, Washington, USA, May 1-5, 1999, pp. 190-197. 5. W. W. Cohen, M. Hurst and L. S. Jensen, "A Flexible Learning System for Wrapping Tables and Lists in HTML Documents", Proc. WWW2002, Honolulu, Hawaii, USA, May 7-11, 2002, pp. 232-241. 6. K. Seymore, A. McCallum and R. Rosenfeld, "Learning Hidden Markov Model Structure for Information Extraction", Proc. AAAI-99 Workshop on Machine Learning for Information Extraction, Orlando, Florida, USA, July 18-19, 1999, pp. 37-42. 7. A.P. Dempster and N.M. Laird and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of Royal Statistical Society: Series B, 39, Blackwell Publishing, 1977, pp. 1-38. 8. D. Freitag, "Information Extraction from HTML: Application of a General Machine Learning Approach", Proc. AAAI-98, Madison, Wisconsin, USA, July 26-30, 1998, pp. 517-523. 200 M. Yoshida, K. Torisawa and J. Tsujii 9. G. D. Forney, The Viterbi algorithm. Proc. IEEE, 6 1 , 1973, pp. 268-278. 10. S. Kurohashi, T. Nakamura, Y. Matsumoto and M. Nagao, "Improve ments of Japanese Morphological Analyzer JUMAN", Proc. the Interna tional Workshop on Shamble Natural Language Resources, Ikoma, Nara, Japan, August 10-11, 1994, pp. 22-28. 11. I. J. Good, "The population frequencies of species and the estimation of population parameters", Biometrika, 40, Oxford University Press, 1953, pp. 237-264. Part IV. Web Image Analysis and Retrieval This page is intentionally left blank CHAPTER 11 A FUZZY APPROACH TO TEXT SEGMENTATION IN W E B IMAGES BASED ON H U M A N C O L O U R PERCEPTION A. Antonacopoulos and D. Karatzas PRImA (Pattern Recognition and Image Analysis) Group, Department of Computer Science, University of Liverpool, Liverpool, L69 7ZF, United Kingdom E-mail:{aa, karatzasj@csc.liv.ac.uk URL: http://www.csc.liv.ac.uk/~prima This chapter describes a new approach for the segmentation of text in images on Web pages. In the same spirit as the authors' previous work on this subject, this approach attempts to model the ability of humans to differentiate between colours. In this case, pixels of similar colour are first grouped using a colour distance defined in a perceptually uniform colour space (as opposed to the commonly used RGB). The resulting colour connected components are then grouped to form larger (character-like) regions with the aid of a propinquity measure, which is the output of a fuzzy inference system. This measure expresses the likelihood for merging two components based on two features. The first feature is the colour distance between the components, in the Lab colour space. The second feature expresses the topological relationship of two components. The results of the method indicate a better performance than previous methods devised by the authors and possibly better (a direct comparison is not really possible due to the differences in application domain characteristics between this and previous methods) performance to other existing methods. 1. Introduction In a typical Web document, there are significant discrepancies between the text appearing in the view of the document (what actually appears in the browser window) and the text contained in the code of the document (the file containing markup language tags, program instructions and various types of text). A major (and very frequently occurring) discrepancy is that some of the visible text in the view of the document is actually embedded in images. In such cases, there is no direct correspondence between the code (an instruction to display a given image) and the text contained in that image. The human reader, of course, can read all 203 204 A. Antonacopoulos andD, Karatzas the text on the screen (document view), whether this text exists in the code or not. From this point on, visible text that is contained in the code will be referred to as encoded text, while text that is embedded in images will be referred to as image text. The difference between encoded text and image text can be seen by contrasting the example of Fig. 1, where both encoded and image text is shown, and that of Fig. 2, where the image text is missing (note that there is also no alternative text for the images—see below). The above discrepancy between the code and the view representations of a Web document is potentially very significant. The origins of the problem are twofold. First, Web document designers create image text as a way of overcoming the limitations of the markup language used in the code. Second, due to limitations of current technology, image text is not accessible to any automated process or analysis performed on the document. Both of these interrelated issues are examined next in more detail. Image text is created for two main reasons. The first is one of necessity as the markup language (HTML in this case) cannot adequately display textual entities such as mathematical equations, text in diagrams and charts etc. The second and main reason is that document creators wish to add impact to certain textual entities such as titles, headings, buttons etc. The effects applied to the text and its background (e.g., complex colour combinations, unusual fonts, image as background etc.) are such that cannot be expressed in the markup language. Not having all the visible text in the code of the document means that a proportion of the text seen by the human reader (image text) is not available for any automated analysis. Such analysis includes essential processes, fundamental to the modus operandi of the Web, such as automated indexing by search engines. Currently, as search engine technology does not allow for text extraction and recognition in images (see the Search Engine Watch website1 for a list of indexing and ranking criteria for different search engines), the image text is ignored. Moreover, the problem of indexing is compounded by the fact that it is precisely the semantically important text (titles, headings etc.) that is most often required to make a visual impact and, therefore, represented as image text. The lack of a uniform representation of the text impacts negatively on several other possibilities for exploiting the Web. If all the visible text was available as encoded text, it would be possible to perform accurate voice browsing,2 for instance. One could listen to the Web document read to them instead of having to look at a monitor. Such a possibility will enable browsing in the car or via the telephone and also will benefit visually impaired people. A Fuzzy Approach to Text Segmentation in Web Images Fig. 1. A Web Page with images shown. Fig. 2. The same Web Page without the images. 205 206 A. Antonacopoulos and D. Karatzas Another major application area is the analysis of the content of a Web document for filtering, summarisation and display (also referred to as adaptive content delivery) on small form-factor devices such as PDAs and mobile phones.3 It should be noted that there has been a provision for specifying an encoded version of the text embedded in images, in the form of ALT tags in HTML. However, a study conducted by the authors,4 assessing the impact and consequences of text contained in images indicates that the ALT tag strategy is not effective. It was found that the textual description (ALT tags) of 56%o of images on Web pages was incomplete, wrong or did not exist at all. This can be a serious matter since, of the total number of words visible on a Web page, 17% are in image form (most often semantically important text). Worse still, 76% of these words in image form do not appear elsewhere in the encoded text. These results agree with earlier findings5 and clearly indicate an alarming trend. It can be seen from the above that there is a significant need for methods to extract and recognise the text in images on Web pages. However, this is a unique and challenging problem. One would immediately attempt to draw on the similarities between the segmentation/recognition of characters in web images and established (albeit far from a solved problem) technologies such as those employed by traditional OCR. It could even be argued that as image text is created and displayed electronically (no digitisation artefacts) traditional OCR would have one less obstacle. Nevertheless, there are significant issues that make the analysis of image text a significantly more difficult problem than that faced by traditional OCR. The most important of these issues are examined next. Perhaps the single most important realisation is that image text is created with the goal of minimizing its transmission/rendering time while being good enough to view on a monitor screen. The implications are quite significant in terms of quality. First, the (sometimes complex) colour images tend to be of low resolution (usually just 72 dpi) and the font-size used for text is very small (about 5pt-7pt). Such conditions clearly pose a challenge to traditional OCR, which works with 300dpi images (mostly bilevel) and character sizes of usually lOpt or larger. Moreover, image text tends to have various artefacts that are not encountered in the analysis of traditional document images. One of the problems affecting most the process of differentiating the foreground from the background is the presence of anti-aliasing. The resulting smooth transition from background to foreground colours produces characters with poorly defined edges, in contrast to the characters in typical document images. Another significant problem is the loss of colour information and 'blocky' appearance of regions resulting from A Fuzzy Approach to Text Segmentation in Web Images 207 compressing the images with a lossy scheme (e.g. JPEG). Furthermore the colour of individual characters or areas of the background is rarely uniform due to either intentional effects (e.g. gradient colour) or colour quantization artefacts (originating from the software producing the image text).6 Finally, there is a pronounced difference in creativity expressed through image text (rather than in traditional documents) in the form of a very wide variety of fonts, colour combinations, complex backgrounds, 3D effects on characters and so on. It should be mentioned that text in Web images is of different nature than text in video, for instance. In principle, although methods attempting to extract text from video (e.g., Li et al.1) could be applied to a subset of Web images, they make restricting assumptions about the nature of embedded text (e.g., colour uniformity). As such assumptions are, more often than not, invalid for text in Web images, such methods are not directly discussed here. The significant variety (in every aspect) of Web images containing text has led most researchers to limit their methods to subsets of Web images that conform to certain assumptions. For instance, Zhou and Lopresti8 only deal with 8-bit palletised images, where text appears in uniform colour. Antonacopoulos and Delporte9 deal with 24-bit images as well, but also require text to be of uniform colour. Jain and Yu10 require that the background is of uniform colour, in addition to being the largest component of the image. Moreover, previous attempts to extract text from Web images work with a relatively small number of colours and restrict all their operations in the RGB colour space. A novel method that is based on information on the way humans perceive colour differences has recently been proposed by the authors.11 That method works on full colour images and uses different colour spaces in order to approximate the way humans perceive colour. It comprises the splitting of the image into layers of similar colour by means of histogram analysis and the merging of the resulting components using criteria drawn from human colour discrimination observations. This paper describes a new method for segmenting character regions in Web images. In contrast to the authors' previous method,11 it is a bottom-up approach. This is an alternative method devised in an attempt to emulate even closer the way humans differentiate between text and background regions. Information on the ability of humans to discriminate between colours is used throughout the process. Pixels of similar colour (as humans see it) are merged into components and a fuzzy inference mechanism that uses a 'propinquity' measure is devised to group components into larger character-like regions. 208 A. Antonacopoulos and D. Karatzas The colour segmentation method and each of its constituent operations are examined in the next section and its subsections. Experimental results are presented and discussed in the subsequent section, concluding the paper. 2. Colour Segmentation Method The basic assumption of this paper is that, in contrast to other objects in general scenes, text in image form can always be easily separated (visually, by humans) from the background. It can be argued that this assumption holds true for all text, even more so for text intended to make an impact on the reader. The colour of the text in Web images and its visual separation from the background are chosen by the designer (consciously or subconsciously) according to how humans perceive it to 'stand out'. To emulate human colour differentiation, a colour distance measure is defined in an alternative colour space, rather than in the 'standard' RGB space (a key argument in this and previous work of the authors is that the perceptual nonuniformity of the RGB colour space makes it less attractive to work in - see below). The distance measure is used first to identify colour connected components and then, combined with a new topological feature (using a fuzzy inference system), it is used to aggregate components into larger entities (characters). Each of the processes of the system is described in a separate subsection below. First, the colour measure is described in the context of colour spaces and human colour perception. The connected components labelling process using this colour distance is described next. The two features (colour distance and a measure of spatial proximity) from which the new 'propinquity' measure is derived are presented in Section 2.3. Finally, the fuzzy inference system that computes the propinquity measure is the subject of Section 2.4 before the description of the last stage of colour connected component aggregation (Section 2.5). 2.1. Colour Distance To model human colour perception in the form of a colour distance measure, requires an examination of the different colour spaces in terms of their perceptual uniformity. The RGB colour system, which is by far the most frequently used system in image analysis applications, lacks a straightforward measurement method for perceived colour difference. This is due to the fact that colours having equal distances in the RGB colour space may not necessarily be perceived A Fuzzy Approach to Text Segmentation in Web Images 209 by humans as having equal distances.3 A more suitable colour system would be one that exhibits perceptual uniformity. The CIE (Commission Internationale de l'Eclairage) has standardised two colour systems {Lab and L u v ) based upon the CIE XYZ colour system.12'13 These colour systems offer a significant improvement over the perceptual non-uniformity of XYZU and are a more appropriate choice to use in that aspect than RGB (which is also perceptually non-uniform, as mentioned before). The measure used to express the perceived colour distance in the current implementation of this method is the Euclidean distance in the L ab colour space (Luv has also been tried, and gives similar results). In order to convert from the RGB to the L a b colour space, an intermediate conversion to XYZ is necessary. This does not, at first, appear to be a straightforward task, since the RGB colour system is by definition hardware-dependent, resulting in the same RGB-coded colour being reproduced on each system slightly differently (based on the specific hardware parameters). On the other hand, the XYZ colour system is based directly on characteristics of human vision (the spectral composition of the XYZ components corresponds to the colour matching characteristics of human vision) and therefore designed to be totally hardware-independent. In reality, the vast majority of monitors conform to certain specifications, set out by the standard ITU-R recommendation BT.709,15 so the conversion suggested by Rec. 709 can be safely used and is the one used for this method. The conversion from XYZ to I a b is straightforward and well documented. 2.2. Colour Connected-Component Identification Colour connected-component labelling is performed in order to identify components of similar colour. These components will form the basis for the subsequent aggregation process (see Section 2.5). It should be noted that although the aggregation process that follows would still work with pixels rather than connected components as input, using connected components significantly reduces the number of mergers and subsequently the computational load of the whole process. The idea behind this pre-processing step is to group pixels into components, if and only if a human being cannot discriminate between their colours. The rationale at this stage is to avoid wrong groupings of pixels as— this is true for a For example, assume that two colours have RGB (Euclidean) distance 5. Humans find it more difficult to differentiate between the two colours if they both lie in the green band than if the two colours lie in the red-orange band (with the distance remaining 8 in both cases). This is because humans are more sensitive to the red-orange wavelengths than they are to the green ones. 210 A. Antonacopoulos andD. Karatzas all bottom-up techniques— early errors have potentially significant impact on the final results. The identification of colour connected-components is performed using a onepass segmentation algorithm adapted from a previously proposed algorithm used for binary images.16 For each pixel, the colour distance to its adjoining (if any) connected components is computed and the pixel is assigned to the component with which the colour distance has the smallest value. If the pixel in question has a distance greater than a threshold to all its neighbouring connected components, a new component is created from that pixel. The threshold below which two colours are considered similar was experimentally determined and set to 20 in the current implementation. In fact, it was determined as the maximum threshold for which no character was merged with any part of the background. It should be noted, since the images in the training data set include cases containing text very similar to the surrounding background in terms of hue, lightness or saturation, this threshold is believed to be appropriate for the vast majority of text in Web images. Finally, the chosen threshold is small enough to conform to the opening statement that only colours that cannot be differentiated by humans should be grouped together. 2.3. Propinquity Features The subsequent aggregation of the connected components produced by the initial labelling process into larger components is based on a fuzzy inference system (see next section) that outputs a propinquity measure. This measure expresses how close two components are in terms of colour and topology. The propinquity measure defined here is based on two features: a colour similarity measure and a measure expressing the degree of 'connectivity' between two components. The colour distance measure described above (Section 2.1) is used to assess whether two components have perceptually different colours or not. The degree of connectivity between two components is expressed by the connections ratio feature. A connection is defined here as a link between a pixel and any one of its 8-neighbours, each pixel thus having 8 connections. A connection can be either internal (i.e., both the pixel in question and the neighbour belong to the same component) or external (i.e. the neighbour is a pixel of another component). Figure 3 illustrates the external and internal connections of a given component to its neighbouring components. Given any two components a and b, the connections ratio, denoted as CRab, is defined as A Fuzzy Approach to Text Segmentation in Web Images CK, = . ^"-br . min(Ceo, Ceb) 211 Eq.l where Ceaj, is the number of external connections of component a to pixels of component b, and Cea and Ceb refer to the total number of external connections (to all neighbouring components) of components a and b, respectively. The connections ratio is therefore the number of connections between the two components, divided by the total number of external connections of the component with the smaller boundary (it follows that Cea,b = Ceba). The connections ratio ranges from 0-1. In terms of practical significance, the connections ratio is far more descriptive of the topological relationship between two components than other spatial distance measures (e.g., the Euclidean distance between their centroids). A low connections ratio indicates loosely linked components, a medium value indicates components connected only at one side, and a high connections ratio indicates that one component is almost included in the other. Moreover, the connections ratio provides a direct indication of whether two components are neighbouring or not in the first place, since it will equal zero if the components are disjoint. Fig. 3. Illustration of connected-components and their connections. Internal connections (between pixels of the same component) are shown in black colour, while external connections (to pixels of other components) are shown in grey. 212 A. Antonacopoulos and D. Karatzas Fig. 4. Membership functions for the connections ratio input. 2.4. Fuzzy Inference A fuzzy inference system has been designed to combine the two features described above into a single value indicating the degree to which two components can be merged to form a larger one. The Lab colour distance and the connections ratio described in the previous sections form the input to the fuzzy inference system. The output, called the propinquity between the two participating components, is a value ranging between zero and one, representing how close the two components are in terms of their colour and topology in the image. Each of the inputs and the output are defined using a number of fuzzy sets and corresponding membership functions, described below. In the fuzzy inference system, the relationship between the two inputs and the output is defined with a set of rules, also explained below. The rationale in defining the fuzzy sets and function for the connections ratio input to the fuzzy inference system is the following. The components that should be combined are those that correspond to parts of characters. Due to the fact that characters consist of continuous strokes, the components in question should only partially touch each other (i.e. one should not be contained in the other nor they should be disjoint). For this reason, a membership function is defined on a medium fuzzy set ranging between 0.15 and 0.65. It is considered advantageous for two components to have a connections ratio that falls in that range in order to A Fuzzy Approach to Text Segmentation in Web Images 213 Fig. 5. Membership functions for the colour distance input. combine them. This fact is enforced by the rules comprising the fuzzy inference system, which favour a connections ratio in the medium region, rather than one in the small or large regions (ranging between 0 and 0.15, and 0.65 and 7, respectively). Furthermore, a membership function called zero is defined, in order to facilitate the different handling of components that do not touch at all, and should not be considered for merging. The fuzzy sets and membership functions defined for the connections ratio input can be seen in Fig. 4. There are three fuzzy sets and corresponding membership functions defined for the Lab colour distance input, namely small, medium and large (see Fig. 5). The small membership function is defined on a fuzzy set ranging between 0 and 15. Colours having an L a b distance less than 75 cannot be discriminated by humans, therefore a colour distance falling in the small range is being favoured by the rules of the fuzzy inference system. In contrast, a membership function has been defined on a large fuzzy set for colour distances above 43. Components having a colour distance in that range are considered as the most inappropriate candidates to be merged. The middle range, described by the medium fuzzy set and membership function, is where there is no high confidence about whether two components should be merged or not. In that case, the rules of the system give more credence to the connections ratio input. The thresholds of 75 and 43 were experimentally determined, as the ones that minimise the number of wrong mergers. 214 A. Antonacopoulos andD. Karatzas Fig. 6. Membership functions for the propinquity output. The single output of the fuzzy inference system, the propinquity, is defined using five fuzzy sets and corresponding membership functions (see Fig. 6). There are two membership functions defined on fuzzy sets at the edges of the possible output values range, namely zero and definite, and three membership functions defined on fuzzy sets in middle range: small, medium and large. This group of membership functions allows for a high degree of flexibility in defining the rules of the system, while it captures all the possible output cases. The rules mapping the two inputs (connections ratio and colour distance) to the output (propinquity) of the fuzzy inference system are shown in Fig. 7. The rules embody the observations on the connections ratio and colour distance between components belonging to characters, as explained above. The first rule handles the case when two components are not neighbouring (connections ratio is zero) and, therefore, should not be considered for a merger. In all other cases, if the colour distance is small, the propinquity is always set to be above medium (large or definite, depending on the value of the connections ratio input). For medium connections ratio values, the propinquity is set higher than for small or large connections ratio values. This directly relates to the observations mentioned before: components that correspond to parts of character strokes (determined by the fact that they are partially neighbouring) should be favoured during the component aggregation process. A Fuzzy Approach to Text Segmentation in Web Images 215 Fig. 7. The rules of the fuzzy inference system. Fig. 8. The surface depicting the relationship between the two inputs and the propinquity output of the fuzzy inference system. 216 A. Antonacopoulos and D. Karatzas In a similar manner, if the colour distance is medium or large, the propinquity is set to medium or below (i.e., medium, small or zero, depending again on the value of the connections ratio input). The fuzzy inference surface, illustrating the relationship defined by the rules of the system between the two inputs and the propinquity output can be seen in Fig. 8. The fuzzy inference system is designed in such a way, that a propinquity value of 0.5 can be used as the threshold in deciding whether two components should be considered for merging or not. Eventually, all pairs of components having a propinquity value above 0.5 (at the time of consideration) will be merged. Nevertheless, the exact value of propinquity plays an important role during the component aggregation phase, since it dictates the order in which mergers should take place, as will be seen next. 2.5. Colour Component Aggregation The merging algorithm considers pairs of connected components, and based on the propinquity output of the fuzzy inference system, combines them or not. All components produced by the initial colour connected components identification process are considered. For each connected component, the propinquity to each of the neighbouring components is computed, and if it is greater than a set threshold, a possible merger is identified. A sorted list of all possible mergers is maintained, based on the computed propinquity value. The algorithm proceeds to merge the components with the largest propinquity value, and updates the list after each merger, including possible mergers between the newly created component and its neighbours. Only the necessary propinquity values are recalculated after each merger, keeping the number of computations to a minimum. The process continues in an iterative manner, as long as there are merger candidates in the sorted list having propinquity greater that the threshold. The threshold for propinquity is set (as a direct result of the design of the membership functions) to be 0.5. 3. Results and Discussion The colour segmentation method was evaluated using a variety of images collected from different Websites. The test set comprises 124 images, which are divided into four categories: (a) Multicoloured text over multicoloured background (24 images), (b) Multicoloured text over single-coloured background (15 images), (c) Single-coloured text over multicoloured background (30 images) and (d) Single-coloured text over single-coloured background (55 images). This A Fuzzy Approach to Text Segmentation in Web Images 217 distribution reflects the occurrence of images on Web documents. The number of colours in the images ranges from two to several thousand and the bits per pixel are in the range from 8 to 24. A width of four pixels was defined as the minimum for any character to be considered readable. The evaluation of the segmentation method was performed by visual inspection. This assessment can be subjective for the following reasons. First, the borders of the characters are not precisely defined in most of the cases (due to anti-aliasing or other artefacts e.g. artefacts caused by compression). Second, no other information is available about which pixel belongs to a character and which to the background (no ground truth information is available for Web images). For this reason, in cases where it is not clear whether a character-like component contains any pixel of the background or not, the evaluator decides on the outcome based on whether by seeing the component on its own he/she can understand the character or not. The foundation for this is that even if a few pixels have been misclassified, as long as the overall shape can still be recognised, the character would be identifiable by OCR software. The following rules apply regarding the characterisation of the results. Each character contained in the image is characterised as identified, partially identified or missed. Identified characters are those that are described by a single component. Partially identified ones are the characters described by more than one component, as long as each of those components contain only pixels of the character in question (not any background pixels). If two or more characters are described by only one component (thus merged together), yet no part of the background is merged in the same component, then they are also characterised as partially identified. Finally, missed are the characters for which no component or combination of components exists that describes them completely without containing pixels of the background as well. The algorithm was tested with images of each of the four categories. In category (a) 223 out of 420 readable characters (53.10%) were correctly identified, 79 characters (18.57%) were partially identified and 119 characters (28.33%) were missed. In addition, out of the 487 non-readable characters of this category, the method was able to identify 245 and partially identify 129. In category (b) the method correctly identified 284 out of 419 characters (67.78%) while 88 (21.00%) were partially identified and 47 (11.22%) missed. There were no non-readable characters in this category. In category (c) 443 (72.74%) out of 609 readable characters were identified, 115 (18.88%) partially identified and 51 (8.37%) missed. In this category, the method was also able to identify 130 and partially identify 186 out of 388 non-readable characters. Finally, in category (d) 572 (73.71%) out of 776 readable characters were identified, 197 (25.39%) 218 A. Antonacoponlos and D. Karatzas partially identified and 7 (0.9%) missed. In addition, 127 out of 227 nonreadable characters were identified and 53 partially identified. The method presented here, compares favourably to the previous (Split-andMerge) method of the authors." Two key differences can be identified. First, the present method performs considerably better on images in categories (a) and (b) (containing multicoloured characters) than the Split-and-Merge approach. The second difference between the two methods is the processing time, with the current method running in a fraction of the time required by the Split and Merge one. The results mentioned above reflect the increasing difficulty in categories where the text and/or the background are multi-coloured. Some of the difficulties encountered in special cases can be seen in figures 9 to 13. For each image, the original is shown along with an image of the final segmentation and an image of the segmented characters. Characters shown in black colour denote correctly identified ones, whereas the ones in red are partially identified characters. A typical problem encountered during the segmentation of Web images containing text, is the existence of small characters. The smaller a character is, the more its appearance is influenced by anti-aliasing or compression artefacts. An example of an image containing small characters is shown in Fig. 9. Although most of the characters in this example are correctly segmented, a number of them are broken into more than one connected component, mainly due to the presence of anti-aliasing artefacts in the original image. Characters in gradient colour can also be difficult to segment, especially when the gradient effect is extensive and tends towards the colour of the background. In such cases, the border of the characters is indistinct, and the segmentation method may produce false results. An example of such a case is shown in Fig. 10. In Fig. 12, a further example of characters in gradient colour is given. Although in this case the gradient effect of the characters is extensive (over a large range of colours), the method performs considerably better, since the characters do not blend with the background. Due to the significant variety of image text, a number of other problems can be identified. Such cases can be, for instance, the existence of semi-transparent characters (e.g., see Fig. 13), or characters that are originally —in the original image— merged, split, or overlapping (e.g., see Fig. 11). However, in most of these situations the segmentation method is able to perform quite well. In conclusion, a new approach for the segmentation of characters in images on Web pages is described. The method is an attempt to emulate the ability of humans to differentiate between colours. A propinquity measure produced by a fuzzy inference system is used to express the likelihood for merging two A Fuzzy Approach to Text Segmentation in Web Images 219 components, based on topological and colour similarity features. The results of the method indicate a better performance than the previous method devised by the authors and comparable performance to other existing methods. Further work is concentrating on the possibilities to enhance the propinquity measure by employing more features and in the further refinement of the fuzzy inference system. Acknowledgement The authors would like to express their gratitude to Hewlett-Packard for their substantial equipment donation in support of this project. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Search Engine Watch, http://www.searchenginewatch.com M.K. Brown, S.C. Glinski and B.C. Schmult, "Web Page Analysis for Voice Browsing", Proceedings of the Is' International Workshop on Web Document Analysis (WDA'2001), Seattle, USA, September 2001 (ISBN: 0-9541148-0-9) and also at h t t p : //www. c s c . l i v . ac .uk/~wda2 0 01, pp. 59-61. G. Penn, J. Hu, H. Luo and R. McDonald, "Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices", Proceedings of the 6'1' International Conference on Document Analysis and Recognition (ICDAR'01), Seattle, USA, September 2001, pp. 1074-1078. A. Antonacopoulos, D. Karatzas and J. Ortiz Lopez, "Accessing Textual Information Embedded in Internet Images", Proceedings ofSPIE Internet Imaging II, San Jose, USA, January 24-26, 2001, pp. 198-205. J. Zhou and D. Lopresti, "Extracting Text from WWW Images", Proceedings of the 4th International Conference on Document Analysis and Recognition (ICDAR'97), Ulm, Germany, August, 1997 D. Lopresti and J. Zhou, "Document Analysis and the World Wide Web", Proceedings of the 2"' IAPR Workshop on Document Analysis Systems (DAS'96), Marven, Pennsylvania, October 1996, pp. 417-424. H. Li; D. Doermann and O. Kia, "Automatic text detection and tracking in digital video", IEEE Transactions on Image Processing, vol. 9, issue 1, Jan. 2000, pp. 147-156. D. Lopresti and J. Zhou, "Locating and Recognizing Text in WWW Images", Information Retrieval, 2 (2/3), May 2000, pp. 177-206. A. Antonacopoulos and F. Delporte, "Automated Interpretation of Visual Representations: Extracting textual Information from WWW Images", Visual Representations and Interpretations, R. Paton and I Neilson (eds.), Springer, London, 1999. A.K. Jain and B. Yu, "Automatic Text Location in Images and Video Frames", Pattern Recognition, vol 31, no. 12, 1998, pp.2055-2076. A. Antonacopoulos and D. Karatzas "An Anthropocentric Approach to Text Extraction from WWW Images", Proceedings of the 4th IAPR Workshop on 220 12. 13. 14. 15. 16. A. Antonacopoulos and D. Karatzas Document Analysis Systems (DAS'2000), Rio de Janeiro, Brazil, December 2000, pp. 515-526. R. C. Carter and E. C. Carter, "CIE L*u*v* Color-Difference Equations for SelfLuminous Displays," Color Research and Applications, vol. 8, 1983, pp. 252-253. K. McLaren, "The development of CIE 1976 (L*a*b*) Uniform Colour Space and Colour-diference Formlua," Journal of the Society of Dyers and Colourists, vol. 92, 1976, pp. 338-341. G. Wyszecki and W. S. Stiles, Color Science - Concepts and Methods, Quantitative Data and Formulae. 2nd ed. John Wiley, New York, 2000. Basic Parameter Values for the HDTV Standard for the Studio and for International Programme Exchange, ITU-R Recommendation BT.709 [formerly CCIRRec.709] Geneva, Switzerland: ITU 1990. A. Antonacopoulos, "Page Segmentation Using the Description of the Background", Computer Vision and Image Understanding, vol. 70, 1998, pp. 350-369. A Fuzzy Approach to Text Segmentation in Web Images Original Image Final Segmentation 221 Identified Characters Fig. 9. An image containing very small characters. Original Image Final Segmentation Identified Characters Fig. 10. An image with gradient characters over uniformly coloured background. Original Image Final Segmentation Identified Characters Fig. 11. An image with overlapping characters. Original Image Final Segmentation Identified Characters Fig. 12. An image containing gradient characters. Original Image Final Segmentation Identified Characters Fig. 13. An image with multicoloured (semi-transparent) characters over photographic background. This page is intentionally left blank C H A P T E R 12 SEARCHING FOR IMAGES ON THE W E B USING TEXTUAL METADATA Ethan V. Munson and Yelena Tsymbalenko Department of EECS, University of Wisconsin-Milwaukee P. 0. Box 784, Milwaukee, WI 53211 USA E-mail: munson@cs.uwm.edu, Yelena.Tsymbalenko@med.ge.com Most approaches to searching for images on the Web or in a closed database emphasize image processing techniques. However, images in Web documents are always accompanied by text, both content and markup. This research examines whether this text can be used to ac curately identify images on the Web that match textual queries. A pro totype search tool was constructed that uses an existing Web search engine to find candidate pages and then analyzes these pages for the presence of several textual clues to image content. We found three sim ple clues (image file name, page title, and text value of an image's ALT attribute) that showed promise for finding relevant images. This suggests that it will be both more efficient and more effective to use text as the first step in finding images on the Web. 1. I n t r o d u c t i o n T h e World Wide Web can be viewed many different ways. It is simultane ously a giant library, a world-encompassing read-mostly file system, a novel means of expression, and an incredible business opportunity. In addition, the Web is probably the largest public repository of images ever created. Images are particularly important to the Web because it was their intro duction into H T M L by the developers of NCSA Mosaic t h a t created the explosion of interest in the Web in 1993 and 1994. While the popular saying t h a t "a picture is worth a thousand words" is not uniformly correct, it is definitely true t h a t people find certain pictures far more compelling and informative t h a n any text addressing the same topic. So, it is natural t h a t we should want to use t h e Web as a sort of "image library" t h a t we can search for images relevant to topics t h a t interest us. 223 224 E. V. Munson and Y. Tsymbalenko As is described briefly in the next section, considerable research has in vestigated methods for indexing and analyzing images using image process ing techniques, particularly in closed image and video databases. In general in these systems, users construct visually-based queries using sketches, sam ple images, or specifications of image parameters. The system compares the image parameters of the query to those of the images in the database in order to find good matches. This approach can be quite effective and these techniques have been applied to the Web with some success.1 In this paper, we want to argue that image processing is the wrong start ing point for most Web image searches. We believe that most users would prefer to create queries using words. Furthermore, images on the Web are almost always accompanied by text that gives useful clues to the images' content. Finally, there are compelling performance reasons to avoid down loading and analyzing images until textual clues suggest that the image might be relevant to a user's query. This assertion is not novel, but it has been poorly explored in the aca demic literature. The WebSeer system2 used textual metadata as a sec ondary factor in research on classifying Web images, but did not report any analysis of its effectiveness. Both Alta Vista and Google provide image search services that use textual metadata as an important clue to image content. For example, Google states that Google analyzes the text on the page adjacent to the image, the image caption and dozens of other factors to determine the image content. Google also uses sophisticated algorithms to remove du plicates and ensure that the highest quality images are presented first in your results. 3 Thus, while it is clear that textual information is an important resource for commercial image search systems, the details of how it is used and why it is effective are not well understood. The research presented here is an attempt to rectify this problem. The next section gives a brief survey of related research. Section 3 de scribes the architecture of our image search system while Section 4 presents the results of our experiments on finding images using HTML metadata. The final section discusses the implications of this research and suggests future research directions. Searching for Images on the Web Using Textual Metadata 225 2. Background There is a large body of research on multimedia indexing and retrieval. Most of this research has been performed using closed databases whose content was under the direct control of the researchers. Examples of such research are easily found in recent conference proceedings and journals. 4 - 6 A good example of this course of research is the IBM Almaden Center's Query-By-Image-Content (QBIC) system. 7 QBIC allows users to make im age queries based on image features such as shape, color, texture, and object layout. Users define queries either by providing a sample image or by using a graphical tool to make a sketch or diagram. QBIC has a well-developed visual query language and an interesting graphical user interface. Its use of image features for indexing and querying is both an advantage and a disadvantage. When users are seeking images with a particular appearance (e.g. mix of colors, object with a particular shape), it is very helpful. When users are looking for pictures of particular content, it is less helpful because the low-level image characteristics give only limited insight into the real se mantics of images. For example, a person's facial appearance may be fairly constant, but their clothing may not be. Many objects (e.g. pencils, fish, or motorcycles) look radically different depending on camera viewpoint. QBIC's approach also appears ill-suited to the scale of the Web. QBIC constructs its indices by pre-analyzing each image in its database. This is computationally demanding and it is difficult to see how it can be done for the Web as a whole. WebSeek1 is a more direct attempt to create a directory and database of images from the Web. WebSeek uses a mix of automated and manual techniques to create a database of images downloaded from the Web. It automatically inspects HTML documents, extracting keywords from the image file names that are used to create a histogram of file names. This histogram is used to manually construct a subject hierarchy for the down loaded images. In another manual step, the downloaded images are mapped into the subject hierarchy. Once this is done, WebSeek users can browse the categories in the subject hierarchy, search the categories by keyword, and search the database using image features, especially color histogram infor mation. WebSeek has a large database of Web images and supports both textbased and image-based queries. Text-based queries have more semantic con tent than image-based queries. However, it seems unlikely that WebSeek's database can approach the scale of the entire Web. Manual categorization 226 E. V. Munson and Y. Tsymbalenko of images is a slow and labor-intensive process that is poorly suited to scale and dynamicity of the Web. WebSeer2 is the system most closely related to this research. The prin cipal investigator, Swain, later worked for Alta Vista, whose image search tool has qualities that appear to derive from Swain's research on WebSeer. The goal of research on WebSeer was to classify images into categories such as photographs, portraits and computer-generated drawings. To do this, WebSeer supplemented information from image content analysis with information from HTML metadata. WebSeer used several kinds of HTML metadata including the file names of images, the text of the ALT attribute of the IMG tag, and the text of hyperlinks to images to help identify relevant images. Since the WebSeer research emphasized image categorization, this use of metadata is not discussed in detail in any of the WebSeer papers. We assume that the metadata was helpful, but a detailed analysis was not provided. The research most similar in spirit to that reported in this paper was conducted by Brown et al.8'9 In their first study, 8 they used textual "closed captions" transmitted with broadcast news to index stored video. In the second study, 9 they used speech recognition techniques to analyze the au dio components of video mail. Then, the textual content of the recognized speech was used for indexing the video content. In both studies, Brown et al. took advantage of the fact that data in two media were traveling to gether and exploited data in one medium to better understand the content in the other. 3. Image Search Architecture We applied the cross-media indexing strategy of Brown et al. to the Web image search problem. We started with the observation that images on the Web are almost always accessed through HTML documents and that the bulk of the content of HTML documents is textual. In addition, the HTML source includes text that defines a hierarchical information structure. We consider both the textual content and the structure of HTML documents to be "metadata" describing images and use this metadata to determine which images may be relevant to a query. The second aspect of our strategy was to exploit existing Web search engines in order to search the entire Web, rather than a closed database of previously downloaded images. By using existing search engines, we saved considerable engineering effort and were able to exploit the search engine Searching for Images on the Web Using Textual Metadata 227 designers' considerable expertise in computing the relevance of Web docu ments to textual queries. We constructed a Web image search application composed of four mod ules: text search, document download and cleaning, document analysis, and search results interface. The text search module accepted a one-word query and sent it to the Alta Vista search engine. Alta Vista returned an HTML document with links to ten Web pages that best matched the query. In addition, the bottom of this document had links to as many as nineteen other pages of search results. In effect, Alta Vista returned links to 200 Web pages having some relevance to our one word queries. The text search module extracted the URLs of these pages from the search results documents and sent a subset of these URLs to the document download and cleaning module. The download and cleaning module first used the low-level HTTP in terfaces to download the Web pages for each URL. In addition, this module downloaded every image referenced by each document, in order to facilitate later analysis. At this point, we confronted the problem that many HTML documents on the Web are ill-formed and thus are difficult to analyze. We solved this problem by using the "Tidy" application 10 developed by Raggett for the World Wide Web Consortium. Tidy uses heuristic rules to translate HTML (well-formed or ill-formed) to well-formed XHTML (an analog of HTML that conforms to the XML specification.)11 The document analysis module parsed the well-formed XHTML doc uments into an internal tree representation and then searched for "clues" that might indicate that an image in the document matched the query. The analysis module considered an image to match the query if the query appeared in any of the following eight places: (1) (2) (3) (4) (5) (6) (7) (8) An image's file name; The textual content of the document's TITLE element; The value of the ALT attribute of the IMG element; The textual content of an anchor (A) element whose target was the image's file; The value of the TITLE attribute of an anchor (A) element; The textual content of the paragraph that was the parent of the IMG element; The textual content of any paragraph located within the same CEN TER element as the IMG element; and The textual content of heading elements that precede the image. 228 B. V. Munson and Y. Tsymbalenko Finally, the search results interface module took the list of matching images generated by the document analysis module and created a Web page interface with links to the matching images and the pages that they came from. This final interface was not designed for end-users, who would certainly prefer an interface based on thumbnail images, but it was suitable for our image search experiments. At this point, some comments on the design of the testbed are appro priate. • By using a commercial search engine as the first step in image search, we saved a tremendous amount of engineering effort. However, it clearly makes the set of images returned by the system depend on the behavior of the search engine. At this time, we have no idea what effect the choice of search engine had on our research. • The eight "clues" used to find matching images were derived from the work on WebSeer and from our own study of the HTML specification and of Web document design practice. • About 1% of the HTML documents we downloaded were so ill-formed that the Tidy program could not produce an XHTML version. • We determined that images smaller than 65 pixels in either the horizon tal or vertical dimension could be ignored. We found through informal experimentation that such images were essentially always "decorative" elements like borders, bullets, or banner advertisements. 4. Image Search Experiment Using the testbed described in the previous section, we conducted an im age search experiment in the fall of 1999 to assess the effectiveness of our strategy. Our goal was to answer two research questions: • Which HTML features reveal the most information about images in a document? • Do image search results depend on the type of query made? We used our testbed to search for images using twelve one-word queries drawn from five categories. The queries, listed by category, were: Famous People: "Gorbachev," "Yeltsin," and "Streisand" Non-famous People: "Yelena" and "Ekaterina" Famous Places: "Paris" and "London" Less-famous Places: "Bremen" and "Spokane" Searching for Images on the Web Using Textual Metadata 229 Phenomena: "Explosion," "Sunset," and "Hurricane" We modified the testbed so that, for each query, it downloaded 30 of the 200 pages returned by Alta Vista and all of the images on those pages. The 30 pages were taken from the first, eleventh, and twentieth search results pages. a This procedure could have produced 360 Web pages, but only 276 pages containing a total of 1578 non-decorative images were accessible. For each image, we recorded which of the eight clues would have caused that image to be retrieved by our software. In addition, one of us (Tsymbalenko) looked at each image and classified it as either "relevant" or "not relevant" to the query word. 4.1. Results We used the human relevance ratings and the data about which images would have been retrieved to compute the standard information retrieval measures of precision and recall. Precision is the proportion of images that a clue caused to be retrieved that are actually relevant to the query. It is computed by the formula _, . Retrieved images that are relevant Precision = ——: lotal retrieved images Recall is the proportion of relevant images (out of the "complete" collection) that are retrieved and computed by the formula _, „ Relevant images that were retrieved Recall = — — - — Total relevant images in collection It is important to give a cautionary note about our recall statistics. Recall is normally computed using some standard body of material (e.g. one year's issues of a major newspaper), called a corpus, which is used as the entire "collection" over which searches are performed. Our recall statistics were computed using the 276 HTML documents returned by Alta Vista as our corpus. This is clearly not a valid approach, since the set of documents returned by Alta Vista were chosen precisely because they were highly relevant to our query and this could easily bias our results. Unfortunately, we know of no standard Web corpus on which to perform our tests. Thus, a W e originally chose this approach in the mistaken belief that there would be interesting differences between the first search results (high relevance) and the last search results (low relevance). In retrospect, this was a pointless exercise, because Alta Vista was finding tens of thousands of Web pages that matched our queries, but only providing the best 200 of these. All of these 200 pages were highly relevant to our queries. 230 E. V. Munson and Y. Tsymbalenko Table 1. Recall percentages of the three useful meta data clues for each query. Query Word Gorbachev Yeltsin Streisand Yelena Ekaterina Paris London Bremen Spokane Explosion Sunset Hurricane Median % Image Filename 26.0 60.0 23.0 11.1 0.0 62.0 12.5 80.0 90.0 25.0 88.8 53.8 39.9 Clue Document Title 84.0 60.0 84.6 85.0 60.0 62.0 95.8 90.0 100.0 75.0 11.1 38.5 79.5 Image ALT text 5.0 0.0 0.0 5.0 14.3 0.0 12.5 33.0 30.0 0.0 0.0 0.0 2.5 we use our recall statistics only to give initial results on the relative merits of our metadata clues, not to make comparisons to other search approaches. Recall results for three clues are shown in Table 1. Only clues 1 (image file name) and 2 (content of document's TITLE element) showed high levels of recall with medians of 39.9% and 79.5%, respectively. Clue 3 (value of the ALT attribute of the IMG element) shows a modest level of recall with a median of only 2.5%, but individual recall percentages as high as 33%. The remaining five clues were essentially useless, with only one other clue, number 6 (textual content of a paragraph that contains an IMG element), showed any recall at all. This clue had recall of 5% on the "Gorbachev" query and zero recall for all other queries. Precision results for the three useful clues are shown in Table 2. For the image file name clue, precision ranges from 36% to 100% with a median value of 83%. The document title clue had precision ranging from 36% to 100% with a median of 55%. The image ALT text clue, where it had some recall, showed precision ranging from 75% to 100% with a median of 100.0%. The most surprising result is the high level of precision for the ALT attribute of the IMG element. Previous research by Antonacopoulos et al.12 and by Lopresti and Zhou13 has shown that for images whose context if formatted text, more than half of the text values of ALT attributes are empty or wrong. This would suggest that the ALT attribute would have little utility in identifying Web image content. Searching for Images on the Web Using Textual Metadata 231 Table 2. Precision percentages of the three useful metadata clues for each query. Dashes are used for clue-query combinations that had zero recall, since pre cision cannot be computed when there is no recall. Me dian precision percentages are computed based only on those queries that had some recall. Query Word Gorbachev Yeltsin Streisand Yelena Ekaterina Paris London Bremen Spokane Explosion Sunset Hurricane Median % Image Filename 83.0 100.0 100.0 66.7 — 84.0 60.0 66.7 75.0 100.0 36.0 100.0 83.0 Clue Document Title 46.0 60.0 47.8 89.5 100.0 70.0 46.9 69.2 71.4 50.0 50.0 35.7 55.0 Image ALT text 100.0 — — 100.0 100.0 — 75.0 100.0 100.0 — — — 100.0 In fact, our results do not conflict with those of Antonacopoulos or Lopresti. Their research addresses the question of whether, given an IMG element, the ALT attribute value is both present and relevant to the im age's content and they both found that it frequently was not for images of words. Our poor recall results for the ALT attribute are in accord with this result. However, our study's precision results look at a different question: given an IMG element whose ALT attribute contains the query word, is the referenced image relevant to the word? In this relatively infrequent case, we found that the image was very likely to be relevant to the query word. Finally, it is worth noting that our technique cannot find an image based on text that appears in that image or in other images in the document. Given that image-based text is widely used to control formatting in Web pages, this limits the recall of our technique. 5. Conclusion and Future Work Our results suggest that textual information in HTML documents can be very useful for finding images on the Web. This is not to say that image features have no place in the image search task, but rather that text may be the best starting point. Once candidate images are identified by clues in 232 E. V. Munson and Y. Tsymbalenko the HTML, image processing techniques can be used to refine the search. These are some of the reasons for our belief: Composing Queries We believe that users primarily seek images that match an idea. In general, users will better express that idea with words than by drawing a sketch or providing a sample image. For example, how do you construct a sketch that will match a motorcycle as seen from many different viewpoints? Image Features The image features commonly used by image retrieval systems are very low-level and sometimes correspond poorly to human perceptual concepts. Users rarely know what color histogram they seek or how to specify textures. Performance Image data is relatively large and requires substantial down load time. Furthermore, image processing can be computationally in tensive. Thus, a system that can make a good estimate of image content from the text of the HTML document can avoid downloading and an alyzing images that are probably irrelevant to the query. Software Engineering Effort By using an existing Web search engine, we can leverage the considerable expertise and development effort in the search engine toward the construction of an image search tool. We are continuing to investigate the use of HTML and XML metadata for Web image search. We have implemented a new image search tool that accepts multi-word queries, uses a variety of search engines and returns a larger number of candidate Web pages. Using this tool, we are running an experiment that tests our approach with a more diverse set of queries, including categories such as holidays and abstract descriptions of people such as "happy baby." Image relevance to a query will be determined by a group of three judges. Then, we plan to use statistical techniques to develop a relevance rating formula that integrates multiple metadata clues. We also want to explore other hypotheses, such as whether image search techniques should be varied for different kinds of queries (e.g. places, people, and events). Acknowledgments This research was supported by the U. S. Department of Defense. John Boyland helped us with our HTML analysis software. The members of UWMilwaukee's Multimedia Software Laboratory helped create a supportive environment for our research. Searching for Images on the Web Using Textual Metadata 233 References 1. J. Smith and S. F. Chang. Visually searching the Web for content. IEEE Multimedia, 4(3): 12-20, July-September 1997. 2. C. Frankel, M. Swain, and V. Athitsos. WebSeer: An image search engine for the World Wide Web. Technical Report 96-14, University of Chicago, Department of Computer Science, July 1996. 3. Google. Google Frequently Asked Questions: Image Search Available on the Web at http://www.google.com/help/faq_images.html. Accessed November 5, 2002. 4. Proceedings, ACM Multimedia '00. ACM Press, November 2000. 5. Proceedings, ACM Multimedia '99. ACM Press, November 1999. 6. Multimedia Systems. Published jointly by Springer Verlag and ACM Press. Quarterly academic journal. 7. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: the QBIC system. Computer, 28(9):23-32, September 1995. 8. M. G. Brown, J. T. Foote, G. J. F. Jones, K. Sparck Jones, and S. J. Young. Automatic content-based retrieval of broadcast news. In Proceedings ACM Multimedia '95, pages 35-44. ACM Press, November 1995. 9. M. G. Brown, J. T. Foote, G. J. F. Jones, K. Sparck Jones, and S. J. Young. Open-vocabulary speech indexing for voice and video mail retrieval. In Pro ceedings ACM Multimedia '96, pages 307-316. ACM Press, November 1996. 10. D. Raggett. Clean up your web pages with HTML TIDY. Available at www.w3.org, August 2000. Version of August 4, 2000. 11. World Wide Web Consortium. Extensible markup language (xml) 1.0. Avail able on the Web at http://www.w3.org/TR/2000/REC-xml-20001006, Oc tober 2000. Second edition. 12. A. Antonacopoulos, D. Karatzas, and J. Ortiz Lopez. Accessing textual in formation embedded in internet images. In Proceedings of Electronic Imaging 2001: Internet Imaging II, San Jose, California USA. SPIE, January 2001. 13. Daniel P. Lopresti and Jiangying Zhou. Locating and recognizing text in W W W images. Information Retrieval, 2(2/3):177-206, May 2000. This page is intentionally left blank C H A P T E R 13 A N A N A T O M Y OF A LARGE-SCALE IMAGE S E A R C H ENGINE Wei-Cheng Lai, Edward Y. Chang and Kwang-Ting (Tim) Cheng VIMA Technologies. 3944 State Street, Suite #340 Santa Barbara, CA 93105, USA {wlai,echang,timcheng}@vimatech.com The content of the World-Wide-Web has moved rapidly from textonly to multimedia-rich. As information becomes available more and more as multimedia content, and requires more personalized access, we deem the existing infrastructures inadequate. To enable effective person alized search in Web-based or large-scale image libraries, this chapter proposes a perception-based search paradigm. We present the anatomy of a large-scale image search engine, which includes such a perceptionbased search component, a multi-resolution image-feature extractor, and a high-dimensional indexer, as well as traditional components such as a crawler and a keyword extractor. Through examples and empirical study, we show that our system is superior over traditional Content-Based Im age Retrieval (CBIR) systems in three aspects: personalization, search accuracy and efficiency 1. I n t r o d u c t i o n T h e proliferation of digital cameras and the deployment of broadband in frastructure have drastically increased the multimedia content on the Web. Searching multimedia information, however, can be much more challenging t h a n searching text documents. This is partly because multimedia content can be difficult t o articulate, and partly because articulation can be subjec tive. For instance, it is difficult to describe a desired image using low-level features such as color, shape, texture, and keywords. Moreover, different users m a y perceive the same image differently; a n d even if an image is perceived similarly, users may use different vocabulary (i.e., different com binations of low-level features) to depict it. It is thus necessary to build intelligent search engines that can provide the following two capabilities: 235 236 Lai et al. alleviating the need for users to specify complex query concepts, and sup porting personalized searches. Typical content-based image retrieval systems 1 - 5 assume that users can provide a "good" example to start a query. This requirement may not be realistic, since if good examples could always be found, then the search engine would not be needed. Keyword-based image retrieval systems such as Google and Yahoo! suffer from low precision. For instance, a "tiger" query yields not only animals but also baseball team Detroit Tigers and golf champion Tiger Woods. For enabling user-friendly query-concept formulation and effective per sonalized searches in Web-based or large-scale image libraries, this chapter presents a perception-based search paradigm. A search engine built on such a paradigm can learn users' subjective query concepts quickly through an in telligent sampling process. We first present two active learning algorithms, Maximizing Expected Generalization Algorithm (MEGA) and Support Vec tor Machine Active Learning (SVM Active), that we recently developed. We then propose a hybrid scheme that combines these two algorithms to learn a query concept in a small number of user iterations. To make both conceptlearning and image retrieval efficient, we employ a multi-resolution imagefeature extractor (Section 2.1) and a high-dimensional indexer (Section 2.2). Together with some traditional search-engine components— including a crawler, a keyword extractor and user interface— we present an anatomy of a large-scale image search engine. Through examples and empirical study, we show that our system is superior over traditional image search engines in three aspects: personalization, search accuracy and efficiency. 1.1. An Illustrative Example In this example, we compare a keyword-based image retrieval system with our proposed perception-based image retrieval system. We use the Yahoo! Picture Gallery (i.e., http://gallery.yahoo.com) as a test site for keywordbased image retrieval. Suppose a user wants to retrieve images related to "bird of paradise." Given the keywords "bird of paradise" at the test site, the gallery engine retrieves five images of this flower as shown in Fig. 1. However, there are more than five images relevant to "bird of paradise" in the Yahoo image database. We will show shortly that our proposed sys tem can retrieve more of these relevant images by taking the following steps: First, we query Yahoo's keyword-based search engine using "bird" and store the returned images (which include various images relevant to key- Large-scale Image Search on the Web 237 word "bird") in a local database. Second, we apply our perception-based search engine to the local database. The learning steps for grasping the concept "bird of paradise" involves four screens that are illustrated in the following four figures. Fig. 1. "Bird of Paradise" Query Using Keyword-based Search Engine. Fig. 2. "Bird of Paradise" Query Screen # 1 . Lai et al. 238 Fig. 3. "Bird of Paradise" Query Screen # 2 . Fig. 4. "Bird of Paradise" Query Screen # 3 . (1) Screen 1. Initial Screen. The system presents the initial screen to the user as depicted in Fig. 2. The screen is split into two frames vertically. The left-hand side of the screen is the learner frame; The right-hand side is the result frame. Through the learner frame, the system learns what the user wants via an active learning process. The result frame displays images that match the user's query concept. Large-scale Image Search on the Web Fig. 5. 239 "Bird of Paradise" Query Screen # 4 . (2) Screen 2. Sampling and relevance feedback starts. Once the user clicks the "submit" button in the initial frame, the active learning step com mences to learn what the user wants. Initially, the system" presents sixteen random images in the learner frame, and the user clicks on the images that are relevant to his/her query concept. As shown in Fig. 3, one image (the last image in the first row) is selected as relevant, and the rest of the unmarked images are considered irrelevant. The user indicates the end of his/her selection by clicking on the submit button. This action brings up the next screen. (3) Screen 3. Sampling and relevance feedback contimies. After receiving feedback from the user, the system uses active learning algorithms to learn the user's query concept. Fig. 4 shows the third screen. First, the result frame displays what the system thinks will match the user's query concept. As the figure indicates, eleven returned images fit the concept of "bird of paradise." As the figure shows, the query concept has been captured, though somewhat fuzzily. The user can ask the system to further refine the target concept by selecting relevant images in the learner frame. In this example, nine images are relevant to the concept. After the user clicks on the submit button, the fourth screen is displayed. (4) Screen 4. Sampling and relevance feedback ends. Fig. 5 shows that all returned images in the result frame fit the query concept. 240 Lai et al. As observed, after two feedback rounds, our system can retrieve fif teen relevant images. In this example, we use the Yahoo! keyword-based search engine to seed our perception-based search engine. The keywordbased search engine can quickly identify the set of images relevant to the specified keywords. Based on these relevant images, the perception-based search engine can explore the perceptual feature space and discover more images relevant to the users' query concept. The above example illustrates that our proposed system has three major features that are crucial for effective and efficient image retrieval: (1) Supporting flexible concept formulation. One can search for a general flower concept or a specific kind of flower (e.g., bird of paradise). Also note in Fig. 4 that the system presents the flower of different angles and shapes on different backgrounds to ask for feedback. The system can intelligently explore more possibilities to refine the query concept. (2) Supporting different seeding options. Our system can support three seeding options: query by keyword, query by example, and query by nothing. The example above shows that seeding a perception-based search with a keyword search is helpful. However, our engine does not require a user to initiate a query with "good" seed images. The sys tem can use the negative-labeled instances to shrink the search space. Therefore, the probability that a relevant example will be presented by our system in the next iteration will be considerably higher. (3) Accomplishing the above tasks quickly and accurately. Our image-feature extractor and high-dimensional indexer (presented in Section 2) make both query-concept learning and image retrieval efficient and effective. The rest of the chapter is organized as follows: In Section 2, we depict the components of our search engine. In Section 3, we present three learn ing algorithms, which power our perception-based image retrieval engine. Sections 4 and 5 report experimental results. Section 6 discusses related work. Finally, we offer our concluding remarks in Section 7. 2. System A r c h i t e c t u r e Our proposed system offers three options for starting an image query, query by nothing, query by keyword, and query by example. Query by keyword and query by example may not always be practical, since good examples may be difficult to come by, and an image-query concept may be difficult to artic ulate in words. The query-by-nothing paradigm uses MEGA (Section 3.1) Large-scale Image Search on the Web 241 to find some relevant images. Once some images relevant to the query con cept are found, our perception-based search engine can formulate the query concept in a much more complete and user-friendly fashion. Note that it might not be practical to do "query-by-nothing" on the web directly since it could involve hundreds of user iterations to find the first "positive" ex ample. Query-by-nothing paradigm is more suitable for moderate image database. big. 6. System Components. This section depicts the components of our system. Fig. 6 presents the architecture, which consists of eight components: crawler, content-based feature extractor, keyword extractor, indexer, content-based search engine, perception-based search engine, text-based search engine and user interface. The three search engines interact with each other to provide better search precision. For example, the search results returned by the text-based or content-based search engines are used to seed the perception-based search engine. In this section, we briefly describe selected components, including the content-based feature extractor, high-dimensional indexer, perceptionbased search engine, and content-based search engine. (We omit discussing components that are well understood such as crawler, keyword extractor, text-based search engine, and user interface.) In Section 3, we will discuss in depth the active learning algorithms that support the perception-based search engine. 242 2.1. Feature Lai et al. Extractor Our feature extractor collects perceptual features from images. Common perceptual features include color, shape, texture, and the spatial layout of these features. Feature extraction can be performed off-line; however, since the number of images can be large, feature extraction should be both efficient and effective. In other words, we would like to extract features that can represent images well, and at the same time, keep the extraction time reasonably short. We have two major types of features, color and texture. 6 For color, we have 12 color bins for the 11 colors (black, white, red, yellow, green, blue, brown, purple, pink, orange, and gray) and one outlier bin. We character ize each color with color histograms, color mean (in HSV channels), color variance (in HSV channels) and two shape characteristics: elongation and spreadness. Color elongation characterizes the shape of a color and spreadness characterizes how that color scatters within the image. Texture consists of three orientations: vertical, horizontal and diagonal. For each orientation, we have texture energy mean, variance, elongation and spreadness. 7 2.2. High-dimensional Indexer An indexer is essential to make both query-concept learning and image retrieval efficient. In addition, to take users' subjectivity into considera tion, an indexer must support dynamic feature weighting. To deal with the "dimensionality-curse" problem and to support dynamic feature weighting, we propose an indexing scheme using clustering and classification meth ods for supporting approximate similarity searches. In many applications it is sufficient to perform an approximate search that returns many but not all nearest neighbors 8 - 1 3 (a feature vector is often an approximate char acterization of an object, so we are already dealing with approximations.) For instance, in content-based image retrieval 1,3 ' 14,15 and document copy detection, 16,17 it is usually acceptable to overlook a small fraction of the target objects. Thus it is not necessary to incur the high cost of an exact search. Our indexing method 18 is a statistical approach that works in two steps. It first performs non-supervised clustering using Tree-Structured Vector Quantization (TSVQ) 19 to group similar objects together. To maximize I/O efficiency, each cluster is stored in a sequential file. A similarity search is then treated as a classification problem. Our hypothesis is that if a query object's class prediction yields C probable classes, then the probability Large-scale Image Search on the Web 243 is high that its nearest neighbors can be found in these C classes. This hypothesis is analogous to looking for books in a library. If we want to look for a calculus book and we know calculus belongs in the math category, by visiting the math section we can find many calculus books. Similarly, by searching for the most probable clusters into which the query object might be classified, we can harvest most of the similar objects. 2.3. Perception-based Search Engine The perception-based search engine is the heart for supporting personalized image retrieval. The engine learns users' query concepts as learning a binary classifier that separates the images relevant to the query concept from the irrelevant ones. The learning takes place in an iterative process: The system presents examples to solicit relevance feedback from the user for refining the binary-class boundary. Relevance feedback is not new. Unfortunately, traditional relevance feed back methods require a large number of training instances to converge to a target concept (please refer to Section 6 for our discussion on related work.) In our perception-based engine, we explore several active learning al gorithms, which can capture a query profile with a small number of training instances. We recently proposed two active learning algorithms, MEGA (The Maximizing Expected Generalization Algorithm) 20 and SVMActive (Sup port Vector Machine Active Learning) 7 to tackle the problem effectively. In Section 3 we describe how MEGA and SVM Active work. We further propose a hybrid technique, which combines the strengths of MEGA and SVM Active2.4. Content-based Search Engine The content-based search engine queries the image database using an image example selected by the user. Given a selected image, this engine finds all similar images based on some appearance criteria such as color, texture, and shape. One limitation of the query-by-example approach is that it cannot capture semantic similarity. Again, this drawback can be overcome by the perception-based search engine. A query-by-example engine is still a useful tool if users just want to find images that "appear" similar to a query image. 3. Active Learning Algorithms We first examine two active learning algorithms: MEGA and SVM ActiveIn Section 3.3, we will propose a hybrid approach that combines the strengths of MEGA and SVM Active- 244 3.1. Lai et al. MEGA The first challenge of query concept learning is to find some relevant objects so that the concept boundary can be fuzzily identified. Finding a relevant object can be difficult if only a small fraction of the dataset satisfies the tar get concept. We can improve the odds with an intelligent sampling method MEGA (The Maximizing Expected Generalization Algorithm), which finds relevant samples quickly, to initialize query-concept learning. MEGA models query concepts in k Conjunctive Normal Form (kCNF), 21 which can formulate virtually all practical query concepts. a MEGA uses k Disjunctive Normal Form (fc-DNF) to bound the sampling space from which to select the most informative samples for soliciting user feedback. Definition 1: fc-CNF: For constant k, the representation class fc-CNF consists of Boolean formulae of the form C\ A • ■ ■ A eg, where each c* is a disjunction of at most k literals over the Boolean variables X\,... ,xn (a Boolean variable represents an image feature.) No prior bound is placed on 6. Definition 2: fc-DNF: For constant k, the representation class fc-DNF consists of Boolean formulae of the form d\ V ■ • ■ V dg, where each d, is a conjunction of at most k literals over the Boolean variables x\,..., xn. No prior bound is placed on 9. MEGA initializes the query concept-space (QCS) as a fc-CNF and the candidate concept-space (CCS) as a fc-DNF. The QCS starts as the most specific concept and the CCS as the most general concept. The target concept is more general than the initial QCS and more specific than the initial CCS. The learner learns the QCS, while at the same time refining the CCS to delimit the boundary of the sampling space. To learn a target concept, the well-known Valiant's algorithm 24 uses positive-labeled instances to refine QCS and negative-labeled instances to refine CCS. The key difference between Valiant's algorithm and MEGA lies in how training samples are selected. Valiant's algorithm employs a random sampling strategy. MEGA makes sure that each sample can be most useful, and hence reduces the number of samples needed to learn a concept by employing dual strategies. (1) 1. Bounding the sample space: Avoid choosing useless unlabeled in stances by using the CCS and QCS to delimit the sampling boundary. a fc-CNF is more expressive than fc-term DNF, and it has both polynomial sample com plexity and time complexity. 2 2 , 2 3 Large-scale Image Search on the Web 245 (2) 2. Maximizing the usefulness of a sample: Choose an example that will remove the maximum expected number of disjunctive terms from QCS. In other words, we choose an example that can maximize the expected generalization of the concept. Even if the example is labeled negative by the user, it can be useful to remove conjunctive terms in the CCS. It may appear that if we pick an example that has more dissimilar disjunctions (compared to the QCS), we would have a better chance of eliminating more disjunctive terms. This is, however, not true. An example must be labeled by the user as positive to be useful for refining QCS. Unfortunately, an example is less likely to be labeled positive when it has more disjunctions that are dissimilar to the target concept. Therefore, there is a tradeoff between choosing an example that has more contradictory terms and choosing one that is more likely to be labeled positive. 3.2. SVM A c t o e Once some relevant and some irrelevant samples are marked, we employ SVMActive to refine the class boundary. We add the active learning compo nent to SVMs for selecting the most informative samples to query a user and to quickly refine a boundary that separates data objects that satisfy the user's query concept from the rest of the dataset. For the purpose of query-concept learning, we consider SVMs in the bi nary classification setting. We are given training data { x i . . . x„} that are vectors in some space X C IR . We are also given their labels {yx... yn} where y* G {—1,1}. In their simplest form, SVMs are hyperplanes that separate the training data by a maximal margin. All vectors lying on one side of the hyperplane are labeled as —1, and all vectors lying on the other side are labeled as 1. The training instances that lie closest to the hyper plane are called support vectors. More generally, SVMs allow us to project the original training data in space A" to a higher dimensional feature space T via a Mercer kernel operator K. In other words, we consider the set of classifiers of the form: / ( x ) = YA=I ctiK(xi,x). When /(x) > 0 we classify x as + 1 , otherwise we classify x as — 1. When K satisfies Mercer's condition 25 we can write: K(u, v) = $(u) • $(v) where $ : X —> T and "•" denotes an inner product. We can then rewrite / as: n / ( x ) = w ■ $(x), where w = ^ i=i a,$(xj). (1) 246 Lai et al. Thus, by using K we are implicitly projecting the training data into a different (often higher dimensional) feature space T. The SVM then com putes the aiS that correspond to the maximal margin hyperplane in T. By choosing different kernel functions, we can implicitly project the training data from X into spaces T for which hyperplanes in T correspond to more complex decision boundaries in the original space X. Intuitively, SVMActive works by combining the following three ideas: (1) SVM Active regards the task of learning a target concept as one of learn ing an SVM binary classifier. An SVM captures the query concept by separating the relevant images from the irrelevant images with a hyper plane in a projected space, usually a very high-dimensional one. The projected points on one side of the hyperplane are considered relevant to the query concept and the rest irrelevant. (2) SVMActive learns the classifier quickly via active learning. The active part of SVM Active selects the most informative instances with which to train the SVM classifier. This step ensures fast convergence to the query concept in a small number of feedback rounds. (3) Once the classifier is trained, SVMActive returns the top-fc most relevant images. These are the k images farthest from the hyperplane on the query concept side. 3.3. Hybrid Algorithms Although MEGA and SVM Active are effective active learning algorithms, we believe combining their strengths can result in even better learning algorithms. In this section we propose a hybrid algorithm. As discussed in Section 3.2, the SVM Active scheme needs at least one positive and one negative example to start. MEGA is not restricted by this seeding requirement, and it is able to find relevant examples quickly by refining the sampling boundary. It is therefore logical to employ MEGA to perform the initialization task. Once some relevant images are found, the refinement step can be executed by either MEGA or SVM^ct^e. Thus, we can have three execution alternatives. (1) MEGA only. Use MEGA all the way to learn a concept. (2) SVM^ ci j ve only. Use random sampling to find the first relevant exam ple^) and then use SVM^c^e to learn a concept. (3) Pipeline learning. Use MEGA to find initial relevant objects, and then switch to SVM^ctiwe for refining the binary classifier and ranking re- Large-scale Image Search on the Web 247 turned objects. 4. Experiments We implemented both MEGA and SVMActive m C, C + +, and tested using an Intel Pentium I I I ™ workstation running Linux. We have implemented an industrial strength prototype with all features discussed in this chapter. We tested our prototype with the intent of answering three central ques tions. (1) Do MEGA and SVM Active learn to model complex query concepts in very high dimensional spaces both quickly and with a small number of iterations? (2) Is using MEGA to find the first relevant object(s) and then switching to SVM^ct^e a more effective algorithm than using the two learning algorithms individually? (3) Does the percentage of relevant data affect learning performance? (If we reduce the matching data from 5% of the dataset to 1%, does it take more iterations to learn query concepts?) For our empirical evaluation of our learning methods we used a twentycategory image dataset where each category consisted of 200 to 300 images. Images for this dataset were collected from Corel Image CDs and the In ternet. The image set consists of architecture, bears, cities, clouds, couples, elephants, fabrics, fireworks, food, flowers, insects, ladies, landscape, objec tionable images, planets, tigers, tools, vehicles, waves, and textures. We separate our queries into two categories: 5% and 1% queries. The matching images for each of the more specific query concepts such as "pur ple flowers" and "white bears" account for about 1% of the total dataset. More general concepts such as "bears", "flowers" and "architectures" have about 5% matching images in the dataset. For each experiment, we will report results for these two categories of queries separately. In the experiment, we report the precision of the top-10 and top-20 retrievals to measure performance. We run each experiment ten times and report the average precision. 5. Results and Discussion Here we present our analysis of the results, organized with regard to the central questions listed at the start of Section 4. 248 Fig. 7. Lai et al. Precision versus iterations for top-10 retrieval (a) 5% and (b) 1% queries. 5.1. MEGA, S\/MActive, o,nd Pipelining MEGA with SVM^fc For top-10 retrieval with 5% matching data (Fig. 7(a)), SVM^ct^e clearly outperforms MEGA. The major weakness of SVMActive is in initialization—finding the first few positive samples. Such weakness may not seriously affect this experiment, since there is a high probability of find ing one of the 5% positive examples through random sampling. For queries with only 1% matching data (Fig. 7(b)), such a weakness becomes more Large-scale Image Search on the Web 249 6 7 Iterations (a) (b) Fig. Precision versus iterations for top-20 retrieval (a) 5% and (b) 1% queries. significant, because it substantially degrades SVMActiue's performance. Es pecially, in the first two iterations, the precision of SVM^ct^e is very low. Overall, the precision of MEGA and SVMytcti-ue is similar for 1% queries. The hybrid algorithm (pipelining MEGA with SVMActive) clearly out performs both SVM Active and MEGA. The difference in precision is more significant for queries with 1% matching data than for those with 5% match ing data. This trend indicates the strength of the hybrid algorithm in han dling more specific query concepts. Note that the precision of the hybrid Lai et al. 250 algorithm reaches 90% after 4 and 9 iterations, respectively, for these two experiments (Figs. 7(a) and (b)). For top-20 retrieval (Figs. 8(a) and (b)), the precision of the hybrid algorithm remains the highest, followed by SVMActive and MEGA. Again, the differences in their performances are more significant for queries with 1% matching data than for those with 5% matching data. For 1% queries, the hybrid algorithm achieves 100% precision after 12 iterations, which indicates the robustness of this new enhancement. 5.2. Observations Our experiments have answered the questions that we stated in the begin ning of Section 4. (1) MEGA and SVM Active can learn complex query concepts in highdimensional spaces in a small number of user iterations. (2) The hybrid scheme that uses MEGA to find the first relevant object(s) and then switches to SVM Active works significantly better than using the two learning algorithms individually. (3) When the matching data is scarce, the number of iterations required to learn a concept increases. 6. Comparison with Related Work Figure 9 depicts the architectures of some traditional image search engines. A shaded box in the figure indicates that the component is used in the corre sponding search paradigm. Figure 9(a) shows the architecture of the current google image-search engine. Google uses text only and does not consider image content. For a text-based image retrieval engine, keywords entered by users may have multiple senses and hence may not precisely characterize the users' query concept. For example, the search results with the keyword "baby" return images ranging from baby photos to baby drawings, statues, and cartoons. Figure 9(b) shows the architecture of a typical content-based im age retrieval system. The following systems fit into this category: IBM QBIC, 1 Stanford SIMPLIcity,2 Columbia VisualSEEk and WebSEEk, 3 CMU Informedia4 and UCSB NeTra.5 These systems try to capture the user's query concepts through one "good" example. This restrictive as sumption may not be able to model a query concept such as flowers of different kinds that have diversified visual features. Our system overcomes Large-scale Image Search on the Web (a) Text-based Fig. 9. 251 (b) Content-based System Architecture of Three Search Paradigms. the restriction by learning through multiple examples, both positive and negative, provided by users. Our system (Fig. 6) employs a sampling methods to minimize the num ber of samples required to learn a binary classifier that separates the ob jects matching the concept from the rest. To improve classification accu racy, researchers recently proposed a number of ensemble techniques such as bagging,26 arcing27 and boosting.28'29 These ensemble schemes enjoy success in improving classification accuracy through reduction in bias or variance, but they do not help reduce the number of samples required to learn a query concept. In fact, most ensemble schemes actually increase learning time because they introduce learning redundancy for improving prediction accuracy. 30 ' 29 ' 31 To reduce the number of required samples,researchers in the machine learning community have conducted several studies of active learning for classification. The query by committee algorithm 32 ' 33 uses a distribution over all possible classifiers and attempts to greedily reduce the entropy of this distribution. This general purpose algorithm has been applied in a number of domains using classifiers (such as Naive Bayes classifiers34,35) for which specifying and sampling classifiers from a distribution is natural. Probabilistic models such as the Naive Bayes classifier provide interpretable models and principled ways to incorporate prior knowledge and data with missing values. However, they typically do not perform as well as discrimi native methods'such as SVMs. 36 ' 37 Lai et al. 252 7. Conclusion This chapter proposes an image retrieval system t h a t uses active learning to capture complex and subjective query concepts. We propose using MEGA t o quickly find objects relevant to the query concept, t h e n switching t o SVMActive once relevant objects are found. T h e experimental results show t h a t this hybrid approach outperforms MEGA and SVM Active when they are used separately. An on-line system prototype is available at the web site. 3 8 We are expanding our work in at least two directions. First, we have recently discovered a perceptual distance function, dynamic partial func tion ( D P F ) , 3 9 , 4 0 t h a t works substantially better t h a n traditional distance functions in finding transformed images (e.g., frames, cropped, rotated, a n d downsampled images of a seed image). We plan to use D P F as a kernel func tion to examine its effectiveness in query-concept learning. Second, we have developed an image annotation scheme, 4 1 which provides images with a set of keywords associated with confidence factors for supporting multimodal image retrieval and annotation. We believe t h a t with the perception-based paradigm as the core, a keyword and content integrated system provides the solution to large-scale image search on the Web. References 1. M. Flickner, H. Sawhney, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele and P. Yanker, "Query by Im age and Video Content: the QBIC System", IEEE Computer, vol. 28, no. 9, 1995, pp. 23-32. 2. L. Wu, C. Faloutsos, K. Sycara and T. R. Payne, "FALCON: Feedback Adap tive Loop for Content-Based Retrieval", The 26* Very Large Data Bases Conference, September 2000, pp. 297-306. 3. J. R. Smith and S.-F. Chang, "VisualSEEk: A Fully Automated ContentBased Image Query System", A CM Multimedia Conference, 1996. 4. S. Stevens, M. Christel and H. Wactlar, "Informedia: Improving Access to Digital Video", Interactions, vol. 1, no. 3, 1994, pp. 67-71. 5. W. Y. Ma and B. Manjunath, "NeTra: A Toolbox for Navigating Large Image Database", Proc. Int'l Conf. Image Processing, 1997, pp. 568-571. 6. E. Chang, B. Li and C. Li, "Towards Perception-Based Image Retrieval", IEEE Content-Based Access of Image and Video Libraries, June 2000, pp. 101-105. 7. S. Tong and E. Chang, "Support Vector Machine Active Learning for Image Retrieval", Proceedings of A CM International Conference on Multimedia, Oc tober 2001, pp. 107-118. 8. S. Arya, D. Mount, N. Netanyahu, R. Silverman and A. Wu, "An Opti mal Algorithm for Approximate Nearest Neighbor Searching in Fixed Di- Large-scale Image Search on the Web 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 253 mensions", Proceedings of the 5th ACM-SIAM Sympos. Discrete Algorithms, 1994, pp. 573-82. K. Clarkson, "An Algorithm for Approximate Closest-point Queries", Pro ceedings of the 10th Annual Symposium on Computational Geometry, 1994, pp. 160-64. P. Ciaccia and M. Patella, "PAC Nearest Neighbor Queries: Approximate and Controlled Search in High-Dimensional and Metric Spaces", International Conference on Data Engineering, 2000, pp. 244-255. P. Indyk and R. Motwani, "Approximate Nearest Neighbors: Towards Re moving the Curse of Dimensionality", Proceedings of the 30th A CM Sympo sium on Theory of Computing, 1998, pp. 604-13. J. M. Kleinberg, "Two Algorithms for Nearest-Neighbor Search in High Di mensions", Proceedings of the 29th ACM Symposium on Theory of Comput ing, 1997, pp 599-608. E. Kushilevitz, R. Ostrovsky and Y. Rabani, "Efficient Search for Approxi mate Nearest Neighbor in High Dimensional Spaces", Proceedings of the 30th ACM Symposium on Theory of Computing, 1998, pp. 614-23. J. Li, J. Z. Wang and G. Wiederhold, "IRM: Integrated region matching for image retrieval", Proceedings of ACM Multimedia, October 2000, pp. 147156. A. Natsev, R. Rastogi and K. Shim, "WALRUC: A Similarity Retrieval Algorithm for Image Databases", Proceedings of ACM Sigmod, June 1999, pp. 395-406. S. Brin and H. Garcia-Molina, "Copy Detection Mechanisms for Digital Doc uments", Proceedings of ACM SIGMOD, May 1995, pp. 398-409. E. Chang, J. Wang, C. Li and G. Wiederhold, "RIME - A Replicated Image Detector for the WWW", Proc. of SPIE Symposium of Voice, Video, and Data Communications, November 1998. K. Goh and E. Chang, "Indexing Multimedia Data in High-dimensional and Dynamic Weighted Feature Spaces", The 6 Visual Database Conference, May 2002. A. Gersho and R. Gray, Vector Quantization and Signal Compression. Kluwer Academic, 1991. E. Chang and B. Li, "MEGA — The Maximizing Expected Generalization Algorithm for Learning Complex Query Concepts (Extended Version)", Tech nical Report http://www-db.stanford.edu/~echang/mega.pdf, November 2000. M. Kearns, M. Li and L. Valiant, "Learning Boolean Formulae", Journal of ACM, vol. 41, no. 6, 1994, pp. 1298-1328. M. Kearns and U. Vazirani, An Introduction to Computational Learning The ory. MIT Press, 1994. T. Michell, Machine Learning. McGraw Hill, 1997. L. Valiant, "A Theory of Learnable", Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, 1984, pp. 436-445. C. Burges, "A Tutorial on Support Vector Machines for Pattern Recogni tion" , Data Mining and Knowledge Discovery, vol. 2, 1998, pp. 121-167. L. Breiman, "Bagging Predicators", Machine Learning, 1996, pp. 123-140. 254 Lai et al. 27. L. Breiman, "Arcing Classifiers", The Annals of Statistics, 1998, pp. 801-849. 28. R. Schapire, Y. Freund, P. Bartlett and W. Lee, "Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods", in Proceeding of the Fourteenth International Conference on Machine Learning, Morgan Kaufmann, 1997, pp. 322-330. 29. A. Grove and D. Schuurmans, "Boosting in the Limit: Maximizing the Mar gin of Learned Ensembles", in Proc. 15th National Conference on Artificial Intelligence (AAAI), 1998, pp. 692-699. 30. T. Dietterich and G. Bakiri, "Solving Multiclass Learning Problems via Error-Correcting Output Codes", Journal of Artifical Intelligence Research, vol. 2, 1995, pp. 263-286. 31. M. Moreira and E. Mayoraz, "Improving Pairwise Coupling Classification with Error Correcting Classifiers", Proceedings of the Tenth European Con ference on Machine Learning, April 1998. 32. H. Seung, M. Opper and H. Sompolinsky, "Query by Committee", in Pro ceedings of the Fifth Workshop on Computational Learning Theory, Morgan Kaufmann, 1992, pp. 287-294. 33. Y. Freund, H. Seung, E. Shamir and N. Tishby, "Selective Sampling Using the Query by Committee Algorithm", Machine Learning, vol. 28, 1997, pp. 133168. 34. I. Dagan and S. Engelson, "Committee-based Sampling for Training Prob abilistic Classifiers", in Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, 1995, pp. 150-157. 35. A. McCallum and K. Nigam, "Employing EM in Pool-Based Active Learning for Text Classification", in Proceedings of the Fifteenth International Con ference on Machine Learning, Morgan Kaufmann, 1998, pp. 350-358. 36. T. Joachims, "Text Categorization with Support Vector Machines", in Pro ceedings of the European Conference on Machine Learning, Springer-Verlag, 1998, pp. 137^142. 37. S. Dumais, J. Platt, D. Heckerman, and M. Sahami, "Inductive Learning Algorithms and Representations for Text Categorization", in Proceedings of the Seventh International Conference on Information and Knowledge Man agement, ACM Press, 1998, pp. 148-155. 38. E. Chang, K.-T. Cheng, W.-C. Lai, C.-T. Wu, C.-W. Chang and Y.-L. Wu, "PBIR — A System that Learns Subjective Image Query Concepts", Proceed ings of ACM Multimedia, http://www.mmdb.ece.ucsb.edu/~demo/corelacm/, October 2001, pp. 611-614. 39. B. Li, E. Chang and C.-T. Wu, "Dynamic Partial Function", IEEE Confer ence in Image Processing, September 2002. 40. B. Li and E. Chang, "Discovery of A Perceptual Distance Function for Mea suring Image Similarity", ACM Multimedia Journal Special Issue, 2002. 41. E. Chang, K. Goh, G. Sychay and G. Wu, "Content-based Soft Annota tion for Multimodal Image Retrieval Using Bayes Point Machines", IEEE Transactions on Circuits and Systems for Video Technology Special Issue on Conceptual and Dynamical Aspects of Multimedia Content Description, 2002. Part V. New Opportunities This page is intentionally left blank C H A P T E R 14 W E B SECURITY A N D D O C U M E N T IMAGE ANALYSIS* Henry S. Baird and Kris Popat Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 94304 USA E-mail: {baird\popat}@parc.com URL: www.pare, com/istl/groups/did Web services offered for human use are being abused by programs. Efforts to defend against these have, over the last five years, stimulated the development of a new family of security protocols able to distin guish between human and machine users automatically over GUIs and networks. AltaVista pioneered this technology in 1997; by 2000, Yahoo! and PayPal were using similar methods. Researchers at Carnegie-Mellon University and, then, a collaboration between the University of Cali fornia at Berkeley and the Palo Alto Research Center developed such tests. By January 2002 the subject was called 'human interactive proofs' (HIPs), defined broadly as challenge/response protocols which allow a human to authenticate herself as a member of a given group: e.g., hu man (vs. machine), herself (vs. anyone else), etc. All commercial uses of HIPs exploit the gap in reading ability between humans and machines. Thus, many technical issues studied by the document image analysis (DIA) research community are relevant to HIPs. This chapter describes the evolution of HIP R&D, applications of HIPs now and on the hori zon, relevant legal issues, highlights of the first NSF HIP workshop, and proposals for a DIA research agenda to advance the state of the art of HIPs. 1. Introduction In 1997 Andrei Broder and his colleagues, 4 then at the D E C Systems Re search Center, developed a scheme to block the abusive automatic submis sion of URLs to the AltaVista web-site. 5 Their approach was to present a t This is an expanded version of a paper published in the DAS2002 Proceedings.1 257 258 H.3. Baird and K. Popat potential user with an image of printed text formed specially so that ma chine vision (OCR) systems could not read it but humans still could. In September 2000, Udi Manber, Chief Scientist at Yahoo!, challenged Prof. Manuel Blum and his students 2 at The School for Computer Science at Carnegie Mellon University (CMU) to design an "easy to use reverse Tur ing test" that would block 'bots' (computer programs) from registering for services including chat rooms, mail, briefcases, etc. In October of that year, Prof. Blum asked the first author, of the Palo Alto Research Center (PARC), and Prof. Richard Fateman, of the Computer Science Division of the University of California at Berkeley (UCB), whether systematically ap plied image degradations could form the basis of such a filter, stimulating the development of PessimalPrint. 3 In January 2002, Prof. Blum and the present authors ran a workshop at PARC on 'human interactive proofs' (HIPs), defined broadly as a class of challenge/response protocols which allow a human to be authenticated as a member of a given group — an adult (vs. a child), a human (vs. machine), a particular individual (vs. everyone else), etc. All commercial uses of HIPs known to us exploit the large gap in ability between human and machine vision systems in reading images of text. The number of applications of vision-based HIPs to Web security is large and growing. HIPs have been used to block access to many services by machine users, but they could also, in principle, be used as 'anti-scraping' technologies to prevent the large-scale copying of databases, prices, auction bids, etc. If HIPs — possibly not based on vision — could be devised to discriminate reliably between adults and children, the commercial value of the resulting applications would be large. Many technical issues that have been systematically studied by the doc ument image analysis (DIA) community are relevant to the HIP research program. In an effort to stimulate interest in HIPs within the document im age analysis research community, this chapter details the evolution of the HIP research field, the range of applications of HIPs appearing on the hori zon, highlights of the first HIP workshop, and proposals for a DIA research agenda to advance the state of the art of HIPs. 1.1. An Influential Precursor: Turing Tests Alan Turing proposed6 a methodology for testing whether or not a machine effectively exhibits intelligence, by means of an "imitation game" conducted over teletype connections in which a human judge asks questions of two in- Web Security and Document Image Analysis 259 terlocutors — one human and the other a machine — and eventually decides which of them is human. If judges fail sufficiently often to decide correctly, then that fact would be, Turing proposed, strong evidence that the machine possessed artificial intelligence. His proposal has been widely influential in the computer science, cognitive science, and philosophical communities 7 for over fifty years. However, no machine has "passed the Turing test" in its original sense in spite of perennial serious attempts. In fact it remains easy for human judges to distinguish machines from humans under Turing-test-like condi tions. Graphical user interfaces (GUIs) invite the use of images as well as text in the dialogues. 1.2. Robot Exclusion Conventions The Robot Exclusion Standard, an informal consensus reached in 1994 by the robots mailing list (robotsQnexor. co.uk), specifies the format of a file (the h t t p : / / . . . / r o b o t s . t x t file) which a web site or server may install to instruct all robots visiting the site which paths it should not traverse in search of documents. The Robots META tag allows HTML authors to indicate to visiting robots whether or not a document may be indexed or used to harvest more links (cf. www.robotstxt.org/wc/meta-user.html). Many Web services (Yahoo!, Google, etc) respect these conventions. Some 'abuses' which HIPs address are caused by deliberate disregard of these conventions. The legality of disregarding the conventions has been vigorously litigated but remains unsettled. 8 ' 9 Even if remedies under civil or criminal law are finally allowed, there will certainly be many instances where litigation is likely to be futile or not cost-effective. Thus there will probably remain strong commercial incentives to use technical means to enforce the exclusion conventions. The financial value of any service to be protected against 'bots' can not be very great, since a human can be paid (or in some other way rewarded) to pass the CAPTCHA (an acronym for Completely Automated Public Turing Test to Tell Computers and Humans Apart, coined by Prof. Manuel Blum, Luis A. von Ahn, and John Langford of CMU). Of course, minimum human response times — of 5-10 seconds at least — may be almost always slower than a automated attack, and this speed gap may force reengineering of the 'bot' attack pattern. Nevertheless, this may be simpler — and more stable — than actively engaging in an escalating arms race with CAPTCHA designers. There are unsubstantiated reports that humans are already being 260 H.S. Baird and K. Popat incented to pass CAPTCHAs. 10 1.3. Primitive Means For several years now web-page designers have chosen to render some ap parent text as image (e.g., GIF) rather than encoded text (e.g., ASCII), and sometimes in order to impede the legibility of the text to screen scrap ers and spammers. A frequent use of this is to hide email addresses from automatic harvesting by potential spammers. To our knowledge the extent of this practice has not been documented. One of the earliest published attempts to automate the reading of imaged-text on web pages was by Lopresti and Zhou. 11 Kanungo et al.12 reported that, in a sample of 862 sampled web pages, "42% of images con tain text" and, of the images with text, "59% contain at least one word that does not appear in the ... HMTL file." 1.4. First Use: The Add-URL Problem In 1997 AltaVista sought ways to block or discourage the automatic sub mission of URLs to their search engine. This free "add-URL" service is important to AltaVista since it broadens its search coverage and ensures that sites important to its most motivated customers are included. How ever, some users were abusing the service by automating the submission of large numbers of URLs, and certain URLs many times, in an effort to skew AltaVista's importance ranking algorithms. Andrei Broder, Chief Scientist of AltaVista, and his colleagues devel oped a filter (now visible at their site 5 ). Their method is to generate an image of printed text randomly (in a "ransom note" style using mixed type faces) so that machine vision (OCR) systems cannot read it but humans still can (Fig. 1). In January 2002 Broder told the present authors that the system had been in use for "over a year" and had reduced the number of "spam add-URL" by "over 95%." (No details concerning the residual 5% are mentioned.) A U.S. patent 4 was issued in April 2001. To the present authors, these do not seem to present a difficult chal lenge to modern machine vision methods. The black characters are widely separated against a background of a uniform grey, so they can be eas ily isolated. Recognizing an isolated bilevel pattern (here, a single char acter) which has undergone arbitrary affine spatial transformations is a well-studied problem in pattern recognition, and several effective methods have been published. 13 ' 14 The variety of typefaces used can be attacked by Web Security and Document Image Analysis 261 Submission Code: Enter Submission Code: Fig. 1. Example of an AltaVista challenge: letters are chosen at random, then each is assigned to a typeface at random, then each letter is rotated and scaled, and finally (optionally, not shown here) background clutter is added. a brute-force enumeration. 1.5. The ChatRoom Problem In September 2000, Udi Manber of Yahoo! described this "chat room prob lem" to researchers at CMU: 'bots' were joining on-line chat rooms and irritating the people there, e.g., by pointing them to advertising sites. How could all 'bots' be refused entry to chat rooms? CMU's Prof. Manuel Blum, Luis A. von Ahn, and John Langford articulated 2 some desirable properties of a test, including: • the test's challenges can be automatically generated and graded (i.e., the judge is a machine); • the test can be taken quickly and easily by human users (i.e., the dialogue should not go on long); • the test will accept virtually all human users (even young or naive users) with high reliability while rejecting very few; • the test will reject virtually all machine users; and • the test will resist automatic attack for many years even as tech nology advances and even if the test's algorithms are known (e.g., published and/or released as open source). Theoretical security issues underlying the design of CAPTCHAs have been addressed by Luis von Ahn, Manuel Blum, Nicholas Hopper and John Langford. 15 The CMU team developed a "hard" 'GIMPY' CAPTCHA which picked English words at random and rendered them as images of printed text under a wide variety of shape deformations and image occlusions, the word 262 H.S. Baird and K. Popat images often overlapping. The user was asked to transcribe some number of the words correctly. An example is shown in Fig. 2. mmL<mffl Fig. 2. Example of a "hard" GIMPY image produced by the Carnegie-Mellon Univ. CAPTCHA. The non-linear deformations of the words and the extensive overlapping of images are, in our opinion, likely to pose serious challenges to existing machine-reading technology. However, it turned out to place too heavy a burden on human users, also: in trials on the Yahoo! website, users com plained so much that this CAPTCHA was withdrawn. As a result, a simplified version of GIMPY ("easy" or "EZ" GIMPY) , using only one word-image at a time (Fig. 3), was installed by Yahoo!, and is in use at the time of writing (visible at chat. yahoo. com after clicking on 'Sign Up For Yahoo! Chat!'). It is used to restrict access to chat rooms and other services to human users. According to Udi Manber, Chief Scientist of Yahoo!, it serves up as many as a million challenges each day. The variety of deformations and confusing backgrounds (the full range of these is not exhibited in the Figure) poses a serious challenge to present machine-vision systems, which typically lack versatility and are fragile out side of a narrow range of expected inputs. However, the use of one English word may be a significant weakness, since even a small number of partial recognition results can rapidly prune the number of word-choices. Web Security and Document Image Enter the word as it is shown in the box below. Analysis 263 Word Verification This step helps Yahoo! prevent automated registrations. If you can not see this image click here. Fig. 3. Example of a simplified Yahoo! challenge (CMU's "EZ GIMPY"): an English word is selected at random, then the word (as a whole) is typeset using a typeface chosen at random, and finally the the word image is altered randomly by a variety of means including image degradations, scoring with white lines (shown here), and non-linear deformations. 1.6. Screening Financial Accounts PayPal (www. paypal. com) is screening applications for its financial pay ments accounts using a text-image challenge (Fig. 4). We do not know any details about its motivation or its technical basis. As a security measure, please enter the characters you see in the box on the right into the box on the left. (The characters are not case sensitive.) Help? Fig. 4. Example of a PayPal challenge: letters and numerals are chosen at random and then typeset, spaced widely apart, and finally a grid of dashed lines is overprinted. This CAPTCHA appears to use a single typeface, which strikes us a 264 H.3. Baird and K. Popat serious weakness that the use of occluding grids does little to strengthen. A similar CAPTCHA has recently appeared on the Overture website (click on 'Advertiser Login' at www.overture.com). 1.7. PessimalPrint The first author, together with Richard Fateman and Allison Coates of UCB, applied a model of document image degradations 16 that approxi mates ten aspects of the physics of machine-printing and imaging of text, including spatial sampling rate and error, affine spatial deformations, jitter, speckle, blurring, thresholding, and symbol size. Fig. 4 shows an example of PessimalPrint challenges that was synthetically degraded according to certain parameter settings of this model. Fig. 5. Example of a PessimalPrint challenge: an English word is chosen at random, then the word (as a whole) is typeset using a randomly chosen typeface, and finally the word-image is degraded according to randomly selected parameters (with certain ranges) of the image degradation model. An experiment assisted by ten UCB graduate-student subjects and three commercial OCR machines located a range of model parameters in which images could be generated pseudorandomly that were always legible to the human subjects and never correctly recognized by the OCR systems. In the current version of PessimalPrint, for each challenge a single English word is chosen randomly from a set of 70 words commonly found on the Web; then the word is rendered using one of a small set of typefaces and that ideal image is degraded using the parameters selected randomly from the useful range. These images, being simpler and less mentally challenging than the original GIMPY, would in our view almost certainly be more readily accepted by human subjects. Web Security and Document Image Analysis 265 2. The First International HIP Workshop The first NSF-sponsored workshop on Human Interactive Proofs was held January 9-11, 2002, at the Palo Alto Research Center. The workshop was a two-and-one-half day, 100%-participation work shop. There were thirty-eight invited participants, with large representa tions from CMU, U.C. Berkeley, and PARC. The CMU group was led by Prof. Manuel Blum, co-organizer of the workshop with the first author. Profs. Richard Fateman, Doug Tygar, and Jitendra Malik and their stu dents attended from UCB. Prof. George Nagy represented RPI, Dr. Nancy Chan the City Univ. of Hong Kong, and Dr. Moni Naor the Weizmann Insti tute. Dr. Robert Sloan, Director of the NSF Theory of Computing Program, attended and expressed warm support for this new research area. The Chief Scientists of Yahoo! and Altavista were present, along with researchers from IBM Research, Lucent Bell Labs, Intertrust STAR Labs, RSA Security, and Document Recognition Technologies, Inc. Prof. John McCarthy of Stanford University presented an invited plenary talk on "Frontiers of AI". As a starting point for discussion, HIPs were defined tentatively as automatic protocols allowing a person to authenticate him/herself — as, e.g., human (not a machine), an adult (not a child), himself (no one else) — over a network without the burden of passwords, biometrics, special mechanical aids, or special training. There was considerable breadth of interests represented; topics presented and discussed included: • Completely Automatic Public Turing tests to tell Computers and Humans Apart (CAPTCHAs): criteria, proofs, and design; • secure authentication of individuals without using identifying or other devices; • catalogues of actual exploits and attacks by machines to commer cial services intended for human use; • funding prospects for HIP work; • design and implementation case study of "Ransom Note" style CAPTCHA; • audio-based CAPTCHAs; • CAPTCHA design considerations specific to East-Asian languages; • authentication and forensics of video footage; • feasibility of text-only CAPTCHAs; • CAPTCHAs based on the human-machine gap text recognition H.S. Baird and K. Popat 266 • • e • • ability; images, human visual ability, and computer vision in CAPTCHA technology; usability issues in cryptography tools; human-fault tolerant approaches to cryptography and authentica tion; and robustly non-transferable authentication. protocols based on human ability to memorize through association and perform simple mental calculations. Workshop participants brainstormed future applications for HIPs: • • • • • thwarting password guessing blocking denial-of-service attacks suppressing spam preventing ballot stuffing protecting databases (e.g., the paper by Baron 8 ) Some believe that similar problems are likely to arise on Intranets. Many further details of the HIP2002 workshop are available on-ine at www.pare. com/istl/groups/did/HIP2002, including the Program and Participants' list. 3. Implications for DIA Research The emergence of 'human interactive proofs' as a research field offers a rare opportunity (perhaps unprecedented since Turing's day) for a substantive alliance between the DIA and the theoretical computer science research communities, especially theorists interested in cryptography and security. The implications for DIA research may be far-reaching. At the heart of CAPTCHAs based on reading-ability gaps is the choice of the family of challenges: that is, defining the technical conditions under which text-images can be generated that are reliably human-legible but machine-illegible. This triggers many DIA research questions, including: • Historically, what do the fields of Computer Vision, Pattern Recog nition, and DIA suggest are the most intractable obstacles to ma chine reading, e.g.: segmentation problems (clutter, etc); gestaltcompletion challenges (parts missing or obscured); severe image degradation? • What are the conditions under which human reading is peculiarly Web Security and Document Image Analysis 267 (or even better, inexplicably) robust? What does the literature in cognitive science and the psychophysics of human reading suggest, e.g.: ideal size and image contrast; known linguistic context; style consistency? • Where, quantitatively as well as qualitatively, are the margins of good performance located, for machines and for humans? • Having chosen one or more of these 'ability gaps', how can we reliably generate an inexhaustible supply of distinct challenges that lie strictly 'inside' the gap? It is well known in the DIA field that low-quality images of printed-text documents pose serious challenges to current image pattern recognition technologies. 17,18 In an attempt to understand the nature and severity of the challenge, models of document image degradations 16,27 have been de veloped and used to explore the limitations 28 of image pattern recognition algorithms. These methods should be extended theoretically and be bet ter characterized in an engineering sense, in order to make progress on the questions above. The choice of image degradations for PessimalPrint was often guided by the discussion by Rice et al.18 of cases that defeat modern OCR machines, especially: • thickened images, so that characters merge together; • thinned images, so that characters fragment into unconnected com ponents; • noisy images, causing rough edges and salt-and-pepper noise; • condensed fonts, with narrower aspect ratios than usual; and • Italic fonts, whose rectilinear bounding boxes overlap their neigh bors'. Does the rich collection of examples in this book suggest other effective means that should be exploited? To our knowledge, all DIA research so far has been focused at appli cations in non-adversarial environments. We should look closely at new security-sensitive questions such as: • how easily can image degradations be normalized away? • can machines exploit lexicons (and other linguistic context) more or less effectively than people? Our familiarity with the state of the art of machine vision leads us to 268 H.S. Baird and K. Popat hypothesize that no modern OCR machine will be able to cope with the image degradations of PessimalPrint. But how can this informed intuition be supported with sufficient experimental data? CMU's Blum et al.2 have experimented, on their website www.captcha.net, with degradations that are not only due to imperfect printing and imaging, but include color, overlapping of words, non-linear distortions, and complex or random backgrounds. The relative ease with which we have been able to generate PessimalPrint, and the diversity of other means of bafflement at hand, suggest to us that the range of effective text-image challenges at our disposal is usefully broad. There are many results reported in the literature ©n the psychophysics of human reading which appear to provide useful guidance in the engineer ing of PessimalPrint and similar reading-based CAPTCHAs. Legge et al.19 reports on studies of the optimal reading rate and reading conditions for people with normal vision. In a later paper 20 an ideal observer model is compared quantitatively to human performance , shedding light on the ad vantage provided by lexical context. Human reading ability is calibrated with respect to estimates of the intrinsic difficulty of reading tasks in a recent study 21 , under a wide range of experimental conditions including varying image size, white noise, and contrast, simple and complex alpha bets, and subjects of different ages and degrees of reading experience. These and other results may suggest which image degradation parameters, lin guistic contexts, style (in)consistencies, and so forth provide the greatest advantage to human readers. How long can a CAPTCHA such as PessimalPrint resist attack, given a serious effort to advance machine-vision technology, and assuming that the principles — perhaps even the source code — defining the test are known to attackers? It may be easy to enumerate potential attacks on vision-based CAPTCHAs, but a close reading of the history of image pattern recog nition technology22 and of OCR technology 23 in particular support the view that the gap in ability between human and machine vision remains wide and is only slowly narrowing. We notice that few, if any, machine vi sion technologies have simultaneously achieved all three of these desirable characteristics: high accuracy, full automation, and versatility. Versatility — the ability to cope with a great variety of types of images — is perhaps the most intractable of these, and so it may be the best long-term basis for designing CAPTCHAs. Ability gaps exist for other varieties of machine vision, of course, and in Web Security and Document Image Analysis 269 the recognition of non-text images, such as line-drawings, faces, and various objects in natural scenes. One might reasonably intuit that these would be harder and so decide to use them rather than images of text. This intuition is not supported by the cognitive science literature on human reading of words. There is no consensus on whether recognition occurs letter-by-letter or by a word-template model; 24 ' 25 some theories stress the importance of contextual clues26 from natural language and pragmatic knowledge. Fur thermore, many theories of human reading assume perfectly formed images of text. However, we have not found in the literature a theory of human reading which accounts for the robust human ability to read despite extreme segmentation (merging, fragmentation) of images of characters. The resistance of these problems to technical attack for four decades and the incompleteness of our understanding of human reading abilities suggests that it is premature to decide that the recognition of text under conditions of low quality, occlusion, and clutter, is intrinsically much easier — that is, a significantly weaker challenge to the machine vision state-of-the-art — than recognition of objects in natural scenes. There is another reason to use images of text: the correct answer to the challenge is unambiguously clear and, even more helpful, it maps into a unique sequence of keystrokes. Can we put these arguments more convincingly? 4. Discussion The HIP2002 Workshop was a 'snapshot' of a research community in the early stages of formation. It seems to us to be a promising field, already enjoying a critical mass of hard problems, smart researchers, and commer cial value. The academic disciplines that were represented at the workshop included: • • • • • cryptography security document image analysis computer vision artificial intelligence Perhaps this list is too narrow; other disciplines that could make important contributions may include: • biometrics, especially remotely sensed (noninvasive) personal at tributes which may help reinforce tentative conclusions developed by CAPTCHAs; H.S. Baird and K. Popat 270 • cognitive science, especially for sharper quantitative insight into the margins of good performance of h u m a n cognitive abilities; • psychophysics, especially the psychophysics of h u m a n vision and reading, b o t h normal and impaired; and • psychology generally, for better understanding of widely shared hu m a n intellectual abilities and limitations. 5. Acknowledgments Our interest in HIPs was triggered by a question — could character images form the basis of a Turing test? — raised by Manuel Blum of CarnegieMellon Univ., which in t u r n was stimulated by Udi Manber's posing the "chat room problem" at CMU in September 2000. Manuel Blum, Luis A. von Ahn, and J o h n Langford, all of CMU, shared with us much of their early thinking. Manuel proposed the H I P workshop, accepted our offer t o hold it at P A R C , and promoted it vigorously, inviting key participants. John Langford, Lenore Blum, and Luis A. von Ahn helped with many details of planning a n d execution. Charles Bennett of IBM Research took the group photo. We are especially grateful t o many P A R C researchers and staff for helping us r u n the workshop so smoothly: Prateek Sarkar, Tom Breuel, Tom Berson, Dirk Balfanz, David Goldberg, Jeanette Figueroa, Randy Jenk ins, Beej Martinez, Eleanor Alvarido, Dan Novarro, Mimi Gardner, Dayne Peavy, Mike Hornbuckle, Sally Peters, and K a t h y Jarvis. Allison Coates provided references and commentary related to t h e cognitive science liter ature. Monica Chew provided references and commentary related to the psychophysics of vision literature. This chapter has benefited from discus sions with Hermann Calabria, Andrei Broder, and Udi Manber and from careful readings by Monica Chew a n d Victoria Stodden. References 1. H. S. Baird and K. Popat, "Human Interactive Proofs and Document Image Analysis," Proc, 5th IAPR Int'l Workshop on Document Analysis Systems, Princeton, NJ, Springer-Verlag (Berlin) LNCS 2423, August 2002, pp. SOTSIS. 2. M. Blum, L. A. von Ahn, and J. Langford, The CAPTCHA Project, "Com pletely Automatic Public Turing Test to tell Computers and Humans Apart," www.captcha.net, Dept. of Computer Science, Carnegie-Mellon Univ., and personal communications, November, 2000. 3. A. L. Coates, H. S. Baird, and R. Fateman, "Pessimal Print: a Reverse Tur ing Test," Proc, IAPR 6th Int'l Conf. on Document Analysis and Recogni tion, Seattle, WA, September 10-13, 2001, pp. 1154-1158. Web Security and Document Image Analysis 271 4. M. D. Lillibridge, M. Abadi, K. Bharat, and A. Z. Broder, "Method for Selectively Restricting Access to Computer Systems," U.S. Patent No. 6,195,698, Issued February 27, 2001. 5. AltaVista's "Add-URL" site: a l t a v i s t a . c o m / s i t e s / a d d u r l / n e w u r l , pro tected by the earliest known CAPTCHA. 6. A. Turing, "Computing Machinery and Intelligence," Mind, 59(236), 1950, pp. 433-460. 7. A. P. Saygin, I. Cicekli, and V. Akman, "Turing Test: 50 Years Later," Minds and Machines, 10(4), Kluwer, 2000. 8. D. P. Baron, "eBay and Database Protection," Case No. P-33, Case Writing Office, Stanford Graduate School of Business, 518 Memorial Way, Stanford Univ., Stanford, CA 94305-5015, 2001. 9. P. Plitch, "Are Bots Legal?," The Wall Street Journal, Dow Jones Newswires: Jersey City, NJ, online.wsj.com, September 16, 2002. 10. C. Thompson, "Slaves to Our Machines," Wired magazine, October, 2002, pp. 35-36. 11. D. Lopresti and J. Zhou, "Locating and Recognizing Text in W W W Im ages," Information Retrieval, May, 2000, 2(2/3), pp. 177-206. 12. T. Kahungo, C. H. Lee, and R. Bradford, "What Fraction of Images on the Web Contain Text?", Proc., 1st Int'l Workshop on Web Document Anal ysis, Seattle, WA, September 8, 2001 (ISBN 0-9541148-0-9) and also at www.csc.liv.ac.uk/~wda2001. 13. D. Shen, W. H. Wong, and H. H. S. Ip, "Affine-Invariant Image Retreival by Correspondance Matching of Shapes," Image and Vision Computing, (17), 1999, pp. 489-499. 14. T. Leung, M. Burl, and P. Perona, "Probabilistic Affine Invariants for Recog nition," Proc., IEEE Computer Society Conf. on Computer Vision and Pat tern Recognition., 1998, pp. 678-684. 15. L. von Ahn, M. Blum, N.J. Hopper and J. Langford, "CAPTCHA: Using Hard AI Problems For Security", EUROCRYPT 2003, May 4-8, 2003, War saw, Poland. 16. H. S. Baird, "Document Image Defect Models," in H. S. Baird, H. Bunke and K. Yamamoto (Eds.), Structured Document Image Analysis, SpringerVerlag: New York, 1992, pp. 546-556. 17. S. V. Rice, F. R. Jenkins and T. A. Nartker, "The Fifth Annual Test of OCR Accuracy," ISRI TR-96-01, Univ. of Nevada, Las Vegas, 1996. 18. S. V. Rice, G. Nagy, and T. A. Nartker, OCR: An Illustrated Guide to the Frontier, Kluwer Academic Publishers, 1999. 19. G. E. Legge, D. G. Pelli, G. S. Rubin, and M. M. Schleske, "Psychophysics of Reading: I. Normal Vision," Vision Research, 25(2), 1985, pp. 239-252. 20. G. E. Legge, T. S. Klitz, and B. S. Tjan. "Mr. Chips: An Ideal-Observer Model of Reading," Psychological Review 104(3), 1997, pp. 524-553 21. D. G. Pelli, C. W. Burns, B. Farell, and D. C. Moore, "Identifying Letters," Vision Research, (accepted with minor revisions; to appear), 2002. 22. T. Pavlidis, "Thirty Years at the Pattern Recognition Front," King-Sun Fu Prize Lecture, 11th Int'l Conf. on Pattern Recognition, Barcelona, Septem- 272 H.S. Baird and K. Popat ber, 2000. 23. G. Nagy and S. Seth, "Modern Optical Character Recognition." The Froehlich / Kent Encyclopaedia of Telecommunications, 11, Marcel Dekker, NY, 1996, pp. 473-531. 24. R. G. Crowder, The Psychology of Reading, Oxford University Press, 1982. 25. P. A. Kolers, M. E. Wrolstad, and H. Bouma, Processing of Visible Language 2, Plenum Press, 1980. 26. L. M. Gentile, M. L. Kamil, and J. S. Blanchard Reading Research Revisited, Charles E. Merrill Publishing, 1983. 27. T. Kanungo, Document Degradation Models and Methodology for Degrada tion Model Validation, Ph.D. Dissertation, Dept. EE, Univ. Washington, March 1996. 28. T. K. Ho and H. S. Baird, "Large-Scale Simulation Studies in Image Pat tern Recognition," IEEE Trans, on Pattern Recognition and Machine Intel ligence, 19(10), October 1997, pp. 1067-1079. C H A P T E R 15 EXPLOITING W W W RESOURCES IN EXPERIMENTAL D O C U M E N T ANALYSIS RESEARCH Daniel Lopresti Bell Labs, Lucent Technologies Inc. 600 Mountain Avenue Murray Hill, NJ 07974 USA E-mail: dpl@research. bell-labs, com Many large collections of document images are now becoming available online as part of digital library initiatives, fueled by the explosive growth of the World Wide Web. In this chapter, we examine protocols and system-related issues that arise in attempting to make use of these new resources, both as a target application (building better search engines) and as a way of overcoming the problem of acquiring ground-truth to support experimental document analysis research. We also report on our experiences running two simple tests involving data drawn from one such collection. The potential synergies between document analysis and digital libraries could lead to substantial benefits for both communities. 1. I n t r o d u c t i o n In the years t h a t have passed since we first examined potential synergies between the World Wide Web and the field of document analysis, 1 the Web has established itself as the largest distributed collection of docu ments in the history of civilization. Many researchers are now exploring problems t h a t have arisen out of this phenomenon, including, for exam ple, the extraction and recognition of text embedded in color GIF and J P E G i m a g e s , 2 _ 6 a n d the detection of tables in H T M L . 7 Document anal ysis is being applied to the conversion process of placing archival mate rial on the W W W . 8 Since pages represented in image format can be quite large, effective compression techniques are needed for document storage and delivery. 9 _ 1 1 Moreover, the pervasive impact of the Web has spawned work in related areas, including the use of XML to represent recognition results, 1 2 , 1 3 as well as frameworks for Web-based cooperative document 273 274 D. Lopresti understanding. 14 Such opportunities and challenges were the subject of a recent workshop. 15 Despite this flurry of activity centered around the Web, there is an important development that appears to have been largely overlooked: that is, the rapidly growing body of traditional scanned documents now being made available online. In retrospect, this should come as no surprise as: (1) the WWW was always touted as a delivery mechanism for multimedia content, (2) documents serve as a basic "quantum" of information in our society, and (3) most users are generally oblivious to the distinction between a page presented in image format versus one encoded in, say, HTML. Often, collections of scanned documents are the product of digital library projects aimed at preserving and disseminating works of historical significance.16'17 However, noisy word images generated on-demand also play a key role in an interesting new idea for protecting online services from automated attacks, an application area sometimes referred to as Human Interactive Proofs. 18 ' 19 For example, the Making of America collection (part of Cornell Uni versity's Prototype Digital Library 16 ) comprises 267 monographs (books) and 22 journals (equaling 955 serial volumes) for a total of 907,750 pages, making it almost 1,000 times larger than the dataset offered on the UW1 CD-ROM. 20 The procedures used in creating this digital library match stan dard methodologies employed in experimental document analysis research: "The materials in the MOA collection were scanned from the origi nal paper source, with materials disbound locally due to the brittle nature of many of the items ... The images were captured at 600 dpi in TIFF image format and compressed using CCITT Group 4. Minimal document structuring occurred at the point of conver sion, primarily linking image numbers to pagination and tagging self-referencing portions of the text ... Further conversion included both optical character recognition of the page images, and SGML-encoding of the ensuing textual information." 21 While OCR results are used for full text retrieval purposes, the default view returned to users of the system is an image of a scanned page. Fig. 1 shows a snapshot of a browser window displaying a page from Making of America on the left,22 and another example of an online docu ment image, a card from the catalog for Princeton University's library, on the right. 23 As a result of such efforts at bringing scanned documents online, sev- Exploiting Fig. 1. WWW Resources in Experimental Document Analysis Research 275 Examples of documents delivered in image format on the Web. eral intriguing opportunities present themselves to researchers working in document analysis. The most obvious of these would be to apply stateof-the-art techniques to build higher quality and/or more powerful indices for information retrieval and presentation. This notion of crafting a better third-party search engine for digital libraries has an analog on the Web as a whole, where competing search engines vie for users by indexing the text that can be extracted from documents encoded in HTML, PDF, PostScript, and other "easy" formats. It is certainly possible to imagine doing a better job on the MOA collection; for example, a search for the keyword "mo dem" returns 1,364 hits in documents published between 1815 and 1926, even though the word was first coined in the early 1950's.a Two of the librarians in the project write: "Our attention to retaining pagination and document structure will allow us to selectively insert improved OCR as it is completed. As we insert the more accurate OCR over time, we expect that the greatly improved OCR will make the searching tools even more effective."25 Beyond this relatively straightforward improvement, it seems conceivable a This test was inspired by a discussion in Baker's book Double Fold, p. 7 1 . 2 4 The word "modem" is most likely in the collection because of a common OCR error, the merging of two adjacent characters into one, rn —> rra, which causes "modern" to become "modem." 276 D. Lopresti that higher-level document analysis methods could provide useful new paradigms for retrieval from digital libraries. Another thought-provoking possibility would be to use existing online collections of scanned images as a way of overcoming the problem of acquir ing sufficient training and testing data to support experimental document analysis research. This matter is regarded as so pressing that it was one of the prime motivations behind the creation of the Open Mind Initiative, 26 a project to enlist Web users around the world to assist in the labeling of ground-truth data for algorithm development. But while Open Mind deals with this one aspect of the problem, it does not address where the raw data comes from, or what qualifies it as "relevant." These issues will be a focus of this paper. 2. Traditional Approaches Typically, document analysis researchers either assemble their own collec tions of scanned images and/or use pre-existing datasets, such as those disseminated by UW, 20 NIST, 27 UNLV,28 and CEDAR. 29 The former ap proach allows the corpus to be targeted to the task under study, but the acquisition process can be time-consuming and perhaps expensive. On the other hand, standard datasets distributed on CD-ROM, once purchased, are easy to use and provide a convenient basis for comparison, although they may not cover the precise application of interest, potentially introduce copyright issues, and could become overused to the point that techniques are developed specific to the test set, which is usually relatively small. Another methodology designed to replace or supplement the previous two approaches involves synthesizing training or testing data. There are, for example, models for generating noisy page images30 and for creating random instances of tables. 31 While it is possible to produce an endless stream of data in this way, there is always the question of whether such data is truly representative. 3. Exploiting W W W Resources As we have noted, there is an enormous quantity of page image data now available on the Web. How might this be used to support document anal ysis research? Consider the basic steps involved in building datasets for either training or testing purposes: (1) collecting and scanning representa tive pages, (2) labeling the ground-truth, and (3) distributing the dataset. While the last step might not seem strictly necessary, good scientific prac- Exploiting WWW Resources in Experimental Document Analysis Research 277 tice requires describing experiments in sufficient detail that it is possible to reproduce them. With that in mind, it clearly becomes important that the test data be accessible to other researchers. With digital libraries, the first and last of these steps are already taken care of. The pages have been scanned and are freely available online. The developer of the library presumably has dealt with any copyright issues connected to the works in question. Furthermore, it is easy to argue that such pages must be representative because they are, in fact, real documents of definite value to some target audience. Still, there remains the question of what to do about labeling the ground-truth. What are the available options? One solution would be to make use of the existing ground-truth provided by the digital library itself (e.g., the OCR results in the case of the Making of America collection). Another would be to develop protocols for using truth produced and/or maintained by a third party (previous researchers who have used the same test documents, or an Open Mind-like entity). A third approach would be to study evaluation techniques that do not depend on having an explicit ground-truth (e.g., comparing retrieval effectiveness relative to what is obtained when using the source library's tools). For Human Interactive Proofs, the system providing the challenge can also be used to judge whether or not the response is correct (see, for example, the word image recognition problem that is used to protect the free email services offered by Yahoo! at http://mail.yahoo.com). 4. Proof of Concept: Analysis of a Digital Library To explore the ideas outlined in this paper, we have performed two sim ple "proof of concept" exercises: the first using the pagereader system 32 developed by Baird at Bell Labs to OCR a set of pages randomly chosen from the Making of America digital library, 16 and the second examining an algorithm we have proposed with several colleagues for table detection. 33 This sort of evaluation is fundamentally different from the kinds typically described in the literature. Because the selection of test images is unbiased and completely automatic, the pages in question are never seen in advance by the researcher(s) involved in running the tests; there can be no attempt, explicit or subconscious, to discard images that do not fit the model or to tune an algorithm to the dataset. b As a result, this criterion is almost It is, of course, quite acceptable to maintain a record of the test documents that were used for an after-the-fact analysis. 278 D. Lopresti certainly more demanding than what is normally encountered. Most research systems for document analysis, including pagereader and our table detection code, assume the input image will be in TIFF format, however TIFF is not a native encoding for current Web browsers. In the case of Making of America, the pages are delivered in one of three pos sible formats: a "50% size" GIF image, a "100% size" GIF image, and a PDF document containing the original scanned TIFF. The GIF forms have relatively low spatial resolution, making use of grayscale (image depth) to compensate, and hence would be difficult to use without a significant amount of extra work. Hence, we chose to implement a process pipeline that first converts the PDF version of the page into PostScript and then extracts the image directly from the PostScript. In addition to the various image "flavors" of the page, the OCR output used to create the searchable index for Making of America is available. We can use this text for evalua tion purposes, but must be careful about making assumptions concerning its quality or the way that it is formatted.0 Lacking our own complete index of the digital library, our approach to retrieving a random page image from Making of America is to issue a query by choosing a term from the Unix spell dictionary, which contains 24,259 words including a number of proper names. From the results of this search, we randomly choose one of the works (book or journal) that is returned, and from that work we select a specific page that contains a match. The implementation of the Web interface is programmed in Tcl/Tk using the Spynergy Toolkit.34 It takes a total of six HTTP "round-trips" to get the data we need: (1) First, issue a search request using a randomly chosen keyword and retrieve the results. (2) The results are presented in "slices" of 50 works per HTML page. Ran domly select a slice and retrieve it. (3) Within the slice, determine one of the works at random and retrieve it. (4) Within the work, randomly choose one of the matches and retrieve it. (5) Based on the HTML for the final target page, retrieve the PDF file that contains the embedded TIFF image (which is then extracted locally). (6) In the same way, retrieve the OCR text corresponding to the target page. c Generally, we assume that the text may contain a modest number of OCR errors, but that any severe problems will have been detected and corrected by those responsible for building the digital library. Exploiting WWW Resources in Experimental Document Analysis Research 279 The last step is skipped in the table detection experiment as it is unneces sary. We have developed a set of simple "wrappers" to extract the required information from the HTML code returned by the MOA server. 4.1. Optical Character Recognition For the OCR experiment, we retrieved 250 pages from the digital library. On the occasions when an HTTP fetch timed-out (after 30 seconds for the initial connection, and 5 seconds for each subsequent buffer), the search was attempted again using a different term. d This situation seemed to arise most often when the original query generated an extremely large number of hits (tens of thousands); it is likely that the machine serving Making of America builds data structures that grow with the size of the result. The 10 most- and least-frequent matches are listed in Table 1. Note that there is a wide distribution and even arcane terms arise occasionally in the collection. Table 1. Most- and least-frequent matches in the OCR experiment. Most -Frequent Matches Search Term 236,021 enemy 103,160 science 46,007 edge 44,467 sold 42,054 empire taught 39,429 35,812 request guide 35,667 34,123 base 31,952 virtue Works 15,000 31,956 20,291 18,834 14,677 20,574 12,803 17,192 17,139 16,175 Least-Frequent Matches Search Term 4 psychopathic 4 gumdrop 3 glamorous 3 pentagram 3 uninominal 2 constructible 2 saddlebag 1 dressmake 1 godparent 1 riverfront Works 4 2 3 3 3 2 2 1 1 1 The times needed to retrieve and process the pages are graphed, in order of decreasing total time, in Fig. 2 (note that the y-axis uses a log scale).6 The four components of the total are the times required to: (1) fetch the data, (2) convert the PDF to TIFF, (3) OCR the image, and (4) compare the output from pagereader to the ground-truth. The minimum total time was 93 seconds, the maximum was 1,966 seconds, and the average was 376 seconds. These values are dominated by the time it took to perform OCR (minimum 42 seconds, maximum 1,890 seconds, average 323 seconds). In We also re-ran searches that returned no matches. All tests were performed on an older SGI 0 2 workstation (200 MHz MIPS R5000 CPU, 64 MB RAM). e 280 D. Lopresti other words, OCR was responsible for 86% of the computation time, on average. On the other hand, processing the HTTP requests and retrieving the page images over the Internet amounted to only about 6% of the total. This ratio is likely to hold true for any kind of sophisticated document analysis, so overhead due to network delay should not be an issue. Fig. 2. Times to retrieve and process the 250 test pages used in the OCR experiment. Given the output from OCR and a suitable ground-truth, we would ordinarily apply techniques from approximate string matching to classify errors and provide a quantifiable measure of the accuracy of the recogni tion process. 35 Such an approach will not work here, however. Although we presume the ground-truth contains a reasonably reliable representation of the text on the page (a "bag of words," if you will), we cannot be cer tain of the precise layout standards used by those who built the digital library. For example, a two-column page could be represented that way in the ground-truth, or it may be de-columnized. Arbitrary conventions might be employed for unrelated articles appearing on the same page. The fact that we have no guarantee there will be a correspondence between the read ing orders for the OCR output and the truth, combined with the potential for large numbers of OCR errors and the need for the evaluation to be fully automated, means that string matching methods must be ruled out. Instead, we have chosen to perform evaluation by applying a well-known ■ Exploiting WWW Resources in Experimental Document Analysis Research 281 measure developed in the context of information retrieval. The vector space model, first proposed by Salton, et al.,36 assigns large weights to terms that occur frequently in a given document but rarely in others because such terms are able to distinguish the document in question from the rest of the collection. Let tfik be the frequency of term tk in document Di, n^ be the number of documents containing term tk, T be the total number of terms, and N be the size of the collection. Then a common weighting scheme (tf x idf, or "term frequency times inverse document frequency") defines wik, the weight of term tk in document Di, to be: tfik ■ log(N/nk) wik = , 2 = V£j=iW -(MJW) 2 ■ (1) The summation in the denominator normalizes the length of the vector so that all documents have an equal chance of being retrieved. Given query vec tor Qi = (wn, Wi2,..., Wix) and document vector Dj = (WJI, Wj2,..., WJT), a dot product is computed to quantify the similarity between the two. In our analysis, we apply this measure using word unigram tokens with stopword removal. Document Rank (sorted by vector space similarity) Fig. 3. Vector space similarity scores for the 250 test pages used in the OCR experiment. The similarity scores for the 250 test documents relative to their groundtruths are graphed in Fig. 3, sorted in order of decreasing similarity. The D. Lopresti 282 maximum was 0.916, the minimum 0.030, and the average 0.520. While these values may seem low, one must keep in mind several important mit igating factors: (1) the severity of the test, (2) the "ground-truth" may itself contain OCR errors, and (3) vector space similarity is not identical to OCR accuracy. A more detailed examination of the results for the five best and five worst documents, as listed in Tables 2 and 3, provides subjective confirmation that this paradigm makes useful distinctions between "easy" and "hard" pages for the system under study. Table 2. Vector Space Similarity 0.916 Highest vector space similarity scores for the OCR experiment. http://cdl.library.cornell.edu/ cgi-bin/moa/moa-cgi?notisid= ABP2287-0047-195, p. 766 Note two column layout. two column layout. 0.889 ABK2934-0016-34, p. 192 0.883 ABR0102-0171-4, p. 97 two column layout with ruling line down gutter. 0.882 ABP2287-0042-55, p. 251 two column layout. 0.874 ABP2287-0056-192, p. 929 two columns headed by centered title and abstract, text starts with ornate drop-cap. Table 3. Vector Space Similarity 0.045 Lowest vector space similarity scores for the OCR experiment. http://cdl.library.cornell.edu/ cgi-bin/moa/moa-cgi?notisid= ABR0102-0045-13, p. 661 Note two column layout, scan looks light, ground-truth also noisy. 0.038 ABS1821-0024-102, p. 46 three columns (newspaper format), page looks slightly skewed, irregular line spacing. 0.036 ABK4014-0008-45, p. 285 two columns, obvious skew, small font, tight spacing. 0.036 ABS1821-0006-20, p. 6 three columns, line drawing in middle of page, scan skewed and light, ground-truth also noisy. 0.030 ANU4519-0130, p. 881 two columns (index page from pension records including many proper names), sparse text, obvious skew. Taking a closer look at the best- and worst-case documents, shown in Figs. 4 and 5, respectively, we note that the ground-truth supplied by MOA Exploiting WWW Resources in Experimental Document Analysis Research 283 is generally "cleaner" than the output from pagereader, although it is by no means error-free. It is easy to understand why the OCR system under test behaved as it did, given the relative qualities of the two input images; the page in Fig. 4 is clear and dark and printed using a relatively large font, while the page in Fig. 5 is light and sparse, uses a small font, and exhibits visible skew. We also see the impact that de-columnization would have on attempts to compare the two representations using simple string matching, the traditional method for determining OCR accuracy; the output from pagereader does not correspond on a line-by-line basis to the ground-truth, so the edit distance will not be representative of the character-level errors that occurred. Fig. 4. The best-case document, MOA ground-truth, and OCR results from pagereader (ABP2287-0047-195, p. 766). To examine quantitatively how the vector space evaluation measure re lates to string edit distance, we corrected by hand the text files provided by Making of America for the best- and worst-case documents. We then computed the normalized edit distance between this "true" ground-truth 284 D. Lopresti Fig. 5. The worst-case document, MOA ground-truth, and OCR results from pagereader (ANU4519-0130, p. 881). and the corresponding OCR results, as well as the ground-truth as orig inally supplied. Table 4 presents this comparison. Vector space similarity appears roughly similar to edit disance when the OCR accuracy is high, but much more severe in its judgement when the accuracy is low. Whether the verified error rate for the worst-case ground-truth, 6.6%, is representative of the collection as a whole is a matter for further investigation. Table 4. Comparison of vector space similarity and string edit distance for the bestand worst-case documents. http://cdl.library.cornell.edu/ cgi-bin/moa/moa-cgi?notisid= Vector Space Similarity Normalized Edit Distance: Corrected Ground-Truth vs. OCR Result MOA Ground-Truth ABP2287-0047-195, p. 766 0.916 0.968 0.997 ANU4519-0130, p. 881 0.030 0.547 0.934 Exploiting WWW Resources in Experimental Document Analysis Research 285 Proof-reading and correcting more than a small number of sample docu ments would be tedious (although as a service provided to the international research community, it could perhaps be justified). To increase our confi dence that vector space is a useful comparison measure in this case, to some approximation, we turned instead to a simulation study. Taking the ground-truth for 1,000 documents randomly selected from MOA, we ran the files through a filter that injects "noise" at a predetermined rate. The ba sic editing operations performed were single character deletions, insertions, and substitutions. We then computed both the vector space similarity and the normalized string edit distance between the original and modified files, graphing the results in Fig. 6. There is clearly a strong relationship between the two, suggesting it may be reasonable to use vector space as a surrogate for edit distance in certain applications as we have done here. Fig. 6. Results of the study comparing string edit distance vs. vector space similarity. 4.2. Table Detection Our past work on table detection considered input in both ASCII and image format. In the latter case, we tested our techniques on a. relatively small number of pages that we knew in advance contained tables. The focus was on whether the algorithm could correctly delimit the boundaries of a table and its various component regions. Another important aspect of the detection problem, however, is deciding when tables are present in an 286 D. Lopresti unknown input. Indeed, for many real applications this must be the first step and hence becomes a key issue. As reported elsewhere,33 our approach to table detection is to formulate the task of partitioning the input into tables as an optimization problem that can be solved using dynamic programming. Say that tab[i,j] is a mea sure of our confidence when text lines i through j are interpreted as a single table. Let meritpre(i, [i+l,j]) be the merit of prepending line i to the table extending from line i + 1 to line j , and meritapp([i,j — l],j) be the merit of appending line j to the table extending from line i to line j — 1. Then: tabli j] = max I meriivre(i, [,JS \tab[i,j-l] [* + M ) + tab[i + l,j] + meritapp([i,j-l],j) (2] The merit functions are based on white space correlation. This defines an upper triangular matrix with values for all possible table starting and end ing positions. The partitioning of the input into tables can then be expressed as an optimization problem. Let score[i,j] correspond to the best way to inter pret lines i through j as some number of (i.e., zero or more) tables. The computation is: r- -i {tabli, j] score z, 7 — m a x < , .. ,. , „ .,, ( max;<fc<j {score[i, k\ + score[kr + 1, j \ \ . ,„. (3) The precise decomposition can be obtained by backtracking the sequence of decisions made in evaluating Eq. 3. Any region on the optimal path whose tab value is higher than a predetermined threshold is considered a table region. Since our table detection procedure assumes single-column input, we used Nagy and Seth's X-Y cut algorithm 37 to segment the page images re cursively, from the level of logical columns down to individual word bound ing boxes. The vast majority of pages in the Making of America collection contain no tables. Rather than begin with a completely random document as in the OCR experiment, we chose to search for pages that held a match for the query term "table." This yielded 103,176 hits in 33,595 works. Note that most of these still do not possess what we would call a table, since the term has many other, unrelated meanings (e.g., it is an article of furniture). From this sub-collection, we selected 250 random pages and ran the X-Y cut and table detection algorithms, saving the 100 highest scores. For the steps shared with the first experiment, fetching the PDF file and converting it to TIFF, the average times were comparable at 17 and 7 seconds, respectively Exploiting WWW Resources in Experimental Document Analysis Research 287 (recall Fig. 2). The time to segment a page using X-Y cut was 31 seconds, and table detection required a little over 6 seconds. For evaluations such as this, the familiar concepts of precision and recall are appropriate performance measures. While the former is relatively easy to compute after-the-fact (we simply need to examine each instance where the algorithm claims to have found a table), the latter requires knowing something about every document in the corpus which is not feasible when the collection is large.f Hence, for now we must limit ourselves to precision measurements; these results are presented in Fig. 7. Ultimately, however, as knowledge is acquired working with the digital library, it should be possible to accumulate it for use in future tests. This "meta-data" (e.g., which pages in MOA contain tables) could perhaps be published on the WWW as a supplemental index into the collection. Document Rank (sorted by table quality measure) Fig. 7. Precision for the table detection experiment (top 100 hits). Turning to the results, the chart in Fig. 7 plots the percentage of pages at or above a given rank that contain a table correctly delimited by the algorithm. The 14 pages with the highest table quality measures (the value of score[i,j] in Eq. 3) are false positives. In examining the documents in f T h e MOA ground-truth is of no assistance when it comes to identifying tables; there is no attempt to preserve the spatial layout of text extracted from tables in the collection. 288 D. Lopresti question, we found that this was due to engraved line drawings with fine cross-hatching. 6 While we had tuned our implementations of X-Y cut and table detection to ignore small components as noise, these made it through and generated extremely high white space correlations, thus fooling our system. Other problems were caused by page headings that use table-like spacing but are not really tables. Such scenarios might have been easy to overlook if not for the random page selection process used in the experiment. Improving the quality of the table detection results is a subject for future research; here, our goal has been to show how page images drawn from a digital library may be used for testing purposes. 5. Examples of Other W W W Resources While this chapter has focused largely on the Making of America collection, we hasten to add that there are numerous other digital library activities that present interesting opportunities for document analysis research. It is not possible to produce an exhaustive listing, so here we will simply provide several additional examples. The Advanced Papyrological Information Sys tem (APIS) Project is a consortium of major universities, supported by a grant from the National Endowment for the Humanities, that has among its goals the digitization of historic writings on papyrus and their distribution on the WWW. As an illustration, the documents found in Princeton Uni versity's contribution 38 would challenge even the most advanced of today's techniques. The U.S. Library of Congress has already placed a number of image collections online,17 including not only handwritten and typed books, pam phlets, and papers, but also sheet music, photographs, and architectural drawings. For example, the Thomas Jefferson Papers total over 27,000 documents, 39 and include correspondence, drawings, maps, and notes, most of which are being made available via the Web. There are even meta-collections, indices that point into other digital image libraries distributed throughout the world, such as the one main tained by UC Berkeley and Sun. 40 A website such as this specialized to resources and raw data in support of document analysis research would be a tremendous service. g Por example, 0013-90, p. 30. http://cdl.library.coraell.edu/cgi-bin/moa/moa-cgi?notisid=ABS1821- Exploiting WWW Resources in Experimental Document Analysis Research 289 6. C o n c l u s i o n s In this paper, we have suggested t h a t the recent phenomenon of digital libraries serving vast collections of scanned page images can b e exploited by document analysis researchers, both as a target application (building better search engines) and as a way of overcoming the problem of acquiring ground-truth to support experimental investigations. We discussed some of the protocols and system-related issues involved, and offered solutions in the specific case of using the Making of America collection to exercise Baird's pagereader system and an algorithm we have developed for table detection. It is important to reiterate t h a t these evaluations were performed with no a priori knowledge of the test images or their ground-truths. T h e se lection process was completely random, working from a very large collec tion. In principle, there is nothing preventing much more comprehensive experiments from being performed fully automatically, with no h u m a n in tervention. At 323 seconds per page image, the entire Making of America collection (907,750 pages) would be sufficient to exercise pagereader for over 9 years running 24 hours a day and, at 62 seconds per page, our table detection code for almost 2 years. Note t h a t we are not advocating "attacking" digital libraries with the intention of consuming their resources a n d / o r appropriating their content. Rather, our observation is t h a t document analysis applied to page images drawn from such online repositories is analogous to what current search engines do when indexing the World Wide Web. T h e potential synergies between document analysis research and digital libraries could lead to sub stantial benefits for b o t h communities. 7. Acknowledgments This chapter is an expanded version of a paper first presented a t the Fifth International Workshop on Document Analysis Systems.41 References 1. D. Lopresti and J. Zhou, "Document analysis and the World Wide Web", In Proceedings of the Second IAPR Workshop on Document Analysis Systems, Malvern, PA, Oct. 1996, pp. 651-669. 2. A. Antonacopoulos and D. Karatzas, "An anthropocentric approach to text extraction from WWW images", In Proceedings of the Fourth IAPR Inter national Workshop on Document Analysis Systems, Rio de Janeiro, Brazil, Dec. 2000, pp. 515^525. 290 D. Lopresti 3. A. Antonacopoulos and D. Karatzas, "Fuzzy segmentation of characters in Web images based on human colour perception", In Document Analysis Sys tems V, volume 2423 of Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, 2002, pp. 295-306. 4. A. Antonacopoulos, D. Karatzas, and J. Ortiz Lopez, "Accessing textual in formation embedded in Internet images", In Proceedings of Internet Imaging II (IS&T/SPIE Electronic Imaging), San Jose, CA, Jan. 2001, pp. 198-205. 5. D. Lopresti and J. Zhou, "Locating and recognizing text in WWW images", Information Retrieval, 2(2/3), May 2000, pp. 177-206. 6. J. Zhou and D. Lopresti, "Extracting text from WWW images", In Pro ceedings of the Fourth International Conference on Document Analysis and Recognition, Ulm, Germany, Aug. 1997, pp. 248-252. 7. Y. Wang and J. Hu, "Detecting tables in HTML documents". In Document Analysis Systems V, volume 2423 of Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, 2002, pp. 249-260. 8. A. C. Downton, A. C. Tarns, G. J. Wells, A. C. Holmes, S. M. Lucas, G. W. Beccaloni, M. J. Scoble, and G. S. Robinson, "Constructing web-based legacy index card archives - architectural design issues and initial data acquisition", In Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle, WA, Sept. 2001, pp. 854-858. 9. L. Bottou, P. Haffner, and Y. LeCun, "Efficient conversion of digital docu ments to multilayer raster formats", In Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle, WA, Sept. 2001, pp. 444-448. 10. F. Le Bourgeois, H. Emptoz, E. Trinh, and J. Duong, "Networking digital document images", In Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle, WA, Sept. 2001, pp. 379-383. 11. A. Mikheev, L. Vincent, M. Hawrylycz, and L. Bottou, "Electronic document publishing using DjVu", In Document Analysis Systems V, volume 2423 of Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, 2002, pp. 480-490.. 12. O. Hitz, L. Robadey, and R. Ingold, "An architecture for editing doc ument recognition results using XML technology", In Proceedings of the Fourth IAPR International Workshop on Document Analysis Systems, Rio de Janeiro, Brazil, Dec. 2000, pp. 385-396. 13. K.-H. Lee, Y.-C. Choy, S.-B. Cho, X. Tang, and V. McCrary, "Document reverse engineering: From paper to XML", In Document Analysis Systems V, volume 2423 of Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, 2002, pp. 503-506. 14. N. Roussel, O. Hitz, and R. Ingold, "Web-based cooperative document under standing" , In Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle, WA, Sept. 2001, pp. 368-373. 15. International Workshop on Web Document Analysis (WDA2001), Seat tle, WA, Sept. 2001, Proceedings ISBN 0-9541148-0-9, also online at http://www.csc.liv.ac.uk/~wda2001/. Exploiting WWW Resources in Experimental Document Analysis Research 291 16. Cornell University Prototype Digital Library. http://moa.cit.Cornell.edu/. 17. Library of Congress: Digital Library Initiatives. http://memory.loc.gov/ammem/dli2/index.html. 18. H. S. Baird and K. Popat, "Human interactive proofs and document image analysis", In Document Analysis Systems V, volume 2423 of Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, 2002, pp. 507-518. 19. A. L. Coates, H. S. Baird, and R. J. Fateman, "Pessimal print: A reverse Tur ing Test", In Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle, WA, Sept. 2001, pp. 1154-1158. 20. I. Phillips, S. Chen, and R. Haralick, "CD-ROM document database stan dard", In Proceedings of Second International Conference on Document Anal ysis and Recognition, Tsukuba Science City, Japan, Oct. 1993, pp. 478-483. 21. About Making of America: The conversion process. http://moa.cit.Cornell.edu/moa/moa_conversion.html. 22. Search result for Making of America, page 520 of The Develop ment of College Architecture in America by Ashton R. Willard. http://cdl.library.Cornell.edu/cgi-bin/moa/moa-cgi?notisid=AFJ3026-0022-73. 23. Search result for Princeton University Electronic Card Catalog, card 60 following t h e guide card Baird. h t t p : / / i m a g e c a t l . p r i n c e t o n . e d u / c g i - b i n /ECC/cards.pl/disk9/0367/A4103?d=f&p=Baird&g=2000.500000&n=60&r=l.000000. 24. N. Baker, Double Fold: Libraries and the Assault on Paper. Random House, New York, NY, 2001. 25. E. J. Shaw and S. Blumson, "Making of America: Online searching and page presentation at the University of Michigan", D-Lib Magazine, July/Aug. 1997. http://www.dlib.org/dlib/july97/america/07shaw.html. 26. D. G. Stork, "The Open Mind Initiative", http://www.openmind.org/index.shtml. 27. M. D. Garris, S. A. Janet, and W. W. Klein, "Federal Register document image database", In Proceedings of Document Recognition and Retrieval VI (IS&T/SPIE Electronic Imaging), volume 3651, San Jose, CA, Jan. 1999, pp. 97-108. 28. S. V. Rice, J. Kanai, and T. A. Nartker, "Preparing OCR test data", Tech nical Report TR-93-08, UNLV Information Science Research Institute, Las Vegas, NV, June 1993. 29. CEDAR Databases. http://www.cedar.buffalo.edu/Databases/. 30. H. S. Baird, "Document image defect models", In H. S. Baird, H. Bunke, and K. Yamamoto, editors, Structured Document Image Analysis, SpringerVerlag, New York, 1992, pp. 546-556. 31. Y. Wang, I. T. Phillips, and R. Haralick, "Automatic table ground truth gen eration and a background-analysis-based table structure extraction method", In Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle, WA, Sept. 2001, pp. 528-532. 32. H. S. Baird, "Anatomy of a versatile page reader", Proceedings of the IEEE, 80(7), 1992, pp. 1059-1065. 292 D. Lopresti 33. J. Hu, R. Kashi, D. Lopresti, and G. Wilfong, "Medium-independent ta ble detection", In Proceedings of Document Recognition and Retrieval VII (IS&T/SPIE Electronic Imaging), volume 3967, San Jose, CA, Jan. 2000, pp. 291-302. 34. H. Schroeder and M. Doyle, Interactive Web Applications with Tcl/Tk. AP Professional, Chestnut Hill, MA, 1998. 35. J. Esakov, D. P. Lopresti, J. S. Sandberg, and J. Zhou, "Issues in automatic OCR error classification", In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, Apr. 1994, pp. 401—412. 36. G. Salton, A. Wong, and C. Yang, "A vector space model for information retrieval", Communications of the Association for Computing Machinery, 18(11), Nov. 1975, pp. 613-620. 37. G. Nagy and S. Seth, "Hierarchical representation of optically scanned doc uments" , In Proceedings of the Seventh International Conference on Pattern Recognition, Montreal, Canada, July 1984, pp. 347-349. 38. Digital images of selected papyri at Princeton. http://www.princeton.edu/papyrus/digitalimages.html. 39. The Thomas Jefferson Papers at the Library of Congress. http://memory.loc.gov/ammem/mtjhtml/mtjhome.html. 40. UC Berkeley listing of digital image collections. http://sunsite.berkeley.edu/Collectioas/otherimage.html. 41. D. Lopresti, "Exploiting WWW resources in experimental document analysis research", In Document Analysis Systems V, volume 2423 of Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, 2002, pp. 532-543. CHAPTER 16 STRUCTURED MEDIA FOR AUTHORING MULTIMEDIA DOCUMENTS Tien Tran-Thuong and Cecile Roisin Opera Project, INRIA Rhone-Alpes, Zirst - 655 avenue de VEurope - 38330 Montbonnot Saint Martin, France E-mail:tien,tran_thuong@inrialpes.fr, cecile.roisin@inrialpes.fr URL: http://opera. inrialpes.fr This chapter proposes a new way for authoring multimedia documents. It uses the concept of structured media that allows deeper access into media objects. The proposed models can be considered as a utilization of MPEG7 for intra-media content description and an extension of hierarchical structure, interval and region based model for inter-media composition. An experiment of structuring video in our authoring and presentation environment for multimedia documents, called VideoMadeus, that takes advantage of that model is described. It allows the author to interactively specify video content structure descriptions that can be then used for the composition of video elements (character, shot, scene, etc.) with other media objects (text, sound, image, video, etc.) Such a composition allows to easily realize attractive multimedia presentations where media content can be synchronized in rich and flexible ways such as: tracking an object in a video, attaching hyperlinks to video objects and fine-grained synchronization (for example a piece of text can be synchronized with a video fragment). 1. Motivation The creation of a multimedia document requires the integration of several media (such as text, image, audio, video and animation) into a single document. A multimedia document model, therefore, integrates media description models together with temporal and spatial models.' The media description models allow to define, to locate, to describe and to group the media that will be used to compose a multimedia document. The temporal and spatial models enable the author to organize media objects in time and space. Numerous approaches have been proposed for multimedia document modelling, including the absolute time axis, the time point temporal model, the interval temporal model and the region 293 294 T. Tran-Thaong and C. Roisin model.2 One of the most interesting results is the emergence of the SMIL standard3 for posting multimedia information on the Web. However, existing media description models mainly serve to declare the set of used media with their intrinsic spatial and temporal properties. As a consequence, a user can only express coarse-grained relationships (both temporal and spatial relationships) between the different media. But it is worth noting that most media have rich content information such as image, video, long text or included documents such as HTML or SVG. Through using subparts of that content information, the author can compose multimedia documents having more complex and sophisticated presentation scenarios. Examples of such needs are: a character in a video introduced by displaying a textual description when that character appears; a word in a text sentence highlighted when an audio stream plays out this word; a hyperlink set on a video object or on a particular region of an image. These scenarios can easily be specified if the authoring system supplies to the authors internal media information, such as: a start time of the video object in the video sequence for the first scenario; coordinates of the word in the text and time location of word pronunciation in the audio for the second scenario; or coordinates of the video objects and the image regions for the last scenario. In SMIL for instance, it is possible to specify subparts of media in terms of their time position from the beginning of the media. But it is a rather low level and limited way of specification of media subparts. Such a low semantic level of specification prevents Web analysis and search engines from processing the content of multimedia document. Using structured media whose information content is described at a higher level will make this content information available for the composition process, indexing and retrieval services. A structured media contains not only raw data, but also a hierarchical description of this media content information. Up until now, there have been many research efforts for a standard format of content information description. Among them, the most important is the MPEG-7 standard also known as "Multimedia Content Description Interface" that aims at providing standardized core technologies, allowing the description of audiovisual data content in multimedia environments.4 Therefore, standard structured media for multimedia authoring is not a distant goal and constitutes the first step towards allowing the editing of more complex multimedia documents. The work presented here is carried out in the context of the development of a multimedia document authoring system called Madeus.1 This prototype allows multimedia documents to be composed from a set of text, images, audio, video, HTML and SVG media. The Madeus document model is based on the structured, Structured Media for Authoring Multimedia Documents 295 temporal interval-based and region-based models. In the first stage, we tested the structured media approach with just video media. Based on this experiment, the discussion in this chapter is devoted to the use of structured media to edit complex multimedia documents. The rest of the paper is organized as follows: first, an overview of multimedia authoring is discussed in Section 2. A proposed multimedia model including a video content description model and an extended multimedia document model is described in Section 3. Our experiment with the proposed multimedia model is then presented in Section 4. In Section 5, we give a brief evaluation of this work through its comparison on modelling and editing aspects with existing work. Finally, the current achievements of our work and some perspectives are given. 2. Multimedia Authoring Multimedia applications cover a wide range of services such as: media production and analysis, media storage and retrieval using multimedia databases, media integration to produce multimedia presentations, and finally multimedia broadcasting or multimedia on demand. In fact most multimedia applications require combining some of these functions in an efficient manner. For instance, media classification and indexing can benefit from media analysis as in the MAVIS system.5 Similarly, media production and integration can use media databases for retrieving and reusing existing media. Fine-grained descriptions of media can bring valuable services for the authors who want to realize multimedia presentations with rich synchronizations between media.6,7 The key requirement for allowing the integration of these services is the use of common media models. However most commercial tools are black boxes with proprietary and low level output formats. Standard groups have however worked to propose languages and modelling tools, such as the MPEG groups with the MPEG4 and MPEG7 formats, or the W3C with the SMIL integration language. But it is worth noting that these standards are not yet widely used in multimedia applications. The work presented here focuses on multimedia authoring applications and more precisely on what models and services are needed to cover all user requirements. The target application is an ideal authoring and presentation system, for which we have identified the following features: 1. Media access and browsing from a multimedia database using multiple media properties (temporal, spatial, and semantic). 2. Media analysis, content descriptors generation and media indexing. 3. Spatial, temporal or combined spatio-temporal media segments identification and access. 296 T. Tran-Thuong and C. Roisin 4. Media integration, synchronization, linking and structuring for the production of multimedia presentations. First, let us consider how existing applications meet these requirements. Multimedia applications can be divided into three main classes: indexing, media production and multimedia document integration. • Indexing applications such as QBIC, VisualSeek or CueVideo provide good solutions for the three first criteria above. Therefore authoring applications can exploit the rich information extracted from raw data to define links and synchronizations. Moreover, generic links and synchronization can be set over the media. The main limitation of these applications regarding authoring needs is their mono media approach. Each one is indeed specialized for one type of media so the last criterion is not covered at all. • Media production applications such as GoLive, Adobe Premiere, MediaStudio Pro, VideoStudio, DVD Movie Factory allow to capture, split, mix, code, translate, export or render media streams. They provide analysis capabilities but at a low level only: for instance shots in video can be detected but not the semantic units such as scenes or sequences. Their media integration capabilities are limited to a flat and linear composition (the aggregation of a set of clips, for instance) and a static rendering of the resulting presentation. Therefore, the most interesting authoring features of media production applications are their abilities in editing media content, for instance to add animation effects. • Multimedia document applications provide authoring features to integrate several types of media inside the same presentation. Examples of such authoring systems are RealSlideShow, PowerPoint, GRiNS, Director and Madeus. The capabilities of each tool substantially depend on their underlying media integration model. For instance Director has an absolute time-based approach with no hierarchical composition while RealSlideShow, GRiNS and LimSee implement (part of) the hierarchical and operator-based SMIL language. They are powerful to define rich temporal scenarios with interactivity and transitions but they offer few facilities for efficient and fine-grained media access, intra-media authoring and fine-grained extra-media synchronization. We can summarize the major points of this analysis in the diagram of Fig. 1, where these three classes of applications are positioned on two axes: (1) the vertical axis measures their ability to provide high level analysis and content interpretation and (2) the horizontal axis identifies their media integration capability. Clearly, indexing applications are close to the vertical axis because Structured Media for Authoring Multimedia Documents 297 they cannot easily integrate media but they can produce high level descriptors; on the other hand, multimedia authoring applications are close to the horizontal axis because they are poor in media content management but are able to richly compose scenarios; finally, media production applications are set in the bottom left corner with a low or medium rate of both criteria. The ideal multimedia application we promote is located on the top right part of the drawing. Fig. 1. Situation of multimedia applications. Our contribution to the emergence of an authoring and presentation system that better covers user needs is based on two activities: • A modelling activity that will allow a better integration of the different components of the system: indexing, integration and production. For that purpose we need media content models which must be consistent with the media integration model. Section 3 below is devoted to these models. • An application architecture definition activity with its implementation in a prototype authoring tool. The application that we have developed is described in Section 4. This approach is based on the use of the media models presented in Section 3. Fig. 2 below presents the general architecture of our system. 3. Multimedia Modelling The multimedia modelling mentioned in this section has two levels: the multimedia content modelling and the multimedia integration modelling. The first one can represent the media content as a semantic structure that allows more 298 T. Tran-Thuong and C. Roisin Fig. 2. Proposed architecture of a user-friendly multimedia authoring system. efficiency in accessing the multimedia content. The second one is aimed at authoring multimedia documents at the higher application level. Unfortunately, the gap between these components marks the limits in the existing multimedia systems, i.e., the user of an authoring system cannot deeply access the media content for a fine-grained synchronization or sophisticated composition; and the multimedia query system cannot return as a result of research a multimedia presentation instead of the individual media. This section first presents the work of modelling the video content and then presents a multimedia document model including a basic model called Madeus and its extension for using the multimedia content descriptions. We have chosen to present video content modelling because video carries rich and high-capacity information. Describing the video content allows the internal video components to be accessed. This is the key point in building more dynamic and interactive presentations in which video entities can be more finely synchronized with other media. 3.1. Video Content Modelling The choices made in our video model are motivated principally by the requirements of the first application we want to realize: we want to compose space and time fragments of video (the appearance of a character (actor), the beginning of a scene or shot) together with other media objects (text, picture, sound etc.) in a multimedia document. Therefore, the definition of the video structure must specify the video decomposition in terms of elements together with space and time relations among these video elements. We chose the mark-up language XML (extensible Markup Language) because it is a language suited to describing structured information and its properties. Moreover, XML allows for easy integration of video structures in multimedia systems as it has already been used as a description language for multimedia documents such as in SMIL and Madeus (on which our prototype has Structured Media for Authoring Multimedia Documents 299 been built). Even the Expert Group that specifies MPEG-4 envisions the use of XML for MPEG-4 descriptions.8 3.1.1. General Model Our model is based on a decomposition of video information into three principal parts: Structure, Semantic and Thesaurus, which define three semantic levels of the knowledge base (Fig. 3). In Bordegoni et al9 we can find a similar organization of the knowledge base for an intelligent multimedia system. • The Structure description is a low-level description that directly indexes raw video to extract the structure of the video. It is the most important part in our model. It makes it possible to describe the content of the video directly and completely. • The Semantic description elements allow the description of the video contents at a higher semantic level (the characters, the events, the relations, etc). • The Thesaurus description elements constitute the highest level description, which describes semantic terms and expressions to classify elements in the video content description. These terms can be located in a thesaurus or defined by the author. • Other element descriptions: Metalnfo, Medialnfo, and Summary that make it possible for the author to easily identify video and to look at it quickly and throroughly. These last three elements issued from the Dublin core project10 and from MPEG-7.4 In Fig. 3 we can find the general description of the video content. Because our objective is to finely synchronize structural components of video inside documents, the rest of the section is only devoted to the structure level of the video decomposition. 3.1.2. Description of the Video Structure At the highest level, we have defined a model of the video structure that is very similar to the classic dramatic structure of the video that can be found in a number of existing works: u~14 a video is composed of successive sequences, a sequence contains successive scenes and a scene contains successive shots (Fig. 4a). All these elements (sequence, scene and shot) can be separated by transitions. Our model especially focuses on the specification of these temporal elements and relations between them. 300 T. Tran-Thuong and C. Roisin Fig. 3. General model and an example of the video content description. These elements are organized in the hierarchical structure as follows: a video element contains one or more sequence elements; a sequence is composed of scene elements; and a scene includes shot elements. The temporal relations among these elements are defined as follows: • A parallel starts relation is set between the first shot of a scene and that scene (i.e. its parent in the structure), between the first scene of a sequence and that sequence, between the first sequence and the video. (see Fig. 4a). These relations can be expressed by the following relations between begin instants: Begins;,„f - Begjn5cene; BeginSCOTC - Beginsa,„OTce; Begins^,,,,,,* - B e g i i w M • A parallel finishes relation is set between the last shot of a scene and that scene, between the last scene of a sequence and that sequence, between the last sequence and the video, (see Fig. 4a). These relations can be expressed by the following relations between end instants: Endshot - Endscene t • teR&scme ~ EM*sequence / ktldsetiuence ~ Endpideo The transition T from an element (sequence, scene or shot) to the following element is modeled by two relations co-starting up (starts) and co-ending (finishes) as following: the element before Ebefore has the relation finishes with the transition T; the element following Eajter has the relation starts with the transition T (see Fig. 4b) : En&Ebefore = Beginr; End T = BefpnEafter Structured Media for Authoring Multimedia Documents 301 Fig. 4. Hierarchical and relational structures at high level. The description of the former basic components allows the synchronization of coarse elements (shots, scenes or sequences) of video in multimedia documents. It can be useful to allow a more fine-grained synchronization and for that purpose it is necessary to be able to describe the content of shots such as the occurrences of characters (actors) or objects, the spatio-temporal relations between these occurrences, etc. We have defined three types of components in shots: 1) the segment element describes a particular situation in the video shot that is considered as an event such as a motor explosion, a plane taking off, a demonstration, a storm, etc.; 2) the occurrence element describes a person or an object that appears in the shot; 3) the Spatio-TemporalLayout element describes the spatio-temporal relationships among the composing regions of occurrences in the shot. These three types of elements are components of shots in the hierarchical structure of the video (see Fig. 5a). They can appear at any time in the shot in which they are included and can be referred by the semantic elements such as event, object or person. Therefore, we use the during relation between these elements and that shot (see Fig. 5b) whose corresponding instant relations are: Beginelement = BegUlshot + d l ( d l > 0) ; Endshot = Endelement + d 2 (d2 > 0) The description of the occurrences enables us to associate actions with the appearances of video objects like hyperlink, filter, seek, follow, synchronize, etc. In our model, an occurrence description is composed of (see Fig. 6) visual features of the occurrence such as colour layout, colour histogram, texture, shape and contour. 2) spatio-temporal locators 3) and finally, sub-occurrences inside the occurrence, for instance, the arms of a character, his clothing, etc.15 302 T. Tran-Thuong and C. Roisin Fig. 5. Hierarchical and relational structures of the elements in a shot. The Spatio-temporal layout defines the spatial relations among characters or things that appear at the same time in a shot, such as A stays behind B, A walks on the left side of B, etc. Note that due to the intrinsic dynamic behaviour of the video, these spatial relations can change with time. For instance, in a video shot, there is a Taunus car that follows a Volvo; when the Taunus approaches the Volvo, the Taunus overtakes the Volvo on the right side and then goes past the Volvo. To describe these changes, we have to define many periods in the shot corresponding to the different spatial relations among the occurrences. In the former example, we can separate the spatial relations between two cars into three sequential periods corresponding to Taunus behind Volvo, Taunus on the right of Volvo and Taunus before Volvo (see Fig. 7). 3.1.3. Extensions of MPEG-7 for Model Definition MPEG-7 takes into account existing models to supply standard tools for multimedia content modelling: a Description Definition Language (DDL) to define sets of Descriptors (D) and Description Schemes (DS). We have opted to use these tools to describe our model. Because of that, our model is convenient for a wide range of applications and can use and adapt existing descriptions. MPEG-7 provides rich tools that can be directly used to describe information such as the metadata (DescriptionMetadata DS), the management of content Structured Media for Authoring Multimedia Documents 303 Fig. 6. Occurrence structure. (UserDescription DS, Creationlnformation DS, etc), the semantics of contents (WorldDescription DS), the thesaurus (ClassificationScheme DS), the summary of the content (Summary/Description DS) and even the occurrences and the relations among them through MovingRegion DS and Relation DS. Nevertheless these tools are very generic, and, therefore, it is necessary to extend them to cover the particular needs of multimedia document authoring and presentation. Fig. 7. Example of the spatio-temporal disposition of two cars in a video shot. In fact, MPEG-7 supplies an element root <mpeg7:Mpeg7> which is an extension of the complex type <mpeg7:Mpeg7Type> to describe either a complete multimedia element, or an information fragment extracted from a piece of media content.4 Both cases are not convenient for our needs, because a complex description is too big to insert it into a document and, on the other hand, a unit description is too simple: it cannot thus supply enough information for editing. That is why we decided to create our element root <MediaDescription>. However, to remain compatible with MPEG-7 descriptions, our element root is an extension of the <mpeg7:Mpeg7Type> type. The standard MPEG-7 supplies the video segment description scheme (VideoSegment DS) to describe the structure of video contents in time and space. However, the VideoSegment DS is more relevant in describing a generic video 304 T. Tran-Thuong and C. Roisin Fig. 8. Differences between (a) the MPEG-7 description model for a piece of multimedia content and (b) our description model for a structured video content. segment that can correspond to an arbitrary sequence of frames, a single frame, or even the full video sequence.4 It does not convey the specific signification of each of the video structure levels such as the sequence, scene and shot. Therefore, we have defined three new types: VideoSequence DS, VideoScene DS and VideoShot DS, which inherit from the MPEG-7 Videosegment DS and extend it to express the specific video structure of our model (cf. Section 3.1.3). Additionally, the Videosegment DS supplies the description of metadata and management. That is not needed for our model, because each Videosegment DS instance aims to describe the structure of only one video for which management description and metadata can be described only once at the top level of the description (see Fig. 8). 3.2. Document Modelling with Structured Media We present in this section the basic multimedia document model and its extension to allow the inclusion of the media content description model presented above. 3.2.1. Multimedia Document Model A multimedia document model has to realize the integration of a set of media elements through temporal, spatial and hyperlink models. Previous work on electronic documents16'17 has stated that the use of a structure, interval and region-based model enables powerful document representation and management. Structured Media for Authoring Multimedia Documents 305 SMIL,3 the standard for bringing multimedia to the Web, ZYX2 a powerful model for expressing adaptable multimedia presentations and Madeus,1 our flexible and concise model are the typical models that follow the hierarchical structure of intervals and regions. Following this decomposition approach, our Madeus model can be considered as an extension of the SMIL standard with the following additional features: 1) enhanced separation of media content location, temporal information and spatial information, 2) hierarchical, operator-based temporal model complemented with relations, 3) rich spatial specification model with relative placements. More precisely, a Madeus specification has four main parts (see Fig. 9). Fig. 9. Madeus document model. The Content part allows the definition of a set of hierarchical fragments of the media contents that will be used to compose a multimedia document. It can be compared with the Content class of the MHEG18 standard that allows the media content to be defined independently of its presentation. So the content can be reused several times for different presentations attributes. The Actor part allows presentation styles and interactions on the content data such as FillColor, FontSize or Hyperlink to be specified through the element called DefActor. It can be compared with the virtual views concept of MHEG that allows media content to be projected onto a concrete utilisation; or the object modification of HyTime that allows media to be provided with the new 306 T. Tran-Thuong and C. Roisin presentation attributes. A DefActor element has to refer to a media content (see Fig. 9). The Temporal part concerns the temporal presentation structure of documents. This is a hierarchical structure augmented with temporal relations set on intervals. An interval refers to one or several DefActor elements for presenting the associated media objects over the specified time. Each interval possesses the following timing attributes: begin, duration and end (with the constraint, end = begin + duration). A set of intervals can be grouped into a composite interval called T-Group and associated to a temporal operator (in sequence or parallel). Similarly, the Spatial structure defines the spatial layout of documents by means of a hierarchical structure and relations over boxes called Regions. A region refers to one or more DefActor elements for presenting the associated media objects in space. The set of spatial relations available such as left_align, centerjxlign, etc. provides relative layouts among Regions that are much more flexible and more comfortable than the absolute spatial layout such as in SMIL model. Although the interval and region-based model is known to be one of the most expressive among existing models,19 the limit of this approach is mainly due to the granularity provided by the leaves of the structure. In fact, there are many media objects having rich content information such as image, video or long text for which authors want to set finer-grained synchronizations in order to produce more sophisticated presentation scenarios. The problem cannot be solved by simply using the existing model and defining deeper hierarchical structures as found in existing models with the Anchor and the Area elements. Such a solution is only a limited solution with the drawbacks of an absolute and non-significant specification. Indeed, media objects do have their own semantics, temporal and spatial organization, which the document model must consider when composing media fragments during document composition. This is why we propose extensions in the next section. 3.2.2. Model Extensions Since our document model has to be consistent with the video content model in order to share the same representation in the different steps of our multimedia document authoring. More precisely, it is necessary to extend the components of the Madeus model to use the video content description model (and other media content models). Thanks to the hierarchical structure-based model of Madeus we have introduced new hierarchical structures to the Madeus document model called Structured Media for Authoring Multimedia Documents 307 sub-Elements (see Fig. 10). The extensions are done in each decomposition axis of the Madeus model (Content, Actor, Temporal and Spatial). For each axis the extension provides a specific sub-Element and defines precisely the constraints imposed by the element in which it is included. Therefore, the distinction between Elements (DefActor, Interval, Region) and sub-Elements is clearly stated. 1. The Content part of Madeus has been extended with new media types for structured media comprising StructuredVideo (specified in section 3.1), StructuredAudio, StructuredText. These new types introduce the internal structural level for the media, which was not available with the classic media types that only represented raw data to play. They provide ease and meaningfulness while integrating the media fragments. 2. In authoring a multimedia document, the author needs to specify actions or styles on media fragments such as a highlight on a phrase or a word of a text, a tracking or hyperlink on a moving region of a video segment. A sub-element of the DefActor element called subDefActor is then provided for these purposes. It uses a Content attribute valued with IDs or XPath expression to refer to the media segments on which the action or style must be applied. The segments referred to must belong to the structured description of the media element. 3. Sub-temporal objects are necessary to carry out the subDefictor objects or/and the temporal representation of the media segment. A sublnterval element is defined inside an interval element for that purpose. The sublnterval element is derived from the interval element in our intervalbased model. Therefore, as any temporal object, the sub-interval can be involved in any temporal relation of the temporal document specification. The refinement of the sublnterval through inheritance is that the sublnterval element has a during temporal constraint with its parent interval. The sublnterval carries the subActor attribute to specify the subDefictor elements referring to the media fragments. The media segments can be static, such as a phrase in text media or a region of an image; in that case the time specification for static fragments must be explicit. If the subDefictor element refers to a temporal segment belonging to continuous media, such as an audio segment or a video segment, then the sublnterval will be automatically scheduled thanks to the temporal information of the segment description. This sublnterval element makes explicit a temporal fragment of media presentation for further synchronizations with others interval/sublnterval. The key point of this model is to maintain the intrinsic time constraints (during) of the 308 T. Tran-Thuong and C. Roisin Fig. 10. A Madeus document structure with content description, subDefActor, sublnterval and subRegion sub-elements. sublntervals inside their media content interval together. That allows temporal segments of media to be integrated into the timed schedule of the whole document. 4. In the spatial part, the subRegion element plays a similar role as the sublnterval for representing a spatial segment of visual media objects. Together with its intrinsic position and dimensions, the identification of subRegion provides the means to specify more sophisticated spatial relations with other regions. For instance, the spatio-temporal synchronization of that region, e.g., the text bag is set on the top of a character's occurrence by the Top-Align relation. If the character's occurrence is a moving region, the Top-Align constraint will result in moving the speak bullet following the movement of the occurrence in the video. The other applications of the subRegion element are interactions on sub areas of visual media objects such as hyperlink, tracking or displaying tip text for the area. Figure 10 summarizes the definitions of sub-elements and their relations. In conclusion, a sub-element always belongs to an element and relates to that element to express its semantic dependency in the corresponding dimension. Note that except for the content part, sub-elements are not recursive. Structured Media for Authoring Multimedia Documents 309 4. Multimedia Document Authoring System This section presents an advanced environment for playing and editing multimedia documents called VideoMadeus. While existing tools such as GRiNS from Oratrix or X-Smiles are based on the SMIL standard model, ours uses the extended Madeus framework presented in the previous section, in which the internal structure of complex media such as video can be edited to be used inside spatial and temporal synchronizations in the document. One of its main features is media fragment integration. It uses several views to display video and audio contents (see Fig. 11). These views allow the user to semi-automatically generate a media content description based on the MPEG-7 standard. This description is then used for specifying fine-grained synchronization between media objects. Using media content description in authoring multimedia documents brings such advantages as: 1) tracking an object in a video (a video object for short), 2) attaching hyperlinks to video objects (video objects are moving regions), 3) fine-grained synchronization (for example a piece of text can be synchronized with a video segment like a scene, a shot or an event), 4) spatio-temporal synchronization: a text can follow a video object, 5) extracting any part of a video/audio (even a video object) for integration with other media. In addition, VideoMadeus provides a timeline view that is much more powerful than the usual flat timeline. Ours is hierarchical and supports editing of many temporal relations (meet, before, finish, during, equal, etc.). This is especially interesting in structuring the video and audio media. It allows an author to easily locate the different parts of the media and to create temporal relations between media objects and fragments of the video/audio content. The end of the section briefly presents the video content description editing tool and the authoring of a multimedia document with a video segment in which a video object is synchronized with a text and a hyperlink is set from a moving sub-region of that video. 4.1. Video Content Description Editing Environment In our system, the video content editing environment (see Fig. 12) enables information within the video medium, such as time and spatial internal structures, to be semi-automatically extracted. The interface presents the resulting video content description through several views: the hierarchical structure view (1), the attribute view (2), the video presentation view (3) and the timeline view (4). That provides a simple way for the visualization, the navigation and the modification of the video content description. 310 T. Tran-Thuong and C. Roisin Fig. 11. Madeus video content description editing views: (1) Video structure view. (2) Attribute view. (3) Video presentation view. (4) Video timeline structure view. (5) Video information view. And (6) Video object editing control. Fig. 12. The Execution and Timeline views of a Madeus document (the text media "Hello Mum" has the equals relationship with the video fragment "Little girl phones"). Structured Media for Authoring Multimedia Documents 311 More concretely, if the author wants to add a video (in the mpeg, avi or mov format) in his document, he simply selects it and the system automatically extracts its basic structure (using a "standard" shot detection algorithm). This first structure is then displayed in the video structure and the timeline views of the video content editing environment. Next, the author can adjust and add semantic media content descriptors (such as scene and sequence decomposition, character objects or spatial/personal relation) which currently cannot be automatically generated by existing content analyzers. For that purpose, some authoring functions are provided: grouping/ungrouping shots, scenes or sequences using the structure view or the timeline view, graphically selecting spatial areas containing objects or characters, attaching key positions and movement functions to these objects using the video presentation view and the attribute panels. In summary, the media content editing views help the user to create and modify structured media. This environment is similar to the IBM MPEG-7 Visual Annotation Tool,19 T. Wahl and K. Rothermel, "Representing Time in Multimedia-Systems", Proceedings of IEEE Conference on Multimedia Computing and Systems, Boston, Massachusetts, May 1994, pp. 538-543. which is used for authoring audiovisual information descriptions based on the MPEG-7 Standard Multimedia Description Schemes (MDS). However, our tool is more focused on the structure description of content (we don't yet propose enhanced features for authoring semantic level descriptions) but it allows the integration of automatic media analyzers and generators. 4.2. Authoring Multimedia Documents The video content editing environment presented above has strong relations with other parts of the Madeus system allowing the use of video description information when composing Madeus documents. Users of Madeus can synchronize video elements of a video media with other media objects in both time and space. For instance, in the document displayed in Fig. 12, the video object "Little girl phones" of a video segment displayed in Fig. 11 has been synchronized with a text media (see the timeline document view). Authors can also apply operations and interactions on elements of the video such as tracking, hyperlink, hiding or even deletion. Thus, complex multimedia documents can be specified while maintaining the declarative approach of XML that allows the use of high-level authoring interfaces like our video content editing system. T. Tran-Thuong and C. Roisin 312 5. Conclusion Our proposition provides support for a deep access into media content in multimedia document-authoring environments, which until now have treated media content as a black box. In addition, our experimental work with video, audio and text media has provided a way to implement such a system. It should be noted that the media content description model is adapted to the composition and rendering of multimedia documents, so it makes little use of metadata descriptions defined in MPEG-7 applications mostly devoted for searching, indexing or archiving media content. Indeed, this model is focused on the structural organization of media content that is relevant to multimedia document composition. As a positive result of this first experiment, we can edit documents that contain fine-grained synchronizations (in the temporal, spatial and spatiotemporal dimensions) between basic media (text, image, audio and so on) and video elements such as scene, shot, event, video object. This result has encouraged us to continue to structure other media. As a next step, we will investigate the same approach for handling audio and text media that will allow to compose complex documents such as Karaoke document type, with which a user can sing a song where every piece of text is synchronously displayed while the associated music stream is played. Another positive result of using description models in multimedia documents is the possibility to apply indexing and searching techniques to the whole resulting presentations. The use of SMIL technology combined with enriched media content descriptions such as proposed here will certainly permit the emergence of real multimedia documents on the Web. Indeed, these new multimedia Web documents integrate multimedia content that is no more considered as a black box such as MPEG-1/2 videos, gif images or even Flash media, Therefore Web applications will be able to fully process all the Web content. References 1. 2. 3. L. Villard, C. Roisin and N. Layai'da, "A XML-based multimedia document processing model for content adaptation", Proceedings of Digital Documents and Electronic Publishing Conference (DDEPOO), September 2000, pp.1-12. S. Boll and W. Klas. "-ZYX- A Semantic Model for Multimedia Documents and Presentations". Proceedings of the 8th IFIP Conference on Database Semantics Semantic Issues in Multimedia System, January, (DS-8), Rotorua, New Zealand, January 4-8, 1999, pp. 189-209. SMIL: Synchronized Multimedia Integration Language, W3C Recommendation, h t t p : //www. w3 . org/AudioVideo/, 07 August 2001. Structured Media for Authoring Multimedia Documents 4. 5. 6. 7. 8. 9. 10. 313 P. Beek, A.B. Benitez, J, Heuer, J. Martinez, P. Salembier, Y. Shibata, J.R. Smith and T. Walker, Text of 15938-5 FCD Information Technology - Multimedia Content Description Interface - Part 5 Multimedia Description Schemes, ISO/IEC JTC1/SC 29/WG11/N3966, Singapore, March 2001. P.H. Lewis, H.C. Davis, S.R. Griffiths, W. Hall and R.J. Wilkins, "Media-based Navigation with Generic Links", Proceedings of The seventh ACM Hypertext'96 Conference, Washington DC, March 16-20, 1996, available on line: http://www.cs.unc.edu/~barman/HT96/ L. Rutledge and P. Schmitz, "Improving Media Fragment Integration In Emerging Web Formats", Proceedings of Multimedia Modeling Conference, Amsterdam, 5-7 November 2001, pp. 147-166. T. Tran-Thuong and C. Roisin, "A Multimedia Model Based on Structured Media and Sub-elements for Complex Multimedia Authoring and Presentation, Special Issue on "Image and Video Coding and Indexing", International Journal of Software Engineering and Knowledge Engineering, World Scientific, 12(5), October 2002, pp. 473-500. M. Kim, S. Wood, L.T. Cheok, Extensible MPEG-4 textual format (XMT), ACM Press, Series-Proceeding-Article, New York, USA, 2000, pp. 71-74. M. Bordegoni, G. Faconti, S. Feiner, M. Maybury, T. Rist, S. Ruggieri, P. Trahanias and M. Wilson, "A Standard Reference Model for intelligent Multimedia Presentation Systems", Computer Standards & Interfaces, 18(6-7), December 1997, pp. 477-496. Dublin Core Metadata Element Set, Version 1.1: Reference Description, July 1999, h t t p : / / p u r l . o c l c . o r g / d c / d o c u m e n t s / r e c - d c e s - 1 9 9 9 0 7 02.htm. 11. M. Jacopo, D. Alberto, D. Lucarella and H. Wenxue, "Multiperspective Navigation of Movies", Journal of Visual Languages and Computing, 7(1996), pp. 445-466. 12. R. Hammoud, L. Chen and D. Fontaine, "An Extensible Spatial-Temporal Model for Semantic Video Segmentation", Proceedings of the First International Forum on Multimedia and Image Processing, Anchorage, Alaska, 10-14 May 1998. 13. J. Hunter, "A Proposal for an MPEG-7 Description Definition Language", MPEG-7 AHG Test and Evaluation Meeting, Lancaster, 15-19 February 1999. 14. M. Dumas, R. Lozano, M.C. Fauvet, H. Martin and P.C. Scholl, "Orthogonally modeling video structuration and annotation: exploiting the concept of granularity", Proceedings of the AAAI-2000 Workshop on Spatial and Temporal Granularity, Austin, Texas, July 2000, pp. 37-44. 15. S. Paek, A.B. Benitez and S.F. Chang, Self-Describing Schemes for Interoperable MPEG-7 Multimedia Content Descriptions, Image & Advanced TV Lab, Department of Electrical Engineering, Columbia University, USA, 1999. 16. J. Andre, R. Furuta and V. Quint, Structured documents, Cambridge University Press, Cambridge, 1989. 17. G. van Rossum, J. Jansen, K. Mullender and D. Bulterman, "CMIFed: a presentation Environment for Portable Hypermedia Documents", Proceedings of the ACM Multimedia Conference, Anaheim, California, 1993, pp. 183-188. 18. T. Meyer-Boudnik and W. Effelsberg, "MHEG Explained", IEEE Multimedia Magazine, 2(1), 1995, pp. 26-38. 19. T. Wahl and K. Rothermel, "Representing Time in Multimedia-Systems", Proceedings of IEEE Conference on Multimedia Computing and Systems, Boston, Massachusetts, May 1994, pp. 538-543. 314 20. T. Tran-Thuong and C. Roisin IBM MPEG- 7 Annotation Tool, July 2002, http://www.alphaworks.ibm.com/tech/videoannex. CHAPTER 17 DOCUMENT ANALYSIS REVISITED FOR WEB DOCUMENTS R. Ingold* and C. Vanoirbeek** DIVA (Document, Image and Voice Analysis) Group Department of Informatics, University of Fribourg CH - 1700 Fribourg, Switzerland E-mail :rolf.ingold@unifr.ch MEDIA (Models & Environments for Document Interaction and Authoring) Group School of Computer and Communication Sciences Swiss Federal Institute of Technology CH - 1015 Lausanne, Switzerland E-mail :christine.vanoirbeek@epfl.ch Highly promoted by the World Wide Web, documents play a growing role within global information systems. The use of HTML, primarily intended to be the standard representation for hypertext information over the Internet, has been significantly diverted from its initial goal. HTML is often used to specify the global structure of a Web site whose effective content mainly resides within documents such as Postscript or PDF files. Moreover, despite the current evolution of the HTML standard, HTML documents themselves remain mostly presentation oriented. Finally, the XML initiative reinforces the production of, once again, presentation oriented documents, generated on the fly from databases. Document analysis, which aims at extracting symbolic and structured information from physical representation of documents, is obviously provided with a new attractive ground for investigations. The objective of this paper is twofold: on the one hand, it emphasizes the evolution of document models, which drastically affects the goal of recognition process; on the other hand, it provides hints on techniques and methods to be used for facing new Web-based document analysis applications. 1. Introduction Document analysis aims at extracting symbolic and structured information from document images. Classically, document analysis has been applied to images acquired from paper documents by scanners in order to recover the electronic 315 316 R. Ingold and C. Vanoirbeek form. Such a process is typically divided in several steps: text and graphics separation, text segmentation, optical character recognition (OCR), font recognition, layout analysis, and finally logical structure recognition, also known as document understanding. Nowadays most documents are already available in electronic form. However, document analysis is not dead; on the contrary it is evolving towards a new goal, namely extracting high-level structure information from documents that exist already in electronic form but whose original source is no longer available or only poorly structured. Formats that are mainly concerned by this issue are Postscript or PDF files in which no logical markup is available. HTML files are also concerned; indeed, HTML documents contain tags that define some basic structures built upon titles, paragraphs, lists and tables. However, HTML tags are generally not sufficient to express high-level structures. Furthermore, these simple constructs are often misused in order to control the presentation. A typical case is the use of tables to control layout. Thus, document analysis should provide techniques to recover the fundamental logical structure of such documents.1 Moreover, traditionally devoted to the recognition of the so-called logical structure, document analysis is currently evolving towards more complex goals. The major reason of such an evolution is mainly due to the growing role of document-centric approaches to address various aspects of data management and data interchange. The World Wide Web, which progressively imposed itself as the major medium for publishing, accessing and interacting with highly widespread sources of data, acts as a universal repository of interconnected collections of documents. The Web obviously raises new challenging problems to be addressed by document analysis applications, which have to deal with new dimensions of documents such as hypertext structures, extensive use of interactive multimedia components, and the wide variety of targeted document based applications. Finally, the Semantic Web initiative2 aims at providing a unified framework to facilitate querying, automation and reuse of resources across applications over the Web. To achieve such a goal, the problems to be faced range from heterogeneity of available formats (for documents, pictures, sound and video) to rich abstract representation of information extracted from documents. Research work addressing this ambitious objective currently concentrates on the semantic level of documents. Most models proposed to store knowledge extracted from data rely on XML and include RDF (Resource Description Framework), an open standard promoted by the World Wide Web Consortium, and PMML (Predictive Document Analysis Revisited for Web Documents 317 Model Markup Language) developed by the Data Mining Group, a vendor-led consortium. The objective of this chapter is to emphasize the new perspectives brought by the Web framework for document analysis research. The chapter is organized as follows. Section 2 draws up an overview of document model evolution and points out in which manner they may affect the recognition processes. Section 3 is devoted to the definition of new document analysis goals that takes care of this evolving situation. It also aims to be more specific about issues to be faced by Web Document Analysis. Section 4 provides some examples of applications that encompass the mentioned perspectives. Section 5 gives hints about techniques and methods to be used. As a motivation, it presents and describes a concrete example. Finally, the concluding Section 6 summarizes the paper and raises some open questions. 2. Document Model Evolution: An Analysis Perspective Initially aiming at fulfilling publishing purposes, fundamental aspects of structured document models rely on two major, unanimously recognized, concepts: on the one hand the distinction between logical and physical structures of a document, and on the other hand, the ability to define, in a formal way, generic document structure. The benefit of distinguishing between logical and physical structures is the ability to allow multiple rendering of the same document without affecting its content, by simply associating an unlimited number of appropriate style sheets. The definition of generic structures is of interest for describing typical document classes and, thus providing mechanisms to automatically check the document consistency according to the class to which the document belongs. The SGML ISO standard, published in 1986, conveys these two basic ideas and has been used extensively by producers of highly structured documents such as technical or legal documents. The further adoption of the SGML framework to define the HTML language, the format of WWW documents, clearly promoted the use of tagged information to represent and to give worldwide access to documents. The initial, and very simple, version of HTML clearly dismissed the two fundamental underlying concepts of the structured document paradigms by proposing a universal model of document, made of a combination of logic and physical elements. The very basic generic logical structure proposed by HTML led the users to consider HTML as a publishing format on the Internet, the choice of tags depending on the expected rendering aspects through a browser. Even though the current version of HTML aims at promoting the advantages of 318 R. Ingold and C. Vanoirbeek disassociating the logical and physical aspects of a document (so-called physical elements are said to be deprecated and users are encouraged to use the Cascading Style Sheet standard to control the layout of their documents), HTML documents remain for the most part presentation oriented. From an analysis point of view, another important issue deals with the hypertext dimension of documents. The simple "point & click" metaphor to access data on the Web clearly contributed to the popularity of this new medium; the metaphor also raises new questions about the targeted document model to be identified by recognition processes. The extensive use of links within and between documents makes it difficult to identify the abstract representation of a document, physically stored as a collection of files, often embedding external objects such as applets, scripts and, interactive multimedia components. Finally, due to obvious limitations of the HTML representation of documents, the XML initiative, which primarily reintroduced SGML basic ideas, is at the root of a real revolution for dealing with data on the web and, opens attractive perspectives for new applications in the domain of document analysis. There are two main reasons for this. First, XML introduced the concept of a well-formed document (syntactically correct document) as opposed to a valid document (a document conforming to a document class definition). The consequence of this concept is the extensive use of XML to generate structured data flow from databases in order to apply powerful rendering mechanisms, such as XSL (Extended Stylesheet Language) to provide document views on data. This practice advantageously replaces oldfashioned and proprietary database report generators, since it benefits from sophisticated formatting models, elaborated for purposes of document publishing. It also means that available document-oriented information over the Internet is based on physical document features that reflect an underlying hidden logical structure stored in database schemas. Such information is generated in both HTML and PDF formats. Second, the XML initiative also gives birth to another fundamental issue of document modeling that potentially affects the document analysis process. Initially designed for publishing purposes, structured electronic documents are increasingly considered as pieces of data within global information systems. Such documents are designed and produced in a way that facilitates automatic processing operations on them. XML schemas are progressively used to define document models; they introduce new concepts such as data types that do not exist in DTDs. This new perception of documents, based on a data-centric approach, clearly influences the targeted high-level structures to be extracted from documents. Document Analysis Revisited for Web Documents 319 3. Web Document Analysis The problem of extracting structures from documents or, more generally, from existing data on the Web, is also tackled by other research communities. Data mining research, for instance, aims at "the identification of interesting structure in data".3 In this paper, we consider document analysis as the process whose goal is to build, from various, often poorly structured document formats, an abstract representation of the document that emphasizes structural relations between components. There clearly exist various levels of document understanding that may be transposed into multiple structures. Discovering structures such as discourse representation or statistical relationships between elements are, we contend, out of scope of document analysis goals. But logical structure extraction constitutes an important step in document analysis. This section examines the issues raised by Web documents in an analysis perspective. It tackles the general goals and then emphasizes more specific problems to be dealt with. 3.1. Goals of Web Document Analysis The problem addressed by document analysis can be considered as a reverse engineering problem in the sense that whatever presentation format (HTML, PDF or Postscript) is used, the goal of web document analysis is to transform the document back to its original structured, possibly editable, format (SGML, LaTeX or Word). At this stage, it should be clearly stated, that web document analysis cannot be solved in a universal manner. The logical structure to be recovered depends on the document class that is considered and, more precisely, on the targeted application. Two major problems have to be addressed by web document analysis. At the application level, the goal is to extract the logical structures of document instances. The latter are supposed to belong to a specific document class (a letter, a report, an invoice, etc.) for which the generic logical structure is known. Setting up such a generic structure of a document class is the second issue to be covered, namely the inference of so-called document models, which enclose all the information that is useful to drive the document instance analysis. The specification of document models can be provided in different manners. Models can be produced by hand, a process that becomes rapidly cumbersome. Alternatively, models can be produced by an automatic learning stage. However, these techniques rely on the existence of groundtruthed data, which is not necessarily available, and otherwise hard to produce. We advocate, therefore, the 320 R. Ingold and C. Vanoirbeek use of an intermediate approach, in which document models are built incrementally4 in a user-friendly assisted environment.5 3.2. Specificities of Web Document Analysis In comparison with traditional approaches, the Web framework raises important new issues to be faced by document analysis. The Web makes available a large number of documents in heterogeneous formats, the most common ones being HTML, PDF or Postscript files and, progressively, XML documents. These documents may not simply be considered as independent objects stored in a virtual universal repository; they are often interconnected, either explicitly (by the traditional use of HTML links) or implicitly (a PDF document may be provided as an alternative version of an HTML document). The extensive use of graphical elements on the Web is another very typical feature of Web documents that distinguishes them from conventional electronic documents. Finally, Web documents introduced a very new dimension: they propose a new metaphor for interacting with data on the Web; a typical example is an HTML form intended to collect data provided by users. This section aims at examining the major issues to be addressed by Web document analysis and thus, providing insight into new prospects for document analysis. 3.2.1. Dealing with Heterogeneous Formats Analyzing HTML documents HTML files are certainly the most widely used and the easiest to handle, since the textual content can be easily extracted. Furthermore, the tags can provide useful structural information. In the best case, if CSS style sheets are properly used, the whole logical structure can be retrieved directly from the class attributes. According to the way they are generated, analyzing HTML documents will significantly differ from traditional document analysis. An HTML document may be written by an author in order to publish a conventional document on the Web. Depending on the authoring system used and the user's skill, the HTML document will be reasonably or badly structured. In this case, the document analysis goal remains very similar to the extraction of the usual logical structure. HTML authoring systems currently include sufficient functionality to generate not only an HTML document but also a full Web site; in that case, the analysis will aim at capturing a potentially complex hypertext structure interconnecting several documents and document fragments. Document Analysis Revisited for Web Documents 321 An HTML file may be generated on the fly, being derived, for instance, from a database. In this case, the underlying logical structure may drastically differ from a regular editorial structure. Let us consider, as a typical example, a catalog of products. Roughly speaking, the global structure may be a set of independent entries; order between elements is without relevance (it is not the case for a series of paragraphs within a section). Finally, an HTML document may also be produced as the result of an XSLT transformation applied on an XML document. In this case, the derived structure reflects only a view on the original XML document structure. An example of this is an HTML document starting with a table of contents, automatically generated and providing the user with hyperlinks to appropriate document parts. Analyzing XML documents Surprisingly, despite the similarity of concepts between SGML and XML, most XML documents currently available on the Web are definitely data-centric; they are automatically generated from databases and not authored by people. They have to be considered as structured data flows, whose underlying logical structure is not explicitly provided. The logical structure of so-called semistructured data can be deduced from their self-describing structure provided through the combined use of logical tags and associated physical properties. Analyzing PDF or Postscript documents In the case of PDF or Postscript files, the process can become more complicated. Normally, the textual content is also available. However, PDF may include encryption (with password protection), which prevents text extraction. Furthermore, PDF is just a page description language that does not make any assumption about the printing order. This means that the reading order of text is not necessarily preserved. In practice, we have observed PDF files representing multicolumn documents in which text blocks were presented in a nearly random order. In the worst case, we could even imagine a PDF driver putting no text stream at all on the page, but only isolated characters. Therefore, text extraction from PDF files requires a complete layout analysis step, in which the whole set of characters have to be sorted according to their coordinates, as would be done in typesetting. Moreover, PDF files may have been generated from scanned images or include TIFF images, in which cases the textual content is not available as such. Traditional character recognition methods are required to extract the text in such situations. 322 R. Ingold and C. Vanoirbeek 3.2.2. Dealing with Links Links in Web documents may be used for multiple purposes. The link model defined in HTML is based on the SGML "ID" and "IDREF" attribute mechanism and aims at providing a basic construct to interactively enable access to crossreferenced elements or documents. Based on results achieved by the hypertext community researchers, the Xlink open standard7 promotes the use of complex hyperlinks whose the number of anchors is no longer limited to two and which explicitly mention the semantic relationship between referenced components. Finally, the XML schema standard introduces another definition of links, very close to the notion of relations in database terminology. The concept of "key" defined in XML schema is very similar to the concept of key used to interconnect tables in the relational data model. The XML schema designer may explicitly specify which element or attribute in the XML document is to be used as an identifier or access key (name of an employee, a zip code, etc.). Therefore, logical structure recognition of web documents must be extended to the analysis of links and anchors in order to preserve the information behind hyperlinks. 3.2.3 Dealing with Images and Graphics Web documents contain text in bitmap form within pictures. This is typically the case of GIF images containing mathematics. It also happens on commercial web sites, where text is included in colored buttons or menus. In all these cases, text recognition requires some OCR functionality. Image analysis of web documents can generally be solved with techniques stemming from classical document image analysis. However, some differences should be noted. On the one hand, web documents are normally not skewed and noise free, which simplifies both segmentation and shape recognition. But on the other hand, the resolution is three or four times less than for scanned images and as a consequence, connected components tend to be merged much more frequently. Another issue raised by web document analysis is color. With a few exceptions, traditional document image analysis tools handle exclusively binary documents. Therefore, new segmentation algorithms are required in order to handle color. The minimum requirement is to deal with uniform-color foreground/background separation, but more sophisticated methods should also solve segmentation in the case of textured text and background.8 Document Analysis Revisited for Web Documents 323 3.2.4. Dealing with Interactive Aspects of Web Documents Further interesting new issues arise with dynamic and interactive documents. For instance, there is the case of moving text contained in animations, or text that appears when the mouse cursor crosses a specific zone, as it is often the case with menus. In addition, web documents may include forms containing text fields, whose values are provided by a server and do not appear in the source document. 4. Some Relevant Applications This section briefly describes some sample applications that address aspects of Web document analysis characteristics emphasized in Section 4, and aims at providing concrete examples of the issues to be faced. 4.1, Extracting Rich Structures from a Large Collection of Documents Recognizing the structures of Web documents, in accordance with the class to which they belong, is a complex task. The first reason is the availability of a large number of documents, in different formats, whose presentation may significantly differ from one document to another. The second reason is the fact that the Web environment generates new document-based applications that promote the use of rich structures to encompass the variety of processing operations to be performed on the documents. For instance, let us consider the automatic analysis and retrieval of curricula vitae (CVs) on the Internet. The ultimate goal of such retrieval could be to help a human resource manager to collect, identify and compare candidates. From this perspective, the level of document understanding requires the extraction of elements such as the age, the nationality, or the language skills in such a way that they can be processed by an application. Selecting interesting CVs according to established criteria (e.g., language knowledge), sorting them according to another one (e.g., age) and, obtaining an evaluation of the salary are examples of operations that can be expected from the user point of view. Another typical example of documents that present a potentially rich structure to be extracted concerns recipes, on which many applications may be developed. Recipes can be used for educational purposes, in which case, providing a multimedia interactive view of recipes may be one of the goals to be achieved. Alternatively, the management of a company canteen, for instance, could benefit from a program that automatically generates the list of ingredients to be bought, according to a set of selected menus and an expected number of people. Connected to a stock management application, such functionality could 324 R. Ingold and C. Vanoirbeek contribute to the efficiency of global business processes within the company. An example, based on the recipe document type, is presented in detail in Section 5.2. 4.2. Extracting Structure from Interconnected Documents The automatic generation of a site map is another interesting application of web document analysis. The goal of a site map is to extract the hyperlink structure of an entire web site and to represent it in a structured way. The task can be divided in the following steps: 1) locate the anchors, 2) identify the URL associated to each hyperlink, 3) characterize each document by a title or keywords, and 4) generate the map. 4.3. Dynamic Aspects of Web Documents Another very useful application consists in automatic form filling. Many web sites use forms; forms may just be for registering users or may have much more sophisticated applications for e-commerce, requesting addresses and other personal data. The problem is that all these forms have to be filled in by hand, even though the task is repetitive in the sense that the requested information is always the same and could be generated by a software agent. We believe that the analysis of the form document would allow the labeling of each field and to generate a content value associated with it. The problem here is that forms can be handled with various technologies such as HTML, various script languages, applets and the challenge would be to develop a general and uniform solution. 4.4. Generation of Metadata One of the fundamental issues to be addressed by web document analysis within the Semantic Web framework2 deals with the logical relationships to be extracted between document components. Data mining techniques are essentially based on linguistic and statistical approaches towards the analysis of document content, and build an abstract knowledge representation. The spatial organization of elements, as well as their typographical properties that may contribute to the discovery of significant associations, are not taken into account. 5. Methodological Issues After the previous general discussion, this section focuses on technical aspects. It gives an overview of the methods to be used and illustrates the different steps of the analysis process on a concrete example. Document Analysis Revisited for Web Documents 325 5.1. Techniques and Methods Web document analysis, as defined above, is both easier and more difficult than conventional document analysis. On the one hand it is easier, because it makes use of more reliable data by avoiding the uncertainty of low-level image analysis, which is critical when applied to scanned images. But on the other hand, the form of the handled data is more complex. Instead of exclusively using simple, often only binary images, the data comes from several and more complex sources that must be combined. In fact, web document analysis requires image analysis as well as source text analysis. By source text analysis we mean, in fact, the analysis of HTML or nonencoded PDF files (where the text is available). In the case of HTML, the analysis consists in extracting the textual content as well as the structure represented by HTML tags. Difficulties may arise in the case of documents that are not properly tagged, a quite common situation for HTML documents. But normally, the process results in a tree-like representation of the document that represents an intermediate representation structure, which is somewhere between the logical and the layout structure of the document. In the case of PDF files, as stated above, content extraction is more complicated, since the character streams may not reflect the logical reading order. Therefore, a better strategy consists in sorting the characters according to their coordinates, which can be determined by a PDF interpreter. However, the final reading order can be quite complicated, especially in multi-column documents, where it can only be recovered by a sophisticated layout analysis step. Hence, image analysis has to be considered as a complementary method to perform this complex task. By locating text blocks, the technique allows the recovery of the reading order much more easily. Moreover, low-level image analysis can produce much more information. For instance, image analysis performs extraction of frames and threads (thin lines) that are used to separate text blocks. More generally, the extracted layout structure expresses geometrical relationships between blocks that are much more difficult to recover from the original source file. Finally, image analysis should also include character recognition, to deal with text contained within images. To achieve the ultimate goal of web document analysis, namely recovering the logical structure, the techniques used are more or less the same as for classical document understanding. The task, consisting in labeling document elements and organizing such elements into a hierarchical structure, requires some additional knowledge, which is dependent on the targeted application. This knowledge is called a recognition model. 326 R. Ingold and C. Vanoirbeek Fig. 1: A sample recipe document, in French [from http://www.bettybossi.ch]. Hence, a document recognition model should be composed with several information types that can be summarized as follows: First, the recognition model should contain the generic logical structure in order to place constraints on the labels used, and on the way in which labels can be combined. Such generic structures can be expressed in either a DTD or an XML-Schema. Second, the recognition model should contain style information in order to constrain single labels. Style sheets should be reversed, in the sense that typographical properties should be used as indexes to retrieve possible element names that match a given physical element. Third, the recognition model should contain some pragmatic information useful to drive the recognition process; such information would express thresholds and other parameters used for tuning recognition tasks. Finally, statistical information on probabilities, or simply frequency of individual elements (or combination of elements) in a given context. Document Analysis Revisited for Web Documents 327 Fig. 2: XML file describing the logical structure of the recipe of Fig. 1. 5.2. A Detailed Example An interesting example of documents, presenting a potentially rich structure to be extracted, concerns recipes. The targeted application we consider for this example is a database indexed by the ingredients of the recipe. Such a tool would allow a househusband (or housewife) to retrieve a list of possible dishes that can be prepared with the food currently available in the fridge. Document analysis would be used to setup the necessary database. Using recipes as input, it requires the extraction of the ingredient list, including at the finest level, for each item the quantities that are needed. 328 R. Ingold and C. Vanoirbeek Fig. 3: Result of text extraction from the PDF file of Fig. 1. Let us consider, for instance, the document of Fig. 1, representing a classical recipe, which is available in PDF format. This document illustrates well the complexity of document structures. On the one hand the layout structure combines single and multiple columns. On the other hand, the logical structure is subdivided into items, each of which contains a title and a variety of ingredients and processing steps. Going deeper inside the structure, each ingredient is specified with a quantity, a unit and a description, where the latter consists of a main part (in bold) and optional complementary information (in regular font weight). The logical structure of the recipe of Fig. 1 can be expressed in XML. Figure 2 shows a partial view of such a description, where the focus has been put on the ingredient list of both items. The goal of our web document analysis application consists in extracting the XML description of Fig. 2 from the document image shown in Fig. 1. As explained in Section 5.1, the textual content can normally be extracted from the PDF file, however with some imperfections. Document Analysis Revisited for Web Documents 329 Fig. 4: Model describing the recipe document class. For our example, the text extraction tool we used produced a result for which two previous transformations were needed. Firstly, the output has been converted to the XML syntax, and secondly, a character translation has been applied on accented French characters. The final results we obtained are illustrated in Fig. 3. Each text line is labeled with its coordinates and its font identifier (in our case comprising the font weight and size). At this stage, two problems should be noticed. First, for some unknown reason, some strings are split in several parts ("S", "auce tomate fr", "oide"). However, merging these text blocks can be performed easily in a post-processing step. Second, the quantity "1/2" that appears for several ingredients has disappeared; this is probably due to the fact that "1/2" is not constituted by standard characters and have therefore been generated as embedded images. Layout analysis is the next important step. In our example, the main goal is to recognize the table structure made up of three columns. Classical segmentation methods working on the image can be avoided by considering the left margins of the text blocks extracted previously. We can actually consider that text blocks with left margin x = 106.6 and x = 238.4, respectively, are vertically aligned. The same consideration can be made with the right margin of several text blocks 330 R. Ingold and C. Vanoirbeek belonging to the left column and having the same value for x+w (for instance 97.6+4.8 = 73.8+28.6 = ... = 102.4). At this stage, we can consider that the whole layout structure is available. Each text block is characterized by its content, the font in which it is printed and the column to which it belongs. The final step consists in associating logical labels. This task needs additional knowledge about presentation rules. This information, effectively being a "reverse stylesheet", can be brought in by the associated document model. In our case, the recognition model could be expressed as illustrated in Fig. 4. In our example, the labeling process can be driven using a bottom-up strategy, consisting in labeling text blocks according to the layout information (column membership and alignment) and font attributes. The final logical structure is obtained by grouping elements together according to the generic structure. 6 Conclusion and Perspectives Initially, document analysis was developed to handle images of scanned documents. Since nowadays almost all documents are produced electronically, the opinion is often expressed that analysis of document images is going to decrease in importance and consequently, except for handwriting, research in this area may no longer be attractive. Moreover, it is often argued that the Semantic Web will make document understanding obsolete. We do not share this opinion. This paper has shown that despite the availability of electronic document formats, the importance of document analysis remains. We are convinced that the ultimate goal of extracting high level structures based on logical labeling will even increase in future in order to fulfil the need to produce annotations for the Semantic Web. This paper has given some insight to new applications. To sum up, the mentioned applications aim at illustrating that extracting structures from web documents is becoming a more complex process. This complexity is due mainly to the fact that the underlying document structure may currently be used to anchor processing operations that go beyond publishing purposes. It obviously influences the document models to be used for answering these new emerging needs for document analysis. Despite all research activities deployed on document understanding during the last decade, the results that have been achieved are rather disappointing, especially for systems designed for broad applications. In our opinion, the main reason is the difficulty of setting up the required contextual information, which we call the recognition model. As a matter of fact, these models are hard to Document Analysis Revisited for Web Documents 331 produce. We claim that significant progress will only be achieved when tools will be available to manipulate models in an efficient way. Therefore, flexible interactive systems are needed. At the higher level of document understanding, the techniques to be used for web documents are more or less the same as those used on scanned images. But at a low level, for information extraction, the methods are somewhat different. Whereas traditional document analysis needs OCR systems working on corrupted images, in the case of web documents the textual content can often be directly extracted in symbolic form. Fortunately, this facilitates further high-level recognition tasks. Finally, dealing with synthetic document images, which are issued from a symbolic format, has at least two major advantages. First, as images can be produced easily from any other document format; in this sense, images can be considered as a universal representation, giving access to the entire content. Second, document images constitute a convenient support for user interaction, which is mandatory in an assisted environment. References 1 2 3 4 5 6 7 8 Y. Wang and J. Hu, "Detecting Tables in HTML documents", in D. Lopresti, J, Hu and R. Kashi (eds.), Document Analysis Systems V, LNCS 2423 Springer, 2002, pp. 249-260. T. Berners-Lee, J. Hendler and O. Lassila, "The Semantic Web", Scientific American, May 2001. U. Fayad and R. Uthurusamy, "Evolving Data Mining into Solutions for Insights", Communications of ACM, vol 45, no 8, August 2002. K. Hadjar, O. Hitz, L. Robadey and R. Ingold, "Configuration Recognition Model for Complex Reverse Engineering Methods: 2(CREM)", in D. Lopresti, J, Hu and R. Kashi (eds.), Document Analysis Systems V, LNCS 2423 Springer, 2002, pp. 469-479. O. Hitz, L. Robadey and R. Ingold, "An Architecture for Editing Document Recognition Results Using XML", in Document Analysis Systems IV, Rio de Janeiro, December 2000, pp. 385-396. S. Abiteboul, P. Buneman and D. Suciu, Data on the Web - From Relations to Semistructured Data and XML, Morgan Kaufmann Publishers, 2000. XML Linking Language (Xlink), W3C Recommendation, 27 June 2001, http://www.w3.org/TR/2 00l/REC-xlink-2 0010627/. A. Antonacopoulos and D. Karatzas, "Fuzzy Segmentation of Characters in Web Images Based on Human Colour Perception", in D. Lopresti, J, Hu and R. Kashi (eds.), Document Analysis Systems V, LNCS 2423 Springer, 2002, pp. 295-306. This page is intentionally left blank AUTHOR INDEX Alam, H. Antonacopoulos, A Baird, H.S Breuel, T.M. Bunke, H. Chang, E.Y. Chen, Y. Cheng, K.-T. Cohen, W.W. Hartono, R Hu, J Hurst, M. Ingold, R Janssen, W.C. Jensen, L.S Kandel, A Karatzas, D Kunze, M. Lai, W.-C. Lakshmi, V. Last, M. Lopresti, D Munson, E.V. Popat, K. Rahman, A.F.R Roisin, C. Rosner, D Schenker, A Tan, A.H. Tan, C.L 95 203 81, 257 81 3 235 113 235 155 95 135 155 315 81 155 3 203 59 235 39 3 19, 273 223 81, 257 95 293 59 3 39 39 333 334 Torisawa, K Tran-Thuong, T. Tsujii, J Tsymbalenko, Y. Vanoirbeek, C. Wang, Y. Wilfong, G Yang, Y. Yoshida, M. Zhang, H.J. Author Index 179 293 179 223 315 135 19 113 179 113 </div> </div> </div> <div class="row hidden-xs"> <div class="col-md-12"> <h2></h2> <hr /> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/web-document-analysis-challenges-and-opportunities2ec620446201b42afe6371d9cd22b06d735.html"> <img src="https://epdf.mx/img/300x300/web-document-analysis-challenges-and-opportunities_5b97fafcb7d7bc7d01ce8137.jpg" alt="Web Document Analysis: Challenges and Opportunities" /> <h3 class="note-title">Web Document Analysis: Challenges and Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/web-document-analysis-challenges-and-opportunities2ec620446201b42afe6371d9cd22b06d735.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/chinas-rise-challenges-and-opportunities.html"> <img src="https://epdf.mx/img/300x300/chinas-rise-challenges-and-opportunities_5b2d6e2bb7d7bcd854b9aa50.jpg" alt="China's Rise: Challenges and Opportunities" /> <h3 class="note-title">China's Rise: Challenges and Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/chinas-rise-challenges-and-opportunities.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/chinas-rise-challenges-and-opportunities-5ea6b49811d50.html"> <img src="https://epdf.mx/img/300x300/chinas-rise-challenges-and-opportunities_5ea6b498097c4700418b5dc6.jpg" alt="China's Rise: Challenges and Opportunities" /> <h3 class="note-title">China's Rise: Challenges and Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/chinas-rise-challenges-and-opportunities-5ea6b49811d50.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/biometric-recognition-challenges-and-opportunities30c6be3e32e61f4d0b14a8de2195996919775.html"> <img src="https://epdf.mx/img/300x300/biometric-recognition-challenges-and-opportunities_5ad2e198b7d7bc1b31722cf1.jpg" alt="Biometric Recognition: Challenges and Opportunities" /> <h3 class="note-title">Biometric Recognition: Challenges and Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/biometric-recognition-challenges-and-opportunities30c6be3e32e61f4d0b14a8de2195996919775.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/data-mining-opportunities-and-challenges36063.html"> <img src="https://epdf.mx/img/300x300/data-mining-opportunities-and-challenges_5a643cdcb7d7bc664c8790da.jpg" alt="Data Mining: Opportunities and Challenges" /> <h3 class="note-title">Data Mining: Opportunities and Challenges</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/data-mining-opportunities-and-challenges36063.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/chinas-rise-challenges-and-opportunities508e5e95a0dc5933fca0f9526163f2cf15697.html"> <img src="https://epdf.mx/img/300x300/chinas-rise-challenges-and-opportunities_5b2d6e49b7d7bcda54dd1b57.jpg" alt="China's Rise: Challenges and Opportunities" /> <h3 class="note-title">China's Rise: Challenges and Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/chinas-rise-challenges-and-opportunities508e5e95a0dc5933fca0f9526163f2cf15697.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/biometric-recognition-challenges-and-opportunities567ffd5e815c19d5203b6ac9de7caaeb60337.html"> <img src="https://epdf.mx/img/300x300/biometric-recognition-challenges-and-opportunities_5ab6f335b7d7bcb01a449d2c.jpg" alt="Biometric Recognition: Challenges and Opportunities" /> <h3 class="note-title">Biometric Recognition: Challenges and Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/biometric-recognition-challenges-and-opportunities567ffd5e815c19d5203b6ac9de7caaeb60337.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/chinas-rise-challenges-and-opportunities-5ea6b49768074.html"> <img src="https://epdf.mx/img/300x300/chinas-rise-challenges-and-opportunities_5ea6b497097c4700418b5dc5.jpg" alt="China's Rise: Challenges and Opportunities" /> <h3 class="note-title">China's Rise: Challenges and Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/chinas-rise-challenges-and-opportunities-5ea6b49768074.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/biometric-recognition-challenges-and-opportunities.html"> <img src="https://epdf.mx/img/300x300/biometric-recognition-challenges-and-opportunities_5ab6f32db7d7bcb01a449d2b.jpg" alt="Biometric Recognition: Challenges and Opportunities" /> <h3 class="note-title">Biometric Recognition: Challenges and Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/biometric-recognition-challenges-and-opportunities.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/data-mining-opportunities-and-challenges.html"> <img src="https://epdf.mx/img/300x300/data-mining-opportunities-and-challenges_5a603677b7d7bcec1aaf9669.jpg" alt="Data Mining: Opportunities and Challenges" /> <h3 class="note-title">Data Mining: Opportunities and Challenges</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/data-mining-opportunities-and-challenges.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/designing-portals-opportunities-and-challenges.html"> <img src="https://epdf.mx/img/300x300/designing-portals-opportunities-and-challenges_5aa9c530b7d7bc7b7ba77f69.jpg" alt="Designing Portals: Opportunities and Challenges" /> <h3 class="note-title">Designing Portals: Opportunities and Challenges</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/designing-portals-opportunities-and-challenges.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/challenges-and-opportunities-in-agrometeorology.html"> <img src="https://epdf.mx/img/300x300/challenges-and-opportunities-in-agrometeorology_5abf614bb7d7bc9a7fe4cc48.jpg" alt="Challenges and Opportunities in Agrometeorology" /> <h3 class="note-title">Challenges and Opportunities in Agrometeorology</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/challenges-and-opportunities-in-agrometeorology.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/kaolin-market-analysis-segmentation-challenges-and-opportuni.html"> <img src="https://epdf.mx/img/300x300/kaolin-market-analysis-segmentation-challenges-and_5e8b0e0a9a3be24c4e8b47ac.jpg" alt="Kaolin Market Analysis, Segmentation, Challenges and Opportunities to 2026" /> <h3 class="note-title">Kaolin Market Analysis, Segmentation, Challenges and Opportunities to 2026</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/kaolin-market-analysis-segmentation-challenges-and-opportuni.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/pharmacogenetics-opportunities-and-challenges-for-health-innovation.html"> <img src="https://epdf.mx/img/300x300/pharmacogenetics-opportunities-and-challenges-for-_5a79552cb7d7bc9c2696be54.jpg" alt="Pharmacogenetics: Opportunities and Challenges for Health Innovation" /> <h3 class="note-title">Pharmacogenetics: Opportunities and Challenges for Health Innovation</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/pharmacogenetics-opportunities-and-challenges-for-health-innovation.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/challenges-and-opportunities-in-the-hydrologic-sciences.html"> <img src="https://epdf.mx/img/300x300/challenges-and-opportunities-in-the-hydrologic-sci_5a653eedb7d7bcf709d3c7e7.jpg" alt="Challenges and Opportunities in the Hydrologic Sciences" /> <h3 class="note-title">Challenges and Opportunities in the Hydrologic Sciences</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/challenges-and-opportunities-in-the-hydrologic-sciences.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/rising-china-global-challenges-and-opportunities-5ea6a82feeb9d.html"> <img src="https://epdf.mx/img/300x300/rising-china-global-challenges-and-opportunities_5ea6a830097c4700418b4c7d.jpg" alt="Rising China: Global Challenges and Opportunities" /> <h3 class="note-title">Rising China: Global Challenges and Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/rising-china-global-challenges-and-opportunities-5ea6a82feeb9d.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/clinical-knowledge-management-opportunities-and-challenges.html"> <img src="https://epdf.mx/img/300x300/clinical-knowledge-management-opportunities-and-ch_5afbd3dcb7d7bcdb12b162a8.jpg" alt="Clinical Knowledge Management: Opportunities and Challenges" /> <h3 class="note-title">Clinical Knowledge Management: Opportunities and Challenges</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/clinical-knowledge-management-opportunities-and-challenges.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/the-hydrogen-economy-opportunities-and-challenges.html"> <img src="https://epdf.mx/img/300x300/the-hydrogen-economy-opportunities-and-challenges_5b4950edb7d7bc447fc58a15.jpg" alt="The hydrogen economy: opportunities and challenges" /> <h3 class="note-title">The hydrogen economy: opportunities and challenges</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/the-hydrogen-economy-opportunities-and-challenges.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/pharmacogenetics-opportunities-and-challenges-for-health-innovatione978d3faa1a388d936d08a1e729ca8ed90804.html"> <img src="https://epdf.mx/img/300x300/pharmacogenetics-opportunities-and-challenges-for-_5a79557fb7d7bc9c2696be55.jpg" alt="Pharmacogenetics: Opportunities and Challenges for Health Innovation" /> <h3 class="note-title">Pharmacogenetics: Opportunities and Challenges for Health Innovation</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/pharmacogenetics-opportunities-and-challenges-for-health-innovatione978d3faa1a388d936d08a1e729ca8ed90804.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/public-enterprises-unresolved-challenges-and-new-opportunities.html"> <img src="https://epdf.mx/img/300x300/public-enterprises-unresolved-challenges-and-new-o_5b77a2d5b7d7bc5a19a1e16e.jpg" alt="Public Enterprises: Unresolved Challenges and New Opportunities" /> <h3 class="note-title">Public Enterprises: Unresolved Challenges and New Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/public-enterprises-unresolved-challenges-and-new-opportunities.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/the-hydrogen-economy-opportunities-and-challengesd23f4cf25d66907884b2dbf6111f0d9254115.html"> <img src="https://epdf.mx/img/300x300/the-hydrogen-economy-opportunities-and-challenges_5b4950f5b7d7bc457fc1f1da.jpg" alt="The hydrogen economy: Opportunities and challenges" /> <h3 class="note-title">The hydrogen economy: Opportunities and challenges</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/the-hydrogen-economy-opportunities-and-challengesd23f4cf25d66907884b2dbf6111f0d9254115.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/clinical-knowledge-management-opportunities-and-challenges52ecb7aeefd186ff8650c1e1dbdb30d759979.html"> <img src="https://epdf.mx/img/300x300/clinical-knowledge-management-opportunities-and-ch_5b8e9735b7d7bc801515b375.jpg" alt="Clinical Knowledge Management: Opportunities and Challenges" /> <h3 class="note-title">Clinical Knowledge Management: Opportunities and Challenges</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/clinical-knowledge-management-opportunities-and-challenges52ecb7aeefd186ff8650c1e1dbdb30d759979.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/russian-electricity-reform-emerging-challenges-and-opportunities.html"> <img src="https://epdf.mx/img/300x300/russian-electricity-reform-emerging-challenges-and_5b3312a4b7d7bc0a67c623cc.jpg" alt="Russian Electricity Reform: Emerging Challenges And Opportunities" /> <h3 class="note-title">Russian Electricity Reform: Emerging Challenges And Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/russian-electricity-reform-emerging-challenges-and-opportunities.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/object-oriented-technologies-opportunities-and-challenges.html"> <img src="https://epdf.mx/img/300x300/object-oriented-technologies-opportunities-and-cha_5b9a513ab7d7bcaa1d06cae9.jpg" alt="Object Oriented Technologies: Opportunities and Challenges" /> <h3 class="note-title">Object Oriented Technologies: Opportunities and Challenges</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/object-oriented-technologies-opportunities-and-challenges.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/industrial-environmental-performance-metrics-opportunities-and-challenges.html"> <img src="https://epdf.mx/img/300x300/industrial-environmental-performance-metrics-oppor_5a506042b7d7bce03290d1ad.jpg" alt="Industrial Environmental Performance Metrics: Opportunities and Challenges" /> <h3 class="note-title">Industrial Environmental Performance Metrics: Opportunities and Challenges</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/industrial-environmental-performance-metrics-opportunities-and-challenges.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/pharmaceutical-policies-in-finland-challenges-and-opportunities.html"> <img src="https://epdf.mx/img/300x300/pharmaceutical-policies-in-finland-challenges-and-_5a7b956db7d7bc951cecb1ab.jpg" alt="Pharmaceutical policies in Finland: Challenges and opportunities" /> <h3 class="note-title">Pharmaceutical policies in Finland: Challenges and opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/pharmaceutical-policies-in-finland-challenges-and-opportunities.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/rising-china-global-challenges-and-opportunities.html"> <img src="https://epdf.mx/img/300x300/rising-china-global-challenges-and-opportunities_5ab585eab7d7bccd12c6512a.jpg" alt="Rising China: Global Challenges and Opportunities" /> <h3 class="note-title">Rising China: Global Challenges and Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/rising-china-global-challenges-and-opportunities.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/sustaining-liberal-democracy-ecological-challenges-and-opportunities.html"> <img src="https://epdf.mx/img/300x300/sustaining-liberal-democracy-ecological-challenges_5af2262ab7d7bc7a5a22354a.jpg" alt="Sustaining Liberal Democracy: Ecological Challenges and Opportunities" /> <h3 class="note-title">Sustaining Liberal Democracy: Ecological Challenges and Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/sustaining-liberal-democracy-ecological-challenges-and-opportunities.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/knowledge-media-in-healthcare-opportunities-and-challenges.html"> <img src="https://epdf.mx/img/300x300/knowledge-media-in-healthcare-opportunities-and-ch_5b33aecdb7d7bc3a14997d27.jpg" alt="Knowledge Media in Healthcare: Opportunities and Challenges" /> <h3 class="note-title">Knowledge Media in Healthcare: Opportunities and Challenges</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/knowledge-media-in-healthcare-opportunities-and-challenges.html">Read more</a> </div> </div> </div> <div class="col-lg-2 col-md-3"> <div class="note"> <div class="note-meta-thumb"> <a href="https://epdf.mx/digital-television-strategies-business-challenges-and-opportunities.html"> <img src="https://epdf.mx/img/300x300/digital-television-strategies-business-challenges-_5a690c04b7d7bcae5fd17dc4.jpg" alt="Digital Television Strategies: Business Challenges and Opportunities" /> <h3 class="note-title">Digital Television Strategies: Business Challenges and Opportunities</h3> </a> </div> <div class="note-action"> <a class="more-link" href="https://epdf.mx/digital-television-strategies-business-challenges-and-opportunities.html">Read more</a> </div> </div> </div> </div> </div> <div class="col-lg-3 col-md-4 col-xs-12"> <div class="panel-recommend panel panel-primary"> <div class="panel-heading"> <h4 class="panel-title">Recommend Documents</h4> </div> <div class="panel-body"> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/web-document-analysis-challenges-and-opportunities2ec620446201b42afe6371d9cd22b06d735.html"> <img src="https://epdf.mx/img/60x80/web-document-analysis-challenges-and-opportunities_5b97fafcb7d7bc7d01ce8137.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/web-document-analysis-challenges-and-opportunities2ec620446201b42afe6371d9cd22b06d735.html"> Web Document Analysis: Challenges and Opportunities </a> </label> <div class="note-meta"> <div class="note-desc">lyHiliIiHIIiMll MiM'WW Challenges and Opportunities Editors flpostolos flntonacopoulos Jiarojin? Hu MACHINE PERCEPT...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/chinas-rise-challenges-and-opportunities.html"> <img src="https://epdf.mx/img/60x80/chinas-rise-challenges-and-opportunities_5b2d6e2bb7d7bcd854b9aa50.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/chinas-rise-challenges-and-opportunities.html"> China's Rise: Challenges and Opportunities </a> </label> <div class="note-meta"> <div class="note-desc">CHINA’S RISE CHALLENGES AND OPPORTUNITIES CHINA’S RISE CHALLENGES AND OPPORTUNITIES C. FRED BERGSTEN CHARLES FREEMA...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/chinas-rise-challenges-and-opportunities-5ea6b49811d50.html"> <img src="https://epdf.mx/img/60x80/chinas-rise-challenges-and-opportunities_5ea6b498097c4700418b5dc6.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/chinas-rise-challenges-and-opportunities-5ea6b49811d50.html"> China's Rise: Challenges and Opportunities </a> </label> <div class="note-meta"> <div class="note-desc">CHINA’S RISE CHALLENGES AND OPPORTUNITIES CHINA’S RISE CHALLENGES AND OPPORTUNITIES C. FRED BERGSTEN CHARLES FREEMA...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/biometric-recognition-challenges-and-opportunities30c6be3e32e61f4d0b14a8de2195996919775.html"> <img src="https://epdf.mx/img/60x80/biometric-recognition-challenges-and-opportunities_5ad2e198b7d7bc1b31722cf1.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/biometric-recognition-challenges-and-opportunities30c6be3e32e61f4d0b14a8de2195996919775.html"> Biometric Recognition: Challenges and Opportunities </a> </label> <div class="note-meta"> <div class="note-desc">Biometric Recognition: Challenges and Opportunities Joseph N. Pato and Lynette I. Millett, Editors; Whither Biometrics ...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/data-mining-opportunities-and-challenges36063.html"> <img src="https://epdf.mx/img/60x80/data-mining-opportunities-and-challenges_5a643cdcb7d7bc664c8790da.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/data-mining-opportunities-and-challenges36063.html"> Data Mining: Opportunities and Challenges </a> </label> <div class="note-meta"> <div class="note-desc">Data Mining: Opportunities and Challenges John Wang Montclair State University, USA IDEA GROUP PUBLISHING Hershey • Lo...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/chinas-rise-challenges-and-opportunities508e5e95a0dc5933fca0f9526163f2cf15697.html"> <img src="https://epdf.mx/img/60x80/chinas-rise-challenges-and-opportunities_5b2d6e49b7d7bcda54dd1b57.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/chinas-rise-challenges-and-opportunities508e5e95a0dc5933fca0f9526163f2cf15697.html"> China's Rise: Challenges and Opportunities </a> </label> <div class="note-meta"> <div class="note-desc">CHINA’S RISE CHALLENGES AND OPPORTUNITIES CHINA’S RISE CHALLENGES AND OPPORTUNITIES C. FRED BERGSTEN CHARLES FREEMA...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/biometric-recognition-challenges-and-opportunities567ffd5e815c19d5203b6ac9de7caaeb60337.html"> <img src="https://epdf.mx/img/60x80/biometric-recognition-challenges-and-opportunities_5ab6f335b7d7bcb01a449d2c.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/biometric-recognition-challenges-and-opportunities567ffd5e815c19d5203b6ac9de7caaeb60337.html"> Biometric Recognition: Challenges and Opportunities </a> </label> <div class="note-meta"> <div class="note-desc">Biometric Recognition: Challenges and Opportunities http://www.nap.edu/catalog/12720.html Joseph N. Pato and Lynette I...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/chinas-rise-challenges-and-opportunities-5ea6b49768074.html"> <img src="https://epdf.mx/img/60x80/chinas-rise-challenges-and-opportunities_5ea6b497097c4700418b5dc5.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/chinas-rise-challenges-and-opportunities-5ea6b49768074.html"> China's Rise: Challenges and Opportunities </a> </label> <div class="note-meta"> <div class="note-desc">CHINA’S RISE CHALLENGES AND OPPORTUNITIES CHINA’S RISE CHALLENGES AND OPPORTUNITIES C. FRED BERGSTEN CHARLES FREEMA...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/biometric-recognition-challenges-and-opportunities.html"> <img src="https://epdf.mx/img/60x80/biometric-recognition-challenges-and-opportunities_5ab6f32db7d7bcb01a449d2b.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/biometric-recognition-challenges-and-opportunities.html"> Biometric Recognition: Challenges and Opportunities </a> </label> <div class="note-meta"> <div class="note-desc">Biometric Recognition: Challenges and Opportunities Joseph N. Pato and Lynette I. Millett, Editors Whither Biometrics ...</div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> <div class="row m-0"> <div class="col-md-3 col-xs-3 pl-0 text-center"> <a href="https://epdf.mx/data-mining-opportunities-and-challenges.html"> <img src="https://epdf.mx/img/60x80/data-mining-opportunities-and-challenges_5a603677b7d7bcec1aaf9669.jpg" alt="" width="100%" /> </a> </div> <div class="col-md-9 col-xs-9 p-0"> <label> <a href="https://epdf.mx/data-mining-opportunities-and-challenges.html"> Data Mining: Opportunities and Challenges </a> </label> <div class="note-meta"> <div class="note-desc"></div> </div> </div> <div class="clearfix"></div> <hr class="mt-15 mb-15" /> </div> </div> </div> </div> </div> </div> <div class="modal fade" id="report" tabindex="-1" role="dialog" aria-hidden="true"> <div class="modal-dialog"> <div class="modal-content"> <form role="form" method="post" action="https://epdf.mx/report/web-document-analysis-challenges-and-opportunities" style="border: none;"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal" aria-hidden="true">×</button> <h4 class="modal-title">Report "Web Document Analysis: Challenges and Opportunities"</h4> </div> <div class="modal-body"> <div class="form-group"> <label>Your name</label> <input type="text" name="name" required="required" class="form-control" /> </div> <div class="form-group"> <label>Email</label> <input type="email" name="email" required="required" class="form-control" /> </div> <div class="form-group"> <label>Reason</label> <select name="reason" required="required" class="form-control"> <option value="">-Select Reason-</option> <option value="pornographic" selected="selected">Pornographic</option> <option value="defamatory">Defamatory</option> <option value="illegal">Illegal/Unlawful</option> <option value="spam">Spam</option> <option value="others">Other Terms Of Service Violation</option> <option value="copyright">File a copyright complaint</option> </select> </div> <div class="form-group"> <label>Description</label> <textarea name="description" required="required" rows="3" class="form-control" style="border: 1px solid #cccccc;"></textarea> </div> <div class="form-group"> <div style="display: inline-block;"> <div class="g-recaptcha" data-sitekey="6Le0e5ceAAAAADsZpn1H3VI-HvOppGDh-O-QAVYL"></div> </div> </div> <script src='https://www.google.com/recaptcha/api.js'></script> </div> <div class="modal-footer"> <button type="button" class="btn btn-default" data-dismiss="modal">Close</button> <button type="submit" class="btn btn-success">Send</button> </div> </form> </div> </div> </div> <footer class="footer" style="margin-top: 60px;"> <div class="container-fluid"> Copyright © 2024 EPDF.MX. All rights reserved. <div class="pull-right"> <a href="https://epdf.mx/about">About Us</a> | <a href="https://epdf.mx/privacy">Privacy Policy</a> | <a href="https://epdf.mx/term">Terms of Service</a> | <a href="https://epdf.mx/copyright">Copyright</a> | <a href="https://epdf.mx/dmca">DMCA</a> | <a href="https://epdf.mx/contact">Contact Us</a> | <a href="https://epdf.mx/cookie_policy">Cookie Policy</a> </div> </div> </footer>  <div class="modal fade" id="login" tabindex="-1" role="dialog" aria-labelledby="myModalLabel"> <div class="modal-dialog" role="document"> <div class="modal-content"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal" aria-label="Close" on="tap:login.close">×</button> <h4 class="modal-title" id="add-note-label">Sign In</h4> </div> <div class="modal-body"> <form action="https://epdf.mx/login" method="post"> <div class="form-group"> <label class="sr-only" for="email">Email</label> <input class="form-input form-control" type="text" name="email" id="email" value="" placeholder="Email" /> </div> <div class="form-group"> <label class="sr-only" for="password">Password</label> <input class="form-input form-control" type="password" name="password" id="password" value="" placeholder="Password" /> </div> <div class="form-group"> <div class="checkbox"> <label class="form-checkbox"> <input type="checkbox" name="remember" value="1" /> Remember me </label> <label class="pull-right"><a href="https://epdf.mx/forgot">Forgot password?</a></label> </div> </div> <button class="btn btn-primary btn-block" type="submit">Sign In</button> </form> <hr style="margin-top: 15px;" /> <a href="https://epdf.mx/login/facebook" class="btn btn-facebook btn-block"> Login with Facebook</a> </div> </div> </div> </div>  <script async src="https://www.googletagmanager.com/gtag/js?id=UA-111550345-1"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-111550345-1'); </script> <script src="https://epdf.mx/assets/js/jquery-ui.min.js"></script> <link rel="stylesheet" href="https://epdf.mx/assets/css/jquery-ui.css"> <script> $(function () { $("#document_search").autocomplete({ source: function (request, response) { $.ajax({ url: "https://epdf.mx/suggest", dataType: "json", data: { term: request.term }, success: function (data) { response(data); } }); }, autoFill: true, select: function( event, ui ) { $(this).val(ui.item.value); $(this).parents("form").submit(); } }); }); </script>  <div id="EPDFMX_cookie_box" style="z-index:99999; border-top: 1px solid #fefefe; background: #97c479; width: 100%; position: fixed; padding: 5px 15px; text-align: center; left:0; bottom: 0;"> Our partners will collect data and use cookies for ad personalization and measurement. <a href="https://epdf.mx/cookie_policy" target="_blank">Learn how we and our ad partner Google, collect and use data</a>. <a href="#" class="btn btn-success" onclick="accept_EPDFMX_cookie_box();return false;">Agree & close</a> </div> <script> function accept_EPDFMX_cookie_box() { document.cookie = "EPDFMX_cookie_box_viewed=1;max-age=15768000;path=/"; hide_EPDFMX_cookie_box(); } function hide_EPDFMX_cookie_box() { var cb = document.getElementById('EPDFMX_cookie_box'); if (cb) { cb.parentElement.removeChild(cb); } } (function () { var EPDFMX_cookie_box_viewed = (function (name) { var matches = document.cookie.match(new RegExp("(?:^|; )" + name.replace(/([\.$?*|{}\[\]\\\/\+^])/g, '\\$1') + "=([^;]*)")); return matches ? decodeURIComponent(matches[1]) : undefined; })('EPDFMX_cookie_box_viewed'); if (EPDFMX_cookie_box_viewed) { hide_EPDFMX_cookie_box(); } })(); </script>  </body> </html> <script data-cfasync="false" src="/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js"></script>