This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
MACHINE PERCEPTIONARTIFICIAL INTELLIGENCE _ Volume 55 I World Scientific
Web Document AnollfCJC
Challenges and Opportunities
SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors:
H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA)
Vol. 38: New Approaches to Fuzzy Modeling and Control — Design and Analysis (M. Margaliot and G. Langholz) Vol. 39: Artificial Intelligence Techniques in Breast Cancer Diagnosis and Prognosis (Eds. A. Jain, A. Jain, S. Jain and L. Jain) Vol. 40: Texture Analysis in Machine Vision (Ed. M. K. Pietikainen) Vol. 41: Neuro-Fuzzy Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 42: Invariants for Pattern Recognition and Classification (Ed. M. A. Rodrigues) Vol. 43: Agent Engineering (Eds. Jiming Liu, Ning Zhong, Yuan Y. Tang and Patrick S. P. Wang) Vol. 44: Multispectral Image Processing and Pattern Recognition (Eds. J. Shen, P. S. P. Wang and T. Zhang) Vol. 45: Hidden Markov Models: Applications in Computer Vision (Eds. H. Bunke and T. Caelli) Vol. 46: Syntactic Pattern Recognition for Seismic Oil Exploration (K. Y. Huang) Vol. 47: Hybrid Methods in Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 48: Multimodal Interface for Human-Machine Communications (Eds. P. C. Yuen, Y. Y. Tang and P. S. P. Wang) Vol. 49: Neural Networks and Systolic Array Design (Eds. D. Zhang and S. K. Pal) Vol. 50: Empirical Evaluation Methods in Computer Vision (Eds. H. I. Christensen and P. J. Phillips) Vol. 51: Automatic Diatom Identification (Eds. H. du But and M. M. Bayer) Vol. 52: Advances in Image Processing and Understanding A Festschrift for Thomas S. Huwang (Eds. A. C. Bovik, C. W. Chen and D. Goldgof) Vol. 53: Soft Computing Approach to Pattern Recognition and Image Processing (Eds. A. Ghosh and S. K. Pal) Vol. 54: Fundamentals of Robotics — Linking Perception to Action (M. Xie)
*For the complete list of titles in this series, please write to the Publisher.
Series in Machine Perception and Artificial Intelligence - Vol. 55 ■■■■■HI
Web Document Analncic HlldiyolO
Challenges and Opportunities
Editors
flpostolos flntonacopoulos University of Liverpool, UK
Jianp? Hu IBM TJ. Watson Research Center, USA
8* World Scientific New Jersey • London • Singapore • Hong Kong
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: Suite 202, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-238-582-7
Printed by MultiPrint Services
PREFACE
With the ever-increasing use of the Web, a growing number of documents are published and accessed on-line. The emerging issues pose new challenges for Document Analysis. Although the development of XML and the new initiatives on the Semantic Web aim to improve the machine-readability of web documents, they are not likely to eliminate the need for content analy sis. This is particularly true for the kind of web documents created as web publications (vs. services) where visual appearance is critical. Such con tent analysis is crucial for applications such as information extraction, web mining, summarization, content re-purposing for mobile and multi-modal access and web security. The need is evident for discussions to identify the role of Document Analysis in this new technical landscape. This book is a collection of chapters including state-of-the-art reviews of challenges and opportunities as well as research papers by leading re searchers in the field. These chapters are assembled into five parts, reflect ing the diverse and interdisciplinary nature of this field. The book starts with Part I, Content Extraction and Web Mining, where four different re search groups discuss the application of graph theory, machine learning and natural language processing to the analysis, extraction and mining of web content. Part II deals with issues involved in adaptive content delivery to devices of varying screen size and access modality, particularly mobile de vices. Part III focuses on the analysis and management of one of the most common structured elements in web documents — tables. Part IV includes three chapters on issues related to images found in web documents, includ ing text extraction from web images and image search on the web. Finally, the book is concluded in Part V with discussions of new opportunities for Document Analysis in the web domain, including human interactive proofs for web security, the exploitation of web resources for document analysis experiments, the expansion of the concept of "documents" to include mul timedia documents, and areas where what has been learnt from traditional Document Image Analysis can be applied to the web domain. V
vi
Preface
It is our hope that this book will set the scene in the emerging field of Web Document Analysis and stimulate new ideas, new collaborations and new research activities in this important area. We would like to extend our gratitude to Horst Bunke who encouraged and supported us unfailingly in putting together this book. We are also grateful to Ian Seldrup of World Scientific for his helpful guidance and for looking after the final stages of the production. Last but certainly not least, we wish to express our warmest thanks to the Authors, without whose interesting work this book would not have materialised.
Apostolos Antonacopoulos and Jianying Hu
CONTENTS
Preface
v
Part I. Content Extraction and Web Mining Ch. 1.
Ch. 2.
Ch. 3.
Ch. 4.
Clustering of Web Documents Using a Graph Model A. Schenker, M. Last, H. Bunke and A. Kandel
3
Applications of Graph Probing to Web Document Analysis D. Lopresti and G. Wilfong
19
Web Structure Analysis for Information Mining V. Lakshmi, A.H. Tan and C.L. Tan
39
Natural Language Processing for Web Document Analysis M. Kunze and D. Rosner
59
Part II. Document Analysis for Adaptive Content Delivery Ch. 5. Ch. 6.
Ch. 7.
Reflowable Document Images T.M. Breuel, W.C. Janssen, K. Popat and H.S. Baird
81
Extraction and Management of Content from HTML Documents H. Alam, R. Hartono and A.F.R. Rahman
95
HTML Page Analysis Based on Visual Cues Y. Yang, Y. Chen and H.J. Zhang
vii
113
viii
Contents
Part III. Table Understanding on the Web Ch. 8. Ch. 9.
Automatic Table Detection in HTML Documents Y. Wang and J. Hu
135
A Wrapper Induction System for Complex Documents and its Application to Tabular Data on the Web W. W. Cohen, M. Hurst and L.S. Jensen
155
Ch. 10. Extracting Attributes and their Values from Web Pages M. Yoshida, K. Torisawa and J. Tsujii
179
Part IV. Web Image Analysis and Retrieval Ch. 11. A Fuzzy Approach to Text Segmentation in Web Images Based on Human Colour Perception A. Antonacopoulos and D. Karatzas
203
Ch. 12. Searching for Images on the Web Using Textual Metadata E. V. Munson and Y. Tsymbalenko
223
Ch. 13. A n Anatomy of a Large-Scale Image Search Engine W.-C. Lai, E. Y. Chang and K.-T. Cheng
235
Part V. N e w Opportunities Ch. 14. Web Security and Document Image Analysis H.S. Baird and K. Popat
257
Ch. 15. Exploiting W W W Resources in Experimental Document Analysis Research D. Lopresti
273
Ch. 16. Structured Media for Authoring Multimedia Documents T. Tran-Thuong and C. Roisin
293
Contents
ix
Ch. 17. Document Analysis Revisited for Web Documents R. Ingold and C. Vanoirbeek
315
Author Index
333
This page is intentionally left blank
Part I. Content Extraction and Web Mining
This page is intentionally left blank
CHAPTER 1 C L U S T E R I N G OF WEB D O C U M E N T S USING A GRAPH M O D E L
Adam Schenker 1 , Mark Last2, Horst Bunke 3 , and Abraham Kandel 1 'Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave. ENB 118 Tampa, FL, 33620, USA E-mail: aschenke, [email protected] 2 Department of Information Systems Engineering, Ben-Gurion University of the Negev Beer-Sheva 84105, Israel E-mail: [email protected] 3 Inst. Fur Informatik und angewandte Mathematik, Department of Computer Science, University of Bern, Neubriickstrasse 10, CH-3012 Bern, Switzerland E-mail: [email protected]
In this chapter we enhance the representation of web documents by utilizing graphs instead of vectors. In typical content-based representations of web documents based on the popular vector model, the structural (term adjacency and term location) information cannot be used for clustering. We have created a new framework for extending traditional numerical vector-based clustering algorithms to work with graphs. This approach is demonstrated by an extended version of the classical k-means clustering algorithm which uses the maximum common subgraph distance measure and the concept of median graphs in the place of the usual distance and centroid calculations, respectively. An interesting feature of our approach is that the determination of the maximum common subgraph for measuring graph similarity, which is an NP-Complete problem, becomes polynomial time with our graph representation. By applying this graph-based kmeans algorithm to the graph model we demonstrate a superior performance when clustering a collection of web documents. 1. Introduction In the field of machine learning, clustering has been a useful and active area of research for some time. In clustering, the goal is to separate a given group of data items (the data set) into groups (called clusters) such that items in the same cluster are similar to each other and dissimilar to the items in other clusters. 3
4
Schenker et al.
Unlike the supervised methods of classification, no labeled examples are provided for training. Clustering of web documents is an important problem for two major reasons. First, clustering a document collection into categories enables it to be more easily browsed and used. Automatic categorization is especially important for the World Wide Web with its huge number of dynamic (time varying) documents and diversity of topics; such features make it extremely difficult to classify pages manually as we might do with small document corpora related to a single field or topic. Second, clustering can improve the performance of search and retrieval on a document collection. Hierarchical clustering methods, for example, are used often for this purpose.1 When representing documents for clustering, a vector model is typically used.2 In this model, each possible term that can appear in a document becomes a feature dimension. The value assigned to each dimension of a document may indicate the number of times the corresponding term appears on it. This model is simple and allows the use of traditional clustering methods that deal with numerical feature vectors in a Euclidean feature space. However, it discards information such as the order in which the terms appear, where in the document the terms appear, how close the terms are to each other, and so forth. By keeping this kind of structural information we could possibly improve the performance of the clustering. The problem is that traditional clustering methods are often restricted to working on purely numeric feature vectors. This comes from the need to compute distances between data items or to calculate some representative of a cluster of items {i.e. a centroid or center of a cluster), both of which are easily accomplished in a Euclidean space. Thus either the original data needs to be converted to a vector of numeric values by discarding possibly useful structural information (which is what we are doing when using the vector model to represent documents) or we need to develop new, customized algorithms for the specific representation. We deal with this problem by introducing an extension of a classical clustering method that allows us to work with graphs as fundamental data structures instead of being limited to vectors of numeric values. Our approach has two main benefits. First, it allows us to keep the inherent structure of the original documents by modeling each document as a graph, rather than having to arrive at numeric feature vectors that contain only term frequencies. Second, we do not need to develop new clustering algorithms completely from scratch: we can apply straightforward extensions to go from classical clustering algorithms that use numerical vectors to those that deal with graphs. In this chapter we will describe a k-means clustering algorithm that utilizes graphs instead of vectors and illustrate its usefulness by applying it to the problem of clustering a
Clustering of Web Documents using a Graph Model
5
collection of web documents. We will show how web documents can be modeled as graphs and then clustered using our method. Experimental results will be given and compared with previous results reported for the same web data set based on a traditional vector representation. The chapter is organized as follows. In Sec. 2 we introduce the mathematical foundations we will use for clustering with graphs. In Sec. 3, we extend the classical k-means algorithm to use graphs instead of numerical vectors. In Sec. 4 we will describe a web page data set and its representation by the graph model. In Sec. 5 we present experimental results and a comparison with previous results from clustering the same web documents when using a vector model and classical k-means algorithms. Conclusions are given in Sec. 6. 2. Graphs: Formal Notation Graphs are a mathematical formalism for dealing with structured entities and systems. In basic terms a graph consists of vertices (or nodes), which correspond to some objects or components. Graphs also contain edges, which indicate the relationships between the vertices. The first definition we have is that of the graph itself. Each data item (document) in the data set we are clustering will be represented by such a graph: Definition 1. A graph3'* G is formally defined by a 4-tuple (quadruple): G=(V, E, a, p), where V is a set of vertices (also called nodes), EcVxV is a set of edges connecting the vertices, a:V—>Zv is a function labeling the vertices, and /1:E-^ZE is a function labeling the edges (Zv and EE being the sets of labels that can appear on the nodes and edges, respectively). The next definition we have is that of a subgraph. One graph is a subgraph of another graph if it exists as a part of the larger graph: Definition 2. AgraphGi = (VUEU au Pi) is a subgraph5 of a graph G2 = (V2,E2, ota, [h), denoted GiOG2, if VIQV2, Elp2((x.y)) V(*oOe£,. Next we have the important concept of the maximum common subgraph, or mcs for short, which is the largest subgraph a pair of graphs have in common: Definition 3. A graph G is a maximum common subgraph5 (mcs) of graphs G\ and G2, denoted mcs(G\,G2), if: (1) GcG\ (2) GcjG2 and (3) there is no other
6
Schenker et al.
subgraph G' (G'QGX, G'aG2) such that \G'\>\G\. (Here IGI is intended to convey the "size" of the graph G; usually it is taken to mean IVI. i.e. the number of vertices in the graph.) Using these definitions, a method for computing the distance between two graphs using the maximum common subgraph has been proposed:
max(|G,|,|G2|) where G\ and G2 are graphs, mcs(G\,G2) is their maximum common subgraph, max(...) is the standard numerical maximum operation, and I...1 denotes the size of the graph as we mentioned in Definition 3.6 This distance measure has four important properties.3 First, it is restricted to producing a number in the interval [0, 1]. Second, the distance is 0 only when the two graphs are identical. Third, the distance between two graphs is symmetric. Fourth, it obeys the triangle inequality, which ensures the distance measure behaves in an intuitive way. For example, if we have two dissimilar objects (i.e. there is a large distance between them) the triangle inequality implies that a third object which is similar (i.e. has a small distance) to one of those objects must be dissimilar to the other. Methods for computing the mcs are presented in the literature.7,8 In the general case the computation of mcs is NP-Complete, but as we will see later in the chapter, for our graph representation the computation of mcs is polynomial time due to the existence of unique node labels in the considered application. Other distance measures which are also based on the maximum common subgraph have been suggested. For example, Wallis et al. have introduced a different metric which is not as heavily influenced by the size of the larger graph.9 Fernandez and Valiente combine the maximum common subgraph and the minimum common supergraph in their proposed distance measure.10 However, Eq. 1 is the "classic" version and the one we will use in our implementation and experiments. As yet there are no reported findings to indicate which distance measure is most appropriate for various applications, and this is a topic we will investigate in future research. However, the distance measure of Eq. 1 has the advantage that it requires the least number of computations when compared to the other two distance measures we mentioned above. Finally we need to introduce the concept of the median of a set of graphs. We define this formally as:
Clustering of Web Documents using a Graph Model
7
Definition 4. The median of a set ofn graphs,11 S={Gi, G2,..., G„},is a graph G such that G has the smallest average distance to all elements in S: G - argmirJ —2_,dist{s,G)
(2)
Here S is the set of n graphs (and thus \S\=n) and G is the median. The median is defined to be a graph in set S. Thus the median of a set of graphs is the graph from that set which has the minimum average distance to all the other graphs in the set. The distance dist(...) is computed from Eq. 1 above. There also exist the concepts of the generalized median and weighted mean, where we don't require that G be a member of S.11'12 However, the related computational procedures are much more demanding and we do not consider them in the context of this chapter. Note that the implementation of Eq. 2 requires only O(n') graph distance computations and then finding the minimum among those distances. 3. The Extended k-Means Clustering Algorithm With our formal notation now in hand, we are ready to describe our framework for extending classical clustering methods which rely on Euclidean distance. The extension is surprisingly simple. First, any distance calculations between data items is accomplished with a graph-theoretical distance measure, such as that of Eq. 1. Second, since it is necessary to compute the distance between data items and cluster centers, it follows that the cluster centers (centroids) must also be graphs if we are to use a method such as that in Eq. 1. Therefore, we compute the representative "centroid" of a cluster as the median graph of the set of graphs in that cluster (Eq. 2). We will now show a specific example of this extension to illustrate the technique. To avoid any confusion, we should briefly emphasize here the difference between our method and the family of "traditional" graph-theoretic clustering algorithms.1'13 In the typical graph clustering case, all the data to be clustered is represented as a single graph where the vertices are the data items and the edge weights indicate the similarity between items. This graph is then partitioned to create groups of connected components (clusters). In our method, each data item to be clustered is represented by a graph. These graphs are then clustered using some clustering algorithm (in this case, k-means) utilizing the distance and median computations previously defined in lieu of the traditional Euclidean distance and centroid calculations.
8
Schenker et al. Inputs: the set of n data items and a parameter k, defining the number of clusters to create Outputs: the centroids of the clusters (represented as numerical vectors) and for each data item the cluster (an integer in [l,k]) it belongs to Step 1. Assign each data item (vector) randomly to a cluster (from 1 to k). Step 2. Using the initial assignment, determine the centroids of each cluster. Step 3. Given the new centroids, assign each data item to be in the cluster of its closest centroid. Step 4. Re-compute the centroids as in Step 2. Repeat Steps 3 and 4 until the centroids do not change.
Fig. 1. The basic k-means clustering algorithm.
The k-means clustering algorithm is a simple and straightforward method for clustering data.14 The basic algorithm is given in Fig. 1. This method is applicable to purely numerical data when using Euclidean distance and centroid calculations. The usual paradigm is to represent each data item, which consists of m numeric values, as a vector in the space 9?m. In this case the distances between two data items are computed using the Euclidean distance in m dimensions and the centroids are computed to be the mean of the data in the cluster. However, now that we have a distance measure for graphs (Eq. 1) and a method of determining a representative of a set of graphs (the median, Eq. 2) we can apply the same method to data sets whose elements are graphs rather than vectors. The k-means algorithm extended to operate on graphs is given in Fig. 2. Inputs: the set of n data items (represented by graphs) and a parameter k, defining the number of clusters to create Outputs: the centroids of the clusters (represented as graphs) and for each data item the cluster (an integer in [l,k]) it belongs to Step 1. Assign each data item (graph) randomly to a cluster (from 1 to k). Step 2. Using the initial assignment, determine the median of the set of graphs for each cluster using Eq. 2. Step 3. Given the new medians, assign each data item to be in the cluster of its closest median (as determined by distance using Eq. 1). Step 4. Re-compute the medians as in Step 2. Repeat Steps 3 and 4 until the medians do not change.
Fig 2. The k-means algorithm for using graphs.
4. Clustering of Web Documents using the Graph Model In order to demonstrate the performance and possible benefits of the graphbased approach, we have applied the extended k-means algorithm to the clustering of a collection of web documents. Some research into performing
Clustering of Web Documents using a Graph Model
9
clustering of web pages is reported in the literature.15-18 Similarity of web pages represented by graphs has been discussed in a recent work by Lopresti and Wilfong.19 Their approach differs from ours in that they extract numerical features from the graphs (such as node degree and vertex frequency) to determine page similarity rather than comparing the actual graphs; they also use a graph representation based on the syntactical structure of the HTML parse tree rather than the textual content of the pages. However, the work we are most interested in for evaluation purposes is that of Strehl et al20 In that paper, the authors compared the performance of different clustering methods on web page data sets. This paper is especially important to the current work, since it presents baseline results for a variety of standard clustering methods including the classical k-means using different similarity measures. The data set we will be using is the Yahoo "K" series,* which was one of the data sets used by Strehl et al. in their experiments.20 This data set consists of 2,340 Yahoo news pages downloaded from www.yahoo.com in their original HTML format. Each page is assigned to one of 20 categories based on its content, such as "technology", "sports" or "health". Although a pre-processed version of the data set is also available in the form of a term-document matrix and a list of stemmed words, we are using the original documents in order to capture their inherent structural information using graphs. We represent each web document as a graph using the following method: •
Each term (word) appearing in the web document, except for stop words (see below), becomes a vertex in the graph representing that document. This is accomplished by labeling each node (using the node labeling function a, see Definition 1) with the term it represents. Note that we create only a single vertex for each word even if a word appears more than once in the text. Thus each vertex in the graph represents a unique word and is labeled with a unique term not used to label any other node. If word a immediately precedes word b somewhere in a "section" s of the web document (see below), then there is a directed edge from the vertex corresponding to a to the vertex corresponding to b with an edge label s. We take into account certain punctuation (such as a period) and do not create an edge when these are present between two words.
This data set is available at: ftp://ftp.cs.umn.edu/dept/users/boley/PDDPdata
Schenker et al.
10
•
We have defined three "sections" for the web pages. First, we have the section title, which contains the text in the document's TITLE tag and any provided keywords (meta-data). Second we have the section link, which is text appearing in clickable links on the page. Third we have the section text, which comprises any of the readable text in the document (this includes link text but not title and keyword text). We perform removal of stop words, such as "the", "and", "of, etc. which are generally not useful in conveying information by removing the corresponding nodes and their incident edges. We also perform simple stemming by checking for common alternate forms of words, such as the plural form.
•
We remove the most infrequently occurring words on each page, leaving at most m nodes per graph (m being a user provided parameter). This is similar to a dimensionality reduction process for vector representations.
This form of knowledge representation is a type of semantic network, where nodes in the graph are objects and labeled edges indicate the relationships between objects.21 The conceptual graph is a type of semantic network sometimes used in information retrieval.22 With conceptual graphs, terms or phrases related to documents appear as nodes. The types of relations (edge labels) include synonym, part-whole, antonym, and so forth. Conceptual graphs are used to indicate meaning-oriented relationships between concepts, whereas our method indicates structural relationships that exist between terms in a web document. We give a simple example of our graph representation of a web document in Fig. 3. The ovals indicate nodes and their corresponding term labels. The edges are labeled according to title (TI), link (L), or text (TX). The document represented by the example has the title "YAHOO NEWS", a link whose text reads "MORE NEWS", and text containing "REUTERS NEWS SERVICE REPORTS". This novel method of document representation is somewhat similar to that of directed acyclic word graphs (or DAWGs); however, our nodes represent words rather than letters, our model allows for cycles in the graphs, and the edges are labeled.
Clustering of Web Documents using a Graph Model
11
Fig. 3. An example graph representation of a web document.
When determining the size of a graph representing a web document (Definition 3) we use the following method: Definition 5. The size of a graph G=(V, E, a, p), denoted \G\, is defined as \G\=\V\+\E\. Recall that the typical definition is simply IGI=IVI. However, for this application it is detrimental to ignore the contribution of the edges, which indicate the number of phrases identified in the text. Further, it is possible to have more than one edge between two nodes since we are labeling the edges according to the document section in which the terms are adjacent separately. Before moving on to the experiments, we mention an interesting feature this model of representing documents has on the time complexity of determining the distance between two graphs (Eq. 1). In the distance calculation we are using the maximum common subgraph; the determination of this in the general case is known to be an NP-Complete problem.24 However, our graphs for this application have the following property: Vxje V, a(x)=a(y) if and only if x=y
(3)
In other words, each node in a graph has a unique label assigned to it, namely the term it represents. No two nodes in a graph will have the same label. Thus the maximum common subgraph Gm=(Vrm^m,o^,j8m) of a pair of graphs Gx and Gi can be created using the following method: Step 1. Create the set of vertices by Vm={x\xs V{ and xe V2 and ai(x)=az(x)} Step 2. Create the set of edges by Em={ (x,y)\x,ye Vm and fii{{x,y))=^2((x,y))}
12
Schenker et al.
The first step states the set of vertices in the maximum common subgraph is just the intersection of the set of terms of both graphs. Each term in the intersection becomes a node in the maximum common subgraph. The second step creates the edges by examining the set of nodes created from the previous step. We examine all pairs of nodes in the set; if both nodes contain an edge between them in both original graphs and share a common label, then we add the edge to the maximum common subgraph. Note that this is different from the concept of induced maximum common subgraph, where nodes are added only if they are connected by an edge in both original graphs. If there is a common subset of nodes but different edge configurations in the original graphs, we still add the nodes using our method. We also note that in document clustering, the nodes, which represent terms, are much more important than the edges, which only indicate the relationships between the terms {i.e., followed by). We see that the complexity of this method is OflVillV^U for the first step and 0(\Vmcs\2) for the second step. Thus it is 0(\Vl\W2\+\VmJ2) < 0(\V\2+\VmJ2) = 0(\V\2) overall if we substitute V = ma\(JVi\,\V2\). 5. Experimental Results In order to compare clustering methods with differing distance measures, Strehl et al. proposed the use of an information-theoretic measure of clustering performance.20 This measurement is given as:
A"=-ifn/
(it)
\og(k-g)
(4)
where n is the number of data items, k is the desired number of clusters, g is the actual number of categories, and «<■" is the number of items in clusteri classified to be category j . The above measure is, in fact, mutual information25 normalized by the sum of its maximum values (log k and log g) and it represents the overall degree of agreement between the clustering and the categorization. In an attempt to adhere to the methodology of the original experiments, which used the vector model approach, we have selected a sample of 800 documents from the total collection of 2,340 and have fixed the desired number of clusters to be &=40 (two times the number of categories), which is the same number of clusters used in the original experiment. Strehl et al. used this number of clusters "since this seemed to be the more natural number of clusters as indicated by preliminary
Clustering of Web Documents using a Graph Model
13
runs and visualisation." The results for our method using different numbers of maximum nodes per graph and the original results from Strehl et al. for vectorbased k-means and a random baseline assignment are given in Table 1 (higher mutual information is better); results from our method are shown in bold. Each row gives the average of 10 experiments using the same 800 item data sample. The variation in results between runs comes from the random initialization in the first step of the k-means algorithm. We used t-tests to evaluate the statistical significance of our results as compared with the best reported vector-based kmeans method (Extended Jaccard Similarity). Confidences less than 0.950 are marked with a "-". The same performance data is plotted graphically in Fig. 4. In Fig. 5 we show the execution times for performing a single clustering of the document collection when using 5, 50, 100, and 150 nodes per graph. These results were obtained on a 733 MHz single processor Power Macintosh G4 with 384 megabytes of physical memory running Mac OS X. The clustering took 7.13 minutes at 5 nodes per graph and 288.18 minutes for 150 nodes per graph. Unfortunately, no execution time data is available for comparison from the original experiments in Strehl et al. Table 1. Results of our experiments compared'with results from Strehl et al: Method Graphs Graphs Graphs Graphs Graphs Extended Jaccard Similarity Pearson Correlation Cosine Measure Graphs Graphs Graphs Graphs Random (baseline) Euclidean
From Fig. 4 we see that the mutual information generally tends to increase as we allow larger and larger graphs. This makes sense since the larger graphs incorporate more information. On the figure we indicated the values of mutual information from the original experiments for three out of the five methods from Table 1. Euclidean is the classical k-means with a Euclidean distance measure. Random baseline is simply a random assignment of data to clusters; it is used to
14
Schenker et al.
provide a baseline for comparison. We would expect any algorithm to perform better than Random, but we see the Euclidean k-means did not. Finally, Jaccard is k-means using the Extended Jaccard Similarity.2'20 It was the best performing of all the k-means methods reported in the original experiment so we have omitted cosine similarity and Pearson correlation on the chart for clarity. It is not a surprising result to see Euclidean distance perform poorly when using the vector model for representing documents, as it does not have the property of vector length invariance. Because of this, documents with similar term frequency proportions but differences in overall total frequency have large distances between them even though they are supposed to be considered similar. For example, if we were interested in the topic "data mining", a document where the terms "data" and "mining" each appeared 10 times and a document where both terms each appeared 1,000 times are considered to be identical when we have the length invariance property {i.e. their distance is 0). It is only the relative proportion between the terms that is of interest when determining the document's content, since there are often large variances in total term frequency even for documents related to the same topic. Here both documents contain an equal proportion of the terms "data" and "mining". If the term "mining" occurred much more frequently than "data", we would expect the document to be related to a different topic {e.g., "gold mining"). Under Euclidean distance these two documents would have a large distance {i.e. be considered dissimilar) due to the fact that the difference in total frequency (10 vs. 1,000) is large. This is why distance measures with the length invariance property (such as the cosine measure, which measures the cosine of the angle between two feature vectors) are often used in these types of applications in lieu of standard Euclidean distance. We see that even with only 5 nodes per graph our method outperforms both Euclidean k-means and the random baseline; as we increased the number of nodes per graph the performance approaches that of the other k-means methods until it exceeded even the best k-means method reported at 75 nodes per graph or more. For comparison, the original experiment used a term-document matrix where each vector had 2,903 dimensions. We note a general increasing trend in performance as we allow for larger graphs, which would be consistent with the increase in information that occurs as we introduce new terms (nodes) and phrases (edges) in the graphs. However, the performance improvement is not always strictly proportional with the increase in graph size. For example, the improvement from 60 to 75 is greater than the improvement from 75 to 90 even though we are adding 15 new nodes in each case. This may be due to the fact that the extra nodes added when we increase the graph size, while they are
Clustering of Web Documents using a Graph Model
15
frequently occurring terms, may not always provide information that is useful for discriminating between the documents and in actuality may hinder performance by introducing extraneous data. A future improvement may be to find better methods of selecting the nodes to be used in each graph rather than relying strictly on term frequency.
Fig. 4. Mutual Information as a function of the maximum number of vertices per graph.
Fig. 5. Clustering time as a function of the maximum number of vertices per graph.
16
Schenker et al.
6. Conclusions In this chapter we showed how it is possible to cluster web documents using a graph representation rather than a vector representation. A graph representation allows us to retain structural information such as where terms are located in a document and the order in which terms appear — information which is mostly discarded when using the typical vector model approach. Given a graph model of web documents, we can apply traditional clustering techniques such as kmeans by performing an extension from Euclidean distance and centroid calculations to graph distance and median graphs, respectively. To demonstrate the performance of the extended k-means method with our graph representation of web documents, we performed experiments on a web document collection and compared with previous results of clustering using k-means when utilizing a vector model for the same documents. We have discovered the following from our experiments: •
Our method outperformed the baseline random assignment method and the vector-based k-means method using Euclidean distance, even in the case of maximum dimensionality reduction using 5 nodes per graph. As the maximum number of nodes allowed per graph became larger, the performance of our method generally increased. This reflects an increase in the amount of information in the graphs as we add nodes and edges. Our method outperformed all the k-means clustering methods (Euclidean distance, cosine measure, Pearson correlation, and Jaccard similarity) described in Strehl et al.20 when we allowed 75 nodes per graph or more. We believe this reflects the information retained by the graph representation which is not present when using the vector model approach.
We have many avenues to explore for future work. We have shown one graph distance measure here, but others have been proposed. We will perform experiments with other graph distance measures and compare clustering performance. We can also attempt to create a more elaborate graph representation for web documents. For example, we can recognize more document sections, connect words that appear in the same sentence or paragraph, and so on. Such representations could capture even more information, possibly leading to better performance. It is also possible to apply our technique to structured text, such as XML documents and software
Clustering of Web Documents using a Graph Model
17
programs, and we intend to investigate clustering collections of source code using our method. We also wish to extend other clustering algorithms to work with graphs, such as hierarchical agglomerative clustering and fuzzy c-means. Acknowledgments This work was supported in part by the National Institute for Systems Test and Productivity at the University of South Florida under U.S. Space and Naval Warfare Systems Command grant number N00039-01-1-2248. References 1.
A. K. Jain, M. N. Murty and P. J. Flynn, "Data clustering: a review", ACM Computing Surveys Vol. 31, No. 3, 1999, pp. 264-323. 2. G. Salton, Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, 1989. 3. H. Bunke, X. Jiang, and A. Kandel, "On the minimum common supergraph of two graphs", Computing, Vol. 65, 2000, pp. 13-25. 4. J. T. L. Wang, K. Zhang, and G.-W. Chirn, "Algorithms for approximate graph matching", Information Sciences, Vol. 82, 1995, pp. 45-74. 5. H. Bunke, "On a relation between graph edit distance and maximum common subgraph", Pattern Recognition Letters, Vol. 18, 1997, pp. 689-694. 6. H. Bunke and K. Shearer, "A graph distance metric based on the maximal common subgraph", Pattern Recognition Letters, Vol. 19, 1998, pp. 255-259. 7. G. Levi, "A note on the derivation of maximal common subgraphs of two directed or undirected graphs", Calcolo, Vol. 9, 1972, pp. 341-354. 8. J. J. McGregor, "Backtrack search algorithms and the maximal common subgraph problem", Software Practice and Experience, Vol. 12, 1982, pp. 23-34. 9. W. D. Wallis, P. Shoubridge, M. Kraetz, and D. Ray, "Graph distances using graph union", Pattern Recognition Letters, Vol. 22, 2001, pp. 701-704. 10. M.-L. Fernandez and G. Valiente, "A graph distance metric combining maximum common subgraph and minimum common supergraph", Pattern Recognition Letters, Vol. 22, 2001, pp. 753-758. 11. H. Bunke, S. Giinter, and X. Jiang, "Towards bridging the gap between statistical and structural pattern recognition: two new concepts in graph matching" in Advances in Pattern Recognition — ICAPR 2001, eds. S. Singh, N. Murshed, and W. Kropatsch, Springer-Verlag, 2001. 12. H. Bunke and A. Kandel, "Mean and maximum common subgraph of two graphs", Pattern Recognition Letters, Vol. 21, 2000, pp. 163-168. 13. J. G. Augustson and J. Minker, "An analysis of some graph theoretical cluster techniques", Journal of the Association of Computing Machinery, Vol. 17, No. 4, 1970, pp. 571-588. 14. T. M. Mitchell, Machine Learning, McGraw-Hill, Boston, 1997. 15. D. Boley, M. Gini, R. Gross, E. H. Han, K. Hastings, G. Karypis, B. Mobasher, and J. Moore, "Partitioning-based clustering for web document categorization", Decision Support Systems, Vol. 27, 1999, pp. 329-341.
18 16.
17.
18.
19.
20.
21. 22. 23. 24.
25.
Schenker et al. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, "Extracting large-scale knowledge bases from the web", Proceedings of the 25th International Conference on Very Large Databases, 1999, pp. 639-649. S. A. Macskassy, A. Banerjee, B. D. Davison, and H. Hirsh, "Human performance on clustering web pages: a preliminary study", Proceedings of The 4'h International Conference on Knowledge Discovery and Data Mining, 1998, pp. 264-268. O. Zamir and O. Etzioni, "Web document clustering: a feasibility demonstration", Proceedings of the 21s' Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 46-54. D. Lopresti and G. Wilfong, "Applications of graph probing to web document analysis", Proceedings of the Is' International Workshop on Web Document Analysis (WDA'2001), Seattle, USA, September 2001 (ISBN: 0-9541148-0-9) and also at h t t p : //www. c s c . l i v . a c . u k / ~ w d a 2 001, pp. 51-54. A. Strehl, J. Ghosh, and R. Mooney, "Impact of similarity measures on web-page clustering", AAAI-2000: Workshop of Artificial Intelligence for Web Search, 2000, pp. 58-64. S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, PrenticeHall, Upper Saddle River, 1995. X. Lu, "Document retrieval: a structural approach", Information Processing and Management, Vol. 26, No. 2,1990, pp. 209-218. M. Crochemore and R. Verin, "Direct construction of compact directed acyclic word graphs" in CPM97, eds. A. Apostolico and J. Hein, Springer-Verlag, 1997. B. T. Messmer and H. Bunke, "A new algorithm for error-tolerant subgraph isomorphism detection", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 5, 1998, pp. 493-504. T. M. Cover and J.A. Thomas, Elements of Information Theory, Wiley, 1991.
CHAPTER 2 APPLICATIONS OF G R A P H P R O B I N G TO W E B D O C U M E N T ANALYSIS
Daniel Lopresti and Gordon Wilfong Bell Labs, Lucent Technologies Inc. 600 Mountain Avenue Murray Hill, NJ 07974 USA E-mail: {dpl,gtw}@research.bell-labs.com In this chapter, we describe our steps towards adapting a new approach for graph comparison known as graph probing to collections of semistructured documents (e.g., Web pages coded in HTML). We consider both the comparison of two graphs in their entirety, as well as deter mining whether one graph contains a subgraph that closely matches the other. A formalism is presented that allows us to prove graph probing yields a lower bound on the true edit distance between graphs. Results from several experimental studies demonstrate the applicability of the approach, showing that graph probing can distinguish the kinds of sim ilarity of interest and that it can be computed efficiently. 1. I n t r o d u c t i o n Graphs are a fundamental representation in much of computer science, in cluding the analysis of both traditional and Web documents. Algorithms for higher-level document understanding tasks often use graphs to encode log ical structure. HTML pages are usually regarded as tree-structured, while the WWW itself is an enormous, dynamic multigraph. Much work on at tempting to extract information from Web pages makes explicit or implicit use of graph representations. 1-8 It follows, then, that possessing the ability to compare two graphs can be essential, as demonstrated in such applications as query-by-structure, information extraction, performance evaluation etc. Because most problems relating to graph comparison have no known efficient, guaranteed-optimal solution, researchers have developed a wide range of heuristics. Recently, we have begun to explore an intuitive, easy-to-implement 19
20
D. Lopresti and G. Wilfong
scheme for comparing graphs that we call graph probing. As shown in Fig. 1, each of the two graphs under study is placed inside a "black box" capable of evaluating a set of graph-oriented operations (e.g., counting the number of vertices labeled in a certain way) and then subjected to a series of simple queries. A measure of the graphs' dissimilarity is the degree to which their responses to the probes disagree.
Fig. 1.
Overview of graph probing.
In this chapter, we examine in detail the graph probing paradigm we first put forth in the context of our work on table understanding, 9-11 where it played an important role in evaluating the performance of the recogni tion techniques under development. We later extended the approach to the analysis of HTML-coded Web pages from the perspective of information retrieval. 5 ' 6 The preliminary experimental results reported in those earlier papers are substantially augmented herein and accompanied by more ex tensive explanations and new analyses. We begin with a discussion of past work relating to the problem of graph comparison, both in the context of Web documents and more generally. In Sec. 3, we present a formalism showing that graph probing provides a lower bound on the true edit distance between two graphs, and that with a minor change we can derive a similar bound for subgraph matching. To examine how well the approach might work in practice, we provide results from several experimental studies in Sec. 4. Finally, we offer our conclusions and topics for future research in Sec. 5.
Applications
of Graph Probing to Web Document
Analysis
21
2. Related Work Graph comparison is a widespread yet challenging problem, so it should come as no surprise that many researchers have proposed heuristics and/or solutions designed for special cases. It is not our intent to survey the field exhaustively, but rather to identify certain representative papers, especially those most closely related to the approach we are about to describe. A more comprehensive overview can be found in a recent paper by Bunke. 12 Jolion offers opinions on research trends in graph matching. 13 Before beginning, it is important to distinguish between the exact and approximate matching problems. The former is typically called graph iso morphism, while the latter is often phrased in terms of a minimum-cost sequence of basic editing operations (e.g., the insertion and deletion of ver tices and edges) that accounts for the observed differences between the two graphs and which defines the notion of edit distance. These viewpoints are in fact complementary; it should be clear how a solution to the ap proximate matching problem could be helpful in solving the isomorphism problem. Moreover, since graphs that are sufficiently similar most likely contain subgraphs that are identical, subgraph isomorphism can be seen as facilitating approximate matching. A formal connection between these two concepts was established by Bunke. 14 In the context of graph editing, another vital distinction arises with respect to two particular quantities that may be of interest: the actual sequence of operations needed to edit one graph into the other, and the cost of such a sequence. The former is useful in attempting to understand the differences between the graphs and why they may have arisen, while the latter provides a concrete measure of similarity. Given a minimum-cost sequence of editing operations, calculating the corresponding edit distance is straightforward. While the converse is not true, the edit distance by itself is still extremely valuable, especially if it can be computed much more rapidly or for larger graphs than would otherwise be possible using procedures that return the operations. Much prior work has focused on the graph isomorphism problem (i.e., finding an exact correspondence between two graphs) and its variants. The complexity of graph isomorphism remains open and, unfortunately, all known algorithms for its solution have worst-case exponential running times. 15 Heuristics for determining isomorphism often rely on the concept of a vertex invariant, that is, a value f(v) assigned to each vertex v such that under any isomorphism / , if I(v) = v' then f(v) = f(v'). One such
22
D. Lopresti and G. Wilfong
invariant is the degree of a vertex (or the in- and out-degrees, if the graph is directed). Indeed, Nauty, an effective software package for computing graph isomorphism, 16,17 relies on vertex invariants. In general, such heuristics can fail in a catastrophic manner. 18 On the other hand, it has been shown that for random graphs, there is a simple linear time test for checking if two graphs are isomorphic based on the degrees of the vertices, and this test succeeds with high probability. 19 Other research aims at speeding up the computation for database searches. Lazarescu et al. propose a machine learning approach to building decision trees for eliminating from further consideration graphs that can not possibly be isomorphic to a given query graph. 20 While they employ a similar set of features to the ones we use, they do not consider the approxi mate matching problem or subgraph problems. Bunke and Messmer present a decision-tree-based precomputation scheme for solving the subgraph iso morphism problem, although their data structure can be exponential in the size of the input graphs in the worst case. 21 ' 22 Valiente and Martinez describe an approach for subgraph patternmatching based on finding homomorphic images of every connected compo nent in the query. 23 Again, the worst-case time complexity is exponential, but such features could also perhaps be incorporated in the measures we are about to present. Turning to graph edit distance, we note there have been a large num ber of papers written on the subject and its many applications. These can be divided into two categories. In the first we have procedures that are guaranteed to find an optimal solution but may, in the worst case, require exponential time, such as the early work by Pu and colleagues.24'25 The second category includes heuristics that may not necessarily return an op timal match, but that have polynomial running times as in, for example, the Bayesian framework posed by Myers et al.26 Frequently these papers focus on search strategies intended to speed up the computation when certain conditions are satisfied, making the understanding and implementation of the algorithms more difficult. Papadopoulos and Manolopoulos discuss an idea that is philosophically quite similar to ours. 27 However, they focus on a single invariant: vertex degree. It is clear this is not sufficient for catching all of the interesting differences that can arise between HTML documents. Moreover, their his togram technique is applied only to the problem of comparing complete graphs, whereas we wish to examine the subgraph matching problem as well.
Applications
of Graph Probing to Web Document
Analysis
23
Instead of trying to solve the problem for graphs in general, some leeway can be had by limiting the discussion to trees, for which efficient comparison algorithms are known. Schlieder and Naumann consider a problem closely related to ours: error-tolerant embedding of trees to judge the similarity of XML documents. 8 Likewise, Dubois et al. write about tree embedding for searching databases of semi-structured multimedia documents. 3 The WebMARS system, as presented by Ortega-Binderberger et al.,7 models Web documents via their parse trees. Queries are likewise treated as trees, although they encode hierarchy for individual object types (e.g., text, images) and do not represent the same sorts of inter-object relation ships that the mark-up in Web documents encodes. Matching only takes place between the leaves of the query tree and all possible "chains" in the document tree (i.e., paths leading in the direction from the root to a leaf). The match values are then propagated upwards towards the root of the query tree over edges that can be weighted to reflect the importance of that particular component. Hence, there is an asymmetry between queries and documents. In any case, the graph model we would like to support is more general than simple trees, allowing both cross-and back-edges. Finally, the flat-file comparison of Web pages, effectively ignoring their hierarchical structure, has been handled by Doughs et al. through the ap plication of string (as opposed to graph) matching techniques. 28,29
3. A Formalism for Graph Probing In this section we formalize the concept of graph probing as a way of quan tifying graph similarity. Our goal is to relate probing to a more rigorous but harder-to-compute graph edit distance model. Let Gi = (Vi,Ei) and G2 = ( V ^ , ^ ) be two directed attribute graphs, that is, directed graphs where the vertices and edges are potentially la beled by a type. For example, vertices might be labeled as corresponding to HTML structure tags (e.g., section heading, paragraph, table) as well as their associated content, while the labels for edges represent relationships between structures (e.g., contains, next, hypertext reference). Now introduce a graph editing model that allows the following basic op erations: (1) delete an edge, (2) insert an edge, (3) delete an isolated vertex, (4) insert an isolated vertex, (5) change the type of an edge, (6) change the type of a vertex. The edges and vertices created through insertions can be assigned any type initially. It should be clear that such operations can be used to edit any graph into any other graph. The minimum number of op-
24
D. Lopresti and G. Wilfong
erations needed to edit G\ into G2 is the graph edit distance, dist(Gi, G2). There is no known algorithm for efficiently computing this distance in gen eral. Consider a probing procedure that asks the following kinds of questions: "How many vertices with a specific combination of incoming and outgoing edges are present in graph G?" Suppose there are a different edge labels h, ■ ■ ■ ,la- The edge structure of a given vertex can then be represented as a 2a-tuple of non-negative integers, {x\,... , xa, y\,... , ya), if the vertex has exactly Xi incoming edges labeled U and exactly yj outgoing edges labeled lj for 1 < i,j < a. Then a typical probe will have the form: "How many vertices with edge structure (x\,... ,xa,yi,... tya) are present in graph G?" We call these Class 1c ■probes3'. This is sufficient for detecting many kinds of changes to an HTML doc ument, but note that the content can be altered without affecting the graph structure of a Web page. Hence, we also need a class of probes focusing on just the vertices and their types: "How many vertices labeled in a specific way are present in graph G?" These are known as Class 2 probes. Let PR\c collect the responses for vertex in- and out-degrees and their respective edge types, and let PR2 collect the responses for vertex types. Define probe(G1,G2)
= (Pi?i c (Gi) - PRic{G2))
+ {PlhiGi)
- PR2(G2)).
(1)
That is, probe is the magnitude of the difference between the two sets of probing results (the L\ norm, or the sum of the absolute values of the differences between the entries in the two vectors). We then have: Theorem 1: Under the directed attribute graph model and its associated edit model, probe is a lower bound, within a factor of 1/4, on the true edit distance between any two graphs. That is, 1/4 • probe(Gi,G2)
<
dist(G1,G2).
The proof of this result follows from a simple case analysis. The operations that cause the largest possible disparity between edit distance and the graph probing measure are the deletion or insertion of an edge. Any one such edit may cause as many as, but no more than, four of the edge structure probes to differ by one. a
Class l a and Class l b probes correspond to two graph models we will not be considering here: undirected and unlabeled directed graphs, respectively.
Applications
of Graph Probing to Web Document
Analysis
25
An example is illustrated in Fig. 2, where the comparison of the two probe vectors PR\C and PR2 yields a value of four as opposed to the true edit distance which is three (corresponding to the operations listed in the figure).
Fig. 2.
Probing example for the directed attribute graph model.
The precomputation needed for each graph is as follows. Computing the edge structures of all the vertices takes total time 0 ( | j E | + a | V | ) . These |V| 2a-tuples can then be lexicographically sorted in 0(a(S+ \V\)) time, where 6 is the maximum number of edges incident on any vertex. Then a simple pass through the sorted list allows us to compute the number of vertices in each of the (non-empty) classes in additional time G ( a | y | ) . Thus the total precomputation time is 0(a(S + \V\) + \E\). We remark that a and 6 are likely to be small constants, in which case the time complexity becomes OQV\ + \E\). The quantity probe{G\,G2) defined by Eq. 1 is referred to as the graph probing distance between G\ and G2. It provides an approximation to the true edit distance between two graphs, which would be too expensive to compute in the most general case. Another useful observation is that the graph probing measure satisfies the triangle inequality. In other words, probe(Gi,G2) > probe(Gi,Gz) + probe{Gz,G2) for all graphs Gi, G2, and G3. Graph probing distance is also non-negative and symmetric (i.e., pro&e(Gj., G2) = probe(G2, Gi)), thus satisfying three of the four conditions of a metric space. The remaining condition, separation, is violated, however, as probe(G\,G2) = 0 does not
26
D. Lopresti and G. Wilfong
imply that G\ and G2 are identical (isomorphic). Hence, graph probing distance forms what is commonly referred to as a "pseudo-metric" space. When comparing two graphs in their entirety, it suffices to correlate their responses to the probes and measure the disagreement. For the prob lem of subgraph matching, however, we cannot expect to be able to compare directly the outputs for the larger graph to those for the smaller. For exam ple, consider a query graph consisting of a single table that corresponds to a table in some database document, but where that document also contains dozens of other tables. There should be no penalty for having to account for the unrelated tables if our goal is to quantify the similarity between the query and some subgraph of the target; the fact that there is a good match for the one table is enough. To assure that the Class 2 probes are invariant across subgraph isomor phism, the manner in which the results are compared must account for this. Clearly, if the query graph contains a certain number of vertices labeled in a given way, the target graph may possibly contain an isomorphic subgraph so long as it contains at least that many vertices labeled in the same way. If the target graph has fewer such vertices, we know there cannot be a subgraph isomorphism, although there may still be a good approximate matching. Similarly, the way in which the Class lc probes are correlated needs to be modified as well, since the vertices present in the query graph may have fewer incoming and outgoing edges than their corresponding ver tices in an isomorphic subgraph of a larger graph. Hence, when computing the difference between the two sets of probing results under Eq. 1, instead of the standard per-element difference \nij —ri2j\, we use max(Q,riij —ri2,j), where riij is the value of the j t h feature for graph Gi, i £ {1,2}, and G\ is the query graph being matched to some subgraph of G2Through a straightforward extension to the previous discussion, we can define a subgraph edit distance, subdist, and a subgraph probing distance, subprobe. As in the case of regular graph matching, we can show a lower bound result: Theorem 2: Under the directed attribute graph model and its associated edit model, subprobe is a lower bound, within a factor of 1/2, on the true subgraph edit distance between any two graphs. That is, 1/2 • subprobe{G1,G2)
<
subdist{Gi,G2).
The bound in this case is stronger (a factor of 1/2 versus a factor of 1/4) because the worst-case edit can affect the results of at most two probes.
Applications
of Graph Probing to Web Document
Analysis
27
4. Experimental Evaluation This chapter describes research that is still very much in progress. As we noted, we have previously implemented a version of graph probing to help in the evaluation of a table understanding system. 9_11 By comparing the output of a recognition algorithm to the ground-truth created by a human expert, both of which can be represented as attribute graphs, it becomes possible to quantify the performance of the algorithm. However, the utility of graph probing in that specialized domain is not an assurance that it is an appropriate paradigm for searching databases in a retrieval application, es pecially since the classes of probes we have at our disposal are also different in this case. In the remainder of this section, we present the results of several em pirical studies demonstrating how graph probing might be applied in the analysis of semi-structured documents (i.e., Web pages coded in HTML). We begin by describing the graph model we use. For our test collections, we assembled a random assortment of Web pages with the assistance of a commercial search engine using a procedure to be described shortly. The first test examined the ability of the two probe classes to detect changes as a specific commercial Web page evolved naturally over several days. The second studied the subgraph matching problem by searching for a Web page that had been edited by deleting a significant portion of its content. 4.1. Graph
Model
The attribute graph model we employ for HTML documents includes the standard tree-structured hierarchy generated when parsing the tags (the "contains"/"contained-by" relationship). Each tag, or pair of matching start/end tags, corresponds to a vertex in the graph, with any associated content or metadata encoded as attributes. For example, the HTML frag ment: Web Document Analysis would yield one vertex of type font with metadata face="Arial" and content Web Document Analysis. Since the content within a vertex can be arbitrarily long, we hash this data to 32-bit integers to facilitate efficient comparison. Beyond the parse tree, we also make use of the order in which con tent and the various substructures are encountered (in many cases this corresponds to the natural reading order for the material in question). We
28
D. Lopresti and G. Wilfong
represent this via "next" / "previous" cross-edges that connect vertices at a given level in the hierarchy, rather than assuming an implicit fixed ordering on the children of a vertex as some other researchers have done. Lastly, we record hyperlinks as either back-edges (in the case of targets on the same page) or a distinguished vertex type (in the case of external references). No provision is currently made for incoming links from outside documents. The inline content at each vertex is hashed to a 32-bit integer to facilitate com parison and also stored in full as a separate file. Our parse graph generator is implemented in Tcl/Tk 3 0 and is capable of recovering from the kinds of simple errors that often arise in real-world HTML {e.g., missing end tags). Recall from the previous section that we have defined two probe classes for these kinds of graphs: Class l c These probes examine the vertex and edge structure of the graph by counting in- and out-degrees, tabulating different types of in coming and outgoing edges separately. Class 2 These probes count the occurrences of a given type of vertex in the graph. 4.2. Generating
"Random"
Collections
of Web
Pages
To conduct a convincing evaluation of the graph probing paradigm, we need access to a relatively large assortment of real Web pages that have been authored by many different individuals using a variety of editing tools, along with the ability to assemble collections "on-the-fly," either in an unbiased manner or perhaps satisfying certain criteria. Our approach to this problem is to resort to the Web itself by making use of a commercial search engine, querying it for randomly-chosen search terms and then selecting from the results it returns the random "hits" we shall use as our test documents. The Google search engine 31 claims to index over 2 billion Web pages as of the time of this writing. We query Google by picking a random word from the Unix "spell" dictionary, which contains 24,259 words including a number of proper names. Based on this search, we choose one of the pages of hits returned by Google, and from that page we select a pointer to a specific HTML document which we then fetch and subject to further analysis. Our implementation of the Web interface is programmed in Tcl/Tk using the Spynergy Toolkit. 32 It takes a total of three HTTP "round-trips" to get the data we require: (1) First, issue a search request using a randomly-chosen keyword and re-
Applications
of Graph Probing to Web Document Analysis
29
trieve the first page of results, which includes a summary of the total number of hits. (2) The hits are grouped by Google 10-per-page. Identify one of these pages of hits at random and retrieve it. (3) Within this page, determine one of the 10 URL's at random and retrieve the page it refers to. We have developed a set of simple "wrappers" to extract the necessary information from the HTML code that is returned in the steps above. For the experiments that follow, we collected 1,000 random Web pages in this fashion. On the occasions when an HTTP fetch timed-out (after 30 seconds for the initial connection, and 5 seconds for each subsequent buffer), the search was attempted again using a different term. We also re-ran searches that produced no matches. Since an HTTP request may return an arbitrary document (not necessarily HTML) of any length, or no document at all (in the case of a stale link), we enforced minimum and maximum page sizes and also checked for common patterns that indicate an error has occurred (e.g., "404: page not found"). It is also important to observe that even though Google purports there are millions of hits for some search terms, apparently it will only ever return at most the first 1,000 hits. Hence, the universe we are selecting from is much smaller than the entire Web, although with an upper bound of around 24 million pages (the size of the dictionary times the maximum number of hits Google will return for each search term), it seems sufficient for our purposes.
4.3. Experiment
#1: Full Graph
Matching
For the first experiment, we used as our query the May 31, 2001 homepage from The New York Times (http://www.nytimes.com/). This is a relatively complex page, its parse tree containing 1,088 vertices. The results for using graph probing to compare this page to 1,000 random Web pages are shown in Fig. 3. The May 31 page is, of course a perfect match for itself, but also a very good match for the June 4, 5, and 6 homepages as well (the other examples from The New York Times "seeded" in the collection). All of the random pages returned much larger distances from the query. In the figure, the chart shows the breakdown between Class lc probes (dark gray) and Class 2 probes (light gray). The former can be equated with the overall structure of the page, while the latter are more closely related to content. Note that the contributions of the two probe components are for the most part evenly balanced for random pages, but that nearly all of the differences
D. Lopresti and G. Wilfong
30
between the various versions of the The New York Times pages are due to Class 2 probes. A number of other regularly-updated Web pages we have examined exhibit this same behavior; there are no major structural changes from day-to-day, although the content is constantly varying.
Fig. 3.
Graph probing results for Experiment 1 (full graph comparison measure).
Such a result is only of interest, of course, if graph probing distance can be computed efficiently. Fortunately, this appears feasible as illustrated in Fig. 4. Here we show two sets of datapoints as functions of the number of vertices in the target graph. The upper set (light gray squares) displays the total time it takes to parse the HTML document into an intermediate rep resentation, generate the probing results, and compare them to the query. These times range from a fraction of a second to at most 15 seconds for graphs of up to several thousand vertices'3. As indicated earlier, our HTML parser is coded in an interpreted (as opposed to compiled) language. This offers a degree of flexibility, but does exact a performance penalty. Since our focus is on the probing process, we coded those routines in C and measured the run times separately; these are plotted in the lower set of datapoints in the figure (dark gray diamonds). Once parsed, Web pages can be compared using graph probing in about 1/100 of a second, on average. b
AU timing results reported in this chapter were recorded on an IBM ThinkPad T23
Applications
Fig. 4.
of Graph Probing to Web Document
Analysis
31
Time to compare Web pages versus number of vertices in parse graph.
In Fig. 5. we examine the number of probes as a function of the size of the graph. As the breakdown shows, the growth in the total number of probes (light gray squares) is dominated by the Class 2 probes. The Class lc probes (dark gray diamonds) grow rapidly at first, but then start to flatten out dramatically at between 30 and 60 probes. Whether the number of such probes is upper-bounded by a constant as the graphs get increasingly larger is a subject for future research. A more detailed analysis of the statistics for the experiment appears in Tables 1 and 2. In Table 1, we present data on the test collection. Here we can see, for example, that the minimum graph probing distance between one of the random Web pages and the query was 1,302, while the average was 1,933. The majority of the time was spent fetching the data over the Internet: graph probing itself is relatively fast. Statistics for the query and related pages are given in Table 2, where the processing times quoted are averaged over 10 iterations to smooth out any transient anomalies in system performance. It is indeed true that the page for June 4 required significantly less time to parse than the others, even though it is about the same size.
(1 Ghz Pentium III. 256 Mbyte RAM) running the Linux operating system.
D. Lopresti and G. Wilfong
32
Fig. 5.
Number of probes generated versus number of vertices in parse graph.
As there are several possible explanations, this would be an interesting question to explore. T h e a s y m m e t r y between t h e two probe classes for t h e graphs in Table 1 versus t h e graphs in Table 2 is also striking.
Table 1.
Statistics for the test collection in Experiment 1.
Attribute Google Hits HTML Size (bytes) Parse Graph Vertices Class lc Probes Class lc Distance Class 2 Probes Class 2 Distance Overall Probes Overall Distance Fetch Time (sees) Parse Time (sees) Probe Time (sees) Compare Time (sees) Overall Time (sees)
Statistics for the query and related documents in Experiment 1.
Attribute HTML Size (bytes) Parse Graph Vertices Class lc Probes Class l c Distance Class 2 Probes Class 2 Distance Overall Probes Overall Distance Parse Time (sees) Probe Time (sees) Compare Time (sees) Overall Time (sees)
NY Times May 31, 2001 50,239 1,088 48 0 234 0 282 0 4.195 0.00776 0.00525 4.21
4.4. Experiment
#2: Subgraph
NY Times June 4, 2001 50,311 1,099 48 21 240 205 288 226 3.446 0.00789 0.00384 3.46
NY Times June 5, 2001 49,625 1,087 48 23 235 203 283 226 4.175 0.00772 0.00381 4.19
NY Times June 6, 2001 49,413 1,081 49 27 235 195 284 222 4.281 0.00772 0.00383 4.29
Matching
For our second experiment, we used the homepage for the WDA'2001 work shop as it existed on July 12, 2001, a screen snapshot of which is shown in Fig. 6. The corresponding parse tree contained 328 vertices. To create the query, the page was modified by deleting the left sidebar (the listing of the workshop chairs and program committee members). The parse tree for this edited page had 125 vertices. Our intention was, of course, to formulate a query graph that would be a good subgraph match for the original (full) page, but hopefully not for other, random pages on the Web. The results for this experiment, using graph probing with the subgraph comparison measure, are shown in Fig. 7. Here we can see that the two probe classes are easily able to distinguish the original page and the smaller, edited version from the rest of the test collection. The biggest contributor to the measured differences were the Class 2 probes, which suggests that content is the key distinguishing factor, although structural differences were captured as well. In the case of the query and the original page, only a single Class lc probe disagreed, indicating a nearly perfect match. As before, we provide a more detailed breakdown of the statistics in Tables 3 and 4. Since a relatively small graph can appear to be a good subgraph match for many larger graphs, graph probing using the subgraph comparison measure is less stringent than full matching. This is the reason the distance values in Table 3 are smaller than the corresponding distances for the same collection of pages using the full graph comparison measure (Table 1). Still, there are detectable differences between the parse graphs for random Web pages and the query we created by editing the WDA'2001
34
D. Lopresti and G. Wilfong
Fig. 6.
Fig. 7.
Snapshot of the WDA'2001 homepage (http://www .csc.liv.ac.uk/~wda2001/).
Graph probing results for Experiment 2 (subgraph comparison measure).
homepage. One potential effect that deserves further exploration is whether the
Applications
of Graph Probing to Web Document Analysis
Table 3.
Statistics for the test collection in Experiment 2.
Attribute Google Hits HTML Size (bytes) Parse Graph Vertices Class lc Probes Class lc Distance Class 2 Probes Class 2 Distance Overall Probes Overall Distance Fetch Time (sees) Parse Time (sees) Probe Time (sees) Compare Time (sees) Overall Time (sees)
Statistics for the query and related document in Experiment 2.
Attribute HTML Size (bytes) Parse Graph Vertices Class lc Probes Class lc Distance Class 2 Probes Class 2 Distance Overall Probes Overall Distance Parse Time (sees) Probe Time (sees) Compare Time (sees) Overall Time (sees)
WDA July 12, 2001 Original 12,109 328 21 1 118 0 139 1 0.542 0.00328 0.00458 0.55
use of predefined templates for building Web pages results in a greater degree of structural similarity than would otherwise be expected between two arbitrary, unrelated pages (the WDA'2001 homepage, for example, was created using Microsoft FrontPage, a popular HTML editor). Even in such cases, however, we note that the Class 2 probes will catch differences in the content of the two pages. 5. Conclusions In this chapter, we have described our initial efforts to adapt the graph prob ing paradigm to searching databases of graph-structured documents, with a focus on Web pages coded in HTML. We considered both the comparison
D. Lopresti and G. Wilfong
36
of two g r a p h s in their entirety, as well as the subgraph matching problem, and gave some preliminary experimental results showing t h a t graph prob ing is effective at distinguishing useful kinds of similarity and t h a t it can be computed efficiently. One topic for future research is to examine database indexing tech niques t h a t could result in sub-linear search times using this measure. The pseudo-metric space property of full graph probing, and in particular the triangle inequality, may be helpful in this regard. Also, the lower bound arguments we presented relating probing to graph edit distance are based on a simple unit cost model for the latter. It would be interesting to know whether similar bounds can be developed for other, more complicated edit cost functions. Currently, graph probing provides a measure of how similar two graphs are or how similar one graph is to some subgraph of another. It does not, however, yield a mapping from one graph to the other. Clearly, to extract information from semi-structured sources we need to recognize more t h a n the fact t h a t some matching is likely to exist; we must be able to identify the actual correspondence between the graphs. Even so, graph probing can at the very least b e used to identify graphs t h a t are a likely match, after which more computationally expensive methods may be run on a smaller collection of graphs to find the best possible matches. In addition to information retrieval, other Web-related applications of graph comparison via probing could include wrapper generation and maintenance 1 ' 2 and analysis of HTML-coded tables. 4 This would require retargeting our graph probing language t o information extraction applica tions, a t a s k t h a t would be challenging but seems feasible.
6.
Acknowledgments
Jianying H u and Ramanujan Kashi played important roles in the table recognition research which lead to the development of the graph probing paradigm. T h e trademarks mentioned in this chapter are the properties of their respective companies.
References 1. N. Ashish and C. Knoblock, "Wrapper generation for semi-structured In ternet sources", In Proceeding of the Workshop on the Management of Semistructured Data, Tucson, AZ, June 1997. 2. W. W. Cohen, "Recognizing structure in Web pp. using similarity queries",
Applications
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14. 15. 16.
of Graph Probing to Web Document Analysis
37
In Proceedings of the Sixteenth National Conference on Artificial Intelligence, Orlando, FL, 1999, pp. 59-66. D. Dubois, H. Prade, and P. Sedes, "Some uses of fuzzy logic in multimedia databases querying", In Proceedings of the Workshop on Logical and Uncer tainty Models for Information Systems, London, England, July 1999. S.-J. Lim and Y.-K. Ng, "An automated approach for retrieving hierarchical data from HTML tables", In Proceedings of the A CM International Confer ence on Information and Knowledge Management, Kansas City, MO, Novem ber 1999, pp. 466-474. D. Lopresti and G. Wilfong, "Applications of graph probing to Web document analysis", In A. Antonacopoulos and J. Hu, editors, Proceedings of the First International Workshop on Web Document Analysis, Seattle, WA, September 2001, pp. 51-54. http://www.csc.liv.ac.uk/~wda2001. D. Lopresti and G. Wilfong, "Comparing semi-structured documents via graph probing", In Proceedings of the Seventh International Workshop on Multimedia Information Systems, Capri, Italy, November 2001, pp. 41-50. M. Ortega-Binderberger, S. Mehrotra, K. Chakrabarti, and K. Porkaew, WebMARS: "A multimedia search engine for full document retrieval and cross media browsing", In Proceedings of the Sixth International Workshop on Advances in Multimedia Information Systems, Chicago, IL, October 2000, pp. 72-81. T. Schlieder and F. Naumann, "Approximate tree embedding for querying XML data", In Proceedings of the ACM SIGIR Workshop on XML and In formation Retrieval, Athens, Greece, July 2000. J. Hu, R. Kashi, D. Lopresti, and G. Wilfong, "A system for understanding and reformulating tables", In Proceedings of the Fourth IAPR International Workshop on Document Analysis Systems, Rio de Janeiro, Brazil, December 2000, pp. 361-372. J. Hu, R. Kashi, D. Lopresti, and G. Wilfong, "Table structure recognition and its evaluation", In Proceedings of Document Recognition and Retrieval VIII, volume 4307, San Jose, CA, January 2001, pp. 44-55. J. Hu, R. Kashi, D. Lopresti, and G. Wilfong, "Evaluating the performance of table processing algorithms", International Journal on Document Analysis and Recognition, 4(3), March 2002, pp. 140-153. H. Bunke, "Recent developments in graph matching"' In Proceedings of the 15th International Conference on Pattern Recognition, volume 2, Barcelona, Spain, 2000, pp. 117-124. J. M. Jolion, "Graph matching: what are we really talking about?", In Third IAPR Workshop on Graph-based Representations in Pattern Recognition, Ischia, Italy, May 2001. h t t p : / / r f v . i n s a - l y o n . f r / ~ j o l i o n / P S / p r l c p l . p d f . H. Bunke, "On a relation between graph edit distance and maximum common subgraph", Pattern Recognition Letters, 18, 1997, pp. 689-694. S. Fortin, "The graph isomorphism problem", Department of Computer Sci ence Technical Report TR 96-20, The University of Alberta, July 1996. B. McKay, Nauty User's Guide (Version 1.5). Computer Science Department, Australian National University.
38
D. Lopresti and G. Wilfong
17. B. McKay, "Practical graph isomorphism", Congressus Numerantium, 30, 1981, pp. 45-87. 18. D. G. Corneil and D. G. Kirkpatrick, "A theoretical analysis of various heuris tics for the graph isomorphism problem", SIAM Journal on Computing, 9(2), May 1980, pp. 281-297. 19. L. Babai, P. Erdos, and S. M. Selkow, "Random graph isomorphism", SIAM Journal on Computing, 9(3), August 1980, pp. 628-635. 20. M. Lazarescu, H. Bunke, and S. Venkatesh, "Graph matching: Fast candi date elimination using machine learning techniques", In Advances in Pattern Recognition, volume 1876 of Lecture Notes in Computer Science, SpringerVerlag, Berlin, Germany, 2000, pp. 236-245. 21. H. Bunke and B. T. Messmer, "Recent advances in graph matching", In ternational Journal of Pattern Recognition and Artificial Intelligence, 11(1), November 1997, pp.: 169-203. 22. B. T. Messmer and H. Bunke, "Efficient error-tolerant subgraph isomorphism detection", In Shape, Structure and Pattern Recognition, World Scientific, Singapore, 1995, pp. 231-240. 23. G. Valiente and C. Martinez, "An algorithm for graph pattern-matching", In Proceedings of the Fourth South American Workshop on String Processing, Carleton University Press, 1997, pp. 180-197. 24. A. Sanfeliu and K.-S. Fu, "A distance measure between attributed relational graphs for pattern recognition", IEEE Transactions on Systems, Man, and Cybernetics, 13(3), May/June 1983, pp. 353-362. 25. W.-H. Tsai and K.-S. Fu, "Error-correcting isomorphisms of attributed re lational-graphs for pattern analysis", IEEE Transactions on Systems, Man, and Cybernetics, 9(12), December 1979, pp. 757-768. 26. R. Myers, R. C. Wilson, and E. R. Hancock, "Bayesian graph edit distance", IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(6), June 2000, pp. 628-635. 27. A. N. Papadopoulos and Y. Manolopoulos, "Structure-based similarity search with graph histograms", In Proceedings of the 10th International Workshop on Database & Expert Systems Applications, IEEE Computer Society Press, 1998, pp. 174-178. 28. Y.-F. Chen, F. Doughs, H. Huang, and K.-P. Vo, "TopBlend: An efficient implementation of HtmlDiff in Java", In Proceedings of WebNet 2000 - World Conference on the WWW and Internet, San Antonio, TX, November 2000. http://www.research.att.com/~chen/topblend/paper/. 29. F. Douglis and T. Ball, "Tracking and viewing changes on the Web", In Pro ceedings of the USENIX Annual Technical Conference, San Diego, CA, Jan uary 1996, pp. 165-176. h t t p : / / w w w . u s e n i x . o r g / p u b l i c a t i o n s / l i b r a r y /proceedings/sd96/douglis.html. 30. J. K. Ousterhout, Tel and the Tk Toolkit. Addison-Wesley, Reading, MA, 1994. 31. Google, October 2002. http://www.google.com/. 32. H. Schroeder and M. Doyle. Interactive Web Applications with Tcl/Tk, AP Professional, Chestnut Hill, MA, 1998.
CHAPTER 3 WEB STRUCTURE ANALYSIS FOR INFORMATION MINING
Vijjappu Lakshmi, 1 Ah-Hwee Tan, 2 and C h e w - L i m Tan 1 ' School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543 Email: vijjappu ,[email protected] URL: http://www.comp.nus.edu.sg/~tancl ~ Laboratories for Information Technology, 21 Heng Mui Keng Terrace, Singapore 119613 [email protected]. sg
Our approach to extracting information from the Web is to analyze the structural content of web pages through exploiting the latent information given by HTML tags. For each specific extraction task, an object model is created consisting of the salient fields to be extracted and the corresponding extraction rules based on a library of HTML parsing functions. We derive extraction rules for both single-slot and multiple-slot extraction tasks which we illustrate through three sample applications. 1. Introduction Information extraction (IE) as defined by Message Understanding Conferences (MUC) refers to the task of, given a document, automatically finding the essential details in the text. For example, given a web page on a seminar announcement, an IE system would extract salient information such as topic, speaker, date, venue, and abstract. A key element of IE systems is the set of text extraction rules that identify the relevant information to be extracted. IE systems have been built with different degrees of success on several kinds of text domains. CRYSTAL1 was a learning-based IE system that took parsed annotated sentences and found patterns for extraction in novel sentences. Webfoot2 was an attempt at general IE that processed fragments by looking at Hyperlink Markup Language (HTML) tags. SRV3 was another learning architecture for IE. It took a user-defined feature set together with a set of hand tagged training documents and learned rules for extraction. Craven et al4 39
40
Lakshmi et al.
reported that greater accuracy could be achieved by representing each web page as a node in graph and each hypertext an edge. Cardie5 provided a list of problems of learning-based IE, including the difficulty of obtaining enough training data and the lack of corpora annotated with the appropriate semantics and domain-specific supervisory information. The generation of training examples is complicated because the IE process often requires the output of earlier levels of analysis such as tagging and partial parsing. DiPasquo argued that there was inherent information in the layout of each page. Marais and Rodheffer7 described WebL, and then showed its usage by implementing a meta-search engine that combined search results from the AltaVista and HotBot public-search services. WebL was a high level, objectoriented scripting language that incorporated two novel features: service combinators and a markup algebra. The markup algebra extracted structured and unstructured values from pages for computation, and was based on algebraic operations on sets of markup elements. Nahm and Mooney8 described a system called DiscoTEX that combines IE and data mining methods to perform text mining task, discovering prediction rales from natural-language corpora. Hence by parsing the HTML formatting, one can improve upon traditional text processing. Recently, the use of wrappers for IE from the Web has been popular. A typical wrapper application extracts the data from web pages that are generated, based on predefined HTML templates. The systems generate delimiter-based rules that use linguistic constraints. WIEN9 used only delimiters that immediately preceded and followed the actual data. SoftMealy10 was a wrapper induction algorithm that generated extraction rales expressed as finite-state transducers. World Wide Web Wrapper Factory11 did extraction by using an HTML parser to construct a parse tree following a Document Object Model. In general, HTML tags can help in many tasks involving natural language processing on the Web. In this paper, we consider the more specific problem of exploiting HTML tags for IE from the Web. The motivation of the work stems from a need for a simpler tool to facilitate the writing of extraction rules for our own applications. We adopt an object model approach to extracting information from HTML pages. An object model, consisting of the salient fields of the web pages and their extraction rales based on an HTML parsing library, represents a projection of a user's interests and requirement on a group of web pages. We derive object models for both single-slot extraction, wherein the extracted fields are independent; and multiple-slot extraction, wherein the extracted field are related. We present object models and experimental results for three sample domains,
Web Structure Analysis for Information Mining
41
namely news extraction model, stock quote extraction model based on singleslot extraction rules and link extraction based on multiple-slot extraction rules. The rest of this paper is organized as follows: Section 2 will present the system architecture, giving details about the HTML parsing library, the object model and the extraction engine. Section 3 will describe the user interface to facilitate the writing of the object models. Three application examples of the object model, namely, news article extraction, link extraction and stock quote extraction, will then be given in sections 4 to 6. Section 7 concludes the paper. 2. Object Model Architecture A web page consists of words, numbers, HTML tags, gif images, advertisements etc. People may not be interested in the entire contents of the page. For example, while reading a news article topics of interest would be the title, the author, the content of the article, the dateline etc. It would be useful if a user was able to specify what's interesting to him on a web page or a group of similar web pages together with an easy way to extract them. As all web pages are written in HTML, a library of basic HTML parsing functions becomes crucial for this purpose. The user can use these functions to specify what's interesting to him and extract the portions he is interested in. Some general functions could be devised which can be used across web pages in the same category or content. Hence for each task, a set of functions could be formulated and this can be encapsulated into an object model. The object model is not just a query language, but it also aims to capture a host of attributes of the web objects, including syntax and semantics. For example, many financial web sites have stock quotes. One object model would be to extract stock quotes from web sites which offers stock quotes. The model would thus consist of the fields like stock name, bid price, ask price, volume, which need to be extracted from the page and the parsing algorithm to extract them. The parsing algorithm would be described using the library of basic HTML parsing functions. In such a scenario, the HTML parsing functions serve as the building blocks for writing extraction rules for different object models. Specific object models could be formulated for each category. Thus given a web page, the appropriate object model could be plugged in and the desired output could be generated. Thus, the essential elements of the object model architecture would be: • HTML Parsing Library: A library of HTML parsing functions serves as the basic building block of writing object models. Functions are provided to manipulate attributes of HTML tags, including text, tables, and links. A subset of the single-slot functions is listed in Table 1.
Lakshmi et al.
42
Multi-slot functions would be prefixed by pat'_. A simple user-interface for writing extraction rules has been developed. A single-slot function is defined as an extraction task, which on identifying the specified pattern exits searching the web page. On the other hand, a multi-slot function, on identification of the required task/pattern, recursively searches for similar pattern throughout the page. Object Model: The library of basic HTML parsing functions can be used to specify what is interesting to a user and extract the portions they are interested in. Some general functions could be devised which can be used across web pages in the same category. For each task, a set of functions can be formulated and this encapsulation of items of interest and the corresponding extraction algorithm forms an object model. Extraction Engine: An extraction engine, central to this architecture, is basically a compiler of the HTML parsing functions. A user specifies the URL of the web page and the object model, i.e., the fields of interests and a set of rules written using the HTML parsing functions to extract them. The extraction engine executes the extraction rules and produces the results in extensible Markup Language (XML) format as shown in Fig. 1.
Fig. 1. The extraction engine extracts the key information from web pages according to the extraction rules of the object model. The extraction engine produces the key information in the form of XML from HTML web pages according to the extraction rules of an object model. Pages returned by search engines, stock quotes listed in financial websites and news articles are common web pages. To test our theory, we formulated three object models. These can further be expanded to other types of web pages. 2.1. HTML Parsing Library A variety of text extraction tasks exist ranging from extracting the contact email address, to the list of products to complex sequential pattern from the page. A text extraction task could thus consist of identifying certain text fragments based
Web Structure Analysis for Information Mining
43
on single or multiple criteria to continuously check for patterns. Hence, based on the nature of extraction, IE systems are characterized as single-slot or multi-slot. Single-slot locates and extracts isolated pieces of text from the document; multislot is able to relate the information contained in separate extracted facts and to link these facts together in a case frame. Parsing functions have been drafted to accommodate both the cases. The HTML parsing library has parsing functions to extract text, lists, tables and links from an HTML document. Since the presentation of an HTML document could also include cues to text extraction, functions to accommodate the presentation (like fonts, colors) have also been included. A sample list of HTML parsing functions is given in Table 1. The full parsing library can be found in the work of V. Lakshmi.12 Table 1: An illustrative list of HTML parsing functions. HTML tag Function Tag getTagContent (tag, n) Contents
Fig. 9. Output of link extraction. Extraction rules were derived, referring to search engines like Yahoo!,16 Google,17 Altavista,18 and Lycos.19 These extraction rules were then tested on Excite,20 Netscape,21 LookSmart,22 AOL,23 and GO/InfoSeek,24 a similar level of performance was obtained. Tables 4 and 5 show the system performance in terms of precision and recall for each individual search engine. The lower recall rates for Yahoo! and Google were due to the fact that they extracted some unnecessary links. On the other hand, GO/Infoseek had a poorer recall because some captions were linked to images. An example output of link extraction from Google is given in Fig. 9. Table 4: Link extraction performance on the reference set. 1 2 3 4
Search engine Yahoo! AltaVista Google Lycos
Precision 100% 100% 100% 100%
Recall 85.33% 100% 85.3% 95.33%
52
Lakshmi ei al.
Table 5: Link extraction performance on other search engines. Search engine 1 Excite
Precision 100%
Recall 100%
2 Netscape
100%
100%
3 LookSmart 4 AOL 5 GO/Infoseek
100% 100% 100%
90% 100% 70%
6. Stock Quote Extraction Tables are an extremely powerful Web design tool for laying tabular data on a page, or to lay out text and graphics on a web page. Tables provide Web designers ways to add vertical and horizontal structure to a page. Tables consist of three basic components: rows (vertical spacing) columns (horizontal spacing) and cells (the container created when a row and column intersect) Most financial web sites have web pages displaying stock quotes in a table. A stock quote is characterized by the fields Stock Name, Ticker Symbol, High, Low, Close, Volume, Previous Close, Bid, Ask, Earning Per Share, Dividend Per share, P/E Ratio etc. A typical stock Quote table is shown in Table 6. The objective of this model would be, given a Stock Quote attribute, to retrieve its value. Stock Quote tables can be arranged in any of the given formats, shown in Fig. 10 with their corresponding HTML layouts. Table 6: A stock quote table. Creative 50 Last
Change
%change
Done 39.40
Last
Volume
Traded -0.700
-1.84
21-Sep-
Buy
Bid
Offer
Volume 32,850
850
39.3
39.5
00,10:29 Open
Prev Closed
39.60
39.10
Intra-Day
5-Days
Range
Range
39.10-
39.00-
39.60
39.70
Sell Volume 2,55 0
Remarks
Charts-Intra-Day 5-Days
-
News
Company Info
Broker
Web Structure Analysis for Information Mining
Fig. 10. Stock quote table arrangi
and the corresponding HTML layouts.
53
54
Lakshmi et al.
Using the HTML parsing functions, an object model is defined to extract stock quotes from financial web sites. This object model can be formulated easily using the HTML parsing functions_for tables namely getTableData(), as shown in Fig. 11. Stock Quote Object Model search_url= http://finance.yah oo.com/q?s=MSFT&d=t$title = SLastTrade = getTableData(row,l,"Last Trade") $Chang e = getTableData(row,1,"% Change") $Open == getTableData(row,l "Open") $High == getTableData(row,l,'High") $Low = getTableData(row, 1,' 'Low") $Exchange= getTableData(row, 1 /'Exchange") SYield == getTableData(row, 1"Yield") Fig. 11. Stock quote object model. If the stock quote table is to be extracted as a whole instead of the individual elements, getTableData(Table,"Open",0) can be used. The object model listing would then simply be like: search_url= http://finance.yahoo.com/q?s=MSFT&d=t$title = $StockQuote=getTableData(Table,"Open",0) Six popular financial web sites namely Yahoo!Finance,25 CNNfn.com,26 OH
Oft
9Q
10
Catcha.com, Finance! Lycos Asia, Fool.com and Quote.com were used for testing the above object model. The results obtained are tabulated in Table 7. A screen shot of the stock quote output is shown in Fig. 12. Table 7: Stock quote tables S.No 1 2 3 4 5 6
Web site Quote.com Yahoo! Finance CNNfn.com Catcha.com Fool.com LycosAsia
Precision 100% 82.35% 100% 100% 100% 89.6%
Recall 100% 100% 73.68% 99.33% 100% 100%
As we analyze the above results, the data in the LycosAsia website were arranged in three rows and hence could not be retrieved properly. In fact, the
Web Structure Analysis for Information Mining
55
above object model would not work where information is arranged column wise. However, the object model can be changed to accommodate other multiple row arrangement or column wise layout. As the user finds new or unexpected layouts in some websites, the user can continue to refine the object model to include additional web structural information to improve the retreival results.
Fig. 12. Output of stock quote extraction from the extraction engine.
7. Conclusion This paper presents an object model approach to extracting information from HTML pages. Our object model may be considered a simplified version of a wrapper. It serves as an easy to use tool for the user to quickly write rules based on the structure of some sample web pages to extract the needed information. While the object model is also delimiter based, it is geared towards HTML tags and attributes. Furthermore, its looping construct through the use of multi-slot extraction allows recursive search of the relevant information. Building on this architecture, object models for different types of web pages can be formulated, say for company pages, educational institution pages, online shopping pages etc.
Lakshmi et al.
56
A personalized search engine can also be built by integrating all the object models. The last several years have witnessed a variety of innovations in text extraction research. Several new approaches have been developed, including: Hidden Markov models and other statistical techniques for text modelling; active learning and bootstrapping approaches that reduce the required training data; and using boosting to improve the performance of simple text extraction learners. A "unified theory of extraction" is beginning to emerge, in which a family of learning algorithms can cope with a wide variety of text, from natural text, to highly-stylized telegraphic text, to rigidly structured HTML documents. Incorporating semantics and learning algorithms could be the next step in the object model architecture. Acknowledgment This project is supported in part by the Agency for Science, Technology and Research (A*STAR) and the Ministry of Education, Singapore, under the joint research grant, R252-000-071-112/303. References 1. 2. 3. 4. 5. 6. 7. 8.
S. Soderland, D. Fisher, J. Aseltine and W. Lehnert, "Crystal: Inducing a conceptual dictionary", Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), 1995, pp. 1314-1319. S. Soderland, "Learning to extract text-based information from the World Wide Web", Proceedings of Third International Conference on Knowledge Discovery and DataMining (KDD-97), 1997, pp. 251-254. D. Freitag, "Information extraction from HTML: Application of a general machine learning approach", Proceedings of the 15th Conference on Artificial Intelligence (AAAI-98), 1998, pp. 517-523. M. Craven, S. Slattery, and K. Nigam, "First-Order Learning for Web Mining", Proceedings, 10th European Conference on Machine Learning, 1998, pp. 250-255. C. Cardie, "Empirical methods in information extraction", AI magazine, 1997, pp. 55-79. D. DiPasquo, "Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web", Senior Honors Thesis, School of Computer Science, CMU, 1998. H. Marais and T. Rodeheffer, "Automating the Web with WebL", Dr. Dobb's Journal, January 1999. U.Y. Nahm and R.J. Mooney, "Using Information Extraction to Aid the Discovery of Prediction Rules", Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, Boston, MA, August, 2000, pp. 51 - 58.
N. Kushmerick, D. Weld, and R. Doorenbos, "Wrapper induction for information extraction", Proceedings of the 15th International Conference on Artificial Intelligence (IJCAI-97) 1997, pp. 729-735. C. Hsu and M. Dung, "Generating finite-state trans-ducers for semi-structured data extraction from the web", J. of Information Systems 23(8), 1998, pp. 521-538. F. Azavant and A. Sahuguet, "The World Wide Web Wrapper Factory", http://db.cis.upenn.edu/W4F/doc.html. V. Lakshmi, "Web Structure Analysis for Information Mining", Master of Sceince Thesis, School of Computing, National University of Singapore, 2001. CNET, h t t p : / / n e w s . com. com/. ZDNet, h t t p : //www. z d n e t . com.com/. Straits Times Interactive, h t t p : / / s t r a i t s t i m e s . a s i a l . c o m . s g / . Yahoo, h t t p : / / w w w . y a h o o . c o m . Google, h t t p : / / w w w . g o o g l e .com. AltaVista, h t t p : //www. a l t a v i s t a . c o m . Lycos, h t t p : / / w w w . l y c o s . com. Excite, h t t p : //www. e x c i t e . com. Netscape, h t t p : //www. n e t s c a p e .com. LookSmart, h t t p : / /www. l o o k s m a r t . com. AOL, h t t p : / / w w w . a o l . com. Go/InfoSeek, h t t p : / / i n f o s e e k . g o . com/. Yahoo IFinance, h t t p : / / f i n a n c e . y a h o o . com/. CNNfh.com, h t t p : / /money. c n n . com/. Catcha.com, h t t p : / / c a t c h a . com. sg. Finance! Lycos Asia, h t t p : //www. l y c o s a s i a . c o m . sg. Fool.com, h t t p ://www. f o o l . com. Quote.com, h t t p : //www. q u o t e . com.
This page is intentionally left blank
CHAPTER 4 N A T U R A L L A N G U A G E PROCESSING FOR W E B D O C U M E N T ANALYSIS
Manuela Kunze and Dietmar Rosner Otto-von-Guericke- Universitdt Magdeburg Institut fiir Wissens- und Sprachverarbeitung P.O.box 4120, 39016 Magdeburg, Germany E-mail:makunze, roesner@iws. cs. uni-magdeburg. de
In this chapter we present an approach to the analysis of web docu ments — and other electronically available document collections — that is based on the combination of XML technology with NLP techniques. A key issue addressed is to offer end-users a collection of highly interoper able and flexible tools for their experiments with document collections. These tools should be easy to use and as robust as possible. XML is chosen as a uniform encoding for all kinds of data: input and output of modules, process information and linguistic resources. This allows effec tive sharing and reuse of generic solutions for many tasks (e.g., search, presentation, statistics and transformation).
1. Introduction In this chapter we present an approach to web document analysis based on techniques from Natural Language Processing (NLP). T h e W W W is a fast growing source of information. It is populated with hyperlinked multimedia information objects. A large proportion of the documents offered on the W W W consist of text or are multimedia documents with a significant proportion of text. a For web document analysis in general, the analysis of text will have to be complemented with the analysis of other W W W media types: image a
'Document' is undoubtedly broader a term than 'text'. For ease of presentation we will in this paper — unless otherwise noted — use 'document' interchangeably with 'text' or 'text document'. The broader usage will be marked explicitly or should be obvious from context. 59
60
M. Kunze and D. Rosner
analysis, video interpretation, voice processing etc. In addition, cross media references and the hypermedia structures need proper treatment. It is necessary to be precise about the purpose of a specific system for the analysis of web documents. In other words: what should be achieved with the system? The different tasks that have been identified for traditional collections of documents are in many cases applicable for web documents as well, although sometimes the naming may be different. For example, the services offered by a web search engine are the equivalent of what has more traditionally been called 'information retrieval' (IR). An incomplete list of applications of web document analysis includes among others information retrieval, information extraction (IE), web mining, text classification, summarisation, machine translation, concept learning etc. Each of these applications has its specific demands. Nevertheless, they share many of their subtasks. In addition, many of the subtasks are shared with traditional document processing. For a given complex application in web document analysis we found it fruitful to distinguish its subtasks into the following three categories: • subtasks that are primarily WWW specific, • subtasks that are specific to the application, • subtasks that are relevant to all NLP approaches to text processing. This chapter has its focus on the last group of subtasks. Since these subtasks are also shared with other applications of document processing, there is a high potential for reuse and resource sharing for these subtasks. At best, complex applications of document processing should be configurable from a common tool box with generic modules and generic resources in combination with only a small number of dedicated application specific modules. WWW specific subtasks can be classified as being part of the preprocess ing stage. Preprocessing in this sense comprises all those operations that finally result in a text document in the format expected as input to the linguistic tools. In other words, aspects of the source document that are irrelevant or distracting for linguistic processing will be abstracted away during preprocessing, and the resulting document will be in a canonical format. If source documents are already equipped with appropriate metadata then some subtasks of preprocessing become reduced to looking up the values of metadata attributes. This includes questions such as:
Natural Language Processing for Web Document
Analysis
61
• What is the natural language used in the text? • What is the domain of the text? • Is the text intended to be 'stand alone' or is it only a fragment in a hyperlinked structure? In the future — with the upcoming 'semantic web' 1 — more and more WWW pages will carry metadata with them. For now, preprocessing will in many cases include attempts to answer these questions with techniques for language identification, domain classification or hyperlink tracing. The focus of our work has been to develop the architecture and a core set of components and resources and a document suite. Application scenarios and corresponding example corpora have been and still are the driving force in the ongoing development and improvement of the system. To work with different scenarios will help to avoid ad hoc solutions that are not scalable and generalisable. To move to different domains and/or application scenar ios also allows to study the methodological issues of how to port and adapt such a system. Given this background, our results are of a methodological nature on the one hand as well as a set of implemented tools on the other hand. In the following, we will try to construct an architecture for a toolbox for NLP-based web document analysis. We will concentrate on the subtasks to be solved and the problems to be tackled. This will be accompanied in many cases by a sketch of the solution chosen in our system. The solution will be explained with examples of data and results of analyses from a variety of domains. Whenever appropriate, we will highlight shortcomings and open problems or discuss the advantages and disadvantages of alternative design decisions. Emphasis will be placed on elaborating both on the algorithms employed as well as on the resources that have to be provided. When one designs modules for NLP tasks there is a variety of aspects to consider: • • • • • •
language dependence vs. language independence, generality vs, specificity, ease of adaptation to other domains, sublanguages etc., efficiency of processing and quality of results, robustness, learning issues.
Human language processing can be seen as an integrated system, which is primarily driven by semantic and pragmatic concerns. Technological ap-
62
M. Kunze and D.
Rosner
proximations — as discussed here — suffer from principal shortcomings. It is, for instance, an inherent problem in NLP that a strict separation into completely independent stages of processing is hard to achieve. This is due to the fact that in each stage of processing there are, in many cases, deci sions to be made that may need information from a later stage of processing or from the context. As a web document we can define all documents that are available via the internet. In addition to structural data (HTML tags) and other objects on the page (e.g., pictures and audio files), the page will in many cases contain text. This text can be titles or paragraphs. For an effective and complete analysis of web documents we therefore need an approach for the extraction of natural language parts of the documents. Natural language analysis of textual parts of web documents is no different from a normal text analysis. The preprocessing process starts with an attempt to get the pure text parts of the documents. Another approach to information extraction is the exploitation of HTML tags, especially when tables play an important role on a webpage (see the work of Habegger 2 for analysis based on HTML tags). The following sections present an XML 5 -based approach for a toolbox with morphological, syntactic and semantic (and some corpus-based) meth ods. The system described here is called Document Suite XDOC (XMLbased tools for DOCument processing) and works on German documents. Currently, we are also experimenting with English resources for XDOC for the task of text mining from English documents. 3
2. Design Principles The Document Suite XDOC was designed and implemented as a workbench for flexible processing of electronically available documents in German. 4 Inside XDOC we exploit XML 5 and its accompanying formalisms (e.g., XSLT 6 ) and tools (e.g., xt, saxon) as a unifying framework. All modules in the XDOC system expect XML documents as input and deliver their results in XML format. Having a uniform encoding for all input and output data and for all employed resources is advantageous both for the development team as well as for end-users. Generic modules can be reused for all repetitive tasks like selection of substructures, highlighting of search results, flexible transfor mation into other representations etc. Sharing of expertise across modules in this way does free developer's energy for the 'real' challenges. It is easier
Natural Language Processing for Web Document
Analysis
63
to offer end-users a uniform 'look and feel' across modules based on the uniform encoding. 2.1. Why
XML?
XML — and its precursor SGML — offers a formalism to annotate pieces of (natural language) text. To be more precise, if a piece of text is (as a simple first approximation) seen as a sequence of characters (alphabetic and whitespace characters) then XML allows to associate arbitrary markup with arbitrary subsequences of contiguous characters. Many linguistic units of interest are represented by strings of contiguous characters (e.g., words, phrases, clauses etc.). It is a straightforward idea to use XML to encode information about such a substring of a text interpreted as a meaningful linguistic unit and to associate this information directly with the occur rence of the unit in the text. The basic idea of annotation is further sup ported by XML's wellformedness demand that XML elements have to be properly nested. This is fully concordant with standard linguistic practice: complex structures are made up from simpler structures covering substrings of the full string in a nested way. With XML we can reuse generic modu les for selection, highlighting and presentation of relevant information. XSL stylesheets can be exploited to allow different presentations of internal data (in different formats such as html or rtf) and processing results for differ ent target groups; for end-users the internals are not in many cases helpful, whereas developers will need them for debugging. The advantage of differ ent presentation formats is that the users are already acquainted with the functionality of a browser or a common text editor. Furthermore, in the case of a browser we are relatively independent of the operating system. 2.2. User
Orientation
We are interested in tools and techniques for natural language analysis that help humans with their document processing tasks. The end-users of our applications are domain experts [e.g., medical doctors, engineers etc.). They are interested in having their problems solved but they are typically neither interested nor trained in computational linguistics. Therefore, the barrier they will have to overcome, before they can use a computational linguistic or text technology system, should be as low as possible. This constraint has consequences for the design of the document suite. The work in the XDOC project is guided by the following design prin ciples that have been generalised from a number of experiments and appli-
64
M. Kunze and D.
Rosner
cations with 'realistic' everyday documents (i.e., email messages, abstracts of scientific papers, technical documentation etc.): • The tools should be usable for 'realistic' documents. One aspect of 'realistic' documents is that they typically contain do main specific tokens which are not directly covered by classical lexical categories (e.g., noun, verb etc.). Those tokens are nevertheless often essential for the user of the document (e.g., an enzyme descriptor like EC 4.1.1.17 for a biochemist). Further aspects of 'realistic' documents are: - they always may contain 'noisy elements': errors of different type (typos, syntax errors, missing or superfluous words, . . . ), - they may be written in a domain specific sublanguage with more or less non-standard syntactic preferences, - due to productiveness of the language lexical and conceptual gaps may occur. • The tools should be as robust as possible. In general, it can not be expected that lexicon information is available for all tokens in a document. This is not only the case for most tokens from 'nonlexical' types — like telephone numbers, enzyme names, ma terial codes etc. — even for lexical types there will always be 'lexical gaps'. This may either be caused by neologisms or simply by start ing to process documents from a new application domain with a new sublanguage. In the last case lexical items will typically be missing in the lexicon ('lexical gap') and phrasal structures may or may not adequately covered by the grammar ('syntax gap'). • The tools should be usable independently but should allow for flexible combination and interoperability. • The tools should not only be usable by developers but by domain experts as well without linguistic training. Here again XML and XSLT play a major role: XSL stylesheets can be exploited to allow different presentations of internal data and results for different target groups; for end-users the internals are not in many cases helpful, whereas developers will need them for debugging. We currently experiment with XDOC in a number of application sce narios. These are: • Knowledge acquisition from technical documentation about casting technology as support for domain experts for the creation of a domain
Natural Language Processing for Web Document
Analysis
65
specific knowledge base. • Extraction of company profiles from WWW pages for an effective search about possible suppliers. • Analysis of autopsy protocols for statistical summary about typical injuries at traffic accidents. • Information extraction from Medline abstracts. 2.3.
Portability
The effort needed to port a system like XDOC to another domain and/or another type of application is not easy to be determined and is obviously dependent on the distance or closeness of the new requirements to func tionality already covered and the resources already available. As a rule of thumb: the more general, domain independent components are easier to port than the domain specific ones. The following questions should be answered with respect to the document collection to be processed before starting with a new domain and/or application: • How is the coverage of the lexicon of the morphosyntactic component? • How is the sublanguage used in the corpus? Can existing grammar modules be re-used or do we need to develop a new rule set for syntactic structures not yet covered? • To what extent can domain models be re-used? What has to be mod eled from scratch? Some of these questions can be answered with the help of the system itself (e.g., checking the ratio of unknown token to token covered by the lex icon). It is crucial that users get as much support as possible for the tasks of resource creation. In order to ease initial domain modeling as well as lexicon creation we have been experimenting with a 'bootstrapping' approach. 7 3. Document Suite XDOC The Document Suite XDOC is divided in several functional modules (cf. architecture in Fig. 1). Document processing starts with the preprocess ing functions, succeeded by the functions of the syntactic, semantic and corpus-based modules. In the following subsections we describe the sepa rate functions of the modules. A crucial point: We do favor an engineering approach to NLP. It is not our concern if these tools or techniques are plausible as components of a hu man reader's processing of a document. We deliberately take into account
66
M. Kunze and D.
Fig. 1.
Rosner
Functional Architecture of the Document Suite XDOC.
that such tools are approximations at best and will have problems and li mitations. We do accept that the results of tools taken in isolation may be ambiguous (and sometimes even faulty). We try to avoid erroneous results but we do accept ambiguities that are not resolvable within an isolated module. We try to anticipate these cases and try to design other modules in such a way that they can deal with alternatives and overgeneralisation in their input but try to resolve ambiguities and to filter erroneous alter natives. Table 1. a m b i g u i t y in . . . POS tagger chart parser semantic tagger phrase detector
Resolving of Ambiguities. resolved b y . . . chart parser case frame analysis case frame analysis analysis of phrase detectors' results
In the Document Suite XDOC most ambiguities can be resolved by sub sequent analysis (see Table 1). In the phrase detector the overall interpre tation is based on the analysis of all results from the processed document. The classification backed by the majority of phrases inside a document is selected. 3.1. Preprocessing
Module
Even if we concentrate on textual web documents there are a number of formats and encodings that we may encounter.
Natural Language Processing for Web Document
Analysis
67
HTML is very widespread, XHTML and even XML have an increasing share in the web. But other formats are available as well: DVI, Postscript, PDF, DOC, RTF etc. And not to forget: we may have plain text (e.g., in emails). For linguistic processing, we are primarily interested in the sequence of characters constituting the natural language content of the document (i.e., the plain text). The process of mapping from an arbitrary format of the document source to the input representation for linguistic processing may be conceptualized either as 'stripping off' non-text (i.e., commands, mark up etc.) or as extracting text. In either case, this process has to preserve that information from the source that is relevant for further processing. This information to preserve or 'carry over' deals with issues like: • What is a unit of linguistic relevance? • Where are boundaries between units? • What is the function of those boundaries? The result of this preprocessing step should preserve and make explicit (e.g., mark up) this structural information in addition to the linguistic content (i.e., the character sequence of the plain text). This requirement is independent of the specific format of the source. In the preprocessing
Fig. 2.
Preprocessing: Identification of Text Units.
M. Kunze and D.
68
Rosner
module all functions from low level (or token-based) text processing are combined. The other modules are based on the results of these functions. The functions are: • • • •
Text unit identification (for HTML: HTML Cleaner), Structure Tagger, Part-of-Speech (POS) Tagger, Sentence Splitter.
3.1.1. HTML Cleaner The analysis of web pages can be essential, for example, the automatic generation of company profiles from web pages (see the work of Krotzsch 8 ) or the creation of product taxonomies through web mining. 9 For this task it is necessary to extract the plain textual parts of a web page. The analysis of HTML tags is relevant, when we need information about the structure of the documents such as titles or tables. We get information about the structure from the analysis of HTML tags. In addition we can extract information from tables with linguistic content by the NLP analysis of the cells' content. 3.1.2. Structure Tagger We accept raw ASCII text without any markup as input as well. Raw ASCII may be the original source, or it may as well be the output of many converters like 12a, pdf2text etc. In such cases, structure detection tries to uncover linguistic units (e.g., sentences, titles etc.) as candidates for further analysis. A major subtask is to identify the role of punctuation characters. If the structure of a text document is explicitly available this may be exploited by subsequent linguistic processing. For example, a unit classified as title or subtitle will be accepted as a noun phrase whereas within a paragraph full sentences will be expected. In realistic text even the detection of possible sentence boundaries needs some care. A period character may not only be used as a full stop but may as well be part of an abbreviation (e.g., "z.B." — engl.: "e.g." - or "Dr."), be contained in a number (e.g., 3.14), be used in an email address or in domain specific tokens. The resources employed are special lexica (e.g., for abbreviations) and finite automata for the reliable detection of tokens from specialized non-lexical categories (e.g., enzyme names, material codes etc.). These resources are used primarily to distinguish between those full stop
Natural Language Processing for Web Document
Analysis
69
characters that function as sentence delimiters (tagged with IP) and those that are part of abbreviations or of domain specific tokens. The information about the function of strings that include a period is tagged in the result {e.g., abbreviation — tagged with ABBR). 3.1.3. POS Tagger The assignment of part-of-speech information to a token — POS tagging for short — is not only a preparatory step for parsing. The information gained about a document by POS tagging and evaluating its results is valuable in its own right. The ratio of tokens not classifiable by the POS tagger to tokens classified is a direct indication of the degree of lexical coverage. In principle, a number of approaches is usable for POS tagging {e.g., the work of Brill 10 ). We decided to avoid approaches based on (supervised) learning from tagged corpora, since the cost for creating the necessary train ing data are likely to be prohibitive for our users (especially in specialized sublanguages). The approach chosen was to try to make the best use of available re sources for German and to enhance them with additional functionality. The tool chosen (MORPHIX 11 ) is not only used in POS tagging but serves as a general morpho-syntactic component for German. The resources employed in XDOC's POS tagger are: • the lexicon and the inflectional analysis from the morpho-syntactic component MORPHIX, and • a number of heuristics {e.g., for the classification of token not covered in the lexicon). For German the morphology component MORPHIX has been devel oped in a number of projects and is available in different realizations. This component has the advantage that the closed class lexical items of German — e.g., determiners, prepositions, pronouns etc. — as well as all irregular verbs are fully covered. The coverage of open class lexical items is depen dent on the amount of lexical coding. Paradigms, such as those for verb conjugation and noun declination are fully covered. To be able to analyze and generate word forms, however, their roots and their classification need to be included in the MORPHIX lexicon. We exploit MORPHIX — in addition to its role in syntactic parsing — for POS tagging as well. If a token in a German text can be morphologi-
70
M. Kunze and D.
Rosner
cally analysed with MORPHIX, the resulting word class categorisation is used as POS information. Note that this classification need not be unique. Since the tokens are analysed in isolation, we often get ambiguous results with multiple analyses. Some examples: the token "der" may either be a determiner (with a number of different combinations for the features case, number and gender) or a relative pronoun, the token "Hebe" may be ei ther a verb or an adjective (again with different feature combinations not relevant for POS tagging). Since we do not expect extensive lexicon coding at the beginning of an XDOC application some tokens will not receive a MORPHIX analysis. We then employ two techniques: We first try to make use of heuristics that are based on aspects of the tokens that can easily be detected with simple string analysis (e.g., upper-/lowercase, endings etc.) and/or exploitation of the token position relative to sentence boundaries (detected by the structure detection module). If a heuristic yields a classification, the resulting POS class is added together with the name of the employed heuristic (marked as feature SRC (stands for source), cf. Example 1). If no heuristics are applicable we classify the token as member of the class unknown (tagged with XXX). E x a m p l e 1: Unknown Tokens Classified as Noun with Heuristics. Blutanhaftungen anderGekroesewurzeK/N>
To keep the POS tagger fast and simple, the disambiguation between multiple POS classes for a token and the derivation of a possible POS class from context for an unknown token are postponed until syntactic processing. This is in line with our general principle to accept results with overgeneralisation when a module is applied in isolation (here: POS tagging) and to rely on filtering ambiguous results at a later stage of processing (here: exploiting the syntactic context). 3.1.4. Sentence Splitter The functions of the syntactic parser, the phrase detector and the analy ses of the semantic module take sentences as input. The sentence splitter
Natural Language Processing for Web Document
Analysis
71
divides the text to be analysed into a sequence of sentences. 3.2. Syntactic
Module
The input data for this module are sentences annotated with POS tags. This module contains methods which are based on a bottom-up parser. The functions of the module are a syntactic parser and a phrase detector. The basis of these functions is a chart parsing machinery, which works with different 'rule sets' to realize different functions. 3.2.1. Syntactic Parser For syntactic parsing we apply a chart parser based on context-free gram mar rules augmented with feature structures. Robustness is achieved by allowing as input elements: 12 • multiple POS classes, • unclassified tokens from open word classes and • tokens with POS class, but without or with only partial feature infor mation. The last case results from some heuristics in POS tagging that allow to assume, for instance, the class noun for a token but do not suffice to detect its full paradigm from the token (note that there are about two dozen different morphosyntactic paradigms for noun declination in German). For a given input the parser attempts to find all complete analyses that cover the input. If no such complete analysis is achievable, it tries to combine maximal partial results ('chunks') into structures covering the whole input. A successful analysis may be based on an assumption about the word class of an initially unclassified token (tagged XXX). This is indicated in the parsing result (feature AS) and can be exploited for learning such clas sifications from contextual constraints. In a similar way the successful com bination of known feature values from closed class items (e.g., determiners, prepositions) with under specified features (written as "_" in attribute-value pairs) in agreement constraints allows the determination of paradigm in formation from successfully processed occurrences. See Example 2: In the input string the word "Mundhoehle" (oral cavity) was not available in the lexicon. Because it is uppercase and not at sentence initial position it is heuristically classified as noun. The successful parse of the prepositional phrase (PP) "in der Mundhoehle" allows the derivation of features from
72
M. Kunze and D.
Rosner
the determiner within the PP (e.g., the lexical gender feminine for "Mundhoehle"). Example 2: Unknown Token Classified as Adjective and Lexical Features Derived Through Contextual Constraints. kein <XXX AS="ADJ">ungehoeriger InhaltinderMundhoehle "
The processing time grows with the number of rules. Therefore, the gram mar used in syntactic parsing is organized in a modular way that allows the addition or removal of groups of rules. This ability to tailor the rule set is also exploited when the sublanguage of a domain contains linguistic structures that are unusual or even ungrammatical in standard German. 3.2.2. Phrase Detector There are applications where a full syntactic analysis is not necessary. Some times it is sufficient to be able to quickly detect key phrases in a document. If the phrases are totally fixed, a string matcher is sufficient. Realistically, there will always be some variability in phrasal patterns. Therefore, a more flexible phrase matcher is needed. See, for example, the simple phrase "zahlreiche Blutungen in der Gesichtshaut." (English: numerous bleedings in the skin of the face). The word "zahlreiche" can, for instance be replaced by the word "massenhaft" (English: large quantities of). This can be expressed by the following generalized pattern: MP-SI: HIGH-QUANTITY N["blutungen"] PRPf'in"] N["gesichtshaut"] IP H I G H - Q U A N T I T Y : ADJ["zahlreich"]— ADV["massenhaft"]
DETD
The corresponding rule (in XML coding) for the chart parser is defined in Example 3. Example 3: Excerpt from the 'Pattern Grammar'. <ELEMENT>HIGH-QUANTITY <ELEMENT CONT="blutungen">N <ELEMENT CONT="in">PRP
The chart parser annotates recognized structures in the document. 3.3. Corpus Based
Module
Currently this module contains one method - the analysis of collocations. In XDOC we not only count cooccurence data of pairs of simple tokens but evaluate pairwise occurences of phrases with phrases or tokens. We analyse which combinations of noun phrases with adjectives, verbs and prepositional phrases are possible in a domain. Coocurrence data are exploited for the creation of an initial ontology.7 3.4. Semantic
Module
The semantic module contains methods that are necessary for the semantic interpretation and understanding of natural language texts. In the follow ing, we describe a semantic tagger and a case frame analysis which com bines isolated meanings of words into a complex concept structure. Finally, we describe a method for the semantic interpretation of specific syntactic structures. 3.4.1. Semantic Tagger Most content words in natural language have a number of different readings. The word 'paper' may — among others — refer to a 'publication' or to a 'piece of matter'. A semantic tagger tries to classify content words into their semantic categories (different applications may have different organisations of those categories into taxonomies or ontologies). For this function we ex pect as input data a text tagged with POS tags and we apply a semantic lexicon. For each token, this semantic lexicon contains a semantic interpre tation and a case frame with syntactic valency requirements. Similar to POS tagging, the tokens in the input are annotated with their meanings and a classification in semantic categories such as concepts and relations. Again, it is possible that the classification of a token in isolation is not unique. In
74
M. Kunze and D.
Rosner
analogy to the POS tagger, a semantic tagger that processes isolated tokens is not able to disambiguate between multiple semantic categorisations. This task is postponed for contextual processing within case frame analysis (cf. below). Example 4 shows the results of the semantic tagging of the sentence Als Folge des Unfalles erfolgte die Herausnahme einer Niere (In English: In consequence of an accident one kidney was removed). In this case Als and Folge are considered ambiguous. The word form Unfalles is unknown, but we can recognize it as a concept using the information from the POS tagger (Unfalles). Example 4: Result of the Semantic Tagger. <0PERATQR TYPE="temporal-gleichzeitig t e m p o r a l - v o r z e i t i g t e m p o r a l - n a c h z e i t i g komparativ s p e z i f i z i e r e n d kausal restriktiv">Als0PERATORXCONCEPT TYPE="sequence result">Folge desUnfalleseriolgtedie Herausnahme<XXX>einerNiere.
Semantic resources are more domain dependent in general than syntactic ones. As many other projects we tried to make use of Wordnet (in its German version GermaNet 13 ) as a basis for the semantic lexicon. This proved to be difficult. The categories covered in GermaNet are too abstract to be directly usable in applications. For now, domain specific semantic lexica have to be encoded manually. This is not such a problem for experimental prototypes but for large scale applications support for the creation of these resources is of vital interest. A planned but not yet worked out approach will be to try to exploit learn ing techniques to automatically extract subcategorisation information from the results of robust noun phrase and prepositional phrase chunking. Sub categorisation frames for verbs (i.e., capturing what complements a verb is taking) are the basis for the subsequent abstraction of case frames.
3.4.2. Case Frame Analysis Case frame analysis of a token uncovers details about the type of recognized concepts (resolving multiple interpretations) and their possible relations to other concepts. The detection of complex concepts by case frame analysis needs as input the results of syntactic analysis (chart parser) and semantic tagging. The resources are case frames coded in the semantic lexicon. The case frames express the semantic and syntactic constraints of a potential filler of a relation. Example 5: Example for an Instantiated Case Frame.
Natural Language Processing for Web Document
Analysis
75
<WORD>Herausnahme <SL0TS> BONE N(gen, fak) < ASS IGN_T0> QRGAN N(gen, fak) einer Niere <W0RD>Niere
The output is instantiated case frames with roles filled with those syntactic phrases that are consistent with both semantic and syntactic constraints. The results are again annotated with XML tags. Example 5 is a partial result of the analysis of the example from the semantic tagger. The con cept parser has assigned the phrase "einer Niere" as the filler of the rela tion 'patient', which needs a concept with the category 'organ' (element ASSIGN_TO) and a syntactic noun phrase in the genitive case as filler (element FORM). 3.4.3. Semantic Interpretation of Syntactic Structure Another step in the analysis of the relations between tokens can be the interpretation of the syntactic structure of a phrase or sentence. We exploit the syntactic structure of the sublanguage to extract the relation between several tokens. The semantic structure analysis expects as input the results from the chart parser and from semantic tagging. The results are rela tions between concepts, which are derived from the syntactic structure of phrases or whole sentences. The resource for this analysis is a lexicon with information about the semantic interpretation associated with rules from the context-free grammar used by the chart parser. For example, syntac tic structures with nouns and adjectives as constituents typically denote attribute-value structures, which can be expressed as a has-attribute rela tion. The name of the recognized relation is composed of 'has-' and the meaning of the adjective. Example 6: Excerpt from Lexicon for Semantic Structure Analysis.
Example 6 shows the entry in the lexicon for the interpretation of adjectivenoun structures. The element SYNTACTIC contains the names of rules in the context-free grammar which describe adjective-noun structures. The general name of a relation is defined with the tag TYPE, and the element FILLER contains the POS tag of the filler of the complete relation name. In this case, it is the semantic interpretation of the adjective token inside the structure. For example, a typical phrase from an autopsy report: Leber dunkelrot (In English: Liver dark red). From semantic tagging we obtain the following information: Example 7: Results of Semantic Tagging. Leberdunkelrot
In this example we can extract the relation 'has-color' between the to kens Leber and dunkelrot. This is an example of a simple semantic relation. Other semantic relations can be described through more complex varia tions. In these cases, we must consider linguistic structures like modifiers (e.g., etwas), negations (e.g., nicht), coordinations (e.g., Beckengeruest unversehrt und fest gefuegt) and complex noun groups (e.g., Bauchteil der grossen Koerperschlagader). 4. Related Work In GATE 14 the idea of piping simple modules in order to achieve complex functionality has been applied to NLP with such a rigid architecture for the first time. The project LT XML15 has been pioneering XML as a data format for linguistic processing. Both GATE and LT XML were employed for processing English texts. SMES 16 has been an attempt to develop a toolbox for message extraction from German texts. A disadvantage of SMES that is avoided in XDOC is the lack of a uniform encoding formalism, in other words, users are confronted with different encodings and formats in each module.
Natural Language Processing for Web Document
Analysis
77
5. C o n c l u s i o n We have presented an approach to the analysis of web documents - and other electronically available document collections - t h a t is based on the combination of XML technology with N L P techniques. T h e end-users of the X D O C toolbox are domain experts {e.g., engineers, medical doctors, economists, etc.) t h a t are naive with regard to NLP. For these users the barrier to overcome before being able to run experiments on their document collections and to get initial results should be as low as possible. Therefore, all modules of X D O C have been designed to be as robust as possible. In the case of gaps in resources {e.g., lexical gaps, missing syntactic structures etc.) the system is geared to exhibit graceful degradation. We have reported about work in progress. An up to date version of X D O C is available for on line experimentation via h t t p : / / l i m a . c s . u n i - m a g d e b u r g . d e : 8 0 0 0 . The key issue for the on going work is to fully exploit the inherent potential for learning as a side effect of processing large quantities of related docu ments. Learning will be especially important for the incremental u p d a t e of X D O C ' s resources. Initial experiments with corpus-based methods have been promising: • In combination with results from syntactic analysis (cf. example 1) the detailled paradigm information for entries in the morphosyntactic lexicon can be deduced from varying occurrences of unknown tokens. • When sentence candidates can not be given a complete syntactic ana lysis then the coverings produced from completed constituents may be (cf. Section 3.2.1) exploited for suggesting new grammar rules. • Abstraction of case frames from multiple occurrences of verbs and their complements may ease the construction of domain specific semantic lexica.
References 1. World Wide Web Consortium (ed.), "Semantic Web. Activity Overviev?, Boston, USA, URL: http://www.w3.org/2001/sw, 2001. 2. B. Habegger and M. Quafafou, "Multi-Pattern Wrappers for Relation Ex traction from the Web", in ECAI2002, Proceedings of the 15th European Conference on Artificial Intelligence, F. van Harmelen (eds.),IOS Press, Amsterdam, 2002, pp. 395-399. 3. M. Kunze and C. Xiao, "An Approach for Resource Sharing in Multilingual NLP", in T. Vidal and P. Liberatore(eds.): Proceedings of STarting Artificial Intelligence Researchers Symposium STAIRS 2002, IOS Press, Amsterdam, 2002, pp. 123-124.
78
M. Kunze and D. Rosner
4. D. Rosner and M. Kunze, "An XML Based Document Suite", in Proceedings of the 2002 International Conference on Computational Linguistics, Taipei 2002, pp. 1278-1282. 5. T. Bray and J. Paoli and C M . Sperberg-McQueen and E. Maler (eds.), "Extensible Markup Language (XML) 1.0 (second Edition)". World Wide Web Konsortium, W3C Recommendation, Boston, USA, URL:http//www.w3.org/TR/2000/REC-xml-20001006, 2000. 6. J. Clark, "XSL Transformations (XSLT) Version 1.0", World Wide Web Konsortium, W3C Recommendation, 1999. 7. D. Rosner and M. Kunze, "Exploiting Sublanguage and Domain Character istics in A Bootstrapping Approach to Lexicon and Ontology Creation", in Proceedings of the OntoLex 2002 - Ontologies and Lexical Knowledge Bases at the LREC 2002, Las Palmas, 2002, pp. 68-73. 8. S. Krotzsch and D. Rosner, "Towards Extraction of Company Profiles from Webpages", in Proceedings of 2nd International Workshop on Databases, Documents, and Information Fusion, Karlsruhe, 2002, pp. 29-36. 9. W. Buntine and H. Tirri, "Multi-Faceted Learning for Web Taxonomies", in Proceeding of the 2nd Workshop on Semantic Web Mining, Helsinki, 2002, pp. 52-60. 10. E. Brill, "A Simple Rule-Based Part-of-Speech Tagger", in Proceedings of the Third Conference on Applied Natural Language Processing, 1992, pp. 152-155. 11. W. Finkler and G. Neumann, "A Fast Realization of a Classification-Based Approach to Morphology", in Proceedings of 4- Osterreichischen ArtificialIntelligence Tagung, Wiener Workshop Wissensbasierte Sprachverarbeitung, H. Trost (ed.), Springer Verlag, Berlin, 1988, pp. 11-19. 12. D. Rosner, "Combining Robust Parsing and Lexical Acquisition in The XDOC System", in Proceedings of KONVENS 2000: 5. Konferenz zur Verarbeitung natiirlicher Sprache, 2000, pp. 75-80. 13. C. Kunze and L. Lemnitzer, "GermaNet - representation, visualization, ap plication.", in Proceedings of LREC 2002, Las Palmas, 2002, pp. 1485-1491. 14. H. Cunningham, "GATE - A General Architecture for Text Engineering", in Computing and Humanities, Vol. 36, 2002, pp. 223-254. 15. Language Technology Group (LTG), "LT XML version 1.1", http://www.ltg.ed.ac.uk/software/xml/, 1999. 16. G. Neumann and R. Backofen and J. Baur and M. Becker and C. Braun, "An Information Extraction Core System for Real World German Text Pro cessing" , 5th International Conference of Applied Natural Language, Wash ington, 1997, pp. 208-215.
Part II. Document Analysis for Adaptive Content Delivery
This page is intentionally left blank
CHAPTER 5
REFLOWABLE DOCUMENT
IMAGES
Thomas M. Breuel, William C. Janssen, Kris Popat, and Henry S. Baird Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 94304 U.S.A. E-mail: { breuel, janssen,popat, baird} @parc. com Millions of documents on the Internet exist in page or image oriented for mats like PDF. Such documents are currently difficult to read on-screen and on handheld devices. This paper describes a system for the auto matic analysis of a document image into atomic fragments (e.g. word images) that can be reconstructed or "reflowed" onto a display device of arbitrary size, depth, and aspect ratio. This allows scans and other page-image documents to be viewed effectively on a limited-resolution screen or hand-held computing device, without any errors and losses due to OCR and retypesetting. The methods of image analysis and repre sentation are described. 1. I n t r o d u c t i o n Millions of documents on t h e Internet exist in page or image oriented doc ument formats like P D F , PostScript, T I F F , J B I G 2 , or DjVu. 6 While such documents are designed to be read easily when printed on paper, they do not come with t h e layout information t h a t would allow t h e m to be reflowed to a d a p t to the smaller window sizes found on handhelds and in web browsers. This chapter, an expanded version of t h e paper presented at I C P R , 4 describes ongoing work on developing document image representa tions t h a t combine the fidelity and simplicity of image oriented document formats with t h e versatility of structured documents. Perhaps surprisingly, page oriented formats are not just being used for scanned legacy documents, but continue t o be widely used for publishing documents on-line. There are a number of reasons for this. Many documents are still primarily generated in word processing systems or t y p e setting sys81
82
Breuel et al.
terns for the purpose of printing. While word processor and type setting inputs are reflowable, they are generally too complex to be used as docu ment interchange formats, and they depend on complex, often proprietary, software systems for viewing. They often also contain sensitive information, like revision and author tags. And they often depend on fonts and other resources for proper display. For web publication, authors generally convert them into a portable, image-oriented format like PDF of PostScript. In contrast, generating formats like PDF or PostScript is very easy, since word processors already need to contain PostScript drivers for printers. Furthermore, pagination and line breaks are important (and often required) in many legal and academic contexts for cross-referencing. And viewers and decoders for page-oriented formats are simpler and widely available. Unfortunately, page-oriented document formats can be hard to read on-screen (see Fig. 1) or on handheld devices (see Fig. 2). Without the ability to reflow documents during display, reading such documents on a display that is smaller than page-size either requires scaling down the image until the text becomes unreadably small, or scrolling around on the page while reading lines or columns. This is not a question of better display technology, but simply a result of the form factor constraints that exist on many physical devices. HTML, SGML, and XML would seem to offer a third option, combining a simple, open format with the ability to reflow. But these formats suffer from many of the same problems as other structured document formats: they require complex software systems to be rendered, they often depend on proprietary features, and they cannot guarantee that text retains its intended appearance. What we would like is a system that gives control over generating re flowable documents to the recipient of a document. The obvious thing to try would be to convert the document into symbolic form by performing optical character recognition (OCR), which generates structured and re flowable documents from bitmapped images. However, OCR systems have difficulties with many important kinds of documents, like those containing chemical or mathematical formulas, as well as special fonts. The user of existing OCR systems can never be quite certain that the text they are seeing is accurate. OCR systems also do not retain the exact appearance of text, often substituting fonts. Many of these problems remain even if the OCR system represents difncult-to-recognize characters as bitmaps, as has been proposed by Tong and Srihari. 7 Our approach combines the ability to reflow documents from OCR sys-
Reflowable Document
Images
83
Fig. 1. This screenshot illustrates a common problem trying to read a P D F document in a browser: to read the entire page requires multiple scrolling up and down.
tems with the fidelity and simplicity of image-based capture and represen tations. Essentially, we represent documents as a large collection of small image elements. Each element represents a character, word, image, table, or figure. The image elements are associated with layout control information, allowing them to refiow like words in a regular word processing document. In the rest of this paper, we first describe the image layout analysis we are currently using for creating reflowable image representations. We then discuss issues of representing reflowable content and displaying it using a variety of existing and new display applications and devices. 2. Image and Layout Analysis Image and layout analysis transforms the raw document image into reflow able components. For general background information on document image and layout analysis, the reader is referred to the handbook edited by Bunke and Wang.0 The techniques used in our system are described in more detail
Fig. 2. Screen shots from attempts to read a document on a handheld using DjVu or Adobe P D F . On the left hand side, the document could be read in its entirety (the width of the page fits the screen width), but the characters are too small for reading. The middle screenshot shows a DjVu document where the characters are readable (if small), but reading the document requires horizontal scrolling for each line. The right hand screenshot shows the same document displayed in a P D F viewer. Since the font used for the document was not included in the P D F file and also was unavailable on the handheld, the P D F display program performed font substitution, further degrading the quality of the document image.
Reflowable Document
Images
85
in a recent paper by Breuel. 3 2.1. Text/Image
Segmentation
Line art, diagrams, and other images need to be treated separately from the textual content of a document: image processing, rescaling, and reflowing are different for images and textual components. It is also important to retain any textual components of an image (labels, annotations) as part of that image, rather than reflowing them with the rest of the text. There are two stages during which images can be detected. First, grayscale or color images are best detected on the unprocessed in put image. This can be done using standard document image segmentation mechanisms, although such mechanisms have not been incorporated into the current system yet. Users can also manually input bounding rectangles for images into the system. Second, images like illustrations and diagrams are also detected as part of connected component analysis and layout analysis. This will be described below. 2.2.
Preprocessing
Image analysis begins with adaptive thresholding and binarization. For each pixel, we determine the maximum and minimum values within a re gion around the pixel using grayscale morphology. If the difference between these two values is smaller than a threshold (determined statistically), the region is judged to contain only white pixels. (If the difference is small in a uniform black image region, it will be erroneously classified as white, but this will happen rarely as the region size is chosen to be larger than the vast majority of black marks.) If the difference is above a threshold, the region contains both black and white pixels, and the minimum and max imum values represent the black ink and white paper background values, respectively. In the first case, the pixel value is normalized by bringing the estimated white level up to the nominal white level of the display. In the second case, the pixel value is normalized by expanding the range between the estimated white and black levels to the full range between the white level and the nominal black level of the display. After this normalization process, a standard thresholding method can be applied. In the thresholded image, connected components are labeled using a scan algorithm combined with an efficient union-find data structure. Then, a bounding box is determined for each connected component. This results
Breuel et al.
86
in a collection of usually several thousand connected components per page. Each connected component may represent a single character, a character part, a collection of touching characters, background noise, or parts of a line drawing or image. The bounding boxes for these connected components are the basis of the subsequent layout analysis.
2.3. Layout
Analysis
For layout analysis, we are primarily interested in the bounding boxes cor responding to characters in the running text of the document, as well as in a few other page elements like headers, footers, and section headings. We are interested in these particular bounding boxes because they give us impor tant information about the layout of the page that we need for refiowing it. In particular, these bounding boxes and their spatial arrangement can tell us page rotation and skew, where we find column boundaries, what tokens we should consider for token-based compression, what the reading order is, and how text should flow between different parts of the layout. Bounding boxes that are not found to represent "text" in this filtering operation are not lost, however. They can later be incorporated into the output from the system as graphical elements. The dimensions of bounding boxes representing body text are found us ing a simple statistical procedure. If we consider the distribution of heights as a statistical mixture of various components, for most pages containing text, the largest mixture component is going to be from lower case letters at the predominant font size. We use this size to find the x-height of the pre dominant font and use this dimension to filter out bounding boxes that are either too small or too large to represent body text or standard headings. Given a collection of bounding boxes representing text, we are inter ested in finding text lines and column boundaries. The approach used in the prototype system for identifying text lines and column boundaries relies on a branch-and-bound algorithm that finds maximum likelihood matches against line models under a robust least square error model (equivalently, a Gaussian noise model in the presence of spurious background features).2 Text line models are described by three parameters: the angle and offset of the line, and the descender height. Bounding boxes whose alignment point, the center of their bottom side, rests either on the line or at a distance given by the descender height below it, are considered to match the line; matches are penalized by the square of their distance from the model, up to a threshold value e, usually of the order of five pixels. After a text line has
Reflowabie Document
Images
87
been found, the bounding box of all the connected components that par ticipated in the match is computed, and all other connected components that fall within that bounding box are assigned to the same text line; this "sweeps up" punctuation marks, accents, and "i"-dots that would otherwise be missed. Within each text line, multiple bounding boxes whose projec tions onto the baseline overlap are merged; this results in bounding boxes that predominantly contain one or more characters (as opposed to bounding boxes that contain character parts). The resulting bounding boxes are then ordered by the x-coordinate of their lower left corner to obtain a sequence of character images in reading order. Multiple text lines are found using a greedy strategy, in which first the top match is identified, the bounding boxes that participated in the match are removed from further considera tion, and the next best text line is found, until no good text line matches can be identified anymore. This approach to text line modeling has several advantages over the traditional projection or linking methods. First, due to scanning artifacts, different text lines can have different orientations. Second, by taking into account both the baseline and the descender line, the technique can find text lines that are missed by other text line finders. Third, the matches returned by the method follow the individual text lines more accurately than most other methods. 2 Column boundaries are identified in an analogous manner, by finding globally optimal maximum likelihood matches of the center of the left side of bounding boxes against a line model. In order to reduce background noise, prior to applying the line finder to the column finding problem, statis tics about the distribution of horizontal distances between bounding boxes are used to estimate the inter-character and inter-words spacing (the two largest components in the statistical distribution of horizontal bounding box distances), and bounding boxes for characters are merged into words. This reduces the number of bounding boxes that need to be considered for column matching severalfold and thereby improves the reliability of column boundary detection. We are currently adapting the system to using the segmentation meth ods described in a recent paper by Breuel. 3 Those methods allow for the more reliable identification of column boundaries. They work by first iden tifying rectangles of empty page background satisfying conditions on their aspect ratio and proximity to textual components. Such whitespace rectan gles correspond to column boundaries with high probability. 1 ' 3 Text lines are then detected by a constrained text line finding method similar to the
88
Breuel et al.
one described above. With either layout analysis method, large connected components that are not part of a text line are treated as images. Such components are merged repeatedly with any connected components that they overlap; the resulting collection of connected components is treated as a single image. The output of the layout analysis is a collection of text line segments, images, and column boundaries. Unlike other layout analysis methods, for generating renowable documents, we do not require a complete, hierarchical analysis of document structure. All that is left to determine is linking the text line segments together in reading order. For a single column document, by enumerating text lines and bounding boxes of images in order of their ycoordinates, we obtain a sequence of characters, whitespace, and images in reading order. For a double column document, the two columns are treated as if the right column were placed under the left column. This simple layout analysis algorithm copes with a fairly wide number of commonly found layouts in printed documents and transform them into a sequence of images that can be reflowed and displayed on a handheld device. In part, a simple algorithm works well in these applications because the requirements of reflowing for a handheld document reader are less stringent than for other layout analysis tasks, like rendering into a word processor. Since the output of the layout analysis will only be used for reflowing and not for editing, no semantic labels need to be attached to text blocks. Text that is misrecognized as an image (for example, a section heading printed in large type) will still reflow properly and often be completely indistinguishable from correctly segmented text. Because the documents are reflowed on a small screen, there is also no user expectation that a rendering of the output of the layout analysis precisely match the layout of the input document. Furthermore, if page elements like headers, footers, or page numbers are incorporated into the output of the layout analysis, users can easily skip them during reading, and such elements may serve as convenient navigational signposts on the handheld device as well. Even some errors in layout analysis, like misattributing a figure caption to the body of a text, can be tolerated by readers.
3. HTML-Based Representations The result of the decomposition process described above is a sequence of text images and illustrations, along with meta-information about format ting such as paragraph breaks and line justification. Many existing Web
(a) Fig. 3. A page from Jepson's A FLORA OF CALIFORNIA: elements; and (c) in a web browser without the boxes.
(b)
(c)
(a) original; (b) in a web browser with boxes drawn around the image
90
Breuel et al.
Fig. 4.
The Jepson page rendered by Microsoft's Reader electronic book viewer.
formats are well-suited for representation of such data. HTML, 11 the stan dard for the World Wide Web, supports the layout in reading order of a sequence of image elements, along with formatting information. The succes sor to HTML, XHTML, 10 uses the more rigorous XML syntax, but provides the same functionality. A group of commercial electronic-book interests are also defining the Open eBook Publication Structure, 12 which also uses the XML syntax, incorporates the XHTML functionality, adds the ability to package multiple XHTML files into a single publication, and provides stan dards for addition of document metadata. Figure 3(a) shows a sample page from Willis Linn Jepson's A FLORA OF CALIFORNIA,8 an important scholarly work in the field of botany, in which typeface and relative type size choices are significant. Once decom posed into image elements, the logical connections between those elements can be represented with HTML. Figure 3(b) shows an HTML representa tion of the decomposition of the Jepson page rendered in a standard Web browser, with boxes drawn around individual image elements. a Figure 3(c) shows that same page, but without the bounding boxes around the elea
T h e illustrations have been manually removed from the original, and manually re inserted into this version. We are investigating ways of doing this automatically.
Reflowable Document
Images
91
ments (the layout has changed slightly as well, since depicting the boxes in (b) takes up some of the available display space). Note that fonts and some characteristics of the original page have been retained, which would not necessarily be the case for an OCR-based transition strategy.
4. Reader Applications A problem with all of the Web formats is that a decomposed document will typically consist of a very large number of separate files. This configuration tends to strain the capabilities of underlying technology support platforms. Token-based compression schemes9 can make this problem somewhat more manageable, but cannot really solve it. Most electronic-book document for mats, however, alleviate this problem by packaging the image elements together with the layout directives in a single file. Dozens of electronic book formats exist, among them Microsoft Reader, Adobe Acrobat Reader, Palm Reader, and Gemstar's RCA 1100 and 1200 formats. Our system of decomposed image elements can be supported by most (probably all) of these formats. Figure 4 shows an example of the Jep son page displayed in Microsoft's Reader viewer program. This was achieved by rendering the decomposed elements into Open eBook Publication Struc ture, then using the Overdrive Readerworks distiller for Reader to create the electronic book. Similar approaches seem to suffice for all of the other popular electronic book formats. Common electronic book formats, and the associated viewer applica tions, are optimized for 'books' consisting mainly of text, with occasional images. This can lead to performance problems in both the time and space domains. Microsoft Reader version l b , for example, running on a 500 MHz Pentium III machine, takes approximately 20 seconds to lay out and display the first page of the Jepson sample shown above. But another system called 'Plucker' 0 provides timely layout and scrolling of the converted document, and averages only 70KB per converted page. We are currently investigat ing alternative storage formats and layout/display techniques optimized for this class of electronic book.
see http://www.microsoft.com/reader/ see http: //www .plkr. org/
c
92
Breuel et al.
5. N e w Document Formats We are currently developing new document formats for reflowable content. Rather than devising a completely new format, as systems like PDF and DjVu have attempted, we are building on existing image formats and anno tating them with additional layout information. This approach means that we can take advantage of the considerable work that has gone into the de velopment of existing image compression formats and reuse large amounts of existing code. One format that we are developing is an extension to the Plucker doc ument format. The Plucker format can be understood as a simplified form of HTML, encoded in binary, and packed into a single Palm database file. Plucker already allows images to be incorporated into its documents, but it does not cope well with the large numbers of small images that occur in reflowable document images. Our extension to the Plucker format allows image tags not only to include complete images in a document for display, but to refer repeatedly to a single image within the document and display different subimages. That is, this new form of Plucker image tags specifies a bounding rectangle in the source image to be inserted into the Plucker text. That way, a single image included into a Plucker document can be used to represent hundreds of different word images. Furthermore, the fact that these image inclusions are much larger than individual word images also help compression algorithms achieve better compressions on them. A second format we are developing is based on document images in formats like TIFF and DjVu. We augment these document images with a representation of the document layout — a list of bounding boxes in reading order. Layout information that needs to be represented per page is small (a few thousand bytes). The additional information can be kept separate from the document image, allowing the document image to remain completely unchanged. It can also be incorporated into extended chunks available in formats like TIFF, DjVu, and PDF as part of the image file, resulting in completely self-contained document image files suitable both for display in their original format and reflowable display.
6. Summary and Conclusions As we noted in the beginning, page-oriented electronic document formats show no signs of disappearing, nor do devices whose display area is smaller than a full page of text. None of the three major classes of on-line document formats address the need of displaying page-oriented documents on such
Refiowable Document
Images
93
devices: structured document formats like LaTeX or Microsoft Word are difficult to display and documents are not usually available in t h a t form, image-based formats like T I F F and DjVu require excessive scrolling on the part of the reader, and existing eBook content for P D A s preserves little of the original visual appearance of the document. O u r approach, refiowable document images, combines the best aspects of eBook and structured documents with the best aspects of image-based formats. Like image-based formats, refiowable document images preserve the appearance of text, fonts, and images and are not subject to O C R errors. B u t refiowable document images also have many of the capabilities of structured document formats to adapt to different displays a n d output devices. This chapter has described work in progress. Currently, the conversion software consists of a number of Java modules t h a t can be invoked from the command line. Clearly, for widespread deployment, a fully automatic system with a convenient user interface is desirable; efforts are underway in our lab t o combine the command line modules into such a system. Another on-going area of work is the development of new viewer applications for handhelds. A lot of documents have been cumbersome to use both on the web and in handhelds. We believe t h a t in the future, the use of refiowable document images will be very important in making large amounts of on-line documents conveniently accessible both on handhelds and inside web browsers.
Acknowledgments We are grateful to Dan S. Bloomberg for many of the original ideas around text reflowing leading to this work.
References 1. H. S. Baird, S. E. Jones, and S. J. Fortune, "Image segmentation by shapedirected covers", Proceedings of the Tenth International Conference on Pat tern Recognition, Atlantic City, New Jersey, 1990, pp. 820-825. 2. T. M. Breuel, "Robust least square baseline finding using a branch and bound algorithm", Document Recognition and Retrieval VIII, SPIE, San Jose, 2002, pp. 20-27. 3. T. M. Breuel, "Two algorithms for geometric layout analysis", Document Analysis Systems V: Proceedings of the IAPR Workshop on Document Anal ysis Systems (DAS2002), D. Lopresti, J. Hu and R. Kashi (Eds.), Springer Lecture Notes in Computer Science, LNCS 2423, pp. 188-199.
94
Breuel et al.
4. T. M. Breuel, W. C. Janssen, K. Popat, and H. S. Baird, "Paper to PDA", 16th International Conference on Pattern Recognition (ICPR'2002), Quebec, Canada, Aug. 2002. 5. H. Bunke and P. S. P. Wang, Handbook of Character Recognition and Docu ment Image Analysis, World Scientific, 1997. 6. P. Haffner, L. Bottou, P. Howard, and Y. Le Cun, "Djvu : Analyzing and compressing scanned documents for internet distribution", Proceedings of the International Conference on Document Analysis and Recognition, 1999, pp. 625-628. 7. T. Hong and S. N. Srihari, "Representing OCRed documents in HTML", Pro ceedings of the I APR 1997 International Conference on Document Analysis and Recognition (ICDAR 1997), Ulm, Germany, April 1997, pp. 831-834. 8. W. L. Jepson, A Flora of California, University of California Press, 1909, h t t p : / / u c j e p s . b e r k e l e y . e d u / j e p s o n - p r o j ect3.html. 9. Joint Bi-Level Image Experts Group (JBIG) Committee, "Information technology - coded representation of picture and audio information lossy/lossless coding of bi-level images", Technical Report 14-4-92 FDC, ISO/IEC, July 1999. 10. S. Pemberton, A. W. Donoho, A. Navarro, B. Epperson, C. Wilson, D. Zigmond, D. Austin, D. Raggett, D. Dominiak, F. Boumphrey, H. Elenbaas, J. Burger, J. Axelsson, K. Hofrichter, M. Ishikawa, M. Altheim, P. Schmitz, P. Klante, P. King, P. Stark, P. Hoschka, R. Relyea, S. Dooley, S. Schnitzenbaumer, S. McCarron, S. Matsui, S. Peruvemba, T. Celik, T. Wugofski, W. ten Kate, and Z. Nies, "Xhtml 1.0: The extensible hyper text markup language", Technical report, World Wide Web Consortium, Jan. 2000, http://www.w3.org/TR/xhtml1/. 11. D. Raggett, A. L. Hors, and I. Jacobs, "HTML 4.01 specifica tion", Technical report, World Wide Web Consortium, Dec. 1999, http://www.w3.org/TR/html4/. 12. Victor McCrary and contributors, "Open ebook publication structure 1.0.1: Recommended specification", Technical report, Open eBook Forum, July 2001, http://www.openebook.org/.
CHAPTER 6 E X T R A C T I O N AND M A N A G E M E N T OF C O N T E N T F R O M H T M L DOCUMENTS
H. Alam, R. Hartono and A.F.R. R a h m a n * Document Analysis and Recognition Team (DART), BCL Technologies Inc., 990 Linden Drive, Suite # 203, Santa Clara, CA 95050, USA. E-mail:fuad@bcltechnologies.com URL: http://www.bcltechnologies.com
In recent times, the way people access information from the web has undergone a transformation. The demand for information to be accessible from anywhere, anytime, has resulted in the introduction of Personal Digital Assistants (PDAs) and cellular phones that are able to browse the web and can be used to find information using wireless connections. However, the small display form factor of these portable devices greatly diminishes the rate at which these sites can be browsed. Efficient algorithms are required to extract the content of web pages and build a faithful reproduction of the original pages on a smaller display with the important content intact. 1. Introduction The problem of content extraction is very important not only from the point of view of managing the amount of content, but other important issues are associated with this. Some of which are: • Viewing any website: Pattern recognition systems that use document analysis techniques can be employed for displaying web pages on small screen devices by extracting and summarizing their content. These systems have to be generic enough so that they can work with any web site, not only the well laid-out ones. • •
High Speed Access: The transformation of web pages has to take place on the fly and therefore should be fast. Network Usage: The schemes employed for the transformation should not slow down network traffic by requiring additional resources.
t Corresponding author. 95
96
Alam et al.
•
Easy Configurability: Any such scheme should be easily configurable within existing systems by System Integrators (SI) and end-users. • Rapid Deployment: This is also a very important factor in software development and deployment. • Non-Intrusive Design: Any such translation scheme should be built on top of a web site without modifying the actual web site. • Multiple Views: This scheme should also allow the Sis and end-users to develop custom views.
2. Research Direction A wireless PDA is designed to be carried in a person's pocket. The display area of such a device is a fraction of that available on a desktop computer as shown in Fig. 1. This can be seen graphically in Fig. 2. For example, Palm devices display only 160x160 pixels. Hence, any web page designed for viewing on a 640x480 or higher-resolution PC display would be severely compromised on a handheld device. Only HP's half-VGA Jornada palmtops approach desktop resolutions. On the other hand, various brands of PocketPCs are coming to the market with somewhat higher resolution, but the price point is still prohibitive for mass adoption of these devices. This can be explored further using the case of cellular phones capable of accessing the web. Technical issues, such as bandwidth, resolutions and formats become extremely important factors with these devices. The present-generation cell-phone data rates are limited to 9600 bps, making the bandwidth limitations a significant bottleneck to web deployment. However, in the longer term, 2.5G cellular is intended to be equivalent to v.90 wired modem speeds, and 3G cellular Data Rates for Cellular and WLAN Standards service targets data throughput roughly equivalent to DSL. In terms of wireless LANs,
Extraction and Management of Content from HTML Documents
97
Bluetooth is expected to achieve 721 kbps, and Intersil is presently shipping 802.11 chipsets that deliver 2 Mbps. Hence, bandwidth may be less of a problem in the future, assuming absolute channel capacity is not overwhelmed by the number of end-users. However, display size and resolution are more important considerations than cellular bandwidth. Among the current generation of WMLready GSM cell phones, Nokia's 3310 has a 48x84 display; the 7110 displays 96 x 65 pixels. Ericsson's R380s displays 120 x 360. There is a conflict here: smaller devices appeal to consumers, but small displays cannot display much information. 3. Current State of the Art The current way of browsing the web using a wireless device can be very restrictive. At present, there are three solutions to this problem, handcrafting, transcoding and adaptive re-authoring. 3.1 Handcrafting Custom Web Sites are typically crafted by hand by a set of content experts. This process is labor intensive and expensive. Because of the expense, typically less than 1% of web sites are converted to wireless content. Currently there is an attempt by several companies to automate this conversion. The core of this automation requires web code written in XML. Two problems exist with this approach: • There is no standard XML tagset (Document Type Definition - DTD) in use by vendors. • XML has been available to web designers for the last 10 years. Examination of websites shows little use of document structural elements. We hypothesize this is because web designers see themselves as artists rather than programmers. 3.2 Transcoding Thranscoding replaces HTML tags with suitable device specific tags1 (HDML, WML etc.). While this is cost effective, it has two problems: • Most web pages have a loose repeating visual structure. Sidebars are often the left-most element on a web page. This results in the wireless user getting the same repeating information with every screen, making the browsing an unfriendly experience.
98
Alam et al.
•
Transcoding sends all the information to the wireless device, making it substantially slow on the wireless network. Transcoding was introduced in Japan during 1999-2000. It was widely rejected by the Japanese users.
3.3 Adaptive
Re-authoring
This approach has attracted huge interest in recent times principally because of its attractiveness as an alternative to the previous options and as an approach that, in theory, should be flexible enough for future extension. Early attempts include the use of a web digestor2 that can explore the structure of a web page to detect 'blocks' of information and use transcoding techniques to reflow the content. Other attempts include approaches to identify content and then either label the blocks of content or convert them to image thumbnails,3"5 which in turn are connected to the details of the relevant content. Some researchers have approached this problem from the point of view of streamlining web server Quality of Service (QoS)6 and to improve server overload behaviour.7 Most of the approaches mentioned above perform some type of structural analysis of the web page, primarily exploiting the HTML data structures and pre defined tagsets. The Natural Language Community, on the other hand, has been involved in the research of textual summarization for a long time. The problem of textual summarization, while not completely solved, is already wellresearched and there is a wealth of solutions available in this category.8"13 Unfortunately natural language summarization techniques are difficult to apply directly to web pages because of the unique way the content is distributed in the HTML representation and the presence of other multi-media components embedded within these pages. More on this topic is discussed later in this paper. 4. Proposed Approach The importance of efficient content extraction from HTML pages for wireless web access becomes clear, especially in the context of the issues discussed in the previous section. The approach proposed here addresses this problem in a slightly different way. The proposed system is closer to the adaptive re-authoring approach as discussed above, but has a number of distinct features and is more streamlined. Initially, the HTML data structure is exploited to segment the web pages into zones:u Once these zones are identified, attribute based analysis of the content is carried out. This results in the extraction of content that is relevant and important.
Extraction and Management of Content from HTML Documents
4.1.
99
Web Page Segmentation
This method segments web pages into intelligent and relevant sections. This involves analysis of the structure of the web page. The document is decomposed into a main data structure. Usually it is a tree structure after HTML's table structure. CContent is the first starting point and CTable is its children. The number of tables should be the same with the number of top-level tables in the hypertext page. CTable will have Crow as its children, which might have CCol as its children. Finally CCol can also have CTable again as its children. After this point it will be recursive (Fig. 12) and it goes all the down to the leaves. Adoption of this structure ensures that the HTML page is split into the units as defined within the hierarchy of this tree. The content is inside the CCol nodes; the rest of the structure defines the way the various objects within that structure are related to each other.16
Fie. 12: Data structure.
4.2.
Contextual Analysis and Segment Labeling
The decomposed sub-documents are analyzed for their context. This classifies the segments in various classes based on the type and size of content in each node. Current state of the art is a simple method that uses information collected in the decomposition stage, such as number of non-link words, linked words,
100
Alam et al.
content weight, presence of forms etc. to devise specific classes. For example, an element composed primarily of text and few links is a "story" node. Similarly, nodes with a very high ratio of linked text compared to non-linked text, and if the number of linked words per phrase is more than a threshold, can be "Links". Other classes, such as "Navigation Bars", "Advertisement Bars", "Main Story", "Other Stories", "Forms", "Images" and many more can be created. This contextual analysis of each sub-document produces a summarization expressed as a label. This labeling is based on exploiting the visual cues in the web page. During web page segment labeling, a semantic label is applied to each segment. Often a segment has an overriding label provided by the author. In such cases, the task is locating the author provided label. In other cases, the labels are implied. This is the more difficult task, and currently there is only experimental work in semantic labeling based on content. Visual cues are a good source of information while trying to create an appropriate label from a segment (or sub-document) such as size of font, headlines, boldness, color, links, flashing (Blink), Italic (I), emphasized (EM) and underlines. Various algorithms can be adopted to exploit these parameters. These labels are ultimately used as short summaries for the Table of Content (TOC). 4.3.
Web-Page
Summarization
Since each of these sub-documents is summarized with an "intelligent" summary, these are put together as a summary of the whole document, giving rise to a Table of Content (TOC). Each entry into this TOC points to specific subdocuments within each document. 4.4.
Post-processing
Extraction of content from individual zones, however, is not the complete solution. These zones can have content that is inter-related and it may not make much sense to display these zones separately. Therefore, the next stage in this process is the analysis of the relationship between these zones. This is achieved in three ways. • Proximity Analysis: This approach involves a relational analysis based on content proximity, which is a measure of how closely related one block of text is to another block. The natural order of these zones can sometimes be used as strong indicators to establish relationship. In addition to that, lexical chains17 are used to create relationship among various zones.
Extraction and Management of Content from HTML Documents
101
•
Content Classification: Content extracted from individual zones can be classified into various types, and this classification, combined with the context of proximity can be a powerful tool to establish a logical map between various zones. • Content Flow Analysis: Often, due to the structure of the HTML documents, a single coherent block of content can be separated into multiple zones. This is usually internal and is transparent to the user after rendering by the browser. However, content classification as described above will separate a related block of information, such as a paragraph, into multiple zones. Content Flow Analysis involves using content understanding methods to approximate the content flow between zones. This analysis is based on natural language processing involving contextual grammar and vector modeling,18 details of which are beyond the scope of this paper. This application of knowledge models and information retrieval techniques defines the relationship between various previously separated zones in order to combine them to form a coherent representation. Once relationships between various zones are established, they are used to reflow the content into a more meaningful and efficient manner that suits the requirements of smaller display devices. Various methods19"21 are applied to combine the information thus collected. Although primarily developed for character recognition, these techniques are generic enough to be applied to this particular task domain with little or no modification. 4.5.
Overall Summary of the Content Extraction and Display Process
The following shows the overall stages required to achieve content extraction from HTML documents: • Structural Analysis: Analysis of the structure of a web document • Decomposition: Decomposing a web document based on the extracted structure • Contextual Analysis: Once decomposed into constituent sub-documents, analyze each sub-document for its context. • Summarization (Labeling): This contextual analysis of each subdocument produces a summarization, which can be expressed as a sentence or sub-sentence (a label) indicating the content of this subdocument. • Table of Content (TOC): Since each of these sub-documents are summarized with an "intelligent" summary, these can be put together as a summary of the whole document, giving rise to a Table of Content
A lam et al.
102
(TOC). Each entry into this TOC points to specific sub-documents within each document. Order of TOO. The order in which the TOC is extracted depends on the "natural" order of the sub-documents extracted from the main document. However, this "natural" order is often misleading as the main "interesting" or "important" message of the document can be lost in the TOC. Therefore, it is important to analyze the content of each subdocument and display the TOC by re-ordering them based on their relative "importance". This is determined by analyzing the visual emphasis of the textual information such as various levels of headers, boldness, font size, word frequency and other parameters.
5. Results The performance of the system is best described using a practical application. Figure 3 shows the first page of the web page www.bcl-computers.com. As is clearly seen, this web site has a complicated multi-column layout. The content is presented in multiple segments with an implied relationship between these
Extraction and Management of Content from HTML Documents
103
segments. For example, a story segment might be followed by a segment providing additional links to similar stories. The system analyses the layout and segments within the page and produces a summarized output (Fig. 4). This is the total table of content (TOC). Each member of the TOC represents several segments within the page. Selecting any of these links will enable the user to go to the more detailed content associated with that TOC. For example, selecting the link "BCL Computers" will lead the user to the display shown in Fig. 5. Clearly, the idea here is to keep the content intact, but the emphasis is on identifying which segments of the page should be put together as a related segment that can be adequately described by a single label.
Also, there has to be some way to navigate between the various levels of abstraction. Since the content is in two levels, making the first level labels (TOCs) hyperlinks solves this problem gracefully. For example, if the user selects the link "BCL Computers" from Fig. 4, the system will show the output of Fig. 5. In similar fashion, by selecting the link "Beta Tester wanted" from this display (Fig. 4), the output of Fig. 6 can be obtained. Figure 7 shows the detailed content associated with the link "PDF SOLUTIONS" in Fig. 4. Not only the "story" contents are summarized, sidebars and navigation links are also summarized. For example, Figures 8-11 show the summarized bars from the page www.bcl-computers.com.
Alain et al.
104
Fig. 7: Detailed content in the 2nd level.
Fig. 6: Detailed content in the 2nd level.
I
Fig. 8: The summarized top bar.
i
^
—
Fig. 10: More details.
I
Fig. 9: A side bar.
i
Fig. 11: The top bar.
Extraction and Management of Content from HTML Documents
105
6. Discussion Wireless web is poised to be the next big technological paradigm shift after PC based Internet. The technologies of web content management, portability and wireless devices, such as handhelds and cellular phones, are maturing rapidly and the convergence is now in sight. As is already the case in Japan, wireless devices are going to be the primary device for web access in the USA. In this emerging market, strong software for easy navigation and accessing information from the web is of paramount importance. However, accessing the web on these small display devices is a very difficult issue, one that has been the primary cause of the slow adoption of this huge phenomenon. Let us look at a very simplified scenario. There are currently 1.2 billion web pages in the world. Assuming it needs a skilled designer 2 hours to adapt an existing page for wireless, it will take 2.4 billion work-hours to make the whole web accessible to the wireless devices. If a cost of $20 per hour is assumed, this effort requires an investment of around $50 billion. This shows the enormity of the task ahead of us. What we need is a very effective tool that can automatically generate summaries of these pages and index the pages accordingly. Such a solution must eliminate the necessity of low level coding to convert web sites. In addition to that, multi-lingual web summarization is going to be extremely important in the very near future. In the next four years, the far-eastern countries are going to surpass the United States in terms of user numbers for web access and usage. This is a vast market for future web related revenue. Sustainable mobile commerce is only going to take off when we have a reliable and efficient method of web access from all types of wireless devices using all the major languages of the world. In general, textual summarization uses various standard techniques such as knowledge-rich methods, exploitation of discourse structures, corpus based approach or classical approaches. However, in the case of web pages, the structure imposed on these pages makes the task much more complicated, as simply extracting the text and graphics from the web pages in sequential order often does not make any sense at all. In the approach proposed here, the summarization of web pages is achieved by a two-stage process. In the first stage, a summary of the structure of the web page is achieved, thereby creating a tree out of the page with automatically generated sections and sub-sections. In the second stage, textual summarization is achieved from those sub-sections. This process, though tending to solve some of the problems, still has some unresolved issues. In the rest of this section, some unresolved issues are discussed.
106
6.1 Web Page
Alam et al.
Segmentation
The design and argument of the web page segmentation approach adopted here is straightforward. However, web designers use the table structures in a variety of ways. In most of the times, this construct is used as a tool for nicely laid out displays. In other cases, tightly related content is distributed into separate nodes just for the sake of visual representation. The prevalent use of HTML table construct for layout purposes is complicated by the fact that sometimes they are used for exactly what they are designed for, as tables with numbers in complex multiple column/row/hybrid configuration. Just assessing when a table construct is a true "table" is a very open problem.22,23 Two of the important problems in web page segmentation are over segmentation and under segmentation. With over segmentation, related areas are segmented, and with under-segmentation unrelated parts are grouped together. The content in these web pages can include forms, frames, Java script enabled active components and many other artefacts. Over segmentation tends to split these up and make the overall rendition unrealistic. Often, the forms, for example, can start anywhere in the code, and again end anywhere in the code, the active components within the form itself can be scattered all over the place. This makes it almost impossible to reconstruct an operational form after segmentation is carried out. Similarly, the same argument is true for some other active components commonly found in web pages. One such example is the common practice of using slices of graphics to build a larger graphics. In the summarization process, these tend to split up, making the summarization incomplete, and often useless. This leads to the second problem. Since we have to have some type of merging to make sure that related sections are put together, this sometimes tends to put unrelated sections together, giving rise to under segmentation. This is still a very open problem. 6.2 Contextual Analysis and Segment Labeling The approach adopted here avoids generating labeling using textual summarization. In recent times, contextual analysis has opened up ways to apply analytical methods to efficiently summarize text. These methods are primarily based on Natural Language (NL) processing.9"12 In most cases, these methods are resource intensive in terms of computational requirements and therefore slow. In any practical system of web browsing, the lag time between switching pages has to be less than 1 second to be accepted by the industry. Therefore, resource intensive solutions, though often superior, are hardly acceptable if they are to be bundled with a commercial web browser.
Extraction and Management of Content from HTML Documents
107
Though this algorithm is straightforward, in reality it hardly satisfies the above timing criterion. Web designers tend to take a lot of artistic freedom with all these features, and what we end up with is a kaleidoscope of graphics, active multimedia components, forms and other artifacts. Web designers also use frames, often in a way not initially intended to be used for. These factors make the task of extracting the correct label very difficult. One of the most difficult problems in this respect is the prevalent use of graphics as text in web documents. In some recent studies, it was found that a substantial number of graphics on the web contain text, and a percentage of this text is not repeated anywhere on those pages.24"26 One option is to use an OCR to detect the text within the graphics, but the success rate of OCR systems makes the task of label extraction very difficult. In the proposed approach, this problem is side-stepped by allowing the graphics to be displayed as is, thereby making the output complete. However, this does not make the task of indexing any easier. 6.3 Web-Page Summarization The approach adopted here has the following three advantages: • Coherent Viewing: By providing a table of contents for each web page visited, the system allows end-users to see a summary of the page they are going to visit. This allows the user to coherently view a large web page on a small mini-browser. • Universal Access: This solution allows any wireless user to coherently view any web site without customization. This makes the entire web viewable to the end user. • High Speed Access: By summarizing web pages, this speeds up access to any web site. Currently there is a 1 second average response lag to a page in a web site assuming good network speed. Wireless network speed is a factor in the performance of this system. Currently wireless network speeds in North America have a 7-15 second lag in response time. This is largely due to lack of 2nd and 3rd generation (2G and 3G) digital wireless networks in North America. The auction of digital wireless bandwidth over the next year should result in deployment of 2G and 3G networks in North America and the speed issue will be eased. 6.4.
Display
Capabilities
The display capability of devices can vary widely. Often the small screen devices have lower display capability not only in terms of the number of pixels, but also whether it can support color or not. Since in general, most cell phones and PDAs
Alam et al.
108
do not reproduce the range of colors the desktop can display, we need to address how web pages are converted in terms of color. In the approach presented here, we aim to detect the targeted device by detecting what type of browsers it can support. Once detected, and if required, the color images are converted to Black and White after dithering.27 Images are also size-normalized to fit the screen size. If the device is capable of displaying color, then it is only size-normalized with color. 6.5.
Language
Independence
One of the biggest advantages of the proposed approach is its ability to be multi lingual. Since most of the summarization depends on structural analysis and visual cues, the solution is language independent. In general, the only problem encountered in switching between languages is the necessity to handle multi-byte characters, since many of the far-east languages such as Chinese and Japanese use multi-byte representations, whereas English is simply single byte. However, this is an implementation issue and is not directly related to the summarization process. 6.6.
Current State of Research
In general, this new approach shows promising signs of success. The remaining problems, and there are many, need to be addressed within the proposed framework. As web technology is emerging and acquiring more and more new and innovative ways of making it more attractive and efficient, and as more and more web sites become available, the task of summarization will also become important since we will not have enough time to read everything. In addition, this is extremely important if we want to make the dream of the projected paradigm shift from desktop to handheld devices a reality, since these new generation devices will offer much smaller display areas. Any useful wireless web browsing has to have some form of summarizer associated with its browser and in that respect, this proposed approach might help in achieving this transition. 6.7.
Supported Devices
The proposed system works by automatically summarizing live web content on the fly to fit smaller screen devices, such as PDAs and cellular phones with web capability. At the present time, the system supports all PDAs using an HTML 3.2 (and above) browser and also cellular phones using WAP, iMode (NTT DoCoMo), J-Sky (J-Phone) and EZweb (KDDI) formats.
Extraction and Management of Content from HTML Documents
109
7. Concluding Remarks This paper has presented a concept to extract content from HTML documents based on their structural analysis. Based on this extraction, a classification of the content can allow a more efficient representation of the content in context with the importance and logical relationship between various zones of the document. This document analysis approach should therefore be able to organize the content into a meaningful, understandable, manageable and useful representation. The paper also discusses some of the unsolved challenges in this field and attempt to provide a perspective on the overall state of the art. References 1.
M. Hori, R. Mohan, H. Maruyama and S. Singhal, "Annotation of Web Content for Transcoding", E3CNote, h t t p : //www.w3 . o r g / T R / a n n o t / . 2. T. Bickmore, A. Girgensohn and J. W. Sullivan, "Web page filtering and reauthoring for mobile users", The Computer Journal, 42(6), Oxford University Journal, 1999, pp. 534-546. 3. H. Zhang, "Adaptive content delivery: A new research in media computing", Proceedings of Multimedia Data Storage, Retrieval, Integration and Applications Workshop (MDSRIA), Hong Kong, January 13-15, 2000. 4. H. Zhang, J. Chen and Y. Yang, "Adaptive delivery of HTML contents", 9th World Wide Web Conference, Amsterdam, Netherlands, May 15-19, 2000, pp. 284-289. 5. H. Zhang, J. Chen and Y. Yang, "An adaptive web content delivery system", Proceedings of International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems (AH2000), Trento, Italy, 28-30 August, 2000. Online Proc. at http://ah2000.itc.it/. 6. T, F. Abdelzaher and N. Bhatti, "Web Server QoS Management by Adaptive Content Delivery", Computer Networks, Elsevier, 31(11-16), 1999, pp. 1563-1577. 7. T. F. Abdelzaher and N. Bhatti, "Web content adaptation to improve server overload behavior", Proceedings of the 8th International World Wide Web Conference, Toronto, Canada, 1999, pp. 465^99. 8. S. Corston-Oliver, "Text compaction for display very small screens". Proceedings of Automatic Summarization Workshop, the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), Pittsburgh, PA, U.S.A., 2-7 June, 2001. 9. O. Buyukkokten, H. Garcia-Molina and A. Paepcke, "Text Summarization of Web pages on Handheld Devices". Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), Pittsburgh, PA, U.S.A., 2-7 June, 2001. 10. A. Kolcz, V. Prabakarmurthi and J. Kalita, "Summarization as Feature Selection for Text Categorization". Proceedings of the. 10th Information and Knowledge Management Conference (CIKM-01), Atlanta, Georgia, 5-10 November, 2001, pp. 365-370. 11. K. McKeown, J. Klavens, V. Hatzivassiloglou, R. Barzilay, E. Eskin, "Towards Multidocument Summarization by Reformulation: Progress and Prospects".
110
12.
13.
14.
15.
16.
Alam et al. Proceedings of the Meeting of the American Association for Artificial Intelligence/International Association of Artificial Intelligence (AAAI/IAAI), Orlando, Florida, U.S.A., 18-22 July, 1999, pp. 453^160. M. J. Witbrock and V. O. Mittal, "Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries. Research and Development in Infonnation Retrieval", proceedings of the Special Interest Group on Information Retrieval (SIGIR), Berkeley, California, U.S.A., 15-19 August, 1999, pp. 315-316. Y. Matsumoto, T. Miyata, T. Nomoto, T. Tokunaga, M. Takeda, and M. Obayashi. "Document Analysis and Summarization Workbench". Software Demonstrations at the 38th Annual Meeting of the Association for Computational Linguistics (ACL2000), Demonstration Notes, Hong Kong, 1-8 October, 2000, pp. 22-23. A. F. R. Rahman, H. Alam, R. Hartono and K. Ariyoshi, "Automatic Summarization of Web Content to Smaller Display Devices". Proceedings of the 6th Document Analysis and Recognition Conference (ICDAR01), Seattle, USA, 10-13 September, 2001, pp. 1064-1068. A. F. R. Rahman, H. Alam and R. Hartono, "Content Extraction from HTML Documents". Proceedings of the Is' International Workshop on Web Document Analysis (WDA'2001), Seattle, USA, September 2001 (ISBN: 0-9541148-0-9) and also at h t t p : / / w w w . c s c . l i v . a c .uk/~wda2 0 01, pp. 7-10. A. F. R. Rahman, H. Alam and R. Hartono, "Understanding the Flow of Content in Summarizing HTML Documents". Proceedings of Document Layout Interpretation and its Applications Workshop (DLIA01), Seattle, USA, 9 September, 2001. Online
proc. at h t t p : / / w w w . s c i e n c e . u v a . n l / e v e n t s / d l i a 2 0 0 l / p r o g r a m / index.html. 17. R. Barzilay and M. Elhadad. "Using lexical chains for text summarization". Advances in Automatic Text summarization, I. Mani and M. T. Maybury (eds.), MIT Press, 1999, pp. 111-121. 18. H. Alam. "Spoken language generic user interface (SLGUI) ". Technical Report, AFRL-IF-RS-TR-2000-58, Air Force Research Laboratory, Rome, NY, 2000. 19. A. F. R. Rahman and M- C. Fairhurst. "Introducing new multiple expert decision combination topologies: A case study using recognition of handwritten characters". Proceedings of 4 Document Analysis and Recognition Conference (ICDAR97), Ulm, Germany, 18-20 August, 1997, vol. 2, pp. 886-891. 20. A. F. R. Rahman and M. C. Fairhurst, "Multiple expert classification: A new methodology for parallel decision fusion". Int. Jour, of Document Analysis and Recognition, Springer, 3(1), 2000, pp. 40-55. 21. A. F. R. Rahman and M. C. Fairhurst, "Enhancing consensus in multiple expert decision fusion". IEE Proc. on Vision, Image and Signal Processing, IEE Press, 147(1), 2000, pp. 39-46. 22. A. Rahman and H. Alam, "Challenges in Web Document Summarization: Some Myths and Reality". Proceedings of the Document Recognition and Retrieval IX Conference, Electronic Imaging Conference, Santa Clara, California, U.S.A., 21-22 January, SPIE 4670-27, 2002. 23. Y Wang and J. Hu, "A machine learning based approach for table detection on the web". Proceedings of the llth World Wide Web Conference (WWW2002), Hawaii, U.S.A., 7-11 May, 2002, pp 242-250.
Extraction and Management of Content from HTML Documents 24. A. Antonacopoulos, D. Karatzas, and J. O. Lopez. Accessing textual information embedded in internet images. Proceedings of the SPIE Internet Imaging Conference, San Jose, California, U.S.A., 24-26 January 2001, pp. 198-205. 25. D. Lopresti and J. Zhou. Locating and recognizing text in WWW images. Information Retrieval, Kluwer, 2, 2000, pp. 177-206. 26. T. Kanungo, C. H. Lee and R. Bradford, "What fraction of images on the web contain text? ". Proceedings of the Web Document Analysis Workshop (WDA01), Seattle, USA, 8 September, 2001, pp. 43-46. Online proceedings at http://www.csc.liv.ac.uk/~wda20 01/. 27. ImageMagick Image Processing Toolkit, h t t p : //www. i m a g e m a g i c k . o r g / .
111
This page is intentionally left blank
CHAPTER 7 HTML PAGE ANALYSIS BASED ON VISUAL CUES
Yudong Yang, Yu Chen and HongJiang Zhang Microsoft Research Asia No. 49, Zhichun Road, Beijing, China yangyud@cn.ibm.com, {i-yuchen, hjzhangj@microsoft.com
In this chapter, we present an approach to automatically analyzing the semantic structure of HTML pages based on detecting visual similarities of the content objects on these pages. The approach is developed based on the observation that, in most web pages, layout styles of the subtitles or records of the same content category are consistent and there are apparent separation boundaries between different categories. Thus, these subtitles should have similar appearance if they are rendered in visual browsers and these different categories can be separated clearly. In our approach, we first measure the visual similarities of HTML content objects. Then we apply a pattern detection algorithm to detect frequent patterns of visual similarity and use a number of heuristics to choose the most possible patterns. By grouping items according to these patterns, we finally build a hierarchical representation (tree) of the HTML document with semantics inferred from "visual consistency". Preliminary experimental results show promising performance of the method with real web pages. 1. Introduction The World Wide Web has become one of the most important information sources today. Most of the data on web are available as pages encoded in markup languages such as HTML intended for visual browsers. As the amount of data on the web grows, locating desired contents accurately and accessing them conveniently become pressing requirements. Technologies such as web search engines and ACD (Adaptive Content Delivery) ~7 are being developed to meet such requirements. These technologies could give better results only if the semantic structures of these documents were available. However, web pages are normally composed for viewing in visual web browsers and lack information on semantic structures. 113
114
Yang el al.
To extract these structures, document analysis methods are required. For search engines, the goal is to locate and extract the data in web pages while obtaining the data structure. For ACD, the semantic structures are the foundation of adaptation decision. Different goals of search engines and ACD lead to differentiated approaches. 1.1. Document Analysis for Search Engines From the search engines' point of view, a web page (HTML Document) is a semistructured data set. The goals of document analysis are: 1) the location and extraction of data from the web page, and 2) the building of the data schema (structure). Hammer et als use an extractor specification file to program a lexical analyzer to extract the data. The specification file is actually a script that tells the analyzer how to perform the extraction. Table 1 gives an example of a typical specification file with two commands. The first command (lines 1-4) tells the analyzer to get a web page from the specified URL. The second command (lines 5-8) tells the analyzer to extract the temperature information inside the web page according to the specified pattern. The schema (structure) of data is also specified by the file. As the specification file is usually created manually, according to some specific page, the analyzer can only extract data from web pages with exactly the same framework.
Ashish et al.9 suggest the utilization of certain heuristics to automatically create a data extractor from some samples of a set of similar web pages. While the created extractors may not be precise enough, the authors provide a facility for the user to interactively correct the errors. As a limitation, the extractor can only be applied on the specific web page set. Having extracted data, the schema should be built upon that data. Nestorov etal.u provide a type detection method to infer the inherent data schema (structure). They assume that the data has been extracted and the references
HTML Page Analysis Based on Visual Cues
115
among individual data objects are known. Based on the reference relationships, a set of heuristics is proposed to classify the data objects into several categories. Each category corresponds to a data type. 1.2. Document Analysis for Adaptive Content Delivery While search engines require the data inside web pages, ACD needs to know the semantic structure of the web pages. It then adapts the web content according to the client device capabilities, network conditions and user interests. For web pages, re-authoring is commonly used. The process consists of operations including content analysis, summarization and re-layout. All of these operations depend on the understanding of documents' semantic structures. This makes document analysis a necessary job. The goal of web page analysis in ACD is to infer the structure of the web page and provide enough information for the re-authoring operations. Traditional analysis methods leverage the structure-representing HTML tags, such as
~
, to infer the page structure. However, HTML as a whole lacks the ability of representing semantic-related content. For various reasons, it was designed with both structural and presentational capability in mind. And these two were not clearly separated. (In the first version of HTML most of the tags were for structures but many layout and presentation tags were added into later versions and are widely used today. Some of this history can be found in Siegel's essay.15) Further widespread misuses of structural HTML tags for layout purposes {e.g.
) make the situation even worse. Cascade Style Sheets (CSS)21 were later developed as a remedy to this, but only recently several popular browsers begun having better CSS support. The recent W3C recommendation of XML provides a better way to organize data and represent the semantic structures. However, most of the web contents are still authored in HTML. Because of the common misuses, HTML tags are not stable features for analyzing structures of real-world HTML documents. In this chapter, we will introduce an automatic method to extract semantic structures from general HTML pages without requiring a priori knowledge of the web pages. This method uses features derived directly from the layout of HTML pages. As we observed, it is common for web pages to keep consistent visual styles with parallel subtitles or records in the same content categories, while different categories are separated by apparent visual boundaries. The objective of this approach is to detect these visual cues and construct the hierarchies of categories.
Yang et al.
116
2. Visual Similarity of HTML Objects Good organization of contents is an essential factor of good content services. Figure 5 and Fig. 6 on later pages show some examples of typical web pages. From these examples we can see that it is quite common to divide contents into categories where each category holds records of related subtitles. In addition, records in one category are normally organized in a consistent visual layout style. Boundaries between different categories are marked clearly with different visual styles or separators. As we have said, the basic idea of this approach is to detect these visual cues, which can then be used to detect records and categories. The first step to visual cues detection is determining the visual similarity of HTML objects. The appearance of HTML objects is defined by factors such as layout and style. With current W3C recommendations, layout and style of HTML pages should be defined by CSS. However, due to historical reasons, CSS is not very popular yet and most of the web pages are still patchworks of structural and deprecated20 presentation tags. Also, due to reasons such as tricks used to reduce page size, laziness, mistakes and limited functions in some authoring tools, two HTML objects that look similar in browsers do not necessarily contain the same combination of HTML tags. Different tags and in different orders may produce the same visual effect. Approaches that rely on tags will not be effective in these complex situations. Therefore, we prefer HTML objects instead of tags to identify the semantic units of the web page. A web page can be considered as a hierarchy of HTML objects, in which the root corresponds to the whole page, leaf nodes correspond to simple objects, and internal nodes correspond to container objects. In the following discussion, we will use the terms defined below. •
Simple object: Indivisible visual HTML object, which does not include other HTML tags (such as paragraphs of pure texts or tags as , ) or are representations of a single embedded media object (such as