ADVANCES IN INFORMATION RETRIEVAL
Recent Research from the Center for Intelligent Information Retrieval
THE KLUWER IN...
68 downloads
11280 Views
8MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
ADVANCES IN INFORMATION RETRIEVAL
Recent Research from the Center for Intelligent Information Retrieval
THE KLUWER INTERNATIONAL SERIES ON INFORMATION RETRIEVAL Series Editor W. Bruce Croft University of Massachusetts, Amherst
Also in the Series: MULTIMEDIA INFORMATION RETRIEVAL: Content-Based Information Retrieval from Large Text and Audio Databases, .. by Peter Schauble; ISBN: 0-7923-9899-8 INFORMATION RETRIEVAL SYSTEMS, by Gerald Kowalski; ISBN: 0-7923-9926-9 CROSS-LANGUAGE INFORMATION RETRIEVAL, edited by Gregory Grefenstette; ISBN: 0-7923-8 122-X TEXT RETRIEVAL AND FILTERING: Analytic Models of Performance, by Robert M. Losee; ISBN: 0-7923-8177-7 INFORMATION RETRIEVAL: UNCERTAINTY AND LOGICS: Advanced Models for the Representation and Retrieval of Information, by Fabio Crestani, Mounia Lalmas, and Cornelis Joost van Rijsbergen; ISBN: 0-7923-8302-8 DOCUMENTCOMPUTING: Technologies for Managing Electronic Document Collections, by Ross Wilkinson, Timothy Arnold-Moore, Michael Fuller, Ron Sacks-Davis, James Thom, and Justin Zobel; ISBN: 0-7923-8357-5 AUTOMATIC INDEXING AND ABSTRACTING OF DOCUMENT TEXTS, by Marie-Francine Moens; ISBN 0-7923-7793-1
ADVANCES IN INFORMATION RETRIEVAL
Recent Research from the Center for Intelligent Information Retrieval
Edited by W. Bruce Croft University of Massachusetts, Amherst
KLUWER ACADEMIC PUBLISHERS New York / Boston / Dordrecht / London / Moscow
eBook ISBN: Print ISBN:
0-306-47019-5 0-792-37812-1
©2002 Kluwer Academic Publishers New York, Boston, Dordrecht, London, Moscow Print ©2000 Kluwer Academic Publishers Massachusetts All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher Created in the United States of America Visit Kluwer Online at: and Kluwer's eBookstore at:
http://kluweronline.com http://ebooks.kluweronline.com
Contents
Preface
ix
xiii Contributing Authors 1 Combining Approaches to Information Retrieval 1 W. Bruce Croft 1 Introduction 1 2 Combining Representations 5 3 Combining Queries 9 4 Combining Ranking Algorithms 11 5 Combining Search Systems 13 6 Combining Belief 16 7 Language Models 25 28 8 Conclusion 2 The Use of Exploratory Data Analysis in 37 Information Retrieval Research Warren R. Greiff 1 Introduction 37 2 Exploratory Data Analysis 39 3 Weight of Evidence 40 4 Analysis of the Relationship between Document Frequency and the 43 Weight of Evidence of Term Occurrence 5 Probabilistic Modeling of Multiple Sources of Evidence 53 6 Conclusions 70 3 Language Models for Relevance Feedback 73 Jay M. Ponte 1 Introduction 73 2 The Language Modeling Approach to IR 75 3 Related Work 78 4 Query Expansion in the Language Modeling Approach 83
vi
ADVANCES IN INFORMATION RETRIEVAL
5
Discussion and Future Work
4 Topic Detection and Tracking: Event Clustering as a Basis for First Story Detection Ron Papka James Allan 1 Topic Detection and Tracking 2 On-lineClustering Algorithms 3 Experimental Setting 4 Event Clustering 5 First Story Detection 6 Discussion of First Story Detection 7 Conclusion 8 Future Work 5 Distributed Information Retrieval Jamie Callan 1 Introduction 2 Multi-Database Testbeds 3 Resource Description 4 Resource Selection 5 Merging Document Rankings 6 Acquiring Resource Descriptions 7 Summary and Conclusions 6 Topic-Based Language Models for Distributed Retrieval Jinxi Xu W. Bruce Croft 1 Introduction 2 Topic Models 3 K-Means Clustering 4 Four Methods of Distributed Retrieval Experimental Setup 5 6 Global Clustering Recall-basedRetrieval 7 8 Distributed Retrieval in Dynamic Environments 9 More Clusters 10 Better Choice of Initial Clusters 11 Local Clustering 12 Multiple-Topic Representation 13 Efficiency 14 RelatedWork 15 Conclusion and FutureWork 7 The Effect of Collection Organization and Query Locality on Information Retrieval System Performance Zhihong Lu Kathryn S. McKinley 1 Introduction 2 Related Work
92 97 98 103 108 110 112 119 120 122 127 127 129 130 131 135 137 145 151 152 154 155 155 158 160 163 165 165 165 166 166 168 168 169 173
174 176
Contents
3 4 5 6 7
SystemArchitectures Configuration with Respect to Collection Organization, Collection Access Skew, and Query Locality SimulationModel Experiments Conclusions
8 Cross-LanguageRetrieval via TransitiveTranslation Lisa A. Ballesteros 1 Introduction 2 Translation Resources 3 Dictionary Translation and Ambiguity 4 Resolving Ambiguity 5 Addressing Limited Resources 6 Summary 9 Building, Testing, and Applying Concept Hierarchies Mark Sanderson Dawn Lawrie 1 Introduction 2 Building a Concept Hierarchy 3 Presenting a Concept Hierarchy 4 Evaluating the Structures 5 Future Work 6 Conclusions Appendix: ANOVA analysis 10 Appearance-Based Global Similarity Retrieval of Images S. Ravela C. Luo 1 Introduction 2 Appearance Related Representations 3 Computing Global Appearance Similarity 4 TrademarkRetrieval 5 Conclusions and Limitations Index
vii
181 185 188 189 197 203 203 205 208 209 212 230 235 235 238 246 251 255 261 262 267 268 272 278 293 299 305
Preface
The Center for Intelligent Information Retrieval (CIIR) was formed in the Computer Science Department of the University of Massachusetts, Amherst in 1992. The core support for the Center came from a National Science Foundation State/Industry/University Cooperative Research Center (S/IUCRC) grant, although there had been a sizeable information retrieval (IR) research group for over 10 years prior to that grant. The basic goal of these Centers is to combine basic research, applied research, and technology transfer. The CIIR has been successful in each of these areas, in that it has produced over 270 research papers, has been involved in many successful government and industry collaborations, and has had a significant role in high-visibility Internet sites and start-ups. As a result of these efforts, the CIIR has become known internationally as one of the leading research groups in the area of information retrieval. The CIIR focuses on research that results in more effective and efficient access and discovery in large, heterogeneous, distributed, text and multimedia databases. The scope of the work that is done in the CIIR is broad and goes significantly beyond “traditional” areas of information retrieval such as retrieval models, cross-lingual search, and automatic query expansion. The research includes both low-level systems issues such as the design of protocols and architectures for distributed search, as well as more human-centered topics such as user interface design, visualization and data mining with text, and multimedia retrieval. The papers in this book contain some of the more recent research results from the CIIR. The first group of papers present new research related to the retrieval models that underly IR systems. The first paper, Combining Approaches to Information Retrieval by Croft, discusses retrieval models and strategies for combining evidence from multiple document representations, queries, ranking algorithms and search systems. This has been an important line of research for more than 10 years, and this paper provides a framework for understanding the many experimental results in this area and indicates how recent work on
x
ADVANCES IN INFORMATION RETRIEVAL
language models contributes to these results. Greiff’s paper, The Use of Exploratory Data Analysis in Information Retrieval Research, introduces a datadriven approach to developing retrieval models and uses this approach to derive a probabilistic ranking formula from an analysis of TREC data. A number of retrieval experiments are used to validate this new model. In the third paper of this group, Language Models for Relevance Feedback, Ponte describes the language modeling approach to IR that he introduced in his thesis work, and then shows how this approach can be used for relevance feedback and filtering environments. A number of experiments demonstrate the effectiveness of this conceptually simple, but potentially very powerful retrieval model. The next paper, Topic Detection and Tracking: Event Clustering as a Basis for First Story Detection by Allan and Papka, describes a relatively new area of research that focuses on detecting significant events in broadcast news. New algorithms and modifications of existing IR techniques are presented and evaluated in the context of this novel task. The next three papers deal with a range of topics related to distributed information retrieval. In Distributed Information Retrieval, Callan gives an overview of this area of research and summarizes the results related to database description and selection, and merging rankings from multiple systems. Xu and Croft, in their paper Topic-Based Language Models for Distributed Retrieval, present recent results from an approach to describing databases that is based on identifying language models through clustering. Lu and McKinley discuss performance-related issues in their paper The Effect of Collection Organization and Query Locality on Information Retrieval System Performance. They show that the use of database replication combined with selection algorithms can significantly improve the efficiency and scalability of distributed retrieval. The next paper, Cross-Language Retrieval via Transitive Translation by Ballesteros, discusses the language resources and techniques that are used for IR in multiple languages. The paper describes a series of experiments using a dictionary-based approach to transitive translation and retrieval through an intermediate language. In Building, Testing and Applying Concept Hierarchies, Sanderson and Lawrie describe research in the important new area of summarization. They focus specifically on a technique for constructing a hierarchy of concepts to summarize the contents of a group of documents, such as those retrieved by a query. The last paper, Appearance-Based Global Similarity Retrieval of Images by Ravela and Luo, presents new research in the important, emerging area of image retrieval. Because many of the techniques used for image indexing and comparison are very different to those used for text retrieval, the paper contains an extensive introduction to those techniques. Retrieval evaluations of a new indexing technique based on image “appearance” are described and discussed.
PREFACE
xi
These papers, like the research in the CIIR, cover a wide variety of topics in the general area of IR. Together, they represent a snapshot of the “state-ofthe-art” in information retrieval at the turn of the century and at the end of a decade that has seen the advent of the World-Wide Web. The papers have been written to provide overviews of their subareas and to serve as source material for graduate and undergraduate courses in information retrieval. Finally, I would liketo acknowledge thefaculty, staff, and students associated with the CIIR since 1992 who have contributed enormously to its success. In particular, Jean Joyce, Kate Moruzzi, and Glenn Stowell havebeen instrumental to the Center’s operation. I would also like to thank Win Aung, ourNSF Program Manager, for his support over the years. BRUCE CROFT
Contributing Authors
James Allan is an Assistant Professor in the Computer Science Department and Assistant Director of the CIIR at the University of Massachusetts, Amherst. He received the Ph.D. in Computer Science from Cornell University in 1995. His research interests are in information retrieval and organization, as well as topic detection and tracking. Lisa Ballesteros is an Assistant Professor in the Computer Science Department at Mount Holyoke College, Massachusetts. She received the B.S. degree in Biology from Union College, Schenectady, N.Y. in 1987 and the M.S. degree in Computer Science from the University of Massachusetts, Amherst in 1996. She is currently a Ph.D. candidate in Computer Science at U.Mass. Her major research interest is cross-language retrieval. Jamie Callan received a Ph.D. in Computer Science from the University of Massachusetts, Amherst, in 1993. He was a Research Assistant Professor at U.Mass. from 1994 to 1999, and was Assistant Director of the CIIR from 1995 to 1999. He has been an Associate Professor at Carnegie Mellon University since the Fall of 1999. Dr. Callan’s research interests include software architectures for IR, distributed information retrieval, information filtering, and use of the Internet and IR tools for K-12 education. W. Bruce Croft is a Professor of Computer Science and Director of the CIIR at the University of Massachusetts, Amherst. He holds a Ph.D. from Cambridge University, England and has been active in information retrieval research since 1975.
xiv
ADVANCES IN INFORMATION RETRIEVAL
Warren Greiff received his Bachelor’s Degree in Mathematics from Case Institute of Technology and Master’s Degrees from the University of Pennsylvania and Antioch College in Computer Science and Education, respectively. After a number of years working in American industry, he spent 12 years as a professor at the University of the Americas in Puebla, Mexico. In 1994, he returned to the United States to pursue a doctoral degree, which he received from the University of Massachusetts, Amherst in 1999. He is currently pursuing his research interests in statistical techniques applied to natural language processing technology at the Mitre Corporation in Bedford, Massachusetts. Dawn Lawrie is a Computer Science graduate student at the University of Massachusetts, Amherst. She received her bachelor’s degree in Computer Science from Dartmouth College in 1997. Currently she is studying the organization of information and text mining at the CIIR. Zhihong Lu received her Ph.D. in Computer Science from the University of Massachusetts, Amherst, and is currently working in AT&T network services. Her dissertation work is on scalable distributed architectures for information retrieval. Chen Luo is a graduate student at the University of Massachusetts, Amherst. He received his B.S. from the Automation Department of Tsinghua University, China. His research interests are image retrieval, pattern recognition and computer vision. Kathryn S. McKinley is an Associate Professor at the University of Massachusetts. She received her Ph.D. from Rice University in 1992. Her research interests include compilers, architectures, parallel and distributed computing with a particular focus on memory hierarchy and program analysis, and information retrieval. Ron Papka is currently the Director of Research and Development at Dataware Technologies. He is also an Adjunct Professor of Computer Science at Smith College. He received his Ph.D. in Computer Science from the University of Massachusetts in 1999, an M.S. from Brown University in 1993, and B.S. from Columbia University in the same discipline. His research interests include Information Retrieval, Machine Learning, and Object-Oriented Programming. Jay M. Ponte received his B.S. in computer science from Northeastern University in 1989. He received his M.S. and Ph.D. degrees in Computer Science
Contributing Authors
xv
from the University of Massachusetts in 1996 and 1998 respectively, where he worked in the CIIR. He is currently a Principal Member of Technical Staff at GTE Laboratories where he continues to specialize in information retrieval and statistical natural language processing. S. Chandu Ravela is a doctoral candidate at the University of Massachusetts, Amherst. Mr. Ravela’s research interests are visual media representations, computer vision, scale-space and multi-sensor data abstractions. Mr. Ravela received his M.S.C.S in 1994 and B.E. in Computer Science and Engineering from the Regional Engineering College, Trichy, India in 1991 and M.S. in Computer Science from the U.Mass. in 1994. Mark Sanderson is a lecturer in the Information Science department at the University of Sheffield. Prior to this, he was a postdoctoral researcher in the CIIR at the University of Massachusetts. He completed his Ph.D. on disambiguation and IR at the University of Glasgow in 1997 and works on summarization, hypertext retrieval, and spoken document retrieval. Jinxi Xu received his Ph.D. in Computer Science from the University of Massachusetts, Amherst in 1997, and was a post-doctoral research associate for two years at the CIIR. He is now a scientist at BBN Technologies in Cambridge, Massachusetts. His research interests include information retrieval and natural language processing.
Chapter 1 COMBINING APPROACHES TO INFORMATION RETRIEVAL W. Bruce Croft Department of Computer Science University of Massachusetts, Amherst croft@cs.umass.edu
Abstract
1
The combination of different text representations and search strategies has become a standard technique for improving the effectiveness of information retrieval. combination, for example, has been studied extensively in the TREC evaluations and is the basis of the “meta-search” engines used on the Web. This paper examines the development of this technique, including both experimental results and the retrieval models that have been proposed as formal frameworks for combination. We show that combining approaches for information retrieval can be modeled as combining the outputs of multiple classifiers based on one or more representations, and that this simple model can provide explanations for many of the experimental results. We also show that this view of combination is very similar to the inference net model, and that a new approach to retrieval based on language models supports combination and can be integrated with the inference net model.
INTRODUCTION
Information retrieval (IR) systems are based, either directly or indirectly, on models of the retrieval process. These retrieval models specify how representations of text documents and information needs should be compared in order to estimate the likelihood that a document will be judged relevant. The estimates of the relevance of documents to a given query are the basis for the document rankings that are now a familiar part of IR systems. Examples of simple models include the probabilistic or Bayes classifier model (Robertson and Sparck Jones, 1976; Van Rijsbergen, 1979) and the vector space model (Salton et al., 1975). Many others have been proposed and are being used (Van Rijsbergen, 1986; Deerwester et al., 1990; Fuhr, 1992; Turtle and Croft, 1992).
2
ADVANCES IN INFORMATION RETRIEVAL
As these retrieval models were being developed, many experiments were carried out to test the effectiveness of these approaches. Quite early in these experiments, it was observed that different retrieval models, or alternatively, variations on ranking algorithms, had surprisingly low overlap in the relevant documents that were found, even when the overall effectiveness of the algorithms was similar (e.g. McGill et al., 1979; Croft and Harper, 1979). Similar studies showed that the practice of searching on multiple document representations such as title and abstract or free text and manually assigned index terms was more effective than searching on a single representation (e.g. Fisher and Elchesen, 1972; McGill et al., 1979; Katzer et al., 1982). These, and other studies, suggested that finding all the relevant documents for a given query was beyond the capability of a single simple retrieval model or representation. The lack of overlap between the relevant documents found by different ranking algorithms and document representations led to two distinct approaches to the development of IR systems and retrieval models. One approach has been to create retrieval models that can explicitly describe and combine multiple sources of evidence about relevance. These models are typically probabilistic and are motivated by the Probability Ranking Principle (Robertson, 1977), which states that optimal retrieval effectiveness is achieved by ranking documents in decreasing order of probability of relevance and that “probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system”. The INQUERY system, for example, is based on a probabilistic model that is explicitly designed to combine evidence from multiple representations of documents and information needs (Turtle and Croft, 1991; Callan et al., 1995a). The other approach has been to design systems that can effectively combine the results of multiple searches based on different retrieval models. This combination can be done in a single system architecture (e.g. Croft and Thompson, 1987; Fox and France, 1987) or in a distributed, heterogeneous environment (e.g. Lee, 1995; Voorhees et al., 1995; Callan et al., 1995b). Combining multiple, heterogeneous searches is the basis of the “metasearch’’ engines on the Web (e.g., MetaCrawler1 ) and has become increasingly important in multimedia databases (e.g. Fagin, 1996). The motivation for both these approaches is to improve retrieval effectiveness by combining evidence. Apart from the empirical results, theoretical justification for evidence combination is provided by a Bayesian probabilistic framework (e.g. Pearl, 1988). In this framework, we can describe how our belief in a hypothesis H is incrementally affected by a new piece of evidence e. Specifically, using log-odds: log O(H|E, e) = log O(H|E) + log L(e|H) 1
http://www.metacrawler.com
Combining Approaches to Information Retrieval 3 where E is all the evidence seen prior to e, O(H|E) =
is the posterior odds on H given evidence E,
O(H| E, e) is the odds on H given the new evidence e, and L(e|H) =
is the likelihood ratio of evidence e.
This formulation makes it clear that each additional piece of positive evidence (i.e. with likelihood > 1) increases the odds of the hypothesis being true. A piece of evidence with very strong likelihood can have a substantial impact on the odds. In addition, the effect of a large error in the estimation of the likelihood for one piece of evidence can be reduced by additional evidence with smaller errors. In other words, the average error can be smaller with more evidence. This analysis assumes that the evidence is conditionally independent and, therefore, P(e|E,H) = P(e|H). If, however, the new evidence is correlated with the previous evidence, the impact of that new evidence will be reduced. If the new evidence can be directly inferred from the previous evidence, P(e|E, H) = 1 and the probability of the hypothesis being true does not change. In retrieval models, the hypothesis of relevance (R) is based on the observation of (or evidence about) document (D) contents and a specific query (Q). Estimating P(R|D, Q) could be viewed, then, as accumulating pieces of evidence from the representations of documents and queries, such as additional words or index terms. Accumulating more pieces of evidence should result in more accurate estimates of the probability of relevance, if the evidence is uncorrelated. As we will see, retrieval models often introduce intermediate concepts that make the relationship between observations and hypothesis less direct, but this simple model supports the basic intuition of evidence combination. Similarly, combining the output of ranking algorithms or search systems can be modeled as a combination of classifiers, which has been shown to reduce classification error (Turner and Ghosh, 1999). A search system can be viewed as a classifier for the classes relevant and nonrelevant. For a given document, the search system’s output corresponds to a probability of that document belonging to the relevant class. In this framework, classification errors reduce retrieval effectiveness. Misclassifying a relevant document reduces recall and misclassifying a nonrelevant document reduces precision. The amount of error reduction from combination depends on the correlation of the classifier outputs, with uncorrelated systems achieving the maximum reduction. We will show that this model provides an explanation for many of the phenomena observed in combination experiments (e.g., Vogt and Cottrell, 1998), such as the increased probability of relevance for a document ranked highly by different systems. It also provides a simple prescription of the conditions for optimum combination. Despite the popularity of the combination approach (sometimes called “fusion”), what is known about it is scattered among many papers covering different areas of IR. One of the main goals of this paper is to summarize the research
4
ADVANCES IN INFORMATION RETRIEVAL
in this area. The summary will show how combination has been applied to many aspects of IR systems, and will discuss the successes and limitations of this research. The most obvious limitation is that there is no clear description of how representations, retrieval algorithms, and search systems should be combined for optimum effectiveness. By comparing and analyzing previous research using the terminology of classifier combination and inference nets, we hope to improve this situation. We will also describe how a new approach to probabilistic retrieval based on language models (Ponte and Croft, 1998; Miller et al., 1999; Berger and Lafferty, 1999) provides mechanisms for representing and combining sources of evidence for IR, and that this approach can be integrated with the inference net model to provide an improved framework for combination. Although our focus in this paper is primarily on combination techniques for improving retrieval effectiveness, combination has been applied to a number of related tasks, such as filtering (Hull et al., 1996) and categorization (Lewis and Hayes, 1994; Larkey and Croft, 1996), and has been studied in other fields such as machine learning (Mitchell, 1997). We will refer to work in these areas in a number of sections of the paper. In the following sections, we describe the research that has been done on combination applied to different levels of an IR system. Section 2 describes how different representations of text and documents can be generated and how they have been combined. Section 3 describes research related to combining different representations of information needs (queries). Section 4 describes how ranking algorithms can be combined and the results of that combination. Section 5 describes how the output from different search systems can be combined and the effectiveness of such combinations. In Section 6, we describe some of the retrieval models that have been proposed for combining all the evidence about relevance in a single framework. Finally, in section 7, we describe the language model approach to retrieval and show how it can be used to support combination. In discussions of retrieval effectiveness in this paper, we assume familiarity with the standard recall and precision measures used for evaluations of information retrieval techniques (Van Rijsbergen, 1979). Although specific performance improvements are discussed for some experiments, it is in general difficult to compare the results from multiple studies because of the variations in the baselines and test collections that are used. For example, a combination technique that produces a 20% improvement in average precision in one study may not yield any improvement in another study that uses a more effective search as the baseline. For this reason, we focus on general summaries of the research rather than detailed comparisons.
Combining Approaches to Information Retrieval
2 2.1
5
COMBINING REPRESENTATIONS MANUAL AND AUTOMATIC INDEXING
The use of multiple representations of document content in a single search appears to have started with intermediaries searching bibliographic databases. These databases typically contain the titles, abstracts, and authors of scientific and technical documents, along with other bibliographic information and manually assigned index terms. Index terms are selected from a controlled vocabulary of terms by indexers based on their reading of the abstract. The query languages supported by typical bibliographic search systems allow the searcher to specify a Boolean combination of words, possibly restricted by location, and index terms as the retrieval criterion. An example of this would be (DRUGS in TI) and (DRUG near4 MEMORY) and (SIDE_EFFECTS_DRUG in DE)
This query is designed to find documents about the side effects of drugs related to memory. It specifies that the word “drugs” should be in the title of the document, and that the text of the document should contain the word “drug” within 4 words of the word “memory”, and that the controlled vocabulary term “side_effects_drug” has been used to index the document (i.e. it is present in the descriptor or DE field). Early studies showed the potential effectiveness of this search strategy. For example, Fisher and Elchesen, 1972, showed that searching title words in combination with index terms was better than searching either representation alone. Svenonius, 1986, in a review of research related to controlled vocabulary and text indexing, makes it clear that the two representations have long been thought of, and used, as complementary. A number of major studies, such as the Cranfield tests (Cleverdon, 1967), the SMART experiments (Salton, 1971), and the Cambridge experiments (Sparck Jones, 1974), also used multiple representations of documents but focused on establishing the relative effectiveness of each representation, rather than on the effectiveness of combinations of representations. The Cranfield tests considered 33 different representations and a number of these were combinations of simpler representations. The major classes of representations, however, were considered separately. Specifically, there were representations based on single words from the text of the documents (“free text”), representations based on controlled index terms, and representations based on “concepts”. Some single word representations were combinations of other single word representations, and similarly for index term representations, but there were no representations that were combinations of single words and index terms. The conclusion of the Cranfield study was that single word representations appeared to perform somewhat better than index term and concept representations, but no mention was made of combining them.
6
ADVANCES IN INFORMATION RETRIEVAL
The first large-scale evaluation of combining representations was reported in Katzer et al., 1982. This study was based on a previous study (McGill et al., 1979) that found low overlap in the documents retrieved by different representations for the same queries. Katzer et al considered representations based on free text and controlled vocabularies. They found that different representations retrieve quite different sets of documents in a Boolean search system. There was low overlap between the relevant document sets retrieved by the representations (about 28% on average) and even lower overlap between all the documents in the retrieved sets (about 13% on average). Despite this, there was little difference in retrieval effectiveness for the representations. In addition, documents with a high probability of relevance had the highest overlap in the retrieved sets and each representation retrieved some unique relevant documents. Using the same data, Rajashekar and Croft, 1995, showed that significant effectiveness improvements could be obtained by combining free text and controlled vocabulary indexing in a probabilistic retrieval system. Turtle, 1990, obtained a similar result with a different set of data. Fox, 1983, carried out combination experiments with controlled vocabulary using a retrieval system based on the vector space model and was also able to improve effectiveness. In each of the experiments using search systems based on ranking, the best results were obtained when the controlled vocabulary representation was treated as weaker evidence than the free text representations. The methods of weighting evidence and choosing weights are important aspects of the overall framework for combining evidence. These frameworks are discussed further in section 6. Controlled vocabulary terms are only one of the alternative representations of documents that have been studied in combination experiments. Citations, passages, phrases, names and multimedia objects have all been considered as sources of evidence about relevance. We will describe each of these here. In order to use some of these representations in a retrieval system, extended representations of the information need (i.e. the query) will also be needed. For example, controlled vocabulary terms could be manually included in a query formulation, as shown in the example query above. Other techniques for extending the query with alternate representations are possible, such as relevance feedback (e.g., Salton et al., 1983), and these techniques will be discussed further in section 3. Query extension is not, however, required for all representations. Controlled vocabulary terms can be matched directly with free text queries, for example, and retrieval results can be altered by relationships, such as citations, between documents.
2.2
CITATIONS
Citations have long been recognized as an alternative document representation (Salton, 1968; Small, 1973). A number of studies have established that there
Combining Approaches to Information Retrieval
7
is low overlap in the documents found using citation representations compared to those found using word or index term representations (e.g., Pao and Worthen, 1989). Retrieval experiments that combine citations with other representations have established that significant effectiveness benefits can be obtained (Salton, 1974; Fox et al., 1988; Croft et al., 1989; Turtle, 1990). The best results in these studies, which used relatively small data sets, improved average precision by 510%. It was also consistently found that the evidence for relevance provided by citations was weaker than that provided by the word-based representation. The citation approach was extended to include hypertext links (Frisse and Cousins, 1989; Croft and Turtle, 1989) and is now being used as the basis for some Web search engines (e.g., Google2). Detailed evaluations of representations based on Web “citations” are not available, but qualitatively the techniques appear to scale well to these very large databases.
2.3
PASSAGES
The basic premise behind passage retrieval is that some parts (or passages) of a document may be more relevant to a query than other parts. By representing a document as a collection of passages rather than a monolithic block of text, more accurate retrieval may be possible (O’Connor, 1975; O’Connor, 1980). A number of definitions of document passages are possible (Callan, 1994). Discourse passages are based on textual discourse units such as sentences, paragraphs and sections (e.g., Salton et al., 1993; Wilkinson, 1994). Semantic passages are based on similarities of the subject or content of the text (e.g., Hearst and Plaunt, 1993; Mittendorf and Schauble, 1994). Window passage s are based upon a fixed number of words (Callan, 1994). Almost all of the research related to passages involves retrieval experiments where passage-level representations are combined with global (whole document) representations. For example, Salton et al., 1993, refine rankings based on global similarity using sentence, paragraph and section similarities. Mittendorf and Schauble, 1994, combine a probabilistic model of relevant text passages with a model of text in general. Callan, 1994, combines global document evidence and window-based passage evidence in a probabilistic framework. Recent research by Kaszkiel and Zobel, 1997, confirms Callan’s earlier result that fixed-length, overlapping passages produce the best effectiveness, and that the best window size is between 150-250 words. More than 20% improvements in average precision were obtained in some experiments.
2
http://www.google.com
8
2.4
ADVANCES IN INFORMATION RETRIEVAL
PHRASES AND PROPER NOUNS
Simple noun phrases are an extremely important part of the searcher’s vocabulary. Many of the queries submitted to current Web search engines consist of 2-3 words (about 50% in Jansen et al., 1998, and many of those queries are phrases. Phrase representations of documents were used in the earliest bibliographic systems and were evaluated in early studies (Cleverdon, 1967). Salton and Lesk, 1968, reported retrieval experiments that incorporated statistical phrases based on word co-occurrence into the document representation as additional index terms. Fagan, 1989, also studied statistical phrases but ranked documents by a weighted average of the scores from a word representation and a phrase representation. In both of these studies, the effectiveness improvements obtained were mostly small but varied considerably depending on the document collections being used (from -1.8% to 20% improvement in average precision). Fagan’s experiments with syntactic methods of recognizing phrases were less successful (Fagan, 1987). Croft et al., 1991, and Callan et al., 1995a, introduced a phrase model that explicitly represents phrasal or proximity representations as additional evidence in a probabilistic framework. This approach yielded results that were somewhat more effective than simple statistical phrases, but not consistently. Bartell et al., 1994, also demonstrated improvements from combining word-based and phrase-based searches. Another representation that has been treated as additional index terms are the so-called named entities found using information extraction (MUC-6, 1995). These are special classes of proper nouns mentioned in document text such as people, companies, organizations, and locations. Callan and Croft, 1993, described how these entities could be incorporated into the retrieval process. For the queries used in this study, the impact on effectiveness was not significant.
2.5
MULTIMEDIA
Documents, in general, can be complex, multimedia objects. We have described some of the representations that can be derived from the text and links associated with the document. Other media such as speech, images, and video may also be used in queries and documents, and should be considered alternative representations. A number of people have described how multimedia objects can be retrieved using associated text, such as captions, surrounding text, or linked text (e.g., Croft and Turtle, 1992; Harmandas et al., 1997). This is the primary representation used by image searching engines on the Web. Increasingly, however, image and video search systems are making use of image processing techniques that help to categorize pictures (e.g., Frankel et al., 1996) or compare images directly (Flickner et al., 1995; Ravela and Manmatha, 1997). These image-based techniques typically use very different data structures and algorithms compared to text-based techniques. As a result, combining the ev-
Combining Approaches to Information Retrieval
9
idence about relevance from text and image representations (and potentially other representations) involves combining the rankings from multiple subsystems. This has been a concern of both the IR community (e.g., Croft et al., 1990; Fuhr, 1990; Callan et al., 1995b) and the multimedia database community (e.g., Fagin, 1996), and will be discussed further in sections 4 through 6.
3
COMBINING QUERIES
We have described various representations of documents that could be used as evidence for relevance. Experiments with combinations of these representations show that, in general, using more than one representation improves retrieval effectiveness. They also show that when one source of evidence is weaker (less predictive of relevance) than the others, this must be reflected in the process of accumulating evidence or effectiveness will suffer. These observations are consistent with the simple probabilistic framework mentioned in the first section. Estimating relevance, however, involves more than document representations. Queries, which are representations of the searcher’s information need, are an important part of the process of calculating P(R|D, Q). Each additional piece of evidence that the query contains about the true information need can make a substantial difference to the retrieval effectiveness. This has long been recognized and is the basis of techniques such as relevance feedback, where user judgments of relevance from an initial ranked list are used to modify the initial query (Salton and McGill, 1983), and query expansion, which involves the automatic addition of new terms to the query (Xu and Croft, 1996; Mitra et al., 1998). Relevance feedback and query expansion can also be viewed as techniques for creating alternative representations of the information need. Traditional query formulation tools, such as the thesaurus, can be viewed the same way. Even in the earliest retrieval experiments with a thesaurus (Salton and Lesk, 1968) and automatic query expansion using term clustering (Sparck Jones, 1971), the thesaurus classes or term clusters were treated as alternative representations that were combined with word-based queries. As mentioned in the last section, in order to make use of some alternative document representations, the query must at least be partially described using the same representations. Salton et al., 1983, used relevance feedback to add citations to the initial query, and consequently was able to use citations in the document representations to improve the ranking. Crouch et al., 1990, also used feedback to add controlled vocabulary terms to the query. Xu and Croft, 1996, used an automatic query expansion technique to construct a phrasebased representation of the query that was combined with the initial wordbased representation using weighted averaging. Callan et al., 1995a, describe
10
ADVANCES IN INFORMATION RETRIEVAL
a number of other strategies for automatic construction of alternative query representations. The idea that there are alternative queries only makes sense with the assumption that there is an underlying information need associated with the searcher. A given query is a noisy and incomplete representation of that information need. By constructing multiple queries, we are able to capture more pieces of evidence about relevance. The best source of information about the information need is, of course, the searcher. A number of studies have looked at or observed the effect of capturing multiple queries from a single searcher or from multiple searchers given the same specification of an information need. McGill et al., 1979, carried out a study of factors affecting ranking algorithms, and noticed that there was surprisingly little overlap between the documents retrieved by different search intermediaries (people who are experts in the use of a particular search system) when they were assigned the same information need as a starting point. Saracevic and Kantor, 1988, also found that when different intermediaries constructed Boolean search formulations based on the same descriptions of the information need, there was little overlap in the retrieved sets. In addition, they observed that the odds of a document being judged relevant was proportional to the number of times it was in a retrieved set. Based on these studies, Turtle and Croft, 1991, proposed a retrieval model that explicitly incorporated the notion of multiple representations of the information need. They report the results of experiments that combined wordbased and Boolean queries to improve retrieval effectiveness. Rajashekar and Croft, 1995, extended this work by combining word-based queries with two other queries based on different types of manual indexing. Combining pairs of query representations produced consistent performance improvements, and a weighted combination of all three achieved the best retrieval effectiveness. Belkin et al., 1993, carried out a more systematic study of the effect of query combination in the same probabilistic framework. They verified that retrieval effectiveness could be substantially improved by query combination, but that the effectiveness of the combination depends on the effectiveness of the individual queries. In other words, queries that provided less evidence about relevance had to have lower weights in the combination to improve performance. Bad query representations could, in fact, reduce effectiveness when combined with better representations. In a subsequent, larger study, Belkin et al., 1995, obtained essentially the same results and compared query combination to the strategy of combining the output of different systems (which they called data fusion). The latter strategy is discussed in the next two sections.
Combining Approaches to Information Retrieval
4
11
COMBINING RANKING ALGORITHMS
The next two sections discuss techniques for combining the output of several ranking algorithms. The ranking algorithms can be implemented within the same general framework, such as probabilistic retrieval or the vector space approach, or they can be implemented in different frameworks. The ranking algorithms can also be operating on the same databases, overlapping databases, or totally disjoint databases. In this section, we focus on the combination of the output of ranking algorithms implemented in the same framework and operating on the same data. Croft and Harper, 1979, noted that a cluster-based ranking algorithm retrieved different relevant documents than a word-based probabilistic algorithm, even though their average performance was very similar. They proposed using clustering as an alternate search strategy when word-based ranking failed. Similar observations were made about the performance of other ranking algorithms, particularly in the TREC evaluations (Harman, 1995). The approach of providing alternative ranking algorithms was incorporated into the design of some experimental retrieval systems, most notably I 3R (Croft and Thompson, 1987) and CODER (Fox and France, 1987). Attempts were made to select the best algorithm for a given query using an adaptive network (Croft and Thompson, 1984), and combine the results of multiple ranking algorithms using a plausible inference network (Croft et al., 1989). Turtle and Croft, 1991, showed how a nearest neighbor cluster search could be described and combined with a word-based search in a probabilistic framework. As mentioned in section 1, combining the output of ranking algorithms can be modeled as combining the output of multiple classifiers (Tumer and Ghosh, 1999). A ranking algorithm defines a classifier for each query, where the classes are associated with relevance and non-relevance (Van Rijsbergen, 1979). These classifiers can be trained using relevance feedback, but typically the only information that is available about the relevant class comes from the query. In fact, the approach of combining multiple representations of the information need, discussed in the last section, is more properly viewed as constructing multiple classifiers (one for each query representation) and combining their output. From this point of view, experiments such as Rajashekar and Croft, 1995, and Belkin et al., 1995, can be viewed as validation of the effectiveness of combining multiple classifiers for IR. Combining classifiers has been extensively studied in neural network research and the machine learning area in general. Tumer and Ghosh, 1999, provide a good overview of the literature in this area. They point out that given limited training data and a large, noisy “pattern space” (possible document descriptions), variations in weighting, initialization conditions, and the internal structure of the classifier produce different outputs. This is exactly what IR re-
12
ADVANCES IN INFORMATION RETRIEVAL
searchers have observed, even to the extent that variations in the “tf.idf’ weighting functions3 can retrieve substantially different documents (Lee, 1995). Tumer and Ghosh observe that simply averaging the output of the classifiers is the most common combining strategy, although more complex strategies such as learning appropriate weighted averages have been evaluated. They analyze the averaging strategy and show that for unbiased, independent classifiers, the “added error” above the Bayes error will be reduced by a factor of N for N classifiers. They model the output of a classifier for a given input (a document) as a combination of the probability distribution for each class and a noise distribution (the error). Reducing the error corresponds to reducing the variance of the noise. Since classification errors correspond to not retrieving relevant documents or retrieving non-relevant documents, reducing this error will improve retrieval effectiveness (Van Rijsbergen, 1979). Tumer and Ghosh also mention that the simple combining strategies are best suited for situations where the classifiers all perform the same task (which is the case for IR), and have comparable success. Simple combination strategies can fail when even one of the classifiers being combined has very poor performance or very uneven performance. There is some evidence that simple combination strategies such as summing, averaging or weighted averaging may be adequate for IR. For example, most of the experiments described in sections 2 and 3 used these strategies. Weighted averaging was required in cases where one of the classifiers were based on a poor document description (controlled vocabulary terms). Bartell et al., 1994, describe an approach to learning a weighted, linear combination of classifiers for IR. Fox and Shaw, 1994, conducted an evaluation of combination strategies using different retrieval algorithms in a vector space model. They found the best combination strategy consisted of summing the outputs of the retrieval algorithms, which is equivalent to averaging in terms of the final ranking. Hull et al., 1996, compared simple and complex combinations of classifiers for the document filtering problem, which has substantially more training data than is the case for IR. They found that the best improvement in performance came from the simple averaging strategy. IR classifiers will often not be independent, since they typically use the same document and query representations. Some of the best results in terms of improving retrieval effectiveness have come from combining classifiers based on very different representations, such as the citation experiments described in section 2. Combining classifiers that are very similar, such as those based on minor differences in the “tf.idf” weights, usually does not improve perfor-
3The “tf.idf” weight is a combination of weights derived from within-document term frequency (tf) and the inverse of the number of documents in the database that contain the term (idf). There are many variations of this weight discussed in the literature (Salton and Buckley, 1988).
Combining Approaches to Information Retrieval
13
mance. Lee, 1995, conducted an extensive study of combining retrieval output based on different weighting schemes in the vector space model. He combined these outputs by averaging the normalized scores. His experiments showed that combining classifiers (“retrieval runs” in his paper) based on similar weighting schemes had little impact on performance. Combining classifiers based on substantially different weighting schemes, however, produced significant improvements. Specifically, he found that combining rankings based on cosine normalization of the tf.idf weight with rankings based on other normalization schemes was effective (about 15% improvement in average precision). Hull et al., 1996, found that the gains in performance from combination were limited by the correlation between the classifiers they used. The correlation was caused primarily by using the same training data and was strongest for classifiers that used the same document representations. The discussion of classifier combination in Tumer and Ghosh assumes that the classifiers have comparable output in that they are trying to make the same decision within the same framework. For probabilistic systems, this means they are all attempting to estimate P(R|D, Q) and we can combine these estimates using simple strategies. The lack of knowledge of prior probabilities and the lack of training data, however, make the accurate estimation of these probabilities difficult and can make the outputs of the classifiers less compatible. Combining a cluster-based retrieval algorithm with a word-based algorithm, for example, can be quite difficult because the numbers produced by these algorithms for ranking may have little relationship to the probabilities of relevance. With sufficient training data, the relationship between these numbers and the probabilities can be learned but this data is usually not available. Incompatibility of classifier output also occurs in the vector space model, as discussed in Lee, 1995. In that paper, the scores produced by different retrieval runs were normalized by the maximum scores for each run in order to improve compatibility. This problem is particularly acute for an approach that combines the output of completely different search systems, with little idea of how the numbers output by those systems are calculated. This situation is discussed in the next section.
5
COMBINING SEARCH SYSTEMS
The idea of combining the output of different search systems was introduced during the DARPA TIPSTER project (Harman, 1992) and the associated TREC evaluations (Harman, 1995). These evaluations involve many search systems running the same queries on the same, large databases. The results of these searches are made available for research and a number of studies have been done of combination strategies. Belkin et al., 1995, combined the results of searches from a probabilistic system and a vector space system and showed performance improvements. Lee, 1997, combined the results from six selected retrieval
14
ADVANCES IN INFORMATION RETRIEVAL
systems from a TREC evaluation. He investigated the combination strategies used in Fox and Shaw, 1994, and Lee, 1995. Scores from the different retrieval systems were normalized using the maximum and minimum scores according to the formula: unnormalized_score – min-score normalized_score = max_score – min_score Lee’s results showed that the most effective combinations (up to approximately 30% improvement relative to a single search) were between systems that retrieve similar sets of relevant documents, but different sets of nonrelevant documents. This is related to the observation in Lee, 1995, that retrieval algorithms with low overlap in the retrieved sets will, given similar overall performance, produce the best results in combination. Vogt and Cottrell, 1998, in a study of the factors that predict good combination performance, looked at pairwise combinations of all systems (61 of them) from another TREC evaluation. They were able to verify Lee’s observation that the best combinations were between systems that retrieve similar sets of relevant documents and dissimilar sets of nonrelevant documents. These results can be simply explained in terms of uncorrelated classifiers. The sets of documents that are being compared in these studies are the top 1000 documents retrieved by each system for each query. The number of relevant documents for a given query is typically not large (100-200). We would expect, therefore, that many of the relevant documents are retrieved by most systems. This is in fact shown by Lee’s analysis of the correlation between the relevant retrieved document sets. Since there are large numbers of nonrelevant documents, we would expect that uncorrelated classifiers (search systems) would retrieve different sets of nonrelevant documents and this is what was observed. We would also expect that uncorrelated systems would produce different rankings of the relevant documents, even when the overlap in the sets of retrieved relevant documents is high. Vogt and Cottrell observed this difference in rankings for good combinations. They also observed, as did Lee, that the best combinations occur when both systems being combined have good performance, although it is possible to get improvement when only one of the systems has good performance. All of these observations are consistent with the statement that the combination with the lowest error occurs when the classifiers are independent and accurate. Lee, 1997, presented two other results related to combination strategies for different search systems. The first of these results was that combining the outputs of search systems using the ranks rather than the scores of the documents was, in general, not as effective. The exception to that was when the search systems had very different characteristics in terms of the shape of the score-rank curve. This can be interpreted as evidence that the normalized score is usually a better estimator for the probability of relevance than the rank. Using the ranks
Combining Approaches to Information Retrieval
15
is a more drastic form of smoothing that appears to increase error except when the systems being combined have very different scoring characteristics. The second of Lee’s results was that the best combination strategy was to sum the normalized scores and then multiply by the number of nonzero scores in the combination. This was better than simply summing the scores by a small but consistent margin. This form of combination heavily favors the documents retrieved by more than one system. A zero score for a document means that it was not retrieved in the top 1000 for that system rather than being a true estimate of the probability of relevance. Given that, the combined estimate of the probability for such a document is likely to have a much higher error than the estimates for documents which have only non-zero scores. A combination strategy that favors these documents could be interpreted, then, as favoring estimates with lower error. The experiments mentioned previously combined the outputs of multiple search systems using the same database. The outputs of search systems using overlapping or disjoint databases could also be combined. This type of combination has been called collection fusion, distributed IR, or meta-search. Voorhees et al., 1995, report experiments on techniques for learning weights to associate with each of the systems in the combination. The weights are used with the ranked document sets to determine how the documents will be mixed for the final ranking. This approach has some similarity to the rank combination strategies used by Lee, 1997. Callan et al., 1995b, show that using scores weighted by an estimate of the value of the database for the query is substantially better than interleaving ranks. Both Voorhees and Callan used disjoint databases for their experiments. In many practical environments, such as metasearch on the web, the databases used by the search systems will be overlapping. This will result in a situation similar to that described by Lee, 1997, where documents would have a varying number of scores associated with them. Although there are no thorough evaluations of the combination of web search results, it appears that Lee’s results may apply in this situation. This means that the best combination strategy may be to normalize the scores from each search engine, sum the normalized scores for each document, and multiply the sum by the number of search engines that returned that document (at a given cutoff). If the overlap between the databases used by the search engines is low (i.e. there are substantial differences in the amount of the web indexed), the last step would be less effective. The situation of combining the outputs of multiple search systems also applies to multimedia retrieval (Croft et al., 1990; Fagin, 1996; Fagin, 1998). In this case, we are typically combining the output of a text search and the output of one or more ranking algorithms that compare image features such as color distributions or texture. The experimental results discussed above suggest that the scores from these image and text retrieval algorithms should be
16
ADVANCES IN INFORMATION RETRIEVAL
combined by normalizing and then summing, potentially taking into account the number of non-zero scores. There is, unfortunately, no current evidence that this is a good choice other than the theoretical argument about classifier combination. Fagin, 1996, develops an algorithm for combining scores in a multimedia database using the standard operators of fuzzy logic, namely min and max. Lee’s experiments do provide evidence that these combination operators perform significantly worse than summing. Ciaccia et al., 1998, also discuss ranking in a multimedia database environment. They are concerned primarily with the efficiency of combination, as is Fagin, and present performance results for a range of combination operators.
6
COMBINING BELIEF
The previous sections have described the results of many different experiments with combining evidence to improve retrieval effectiveness. We have described these experiments in terms of combining evidence about relevance in a single classifier and then combining the outputs of multiple classifiers. Figure 1.1 (derived from Tumer and Ghosh, 1999) shows this overall conceptual view of combination. In this view, multiple representations (called feature sets in the classification literature) are constructed from the raw data in the documents. Both retrieval algorithms and search systems are regarded as classifiers that make use of these representations to calculate the probability of relevance of the documents. Some algorithms and systems combine representations in order to reduce the error of that calculation. The output of the retrieval algorithms or search systems can then be combined to further reduce the error and improve retrieval effectiveness. This simple framework can be used to explain some of the basic results obtained with combination experiments, such as the increased probability of relevance for documents retrieved multiple times by alternative representations and search systems, and the low overlap between the outputs of the best combined systems. According to this view of combination, there are only two requirements for minimizing the classification error and obtaining the best retrieval performance. The first of these is that each individual classifier (retrieval algorithm or system) should be as accurate as possible. This means that each classifier should produce probabilities of relevance with low error. The second requirement is that the classifiers that are combined should be uncorrelated. This means that we do not want to combine classifiers that repeatedly produce the same or similar rankings for documents, regardless of whether those rankings are accurate or inaccurate. Classifiers that use different representations and retrieval algorithms are more likely to be independent. A number of other frameworks have been proposed in the IR literature for providing a formal basis for the combination processes described in Figure 1.1.
Combining Approaches to Information Retrieval
Figure1.1
17
Combining strategies for retrieval
Some frameworks address the combination of representations for retrieval algorithms, some the combination of retrieval algorithms, and others the combination of search system output. We will discuss the frameworks using these categories.
6.1
FRAMEWORKS FOR COMBINING REPRESENTATIONS
The vector space model has been used as the basis for a number of combination experiments. In this model, documents and queries are characterized by vectors of weighted terms. Fox et al., 1988, proposed using subvectors to describe different “concept types” or representations derived from documents. An overall similarity between a document and a query, which is used to rank the documents, is computed as a linear combination of the similarities for each subvector. For example, if documents were represented using words, authors, and citations, the similarity function would be : sim(Q, D)
=cword . sim(Qword, Dword) + cauthor . sim(Qauthor, Dauthor) + ccite . sim(Qcite, Dcite)
18
ADVANCES IN INFORMATION RETRIEVAL
where the ci values are coefficients. This type of weighted linear combination is identical to that used in the experiments described in previous sections for combining the output of classifiers or search systems. This shows that there is little difference in this framework between combining representations and combining search output. Fox et al., 1988, used the similarity function to predict relevance and then performed a regression analysis with test collection data to determine the values of the coefficients. Fuhr and Buckley, 1991, also used regression to learn effective combinations of document representations. This work was based on a probabilistic model that estimates P(R|x(t, d)), which is the probability of relevance given a “relevance description” x of the term and document characteristics, instead of P(R|t, d). By using representations based on these characteristics rather than directly on words, more training data is available to estimate the probabilities in the model. They used a least-squared error criterion to compute coefficients of polynomial combinations of the term and document characteristics. Their results showed that a linear combination produced the best overall performance. The characteristics that were used in the relevance description included withindocument frequency of terms, the maximum frequency ofa term in a document, the number of documents in which a term occurs, the number of documents in the collection, the number of terms in a document, and whether a term occurs in the title or the body of a document. Gey, 1994, developed a logistic inference model that used logistic regression, which is generally considered more appropriate for estimating probabilities, to compute the coefficients of a formula for the log of the odds of relevance given the presence of a term. The formula was a linear combination of term characteristics in the documents and the queries. Greiff, 1998, described a probabilistic model developed using exploratory data analysis, which involves looking at large amounts of data about terms, documents and relevance to discover relationships. This approach has some similarity to the regression models described above, but does not make assumptions about the underlying distributions. In Greiff, 1999, he extends his approach to incorporate multiple sources of evidence (representations) based on the Maximum Entropy Principle, which is a way of determining an appropriate probabilistic model given known constraints. The model that results from this approach scores documents using a linear combination of a within-document frequency (tf) component and an inverse document frequency component (idf) for each term that matches the query. Although this is a relatively simple model in terms of the number of representations being combined, the framework he develops is sufficiently general to incorporate any of the representations mentioned previously. Incorporating a new representation involves studying retrieval data involving that representation, and using regression to determine coefficients of a formula that predicts relevance given the evidence provided by the new
Combining Approaches to Information Retrieval
19
representation, conditioned on the evidence provided by the existing representations. This new formula can then be simply added to the linear combination of formulas for other representations. The common characteristic for all frameworks that use training data and regression, of which Greiff’s can be viewed as the most general, is that a retrieval algorithm or search system based on them will produce estimates of probabilities of relevance instead of just normalized similarity values. Other probabilistic systems, such as INQUERY (Callan et al., 1995a), assume that because the goal of the system is to rank documents, the parts of a probabilistic formula that are constant for a given query (such as prior probabilities) can be ignored. In addition, ad-hoc (but effective) formulas are used to calculate parts of the document scores. This means that the numbers produced by these systems are not probabilities. This is a significant disadvantage when it comes to combining the output of a system with the output of other systems. Systems with compatible outputs, and accurate probability estimates, will produce the best combinations assuming they are independent.
6.2
FRAMEWORKS FOR COMBINING RETRIEVAL ALGORITHMS
The inference network framework, developed by Turtle and Croft (Turtle and Croft, 1991; Turtle and Croft, 1992) and implemented as the INQUERY system (Callan et al., 1995a), was explicitly designed for combining multiple representations and retrieval algorithms into an overall estimate of the probability of relevance. This framework uses a Bayesian network (Pearl, 1988) to represent the propositions and dependencies in the probabilistic model (Figure 1.2). The network is divided into two parts: the document network and the query network. The nodes in the document network represent propositions about the observation of documents (D nodes), the contents of documents (T nodes), and representations of the contents (K nodes). Nodes in the query network represent propositions about the representations of queries (K nodes and Q nodes) and satisfaction of the information need (I node). This network model corresponds closely to the overview of combining classifiers in Figure 1.1. The parts of the network that model the raw data in documents, the features extracted from that data, the classifiers that use the features to predict relevance, and the overall combiner for the classifier outputs are labeled in Figure 1.2. In this model, all nodes represent propositions that are binary variables with the values true or false, and the probability of these states for a node is determined by the states of the parent nodes. For node A, the probability that A is true is given by:
20
ADVANCES IN INFORMATION RETRIEVAL
Figure 1.2
Bayesian net model of information retrieval
where Ds is a coefficient associated with a particular subset S of the n parent nodes having the state true, and pi is the probability of parent i having the state true. Some coefficient settings result in very simple but effective combinations of the evidence from parent nodes. For example, if αS = 0 unless all parents have the state true, this corresponds to a Boolean and. In this case, p(A) = 3 i=1n pi. The most commonly used combination formulas in this framework are the average and the weighted average of the parent probabilities. These formulas are the same as those shown in other research to be the best combination strategies for classifiers and discussed earlier in the paper. The combination formula based on the average of the parent probabilities comes from a coefficient setting where the probability of A being true depends only on the number of parent nodes having the state true. The weighted average comes from a setting where the
Combining Approaches to Information Retrieval
21
probability of A depends on the specific parents that are true. Parents with higher weight have more influence on the state of A . The INQUERY search system provides a number of these “canonical” combination formulas as query operators. The three described above are #and, #sum, and #wsum. In the INQUERY system, different document representations are combined by constructing nodes corresponding to propositions about each representation (i.e. is this document represented by a particular term from a representation vocabulary) and constructing queries using those representation nodes. The queries for each representation are combined using operators such as #wsum (Rajashekar and Croft, 1995). For example, there may be nodes corresponding to word-based terms and nodes corresponding to controlled vocabulary terms. These nodes are connected to documents by the probabilities that those terms represent the contents of the documents. Query operators are then used to construct a query based on words (qword) and a query based on controlled vocabulary (qcontrol). These queries may be complex combinations of the evidence in those terms. Each of the queries is, in fact, a classifier that could produce an individual ranking of the documents. The two representations are combined by combining the query nodes. If the searcher wanted to weight the word-based representation twice as much as the controlled vocabulary representation, the final INQUERY query would be #wsum(2.0 qword 1.0 qcontrol). This example shows that the inference net framework can be used for most of the combination processes in Figure 1 . 1. Individual classifiers are built using representation nodes and combination operators, and the output of those classifiers is combined using other operators. It is possible to represent different retrieval algorithms in the inference net framework by using different combinations of representation nodes (i.e. new operators), new types of representation nodes, and different techniques for computing initial probabilities for representation nodes. The probabilities associated with the query node propositions are computed from the probabilities associated with representation nodes. The probabilities associated with representation nodes, however, can be computed from evidence in the raw data of the documents. For example, a tf.idf formula is used in INQUERY to compute the probability of a word-based representation node for a particular document. Turtle and Croft, 1991, describe how retrieval based on document clustering and hypertext links can be incorporated into the inference net framework by changing the probability estimates for nodes representing linked documents. The advantage of the inference net is that it provides a probabilistic framework for rapidly constructing new classifiers based on different representations and retrieval algorithms and combining their output. Not every retrieval algorithm can be represented in this framework, however. Greiff et al., 1997, describes the class of combination operators that can be computed easily and how these can be used to model a well-known vector space ranking algorithm.
22
ADVANCES IN INFORMATION RETRIEVAL
Although this effort was successful in achieving comparable or better effectiveness, it shows the difficulty of modeling even relatively simple retrieval algorithms that are not based on a probabilistic approach. A complex retrieval algorithm based on, for example, a neural net architecture, could not be modeled in an inference net. Another problem with the inference net is that the output of the classifiers do not correspond well to real probabilities. This is due to the heuristic estimation formulas used (such as tf.idf), the lack of knowledge of prior probabilities, and the lack of training data. Haines and Croft, 1993, describe how relevance feedback can be used to modify the query network to produce more effective rankings, although their approach does not improve the correspondence between document scores and probabilities. Haines, 1996, discusses how the structure of the inference net could be changed to better accommodate learning, but his approach was difficult to implement in the INQUERY system. A similar comment applies to the general techniques for learning with Bayesian networks (Heckerman et al., 1994).
6.3
FRAMEWORKS FOR COMBINING SEARCH SYSTEM OUTPUT
The strategies for combining the output of search systems described in Lee, 1997, are implemented in a simple, heuristic framework. The output of the systems are treated as similarity values that are normalized and combined using one of a variety of possible strategies. The specific normalization and combination strategies are selected based solely on empirical evidence, although some of the combination strategies, such as min and max, can be justified with formal arguments. Hull et al., 1996, also experimented with various combination strategies, but in the context of a framework of combining classifier output. They derive a formula for the combination of the output of n classifiers C1, . . . , Cn that are conditionally independent given relevance (R) and non relevance:
This formula is then used as justification for the strategy of summing the logodds numbers derived from the classifier output. There are two problems with this framework. The first is that the conditional independence assumption is not warranted. The second is that P(R|Ci) is not the same as the output of the classifier. The explanation for this is best done in a Bayesian network framework where the Ci are nodes representing the decisions of the document classifiers and R is a node representing the combined decision. The structure of this network is shown in Figure 1.3. Each classifier can be viewed as voting
Combining Approaches to Information Retrieval
23
on the overall decision of the combined network. The output of classifier Ci for a given document is the probability associated with the state Ci = true. For this network,
where C1, . . . , Cn represents a particular configuration of nodes with values true and false, and the summation is over all possible configurations. In the Bayesian network framework, the evidence supporting each classifier’s decision is assumed to be independent and
This is simply a reformulation of the function for node probability given in section 6.2. As mentioned previously, there are a number of possibilities for calculating P(R|C1,. . . , Cn). If the probability of voting for relevance overall depends on the number of classifiers that vote for relevance, the resulting combining function is the average of the classifier probabilities. We could base the overall vote for relevance on the vote of the classifier with the maximum (or minimum) probability of voting for relevance. This would result in the combining function being max (or min). A number of other combining functions are possible, but they have the common characteristic that determining the probability of relevance involves looking at the collective vote of the classifiers. The overall vote and the probability of that vote cannot be determined by looking at each classifier’s vote independently of the others. Thus the assumption of conditional independence is not appropriate. Hull et al., 1996, also report the interesting result that the probability estimates obtained from the combined classifier were not as accurate as the best individual classifier, even though the combined rankings were significantly better. This is related to the problem of combining similarity scores mentioned by Lee, 1997. Referring to Figure 1.3, each of the classifiers is, in effect, voting on relevance. The probabilities associated with a particular classifier may be inaccurate relative to the true probability of relevance, but still be consistent with respect to ranking the “votes” for that classifier. This would produce an effective ranking overall, but inaccurate probabilities. Indeed, if one of the classifiers were producing very accurate probability estimates, the combined estimates would be worse. The best situation is, of course, when all classifiers are producing reasonably accurate estimates. In that case, the combined estimate, as well as the combined vote, should be more accurate (Tumer and Ghosh, 1999). Fagin (Fagin, 1996; Fagin, 1998) proposed a framework based on fuzzy logic for combining the output of multiple search systems in a multimedia
24
ADVANCES IN INFORMATION RETRIEVAL
Figure 1.3
Combining the output of classifiers in a Bayesian net
database system. In this framework, a query is a Boolean combination of “atomic queries”. Each atomic query is of the form object attribute = value and the result of an atomic query is a “graded set” or a list of objects with their scores. Fagin combines the results of atomic queries using the min and max combining functions because they are unique in preserving logical equivalence of queries involving conjunction and disjunction. This property is important for the query optimization strategies he develops, but as we have discussed, min and max do not produce effective retrieval compared to averaging the output of the search systems. Query efficiency will continue to increase in importance, however, as multimedia systems are scaled up to accommodate the enormous volume of data being generated. The assumptions that have been used to build large distributed text retrieval systems may also be different in a multimedia environment, requiring changes in the query processing strategies. For these reasons, it is important to consider efficiency when implementing a combining strategy. The inference network model has also been proposed as a framework for multimedia retrieval (Croft and Turtle, 1992). One of the main problems with incorporating the results of an image retrieval algorithm into a probabilistic combination framework is that the image techniques typically rank images based on distance or similarity scores that are not probabilities. As mentioned in section 5, the scores can be normalized, but this is acompletely ad-hoc procedure that does not produce accurate probability estimates. There has recently been work done on object recognition using probabilistic models (Schneiderman and
Combining Approaches to Information Retrieval
25
Kanade, 1998). They develop a formula for P(object|image), the probability of an object (such as a face) being present in an image. In an image retrieval setting, a query typically includes an image or part of an image and possibly some text. The task of the image retrieval component of the system is to find images that are “similar” to the query image. This task can be based on a probabilistic model, as in Schneiderman and Kanade, 1998. One such probabilistic model would be to compute P(target image|database image). This probability could be calculated using a representation based on visual index terms, similar to the text models described above. Alternatively, a probability of generating the target image could be calculated. This approach corresponds to the use of language models for retrieval, as described in the next section.
7
LANGUAGE MODELS
Most probabilistic retrieval models attempt to describe the relationship between documents and index terms by estimating the probability that an index term is “correct” for that document (Fuhr, 1992). This is a difficult probability to explain, and as a result, heuristic tf.idf weights are used in the retrieval algorithms based on these models. In order to avoid these weights and the awkwardness of modeling the correctness of indexing, Ponte and Croft, 1998, proposed a language modeling approach to information retrieval. The phrase “language model” is used by the speech recognition community to refer to a probability distribution that captures the statistical regularities of the generation of language (Jelinek, 1997). Generally speaking, language models for speech attempt to predict the probability of the next word in an ordered sequence. For the purposes of document retrieval, Ponte and Croft modeled occurrences at the document level without regard to sequential effects, although Ponte, 1998, showed that it is possible to model local predictive effects for features such as phrases. Mittendorf and Schauble, 1994, used a similar approach to construct a generative model for retrieval based on document passages. The approach to retrieval described in Ponte and Croft, 1998, is to infer a language model for each document and to estimate the probability of generating the query according to each of these models. Documents are then ranked according to these probabilities. In this approach, collection statistics such as term frequency, document length and document frequency are integral parts of the language model and do not have to be included in an ad hoc manner. The score for a document in the simple unigram model used in Ponte and Croft is given by:
26
ADVANCES IN INFORMATION RETRIEVAL
where P(Q|D) is the estimate of the probability that a query can be generated for a particular document, and P(w|D) is the probability of generating a word given a particular document (the language model). Much of the power of this simple model comes from the estimation techniques used for these probabilities, which combine both maximum likelihood estimates and background models. This part of the model benefits directly from the extensive research done on estimation of language models in fields such as speech recognition and machine translation (Manning and Schutze, 1999). More sophisticated models that make use of bigram and even trigram probabilities are described in Ponte, 1998, and are currently being investigated (Miller et al., 1999; Song and Croft, 1999). The Ponte and Croft model uses a relatively simple definition of relevance that is based on the probability of generating a query text. This definition does not easily describe some of the more complex phenomena involved with information retrieval. The language model approach can, however, be extended to incorporate more general notions of relevance. Berger and Lafferty, 1999, show how a language modeling approach based on machine translation provides a basis for handling synonymy and polysemy. Hofmann, 1999, describes how mixture models based on latent classes can represent documents and queries. The latent classes can be thought of as language models for important topics in a domain. The language model approach can also be integrated with the inference net model, as described later. For this paper, the important issue is how the language model frameworks for retrieval deal with combination of evidence. In fact, it is ideally suited to the combination approach. The language model framework can readily incorporate new representations, it produces accurate probability estimates, and it can be incorporated into the general Bayesian net framework. Miller et al., 1999, point out that estimating the probability of query generation involves a mixture model that combines a variety of word generation mechanisms. They describe this combination using a Hidden Markov Model with states that represent a unigram language model (P(w|D)), a bigram language model (P(wn|wn–1, D)), and a model of general English (P(w|English)), and mentions other generation processes such as a synonym model and a topic model. Hofmann, 1999, and Berger and Lafferty, 1999 also describe the generation process using mixture models, but with different approaches to representation. Put simply, incorporating a new representation into the language model approach to retrieval involves estimating the language model (probability distribution) for the features of that representation and incorporating that new model into the overall mixture model. The standard technique for calculating the parameters of the mixture model is the EM (Expectation-Maximization) algorithm (McLachlan and Krishnan, 1997). This algorithm, like the regression techniques mentioned earlier, can be applied to training data that is pooled across queries and this, together with techniques for smoothing the maximum likelihood estimates, re-
Combining Approaches to Information Retrieval
27
sults in more accurate probability estimates than a system using tf.idf weights without training, such as INQUERY. Combining representations to produce accurate classifier output is only part of the overall combination process. The need to combine the outputs of multiple classifiers still exists. The other classifiers might be based on alternate language modeling approaches or completely different retrieval models, as we have described in the last section. To accomplish this level of combination, the language modeling approach can be incorporated into the inference network framework described in section 6.2. Figure 1.4 shows the unigram language model approach represented using a simplified part of the network from Figure 1.2. The W nodes that represent the generation of words by the document language model replace the K nodes representing index terms describing the content of a document. The Q node represents the satisfaction of a particular query. In other words, the inference net computes the value of P(Q is true). In the Ponte and Croft model, the query is simply a list of words. In that model, Q is true when the parent nodes representing words present in the query are true and the words not in the query are false. The document language model gives the probabilities of the true and false states for the W nodes. More generally, however, we can regard the query as having an underlying language model, similar to documents. This language model is associated with the information need of the searcher and can be described by P(W1, . . . , Wn|Q). This probability is directly related (by Bayes rule) to the probability P(Q|W1,. . . , Wn) that is computed by the inference network. More complex query formulations, such as those used in the INQUERY system, and relevance feedback provide more information about the searcher’s underlying language model. This information can be directly incorporated into the inference network version of the language model approach by adding more links between the Q node and the W nodes, and changing how the evidence from the W nodes is combined at the Q node. For example, if we learn from relevance feedback that W2 is an important word for describing the user’s language model, we can assign more weight to this word in calculating P(Q|W1, . . . , Wn). The inference network, therefore, provides a mechanism for comparing the document language model to the searcher’s language model. The two language models could also be compared using the Kullback-Leibler divergence or a similar measure (Manning and Schutze, 1999). The advantages of using the inference net mechanism are that it provides a simple method of using the relatively limited information that is known about the searcher’s language model in a typical retrieval environment and it allows the language model approach to be directly combined with other classifiers described in this framework. In sections 6.2 and 6.3, we described how the inference net model could be used to represent the combination of different retrieval algorithms and search systems. This framework can now be extended to include the language model
28
ADVANCES IN INFORMATION RETRIEVAL
Figure 1.4
The language model approach represented in a Bayesian net
approach. The inference net incorporating language models will provide more accurate probability estimates than the inference net based on tf.idf weights. Language models also provide another view of the query formulation process that supports a more direct use of learning than was done in the INQUERY system.
8
CONCLUSION
It is clear from this survey of the experimental results published over the last twenty years that combination is a strategy that works for IR. Combining representations, retrieval algorithms, queries, and search systems produces, most of the time, better effectiveness than a single system. Sometimes the performance improvement is substantial. This approach to IR can be modeled as combining the output of classifiers. Given some assumptions, this model specifies that the best results will be achieved when the classifiers produce good probability estimates and are independent. Even when they are not independent, some improvement can still be expected from combination (Tumer and Ghosh, 1999). This simple prescription for good performance explains many of the results obtained from previous research, including those involving ad-hoc normalization and combination strategies for rankings based on similarity values. The inference net framework was an attempt to provide a general mechanism for combination in a single search system. This framework is, in fact, an instantiation of the combination of classifiers model. Although it is a probabilistic
Combining Approaches to Information Retrieval
29
model, the difficulty of estimating indexing probabilities led to ad-hoc tf.idf weights being used in the INQUERY implementation of the model. This, and the lack of training data, meant that the output of the system was not probabilities. The language model approach to retrieval can more easily be trained to produce accurate probabilities and can be integrated into the inference net framework. Other probabilistic approaches such as logistic regression or maximum entropy models could also be integrated into this framework. Do we need to combine multiple approaches to retrieval and multiple search systems? Combination is, after all, expensive in terms of time, space and implementation effort. The combination of classifiers model implies that, given the extremely high-dimensional, noisy data contained in documents and the general lack of training data, many different classifiers for the retrieval problem could be built that are to some degree independent and would produce different, but equally effective, rankings. This means that combination is both inevitable and beneficial. Given that combination will need to be done, we should try to build search systems that use multiple representations to produce accurate output and we should provide a framework for those systems to be combined effectively. The inference net incorporating language models is a candidate for this framework. Providing such a framework will also be an important part of providing scalability for the immense information stores of the future.
Acknowledgments This material is based on work supported in part by the National Science Foundation under cooperative agreement EEC-9209623. It is also supported in part by the United States Patent and Trademark Office and the Defense Advanced Research Projects Agency/ITO under ARPA order D468, issued by ESC/AXS contract number F19628-95-C-0235, and by SPAWARSYSCEN contract N66001-99- 1-8912. Any opinions, findings and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsors.
References Bartell, B., Cottrell, G., and Belew, R. (1994). Automatic combination of multiple ranked retrieval systems. In Proceedings of the17th ACM SIGIR Conferenceon Research and Development in Information Retrieval, pages 173–181. Belkin, N., Cool, C., Croft, W., and Callan, J. (1993). The effect of multiple query representations on information retrieval system performance. In Proceedings of the 16th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 339–346. Belkin, N., Kantor, P., Fox, E., and Shaw, J. (1995). Combining the evidence ofmultiple query representations forinformation retrieval. Information Processing and Management, 31(3):431–448.
30
ADVANCES IN INFORMATION RETRIEVAL
Berger, A. and Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222–229. Callan, J. (1994). Passage-level evidence in document retrieval. In Proceedings of the 17th ACM SIGlR Conference on Research and Development in Information Retrieval, pages 302–3 10. Callan, J. and Croft, W. (1993). An evaluation of query processing strategies using the TIPSTER collection. In Proceedings of the 16th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 347– 355. Callan, J., Croft, W., and Broglio, J. (1995a). TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3):327–343. Callan, J., Lu, Z., and Croft, W. (1995b). Searching distributed collections with inference networks. In Proceedings of the 18th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2 1–28. Ciaccia, P., Patella, M., and Zezula, P. (1998). Processing complex similarity queries with distance-based access methods. In Proceedings of the 6th International Conference on Extending Database Technology (EDBT) , pages 9–23. Springer-Verlag. Cleverdon, C. (1967). The Cranfield tests on index language devices. Aslib Proceedings, 19:173–192. Croft, W. and Harper, D. (1979). Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35:285– 295. Croft, W., Krovetz, R., and Turtle, H. (1990). Interactive retrieval of complex documents. Information Processing and Management, 26(5):593–613. Croft, W., Lucia, T.J.and Cringean, J., and Willett, P. (1989). Retrieving documents by plausible inference: An experimental study. Information Processing and Management, 25(6):599–614. Croft, W. and Thompson, R. (1984). The use of adaptive mechanisms for selection of search strategies in document retrieval systems. In Proceedings of the 7th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 95–110. Cambridge University Press. 3 Croft, W. and Thompson, R. (1987). I R: A new approach to the design of document retrieval systems. Journal of the American Society for Information Science, 38(6):389–404. Croft, W. and Turtle, H. (1989): A retrieval model incorporating hypertext links. In Proceedings ofACM Hypertext Conference, pages 213–224. Croft, W. and Turtle, H. (1992). Retrieval of complex objects. In Proceedings of the 3rd International Conference on Extending Database Technology (EDBT), pages 217–229. Springer-Verlag.
Combining Approaches to Information Retrieval
31
Croft, W., Turtle, H., and Lewis, D. (1991). The use of phrases and structured queries in information retrieval. In Proceedings of the 14th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 32–45. Crouch, C., Crouch, D., and Nareddy, K. (1990). The automatic generation of extended queries. In Proceedings of the 13th ACM SlGlR Conference on Research and Development in Information Retrieval, pages 369–383. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391–407. Fagan, J. (1987). Experiments in automatic phrase indexing for document retrieval: A comparison of syntactic and non-syntactic methods. PhD thesis, Computer Science Department, Cornell University. Fagan, J. (1989). The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science, 40(2):115–132. Fagin, R. (1996). Combining fuzzy information from multiple systems. In Proceedings of the 15th ACM Conference on Principles of Database Systems (PODS), pages 216–226. Fagin, R. (1998). Fuzzy queries in multimediadatabase systems. In Proceedings of the 17th ACM Conference on Principles of Database Systems (PODS), pages 1–10. Fisher, H. and Elchesen, D. (1972). Effectiveness of combining title words and index terms in machine retrieval searches. Nature, 238: 109–110. Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q.,Dom, B., Gorkani, M., Lee, D., Petkovix, D., Steele, D., and Yanker, P. (1995). Query by image and video content: The QBIC system. IEEE Computer Magazine, 28(9):23– 30. Fox, E. (1983). Extending the Boolean and vector space models of information retrieval with p-norm queries and multiple concept types. PhD thesis, Computer Science Department, Cornell University. Fox, E. and France, R. (1987). Architecture of an expert system for composite document analysis, representation, and retrieval. Journal of Approximate Reasoning, 1:151–175. Fox, E., Nunn, G., and Lee, W. (1988). Coefficients for combining concept classes in a collection. In Proceedings of the 11th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 291–308. Fox, E. and Shaw, J. (1994). Combination of multiple searches. In Proceedings of the 2nd Text Retrieval Conference (TREC-2), pages 243–252. National Institute of Standards and Technology Special Publication 500–215.
32
ADVANCES IN INFORMATION RETRIEVAL
Frankel, C., Swain, M., and Athitsos, V. (1996). WebSeer: An image search engine for the World Wide Web. Technical Report TR-96-14, University of Chicago Computer Science Department. Frisse, M. and Cousins, S. (1989). Information retrieval from hypertext: Update on the dynamic medical handbook project. In Proceedings of ACM Hypertext Conference, pages 199–212. Fuhr, N. (1990). A probabilistic framework for vague queries and imprecise information in databases. In Proceedings of the Very Large Database Conference (VLDB), pages 696–707. Fuhr, N. (1992). Probabilistic models in information retrieval. Computer Journal, 35(3):243–255. Fuhr, N. and Buckley, C. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems, 9(3):223–248. Gey, F. (1994). Inferring probability of relevance using the method of logistic regression. In Proceedings of the 17th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222–23 1. Greiff, W. (1998). A theory of term weighting based on exploratory data analysis. In Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval, pages 11–19. Greiff, W. (1999). Maximum entropy, weight of evidence, and information retrieval. PhD thesis, Computer Science Department, University of Massachusetts. Greiff, W., Croft, W., and Turtle, H. (1997). Computationally tractable probabilistic modeling of Boolean operators. In Proceedings of the 20th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 119–128. Haines, D. (1996). Adaptive query modification in a probabilistic information retrieval model. PhD thesis, Computer Science Department, University of Massachusetts. Haines, D. and Croft, W. (1993). Relevance feedback and inference networks. In Proceedings of the 16th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2–11 . Harman, D. (1992). The DARPA TIPSTER project. ACM SIGIR Forum, 26(2):26– 28. Harman, D. (1995). Overview of the second text retrieval conference (TREC-2). Information Processing and Management, 3 1(3):27 1–289. Harmandas, V., Sanderson, M., and Dunlop, M. (1997). Image retrieval by hypertext links. In Proceedings of the 20th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 296–303. Hearst, M. and Plaunt, C. (1993). Subtopic structuring for full-length document access. In Proceedings of the 16th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 59–68.
Combining Approaches to Information Retrieval
33
Heckerman, D., Geiger, D., and Chickering, D. (1994). Learning Bayesian networks: The combination of knowledge and statistical data. In Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, pages 293–301. Morgan Kaufmann. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval, pages 50–57. Hull, D., Pedersen, J., and Schutze, H. (1996). Method combination for document filtering. In Proceedings of the 19th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 279–287. Jansen, B., Spink, A., and Saracevic, T. (1998). Real life information retrieval: A study of user queries on the Web. SIGIR Forum, 32( 1):5–17 Jelinek, F (1997). Statistical methods for speech recognition. MIT Press, Cambridge. Kaszkiel, M. and Zobel, J. (1997). Passage retrieval revisited. In Proceedings of the 20th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 178–185. Katzer, J., McGill, M., Tessier, J., Frakes, W., and DasGupta, P. (1982). A study of the overlap among document representations. Information Technology: Research and Development, 1(4):261–274. Larkey, L. and Croft, W. (1996). Combining classifiers in text categorization. In Proceedings of the 19th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 289–297. Lee, J. (1995). Combining multiple evidence from different properties of weighting schemes. In Proceedings of the 18th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 180–188. Lee, J. (1997). Analyses of multiple evidence combination. In Proceedings of the 20th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 267–276. Lewis, D. and Hayes, P. (1994). Special issue on text categorization. ACM Transactions on Information Systems, 12(3). Manning, C. and Schutze, H. (1999). Foundations of statistical natural language processing. MIT Press, Cambridge. McGill, M., Koll, M., and Noreault, T. (1979). An evaluation of factors affecting document ranking by information retrieval systems. Final report for grant NSF-IST-78-10454 to the National Science Foundation, Syracuse University. McLachlan, G. and Krishnan, T. (1997). The EM algorithm and extensions. Wiley, New York. Miller, D., Leek, T., and Schwartz, R. (1999). A Hidden Markov Model information retrieval system. In Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2 14–221. Mitchell, T. (1997). Machine Learning. McGraw-Hill, New York. a
34
ADVANCES IN INFORMATION RETRIEVAL
Mitra, M., Singhal, A., and Buckley, C. (1998). Improving automatic query expansion. In Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval, pages 206–2 14. Mittendorf, E. and Schauble, P. (1994). Document and passage retrieval based on Hidden Markov Models. In Proceedings of the 17th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 318-327. MUC-6 ( 1995). Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, San Mateo. O’Connor, J. (1975). Retrieval of answer-sentences and answer figures from papers by text searching. Information Processing and Management, l1(5/7):155– 164. O’Connor, J. (1980). Answer-passage retrieval by text searching. Journal of the American Society for Information Science, 3 1(4):227–239. Pao, M. and Worthen, D. (1989). Retrieval effectiveness by semantic and citation searching. Journal of the American Society for Information Science, 40(4):226–235. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann, San Mateo. Ponte, J. (1998). A Language Modeling Approach to Information Retrieval. PhD thesis, Computer Science Department, University of Massachusetts. Ponte, J. and Croft, W. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval, pages 275–28 1. Rajashekar, T. and Croft, W. (1995). Combining automatic and manual index representations in probabilistic retrieval. Journal of the American Society for Information Science, 46(4):272–283. Ravela, C. and Manmatha, R. (1997). Image retrieval by appearance. In Proceedings of the 20th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 278–285. Robertson, S. (1977). The probability ranking principle in information retrieval. Journal of Documentation, 33:294–304. Robertson, S. and Sparck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129– 146. Salton, G. (1968). Automatic information organization and retrieval. McGrawHill, New York. Salton, G. (1971). The SMART retrieval system - Experiments in automatic document processing. Prentice-Hall, Englewood Cliffs. Salton, G. (1974). Automatic indexing using bibliographic citations. Journal of Documentation,27:98–100. Salton, G., Allan, J., and Buckley, C. (1993). Approaches to passage retrieval in full text information systems. In Proceedings of the 17th ACM SIGIR
Combining Approaches to Information Retrieval
35
Conference on Research and Development in Information Retrieval, pages 49–56. Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 245 13–523. Salton, G., Fox, E., and Voorhees, E. (1983). Advanced feedback methods in information retrieval. Journal of the American Society for Information Science,36(3):200–210. Salton, G. and Lesk, M. (1968). Computer evaluation of indexing and text processing. Journal of the ACM, 15:8–36. Salton, G. and McGill, M. (1 983). Introduction to modern information retrieval. McGraw-Hill, New York. Salton, G., Wong, A., and Yang, C. (1975). A vector space model for automatic indexing. Communications of the ACM, 18:613–620. Saracevic, T. and Kantor, P. (1988). A study of information seeking and retrieving. Part 111. Searchers, searches, overlap. Journal of the American Society for Information Science, 39(3): 197–216. Schneiderman, H. and Kanade, T. (1998). Probabilistic modeling of local appearance and spatial relationships for object recognition. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 45–51. Small, H. (1973). Co-citation in scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24:265–269. Song, F. and Croft, W. (1999). A general language model for information retrieval. In Proceedings of the Conference on Information and Knowledge Management (CIKM), pages 316–321. Sparck Jones, K. (197 1). Automatic keyword classification for information retrieval. Butterworths, London. Sparck Jones, K. (1974). Automatic indexing. Journal of Documentation, 30(4):393– 432. Svenonius, E. (1986). Unanswered questions in the design of controlled vocabularies. Journal of the American Society for Information Science, 37(5):331– 340. Tumer, K. and Ghosh, J. (1999). Linear and order statistics combiners for pattern classification. In Sharkey, A., editor, Combining Artificial Neural Networks, pages 127–162. Springer-Verlag. Turtle, H. (1990). Inference networks for document retrieval. PhD thesis, Computer Science Department, University of Massachusetts. Turtle, H. and Croft, W. (1991). Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3): 187–222. Turtle, H. and Croft, W. (1992). A comparison of text retrieval models. Computer Journal, 35(3):279–290.
36
ADVANCES IN INFORMATION RETRIEVAL
Van Rijsbergen, C. (1979). Information Retrieval. Butterworths, London. Van Rijsbergen, C. (1986). A non-classical logic for information retrieval. Computer Journal, 29:481–485. Vogt, C. and Cottrell, G. (1998). Predicting the performance of linearly combined IR systems. In Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval, pages 190–1 96. Voorhees, E., Gupta, N., and Johnson-Laird, B. (1995). Learning collection fusion strategies. In Proceedings of the 18th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–179. Wilkinson, R. (1994). Effective retrieval of structured documents. In Proceedings of the 17th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 311–3 17. Xu, J. and Croft, W. (1996). Query expansion using local and global document analysis. In Proceedings of the 19th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4–11.
Chapter 2 THE USE OF EXPLORATORY DATA ANALYSIS IN INFORMATION RETRIEVAL RESEARCH Warren R. Greiff The MITRE Corporation Bedford, Massachusetts greiff
@mitre.org
Abstract
1
We report on a line of work in which techniques of Exploratory Data Analysis (EDA) have been used as a vehicle for better understanding of the issues confronting the researcher in information retrieval (IR). EDA is used for visualizing and studying data for the purpose of uncovering statistical regularities that might not be apparent otherwise. The analysis is carried out in terms of the formal notion of Weight of Evidence (WOE). As a result of this analysis, a novel theory in support of the use of inverse document frequency (idf) for document ranking is presented, and experimental evidence is given in favor of a modification of the classical idf formula motivated by the analysis. This approach is then extended to other sources of evidence commonly used for ranking in information retrieval systems.
INTRODUCTION
Information retrieval (IR) systems work because, for most information needs, they are able to order documents of a collection in such a way that documents relevant to the user’s query appear, on average, much higher in the ranking than could be expected by chance. They work because they exploit statistical regularities of document collections and user queries. In response to a query, the earliest systems returned the documents that contained the largest number of query terms. This was effective because a document that contains a query term is more likely to be relevant to the user. A document that contains many of the query terms is more likely to be relevant than a document that contains fewer of them. More modem systems also take into consideration the rarity of a query term and its frequency of occurrence in a document. All other things being equal, a document that contains rare query
38
ADVANCES IN INFORMATION RETRIEVAL
terms is more likely to be relevant than a document containing more common terms. In the same way, a document containing more occurrences of a query term is more likely to be relevant than a document with fewer occurrences of the same term. Out of four decades of research have evolved IR systems that take advantage of these statistical characteristics. While modem retrieval systems have developed sophisticated means for exploiting statistical regularities, IR research has not, as a rule, tended to focus directly on the study of these regularities. Although research in IR is varied and a diversity of philosophies, strategies and techniques have been employed, two major trends may be discerned. For the purposes of this discussion, we may refer to these as the engineering and the a priori modeling approaches to IR research and system development. In the engineering approach, intuitions concerning the nature of document collections and the behavior of users posing queries to an IR system are encoded in a (typically parameterized) retrieval algorithm. Experiments are run comparing a system incorporating this algorithm to a benchmark system; the result of varying parameter settings is studied; alternate versions of the algorithm are tried. If the research is successful, robust improvement to previous retrieval practice is realized and a better understanding of what is needed for effective retrieval performance results. Even in the absence of improved retrieval performance, new insight is often gained into the nature of the IR problem. Much of the progress of information retrieval is due to research of this nature. The a priori modeling approach adopts a more theoretical, formal line of attack. In this case, the researcher attempts to formalize her intuitions in the form of a priori assumptions, and a theory, typically a probabilistic theory, of information retrieval is developed. From this theory an information retrieval strategy is usually derived. This strategy can then be implemented and tested. The proponents of this approach believe that the field of information retrieval is well served by the development of formal theoretical foundations. Formal theories promote precision in discourse and permit the application of deductive reasoning to the analysis of information retrieval questions. In Cooper’s words, probabilistic theories “bring to bear . . . a high degree of theoretical coherence and deductive power” (Cooper, 1994, p. 242). Cooper also emphasizes that a formal approach assists investigators to articulate their important underlying assumptions. “When the underlying theoretical postulates are known and clearly stated,” he tell us, “their plausibility can be subjected to analysis” [p. 2441. In contrast to these, the research described in this paper may be considered a data driven approach. The goal of this research is the development of a model of IR document ranking based on observed statistical regularities of retrieval situations. It is similar to much work in probabilistic information retrieval in that the objective is to formally model the probability of a document being judged relevant to a query conditioned on the available evidence. It is significantly
The Use of Exploratory Data Analysis in Information Retrieval Research
39
different from both the engineering and a priori modeling approaches in the emphasis that is placed on the study of existing retrieval data. The methodology involves the adaptation of techniques of exploratory data analysis to the specific conditions encountered with regard to document ranking in information retrieval. The data driven methodology centers on the analysis of the weights of evidence in favor of relevance given by various sources of evidence. The end product is a model for weight of evidence from which a ranking formula can be derived directly. We will discuss how the data driven methodology has been applied to the analysis of existing retrieval data and a probabilistic model has been produced. The result of this analysis is a model of the weight of evidence in favor of relevance for the totality of evidence associated with a query/document pair being evaluated. The remainder of this paper is organized as follows. Section 2 and Section 3 describe the exploratory data analysis approach, and the concept of weight of evidence, both of which are central to this research. Section 4 presents an initial stage of the research in which attention was focused on the relationship between the document frequency for a term and the weight of evidence associated with its occurrence in a document. The result of the analysis is an hypothesis regarding how the classical inverse document frequency formula may be modified to improve retrieval performance. Substantial experimental evidence is given to support the hypothesis. In Section 5, the methodology used for the study of inverse document frequency in Section 4 is extended to deal with multiple sources of evidence. Retrieval data is analyzed and a stochastic model is developed. From the model, a ranking formula is derived. Experimental results are given for tests of the ranking formula which indicate the viability of the overall approach to IR research and system development.
2
EXPLORATORY DATA ANALYSIS
Hartwig and Dearing define EDA as “a state of mind, a way of thinking about data analysis – and also a way of doing it” (Hartwig and Dearing, 1979, p. 9). They advance adherence to two principles. First, that one should be skeptical of data summaries which may disguise the most enlightening characteristics of the phenomenon being investigated. Second, that one must be open to unanticipated patterns in the data, because uncovering such patterns can be, and often is, the most eventful outcome of the analysis. The article on EDA in the International Encyclopedia of Statistics says that it is the “manipulation, summarization, and display of data to make them more comprehensible to human minds, thus uncovering underlying structure in the data and detecting important departures from that structure” (Andrews, 1978). It goes on to point out that “these goals have always been central to statistics
40
ADVANCES IN INFORMATION RETRIEVAL
and indeed all scientific inquiry”, but that there has been a renaissance in the latter part of this century. This is due in no small part to the accessibility of powerful electronic computers for the accumulation of voluminous quantities of data, high-speed calculation and efficient graphical display. EDA embodies a set of useful methods and strategies, fomented primarily by John W. Tukey (Tukey, 1977). Four distinguishing aspects of this practice, each of which plays an important role in the probability modeling discussed in this paper, are graphical displays, smoothing, the study of residuals, and the transformation of variables. The emphasis in exploratory data analysis is on making the most of graphical displays of the data, a historical review of which is given in Beniger and Brown, 1978. The human mind is far better at uncovering patterns in visual input than in lists or tables of numbers. Depending solely on the reduction of large quantities of data to a few summary statistics erases most of the message the data have for us. Smoothing and non-parametric regression techniques are used with the objective of identifying the component of the data considered to be the signal from that which, for the purposes of the analysis at hand, are to be treated as noise. A residual is the difference between the observed value of a response variable and the value predicted by a given model. By studying graphs of residuals against potential predictor variables, possibilities for extending the model can be explored. Finally, supported by the production of graphical displays, features of the data can be transformed in a variety of ways in order to make underlying regularities in the data more evident.
3
WEIGHT OF EVIDENCE
Formal definition of Weight of Evidence I. J. Good formally defines the weight in favor of a hypothesis, h, provided by evidence, e, as (Good, 1983b; Good, 1950): (2.1) where (2.2) is the prior odds of the hypothesis, h being true, and (2.3) is the posterior odds of the hypothesis h being true conditioned on the evidence e having been observed. He believes this is a concept “almost as important as that of probability itself’ (Good, 1950, p. 249).
The Use of Exploratory Data Analysis in Information Retrieval Research
41
In (Good, 1983a, chap. 4), Good points out that, in various guises, the notion of weight of evidence had appeared in the work of others. As early as 1878, the quantity given in eq. 2.1 appears in the work of the philosopher Charles Sanders Pierce. Good credits Pierce with the original use of the term weight of evidence. More recently, Minsky and Selfridge also refer to this quantity and call it weight of evidence as well (Minsky and Selfridge, 1961). Turing had O(h|e) ____ , as the factor in favor of the hypothesis h provided labeled the quantity, O(h) by the evidence e, and Harold Jeffreys made much use of the concept, referring to it as support (Jeffreys, 1961). Weight of evidence is related to Keynes’s concept of the amount of information, which he defined as the log of p(e h)/p(e)p(h) (Good, 1983a, chap. 11). This is more commonly referred to today as mutual information and is discussed in Section 4.3 in connection with the relationship between weight of evidence and inverse document frequency. Keynes also used the term weight of evidence, but in a different sense from that used by Pierce, Minsky & Selfridge, and Good (Good, 1983a, chap. 15). A generalization of the concept of weight of evidence as it is defined in eq. 2.1, will also play an important role in what follows. Weight of evidence can be conditional. That is, attention may be restricted to some sub-space of the full event space. The notation that will be used for conditional weight of evidence will be:
^
(2.4) Desiderata for a Concept of Weight of Evidence Good elucidates three simple, natural desiderata for the formalization of the notion of weight of evidence (Good, 1989; Good, 1983b). 1. Weight of evidence is some function of the likelihoods: woe (h : e) = f[p(e | h),p(e | h)]
(2.5)
For example, let us suppose that a document is evaluated with respect to the query, Clinton impeachment proceedings and we discover the term Hyde in the document: h e
= =
document is relevant to the query Hyde occurs in the document
This first criterion states that the weight of evidence provided by finding Hyde in the document should be some function of the likelihood of finding Hyde in a document that is relevant to the query and the (for this example,
42
ADVANCES IN INFORMATION RETRIEVAL
presumably lower) likelihood of finding Hyde in a document that is not relevant to the query. 2. The final (posterior) probability is a function of the initial (prior) probability and the weight of evidence: (2.6)
p(h | e) = g[p(h),woe(h : e)]
In the context of the same example, this states that the probability associated with the document being about the impeachment proceedings after having observed that the document contains the term, Hyde, should be a function of 1) the probability associated with the document being relevant to the query prior to obtaining knowledge as to the terms it contains, and 2) the weight of evidence associated with finding Hyde. 3. Weight of evidence is additive: woe(h : e1
^ e ) = woe(h : e ) + woe(h : e |e ) 2
1
2
1
(2.7)
This property states that the weight in favor of a hypothesis provided by two sources of evidence taken together is equal to the weight provided by the first piece of evidence, plus the weight provided by the second piece of evidence, conditioned on our having previously observed the first. The weight of the second piece of evidence is conditioned on the first in the sense that, woe(h : e2) is calculated on the subspace corresponding to the event, e1. If, for example, the two pieces of evidence are: e1 = Hyde occurs in the document e2 = Henry occurs in the document then the weight of evidence provided by finding both of the terms, Hyde and Henry, in the document is the sum of the evidence given by finding Hyde and the evidence given by finding Henry. The weight of evidence provided by occurrence of the term, Henry, is conditioned on having previously taken into consideration the evidence provided by encountering Hyde in the document. Starting from these desiderata, Good is able to show that, up to a constant factor, weight of evidence must take the form given in eq. 2.1: O(h|e) woe(h : e) = log _______ O(h)
(2.8)
The constant factor may be absorbed in the base of the logarithm. For the purposes of this discussion all logarithms will be understood to be in base 10, which will make the scales shown on graphs easier to interpret.
The Use of Exploratory Data Analysis in Information Retrieval Research
43
Weight of Evidence and Information Retrieval There is nothing new about using either log-odds or weight of evidence in information retrieval. The term weights that result from the Robertson/Sparck Jones Binary Independence Model (Robertson and Sparck Jones, 1977) is motivated by the desire to determine the log-odds of relevance conditioned on the term occurrence pattern of a document. The weight can be viewed as the difference between the weights of evidence in favor of relevance provided by the occurrence and non-occurrence of the term. Also, the focus of statistical inference based on logistic regression is the probability of the event of interest transformed by the logit function; that is, the log-odds.
4
ANALYSIS OF THE RELATIONSHIP BETWEEN DOCUMENT FREQUENCY AND THE WEIGHT OF EVIDENCE OF TERM OCCURRENCE
This section reports on a study of the relationship between the weight of evidence in favor of relevance provided by the occurrence of a query term in a document and the frequency with which the term is found to occur in the collection as a whole. From the empirical study of retrieval data, statistical regularities are observed. Based on the patterns uncovered, a theory of term weighting is developed. The theory proposes mutual information between term occurrence and relevance as a natural and useful measure of query term quality. We conclude that this measure is correlated with document frequency and use this to derive a theoretical explanation in support of idf weighting which is different from theories that have previously been proposed. The theory developed, in conjunction with the empirical evidence, predicts that a modification of the idf formula should produce improved performance. In Section 4.4, experiments are presented that corroborate this prediction.
4.1
DATA PREPARATION
The study presented here involved data from queries 051-100 from the first Text REtrieval Conference (TREC) and the Associated Press (AP) documents from TREC volume 1 (Harman, 1993). Each data point corresponds to one query term. The query terms were taken from the concepts field of the TREC 1 topics. For the purposes of uncovering underlying statistical regularities, a set of quality query terms was desired that would keep to a minimum the noise in the data to be analyzed. For this reason the concepts field was used. Initially, the plan was for all query terms to be plotted. Two problems immediately presented themselves. First, rare terms are likely to have zero counts and this is problematic. For variables that are functions of log odds, a zero count translates to a (positive or negative) infinite value. One way around the problem is to add a small value to each of the counts of interest (relevant
44
ADVANCES IN INFORMATION RETRIEVAL
documents in which term occurs, relevant documents in which term does not occur, non-relevant documents in which term occurs, non-relevant documents in which term does not occur). This is a common approach, taken for instance in Robertson and Sparck Jones, 1977, where, for the purpose of estimating wrsj, 0.5 is added to each count. For the purposes of data analysis, however, there is a problem with this approximation. The choice of constant is to a large degree arbitrary. For many of the plots of interest, the shape of the plot at the low frequency end will vary considerably with the value chosen for the constant. Two slightly different choices for the constant value can give a very different overall picture of the data when they are plotted, particularly at the low frequency end. Since our objective is precisely to infer the true shape of the data, this approach is inadequate to our needs. A second problem, is that the variance of the variables we are interested in is large, relative to the effects we hope to uncover. This can be seen clearly, for example, at the left of Figure 2.4 where p(occ | rel) is plotted against log O(occ) for all terms for which it has a finite value. That this variable increases with increasing df is somewhat evident, but subtler details of the relationship are obscured. In order to confront both of these problems, data points were binned. Query terms were sorted in order of document frequency. Then for some bin size, k, sequences of k query terms1 were grouped together in bins. Each bin was then converted into a single pseudo-term by averaging all counts (number of relevant documents in which term occurs, number of relevant documents in which term does not occur, number of non-relevant documents in which term occurs, number of non-relevant documents in which term does not occur). Calculations of probabilities, weights, etc. were done on the pseudo-terms and these results were then plotted. A bin size of k = 20 was found to be best for our purposes. The plot of binned pseudo-terms corresponding to the left of Figure 2.4 is shown at the right of the same figure, Although we will focus on the binned plots, each of these plots will be displayed alongside its unbinned version, in order that the reader may get a feel for the raw data. It should be kept in mind, however, that points with zero counts are not represented in the unbinned versions. Kernel regression methods (Devroye, 1987; Härdle, 1990; Silverman, 1986) were also explored as an alternative to binning (Greiff, 1999).
4.2
PLOTTING THE DATA
The data analysis begins by focusing on the components, p(occ | rel) and p(occ | rel), and how these components correlate with document frequency.
More precisely, k or k + 1 query terms, so that no bin was much smaller than the rest.
1
The Use of Exploratory Data Analysis in Information Retrieval Research
Figure 2.2
45
p(occ|rel) as function of p(occ)
Because the goal is to compare various document sets of differing sizes, we prefer not to plot the data in terms of absolute document frequencies. Instead, we plot against p(occ) = the probability that the term will occur in a document chosen randomly from the collection. __ Occurrence in Non-relevant Documents With respect to p(occ|rel), we see in Figure 2.1 that it is __ well approximated by p(occ). We will return to analyze the variable, p(occ|rel), in more detail.
46
ADVANCES IN INFORMATION RETRIEVAL
Figure 2.3
Histograms for p(occ) and log O(occ)
Occurrence in Relevant Documents More interesting is Figure 2.2, which shows a plot of p(occ | rel) as a function of p(occ). We see from this scatter plot that as the document frequency (equivalently, probability of term occurrence) gets small (left end of graph), the probability of the term occurring in a relevant document tends to get small (lower on the graph) as well. A glance at this graph suggests that a re-expression of variables may be indicated. The histogram shown at the left in Figure 2.3 confirms that, as both intuition and Figure 2.2 suggest, the distribution of document frequencies is highly skewed. With this type of skew, a logarithmic2 transformation is often found to be beneficial (Tukey, 1977). Here, we go one step further and reexpress the variable as: log O(occ) = log
p(occ) 1 – p(occ)
(2.9)
For practical purposes, given typical document frequencies for query terms, the difference between log p(occ) and log O(occ) is negligible. For the development of a general theory, log O(occ) tends to be a preferable scale on which to As mentioned earlier, in order to aid intuitive comprehension, all logarithms in this document are logarithms to the base 10.
2
The Use of Exploratory Data Analysis in Information Retrieval Research
Figure 2.6 log
p(occ | rel) _________ p(occ)
47
as function of log O(occ)
work, due to the symmetric treatment it gives to probabilities below and above 0.5, as discussed in Section 3. The histogram at the right in Figure 2.3 shows the distribution of the variable after it has been re-expressed as log O(occ). Of course, our interest in log p(occ) or log O(occ) is further motivated by the knowledge that this statistic is, in fact, known to be a useful indicator of term value. The variable, p(occ | rel), is re-plotted as a function of log O(occ) in Figure 2.4. The plot against the log-odds shows that the decrease in p(occ | rel) continues as document frequency get smaller and smaller, a fact that was obscured by the bunching together of points below p(occ) ≈ 0.01 in the original plot (Figure 2.2). p(occ|rel) Relative to p(occ) Despite the transformation of the independent variable, looking at p(occ | rel) directly makes it hard to appreciate the phenomenon of interest. The conditional probability of occurrence is higher for high frequency terms. But, high frequency terms are more likely to appear in documents, in general. It comes as no great surprise, then, that they are more likely to occur in relevant documents. This is particularly obvious for very high frequency terms as compared to very low frequency terms. Take for example, two terms: t1 with probability of occurrence, p(t1) = 0.2, and t2 with probability of occurrence, p(t2) = 0.0001. We would expect that p(t1 | rel) is at least 0.2. In contrast, we could hardly expect a term which only appears in one of every ten thousand documents in the collection to appear in as many as two out of ten of the relevant documents. In fact, if the probability of relevance for the query is, say, one in a thousand, simple algebra shows that
48
ADVANCES IN INFORMATION RETRIEVAL
Figure 2.5
p(occ | rel) ___________ p(occ)
as function of log O(occ)
it will not be possible for p(t2 | rel) to be greater than 0.1:
, What may be of more interest to us, then, is how much more likely it is for a term to occur in the relevant documents compared to its being found in an arbitrary document of the collection as a whole. Figure 2.5 shows a plot of the ratio
as a function of log O(occ). We observe here a clear non-linear increase in this ratio as document frequency decreases. From this plot it is evident that, in general: 1) query terms are more likely to appear in relevant documents than in the collection as a whole, and 2) how much more likely their appearance in relevant documents is correlates inversely with document frequency. The apparent exponential nature of this correlation calls out for the logarithm of p(occ | rel) to _______ be investigated. p(occ) Log of the Ratio of p(occ | rel) to p(occ) In Figure 2.6 the log of the p(occ | rel) ratio _______ is plotted against the logarithm of the odds of occurrence in the p(occ) collection. In the plot, we observe:
•
a roughly linear overall increase in p(occ | rel) log _________ with decreasing log O(occ); p(occ)
The Use of Exploratory Data Analysis in Information Retrieval Research
49
p(occ | rel) _ as function of log O(occ) Figure 2.6 log ________ p(occ)
•
a stronger linear relationship apparent in the midrange of document frequencies;
•
an apparent flattening of this growth at both high and low frequencies.
A number of comments are in order. First, a clear pattern has emerged that is difficult to attribute to chance. Furthermore, the “reality” of this regularity is corroborated by our inspection of data from other collections included in TREC volumes 1 and 2. Second, the apparent flattening of the curve at the two extremes is supported by theoretical considerations. At the low-frequency end, we note that: (2.10) If we assume that, for a given query, the probability of relevance across the p(occ | rel) must entire collection is approximately one in a thousand3, then log _______ p(occ) be below 3.0. We can conclude that, on average, the growth of the log ratio observed between log O(occ) = -1.0and log O(occ) = –3.0 cannot be sustained for very small document frequencies. It is reasonable to assume that this p(occ | rel) growth should begin to taper off as log _______ approaches – log p(rel). p(occ) The argument is similar at the high frequency end. We can safely assume that, on average, a query term, even a very high frequency query term, is more likely to appear in a relevant document than it is to appear in an arbitrary document of For the AP collection, the average probability of relevance over the 50 queries is 0.00085.
3
50
ADVANCES IN INFORMATION RETRIEVAL
the collection. Hence, the ratio,
p(occ | rel) _______
is greater than 1, and its logarithm p(occ | rel) _______ greater than 0. Since we conclude that log can be expected to be p(occ) positive at all document frequencies, its rate of descent must taper off at same point before reaching 0. Presumably it approaches 0 asymptotically as the log odds of occurrence goes to ∞ (i.e. term occurs in all documents). It is reasonable to entertain the hypothesis that this leveling off is what we are observing with the rightmost points in Figure 2.6 (and have observed in plots for other collections as well). We must be cautious, however. The leveling off may, in truth, occur at higher frequencies; the flattening suggested by the few points in question attributable to chance happening. Finally, we note that the quantity: p(occ)
has connections to information theory. Often referred to as mutual infomation, it has frequently been used as a measure of variable dependence in information retrieval. It has been used in attempts to include co-occurrence data in models with weaker independence assumptions (van Rijsbergen, 1979); for the purposes of corpus-specific stemming (Croft and Xu, 1995); and for term selection in query expansion based on relevance feedback (Haines and Croft, 1993). It is also often used as a measure of variable dependence in computational linguistics (Church et al., 1991). In a very important sense, it can be taken as a measure of the information about one event provided by the occurrence of another (Fano, 1961). In our context, it can be taken as a measure of the information about relevance provided by the occurrence of a query term. In what follows, we shall adopt the notation MI(occ,rel) for this quantity, which we believe to be an object worthy of attention as a measure of term value in IR research. It will be the main focus in the analysis that follows.
4.3
MUTUAL INFORMATION AND IDF
Our interest is in modeling the weight of evidence in favor of relevance provided by the occurrence or non-occurrence of a query term. Presumably, the occurrence of a query term provides positive evidence and its absence is negative evidence. If we will assign a non-zero score only to those terms that appear in a document, this score should be, woe(rel : occ) – woe(rel : occ) This quantity, which we shall denote by 'woe, measures how much more evidence we have in favor of relevance when the term occurs in a document than we do when it is absent. Based on the formal definition of weight of evidence (eq. 2.1), together with that for mutual information, 'woe can be
The Use of Exploratory Data Analysis in Information Retrieval Research
51
expressed as:
(2.11)
By accepting some empirically motivated assumptions concerning query terms, it can be shown that: ___ 1) log p(occ | rel) ≈ log p(occ) ___ 2) – log p(occ | rel) is not too big; ___ ___ 3) log p(occ |rel) ≈ 0 and from this, that 'woe ≈ MI(occ, rel). For details the reader is referred to Greiff, 1999. There is little question about our ability to infer from the available data that MI(occ, rel) increases with decreasing document frequency. To a first order approximation, we can say that this increase is roughly linear with respect to log O(occ). ' woe ≈ MI(occ, rel) ≈ k2 – k1 log O(occ)
(2.12)
But, k2 can be ignored. By casual inspection of Figure 2.6, we see that any p(occ | rel) as reasonable linear approximation of the plot of MI (occ, rel) = log _______ p(occ) a function of log O(occ) will have an intercept value relatively close to 0. We are now left with: ' woe
≈
–k1 log O(occ) = k 1(–log O(occ))
(2.13)
Once the constant k2 has been eliminated, the remaining constant, k1, becomes irrelevant for the purposes of ranking. It will affect only the scale of the scores obtained, having no affect on the ranking order itself. And so we conclude that the idf formulation, idf = – log O(occ) = log should produce good retrieval performance.
(2.14)
52
4.4
ADVANCES IN INFORMATION RETRIEVAL
IMPROVING ON IDF
We argued in Section 4.1 that both theoretical and empirical considerations give reason to assume a flattening of MI(occ, rel) at both ends of the practical spectrum of document frequencies. If we can assume that the “true” form of the function that maps log O(occ) to MI(occ, rel) involves flattening at the extremes, the map to 'woe will exhibit similar shape. If we accept the hypothesis that the plot of Figure 2.6 is representative of the general behavior of query terms for the types of queries and collections we study, we should expect improved retrieval performance from a term weighting formula that accounts for the observed flattening. To test this prediction, we compared retrieval performance of two versions of the INQUERY IR system (Callan et al., 1992) on each of the ad-hoc tasks for TREC 1 through TREC 6 (Harman, 1997). Queries were formed by taking all words from both the title and description. The baseline system used pure idf term weighting with idf = – log O(occ)4. The test system used a flattened version of idf. For this version, weights were kept at 0 for all values of – log O(occ) below 1.0; increased at the same rate as – log O(occ) from – log O(occ) = 1.0 to – log O(occ) = 3.0; and maintained at a constant value for all terms for which – log O(occ) exceeded 3.0. Table 2.1 3-piece Piecewise-linear vs. Linear Versions of Idf avg. prec. baseline test
% diff
sign
wilcoxon
TREC1 TREC2
0.1216 0.1312 0.0693 0.1021
7.88 18/32 0.0325 47.36 10/40 0.0000
0.0201 0.0000
TREC3 TREC 4 TREC5 TREC6
0.0676 0.0680 0.0466 0.1185
0.1257 86.03 4/46 0.0000 0.1002 47.42 15/34 0.0047 0.0688 47.63 17/32 0.0222 0.1422 20.01 12/37 0.0002
0.0000 0.0006 0.0006 0.0000
-/ +
The results of these tests are summarized in Table 2.1. The test version outperforms the baseline system in terms of average precision, on all six query sets, by 20% or more on five of the six. The test system also outperforms the baseline system on a majority of queries on each of the six query sets. The “-/+” column gives the number of queries for which the test system performed Tests with idf = – log p(occ) were also run. For all test sets, performance differences were small, with – log O(occ) outperforming – log p(occ) on all 6 of the test sets.
4
The Use of Exploratory Data Analysis in Information Retrieval Research
53
below/above baseline. The column labeled {sign” gives the results of the sign test for each query set. Each value indicates the probability of the test version outperforming the baseline on as many of the queries as it did were each system equally likely to outperform the other. The column labeled “wilcoxon” gives the analogous probability according to the wilcoxon test, taking into account the size of the differences in average precision for each of the queries. The test results showed statistically significant improvement at the 5% level on all test sets according to both statistical measures. Improvement at the 0.1 % level was observed in three of the six runs according to the sign test and five of the six according to the wilcoxon. Improvement was found at all (11) levels of recall on TREC’S 2 through 5; all but the 50% recall level on TREC 1 and all but the 80% recall level on TREC 6.
5
PROBABILISTIC MODELING OF MULTIPLE SOURCES OF EVIDENCE
This section presents an analysis of the weight of evidence in favor of relevance offered by query/document features traditionally used for ranking in information retrieval. The predominate objective is to obtain a more precise and rigorous understanding of the relationship these retrieval characteristics have to the probability that a document will be judged relevant. The ultimate goal of this analysis is the development of a retrieval formula, the components of which can be understood in terms of statistical regularities observed in the class of retrieval situations of interest. A methodology is presented for the analysis of the relationship between query-document characteristics and the probability that a document will be judged relevant to the query. The characteristics that have been studied are coordination level, inverse document frequency, term frequency, and document length. Application of the methodology to a homogeneous collection of documents – 1988 news articles from the Associated Press (AP88), taken from volume 2 of the TREC data, evaluated for queries 151-200 from TREC 3 (Harman, 1995) – will serve as the vehicle for exposition of the principal techniques involved. In what follows we will show how query/document features can be studied, how a model in terms of this evidence can be formulated, and how parameters for it can be determined. The resulting model can be used directly as a scoring mechanism for which the retrieval status values (RSVs)5 that are produced have a precise probabilistic interpretation. Results will be presented suggesting that the modeling framework, and more important the general approach to the analysis of evidence, developed in this
5RSV is a term often used to refer to the score given to a document by a retrieval algorithm. It originated in a paper by Bookstein and Cooper (1976).
54
ADVANCES IN INFORMATION RETRIEVAL
study may lead to a ranking formula that performs as well as state-of-the-art retrieval formulas that have evolved over the years.
5.1
OVERVIEW OF THE MODELING STRATEGY
The objective of the analysis is to develop a model for the weight of evidence in favor of relevance given by the query/document features under consideration: (2.15) where the weight of evidence is conditioned on the query being evaluated. To be more precise, the weight of evidence that will be modeled is restricted as well to the subspace corresponding to those query/document pairs for which at least one of the query terms appears in the document. This is indicated by the * in the condition of the weight of evidence given in eq. 2.15. In general, IR systems do not evaluate documents that do not include at least one of the query terms. For that reason, all probabilities and weights of evidence considered in this section will be conditioned on the occurrence of at least one term. We will explicitly include the * in the formulas appearing in the upcoming presentation of the four models, but for the sake of reducing clutter in the notation, it will be left implicit in the sections that follow. The data analysis will result in the development of four models, which will be denoted by: M0, M1, M2, and M3. The focus of the modeling effort for each of these models will be: M0 : logO(rel|Qry = q,*) M1 : woe(rel : Coord = co | Qry = q, *) M2 : woe(rel : Idf = idf |Coord = co,Qry = q,*) M3 : woe(rel : Tf = tf | Idf = idf, Coord = co, Qry = q, *)
(2.16) (2.17) (2.18) (2.19)
with each model extending the previous one, in terms of the constraints imposed on the probability distribution. Although, the direct objective at each of the last three steps is to model weight of evidence conditioned on the query, the logodds in favor of relevance, and hence the probability of relevance, for the query is modeled as well. For example, log O(rel|tf, idf, co, q, *)
=
log O(rel|idf, co, q, *) +woe(reL : tf | idf, co, q, *)
In each case, the conditional probability can be derived from the conditional log-odds by: where D = log O(rel| . . .) To begin the process, model M0 is developed by simply estimating, for each query, the probability that an arbitrary document will be found to be relevant
The Use of Exploratory Data Analysis in Information Retrieval Research
55
to that query. This is described in more detail in Section 5.2. We proceed, in Section 5.3, to analyze evidence corresponding to coordination level. This results in the M1 model of relevance conditioned on the query being evaluated and the number of query terms occurring in the document. In Section 5.4, we see that inverse document frequency is correlated with residual log-odds of relevance, relative to the M1 model. Extension of the model to include idfi for each of the query terms, i = 1,2, . . ., produces the M2 model. Finally, analysis of the role of term frequency, discussed in Section 5.5, results in the M3 model. It is this model on which a ranking formula will be based.
5.2
BASE MODEL
A modeling assumption, derived from the Maximum Entropy Principle ( Greiff and Ponte, 1999; Greiff, 1999), is that the weight of the query/document → evidence, e, shall be considered independent of the query, q. The retrieval status value to be used will then be this weight of evidence, which can be assigned without knowledge of the prior probability of relevance for the query, p(rel|q). However, because of possible confounding, modeling is best accomplished by including the probability of relevance conditioned only on the query, q, in order to eliminate potential problems due to confounding. The basis for the analysis is established by first developing a model of the probability of relevance conditioned only on the query being evaluated, p(rel | q). We shall loosely refer to the probability as the prior probability of relevance for query q in the sense that it is the probability that a randomly selected document will be found relevant to the query before any evidence is observed, that is before the contents of the document are known. The estimation of this probability is straightforward. From the TREC data, the probability of relevance, p(rel|q), can be estimated using counts as, p(rel | q) = #rels / #docs and this can be converted to log-odds. A graph of these data is shown in Figure 2.7. It can be seen in this graph that the prior probability of relevance ranges over almost three orders of magnitude, from a little less than one in ten (log-odds ≈ -1) for query #189, to slightly above one in ten thousand (log-odds ≈ -4), for query #181.
5.3
MODELING COORDINATION LEVEL
Using M0 as a base, we model the weight of evidence offered by coordination level. Calculation of Residual Log-odds In order to model the weight of evidence offered by coordination level, the data were first grouped by the value of the Coord variable into subsets, C1, C2, . . .
56
ADVANCES IN INFORMATION RETRIEVAL
Figure 2.7
(Prior) log-odds of relevance for TREC-3 queries
Ci = {(q,d) ∈ Q x D | Coord(q,d) = i} where Q is the set of queries and is the set of documents. For each of these subsets of query/document pairs, an expected number of relevant documents was ^ computed. based on the estimated probabilitiss of relevance, p(rel|Qry = q), for each query: (2.20) where ni,q is the number of documents that contain i terms from query q. The ^ probability, p(rel|Qry = q), is the conditional probability ofrelevance given by model, M0, that was estimated by counting the fraction of documents relevant ^ = q), is an estimate of the number to the query. The product, ni,q . p(rel|Qry of these ni,q documents that can be expected to be relevant. The sum of these over all queries is then an estimate of the number of relevant documents in the set Ci. Although somewhat less intuitive, the summation given in eq. 2.20 can also be expressed as: (2.21) This formulation will be seen to be more useful as this technique is extended to the analysis of idf and tf as sources of evidence. Accompanying the calculation of ri, the actual number of documents in Ci that are relevant, ri, can be counted. Both of these can be transformed into log-odds: log and log where ni = |Ci| is the number of query/document pairs for which the cbordination level is i. The difference between the two:
The Use of Exploratory Data Analysis in Information Retrieval Research
Figure 2.8
57
Residual log-odds as function of coordination level: unsmoothed
can be viewed as the residual log-odds of relevance; the difference between the observed log-odds of relevance and the log-odds that would be predicted by a model that only uses information about which query is being evaluated. After the residuals were calculated for each subset of query/document pairs, Ci, a residual plot was produced. Figure 2.8 shows the scatterplot of residuals against coordination level. The lightly shaded bars in the background give a cumulative histogram. The height of a bar at Coord = i indicates the fraction of query/document pairs under consideration for which Coord(q, d) d i. The small circle along the bottom of the graph at Coord = 7 indicates that there were 0 relative documents in ∧ ∧ subset C7. Since ri is undefined for 0 relevant documents (i.e. ri = –∞ and hence resi is undefined, the circle serves to remind us that a point is missing from the plot at this value for the predictor variable. Fitting a Regression Line Motivated by the linearity suggested by Figure 2.8, a weighted linear regression was performed. The point at Coord = 7 represented by the small circle in Figure 2.8 has been ignored for the purposes of the regression. Since the point corresponds to very few documents, its effect on the overall regression would be negligible. Producing the M1 Model This regression line can be used to form a model, M1, that takes into account both the query being evaluated and the coordination level. The points marked by small x’s at each value, co = 1,2, . . ., that lie about the line, res = 0, show the difference between the log-odds predicted by the model, M1, and the log-odds actually observed — their proximity to the line, res = 0, demonstrating that the model provides a close fit to the data. The log-odds difference given by the model is equivalent to the weight of evidence in favor of relevance provided by the coordination level, conditioned
58
ADVANCES IN INFORMATION RETRIEVAL
on the query: woe(rel : co | q) (2.22) This M1 model advances the development of the complete model. The next step will be to use it as a basis for the analysis of the evidence provided by the rarity of a term.
5.4
MODELING INVERSE DOCUMENT FREQUENCY
In the previous section, we reported on a study of idf In that phase of the research, the behavior of idf was analyzed in isolation, independent of other explanatory variables. For the model being developed here, the behavior of idf will be studying in association with other variables of interest. The analysis for idf, and later for tf, as sources of evidence, will also depend on the study of residual log-odds of relevance. Whereas Coord is a feature of query/document pairs, for both idf and tf, data points involve individual query terms as well. For the analysis of coordination level, each pair corresponded to only one data point. For the analysis of idf and tf, each query/document pair corresponds to multiple data points — one for each term appearing in the document. In order that each document, and hence each relevance judgment, be treated equally, each point will be considered weighted by: w(q, d, t) = 1/Coord(q, d). In this way, the 5 points corresponding to a relevant query/document pair with a coordination level of 5 will each receive a weight of 1/5 – i.e. each will be considered as 1/5 of a relevant document; 2 points corresponding to a nonrelevant document with a coordination level of 2 will be considered as 1/2 of a non-relevant document; in total, 1 relevant and 1 non-relevant document. In order to study the value of a term’s inverse document frequency as a source of evidence, the evidence associated with learning that a specific query term, t, was one of the terms that occurred in the document, was studied. The weight of evidence tied to this event was estimated for each of the query terms over all of the queries. In order to carry out this estimation process, the data were first grouped into subsets, Ii = {(q, d, t) ∈ Q× D × T | Qry(t)=q, t = i} (T, the set of query terms), with one subset for each query term. (Occurrences of the same word used in two or more different queries are, for this purpose, considered different terms.) Here again, the actual number of relevant documents can be counted and the fraction of documents that are relevant can be compared
The Use of Exploratoy Data Analysis in Information Retrieval Research
Figure 2.9
59
Residual log-odds as function of idf unsmoothed
against the expected fraction for each subset. It must be kept in mind that both the observed and expected values are based on counts of entries weighted by the inverse of the coordination level. More precisely,
The calculation of the expected number of relevant documents for each subset is analogous to that given in eq. 2.21:
. where the estimated probability is calculated from the estimated log-odds of relevance: log O(rel|co, q)
=
log O(rel|q) + woe(rel : co | q)
with the second term being the weight of evidence predicted by the M1 model. Figure 2.9 shows a scatterplot of these residuals against idf value. Again, small circles are shown at the bottom of the graph for each residual that is undefined because the corresponding term did not appear in any relevant documents. Also visible is a small circle in the upper right hand comer. This corresponds to a term that only appeared in relevant documents. For this term the estimated probability of relevance is 1, giving infinite odds, and hence infinite log-odds, of relevance. The vertical bars in the background give a cumulative histogram for the (weighted) data points. As discussed in Section 4, the study of the weight of evidence provided by term idf values, suggests that the weight of evidence provided by idf is well-modeled by a 3-piece linear function. Review of the general form of residual plots, generated at various level of smoothing, tended to corroborate
60
ADVANCES IN INFORMATION RETRIEVAL
Figure 2.10
Residual log-odds as function of idf smoothed with regression
these earlier findings. Together, these two factors motivated the attempt to model woe(rel : idf | co, q) as a 3-piece linear function. In order to realize this, a linear regression was performed to determine parameters for the following linear model: (2.23) (2.24) and E 1IDF , yield the model, M2, The resulting estimates forthe parameters, EIDF 0 that minimizes the mean square error of all those models for which the expected value, E[ri], of the residual is a 3-piece linear function of idf with flat segments at the two extremes, and elbows at idf = 1.0 and idf = 2.0. Regressions were also run with a 4-parameter function, allowing for a general 3-piece linear model (one without the flat-segments restriction). These regressions showed no statistical evidence of non-zero slope in either of the tails, an indication that might have justified consideration of a more general model. Regressions were also run for other settings for the elbows; with values close to 1.0 and 2.0 resulting in the best fit. The 3-piece linear curve shown in Figure 2.10 shows the resulting model imposed on the scatterplot of smoothed residual values.
5.5
MODELING TERM FREQUENCY
Analysis of the evidence provided by term frequency proceeds along the same lines as that of inverse document frequency. Data points are grouped into subsets TF1, TF2,..., according to the number of occurrences of the query term in the document, Tf (q, d, t). For each subset, TFi, the observed number of relevant documents is determined, and the expected number of relevant documents is
The Use of Exploratory Data Analysis in Information Retrieval Research
Figure 2.11
Figure 2.12
61
Residual log-odds as function of tf unsmoothed
Residual log-odds as function of tf: smoothed with curve fit
calculated as:
where p(rel|idf, co, q) is calculated from the log-odds of relevance according to model, M2: ^ log Ô(rel|idf, co, q) = log Ô (rel|q) + woe(rel : co | q) + woe(rel : idf | co,q)
A scatterplot of the resulting residuals is shown in Figure 2.11. A number of transformations of the variables involved were tried. Figure 2.12 shows a plot of the residuals with log(tf) as predictor variable, smoothed to 50 bins. (Some of the original points, in particular the point for tf = 1, i.e. log(tf) = 0, are spread over a number of bins, accounting for several points with the same
62
ADVANCES IN INFORMATION RETRIEVAL
Figure 2.13
Figure 2.14
Residual log-odds as function of log(tf): smoothed with regression
Residual log-odds as function of tf smoothed with regression on original scale
The Use of Exploratory Data Analysis in Information Retrieval Research
Figure 2.15
63
Residual log-odds as function of document length
coordinates, overlapping one another, on the smoothed version of the graph.) The apparent linearity motivated the application of a simple linear regression. The regression results in a line given by the equation: (2.25) This line is overlaid on the smoothed scatterplot of Figure 2.13. The fit of the curve to the smoothed data on the more natural, unlogged tf scale can be seen in Figure 2.14.
5.6
MODELING DOCUMENT LENGTH
Many modem retrieval systems normalize term frequencies by document length in some way. The intuition is that a term is more likely to occur a larger number of times, the longer a document is. In the vector space model, term frequency is typically normalized by use of the cosine rule (Witten et al., 1994). Here the score is normalized by the Euclidean length of the document vector, which will tend to be greater for longer documents. In INQUERY, the term formula component of the ranking formula is: (2.26) where dl is document length and avg-dl is the average document length over the entire collection, and also incorporates a form of normalization based on document length. In order to consider the role of document length in this work, two types of analysis were performed. First, a residual plot was produced with document length as predictor, as it was for the other variables. The data were grouped by document length, and for each group: an expected number of relevant documents was computed, based on M3; the actual number of relevants were
64
ADVANCES IN INFORMATION RETRIEVAL
Figure 2.16
Residual log-odds as function of coordination level for individual queries
counted; and from these a residual log-odds was calculated. A plot for 50 bins is shown in Figure 2.15. From the plot it does not appear that there is any predictive value associated with document length. A number of other analyses were performed on different data sets with similar results. Based on these results, it was decided not to include document length as a source of evidence in the final model.
5.7
DISCUSSION
In this section, we review some of the important issues involved in the analysis presented in the previous sections. Coordination Level as Evidence The study of coordination level gives convincing evidence that the weight of evidence provided by the number of query terms appearing in the document, Coord = co, is well modeled as a linear function of co. This conclusion is supported by analysis of individual queries, over a number of different data sets. Figure 2.16 shows plots of wôe(rel : co | q) as a function of coordination level, co, for all of the TREC 3 queries for which at least one document of AP88 was judged relevant. Remarkable regularity is evidenced by this plot. This is especially true when we take into account that the number of relevant documents corresponding to many of the queries is small, which would cause us to expect substantial variability in the data. Similar regularity was observed for other data sets that were examined. Inverse Document Frequency As Evidence The modeling of idf is more problematic in two respects. First, the variance of residual log-odds across terms with similar idf values is large. The trend of increasing weight of evidence for increasing idf in Figure 2.9 is clear, although the number of points that are undefined (circles along the bottom of the graph) must not be forgotten. The
The Use of Exploratory Data Analysis in Information Retrieval Research
65
trend is also evident in smoothed versions of the plot, where all query terms, including those that do not appear in relevant documents, contribute to the visual effect. Even in smoothed versions, large variance is in display. This large variance makes it difficult to have confidence in the modeling decisions. Second, although the general trend is quite robust, the magnitude of the effect and the exact form of the increase, seem to vary considerably across varying collections and query sets. More study will be required to arrive at a more thorough understanding of the nature of the evidence provided by the value of the idf feature. Term Frequency As Evidence There are two reasons that speak in favor of a log transformation of the term frequency variable. The shape of the curve shown in Figure 2.11 strongly suggests a sharp decrease in the weight of evidence provided by one additional occurrence of a query term as term frequency increases. All data sets studied exhibited this same behavior. Intuitively, this is what one would expect, and this intuition has been the inspiration for a number of ranking formulas that are used in IR research. The difference between T f = 1 and T f = 2 should not be treated as equal to the difference between T f = 21 and Tf = 22. This is an indication that a log transformation is likely to provide a more appropriate scale on which to analyze the data. Document Length As Evidence We have concluded that there is no convincing indication that document length should be included in our model. This goes against much evidence to the contrary in the experimentation literature. Two points can be made. First, there may not be any effect due to document length when attention is restricted to a relatively homogeneous collection of documents such as a collection of news articles. The benefits of document length normalization in an environment such as TREC may be due to differences in behavior across the variety of sub-collections. Second, the distribution of document lengths over a sub-collection is much more uniform than it is over an entire TREC collection. It is reasonable to conjecture that even if there is an effect due to document length, it may be too small to be detected, and perhaps too small to make a difference in retrieval effectiveness, when retrieval is limited to a sub-collection. The above comments are corroborated to a degree by the graphs shown in Figure 2.17. The two graphs were produced from data corresponding to a combination of the Associated Press articles for 1988 (Ap88) and Federal Register documents for the same year (FR88). The graph at the top of this figure shows weight of evidence as a function of term frequency for documents of 600 words or less, whereas the bottom graph shows the same curve for documents of 600 words or more. A comparison of the two graphs indicates
66
ADVANCES IN INFORMATION RETRIEVAL
Figure 2.17 Residual plot for term-frequency for two ranges of document length over a combination of AP88 and FR88 documents
that the rise in weight of evidence is more gradual for the longer documents than it is for the shorter ones.
5.8
DEVELOPMENT OF A RANKING FORMULA
In this section we take the M3 model and from it derive a probabilistic ranking formula in terms of weight of evidence. The Probability Ranking Principle (Robertson, 1977) counsels us to rank documents by the probability of relevance. Equivalently, we may rank by the log-odds of relevance. The evidence we have considered to this point is: the coordination value; and, for each of the query terms appearing in the document, the idf and tf values. In terms of this evidence, the log-odds of relevance is given by:
There are two steps that need to be taken to convert this to a ranking formula. First we note that we do not expect to have knowledge of the prior odds of relevance, O(rel|q), for the query. However, this value is not needed. Since O(rel|q) is constant for a given query, we can ignore it, and, for the purposes of ranking, simply use the weight of evidence as an RSV, in place of log-odds:
The Use of Exploratory Data Analysis in Information Retrieval Research
61
Greiff, 1999, shows that this can be written as,
Using the regression equations for coordination level, inverse document frequency, and term frequency, produced to model the weights of evidence in the above expression, yields a retrieval status value in the form of
(2.27)
In the above formula E 0co can be ignored for the purposes of ranking. The coefficient values corresponding to the regression which have been fit in the previous sections are given by: = = =
(-0.66, +0.42) (-0.49, +1.27) (-0.55, +1.25)
giving the ranking formula: RSV
=
It is instructive to compare the ranking formula given in eq. 2.27 with formulas commonly used in IR systems. First, the general form of the M3 formula is different from traditional tf-idf formulas. In classic versions of the formula, some function of the document frequency is multiplied by some function of the term frequency and these products are added over all terms appearing in the document. In the M3 version, the idf and tf components are added.
68
ADVANCES IN INFORMATION RETRIEVAL
Figure 2.18
Recall-precision graph for TREC 3 queries on AP88
A second difference is the introduction of a 3-piece linear function of – log as part of the ranking formula. This form of the idf function is a direct result of the data analysis performed. Performance Evaluation To test the M3 model, the INQUERY information retrieval system (Callan et al., 1992) was modified to apply a formula based on eq. 2.27 for the RSV calculation. Performance was compared to an unmodified version of INQUERY as a baseline. A series of tests were run with various parameter settings. Figure 2.18 shows an 11-point recall-precision graph for the M3 model using the best parameter settings found. It is compared against the unmodified INQUERY system. The test system is represented by the solid line, with a broken line used for the baseline. Performance is almost identical at all levels of recall. This is encouraging, giving reason to believe that the traditional multiplicative tf-idf formulation may ultimately yield to a probabilistic ranking formula that is founded on observable statistical regularities. The test results shown in Figure 2.18 correspond to testing the M3 model on the same data set for which it was developed. However, test results on other data sets are promising. Figures 2.19 and 2.20 show results for two other document collections: ZIFF2 and WSJ89. In both cases, performance for the two systems were again quite comparable. The collections involved are similar to the AP88 collection whose analysis is discussed in this paper. The ZIFF2 test was run with the same TREC 3 queries that were used for analysis with the AP88 data. Here, the test system performs slightly better than the baseline system. Interestingly, the test system performs well also on the WSJ89 test, even though both the collection and the query set, queries from TREC 1, were different from those used to develop the model.
The Use of Exploratory Data Analysis in Information Retrieval Research
Figure 2.19
Recall-precision graphs for TREC 3 queries on ZIFF2
Figure 2.20
Recall-precision graphs for TREC 1 queries on WSJ89
69
70
ADVANCES IN INFORMATION RETRIEVAL
6
CONCLUSIONS
In this paper, we have reported on the development of a methodology for the study of evidence used for the ranking of documents in response to the expression of a user's need for information. EDA has been applied to all sources of evidence that participate in the retrieval strategy. It can fairly be categorized as the first example of what we have termed the data driven approach. The methodology based on weight of evidence was applied to an available set of retrieval data. This resulted in the development of a model for weight of evidence in favor of relevance given by the query/document features that were studied. A ranking formula has been derived directly from the weightof-evidence model. The derived formula has two important characteristics. The first is that it has a strict probabilistic interpretation. It is the weight of evidence in favor of relevance given by the features of the query/document pair being evaluated. The second is that the components of the formula can be decomposed. Both the form and the parameter values correspond to observable regularities of the retrieval data. A central motivation for this research has been the development of a theoretical framework that allows for various forms of evidence to be incorporated in a general retrieval system in a systematic way. It should be possible to apply the techniques developed as a result of this research to other sources of evidence. Alternatively, the same sources of evidence can be analyzed in different retrieval settings. The approach taken here opens the door to a research paradigm that can be brought to bear on the study of all aspects of the information retrieval problem. Directions that are ripe for immediate exploration include the study of other sources of evidence such as phrases, thesaurus terms, expanded queries, and studying the nature of retrieval in other languages.
Acknowledgments This material is based on work supported in part by the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement number EEC-9209623, and also supported in part by United States Patent and Trademark Office and Defense Advanced Research Projects Agency/ITO under ARPA order number D468, issued by ESC/AXS contract number F19628-95-(2-0235. Any opinions, findings and conclusions or recommendations expressed in this material are the author's and do not necessarily reflect those of the sponsors.
References Andrews, D. E (1978). Data analysis, exploratory. In Kruskal, W. H. and Tanur, J. M., editors, International Encyclopedia of Statistics, volume 7, pages 210– 218. Free Press, New York.
The Use of Exploratory Data Analysis in Information Retrieval Research
71
Beniger, J. R. and Brown, D. L. (1978). Quantitative graphics in statistics: A brief history. The American Statistician, 32(1): 1–9. Bookstein, A. and Cooper, W. (1976). A general mathematical model for information retrieval systems. Library Quarterly, 46(2): 153–167. Callan, J. P., Croft, W. B., and Broglio, J. (1995). TREC and TIPSTER experiments with INQUERY. Information Processing & Management, 3 1(3):327– 343. Callan, J. P., Croft, W. B., and Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications, pages 78–83. Church, K., Gale, W., Hanks, P., and Hindle, D. (1991). Using statistics in lexical analysis. In Zernik, U., editor, Lexical Acquisition: Exploiting OnLine Resources to Build a Lexicon, pages 115–164, Hillsdale, NJ. Lawrence Erlbaum Associates. Cooper, W. S. (1994). The formalism of probability theory in IR: A foundation or an encumbrance. In Croft, W. B. and van Rijsbergen, C. J., editors, Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 242–248, Dublin, Ireland. Croft, W. B. and Xu, J. (1995). Corpus-specific stemming using word form cooccurence. In Proceedings for the Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 147–1 59, Las Vegas, Nevada. Devroye, L. (1987). A Course in Density Estimation. Birkhauser, Boston. Fano, R. M. (1961). Transmission of Information; a Statistical Theory of Communications. MIT Press, Cambridge, MA. Good, I. J. (1950). Probability and the Weighing of Evidence. Charles Griffin, London. Good, I. J. (1983a). Good Thinking: The Foundations of Probability and its Applications. University of Minnesota Press, Minneapolis. Good, I. J. (1983b). Weight of evidence: A brief survey. In Bernardo, J. M., DeGroot, M. H., Lindley, D. V., and Smith, A. F. M., editors, Bayesian Statistics 2, pages 249–269. North-Holland, Amsterdam. Good, I. J. (1989). Statistical evidence. In Kotz, S. and Johnson, N. L., editors, Encyclopedia of Statistical Sciences, pages 65 1–656. Wiley. Greiff, W. R. (1999). Maximum Entropy, Weight of Evidence and Information Retrieval. PhD thesis, University of Massachusetts, Amherst, Massachusetts. Greiff, W. R. and Ponte, J. (1999). The maximum entropy approach and probabilistic IR models. To appear in ACM Transactions on Information Systems. Haines, D. and Croft, W. B. (1993). Relevance feedback and inference networks. In Korfhage, R., Rasmussen, E., and Willett, P., editors, Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191–203, Pittsburgh, Pa. USA.
72
ADVANCES IN INFORMATION RETRIEVAL
Harman, D. (1993). Overview of the first Text REtrieval Conference (TREC1). In Harman, D. K., editor, The First Text REtrieval Conference (TRECI), pages 1–20, Gaithersburg, Md. NIST Special Publication 500-207. Harman, D. (1995). Overview of the third Text REtrieval Conference (TREC3). In Harman, D. K., editor, The Third Text REtreival Conference (TREC-3), pages 1–20, Gaithersburg, Md. NIST Special Publication 500-225. Harman, D. (1997). Overview of the fifth Text REtrieval Conference (TREC5). In Voorhees, E. M. and Harman, D. K., editors, The Fifth Text REtreival Conference (TREC-5), pages 1–28, Gaithersburg, Md. 500-238. NIST Special Publication 500-238. Hartwig, F. and Dearing, B. E. (1979). Exploratory Data Analysis. Number 07016 in Sage university paper series: Quantitative applications in the social sciences. Sage Publications, Beverly Hills. Hardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cambridge. Jeffreys, H. (1961). Theory of Probability. Oxford University Press, Oxford, 3 edition. Minsky, M. and Selfridge, O. G. (1961). Learning in random nets. In Cherry, C., editor, Information Theory: Fourth London Symposium, pages 335–347, London. Butterworths. Neter, J., Wasserman, W., and Kutner, M. H. (1985). Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs. R. D. Irwin, Homewood, Ill., 2 edition. Robertson, S. E. (1977). The probability ranking principle in IR. Journal of Documentation,33:294–304. Robertson, S. E. and Sparck Jones, K. (1977). Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129–146. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11–21. Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading, MA. van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths, London, 2 edition. Witten, I. H., Moffat, A., and Bell, T. C. (1994). Managing Gigabytes: Compressing and Indexing Documents and Images. van Nostrand Reinhold, New York.
Chapter 3 LANGUAGE MODELS FOR RELEVANCE FEEDBACK Jay M. Ponte GTE Laboratories 40 Sylvan Rd Waltham MA 02451 USA ponte @acm.org
Abstract The language modeling approach to Information Retrieval (IR) is a conceptually simple model of IR originally developed by Ponte and Croft (1998). In this approach, the query is treated as a random event and documents are ranked according to the likelihood that the query would be generated via a language model estimated for each document. The intuition behind this approach is that users have a prototypical document in mind and will choose query terms accordingly. The intuitive appeal of this method is that inferences about the semantic content of documents do not need to be made resulting in a conceptually simple model. In this paper, techniques for relevance feedback and routing are derived from the language modeling approach in a straightforward manner and their effectiveness is demonstrated empirically. These experiments demonstrate further proof of concept for the language modeling approach to retrieval.
1
INTRODUCTION
The language modeling approach to Information Retrieval (IR), developed by Ponte and Croft (1998), treats the query as a random event generated according to a probability distribution. Intuitively, it is assumed that the end user of an IR system will have an idea of a prototypical document in which he or she is interested and will choose query terms likely to occur in documents similar to that prototype. Viewed this way, one can then think estimating a model of the term generation probabilities for the query terms for each document and ranking the documents according to the probability of generating the query.
74
ADVANCES IN INFORMATION RETRIEVAL
This approach has the intuitive appeal that one ranks documents by much the same process in which the user is presumed to have asked for them. However, as intuitiveness is in the eye of the beholder, perhaps it will be useful to consider some examples where viewing the retrieval problem from the standpoint of language modeling can be beneficial. The main advantage to the language modeling approach is that it does not require inferences as to the semantic content of documents. This makes the model conceptually simple and easy to generalize. For example, consider a generalization of the information retrieval problem where document boundaries are not predefined. This could occur when, for example, one has a large repository of online audio or video and wishes to retrieve relevant segments. Information retrieval systems are usually implemented to retrieve documents and, as such, use the document level statistics of term frequency (tf) and inverse document frequency (idf). This approach can be very effective for document retrieval but does not have a natural generalization to the case of audio/video segment retrieval (unless one pre-segments the data, which presupposes that there there is a single correct segmentation irrespective of the information need of the user). When document boundaries are ill-defined, document level statistics do not have obvious counterparts. However, the language modeling approach does generalize to this case since one can estimate local term occurrence probabilities from the data even in the absence of document boundaries. How would one accomplish this? If efficiency were of no object, one could compute local probabilities of query generation at each position of each query term using a kernel style estimator (Silverman, 1986). Since the cost of this operation is likely to be prohibitive, one would need to optimize the process, perhaps by precomputing statistics for fixed length blocks and computing the full estimate only for the most likely candidates. So, while the language modeling approach-does not offer the solution “out of the box” it does suggest a clear path to the solution of the problem. As a second example, consider the retrieval of noisy data such as OCR text or automatically recognized speech transcripts. When one views these problems from the standpoint of traditional vector space or probabilistic IR models, it is not obvious how to proceed. The reason for the difficulty is that the presence or absence of an indexing feature is itself an uncertain event. On the other hand, when viewed as a problem of query term probability estimation, one can utilize probability distributions, e.g., from a speech recognition system, in a fairly straightforward manner. The probability of generating an indexing feature would now need to be estimated taking the feature presence probability into account. Again, the language modeling approach does not offer an immediate solution to this problem and, certainly, much work remains to solve them, but the path to the solution is clear since both sources of uncertainty are modeled by probabilities.
Language Models for Relevance Feedback
75
As a final example, we get to the main topic of this paper, relevance feedback or document routing from the perspective of language modeling. Some existing approaches to relevance feedback are quite complex and others are somewhat ad hoc as will be discussed in section 3. On the other hand, the language modeling approach offers an intuitive and even obvious method of performing relevance feedback. That method will be the topic of the remainder of this paper. The organization of the paper is as follows. First, the language modeling approach to ad hoc retrieval is described. Following that, previous approaches to relevance feedback are presented. Finally, the language modeling approach to relevance feedback is described along with empirical results for interactive relevance feedback as well as document routing. The paper closes with conclusions and future directions.
2
THE LANGUAGE MODELING APPROACH TO IR
In the language modeling approach to IR, a language model is inferred for each document and the documents are ranked according to the estimate of producing the query according to that model. The query generation probability p(Q|Md), is the probability of producing the query given the language model of document d. To estimate this probability, the first step will be to estimate the probability for each term in the query. This probability will be estimated starting with the maximum likelihood estimate of the probability of term t in document d:
where tf(t,d) is the raw term frequency of term t in document d and dld is the total number of tokens in document d. A simplifying assumption will be made. Assume that given a particular language model, the query terms occur independently. The maximum likelihood estimator gives rise to the ranking formula 3 t ∈Q p∧ ml(t|Md) for each document.
2.1
INSUFFICIENT DATA
There are two problems with the maximum likelihood estimator, or perhaps a better way to state this is that there are two symptoms of one underlying problem: insufficient data for the reliable estimation of maximum likelihood. These two problems (or symptoms of the same problem) will now be examined a little more closely. The obvious practical problem with this estimator is that we do not wish to assign a probability of zero for a document that is missing one or more of the query terms. Doing so would mean that if a user included several synonyms in the query, a document missing even one of them would not be retrieved
76
ADVANCES IN INFORMATION RETRIEVAL
In addition to this practical consideration, from a probabilistic perspective, it is a somewhat radical assumption to infer that p(t|Md) = 0, i.e., the fact that we have not seen it does not make it impossible. Why is that? Recall that we are trying to estimate the term generation probabilities from a hypothesized language model for the document. We don’t know the ‘real’ model and so we do not want to make the assumption that our hypothesized model could not generate a term just because it did not. Instead, we make the assumption that a non-occurring term is possible, but no more likely than what would be expected by chance in the collection, i.e., cft ____ cs where cft is the raw count of term t in the collection and cs is the raw collection size or the total number of tokens in the collection. This provides us with a more reasonable distribution and circumvents the practical problem. It should be noted that in homogeneous databases, one may need to use a more careful smoothing estimate of the collection probability since, in some cases, the absence of a very frequently occurring word in a document (i.e., a word with the characteristics of a stopword) could conceivably contribute more to the score of that document in which it does not occur than it would for a document in which it does occur. This is not a problem in the collections studied, as they are heterogeneous in nature, and stopwords have been removed. However, this issue should be addressed in the future to insure that this approach will be immune to these pathological cases. The other problem with this estimator was pointed out earlier. If we could get an arbitrarily large sample of data from Md we could be reasonably confident in the maximum likelihood estimator. However we only have a document sized sample from that distribution and so the variation in the raw counts may partially be accounted for by randomness. In other words, if one were given two documents of equal length generated from the same language model, one would expect these two documents to have different numbers of occurrences of most of the terms due entirely to random variation.
2.2
AVERAGING
Since it is not possible to draw additional data from the same random process that generated each document, one can view the problem with the maximum likelihood estimator as one of insufficient data. To circumvent the problem of insufficient data, we are going to need an estimate from a larger amount of data. That estimate is the mean probability estimate of t in documents containing it:
where dft is the document frequency of t. This is a more robust statistic in the sense that we have a lot more data from which to estimate it, but it too has
Language Models for Relevance Feedback
77
a problem. It cannot be assumed that every document containing t is drawn from the same language model, and so there is some risk in using the mean to estimate p(t|Md). Furthermore, if the mean were used by itself, there would be no distinction between documents with different term frequencies. In order to benefit from the robustness of this estimator, and to minimize the risk, the mean will be used to moderate the maximum likelihood estimator by combining the two estimates using the geometric distribution as follows:
_ where f t is the the mean term frequency of term t in documents where t occurs normalized by document length. The intuition behind this formula is that as the t f gets further away from the normalized mean, the mean probability becomes riskier to use as an estimate. For a somewhat related use of the geometric distribution see Ghosh et al., 1983. There are several reasons why the geometric function is a good choice in addition to the shape. In the first place, the mean of the distribution is equal _ to f t , the expected rate of occurrence according to the mean probability. Secondly, the variance of this distribution is larger than the mean. For this reason, the case where a large number of relatively low rates of occurrence along with relatively few cases where there is a large rate of occurrence are captured by this function. Finally, this function is defined in terms of only the mean and the t f so it can be computed without adding to the space overhead of the index and in minimal time. At the present time, IR systems need to be able to index gigabytes of text at rates of hundreds of megabytes per hour and to process queries in seconds. Complex estimation techniques that require significant additional indexing time, even if effective, would not be acceptable for real systems. In addition, techniques that require significant additional space, such as storing an additional number per word occurrence, would not be feasible. This makes the geometric distribution an excellent choice from a systems engineering perspec tive.
2.3
COMBINING THE TWO ESTIMATES ∧
The geometric distribution will be used in the calculation of p(Q|Md), the estimate of the probability of producing the query for a given document model as follows:
78
ADVANCES IN INFORMATION RETRIEVAL
In this formula, the first term is the probability of producing the terms in the query and the second term is the probability of not producing other terms. This can be regarded as a “background” model for each document that captures the likelihood of other terms for the document. This quantity essentially accounts for other terms that would be better discriminators of the document than the query terms, i.e., terms that would be unlikely to be left out of the query by users interested in the document. This functions in lieu of a more expensive estimation from missing data solution and allows for the tractable ∧calculation of the background probabilities. Also notice the risk function, Rt,d and the cft background probability ____ cs mentioned earlier. This function is computed for each candidate document and the documents are ranked accordingly. This approach will be applied to the problems of relevance feedback and routing in section 4. Prior to that, some other approaches to these problems will be discussed.
3 3.1
RELATED WORK THE HARPER AND VAN RIJSBERGEN MODEL
In 1978, Harper and van Rijsbergen developed a method for using relevance information, obtained by relevance feedback, to obtain better estimates for the probability of relevance of a document given the query. This work attempted to correct for the assumption of independence which the authors did not think was realistic (Harper and van Rijsbergen, 1978). Given complete relevance information, an approximation of the dependence of query terms was defined by the authors by means of a maximal spanning tree. Each node of the tree represented a single query term and the edges between nodes were weighted by a measure of term dependence. Rather than computing the fully connected graph, the authors computed a tree that spanned all of the nodes and that maximized the expected mutual information computed as follows:
where i and j range over the query terms, P(xi, xj) is the probability of term xi and term xj occurring in a relevant document, P(xi) is the probability of term xi occurring in a relevant document and, likewise, P(xj) is the probability of term xj occurring in a relevant document. Note that these probabilities refer to the probability of occurrence of the term or term pair one or more times
Language Models for Relevance Feedback
79
in a document. The within document frequency is not accounted for by this measure. Harper and van Rijsbergen did two sets of experiments using this approximation. The first was to determine the upper bound performance of the term dependence model vs. a term independence model using the complete relevance judgments. These experiments showed that under these conditions, the dependency graph did, in fact, yield useful information and resulted in more effective retrieval than the model that did not utilize this information. The second set of experiments used the dependency graph in the presence of a limited number of relevance judgments; both ten and twenty judged documents were tested. In order to compensate for the relatively sparse data, the authors used a Bayesian estimator with Jeffrey’s prior as recommended in van Rijsbergen, 1977, to estimate the probabilities. The authors found that with a limited number of judgments, it was still the case that the dependency graph model yielded better retrieval. It should be noted that the collections used for these experiments were quite small, as larger collections were not available at the time, and the authors cautioned against drawing firm conclusions from these results (Harper and van Rijsbergen, 1978).
3.2
THE ROCCHIO METHOD
A commonly used method of relevance feedback is due to Rocchio (Rocchio, 1971). The Rocchio method provides a mechanism for the selection and weighting of expansion terms and is defined as follows:
(3.1) where D is the weight assigned for occurring in the initial query, E is the weight assigned for occurring in relevant documents, J is the weight assigned for occurring in non-relevant documents, w(t) is a weighting function, generally based on term frequency _ and/or document frequency, R is the set of documents judged relevant and R is the set of documents judged non-relevant. This formula can be used to rank the terms in the judged documents. The top N can then be added to the query and weighted according to the Rocchio formula. This is a reasonable solution to the problem of relevance feedback that works very well in practice. However, there is no principled way to determine the optimal values of D E and J and so these parameters are generally set empirically. Also, the weighting function, w(t) is based on heuristic use of the collection statistics. The method described in Section 4 will be derived from the language modeling approach and does not require heuristics or empirically set parameters.
80
3.3
ADVANCES IN INFORMATION RETRIEVAL
THE INQUERY MODEL
The INQUERY inference network model was developed by Turtle (1991). See Figure 3.1 for an example network. The document portion of the network is computed in advance and the query portion is computed at retrieval time. The document side consists of document nodes d1 . . . di, text nodes t1 . . . tj and concept representation nodes r1 . . . rk. The document nodes represent abstract documents. A document may consist of text, figures, images, formatting information, etc. The text nodes represent the textual component of the documents. A given text node corresponds to the observation of the text of a document. For our purposes, we can assume that there is a one-to-one and onto relationship between text nodes and document nodes but, of course, in general that is not necessarily true. Several documents could share a textual component and, conversely, non-textual information can be considered as a another source of information within the same model, but these two cases are not relevant to the present discussion.
Figure3.1
Example inference network.
The concept representation nodes r1 . . . rk are features that can be possessed by the document texts. A link to an r node means that a document is “about” that particular concept. There is some uncertainty to be resolved due, for example,
Language Models for Relevance Feedback
81
to differences in word sense e.g., the word “train” may mean that a document is about trains as a mode of transportation or it may mean that document is about training employees. This distinction between words and concepts is the key distinction to be made here. The uncertainty of indexing in information retrieval is based on not having a direct representation of concepts. Instead, the probabilities of concepts are estimated from word occurrence statistics. Query Network. The query side of the network consists of the query concepts, c1 . . . cm, some number of queries, q1 and q2 in this diagram, and I the information need. The query concepts are the primitive concepts used to represent the query. The concepts will be represented by the words of the query, but again there is a fair amount of uncertainty due to differences in word sense. The query nodes represent individual queries. In this model, the information need of the user can be represented by more than one query, using more than one type of information. The information need itself is known only to the user and needs to be inferred from the queries. There can be an additional layer between the query nodes and the concept nodes representing intermediate operators such as phrases or Boolean operators. The task then is to calculate the belief in the information need of the user given each document. The documents will then be ranked by their belief scores (Turtle, 1991).
Figure 3.2
Close-up view of query network.
82
ADVANCES IN INFORMATION RETRIEVAL
Relevance Feedback. Turtle included discussion of relevance feedback, however, the theoretical work of actually adding relevance feedback to the inference network model and the implementation of the theoretical ideas was done by Haines (1996). A new type of node was added to the inference network to reflect the relevance judgments provided by the user. These annotation nodes were added to the query side of the inference network and the evidence from the annotations was propagated through the network using message passing. Figure 3.2 shows acloseup view of a portion of the query network from Figure 3.1. Once again, c1 and c2 represent query concept nodes, ql represents a query and I represents the information need of the user. The relevance judgments will be represented by means of a set of annotation nodes.
Figure 3.3 Annotated query network.
In order to incorporate annotations, each query concept node is annotated using three additional nodes as shown in figure 3.3. The nodes k1 and k2 represent the proposition that nodes c1 and c2 imply that the information need I has been satisfied. The nodes jl and j2 represent the observed relevance judgments. The and nodes are used to require that the query concept occur in the document in question in order for an annotation to have an effect on the score. Haines was able to show that an inference network annotated in this way can be computed efficiently (Haines, 1996), thereby showing that relevance
Language Models for Relevance Feedback
83
feedback can be incorporated into the inference network model in practice as well as in theory. The only drawback of this technique is that it requires inferences of considerable complexity. In order to make use of relevance judgments, two additional layers of inference and several new propositions are required. So, while this method has been shown to work, it is more complex than is desirable. Due to the complexity, the implications of this technique for improvement of relevance feedback are not obvious. In contrast, the method of relevance feedback that will be described in section 4 follows more directly from the language modeling approach to information retrieval. This technique is very straightforward and will be shown to work as well as existing methods of relevance feedback.
3.4
EXPONENTIAL MODELS
Beeferman et al., 1997, describe an approach to predicting topic shifts in text using exponential models. The model utilized ratios of long range language models and short range language models as follows:
where pl (x) is the probability of seeing word x given the context of the last 500 words and ps(x), is the probability of seeing word x given the two previous words. The intuition behind this approach is that when a long range language model is not able to predict the next word better than a short range language model, that indicates a shift in topic. A similar ratio based method will be used in this paper for the identification of useful terms for relevance feedback. This method will now be described in detail.
4
QUERY EXPANSION IN THE LANGUAGE MODELING APPROACH
Query expansion techniques have a natural interpretation in the language modeling approach. Recall that the underlying assumption of this approach is that users can choose query terms that are likely to occur in documents in which they would be interested, and that separate these documents from the ones in which they are not interested. Also note that this notion has been developed into a ranking formula by means of probabilistic language models. Since probabilities can be estimated as shown in Section 2.3 to, in some sense, evaluate terms chosen by the user, the same probabilities can be used by the system to ‘choose’ terms in much the same way.
84
4.1
ADVANCES IN INFORMATION RETRIEVAL
INTERACTIVE RETRIEVAL WITH RELEVANCE FEEDBACK
Consider the relevance feedback case, where a small number of documents are judged relevant by the user. The relevance of all of the remaining documents is unknown to the system. The system can estimate the probability of producing terms from each document’s probability distribution. From this, a random sample could be drawn from that distribution, but this would be undesirable since, of course, the most common terms would tend to be drawn. However, it is also possible to estimate the probability of each term in the collection as a whole using the term counts. The log ratio of the two can then be used to do as the user does: choose terms likely in documents of interest but unlikely in the collection as a whole.
4.2
DOCUMENT ROUTING
In the routing case, a training collection is available with a large number of relevance judgments, both positive and negative, for a particular query. The task is to use the training collection to construct a query which can then be run on a second collection for which relevance information is not available. Since both relevant and non-relevant documents are known, the ratio method can utilize this additional information by estimating probabilities for both sets. This process is described in more detail in section 4.6. Once again, the task is to choose terms associated with documents of interest and to avoid those associated with other documents. The ratio method provides a simple mechanism to do that.
4.3
THE RATIO METHOD
Incorporating relevance feedback into the language modeling approach turns out to be quite straightforward. Recall that in the language modeling approach, the prototypical user considers the documents of interest and chooses terms that will separate these documents from the rest of the collection. An analogous method will be used to choose terms for relevance feedback. The judged documents can be used to estimate a language model of the documents of interest using the method described in Section 2.3. The probability of a term t cft the collection in the collection as a whole can be estimated, as before, by ____ cs frequency of t normalized by the total number of tokens in the collection. These two probability models can then be used in manner similar to that used by Beeferman et al., 1997, for the purposes of topic boundary prediction. Recall from section 3.4 that the method they used was to take the log ratio of a long range language model vs. a short range language model in order to predict topic shifts.
Language Models for Relevance Feedback
85
In the relevance feedback case, the two estimates from above will be used in a similar manner to predict useful terms. Terms can then be ranked according to the probability of occurrence according to the relevant document models as compared to the collection as a whole, i.e. terms can be ranked according to the sum of the log ratio of each relevant model vs. the collection model.
where R is the set of relevant documents, P(t|Md) is the probability of term t given the document model for d as defined in Section 2.3, cf t is the raw count of term t in the collection and cs is the raw collection size. Terms are ranked according to this ratio and the top N are added to the initial query.
4.4
EVALUATION
Results are measured using the metrics of recall and precision. Table 3.1 shows the contingency table for retrieval of a set of documents where: r= Relevant –r = Non-Relevant R= Retrieved – R = Non-Retrieved
Table 3. I
Set Set Set Set
Contingency table for sets of documents.
If one were retrieving a set of documents, recall is the probability that a document has been retrieved given that it is a member of the set of relevant documents. Precision is the probability that a document is a member of the relevant set given that it has been retrieved. Defining these measures in terms of the contingency table yields: Recall
=
Precision
=
A third measure, fallout, is the probability that a document is not relevant given that is has been retrieved. Fallout, in terms of the contingency table looks like this:
86
ADVANCES IN INFORMATION RETRIEVAL
Fallout
=
Measuring performance in terms of recall and fallout would be more typical in a classification task. However, fallout is not a very interesting measure for retrieval. Suppose, for example, that a collection consisted of one million documents, and suppose one hundred of them were relevant, which would be a typical proportion. In that case one can do a reasonable optimization for fallout by not retrieving any documents at all. Due to the huge disparity between the numbers of relevant and non-relevant documents, recall and precision are the preferred metrics. For the evaluation of ranked retrieval, precision can be measured at several levels of recall to show the tradeoff. Other measures include the average precision over all relevant documents and precision after R documents have been retrieved for various values of R. Each of these measures will be reported for each of the experiments.
4.5
EXPERIMENTS
The first series of experiments compares the Rocchio method of relevance feedback (Rocchio, 1971) to the language modeling approach with varying numbers of relevance judgments. The experiments were performed on on TREC disks 2 and 3 using TREC topics 202-250 (Harman, 1996). For the language model approach, terms are ranked by the log ratio of the probability in the judged relevant set vs. the collection as a whole. For the Rocchio method, the weighting function was tf.idf and no negative feedback was used J = 0). The results of feeding back one relevant document and adding five terms to the initial query are shown Table 3.2. The top of the table lists the interpolated precision at 0 - 100% recall. The next row of the table shows the uninterpolated average precision. The bottom part of the table shows the precision at increasing numbers of documents. Finally, the row labeled ‘R-Precision’ is the precision after ‘R’ documents have been retrieved. For each query, ‘R’ is chosen to be the number of relevant documents for that query. The language modeling approach is slightly better at the top of the ranking but Rocchio is better at most of the points. The first result column shows the performance of the Rocchio method. The second column shows the corresponding result for the language modeling approach. The third column reports the percent change. Column four is of the form I/D where I is the count of queries for which performance improved using the new method and D is count of queries for which performance was different. Column five reports significance values according to the sign test and column six does likewise according to the Wilcoxon test. The entries in these
Language Models for Relevance Feedback
87
Table 3.2 Comparison of Rocchio to the Language modeling approach using 1 document and adding 5 terms on TREC queries 202-250 on TREC disks 2 and 3. Rocchio
LMRF
%chng.
Sign
Wilc.
16/39
0.1684
0.3004
+2.5 -9.8 -3.3 -11.5 -12.2 -18.5 - 17.2 -28.8 -45.9 -29.9 126.4 –10.91
2/2 21/44 22/44 19/43 15/40 12/36 9/29 7/22 4/12 216 2/3 22/49
0.2500 0.4402 0.5598 0.2712 0.0769 0.0326* 0.0307* 0.0669 0.1938 0.3438 0.5000 0.2841
undef 0.0707 0.2997 0.1413 0.1323 0.0434* 0.0355* 0.04 12* 0.0680 undef undef 0.0841
-5.6 -6.4 -9.1 -8.6 -5.8 -8.3 -2.6 -4.4 -2.9 -9.66
13/30 15/35 15/37 13/37 21/43 19/43 20/42 19/42 16/39 18/43
0.2923 0.2498 0.1620 0.0494* 0.5000 0.2712 0.4388 0.3220 0.1684 0.1802
0.3036 0.2041 0.0861 0.0873 0.2024 0.0961 0.3399 0.1976 0.3004 0.0626
Relevant: Rel. ret.: Precision
6501 3366
6501 3270
-2.85
at 0.00 at 0.10 at 0.20 at 0.30 at 0.40 at 0.50 at 0.60 at 0.70 at 0.80 at 0.90 at 1.00 Avg: Precision at:
0.9694 0.5346 0.4209 0.3484 0.2694 0.2156 0.1478 0.0890 0.0519 0.0135 0.0047 0.2410
0.9932 0.4822 0.4070 0.3082 0.2365 0.1757 0.1224 0.0633 0.028 0.0094 0.0060 0.2147
5 docs: 10 docs: 15docs: 20 docs: 30 docs: 100 docs: 200docs: 500 docs: 1000 docs: R-Precision:
0.5878 0.5102 0.4803 0.4490 0.3891 0.2537 0.1835 0.1119 0.0687 0.2930
0.5551 0.4776 0.4367 0.4 102 0.3667 0.2327 0.1788 0.1070 0.0667 0.2647
1/D
two columns marked with a star indicate a statistically significant difference at the 0.05 level. Note that these are one sided tests. Overall, the two techniques tend to track each other reasonably well, though the Rocchio method outperforms the language modeling approach at most levels of recall and on average. The results of feeding back two relevant documents and adding five terms to the initial query are shown in Table 3.3. Once again, the baseline technique is Rocchio. Notice that the improvements of the language modeling approach over Rocchio are not statistically significant. Again, as with a single document, the two techniques provide similar performance at most levels of recall. Tables 3.4 and 3.5 show results of feeding back ten relevant documents adding five and ten terms respectively. As in the previous experiments, the baseline technique is Rocchio. Note that for five terms, recall, average precision and precision at several levels of recall are significantly better using the language modeling approach.
88
ADVANCES IN INFORMATION RETRIEVAL
Table 3.3 Comparison of Rocchio to the Language modeling approach using 2 documents and adding 5 terms on TREC queries 202-250 on TREC disks 2 and 3. Rocchio
LMRF
%chng.
1/D
Relevant: Rel. ret.: Precision
6501 3650
6501 3590
Sign
Wilc.
–1.64
24/44
0.7743
0.4351
at0.00 at0.10 at 0.20 at0.30 at 0.40 at0.50 at 0.60 at 0.70 at0.80 at0.90 at1.00 Avg:
0.9864 0.5679 0.4703 0.3889 0.3115 0.2494 0.1778 0.1215 0.0551 0.0156 0.0093 0.2739
0.9908 0.5874 0.5072 0.4013 0.3186 0.2652 0.1888 0.1099 0.0608 0.0222 0.0184 0.2825
+0.4 +3.4 +7.8 +3.2 +2.3 +6.4 +6.2 -9.6 +10.4 +42.2 +98.6 +3.16
33 22/45 26/46 26/45 23/40 22/39 17/33 7/25 8/15 4/6 213 27/49
0.5000 0.6170 0.2307 0.1856 0.2148 0.2612 0.5000 0.0216* 0.5000 0.3438 0.5000 0.2841
undef 0.2882 0.1361 0.2473 0.2726 0.2173 0.3116 0.1264 0.4548 undef undef 0.2003
0.6735 0.5939 0.5388 0.5092 0.4456 0.2890 0.2067 0.1225 0.0745 0.3238
0.7061 0.6102 0.5537 0.4939 0.4367 0.2853 0.2120 0.1200 0.0733 0.3264
+4.8 +2.7 +2.8 -3.0 -2.0 -1.3 +2.6 -2.1 -1.6 +0.79
21/36 2443 21/40 18/41 19/44 25/46 27/43 24/42 2444 25/45
0.2025 0.2712 0.4373 0.2664 0.2257 0.7693 0.0631 0.8600 0.7743 0.2757
0.1853 0.2710 0.22 18 0.4053 0.4744 0.5737 0.1509 0.4502 0.435 1 0.3258
Precision at:
5 docs: 10 docs: 15docs: 20 docs: 30 docs: 100 docs: 200docs: 500 docs: 1000 docs: R-Precision:
However, it should be noted that ten relevant documents is a large number to expect from a typical user. The results for ten terms are similar. Also note that both techniques show improved results by the addition of more terms, as expected. The main point to take away from this series of experiments is that a relevance feedback technique that follows directly from the language modeling approach works well without any ad hoc additions. Note that in most IR systems, relevance feedback techniques make use of term weighting. No attempt was made to use term weighting in these experiments. This matter is discussed in more detail in section 5.1.
4.6
INFORMATION ROUTING
In the routing task, a training collection with relevance judgments, positive and negative, is provided. The task is to develop a query that will perform well on a test set of new documents. The language modeling approach to
Language Models for Relevance Feedback
89
Table 3.4 Comparison of Rocchio to the Language modeling approach using 10 documents and adding 5 terms on TREC queries 202-250 on TREC disks 2 and 3. Rocchio
LMRF
%chng.
1/D
Sign
Wilc.
Relevant: Rel. ret.: Precision
6501 3834
6501 4124
+7.56
28/44
0.0481*
0.0168*
at 0.00 at 0.10 at0.20 at 0.30 at 0.40 at0.50 at 0.60 at 0.70 at 0.80 at0.90 at1.00 Avg: Precision at:
0.9064 0.6134 0.5317 0.4151 0.3257 0.2706 0.2120 0.1374 0.0731 0.0173 0.0103 0.2943
0.9083 0.7117 0.5701 0.4875 0.4054 0.3198 0.2384 0.1364 0.0658 0.0175 0.0051 0.3279
+0.2 +16.0 +7.2 +17.4 +24.5 +18.2 +12.5 -0.7 - 10.0 +1.2 -50.0 $11.44
9/18 32/42 28/45 27/46 29/46 24/39 26/38 19/32 12/20 2/8 0/3 31/49
0.5927 0.0005* 0.0676 0.1510 0.0519 0.0998 0.0168* 0.8923 0.8684 0.9648 0.1250 0.0427*
0.6763 0.0005* 0.0431* 0.0200* 0.0037* 0.0194* 0.0355* 0.7728 0.8047 undef undef 0.0186*
5docs: 10 docs: 15docs: 20 docs: 30docs: 100 docs: 200docs: 500 docs: 1000docs: R-Precision:
0.7061 0.6306 0.5728 0.5337 0.4810 0.3169 0.2251 0.1290 0.0782 0.3474
0.7347 0.6878 0.6218 0.5796 0.5156 0.3408 0.2457 0.1367 0.0842 0.3773
+4.0 +9.1 +8.6 +8.6
14/28 23/36 25/33 29/40 26/39 26/4 I 25/42 27/46 28/44 29/43
0.5747 0.0662 0.0023* 0.0032* 0.0266* 0.0586 0.1400 0.1510 0.0481* 0.0158*
0.2545 0.0176* 0.0160* 0.0110* 0.0230* 0.0284* 0.0576 0.0452* 0.0168* 0.0181*
+7.2
+7.5 +9.2 +6.0 +7.6 +8.61
information routing will use similar techniques to those used for relevance feedback. The experiments described will show two different ratio methods. The first is identical to the relevance feedback method described in section 4 where documents were either relevant or unknown. The second method uses models of both the relevant and non-relevant sets of documents. It will be shown that the negative information is very useful in the context of the language modeling approach, just as it is in existing approaches to routing. Ratio Methods With More Data. In this section, the ratio method will be described in the context of the routing task. In addition, a second ratio method will be developed to make use of the additional information available in the routing task. Empirical results will be shown in section 4.6. As in the relevance feedback case, terms are ranked by the log ratio of the probability in the judged relevant set vs. the collection as a whole. These two
90
ADVANCES IN INFORMATION RETRIEVAL
Table 3.5 Comparison of Rocchio to the Language modeling approach using 10 documents and adding 10 terms on TREC queries 202-250 on TREC disks 2 and 3. Rocchio
LMRF
%chng.
Sign
Wilc.
35/45
0.0001*
0.0013*
-1.3 +5.2 +6.6 +0.5 +14.7 +12.9 +23.6 +25.5 +47.9 +29.4 -4.6 +8.11
6/10 24/42 27/45 25/46 30/46 3 0/45 28/39 22/35 13/21 5/9 0/2 33/49
0.8281 0.2204 0.1163 0.3294 0.0270* 0.0178* 0.0047* 0.0877 0.1917 0.5000 0.2500 0.0106*
0.4392 0.1482 0.I169 0.3819 0.0037* 0.0153* 0.0164* 0.0442* 0.0315* undef undef 0.0190*
+4.9 +3.4 +3.3 +3.2 +3.0 +2.9 +2.5 +4.9 +7.0 +7.43
13/23 20/37 22/37 27/46 23/40 31/44 26/43 29/45 35/45 30/45
0.3388 0.3714 0.1620 0.1510 0.2148 0.0048* 0.1110 0.0362* 0.0001* 0.0178*
0.242 I 0.1709 0.1066 0.0987 0.1519 0.0405* 0.1102 0.0346* 0.0013* 0.0431*
Relevant: Rel. ret.: Precision
6501 3933
6501 4210
+7.04
at 0.00 at 0.10 at 0.20 at0.30 at 0.40 at 0.50 at 0.60 at 0.70 at0.80 at0.90 at 1.00 Avg: Precision at:
0.9478 0.6882 0.5509 0.4766 0.3608 0.2829 0.1973 0.1242 0.0650 0.0203 0.0094 0.3138
0.9354 0.7242 0.5873 0.4789 0.4138 0.3193 0.2439 0.1559 0.0961 0.0263 0.0090 0.3393
5 docs: 10 docs: 15docs: 20docs: 30 docs: 100 docs: 200 docs: 500 docs: 1000 docs: R-Precision:
0.7469 0.6653 0.6150 0.5755 0.5150 0.3300 0.2356 0.1327 0.0803 0.3550
0.7837 0.6878 0.6354 0.5939 0.5306 0.3396 0.2415 0.1391 0.0859 0.3813
1/D
term sets will be used to construct queries as described below. In the results section, this method will be referred to as ratio 1. The second ratio method, referred to as ratio 2 in the results section, uses the log ratio of the average probability in judged relevant documents vs. the average probability in judged non-relevant documents as shown.
Language Models for Relevance Feedback
91
Recall that in the relevance feedback case, a small number number of documents were known to be relevant but the relevance of the rest of the collection was unknown. In the routing case, complete relevance judgments are available for the training collection. Ratio method 2 makes use of this additional information. Routing Results and Discussion. Experiments were performed on the TREC 2 routing task. The training data was TREC disks 1 and 2 and the test set was TREC disk 3. The queries were derived from TREC topics 51-100. The initial queries consisted of the terms from the concept fields. These queries were augmented by adding the top 20 terms as ranked by each of the ratio methods. The results for these two augmented query sets are shown in Figure 3.4. Notice that ratio 2 outperforms ratio 1, which is to be expected, since ratio 2 makes use of the non-relevance judgments in addition to the relevance judgments. For purposes of comparison, the official UMass TREC 2 results which used the Rocchio method are also included in the graph. Notice that ratio method 1 performs better than the UMass run at the high precision end of the curve, is approximately equal in the middle and drops below at the high recall end of the curve. Ratio method 2 is better for most of the curve dropping below the UMass run at the high recall end of the curve. Also see Table 3.6.
Figure 3.4
Comparison of ratio methods 1 and 2 on TREC 93 routing task.
92
ADVANCES IN INFORMATION RETRIEVAL Table 3.6
Comparison of ratio methods one and two on TREC 93 routing task. One
TWO
%chng.
I/D
Sign
Wilc.
0.0047*
0.0039*
Relevant: Rel. ret.: Precision
10485 7246
10485 7599
$4.87
28/39
at0.00 at0.10 at0.20 at0.30 at0.40 at 0.50 at 0.60 at 0.70 at0.80 at 0.90 at 1.00 Avg: Precision at:
0.8300 0.6712 0.5752 0.4953 0.4259 0.3613 0.2909 0.2159 0.1273 0.0594 0.0055 0.3543
0.8525 0.7026 0.6241 0.5350 0.4605 0.3983 0.3229 0.2411 0.1575 0.0639 0.0047 0.3839
+2.7 +4.7 +8.5 +8.0 +8.1 +10.2 +11.0 +11.7 +23.7 $7.7 -14.9 +8.36
8/14 28/42 32/48 30/48 27/50 29/48 28/44 23/37 18/28 9/16 113 34/50
0.3953 0.0218* 0.0147* 0.0557 0.3359 0.0967 0.0481* 0.0939 0.0925 0.4018 0.5000 0.0077*
0.1653 0.0047* 0.0020* 0.0060* 0.0535 0.0424* 0.0499* 0.0228* 0.0065* 0.2190 undef 0.0007*
5 docs: 10docs: 15docs: 20docs: 30docs: 100docs: 200 docs: 500 docs: 1000docs: R-Precision:
0.6880 0.6680 0.6440 0.6360 0.5987 0.4612 0.3660 0.2337 0.1449 0.3901
0.7040 0.7020 0.6840 0.6590 0.6260 0.4888 0.3865 0.2452 0.1520 0.4171
+2.3 +5.1 +6.2 +3.6 +4.6 +6.0 +5.6 +4.9 +4.9 +6.92
12/18 18/29 22/32 21/36 22/34 29/43 28/41 25/41 28/39 29/43
0.1189 0.1325 0.0251* 0.2025 0.0607 0.0158* 0.0138* 0.1055 0.0047* 0.0158*
0.2040 0.0536 0.0158* 0.0720 0.0246* 0.0036* 0.0063* 0.0150* 0.0039* 0.0071*
Also notice that the average precision for ratio 2 is competitive with the best systems in TREC to date. For purposes of comparison, the best routing average precision result in any of the TREC evaluations is approximately 41 % (Harman, 1996). This best ever result was obtained using a search based optimization of term weights. The results presented here did not require any term weighting at all, nor were other sources of evidence, such as proximity information, often used by systems in TREC. These techniques may improve results further, but this question will be left for future work. These results provide more evidence that the language modeling approach is a good model for retrieval since a technique that follows from this approach yields routing performance competitive with the best results in the field.
5
DISCUSSION AND FUTURE WORK
The experiments already performed indicate that the language modeling approach to relevance feedback and routing is a reasonable alternative to existing
Language Models for Relevance Feedback
93
methods. In the future, additional empirical study will need to be done to determine practical considerations such as the number of terms to add to a query. In addition, term weighting methods will be incorporated into the relevance feedback method and additional empirical studies will be performed to determine the utility of these techniques in practice. An interesting question that was not addressed in this study is the relative performance of the Rocchio method vs. the language modeling approach as the number of documents increases. Recall that the Rocchio method performed better with a single document fed back, approximately the same with two documents and not as well with ten documents. It can be conjectured that a Bayesian approach where one starts with a biased model that will be overcome by the document estimates as the number of documents increases would provide the best of both worlds. This matter is left for future work.
5.1
QUERY TERM WEIGHTING
Within-query term weighting is used by most modern retrieval systems. The language modeling approach, as currently defined by ?, does not include query term weighting. However, there are extensions that can be made to the model to incorporate query term weighting in a probabilistically justified manner. Two possibilities will be considered here. Risk Functions. The current method of probability estimation uses the maximum likelihood probability and the average probability, combined with a geometric risk function. A simple implementation of term weighting is to modify the risk function. This can be accomplished by means of a Bayesian prior over the term distribution which forces the less important terms to be weighted more heavily by the average probability. This makes the documents less distinguishable from each other based on occurrences of the less important terms. For more important terms, the maximum likelihood term will be allowed to dominate, making the more important terms better able to distinguish between documents. The implementation of this idea would change the ranking formula slightly. For a non-occurring term, the current ranking function estimates the probability as where cft is the raw count of term t in the collection and cs is the the total number of tokens in the collection. The change will be to mix the estimate for non-occurring terms with the mean in the same way the maximum likelihood estimator is currently for terms that do occur. The intuitive meaning of this idea is that the current risk function treats all terms equally, but with prior knowledge (or prior belief) about the relative importance of terms, one can vary the risk of relying on the mean according to the prior belief of the term importance. For example, suppose a term is deemed to be completely useless, the risk function would be modified so that all of the weight is assigned to the mean. The result is that this term is assigned an equal
94
ADVANCES IN INFORMATION RETRIEVAL
probability estimate for every document, causing it to have no effect on the ranking. A stopword such as ‘the’ would fall into this category. One can regard the differences in observed values as pure noise in this case. User Specified Language Models. Currently, queries are treated as a specific type of text produced by the user. One could also allow the user to specify a language model for the generation of query text. The term weights are equivalent to the generation probabilities of the query model. In other words, one would generate query text according to the probabilities and, conceptually, the query is the result of generating a large amount of query text which could then could be processed using the current method. This probably makes more sense in a routing environment where there is enough data from which to estimate the probabilities, but it could conceivably be used for ad hoc retrieval if users have any intuitions about the probabilities.
Acknowledgments This material is based on work supported in part by the National Science Foundation under cooperative agreement EEC-9209623. It is also supported in part by United States Patent and Trademark Office and by the Defense Advanced Research Projects Agency/ITO under ARPA order D468, issued by ESC/AXS contract number F19628-95-C-0235, Any opinions, findings and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsors.
References Beeferman, D., Berger, A., and Lafferty, J. (1997). Text segmentation using exponential models. In Proceedings of Empirical Methods in Natural Language Processing. Ghosh, M. J., Hwang, T., and Tsui, K. W. (1983). Construction of improved estimators in multiparameter estimation for discrete exponential families. Annals of Statistics, 11:351–367. Haines, D. (1996). Adaptive query modification in a probabilistic information retrieval model. PhD thesis, Computer Science Department, University of Massachusetts. Harman, D. (1996). Routing results. In Proceedings of the 4th Text Retrieval Conference (TREC-4), pages A53–A81. Harper, D. J. and van Rijsbergen, C. J. (1978). An Evaluation of Feedback in Document Retrieval Using Co-occurrence Data. Journal of Documentation, 34(3):189–216. onte, J. and Croft, W. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval, pages 275–28 1.
Language Models for Relevance Feedback
95
Rocchio, J. J. (1971). Relevance Feedback in Information Retrieval, chapter 14, pages 313–323. Prentice-Hall Inc. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall. Turtle, H. R. (1991). Inference networks for document retrieval. Technical report, University of Massachusetts Ph.D. dissertation. van Rijsbergen, C. J. (1977). A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, pages 106–119.
Chapter 4 TOPIC DETECTION AND TRACKING: EVENT CLUSTERING AS A BASIS FOR FIRST STORY DETECTION Ron Papka Dataware Technologies 100 Venture Way Hadley, MA 01035 rpapka@dataware.com
James Allan Center for Intelligent lnformation Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003 allan @cs.umass.edu
Abstract Topic Detection and Tracking (TDT) is a new research area that investigates the organization of information by event rather than by subject. In this paper, we provide an overview of the TDT research program from its inception to the third phrase that is now underway. We also discuss our approach to two of the TDT problems in detail. For event clustering (Detection), we show that classic Information Retrieval clustering techniques can be modified slightly to provide effective solutions. For first story detection, we show that similar methods provide satisfactory results, although substantial work remains. In both cases, we explore solutions that model the temporal relationship between news stories. We also investigate the use of phrase extraction to capture the who, what, when, and where contained in news.
Within Information Retrieval research, texts are usually indexed, retrieved, and organized on the basis of their subjects. Texts are assigned subject headings, a query searches for texts that are “about” a subject, or texts might be grouped
98
ADVANCES IN INFORMATION RETRIEVAL
by the subjects they cover. The research in this work takes a different view, focusing on the events that are described by the text rather than the broader subject it covers. This approach allows us to address questions such as “What is the major event discussed within this story?’ or “Do these texts discuss the same event?” Of course, not all texts can be reduced to a set of events—books on gardening, papers about mathematics, and tracts discussing bridge building do not necessarily provide a narrative of things that happened, but instead convey general information about a subject. This work will necessarily apply only to texts that have an event focus, such as announcements or news broadcasts. The CIIR has participated in a multi-site research effort called Topic Detection and Tracking that has been investigating event-based organization of news stories for the past several years. In this study we present the several-year history of that effort to show the range of problems that is being considered. We then present detailed experimental results showing how the traditional notion of “clusters” in Information Retrieval can be used to address two event-based organization problems: Event Clustering and First Story Detection.
1
TOPIC DETECTION AND TRACKING
The work in this study arose out of and is part of the Topic Detection and Tracking (TDT) initiative. The purpose of that effort is to organize broadcast news stories by the real world events that they discuss. News stories are gathered from several sources in parallel—television, radio, and newswire services are all gathered together to create a single stream of constantly arriving news. The first two types of sources are a particularly important part of the TDT project since they require some mechanism for converting the audio stream into text (the television shows might have closed captioning, but it cannot be guaranteed that they will). Automatic speech recognition (ASR) is therefore a critical component of the entire project. In the autumn of 1999, TDT is in its third phase, called TDT-3. TDT-2 ran throughout 1998 and TDT-1 was a pilot study that ran from mid-1996 through 1997. Each phase of TDT has had a different focus and has helped to develop an improved notion of event-based topics for information organization.
1.1
TDT-1, THE PILOT STUDY
The TDT pilot study was a proof-of-concept effort to explore the extent to which state-of-the-art technologies from Information Retrieval could address organization of information by event-based topics. TDT-1 included three research sites—Carnegie Melon University’s (CMU) Language Technology Institute (LTI), Dragon Systems, and the University of Massachusetts at Amherst’s (UMass) Center for Intelligent Information Retrieval (CIIR)—and participants
Topic Detection and Tracking
99
from the Department of Defense. The study started in mid-1996 and concluded with a final meeting in October of 1997 (Department of Defense, 1997). The first project undertaken by the TDT-1 team was a definition of the problem. An event was defined to be “some unique thing that happens at some point in time.” (Allan et al., 1998a) The properties of time and spatial locality are what distinguish an event from the more general subject. For example, “computer virus detected at British Telecom, March 3, 1993,” is considered an event, whereas “computer virus outbreaks” is the general subject comprising occurrences of this type of event. This definition was never viewed as entirely satisfactory, but was usable as a guideline for building a small pilot corpus to investigate the issue. That is, it could be used to identify most events, even if no one felt that it was a robust definition. The team also defined three research problems to be investigated during the course of the pilot study: 1. Segmentation. This is the problem of breaking a set of contiguous news into discrete stories. Recall that an important component of the TDT problem is dealing with broadcast news sources such as television and radio. Unlike newswire sources, the broadcast information does not have clear boundaries between individual stories—how can we detect a shift from one story to another just by looking at the transcript? 2. Detection. This is the problem of identifying when new topics have appeared in the news stream. There were two versions of the detection problem considered, the endpoints of a spectrum of possibilities. An underlying assumption of these tasks is that no news story discusses more than one topic—an assumption that is known to be false on occasion, but is believed to be usually true. (a) event clustering. The goal here is to consider a complete corpus of news stories and to partition it into clusters, each of which is about a single news topic, and such that no news story is in more than one cluster. (This task was actually referred to as “Retrospective Detection” within the pilot study.) (b) first story detection. The task here is to flag the onset of a previously unseen news topic by marking stories as “new” or “old” as the story arrives on the stream. That is, unlike the retrospective case, a story must be identified as novel before the next story can be considered. (This task was actually called On-Line Detection in the pilot study.) 3. Tracking. In this task, a system is given a small number (Nt = 1, 2, 4, 8, or 16) of sample stories that are known to be on the same news topic. The system’s job is to monitor the stream of stories that follow that Ntth story to find all subsequent stories on the same topic.
100
ADVANCES IN INFORMATION RETRIEVAL
In addition to defining those tasks, the TDT-1 team also cooperated to create a small evaluation corpus to explore the effectiveness of state-of-the-art techniques to address the problems. This corpus was created from 15,683 news stories, roughly balanced between CNN and Reuters news stories, covering July 1, 1994, through June 30, 1995. The stories were extracted and formatted for TDT-1 by the team. To evaluate the problems above using that corpus, the team also generated a set of 25 news topics by selecting “significant” events from that period, supplemented by a small amount of semi-random sampling. This resulted in topics spanning the Oklahoma City bombing, an asteroid’s collision with Jupiter, and the removal of a polyp from Vice President Dan Quayle’s nose. The team employed a two-prong method for assigning relevance judgments between topics and stories. One group of people (split between CMU and Dragon) read every story in the corpus, keeping all 25 topics in mind, and marked whether the story was on topic for each of them. A second group of people (at UMass) used a search engine to look for stories on each of the 25 topics. The two sets of judgments were adjudicated by a team at Dragon, resulting in the final corpus. Using this corpus, the three research groups individually attacked each of the three tasks outlined above. The approaches were varied, though because they were leveraging existing technologies, were similar in some core aspects. Evaluation was on the basis of miss and false alarm rates. Another standard performance measure in TDT is the Detection Error Tradeoff (DET) graph (Martin et al., 1997) that shows the relationship between miss and false alarm rates. It is similar in spirit to the precision and recall graphs that are common in Information Retrieval research. The graphs are similar in nature to Receiver Operating Characteristic (ROC) graphs (Swets, 1998), which have been used to evaluate machine classification (Provost and Fawcett, 1998) and medical diagnostic experiments. For simplicity, in this study we are using F1 and cost measures exclusively. The conclusion of the team was that segmentation was of reasonable quality, tracking effectiveness appeared to be quite high, and both detection problems needed substantial work. In short, the state-of-the-art technologies solved substantial amounts of the problems, but left equally substantial portions to be solved by later research. The TDT-1 pilot study included a small effort toward understanding the impact of speech recognition (ASR) errors on the tasks. Those very small experiments suggested that ASR impacted all of the problems, though tracking the least. The TDT pilot study showed the value of further research on the problem. Its final report appeared at a workshop on broadcast news transcription (Allan et al., 1998a). TDT-2 was initiated a few months later.
Topic Detection and Tracking
1.2
101
TDT-2, A FULL EVALUATION
The primary goal of TDT-2 was to create a full-scale evaluation of the TDT tasks began in the pilot study. TDT-2 ran throughout 1998, and was reported out at the DARPA Broadcast News Workshop (DARPA, 1999). It included almost a dozen participating organizations, including the three pilot sites. The TDT-2 research team opted to redefine the tasks somewhat. The first change was that the two detection tasks were “merged” to create an on-line version of the event clustering task. The TDT-2 Detection task was still to partition the corpus into clusters, with each story in exactly one cluster, and every cluster representing a discrete news topic. However, unlike the TDT-1 event clustering task, the stories arrived in small groups (corresponding roughly to half-hour news broadcasts) that had to be clustered before the next group could be processed. The segmentation and tracking tasks were maintained as defined earlier, though the introduction of stories in small batches was added in those tasks, too. The TDT-2 effort also developed revised evaluation methodologies. Detection Error Tradeoff graphs (Martin et al., 1997) remained a primary investigative tool, but miss and false alarm rates were combined to create a cost measure that was the “official” measure, allowing sites to tune algorithms to find the “best” parameters (for that cost function). One of the most significant changes in TDT-2 was the creation of a largescale corpus. By the standards of classic Information Retrieval research, the roughly 60,000 stories were miniscule. But the effort to gather and annotate the stories was substantial. Stories were gathered from two television sources, two radio sources, and two newswire sources, in roughly equal numbers of stories per source. The corpus covers January through June of 1998. For the audio sources, the information was gathered as an audio recording, as closed captioned transcription, and as recognized speech from a speech recognition system provided by Dragon. Recognizing speech on roughly 36,000 stories (630 hours) of broadcast news was an impressive feat given existing technology, and provided a major limiting factor on the collection size. The Linguistic Data Consortium (LDC) developed the TDT-2 corpora, defining true story segmentation, identifying news topics and their definition, and labeling news stories vis-a-vis topics. Approximately 100 news topics were identified by random selection, their scope was defined using carefully crafted “rules of interpretation,” and they were labeled by teams of trained relevance assessors who read every story for each of the topics. The exhaustive relevance assessments were clearly another limiting factor on the collection size. The TDT-2 research teams were given the first four months of TDT-2 news along with approximately 2/3 of the tagged news topics, to use as training. The
102
ADVANCES IN INFORMATION RETRIEVAL
final two months and corresponding news topics were held out to be used as blind evaluation material at the end of the project. The results of the TDT-2 effort were similar to those of the TDT-1 pilot study. Tracking was deemed to be essentially solved—there remains work to be done, but the accuracy is quite good. The same was said about the segmentation task. Detection, on the other hand, clearly needed more work. The ASR version of the corpus allowed the community to do contrastive studies on ASR and closedcaption versions of the corpus. Tracking and segmentation did not appear to be affected by ASR errors, but detection was. The full evaluation confirmed the trends that the pilot study suggested, and showed that substantial more work could be done. So began TDT-3.
1.3
TDT-3, MULTI-LINGUAL TDT
The most recent portion of the TDT initiative is the TDT-3 effort, running throughout 1999. TDT-3 is different from TDT-2 in the range of event-based organizational tasks considered, the introduction of multi-lingual sources, and a new evaluation corpus. More specifically: 1. TDT-3 includes a group of Chinese newswire and radio sources. The goal is to perform all TDT tasks across the languages, not just within each language. That is, the tracking task might be given Nt = 4 sample training stories in Chinese, and be expected to find all subsequent stories, whether they are in English or Chinese. 2. The entire six months of TDT-2 data is used as the training data for TDT-3 (augmented with some Chinese news stories and topic judgments). The evaluation set is an additional three months of news from the end of 1998, across numerous sources. 3. A new task called Story Link Detection was created, wherein a system is presented with two news stories and is asked to determine whether or not the two stories discuss the same news topic. The motivation for introducing this new task was to provide a base task out of which all the others could be created. 4. The first story detection task (called on-line detection in the pilot study) was reintroduced. The event clustering task (called “detection” in TDT-3) was maintained as before. Preliminary results from TDT-3 suggest that the cross-language component degrades effectiveness somewhat. Effectiveness on the other tasks is similar to the TDT-2 and TDT-1 versions. Results from TDT-3 will be reported in the spring of 2000.
Topic Detection and Tracking
1.4
103
TDT AT THE CIIR
The CIIR has participated in all three phases of TDT. We spent considerable effort on story segmentation in the pilot study (Ponte and Croft, 1997) but did not explore that problem further—although it was not solved, we did not feel that Information Retrieval techniques would be of greater help. Similarly, we have not expended substantial effort on the tracking task. We feel that its similarity to the Information Filtering and Routing problems make it equally likely that progress will come from the thriving activity at TREC (Voorhees and Harman, 1998). The CIIR has instead focused most of its energies on forms of the Detection tasks—specifically first story detection and event clustering. In the remainder of this study, we discuss our efforts to explore the limits of traditional Information Retrieval approaches to solve those problems. We will work within the TDT-1 and TDT-2 frameworks throughout, using the corpora and relevance judgments created for those efforts. Our approach to the problem of first story detection is strongly related to the problem of on-line event clustering. In particular, one approach to first story detection is to cluster the story stream, and to return to the user the story in each cluster with the lowest time-stamp. Assuming this approach is effective, the problem of first story detection becomes an issue of finding a clustering approach that works well in an on-line environment.
2
ON-LINE CLUSTERING ALGORITHMS
As mentioned above, our solution to first story detection is related to the problem of event clustering, or on-line story clustering. Previous clustering work has been done primarily in a retrospective environment where all the stories are available to the process before clustering begins. In this study we focus on an on-line solution to first story detection—i.e., where the stories arrive a few at a time. We also re-evaluate some of the common approaches to retrospective clustering and analyze their effectiveness in an on-line environment. In this context, the goal of clustering—i.e., grouping news stories—is to gather together those stories that discuss the same news topic. Clustering typically involves these steps: converting stories to a vector of weighted features that represents the content of the story, comparing stories and clusters to one another, and applying a threshold to determine if clusters are sufficiently similar that they should be merged. We refer to the combination of vector and threshold as a classifier. An overview of our use of clustering follows (it is explained in more detail below). For each story, we formulate a separate fixed-length classifier from the most frequent words in each story (excluding stopwords). A classifier’s initial threshold is its similarity value when compared to the story from which
104
ADVANCES IN INFORMATION RETRIEVAL
it was created. We assume no subsequent story will exceed this threshold, and so we use it as an initial estimate for the threshold. As new stories arrive on the stream, they are compared to previously formulated classifiers, and clusters are formed based on a comparison strategy. A separate threshold is estimated for each classifier, and stories on the stream that have similarity exceeding the threshold are classified as positive instances of an event, that is, the contents of the story are assumed to discuss the same topic as the story with which the classifier was formulated.
2.1
RELATED CLUSTERING WORK
Several issues pertaining to story clustering have been explored using the agglomerative hierarchical clustering approach (Willett, 1998). Probabilistic approaches to clustering have emerged (van Rijsbergen, 1979; Croft and Harper, 1979; Allan et al., 1998a; Walls et al., 1999), as well as Vector Space Model approaches (Voorhees, 1985; Willett, 1998; Salton, 1989). Most of those algorithms assume that all the stories are available before clustering begins. In an on-line environment, stories are processed sequentially, and each story is either placed into an existing cluster or initiates a new cluster; therefore, many aspects of retrospective approaches are not directly applicable to on-line story processing. In addition, several algorithms require specification of the number of clusters to generate before the clustering process begins (Can and Ozkarahan, 1990). However, these solutions are not applicable to event clustering since we do not know a priori the number of topics that will be encountered during processing. Some work has been done in single-pass clustering where the goal is to build a cluster by making a single pass through the data. In such a setting, the final set of clusters depends greatly upon the order in which the stories are processed, and that has typically been viewed as a disadvantage (van Rijsbergen, 1979). In the TDT setting, however, the order of stories is pre-determined and so its impact on the final clustering is anticipated.
2.2
CREATING STORY VECTORS
Recall that a first stop in clustering is to transform stories into a set ofweighted features. We represent each story by a vector of features weighted by the INQUERY (Broglio et al., 1994) tf.idf function where, and
(4.1)
where t is the number of times the lexical feature appears in the story, and dl is the story’s length in words, and avg_dl is the average number of terms in a story. (Equation 4.1 is variant of the function originally introduced by Robertson et al.,
Topic Detection and Tracking
105
1995.) Also, (4.2) where C is the number of stories in the auxiliary corpus, and df is the number of stories in which the term appears. Both C and df are derived from an auxiliary corpus since they cannot be known in advance in the on-line setting. If the term does not appear in the auxiliary corpus, a default value of 1 is used for df. During processing, a classifier is represented using just the tf information (Equation 4.1). An individual story (when it is compared to existing classifiers) is represented at time j as: dj,k = 0.4 + 0.6 . tfk . idfk
(4.3)
where k is the index of the word co-occurring in the classifier and story, and idf is calculated using Equation 4.2. Since future word occurrence statistics are unknown in real-time applications, the number of stories within which a word will appear is unknown. Several solutions to this problem have been tested by TDT participants—eg., estimating df from an auxiliary corpus (Papka et al., 1999), using the df from the current stream (Schultz and Liberman, 1999), or some combination. We continue to estimate df for Equation 4.2 from an auxiliary corpus in a similar domain. In previous experiments, we evaluated different sources for df and their resulting impact on classification accuracy, and found that idf calculated using story frequencies from TREC volumes 1, 2, and 3, combined with the TREC-4 routing volume (“TREC 123R4”), was a more effective source for df than newsonly stories from TREC 123R4 as well as an independent CNN broadcast news corpus. Also, TREC 123R4 was a more effective source for story frequency (the value of df) than using the entire corpus being processed. In general, the overall changes in effectiveness between using different sources for idf and not using idf at all were small at best. Our experience using incremental idf for tracking in the TDT pilot study (Allan et al., 1998b) suggests that constantly updating story frequency can be a useful statistic, though we are unable to use it consistently well.
2.3
COMPARING CLUSTERS
The methodology for comparing a story to a cluster or the contents of two clusters has the greatest effect on the resulting grouping of stories. For example, in agglomerative hierarchical clustering algorithms three common strategies for combining similarity values are known as single-link, complete-link, and group-average (average-link) clustering (van Rijsbergen, 1979; Salton, 1989). In general, the three strategies lead to different clusterings (Voorhees, 1985).
106
ADVANCES IN INFORMATION RETRIEVAL
Each of those strategies is associated with retrospective clustering algorithms, and in particular with agglomerative hierarchical clustering, where a similarity matrix between all stories is available before processing begins. In the experiments to follow, we test these strategies in the context of on-line clustering. Each strategy, when applied to the on-line environment, results in a different grouping of stories than when the same strategy is applied to a retrospective environment.
Figure 4.1
On-line Cluster Comparison Strategies.
In the on-line clustering environment, each story is analyzed as it arrives and is either placed into an existing cluster or initiates a new cluster—and thus becomes a cluster seed. Figure 4.1 illustrates the differences the three comparison strategies can have on the classification of a story. In the figure, the current story is C, and six stories have been processed and assigned to clusters R, B, and G. The similarities between C and each of the six stories are represented by the numbers in the clusters, ranging from 0.05 to 0.95 (the higher the number, the more similar). Each strategy assigns a different comparison value for C against the three existing clusters, as shown in the bottom of the figure. For single-link, acluster is compared to C using the maximum similarity. For complete link, it is compared using the minimum similarity, and for averagelink, the average of all similarities is used. The boxes at the bottom of the figure provide the comparison values for each strategy and indicate which cluster is chosen for that strategy. This process would then continue with the next arriving story.
Topic Detection and Tracking
107
A similarity value is calculated when comparing a classifier to a story. In the experiments that follow, the similarity between a classifier q formulated at time i when compared to a story arriving on the stream at time j is calculated using the weighted sum operator of INQUERY, which is defined as (4.4) A story is assumed to discuss the topic represented by the classifier if its similarity to the classifier exceeds a certain threshold.
2.4
THRESHOLDS FOR MERGING
In addition to the cluster comparison strategy, a thresholding methodology is needed that affects the decision for generating a new cluster. When we use clustering for first story detection, choosing this threshold is important. From the example in Figure 4.1, choosing a threshold of 0.5 would result in story C initiating a new cluster using the on-line compiete-link strategy. The other strategies result in cluster comparison values that are above this threshold, and thus C is assigned to an existing cluster; however, a high enough threshold (e.g., 0.97) would result in C creating a new cluster for all of the strategies. Time-based thresholds. A side-effect of the timely nature of broadcast news is that stories closer together on the news stream are more likely to discuss related topics than stories farther apart on the stream. When a significant new event occurs, there are usually several stories per day discussing it; over time, coverage of old events is displaced by more recent events. Figure 4.2, for example, depicts the number of stories per day that arrived on CNN broadcast news and AP newswire pertaining to two news topics: an earthquake in Japan and the downing of a US Airforce pilot over Bosnia. Most of the stories pertaining to each of these topics occurred within 10 days, and this appears to be the pattern for many of the topics in the TDT corpora. Therefore, one hypothesis about the domain of broadcast news is that exploiting the time between stories will lead to improved classification accuracy. The threshold model we use is a linear function that controls similarity based on the number of days between the formulation of the classifier and the arrival of the story. During processing, a classifier’s actual threshold is recomputed at each time increment—i.e., each time a new story arrives on the stream. For any classifier formulated at time i, its threshold for a story arriving at a later time j is threshold(qi , dj ) = 0.4 + T (sim(qi , di ) – 0.4) + E (datej - datei ) (4.5) where sim(qi , di ) is the similarity value between theclassifier and the story from which it was formulated (Equation 4.4), and the constant 0.4 is an INQUERY
108
ADVANCES IN INFORMATION RETRIEVAL
Figure 4.2
Daily story counts for two TDT-1 topics.
parameter. The value of (datej – datei) is the number of days between the arrival of story dj and the formulation of classifier qi, and we use the global system parameter E to control the effects of this value. The values for T and E control classification decisions, and our method for finding appropriate settings are discussed below. Decision Scores. The threshold model described above is what decides for each classifier whether a story is a positive instance. Our confidence in the decision that story dj is a positive instance of classifier qi is the extent to which the story exceeds the classifier’s threshold. In what follows, the score we used between a story and a classifier is decision(qi, dj) = sim(qi, dj) – threshold(qi, dj).
(4.6)
A decision scores greater than zero implies that dj is a positive instance of the topic represented by qi, and it also implies that stories di and dj are similar. In many clustering approaches, the similarity between di and dj is symmetric. Since the stories and classifiers have different representations—di is not the same as qi—using our approach to clustering, the similarity between stories is not necessarily symmetric.
3 3.1
EXPERIMENTAL SETTING DATA
One of the main contributions of the TDT research effort is the creation of event-based corpora comprising newswire and broadcast news sources (discussed earlier). Stories were judged relevant to event-based topics using a ternary scale: non-relevant, having content relevant to the event, or containing only a brief mention of the event in a generally non-relevant story. Consistent with TDT methodology, we remove brief mentions from processing and mea-
Topic Detection and Tracking
109
sure classification using the relevant and non-relevant stories only. The news collected for TDT-1 and TDT-2 is divided into four sets, depicted in Table 5.2. Table 4.1
corpus
Dates
TDT- 1 TDT-2 Train TDT-2 Dev TDT-2 Eval
3.2
07/94 01/98 03/98 05/98
-
06/95 02/98 04/98 06/98
TDT corpora and topic information Number of sources stories 2 6 6 6
15863 20404 20462 22443
Words/ story 460 341 314 333
Number of events rels briefs 25 35 25 34
1132 4159 608 1883
250 1103 78 472
EVALUATION MEASURES
Text classification effectiveness is often based on two measures. It is common for Information Retrieval experiments to be evaluated in terms of recall and precision, where recall is the percent of relevant instances classified correctly, and precision is the percent of relevant instances in the set of stories returned to the user. In TDT, system error rates are used to evaluate text classification. These errors are system misses and false alarms, and the accuracy ofa system improves when both types of errors decline. In first story detection, misses occur when the system does not detect a new event, and false alarms occur when the system indicates a story contains a new event when in truth it does not. It is often desirable to have one measure of effectiveness for cross-system comparisons, or to tune a system for maximum effectiveness. Unfortunately, no measure above uniquely determines the overall effectiveness characteristics of a classification system. Several definitions for single-valued measures have been proposed, and many are reviewed by van Rijsbergen, 1979. One prevalent approach is to evaluate text classification using F1-Measure (Lewis and Gale, 1994), which combines recall and precision as 2PR/(P + R). In TDT-2, a cost function was used to analyze detection effectiveness. The general form of the TDT cost function is Cost = costfa * P(fa) * (1 – P(topic)) + costm * P(m) * P(topic)
(4.7)
where P(fa) is the system false alarm rate, P(m) is miss probability, and P(topic) is the prior probability that a story is relevant to a topic. In TDT-2, cost was defined with P(event) = 0.02, and the constants costfa = costm = 1.0. Because only 25–35 topics in each corpus were judged, we use an evaluation methodology for first story detection that expands the number of experimental
110
ADVANCES IN INFORMATION RETRIEVAL
trials (this approach was developed for the TDT pilot study). The methodology uses 11 passes through the stream of stories. Assuming that there are 25 topics available for a particular experiment, the goal of the first pass is to mark the first story of each of the 25 topics as “first” and the rest as “not first.” Systems process all of the unjudged stories, too, but are not evaluated on their effectiveness on those stories. The second pass excludes those 25 first stories and the goal is to detect the new first story (previously the second story) for each of the 25 topics. The process repeats to skip up to 10 stories for each topic. If a topic contains fewer stories than the number of stories to be skipped in the pass, the topic is excluded from evaluation in that pass. Most of our experiments used the holdout method (Cohen, 1995; Kohavi, 1995) for parameter setting. Here, the topics from the TDT-1, TDT-2-Train, and TDT-2-Development corpora are used for training, and testing is done on the topics and stories from the TDT-2-Evaluation corpus. We chose a two-tailed sign test with confidence α = 0.05 to determine significance.
4
EVENT CLUSTERING
In this section, we test our classifier approach for event clustering. We evaluate the on-line versions of single- and average-link strategies. We also test the effectiveness of the time component E > 0) in the threshold model. We evaluate event clustering effectiveness using a standard TDT methodology: we match each truth cluster (topic) with the generated cluster that contains that most matching stories, considering the extraneous stories if necessary. That is, we maximize recall and then use precision within the generated cluster to break ties. The match between truth and generated clusters can then be compared using recall, precision, and F1-Measures. The parameter optimization process involves finding appropriate values for T and E for our threshold model (Equation 4.5). We tested a range of values for each parameter and vector dimensionality over the training topics with known judgments. In the experiments that follow, we used the TDT-1 corpus as well as the automatic speech recognition (ASR) versions of the TDT-2-Train, and TDT-2-Development corpora to find parameters. We evaluated the effectiveness of our systems using the ASR version of the TDT-2 evaluation corpus and its topics. During training, we found that the average-link strategy was comparable in effectiveness to the single-link strategy with the time-based threshold. Both of those strategies were more effective than the single-link strategy without the time factor. No particular classifier dimensionality appears to yield a significant increase in effectiveness. For example, on the TDT-1 corpus using 50-feature classifiers, the on-line average-link approach had an optimal F1 of 0.81, and using 200 features, the on-line single-link+time strategy had optimal F1 of 0.79.
Topic Detection and Tracking
111
These measures are comparable to others reported for on-line clustering on this corpus (Yang et al., 1998; Allan et al., 1998a) The TDT-2 corpora contain news headers and trailers that are not full news stories (e.g., “up next.. . ”). Those short “stories” were omitted from the evaluation—i.e., they were not considered part of the topic, let alone a first story—so it was important that we detect them. We selected the simple strategy of assuming that any story containing fewer than 55 words was a header or trailer, and treating them as if they were not in the collection. (They were assigned a system score indicating non-relevance to all topics.) We selected thresholds for event clustering by choosing the parameter values that gave optimal pooled average F1-measures in the training data. The evaluation was run using those parameters on the TDT-2-Evaluation corpus. It resulted in the values reported in Table 4.2. Optimizing for TDT-2 cost results in similar numbers, though the cost values are slightly lower, not surprisingly. Table 4.2
Event clustering results on TDT-2-Evaluation corpus.
On-line Strategy
F1-Measure SW TW
single-link+time single-link average-link
0.61 0.54 0.58
0.71 0.71 0.68
TDT-2 Cost SW TW 0.0056 0.0077 0.0108
0.0062 0.0072 0.0062
In Table 4.2, pooled or story-weighted (SW) measures, as well as the mean or topic-weighted (TW) measures, are reported for both F1 and TDT-2 cost. The on-line single-link+time approach appears to cluster event-based topics more effectively than the on-line single-link strategy, which suggests that using the time component of the threshold model is effective. In addition, the singlelink+time approach appeared to be the most effective in terms of pooled F1measure and even more so in terms ofpooled TDT-2 cost. However, the averagelink approach resulted in the same topic-weighted cost as the single-link+time approach, which suggests that both methods are effective for event clustering.’ These results are comparable to those obtained by other sites participating in the TDT-2 evaluation (DARPA, 1999). We are interested in how event clustering can help first story detection. In the context of clustering news, the problem of first story detection is to find the first story in each cluster—i.e., the stories that become cluster seeds. Our results suggest that, on average, on-line event clustering puts most of the stories 1In TDT-3, new evaluation measures for the detection task are being considered that may provide more insight into the quality of the resulting clusters. There is concern that the matching process between truth and generated clusters does not appropriately measure the capabilities of a system.
112
ADVANCES IN INFORMATION RETRIEVAL
about a topic in the appropriate cluster. In the next section we evaluate on-line clustering as a basis for first story detection, and determine how well different comparison methodologies produce cluster seeds.
5
FIRST STORY DETECTION
Recall that the problem of first story detection is to identify the stories in a stream of news that contain discussion of a new topic, that is, a topic whose event has not been previously reported. In this section, we present on-line solutions to first story detection in which the system indicates whether the current news story contains or does not contain discussion of a new topic before processing the subsequent story. In what follows, we describe the details of our algorithm and experimental results using the corpora and evaluation methodology developed as part of the TDT initiative. The motivation for our approaches to the problem is to incorporate the salient properties of broadcast news. In particular, we identify the property of time as a distinguishing feature of this domain. We posit that modeling the temporal relationship between stories should result in improved classification. Our event clustering results suggest that this hypothesis is true, and we showed that our approach to event clustering is more effective when this temporal relationship is modeled. Another property of news is that its content includes the names of the people, places, dates, and things, i.e., the who, what, when, and where that are the focus of an topic. Our intuition is that the words in proper noun phrases are important to include in a classifier when using a word co-occurrence model for text classification. We test this intuition by augmenting the classifier formulation process with a natural language parser that finds proper noun phrases in each story. Our results suggest that identifying these phrases leads to improved classification. In what follows, we use the same representation and classification model we used for event clustering. The contents of each story are formulated into a classifier which is compared to subsequent stories. Our approach to first story detection is that if no classifier comparison results in a positive classification decision for the current story, then the current story has content not previously encountered, and thus it contains discussion of a new topic. The main difference between our approach to first story detection and event clustering is that the emphasis is placed on finding the start of each topic, not grouping stories by topics.
5.1
FIRST STORY DETECTION ALGORITHM
We use a single-pass algorithm for first story detection using the representation and model described earlier for event clustering. We represent the content of each story, which we assume discusses some event-based news topic, as a
Topic Detection and Tracking
113
classifier. If any existing classifier results in a positive classification of the current story, the story is assumed to discuss the topic represented by the classifier; otherwise the current story contains a new topic. The approach we use for first story detection does not actually merge the stories in a cluster in any way. In our algorithm, a story contains a new topic if the story does not result in a positive decision score with respect to any of the existing classifiers—where there is one classifier for every earlier story. Our first story detection algorithm is therefore similar to the on-line singlelink+time strategy for event clustering that we evaluated above. The data in those experiments suggested that different comparison strategies result in different groupings of stories. This implies that the number of clusters, and thus the number of new topics identified, is different. If we view the creation of a cluster seed as equivalent to detecting a new topic, we find that on-line single-link strategies tend to return more new topics than on-line average-link strategies at optimal parameter settings, which is advantageous in terms of misses, but not for false alarms. In the sections that follow, we compare event clustering strategies and evaluate their effectiveness as approaches to first story detection.
5.2
FIRST STORY DETECTION EXPERIMENTS
In this section, we evaluate our first story detection system varying the comparison strategies between stories and classifiers, and varying the dimensionality of the classifiers. We ran an analogous set of retrospective experiments to those we ran for event clustering, in which we compared on-line single-link strategies and on-line average-link strategies. Recall that in clustering, comparison values are determined for each cluster using the maximum or average decision scores resulting from the current story being processed. The classifier for the current story becomes a member of the cluster with highest comparison value. If no existing cluster results in a positive comparison value, then the story’s classifier initiates a new cluster and the story is assumed to discuss a new topic. At optimal average effectiveness, first story detection is similar using different numbers of features for classifiers. However, a topic-level comparison across dimensionality revealed that some classifiers benefit from using fewer features while others benefit from more. This analysis also suggested that choosing the appropriate dimensionality for each topic would yield over 50% improvements in F 1 -measure on the TDT-2-Train and TDT-2-Development corpora. We have yet to find a methodology for determining the appropriate dimensionality for an individual topic automatically, and we look forward to studying this problem in our future work. Our training runs suggested that the on-line single-link strategy is better at detecting new event-based topics when E > 0, that is, when the time component of the threshold model is used. The on-line single-link strategies, in general,
114
ADVANCES IN INFORMATION RETRIEVAL
appear to be more effective than the average-link strategy using optimal parameters. However, all the strategies had lower accuracy on the TDT-2-Train corpus. We believe this was caused by the “Monica Lewinsky Scandal” and the “Asian Financial Crisis” topics, two topics that contained heavy news coverage comprising 10% of the stories in the corpus. We evaluated our first story detection system on the ASR version of the TDT2-Evaluation corpus, and the pooled-average results from the 1 1-pass evaluation methodology are listed in Table 4.3. We used the parameters that on average gave rise to optimal effectiveness in the training data, using 50-feature classifiers. We set T = 0.2 and E = 0.0005 when time is factored into the threshold model, and T = 0.3 and E = 0 otherwise. The results from Table 4.3 indicate a 7.1% improvement in F1-measure, and a 15.6% improvement in TDT-2 cost when the temporal relationship between stories is modeled2. This suggests the desirability of using the time component of the threshold model for an actual application. Table 4.3 Comparison of clustering strategies for first story detection (Pooled averages, TDT2-Evaluation corpus, ASR and newswire text). On-line Strategy
Miss Rate
F/A Rate
Recall
Prec
Fl
TDT-2 Cost
single-link+time single-link average-link
66% 63% 73%
1.69% 2.31% 4.22%
34% 37% 27%
26% 22% 10%
0.30 0.28 0.15
0.0297 0.0352 0.0559
We also tested the comparison strategies and parameter settings that gave rise to the optimal event clustering results. In the following experiments we evaluate event clustering as the first story detection task. The threshold parameter for these experiments are those obtained from the optimization process for event clustering using F1 as the target utility measure. In the following experiments, we used T = 0.26 and E = 0.001 for the on-line single-link+time strategy, and T = 0.33 and E = 0 for the on-link single-link strategy. For on-line average-link, 6’ = 0.15. The results for these experiments are listed in Table 4.4. The parameters for the on-line clustering strategies were different, and generally lower for first story detection, and we see an improvement in miss rates, The TDT cost measures in Tables 4.3 through 4.7 are worse than those obtained by a heuristic that decides that no story contains a new topic. For example, this simple approach would yield an F1-measure of 0.0, but a cost of 0.0200 using the TDT-2 cost function parameters applied to Equation 4.7. It should be noted that the cost function parameters were intended for evaluating TDT tracking and detection data, and that different parameters are currently being considered in TDT-3 to address this issue. 2
Topic Detection and Tracking
115
Table 4.4 First story detection using good parameters for event clustering (Pooled averages, TDT-2-Evaluation corpus, ASR and newswire text). On-line Strategy
Miss Rate
F/A Rate
Recall
Prec
FI
TDT-2 Cost
single-link+time single-link average-link
41% 35% 65%
7.50% 10.09% 6.53%
59% 65% 35%
12% 10% 9%
0.20 0.17 0.14
0.0818 0.1059 0.0769
but a significant increase in false alarm rates when applying the lower effective clustering parameters. The improvement in F1 and the decrease in cost resulting from using parameters optimized for first story detection are listed in Table 4.5. Table 4.5 Percent improvement using best first story detection threshold parameters over best event Clustering parameters. On-line Strategy
FI
TDT-2 Cost
single-link+time single-link average-link
50.0% 64.7% 7.1%
-63.7% -66.8% -27.3%
The data in Tables 4.3 to 4.5 suggest that on average, first story detection improved when different threshold parameters were determined explicitly for the task. These data suggest that good clustering strategies do not necessarily lead to effective first story detection: each task focuses places different emphasis on whether or not the first story is identified. Recall that the on-line averagelink strategy was shown to be an effective event clustering strategy, but does not result in finding many new event-based topics. In addition, the on-line single-link strategies need different parameter settings for event clustering and first story detection.
5.3
IMPACT OF ASR TECHNOLOGY
In the following experiments, we evaluate the effects of using text from the automatic speech recognition (ASR) process as compared to human-generated transcripts from closed captioning (CCAP). The ASR process has an expected word error rate of 15%. We substituted the ASR transcriptions with cleaner CCAP transcription. The newswire (NWT) stories remained the same. We used the same parameters for the CCAP+NWT data that were used on the
116
ADVANCES IN INFORMATION RETRIEVAL
ASR+NWT data above. The pooled-average results from the 11-pass evaluation methodology are listed in Table 4.6 below. Table4.6 Comparison of clustering methodologies for first story detection (pooled averages, TDT-2-Evaluation corpus with closed caption source).
Type
Miss Rate
F/A Rate
Recall
Prec
F1
TDT-2 Cost
single-link+tirne single-link average-link
61% 61% 73%
1.53% 1.73% 3.02%
39% 39% 27%
31% 29% 14%
0.35 0.33 0.18
0.0271 0.0291 0.0442
The on-line single-link strategies were more effective than were the online average-link strategy for first story detection for both the CCAP and ASR sources. Furthermore, using the on-link single-link+time strategy gives rise to the most effective pooled-average results for F1-measure and TDT-2 cost. In Table 4.7, we summarize the percent improvements in effectiveness realized by replacing ASR sources with CCAP sources. For both F1-measure and TDT-2 cost, the CCAP sources result in relatively high percent improvements in effectiveness over the ASR sources for both on-line single- and average-link strategies. In general, we would expect the ASR technology to give rise to more out-of-vocabulary (OOV) words than the manual transcription process, and thus we expect the OOV words to be the cause of an increase in classification error. This analysis suggests that the ASR technology hinders new topic classification using our approaches. Table 4.7
5.4
Percent improvement using CCAP vs. ASR transcriptions. On-line Strategy
F1
TDT-2 Cost
single-link+time single-link average-link
16.7% 17.8% 20.0%
-9.4% -17.3% -20.9%
USING PHRASES
A retrieval system that uses single-word features has its drawbacks. For example, a user interested in stories about The World Bank is not necessarily interested in retrieving stories about the merger that created the world’s largest bank. In situations such as that, specifying the query as a phrase would assist in discriminating between relevant and non-relevant stories. Previous research
Topic Detection and Tracking
117
in story classification extends the feature space by extracting natural language phrases and more general multi-word features. The utility of multi-word features and their effects on retrieval have gotten mixed reviews in the previous literature. Lewis has documented some of the earlier work pertaining to representation and ambiguity issues arising from the use of phrases. He shows that “[t]he optimal effectiveness of a text representation based on using simple noun phrases... will be less than that of a word-based representation” (Lewis, 199 1). In addition, TREC routing and filtering systems using multi-word features do not appear to significantly outperform systems that use only single-word features (Robertson et al., 1995). These negative conclusions regarding multi-word features are offset by several positive results that have been reported. For example, Fagan reports 2%22% improvements in average precision using phrasal indexing (Fagan, 1987). Strzalkowski and Carballo, 1996, describe improvements in ad hoc retrieval using natural language phrases. Boolean features comprising multiple words have been used to improve precision by Hearst and Pedersen, 1996. Papka and Allan, 1998, showed that in the context of massive query expansion (queries with several hundred features), retrieval effectiveness improves by using proximity operators on pairs of words. There are several methods for extracting phrases. N-gram models based on mutual information metrics are used to find sets of adjacent words that are likely to co-occur within sentences (Brown et al., 1990). Part-of-speech tagging using pre-specified syntactic templates or more complex parsing (Fagan, 1987; Tzoukermann et al., 1997) gives rise to related multiple words comprising noun, verb, and prepositional phrases. Riloff and Lehnert, 1994, use information extraction techniques that build multi-word features as an integral part of their message understanding system. In what follows we embed a statistical natural language parser into our first story detection system (Charniak, 1996; Charniak, 1999). We use the parser for its ability to detect proper nouns and dates. Our goal was to select features associated with the who, what, where, and when of the story. For example, in the “O.J. Simpson murder trial” topic, we might get the names O. J. Simpson and Nicole Brown Simpson, as well as locations such as Chicago, Brentwood section, as well as dates and the names of objects. We tried a series of experiments on the TDT-1 data where we parsed each story and extracted proper nouns and dates. Our hypothesis is that if proper noun phrases and dates are important lexical features, then increasing their weight in a classifier should result in an effectiveness increase for first story detection, and decreasing their weights should result in an effectiveness decrease. We augmented the classifier formulation process using single-word lexical features. After formulating a classifier for each story, we identified the singleword features that were contained in the proper noun phrases contained in the
118
ADVANCES IN INFORMATION RETRIEVAL
story from which it was formulated. We initially set the feature weight using the usual weight assignment, and then we increased and decreased the feature weights only for words that appeared in the proper noun phrases collected from the parse. For an experiment using the TDT- 1 corpus 3 and 200-feature classifiers, we determined that 47% of the words that were contained in the proper noun phrases extracted from the parse appeared in the classifiers. In addition, 21% of the lexical features in the classifiers mapped to words in the proper noun phrases. In general, we saw that classification improved at false alarm levels below 5% when the weights of the lexical features in the proper noun phrases are doubled. When the weights of these features are halved, classification tends to get worse at false alarm levels below 5%. Unfortunately this improvement does not appear to be stable throughout the entire range of false alarm values. We also observed that effectiveness was reduced when all the words associated with proper noun phrases were removed from the classifier by setting their weights to zero. We did not see that more gains resulted from quadrupling weights. We also tested this approach using 25-feature classifiers, and found no apparent improvement from increasing weights in the features mapping to the words of proper noun phrases. Several feature selection issues associated with natural language phrases became apparent from this experiment. For example, we found that many proper noun phrases were already represented in the classifier. Another problem is that in broadcast news there are many references to news correspondents’ names. Reporters use their own names for communication cues within live coverage to signal transitions, so a reporter’s name that appears often in the text is likely to be selected for the classifier. It could be that a reporter’s name indicates the type of coverage (e.g., Wolf Blitzer often covered politics from the White House) but in general, names of reporters are not good features, and should be removed from a classifier. In addition, there are many dates and proper noun phrases that have no significant bearing on the content of the story, so using all of these features blindly will most likely hurt rather than help effectiveness. Even if these problems were solved, we believe that a more effective use of the parse data needs to be established in order to show significant gains and justify the direct application of this technology to our first story detection system.
3Our attempts to use the parser on the TDT-2 corpora failed. The major problem was that the automatic speech recognition data lacked the punctuation needed for the parser to work. The closed caption data, on the other hand, had too much punctuation. We found many spurious periods that appeared to be used to indicate a pause in the flow of the news story. This additional punctuation also prevented the parser from working.
Topic Detection and Tracking
6
119
DISCUSSION OF FIRST STORY DETECTION
We presented our algorithm for first story detection, and analyzed extensions to our classification model that incorporate the properties of broadcast news. In particular, we showed a method for modeling the temporal relationship between stories, and a method for incorporating proper noun phrases into the classifier formulation process. Our results suggest that these extensions can be modeled efficiently and lead to improved classification accuracy for first story detection. We evaluated the use of on-line clustering as an approach to first story detection, and we found that the on-line single-link+time strategy appeared to be more effective than the other on-line clustering strategies tested. Our analysis suggests that good strategies and parameter settings for event clustering do not necessarily result in effective first story detection. In addition, we showed that the ASR technology negatively affects classification.
6.1
THE GOOD NEWS
The effectiveness of our approach to first story detection is sufficient for certain applications. We find that on the TDT-2 data, we correctly classify between 35% and 60% of the new event-based topics from broadcast news with relatively low false alarm rates. At the lowest reported false alarm rate of 3.2%, our results suggest that a user will need to read only 650 of the 20,000 stories available over a two-month period. This implies that our system can significantly reduce the workload of someone searching for new topics in large amounts of news.
6.2
THE BAD NEWS
Though our approach appears to be comparable to other approaches to first story detection, classification using these methods is far from perfect. From the TDT-2 assessment process, we know that assessors of the same topic were in agreement on 90% of the assessments (Graff, 1999), so we would like to see miss rates of 10% at much lower false alarm levels than ones we measured. A major problem is that even with several extensions to the general word-cooccurrence model, the improvements we are experiencing from the various techniques are incremental. With the exception of modeling the temporal relationship between stories, other approaches, including average-link clustering and additional feature selection, did not appear to have a significant impact on first story detection effectiveness. Another major problem is the limitation of the word-co-occurrence model. When we analyze system misses, which occur when stories containing new topics are labeled as “not new”, we find that at low dimensionality, misses occur because important topic features are not appearing in the classifier, or
120
ADVANCES IN INFORMATION RETRIEVAL
are not sufficiently weighted in the classifier. For example, we found the first story about the “Crash of US Air Flight 427”, resulted in a positive decision score from a classifier formulated for an earlier story about the “Crash of US Air Flight 1016”. The distinguishing lexical feature is the flight number, which does not always have a high term frequency in a relevant story. We had similar problems with other disaster events. For example, the first story discussing the “Oklahoma City bombing” resulted in a positive decision score from a classifier formulated from a story discussing the earlier “World Trade Center bombing”. At higher dimensionality, the two bombing topics were separable, but the airline crashes were not. However, as we mentioned previously, choosing the correct dimensionality automatically for a classifier is a hard problem. Other problems appear to be associated with topics that are heavily covered in the news. These topics often span several hundred days and contain significantly more stories than other event-based topics in the corpora. The more stories discussing a topic, the more its effectiveness is weighted in the pooled measures, and also the greater prior probability that false alarms will occur. For example, errors in the “Monica Lewinsky Case”, which has the most stories of all the topics in the TDT-2 corpora, are a major cause of poor effectiveness on the TDT-2-Train corpus. We also noticed that some domains, such as courtroom coverage, led to errors. For example, using the best parameters on the TDT-1 corpus, the system could not distinguish between stories from the “O. J. Simpson Trial” and stories pertaining to other court cases. Different topics in the same country are also problematic. For example, stories related to various topics in Bosnia caused our system to miss “Carter’s Visit to Bosnia”. These examples indicate that the system was unable to detect certain topics that are discussed in the news at different levels of granularity.
7
CONCLUSION
First story detection is an abstract story classification task that we have shown can be reasonably well solved using a single-pass (on-line) approach. Our algorithm for this problem is based on a notion of event identity. Our approach is to represent the event-based topics discussed in each story with a query and a threshold, which together form a classifier for the content of the story. When a story appearing on the news stream is not classified as a positive instance by any of the existing classifiers, we assume the story discusses a new topic. If the story is classified as a positive instance, then we assume the story is similar in content to a previously processed story, and therefore does not discuss a new topic. Our results indicate that we find between 35%-60% of the stories discussing new topics at relatively low false alarm rates of 1.27%-3.2%.
Topic Detection and Tracking
121
Our approach to first story detection is very similar in nature to on-line clustering algorithms. The step in a clustering algorithm in which the decision is made to assign a story to an existing cluster or to initiate the formation of a new cluster is equivalent to deciding whether a story discusses a new topic. With this in mind, we explored different cluster comparison strategies which included on-line single-, average-, and complete-link strategies. Each of these strategies gives rise to a different on-line clustering of the stories, and thus a different set of new topics. We found that our implementation of complete-link did not work well for either first story detection or event clustering. While the average-link approach worked well for clustering, but not for first story detection, the on-line single-link+time strategy worked best for both tasks. The motivation for exploiting the property of time was based on an analysis of the TDT corpora which suggested that stories closer together on the stream were more likely to discuss the same topic than stories further apart. Time was incorporated as a component of our threshold model, which resulted in improved effectiveness for the on-line single-link strategy when applied to first story detection and event clustering. In addition, we explored a process that extracts the lexical features that capture the who, what, when, and where contained in news. The process involved a natural language parser that was used to extract proper noun phrases and dates from the stories in the news stream. Experiments with the parser on the TDT-1 corpus indicated that many of the words in the proper noun phrases were already in the classifiers that were being formulated. When we increased the weights of the classifier features that appeared in proper noun phrases and dates, classification accuracy improved, and when we decreased their weights, classification accuracy declined. This suggests that identifying proper noun phrases and dates can help first story detection. The automatic speech transcription (ASR) process impacted the effectiveness of our approach to first story detection and event clustering. We found that effectiveness improved between 9.4% to 20.9% for both single- and averagelink approaches when the ASR data was replaced by CCAP transcriptions. The impact of the ASR technology was minimal for the TDT Tracking task. The majority of our experimental runs for all three problems involved an exhaustive search for system parameters, including classifier dimensionality and the parameters for our threshold model. Our current solution for first story detection uses constant dimensionality and two global threshold parameters. Our experiments indicated that a good set of values existed for these parameters in 80% of the topics tested. In addition, we found that if it had been possible to determine optimal dimensionality automatically, we would have realized an over 50% effectiveness improvement in the classification of new topics. Unlike ranked retrieval, true text classification tasks require that a system make hard decisions about a story’s relevance. Therefore, threshold parameter
122
ADVANCES IN INFORMATION RETRIEVAL
estimation becomes an integral part of the classification approach. We found that the threshold parameters that optimized pooled average effectiveness measure were similar across the TDT corpora, and thus an averaging of optimal parameters resulted in comparably effective classification relative to other systems evaluated at TDT-2. However, when we compared optimal threshold parameters that worked well for event clustering against those we determined for first story detection, we found that good parameters for first story detection were consistently lower than good parameters for event clustering. We therefore believe that a good clustering approach does not necessarily lead to a good solution for first story detection (at least, when “good” is measured using TDT’s cost functions). Additional evidence that suggests this hypothesis is true was our application of on-line average-link clustering, which was effective for event clustering, but not for first story detection. We presented several experiments for the event clustering problem, where we compared single-pass clustering solutions that included on-line single-link and average-link cluster comparison strategies. The data suggest that augmenting on-line single-link clustering with a time component was the most effective approach when using automatic and manual transcriptions for broadcast news sources. Other clustering experiments suggested that average-link is comparable to single-link+time when audio sources for news are manually transcribed. However, our single-link+time approach resulted in the lowest story-weighted cost realized by the systems evaluated at TDT-2, and we suggest using singlelink+time for on-line clustering of broadcast news because it is both efficient and effective. In retrospect, the comparisons between our approaches and those of other TDT participants indicate that different views of the tasks lead to different retrieval models which however result in similar effectiveness. Our systems, in general, were comparable in effectiveness to the best systems for each of the problems, and we attribute this to our efforts in parameter estimation. In addition to similar overall effectiveness, the common element among the systems is the underlying model of word-co-occurrence used to determine when two stories discuss the same topic. We believe this model is the key to the event-based topic classification solutions, and that improvements in effectiveness will come more from modeling the properties of topics than from modifying existing retrieval models.
8
FUTURE WORK
Our plan for future work involves improving the feature selection and extraction methodologies of our approach to first story detection. In several experiments, we found that classifiers did not contain all the features that distinguish an event-based topic from its more general subject-based topic. For example,
Topic Detection and Tracking
123
distinguishing lexical features such as a flight number and accident location were not necessarily included in each classifier formulated from a plane crash story. Other problems included over-weighting a feature, such as Bosnia, in classifiers that were used to track topics about more specific news coverage. We plan to extend our classifier representation with a model that distinguishes event-level features from subject-level features, perhaps by explicitly incorporating the “rules of interpretation” that define topic scope. Other aspects of feature selection to develop include finding a more effective use of the proper noun phrase data, which should lead to additional effectiveness gains and justify the direct application of natural language parsing technology to our first story detection system. Our results suggest that cleaner data improves classification effectiveness. In the first story detection and event clustering experiments, we saw improvements in all cluster comparison strategies using the cleaner closed caption data. In addition, removing broadcast news header and trailer snippets resulted in improved effectiveness for both problems. There would appear to be several applications for text pre-processing steps that may result in further improvements to classification accuracy, such as removing reporters’ names and other text that is specific to the broadcast news program. Another example is a processing step that would resolve some of the word errors resulting from the automatic speech recognition process. There appear to be several applications for event-based classification, and the interest in Topic Detection and Tracking is increasing with each phase of TDT. We expect several new approaches to the problems we discussed here to emerge from this research effort, and we look forward to future TDT comparisons, which may help us understand how improvements in text classification can be made.
Acknowledgments We thank Victor Lavrenko and Daniella Malin for their help with some of the experiments discussed in this work. This material is based on work supported in part by the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement number EEC9209623, the Air Force Office of Scientific Research under grant number F49620-99-1-0138, and SPAWARSYSCEN-SDunder grant number N66001-99- 1-8912. Any opinions, findings and conclusions or recommendations expressed in this material are the authors’ and do not necessarily reflect those of the sponsor.
References Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. (1998a). Topic detection and tracking pilot study: Final report. In Proceedings of DARPA
124
ADVANCES IN INFORMATION RETRIEVAL
Broadcast News Transcription and Understanding Workshop, pages 194– 218. Allan, J., Papka, R., and Lavrenko, V. (1998b). On-line new event detection and tracking. In Proceedings of ACM SIGIR, pages 37–45. Brown, P., Cocke, J., Pietra, S. D., Pietra, V. D., Jelinek, E, Lafferty, J., Mercer, R., and Roossin, P. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2):79–85. Callan, J., Croft, B., and Broglio, J. (1994). TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3):327–343. Can, F. and Ozkarahan, E. (1990). Concepts and effectiveness of the covercoefficient-based clustering methodology for text databases. ACM Transactions on Database Systems, 15(4):483–517. Charniak, E. (1996). Tree-bank grammars. Technical Report CS-96-02, Department of Computer Science, Brown University. Charniak, E. (1999). Personal communication. Cohen, P. (1995). Empirical Methods for Artifcial Intelligence. The MIT Press, Cambridge, Massachusetts. Croft, W. and Harper, D. (1979). Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 5(3):285– 295. DARPA, editor (1999). Proceedings of the DARPA Broadcast news Workshop, Herndon, Virginia. Department of Defense (1997). Proceedings of the TDT workshop. University of Maryland, College Park, MD (unpublished). Fagan, J. (1987). A Comparison of Syntactic and Non-Syntactic Methods. PhD thesis, Department of Computer Science, Cornell University. Graff, D. (1999). Personal communication. Hearst, M. and Pedersen, J. (1996). Reexamining the cluster hypothesis: Scatter/ gather on retrieval results. In Proceedings of ACM SIGIR, pages 76–84. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of International Joint Conference on Artifcial Intelligence, pages 1137–1443. Lewis, D. (1991). Representations and Learning in Information Retrieval. PhD thesis, Department of Computer and Information Science, University of Massachussetts. Lewis, D. and Gale, W. (1994). A sequential algorithm for training text classifiers. In Proceedings of ACM SIGIR, pages 3–13. Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. (1997). The DET curve in assessment of detection task performance. In Proceedings of EuroSpeech'97, pages 1895–1898.
Topic Detection and Tracking
125
Papka, R. and Allan, J. (1998). Document classificiation using multiword features. In Proceedings ofACM International Conference on Information and Knowledge Management, pages 124–131. Papka, R., Allan, J., and Lavrenko, V. (1999). UMass approaches to detection and tracking at TDT2. In Proceedings of the DARPA Broadcast News Workshop,pages 111–116. Ponte, J. and Croft, W. (1997). Text segmentation by topic. In Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pages 113–125. Provost, F. and Fawcett, T. (1998). Robust classification systems for imprecise environments. In Proceedings of the Fifth National Conference on Artificial Intelligence (AAAI98), pages 3–13. Riloff, E. and Lehnert, W. (1994). Information extraction as a basis for highprecision text classification. ACM Transactions on Information Systems, 12(3):296–333. Robertson, S., Walker, W., Jones, S., Hancock-Beaulieu, M., and M.Gatford (1995). Okapi at TREC-3. In Proceedings of TREC-3, pages 109–126. Salton, G. (1989). Automatic Text Processing. Addison-Wesley Publishing Co., Reading, MA. Schultz, J. and Liberman, M. (1999). Topic detection and tracking using idfweighted cosine coefficient. In Proceedings of the DARPA Broadcast News Workshop,pages 189–192. Strzalkowski, T. and Carballo, J. P. (1996). Natural language information retrieval: TREC-4 report. In Proceedings of TREC-4, pages 245–258. Swets, J. (1998). Measuring the accuracy of diagnostic systems. Science, 240:1285– 1293. Tzoukermann, E., Klavans, J., and Jacquemin, C. (1997). Effective use of natural language processing techniques for automatic conflation of multi-word terms: The role of derivational morphology, part of speech tagging, and shallow parsing. In Proceedings of ACM SIGIR, pages 148–155. van Rijsbergen, C. (1979). Information Retrieval. Butterworths, London. Voorhees, E. (1985). The Effectiveness and Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval. PhD thesis, Department of Computer Science, Cornell University, Ithaca, N.Y. Voorhees, E. and Harman, D., editors (1996–1998). Proceedings of Text REtrieval Conferences (TREC-5 through TREC-7). NIST Special Publications. Walls, F., Jin, H., Sista, S., and Schwartz, R. (1999). Topic detection in broadcast news. In Proceedings of the DARPA Broadcast News Workshop, pages 193– 198. Willett, P. (1998). Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24(5):577–597.
126
ADVANCES IN INFORMATION RETRIEVAL
Yang, Y., Pierce, T., and Carbonell, J. (1998). A study on retrospective and on-line event detection. In Proceedings ofACM SIGIR, pages 28–36.
Chapter 5 DISTRIBUTED INFORMATION RETRIEVAL Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA callan@cs.cmu.edu
Abstract
1
A multi-database model of distributed information retrieval is presented, in which people are assumed to have access to many searchable text databases. In such an environment, full-text information retrieval consists of discovering database contents, ranking databases by their expected ability to satisfy the query, searching a small number of databases, and merging results returned by different databases. This paper presents algorithms for each task. It also discusses how to reorganize conventional test collections into multi-database testbeds, and evaluation methodologies for multi-database experiments. A broad and diverse group of experimental results is presented to demonstrate that the algorithms are effective, efficient, robust, and scalable.
INTRODUCTION
Wide area networks, particularly the Internet, have transformed how people interact with information. Much of the routine information access by the general public is now based on full-text information retrieval, as opposed to more traditional controlled vocabulary indexes. People have easy access to information located around the world, and routinely encounter, consider, and accept or reject information of highly variable quality. Search engines for the Web and large corporate networks are usually based on a single database model of text retrieval, in which documents from around the network are copied to a centralized database, where it is indexed and made searchable. The single database model can be successful if most of the important or valuable information on a network can be copied easily. However, information that cannot be copied is not accessible under the single database
128
ADVANCES IN INFORMATION RETRIEVAL
model. Information that is proprietary, that costs money, or that a publisher wishes to control carefully is essentially invisible to the single database model. The alternative to the single database model is a multi-database model, in which the existence of multiple text databases is modeled explicitly. A central site stores brief descriptions of each database, and a database selection service uses these resource descriptions to identify the database(s) that are most likely to satisfy each information need. The multi-database model can be applied in environments where database contents are proprietary or carefully controlled, or where access is limited, because the central site does not require copies of the documents in each database. In principle, and usually in practice, the multi-database model also scales to large numbers of databases. The multi-database model of information retrieval reflects the distributed location and control of information in a wide area computer network. However, it is also more complex than the single database model of information retrieval, requiring that several additional problems be addressed: Resource description: The contents of each text database must be described; Resource selection: Given an information need and a set of resource descriptions, a decision must be made about which database(s) to search; and Results merging: Integrating the ranked lists returned by each database into a single, coherent ranked list. This set of problems has come to be known as Distributed Information Retrieval. One problem in evaluating a new research area such as distributed IR is that there may be no accepted experimental methodologies or standard datasets with which to evaluate competing hypotheses or techniques. The creation, development, and evaluation of experimental methodologies and datasets is as important a part of establishing a new research area as the development of new algorithms. This paper presents the results of research conducted over a five year period that addresses many of the issues arising in distributed IR systems. The paper begins with a discussion of the multi-database datasets that were developed for testing research hypotheses. Section 3 addresses the problem of succinctly describing the contents of each available resource or database. Section 4 presents an algorithm for ranking databases by how well they are likely to satisfy an information need. Section 5 discusses the problem of merging results returned by several different search systems. Section 6 investigates how a distributed IR system acquires resource descriptions for each searchable text database in a multi-party environment. Finally, Section 7 summarizes and concludes.
Distributed Information Retrieval
2
129
MULTI-DATABASE TESTBEDS
Research on distributed IR can be traced back at least to Marcus, who in the early 1980’s addressed resource description and selection in the EXPERT CONIT system, using expert system technology (Marcus, 1983). However, neither Marcus nor the rest of the research community had access to a sufficiently large experimental testbed with which to study the issues that became important during the 1990’s: How to create solutions that would scale to large numbers of resources, distributed geographically, and managed by many parties. The creation of the TREC corpora removed this obstacle. The text collections created by the U.S. National Institute for Standards and Technology (NIST) for its TREC conferences (Harman, 1994; Harman, 1995) were sufficiently large and varied that they could be divided into smaller databases that were themselves of reasonable size and heterogeneity. NIST also provided relevance judgements based on the results of running dozens of IR systems on queries derived from well-specified information needs. The first testbed the UMass Center for Intelligent Information Retrieval (CIIR) produced for distributed IR research was created by dividing three gigabytes of TREC data (NIST CDs 1, 2, and 3) by source and publication date (Callan et al., 1995; Callan, 1999b). This first testbed contained 17 text databases that varied widely in size and characteristics (Table 5.1) (Callan, 1999b). The testbed was convenient to assemble and was an important first step towards gaining experience with resource description and selection. However, it contained few databases, and several of the databases were considerably larger than the databases found in many “real world” environments.
Table5.1 Number of Databases 17 100 921
Summary statistics for three distributed IR testbeds.
Source TREC CDs 1,2,3 TREC CDs 1,2,3 TREC VLC
Number of Documents Avg Max Min 6,711 64,010 226,087 10,782 39,723 752 31,703 8,157 12
Megabytes Min Avg Max 35 196 362 28 33 42 1 23 31
Several testbeds containing O( 100) smaller databases were created to study resource selection in environments containing many databases. All were created by dividing TREC corpora into smaller databases, based on source and publication date. One representative example was the testbed created for TREC-5 (Harman, 1997), in which data on TREC CDs 2 and 4 was partitioned into 98 databases, each about 20 megabytes in size. Testbeds of about 100 databases each were also created based on TREC CD’s 1 and 2 (Xu and Callan, 1998),
130
ADVANCES IN INFORMATION RETRIEVAL
TREC CD’s 2 and 3 (Lu et al., 1996a; Xu and Callan, 1998), and TREC CD’s 1, 2, and 3 (French et al., 1999; Callan, 1999a). A testbed of 921 databases was created by dividing the 20 gigabyte TREC Very Large Corpus (VLC) data into smaller databases (Callan, 1999c; French et al., 1999). Each database contained about 23 megabytes of documents from a single source (Table 5.1), and the ordering of documents within each database was consistent with the original ordering of documents in the TREC VLC corpus. This testbed differed from other, smaller testbeds not only in size, but in composition. 25% of the testbed (5 gigabytes) was traditional TREC data, but the other 75% (15 gigabytes) consisted of Web pages collected by the Internet Archive project in 1997 (Hawking and Thistlewaite, 1999). The relevance judgements were based on a much smaller pool of documents retrieved by a much smaller group of IR systems, thus results on that data must be viewed morecautiously. Although there are many differences among the testbeds, they share important characteristics. Within a testbed, database sizes vary, whether measured by number of documents, number of words, or number of bytes. Databases in a testbed are more homogeneous than the testbed as a whole, which causes some corpus statistics, for example, inverse document frequency (idf), to vary significantly among databases. Databases also retain a certain degree of heterogeneity, to make it more difficult to distinguish among them. These characteristics are intentional; they are intended to reduce the risk of accidental development of algorithms that are sensitive to the quirks of a particular testbed. As a group, this set of distributed IR testbeds enabled an unusually thorough investigation of distributed IR over a five year period. Others have also created resource selection testbeds by dividing the TREC data into multiple databases, usually also partitioning the data along source and publication date criteria, for example (Voorhees et al., 1995; Viles and French, 1995; Hawking and Thistlewaite, 1999; French et al., 1998). Indeed, there are few widely available alternative sources of data for creating resource selection testbeds. The alternative data used most widely, created at Stanford as part of research on the GlOSS and gGlOSS resource selection algorithms (Gravano et al., 1994; Gravano and Garcia-Molina, 1995), is large and realistic, but does not provide the same breadth of relevance judgements.
3
RESOURCE DESCRIPTION
The first tasks in an environment containing many databases is to discover and represent what each database contains. Discovery and representation are closely related tasks, because the method of discovery plays a major role in determining what can be represented. Historically representation was addressed first, based
Distributed lnformation Retrieval
131
on a principle of deciding first what is desirable to represent, and worrying later about how to acquire that information. Resource descriptions vary in their complexity and in the effort required to create them. CIIR research was oriented towards environments containing many databases with heterogeneous content. Environments containing many databases, and in which database contents may change often, encourage the use of resource descriptions that can be created automatically. Resource descriptions that must be created and updated manually (e.g., Marcus, 1983; Chakravarthy and Haase, 1995) or that are learned from manual relevance judgements (e.g., Voorhees et al., 1995a) might be difficult or expensive to apply in such environments. Environments containing heterogeneous databases also favor detailed resource descriptions. For example, to describe the Wall Street Journal as a publication of financial and business information ignores the large amount of information it contains about U.S. politics, international affairs, wine, and other information of general interest. A simple and robust solution is to to represent each database by a description consisting of the words that occur in the database, and their frequencies of occurrence (Gravano et al., 1994; Gravano and Garcia-Molina, 1995; Callan et al., 1995) or statistics derived from frequencies of occurrence (Voorhees et al., 1995a). We call this type of representation a unigram language model. Unigram language models are compact and can be obtained automatically by examining the documents in a database or the document indexes. They also can can be extended easily to include phrases, proper names, and other text features that occur in the database. Resource descriptions based on terms and their frequencies are generally a small fraction of the size of the original text database. The size is proportional to the number of unique terms in the database. Zipf’s law indicates that the rate of vocabulary growth decreases as database size increases (Zipf, 1949), hence the resource descriptions for large databases are a smaller fraction of the database size than the resource descriptions for small databases.
4
RESOURCE SELECTION
Given an information need and a set of resource descriptions, how is the system to select which resources to search? The major part of this resource selection problem is ranking resources by how likely they are to satisfy the information need. Our approach is to apply the techniques of document ranking to the problem of resource ranking, using variants of tf.idf approaches (Callan et al., 1995; Lu et al., 1996a). One advantage is that the same query can be used to rank resources and to rank documents.
132
ADVANCES IN INFORMATION RETRIEVAL
Figure 5.1
A simple resource selection inference network.
The Bayesian Inference Network model of Information Retrieval can be applied to the process of ranking resources, as illustrated by Figure 5.1. Each resource Ri is represented by a set of representation nodes (indexing terms) rj. An information need is represented by one or more queries (q), which are composed of query concepts (ck) and query operators (not shown in this simple example). The belief P(q|Ri) that the information need represented by query q is satisfied by searching resource Ri is determined by instantiating node & and propagating beliefs through the network towards node q. The belief P(q|Ri) that the representation concept rj is observed given resource Ri is estimated by a variation of tf.idf formulas, shown below. (5.1) (5.2) (5.3) where: is the number of documents in Ri containing rk, df cw is the number of indexing terms in resource Ri, avg-cw is the average number of indexing terms in each resource, is the number of resources, C is the number of resources containing term rk, and cf b is the minimum belief component (usually 0.4). Equation 5.1 is a variation of Robertson’s term frequency (tf) weight (Robertson and Walker, 1994), in which term frequency (tf) is replaced by document
Distributed Information Retrieval
133
frequency (df), and the constants are scaled by a factor of 100 to accommodate the larger df values (Callan et al., 1995). Equation 5.2 is a variation of Turtle’s scaled idf formula (Turtle, 1990; Turtle and Croft, 1991), in which number of documents is replaced by number of resources (C). Equations 5.1-5.3 have come to be known as the CORI algorithm for ranking databases (French et al., 1998; French et al., 1999; Callan et al., 1999b), although the name CORI was originally intended to apply more broadly, to any use of inference networks for ranking databases (Callan et al., 1995). The scores p(rj |Ri) accruing from different terms rj are combined according to probabilistic operators modeled in the Bayesian inference network model. INQUERY operators are discussed in detail elsewhere (Turtle, 1990; Turtle and Croft, 1991), so only a few common operators are presented here. The belief P(rj|Ri) is abbreviated pj for readability.
belsum (Q)
=
(5.4)
belwsum (Q)
=
(5.5)
= 1 – p1 = 1 – (1 – p1). ....(1 – pn) = p1 . p2 . . . . . pn
(5.6) (5.7) (5.8)
belnot(Q) belor(Q) beland(Q)
Most INQUERY query operators can be used, without change, for ranking both databases and documents. The exceptions are proximity, passage, and synonym operators (Callan et al., 1995), all of which rely on knowing the locations of each index term in each document. Such information is not included in database resource descriptions due to its size, so these operators are all coerced automatically to a Boolean AND operator. Boolean AND is a weaker constraint than proximity, passage, and synonym operators, but it is the strongest constraint that can be enforced with the information available. The effectiveness of a resource ranking algorithm can be measured with R(n), a metric intended to be analogous to the recall metric for document ranking. R( n) compares a given database ranking at rank n to a desired database ranking at rank n. The desired database ranking is one in which databases are ordered by the number of relevant documents they contain for a query (Gravano and Garcia-Molina, 1995; Lu et al., 1996b; French et al., 1998). R(n) is defined for a query as follows.
R(n) rgi
= :
number of relevant documents in the i’th-ranked database
134
ADVANCES IN INFORMATION RETRIEVAL
Figure5.2 Effectiveness of the resource ranking algorithm on testbeds with differing numbers of resources.
rdi
:
under the given ranking number of relevant documents in the i’th-ranked database under a desired ranking in which documents are ordered by the number of relevant documents they contain
R(n) measures how well an algorithm ranks databases containing many relevant documents ahead of databases containing few relevant documents. The CORI database ranking algorithm was tested in a series of experiments on testbeds ranging in size from O( 100) to O( 1,000) databases. Two of the testbeds were developed at the University of Massachusetts (Callan, 1999a; Callan, 1999c); one was developed at the University of Virginia (French et al., 1998). Results were measured using R(n). Figure 5.2 shows the effectiveness of the resource ranking algorithm with differing numbers of resources (French et al., 1999). The horizontal axis in these graphs is the percentage of the databases in the testbed that are examined or considered. For example, for all testbeds, the top 10% of the databases contain about 60% as many relevant documents as the top 10% of the databases in the desired ranking (a ranking in which databases are ordered by the number of relevant documents they contain). The accuracy of the resource rankings was remarkably consistent across all three testbeds when 8–100% of the databases are to be searched. The algorithm was most effective on the testbed of 236 databases, but the differences due to testbed size were small. Greater variability was apparent when 0–8% of the databases are to be searched. In this test, accuracy on the testbed of 921 databases was significantly lower than the accuracy on the other databases. It is
Distributed Information Retrieval
135
unclear whether this difference at low recall (searching 0-8% of the databases) is due to testbed size (100 databases vs 921 databases) or testbed content (produced professionally vs Web pages). One issue in scaling-up this research is that as more databases become available, a smaller percentage of the available data is typically searched for each query. Consequently, as the number of available databases increases, the accuracy of the ranking algorithm must also increase, or else recall will decrease significantly. Some loss of recall is inevitable when many resources contain relevant documents but only a few resources are searched. Once a set of resources is ranked, resource selection is relatively simple. One can choose to search the top n databases, all databases with a score above some threshold value, or a set of databases satisfying some cost metric (e.g., Fuhr, 1999).
5
MERGING DOCUMENT RANKINGS
After a set of databases is searched, the ranked results from each database must be merged into a single ranking. This task can be difficult because the document rankings and scores produced by each database are based on different corpus statistics and possibly different representations and/or retrieval algorithms; they usually cannot be compared directly. Solutions include computing normalized scores (Kwok et al., 1995; Viles and French, 1995; Kirsch, 1997; Xu and Callan, 1998), estimating normalized scores (Callan et al., 1995; Lu et al., 1996a), and merging based on unnormalized scores (Dumais, 1994). The most accurate solution is to normalize the scores of documents from different databases, either by using global corpus statistics (e.g., Kwok et al., 1995; Viles and French, 1995; Xu and Callan, 1998) or by recomputing document scores at the search client (Kirsch, 1997). However, this solution requires that search systems cooperate, for example by exchanging corpus statistics, or that the search client rerank the documents prior to their display. Our goal was a solution that required no specific cooperation from search engines, and that imposed few requirements on the search client. Our solution was to estimate normalized document scores, using only information that a resource selection service could observe directly. Several estimation heuristics were investigated. All were based on a combination of the score of the database and the score of the document. All of our heuristics favor documents from databases with high scores, but also enable high-scoring documents from low-scoring databases to be ranked highly. The first heuristic, which was used only briefly (Callan et al., 1995; Allan et al., 1996), is shown in Equation 5.10. (5.10)
136
ADVANCES IN INFORMATION RETRIEVAL
N
:
Number of resources searched
The normalized document score D" is the product of the unnormalized document score D and a database weight that is based on how the database score Ri compares to the average database score Avg_R. This heuristic was effective with a few databases, but is flawed by its use of the number of databases N and the average database score Avg_R. If 100 low-scoring databases with no relevant documents are added to a testbed, N is increased and Avg_R is decreased, which can dramatically change the merged document rankings. A second heuristic for normalizing database scores was based on the observation that the query constrains the range of scores that the resource ranking algorithm can produce. If T in Equation 5.1 is set to 1 .0 for each query term, a score Rmax can be computed for each query. If T is set to 0.0 for each query term, a score Rmin can be computed for each query. These are the highest and lowest scores that the resource ranking algorithm could potentially assign to a database. In practice, the minimum is exact, and the maximum is an overestimate. R min and Rmax enable database scores to be normalized with respect to the query instead of with respect to the other databases, as shown in Equation 5.11. This type of normalization produces more stable behavior, because adding databases to a testbed or deleting databases from a testbed does not change the scores of other databases in the testbed. However, it does require a slight modification to the way in which database scores and document scores are combined (Equation 5.12). (5.11) (5.12) Equations 5.1 1 and 5.12 were the core of the INQUERY distributed IR system from 1995-1998. They produced very stable results for most CIIR distributed IR testbeds. However, research projects on language modeling and U.S. Patent data identified an important weakness. Databases that are organized by subject, for example by placing all of the documents about computers in one database, all of the documents about health care in another, etc, produce idf scores, and hence document scores, that are very highly skewed. Documents from databases where a query term is common (probably a good database for the query) tend to have low scores, due to low idf values. Documents from databases where a query term is rare (probably a poor database for the query) tend to have high scores, due to high idf values. When idf statistics are very highly skewed, the normalization provided by Equations 5.11 and 5.12 is insufficient.
Distributed Information Retrieval
137
Equations 5.14 and 5.15 solve the problem of highly skewed document scores by normalizing a document’s score by the maximum and minimum document scores that could possibly be produced for the query using the corpus statistics in its database. (5.13) (5.14) (5.15) In INQUERY, Dmax, for database Ri is calculated by setting the tf component of the tf.idf algorithm to its maximum value (1.0) for each query term; Dmin, for database Ri is calculated by setting the tf component of the tf.idf algorithm to its minimum value (0.0) for each query term. Hence D maxi and Dmini are estimates of the maximum and minimum scores any document in database Ri could be assigned for the given query. Equation 5.14 solves the problem of highly skewed idf scores, because it is effective on testbeds with and without highly skewed idf scores. However, it requires cooperation among search engines, because Dmaxi and Dmini must be provided by the search engine when it returns document rankings. An independent resource ranking service cannot calculate those values itself (although it could perhaps estimate them, based on observation over time). It is our goal not to rely upon cooperation among search engines, because cooperation can be unreliable in multi-party environments. Thus, although this variant of the result-merging algorithm is effective, equally effective algorithms that do not require cooperation remain a research goal. The two variants of the result-merging algorithm are suitable for different environments. The first variant, expressed in Equations 5.1 1-5.12, requires no cooperation from resource providers, and is effective when corpus statistics are either homogeneous or moderately skewed among databases. The second variant, expressed in Equations 5.13-5.14, is effective when corpus statistics are homogeneous, moderately skewed, and extremely skewed among databases, but it requires resource providers to cooperate by providing Dmaxi and Dmini. The first variant might be appropriate on a wide area network, where cooperation cannot be enforced. The second variant might be appropriate on a local area network within a single organization.
6
ACQUIRING RESOURCE DESCRIPTIONS
Acquiring resource descriptions can be a difficult problem, especially in a wide-area nework containing resources controlled by many parties. One solution is for each resource provider to cooperate by publishing resource descriptions for its document databases. The STARTS protocol, for example,
138
ADVANCES IN INFORMATION RETRIEVAL
is a standard format for communicating resource descriptions (Gravano et al., 1996). Solutions that require cooperation are appropriate in controlled environments, such as a single organization, but face problems in multi-party environments such as the Internet. If a resource provider can’t cooperate, or refuses to cooperate, or is deceptive, the cooperative approach fails. Even when providers intend to cooperate, different systems, different assumptions, and different choices (e.g., how to stem words) make resource descriptions produced by different parties incomparable. For example, which database is best for the query ‘Apple’: A database that contains 2,000 occurrences of ‘appl’, a database that contains 500 occurrences of ‘apple’, or a database that contains 50 occurrences of ‘Apple’? The answer requires detailed knowledge about the tokenizing, stopword, stemming, case conversion, and proper name handling performed by each database. Such detail is impractical to communicate, thus cooperative solutions are most appropriate in environments where all parties use the same software and the same parameter settings. An alternative solution is for the resource selection service to learn what each resource contains by submitting queries and observing the documents that are returned. This technique is called query-based sampling (Du and Callan, 1998; Callan et al., 1999a; Callan and Connell, 1999; Callan et al., 1999b). It is based on the hypothesis that a resource description created from a small sample of text is sufficiently similar to a complete resource description. Querybased sampling requires minimal cooperation (only the ability to run queries and retrieve documents), and it makes no assumptions about how each system operates internally. It also allows different resource selection services to make different decisions about how to represent resources, encouraging development of competing approaches to resource description and selection. Query-based sampling was tested with experiments that investigate it from several different perspectives: Accuracy of learned language models, accuracy of database rankings, and accuracy of document rankings. These experiments are discussed below.
6.1
ACCURACY OF UNIGRAM LANGUAGE MODELS
The first tests of query-based sampling studied how well the learned language models matched the actual orcomplete language model of a database. A learned language model is one created from documents that were obtained by querybased sampling. The actual or complete language model is one created by examining every document in the database. Three text databases were used: CACM, 1988 Wall Street Journal, and the TREC-123 databases. CACM is a small, homogeneous database of scientific abstracts. The 1988 Wall Street Journal is a larger, heterogeneous database of
Distributed Information Retrieval Table 5.2
Name CACM WSJ88 TREC-123
139
Test corpora for query-based sampling experiments.
Size, Size, in in bytes documents 2MB 3,204 104MB 39,904 3.2GB 1,078,166
Size, in unique terms 6,468 122,807 1,134,099
Size, in total terms 117,473 9,723,528 274,198,901
Variety homogeneous heterogeneous very heterogenenous
American newspaper articles (Harman, 1994). The TREC-123 database is a large, very heterogeneous database of documents from a variety of different sources and timespans (Harman, 1994; Harman, 1995). Their characteristics are summarized in Table 5.2. All three are standard IR test databases. Unigram language models consist of a vocabulary and term frequency information. The ctf ratio measures how well the learned vocabulary matches the actual vocuabulary. The Spearman Rank Correlation Coefficient measures how well the learned term frequencies indicates the frequency of each term in the database. Ctf ratio is the proportion of term occurrences in the database that are covered by terms in the learned resource description. For a learned vocabulary V' and an actual vocabulary V, ctf ratio is: (5.16)
where ctfi is the number of times term i occurs in the database (database term frequency, or ctf). A ctf ratio of 80% means that the learned resource description contains the terms that account for 80% of the term occurrences in the database. The Spearman Rank Correlation Coefficient is an accepted metric for comparing two orderings, in this case an ordering of terms by frequency. The Spearman Rank Correlation Coefficient is defined (Press et al., 1992) as:
(5.17)
where di is the rank difference of common term i, n is the number of terms, fk is the number of ties in the kth group of ties in the learned resource description, and gm is the number of ties in the mth group of ties in the actual resource
140
ADVANCES IN INFORMATION RETRIEVAL
Number of documents examined
Number of documents examined
(a) (b) Figure 5.3 Measures of how well a learned resource description matches the actual resource description of a full-text database. (a) Percentage of database word occurrences covered by terms in the learned resource description. (b) Spearman rank correlation coefficient between the term rankings in the learned resource description and the database. (Four documents examined per query.)
description. 1 Two orderings are identical when the rank correlation coefficient is 1. They are uncorrelated when the coefficient is 0, and they are in reverse order when the coefficient is – 1. Prior to comparison with ctf ratio and Spearman Rank Correlation metrics, identical stopword lists and stemming algorithms were applied to the learned and actual language models. ctf ratios would have been significantly higher if stopwords were retained in the language models. Query-based sampling supports different sampling strategies, depending upon how query terms are chosen, how many documents are examined from each query, and how often the learned language model is updated with new information. The baseline experiment presented here was based on selecting query terms randomly from the learned language model, examining four documents per query, and updating language models immediately with new information. The initial query term was selected randomly from another convenient resource, in this case, the TREC-123 database. The choice of the initial query term was a source of bias in these experiments. However, preliminary experiments showed that as long as the initial query term returned at least one document, the choice of the initial query term had little effect on the quality of the language model learned.
1Simpler versions of the Spearman Rank Correlation Coefficient are more common (e.g., (Moroney, 1951 )). However, simpler versions assume that two elements cannot share the same ranking. Term rankings have many terms with identical frequencies, and hence identical rankings.
Distributed Information Retrieval
141
Experimental results are summarized in Figure 5.3. Figure 5.3a shows that the sampling method quickly identifies the vocabulary that represents 80% of the non-stopword term occurrences in each database. Figure 5.3b shows that the sampling method also quickly learns the relative frequencies of terms in each database. The rate at which resource descriptions converged was independent of database size and heterogeneity. The results shown here are based on examining the top 4 documents retrieved for each query, but similar results are obtained when 1, 2, 4, 6, 8, and 10 documents are examined per query (Callan et al., 1999a). Smaller samples, for example 1 or 2 documents per query, produced slightly more accurate language models for heterogeneous databases. Larger samples, for example, 4 or 6 documents per query, produced slightly faster learning for homogeneous databases. The differences were consistent, but not significant. When nothing is known about the contents of a database, the best strategy is to take small samples, trading off speed for guaranteed accuracy. Several different approaches to query term selection were tested, including selecting terms from the learned language model using frequency criteria, and selecting terms that appear important in other, presumably similar language models (Callan et al., 1999a; Callan and Connell, 1999). Frequency-based selection was rarely a good choice. Selecting query terms from another language model was only a good choice when that other language model was very similar to the database being sampled; in other words, if one has a good guess about what a database contains, the database can be sampled more efficiently; otherwise, random sampling is best. The language models for all three databases required about the same number of documents to converge. Database size and heterogeneity had little effect on the rate of convergence. This characteristic is consistent with Zipf’s “law” (Zipf, 1949), which states that the rate at which new terms are found decreases with the number of documents examined. Zipf’s law places no constraints on the order in which documents in a database are examined. Whether documents are selected sequentially or by query-based sampling, only a relatively small number of documents is required to identify most of the vocabulary in a database of documents.
6.2
ACCURACY OF RESOURCE RANKINGS
One might expect relatively accurate language models to produce relatively accurate resource rankings. However, no prior research indicated how much inaccuracy in a language model could be tolerated before resource ranking accuracy deteriorated. A set of experiments was designed to study this issue. Resource ranking accuracy was studied using the testbed of 100 databases created from TREC CDs 1, 2, and 3 (Section 2). 100 complete resource descrip-
142
ADVANCES IN INFORMATION RETRIEVAL
(a)
(b)
Figure 5.4 Measures of database ranking accuracy using resource descriptions of varying accuracy. (a) Topics 51-100 (TREC query set INQ026). (b) Topics 101-150 (TREC query set INQ001). (4 documents examined per query. TREC volumes 1,2, and 3.)
tions were created (one per database). 100 learned resource descriptions were also created (one per database). The learned resource descriptions were created using query-based sampling, with query terms selected randomly from the learned language model, and 4 documents examined per query. Each databases was sampled with enough queries to yield a specified number of unique documents. Sample sizes of 100, 300, and 700 documents were examined. Databases were ranked with the CORI database ranking algorithm (Section 4). frequency statistics (dfi,j) using the length, in words, of the database (cwj) (Callan et al., 1995). It is not known yet how to estimate database size with query-based sampling. In these experiments, term frequency information (df) was normalized using the length, in words, of the set of documents used to construct the resource description. Queries were based on TREC topics 51-150 (Harman, 1994). The query sets were INQ001 and INQ026, both created at the CIIR (Callan et al., 1995a). Queries in these query sets are long, complex, and have undergone automatic query expansion. The relevance assessments were the standard TREC relevance assessments supplied by the U.S. National Institute for Standards and Technology (Harman, 1994). The experimental results are summarized in Figure 5.4. The baselines are the curves showing results with the actual resource description (“complete resource descriptions”). This is the best result that the database ranking algorithm can produce when given a complete description for each database. Resource rankings produced from learned language models were slightly less accurate than rankings produced from complete language models. However, the difference was small when learned language models were created from
Distributed Information Retrieval
143
700 and 300 documents. The difference was greater when language models were learned from only 100 documents, but the loss is small compared to the information reduction. Accuracy at “low recall” (only 10-20% of the databases searched) was quite good. These results are consistent with the results presented in Section 6.1. The earlier experiments showed that term rankings in the learned and actual resource descriptions were highly correlated after examining 300 documents. These experiments demonstrate that the degree of correlation is sufficiently high to enable accurate resource ranking.
6.3
ACCURACY OF DOCUMENT RANKINGS
Relatively accurate database rankings are a prerequisite for accurate document rankings, but the degree of accuracy required in the database ranking was not known. In particular, it was not known whether the minor database ranking errors introduced by learned language models would cause small or large errors in document ranking. A set of experiments was designed to study this issue. Document ranking accuracy was studied using the testbed of 100 databases created from TREC CDs 1, 2, and 3 (Section 2). 100 complete resource descriptions were created (one per database). 100 learned resource descriptions were also created (one per database). The learned resource descriptions were created using query-based sampling, with query terms selected randomly from the learned language model, and 4 documents examined per query. Each databases was sampled with enough queries to yield 300 unique documents. The CORI database selection algorithm ranked databases using either the learned resource descriptions or the complete resource descriptions, as determined by the experimenter. The 10 databases ranked most highly for each query by the database selection algorithm were searched by INQUERY. The number 10 was chosen because it was used in recent research on distributed search (Xu and Callan, 1998; Xu and Croft, 1999). Each searched database returned its most highly ranked 30 documents. Document rankings produced by different databases were merged into a single ranking by INQUERY’s default resultmerging algorithm (Section 5). Document ranking accuracy was measured by precision at ranks 5, 10, 15, 20, and 30. The experimental results indicate that distributed retrieval is about as effective with learned resource descriptions as it is with complete resource descriptions (Table 5.3). Precision with one query set (INQ026, topics 51-100) was 6.6% to 8.3% higher using learned descriptions. Precision with the other query set (INQ001, topics 101-150) averaged 2.2% lower using learned descriptions, with a range of –0.3% to –6.0%. Both the improvement and the loss were too small for most people to notice.
144
ADVANCES IN INFORMATlON RETRIEVAL
Table 5.3 Precision of a search system using complete and learned resource descriptions for database selection and result merging. TREC volumes 1, 2, and 3, divided into 100 databases. 10 databases were searched for each query.
Document Rank 5 10 15 20 30
Topics 51-100 (INQ026 queries) Topics 101-150 (INQ001 queries) Complete Learned Complete Learned Resource Resource Resource Resource Descriptions Descriptions Descriptions Descriptions 0.5800 0.6280 (+8.3%) 0.5960 0.5600 (–6.0%) 0.5640 0.6040 (+7.1%) 0.5540 0.5520 (–0.3%) 0.5493 0.5853 (+6.6%) 0.5453 0.5307 (–2.7%) 0.5470 0.5830 (+6.6%) 0.5360 0.5270 (–1.7%) 0.5227 0.5593 (+7.0%) 0.5013 0.4993 (–0.4%)
Table 5.4
Summary statistics for the query sets used with the testbed.
Query Set Name Title queries, 51-100 Title queries, 101 - 150 Description queries, 51-100 Description queries, 101-150
TREC Topic Set 51-100 101-150 51-100 101-150
TREC Topic Field Title Title Description Description
Average Length (Words) 3 4 14 16
Experiments were also conducted with shorter queries. Sets of queries were created for TREC topics 51-100 using text from the Title fields (Title queries), and sets were created using text from the Description fields (Description queries). Summary characteristics for the query sets are shown in Table 5.4. Table 5.5 summarizes the results of experiments with shorter queries. The shorter queries produce rankings with lower precision than the long queries (INQ026 and INQ001, Table 5.3), which was expected. The difference in precision between searches done with complete language models and with learned language models is larger than in experiments with longer queries (Table 5.5). The drop in precision was 5 – 10% with all but one one query set; in one test, precision actually improved slightly. These experimental results with short and long queries extend the results of the previous sections, which indicated that using learned resource descriptions to rank databases introduced only a small amount of error into the ranking process. These results demonstrate that the small errors introduced by learned
Distributed Information Retrieval
145
Table 5.5 The effects of query-based sampling on the CORI database ranking algorithm, as measured by the precision of the document rankings that are produced. 10 databases searched in a 100 database testbed. (a) Title queries. (b) Description queries.
Precision at Rank 5 docs 10 docs 15 docs 20 docs 30 docs 100 docs Precision at Rank 5 docs 10 docs 15 docs 20 docs 30 docs 100 docs
Title queries Topics 51-100 Topics 101-150 Full Sampled Full Sampled 0.4800 0.4520 (–5.8%) 0.4440 0.4440 (0.0%) 0.4400 0.4280 (–2.7%) 0.4100 0.3920 (–4.4%) 0.4240 0.4067 (–4.1%) 0.3987 0.3627 (–9.0%) 0.4070 0.3870 (–4.9%) 0.3740 0.3470 (–7.2%) 0.3913 0.3620 (–7.5%) 0.3560 0.3267 (–8.2%) 0.3054 0.2748 (–10.0%) 0.2720 0.2576 (–5.3%) Description queries Topics 51-100 Topics 101-150 Full Sampled Full Sampled 0.4960 0.4840 (–2.4%) 0.4560 0.4920 (+7.9%) 0.4660 0.4540 (–2.6%) 0.4260 0.3980 (–6.6%) 0.4520 0.4227 (–6.5%) 0.3973 0.3600 (–9.4%) 0.4350 0.4080 (–6.2%) 0.3890 0.3430 (–11.8%) 0.4273 0.3860 (–9.7%) 0.3733 0.3327 (–10.9%) 0.3128 0.2772 (–11.4%) 0.2702 0.2376 (–12.1%)
resource descriptions do not noticeably reduce the accuracy of the final search results. The accuracy of the document ranking depends also on merging results from different databases accurately. The experimental results indicate that learned resource descriptions support this activity as well. This result is important because INQUERY’s result merging algorithm estimates a normalized document score as a function of the database’s score and the document’s score with respect to its database. The results indicate that not only are databases ranked appropriately using learned descriptions, but that the scores used to rank them are highly correlated with the scores produced with complete resource descriptions. This is further evidence that query-based sampling produces very accurate resource descriptions.
7
SUMMARY AND CONCLUSIONS
The research reported in this paper addresses many of the problems that arise when full-text information retrieval is applied in environments containing many text databases controlled by many independent parties. The solutions include techniques for acquiring descriptions of resources controlled by uncooperative parties, using resource descriptions to rank text databases by their likelihood
146
ADVANCES IN INFORMATION RETRIEVAL
of satisfying a query, and merging the document rankings returned by different text databases. Collectively, these techniques represent an end-to-end solution to the problems that arise in distributed information retrieval. The distributed IR solutions developed in this paper are effective under a broad set of conditions. The experimental conditions include testbeds with relatively uniform database sizes, testbeds with relatively heterogeneous database sizes, and testbeds ranging in size from O(10) to O(1,00) databases. The solutions scale to at least O( 1,000) databases. The experiments presented in this paper are a representative subset of distributed IR experiments done at the CIIR over a five year period. The core algorithms required little adjustment during that time. The experimental methodology developed as part of this research was intended to reflect conditions in wide area computer networks. These conditions include minimal cooperation among parties, a complete lack of global corpus information (e.g., idf statistics), a desire to minimize communication costs, and a desire to minimize the number of interactions among parties. Database ranking algorithms were evaluated by how well they identified databases containing the largest number of relevant documents for each query, and by the precision an end-user would see. The intent was to be as “real world” and unforgiving as possible. In spite of good intentions, weaknesses remain, and these reflect opportunities for future research. The major remaining weakness is the algorithm for merging document rankings produced by different databases. This paper presents two versions of the algorithm. One requires some cooperation among parties; the other does not. Neither algorithm has a strong theoretical basis, and neither algorithm has been tested with document rankings and document scores produced by multiple, disparate search systems, as would be common in the “real world”. These weaknesses could be avoided, at some computational cost, by parsing and reranking the documents at the search client. They could also be avoided with a simpler heuristic algorithm, at the cost of a decrease in precision, as in Allan et al., 1996. However, an accurate and efficient solution to this problem remains unknown. The experimental results with O(1,00) databases demonstrate the need for additional research on “high precision” database ranking algorithms. Few people can or will search 10% of the databases when many databases are available. The most useful algorithms will be those that are effective when 10 out of 1,000 databases (1%), or 10 out of 10,000 databases (0.1%) are searched. None of the prior research has studied this level of accuracy. The research reported in this paper represents a large first step towards cresting a complete multi-database model of full-text information retrieval. A simple distributed IR system can be built today, based on the algorithms presented here. However, many of the traditional IR tools, such as relevance feedback, have yet
Distributed lnformation Retrieval
147
to be applied to multi-database environments. Query expansion greatly improves the ranking of databases (Xu and Callan, 1998), but this result is of only academic interest until there is a general method for creating query expansion databases that accurately represent many other databases. Nobody has shown how to summarize database contents so that a person can browse in an environment containing thousands of databases. These and related problems are likely to represent the next wave of research in distributed information retrieval.
Acknowledgments I thank Margie Connell, Zhihong Lu, Aiqun Du, Hongmin Shu, and Yun Mu for their assistance with the research reported here. This material is based on work supported in part by the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement EEC-9209623; by the U.S. Patent and Trademark Office and Defense Advanced Research Projects Agency/ITO under ARPA order number D468, issued by ESC/AXS contract F19628-95-C-0235; and by the National Science Foundation, under grant IIS-9873009. Any opinions, findings, conclusions or recommendations expressed in this material are the author’s, and do not necessarily reflect those of the sponsor(s).
References Allan, J., Ballesteros, L., Callan, J. P., Croft, W. B., and Lu, Z. (1996). Recent experiments with inquery. In Harman, D., editor, Proceedings of the Fourth Text REtrieval Conference (TREC-4). National Institute of Standards and Technology Special Publication. Callan, J. (1999a). Distributed IR testbed definition: trec123-100-bysourcecallan99.v2a. Technical report, Language Technologies Institute, Carnegie Mellon University. Available at http://www.cs.cmu.edu/~callan/Data/. Callan, J. (1999b). Distributed IR testbed definition: trec123-17-bysourcecallan99.v la. Technical report, Language Technologies Institute, Carnegie Mellon University. Available at http://www.cs.cmu.edu/Ncallan/Data/. Callan, J. (1999c). Distributed IR testbed definition: trecvlcl-921-bysourcecallan99. Technical report, Language Technologies Institute, Carnegie Mellon University. Available at http://www.cs.cmu.edu/~callan/Data/. Callan, J. and Connell, M. (1999). Query-based sampling of text databases. Technical Report IR-180, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts. Callan, J., Connell, M., and Du, A. (1999a). Automatic discovery of language models for text databases. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 479–490, Philadelphia. ACM.
148
ADVANCES IN INFORMATION RETRIEVAL
Callan, J., Powell, A. L., French, J. C., and Connell, M. (1999b). The effects of query-based sampling on automatic database selection algorithms. Technical Report IR-18 1, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts. Callan, J. P., Croft, W. B., and Broglio, J. (1995a). TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3):327– 343. Callan, J. P., Lu, Z., and Croft, W. B. (1995b). Searching distributed collections with inference networks. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–28, Seattle. ACM. Chakravarthy, A. and Haase, K. (1995). Netserf: Using semantic knowledge to find Internet information archives. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4–11, Seattle. ACM. Du, A. and Callan, J. (1998). Probing a collection to discover its language model. Technical Report 98-29, Department of Computer Science, University of Massachusetts. Dumais, S. T. (1994). Latent semantic indexing (LSI) and TREC-2. In Harman, D. K., editor, The Second Text REtrieval Conference (TREC-2), pages 105– 1 15, Gaithersburg, MD. National Institute of Standards and Technology, Special Publication 500-215. French, J., Powell, A., Callan, J., Viles, C., Emmitt, T., Prey, K., and Y.Mou (1999). Comparing the performance of database selection algorithms. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 238–245. ACM. French, J., Powell, A., Viles, C., Emmitt, T., and Prey, K. (1998). Evaluating database selection techniques: A testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. Fuhr, N. (1999). A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3):229–249. Gravano, L., Change, K., Garcia-Molina, H., and Paepcke, A. (1996). STARTS Stanford protocol proposal for Internet retrieval and search. Technical Report SIDL-WP-1996-0043, Computer Science Department, Stanford University. Gravano, L. and Garcia-Molina, H. (1995). Generalizing GLOSS to vectorspace databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB), pages 78–89. Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The effectiveness of GLOSS for the text database discovery problem. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 126–137. ACM. SIGMOD Record 23(2).
Distributed Information Retrieval
149
Harman, D., editor (1994). The Second Text REtrieval Conference (TREC2). National Institute of Standards and Technology Special Publication 500-2 15, Gaithersburg, MD. Harman, D., editor (1995). Proceedings of the Third Text REtrieval Conference (TREC-3). National Institute of Standards and Technology Special Publication 500-225, Gaithersburg, MD. Harman, D., editor (1997). Proceedings of the Fifth Text REtrieval Conference (TREC-5). National Institute of Standards and Technology Special Publication 500-238, Gaithersburg, MD. Hawking, D. and Thistlewaite, P. (1999). Methods for information server selection. ACM Transactions on Information Systems, 17( 1):40–76. Kirsch, S. T. (1997). Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents. U.S. Patent 5,659,732. Kwok, K. L., Grunfeld, L., andLewis, D. D. (1995). TREC-3 ad-hoc, routing retrieval and thresholding experiments using PIRCS. In Harman, D. K., editor, The Third TextREtrieval Conference (TREC-3) , Gaithersburg, MD. National Institute of Standards and Technology, Special Publication 500-225. Lu, Z., Callan, J., and Croft, W. (1996a). Applying inference networks to multiple collection searching. Technical Report 96-42, Department of Computer Science, University of Massachusetts. Lu, Z., Callan, J., and Croft, W. (1996b). Measures in collection ranking evaluation. Technical Report 96-39, Department of Computer Science, University of Massachusetts. Marcus, R. S. (1983). An experimental comparison of the effectiveness of computers and humans as search intermediaries. Journal of the American Society for Information Science, 34:381–404. Moroney, M. (195 1). Facts from figures. Penguin, Baltimore. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1992). Numerical recipies in C: The art of scientific computing. Cambridge University Press. Robertson, S. and Walker, S. (1994). Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 232–241, Dublin, Ireland. ACM. Turtle, H. (1990). Inference networks for document retrieval. Technical Report COINS Report 90-7, Computer and Information Science Department, University of Massachusetts, Amherst, MA 01003. Turtle, H. R. and Croft, W. B. (1991). Efficient probabilistic inference for text retrieval. In RIAO 3 Conference Proceedings, pages 644–661, Barcelona, Spain.
150
ADVANCES IN INFORMATION RETRIEVAL
Viles, C. L. and French, J. C. (1995). Dissemination of collection wide information in a distributed Information Retrieval system. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 12–20, Seattle. ACM. Voorhees, E., Gupta, N., and Johnson-Laird, B. (1995a). Learning collection fusion strategies. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–179, Seattle. ACM. Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B. (1995b). The collection fusion problem. In Harman, D. K., editor, The Third Text REtrieval Conference (TREC-3), Gaithersburg, MD. National Institute of Standards and Technology, Special Publication 500-225. Xu, J. and Callan, J. (1998). Effective retrieval of distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 112–120, Melbourne. ACM. Xu, J. and Croft, W. (1999). Cluster-based language models for distributed retrieval. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 254–261, Berkeley. ACM. Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA.
Chapter 6 TOPIC-BASED LANGUAGE MODELS FOR DISTRIBUTED RETRIEVAL Jinxi Xu BBN Technologies 70 Fawcett Street Cambridge, MA 02138 jxu @ bbn.com
W. Bruce Croft Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts, Amherst Amherst, MA 01003 croft@cs.urnass.edu
Abstract
Effective retrieval in a distributed environment is an important but difficult problem. Lack of effectiveness appears to have two major causes. First, existing collection selection algorithms do not work well on heterogeneous collections. Second, relevant documents are scattered over many collections and searching a few collections misses many relevant documents. We propose a topic-oriented approach to distributed retrieval. With this approach, we structure the document set of a distributed retrieval environment around a set of topics. Retrieval for a query involves first selecting the right topics for the query and then dispatching the search process to collections that contain such topics. The content of a topic is characterized by a language model. In environments where the labeling of documents by topics is unavailable, document clustering is employed for topic identification. Based on these ideas, three methods are proposed to suit different environments. We show that all three methods improve effectiveness of distributed retrieval.
152
1
ADVANCES IN INFORMATION RETRIEVAL
INTRODUCTION
The motivation for distributed retrieval is three fold. First, it can improve efficiency. By some accounts, the World Wide Web has over 800 million pages with about 6 terabytes of text (Lawrence and Giles, 1999). Creating one central index and searching it for millions of queries per hour is a difficult, if not impossible, task. Distributed retrieval improves scalability and efficiency in this case. Second, documents are often proprietary and are owned by autonomous domains. Distributed retrieval is the only method of simultaneously searching collections owned by several domains. Third, distributed retrieval can potentially improve retrieval effectiveness over centralized retrieval. It is often easier to tell whether a source contains relevant documents to a query than to tell whether a document is relevant to the query. Excluding dubious collections from the retrieval process therefore can avoid retrieving some non-relevant documents that would be retrieved by a full retrieval. There have been many studies about distributed retrieval in the IR, digital library, database and World Wide Web communities (Callan et al., 1995; Xu and Callan, 1998; Dolin et al., 1996; Gravano et al., 1994; Weiss et al., 1996). A critical problem in distributed retrieval is collection selection. Because a distributed retrieval system may consist of a large number of collections, the only way to ensure timely and economic retrieval is to search a small number of collections which are likely to contain relevant documents for a query. Collection selection is critical for retrieval accuracy for the simple reason that searching the wrong collections containing few or no relevant documents will result in retrieval failure for a query. The other problem is how to merge the retrieval results from different collections. In our opinion, result merging, though important, is a secondary issue. Callan et al., 1995, showed that simple normalizations of document scores from different collections can minimize the impact on retrieval performance. We avoid the issue of result merging in this study by assuming that searching different collections produces comparable document scores. Distributed retrieval in the context of the World Wide Web is known as meta searching, but current meta search engines such as MetaCrawler (http://www.metacrawler.com) typically send a query to a fixed list of popular search engines and do not perform collection selection. Past research has shown that distributed retrieval is markedly less effective than centralized retrieval (Xu and Callan, 1998). One reason for the lack of effectiveness is that current collection selection algorithms are not effective. The most common technique for collection selection is to represent a collection as a word histogram, which is usually a list of words that occur in the collection and the associated frequencies. The virtual document representation is an example (Callan et al., 1995; Xu and Callan, 1998). Collection selection consists of simply ranking the word histograms against a query in the same way as ranking
Topic-Based Language Models for Distributed Retrieval
153
ordinary documents. Such a technique does not work well on heterogenous collections because different words in a query can match different topics of a collection. In other words, a collection can closely match a query and yet has no relevant documents for the query. In fact, if all collections are sufficiently heterogeneous, matching a query consisting of a few common words with word histograms can even produce random collection selection because statistically all histograms are almost identical with relation to the query. The other reason for the lack of effectiveness is that relevant documents for a query are scattered over many collections. On some test-beds we found that searching a few collections for a query misses a majority of relevant documents even when collections are optimally chosen. This decreases retrieval effectiveness no matter how good the retrieval algorithm is. In this study we propose a topic-oriented approach to distributed retrieval. Our goal is to improve distributed retrieval and make it as effective as centralized retrieval. Our approach is based on the notion of topics. Roughly speaking, a topic defines a set of documents about the same or similar subject, such as “sports” or “politics”. In many retrieval environments, documents are not clearly labelled by topics. Our solution is to use document clustering (van Rijsbergen, 1979) as an approximation. That is, documents are clustered by content and each cluster is treated as a topic. The content of a topic is characterized by a language model. Language models were originally used in speech recognition to capture statistical regularities of language generation and were recently successfully used for retrieval by Ponte, 1998. We call the language model for a topic a topic model. Given a query, topics are ranked based on how likely the query text is generated by the topic models. For collection selection, a collection is represented by a number of topic models, each of which characterizes the content of one topic in the collection. Collections that contain the highest ranked topics are selected for retrieval for the query. Since our collection selection algorithm is based on how well a query matches a topic, it eliminates many mistakes made by existing algorithms that result from matching the words in a query with different topics of a collection. In environments where the document set can be freely reorganized, such as a Web search engine, we can make each topic a collection. Since collections and topics are the same, relevant documents for a query will tend to be in a small number of collections. That makes distributed retrieval even more effective. A disadvantage with our approach is that it requires clustering large sets of documents. The computational cost, however, can be made acceptable even on large collections. In sections 2 and 3, we describe the basic approach in more detail. In section 4, we describe four methods of organizing a distributed retrieval system. In section 5 we describe experimental environments. One is the old method of distributed retrieval with heterogeneous collections. The other three are new methods proposed in this paper. Experimental results are presented in sections
154
ADVANCES IN INFORMATION RETRIEVAL
6 to 12. The three new methods are evaluated on TREC3, TREC4 and TREC6, using the old method and centralized retrieval as baselines. In section 14, we discuss related work. The final section summarizes this work and suggests future research.
2
TOPIC MODELS
The topic models used in this study are unigram models. A topic model, i.e. a language model for a topic T, is a probability distribution {p1, p2, . . .pn} over a vocabulary set {w1, w2, ... wn}, where pi is the frequency with which word wi is used in the text of T when observed with an unlimited amount of data. More complex models such as bigram and trigram models have been used in IR (Miller et al., 1999; Song and Croft, 1999). However, since word order is less important for IR than for other applications such as speech recognition, their advantage over unigram models appears to be limited. We choose to use unigram models for simplicity. Suppose we have a set of available documents D about T, pi is estimated as
where f(D, wi) is the number of occurrences of wi in D, |D| is the size of D in words and n is the vocabulary size. The small value 0.01 prevents zero probabilities as the Kullback-Leibler divergence described below involves logarithms. Instead of directly computing probabilities, we use an information theoretic metric, Kullback-Leibler divergence, to measure how well a topic model for topic T predicts a query Q:
where f(Q, wi) is the number of occurrences of wi in Q and |Q| is the length of Q in words. Kullback-Leibler divergence is an important metric in information theory. It has been widely used to measure how well one probability distribution predicts another in many applications such as speech recognition, pattern recognition and so forth. It is a distance metric and falls in [0, ∞@ The smaller the value, the better the topic model of T predicts Q. Justification for the metric can be found in textbooks on information theory (Kullback et al., 1987). We argue that distributed retrieval can potentially be more effective than centralized retrieval. Both document ranking and topic ranking against a query are based on statistical correlation between query words and words in a document or a topic. A document is a small sample of text. Therefore the statistics in
Topic-Based Language Models for Distributed Retrieval
155
a document are often too sparse to reliably predict how likely the document is relevant to a query. In contrast, we have much more text for a topic and the statistics are more stable. Therefore, we can more reliably predict how likely a topic is a right topic for a query. By excluding clearly unrelated topics from the retrieval process, we can avoid retrieving many of the non-relevant documents which could be ranked highly by a full ranking of all documents. That is the reason we believe distributed retrieval with accurate topic selection can be more effective than centralized retrieval. Whether this can be achieved in practice will be determined experimentally.
3
K-MEANS CLUSTERING
In this work, topics are approximated by document clustering. We run a clustering algorithm on a set of documents and treat each cluster as a topic. The K-Means algorithm is chosen for its efficiency (Jain and Dubes, 1988). To create k clusters out of n documents, the algorithm starts by making the first k documents as the initial clusters (seeds). For each of the remaining n – k documents, it finds the closest cluster and adds it to that cluster. An optional second pass corrects possible mistakes made in the first pass. In the second pass, the algorithm takes the results of the first pass as the initial clusters and walks through the document set one more time. For each document it finds the closest cluster and re-assigns the document to that cluster unless the document is already there. The distance metric to determine the closeness of a document d to a cluster c is the Kullback-Leibler divergence with some modification:
where f(c, wi) is the number of occurrences of word wi in c, f (d, wi) is the number of occurrences of wi in d, |d| is the size of d and |c| is the size of c. The complexity of the algorithm is O(nk).
4
FOUR METHODS OF DISTRIBUTED RETRIEVAL
Four methods of organizing a distributed retrieval system are used in this work. One is the baseline method of distributed retrieval with heterogeneous collections (Callan et al., 1995; Xu and Callan, 1998) that has been used in previous research. The other three are new methods based on the basic ideas discussed in the previous sections. The three new methods are global clustering, local clustering and multiple-topic representation.
156
4.1
ADVANCES IN INFORMATION RETRIEVAL
BASELINE DISTRIBUTED RETRIEVAL
The baseline method represents the typical approach to distributed retrieval in previous studies. A distributed retrieval system using this method consists of a number of heterogeneous collections. Documents in one collection are typically from the same source or were written in the same time period. To simulate existing collection selection algorithms, each collection is represented by one topic model, even though it contains documents about different topics. This method is used as a baseline in our experiments. It is illustrated by Figure 6.1.
Figure 6.1
4.2
Baseline distributed retrieval
GLOBAL CLUSTERING
Our first new method is global clustering. We assume that all documents are made available in one central repository for us to manipulate. We can cluster all the documents and make each cluster a separate collection. Each collection therefore contains only one topic. Selecting the right collections for a query is the same as selecting the right topics for the query. This method is illustrated by Figure 6.2. This method is appropriate for searching very large corpora such as the U.S. Patent and Trademark collection (Larkey, 1998) and Internet search engines, where the collection size can be hundreds of Gibabytes and even Terabytes. The size of the collections and the volume of queries to process make efficiency a critical issue. Distributed retrieval can improve efficiency because we do not have to search all the collections for each query. The baseline method described in section 4.1 would partition a large collection into smaller ones according to attributes such as the sources of the documents and the time of document creation. Such partitions are convenient but undesirable for distributed retrieval. Relevant documents for a query may be scattered in many collections. Therefore we have to either search many collections at the cost of efficiency or search fewer collections at the cost of effectiveness. Furthermore, the resulting collections are often heterogeneous in content and make collection selection difficult.
Topic-Based Language Models for Distributed Retrieval
157
Global clustering produces topic-based collections which are more suitable for distributed retrieval.
Figure 6.2
Distributed retrieval with global clustering
Experimental results show that this method can achieve the most effective retrieval. A disadvantage is that creating many clusters can be expensive. The other problem is that it is not appropriate in environments where documents cannot be made available in one place for reasons such as copyright.
4.3
LOCAL CLUSTERING
Our second new method is local clustering. We assume that a distributed system comprises a number of autonomous subsystems or domains. Documents within subsystems are protected and cannot be made available to a central repository. Within a subsystem, however, there is no limitation as how to manipulate the documents. For example, we can imagine a federated retrieval system comprising several for-profit retrieval service providers such as WESTLAW, Lexis-Nexis and so forth. Each subsystem can cluster its own documents and make each topic a collection. This method is illustrated by Figure 6.3.
Figure 6.3
Distributed retrieval with local clustering
This method can provide competitive distributed retrieval without assuming full co-operation from the subsystems. The disadvantage is that its performance is slightly worse than that of global clustering.
158
4.4
ADVANCES IN INFORMATION RETRIEVAL
MULTIPLE-TOPIC REPRESENTATION
Our third new method is multiple-topic representation. In addition to the constraints in local clustering, we assume that subsystems do not want to physically partition their documents into several collections. A possible reason is that a subsystem has already created a single index and wants to avoid the cost of re-indexing. However, each subsystem is willing to cluster its documents and summarize its collection as a number of topic models for effective collection selection. With this method, a collection corresponds to several topics. Collection selection is based on how well the best topic in a collection matches a query. This method is illustrated by Figure 6.4.
Figure 6.4
Distributed retrieval with multiple-topic representation
The advantage with this approach is that it assumes minimum co-operation from the subsystems. The disadvantage is that it is less effective than global clustering and local clustering. Experiments show, however, that it is still more effective than the baseline method which represents a heterogeneous collection as a single topic.
5
EXPERIMENTAL SETUP
Experiments were carried out on TREC3, TREC4 and TREC6. The TREC3 queries consist of words from the title, description and narrative fields, averaging 34.5 words per query. The TREC4 queries have 7.5 words per query. The TREC6 queries in this study only use the title words, averaging 2.6 words per query. The purpose of using 3 sets of queries of different lengths is to ensure the generality of our results. Table 6.1 shows the statistics of the test sets. The two baselines used are centralized retrieval and the baseline distributed retrieval method, which creates collections based on document sources and represents a collection as a whole. The sets of collections used for baseline distributed retrieval are: TREC3-100col-bysource: 100 collections created according to the sources of the documents in TREC3. The number of collections created for a source is proportional to the number of documents in that source. For example, the source DOE contains around 30% of the documents in TREC3 and is therefore split into 30 collections. Each source is evenly divided
Topic-Based Language Models for Distributed Retrieval Table6.1
159
Test collections statistics. Stop words are not included.
collection
query count
size (GB)
document count
words per query
words per document
rel docs per query
TREC3
50
2.2
741,856
34.5
260
196
TREC4
49
2.0
567,529
7.5
299
133
TREC6
50
2.2
556,077
2.6
308
92
into collections. This ensures that all collections have roughly the same number (7,418) of documents. TREC4-100col-bysource: 100 collections created similarly for TREC4. Each collection has roughly the same number (5,675) of documents. TREC6- 100col-bysource: 100 collections created similarly for TREC6. Again, each collection has roughly the same number (5,560) of documents. The sets of collections created by global clustering are: TREC3-100colglobal, TREC4- 100col-global, and TREC6-100col-global. Each set was created by running two-pass K-Means clustering on the corresponding document set and has 100 collections. The set of collections created by local clustering is TREC4-100col-local. Two-pass K-Means clustering was run on each of the six TREC4 sources separately. The total number of collections is 100. The number of collections for a source is proportional to the number of documents in the source. The set of collections used in the multiple-topic representation experiments is TREC4-10col-bysource. The ten collections are AP88, AP90, FR88, U.S. Patent, SJM91, WSJ90, WSJ91, WSJ92, ZIFF91 and ZIFF92. These are the natural collections in TREC volumes 2 and 3. On average, each collection is represented as 10 topic models. The number of topic models for a collection is proportional to the number of documents in the collection. Collections are indexed and searched by the INQUERY retrieval system (Broglio et al., 1994). A problem encountered in the experiments is the inverse document frequency (IDF). IDF is intended to give rare terms, which are usually important terms, more credit in retrieval. When collections are created by topics, however, IDFcan achieve the opposite effect. Frequent terms in a topic are often important in distinguishing the topic from other topics. To avoid the problem, we modified INQUERY so that it uses global IDF in retrieval. Given a test set (e.g. TREC3), the global IDF of a term is calculated based on the total number of documents in the set that contain the term. By doing so we also avoided the
160
ADVANCES IN INFORMATION RETRIEVAL
tricky issue of result merging which is not the primary concern of this study. Global IDF was used in all experiments in this paper. The steps to search a set of a distributed collections for a query are (1) rank the collections against the query, (2) retrieve the top N documents from each of the best n collections, (3) merge the retrieval results based on the document scores. Because most people are interested in retrieving a small number of documents, we evaluate the results at a small value of N (30). The major evaluation metric is precision figures at document cut-offs 5, 10, 15, 20 and 30. Optionally, we consider full recall-precision figures for applications where both recall and precision are important.
6 6.1
GLOBAL CLUSTERING RESULTS
Tables 6.2, 6.3 and 6.4 compare the baseline results and global clustering on TREC3, TREC4 and TREC6. Ten collections were searched per query in the experiments. Each collection was represented as a language model. The collection selection metric is the Kullback-Leibler divergence. The results show that the baseline distributed retrieval with heterogeneous collections is significantly worse than centralized retrieval, around 30% at all document cutoffs on all three test sets. But when collections are created based on topics, the performance of distributed retrieval is close to centralized retrieval on all three test sets. On TREC4, global clustering outperforms centralized retrieval at document cutoffs 5 and 10. The t-test (Hull, 1993) shows the improvement over centralized retrieval is statistically significant at document cutoff 10 (p=0.05). Table 6.2 TREC3: comparing centralized retrieval, baseline distributed retrieval and global clustering
5 docs: 10 docs: 15docs: 20 docs: 30 docs:
6.2
TREC3 centralized
TREC3 100col-bysource
TREC3 100col-global
0.6760 0.6080 0.5840 0.5490 0.5120
0.5000 (-26.0) 0.4520 (-25.7) 0.4067 (-30.4) 0.3770 (-31.3) 0.3240 (-36.7)
0.6680 (-1.2) 0.6080 (+0.0) 0.5707 (-2.3) 0.5320 (-3.1) 0.4920 (-3.9)
DISCUSSION
One reason that global clustering improves distributed retrieval is that it makes the distribution of relevant documents more concentrated. Figure 6.5
Topic-Based Language Models for Distributed Retrieval
161
Table 6.3 TREC4: comparing centralized retrieval, baseline distributed retrieval and global clustering
5 docs: 10docs: 15docs: 20 docs: 30 docs:
TREC4 centralized
TREC4 100col-bysource
TREC4 100col-global
0.5918 0.4918 0.4612 0.4337 0.3714
0.4245 (-28.3) 0.3816 (-22.4) 0.3469 (-24.8) 0.3122 (-28.0) 0.2769 (-25.4)
0.6204 (+4.8) 0.5163 (+5.0) 0.4639 (+0.6) 0.4194 (-3.3) 0.3707 (-0.2)
Table 6.4 TREC6: comparing centralized retrieval, baseline distributed retrieval and global clustering
5 docs: 10docs: 15docs: 20 docs: 30 docs:
TREC6 centralized
TREC6 100col-bysource
TREC6 100col-global
0.4440 0.3920 0.3573 0.3330 0.2907
0.3360 (-24.3) 0.2940 (-25.0) 0.2560 (-28.4) 0.2350 (-29.4) 0.1973 (-32.1)
0.4520 (+1.8) 0.3820 (-2.6) 0.3533 (-1.1) 0.3220 (-3.3) 0.2780 (-4.4)
shows the distribution of relevant documents in TREC4-100col-global and in TREC4- 100col-bysource. There are four curves in the figure: bysource-optimal-ranking: Collections in TREC4- 100col-bysource are ranked by optimal ranking. That is, the ranking is based on how many relevant documents a collection has for a query. It is the upper bound for any collection selection algorithm. global-optimal-ranking: Optimal ranking of collections in TREC4- 100colglobal. bysource-KL-ranking: Ranking of collections in TREC4- 100col-bysource using Kullback-Leibler divergence. global-KL-ranking: Ranking of collections in TREC4-100col-global using Kullback-Leibler divergence. On the y-axis, y(i) is the average number of relevant documents in the top i collections. We can see that relevant documents are significantly more densely distributed in TREC4- 100col-global than in TREC4- 100col-bysource. The TREC4 queries have 133 relevant documents per query on average. With
162
ADVANCES IN INFORMATION RETRIEVAL
TREC4- 100col-global, the top 10 collections chosen by optimal ranking have 119 relevant documents on average. By comparison, the number is only 65 with TREC4-100col-bysource. That is, given heterogenous collections, even with perfect collection selection, searching only 10 collections misses over 50% of relevant documents.
Figure 6.5 Comparing the distributions of relevant documents in TREC4- 100col-global and TREC6 100col-bysource
The second reason for the improvement is that collection selection is more accurate with topic-based collections than with heterogeneous collections. We measure the accuracy of a collection selection algorithm by comparing it with optimal ranking of collections. On the TREC4- 100col-global test-bed, when 10 collections are selected for each query by each method, the optimal ranking finds 119 relevant documents per query and Kullback-Leibler finds 90. This represents a 76% (90/119) accuracy. This shows that with topic-based collections, collection selection is very accurate. On the TREC4- 100col-bysource test-bed, the accuracy is only 54% (35 Kullback-Leibler/65 optimal). Table 6.3 shows that global clustering is more effective than centralized retrieval when 10 documents are retrieved per query on TREC4. The reason for the improvement is that the small number of collections we selected contain most of the relevant documents. Therefore we are able to exclude many nonrelevant documents from the top ranked set without removing many relevant documents. If we divide the 490 documents (10 documents per query * 49 queries) retrieved by centralized retrieval in two categories, those that were included and those that were excluded by distributed retrieval, 56% (217 rel-
Topic-Based Language Models for Distributed Retrieval
163
evant/387 total) of the included are relevant while only 23% (24/103) of the excluded are relevant. The collections in TREC4- 100col-bysource have roughly the same number of documents. The collections in TREC4-100col-global, however, have very different numbers of documents, ranging from 301 to 82,727. One might have the concern that our collection selection method may simply choose the largest collections. To make sure our technique is immune to this problem, we calculated the average number of documents per collection for the collections we searched (10 collections per query). The number is 5,300, which is even slightly smaller than the average (5,675) for the whole set TREC4-100col-global.
6.3
COLLECTION SELECTION METRICS
The INQUERY retrieval function is very effective for document retrieval, as shown by past TREC results (Voorhees and Harman, 1998). It is, however, less effective than Kullback-Leibler for collection selection. Table 6.5 compares the retrieval performance on TREC4- 100col-global when the INQUERY retrieval function instead of Kullback-Leibler was used for collection selection. (The INQUERY-style collection selection was described by Callan et al., 1995; Xu and Callan, 1998.) Ten collections were selected per query by each metric. Using INQUERY for collection selection resulted in a drop in precision at all cutoff levels, with an average drop of 7.6%. The collections selected by INQUERY have an average of 82 relevant documents per query while the ones selected by Kullback-Leibler have 90 per query. The results show that collection selection is different from document retrieval. A good metric for one problem is not necessarily good for the other. Table 6.5 Comparing Kullback-Leibler and INQUERY for collection selection on TREC4100col-global
5 docs: 10docs: 15docs: 20 docs: 30 docs:
7
Kullback Leibler
INQUERY
0.6204 0.5163 0.4639 0.4194 0.3707
0.5510 (-11.2) 0.4633 (-10.3) 0.4272 (- 7.9) 0.3980 (- 5. 1) 0.3565 (- 3.8)
RECALL-BASED RETRIEVAL
In previous experiments, we only consider precision figures when at most 30 documents are returned for a query. While it is reasonable in many retrieval
164
ADVANCES IN INFORMATION RETRIEVAL
environments, there are applications for which both recall and precision are important. Here we examine the impact of distributed retrieval on the full recall-precision figures. The experiments were carried out using the TREC4- 100col-global test-bed. We examine retrieval performance when 10, 30 and 50 collections are selected for each query. The results in Table 6.6 show that when 10 collections are selected per query, the average precision is 10% worse than centralized retrieval. The performance change at recall d 0.2 is minimal but is significant at higher recall levels. When 30 collections are selected per query, the average precision is only 3.4% worse than centralized retrieval. There is no significant change in precision at recall d 0.4. When 50 collections were selected, the change in average precision is negligible. There is no significant change in precision at recall d 0.7. Overall, when recall is important, more collections must be searched. But even for recall-based retrieval, distributed retrieval can still exclude 50% of collections (which means less computation and faster retrieval) without affecting performance.
Table 6.6
Recall Retrieved: Relevant: Rel-ret: 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Full recall-precision figures on TREC4
Precision(% change) centralized top-10-colls top-30-colls Total number of documents over all queries 49000 49000 (+ 0.0) 49000 (+ 0.0) 650 1 6501 (+ 0.0) 6501 (+ 0.0) 3455 (- 1.7) 3515 3078 (-12.4) Interpolated Recall - Precision Averages at:
top-50-colls 49000 (+ 0.0) 6501 (+ 0.0) 3526 (+ 0.3)
0.7791 0.8187 (+ 5.1) 0.7933 (+ 1.8) 0.7804 (+ 0.2) 0.5288 (+ 0.6) 0.5254 0.5358 (+ 2.0) 0.5208 (- 0.9) 0.4103 (+ 0.4) 0.4107 (+ 0.5) 0.3959 (- 3.1) 0.4087 0.3314 (- 0.7) 0.3300 (- 1.1) 0.3337 0.2846 (-14.7) 0.2618 (+ 1.4) 0.2446 (- 5.2) 0.2581 0.1968 (-23.8) 0.2021 (- 2.0) 0.1728 (-16.2) 0.2063 0.1409 (-31.7) 0.1287 (- 2.8) 0.1216 (- 8.2) 0.1324 0.0999 (-24.5) 0.0736 (- 3.8) 0.0679 (-11.2) 0.0449 (-41.3) 0.0765 0.0254 (-14.8) 0.0241 (-19.1) 0.0130 (-56.4) 0.0298 0.0033 (-15.4) 0.0034 (-12.8) 0.0039 0.0001 (-97.4) 0.0011 0.0001 (-90.9) 0.0000 (-100.0) 0.0012 (+ 9.1) Average precision (non-interpolated) over all rel doc 0.2255 0.2029 (-10.0) 0.2178 (- 3.4) 0.2241 (- 0.6)
Topic-Based Language Models for Distributed Retrieval
8
I65
DISTRIBUTED RETRIEVAL IN DYNAMIC ENVIRONMENTS
Results in previous sections were obtained using two-pass K-Means clustering. Since it needs to see the last document in a document set before assigning documents to clusters, it is not appropriate for environments where new documents arrive on a regular basis. One-pass K-Means is more suitable in this case. Table 6.7 shows the retrieval results when one-pass K-Means clustering was applied on TREC4. As before, 100 clusters were created and 10 were selected per query. Retrieval performance is very close to centralized retrieval. It means that our technique would also work well in a dynamic environment. Table 6.7 Comparing centralized retrieval and one-pass K-Means clustering on TREC4 TREC4 centralized
TREC4 1-pass-kmeans
0.5918 0.4918 0.4612 0.4337 0.3714
0.5714 (- 3.4) 0.4898 (- 0.4) 0.4422 (- 4.1) 0.4082 (- 5.9) 0.3605 (- 2.9)
5 docs: 10docs: 15docs: 20 docs: 30 docs:
9
MORE CLUSTERS
Our previous experiments created 100 clusters and searched 10% of the collections per query. In this experiment, we created 1000 clusters on TREC4 and examined the impact on retrieval performance. To make the results comparable, we still search 10% of the collections, i.e. 100 collections per query. The results in Table 6.8 show that the performance (TREC4-1000col-global) is close to centralized retrieval and is moderately worse than creating 100 clusters. The average number of documents in a cluster is 568, and many clusters have fewer than 100 documents. We suspect the degradation in performance may be due to the small clusters, which may be too small to allow reliable selection. But overall, retrieval performance is relatively robust with relation to cluster size.
10
BETTER CHOICE OF INITIAL CLUSTERS
It is known that with K-Means clustering, the resulting clusters often heavily depend on the initial clusters, or seeds. In previous experiments, we simply choose the first k documents as the initial clusters. That may result in undesirable clusters because some of the documents chosen as seeds may belong to the same topic.
166
ADVANCES IN INFORMATION RETRIEVAL Table 6.8
Retrieval results when 1000 clusters were created on TREC4
5 docs: 10 docs: 15 docs: 20 docs: 30 docs:
TREC4 centralized
TREC4 100col-global
TREC4 1000col-global
0.5918 0.4918 0.4612 0.4337 0.3714
0.6204 (+ 4.8) 0.5163 (+ 5.0) 0.4639 (+ 0.6) 0.4194 (- 3.3) 0.3707 (- 0.2)
0.5796 (- 2.1) 0.4857 (- 1.2) 0.4558 (- 1.2) 0.4224 (- 2.6) 0.3667 (- 1.3)
As a better method of choosing initial clusters, we randomly chose 2000 documents from the TREC4 document set and clustered them using the averagelink algorithm (Salton, 1989). We generated 100 clusters by properly setting the similarity threshold. Average-link clustering is a slow algorithm but produces very stable clusters. We then used the resulting clusters as the seeds for K-Means clustering. We compared the retrieval results with previous results where the first 100 documents were used as seeds . Disappointingly, there is no significant difference between them. The possible explanation is that the 100 documents used as seeds turned out to be news stories (from AP). It appears that these news articles are fairly random in content and therefore did not cause serious problems in clustering. This approach might make a difference on other test-beds.
11
LOCAL CLUSTERING
In environments where subsystems are autonomous, local clustering is appropriate. Since the number of clusters created is small for each subsystem, the method also scales well. Table 6.9 shows that the retrieval performance of local clustering is only slightly worse than centralized retrieval. Ten collections were searched for each query. Performance at document cutoff 5 is even somewhat (2.1%) better than centralized retrieval. Local clustering is substantially better than the baseline distributed retrieval method. The results demonstrate that some extra work from participating subsystems can significantly improve the performance of a distributed retrieval system. Compared to global clustering (Table 6.3), local clustering is slightly worse in performance. In our opinion, local clustering is a reasonable tradeoff between retrieval effectiveness and implementation complexity.
12
MULTIPLE-TOPIC REPRESENTATION
Table 6.10 shows the performance of distributed retrieval with multipletopic representation. The set of collections used is TREC4- 10col-bysource, which has 10 collections. The baseline distributed retrieval method represents
Topic-Based Language Models for Distributed Retrieval
167
Table 6.9 TREC4: comparing centralized retrieval, baseline distributed retrieval and local clustering TREC4 centralized
TREC4 100col-bysource
TREC4 100col-local
0.5918 0.4918 0.4612 0.4337 0.3714
0.4245 (-28.3) 0.3816 (-22.4) 0.3469 (-24.8) 0.3122 (-28.0) 0.2769 (-25.4)
0.6041 (+2.1) 0.4857 (-1.2) 0.4381 (-5.0) 0.4020 (-7.3) 0.3476 (-6.4)
5 docs: 10docs: 15docs: 20 docs: 30 docs:
a collection by one topic model while the new method represents a collection by several topic models, as discussed in section 4. The average number of topic models is 10 per collection with the new method. The rank of a collection was determined by the best topic model of that collection for a query. Two collections were searched per query. The retrieval performance of multipletopic representation is noticeably better than the baseline distributed retrieval at all cutoffs. The improvement is statistically significant at all cutoffs (t-test, p-value < 0.01). The improvement at document cutoff 5 is a substantial 18%. The results show that representing a heterogeneous collection as a number of topics can significantly improve collection selection. Table 6.10 TREC4: comparing centralized retrieval, baseline distributed retrieval, multipletopic representation and optimal collection selection
5 docs: 10docs: 15docs: 20 docs: 30 docs:
TREC4 centralized
TREC4 10col-bysource baseline
TREC4 10col-bysource multiple-topic
TREC4 10col-bysource optimal
0.5918 0.4918 0.4612 0.4337 0.3714
0.4163 (-29.7) 0.3673 (-25.3) 0.3347 (-27.4) 0.2969 (-31.5) 0.2537 (-31.7)
0.4898 (-17.2) 0.3980 (-19.1) 0.3646 (-20.9) 0.3255 (-24.9) 0.2850 (-23.3)
0.5469 (- 7.6) 0.4673 (- 5.0) 0.4463 (- 3.2) 0.4041 (- 6.8) 0.3503 (- 5.7)
One problem with multiple-topic representation is that relevant documents are still sparsely distributed. Even though we are able to rank the collections more accurately, searching a few collections misses many relevant documents. Therefore the retrieval performance is significantly worse than centralized retrieval and the other two techniques. In fact, even with optimal collection selection based on the number of relevant documents in a collection, the performance is still worse than centralized retrieval (Table 6.10).
168
ADVANCES IN INFORMATION RETRIEVAL
13
EFFICIENCY
It took about 6 hours on an Alpha workstation to run the two pass K-Means algorithm on TREC4 when 100 topics were created. The speed is acceptable for 2 GB collections. Memory usage was around 100 MB and can be reduced with more careful implementation. When local clustering was performed on the six document sources of TREC4 individually, the time to create 100 topics was 2 hours. While the efficiency of global clustering using the K-Means algorithm is acceptable for 2 GB collections, it may be too slow for very large collections (e.g. 100 GB). The running time of K-Means is proportional to the product of the collection size and the number of topics created. For larger collections, we need to create more topics. Therefore the clustering time increases faster than the collection size. There are two possible solutions. One is to partition large collections into chunks of appropriate size (e.g. 2 GB per chunk) and cluster each chunk separately. This will limit the number of topics per chunk and hence improve clustering efficiency. The results of local clustering suggests that this solution shall be close to global clustering in performance. The other is to use faster clustering algorithms. Some clustering algorithms, such as the Scatter/Gather algorithm and binary clustering algorithm run in O(n) or O(n log n), where n is the size of the collection. Scatter/Gather has been used in clustering large collections of documents (Cutting et al., 1992). Binary clustering, usually used together with K-Means, has been used in speech recognition for constructing VQ codebooks (Rabiner and Juang, 1993). Since both algorithms emphasize efficiency over the quality of clustering, it remains to be seen how retrieval effectiveness will be affected.
14
RELATED WORK
Distributed retrieval has been studied under a variety of names, including server selection (Hawking and Thistlewaite, 1999), text database resource discovery (Gravano et al., 1994) and collection selection (Callan et al., 1995). A popular technique for representing collections is to use word histograms (Gravano et al., 1994; Callan et al., 1995; Xu and Callan, 1998). Other techniques include manually describing the content of a collection (Kahle and Medlar, 1991) and knowledge-based techniques (Chakravarthy and Hasse, 1995). Danzig et al., 1991, proposed an approach to organize a document space around retrieval results for a set of common queries. Dolin et al., 1996, proposed an approach to server selection based on term classification to address the vocabulary mismatch between user requests and actual documents. Weiss et al., 1996, proposed an approach to organizing web resources based on content and link clustering. Hawking and Thistlewaite, 1999, proposed a distributed retrieval technique based on lightweight probe queries. Xu and Callan, 1998,
Topic-Based Language Models for Distributed Retrieval
169
demonstrated that properly expanded queries can improve collection selection. French et al., 1998, discussed issues in evaluating collection selection techniques. Document clustering has been extensively studied in IR as an alternative to ranking-based retrieval and as a tool for browsing (van Rijsbergen, 1979). Recent trends in document clustering include faster algorithms (Cutting et al., 1992; Silverstein and Pedersen, 1997) and clustering query results (Hearst and Pedersen, 1996). The K-Means algorithm was described by Jain and Dubes, 1988. Ponte, 1998, showed that language modeling approach to retrieval can produce very effective retrieval results. Language modeling was used by Yamron, 1997, for text segmentation. Topic models in this study are similar to the ones in that work.
15
CONCLUSION AND FUTURE WORK
This work proposed a topic-oriented approach to distributed retrieval. Under the general approach, we proposed three methods of organizing a distributed retrieval system. All three methods can improve the results of distributed retrieval. One area for future work is to determine how many topics are appropriate for a collection. The second area is to explore the usability of faster clustering algorithms, as we discussed in section 13. The third area is explore the impact of collection selection on the throughput of a distributed retrieval system. One issue is server-balance. Independent of the query in question, some collections are more likely to be accessed than other collections. A naive server allocation strategy that assigns equal amount of computer resource to all collections can jam some servers while let other servers idle. A good strategy that allocates resources in proportion to the access rates of the collections will maximize the throughput of the system and reduce retrieval latency. The test collections used in this paper do not have enough number of queries to reliably establish the access patterns across collections. Realistic data such as the query logs of Internet search engines will be valuable for such a research.
Acknowledgments Most of the work laid out in this study was completed when the first author was at the Center for Intelligent Information Retrieval, University of Massachusetts, Amherst with support from the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement number EEC-9209623, and Defense Advanced Research Projects Agency/lTO under ARPA order number D468, issued by ESC/AXS contract number F1962895-C-0235. This work is also supported in part by the professional development program of
170
ADVANCES IN INFORMATION RETRIEVAL
BBN technologies. Any opinions, findings and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsors.
References Broglio, J., Callan, J. P., and Croft, W. (1994). An overview of the INQUERY system as used for the TIPSTER project. In Proceedings of the TIPSTER Workshop.Morgan Kaufmann. Callan, J. P., Lu, Z., and Croft, W. (1995). Searching distributed collections with inference networks. In Proceedings of 18th ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 21–28. Chakravarthy, A. and Hasse, K. (1995). NetSerf: Using semantic knowledge to find Internet information archives. In Proceedings of the 18th Annual InternationalACMSIGIR Conference on Research and Development in Information Retrieval, pages 4–11. Cutting, D., Karger, D. R., Pedersen, J. O., and Tukey, J. W. (1992). Scatter/gather: a cluster-based approach to broswing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in lnformation Retrieval. Danzig, P., Ahn, J., Noll, J., and Obraczka, K. (1991). Distributed indexing: A scalable mechanism for distributed information retrieval. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 220–229. Dolin, R., Agrawal, D., Dillon, L., and Abbadi, A. E. (1996). Pharos: a scalable distributed architecture for locating heterogeneous information sources. Technical Report TRCS96-05, Computer Science Department, University of California, Santa Barbara. French, J., Powell, A., Viles, C., Emmitt, T., and Prey, K. (1998). Evaluating database selection techniques: A testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 121–129. Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The effectiveness of GlOSS for the text database discovery problem. In Proceedings of SIGMOD 94, pages 126–137. ACM. Hawking, D. and Thistlewaite, P. (1999). Methods for information server selection. ACM Transactions on Office lnformation Systems, 17(1):40–76. Hearst, M. and Pedersen, J. 0. (1996). Reeaxming the cluster hypothesis: Scatter/gatheron retrieval results. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 76–84.
Topic-Based Language Models for Distributed Retrieval
171
Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 329–338. Jain, A. and Dubes, R. (1988). Algorithms for Clustering Data. Prentice Hall. Kahle, B. and Medlar, A. (1991). An information system for corporate users: Wide Area Information Servers. Technical Report TMC199, Thinking Machines Corporation. Kullback, S., Keegel, J., and Kullback, J. (1987). Topics In Statistical Information Theory. Springer-Verlag. Larkey, L. (1998). Some issues in the automatic classification of U.S. patents. In Learning for Text Categorization. Papers from the 1998 Workshop. AAAI Press, pages 87–90. Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web. Nature, 400:107–109. Miller, D., Leek, T., and Schwartz, R. (1999). A hidden markov model information retrieval system. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 214–221. Ponte, J. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 275–28 1. Rabiner, L. and Juang, B. (1993). Fundamentals of Speech Recognition. Prentice Hall. Salton, G. (1989). Automatic Text Processing. Addison Wesley. Silverstein, C. and Pedersen, J. O. (1997). Almost constant-time clustering of arbitrary corpus subsets. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 60–66. Song, F. and Croft, W. B. (1999). A general language model for information retrieval. In Proceedings of Eighth International Conference on Information and Knowledge Management (CIKM), pages 3 16–321. van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths, second edition. Voorhees, E. and Harman, D., editors (1998). TREC7 Proceedings. NIST. Weiss, R., Velez, B., Sheldon, M., Namprempre, C., Szilagyi, P., Duda, A., and Gifford, D. (1996). Hypursuit: A hiearchical network search engine that exploits content-link hypertext clustering. In Proceedings of the 7th ACM Conference on Hypertext. Xu, J. and Callan, J. (1998). Effective retrieval with distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 112–120.
172
ADVANCES IN INFORMATION RETRIEVAL
Yamron, J. (1997). Topic detection and tracking segmentation task. In Proceedings of the DARPA Topic Detection and Tracking Workshop, (unpublished).
Chapter 7 THE EFFECT OF COLLECTION ORGANIZATION AND QUERY LOCALITY ON INFORMATION RETRIEVAL SYSTEM PERFORMANCE Zhihong Lu AT&T Network Services zIu @puma.att.com
Kathryn S. McKinley Department of Computer Science, University of Massachusetts mckinley@cs.umass.edu
Abstract The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of text. Collection selection and partial collection replication with replica selection are two such mechanisms that enable IR systems to search a small percentage of data and thus improve performance and scalability. To maintain effectiveness as well as efficiency, IR systems must be configured carefully to consider workload locality and possible collection organizations. We propose IR system architectures that incorporate collection selection and partial replication, and compare configurations using a validated simulator. Locality and collection organization have dramatic effects on performance. For example, we demonstrate with simulation results that collection selection performs especially well when the distribution of queries to collections is uniform and collections are organized by topics, but it suffers when particular collections are “hot.” We find that when queries have even modest locality, configurations that replicate data outperform those that partition data, usually significantly. These results can be used as the basis for IR system designs under a variety of workloads and collection organizations.
174
1
ADVANCES IN INFORMATION RETRIEVAL
INTRODUCTION
The rapidly increasing content in distributed information retrieval (IR) systems for unstructured text has motivated new performance improving techniques such as collection selection (Voorhees et al., 1995; Callan et al., 1995; French et al., 1999; Xu and Croft, 1999) and partial replication with replica selection (Lu and McKinley, 1999a), and confirmed the importance of techniques, such as caching (Martin and Russell, 1991; Markatos, 1999). These techniques all seek to decrease query response time by searching less text, while maintaining the same effectiveness (i.e., returning the same number of relevant documents to each query). For a given query, collection selection seeks to reduce the number of collections it searches by picking and only searching the most relevant subset out of a large number of collections. In many systems, the queries repeat and/or are related to the same set of documents; we say these queries have locality. By separately storing the queries with the most locality, caches and partial replicas seek to improve performance by limiting their search to the cache or partial replica, rather than searching the entire collection. Caches have a simple organization and membership test. Caches may store a set of queries, their responses, and the corresponding documents. The membership test is simply: is this query in the cache? With a general cache, if an IR system sees the same exact query or document request repeatedly, it could respond from the cache rather than by repeating query processing or fetching the document from the original source. The vast majority of web caches store only documents and respond only to document requests (Wang, 1999). The IR system can place caches at a variety of locations to improve system performance and availability. Partial replicas are based on the same idea as caches, but are instead searchable sub-collections consisting of the set of documents returned from the most frequent queries (Lu and McKinley, 1999a). Their membership test is: is this query relevant to the partial replica? Partial replicas are coupled with a replica selection function to choose between replicas and the original collection based on content and load. These IR systems increase observed query locality over caching, because they do not depend on exact match (Lu and McKinley, 1999b). For example, the replica selector can direct many distinct queries such as “Starr,” “Starr Report,” “Bill Clinton,” and “Monica Lewinsky,” all trying to access the Starr Report, to the same replica based on content, while caches must see the exact same query to return the Starr Report. In previous work, we showed replication may increase observed locality by up to 20% compared with caching (Lu and McKinley, 1999b). We leave to future work to quantify the performance benefit this increase should give, and for the purposes of this work, consider caches to be subsumed by partial replicas.
The Effect of Collection Organization and Query Locality on IR Performance
175
Collection selection picks the most relevant collections for a given query out of some larger set (Callan et al., 1995; Voorhees et al., 1995; French et aI., 1999; Xu and Croft, 1999). It is distinguished from replica selection because it expects distinct collections and does not require query locality. Because it expects distinct collections and is somewhat influenced by collection size, it is not able to select replicas very well (Lu and McKinley, 1999a). However, it can be used in combination with replica selection. For example, given a number of collections and replicas for them, if the replica selector does not choose a replica, the IR system can use the collection selector to choose among the original collections. Previous work shows how to maintain retrieval accuracy using collection and replica selection (Callan et al., 1995; Voorhees et al., 1995; French et al., 1999; Lu and McKinley, 1999a; Xu and Croft, 1999). In this work, we build on previous results and other system characteristics to configure IR systems to achieve both high retrieval accuracy and good performance (i.e., fast query response time). We use a simulator for our experimental results which we validated against a running multithreaded version of INQUERY on a shared memory multiprocessor (Callan et al., 1992; Turtle, 1991; Lu, 1999). We use workloads and collection organizations to determine when and how to use collection selection and replication to achieve high performance. We classify collection organizations as either by topic, source, or random, and compare using additional resources for either further partitioning of the collections or for partial replication. We then compare the simulated performance of configurations that achieve the same or similar query accuracy. Our simulation results indicate that the performance of collection selection is highly correlated with the distribution of queries among collections. If queries localize in a few collections, which is likely with a topic or even source organization, IR system performance with collection selection is similar to that without collection selection. When queries distribute uniformly among the collections, collection selection improves performance significantly. Query locality always enables partial replication to improve performance consistently over collection selection on its own. In situations when collection selection is not applicable, e.g., in randomly distributed collections, partitioning can only improve performance by a small amount, but with query locality, replication yields dramatic improvements. We also demonstrate with a sensitivity study that these results are similar even with increases to the collection size and improved hardware which indicates these trends are likely to continue to hold in the future. The remainder of this chapter is organized as follows. The next section overviews related work in IR system architectures and scalability, databases, web caching, and collection selection. Section 3 introduces the distributed information retrieval architectures we study, and Section 4 discusses the impact of collection organization and query locality on their configurations. Section 5
176
ADVANCES IN INFORMATION RETRIEVAL
briefly describes our simulation model. Section 6 compares the performance of collection partitioning, partial collection replication, and collection selection, discusses the sensitivity of our results, and demonstrates the performance impact when we use larger disks, and faster servers. Section 7 summarizes our results and concludes.
2
RELATED WORK
In this section, we discuss several research topics that are related to this work: architectures for fast, scalable, and effective large scale information retrieval; single query performance; IR versus database systems; IR versus the web; caching; and collection selection.
2.1
SCALABLE IR ARCHITECTURES
In this section, we discuss architectures for parallel and distributed IR systems. Our research combines and extends previous work in distributed IR ( Burkowski, 1990; Harman et al., 1991; Couvreur et al., 1994; Burkowski et al., 1995; Cahoon and McKinley, 1996; Hawking, 1997; Hawking et al., 1998; Cahoon et al., 1999) since we model and analyze a complete system architecture with replicas, replica selection, and collection selection under a variety of workloads and conditions. We base our distributed system on INQUERY (Callan et al., 1992; Turtle, 1991), a proven, effective retrieval engine. We also model architectures with very large text collections; up to 1 terabyte of data on 8 INQUERY servers. Much of the prior work on distributed IR architectures has been restricted to small collections, typically less than 1 GB and/or 16 servers. Participants in the TREC Very Large Collection track use collections up to 100 GB, but they only provide query processing times for a single query at a time (Hawking et al., 1998). It is clear that some industrial sites use collections larger than what we simulate, but they choose not to report on them in the literature to maintain their competitive edge. Harman et al., 1991, shows the feasibility of a distributed IR system by developing a prototype architecture and performing user testing to demonstrate usefulness. Unlike our research which emphasizes performance, Harman et al., 1991, do not study efficiency issues and they use a small text collection (i.e., less than 1 GB). Burkowski, 1990, and Burkowski et al., 1995, report on a simulation study which measures the retrieval performance of a distributed IR system. The experiments explore two strategies for distributing a fixed workload across a small number of servers. The first equally distributes the text collection among all the servers. The second splits servers into two groups, one group for query evaluation and one group for document retrieval. They assume a worst case workload where each user broadcasts queries to all servers without any
The Effect of Collection Organization and Query Locality on IR Performance
177
think time. We experiment with larger configurations, and consider collection selection and replicas with replica selection. Couvreur et al., 1994, analyzes the performance and cost factors of searching large text collections on parallel systems. They use simulation models to investigate three different hardware architectures and search algorithms including a mainframe system using an inverted list IR system, a collection of RISC processors using a superimposed IR system, and a special purpose machine architecture that uses a direct search. The focus of the work is on analyzing the tradeoff between performance and cost. Their results show that the mainframe configuration is the most cost effective. They also suggest that using an inverted list algorithm on a network of workstations would be beneficial but they are concerned about the complexity. In their work, each query is evaluated against all collections. Hawking, 1997, designs and implements a parallel information retrieval system, called PADRE97, on a collection of workstations. The basic architecture of PADRE97, which is similar to ours, contains a central process that checks for user commands and broadcasts them to the IR engines on each of the workstations. The central process also merges results before sending a final result back to the user. Hawking presents results for a single workstation and a cluster of workstations using a single 51 term query. A large scale experiment evaluates query processing on a system with up to 64 workstations each containing a 10.2 GB collection. The experiment uses four shorter queries of 4 to 16 terms. This work focus on the speedup of a single query, while our work evaluates the performance for a loaded system under a variety of workloads and collection configurations. Cahoon and McKinley, 1996, and Cahoon et al., 1999, report a simulation study on a distributed information retrieval system based on INQUERY. They assume the collections are uniformly distributed, and experiment with collections up to 128 GB using a variety of workloads. They measure performance as a function of system parameters such as client command rate, number of document collections, terms per query, query term frequency, number of answers returned, and command mixture. They demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors under some realistic workloads. Our work extends this work by adding replicas into the system, and considering three different collection configurations as well as query locality.
2.2
HOW TO SEARCH LARGE COLLECTIONS
The TREC conference recently added the Very Large Collection track for evaluating the performance of IR systems on large text collections (Hawking and Thistlewaite, 1997; Hawking et al., 1998). To handle the large collections,
178
ADVANCES IN INFORMATION RETRIEVAL
participants use shared-memory multiprocessors and/or distributed architectures. The experiments in TREC-7 use 49 long queries on a 100 GB collection of web documents. The Very Large Collection track summary (Hawking et al., 1998) presents precision and query processing time, but does not provide significant details about each system. The experiments report response times for a single query at a time, rather than for a variety of workloads as we do. None of the systems report results for caching or searchable replicas. Two of the participants present details of their distributed systems elsewhere but neither provide significant performance evaluations (Burkowski et al., 1995; Brown and Chong, 1997).
2.3
DATABASE VERSUS IR ARCHITECTURES
There is also a large volume of work on architectures for distributed and parallel database systems including research on performance (e.g., Stonebraker et al., 1983; DeWitt et al., 1986; Mackert and Lohman, 1986; Hagmann and Ferrari, 1986; DeWitt and Gray, 1992; Bell and Grimson, 1992). Although the fields of information retrieval and databases are similar, there are several distinctions which make studying the performance of IR systems unique. A major difference between database systems and information retrieval systems is structured versus unstructured data. In a structured data, the tests resemble set membership. In unstructured data, we measure similarity of queries to documents. The unstructured nature of the data in IR raises questions about how to create large, efficient architectures. Our work attempts to discover some of the factors that effect performance when searching and retrieving unstructured data. Furthermore, the types of common operations that are typical to database and IR systems are slightly different. For example, the basic query commands in an IR system tests document similarity, and thus differ from queries in a database system which test set membership. In our IR system, we are not concerned with updates (commit protocols) and concurrency control which are important issues in distributed database systems. We assume our IR system performs updates offline.
2.4
WEB VERSUS IR ARCHITECTURES
Although commercial information retrieval systems, such as the web search engines AltaVista and Infoseek exploit parallelism, parallel computers, caching, and other optimizations to support their services, they have not published their hardware and software configurations, which makes comparisons difficult. There are several important differences between the IR technology we discuss here and the web’s implementation. We consider a more static collection of unstructured text on a local area network, such as a collection of case law or journal articles. Whereas the web has more structured text on a wide area
The Effect of Collection Organization and Query Locality on IR Performance
179
network whose content is very dynamic. On the web, most caches are built for specific documents, not for querying against as we do here (see Wang, 1999, for a survey of web document caching). Web document caches test set membership to determine if the cache has a requested document, and a variety of policies for putting documents in the cache. The search engines do cache queries, but they do so on the server side, rather than elsewhere in the network. Researchers have also used document replication techniques to solve the problem of scale in the Web (Katz et al., 1994; Bestavros, 1995; Baentsch et al., 1996). Katz et al., 1994, reported a prototype of a scalable web server. They treat several identically configured http servers as a cluster, and use the DNS (Domain Name System) service to distribute http requests across the cluster in a round-robin fashion. Bestavros, 1995, proposes a hierarchical demand-based replication strategy that optimally disseminates information from its producer to servers that are closer to its consumers in the environment of the web. The level of dissemination depends on the popularity of that document (relative to other documents in the system) and the expected reduction in traffic that results from its dissemination. Baentsch et al., 1996, implements a replication system called CgR/WLIS (Caching goes Replication/Web Location and Information Service). As the name suggests, CgR/WLIS turns web document caches into replicated servers as needed. In addition, the primary servers forward the data to their replicated servers. A name service WLIS is used to manage and resolve different copies of data. Although we also organize replicas as a hierarchy, our work is different from those above, because our system is a retrieval system that supports queries while their servers contain Web documents and only support document fetching.
2.5
CACHING
Caching in distributed IR systems has a long research history (Simpson and Alonso, 1987; Martin et al., 1990; Martin and Russell, 1991; Tomasic and Garcia-Molina, 1992). The client caches data so that operations are not repeatedly sent to the remote server. Instead, the client locally performs frequent operations. The use of caching is most beneficial for systems that are distributed over networks, that evaluate queries slowly, or in which query locality is high. Only Markatos, 1999, caches web queries and their results. This cache however requires exact match, whereas we increase locality by determining query similarity to replicas. Markatos, 1999, deals with the dynamic nature of the web by caching for a short period of time. He argues that many web search engines update their databases as infrequently as a month, and thus caching for a day or less should not effect precision by much. Markatos analyzes a trace from the Excite search engine and uses trace-driven simulations to compare
180
ADVANCES IN INFORMATION RETRIEVAL
several cache replacement policies. He shows that medium-sized caches (a few hundred Mbytes large) can achieve a hit ratio of around 20%, and effective cache replacement policies should take into account both recency and frequency of access in their replacement decisions. As far as we can find, there exists no other distributed caches for web search engines. Our work is different from caching, because we use searchable replicas and a replica selector to select a partial replica based on content and load, rather than the simple membership test in caching. Compared with caching, selection based on content increases observed locality, and is thus able to offload more loads from servers that process original collections.
2.6
COLLECTION SELECTION
A number of researchers have been working on how to select the most relevant collections for a given query (Callan et al., 1995; Chakravarthy and Haase, 1995; Danzig et al., 1991; Gravano et al., 1994; Voorhees et al., 1995; Fuhr, 1999; Xu and Croft, 1999). Only our previous work (Lu and McKinley, 1999a; Lu, 1999) considered partial replica selection based on relevance. None of the previous work addresses system configuration based on collection access skew and organization. Danzig et al., 1991, use a hierarchy of brokers to maintain indices for document abstracts as a representation of the contents of primary collections. They support Boolean keyword matching to locate the primary collections. If users’ queries do not use keywords in the brokers, they have difficulty finding the right primary collections. Our approach is thus more general. Voorhees et al., 1995, exploits similarity between a new query and relevance judgments for previous queries to compute the number of documents to retrieve from each collection. Netserf extracts structured, disambiguated representations from the queries and matches these query representations to hand-coded representations (Chakravarthy and Haase, 1995). Voorhees et al., 1995, and Netserf require manual intervention which limits them to relatively static and small collections. Callan et al., 1995, adapts the document inference network to ranking collections by replacing the document node with the collection node. This system is called CORI. CORI stores the collection ranking inference network with document frequencies and term frequencies for each term in each collection. Experiments using CORI with the INQUERY retrieval system and the 3 GB TREC 1+2+3 collection which is basically organized by source show that this method can select the top 50% of subcollections and attain similar effectiveness to searching all subcollections. GLOSS uses document frequency information for each collection to estimate whether, and how many, potentially relevant documents are in a collection (Gra-
The Effect ofCollection Organization and Query Locality on IR Performance
181
vano et al., 1994; Gravano and Garcia-Molina, 1995). The approach is easily applied to large numbers of collections, since it stores only document frequency and total weight information for each term in each collection. French et al., 1999, compare GLOSS with CORI and demonstrate that CORI consistently returns better results while searching fewer collections. Gloss and CORI both assume documents and collections are distinct and do not select between overlapping collections. Fuhr proposes a decision-theoretic approach to solve the collection selection problem (Fuhr, 1999). He makes decisions by using the expected recallprecision curve which yields the expected number of relevant documents, and uses cost factors for query processing and document delivery. He does not report on effectiveness. Xu and Croft propose cluster-based language models for collection selection (Xu and Croft, 1999). They first apply clustering algorithms to organize documents into collections based on topics, and then apply the approach of Callan et al., 1995 to select the most relevant collections. They find that selecting the top 10% of topic collections can achieve retrieval accuracy comparable to searching all collections. Our previous work on partial replica selection modifies the collection inference network model of Callan et al., 1995, to rank partial replicas and the original collections, proposes a new algorithm for replica selection, and shows that it is effective and improves performance. Our results demonstrate that our new replica selection function can direct more than 85% of replicated queries to a relevant partial replica rather than the original collection, and thus always achieves the same precision and usually in much less time. For unreplicated queries (those that were not used to build the replica), it achieves a precision percentage loss within 8.7% and 14.2% when retrieving the top 30 documents and when the size of the replicas range from 2% to 10% for a 2 GB collection, and from 0.2% to 1 % for a 20 GB collection, respectively. This work instead focus on comparing the performance of collection selection with partitioning and replicas for different types of collection organizations and user access skew. We suggest configurations of the IR system based on these characteristics to yield high precision. Our results for these configurations show that systems can be organized to yield both high performance and high precision.
3
SYSTEM ARCHITECTURES
This section describes architectures for a distributed information retrieval system based on INQUERY (Callan et al., 1992; Turtle, 1991), as shown in Figure 7.1. The distributed system enables multiple clients to simultaneously access multiple collections over a network. As shown in Figure 7.1(a), the basic
182
ADVANCES IN INFORMATION RETRIEVAL
components of the system are a set of clients, a connection broker, and a set of INQUERY servers for storing indexed collections which are connected by a local area network. In this architecture, the connection server sends every user query to every collection, collects all the responses, and then forwards the result back to the user. In order to improve system performance and eliminate bottlenecks, we add partial replicas of collections, replica selectors, and collection selectors. Figure 7.1(b) and (c) illustrate two configurations we examine. In Figure 7.1(b), we add a collection selector in the connection broker to select the most relevant collections on a query-by-query basis and restrict the search to the selected collections. In Figure 7.1(c), we build partial replicas for the original collections, add a replica selector in the connection brokers to direct as many queries as possible to relevant partial replicas. For the queries the replica selector sends to the original collection, we use a collection selector to restrict the search to the most relevant collections. In the remainder of this section, we describe the functionality of each component and their interactions in more detail.
3.1
CLIENTS
The clients are lightweight processes that provide a user interface to the retrieval system. Clients interact with the distributed IR system by connecting to the connection broker. The clients initiate all work in the system, but perform very little computation. They issue IR commands and then wait for responses. The clients issue the following basic IR commands in our system: natural language query, summary, and document retrieval commands. Our experiments include a mixture of all these commands, but here we only report on query response time (a user query and its summary response). Elsewhere we have shown that document response time has similar performance trends (Lu, 1999).
3.2
COLLECTIONS AND INQUERY SERVERS
A collection is a set of documents the retrieval system is working on. A large collection could contain millions or even billions of documents. For simplicity, we assume there is no overlaps between documents in any two collections. We assume that collections are organized either by topic, source, or randomly. To create a collection suite organized by topic automatically is the focus of document clustering research (Jain and Dubes, 1988; Papka and Allan, 1998). With significant human intervention, existing collections such as portal web pages, are organized by topic. Topic organization usually requires the system builder to pre-categorizing incoming documents, which may be very expensive (Xu and Croft, 1999). Most collections are organized by source, for example, newspapers, journals, and other periodicals. A truly random collection is actually less likely to occur
The Effect of Collection Organization and Query Locality on IR Performance
Figure 7.1
Architectures for distributed information retrieval.
183
184
ADVANCES IN INFORMATION RETRIEVAL
by accident, since humans like to impose structure. Each of these organizations has an impact on performance and collection access properties, and we elaborate on these points in Section 4. We use the INQUERY retrieval engine version 3.1 as our testbed to index collections and provide IR services such as evaluating queries, obtaining summary information, and retrieving documents (Callan et al., 1992; Turtle, 1991). We refer to the server as the INQUERY server. The INQUERY server accepts a command from the connection broker, processes the command, and returns its result back to the connection broker.
3.3
CONNECTION BROKER
Clients and INQUERY servers communicate via the connection broker. The connection broker is a process that keeps track of all registered clients and INQUERY servers. A client sends a command to the connection broker which forwards it to the appropriate INQUERY servers. The connection broker maintains intermediate results for commands that involve multiple INQUERY servers. When an INQUERY server returns a result, the connection broker merges it with other results. After all INQUERY servers involved in a command return results, the connection broker sends the final result to the client. Besides keeping track of all clients and INQUERY servers, we may also enhance the connection broker to perform collection selection or replica selection.
3.4
COLLECTION SELECTOR
The collection selector automatically chooses the most relevant collections from some set of collections on a query-by-query basis (Callan et al., 1995; French et al., 1999). It maintains a collection selection database with collectionlevel information for each collection. When the collection selector receives a query, it searches the collection selection database, computes a ranked list of the most relevant collection identifiers, and sends the query to the top ranked collections. The connection broker further uses the ranking to weight document scores to produce overall document rankings which it returns to the users (Callan et al., 1995; French et al., 1999).
3.5
REPLICA SELECTOR
If the same or related queries repeat, we replicate a portion of the original collections to improve performance. In Figure 7.1(c), we view all original collections as a whole, and build a partial replica for the whole. In this study, we use a single replica. In previous work, we found a hierarchy of replicas further improved performance (Lu and McKinley, 1999b). The replica selector directs as many queries as possible to a partial replica based on both relevance and load (Lu and McKinley, 1999a). As opposed to collection selection which
The Effect of Collection Organization and Query Locality on 1R Performance
185
ranks disjoint collections, replica selection ranks partial replicas and the original collections. The partial replicas are a subset of the original collection. A replica selector maintains a replica selection database and the load information for each server. For each query command, the replica selector searches this database and returns either a replica identifier if there is a relevant replica, or a collection identifier. When the replica selector returns a replica identifier, it sends the query to the replica if it is not overloaded, otherwise it uses collection selection as specified above.
4
CONFIGURATION WITH RESPECT TO COLLECTION ORGANIZATION, COLLECTION ACCESS SKEW, AND QUERY LOCALITY
Collection selection and partial collection replication with replica selection enable us to search a small percentage of the total data to improve performance. Searching a small percentage of data may however degrade retrieval accuracy, since relevant documents may be missing. We only want to apply these techniques when they will maintain retrieval accuracy. Collection characteristics and query locality directly effect whether each technique is applicable, and how to use them. We first discuss collection selection with respect to collection organization and access skew without query locality, and then discuss replication and collection selection with query locality.
4.1
COLLECTION ORGANIZATION AND COLLECTION ACCESS SKEW
In this section, we discuss the relationships between collection access skew and a random, source, or topic collection organization. Collection access skew occurs when queries are relevant to a few collections and collection selection thus concentrates queries in these collections. The more skewed the access pattern, the more demand a few collections receive. In a uniform access distribution, collection selection distributes commands uniformly over the collections. Since we do not have logs for collection access patterns, we model the collection access skew using a Zipf-like function as follows: assume that we want to select W collections to search at a time, we use W trials. In each trial, we choose a collection from the collections C that have not been chosen according to the distribution function Z(i):, Z(i) = c/i1–θ where c = 6Cj=1(1/j1– T 1 d i d C. As T varies from 1 to 0, the probabilities vary from a uniform distribution to a pure Zipf distribution. Topic collections improve the ability ofthe collection selector to find relevant collections for a given query because most relevant documents will concentrate
186
ADVANCES IN INFORMATION RETRIEVAL
in a few collections. Previous research finds that selecting the top 10% of topic collections can achieve retrieval accuracy comparable to searching all collections (Xu and Croft, 1999). However, a topic collection organization and collection selection alone will exacerbate any collection skew; given a few “hot” topics, collection selection will send queries on these topics to a few collections and thus overload these servers. (Our experiments in Section 6 confirm this intuition.) This situation can occur even without query locality, if for example an IR server supports related services such as case law collections and the statues for each state, but most of its users come from one or two states and are interested in case law, but rarely on the same topic. If we have prior knowledge about query patterns or have the ability to redistribute data to collections, we may distribute the most popular topics over different collections or servers, which may result in a more uniform access pattern, and will produce correspondingly larger performance improvements. When collections are organized randomly, relevant documents for a given query are scattered over collections. Selecting p% ofcollections to search means we will not find (l-p%) of relevant documents, since collection selection will choose each collection with the same probability. Thus, the effectiveness of collection selection will suffer considerably. Organizing collections based on data sources is something between random and topic organization, since some sources may cover a given topic more thoroughly than others. In such configuration, the previous research on collection selection finds that selecting the top 50% of collections can achieve retrieval accuracy comparable to searching all collections (Callan et al., 1995).
4.2
QUERY LOCALITY
If users repeatedly issue queries on the same topics, a set of documents will receive more hits, which results in query locality. Partial replication takes advantage of this characteristics to off-load the services on original collections. The percentage of user requests that partial replicas may serve (D%) determines the performance improvement due to partial replication. Our previous research shows that we can maintain effectiveness with an automatic mechanism to direct queries to either relevant replicas or the original collections (Lu and McKinley, 1999a). There is correlation between collection access skew and query locality. If query locality is low, collections should tend to be accessed uniformly. If query locality is high, collection access may range from uniform to highly skewed, because popular topics could be scattered over collections, or concentrate in a few of collections. With locality and topic collections, access skew should be high; with random collections, collection access should be uniform; and with
The Effect of Collection Organization and Query Locality on IR Performance Table 7.1
187
Configuring Effective Replicas and Partitions
source collections, access skew will fall somewhere in between. If collection access is highly skewed, query locality should tend to be high also, but is not necessarily so.
4.3
CONFIGURATION
In Table 7.1, we summarize the above discussion. Given some original set of collections, here we assume each collection resides on a server, we can add an additional server, and either further partition the original set of collections on it, or build a partial replica on it. With partitioning, we can use collection selection. With replication, we direct queries to the relevant replica and use collection selection to search the original set of collections for the queries that the replica can not serve. For collection selection, if we have a topic organization, we assume the collection selector selects the top 25% of the collections. Because we assume larger and fewer collections, we use a higher percentage than Xu and Croft, 1999. If we have a source organization, the collection selector searches the top 50% as found by Callan et al., 1995. With a random organization, we cannot use collection selection. If locality exists, we replicate the retrieved documents for the most frequent queries and their associated index on the additional server, vary the percentage of queries that the replica selector directs to the replica, and perform collection selection at the same percentages as above on the remaining queries. Without locality, we also consider a replica of just the hottest server, and load balance with it. Depending on the collection access skew to this server (which is not represented in the table), this configuration may perform as well or better than partitioning.
188
ADVANCES IN INFORMATION RETRIEVAL
5
SIMULATION MODEL
In order to expedite our investigation of possible system configurations, characteristics of IR collections, and the system performance for distributed information retrieval architectures illustrated in Section 3, we implement a simulator with numerous system parameters, and validate it against a prototype implementation (Lu, 1999). In our simulator, we model collections and queries by obtaining statistics from test collections and real query sets. We model query processing and document retrieval by measuring resource usages for each operation in our multithreaded prototype. The hardware we model includes CPUs, disks, I/O bus, memory, and the network. Our simulator has two types of parameters: system configuration and system measurement parameters. The system configuration parameters include the number of threads, number of disks, number of CPUs, collection size, term frequency distribution, query length, command arrival rate, command mixture ratio, replication percentage, distracting percentage, and selection percentage. The system measurement parameters include query evaluation time, document retrieval time, network time, connection broker time, and time to merge results. The system configuration values determine the system configuration and data layout on the disks, and change with each simulation scenario or group of scenarios. The system measurements change based on the IR system and the underlying hardware platform. For example, in a system with replication, we model the time to select a replica and determine its load. The outputs of the simulator include response times for each IR command, the utilization of each hardware resource, such as the CPU and disk, and the utilization of each IR software system component, such as the INQUERY server and the connection broker. We use YACSIM, a process oriented discrete event simulation language to implement the simulator (Jump, 1993). YACSIM contains a set of data structures and library routines that manage user created processes and resources as a set of processes. Each process simulates the activity of a thread in the real system by requesting services from resources. For the experiments in this paper, we model the IR system by analyzing a prototype of a distributed multithreaded information retrieval system based on INQUERY and by measuring the resource usage for each operation. We measure resource usages for query evaluation, document/summary retrieval, result merging, connection brokering, and network transmission. We examine TREC collections (up to 20 GB) to obtain system measurements. We obtained the measurements using a client-server IR system based on INQUERY version 3.1 running on DEC Alpha Server 2100 5/250 with 3 CPUs (clocked at 250 MHz), 1024 MB main memory and 2007 MB of swap space, running Digital UNIX V3.2D-1 (Rev 41).
The Effect of Collection Organization and Query Locality on IR Performance
189
Our simulation model is simple, yet contains sufficient details to accurately represent the important features of the system, as we have demonstrated through performance validation elsewhere (Lu, 1999).
6
EXPERIMENTS
In this section, we present experiments to demonstrate the performance impact of collection organizations and query locality. We use configurations with collection and replica selection as appropriate and with their parameters tuned such that performance improvements do not come at the expense of effectiveness, as discussed in Section 4. As our example, we study 256 GB of data using 9 servers, each of which has four 250 MHz CPUs and 8 disks and can store up to 32 GB of data with its associated indexes. The system architecture is illustrated in Figure 7.1(b) and 7.1(c). We use 8 servers to store the original collections, and the 9th server to store either a 32 GB partial replica or to partition the data further. We view the data on each server as a collection. We include a collection selector and a replica selector in the connection broker, as appropriate. If there is a relevant partial replica, we search that replica instead of the original collections, otherwise we search the original collection with collection selection. We also experiment with faster servers and larger data sizes. We vary the command arrival rate that indicates light or heavy workloads, the collection access skew that represent probability that each collection contains relevant documents for a given query, the selection percentage that indicates the percentage of collections the collection selector chooses as appropriate to the collection organization, and the distracting percentage that indicates the percentage of commands that the replica selector directs to a replica. Table 7.2 presents these parameters, their abbreviations, and values we use in the experiments of this section. We model the command arrival as a Poisson process. In practices, users tend to issue short queries; we use an average of 2 terms per query, and issue query, summary, and document commands, with a ratio of 1:1.5:2, as we found in the Thomas logs (see Lu, 1999). We model the collection access skew using a uniform, Zipf, and Zipf-like function (see Section 4.1).
6.1
COLLECTION ORGANIZATION AND QUERY LOCALITY
For each collection organization, we present two graphs which compute query response time (1) as a function the distracting percentage (i.e., the percent of commands that may be satisfied by the replica) when commands arrive at the rate of 10 or 20 per second, and (2) as a function of the command arrival rate. In each graph, we use system organizations appropriate to the collection
190
ADVANCES IN INFORMATlON RETRIEVAL Table 7.2 Configuration parameters Parameters Command Arrival Rate Poisson dist. (avg. commands/sec) Command Mixture Ratio query:summary:document Terms per Query (average) shifted neg. binomial dist. Query Term Frequency dist. from queries Number of CPUs per server Number of Disks per server Number of Threads per server Collection Size Collection Access Skew Zipf-like function Selection Percentage Distracting Percentage Speedup
Abbre. O
0.1 16
2 18
Rcm
1:1.5:2
Ntpq
2 Obs. Dist. 4 8 32 256 GB
Dqtf Ncpu Ndik Nth Csize
T Psel
Values 4 6 8 12 14 20 25 30 35 40
ITB
D%
0.0 (Zipf) 0.3 1.0(uniform) 25% 50% 75% 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Sp
1
4
organization, and vary the collection distribution skew when using collection selection.
Random organization When we randomly partition data over collections, we can not use collection selection mechanism, since it will cause significant precision losses. If there is query locality, we can of course use partial replication to hold documents for the most frequently issued queries. Figure 7.2 compares the performance of simply partitioning versus partial replication given a random collection organization. Figure 7.2(a) illustrates average response time versus distracting percentage when commands arrive at 10 per second. Even when query locality is low (10%), partial replication significantly outperforms partitioning by over 10%, and with higher locality improves performance by up to 80%. This improvement occurs when the replica satisfies a query, since it only gets results from 1 server rather than issuing commands and coordinating responses from 8 servers. We also consider using the additional server to replicate the contents of a single, hot server (instead of hottest documents from all original servers), and load balance between these two servers. For random collection organization, this configuration does not bring significant performance improvement compared with 8 servers, because there
The Effect of Collection Organization and Query Locality on IR Performance
Command Arrival Rate (requests per second)
(b) varying λ (command arrival rate) Figure 7.2 Random collection organization.
191
192
ADVANCES IN INFORMATION RETRIEVAL
is no hot server in the configuration of 8 servers, which makes the replicated server almost useless. Figure 7.2(b) illustrates average query response time versus command arrival rate at several interesting configurations. The results demonstrate significant performance improvement due to adding partial replication. Compared with partitioning the data over all 9 servers, when a replica distracts 10% and 40% of commands, the largest command arrival rate with an average response time under 10 seconds improves by a factor of 1.1 and 1.7, respectively.
Source Organization When collections are organized by source, we can use collection selection on the collection partitioned over the 9 servers, or if we have locality, we can replicate 12.5% of the collection on the 9th server. In both cases, we enable collection selection to choose the top 4 servers (near 50%) since that should not degrade effectiveness. Regardless of the query locality, the collection access skew may vary due to the source organization as we discussed in Section 4.1. We experiment with uniform, pure Zipf, and 0.3 Zipf distributions. We also experiment with the configuration when we use the 9th server to replicate the hottest server for the pure Zipf collection access skew. Figure 7.3 illustrates average response times for these configurations. Figure 7.3 (a) illustrates average query response time for an arrival rate of O = 10 commands per second. Notice that without a replica there is a performance difference of almost a factor of 4 between collection selection with a uniform distribution and that of a Zipf distribution. (0.3 Zipf falls between the two.) The performance of collection selection is very sensitive to this distribution. Because it must select the top 50% of collections to maintain effectiveness with a source distribution, collection access skew will tend to overwhelm the popular servers. Figure 7.3 (a) also illustrates that with even modest query locality, replication consistently and significantly improves performance over collection selection and partitioning. When collection access is uniform and collection selection experiences its best performance, the improvement due to replication is still above 20%. Also observe that the degree of locality between 0 and 40% has a dramatic effect on performance, and these increases demonstrate that replication may outperform caching. Figure 7.3 (b) demonstrates that these results continue to hold for higher work loads. Compared to only using collection selection, when a replica can distract 20% to 40% of commands, it improves the largest command arrival rate with an average response time under 10 seconds by a factor between 1.15 and 1.7. Compared to partitioning and collection selection when collection access skew is a pure Zipf function, replicating the hottest
The Effect of Collection Organization and Query Locality on IR Peformance
Command Arrival Rate (requests per second)
(b) varying O (command arrival rate) Figure 7.3
Collections are organized based on sources (selecting the top 4 servers).
193
194
ADVANCES IN INFORMATION RETRIEVAL
Command Arrival Rate (requests per second)
(b) varying λ (command arrival rate) Figure 7.4
Collections are organized based on topics (selecting the top 2 servers).
server improves the largest command arrival rate with an average response time under 10 seconds by a factor of around 1.1.
Topic Organization When collections are organized by topic, collection selection can maintain accuracy by confining searches to the top 2 servers (near 25%) of collections. Because the load on all the servers is minimized in these organizations, they
The Effect of Collection Organization and Query Locality on IR Performance
195
show the best performance for both collection selection and replication; improving performance by a factor of 5 and more. This reduction also enables our system to achieve query response times under 10 seconds for dramatically more commands arriving per second. Figure 7.4 (a) plots O = 20 per second and (b) plots O = 0 to 40, rather than O = 10 and O = 0 to 20 as in the previous two sets of experiments. The topic organization clearly attains the best performance, but as we pointed out earlier is the most difficult for the system builder to create. Because fewer servers are involved, each command requires less work and coordination. The trends in relative performance between partitioning with collection selection and replication that we found with a source organization continue to hold here, but we are able to support much higher work loads when we can restrict searches to only 25% of all collections. Compared to collection selection with pure Zipf collection access skew, replicating the hottest server improves the largest command arrival rate with an average response time under 10 seconds by a factor of around 1.3, which is better than when collections are organized based on sources, since the difference of workloads on the hottest server and the second hottest server becomes larger in the topic organization. Again for all organizations, the biggest factor affecting performance is the collection access skew which may be managed if a high skew also corresponds to significant query locality.
6.2
SENSITIVITY STUDY
This section investigates the sensitivity of our results with respect to larger data size and faster servers.
Larger Data Size In Figure 7.5, we present a set of experiments for 1 terabyte of text. We use 8 servers to store the original collections, each of which stores 128 GB of data. We still view the data on one server as one collection. We use an additional server to either further partition or to store 4 copies of a 32 GB replica, since our estimation based on an Excite log shows that a 32 GB replica may hold enough data to satisfy 30% to 40% of queries (Lu, 1999). As shown in Figure 7.5, compared to only using collection selection, using a replica to distract 20% and 40% of commands improves the largest command arrival rate with an average response time under 10 seconds by a factor of 1.3 and 2.0, which is better than those we found in experiments for 256 GB, since we have 4 copies of the replica on the additional server. These results demonstrate that the results from the previous experiments still hold for larger data size.
196
ADVANCES IN INFORMATION RETRIEVAL
Command Arrival Rate (requests per second)
(b) topic organization (selecting the top 2 servers) Figure 7.5
Performance for 1 terabyte
The Efect of Collection Organization and Query Locality on IR Performance
197
Faster Servers In Figure 7.6, we present another set of experiments for 1 terabyte of text using faster servers. We still assume we have 8 servers to handle original collections, and an additional server. Each server handles 128 GB of data. In this set of experiment, we assume servers are 4 times as fast. Figure 7.6 demonstrates that the previous results continue to hold for faster servers. Compared to only using collection selection, using a replica to distract between 20% and 40% of commands improves the largest command arrival rate with an average response time under 10 seconds by a factor of 1.3 and 2.1. Of course, the faster servers support higher workloads.
7
CONCLUSIONS
This work explores the effect of query locality and collection organization on the design and performance of IR systems. We propose system organizations and functionality that should achieve precision that is comparable with searching the entire collection. We then compare the performance of these organizations: comparing collection selection and partial replication considering the collection organization and workloads. We also demonstrate the sensitivity of our results when the collection size increases and hardware changes. Collection selection improves performance significantly when either collection access is fairly uniform or collections are organized based on topics. In any configuration that uses collection selection, the collection access skew determines the performance to a large degree, and to a comparable degree with locality. Query locality always enables partial replication to improve performance consistently over collection selection with partitioning. Our sensitivity study shows that although increasing the collection size and speeding up the servers changes the response time, neither changes the relative improvements due to partial replication. In the future, we expect to see research making further improvements to the precision of replica and collection selection with its corresponding performance benefits. In our local area network model, the bottlenecks are all at the servers. In the future, we would like to consider a wider area network and issues such as network contention and more significant network delays, perhaps in a complete implementation. We also hope to make more detailed comparisons between searchable replicas with replica selection and simple caching of query responses and documents. This comparison and further characterization of web query and document request may find this work is widely applicable to the web.
Acknowledgments This material is based on work supported in part by the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement number EEC-9209623,
198
ADVANCES IN INFORMATION RETRIEVAL
Command Arrival Rate (requests per second)
(b) topic organization (selecting the top 2 servers) Figure 7.6
Performance with faster servers for 1 terabyte
The Effect of Collection Organization and Query Locality on IR Performance
199
supported in part by United States Patent and Trademark Office and Defense Advanced Research Projects Agency/ITO under ARPA order number D468, issued by ESC/AXS contract number F1 9628-95-C-0235, and also supported in part by grants from Compaq, NSF grant EIA-9726401, and an NSF Infrastructure grant CDA-9502639. Kathryn S. McKinley is supported by an NSF CAREER award CCR-9624209. Any opinions, findings and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsors.
References Baentsch, M., Molter, G., and Sturm, P. (1996). Introducing application-level replication and naming into today's Web. In Proceedings of Fifth International World Wide Web Conference, Paris, France; Available at http:// www5conf.inria.fr/fich-htmI/papers/P3/Overview.html. Bell, D. and Grimson, J. (1992). DistributedDatabase Systems. Addison-Wesley Publishers. Bestavros, A. (1995). Demand-based document dissemination to reduce traffic and balance load in distributed information systems. In Proceedings of SPDP '95: The 7th IEEE Symposium on Parallel and Distributed Processing, pages 338–345, San Anotonio, Texas. Brown, E. W. and Chong, H. A. (1997). The GURU system in TREC-6. In Proceedings of the Sixth Text REtrieval Conference (TREC-6), pages 535– 540, Gaithersburg, MD. Burkowski, F., Cormack, G., Clarke, C., and Good, R. (1995). A global search architecture. Technical Report CS-95- 12, Department of Computer Science, University of Waterloo, Waterloo, Canada. Burkowski, F. J. (1990). Retrieval performance of a distributed text database utilizing a parallel process document server. In 1990 International Symposium On Databases in Parallel and Distributed Systems, pages 71–79, Trinity College, Dublin, Ireland. Cahoon, B. and McKinley, K. S. (1996). Performance evaluation of a distributed architecture for information retrieval. In Proceedings of the Nineteenth Annual International ACM SlGIR Conference on Research and Development in Information Retrieval, pages 110–118, Zurich, Switzerland. Cahoon, B., McKinley, K. S., and Lu, Z. (1999). Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transaction on Information Systems. To appear. Callan, J. P., Croft, W. B., and Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert System Applications, pages 78–93, Valencia, Spain. Callan, J. P., Lu, Z., and Croft, W. B. (1995). Searching distributed collections with inference networks. In Proceedings of the Eighteenth Annual International ACM SlGlR Conference on Research and Development in Information Retrieval, pages 21–29, Seattle, WA.
200
ADVANCES IN INFORMATION RETRIEVAL
Chakravarthy, A. and Haase, K. (1995). Netserf: Using semantic knowledge to find internet information archives. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4–11, Seattle, WA. Couvreur, T. R., Benzel, R. N., Miller, S. F., Zeitler, D. N., Lee, D. L., Singhai, M., Shivaratri, N., and Wong, W. Y. P. (1994). An analysis ofperformance and cost factors in searching large text databases using parallel search systems. Journal of the American Society for Information Science, 7(45):443–464. Danzig, P. B., Ahn, J., Noll, J., and Obraczka, K. (1991). Distributed indexing: A scalable mechanism for distributed information retrieval. In Proceedings of the Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 221–229, Chicago, IL. DeWitt, D., Graefe, G., Kumar, K. B., Gerber, R. H., Heytens, M. L., and Muralikrishna, M. (1986). GAMMA – a high performance dataflow database machine. In Proceedings of the Twelfth International Conference on Very Large Data Bases, pages 228–237, Kyoto, Japan. DeWitt, D. and Gray, J. (1992). Parallel database systems: The future of high performance database systems. Communications of the ACM, 35(6):85–98. French, J. C., Powell, A. L., anc C. L. Viles, J. C., Emmeitt, T., Prey, K. J., and Mou, Y. (1999). Comparing the performance of database selection algorithms. In Proceedings of the Twenty-Second Annual International ACM SIGlR Conference on Research and Development in Information Retrieval, pages 238–245, Berkeley, CA. Fuhr, N. (1999). A decision-theoretic approach to database selection in networked ir. ACM Transactions on Information Systems, 17(3):229–249. Gravano, L. and Garcia-Molina, H. (1995). Generalizing GLOSS to vectorspace databases and broker hierarchies. In Proceedings of the Twenty First International Conference on Very Large Data Bases, pages 78–89, Zurich, Switchland. Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The effectiveness of GLOSS for the text database discovery problem. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, pages 126–137, Minneapolis, MN. Hagmann, R. B. and Ferrari, D. (1986). Performance analysis of several backend database architectures. ACM Transactions on Database Systems, 11( 1): 1–26. Harman, D., McCoy, W., Toense, R., and Candela, G. (1991). Prototyping a distributed information retrieval system that uses statistical ranking. Information Processing & Management, 27(5):449–460. Hawking, D. (1997). Scalable text retrieval for large digital libraries. In First European Conference on Research and Advanced Technology for Digital Libraries, number 1324 in Lecture Notes in Computer Science, pages 127– 145, Pisa, Italy. Springer.
The Effect of Collection Organization and Query Locality on IR Performance
201
Hawking, D., Craswell, N., and Thistlewaite, P. (1998). Overview of TREC7 very large collection track. In Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 91–104, Gaithersburg, MD. Hawking, D. and Thistlewaite, P. (1997). Overview of the TREC-6 very large collection track. In Proceedings of the Sixth Text REtrieval Conference (TREC6), pages 93–106, Gaithersburg, MD. Jain, A. and Dubes, R., editors (1988). Algorithms for Clustering Data. Prentice Hall. Jump, J. R. (1993). YACSIM Reference Manual. Rice University, version 2.1.1 edition. Katz, E., Butler, M., and McGrath, R. (1994). A scalable HTTP server: the NCSA prototype. Computer Networks and ISDN Systems, 27(2): 155–164. Lu, Z. (1999). Scalable Distributed Architectures For Information Retrieval. PhD thesis, University of Massachusetts at Amherst. Lu, Z. and McKinley, K. S. (1999a). Partial replica selection based on relevance for information retrieval. In Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 97–104, Berkeley, CA. Lu, Z. and McKinley, K. S. (1999b). Searching a terabyte of text using partial replication. Technical Report TR99-50, Department of Computer Science, University of Massachusetts at Amherst. Mackert, L. F. and Lohman, G. M. (1986). R* optimizer validation and performance evaluation for distributed queries. In Proceedings of the Twelfth International Conference on Very Large Data Bases, pages 149–159, Kyoto, Japan. Markatos, E. P. (1999). On caching search engine results. Technical Report 241, Institute of Computer Science (ICs) Foundation for Research & Technology - Hellas (FORTH), Greece. Martin, T. P., Macleod, I. A., Russell, J. I., Lesse, K., and Foster, B. (1990). A case study of caching strategies for a distributed full text retrieval system. Information Processing & Management, 26(2):227–247. Martin, T. P. and Russell, J. I. (1991). Data caching strategies for distributed full text retrieval systems. Information Systems, 16(1):1-11. Papka, R. and Allan, J. (1998). Document classification using multiword features. In Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM) , pages 124–131, Bethesda, MD. Simpson, P. and Alonso, R. (1987). Data caching in information retrieval systems. In Proceedings of the Tenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 296–305, New Orleans, LA. Stonebraker, M., Woodfill, J., Ranstrom, J., Kalash, J., Arnold, K., and Anderson, E. (1983). Performance analysis of distributed data base systems. In
202
ADVANCES IN INFORMATION RETRIEVAL
Proceedings of the Third Symposium on Reliability in Distributed Software and Database Systems, pages 135–138, Clearwater Beach, FL. Tomasic, A. and Garcia-Molina, H. (1992). Caching and database scaling in distributed shared-nothing information retrieval systems. Technical Report STAN-CS-92-1456, Stanford University. Turtle, H. R. (1991). Inference Networks for Document Retrieval. PhD thesis, University of Massachusetts. Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B. (1995). Learning collection fusion strategies. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–179, Seattle, WA. Wang, J. (1999). A survey of web caching schemes for the internet. Computer Communication Review, 29(5):36–46. Xu, J. and Croft, W. (1999). Cluster-based language models for distributed retrieval. In Proceedings of the Twenty-Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 254–261, Berkeley, California.
Chapter 8 CROSS-LANGUAGE RETRIEVAL VIA TRANSITIVE TRANSLATION Lisa A. Ballesteros Computer Science Department Mount Holyoke College South Hadley, MA lballest @mtholyke.edu
Abstract
1
The growth in availability of multi-lingual data in all areas of the public and private sector is driving an increasing need for systems that facilitate access to multi-lingual resources. Cross-language Retrieval (CLR) technology is a means of addressing this need. A CLR system must address two main hurdles to effective cross-language retrieval. First, it must address the ambiguity that arises when trying to map the meaning of text across languages. That is, it must address both within-language ambiguity and cross-language ambiguity. Second, it has to incorporate multilingual resources that will enable it to perform the mapping across languages. The difficulty here is that there is a limited number of lexical resources and virtually none for some pairs of languages. This work focuses on a dictionary approach to addressing the problem of limited lexical resources. A dictionary approach is taken since bilingual dictionaries are more prevalent and simpler to apply than other resources. We show that a transitive translation approach, where a third language is employed as an interlingua between the source and target languages, is a viable means of performing CLR between languages for which no bilingual dictionary is available.
INTRODUCTION
The rapid increase in the amount of electronic information makes providing effective and efficient access to text and objects associated with text of paramount importance. The task of an Information Retrieval (IR) system is to estimate the degree to which documents in a collection reflect the information expressed in a user query. IR techniques facilitate this access by developing
204
ADVANCES IN INFORMATION RETRIEVAL
a representation of the user’s information need or query, comparing it to representations of information objects, and retrieving those objects most closely matching that need. Object representations are typically based on the words or vocabulary of a language. This seems to make sense because textual information is conveyed using words. However, language is ambiguous and there are many different ways to express a given idea or concept. This presents difficulties for ascertaining two things needed to estimate how likely it is that a document will match a users need. The first is to infer what information the writer of a document was trying to convey to the reader. For example, is a document about Macintoshes about fruit or is it about computers? The second is to infer what information need a user is trying to express with a query. In recent years, there has been tremendous growth in the amount of multilingual, electronic media. The current boom in economic growth has lead more businesses and organizations to expand the boundaries of their organizational structure to include foreign offices and interests. Greater numbers of people are interested in data and information collected in other regions of the world as efforts to address issues of global concern lead to increased multi-national collaboration. In addition, the explosive growth of the Internet and use of the World Wide Web (WWW) makes possible access to information that was once hindered by physical boundaries. The Internet has become a powerful resource for information in areas such as electronic commerce (E-commerce), advertising and marketing, education, research, banking and finance. This growth in availability of multi-lingual data in all areas of the public and private sector is driving an increasing need for systems that facilitate access to multi-lingual resources by people with varying degrees of expertise with foreign languages. Cross-language Retrieval (CLR, also known as translingual retrieval) technology is a means of addressing this need. Cross-language retrieval aims to develop tools that in response to a query posed in one language (e.g. Spanish), allow the retrieval of documents written in other languages (e.g. Chinese). There are several approaches one could take to solve this problem. Each of them amounts to generating a translation, or more appropriately, an approximate translation of the document into the language of the query or of the query into the language(s) of the documents being searched. Unlike a machine translation system, the goal of a CLR system is not to generate exact, syntactically correct representations of a text in other languages. It is rather to examine the tremendous number of electronic texts and to select and rank those documents that are most likely related to a query written in another language. In addition to addressing the difficulties encountered in a monolingual environment, a CLR system must address two main hurdles to effective crosslanguage retrieval. First, it must find a means of addressing an additional level
Cross-Language Retrieval via Transitive Translation
205
of ambiguity that arises when trying to map the import of a text object across languages. That is, it must address both within language ambiguity and crosslanguage ambiguity. Second, it has to incorporate multi-lingual resources that will enable it to perform the mapping across languages. The difficulty here is that there is a limited number of lexical resources and there are virtually no resources for some pairs of languages. The goal for addressing ambiguity is to sufficiently reduce its effects such that the gist of the original query is preserved, thus enabling relevant documents to be retrieved. Translation ambiguity is a result of erroneous word translations, failure to translate multi-term concepts as phrases, and the failure to translate out-of-vocabulary words. Previous work by Ballesteros and Croft, 1996, and Hull and Grefenstette, 1996, showed that ambiguity greatly reduces the effectiveness of cross-language retrieval in comparison to monolingual retrieval of the same queries. This work focuses on a dictionary approach to addressing the problem of limited lexical resources. Availability of lexical resources varies and often depends upon several factors including the commercial viability of producing them, proprietary rights, and cost. A dictionary approach is taken since bilingual dictionaries are more prevalent and simpler to apply than other resources. More specifically, we show that a transitive translation approach, where a third language is employed as an interlingua between the source and target languages, is a viable means of performing CLR between languages for which no bilingual dictionary is available. (Source refers to the language being translated and target refers to the language to translate to.)
2
TRANSLATION RESOURCES
The goal of CLR is to compare a query with documents written in another language and to select and rank those documents most likely related to the query. To do this, a CLR system must employ lexical resources from which a mapping from the query language to the document language(s) can be derived. Current research in CLR has relied primarily on two types of resources: aligned corpora and bilingual dictionaries.
2.1
ALIGNED CORPORA
Aligned corpora are multi-lingual collections of related documents. The two types of aligned corpora are referred to as either parallel or comparable. Parallel corpora contain a set of documents written in one language and the translations of those documents into one or more other languages. For example, the UN Corpus (UN, 1999) contains transcripts of UN proceedings written in English and translated to Spanish and/or French. Parallel corpora are typically the result of large scale translation projects commissioned by a group
206
ADVANCES IN INFORMATION RETRIEVAL
for a particular class of documents. For this reason, parallel corpora tend to be scarce and their coverage domain specific. Documents in a comparable corpus, rather than being translations of one another, are related by topic. Comparable texts are those written independently in different languages, but that have the same communicative function. The Swiss News Agency (SDA) reports the news in the three primary languages of Switzerland: German, French, and Italian. News stories describing the same events are written independently in each of these languages and thus comprise a comparable corpus. The idea behind the aligned corpus approach is that the words used to describe a particular topic or event will be related semantically across languages. When aligned texts are sufficiently large, statistical methods can be applied to infer the most likely translation equivalents. This approach has been taken by many CLR researchers (Sheridan and Ballerini, 1996; Sheridan et al., 1997; Davis and Dunning, 1995a; Davis and Dunning, 1995b; Landauer and Littman, 1990; Rehder et al., 1997; Picchi and Peters, 1996). Work by Sheridan and Ballerini, 1996, relies on the employment of comparable corpora. In their approach, semantic relationships between groups of words in language1 (German) and language2 (Italian) were inferred by analysis of SDA news articles aligned via language independent subject codes and dates. The assumption is that the language1 words used to describe a particular event or topic will be related to language2 words used to describe the same event or topic. In other words, semantically relationships between terms in different languages can be identified by studying their pattern of co-occurrence in comparable documents. The approach works in the following way. A German query is submitted and the most highly ranked German documents are returned along with their Italian counterparts. The Italian terms most frequently occurring in those aligned Italian documents are assumed to be related to the original German query terms and thus comprise an approximate translation of the original query. The Italian query is then submitted to another collection of Italian documents. Results have been promising and is most effective when the document collection is restricted to a sub-language (Sheridan et al., 1997). The Latent Semantic Indexing (LSI) approach of Littman and others (Rehder et al., 1997) employs parallel corpora. LSI (Furnas et al., 1988) is based on the vector space retrieval model in which documents are represented as vectors of terms. It differs from the vector space model in that a reduction on the term-by-document matrix is performed via singular value decomposition. The underlying theory is that the dimensions of the resulting matrix are representative of “core” or “basic” concepts of discourse. When applied to a collection of documents and their translations, a bilingual indexing space is created. It
Cross-Language Retrieval via Transitive Translation
207
is assumed that the resulting dimensions represent language-independent core concepts. The approach has been effective on small collections for retrieving a document in response to its translation as the query. However LSI efficacy for the cross-language task has not been shown for collections of more realistic size. Despite promising results from approaches relying on these resources, it is not a trivial task to employ them. First, there are a limited number of available parallel corpora from which to choose. As mentioned above, they also tend to be domain specific and so may not be as effective for more general topics. In addition, the non-trivial task of identifying more fine grained alignments than at the document level is necessary for generating mappings at the word level. Comparable corpora may be easier to find when one considers the availability of on-line newspaper articles and other types of text from around the world. However, the real difficulty lies in generating document alignments. In the case of the SDA collection, each article was manually assigned some number of descriptors each describing some attribute of the article’s content. Descriptors identified, for example, such features as country of origin, location, subject, or date. This is not a trivial task and even when the descriptors have been assigned in advance, further alignments are necessary to achieve mappings that are effective in CLR environment.
2.2
DICTIONARIES
Machine Readable Dictionary (MRD) translation has been the starting point for other researchers (Ballesteros and Croft, 1996; Ballesteros and Croft, 1997; Ballesteros and Croft, 1998; Hull and Grefenstette, 1996; Pirkola, 1998). Automatic MRD translation leads to a drop in effectiveness of 40-60% below that of monolingual retrieval. This is due primarily to translation ambiguity. Statistical techniques that are discussed in section 4 can significantly reduce the effects of ambiguity and bring the effectiveness of cross-language retrieval near the level of monolingual retrieval. Machine readable dictionaries often require some degree of preprocessing before they can be applied to the CLR problem. This is primarily because dictionaries are designed for use by humans. Dictionary mark-up identifies information such as head-words, part-of-speech, and word usage, but may be inconsistent. These inconsistencies are easily differentiated by a person, but they can make computer analysis a challenging task. One can typically find bilingual dictionaries for most commercially important languages. The on-line availability of dictionaries is also increasing. A simple web search found over 70 links to general, multi-lingual dictionaries on one
208
ADVANCES IN INFORMATION RETRIEVAL
web page alone1. Although machine readable dictionaries are becoming more widely available, their coverage and quality varies. One would expect this variability to impact effectiveness, however to what degree has yet to be studied directly. Although dictionaries may also be proprietary or costly, they are more prevalent and simpler to apply than are aligned corpora making them a logical choice of lexical resource. Section 5 explores the feasibility of a transitive approach to translation. This would circumvent the problem of having no lexical resources for a pair of languages and would further increase the significance of the dictionary approach to CLR. However, we first discuss automatic dictionary translation, its inherent ambiguity, and the statistical techniques that have been shown to significantly reduce ambiguity's negative effects.
2.3
MACHINE TRANSLATION SYSTEMS
Machine translation (MT) systems are an important translation resource. MT systems are available for translation between a number of major languages. Developing systems for new languages, however, requires a significant effort (although research is underway to address this issue). Using an MT system to translate entire document databases is not practical, so the obvious approach is to translate queries. Even if an MT system is available for the target language, there is evidence that they often require more context than is available in a typical query for accurate translation. Ballesteros and Croft, 1998, showed in retrieval experiments that dictionary-based techniques outperformed one popular commercial MT system and performed as well as another system. Dictionary-based retrieval systems are, therefore, an important technique for situations where MT systems are not available or not effective.
3
DICTIONARY TRANSLATION AND AMBIGUITY
The natural approach to automatic translation via machine readable dictionary is simple word-by-word replacement. More specifically, each word in the source language is replaced by its translation equivalents in the target language. There are several problems with this approach. First, word-by-word translations are inherently ambiguous. For each headword, a dictionary will list several parts-of-speech each having one or more related meanings. The resulting translation will contain many incorrect translation equivalents. Previous work (Ballesteros and Croft, 1996; Hull and Grefenstette, 1996) showed that word-by-word translations yield cross-language retrieval effectiveness that is more than 50% less effective than monolingual retrieval.
1
http://www.artintemet.fr/city/Biblio/Autres/autres.htm
Cross-Language Retrieval via Transitive Translation
209
Furthermore, erroneous words are responsible for a significant portion of this drop in effectiveness. Second, queries often contain multi-word concepts that lose their intended meaning when translated word-by-word. Consider the Spanish phrase oso de peluche, meaning teddy bear. The Collins Spanish-English dictionary lists (bear; braggart, bully) and (felt, plush) as the translation equivalents for oso and peluche respectively (de is a preposition meaning of). It is not possible to reconstruct the correct English translation of the phrase via a word-by-word approach. There are compositional phrases for which the correct translation can be derived word-by-word, however it is still non-trivial to select the appropriate equivalent for each word. Ballesteros and Croft, 1996, found that failure to translate multi-term concepts as phrases accounted for 31% of the drop in crosslanguage effectiveness as compared to that of monolingual. Finally, there are some words which cannot be translated via a dictionary. Dictionaries typically contain vocabulary used in a variety of settings. This broad coverage makes them applicable for translating queries covering a wide variety of topics. However, coverage is generally not deep enough to include many domain specific words or specialized terminology. Query words that can not be translated via the dictionary are referred to as out-of-vocabulary (OOV) words. OOV words have been shown to be responsible for up to 23% of the drop in effectiveness for cross-language retrieval.
4
RESOLVING AMBIGUITY
Despite the negative effect of translation ambiguity on cross-language retrieval effectiveness, ambiguity reduction techniques can be applied to significantly improve effectiveness. These techniques are based on either syntactic or statistical analysis of word co-occurrence. The following sections describe techniques for reducing the effects of ambiguity associated with simple dictionary translation. When ambiguity reduction is not employed, cross-language retrieval via simple MRD translation is only 40-60% as effective as monolingual retrieval. However when all of the techniques described below are applied, cross-language retrieval effectiveness can be comparable to that of monolingual.
4.1
SYNONYMS AND PART-OF-SPEECH
Recall that dictionaries list many possible translation equivalents for each head-word. When used indiscriminately, erroneous words are responsible for much of the ambiguity in a dictionary translation. When it is not possible to automatically distinguish between the correct and incorrect translations for a word, it is possible to reduce the negative impact of the incorrect translations via the synonym operator in the INQUERY retrieval system (Broglio et al, 1994)
210
ADVANCES IN INFORMATION RETRIEVAL
and via part-of-speech. We first discuss the effects of applying the synonym operator and then describe the use of part-of-speech. There are two factors related to erroneous word translations that reduce effectiveness. First, because dictionaries often include archaic usages for headwords, query translations can be unduly affected by these rarely used equivalents. Second, query words having the greatest number of translation equivalents tend to be given greater importance. This is an artifact of the way in which document relevance is assessed. The words that a query and document have in common serves as the basis for inferring the likelihood that they are related. Thus a query word replaced with ten translations will have five more chances to match document terms than a query word replaced with only two translations. We measure the importance or discriminating power of a word in a collection of documents by a belief score based on two types of term frequency, tf-score and idf-score. The tf-score reflects the within document frequency of a term while idf-score is inversely proportional to the number of documents in the collection in which the term occurs. For example, a document containing the word apple with reasonable frequency is a good indication that the document is about apples. However, if the word apple appears in many documents across the collection, it will not have much ability to discriminate between relevant and non-relevant documents about apples. The score for apple in this particular document would get credit for a high tf-score, but would be penalized for occurring frequently throughout the corpus (low idf-score). Infrequent or rare terms have higher idf scores and tend to have higher belief values. The INQUERY synonym operator treats occurrences of all words within it as occurrences of a single pseudo-tern whose term frequencies are the sum of the frequencies for each word in the operator. This de-emphasizes infrequent words and has a disambiguating effect. If the synonym operator is not used, infrequent translations get more weight than more frequent translations due to their higher idf. The correct translation of a query term is generally not an infrequently used word, so in most cases this approach is effective. In addition, the synonym operator reduces ambiguity by normalizing for the variance in number of translation equivalents across query terms. Without the synonym operator, query words with many translation equivalents would get a higher weight than words with fewer translations. This is because an alternative to the synonym operator is to give each translation equal weight. In a two word query ts1 ts2 and having one and five translation equivalents respectively, the resulting target language query, tt11 tt21 tt22 tt23 tt24 tt25, essentially treats the concept described by ts2 as five times as important as ts1. Application of the synonym operator has the effect of treating occurrences of all translations of a word as an occurrence of a single concept therefore normalizing for this variance.
Cross-Language Retrieval via Transitive Translation
211
When queries are expressed in well-formed sentences, syntactic analysis can be employed to tag each query word with its part-of-speech (POS). This enables the replacement of each query word with only those translation equivalents corresponding to its correct POS. Part-of-speech tagging may not eliminate the transfer of archaic translation equivalents, but it typically reduces ambiguity by reducing the number of erroneous terms. The synonym operator is much more effective in general than is the application of POS. This is good news since queries are rarely expressed in syntactically correct sentences. The synonym operator yields improvements in effectiveness of greater than 45% over simple word-by-word translation alone, while POS disambiguation yields improvements of up to 22% for short queries. POS is less effective for long queries since the number of erroneous terms may not be sufficiently reduced to yield significant improvements in effectiveness.
4.2
WORD CO-OCCURRENCE
Words take their meaning from the context in which they are used and this fact can be exploited to strengthen the effectiveness of a query. In absence of any other information, if one speaks of “the bank” we are unable to determine whether the reference is to the financial sense of the word or to the land bordering a body of water. However if one indicates that a deposit must be made, it can be inferred that the reference is to the financial sense. Analysis of word cooccurrence allows us to make assumptions about the intended meanings of query words and thus reduce the effects of ambiguity. Query Expansion One means of exploiting word co-occurrence is via query expansion (Salton and Buckley, 1990; Attar and Fraenkel, 1977) where the results of previous retrievals are employed to improve the effectiveness of subsequent retrievals. The approach is to modify the query by adding words cooccurring with query terms in documents known or believed to be relevant. This expansion with related words strengthens the query and improves effectiveness. In the experiments described in section 5, we perform query expansion via the Local Context Analysis technique (LCA) of Xu and Croft, 1996. This differs from other expansion methods in two ways. First, rather than analyzing entire documents, it expands the query with words from the most highly ranked passages. Second, the more frequently a word co-occurs with query terms, the higher it will be ranked. It also penalizes words that occur frequently throughout the corpus. Query expansion can be viewed as a technique for smoothing the document representation with language models (Ponte, 1998). In the cross-language environment where queries may contain many erroneous terms, application of query expansion is based on two assumptions. The first is that related terms will tend to co-occur in documents while unrelated terms will tend not to. The second is that the documents containing the related
212
ADVANCES IN INFORMATION RETRIEVAL
terms will occur at the top of the ranking. Earlier work (Ballesteros and Croft, 1996; Ballesteros and Croft, 1997; Ballesteros and Croft, 1998) showed that expansion of cross-language queries is effective at two stages of the translation process: prior to dictionary translation and after dictionary translation. Expansion prior to translation (pre-translation expansion) creates a stronger base query for translation by adding terms that emphasize query concepts. Consider the English query Programsfor suppressing or limiting epidemics in Mexico which translates to Programas para reprimir o limitar epidemias en México. In a TREC experiment, pre-translation expansion led to the addition of cholera, disease, health and epidemiologist thus strengthening the intent of the original query. The subsequent query translation contains more terms related to the information need and this moderates the effects of erroneous term translations. When post-translation expansion is applied to the query above, morbo, morbosidad, and contagio are examples of the words added to the query. In English, these translate to (morbidity, sickness rate), (morbidity, morbidness, unhealthiness, sickrate) and (infection, contagion, corruption, taint), respectively. Posttranslation query expansion has the effect of further reducing the effects of erroneous term translations by adding more context specific terms. Disambiguation of Phrases Co-occurrence analysis can also be applied to the disambiguation of compositional phrases. Recall that compositional phrases are those for which the correct translation can be derived word-by-word. Phrase disambiguation proceeds as follows. First, create sets of words such that a set contains all the translation equivalents for one word in the phrase being translated. Then generate all possible combinations of words containing one equivalent from each set. Each combination is a potential phrase translation. Analyze the frequencies of co-occurrence for each combination of words, selecting the combination that co-occurs with the greatest percentage over what would be expected by chance. For a more detailed description, see Ballesteros and Croft, 1998.
5
ADDRESSING LIMITED RESOURCES
We have discussed how statistical and MRD techniques can be applied to CLR. This approach significantly reduces the effects of translation ambiguity enabling effective cross-language retrieval in a general environment without relying upon deep linguistic analysis. Our next goal is to develop a means for circumventing the problem of limited availability of linguistic resources. To this end, our approach is applied in the context of transitive translations that are performed between languages for which no bilingual dictionary is available. In cases where the goal is to perform cross-language retrieval between languages A and C and no bilingual dictionary between A and C is available,
Cross-Language Retrieval via Transitive Translation
2 13
transitive translation involves finding a translation path through an intermediate language or interlingua. In other words, find a language B for which bilingual dictionaries exist between A and B and between B and C. Language B will act as an interlingua in order to perform translations between A and C.
5.1
EXPERIMENTAL METHODOLOGY
The main idea behind the following experiments is to simulate a crosslanguage environment so that the efficacy of a transitive query translation approach can be evaluated and compared to our earlier cross-language work and to monolingual retrieval. In order to do this, we need a set of queries in a source language, a collection of documents in a target language, and relevance judgments. Relevance judgments indicate which documents in the target language are relevant to each query. We can then evaluate the effectiveness of a translation approach by measuring the ability of the translated queries to retrieve relevant documents. The data for the experiments described here come from the TREC6 (Voorhees and Harman, 1997) cross-language track. There are three sets of 25 queries: one English, one French and one Spanish, where the French and Spanish are manual translations of the English. Each query set has relevance judgments for a collection of English and a collection of French documents. There are no relevant documents for some of the French queries, so those queries were removed leaving 20 corresponding Spanish and French queries. In these experiments, two types of translations are performed. One generates French queries from Spanish queries via a bilingual dictionary and we refer to this as a bilingual translation. The second type is referred to as a transitive translation. Transitive translations are performed from the Spanish queries through English as the interlingua, to French. More specifically, Spanish queries are first translated to English with the Collins Spanish-English machine-readable dictionary. The resulting English query is translated to French via the Collins English-French machine-readable dictionary. MRD translations are performed after simple morphological processing of query terms to remove most plural word forms and to replace verb forms with their infinitive form. The Spanish morphological processor was based on the rules of pluralization and verb conjugation for Spanish. Morphological processing for French was performed via the on-line XEROX morphological analyzer (Xerox, 1998). Recall that translation is not used here to suggest the generation of an exact, syntactically correct representation of a query in another language. The words of a query in one language are merely replaced with the dictionary definition of those terms in another language. The goal is to generate an approximate translation in which the gist of the original query is preserved.
214
ADVANCES IN INFORMATION RETRIEVAL
Each query term is first tagged with its part-of-speech. Spanish queries are tagged with the BBN part-of-speech tagger (BBN, 1999). The TreeTagger system (Schmid, 1994) is used for tagging French. Sequences of nouns and adjective-noun pairs are taken to be phrases and are translated as such. Single terms are replaced with only those translation equivalents corresponding to the term’s POS. All translation equivalents for an individual term are wrapped in a synonym operator. Words which are not found in the dictionary are added to the new query without translation. Phrasal translations are performed using a database built from information on phrases and word usage contained in the Collins or Larousse MRD. During phrase translation, the database is searched for a source language phrase. A hit returns the target language translation of the source language phrase. If more than one translation is found, each of them is added to the query. This allows the replacement of a source phrase with its multi-term representation in the target language. When a phrase can not be found in the database, it is translated word-by-word and disambiguated via the co-occurrence method discussed in section 4.2. Query expansion is applied at various points in the translation process. First it is applied prior to replacement in order strengthen the query for translation. Queries are also expanded after replacement to reduce the negative effects of erroneous term translations. Average precision is used as the basis of evaluation for all experiments. It is unrealistic to expect the user to read many retrieved foreign documents to judge their relevance, so we also report precision at low recall levels. All work in this study is performed using the INQUERY information retrieval system. INQUERY is based on the Bayesian inference net model and is described elsewhere (Turtle and Croft, 1991b; Turtle and Croft, 1991a; Broglio et al., 1994). All significance tests use the Wilcoxon Matched-Pairs Signed Ranks and the paired sign test test unless otherwise specified. The retrieval environment for French differs slightly from that for English or Spanish. We have both a Spanish and an English stemmer that are used for retrieval based on those languages. However, we do not have a French stemmer so we employ the XEROX Finite-State Morphological Processor to simulate the effect of stemming by expanding query terms with inflectional variants.
5.2
BILINGUAL VS TRANSITIVE TRANSLATION AMBIGUITY
Translation ambiguity greatly reduces cross-language retrieval effectiveness. The following set of experiments is designed to confirm that earlier findings ( Ballesteros and Croft, 1996; Ballesteros and Croft, 1997; Ballesteros and Croft, 1998) about bilingual translations between Spanish and English are also true of
Cross-Language Retrieval via Transitive Translation
2 15
bilingual translations between Spanish and French. First, simple bilingual and manual translations to French are generated from the Spanish cross-language queries. Manual translations are generated to simulate near-perfect conditions in which the best translation from the dictionary could always be determined. Later, these results are compared to those for simple transitive translation. Bilingual translations are generated via word-by-word replacement both with and without the use of POS and synonym operator disambiguation. The two manual translations are performed to measure the degree of ambiguity caused by erroneous word translations and loss of phrasal translations. First is a wordby-word translation where the best single term translation is selected manually. Second is a best word-by-word translation augmented by manual phrasal translation. Table 8.1 compares effectiveness of monolingual retrieval with automatic word-by-word translation with and without POS disambiguation, and with manual word-by-word and manual word-by-word plus phrasal translation. Table 8.1 Average precision and number of relevant documents retrieved: monolingual French, automatic word-by-word translation, automatic word-by-word translation with SYN, automatic word-by-word translation with POS, automatic word-by-word translation with POS and SYN, manual word-by-word translation, and manual word-by-word with phrasal translation. Query Relevant Docs: Relevant Ret: Avg Prec % Change:
Mono
WBW
WBW +SYN
WBW +POS
WBW+POS +SYN
Manual WB W
Manual WB W+Phr
1098 730
1098 494
1098 583
1098 533
1098 575
1098 609
1098 605
0.2767
0.1634 -40.9
0.2013 -27.3
0.1869 -32.4
0.2008 -27.4
0.1961 -29.1
0.2419 -12.6
0.5700 0.5050 0.4025 0.3433 0.1850
0.2900 0.2350 0.1950 0.1717 0.1145
0.3600 0.2950 0.2425 0.2117 0.1295
0.3700 0.3000 0.2225 0.1883 0.1230
0.3600 0.2850 0.2275 0.1983 0.1280
0.3600 0.2750 0.2300 0.2050 0.1385
0.4800 0.3900 0.3100 0.2633 0.1510
Precision: 5 docs: 10docs: 20 docs: 30 docs: 100 docs:
The bilingual translations of the queries achieve 60% of monolingual effectiveness. This supports earlier reports showing that cross-language retrieval via automatic word-by-word translation without disambiguation achieves 50-60% of monolingual performance. Word ambiguity accounts for 29% of the loss of effectiveness and failure to translate phrases accounts for 40%. This is also consistent with earlier results. However, combining part-of-speech with synonym operator disambiguation is less effective here than reported for previous bilingual translations from Spanish to English. Query term statistics given in
216
ADVANCES IN INFORMATION RETRIEVAL
Table 8.2 explains at least in part why this is the case. For each query set, the table gives the source and target languages, whether POS disambiguation is applied, the average number of words in the translated query, the average number of translation equivalents or definitions per original query term, and the number of original query terms that were recovered via translation. The variances are shown in parentheses.
Table 8.2 Mean (variance) statistics for cross-language query sets: terms per query, definitions per term, undefined terms, number of original query terms recovered after translation. Source Lang.
Target Lang.
POS
Qry Length
Defs perTerm
Undef. Terms
Orig. Terms Recovered
Spanish
English
no
54.67 (738.98)
6.58 (3.5)
4 (6.81)
4.62
Spanish
English
yes
46.52 (447.49)
5.62 (21.43)
4 (6.60)
4.67
Spanish
French
no
16.67 (33.28)
2.11 (3.39)
6 (7.01)
4.19
Spanish
French
yes
15.05 (61.7)
1.92 (2.28)
6 (7.17)
4.14
When bilingual query translations are compared to their monolingual counterparts, each bilingual set contains roughly the same number of original query terms. In other words, the degree to which the monolingual queries are recovered by bilingual translation is roughly the same for translations to French and to English. However, there are more than three times as many translations per query term for bilingual translations between Spanish and English (CSE) than for Spanish to French (CSF) translations. The CSF query terms were replaced with an average of two translations per term while the CSE query terms were replaced with six translations per term. The probability of introducing erroneous terms in translations from Spanish to English is greater. This probability is lower for translations from Spanish to French, and queries contain only 20-30% as many terms. Given the smaller number of French translation equivalents, normalizing for the variance in translation equivalents via the synonym operator has a similar effect to application of POS to reduce the number of translation equivalents. The effects are not additive under these conditions. However, cross-language retrieval effectiveness of CSF queries with automatic translation via a bilingual dictionary and augmented by POS and synonym disambiguation achieves 73% of monolingual effectiveness.
Cross-Language Retrieval via Transitive Translation
5.3
217
TRANSITIVE TRANSLATION AMBIGUITY
Further analysis reveals that even manual translations introduce ambiguity. There can be many ways to translate a given concept and thus many ways for the original meaning to decay through repeated translations. This suggests that ambiguity will have an even greater negative effect on transitive translations than on bilingual translations. Table 8.3 shows that this is in fact the case. It compares bilingual translation from Spanish to French to transitive translation from Spanish through English to French. Column one gives recall-precision figures for simple word-by-word bilingual translation via a MRD and employing POS. Column two shows those for a simple word-by-word transitive translation from Spanish to English to French, employing POS. In the transitive case, POS tags from each original Spanish query word are propagated to its English translation equivalents. The English words are the replaced only by the French translation equivalents having the same part-of-speech. Transitive translation effectiveness is 91 % below that of the bilingual translation, supporting the assumption that transitive translations are more ambiguous. This makes sense given the statistics in table 8.2. Each Spanish query word is replaced on average with six English terms. Each English word is replaced on average with 2 French terms. This results in a more ambiguous transitive translation having twelve translation equivalents per original query term.
Table 8.3 Average precision and number of relevant documents retrieved for bilingual wordby-word translation and transitive word-by-word translation. Query
Bilingual
Transitive
Relevant Docs: Relevant Ret:
1098 533
1098 216
Avg. Prec. % Change:
0.169
0.0151 -91.9
0.3700 0.3000 0.2225 0.1883 0.1230
0.0800 0.0500 0.0400 0.0383 0.0320
Precision at: 5 docs: 10docs: 20 docs: 30 docs: 100docs:
218
5.4
ADVANCES IN INFORMATION RETRIEVAL
RESOLVING TRANSITIVE TRANSLATION AMBIGUITY
Synonym Operator The synonym operator is effective for reducing ambiguity when applied to the target language query following a bilingual translation. We hypothesized that when applied to the inter-lingual translation, it would reduce the additional ambiguity introduced by transitive translation. Table 8.4 gives recall-precision figures that support this hypothesis. Column one shows simple word-by-word bilingual translation via a Spanish-French bilingual dictionary. Column two shows simple word-by-word transitive translation from Spanish to English to French. The queries for column three were generated in the following way. First, Spanish query words are replaced by their English translation equivalents after wrapping all equivalents for a particular word in a synonym operator. This creates an English inter-lingual query that ensures that all English translations of a query word are treated equivalently. Then each English word is replaced by its French translation equivalents (recall that the POS for each Spanish query word is propagated to all of its English translation equivalents). The result is that all the translation equivalents of a particular Spanish query term are treated as an instance of the same word. In other words, it normalizes for the variance in number of French translation equivalents across the original Spanish query words. The effect of ambiguity is considerably reduced, raising effectiveness of transitive translation for fourteen out of 20 queries. This is significant at p d 0.01.
Table 8.4 Average precision and number of relevant documents retrieved for bilingual word-byword translation, transitive word-by-word translation, and transitive word-by-word translation with synonym operators used at the bilingual stage to group multiple English translations for a Spanish query term. Query Relevant Docs: Relevant Ret: Avg. Prec. % Change:
Bilingual WBW+POS
Transitive WBW+POS
Transitive WBW+POS +Bi-SYN
1098 494
1098 216
1098 328
0.1869
0.0151 -90.8
0.1231 -34.2
0.3700 0.3000 0.2225 0.1883 0.1230
0.0800 0.0500 0.0400 0.0383 0.0320
0.2600 0.2300 0.1750 0.1450 0.0775
Precision at: 5 docs: 10docs: 20 docs: 30 docs: 100 docs:
Cross-Language Retrieval via Transitive Translation
219
Due to the positive effect of applying synonym operators as described above, all subsequent transitive translations will be performed in this way. Phrasal Translation Failure to translate multi-term concepts as phrases is one of the main factors contributing to translation ambiguity. We hypothesize that this ambiguity will be exacerbated in transitive translations. Consider the following example. The Spanish phrase Segunda Guerra Mundial means Second World War. The Spanish-English MRD lists (second, second meaning, veiled meaning, second gear), (world-wide, universal, world), and (war warfare, struggle, fight, confict, billiards) as the translation equivalents for segunda, guerra, and mundial, respectively. When the English words are translated to French and the French words are grouped such that each group corresponds to one of the original Spanish phrase terms, the phrase translation will be: (seconde, deuxiéme, second, licence, avec mention bien assez bien, article de second choix), (universel, universelle, du monde, mondial, mondiale), and (guerre, guerre, lutte, bagarre, combat, lutte, conflit, billard). The resulting transitive translation of a three word phrase is twenty-seven words long. Furthermore, there are several unrelated terms included in the translation. One means of reducing the effect of this ambiguity is to translate phrases when they are listed in the dictionary. In the next experiment, we translate phrases via the dictionary where possible and where not possible, we disambiguate phrasal translations via the cooccurrence method described in section 4.2. Spanish queries are first part-ofspeech tagged and noun phrases are identified. The query is then translated to English replacing Spanish phrases with English phrasal translations when they are listed in the dictionary. The resulting English query is translated similarly to French. In addition, the phrasal translations are augmented by the INQUERY passage25, phrase, and synonym query operators. The passage25 and phrase operators were shown to be effective for use with phrasal translations in earlier work (Ballesteros and Croft, 1997). The phrase operator works in the following way. The query #phrase (greenhouse effect) indicates that if the words “greenhouse” and “effect” co-occur frequently in the collection, then co-occurrences within three terms of each other are considered when calculating belief scores. If not, the terms are treated as having equal influence on the final result. This allows for the possibility that individual occurrence of the words is evidence of relevance. The passage25 operator ensures that words that do not co-occur frequently be located within a small window of 25 words. The synonym operator is applied because we employ the Xerox morphological processor to give the effect of stemming. For example, Segunda Guerra Mundial translates via English phrase dictionary to Second World War. This English phrase then translates via French phrase dictionary to Duexième Guerre Mondial and Second Guerre Mondial.
220
ADVANCES IN INFORMATION RETRIEVAL
When the morphological processor is applied, the French translations become Duexième Guerre Mondial Mondiale and Second Seconder Seconde Guerre Mondial Mondiale, repectively. This results in each three word phrase being treated as a longer phrase. The synonym operator groups morphological variants to remove this artifact yielding #passage25(#phrase(Duexiéme Guerre #syn(Mondial Mondiale))) and #passage25(#phrase(#syn(Second Seconder Seconde)Guerre #syn(Mondial Mondiale))). Table 8.5 compares monolingual retrieval with bilingual translation and transitive translation both with phrasal translation via dictionary and co-occurrence disambiguation. Transitive translation effectiveness increases by 15.7% when phrases are detected and translated. In this particular query set, nine of twentyone Spanish phrases were translatable to English via dictionary. Of those nine English phrases, five were translatable to French via dictionary. The remaining phrases were translated via co-occurrence disambiguation. Table 8.5 Average precision and number of relevant documents retrieved for monolingual, bilingual word-by-word translation with phrase dictionary and co-occurrence translation of phrases, and transitive word-by-word translation with phrase dictionary and co-occurrence translation of phrases. Query Relevant Docs: Relevant Ret: Avg. Prec. % Change:
Mono
Bilingual WB W+Phr+Co
Transitive WB W+Phr+Co
1098 730
1098 596
1098 389
0.2767
0.2104 -23.9
0.1424 -48.6
0.5700 0.5050 0.4025 0.3433 0.1850
0.3500 0.3000 0.2425 0.2067 0.1325
0.2700 0.2300 0.1800 0.1567 0.0860
Precision at: 5 docs: 10docs: 20 docs: 30 docs: 100 docs:
When translations are disambiguated only via synonym operators and phrasal translation, transitive translation achieves only 5 1 % of monolingual effectiveness while bilingual translation achieves 76% of monolingual. Transitive translation is 32% less effective than bilingual translation. Recall from Table 8.2 that the average number of English definitions for a Spanish query term is about six and that there are two French definitions per English query term. This means that transitive translations have about 12 translations per query term. Although there are far more translations per query term in the transitive translation, the
Cross-Language Retrieval via Transitive Translation
221
number of original query terms in the resulting queries is about the same. Transitive translation yields more ambiguous queries than bilingual translation. Ballesteros and Croft, 1998, showed that the effectiveness ofbilingual translations could be brought near the level of monolingual. The next section explores the feasibility of reducing transitive translation ambiguity by first generating the best bilingual translation. This means reducing as much interlingual ambiguity as possible prior to the transitive translation phase. Combining Disambiguation Strategies during Transitive Translation In the following experiments, transitive translations of the Spanish queries are generated via English as the interlingua, to French. All disambiguation strategies discussed in section 4 and Ballesteros and Croft, 1998, are applied at the bilingual stage of the translations. In other words, when translating from Spanish, all disambiguations strategies are applied to generate the least ambiguous English query possible. This disambiguated English query is then used as the base for a transitive translation to French. Bilingual translations are generated via automatic dictionary translation augmented by co-occurrence disambiguation and query expansion. Expansion is applied both prior to and after translation to English. While these approaches work well to reduce the negative effects of ambiguity on retrieval for one level of translation, this effect may not be preserved after an additional level of translation and further introduction of ambiguity. The mean number of interlingua terms after the bilingual translation to English, employing all disambiguation techniques, is 126.8 while the original query has 7.19. The mean number of original query terms recovered via translation in the bilingual queries is 5.86. Although many of these additional terms aid in disambiguation, there are more than one hundred more query terms after bilingual translation than are in the original query. It is not clear what effect this will have on the transitive translation from English to French. Dictionaries often list a phrase or short description as the translation of a head word, especially if the head word is a verb. For example, the translation for rodear meaning “to surround” includes: to beat about the bush, to go by an indirect route, and to make a detour. Many of these types of translations will not be listed as head words in a dictionary. Translating these multi-term definitions word-by-word may introduce even more ambiguity. To test this possibility, two transitive translations were performed beginning with the bilingual translations as a base; one in which these multi-term definitions were translated word-by-word and one in which they were translated via co-occurrence disambiguation. Table 8.6 gives recall-precision values for the best results from these two transitive translation approaches and compares them to transitive translation without applying expansion at the bilingual translation phase. This is to establish whether the expansion at the bilingual stage adds noise or improves the
222
ADVANCES IN INFORMATION RETRIEVAL
final translation. (i.e Translate Spanish queries to English via MRD with synonym operator, phrase dictionary and co-occurrence phrasal translation, both with and without any expansion. Translate the resulting queries from English to French.) Table 8.6 Average precision and number of relevant documents retrieved for transitive translation with no bilingual expansion, transitive translation after bilingual expansion and word-byword translation of multi-term definitions, and transitive translation after bilingual expansion and co-occurrence disambiguation of multi-tern definitions. Query Relevant Docs: Avg. Prec % Change:
No Biling. Expan.
Biling. Expan. +WBW
Biling. Expan. +co
389 0.1424
669 0.1845 29.6
679 0.1784 25.3
0.2700 0.2300 0.1800 0.1567 0.0860
0.3000 0.2650 0.2375 0.2183 0.1490
0.3000 0.2700 0.2350 0.2217 0.1465
Precision at: 5 docs: 10 docs: 20 docs: 30 docs: 100docs:
As one would expect, there is a significant improvement (at p d 0.019) in effectiveness when we generate the best inter-lingual translation possible prior to performing the transitive translation. However, there is no significant difference between effectiveness of queries translated with and without co-occurrence disambiguation of multi-term inter-lingual definitions. This could be due to the following explanation. We have an ambiguous interlingua translation that contains erroneous translations as well as synonymous correct translations. When we attempt to further disambiguate, we reduce the expansion effect generated by correct synonymous translations. Expansion has been shown to reduce the effect of ambiguity. In addition, there may be some natural disambiguation that occurs when many related and unrelated terms come together. One would expect that more related or correct than unrelated or incorrect terms would co-occur in documents and thus mediate the effect of poor translations. Results also show that the reduction in ambiguity as exhibited by the bilingual translations to English is not lost after an additional level of translation. Transitive translation effectiveness is 67% of monolingual French. This is reasonable when one considers that the bilingual translations of the Spanish queries having only one level oftranslation achieved 79% of monolingual effectiveness. Bilingual translations to English gained further improvements after query ex-
Cross-Language Retrieval via Transitive Translation
223
pansion. The following experiments explore whether transitive translations can be further disambiguated via query expansion. Transitive Translations and Query Expansion Transitive translations are more ambiguous than bilingual translations. Post-translation expansion has been shown to significantly reduce the effects of bilingual translation ambiguity. It is applied here after transitive translation to ascertain whether it is effective for further reduction of transitive translation ambiguity. The French documents are less than half as long as either the English or Spanish documents employed in expansion of bilingual query translation (149, 549, and 458 terms, respectively). Passages of size 50, 100, and 200 were generated for the French expansion experiments to find out whether average document length influences effectiveness. Table 8.7 gives recall-precision figures for monolingual, transitive translation with co-occurrence translation of multi-term definitions, and transitive translation with co-occurrence translation of multi-term definitions and the best post-transitive translation expansion runs using passages of size 200, 100, and 50, respectively. Column one shows monolingual effectiveness. Column two shows transitive translation with co-occurrence disambiguation of multi-word inter-lingual definitions and no expansion. Columns three through five show transitive translation with co-occurrence disambiguation of multiword inter-lingual definitions and the best expansion runs. The passage size, number of passages analyzed and number of expansion terms chosen are indicated by Py:a-b, where y is the passage size, a is the number of passages and b is the number of expansion terms. Results in this case show passage size has no influence and expansion has no effect on effectiveness of the transitive translations. These results are not significantly different than expansion of transitive translations without co-occurrence translation of multi-word definitions. Results are shown in Table 8.8. Column one shows monolingual retrieval effectiveness. Column two shows transitive translation without co-occurrence disambiguation of multi-word inter-lingual definitions and no expansion. Columns three and four show the best expansion runs with a passage size of 200. Query-by-query analysis reveals why this is so. The expansion terms are unrelated to the query content. Query one is about the controversy over Waldheim’s WWII activities, but expansion terms are about agriculture, wine growing, and other unrelated concepts. In fact, only query 13 appears to have many related expansion terms. Query 13 asks about the attitudes of the Arabic countries towards the peace process in the Middle East. The expansion terms include caire, syrie, and arafat. This is unusual judging by previous experience as well as work published by others using LCA expansion. LCA typically selects better expansion terms than do other expansion techniques. The following section es-
224
ADVANCES IN INFORMATION RETRIEVAL
Table 8.7 Average precision and number of relevant documents retrieved for monolingual, transitive translation, and transitive translation with post-translation expansion using 200, 100, and 50 passages. Query Relevant Docs: Relevant Ret:
mono 1098 730
no exp 1098 679
P200:10-10 1098 676
P100:30-5 1098 675
P5O:10-5 1098 672
Avg Prec % Change:
0.2767
0.1784 -35.5
0.1778 -35.7
0.1776 -35.8
0.1791 -35.3
0.5700 0.5050 0.4025 0.3433 0.1850
0.3000 0.2700 0.2350 0.2217 0.1465
0.3000 0.2800 0.2275 0.2217 0.1440
0.2900 0.2750 0.2375 0.2133 0.1440
0.3000 0.2700 0.2375 0.2117 0.1445
Precision: 5 docs: 10 docs: 20 docs: 30 docs: 100docs:
Table 8.8 Average precision and number of relevant documents retrieved for monolingual, and transitive translation without, and with post-translation expansion. Query Relevant Docs: Relevant Ret: Avg Prec % Change:
mono 1098 730
no exp 1098 669
P200:10-10 1098 663
P200:30-10 1098 660
0.2767
0.1845 -33.3
0.1801 -34.9
0.1817 -34.3
0.5700 0.5050 0.4025 0.3433 0.1850
0.3000 0.2650 0.2375 0.2183 0.1490
0.3000 0.2650 0.2425 0.2217 0.1500
0.2900 0.2650 0.2400 0.2250 0.1490
Precision: 5 docs: 10docs: 20 docs: 30 docs: 100 docs:
tablishes baselines for the bilingual translations of the Spanish queries to French and brings some understanding of the lack of effectiveness of expansion.
5.5
BILINGUAL EXPANSION BASELINES
Bilingual translation of Spanish queries to French without the application of query expansion techniques yields 76% monolingual effectiveness. It has been shown that ambiguity can be further reduced by the application of query expansion before and after translation.
Cross-Language Retrieval via Transitive Translation
225
Pre-translation expansion improves precision while reducing ambiguity by creating a stronger base for translation. First results in Table 8.9 of experiments with the bilingual translations of Spanish queries to French are disappointing because pre-translation expansion yields no increase in effectiveness. Recall is improved and there is a moderate improvement in average precision at low recall, but overall average precision does not increase after translation. Increasing the number of passages used and/or the number of expansion terms does not effect these results. This table shows bilingual translation including translation of phrases with the phrase dictionary and co-occurrence disambiguation both with and without pre-translation expansion. In other words, prior to replacing any Spanish terms with their French equivalents, each query is expanded with the top Spanish words from passages of size 200. Column one shows bilingual translation without expansion. Columns two and three show bilingual translations with pre-translation expansion from the top twenty passages with the top five and ten terms, respectively. Table 8.9 Average precision and number of relevant documents retrieved for bilingual wordby-word translation with phrase dictionary and co-occurrence translation of phrases augmented by pre-translation expansion with the top 5 or 10 terms from the top 20 passages. Query Relevant Docs: Relevant Ret: Avg prec % Change:
NO Exp
P200:20-5
P200:20-10
1098 596
1098 628
1098 632
0.2104
0.2117 0.6
0.2093 -0.5
0.3500 0.3000 0.2425 0.2067 0.1325
0.3600 0.2900 0.2525 0.2233 0.1425
0.3600 0.3250 0.2550 0.2333 0.1435
Precision: 5 docs: 10docs: 20 docs: 30 docs: 100docs:
Pre-translation expansion did increase average precision at low recall levels, which should result in more relevant documents at the top of the ranking. A subsequent round of pre-translation expansion may lead to better expansion terms and thus improved effectiveness. Results in Table 8.10 show that a secondary pre-translation expansion does yield a moderate improvement. After expanding once with the top 5 terms from the top 20 passages, queries were modified further by the addition of the top 10 terms from the top 20 passages. Earlier work showed that post-translation expansion increases effectiveness by reducing the negative effects of poor translations and increasing recall. How-
226
ADVANCES IN INFORMATION RETRIEVAL
Table 8.10 Average precision and number of relevant documents retrieved for bilingual wordby-word translation with phrase dictionary and co-occurrence translation of phrases augmented by two rounds of pre-translation expansion:first with the top 5 terms from the top 20 passages and then with the top 10 terms from the top 20 passages. Query
P200:20-5
2° P200:20-10
1098 596
1098 645
0.2104
0.2 179 3.5
0.3500 0.3000 0.2425 0.2067 0.1325
0.3900 0.3350 0.2725 0.2333 0.1450
Relevant Docs: Relevant Ret: Avg Prec % Change: Precision: 5 docs: 10docs: 20 docs: 30 docs: 100docs:
ever, post-translation expansion is not effective for improving the effectiveness of the bilingual query translations. Table 8.11 gives recall-precision figures for post-translation expansion with the top 50 or 100 terms from the top 100 passages. Results show that recall goes up and precision increases at low recall, but the results are not significant.
Table8.11 Average precision and number of relevant documents retrieved for bilingual wordby-word translation with phrase dictionary and co-occurrence translation of phrases augmented by post-translation expansion with the top 50 and 100 terms from the top 100 passages. Query
NO Exp.
P200:100-50
P200:100-100
Relevant Docs: Relevant Ret: Avg Prec % Change:
1098 596 0.2104
1098 608 0.2141 1.7
1098 608 0.2148 2.1
0.3500 0.3000 0.2425 0.2067 0.1325
0.3800 0.3150 0.2425 0.2167 0.1355
0.3800 0.3100 0.2450 0.2167 0. I330
Precision: 5 docs: 10docs: 20 docs: 30 docs: 100docs:
Cross-Language Retrieval via Transitive Translation
227
Combining pre- and post-translation expansion has been shown to significantly increase cross-language effectiveness while improving both precision and recall. However, expansion of the bilingual translations of the Spanish queries before and after translation to French has not yielded increases in effectiveness comparable to those for other languages. Table 8.12 gives results for combined expansion of these queries. It compares bilingual translation without expansion (column 1) to bilingual translation with two levels of pre-translation expansion without (column 2) and with post-translation expansion (columns 3-6. It shows that combined expansion does little to improve average precision overall. However, moderate improvement is due to increased average precision at low recall and is important in a cross-language environment. Translation without expansion and with combined expansion yields 76% and 80% monolingual effectiveness, respectively. Table 8.12 Average precision and number of relevant documents retrieved for bilingual wordby-word translation with phrase dictionary and co-occurrence translation of phrases compared to the same queries augmented by two levels of pre-translation expansion (2°) with the top 5 terms from the top 20 passages followed by the top 10 terms from the top 20 passages, and the 2° expansion queries augmented by post-translation expansion with the top 50 and 100 terms from the top 20 and 30 passages. Query Relevant Docs: Relevant Ret: Avg Prec % Change:
No Expan.
2° Pre-lca
2°+Post 20-50
2° Pre+Post 20- 100
2°Pre+Post 30-50
2°Pre+Post 30-100
1098 596
1098 645
1098 653
1098 647
1098 655
1098 650
0.2104
0.2179 3.5
0.2175 3.4
0.2203 4.7
0.2210 5.0
0.2205 4.8
0.3500 0.3000 0.2425 0.2067 0.1325
0.3900 0.3350 0.2725 0.2333 0.1450
0.4000 0.3300 0.2925 0.2683 0.1540
0.4000 0.3300 0.3025 0.2600 0.1510
0.4000 0.3400 0.2925 0.2633 0.1520
0.3900 0.3450 0.2975 0.2617 0.1525
Precision: 5 docs: 10docs: 20 docs: 30 docs: 100 docs:
An analysis of query term statistics before and after expansion may answer, in part, why there is such little improvement when it is applied to the bilingual queries. Table 8.13 gives statistics for translation with and without pre-translation LCA expansion. Statistics were collected after stemming English translations and applying the XEROX morphological processor to French translations. This was done to try to get a more accurate picture of the stems or words used for retrieval after query processing. Statistics collected prior to query processing and for other expansion methods yield similar results.
228
ADVANCES IN INFORMATION RETRIEVAL
Table 8.13 Mean (variance) statistics for cross-language query sets after bilingual translation via word-by-word, phrase dictionary, and co-occurrence phrasal disambiguation with and without pre-translation expansion: terms per query, number of original query terms recovered after translation, percentage of unique query terms recovered.
LCA
Target Language
QrY Length
Original Terms Recovered
% Unique Terms Recovered
Spanish
no
English
44.1(157.17)
5.57(6.72)
72
Spanish
yes
English
55.7(653.06)
5.71(6.97)
74
Spanish
no
French
17.0(49.9)
7.1(16.47)
46
Spanish
yes
French
28.95(58.05)
7.0( 15.62)
46
Source Language
The average number of original query terms recovered does not increase after expansion, although query length nearly doubles. On average, three original query terms are added by the pre-translation, but they are not new nor different terms. In fact, the percentage of unique query terms recovered both with and without expansion is only 46%. For the bilingual translations of Spanish queries to English, it is 72% and 74% for before and after expansion, respectively. This suggests that the translations of the Spanish queries to English queries are better to begin with and that expansion terms help in recovering additional query terms, but this is not the case for the translations to French. In fact, manual evaluation of the French translations including post-translation expansion terms suggests that they appear to be at best, only peripherally related to the query. The pre-translation expansion terms for the French and English query sets are the same. This is because pre-translation expansion is done in the source language which is Spanish in both cases. Each set begins with the same Spanish queries that are translated to French and English, respectively. The problem is that the quality of the expansion terms is effected by the quality of the query translations. Both the Collins Spanish-English and Spanish-French dictionaries contain translations for all but 16 and 17 query terms, respectively. However, translations to English contain a much higher percentage of the original query terms. The MRDs are of similar quality and size having more than 80,000 head-words, so the problem appears to be language related. In the early work of Salton, 1972, he found that query words across languages did not always have a reciprocal relationship. In other words, although Spanish word ws may be the best translation for French word wf, the best Spanish translation for wf may not be ws. In addition although listing many possible translations for a word (as the Spanish-
Cross-Language Retrieval via Transitive Translation
229
English dictionary does) introduces ambiguity, it may also yield an indirect expansion effect that aids in disambiguation as well as providing a good base for further improvement via post-translation expansion. The number of relevant French and English documents for the cross-language Spanish queries are 1098 and 1247, respectively. This 15% difference should not significantly effect the ability of LCA to identify good expansion terms, but the French expansion terms are not good. Part of the problem appears to be related to translation quality. However, poor translations can not be the only explanation for the lack of effectiveness of expanding French queries translated both bilingually and transitively. In fact Table 8.14 reveals it is also the case that expanded monolingual queries are no more effective than unexpanded monolingual queries. Column one shows monolingual French retrieval without expansion. Columns two through five shows monolingual French retrieval with LCA expansion from the top 100 passages with the top 100, 200, 30, and 50 terms, respectively, with a passage size of 200. Table 8.14 Average precision and number of relevant documents retrieved for monolingual and post-translation expansion of monolingual French queries. LCA expansion terms (30, 50, 100, and 200) were selected from the top 100 passages. Monolingual LCA100-100 LCA100-200 LCA 100-30 LCA 100-50 Query 1098 1098 1098 1098 1098 Relevant Docs: 744 742 753 749 Relevant Ret: 730 AvgPrec %Change: Precision: 5 docs: 10 docs: 20 docs: 30 docs: 100 docs:
0.2767
0.2752 -0.6
0.2813 1.7
0.5700 0.5050 0.4025 0.3433 0.1850
0.5400 0.5000 0.4125 0.3550 0.2005
0.5600 0.5100 0.4275 0.3650 0.2000
0.2841 2.7 0.5700 0.4950 0.4250 0.3633 0.1985
0.2852 3.1 0.5800 0.5150 0.4100 0.3550 0.1950
The question that remains is whether the expansion problem is related to the nature of French, or whether it is system related. It seems unlikely that it would be the former. Query expansion has been applied successfully in a monolingual environment, to a number of languages including Spanish, Chinese, and Japanese (Allan et al., 1995; Allan et al., 1996; Han et al., 1994). Although the French cross-language work by Buckley et al., 1997, also failed to improve after expansion, they did apply expansion to their monolingual French runs. However, there is no report of how much an improvement was realized over monolingual without expansion. In addition, in work by Boughanem and SouléDupuy, 1997, expansion increased monolingual French retrieval effectiveness by 11 % providing more evidence that the nature of French is not an obstacle to the application of expansion.
230
ADVANCES IN INFORMATION RETRIEVAL
One difference between our monolingual French runs and those mentioned above is that they employed a simple stemmer. Stemming has been shown to improve retrieval effectiveness which could impact the effectiveness of expansion. Rather than stem, we apply the XEROX morphological processor which has been shown to work as well as a traditional stemmer for English (Hull, 1996). Our monolingual French runs were 43% more effective with morphological processing than without. However we have no stemmed monolingual French runs for comparison and this still does not answer the question of whether expansion is more effective when stemming is employed.
6
SUMMARY
Previous work has shown that statistical techniques are effective for reducing the ambiguity associated with bilingual query translation. The experiments described herein support these results, with one exception. Query expansion has not been shown to be effective in reducing either bilingual or transitive translation ambiguity for translations to French, nor was it effective for improving our monolingual French retrievals. There are many questions surrounding the lack of effectiveness of expansion for French queries. Despite the our inability to show ambiguity reduction for French translations via expansion, there are still some positive lessons learned about the feasibility of transitive translation. First, although transitive translation is much more ambiguous than bilingual translation, that ambiguity can be significantly reduced. Table 8.15 illustrates this. It compares bilingual translation without expansion to transitive translation without expansion at the bilingual stage and transitive translation after application of all ambiguity reduction techniques at the bilingual stage. In other words, Spanish queries are translated to English using the synonym operator, POS and co-occurrence disambiguation, and combined expansion. The resulting English query is then translated to French. This brings transitive translation effectiveness up 30% to 67% of monolingual effectiveness. Although the results are not directly comparable, this is still as good or better than effectiveness reported for these queries via other cross-language approaches based more complex resources that mapped the source and target languages directly. Second, transitive translations can be as effective or more effective than their monolingual or bilingual counterparts. Nine queries each of the transitive and bilingual translations are more effective than monolingual, while ten and eleven queries, respectively, are less effective than monolingual. Eleven of the transitive translations are more effective than their bilingual counterparts. Table 8.16 gives query term statistics for the monolingual, bilingual, and transitive query sets (these statistics were collected after morphological processing). What it shows is that although the transitive translations are longer and more
Cross-Language Retrieval via Transitive Translation
231
Table 8.15 Average precision and number ofrelevant documents retrieved for monolingual and transitive translation without expansion at the bilingual stage and with expansion at the bilingual stage. Query
Monolingual
Relevant Docs: Relevant Ret: AvgPrec %Change:
No bilingual Expansion
With bilingual Expansion
1098 596
1098 389
1098 669
0.2104
0.1424 -32.4
0.1845 -12.3
0.3500 0.3000 0.2425 0.2067 0.1325
0.2700 0.2300 0.1800 0.1567 0.0860
Precision: 5docs: 10docs: 20 docs: 30 docs: 100docs:
0.3000 0.2650 0.2375 0.2183 0.1490
ambiguous, they still recover more unique original query terms (54.8%) than do bilingual translations (45%). This suggests that it may be possible to combine evidence from transitive translations via several intermediate languages to further reduce ambiguity associated with this approach. Table 8.16 Mean (variance) query term statistics for French monolingual, bilingual, and transitive translations: terms per query, number of original query terms recovered by translation, and percentage of unique query terms recovered by translation. Typeof Translation
Target Source Language Language
Monolingual
French
Bilingual
Spanish
Transitive
Spanish
French French French
Qry Length
Original Terms
13.76 (28.56)
NIA
17.0 (49.9)
7.1 (16.47)
459.2 (21145.6)
8.33 (23.4)
Terms Recovered N/A 45% 54.8%
Finally, these initial results suggest that transitive translation is a viable approach to cross-language retrieval. The lack of significant reduction in transitive translation ambiguity via expansion while disappointing, is inconclusive. Expansion also failed to produce significant improvements in monolingual effectiveness. More work must be done to try to determine whether this is an issue associated with the best way to perform French retrieval (multi-lingual issue) or whether there is some characteristic of French that has not been ad-
232
ADVANCES IN INFORMATION RETRIEVAL
dressed for effective expansion in our implementation. Additional experiments will also need to be done with other sets of languages to support the viability of a transitive approach.
Acknowledgments This material is based on work supported in part by the National Science Foundation under cooperative agreement EEC-9209623. It is also supported in part by SPAWARSYSCEN contract N66001-99- 1-8912. Any opinions, findings and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsors.
References Allan, J., Ballesteros, L., Callan, J., Croft, W., and Lu, Z. (1995). Recent experiments with INQUERY. In Proceedings of the Fourth Retrieval Conference (TREC-4) Gaithersburg, MD: National Institute of Standards and Technology. Allan, J., Callan, J., Croft, W., Ballesteros, L., Broglio, J., Xu, J., and Shu, H. (1996). INQUERY at TREC-5. In Proceedings of the Fifth Retrieval Conference (TREC-5) Gaithersburg, MD: National Institute of Standards and Technology. Attar, R. and Fraenkel, A. S. (1977). Local feedback in full-text retrieval systems. Journal of the Association for Computing Machinery, 24:397–417. Ballesteros, L. and Croft, W. B. (1996). Dictionary-based methods for crosslingual information retrieval. In Proceedings of the 7th International DEXA Conference on Database and Expert Systems Applications, pages 791–801. Ballesteros, L. and Croft, W. B. (1997). Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the 20th International SIGIR Conference on Research and Development in Information Retrieval, pages 84–91. Ballesteros, L. and Croft, W. B. (1998). Resolving ambiguity for cross-language retrieval. In Proceedings of the 21st International SIGIR Conference on Research and Development in Information Retrieval, pages 64–71. BBN. BBN part-of-speech tagger for Spanish. http://www.gte,com/bbnt/ (July 1999). Boughanem, M. and Soulé-Dupuy, C. (1997). Mercure at TREC-6. In Proceedings of the Sixth Retrieval Conference (TREC-6) Gaithersburg, MD: National Institute of Standards and Technology, pages 321–328. Broglio, J., Callan, J., and Croft, W. (1994). INQUERY system overview. In Proceedings of the TIPSTER Text Program (Phase I), pages 47-67. Buckley, C., Mitra, M., Walz, J., and Cardie, C. (1997). Using clustering and superconcepts within SMART: TREC-6. In Proceedings of the Sixth Retrieval
Cross-Language Retrieval via Transitive Translation
233
Conference (TREC-6) Gaithersburg, MD: National Institute of Standards and Technology, pages 107–121. Davis, M. and Dunning, T. (1995a). Query translation using evolutionary programming for multi-lingual information retrieval. In Proceedings of the Fourth Annual Conference on Evolutionary Programming. Davis, M. and Dunning, T. (1995b). A TREC evaluation of query translation methods for multi-lingual text retrieval. In In Proceedings of the Fourth Retrieval Conference (TREC-4) Gaithersburg, MD: National Institute ofStandards and Technology, Special Publication 500-236. Furnas, G., Deerwester, S., Dumais, S., and R.A. Harshman, T. L., Streeter, L., and Lochbaum, K. (1988). Information retrieval using a singular value decomposition model of latent semantic structure. In Proceedings of the 1 1th lnternational SIGIR Conference on Research and Development in Information Retrieval, pages 465–480. Han, C., Fujii, H., and Croft, W. (1994). Automatic query expansion of Japanese text retrieval. Technical Report TR 95-11, Computer Science Department, University of Massachusetts. Hull, D. (1996). Stemming algorithms - a case study for detailed evalutation. Journal of the American Society for Information Science, 47:70–84. Hull, D. A. and Grefenstette, G. (1996). Querying across languages: A dictionarybased approach to multilingual information retrieval. In Proceedings of the 19th International SIGIR Conference on Research and Development in Information Retrieval, pages 49–57. Landauer, T. K. and Littman, M. L. (1990). Fully automatic cross-language document retrieval. In Proceedings of the Sixth Conference of the UW Center for the New Oxford English Dictionary and Text Research, pages 31–38. Picchi, E. and Peters, C. (1996). Cross language information retrieval: A system for comparable corpus querying. In Grefenstette, G., editor, Cross-Language Information Retrieval, chapter 7, pages 8 1-92.Kluwer Academic Publishers. Pirkola, A. (1998). The effects of query structure and dictionary setups in dictionary-base cross-language information retrieval. In Proceedings of the 21st International Conference on Research and Development in Information Retrieval, pages 55–63. Ponte, J. (1998). A Language Modeling Approach to Information Retrieval. PhD thesis, Computer Science Department, University of Massachusetts. Rehder, B., Littman, M. L., Dumais, S., and Landauer, T. K. (1997). Automatic 3-language cross-language information retrieval with latent semantic indexing. In Proceedings of the Sixth Retrieval Conference (TREC-6) Gaithersburg, MD: National Institute of Standards and Technology, pages 233–239. Salton, G. (1972). Experiments in multi-lingual information retrieval. Technical report TR 72-154, Computer Science Department, Cornell University.
234
ADVANCES IN INFORMATION RETRIEVAL
Salton, G. and Buckley, C. (1990). Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41:288–297. Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. InProceedings of the International Conference on New Methods in Language Processing. Sheridan, P. and Ballerini, J. P. (1996). Experiments in multilingual information retrieval using the SPIDER system. In Proceedings of the 19th International SIGIR Conference on Research and Development in Information Retrieval, pages 58-65. Sheridan, P., Braschler, M., and Schauble, P. (1997). Cross-language information retrieval in a multilingual legal domain. In Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pages 253–268. Turtle, H. R. and Croft, W. B. (1991a). Efficient probabilistic inference for text retrieval. In RIAO 3 Conference Proceedings, pages 664–661. Turtle, H. R. and Croft, W. B. (1991b). Inference networks for document retrieval. In Proceedings of the 19th International SIGIR Conference on Research and Development in Information Retrieval, pages 1–24. UN. Linguistic data consortium resource: U.N. parallel text. http://www.1dc.upenn.edu/Catalog/LDC94T4A.html (June 1999). Voorhees, E. and Harman, D., editors (1997). Proceedings of the 6th Text Retrieval Conference (TREC-6), National Institute of Standards and Technology. Xerox. Xerox finite-state morphological analyzers http://www.xrce.xerox.com:80/research/mltt/Tools/morph.html (Dec. 1998). Xu, J. and Croft, W. B. (1996). Querying expansion using local and global document analysis. In Proceedings of the 19th International SIGIR Conference on Research and Development in Information Retrieval, pages 4– 11.
Chapter 9 BUILDING, TESTING, AND APPLYING CONCEPT HIERARCHIES Mark Sanderson Department of Information Studies University of Shefield, Western Bank Shefield, S10 2TN, UK +44 114 22 22648 m.sanderson @sheffield.ac.uk
Dawn Lawrie Department of Computer Science University of Massachusetts Amherst, MA 01003 +1 413 54.70728 lawrie@cs.urnass.edu
Abstract
1
A means of automatically deriving a hierarchical organization of concepts from a set of documents without use of training data or standard clustering techniques is presented. Using a process that extracts salient words and phrases from the documents, these terms are organized hierarchically using a type of co-occurrence known as subsumption. The resulting structure is displayed as a series of hierarchical menus. When generated from a set ofretrieved documents, a user browsing the menus gains an overview of their content in a manner distinct from existing techniques. The methods used to build the structure are simple and appear to be effective. The formation and presentation of the hierarchy is described along with a study of some of its properties, including a preliminary experiment, which indicates that users may find the hierarchy a more efficient means of locating relevant documents than the classic method of scanning a ranked document list.
INTRODUCTION
Manually constructed subject hierarchies, such as the Dewey Decimal system, the U.S. Patent and Trademark Office categories, or the Yahoo directory
236
ADVANCES IN INFORMATION RETRIEVAL
of web sites1, are successful pre-coordinated ways of organizing documents. By clustering together documents covering similar topics, the hierarchies allow users to locate documents on specific subjects and to gain an idea of the overall topic structure of the underlying collection. Devising a means of automatically deriving a subject classification from a collection of documents and assigning those documents to the classification is undoubtedly one goal of information retrieval (IR) research2. The classic automated method of achieving this aim is based on polythetic clustering (Sparck Jones, 1970) where a set of document clusters are derived from a collection, each cluster being defined by a set of words and phrases, referred to here as terms. A document’s membership of a cluster is based on its possession of asufficient fraction of the terms that define the cluster. Hierarchies of clusters can be constructed by re-clustering each initial cluster to produce a second level of more specific clusters and repeating this process recursively to produce more specific clusters until only individual documents remain. This technique has been used to organize document collections (Cutting et al., 1992), sets of retrieved documents (Hearst and Pedersen, 1996) and groupings of web sites (Chen et al., 1998). Clustering has also been used to arranged query expansion terms (Veling and van der Weerd, 1999). Although successful at grouping documents containing common terms, automatically labeling a cluster is still an active and important research issue. Two common techniques used to label polythetic clusters are showing a list of its most representative terms and displaying a number of key passages extracted from the cluster’s most representative documents. Neither method is ideal. To illustrate, the following is a term-based cluster label taken from Hearst and Pedersen, 1996: battery California technology mile state recharge impact official cost hour government. Although one can deduce the topic of the cluster, it is not as concise or as clear a description as the manually generated version given by the authors of the paper: “alternative energy cars”. As well as being verbose, the labels can be overly specific. For example, Cutting et al, 1992, show in their paper sample clusters (produced by their system) labeled with both passages and term lists. www.yahoo.com It is of course possible to automatically train a classifier on an existing manually created hierarchy and use it to assign documents to the classification (Larkey, 1999, McCallum et al., 1998). However, the hierarchy may be deficient in the range of topics it covers relative to the documents being classified. Therefore, there will be times when it is desirable to automatically derive a classification directly from a collection. 1
2
Building, Testing, and Applying Concept Hierarchies
237
Three of the illustrated clusters were labeled as follows, one was about the Gulf War (mentions of the U.S, Iraq, Kuwait, Saudi Arabia), one on oil sales and stock markets, and the other on East and West Germany. They combined the documents in these clusters and re-clustered them to reveal that documents about Pakistan, Trinidad, South Africa, and Liberia were in the three original clusters as well. Based on their labels, it is not immediately clear which of the three clusters these documents would have resided in. Essentially, the labels of a polythetic cluster reveal the cluster’s central focussed theme. As illustrated, it is quite possible for a cluster to hold documents on topics different from that theme. It follows that if the labels are hard to comprehend or in some way misleading, a user’s understanding of the formation and content of a cluster will be impaired. This suggests that an alternative means of grouping documents should be sought. Polythetic clustering is not the only form of clustering, as Sparck Jones, 1970, points out. There are also monothetic clusters. Like polythetic, these are defined by a set of terms, but a document’s membership of such a cluster is based on its possession of all those terms, not just some fraction as occurs with polythetic. This alternative form of clustering has not proved popular in IR, as monothetic clusters composed of many terms are likely to contain only a few documents. However, monothetic clusters composed of a single word or phrase may produce useful groupings. Clearly, such groupings are different from the polythetic clusters illustrated above; however, this form of cluster does address the two issues of labeling and focus3. Labeling is simple: the label is the defining term of the cluster. The focus of the cluster content should be clear as documents are only members if they contain the cluster’s defining term. Therefore, all members of the cluster will, at the very least, mention the topic specified by the term. This could still be confusing if the term is ambiguous, however, this issue will be dealt with later. Given the transparent nature of their composition, it is expected that users will find these clusters easier to understand. Indeed most users should be familiar with them already, as a single term monothetic cluster is akin to the set of documents retrieved if that single term were a query. Given the propensity of users to generate short queries (Jansen et al., 1998), this form of document grouping is a common experience for many users. A hierarchical organization 3 The distinction between monothetic and polythetic clusters reflects the distinction between the classic view of human categorization and the more recent prototype theories as described in the opening chapter of Lakoff, 1987. Like classic categories, the members of a monothetic cluster are considered equally good members of the cluster because they all share the same attributes. As with prototype theory, some members of a polythetic cluster are regarded as better representatives of the cluster than others due to the different range of attributes members can have. Lakoff argues that prototype theory better models the way humans categorize than the classical approach. One might view this as an argument in favor of polythetic clusters; however, the issue presented here is the understandability of clusters, which is a separate notion from the modeling of human categorization.
238
ADVANCES IN INFORMATION RETRIEVAL
of single term monothetic clusters will, in form at least, be similar to existing manually created subject hierarchies, which are a familiar means of organization for most users. Given these anticipated advantages of using single term monothetic clusters, henceforth, called concepts, the means of automatically building a hierarchical organization of these concepts was undertaken. It is this work that is described here. It starts with a review of possible approaches to building a hierarchy, initially examining the utility of a thesaurus, and concentrating on term clustering methods. The means chosen to build the concept hierarchy is then presented, followed by a set of examples illustrating the structure and the technique used to display it. Next, a preliminary user experiment designed to test the properties of the structure is outlined, its results are described. Another method of evaluation is included which measures the ability to find relevant documents within a hierarchy. Finally, conclusions are drawn and future work is detailed.
2
BUILDING A CONCEPT HIERARCHY
In the introduction, it was established that the goal of this work was to automatically produce, from a collection of documents, a concept hierarchy similar to manually created hierarchies such as the Yahoo categories. This was broken down into five basic principles: terms for the hierarchy had to best reflect the topics covered within the documents; their organization was such that a parent term referred to a related but more general concept than its children, in other words, the parent’s concept subsumed the child’s; thenotion of a parent being more general than its children held transitively for all descendants of the parent; a child could have more than one parent, therefore, the structure was a directed acyclic graph (DAG) although it is referred to as a hierarchy here; and finally, ambiguous terms were expected to have separate entries in the hierarchy one for each sense appearing in the documents. It might be expected that the relatedness between a parent and child might also hold transitively for all the descendants of the parent; however, as pointed out by Woods, 1997, some types of relationships between a general concept and its related, more specific descendants are intransitive. Using an example from Woods, a “ship’s captain” is a “profession” and “Captain Ahab” is a “ship’s captain”, but the relationship between “Captain Ahab” and the concept
Building, Testing, and Applying Concept Hierarchies
239
“profession” is less clear. In practice, many parts of a created concept hierarchy may show transitivity in relatedness. With these principles in mind, the building of a hierarchy was addressed, starting with the determination of what sets of documents the hierarchies were to be initially built from followed by finding a means of relating terms to each other.
2.1
BUILD IT FROM WHAT?
The final design principle outlined above forced certain choices to be made about the nature of documents being processed. As the terms of the hierarchy were to be extracted from documents, it was necessary to know the senses in which they were being used. Though a great deal of work has been expended on performing automatic word sense disambiguation (Yarowsky, 1995, Ng and Lee, 1996), the low accuracy and general lack of availability of such systems effectively precluded the possibility of disambiguating all the words of an arbitrary collection of documents. However, ambiguity could be ignored by choosing to only derive concept hierarchies from sets of documents where ambiguous terms were used in only one sense. For the purposes of this preliminary work, this was achieved by using top ranked documents retrieved in response to a query. Because they all have a similarity to the query, the documents would have a commonality between them, meaning that many of the terms within them would be used in the same sense. (A more general solution that avoids the need for queries and retrieved documents is described in Section 5.3.) Working with retrieved documents also meant that the set of documents to be processed was relatively small. This had practical benefits as speed and complexity issues would not be a significant problem when developing the software to build the hierarchies. The building of summaries and overviews of a retrieved set of documents is an active area of research (Tombros and Sanderson, 1998) and the creation of a concept hierarchy promised to be a novel approach in this area. With the issues of which documents to process resolved, the building of the hierarchy could now be tackled.
2.2
RELATING TERMS
From the outset, it was anticipated that a successful concept hierarchy building process would consist of a collection of techniques, which may vary in complexity, coverage, and accuracy. As a starting point, however, it was decided that a relatively simple approach was required that would act as a base on top of which other more sophisticated techniques could be added later. The planned concept hierarchy was in some ways like the WordNet thesaurus (Miller, 1995): a largely hierarchical organization of terms, organized through
240
ADVANCES IN INFORMATION RETRIEVAL
a set of relations (synonym, antonym, hyponym-hypemym (is-dis-a-type-of), and meronym-holonym (has-part/is-part-of). Therefore, the thesaurus was investigated as a means of relating terms. The WordNet-based term similarity measure, from Resnik, 1995, was used to estimate the relatedness of terms. A small informal experiment was conducted to examine the effectiveness of this method working with terms extracted from fifty sets of retrieved documents and using version 1.6 of WordNet. The main problem encountered was the small number of terms pairs actually found to be related in WordNet. Many pairs that appeared to be have a strong semantic relationship were unrelated in the thesaurus. For example, the terms “volcanic eruption” and “earthquake”, both forms of natural disaster, have no connection in WordNet, the former being regarded as an event and the later as a phenomenon. The finding of this small investigation was that the term relationships in WordNet were rarely of any use for the concept hierarchy planned here. What was required was a means of finding broader term relationships that were customized to a particular domain. An obvious area to be examined was term clustering. Methods for relating terms into graph structures based on document cooccurrence (or co-variance) have been used for many years (Doyle, 1961). The application for most of this work is in query expansion, either automatic (Qiu and Frei, 1993) or manual (Thompson and Croft, 1989, Fowler et al., 1992, Bourdoncle, 1997). Term similarity is calculated using some form of statistical measure, such as the Expected Mutual Information Measure (EMIM) described by van Rijsbergen, 1979. To the best of our knowledge, most work in term clustering used relations that were symmetric. Our interest was in producing a concept structure with an ordering from general terms to more specific. Forsyth and Rada, 1986, performed such an ordering using the cohesion statistic to measure the degree of association between terms. The number of documents the terms occurred in determined the generality and specificity of terms. This was referred to as a term’s document frequency, DF. The more documents a term occurred in, the more general it was assumed to be (the validity of this simple approach to generality and specificity is discussed in Section 2.2). The authors reported building a small multilevel graph like structure of terms. Although no testing of its properties were reported, it appeared to be promising. Therefore, it was decided to start with a version of Forsyth’s approach, leaving open the possibility of adopting more sophisticated methods for later. Method used, Although it was used to create a concept hierarchy, Forsyth’s term association method was not originally designed to identify the types of association found in concept hierarchies: where, as was stated at the start of this section, a parent node subsumes the topics of its children. Therefore, it was decided to drop cohesion in favor of a test based on the notion of subsumption.
Building, Testing, and Applying Concept Hierarchies
241
It is defined as follows, for two terms, x and y, x is said to subsume y if the following two conditions hold, P(x|y) = 1, P(y|x) < 1. In other words, x subsumes y if the documents which y occurs in are a subset of the documents which x occurs in. Because x subsumes y and because it is more frequent, in the hierarchy, x is the parent of y. Although a good number of term pairs were found that adhere to the two subsumption conditions, it was noticed that many were just failing to be included because a few occurrences of the subsumed term, y, did not co-occur with x. Subsequently, the first condition was relaxed and subsumption was redefined as P(x|y) ≥ 0.8, P(y|x) < P(x|y). The value of 0.8 was chosen through informal analysis of subsumption term pairs. The change to the second condition ensures that the term occurring more frequently is the one that subsumes the less frequent. In the rare case of two terms co-occurring with each other exactly, P(y|x) = P(x|y) = 1, the two terms will be merged into one monothetic cluster. Subsumption satisfied four of the design principles outlined at the start of this section: as a form of co-occurrence, subsumption provided a means of associating related terms; it did not prevent children from having more than one parent; the DF of terms provided an ordering from general to more specific; and the ordering from general to specific would hold transitively. As will be seen later on, the subsumption process was adapted further in the light of experiences in implementing the system. Before moving on with term selection, the validity of using DF for determining the generality or specificity of terms is now addressed. Is DF sufficient. One may wonder how well DF models generality and specificity. There is evidence to indicate that it is sufficient. The DF of a query term is successfully used in IR through the application of Inverse Document Frequency (IDF) weighting. Query terms with a low DF are regarded as being more important than those with a high DF when computing a document ranking. There are a number of interpretations of what IDF is modeling, but in the original paper on this weighting scheme, Sparck Jones, 1972, asserts that IDF interprets the specificity of a query term.
242
ADVANCES IN INFORMATION RETRIEVAL
More recently, Caraballo and Charniak, 1999, presented results of an experiment that, amongst other things, examined the specificity of nouns based on their frequency of occurrence in a corpus. Caraballo and Charniak split the nouns they were examining into two groups dividing them on whether they were more general than basic level categories4 or not. They found that DF worked well determining the specificity and generality of nouns at or below the basic level. But for those above, their DF was a much less effective indicator. From these two works, it was expected that DF would provide a reasonable ordering of terms from general to more specific, although for terms one might wish to appear at the top of a concept hierarchy, it may prove less successful. The final issue to be tackled before building the hierarchies was how to select terms from the set of documents from which the hierarchy was to be built.
2.3
TERM SELECTION
Given that the concept hierarchies were to be derived from a set of documents retrieved in response to a query, there were two clear sources of terms: the documents and the query. The query was expected to be a good source of terms as it was to be processed and expanded using a proven automatic expansion technique called Local Context Analysis (LCA), which works in the following manner (Xu and Croft, 1996). An initial set of documents is retrieved in response to a query in its original form. The best passages of the top ranked documents are examined to find words and phrases that commonly co-occur with each other across many of the passages. The best of these terms are then added to the query and another retrieval takes place. Xu and Croft, 1996, presented experimental results showing retrieval based on the expanded query producing a higher level of effectiveness than that measured from the first retrieval. From these results, it was anticipated that the expansion phrases were well chosen and would be representative of the topics covered in the retrieved documents. Therefore, all words and phrases generated by LCA were used when constructing the hierarchies. For other words and phrases extracted from the retrieved documents themselves, term selection was a two stage process, first, identification of the words
4
Lakoff provides a detailed description of basic level categories in the second chapter of his book (Lakoff, 1987). Only a very short and incomplete explanation is provided here. Within a hierarchical categorization of things, basic level categories are to be found in the middle levels of the category. These are the categories most likely to be encountered and mastered when first learning a particular categorization scheme. For example, when categorizing animals, for most people, the basic level categories are the names of animals such as “dog”, “cat”, “cow”, “snake”, etc. Terms below the basic level are specializations, such as “German Shepherd”, ”Siamese”, “Aberdeen Angus”, and “Cobra”. Those above the basic level are more general, possibly esoteric, groupings: clustering “dog”, “cat”, “cow” under the term “mammals”, “snake” under “reptiles”, and “mammals” and “reptiles” under “animate beings”, for example.
Building, Testing, and Applying Concept Hierarchies
243
and phrases to be extracted, and second, determining which of the extracted terms should be selected for inclusion in the concept hierarchy. Identifying words and phrases. Given that the documents being processed resulted from retrieval, it was decided to extract terms from the best passages of the documents. It was hoped that this would produce terms that reflected the content of the documents with a bias towards the information need expressed in the query. Identification of words from the best passages was a simple process of extracting alphanumeric character sequences delineated by common word separators such as spaces, punctuation marks, etc. The extracted words were then stemmed using Krovetz’s KSTEM system (Krovetz, 1993). Phrases were extracted using a ‘in-house’ phrase identification process created within the CIIR group at the University of Massachusetts. The process works best when extracting phrases from a number of documents at the same time. It operates as follows. Text is first segmented using a number of phrase separators such as: stop words, irregular verbs, numbers, dates, punctuation, title words (e.g. Mr., Dr., Mrs.), company designators (e.g. Ltd., Co., Corp.), auxiliary verbs or phrases, and format changes (e.g. table fields, font changes). Then the candidate phrases extracted from the text are stored in a lookup table along with their frequency of occurrence in the documents being processed. Next, the words of the candidate phrases are tagged with all their possible Part Of Speech (POS) tags using grammatical information taken from WordNet. Using a set of syntactic rules, the candidate phrases are checked to see if they are syntactically correct. Those that are not are removed from the lookup table. Finally, the frequency of occurrence of the remaining phrases is checked. Those occurring more often than a specified threshold are returned as valid phrases. The remaining phrases are searched to find any that have a sub-string (of significant length) in common. For any found, the longer phrase is removed and its frequency of occurrence added to the shorter phrase’s occurrence value. If this phrase now occurs more often than the threshold, it is returned by the system as a valid phrase. As a final form of normalization, all valid phrases returned are, like individual words, stemmed using Krovetz’s KSTEM stemmer. With all words and phrases extracted from the best passages of the documents, the process of selecting a subset for the concept hierarchy now took place. Selecting “good” terms. Term selection used the classic approach of comparing a term’s frequency of occurrence in the set of retrieved documents with its occurrence in the collection as a whole. Terms that are ‘unusually frequent’ in the retrieved set compared to their use in the collection are selected. The formula used to calculate this value was simply xr/xc where xc is the frequency
244
ADVANCES IN INFORMATION RETRIEVAL
of occurrence of x in the retrieved set, xc is its occurrence in the collection. The extracted words and phrases were each assigned their frequency comparison value and were ranked by this score. The top N terms were selected for inclusion in the concept hierarchy. With the terms selected, the process to create a concept hierarchy could now take place.
2.4
HOW TO BUILD A HIERARCHY
The process to build a concept hierarchy consisted of a number of phases, which are now described. First, occurrence information on the extracted words and phrases was gathered. For each term, a list of all the documents that a term occurred in was gathered along with the location of that term within each document. This information was passed onto the subsumption module. Here, each term’s occurrence data was compared to each other term’s data to find subsumption relationships. This was an O(n2) process. All term pairs found to have a subsumption relationship were passed onto a transitivity module. This final process removed extraneous subsumption relationships. For example if it found that a subsumed b and a subsumed c, but also found that b subsumed c, then the a, c pairing was removed because there was a pathway from a to c via b. The output of this module was the data needed to display a concept hierarchy. It was decided to test this method out on the 500 top ranked documents retrieved in response to a selection of queries taken from the TREC test collection (Voorhees and Harman, 1998). Retrieval was performed using the INQUERY search engine. After words and phrases were extracted from the documents (on average 12,000 terms from the 500 documents) and their document position information was recorded, the subsumption process took a relatively short time5 and produced 4,500 subsumption term pairs. The concept hierarchies that were generated were examined by one of the authors and as a result, an ad hoc modification was made to the subsumption process. It was determined that if x subsumed y and y occurred infrequently, this subsumption relationship was less likely to be of interest. Consequently, terms occurring only once or twice in the document collection were not considered for subsumption. With the hierarchy creation process determined, an example structure is now displayed and contrasted with other document clustering methods.
On a 266MHz Pentium II computer with 96Mb of RAM running Linux v5.2, the developmental software used to perform the subsumption process took on average 15 seconds per query.
5
Building, Testing, and Applying Concept Hierarchies
2.5
245
CREATING A HIERARCHY AND CONTRASTING IT WITH OTHER METHODS
Figure 9.1 shows a fragment (~ 10%) of the concept hierarchy resulting from the 500 documents retrieved in response to TREC topic 230: “Is the automobile industry making an honest effort to develop and produce an electricpowered automobile?’. As can be seen, much of the concept organization is promising, especially under "pollution". Other term pairs - “average fuel economy standard” and “electric vehicles” or “safety” and “energy” - seem less sensible. Nevertheless, the hierarchy appears to display the desired property of general terms at the top leading to more specific terms below.
Figure 9.1
Fragment of concept hierarchy from TREC topic 230.
According to Hearst and Pedersen (Hearst and Pedersen, 1996), topic 230 is reminiscent of the topic used to illustrate their system’s (Scatter/Gather) creation of polythetic clusters. In their paper, Hearst and Pedersen show documents retrieved in response to the query being assigned to one of five clusters, whose topics are (descriptions taken from paper) 1. “...safety and accidents, auto maker recalls, and a few very short articles”; 2. “alternative energy cars, including battery [cars]”; 3. “sales, economic indicators, and international trade, particularly issues surrounding imports by the U.S.”; 4. “also related to trade, focuses on exports from other countries”; and 5. the final cluster is said to act as a “junk” cluster holding those document difficult to classify. As can be seen, there is little similarity between the polythetic clusters, and the hierarchy displayed in Figure 9.1. This should not be surprising, however,
246
ADVANCES IN INFORMATION RETRIEVAL
as polythetic document clustering works quite differently from the monothetic clustering used here. Document clustering is based on finding document-wide similarities to form clusters. In Scatter/Gather, a document is assigned to only one cluster (Sparck Jones, 1970) classifies this as an exclusive clustering), consequently, the cluster acts as a summary for that whole document. In contrast, a document can belong to many clusters in a concept hierarchy (which Sparck Jones classifies as overlapping clusters); consequently, each cluster represents one of potentially many themes running through a document. As has already been stated, the organization of terms used in the concept hierarchies is akin to term clustering techniques. To show this similarity, one such system, Refine from AltaVista6 (Bourdoncle, 1997), is illustrated. Publications about this system are somewhat limited (it appears to be based on a combination of term co-occurrence and term co-variance), but as it is publicly available, it is easy to create a term cluster also reminiscent of topic 230. Figure 9.2 shows the output of Refine after entering the query “auto car vehicle electric” (use of the full TREC topic produced poor output). Each node represents a word grouping, which is expanded via a pop-up menu. Remembering that Refine is working from a different document collection (i.e. the web as opposed to TREC), there is more similarity between its output and the presentation in Figure 9.1 than the output of Scatter/Gather. However, the main difference between Refine and the concept hierarchy is in the organization of terms: the layout of the Refine groups has no apparent significance or ordering.
3
PRESENTING A CONCEPT HIERARCHY
As seen in Figure 9.2, it is possible to lay out a small graph structure on screen; however, the concept hierarchies being generated were much bigger: the fragment in Figure 9.1 showed only one tenth of a typical hierarchy. Laying it all out on screen was judged to be potentially complex, time-consuming and maybe even impossible given the size of the structure. Therefore, an alternative means of displaying the structure was examined. An informal assessment of a couple of possible layout schemes was conducted. The first was a hierarchical arrangement of bullet points. The second involved creating a series of web pages one for each monothetic cluster and for each subsumption relationship between a cluster and other related clusters a hypertext link was added to the cluster’s web page. Neither presentation worked well, but from this study several priorities were determined It was preferable for the structure to fit onto a single screen to avoid the use of scrolling or change of context;
6
www.altavista.com
Building, Testing, and Applying Concept Hierarchies
Figure 9.2
247
Clustered term structure from Refine.
users should be familiar with the interface components used to present the structure; users should be able to move around the structure easily and quickly; and when at a particular 'level' in the hierarchy, users should be able to easily determine the possible paths that led to that level. The means of presentation found to hold to almost all of these priorities was a hierarchical menu. Because menus only show the current menu plus the path of menus used to get there, the chances of getting the structure to fit in a single screen was higher. Hierarchical menus have been a standard feature of operating systems for many years. Due to their familiarity, users can generally move around menus with relative ease. Hierarchical menus are used to display a strict hierarchy, where a child has a single parent. A child in the concept hierarchies, however, can have multiple parents, and observing that a child has multiple parents can be important information to a user. Unfortunately, there was no immediately obvious solution to this problem. As menus were judged to be a good
248
ADVANCES IN INFORMATION RETRIEVAL
means of presentation, the problem was ignored and any child having multiple parents was duplicated and placed under those parents. No link was displayed between the copies of the child. The hierarchical menu system chosen7 was one capable of being displayed within a web browser. Most menu systems are designed to allow auser to get to a known item in a sub-menu as fast as possible without making a mistake. This is generally achieved using delays related to mouse movement, which temporarily prevent the closing of the currently open sub-menu. Such a provision was not helpful for the task required here as the user was to be encouraged to browse around the entire structure as fast as possible. The menu system obtained did not have such delays and so was well suited to the browsing task. To illustrate the look of the menu system, the sample structure in Figure 9.1 is shown in its menu form in Figure 9.3.
Figure 9.3
3.1
Menu version of structure displayed in Figure 9.1.
LIMITATIONS OF THE MENUS
Certain limitations in the workings of the chosen menu system along with restrictions of screen size meant that additional constraints had to be imposed onto the concept hierarchy formation process. The first was caused by the large size of the hierarchy structures. If a term in the hierarchy had a great many parent terms, in this menu system, the child term was duplicated and appeared under each of its parent terms. If the child was itself a parent to a great many other terms, the size of the menus became very large and the menu display code failed. Consequently an appearance limit was placed on all terms in the hierarchy: any appearing more than a certain number of times (typically 25) were removed completely from the structure. While this action appears somewhat draconian, it was necessary to enable the menu system to function properly. It is worth noting that a better implementation of a hierarchical menu system would in all likelihood avoid this problem. With so much information being displayed, screen space was inevitably an important issue. A limit on the vertical size of a menu was consequently imThe menu software is described at http://www.dhtmlab.com
7
Building, Testing, and Applying Concept Hierarchies
249
posed. On a large display, the limit was set to 30 terms per menu. A menu larger than this limit was simply truncated loosing its extra terms. In order to ensure that less important terms were those that were lost, the terms within a menu were sorted based on their DF as it was found that terms with a high DF appeared to be more important. This ordering can be seen in the examples illustrated in the next section. A final problem was so-called ‘singleton menus’: those containing only one term, such as the “smog” menu in Figure 9.3. A large number of these were found to exist in the created concept hierarchies. As they use up a lot of horizontal screen space, the menu creation procedure was adapted to merge the term of a singleton menu into its parent term and remove the offending menu. With these final adaptations in place, an example of the menu display is now presented.
3.2
EXAMPLE HIERARCHY FRAGMENTS
Figure 9.4
A fragment of concept hierarchy from topic 302
Figure 9.4, Figure 9.5, and Figure 9.6 shows three parts of a concept hierarchy, this time generated from TREC topic 302: “Poliomyelitis and Post-Polio: Is the disease of Poliomyelitis (polio) under control in the world?”. The number next to each term is the DF of that term, which, therefore, is the number of
250
ADVANCES IN INFORMATION RETRIEVAL
Figure 9.5
Second fragment of concept hierarchy from TREC topic 302
Figure 9.6
Third fragment of concept hierarchy from TREC topic 302
documents assigned to that particular monothetic cluster. It is worth noting that at the top level of the hierarchy (the left most menu in the Figures) the DFs of all the terms on that level add up to more than the 500 documents the hierarchy
Building, Testing, and Applying Concept Hierarchies
251
was built from. This is an indication that there are documents appearing in more than one place in the hierarchy. It should also be noted that it is possible for documents to be missing entirely from the hierarchy, due to them not containing any of the terms that were subsumed. As has been seen, from the three figures as well as the structure in Figure 9.1, there is a trend of general terms leading to the more specific. One can see that ‘‘Salk” (inventor of a polio vaccine) appears both in the “polio” and the “diseasevaccine” sections of the hierarchy; both sensible locations for this term. The structure while initially satisfying could be improved: in Figure 9.4 for example, “Fauci”, the surname of an AIDS researcher, might have been better categorized under “AIDS” instead of “virus”. Nevertheless, as a first step towards building a concept hierarchy, the structure appeared to be promising.
4
EVALUATING THE STRUCTURES
Evaluating the concept hierarchies presented a challenge; their intended purpose was to provide users with an overview of the topical structure of the documents retrieved in response to a query. Measuring how well something provides an overview was not going to be counted by some objectively derived value. In a paper on user evaluation of Scatter/Gather, Pirolli et al., 1996, reported using a method aimed at testing how well users understood the topical structure of documents after seeing Scatter/Gather clusters. Unfortunately, the test involved asking users to draw a concept hierarchy, something that would inevitably be influenced after seeing the structures generated here. Before taking on a large-scale user study of the hierarchy’s overview capabilities, it was felt that some of the basic properties of the structure should be examined first. Therefore, an experiment was created that addressed the second and third design principles outlined at the start of Section 2: testing the relatedness of a child to its parent; and examining the type of relationship between the two. The details of the experiment are described in an earlier paper (Sanderson and Croft, 1999). The results of the experiment found that approximately 50% of the subsumption relationships within the concept hierarchies examined were found to be of interest and that the parent term was judged to be more general that its child. This figure compared favorably to concept hierarchies created with a random formation process. Another use of the hierarchies is as an aid to finding relevant documents. Rather than examining a ranked list of documents from a retrieval system, a hierarchy can be used. By using knowledge of the query topic, a person can follow paths in the menus that lead to relevant documents. There are at least two aspects to the problem of finding relevant documents within the hierarchy. One is how long it takes to traverse the hierarchy once it is known where relevant documents are located. If it is found that traversal (in this situation
252
ADVANCES IN INFORMATION RETRIEVAL
of having perfect knowledge about relevant documents) is better at locating relevant documents than scanning down a ranked list, then the other aspect of the problem can be studied. This is how easily humans locate the menu pathways that lead to a relevant document.
4.1
THE TRAVERSAL ALGORITHM
Our algorithm estimates the time it takes to find all relevant documents by calculating the total number of menus that must be traversed and the number of documents that must be read. The algorithm aims to find an optimum route through the hierarchy travelling to nodes that hold the greatest concentration of relevant documents. Since we begin with the knowledge of where documents are located, our algorithm iterates through all the relevant ones and assigns a path length to each. Any relevant documents not found in the hierarchy (which is possible) are assigned a path length of negative one. The total path length for a hierarchy is the summation of all non-zero (relevant) document paths. The algorithm follows. Document path length algorithm { for each relevant document d { if (d seen before?) {d.path_length = 0) else { find all leaves with d if (num_leaves > 0) { lm = menu with max # rel docs d.path_length = 1m.new_menus + 1m.total_newdocs } else { find all menus with d if (num_menus > 0) { m = menu with min #total docs d.path_length = m.new_menus + m.totalnew_docs } else {d.path_length = -1) }}}} Figure 9.7
Document path length algorithm
Given that documents often belong to more than one menu, it is necessary to choose which of these will be used when calculating the path. To do this, we break the menus into two groups. The first group consists of leaf menus. These types of menus are favored because they tend to have a smaller number of documents associated with them. Smaller document groups are also likely to be
Building, Testing, and Applying Concept Hierarchies
253
more homogenous. From among these leaf menus, we favor the menu with the most relevant documents because we are computing an optimal path. If there are no leaf menus, then all menus containing the document are considered. In this case, we favor menus that contain a small number of documents, since it is unlikely that a human would read more documents than necessary. The path to a relevant document is composed of the previously unexplored menus that are traversed to reach it and the unread documents associated with the final menu. As the documents belonging to a particular menu item are not sorted in any way, it is assumed that users will have to read all new documents in the menu in order to find the relevant one(s). Although this algorithm leads to a succinct analysis of the concept hierarchy, it is worth noting that it contains certain simplifying assumptions. First, all documents are regarded as equal despite the expected variability in document length. Similarly, all menus are treated equally despite the variability in their length. Finally, when computing the path length, documents and menus are treated the same, i.e. the time and effort to read a document is regarded as being the same as that to read a menu.
4.2
EXPERIMENTS
Our experiment makes use of TREC topics 301-350 and associated relevance judgements. We have retrieved 500 documents using INQUERY for each of the 50 queries. We treat a set of 500 documents for a given query as a document set. Concept hierarchies are generated for each document set. Hierarchies are assigned a path length score using the algorithm described above. A lower score denotes a superior hierarchy. We compare our hierarchies to those formed through a random subsumption process. These hierarchies were formed in the same manner as the concept hierarchies (as described in Section 2.4 except that when all terms were compared to all other terms, random selection was used to form parent-child pairs instead of subsumption. Note the ordering of terms based on frequency of occurrence was still present in this structure. Once all the menus were scored, they were compared on a basis of the average path to a document. This was used instead of doing a straight comparison of the total path length because it was possible that some relevant documents were unreachable. The total path length for a particular hierarchy could end up being shorter simply by leaving out relevant documents. By using the average path length, we neither rewarded nor penalized a hierarchy for excluding relevant documents. It was found empirically that randomly generated hierarchies were more likely to leave relevant documents out of the hierarchy than the true hierarchies. The true concept hierarchies contained no path to a relevant document 1.9% of the time. The random menus contained no path to a relevant document
254
ADVANCES IN INFORMATION RETRIEVAL
19.4% of the time. These percentages are based on the number of relevant documents excluded compared to the total number of relevant documents.
4.3
RESULTS
Evaluation Relative to Random Hierarchies. Ten randomly generated hierarchies were created for each query. The average relevant document path length was then averaged among the randomly generated hierarchies before comparing the path lengths so that an average baseline hierarchy could be compared to the true hierarchy. In 41 of the 50 queries, the true hierarchy had a smaller average document path than the baseline hierarchy. When the true hierarchy had a smaller path, it was on average 5.03 units shorter for each relevant document. In 8 of the 50 document sets, the baseline hierarchy had smaller document paths. However, these paths were only 2.15 units shorter on average. The paths were equal in one case where INQUERY retrieved no relevant documents within the document set. Figure 9.8 compares each randomly generated hierarchy to the true hierarchy. The black part of the column represents the number of times that the true hierarchy had a shorter path length than the random ones. The gray part of the column represents the number of times that a random hierarchy had a shorter path than the true hierarchy. In cases where the column has a height less than ten, there were random hierarchies of exactly the same path length as the true hierarchy. Query 305 had no relevant documents so all hierarchies are equivalent, which is why there is no column. We performed ANOVA (ANalysis Of VAriance) on the data8. To linearize the data for the ANOVA, we performed a log transform on the average path length, indicating that the average path length differed between the models by a multiplicative factor. We discovered that the path lengths from the random hierarchies were 33% longer (7.11 vs. 5.35) than the path lengths from the concept hierarchies (p < 0.0002). (Path lengths are geometric averages.)
Figure 9.8
Shows the comparison of the true hierarchy to ten random ones.
See Table 9.A. 1 in the Appendix for the ANOVA table.
8
Building, Testing, and Applying Concept Hierarchies
255
Evaluation Relative to Ranked List. Since a ranked list is a widely used method of displaying retrieved documents, we also compare our hierarchies to the ranked order that INQUERY generates for the retrieved document set. In order to deal with the difference in the number of relevant documents, we used the same average document path length as was used in comparing the hierarchies. This means that the lowest ranked document is treated as the total path length of ranked list. The path length is divided by the total number of relevant documents since all relevant documents within the set will be ranked. When the scores from the hierarchy are compared to INQUERY’s ranked list, the hierarchy required the user to read fewer documents in 47 of the 50 topics. On average, 224.2 fewer documents and menus were read. In the two cases where INQUERY required fewer documents to be read, the difference in the number of documents read was on average 12.5. Again the topic where INQUERY returned no relevant documents had the same scores. We performed ANOVA on the average path length data comparing the random hierarchies, the concept hierarchies, and the INQUERY ranked list9. The log transform was the best model again. The randomized path length was 33% longer (7.11 vs. 5.35) than the concept hierarchy path length (p < 0.001 by Honest Significant Difference (HSD) paired comparison). The INQUERY ranked list generated a path length 156% longer than the concept hierarchy path length (13.67 vs. 5.35), and 92% longer than the randomized path length (13.67 vs. 7.11) (p < 1.0E-12). (Again, path lengths are geometric averages.)
5
FUTURE WORK
As was stated in Section 2.2, the work described so far is only a starting point in the automatic building of a concept hierarchy. A number of potential improvements to the formation process are now presented, followed by a brief discussion of alternative means of presenting the hierarchies, concluded with ideas for their wider use.
5.1
IMPROVING TERM IDENTIFICATION
Currently a simple phrase extraction process performs identification of concepts within documents; however, there are a number of other utilities created within the field of Information extraction (IE) which may improve identification accuracy. A Named Entity Recognizer (NER) is a basic tool used to perform initial text processing in an IE system (Wakao et al., 1996). It locates and types common text forms such as proper nouns, dates/times, money expressions, postal addresses, etc. For proper noun recognition, name lists for people, places, and companies may be used. It is anticipated that use of such a mark-up Table 9.A.2 in the Appendix.
9
256
ADVANCES IN INFORMATION RETRIEVAL
tool will better inform the term selection process by avoiding text types that are unlikely to be good terms, such as email addresses or phone numbers. In addition new conceptual groupings will be possible based on the NER types such as the names of people or companies related to a particular term. One other IE tool that will be examined is co-reference resolution. This tool finds different references to the same concept in text. The range of coreferences that such a system can tackle is large, but for the purposes of this project only Proper name co-references will be resolved (Wakao et al., 1996). For example, determining if, in a document, the name “Dr. Jonas Salk” and the name “Salk” refers to the same person. Successful use of this tool would group multiple references and thus remove duplicates from the concept hierarchy.
5.2
WIDENING THE RANGE OF CONCEPT RELATIONSHIPS
Although subsumption identifies relatively accurately a large number of valid concept relationships, it is believed that a range of other existing methods can be employed to increase this number and will provide validation of existing relationships. The subsumption-based work used so far was found to be successful in providing a set of concepts organized into a hierarchy leading from the most general concepts to the most specific. No attempt was made to locate synonymous relationships. There is a body of work on using forms of statistical co-occurrence to locate such relationships. One such technique is co-variance. Two concepts are said to co-vary when the contexts in which they occur are similar. Grefenstette, 1994, has had success in locating synonym relationships using co-variance. This technique will be applied to the concept hierarchy formation process to group sets of synonymous concepts. Another source of information on synonymous relationships is a thesaurus. Despite the relatively poor utility provided by WordNet in the small investigation outlined in Section 2.2, it was felt that a sufficient number of concepts were successfully related to warrant a re-examination of WordNet. In the concept hierarchy illustrated above, for example, “polio” and “poliomyelitis” are located in different parts of the hierarchy despite being synonyms of each other; use of WordNet would concatenate these two terms into a single concept. In addition, WordNet may also provide some evidence on the generality and specificity of concepts to further improve the hierarchy formation process particularly for terms above basic level categories, where, as Caraballo and Charniak, 1999, has found, DF is a poor indicator of generality or specificity. In addition to use of an existing thesaurus to locate hyponym/hypernyms and synonyms, a number of corpus based techniques have been developed to locate such relationships. Hearst, 1998, found that certain key phrases could be an
Building, Testing, and Applying Concept Hierarchies
257
indicator of such a subsumption-like relation. Three of the phrases she found were “such as”, e.g. “...popular forms of entertainment such as movies...”; “and other”, e.g. “...Julia Roberts, Robert De Niro and other actors...”; “especially”, e.g. “...most horror films, especially Psycho and The Exorcist.”. Sentences that contained these phrases were parsed to identify the noun phrases being related. Hearst discovered around ten such phrases that were accurate identifiers of the “type-of” relation. However, manual intervention was required for their discovery and the scope of the noun phrase pairs identified was limited. Hearst suggested using the key phrases to help thesaurus lexicographers search for new relations. Use of this technique could be applied to the formation of the concept hierarchies and an investigation of its utility will be conducted. In a similar vein to Hearst’s work, a series of key phrases could also be identified to locate terms that are synonyms within the context of a subsuming term. Working from the examples shown above, “Julia Roberts” and “Robert De Niro” are both actors, “Psycho” and “The Exorcist” are both horror movies. Observing that these terms are components of a list should be a relatively simple task. Two pieces of work on phrase analysis are also promising avenues of research. Grefenstette, 1997, has described a method of phrase classification, where, through the use of simple syntactic analysis, he was able to place noun and verb phrases into one of nine classes. He illustrated his ideas by examining all possible phrases containing the word “research”. For example depending on whether “research” was the head or the modifier of a noun phrase, Grefenstette was able to differentiate types of research (e.g. market research, recent research, scientific research, etc) from research things (e.g. research project, research program, research center, etc). No tested application of this classification scheme was reported. Woods, 1997, also used phrase analysis in addition to a large knowledge base to organize terms into a concept hierarchy. By locating the head and modifier of noun and verb phrases, Woods was able to make choices on how to classify phrases. For example in the phrase “car washing”, Woods’ system would identify “car” as the modifier and “washing” as the head of the phrase. This would inform the system to classify the phrase “car washing” under “washing” and not “car”. The success of the technique relied on a large morphological knowledge base of information to help identify phrase components. Woods used the concept hierarchy to automatically expand non-matching terms of a query.
258
5.3
ADVANCES IN INFORMATION RETRIEVAL
CREATING HIERARCHIES WITHOUT QUERIES
In Section 2.1 it was noted that concept hierarchies rely on words being used in the same sense. It is thought that a homogeneous document set provides an environment where word sense disambiguation is not an issue. Using the top ranked documents of a query is one way to achieve the desired environment. An alternative method is to create polythetic clusters of the document set. Concept hierarchies can be then be created in cases where there is no query. In fact the hierarchy becomes a description of the polythetic cluster. The hierarchy does not suffer from the traditional problems of labeling polythetic clusters which may leave out the topics of sub-clusters since only the most frequent words are used in the description. A concept hierarchy seeks to create a complete description of the document set, and thus creates a complete description of the cluster. Lawrie and Croft, 1999, studied the effectiveness of using clustering as a preprocessing of the document set before creating the hierarchy. It was found that this can expose more relations in a document set than using a single hierarchy for the entire set. However, some relations may be left out because a group of documents that formed a subpart of the initial single hierarchy are clustered into different groups and no longer have a sufficient number of occurrences independently for inclusion in the hierarchy. In the task of finding relevant documents, as described in Section 4.1, creating hierarchies of clusters provides a faster method to finding relevant documents.
5.4
VISUALIZATION OF HIERARCHIES
Currently, presentation of the concept structure is achieved using hierarchical menus. Although simple to manipulate and interpret, this form of visualization looses some of the information held within the structures, the reason (as described in Section 3) being that the hierarchical menus visualize a strict hierarchy, one parent to each child. The actual data, however, has children possessing multiple parents, which can be important. For example, in a hierarchy built from documents on international conflicts, the child term “war” had two parents “India” and “Kashmir”. Seeing this shared link helped users’ understanding of the concept organization. Currently, the hierarchical menu system handles this situation by placing a copy of such a child under each of its parents, the hope being that a user will notice the child term under each parent and mentally make the link between them. Alternative visualizations will also be explored. Much work has been conducted on tools to visualize directed acyclic graph (DAG) DAGs and some are freely available such as the daVinci system (Fröhlich and Werner, 1994). One of these tools will be selected and applied. It remains to be seen how well these tools will display as large a structure as that currently being generated (each
Building, Testing, and Applying Concept Hierarchies
259
hierarchy holds several hundred concepts). If the use of these tools fails to be successful, an alternative will be to work within the existing menu framework and produce a system whereby any child can be expanded in some alternate manner to show a list of its parents.
5.5
ALTERNATIVE APPLICATIONS FOR THE HIERARCHIES
The work presented here attempts to provide an automatically constructed meaningful categorization of documents that was similar to manually constructed categories. The intended use of this structure was, like their manual equivalents, to allow users to locate documents of interest within the hierarchy and to provide users with an overview of the topic structure of the document collection being categorized. The manual topic structures can have additional uses as well, for query expansion and for organizing documents written in foreign languages (Pollitt et al., 1993). Both these alternative uses are now discussed. Query expansion. It is a well known feature of searching that the vast majority of queries submitted to widely available IR systems are very short, typically one or two words in length (Rose and Stevens, 1996). Query expansion whether it is through automatic or semi-automatic means (Xu and Croft, 1996, Harman, 1992) or via manual intervention (Magennis and van Rijsbergen, 1997) has been shown to increase the number of relevant documents retrieved. What has not been successful is persuading users to employ these techniques. When presenting automatically extracted expansion terms to users, most systems present these terms in a simple list. The concept hierarchies could be used to sensibly organize these words and phrases to make the range of possible expansion terms easier for users to process. Some related work has already been conducted in this area which indicates this may be a promising line of enquiry. Anick and Tipirneni, 1999, presented a technique that attempted to select terms that reflected the main topical threads running through a collection of documents. To do this, the method looked for terms that had a high “lexical dispersion”: terms that occurred with many other different terms. Anick showed the terms with the highest lexical dispersion in a collection of Financial Times documents were “market”, “group”, and “company”. He used lexical dispersion to select words and phrases from a set of retrieved documents and present these terms to users as candidates for query expansion. Not only were the terms shown but all the phrases that those terms were part of were shown as well through a series of menus. The authors presented an analysis of access logs to a retrieval system using the expansion method. It is unclear in the paper how
260
ADVANCES IN INFORMATION RETRIEVAL
often the expansion terms were used, but when they were, expansion appeared to be of use. Taking a less statistical and more NLP based approach, Bruza and Dennis (Bruza and Dennis, 1997) presented their hyperindex system. Working on top of a web search engine, their system parsed the titles of retrieved documents and looked for the query phrase in conjunction with other words linked by certain connectors (“in”, “of”, “with”, “as”, etc). The new phrases were presented to the user in a structured fashion, showing phrases that were either restrictions or expansions on the existing query. All new query phrases were derived through this simple parsing technique. Titles of documents were used because they were mostly expressed in passive form, which was easier to work with when finding new phrases. The paper claimed that the titles parsed fairly well. No users testing of the system was reported in the paper. Both papers have presented means of structuring query expansion terms, though neither has presented a large user study to examine the utility of their respective techniques. Therefore, although expansion clearly can be presented in this structured form, its utility remains to be determined. If concept hierarchies are to be investigated as a means of query expansion, such a study will have to take place. Use in a cross-language environment, A considerable amount of research has been conducted on the cross-language retrieval problem (retrieval based on a query written in (what is referred to here as) a source language retrieving documents written in (what is referred to here as) a target language). The best results approach the effectiveness of a monolingual system (Ballesteros and Croft, 1998). The most likely outcome of a user session with a cross-language retrieval system is the need to translate some of the retrieved documents back into the source language. Such a process is usually costly and time consuming. Consequently, it is in the interest of the user of such a system to locate, as accurately as possible, the best set of relevant documents. In a monolingual retrieval system, users refining their query through several retrieval iterations would normally achieve this. In order for users of a cross-language system to conduct a similar refining process, it is necessary for them to be able to assess, at some level, the relevance of the retrieved documents. Full automatic document translation is not accurate, one approach is to generate a translated concept hierarchy. Translating a target language concept hierarchy into a source language is not as hard as it might appear at first. As the translation is occurring in the context of a retrieval system, there are certain features that can be taken advantage of. First, there already exists a set of translated terms - those of the query - and these can be exploited. Second, the documents to be retrieved have a degree of
Building, Testing, and Applying Concept Hierarchies
261
similarity to them and this quality will also be beneficial. We start by working with the query terms. In the work by Ballesteros and Croft, 1998, a successful method of cross language retrieval was described, one aspect of which involved the expansion of users’ original queries with other source language terms, which were then translated into the collection language to produce an effective target language query. From this form of retrieval, a reliable mapping exists between the translated terms in the retrieved documents and the expanded query terms. As a starting point, one can build concept hierarchies from these translated terms alone. Because of the existing mapping, further translation is not necessary. Although the resulting hierarchies will be small, they will still be of use to users unable to read the target language documents. As was found with monolingual concept hierarchies, their quality and richness can be improved by including terms found within the documents in addition to those of the query. The accurate translation of the additional terms will be conducted using Dagan’s technique, which is designed to work with minimal translation resources (Dagan et al., 1991). When translating a particular term, the context in which it occurs is used to disambiguate the term. If that term occurs in other retrieved documents, it is reasonable to assume it will be used in the same sense throughout those documents. All the contexts, therefore, can be conjoined to provide more information to make the disambiguation, and therefore, the translation more accurate. It is believed that the translated concept hierarchies show great promise in conveying the topical structure of retrieved documents, and a series of initial attempts are planned for future work.
6
CONCLUSIONS
Through use of a simple term association technique, a method for building concept hierarchies has been presented. The hierarchies were informally compared to other methods that derive structure from collections of documents. From this comparison, it was shown that a hierarchical organization of monothetic clusters is quite different from polythetic document clustering. Through two small-scale experiments, it has been shown that the generated concept hierarchies provide some level of sensible organization of concepts and provide a reasonable means of access to relevant documents.
Acknowledgments The authors wish to thank Russell Swan for his help in the experimental analysis. This material is based on work supported in part by the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement number EEC-9209623. It was also supported in part by United States Patent and Trademark Office and Defense Advanced
262
ADVANCES IN INFORMATION RETRIEVAL
Research Projects Agency/ITO under ARPA order number D468, issued by ESC/AXS contra number F19628-95-C-0235, and by SPAWARSYSCEN contract N66001-99-1-8912. Any opinions, findings and conclusions or recommendations expressed in this material ar the authors and do not necessarily reflect those of the sponsors.
Appendix: ANOVA analysis Table 9.A.1
CONSTANT qf sysf ERROR1
Compares concept hierarchies to random ones DF
SS
MS
F
P-value
1 48 1 489
2019.9 220.8 3.6111 117.36
2019.9 4.5 3.6111 0.24
8416.35025 19.16696 15.04645
0 0 0.00011934
Table 9.A.2 Compares concept hierarchies, random hierarchies, and INQUERY
CONSTANT qf sysf ERROR1
DF
SS
MS
F
P-value
1 48 2 537
2334.4 244.41 24.358 141.39
2334.4 5.0919 12.179 0.2633
8865.76987 19.33863 46.25429
0 0 0
References Anick, P. and Tipirneni, S. (1999). The paraphrase search assistant: Terminological feedback for iterative information seeking. In Hearst, M., Gey, E, and Tong, R., editors, Proceedings on the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 153–159. Ballesteros, L. and Croft, W. (1998). Resolving ambiguity for cross-language retrieval. In Croft, W., Moffat, A., van Rijsbergen, C., Wilkinson, R., and Zobel, J., editors, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 64–71, Melbourne Australia. Bourdoncle, F. (1997). Livetopics: recherche visuelle d’information sur I’internet (livetopics: visual search for information on the internet). In Proceedings of
Building, Testing, and Applying Concept Hierarchies
263
RIAO (Proceedings of RIAO (Recherche d’Informations Assistee par Ordinateur - Computer Assisted Information Retrieval), pages 651–654. Bruza, P. and Dennis, S. (1997). Query reformulation on the internet: Empirical data and the hyperindex search engine. In Proceedings of RIAO (Recherche d’Informations Assistee par Ordinateur - Computer Assisted Information Retrieval), pages 488–499. Caraballo, S. and Charniak, E. (1999). Determining the specificity of nouns from text. In Proceedings of the joint SIGDAT conference on empirical methods in natural language processing (EMNLP) and very large corpora (VLC), pages 63–70. Chen, H., Houston, A., Sewell, R., and Schatz, B. (1998). Internet browsing and searching: user evaluations of category map and concept space techniques. Journal of the American Society for Information Science, 49(7):582–603. Cutting, D., Karger, D., Pedersen, J., and Tukey, J. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR conference on Research and development in information retrieval, pages 318–329, Copenhagen Denmark. Dagan, I., Itai, A., and Schwall, U. (1991). Two languages are more informative than one. In Proceedings of ACL’91: the 29th Annual Meeting of the Association for Computational Linguistics, pages 130–1 37. Doyle, L. (1961). Semantic road maps for literature searchers. Journal of the Association of Computing Machinery (ACM), 8(4):553–578. Forsyth, R. and Rada, R. (1986). Adding an edge. In Machine Learning: applications in expert systems and information retrieval, Ellis Horwood series in artificial intelligence, pages 198–212. Chichester: Ellis Horwood: Halsted Press, New York. Fowler, R., Wilson, B., and Fowler, W. (1992). Information navigator: An information system using associative networks for display and retrieval. Technical Report NAG9-551, #92-1, Department of Computer Science, University of Texas, Pan American Edinburg, TX 78539–2999. Frohlich, M. and Werner, M. (1994). The graph visualization system davinci - a user interface for applications. Technical Report 5/94, Department of Computer Science, Universität Bremen, Bremen, Germany. Grefenstette, G. (1994). Explorations inAutomatic ThesaurusDiscovery. Kluwer Academic Publishers. Grefenstette, G. (1997). Sqlet: Short query linguistic expansion techniques, palliating one-word queries by providing intermediate structure to text. In Proceedings of RIAO (Proceedings of RIAO (Recherche d’Informations Assistee par Ordinateur - Computer Assisted Information Retrieval), pages 500–509.
264
ADVANCES IN INFORMATION RETRIEVAL
Harman, D. (1992). Relevance feedback revisited. In Proceedings of the 15th Annual International ACM SIGIR conference on Research and development in information retrieval, pages 1–10, Copenhagen Denmark. Hearst, M. (1998). Automated discovery of wordnet relations. In Fellbaum, C., editor, WordNet: an electronic lexical database. MIT Press. Hearst, M. and Pedersen, J. (1996). Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the 19th Annual International ACM SIGIR conference on Research and development in information retrieval, pages 76–84, Zurich, Switzerland. Jansen, B., Spink, A., Bateman, J., and Saracevic, T. (1998). Real life information retrieval: A study of user queries on the web. SIGIR Forum: A Publication of the Special Interest Group on Information Retrieval, 32(1):5–17. Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 191–202. Lakoff, G. (1987). Women, Fire, and Dangerous Things. University ofChicago Press. Larkey, L. (1999). A patent search and classification system. In Proceedings of the 4th ACM conference on Digital libraries, pages 179–187. Lawrie, D. and Croft, W. (1999). Discovering and comparing hierarchies. Technical Report IR-183, CIIR, Department of Computer Science, University of Massachusetts, Amherst, MA 01002. Magennis, M. and van Rijsbergen, C. (1997). The potential and actual effectiveness of interactive query expansion. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pages 324–332. McCallum, A., Rosenfeld, R., Mitchell, T., and Ng, A. (1998). Improving text classification by shrinkage in a hierarchy of classes. In Brasko, I. and Dzeroski, S., editors, Machine Learning: Proceedings of the 15th International Conferences (ICML '98), pages 359–367. Morgan Kaufmann Publishers. Miller, G. (1995). Wordnet: A lexical database for english. Communications of the ACM, 38(11):39–41. Ng, H. and Lee, H. (1996). Integrating multiple knowledge sources to disambiguate word sense: An examplar-based approach. In Proceedings of ACL'96: the 34th Annual Meeting of the Association for Computational Linguistics, volume 34, pages 40–7. Pirolli, P., Schank, P., Hearst, M., and Diehl, C. (1996). Scattedgather browsing communicates the topic structure of a very large text collection. In Conference proceedings on Human factors in computing systems (ACM CHI '96), pages 213–220. Pollitt, A., Ellis, G., Smith, M., Gregory, M., Li, C., and Zangenberg, H. (1993). A common query interface for multilingual document retrieval from
Building, Testing, and Applying Concept Hierarchies
265
databases ofthe european community institutions. In Proceedings of the 17th International Online Information meeting (Online '93) Learned Information, pages 47–61. Learned Information. Qiu, Y. and Frei, H. (1993). Concept based query expansion. In Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 160–170. ACM Press. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), pages 448–453. Rose, D. and Stevens, C. (1996). V-twin: A lightweight engine for interactive use. In NIST Special Publication 500–238: The 5th Text REtrieval Conference (TREC-5), pages 279–290. Sanderson, M. and Croft, W. (1999). Deriving concept hierarchies from text. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 206–213. Sparck Jones, K. (1970). Some thoughts on classification for retrieval. Journal of Documentation, 26(2):89–101. Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28( 1): 11–21. Thompson, R. and Croft, W. (1989). Support for browsing in an intelligent text retrieval system. International Journal of Man Machine Studies, 30:639–668. Tombros, A. and Sanderson, M. (1998). Advantages of query-biased summaries in ir. In Proceedings of the 21st annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 2–10. van Rijsbergen, C. (1979). Information retrieval. Butterworths, London, second edition. Veling, A. and van der Weerd, P. (1999). Conceptual grouping in word cooccurrence networks. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), pages 694–699. Voorhees, E. and Harman, D., editors (1998). The 7th Text REtrieval Conference (TREC-7). Department of Commerce, National Institute of Standards and Technology. Wakao, T., Gaizauskas, R., and Wilks, Y. (1996). Evaluation of an algorithm for the recognition and classification of proper names. In Proceedings of the 16th International Conference on Computational Linguistics (COLING '96), pages 418–423. Woods, W. (1997). Conceptual indexing: a better way to organize knowledge. Technical Report TR-97-61, Sun Labs, Editor, Technical Reports, 901 San Antonio Road, Palo Alto, California 94303, USA. Xu, J. and Croft, W. (1996). Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGlR conference on Research and development in information retrieval, pages 4–11.
266
ADVANCES IN INFORMATION RETRIEVAL
Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 189–196.
Chapter 10 APPEARANCE-BASED GLOBAL SIMILARITY RETRIEVAL OF IMAGES S. Ravela Center for Intelligent Information Retrieval University of Massachusetts, Amherst, MA 01003 ravela@cs.umass.edu
C. Luo Center for Intelligent Information Retrieval University of Massachusetts, Amherst, MA 01003 cluo @cs.umass.edu
Abstract
Visual appearance is an important part of judging image similarity. We readily classify objects that share a visual appearance as similar, and reject those that do not. Our hypothesis is that image intensity surface features can be used to compute appearance similarity. In the first part of this paper, a technique to compute global appearance similarity is described. Images are filtered with Gaussian derivatives to compute two features, namely, local curvatures and orientation. Global image similarity is deduced by comparing distributions of these features. This technique is evaluated on a heterogeneous collection of 1600 images. The results support the hypothesis in that images similar in appearance are ranked close together. In the second part of this paper, appearance-based retrieval is applied to trademarks. Trademarks are generally binary images containing a single mark against a texture-less background. While moments have been proposed as a representation, we find that appearance-based retrieval yields better results. Two small databases containing 2,345 parametrically generated shapes, and 10,745 trademarks are used for evaluation. A retrieval system that combines a trademark database containing 68,000 binary images with textual information is discussed. Text and appearance features are jointly (or independently) queried to retrieve images.
268
1
ADVANCES IN INFORMATION RETRIEVAL
INTRODUCTION
From personal photo collections to libraries and web pages, digital multimedia collections are becoming more common. Retrieving relevant images from these collections will, in many cases, involve visual similarity. Image retrieval is a hard problem. In text retrieval, a word in the language is a “token” of information that has some meaning. While there are challenges in building effective representations and inference techniques to better satisfy user needs, basic operations such as, extracting tokens, representing their statistics, and comparing them are straightforward. In contrast, the basic operations for images are computationally intensive and, in general, unsolved. Assume that a camera is used to acquire an image. Image formation begins when direct and reflected light rays impinge on elements of the camera’s CCD array. Elements register a charge corresponding to the brightness and wavelength of the incident illumination and the pattern of charges, after quantization, form the image. A digital image is stored as a two dimensional array of pixels, each containing a color value. Color can be expressed in terms of hue, saturation and intensity (HSI)1 . Hue corresponds to the color wavelength on the color wheel. Saturation corresponds to the “depth” of the color. A 0 saturation value makes the color gray. Intensity corresponds to the brightness of the incident illumination and is seen as a gray-level image. In black and white pictures, only intensity is represented. Images can represent a rich repertoire of visual information, but instead of the word the basic unit is intensity or color. Intensity value at a pixel has no semantic meaning. Seemingly simple questions, such as object segmentation (how many objects are there in the image and where?) and object recognition (what objects are in the image?) are hard problems. Finding semantically relevant information is even more daunting. For example, consider a query, “find me pictures of Mahatma Gandhi giving a speech”. In order to retrieve relevant images, images containing Mahatma Gandhi will have to be found. This recognition problem is challenging, given that Mahatma Gandhi’s face could be imaged from a number of different views, subject to occlusions, lighting changes, digitization effects, color and lens distortions, among others. One approach to providing visual semantics is to use surrogates such as text descriptions. In web pages for example, descriptions of scenes, bibliographic information, and other forms of metadata are usually available and can be used to provide semantics without resort to the harder content “understanding” issues. In certain other cases, like trademark retrieval, text is also necessary (see Section 4). Textual annotations alone are, however, insufficient. In many cases textual annotations may not be available, difficult to build, and most
Other frames such as red, green and blue (RGB), exist, but HSI is adopted here for convenience
1
Appearance-Based Global Similarity Retrieval of Images
269
importantly, they may be inadequate. For example, it may be quite unlikely that all the pictures of Mahatma Gandhi giving a speech will be labeled as such. Annotations cannot entirely anticipate the different uses for an image. Thus, image content itself needs to be part of a retrieval system, either independently or as in conjunction with other sources. A more realistic approach for image retrieval is one that treads the middle ground. That is, content is exploited to acquire descriptions that are primarily a function of the signal or statistical characteristics of the content. This approach can be very effective, is usually computationally straightforward, and is somewhat similar to the basic text retrieval approach. It can be described as follows. 1. Index images by their content while avoiding the segmentation and recognition issues. Several attributes such as color, texture, shape and appearance can be associated with image content. 2. Apply operators to extract “features” that measure statistical properties of the content attribute. For example, a filter tuned to a certain frequency can detect textures of a certain periodicity. Hence textures can be compared. 3. “Compile” features into representations of images. For example, a distribution such as a histogram of the attribute features can be used to represent the images. 4. Retrieve in a query-by-example fashion. A user selects an image as a query and its representation is compared with those in the collection to obtain a ranked list of images. Many researchers have adopted this overall framework using various content attributes. Color has been the most common of image attributes used for retrieval. Swain and Ballard, 1991, pioneered a technique of representing distributions of color and comparing them for object recognition. Variations of this technique have become popular for image retrieval. Image intensity has been used to construct other attributes. In certain cases, the outline shape of objects can be used. Typically, the image is processed to detect object contours, and these contours are compared to measure similarity. Another attribute is image texture. It is hard to precisely define texture, but an operational definition of texture is that textures are characterized by a periodicity of the pattern repeating itself. Several systems have been built using color, shape, texture and combinations thereof. Initial work focused on heterogeneous collections, that is, images of various “genres”. For example, systems like QBIC (Flickner et al., 1995) and Virage (Bach et al., 1996) allow users to combine color, texture and shape to search a database of general images. However, it is unclear what the purpose
270
ADVANCES IN INFORMATION RETRIEVAL
of such systems is. First, it is unclear if users are interested in querying general image collections using a color palette (Bach et al., 1996). To support efficient browsing and search, example-based querying might be more significant. Second, it is unclear if the attributes are actually meaningful. Specifically, consider color. A picture of a red and green parrot may be used to retrieve images based on their color similarity. The retrieved images may include other parrots and birds as well as red flowers with green stems and other images. While this is a reasonable result when viewed as a matching problem, clearly it is not a reasonable result for a retrieval system. The problem arises because color does not have a good correlation with semantics when used with general images. Shape based methods are likely to have the same problem in heterogeneous domains (Flickner et al., 1995; Pentland et al., 1994; Sclaroff, 1996; Mokhtarian et al., 1996) because, they require that the database be segmented into objects and segmentation is an unsolved problem. The QBIC system Flickner et al., 1995, developed by IBM, attempts to resolve this by matching binary shapes that the user manually outlines for objects in the database. Clearly, this is neither desirable nor practical. Except for certain special domains, all methods based on shape are likely to have the same problem when applied to heterogeneous collections. Although texture is the most effective of these attributes for heterogeneous collections, it is unclear if many texture based algorithms can be generalized.
Figure 10.1 Similarity in appearance. At a glance, the first two images appear visually similar in contrast to the third.
An attribute that has recently been discovered (Kirby and Sirovich, 1990; Turk and Pentland, 1991) as being useful for visual similarity is appearance. Visual appearance is an important cue. We readily classify images that share an appearance as similar and reject those that do not. As with texture, it is hard to precisely quantify visual appearance. An object’s appearance depends not only on its three dimensional shape, but also on the object’s albedo2, the viewpoint from which it is imaged and a number of other factors. It is non-trivial to separate the different factors constituting an object’s appearance from each other. For example, consider the three pictures in Figure 10.1 (wall sequence). The first Albedo is the fraction of light that is reflected by a body
2
Appearance-Based Global Similarity Retrieval of Images
271
two scenes appear very similar, while the third appears very dissimilar. The visual similarity between these images can be perceived at a glance, while at the same time, decomposing the image appearance into its component attributes is hard. As an attribute appearance is distinct from other image attributes in the following ways: It is not dependent on color because appearance similarity is evident in gray level images. It is not dependent on binary shape, because appearance similarity is evident in images containing no distinct shapes. It is distinct from texture but somewhat related. Primitive textures need not be present in an image to judge appearance similarity. For example, in the “wall” sequence no detectable texture patterns are present that make the two wall images similar. In contrast, appearance lends itself to textures in the sense that it is inclusive of texture similarity. In practice, the operators used to extract appearance features have a resemblance to some classes of texture operators, see Section 2 Our hypothesis is that the image intensity surface has features that can be used to compute appearance similarity. Finding appearance features is non-trivial. Intensity alone cannot be used because it is intolerant to various factors such as coordinate deformations induced by camera motion or illumination variations induced by lighting changes. In the first part of this paper, a technique to retrieve images by global appearance similarity and query-by-example is presented. We develop appearance features and show that they can be used in heterogeneous gray level collections to find images that, as a whole, appear visually similar. In order to compute global appearance similarity, features are extracted from pixel neighborhoods and, their distribution over the image are compared. Histograms are used to represent distributions offeatures and correlation is used to compare histograms. This technique is demonstrated on two small collections to obtain an evaluation. The first consists of 1600 grey level images of objects such as cars, faces, apes and other miscellaneous objects. The second collection is a set of 2345 parametrically generated binary (black and white) shapes of ellipses (including circles), triangles, rectangles (including squares), and pentagons. The purpose of this database is three fold. First, to test whether this technique extends to binary images. Second, to test whether the proposed method can be used to find similar shapes. The third purpose of this collection is to provide a test bed for comparisons. In the second part of this paper, we apply the appearance based global similarity retrieval to a collection of trademark images. Patent and Trademark Offices have large repositories that are searched for conflicting (similar) trademarks
272
ADVANCES IN INFORMATlON RETRIEVAL
before one can be awarded to an applicant. One of the factors for determining conflict is visual. A submission in the same category is obviously in conflict if it is visually similar to one that has been awarded. Determining conflict is labor intensive. Examiners have to look at large number of trademarks and the associated textual descriptions before making a decision. A system that even partly automates these functions by exploiting text and image information would have significant value. From a research perspective, trademark images may consist of simple geometric designs, pictures of animals or even complicated designs. Thus, they provide a testbed for image retrieval algorithms. A collection of 63,718 trademark images with associated text was provided by the US Patent and Trademark Office. This database consists of all (at the time the database was provided) the registered trademarks in the United States which consist only of designs (i.e. there are no words in them). A framework for multi-modal retrieval is applied. All searches begin with a text query because these images have associated text that can be searched. This provides a solution the the problem “where does the example image come from?’ ( also referred to as the page zero problem). The INQUERY (Callan et al., 1992) search engine is used to find images whose associated text match the query. Subsequent searches can use the appearance attribute independently or in conjunction with text. Currently, relevance judgments for this collection are not available. In order to evaluate performance, we also used a collection of 10,745 trademarks from the UK Patent and Trademark office, provided by Dr. John Eakins at the University of Northumbria at Newcastle, UK. These images come with a set of 24 queries for which relevance judgments have been independently obtained. This collection is tested with exactly the same parameters used to search US PTO trademarks.
2
APPEARANCE RELATED REPRESENTATIONS
In this section, we examine several techniques that use the image intensity surface to construct features related to appearance. The intensity surface can be viewed from two different perspectives, as a discrete function that extends over a finite space, and as a function of spatial frequencies. Each of these leads to certain classes of appearance features. In the spatial view there are three main classes of features. The first is the intensity at a pixel itself. The second is the integral of intensity over an area. The third is differential features. That is, derivatives of the function and their transformations. Features can also be computed in the frequency domain. This is common with texture-based features. Figure 10.2 shows an example of a one-dimensional signal represented in space and frequency. The signal shown in the top graph, z (x), is generated
Appearance-Based Global Similarity Retrieval of Images
Figure 10.2
273
Space and frequency domain representation.
using the formula, z (x) = 64i=1 cos S fi x), where, f1 = 0.025H z, f2 = 0.05Hz, f 3 = 0.1Hz, f 4 = 0.2Hz. Here Hertz (Hz) implies cycles per pixel and x has the unit pixels. The information content of the superposed signal z (x) is precisely the frequencies of the corresponding sinusoids. The Fourier transform can be used to extract this and the bottom graph shows the signal in frequency space. Peak responses are seen at exactly the frequencies that went into constructing the original signal. The Fourier transform Fz (f) registers peaks at the corresponding frequencies3. Space and frequency are canonical representations that easily extend to two dimensional images. Therefore, each of these views of the image can be used to design features. A method that uses the intensity as the feature is based on principal component analysis. Kirby and Sirovich, 1990, pioneered the use of principal component analysis as a representation for faces which was developed into an effective face recognition system by Turk and Pentland, 1991. Another system by Nayar et al., 1996, was developed for view-based and illumination invariant (to a degree) object recognition. That is, to recognize an objects when it is imaged at different angles and lighting. Principal component analysis is used to exploit the redundancy in data to construct a compact (orthonormal) basis for repre3The user can safely disregard negative values of f and the peaks observed there. Since the original signal is real and even the Fourier transform is symmetric and real
274
ADVANCES IN INFORMATION RETRIEVAL
senting them. For images, the technique works as follows. The idea is similar to latent semantic indexing used in text retrieval (Deerwester et al., 1990). An image is treated as a fixed length vector. Using a few images as training examples, a covariance matrix is constructed and the eigen vectors of this covariance matrix are computed. The eigen vectors describe the best orthonormal space for representing the data. Usually, the number of eigen vectors are much smaller than the size of the vector describing the image. All the images in the database are then projected into this space. A new image is projected into this space and compared with the others. While this technique popularized the notion of an appearance based representation, it also has some limitations. Images have to be size and intensity normalized, segmented4 and trained. Further, the objects must be correlated to begin with or the number of principal components will be large. In an attempt to overcome the restriction of “correlated objects”, Swets and Weng, 1996, extend the traditional method to a few classes of objects. Distances are computed from these classes first, followed by distance computation within classes. The issue, however, is that these classes are manually determined and training must be performed on each. The approach presented in this paper is different because eigen decompositions are not used to characterize appearance. Further, the method presented uses no learning and, does not require constant sized images. It should be noted that, although learning significantly helps in such applications as face recognition, it may not be feasible in many instances where sufficient examples are not available. A common technique based on integrative features is Moments. These features are especially useful for binary images. Moments can be considered an appearance feature because they describe the image intensity surface. Moments are integrals of the form mpq =
∫ ∫
xpyq f (x, y) dx dy
Hu, 1962, developed a theory of moment features with the goal of constructing invariants to image deformations. By invariants we mean that they do not change when the image undergoes a coordinate deformation. Specifically similarity invariants are moments invariant to translation, rotation and scale changes. Similarity invariance may be important when it is desired that rotated and scaled versions of an image, such as a trademark, be considered relevant. Figure 10.3 illustrates the practical implications of similarity invariance. The moment equation described above can be transformed to achieve invariance to similarity deformations. We adopt previous derivations (Hu, 1962; Reiss, o 1993; Methre et al., 1997). In Figure 10.3, the first picture is rotated 90 and The techniques referred here actually avoid segmentation by considering relatively featureless backgrounds.
4
Appearance-Based Global Similarity Retrieval of Images
Figure 10.3
275
Similarity invariance in moments. Pictures are Crown copyright
45° respectively and three more versions that are half the size of the first three are shown. Table 10.1 shows the distance between the first picture and the rest. These distances are quite small implying that the moment representations would rank these images almost equally. Table 10.1 Distance between first picture in Figure 10.3 and the rest using similarity invariant moments 3
12 0
3.1923e
–21
1.6299e
5
4 –08
1.0352e
–05
1.0352e
6 –05
4.3825e–06
Moments have been used in trademark images by several authors (Methre et al., 1997; Wu et al., 1994; Jain and Vailaya, 1998). The advantage of using moments as a feature is that the image is represented by a few numbers and hence images can be compared rapidly. Moment invariants have been used in isolation and in conjunction with other features. Jain and Vailaya, 1998, used edge angles and invariant moments to prune trademark collections. They then used template matching to find similarity within the pruned set. Their database was limited to 1100 images. Wu et al., 1994, combine moments with Fourier descriptors and projection profiles. Moments have several limitations. They cannot be used in gray-level images effectively, and do not work well when the image has noise. They work best for solid objects. We experimentally compare our technique with moments in Section 3
276
ADVANCES IN INFORMATION RETRIEVAL
In the context of trademark retrieval, there are systems based on methods other than moments that are worth mentioning. One of the earliest methods was proposed by Kato, 1992. The system used rudimentary features, namely edges and “occupancy” within pixel regions. A more sophisticated system by Eakins et al., 1996, develops trademark retrieval at a more semantic level, where image features such as lines and curves are extracted, grouped by parallelism and connectivity and then compared for retrieval. The method presented here is intended to be general in that it is applicable to a wide variety of images rather than specifically trademarks. Appearance features can also be extracted in the frequency domain. As discussed earlier, the frequency domain representation can be obtained by computing the Fourier transform, and it represents the spatial frequencies, that is, the periodicity of patterns in the image. Ma and Manjunath, 1996, use a class of filters called Gabor filters to retrieve images with similar texture. Gabor filters are sine modulated Gaussian functions, which can be tuned to respond to a bandwidth around a certain center frequency. That is, textures of a certain periodicity can be detected. Figure 10.4 gives an example of how Gabor filters can be used. We will restrict ourselves to one dimensional signals for simplicity. A Gabor filter is generated using the formula
The filter has two parameters. The first is the frequency (f) around which it is centered and the second, the bandwidth of the filter (w). We choose the scale of the Gaussian as V = 2/w. We “design” three filters. These filters will have center frequencies of 0.025Hz, 0.05 Hz and 0.1 Hz respectively. The bandwidth of these filters will be kept a 0.03 Hz. The top row in Figure 10.4 shows the three filters in the spatial domain. Notice that the filter looks like a sinusoid attenuated with a Gaussian envelope. The frequency of this sinusoid determines the center frequency of the filter and the scale of the Gaussian determines the bandwidth. The next row of plots show the corresponding amplitudes of the filters in the frequency domain. The third row of plots show the filter applied to the sinusoidal test function in Figure 10.2. Each filter is applied in the frequency domain. Its Fourier representation is multiplied with the function in the bottom plot of Figure 10.25. An inverse Fourier transform of the filtered outputs produces the last row. These are the signals filtered out of z. These signals are three of the four sinusoids that went into generating the test function z shown in the top plot of Figure 10.2! In the spatial domain filtering is implemented as a convolution
5
Appearance-Based Global Similarity Retrieval of Images
Figure 10.4
277
The design and use of Gabor filters
This simple example brings the essence of operating in the frequency domain. Take the Fourier transform, apply the filter, and take the inverse Fourier transform. The filter defines the feature that is extracted. If it is assumed that two images are visually similar when they have similar textures, then Gabor filters can be used for comparing textures. Since the type of textures that can be present in the image are not known a priori, therefore, filters must be designed to sample the frequency spectrum and bandwidths. Ma and Manjunath, 1996, use filters that are are adjacent to each other that is they tessellate the frequency spectrum. In two dimensions orientation of textures is also important, therefore orientation of the filter is also a parameter. Ma and Manjunath, 1996, use filters oriented in several directions again by sampling the space of orientations. They apply a total of 24 filters and associate each pixel with a vector of filter responses. A representation of the image is generated by computing the mean and variances of the texture vectors over the image. These representations are compared for retrieval.
278
ADVANCES IN INFORMATION RETRIEVAL
Liu and Picard, 1996, use another approach by modeling textures in terms of their periodicity, randomness and directionality. This is called a Wold model. Using this as a basis the authors try to classify the textures. Similarity Gorkani and Picard, 1994, attempt to classify scenes as city and country.
3
COMPUTING GLOBAL APPEARANCE SIMILARITY
In the method adopted in this paper, differential features are used as a representation of appearance. A differential feature is a feature computed from the spatial derivatives of an image. Such features are obtained by transforming simple derivatives so that they are invariant or tolerant to factors affecting object appearance, such as to rotations, scale and illumination changes. In practice, a differential feature is a vector associated with a point in the image. In this section we examine the use of differential features for appearance-based retrieval. The framework is the following. Pixels are associated with differential features. These features are accumulated to represent images. Representations are compared to rank images. Spatial derivatives and multi-scale features:. The simplest differential feature is a vector of spatial derivatives. For example, given an image I, and some point p, the following low two order derivatives, can be used as a feature
Derivatives capture useful statistical information about the image. The first derivatives represent the gradient or “edgeness” of the intensity and the second derivatives can be used to represent bars. Derivative features (represented as the vector V) are appearance features because they approximate the local shape of the intensity surface in the sense of a Taylor expansion. This is because the Taylor series tells us that the derivatives at a pixel are needed to estimate the value of the intensity in a neighborhood around it. Differential features, that are invariant or tolerant to rotations, scale changes and illumination can be computed from the derivatives (Florack, 1993). A typical method for computing derivatives is using finite differences. However, this does not guarantee their stability. It has been argued by Koenderink and van Doom, 1987, Florack, 1993, and others that the derivatives will be guaranteed to be stable if, instead of using finite differences, they are computed by filtering an image with normalized Gaussian derivative filters. A Gaussian derivative is the derivative of the function
Appearance-Based Global Similarity Retrieval of Images
Figure 10.5
279
Computing derivatives in a stable manner.
Filtering an image with a Gaussian derivative is equivalent to filtering the image with a Gaussian and then computing the derivatives of the filtered image6. For example the partial derivative in x direction has the following equivwhere isconvolution and alence. the Gaussian is G(r, V In the following discussion we use these formulations interchangeably. In Figure 10.5 we illustrate derivative instability due to finite differences. Consider a function that is flat, say with a constant value of 100. This function
Differentiation and filtering commute.
6
280
ADVANCES IN INFORMATION RETRIEVAL
Figure 10.6
Gaussian derivative filters in the frequency domain.
is trivially differentiable, its derivatives are zero. To this function we add S sinusoidal noise ofthe form sin –2 x). This function is shown in the top plot of Figure 10.5. In the next plot derivatives are computed using finite differences. The derivatives start oscillating and are either undamped or weakly damped, hence unstable. In general it can be shown that they can diverge. In the next plot derivatives are shown after using a smoothing filter that averages adjacent values (box car filter). While the derivatives are better they still exhibit unstable behavior. In contrast in the bottom plot derivatives are computed after filtering with a Gaussian. The derivatives quickly converge to the right value. Thus, it is important that smoothing be done appropriately. The box car filter cannot be guaranteed to work well either because its Fourier transform is a sinc function and noise can be leaked into the filtered output.
Figure 10.7
The effect of Gaussian smoothing
Computing derivatives using Gaussian derivatives implies that the derivatives are being computed not of the original image I but of the smoothed image IV
Appearance-Based Global Similarity Retrieval of Images
281
While these derivatives are guaranteed stable, it also implies that the function is band-limited. In order to visualize this consider the Gaussian derivative in the frequency domain. A Gaussian derivative filter is a band-pass filter centered at a frequency relating to the order of the derivative and with a bandwidth predominantly related to the scale V In Figure 10.6 the Gaussian derivatives are shown in the frequency domain. As the derivative increases the center frequency shifts to higher frequencies, that is, finer detail is detected. Filtering with a Gaussian derivative at a certain scale, therefore, implies that only a limited band of frequencies are being observed. That is, the function is bandlimited. Clearly this is a disadvantage. For example, in Figure 10.7 a texture pattern is smoothed with a Gaussian at three different scales. At the smaller scale finer details of the image are visible (first picture). At a large enough scale the texture is smoothed out (last picture). At an appropriate scale the pattern is best represented (middle picture). A priori the best scale is not known, hence derivatives are generated using Gaussian derivatives at multiple scales. That is, a sampling of the scale-space of the image. We have been arguing that in order to compute derivatives stably the Gaussian can be used. But any C ∞ smooth function will guarantee stability. Further, we have argued that if a Gaussian is used then a multi-scale representation is necessary. However, this is true of any smoothing filter. That is, irrespective of the shape of the filter, a multi-scale representation becomes necessary because all filters are of finite size, hence band-limiting. So the question is, why the Gaussian ? It has been shown by several authors Lindeberg, 1994; Koenderink, 1984; Witkin, 1983; ter Har Romeny, 1994; Florack, 1993, that under certain general constraints, the Gaussian filter forms a unique operator for representing an image across the space of scales. The Gaussian guarantees that every structure that is observed at a coarser scale can be related to structures already present at a finer scale. That is, no new structures are introduced as an artifact of the filter. This is a very powerful result that motivates the computation of derivatives using Gaussians. From a practical perspective, a derivative vector is generated at multiple scales and associated with the pixel. This is the basic appearance feature. It is instructive to draw a comparison between Gaussian filters and Gabor filters because they both are acting as band-pass filters. The argument of Gaussian (vs.) Gabor has been going on for some years. Gabor filters are attractive because of their conceptual simplicity. The frequency and bandwidth can be independently tuned. In the Gaussian they are coupled. Both the order of the derivative and scale determine the center frequency and bandwidth. However, unlike the Gabor, the Gaussian is attractive from an implementation point of view. Gaussian derivatives form an orthonormal basis and therefore they can be used to compute filter responses in any arbitrary orientation. For example, the Gaussian derivative in the x direction Gx and the y direction Gy can be com-
282
ADVANCES IN INFORMATION RETRIEVAL
bined using the steering formula GT = cos T Gx + sin T Gy to generate a filter in any arbitrary direction (Freeman and Adelson, 1991). To find out the response of an image to any number of directions thus involves two filtering operations, plus steering. Since the Gaussians form a basis, they support the algebraic formulations that go into transforming derivatives for other differential features. In contrast oriented filters have to be specifically generated for Gabors making them expensive to implement. Another reason for preferring the Gaussian is that separability7 is easier to implement in contrast to the Gabor, thus making them faster. Finally, in many cases it is possible to synthesize a Gaussian derivative that closely resembles the frequency profile of the Gabor. In practice, such design decisions are not explicitly necessary because the derivatives provide a natural sampling of the frequency space. Differential features: Curvature and Orientation. There are several features that can be constructed from derivative vectors. The choice of these features depends on several factors, primary among which is a consideration to the factors affecting appearance. It has been shown that local features computed using Gaussian derivative filters can be used for local similarity, i.e. to retrieve parts of images (Schmid and Mohr, 1996; Ravela and Manmatha, 1997). Here we argue that global similarity can be determined by computing local features and comparing distributions of these features. The task is to robustly characterize the 3-dimensional intensity surface (X,Y, Intensity). A 3-dimensional surface is uniquely determined if the local curvatures (rate of rate of change) everywhere are known. Thus, it is appropriate that one of the features be local curvature. For a three dimensional surface there are two principal curvatures that can be computed at a point. These are called the normal and tangential curvatures respectively. In fact, principal curvatures are nothing but the second order spatial derivatives expressed in a coordinate frame determined by the orientation of the local intensity gradient8. The principal curvatures of the intensity surface are invariant to image plane rotations, monotonic intensity variations and further, their ratios are, in principle, insensitive to scale variations of the entire image.
Separability implies that a two dimensional filter can be implemented as a series of one dimensional filtering steps. 8Proof sketch: Construct a coordinate frame that has one of its axes aligned along the direction of the gradient. Express the second derivative (Hessian) in this frame. The principal curvatures are two of the three distinct terms. A full proof is beyond the scope of this paper. 7
Appearance-Based Global Similarity Retrieval of images
283
The normal and tangential curvatures of a 3-D surface (X,Y,Intensity) are defined as (Florack, 1993):
Where Ix (p, V and Iy (p, V are the local derivatives of Image I around point p, computed using Gaussian derivative at scale a. Similarly Ixx (., .), Ixy (., .), and Iyy (., .) are the corresponding second derivatives. The normal curvature N and tangential curvature T are then combined into a ratio (Koenderink and Doom, 1992) called generate a shape index as follows:
S
The index value C is –2 when N = T and is undefined when either N and T are both zero, and is, therefore, not computed. This is interesting because very flat portions of an image (constant or constant slope in intensity) are eliminated. For example in Figure 10.9, the background in most of these face images is not used. The shape index is rescaled and shifted to the range [0,1] as is done in Dorai and Jain, 1995. Nastar et al., 1996, also use the shape index for recognition and retrieval. However his approach uses curvatures computed at a single scale. Clearly, this is not enough. The second feature used is local orientation. Local orientation is the direction of the local gradient. Curvatures are rotationally invariant and do not carry any information about the orientation of objects or textures in the image. This can be important. For example, consider the “wall” sequence. Part of what makes these images similar is the roughly similar orientation of structures in them. Orientation is independent of curvature, is stable with respect to scale and illumination changes, but by definition it is rotationally variant. The orientation is simply defined as P (p, V = atan2 (Iy (p, V , Ix (p,V Note that P is defined only at those locations where C is and ignored elsewhere. As with the shape index P is rescaled and shifted to lie between the interval [0,1]. Feature Histograms. Histograms of the shape index and orientation are used to represent the distributions of features over an image. Histograms form a
284
ADVANCES IN INFORMATION RETRIEVAL
global representation because they capture the distribution of local features. A histogram is one of the simplest ways of estimating a non parametric distribution. The representation of the image is therefore multi-scale histograms. In this implementation, curvature and orientation are generated at several scales and represented as a one dimensional record or vectors. The representation of the image I is the vector Vi = 〈 Hc V . .Hc Vn) , Hp V . . . Hp Vn)〉 Hp and Hc are the curvature and orientation histograms respectively. We found that using 5 scales gives good results and the scales are 1 . . - 4 in steps of half – anoctave (√ 2.) Schiele and Crowley, 1996, use histograms of various differential features. However, the difference between the two approaches is that their method uses multi-dimensional histograms of features that does not include curvature. Further, they apply their technique to simple object classes and not to heterogeneous collections. Finally, their representations are computed at a single scale. Matching feature histograms:. Two representations are compared using normalized cross-covariance defined as
Where Vi(m) = Vi - mean(Vi). Normalized cross-covariance is similar to cosine correlation, or dot products of vectors with the mean of the signal removed. Asymptotically, all these measures behave similarly. There are other possible measures, such as the Kulback-Leibler (Cover and Thomas, 1991) and Mahalanobis (Mahalanobis, 1936) distances could be used. Retrieval. Retrieval is carried out as follows. A query image is selected by a user. The query histogram vector Vq is compared with database histogram vectors Vi. Then the images are ranked by their correlation score and displayed to the user. We call this retrieval algorithm the curvature/orientation or CO-1 algorithm.
3.1
EXPERIMENTS
The CO-1 method is tested using two databases. The first is a a collection of 1561 assorted gray-level images and the second, a collection of 2345 parametrically generated shapes. These two collections are discussed next.
Appearance-Based Global Similarity Retrieval of Images
3.2
285
ASSORTED COLLECTION OF GRAY LEVEL IMAGES
This database has digitized images of cars, steam locomotives, diesel locomotives, apes, faces, people embedded in different background(s) and a small number of other miscellaneous objects such as houses. These images were obtained from the Internet and the Corel photo-cd collection and were taken with several different cameras of unknown parameters, and under varying uncontrolled lighting and viewing geometry. In the following experiments an image is selected and submitted as a query. The objective of this query is stated and the relevant images are decided in advance. Then the retrieval instances are gauged against the stated objective. In general, objectives of the form 'find images similar in appearance to the query' will be posed to the retrieval algorithm. A measure of the performance of the retrieval engine can be obtained by examining the recall/precision table for several queries. Briefly, recall is the proportion of the relevant material actually retrieved and precision is the proportion of retrieved material that is relevant (van Rijsbergen, 1979). These measures are widely used in the information retrieval community and are adopted here. The retrieved ranks are used to interpolate and extrapolate precision at all recall points. Four example queries are discussed below in Figure 10.8 through Figure 10.11. The left most image of the top row in each image is the query and is also the first retrieved. The rest in row major order are seven retrievals depicted in rank order. Note that flat portions of the background are never considered because the principal curvatures are very close to zero and therefore do not contribute to the final score. Thus, for example, the flat background in the faces in Figure 10.9 is not used. Notice that visually similar images are retrieved even when there is some change in the background (Figure 10.8) This is because the dominant object contributes most to the histograms. If a single scale is used, poorer results are achieved and background affects the results more significantly. The results of these examples are discussed below, with the precision over all recall points given in parentheses. 1. Find similar cars (65%). Pictures of cars viewed from similar orientations appear in the top ranks because of the contribution of the orientation histogram. This result also shows that some background variation can be tolerated. The eighth retrieval, although a car, is a mismatch and is not considered relevant. 2. Find same face (87.4%) and find similar faces: In the face query, the objective is to find the same face. In experiments with a University of Bern face database of 300 faces with a 10 relevant faces each, the average precision over all recall points for all 300 queries was 78%. It should be
286
ADVANCES IN INFORMATION RETRIEVAL
Figure 10.8
Image retrieval using CO-1: Car
Figure 10.9
Image retrieval using CO-1 : Face
noted that the system presented here works well for faces with the same representation and parameters used for all the other queries. There is no specific “tuning” or learning involved to retrieve faces.
Figure 10.10
Image retrieval using CO-1 : Ape
Appearance-Based Global Similarity Retrieval of Images
287
3. Find dark textured apes (64.2%). The ape query results in several other light textured apes with similar texture. Although these are not mismatches they are not consistent with the intent of the query which is to find dark textured apes.
Figure 10.11
Image retrieval using CO-1 : Patas Monkey
4. Find other patas monkeys. (47.1%) Here there are 16 patas monkeys in all and 9 within a small view variation. However, here the whole image is being matched so the number of relevant patas monkeys is 16. The precision is low because the method cannot distinguish between light and dark textures, leading to irrelevant images. Note, that it finds other dark textured apes, but those are deemed irrelevant with respect to the query. The recall/precision results over all queries was 66.3%. While the queries presented here are not “optimal” with respect to the design constraints of global similarity retrieval, they are however, realistic queries that can be posed to the system. Mismatches can and do occur.
3.3
PARAMETRIC SHAPE COLLECTION
The second collection contains 2345 parametrically generated shape images. The shapes are binary in intensity (0 background, 255 foreground) and contain four types of shapes, ellipses (including circles), triangles, rectangles (including squares) and pentagons. There are several reasons for using this collection. First, to test the performance of the CO-1 algorithm on binary intensities. Binary images do not span the full range of intensity values. Thus, it would be interesting to examine the effectiveness of curvature histograms in this case. Second, to examine if appearance similarity is effective for shapes. The predominant visual information depicted in the objects is shape. Thus, one can evaluate the utility of appearance representation for shape similarity. Third, there are several situations where retrieving binary shape images is necessary.
288
ADVANCES IN INFORMATION RETRIEVAL
Trademark images are usually binary and several of the them have distinct shapes (See Figure 10.20). Thus this database provides a feasibility study for applying CO-1 to trademark retrieval. Finally, a parametrically generated collection provides ground truth that can be used to determine shape relevance. Therefore, different algorithms can be systematically generated. In this context it should be noted that moments have been widely used. Thus, this collection provides a testbed for comparison. Table 10.2
Variable parameters individual shape classes
Shape
Parameter 1
Ellipse
Eccentricity 0.2 0.4 0.6 0.8 1.0 Angle a1 30, 60, 90, 120 Ratio of top and bottom edge 0.2, 0.4, 0.6, 0.8, 1.0 Upper notch angle 72, 90, 108, 126, 144
Triangle Quad.
Pentagon
Table 10.3
Parameter 2
Angle a2 30, 60, 90, 120 Ratio of height to bottom edge 0.6, 0.8, 1.0, 1.2, 1.4 bottom notches 90, 105, 120, 135, 150
Comment Horizontal diameter constant Base length constant a1 + a2 d 150 Base length constant Base length constant
Fixed parameters all shape classes
Parameter
Comment
Edge Thickness Inscriptions Noise
Solid, 5, 10, 15 pixels wide. One, two or three loops 0,2 or 3 points of noise. Points randomly chosen solid shapes have no noise copies
Four families of shapes are generated, namely, ellipses, triangles, quadrilaterals and pentagons. The parameters are chosen so a small set of images are generated yet, with a reasonable diversity for evaluation. Each of these shapes are defined by parameters specific to the family and those that are common across families, enumerated in Table 10.2 and Table 10.3 respectively. Table 10.2 enumerates the parameters specific to each shape. All the ellipses in this collection have a constant horizontal diameter, and the eccentricity is changed to generate others. Thus shapes from circles to highly eccentric ellipses are generated. All the ellipses are oriented horizontally, that is, the horizontal axis has the longer diameter. Triangles are defined by a base length that is constant and two inter-
Appearance-Based Global Similarity Retrieval of Images
289
nal angles that are variable. Thus, acute, obtuse, and right-angled triangles of various angles (shown in the table) are generated. Quadrilaterals are defined by a fixed base (lower) length and varying both the perpendicular height and length of the side parallel to the base. Thus, various rectangles, squares and trapezoids are generated, but parallelograms and rhombuses are not. Pentagons are defined by a constant base length, and two variable angles. The angle of the edges adjoining the base (they are the same) and the angle of the apex. This constructs regular pentagons and pentagons with obtuse angles. Table 10.4 No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Per query average precision over all recall: Parametric shape collection RelevanceCriteria All ellipses and circles All circles Ellipses with same eccentricity Ellipses with same eccentricity Ellipses with same eccentricity and inscriptions Ellipses with same eccentricity and inscriptions All Pentagons All pentagons of same regularity All pentagons of same regularity All pentagons of same regularity All pentagons of same regularity and inscriptions All pentagons of same regularity and inscriptions All quadrilaterals All quadrilaterals with same shape All quadrilaterals with same shape All quadrilaterals with same shape All quadrilaterals with same shape and inscriptions All quadrilaterals with same shape and inscriptions All triangles All triangles with same angles All triangles with same angles All triangles with same angles All triangles with same angles and inscriptions All triangles with same angles and inscriptions
Relevant
CO-1
Moments
230 46 46 46 15
90.6 79.1 100.0 88.4 72.7
24.1 16.0 24.0 17.0 28.2
15
86.5
28.8
1145 46 45 46 15
76.8 92.4 91.9 64.0 58.6
58.2 14.3 21.7 11.3 28.5
15
60.4
28.2
690 28 28 28 9
61.9 94.2 56.3 65.0 50.9
38.3 25.3 19.9 10.7 36.9
9
58.2
36.8
280 28 28 28 9
46.0 100.0 93.0 72.9 68.5
19.9 21.0 21.3 10.2 37.3
9
67.1
37.0
All shapes are generated with additional parameters constant across families. These parameters are enumerated in Table 10.3. The first is the thickness of an edge. Shapes can be solid, or with varying edge thickness. The second is the number of inscriptions. For example, a pentagon could be inscribed with another pentagon of the same variable parameters. Likewise for other shapes.
290
ADVANCES IN INFORMATION RETRIEVAL
Table 10.5
Precision at standard recall points for 24 queries of the Parametric shape collection Recall
CO-1
10 20 30 40 50 60 70 80 90 100 110
100.0 98.9 95.8 88.1 78.3 74.8 73.8 65.7 58.5 51.8 37.1
100.0 70.5 45.1 24.1 7.0 6.4 6.2 5.8 5.7 5.5 5.4
average
74.8
25.6
Moments
Finally, each shape has copies with noise added to it. Noise is added by selecting a set of points randomly (uniform random distribution) and changing its value between 0 and 255. In these set of experiments the CO-1 algorithm is compared with moment invariants. Moments invariant to similarity transformations are generated using the technique shown in Reiss, 1993, and first developed by Hu, 1962. Moment invariants have been used by several authors (Methre et al., 1997; Wu et al., 1994) for shape similarity. Twenty-four queries are posed and relevance is shape related. For example, a query number 1, a Circle was posed with find all circles and ellipses as the relevance criteria. Table 10.4 describes the queries. The first column is the query number. The second column describes the relevance criteria. That is, what is the user looking for. The third depicts the number of images in the collection (out of 2345) that are actually relevant. The fourth column depicts the percentage average precision of this query over all recall points and the fifth the corresponding number for moments. Relevance criteria are designed to isolate an increasingly narrow set of images that can be relevant. We start from a very broad criteria. Given a shape, find all instances of that family (Query 1, 2, 7, 13, 19). For example, given a pentagon find all pentagons. Then, we make it progressively narrower. Given a shape find find the shape with the same variable parameters (Queries 2,3,4,8,9,10,14,15,16,20,21,22). For example, given an ellipse find all ellipses of the same eccentricity (other eccentricities would be invalid, but the same with a different edge thickness or number of inscriptions would not). Finally, the narrowest. Given a shape find all shapes of the same variable parameters
Appearance-Based Global Similarity Retrieval of Images
291
and inscriptions (Queries 5, 6, 11, 12, 17, 18, 23, 24). For example, given a two loop circle find all two loop circles. By a large margin, the CO-1 technique performs better than moments for all the proposed set of relevance criteria as can be seen from Table 10.4. The aggregate recall/precision results are summarized in 10.5. Here we discuss the best and worst performing CO-1 and moment queries.
Figure 10.12 Best performing CO-1 query(20).Top row: top eight CO-1 retrievals. Bottom row: top 8 Moment retrievals
Figure 10.12 depicts the top eight ranks for Query 20, the best performing query for CO-1. Find all triangles with the same angles (variable parameters) as the query. The CO-1 technique gets all three triangles of the same shape. The next two ranks are of inscribed triangles, whose thicknesses are narrower (5 pixels) compared with the following three (15 pixels). This can be expected because at larger scales the image is smoother, thus making little distinction between the inscribed triangle and thicker triangle. In contrast, the moments technique retrieves completely different shapes in the top ranks. While it can be argued that these images are somewhat visually similar to the query there are others that are far more similar that have not been retrieved.
Figure 10.13 Worst performing CO-1 query(19).Top row: top eight CO-1 retrievals. Bottom row: top 8 Moment retrievals
Figure 10.13 depicts the worst performing CO-1 query (Query 19) and the corresponding moment ranks. The query triangle is the only solid triangle with the same angle parameters. That is, there are no noise versions of solid objects in the collection (see Table 10.3). The top row depicting the top 8 CO-1
292
ADVANCES IN INFORMATION RETRIEVAL
ranks look very good. However, several triangles in the collection are obtuse (see Figure 10.15 and were ranked after other shapes such as the three non solid pentagons shown in Figure 10.14 (ranks 4, 5, 6 top row). In contrast the moments technique performs very differently, and poorly with respect to shape criteria. This query also depicts a primary difference between the two techniques. Curvature and orientation are differential features and hence flatness in the image intensity surface does not contribute to the representation. Thus the solid “painted” query is treated similar to the “line drawn” triangles. In contrast, moments are integrals of the image intensity function and hence work best when the shapes to be compared are all solid. Thus this technique retrieves solid polygons instead of other triangles and fail shape-based criteria for similarity.
Figure 10.14 Best performing Moment query(07).Top row: top eight CO- 1 retrievals. Bottom row: top 8 Moment retrievals
Figure 10.15 Worst performing Moment query(22).Toprow: top eight CO- 1 retrievals. Bottom row: top 8 Moment retrievals
Figure 10.14 and Figure 10.15 depict the best and worst performing moment queries. The CO-1 technique performs better on both these than moments. These results depict a similar pattern of results as explained above. The overall results of all twenty-four queries are presented in Table 10.5. This table depicts the precision at standard recall points over the 24 queries. The conclusion from this experiment is that, as far as shape based criteria are concerned, the proposed appearance based technique (CO-1) is better than a commonly used shape based representation namely, moments.
Appearance-Based Global Similarity Retrieval of Images
4
293
TRADEMARK RETRIEVAL
The system indexes 63,718 trademarks from the US Patent and Trademark office in the design only category. These trademarks are binary images. Text was provided for each image in the collection of design only trademark category from the Patent and Trademark Office. This information contained specific fields such as the design code, the goods and services provided, the serial number, the manufacturer, among others. The system for browsing and retrieving trademarks contains a netscape/Java user interface that allows the user to specify a text query. The queries that can be submitted are combinations of all the free and fielded text allowed within the interface. All images associated with the query are retrieved using the INQUERY (Callan et al., 1992) text search engine. The user can then use any of the example pictures to search for images that are similar visually or restrict it to images with relevant text, thereby combining the image and text searches. Image search is done using a variation of the CO- 1 algorithm described below. Preprocessing:. Each binary image in the database is first size normalized, by clipping. Then they are converted to gray-scale and reduced in size. Computation of Histograms:. Each processed image is divided into four equal rectangular regions. This is different than constructing a histogram based on pixels of the entire image. This is because in scaling the images to a large collection, we found that the added degree of spatial resolution significantly improves the retrieval performance. The curvature and orientation histograms are computed for each tile at three scales (1,5,9). A histogram descriptor of the image is obtained by concatenating all the individual histograms across scales and regions. These two steps are conducted off-line. Execution:. The image search server begins by loading all the histograms into memory. Retrieval takes a few seconds and is done by comparing histograms of all 63,718 trademarks on the fly. The match scores are ranked and the top N (set to 2000) requested retrievals are returned.
4.1
EXAMPLES
Figure 10.16 shows the first page of images returned by the system in response to the user’s text query “Apple”. The user queries using Apple computer’s logo (the image in the second row, first column). Images retrieved in response to this query are shown in 10.17. The first eight retrievals are all copies of Apple Computer’s trademark (Apple used the same trademark for a number of other goods and so there are multiple copies of the trademark in the database). Trademarks number 9 and 10 look remarkably similar to Apple’s trademark. They
294
ADVANCES IN INFORMATION RETRIEVAL
Figure 10.16
Retrieval in response to the query “Apple”
are considered valid trademarks because they are used for goods and services in areas other than computers. Trademark 13 is another version of Apple Computer’s logo but with lines in the middle. Although somewhat visually different it is still retrieved in the high ranks. Image 14 is an interesting example of a mistake made by the system. Although the image is not of an apple, the image has similar distributions of curvature and orientation. The second example demonstrates combining text and visual appearance for searching. The same apple image obtained in the previous image is used as the image query. However, it is cross-referenced with the text query “computer”. That is, search for trademarks which are visually similar to the apple query image but also have the words “computer” associated with them. The results are shown in Figure 10.18. Notice that the first image is the same as the query image. The second image is an actual conflict. The image is a logo which belongs to the Atlanta Macintosh User’s Group. The text describes the image as a peach but visually one can see how the two images may be confused with each other (which is the basis on which trademark conflicts are adjudicated). This example shows that it does not suffice to go by the text descriptions alone
Appearance-Based Global Similarity Retrieval of Images
295
Figure 10.17 Retrieval in response to the image “Apple”. The image used is in the first column, second row of Figure 10. 16.
and image search is useful for trademarks. Notice that the fourth image which some people describe as an apple and others as a tomato is also described in the text as an apple.
4.2
EVALUATION
Relevance judgments are not currently available for the PTO collection. Instead, a collection of 10,745 trademarks are evaluated using the same parameters used to construct the PTO system. These trademarks have been provided by Dr. John Eakins at the University of Northumbria at Newcastle, UK. They come with a set of 24 queries for which relevance judgments have been obtained from various examiners. Hence, a modest, but realistic, evaluation can be obtained. However, the contribution of multi-modality to retrieval effectiveness cannot be gauged. Twenty-four queries are provided for the purpose of evaluation the results are shown in Table 10.6. Below the results are discussed for the best and worst performing queries which bring up some issues concerning relevance.
296
ADVANCES IN INFORMATION RETRIEVAL
Figure 10.18 The apple image in the first column, second row of Figure 10.16 is combined with the text search “computer”. Table 10.6
Precision at standard recall points for 24 queries of the UK Trademark collection Recall 10 20 30 40 50 60 70 80 90 100 110 average
co- 1
Moments
100.0 75.5 62.3 60.7 31.1 26.6 19.0 08.5 05.5 05.4 04.6
100.0 51.7 36.1 22.5 15.9 14.2 5.4 3.7 0.4 0.3 0.2
36.3
22.8
Figure 10.19 shows the top eight ranked retrievals for query 12 using the CO-1 (top row) and moments (bottom row) techniques. The CO-1 technique gives a 100% precision. It should be noted that only four images were marked relevant for this query (first four, top row, Figure 10.19 ). Ranks 5,6 and 7 are also very similar visually, but are not part of the relevant list. In contrast, the moments technique performs poorly. This is also the best performing moment
Appearance-Based Global Similarity Retrieval of Images
297
Figure 10.19 Best performing CO- 1 query( 12). This is also the best performing moment query. Top two rows: top eight CO- 1 retrievals. Bottom two rows: top 8 Moment retrievals. Images are Crown copyright.
retrieval over all the queries in terms of precision because a near identical copy was retrieved. While the third rank can be considered somewhat visually similar the rest do not bear any resemblance to the query.
Figure 10.20 Relevant set for Query 18. Note the second retrieved in Figure 10.21(lmage 2047233) is not in this set. Images are Crown copyright.
298
ADVANCES IN INFORMATION RETRIEVAL
Figure 10.21 Worst performing CO-1 query(18).Top row: top eight CO-1 retrievals. Bottom row: top 8 Moment retrievals. Images are Crown copyright.
Figure 10.21 shows the retrievals using CO-1 (top row) and moments (bottom row) for query 18. Figure 10.20 shows the images marked relevant for query 18. This result, in terms of precision, is the worst performing CO-1 query with a low average precision of 9.5%. However, there are some issues that emerge. First, the second rank which is nearly identical to the query, is not on the relevant list. Second, many images considered relevant actually bear very little “visual resemblance” or “shape resemblance” to the query. We would argue that retrievals 6,7,8 (top row Figure 10.21) are actually more similar visually than say the first four pictures in row three of the relevant images (Figure 10.20). Thus, the relevance judgments in this case are based on other criteria, Appearance-based techniques will not work in these cases.
Figure 10.22
Relevant set for Query 19. Images are Crown copyright.
Figure 10.22 and Figure 10.23 show the relevant images and ranked retrievals for the worst performing moment query (Query 19). The moment ranks (Figure 10.23, bottom row) show that except for rank 6,8 and perhaps 2 the others do not bear a significant similarity to the query (Figure 10.22, bottom row, first
Appearance-Based Global Similarity Retrieval of Images
299
Figure10.23 Worst performing Moment query(19).Top row: top eight CO-1 retrievals. Bottom row: top 8 Moment retrievals. Images are Crown copyright.
picture). In contrast, the CO-1 query does perform better. The top eight ranks all have circular shapes, although none with inscribed triangles. All of these image can be considered visually similar, but they are irrelevant. From a matching perspective, this is not unexpected. The two flags (triangles) in the query do not provide sufficient discriminating information beyond the circularity of the image to make them prominent. However, an observation of the relevant images shows that the flags are what are deciding in terms of relevance. Overall, the CO-1 technique performs better than moments, however not as well as in the parametric shape collection. As mentioned, in some instances the relevance judgments were incomplete, in some instances the relevant images bore little visual similarity (or shape similarity) to the query and in others, there were other images which bore a gross similarity but didn’t match the relevance criteria. Given the problems with relevance judgements, we can see that the CO-1 performs quite well, but there is still room for improvement.
5
CONCLUSIONS AND LIMITATIONS
In this paper we demonstrated a method by which images can be retrieved globally by appearance. The appearance based technique deviates from the classical use of the term appearance in image representations, thereby removing the requirements for constant sized or homogeneous images. The framework presented to develop appearance features is rigorous. Differential features can be generated robustly using Gaussian derivative filters, and several features with provable properties of tolerance or invariance with respect to scale, rotation, and illumination changes can be systematically generated. The specific features we chose, curvature and orientation, are two fundamental geometric features that can be associated with surfaces. As has been shown they can be used for global similarity in a fairly straightforward manner. The presented technique has produced good results on heterogeneous gray level images, and binary images. The results with the parametric shape collection show appearance representations are better suited for shape based similarity than moments. The technique can also be scaled to large image collections.
300
ADVANCES IN INFORMATION RETRIEVAL
There are also several issues that are being examined. In some instances it seems that an image that should be ranked higher is not. We believe this to be because of a bias-variance problem in histogram binning. Kernel density estimation techniques can be used to address it. A histogram is constructed from filter responses at a single scale. However, scale is a local property, that is, different structures in the image will manifest at different scales. Thus, if one can estimate a natural scale at which a local curvature should be computed, then a representation that is far more compact and tolerant to scale changes can be computed. This is because each pixel then also has a unique set of scales associated with it, and the multi-scale histograms would be ordered by “significant scales” rather than “sampled scales”. The parametric shape collection is being expanded. We plan to evaluate every image in the collection against several more relevance criteria. It is possible to ask questions such as, all triangles within a 20 degree variation. The multi-modality described here is rudimentary. We are working on techniques to combine text retrieval and image retrieval in a principled manner. In the context of trademark retrieval,we have been experimenting with relevance feedback as a means for improving retrieval.
Acknowledgments This material is based on work supported in part by the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement number EEC-9209623, in part by the United States Patent and Trademark Office and Defense Advanced Research Projects Agency/ITO under ARPA order number D468, issued by ESC/AXS contract number F19628-95-C-0235, in part by the Air Force Office of Scientific Research under grant number F49620-99-1-0138, in part by the National Science Foundation under grant number IRI-9619117 and in part by NSF Multimedia CDA-9502639. Any opinions, findings and conclusions or recommendations expressed in this material are the author(s) and do not necessarily reflect those of the sponsors. We would like to thank R. Manmatha, Tom Michel, Joe Daverin and Dr. John Eakins, University of Northumbria for helping with this work.
References Bach, J., Fuller, C., and et al (1996). The Virage image search engine: An open framework for image management. In SPIE Conference on Storage and Retrieval for Still Image and Video Databases IV, pages 133–156. Callan, J. P., Croft, W. B., and Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert System Applications (DEXA), pages 78–83. Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley Series in Telecommunications. John Wiley and Sons.
Appearance-Based Global Similarity Retrieval of Images
301
Deerwester, S., Dumais, S., Fumas, G., Landauer, T., and Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6):391–407. Dorai, C. and Jain, A. (1995). Cosmos - a representation scheme for free form surfaces. In Proc. 5th International Conference on Computer Vision, pages 1024–1029. Eakins, J., Shield, K., and Boardman, J. (1996). Artisan: A Shape Retrieval System Based on Boundary Family Indexing. In Sethi, J. and Jain, R. e., editors, Storage and Retrieval for Image Video and Databases IV, volume 2670 of Proc. SPIE, pages 17–28. Flickner, M., Sawhney, H.,Niblack, W., Ashley, J.,Huang, Q., Dom, B., Gorkani, M., Lee, D., Petkovix, D., Steele, D., and Yanker, P. (Sept. 1995). Query by image and video content: The QBIC system. IEEE Computer Magazine, pages 23–30. Florack, L. M. J. (1993). The Syntactic Structure of Scalar Images. PhD thesis, University of Utrecht. Freeman, W. T. and Adelson, E. H. (1991). The design and use of steerable filters. IEEE Transactions on Pattern Analysis and Machine tntelligence (PAMI), 13(9):891–906. Gorkani, M. M. and Picard, R. W. (1994). Texture orientation for sorting photos 'at a glance'. In Proc. 12th International Conference on Pattern Recognition, pages A459–A464. Hu, M. K. (1962). Visual pattern recognition by moment invariants. IRE Transactions of Information Theory, IT-8: 179–187. Jain, A. K. and Vailaya, A. (1998). Shape-based retrieval: A case study with trademark image databases. Pattern Recognition, 3 1(9): 1369–1390. Kato, T. (1992). Database architecture for content-based image retrieval. In Jambardino, A. A. and Niblack, W. R., editors, Image Storage and Retrieval Systems, 2185, pages 112–123. Proc. SPIE. Kirby, M. and Sirovich, L. (1990). Application of the kruhnen-loeve procedure for the characterization of human faces. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 12(1): 103–108. Koenderink, J. J. (1984). The structure of images. Biological Cybernetics, 50:363–396. Koenderink, J. J. and Doom, A. J. V. (1992). Surface shape and curvature scales. Image and Vision Computing, 10(8). Koenderink, J. J. and van Doom, A. J. (1987). Representation of local geometry in the visual system. Biological Cybernetics, 55:367–375. Lindeberg, T. (1994). Scale-Space Theory in Computer Vision. Kluwer Academic Publishers.
302
ADVANCES IN INFORMATION RETRIEVAL
Liu, F. and Picard, R. W. (1996). Periodicity, directionality, and randomness: Wold features for image modeling and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 18(7):722–733. Ma, W. Y. and Manjunath, B. S. (1996). Texture-based pattern retrieval from image databases. Multimedia Tools and Applications, 2( 1):35–5 1, Mahalanobis, P. C. (1936). On the generalized distance in statistics. Proceedings of the National Institute of Science, India, 12:49–55. Methre, B., Kankanhalli, M., and Lee, W. (1997). Shape Measures for Content Based Image Retrieval: A Comparison. Information Processing and Management,33(3):319–337. Mokhtarian, F., Abbasi, S., and Kittler, J. (1996). Efficient and robust retrieval by shape content through curvature scale-space. In First International Workshop on Image Databases and Multi-media Search. Nastar, C., Moghaddam, B., and Pentland, A. (1996). Generalized image matching: statistically learning of physically-based deformations. In Buxton, B. and Cipolla, R., editors, Computer Vision - ECCV ’96, volume 1 of Lecture Notes in Computer Science, Cambridge, U.K. 4th European Conference on Computer Vision, Springer. Nayar, S. K., Murase, H., and Nene, S. A. (1996). Parametric appearance representation. In Early Visual Learning. Oxford University Press. Pentland, A., Picard, R. W., and Sclaroff, S. (1994). Photobook: Tools for content-based manipulation of databases. In Proceedings of Storage and Retrieval for Image and Video Databases II,SPIE, volume 185, pages 34–47. Ravela, S. and Manmatha, R. (1997). Image retrieval by appearance. In Proceedings ofthe 20th International Conference on Research and Development in Information Retrieval (SIGIR’97), pages 278–285. Reiss, T. H. (1993). Recognizing Planar Objects Using lnvariant Image Features, volume 676 of Lecture Notes in Computer Science. Springer-Verlag. Schiele, B. and Crowley, J. L. (1996). Object recognition using multidimensional receptive field histograms. In Proc. 4th European Conference on Computer Vision, Cambridge, U.K. Schmid, C. and Mohr, R. (1996). Combining greyvalue invariants with local constraints for object recognition. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 872–877. Sclaroff, S. (1996). Encoding deformable shape categories for efficient contentbased search. In Proceedings of the First International Workshop on Image Databases and Multi-Media Search. Swain, M. and Ballard, D. (1991). Color indexing. International Journal of Computer Vision, 7(1):11–32. Swets, D. L. and Weng, J. (1996). Using discriminant eigen features for retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 18:831–836.
Appearance-Based Global Similarity Retrieval of Images
303
ter Har Romeny, B. M. (1994). Geometry Driven Diffusion in Computer Vision. Kluwer Academic Publishers. Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive NeuroScience, 3:71–86. van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths. Witkin, A. P. (1983). Scale-space filtering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 1019–1023. Wu, J., Mehtre, B., Gao, Y., Lam, P., and Narasimhalu, A. (1994). Star - a multimedia database system for trademark registration. In Lecture Notes in Computer Science: Application of Database, volume 819, pages 109–122.
Index
Albedo, 270 Ambiguity, 209 Appearance, 270 ASR, 115 Bayesian network, 19, 22,26, 131 Bayesian, 79, 93 Bibliographic databases, 5 Binary Independence Model, 43 Boolean operators, 81 Cache, 174, 179 document, 179 query, 174, 179 web, 178 Choice of seeds in K-Means clustering, 166 Citations, 6 Classification errors, 12 Classification, 86 Classifier, 3, 11, 16, 22, 103 Cluster seeds, 106, 165 Clustering, 11,97, 104, 155 efficiency, 168 monothetic, 237 polythetic, 236 term, 246 CO-1 algorithm, 282 Co-reference resolution, 256 Collection organization, 97, 175, 185, 192 Collection partitioning, 181 Collection ranking, 131 Collection Retrieval Inference (CORI) network, 131 Collection selection metrics evaluation, 163 Collection selection, 131, 152, 174–175, 180–182, 185 Collection selection:performance, 194 Color, 269 Concept hierarchy building, 238 evaluating, 251 presenting, 246
visualizing, 246 Controlled vocabulary, 5 Coordination level, 64 COR1 algorithm, 131, 133 Corpora aligned, 205 comparable, 205 parallel, 205 Correlation, 13 Cross-language retrieval, 204, 260 Ctf ratio, 139 Curvature, 282 Database ranking, 131 Database selection, 131 Database distributed, 178 Detection Error Tradeoff (DET), 101 Df.icf, 132 Dictionary bilingual, 207 machine readable, 207 Differential features, 278 Directed acyclic graph (DAG), 238, 258 Distributed IR, 127, 174, 177, 179 evaluation, 133 testbeds, 129 Distributedretrieval, 151 Distributions of relevant documents, 162 Document length, 63,65 Document routing, 75, 88 Estimation, 26,44, 77 Event clustering, 97, 99, 110 Evidence, 2, 16, 26, 40 Exploratory Data Analysis, 39 Exponential models, 83 Face retrieval, 285 Fallout, 85-86 First story detection, 97, 99, 112 Fusion, 3, 15 Fuzzy logic, 23
306
ADVANCES IN INFORMATION RETRIEVAL
Gabor filters, 276 Gaussian scale-space, 281 Geometric distribution, 77 Global clustering, 156 evaluation, 160 Graphical displays, 40 Hierarchical menu, 247 Histogram representation, 283 HSI, 268 Image retrieval, 8, 15, 25, 268 Image semantics, 269 Inference network, 19, 24, 27, 80–42, 13 1 Information extraction (IE), 255 Information need, 9 Information routing, 89 INQUERY query operators, 133, 210, 219 INQUERY, 2, 19, 27, 80, 163, 175–177, 180, 182, 188, 209, 244 Interlingua, 205, 213 Inverse document frequency (IDF), 51–52, 64,74, 241 Jeffrey’s prior, 79 K-Means Clustering, 155 Kullback-Leibler divergence, 27, 154, 163, 284 Language model, 4, 25, 151, 211 unigram, 26, 130–131. 138, 154 Language modeling, 73–75, 83–84, 86 Latent Semantic Indexing, 206 Links, 7 Local clustering. 157 evaluation, 166 Machine translation, 208 Maximum Entropy Principle, 18 Maximum likelihood, 75, 93 Merging document rankings, 135 Moment invariants, 274 Multi-database evaluation, 133 model, 127–128 testbeds, 129 Multi-modal retrieval, 293 Multi-scale features, 280 Multimedia, 8, 15, 24 Multiple-topic representation, 158 evaluation, 166 Mutual information, 50, 78 Named Entity Recognizer (NER), 255 Neural network, 11, 22 Orientation, 282 Parametric shapes, 287 Part-of-speech, 2 11, 2 14 Passages, 7, 25, 211, 242-243 Phrase disambiguation, 212 Phrase identification, 243 Phrases, 8, 116 Precision, 85
Principal component analysis, 273 Probability Ranking Principle, 2, 66 Probability, 73–75, 94 QBIC, 270 Query expansion, 9, 83, 211, 259 Query weights, 93 Query access skew, 175, 185, 189, 192 locality, 174, 185–186, 192 response time, 174, 178, 182, 189, 192, 195 Query-based sampling, 137 R(n) metric, 133 R-Precision, 86 Ranking formula, 66, 83 Recall, 85 Regression, 18, 26 Relevance feedback, 9, 75, 78, 82–84, 86 Replication, 179, 181, 192 partial collection, 174, 182, 185–186, 192 performance, 190, 195 replica selection, 174–175, 182, 185 Residuals, 40,55 Resource description, 130 acquisition, 137 Resource discovery, 137 Resource ranking, 131 Resource selection, 131 Retrieval model, 1, 38,73 Retrieval Status Value (RSV), 53 Rocchio method, 79.86 Segmentation, 74 Server-balance, 169 Shape, 269 Smoothing, 40, 76 Spatial derivatives, 278 Spearman Rank Correlation Coefficient, 139 Speech recognition, 115 Stemming, 230,243 Subsumption, 238 TDT, 97 Term dependence, 79 Term frequency (TF), 60, 65, 74 Texture, 269 TF.IDF, 12, 22, 25, 159, 210 TIPSTER, 13 Topic Detection and Tracking, 97 Topic models, 97, 154 Tracking, 99 Trademark retrieval, 293 Transitive translation, 205, 2 12 Translingual retrieval, 204 TREC, 13, 43, 86, 91, 158, 177, 180, 188, 244 Very Large Corpus (VLC), 130 Vector space, 13, 17, 21, 63, 206 Weight of evidence, 40,54 WordNet, 239