Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2534
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Steffen Lange Ken Satoh Carl H. Smith (Eds.)
Discovery Science 5th International Conference, DS 2002 L¨ubeck, Germany, November 24-26, 2002 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Steffen Lange Deutsches Forschungszentrum f¨ur K¨unstliche Intelligenz Stuhlsatzenhausweg 3, 66123 Saarbr¨ucken, Germany E-mail:
[email protected] Ken Satoh National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan E-mail:
[email protected] Carl H. Smith University of Maryland, Department of Computer Science College Park, Maryland, MD 20742, USA E-mail:
[email protected]
Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at
.
CR Subject Classification (1998): H.2.8, I.2, H.3, J.1, J.2 ISSN 0302-9743 ISBN 3-540-00188-3 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin, Stefan Sossna e.K. Printed on acid-free paper SPIN: 10871526 06/3142 543210
Preface
This volume contains the papers presented at the 5th International Conference on Discovery Science (DS 2002) held at the M¨ ovenpick Hotel, L¨ ubeck, Germany, November 24–26, 2002. The conference was supported by CorpoBase, DFKI GmbH, and JessenLenz. The conference was collocated with the 13th International Conference on Algorithmic Learning Theory (ALT 2002). Both conferences were held in parallel and shared five invited talks as well as all social events. The combination of ALT 2002 and DS 2002 allowed for a comprehensive treatment of recent developments in computational learning theory and machine learning – some of the cornerstones of discovery science. In response to the call for papers 76 submissions were received. The program committee selected 17 submissions as regular papers and 29 submissions as poster presentations of which 27 have been submitted for publication. This selection was based on clarity, significance, and originality, as well as on relevance to the rapidly evolving field of discovery science. The conference provided an open forum for intensive discussions and interchange of new information among researchers working in the new area of discovery science. The conference focused on the following areas related to discovery: logic for/of knowledge discovery; knowledge discovery by inferences, learning algorithms, and heuristic search; scientific discovery; knowledge discovery in databases; data mining; knowledge discovery in network environments; active mining; inductive logic programming; abductive reasoning; machine learning; constructive programming as discovery; intelligent network agents; knowledge discovery from texts and from unstructured and multimedia data; statistical methods and neural networks for knowledge discovery; data and knowledge visualization; knowledge discovery and human interaction; human factors in knowledge discovery; philosophy and psychology of discovery; chance discovery; and application of knowledge discovery to natural science and social science. The proceedings contain papers from a variety of the above areas, reflecting both the theoretical and the practical aspects of discovery science. This year’s conference was the fifth in a series of annual conferences established in 1998. Continuation of this process is supervised by the DS steering committee consisting of Setsuo Arikawa (Chair, Kyushu Univ., Japan), Klaus P. Jantke (DFKI GmbH, Germany), Masahiko Sato (Kyoto Univ., Japan), Ayumi Shinohara (Kyushu Univ., Japan), Carl H. Smith (Univ. Maryland, USA), and Thomas Zeugmann (Univ. L¨ ubeck, Germany). This volume consists of three parts. The first part contains the invited talks of ALT 2002 and DS 2002. The invited talks were given by Susumu Hayashi (Kobe Univ., Japan), Rudolf Kruse (Tech. Univ. Magdeburg, Germany), John Shawe-Taylor (Royal Holloway, Univ. London, UK), Gerhard Widmer (Austrian Research Inst. for AI, Austria), and Jan Witten (Univ. Waikato, New Zealand).
VI
Preface
Since the invited talks were for both conferences, this volume contains the full versions of Rudolf Kruse’s and Gerhard Widmer’s talk as well as the abstracts of the others. The second part contains the accepted regular papers and the third part contains the written versions of the posters accepted for presentation during the conference. We would like to thank all individuals and institutions who contributed to the success of the conference: the authors of submitted papers, the invited speakers, the sponsors, and Springer-Verlag. We are particularly grateful to the members of the program committee for spending their valuable time reviewing and evaluating the submissions and for participating in online discussions, ensuring that the presentations at the conference were of high technical quality. We are also grateful to the external additional referees for their considerable contribution to this process. Last, but not least, we would like to express our immense gratitude to Andreas Jacoby (Univ. L¨ ubeck, Germany) and Thomas Zeugmann (Univ. L¨ ubeck, Germany) who did a remarkable job as local arrangement chairs for both conferences.
November 2002
Steffen Lange Ken Satoh Carl H. Smith
Organization
Conference Chair Carl H. Smith
University of Maryland, USA
Program Committee Steffen Lange (Co-chair) Ken Satoh (Co-chair) Diane J. Cook Andreas Dengel Peter A. Flach Gunter Grieser Achim Hoffmann Klaus P. Jantke John R. Josephson Pat Langley Bing Liu Heikki Mannila Hiroshi Motoda Stephan Muggleton Ryohei Nakano Yukio Ohsawa Jorge C.G. Ramirez Ayumi Shinohara Stefan Wrobel Kenji Yamanishi
DFKI GmbH, Germany Nat. Institute of Informatics, Japan Univ. of Texas at Arlington, USA DFKI GmbH, Germany Univ. of Bristol, UK TU Darmstadt, Germany UNSW, Australia DFKI GmbH, Germany Ohio State Univ., USA ISLE, USA National Univ., Singapore Helsinki Univ. of Tech., Finland Osaka Univ., Japan Imperial College, UK Nagoya Inst. Tech., Japan Tsukuba Univ., Japan ACS State Healthcare, USA Kyushu Univ., Japan Tech. Univ. Magdeburg, Germany NEC Co. Ltd., Japan
Additional Referees Hiroki Arimura Hideki Asoh Stephan Baumann Michael Boronowsky Nigel Collier Ludger van Elst Jim Farrand Koichi Furukawa Artur Garcez Elias Gyftodimos J¨ org Herrmann
Michiel de Hoon Tamas Horvath Hitoshi Iba Ryutaro Ichise Naresh Iyer Markus Junker Thomas Kieninger J¨ org Kindermann Asanobu Kitamoto Stefan Klink Willi Kl¨ osgen
VIII
Organization
Mark-A. Krogel Tim Langford Yuko Matsunaga Martin Memmel Satoru Miyano Satoshi Morinaga Yoichi Motomura Heinz M¨ uhlenbein Jochen Nessel Kouzou Ohara Gerhard Paaß Yonghong Peng
Son Bao Pham Kazumi Saito Hiroko Satoh Shinichi Shimozono Jun-ichi Takeuchi Osamu Watanabe Shaomin Wu Seiji Yamada Masayuki Yamamura Thomas Zeugmann Sandra Zilles
Local Arrangements Andreas Jacoby Thomas Zeugmann
Univ. L¨ ubeck, Germany Univ. L¨ ubeck, Germany
Table of Contents
Invited Talks Mathematics Based on Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Susumu Hayashi
1
Data Mining with Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rudolf Kruse, Christian Borgelt
2
On the Eigenspectrum of the Gram Matrix and Its Relationship to the Operator Eigenspectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 John Shawe-Taylor, Chris Williams, Nello Cristianini, Jaz Kandola In Search of the Horowitz Factor: Interim Report on a Musical Discovery Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Gerhard Widmer Learning Structure from Sequences, with Applications in a Digital Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Ian H. Witten
Regular Papers Application of Discovery to Natural Science Discovering Frequent Structured Patterns from String Databases: An Application to Biological Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Luigi Palopoli, Giorgio Terracina Discovery in Hydrating Plaster Using Machine Learning Methods . . . . . . . . 47 Judith E. Devaney, John G. Hagedorn Revising Qualitative Models of Gene Regulation . . . . . . . . . . . . . . . . . . . . . . . 59 Kazumi Saito, Stephen Bay, Pat Langley Knowledge Discovery from Unstructured and Semi-structured Data SEuS: Structure Extraction Using Summaries . . . . . . . . . . . . . . . . . . . . . . . . . 71 Shayan Ghazizadeh, Sudarshan S. Chawathe Discovering Best Variable-Length-Don’t-Care Patterns . . . . . . . . . . . . . . . . . 86 Shunsuke Inenaga, Hideo Bannai, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa
X
Table of Contents
Meta-learning and Analysis of Machine Learning Algorithms A Study on the Effect of Class Distribution Using Cost-Sensitive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Kai Ming Ting Model Complexity and Algorithm Selection in Classification . . . . . . . . . . . . . 113 Melanie Hilario Experiments with Projection Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Tapio Elomaa, J.T. Lindgren Improved Dataset Characterisation for Meta-learning . . . . . . . . . . . . . . . . . . . 141 Yonghong Peng, Peter A. Flach, Carlos Soares, Pavel Brazdil Combining Machine Learning Algorithms Racing Committees for Large Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Eibe Frank, Geoffrey Holmes, Richard Kirkby, Mark Hall From Ensemble Methods to Comprehensible Models . . . . . . . . . . . . . . . . . . . . 165 C. Ferri, J. Hern´ andez-Orallo, M.J. Ram´ırez-Quintana Neural Networks and Statistical Methods Learning the Causal Structure of Overlapping Variable Sets . . . . . . . . . . . . 178 David Danks Extraction of Logical Rules from Data by Means of Piecewise-Linear Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Martin Holeˇ na Structuring Neural Networks through Bidirectional Clustering of Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Kazumi Saito, Ryohei Nakano New Approaches to Knowledge Discovery Toward Drawing an Atlas of Hypothesis Classes: Approximating a Hypothesis via Another Hypothesis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 Osamu Maruyama, Takayoshi Shoudai, Satoru Miyano Datascape Survey Using the Cascade Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Takashi Okada Learning Hierarchical Skills from Observation . . . . . . . . . . . . . . . . . . . . . . . . . 247 Ryutaro Ichise, Daniel Shapiro, Pat Langley
Table of Contents
XI
Poster Papers Applications of Knowledge Discovery to Natural Science Image Analysis for Detecting Faulty Spots from Microarray Images . . . . . . 259 Salla Ruosaari, Jaakko Hollm´en Inferring Gene Regulatory Networks from Time-Ordered Gene Expression Data Using Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Michiel de Hoon, Seiya Imoto, Satoru Miyano DNA-Tract Curvature Profile Reconstruction: A Fragment Flipping Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Daniele Masotti Evolution Map: Modeling State Transition of Typhoon Image Sequences by Spatio-Temporal Clustering . . . . . . . . . . . . . . . . . . . . . . . 283 Asanobu Kitamoto Structure-Sweetness Relationships of Aspartame Derivatives by GUHA . . . 291 Jaroslava Halova, Premysl Zak, Pavel Stopka, Tomoaki Yuzuri, Yukino Abe, Kazuhisa Sakakibara, Hiroko Suezawa, Minoru Hirota Knowledge Discovery from Texts A Hybrid Approach for Chinese Named Entity Recognition . . . . . . . . . . . . . . 297 Xiaoshan Fang, Huanye Sheng Extraction of Word Senses from Human Factors in Knowledge Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 Yoo-Jin Moon, Minkoo Kim, Youngho Hwang, Pankoo Kim, Kijoon Choi Event Pattern Discovery from the Stock Market Bulletin . . . . . . . . . . . . . . . . 310 Fang Li, Huanye Sheng, Dongmo Zhang Email Categorization Using Fast Machine Learning Algorithms . . . . . . . . . . 316 Jihoon Yang, Sung-Yong Park Discovery of Maximal Analogies between Stories . . . . . . . . . . . . . . . . . . . . . . . 324 Makoto Haraguchi, Shigetora Nakano, Masaharu Yoshioka Automatic Wrapper Generation for Multilingual Web Resources . . . . . . . . . 332 Yasuhiro Yamada, Daisuke Ikeda, Sachio Hirokawa Combining Multiple K-Nearest Neighbor Classifiers for Text Classification by Reducts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 Yongguang Bao, Naohiro Ishii
XII
Table of Contents
ARISTA Causal Knowledge Discovery from Texts . . . . . . . . . . . . . . . . . . . . . . 348 John Kontos, Areti Elmaoglou, Ioanna Malagardi Applications of Knowledge Discovery to Social Science Knowledge Discovery as Applied to Music: Will Music Web Retrieval Revolutionize Musicology? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 Francis Rousseaux, Alain Bonardi Process Mining: Discovering Direct Successors in Process Logs . . . . . . . . . . . 364 Laura Maruster, A.J.M.M. Weijters, W.M.P. van der Aalst, Antal van den Bosch The Emergence of Artificial Creole by the EM Algorithm . . . . . . . . . . . . . . . 374 Makoto Nakamura, Satoshi Tojo Generalized Musical Pattern Discovery by Analogy from Local Viewpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 Olivier Lartillot Machine Learning Approaches Using Genetic Algorithms-Based Approach for Better Decision Trees: A Computational Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 Zhiwei Fu Handling Feature Ambiguity in Knowledge Discovery from Time Series . . . 398 Frank H¨ oppner A Compositional Framework for Mining Longest Ranges . . . . . . . . . . . . . . . . 406 Haiyan Zhao, Zhenjiang Hu, Masato Takeichi Post-processing Operators for Browsing Large Sets of Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 Alipio Jorge, Jo˜ ao Po¸cas, Paulo Azevedo Mining Patterns from Structured Data by Beam-Wise Graph-Based Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 Takashi Matsuda, Hiroshi Motoda, Tetsuya Yoshida, Takashi Washio Feature Selection for Propositionalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 Mark-A. Krogel, Stefan Wrobel Subspace Clustering Based on Compressibility . . . . . . . . . . . . . . . . . . . . . . . . . 435 Masaki Narahashi, Einoshin Suzuki
Table of Contents
XIII
New Approaches to Knowledge Discovery The Extra-Theoretical Dimension of Discovery. Extracting Knowledge by Abduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 Lorenzo Magnani, Matteo Piazza, Riccardo Dossena Discovery Process on the WWW: Analysis Based on a Theory of Scientific Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 Hitomi Saito, Kazuhisa Miwa Invention vs. Discovery (A Critical Discussion) . . . . . . . . . . . . . . . . . . . . . . . . 457 Carlotta Piscopo, Mauro Birattari Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
Mathematics Based on Learning Susumu Hayashi Kobe University, Rokko-dai, Nada, Kobe 657-8501, Japan, [email protected], http://www.shayashi.jp
Abstract. Learning theoretic aspects of mathematics and logic have been studied by many authors. They study how mathematical and logical objects are algorithmically “learned” (inferred) from finite data. Although the subjects of studies are mathematical objects, the objective of the studies are learning. In this paper, a mathematics of which foundation itself is learning theoretic will be introduced. It is called LimitComputable Mathematics. It was originally introduced as a means for “Proof Animation,” which is expected to make interactive formal proof development easier. Although the original objective was not learning theoretic at all, learning theory is indispensable for our research.
The full version of this paper is published in the Proceedings of the 13th International Conference on Algorithmic Learning Theory, Lecture Notes in Artificial Intelligence Vol. 2533
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, p. 1, 2002. c Springer-Verlag Berlin Heidelberg 2002
Data Mining with Graphical Models Rudolf Kruse and Christian Borgelt Department of Knowledge Processing and Language Engineering Otto-von-Guericke-University of Magdeburg Universit¨ atsplatz 2, D-39106 Magdeburg, Germany {kruse,borgelt}@iws.cs.uni-magdeburg.de
Abstract. Data Mining, or Knowledge Discovery in Databases, is a fairly young research area that has emerged as a reply to the flood of data we are faced with nowadays. It tries to meet the challenge to develop methods that can help human beings to discover useful patterns in their data. One of these techniques — and definitely one of the most important, because it can be used for such frequent data mining tasks like classifier construction and dependence analysis — is learning graphical models from datasets of sample cases. In this paper we review the ideas underlying graphical models, with a special emphasis on the less well known possibilistic networks. We discuss the main principles of learning graphical models from data and consider briefly some algorithms that have been proposed for this task as well as data preprocessing methods and evaluation measures.
1
Introduction
Today every company stores and processes its data electronically, in production, marketing, stock-keeping or personnel management. The data processing systems used were developed, because it is very important for a company to be able to retrieve certain pieces of information, like the address of a customer, in a fast and reliable way. Today, however, with ever increasing computer power and due to advances in database and software technology, we may think about using electronically stored data not only to retrieve specific information, but also to search for hidden patterns and regularities. If, for example, by analyzing customer receipts a supermarket chain finds out that certain products are frequently bought together, turnover may be increased by placing the products on the shelves of the supermarkets accordingly. Unfortunately, in order to discover such knowledge in databases the retrieval capacities of normal database systems as well as the methods of classical data analysis are often insufficient. With them, we may retrieve arbitrary individual information, compute simple aggregations, or test the hypothesis whether the day of the week has an influence on the product quality. But more general patterns, structures, or regularities go undetected. These patterns, however, are often highly valuable and may be exploited, for instance, to increase sales. As a consequence a new research area has emerged in recent years—often called S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 2–11, 2002. c Springer-Verlag Berlin Heidelberg 2002
Data Mining with Graphical Models
3
Knowledge Discovery in Databases (KDD) or Data Mining (DM)—in which hypotheses and models describing the regularities in a given dataset are generated and tested automatically. The hypotheses and models found in this way can then be used to gain insight into the domain under consideration, to predict its future development, and to support decision making. In this paper we consider two of the most important data mining tasks, namely the construction of classifiers and the analysis of dependences. Among the different methods for these tasks we concentrate on learning a graphical model from a dataset of sample cases. Furthermore, our emphasis is on possibilistic graphical models, which are a powerful tool for the analysis of imprecise data.
2
Graphical Models
An object or a case of a given domain of interest is usually described by a set of attributes. For instance, to describe a car we may use the manufacturer, the model name, the color etc. Depending on the specific object or case under consideration these attributes have certain values, for example, Volkswagen, Golf, red etc. Sometimes only certain combinations of attribute values are possible, for example, because certain special equipment items may not be chosen simultaneously, or certain combinations of attribute values are more frequent than others, for example, red VW Golf are more frequent than yellow BMW Z1. Such possibility or frequency information can be represented as a distribution on the Cartesian product of the attribute domains. That is, to each combination of attribute values we assign its possibility or probability. Often a very large number of attributes is necessary to describe a given domain of interest appropriately. Since the number of possible value combinations grows exponentially with the number of attributes, it is often impossible to represent this distribution directly, for example, in order to draw inferences. One way to cope with this problem is to construct a graphical model. Graphical models are based on the idea that independences between attributes can be exploited to decompose a high-dimensional distributions into a set of (conditional or marginal) distributions on low-dimensional subspaces. This decomposition (as well as the independences that make it possible) is encoded by a graph: Each node represents an attribute. Edges connect nodes that are directly dependent on each other. In addition, the edges specify the paths on which evidence has to be propagated if inferences are to be drawn. Since graphical models have been developed first in probability theory and statistics, the best-known approaches originated from this area, namely Bayes networks [Pearl 1988] and Markov networks [Lauritzen and Spiegelhalter 1988]. However, the underlying decomposition principle has been generalized, resulting in the so-called valuation-based networks [Shenoy 1992], and has been transferred to possibility theory [Gebhardt and Kruse 1996]. All of these approaches lead to efficient implementations, for example, HUGIN [Andersen et al. 1989], PULCINELLA [Saffiotti and Umkehrer 1991], PATHFINDER [Heckerman 1991], and POSSINFER [Gebhardt and Kruse 1996].
4
2.1
R. Kruse and C. Borgelt
Decomposition
The notion of decomposition is probably best-known from relational database theory. Thus it comes as no surprise that relational database theory is closely connected to the theory of graphical models. This connection is based on the notion of a relation being join-decomposable, which is used in relational database systems to decompose high-dimensional relations and thus to store them with less redundancy and (of course) using less storage space. Join-decomposability means that a relation can be reconstructed from certain projections by forming the so-called natural join of these projections. Formally, this can be described as follows: Let U = {A1 , . . . , An } be a set of attributes with respective domains dom(Ai ). Furthermore let rU be a relation over U . Such a relation can be described by its indicator function, which assigns a value of 1 to all tuples that are contained in the relation and a value of 0 to all other tuples. The tuples themselves are represented as conjunctions Ai ∈U Ai = ai , which state a value for each attribute. Then the projection onto a subset M ⊆ U of the attributes can then be defined as the relation rM max Ai = ai = rU Ai = ai , Ai ∈M
∀Aj ∈U −M : ai ∈dom(Aj )
Ai ∈U
where the somewhat sloppy notation under the maximum operator is meant to express that the maximum has to be taken over all values of all attributes in the set U − M . With this notation a relation rU is called join-decomposable w.r.t. a family M = {M1 , . . . , Mm } of subsets of U if and only if ∀a1 ∈ dom(A1 ) : . . . ∀an ∈ dom(An ) : rU Ai = ai = min rM Ai = ai . Ai ∈U
M ∈M
Ai ∈M
Note that the minimum of the projections is equivalent to the natural join of relational calculus, justifying the usage of the term “join-decomposable”. This decomposition scheme can easily be transferred to the probabilistic case: All we have to do is to replace the projection operation and the natural join by their probabilistic counterparts. Thus we arrive at the decomposition formula ∀a1 ∈ dom(A1 ) : . . . ∀an ∈ dom(An ) : pU Ai = ai = φM Ai = ai . Ai ∈U
M ∈M
Ai ∈M
The functions φM can be computed from the marginal distributions on the attribute sets M . This demonstrates that the computation of a marginal distribution takes the place of the projection operation. These functions are called factor potentials [Castillo et al. 1997]. Alternatively, one may describe a decomposition of a probability distribution by exploiting the (generalized) product rule of probability theory and by using conditional distributions.
Data Mining with Graphical Models
5
The possibilistic case is even closer to the relational one, because the decomposition formula is virtually identical. The only difference is that the relations r are replaced by possibility distributions π, i.e., by functions which are not restricted to the values 0 and 1 (like indicator functions), but may take arbitrary values from the interval [0, 1]. In this way a “gradual possibility” is modeled with a generalized indicator function. As a consequence possibilistic graphical models may be seen as “fuzzifications”’ of relational graphical models. Of course, if such degrees of possibility are introduced, the question of their interpretation arises, because possibility is an inherently two-valued concept. In our research we rely on the context model [Gebhardt and Kruse 1993] to answer this question. However, since the common ways of justifying the maximum and minimum operations are not convincing, we have developed a different justification that is based on the goal of reasoning with graphical models. Details about this justification can be found in [Borgelt and Kruse 2002]. 2.2
Graphical Representation
Decompositions can very conveniently be represented by graphs. In the first place, graphs can be used to specify the sets M of attributes underlying the decomposition. How this is done depends on whether the graph is directed or undirected. If it is undirected, the sets M are the maximal cliques of the graph, where a clique is a complete subgraph, which is called maximal if it is not a proper part of another complete subgraph. If the graph is directed, we can be more explicit about the distributions of the decomposition: We can employ conditional distributions, because the direction of the edges allows us to distinguish between conditioned and conditioning attributes. However, in the relational and the possibilistic case no changes result from this, since the conditional distributions are identical to their unconditional analogs (because in these calculi no renormalization is carried out). Secondly, graphs can be used to represent (conditional) dependences and independences via the notion of node separation. What is to be understood by “separation” again depends on whether the graph is directed or undirected. If it is undirected, node separation is defined as follows: If X, Y , and Z are three disjoint sets of nodes, then Z separates X and Y if all paths from a node in X to a node in Y contain a node in Z. For directed acyclic graphs node separation is defined as follows [Pearl 1988]: If X, Y , and Z are three disjoint sets of nodes, then Z separates X and Y if there is no path (disregarding the directionality of the edges) from a node in X to a node in Y along which the following two conditions hold: 1. Every node, at which the edges of the path converge, either is in Z or has a descendant in Z, and 2. every other node is not in Z. With the help of these separation criteria we can define conditional independence graphs: A graph is a conditional independence graph w.r.t. a given (multidimensional) distribution if it captures by node separation only valid conditional
6
R. Kruse and C. Borgelt
independences. Conditional independence means (for three attributes A, B, and C with A being independent of C given B; the generalization is obvious), that P (A = a, B = b, C = c) = P (A = a | B = b) · P (C = c | B = b) in the probabilistic case and π(A = a, B = b, C = c) = min{π(A = a | B = b), π(C = c | B = b)} in the possibilistic and the relational case. These formula also indicate that conditional independence and decomposability are closely connected. Formally, this connection is established by theorems, which state that a distribution is decomposable w.r.t. a given graph if the graph is a conditional independence graph. In the probabilistic case such a theorem is usually attributed to [Hammersley and Clifford 1971]. In the possibilistic case an analogous theorem can be proven, although some restrictions have to be introduced on the graphs [Gebhardt 1997, Borgelt and Kruse 2002]. Finally, the graph underlying a graphical model is very useful to derive evidence propagation algorithms, because transmitting evidence information can be implemented by node processors that communicate by sending message to each other along the edges of the graph. Details about these methods can be found, for instance, in [Castillo et al. 1997].
3
Learning Graphical Models from Data
Since a graphical model represents the dependences and independences that hold in a given domain of interest in a very clear way and allows for efficient reasoning, it is a very powerful tool—once it is constructed. However, its construction by human experts can be tedious and time-consuming. As a consequence recent research in graphical models has placed a strong emphasis on learning graphical models from a dataset of sample cases. Although it has been shown that this learning task is NP-hard in general [Chickering et al. 1994], some very successful heuristic algorithms have been developed [Cooper and Herskovits 1992, Heckerman et al. 1995, Gebhardt and Kruse 1995]. However, some of these approaches, especially probabilistic ones, are restricted to learning from precise data. That is, the description of the sample cases must contain neither missing values nor set-valued information. There must be exactly one value for each attribute in each of the sample cases. Unfortunately, this prerequisite is rarely met in applications: Real-world databases are often incomplete and useful imprecise information (sets of values for an attribute) is frequently available (even though it is often neglected, because common database systems cannot handle it adequately). Therefore we face the challenge to extend the existing learning algorithms to incomplete and imprecise data. Research in probabilistic graphical models tries to meet this challenge mainly with the expectation maximization (EM) algorithm [Dempster et al. 1977, Bauer et al. 1997]. In our own research, however, we focus on possibilistic graphical
Data Mining with Graphical Models
7
models, because possibility theory [Dubois and Prade 1988] allows for a very convenient treatment of missing values and imprecise data. For possibilistic networks no iterative procedure like the EM algorithm is necessary, so that considerable gains in efficiency can result [Borgelt and Kruse 2002]. 3.1
Learning Principles
There are basically three approaches to learn a graphical model from data: – Test whether a given distribution is decomposable w.r.t. a given graph. – Construct a conditional independence graph through conditional independence tests. – Choose edges based on a measurement of the strength of marginal dependence of attributes. Unfortunately, none of these approaches is perfect. The first approach fails, because the number of possible graphs grows over-exponentially with the number of attributes and so it is impossible to inspect all of these graphs. The second approach usually starts from the strong assumption that the conditional independences can be represented perfectly and may require independence tests of high order, which are sufficiently reliable only if the datasets are very large. Examples in which the third approach yields a suboptimal result can easily be found [Borgelt and Kruse 2002]. Nevertheless, the second and the third approach, enhanced by additional assumptions, lead to good heuristic algorithms, which usually consists of two ingredients: 1. an evaluation measure (to assess the quality of a given model) and 2. a search method (to traverse the space of possible models). This characterization is apt, even though not all algorithms search the space of possible graphs directly. For instance, some search for conditional independences and some for the best set of parents for a given attribute. Nevertheless, all employ some search method and an evaluation measure. 3.2
Computing Projections
Apart from the ingredients of a learning algorithm for graphical models that are mentioned in the preceding section, we need an operation for a technical task, namely the estimation of the conditional or marginal distributions from a dataset of sample cases. This operation is often neglected, because it is trivial in the relational and the probabilistic case, at least for precise data. In the former it is an operation of relational calculus (namely the relational projection operations, which is why we generally call this operation a projection), in the latter it consists in counting sample cases and computing relative frequencies. Only if imprecise information is present, this operation is more complex. In this case the expectation maximization algorithm [Dempster et al. 1977, Bauer et al. 1997] is drawn upon, which can be fairly costly.
8
R. Kruse and C. Borgelt
In possibility theory the treatment of imprecise information is much simpler, especially if it is based on the context model. In this case each example case can be seen as a context, which allows to handle the imprecision conveniently inside the context. Unfortunately, computing projections in the possibilistic case is also not without problems: There is no simple operation (like simple counting), with which the marginal possibility distribution can be derived directly from the dataset to learn from. A simple example illustrating this can be found in [Borgelt and Kruse 2002]. However, we have developed a preprocessing method, which computes the closure under tuple intersection of the dataset of sample cases. From this closure the marginal distributions can be computed with a simple maximum operation in a highly efficient way [Borgelt and Kruse 2002]. 3.3
Evaluation Measures
An evaluation measure (or scoring function) serves the purpose to assess the quality of a given candidate graphical model w.r.t. a dataset of sample cases, so that it can be determined which model best fits the data. A desirable property of an evaluation measure is decomposability. That is, the quality of the model as a whole should be computable from local scores, for instance, scores for cliques or even scores for single edges. Most evaluation measures that exhibit this property measure the strength of dependence of attributes, because this is necessary for the second as well as the third approach to learning graphical models from data (cf. section 3.1), either to assess whether a conditional independence holds or to find the strongest dependences between attributes. For the probabilistic case there is a large variety of evaluation measures, which are based on a wide range of ideas and which have been developed for very different purposes. In particular all measures that have been developed for the induction of decision trees can be transferred to learning graphical models, even though this possibility is rarely fully recognized and exploited accordingly. In our research we have collected and studied several measures (e.g., information gain (ratio), Gini index, relieff measure, K2 metric and its generalization, minimum description length etc). This collection together with detailed explanations of the underlying ideas can be found in [Borgelt and Kruse 2002]. Furthermore we have developed an extension of the K2 metric [Cooper and Herskovits 1992, Heckerman et al. 1995] and an extension of measure that is based on the minimum description length principle [Rissanen 1983]. In these extensions we added a “sensitivity parameter”, which enables us to control the tendency to add further edges to the model. Such a parameter has proven highly useful in applications (cf. the application at DaimlerChrysler, briefly described in section 4). Evaluation measures for possibilistic graphical models can be derived in two ways: In the first place, the close connection to relational networks can be exploited by drawing on the notion of an α-cut, which is well known from the theory of fuzzy sets [Kruse et al. 1994]. With this notion possibility distributions can be interpreted as a set of relations, with one relation for each possibility degree α. Then it is easy to see that a possibility distribution is decomposable if and only if each of its α-cuts is decomposable. As a consequence evaluation measures for
Data Mining with Graphical Models
9
possibilistic graphical models can be derived from corresponding measures for relational graphical models by integrating over all possible values α. An example of such a measure is the specificity gain [Gebhardt 1997], which can be derived from the Hartley information gain [Hartley 1928], a measure for relational graphical models. Variants of the specificity gain, which results from different ways of normalizing it, are discussed in [Borgelt and Kruse 2002]. Another possibility to obtain evaluation measures for possibilistic networks is to form analogs of probabilistic measures. In these analogs usually a product is replaced by a minimum and a sum by a maximum. Examples of measures derived in this way can also be found in [Borgelt and Kruse 2002]. 3.4
Search Methods
The search method used determines which graphs are considered. Since an exhaustive search incurs prohibitively large costs due to the extremely high number of possible graphs, heuristic methods have to be drawn upon. These methods usually restrict the set of considered graphs considerably and use the value of the evaluation measure to guide the search. In addition, they are often greedy w.r.t. the model quality in order to speed up the search. The simplest search methods is the construction of an optimal spanning tree for given edges weights. This method was used first by [Chow and Liu 1968] with Shannon information gain providing the edge weights. In the possibilistic case the information gain may be replaced with the abovementioned specificity in order to obtain an analogous algorithm [Gebhardt 1997]. However, almost all other measures (probabilistic as well as possibilistic) are usable as well. A straightforward extension of this method is a greedy search for parent nodes in directed graphs, which often starts from a topological order of the attributes that is fixed in advance: At the beginning the evaluation measure is computed for a parentless node. Then parents are added step by step, each time selecting the attribute that yields the highest value of the evaluation measure. The search is terminated if no other parent candidates are available, a user-defined maximum number of parents is reached, or the value of the evaluation measures does not improve anymore. This search method is employed in the K2 algorithm [Cooper and Herskovits 1992] together with the K2 metric as the evaluation measure. Like optimum weight spanning tree this learning approach can easily be transferred to the possibilistic case by replacing the evaluation measure. In our research we have also developed two other search methods. The first starts from an optimal spanning tree (see above) and adds edges if conditional independences that are represented by the tree not hold. However, the edges that may be added have to satisfy certain constraints, which ensure that the cliques of the resulting graph contain at most three nodes. In addition, these constraints guarantee that the resulting graph has hypertree structure. (A hypertree is an acyclic hypergraph, and in a hypergraph the restriction that an edge connects exactly two nodes is relaxed: A hyperedge may connect an arbitrary number of nodes.) The second methods uses the well-known simulated annealing approach to learn a hypertree directly. The main problem in developing this approach
10
R. Kruse and C. Borgelt
was to find a method for randomly generating and modifying hypertrees that is sufficiently unbiased. These two search methods are highly useful, because they allow us to control the complexity of later inferences with the graphical model at learning time. The reason is that this complexity depends heavily on the size of the hyperedges of the learned hypertree, which can be easily constrained in these approaches.
4
Application
In a cooperation between the University of Magdeburg and the DaimlerChrysler corporation we had the opportunity to apply our algorithms for learning graphical models to a real-world car database. The objective of the analysis was to uncover possible causes for faults and damages. Although the chosen approach was very simple (we learned a two-layered network with one layer describing the equipment of the car and the other possible faults and damages), it was fairly successful. With a prototype implementation of several learning algorithms, we ran benchmark tests against human expert knowledge. We could easily and efficiently find hints to possible causes, which had taken human experts weeks to discover. The sensitivity parameters which we introduced into two evaluation measures (cf. section 3.3) turned out to be very important for this success.
References [Andersen et al. 1989] S.K. Andersen, K.G. Olesen, F.V. Jensen, and F. Jensen. HUGIN — A Shell for Building Bayesian Belief Universes for Expert Systems. Proc. 11th Int. J. Conf. on Artificial Intelligence (IJCAI’89, Detroit, MI, USA), 1080–1085. Morgan Kaufmann, San Mateo, CA, USA 1989 [Baldwin et al. 1995] J.F. Baldwin, T.P. Martin, and B.W. Pilsworth. FRIL — Fuzzy and Evidential Reasoning in Artificial Intelligence. Research Studies Press/J. Wiley & Sons, Taunton/Chichester, United Kingdom 1995 [Bauer et al. 1997] E. Bauer, D. Koller, and Y. Singer. Update Rules for Parameter Estimation in Bayesian Networks. Proc. 13th Conf. on Uncertainty in Artificial Intelligence (UAI’97, Providence, RI, USA), 3–13. Morgan Kaufmann, San Mateo, CA, USA 1997 [Borgelt and Kruse 2002] C. Borgelt and R. Kruse. Graphical Models — Methods for Data Analysis and Mining. J. Wiley & Sons, Chichester, United Kingdom 2002 [Castillo et al. 1997] E. Castillo, J.M. Gutierrez, and A.S. Hadi. Expert Systems and Probabilistic Network Models. Springer-Verlag, New York, NY, USA 1997 [Chickering et al. 1994] D.M. Chickering, D. Geiger, and D. Heckerman. Learning Bayesian Networks is NP-Hard (Technical Report MSR-TR-94-17). Microsoft Research, Advanced Technology Division, Redmond, WA, USA 1994 [Chow and Liu 1968] C.K. Chow and C.N. Liu. Approximating Discrete Probability Distributions with Dependence Trees. IEEE Trans. on Information Theory 14(3):462–467. IEEE Press, Piscataway, NJ, USA 1968 [Cooper and Herskovits 1992] G.F. Cooper and E. Herskovits. A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning 9:309–347. Kluwer, Dordrecht, Netherlands 1992
Data Mining with Graphical Models
11
[Dempster et al. 1977] A.P. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society (Series B) 39:1–38. Blackwell, Oxford, United Kingdom 1977 [Dubois and Prade 1988] D. Dubois and H. Prade. Possibility Theory. Plenum Press, New York, NY, USA 1988 [Dubois et al. 1996] D. Dubois, H. Prade, and R. Yager, eds. Fuzzy Set Methods in Information Engineering: A Guided Tour of Applications. J. Wiley & Sons, New York, NY, USA 1996 [Gebhardt 1997] J. Gebhardt. Learning from Data: Possibilistic Graphical Models. Habilitation Thesis, University of Braunschweig, Germany 1997 [Gebhardt and Kruse 1993] J. Gebhardt and R. Kruse. The Context Model — An Integrating View of Vagueness and Uncertainty. Int. Journal of Approximate Reasoning 9:283–314. North-Holland, Amsterdam, Netherlands 1993 [Gebhardt and Kruse 1995] J. Gebhardt and R. Kruse. Learning Possibilistic Networks from Data. Proc. 5th Int. Workshop on Artificial Intelligence and Statistics (Fort Lauderdale, FL, USA), 233–244. Springer-Verlag, New York, NY, USA 1995 [Gebhardt and Kruse 1996] J. Gebhardt and R. Kruse. POSSINFER — A Software Tool for Possibilistic Inference. In: [Dubois et al. 1996], 407–418 [Hartley 1928] R.V.L. Hartley. Transmission of Information. The Bell System Technical Journal 7:535–563. Bell Laboratories, Murray Hill, NJ, USA 1928 [Hammersley and Clifford 1971] J.M. Hammersley and P.E. Clifford. Markov Fields on Finite Graphs and Lattices. Unpublished manuscript, 1971. Cited in: [Isham 1981] [Heckerman 1991] D. Heckerman. Probabilistic Similarity Networks. MIT Press, Cambridge, MA, USA 1991 [Heckerman et al. 1995] D. Heckerman, D. Geiger, and D.M. Chickering. Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning 20:197–243. Kluwer, Dordrecht, Netherlands 1995 [Isham 1981] V. Isham. An Introduction to Spatial Point Processes and Markov Random Fields. Int. Statistical Review 49:21–43. Int. Statistical Institute, Voorburg, Netherlands 1981 [Kruse et al. 1994] R. Kruse, J. Gebhardt, and F. Klawonn. Foundations of Fuzzy Systems, J. Wiley & Sons, Chichester, United Kingdom 1994. [Lauritzen and Spiegelhalter 1988] S.L. Lauritzen and D.J. Spiegelhalter. Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems. Journal of the Royal Statistical Society, Series B, 2(50):157–224. Blackwell, Oxford, United Kingdom 1988 [Pearl 1988] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA, USA 1988 (2nd edition 1992) [Rissanen 1983] J. Rissanen. A Universal Prior for Integers and Estimation by Minimum Description Length. Annals of Statistics 11:416–431. Institute of Mathematical Statistics, Hayward, CA, USA 1983 [Saffiotti and Umkehrer 1991] A. Saffiotti and E. Umkehrer. PULCINELLA: A General Tool for Propagating Uncertainty in Valuation Networks. Proc. 7th Conf. on Uncertainty in Artificial Intelligence (UAI’91, Los Angeles, CA, USA), 323–331. Morgan Kaufmann, San Mateo, CA, USA 1991 [Shenoy 1992] P.P. Shenoy. Valuation-based Systems: A Framework for Managing Uncertainty in Expert Systems. In: [Zadeh and Kacprzyk 1992], 83–104 [Zadeh and Kacprzyk 1992] L.A. Zadeh and J. Kacprzyk. Fuzzy Logic for the Management of Uncertainty. J. Wiley & Sons, New York, NY, USA 1992
On the Eigenspectrum of the Gram Matrix and Its Relationship to the Operator Eigenspectrum John Shawe-Taylor1 , Chris Williams2 , Nello Cristianini3 , and Jaz Kandola1 1 Department of Computer Science, Royal Holloway, University of London Egham, Surrey TW20 0EX, UK 2 Division of Informatics, University of Edinburgh 3 Department of Statistics, University of California at Davies
Abstract. In this paper we analyze the relationships between the eigenvalues of the m × m Gram matrix K for a kernel k(·, ·) corresponding to a sample x1 , . . . , xm drawn from a density p(x) and the eigenvalues of the corresponding continuous eigenproblem. We bound the differences between the two spectra and provide a performance bound on kernel PCA.
The full version of this paper is published in the Proceedings of the 13th International Conference on Algorithmic Learning Theory, Lecture Notes in Artificial Intelligence Vol. 2533
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, p. 12, 2002. c Springer-Verlag Berlin Heidelberg 2002
In Search of the Horowitz Factor: Interim Report on a Musical Discovery Project Gerhard Widmer Department of Medical Cybernetics and Artificial Intelligence, University of Vienna, Austria, and Austrian Research Institute for Artificial Intelligence, Vienna [email protected]
Abstract. The paper gives an overview of an inter-disciplinary research project whose goal is to elucidate the complex phenomenon of expressive music performance with the help of machine learning and automated discovery methods. The general research questions that guide the project are laid out, and some of the most important results achieved so far are briefly summarized (with an emphasis on the most recent and still very speculative work). A broad view of the discovery process is given, from data acquisition issues through data visualization to inductive model building and pattern discovery. It is shown that it is indeed possible for a machine to make novel and interesting discoveries even in a domain like music. The report closes with a few general lessons learned and with the identification of a number of open and challenging research problems.
1
Introduction
The potential of machine learning and automatic scientific discovery for various branches of science has been convincingly demonstrated in recent years, mainly in the natural sciences ((bio)chemistry, genetics, physics, etc. (e.g., [10,11,17,25, 29,30,31]). But is computer-based scientific discovery also possible in less easily quantifiable domains like the arts? This paper presents the latest results of a long-term interdisciplinary research project that uses AI technology to investigate one of the most fascinating — and at the same time highly elusive — phenomena in music: expressive music performance. We study how skilled musicians (concert pianists, in particular) make music ‘come alive’, how they express and communicate their understanding of the musical and emotional content of the pieces by shaping various parameters like tempo, timing, dynamics, articulation, etc. The reader will be taken on a grand tour of a complex discovery enterprise, from the intricacies of data gathering (which already require new AI methods) through novel approaches to data visualization all the way to automated data analysis and inductive learning. It turns out that even a seemingly intangible phenomenon like musical expression can be transformed into something that can be studied formally, and that the computer can indeed discover some fundamental (and sometimes surprising) principles underlying the art of music performance. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 13–32, 2002. c Springer-Verlag Berlin Heidelberg 2002
14
G. Widmer
The purpose of this paper is to lay out the general research questions that guide the project, to summarize the most important results achieved so far (with an emphasis on the most recent and, in fact, still very speculative work), and to identify challenging open research questions for the field of knowledge discovery and discovery science. In the process, it will become clear that computer-based discovery in a ‘real-world’ domain like music is a complex, multi-stage process where each phase requires both intelligent data analysis techniques and creativity on the part of the researcher. The title of the paper refers to the late Vladimir Horowitz, one of the most famous pianists of the 20th century, who symbolizes like few others the fascination that great performers hold for the general audience. Formally explaining the secret behind the art and magic of such a great master would indeed be an extremely thrilling feat. Needless to say, that is highly unlikely to be possible, and we freely admit that we chose Horowitz’s name mainly for metaphoric purposes, to attract the reader’s attention. The ‘Horowitz Factor’ will not be revealed here (though we have recently also started to work with Horowitz data). Still, we do hope the following description of project and results will capture the reader’s imagination. The paper is organized follows: section 2 presents the object of our investigations, namely, expressive music performance. Section 3 then gives an overview of the main lines of investigation currently pursued. Two different kinds of fundamental research questions will be identified there, and the two subsequent sections present our current results along these two dimensions. Section 4 is kept rather short (because these results have been published before). The main part of the paper is section 5, which describes a line of research currently under investigation (studing aspects of individual artistic style). Section 6 completes this report by trying to derive some general lessons for discovery science from our work so far, and by identifying a set of challenging research opportunities for further work in this area.
2
Expressive Music Performance
Expressive music performance is the art of shaping a musical piece by continuously varying important parameters like tempo, dynamics, etc. Human musicians do not play a piece of music mechanically, with constant tempo or loudness, exactly as written in the printed music score. Rather, they speed up at some places, slow down at others, stress certain notes or passages by various means, and so on. The most important parameter dimensions available to a performer (a pianist, in particular) are tempo and continuous tempo changes, dynamics (loudness variations), and articulation (the way successive notes are connected). Most of this is not specified in the written score, but at the same time it is absolutely essential for the music to be effective and engaging. The expressive nuances added by an artist are what makes a piece of music come alive (and what makes some performers famous).
In Search of the Horowitz Factor
15
Expressive variation is more than just a ‘distortion’ of the original (notated) piece of music. In fact, the opposite is the case: the notated music score is but a small part of the actual music. Not every intended nuance can be captured in a limited formalism such as common music notation, and the composers were and are well aware of this. The performing artist is an indispensable part of the system, and expressive music performance plays a central role in our musical culture. That is what makes it a central object of study in the field of musicology. Our approach to studying this phenomenon is data-driven: we collect recordings of performances of pieces by skilled musicians,1 measure aspects of expressive variation (e.g., the detailed tempo and loudness changes applied by the musicians), and search for patterns in these tempo, dynamics, and articulation data. The goal is interpretable models that characterize and ‘explain’ consistent regularities and patterns (if such should indeed exist). As we will see, that requires methods and algorithms from machine learning, data mining, pattern recognition, but also novel methods of intelligent music processing. Our research is meant to complement recent work (and the set of methods used) in contemporary musicology, which has largely been hypothesis-driven (e.g., [8,9,26,27,28,37]), although some researchers have also taken real data as the starting point of their investigations (e.g., [20,22,23,24]). In any case, our investigations are the most data-intensive empirical studies ever performed in the area of musical performance research.
3
Project Overview
Our starting point for the following presentation are two generic types of research questions regarding expressive music performance. First, are there general, fundamental principles of music performance that can be discovered and characterized? Are there general (possibly unconscious and definitely unwritten) rules that all or most performers adhere to? In other words, to what extent can a performer’s expressive actions be predicted? And second, is it possible to formally characterize and quantify aspects of individual artistic style? Can we desribe formally what makes the special art of a Vladimir Horowitz, for instance? The first set of questions thus relates to similarities or commonalities between different performances and different performers, while the second focuses on the differences. The following project presentation is structured according to these two types of questions. Section 4 focuses on the commonalities and briefly recapitulates some of our recent work on learning general performance rules from data. The major part of this report is section 5, which describes currently ongoing (and very preliminary) work on the discovery of stylistic characteristics of great artists. Both of these lines of research are complex enterprises and comprise a number of important steps — from the acquisition and measuring of pertinent data to computer-based discovery proper. The presentation in chapters 4 and 5 will be 1
At the moment, we restrict ourselves to classical tonal music, and to the piano.
16
G. Widmer
Starting Point: The Mystery of Expressive Music Performance
2 types of questions: commonalities / fundamental principles? -> Section 4 data collection / data extraction from MIDI (4.1) discovery of noteĆlevel rules (4.2)
data collection / data extraction from audio (5.1)
learning of multiĆlevel strategies (4.3)
integrated multiĆlevel model
differences / individual artistic style? -> Section 5
visualization of individual cases - The Worm! (5.2) transformation (segmentation & clustering) (5.3)
visualization of global characteristics (5.3)
string analysis: characteristic patterns (5.4)
The `Horowitz Factor'???
Fig. 1. Structure of project work presented in this paper (with pointers to relevant sections).
structured according to these main steps (see Figure 1). This presentation should give the reader an impression of the complexity of such a discovery process, and the efforts involved in each of the steps of the process.
4
Studying Commonalities: Searching for Fundamental Principles of Music Performance
The question we turn to first is the search for commonalities between different performances and performers: are there consistent patterns that occur in many performances and point to fundamental underlying principles? We are looking for general rules of music performance, and the methods used will come from the area of inductive machine learning. This section is kept rather short and only points to the most important results, because most of this work has already been published elsewhere [32,34, 35,36].
In Search of the Horowitz Factor 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6
17
Dynamics Pianist 1 Dynamics Pianist 2 Dynamics Pianist 3
Fig. 2. Dynamics curves (relating to melody notes) of performances of the same piece (Fr´ed´eric Chopin, Etude op.10 no.3, E major) by three different Viennese pianists (computed from recordings on a B¨ osendorfer 290SE computer-monitored grand piano).
4.1
Data Acquisition: Measuring Expressivity in Performances
The first problem is data acquisition. What is required are precise measurements of the tempo, timing, dynamics, and articulation in a performance of a piece by a musician. In principle, we need to measure exactly when and how long and how loud each individual note was played, and how this deviated from the nominal values ‘prescribed’ in the written music score. Extracting this information with high precision from sound recordings is not possible, for basic signal processing reasons. Instead, our main source of information are special pianos that precisely record each action by a performer. In particular, the B¨ osendorfer 290SE is a full concert grand piano with a special mechanism that measures every key and pedal movement with high precision, and stores this information in a format similar to MIDI. (The piano also features a mechanical reproduction facility that can re-produce a recorded performance with very high accuracy.) From these measurements, and from a comparison of these with the notes as specified in the written score, every expressive nuance applied by a pianist can be computed. These nuances can be represented as expression curves. For instance, Figure 2 shows dynamics curves — the dynamics patterns produced by three different pianists in performing the same piece. More precisely, each point represents the relative loudness with which a particular melody note was played (relative to an averaged ‘standard’ loudness); a purely mechanical, unexpressive rendition of the piece would correspond to a perfectly flat horizontal line at y = 1.0. Variations in tempo and articulation can be represented in an analogous way. Figure 2 exhibits some clear common patterns and tendencies in the three performances. Despite individual differences between the recordings, there seem to be common strategies or ‘rules’ that are followed by the pianists, consciously or unconsciously. Obviously, there is hope for automated discovery algorithms to find some general principles.
18
G. Widmer
4.2
Induction of Note-Level Performance Rules
And some such general principles have indeed been discovered, with the help of a new inductive rule learning algorithm named PLCG [34]. When trained on a large set of example performances (13 complete piano sonatas by W.A.Mozart, played on the B¨ osendorfer 290SE by the Viennese concert pianist Roland Batik — that is some four hours of music and more than 106.000 played notes), PLCG discovered a small set of 17 quite simple classification rules that predict a surprisingly large number of the note-level choices of the pianist.2 The rules have been published in the musicological literature [35] and have created quite a lot of interest in the musicology world. The surprising aspect is the high number of note-level actions that can be predicted by very few (and mostly very simple) rules. For instance, four rules were discovered that together correctly predict almost 23% of all the situations where the pianist lengthened a note relative to how it was notated (which corresponds to a local slowing down of the tempo).3 To give the reader an impression of the simplicity and generality of the discovered rules, here is an extreme example: RULE TL2: abstract duration context = equal-longer & metr strength ≤ 1 ⇒ lengthen “Given two notes of equal duration followed by a longer note, lengthen the note (i.e., play it more slowly) that precedes the final, longer one, if this note is in a metrically weak position (‘metrical strength’ ≤ 1).” This is an extremely simple principle that turns out to be surprisingly general and precise: rule TL2 correctly predicts 1,894 cases of local note lengthening, which is 14.12% of all the instances of significant lengthening observed in the training data. The number of incorrect predictions is 588 (2.86% of all the counterexamples), which gives a precision (percentage of correct predictions) of .763. It is remarkable that one simple principle like this is sufficient to predict such a large proportion of observed note lengthenings in a complex corpus such as Mozart sonatas. This is a truly novel discovery; none of the existing theories of expressive performance were aware of this simple pattern. 2
3
By ‘note-level’ rules we mean rules that predict how a pianist is going to play a particular note in a piece — slower or faster than notated, louder or softer than its predecessor, staccato or legato. This should be contrasted with higher-level expressive strategies like the shaping of an entire musical phrase (e.g., with a gradual slowing down towards the end) — that will be addressed in section 4.3. It should be clear that a coverage of close to 100% is totally impossible, not only because expressive music performance is not a perfectly deterministic, predictable phenomenon, but also because the level of individual notes is clearly insufficient as a basis for a complete model of performance; musicians think not (only) in terms of single notes, but also in terms of higher-level musical units such as motifs and phrases — see section 4.3.
In Search of the Horowitz Factor
19
Moreover, experiments revealed that most of these rules are highly general and robust: they carry over to other performers and even music of different styles with virtually no loss of coverage and precision. In fact, when the rules were tested on performances of quite different music (Chopin), they exhibited significantly higher coverage and prediction accuracy than on the original (Mozart) data they had been learned from. What the machine has discovered here really seem to be fundamental performance principles. A detailed discussion of the rules as well as a quantitative evaluation of their coverage and precision can be found in [35]; the learning algorithm PLCG is described and analyzed in [34]. 4.3
Multi-level Learning of Performance Strategies
As already mentioned, not all of a performer’s decisions regarding tempo or dynamics can be predicted on a local, note-to-note basis. Musicians understand the music in terms of a multitude of more abstract patterns and structures (e.g., motifs, groups, phrases), and they use tempo and dynamics to ‘shape’ these structures, e.g., by applying a gradual crescendo (growing louder) or decrescendo (growing softer) to entire passages. In fact, music performance is a multi-level phenomenon, with musical structures and performance patterns at various levels embedded in each other. Accordingly, the set of note-level performance rules described above is currently being augmented with a multi-level learning strategy where the computer learns to predict elementary tempo and dynamics ‘shapes’ (like a gradual crescendo-decrescendo) at different levels of the hierarchical musical phrase structure, and combines these predictions with local timing and dynamics modifications predicted by learned note-level models. Preliminary experiments, again with performances of Mozart sonatas, yielded very promising results [36]. Just to give an idea, Figure 3 shows the predictions of the integrated learning algorithm on a test piece after learning from other Mozart sonatas. As can be seen in the lower part of the figure, the system manages to predict not only local patterns, but also higher-level trends (e.g., gradual increases of overall loudness) quite well. Sound examples will be made available at our project home page. (http://www.oefai.at/music). The ultimate goal of this work is an integrated model of expressive performance that combines note-level rules with structure-level strategies. Our current work along these lines focuses on new learning algorithms for discovering and predicting patterns that are hierarchically related to each other.
5
Studying Differences: Trying to Characterize Individual Artistic Style
The second set of questions guiding our research concerns the differences between individual artists. Can one characterize formally what is special about the style of a particular pianist? Contrary to the research on common principles described
20
G. Widmer
rel. dynamics
1.5
1
0.5 25
30
35 40 abs. score position
45
50
25
30
35 40 abs. score position
45
50
rel. dynamics
1.5
1
0.5
Fig. 3. Learner’s predictions for the dynamics curve of Mozart Sonata K.280, 3rd movement, mm. 25–50. Top: dynamics shapes predicted for phrases at four levels; bottom: composite predicted dynamics curve resulting from phrase-level shapes and note-level predictions (grey) vs. pianist’s actual dynamics (black). Line segments at the bottom of each plot indicate hierarchical phrase structure.
in section 4 above, where we used mainly performances by local (though highly skilled) pianists, here we are explicitly interested in studying famous artists. Can we find the ‘Horowitz Factor’ ? This may be the more intriguing question to the general audience, because it involves famous artists. However, the reader must be warned that this is a very difficult question. The following is work in progress, the current results are highly uncertain and incomplete, and the examples given below should be taken as indications of the kinds of things that we hope to discover, rather than truly significant discovery results. 5.1
Data Acquisition: Measuring Expressivity in Audio Recordings
The first major difficulty is data aquisition. With famous pianists, the only source of data are audio recordings, i.e., records and music CDs (we cannot very well invited them all to Vienna to perform on the B¨ osendorfer 290SE piano). Unfortunately, it is impossible, with current signal processing methods, to extract
In Search of the Horowitz Factor
21
precise performance information (start and end times, loudness, etc.) about each individual note directly from audio data. Thus, it will not be possible to perform studies at the same level of detail as those based on MIDI data. In particular, we cannot study how individual notes are played. What is currently possible is to extract tempo and dynamics at the level of the beat.4 That is, we extract those time points from the audio recordings that correspond to beat locations. From the (varying) time intervals between these points, the beat-level tempo and its changes can be computed. Beat-level dynamics is also computed from the audio signal as the overall loudness/amplitude of the signal at the beat times. The hard problem here is automatically detecting and tracking the beat in audio recordings. Indeed, this is an open research problem that forced us to develop a novel beat tracking algorithm [4]. Beat tracking, in a sense, is what human listeners do when they listen to a piece and tap their foot in time with the music. As with many other perception and cognition tasks, what seems easy and natural for a human turns out to be extremely difficult for a machine. The main problems to be solved are (a) detecting the onset times of musical events (notes, chords, etc.) in the audio signal, (b) deciding which of these events carry the beat (that includes determining the basic tempo, i.e., the basic rate at which beats are expected to occur), and (c) tracking the beat through tempo changes. The latter part is extremely difficult in classical music, where the performer may change the tempo drastically — a slowing down by 50% within one second is nothing unusual. It is difficult for a machine to decide whether an extreme change in inter-beat intervals is due to the performer’s expressive timing, or whether it indicates that the algorithm’s beat hypothesis was wrong. Experimental evaluations showed that our beat tracking algorithm is probably among the best currently available [6]. In systematic experiments with 13 Mozart sonatas, the algorithm achieved a correct tracking rate of over 90%. However, for our investigations we need a tracking accuracy of 100%, so we opted for a semi-automatic, interactive procedure. The beat tracking algorithm was integrated into an interactive computer program5 that takes a piece of music (a sound file), tries to track the beat, displays its beat hypotheses visually on the screen, allows the user to listen to (selected parts of) the tracked piece and modify the beat hypothesis by adding, deleting, or moving beat indicators, and then attempts to re-track the piece based on the updated information. This is still a very laborious process, but it is much more efficient than ‘manual’ beat tracking. 4
5
The beat is an abstract concept related to the metrical structure of the music; it corresponds to a kind of quasi-regular pulse that is perceived as such and that structures the music. Essentially, the beat is the time points where listeners would tap their foot along with the music. Tempo, then, is the rate or frequency of the beat and is usually specified in terms of beats per minute. The program has been made publicly available and can be downloaded from http://www.oefai.at/˜simon.
22
G. Widmer
After a recording has been processed in this way, tempo and dynamics at the beat level can be easily computed. That is the input data to the next processing step. 5.2
Data Visualization: The Performance Worm
To facilitate visual analysis and an intuitive musical interpretation of the tempo and dynamics measurements, a new representation and, based on that, a visualization system were developed, using a recent idea and method by the German musicologist J¨org Langner [13]. The two basic ideas are to integrate tempo and dynamics into a single representation, and to compute smoothed trajectories from the beat-level measurements. Given a series of measurements points (tempo or loudness values) over time, the series is smoothed using overlapping Gaussian windows. The motivation for this choice (which, among other things, has to do with issues of musical and visual perception) and the precise method are described in [14]. The smoothed sequences of tempo and dynamics values are then combined into a joint sequence of pairs of coordinates, which represent the development of tempo and dynamics over time as a trajectory in a two-dimensional space, where tempo and dynamics define the axes, and the time dimension is implicit in the trajecory. The method has been implemented in a interactive visualization system that we call the Performance Worm [7]. The Worm can play a given
Fig. 4. Snapshot of the Performance Worm at work: First four bars of Daniel Barenboim’s performance of Mozart’s F major sonata K.332, 2nd movement. Horizontal axis: tempo in beats per minute (bpm); vertical axis: loudness in sone (a psycho-acoustic measure [38]). Movement to the upper right indicates a speeding up (accelerando) and loudness increase (crescendo) etc. etc. The darkest points represents the current instant, while instants further in the past appear fainter.
In Search of the Horowitz Factor
23
35
30
25
20
15
10
5
0 10
15
20
25
30
35
40
Fig. 5. A complete worm: smoothed tempo-loudness trajectory representing a performance of Mozart’s F major sonata K.332, 2nd movement, by Mitsuko Uchida. Horizontal axis: tempo in beats per minute (bpm); vertical axis: loudness in sone [38].
recording and, in parallel, show the movement of the trajectory in an animated display. Figure 4 shows an example of the Worm at work. This is a very intuitive representation that facilitates a direct appreciation and understanding of the performance strategies applied by different pianists. The subsequent analysis steps will be based on this representation. We are currently investing large efforts into measuring tempo and dynamics in recordings of different pieces by different famous pianists, using the interactive beat tracking system mentioned above. One such data collection that will be referred to in the following consists of five complete piano sonatas by W.A.Mozart (54 separate sections), as played by the five pianists Daniel Barenboim, Glenn Gould, Maria Jo˜ ao Pires, Andr´ as Schiff, and Mitsuko Uchida. Beat-tracking these 54 × 5 recordings took roughly two person months. Figure 5 shows a complete tempo-loudness trajectory representing a performance of one movement of a piano sonata by Mitsuko Uchida. 5.3
Transforming the Problem: Segmentation, Clustering, Cluster Visualization
Instead of analyzing the raw tempo-loudness trajectories directly, we chose to pursue an alternative route, namely, to transform the data representation and thus the entire discovery problem into a form that is accessible to common inductive learning and data mining algorithms. To that end, the performance trajectories are segmented into short segments of fixed length (e.g., 2 beats), these segments are optionally subjected to various normalization operations (e.g., mean and/or variance normalization to abstract away from absolute tempo and loudness and/or absolute pattern size, respectively). The resulting segments are
24
G. Widmer
Fig. 6. A ‘Mozart performance alphabet’ (cluster prototypes) computed by segmentation and clustering from performances of Mozart piano sonatas by five famous pianists (Daniel Barenboim, Glenn Gould, Maria Jo˜ ao Pires, Andr´ as Schiff, Mitsuko Uchida). To indicate directionality, dots mark the end points of segments.
then grouped into classes of similar patterns via clustering. For each of the resulting clusters, a prototype (centroid) is computed. These prototypes represent a set of typical elementary tempo-loudness patterns that can be used to approximately reconstruct a ‘full’ trajectory (i.e., a complete performance). In that sense, they can be seen as a simple alphabet of performance (restricted to tempo and dynamics). Figure 6 shows a set of prototypical patterns computed from the above-mentioned set of Mozart sonata performances. The number of alphabets one could compute is potentially infinite, due to a multitude of existing clustering algorithms with their parameters (in particular, most clustering algorithms require an a priori specification of the number of desired clusters) and possible normalization operations. In order to reduce the number of degrees of freedom (and arbitrary decisions to be made), extensive research is currently under way on non-parametric methods for determining an optimal number of clusters. Particular candidates are information-theoretic approaches based on the notion of minimum encoding length [19,15]. The particular clustering shown in Figure 6 was generated by a SelfOrganizing Map (SOM) algorithm [12]. A SOM produces a geometric layout of the clusters on a two-dimensional grid or map, attempting to place similar clusters close to each other. That property facilitates a simple, intuitive visualization method. The basic idea, named Smoothed Data Histograms (SDH), is to visualize the cluster distribution in a given data set by estimating the probability density of the high-dimensional data on the map (see [21] for details). Figure 7 shows how this can be used to visualize the frequencies with which certain pi-
In Search of the Horowitz Factor
25
Fig. 7. Visualization, based on ‘smoothed data histograms’, of the distribution of stylistic patterns in Mozart performances by four pianists (see Figure 6 for the corresponding cluster map). Bright areas indicate high density.
anists use elementary expressive patterns (trajectory segments) from the various clusters. The differences are striking: the most common stylistic elements of, say, a Daniel Barenboim seem to be quite different from those of Andr´ as Schiff, for instance. 5.4
Structure Discovery in Musical Strings
The SDH cluster visualization method gives some insight into very global aspects of performance style; it does show that there are indeed significant differences between the pianists. But we want to get more detailed insight into characteristic patterns and performance strategies. To that end, another (trivial) transformation is applied to the data. We can take the notion of an alphabet literally and associate each prototypical elementary tempo-dynamics shape (i.e., each cluster prototype) with a letter. A full performance — a complete trajectory in tempo-dynamics space — can be approximated by a sequence of elementary prototypes, i.e., a sequence of letters, i.e., a string. Figure 8 shows a part of a performance of a Mozart sonata movement, coded in terms of such an alphabet. This final transformation step, trivial as it may be, makes it evident that our original musical problem has now been transferred into a quite different world: the world of string analysis. The fields of pattern recognition, machine learning, data mining, etc. have developed a rich set of methods that can find structure in strings and that could now profitably be applied to our musical data.6 6
However, it is also clear that through this long sequence of transformation steps — smoothing, segmentation, normalization, replacing individual elementary patterns by a prototype — a lot of information has been lost. It is not clear at this point whether this reduced data representation still permits truly significant discoveries.
26
G. Widmer PSWPKECSNRVTORRWTTOICXPMATPGFRQIFBDHDTPQIEEFECDLGCSQIEETPOVTHGA RQMEECDSNQQRTPSMFATLHHDXTOIFEATTPLGARWTPPLHRNMFAWHGARQMCDHFAQHH DNIERTPKFRTPPPKFRWPRWPLRQVTOECSQDJRVTTPQICURWDSKFERQICDHDTTPQEE CGFEWPIEXPMFEEXTTTHJARQARCDSNQEUVTPNIEXPHHDDSRQIFECXTPOMAVTHDTP HDSPQFAVHDURWHJEEFBHGFEFARVPLARWTPNIEVTPPTPHRMEARIEXPPTTOQICDIB TLJBJRRRWSRVTTPLTPNIEETPPXOIFEEECDORTTOQMBURWHFBDTPKFECHJFRTKUT PMFETORRLXPNQCTPPXSIEBDHHDNRQICDSQEDTTOMCDJQIEECSOTLHDNMARVPPTN QUVMIBDTPNQWTTLURQIEFEASRTPLJFEEUQBRVTORTOIEAMEDTPPHDNQIDGEHJRR VTLGARQCHGAVPNRVHDURWTPQIEUVLGFECDDPNQIEEXHHDJFFEURTTLGFEEANQBS .............................
Fig. 8. Beginning of a performance by Daniel Barenboim (W.A.Mozart, piano sonata K.279 in C major) coded in terms of a simple 24 letter ‘performance alphabet’ derived from a clustering of performance trajectory segments.
There is a multitude of questions one might want to ask of these musical strings. For instance, one might search for local patterns (substrings) that are characteristic of a particular pianist (that will be demonstrated below). One might search for general, frequently occurring substrings (i.e., more extended performance patterns) that are typical components of performances (‘stylistic clich´es’, so to speak). Using such frequent patterns as building blocks, one might try to use machine learning algorithms to induce (partial) grammars of musical performance style, both deterministic (e.g., [18]) or probabilistic (e.g,. [2]). A broad range of methods is available. Whether their application produces results that make musical sense remains to be investigated. A First Step: Discovering Discriminative Frequent Substrings. As a first simple example, let us consider the following question: are there substrings in these musical performance strings that exhibit a certain minimum frequency and that are characteristic of a particular pianist? In data mining terms, these could be called discriminative frequent sequences. Assume we have p pianists. If we represent each pianist by one string built by concatenating all the pianist’s performances, the problem can be stated as follows: let S = {S1 , ..., Sp } be a set of strings (each representing a pianist), and let occ(Xi , Sj ) denote the set of occurrences ofsome character sequence Xi in p a string Sj ; let nij = |occ(Xi , Sj )| and Ni = j=1 nij (i.e., the total number of occurrences of Xi over all strings Sj ∈ S). Given a pre-defined minimum frequency threshold Θ, the goal then is to find a set of character sequences X = {Xi : ∃j [ nij ≥ Θ ∧ nik = 0 ∀k = j ]},
(1)
that is, sequences with a certain minimum frequency that occur only in one string (i.e., in performances by one pianist). In reality, sequences that perfectly single In any case, whatever kinds of patterns may be found in this representation will have to be tested for musical significance in the original data.
In Search of the Horowitz Factor
27
out one pianist from the others will be highly unlikely, so instead of requiring uniqueness of a pattern to a particular pianist, we will be searching for patterns that exhibit a certain level of disciminatory power; let us call these approximately discriminative frequent sequences. As a measure of discriminatory power of a sequence we will use an entropy-like measure (see below). Approximately discriminative frequent sequences can be found by a simple two-stage algorithm. In the first step, all (maximal) sequences that are frequent overall (i.e., all Xi with Ni > Θ) are found by a straightforward version of the exhaustive level-wise search that is used in data mining for the discovery of association rules [1] and frequent episodes [16]. In the second step, the frequent sequences are sorted in ascending order according to the ‘entropy’ of their distribution over the pianists, as defined in Equation 2:
E(Xi ) =
p j=1
−
nij nij · log2 , Ni Ni
(2)
and then we simply select the first k (the ones with the lowest entropy) from the sorted list as the most discriminative ones. These will tend to be patterns that occur frequently in one or few pianists, and rarely in the others. As we currently have to inspect (and listen to) all these patterns in order to find out which ones are musically interesting (we have no operational mathematical definition of musical ‘interestingness’), this crude method of ‘look at the k most discriminative ones’ is good enough for our current purposes. Also, first searching for all frequent sequences and then selecting the ones with highest discriminative power may seem an inefficient method, but is sufficient for our current situation with strings of only some tens of thousands of characters. It would be straightforward to devise a more efficient algorithm that uses a discriminativity (e.g., entropy) threshold for pruning the search for frequent sequences. In very preliminary experiments with our corpus of Mozart performances (5 sonatas, 54 sections, 5 pianists), coded in a variety of different ‘performance alphabets’, a number of sequences were discovered that appear to be discriminative (at least in our limited corpus of data) and also look like they might be musically interesting. For example, in the alphabet used in the encoding of Figure 8, the sequence FAVT came up as a typical Barenboim pattern, with 7 occurrences in Barenboim’s Mozart performances, 2 in Pires, 1 in Uchida, and 0 in Schiff and Gould. To find out whether such a sequence codes any musically interesting or interpretable behaviour, we can go back to the original data (the tempo-loudness trajectories) and identify the segments of the trajectories coded by the various occurrences of the sequence. As Figure 9 (left part) shows, what is coded by the letter sequence FAVT in Daniel Barenboim’s performances of Mozart is an increase in loudness (a crescendo), followed by a slight tempo increase (accelerando), followed by a decrease in loudness (decrescendo) with more or less constant tempo. That is indeed a rather unusual pattern. In our experience so far, it is quite rare to see a pianist speed up during a loudness maximum. Much
28
G. Widmer FAVT barenboim t2_norm_mean_var (7)
SC uchida t4_norm_mean (8) 40
26
24
35
22
Loudness [sone]
Loudness [sone]
30 20
18
25
20 16
15
14
12 50
100
150
200 Tempo [bpm]
250
300
350
10 70
80
90
100
110 Tempo [bpm]
120
130
140
150
Fig. 9. Two sets of (instantiations of) performance patterns: FAVT sequence typical of Daniel Barenboim (left) and SC pattern (in a different alphabet) from Mitsuko Uchida (right). To indicate directionality, a dot marks the end point of a segment.
more common in such situations are slowings down (ritardandi), which gives a characteristic counterclockwise movement of the Worm (as, e.g., in the right half of Figure 9, which shows instantiations of a pattern that seems characteristic of the style of Mitsuko Uchida (8 occurrences vs. 0 in all the other pianists)). But before being carried away by musical interpretations and hypotheses, we must remind ourselves once more that the current data situation is too limited to draw serious conclusions, and determining the musical significance of such patterns will be a complex problem. The absolute numbers (7 or 8 occurrences of a supposedly typical pattern in a pianist) are too small to support claims regarding statistical significance. Also, we cannot say with certainty that similar patterns do not occur in the performances by the other pianists just because they do not show up as substrings there — they might be coded by a slightly different character sequence! And finally, many alternative performance alphabets could be computed; we currently have no objective criteria for determining the optimal one in any sense. So the reader is cautioned not to take any of the patterns shown here too literally. They are only indicative of the kinds of things we hope to discover with our methods in the (near) future. Whether these findings will indeed be musically relevant can only be hoped for at the moment.
6
Conclusions
Expressive music performance is a complex phenomenon, and what has been achieved and discovered so far are only tiny parts of a big mosaic. Still, we do feel that the project has already produced a number of results that are interesting and justify this computer-based discovery approach. Regarding the induction of note-level rules (section 4.2), it seems safe to say that this is the first time a machine has made significant discoveries in music. Some of the rules were new to musicologists and complement other rule-based
In Search of the Horowitz Factor
29
musicological models of performance [8]. Extending the rule set with multi-level learning of more abstract patterns, as indicated in section 4.3, will lead to an operational model that can actually produce musically sensible interpretations (whether that has any practical applications is currently open). And our preliminary work on performance visualization and pattern extraction (sections 5.2, 5.3, and 5.4) does seem to indicate that it will also be possible to get completely new kinds of insight into the style of great artists. Along the way, a number of novel methods of potentially general benefit were developed: the beat tracking algorithm and the interactive tempo tracking system [4], the Performance Worm [7] (with possible applications in music education and analysis), and the PLCG rule learning algorithm [34]. Also, we have compiled what is most likely the largest database of high-quality performance measurements, both in the form of MIDI measurements and beat-tracked sound recordings. Such data collections will be an essential asset in future research. However, the list of limitations and open problems is much longer, and it seems to keep growing with every step forward we take. One limiting factor is the measuring problem. Currently, it is only possible to extract rather crude and global information from audio recordings; we cannot get at details like timing, dynamics, and articulation of individual voices or individual notes. This is a challenge for signal analysis and music processing, and we are indeed working on improved intelligent expression extraction methods. On the data anlysis and discovery side, the preliminary pattern discovery work described in section 5.4 above has only looked at the strings (i.e., the performance trajectories) themselves, without regard to the underlying music. Obviously, the next step must be to search for systematic relations between patterns in the performance trajectories, and the content and structure of the music that is being played. Also, it is clear that most of the secrets of the great masters’ style — if they are at all describable — will not be hidden in the kinds of local patterns that were just described. Important aspects of structural music interpretation and high-level performance strategies will only become visible at higher abstraction levels. These questions offer to be a rich source of challenges for sequence analysis and pattern discovery. Things we are currently thinking about include generalised notions of frequent sequences (e.g., based on a notion of graded letter similarity), methods for discovering frequent subpatterns directly from the numeric time series, unsupervised learning of useful abstract building blocks from sequences, algorithms for inducing partial grammars (which may be allowed to account for only parts of a string), methods for learning the connection between grammar rules and properties of the underlying music, and methods for decomposing complex trajectories into more elementary patterns at different structural levels, which may have different kinds of ‘explanations’ and thus require different models. We should also look into work on related phenomena in language, specifically, prosody, which might offer useful methods for our musical problem.
30
G. Widmer
In view of all this, this project will most likely never be finished. But much of the beauty of research is in the process, not in the final results, and we do hope that our sponsors share this view and will keep supporting what we believe is an exciting research adventure. Acknowledgements. The project is made possible by a very generous START Research Prize by the Austrian Federal Government, administered by the Austrian Fonds zur F¨ orderung der Wissenschaftlichen Forschung (FWF) (project no. Y99-INF). Additional support for our research on AI, machine learning, scientific discovery, and music is provided by the European project HPRN-CT2000-00115 (MOSART) and the EU COST Action 282 (Knowledge Exploration in Science and Technology). The Austrian Research Institute for Artificial Intelligence acknowledges basic financial support by the Austrian Federal Ministry for Education, Science, and Culture. Thanks to Johannes F¨ urnkranz for his implementation of the level-wise search algorithm. I am indebted to my project team Simon Dixon, Werner Goebl, Elias Pampalk, and Asmir Tobudic for all their great work.
References 1. Agrawal, R., and Srikant, R. (1994). Fast Algorithms for Mining Association Rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), Santiago, Chile. 2. Chen, S.F. (1995). Bayesian Grammar Induction for Language Modeling. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp.228–235. 3. Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. In J. Kittler and F. Roli (Ed.), First International Workshop on Multiple Classifier Systems. New York: Springer Verlag. 4. Dixon, S. (2001). Automatic Extraction of Tempo and Beat from Expressive Performances. Journal of New Music Research 30(1), 39–58. 5. Dixon, S. and Cambouropoulos, E. (2000). Beat Tracking with Musical Knowledge. In Proceedings of the 14th European Conference on Artificial Intelligence (ECAI2000), IOS Press, Amsterdam. 6. Dixon, S. (2001). An Empirical Comparison of Tempo Trackers. In Proceedings of the VIII Brazilian Symposium on Computer Music (SBCM’01), Fortaleza, Brazil. 7. Dixon, S., Goebl, W., and Widmer, G. (2002). The Performance Worm: Real Time Visualisation of Expression Based on Langner’s Tempo-Loudness Animation. In Proceedings of the International Computer Music Conference (ICMC’2002), G¨ oteborg, Sweden. 8. Friberg, A. (1995). A Quantitative Rule System for Musical Performance. Doctoral Dissertation, Royal Institute of Technology (KTH), Stockholm, Sweden. 9. Gabrielsson, A. (1999). The Performance of Music. In D. Deutsch (Ed.), The Psychology of Music (2nd ed., pp. 501-602). San Diego: Academic Press. 10. Hunter, L. (ed.) (1993). Artificial Intelligence and Molecular Biology. Menlo Park, CA: AAAI Press.
In Search of the Horowitz Factor
31
11. King, R.D., Muggleton, S., Lewis, R.A., and Sternberg, M.J.E. (1992). Drug Design by Machine Learning: The Use of Inductive Logic Programming to Model the Structure-activity Relationship of Trimethoprim Analogues Binding to Dihydrofolate Reductase. In Proceedings of the National Academy of Sciences, Vol. 89, pp.11322–11326. 12. Kohonen, T. (2001). Self-Organizing Maps, 3rd edition. Berlin: Springer Verlag. 13. Langner, J. and Goebl, W. (2002). Representing Expressive Performance in TempoLoudness Space. In Proceedings of the ESCOM Conference on Musical Creativity, Li`ege, Belgium . 14. Langner, J., and Goebl, W. (2002). Visualizing Expressive Performance in TempoLoudness Space. Submitted to Computer Music Journal. Preliminary version available on request from the author. 15. Ludl, M. and Widmer, G. (2002). Towards a Simple Clustering Criterion Based on Minimum Length Encoding. In Proceedings of the 13th European Conference on Machine Learning (ECML’02), Helsinki. Berlin: Springer Verlag. 16. Mannila, H., Toivonen, H., and Verkamo, I. (1997). Discovery of Frequent Episodes in Event Sequences. Data Mining and Knowledge Discovery, 1(3), 259–289. 17. Muggleton, S., King, R.D., and Sternberg, M.J.E. (1992). Protein Secondary Structure Prediction Using Logic-based Machine Learning. Protein Engineering 5(7), 647–657. 18. Nevill-Manning, C.G. and Witten, I.H. (1997). Identifying Hierarchical Structure in Sequences: A linear-time algorithm. Journal of Artificial Intelligence Research 7, 67-82. 19. Oliver, J., Baxter, R., and Wallace, C. (1996). Unsupervised Learning Using MML. In Proceedings of the 13th International Conference on Machine Learning (ICML’96). San Francisco, CA: Morgan Kaufmann. 20. Palmer, C. (1988). Timing in Skilled Piano Performance. Ph.D. Dissertation, Cornell University. 21. Pampalk, E., Rauber, A., and Merkl, D. (2002). Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps. In Proceedings of the International Conference on Artificial Neural Networks (ICANN’2002), Madrid. 22. Repp, B. (1992). Diversity and Commonality in Music Performance: An Analysis of Timing Microstructure in Schumann’s ‘Tr¨ aumerei’. Journal of the Acoustical Society of America 92(5), 2546–2568. 23. Repp, B. H. (1998). A Microcosm of Musical Expression: I. Quantitative Analysis of Pianists’ Timing in the Initial Measures of Chopin’s Etude in E major. Journal of the Acoustical Society of America 104, 1085–1100. 24. Repp, B. H. (1999). A Microcosm of Musical Expression: II. Quantitative Analysis of Pianists’ Dynamics in the Initial Measures of Chopin’s Etude in E major. Journal of the Acoustical Society of America 105, 1972–1988. 25. Shavlik, J.W., Towell, G., and Noordewier, M. (1992). Using Neural Networks to Refine Biological Knowledge. International Journal of Genome Research 1(1), 81–107. 26. Sundberg, J. (1993). How Can Music Be Expressive? Speech Communication 13, 239–253. 27. Todd, N. (1989). Towards a Cognitive Theory of Expression: The Performance and Perception of Rubato. Contemporary Music Review, vol. 4, pp. 405–416. 28. Todd, N. (1992). The Dynamics of Dynamics: A Model of Musical Expression. Journal of the Acoustical Society of America 91, pp.3540–3550. 29. Vald´es-P´erez, R.E. (1995). Machine Discovery in Chemistry: New Results. Artificial Intelligence 74(1), 191–201.
32
G. Widmer
30. Vald´es-P´erez, R.E. (1996). A New Theorem in Particle Physics Enabled by Machine Discovery. Artificial Intelligence 82(1-2), 331–339. 31. Vald´es-P´erez, R.E. (1999). Principles of Human-Computer Collaboration for Knowledge Discovery in Science. Artificial Intelligence 107(2), 335-346. 32. Widmer, G. (2001). Using AI and Machine Learning to Study Expressive Music Performance: Project Survey and First Report. AI Communications 14(3), 149– 162. 33. Widmer, G. (2001). The Musical Expression Project: A Challenge for Machine Learning and Knowledge Discovery. In Proceedings of the 12th European Conference on Machine Learning (ECML’01), Freiburg, Germany. Berlin: Springer Verlag. 34. Widmer, G. (2001). Discovering Strong Principles of Expressive Music Performance with the PLCG Rule Learning Strategy. In Proceedings of the 12th European Conference on Machine Learning (ECML’01), Freiburg, Germany. Berlin: Springer Verlag. 35. Widmer, G. (2002). Machine Discoveries: A Few Simple, Robust Local Expression Principles. Journal of New Music Research 31(1) (in press). 36. Widmer, G., and Tobudic, A. (2002). Playing Mozart by Analogy: Learning Multilevel Timing and Dynamics Strategies. Journal of New Music Research (to appear). 37. Windsor, L. and Clarke, E. (1997). Expressive Timing and Dynamics in Real and Artificial Musical Performances: Using an Algorithm as an Analytical Tool. Music Perception 15, 127–152. 38. Zwicker, E. and Fastl, H. (2001). Psychoacoustics. Facts and Models. Springer Series in Information Sciences, Vol.22. Berlin: Springer Verlag.
Learning Structure from Sequences, with Applications in a Digital Library Ian H. Witten Department of Computer Science, University of Waikato Hamilton, New Zealand [email protected]
Abstract. The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This paper reviews some recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text, and puts it in the context of digital library applications. We describe three areas: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to generic entity extraction, word segmentation, and acronym extraction; and keyphrase extraction.
The full version of this paper is published in the Proceedings of the 13th International Conference on Algorithmic Learning Theory, Lecture Notes in Artificial Intelligence Vol. 2533
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, p. 33, 2002. c Springer-Verlag Berlin Heidelberg 2002
Discovering Frequent Structured Patterns from String Databases: An Application to Biological Sequences Luigi Palopoli and Giorgio Terracina DIMET - Universit` a di Reggio Calabria Localit` a Feo di Vito 89100 Reggio Calabria, Italy {palopoli,terracina}@ing.unirc.it
Abstract. In the last years, the completion of the human genome sequencing showed up a wide range of new challenging issues involving raw data analysis. In particular, the discovery of information implicitly encoded in biological sequences is assuming a prominent role in identifying genetic diseases and in deciphering biological mechanisms. This information is usually represented by patterns frequently occurring in the sequences. Because of biological observations, a specific class of patterns is becoming particularly interesting: frequent structured patterns. In this respect, it is biologically meaningful to look at both “exact” and “approximate” repetitions of the patterns within the available sequences. This paper gives a contribution in this setting by providing some algorithms which allow to discover frequent structured patterns, either in “exact” or “approximate” form, present in a collection of input biological sequences.
1
Introduction
A large number of text databases have been recently produced, as the result of both technological improvement and the growth of the Internet allowing for an “almost everywhere” accessibility of information. Unfortunately, the availability of such large amounts of raw data does not produce “per se” a necessary enlargement of the available “information”. This because such information is often implicitly encoded in the data. But making this information explicit is a non trivial task and, usually, it cannot be done by hand. This implies the need to devise automatic methods by which meaningful information can be extracted from string databases. In the last years, a particular class of raw data stored in string databases is assuming a prominent role in discovery science: genomic data. Indeed, the completion of the human genome sequencing showed up a wide range of new challenging issues involving sequence analysis. Genome databases mainly consists of sets of strings representing DNA or protein sequences (biosequences); unfortunately, most of these strings still require to be “interpreted”.
Corresponding author
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 34–46, 2002. c Springer-Verlag Berlin Heidelberg 2002
Discovering Frequent Structured Patterns from String Databases
35
In the context of sequence analysis, pattern discovery has recently assumed a fundamental role. It deals with automatic methods for pattern and motif discovery in biosequences. The PROSITE Data Base [3] is very well fit to illustrate the notion of motif and the “discovery process” involved. It is a Data Base of protein families, where each family is described by a regular expression (the motif or pattern) highlighting important and common domains within the sequences in the given protein family. Unfortunately, those motifs have been extracted semi-automatically through the analysis of multiple alignments: a procedure no longer adequate, given the amount of biological data being produced worldwide. In fact, automatic pattern discovery is now more crucial than ever. The task of pattern discovery must be clearly distingueshed from that of matching a given pattern in a Data Base. Indeed, in the latter case, we know what we are looking for, while in the former we do not know. A particular class of patterns seems to be highly promising in the analysis of biosequence functions. Indeed, it has been observed [9] that, in complex organisms, the relative positions of substrings recognized as motifs are often not random when they participate in biological processes. For instance, the most frequently observed prokaryotic promoter regions are in general composed by two parts positioned approximately 10 and 35 bases respectively upstream from the start of transcription. The biological reason for this particular positions is that the two parts are recognized by the same protein. Analogously, in eukaryotic transcription regulation, different portions of sequences suitably positioned w.r.t. each other, promote or inhibit the transcription in presence of certain proteins. As a further example satellites and tandem repeats are repetitive structures, important to identify genetic markers and for their association to some genetic diseases [4,15]; extraction of patterns common to a set of sequences indicate conservation, which in turn indicates that the (sometimes unknown) function being described by the pattern is important [3,16,17]; promoter sequences and regulatory regions have a specific “high level” structure and their identification is also essential [7,8,13,14]. It is clear that discovering frequent structured patterns is therefore biologically meaningful and particularly important for the recognition of biological functions. As a further complication, looking for exact repetitions of patterns may lead to biologically unsatisfactory results. Indeed similar biological subsequences may have the same functions in their sequences. It is therefore mandatory to look for structured patterns “representing” a sufficiently high number of strings in the database. This problem is a nontrivial one [6,14] and requires high computational resources. In this paper we propose an efficient solution to the problem of discovering structured motifs in biological databases by exploiting Subword Trees [1,2,5, 10,11] as support index structures. In particular, we face four problems: (i) discovering simple structured motifs composed by two highly conserved regions separated by constrained spacers (exact repetitions) (Section 3); (ii) extending the previous solution for extracting complex structured motifs composed by
36
L. Palopoli and G. Terracina
r > 2 regions (Section 4); (iii) deriving structured motifs with the same structure as in (i) but allowing up to e errors in string repetitions (Section 5) and, finally (iv) discovering more complex structured motifs allowing up to e errors in string repetitions (Section 6). Before all that, in the following section, we provide some preliminary definitions that will be used throughout the paper.
2
Statement of the Problem
In biological contexts, a motif is a highly conserved region of nucleotide sequences. In other words, given a collection SC of N strings and a string m, m is called a motif for SC if it is repeated in at least q of the N strings in SC. The number q is called quorum. If we look for exact repetitions of m in SC we say that m is an exact motif. However, in some situations, exact repetitions do not suffice to represent biologically significant pieces of information because similar substrings may play the same role in biological processes. Therefore, it is often necessary to verify if a string m “represents” a set of strings occurring in SC. In this paper we consider that a string m represents a string m with an error e if the Hamming distance between m and m (denoted h(m, m )) is no larger than e1 ; m is said to be an e-occurrence of m. In this setting, a string m is called an e-motif if it has an e-occurrence in at least q of the N strings in SC. As pointed out in the introduction, it is biologically meaningful to search for frequent patterns structured as two or more parts separated by constrained spacers either in their exact form or allowing for errors in the match between the pattern and the strings. Before giving a formal definition of the problems we are going to analyze, it is necessary to define the notation exploited in the paper. Given an alphabet Σ we will denote single symbols in Σ with uppercase letters, whereas lowercase letters identify strings on Σ. The notation m = uv indicates that the string m is obtained by concatenating the (sub)strings u and v. The notation s[k] is used to indicate that the string s has length k. The special symbol X, not present in Σ, is used to identify the “don’t care” symbol, that is a symbol that matches without error with any symbol in Σ, and the expression X(d) denotes a string composed by d symbols X. Now, define a structured pattern a pattern of the form p = w1[k1 ] X(d1 ) w2[k2 ] . . . X(dr−1 ) wr[kr ] denoting a string composed by r words such that the i-th word has length ki and di don’t care symbols separate wi from wi+1 (1 ≤ i < r). An occurrence of the pattern p in SC is a substring s of a string in SC having the same length as p and such that each word wi of p is equal to the corresponding word swi of s. Analogously, an e-occurrence of the pattern p in SC is a substring e-s of SC having the same length as p and such that each 1
The Hamming distance between two words m and m is given by the minimum number of symbol substitutions to be performed on m in order to transform m into m.
Discovering Frequent Structured Patterns from String Databases
37
word wi of p has an Hamming distance at most e from the corresponding word e-swi of e-s. A structured pattern p is, therefore, a structured exact motif if there are at least q strings in SC containing an occurrence of p. Analogously, a pattern p is a structured e-motif if there are at least q strings in SC containing an e-occurrence of p. Obviously a structured exact motif is also a structured e-motif but not vice versa. In this paper we will address the following problems. Given a collection SC of strings and a quorum q: Problem 1. Find all structured exact motifs of the form m = w1[k1 ] X(d) w2[k2 ] . Problem 2. Find all structured exact motifs of the form m = w1[k1 ] X(d1 ) w2[k2 ] X(d2 ) . . . X(d(r−1) ) wr[kr ] , for r > 2. Moreover, fixed a maximum allowed error e: Problem 3. Find all structured e-motifs of the form e-m= w1[k1 ] X(d) w2[k2 ] . Problem 4. Find all structured e-motifs of the form e-m= w1[k1 ] X(d1 ) w2[k2 ] X(d2 ) . . . X(d(r−1) ) wr[kr ] , for r > 2.
3 3.1
Solving Problem 1 Overview
The algorithm we are going to define for solving Problem 1 exploits Subword Trees [1,2,5,10,11] for storing subwords of the strings in the collection SC possibly containing a structured motif. The basic idea underlying our approach is that of representing the structured patterns, candidate to be structured motifs, using links between subwords laying at distance d in SC. Each pair of subwords laying at distance d are suitably inserted in the Subword Tree and are identified by references to the Tree nodes representing them (node pointers). Links between subwords are expressed as pairs of references. We call these links d-links because they represent links between subwords laying at distance d. As an example, if Subword Tree nodes v1 and v2 associated, respectively, to the words w1 and w2 laying at distance d in a certain string of SC, the d-link (v1 , v2 ) is used to represent the pattern w1 X(d) w2 . Each d-link is associated with a number of occurrences, stating the number of strings in SC containing the pattern it represents. It is known that Subword Trees in compact form, representing the set of subwords in SC, can be built in linear time and occupies linear space in the size of SC. Differently from other approaches presented in the literature [6,14] the main idea of our approach is that of extracting the motifs during the construction of the Subword Tree. In particular, during the construction of the Subword Tree, we obtain the d-links representing the patterns, and compute the associated number of occurrences. The result yielded by our algorithm is the set of d-links having a number of occurrences greater than the quorum q along with the Subword Tree referred by d-links.
38
L. Palopoli and G. Terracina
One important issue to be dealt with in our approach is how and where to represent d-links. Indeed, storing d-links directly into the nodes of the Subword Tree T constructed from SC may lead to non-linear algorithms, which is a situation we wish to avoid. Therefore, we store and manage d-links into a separate support structure, say d-Tree, handled in its turn as a Subword Tree. Moreover, in order to properly exploit the features of Subword Trees, we represent the pointers of a d-link as strings (of 0s and 1s) and a d-link is denoted by the juxtaposition of such strings. For each pattern pi , the number of occurrences associated to the corresponding d-link li is stored into the node vleafi , representing li , and belonging to d-Tree. Note that vleafi is always a leaf node of d-Tree. Since we need to count the number of distinct strings of SC in which pi occur, a further information, indicating the last string in which pi has been found, must be stored into the node. The algorithm for solving Problem 1 works as follows. It scans the symbols in SC one at a time and, starting from the position of each of them, it considers 2 subwords w1 and w2 such that w2 lay at distance d from w1 . These subwords are inserted into the (main) Subword Tree T and, from the pointers to the nodes associated to these subwords, the algorithm creates a d-link l stating that w1 and w2 have been found at distance d. The algorithm then inserts l into the (support) Subword Tree d-Tree storing all the d-links extracted from SC. If the d-link l is a newly inserted one its number of occurrences is set to 1, otherwise it is updated only if it is the first time that the pattern is found in the current string. When the number of occurrences associated to a d-link equals the quorum q, the pattern identified by that d-link is recognized as a structured exact motif and, therefore, it is added to the set of results. 3.2
Algorithm P1
In this section we present our algorithm for solving Problem 1. The algorithm receives in input a collection SC of N strings, the lengths k1 and k2 of the subwords w1 and w2 , resp., the distance d and the quorum q and returns the set of structured exact motifs w1[k1 ] X(d) w2[k2 ] in SC. Algorithm P1 for solving Problem 1 Input: a collection SC of sequences, and four integers k1 , k2 , d and q representing, respectively the length of the first subword, the length of the second subword, the distance and the quorum. Output: The set Results of d-links representing structured exact motifs solving Problem 1 and the Subword Tree T obtained from SC. Type StringSet: Set of Strings; var T, d-Tree: Subword Tree; pf irst , psecond , pleaf : pointer; str, i, occ: integer; d-link, w1 , w2 : string;
Discovering Frequent Structured Patterns from String Databases
39
Results: StringSet; begin for str:=1 to NumberOfStrings(SC) do begin for i:=1 to StringLenght(SC,str) – (k1 +d+k2 ) do begin w1 :=Subword(SC,str,i,k1 ); w2 :=Subword(SC,str,i + k1 + d,k2 ); pf irst :=InsertString(T,w1 ); psecond :=InsertString(T,w2 ); d-link:=Convert(pf irst ,psecond ); pleaf :=InsertString(d-Tree,d-link); if (LastString(pleaf ) = str) then begin occ:=IncrementOccurrences(pleaf ); LastStringUpdate(pleaf ,str); if (occ = q) then AddResult(Results,d-link); end end; end; end
Procedures and functions used in the algorithm above are as follows. Function NumberOfStrings receives a collection SC of strings and returns the number of strings it contains. Function StringLength receives in input a collection SC of strings and a string index str; it yields in output the length of the string indexed by str. Function Subword receives a collection SC of strings and three integers str, i, k and yields the subword of length k starting from position i of the string indexed by str in SC. Function InsertString receives a Subword Tree T and a subword w; it inserts w in T and returns the pointer to the node of T associated to the last symbol of w. The function Convert takes in input two pointers pf irst and psecond ; it converts each pointer from its numeric to its string representation and juxtaposes the two obtained strings to obtain a single string representing a d-link. Function LastString receives a pointer pleaf to a leaf node of d-Tree and returns the index (stored therein) of the last string that caused updating of the associated number of occurrences (note that both the number of occurrences and the “last string” index are initialized to 0 when the node is created). Function IncrementOccurrences receives a pointer pleaf and increments the number of occurrences stored in the node pointed by pleaf ; the function returns the updated number of occurrences. The procedure LastStringUpdate takes in input a pointer pleaf and a string index str and updates the information stored in the node pointed by pleaf relative to the last string index that caused updating the associated number of occurrences. Finally, the procedure AddResult receives a set of strings Results and a string representing a d-link and adds it to the set Results. 3.3
Complexity Issues
In this section we analyze the complexity of Algorithm P1. Let n = |SC| denote the size (number of symbols) of the string collection SC, our algorithm extracts
40
L. Palopoli and G. Terracina
all structured exact motifs in just one scan of SC. In particular, it extracts the motifs during the construction of d-Tree. It has been shown that the construction of subword trees in compact form [1,11,2] has linear time complexity in n; therefore, the construction of T is O(n). Since we insert exactly one d-link for each subword extracted from SC, the construction of d-Tree is done in O(n) too. The information stored into each dTree leaf node, relative to the number of occurrences of that d-link and the last string that caused an update, can be checked/updated in constant time since we always have at disposal the pointer to that node. Thus, the overall time complexity of Algorithm P1 is O(n). Moreover, it requires O(n) additional space for storing the subword trees.
4 4.1
Solving Problem 2 Overview
Problem 2 can be solved in a quite straightforward way by slightly modifying Algorithm P1. The main difference regards the representation and creation of the d-links. As we have done in Algorithm P1, the algorithm for solving Problem 2 extracts all structured exact motifs during the construction of both T and d-Tree in a single scan of SC. In order to handle structured motifs of the form m = w1[k1 ] X(d1 )w2[k2 ] X(d2 ) . . . X(d(r−1) ) wr[kr ] we have to deal with r > 2 subwords at a time. Thus, a d-link representing a pattern of this kind, must be a cross link between r nodes of the subword tree T constructed from the strings in SC. d-links introduced in the previous section can be extended in a natural way to represent this kind of patterns by juxtaposing the r pointers to Subword Tree nodes associated to the r subwords to represent. The manipulation of these kind of d-links is performed in the same way as in the previous section; in particular, d-links are stored into a subword tree d-Tree, each d-link li is associated with information stating the number of occurrences in SC of the pattern represented by li and the index of the last string that caused updating this number of occurrences. Both these numbers are stored into the leaf node of d-Tree associated with li . Thus, the algorithm for solving Problem 2 works as follows. For each symbol s of SC, it extracts, starting from s, r subwords such that the i-th subword is at distance di from the (i+1)-th subword in SC; it inserts these subwords into the Subword Tree T . From the pointers to the nodes associated to these subwords it creates a d-link l and inserts l into the subword tree d-tree storing all the d-links extracted from SC. The updating of the number of occurrences of l is carried out in the same way as in Algorithm P1, in order to take into account occurrences of the pattern represented by l in distinct strings of SC. 4.2
Complexity Issues
As for the complexity analysis the following considerations can be drawn. Given a collection of strings SC, such that |SC| = n, which exact structured pattern of
Discovering Frequent Structured Patterns from String Databases
41
the form w1[k1 ] X(d1 ) w2[k2 ] X(d2 ) . . . X(d(r−1) ) wr[kr ] must be extracted from, the number of necessary subword insertions into the subword tree T is O(r ∗ n). The number of d-links generated by the algorithm is O(n), therefore, the overall complexity of the algorithm is O(r ∗ n).
5 5.1
Solving Problem 3 Overview
Solving Problem 3 is quite a more difficult task w.r.t. Problems 1 and 2. This because the difficulties arising from the structure of the motifs are coupled with the difficulties arising when the exact repetition requirement is relaxed. The former difficulty (i.e., motif structure) is dealt with the support of d-links as shown in the previous sections. The latter one requires a more involved analysis. Recall that, given a collection of strings SC, a quorum q and a number of errors e, a structured e-motif e-m is a structured pattern having at least q eoccurrences in SC. An e-occurrence of e-m in SC is a pattern p having the same structure as e-m and such that each of the subwords included in e-m have an Hamming distance at most equal to e from the corresponding subword of p. In this section we propose a technique for extracting structured e-motifs which exploits the concept of e-neighbor, defined next, for boosting the motif extraction process. Given a word w of length k, an alphabet Σ and a maximum allowed number e of mismatches e, there are Σi=1 (ki )(|Σ|−1)i different possible words at Hamming distance less than or equal to e from w (i.e., being e-ocurrences of w). We want to represent this set of words in a as compact as possible way. First these words can be grouped according to the positions in which mismatches with w occur; the e number of these groups is Σi=1 (ki ). By substituting the mismatching symbols with the “don’t care” symbol X, each group can be represented by a single word. Words of a group can be obtained by “instantiating” the Xs with specific symbols in the alphabet. Finally, note that because of the meaning associated to the “don’t care” symbols X, all strings containing l ≤ e X symbols are all represented by the (ke ) strings containing exactly e Xs. So, we can focus on those e−1 k (ke ) strings, disregarding the remaining Σi=1 (i ) ones. We call the set of strings that are obtained from w by substituting exactly e symbols of w by “don’t care” symbols, the e-neighbor Set of w; each word included in this set is called an e-neighbor of w. Note that the e-neighbor Set of a word w represents the whole set of e-occurrences of w. As an example, consider the word w = AGCT and an error e = 2. The e-neighbor Set of w is: {AGXX, AXCX, AXXT, XGCX, XGXT, XXCT }. It is worth observing that if the intersection between the e-neighbor Sets of two words w1 and w2 is not empty then w1 is an e-occurrence of w2 and vice versa, but the sets of e-occurrences of w1 and w2 might be different. Our solution to Problem 3 consists in two steps: – determining the occurrences of structured patterns formed by the e-neighbors of the subwords of SC lying at distance d; we will call this patterns e-neighbor
42
L. Palopoli and G. Terracina
patterns. Exact repetitions of e-neighbors are considered in the computation of the occurrences. – determining the e-occurrences of each structured pattern pi (candidate to be a structured e-motif); this is done by combining the number of occurrences of the e-neighbor patterns generated by pi . Both structured patterns and e-neighbor patterns are handled and stored, as described in the previous sections, with the support of Subword Trees. However, in order to combine the occurrences of two different e-neighbor patterns for determining the e-occurrences of the corresponding structured pattern, it is not sufficient to manipulate just the number of their occurrences. Indeed, the sets of strings in which two e-neighbor patterns have been found may overlap. Therefore, it is necessary to store, for each e-neighbor pattern, a boolean array of size equal to the number N of strings in the collection SC. This boolean array indicates the strings in the collection containing the corresponding e-neighbor pattern. We exploit an Extended Subword Tree for storing d-links associated to e-neighbor patterns. This is a standard Subword Tree storing into leaf nodes boolean arrays in the place of simple integers. 5.2
Algorithm P3 Algorithm P3 for solving Problem 3 Input: a collection SC of N sequences, and five integers k1 , k2 , d, q and e representing, respectively the length of the first subword, the length of the second subword, the distance, the quorum and the maximum number of errors allowed for each subword in the pattern; Output: The set Results of d-links representing structured e-motifs solving Problem 3 and the Subword Tree T obtained from SC. Type StringSet: Set of strings; var str,i : integer; w1 , w2 , ewi , ewj , d-link, e-d-link : string; Results, e-neighborSet1, e-neighborSet2 : StringSet; pf irst , psecond , pleaf :pointer; e-T, T, d-Tree : SubwordTree; e-d-Tree : ExtendedSubwordTree; presence : Array of N bits; begin for str:=1 to NumberOfStrings(SC) do begin for i:=1 to StringLenght(SC,str) – (k1 + d + k2 ) do begin w1 :=Subword(SC, str, i, k1 ); w2 :=Subword(SC,str, i + k1 + d, k2 ); e-neighborSet1:=Extract e-neighbors(w1 , e); e-neighborSet2:=Extract e-neighbors(w2 , e); for each ewi ∈ e-neighborSet1 do begin pf irst :=InsertString(e-T, ewi ); for each ewj ∈ e-neighborSet2 do begin psecond :=InsertString(e-T, ewj );
Discovering Frequent Structured Patterns from String Databases
43
e-d-link:=Convert(pf irst , psecond ); pleaf :=InsertString(e-d-Tree, e-d-link); SetPresence(pleaf , str); end; end; pf irst :=InsertString(T, w1 ); psecond :=InsertString(T, w2 ); d-link:=Convert(pf irst , psecond ); pleaf :=InsertString(d-Tree, d-link); end; end; for each d-link in d-Tree do begin RetrieveSubwords(d-link, d-Tree, w1 , w2 ); e-neighborSet1:=Extract e-neighbors(w1 ,e); e-neighborSet2:=Extract e-neighbors(w2 ,e); InitializePresenceArray(presence); for each ewi ∈ e-neighborSet1 do begin pf irst :=FindSubword(e-T, ewi ); for each ewj ∈ e-neighborSet2 do begin psecond :=FindSubword(e-T, ewj ); e-d-link:=Convert(pf irst ,psecond ); pleaf :=FindSubword(e-d-Tree, e-d-link); presence:=BitwiseOR(presence, GetPresence(pleaf )); end; end; if (NumberOfOccurrences(presence)> q) then AddResult(Results, d-link); end; end.
Here, functions NumberOfStrings, StringLength, Subword, InsertString, Convert and procedure AddResults are as described in Section 3. Function Extract e-neighbors receives a word w and an integer e and extracts the e-neighbor Set of w. Procedure SetPresence receives a pointer pleaf and a string index str; it sets to true the element of index str of the boolean array stored in the node referred by pleaf . Procedure RetrieveSubwords takes in input a d-link and the corresponding Subword Tree d-Tree and retrieves from d-Tree the words referred by the pointers composing the d-link. The procedure InitializePresenceArray simply initialize to false all the elements of the boolean array received in input. Function FindSubword receives a Subword Tree T and a word w and returns the pointer to the node of T associated to w. Function GetPresence takes in input a pointer p and yields the boolean array associated to the node referenced by p. Function BitwiseOR performs the OR operator between the elements of the two boolean arrays received in input. Finally, the function NumberOfOccurrences counts the number of true values contained in the boolean array received in input.
44
5.3
L. Palopoli and G. Terracina
Complexity Issues
Given a collection SC of N strings of average length equal to l so that lN = |SC|, in this section we analyze the complexity of Algorithm P3 for extracting the set of structured e-motifs of the form w1[k1 ] X(d) w2[k2 ] . Retrieving a subword from SC takes constant time. Extracting the e-neighbor Set of a word w of length k takes O(ν(k, e)) time, where ν(k, e) = (ke ), and the number of e-neighbors returned is O(ν(k, e)). The function SetPresence can be executed in constant time. Thus, the first step of the algorithm requires O(lN ν(k1 , e)ν(k2 , e)) As we have seen in Section 3.3, the maximum number of d-links in d-Tree is lN . Retrieving a pair of subwords from a d-link can be done in constant time since each d-link provides the pointers to the nodes in T associated to the subword pair. Initializing and performing the bitwise OR of boolean arrays can be done in time O(N ). Thus, the second step of the algorithm takes O(lN 2 ν(k1 , e)ν(k2 , e)). The overall time complexity of the algorithm is O(lN 2 ν(k1 , e)ν(k2 , e)). Finally, the algorithm requires O(lN 2 ν(k1 , e)ν(k2 , e)) space.
6
Solving Problem 4
Extending Algorithm P3 for solving Problem 4 is quite straightforward. It is necessary to extend d-links to represent r > 2 pointers as we have described in Section 4. Moreover, the construction of e-d-Tree must be modified in such a way to consider the combinations of r e-neighbors at a time. In particular, given r words from SC such that the word wi lay at distance di from the word w(i+1) , it is necessary to: (i) derive the r corresponding e-neighbor Sets; (ii) combine all the e-neighbors belonging to the r e-neighbor Sets; (iii) store the corresponding d-links into e-d-Tree. After this tasks are carried out, determining the e-occurrences of each structured pattern pi is performed in a way similar as that shown for Algorithm P 3. Given a collection SC of N strings of average length l so that lN = |SC|, the complexity of Algorithm P4 for extracting the set of structured e-motifs of the form w1[k1 ] X(d1 ) w2[k2 ] X(d2 ) . . . X(d(r−1) ) wr[kr ] is O(lN 2 × ν(k1 , e) × . . . × ν(kr , e)).
7
Discussion
In this section we compare our approach with some other techniques recently appeared in the literature. In [12] the problem of finding frequent structured patterns (called dyads in the paper) is demonstrated to be biologically meaningful by a real case study. In this paper, frequent patterns are extracted by enumerating all possible patterns over the alphabet having a particular structure and counting the number of occurrences of those patterns in the database. In [6] a wide range of patterns is considered. Particulary interesting, the authors propose a technique for extracting frequent patterns with very complex structures by exploiting pattern trees for representing the pattern space. While
Discovering Frequent Structured Patterns from String Databases
45
allowing to extract patterns with more complex structures than those considered in this paper, these algorithms are worst-case exponential in the length of patterns. The approach most closely related to our own is that presented in [14]. In this paper Suffix Trees, obtained from the input string set, are exploited to derive valid models which correspond to structured e-motifs defined in Section 2. In particular, models are incrementally obtained by recursively traversing (in a depth-first fashion) the (virtual) lexicographic trie M of all possible models of a fixed length k and the (actually built) suffix tree T of the sequences. Two algorithms are presented which both jump in the suffix tree to skip the d symbols separating the conserved regions of the patterns and exploit boolean arrays to combine the set of tuples containing the considered patterns. The approach proposed in [14] does not explicitly consider the case of exact repetitions of the patterns in the set, but only approximate ones. However, the exact case can be easily simulated by allowing e = 0 errors in the match between strings and patterns. In order to obtain structured e-motifs of the form w1[k1 ] X(d) w2[k2 ] , that is solving Problem 3 defined in Section 2, the cleverest algorithm proposed in [14] requires, O(N nk N 2 (e, k) + N n2k+d N (e, k)), where N is the number of strings in the collection, k = max(k1 , k2 ), nk ≤ lN indicates the number of Suffix Tree e nodes at level k, N (e, k) = Σi=0 (ki )(|Σ|−1)i ≤ k e |Σ|e is the number of strings at an Hamming distance e from a model and N nk N (e, k) is the number of possible models of length k that can be generated. In case of exact match (i.e., solving Problem 1), the complexity above reduces to O(N nk +N n2k+d ). These measures do not take into account the construction of the Suffix Tree from the strings. Recall that the complexity of our Algorithm P1 for solving Problem 1 (exact repetitions) is O(lN ), where lN = |SC|, N is the number of strings in the collection and l is the average length of such strings, whereas the complexity of our Algorithm P3 for solving Problem 3 is O(lN 2 ν(k1 , e)ν(k2 , e)), where ν(k, e) = (ke ). Thus, the main improvements of our approach w.r.t. [14] are the following: (i) Our approach solves Problem 1 during the construction of support structures; in particular, in Algorithm 1, all structured exact motifs are discovered during the construction of the Subword Tree; conversely the approach of [14] requires the availability of the Suffix Tree. (ii) The complexity of our approach for solving Problem 3 does not depend on the dimension of the alphabet (this is significant when dealing with large alphabets); indeed, the term ν(k, e) above depends only on k and e, whereas in [14] the term N (e, k) is bounded by k e |Σ|e . We are able to achieve this result by exploiting “don’t care” symbols and e-neighbors. (iii) Our algorithms do not depend on the number of symbols X(d) separating the words w1[k1 ] and w2[k2 ] in the patterns; this improvement is a direct result of the exploitation of d-links. Indeed, d-links allow us to retrieve pairs of words in the Subword Tree laying at distance d simply by analyzing the words stored in the tree. On the contrary, in [14], a word w is related to the other ones laying at distance d by jumping down in the Suffix Tree of d levels, which makes this algorithm complexity dependent form d. Moreover, this imposes the exploitation of boolean presence arrays even for solving Problem 1.
46
L. Palopoli and G. Terracina
References 1. A. Amir, M. Farach, Z. Galil, R. Giancarlo, and K. Park. Dinamic dictionary matching. Journal of Computer and System Science, 49:208–222, 1994. 2. A. Apostolico and M. Crochemore. String matching for a deluge survival kit. Handbook of Massive Data Sets, To appear. 3. A. Bairoch. PROSITE: A dictionary of protein sites and patterns. Nucleic Acid Research, 20:2013–2018, 1992. 4. G. Benson. An algorithm for finding tandem repeats of unspecified pattern size. In Proceedings of ACM Recomb, pages 20–29, 1998. 5. P. Bieganski, J. Riedl, J.V. Carlis, and E.M. Retzel. Generalized suffix trees for biological sequence data: Applications and implementations. In Proc. of the 27th Hawai Int. Conf. on Systems Science, pages 35–44. IEEE Computer Society Press, 1994. 6. A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology, 5(2):277–304, 1998. 7. Y.M. Fraenkel, Y. Mandel, D. Friedberg, and H. Margalit. Identification of common motifs in unaligned dna sequnces: application to escherichia coli lrp regulon. Computer Applied Bioscience, 11:379–387, 1995. 8. D.J. Galas, M. Eggert, and M.S. Waterman. Rigorous pattern-recognition methods for dna sequences: Analysis of promoter sequences from escherichia coli. J. of Molecular Biology, 186:117–128, 1985. 9. C.A. Gross, M. Lonetto, and R. Losick. Bacterial sigma factors. Transcriptional Regulation, 1:129–176, 1992. 10. D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambrige University Press, 1997. 11. D. Gusfield, G. M. Landau, and B. Schieber. An efficient algorithm of all pairs suffix-prefix problem. Information Processing Letters, 41:181–185, 1992. 12. J. Helden, A.F. Rios, and J. Collado-Vides. Discovering regulatory elements in noncoding sequences by analysis of spaced dyads. Nucleic Acids Research, 28(8):1808– 1818, 2000. 13. A. Klingenhofen, K. Frech, K. Quandt, and T. Werner. Functional promoter modules can be detected by formal methods independent of overall sequence similarity. Bioinformatics, 15:180–186, 1999. 14. L. Marsan and M.F. Sagot. Algorithms for extracting structured motifs using a suffix tree with application to promoter and regulatory site consensus identification. Journal of Computational Biology, 7:345–360, 2000. 15. M.F. Sagot and E.W. Myers. Identifying satellites in nucleic acid sequences. In Proc. of ACM RECOMB, pages 234–242, 1998. 16. H.O. Smith, T.M. Annau, and S. Chandrasegaran. Finding sequence motifs in groups of functionally related proteins. In Proc. of National Academy of Science, pages 118–122, U.S.A., 1990. 17. R.L. Tatusov, S.F. Altschul, and E.V. Koonin. Detection of conserved segments in proteins. In Proc. of National Academy of Science, pages 12091–12095, U.S.A., 1994.
Discovery in Hydrating Plaster Using Machine Learning Methods Judith E. Devaney and John G. Hagedorn National Institute of Standards and Technology, Gaithersburg MD, 20899-8951, USA, {judith.devaney,john.hagedorn}@nist.gov, http://math.nist.gov/mcsd/savg
Abstract. We apply multiple machine learning methods to obtain concise rules that are highly predictive of scientifically meaningful classes in hydrating plaster over multiple time periods. We use three dimensional data obtained through X-ray microtomography at greater than one micron resolution per voxel at five times in the hydration process: powder, after 4 hours, 7 hours, 15.5 hours, and after 6 days of hydration. Using statistics based on locality, we create vectors containing eight attributes for subsets of size 1003 of the data and use the autoclass unsupervised classification system to label the attribute vectors into three separate classes. Following this, we use the C5 decision tree software to separate the three classes into two parts: class 0 and 1, and class 0 and 2. We use our locally developed procedural genetic programming system, GPP, to create simple rules for these. The resulting collection of simple rules are tested on a separate 1003 subset of the plaster datasets that had been labeled with their autoclass predictions. The rules were found to have both high sensitivity and high positive predictive value. The classes accurately identify important structural comonents in the hydrating plaster. Morover, the rules identify the center of the local distribution as a critical factor in separating the classes.
1
Introduction
Plaster of paris is a widely used material of economic importance [1]. For example, the porcelain industry maintains large numbers of molds whose strength, durability, and ability to absorb water impact the industry’s costs [2]. Plaster powder is formed by calcining gypsum (calcium sulfate dihydrate, CaSO4 · 2H2 O) to form calcium sulfate hemihydrate (CaSO4 · 12 H2 O). The solid plaster is then formed by adding water (hydration) to the powder and allowing the mixture to set. The equations are [1]: Calcination: CaSO4 · 2H2 O = CaSO4 · 12 H2 O + 32 H2 O Hydration: CaSO4 · 12 H2 O + 32 H2 O = CaSO4 · 2H2 O During hydration, an interlocking network of gypsum crystals forms. See Figure 1 for a scanning electron micrograph (900X) [3] of precipitated gypsum crystals (CaSO4 · 2H2 O). This crystalline network is the foundation of the strength, S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 47–58, 2002. c Springer-Verlag Berlin Heidelberg 2002
48
J.E. Devaney and J.G. Hagedorn
durability, and absorptivity of the plaster [1]. However, the final properties of the set plaster are dependent on many things such as the water-solid ratio in hydration, impurities in the plaster, additives, temperature, and production conditions [3]. Moreover, these interact. For example, as the water-solid ratio in the hydrating plaster increases, the volume fraction of porosity increases, absorptivity of the plaster increases, but the strength and durability decrease [1]. There is much to learn about plaster; even the form of the kinetic equations (fraction of plaster reacted versus time) is not agreed upon [4][5][6]. Understanding the process of setting plaster as well as being able to predict its final properties is of scientific as well as economic interest.
Fig. 1. A scanning electron micrograph (900X) [3] of precipitated gypsum crystals (CaSO4 · 2H2 O).
2
High Resolution Data
Recently, X-ray microtomography has been used to obtain the unprecedented resolution of 0.95µm per voxel in three dimensional images of hydrating plaster. Commercial grade plaster of paris was mixed with a water-to-solids mass ratio of 1.0 and viewed with X-ray microtomography after 4, 7, 15.5 hours and 6 days. Additionally, a separate sample of plaster powder was imaged. This resulted in five images of plaster of size 10243 . This is gray scale data with each voxel varying from 0 to 255 [7][8].
3
Methodology
We seek simple rules to describe and predict structural components in hydrating plaster such as hydration products (gypsum crystals), unhydrated plaster, and porosity (holes). With materials like cement, it is straightforward to obtain such components through thresholding of the brightness histogram. However, this is
Discovery in Hydrating Plaster Using Machine Learning Methods
49
not the case with plaster [8]. Since the problem is not straightforward, we use multiple methods from machine learning to obtain the rules. We use a combination of unsupervised classification, decision trees, and genetic programming to obtain these rules. The rules are developed on a 1003 subset of the data taken at the same place in each dataset. The rules are tested on a completely separate 1003 subset of the data taken at the same place in each dataset. 3.1
Unsupervised Classification
Since the data is unlabeled, we start the discovery process with an unsupervised classifier. We use autoclass [9][10][11][12][13] which has been used successfully for discovery. In the creation of attributes for input to autoclass, we follow the Principle of locality [14], wherein natural laws are viewed as the consequence of small-scale regularities. Since the particle sizes may be as small as a few microns [8], we choose as our scale a 33 cube centered on each pixel in the image. Using simple statistics on these small cubes, we create eight attributes for each pixel as input vectors to autoclass, as described in the following table. Name A0 A1 A2 A3 A4 A5 A6 A7
Definition gray level value of pixel itself neighborhood midrange neighborhood variance about midrange neighborhood range neighborhood minimum neighborhood maximum neighborhood median neighborhood mean
Hence, for the 1003 training subcube, this results in 1, 000, 000 vectors. Because materials scientists are interested in three classes within hydrating plaster [15], we constrain autoclass to seek three classes. Classification runs were performed for each training subcube for powder, 4 hours, 7 hours, 15.5 hours, and 6 days. Since this data is at a new resolution, we do not have pre-labeled data to compare it with, or experts who can label it. We validate the classification through a visual comparison with the classes obtained. The 1003 dataset used for training is small enough to look at all one hundred 1002 images in the dataset and in the classification. A hundred 1002 images can be printed on an 8 12 by 11 page in a ten by ten array making comparisons straightforward. Due to space considerations, we reproduce three images from each array here. Each image is taken at z = 0, 30, 60 in the image array. Figure 2 shows the data and classes for the plaster powder. The data is on the left, with z = 0 at the bottom, z = 30 in the middle, and z = 60 at the top. The corresponding classification for each plane is to the right of the data. Class 0 is black, class 1 is gray, and class 2 is white.
50
J.E. Devaney and J.G. Hagedorn
Fig. 2. Comparison of three 1002 slices of original data with corresponding classification for powder. The data is in the left column, with z = 0 at the bottom, z = 30 in the middle, and z = 60 at the top. The corresponding classification for each plane is to the right of the data.
It is immediately obvious that autoclass has picked up the basic structure of the data. The plaster particles are in class 1. Figure 3 shows the equivalent images for 4, 7, 15,5 hours and 6 days of hydration. Again the structure in the data matches the structure of the classification. In the classified data, Class 2 (the white area) is the pore space, Class 1 (the grey area) identifies the crystalline network and the unhydrated plaster, and Class 0 (the black area) represents the boundary region. Figure 4 shows this difference clearly in renderings of the individual classes in their three dimensional configuration. Plaster hardens with little change in external volume, but since the volume of the hydration products is smaller than the original material plus water, voids occur inside [16]. In our classification, the space seems to form around the gypsum crystals. 3.2
Decision Tree
In order to gain better insight into the classifications for each time period, we seek comprehensible representations of the classification algorithms. That is, we wanted to find relatively simple rules to determine each element’s class based on its eight attributes. But autoclass operates in a “black-box” fashion. The algorithm by which it classifies and predicts elements is opaque to the user.
Discovery in Hydrating Plaster Using Machine Learning Methods
51
Fig. 3. Comparisons of original data with classification for 4 hours (top left), 7 hours (top right), 15.5 hours (bottom left), and 6 days (bottom right). For each time, three 1002 slices of original data are in the left column, with z = 0 at the bottom, z = 30 in the middle, and z = 60 at the top. The corresponding classification for each plane is to the right of the data.
52
J.E. Devaney and J.G. Hagedorn
To derive more transparent statements of the classification schemes, we used a decision tree, C5 [17]1 . C5 is the commercial successor to C4.5 [18], which has been used extensively for learning. Runs of C5 on the autoclass labeled attributes produced incomprehensible trees with thousands of nodes. However, the component classes in the brightness histograms in Figure 5 indicated that class 2 and class 1 were easily separable. Ten fold cross validation on the combined class 1 and 2 showed that this was the case in four of the five datasets. Three of the datasets yielded single node decision stumps with less than five missclassifications over hundreds of thousands of cases for powder, 4 hours, and 15.5 hours. All of these branched on attribute A1. The fourth simple classification was for the 7 hour dataset. This also yielded a single node decision stump; however, this branched on A7. For uniformity in the final rules across the hydration times, the 7 hour case was rerun requiring C5 to use only A1 to get a single best split on this attribute. The six day dataset did not yield a simple decision tree for the combined class 1 and 2. So it was also run with A1 as the only attribute to get the best split for input into the next phase, which was genetic programming to obtain complete and simple rules. The attribute A1 is the local midrange. The midrange is a robust estimator of the center of a short tailed distribution [19]. Since the range is limited for each neighborhood to 0 − 255, this is the situation here. So all the rules are now of the form: if the center of the local distribution is ≤ x ....
Fig. 4. Three dimensional renderings of individual classes (class 0 is on the left, class1 is in the middle, and class 2 is on the right) at 4 hours of hydration.
3.3
Genetic Programming
Genetic Progamming [20][21][22] is a technique that derives a sequence of operations (chosen from a user defined set) that have high fitness (as defined by the user) using techniques inspired by evolution. We use GPP (Genetic Program1
The identification of any commercial product or trade name does not imply either endorsement or recommendation by the National Institute of Standards and Technology.
Discovery in Hydrating Plaster Using Machine Learning Methods
53
Fig. 5. Brightness histograms for whole datasets as well as component classes for powder, 4 hours, 7 hours, 15.5 hours, and 6 days.
ming - Procedural) [23][24], a procedural genetic programming system that we have developed. GPP was used to derive simple, understandable formulae that closely match the original classifications provided by autoclass. Before GPP is used, the class 1/2 decision algorithm is determined by C5 as described above. GPP is then used to derive the class 0/1 and the class 0/2 decision algorithms. The method for using GPP in this problem followed the following steps for each desired classification: – – – – – –
Prepare training data sets from the classified data sets. Select parameters, such as the operator set, for the GPP runs. Construct a fitness function to measure algorithm effectiveness. Execute a set of GPP runs. Select the run with results that most closely match the original classification. Simplify the GPP-produced algorithm to a succinct form.
All GPP runs were done with the same set of operating parameters. We used a small set of arithmetic and logical operators, a population size of 500, and the maximum number of generations for a single evolution was 50. The fitness function that we used is based on the correlation between the algorithm’s classifications with the actual classifications. The fact that this is a two-valued problem simpifies the calculation of the correlation. We use the formulation by Matthews[25], which has been used in the context of genetic programming by Koza[21]. The correlation is given by:
54
J.E. Devaney and J.G. Hagedorn
T p T n − Fn Fp (Tn + Fn )(Tn + Fp )(Tp + Fn )(Tp + Fp ) where: – – – –
Tp is the number of true positives Tn is the number of true negatives Fp is the number of false positives Fn is the number of false negatives
The correlation is evaluated based on the execution of an algorithm on the appropriate training set. This correlation value varies from -1 to 1, where 1 indicates perfect performance on the training set and -1 indicated a perfectly incorrect performance on the training set. Because these decision algorithms are binary in nature, we can turn a very bad algorithm into a very good algorithm simply by inverting each decision. In terms of the correlation value, this means that we can regard a decision algorithm with a correlation of -0.6 to be just as good as an algorithm with correlation +0.6. So our fitness function is the absolute value of the correlation value given above. Each GPP run seeks to evolve an algorithm that maximizes this fitness value. For each decision algorithm to be derived, five hundred GPP runs were made. Each run differed only in the seed to the random number generator. Given the resources available to us, we easily ran all 12 sets of runs over a single night. After completing a set of five hundred runs we evaluated the algorithm generated by each run on the full data sets. This enables us to select the best performing algorithm. We then use a symbolic computation system (Maple [26]) to reduce the algorithms to simpler forms.
4
Results
Relatively simple decision rules were derived for all five time steps and for each of the required classifications. After deriving the rules, we sought to evaluate their effectiveness relative to the original autoclass classification. 4.1
Rules
Here are the classification algorithms that were derived by C5 and GPP. Note that the derived rules are succinct and the entire classification algorithm for a particular time step is quite compact. Powder if A1 ≤ 42. then (class is either 2 or 0) if .518494A3 + .019318A6 = 2
Discovery in Hydrating Plaster Using Machine Learning Methods
then class = 0 else class = 2 else (class is either 1 or 0) if (7235./A7) ≤ (A7 + A4 + A1) then class = 1 else class = 0 4 hours if A1 ≤ 27. then (class is either 2 or 0) if A6 = 0 then if 0.5 + .06145A1 = 2 then class = 0 else class = 2 then if 0.5 + .06145A1 + .003373A3 = 2 then class = 0 else class = 2 else (class is either 1 or 0) if (6323.0/A1) < (A0 + A4 + 0.692129A6 + A7) then class = 1 else class = 0 7 hours if A1 ≤ 42.5 then (class is either 2 or 0) if .527467A1 + .027467A7 = 2 then class = 0 else class = 2 else (class is either 1 or 0) if 0.5 + 0.008515A7 = 1 then class = 1 else class = 0 15.5 hours if A1 ≤ 30. then (class is either 2 or 0) if .525849(A6 + A3) = 2 then class = 0 else class = 2 else (class is either 1 or 0) if A0 = 0 then if A4 < (5321.0/A7) − (a0 + a7) then class = 0 else class = 1 else if A4 < (5321.0/A7) − a3
55
56
J.E. Devaney and J.G. Hagedorn
then class = 0 else class = 1 6 days if A1 ≤ 28. then class = 2 else (class is either 1 or 0) if A0 + (A1 − (4643/A7)) > 0 then class = 0 else class = 1 4.2
Evaluation of the Rules
We use sensitivity and positive predictive value [27] as metrics to evaluate our rules. A rule can be optimal with respect to a particular classification in two ways. The rule can be very successful at seeing a class when it is there. This is called its sensitivity. And the rule can be very successful at identifying the class in the presence of other classes. This is called its positive predictive value. Let Tp be the true positives. Let Fp be the false positives. Let Fn be the false negatives. Then: Sensitivity
=
Positive-Predictive-Value =
Tp Tp +Fn Tp Tp +Fp
In a confusion matrix, sensitivity is accuracy across a row; positive predictive value is accuracy down a column. We test our classification rules with a completely different 1003 subcube of data from each of the five time periods. To test the rules we first compute the same attribute vectors for the new dataset. Then we use the prediction capability of autoclass to label the vectors. Finally, we use the above rules to create confusion matrices of the preditions for each of the time periods. We derive the sensitivity and positive predictive values for class and time period. The derived rules are all highly predictive as shown in the following table. Dataset Powder 4 Hour 7 Hour 15.5 Hour 6 Day
5
Sensitivity Positive-Predictive-Value class 0 class 1 class 2 class 0 class 1 class 2 0.94 0.88 0.97 0.86 0.97 0.95 0.97 0.96 0.97 0.93 0.98 0.997 0.95 0.95 0.99 0.95 0.95 0.99 0.96 0.93 0.98 0.93 0.96 0.996 0.97 0.96 0.996 0.99 0.95 0.98
Conclusions and Future Work
We have taken unclassified data and created a set of simple rules that accurately predict the class of unseen data for plaster powder, and plaster after 4, 7, 15.5
Discovery in Hydrating Plaster Using Machine Learning Methods
57
hours, and 6 days of hydration. This was accomplished using a combination of three machine learning methods, providing results and insight that were not possible with any one of the techniques. Our work on plaster has just begun, however. First, we would like to develop a better method for validating the classifications. One approach is to generate simulated plaster data sets for which proper classifications are known, for example using computer model microstructures designed to mimic the Plaster of Paris system [28]. We will also be working with an expert to label manually small subsets of the X-ray tomography data sets. These labeled data will then be used for training and validation. This will likely result in refinement of our rules. Next we would like to develop equations that accurately predict the class regardless of the time of hydration, i.e. that work over the whole hydration period. We will need additional data to include variations with respect to the parameters that can influence the setting process and the resultant properties of plaster. Finally, we would like to predict physical characteristics of classes with equations, instead of predicting classes. Plaster is an interesting and exciting topic for automated discovery methods. We look forward to extending our study. Acknowledgements. We would like to thank Dale Bentz for his encouragement and support.
References 1. Kingrey, W.D., Bowen, H.K., Uhlmann, D.R.: Introduction to Ceramics. John Wiley and Sons, New York (1976) 2. Bullard, J.W.: Personal communication (2002) 3. Clifton, J.R.: Some aspects of the setting and hardening of gypsum plaster. Technical Note 755, NBS (1973) 4. Hand, R.J.: The kinetics of hydration of calcium sulphate hemihydrate: A critical comparison of the models in the literature. Cement and Concrete Research 24 (1994) 885–895 5. Ridge, M.J.: A discussion of the paper: The kinetics of hydration of calcium sulphate hemihydrate: A critical comparison of the models in the literature by r. j. hand. Cement and Concrete Research 25 (1995) 224 6. Hand, R.J.: A reply to a discussion by m. j. ridge of the paper: The kinetics of hydration of calcium sulphate hemihydrate: A critical comparison of the models in the literature. Cement and Concrete Research 25 (1995) 225–226 7. D. P. Bentz, S. Mizell, S. G. Satterfield, J. E. Devaney, W. L. George, P. M. Ketcham, J. Graham, J. Porterfield, D. Quenard, F. Vallee, H. Sallee, E. Boller, J. Baruchel: The Visible Cement Dataset. J. Res. Natl. Inst. Stand. Technol. 107 (2002) 137–148 8. The visible cement dataset (2002) [online] . 9. Cheeseman, R., Kelley, J., Self, M., Taylor, W., Freeman, D.: Autoclass: A bayesian classification system. In: Proceedings of the Fifth International Conference on Machine Learning, San Francisco, CA, Morgan Kaufman (1988) 65–74
58
J.E. Devaney and J.G. Hagedorn
10. Cheeseman, P.: On finding the most probable model. In Shrager, J., Langley, P., eds.: Computational Models of Discovery and Theory Formation. Morgan Kaufman, San Francisco, CA (1991) 73–96 11. Stutz, J., Cheeseman, P.: Bayesian classification (autoclass): Theory and results. In Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., eds.: Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA (1995) 12. Kanefsky, B., Stutz, J., Cheeseman, P., Taylor, W., Clifton, J.R.: An improved automatic classification of a landsat/tm image from kansas (fife). Technical Report FIA-94-01, NASA AMES (1994) 13. Goebel, J., Volk, K., Walker, H., Gerbault, P., Cheeseman, P., Self, M., Stutc, J., Taylor, W.: A bayesian classifiection of the iras lrs atlas. Astronomy and Astrophysics 222 (1989) L5–L8 14. Reichenbach, H.: Atom and Cosmos. Dover Publications, Inc., Mineola, New York (1932) (First published in 1930 as Atom und Kosmos.). 15. Bentz, D.P.: Personal communication (2002) 16. Sattler, H., Bruckner, H.P.: Changes in volume and density during the hydration of gypsum binders as a function of the quantity of water available. ZKG International 54 (2001) 522 17. : C5 (2002) [online] . 18. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Klauffann, San Mateo (1993) 19. Crow, E.L., Siddiqui, M.N.: Robust estimation of location. Journal of the American Statistical Associatiuon 63 (1967) 363–389 20. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press (1992) 21. Koza, J.R.: Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press (1994) 22. Koza, J.R., Andre, D., Bennett III, F.H., Keane, M.: Genetic Programming 3: Darwinian Invention and Problem Solving. Morgan Kaufman (1999) 23. Hagedorn, J., Devaney, J.: A genetic programming system with a procedural program representation. In: 2001 Genetic and Evolutionary Computation Conference Late Breaking Papers. (2001) 152–159 http://math.nist.gov/mcsd/savg/papers˜. 24. Devaney, J., Hagedorn, J., Nicolas, O., Garg, G., Samson, A., Michel, M.: A genetic programming ecosystem. In: Proceedings 15th International Parallel and Distributed Processing Symposium, IEEE Computer Society (2001) 131 http://math.nist.gov/mcsd/savg/papers˜. 25. Matthews, B.W.: Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta 405 (1975) 442–451 26. Monagan, M.B., Geddes, K.O., Heal, K.M., Labahn, G., Vorkoetter, S.M., McCarron, J.: Maple 6 Programming Guide. Waterloo Maple Inc., Waterloo, Ontario, Canada (2000) 27. Lathrop, R., Erbdster, T., Smith, R., Winston, P., Smith, T.: Integrating ai with sequence analysis. In Hunter, L., ed.: Aritifical Intelligence and Molecular Biology, Cambridge, MA (1993) 28. Meille, S., Garboczi, E.J.: Linear elastic properties of 2-d and 3-d modles of porous materials made from elongated objects. Mod. Sim. Mater. Sci 9 (2001) 1–20
Revising Qualitative Models of Gene Regulation Kazumi Saito1 , Stephen Bay2 , and Pat Langley2 1
NTT Communication Science Laboratories 2-4 Hikaridai, Seika, Soraku, Kyoto 619-0237 Japan [email protected] 2 Institute for the Study of Learning and Expertise 2164 Staunton Court, Palo Alto, CA 94306 USA [email protected], [email protected]
Abstract. We present an approach to revising qualitative causal models of gene regulation with DNA microarray data. The method combines search through a space of variable orderings with search through a space of parameters on causal links, with weight decay driving the model toward integer values. We illustrate the technique on a model of photosynthesis regulation and associated microarray data. Experiments with synthetic data that varied distance from the target model, noise, and number of training cases suggest the method is robust with respect to these factors. In closing, we suggest directions for future research and discuss related work on inducing causal regulatory models.
1
Introduction and Motivation
Like other sciences, biology requires that its models fit available data. However, as the field moves from a focus on isolated processes to system-level behaviors, developing and evaluating models has become increasingly difficult. This challenge has become especially clear with respect to models of gene regulation, which attempt to explain complex interactions in which the expression levels of some genes influence the expression levels of others. A related challenge concerns a shift in the nature of biological data collection from focused experiments, which involve only a few variables, to cDNA microarrays, which measure thousands of expression levels at the same time. In this paper, we describe an approach that takes advantage of such nonexperimental data to revise existing models of gene regulation. Our method uses these data, combined with knowledge about the domain, to direct search for a model that better explains the observations. We emphasize qualitative causal accounts because biologists typically cast their regulatory models in this form. We focus on model revision, rather than constructing models from scratch, because biologists often have partial models for the systems they study. We begin with a brief review of molecular biology and biochemistry, including the central notion of gene regulation, then present an existing regulatory model of photosynthesis. After this, we describe our method for using microarray data to improve such models, which combines ideas from learning in neural networks S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 59–70, 2002. c Springer-Verlag Berlin Heidelberg 2002
60
K. Saito, S. Bay, and P. Langley
and the notion of minimum description length. Next we report experimental studies of the method that draws on both biological and synthetic data, along with the results of these experiments. In closing, we suggest directions for future research and discuss related work on inducing causal models of gene regulation.
2
Qualitative Causal Models of Gene Regulation
A gene is a fundamental unit of heredity that determines an organism’s physical traits. It is an ordered sequence of nucleotides in deoxyribonucleic acid (DNA) located at at specific position on a chromosome. Genes encode functional products, called proteins, that determine the structure, function, and regulation of an organism’s cells and tissues. The gene’s nucleotide sequence is used to construct proteins through a multiple stage process. In brief, the enzyme RNA polymerase transcribes each gene into a complementary strand of messenger ribonucleic acid (mRNA) using the DNA as a template. Ribosomes then translate the mRNA into a specific sequence of amino acids forming a protein. Transcription is controlled through the RNA polymerase by transcription factors that let it target specific points on the DNA. The transcription factors may themselves be controlled through signalling cascades that relay signals from cellular or extra-cellular events. Typically, a signalling cascade phosphorylates (or dephosphorylates) a transcription factor, changing its conformation (i.e., physical structure) and its ability to bind to the transcription site. Translation is controlled by many different mechanisms, including repressors binding to mRNA that prevents translation into proteins. In our work, we focus on revising biological models that relate external cell signals to changes in gene transcription (as measured by mRNA) and, ultimately, phenotype. Specifically, we look at a model of photosynthesis regulation that is intended to explain why Cyanobacteria bleaches when exposed to high light conditions and how this protects the organism. This model, shown in Figure 1, was adapted from a model provided by a microbiologist (Grossman et al., 2001).1 Each node in the model corresponds to an observable or theoretical variable that denotes a measurable stimulus, gene expression level, or physical characteristic. Each link stands for a causal biological process through which one variable influences another. Solid lines in the figure denote internal processes, while dashes indicate processes connected to the environment. The model states that changes in light level modulate the expression of dspA, a protein hypothesized to serve as a sensor. This in turn regulates NBLR and NBLA expression, which then reduces the number of phycobilisome (PBS) rods that absorb light. The level of PBS is measured photometrically as the organism’s greenness. The reduction in PBS protects the organism’s health by reducing absorption of light, which can be damaging at high levels. The organism’s health 1
The paper describes an initial model for high light response in the Cyanobacterium S ynechococcus. This model was modified slightly for the Cyanobacterium used in our experiments, S ynechocystis PCC6803, by actions such as replacing nblS with its homolog dspA.
Revising Models of Gene Regulation NBLR
+
NBLA
−
PBS
+ Light
+
dspA
− + RR
− −
−
cpcB psbA2 psbA1
61
Health +
Photo
−
+
Fig. 1. Initial model for photosynthesis regulation of wild type Cyanobacteria.
under high light conditions can be measured in terms of the culture density. The sensor dspA impacts health through a second pathway by influencing an unknown response regulator RR, which in turn down regulates expression of the gene products psbA1, psbA2, and cpcB. The first two positively influence the level of photosynthetic activity (Photo) by altering the structure of the photosystem. If left unregulated, this second pathway would also damage the organism in high light conditions. Although the model incorporates quantitative variables, it is qualitative in that it specifies cause and effect but not the exact numerical form of the relationship. For example, one causal link indicates that increases in NBLR will increase NBLA, but it does not specify the form of the relationship, nor does it specify any parameters. The model is both partial and abstract. The biologist who proposed the model made no claim about its completeness and clearly viewed it as a working hypothesis to which additional genes and processes should be added as indicated by new data. Some links are abstract in the sense that they denote entire chains of subprocesses. For example, the link from dspA to NBLR stands for a signaling pathway, the details of which are not relevant at this level of analysis. The model also includes a theoretical variable RR, an unspecified gene (or possibly a set of genes) that acts as an intermediary controller.
3
An Approach to Revising Qualitative Causal Models
In this paper, we represent causal models in terms of linear relationships among variables. That is, each quantitative variable x(i) is represented with an equation of the form x(i) =
i−1
A(i, j)x(j) + b(i) ,
(1)
j=1
where A(i, j) describes the causal effect of variable x(j) on x(i) and b(i) is an additive constant. The variables in a model are ordered and variable x(i) can only be influenced by those variables that come before it in the causal ordering.
62
K. Saito, S. Bay, and P. Langley
Using matrix form, we can represent the equations for all x(i), i = 1..n, as x = Ax + b. In this formulation, A(i, j) = 0 if i ≤ j, where A(i, j) denotes the element in row i and column j of A. This constraint enforces the causal ordering on the variables. A model is completely specified by an ordering of variables in x and an assignment of values to all elements of A and b that satisfy the above constraints. This defines the space of models that our revision method will consider. However, we still need some way to map an initial biological model onto this notation. If we let A0 and b0 denote the initial model, then we can transform qualitative models like that in Figure 1 into a matrix A0 by setting A(i, j) = 1 if there is a positive link from variable j to i in the model, A(i, j) = −1 if the link is negative, and A(i, j) = 0 otherwise. We set the vector b0 to zero for all its elements. Given A0 , b0 , and observations on x, we learn new values for A and b by 1. Picking an initial ordering for variables in x; 2. Learning the best real-valued matrix A according to a score function that penalizes for differences from A0 , and is subject to the ordering constraints; 3. Swapping variables in the ordering and going to step 2 (i.e., performing hillclimbing search in the space of variable orderings), continuing until the score obtained no longer improves; and 4. Transforming the real matrix A that has the best score into a discrete version with A(i, j) ∈ −1, 0, 1 with a thresholding method. Step 1 in this revision algorithm determines the starting state of the search. Our approach selects a random ordering that is consistent with the partial ordering implied by the initial model. During step 2, our method invokes an approach to equation revision that transforms the equation x = Ax + b into a neural network, revises weights in that network, and then transforms the network back into equations in a fashion similar to that described by Saito et al. (2001). This neural network approach uses a minimum description length (Rissanen, 1989) criterion during training to penalize models that differ from the initial model. For example, suppose w0 is the parameter vector of the neural network that corresponds to the initial model. We define our revision task as finding a w that lets the network closely replicate the observed data and is also reasonably close to w0 . To this end, we consider a communication problem where a sender wishes to transmit a data set to a receiver using a message of the shortest possible length. However, unlike the standard MDL criterion, we assume that the initial model with w0 is known to the receiver. Namely, we try to send message length with respect to w0 − w, rather than with respect to w. Since we can avoid encoding parameter values equal to the initial ones, this metric prefers the initial model. The new parameters w0 − w are regarded as weights of the neural network, and their initial values are set to zero. Then, in order to obtain a learning result that is reasonably close to the initial model, the network is trained with weight decay, using a method called the MDL regularizer (Saito & Nakano, 1997).
Revising Models of Gene Regulation
63
When the modeling task includes some unobserved variables, like RR in Figure 1, we cannot directly revise links associated with those variables. To cope with such situations, our method adopts a simple forward-backward estimation based on the initial model. If x(i) is an unobserved variable, then its value can be estimated in the forward direction by the equation, x ˆ(i)(0) = j A(i, j)x(j)+ b(j). On the other hand, if S is a set of observed variables linked directly from x(i), i.e., S = {x(k) : k > i ∧ A(k, i) = 0}, then for x(k) ∈ S, the equation for the backward estimation is x(i) = A(k, i)−1 (x(k) − j=i A(k, j)x(j) − b(k)). This lets us estimate the values {ˆ x(i)(1) , ..., x ˆ(i)(M ) }, where M is the number of elements in S. Finally, our method estimates the value of x(i) as the average M of these values using the equation x ˆ(i) = (M + 1)−1 m=0 x ˆ(i)(m) . One could repeat these two procedures, estimation of the unobserved variables and revision of the parameters, although the current implementation makes only one pass. As stated above, our method performs gradient search through a space of parameters on causal links, with weight decay driving the model toward integer values. However, the resulting values are not strictly integers. To overcome this problem, in step 4 we employ a simple thresholding method. After sorting the resulting parameter values to predict one variable x(i), the system uses two thresholds, T−1 and T+1 , to divide this sorted list into three portions. Parameter value A(i, j) is set to −1 if A(i, j) < T−1 , to +1 if A(i, j) > T+1 , and to 0 otherwise. Note that T−1 ≤ T+1 , and we can obtain all possible integer lists with computational complexity O(N 2 ), where N denotes the number of parameters. Given these integer lists, our method selects the result that minimizes the MDL cost function defined by {0.5×s×log(M SE)}+{r×log(N )}, where s is the number of training samples, r is the number of revised parameters, and M SE is the mean squared error on the samples. The first term of the cost function is a code length for transmitting data, derived by assuming Gaussian noise for variables, while the second term is a code length for revision information, i.e., multiplication of the number of revised parameters and the cost of encoding an integer to indicate the parameter that is revised.
4
Experimental Studies of the Revision Method
In this section, we describe experimental studies of our revision method. We take a dual approach of evaluating the system using both natural data obtained from microarrays of Cyanobacteria cultures and synthetic data generated from known mathematical models. Natural data lets us evaluate the biological plausibility of changes suggested by our algorithm. However, because we have an extremely limited number of microarrays, it can be difficult to evaluate the reliability of the suggested revisions even if they appear biologically plausible. Therefore, we also used synthetic data to evaluate the robustness and reliability of our approach. Because we can generate synthetic data from a known model, we can measure the sensitivity and reliability of our algorithm in the presence of complicating factors such as errors in the initial model, small sample sizes, and noise.
64
K. Saito, S. Bay, and P. Langley
4.1
Revising the Model of Photosynthesis Regulation
We applied our method to revise the regulatory model of photosynthesis for wild type Cyanobacteria. We have microarray data which includes measurements for approximately 300 genes believed to play a role in photosynthesis. For this analysis, we focus on the genes in the model and do not consider links to other genes. The array data were collected at 0, 30, 60, 120, and 360 minutes after high light conditions were introduced, with four replicated measurements at each time point. We treat both RR and Photo, which represents the structure of the photosystem, as unmeasured variables. We currently treat the data as independent samples and ignore their temporal aspect, along with dependencies among the four replicates. We implemented our method in the C programming language and conducted all experiments on a 1.3 Ghz Pentium running Linux. Revising the photosynthesis model took 0.02 seconds of CPU time. For each variable, the observed values were normalized to a mean of zero and a standard deviation of one. Figure 2 shows the revised model, which reflects three changes: 1. dropping the link from dspA to RR; 2. connecting Photo to RR instead of psbA1 and psbA2; and 3. changing the sign of the link from PBS to Health from negative to positive. The first two changes are difficult to explain from a biological perspective. Because dspA is a light sensor, there should be either a direct or indirect path linking it with the genes cpcB, psbA1, or psbA2. Dropping the link disconnects dspA from those genes and removes it as possible cause. Also, the structure of the photosystem (Photo) is believed to depend on at least one of psbA1 or psbA2, and connecting Photo only to RR removes psbA1 and psbA2 as parents.2 Changing the sign of the link from PBS to Health is more plausible. The initial model was specified for high light conditions in which excessive light levels damage the organism. However, at lower light levels, increased PBS should aid the organism because it is a vital component in energy production. One explanation suggested by the microbiologist is that light levels during the biological experiment may not have been set correctly and were not high enough to reduce health. 4.2
Robustness of the Revision Approach
We evaluated the robustness of our approach by generating synthetic data from a known model and varying factors of interest. Specifically, we varied the number of training samples, the number of errors in the initial model, the observability of variables, and the noise level. We expected each of these factors to influence the behavior of the revision algorithm. 2
The genes psbA1 and psbA2 encode variants of the D1 protein, a necessary and central component of the Photosystem II reaction center (Wiklund et al., 2001).
Revising Models of Gene Regulation NBLR
+
NBLA
−
PBS
+ Light
+
dspA
− RR
− −
65
+
cpcB
Health
psbA2
Photo
−
+ psbA1
Fig. 2. Revised model of photosynthesis regulation in Cyanobacteria.
To this end, we generated training data by treating the structure of the model in Figure 1 as the true model. We assumed that each variable was a linear function of its parents with noise added from a random normal distribution. The root causal variable, Light, has no parents and was assigned a random uniform value between 0 and 1. We generated initial models to serve as starting points for revision by randomly adding links to, or deleting links from, the true model in Figure 1. Our dependent measure was the net number of corrections, that is, the number of correct changes minus the number of incorrect changes, suggested by the revision process. For each experimental condition, we generated 20 distinct training sets and averaged the results for this measure. Figure 3 (a) shows the results from one experimental condition that involved only observable variables and only a small amount of noise (σ = 0.1). The x axis in the graph represents the number of errors in the initial model, whereas the y axis specifies the net number of corrections. The three curves correspond to different size training sets, with the smallest containing only 25 instances and the largest involving 100 observations. In general, the revision method fared quite well, in that it consistently corrected almost all of the errors in the initial model. More data improved this performance, with 100 training cases being enough to give almost perfect results on all 20 runs. However, other factors can degrade the system’s behavior somewhat. Figure 3 (b) shows the results at the same noise level when the variable RR is unobservable but all others are available. Overall, the net number of corrections decreased substantially compared to the fully observable condition. However, the method still has enough power to recover portions of the true model. Figure 3 (c) and (d) show the system’s behavior with RR unobserved at higher levels of noise, with σ = 0.2 and σ = 0.4, respectively. The net number of corrections under these conditions is similar to that when σ = 0.1, which suggests that our approach is robust with respect to noise of this type. Note that σ = 0.4 constitutes a rather high noise level in comparison with the range of the variables (e.g., light varies from 0 to 1). We should also note that the system never suggested changes to the initial model when it was correct (i.e., contained zero errors). This indicates that the revision method is behaving in a conservative manner that is unlikely to make
66
K. Saito, S. Bay, and P. Langley
a good model worse, even in the presence of noise, unobservable variables, and small samples. This in turn suggests that our use of minimum description length is having the desired effect.
25 samples 50 samples 100 samples
25 samples 50 samples 100 samples
6
5
5
4
4
net corrections
net corrections
6
3 2 1
3 2 1 0
0 0
1
2 3 4 errors in initial model
5
6
0
1
(a)
6
5
6
25 samples 50 samples 100 samples
6
5
5
4
4
net corrections
net corrections
5
(b)
25 samples 50 samples 100 samples
6
2 3 4 errors in initial model
3 2 1
3 2 1
0
0 0
1
2 3 4 errors in initial model
5
6
0
(c)
1
2 3 4 errors in initial model
(d)
Fig. 3. Average net number of corrections to the initial model for 25, 50, and 100 samples when (a) all variables are observed and σ = 0.1, (b) the variable RR is unobserved and σ = 0.1, (c) RR is unobserved and σ = 0.2, and (d) RR is unobserved and σ = 0.4.
5
Directions for Future Research
The results from our experiments on Cyanobacteria data were disappointing, as they were difficult to explain from a biological perspective. However, on synthetic data our system was able to improve incorrect initial models even when there were few training samples, unobserved variables, and noise. This suggests that our general approach is feasible, but that we may need to address some of the limitations, chosen by design, in the approach. For instance,
Revising Models of Gene Regulation
67
we modeled the relationships between genes as a linear function. Although linear models are desirable because they have few parameters, they cannot model combinatorial effects among genes or thresholds in which a gene’s expression must be above a certain level before it can affect other genes. The neural network approach to revision is not limited to linear models and we could use a more general form to represent relationships between genes. We also restricted the genes that could appear in the model to a small subset of those measured by the microarray chips. The complete set of data contains about 300 variables from which we used the 11 variables present in the initial model. Restricting the number of variables involves a tradeoff. Including too many variables for the number of samples makes estimating relationships unreliable because of the multiple hypothesis testing problem (Shaffer, 1995). However, using too few variables increases the likelihood that we may have ignored an important variable from the analysis. Future implementations could minimize this problem by including an operator for adding new genes during the revision process and using domain knowledge to select only the most promising candidates for incorporation into the model. In addition, we should extend our approach to model revision in various other ways. Since transcriptional gene regulation takes time to occur, future systems should search through an expanded space of models that include time delays on links3 and feedback cycles. To handle more complex biological processes, it should also represent and revise models with subsystems that have little interaction with each other. Finally, each of these extensions would benefit from incorporation of additional biological knowledge, cast as taxonomies over both genes and regulatory processes, to constrain the search for improved models. Finally, we must test our approach on both more regulatory models and more microarray data before we can judge its practical value. Our biologist collaborators are collecting additional data on Cyanobacteria under more variable conditions, which we predict will provide additional power to our revision method. We also plan to evaluate the technique on additional data sets that we have acquired from other biologists, including ones that involve yeast development and lung cancer.
6
Related Research
Although most computational analyses of microarray data rely on clustering to group related genes, we are not the first to focus on inducing causal models of gene regulation. Most research on this topic encodes regulatory models as Bayesian networks with discrete variables (e.g., Friedman et al., 2000; Hartemink, 2002; Ong et al., 2002). Because microarray data are quantitative, this approach often includes a discretization step that may lose important information, whereas our approach deals directly with the observed continuous 3
An alternative is to model the regulation between genes with differential equations.
68
K. Saito, S. Bay, and P. Langley
values.4 These researchers also report methods that construct causal models from scratch, rather than revising an initial model, though some incorporate background knowledge to constrain the search process. An alternative approach represents hypotheses about gene regulation as linear causal models, which relate continuous variables through a set of linear equations. Such systems evaluate candidate models in terms of their ability to predict constraints among partial correlations, rather than their ability to predict the data directly. Within this framework, some methods (e.g., Saavedra et al., 2001) construct a linear causal model from the ground up, whereas others (e.g., Langley et al., 2002) instead revise an initial model, as in the approach we report here. One advantage of this constraint-based paradigm is that it can infer qualitative models directly, without the need to discretize or fit continuous parameters. In contrast, our technique combines search through a parameter space with weight decay to achieve a similar end. We should also mention approaches that, although not concerned with gene regulation, also construct causal models in scientific domains. One example comes from Koza et al. (2001), whose method formulates a quantitative model of metabolic processes from synthetic time series about chemical concentrations. Another involves Zupan et al.’s (2001) GenePath, which infers a qualitative genetic network to explain phenotypic results from gene knockout experiments. Mahidadia and Compton (2001) report an interactive system for revising qualitative models from experimental results in neuroendocrinology. Finally, our approach to revising scientific models borrows ideas from Saito et al. (2001), who transform an initial quantitative model into a neural network and utilize weight learning to improve its fit to observations.
7
Conclusions
In this paper, we characterized the task of discovering a qualitative causal model of gene regulation based on data from DNA microarrays. Rather than attempting to construct the model from scratch, we instead assume an existing model has been provided biologists who want to improve its fit to the data. These models require a causal ordering on variables, links between variables, and signs on these links. We presented an approach to this revision task that combines a hillclimbing search through the space of variable orderings and a gradient descent search for weights on links, with the latter using a weight decay method guided by minimum description length to drive weights to integer values. We illustrated the method’s behavior on a model of photosynthesis regulation in Cyanobacteria, using microarray data from biological experiments. However, our experimental evaluation also relied on synthetic data, which let us vary systematically the distance between the initial and target models, the amount of training data available, and the noise in these data. We found that the method scaled well on each of these dimensions, which suggests that it may prove a useful 4
Imoto et al. (2002) report one way to induce quantitative models of gene regulation within the framework of Bayesian networks.
Revising Models of Gene Regulation
69
tool for revising models based on biological data. We noted that our approach has both similarities to, and differences from, other recent techniques for inducing causal models of gene regulation. We must still evaluate the method on other data sets and extend it on various fronts, but our initial experiments on synthetic data have been encouraging.
Acknowledgements. This work was supported by the NASA Biomolecular Systems Research Program and by NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation. We thank Arthur Grossman, Jeff Shrager, and C. J. Tu for the initial model, for microarray data, and for advice on biological plausibility.
References Friedman, N., Linial, M., Nachman, I., & Peer, D. (2000). Using Bayesian Networks to Analyze Expression Data. Journal of Computational Biology, 7 , 601–620. Grossman, A. R., Bhaya, D., & He, Q. (2001). Tracking the Light Environment by Cyanobacteria and the Dynamic Nature of Light Harvesting. The Journal of Biological Chemistry, 276 , 11449–11452. Hartemink, A. J., Gifford, D. K., Jaakkola, T. S., & Young, R. A. (2002). Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models. Pacific Symposium on Biocomputing, 7 , 437–449. Imoto, S., Goto, T., & Miyano, S. (2002). Estimation of Genetic Networks and Functional Structures Between Genes by using Bayesian Networks and Nonparametric Regression. Pacific Symposium on Biocomputing, 7 , 175–186. Koza, J. R., Mydlowec, W., Lanza, G., Yu, J., & Keane, M. A. (2001). Reverse engineering and automatic synthesis of metabolic pathways from observed data using genetic programming. Pacific Symposium on Biocomputing, 6 , 434–445. Langley, P., Shrager, J., & Saito, K. (in press). Computational discovery of communicable scientific knowledge. In L. Magnani, N. J. Nersessian, & C. Pizzi (Eds), Logical and computational aspects of model-based reasoning. Dordrecht: Kluwer Academic. Mahidadia, A., & Compton, P. (2001). Assisting model-discovery in neuroendocrinology. Proceedings of the Fourth International Conference on Discovery Science (pp. 214–227). Washington, D.C.: Springer. Ong, I. M., Glasner, J., & Page, D. (2002). Modeling Regulatory Pathways in E.Coli from Time Series Expression Profiles. Proceedings of the Tenth International Conference on Intelligent Systems for Molecular Biology. Rissanen, J. (1989). Stochastic complexity in statistical inquiry. World Scientific, Singapore. Saavedra, R., Spirtes, P., Scheines, R., Ramsey, J., & Glymour, C. (2001). Issues in Learning Gene Regulation from Microarray Databases. (Tech. Report No. IHMCTR-030101-01). Institute for Human and Machine Cognition, University of West Florida. Saito, K., Langley, P., Grenager, T., Potter, C., Torregrosa, A., & Klooster, S. A. (2001). Computational revision of quantitative scientific models. Proceedings of the Fourth International Conference on Discovery Science (pp. 336–349). Washington, D.C.: Springer.
70
K. Saito, S. Bay, and P. Langley
Saito, K., & Nakano, R. (1997). MDL regularizer: a new regularizer based on MDL principle. Proceedings of the 1997 International Conference on Neural Networks (pp. 1833–1838). Houston, Texas. Shaffer, J. P. (1995). Multiple Hypothesis Testing. Annual Review Psychology, 46, 561–584. Wiklund, R., Salih, G. F., Maenpaa, P., & Jansson, C. (2001) Engineering of the protein environment around the redox-active TyrZ in photosystem II. Journal of European Biochemistry , 268, 5356–5364. Zupan, B., Bratko, I., Demsar, J., Beck, J. R., Kuspa, A., Shaulsky, G. (2001). Abductive inference of genetic networks. Proceedings of the Eighth European Conference on Artificial Intelligence in Medicine. Cascais, Portugal.
SEuS: Structure Extraction Using Summaries Shayan Ghazizadeh and Sudarshan S. Chawathe University of Maryland Department of Computer Science College Park, MD {shayan,chaw}@cs.umd.edu
Abstract. We study the problem of finding frequent structures in semistructured data (represented as a directed labeled graph). Frequent structures are graphs that are isomorphic to a large number of subgraphs in the data graph. Frequent structures form building blocks for visual exploration and data mining of semistructured data. We overcome the inherent computational complexity of the problem by using a summary data structure to prune the search space and to provide interactive feedback. We present an experimental study of our methods operating on real datasets. The implementation of our methods is capable of operating on datasets that are two to three orders of magnitude larger than those described in prior work.
1
Introduction
In many data mining tasks, an important (and frequently most time-consuming) task is the discovery and enumeration of frequently occurring patterns, which are informally sets of related data items that occur frequently enough to be of potential interest for a detailed data analysis. The precise interpretation of this term depends on the data model, dataset, and application. Perhaps the best studied framework for data mining uses association rules to describe interesting relationships between sets of data items [AIS93]. In this framework, which is typically applied to market basket data (from checkout registers, indicating items purchased together), the critical operation is determining frequent itemsets, which are defined as sets of items that are purchased together often enough to pass a given threshold (called the support). For time series data, an analogous concept is a subsequence of the given series that occurs frequently. This paper defines an analogous concept, called frequent structures for semistructured data (represented as a labeled directed graph) and presents efficient methods for computing frequent structures in large datasets. Semistructured data is referred to data who has some structure, but is difficult to describe with a predefined, rigid schema. The structure of semistructured data is irregular, incomplete, frequently changing, and usually implicit or unknown to user. Common examples of this type of data include memos, Web pages, documentation, and bibliographies. Data mining is an iterative process in which a human expert refines the parameters of a data mining system based on intermediate results presented by the mining system. It is unreasonable to expect an expert to select the proper values for mining parameters a priori because such selection requires a detailed knowledge of the data, which is what the mining system is expected to enable. While frequent and meaningful feedback is S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 71–85, 2002. c Springer-Verlag Berlin Heidelberg 2002
72
S. Ghazizadeh and S.S. Chawathe
important for any data mining system, it is of particular importance when the data is semistructured because, in addition to the data-dependent relationships being unknown a priori, even the schema is not known (and not fixed). Therefore, rapid and frequent feedback to a human expert is a very important requirement for any system that is designed to mine semistructured data. Prior work (discussed in Section 4) on mining such data often falls short on this requirement. The main idea behind our method, which is called SEuS is the following threephase process: In the first phase (summarization), we preprocess the given dataset to produce a concise summary. This summary is an abstraction of the underlying graph data. Our summary is similar to data guides and other (approximate) typing mechanisms for semistructured data [GW97,BDFS97,NUWC97,NAM97]. In the second phase (candidate generation), our method interacts with a human expert to iteratively search for frequent structures and refine the support threshold parameter. Since the search uses only the summary, which typically fits in main memory, it can be performed very rapidly (interactive response times) without any additional disk accesses. Although the results in this phase are approximate (a supper set of final results), they are accurate enough to permit uninteresting structures to be filtered out. When the expert has filtered potential structures using the approximate results of the search phase, an accurate count of the number of occurrences of each potential structure is produced by the third phase (counting). Users are often willing to sacrifice quality for a faster response. For example, during the preliminary exploration of a dataset, one might prefer to get a quick and approximate insight into the data and base further exploration decisions on this insight. In order to address this need, we introduce an approximate version of our method, called L-SEuS. This method only returns the top-n frequent structures rather than all frequent structures. Due to space limitations we are not able to present the details of this approximate method here. Interested readers can refer to [GC02]. The methods in this paper have three significant advantages over prior work: First, they operate efficiently on datasets that are two to three orders of magnitude larger than those handled by prior work of which we are aware. Second, even for large datasets, our methods provide approximate results very quickly, enabling their use in an interactive exploratory data analysis. Third, for applications and scenarios that are interested in only the frequent structures, but not necessarily their exact frequencies, the most expensive counting phase can be completely skipped, resulting in great performance benefits. In order to evaluate our ideas, we have implemented our method in a data mining system for (semi)structured data (also called SEuS). In addition to serving as a testbed for our experimental study (Section 3), the system is useful in its own right as a tool for exploring (semi)structured data. We have found it to discover intuitively meaningful structures when applied to datasets from several domains. Our implementation of SEuS uses the Java 2 (J2SE) programming environment and is freely available at http://www.cs.umd.edu/projects/seus/ under the terms of the GNU GPL license. Figure 1 is a screenshot of our system in action. The current set of frequent structures is displayed together with a slider that allows the threshold to be modified. Given a new value for the threshold, the system computes (in interactive times) the new
SEuS: Structure Extraction Using Summaries
73
Fig. 1. A snapshot of SEuS system
set of frequent structures and presents them as depicted. We have found this iterative process to be very effective in arriving at interesting values of the parameter. The rest of this paper is organized as follows: In Section 2, we define the structure discovery problem formally and present our three-phase solution called SEuS. Sections 2.1, 2.2, and 2.3 describe the summarization, candidate generation, and counting phases. Section 3 summarizes the results of our detailed experimental study. Related work is discussed in Section 4 and we conclude in Section 5.
2
Structure Discovery
SEuS represents semistructured data as a labeled directed graph. In this representation, objects are mapped to vertices and relations between these objects are modeled by edges. A structure is defined to be a connected graph that is isomorphic to at least one subgraph of the database. Figure 2 illustrates the graph representation of a small XML database. Any subgraph of the input database that is isomorphic to a structure is called an instance of that structure. The number of instances of a structure is called the structure’s support. (We allow the instances to overlap.) For the data graph in Figure 2, a structure and its three instances are shown in Figure 3. We say a structure is T-frequent if it has a support higher than a given threshold T . Problem statement (frequent structure discovery): Given the graph representation of a database and a threshold T , find the set of T-frequent structures.
74
S. Ghazizadeh and S.S. Chawathe author
a1 child book child
child
child child
n2
p3 paper
name
b1
child
child
child cite
child
t6 title
y1
author a2 child year idref
t1 title book child
b4 child
name child
v2
y4
t4
volume
year
title
p1 paper
child
cite
v1 y3
j1
year journal cite child
b2 book child
child
child child
idref
n1 cite
book
b3
volume
p2
t3 title
child
child
t5
c1
title
conference
t7 title
child
t2
y2
title
year
Fig. 2. Example input graph
A naive approach for finding frequent structures consists of enumerating all subgraphs, partitioning this set of subgraphs into classes based on graph isomorphism, and returning a representative from the classes with cardinality greater than the support threshold. Unfortunately, the number of subgraphs of a graph database is exponential in the size of the graph. Further, the naive approach tests each pair of these subgraphs for isomorphism in worst case. Although graph isomorphism is not known to be NP-hard (or in P) [For96], it is a difficult problem and an approach relying on an exponential number of isomorphism tests is unlikely to be practical for large databases. Given the above, practical systems must use some way to avoid examining all the possible subgraphs and must calculate the support of structures without partitioning the set of all possible subgraphs. Instead of enumerating all of the subgraphs in the beginning, we can use a level-by-level expansion of subgraphs similar to the k-itemset approach adopted in Apriori [AS94] for market basket data. We start from subgraphs of size one (single vertex) and try to expand them by adding more vertices and edges. A subgraph is not expanded anymore as soon as we can reason that its support will fall under the threshold based on downward closure property: A structure has a support higher than a threshold if all of its subgraphs also have a support higher than the threshold. A number of systems have used such a strategy for discovering frequent structures [IWM00,KK01,CH00] along other heuristics to speed up the process. (See Section 4 for details.) However, The results reported in these papers, as well as our experiments, suggest that these methods do not scale to very large databases. The main factor hurting performance of these methods is the need to go through the database to determine the support of each structure. Although the number of structures for which the support has to be calculated has decreased significantly compared to the
SEuS: Structure Extraction Using Summaries
book child
child
child
-
year
t1
title Structure
child t2
b2
child y1
Subgraph 1
child
child
y2
t4
Subgraph 2
b1
75
b4
child y4
Subgraph 3
Fig. 3. A structure and its three instances
naive approach (due to the use of downward closure properties and other heuristics), the calculation of the support of the remaining structures is still expensive. Further, all of these systems operate in a batch mode: After providing the input database, a user has to wait for the structure discovery process to terminate before any output is produced. There are no intermediate (partial or approximate) results, making exploratory data analysis difficult. This batch mode operation can cause major problems, especially when the user does not have enough domain knowledge to guess proper values for mining parameters (e.g., support threshold). In order to operate efficiently, SEuS uses data summaries instead of the database itself. Summaries provide a concise representation of a database at the expense of some accuracy. This representation allows our system to approximate the support of a structure without scanning the database. We also use the level-by-level expansion method to discover frequent structures. SEuS has three major phases: The first phase (summarization) is responsible for creating the data summary and is described in Section 2.1. In the second phase (candidate generation), SEuS finds all structures that have an estimated support above the given threshold; it is described in Section 2.2. The second phase reports such candidate structures to the user, and this early feedback is useful for exploratory work. The exact support of structures is determined in the third phase (counting), described in Section 2.3. 2.1
Summarization
We use a data summary to estimate the support of a structure (i.e., the number of subgraphs in the database that are isomorphic to the structure). Our summary is similar in spirit to representative objects, graph schemas, and DataGuides [NUWC97,BDFS97, GW97]. The summary is a graph with the following characteristics. For each distinct vertex label l in the original graph G, the summary graph S has an l-labeled vertex. For each m-labeled edge (v1 , v2 ) in the original graph there is an m-labeled edge (l1 , l2 )
76
S. Ghazizadeh and S.S. Chawathe
in S, where l1 and l2 are the labels of v1 and v2 , respectively. The summary S also associates a counter with each vertex (and edge) indicating the number of vertices (respectively, edges) in the original graph that it represents. For example, Figure 4 depicts the summary generated for the input graph of Figure 2.
name
idref:2 child:2 child:2
2 child:1 author
2
book 4
cite:2 child:2
cite:1 child:1 paper child:1
cite:1child:4 child:4 2 child:3 volume 7
3
title
child:1 child:1
1
1 journal conference
4 year
Fig. 4. Summary graph
Since all vertices in the database with the same label map to one vertex in the summary, the summary is typically much smaller than the original graph. For example, the graph of Figure 2 has four vertices labeled book, while the summary has only one vertex representing these four vertices. In this simple example, the summary is only slightly smaller than the original data. However, as noted in [GW97], many common datasets are characterized by a high degree of structural replication, giving much greater space savings. (For details, see Table 1 in Section 3.) These space savings come at the cost of reduced accuracy of representation. In particular this summary tells us the labels on possible edges to and from the vertices labeled paper, although they may not all be incident on the same vertex in the original graph. (For example, journal and conference vertices never connect to the same paper vertex, but the summary does not contain this information.)
child, title 1 child, con f erence
child, journal child, year
child, title
cite, book
1
child, title 1
Fig. 5. Counting Lattice for paper vertex
SEuS: Structure Extraction Using Summaries
77
We can partly overcome this problem by creating a richer summary. Instead of storing only the set of edges leaving a vertex label and their frequencies, we can create a counting lattice (similar to the one used in [NAM97]), L(v) for each vertex v. For every distinct set of edges leaving v, we create a node in L(v) and store the frequency of this set of outgoing edges. For example, consider the vertex label paper in Figure 2. The counting lattice for this vertex is depicted in Figure 5. In the input graph, there are three different types of paper vertices with respect to their outgoing edges. One of them, p3 , has a single outgoing edge labeled child leading to a title vertex. Another instance, p2 , has two outgoing edges to title and conference vertices. Finally, p1 has four outgoing edges. The lattice represents these three types of vertices with label paper separately, while a simple summary does not distinguish between them. Each node in lattice also stores the support of the paper vertex type it represents. We call the original summary a level-0 summary and the summary obtained by adding this lattice structure a level-1 summary. Using the level-1 summary, we can reason that there is no paper vertex in the database that connects to both journal and conference vertices, which is not possible using only level-0 summary. This process of enriching the summary by differentiating vertices based on the labels of their outgoing edges can be carried further by using the labels of vertices and edges that are reachable using paths of lengths two or more. We refer to such summaries as level-k summaries: A level-k summary differentiates vertices based on labels of edges and vertices on outgoing paths of length k. However, building level-k summaries for k ≥ 2 is considerably more difficult than building level-0 and level-1 summaries. Level-0 summaries are essentially data guides, and level-1 summaries can be built with no additional cost if the file containing the graph edges is sorted by the identifiers of source vertices. For summaries of higher levels, additional passes of graph are required. Further, our experiments show that level-1 summaries are accurate enough for the datasets we study (See [GC02] for details), so the additional benefit of higher summary levels is unclear. In the rest of this paper, we focus on level-0 and level-1 summaries. We assume that the graph database is stored on disk as a sequence of edges, sorted in lexicographic order of the source vertex. Clearly, building level-0 summary requires only a single sequential scan of the edges file. We build the summary incrementally in memory as we scan the file. For an edge (v1 , v2 , l) we increment the counters associated with the summary nodes representing the labels l1 and l2 of v1 and v2 , respectively. Similarly, the counter associated with the summary edge (s(l1 ), s(l2 ), l) is incremented, where s(li ) denotes the summary node representing label li . (If the summary nodes or edges do not exist, they are created.) Since the edges file is sorted in lexicographic order of the source, we can be sure that we get all of the outgoing edges of a vertex before encountering another source vertex. Therefore, we can create the level-1 summary in the same pass as we build the level-0 summary. We use a level-0 summary L0 to estimate the support of a structure S as follows: By construction, there is at most one subgraph of L0 (say, S ) that is isomorphic to S. If no such subgraph exists, then the estimated (and actual) support of S is 0. Otherwise, let C be the set of counters on S (i.e., C consists of counters on the nodes and edges of S ). The support of S is estimated by the minimum value in C. Given our construction of the summary, this estimate is an upper bound on the true support of S. With a level-1
78
S. Ghazizadeh and S.S. Chawathe
summary L1 , we estimate the support of a structure S as follows: For each vertex v of S, let L(v) be the set of lattice nodes in L1 that represent a set of edges that is a superset of the set of out-edges of v. Let c(v) denote the sum of the counters for nodes in L(v). The support of S is estimated to be minv∈S c(v). This estimate is also an upper bound on the true support of S. Further, it is a tighter bound than that given by the corresponding level-0 summary. 2.2
Candidate Generation
A simplified version of our candidate generation algorithm is outlined in Figure 6: CandidateGeneration(x) returns a list of candidate structures whose estimated support is x or higher. It maintains two lists of structures: open and candidate. In the open list we store structures that have not been processed yet (and that will be checked later). The algorithm begins by adding all structures that consist of only one vertex and pass the support threshold test to the open list. The rest of the algorithm is a loop that repeats until there are no more structures to consider (i.e., the open list is empty.) In each iteration, we select a structure (S) from the open list and we use it to generate larger structures (called S’s children) by calling the expand subroutine, described below. New child structures that have an estimated support of at least x are added to the open list. The qualifying structures are accumulated in the candidate result, which is returned as the output when the algorithm terminates. Algorithm CandidateGeneration(threshold) 1. candidate ←∅; open ←∅; 2. for v ∈ summary and support(v) ≥ threshold 3. do create a structure s consisting of a single vertex v; 4. open ←open ∪s; 5. while open = ∅ 6. do S ←any structure in open; 7. open ←open −S; candidate ←candidate ∪S; 8. children ←expand(S); 9. for c ∈ children 10. do if support(c) ≥ threshold and c ∈ candidate 11. then open ←open ∪c; Fig. 6. Simplified Candidate Generation Algorithm
Given a structure S, the expand subroutine produces the set of structures generated by adding a single edge to S (termed the children of S). In the following description of the expand(S) subroutine, we use S(v) to denote the set of vertices in S that have the same label as vertex v in the data graph and V (s) to denote the set of data vertices that have the same label as a vertex s in S. For each vertex s in S, we create the set addable(S, s) of edges leaving some vertex in V (s). This set is easily determined from the data summary: It is the set of out-edges for the summary vertex representing s. (As we shall discuss in Section 3, this ability to generate structures using only the in-memory
SEuS: Structure Extraction Using Summaries
79
summary instead of the disk resident database results in large savings in running time.) Each edge e = (s, v, l) in addable(S, s) that is not already in S is a candidate for expanding S. If S(v) (the set of vertices with the same label as e’s destination vertex) is empty, we add a new vertex x with the same label as v and a new edge (s, x, l) to S. Otherwise, for each x ∈ S(v) if (s, x, l) in not in S, a new structure is created from S and e by adding the edge (s, x, l) (an edge between vertices already in S). If s does not have an l-labeled edge to any of the vertices in S(v), we also add a new structure which is obtained from S by adding a vertex x with the same label as v and an edge (s, x , l). For example consider the graph in Figure 2. Let us assume that we want to expand a structure S consisting of a single vertex s labeled author. The set addable(S, s) is child idref child child {author −→ book, author −→ book, author −→ name, author −→ paper} (all the edges that leave an author labeled vertex in database). Since S has only one vertex, it can be expanded only by adding these four edges. Using the first edge in the addable set, a new structure is obtained from S by adding a new book-labeled vertex and connecting s to this new vertex by a child edge. The other edges in addable(S, s) give rise to three other structures in this manner. 2.3
Support Counting
Once the user is satisfied with the structures discovered in the candidate generation phase, she may be interested in finalizing the frequent structure list and getting the exact support of the structures. (Recall that the candidate generation phase provides only a quick, approximate support for each structure, based on the in-memory summary.) This task is performed in the support counting phase, which we describe here. Let us define the size of a structure to be the number of nodes and edges it contains; we refer to a structure of size k as a k-structure. From the method used for generating candidates (Section 2.2), it follows that for every k-structure S in the candidate list there exists a structure Sp of size k − 1 or k − 2 in the candidate list such that Sp is a subgraph of S. We refer to Sp as the parent of S in this context. Clearly, every instance I of S has a subgraph I that is an instance of Sp . Further, I differs from I only in having one fewer edge and, optionally, one fewer vertex. We use these properties in the support counting process. Determining the support of a 1-structure (single vertex) consists of simply counting the number of instances of a like-labeled vertex in the database. During the counting phase, we store not only the support of each structure (as it is determined), but also a set of pointers to that structure’s instances on disk. To determine the support of a k-structure S for k > 1, we revisit the instances of its parent Sp using the saved pointers. For each such instance I , we check whether there is a neighboring edge and, optionally, a node that, when added to I generates an instance I of S. If so, I is recorded as an instance of S. This operation of growing an instance I of Sp to an instance I of S is similar to the expand operation used in the candidate generation phase; however, there are two difference. First, in the counting phase we expand subgraphs of the database whereas in the candidate generation phase we expand abstract structures without referring to the disk-resident data (using only the summary). Second, in the counting phase we need to find an edge or vertex in the database to be added to the instance that satisfies the
80
S. Ghazizadeh and S.S. Chawathe
constraints imposed by the expansion which created the structure (e.g., the label of the edge). Whereas in the candidate generation phase, we add any possible edges and vertices to the structure.
3
Experimental Evaluation
In order to evaluate the performance of our method we have performed a number of experiments. We have implemented SEuS using the Java 2 (J2SE) programming environment. For graph isomorphism tests, we have used the nauty package[McK02] to derive canonically labeled isomorphic graphs. Since we have two levels of summaries, we append a “-Sd” to a system’s name to show which level of summary has been used with the method in a particular experiment (e.g., SEuS-S0 is the SEuS method using summary level-0). In the experiments described below, we have used a PC-class machine with a 900 MHz Intel Pentium III processor and one gigabyte of RAM, running the RedHat 7.1 distribution of GNU/Linux. Where possible, we have compared our results with those for SUBDUE version 4.3 (serial version), which is implemented in the C programming language. Due to space restictions we are not able to present detailed experimental results here. Extensive results have been presented in [GC02]. Table 1 presents some characteristics of the 13 datasets we have used for our experiments. Table 1. Datasets used in experiments Name Credit-* Diabetes-* Vote Chemical Chess Medical-*
Description
Vertex Edge Summary Graph labels labels Size Size Credit card applications 59 20 136 3899-27800 Diabetes patient records 7 8 39 4556-8500 Congressional voting records 4 16 52 8811 Chemical compounds 66 4 338 18506 Chess relational domain 7 12 88 189311 Medical publication citations 75 4 175 4M-10M
Figure 7 compares the running time of SEuS and SUBDUE on the 13 datasets of Table 1. Running times of SEuS using both levels of summaries are depicted here. It is important to notice that SEuS versions run for a longer time compared to SUBDUE because it is looking for all frequent structures, whereas SUBDUE only returns the n most frequent structures (n = 5 in these experiments.). The running times of SEuS increases monotonically as the size of datasets increases. The irregularities in the running time of SUBDUE are due to the fact that, besides the size of a dataset, factors such as the number of vertex and edge labels have a significant effect on the performance of SUBDUE. Referring to Table 1, it is clear that Credit datasets have many more labels than the Diabetes datasets. Although Credit-1 and Credit-2 datasets are smaller than the Diabetes datasets, it takes SUBDUE longer to mine them because it tries to expand the subgraphs by all possible edges at each iteration. Then SUBDUE decides which isomorphism class is better by considering the number of subgraphs in them and the
SEuS: Structure Extraction Using Summaries
81
size of the subgraphs. (In SUBDUE the sets of isomorphic subgraphs are manipulated as bags of subgraphs.) When there is a large number of different vertex or edge labels, there will be a larger number of subgraphs to choose between and since SUBDUE accesses the database for each subgraph, the running time increases considerably. The number of edge or vertex labels affects SEuS in a similar way, but since we do not access the main database to find the support of a structure (we use the summary instead) this number does not significantly affect our running time. SEuS has a phase of data summary generation which SUBDUE does not perform. In small datasets this additional effort is comparable to the overall running time. Also note that the running time of SEuS increases if we use level-1 summary instead of level-0 summary. This increase in running time is mainly due to the overhead of creating a richer summary. This additional effort will result in more accurate results (lower overestimation which yeilds less waste of time in the counting phase). We are comparing a Java implementation of SEuS with the C implementation of SUBDUE. While the difference in efficiency of these programming environments is not significant for large datasets, it is a factor for the smaller ones.
Fig. 7. Running time
As the datasets grow, the running time of SUBDUE grows very quickly, while SEuS does not show such a sharp increase. With our experimental setup, we were unable to obtain any results from SUBDUE for datasets larger than 3 MB (after running for 24 hours). For this reason, Figure 7 presents the running time of only SEuS method for the large datasets. To best of our knowledge, other complete structure discovery methods cannot handle datasets with sizes comparable to those we have used here. AGM and FSG methods, presented in [IWM00,KK01], take respectively eight days and 600 seconds to process the Chemical dataset, for which SEuS only needs 20 seconds[KK01].
82
S. Ghazizadeh and S.S. Chawathe
(Unfortunately, we were unable to obtain the FSG system to perform a more detailed comparison.) One should note that for very small thresholds, these methods will have a better performance because with those thresholds a large number of structures will be frequent and our summary does not provide a significant pruning while introducing the overhead of creating a summary. As discussed in Section 1, the SEuS system provides real-time feedback to the user by quickly displaying the frequent structures resulting from different choices of the threshold parameter. This interactive feedback is possible because the time spent in the candidate generation (search) phase is very small. Figure 8 justifies this claim. It depicts the percentage of time used by each of the three phases in processing different datasets. As datasets get larger, the fraction of running time spent on summarizing the graph falls rapidly. Also the time spent in the candidate generation phase is relatively small. Therefore, our strategy of creating the summary once and running the candidate generation phase multiple times with different input parameters (in order to determine suitable values before proceeding to the expensive counting phase) is very effective.
Percentage of Time (%)
1 Count Search Summarize
2 Cr
n 1 2 3 4 -1 -2 -1 -2 -3 -4 it it es es te it it ss io l- l- l- led red abet abet Vo red red Che viat dica dica dica dica C Di Di C C A Me Me Me Me
Datasets (Increasing size) Fig. 8. Time spent in algorithm phases
4
Related Work
Much of the prior work on structure discovery is domain dependent (e.g., [Win75,Lev84, Fis87,Leb87,GLF89,CG92]) and a detailed comparison of these methods appears in [Con94]. We consider only domain independent methods in this paper. The first such system, CLIP, discovers patterns in graphs by expanding and combining patterns discovered in previous iterations [YMI93]. To guide the search, CLIP uses an estimate of the compression resulting from an efficient representation of repetitions of a candidate structure. The estimate is based on a linear-time approximation for graph isomorphism. SUBDUE [CH00] also performs structure discovery on graphs. It uses the minimum description length principle to guide its beam search. SUBDUE uses an inexact graph matching algorithm during the process to find similar structures. SUBDUE discovers structures differently from CLIP. First, SUBDUE produces only single structures evaluated using minimum description length, whereas CLIP produces a set of structures that collectively compress the input graph. CLIP has the ability to grow structures using the merge operator between two previously found structures,
SEuS: Structure Extraction Using Summaries
83
while SUBDUE only expands structures one edge at a time. Our system is similar to SUBDUE with respect to structure expansion. Second, CLIP estimates the compression resulting from using a structure, but SUBDUE performs an expensive exact measurement of compression for each new structure. This expensive task causes the SUBDUE system to be very slow when operating on large databases. AGM [IWM00] is an Apriori-based algorithm for mining frequent structures which are induced subgraphs of the input graph. The main idea is similar to that used by the market basket analysis algorithm in [AS94]: a (k + 1)-itemset is a candidate frequent itemset only if all of its k-item subsets are frequent. In AGM, a graph of size k + 1 is considered to be a candidate frequent structure only if all its subgraphs of size k are frequent. AGM only considers the induced subgraphs to be candidate frequent structures. (Given a graph G, subgraph Gs is called an induced subgraph if V (Gs ) ⊂ V (G), E(Gs ) ⊂ E(G) and ∀u, v ∈ V (Gs ), (u, v) ∈ E(Gs ) ⇔ (u, v) ∈ E(G).) This restriction reduces the size of the search space, but also means that interesting structures that are not induced subgraphs cannot be detected by AGM. After producing the next generation of candidate frequent structures, AGM counts the frequency of each candidate by scanning the database. As in SUBDUE, this need for a database scan at each generation limits the scalablity of this method. FSG [KK01] is another system that finds all connected subgraphs that appear frequently in a large graph database. Similar to AGM, this system uses the level-by-level expansion adopted in Apriori. The key features of FSG compared to AGM are the following: (1) it uses a sparse graph representation which minimizes storage and computation, (2) there is no restriction on the structure’s topology (e.g., induce subgraph restriction) other than their connectivity, and (3) it incorporates a number of optimizations for candidate generation and counting which makes it more scalable (e.g., transaction ID lists for counting). However, this system still scans the database in order to find the support of next generation structures. The experimental results in [KK01] show that FSG is considerably faster than AGM. One should note that AGM and FSG both operate on a transaction database where each transaction is a graph, so that their definition of a frequent structure’s support can be applicable. In SEuS we do not have this restriction, and SEuS can be applied to both a transaction database and a large connected graph database. As mentioned in Section 3, for a common Chemical dataset, FSG needs 600 seconds, where SEuS returned the frequent structures in less than 20 seconds. Asai [AAK+ 02] proposes FREQT algorithm for discovering frequent structures in semistructured data. FREQT models semistructured data and the frequent structures using labeled ordered trees. The key contribution of this work is the notion of the rightmost expansion, a technique to grow a tree by attaching new nodes only to the rightmost branch of the tree. The authors show that it is sufficient to maintain only the instances of the rightmost leaf to efficiently implement incremental computation of structure frequency. Limiting the search space to ordered trees allows the method to scale almost linearly in the total size of maximal tree contained in the collection. In [CYLW02], authors propose another method for frequent structure discovery in semistructured collections. In this work, the dataset is a collection of semistructured objects treated as transactions similar to FSG method. Motivated by the semistructured data path expressions, the authors try to represent the objects and patterns as a set
84
S. Ghazizadeh and S.S. Chawathe
of labeled paths which can include wildcards. After introducing the notion of weaker than for comparing a structure path set with a transaction object, the algorithm tries to discover the set of all patterns that have a frequency higher than a given threshold. The authors discuss that the methods is motivated and well-suited for collections consisting of similarly structured objects with minor differences. The problem of finding frequent structures is related to the problem of finding implicit structure (or approximate typing) in semistructured databases [NAM97,NAM98]. In type inference, the structures are typically limited to rooted trees and each structure must have a depth of one. Further, the frequency of a structure is not the only metric used in type inference. For instance, a type that occurs infrequently may be important if its occurrences have a very regular structure. Despite these differences, it may be interesting to investigate the possibility of adapting methods from one problem for the other.
5
Conclusion
In this paper, we motivated the need for data mining methods for large semistructured datasets (modeled as labeled graphs with several million nodes and edges). We focused on an important building block for such data mining methods: the task of finding frequent structures, i.e., structures that are isomorphic to a large number of subgraphs of the input graph. We presented the SEuS method, which finds frequent structures efficiently by using a structural summary to estimate structure support. Our method has two main distinguishing features: First, due to their use of a summary data structure, they can operate on datasets that are two to three orders of magnitude larger than those used by prior work. Second, our methods provide rapid feedback (delay of a few seconds) in the form of candidate structures, thus permitting their use in an interactive data exploration system. As ongoing work, we are exploring the application of our methods to finding association rules and other correlations in semistructured data. We are also applying our methods to the problems of classification and clustering by using frequent structures to build a predictive model.
References [AAK+ 02]
[AIS93] [AS94]
[BDFS97]
Tatsuya Asai, Kenji Abe, Shinji Kawasoe, et al. Efficient substructure discovery from large semi-structured data. In Proc. of the Second SIAM International Conference on Data Mining, 2002. R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in massive databases. SIGMOD Record, 22(2):207–216, June 1993. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th International Conference Very Large Data Bases, pages 487–499. Morgan Kaufmann, 1994. P. Buneman, S. B. Davidson, M. F. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proc. of the 6th International Conference on Database Theory, 1997.
SEuS: Structure Extraction Using Summaries [CG92] [CH00] [Con94] [CYLW02]
[Fis87] [For96] [GC02]
[GLF89] [GW97]
[IWM00]
[KK01] [Leb87] [Lev84] [McK02] [NAM97] [NAM98]
[NUWC97]
[Win75] [YMI93]
85
D. Conklin and J. Glasgow. Spatial analogy and subsumption. In Proc. of the Ninth International Conference on Machine Learning, pages 111–116, 1992. D. J. Cook and L. B. Holder. Graph-based data mining. ISTA: Intelligent Systems & their applications, 15, 2000. D. Conklin. Structured concept discovery: Theory and methods. Technical Report 94-366, Queen’s University, 1994. Gao Cong, Lan Yi, Bing Liu, and Ke Wang. Discovering frequent substructures from hierarchical semi-structured data. In Proc. of the Second SIAM International Conference on Data Mining, 2002. D. H. Fisher, Jr. Knowledge acquisition via incremental conceptual clustering. Machine Learning, (2):139–172, 1987. S. Fortin. The graph isomorphism problem. Technical Report 96-20, University of Alberta, 1996. Shayan Ghazizadeh and Sudarshan Chawathe. Discovering freuqent structures using summaries. Technical report, University of Maryland, Computer Science Department, 2002. J. H. Gennari, P. Langley, and D. Fisher. Models of incremental concept formation. Artificial Intelligence, (40):11–61, 1989. R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. In Proc. of the Twenty-Third International Conference on Very Large Data Bases, pages 436–445, 1997. A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 13–23, 2000. M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proc. of the 1st IEEE Conference on Data Mining, 2001. M. Lebowitz. Experiments with incremental concept formation: Unimem. Machine Learning, (2):103–138, 1987. R. Levinson. A self-organizing retrieval system for graphs. In Proc. of the National Conference on Artificial Intelligence, pages 203–206, 1984. B. D. McKay. nauty user’s guide (version 1.5), 2002. S. Nestorov, S. Abiteboul, and R. Motwani. Inferring structure in semistructured data. In Proc. of the Workshop on Management of Semistructured Data, 1997. S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In Proc. of the ACM SIGMOD International Conference on Management of Data, pages 295–306, 1998. S. Nestorov, J. Ullman, J. Wiener, and S. Chawathe. Representative objects: Concise representations of semistructured, hierarchial data. In Proc. of the International Conference on Data Engineering, pages 79–90, 1997. P. H. Winston. Learning structural descriptions from examples. In The Psychology of Computer Vision, pages 157–209. 1975. K. Yoshida, H. Motoda, and N. Indurkhya. Unifying learning methods by colored digraphs. In Proc. of the International Workshop on Algorithmic Learning Theory, volume 744, pages 342–355, 1993.
Discovering Best Variable-Length-Don’t-Care Patterns Shunsuke Inenaga1 , Hideo Bannai3 , Ayumi Shinohara1,2 , Masayuki Takeda1,2 , and Setsuo Arikawa1 1
Department of Informatics, Kyushu University 33, Fukuoka 812-8581, Japan 2 PRESTO, Japan Science and Technology Corporation (JST) {s-ine, ayumi, takeda, arikawa}@i.kyushu-u.ac.jp 3 Human Genome Center, University of Tokyo, Tokyo 108-8639, Japan [email protected]
Abstract. A variable-length-don’t-care pattern (VLDC pattern) is an element of set Π = (Σ ∪ {})∗ , where Σ is an alphabet and is a wildcard matching any string in Σ ∗ . Given two sets of strings, we consider the problem of finding the VLDC pattern that is the most common to one, and the least common to the other. We present a practical algorithm to find such best VLDC patterns exactly, powerfully sped up by pruning heuristics. We introduce two versions of our algorithm: one employs a pattern matching machine (PMM) whereas the other does an index structure called the Wildcard Directed Acyclic Word Graph (WDAWG). In addition, we consider a more generalized problem of finding the best pair q, k, where k is the window size that specifies the length of an occurrence of the VLDC pattern q matching a string w. We present three algorithms solving this problem with pruning heuristics, using the dynamic programming (DP), PMMs and WDAWGs, respectively. Although the two problems are NP-hard, we experimentally show that our algorithms run remarkably fast.
1
Introduction
A vast amount of data is available today, and discovering useful rules from those data is quite important. Very commonly, information is stored and manipulated as strings. In the context of strings, rules are patterns. Given two sets of strings, often referred to as positive examples and negative examples, it is desired to find the pattern that is the most common to the former and the least common to the latter. This is a critical task in Discovery Science as well as in Machine Learning. A string y is said to be a substring of a string w if there exist strings x, z ∈ Σ ∗ such that w = xyz. Substring patterns are possibly the most basic patterns to be used for the separation of two sets S, T of strings. Hirao et al. [8] stated that such best substrings can be found in linear time by constructing the suffix tree for S ∪ T [12,21,7]. They also considered subsequence patterns as rules for separation. A subsequence pattern p is said to match a string w if p can be obtained by removing zero or more characters from w [2]. Against the fact that finding S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 86–97, 2002. c Springer-Verlag Berlin Heidelberg 2002
Discovering Best Variable-Length-Don’t-Care Patterns
87
the best subsequence patterns to separate given two sets of strings is NP-hard, they proposed an algorithm to solve the problem with practically reasonable performance. More recently, an efficient algorithm to discover the best episode patterns was proposed in [9]. An episode pattern p, k, where p is a string and k is an integer, is said to match a string w if p is a subsequence of a substring u of w with |u| ≤ k [14,6,20]. The problem to find the best episode patterns is also known to be NP-hard. In this paper, we focus on a pattern containing a wildcard that matches any string. The wildcard is called a variable length don’t care and is denoted by . A variable-length-don’t-care pattern (VLDC pattern) is an element of Π = (Σ ∪ { })∗ , and is also sometimes called a regular pattern as in [19]. When a, b ∈ Σ, ab bb ba is an example of a VLDC pattern and, for instance, matches string abbbbaaaba with the first and second ’s replaced by b and aaa, respectively. The language L(q) of a pattern q ∈ Π is the set of strings obtained by replacing ’s in q with arbitrary strings. Namely, L(ab bb ba) = {abubbvba | u, v ∈ Σ ∗ }. The class of this language corresponds to a class of the pattern languages proposed by Angluin [1]. VLDC patterns are generalization of substring patterns and subsequence patterns. For instance, consider a pattern string abc ∈ Σ ∗ . The substring matching problem corresponding to the pattern is given by the VLDC pattern abc . Also, the VLDC pattern a b c leads to the subsequence pattern matching problem. This paper is devoted to introducing a practical algorithm to discover the best VLDC pattern to distinguish two given sets S, T of strings. To speed up the algorithm, firstly we restrict the search space by means of pruning heuristics inspired by Morishita and Sese [16]. Secondly, we accelerate the matching phase of the algorithm in two ways, as follows: In [11], we introduced an index structure called the Wildcard Directed Acyclic Word Graph (WDAWG). The WDAWG for a text string w recognizes all possible VLDC patterns matching w, and thus enables us to examine whether a given VLDC pattern q matches w in O(|q|) time. More recently, a space-economical version of its construction algorithm was presented in [10]. We use WDAWGs for quick matching of VLDC patterns. Another approach is to preprocess a given VLDC pattern q, building a DFA accepting L(q). We use it as a pattern matching machine (PMM) which runs over a text string w and determines whether or not q matches w in O(|w|) time. We furthermore propose a generalization of the VLDC pattern matching problem. That is, we introduce an integer k called the window size which specifies the length of an occurrence of a VLDC pattern that matches w ∈ Σ ∗ . The introduction of k leads to the generalization of the episode patterns as well. Specifying the length of an occurrence of a VLDC pattern is of great significance especially when classifying long strings over a small alphabet, since a short VLDC pattern surely matches most long strings. Therefore, for example, when two sets of biological sequences are given to be separated, this approach is adequate and promising. Pruning heuristic to speed up our algorithm finding the best pair q, k is also presented. We propose three approaches effective in computing the best pair, using the dynamic programming, PMMs, and WDAWGs, respectively.
88
S. Inenaga et al.
We declare that this work generalizes and outperforms the ones accomplished in [8,9], since it is capable of discovering more advanced and useful patterns. In fact, we show some experimental results that convince us of the accuracy of our algorithms as well as their fast performances. Moreover, we are now installing our algorithms into the core of the decision tree generator in BONSAI [17], a powerful machine discovery system. We here only give basic ideas for our pruning heuristics, that are rather straightforward extensions of those developed in our previous work [8,9]. Interested readers are invited to refer to our survey report [18].
2 2.1
Finding the Best Patterns to Separate Sets of Strings Notation
Let N be the set of integers. Let Σ be a finite alphabet. An element of Σ ∗ is called a string. The length of a string w is denoted by |w|. The empty string is denoted by ε, that is, |ε| = 0. Strings x, y, and z are said to be a prefix, substring, and suffix of string w = xyz, respectively. The substring of a string w that begins at position i and ends at position j is denoted by w[i : j] for 1 ≤ i ≤ j ≤ |w|. For convenience, let w[i : j] = ε for j < i. The reversal of a string w is denoted by wR , that is, wR = w[n]w[n − 1] . . . w[1] where n = |w|. For a set S ⊆ Σ ∗ of strings, the number of strings in S is denoted by |S| and the total length of strings in S is denoted by S . Let Π = (Σ ∪ { })∗ , where is a variable length don’t care matching any string in Σ ∗ . An element q ∈ Π is a variable-length-don’t-care pattern (VLDC pattern). For example, a ab ba is a VLDC pattern with a, b ∈ Σ. We say a VLDC pattern q matches a string w if w can be obtained by replacing ’s in q with some strings. In the running example, the VLDC-pattern a ab ba matches string abababbbaa with the ’s replaced by ab, b, b and a, respectively. For any q ∈ Π, |q| denotes the sum of numbers of characters and ’s in q. 2.2
Finding the Best VLDC Patterns
We write as q u if u can be obtained by replacing ’s in q with arbitrary elements in Π. Definition 1. For a VLDC pattern q ∈ Π, we define L(q) by L(q) = {w ∈ Σ ∗ | q w}. According to the above definition, we have the following lemma. Lemma 1. For any q, u ∈ Π, if q u, then L(q) ⊇ L(u). ∗ ∗ Let good be a function from Σ ∗ × 2Σ × 2Σ to the set of real numbers. In what follows, we formulate the problem to solve. Definition 2 (Finding the best VLDC pattern according to good ). Input: Two sets S, T ⊆ Σ ∗ of strings. Output: A VLDC pattern q ∈ Π that maximizes the score of good (q, S, T ).
Discovering Best Variable-Length-Don’t-Care Patterns
89
Intuitively, the score of good (q, S, T ) expresses the “goodness” of q in the sense of distinguishing S from T . The definition of good varies with applications. For examples, the χ2 values, entropy information gain, and gini index can be used. Essentially, these statistical measures are defined by the numbers of strings that satisfy the rule specified by q. Any of the above-mentioned measures can be expressed by the following form: good (q, S, T ) = f (xq , yq , |S|, |T |), where xq = |S ∩ L(q)|, yq = |T ∩ L(q)|. When S and T are fixed, |S| and |T | are regarded as constants. On this assumption, we abbreviate the notation of the function to f (x, y) in the sequel. We say that a function f from [0, xmax ] × [0, ymax ] to real numbers is conic if – for • • – for • •
any 0 ≤ y ≤ ymax , there exists an x1 such that f (x, y) ≥ f (x , y) for any 0 ≤ x < x ≤ x1 , and f (x, y) ≤ f (x , y) for any x1 ≤ x < x ≤ xmax . any 0 ≤ x ≤ xmax , there exists a y1 such that f (x, y) ≥ f (x, y ) for any 0 ≤ y < y ≤ y1 , and f (x, y) ≤ f (x, y ) for any y1 ≤ y < y ≤ ymax .
In the sequel, we assume that f is conic and can be evaluated in constant time. The optimization problem to be tackled follows. Definition 3 (Finding the best VLDC pattern according to f ). Input: Two sets S, T ⊆ Σ ∗ of strings. Output: A VLDC pattern q ∈ Π that maximizes the score of f (xq , yq ), where xq = |S ∩ L(q)| and yq = |T ∩ L(q)|. The problem is known to be NP-hard [15], and thus we essentially have exponentially many candidates. Therefor, we reduce the number of candidates by using the pruning heuristic inspired by Morishita and Sese [16]. The following lemma derives from the conicality of function f . Lemma 2 ([8]). For any 0 ≤ x < x ≤ xmax and 0 ≤ y < y ≤ ymax , we have f (x, y) ≤ max{f (x , y ), f (x , 0), f (0, y ), f (0, 0)}. By Lemma 1 and Lemma 2, we have the next lemma, basing on which we can perform the pruning heuristic to speed up our algorithm. Lemma 3. For any two VLDC patterns q, u ∈ Π, if q u, then f (xu , yu ) ≤ max{f (xq , yq ), f (xq , 0), f (0, yq ), f (0, 0)}. 2.3
Finding the Best VLDC Patterns within a Window
We here consider a natural extension of the problem mentioned previously. We introduce an integer k called the window size. Let q ∈ Π and q[i], q[j] be the first and last characters in q that are not , respectively, where 1 ≤ i ≤ j ≤ |q|. If q matches w ∈ Σ ∗ , let w[i ], w[j ] be characters to which q[i] and q[j] can
90
S. Inenaga et al.
correspond, respectively, where 1 ≤ i ≤ j ≤ |w|. (Note that we might have more than one combination of i and j .) If there exists a pair i , j satisfying j − i < k, we say that q occurs w within a window of size k. Then the pair q, k is said to match the string w. Definition 4. For a pair q, k with q ∈ Π and k ∈ N , we define L (q, k) by L (q, k) = {w ∈ Σ ∗ | q, k matches w}. According to the above definition, we have the following lemma. Lemma 4. For any q, k and p, j with q, p ∈ Π and k, j ∈ N , if q p and j ≥ k, then L (q, k) ⊇ L (p, j). The problem to be tackled is formalized as follows. Definition 5 (Finding the best VLDC pattern and window size according to f ). Input: Two sets S, T ⊆ Σ ∗ of strings. Output: A pair q, k with q ∈ Π and k ∈ N that maximizes the score of f (xq,k , yq,k ), where xq,k = |S ∩ L (q, k)| and yq,k = |T ∩ L (q, k)|. We stress that the value of k is not given beforehand, i.e., we compute not only q but also k with which the score of function f is maximum. Therefore, the search space of this problem is Π × N , while that of the problem in Definition 3 is Π. We remark that this problem is also NP-hard. By Lemma 4 and Lemma 2, we achieve the following lemma that plays a key role for the heuristic to prune the search tree. Lemma 5. For any q, k and p, j with q, u ∈ Π and k, j ∈ N , if q u and j ≥ k, f (xp,j , yp,j ) ≤ max{f (xq,k , yq,k ), f (xq,k , 0), f (0, yq,k ), f (0, 0)}.
3
Efficient Match of VLDC Patterns
Definition 6 (Counting the matched VLDC patterns). Input: A set S ⊆ Σ ∗ of strings. Query: A VLDC pattern q ∈ Π. Answer: The cardinality of set S ∩ L(q). This is a sub-problem of the one given in Definition 3. It must be answered as fast as possible, since we are given quite many VLDC patterns as queries. Here, we utilize two practical methods which allows us to answer the problem quickly. 3.1
Using a DFA for a VLDC Pattern
Our first idea is to use a deterministic finite-state automaton (DFA) for a pattern. Given a VLDC pattern q ∈ Π, we construct a DFA that accepts L(q) and use it as a pattern matching machine (PMM) which runs over text strings in S. For any q ∈ Π, a DFA can be constructed in O(|q|) time. Lemma 6. Let S ⊆ Σ ∗ and q ∈ Π. Then |S ∩ L(q)| can be computed in O(|q|) preprocessing time and in O( S ) running time.
Discovering Best Variable-Length-Don’t-Care Patterns
91
Fig. 1. WDAWG(w) where w = abbab.
3.2
Using Wildcard Directed Acyclic Word Graphs
The second approach is to use an index structure for a text string w ∈ S that recognizes all VLDC patterns matching w. The Directed Acyclic Word Graph (DAWG) is a classical, textbook index structure [5], invented by Blumer et al. in [3]. The DAWG of a string w ∈ Σ ∗ is denoted by DAWG(w), and is known to be the smallest deterministic automaton that recognizes all suffixes of w [4]. By means of DAWG(w), we can examine whether or not a given pattern p ∈ Σ ∗ is a substring of w in O(|p|) time. Recently, we introduced Minimum All-Suffixes Directed Acyclic Word Graphs (MASDAWGs) [11]. The MASDAWG of a string w ∈ Σ ∗ , which is denoted by MASDAWG(w), is the minimization of the collection of the DAWGs for all suffixes of w. More precisely, MASDAWG(w) is the smallest automaton with |w| + 1 initial nodes, in which the directed acyclic graph induced by all reachable nodes from the k-th initial node conforms with the DAWG of the k-th suffix of w. Several important applications of MASDAWGs were given in [11], one of which corresponds to a significantly time-efficient solution to the VLDC pattern matching problem. Namely, a variant of MASDAWG(w), called Wildcard DAWG (WDAWG) of w and denoted by WDAWG(w), was introduced in [11]. WDAWG(w) is the smallest automaton that accepts all VLDC patterns matching w. WDAWG(w) with w = abbab is displayed in Fig. 1. Theorem 1. When |Σ| ≥ 2, the number of nodes of WDAWG(w) for a string w is Θ(|w|2 ). It is Θ(|w|) for a unary alphabet. Theorem 2. For any string w ∈ Σ ∗ , WDAWG(w) can be constructed in time linear in its size. For all strings in S ⊆ Σ ∗ , we construct WDAWGs. Then we obtain the following lemma that is a counterpart of Lemma 6. Lemma 7. Let S ⊆ Σ ∗ and q ∈ Π. Let N = w∈S |w|2 . Then |S ∩ L(q)| can be computed in O(N ) preprocessing time and in O(|q|·|S|) running time.
92
S. Inenaga et al.
In spite of the quadratic space requirement of WDAWGs, it is meaningful to construct them because of the following reasons. Assume that, for a string w in S, a VLDC pattern q has been recognized by WDAWG(w). We then memorize the node at which q was accepted. It allows us a rapid search of any VLDC pattern qr with r ∈ Π, since we only need to move |r| transitions from the memorized node. Therefore, WDAWGs are significantly useful especially in our situation. Moreover, WDAWGs are also helpful for pruning the search tree. Once knowing that a VLDC pattern q does not match any string in S by using the WDAWGs, we need not consider any u ∈ Π such that q u.
4
How to Compute the Best Window Size
Definition 7 (Computing the best window size according to f ). Input: Two sets S, T ⊆ Σ ∗ of strings and a VLDC pattern q ∈ Π. Output: An integer k ∈ N that maximizes the score of f (xq,k , yq,k ), where xq,k = |S ∩ L (q, k)| and yq,k = |T ∩ L (q, k)|. This is a sub-problem of the one in Definition 5, where a VLDC pattern is given beforehand. Let be the length of the longest string in S ∪ T . A short consideration reveals that, as candidates for k, we only have to consider the values from |q| up to , which results in a rather straightforward solution. In addition to that, we give a more efficient computation method, whose basic principle originates in [9]. For a string u ∈ Σ ∗ and a VLDC pattern q ∈ Π, we define the threshold value θ of q for u by θu,q = min{k ∈ N | u ∈ L (q, k)}. / L (q, k) for any k < θ If there is no such value, let θu,q = ∞. Note that u ∈ and u ∈ L (q, k) for any k ≥ θ. The set of threshold values for q ∈ Π with respect to S ⊆ Σ ∗ is defined as ΘS,q = {θu,q | u ∈ S}. A key observation is that the best window size for given S, T ⊆ Σ ∗ and a VLDC pattern q ∈ Π can be found in set ΘS,q ∪ ΘT,q without loss of generality. Thus we can restrict the search space for the best window size to ΘS,q ∪ ΘT,q . It is therefore important to quickly solve the following sub-problem. Definition 8 (Computing the minimum window size). Input: A string w ∈ Σ ∗ and a VLDC pattern q ∈ Π. Output: The threshold value θw,q . We here show our three approaches to efficiently solve the above sub-problem. The first is to adopt the standard dynamic programming method. For a string w ∈ Σ ∗ with |w| = n and a pattern q ∈ Π with |q| = m, let dij be the length of the shortest suffix of w[1 : j] that q[1 : i] matches, where 0 ≤ i ≤ m and 0 ≤ j ≤ n. We can compute all dij ’s in O(mn) time, basing on the following recurrences: d00 = 0,
Discovering Best Variable-Length-Don’t-Care Patterns
d0j = di0 =
0 ∞
if q[1] = otherwise
di−1,0 ∞
if q[1] = otherwise
93
for j ≥ 1, for i ≥ 1, and
min{di−1,j−1 +1, di,j−1 +1, di−1,j } if q[i] = di−1,j−1 +1 if q[i] = w[j] for i ≥ 1 and j ≥ 1. ∞ otherwise min1≤j≤n {dmj } if q[m] = Then θw,q = dmn otherwise. Remark that if the row dmj (1 ≤ j ≤ n) is memorized, it will save the computation time for any pattern qr with r ∈ Π. The second approach is to preprocess a given VLDC pattern q ∈ Π. We construct a DFA accepting L(q) and another DFA for L(q R ), and utilize them as PMMs running over a given string w ∈ Σ ∗ . If q[1] = (q[m] = , respectively), we have only to compute the shortest prefix (suffix, respectively) of w that q matches and return its length. We now consider the case q[1] = q[m] = . Firstly, we run the DFA for L(q) over w. Suppose that q is recognized between positions i and j in w, where 1 ≤ i < j ≤ |w| and j − i > |q|. A delicate point is that it is unsure whether w[i : j] corresponds to the shortest occurrence of q ending at position j. How can we find the shortest one? It can be found by running the DFA for L(q R ) backward, over w from position j. Assume that q R is recognized at position k, where i ≤ k < j − |q|. Then w[k : j] corresponds to the shortest occurrence of q ending at position j. After that, we resume the running of the DFA of L(q) from position k + 1, and continue the above procedure until encountering position |w|. The pair of positions of the shortest distance gives the threshold value θw,q . This method is feasible in O(m) preprocessing time and in O(mn) running time, where m = |q| and n = |w|. The third approach is to preprocess a text string w ∈ Σ ∗ , i.e., we construct WDAWG(w) and WDAWG(wR ). For any w ∈ Σ ∗ , each and every node of WDAWG(w) can be associated with a position in w [11]. Thus we can perform a procedure similar to the second approach above, which enables us to find the threshold value θw,q . This approach takes us O(n) preprocessing time and O(mn) running time, where m = |q| and n = |w|. As a result, we obtain the following: Lemma 8. Let w ∈ Σ ∗ and q ∈ Π with |w| = n and |q| = m. The threshold value θw,q can be computed in O(mn) running time. dij =
5
Computational Experiments
The algorithms were implemented in the Objective Caml Language. All calculations were performed on a Desktop PC with dual Xeon 2.2GHz CPU (though our algorithms only utilize single CPU) with 1GB of main memory running Debian Linux. In all the experiments, the entropy information gain is used as the score for which the search is conducted.
94
S. Inenaga et al. Execution time for 100 positive/100 negative completely random data (maxlen = 8)
Execution time for completely random data of length 100 (maxlen = 8)
350
350 Substring VLDC: PMM VLDC: WDAWG VLDC: WDAWG-sm VLDC in Window: PMM VLDC in Window: DP-rm
300
300
250
Time (secs)
250
Time (secs)
Substring VLDC: PMM VLDC: WDAWG VLDC: WDAWG-sm VLDC in Window: PMM VLDC in Window: DP-rm
200
150
200
150
100
100
50
50
0
0 50
100
150
200 250 300 350 Length of each string in positive/negative set
400
450
500
50
100
150
200 250 300 350 # of strings in each positive/negative set
400
450
500
Fig. 2. Execution time (in seconds) for artificial data for: different lengths of the examples (left) different number of examples in each positive/negative set (right). The maximum length of patterns to be searched for is set to 8. WDAWG-sm is matching using the WDAWG with state memoization. DP-rm is matching using the dynamic programming table with row memoization. Only one point is shown for DP-rm in the left graph, since a greater size caused memory swapping, and the computation was not likely to end in a reasonable amount of time.
5.1
Artificial Data
We first tested our algorithms on an artificial dataset. The datasets were created as follows: The alphabet was set to Σ = {a, b, c, d}. We then randomly generate strings over Σ of length l. We created 3 types of datasets: 1) a completely random set, 2) a set where a randomly chosen VLDC pattern ccd a ddad is embedded in the positive examples, and 3) a set where a pair of a VLDC pattern and a window size ccd a ddad , 19 is embedded in the positive examples. In 2) and 3), a randomly generated string is used as a positive example if the pattern matches it, and used as a negative example otherwise, until both positive and negative set sizes are n. Examples for which the set size exceeds n are discarded. Fig. 2 shows the execution times for different l and n, for the completely random dataset. We can see that the execution time grows linearly in n and l as expected, although the effect of pruning seems to take over for VLDC patterns in the left graph, when the length of each sequence is long. Searching for VLDC patterns and window sizes using dynamic programming with row memoization, does not perform very well. Fig. 3 shows the execution times for different maximum lengths of VLDC patterns to look for, for the 3 datasets (The length of a VLDC pattern is defined as the length of the pattern representation, excluding any ’s on the ends). We can see that the execution time grows exponentially as we increase the maximum pattern length searched for, until the pruning takes effect. The lower left graph in Fig. 3 shows the effect of performance of an exhaustive search, run on the completely random dataset, compared to searches with the branch and bound pruning for the different datasets. The pruning is more effective when it is more likely to have a good solution.
Discovering Best Variable-Length-Don’t-Care Patterns
95
Fig. 3. Execution time (in seconds) for artificial data for different maximum lengths of patterns to be searched for with: completely random data (upper left), VLDC and window size embedded data (upper right), VLDC embedded data (lower left). The lower right graph shows the effect of pruning of the search space for the different data sets, compared to exhaustive search on the completely random dataset.
5.2
Real Data
To show the usefulness of VLDC patterns and windows, we also tested our algorithms on actual protein sequences. We use the data available at http://www.cbs.dtu.dk/services/TargetP/, which consists of protein sequences which are known to contain protein sorting signals, that is, (in many cases) a short amino acid sequence segment which holds the information which enables the protein to be carried to specified compartments inside the cell. The dataset for plant proteins consisted of: 269 sequences with signal peptide (SP), 368 sequences with mitocondrial targeting peptide (mTP), 141 sequences with chloroplast transit peptide (cTP), and 162 “Other” sequences. The average length of the sequences was around 419, and the alphabet is the set of 20 amino acids. Using the signal peptides as positive examples, and all others as negative examples, we searched for the best pair p, k with maximum length of 10 using PMMs. To limit the alphabet size, we classify the amino acids into 3 classes {0, 1, 2}, according to the hydropathy index [13]. The most hydrophobic amino acids {A, M, C, F, L, V, I} (hydropathy ≥ 0.0) are converted to 0, {P,Y,W,S,T,G} (−3.0 ≤ hydropathy < 0.0 ) to 1, and {R, K, D, E, N, Q, H}
96
S. Inenaga et al.
(hydropathy < −3.0 ) to 2. We obtained the pair 0 00 00000 , 26, which occurs in 213/269 = 79.2% of the sequences with SP, and 26/671 = 3.9% of the other sequences. The calculation took exactly 50 minutes. This pattern can be interpreted as capturing the well known hydrophobic h-region of SP [22]. Also, the VLDC pattern suggests that the match occurs in the first 26 amino acid residues of the protein, which is natural since SP, mTP, cTP are known to be Nterminal sorting signals, that is, they are known to appear near the head of the protein sequence. A best substring search quickly finds the pattern 00000001 in 36 seconds, but only gives us a classifier that matches 152/269 = 56.51% of the SP sequences, and 41/671 = 6.11% of the others. For another example, we use the mTP set as positive examples, and the SP and Other sets as negative examples. This time, we convert the alphabet according to the net charge of the amino acid. Amino acids {D, E} (negative charge) are converted to 0, {K, R} (positive charge) to 1, and the rest {A, L, N, M, F, C, P, Q, S, T, G, W, H, Y, I, V} to 2. The calculation took about 21 minutes and we obtain the pair 2 1 1 2221 , 28 which occurs in 334/368 = 90.76% of the mTP sequences and (73/431 = 16.94%) of the SP and Other sequences. This pattern can also be regarded as capturing existing knowledge about mTPs [23]: They are fairly abundant in K or R, but do not contain much D or E. The pattern also suggests a periodic appearance of K or R, which is a characteristic of an amphiphilic α-helix that mTPs are reported to have. A best substring search finds pattern 212221 in 20 seconds, which gives us a classifier that matches 318/368 = 86.41% of sequences with mTP and 255/431 = 59.16% of the other sequences.
References 1. D. Angluin. Finding patterns common to a set of strings. J. Comput. Sys. Sci., 21:46–62, 1980. 2. R. A. Baeza-Yates. Searching subsequences (note). Theoretical Computer Science, 78(2):363–376, Jan. 1991. 3. A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. T. Chen, and J. Seiferas. The smallest automaton recognizing the subwords of a text. Theoretical Computer Science, 40:31–55, 1985. 4. M. Crochemore. Transducers and repetitions. Theoretical Computer Science, 45:63–86, 1986. 5. M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, New York, 1994. 6. G. Das, R. Fleischer, L. Gasieniec, D. Gunopulos, and J. K¨ arkk¨ ainen. Episode matching. In Proc. 8th Annual Symposium on Combinatorial Pattern Matching (CPM’97), volume 1264 of Lecture Notes in Computer Science, pages 12–27. Springer-Verlag, 1997. 7. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York, 1997. 8. M. Hirao, H. Hoshino, A. Shinohara, M. Takeda, and S. Arikawa. A practical algorithm to find the best subsequence patterns. In Proc. The Third International Conference on Discovery Science, volume 1967 of Lecture Notes in Artificial Intelligence, pages 141–154. Springer-Verlag, 2000.
Discovering Best Variable-Length-Don’t-Care Patterns
97
9. M. Hirao, S. Inenaga, A. Shinohara, M. Takeda, and S. Arikawa. A practical algorithm to find the best episode patterns. In Proc. The Fourth International Conference on Discovery Science, volume 2226 of Lecture Notes in Artificial Intelligence, pages 435–440. Springer-Verlag, 2001. 10. S. Inenaga, A. Shinohara, M. Takeda, H. Bannai, and S. Arikawa. Space-economical construction of index structures for all suffixes of a string. In Proc. 27th International Symposium on Mathematical Foundations of Computer Science (MFCS’02), Lecture Notes in Computer Science. Springer-Verlag, 2002. To appear. 11. S. Inenaga, M. Takeda, A. Shinohara, H. Hoshino, and S. Arikawa. The minimum dawg for all suffixes of a string and its applications. In Proc. 13th Annual Symposium on Combinatorial Pattern Matching (CPM’02), volume 2373 of Lecture Notes in Computer Science, pages 153–167. Springer-Verlag, 2002. 12. S. R. Kosaraju. Fast pattern matching in trees. In Proc. 30th IEEE Symp. on Foundations of Computer Science, pages 178–183, 1989. 13. J. Kyte and R. Doolittle. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol., 157:105–132, 1982. 14. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episode in sequences. In Proc. 1st International Conference on Knowledge Discovery and Data Mining, pages 210–215. AAAI Press, 1995. 15. S. Miyano, A. Shinohara, and T. Shinohara. Polynomial-time learning of elementary formal systems. New Generation Computing, 18:217–242, 2000. 16. S. Morishita and J. Sese. Traversing itemset lattices with statistical metric pruning. In Proc. of the 19th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 226–236. ACM Press, 2000. 17. S. Shimozono, A. Shinohara, T. Shinohara, S. Miyano, S. Kuhara, and S. Arikawa. Knowledge acquisition from amino acid sequences by machine learning system BONSAI. Transactions of Information Processing Society of Japan, 35(10):2009– 2018, 1994. 18. A. Shinohara, M. Takeda, S. Arikawa, M. Hirao, H. Hoshino, and S. Inenaga. Finding best patterns practically. In Progress in Discovery Science, volume 2281 of Lecture Notes in Artificial Intelligence, pages 307–317. Springer-Verlag, 2002. 19. T. Shinohara. Polynomial-time inference of pattern languages and its applications. In Proc. 7th IBM Symp. Math. Found. Comp. Sci., pages 191–209, 1982. 20. Z. Tron´ıˇcek. Episode matching. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching (CPM’01), volume 2089 of Lecture Notes in Computer Science, pages 143–146. Springer-Verlag, 2001. 21. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995. 22. G. von Heijne. The signal peptide. J. Membr. Biol., 115:195–201, 1990. 23. G. von Heijne, J. Steppuhn, and R. G. Herrmann. Domain structure of mitochondrial and chloroplast targeting peptides. Eur. J. Biochem., 180:535–545, 1989.
A Study on the Effect of Class Distribution Using Cost-Sensitive Learning Kai Ming Ting Gippsland School of Computing and Information Technology, Monash University, Victoria 3842, Australia. [email protected]
Abstract. This paper investigates the effect of class distribution on the predictive performance of classification models using cost-sensitive learning, rather than the sampling approach employed previously by a similar study. The predictive performance is measured using the cost space representation, which is a dual to the ROC representation. This study shows that distributions which range between the natural distribution and the balanced distribution can also produce the best models, contrary to the finding of the previous study. In addition, we find that the best models are larger in size than those trained using the natural distribution. We also show two different ways to achieve the same effect of the corrected probability estimates proposed by the previous study.
1
Introduction
Since the recent revelation of the fact that the best performing model is not obtained from the training data with matching class distribution as that in the test data, it has been a growing interest in investigating the effect of class distribution on classifier performance. One such recent study is conducted by Weiss & Provost [11] using a sampling approach. Given a test set of a fixed class distribution, the study investigates the conditions under which the best classifiers can be trained from different class distributions. This study has the same aim as the previous study but differs in two important aspects. First, we use the cost-sensitive learning instead of the sampling approach. Though the two methods are equivalent conceptually, cost-sensitive learning maintains the same training set while changing the class distribution; but the sampling approach actually adds or reduces the number of training examples in each class in order to maintain the same training size. As a result, each class distribution does not have exactly the same content as the others, and a substantial amount of available (majority class) data is not used in the highly skewed distribution data sets. None of these problems appear in our approach. We identify these problems as the limiting factor in the previous finding (described later in Section 3.1). Furthermore, the task of determining the best training class distribution(s) also becomes simpler using cost-sensitive learning approach because only a learning parameter value need to be altered. Second, we conduct a finer analysis using the cost space representation, in addition to S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 98–112, 2002. c Springer-Verlag Berlin Heidelberg 2002
A Study on the Effect of Class Distribution Using Cost-Sensitive Learning
99
using a similar coarse performance measure such as the area under ROC curves (AUC) used in the previous study. This paper examines the previous finding [11] that (i) a model that trained from the balanced distribution will perform no worse or better than that trained from the natural distribution, and (ii) the best distribution to train a model is generally the one that biases the minority class. We show in this paper that this is only true under certain condition. On a finer analysis, we show that the finding should be revised to include distributions that range between the natural distribution and the balanced distribution. Using more general performance measures than predictive accuracy, we affirm a previous result [10] which shows that a larger size tree performs better than a tree trained from a commonly used induction method. Unlike the previous work which uses a different induction method, we show that one can induce a larger size and better performing tree using exactly the same induction method. Weiss & Provost [11] show that an improved performance in terms of predictive accuracy can be obtained by correcting the probability of each leaf in a decision tree under changed operating conditions. We show in this paper that an equivalent outcome can be obtained by modifying the decision rule instead, leaving the induced model unchanged. We show two ways of achieving that, depending on whether the changed condition is expressed as probability or cost. We begin the next section by describing the cost space representation pertinents to this work. Section 3 reports the experiments and results of examining the utility of the balanced distribution, and whether better performing trees are larger in size. Section 4 gives an analysis for an alternative to the corrected probability. We present a discussion and a summary in the last two sections.
2
Using Cost Curves to Measure Performance
Drummond & Holte [3] introduce the cost space representation in which the expected cost of a classification model is represented explicitly. It is a dual to the ROC representation but has advantages over the latter when a comparison of model performance in terms of the expected cost is required. Let p(a) be the probability (prior) of a given example being in class a, and C(a|b) be the cost of misclassifying a class b example as being in class a. The normalised expected cost expressed in terms of true positive rate (T P ), false positive rate (F P ) and probability-cost function (P CF ) is defined [3] as N E[C] = (1 − T P )P CF (+) + F P (1 − P CF (+)) = (1 − T P − F P )P CF (+) + F P, where
positives correctly classif ied , total positives negatives incorrectly classif ied FP = , and total negatives p(+)C(−|+) P CF (+) = . p(+)C(−|+) + p(−)C(+|−) TP =
(1) (2)
100
K.M. Ting
Fig. 1. Cost lines and curves: Part (a) shows a cost line produced using a fixed decision threshold, and part (b) shows the cost lines produced using different decision thresholds from the same model. The order in the legends corresponds to the anti-clockwise order starting from the line on the right. The two diagonal cost lines that intercept at point (0.5, 0.5) are the performance of the default classifiers that always predict + or −, respectively. The cost curve for the model is the lowest envelop of all the cost lines for the model, indicated as the dotted line. Note that not all the cost lines are shown. In our implementation, each curve is sampled at 0.01 interval along P CF (+)-axis.
The performance of a classification model that uses a fixed decision threshold is represented by a pair {T P , F P }. Given the pair, it can be represented as a line in the cost space, which consists of the normalised expected cost in the yaxis and P CF (+) in the x-axis, indicated by the linear equation (2). Because both are normalised, they range from 0 to 1. An example of a cost line is shown in Figure 1(a). By varying the decision threshold1 of the same model, different cost lines are obtained (as shown in Figure 1(b)). The cost curve representing the performance of the model that uses varying decision thresholds is the lowest envelop of all cost lines produced by the model. Provost & Fawcett [6] shows an algorithm where one can obtain all pairs of {T P , F P } in one pass for all the different thresholds of a model. P CF (+) denotes the operating condition when one wants to use a model for testing. The area under the cost curve is the total expected cost. The difference in area under two curves is a coarse measure of the expected advantage of a model 1
A decision threshold is the cut-off level used to decide the final prediction of a classification model. In a two-class problem, the final prediction is class positive if the model’s posterior probability of a test example is above the threshold; otherwise it is class negative. When the threshold is changed, the model’s performance also changes.
A Study on the Effect of Class Distribution Using Cost-Sensitive Learning
101
over another. A model which has the lower normalised expected cost under all operating conditions is said to strictly dominate the other. We are interested in using unbalanced distribution data in this study as such data occurs in many real-world applications. We will denote the minority class as + and the majority class as −. The cost ratio (M ) is defined as the ratio of the cost of misclassifying the minority class and the cost of misclassifying the majority class, that is C(−|+)/C(+|−). When the cost ratio is unity, the normalised expected cost reduces to error rate and P CF (+) to p(+). While the cost ratio reflects the condition for the given data, one can use this information to influence the model induction process in training. This is what we do by using M as a training parameter in cost-sensitive learning2 . Rather than using the entire area under the cost curve (AUC) as a measure of performance we use a variant, AUC*, as a coarse measure for performance3 . AUC* is defined as follows. For M > 1, AUC* is the part of the area under the cost curve where P CF (+) ranges between the training prior and 1.0, assuming the changed operating conditions are solely due to changes in cost and the prior is unchanged. This is equivalent to a changed operating condition solely due to changes in prior and the cost is unchanged. For M < 1, AUC* is the other part of the area where P CF (+) ranges between 0.0 and the training prior. We only use the first definition since we are interested in unbalanced data. A finer analysis of model performance requires to identify the operating range in which a model is better than the other. In this case, the normalised expected cost is used instead. We use both measures here. Note that all AUC* values shown in this paper must ×10−2 to obtain the actual values.
3
Experiments
We use two sets of data sets obtained from the UCI repository [1]. In the first set where each data set has the total data size larger than 4000, we split each set into training and testing sets. The large data size allows us to produce learning curves, where each point in the curve is an average over multiple runs using a subset of the training set. In other cases, a single run is conducted for each data set. The main reason is that only a cost curve produced from a single run allows one to identify the threshold of the model for certain operating condition. An average cost curve produced from multiple runs does not allow one to do that. The second set of data sets has size less than 4000. Here we perform 2
3
An instance weighting approach to changing class distribution in decision tree induction has been used by a number of researchers. A description of the approach can be found in [8], which is the approach used in this paper. Refer to [9] for a discussion on the factors affecting the selection of either AUC or AUC*. AUC* is preferred for the method of generating cost curves we used here— from a single model with varying decision thresholds. Also, the difference between AUC and AUC* is a constant for unbalanced data because the model induced using M < 1 is likely to be the default model. The constant approaches zero as the data becomes highly imbalance.
102
K.M. Ting
Table 1. Description of data sets, evaluation methods and M values for balanced distribution. Data set D01. D02. D03. D04. D05. D06. D07. D08. D09. D10. D11. D12. D13.
% Minority Size and Method M for Prior Train Test Balanced Distribution Coding 50.0 15000 5000 1.00 Abalone 31.7 2784 1393 2.15 Adult 24.1 32561 16281 3.15 Satellite 9.7 4290 2145 9.29 Pendigits 9.6 7494 3498 9.42 Letter-a 3.9 13333 6667 24.64 Nursery 2.6 8640 4320 38.22 Nettalk-s 1.1 3626 1812 89.91 Kr-vs-kp 47.8 10CV of 3196 1.09 German 30.0 10CV of 1000 2.33 Splice 24.0 10CV of 3175 3.17 Solar Flare 15.7 10CV of 1389 5.37 Hypothyroid 4.8 10CV of 3168 19.83
a stratified 10-fold cross-validation to produce an average cost curve in order to get a good estimate. Both sets have data sets with percentage of minority class spanning from 50% to about 1%, that is from balanced to highly skewed distributions. The description of the data sets and the evaluation methods are listed in Table 1. There are seven data sets which have more than two classes and they are converted to two classes (marked with ). In all splits or partitions, the class distribution is maintained. Only two-class problems are considered because of the cost space representation’s existing limitation to this class of problems [3]. The last column of Table 1 shows the value of training parameter M used in order to train a model from balanced distribution in each data set. The Appendix shows how this is obtained. We use the decision tree induction algorithm C4.5 [7] and its default setting in all experiments, while taking the following modifications into consideration. The algorithm is modified to take cost ratio into consideration in the training process. For example, with M =2, every minority class training instance will be weighted twice as high as every majority class instance during the tree induction process [8]. Cost-sensitive pruning is done in accordance with the modified weights, though the default pruning method of C4.5 is used unaltered. The Laplace estimate is used to compute the class posterior probability for a leaf of the tree. This allow us to compute all pairs of {T P , F P } for a test set in one pass using the algorithm of Provost & Fawcett [6]. We report the results of pruned trees using the gain ratio criterion4 4
Despite previous claims of better performance using unpruned trees [2] and using the DKM splitting criterion [4], a recent study [9] shows that these claims are a result of using sub optimal models for comparison. When optimal models are used, the outcomes are different.
A Study on the Effect of Class Distribution Using Cost-Sensitive Learning
103
In the following experiments in each data set, a set of selected M values are explored. We use the term “the best” to refer to the best among those produced in the set of the selected M values.
Fig. 2. Learning curves for models trained using different M values. Each curve is a result of training using one M value at different training sizes. Insignificant curves are omitted to enhance readability.
3.1
Trees Trained from Balanced Distribution
In Weiss & Provost’s [11] experiments, they find that “..the optimal distribution generally contains between 50% and 90% minority-class examples. In this situation, the strategy of always allocating half of the training examples to the minority class, while it will not always yield optimal results, will generally lead
104
K.M. Ting
to results which are no worse than, and often superior to, those which use the natural class distribution.” We examine their proposition in this section. In terms of AUC*, a learning curve is produced for each M value. The base tree is the one trained with M =1, i.e., natural distribution. Trees of M =2,5,10,20,50,100,1000 and a value corresponding to the balanced distribution (shown in the last column of Table 1) are trained. Each learning curve is produced from 10% to 100% of the entire training set, at 10% intervals. The class distribution is maintained in all cases and the high percentage data always includes the low percentage data. The same testing set is used for all runs in each data set. Each point in the curve is an average over 10 runs, except at 100% where only one run is carried out. The learning curves are shown in Figure 2.
Fig. 3. Learning curves for models trained using different M values. (continue)
A Study on the Effect of Class Distribution Using Cost-Sensitive Learning
105
Observation of the learning curves shows that trees trained from the natural distribution almost always perform worse than a lot of trees trained using a changed distribution biased towards minority class. The only exception is the coding data set in small training sizes, whose natural distribution is balanced. Balanced class distribution appears to be the best (or close to the best) to use when the training size is small. This happens in eight data sets: ≤ 20% in nettalk-stress, ≤ 30% in coding and satellite, ≤ 40% in letter-a, ≤ 60% in pendigits, ≤ 70% in nursery, ≤ 80% in adult, and ≤ 90% in abalone (note that abalone has training size < 3000). For a large training size, usually a M value larger than the one corresponding to the balanced distribution is the best. However, there are some exceptions. First, the natural distribution is better than the balanced distribution in some high percentage training data in satellite, pendigits, and adult. Second, a tree trained with a value of M which is less than that of the balanced distribution can sometimes be better than a tree trained with the balanced distribution. Examples can be found in some cases in adult, pendigits, and nettalk-stress. Overall, this result does not contradict with Weiss & Provost’s finding as their result is equivalent to one low to medium percentage point in our learning curve. Their training size is much smaller than ours because they use sampling; thus, it is restricted by the total number of minority class instances to vary the percentage of minority class from 2% to 95% of the training data in their experimental design.
Fig. 4. Part (a) shows five cost curves, each representing the performance of a model induced using one value of M ; Part (b) shows the same curves plotted with cost ratio as x-axis, instead of P CF (+). Note that only part of the curves are shown in part (b) where cost ratio ranges 1–20. In this case, the best operating range is 1.0–1.5 for the model built using M = 1 and 1.6–100 for the model built using M = 2.
106
K.M. Ting
Table 2. The best operating range, in terms of cost ratios, obtained from the trees trained using M =1,2,5,10,20,50,100,1000 and a value corresponding to the balanced class distribution. In kr-vs-kp, M =1 and M =1.09 produce the same tree. The marker indicates the relative position of the M value corresponding to the balanced distribution for each data set, shown in Table 1.
D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13
3.2
M=1 1.0-1.5 1.0-1.5 150-200 1-30 1.6-12 1.0-2.0
The best operating range of cost ratios in testing for each M used in training M=2 M=5 M=10 M=20 M=50 M=100 M=1000 1.6-100 1.0-2.2 2.3-3.7 3.8-11 12-100 1.6-3.5 3.6-8.9 9-22 23-55 56-300 1-6.9 7-15 16-300 1-300 210-300 16-140 210-300 6.7-200 31-100 1.0-2.3 2.4-16 1.0-1.5 13-300 1.0-1.7 5.3-11 12-250 2.1-300
M=Bal. 1.0-1.5
1-15 4.7-200 1-30 17-230 1.8-5.2
Finer Analysis Using Cost Curves
For a finer analysis using the normalised expected cost, cost curves such as those in Figure 4(a) are produced and the best operating range to be used for testing in each of the thirteen data sets are recorded. The result is shown in Table 2. This is equivalent to getting the lowest envelop of all cost curves from the trees trained using M =1,2,5,10,20,50,100,1000, and a value corresponding to the balanced distribution. For ease of reference, the operating range is converted to cost ratio from P CF (+). The Appendix shows the relationship between cost ratios and priors in terms of P CF (+). Figure 4(b) show example cost curves for D01:coding data set plotted using cost ratio in the x-axis, in which the result in Table 2 is extracted from. Weiss & Provost’s finding quoted in the last section can be translated to the following: The best operating range in each data set shall lie on the righthand side of the balanced distribution (marked by ) in Table 2. In contrast, we observe that • A substantial portion of the range in many data sets lies on the left-hand side of the marker. • In two unbalanced data sets (D05:pendigits and D13:hypothyroid), the entire range lies on the left-hand side. • Only in three data sets (D04:satellite, D07:nursery and D08:nettalk-stress), the entire range lies on the right-hand side. This is disregarding the range where the default classifiers are the best. Note that the best ranges in nursery and nettalk-stress do not begin with 1.0, because the best classifier in those ranges is the default classifier. This result reveals a fact that in highly skewed
A Study on the Effect of Class Distribution Using Cost-Sensitive Learning
107
distribution data, we are usually not interested in operating at P CF (+) corresponding to the natural distribution (or any values close to it) because the default classifier turns out to be the best. The default model is also the best model for cost ratio > 200 in nettalk-stress. The two naturally balanced distribution data sets where the best ranges must lie on the right-hand side are not counted. • Trees trained using the changed balanced distribution contribute to produce the best performance in some range in four data sets only (indicated in the last column in Table 2), excluding the two naturally balanced distribution data sets. Note that only a few trees are required to cover a wide range of cost ratios. One extreme example is in pendigits where a tree trained with M =5 is the best for the entire range. Other examples are in D01:coding, D08:nettalk, D09:kr-vs-kp and D13:hypothyroid—except for the cost ratios close to the natural distribution, only one tree is required for the rest of the range. 3.3
Are Better Performing Trees Larger in Size?
Webb [10] introduces a method of grafting trees that adds nodes to an existing tree and the larger trees perform better in terms of accuracy. Using more general measures such as AUC* and the expected cost, this section investigates whether a tree that performs better than the one trained from the natural distribution has larger size. The first type of better performing trees are those trained from the M value with the best AUC*. The second type of better performing trees are those trained from the best M for the natural distribution, in terms of the expected cost. Table 3 shows the tree sizes for trees trained from the natural distribution, and the two types of better performing trees. The M value used to get the best AUC* is shown in the second column. The best M used to train the best performing tree for the natural distribution is shown in the last column. We measure tree size as the total number of leaves predicting either the minority class or the majority class. The trees trained with a M value that produce the best AUC* for imbalance data sets have tree sizes larger than those trained using the natural distribution in eight data sets. In kr-vs-kp and splice, the best trees are trained from the natural distribution (see the second column in Table 3). The only real exceptions are coding, satellite and nursery. The trees trained with the best M have sizes larger than those trained with M =1 in six data sets. Because there are four data sets in which the best M is 1 and the best tree in nursery and nettalk-stress is the default classifier, the only real exception is satellite. This result points to an alternative method other than the tree grafting method [10] that produces a larger tree than those trained using the intended cost ratio but performs better. Instead of using a different induction algorithm, the method is to find the best M and train a tree from it using exactly the same induction algorithm.
108
K.M. Ting
Table 3. Leaf sizes for trees trained from the natural distribution, from the M value with the best AUC*, and from the best M for the natural distribution. Sa and Si are the total numbers of leaves predicting majority class and minority class, respectively. Boldface indicates that the sizes are larger than those in M =1. Note that though a nontrivial classifier is induced using M =1 in nursery, the default classifier is still the best to operate on a point corresponding to M =1 or the natural distribution. indicates that the default models are used. Data set
(M)
M=1
Sa/Si Coding 2 804/766 Abalone 5 18/15 Adult 10 285/241 Satellite 20 61/46 Pendigits 5 21/19 Letter-a 50 22/10 Nursery 100 38/23 Nettalk-s 1000 1/0 Kr-vs-kp 1 9.5/15.9 German 5 26.5/25.0 Splice 1 23.8/8.4 S.Flare 10 1.8/0.8 Hypo 10 4.3/1.5
4
(M) with Best M Best M Best for AUC* nat.dist. Sa /Si Sa/Si 563/524 804/766 1 22/21 30/25 2 539/628 285/241 1 51/43 51/43 20 22/20 22/20 5 29/19 29/19 24.64 25/18 1/0 132/77 1/0 9.5/15.9 9.5/15.9 1 27.5/29.6 35.7/35.1 2 23.8/8.4 24.7/15.6 5 9.0/15.7 10.5/10.5 2 11.1/7.1 4.3/1.5 1
An Alternative to Corrected Probability: Modifying the Decision Rule
Weiss & Provost [11] show that in order to achieve better predictive accuracy, the estimated probability at a leaf of a decision tree shall be corrected in the following form: N+ p(+) p (+) : , , where k = N+ + kN− p(−) p (−)
(3)
and Na is the number of class a training examples at the leaf, and p(a) and p (a) are the prior probabilities for class a in the training set and testing set, respectively. Instead of correcting the probability at each leaf of a tree, we show here that one can modify the decision rule instead, leaving the induced tree unchanged. There are two ways to do this, depending on whether the changed condition is expressed as (i) probability or (ii) misclassification cost. + Let P = P (+|x) = N+N+N be the uncorrected probability for class + given − an example x. (i) For changed probability, equation (3) can be re-written by dividing the numerator and denominator by N+ + N− , and expressed in terms of P and k as follows.
A Study on the Effect of Class Distribution Using Cost-Sensitive Learning
109
P P + k(1 − P ) The tree that uses the following decision rule to make a prediction achieves the same effect of equation (3) which uses the fixed decision threshold of 0.5. P redict class + if P : k(1 − P ) > 1; predict class − otherwise.
(4)
(ii) If the changed condition is expressed as misclassification cost, we show here that using the uncorrected probability and the minimum expected cost (MEC) criterion to make a decision [5,8] can achieve the same effect. Let E(a|x) be the expected cost of predicting class a, given an example x. The MEC criterion for two-class problems is defined as follows. P redict class + if E(−|x) : E(+|x) > 1; predict class − otherwise. (5) Where E(−|x) = P (+|x)C (−|+) = P C (−|+). E(+|x) = P (−|x)C (+|−) = (1 − P )C (+|−).
(6) (7)
Recall C(a|b) is the cost of misclassifying a class b example as class a, during training; and C (a|b) is the changed cost during testing. Let T P and F P be the true positive rate and the false positive rate. The slope of the line connecting two points {T P1 , F P1 } and {T P2 , F P2 } in the ROC representation is defined as follows [6]. p(−)C(+|−) T P2 − T P 1 . = F P2 − F P1 p(+)C(−|+) Let p be the changed prior with fixed C; and C be the equivalent changed misclassification cost with fixed p, such that: p(+)C (−|+) p (+)C(−|+) = . p (−)C(+|−) p(−)C (+|−) Using the relation k in equation (3), C (−|+) C(−|+) p (+) p(+) 1 = = . / / C (+|−) C(+|−) p (−) p(−) k C (−|+) 1 C(−|+) = . C (+|−) k C(+|−) Then, substitute equations (6) and (7) into (5) and replace equivalent as in equation (8) give C (−|+) P , (1 − P ) C (+|−) 1 C(−|+) P = (1 − P ) k C(+|−)
E(−|x) : E(+|x) =
(8) C (−|+) C (+|−)
with its
(9) (10)
110
K.M. Ting
Since C(−|+) C(+|−) is assumed to be unity in Weiss & Provost’s formulation, substitute equation (10) into equations (5) reveals that the prediction using the MEC criterion (with the uncorrected probability) is equivalent to that using equation (4); and thus equivalent to that using the corrected probability with the fixed decision threshold of 0.5.
5
Discussion
It is important to point out that there is a subtle difference between the previous work [11] and this work. Weiss & Provost study the effect of class distribution with a different information content in each class distribution; whereas our work studies the effect with exactly the same information content in each class distribution. Though both aim at investigating the conditions under which the best classifiers can be trained from different class distribution, the drivers behind them differ. The former study is driven by the following need: To purchase a data set with the optimal class distribution, under the constraint of a limited fund. Assuming there is a large pool of data to purchase from. This study is driven to find an optimal cost (or ROC) curve from a given data set that allows one to use the optimal models under different operating conditions. If one has to first purchase a data set from a large data pool and then build an optimal cost curve, the results of the two studies are complementary which contribute to the following recommendation—purchase a balanced class distribution data set (guided by the result of the former study) and then construct the optimal cost curve using cost-sensitive learning.
6
Summary
We reveal the limitation of the sampling approach imposed on the experimental design which constrains the finding of the previous research to small data size only. Based on the similar coarse measure as used by the previous study, both studies produce the same finding for small training size. However, there are some discrepancies for large training size. On a finer analysis using cost curves, the discrepancies become apparent that they lead us to conclude: To train the best model, one cannot ignore the M values which lie between unity (i.e., the natural distribution) and those less than that corresponding to the balanced distribution. Using the general measures of the normalised expected cost and AUC*, our result in Section 3.3 shows that a larger size tree performs better than a tree trained from the natural distribution, in which both trees are trained from the same induction algorithm. While this affirms the previous result, the previous work was using a different induction algorithm to achieve that. The current work contributes to open an avenue to find a larger and better performing tree using exactly the same induction algorithm. We show that the corrected probability can be better implemented using a modified decision rule, leaving the induced model unchanged. Two ways of
A Study on the Effect of Class Distribution Using Cost-Sensitive Learning
111
modifying the decision rule are shown to give predictive decision equivalent to that given by the corrected probability. The ways differ in whether the changed operating condition is expressed as probability or misclassification cost. The revelation mentioned at the beginning of this paper is implied throughout most parts of this paper without further discussion: A trained model does not operate optimally on the data drawn from the same distribution as the training data. The reader is referred to [9] for a discussion of this issue. Acknowledgements. Part of this work was completed when the author was visiting Osaka Univeristy, supported by a fellowship provided by Japan Society for the Promotion of Science. Ross Quinlan provides C4.5. Foster Provost and Gary Weiss point out that the results from their study and this work can be complementary. Comments from the anonymous reviewers have helped to improve this paper.
References 1. Blake, C. & Merz, C.J. UCI Repository of machine learning databases. [http://www.ics.uci.edu/˜mlearn/MLRepository.html]. Irvine, CA: University of California (1998). 2. Bradford, J., Kunz, C., Kohavi, R., Brunk, C., & Brodley, C. Pruning decision trees with misclassification costs. Proceedings of the European Conference on Machine Learning. (1998) 131–136. 3. Drummond C. & Holte R. Explicitly Representing Expected Cost: An Alternative to ROC Representation. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2000) 198–207. 4. Drummond C. & Holte R. Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria. Proceedings of The Seventeenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann. (2000) 239–246. 5. Michie, D., D.J. Spiegelhalter, & C.C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood Limited. (1994). 6. Provost, F. & Fawcett, T. Robust Classification for Imprecise Environments. Machine Learning 42 (2001) 203–231. 7. Quinlan, J.R. C4.5: Program for Machine Learning. Morgan Kaufmann. (1993). 8. Ting, K.M. An Instance-Weighting Method to Induce Cost-Sensitive Trees. IEEE Transactions on Knowledge and Data Engineering. Vol. 14, No. 3. (2002) 659–665. 9. Ting, K.M. Issues in Classifier Evaluation using Optimal Cost Curves. Proceedings of The Nineteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann. (2002) 642–649. 10. Webb, G. Decision tree grafting from the all-tests-but-one partition. Proceedings of the 16th International Joint Conference on Artificial Intelligence. San Fransisco: Morgan Kaufmann. (1999) 702–707. 11. Weiss, G. & Provost, F. The Effect of Class Distribution on Classifier Learning: An Empirical Study. Technical Report ML-TR-44, Department of Computer Science, Rutgers University. (2001).
112
K.M. Ting
Appendix: The Relationship between P CF and Cost Ratio Operating conditions can be expressed as either P CF (or its components) or cost ratios. The original cost curves use P CF to denote the operating conditions [3]. However, there is an advantage to plot cost curves using cost ratios as xaxis, instead of P CF . Because a unity cost ratio corresponds to the natural distribution and it does not change from one data set to another, unlike P CF , this allows easy identification of the natural distribution operating condition in the cost curves. On the other hand, P CF (+) = 0.5 always indicates the balanced distribution. The relationship between cost ratio and P CF (+), including those in these two special conditions are given below. Let p and C denote the prior and misclassification cost of an operating condition, respectively. To avoid confusion with cost ratio M as the learning parameter for an induction algorithm, we use cost ratio R to denote the operating condition. By substituting R = C(−|+)/C(+|−) into the definition of P CF from [3]: p(+)C(−|+) . p(+)C(−|+) + p(−)C(+|−) p(+)R . = p(+)R + p(−) P CF (+)p(−) = p(+)R(1 − P CF (+)). P CF (+) p(−) R= . 1 − P CF (+) p(+) P CF (+) =
Thus, R corresponds to a changed probability by the factor special values are given below.
P CF (+) 1−P CF (+) .
Three
R = 1.0 when P CF (+) = p(+). p(−) when P CF (+) = 0.5. R= p(+) R = ∞ when P CF (+) = 1.0. R = 1.0 corresponds to the operating condition that matches the natural distribution. R = p(−) p(+) or P CF (+) = 0.5 corresponds to the operating condition which is equivalent to the balanced distribution under unity cost ratio condition as shown below: R=
C(−|+) p(−) = = 1. p(+) C(+|−) p(+) = p(−).
To train a model with the balanced distribution from any given class distribution, simply set the training parameter M = R = p(−)/p(+) for training.
Model Complexity and Algorithm Selection in Classification Melanie Hilario University of Geneva, CSD CH-1211 Geneva 4, Switzerland [email protected]
Abstract. Building an effective classifer involves choosing the model class with the appropriate learning bias as well as the right level of complexity within that class. These two aspects have rarely been addressed together: typically, model class (or algorithm) selection is performed on the basis of default settings, while model instance (or complexity) selection is investigated within the confines of a single model class. We study the impact of model complexity on algorithm selection and show how the relative performance of candidate algorithms changes drastically with the choice of complexity parameters.
1
Introduction
The choice of the appropriate model for a given classification task comprises two complementary aspects: model class selection and model instance selection. Model class selection is usually implicit in algorithm selection, while model instance selection assumes a given model class and involves choosing the right level of complexity within that class. Though these two subproblems are intricately related, researchers have tended to focus on one to the detriment of the other. In statistical learning, emphasis is laid on model (instance) selection, defined as creating, within a specific model class, an individual model whose complexity has been fine-tuned to the given data. This viewpoint has found its fullest expression in Vapnik’s theory. The principle of structural risk minimization, which defines a trade-off between the quality of the approximation to the data and the complexity of the approximating function, relies on the existence of a structure consisting of nested subsets of functions Sk such that S1 ⊂ S2 ⊂ . . . ⊂ SN [19]. As model complexity is typically governed by one or several “capacity control” parameters, model selection in this precise sense consists in adjusting these parameters in order to generate a model instance adapted to the data. By contrast, the machine learning community lays stress on algorithm selection. In the absence of prior knowledge about the most suitable algorithm, the idea is to cover a wide range of clearly distinct learning biases to increase the chances of pinpointing a promising model class for the learning task at hand. In order to restrict the space of possible hypotheses, it has become a common expedient to evaluate algorithms with their default parameter settings (e.g., [11], [13] S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 113–126, 2002. c Springer-Verlag Berlin Heidelberg 2002
114
M. Hilario
and studies conducted in the Metal project [12]). The implicit assumption seems to be that once a model class with the right bias is found, it would be easier to search this more restricted subspace for the model with the adequate complexity. While intuitively appealing, this strategy runs the risk of prematurely eliminating algorithms which may be highly effective with the proper parameters but which perform poorly with their default settings. As a consequence, one can legitimately question the validity of drawing generic conclusions about an algorithm on the basis of its default behavior. Integrating algorithm selection and model selection is a long-term research goal with many open issues. In this paper, we take a tiny step in that direction by investigating the impact of parameter adjustment on algorithm selection. After a preliminary discussion of the methodology adopted to assess and compare generalization performance (section 2), we describe the experimental setup (section 3) and main observations (section 4). Finally we discuss underlying issues as well as related and future work (sections 5 and 6).
2
Comparing Performance: Weighting vs. Ranking
Predictive accuracy (or error) is perhaps the most popular measure of generalization performance, due mainly to the convenience of having a single metric as a basis for comparison. While it is satisfactory when selecting a classifier for a particular task (and assuming that time and other practical considerations are irrelevant), the accuracy criterion is fraught with problems when comparing across different task domains and levels of difficulty. Clearly, 99% accuracy in a 2-class domain with a 95%-majority class cannot be deemed better than 80% accuracy in a task involving ten equiprobable classes. In addition, selecting a single algorithm to recommend is a delicate undertaking except in fairly rare problems where one clearly dominates all the others. For these reasons, model ranking has been proposed as an alternative to categorical model selection. First, replacing absolute measures of accuracy with performance ranks facilitates transdomain comparisons; another advantage is the greater flexibility offered by a ranked set of promising alternatives rather than a single ”winner”. How are ranks established? To be at all valid, any ranking strategy should be based on statistically significant differences in the performance measure. We use the McNemar test with the Yates continuity correction to determine if one algorithm’s performance is significantly better than another’s [5]. Multiple comparisons are handled with the Bonferroni adjustment [8], but there is an additional problem raised by the transition from multiple pairwise comparisons to a common overall ranking. Simply taking the “transitive closure” of the pairwise ”better-than” judgments can lead to inconsistencies that have been addressed in Saaty’s [18] analytical hierarchy process (AHP). AHP is a comprehensive methodology for multicriteria decision-making. It is composed of three stages: hierarchical structuring of goals and criteria, pairwise comparisons, and synthesis of priorities. Our focus here is the transition from the second to the third phase, in particular the establishment of an overall order of preference based
Model Complexity and Algorithm Selection in Classification
115
on pairwise significance tests on the performance measures of candidate algorithms. The underlying scale is a set of positive numbers (typically 1-9) used to quantify judgments concerning a pair of entities. Results of pairwise comparisons of N entities are gathered in an N × N matrix A, where A(i, j) contains a numerical value expressing a relative assessment of i with respect to j, and A(j, i) contains the reciprocal of this value to indicate the inverse assessment. In other words, A(i, j) can be viewed as a ratio wi /wj of the weights given to entities i and j. In the specific task of comparing classifier performance, we use A(i, j) = A(j, i) = 1 if there is no statistically significant difference between algorithms i and j, A(i, j) = 2 if i is significantly better than j, and A(i, j) = 0.5 if j is significantly worse than i. If we knew the precise weight or importance of each of the N items, it would be straightforward to build the comparison matrix (understandably called the positive reciprocal matrix), which would then verify the equality Aw = N w. But the situation is slightly different: first, to accomodate for possible inconsistencies, we replace the constant N by an unknown λ; second, we face the inverse task of recovering the weights from the weight ratios stored in the comparison matrix. The problem is that of solving the equation (A − λI)w = 0 for w; this is no other than the classical eigenvalue problem, where w is the dominant eigenvector of the matrix(and λ = N in case of perfect consistency).We can normalize w by requiring wi = 1 with no risk of altering the weight ratios. The resulting weights give us precisely what we are looking for: a vector of priorities on the N candidate algorithms. It would be trivial to map this vector of weights onto a set of ranks. However, we would lose the relative amplitude of the weights, which is exactly the information needed to solve a problem raised by simple ranking: how can we quantify an improvement or a degradation in a classifier’s ranking under varying experimental conditions? The question is more complex than it seems, as ranks are determined by statistically significant differences in performance measures and the number of ranks varies from one situation to another. The absolute value of a classifier’s rank is meaningless in itself; an algorithm that is ranked second in a 2-rank scale is certainly worse off than one which is second in a 9-rank scale. To overcome this difficulty we can use the relative rank defined by rank(x)/#ranks. But annoying conundrums subsist, e.g., a classifier ranked first in a 2-rank scale would have the same relative rank as one that is fifth in a 10-rank scale. All of these problems are solved naturally by the use of normalized AHP weights which are by construction ranks measured on a ratio scale. We therefore propose weighting as an alternative to ranking in algorithm and model evaluation. AHP-based weights have all the known advantages of ranks plus (1) information about distances and relative magnitudes of preferences or priorities, and (2) a fixed interval of [0..1]: knowing the number of candidate algorithms, one can easily draw conclusions from an algorithm’s weight; in ranking, one needs to know in addition the effective number of significant ranks.
116
3
M. Hilario
Experimental Setup
We selected 9 learning algorithms with clearly distinct learning biases: Quinlan’s C5.0 decision tree (C5T)1 and rule inducer (C5R) [17], a multilayer perceptron (MLP) and a radial basis function network (RBF) from Clementine 5.0.1 [3], J. Gama’s linear discriminant (LDS) and Linear Tree (LTR) [7], an instance-based learner (IB1) and Naive Bayes (NB) from the MLC++ library [9], and Ripper (RIP) [4]. All experiments were done on a set of 70 selected datasets, most of them from the UCI benchmark repository. The same learning algorithms and datasets were used throughout the three experimental phases described below. 3.1
Baseline Construction
The 9 learning algorithms were run with their default parameter settings on the 70 datasets and their error rates estimated using stratified ten-fold crossvalidation. Predictions and error rates of the 9 learned classifiers were then input to the weighting procedure, which is summarized as follows. For each dataset: 1. Perform 92 pairwise comparisons using McNemar tests to determine if each learning algorithm performed significantly better (or worse) than each of the 8 others. Apply the Bonferroni adjustment and set the effective significance level to 0.05/ 92 = 0.001 to obtain a nominal significance level of 0.05. 2. Build a comparison matrix A from significance test results, with A(i, j) = 1 if there is no statistically significant difference between classifiers i and j, 2 if i is significantly better than j, and 1/2 if j is significantly worse than i. 3. Compute the eigenvalues and eigenvectors of the comparison matrix; the normalized eigenvector corresponding to the highest eigenvalue represents the weights of the 9 candidate classifiers as explained in Section 2. 3.2
Study of Parameter Adjustment
Of the 9 learning algorithms studied, 6 have one or several parameters that control the complexity of the final model. For this study we selected the principal complexity parameter of these algorithms: the degree of pruning severity for decision trees (C5T and LTR) and rules (C5R and RIP), the number of hidden units in neural networks (MLP and RBF). These 6 parameterizable algorithms will be called the target algorithms in the rest of this paper. The learning algorithms which have no complexity parameters are NB, LDS, and IB1 (an instance-based learner based on a single nearest neighbor). In C5T, C5R, and LTR, the pruning severity is governed by the c parameter which takes values in (1..99), with a default of 10 for LTR and 25 for C5T and C5R. Increasing this value decreases pruning and yields a larger tree while decreasing it results in more severe pruning. We chose to sample landmark values 1
These abbreviations will be used throughout the paper to designate the specific implementations used and not the generic algorithms.
Model Complexity and Algorithm Selection in Classification
117
10, 20, 25, 30, 40, 50, 60, 70, 80, and 90. Ripper’s S parameter controls pruning severity in the opposition direction: increasing the default value of 0.5 entails more severe pruning while decreasing it reduces pruning, yielding more complex rules. In the absence of precise documentation concerning the limits of the S parameter, we tested the following values: 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 10. By default, Clementine builds an RBF network with 20 hidden units; we selected the following values to compare with this default: 1, 5, 10, 30, 40, 50, 100, 200, 300. For MLP, Clementine’s default (“quick”) strategy generates a variable number of hidden units according to some internal heuristic that is not explicit in the documentation. We explored the following settings: 2, 5, 10, 20, 30, 40, 50, 100, and 200 (2 is the minimal value accepted by Clementine, and 300 was excluded since MLPs are known to require fewer hidden units than RBFNs). Altogether, 4820 ten-fold cross-validation experiments were conducted: 630 in establishing the baseline weights of 9 algorithms on 70 datasets, plus 4190 in studying the selected parameter values of the six target algorithms. 3.3
Comparison of Default and Optimal Weighted Ranks
For each dataset, each target algorithm thus generated a set of classifiers of varying complexity. The optimal model was selected according to criteria that will be specified in Section 4.2. To compare the default and optimal performance of the 6 target algorithms, we gathered the optimal models generated for each dataset together with the baseline models from the non parameterizable learning algorithms. All 9 classifiers were then reweighted using the same weighting procedure as in the baseline construction phase (Section 3.1). The main difference is that, considering the number of parameterizations explored for the 6 target 2 algorithms 62 (9 for MLP and 10 for the 5 others), we used a significance level of 0.05/ 2 = 0.000026 to obtain a nominal significance level of 0.05.
4
Results and Discussion
4.1
Baseline Observations
Table 1 gives a sampling of the weights of the 9 default classifiers on a few selected datasets. The sample datasets are ordered according to the number of distinct weights reflecting the number of significant differences detected in performance levels. The heart dataset in the first row is representative of a set of 13 “equivalent-performance” (henceforth equiperformance) datasets, so called because no algorithm significantly outperformed any other on these datasets. Of the 57 remaining datasets where error rates showed greater variability, 5 datasets produced 2 performance ranks, 31 datasets between 3 and 6, and 14 datasets between 7 and 9. Averaging over the 70 datasets, we obtain an overall weight for 2
The numbers of hidden units generated by default in MLP did not represent a distinct setting but coincided or were binned with the 9 nondefault values.
118
M. Hilario Table 1. Weights of classification algorithms with default parameter settings
Dataset
C5R
C5T
MLP
heart 0.111 0.111 0.111 crx 0.125 0.125 0.125 sonar 0.110 0.110 0.110 dermatology 0.117 0.117 0.058 glass2 0.135 0.135 0.067 clean2 0.151 0.151 0.067 processed.cleveland 2 0.102 0.095 0.110 bands 0.127 0.117 0.117 char 0.162 0.162 0.096 pyrimidines 0.180 0.154 0.103 satimage 0.103 0.095 0.142 segmentation 0.142 0.142 0.122 triazines 0.191 0.164 0.056 optdigits 0.097 0.076 0.065 waveform40 0.082 0.066 0.139
RBF
LDS
LTR
IB1
0.111 0.125 0.110 0.117 0.117 0.078 0.110 0.102 0.077 0.060 0.095 0.067 0.065 0.123 0.163
0.111 0.125 0.110 0.141 0.080 0.097 0.142 0.069 0.066 0.078 0.067 0.079 0.089 0.153 0.178
0.111 0.125 0.133 0.117 0.117 0.151 0.110 0.095 0.162 0.103 0.153 0.155 0.140 0.162 0.131
0.111 0.063 0.103 0.117 0.135 0.097 0.102 0.139 0.123 0.151 0.195 0.107 0.110 0.179 0.057
NB
RIP
0.111 0.111 0.063 0.125 0.103 0.103 0.109 0.109 0.080 0.135 0.057 0.151 0.120 0.110 0.117 0.117 0.056 0.096 0.060 0.111 0.057 0.095 0.057 0.130 0.076 0.110 0.069 0.075 0.096 0.088
#Rks 1 2 3 4 4 5 5 6 6 7 7 8 8 9 9
Summary over 70 datasets: Mean Weight Overall Rank
0.1263 0.1224 0.1122 0.0999 0.1005 0.1232 0.111 0.0917 0.1129 1 3 5 8 7 2 6 9 4
each algorithm which we can translate into integer ranks for convenience (bottom of table). However, this incurs a loss of information: relying on simple ranks, we see instantly that decision trees and rules (C5R, LTR, C5T, RIP) occupy the top four positions of the performance scale. Weights reveal more subtle differences: for instance, the distance between C5R and LTR is four times that between LTR and C5T while the distance between C5T and RIP is 13 times greater than that between LTR and C5T.
Fig. 1. Mean weights over 70 datasets of learning algorithms with default parameters
Figure 1 visualizes relative weighting leaps which are completely hidden by the notion of ranks. It is however a bit awkward to express relative performance in terms of pairwise differences between algorithm weights. Since the weights of
Model Complexity and Algorithm Selection in Classification
119
N candidate algorithms sum to unity, a convenient reference is the equiperformance level, i.e.. 1/N (0.111 for our 9 algorithms, just a hair’s breadth above the 0.110 line in Fig. 1). The equiperformance line plays the same role as the wellknown default accuracy in comparing generalization performance, but it has the clear advantage of being data-independent and therefore usable in cross-domain comparisons. While all 4 decision tree and rule learners are above the equiperformance line in our example, C5R is about 8 times farther from it than RIP. Note that from the viewpoint of both ranks and weights, the neural networks are clearly outperformed by decision trees and rules; MLP and RBF rank 5th and 8th respectively; MLP is barely above the equiperformance line whereas RBF is clearly below it, along with LDS and NB. 4.2
Impact of Parameter Variation
We ran each target algorithm with 10 different parameter settings on each dataset. We then selected the optimal parameter, defined as that which led to the lowest 10-fold cross-validation error3 —and, in the rather frequent case of ties, to the smallest model4 . This section focuses mainly on how each algorithm’s performance varies with the complexity parameter; inter-algorithm conclusions will be discussed in the next section. Figure 2 shows the distribution of the optimal parameter for the four decision tree/rule learners. The height of the bar representing a particular parameter setting indicates the number of datasets for which that setting produced the lowest error. In each chart, the figures depicted by the bars sum up to several times the number of datasets due to numerous ties in optimal performance. The darker part of each bar shows the number of datasets for which a given parameter value minimizes both error and complexity. The charts of the recursive partitioning algorithms (C5.0 trees and rules, Ltree) show a clear pattern: simpler models achieve higher generalization accuracy. In C5.0 trees, for instance, a c value of 10 (most severe pruning) sufficed to attain the minimal error in 36 of the 70 datasets. The default parameter of c=25 led to the lowest error on 23 of the 70 datasets, but in 17 of these 23 cases the same error was produced by a smaller c value (which actually produced the same model). On 28 datasets, lower c values (and smaller trees) produced a lower error than the default (e.g., waveform21, waveform40, dna-splice, clean2, and satimage). In these cases, the default parameter was clearly overfitting the data. By contrast, the default did not yield the complexity needed to classify 18 datasets, on which maximal accuracy was attained only with c values ranging from 30 to 70 (e.g., bands, char, sonar, triazines). The behavior of the complexity parameter in C5.0-rules is rather similar, except that the distribution is a bit less sparse at the higher end of the c pa3 4
At this stage use of the error rate is reasonable since different models are being compared on the same dataset. The model size depends on the representation used and can be used only in comparing different models produced by the same algorithm (e.g., # of leaf nodes in decision trees, # of hidden nodes in neural networks).
120
M. Hilario
Fig. 2. Distribution of optimal complexity parameters of C5.0-tree, C5.0-rules, Ltree, and Ripper (clockwise)
rameter spectrum. In particular, 5 datasets required a c value of 80 and above to attain optimal performance in C5.0-rules, as opposed to none in C5.0 trees (however the error thus obtained was smaller than that of the C5.0-tree optimal model). The distribution of parameter optima in Ltree is strikingly similar to that of C5.0, with one major difference: the default c value in Ltree is 10, which actually attained the highest accuracy on the majority of the datasets. Finally, Ripper’s behavior is clearly atypical; contrary to right-skewness of the three other distributions, optimality appears to be more uniformly distributed among the different S values; note however the two rather distant modes, S=1.75 and S=0.5, the default setting which proves optimal in 18 datasets. Figure 3 shows the distribution of the optimal complexity parameters which correspond roughly to the number of hidden units in Clementine RBF and MLP. In RBF, the default number of hidden units (20) proved optimal in only 13 datasets. In 15 datasets, RBF obtained equal or better accuracy with much simpler networks comprising 1 or 10 hidden units. In 34 datasets, it took a minimum of 100 units to achieve optimal performance, the overall mode being situated at c=200 units. For multilayer perceptrons, we saw in Section 3.2 that Clementine follows a more flexible (albeit more opaque) default strategy in determining the number of hidden units. On the 70 experimental datasets, Clementine’s quick strategy
Model Complexity and Algorithm Selection in Classification
121
Fig. 3. Distribution of optimal number of hidden units in Clementine RBFN (left) and MLP (right)
produced default values ranging from a minimum of 4 (for simple problems such as iris, breast-cancer-wisconsin, and the monks problems) to upper limits of 61 units for triazine, 66 for dna-splice, and 67 for clean2. The default value turned out to be optimal in only 9 of the 70 datasets. In 16 cases, Clementine selected h values of 5 and 10 whereas 2 hidden units sufficed to achieve equivalent or even significantly better performance. Altogether, the optimal values were smaller than the default values in 28 cases and larger in 33 cases. It is also instructive to compare the distributions of the optimized number of hidden units of MLP and RBF. Figure 3 shows a right skewed distribution of optimal values (minimal error at minimal size) for MLP, with highest frequencies between 2 and 10 hidden units. RBF displays a multimodal distribution with modes at or near the extremes. However, values at the higher end (200-300) clearly outnumber those at the lower end (1-10). On approximate half of the datasets the most accurate RBF models have a topology of 100 hidden units or more. This confirms the well-known fact that RBF networks require more hidden units than multilayer perceptrons. 4.3
The Final Picture
After selecting the optimal models produced by the parameterizable algorithms for each dataset, we reapplied the McNemar-Bonferroni-AHP weighting procedure to these classifiers together with those generated by the 3 non parameterizable algorithms. To give an idea of the changes in performance entailed by parameter selection, Table 2 shows the new weights obtained on the sample datasets of Table 1. The overall picture is summarized at the bottom of the table, which shows the mean weight (and simplified rank) of each algorithm over 70 datasets. These figures, compared with those in Table 1, reveal a dramatic change in the performance of the neural networks with respect to the baseline. Quickly said, MLP jumps from fifth to the top rank and RBF from rank 8 to rank 5. Expressed more
122
M. Hilario Table 2. Weights of classification algorithms after parameter optimization
Dataset
C5R
C5T
MLP
RBF
LDS
heart 0.111 0.111 0.111 0.111 0.111 crx 0.125 0.125 0.125 0.125 0.125 sonar 0.111 0.111 0.111 0.122 0.111 dermatology 0.109 0.109 0.129 0.129 0.129 glass2 0.127 0.127 0.081 0.109 0.102 clean2 0.132 0.132 0.158 0.115 0.072 processed.cleveland 2 0.111 0.111 0.111 0.111 0.111 bands 0.119 0.110 0.119 0.119 0.084 char 0.142 0.154 0.132 0.105 0.067 pyrimidines 0.197 0.120 0.120 0.072 0.072 satimage 0.096 0.096 0.164 0.164 0.067 segmentation 0.131 0.144 0.131 0.067 0.079 triazines 0.191 0.149 0.089 0.065 0.078 optdigits 0.076 0.065 0.166 0.195 0.111 waveform40 0.088 0.067 0.163 0.163 0.163
LTR
IB1
0.111 0.110 0.111 0.109 0.109 0.132 0.111 0.110 0.154 0.120 0.096 0.144 0.149 0.130 0.123
0.111 0.069 0.111 0.109 0.127 0.079 0.111 0.119 0.098 0.120 0.164 0.115 0.104 0.118 0.057
NB
RIP
0.111 0.111 0.069 0.125 0.104 0.111 0.088 0.088 0.081 0.138 0.058 0.123 0.111 0.111 0.110 0.110 0.057 0.090 0.058 0.120 0.058 0.096 0.058 0.131 0.056 0.121 0.069 0.069 0.088 0.088
#Rks 1 3 3 3 4 7 1 3 8 4 4 6 8 8 5
Summary over 70 datasets: Mean Weight Overall Rank
0.1200 0.1173 0.1211 0.1138 0.0984 0.1205 0.1048 0.0920 0.1121 3 4 1 5 8 2 7 9 6
precisely in terms of performance weights, MLP’s distance from the equiperformance level increases by an order of magnitude (from 0.001 to 0.01) while that of RBF shifts from -0.01 to +0.002. Despite its effective default parameter value, LTR still improves with complexity adjustment, outperforming both variants of C5.0. The two rule learners (C5R and RIP) find themselves less favorably weighted despite improved performance, but this is due mainly to the relatively more significant improvement of MLP, RBF, and LTR. Finally, the three nonparameterizable algorithms end up at the bottom of the performance scale. Note, however, that IB1 is much closer to the default performance than LDS and NB (Fig. 4, right); a possible explanation is that IB1, as a high-variance algorithm (as opposed to the other two—high-bias—learners), can adapt more easily to complex problems.
Fig. 4. Mean weights of algorithms with default (left) and optimized (right) parameters
Model Complexity and Algorithm Selection in Classification
123
An intriguing result of the final weighting process is the increase in the number of equiperformance cases: the optimized models display no significant difference in performance on 23 datasets (against 13 in the baseline evaluation). This is illustrated in the table by processed.cleveland 2, on which the 5 distinct initial weights flattened out to a single common weight. The trend towards a decrease in significant differences is confirmed in the general case, as can be seen by comparing the rightmost columns of Tables 1 and 2 (boldface). More clearly, the final performance weights of the optimized algorithms no longer display the abrupt leaps observed in the default settings (Fig. 4). There are two possible explanations for this: the convergence of the optimized models’ accuracies in the vicinity of the Bayes error, and/or the extremely stringent significance level of 0.000026 imposed by the Bonferroni adjustment.
5
Discussion and Related Work
The weighting/ranking upheaval entailed by parameter selection demonstrates the inconclusive nature of comparative evaluations based on default settings. By blending model instance selection into algorithm selection, we can more confidently draw conclusions about the specific biases and areas of expertise of the different algorithms. We tried the meta-learning approach to uncover a few clues on why a properly parameterized algorithm outperforms the others on given dataset. The task was cast as one of binary classification: a dataset is considered to belong to a learning algorithm’s area of competence if the algorithm obtained the highest weight (possibly tied with others) on the dataset. The input meta-data consisted of 20 dataset characteristics (e.g., # of instances, classes, and explanatory variables; proportion of missing values, average entropy of predictive variables, class entropy, mutual information between predictive and class variables). Since the objective was gain of insight rather than predictive accuracy, we chose C5.0-rules as our meta-learner. Examples of learned rules are given in the table below. The head designates the algorithm(s) whose success can be explained by the corresponding conditions. Each rule is followed by its coverage and confidence (correctly covered cases). Head MLP LTR C5R C5R RP NN NN UP
Conditions P ≤ 20 & mean attribute entropy ≤ 5.24 P/K ≤ 16 & default accuracy > 0.53 default accuracy > 0.16 & mean attribute entropy ≤ 4.48 P ≤ 32 & default accuracy > 0.16 & equiv#atts ≤ 1.71 P > 8 & P/K ≤ 16 & default accuracy > 0.53 default accuracy > 0.68 & mean NMI ≤0.01 & dsize ≤ 43890 P ≤ 40 and default accuracy > 0.16 and mean NMI > 0.01 N ≤ 339 & P ≤ 40
Cover. 34 37 49 5 31 11 22 21
Conf. 0.944 0.846 0.824 0.857 0.879 0.923 0.875 0.783
In the above table, N is the number of instances, K the number of classes, and P the number of predictive attributes. The first four rules explain the performance of the three top-ranking algorithms after parameter optimization. The
124
M. Hilario
most commonly used meta-features are those related to the difficulty of the problem (default accuracy), the dataset size (P,P/K or the attribute/class ratio and dsize = P × N ) and information content (mean attribute entropy and equiv#atts or the ratio of the class entropy to the mean mutual information between class and predictive attributes). Another challenge was to explain the striking correlation in the performance of the three recursive partitioning algorithms: C5R, C5T and LTR obtained the same (highest) weights on 38 out of the 70 datasets. The RP rule shown in the table correctly explains 28 of these using the same meta-features previously described. To explain a similar phenomenon, the equality of weights of MLP and RBF on 39 datasets, the NN rules use an additional characteristic, the mean NMI or average mutual information between attributes and the class variable, normalized by the average number of distinct values per attribute. Finally, it was tempting to find an explanation for equivalent performance, whose practical consequence is the expendability of algorithm or model selection for datasets that satisfy the induced conditions. The UP rule in the table expresses the idea that below a certain dataset size no algorithm can demonstrate its superiority over the other. Datasets correctly covered by this rule include staples of the UCI repository such as balance-scale, iris, heart, hepatitits, and the different versions of the Wisconsin breast-cancer database. While this rule does make sense, an alternative explanation could be, as some have conjectured, that many existing algorithms have been overfitted to these classical UCI benchmarks. The choice of the appropriate model complexity has been a long-standing subject of research in the neural network community. Aside from many futile attempts to discover simple heuristics based on, say, the number of inputs and outputs, researchers have explored two main avenues: (1) strategies for iteratively adjusting model complexity by varying the number of hidden units/layers and/or a regularization term, typically via a double cross-validation loop [2][16]; and (2) so-called self-configuring networks, where the number of hidden units (and layers in certain cases) is dynamically incremented (e.g., [6][1]) or pruned (e.g., [10] [14]) during training. However, there has been relatively little work on the impact of complexity parameters in algorithm selection. As mentioned in Section 1, most large-scale comparative studies use default parameters; while performance is determined as much by these parameters as by the algorithms’ inherent biases, observed results have usually been attributed to the latter. In a few cases, algorithm comparisons are slightly broadened to take account of parameter effects. For instance, in Lim et al.’s [11] study of 33 classifiers, roughly a third are actually alternative parameterizations of a single learning algorithm; however, these variants were viewed as distinct learning algorithms in the comparative analysis.
6
Limitations and Future Work
In this paper we investigated the impact of model (instance) selection on algorithm selection. We ran 9 learning algorithms with their default parameters on
Model Complexity and Algorithm Selection in Classification
125
70 datasets, then experimented with 9-10 different settings on the six algorithms which could induce models of varying complexity. To track performance variations more effectively, we proposed weighting as an alternative to ranking in the comparative study of algorithms and models. We showed that a number of problems raised by the use of ranks are smoothly solved by AHP-based weights. In analyzing observed results, we used rule meta-learning as an auxiliary tool in explaining certain observations on the basis of dataset and task characteristics This study is an initial foray into the problem of the interaction between model complexity and algorithm selection, and is of necessity still crude and partial. One obvious limitation is that it focuses on a single complexity parameter for each of the algorithms studied, ignoring essential interactions between different model and runtime parameters (e.g., between the number of hidden units and the width of each unit’s region of influence in RBF, or between network topology and the number of training cycles in MLPs). Our experiments have shown clearly that we cannot rely on comparative evaluations based on default parameter settings. At the same time, in many real-world applications we cannot afford the luxury of time-consuming experimentation on each candidate algorithm. Another practical research objective is thus the development of more adequate default strategies for handling parameters data mining tools. While such automated strategies will never replace empirical fine-tuning to the application data, they will at least provide a more adequate baseline to start from, thus accelerating the modeling process. From the results of this study, neural network algorithms stand to gain most from the development of such strategies. Finally, this study uses Saaty’s technique for deriving weights from pairwise comparison matrices, but his analytical hierarchy process has a much broader application scope in decision theory and its full potential has yet to be exploited in data mining. In particular, AHP offers a principled way of synthesizing weighted priorities in the presence of multiple—possibly conflicting—goals and criteria (e.g., generalization performance, computational cost, interpretability, and practicality). It would be certainly be interesting to compare the hierarchical approach with data envelopment analysis, another technique from operational research which has been adapted to the integration of multiple criteria in knowledge discovery [15].
References 1. E. Alpaydin. GAL : Networks that grow when they learn and shrink when they forget. Technical Report TR-91-032, International Computer Science Institute, Berkeley, May 1991. 2. C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. 3. Clementine. http://www.spss.com. 4. W. W. Cohen. Fast effective rule induction. In Proc. 11th International Conference on Machine Learning, pages 115–123, 1995.
126
M. Hilario
5. T. G. Dietterich. Statistical tests for comparaing supervised classification learning algorithms. Technical report, DCS, Oregon State University, 1996. 6. S. E. Fahlman and C. Lebiere. The Cascade-Correlation learning architecture. Technical Report CMU-CS-90-100, Carnegie Mellon University, 1990. 7. J. Gama and P. Brazdil. Linear tree. Intelligent Data Analysis, 3:1–22, 1999. 8. D. D. Jensen and P. R. Cohen. Multiple comparisons in induction algorithms. Machine Learning, 38(309–338), 2000. 9. R. Kohavi, G. John, R. Long, D. Manley, and K. Pfleger. MLC++: A machine learning library in C++. Technical report, CSD, Stanford University, 1994. 10. Yann LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing, 2, pages 598–605, 1990. 11. T. Lim, W. Loh, and Y. Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40:35–75, 2000. 12. MetaL Consortium. Project Homepage. http://www.metal-kdd.org/. 13. D. Michie, D. J. Spiegelhalter, and C. C. Taylor, editors. Machine learning, neural and statistical classification. Ellis-Horwood, 1994. 14. M. C. Mozer and P. Smolensky. Skeletonization: a technique for trimming the fat from a network via relevance assessment. In D. S. Touretzky, editor, Advances in Neural Information Processing, 1, pages 107–116, San Mateo, CA, 1989. Morgan Kaufmann. 15. G. Nakhaeizadeh and A. Schnabl. Development of multi-criteria metrics for evaluation of data mining algorithms. In Proc. Third Intl. Conf. on Knowledge Discovery and Data Mining, pages 37–42, 1997. 16. B. D. Ripley. Statistical ideas for selecting network architectures. In B. Kappen and S. Gielen, editors, Neural Networks: Artificial Intelligence and Industrial Applications, pages 183–190, London, 1995. Springer. 17. http://www.rulequest.com. 18. T. L. Saaty. Fundamentals of Decision Making and Priority Theory. RWS Publications, 1994. 19. V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
Experiments with Projection Learning Tapio Elomaa and J.T. Lindgren Department of Computer Science P. O. Box 26 (Teollisuuskatu 23) FIN-00014 University of Helsinki, Finland {elomaa,jtlindgr}@cs.helsinki.fi
Abstract. Excessive information is known to degrade the classification performance of many machine learning algorithms. Attribute-efficient learning algorithms can tolerate irrelevant attributes without their performance being affected too much. Valiant’s projection learning is a way to combine such algorithms so that this desired property is maintained. The archetype attribute-efficient learning algorithm Winnow and, especially, combinations of Winnow have turned out empirically successful in domains containing many attributes. However, projection learning as proposed by Valiant has not yet been evaluated empirically. We study how projection learning relates to using Winnow as such and with an extended set of attributes. We also compare projection learning with decision tree learning and Na¨ive Bayes on UCI data sets. Projection learning systematically enhances the classification accuracy of Winnow, but the cost in time and space consumption can be high. Balanced Winnow seems to be a better alternative than the basic algorithm for learning the projection hypotheses. However, Balanced Winnow is not well suited for learning the second level (projective disjunction) hypothesis. The on-line approach projection learning does not fall far behind in classification accuracy from batch algorithms such as decision tree learning and Na¨ive Bayes on the UCI data sets that we used.
1
Introduction
Redundancy and excess of information are typical in natural information processing. Unfortunately, from the algorithmic point of view, such expendable information is often directly reflected in the efficiency of data processing. Moreover, the quality of the hypothesis is known to degrade for many learning algorithms through the addition of irrelevant data. One approach to tolerating excessive information in algorithmic learning is attribute-efficient learning [6,10,11]. This approach allows a learning algorithm to make a polynomial number of prediction errors in the number of relevant features, but only a subpolynomial number in the total number of features, before converging to the correct concept. The setting for attribute-efficient learning is on-line learning, where the (training) instances are received one at a time, the current hypothesis is used to predict the class of an instance before receiving a reinforcement (the true class of the instance), after which the hypothesis is updated on the basis of the S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 127–140, 2002. c Springer-Verlag Berlin Heidelberg 2002
128
T. Elomaa and J.T. Lindgren
difference, if any, in the prediction and the reinforcement. In on-line learning a natural goal is to attempt to minimize the number of classification errors of the learner. This learning model is known as the mistake bound model [10]. Attribute-efficiency is particularly important in domains where there are huge numbers of attributes, out of which only a few are relevant [9]. Examples of such domains are text classification [5], natural language processing [8], and object recognition [14]. Attribute-efficiency can typically be guaranteed only for very restricted concept classes. The best-known attribute-efficient learning algorithm is Littlestone’s [10] Winnow — a variant of the classical Perceptron algorithm [13]. Also other attribute-efficient learning algorithms are very simple algorithms that alone have a limited computational power [18]. Combining linear machines into a neural network [7] or using linear classifiers in the implicit feature space of support vector machines [22] makes these simple and restricted learning machines a powerful method to use. How to combine linear classifiers so that their attribute-efficient properties are maintained is the question that projection learning [20] wants to tackle. Even though variants of Winnow and networks of Winnows — some even using projections to guide the examples to subclassifiers — have proved successful in practical learning tasks [2,5,8,14], the two-level projection learning proposed by Valiant [20] has not been tested empirically. Our aim is to study, on one hand, how much advantage does projection learning bring in contrast to using Winnow alone and, on the other hand, how close to such empirically successful algorithms as C4.5 decision tree learner [12] and Na¨ive Bayes [3] does projection learning reach. Our study is experimental, we test the different learning algorithms using UCI data sets [1]. Out of the projection learning versions that we have tested, the most successful is the one in which the lower level learners are Balanced Winnows [11] and the high-level learner is the standard (positive) Winnow. As projection sets both single-attribute projections and quadratic ones are useful. Statistically significant improvement over using Winnow alone is recorded on many UCI domains. However, the loss in time- and space-efficiency can be high. The classification accuracy of the on-line method approaches that of decision tree learning and Na¨ive Bayes. The remainder of this article is organized as follows. In Section 2 we recapitulate the basics of attribute-efficient learning. In particular, the Winnow algorithm is examined. Section 3 concentrates on projection learning. We present our empirical experiments in Section 4. The last section gives the concluding remarks of this study.
2
Mistake-Bounded Learning and Attribute-Efficiency
Let us consider Boolean functions on n variables x1 , . . . , xn . A concept is a n mapping from value assignments (instances) Xn = { 0, 1 } to { 0, 1 } for some finite n. Those instances that are mapped to 1 are called positive and those mapped to 0 negative. A concept class C is the union of stratified subclasses Cn ,
Experiments with Projection Learning
129
n≥1 Cn , where c ∈ Cn has domain Xn . An instance x ∈ Xn together with its assigned class y = c(x) makes up an example (x, y). A collection S of (possibly multiply occurring) examples is called a sample. The size of a sample |S| is the total number of examples in it. The underlying concept c determining the classification of instances is the target concept that we try to learn. By ES (L) we denote the number of errors that learning algorithm L makes on sample S with respect to the target concept c.
2.1
Online Learning and Attribute-Efficient Learning
In on-line learning one wants to update the hypothesis incrementally and efficiently at the same time minimizing the number of erroneous predictions for instances. Many on-line algorithms — like Perceptron — are conservative in the sense that the hypothesis is not updated when the label of the instance was predicted right. Only falsely classified instances cause the algorithm to make changes to its guess about the correct concept. Assuming that the on-line learning algorithm will eventually converge to the correct hypothesis, we can set a mistake bound for the learning algorithm [10]. Now, learning within the mistake bound model is defined through the number of errors made before converging to the correct concept. Definition 1. Learning algorithm L learns the concept c within the mistake bound model, if for all samples S that are consistent with c it holds that ES (L) ≤ poly(n). Given examples described by n attributes, it is often the case that only few of the attributes are relevant in determining the value of the class attribute. Those attributes that have no effect are called irrelevant attributes. If the efficiency of our learning algorithm depends linearly (or through a higher order dependency) on the number of attributes used to describe the instances, utmost care must be exercised in choosing the variables to represent the data. A more natural approach is to have the learning algorithm efficiently recognize the relevant attributes from among the multitude of possibilities. Definition 2. Learning algorithm L is attribute-efficient in the mistake bound model, if ES (L) ≤ poly(k)polylog(n), where n is the total number of attributes and k is the number of relevant attributes. Winnow [10] (see Table 1) is the best-known attribute efficient learning algorithm. Like Perceptron it processes the given sample S one example (x, y) at a time updating the weight vector w ∈ Rn each time an example gets misclassified. The algorithm predicts the instance x to be positive if x · w > θ, where θ ∈ R is a threshold value given as an external parameter. If the inner product x · w does not exceed θ, then the prediction is h(x) = 0. After predicting the class of x its true class y is revealed, and the hyperplane hypothesis defined by the weight vector w is updated. Only falsely classified instances require changing the hypothesis. Each weight wi that corresponds to
130
T. Elomaa and J.T. Lindgren Table 1. The linear learning algorithm Winnow [10,11].
Winnow( S, α, θ ) % input: sample S of m pairs (x, y) and positive reals α and θ % maintain hypothesis: h(x) = sgn(w · x > θ) { initialize w ∈ Rn with positive values; for each example (x, y) ∈ S { predict h(x) = sgn(w · x > θ); for each xi = 1 { update wi ← wi α(y−h(x)) ; } } }
an active attribute xi = 1 in the misclassified instance is changed. A false positive prediction (h(x) = 1 and y = 0) causes active weights to be demoted and a false negative (h(x) = 0 and y = 1) causes them to be promoted. However, the updates in Winnow are multiplicative rather than additive as in Perceptron. Thus, whenever xi = 1, wi /α, when false positive is predicted; (y−h(x)) = wi ← wi α wi α, when false negative is predicted. For monotone disjunctive concepts on k variables Winnow can be shown to learn the correct concept from n-dimensional inputs with at most 2+3k(1+log n) mistakes before converging to the correct concept by setting α = 2, θ = n, and the initial value of w to be 2 [10]. Thus, for this concept class Winnow is attribute efficient. For the Perceptron algorithm the number of mistakes for monotone disjunctive concepts grows linearly in n [9]. The fact that Winnow can learn monotone disjunctive concepts attributeefficiently does not, however, mean that it would be attribute-efficient for all linear concept classes. For example, in leaning arbitrary linear threshold functions Winnow is not attribute efficient [15]. It can even make an exponential number of mistakes for some concept classes [16]. 2.2
Balanced Winnow
Winnow initializes the weight vector w with positive initial values. If some of the weights are never updated, they will have a positive bias in classification. Balanced Winnow [11] neutralizes this effect by maintaining, in addition to w, another weight vector v ∈ Rn , which is updated conversely to w. Thus, in v attributes that correlate negatively with positive examples should have high weights. The algorithm predicts positive class whenever w · x > v · x. Hence, in Balanced Winnow the weights of those attributes that have not been updated cancel each other out. This algorithm does not require the threshold parameter
Experiments with Projection Learning
131
θ, because classification is decided by the two weight vectors. Taking the negatively correlating attributes into account extends the class of attribute-efficiently learned concepts from monotone to arbitrary disjunctions.
3
Projection Learning
Linear classifiers such as the Perceptron and Winnow have serious fundamental and practical limitations in their expressive power, even if they have been successfully applied in some domains [2,5,8,14]. Valiant [20] has proposed projection learning as an approach to enrich the expressive power of attribute-efficient learners without losing their essential properties. The need for more expressive attribute-efficient learners arises in computational models of cognitive systems [19,21]. 3.1
Valiant’s Projection Learning Algorithm Y
A projection (or a restriction) ρ of Xn is a subset of Xn [20]. Usually a projection is represented by a simple constraint on the original attributes; e.g., x1 = 1 or x2 = x3 . For a function f : Xn → { 0, 1 } the restriction fρ of f is defined as f (x), when ρ(x) = 1; fρ (x) = 0, otherwise. Hence, fρ (x) = ρ(x)f (x). Let R = { ρ1 , . . . , ρr } be a set of projections. The projection set R could be for example the set of quadratic projections { xi xj | 1 ≤ i < j ≤ n }. A projective disjunction over (C, R) is a function c of the form c(x) = ρ1 (x)c1 (x) ∨ ρ2 (x)c2 (x) ∨ . . . ∨ ρm (x)cm (x), where ρ1 , . . . , ρm ∈ R and c1 , . . . , cm ∈ C and ρi c = ρi ci for each i, 1 ≤ i ≤ m. Thus, it cannot happen for any x that ρi (x) = c(x) = 1, ci (x) = 0, and cj (x) = ρj (x) = 1 for some i = j. In order to learn a projective disjunction on concept class C one can, for each ρ, apply a learning algorithm to those examples that satisfy ρ to obtain a hypothesis hρ . A higher level hypothesis h learns to distinguish the relevant projection hypotheses from those that are not useful in learning the target concept. If the concept class C can be learned attribute-efficiently, then also projective disjunctions on it can be learned efficiently. The projection learning algorithm Y [20] (see Table 2) uses learning algorithm A to learn hypotheses for the projections and (possibly) another algorithm B for learning the projective disjunction. For each example (x, y) the algorithm r composes a meta-instance z ∈ { 0, 1 } by querying what is the classification of the projection hypotheses hρ on x if x satisfies the restriction ρ; i.e., z = (ρ1 (x)hρ1 (x), ρ2 (x)hρ2 (x), . . . , ρr (x)hρr (x)).
132
T. Elomaa and J.T. Lindgren Table 2. Valiant’s [20] projection learning algorithm Y.
YA,B ( S, R ) % A and B are learning algorithms % input: sample S of m pairs (x, y) and a set of projections R % maintain hypothesis hρ for each ρ ∈ R and % h for the projective disjunction over R { for each ρ ∈ R { initializeA ( hρ ); } initializeB ( h ); for each example (x, y) ∈ S { for each ρ ∈ R { zρ ← ρ(x)hρ (x); } predict h(z); for each ρ ∈ R { if ( ρ(x) = 1 ) updateA ( hρ , (x, y) ); } for each ρ ∈ R { zρ ← ρ(x)hρ (x); } updateB ( h, (z , y) ); } }
This meta-instance is then classified using the hypothesis for the projective disjunction h. Once the prediction is obtained, projection hypotheses hρ can be updated according to the learning algorithm A that has been used to learn them. The original example (x, y) is used to update each hypothesis corresponding to a satisfied projection; a hypothesis corresponding to an unsatisfied projection does not even have to be evaluated. After the update a new meta-instance z is constructed for updating the hypothesis h corresponding to the projective disjunction. This time the update uses (z , y) and is based on the learning algorithm B, which is used for learning h. Valiant [20] has shown that the algorithm Y, using Winnow as the learning algorithms A and B, has a mistake bound O(sk log n+s log r+s) in learning projective disjunctions over a concept class that can be learned attribute-efficiently, where n and r are the numbers of all attributes and projections, respectively, and k and s are the numbers of relevant attributes and projections, respectively. The logarithmic dependency on the total number of projections makes Y projectionefficient in the same sense as Winnow is attribute-efficient. Moreover, it preserves the basic attribute-efficiency. 3.2
On the Efficiency and Expressive Power of Projection Learning
Projection learning can be quite time consuming in practice. Consider, for example, quadratic projections. There are O(n2 ) variable pairs. For each instance x we have to check, in the worst case, whether it satisfies these O(n2 ) pairs or not. Furthermore, we have to update all projection hypotheses. It is reasonable to
Experiments with Projection Learning
133
assume that the update takes O(n) time. Thus, the time of Y in using quadratic projections can be as much as O(n3 ) for each example. Also the space requirements of Y can be high. Consider, again, quadratic projections. If n = 100, there will be n(n − 1)/2 ≈ 5, 000 variable pairs. Initializing a Winnow classifier for each such projection will mean that roughly 500, 000 weights have to be maintained. On the other hand, adding the quadratic projections as new features into the input space would have meant that only approximately 5, 100 features would have been enough. Valiant’s [20,21] motivation for learning projective disjunctions is to extend the class of (practical) concepts that can be learned attribute efficiently. A full concept may be too hard to learn attribute-efficiently by Winnow alone, but in practice it may suffice that restricted parts of different concepts are known. In principle projection learning has a very high expressive power, when allowed powerful enough projection functions. One can show that Y can (trivially) learn in principle any consistent concept: Proposition 1. Assuming a suitable set of projections, the algorithm Y learns any target concept c that is consistent with the sample S, if such a concept exists. Proof. Let projections ρ1 and ρ2 be such that ρ1 (x) = 1 if and only if c(x) = 1 and ρ2 (x) = 0 if and only if c(x) = 0. Then all positive examples will be handled by hρ1 and the negative ones by hρ2 . Therefore, eventually hρ1 (x) = 1 for any x such that ρ1 (x) = 1 and hρ2 (x) = 0 for any x such that ρ2 (x) = 1. Thus, ρ1 (x)hρ1 (x) ∨ ρ2 (x)hρ2 (x) ≡ ρ1 (x) ∨ 0 ≡ ρ1 (x) ≡ c(x). Projection learning closely resembles stacked generalization [24]. Also in the latter a main learner is trained by meta-examples composed from the predictions of subclassifiers. However, in Y the projections guide the input to the subclassifiers. Nevertheless, it is natural to assume that some of the observations concerning stacked generalization might hold for projection learning as well. For example, it is known that stacking will overfit if the meta-examples used to train the main classifier are constructed from the training examples that have been used to train the subclassifiers [17].
4
Empirical Evaluation
In this section we compare Winnow, Balanced Winnow, and projection learning algorithm Y to the standard baseline methods, decision tree learning and Na¨ive Bayes (NB). As the decision tree learning algorithm we use the C4.5 clone j48 [23]. The algorithms are evaluated on well-known UCI datasets [1]. Numeric attributes were discretized using Fayyad and Irani’s [4] greedy method, even though the simple equal width binning method did not significantly degrade the accuracies of the attribute efficient algorithms. This robustness held even when the number of bins was only roughly guessed. On the other hand, careless discretization caused j48 and NB to perform significantly worse.
134
T. Elomaa and J.T. Lindgren
Table 3. Winnow (W) versus Balanced Winnow (BW), j48, Na¨ive Bayes, and Y2B . W BW j48 NB Dataset Breast cancer 66.87 68.78 73.87 ◦ 73.12 ◦ Breast Wisconsin 95.17 90.45 • 94.69 97.25 Credit rating 77.12 80.26 ◦ 86.17 ◦ 85.72 ◦ German credit 69.09 70.53 ◦ 71.54 ◦ 74.22 ◦ Heart statlog 77.19 69.74 • 81.56 ◦ 83.00 ◦ Hepatitis 76.86 79.67 ◦ 78.65 83.71 ◦ Horse colic 75.79 76.22 85.25 ◦ 79.23 ◦ Ionosphere 83.10 84.25 92.74 ◦ 90.43 ◦ Kr-vs-kp 59.67 73.62 ◦ 99.40 ◦ 87.84 ◦ Labor 79.93 85.47 ◦ 78.43 89.07 ◦ Pima diabetes 70.79 64.31 • 74.53 ◦ 75.03 ◦ Sick 88.14 95.18 ◦ 97.79 ◦ 96.37 ◦ Sonar 73.29 66.43 • 75.23 76.87 ◦ Vote 91.28 92.12 96.46 ◦ 90.19 ◦, • statistically significant improvement or degradation
Y2B 64.43 96.61 83.23 ◦ 70.09 ◦ 81.37 ◦ 75.97 79.91 ◦ 89.57 ◦ 78.06 ◦ 87.10 ◦ 72.84 ◦ 93.98 ◦ 74.75 94.89
Another important issue is choosing good learning parameters for the algorithms. To be fair, we apply the usual settings. For Winnow-based algorithms, this means the values providing attribute-efficiency when learning disjunctions, that is, learning rate α = 2, threshold θ = |x| = n and equal starting weights wi = 2 for all i ∈ { 1, . . . , n }. For j48, the decision tree is post-pruned with subtree raising using confidence threshold of 0.25. In addition, each leaf that does not represent more than one training example is pruned. Our results are based on ten-fold cross-validation with accuracies averaged over ten runs. Statistical significance was measured with the paired t-test at confidence level 0.05. 4.1
Feasibility of Projection Learning
The question that we consider first is the feasibility of projection learning in the form of algorithm Y. We are particularly interested in the case where some simple set of projections is chosen without domain knowledge. We want to know whether the on-line algorithm Y can compete in prediction accuracy with the well-known learning algorithms in this setting. Here Y2B denotes algorithm Y using the set of quadratic projections as R and algorithms Winnow and Balanced Winnow as on-line learners B and A, respectively. The results in Table 3 clearly show that even though the basic attributeefficient methods are not very accurate on these problems, the more advanced algorithm Y is able to find significantly more accurate hypotheses. It still mostly loses to the baseline algorithms j48 and NB, but attains an overall performance much closer to their level than the linear classifiers alone do.
Experiments with Projection Learning
135
Table 4. Effect of projection sets and algorithm combinations on Y. Dataset Y1B Y2 W Y1 66.87 66.38 68.67 ◦ 63.88 Breast cancer Breast Wisconsin 95.17 95.74 95.63 96.14 Credit rating 77.12 80.23 ◦ 82.46 ◦ 80.46 ◦ German credit 69.09 68.67 70.18 67.73 • Heart statlog 77.19 80.04 ◦ 79.70 ◦ 81.89 ◦ Hepatitis 76.86 71.11 • 77.49 71.62 • Horse colic 75.79 78.85 ◦ 79.53 ◦ 79.39 ◦ Ionosphere 83.10 87.30 ◦ 88.47 ◦ 87.76 ◦ Kr-vs-kp 59.67 61.53 ◦ 75.80 ◦ 62.72 ◦ Labor 79.93 76.17 • 83.30 ◦ 80.20 Pima diabetes 70.79 70.36 70.29 72.45 ◦ Sick 88.14 84.35 • 93.83 ◦ 84.49 • Sonar 73.29 74.42 74.42 74.93 Vote 91.28 92.71 ◦ 94.32 93.37 ◦ ◦, • statistically significant improvement or degradation
4.2
Y2B 64.43 96.61 83.23 ◦ 70.09 ◦ 81.37 ◦ 75.97 79.91 ◦ 89.57 ◦ 78.06 ◦ 87.10 ◦ 72.84 ◦ 93.98 ◦ 74.75 94.89
The Significance of the Choice of Algorithms and Projection Sets
There are at least two open problems related to projection learning: how to choose a good projection set and which algorithms should be used as learners A and B. To evaluate the significance of the projection sets, we applied algorithm Y also with single-attribute projections (this variant is called Y1 ). Algorithms Y1 and Y2 use Winnow as the learning algorithm at both levels of projection learning, as suggested by Valiant [20], while Y1B and Y2B have Balanced Winnow in the role of the algorithm A. From Table 4 it can be seen that also single-attribute projections increase the prediction accuracy. Using quadratic projections is only somewhat better choice, even if its expressive power is much higher. However, using Balanced Winnow as algorithm A results in significantly better classification accuracies than using basic Winnow. The reason for this is yet to be analyzed. On the other hand, using Balanced Winnow as the algorithm B gives much worse results. This might be due to its ability to express negative correlation of the attributes with the class value. The mistakes of the projection hypotheses can be thought of as noise in the meta-instance z, but in general the prediction of a projection hypothesis does not correlate negatively with the positive class value. Thus, the update rule of Balanced Winnow’s two weight vectors causes oscillation. 4.3
Practical Significance of Attribute-Efficiency
To test how important it is in our test domains that the attribute-efficient Winnow learns the projection hypotheses, we incorporated the basic Perceptron to the role of the learning algorithm A in Y. By Y2P we denote this algorithm combination when quadratic projections are used.
136
T. Elomaa and J.T. Lindgren
Table 5. The effect of using Perceptron instead of Winnow in learning projection hypotheses. Dataset W BW j48 66.87 68.78 73.87 ◦ Breast cancer Breast Wisconsin 95.17 90.45 • 94.69 Credit rating 77.12 80.26 ◦ 86.17 ◦ German credit 69.09 70.53 ◦ 71.54 ◦ Heart statlog 77.19 69.74 • 81.56 ◦ Hepatitis 76.86 79.67 ◦ 78.65 Horse colic 75.79 76.22 85.25 ◦ Ionosphere 83.10 84.25 92.74 ◦ Kr-vs-kp 59.67 73.62 ◦ 99.40 ◦ Labor 79.93 85.47 ◦ 78.43 Pima diabetes 70.79 64.31 • 74.53 ◦ Sick 88.14 95.18 ◦ 97.79 ◦ Sonar 73.29 66.43 • 75.23 Vote 91.28 92.12 96.46 ◦ ◦, • statistically significant improvement or
NB Y2B 73.12 ◦ 64.43 97.25 96.61 85.72 ◦ 83.23 ◦ 74.22 ◦ 70.09 ◦ 83.00 ◦ 81.37 ◦ 83.71 ◦ 75.97 79.23 ◦ 79.91 ◦ 90.43 ◦ 89.57 ◦ 87.84 ◦ 78.06 ◦ 89.07 ◦ 87.10 ◦ 75.03 ◦ 72.84 ◦ 96.37 ◦ 93.98 ◦ 76.87 ◦ 74.75 90.19 94.89 degradation
Y2P 65.64 96.53 82.39 70.20 81.41 76.59 78.74 89.92 89.38 87.60 72.72 94.22 75.71 94.23
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
Table 5 gives the results of this experiment. In addition to the really interesting results of Y2B and Y2P , we also list the ones for Winnow, Balanced Winnow, j48, and Na¨ive Bayes for comparison. From these results it is immediately obvious that attribute-efficiency of projection hypothesis learners is not extremely important on our test domains. More or less similar results are obtained using either linear learner to obtain the projection hypotheses. Comparing Y2B and Y2P with each other, both record two statistically significant wins and losses. Y2P is better on domains Kr-vs-kp and Sonar, but loses on Horse colic and Credit rating. Out of these differences only the one on Kr-vs-kp is clearly outstanding. We may conclude that for our test domains Balanced Winnow and Perceptron are equally good projection hypothesis learners. However, let us point out that Perceptron — like Balanced Winnow — does not suit to be used in learning the meta-level concept. The fact that both Balanced Winnow and Perceptron are clearly better suited as projection hypotheses learners than the positive Winnow makes us speculate that, when using simple projection sets, it is the expressive power of the subclassifier that is important. Both Balanced Winnow and Perceptron can have a negative weight for an attribute. Thus, their possible classifiers are a superset of those of positive Winnow. We have not yet tried to verify whether this holds more generally or not. 4.4
Projections versus Input Space Extension
Quadratic projections provide the projection hypotheses with examples in which the respective pair of attributes is true. Adding the pairwise conjunctions of the original attributes to the input space gives the linear learner ability to use the same information. The input space extension, in this case, gives much better time
Experiments with Projection Learning
137
Table 6. Winnow using feature space extended by quadratic variables versus the projection methods. Dataset W BW 66.87 68.78 Breast cancer Breast Wisconsin 95.17 90.45 • Credit rating 77.12 80.26 ◦ German credit 69.09 70.53 ◦ Heart statlog 77.19 69.74 • Hepatitis 76.86 79.67 ◦ Horse colic 75.79 76.22 Ionosphere 83.10 84.25 Kr-vs-kp 79.93 85.47 ◦ Pima diabetes 70.79 64.31 • Sick 88.14 95.18 ◦ Sonar 73.29 66.43 • Vote 91.28 92.12 ◦, • statistically significant
W+C 2 BW+C 2 Y1B 66.59 67.19 68.67 95.57 93.42 • 95.63 79.81 ◦ 80.62 ◦ 82.46 69.78 69.79 70.18 79.52 ◦ 74.44 79.70 76.51 79.58 ◦ 77.49 77.98 78.55 79.53 87.15 ◦ 87.27 ◦ 88.47 74.13 • 88.03 ◦ 83.30 69.88 68.88 • 70.29 89.86 94.47 ◦ 93.83 74.05 71.24 74.42 92.84 92.70 94.32 improvement or degradation
◦ ◦ ◦ ◦ ◦ ◦ ◦
Y2B 64.43 96.61 83.23 70.09 81.37 75.97 79.91 89.57 87.10 72.84 93.98 74.75 94.89
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
and space complexities than using projections. However, using higher-order conjunctions would lead the dependence of the computational cost on n to surpass that of projection learning [20]. Table 6 relates Winnow on the original attributes to having its feature space extended by pairwise conjunctions (+C 2 ) and algorithm Y using single-attribute and quadratic projections. It is evident that both methods enhance the basic algorithms’ accuracy, but projection learning, with only a few exceptions, achieves better results. Based on this, both methods seem feasible alternatives to replace Winnow on some task where it is currently used, if one is willing to pay the additional costs in time and space complexity. Input space extension does not seem a particularly good choice for decision tree learning and Na¨ive Bayes. In the problems that we evaluated, the addition of pairwise conjunctions to the input space caused j48 and NB generally to find worse hypotheses. In particular, with an extended input space j48 lost four times to the normal j48 and won only twice, all of these differences statistically significant. For Na¨ive Bayes the ratio was 5:1. The reason for this degradation is the redundancy of the new attributes. Clearly, j48 is already able to represent conjunctions by its hypothesis class. For Na¨ive Bayes the new variables are not independent of the original variables, thus violating the basic assumption of Bayesian learning. 4.5
The Running Time of Projection Learning
Let us finally reflect upon the observed time consumption of our learning algorithms. We know that in the worst case projection learning can be extremely inefficient. How does it fare on these “real-world” data sets? The two-level projection learning cannot, of course, be as efficient as running the linear learners
138
T. Elomaa and J.T. Lindgren Table 7. Time consumption of our test algorithms (in seconds).
Dataset Breast cancer Breast Wisconsin Credit rating German credit Heart statlog Hepatitis Horse colic Ionosphere Kr-vs-kp Labor Pima diabetes Sick Sonar Vote
W 0.08 0.26 0.28 0.44 0.07 0.03 0.17 0.70 1.17 0.01 0.22 1.29 0.30 0.08
BW 0.07 0.25 0.29 0.44 0.07 0.03 0.17 0.68 1.18 0.01 0.22 1.26 0.30 0.08
j48 0.12 0.23 0.44 0.92 0.15 0.10 0.30 0.77 3.66 0.02 0.34 4.37 0.55 0.20
NB 0.02 0.14 0.15 0.26 0.06 0.03 0.07 0.53 0.88 0.01 0.15 1.03 0.30 0.04
Y2B 2.01 1.22 2.13 4.82 0.23 0.38 3.58 20.33 29.23 0.49 0.42 3.77 3.16 0.41
alone. Top-down induction of decision trees and Na¨ive Bayes are also known to be fast learning approaches. Table 7 lists the average execution times of the learning algorithms in one fold of ten-fold cross-validation. Moreover, the averages have been recorded from five repetitions of the ten times repeated ten-fold cross-validation. The experiments have been run on a laptop computer with a slow 350 MHz processor. The two on-line linear learners cannot be separated from each other in execution efficiency. Na¨ive Bayes is even more efficient and decision tree learning by j48 requires somewhat more time than Winnow and Balanced Winnow do. The average time consumption of Y2B is mostly an order of magnitude more than that of NB. The results for Y2P are very similar to those of Y2B and will, thus, be omitted here. In sum, by using projection learning, we can raise the classification accuracy of the straightforward linear learners much closer to that of decision trees and Na¨ive Bayes, but the cost is greatly increased running time and space complexity.
5
Conclusion
Straightforward input space extension by, e.g., conjunctions of attribute pairs can enhance the performance of linear classifiers such as Winnow. However, even better results can be obtained using the provably attribute-efficient classifier combination approach of projection learning. Using Balanced Winnow or Perceptron to learn the projection hypotheses is useful. However, in light of our (limited) experiments, neither should not be used as the learning algorithm for the hypothesis of the projective disjunction. The expense of using projection learning is increased time and space requirement. The increase can at times be so large that it may even become prohibitive
Experiments with Projection Learning
139
in realistic problems involving tens of thousands of attributes. In our experiments the overall performance of projection learning was observed to approach that of decision tree learning and Na¨ive Bayes. However, projection learning is significantly more inefficient than either of these learning algorithms. We set out to test projection learning on “real-world” data. However, as our experiment with using Perceptron to learn the projection hypotheses demonstrates, attribute-efficiency is not required here. It would be interesting to see the same experiments repeated for domains where attribute-efficiency of the learner makes a difference.
References 1. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine (1998). http://www.ics.uci.edu/∼mlearn/MLRepository.html 2. Blum, A.: Empirical support for Winnow and weighted-majority based algorithms: results on a calendar scheduling domain. Proc. Twelfth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA (1995) 64–72 3. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Second Edition. John Wiley and Sons, New York, NY (2000) 4. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. Proc. Thirteenth International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA (1993) 1022–1027 5. Golding, A.R., Roth, D.: A Winnow-based approach to context-sensitive spelling correction. Mach. Learn. 34 (1999) 107–130 6. Haussler, D. Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artif. Intell. 36 (1988) 177–222 7. Haykin, S.: Neural Networks: A Comprehensive Foundation. Second Edition. Prentice Hall, Upper Saddle River, NJ (1999) 8. Khardon, R., Roth, D., Valiant L.G.: Relational learning for NLP using linear threshold elements. Proc. Sixteenth International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA (1999) 911–919 9. Kivinen, J., Warmuth, M.K., Auer, P.: The Perceptron algorithm versus Winnow: linear versus logarithmic mistake bounds when few inputs are relevant. Artif. Intell. 97 (1997) 325–343 10. Littlestone, N.: Learning quickly when irrelevant attributes are abound: a new linear threshold algorithm. Mach. Learn. 2 (1988) 285–318 11. Littlestone, N.: Mistake bounds and logarithmic linear-threshold learning algorithms. Ph.D. Thesis, Report UCSC-CRL-89-11, University of California, Santa Cruz (1989) 12. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA (1993) 13. Rosenblatt, F.: The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Rev. 65 (1958) 386–407 14. Roth, D., Yang, M.-H., Ahuja, N.: Learning to recognize three-dimensional objects. Neural Comput. 14 (2002) 1071–1103 15. Servedio, R.A.: On PAC learning using Winnow, Perceptron, and a Perceptron-like algorithm. Proc. Twelfth Annual Conference on Computational Learning Theory. (1999) 296–307
140
T. Elomaa and J.T. Lindgren
16. Servedio, R.A.: Computational sample complexity and attribute-efficient learning. J. Comput. Syst. Sci. 60 (2000) 161–178 17. Ting, K.M., Witten, I.H.: Stacked generalizations: when does it work? Proc. Fifteenth International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA (1997) 866–873 18. Uehara, R., Tsuchida, K., Wegener, I.: Identification of partial disjunction, parity, and threshold functions. Theor. Comput. Sci. 230 (2000) 131–147 19. Valiant, L.G.: Circuits of the Mind. Oxford University Press, Oxford (1994) 20. Valiant, L.G.: Projection learning. Mach. Learn. 37 (1999) 115–130 21. Valiant, L.G.: A neuroidal architecture for cognitive computation. J. ACM 47 (2000) 854–882 22. Vapnik, V.N.: Statistical Learning Theory. John Wiley and Sons, New York, NY (1998) 23. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, CA (2000) 24. Wolpert, D.H.: Stacked generalization. Neural Networks 5 (1992) 241–259
Improved Dataset Characterisation for Meta-learning 1
1
2
2
Yonghong Peng , Peter A. Flach , Carlos Soares , and Pavel Brazdil 1
Department of Computer Science, University of Bristol, UK {yh.peng,peter.flach}@bristol.ac.uk 2 LIACC/Fac. of Economics, University of Porto, Portugal {csoares,pbrazdil}@liacc.up.pt
Abstract. This paper presents new measures, based on the induced decision tree, to characterise datasets for meta-learning in order to select appropriate learning algorithms. The main idea is to capture the characteristics of dataset from the structural shape and size of decision tree induced from the dataset. Totally 15 measures are proposed to describe the structure of a decision tree. Their effectiveness is illustrated through extensive experiments, by comparing to the results obtained by the existing data characteristics techniques, including data characteristics tool (DCT) that is the most wide used technique in metalearning, and Landmarking that is the most recently developed method.
1 Introduction Extensive research has been performed to develop appropriate machine learning techniques for different data mining problems, and has led to a proliferation of different learning algorithms. However, previous work has shown that no learner is generally better than another learner. If a learner performs better than another learner on some learning situations, then the first learner must perform worse than the second learner on other situations [18]. In other words, no single learning algorithm can perform well and uniformly outperform other algorithms over all data mining tasks. This has been confirmed by the ‘no free lunch theorems’ [29,30]. The major reasons are that a learning algorithm has different performance in processing different dataset and that different learning algorithms are implemented with different search heuristics, which results in variety of ‘inductive bias’ [15]. In real-world applications, the users need to select an appropriate learning algorithm according to the mining task that he is going to perform [17,18,1,7,20,12]. An inappropriate selection of algorithm will result in slow convergence, or even lead a sub-optimal local minimum. Meta-learning has been proposed to deal with the issues of algorithm selection [5, 8]. One of the aims of meta-learning is to assist the user to determine the most suitable learning algorithm(s) for the problem at hand. The task of meta-learning is to find functions that map datasets to predicted data mining performance (e.g., predictive accuracies, execution time, etc.). To this end meta-learning uses a set of attributes, called meta-attributes, to represent the characteristics of data mining tasks, and search for the correlations between these attributes and the performance of learning algorithms [5,10,12]. Instead of executing all learning algorithms to obtain the optimal one, meta-learning is performed on the meta-data characterising the data mining tasks. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 141-152, 2002. © Springer-Verlag Berlin Heidelberg 2002
142
Y. Peng et al.
The effectiveness of meta-learning is largely dependent on the description of tasks (i.e., meta-attributes). Several techniques have been developed, such as data characterisation techniques (DCT) [13] to describe the problem to be analyzed, including simple measures (e.g. the number of attributes, the number of classes et al.), statistical measures (e.g. mean and variance of numerical attributes), and information theorybased measures (e.g. entropy of classes and attributes). There is, however, still a need for improving the effectiveness of meta-learning by developing more predictive metaattributes and selecting the most informative ones [9]. The aim of this work is to investigate new methods to characterize the dataset for meta-learning. Previously, Bensusan et al’s proposed to capture the information from the induced decision trees for characterizing the learning complexity [3, 32]. In [3], they listed 10 measures based on the decision tree, such as the ratio of the number of nodes to the number of the attributes, the ratio of number of nodes to the number of training instances; however, they did evaluate the performance of these measures. In our recent work, we have re-analysed the characteristics of decision trees, and proposed 15 new measures, which focus on characterizing the structural properties of decision tree, e.g., the number of nodes and leave, the statistical measures regarding the distributes of nodes at each level and along each branch, the width and depth of the tree, the distribution of attributes in the induced decision tree. These measure have been applied to rank 10 learning algorithms. The experimental results show the enhancement of performance in ranking algorithms, compared to the DCT, which is the commonly used technique, and landmarking, a recently introduced technique [19,2]. This paper is organized as following. In section 2, some related work is introduced, including meta-learning methods for learning algorithm selection and data characterisation. The proposed method for characterising datasets is stated in detail in section 3. Experiments illustrating the effectiveness of the proposed method are described in section 4. Section 5 concludes the paper, and points out interesting possibilities for future work.
2 Related Work There are two basic tasks in meta-learning: the description of the learning tasks (datasets), and the correlation between the task description and the optimal learning algorithm. The first task is to characterise datasets with meta-attributes, which constitutes the meta-data for meta-learning, whilst the second is the learning at meta-level, which develops the meta-knowledge for selecting appropriate algorithm in classification. For algorithm selection, several meta-learning strategies have been proposed [6,25,26]. In general, there are three options concerning the target. One is to select the best learning algorithm, i.e. to select the algorithm that is expected to produce the best model for the task. The second is to select a group of learning algorithms including not only the best algorithm but also the algorithms that are not significantly worse than the best one. The third possibility is to rank the learning algorithms according to their predicted performance. The ranking will assist the user to finally select the learning algorithm. This ranking-based meta-learning is the main approach in the Esprit Project MetaL (www.metal-kdd.org).
Improved Dataset Characterisation for Meta-learning
143
Ranking the preference order of algorithms is performed based on estimating their performance in mining the associating dataset. In data mining, performance can be measured not only in term of accuracy but also time or understandability of model. In this paper, we assess performance with the Adjusted Ratio of Ratios (ARR) measure, which combines the accuracy and time. ARR gives a measure regarding the advantage of a learning algorithm over another algorithm in terms of their accuracy and the execution time. The user can adjust the importance of accuracy relative to time by a tunable parameter. The ‘zoomed ranking’ method proposed by Soares [26] based on ARR is used in this paper for algorithm selection, taking into account of accuracy and execution time simultaneously. The first attempt to characterise datasets in order to predict the performance of classification algorithm was done by Rendell et al. [23]. So far, two main strategies have been developed in order to characterise a dataset for suggesting which algorithm is more appropriate for a specific dataset. First one is the technique that describes the properties of datasets using statistical and informational measures. In the second one a dataset is characterised using the performance (e.g. accuracy) of a set of simplified learners, which was called landmarking [19,2]. The description of a dataset in terms of its information/statistical properties appeared for the first time within the framework of the STATLOG project [14]. The authors used a set of 15 characteristics, spanning from simple ones, like the number of attributes or the number of examples, to more complex ones, such as the first canonical correlation between the attributes and the class. This set of characteristics was later applied in various studies for solving the problem of algorithm selection [5,28,27]. They distinguish three categories of dataset characteristics, namely simple, statistical and information theory based measures. Statistical characteristics are mainly appropriate for continuous attributes, while information theory based measures are more appropriate for discrete attributes. Linder and Studer [13] provide an extensive list of information and statistical measures of a dataset. They provide a tool for the automatic computation of these characteristics, which was called Data Characterisation Tool (DCT). Sohn [27] also uses the STATLOG set as a starting point. After careful evaluation of their properties in a statistical framework, she noticed that some of the characteristics are highly correlated, and she omitted the redundant ones in her study. Furthermore, she introduces new features that are transformation or combinations of these existing measures, like ratios or seconds powers [27]. An alternative approach to characterise datasets called landmarking was proposed in [19,2]. The intuitive idea behind landmarking is that the performance of simple learner, called landmarker, can be used to predict the performance of given candidate algorithms. That is, given landmarker A and B, if we know landmarker A outperforms landmarker B on the present task, then we could select the learning algorithms that has the same inductive bias of landmarker A to perform this data mining task. It has to be ensured that the chosen landmarkers have quite distinct learning biases. As a closely related approach, Bensusan et al. had also proposed to use the information computed from the induced decision trees to characterize learning tasks [3, 32]. They listed 10 measures based on the unpruned tree but did not evaluate their performance.
144
Y. Peng et al.
3 The Proposed Measures for Describing Data Characteristics The task of characterizing dataset for meta-learning is to capture the information about learning complexity on the given dataset. This information should enable the prediction of performance of learning algorithms. It should also be computable within a relative short time comparing to the whole learning process. In this section we introduce new measures to characterize the dataset by measuring a variety of properties of a decision tree induced from that dataset. The major idea here is to measure the model complexity by measuring the structure and size of decision tree, and use these measures to predict the complexity of other learning algorithms. We employed the standard decision tree learner, c5.0tree. There are several reasons for selecting decision trees. The major reason is that decision tree has been one of the most popularly used machine learning algorithms, and the induction of decision tree is deterministic, i.e. the same training set could produce the similar structure of decision tree. Definition. A standard tree induced with c5.0 (or possibly ID3 or c4.5) consists of a number of branches, one root, a number of nodes and a number of leaves. One branch is a chain of nodes from root to a leaf; and each node involves one attribute. The occurrence of an attribute in a tree provides the information about the importance of the associated attribute. The tree width is defined as the number of lengthways partitions divided by parallel nodes or leave from the leftmost to the rightmost nodes or leave. The tree level is defined as the breadth-wise partition of tree at each success branches, and the tree height is defined by the number of tree levels, as shown in Fig.1. The length of a branch is defined as the number of nodes in the branch minus one. level-1
F
level-3 level-4
- root x2 c1
x1 c3
c1
TreeHigh
level-2
x1
- node - leaf x1,x2 - attributes c1,c2,c3 - classes
TreeWidth
Fig. 1. Structure of Decision Tree.
We propose, based on above notations, to describe decision tree in term of the following three aspects: a) outer-profile of tree; b) statistic for intra-structure: including tree levels and branches; c) statistic for tree elements: including nodes and attributes. To describe the outer-profile of the tree, the width of tree (treewidth) and the height of the tree (treeheight) are measured according to the number of nodes in each level and the number of levels, as illustrated in Fig.1. Also, the number of nodes (NoNode) and the number of leaves (NoLeave) are used to describe the overall property of a tree. In order to describe the intra-structure of the tree, the number of nodes at each level and the length of each branch are counted. Let us represent them with two vectors
Improved Dataset Characterisation for Meta-learning
145
denoted as NoinL=[v1,v2,…vl] and LofB=[L1,L2,….Lb] respectively, where vi is the number of nodes at the ith level, Lj is the length of jth branch, l and b is the number of levels (treeheight) and number of branches. Based on NoinL and LofB, the following four measures can be generated: The maximum and minimum number of nodes at one level:
maxLevel = max(v1,v 2 ,...v l ) minLevel = min(v1,v 2 ,...v l )
(1)
(As the minLevel is always equal to 1, it is not used.) The mean and standard deviation of the number of nodes on levels:
(∑ v ) l , l
meanLevel =
devLevel =
∑
l i =1
i =1 i
(2)
(vi − meanLevel ) 2 (l − 1)
The length of longest and shortest branches:
LongBranch = max(L1 , L2 ,...Lb )
(3)
ShortBranch = min(L1 , L2 ,...Lb ) The mean and standard deviation of the branch lengths:
(∑
b
meanBranch = devBranch =
∑
b j =1
j =1
)
Lj b ,
(4)
( L j − meanBranch) 2 (b − 1)
Besides the distribution of nodes, the frequency of attributes used in a tree provides useful information regarding the importance of each attribute. The times of each attribute is used in a tree represents by a vector NoAtt=[nAtt1, nAtt2, …. nAttm], where nAttk is the number of times the kth attribute is used and m is the total number of attributes in the tree. Again, the following measures are used: The maximum and minimum occurrence of attributes:
maxAtt = max(nAtt1,nAtt2 ,...nAttmm )
(5)
minAtt = min(nAtt1,nAtt2 ,...nAttmm ) Mean and standard deviation of the number of occurrences of attributes:
meanAtt =
devAtt =
∑
m
i =1
(∑
m
i =1
)
nAtt i m ,
(6)
(nAtti − meanAtt) 2 (m − 1)
As a result, a total of 15 meta-attributes (i.e., treewidth, treeheight, NoNode, NoLeave, maxLevel, meanLevel, devLevel, LongBranch, ShortBranch, meanBranch, devBranch, maxAtt, minAtt, meanAtt, devAtt) has been defined.
146
Y. Peng et al.
4 Experimental Evaluation In this section we experimentally evaluate the proposed data characteristics. In section 4.1 we describe our experimental set-up, in section 4.2 we compare our proposed meta-features with DCT and landmarking, and in section 4.3 we study the effect of meta-feature selection. 4.1 Experimental Set-up The meta-learning technique employed in this paper is an instance-based learning algorithm based ranking. Given a data mining problem (a dataset to analyze), the kNearest Neighbor (kNN) algorithm is used to select a subset with k dataset, whose characteristics are similar to the characteristics of the present dataset according to some distance function, from the benchmark datasets. Next, a ranking of their preference according to the the selected datasets is generated based on the adjusted ratio of ratios (ARR), a multicriteria evaluation measure that combines the accuracy and time. ARR has a parameter to enable the user to adjust the relative importance of accuracy and time according to his particular data mining objective. More details can be found in [26]. To evaluate a recommended ranking, we calculate its similarity to an ideal ranking obtained for the same dataset. The ideal ranking is obtained by estimating the performance of the candidate algorithms using 10-fold cross-validation. Similarity is measured using the Spearman’s rank correlation coefficient [29]. n n 6D2 2 2 2 (7) rs = 1− ,D = ∑i=1 Di = ∑i=1 (ri − ri ) 2 n(n − 1) where the
ri and r i are the predicted ranking and actual ranking for algorithm i re-
spectively. The bigger rs is, better of ranking result is, with rs = 1 if the ranking is same as the ideal ranking. In our experiments, a total of 10 learning algorithms, including c5.0tree, c5.0boost and c5.0rules [21], Linear Tree (ltree), linear discriminant (lindiscr), MLC++ Naive Bayes classifier (mlcnb) and Instance-based leaner (mlcib1) [11], Clementine Multilayer Perceptron (clemMLP), Clementine Radial Basis Function (clemRBFN) and rule learner (ripper), have been evaluated on 47 datasets, which are mainly from the UCI repository [4]. The error rate and time were estimated using 10-fold cross-validation. The leave-one-out method is used to evaluate the performance of ranking, i.e., the performance for ranking the 10 given learning algorithms for each dataset on the basis of the other 46 datasets. 4.2 Comparison with DCT and Landmarking The effect of new proposed meta-attributes (called DecT) has been evaluated on ranking of these 10 learning algorithms. In this section, we compare the ranking performance generated by DecT (15 meta-attributes) to that generated by DCT (25 meta-
Improved Dataset Characterisation for Meta-learning
147
attributes) and Landmarking (5 meta-attributes). The used 25 DCT and 5 Landmarking meta-attributes are listed in the Appendix. The first experiment is performed to rank the given 10 learning algorithms on the 47 datasets, in which, the parameters k=10 (meaning the 10 most similar datasets are first selected from the 46 datasets in kNN algorithm), Kt=100 (meaning that we are willing to tread 1% in accuracy for a 100 times speed-up or slowdown [26]) is used. The ranking performance is measured with rs (Eq. (7)). The results of ranking performance of using DCT, landmarking and DecT are shown in Fig. 2. The overall average performance for DCT, Landmarking and DecT are 0.613875, 0.634945 and 0.676028 respectively, which demonstrates the improvement of DecT in ranking algorithms, comparing to DCT and Landmarking.
'&7
/DQGPDUNLQJ
'HF7
Fig. 2. Ranking performance for 47 datasets using DCT, landmarking and DecT.
In order to look in more detail at the improvement of DecT over DCT and Landmarking, we performed the experiment of ranking using different values of k and Kt. As stated in [26], the parameter Kt represents the relative importance of accuracy and execution time in selecting the learning algorithm (i.e., higher Kt means the accuracy is more important and time is less important in algorithm selection). Fig. 3 shows that for Kt={10,100,1000}, using DecT improves the performance comparing with the use of DCT and landmarking meta-attributes. NW
NW '&7
/DQGPDUNLQJ
NW 'HF7
Fig. 3. The ranking performance for different values of Kt.
148
Y. Peng et al.
Fig. 4 shows the performance of ranking based on different zooming degree (different k), i.e., selecting different number of similar datasets, based on which the ranking is performed. From these results, we observe that 1) for all different values of k, DecT produces better ranking performance than DCT and landmarking; 2) best performance is obtained by selecting 10-25 datasets among 47 datasets. N
N
N
N
N
'&7
N
N
N
N
/DQGPDUNLQJ
N
'HF7
Fig. 4. The ranking performance for different values of k.
4.3 Performing Meta-feature Selection The kNN-nearest neighbor (kNN) method, employed to select k datasets for ranking the performance of learning algorithms for the given dataset, is known to be sensitive to the irrelevant and redundant features. Using smaller number of features could help to improve the performances of kNN, as well as to reducing the time used in metalearning. In our experiments, we manually reduced the number of DCT meta-features from 25 to 15 and 8, and compare their results to those obtained based on the same number of DecT meta-features. The reduction for DCT meta-features is performed by removing the features thought to be redundant, and the features having a lot of nonappl (missing or error) values, and the reduction for DecT meta-features are performed by removing redundant features that are highly correlated. N
N
N
N
N
N
N
'&7
'&7
'HF7
'HF7
N
N
'&7
Fig. 5. Results for reduced meta-features.
N
Improved Dataset Characterisation for Meta-learning
149
The ranking performances for these reduced meta-features are shown in Fig.5, in which DCT(8), DCT(15), DecT(8) represent the reduced 8, 15 DCT meta-features and 8 DecT meta-features, DCT(25) and DecT(15) represent the full DCT and DecT metafeatures respectively. From Fig.5, we can observe that feature selection did not significantly influence the performance of either DCT or DecT, and that the latter outperforms the former across the board.
5 Conclusions and Future Work Meta-learning strategy, under the framework of MetaL, aims at assisting the user in select appropriate learning algorithm for the particular data mining task. Describing the characteristics of dataset in order for estimating the performance of learning algorithm is the key to develop a successful meta-learning system. In this paper, we proposed new measures to characterise the dataset. The basic idea is to process the dataset using a standard tree induction algorithm, and then to capture the information regarding the dataset’s characteristics from the induced decision tree. The decision tree is generated using standard c5.tree algorithm. A total of 15 measures, which constitute the meta-attributes for meta-learning, have been proposed for describing different kind of properties of a decision tree. The proposed measures have been applied in ranking the learning algorithms based on accuracy and time. Extensive experimental results have illustrated the improvement of ranking performance by using the proposed 15 meta-attributes, compared to the 25 DCT and 5 Landmarking meta-features. In order to avoid the effect of redundant or irrelevant features on the performance of kNN learning, we also compared the performance of ranking based on the selected 15 DCT meta-features and DecT, and selected 8 DCT and DecT meta-features. The results suggest that feature selection does not significantly change performance of either DCT or DecT. In other experiments, we observed that the combination of DCT with DecT or Landmarking with DCT and DecT did not produce better performance than DecT. This is an issue that we are interested in further investigation. The major reason may come from the use of k-nearest neighbor learning in zooming based ranking strategy. One possibility is to test the performance of the combination of DCT, landmarking and DecT in other meta-learning strategies, such as best algorithm selection. Another interesting subject is to look at the change of shape and size of the decision tree along with the change of examples used in tree induction, as it will be useful if it is possible to capture the data characteristics based on sampled dataset. This is especially important for large datasets. Acknowledgements. this work is supported by the MetaL project (ESPRIT Reactive LTR 26.357).
150
Y. Peng et al.
References 1. 2.
3.
4. 5.
6. 7. 8.
9. 10.
11. 12.
13.
14. 15. 16. 17. 18.
C. E. Brodley. Recursive automatic bias selection for classifier construction. Machine Learning, 20:63-94, 1995. H. Bensusan, and C. Giraud-Carrier. Discovering Task Neighbourhoods through Landmark th Learning Performances. In Proceedings of the 4 European Conference on Principles of Data Mining and Knowledge Discovery. 325-330, 2000. H. Bensusan, C. Giraud-Carrier, and C. Kennedy. Higher-order Approach to Metalearning. The ECML’2000 workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, 109-117, 2000. C. Blake, E. Keogh, C. Merz. www.ics.uci.edu/~mlearn/mlrepository.html. University of California, Irvine, Dept. of Information and Computer Sciences,1998. P. Brazdil, J. Gama, and R. Henery. Characterizing the Applicability of Classification Algorithms using Meta Level Learning. In Proceeedings of the European Conference on Machine Learning, ECML-94, 83-102, 1994. T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895-1924, 1998. E. Gordon, and M. desJardin. Evaluation and Selection of Biases. Machine Learning, 20(12):5-22, 1995. A. Kalousis, and M. Hilario. Model Selection via Meta-learning: a Comparative Study. In Proceedings of the 12th International IEEE Conference on Tools with AI, Vancouver. IEEE press. 2000. A. Kalousis, and M. Hilario. Feature Selection for Meta-Learning. In Proceedings of the 5th Pacific Asia Conference on Knowledge Discovery and Data Mining. 2001. C. Koepf, C. Taylor, and J. Keller. Meta-analysis: Data characterisation for classification and regression on a meta-level. In Antony Unwin, Adalbert Wilhelm, and Ulrike Hofmann, editors, Proceedings of the International Symposium on Data Mining and Statistics, Lyon, France, (2000). nd R. Kohavi. Scaling up the Accuracy of Naïve-bayes Classifier: a Decision Tree hybrid. 2 Int. Conf. on Knowledge Discovery and Data Mining, 202-207. (1996) M. Lagoudakis, and M. Littman. Algorithm selection using reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML2000), 511-518, Stanford, CA. (2000) C. Linder, and R. Studer. AST: Support for Algorithm Selection with a CBR Approach. th Proceedings of the 16 International Conference on Machine Learning, Workshop on Recent Advances in Meta-Learning and Future Work. 1999. D. Michie, D. Spiegelhalter, and C. Toylor. Machine Learning, Neural Network and Statistical Classification. Ellis Horwood Series in Artificial Intelligence. 1994. T. Mitchell. Machine Learning. MacGraw Hill. 1997. S. Salzberg. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach. Data Mining and Knowledge Discovery, 1:3, 317-327, 1997. C. Schaffer. Selecting a Claassification Methods by Cross Validation, Machine Learning, 13, 135-143,1993. C. Schaffer. Cross-validation, stacking and bi-level stacking: Meta-methods for classification learning. In P. Cheeseman and R. W. Oldford, editors, Selecting Models from Data: Artificial Intelligence and Statistics IV, 51-59, 1994.
Improved Dataset Characterisation for Meta-learning
151
19. B. Pfahringer, H. Bensusan, and C. Giraud-Carrier. Tell me who can learn you and I can th tell you who you are: Landmarking various Learning Algorithms. Proceedings of the 17 Int. Conf. on Machine Learning. 743-750, 2000. 20. F. Provost, and B. Buchanan. Inductive policy: The pragmatics of bias selection. Machine Learning, 20:35-61, 1995. 21. J. R. Quinlan. C4.5: Programs for Machine Learning, Morgan Kaufman, 1993. 22. J. R. Quinlan. C5.0: An Informal Tutorial, RuleQuest, www.rulequest.com, 1998. 23. L. Rendell, R. Seshu, and D. Tcheng. Layered Concept Learning and Dynamically Varith able Bias Management. 10 Inter. Join Conference on AI. 308-314, 1987. th 24. C. Schaffer. A Conservation Law for Generalization Performance. Proceedings of the 11 International Conference on Machine Learning, 1994. 25. C. Soares. Ranking Classification Algorithms on Past Performance. Master’s Thesis, Faculty of Economics, University of Porto, 2000. 26. C. Soares. Zoomed Ranking: Selection of Classification Algorithms based on Relevant th Performance Information. Proceedings of the 4 European Conference on Principles of Data Mining and Knowledge Discovery, 126-135, 2000. 27. S. Y. Sohn. Meta Analysis of Classification Algorithms for Pattern Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 21, 1137-1144, 1999. 28. L. Todorvoski, and S. Dzeroski. Experiments in Meta-Level Learning with ILP. Proceedth ings of the 3 European Conference on Principles on Data Mining and Knowledge Discovery, 98-106, 1999. 29. A. Webster. Applied Statistics for Business and Economics, Richard D Irwin Inc, 779-784, 1992. 30. D. Wolpert. The lack of a Priori Distinctions between Learning Algorithms. Neural Computation, 8, 1341-1390, 1996. 31. D. Wolpert. The Existence of a Priori Distinctions between Learning Algorithms. Neural Computation, 8, 1391-1420, 1996. 32. H. Bensusan. God doesn’t always shave with Occam’s Razor - learning when and how to prune. In Proceedigs of the 10th European Conference on Machine Learning, 119--
124, Berlin, Germany, 1998.
Appendix DCT Meta-attributes: 1. Nr_attributes: Number of attributes. 2. Nr_sym_attributes: Number of symbolic attributes. 3. Nr_num_attributes: Number of numberical attributes. 4. Nr_examples: Number of records/examples. 5. Nr_classes: Number of classes. 6. Default_accuracy: The default accuracy. 7. MissingValues_Total: Total number of missing values. 8. Lines_with_MissingValues_Total: Number of examples having missing values. 9. MeanSkew: Mean skewness of the numerical attributes. 10. MeanKurtosis: Mean kurtosis of the numerical attributes.
152
Y. Peng et al.
11. NumAttrsWithOutliers: number of attributes for which the ratio between the alpha-trimmed standard-deviation and the standard-deviation is larger than 0.7 12. MStatistic: Boxian M-Statistic to test for equality of covariance matrices of the numerical attributes. 13. MStatDF: Degrees of freedom of the M-Statistic. 14. MStatChiSq: Value of the Chi-Squared distribution. 15. SDRatio: A transformation of the M-Statistic which assesses the information in the co-variance structure of the classes. 16. Fract: Relative proportion of the total discrimination power of the first discriminant function. 17. Cancor1: Canonical correlation of the best linear combination of attributes to distinguish between classes. 18. WilksLambda: Discrimination power between the classes. 19. BartlettStatistic: Bartlett’s V-Statistic to the significance of discriminant functions. 20. ClassEntropy: Entropy of classes. 21. EntropyAttributes: Entropy of symbolic attributes. 22. MutualInformation: Mutual information between symbolic attributes and classes. 23. JointEntropy: Average joint entropy of the symbolic attributes and the classes. 24. Equivalent_nr_of_attrs: ratio between class entropy and average mutual information, providing information about the number of necessary attributes for classification. 25. NoiseSignalRatio: Ratio between noise and signal, indicating the amount of irrelevant information for classification. Landmarking Meta-features: 1. Naive Bayes 2. Linear discriminant 3. Best node of decision tree 4. Worst node of decision tree 5. Average node of decision tree
Racing Committees for Large Datasets Eibe Frank, Geoffrey Holmes, Richard Kirkby, and Mark Hall Department of Computer Science University of Waikato Hamilton, New Zealand {eibe, geoff, rkirkby, mhall}@cs.waikato.ac.nz
Abstract. This paper proposes a method for generating classifiers from large datasets by building a committee of simple base classifiers using a standard boosting algorithm. It permits the processing of large datasets even if the underlying base learning algorithm cannot efficiently do so. The basic idea is to split incoming data into chunks and build a committee based on classifiers built from these individual chunks. Our method extends earlier work by introducing a method for adaptively pruning the committee. This is essential when applying the algorithm in practice because it dramatically reduces the algorithm’s running time and memory consumption. It also makes it possible to efficiently “race” committees corresponding to different chunk sizes. This is important because our empirical results show that the accuracy of the resulting committee can vary significantly with the chunk size. They also show that pruning is indeed crucial to make the method practical for large datasets in terms of running time and memory requirements. Surprisingly, the results demonstrate that pruning can also improve accuracy.
1
Introduction
The ability to process large datasets becomes more and more important as institutions automatically collect data for the purpose of data mining. This paper addresses the problem of generating classification models from large datasets, where the task is to predict the value of a nominal class given a set of attributes. Many popular learning algorithms for classification models are not directly applicable to large datasets because they are too slow and/or require too much memory. Apart from specialized algorithms for particular classification models, several generic remedies for the above problems have been proposed in the literature. They can be broadly classified into subsampling strategies [8,13] and learning using committee machines [11,3,4,12,14]. Of these two strategies, the latter one appears to be particularly promising because (a) it does not require any data to be discarded when building the classifier, and (b) it allows for incremental learning because the model can be updated when a new chunk of data arrives. The basic idea of committee-based learning for large datasets is to build a committee by splitting the data into chunks, learning a model from each chunk, S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 153–164, 2002. c Springer-Verlag Berlin Heidelberg 2002 ÿ
154
E. Frank et al.
and combining the predictions of the different models to form an overall prediction. If the maximum chunk size is kept small, polynomial time algorithms can be applied to induce the individual models in a reasonable amount of time. Working with chunks also makes the process memory-efficient because a chunk can be discarded once it has been processed by the learning scheme. In this paper we focus on using a boosting algorithm for building the committee machines. Boosting has the advantage that it can combine “weak” classifiers into a committee that is significantly more powerful than each individual classifier [5]. This is particularly advantageous in our application because the individual classifiers are built from relatively small samples of data and are therefore necessarily “weak.” The idea of using boosting for large datasets is not new, and appears to have been proposed first by Breiman [3]. The main contribution of this paper is a method for adaptively and efficiently pruning the incrementally built committee of classifiers, which makes the process computationally feasible for large datasets. It also makes it possible to choose an appropriate chunk size among several candidates based on “racing” the candidate solutions. This is important because the correct chunk size cannot be determined a priori. Apart from making the method practical, pruning also has the (desirable) side-effect that the resulting predictions can become more accurate. This paper is structured as follows. In Section 2 we present our method for constructing committees on large datasets. We start with a naive method that does not perform any pruning and then move on to a more practical method that incorporates a pruning strategy. Finally, we discuss how the resulting committees are raced. Section 3 contains experimental results on a collection of benchmark datasets, demonstrating the importance of choosing an appropriate chunk size and using a pruning strategy. Section 4 discusses related work on combining classifiers built from chunks of data. Section 5 summarizes the contributions made in this paper.
2
The Algorithm
We first describe the basic algorithm—called “incremental boosting”—that generates the committee from incoming chunks of data. Then we explain how incremental boosting can be modified to incorporate pruning. Finally we present the racing strategy for pruned committees built from different chunk sizes. 2.1
Incremental Boosting
Standard boosting algorithms implement the following basic strategy. In the first step a prediction model is built from the training data using the underlying “weak” learning algorithm and added to the (initially empty) committee. In the second step the weight associated with each training instance is modified. This two-step process is repeated for a given number of iterations. The resulting
Racing Committees for Large Datasets
155
committee is used for prediction by combining the predictions of the individual models. Boosting works well because the individual models complement each other. After a model has been added to the committee, the instances’ weights are changed so that instances that the committee finds “difficult” to classify correctly get a high weight, and those that are “easy” to classify get a low weight. The next model that is built will then focus on the difficult parts of the instance space instead of the easy ones (where difficulty is measured according to the committee built so far). This strategy generates a diverse committee of models, and this diversity appears to be the main reason why boosting works so well in practice. It turns out that boosting can also be viewed as a statistical estimation procedure called “additive logistic regression.” LogitBoost [6] is a boosting procedure that is a direct implementation of an additive logistic regression method for maximizing the multinomial likelihood of the data given the committee. In contrast to AdaBoost and related algorithms it has the advantage that it is directly applicable to multiclass problems. It jointly optimizes the class probability estimates for the different classes and appears to be more accurate than other multi-class boosting methods [6]. For this reason we chose it as the underlying boosting mechanism for our incremental boosting strategy. LogitBoost assumes that the underlying weak learner is a regression algorithm that attempts to minimize the mean-squared error. This can, for example, be a regression tree learner. For the experimental results reported in this paper we used a learning algorithm for “regression stumps.” Regression stumps are 1-level regression trees. In our implementation these stumps have ternary splits where one branch handles missing attribute values. The only difference between standard boosting and incremental boosting is that the latter uses a different dataset in each iteration of the boosting algorithm: the incoming training data is split into mutually exclusive “chunks” of the same size and a model is generated for each of these chunks. When a new chunk of data becomes available the existing committee’s predictions for this chunk are used to weight the data and a new model is learned on the weighted chunk and added to the committee. In this fashion a committee of boosted models is incrementally constructed as training data is processed. Figure 1 depicts the basic algorithm for incremental boosting. This algorithm assumes that new models can be added to the committee until the data is exhausted. This may not be feasible because of memory constraints. In the next section we will discuss a pruning method for reducing the committee’s size. Another drawback of the algorithm is its time complexity, which is quadratic in the number of chunks (and therefore quadratic in the number of training instances). In each iteration i, i − 1 base models are invoked in order to make predictions for chunk Ki (so that its instances can be weighted). Consequently this naive algorithm can only be applied if the chunk size is large relative to the size of the full dataset.
156
E. Frank et al. START with an empty committee K0 REPEAT FOR next data chunk Ci DO BEGIN IF (i > 1) THEN weight chunk Ci according to the predictions of K0..i−1 learn model Mi for chunk Ci and add to committee K0..i−1 END UNTIL no more chunks Fig. 1. Incremental boosting.
2.2
Incremental Boosting with Pruning
To make the algorithm practical it is necessary to reduce the number of committee members that are generated. Preferably this should be done adaptively so that accuracy on future data is not affected negatively. The first design decision concerns which pruning operations to apply. The second problem is how to decide whether pruning should occur. Boosting is a sequential process where new models are built based on data weighted according to the predictions of previous models. Hence it may be detrimental to prune models somewhere in the middle of a committee because subsequent models have been generated by taking the predictions of previous models into account. Consequently the only model that we consider for pruning is the last model in the sequence. This makes pruning a straightforward (and computationally very efficient) procedure: the existing committee is compared to the new committee that has an additional member based on the latest chunk of data. If the former is judged more accurate, the last model is discarded and the boosting process continues with the next chunk of data. Hence the pruning process makes it possible to skip chunks of data that do not contribute positively to the committee’s accuracy. As the experimental results presented in Section 3 show, this is especially useful when small chunks are used to build the committee. The experimental results also show that it is not advisable to stop building the committee when a “bad” chunk of data is encountered because later chunks of data may prove useful and lead to models that improve the committee’s accuracy. The second aspect to pruning is the choice of evaluation criterion. The pruned model needs to be compared to the unpruned one. Pruning should occur if it does not negatively affect the committee’s generalization performance. Fortunately our target application domains share a common property: they exhibit an abundance of data. This means we can be generous and reserve some of the data for pruning. We call this data “validation data.” This data is held completely separate from the data used for training the models. In our implementation the first N instances encountered (where N is the size of the validation dataset) are skipped by the boosting process. Consequently the first chunk of data that generates a potential committee member starts with instance N + 1.
Racing Committees for Large Datasets
157
START with an empty committee K0 AND validation data V REPEAT FOR next data chunk Ci DO BEGIN IF (i > 1) THEN weight chunk Ci according to the predictions of K0..i−1 learn model Mi for chunk Ci IF (loglikelihood for K0..i−1 + Mi on V > loglikelihood for K0..i−1 on V ) THEN add Mi to K0..i−1 END UNTIL no more chunks Fig. 2. Incremental boosting with pruning.
Accuracy on the validation data is the obvious performance measure. However, we found empirically that this is not a good pruning criterion. Preliminary results showed that it results in useful models being skipped because they do not change the accuracy immediately although they do improve accuracy in conjunction with models that are built later in the process. Logistic regression attempts to maximize the likelihood of the data given the model. An alternative candidate for measuring performance is therefore the loglikelihood on the validation data. This measures the accuracy of the class probability estimates that are generated by the committee. It turns out that using the loglikelihood avoids overpruning because it is more sensitive to whether a potential committee member manages to extract useful additional information. The resulting pruning algorithm based on the loglikelihood is depicted in Figure 2. Pruning reduces the size of the committee according to the properties of the data. Ideally no further models are added to the committee when the information in the data is exhausted. If this is the case there exists an upper bound on the number of models that are generated and the time complexity becomes linear in the number of training instances, allowing very large datasets to be processed effectively. Of course, apart from affecting running time, pruning also reduces the amount of memory that is needed to store the committee. 2.3
Racing Committees
Experimental results show that the performance of the committee varies, sometimes dramatically, with the chunk size. The chunk size should be large enough for each individual committee member to become a reliable predictor. However, as the chunk size increases, returns for each individual committee member diminish. At some point it becomes more productive to increase the diversity of the committee by starting with a new chunk. The best chunk size depends on the properties of the particular dataset and the weak learner used in the boosting process.
158
E. Frank et al.
Given these observations it appears impossible to determine an appropriate chunk size a priori. Consequently the only sensible strategy is to decide on a range of chunk sizes and to run the different committees corresponding to these different chunk sizes in parallel—i.e. to “race” them off against each other. Then we can keep track of which committee performs best and use the best-performing committee for prediction. Typically the best-performing chunk size changes as more data becomes available. The validation data, which is also used for pruning, can be used to compare the performance of the committees. However, in contrast to pruning, where the loglikelihood is employed to measure performance, here it is more appropriate to use percent correct because we want to use the committee that maximizes percent correct for future data.1 The question remains as to how many committees to run in parallel and which set of chunk sizes to use. Ultimately this depends on the computing resources available. If the number of committees is constant then the time and space complexity of racing them is the same as the corresponding complexities for its “worst-case” member. Consequently, assuming that pruning works and after a certain number of iterations no further models are added to the committee, the overall time-complexity is linear in the number of instances, and the space complexity is constant. In our experiments we used the following five chunk sizes: 500, 1,000, 2,000, 4,000, and 8,000. We kept the maximum chunk size relatively small because decision stumps are particularly weak classifiers and the returns on adding more data diminish quickly. Doubling the chunk size from one candidate to the next has the advantage that whenever the committee corresponding to the largest chunk size may have changed this is also true for all the smaller ones, and a comparison on the validation data at this point is fair because all the committees have “seen” the same amount of training data.
3
Experimental Results
To evaluate the performance of racing unpruned and pruned committees we performed experiments on six datasets ranging in size from approximately 30,000 to roughly 500,000 instances. The properties of these datasets are shown in Table 1.2 We obtained them from the UCI repositories [1,7]. The kdd cup ’99 data is a reduced version of the full dataset (reduced so that incremental boosting without pruning could be applied to this data). The “Train” column shows the amount of data that is used for training the committee (excluding the validation data). We attempted to set a sufficient amount of data aside for validation and testing to obtain accurate performance estimates. In our experiments the validation set size was set to half the size of the test set. Before we split the data 1 2
Of course, if the application domain requires accurate probability estimates, it is more appropriate to use the loglikelihood for choosing the best committee. The first three datasets are rather small but we chose to include them in our comparison because of the lack of publicly available large datasets.
Racing Committees for Large Datasets
159
Table 1. Datasets and their characteristics Dataset anonymous adult shuttle census income kdd cup ’99 covertype
Train Validation Test 30211 2500 5000 33842 5000 10000 43000 5000 10000 224285 25000 50000 475000 25000 50000 506012 25000 50000
Numeric Nominal Classes 0 293 2 6 8 2 9 0 7 8 33 2 34 7 22 10 44 7
into training, validation, and test data we randomized it to obtain independent and identically distributed samples. The first row of Figure 3 shows the results for the anonymous data. The leftmost graph shows percent incorrect on the test set for the unpruned committees as an increasing amount of training data is processed. Points mark the graph corresponding to the committee that performs best on the validation data (i.e. the committee that would be used for prediction at that point under the racing scheme). The middle graph shows the same for the pruned committees, and the rightmost graph shows the committee sizes for the pruned committees.3 The worst-performing chunk size is 8,000 because there is insufficient data to build a large enough committee. The final ranking of committees is the same with and without pruning (and the resulting error rates are comparable). Pruning appears to smooth fluctuations in error on the test data. Pruning also substantially reduces the committee size for small chunk sizes. After 30,000 training instances, 60 models are built without pruning for chunk size 500, with pruning there are only 16 (and it appears that the number of models has reached a plateau). No pruning is done for chunk sizes 4,000 and 8,000. It appears as if pruning should have occurred for chunk size 8,000. However, the loglikelihood on the validation data does increase after models are added and consequently no pruning occurs. The second row of Figure 3 shows the results for the adult data. Substantial pruning occurs for chunk sizes 500 and 1,000. In both cases it smoothes fluctuations in the performance and results in improved final error. It is interesting to see that the final size of the pruned committee for chunk size 1,000 is larger than the size of the committee for chunk size 500. Pruning appears to behave correctly because the final error is lower for the former. The shuttle data in the third row of Figure 3 is different from the previous two datasets in that very high accuracy scores can be achieved. The results show that pruning produces substantially smaller committees for chunk sizes 500, 1,000, and 2,000. However, for chunk sizes 500 and 1,000 it also results in fractionally lower accuracy. Choosing the best-performing committee on the validation data under the racing scheme results in approximately the same final error rate both with and without pruning. 3
The size of the unpruned committees is not shown because it increases linearly with the amount of training data.
160
E.Frank et a1
Racing Committees for Large Datasets
161
162
E. Frank et al.
Table 2. Percent incorrect for standard LogitBoost compared to our racing scheme LogitBoost #Iterations Racing w/o pruning Racing w pruning Dataset anonymous 27.00% 60 28.24% 27.56% adult 13.51% 67 14.58% 14.72% shuttle 0.01% 86 0.08% 0.07% census-income 4.43% 448 4.90% 4.93%
The next dataset we consider is census-income. The first row of Figure 4 shows the results. The most striking aspect is the effect of pruning with small chunk sizes. In this domain the fluctuation in error is extreme without pruning. With pruning this erratic behavior disappears and error rates decrease dramatically. Pruning also results in a marginally lower final error rate for the largest chunk sizes. The results also show that the size of the pruned committees starts to level out after a certain amount of training data has been seen. Note that even though chunk size 8,000 results in the most accurate pruned committee on the test data, chunk size 4,000 is chosen for prediction based on superior performance on the validation set. The kdd cup ’99 domain is similar to the shuttle domain in that very high accuracy can be achieved. As in the shuttle domain overpruning occurs for small chunk sizes (however, note that the degradation in performance corresponds to fractions of a percent). Although this is difficult to see on the graphs, pruning marginally improves performance for the largest chunk size (8,000). Under the racing scheme the final performance is approximately the same both with and without pruning. Because the dataset is so large, pruning results in substantial savings in both memory and runtime. The behavior on the largest dataset (third row of Figure 4) is similar to that seen on the census-income dataset. The only difference is that the pruned version chooses chunk size 4,000 on census-income, whereas 8,000 is chosen for covertype. Pruning substantially increases accuracy for chunk sizes 500, 1,000, and 2,000 eliminating the erratic behavior of the unpruned committees. The best-performing committee (both pruned and unpruned) is based on a chunk size of 8,000. Pruning does not improve the accuracy of the final predictor under the racing scheme. However, it does lead to substantial savings in both memory and runtime. The final committee for chunk size 8,000 is less than half the size of the unpruned version. Table 2 compares the final error under the racing scheme to standard LogitBoost (i.e. where the weak learner is applied to the full training set in each boosting iteration). We set the number of iterations for standard LogitBoost to be the same as the number of committee members in the largest unpruned committee (i.e. the one built from chunk size 500). The table does not include results for the two largest datasets because processing them with standard LogitBoost was beyond our computing resources. As might be expected, standard LogitBoost is slightly more accurate on all four test sets. However, the results are very close.
Racing Committees for Large Datasets
4
163
Related Work
Breiman [3] appears to have been the first to apply boosting (or “arcing” [2]) to the problem of processing large datasets by using a different subsample in each iteration of the boosting algorithm. He shows that this produces more accurate predictions than using bagging in the same fashion. He also shows that incremental boosting (if used in conjunction with an appropriate subsample size) produces classifiers that are about as accurate as the ones generated by standard boosting applied to the full dataset. However, his work does not address the problem of how to decide which committee members to discard. Fan et al. [4] propose an incremental version of AdaBoost that works in a similar fashion. Their method retains a fixed-size “window” of weak classifiers that contains the k most recently built classifiers. This makes the method applicable to large datasets in terms of memory and time requirements. However, it remains unclear how an appropriate value for k can be determined. Street and Kim [14] propose a variant of bagging for incremental learning based on data chunks that maintains a fixed-size committee. In each iteration it attempts to identify a committee member that should be replaced by the model built from the most recent chunk of data. Because the algorithm is based on bagging (i.e. all data points receive equal weight and a simple majority vote is performed to make a prediction), the algorithm has limited potential to boost the performance of the underlying weak classifiers. Oza and Russell [10] propose incremental versions of bagging and boosting that differ from our work because they require the underlying weak learner to be incremental. The method is of limited use for large datasets if the underlying incremental learning algorithm does not scale linearly in the number of training instances. Unfortunately, the time complexity of most incremental learning algorithms is worse than linear. Prodromidis and Stolfo [11,12] present pruning methods for ensemble classifiers built from different (unweighted) subsets of a dataset. These methods require an unpruned ensemble to be built first before pruning can be applied. The pruned ensemble cannot be updated incrementally as new data arrives. Similarly, Margineantu and Dietterich [9] investigate pruning methods for ensembles built by the standard AdaBoost algorithm (i.e. where a weak classifier is built from the entire dataset in each boosting iteration). Again, their method is applied once an unpruned ensemble has been generated.
5
Conclusions
This paper has presented a method for efficiently processing large datasets using standard learning techniques by wrapping them into an incremental boosting algorithm. The main contribution of this paper is a pruning method for making the procedure efficient in terms of memory and runtime requirements. The accuracy of the resulting committee depends on an appropriately chosen chunk
164
E. Frank et al.
size. Experimental results obtained by racing candidate solutions based on different chunk sizes demonstrate the effectiveness of our method on six real-world datasets. Although our technique can be used in an online setting, it cannot be applied in domains with concept drift (i.e. where the target concept changes over time) because it assumes that all the incoming data is independent and identically distributed. Acknowledgments. We would like to thank Bernhard Pfahringer for his valuable comments.
References 1. C. Blake, E. Keogh, and C.J. Merz. UCI repository of machine learning databases, 1998. [www.ics.uci.edu/∼mlearn/MLRepository.html]. 2. Leo Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998. 3. Leo Breiman. Pasting small votes for classification in large databases and on-line. Machine Learning, pages 85–103, 1999. 4. Wei Fan, Salvatore J. Stolfo, and Junxin Zhang. The application of AdaBoost for distributed, scalable and on-line learning. In 5th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 362–366, 1999. 5. Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th Int. Conf. on Machine Learning, pages 148–156. Morgan Kaufmann, 1996. 6. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistic, 38(2):337–374, 2000. 7. S. Hettich and S. D. Bay. The UCI KDD archive, 1999. [http://kdd.ics.uci.edu]. 8. G. H. John and P. Langley. Static versus dynamic sampling for data mining. In 2nd ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining, pages 367–370, 1996. 9. D. D. Margineantu and T. G. Dietterich. Pruning adaptive boosting. In Proc. of the 14th Int. Conf. on Machine Learning, pages 211–218, 1997. 10. Nikunj Oza and Stuart Russell. Experimental comparisons of online and batch versions of bagging and boosting. In 7th ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining, pages 359–364, 2001. 11. A. L. Prodromidis, S. J. Stolfo, and P. K. Chan. Pruning classifiers in a distributed meta-learning system. In Proc. of 1st National Conference on New Information Technologies, pages 151–160, 1998. 12. Andreas L. Prodromidis and Salvatore J. Stolfo. Cost complexity-based pruning of ensemble classifiers. Knowledge and Information Systems, 3(4):449–469, 2001. 13. A. Srinivasan. A study of two sampling methods for analysing large datasets with ILP. Data Mining and Knowledge Discovery, 3(1):95–123, 1999. 14. W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In 7th ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining, pages 377–382, 2001.
From Ensemble Methods to Comprehensible Models C. Ferri, J. Hern´ andez-Orallo, and M.J. Ram´ırez-Quintana DSIC, UPV, Camino de Vera s/n, 46020 Valencia, Spain. {cferri,jorallo,mramirez}@dsic.upv.es
Abstract. Ensemble methods improve accuracy by combining the predictions of a set of different hypotheses. However, there are two important shortcomings associated with ensemble methods. Huge amounts of memory are required to store a set of multiple hypotheses and, more importantly, comprehensibility of a single hypothesis is lost. In this work, we devise a new method to extract one single solution from a hypothesis ensemble without using extra data, based on two main ideas: the selected solution must be similar, semantically, to the combined solution, and this similarity is evaluated through the use of a random dataset. We have implemented the method using shared ensembles, because it allows for an exponential number of potential base hypotheses. We include several experiments showing that the new method selects a single hypothesis with an accuracy which is reasonably close to the combined hypothesis. Keywords: Ensemble Methods, Decision Trees, Comprehensibility in Machine Learning, Classifier Similarity, Randomisation.
1
Introduction
Comprehensibility has been the major advantage that has been advocated for supporting some machine learning methods such as decision tree learning, rule learners or ILP. One major feature of discovery is that it gives insight from the models, properties and theories that can be obtained. A model that is not comprehensible may be useful to obtain good predictions, but it cannot provide knowledge about how predictions are made. With the goal of improving model accuracy, there has been an increasing interest in constructing ensemble methods that combine several hypotheses [4]. The effectiveness of combination is further increased the more diverse and numerous the set of hypotheses is [10]. Decision tree learning (either propositional or relational) is especially benefited by ensemble methods [18,19]. Well-known techniques for generating and combining hypotheses are boosting [9,18], bagging [1,18], randomisation [5], stacking [22] and windowing [17].
This work has been partially supported by CICYT under grant TIC2001-2705-C0301, Generalitat Valenciana under grant GV00-092-14 and Acci´ on Integrada HispanoAlemana HA2001-0059.
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 165–177, 2002. c Springer-Verlag Berlin Heidelberg 2002
166
C. Ferri, J. Hern´ andez-Orallo, and M.J. Ram´ırez-Quintana
Although ensemble methods significantly increase accuracy, they have some drawbacks, mainly the loss of comprehensibility of the model and the large amount of memory required to store the hypotheses [13]. Recent proposals have shown that memory requirements can be considerably reduced (in [16], a method called miniboosting reduces the ensemble to just three hypotheses, with 40% less of the improvement that would be obtained by a 10trial AdaBoost). Nonetheless, the comprehensibility of the resulting combined hypothesis is not improved. A combined hypothesis is usually a voting of many hypotheses and it is usually treated as a black box, giving no insight at all. However, one major goal of the methods used in discovery science is comprehensibility. The question is how to reduce to one single hypothesis from the combination of m hypotheses without losing too much accuracy with respect to the combined hypothesis. Instead of using classical methods for selecting one hypothesis, such as the hypothesis with the lowest expected error, or the one with the smallest size (Occam’s razor), we will select the single hypothesis that is most similar to the combined hypothesis. This single hypothesis will be called an archetype or representative of the ensemble and can be seen as an ‘explanation’ of the ensemble. To do this, the main idea is to consider the combination as an oracle that would allow us to measure the similarity of each single hypothesis with respect to this oracle. More precisely, for a hypothesis or solution h and an unlabelled example e, let us define h(e) as the class or label assigned to e by h. Consider an ensemble of solutions E = h1 , h2 , · · · hm and a method of combination χ. By Σχ,E we denote the combined solution formed by using the method χ on E. Thus, Σχ,E (e) is the class assigned to e by the combined solution. Now, we can use Σχ,E as an oracle, which, generally, gives better results than any single hypothesis [4]. The question is to select a single hypothesis hi from E such that hi is the most similar (semantically) to the oracle Σχ,E . This rationale is easy to understand following the representation used in a statistical justification for the construction of good ensembles presented by Dietterich in [4]. A learning algorithm is employed to find different hypotheses {h1 , h2 , · · · , hm } in the hypothesis space or language H. By constructing an ensemble out of all these classifiers, the algorithm can “average” their votes and reduce the risk of choosing a wrong classifier. Figure 1 depicts this situation. The outer curve denotes the hypothesis space H. The inner curve denotes the set of hypotheses that give a reasonably good accuracy on the training data and hence could be generated by the algorithm. The point labelled by F is the true hypothesis. H h1 h2 F
hc
h3
h4 h5
Fig. 1. Representation of an ensemble of hypotheses
From Ensemble Methods to Comprehensible Models
167
If an ensemble hc is constructed by combining the accurate hypotheses, hc is a good approximation to F . However, hc is an ensemble, which means that it needs to store {h1 , h2 , · · · , h5 } and it is not comprehensible. For this reason, we are interested in selecting the single hypothesis from {h1 , h2 , · · · , hm } that would be closest to the combination hc . Following the previous rationale, this single hypothesis would be close to F . In the situation described in Figure 1, we would select h4 as the archetype or representative of the ensemble. A final question, also pointed out by [4], is that a statistical problem arises when the amount of training data available is too small compared to the size of the hypothesis space H. The selection of a good archetype would not be possible if a sufficient amount of data is not available for comparing the hypotheses. The reserve of part of the training data is generally not a good option because it would yield a smaller training dataset and the ensemble would have a lower quality. This problem has a peculiar but simple solution: the generation of random unlabelled datasets. Although the technique presented in this work is applicable to many kinds of ensemble methods, we will illustrate it with shared ensembles, because the number of hypotheses, in this kind of structure, grows exponentially wrt. the number of iterations. Therefore, there is a much bigger population which the representative can be extracted from. The paper is organised as follows. First, in section 2, we discuss the use of a similarity measure and we adapt several similarity metrics we will use. Section 3 explains how artificial datasets can be employed to estimate the similarity between every classifier and their combination. Section 4 presents the notion of shared ensemble, its advantages for our goals and how it can be adapted for the selection of the most similar hypothesis with respect to the combination. A thorough experimental evaluation is included in Section 5. Finally, the last section presents the conclusions and proposes some future work.
2
Hypothesis Similarity Metrics
As we have stated, our proposal is to select the single hypothesis which is most similar to the combined one. Consequently, we have to introduce different measures of hypothesis similarity. These metrics and an additional dataset will allow the estimation of a value of similarity between two hypotheses. In the following, we will restrict our discussion to classification problems. Several measures of hypothesis similarity (or diversity) have been considered in the literature with the aim of obtaining an ensemble with high diversity [12]. However, some of these are defined for a set of hypotheses and others for a pair of hypotheses. We are interested in these “pairwise diversity measures”, since we want to compare a single hypothesis with an oracle. However, not all of these measures can be applied here. First, the approach presented by [12] requires the correct class to be known. The additional dataset should be labelled, which means that part of the training set should be reserved for the estimation of similarities. Secondly, some other metrics are only applicable to two classes. As
168
C. Ferri, J. Hern´ andez-Orallo, and M.J. Ram´ırez-Quintana
a result, in what follows, we describe the pairwise metrics that can be estimated by using an unlabelled dataset and that can be used for problems with more than two classes. Given two classifiers ha and hb , and an unlabelled dataset with n examples with C classes, we can construct a C × C contingency or confusion matrix Mi,j that contains the number of examples e such that ha (e) = i and hb (e) = j. With this matrix, we can define the following similarity metrics: – θ measure: It is just based on the idea of determining the probability of both classifiers agreeing: θ=
C Mi,i n i=1
Its value is between 0 and 1. An inverse measure, known as discrepancy is also considered by [12]. – κ measure: The previous metric has the problem that when one class is much more common than the others or there are only two classes, this measure is highly affected by the fact that some predictions may match just by chance. Following [13], we define the Kappa measure, which was originally introduced as the Kappa statistic (κ) [3]. This is just a proper normalisation based on the probability that two classifiers agree by chance: θ2 =
C C C Mi,j Mj,i ( · ) n n i=1 j=1 j=1
As a result, the Kappa statistic is defined as: κ=
θ − θ2 1 − θ2
Its value is usually between 0 and 1, although a value lower than 0 is possible, meaning that the two classifiers agree less than two random classifiers agree. – Q measure: The Q measure is defined as follows [12]: C
i=1
Mi,i −
i=1
Mi,i +
Q = C
C
i=1,j=1,i=j
Mi,j
i=1,j=1,i=j
Mi,j
C
This value varies between -1 and 1. Note that this measure may have problems if any component of M is 0. Thus it is convenient to apply smoothing in M to compute the measure. We will add 1 to every cell. Obviously, the greater the reference dataset is, all of the previous metrics give a better estimate of similarity. In our case, and since the previous measures use the contingency matrix, we can have huge reference datasets available: random invented datasets.
3
Random Invented Datasets
In many situations, a single hypothesis may be the one which is the most similar to the combined hypothesis with respect to the training set, however it may not
From Ensemble Methods to Comprehensible Models
169
be the most similar one in general (with respect to other datasets). In some cases, e.g. if we do not use pruning, then all the hypotheses (and hence the combined solution) may have 100% accuracy with respect to the training set, and all the hypotheses are equally “good”. Therefore, it is suitable or even necessary to evaluate similarity with respect to an external (and desirably large) reference dataset. In many cases, however, we cannot reserve part of the training set for this, or it could be counterproductive. The idea then is to use the entire training set to construct the hypotheses and to use a random dataset to select one of them. In this work, we consider that the examples in the training set are equations of the form f (· · ·) = c, where f is a function symbol and c is the class of the term f (· · ·). Given a function f with a arguments, an unlabelled random example is any instance of the term f (X1 , X2 , · · · , Xa ), i.e., any term of the form f (v1 , v2 , · · · , va ) obtained by replacing every attribute Xi by values vi from the attribute domain (attribute type). Note that an unlabelled random example is not an equation (a full example) because we include no information about the correct class. We will use the following technique to generate each random unlabelled example: each attribute Xi of a new example is obtained as the value vi in a different example f (v1 , . . . , vi , . . . , va ) selected from the training set by using a uniform distribution. This procedure of generating instances assumes that all the attributes are independent, and just maintains the probabilities of appearance of the different values observed in each attribute of the training dataset.
4
Shared Ensembles
A multi-tree is a data structure that permits the learning of ensembles of trees that share part of their branches. These are called “shared ensembles”. In the particular case of trees, a multi-tree can be based on an AND/OR organisation, where some alternative splits are also explored. Note that a multi-tree is not a forest [10], because a multi-tree shares the common parts of different trees, whereas a forest is just a collection of trees. In a previous work [6], we presented an algorithm for the induction of multitrees which is able to obtain several hypotheses, either by looking for the best one or by combining them in order to improve accuracy. To do this, once a node has been selected to be split (an AND-node) the possible splits below (OR-nodes) are evaluated. The best one, according to the splitting criterion is selected and the rest are suspended and stored. After is completed the first solution, when a new solution is required, one of the suspended nodes is chosen and ‘woken’, and the tree construction follows under this node. This way, the search space is an AND/OR tree [14] which is traversed, thus producing an increasing number of solutions as the execution time increases. In [7], we presented several methods for growing the multi-tree structure. Since each new solution is built by completing a different alternative OR-node branch, our method differs from other approaches such as the boosting or bagging methods [1,9,18] which would induce a new decision tree for each solution.
170
C. Ferri, J. Hern´ andez-Orallo, and M.J. Ram´ırez-Quintana
Note that in a multi-tree structure there is an exponential number of possible hypotheses with respect to the number of alternative OR-nodes explored. Consequently, although the use of multi-trees for combining hypotheses is more complex, it is more powerful because it allows us to combine many more hypotheses using the same resources. Other previous works have explored the entire the search space of the AND/OR tree to make the combination [2], inspired by Context Tree Weighting (CTW) models [20], whereas we only explore a subset of the best trees. 4.1
Shared Ensemble Combination
Given several classifiers that assign a probability to each prediction (also known as soft classifiers) there are several combination methods or fusion strategies that can be applied. Let us denote by pk (cj |x) an estimate of the posterior probability that classifier k assigns class cj for example x. If we consider all the estimates equally reliable we can define several fusion strategies: majority vote, sum or arithmetic mean, product or geometric mean, maximum, minimum and median. Some works have studied which strategy is best. In particular, [11] concludes that, for two-class problems, minimum and maximum are the best strategies, followed by average (arithmetic mean). In decision tree learning, the pk (cj |x) depend on the leaf node where each x falls. More precisely, these probabilities depend on the proportion of training examples of each class that have fallen into each node during training. The reliability of each node usually depends on the cardinality of the node. Let us define a class vector vk,j (x) as the vector of training cases that fall in each node k for each class j. For leaf nodes the values would be the training cases of each class that have fallen into the leaf. To propagate upwards these vectors to internal nodes, we must clarify how to propagate through AND and OR nodes. This is done for each new unlabelled example we want to make a prediction for. For the AND-nodes, the answer is clear: an example can only fall through an AND-node. Hence, the vector would be the one of the child where the example falls. OR-nodes, however, must do a fusion whenever different alternative vectors occur. This is an important difference in shared ensembles: fusion points are distributed all over the multi-tree structure. We have implemented several fusion strategies. Nonetheless, it is not the goal of this paper to evaluate different methods for combining hypotheses but to select a single hypothesis. Thus, for the sake of simplicity, in this paper we will only use the maximum strategy because it obtains the best performance, according to our own experiments and those of [11]. 4.2
Selecting an Archetype from a Shared Ensemble
In a shared ensemble, we are not interested (because it would be unfeasible) to compute the similarity of each hypothesis with respect to the combined hypothesis, because there would be an exponential number of comparisons. What we are interested in is a measure of similarity for each node with respect to the
From Ensemble Methods to Comprehensible Models
171
combined solution, taking into account only the examples of the invented dataset that fall into a node. The general idea is, that once the multi-tree is constructed, we use its combination to predict the classes for the previously unlabelled invented dataset. Given an example e from the unlabelled invented dataset, this example will fall into different OR-nodes and finally into different leaves, giving different class vectors. Then, the invented dataset is labelled by voting these predictions in the way explained in the previous subsection. After this step, we can calculate a contingency matrix for each node, in the following way. For each node (internal or leaf), we have a C × C contingency matrix called M , initialised to 0, where C is the number of classes. For each example in the labelled invented dataset, we increment the cell Ma,b of each leaf where the example falls by 1, with a being the predicted class by the leaf and b being the predicted class by the combination. When all the examples have been evaluated and the matrices in the leaf nodes have been assigned, then we propagate the matrices upwards as follows: – For the contingency matrix M of AND-nodes we accumulate the contingency matrix of their m children nodes: (M1 + M2 + · · · + Mm ). – For the contingency matrix M of OR-nodes, the node of their children with greater Kappa (or other similarity measure) is selected and its matrix is propagated upwards. The selected node is marked. This ultimately generates the hypothesis that is most similar to the combined hypothesis, using a particular invented dataset and a given similarity measure. K=0.5
K=0.52
24
7
19
5
5
14
7
19
X>6
X<6
Y=a
4
13
1
3
9
1
15
K=0.52
Y=a
Y=b
14
14
4
3
9
K=0.87 10
3
6
4
2
5
6
4
Y=b
X>3
K=−0.18 13
1
10
7
1
15
10
3
X<3
X>9
X<9
10
3
4
1
7
0
6
1
2
4
8
3
2
5
1
4
1
7
0
8
2
2
8
1
Fig. 2. Selection of a single decision tree from the multi-tree structure.
Figure 2 shows the selection of one hypothesis from a multi-tree according to the contingency matrix. The AND-nodes are represented with an arc. The leaves are represented by rectangles. First, we fill the matrices of the leaves. Then, we propagate these upwards as has been detailed previously. Finally, when we reach the top of the tree, it is straightforward to extract the solution by simply descending the multi-tree by the marked nodes. In the figure, the marked nodes
172
C. Ferri, J. Hern´ andez-Orallo, and M.J. Ram´ırez-Quintana
are represented by the dashed lines, and the leaves of the selected hypothesis are shadowed. Therefore, we can summarise the approach in five different steps: 1. Multi-tree generation: The first step consists in the generation of a multitree from a training dataset. There are some criteria which affect the quality of the multi-tree: the splitting criterion, the pruning method, and the criterion for the selection of the suspended node to be woken. 2. Invented dataset: In this phase, an unlabelled invented dataset is created, by a random dataset. 3. Multi-tree combination: The invented dataset is labelled by the combination of the shared ensemble. A method of combination of hypotheses can be specified. 4. Calculation and propagation of contingency matrices: A contingency matrix is assigned to each node of the multi-tree, using the labelled invented dataset and a similarity metric. 5. Selection of a solution: An archetype hypothesis is extracted from the multi-tree by descending the multi-tree through the marked nodes.
5
Experiments
In this section, we present an experimental evaluation of our approach, as it is implemented in the SMILES system [8]. SMILES is a multi-purpose machine learning system which includes (among many other features) the implementation of a multi-tree learner. For the experiments, we used GainRatio [17] as splitting criterion. We chose a random method [7] for populating the shared ensemble (after a solution is found, a suspended OR-node is woken at random) and we used the maximum strategy for combination. We used several datasets from the UCI dataset repository [15]. Table 1 shows the dataset name, the size in number of examples, the number of classes, the nominal and numerical attributes. Table 1. Information about datasets used in the experiments. # Dataset Size Classes Nom.Attr. Num.Attr. 1 monks1 566 2 6 0 2 monks2 601 2 6 0 3 monks3 554 2 6 0 tic-tac 958 2 8 0 4 5 house-votes 435 2 16 0 3 7 1 6 post-operative 87 7 balance-scale 625 3 0 4 8 soybean-small 35 4 35 0 9 dermatology 358 6 33 1 10 cars 1728 4 5 0 11 tae 151 3 2 3 12 new-thyroid 215 3 0 5 13 ecoli 336 8 0 7
From Ensemble Methods to Comprehensible Models
173
Since there are many sources of randomness, we have performed the experiments by averaging 10 results of a 10-fold cross-validation. This makes a total of 100 runs (each one with a different multi-tree construction, random dataset and hypothesis selection process) for each pair of method and dataset. In the experiments, we will use the following notation: – First Solution: this is the solution given by just one hypothesis (the first hypothesis that is obtained). This is similar to C4.5 [17]. – Combined Solution: this is the solution given by combining the results of the ensemble (in our case, the multi-tree, as described in the previous section). – Archetype Solution: this is the single solution which is most similar to the combined solution. – Occam Solution: this is the single solution with the lowest number of rules, i.e., the shortest solution. It is not our purpose to evaluate the improvement of the Combined Solution over the First Solution using shared ensembles. We have done that in previous works [7]. We have not included the results using post-pruning because it does not improve the performance of any of the four kinds of solutions. Our goal is to show that a significant gain can be obtained from the First Solution to the Archetype and Occam methods as long as the size of the ensemble increases. Another question to be answered is to determine which method to extract a single solution from an ensemble is better: Archetype or Occam. 5.1
Evaluating Similarity Metrics
Table 2 shows the accuracy for each pair composed of a dataset and a method and the geometric means for each method. The methods studied are First, Combined and Archetype. The latter uses three different similarity metrics κ, θ and Q. The multi-tree has been generated exploring 100 suspended OR-nodes. Table 2. Comparison between measures of similarity. # 1 2 3 4 5 6 7 8 9 10 11 12 13 gmeans
1st 92.3 74.8 97.5 78.2 93.6 60.9 76.8 97.3 89.8 89.0 62.9 92.6 77.5 82.41
Comb 100 77.4 97.5 82.7 96.0 66.3 83.1 96.5 93.6 91.0 64.5 92.6 79.9 85.45
Arc. κ 100 76.1 97.6 78.2 94.4 63.8 80.1 96.5 90.6 89.6 61.9 92.8 79.4 83.78
Arc. θ Arc. Q 100 100 76.2 75.8 97.6 97.6 78.3 78.5 93.9 94.2 64.3 61.9 80.1 79.8 91.0 47.0 89.9 74.3 89.6 89.3 62.9 49.8 92.9 91.4 78.9 76.7 83.45 76.24
As expected, hypothesis combination improves the accuracy w.r.t. the first single tree. The use of the Archetype method also obtains good results. On the
174
C. Ferri, J. Hern´ andez-Orallo, and M.J. Ram´ırez-Quintana Table 3. Influence of the size of the invented dataset. 10 100 1000 10000 100000 # Comb Arc Arc Arc Arc Arc 1 99.8 72.3 93.3 99.8 100 99.9 2 77.3 64.6 61.0 75.2 76.1 76.2 97.6 82.9 94.5 97.6 97.6 97.6 3 82.9 65.9 70.3 78.0 78.2 78.6 4 5 95.8 73.7 92.4 94.4 94.4 93.8 63.5 6 67.5 69.1 63.6 63.9 63.8 79.9 7 83.0 62.5 75.4 79.4 80.1 96.5 8 95.0 68.8 93.3 95.0 96.5 9 93.6 45.6 84.7 90.5 90.6 89.9 10 91.0 71.0 75.4 88.1 89.6 89.8 11 63.7 44.3 54.3 59.1 61.9 61.2 12 92.5 73.8 89.3 91.3 92.8 92.6 13 80.0 46.8 73.9 77.9 79.4 79.0 gmeans 85.36 63.57 77.40 82.88 83.78 83.57
other hand, the results show that the Archetype method is very dependent on the measure of similarity used: κ seems to be the best metric and Q the worst (it even obtains lower accuracy than the first single hypothesis). 5.2
Influence of the Size of the Invented Dataset
Similarity is approximated through the use of an invented dataset. Let us study the influence of its size, varying from 10 to 100,000 examples. The similarity metric and the size of the multi-tree are fixed to κ and 100 alternative opened OR-trees, respectively. Table 3 shows that in order to obtain a good archetype hypothesis, the similarity metric has to be computed as accurately as possible. Although it depends on the dataset, a size of 10,000 invented examples seems to be sufficient. 5.3
Influence of the Size of the Ensemble
The effect of the size of the multi-tree is evaluated in Table 4. In this table, we show the accuracy of the first single solution and the accuracy of the combination, the archetype solution and the Occam solution for multi-trees created by exploring 10, 100, and 10001 alternative OR-nodes. We also include the geometric average number of solutions in the multi-tree (#Sol). Note that with 100 OR-nodes, we obtain millions of solutions with much less required memory than 100 non-shared hypotheses. The results are quite encouraging: by simply exploring 10 OR-nodes, the archetype solution surpasses the first solution and the Occam solution. This difference is increased as long as the multi-tree is populated. This is mainly due to the improvement in the accuracy of the combined solution and the fact that the archetype hypothesis can actually get close to it. The Occam solution 1
The experiments for datasets 9 and 13 have been performed exploring only 300 and 500 alternative OR-nodes, respectively.
From Ensemble Methods to Comprehensible Models
175
Table 4. Influence of the size of the multi-tree. # 1 2 3 4 5 6 7 8 9 10 11 12 13 gm.
1 10 100 1st Comb Arc Occ #Sol Comb Arc Occ 92.3 96.1 96.0 96.5 107 100 100 100 74.8 74.9 74.3 74.3 148 77.4 76.1 72.5 97.5 97.7 97.7 97.6 46 97.5 97.6 97.5 78.2 79.0 78.1 78.3 257 82.7 78.2 78.6 93.6 94.9 94.2 93.9 63 96.0 94.4 93.6 66.3 63.8 62.3 60.9 63.8 61.8 60.0 55 76.8 77.9 77.2 76.8 131 83.1 80.1 76.7 97.3 97.0 98.0 97.5 23 96.5 96.5 96.8 93.6 90.6 90.2 89.8 91.3 90.6 90.1 92 89.0 89.6 89.1 89.0 151 91.0 89.6 89.1 62.9 62.5 62.3 61.9 97 64.5 61.9 62.1 92.6 92.8 93.0 92.6 93.2 92.6 92.6 26 77.5 79.1 77.6 77.8 57 79.9 79.4 78.4 82.41 83.49 82.85 82.55 78.31 85.45 83.78 82.91
1000 #Sol Comb Arc Occ 8.7 × 108 100 100 100 2.6 × 1010 82.3 82.1 70.4 80 × 104 97.7 97.7 97.6 2.7 × 1012 84.6 79.8 79.5 5 26 × 10 95.7 94.1 93.9 59674 68.5 65.9 62.1 3.4 × 108 88.0 83.5 76.8 38737 95.0 93.3 96.3 3.3 × 107 93.8 91.1 90.8 1.7 × 109 91.6 90.0 89.1 1.5 × 106 64.5 60.9 61.1 3392 90.7 92.6 93.7 1134750 80.3 78.2 77.0 4.3 × 107 86.44 84.49 82.65
#Sol 1.6 × 1019 3.2 × 1020 7.1 × 1014 3.1 × 1038 5.6 × 1011 2.1 × 109 1.2 × 1018 1.8 × 1018 1.2 × 1010 2.8 × 1024 4.6 × 1014 6.1 × 107 3.8 × 108 6.2 × 1014
does not seem to be improved by larger multi-trees. Nevertheless, the Occam hypothesis can also be regarded as a way to obtain more and more compact solutions without losing accuracy.
6
Conclusions
This work has presented a novel method for extracting a single solution from an ensemble of solutions without removing training data for validation. The most closely related work is Quinlan’s miniboosting [16]. However, Quinlan’s method can be considered an ensemble method which generates three trees, followed by a merging stage where a single but quite complex tree could be obtained. Moreover, as he recognises, “although it is (usually) possible to construct a single merged decision tree that induces the same partition as a small ensemble, the tree is so large that it conveys no insight. This is a pity, as insight was the prime motivation for producing a single tree”. We overcome the previous problem by producing a single tree that is based on a selection, using the combination as an oracle. Consequently, the result is a single comprehensible solution, an archetype or representative of the ensemble. As we have shown, the single solution obtained by our method is not 100% equivalent to the combination, but in general it gets reasonably close. On the other hand, it is clear that a similar technique could be used for regression models, using, e.g., minimum squared error as a discrepancy (similarity) metric. From a more general point of view, ensemble methods have been used as an argument in favor of the Epicurus criterion (all consistent models should be retained) and against Occam’s razor, because complex combined hypotheses usually obtain better results than the simplest solution [21]. A counterargument may be that the combination is usually expressed outside the hypothesis language. With our work, we have shown that even inside the hypothesis language, the shortest solution is not the best one.
176
C. Ferri, J. Hern´ andez-Orallo, and M.J. Ram´ırez-Quintana
With regard to future work, a mixture of the archetyping method and Occam’s razor could also be investigated. Another idea is that the oracle does not need to be an internal combined hypothesis but it can be any external source, such as a neural network. Therefore, this could be regarded as a new method to “convert” incomprehensible neural networks (or other models) to comprehensible models (with possibly a slight loss in accuracy), which could also be seen as an ‘explanation’ of the original model. Finally, it is important to clarify that an archetype solution cannot be obtained without an ensemble, and the quality of the representative would depend on the number of individual hypotheses in the ensemble. Note that this number is exponentially increased by the use of shared ensembles, in particular our multi-tree structure, without requiring a huge amount of memory. Nonetheless, a quite interesting open work would be to study specific methods to generate the ensemble (in our case, to construct the multi-tree) or to investigate the combination method that would produce better oracles for the selection of the archetype. Acknowledgements. We would like to thank the anonymous reviewers for suggesting the idea of using the archetype as an explanation of the ensemble.
References 1. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. 2. J.G. Cleary and L.E. Trigg. Experiences with ob1, an optimal bayes decision tree learner. Technical report, Department of Computer Science, Univ. of Waikato, New Zealand, 1998. 3. J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Meas., 20:37–46, 1960. 4. T. G Dietterich. Ensemble methods in machine learning. In First International Workshop on Multiple Classifier Systems, pages 1–15, 2000. 5. Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, Boosting, and Randomization. Machine Learning, 40(2):139–157, 2000. 6. C. Ferri, J. Hern´ andez, and M.J. Ram´ırez. Induction of Decision Multi-trees using Levin Search. In Int. Conf. on Computational Science, ICCS’02, LNCS, 2002. 7. C. Ferri, J. Hern´ andez, and M.J. Ram´ırez. Learning multiple and different hypotheses. Technical report, Department of Computer Science, Universitat Polit´ecnica de Val´encia, 2002. 8. C. Ferri, J. Hern´ andez, and M.J. Ram´ırez. SMILES system, a multi-purpose learning system. http://www.dsic.upv.es/˜flip/smiles/, 2002. 9. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In the 13th Int. Conf. on Machine Learning (ICML’1996), pages 148–156, 1996. 10. Tim Kam Ho. C4.5 decision forests. In Proc. of 14th Intl. Conf. on Pattern Recognition, Brisbane, Australia, pages 545–549, 1998. 11. Ludmila I. Kuncheva. A Theoretical Study on Six Classifier Fusion Strategies. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(2):281–286, 2002.
From Ensemble Methods to Comprehensible Models
177
12. Ludmila I. Kuncheva and Christopher J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Submitted to Machine Learning, 2002. 13. Dragos D. Margineantu and Thomas G. Dietterich. Pruning adaptive boosting. In 14th Int. Conf. on Machine Learning, pages 211–218. Morgan Kaufmann, 1997. 14. N.J. Nilsson. Artificial Intelligence: a new synthesis. Morgan Kaufmann, 1998. 15. University of California. UCI Machine Learning Repository Content Summary. http://www.ics.uci.edu/˜mlearn/MLSummary.html. 16. J. Quinlan. Miniboosting decision trees. Submitted to JAIR, 1998. 17. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. 18. J. R. Quinlan. Bagging, Boosting, and C4.5. In Proc. of the 13th Nat. Conf. on A.I. and the 8th Innovative Applications of A.I. Conf., pages 725–730. AAAI/MIT Press, 1996. 19. Ross Quinlan. Relational learning and boosting. In Saso Dzeroski and Nada Lavrac, editors, Relational Data Mining, pages 292–306. Springer-Verlag, September 2001. 20. P. Volf and F. Willems. Context maximizing: Finding mdl decision trees. In Symposium on Information Theory in the Benelux, Vol.15, pages 192–200, 1994. 21. Geoffrey I. Webb. Further experimental evidence against the utility of Occam’s razor. Journal of Artificial Intelligence Research, 4:397–417, 1996. 22. David H. Wolpert. Stacked generalization. Neural Networks, 5(2):241–259, 1992.
Learning the Causal Structure of Overlapping Variable Sets David Danks Institute for Human and Machine Cognition University of West Florida 40 S. Alcaniz St. Pensacola, FL 32501 U.S.A. [email protected]
Abstract. In many real-world applications of machine learning and data mining techniques, one finds that one must separate the variables under consideration into multiple subsets (perhaps to reduce computational complexity, or because of a shift in focus during data collection and analysis). In this paper, we use the framework of Bayesian networks to examine the problem of integrating the learning outputs for multiple overlapping datasets. In particular, we provide rules for extracting causal information about the true (unknown) Bayesian network from the previously learned (partial) Bayesian networks. We also provide the SLPR algorithm, which efficiently uses these previously learned Bayesian networks to guide learning of the full structure. A complexity analysis of the “worst-case” scenario for the SLPR algorithm reveals that the algorithm is always less complex than a comparable “reference” algorithm (though no absolute optimality proof is known). Although no “expectedcase” analysis is given, the complexity analysis suggests that (given the currently available set of algorithms) one should always use the SLPR algorithm, regardless of the underlying generating structure. The results provided in this paper point to a wide range of open questions, which are briefly discussed.
1
The “Big Picture” Problem
Modern data collection has advanced to the point that the size and complexity of our datasets regularly exceed the computational limits of our algorithms (on modern machines). As a result, analysis is often rendered computationally tractable only when we consider proper subsets of the variables we have measured. In addition, the variables thought to be relevant often change over the course of an investigation, both in the data collection and analysis phases. For example, an unexpected correlation might suggest the need to find an unmeasured common cause. When these changes occur, we want to use as much information as possible from earlier analyses to minimize duplication of effort. Thus, in both of these situations, we must address the distinctive problems (such as integration of outputs and efficient use of prior learning in subsequent learning) that arise for learning on multiple overlapping sets of variables. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 178–191, 2002. c Springer-Verlag Berlin Heidelberg 2002
Learning the Causal Structure of Overlapping Variable Sets
179
There is a further, more practical, motivation. Many social science datasets have overlapping variables but the datapoints are unlabelled (for privacy reasons), so that there is no possible way to create a “complete” dataset. For example, we might have a census dataset and an unemployment dataset, both of which have Income as a variable, but neither of which contains identifiers to be used for creation of a single, integrated dataset. Hence, in these domains where there are substantial practical barriers to creating a unified dataset, the problem of integrating the learning outputs becomes particularly salient. At the same time as we face these difficulties, we want our analysis techniques to reveal causal relationships. Causal information allows for predictions about the (probabilistic) outcomes of interventions, and is more easily understood by human users of machine learning techniques. Hence, if possible, we want to use a representation that allows for causal inference and prediction. To better understand these problems, we can try to express them more formally. Let V be the full set of variables under consideration. We assume that the variables are either all discrete or all continuous, though in the former case they need not have the same number of values. Let S 1 , ..., S n be (nonempty) subsets of V such that S 1 ∩ ... ∩ S n = ∅. We further assume that, throughout all stages of learning, there is some stationary generating process producing the data, and that we have sufficient data that the sample statistics are the same as the population statistics. In this paper, we will be concerned with the following two questions: 1. If we do not have joint data over V , but we do have the outputs of some reliable, correct learning process (e.g., a machine learning algorithm) on S 1 , ..., S n , what can we learn about the relationships among V as a whole? That is, if we can only learn the causal structure of the subsets (because of lack of data), what can we learn (if anything) about the full structure underlying V ? 2. If we do have joint data over V , as well as the learning outputs, how can we efficiently learn the full structure for V ? That is, how (if at all) can we use learning results over subsets to guide the learning for V as a whole? For example, we might have three datasets (drawn from the same population) over the following variables: 1. S 1 = {Education, P arentalEducation, Income}; 2. S 2 = {Education, Housing, Income}; and 3. S 3 = {Education, Age, N umberOf Children, Income}. In this example, the above two questions correspond to: (1) Given just these three datasets, what can be learned about the interrelationships among the six variables (including pairs, like Housing and Age, that do not appear in the same dataset)? and (2) How could we efficiently learn the full causal structure if we actually had a complete dataset? On one level, it will be surprising if the answer to question 1 is anything other than “nothing.” Any positive answer implies that we can determine something about the relationship between two variables, despite the fact that we have no
180
D. Danks
dataset that contains all of the (possibly) relevant variables, perhaps including the two target variables. Interestingly, we will find in Section 3 that, despite this restriction, we can still learn something in this situation. Question 2 is more straightforward, since we would expect that some speedup would be possible (since we have already done some learning). An algorithm that uses the initial learning is provided in Section 4.1. The complexity analysis in Section 4.2 then shows that the algorithm is always more efficient than a comparable reference algorithm on the worst-case scenario. That is, for all parameterizations of the worst-case, we always gain an advantage by using prior learning. A question similar in spirit was previously considered by Fienberg and Kim in the context of log-linear models (a type of undirected graph) [5]. They considered the problem of finding the full structure for V ∪ X when we are given each marginal structure for V , conditional on some value of X. Since there are conditions under which log-linear models can model a Bayesian network (the formalism used here), their work shares a similar flavor to the ideas in this paper. It differs, however, by considering variable addition (rather than overlap), focusing on a particular parameterization, and assuming that the full data is (implicitly) available. Before proceeding on to the heart of the paper, we note that, for the purposes of this paper, we will restrict ourselves to the case of exactly two overlapping sets, S 1 and S 2 , but we will not assume that one set is a proper subset of the other.
2
Bayesian Networks
We cannot address the two principal questions of this paper without examining them from within a particular formalism. In this paper, we will use Bayesian networks (or simply Bayes nets). In addition, we will assume that all of the variables in V are discrete. Nothing in this paper hinges on the latter assumption; there are corresponding representations for continuous variables, and the SLPR algorithm (presented in Section 4.1) refers only to (conditional) independence, which is defined for both discrete and continuous variables. The remainder of this section is meant simply as a quick overview of Bayes net terminology and concepts. Several excellent introductions to Bayes nets are [7,11,12,14]; more detailed snapshots of the current state of the field can be found in [7,9]. 2.1
The Bayes Net Formalism
Suppose that V = {V1 , ..., Vn } is a set of random variables, and let v = {v1 , ..., vn } be the values of the variables for some datapoint. A Bayes net for V is composed of two inter-connected elements: – A directed acyclic graph over V ; and – A joint probability distribution (j.p.d.) over the possible joint variable values.
Learning the Causal Structure of Overlapping Variable Sets
181
There is a node for each variable in V in the graph, and nodes are (sometimes) connected by an edge with exactly one arrowhead. The edge points from the parent variable to the child variable. We use similar relation terminology (e.g., ancestor, descendant) to describe other variables. So, for example, in the simple graph A → B ← C, A, C are B’s parents, and B is A, C’s child. Every variable is its own ancestor and descendant, so {A, B, C} are B’s ancestors. B is also called a collider in the above graph, since the two edges “collide” at it. A directed path from Vi to Vj is a sequence of the form Vi → ... → Vj , with all edges oriented in the same direction. A trek is a pair of directed paths (possibly one with zero length) that are both from the same variable (called the source). A graph is acyclic if there is no directed path (of non-zero length) from a variable to itself. The graph and j.p.d. are related1 through the following two assumptions: Markov Assumption. A variable A is (probabilistically) independent of all (graphical) non-parental non-descendants, conditional on A’s (graphical) parents. Faithfulness (Stability) Assumption. If variables A and B are (probabilistically) independent conditional on set S, then there is no (graphical) edge between them. Notice that these two assumptions are each other’s converse: the Markov assumption says “No edge implies conditional independence” and the Faithfulness assumption says “Conditional independence implies no edge.” The Markov assumption enables us to decompose the j.p.d. into the product of n simpler probabilities based on the graph. Specifically, if pa (Vi ) is the set of parents of Vi , then the j.p.d. can be expressed as: P (V1 , ..., Vn ) =
n
P (Vi |pa (Vi ))
(1)
i=1
For example, any Markov j.p.d. for the graph X → Y → Z must factor as P (X, Y, Z) = P (X) P (Y |X) P (Z|Y ). The Faithfulness assumption rules out cases in which two pathways exactly cancel each other out. That is, if there is an edge between two variables, they must be associated regardless of conditioning set. For example, suppose that running increases your metabolism, which would normally lead you to lose weight. But suppose it also makes you hungrier, so you eat more and (normally) gain weight. Faithfulness (or rather, a causal version of it) states that these two processes do not exactly cancel out (so that your weight gain or loss is independent of whether you go running). This assumption is necessary to learn the most parsimonious graph(s) for a particular j.p.d., and is assumed by all current Bayes net learning methods (discussed below). 1
I deliberately use the neutral term “related.” Contextual factors determine which component is primary. In most machine learning contexts, we have a j.p.d. and we want to learn the graph; in most expert knowledge contexts, we have a graph and we want to make predictions about the j.p.d. In this paper, we will operate in both contexts: from graph to j.p.d. (and back to graph) in Section 3, and from j.p.d. to graph in Section 4.
182
D. Danks
Most importantly for our present purposes, Bayes nets have proven to be excellent models of causation. In particular, the “asymmetry of intervention” (i.e., an intervention on a variable influences its effects, but not its causes) is quite easily represented in a Bayes net, and we can thus predict the (probabilistic) result of a particular intervention [12,14]. Furthermore, the learning algorithms discussed in the next section have been shown to accurately recover causal structure (in appropriate situations). There are debates about the exact nature of the assumptions needed to extract causal information from data (e.g., [10,15]); we will not deal with those issues in this paper. 2.2
Learning Bayes Nets
Two different strategy types have been used for learning Bayes nets from data: constraint-based search and Bayesian updating. These strategy types differ in process, not in asymptotic behavior, though they do have different performance profiles. Constraint-based procedures (e.g., [3,14]) use patterns of conditional and unconditional independencies and associations in the data to determine the equivalence class of graphs that could possibly have produced that pattern. An equivalence class (in this context) is a set of graphs that all imply (by the Markov and faithfulness conditions) the same conditional and unconditional associations and independencies. The equivalence class output is usually represented as a partially directed graph with associated rules for transforming the output into the full equivalence class. For Bayesian updating (e.g., [2,6,8]), we assign a probability distribution to the space of possible graphs. That distribution encodes our prior beliefs about the likelihood of each graph. As we receive data, we use standard Bayesian updating to revise both the probability distribution over the search space, and also the parameter probability distributions for each possible graph. The output produced by a Bayesian procedure is thus an updated (and asymptotically correct) probability distribution for the space of possible graphs. Bayesian search procedures are more flexible than constraint-based procedures, and often give more useful information. However, if we have n different variables in our system, then the number of possible graphs is at least exponential in n. Therefore, Bayesian search procedures are only practical if we use a series of heuristics, even though we can prove that the use of these heuristics is sometimes incorrect. There are also barriers peculiar to this problem to using Bayesian updating. There is no principled way (at present) to “redistribute” a probability distribution for graphs over S 1 or S 2 into a probability distribution for graphs over V as a whole [3]. In addition, even if we had such a method, we would potentially need to resolve conflicts in the marginal distributions over the shared variables, and there are theorems suggesting that there may not be principled Bayesian solutions for such conflicts [13]. Hence, we will focus on constraint-based procedures for the remainder of this paper. We can now recast the questions from Section 1 in Bayes net terms. Suppose we are attempting to learn some causal structure for a large set of variables V , and suppose V is divided into overlapping subsets A and B. We will primarily
Learning the Causal Structure of Overlapping Variable Sets
183
be interested in the three distinct (non-empty) subsets: the shared variables M = A ∩ B, and the unshared variables X = A \ M and Y = B \ M . Further suppose that we have learned the patterns (i.e., the equivalence classes of graphs) that are Markov and faithful to A and B: P attA and P attB , respectively. Let P attV be the (unknown) pattern that is Markov and faithful to (data over) V . The questions from Section 1 can now be re-expressed as: 1. Given only P attA and P attB as inputs, what can we determine about possible edges in P attV ? (discussed in Section 3) 2. Given P attA , P attB , and a sufficiently large dataset of complete data over V , is there an algorithm for learning P attV that is less computationally complex than a standard learning algorithm that uses only the data over V ? (discussed in Section 4) We also introduce one further assumption here: Causal Sufficiency Assumption V contains all common causes of variables in V . This assumption is essentially a closure assumption that says that there are no unobserved common causes (though there can be other unobserved causes of variables in V ). The appropriateness of this assumption is obviously quite contextand knowledge-dependent. It is, however, assumed by all Bayesian learning algorithms, and is regularly assumed in practice by users of constraint-based methods (at least at first). We will assume causal sufficiency for the remainder of this paper.
3 3.1
Integrating Overlapping Bayes Nets Edge Removal Rules
In this section, we attempt to determine whether we can learn anything about the structure of (the unknown) P attV simply through consideration of P attA and P attB . At first glance, we might naturally think that nothing at all could be said about the pattern for the full causal structure. Answering the question requires making claims about the presence or absence of edges (i.e., causal relationships) without having any datapoints with values for every variable in V . There seems to be no obvious reason why we should be able to determine anything at all, since the unobserved values seemingly might provide the crucial information for determining whether a particular edge should be present or absent. This intuition is quite reasonable, and does apply to whether an edge should be present; it does not, however, apply to the absence of an edge. Removal of an edge between two variables U and W indicates that they are independent conditional on some set T (namely, the parents of at least one of the variables). Hence, these variables will be non-adjacent in any graph for variables R with T ⊆ R. Since V contains both A and B, we can use the following rule: Edge Removal Rule 1. If U and W are not adjacent in P attA or P attB , then they must also be not adjacent in P attV .
184
D. Danks
In addition to enabling us to remove some edges from P attV without checking independencies, we can also use this rule to resolve conflicts when U, W ∈ M and U and W are adjacent in P attA , but not P attB (or vice versa), since absence in either sub-pattern implies absence in the full pattern. The above result essentially allows us to import the independencies encoded in P attA and P attB . More surprisingly, we can sometimes remove edges between variables X ∈ X and Y ∈ Y , even though we do not have a single datapoint that contains values for both variables. Before showing how this process can work, we first prove a more general result. We define the “reachable ancestors” for a variable X relative to a “blocking set” T as: RA (X, T ) = {Y : ∃Z = {Z1 , ..., Zn } (possibly empty) s.t. Y → Z1 → ... → Zn → X ∧ ∀i (Zi ∈ / T )} That is, RA (X, T ) contains just those variables that have a directed path to X containing no variables in T . Using “X ⊥ Y |T ” as shorthand for “X is independent of Y conditional on set T ,” we can state the following theorem: Theorem 1. Assume faithful and Markov data for graph G. If X ⊥ Y |T , then X and Z ∈ RA (Y, T ) are non-adjacent in G. Proof. Proof of contrapositive. Suppose some Z ∈ RA (Y, T ) is adjacent to X in G. Then since (by definition of RA (Y, T )) there is an unblocked (by T ) directed path from Z to Y , there is an unblocked (by T ) trek between X and Y . Therefore, X and Y are associated conditional on T . Q.E.D. We can use Theorem 1 for our integration problem, because a special case of this theorem is: Edge Removal Rule 2. If there exists X ∈ X, Y ∈ Y , M ∈ M such that (i) there is a directed path from X to M in P attA involving only variables in X; and (ii) M and Y are not adjacent in P attB , then X and Y are not adjacent in P attV . The lack of an edge between Y and M in P attB indicates an independence conditional on some subset of Y ∪ M (since P attB is faithful and Markov to the generating process marginalized to Y ∪ M ). Therefore, any ancestor of M that is connected only through elements of X (and so only variables in X are candidates) is reachable, and so there cannot be an edge between Y and those variables. Moreover, Edge Removal Rule 2 results in the removal of as many edges as possible based solely on Theorem 1. If we know something about the parameterization of the Bayes net, then we can sometimes remove more edges. For tn −1correlations in linear systems example, obey the generalized trek rule: ρi.j = t∈T k=0 ρk.k+1 , where ρi.j is the correlation between Vi and Vj , T is the set of all treks between Vi and Vj , and k (in the product) indexes over the variables on a particular trek t. Bayes nets with only binary variables have a similar, though not identical, decomposition [4]. In these cases, we can attempt to remove edges between variables M1 , M2 ∈ M by determining whether the observed correlation matches the correlations predicted by the two patterns. In the interest of limiting the assumptions about the underlying processes, we do not pursue these “special case” rules further.
Learning the Causal Structure of Overlapping Variable Sets
3.2
185
A Toy Example
Suppose the true underlying graph is: U → X → Y ← Z, and that A = {U, X, Y } and B = {X, Y, Z}. If we assume that the data are Markov and faithful to this underlying graph, then we can analytically determine the patterns for A and B. Specifically, P attA = U − X − Y , whose equivalence class is {U → X → Y, U ← X → Y, U ← X ← Y }, and P attB = X → Y ← Z, which has a singleton equivalence class (namely, itself). For P attV , we start with the complete pattern, in which every pair of variables is connected by an undirected edge. By Edge Removal Rule 1, we can remove two edges: U − Y and X − Z. We can also use Edge Removal Rule 2, since there is a directed path from U to X (in this case, involving no other variables), and X and Z are not adjacent in P attB . Hence, there cannot be an edge between U and Z. Thus, by applying both of the edge removal rules (and incorporating orientation information), we get the output pattern: U − X → Y ← Z, which is the correct pattern for the underlying graph (i.e., it describes the equivalence class of which the underlying graph is a member). 3.3
Piecewise Causal Learning
The example in Section 3.2 is, as the section title suggests, simply a toy example. It does, however, point towards situations in which these edge removal rules have non-trivial consequences. In particular, these rules can significantly advance causal learning by computationally limited agents, such as people. Suppose we have a causal system with a relatively large number of variables. Directly learning the structure of this system requires (in some sense) the ability to keep all of the variables “in mind” at one point in time. For agents with limited memory or computational abilities, we might consider a piecewise learning procedure. Specifically, we might consider only single variables, and attempt to learn that variable’s parents (i.e., direct causes) with a standard learning algorithm, without worrying about larger-scale structure. We then must integrate the local causal information, but notice that the above edge removal rules will apply to any case in which the focus variable is later found to be a parent of another variable. That is, we might not have to consider every variable as a possible effect in order to learn substantial portions of the causal structure. To make the applicability more obvious, consider a concrete (though abstract) example. Suppose we have some large set of variables, we choose a variable Y in that set (perhaps at random), and we learn its direct causes: X1 , ..., Xn . Now suppose that we learn that the parents of some other variable Z are: Y, A1 , ..., Am . We can use the edge removal rules to conclude immediately that there are no edges between the X’s and any of the A’s that are independent of Y . Hence, we can potentially get quite close to the true causal structure simply by learning local causal “families” (a child and its parents). This rough strategy is, of course, not fully fleshed out. We must give a better account of exactly how the composition procedure works, as well as what to do when it doesn’t work. In addition, this strategy is designed to aid computationally limited agents, but currently-developed methods for learning the structure of
186
D. Danks
causal families are not practical for those types of agents. Nevertheless, some theoretical psychological work suggests that humans might actually learn largescale causal structures in this compositional manner (e.g., [1]).
4
Efficient Learning through Prior Learning
4.1
SLPR Algorithm
We will not typically be as fortunate as in the examples in Sections 3.2 and 3.3, and we will thus be unable to reach the correct answer using only the edge removal rules. Therefore, we now turn to the second problem: what algorithm will most efficiently learn P attV when given P attA and P attB , as well as complete data over V as inputs? Since we have complete data, we know that we can learn the Markov and faithful pattern for the generating process; this question asks how efficiently we can do it. In this section, we will focus on describing an algorithm that is more efficient than learning P attV completely from scratch, and not attempt to prove that it is the most efficient algorithm possible.2 At the very least, we can (sometimes) improve the efficiency of a learning algorithm by using the rules from Section 3.1 as a preprocessor that removes some edges without checking conditional independencies. This change, however, does not take advantage of all of the information in the input patterns. We could also use the algorithms described in [3] (which enable one to add variables to a graph after learning) by treating the variables in either X or Y (whichever is smaller) as the variables added to the graph. Even this more efficient algorithm is almost certainly sub-optimal, however, since it only uses the edge information in one of P attA and P attB . We propose below the SLPR (Structure Learning using Prior Results) algorithm as more efficient than either of the above proposals. Before describing the algorithm, we must define one term and two variables: (i) T is a separating set for A and B iff A ⊥ B|T ; (ii) SepSet (A, B) stores any separating sets for variables A and B discovered in earlier learning; and (iii) Adj (G, X, Y ) is the set of variables adjacent to either X or Y in the graph G, excluding X and Y themselves. The SLPR algorithm takes an “oracle” as input to (i) enable the algorithm to be applicable to any data over which (conditional) independence is defined; and (ii) avoid discussions of particular statistical tests. SLPR Algorithm Inputs: (i) P attA=X∪M ; (ii) P attB=Y ∪M ; (iii) the SepSet functions from prior learning; and (iv) an “oracle” for determining independence relations. Output: P attV , the Markov and faithful pattern for the generating causal structure for V 2
The discussion below will give us reason to think that this algorithm is close to optimal; we have no proof, however, showing that it actually is optimal.
Learning the Causal Structure of Overlapping Variable Sets
187
1. Form the complete (undirected) graph G over V . For P attA , if some U and W are non-adjacent, then remove the edge between them in G; otherwise, orient the edge in G as it is oriented in P attA . Perform the same operations for P attB and resolve conflicts by: (a) excluding an edge if at least one pattern excludes it; and (b) making an edge unoriented if the two patterns disagree about its orientation. 2. For all X ∈ X, Y ∈ Y , M ∈ M such that (i) there is a directed path from X (or Y ) to M involving only nodes in X (Y ); and (ii) Y (X) and M are non-adjacent in G, remove the edge between X and Y . 3. n = 0 repeat TEST SETS OF SIZE N { repeat SELECT PAIR AND CHECK INDEP { Select an ordered pair of variables (X, Y ) such that (i) X ∈ X, Y ∈ Y ; (ii) X and Y are adjacent in G; and (iii) |Adj (G, X, Y )| ≤ n. If ∃T ⊆ Adj (G, X, Y ) such that (i) |T | = n; and (ii) X ⊥ Y |T , then (i) remove X − Y from G; (ii) record T in SepSet (X, Y ) and SepSet (Y, X); and (iii) remove all edges between X and Z ∈ RA (Y, T ) (and similarly for Y and RA (X, T )). } until all ordered pairs (X, Y ) that satisfy the preconditions have been tested. n=n+1 } until |Adj (G, X, Y )| < n for all appropriate ordered pairs (X, Y ). 4. For each triple A, B, C such that (i) A, B are adjacent; (ii) B, C are adjacent; and (iii) A, C are not adjacent, orient A − B − C as A → B ← C iff B ∈ / SepSet (A, C). 5. For each pair of adjacent variables U , W (except U ∈ X and W ∈ Y , or vice versa) define X, if U ∈ X or W ∈ X Y , if U ∈ Y or W ∈ Y C= M , if U ∈ M or W ∈ M I.e., C is the union of the containing sets (out of X, Y , M ) for U and W . If there exists a path Q between U and W such that (i) Q = {C1 , ..., Cm } ∪ {A1 , ..., An }, with ∀i (Ci ∈ C), ∀j (Aj ∈ V \ C); (ii) {Aj } = ∅ (though it is possible that {Ci } = ∅); (iii) every Ci (if any) is possibly a collider and every Aj is possibly a non-collider on Q; and (iv) every Ci (if any) is possibly an ancestor of either U or W . Then if ∃T ⊆ Adj (G, U, W ) such that ∃i (Ai ∈ T ) and U ⊥ W |T , then remove U − W from G and record T in SepSet (U, W ) and SepSet (W, U ). 6. Unorient all of the edges in G and reorient using the following two steps (in order): a) For each triple A, B, C such that (i) A, B are adjacent; (ii) B, C are adjacent; and (iii) A, C are not adjacent, orient A−B−C as A → B ← C iff B ∈ / SepSet (A, C). b) Repeatedly apply the following two rules until no more edges can be oriented:
188
D. Danks
i. If (i) A → B − C; and (ii) A, C are not adjacent, then orient B − C as B → C. ii. If there is a directed path from A to B, and A and B are adjacent, then orient A − B as A → B. Step 2 is simply an application of Rule 2. Further note that despite the apparent complexity of the preconditions in that rule, it is computationally quite simple. We simply iterate through the nodes in M and, for each node M ∈ M , determine the non-adjacent nodes X ∈ X (and Y ∈ Y ), and then recursively remove edges between the non-adjacent nodes and variable parents (not in SepSet (M, X)), starting with adjacent (to M ) variables in Y (X). The path-checking in the preconditions for Step 2 occurs implicitly in this procedure. Step 3 tests the variables that have not previously appeared in the same dataset. For these pairs of variables, X ∈ X and Y ∈ Y , we need to determine whether they are independent, conditional on some (sub)set of the adjacent variables. We only need to consider adjacent variables because, if X and Y are not adjacent in the true generating structure, then they must be independent conditional on one of their parents. Since the edges between a variable and its parents are never removed, the variables currently adjacent to X (and similarly for Y ) must be a superset of its parents. Hence, we do not need to condition on all variables, but only those that could possibly be parents of one of the variables.3 Step 5 is necessary because of the possibility that some edges in P attA are present only because of the existence of common causes in V that are outside of A (similarly for P attB and B). For example, if the true generating structure contains X1 ← Y → X2 , then there will be an incorrect edge between X1 and X2 in P attA (since they are associated regardless of conditioning set). This step removes those edges that remained because of the restricted variable sets of the initial learning. We do not necessarily need to check all pairs of adjacent variables, but only those with at least one (other) path between them that might explain the presence of the edge. The characteristics of such a possible inducing path are rather complicated, but all necessary [3]. However, we only need to find one such path to trigger the conditional independence checks. Therefore, the added computational burden of path-checking is relatively minor. 4.2
Complexity Analysis
Recall that the original question for Section 4 centered on finding a more efficient algorithm. In this section, we determine the conditions (for the worst-case) under which the SLPR algorithm is more efficient than a comparable constraint-based 3
It is actually possible to make this step slightly more efficient, by considering subsets of variables adjacent to X, followed by subsets of variables adjacent to Y (since we only need to condition on the parents of one of the variables). However, this modification is more difficult to express, more difficult to implement, and makes no difference for the “worst-case” scenario discussed in the complexity analysis of Section 4.2.
Learning the Causal Structure of Overlapping Variable Sets
189
algorithm. This complexity analysis focuses only on the number of conditional independence tests we must perform. Orientation is a computationally simple task compared to the conditional independence tests, and so we should expect the latter to dominate the algorithmic complexity. In a slight abuse of notation (that greatly simplifies our formulae), we define |X| = x, |Y | = y, and |M | = m. Hopefully there will be no confusion with the notation introduced in Section 2.1, since we will not be referring to the values of individual variables in this section. Recall that we are interested in determining whether the SLPR algorithm is more efficient than de novo structure learning. An appropriate reference algorithm is the PC algorithm ([14, pp. 84-88]). If we assume that the degree of the true graph is k (i.e., each variable has at most k adjacent variables), then the “worst-case” complexity of the PC algorithm on S is: 2
x+y+m 2
k i=0
x+y+m−2 i
(2)
To determine the complexity of the SLPR algorithm, we consider each step individually. Since they do not involve any conditional independence tests, we ignore Steps 1, 2, 4, and 6 of the algorithm. Furthermore, since we want to provide the worst case for the SLPR algorithm in this analysis, we will assume that no edges are removed in Step 2, and that no edges are removed in the last stage of Step 3 (in which the removal of one edge can trigger the removal of other edges). Step 3, in which we determine whether there are edges between X ∈ X and Y ∈ Y , has the following worst-case complexity: xy
k
x+y+m−2 i=0
(3)
i
The complexity of Step 5 is a bit trickier, since it depends crucially on the particular structure of the graph being learned. The worst-case scenario occurs when every pair of variables we consider is involved in a triangle with a variable outside of that pair’s C, since that maximizes the number of checks we must perform. It is unclear whether such a structure can exist for many values of k. Nevertheless, we want to put SLPR in the most difficult position possible, so we assume that this graph structure is possible. We further define a and b to be the maximum degrees of P attA and P attB , respectively. For this graph structure, an upper bound on the number of conditional independence checks in Step 5 is: [ax + by + m (a + b)]
k−1
i=0
x+y+m−3 i
(4)
The SLPR algorithm is computationally less complex than the PC algorithm if and only if (2) ¿ (3) + (4). Since the summation in (4) is always strictly less than the summation in the other two equations, SLPR is less complex than PC if (but not only if):
190
D. Danks
2
x+y+m 2
> xy + ax + by + m (a + b)
(5)
Surprisingly, this equality always hold. Clearly, it is least likely to hold when a and b each have their maximal values (x + m − 1 and y + m − 1, respectively). In fact, inequality (5) does not hold if we simply substitute in these values. However, the m (a + b) term on the right-hand side corresponds to the number of possible pairs (in Step 5) involving a variable in M . Therefore, we cannot simply substitute in the maximal values for a and b, since there is an upper limit to the number of adjacent variables: namely, (x + y + m − 1). When we substitute this lesser value in for (a + b), then (5) reduces to xy > 0, which clearly always holds (since we assume that X and Y are always non-empty). Therefore, the SLPR algorithm is always less computationally complex than the PC algorithm in the worst-case, regardless of the specific underlying parameterization.
5
Conclusion
Databases now regularly reach terabyte sizes, and their size arises from both large numbers of datapoints, and large numbers of variables. Hence, for practical reasons, our analyses are often restricted to a subset of the variables in the dataset. Moreover, multiple databases can be more usefully shared if we have efficient algorithms for integrating machine learning outputs for the datasets considered in isolation. Multiple datasets might also face practical barriers to integration (e.g., privacy issues). Hence, there are practical (in addition to purely theoretical) reasons to consider the problems associated with integrating and using learning outputs for multiple overlapping sets of variables. In this paper, we provide two rules for edge presence and absence when integrating two Bayes nets. These rules almost certainly do not exhaust the possible rules; the existence of others, however, remains an open question. Also, as noted earlier, there are additional rules if we have some prior knowledge of the parameterization of the network. Given that we often have some domain-specific knowledge about the types of causation under investigation, further investigation of these rules could have substantial practical impact. The SLPR algorithm provided in Section 4.1 also supports the goal of integrating multiple datasets. However, the practical usefulness of the algorithm awaits a more adequate “expected-case” complexity analysis. The usefulness of the algorithm also depends on whether it is in fact faster when path-checking and orientation steps are taken into consideration. Although those steps are much simpler, they might nonetheless add sufficient time to make the PC algorithm faster. In addition, we can ask whether this is the best we can do; are there more efficient algorithms than SLPR? Most importantly, the robustness of the SLPR algorithm should be fully checked using real-world datasets. The algorithm assumes that the data independencies match the independencies in the true underlying generating structure, and this assumption is often violated by real-world data. Further empirical validation is necessary to determine both the scope of the problems that arise when the above assumption is violated, and also the magnitude of “speed-up” benefit
Learning the Causal Structure of Overlapping Variable Sets
191
provided by the SLPR algorithm. This further analysis might also suggest an algorithm that drops the Causal Sufficiency Assumption. Despite these remaining open problems, the rules and algorithm presented in this paper provide an important start on the problem of integrating the causal learning of multiple, overlapping datasets.
References 1. Cheng, Patricia. 1997. “From Covariation to Causation: A Causal Power Theory.” Psychological Review, 104: 367-405. 2. Cooper, Gregory F. 2000. “A Bayesian Method for Causal Modeling and Discovery Under Selection.” In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI-2000). 3. Danks, David. 2002. “Efficient Integration of Novel Variables in Bayes Net Learning.” Technical report: Institute for Human & Machine Cognition, University of West Florida. 4. Danks, David, and Clark Glymour. 2001. “Linearity Properties of Bayes Nets with Binary Variables.” In J. Breese & D. Koller, eds. Uncertainty in Artificial Intelligence: Proceedings of the 17th Conference (UAI-2001). San Francisco: Morgan Kaufmann. pp. 98-104. 5. Fienberg, Stephen E., and Sung-Ho Kim. 1999. “Combining Conditional LogLinear Structures.” Journal of the American Statistical Association, 94 (445): 229239. 6. Geiger, Dan, David Heckerman, and Christopher Meek. 1996. “Asymptotic Model Selection for Directed Networks with Hidden Variables.” Microsoft Research Technical Report: MSR-TR-96-07. 7. Glymour, Clark, and Gregory F. Cooper, eds. 1999. Computation, Causation, and Discovery. Cambridge, Mass.: AAAI Press and The MIT Press. 8. Heckerman, David, Dan Geiger, and David M. Chickering. 1994. “Learning Bayesian Networks: The Combination of Knowledge and Statistical Data.” Microsoft Research Technical Report: MSR-TR-94-09. 9. Jordan, Michael I., ed. 1998. Learning in Graphical Models. Cambridge, Mass.: The MIT Press. 10. McKim, Vaughn R., and Stephen P. Turner. 1997. Causality in Crisis? Statistical Methods and the Search for Causal Knowledge in the Social Sciences. Notre Dame, Ind.: University of Notre Dame Press. 11. Pearl, Judea. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco: Morgan Kaufmann Publishers. 12. Pearl, Judea. 2000. Causality: Models, Reasoning, and Inference. Cambridge: Cambridge University Press. 13. Seidenfeld, Teddy, Joseph B. Kadane, and Mark J. Schervish. 1989. “On the Shared Preferences of Two Bayesian Decision Makers.” The Journal of Philosophy, 86 (5): 225-244. 14. Spirtes, Peter, Clark Glymour, and Richard Scheines. 1993. Causation, Prediction, and Search. 2nd edition, 2001. Cambridge, Mass.: AAAI Press and The MIT Press. 15. Williamson, Jon. 2001. “Foundations for Bayesian Networks.” Forthcoming in D. Corfield & J. Williamson, eds. Foundations of Bayesianism. Kluwer Applied Logic Series.
Extraction of Logical Rules from Data by Means of Piecewise-Linear Neural Networks Martin Holeˇna Institute of Computer Science, Academy of Sciences of the Czech Republic Pod vod´arenskou vˇezˇ´ı 2, CZ-182 07 Praha 8 [email protected], http://www.cs.cas.cz/˜martin
Abstract. The extraction of logical rules from data by means of artificial neural networks is receiving increasingly much attention. The meaning the extracted rules may convey is primarily determined by the set of their possible truth values, according to which two basic kinds of rules can be differentiated – Boolean and fuzzy. Though a wide spectrum of theoretical principles has been proposed for ANN-based rule extraction, most of the existing methods still rely mainly on heuristics. Moreover, so far apparently no particular principles have been employed for the extraction of both kinds of rules, what can be a serious drawback when switching between Boolean and fuzzy rules. This paper presents a mathematically well founded approach based on piecewise-linear activation functions, which is suitable for the extraction of both kinds of rules. Basic properties of piecewise-linear neural networks are reviewed, most importantly, the replaceability of suboptimal computable mappings, and the preservation of polyhedra. Based on those results, a complete algorithm for the extraction of Boolean rules with that approach is given. In addition, two modifications of the algorithm are described, relying on different assumptions about the way how the properties of a polyhedron determine the decision to replace the polyhedron with a hyperrectangle. Finally, a biological application in which the presented approach has been successfully employed is briefly sketched.
1
Introduction
The extraction of knowledge from data by means of artificial neural networks (ANNs) is receiving increasingly much attention in connection with data mining and pattern recognition. Actually, already the mapping learned and computed by the network incorporates knowledge about the implications that certain values of the inputs have for the values of the outputs. Usually, however, ANN-based knowledge extraction aims at the better comprehensible representation of those implications as logical rules [5,7,18,24,25,29]. A large number of ANN-based rule extraction methods have already been proposed. They differ with respect to various aspects, the most important among which is the expressive power of the rules, given by the meaning they are able to convey (cf. the classifications of those methods suggested in [2], [6] and [28]). Though the conveyable meaning of the rules depends also on the syntax of the language underlying the considered logic, which allows to differentiate, e.g., propositional and first-order logic rules, it is primarily determined by the set of possible truth values of the rules. According to this criterion, extracted rules can be divided into two main groups: S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 192–205, 2002. c Springer-Verlag Berlin Heidelberg 2002
Extraction of Logical Rules from Data
193
– Boolean rules, i.e., formulae of the Boolean logic. They can assume only two different truth values, say true and false. The tertium-non-datum axiom of the Boolean logic implies that if a Boolean rule has been evaluated and has not been found true, then it automatically must be false. That is why methods for the extraction of Boolean rules only need to output rules that, within an apriori set of rules to evaluate, have been found valid in the data [5,18,29]. – Fuzzy rules, i.e., formulae of some fuzzy logic, typically of the product logic, Lukasiewicz logic, G¨odel logic, or some combination of those three. Their truth values can be arbitrary elements of some BL-algebra [10]. In the so far proposed methods, that BL-algebra is always the interval [0, 1] or some subalgebra thereof [6, 7,24,26] (for a survey of those methods, see [25]). The existing ANN-based rule extraction methods are based on a wide spectrum of theoretical principles. So far, apparently no particular principles have been employed both for the extraction of Boolean rules and for the extraction of fuzzy rules. That can be a serious drawback when switching between both kinds of rules, since results obtained with methods that do not share common theoretical fundamentals are difficult to compare. That drawback is further increased by the fact that most of the existing methods rely mainly on heuristics, and their underlying theoretical principles are not very deep. In the present paper, a mathematically well founded paradigm is presented that is suitable for the extraction of both Boolean and fuzzy rules. It is the paradigm of piecewise-linear activation functions, which has been proposed independently in [22] and [14] for the extraction of Boolean rules. In the following section, basic properties of multilayer perceptrons with piecewiselinear activation functions are reviewed. Section 3, which forms the core of the paper, explains using such multilayer perceptrons to extract Boolean rules, more precisely implications of the observational logic. That section is complemented by a sketch of a biological application in Section 4, and by the outline, in Section 5, of the main theoretical principle of a method that uses multilayer perceptrons with piecewise-linear activation functions to extract fuzzy rules, more precisely implications of the Lukasiewicz logic.
2
Piecewise-Linear Neural Networks
Though piecewise-linearity can be studied in connection with any kind of artificial neural networks that admits continuous activation functions [12,13], this paper restricts attention only to multilayer perceptrons (MLPs). The reason is their popularity in practical applications – both in general, and in the specific rule extraction context [2,5,17,24, 26,29]. To avoid misunderstanding due to differences encountered in the literature, the adopted definition of a multilayer perceptron is precised. Definition 2.1. The term multilayer perceptron (MLP), more precisely timeless fully connected multilayer perceptron, denotes the pair M = ((n0 , n1 , . . . , nL ), f ) where
(1)
194
M. Holeˇna
(i) (n0 , n1 , . . . , nL ) ∈ INL+1 , L ∈ N \ {1}, is called the topology of M, and is given by: – n0 input neurons, – nL output neurons, – and ni hidden neurons in the i-th layer, i = 1, . . . , L − 1. (ii) f : IR → IR is called the activation function of M . Most generally, it is only required to be nonconstant and Borel-measurable. Typically, however, it has various additional properties. In the sequel, focus will be on multilayer perceptrons with one layer of hidden neurons and with a sigmoidal piecewise-linear activation functions. Such artificial neural networks will be, for simplicity, called piecewise-linear neural networks. A mapping F : IRn0 → IRnL is said to be computable by M if it fulfills the following composability condition: (∀i ∈ {1, . . . , L})(∀j ∈ {1, . . . , ni })(∃ϕi,j : IRni−1 → IR)(∃wi,j = n
0 1 = (wi,j , wi,j , . . . , wi,ji−1 ) ∈ IRni−1 +1 )(∀z ∈ IRni−1 )[i < L ⇒ ϕi,j (z) = f (wi,j ·(1, z))
& j = L ⇒ ϕL,j (z) = wL,j ·(1, z)] & (∀x ∈ IRn0 )(∃(z0,1 , . . . , z0,n0 , z1,1 , . . . , zL,nL ) ∈ IRn0 +n1 +···+nL )[x = (z0,1 , . . . , z0,n0 ) & F (x) = (zL,1 , . . . , zL,nL ) & (∀i ∈ {1, . . . , L})(∀j ∈ {1, . . . , ni }) zi,j = ϕi,j (zi−1,1 , . . . , zi−1,ni−1 )],
(2)
n 1 where · denotes the dot-product of vectors. The parameters wi,j , . . . , wi,ji−1 , i = 1, . . . L, 0 , i = 1, . . . L, j = 1, . . . ni j = 1, . . . ni in (2) are called weights, the parameters wi,j
are called thresholds. Notice that the seemingly complicated condition (2) assures that computable mappings are composed in accordance with the architecture of the multilayer perceptron. This property will be employed below, to introduce the notion of replaceability of computable mappings, which will play a crucial role in Proposition 2.4. Being a special case of multilayer perceptrons with one hidden layer, piecewiselinear neural networks inherit the attractive approximation capabilities of such MLPs (see, e.g., [16,19,20]). On the other hand, not directly taken over can be commonly used methods for training a MLP with some given (x1 , y1 ), . . . , (xm , ym ) sequence of input-output pairs, i.e., of finding a mapping F computable by the MLP and fulfilling E((y1 , . . . , ym ), (F (x1 ), . . . , F (xm ))) = min E((y1 , . . . , ym ), (F (x1 ), . . . , F (xm ))), F ∈IF
(3)
where IF denotes the set of all mappings computable by the MLP, and E is some cost function, typically the Euclidian norm or its square (the "sum of squares" cost function). The difficulty with training in the case of piecewise-linear neural networks is caused by the discontinuity of the partial derivatives of their activation functions with respect to weights and thresholds, which in turn implies the discontinuity of the partial derivatives of E. Such nonsmooth cost functions are admissible neither in the popular back propagation method [8,27], nor in more sophisticated methods for neural-network training, such as conjugate-gradient methods, quasi-newton methods, or the LevenbergMarquardt method [4,8,9].
Extraction of Logical Rules from Data
195
Fortunately, this problem can be bypassed. Since all algorithms for the optimization of general continuously-differentiable functions are iterative, after a finite number of iterations they in general find only some suboptimal solution of (3). And it can be shown that suboptimal mappings computable by a piecewise-linear neural network are interchangeable with their counterparts computable by a MLP with a continuous sigmoidal activation function, in the sense formulated below in Definitions 2.2 and 2.3, and Proposition 2.4. Definition 2.2. Let M1 = ((n0 , n1 , . . . , nL ), f ) and M2 = ((n0 , n1 , . . . , nL ), g) be two MLPs with the same topology (n0 , n1 , . . . , nL ) and with the activation functions f and g, respectively. Let further for each i = 1, . . . , L and each j = 1, . . . , ni , ϕi,j : IRni−1 → [0, 1] be a mapping defined (∀z ∈ IRni−1 ) ϕi,j (z) = f (wi,j z + θi,j ),
(4)
where wi,j ∈ IRni−1 , θi,j ∈ IR. Finally, let F be a mapping computable by M1 and such that (∀x ∈ [0, 1]n0 )(∃(z0,1 , . . . , zL,nL ) ∈ IRn0 +···+nL ) x = (z0,1 , . . . , z0,n0 ) & F (x) = (zL,1 , . . . , zL,nL ) & (∀i ∈ {1, . . . , L})(∀j ∈ {1, . . . , ni }) zi,j = ϕi,j (zi−1,1 , . . . , zi−1,ni−1 ). (5) Then the term counterpart of F in M2 denotes the mapping G computable by M2 that fulfills (∀x ∈ [0, 1]n0 )(∃(z0,1 , . . . , zL,nL ) ∈ IRn0 +···+nL ) x = (z0,1 , . . . , z0,n0 ) & G(x) = (zL,1 , . . . , zL,nL ) & (∀i ∈ {1, . . . , L})(∀j ∈ {1, . . . , ni }) zi,j = ψi,j (zi−1,1 , . . . , zi−1,ni−1 ),
(6)
here for each i = 1, . . . , L and each j = 1, . . . , ni , ψi,j : IRni−1 → [0, 1] is a mapping defined (∀z ∈ IRni−1 ) ψi,j (z) = g(wi,j z + θi,j ).
(7)
Reciprocally, F is then called the counterpart of G in M1 . For the origin of the conditions (5) and (6), refer to (2). Definition 2.3. Let M = ((n0 , n1 , . . . , nL ), f ) be a MLP, ε > 0, m ∈ IN be constants, and (x1 , y1 ), . . . , (xm , ym ) ∈ IRn0 × IRnL be a sequence of input-output pairs. Denote IF the set of all mappings computable by M and the Euclidian norm. Then a mapping F : IRn0 → IRnL is called: a) optimal for ((n0 , n1 , . . . , nL ), f ) with respect to (x1 , y1 ), . . . , (xm , ym ) if F ∈ IF and (F (x1 ) − y1 , . . . , F (xm ) − ym ) = min (F (x1 ) − y1 , . . . , F (xm ) − ym ) F ∈IF
(8)
196
M. Holeˇna
b) ε-suboptimal for ((n0 , n1 , . . . , nL ), f ) with respect to (x1 , y1 ), . . . , (xm , ym ) if (F (x1 ) − y1 , . . . , F (xm ) − ym ) < inf (F (x1 ) − y1 , . . . , F (xm ) − ym ) + ε F ∈IF
(9)
Proposition 2.4. Let M be a MLP with one layer of hidden neurons, a topology (ni , nh , no ) and a continuous sigmoidal activation function f , and L be a piecewiselinear neural network with the same topology as M and an activation function g. Finally, let ε > δ > 0, (x1 , y1 ), . . . , (xm , ym ) ∈ IRni ×IRno and F : IRni → IRno be a mapping δ-suboptimal for M with respect to the sequence (x1 , y1 ), . . . , (xm , ym ). Then provided g is close enough to f in Cς = {f ∈ C(IR) : f is continuous sigmoidal}, the counterpart of F in L is ε-suboptimal for L with respect to (x1 , y1 ), . . . , (xm , ym ). This proposition provides the possibility to obtain mappings computable by piecewise-linear neural networks without having to develop specific training methods for them. However, caution should be taken if computational complexity is an issue, since piecewise-linear neural networks have a lower Vapnik-Chervonenkis dimension than MLPs with smooth activation functions [21]. The piecewise-linearity of their activation functions implies that locally, in finitely many separate areas, piecewise-linear neural networks behave like linear operators between the input and output space of the network. This, in turn, means that a piecewiselinear neural network transforms linearly-constrained sets in the input space into linearly-constrained sets in the output space, in particular polyhedra into polyhedra and pseudopolyhedra into pseudopolyhedra (for definitions, see [23]). That result, the importance of which will become apparent in the next section, is formulated in the following proposition. Proposition 2.5. Let for n ∈ IN, the symbols Pn and P˜n denote the set of all polyhedra in IRn and the set of all pseudopolyhedra in IRn , respectively. Let further L be a piecewiselinear neural network with a topology (ni , nh , no ) and an activation function f . Finally, let F : IRni → IRno be a mapping computable by L, and Q ∈ P˜no . Then (∃r ∈ IN)(∃P1 , . . . , Pr ∈ P˜ni ) Q = F (
r
Pj ).
(10)
j=1
Moreover, If Q is a polyhedron, then also P1 , . . . , Pr ∈ Pni are polyhedra.
3
Extraction of Boolean Rules
Suppose that in any pair ((x1 , . . ., xni ), (y 1 , . . . ,y no )) of input-output values used to train a piecewise-linear neural network, the numbers x1 , . . . , xni and y 1 , . . . , y no are values of variables X 1 , . . . , X ni and Y 1 , . . . , Y no capturing quantifiable properties of objects in the application domain. Then for each P ∈ P˜ni and each Q ∈ P˜no ,
Extraction of Logical Rules from Data
197
the statements (X 1 , . . . , X ni ) ∈ P and (Y 1 , . . . , Y no ) ∈ Q are Boolean predicates. Consequently, Proposition 2.5 can be restated as a Boolean implication r
(X 1 , . . . , X ni ) ∈ Pj → (Y 1 , . . . , Y no ) ∈ Q,
(11)
j=1
which is equivalent to the conjunction of Boolean implications (X 1 , . . . , X ni ) ∈ Pj → (Y 1 , . . . , Y no ) ∈ Q
(12)
for j = 1, . . . , r. The comprehensibility of (12) is hindered by the fact that the P1 , . . . , Pr and Q can be quite complicated general polyhedra. Logicians have for a long time been aware of such difficulties with predicates of a higher arity. Therefore, observational logic, i.e., the branch of Boolean logic devoted to logical treatment of data analysis, basically deals only with monadic calculi, which contain merely unary predicates [11]. Observe that the no -ary predicate (Y 1 , . . . , Y no ) ∈ Q in (12) turns to a conjunction of unary predicates, k ∈ Ok with O = {k : Ok = IR} if Q is a hyperrectangle with projections k∈O Y O1 , . . . , Ono , similarly also for the ni -ary predicates (X 1 , . . . , X ni ) ∈ Pj , j = 1, . . . , r. Unfortunately, even if the polyhedron Q in (12) is chosen to be a hyperrectangle, the polyhedra Pj , j = 1, . . . , r in general do not have that property. To arrive to an implication in which also the predicate (X 1 , . . . , X ni ) ∈ Pj for some j = 1, . . . , r is a conjunction of unary predicates, it is necessary to replace Pj with a suitable hyperrectangle in IRni . The problem of replacing Pj with a suitable hyperrectangle will be tackled by means of the concept of rectangularization. 3.1
Rectangularization of Polyhedra
The first idea is that the decision whether to replace a polyhedron or pseudopolyhedron P with a hyperrectangle H should rely on our dissatisfaction with that part of P that does not belong to H and that part of H that does not belong to P , i.e. on our dissatisfaction with the symmetric difference P ∆H of the sets P and H. Let us denote that dissatisfaction µP (P ∆H) and make the following assumptions about the way how that dissatisfaction determines the replacement decision: (i) the dissatisfaction is nonnegative (µP (P ∆H) ≥ 0); (ii) increasing the area P ∆H leads to an increased dissatisfaction µP (P ∆H); (iii) the dissatisfaction µP (P ∆H) is minimal among the dissatisfactions µP (P ∆H ) for hyperrectangles H in the considered space; (iv) for P to be replaceable with H, the dissatisfaction µP (P ∆H) must not exceed some prescribed limit; (v) to be eligible for replacement, P has to cover at least one point of the available data. The assumptions (i)–(ii) imply that µP is a nonnegative monotone measure on the considered space and such that P ∆H belongs to its domain for any (pseudo)polyhedron P and any hyperrectangle H, e.g., a nonnegative monotone Borel measure. That measure may be, eventually, dependent on the (pseudo)polyhedron P to be replaced. If that space is the input space of the neural network, two measures are particularly attractive:
198
M. Holeˇna
A. The empirical distribution of the input components (x1 , . . . , xm ) of the sequence of input-output pairs (x1 , y1 ), . . . , (xm , ym ) that were used to train the network (this measure does not depend on the (pseudo)polyhedron to be replaced). B. The conditional empirical distribution of the input components of the training sequence (x1 , y1 ), . . . , (xm , ym ), conditioned by the (pseudo)polyhedron to be replaced (hence, it depends on that (pseudo)polyhedron). An important property of the measures A. and B., is that for any (pseudo)polyhedron P in the input space, a hyperrectangle H in that space can be found such that the condition (iv.) holds. General nonnegative monotone Borel measures do not have this property. Nevertheless, no matter whether it is really any of those measures that plays the role of µP for every (pseudo)polyhedron P , the above conditions (i.)–(v.) together with the results of Section 2 already allow to formulate an algorithm for the extraction of observational implications from data by means of piecewise-linear neural networks.
3.2
Implemented Algorithm
Input:
– disjoint sets{X 1 ,. . .,X ni }, {Y 1 , . . . , Y no } of real-valued variables capturing properties of objects in the application domain; – a set of predicates {Y k ∈ Ok : k ∈ O}, where ∅ = O ⊂ {1, . . . , no }, and for each k ∈ O, Ok is an interval different from the whole IR; – constants nh ∈ IN, ε > 0; – a continuous sigmoidal function f ; – a sequence of input-output pairs (x1 , y1 ), . . . , (xm , ym ) ∈ IRni × IRno ; – a system (µP )P ∈P˜n of nonnegative monotone Borel measures on IRni . i
2. Initialize the set of extracted boolean rules by = ∅. 3. Construct a hyperrectangle Ho in IRno such that for each k ∈ O, the k-th projection of Ho is Ok , and if O = {1, . . . , no }, then any remaining projection of H is the whole IR. 4. Initialize a MLP M = ((ni , nh , no ), f ). 5. Training M with (x1 , y1 ), . . . , (xm , ym ), obtain a computable mapping F . 6. For a piecewise-linear g close enough to f in Cς , construct the counterpart G of F in ((ni , nh , no ), g). r 7. Find P1 , . . . , Pr ∈ P˜ni , r ∈ IN such that Ho = G( j=1 Pj ). 8. For each j = 1, . . . , r such that (∃p ∈ {1, . . . , m)) xp ∈ Pj and there exists a hyperrectangle Hj in IRni fulfilling µPj (Pj ∆Hj ) = min{µPj (Pj ∆H ) : H is a hyperrectangle in IRni } ≤ ε : (13) (a) Find the intervals I1 , . . . , Ini such that Hj = I1 × · · · × Ini . (b) Define the set Ij = {k : Ik = IR}. (c) Update the set of extracted boolean rules by = ∪ { k∈Ij Ik → k∈O Ok }.
Extraction of Logical Rules from Data
3.3
199
Two Modifications
In this subsection, two modifications of the proposed approach will be described. The first of them concerns only the assumption (iv.), i.e., the assumption that the dissatisfaction µP (P ∆H ) as a function of hyperrectangles H reaches its minimum for the found hyperrectangle H. The proposed modification is motivated by the fact that the search for such a hyperrectangle H suffers from the "curse of dimensionality" phenomenon. For example, to find the hyperrectangle hyperrectangle H if µP is any of the measures A. or B. requires to compute the value µP (P ∆H ) for O(mni ) different hyperrectangles H . To eliminate the curse of dimensionality, the proposed modification attempts to reduce the search for a hyperrectangle H in the input space of the network to the search for intervals corresponding to the individual input dimensions. To this end, the assumption (iv.) is replaced with the following two assumptions: (iv’a) the dissatisfaction µP is as a nonnegative monotone Borel measure decomposable into its one-dimensional projections; (iv’b) every one-dimensional projection of µP (P ∆H) is minimal among the corresponding one-dimensional projections µP (P ∆H ) for hyperrectangles H in the considered space. This modification does not entail any changes to the algorithm described in the preceding subsection. Provided all the measures µP for P ∈ P˜ni given at the input fulfill the assumptions (iv’a), the assumption (iv’b) already implies the validity of (13), thus the algorithm works as before. Examples of measures fulfilling (iv’a) are the following pendants of the above introduced measures A. and B.: A’. The product of the marginal empirical distributions of the input components of the training sequence (x1 , y1 ), . . . , (xm , ym ). This measure does not depend on the (pseudo)polyhedron to be replaced, and if the marginal empirical distributions of the input components of the training sequence are mutually independent, it coincides with the measure A. B’. The product of the marginal conditional empirical distribution of the input components of the training sequence (x1 , y1 ), . . . , (xm , ym ), conditioned by the (pseudo)polyhedron to be replaced (hence, it depends on that (pseudo)polyhedron). If the marginal conditional empirical distribution of the input components of the training sequence are mutually independent, this measure coincides with the measure B. Contrary to this first modification, the second modification concerns directly the starting principle of the proposed rectangularization approach, i.e., the principle that the decision whether to replace P with H should rely on our dissatisfaction with the symmetric difference P ∆H. This modification is based on the point of view that for the choice of a hyperrectangle H in the input space of the r network, more important than a particular polyhedron P is the union of polyhedra i=1 Pi in that space mapped to a polyhedron in the output space according to Proposition 2.5 (which, in particular, may be the union of polyhedra mapped to a hyperrectangle in the output space, as in Step 7. of the presented algorithm). Hence, the starting principle is modified in such a way that the decision whether to replace a particular Pj , j = 1, . . . , r, with r a hyperrectangle H should rely on our dissatisfaction with the symmetric difference i=1 Pi ∆H instead
200
M. Holeˇna
of our dissatisfaction with Pj ∆H. Since the polyhedra Pi i = 1, . . . , r, i = j, can be viewed as a context of the polyhedron Pj , a rectangularization according to this modified principle will be called a contextual rectangularization, while rectangularization according to the original principle will be called context-free rectangularization. Provided the assumptions about the way how the dissatisfaction determines the decision whether to replace a particular Pj , j = 1, . . . , r, with a hyperrectangle H are as before, i.e., either the assumptions (i)–(v), or the assumptions (i)–(iii), (iv’a), (iv’b) and (v), this second modification entails also a change to the algorithm described in the preceding subsection. Namely, the condition (13) in Step 8 of the algorithm has to be replaced with the condition µPj (
r
i=1
Pi ∆Hj ) = min{µPj (
r
Pi ∆H ) : H is a hyperrectangle in IRni ≤ ε.
i=1
(14) The remaining steps of the algorithm are unchanged.
4 A Biological Application The algorithm described in the preceding section, including both presented modifications, has been implemented in Matlab and has already been successfully employed in two real-world applications. One of them, belonging to the area of biology of biocoenoses [15], will now be briefly sketched. One of very efficient ways to increase the suitability of rivers for water transport is building groynes. On the other hand, ecologists often fear the changes in the biocoenosis of the river and its banks to which groynes may lead. This is especially true for rivers in the former communist countries, where ecological aspects used to play only a very subordinate role till the eighties. One of the most prominent examples of such rivers is the Czech and East-German river Elbe. However, it is a matter of fact that the complex relationships between the biocoenosis and the ecological factors characterizing a groyne field are only poorly understood so far. Therefore, a research has been performed, 1998–2000, on the Elbe river, with the objective to investigate those relationships, and to propose an empirically proven hydrological model capturing them and allowing to estimate the changes in the biocoenosis that prospective groynes would cause. Five groyne fields typical for the middle part of the river have been chosen near the town Neuwerben. In those groyne fields, a large amount of empirical data has been collected during 1998–1999. The main part of the collected data is formed by nearly 1000 field samples of benthic fauna and more than 1400 field samples of terrestrial fauna. Each sample includes all animals caught in special traps during some prescribed period of time, ranging from several hours to two days. Simultaneously with collecting those samples, various ecological factors have been measured in the groyne fields, e.g., oxygen concentration, diameter of ground grains, glowable proportion of the ground material, whereas others, such as water level and flow velocity, have been computed using a hydrodynamic simulation model. The collected data were, first of all, analysed by biologists with respect to the species contained in them. Then some preprocessing was performed, and finally data mining
Extraction of Logical Rules from Data
201
was applied to the preprocessed data. It is the data mining of those samples where the above outlined approach to ANN-based extraction of Boolean rules has been employed, complementing methods of exploratory statistical analysis [14,15]. To this end, a MLP with the topology (6, 4, 20) has been constructed in the case of terrestrial data, and a MLP with the topology (4, 5, 9) in the case of benthic data. The input neurons of those MLPs corresponded to selected ecological factors, whereas their output neurons corresponded to selected terrestrial, resp. benthic, species and tribes. The size of the hidden layer has been chosen empirically, splitting the available data randomly into a training set and a test set, constructing MLPs with 1–10 hidden neurons, and selecting the one that showed the best performance (in terms of the sum of squared errors) on the test data. The selected MLP has been finally retrained with all the available data. An example of results obtained for benthic data is shown in Figures 1–2. The example corresponds to the following particular predicate, input to the algorithm 3.2: SD(Robackia demeierei) >
1 MaxSD(Robackia demeierei), 10
(15)
where SD(Robackia demeierei) is a variable capturing the surface density of the abundance (number of individuals) of the species "Robackia demeierei" in the current sample, and MaxSD(Robackia demeierei) is the maximum of SD(Robackia demeierei) over all collected samples. Figure 1 depicts a two-dimensional cut of the union of polyhedra in IRni found by the algorithm 3.2 for (15). Two of those polyhedra were replaceable by hyperrectangles according to (13), the corresponding projection of the replacing hyperrectangles is depicted in Figure 2.
>
SD(Robackia demeierei) demeierei)
1 MaxSD(Robackia 10
>
5.5
5.5
3
3
0.5 0
22
44
Fig. 1. A two-dimensional cut of the union of polyhedra in the input space found by the algorithm 3.2 for the predicate (15)
0.5 0
SD(Robackia demeierei) demeierei)
1 MaxSD(Robackia 10
22
44
Fig. 2. A two-dimensional projection of hyperrectangles that replace those polyhedra from Figure 1 replaceable according to (13)
202
5
M. Holeˇna
Extraction of Fuzzy Rules
Whereas the truth of a boolean predicate concerning input or output variables, such as (15), can be determined as soon as particular values of those variables are given, the situation with fuzzy predicates is different. Consider the following fuzzy generalization of (15): SD(Robackia demeierei) is not negligible.
(16)
To determine a truth value from [0, 1] of that predicate, it must be first interpreted in some model of the calculus. Let us assume that before a neural network is trained, the interpretation has been performed for all fuzzy predicates concerning either input variables or output variables, no matter whether they have been interpreted already during data collection or the data have been collected crisp and the predicate interpretation added only during a subsequent fuzzification. Then a piecewise-linear neural network can be trained whose input neurons correspond to predicates concerning input variables rather than to input variables themselves, output neurons correspond to predicates concerning output variables, and in any input-output pair ((x1 , . . . , xni ), (y 1 , . . . , y no )) ∈ IRni × IRno used for training, the numbers x1 , . . . , xni , y 1 , . . . , y no are truth values of the involved predicates. This is the approach that will be adopted in the present section. The main theoretical result on which it relies is formulated in the following proposition: Proposition 5.1. Let a piecewise-linear neural network with a topology(ni ,nh ,no )and an activation function f be such that its set IF of computable mappings fulfills (∀F ∈ IF) F ([0, 1]ni ) ⊂ [0, 1]no .
(17)
Let further P1 , . . . , Pni +1 be monadic predicates of a Lukasiewicz predicate calculus whose object variables contain at least x1 , . . . , xni and y. Then for each F = (F1 , . . . , Fno ), j = 1, . . . , no and ε > 0: a) there exists a rational McNaughton function (for definition, see [1]) Fˆj,ε approximating Fj |[0, 1]ni with precision ε, b) in the considered predicate calculus, an open formula ΦFˆj,ε (x1 , . . . , xni , y) can be constructed whose constituent atomic formulae are only P1 (x1 ), . . . , Pni (xni ), Pn+1 (y) and such that for each model M of the calculus and each evaluation v of x1 , . . . , xni , the following functional relationship among evaluations holds: (∃y)ΦFˆj,ε (x1 , . . . , xni , y) M,v = Fˆj,ε ( x1 M,v , . . . , xni M,v ).
(18)
Proposition 5.1 already provides the possibility to develop an algorithm for the extraction, from the considered piecewise-linear neural network, of implications of Lukasiewicz predicate logic such that the truth functions of their antecedents approximate with arbitrary precision the mappings of the network input space to the individual output neurons. Moreover, the algorithm can be derived from the proofs of theorems in [1] and [3] on which the proof of Proposition 5.1 relies. Nevertheless, it is not this algorithm that is actually being developed. The reason is that using directly the original algorithm for rule extraction would mean facing two serious problems:
Extraction of Logical Rules from Data
203
(i) The predicate Pni +1 in Proposition 5.1 may be quite arbitrary. It does not have to correspond to any variable, therefore it may lack any conveyable meaning. (ii) The formulae Fˆj,ε in Proposition 5.1 may be quite arbitrary. In particular, they may have an arbitrary length and a quite unrestricted syntax, features that even each alone can make the antecedents of the extracted implications incomprehensible. Therefore, a modified algorithm is currently being developed, still based on Proposition 5.1 but sacrificing the arbitrary precision of the antecedent formulae from that proposition in favour of their comprehensibility.
6
Conclusion
This paper attempted to show that multilayer perceptrons with piecewise-linear activation functions are a promising tool for ANN-based extraction of logical rules from data. A rule-extraction approach using such piecewise-linear neural networks has two advantages – it relies on solid theoretical fundamentals to an extent seldom encountered in the area of ANN-based rule extraction, and it can be used for the extraction of both Boolean and fuzzy implications. In this context, a remark is appropriate that experience with the approach in first practical applications has revealed another its valuable property – easy visualisation of the obtained results. Notice also that although the proposed rule extraction approach is based on piecewise-linearity of activation functions, Proposition 2.4 provides the possibility to use this approach also for multilayer perceptrons with general continuous sigmoidal activation functions, e.g., if a trained MLP of such a general kind is simply given and rules are to be extracted from it. Nevertheless, the approach is still far from maturity. Especially the elaboration of the fuzzy case is only in an early stage and is by far not so well developed as the Boolean case. The main issue is here the tradeoff between the accuracy and comprehensibility of the formulae forming the antecedents of the extracted rules. It is already clear that we can not have both. An analogy of the easily comprehensible disjunctive Boolean normal form in the Lukasiewicz logic are disjunctions of Schauder hats [3], but the comprehensibility of Schauder hats is hindered by the necessity to construct them on artificially obtained unimodular triangulations instead of semantically motivated partitions. However, to understand how much accuracy we loose due to particular syntactic and complexity restrictions on the antecedent formulae requires a lot of further research. Acknowledgements. This research has been supported by the grant 201/00/1489, “Soft Computing”, of the Grant Agency of the Czech Republic, and by the grant B2030007, “Neuroinformatics”, of the Grant Agency of the Czech Academy of Sciences.
References 1. S. Aguzzoli and D. Mundici. Weierstrass approximations by Lukasiewicz formulas with one quantified variable. In 31st IEEE International Symposium on Multiple-Valued Logic, 2001. 2. R. Andrews, J. Diederich, and A.B. Tickle. Survey and critique of techniques for extracting rules from trained artificical neural networks. Knowledge Based Systems, 8:378–389, 1995.
204
M. Holeˇna
3. L.O. Cignoli, I.M.L. D’Ottaviano, and D. Mundici. Algebraic Foundations of Many-valued Reasoning. Kluwer Academic Publishers, Dordrecht, 2000. 4. J.E. Dennis and R.B. Schnabel. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice Hall, Englewood Cliffs, 1983. 5. W. Duch, R.Adamczak, and K. Grabczewski. Extraction of logical rules from neural networks. Neural Processing Letters, 7:211–219, 1998. 6. W. Duch, R. Adamczak, and K. Grabczewski. A new methodology of extraction, optimization and application of crisp and fuzzy logical rules. IEEE Transactions on Neural Networks, 11:277–306, 2000. 7. G.D. Finn. Learning fuzzy rules from data. Neural Computing & Applications, 8:9–24, 1999. 8. M.T. Hagan, H.B. Demuth, and M.H. Beale. Neural Network Design. PWS Publishing, Boston, 1996. 9. M.T. Hagan and M. Menhaj. Training feedforward networks with the Marquadt algorithm. IEEE Transactions on Neural Networks, 5:989–993, 1994. 10. P. H´ajek. Metamathematics of Fuzzy Logic. Kluwer Academic Publishers, Dordrecht, 1998. 11. P. H´ajek and T. Havr´anek. Mechanizing Hypothesis Formation. Springer Verlag, Berlin, 1978. 12. M. Holeˇna. Ordering of neural network architectures. Neural Network World, 3:131–160, 1993. 13. M. Holeˇna. Lattices of neural network architectures. Neural Network World, 4:435–464, 1994. 14. M. Holeˇna. Observational logic integrates data mining based on statistics and neural networks. ˙ In D.A. Zighed, J. Komorowski, and J.M. Zytkov, editors, Principles of Data Mining and Knowledge Discovery, pages 440–445. Springer Verlag, Berlin, 2000. 15. M. Holeˇna. Mining rules from empirical data with an ecological application. Technical report, Brandenburg University of Technology, Cottbus, 2002. ISBN 3-934934-07-2, 62 pages. 16. K. Hornik, M. Stinchcombe, H. White, and P. Auer. Degree of approximation results for feedforward networks approximating unknown mappings and their derivatives. Neural Computation, 6:1262–1275, 1994. 17. P. Howes and N. Crook. Using input parameter influences to support the decisions of feedforward neural networks. Neurocomputing, 24:191–206, 1999. 18. M. Ishikawa. Rule extraction by successive regularization. Neural Networks, 13:1171–1183, 2000. 19. V. K˚urkov´a. Kolmogorov’s theorem and multilayer neural networks. Neural Networks, 5:501–506, 1992. 20. M. Leshno, V.Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a non-polynomial activation can approximate any function. Neural Networks, 6:861–867, 1993. 21. W. Maass. Bounds for the computational power and learning complexity of analog neural nets. SIAM Journal on Computing, 26:708–732, 1997. 22. F. Maire. Rule-extraction by backpropagation of polyhedra. Neural Networks, 12:717–725, 1999. 23. E.N. Mayoraz. On the complexity of recognizing regions computable by two-layered perceptrons. Annals of Mathematics and Artificial Intelligence, 24:129–153, 1998. 24. S. Mitra, R.K. De, and S.K. Pal. Knowledge-based fuzzy MLP for classification and rule generation. IEEE Transactions on Neural Networks, 8:1338–1350, 1997. 25. S. Mitra and Y. Hayashi. Neuro-fuzzy rule generation: Survey in soft computing framework. IEEE Transactions on Neural Networks, 11:748–768, 2000. 26. D. Nauck, U. Nauck, and R. Kruse. Generating classification rules with the neuro-fuzzy system NEFCLASS. In Proceedings of the Biennial Conference of the North American Fuzzy Information Processing Society NAFIPS’96, pages 466–470, 1996.
Extraction of Logical Rules from Data
205
27. D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning internal representations by error backpropagation. In D.E. Rumelhart and J.L. McClelland, editors, Parallel Distributed Processing: Experiments in the Microstructure of Cognition, pages 318–362, 1986. 28. A.B. Tickle, R. Andrews, M. Golea, and J. Diederich. The truth will come to light: Directions and challenges in extracting rules from trained artificial neural networks. IEEE Transactions on Neural Networks, 9:1057–1068, 1998. 29. H. Tsukimoto. Extracting rules from trained neural networks. IEEE Transactions on Neural Networks, 11:333–389, 2000.
Structuring Neural Networks through Bidirectional Clustering of Weights Kazumi Saito1 and Ryohei Nakano2 1
NTT Communication Science Laboratories, NTT Corporation 2-4 Hikaridai, Seika, Soraku, Kyoto 619-0237 Japan [email protected] 2 Nagoya Institute of Technology Gokiso-cho, Showa-ku, Nagoya 466-8555 Japan [email protected]
Abstract. We present a method for succinctly structuring neural networks having a few thousands weights. Here structuring means weight sharing where weights in a network are divided into clusters and weights within the same cluster are constrained to have the same value. Our method employs a newly developed weight sharing technique called bidirectional clustering of weights (BCW), together with second-order optimal criteria for both cluster merge and split. Our experiments using two artificial data sets showed that the BCW method works well to find a succinct network structure from an original network having about two thousands weights in both regression and classification problems.
1
Introduction
In knowledge discovery using neural networks, an important and challenging research issue is to automatically find a succinct network structure from data. As a technique for such structuring, we focus on weight sharing [1,5]. Weight sharing means constraining the choice of weight values such that weights in a network are divided into clusters, and weights within the same cluster are constrained to have the same value called a common weight. If a common weight value is very close to zero, then all the corresponding weights can be removed as irrelevant, which is called weight pruning. By virtue of weight sharing and weight pruning, a neural network will have as simple a structure as possible, which greatly benefits knowledge discovery from data. In fact, there exist several types of important knowledge discovery problems in which a neural network with shared weights plays an essential role. For instance, finding multivariate polynomial-type functions from data plays a central role in many scientific and engineering domains. As one approach to solving this type of regression problems, we have investigated three-layer neural networks [8]. The above stated weight sharing and weight pruning will bring us much clearer solutions in the problems than without using the technique. As another instance, it is widely recognized that m-of-n rules [12] are useful to solve certain classification problems. The conditional part of an m-of-n rule is S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 206–219, 2002. c Springer-Verlag Berlin Heidelberg 2002
Structuring Neural Networks through Bidirectional Clustering of Weights
207
satisfied when at least m of its n conditions are satisfied. By representing each condition as a binary variable, we can naturally express this type of rules as a form of linear threshold functions whose corresponding coefficients are the same. In this paper, we present a weight sharing method for succinctly structuring neural networks. This method employs a newly developed technique called bidirectional clustering of weights, together with second-order optimal criteria for both cluster merge and split. Although our method is potentially applicable to a wider variety of network structures including recurrent neural networks, we focus on three-layer feedforward neural networks in our experiments in order to evaluate its basic capabilities.
2 2.1
Weight Sharing Method Basic Definitions
Here we introduce some notations to explain our problem formally. Let E(w) be an error function to minimize, where w = (w1 , · · · , wd , · · · , wD )T denotes a vector of weights in a neural network and aT is a transposed vector of a. Then, we define a set of clusters Ω(G) = {S1 , · · · , Sg , · · · , SG }, where Sg denotes a set of weights such that Sg = ∅, Sg ∩Sg = ∅ (g = g ) and S1 ∪· · ·∪SG = {w1 , · · · , wD }. Also, we define a vector of common weights u = (u1 , · · · , ug , · · · , uG )T associated is obtained with a cluster set Ω(G) such that wd = ug if wd ∈ Sg . Note that u by training a neural network whose structure is defined by Ω(G). Now we consider a relation between w and u. Let eD d be the D-dimensional unit vector whose elements are all zero except for the d-th element, which is equal to unity. Then the original weight vector w can be expressed by using a D × G transformational matrix A as follows. . (1) w = Au, A= eD eD d , ···, d {d:wd ∈S1 }
{d:wd ∈SG }
Note that the transformational matrix A is identified with the cluster set Ω(G). Therefore, our clustering problem is to find ΩG∗ which minimizes E(Au), where G∗ is a predetermined number of clusters. Here we outline the basic idea of our method called bidirectional clustering of weights (BCW). Since a weight clustering problem will have many local optima, the BCW method is implemented as an iterative method in which cluster merge and split operations are repeated in turn until convergence. In order to obtain good clustering results, the BCW method must be equipped with a reasonable criterion for each operation. To this end, we derive the second-order optimal criteria with respect to the error function of neural networks. Incidentally, the BCW is morphologically similar to the SMEM algorithm [13] in the respect that both have the split and merge operations to overcome local optima, although the SMEM solves quite a different problem of finding the global maximum likelihood (ML) solution in incomplete data situations.
208
K. Saito and R. Nakano
2.2
Bottom-up Clustering
A one-step bottom-up clustering is to transform Ω(G) into Ω(G − 1) by a merge operation; i.e., a pair of clusters Sg and Sg is merged into a single cluster Sg = Sg ∪ Sg . Clearly, we want to select a suitable pair of clusters so as to minimize the increase of the error function. One may think that this could be implemented by direct evaluation, i.e., by training a series of neural networks defined by merging each pair of clusters. Obviously, such an approach would be computationally demanding. As another idea, by focusing on the common weight vector u, we can obtain a pair of clusters which minimizes (ug − ug )2 . However, this approach does not directly address minimizing the increase of the error function. For a given pair of clusters Sg and Sg , we derive the second-order optimal criterion with respect to the error function. The second order Taylor expansion of E(Au) around u gives the following: 1 E(A(u + ∆u)) − E(Au) ≈ g(w)T A∆u + ∆uT AT H(w)A∆u, (2) 2 where g(w) and H(w) denote the gradient and Hessian of the error function be a trained common weight vector, then with respect to w respectively. Let u from the local optimality condition we have AT g(A u) = 0. Now we consider ∆u that minimizes the right-hand-side of Eq. (2) under the following constraint imposed by merging Sg and Sg . T G (u + ∆u)T eG g = (u + ∆u) eg .
(3)
By using the Lagrange multiplier method, we obtain the minimal value of the = A right-hand-side of Eq. (2) as follows. Here w u. 1 (ug − ug )2 ∆uT AT H(w)A∆u = min . (4) T G T −1 (eG − eG ) 2 2(eG g − eg ) (A H(w)A) g g This is regarded as the second-order optimal criterion for merging Sg and Sg , and we can define the dissimilarity as follows. DisSim(Sg , Sg ) =
(ug − ug )2 . T G T −1 (eG − eG ) (eG g − eg ) (A H(w)A) g g
(5)
Based on the above criterion, the one-step bottom-up clustering with retraining selects the pair of clusters which minimizes DisSim(Sg , Sg ), and merges these two clusters into one. After the merge the network with Ω(G − 1) is retrained. When we want to merge a few thousands of clusters, retraining a network at each step would be computationally demanding. To cope with such cases, we consider a multi-step bottom-up clustering without retraining by using the following average dissimilarity [3]. 1 AvgDisSim(Sc , Sc ) = DisSim(Sg , Sg ), (6) nc nc Sg ⊂Sc Sg ⊂Sc where nc is the number of clusters merged into Sc .
Structuring Neural Networks through Bidirectional Clustering of Weights
2.3
209
Top-down Clustering
A one-step top-down clustering is to transform Ω(G) into Ω(G + 1) by a split operation; i.e., a single cluster Sg is split into two clusters Sg and SG+1 where Sg = Sg ∪SG+1 . In this case, we want to select a suitable cluster and its partition so as to maximize the decrease of the error function. To our knowledge, previous work has not address this type of problems in the context of weight sharing. For a given clusters Sg and its partition {Sg , SG+1 }, we derive the second be a order optimal criterion with respect to the error function. Again, let u trained common weight vector. Just after the splitting, we have a (G + 1) = ( dimensional common weight vector v uT , u ˆg )T , and a new D × (G + 1) transformational matrix B defined as (7) . B= eD eD eD eD d , ···, d , ···, d , d {d:wd ∈Sg }
{d:wd ∈S1 }
{d:wd ∈SG }
{d:wd ∈SG+1 }
gives the following: The second order Taylor expansion of E(B v ) around v 1 E(B( v + ∆v)) − E(B v ) ≈ g(B v )T B∆v + ∆v T B T H(B v )B∆v. 2
(8)
Here we consider ∆v that minimizes the right-hand-side of Eq. (8). Its local optimality condition does not hold anymore; i.e., B T g(B v ) = 0. Instead, we have , f = eG+1 − eG+1 B T g(B v ) = κf , κ = g(B v )T eD (9) d g G+1 {d:wd ∈SG+1 }
from the following optimality condition on u 0 = g(B v )T eD v )T d = g(B {d:wd ∈Sg }
{d:wd ∈Sg }
eD d +
. (10) eD d
{d:wd ∈SG+1 }
Therefore, by substituting Eq. (9) into Eq. (8), we obtain the minimal value of the right-hand-side of Eq. (8) as follows. 1 min g(B v )T B∆v + ∆v T B T H(B v )B∆v 2 1 2 T T v )B)−1 f . (11) = − κ f (B H(B 2 This is regarded as the second-order optimal criterion for splitting Sg into Sg and SG+1 , and we can define the general utility as follows. v )B)−1 f . GenU til(Sg , SG+1 ) = κ2 f T (B T H(B Note that the utility values are positive.
(12)
210
K. Saito and R. Nakano
When a cluster has m elements, the number of different splitting amounts to (2m − 2)/2 = 2m−1 − 1. This means an exhaustive search suffers from combinatorial explosion. Since we consider a bidirectional search as shown next, we don’t have to do an exhaustive search in the splitting. Thus, a simple splitting will do in our case; that is, the splitting removes only one element (weight) from a cluster. Accordingly, by assuming wd ∈ Sg , we can define the utility as follows. T 2 T U til(Sg , {wd }) = (g(B v )T eD v )B)−1 f . d ) f (B H(B
(13)
Based on the above criterion, the one-step top-down clustering with retraining selects the combination of a cluster Sg and its element wd which maximizes U til(Sg , {wd }), and splits such Sg into two clusters Sg − {wd } and {wd }. After the splitting, the network with Ω(G + 1) is retrained. 2.4
Bidirectional Clustering of Weights
In general, there exists many local optima for a clustering problem. The single usage of either the bottom-up or top-down clustering will get stuck at a local optima. Thus, we consider an iterative usage of both clusterings, proposing the following method called bidirectional clustering of weights (BCW). The initial set of clusters should be Ω(D) = {S1 , · · · , SD }, where Sd = {wd }. Note that there are two control parameters G and h: the former denotes the final number of clusters, and the latter is the depth of the bidirectional search. Bidirectional Clustering of Weights (BCW) : step 1 : Compute Ω1 (G) from the initial clusters Ω(D) by performing the (D − G)-step bottom-up clustering without retraining. step 2 : Compute Ω(G + h) from Ω(G) by repeatedly performing the onestep top-down clustering with retraining. step 3 : Compute Ω2 (G) from Ω(G + h) by repeatedly performing the onestep bottom-up clustering with retraining. step 4 : If E( u; Ω2 (G)) ≥ E( u; Ω1 (G)), then stop with Ω1 (G) as the final solution; otherwise, Ω1 (G) = Ω2 (G) and go to step 2. Note that the above method always converges since the search is repeated only when the error value E decreases monotonically. The remaining thing we have to do is to find the optimal number G∗ of common weights, and to determine a reasonable value for h. Some domain knowledge may indicate a reasonable G∗ . In general, however, we don’t know the optimal G∗ in advance. In such cases a reasonable way to decide G∗ is to find the G that minimizes the generalization error. Here generalization means the performance for new data. When we can assume G is among very small intergers, it suffices to employ a brute-force approach of trying for G = 1,2,... When such an approach is too heavy to take, a more effective method to get the optimal G is a binary search of halving a search area each time on the assumption that the generalization error space is uni-modal in respect of G. As for h, the larger h is, the better will the solution be, and the more the computation is. However, there will be a saturation point in such tendency. We adopted h = 10 in the following experiments.
Structuring Neural Networks through Bidirectional Clustering of Weights
3 3.1
211
Application to Regression Problem Polynomial Discovery
As one application of the BCW method, we consider a regression problem to find the following polynomial made of multiple variables. f (x; w) = w0 +
J j=1
wj
k
w
xk jk ,
(14)
where x = (x1 , · · · , xk , · · · , xK )T (∈ RK ) is a vector of numeric explanatory variables. By assuming xk > 0, we rewrite it as follows. J (15) f (x; w) = w0 + wj exp wjk ln xk j=1
k
The right hand side can be regarded as a feedforward computation of a threelayer neural network [5] having J hidden units with w0 as a bias term. Note that an activation function of a hidden unit is exponential. A regression problem requires us to estimate f (x; w) from training data {(xµ , y µ ) : µ = 1, · · · , N }, where y(∈ R) denotes a numeric target variable. The following sum-of-squared error is employed as an error function for this regression problem. E(w) =
N
(y µ − f (xµ ; w))2
(16)
µ=1
3.2
Experiment Using Artificial Data Set
Our regression problem is to find the following polynomial used in [10]. y = 2 + 3x1 x2 + 4x3 x4 x5
(17)
Here we introduce lots of irrelevant explanatory variables: 995 irrelevant ones. For each sample, each variable value is randomly generated in the range of (0, 1), while the corresponding value of y is calculated by following Eq. (17). The size of training data is 5,000 (N = 5, 000), and the size of test data is 2,000. For only the training data, added was Gaussian noise with a mean of 0 and a standard deviation of 0.1. The initial values for weights wjk are independently generated according to a normal distribution with a mean of 0 and a standard deviation of 1; weights wj are initially set equal to zero, but the bias w0 is initially set to the average output over all training samples. The iteration is terminated when the gradient vector is sufficiently small, i.e., each elements of the vector is less than 10−6 . In this experiment, the number of hidden units was set to 2. The optimal number of hidden units can be found by employing a model selection technique such as cross-validation. Moreover, weight sharing was applied only to weights wjk from an input layer to a hidden layer in a three-layer neural network. Thus, in our regression problem, the true G∗ is two where common weight values are 1 and zero.
212
K. Saito and R. Nakano Table 1. G∗ comparison for the regression problem number of clusters: G∗ RMSE for training data RMSE for test data
2 0.1007 0.0116
3 0.0995 0.0206
5 0.0962 0.0373
10 0.0959 0.0384
Table 2. Comparison of generalization error for the regression problem run ID number neural network SBCW method BCW method
3.3
1 1.426 0.420 0.012
2 3.680 0.012 0.012
3 4.271 0.910 0.012
4 1.395 0.420 0.012
5 5.486 0.906 0.012
Experimental Results
Table 1 shows how the performance changes with different G∗ , where RMSE means root-mean-squared error. We can see that the training RMSE decreases with the growth of G∗ , and the case G∗ = 2 shows the best generalization performance. Recall that there is no noise on test data because we wanted to directly measure the closeness between the true polynomial and obtained ones. To evaluate the usefulness of the BCW method, we compared its generalization performance with those of a three-layer neural network and a simplified version of the BCW, called SBCW. The SBCW method performs only the multistep bottom-up clustering and can be regarded as a straightforward version of existing methods such as the Towell and Shavlick’s clustering method [12] or the Hessibi et al’s network pruning method (OBS) [4]. This is because they did not consider either the second-order optimal criteria for cluster merge and split operations, or bidirectional search for improving the results. Thus, the SBCW method can be regarded as an OBS-like method. Table 2 shows experimental results when G∗ = 2. Here generalization is evaluated by the RMSE for test data. We can see that the BCW method showed the same performance for each run, while the other two were to some degree behind it. Figure 1 shows how errors change during the bidirectional clustering under the condition that G∗ = 2. The solid and dotted lines indicate RMSEs for training data and test data respectively. Since the depth of bidirectional clustering was 10 (h = 10), the top-down clustering was done for the first 10 iterations, and the bottom-up clustering was performed for the next 10 iterations, and the first cycle of the BCW was completed at the 20-th iteration. The BCW was terminated at the 40-th iteration since the second cycle could not improve the training error. This experiment shows that the depth h (= 10) was large enough. The following shows the final function obtained after the BCW converged under the condition that G∗ = 2. Note that a near-zero common weight −0.0002 was set equal to zero, and the other common weight obtained after retraining
Structuring Neural Networks through Bidirectional Clustering of Weights
213
0
RMSE
10
−1
10
training error
generalization error −2
10
0
10
20 iteration number
30
40
Fig. 1. Bidirectional clustering for regression problem Table 3. Computational complexity for regression problem (sec.) run ID number 1 networks learning 1687 Hessian inversion 342 BCW processing time 240
2 1579 327 298
3 1862 279 288
4 2134 291 103
5 1287 277 222
average 1709.8 303.2 230.2
was 0.997. We can see that the significant weights w11 , w12 , w23 , w24 , and w25 belong to the same cluster, and the function almost equivalent to the original was found. x0.997 + 4.007x0.997 x0.997 x0.997 y = 1.995 + 3.006x0.997 1 2 3 4 5
(18)
Table 3 shows the computational complexity of neural networks learning, Hessian inversion and processing time for the BCW method; the total CPU time required for each G∗ was about 40 min. and 75% was used for neural networks learning. The experiment was done by using PCs of 2 GHz Pentium.
4 4.1
Application to Classification Problem m-of-n Rule Discovery
As another type of application of the BCW method, we consider a classification problem which will be effectively solved by using m-of-n rules. As described
214
K. Saito and R. Nakano
previously, the conditional part of an m-of-n rule is satisfied when at least m of its n atomic propositions are satisfied. When m = n, the rule is conjunctive, and when m = 1, the rule is disjunctive. When m < n and m 1, the rule requires a very complex DNF if one tries to describe it equivalently. We employ the following form of standard three-layer neural networks, J f (x; w) = σ w0 + (19) wjk qk , wj σ j=1
k
where q = (q1 , · · · , qk , · · · , qK ) ∈ {0, 1}K denotes a vector of binary explanatory variables. An activation function of the output unit is sigmoidal σ(u) = 1/(1 + e−u ) so that the output value is confined to the range of (0, 1). A classification problem requires us to estimate f from training data {(q µ , y µ ) : µ = 1, · · · , N }, where y ∈ {0, 1} is a binary target variable. The following cross-entropy error (CE) is employed as an error function for this classification problem. E(w) = −
N
(y µ log f (q µ ; w) + (1 − y µ ) log(1 − f (q µ ; w))).
(20)
µ=1
4.2
Experiment Using Extended Monk’s Problem 2
The Monk’s problems [11] treat an artificial robot domain, where robots are described by the following six nominal variables: var1 : head-shape ∈ {round, square, octagon} var2 : body-shape ∈ {round, square, octagon} var3 : is-smiling ∈ {yes, no} var4 : holding ∈ {sword, balloon, f lag} var5 : jacket-color ∈ {red, yellow, green, blue} var6 : has-tie ∈ {yes, no} The learning task is binary classification, and there are three Monk’s problems, each of which is given by a logical description of a positive class. Here we consider only problem 2 since problems 1 and 3 are given in a standard DNF (disjunctive normal form), but problem 2 is given in the following strict 2-of-6 rule. Exactly two of the six nominal variables have their first value. For example, one positive sample is a robot whose head-shape and body-shape are round is not smiling, holding no sword, having no tie and its jacket-color is not red. Here binary variables qk , k = 1, · · · , 17 are ordered as follows: q1 , q2 , and q3 correspond to the cases head-shape = round, head-shape = square, and head-shape = octagon, respectively, and q4 corresponds to body-shape = round, and so on. Thus, the strict 2-of-6 rule is equivalent to the following equation. q1 + q4 + q7 + q9 + q12 + q16 = 2.
(21)
Structuring Neural Networks through Bidirectional Clustering of Weights
215
Table 4. G∗ comparison for the extended Monk’s problem 2 number of clusters: G∗ average CE for training data accuracy for test data
2 0.561 0.819
3 0.403 1.000
4 0.382 1.000
5 0.379 1.000
10 0.367 1.000
Note that such a rule requires a very complex DNF if one tries to describe it equivalently. From 432 possible samples, the designated 169 samples are used for learning, as instructed in the UCI Machine Learning Repository [2]. In the original problem there is no noise assumed. Many famous machine learning algorithms could not solve this problem well showing rather poor generalization [11]. We extend the Monk’s problem 2 in the following two respects. One is to introduce lots of irrelevant variables; i.e., 100 irrelevant nominal variables having two discrete values, and other 100 having three discrete values, and 100 more having four discrete values, which means 900 irrelevant binary (category) variables. Since the Monk’s problem 2 has 17 significant binary variables, the total number of binary variables amounts to 917. The other is to introduce 10% noise to make the learning harder. Here 10% noise means a correct binary target value is reversed with the probability of 0.1. We set the sizes of training and test data to be 30 times as large as the original problem by varying values of the newly added variables. Thus, the size of training data is 5,070 (N = 5, 070), and the size of test data is 7,890. Note that values for irrelevant nominal variables were randomly assigned for both training and test data, but 10% noise was added only to the training data. The number of hidden units was set equal to 2. Again the optimal number of hidden units can be found by employing a model selection technique such as cross-validation. Moreover, weight sharing was restricted to weights wjk from input to hidden. 4.3
Experimental Results
Table 4 shows how the performance changes with different G∗ , where the average CE means the cross-entropy error per one training sample, and generalization performance is evaluated by the accuracy for test data. We can see that the average CE for training data decreases monotonically as G∗ increases, but the perfect generalization was obtained for G∗ ≥ 3; thus, G∗ = 3 is selected here. Again, recall that there is no noise on test data because we wanted to directly measure the closeness between the true rule and obtained ones. Table 5 compares generalization performance of a three-layer neural network, the SBCW method with G∗ = 3, and the BCW method with G∗ = 3. Here generalization is evaluated by the accuracy for test data. We can see that the BCW method showed the perfect generalization for each run, while the other two were to some degree behind it. Figure 2 shows how the accuracy changes during the bidirectional clustering under the condition that G∗ = 2. The solid and dotted lines indicate the accuracy
216
K. Saito and R. Nakano
Table 5. Comparison of generalization accuracy for the extended Monk’s problem 2 run ID number neural network SBCW method BCW method
1 0.863 0.875 1.000
2 0.841 0.875 1.000
3 0.872 1.000 1.000
4 0.850 0.875 1.000
5 0.864 0.875 1.000
1
generalization accuracy
Accuracy
0.95
training accuracy 0.9
0.85 0
10
20 iteration number
30
40
Fig. 2. Bidirectional clustering for classification problem
for training data and test data respectively. Since the depth of bidirectional clustering was 10 (h = 10), we can see that the BCW was terminated at the second cycle. This experiment also shows that the depth h (= 10) was large enough. The following shows the final function obtained by the BCW under the condition that G∗ = 3. y = σ(2.408 − 4.748 σ(−0.433 + 2.280 Q1 − 1.712 Q2 ) −5.410 σ(2.017 − 1.712 Q1 )) Q1 = q1 + q4 + q7 + q9 + q12 + q16 , Q2 = q2 + q3 + q5 + q6 + q8 + q10 + q11 + q13 + q14 + q15 + q17 ,
(22)
Note that a near-zero common weight 0.0007 was set equal to zero, and the other weights were obtained by retraining. The underlying strict 2-of-6 rule is nicely represented in this function as shown below. Note that Q1 + Q2 = 6 and
Structuring Neural Networks through Bidirectional Clustering of Weights
217
Table 6. Computational complexity for classification problem (sec.) run ID number 1 networks learning 25 218 Hessian inversion BCW processing time 137
2 26 219 173
3 40 220 131
4 26 224 187
5 average 27 28.8 231 222.4 122 150.0
Q1 should be 2 as shown in Eq. (21). Thus, the first hidden unit is “on” when Q1 > 2.682 because −0.433 + 2.280Q1 − 1.712Q2 = −0.433 + 2.280Q1 − 1.712(6 − Q1 ) = −10.705 + 3.992Q1 > 0. On the other hand, the second hidden unit is “on” when Q1 < 1.178. Hence, the truth value is “false” if and only if either hidden unit is “on”, which means the truth value is “true” only when Q1 = 2. Table 6 shows the computational complexity of neural networks learning, Hessian inversion and processing time for the BCW method; the average CPU time over 5 runs required for each G∗ was about 6.7 min. and only 7% was used for neural networks learning. Most was spent for the computation of the inverse matrix of the Hessian. The experiment was done by using PCs of 2 GHz Pentium.
5
Related Work
In the neural information processing field, the idea of weight sharing in a layered neural network is known to imitate some aspects of mammalian visual processing system. Thereby, the technique of weight sharing was used to build translation invariance into the response of the network for two-dimensional character recognition [1]. In such usage, however, which weights should have the same value was determined in the design process of receptive fields in a network. Nowlan and Hinton [9] proposed the idea of soft weight sharing, where the distribution of weight values is modelled as a mixture of Gaussians and a learning algorithm automatically decides which weights should be tied together having the same distribution. However, we would need some sophisticated methods to find explicit knowledge from such mixture models. Towell and Shavlik [12] also introduced the idea of weight clustering in their rule extraction algorithm from a trained neural network, but criteria such as the dissimilarity and utility proposed in this paper were not clearly defined in their clustering. The dissimilarity introduced in the bottom-up clustering can be considered as an extension of a criterion used in the Hessian-based network pruning such as the optimal brain damage (OBD)[7] or the optimal brain surgeon (OBS)[4]. The OBS computes the full Hessian H, while the OBD makes an assumption that
218
K. Saito and R. Nakano
H is diagonal. These methods prune a network one by one by finding a weight which minimizes the increase in the error function. They show that even small weights may have a substantial effect on the sum-of-squared error. However, they didn’t suggest any weight clustering as proposed in the present paper. Regularization techniques have been extensively studied to improve the generalization performance and to make a network simpler [1]. Regularization provides a framework to suppress insignificant weights, and rule extraction methods have been proposed by using regularization[6]. Little work, however, has been done in a situation of lots of variables.
6
Future Issues
Although we have been encouraged by our results to date, there remain several issues we must solve before our method can become a useful tool for succinctly structuring neural networks. The proposed criteria for cluster merge and split are second-order optimal with respect to the training error function of neural networks. However, ideal criteria should be optimal with respect to generalization performance for unseen samples. To this end, we plan to incorporate a cross-validation procedure into our current criteria. The BCW search depth h was rather empirically determined in our experiments. For some problems, since this parameter may play an important role to avoid poor local minimum solutions, our future studies should address determining it adequately. As for the optimal number of weight clusters G∗ and the optimal number of hidden units in neural networks, although we have suggested an approach to determine adequate ones in this paper, we need to perform further experiments to evaluate its usefulness and efficiency. Clearly, we also need to evaluate the BCW method by using a wider variety of problems. Some problems may require neural networks of more complex structures than simple three-layer feed-forward ones. Although we believe that the BCW method is potentially applicable to complex ones, this claim must be confirmed by our further experiments.
7
Concluding Remarks
In this paper, we presented a new weight sharing method called BCW for automatically structuring neural networks having a few thousands weights in the context of regression and classification problems. The method employs a bidirectional iterative framework to find better solutions for a weight sharing problem. The experiments showed that the BCW worked well in the environments of about two thousands weights. In the future we plan to do further experiments to evaluate and extend our method. Acknowledgement. This work was partly supported by NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation.
Structuring Neural Networks through Bidirectional Clustering of Weights
219
References 1. C. M. Bishop. Neural networks for pattern recognition. Clarendon Press, Oxford, 1995. 2. C. L. Blake and C. J. Merz. UCI Repository of machine learning databases [http://www.ics.uci.edu/˜mlearn/MLRepository.html]. 1998. 3. R. O. Duda and P. E. Hart. Pattern classification and scene analysis. John Wiley & Sons, 1973. 4. B. Hassibi, D. G. Stork, and G. Wolf. Optimal brain surgeon and general network pruning. In Proc. IEEE Int. Conf. on Neural Networks, pages 293–299, 1992. 5. S. Haykin. Neural networks - a comprehensive foundation, 2nd edition. PrenticeHall, 1999. 6. M. Ishikawa. Structural learning and rule discovery. In Knowledge-based Neurocomputing, pages 153–206. MIT Press, 2000. 7. Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems 2, pages 598–605, 1990. 8. R. Nakano and K. Saito. Discovering polynomials to fit multivariate data having numeric and nominal variables. In Progress in Discovery Science, LNAI 2281, pages 482–493, 2002. 9. S. J. Nowlan and G. E. Hinton. Simplifying neural networks by soft weight sharing. Neural Computation, 4(4):473–493, 1992. 10. R. S. Sutton and C. J. Matheus. Learning polynomial functions by feature construction. In Proc. 8th Int. Conf. on Machine Learning, pages 208–212, 1991. 11. S. B. Thrun, J. Bala, and et al. The Monk’s problem – a performance comparison of different learning algorithm. Technical Report CMU-CS-91-197, CMU, 1991. 12. G. G. Towell and J. W. Shavlik. Extracting refined rules from knowledge-based neural networks. Machine Learning, 13:71–101, 1993. 13. N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton. SMEM algorithm for mixture models. Neural Computation, 12(9):2109–2128, 2000.
Toward Drawing an Atlas of Hypothesis Classes: Approximating a Hypothesis via Another Hypothesis Model Osamu Maruyama1 , Takayoshi Shoudai2 , and Satoru Miyano3,4 1
Faculty of Mathematics, Kyushu University [email protected] 2 Department of Informatics, Kyushu University [email protected] 3 Institute of Medical Science, University of Tokyo 4 Institute for Chemical Research, Kyoto University [email protected]
Abstract. Computational knowledge discovery can be considered to be a complicated human activity concerned with searching for something new from data with computer systems. The optimization of the entire process of computational knowledge discovery is a big challenge in computer science. If we had an atlas of hypothesis classes which describes prior and basic knowledge on relative relationship between the hypothesis classes, it would be helpful in selecting hypothesis classes to be searched in discovery processes. In this paper, to give a foundation for an atlas of various classes of hypotheses, we have defined a measure of approximation of a hypothesis class C1 to another class C2 . The hypotheses we consider here are restricted to m-ary Boolean functions. For 0 ≤ ε ≤ 1, we say that C1 is (1 − ε)-approximated to C2 if, for every distribution D over {0, 1}m , and for each hypothesis h1 ∈ C1 , there exists a hypothesis h2 ∈ C2 such that, with the probability at most ε, we have h1 (x) = h2 (x) where x ∈ {0, 1}m is drawn randomly and independently according to D. Thus, we can use the approximation ratio of C1 to C2 as an index of how similar C1 is to C2 . We discuss lower bounds of the approximation ratios among representative classes of hypotheses like decision lists, decision trees, linear discriminant functions and so on. This prior knowledge would come in useful when selecting hypothesis classes in the initial stage and the sequential stages involved in the entire discovery process.
1
Introduction
Computational knowledge discovery can be regarded as a complicated human activity concerned with searching for something new from data by exploiting the advantages that computational discovery systems give us (see [1,6,7,11] for example). We thus think that, even the role of computational systems is important, a discovery process should be fundamentally regarded as a cycle of trial and error driven by human beings. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 220–232, 2002. c Springer-Verlag Berlin Heidelberg 2002
Toward Drawing an Atlas of Hypothesis Classes
221
Cheeseman and Stutz [2], who have devised a clustering system, called AutoClass, using a Bayesian method, would support our view on discovery process. Their opinion is obtained through their experiences with various kinds of successful discoveries using AutoClass, which is summed up as follows. “The discovery of important structure in data (classes) is rarely a one-shot process of throwing some database at AutoClass (or similar program) and getting back something useful. Instead, discovery of important structure is usually a process of finding classes, interpreting the results, transforming and/or augmenting the data, and repeating the cycle. In other words, the process of discovery of structure in databases is an example of the well known hypothesize-and-test cycle of normal scientific discovery.” Their opinion holds good for not only AutoClass but also other knowledge discovery programs. From their opinion, we can recognize that a discovery process is a cycle of trial and error driven by human beings. Fayyad et al. [4] have given a framework for knowledge discovery in databases (KDD). The KDD process is described as a process starting from target data selection from databases, preprocessing of target data, transformation of preprocessed data for data mining systems, pattern generation by data mining systems, interpretation/evaluation of patterns, and finally reaching a creation of a new knowledge. They have pointed out that the KDD process can involve iteration and may contain loops between any steps. This also implies that a discovery process is a cycle of trial and error driven by human beings. The optimization of the entire process of computational knowledge discovery would be feasible in every aspects of the process even if the process is deeply dependent on human activities. Furthermore it should be extensively inquired into because of the urgent need for better ways of computational knowledge discovery in many fields including science, commerce, engineering and so on (see [8] for example). We then put our scope on the entire process of knowledge discovery, and focus less on details, for example, the development of new data mining methods. One of the important subproblems involved in this entire process optimization would be how one could repeatedly select an appropriate class of hypotheses to be searched in stages of the entire process of knowledge discovery. In a normal discovery process, no one can avoid repeatedly selecting hypothesis classes to be searched. In most of these cases, ones would select a hypothesis class to be searched based on their experience, knowledge, intuition and so on. If one can have an insight into relationships between hypotheses, one can select a class of hypotheses more easily. The aim of this work is to give prior and basic knowledge on relationships between various kinds of classes of hypotheses. We then define a way of measuring how a hypothesis class C1 can be approximated to another hypothesis class C2 . For 0 ≤ ε ≤ 1, we say that C1 is (1 − ε)-approximated to C2 if, for every distribution D over the input space of the hypotheses, and for each hypothesis h1 ∈ C1 , there exists a hypothesis h2 ∈ C2 such that, with the probability at most ε, we have h1 (x) = h2 (x) where x is drawn randomly and independently according to D. Informally speaking, the larger the approximation ratio (i.e., 1 − ε) of C1 to C2 is, the more similar to each hypothesis in C1 , C2 contains a
222
O. Maruyama, T. Shoudai, and S. Miyano
hypothesis. Thus, we can use an approximation ratio as an index of how similar C1 is to C2 . An advantage of the approximation measure is described in section 2.1. The hypotheses we consider here can be regarded as m-ary Boolean functions i.e., functions from {0, 1}m to {0, 1}. We then consider classes of hypotheses like decision lists, decision trees, linear discriminant functions and so on. Each class is parameterized by a variable. The variables of most classes represent the appropriate sizes of hypotheses. We show novel lower bounds of the approximation ratios among these hypothesis classes. These prior knowledge on various kinds of classes of hypotheses would come in useful for selecting hypothesis classes to be applied in the entire discovery process. This paper is organized as follows. In section 2, we define the measure of approximation of a hypothesis class to another class, and consider the advantage of it. Section 3 gives the definitions of several hypothesis models. In section 4, we analyze approximation ratios among those hypothesis classes.
2
Approximation Measure
The hypotheses we consider here are restricted to polynomial-time computable functions f from {0, 1}m to {0, 1}, for an arbitrary integer m > 0. Note that we confuse the term “function” with “hypothesis” in this paper. For a subset X ⊆ {0, 1}m , we denote the set {x ∈ X | f (x) = 1} by f (X). Let D be an arbitrary fixed probability distribution over {0, 1}m . The probability that x ∈ {0, 1}m is drawn randomly and independently according to D is denoted by D(x). For X ⊆ {0, 1}m , we denote D(X) = x∈X D(x). Here suppose that we have two arbitrary hypotheses h1 and h2 over {0, 1}m . For X ⊆ {0, 1}m , we define h1 h2 (X) = {x ∈ X | h1 (x) = h2 (x)}, which is denoted by h1 h2 when X = {0, 1}m , for brevity. As a measure of the similarity between h1 and h2 over X, we use the symmetric difference between h1 and h2 over X, that is, D(h1 h2 (X)). Notice that this similarity measure is also used in the PAC-learning model (see [5,9] for example). Definition 1. Let h1 and h2 be arbitrary hypotheses over {0, 1}m , let D be a distribution over {0, 1}m , and let 0 ≤ ε ≤ 1. We say that h1 is (1 − ε)-approximated to h2 over D, denoted by h1 ⇒1−ε h2 , if D ε ≥ D(h1 h2 ). Thus, h1 ⇒1D h2 means that h1 and h2 are identical with respect to D. We will omit the subscript D of h1 ⇒1−ε h2 if D is clear from the context. D Next we define the approximation measure of a class of hypotheses to another, based on the approximation measure of a hypothesis to another.
Toward Drawing an Atlas of Hypothesis Classes
223
Definition 2. Let C1 and C2 be arbitrary classes of hypotheses from {0, 1}m to {0, 1}, and let 0 ≤ ε ≤ 1. We say that C1 is (1 − ε)-approximated to C2 , denoted by C1 ⇒1−ε C2 , if, for every distribution D over {0, 1}m , and for every hypothesis h1 ∈ C1 , there exists h2 ∈ C2 such that h1 ⇒1−ε h2 . The value, 1 − ε, D is called the approximation ratio of C1 to C2 . We have the following lemmas concerning approximation ratios.
Lemma 1. If C1 ⇒1−ε C2 and C2 ⇒1−ε C3 then C1 ⇒1−ε C3 where 1 − ε = max{1 − ε − ε , 0}. Let C be a class of hypotheses. We say that C is reversible if, for each hypothesis h1 ∈ C, there is another hypothesis h2 ∈ C such that h1 h2 ({0, 1}m ) = {0, 1}m . For an arbitrary hypothesis h, for a hypothesis h in a reversible class C, and for a distribution D, if D(h h ) ≥ 12 then there exists a h ∈ C such that D(h h ) ≤ 21 . Therefore we have the next lemma. Lemma 2. If C2 is a reversible class, then, for any class C1 , we have C1 ⇒1−ε C2 where 1 − ε ≥ 12 . 2.1
Application of Approximation Ratios
In this subsection, we mention how useful the approximation ratios of a hypothesis class by another is in a discovery process with repeatedly exploiting computational hypothesis generating systems. Let Ct be a class of functions from {0, 1}m to {0, 1}, and let ht be a function in Ct chosen as a target function. We are given a table T of the input/output behavior (x, ht (x)) of ht for every x ∈ {0, 1}m , i.e., all possible labeled examples of ht over {0, 1}m , in the terminology of the PAC-learning model. In addition to T , we are also given the probability distribution D over {0, 1}m in some way or other. For example, it would be feasible the way that T is given in the form of a list of labeled examples where the duplication of an example is allowed, and that, for each x ∈ {0, 1}m , the number of duplications of an example (x, ht (x)) is proportional to D(x). The problem we must solve is to find, from T and D, a compact representation of a function whose input/output behavior is almost identical to T with respect to D. Thus, no one knows anything about ht and Ct . More precisely, one knows neither the fact that the labeled examples are derived from ht nor the fact that the unknown target function ht belongs to Ct . What we know is just the table T of the input/output behavior of ht over {0, 1}m and the probability distribution D over {0, 1}m . This situation would be natural in the process of knowledge discovery from real data. In a discovery process, one would have to repeatedly select a hypothesis class and search it until one reaches the goal of discovery. Assume that, at a stage of
224
O. Maruyama, T. Shoudai, and S. Miyano
the process, we have a hypothesis class C and 0 ≤ ε ≤ 1 such that any hypothesis h ∈ C satisfies D(ht h) > ε, which would be obtained using exhaustive search methods, or which might be obtained with some approximation schemes. From this piece of information on the unknown target function ht , we can claim that, for every hypothesis class C˜ satisfying C˜ ⇒1−˜ε C, ˜ because, if not, we if ε˜ ≤ ε then the target function ht does not belong to C, have D(ht h) ≤ ε˜, which contradicts the assumption that D(ht h) > ε and ε˜ ≤ ε. This also implies that C˜ is not equal to Ct . Thus it might be better that, the larger the gap ε − ε˜ is, the more backward the turn of C˜ to be searched should be put off. This seems to be a great advantage of using a map of the approximation ratios of hypothesis classes to others.
3
Hypothesis Classes
In this section, we describe the classes of functions we deal with in this paper. A hypothesis class is formulated using a hypothesis model C, and parameterized using a variable k representing an upper bound of sizes of functions in C, which is denoted by C(k). It should be noted here that we have not put any constraints on the relationship between hypothesis classes C(k) and C (k), which have the same size parameter k. For an integer k ≥ 0, we denote by DL(k) the classes of decision lists [5] of at most k branching nodes, and by DT(k) the classes of decision trees [10] of at most k branching nodes. Terminal nodes are assigned a constant function which returns 0 or 1. We assume that a branching node has exactly two arcs, which are labeled as 0 and 1, respectively. When a branching node receives an input x = (x1 , . . . , xm ) ∈ {0, 1}m , the node looks at a particular bit of x, say xi , and send x to either children of the node through the arc whose label is the same as xi . When an input x is given to a hypothesis in these classes, this process, started at the root node, is repeated recursively until x reaches a terminal node. The returned value of the terminal node becomes the output of the hypothesis. Note that, in the definition of decision list of [5], for an integer l, each branching node of a decision list is allowed to have a conjunction of at most l literals over x1 , . . . , xm . On the other hand, in our definition of DL(k), a decision list L is normalized such that each branching node of L has exactly one variable for a bit of an input. For positive integers l and k, we denote by lCNF(k) the class of Boolean formulae in conjunctive normal form (CNF) of at most k clauses of l literals, and by lDNF(k) the class of Boolean formulae in disjunctive normal form (DNF)
Toward Drawing an Atlas of Hypothesis Classes
225
of at most k terms of l literals. Here, a literal is either a variable xi of an input x = (x1 , . . . , xm ) ∈ {0, 1}m or the negation of it. A linear discriminant function fk (x) whose number of variables is at most k can be represented as follows [3]: k
fk (x) = w0 +
wi xji ,
i=1
where x = (x1 , . . . , xm ) ∈ {0, 1}m and xj1 , xj2 , . . . , xjk ∈ {x1 , x2 , . . . , xm }. The constant factor w0 is called a threshold weight. We denote by LDF(k) the class of Boolean functions gk (x) defined as 1 if fk (x) ≥ 0 gk (x) = 0 otherwise. For convenience, we also call gk (x) a linear discriminant function of at most k variables.
4
Analysis
In this section, we consider the approximation among the classes of hypotheses described in the previous section. 4.1
Decision List
First, we consider the approximation of DL(k + 1) to DL(k). From this result, we can see that how the existence of one extra branching node have impact on the representational ability of decision list. Theorem 1. For each positive integer k, k+1
DL(k + 1) ⇒ k+2 DL(k) Proof. Let D be an arbitrary fixed probability distribution over {0, 1}m . First, we describe how to construct a decision list lk in DL(k) from a decision list lk+1 in DL(k + 1). Note that lk+1 has at most k + 2 terminal nodes. Thus, there is at least one terminal node in lk+1 satisfying D(X) ≤
1 k+2
where X ⊆ {0, 1}m is the subset of the inputs reaching the terminal node. We choose one of such terminal nodes, and remove it and the branching node directly connected to the terminal node, from lk+1 . The remaining possible two parts are connected by identifying the incoming arc to the branching node with the outgoing arc from the branching node. The label of the new arc is the same
226
O. Maruyama, T. Shoudai, and S. Miyano
as the label of the incoming arc. The resulting decision list is in DL(k), and denoted by lk . This is the way to construct lk from lk+1 . For such decision lists lk and lk+1 , we have D(lk lk+1 ) ≤
1 , k+2
which completes the proof.
Next we consider the problem of approximating DT(k) to DL(k). In this case, a decision list and a decision tree both have at most k branching nodes. The different point between them is the structure of them, that is, list and tree, respectively. Theorem 2. For each positive integer k, k+2
DT(k) ⇒ 2(k+1) DL(k). Proof. Let D be an arbitrary fixed probability distribution over {0, 1}m , and let T be an arbitrary decision tree in DT(k). Note that T has at most k +1 terminal nodes. This fact implies that there is a terminal node of T satisfying D(X) ≥
1 k+1
where X ⊆ {0, 1}m is the subset of the inputs reaching the terminal node. One of such terminal nodes of T is fixed and denoted by vt , and the path from the root to vt is extracted as the main part of a decision list, denoted by L. To complete L, constant functions have to be assigned to the terminal nodes v of L, except the terminal node derived from vt . Let Pv (and Nv , resp.) be the subset of the inputs x reaching v and satisfying T (x) = 1 (and T (x) = 0, resp.). We assign v a constant function returning 1 if D(Pv ) ≥ D(Nv ) and a constant function returning 0 otherwise. In this way, the constant functions assigned to the terminal nodes v of L are determined. Let Y be the subset of the inputs reaching the terminal nodes v, i.e., Y = {0, 1}m − X. It should be noted here that 1 Pr [T (x) = L(x)] ≥ x∈Y 2 and
Pr
x∈{0,1}m
[x ∈ Y ] = 1 − D(X).
In addition, notice that Pr [T (x) = L(x)] = 1
x∈X
and
Pr
x∈{0,1}m
[x ∈ X] = D(X).
Toward Drawing an Atlas of Hypothesis Classes
227
Thus, we have Prx∈{0,1}m [T (x) = L(x)] = Prx∈{0,1}m [x ∈ X] · Prx∈X [T (x) = L(x)]+ Prx∈{0,1}m [x ∈ Y ] · Prx∈Y [T (x) = L(x)] ≥ (1 + D(X))/2 ≥ (k + 2)/2(k + 1).
The proof of the next corollary shows the usefulness of Lemma 1. Corollary 1. For positive integers l and k, ˜ k+2
˜ ˜ lCNF(k) ⇒ 2(k+1) DL(k),
k where k˜ = i=1 li . Proof. First, we have
˜ k+2
˜ ˜ ⇒ 2(k+1) ˜ DT(k) DL(k),
which is derived from Theorem 2. Next, we have ˜ lCNF(k) ⇒1 DT(k) from Theorem 4. This proof completes by combining these two facts by Lemma 1.
The next result can be shown in a similar way. Corollary 2. For positive integers l and k, ˜ k+2
˜ ˜ lDNF(k) ⇒ 2(k+1) DL(k),
k where k˜ = i=1 li . As an example of a map of approximation ratios of hypothesis classes, we here draw a map which shows the approximation ratios of several hypothesis classes to DL(6), given in Fig. 1. Suppose that we are given a distribution D over {0, 1}m and a target function ht , and that a lower bound 0 ≤ ε ≤ 1 satisfying D(ht h) > ε for any hypothesis h ∈ DL(6) is obtained using an exhaustive search method and so on. Applying the result of the discussion in Section 2.1 to the map in Fig. 1, the map tells us that, if ε satisfies 78 ≥ 1−ε > 55 72 then DL(7) does not contain ht , and neither does DL(8) if 55 72 ≥ 1 − ε. In this way, these prior knowledge on the similarities between classes of hypotheses would come in useful for selecting hypothesis classes to be searched in a discovery process.
228
O. Maruyama, T. Shoudai, and S. Miyano
DL(8)
8 9
DL(7)
2CNF(2)
55 72
7 8
4 7
1
4 7
DL(6)
DT(6)
4 7
1
2DNF(2)
Fig. 1. A map of the approximation ratios of DL(7), DL(8), DT(6), 2CNF(2) and 2DNF(2) to DL(6). The label of an arrow from a class C1 to a class C2 is the approximation ratio of C1 to C2 which we have proved. A solid line indicates that the label is derived from a concrete approximation. On the other hand, a dashed line means that the label is obtained using Lemma 1. Note that 78 > 55 > 47 . 72
4.2
Decision Tree
In this subsection, we discuss the approximation ratios of several hypothesis classes to decision trees. We start with the approximation of themselves. Theorem 3. For each positive integer k, k+1
DT(k + 1) ⇒ k+2 DT(k) This theorem can be shown in a similar way to Theorem 1. For a positive integer k, we have DL(k) ⇒1 DT(k) because of DL(k) is a subset of DT(k). By applying Lemma 1 to this fact and Theorem 1, we have the next. Corollary 3. For each positive integer k, k+1
DL(k + 1) ⇒ k+2 DT(k).
Toward Drawing an Atlas of Hypothesis Classes
229
The next result is used in the proof of Corollary 1. Theorem 4. For positive integers l and k, k where k˜ = i=1 li .
˜ lCNF(k) ⇒1 DT(k),
A sketch for a proof of this theorem is as follows. Let Fl,k ∈ lCNF(k). Suppose that Fl,k = C1 ∧ · · · ∧ Ck , where Ci = (yi,1 ∨ · · · ∨ yi,l ). Note that yi,1 , . . . , yi,l are literals over x1 , . . . , xm . We can recursively construct a decision tree Tl,k satisfying Tl,k (x) = Fl,k (x) for each x ∈ {0, 1}m , as shown in Fig. 2. Let Sl,k be the number of branching nodes in Tl,k (x). We then have Sl,k = l · Sl,k−1 + l for k > 1, Sl,1 = l. k i ˜ Thus, Sl,k = i=1 l = k. The next can be shown in a similar way to Theorem 4. Theorem 5. For positive integers l and k, k where k˜ = i=1 li . 4.3
˜ lDNF(k) ⇒1 DT(k),
Linear Discriminant Function
Any approximation to non-discrete hypothesis models has not been discussed yet. Decision lists, decision trees, and CNF and DNF formulae are categorized as discrete models. On the other hand, linear discriminant functions are nondiscrete. In this subsection, we focus on the approximation of decision lists to linear discriminant functions. Theorem 6. For each positive integer k, DL(k) ⇒1 LDF(k). Proof. Let D be an arbitrary fixed probability distribution over {0, 1}m . This can be proved by induction on k, that is, the upper bound of the number of branching nodes of a decision list in DL(k). The case of k = 1 is trivial. Suppose that the statement holds for the cases of the positive integers less than or equal to k, and consider the case of k + 1. Let lk+1 be a decision list in DL(k + 1). It should be noted here that lk+1 must be one of the four forms given in Fig. 3. We have a linear discriminant function fk in LDF(k) satisfying lk ⇒1D fk , by the induction hypothesis. Let N is the maximum value of the absolute values of the maximum and minimum values of fk over {0, 1}m , and N = N + 1. Then, a linear discriminant function fk+1 in LDF(k +1) is constructed in the following way:
230
O. Maruyama, T. Shoudai, and S. Miyano
x1,1 1/0
0/1 x1,2
Tl,k−1
1/0
0/1
Tl,k−1 x1,l 1/0
0/1
0 Tl,k−1 Fig. 2. Decision tree Tl,k . Let Fl,k−1 = C2 ∧ · · · ∧ Ck , and let Tl,k−1 be a decision tree satisfying Tl,k−1 (x) = Fl,k−1 (x) for every x ∈ {0, 1}m . We denote the variable of the literal y1,j by x1,j . Note that, if y1,j is a positive literal, i.e., y1,j = x1,j then the label of the left arc outgoing from the node labeled as x1,j is 1 and the label of the right arc is 0. On the other hand, if y1,j is a negative literal, i.e., y1,j = ¬x1,j then the label of the left arc is 0 and the label of the right arc is 1. A square means that it is a terminal node. The label on the square is the returned value of the constant function assigned to the terminal node.
xi
1
lk
xi
0
1
0
1
xi
0
lk
1
0 (a)
lk
0
lk
1
1 (b)
xi
0 (c)
(d)
Fig. 3. lk is a decision list in DL(k).
(a) (b) (c) (d)
fk+1 fk+1 fk+1 fk+1
= N (1 − xi ) + fk , = N (xi − 1) + fk , = N xi + fk , = −N xi + fk .
It would be trivial that
lk+1 ⇒1D fk+1 ,
which implies that the statement of this theorem holds for the case of k + 1.
Toward Drawing an Atlas of Hypothesis Classes
231
The next can be obtained as a consequence of Theorem 6 and Lemma 1. Corollary 4. Let k be a positive integer, and let 0 ≤ ε ≤ 1. For any class C of hypotheses, if C ⇒1−ε DL(k) then C ⇒1−ε LDF(k). At present, we can say the followings, using the results in Section 4.1: k+1
– DL(k + 1) ⇒ k+2 LDF(k), k+2 – DT(k) ⇒ 2(k+1) LDF(k), ˜ k+2 ˜ ˜ – lCNF(k) ⇒ 2(k+1) LDF(k), ˜ k+2
˜ ˜ – lDNF(k) ⇒ 2(k+1) LDF(k), k where k˜ = i=1 li .
5
Concluding Remarks
Comparing with decision lists, decision trees and linear discriminant functions, the Boolean formulae in lCNF(k) and lDNF(k) seem to be a quite different type of hypothesis models because, in general, they are not reversible. It would be one of the future works to show non-trivial approximation ratios to CNF and DNF formulae. What we have considered in this paper is to clarify how a class of hypotheses can be approximated to another class of hypotheses. This problem would be quite a new and motivated by the problem of optimization of the entire process of knowledge discovery. As a consequence of this study, we have attained insight on structural similarities between hypothesis models from the view point of theoretical aspect. However, although we have shown the lower bounds of the approximation ratios among several hypothesis classes, we have not discussed the tightness of those approximation ratios yet. It is a future work to attain them. In a real discovery process, though we have selected hypothesis classes as search spaces, based on our experience, knowledge, intuition so far, prior knowledge on the approximation ratios among hypothesis models makes it possible that we can take a more efficient strategy in selection of hypothesis classes in a discovery process. Providing such prior knowledge on hypothesis models would be one of the ways to contribute to the optimization of the entire process of knowledge discovery. To find other ways to contribute that optimization would be also a future work. Acknowledgments. We thank anonymous referees and Ayumi Shinohara for valuable comments. Thanks also to Daisuke Shinozaki and Hiroki Sakai for fruitful discussions. This work is in part supported by Grant-in-Aid for Encouragement of Young Scientists and Grant-in-Aid for Scientific Research on Priority Areas (C) “Genome Information Science” from MEXT of Japan.
232
O. Maruyama, T. Shoudai, and S. Miyano
References 1. Brachman, R., and Anand, T. The process of human-centered approach. In Advances in Knowledge Discovery and Data Mining. AAAI Press, 1996, pp. 37–58. 2. Cheeseman, P., and Stutz, J. Bayesian classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press, 1996. 3. Duda, R. O., Hart, P. E., and Stork, D. G. Pattern Classification, second ed. John Wiley & Sons, Inc, 2001. 4. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. From data mining to knowledge discovery in databases. AI Magazine 17, 3 (1996), 37–54. 5. Kearns, M. J., and Vazirani, U. V. An Introduction to Computational Learning Theory. The MIT Press, 1994. 6. Langley, P. The computer-aided discovery of scientific knowledge. In Discovery Science (1998), vol. 1532 of Lecture Notes in Artificial Intelligence, pp. 25–39. 7. Maruyama, O., and Miyano, S. Design aspects of discovery systems. IEICE Transactions on Information and Systems E83-D (2000), 61–70. 8. Munakata, T. Knowledge discovery. Commun. ACM 42 (1999), 26–29. 9. Natarajan, B. K. Machine Learning : A Theoretical Approach. Morgan Kaufmann Publishers, Inc., 1991. 10. Quinlan, J. Induction of decision trees. Machine Learning 1 (1986), 81–106. ´s-Pe ´rez, R. Principles of human computer collaboration for knowledge 11. Valde discovery. Artificial Intelligence 107 (1999), 335–346.
Datascape Survey Using the Cascade Model Takashi Okada Kwansei Gakuin University Center for Information & Media Studies 1-1-155 Uegahara, Nishinomiya 662-8501, Japan [email protected]
Abstract. Association rules have the potential to express all kinds of valuable information, but a user often does not know what to do when he or she encounters numerous, unorganized rules. This paper introduces a new concept, the datascape survey. This provides an overview of data, and a way to go into details when necessary. We cannot invoke active user reactions to mining results, unless a user can view the datascape. The aim of this paper is to develop a set of rules that guides the datascape survey. The cascade model was developed from association rule mining, and it has several advantages that allow it to lay the foundation for a better expression of rules. That is, a rule denotes local correlations explicitly, and the strength of a rule is given by the numerical value of the BSS (between-groups sum of squares). This paper gives a brief overview of the cascade model, and proposes a new method of organizing rules. The method arranges rules into principal rules and associated relatives, using the relevance among supporting instances of the rules. Application to a real medical dataset is also discussed.
1 Introduction Various methods are used to mine characteristic rules, which are useful for recognizing patterns inherent in data. The most popular of these is association rule mining [1], which gave rise to the field of mining itself. However, several problems arise when we use association rules for data analysis. The most common criticism is that there are too many rules, and that the content of most rules is already known. Another well-known difficulty is that a rule does not always help in recognizing a correlation. ‘Datascape’ is a new word that is proposed in this paper. We do not give its definition, but it refers to the image of a scenic view of a dataset from the perspective of the analyst. The datascape should be put into perspective by using visualization techniques of the distribution function of data. However, the datasets often have too many variables in order to inspect all of their visualizations. This is why we use characteristic rule mining to get a useful viewpoint. However, we cannot understand the importance of a specific pattern unless we can view the datascape surrounding the pattern. That is, a datascape survey is essential for invoking an active user reaction.
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 233-246, 2002. © Springer-Verlag Berlin Heidelberg 2002
234
T. Okada
In this paper, we try to survey datascapes with the help of rules. Here, we devise a discrimination problem, and propose the following necessary conditions for the expression of rules. (1) There must be variable expression of rules from concise to detailed. (2) We need to quantify the scale of the discrimination problem and to know how much of the problem is solved by a given rule. (3) A rule must provide correlation information among various variables. That is, the correlation between explanatory and dependent variables is not sufficient; correlations among various variables that exist in supporting instances of a rule should also be depicted. There is no existing mining methodology that fulfills all the abovementioned conditions. The cascade model developed by the author provides solutions to some of these conditions. Therefore, this paper develops a new set of rules suitable for datascape surveys. The next section briefly introduces the cascade model. We propose several improvements to derive rule expressions in Section 3, and Section 4 applies the results to a medical diagnosis problem.
2 The Cascade Model 2.1 Cascades and Sum of Squares Criterion The cascade model was originally proposed by the author [5]. It can be considered as an extension of association rule mining. The method creates an itemset lattice in which an [attribute: value] pair is employed as an item to form itemsets. Let us consider the trivial dataset shown in Table 1, which discriminates the Y value using two attributes, A and B. When we construct a lattice using explanation features, the nodes and links of the lattice can be viewed as lakes and waterfalls connecting lakes, respectively, as shown in Figure 1. The height of a lake is assumed to denote the purity of class features, and its area approximates the number of supporting instances for the itemset.
Table 1. Trivial sample data A a1 a2 a2 a1 a1 a1 a2 a2
B b1 b1 b1 b2 b2 b2 b2 b2
Y p p p n n p n n
Fig. 1. The cascades expression from sample data.
Datascape Survey Using the Cascade Model
235
The concept behind the cascade model is to select the most powerful waterfalls and to use them as rules. Then, we need to define the power of a waterfall. Gini’s definition of SS (sum of squares) for categorical data in (1) provides a framework for the power of a waterfall [2]. Imagine that the instances are divided into G subgroups by the value of an explanation attribute. Then, TSS (total sum of squares) can be decomposed into the sum of WSSg (within-group sum of squares) and BSSg (betweengroup sum of squares) using (2), if we define BSSg as in (3) [6]. We propose that BSSg be used as a measure of rule strength. The BSS value per instance is called dpot, as defined in (4), and it will be used as a measure of the potential difference of a waterfall.
n 2 1 − ∑ p(a ) , 2 a
SS =
(1)
TSS = ∑ (WSS g + BSS g ) , G
(2)
g =1
BSS g = dpot g =
ng 2
∑ ( p (a ) − p (a )) g
U
2
,
(3)
a
2 1 ( p g (a ) − p U (a )) . ∑ 2 a
In the equations, the superscripts U and g indicate the upper and the g-th subgroup nodes, respectively; n is the number of cases supporting a node; and p(a) is the probability of obtaining the value a for the objective attribute.
A: y
2.2 Rule Link in the Lattice Powerful links in the lattice are selected and expressed as rules [7]. Figure 2 shows a typical example of a link and its rule expression. Here, the problem contains four explanation attributes, A – D, and an objective attribute Z, which take (y, n) values. The itemset at the upper end of the link contains item [A: y], and another item, [B: y], is added along the link. The items of the other attributes are called veiled items. The small tables to the right of the nodes show the
A: y, B: y
(4)
B C D Z
y 60 50 60 40
n 40 50 40 60
B C D Z
BSS 9.60 0.00 6.67 5.40
dpot 0.16 0.00 0.11 0.90
y 60 30 56 6
n
B C D Z
0 30 4 54
WSS 24.0 25.0 24.0 24.0
WSS 0.0 15.0 3.73 5.40
IF [B: y] added on [A: y] THEN [Z] BSS=5.40 (.40 .60) ==> (.10 .90) THEN [D] BSS=6.67 (.60 .40) ==> (.93 .07)
Fig. 2. A sample link, its rule expression, and the distributions of the veiled item.
236
T. Okada
frequencies of the items veiled in the upper node. The corresponding WSS and BSS values are also shown. The textbox at the bottom in Figure 2 shows the derived rule. The large BSS(Z) value is evidence of a strong interaction between the added item and attribute Z, and its distribution change is placed on the RHS of a rule. The added item [B: y] appears as the main condition on the LHS, while the items in the upper node are placed at the end of the LHS as preconditions. When an explanation attribute has a large BSS value, its distribution change is also denoted on the RHS to show the additional dependency. This information is useful for detecting colinearity among variables in the supporting instances in the lower node. It is not necessary for items on the RHS of a rule to reside in the lattice. We only need itemsets [A: y] and [A: y, B: y] to detect the rule shown in Figure 2, though we have to count the frequencies of veiled items. This is in sharp contrast to association rule miners, which require the itemset [A: y, B: y, D: y, Z: n] to derive the rule in Figure 2. This property makes it possible to detect powerful links dynamically before constructing the entire lattice. Combinatorial explosion in the number of nodes is always a problem in latticebased machine-learning methods. Since an item is expressed in the form [attribute: value], items are very dense in the cascade model, and this problem becomes more serious. However, it is possible to prune the lattice expansion using the abovementioned property, which allows us to derive a rule from a link avoiding the construction of the entire lattice [8]. In fact, we can find an inequality constraint for the value of BSS(Z). Using this inequality as the pruning criterion, we can find valuable rules, even when the number of attributes reaches several thousand.
3 Expression of Rules for a Datascape Survey 3.1 Problems The cascade model has already given an answer to the quantification problem using the BSS criterion. That is, SS at the root node is the size of the problem, and the BSS value of a rule shows the part solved by that rule. A rule derived by the model also denotes a local correlation between the main condition and the objective attribute that stands out for the instances selected by the preconditions. Other variables are also denoted if they correlate with the main condition. However, there are no clear ways to select and express sets of effective rules that are useful in a datascape survey. The simplest way of expressing rules is to list them in decreasing order of their BSS values. For example, eight rules derived from the sample data in Table 1 are shown in Table 2. All waterfalls with nonzero power are included. However, a simple list of rules is insufficient for invoking an active user reaction. The shortcomings are: 1. There will be more than 100 rules if we include enough rules to not miss valuable information. 2. The BSS value of a rule may increase by adding or deleting a precondition clause. Pairs of rules related in this way often appear in the resulting rules independently.
Datascape Survey Using the Cascade Model
237
3. Two rules sometimes share most of their supporting instances, although their conditions are completely different. Such information is useful for recognizing local colinearity among variables, but users must devote a lot of effort to identifying the information in a list of rules. 4. We cannot know the total explanation capability of a set of rules. That is, we do not know how many instances the rules have discriminated, and what part of SS they have explained. In a previous study, we tried to solve the last problem by using the instance-covering algorithm [5]. Here, rules represented waterfalls in the cascade when we drained all the water from the top lake. Rules were selected so that the maximum SS was explained, although each drop of water was limited in that it could not flow down two waterfalls simultaneously. This method selects rules 1, 2 and 4 in Table 2, those waterfalls illustrated by solid lines in Figure 1. Subsequently, we introduced multiple rule sets to obtain local colinearity information [7]. Repeating the selection of rules from the unemployed links successfully solved problems (3) and (4) above. For example, the first rule set of the sample problem is the same as that mentioned above, and the second set consists of rules 3, 5 and 6 in Table 2. However, a datascape survey using real-world datasets is still difficult, because the number of rules is still large, and because the expressions do not show the relationships among rules explicitly. In the following subsections, we propose a way to effectively express rules. An application involving a medical dataset is discussed in the following section. Table 2. Rule links derived from a sample dataset ID
Conditions
1 2 3 4 5 6 7 8
IF [B: b1] added on [ ] IF [B: b2] added on [A: a2] IF [B: b1] added on [A: a2] IF [B: b2] added on [ ] IF [B: b1] added on [A: a1] IF [B: b2] added on [A: a1] IF [A: a2] added on [B: b2] IF [A: a1] added on [B: b2]
Distribution of Y (p n) and / number of instances (0.5 0.5)/8 (0.5 0.5)/4 (0.5 0.5)/4 (0.5 0.5)/8 (0.5 0.5)/4 (0.5 0.5)/4 (0.2 0.8)/5 (0.2 0.8)/5
Í Í Í Í Í Í Í Í
(1.00 0.00)/3 (0.00 1.00)/2 (1.00 0.00)/2 (0.20 0.80)/5 (1.00 0.00)/1 (0.33 0.67)/3 (0.00 1.00)/2 (0.33 0.67)/3
BSS .750 .500 .500 .450 .250 .083 .080 .053
3.2 Optimization of a Rule The essence of the cascade model is that it recognizes pairs of set-subset instance groups, of which connecting link bears a large BSS value. A rule with more power is valuable in itself. Furthermore, optimization of several rules may converge on a single rule, decreasing the number of rules. A condition consisting of several items is described as [attribute: value-zone], where the value-zone is defined by the lowest and highest values in the case of a numerical attribute, and by a list of values in the case of a nominal attribute. The value-zone of a condition is optimized by adding its neighbor or by cutting the edge of the zone. As there is no directionality in a nominal attribute, we must treat any value as a neighbor or an edge of the value-zone in the optimization.
238
T. Okada
The steps involved in a search in the neighborhood of a rule candidate consist of (1) optimizing the main condition clause, (2) optimizing any existing precondition clauses, and (3) adding and optimizing a new precondition clause. When the valuezone extends to cover the entire value of the attribute in the optimization of a precondition clause, the clause can be deleted. The addition of a new precondition includes that of the main condition attribute. Since the search space is huge, we must use a greedy hill-climbing algorithm. The optimization follows the order described above, and the process is repeated until a rule reaches a local maximum BSS value. However, the optimization often results in the inclusion of trivial preconditions that exclude a very small portion of the instances. Following the principle of Occam’s razor, we simply include preconditions that increase the BSS value by more than 5%. We can see the usefulness of optimization when it is applied to the rules shown in Table 2, where the optimization of rules: 2, 5 and rules: 4, 6 converge on rule 1 and rule 3, respectively.
3.3 Organization into Principal and Relative Rules The number of rules is very important for invoking an active user reaction. If a user does not trust the rule induction system, and if a mining result contains more than 30 rules, then the user usually avoids serious evaluation of the utility of the system. If there are less than 10 rules, the user might peruse all the rules to evaluate them. Experts in data analysis expect another outcome; an expert also desires a smaller number of rules too, but also hopes to read detailed information on request to conduct a datascape survey. In this section, we propose a way to organize rules into several principal rules and relative rules. Here, a user is expected to look at the principal rules first in a rough datascape survey. If the user wishes to inspect the details of a principal rule, the relative rules guide a minute survey. Then, the problem is defining the relevance of the relationship between two rules. As noted earlier, two rules may share most of their supporting instances, although they are expressed in very different ways. In such a case, we believe that the two rules depict different aspects of a single phenomenon, and that they should be expressed as a principal rule and its relative. Therefore, the definition of relevance should be based on the overlap of their supporting instances. We introduce the following measure, rlv(A, B), to quantify the strength of the relevance of the supporting instances of two rules: A and B.
cnt (AUL ∩ BUL ) cnt (AUL ∩ BUL ) rlvUL ( A, B ) = max , UL cnt (BUL ) cnt (A )
(5) UL
where cnt is a function that returns the number of instances in a set, and A shows the set of supporting instances at the nodes above and below rule A, depending on the value of the superscript UL, UP or LOW, respectively. This measure takes the highest value 1.0 when one set is a subset of the other, and has the lowest value 0.0 when there is no overlap between two instance sets.
Datascape Survey Using the Cascade Model
239
We set a threshold value, min-rlv, for the above relevance to judge whether two rules are relatives. The relationships between two rules are defined as shown in Table 3, depending on the relevance at the upper and lower nodes. Table 3. Relationships between two rules At the upper node relevant not relevant (a) ULrelative (b) Lrelative (c) Urelative (d) no relation
At the lower node relevant not relevant
(a) ULrelative is the relationship when the rules are relevant at both nodes. Two typical cases are shown on the left side of Fig. 3. In Case (1), two rules share most of their supporting instances and they offer explanations from different viewpoints. This aids inspection by a user. Sometimes two rules differ only in their preconditions. Case (2) shows the relationship when Rule B has precondition clauses in addition to those of Rule A, and we need to consider the use of a related rule. Suppose that Rule A has a larger BSS value than Rule B and we employ Rule A as a principal rule. If Rule B has a larger dpot value than Rule A, Rule B is useful, since it states that the correlation expressed by Rule A is stronger in the region of rule B. On the contrary, if the dpot of Rule B is less than that of Rule A and the RHS of both rules leads to the same classification, Rule B is useless. Similarly, if Rule B has a larger BSS value and the RHS of the two rules are the same, Rule A is useless, because the local correlation found by Rule A is just a consequence of a stronger correlation found by Rule B. These useless relatives may reside even when two rules do not have a strict set-subset relationship. Therefore, the above criterion to judge useless rules is applied when rlv values on both nodes exceed some parameter (default: 0.85). Useless rules are simply removed from the final rules. (b) An Lrelative relationship holds when two instance sets are relevant at the lower UP node only. A typical example is Case (3) in Figure 4, where the intersection of A UP LOW LOW and B is very close to that of A and B . Then, it is useful to give explanations LOW using two rules. Another example of an Lrelative is Case (4), where B covers only LOW a small part of A , and Rule B can be used to give detailed information about some LOW segment in A . (c) Urelative is the last relationship. Two rules simply dissect the data, as shown in Case (5) of Figure 4. Lrelative
ULrelative
$83
%
$83
$
%
$
83
/2:
/2:
(1)
%
83
%/2:
/2:
(2)
%
83
$83
$
/2:
(3)
Urelative
%
83
$83
%/2: %/2: $/2: (4)
$83
%83
$/2: %/2: (5)
Fig. 3. Sample relationships between supporting instances of two rules.
240
T. Okada
Organizing rules into a principal rule and its relatives seems to help a user to understand data. Given two rules, a stronger and a weaker one, it is reasonable to make the former the principal rule and the latter its relative rule. The exception is Urelative rules, for which it seems suitable to attach pointers to separately placed rules. A principal rule may have several ULrelative and Lrelative rules, which are placed LOW in decreasing order of their Tanimoto coefficients, defined in (6), between A and LOW B [14].
cnt (A LOW ∩ B LOW ) cnt (A LOW ) + cnt (B LOW ) − cnt (A LOW ∩ B LOW )
(6)
We summarize the computation process used to organize rules in Algorithm 1. It receives candidate links for rules as an argument. The function optimize-rule performs the procedures described in the previous subsection. In organizing the rules, we select the rule with the largest BSS value as the first principal rule, and judge its relevance to all other rules. All relative rules except useless rules accompany the description of the principal rule. The selection of principal and relative rules is repeated for the rest of the candidate rules until all rules are organized. create-structured-rules(links) ls := sort-by-BSS(links) rules := nil loop for link in ls rule := optimize-rule(link) unless member(rule rules) push rule to rules rules := sort-by-BSS(rules) final-rules := nil loop for prule in rules loop for rule in rest(rules) if rule is useless remove rule from rules skip if rlv(prule rule)>min-rlv push rule to prule.relatives remove rule from rules push prule to final-rules return final-rules
Algorithm 1
Table 4. Example of rule organization using min-rlv=0.7 No. 1 2 3
Principal rule 1 2 6
ULrelative rules 3, 5 4 8
Lrelative rule none 7 none
Urelative rule 2, 6 1 1
Datascape Survey Using the Cascade Model
241
Suppose that 8 rules in Table 2 are those after optimization. When we organize these rules using the above procedure, only the 3 principal rules shown in Table 4 are identified. Therefore, we can expect the organization of rules to produce a simple and effective datascape survey.
4 Application to Medical Diagnostics We used the test dataset for meningoencephalitis diagnosis provided at the JSAI KDD Challenge 2001 workshop [13] to examine the capacity of the method proposed in the previous section to perform a datascape survey. For this dataset, it is important to determine whether the disease is bacterial or viral meningitis. It is already known that the diagnosis can be obtained by comparing the numbers of polynuclear and mononuclear cells, but there should be additional information related to the diagnosis. The cascade model has already been used to analyze this data [9]. The analysis produced strong rules based on the number of cells, which contributed to the diagnosis. However, there were many other rules, most of which also included conditions related to the number of cells. Therefore, we had to re-analyze the dataset excluding the attribute related to cell numbers, to determine additional ways to obtain the diagnosis without the help of cell numbers. Therefore, the analysis of this dataset provides a good test to determine whether the proposed method can perform a datascape survey. The computations used the same categories and parameters considered in [9]. 4.1 Stability of the Derived Rules First, we examined whether we could obtain a stable set of rules before organizing them into principal and related rules. Table 5 shows the number of rules resulting from changing the parameters thres and min-BSS. As a powerless rule is not interesting, we counted those rules that have BSS values larger than (0.03 * #instances). The lattice expands when we use lower values of thres. A link in the lattice is added to the initial candidates for rule optimization when its BSS value is larger than a threshold value, (min-BSS * #instances). The parameter min-sup, having the same meaning as that of the association rule, was set to 0.01. The results do not change if we set it to 0.02. Table 5. Number of candidate links and optimized rules min-BSS 0.01 0.02 0.03 0.05 1702 Å97 250 Å19 62 Å5 0.07 582 Å50 110 Å13 32 Å3 0.10 210 Å30 38 Å10 13 Å3 0.15 83 Å19 21 Å 8 9 Å3 Each cell shows the number of candidate links at the left to the arrow, and the number of optimized rules at the right. thres
242
T. Okada
We see that the number of candidate links shows a steep rise according to the increase in the lattice size and the sensitivity of candidate selection. The same tendency is also observed in the number of optimized rules, but its slope is gentle. The inspection of individual rules shows that a smaller set of rules is always contained in a larger set of rules, but the former does not always occupy those with higher BSS values in the latter. Therefore, we can say that the greedy optimization of rules starting from many links reaches a relative small number of rules, and it contributes the easier understanding by a user. However, there are many changes in the optimized rules with changes in the lattice expansion and candidate selection parameters. Let us examine the number of principal and relative rules with changes in the value of the parameter rlv-min. The organization uses 19 optimized rules obtained with the parameters: thres=0.05, min-BSS=0.02, min-sup=0.01. 2 rules were excluded as they are judged to be useless. Table 6 shows the conditions, associated distribution Table 6. Principal rules and their organization to ULrelatives rules
min-rlv: .5 .7 .9
ID
Main condition
Preconditions
1 [Cell_Poly>300] [ ]
Bacteria supports BSS .30 ==>1.0 140Å3014.7
[Cell_Mono=<750] [CSF_PRO>0] [LOC=<1] [BT>36] 3 [CSF_CELL>750] [CSF_PRO=<100] 4 [Cell_Poly>50] [AGE>20] [STIFF>=1]
.29 ==>.81 95Å32 8.57
5 [CRP>3]
.25 ==>.88 102Å176.69
2 [CSF_CELL>750]
[SEIZURE=<0] [FOCAL: 0]
.25 ==>1.0 99Å15 8.38 .28 ==>1.0 67Å11 5.64
6 [CSF_CELL>750] [BT>36] [EEG_FOCUS: 0] .29 ==>.80 [COLD=<5] [SEIZURE=<0] 7 [CSF_CELL>750] .20 ==>.79 [BT>36] [CT_FIND: 1] [AGE>20] [FEVER>0] 8 [BT>39] .23 ==>1.0 [NAUSEA=<0] [CSF_GLU>40] 9 [CRP: 3] [CSF_PRO=<200] .26 ==>.84 [NAUSEA=<3] 10 [CRP>3] .25 ==>.92 [CSF_PRO=<200] [NAUSEA=<3] [EEG_WAVE: 11 [WBC: 4] .28 ==>.91 0] [CSF_GLU>40] [NAUSEA=<3] [STIFF=<3] 12 [CT_FIND: 0] .19 ==>1.0 [LOC_DAT: 0] [CSF_GLU>40] [NAUSEA=<3] [FOCAL: 0] 13 [CRP>1] .22 ==>.83 [ESR=<10] [NAUSEA=<0] [ONSET: 0, 1, 3] 14 [CRP: 2 - 3] .27 ==>.85 [FOCAL: 0] [2020] 15 .52 ==>.03 00] [CSF_CELL>300] [SEIZURE=<0] [BT=<39] 16 [CT_FIND: 0] .23 ==>.91 [STIFF=<2] [EEG_FOCUS: 0] [FEVER=<6] [LOC=<1] 17 [CT_FIND: 0] .27 ==>.85 [BT=<39]
82Å20 5.14 55Å14 4.80 48Å8 4.75 123Å196.43 95Å13 5.84 80Å11 4.42 63Å7 4.58 68Å12 4.50 49Å13 4.38 60Å30 7.00 60Å11 5.02 74Å13 4.31
Datascape Survey Using the Cascade Model
243
changes and BSS values of 17 rules. The dendrogram to the left of Table 6 illustrates the effects of the rlv-min value on the organization of the rules. Here, a branch indicates a ULrelative rule, and a straight line merging branches shows a principal rule. No Lrelative rules were found. There are 14 principal rules at min-rlv=0.9, as seen in the rightmost part of the tree. These merge into fewer principal rules, as min-rlv decreases, in the left end part of this tree. Therefore we can see the relevance among rules from the tree. The choice of a reasonable relevance threshold is a difficult problem as it depends on the expert’s background knowledge, the aim of the analysis, and the data. 4.2 Interpretation of Organized Rules An example of a principal rule along with its ULrelative rules is shown in Figure 4 with min-rlv set to 0.7. The first line shows the number of instances before and after the application of the main condition and BSS value. The second line is the expression for the main and preconditions, followed by the distribution change in the dependent attribute Diag2 in the third line. In this case, the application of the main condition raised the percentage of bacteria from 30% to 100%. When an explanation attribute changes its distribution significantly, its contents are also denoted as depicted in the following two lines. This principal rule gives a reasonable result, but its content is already recognized by experts. After the description of the principal rule, there appear those of relative rules. The first line shows that there are 7 ULrelative and 7 Urelative rules. The first and the fifth ULrelatives are displayed as examples. 5XOH&DVHV!%66
,)>&HOOB3RO\!@DGGHGRQ>@%66 7+(1'LDJ
!
7+(1&6)B&(//
!
7+(1&HOOB3RO\
!
5HODWLYH5XOHVIRU5XOH8/UHODWLYH/UHODWLYH8UHODWLYH 8/ZLWK5XOH>8S@
UOY 7DQLPRWR5 3
>/RZ@ UOY 7DQLPRWR5 3 5XOH8/&DVHV!%66
53
35
53 35
,)>&HOOB3RO\!@DGGHGRQ>$*(!@>67,))!@%66 7+(1'LDJ
!
7+(1&6)B&(//
!
7+(1&HOOB3RO\
!
««« 8/ZLWK5XOH>8S@ UOY 7DQLPRWR 5 3
53
35
>/RZ@ UOY 7DQLPRWR 5 3
53
35
5XOH8/&DVHV!%66 ,)>&53!@DGGHGRQ>6(,=85(
@>)2&$/
7+(1'LDJ 7+(1/2&B'$7
@%66
! !
7+(1:%&
!
7+(1&53
!
7+(1&HOOB3RO\
!
Fig. 4. Sample output of a principal and its relative rules.
244
T. Okada
The first two lines in a relative rule section show the relevance of this relative rule to the principal rule. Here, R and P mean the sets of supporting instances for the relative and principal rules, respectively. R&P, R-P, and P-R show the number of instances after an intersection and two difference operations, respectively. The rlv and Tanimoto coefficients defined by (5) and (6) give us an estimate of the shared instances by the two rules. In this case, R is the subset of P at the upper node, occupying more than two thirds of P. At the lower node, the shared instances cover more than 70% of P and R. A relative rule itself is expressed like a principal rule. As the attributes of the main condition are the same, this relative rule seems to display a part of the outskirts described by the principal rule. In the fifth relative rule, all the pre- and main conditions are different from those of the principal rule. 40% of instances of the principal rule are covered by this relative rule. Therefore, this ULrelative rule gives an alternative explanation for a segment of patients supporting the principal rule. The description of a Urelative rule is just the relevance information in the first two lines of any relative rule and a pointer to another principal rule. These organized expressions of rules enabled us to give understandable interpretations that were not reached by the rules identified in the previous work. We give some considerations on the above principal rule, which might be useful for the interpretation by an expert. We can easily see that the very high values in Cell_Poly are directly connected to the bacterial disease, and it is also correlated with the high values of CSF_CELL from the principal rule. However, these phenomena are already well known by medical experts. Leads to a new knowledge must be explored in relative rules. Supporting instances of these relative rules approximates a subset of the principal rule in many cases, and they can offer explanations to a segment of patients. At a first glance, conditions of rule 2 in Table 6 seems to be boring as it uses cell number attributes. However, the main condition of this rule show high correlations with several attributes; they include SEX, KERNIG, LOC_DAT, WBC and CT_FIND. Therefore, this rule seems to indicate a cluster of patients waiting for further research. Rule 5 (rule1-UL5 in Figure 6) is also interesting. Attributes related to cell numbers do not appear in the LHS of this rule. Attributes concerned with this group of patients are LOC_DAT, WBC and Cell_Poly along with those in conditions. Attributes appeared in conditions of rules 3 and 8 are also expected to be useful for an expert. Attributes with high correlations to the main conditions of these rules are (AGE, BT, CRP) and (BT, WBC, CRP, CT_FIND) in rules 3 and 8, respectively. The same kind of interpretation is possible for other rules, but it goes beyond the scope of this paper.
5 Related Works It is well recognized that in association mining too many rules appear, and there have been several attempts to improve this situation. The well known research direction is to remove redundant rules using the concepts of closed itemsets [10, 15] and representative rules [3], but any improvement is very limited from the viewpoint of data analyst. Clustering of association rules has also been used to obtain a reasonable set of rules [4], but this method has no theoretical foundation like other works. Another way to clarify a large number of association rules is to use a visualization technique implemented in many kinds of mining software. This provides a type of
Datascape Survey Using the Cascade Model
245
datascape, but we believe that it is difficult to view the details of relationships among rules. Apart from association rules, a decision tree is a set of organized rules that cover all instances once [12]. Relationships among rules are clear. However, the method only finds local correlations when they reside on the path of sequential expansion of the tree. Rough set theory also provides a set of rules, and its reducts give explanations of data from multiple viewpoints [11]. But the relationships among these rules are not clear. In conclusion, this paper is the first attempt to examine datascape surveys using the organized set of rules.
6 Conclusions A datascape, a word that we have coined, seems to be a valuable method of characterizing rules that focus on the discovery of specific knowledge; the importance of viewing a datascape from a distant point has been overlooked. Furthermore, the organization of rules into principal and relative rules opens a way to use them as a guide in a survey of the datascape. Since the method is based on the cascade model, the problems of expressing data correlations and their quantifications have already been solved. Therefore, we anticipate that this method can be used as a general tool for data analysis, although we need more experience of using organized rules to analyze data and to elaborate the expression of rules. Rigorous evaluation and harsh criticism of the resulting rules by an expert are expected to improve the method proposed in this paper. A piece of knowledge expressed by a rule makes no sense if it is isolated from the surrounding information. The method proposed in this paper focuses on clarifying the relationships among rules, thus seeking to provide an overview of the datascape. We need to survey a more detailed datascape within the instances covered by a rule. A visualization of the instances indicated by a rule will work well in some cases. Illustration of multiple results like Table 6 may be one way of presenting the results; a user starts his or her inspection from the principal rules obtained using a lower relevance threshold and then descends the tree to inspect a smaller segment of the data. Another method must be devised to express the internal structure of a rule, and work on this aspect is currently in progress.
References [1] [2]
[3] [4]
Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. In Proc. ACM SIGMOD (1993) 207-216 Gini, C. W.: Variability and mutability, contribution to the study of statistical distributions and relations. Studi Economico-Giuridici della R. Universita de Cagliari. 1912. Reviewed in Light, R.J. and Margolin, B.H.: An analysis of variance for categorical data. J. Amer. Stat. Assoc. 66, 534-544. Kryszkiewicz, M.: Representative Association Rules and Minimum Condition Maximum Consequence Association Rules. In Zytkow, J.M., Quafalou M. (eds.): Principles of Data Mining and Knowledge Discovery, PKDD ’98, LNCS 1510, Springer 361-369 Lent, B., Swami, A. and Widom, J.: Clustering Association Rules. Proc. ICDE 1997, IEEE Computer Soc. 220-231
246 [5] [6] [7] [8] [9]
[10] [11] [12] [13] [14] [15]
T. Okada Okada, T.: Finding Discrimination Rules using the Cascade Model. J. Jpn. Soc. Artificial Intelligence, 15, 321-330 Okada, T.: Sum of Squares Decomposition for Categorical Data. Kwansei Gakuin Studies in Computer Science, Vol. 14, 1-6, 1999. http://www.media.kwansei.ac.jp/home/kiyou/kiyou99/kiyou99-e.html. Okada, T.: Rule Induction in Cascade Model based on Sum of Squares Decomposition. In Zytkow, J.M. and Rauch, J. (eds.) Principles of Data Mining and Knowledge Discovery, PKDD’99, LNAI 1704, Springer, 468-475 Okada, T.: Efficient Detection of Local Interactions in the Cascade Model. In Terano, T. et al (eds.) Knowledge Discovery and Data Mining (Proc. PAKDD 2000), LNAI 1805, Springer, 193-203 Okada, T.: Medical Knowledge Discovery on the Meningoencephalitis Diagnosis Studied by the Cascade Model. In Terano, T. et al (eds.) New Frontiers in Artificial Intelligence. Joint JSAI 2001 Workshop Post-Proceedings, LNCS 2253, Springer, 533540. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering Frequent Closed Itemsets for Association Rules. In Proc. 7th Intl. Conf. on Database Theory, 1999, LNCS 1540, 398-416 Pawlak, Z.: Rough sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht 1991 Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann, 1993. Washio, T.: JSAI KDD challenge 2001. http://wwwada.ar.sanken.osaka-u.ac.jp/ pub/washio/jkdd/jkddcfp.html. Willett, P., Winterman, V.: Quant. Struct. Activ. Relat., Vol. 5, 18. Zaki, M. J.: Generating Non-redundant Association Rules. In Proc. KDD 2000, ACM press, 34-43
Learning Hierarchical Skills from Observation Ryutaro Ichise1,2 , Daniel Shapiro1 , and Pat Langley1 1
2
Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford CA 94305-4115 USA National Institute of Informatics, Tokyo 101-8430 Japan {ichise,dgs,langley}@csli.stanford.edu
Abstract. This paper addresses the problem of learning control skills from observation. In particular, we show how to infer a hierarchical, reactive program that reproduces and explains the observed actions of other agents, specifically the elements that are shared across multiple individuals. We infer these programs using a three-stage process that learns flat unordered rules, combines these rules into a classification hierarchy, and finally translates this structure into a hierarchical reactive program. The resulting program is concise and easy to understand, making it possible to view program induction as a practical technique for knowledge acquisition.
1
Introduction
Physical agents like humans not only execute complex skills but also improve their ability over time. The past decade has seen considerable progress on computational methods for learning such skills and control policies from experience. Much of this research has focused on learning through trial and error exploration, but some has addressed learning by observing behavior of another agent on the task. In particular, research on behavioral cloning (e.g., Sammut, 1996) has shown the ability to learn reactive skills through observation on challenging control problems like flying a plane and driving an automobile. Although such methods can produce policies that predict accurately the desirable control actions, they ignore the fact that complex human skills often have a hierarchical organization. This structure makes the skills more understandable and more transferable to other tasks. In this paper, we present a new approach to learning reactive skills from observation that addresses the issue of inferring their hierarchical structure. We start by specifying the learning task, including the training data and target representation, then present a method for learning hierarchical skills. After this, we report an experimental evaluation of our method that examines the accuracy of the learned program and its similarity to a source program that generated the training cases. In closing, we discuss related work and directions for future research on this topic. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 247–258, 2002. c Springer-Verlag Berlin Heidelberg 2002
248
2
R. Ichise, D. Shapiro, and P. Langley
The Task of Learning Hierarchical Skills
We define the task of learning skills in terms of its inputs and outputs: – Given: a trace of agent behavior cast as a sequence of state descriptions and associated actions; – Find: a program that generates appropriate actions when presented with new states. Research on behavioral cloning (e.g., Anderson et al., 2000; Sammut, 1996) has already addressed this task, having developed methods that learn reactive skills from observation that are both accurate and comprehensible. However, complex skills can often be decomposed naturally into subproblems, and here we focus on capturing this hierarchical structure in an effort to produce even more concise and understandable policies. We increase the generality of this learned structure by adopting the separation hypothesis (Shapiro & Langley, 2002), which asserts that differences in individual behavior are due to the action of distinct preferences over the same set of skills. For example, we all know how to perform common tasks like driving, but some prefer more safe options, and others more reckless ones. This assumption separates the task of program acquisition into two parts, the first involving the structure of skills, and the second a (possibly numeric) representation of preference that explains individual choices. We address the first task here. The separation hypothesis simplifies the task of program acquisition because it implies that we should learn a non-deterministic mapping from the observed situation to a feasible set of actions, instead of aiming for a deterministic characterization of a single agent’s behavior. The resulting program will represent fewer distinctions, and should therefore be easier to understand. 2.1
Nature of the Training Data
We assume that the learner observes traces of another agent’s behavior as it executes skills on some control task. As in earlier work on learning skills from observation, these traces consist of a sequence of environmental situations and the corresponding agent action. However, since our goal is to recover a nondeterministic mapping, we consider traces from multiple agents that collectively exhibit the full range of available options. Moreover, since we are learning reactive skills, we transform the observed sequences into an unordered set of training cases, one for each situation. Traditional work in behavioral cloning turns an observational trace into training cases for supervised learning, treating each possible action as a class value. In contrast, we find sets of actions that occur in the same environmental situation and generate training cases that treat each observed set of actions as a class value. This lets us employ standard methods for supervised induction to partition situations into reactive but nondeterministic control policies.
Learning Hierarchical Skills from Observation
2.2
249
Nature of the Learned Skills
We assume that learned skills are stated in Icarus (Shapiro, 2001), a hierarchical reactive language for specifying the behavior of physical agents that encodes contingent mappings from situations to actions. Like other languages of this kind (Brooks, 1986; Firby, 1989; Georgeff et al., 1985), Icarus interprets programs in a repetitive sense-think-act loop that lets an agent retrieve a relevant action even if the world changes from one cycle of the interpreter to the next. Icarus shares the logical orientation of teleoreactive trees (Nilsson, 1994) and universal plans (Schoppers, 1987), but adds vocabulary for expressing hierarchical intent and non-deterministic choice, as well as tools for problem decomposition found in more general-purpose languages. For example, Icarus supports function call, Prolog-like parameter passing, pattern matching on facts, and recursion. We discuss a simple Icarus program in the following section. 2.3
An Icarus Plan for Driving
An Icarus program is a mechanism for finding a goal-relevant reaction to the situation at hand. The primitive building block, or plan, contains up to three elements: an objective, a set of requirements (or preconditions), and a set of alternate means for accomplishing the objective. Each of these can be instantiated by further Icarus plans, creating a logical hierarchy that terminates with calls to primitive actions or sensors. Icarus evaluates these fields in a situationdependent order, beginning with the objective field. If the objective is already true in the world, evaluation succeeds and nothing further needs to be done. If the objective is false, the interpreter examines the requirements field to determine if the preconditions for action have been met. If so, evaluation progresses to the means field, which contains alternate methods (subplans or primitive actions) for accomplishing the objective. The means field is the locus of all choice in Icarus. Given a value function that encodes a user’s preferences, the system learns to select the alternative that promises the largest expected reward. Table 1 presents an Icarus plan for freeway driving. The top-level routine, Drive, contains an ordered set of objectives implemented as further subplans. Icarus repetitively evaluates this program, starting with its first clause every execution cycle. The first clause of Drive defines a reaction to an impending collision. If this context applies, Icarus returns the Slam-on-brakes action for application in the world. However, if Emergency-brake is not required, evaluation proceeds to the second clause, which encodes a reaction to trouble ahead, defined as a car traveling slower than the agent in the agent’s own lane. This subplan contains multiple options. It lets the agent move one lane to the left, move right, slow down, or cruise at its current speed. Icarus makes a selection based on the long-term expected reward of each alternative. The remainder of the program follows a similar logic as the interpreter considers each clause of Drive in turn. If a clause returns True, the system advances to the next term. If it returns False, Drive would exit with False as its value. However, Icarus supports a third option: a clause can return an action, which becomes the return value of the
250
R. Ichise, D. Shapiro, and P. Langley Table 1. The Icarus program for freeway driving.
Drive() :objective [*not*(Emergency-brake()) *not*(Avoid-trouble-ahead()) Get-to-target-speed() *not*(Avoid-trouble-behind()) Cruise()] Emergency-brake() :requires [Time-to-impact() <= 2] :means [Slam-on-brakes()] Avoid-trouble-ahead () :requires [?c = Car-ahead-center() Velocity() > Velocity(?c)] :means [Safe-cruise() Safe-slow-down() Safe-change-left() Safe-change-right()] Get-to-target-speed() :objective [Near(Velocity(), Target-speed())] :means [Adjust-speed-if-lane-clear() Adjust-speed-if-car-in-front()] Avoid-trouble-behind() :requires ;;faster car behind [?c = Car-behind-center() Velocity(?c) > Velocity()] :means [Safe-cruise() Safe-change-right()] Safe-cruise() :requires [Time-to-impact() > 2] :means [Cruise()]
Safe-slow-down() :requires [Time-to-impact(-2) > 2] :means [Slow-down()] Safe-speed-up() :requires [Time-to-impact(2) > 2] :means [Speed-up()] Safe-change-left() :requires [Clear-left()] :means [Change-left()] Safe-change-right() :requires [Clear-right()] :means [Change-right()] Adjust-speed-if-lane-clear() :requires [*not*(Car-ahead-center())] :means [Slow-down-if-too-fast() Speed-up-if-too-slow()] Adjust-speed-if-car-in-front() :requires [Car-ahead-center() *not*(Slow-down-if-too-fast())] :means [Speed-up-if-too-slow() Safe-cruise() Safe-slow-down()] Slow-down-if-too-fast() :requires [Velocity() > Target-speed()] :means [Safe-slow-down()] Speed-up-if-too-slow() :requires [Velocity() <= Target-speed()] :means [Safe-speed-up()] Slam-on-brakes() :action [match-speed-ahead()]
enclosing plan. For example, Avoid-trouble-behind might return Change-right, which would become the return value of Drive. Thus, the purpose of an Icarus program is to find action. At each successive iteration, Icarus can return an action from an entirely different portion of Drive. For example, the agent might slam on the brakes on cycle 1, and speed up in service of Get-to-target-speed (a goal-driven plan) on cycle 2. However, if Emergency-brake and Avoid-trouble-ahead do not apply, and the agent is already at its target speed, Icarus might return the Change-right action in service of Avoid-trouble-behind on cycle 3.
3
A Method for Learning Hierarchical Skills
Now that we have defined the task, we describe our method for learning hierarchical skills from behavioral traces. Our approach involves three distinct stages. The first induces unordered flat rules using a standard supervised learning technique. The second stage creates a classification hierarchy by combining tests that appear in multiple rules. When viewed as an action generator, this structure resembles a hierarchical program. The third stage transforms this representation
Learning Hierarchical Skills from Observation
251
y
x,y
y,Z
Action1
Action2
x
Action1
z
Action2
Fig. 1. Operator for promoting conditions.
into an Icarus program and simplifies it by taking advantage of the language’s semantics. This section discusses each of the stages in turn. 3.1
Constructing Flat Rules
We employ CN2 (Clark & Boswell, 1991) to obtain rules that summarize the behavior represented by the input data. These rules predict a target class from attribute values, where each class corresponds to a set of available actions. In general, the number of distinct action sets, and thus the classes, will depend on the particular rules that are formed in this manner, but in this work we specified these action sets manually. The net result of this process is an unordered set of production rules. 3.2
Constructing Hierarchies
The second stage of our approach to program induction generates a classification hierarchy. One step in this process involves promoting conditions that appear in multiple rules. Consider the two rules: – If x and y Then Action1 – If y and z Then Action2 Since the condition y appears in the both rules, we can promote it by creating a more abstract rule that tests the common precondition, using a technique borrowed from work on grammar induction (e.g., Langley & Stromsten, 2000). We illustrate this transformation in Figure 1. Here, the labels on arcs denote conditional tests and the leaf nodes denote actions. The black circles indicate choice points, where one (or more) of the subsequent tests apply. These structures are interpreted from the top downwards. For example, the right side of Figure 1 classifies the current situation first by testing y and then, if y holds, by testing x and z (in parallel) to determine which action or actions apply. This structure is similar to the decision trees output by C4.5 (Quinlan, 1993), but more general in that it allows non-exclusive choice. In addition to promoting conditions, our method can promote actions within a classification hierarchy. Figure 2 provides a simple example, where Action2
252
R. Ichise, D. Shapiro, and P. Langley Action2 x
Action1 Action2
x
Action2 Action3
x
Action1
x
Action3
Fig. 2. Operator for promoting actions.
occurs at all leaf nodes within a given subtree. If the system is guaranteed to reach at least one of the leaf nodes, it associates Action2 with the root node of the subtree, which we depict with a hollow circle. This simplification applies even if the leaf nodes are at an arbitrary depth beneath the root of the subtree. The system uses condition and action promotion to transform the flat rules learned by CN2 into a classification hierarchy. However, there are many possible ways to combine rules by promoting conditions, so we have an opportunity to shape the final classification hierarchy by defining rule-selection heuristics. The key idea is to merge rules with similar actions. In particular, we identify three heuristics that tend to combine rules with similar purposes and isolate rules that represent special cases: 1. Select rules with the same action or same set of actions. 2. Select rules with subset relations among the actions. 3. Select rules with the same conditions. Our algorithm considers these heuristics in the indicated order. Rules that determine the same target class (action set) have the highest priority for condition promotion. The operation will only be successful, of course, if the rules share conditions. If more than two rules select the same target class, the ones that share the largest number of conditions will be combined. The second heuristic applies if no two rules select the same target class. In this case, the algorithm selects rules whose action sets bear a subset relation, such as “Speed-up” and “Speed-up, Change-right”. If a single rule enters into many such pairings, the system combines the ones that share the smallest number of actions on the theory that they express the most cohesive intent. Ties are broken by a similarity metric that maximizes the number of shared and thus promotable conditions. Finally, if no action sets bear subset relations, the system picks rules that share the largest number of conditions. The system combines the pair of rules selected by these heuristics to yield a subtree with shared conditions on its top-level arc. These conditions can enter into further promotion operations, although the remaining conditions cannot be merged with any other rules. This process of rule selection and combination continues to exhaustion, with the system merging top-level conditions to build multi-layered trees. A simple example may help to clarify this algorithm. Consider three rules whose abbreviations are defined in Table 2:
Learning Hierarchical Skills from Observation IF AND AND AND THEN
T T IA < 52.18 T T IA > 1.82 CLR = T rue CLL = T rue Action = CHR, CHL, CRU, SLO
IF AND AND AND THEN
T T IA < 52.18 T T IA > 1.82 CLR = T rue CLL = F alse Action = CHR, CRU, SLO
IF AND AND AND THEN
253
T T IA < 52.18 T T IA > 1.82 CLR = F alse CLL = F alse Action = CRU, SLO
Although no two rules select the same target class, the actions defining the target classes bear subset relations. In this case, the algorithm will select the last two rules because their action sets are the smallest, and it will promote three conditions to obtain a new shared structure. Two of these conditions can be combined with conditions in the first rule, yielding a three-level subtree that represents all three rules. When the process of condition promotion terminates, the system adds a toplevel node to represent the choice among subtrees. After this, it simplifies the structure using the action promotion rule shown in Figure 2. This produces the rightmost subtree of the classification structure in Figure 3, which we discuss later in more detail. 3.3
Constructing the Icarus Program
We can simplify hierarchical classification structures by translating them into the more powerful Icarus formalism. The key idea is that the first phases of program induction always produce a mutually exclusive classification hierarchy, and thus that the branches can be ordered without loss of generality. We use Icarus to express this order, and its concept of an action as a return value to identify target classes. This lets us simplify the conditions in one branch using the knowledge that a previous branch did not return an action. Consider the fourth and fifth subtrees of the top node in Figure 3. These represent a rule to avoid collisions, and rules that respond to a slower car in front (as discussed above). If Icarus evaluates these in order, it can only reach the fifth branch if the fourth fails to return an action, meaning there is no imminent collision (T T IA > 1.82). We can use this knowledge to simplify the logical tests in the fifth subtree, producing the Icarus subplans labeled R1, R2, R21, and R22 in Table 3. This completes the process of inducing a hierarchical control program from observational traces.
4
An Experiment in Hierarchical Behavior Cloning
Now that we have discussed our method for inducing hierarchical programs, we turn to an experiment designed to evaluate the approach in a simple driving domain. To be specific, we use the Icarus program of Table 1 to generate trace data and employ our induction method to recover a second Icarus program that explains these data. We evaluate the results in terms of the accuracy and efficiency of the recovered program, as well as its conceptual similarity to the source program.
254
R. Ichise, D. Shapiro, and P. Langley Table 2. Notation used in example rules and hierarchies. Actions Abbreviation Meaning CRU Cruise SLO Slow Down SPE Speed Up MAT Match Speed Ahead CHR Change Right CHL Change Left
4.1
Conditions Abbreviation Meaning CAC Car Ahead Center CBC Car Behind Center CLR Clear Right CLL Clear Left TTIA Time To Impact Ahead TTIB Time To Impact Behind VEL Velocity
Data on Driving Behavior
We used the Icarus program in Table 1 to generate trace data. Since our goal was to recover the structure of a shared driving skill, we needed the equivalent of data from multiple drivers whose preferences would collectively span the feasible behavior. Instead of creating these agents, we took the simpler approach of directly exercising every control path in the source program, while recording the feature set and the action set available at each time step. This produced a list of situation-action tuples that included every possible action response. We enumerated five values of in-lane separation (both to the car ahead and behind), five values of velocity for each of the three in-lane cars, and the status of the adjacent lane (whether it was clear or not clear). We chose the particular distance and velocity numbers to produce True and False values for the relevant predicates in the driving program (e.g., time to impact ahead, velocity relative to target speed). This procedure also created multiple occurrences of many situation-action tuples (i.e., the mapping from distance and velocity onto time to impact was many to one). The resulting data had nine attributes. Four of these were Boolean, representing the presence or absence of a car in front/back and whether the lanes to the right or left of the agent are clear. The rest were numerical attributes, two representing time to impact with the car ahead or behind, two encoding relative velocity ahead or behind, and the last measuring the agent’s own velocity. Our formulation of the driving task assumed six primitive actions. We preprocessed the data to identify sets of these actions that occurred under the same situation. We obtained ten such sets, each containing one to four primitive actions. These sets define a mutually exclusive and collectively exhaustive set of classes for use in program induction. 4.2
Transformation into an Icarus Program
We employed CN2 to transform the behavioral trace obtained from the Icarus source program into a set of flat rules, and further transformed that output into a hierarchical classification structure using the condition and action promotion rules of Section 3.2. This produced the tree shown in Figure 3.
Learning Hierarchical Skills from Observation
TTIA>52.18
255
52.18>TTIA>1.82
CAC=F 56.5>VEL
CAC=T 1.82>TTIA
SPE 67.5>VEL>56.5
CRU
VEL>67.5
SLO
SLO CRU
MAT 45.5>VEL
CRU SLO SPE
TTIA>52.18 56.5>VEL
CRU SLO SPE
CLL=T
CHL
CLR=T
CHR
CLR=T 52.18>TTIB
CHR
Fig. 3. The classification hierarchy obtained by our method. Table 3. The Icarus program induced by our method. Drive () :requires [NOT(R1) NOT(R2) NOT(R3) NOT(R4)] R1 () :requires [TTIA < 1.82] :means [MAT] R2 ()
R21 ()
R22 ()
:requires [TTIA < 52.18] :means [SLO CRU R21 R22] :requires [CLL = True] :means [CHL] :requires [CLR = True] :means [CHR]
R3 ()
R31 ()
R4 ()
R41 ()
R42 ()
:requires [VEL < 56.5] :means [SPE R31] :requires [CAC = True] :means [SLO CRU] :requires [NOT(R41)] :means [CRU R42] :requires [VEL > 67.5] :means [SLO] :requires [CLR = True TTIB < 52.18] :means [CHR]
We simplified this tree by transforming it into an Icarus program via a manual process that we expect to automate in the future. We numbered the branches from left to right and considered them in the order 4,5,3,1,2. This ordering simplified the required conditions. Taken as a whole, these transformations recovered the Icarus program shown in Table 3, completing the task of inducing a hierarchical program from observations.
256
4.3
R. Ichise, D. Shapiro, and P. Langley
Experimental Evaluation
We evaluated our learning method at several stages in the transformation process. In particular, we examined the accuracy of the flat rules induced by CN2 to determine how much of the original behavior they were able to recover. (Since all of the subsequent processing steps preserve information, this corresponded to the accuracy of the final program.) In addition, we evaluated the structure of the learned Icarus program in a more subjective sense, by comparing it against the original Icarus program that generated the data. We measured the accuracy of the rules induced by CN2 using ten-fold cross validation. Each training set contained circa 4300 labeled examples, and for each of these training sets, our method induced a program that had 100% accuracy on the corresponding test set. Moreover, even though the rules induced by the first stage were slightly different across the training runs, the resulting classification hierarchies were identical to the tree in Figure 3. Thus, our heuristics for rule combination regularized the representation. When we compare the learned Icarus program in Table 3 with the original program in Table 1, several interesting features emerge. First, the learned program is simpler. It employs ten Icarus functions, whereas the original program required 14. This was quite surprising, especially since the original code was written by an expert Icarus programmer. Next, the learned program captures much of the natural structure of the driving task; the top-level routines call roughly the same number of functions, and half of those implement identical reactions. Specifically, R1 in Table 3 corresponds to Emergency-brake in Table 1, while R2 represents Avoid-trouble-ahead using a simpler gating condition. Similarly, R4 captures the behavior of Avoid-trouble-behind, although it adds the Slowdown operation found in Get-to-target-speed. R3 represents the remainder of Get-to-target-speed, absent the Slow-down action. The system repackaged these responses in a slightly more efficient way. The only feature missing from the learned program is the idea that maintaining target speed is an objective. We hope to address this issue in the future, as it raises the interesting problem of inferring the teleological structure of plans from observation.
5
Related Work on Control Learning
We have already mentioned in passing some related work on learning control policies, but the previous research on this topic deserves more detailed discussion. The largest body of work focuses on learning from delayed external rewards. Some methods (e.g., Moriarty et al., 1999) carry out direct search through the space of policies, whereas others (e.g., Kaelbling et al., 1996) estimate value functions for state-action pairs. Research in both paradigms emphasizes exploration and learning from trial and error, whereas our approach addresses learning from observed behaviors of another agent. However, the nondeterministic policies acquired in this fashion can be used to constrain and speed learning from delayed reward, as we have shown elsewhere (Shapiro et al., 2001).
Learning Hierarchical Skills from Observation
257
Another framework learns control policies from observed behaviors, but draws heavily on domain knowledge to interpret these traces. This paradigm includes some, but not all, approaches to explanation-based learning (e.g., Segre, 1987), learning apprentices (e.g., Mitchell et al., 1985), and programming by demonstration (e.g., Cypher, 1993). The method we have reported for learning from observation relies on less background knowledge than these techniques, and also acquires reactive policies, which are not typically addressed by these paradigms. Our approach is most closely related to a third framework, known as behavioral cloning, that also observes another agent’s behavior, transforms traces into supervised training cases, and induces reactive policies. This approach typically casts learned knowledge as decision trees or logical rules (e.g., Sammut, 1996; Urbancic & Bratko, 1994), but other encodings are possible (Anderson et al., 2000; Pomerleau, 1991). In fact, our method’s first stage takes exactly this approach, but the second stage borrows ideas from work on grammar induction (e.g., Langley & Stromsten, 2000) to develop simpler and more structured representations of its learned skills.
6
Concluding Remarks
This paper has shown that it is possible to learn an accurate and well-structured program from a trace of an agent’s behavior. Our approach extends behavioral cloning techniques by inducing simpler control programs with hierarchical structure that has the potential to make them far easier to understand. Moreover, our emphasis on learning the shared components of skills holds promise for increased generality of the resulting programs. Our technique for learning hierarchical structures employed several heuristics that provided a substantial source of power. In particular, the attempt to combine rules for similar action sets tended to group rules by purpose, while the operation of promoting conditions tended to isolate special cases. Both techniques led to simpler control programs and, presumably, more understandable encodings of reactive policies. We hope to develop these ideas further in future work. For example, we will address the problem of inferring Icarus objective clauses, which is equivalent to learning teleological structure from observed behavior. We also plan to conduct experiments in other problem domains, starting with traces obtained from simulations and/or human behavior. Finally, we intend to automate the process of transforming classification hierarchies into Icarus programs. This will let us search more effectively through the space of hierarchical programs that represent observed skills. Acknowledgements. The Icarus driving program used in this work was developed by the second author under funding from the DaimlerChrysler Research and Technology Center. We thank the anonymous reviewers for comments that improved earlier drafts of the paper.
258
R. Ichise, D. Shapiro, and P. Langley
References Anderson, C., Draper, B., & Peterson, D. (2000). Behavioral cloning of student pilots with modular neural networks. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 25–32). Stanford: Morgan Kaufmann. Brooks, R. (1986). A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, 2 , 14–23. Clark, P., Boswell, R. (1991). Rule induction with CN2: Some recent improvements. Proceedings of the European Working Session on Learning (pp. 151–163). Porto. Cypher, A. (Ed.). (1993). Watch what I do: Programming by demonstration. Cambridge, MA: MIT Press. Firby, J. (1989). Adaptive execution in complex dynamic worlds. PhD Thesis, Department of Computer Science, Yale University, New Haven, CT. Georgeff, M., Lansky, A., & Bessiere, P. (1985). A procedural logic. Proceedings of the Ninth International Joint Conference on Artificial Intelligence (pp. 516–523). Los Angeles: Morgan Kaufmann. Kaelbling, L. P., Littman, L. M., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4 , 237–285. Langley, P., & Stromsten, S. (2000). Learning context-free grammars with a simplicity bias. Proceedings of the Eleventh European Conference on Machine Learning (pp. 220–228). Barcelona: Springer-Verlag. Mitchell, T. M., Mahadevan, S., & Steinberg, L. (1985). Leap: A learning apprentice for VLSI design. Proceedings of the Ninth International Joint Conference on Artificial Intelligence (pp. 573-580). Los Angeles: Morgan Kaufmann. Moriarty, D. E., Schultz, A. C., & Grefenstette, J. J. (1999). Evolutionary algorithms for reinforcement learning. Journal of Artificial Intelligence Research, 11 , 241–276. Nilsson, N. (1994). Teleoreactive programs for agent control. Journal of Artificial Intelligence Research, 1, 139–158. Pomerleau, D. (1991). Rapidly adapting artificial neural networks for autonomous navigation. Advances in Neural Information Processing Systems 3 (pp. 429–435). San Francisco: Morgan Kaufmann. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann. Sammut, C. (1996). Automatic construction of reactive control systems using symbolic machine learning. Knowledge Engineering Review , 11 , 27–42. Schoppers, M. (1987). Universal plans for reactive robots in unpredictable environments. Proceedings of the Tenth International Joint Conference on Artificial Intelligence (pp. 1039–1046). Milan, Italy: Morgan Kaufmann. Segre, A. (1987). A learning apprentice system for mechanical assembly. Proceedings of the Third IEEE Conference on AI for Applications (pp. 112–117). Shapiro, D., Langley, P., & Shachter, R. (2001). Using background knowledge to speed reinforcement learning in physical agents. Proceedings of the Fifth International Conference on Autonomous Agents (pp. 254–261). Montreal: ACM Press. Shapiro, D. (2001). Value-driven agents. PhD thesis, Department of Management Science and Engineering, Stanford University, Stanford, CA. Shapiro, D., & Langley, P. (2002). Separating skills from preference: Using learning to program by reward. Proceedings of the Nineteenth International Conference on Machine Learning (pp. 570–577). Sydney: Morgan Kaufmann. Urbancic, T., & Bratko, I. (1994). Reconstructing human skill with machine learning. Proceedings of the Eleventh European Conference on Artificial Intelligence (pp. 498–502). Amsterdam: John Wiley.
Image Analysis for Detecting Faulty Spots from Microarray Images Salla Ruosaari and Jaakko Hollm´en Helsinki University of Technology, Laboratory of Computer and Information Science, P.O. Box 5400, 02015 HUT, Finland [email protected], [email protected]
Abstract. Microarrays allow the monitoring of thousands of genes simultaneously. Before a measure of gene activity of an organism is obtained, however, many stages in the error-prone manual and automated process have to be performed. Without quality control, the resulting measures may, instead of being estimates of gene activity, be due to noise or systematic variation. We address the problem of detecting spots of low quality from the microarray images to prevent them to enter the subsequent analysis. We extract features describing spatial characteristics of the spots on the microarray image and train a classifier using a set of labeled spots. We assess the results for classification of individual spots using ROC analysis and for a compound classification using a non-symmetric cost structure for misclassifications.
1
Introduction
Microarray techniques have enabled the monitoring of thousands of genes simultaneously. These techniques have proven powerful in gene expression profiling for discovering new types of diseases and for predicting or diagnosing the type of a disease based on the gene expression measurements [1]. It is indeed an interesting possibility that we examine all genes of a given organism at the same time and possibly under different conditions. This opens up new ways of making discoveries, assuming that the large amounts of data can be reliably analyzed. The rapidly increasing amount of gene expression data and the complex relationships about the function of the genes has made it more difficult to analyze and understand phenomena behind the data. For these reasons, functional genomics has become an interdisciplinary science involving both biologists and computer scientists. Before estimates of the gene activities are obtained from an organism, a multi-phased process takes place allowing different sources of noise to enter the analysis. Noise is in fact a major issue with microarrays. Low quality measurements have to be detected before subsequent analysis such as clustering is performed and inferences are made. However, the detection of these poor quality spots has not been widely discussed. In this paper, we attempt to provide one solution to this problem. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 259–266, 2002. c Springer-Verlag Berlin Heidelberg 2002
260
2
S. Ruosaari and J. Hollm´en
Microarray Technology
The microarray experiments are basically threefold involving the preparation of the samples of interest, the array construction and sample analysis, and the data handling and interpretation. The microarray itself is simply a glass slide onto which differing single-stranded DNA chains have been attached at fixed loci. The phenomenon that microarrays exploit is the preferential binding of complementary single-stranded sequences. Popularly mRNA extracted from two different samples are brought into contact as they are washed over the microarray. Hybridization takes places at spots where complementary sequences meet. Therefore, hybridizations of certain nucleic acid sequences on the slide indicate the presence of the complementary chain in the samples of interest. 2.1
Two-Sample Competitive Hybridization and Dye Separation
A popular experimental procedure is the monitoring of the mRNA abundance in two samples. When two samples are simultaneously allowed to hybridize with the sequences on the slide, the relative abundance of the hybridized mRNA in the samples can be measured. This measure is assumed to reflect the relative protein manufacturing activity in the cells. Often a common reference is used making further comparisons of gene activities e.g. between individuals possible. The two samples are labeled with different fluorescent dyes allowing their separation when excited with the corresponding laser. When the whole slide is scanned, two 16-bit images looking as Fig. 1 are obtained, each reflecting the gene activities of the respective sample. The intensities of the image pixels correspond to the level of hybridization of the samples to the DNA sequences on the microarray slide. 2.2
From Digitized Images to Intensity Measures
To get an estimate of the gene activities, the pixels corresponding to the gene spots, and consequently the genes, must be found. The images are segmented or partitioned into foreground (i.e. belonging to a gene) and background regions. The gene activity estimates are then derived from the foreground regions. Many different methods exist including the average intensity of pixels inside some predefined area around the assumed spot center or within an area found by seeded region growing or histogram segmentation. Estimates of the background noise can also be obtained. The estimates can be global, i.e. all genes are assumed to include the same noise or local, i.e. the background estimate is determined individually for all genes or for some set of genes using a (predefined) combination of the pixel intensities outside the area used for gene activity estimation. The gene activity estimation has an impact on the subsequent data analysis and interpretation. If the gene’s measured activity is not due to the activity itself, subsequent analysis using this erroneous estimate will, of course, be misleading. To overcome this, background correction is often done, usually simply by subtracting the background intensity estimates from the gene activity estimates.
Image Analysis for Detecting Faulty Spots from Microarray Images
261
Fig. 1. A scanned microarray image and four example spots, which demonstrate possible problems, i.e. spots of varying sizes, scratches, and noise.
Depending on how the gene activity estimate and the background estimate have been derived, the resulting measures may be largely deviant. Image analysis methods using predefined regions, histogram segmentation or region growing essentially all lead to biased results, even if background correction is used, if the data quality is not taken into consideration. This can be understood by observing Fig. 1. The spots may be of various sizes or contaminated and can therefore have an effect on the activity estimation when no attention to the spatial information is given. The Mann-Whitney segmentation algorithm may provide better results as it associates a confidence level with every intensity measurement based on significance [2]. If the noise level on the slides is not constant, non gene activity due measures may start dominating the results as most of the genes on typical slides are silent. Background estimations may be even more affected by contamination. In order for the background correction to be effective, the background estimates should be derived iteratively and not by using the same pixels for each spot. Moreover, the most contaminated spots should be excluded from the analysis as the measure does not reflect the gene activity at all. Replicate measurements may be of help [3] especially when the median of the measures is used in the analysis. To this day, little has been published on data quality related issues. Previously, the effect of the choice of image analysis method has been assessed. It has been shown that the background adjustment can substantially reduce the precision of low-intensity spot values whilst the choice of segmentation procedure has a smaller impact [4]. Measures based on spot size irregularity, signal-to-noise ratio, local background level and variation, and intensity saturation have been used to evaluate spot quality [5]. Experiments on error models for gene array data and expression level estimation from noisy data have been carried out [6]. The intrinsic noise of cells has also been researched [7,8].
262
3
S. Ruosaari and J. Hollm´en
Detection of Faulty Spots
Our work is based on analyzing real-valued raw 16-bit images with the approximate gene loci known. Each gene spot is searched from a 31 x 31 environment defined by the gene center locus obtained as a result of previous image segmentation with QuantArray software. The sizes of these blocks were chosen to allow some non-exactness in gene loci and to be large enough to be able to include valid spot pixels. We apply image analysis techniques in extracting spatial features describing relevant properties of microarray spots [9]. 3.1
Defining the Spot Area
The spot area is defined on the basis of raw pixel intensity values and their spatial distribution. We assume that the intensity of the spot pixels deviates from the background intensity in the positive direction. At the initial step, the raw pixels are judged to belong to the spot if their raw intensity is more than 12.5 percent of the maximum pixel intensity found in the 31 x 31 image. This is how the histogram segmentation methods work. Here, however, the histogram segmentation forms only the initial step of the segmentation procedure. From these regions, the largest connected block of pixels is picked using eight-connectivity, and pixels inside the area are joined to the area using fourconnectivity. This way, we obtain a binary image in which the spot area is differentiated from the background. Examples of these images, which can be regarded as masks for the original intensity images, are shown in Fig. 2.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2. The search for the spot area is presented using a non-faulty spot (a-c) and a faulty spot (d-e). The 31 x 31 pixel block around the spot centers (a and d), the corresponding binary image obtained using threshold 12.5 percent of the maximum intensity found within this block (b and e), and the largest connected region of the binary image with holes filled (c and f).
3.2
Spatial Features of the Spots
We assume that features extracted from the spot area can be used to describe the quality of the measurement. The features are collected into a feature vector x = [x1 , . . . , x6 ] and are later used to discard redundant low quality data from subsequent analysis. Through the choice of the features, an implicit model for
Image Analysis for Detecting Faulty Spots from Microarray Images
263
the spots is defined. The image pixel coordinates are denoted as (h, v) pairs and the individual pixel coordinates with hi and vi , i = 1, . . . , n, n being the number of pixels belonging to the spot in this context. The features we extract are: The horizontal range of the spot x1 = max(|hi − hj |), i = j The vertical range of the spot x2 = max(|vi − vj |), i = j The elongation of the spot as the ratio of the x3 = λ1 /λ2 eigenvalues The circularity of the spot as the ratio between x4 = 4πArea/(P erimeter)2 the area of the estimated spot and an ideal circle with the same perimeter n The uniformity of the spot expressed as the dif- x5 = 1/n i=1 (hi , vi ) − n ference between the Euclidean distance of the 1/n i=1 inti (hi , vi ) mass centers between the binary image and the intensity image masked with the binary image n The Euclidean distance between the assumed x6 = 1/n i=1 (hi , vi ) − spot center and the binary image (hc , vc ) 3.3
Classification Based on the Spatial Features
As stated earlier, our primary task is to classify microarray spots to classes faulty and good. This binary class variable ci is predicted on the basis of six features, or input variables, describing relevant properties of the objects to be classified. Having access to n labeled training data, that is, pairs (xi , ci ), i = 1, . . . , n, we can train a classification model in order to classify future cases where label information is not available. Based on the assumption that classes have differing statistical properties in terms of the distributions of the feature variables, we may use the classconditional approach [10,11]. Suppose we already have a classification model, we may assign a spot to the class to which it is most likely to belong to, i.e. whose posterior probability is the largest. Using Bayes rule, this is equivalent to assigning the spot, i.e. the feature vector x derived from it, to the class ci for which the discriminant function gi is the largest, as in cj = arg maxk gk (xj ), where gi (x) = log p(x|ci )p(ci ). The underlying distributions p(x|ci ) are assumed to be Gaussian. The parameters of the class-conditional distributions, i.e. mean vectors and covariance matrices, are estimated from pre-labeled training data. The prior probabilities are not of concern because the optimal bias is found by observing the misclassification costs. 3.4
Assessment of Classification Results
Before put to practice, it is important to assess the accuracy of the proposed scheme to detect faulty spots. We are interested in the following two aspects: first, how well the individual spots are classified correctly and how often the
264
S. Ruosaari and J. Hollm´en
spots are misclassified in the two possible directions (good as faulty and faulty as good) and second, combining the results for the three classifications of replicate spot measurements, what is the most beneficial compound result that fulfills our goal. In both approaches, we have the problem of choosing an optimal decision function. Receiver Operating Characteristics (ROC) Curve [12,13] visualizes the tradeoff between false alarms and detection, helping the user in choosing an optimal decision function. With the ROC curve, we can assess the errors made in the classification of individual spots. However, we are in fact faced by the need to classify three spots that are repetitive measurements of the same gene expression, two of which are possibly redundant. We are fundamentally interested in correct classification of good spots as good (true negative, tn) and faulty spots as faulty (true positive, tp), but the situation is complicated by our consideration of classifying good spots as faulty (false positive, fp) not being so harmful as long as at least one of the replicate good spots is classified correctly. On the contrary, classifying faulty spots as good (false negative, fn) is considered harmful, since possible measurements of the faulty spots may enter the subsequent analysis. Formulating the above as a matrix for misclassification costs, we get Λ = (λij ) = Σfn/Σ(tn + fn), with the exception λi4 = 1, when i = 1, 2, 3. The entries in the cost matrix λij signify how much cost is incurred when the compound configuration of three spots i is chosen when j is in fact the right choice. For instance, the entry λ41 signifies the cost of classifying the compound classification faulty faulty faulty as good good good, and therefore a cost of 1 units is incurred. The order of the outcomes is irrelevant as long as the classification-label pairs match. The cost matrix contains off-diagonal zeros to allow misclassifications of some good spots if at least one good spot is classified as good. If a good spot finally enters the subsequent analysis, our goal is fulfilled. 1
True Positive Rate
300
0.9 Cost
200
0.8
100
0.7
0.6 0
0.05 0.1 0.15 False Positive Rate
(a)
0.2
0
−30
−20 −10 Threshold
0
10
(b)
Fig. 3. Classification results presented with a ROC curve (a) and as a function of classification cost with a varying boundary threshold (b).
Image Analysis for Detecting Faulty Spots from Microarray Images
4
265
Experimental Results
The covariance matrices and mean vectors of the class descriptive Normal distributions were estimated from data consisting of 7488 spots. The spots were visually determined to be either valid or faulty enabling the derivation of the class separating discriminant functions. Data consisting of 2881 spots, of which 2617 were valid and 264 faulty, was used to test the classifier. Each test spot was considered to be an independent sample. The results are presented with a ROC curve in Fig. 3 a. The ROC curve characterizes the diagnostic accuracy of the classifier. The false positive rate is the probability of incorrectly classifying a valid spot and describes thus the specificity of our classifier. Equally, the true positive rate is the probability of correctly classifying a faulty spot. As random guessing would result in a linear curve connecting the points (0,0) and (1,1), our performance is much improved. Fig. 3 a shows, the true positive rate of our classifier is high even with rather low false positive rates indicating high sensitivity. However, perfect classifiers would have true positive rates equal to 1.0. Note that the false positive axis has been scaled from 0 to 0.2. Attaining true positive rates close to one is difficult due to the various source and type of noise on the array. However, the optimal working point of the classifier can be found by associating costs with the different possible errors that can be made. This was done to assess the quality of replicate spot classification. The spots were considered in triplets with costs incurring each time a invalid spot is labeled as valid or with all valid spots being classified as faulty. The resulting curve is shown in Fig. 3 b. The observing of Fig. 3 b shows that the location of the curve minimum is shifted from 0. The costs assigned to misclassifications introduces a bias into the class separating boundaries as the cost matrix is asymmetric. The classification costs are therefore minimal when the threshold equals ca. -6. With our data, this is the optimum working point. If more negative threshold is chosen, more faulty spots become labeled as valid reducing the sensitivity of the classifier. On the other hand, a more positive threshold reduces the specificity. However, costs also incur when threshold equal to -6 is chosen because the classifier is imperfect. The nonsymmetric slopes of Fig. 3 b are due to the different variances of the features derived from valid and faulty spots. As the variance between the valid spots is small, the specificity decreases faster with increasing threshold than the sensitivity with decreasing threshold introducing costs. The features derived form high intensity noise are well separated from those derived from valid spots whereas the resemblance between valid spots and spot-like dirt is smaller. The noise spots that are very different from the valid ones become classified as valid only when the threshold is shifted very far away from the unbiased boundary. Thus, the slope is very gentle when moving in the reduced sensitivity direction.
266
5
S. Ruosaari and J. Hollm´en
Summary
Microarray technology offers new ways to explore the functions of the genome. For making reliable analyzes, the quality aspects of the data have to taken into account. In this paper, we proposed an automated classification of microarray image spots to classes faulty and good based on a on features derived form the spatial characteristics of the individual spots on the microarray. Assessment was presented for classification of individual spots using ROC analysis and for compound classification of replicate measurements using a non-symmetric misclassification cost matrix.
References 1. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.H. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999. 2. Yidong Chen, Edward R. Dougherty, and Michael L. Bittner. Ratio-based decisions and the quantitative analysis of cdna microarray images. Journal of Biomedical Optics, 1997. 3. Mei-Ling Ting Lee, Frank C. Kuo, G.A. Whitmore, and Jeffrey Sklar. Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetetive cdna hybridizations. Proc. Natl Acad. Sci. USA, 2000. 4. Yee Hwa Yang, Michael J. Buckley, Sandrine Dudoit, and Terence P. Speed. Comparison of methods for image analysis on cdna microarray data. Technical Report 584, Department of Statistics, University of California, Berkeley, December 2000. 5. Xujing Wang, Soumitra Ghosh, and Sun-Wei Guo. Quantitative quality control in microarray image processing and data acquisition. Nucleic Acids Research, 29(15), 2001. 6. Ron Dror. Noise models in gene array analysis. Report in fulfillment of the area exam requirement in the MIT Department of Electrical Engineering and Computer Science, 2001. 7. Mukund Thattai and Alexander van Oudenaarden. Intrinsic noise in gene regulatory networks. Proc. Natl Acad. Sci. USA, 2001. 8. Ertugrul M. Ozbudak, Iren Kurtser Mukund Thattai, Alan D. Grossman, and Alexander van Ouderaarden. Regulation of noise in the expression of a single gene. Nature Genetics, 2002. 9. Milan Sonka, Vaclav Hlavac, and Roger Boyle. Image Processing, Analysis and Machine Vision. Chapman & Hall Computing, 1993. 10. David Hand, Heikki Mannila, and Padhraic Smyth. Principles of Data Mining. Adaptive Computation and Machine Learning Series. MIT Press, 2001. 11. Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. John Wiley & Sons, second edition, 2001. 12. J.P. Egan. Signal Detection Theory and ROC Analysis. New York: Academic Press, 1975. 13. John A. Swets. Measuring the accuracy of diagnostic systems. Science, 240:1285– 1293, 1988.
Inferring Gene Regulatory Networks from Time-Ordered Gene Expression Data Using Differential Equations Michiel de Hoon, Seiya Imoto, and Satoru Miyano Human Genome Center, Institute of Medical Science, University of Tokyo 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan {mdehoon,imoto,miyano}@ims.u-tokyo.ac.jp
Abstract. Spurred by advances in cDNA microarray technology, gene expression data are increasingly becoming available. In time-ordered data, the expression levels are measured at several points in time following some experimental manipulation. A gene regulatory network can be inferred by fitting a linear system of differential equations to the gene expression data. As biologically the gene regulatory network is known to be sparse, we expect most coefficients in such a linear system of differential equations to be zero. In previously proposed methods to infer such a linear system, ad hoc assumptions were made to limit the number of nonzero coefficients in the system. Instead, we propose to infer the degree of sparseness of the gene regulatory network from the data, where we determine which coefficients are nonzero by using Akaike’s Information Criterion.
1
Introduction
The recently developed cDNA microarray technology allows gene expression levels to be measured for the whole genome at the same time. While the amount of available gene expression data has been increasing rapidly, the required mathematical techniques to analyze such data is still in development. Particularly, deriving a gene regulatory network from gene expression data has proven to be a difficult task. In time-ordered gene expression measurements, the temporal pattern of gene expression is investigated by measuring the gene expression levels at a small number of points in time. Periodically varying gene expression levels have for instance been measured during the cell cycle of the yeast Saccharomyces cerevisiae [1]. The gene response to a slowly changing environment has been measured during the diauxic shift in the yeast metabolism from anaerobic fermentation to aerobic respiration due to glucose depletion [2]. In other experiments, the temporal gene expression pattern due to an abrupt change in the environment of the organism is measured. As an example, the gene expression response was measured of the cyanobacterium Synechocystis sp. PCC 6803 after a sudden shift in the intensity of external light [3,4]. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 267–274, 2002. c Springer-Verlag Berlin Heidelberg 2002
268
M. de Hoon, S. Imoto, and S. Miyano
A number of methods have been proposed to infer gene interactions from gene expression data. In cluster analysis [2,5,6], genes are grouped together based on the similarity between their gene expression profiles. Several measures of similarity can be used, such as the Euclidean distance, correlation, or angle between two gene expression data vectors. Inferring Boolean or Bayesian networks from measured gene expression data has been proposed previously [7,8,9,10,11], as well as modeling gene expression data using an arbitrary system of differential equations [12]. However, a long series of time-ordered gene expression data would be needed to reliably infer such an arbitrary system of differential equations. This is currently often not yet available. Instead, we will consider inferring a linear system of differential equations from gene expression data. This approach maintains the advantages of quantitativeness and causality inherent in differential equations, while being simple enough to be computationally tractable. Previously, modeling biological data with linear differential equations was considered theoretically by Chen [13]. In this model, both the mRNA and the protein concentrations were described by a system of linear differential equations. Such a system can be described as d x (t) = M · x (t) , dt
(1) −1
in which M is a constant matrix with units of [second] , and the vector x (t) contains the mRNA and protein concentrations as a function of time. A matrix element Mij represents the effect of the concentration of mRNA or protein j on −1 the concentration of mRNA or protein i, where [Mij ] (with units of [second]) corresponds to the typical time it takes for the concentration of j to significantly respond to changes in the concentration of i. To infer the coefficients in the system of differential equations from measured data, Chen suggested to replace the system of differential equations with a system of difference equations, substitute the measured mRNA and protein concentrations, and solve the resulting linear system of equations in order to find the coefficients Mij in the system of linear differential equations. The system is simplified by making the following assumptions: – mRNA concentrations can only affect the protein concentrations directly; – protein concentrations can only affect the mRNA concentrations directly; – one type of mRNA is involved in the production of one type of protein only. The resulting system of equations is still underdetermined. Using the additional requirement that the gene regulatory should be sparse, it is shown network that the model can be constructed in O mh+1 time, where m is the number of genes and h is the number of non-zero coefficients allowed for each differential equation in the system [13]. The parameter h is chosen ad hoc. Although describing a gene regulatory network with differential equations is appealing, there is one drawback to this method. For a given parameter h, each column in the matrix M will have exactly h nonzero elements. This means that
Inferring Gene Regulatory Networks
269
every gene or protein in the system affects h other genes or proteins. This has two consequences: – no genes or proteins can exist at the bottom of a network, as every gene or protein is the parent of h other genes or proteins in the network; – the inferred network inevitably contains loops. While feedback loops are likely to exist in gene regulatory networks, this method artificially produces loops instead of determining their existence from the data. In Bayesian networks, on the other hand, no loops are allowed. Bayesian networks rely on the joint probability distribution of the estimated network being decomposable in a product of conditional probability distributions. This decomposition is possible only in the absence of loops. In addition, Bayesian networks tend to contain many parameters, and therefore a large amount of data is needed to estimate such a model. We therefore aim to find a method that allows the existence of loops in the network, but does not dictate their presence. Using equation (1), we also construct a sparse matrix by limiting the number of non-zero coefficients that may appear in the system. However, we do not choose this number ad hoc; instead, we estimate the number of nonzero parameters from the data by using Akaike’s Information Criterion (AIC). This enables us to obtain the sparseness of the gene regulatory network from the gene expression data. In contrast to previous methods, the number of gene regulatory pathways is allowed to be different for each gene. Usually, in cDNA microarray experiments only the gene expression levels are found by measuring the corresponding mRNA concentrations, whereas the protein concentrations are unknown. To analyze the results from such experiments, we therefore construct a system of differential equations in which genes are allowed to affect each other directly, since proteins are no longer available in the model to act as an intermediary. The vector x then only contains the mRNA concentrations, and matrix M describes gene-gene interactions.
2
Method
Consider the gene expression ratios of m genes as a function of time. At a given time t, the expression ratios can be written as a vector x (t) with m entries. The interactions between these genes can be described quantitatively in terms of a system of differential equations. Several forms can be chosen for the differential equations. We have chosen a system of linear differential equations (1), which is the simplest possible model. This equation can be solved as x (t) = exp M t · x0 , (2) in which x0 is the gene expression ratio at time zero. In this equation, the matrix exponential is defined by the Taylor expansion of the exponential function [14]: ∞ 1 i exp A ≡ A . i! i=0
(3)
270
M. de Hoon, S. Imoto, and S. Miyano
This definition can be found from the usual Taylor expansion of the exponential of a real number a: ∞ 1 i exp (a) = (4) a , i! i=0 by replacing the multiplication by a matrix dot product. For a 1 × 1 matrix A, equation (3) reduces to equation (4). Notice that in general, exp A is not the element-wise exponential of A. Equation (2) frequently occurs in the natural sciences, in particular to describe radioactive decay. In that context, x contains the activity of the radioactive elements, while the matrix M effectively describes the radioactive half-lives of the elements. Since equation (2) is nonlinear in M , it will still be very difficult to solve for M using experimental data. We therefore approximate the differential equation (1) by a difference equation: ∆x =M ·x , (5) ∆t or x (t + ∆t) − x (t) = ∆t · M · x (t) , (6) similarly to Chen [13]. To this equation, we now add an error ε (t), which will invariably be present in the data: x (t + ∆t) − x (t) = ∆t · M · x (t) + ε (t) .
(7)
By using this equation, we effectively describe a gene expression network in terms of a multidimensional linear Markov model, in which the state of the system at time t + ∆t depends linearly on the state at time t, plus a noise term. We assume that the error has a normal distribution independent of time: m 2 1 εj (t) 2 √ f ε (t) ; σ = exp − 2σ 2 2πσ 2 j=1 m T 1 ε (t) · ε (t) , (8) = √ exp − 2σ 2 2πσ 2 with a standard deviation σ equal for all genes at all times. The log-likelihood function for a series of time-ordered measurements xi at times ti , i ∈ {1, . . . , n} at n time points is then n
nm 1 T L M , σ2 = − εˆ · εˆi , ln 2πσ 2 − 2 2 2σ i=1 i
(9)
in which we use equation (6) to estimate the error at time ti from the measured data: εˆi = xi − xi−1 − (ti − ti−1 ) · M · xi−1 . (10)
Inferring Gene Regulatory Networks
271
The maximum likelihood estimate of the variance σ 2 can be found by maximizing the log-likelihood function with respect to σ 2 . By taking the partial derivative with respect to σ 2 and setting the result equal to zero, we find σ ˆ2 =
n
1 T εˆ · εˆi . nm i=1 i
(11)
Substituting this into the log-likelihood function (9) yields
nm nm ln 2πˆ σ2 − L M , σ2 = σ ˆ2 = − . 2 2
(12)
ˆ of the matrix M can now be found by The maximum likelihood estimate M 2 minimizing σ ˆ . By taking the derivative of equation (11) with respect to M , we find that σ ˆ 2 is minimized for ˆ = B · A−1 , M
(13)
where the matrices A and B are defined as A≡
n
2
(ti − ti−1 ) · xi−1 · xT i−1
(14)
i=1
and B≡
n
(ti − ti−1 ) · xi − xi−1 · xT i−1 .
(15)
i=1
ˆ is equal to the true matrix In the absence of errors, the estimated matrix M M . We know from biology that the gene regulatory network and therefore M is sparse. However, the presence of noise in experiments would cause most or all of ˆ to be nonzero, even if the corresponding the elements in the estimated matrix M element in the true matrix M is zero. We can determine if a matrix element is nonzero due to noise by setting it equal to zero and recalculating the total squared error as given in equation (11). If the increase in the total squared error is small, we conclude that the previously calculated value of the matrix element is due to noise. Formally, we can decide if matrix elements should be set to zero using Akaike’s Information Criterion [15,16]
number of estimated log-likelihood of the AIC = −2 · +2· , (16) parameters estimated model ˆ that in which the estimated parameters are σ ˆ 2 and the elements of the matrix M we allow to be nonzero. The AIC avoids overfitting of a model to data by comparing the total error in the estimated model to the number of parameters that was used in the model. The model which has the lowest AIC is then considered to be optimal. The AIC is based on information theory and is widely used for statistical model identification, especially for time series model fitting [17].
272
M. de Hoon, S. Imoto, and S. Miyano
Substituting the estimated log-likelihood function from equation (12) into equation (16), we find
number of nonzero . (17) AIC = nm ln 2πˆ σ 2 + nm + 2 · ˆ elements in M From this equation, we see that while the squared error decreases, the AIC may increase as the number of nonzero elements increases. A gene regulatory network can now be estimated using the following procedure. Starting from the measured gene expression levels xi at time points ti , we calculate the matrices A and B as defined in equations (14) and (15). We find ˆ of the matrix M from equation (13). The the maximum likelihood estimate M corresponding squared error is found from equations (10) and (11). Equation (17) gives us the AIC for the maximum likelihood estimate of M . We then genˆ equal to zero. ˆ by forcing a set of matrix elements of M erate a new matrix M
ˆ are recalculated by minimizing σ ˆ 2 using The remaining matrix elements of M 2 the Lagrangian multiplier technique. We calculate the squared error σ ˆ and the ˆ ˆ AIC for this modified matrix M . The matrix M , and its corresponding set of zeroed matrix elements, that yields the lowest value for the AIC is then the final estimated gene regulatory network. In typical cDNA microarray experiments, the number of genes is several thousands, of which several tens to hundreds are affected by the experimental manipulation. Due to the size of matrix M , the number of sets of zeroed matrix elements is extremely large and an exhaustive search to find the optimal combination of zeroed matrix elements is not feasible. Instead, we propose a greedy search. First, we randomly choose an initial set of matrix elements that we set equal to zero. For every matrix element, we determine if the AIC is reduced if we change the state of the matrix element between zeroed and not zeroed. If the AIC is reduced, we change the state of the matrix element and continue with the next matrix element. This process is stopped if the AIC can no further be reduced. We repeat then this algorithm many times starting from different initial sets of zeroed matrix elements. If the algorithm described above yields the same set of zeroed elements several times, we can assume that no other sets of zeroed elements with a lower AIC exist.
3
Discussion
We have shown a method to infer a gene regulatory network in the form of a linear system of differential equations from measured gene expression data. Due to the limited number of time points at which measurements are typically made, finding a gene regulatory network is usually an underdetermined problem, as more than one network can be found that is consistent with the measured data. Since in biology the resulting gene regulatory network is expected to be sparse, we set most of the matrix elements equal to zero, and infer a network using only
Inferring Gene Regulatory Networks
273
the nonzero elements. The number of nonzero elements, and thus the sparseness of the network, is inferred from the data using Akaike’s Information Criterion. Describing a gene network in terms of differential equations has three advantages. First, the set of differential equations describes causal relations between genes: a coefficient Mij of the coefficient matrix represents the effect of gene j on gene i. Second, it describes gene interactions in an explicitly numerical form. Third, because of the large amount of information present in a system of differential equations, other network forms can easily be derived from it. We can also link the inferred network to other analysis or visualization tools, for instance Genomic Object Net [18]. While the method proposed here allows loops to be present in the network, it does not dictate their existence. Loops are only found if the measured data warrant them. Previously described methods to infer gene regulatory networks from gene expression data, either artificially generate loops, or, in case of Bayesian network models, do not allow the presence of loops. It should be noted that recently, Dynamic Bayesian Networks have been applied to represent feedback loops [19,20]. In a Dynamic Bayesian Network, nodes in the Bayesian network at time t + ∆t are connected to nodes at the Bayesian network at time t, thereby effectively creating one network for timeindependent behavior and another network for time-dependent behavior. A practical example of our method applied to measured gene expression data will appear in the Proceedings of the Pacific Symposium on Biocomputing (PSB 2003).
References 1. Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., Futcher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9 (1998) 3273–3297. 2. DeRisi, J., Iyer, V., Brown, P.: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278 (1997) 680–686. 3. Hihara, Y., Kamei, A., Kanehisa, M., Kaplan, A., Ikeuchi, M.: DNA microarray analysis of cyanobacterial gene expression during acclimation to high light. The Plant Cell 13 (2001) 793–806. 4. De Hoon, M., Imoto, S., Miyano, S.: Statistical analysis of a small set of timeordered gene expression data using linear splines. Bioinformatics, in press. 5. Eisen, M., Spellman, P., Brown, P., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95 (1998) 14863– 14868. 6. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E., Golub, T.: Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96 (1999) 2907–2912. 7. Liang, S., Fuhrman, S., Somogyi, R.: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. Proc. Pac. Symp. on Biocomputing 3 (1998) 18–29.
274
M. de Hoon, S. Imoto, and S. Miyano
8. Akutsu, T., Miyano, S., Kuhara, S.: Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics 16 (2000) 727–734. 9. Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. J. Comp. Biol. 7 (2000) 601–620. 10. Imoto, S., Goto, T., Miyano, S.: Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Proc. Pac. Symp. on Biocomputing 7 (2002) 175–186. 11. Imoto, S., Sunyong, K., Goto, T., Aburatani, S., Tashiro, K., Kuhara, S., Miyano, S.: Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. Proceedings of the IEEE Computer Society Bioinformatics Conference, Stanford, California (2002) 219–227. 12. Sakamoto, E., Iba, H.: Evolutionary inference of a biological network as differential equations by genetic programming. Genome Informatics 12 (2001) 276–277. 13. Chen, T., He, H., Church, G.: Modeling gene expression with differential equations. Proc. Pac. Symp. on Biocomputing 4 (1999) 29–40. 14. Horn, R., Johnson, C.: Matrix Analysis. Cambridge University Press, Cambridge, UK (1999). 15. Akaike, H.: Information theory and an extension of the maximum likelihood principle. Research Memorandum No. 46, Institute of Statistical Mathematics, Tokyo (1971). In Petrov, B. and Csaki, F. (editors): 2nd Int. Symp. on Inf. Theory. Akad´emiai Kiad´ o, Budapest (1973) 267–281. 16. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Automat. Contr. AC-19 (1974) 716–723. 17. Priestley, M.: Spectral Analysis and Time Series. Academic Press, London (1994). 18. Matsuno, H., Doi, A., Hirata, Y., Miyano, S.: XML documentation of biopathways and their simulation in Genomic Object Net. Genome Informatics 12 (2001) 54–62. 19. Smith, V., Jarvis, E., Hartemink, A.: Evaluating functional network inference using simulations of complex biological systems. Bioinformatics 18 (2002) S216–S224. 20. Ong, I., Glasner, J., Page, D.: Modelling regulatory pathways in E. coli from time series expression profiles. Bioinformatics 18 (2002) S241–S248.
DNA-Tract Curvature Profile Reconstruction: A Fragment Flipping Algorithm Daniele Masotti Dipartimento di Informatica Elettronica e Sistemistica Facolt`a di Ingegneria Universit`a di Bologna, Viale Risorgimento 2, Bologna, Italy
Abstract. At nanometric level of resolution DNA molecules can be idealized with one dimensional curved line. The curvature value along this line is composed by static and dynamic contributions. The first ones constitute the intrinsic curvature, vectorial function of the sequence of DNA nucletides, while the second ones, caused by thermal energy, constitute the flexibility. The analysis of intrinsic curvature are a central focus in several biochemical DNA researches. Unfortunately observing this sequence-driven chain curvature, is a difficult task, because the shape of the molecule is largely affected by the thermal energy, i.e. the flexibility. A recent approach to this problem shows a possible methodology to map the intrinsic curvature along the DNA chain, observing an Atomic Force Microscopy image of a population of the DNA molecule under study. Reconstructing the intrinsic curvature profile needs a computing method to exclude the entropic contributions from the imaged profiles of molecules and to detect fragment orientation on image. The heuristic-search algorithm we propose can be a solution for these two tasks.
1
Introduction
The most stable conformation of a DNA molecule in solution is a dimer, formed by the association of two single DNA-strands. At nanometric level of resolution this molecule can be idealized with one dimensional curved line, in which the curvature values are affected by dynamic contributions, i.e. flexibility, and by structural inhomogeneity of the nucletidic bases along the chain, i.e. intrinsic curvature. Attemps to separate the intinsic contributions from dynamic ones were made only on particular molecular structures while the problem is still open for natural molecules. Atomic Force Microscopy can visualize a population of DNA molecules adsorbed on a substratum. Using DNA fragments that share the same nucleotidic sequence it is possible to image a collection of molecular profiles and to get intinsic values averaging the resulting profile population, in order to exclude the dynamic contributions. This task need to recognize the molecules orientations on the image and to match the bases sequence with the measured profile for each molecule. The correct profiles orientation is hardly recognizable due to the strong noise introduced by the flexibility and the exhaustive search in the configurations space is too wise, so we have defined a new heuristic to limit the search space and to exclude the noise in the evaluation of the configurations. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 275–282, 2002. c Springer-Verlag Berlin Heidelberg 2002
276
D. Masotti
The experimental results show that the proposed method, with a low computational complexity, can get correct spatial orientations of the molecules in the image to allow the mapping of the intrinsic curvature profile along the chain.
2 2.1
Background Generality on DNA
DNA is a polymer, constituted by an orderly sequence of nucleotides (or nucleutidic bases). The classical double helix structure (dsDNA), in which two strands are wound around each other, is energetically the more stable conformation in solution, and therefore the favorite one. At microscopial level of resolution, the dsDNA molecule can be idealized as a long filament, whose spatial conformation is describable through the global helical axis. The dsDNA 3D-structure depends on many factors like base composition and environmental conditions. The DNA sequence contains subtle information on local variations that can become collectively pronounced over large spatial scales. Sequence-dependent variations are the result of the chemical and stereochemical inhomogeneity of the sequence. This structural deviations, lead to static and dynamic contributions. The former ones are mapped in the central axis static curvature, i.e. the intinsic curvature, while the latter, in the deformability around those stucture, i.e. on flexibility. The classical models often used to describe the entropic elasticity of long polymer molecules are Freely Jointed Chain (FJC) [14] and Worm-Like Chain (WLC) [15] . These models consider the DNA strand as homogeneous along the chain, neglecting structural peculiarity caused by the particular nucleotidic sequence, without providing the possibility to study the intrinsic curvature. Curvature in dsDNA regions was originally believed to be an intrinsic attribute of only certain short DNA sequences (named A-tracts) [4] [5]. More recently sophisticated models [6] [7] can successfully predict DNA curvature with appropriate structural assumptions [8] [9], which means that for practical purposes it is not crucial which one is correct, elucidating the real origin of DNA bending is a fundamental issue and remains one of the most important tasks of structural biology. Attempts to characterize and separate the effects of static curvatures from those of the flexibility thus far were made only on peculiar DNA constructs with anomalous flexibility [10] [11] [12]. The problem is open still for a ”natural” dsDNA of any arbitrary sequence, as very recently pointed out also by Crothers and coworkers [13]. 2.2
DNA Curvature Models
Curvature along a curve line is the first derivative C = dt dl of the unit tangent (t) with respect to distance (l) along the line (or w.r.t. base number n of the nucleotides sequence). In our case, it is a vectorial function of nucleotides sequence and represent the angular deviation of central backbone (helical axes) between two consecutive base pair. Without consider environment perturbation, this term is function of sequence only, and it is called intrinsic curvature C0 .
DNA-Tract Curvature Profile Reconstruction: A Fragment Flipping Algorithm
277
Under thermal perturbations, we have to consider the contribute of fluctuation also. So the observed curvature can be described as C(n) = C0 (n) + f (n) where f (n) is the fluctuation (caused by thermal energy). Due the relative DNA high rigidity, the fluctuations are considered to follow the first-order elasticity [2], thus their average value are annulled for all sequence positions. So, having a statistically significant population of molecules and averanging along the chain < C(n) >=< C0 (n) > + < f (n) >= C0 (n) (bracket means averaging on molecules population) we obtain the intrinsic curvature value in position n. 2.3 Atomic Force Microscopy and Imaging Atomic force microscopy (AFM) is a relatively new structural tool of enormous potential importance to structural biology [1]. It works by scan a fine ceramic or semiconductor _ in diameter, over a surface. The tip, a couple of microns long and often less than 100 A tip is located at the free end of a cantilever that is 100 to 200µm long. Forces between the tip and the sample surface cause the cantilever to bend, or deflect. A detector measures the cantilever deflection as the tip is scanned over the sample, or the sample is scanned under the tip. The measured cantilever deflections allow to generate a map of surface topography. In our case DNA molecules have been deposited on a mica substratum, a well ordered crystalline structure material from whitch flat and clean surfaces can be obtained easily, and chemically treated to promote DNA adsorption. Imaging was performed and processed with an image-processing software for background noise removal.
3
Image Processing
Using ALEX [16], a processing software for AFM DNA images, molecular profiles have been measured, by a semiautomatic method for tracking the molecule contours. From the resulting data set, molecules with contours length different more then 6% respect the aspected length have been left out, in order to delete non interesting fragments and other molecules. To obtain curvature samples, the molocule profiles has been smoothed and fitted to a variable degree polinomial curve that ensures square error smaller of a chosen threshold. The segmental chain lenght were standardized for their length, obtaining v equivalent segments per chain and the curvature samples were obtained using vector product of neareast-neighbor chain-oriented segments. The proposed algorithm, starts from a recent publication [3] which has used averaging on a molecules population for mapping intrinsic curvature values along a known palindromic DNA fragment. In palindromic molecules, the sequence is the same reading from either end to the other, and no uncertainty on the sequence orientation can exist. The resultant curvature profile, considered in its module values, don’t have the necessity to discriminate the molecular orientation on image. The obtained results (compared with theoretical results) have proven the validity of this method, allowing a mapping of both intinsic curvature modulus and flexibility along the considered DNA-fragment.
278
4 4.1
D. Masotti
Fragment Flipping Algorithm Shapes Considerations
The observation that the DNA molecule profile can be viewed in four different fashions respect how DNA has been adsorbed on the surface and respect the direction of sampling, leads to identify four different configurations and curvature profiles that can be measured by AFM microscopy. It is possible to distinguish between the above fourth different molecule configurations observing the order (L-R) or the sign (U-D) of their respective curvature values. (see fig. 1A, the arrows show the direction of sampling) When we consider real data, the curvature profile is not so clear, and from single-pair series comparison does not derive meaningful results. The configurations on opposite diagonals in figure 1A share the same face of molecular plane, i.e. are adsorbed on same face and thus they should have exactly identical curvature values. Instead molecules adsorbed on opposite face could have different features, due to chemical inhomogeneity of the two faces. The thermal perturbation can be seen as a strong noise source, that can deforms completely the original signal profile and can prevents from recognizing particular patterns or defining effective similarity measures among two series of values. In order to avoid the comparisons between the single sequences, it has been defined a measure of the total state that can avoids the similarity between two molecule at a time, but indicates global optimality. 4.2
Curvature-Matrix
Suppose to detect n usable molecular profiles in the AFM image, and to impose a number v of equal sampling intervals along every fragment. So, we can define the curvature-matrix C(n × v) in witch every row contains the v values of curvature samples, for all the n molecules. The observations of the previous section, lead to the definition of two elementary operators on row r of matrix C, named OPLR (r, C) and OPU D (r, C). For the first operator we define (t+1)
C (t+1) = OPLR (r, C (t) ) = [cij
(t+1)
where cij
=
]
(1)
(t) cij , i = 0..(n − 1) i = r j = 0..(v − 1) c(t)
r(v−j)
, i = r j = 0..(v − 1)
while for the second one, (t+1)
C (t+1) = OPU D (r, C (t) ) = [cij
(t+1)
where cij
=
] (t) cij , i = 0..(n − 1) i = r j = 0..(v − 1) −c(t) , i = r j = 0..(v − 1) rj)
(2)
DNA-Tract Curvature Profile Reconstruction: A Fragment Flipping Algorithm
279
Fig. 1. A) The four shapes (Left-Upper Right-Upper Left-Down Right-Down) and curvature profiles of an idealized molecule. B) Transition to optimal configuration (with columns variance values and orientation vectors)
in which (t) and (t + 1) mean the transition from one configuration to the successive one of matrix C . In particular the former one corresponds to reversing elements sequence (transition Left ←→ Right Shapes) and the latter one to inverting elements sign (transition Up ←→ Down Shapes) for row r. Tracing the transformations applied to every row, using two n-sized bit vectors (named LR and U D in fig. 1B), it is possible to get a classification of the relative molecular orientations on the image. When the optimal state has been reached, in fact, the two vectors indicate the molecules disposition respect to the two possible degrees of freedom (U ↔ D and L ↔ R) for the whole data-set. But how can we estimate the optimal C state? Without considering thermal fluctuation, the optimal configurations are that ones in which we can observe the mimimal value of curvature variance for each point, i.e. the minimal column variance in matrix C ( see fig. 1B). Also considering thermal noise effects, the optimal state column variance won’t be zero, but equal to the square of the flexibility f in that point, however least respect the other possible states. Following this consideration, the metric chosed in order to define the state optimalv−1 ity is mean value of column variances, M (t) = v1 j=0 σj2 where σj2 is the j column variance. Thus, let {C (y) } be the space of possible configurations, the optimum can be defined by M (0) = min{C (y) } M (y) . From state M (0) we can easily calculate intrinsic curvature and flexibility, respectively averaging column values and computing column standard deviations. Extensive search, in real data problem, has an excessive computational cost. Due the four possible forms, the search space has dimension equal to 4n , where n is rows number, hence we need a heuristic search approach.
280
4.3
D. Masotti
Heuristics
In order to reach local optimum, we can use a simply optimization approach. At every algorithm step, the objective function associate to the actual state, can be calculated in linear time respect to the columns number v. Applying the OPLR operator to row r ∆M = M (t+1) − M (t) , the objective function variation caused by transition C (t) → C (t+1) , can be expressed by v−1
(t) −2 (t) (t) (t) [(c − crj )(ncj − crj )] vn(n − 1) j=0 r(v−j)
∆MLR =
(3)
(t)
where cj is the mean on the jth column of the element of C (t) that can be updated (t+1)
from C (t) to C (t+1) with cj For the OPU D operator
(t)
= cj +
(t)
(t)
cr(v−j) −crj n
.
v−1
∆MU D = (t+1)
(t)
−2 (t) (t) (t) [−2crj (ncj − crj )] vn(n − 1) j=0
(4)
(t)
−2c
= cj + nrj . with cj To reach optimal state we can allow transition that lead to ∆M < 0, without computing M value for each considered state. The search-tree algorithm implementation can be put into practice in different ways, that influence the execution-time and the computational efficacy. Maximum Decrement Transition. Starting from C (t) and appraising all the possible successive states, we choose the transition that leads to the minimum M (t+1) value, i.e. the minimum ∆M . This allows a faster convergence rate toward the optimal state, but imposes a greater computational complexity every step. Greedy Implementation. Starting from C(t) we choose the first examined transition that leads to ∆M < 0. Using this approach each step is quickly performed, while the steps number increases respect previous implementation. Simulated Annealing. With high thermal enery contributions local optimisation heuristics may lead to poor local solution, stopping on an undesired local optimum. To avoid this, we implemented a simulated annealing heuristic search-tree. According to the simulated annealing theory, the moves that lead to a ∆M > 0 are also allowed, but with a probability decreasing with the amount of ∆M and with time ( means algorithm steps S ). The decreasing probability is described by |∆M |S
P (∆M ) = e− T = where T has been fixed to good results after various attempts.
n−1
v−1
i=0
j=0
n2 v
c2ij
value that has given
DNA-Tract Curvature Profile Reconstruction: A Fragment Flipping Algorithm
281
Fig. 2. A) The experimental results at different noise levels. B) The two cluster averaged curvature profiles compared with theoretical profile of curvature module.
5
Experimental Results
The three different algorithm implementations have been tested with randomly generated data for verifying the method correctness and efficacy. First a curvature profile has been randomly choosen, then a large number of simulated molecules have been generated, casually have been flipped along the two dimension (L ↔ R and U ↔ D) and uniform distributed noise has been added. Figure 2A point out the percentage of correct simulated fragment orientations individualized by the three algorithm versions, medium number of effected transactions (T ) and medium number of examined states (S) on the various simulations at different noise levels. Noise is indicated as ratio between maximum absolute value of distribution and maximum curvature value of the employed profile. For every simulation a curvature matrix C(100, 100) has been used. The implementation with simulated annealing visibly improves the correct recognition of molecule orientation at high noise level at the price of increased computational complexity. The other two implementations have equivalent efficacy in the recognition but the Greedy implementation results (obviously) the fastest. The method has been tested on a well-know DNA-tract image. The fragment has been select from the DNA of a prokaryotic cell (crithidia) and it is characterized by a high curved region in proximity of the middle of sequence, that favors the molecular planarity. From a population of 271 fragments, with 230 curvature values sampled, two different cluster of molecules have been detected, respect the face exposed to the adsorbing surface. The first cluster ( in our case composed by shapes UL and DR) consists of more frequently adsorbing modality (208 on 271), while the second one (63 on 271), is composed by the less frequent modality (shapes DL and UR).
282
D. Masotti
The plot in figure 2B shows the values of the two cluster reconstructed intrinsic curvature profile compared with the theoretical curvature module profile. Red plot (most larger cluster) is considerably similar to the theory especially in peak region, while green one is mostly dissimilar. This could be caused by not statistically significant number of molecules in the cluster or caused by the not adequate planarity of one face of the molecule. The proposed method seems to be a promising approach for detect spatial orientation of DNA fragments on AFM images. It can be useful to permit the correlation between sequences and imaged molecular shapes, without using intrusive markers that can invalidate the measures, provided that a statistically significant number of molecules is available. The obtained results show that high curved regions can be mapped along the chain and that it can reach a good reliability to individualize the correct spatial orientation, even if the due limitations to the necessary molecular planarity impose restraints to the usable fragments, and further improvements to generalize this work.
References 1. Bustamante, C. , Keller, D.J.: Scanning Force microscopy in biolog Physics Today 48 (1995) 32-38 2. Landau, L.D., Lifshitz, E.M.: Theory of Elasticity Pergamon Press, Oxford, NY (1986) 3. Zuccheri, G. , Scipioni, A. , Cavaliere, V. , Gargiulo, G. , De Santis, P. , Samori, B.: Mapping the intrinsic curvature and flexibility along the DNA chain PNAS 98 (2001) 3074-3079 4. Hagerman, P. J.: Annu. Rev. Biochem. 58 (1990) 755-781 5. Wu, H. M. , Crothers, D.M.: Nature 308 (1984) 509-513 6. BolsHoy, A., McNamara, P.T., Harrington, R. E., Trifonov, E. N.: Proc. Natl. Acad. Sci. 88 (1991) 2312-2316 7. De Santis, P., Palleschi, A., Savino, M., Scipioni, A.: Biochemistry 29 (1990) 9269-9273 8. Dlakic, M., Park, K., Griffith, J. D., Harvey, S. C., Harrington, R. E.: J. Biol. Chem. 271 (1996) 17911-17919 9. Harvey, S.C., Dlakic, M., Griffith, J. D., Harrington, R. E., Park, K., Sprous, D., Zacharias, W.: J. Biomol. Struct. Dynam 13 (1995) 301-307 10. Rivetti, C., Walker, C., Bustamante, C.: J. Mol Biol. 280 (1998) 41-59 11. Grove, A., Galeone, A., Mayol, L., Geiduschek, E. P.: J. Mol Biol. 260 (1996) 120-125 12. Kahn, J. D., Yun, E., Crothers, D. M.: Nature (London) 368 (1994) 163-166 13. Roychoudhury, M., Sitlani, A., Lapham, J., Crothers, D. M.: Proc. Natl. Acad. Sci. USA Vol. 97 (2000) 13608–13613 14. Flory, P.J.: Statistical Machanics of Chain Molecules Interscience Publishers, New York (1969) 15. Schellman, J.A.: Flexibility of DNA Biopolymers 13 (1974) 217–226 16. Young, Mark, Rivetti, Claudio : ALEX Software Processor-tool for AFM image in MATLAB (MathWorks, Natick, MA) (http://www.mathworks.com/)
Evolution Map: Modeling State Transition of Typhoon Image Sequences by Spatio-Temporal Clustering Asanobu Kitamoto National Institute of Informatics, Tokyo 101–8430, JAPAN [email protected] http://research.nii.ac.jp/˜kitamoto/
Abstract. The purpose of this paper is to analyze the evolution of typhoon cloud patterns in the spatio-temporal domain using statistical learning models. The basic approach is clustering procedures for extracting hidden states of the typhoon, and we also analyze the temporal dynamics of the typhoon in terms of transitions between hidden states. The clustering procedures include both spatial and spatio-temporal clustering procedures, including K-means clustering, Self-Organizing Maps (SOM), Mixture of Gaussians (MoG) and Generative Topographic Mapping (GTM) combined with Hidden Markov Model (HMM). The result of clustering is visualized on the ”Evolution Map” on which we analyze and visualize the temporal structure of the typhoon cloud patterns. The results show that spatio-temporal clustering procedures outperform spatial clustering procedures in capturing the temporal structures of the evolution of the typhoon.
1
Introduction
The evolution of the typhoon is still a mystery even for meteorologists because of the complexity of physical processes involved in the formation, development and weakening of the typhoon. The challenge of our project is to unveil some of the mystery using informatics-based methods based on the huge collection of satellite typhoon images. In this paper, we deal with a particular aspect of this project, namely the analysis of the evolution of typhoon cloud patterns. In this paper, clustering procedures are the main tools for extracting meaningful partition of data from the life cycle of the typhoon. In a spatial domain, we expect clustering procedures to extract prototypical cloud patterns, while in a temporal domain, we expect them to extract characteristic period of time in the life cycle such as developing, mature and weakening stages. The meteorological theory for modeling such stages is still premature, so we try to characterize them from a statistical viewpoint applying statistical learning methods on the large collection of image dataset. Then the result of learning is visualized on the ”Evolution Map,” on which we arrange a wide variety of typhoon cloud patterns and analyze the temporal structure of the evolution. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 283–290, 2002. c Springer-Verlag Berlin Heidelberg 2002
284
A. Kitamoto
This paper is organized as follows. Section 2 introduces the background and motivation of the problems and our typhoon image collection which is the basis of our project. Section 3 then addresses research issues and challenges specific in this paper, and Section 4 briefly reminds statistical learning algorithms, in particular spatio-temporal clustering procedures with their basic results. We then proceed to Section 5 to have discussions on experimental results, and finally Section 6 concludes the paper.
2
Background and Motivation
2.1
Typhoon and Pattern Recognition
Typhoon analysis and prediction has been one of the most important issue in the meteorology community. At this moment, typhoon analysis still relies on the visual inspection of human experts into typhoon cloud patterns on satellite images. This fact suggests that the complex cloud pattern of the typhoon on satellite images carries much information rich enough for making decisions on the intensity of the typhoon. It also indicates that we may be able to formulate this problem as a typical pattern recognition problem with a real world large-scale application that can be solved by informatics-based methods. Toward this goal, in this paper, we are especially interested in the spatio-temporal modeling of the typhoon cloud patterns, which are flexible and change significantly over time. Although the recent development of numerical weather prediction technology has contributed to typhoon prediction, the complexity of the typhoon is still beyond the combination of known mathematical equations, and the realistic simulation of the typhoon is yet to be realized. Hence we concentrate on the pattern recognition of the observed satellite data, from which we try to extract characteristic structures in an inductive way that may lead to the discovery of a hidden properties of typhoon cloud patterns. 2.2
Typhoon Image Collection
Because our approach is a learning-from-past-observations approach, the collection of large number of data is indispensable for improving the performance. For this purpose, we created the collection of about 41,000 well-framed typhoon images for the northern (30,300) and southern (10,700) hemisphere. Here the term well-framed means that all images are centered with an equivalent size. For the detail of the collection and some experimental results, readers are referred to [1]. Here we briefly introduce the image dataset. They are from the northern hemisphere image collection, preprocessed into a form of cloud amount images. We then apply principal component analysis (PCA) for dimensionality reduction and the final product is a 67 dimensional vector compared to the original 1024 dimensional vector, or the original 512 × 512 typhoon image 1 . 1
These materials are also available at our Web site http://www.digital-typhoon.org/.
Evolution Map: Modeling State Transition of Typhoon Image Sequences
3 3.1
285
Issues and Challenges Regime Hypothesis
Meteorological experts have an impression that some atmospheric patterns are more frequently observed than others. Although the rigorous validation of this impression is very hard, this raises a hypothesis that we actually have ”regimes” (e.g. [2]) or attractors of atmospheric patterns. In pattern recognition terminology, such regimes roughly correspond to ”clusters,” and this is the reason we apply clustering procedures to obtain a set of typical typhoon cloud patterns. For clustering, we first represent an instance of a typhoon cloud pattern as a point in a data space (feature space). Then the life cycle of a typhoon sequence is represented as a continuous trajectory in that space. Next we apply clustering procedures to those trajectories to obtain the prototypical patterns and sequences of the typhoon. Once regarding those clusters as hidden states, temporal dynamics of typhoon cloud patterns can be studied as state transitions between hidden states. This may naturally lead to the characterization of typical state transitions, but the more interesting discovery is that of anomalous state transitions, because they often indicate unusual changes (e.g. rapid development) which may be related to severe natural disasters. 3.2
Learning of the Manifold
We obtain a set of information on the center of clusters as a result of clustering, but this does not give us any spatial relationship between clusters, which is hard to imagine in a high dimensional feature space. In that case, a useful tool is a (non-)linear projection method that maps clusters in a high dimensional space onto a point in a lower dimensional space (e.g. 2 dimensions) with maximally preserving the spatial relationship in the original feature space. Another useful approach is to learn the manifold in the feature space so that it fits well to the data distribution in that space. Such a technique can be combined with clustering procedures to learn clusters and manifold simultaneously. If the manifold is chosen appropriately, it may correspond to typical paths of the trajectory of the typhoon, or the preferred course of change in terms of hidden states of the typhoon. Hence, for the modeling of the life cycle of the typhoon, the learning of the manifold is an interesting research issue.
4
Methods and Results
Based on the above two viewpoints, we categorize various clustering procedures, and apply them to the collection of typhoon images. 4.1
K-Means Clustering
K-means clustering is one of the most popular iterative descent clustering methods [3]. The dissimilarity measure is usually the squared Euclidean distance, and
286
A. Kitamoto
Fig. 1. Clustering of typhoon images by K-means clustering and SOM. Note the strong effect of topological ordering in the SOM.
a two-step procedure, 1) the relocation of data vectors to the closest cluster, 2) the computation of cluster mean among data vectors that belong to the cluster, leads to the suboptimal partition of data vectors. As a basic method, it simply performs clustering without any temporal models or manifold learning. Fig. 1 shows the result of K-means clustering, where each image is the one closest to the center of the respective cluster. Although K-means clustering does not have a built-in mechanism for ordering the clusters, we can apply Sammon’s mapping [4] (or multi-dimensional scaling) to obtain a roughly ordered visualization of clusters on a two-dimensional space. 4.2
Self-Organizing Maps (SOM)
Self-Organizing Maps (SOM) [5] can be viewed as the constrained version of K-means clustering, in which the prototypes are encouraged to lie in a twodimensional manifold in the feature space [3]. Hence it has a mechanism for learning the manifold. However, because of the lack of probabilistic framework, the integration of temporal models into SOM is not straightforward, in spite of some effort for the temporal version of SOM [5]. As Fig. 1 illustrates, we exploit two types of manifolds – a hyperplane and a toroid. The standard manifold is the former, and usually a good choice if data vectors are distributed on a twodimensional manifold. On the other hand, with the latter manifold, we can get rid of an ”edge effect” – edges of the manifold may become artifacts that do not exist in the feature space. 4.3
Mixture of Gaussians (MoG)
To achieve greater extensibility of the model, we introduce probability models into a clustering procedure. We begin with a (finite) mixture density model, or in particular Mixture of Gaussians (MoG) model, where the PDF of each cluster is a multivariate Gaussian distribution. In this paper, we pursue a particular form of the MoG model, namely the PDF of cluster i is represented by pi (x) = N (µi , Σd )
Evolution Map: Modeling State Transition of Typhoon Image Sequences
287
where Σd is a diagonal covariance matrix common to all clusters2 . The estimation of those parameters is essentially equivalent to mixture density estimation problem, and typical learning algorithm is the EM (expectation-maximization) algorithm. 4.4
MoG-HMM
The MoG model itself does not have a built-in mechanism for temporal models, but we can combine the MoG model with the Hidden Markov Model (HMM) [6] under a probabilistic framework. Because the PDF of the MoG model overlaps each other, the reconstruction of actual state transitions cannot be determined uniquely from the observation sequence. Hence we regard each cluster as a hidden state, and estimate the temporal dynamics of the observation sequence with state transitions in the HMM. In the MoG-HMM model, the states of HMM is the Gaussian component of the MoG model with emission probabilities subject to the Gaussian. The parameters of those Gaussian and HMM are simultaneously optimized by EM algorithm (for HMM, this is also called forward-backward algorithm), and the estimated sequence can then be reconstructed using Viterbi algorithm. All these MoG models do not have a built-in mechanism for ordering the clusters, so we can use Sammon’s mapping or multidimensional scaling to obtain an ordered visualization of the clusters. 4.5
Generative Topographic Mapping (GTM)
Generative Topographic Mapping (GTM) [7,8] is effectively a constrained MoG model in which the centers of Gaussians are related through a function. It has a built-in mechanism for learning the manifold, so this method is a kind of probabilistic formulation of the SOM in a more principled framework. We exploit two types of basis functions – radial basis functions and von Mises distribution[9], φ(x) =
1 exp [b cos(x − µ)] , 2πI0 (b)
0 < x ≤ 2π
(1)
where I0 (b) is the modified Bessel function of the first kind of order 0, and b > 0 is the scale parameter. The latter basis function corresponds to the toroidal manifold of SOM. Fig. 2 illustrates some of the results of GTM. 4.6
GTM-HMM
Because of the probabilistic framework of the GTM, we can also combine GTM with HMM to derive the temporal GTM [7]. In this case, all of the learning process can be done using the forward-backward algorithm or the EM algorithm. This method can be used as a spatio-temporal clustering procedure, and it also has a mechanism for learning the manifold. Nevertheless, GTM-HMM suffers from high computational complexity regarding matrix computation, especially when dealing with a large dataset. 2
We do not use a full covariance matrix due to the lack of observation data compared to the high dimensionality of the feature space.
288
A. Kitamoto
Fig. 2. (a) The clustering of typhoon images by the planar GTM. (b) Magnification factor on the latent space of GTM. (c) Trajectory of Typhoon 199810 visualized on the latent space of GTM.
Fig. 3. The distribution of the typhoon age for each cluster.
5 5.1
Discussion Clustering and Typhoon Life Cycle
In Section 3 we referred to the regime hypothesis, which supports that the clustering of typhoon cloud patterns is actually meaningful. In the previous section, we applied several kinds of clustering procedures, and obtained a two dimensional visualization of the variety of cloud patterns. However, not only visualizing such cloud patterns, we also want to perform some analysis on these clusters. Hence we view those clusters arranged on a two dimensional space as the ”Evolution Map” of the typhoon, and visualize various aspects of the typhoon on the map to get an intuitive understanding on the relationship between data. For example, an interesting question is whether clustering results can extract relevant information on the life cycle of the typhoon. In this case, we illustrate in Fig. 3 the list of representative images and the distribution of age for each
Evolution Map: Modeling State Transition of Typhoon Image Sequences
289
cluster. Here age is computed as a linear normalized age in [0,1] between the start and the end of the life. Some clusters are filled with young images, while members of other clusters are distributed throughout the life cycle. The idea is to have a set of clusters for young typhoons and old typhoons, and characterize each cluster by the type of member typhoons. The difficult part of this clustering is that since the feature space is high dimensional, most of the data has similar distance to the center of multiple clusters, and dissimilar data sometimes contaminates one of those clusters by chance. To solve this problem, we need to incorporate temporal models into clustering procedures, because with additional information on state transitions, the cluster assignment of data would be more reasonable considering the history of the cloud patterns.
Fig. 4. State transitions of typhoon sequences. Clusters are arranged on a twodimensional grid, and each state transition is represented as a line between clusters.
290
5.2
A. Kitamoto
Manifolds and State Transitions
We next study the state transitions on different evolution maps obtained from different clustering procedures. Fig. 4 illustrates the state transitions with black lines, where clusters are arranged on a two-dimensional grid. Here, state transitions on the SOM evolution map shows an ordered pattern – vertical and horizontal transitions are prevalent. On the contrary, MoG evolution map shows a smoothed pattern, which suggests that cluster centers are distributed over the space, and there are no preferred direction for state transitions. Another interesting observation is that on MoG-HMM evolution map, we can observe a pattern of larger scales. This may be the effect of the temporal model (HMM) used in this clustering procedure and indicate some preferred direction of change underlying typhoon cloud patterns. However, meteorological study of this structure is left for future work.
6
Conclusion
This paper introduced some preliminary results on the application of various spatio-temporal clustering procedures to the large collection of typhoon images. The regime hypothesis partly justifies the usage of clustering procedures, and more powerful clustering procedures are required for extracting relevant structures from complex typhoon cloud patterns. Future work includes hierarchical clustering procedures, or the hierarchical combination of non-hierarchical clustering. From the results above, it is clear that the fitting of a single manifold to the global data distribution does not work well, and the combination of local clustering procedures to form a global clustering hierarchy is an interesting direction of research.
References 1. A. Kitamoto. Spatio-temporal data mining for typhoon image collection. Journal of Intelligent Information Systems, 19(1), 2002. 25–41. 2. M. Kimoto and M. Ghil. Multiple flow regimes in the northern hemisphere winter. part I: Methodology and hemispheric regimes. Journal of the Atmospheric Sciences, 50(16):2625–2643, 1993. 3. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2001. 4. Jr. J.W. Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18(5):401–409, 1969. 5. T. Kohonen. Self-Organizing Maps. Springer, second edition, 1997. 6. L.R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–285, 1989. 7. C.M. Bishop, G.E. Hinton, and I.G.D. Strachen. GTM through time. Technical Report NCRG/97/005, Neural Computing Research Group, Aston University, 1997. 8. C.M. Bishop, M. Svens´en, and C.K.I. Williams. GTM: The generative topographic mapping. Neural Computation, 10:215–234, 1998. 9. M. Evans, N. Hastings, and B. Peacock. Statistical Distributions. John Wiley & Sons, Inc., third edition, 2000.
Structure-Sweetness Relationships of Aspartame Derivatives by GUHA 1
2
1
3
3
Jaroslava Halova , Premysl Zak , Pavel Stopka , Tomoaki Yuzuri , Yukino Abe , 3 3 3 Kazuhisa Sakakibara , Hiroko Suezawa , and Minoru Hirota 1,2
Academy of Sciences of The Czech Republic Institute of Inorganic Chemistry, CZ 250 68 Rez near Prague, Czech Republic, [email protected], [email protected] 2 Institute of Computer Science, Pod vodarenskou vezi 2, CZ 182 07, Prague 8, Czech Republic, [email protected] 3 Yokohana National University, Faculty of Engineering, Tokiwadai, Hodogayaku, Yokohama 240, Japan [email protected], [email protected] 1
Abstract. Structure-Sweetness Relationships of Aspartame derivatives have been established using fingerprint descriptors by GUHA method. GUHA is the acronym for General Unary Hypotheses Automaton. The glucophoric hypotheses on the reasons for sweetness of aspartame derivatives were generated. Moreover, new results on sweetness receptor site topology have been found. The results were confirmed both by theoretical studies of other authors and chemical evidence. New knowledge obtained can be used for tailoring new aspartame analogous as artificial sweeteners.
1 Data Set The aspartame derivatives based sweeteners data set has been studied by the regression methods in [1]. The sweetness characteristics, determined by tasting samples of artificial sweeteners, were either cardinal (logarithm of sweetness potency) or nominal variables. Sweetness potency determines the relative sweetness based on sweetness of sucrose. Structure-sweetness relationships as a special case of StructureProperty Relationships (acronym SPR) have been studied using Czech GUHA method. Steric parameters were recalculated by CATALYST RTM software system of Molecular Simulation (2). Fingerprint descriptors encoded structure characteristics. Fingerprint descriptors encode the structures by nominal variables in the same manner as fingerprints are encoded in computational dactyloscopy (3). Moreover, partition octanol-water coefficient (cardinal) and optical activity (nominal: L, DL, D, no) were taken into account.
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 291-296, 2002. © Springer-Verlag Berlin Heidelberg 2002
292
J. Halova et al.
2 Principles of GUHA Method Basic ideas of GUHA (General Unary Hypotheses Automaton) method were presented in [4] already in 1966. Starting notion of the method is an object. Each object has properties expressed by variables ascribed to this object. For example object can be a man with properties given by the variables of sex, age, color of eyes, etc. In order to make reasonable knowledge discovery we need to have a set of objects of the same kind, which differ in values of variables defined on them. The aim of GUHA method is to generate hypotheses on relations among the properties of the objects, which are in some respect interesting. This generation is processed systematically; the machine generates in some sense all the possible hypotheses and collects the interesting ones. The hypothesis is generally composed of two parts: from the antecedent and the succedent. The antecedent and the succedent are tied together by the generalized quantifier, which describes the relation between them. The antecedents and succedents are propositions on the object in the sense of the classical prepositional logic, so they are true or false for particular object. These propositions can be simple or compound similarly to prepositional logic. Compound propositions (literals) are usually composed of conjunction connective. Formulation of these propositions is enabled through original variable categorization. Given an antecedent and succedent, the frequencies of four possible combinations can be computed and expressed in compressed form as the so-called four-fold table (fftable). General ff-table looks like this:
ff-table Antecedent Non (antecedent)
Succedent a c
Non (succedent) b d
Where ”a” is the number of the objects satisfying both the antecedent and succedent, ”b” is the number of the objects satisfying the antecedent but not the succedent, etc. A generalized quantifier is a decision procedure assigning 1 or 0 to each ff-table. If the value is 1, then we accept the hypothesis with this ff-table, if it is 0, then we do not accept it. The basic Fisher generalized quantifier defined and used in GUHA is given by Fisher exact test known from mathematical statistics. For each hypothesis, value of Fisher statistic given by values a, b, c, and d of ff-table is computed. Its value, simply said, describes the measure of association between the antecedent and succedent. The lower the value of Fisher quantifier is, the better association is. In [5] information content of rules obtained by mining procedure is proposed, which suggests a promising improvement of the procedure.
3 Data Preprocessing The sweetness data set is given by Iwamura [1]. Iwamura performed correlation analysis of structure-sweetness relationships. He omitted some compounds, but even
Structure-Sweetness Relationships of Aspartame Derivatives by GUHA
293
then the correlation gave rather poor results. We established the structure-sweetness relationships of 39 aspartame-based sweeteners using Czech GUHA method working with binary data. Sweetness potency is either cardinal as a ratio of sweetness of tested molecule and the sweetness of sucrose. Sweetness of some compounds was characterized by nominal variables only (TL, tasteless, B, bitter, S, sweet, NS, not sweet). Preprocessing of cardinal variables is necessary. We divided each cardinal variable into 2-4 almost equifrequential intervals. Structural data of aspartame derivatives have been transformed into binary strings during GUHA processing. Steric parameters of sweeteners molecules (maximum dimensions of molecule, and mean molecular volume important for sweetness receptor site topology were recalculated by CATALYST RTM software system of Molecular Simulation [2]).
4 The Results of Data Mining GUHA is used for generating hypotheses of the following type: "If the car is black and is cheaper than 50000 crowns, then the owner is a widower older than 50." Most variables were nominal or divisible into natural intervals. Now the task is not only to find hypotheses of the type: "Methyl substituent in position R1 and ethyl substituent in position R2 z causes Sweetness potency from xx to yy." Such results can be substantially dependent on the interval division of variables. Therefore, we should try to find the variable (combination of variables) affecting sweetness. Our efforts were divided into four phases. Fisher quantifier [4] was always used as the lead criterion in the search for hypotheses. The second important criterion was Prob [4] (number of cases fulfilling the hypothesis divided by the number of cases fulfilling the antecedent) that characterized hypotheses in terms of an implication. The best glucophoric hypothesis can be interpreted in the following manner: All the above mentioned hypotheses have relative frequency one (i.e. there is no case in the data when they are not valid. They are sorted according to increasing Fisher parameter from the best hypothesis (the least Fisher parameter). Hypotheses generated by GUHA were divided into five groups: (SP in succedent is sweetness potency defined as logarithm of the relation sweetness of tested substance and that of succrose.
294
J. Halova et al.
E.g. SP ³ (1,2) means that the relevant substances are 10-100 sweeter than succrose SP 2 means that relevant substances are less than 100 times sweeter than succrose.) I. R1 = Me ¾ minM2 6 à SP³ (1,2), (PROB= 1.0 FISHER= 0.257E-5 number of cases = 14 ) This glucophoric hypothesis is in accordance with [6], p.43: When R1 and R2 (hydrophobic groups) are sufficiently dissimilar in size, the sweetness potency is very high. R1 methyl substituent is the smallest possible hydrophobic group. The condition of size limit of sweetener molecules is supported by [6], p.45: The receptor site may exist in the form of deep pocket with critical binding sites deep inside. II. a) MeanVol 320 ¾ NOT R2bra à (PROB= 1.0 FISHER= 0.00002
SP 2 number of cases = 24)
b) MeanVol 320 ¾ minM2 ´ ( 8 ; 9 > (PROB= 1.0 FISHER= 0.00005
à SP 2 number of cases = 23)
à c) MeanVol 320 ¾ wu2 1.9 (PROB= 1.0 FISHER= 0.00011
SP 2 number of cases = 22)
The common features of II. a, b, c hypotheses reflect well known fact that rather big molecules of sweeteners must be accommodated by the receptor. Hypothesis II.a is in concordance with [6], p.45: The activity depends on the size and shape of the amino acid ester carboalkoxy and sidechain substituents. NOT R2bra encodes straight R2 substituents, i.e. their shape. Hypothesis II.b means that there is a deep pocket in the receptor to accommodate the sweetener molecule [6,7]. The chemical interpretation of this hypothesis is the following: The maximum dimensions of molecules must fit the sweetness receptor site geometry (there is a pocket to be fit by medium sweet molecules). Our results are in full accordance with the results of Brickmann based on the calculations of free energy molecular isosurfaces [7]. III. NOT R1nam is COOMe ¾ minM0 12 Ã SP 2 (PROB= 1.0 FISHER= 0.00079 number of cases = 19) This hypothesis is in accordance with [6], p.43 (see II.a): The activity depends on the size and shape of the amino acid ester carboalkoxy and side chain substituents. Minimum dimension of the molecule must not exceed 12 to be accommodated by the receptor cavity. IV. wu1 1.88
¾ MaxM0 15
Ã
SP 2
Structure-Sweetness Relationships of Aspartame Derivatives by GUHA
295
(PROB= 1.0 FISHER= 0.00079 number of cases = 19) This hypothesis is in accordance with hypothesis II.b indicating that there is a pocket in the receptor cavity. The second literal is obvious, because 15 is the upper limit of lowest interval of MaxM0 dimension of molecule. V. a) R1O = No à SP 2 (PROB= 1.0 FISHER= 0.00079 b) R1O = No ¾ R2COS = Via C (PROB= 1.0 FISHER= 0.00139
number of cases = 19)
Ã
SP 2 number of cases = 18)
The hypotheses V. say that R1 substituent cannot contain oxygen. The hypothesis V.b says that substituent R2 is not bound via carbon, hence it is bound via ester oxygen. These results represent new glucophoric knowledge. The best hypothesis I is undoubtedly also the most interesting one. In case of this hypothesis both Fisher and Prob characteristics are excellent. We could say that it is absolutely the best hypothesis, we have generated.
5 Conclusion Chemical interpretation of the most favorite hypothesis I is the following: If molecules of aspartame derivatives fit the receptor site and the R1 substituent is a methyl group, then they are 10-100 times sweeter than sucrose. Apart from the best hypothesis, several others on the reasons of the sweetness of aspartame derivatives have been generated using GUHA method. Some of them represent new chemical knowledge-new glucophores (e.g. straight R2 substituent or substituent R2 bound via ester oxygen). The other hypotheses are in accordance with chemical evidence [6]. Other results were confirmed by independent studies [7]. GUHA method using fingerprint descriptors is generally applicable beyond the scope of structure-property relationships. The wide applicability of GUHA was proven through the study of aspartame derivatives. Acknowledgement. The authors are highly indebted to Professor Jitka Moravcova, Head of Department of Natural Products, Prague School of Chemical Technology, as a domain expert in sweeteners for her invaluable help and encouragement. The main part of the work has been done at Yokohama National University in Japan. Generous support of Japan Society for Promotion of Science is highly appreciated.
296
J. Halova et al.
References 1. 2. 3. 4. 5. 6. 7.
Iwamura, H.:Structure-Sweetness Relationship of L-Aspartyl Dipeptide Analogues. A Receptor Site Topology, J. Med. Chem.1981, 24, 572-583 CATALYST RTM Manual 2.3 Release, Molecular Simulation Inc, Burlington, MA 1995 Halova, J., Strouf, O., Zak, P., Sochorova, A., Uchida, N., Yuzuri, T., Sakakibara, K., Hirota, M.: QSAR of Catechol Analogs Against Malignant Melanoma Using Fingerprint Descriptors, Quanta. Struct. -Act. Relat. 17,37-39(1998) Chytil, M., Hajek, P., Havel, I.: The GUHA method of automated hypotheses generation, Computing, 293-308, 1966 Smyth, P., Goodman, and R. M.: An Information Theoretic Approach to Rule Induction from Databases. IEEE Transactions on Knowledge and Data Engineering 4(4)(1992) 301. Sweeteners Discovery, Molecular Design, and Chemoreception, Walters D.E, Lorthofer ,F.T, and Dubois, G.E.,Eds, ACS Symposium Series 450, American Chemical Society, Washington DC 1991 Brickmann , J., Schmidt, F., Schilling, B. Jaeger, R.: Localization and Quantification of Hydrophobicity: The Molecular Free Energy Density (MOLFESD) Concept and its Application to the Sweetness Recognition, Invited Lecture I4, Proc. Chemometrics V, Masaryk University Brno 1999, Czech Republic
A Hybrid Approach for Chinese Named Entity Recognition 1
2
Xiaoshan Fang and Huanye Sheng 1
Computer Science & Engineering Department, Shanghai Jiao Tong University, Shanghai 200030, China [email protected] 2 Shanghai Jiao Tong University, Shanghai 200030, China [email protected]
Abstract. Handcrafted rule based systems attain a high level of performance but constructing rules is a time consuming work and low frequency patterns are easy to be neglected. This paper presents a hybrid approach, which combines a machine learning method and a rule based method, to improve our Chinese NE system’s efficiency. We describe a bootstrapping algorithm that extracts patterns and generates semantic lexicons simultaneously. After the use of new patterns 14% more person names are extracted by our system.
1
Introduction
Named entity recognition (NE) is a computational linguistics task in which we seek to identify group of words in a document as falling into one of the eight categories: person, location, organization, date, time, percentage, monetary value, and “none of the above”. In the taxonomy of computer linguistics tasks, it falls under the domain of information extraction. Information extraction is the task of extracting specific kinds of information from documents as opposed to the more general task of “document understanding”. Handcrafted rule based systems for named entity recognition have been developed that attain a very high level of performance. Two such systems are publicly available over the web: SRI’s FATUS and TextPro systems. Whereas constructing rules is a time consuming work. Moreover even a skilled rule writer will neglect some rules which do not appear frequently. We use machine learning approach to automatically extract patterns and terms from a corpus. The combination of machine learning and handcraft approach improves the system’s efficiency. We take Chinese person name as experimental example. Chinese is not a segmented language and has no upper and lower case. In addition, almost every word can be a part of a first name. Therefore Chinese person names are more difficult than European names to be recognized.
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 297-301, 2002. © Springer-Verlag Berlin Heidelberg 2002
298
X. Fang and H. Sheng
Figure 1 is the architure of our Chinese named entity system. The pre-processing component does the segmentation and part of speech tagging. The online named entity recognition component includes a Finite-State Cascades (FSC) for NE recognition. In the offline acquisition part, we use a supervised method to extract patterns and named entities. The new patterns can be added to the rule set of the online recognition Preprocessing Text
Segmentation and POS Tagging
Lexicon
Online Named Entity Recognition FSC
Annotated Text
Date, Number, Percentage… Extraction Person Name Extraction Location Extraction Organization Name Extraction …
Patterns
Text annotated with NE tags
Patterns
Offline pattern acquisition Initial Patterns New Patterns Supervised learning Chinese Treebank New Named Entities
Fig. 1. The architecture of Chinese named entity recognition system
part to extract more named entities .The new named entities can be added to the lexicon of the POS component. Section two deals with the algorithm of our Chinese named entity recognition system. In section three the experiment result is presented. Finally section tfour draws conclusion of this approach.
A Hybrid Approach for Chinese Named Entity Recognition
2
299
Name Entity Recognition Algorithm
2.1 Preprocessing We use the Modern Chinese Automatic Word Segmentation and POS Tagging System [7] as the preprocessing component in our model. 2.2 Finite State Cascades We utilize Finite-State Cascades (FSC) as analysis mechanism for named entity extraction, because it is fast and reliable. The basic extraction algorithm is described as follows. Each transduction is defined by a set of patterns. A pattern consists of a category and a regular expression. The regular expression is translated into a finite-state automaton. The union of the pattern automata produces a single, deterministic, finite-state level in which each final state is related with a unique pattern. There are several levels in our FSC, the extracted information in lower level will provide higher level for supporting its information extraction. 2.3 Pattern Extraction Algorithm Inspired by Hearst (1992, 1998), our procedure of discovering new patterns through corpus exploration, is compsed of the following eight steps: 1 Collect the context relations for person names, for instance person name and verb, title and person name, person name and adjective. 2 For each context relation, we use the high occurrence pattern to collect a list of terms. For instance, for the relation of title and person name, with a pattern NN+NR, we extract the terms of title, for example, (reporter), (team player), etc. Here NN+NR is a lexico – syntactic pattern found by a rule writer. NN and NR are POS tags in the Corpus. NR is proper noun. NN includes all nouns except for proper nouns and temporal nouns. 3 Validate the terms manually. 4 For each term, retrieve sentences contain this term. Transform these sentences to lexico-syntactic expression. 5 Generalize the lexico-syntactic expressions extracted in last step by clustering the similar patterns with a algorithm described in [3]. 6 Validate the candidate lexico-syntactic expressions. 7 Use new patterns to extract more person names. 8 Validate person names and go to step 3.
300
3
X. Fang and H. Sheng
Experiments and Results
We use Chinese Penn Treebank, which published by the Linguistic Data Consortium (LDC), as training corpus. Five relations are considered: Title and Person Name, e.g. (reporter) (Huang Chang-rui) (from Chinese treebank, text 325) Person (Shou Ye) (emphasize) (from Chinese treebank, text Name and Verb, e.g. 318) Adjective and Person Name, e.g. (American) (An Ameican person name) (from Chinese treebank, text 314); Person name and Conjunctions, e.g. (Fu Ming-xia) (and) (Chi Bing) (from Chinese treebank, text 325). Location names and organization name used before PNs, like (Tai Yuan Steel Company) (Li Shuang - liang), are also useful clues for person name recognition. Based on this method discribed in section 2 and a predefind high frequency pattern NN NR,we learned four new patterns for the relation title - person name from twentyfive texts in the Chinese Penn Treebank. They are NN NR NR, NN NR NR, NN NR ‘ NR NN NR NR. Use all the five patterns we extract 120 person names form these texts. 15 of them are new. This new person names can also be used for person name thesaurus construction. Chart 1 compares the number of person names extracted by pattern 1 and the number of person names extracted by all the four patterns.
with new patterns
NNNR-301-325
25
Person Name
20 15 10 5 0
Text
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
Fig. 2. Person names extracted by original observed pattern and by with new patterns
From the text 301 to text 305 in Chinese treebank we totally have 105 sentences that contain these patterns. Totally there are 120 person names. We use the pattern NN NR occurs four times. Pattern NN NR NR 7 NR 105 times. The pattern NN NR
A Hybrid Approach for Chinese Named Entity Recognition
301
times, pattern NNNR‘NR 4 times. The frequencies of each pattern are described in above chart. By using the new patterns the number of person names extracted from the Chinese Treebank increased about 14.3%.
4
Conclusion
Chinese named entity recognition is more difficult than European languages. Machine learning approach can be used to improve rule-based system’s efficiency. Since Chinese penn treebank is not large enough and Chinese annotated corpus is very rare we will try co-training method in our future work.
Acknowledgements. Our work is supported by project COLLATE in DFKI (German Artificial Intelligent Center) and Computational Linguistic Department and project “Research on Information Extraction and Template Generation based Multilingual Information Retrieval”, funded by the National Natural Science Foundation of China. We would like to give our thanks to Prof. Uszkoreit, Ms. Fiyu Xu, and Mr. Tianfang Yao for their comments on our work.
References 1. 2. 3.
4. 5. 6. 7. 8. 9.
Fei Xia: The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0). October 17, 2000. Andrew Borthwick: A Maximum Entropy Approach to Named Entity Recognition, Ph.D. (1999). New York University. Department of Computer Science, Courant Institute. Finkelstein-Landau, Michal and Morin, Emmanuel (1999): Extracting Semantic Relationships between Terms: Supervised vs. Unsupervised Methods, In proceedings of International Workshop on Ontological Engineering on the Global Information Infrastructure, Dagstuhl Castle, Germany, May 99, pp. 71-80. Emmanual Morin, Christian Jacquemin: Project Corpus-Based Semantic Links on a Thesaurus, (ACL99), Pages 389-390, University of Maryland. June 20-26, 1999 Marti Hearst: Automated Discovery of WordNet Relations, in WordNet: An Electronic Lexical Database, Christiane Fellbaum (ed.), and MIT Press, 1998. Marti Hearst, 1992: Automatic acquisition of hyponyms from large text corpora. In COLING’92, pages 539-545, Nantes. Kaiyin Liu: Chinese Text Segmentation and Part of Speech Tagging, Chinese Business Publishing company, 2000 Douglas Appelt: Introduction to Information Extraction Technology , http://www.ai.sri.com/~appelt/ie-tutorial/IJCAI99.pdf http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html
Extraction of Word Senses from Human Factors in Knowledge Discovery Yoo-Jin Moon 1, Minkoo Kim 2, Youngho Hwang 3, Pankoo Kim4, and Kijoon Choi 1 1
Hankuk University of Foreign Studies 270 Imun-dong Tongdaemun-Gu Seoul 130-791, Korea [email protected], [email protected] 2 Ajou University 5 San Wonchun-Dong Paldal-Gu Suwon 442-749, Korea [email protected] 3
Honam University Chosun University Kwangju 506-741, Korea [email protected], [email protected] 4
Abstract. Flood of information sometimes makes it difficult to extract useful knowledge from databases, libraries and WWW. This paper presents an intelligent method for extraction of word senses from human factors in knowledge discovery, which utilizes the integrated Korean noun and verb networks through the selectional restriction relations in sentences. Integration of Korean Noun Networks into the SENKOV(Semantic Networks for Korean Networks) system will play an important role in both computational linguistic applications and psycholinguistic models of language processing.
1 Introduction Flood of information sometimes makes it difficult to extract useful knowledge from databases, libraries and WWW etc. Extraction of useful knowledge will boost cooperative e-commerce, global information communication, knowledge engineering and intelligent information access. Korean has quite a lot of polysemous words compared to the other languages, by reading Chinese characters phonetically. Thus extraction of word senses in Korean has been one of the most popular research themes. In order to solve the problem, semantic networks for verbs and nouns contribute as knowledge bases for simulation of human psycholinguistic models in this paper. Also, they can play an important role in both computational linguistic applications and psycholinguistic models of language processing. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 302-309, 2002. © Springer-Verlag Berlin Heidelberg 2002
Extraction of Word Senses from Human Factors in Knowledge Discovery
303
There are several kinds of semantic networks for verbs------WordNet, Levin Verb Classes and VerbNet in U.S.A., German WordNet in Germany, EuroWordNet in Europe, and Korean Noun Networks and SENKOV(Semantic Networks for Korean Verbs) in Korea etc. It has been a difficult task to prove that semantic networks have been built with valid hierarchical classes and the semantic networks work for the semantic analysis of sentences properly. It is why the networks are based on dictionaries, concepts, recognition and heuristic methods[1], [2], [3].
2 Literature Review Many researchers[4], [5], [6] say that statistics-based methods for WSD(word-sense disambiguation) in NLP are moving toward the integration of more linguistic information into probabilistic models --- as an indication of how much the Penn Treebank is moving in the direction of annotating not only surface linguistic structure but predicate argument structure as well. This makes sense, since the value of a probabilistic model is ultimately constrained by how well its underlying structure matches the underlying structure of the phenomenon it is modeling. [7] suggests the combined method of a collocation-based method and a statisticsbased method for WSD in the machine translation. The combined method calculates the co-occurrence similarity knowledge between words using statistical information from corpus. And ambiguous verbs are disambiguated using the similarity match, when the verb-related nouns do not exactly match the collocations specified in the dictionary. It shows about 88% accuracy for the Korean verb translation. [8] classifies the set of relations G between a noun and a verb into five grammatical relations as follows. G % {sbj, obj, loca, inst, modi}
(1)
And he defines the set of co-occurrence verbs Vg(n) for a noun n as follows. fg(n,v) is the co-occurrence frequency from corpus between a noun n and a verb v in the grammatical relation g. Vg(n) % { v | v is a verb such that fg(n, v) " 1 }, where g & G % { sbj, obj, loca, inst, modi }
(2)
The co-occurrence similarity |Vg(n)| is the sum of the co-occurrence frequencies among a noun n and verbs v in the grammatical relation g. | Vg(n) | %
' f !n, v $
v&Vg ! n $
g
(3)
The set of relations G for the co-occurrence similarity |Vg(n)| may contain any other relations than the above described G. But the paper utilizes “sbj” and “obj” relations among the set of relations G for the co-occurrence similarity |Vg(n)| . [7] suggests a concept-based method for Korean verb WSD in the machine translation, which is the combined method of a collocation-based method and an examplebased method. The transfer phase in the machine translation system refers to the idiom
304
Y.-J. Moon et al.
dictionary to find translated English words for Korean verbs, and if it fails, it refers to the collocation dictionary to find them. If that fails, a concept-based verb translation is performed. The concept-based verb translation refers to the collocation dictionary once more to find the conceptually close sense of the input Korean verb, refers to WordNet to calculate word similarities among the input logical constraints and those in the collocation dictionary, and selects the translated verb sense with the maximum word similarities beyond the specified critical value. It shows about 91% accuracy at the critical value 0.4 when applied to the 5th grade student textbooks. Information content of a class [9] is defined in the standard way as negative the log likelihood, or log 1/p!c$. The simplest way to compute similarity of two classes using this value would be to find the superclass that maximizes information content, that is, to define a similarity measure as follows: WS(c1, c2) = max[log 1/p!ci$], where {ci} is the set of classes dominating both c1 and c2, and the similarity is set to zero if that set is empty.
(4)
[7] says that word similarity from noun A to noun B in WordNet can be calculated by measuring how close common superordinates of the two nouns (A and B) are, which can be calculated by the expression (5) below. WS(A, B) =
(# of common superordinates of A and B) # 2 (# of superordinates of A and B)
(5 )
WordNet[10], semantic networks for English nouns, is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adverbs and adjectives are implemented in terms of synonym sets(synset). Each synset represents one underlying lexical concept. WordNet presently contains about 120,000 word forms. WordNet is viewed as the semantic networks which represent hypernyms of English word senses in the form of ISAhierarchies. WordNet does not systematically classify top nodes of verbs, overclassifies the verbs into similar concepts and does not distinguish the intransitive verb from the transitive verb. Levin Verb Classes[2], [11], semantic networks for English verbs, contain various syntactically relevant and semantically coherent English verb classes. It takes a semantic classification structure and incorporates the syntactic relationship into the semantic relationship for verbs. Levin classifies approximately 3,000 verbs into 49 verb classes and the verb class groups meaningfully related verbs together. However, there is little hierarchical organization compared to the number of classes identified. Semantic Networks for Korean nouns have been built for sets of ISA hierarchies for Korean nouns, which are called Korean Noun Networks(KNN)[1]. The ISA hierarchies consist of nodes and edges. The nodes represent synonym sets of Korean nouns and English WordNet. And the edges represent hypernymous relations among nodes. In this paper, KNN are utilized to automatically extract sets of hyponymous concepts. SENKOV(Semantic Networks for Korean Verbs) system classifies about 700 Korean verbs into 46 verb classes by meaning [12]. It has been implemented on the basis of the definition in a Korean dictionary, with top nodes of Levin verb classes, hierar-
Extraction of Word Senses from Human Factors in Knowledge Discovery
305
chies of WordNet and heuristics. It attempts to incorporate syntactic relation into the semantic relation for Korean verbs, and distinguishes the intransitive verb from the transitive verb.
3 Integration of Semantic Networks for WSD Integration of semantic networks will contribute to the semantic analysis of NLP and speech recognition. In this section we simulate human psychological models for WSD. And it intelligently resolves WSD in machine translation utilizing integration of noun semantic networks into verb semantic networks for Korean. Fig. 1 illustrates a part of Database for Integration of Semantic Networks (DISNet).
9.1 2 6 (hang, stake, run, call) [POS] : [vt] [SYN] : [S+V+O+L] [SUBCAT] : [S - nc 1.2.1 (person, individual, human) nc 1.2.2 (animal, animate being, brute) V - hang, stake, run, call O - nc 1.3 (object, inanimate object, thing) (Eng. hang + nc 1.3 + prep. + L) - $ ( (life) (Eng., run + a risk) - nc 7.5.3 (money and other possessions, medium of exchange) (Eng. stake + nc 7.5.3) - nc 2.3.2.8.11(telephone, telephony) (Eng. call) L - nc 5.6 (location) ] 9.1 : SENKOV verb class 9.1 POS : part of speech SYN : syntactic structure SUBCAT : subcategorization information nc : hierarchical class of KNN S : subject, V : verb, O : object, L : location, vt : a transitive verb Eng. : English *) Values of SYN and SUBCAT are collected from corpus. Fig. 1. A Part of DISNet
Fig. 1 describes a part of DISNet. SENKOV verb class 9.1 contains the verb
“2 6 (hang, stake, run, call).” The verb “2 6 ” has three slots and their corresponding values as follows:
306
Y.-J. Moon et al.
POS(part-of-speech) : vt SYN (synonym set) : S+V+O+L SUBCAT (subcategorization): [S - nc 1.2.1 (person, individual, human) … ], where the values of SUBCAT are integrated with the hierarchical classes of KNN.
That is, the selectional restriction of the subject for the verb “2 6 ” is ‘person, individual, human’ (noun class 1.2.1) or ‘animal, animate being, brute’ (noun class 1.2.2), and that of the object is ‘life’ or ‘object, inanimate object, thing’ (noun class 1.3) etc. Values of SUBCAT are collected from corpus[3], [4], [13] and mapped to KNN. For example, the verb “2 ) 6 ”(the past form of “2 6 ”(hang, stake, run, call)) in the Korean sentence “& / . % * 3 ! - 2 ) 6 .” might be translated into the English word “hung” rather than “staked”, “ran” or “call”. & / . (S: owner) % * (L: on the wall) 3 ! - (O: picture) 2 ) 6 (V: ?).
The predicate of the sentence can be translated into one of four English verbs --hang, stake, run, call. The object of the sentence is “picture” which belongs to the noun class 1.3, which corresponds to human psycholinguistic models. According to DISNet in Fig. 1, the predicate of the sentence may be translated into the English verb “hang.” Thus the above sentence is translated into “The owner hung a picture on the wall.” However, the verb “2 ) 6 ” in the following Korean sentence might be translated into “staked.” 1 # 5 4 (S: grandmother) 0 ' * (on the evens) + " , - (O: 50,000 won) 2 ) 6 (V: ?).
The predicate of the sentence can be translated into one of four English verbs --hang, stake, run, call. The object of the sentence is “50,000 won”4 which belongs to the noun class 7.5.3. According to DISNet in Fig. 1, the predicate of the sentence may be translated into the English verb “stake.” The above sentence is translated into “A grandmother staked 50,000 won on the evens.” In this paper an intelligent method of WSD as described above, utilizing DISNet, is called the Psycholinguistic Method. Because it simulates the way how human beings resolve WSD.
4
Unit of Korean currency.
Extraction of Word Senses from Human Factors in Knowledge Discovery
307
4 Algorithm from Human Factors in Knowledge Discovery The Psycholinguistic Method for WSD suggested in this paper utilizes DISNet, a collocation dictionary for bilingual translation, KNN, word similarities and cooccurrence similarities etc. The algorithm of the Psycholinguistic Method is as follows. 1. There is a parsed input sentence which contains an ambiguous verb(AV). 2. The algorithm refers to DISNet. 3. It tries to match the predicate argument structure of AV in input to that of AV in DISNet. 4. If it succeeds, then return the translated word of AV from DISNet. 5. Otherwise, it tries to match the predicate argument structure of AV in input to the hyponymous predicate-argument structure of AV in DISNet. 6. If it succeeds, then return the translated word of AV from DISNet. 7. Otherwise, it refers to KNN to calculate word similarities in sequence between the logical constraint of AV and that of the collocation list. It selects the translated word of AV with the maximum value of the word similarity beyond the critical value 4.0 [7]. 8. It refers to statistical information to calculate co-occurrence similarities in sequence between the logical constraint of AV and that of the collocation list. It selects the translated word of AV with the maximum value of the co-occurrence similarity beyond the critical value [8]. 9. If the results of stage 7 and stage 8 are the same, return the selected word. 10. If the result of the stage 7 is not null, return the selected word of the stage 7. 11. If the result of the stage 8 is not null, return the selected word of the stage 8. 12. Return the default of the translated word of AV. The logical constraint of the input verb means the object of the Korean input, if the logical constraints in the collocation dictionary belong to an object. Otherwise, it means the subject of the Korean input. As well, the logical constraint of the collocation list means the Korean object or subject of the corresponding one in the collocation list to the input verb. The stages 2 ~ 6 in the Psycholinguistic Method simulate the way how human beings resolve WSD. Humans generally consider the predicate-argument structure of AV to disambiguate AV in a sentence, which is simulated in the stages 3 and 4. If the stages 3 and 4 succeed, the algorithm selects the translated word of AV. If the stages 3 and 4 fail, the stage 5 considers hyponymous values of the predicate-argument structure of AV as human beings do. If the stages 2 ~ 6 don’t disambiguate AV in the input sentence, the stages 7 and 8 perform the Concept-based WSD and the Statistics-based WSD respectively. The stages 9 ~ 11 compare the result of the stage 7 and that of the stage 8, and return the translated word of AV. If the stages 9 ~ 11 do not find the proper result, the stage 12 returns default of the translated word of AV.
308
Y.-J. Moon et al.
5 Experiments The Psycholinguistic Method for WSD has been applied to the KEMT(KoreanEnglish Machine Translation) System for verbs of the middle school textbooks. The experiment was performed under the UNIX operating system at SUN workstation. The size of the middle school textbooks was about 1.97MB, the number of ambiguous verbs in the textbooks was 14,539 and the number of average meanings for the ambiguous verbs was about 2.78. While the statistics-based method performs the verb translation with about 70.8% of accuracy, the Psycholinguistic Method performs the verb translation with about 88.2% of accuracy in the verb translation as illustrated in Table 1. Table 1. Comparison of the Methods for Verb Translation
Size of Texts
Accuracy of Verb Translation
Statistics-based
1.97MB
70.8 %
Concept-based
1.97MB
79.3 %
Psycholinguistic
1.97MB
88.2 %
Methods
The Concept-based Method refers to KNN to calculate word similarities between the logical constraint of the ambiguous verb and that of the collocation list. In this stage it can span to the top nodes of KNN, which does not correspond to human psycholinguistic model. But the Psycholinguistic Method spans calculation of word similarities only to the exact superordinate node of the human psycholinguistic model. Therefore, the Psycholinguistic Method performs more accurate verb translation than the concept-based method and the statistics-based method, from the point of psycholinguistic view. Inaccurate verb translations happen when DISNet and the collocation dictionary don’t contain the verb as one of their entries. As described above, DISNet provides a knowledge base for the relatively accurate and efficient WSD. Also, DISNet can play an important role in both computational linguistic applications and psycholinguistic models of language processing. Applicable areas of DISNet are disambiguation of nouns and verbs for NLP and machine translation, writing aid, speech recognition, conversation understanding, the abridged sentences, human-computer interface, extraction of co-occurrence information and structure information in the information retrieval and summary and so on.
6 Conclusions This paper utilized the integrated Korean noun and verb networks for extraction of word-senses from human factors in the Korean sentences, through the selectional restriction relations in sentences. Limitation of this paper is that DISNet has been built only for nouns and verbs and we dealt with WSD of nouns and verbs.
Extraction of Word Senses from Human Factors in Knowledge Discovery
309
The presented Psycholinguistic Method spans calculation of word similarities only to the exact superordinate node of the human psycholinguistic model. Therefore, the Psycholinguistic Method performs more accurate verb translation than the conceptbased method and the statistics-based method, with addition of the psycholinguistic view. Integration of KNN into the SENKOV system provides a knowledge base for the relatively accurate and efficient WSD. Also, DISNet can play an important role in both computational linguistic applications and psycholinguistic models of language processing. Future works are to update and extend DISNet to all of the Korean verbs and to apply them to NLP.
References 1. Moon, Y.: Design and Implementation of Korean Noun WordNet Based on Semantic Word Concepts. Ph.D. Thesis, Seoul National University, Korea (1996) 2. Levin, B.: English Verb Classes and Alterations : A Preliminary Investigation. The MIT Press (1997) 3. Roland, D.: Verb Subcategorization Frequency Differences between Business-News and Balanced Corpora: The Role of Verb Sense. Proc. of the Workshop and Comparing Corpora in ACL-2000, Hong Kong (2000) 4. Gonzalo, J., Chugur, I., Verdejo, F.: Sense Clusters for Information Retrieval: Evidence from SemCor and the EuroWordNet InterLingual Index. Proc. of SIGLEX Workshop on Word Senses and Multi-linguality in ACL-2000, Hong Kong (2000) 5. Resnik, P.: Selection and Information: A Class-Based Approach to Lexical Relationship. Ph.D. Thesis, Univ. of Pennsylvania (1993) 105-114 6. Yarowsky, D.: Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora. Proc. Of COLING-92 (1992) 454-460 7. Moon Y., Kim, Y.: Concept-Based Verb Translation in the Korean-English Machine Translation System. Journal of Korea Information Science Society, vol. 22, no. 8. Korea (1995) 1166-1173 8. Yang, J.: Co-occurrence Similarity of Nouns for Ambiguity Resolution in Analyzing Korean Language. Ph.D. Thesis, Seoul National University (1995) 9. Pereira, F., Tishby N., Lee, L.: Distributed Clustering of English Words. Proc. of ACL-93 (1993) 10. Miller, G., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet : An On-line Lexical Database. in Five Papers on WordNet, CSL Report. Cognitive Science Laboratory, Princeton University (1993) 11. Levin, B., Hovav, M.: Unaccusativity : At the Syntax-Lexical Semantics Interface. The MIT Press (1996) 12. Moon, Y.: Design and Implementation of SENKOV System and Its Application to the Selectional Restriction. Proc. of the Workshop MAL in NLPRS (1999) 81-84 13. Shin, J., et al.: Verb Classification Utilizing Clustering Techniques. Proc. Of Cognitive Science Society (1999)
Event Pattern Discovery from the Stock Market Bulletin
Fang Li, Huanye Sheng, and Dongmo Zhang Dept. of Computer Science & Engineering, Shanghai Jiao Tong University 200030 Shanghai, China [email protected], [email protected], [email protected]
Abstract. Electronic information grows rapidly as the Internet is widely used in our daily life. In order to identify the exact information for the user query, information extraction is widely researched and investigated. The template, which pertains to events or situations, and contains slots that denote who did what to whom, when, and where, is predefined by a template builder. Therefore, fixed templates are the main obstacles for the information extraction system out of the laboratory. In this paper, a method to automatically discover the event pattern in Chinese from stock market bulletin is introduced. It is based on the tagged corpus and the domain model. The pattern discovery process is independent of the domain model by introducing a link table. The table is the connection between text surface structure and semantic deep structure represented by a domain model. The method can be easily adapted to other domains by changing the link table.
1 Introduction A key component of any IE system is a set of extraction patterns or extraction rules that is used to extract from each document contained information relevant to a particular extraction task. Writing extraction patterns is a difficult, time-consuming task. Many research efforts have focused on this task, such as SRV [1], RAPIER [2], WHISK [3] and so on. They rely on the surface structure of text and extract the single slot item (e.g. RAPIER and SRV). How to extract patterns based on the semantic information and not to be restricted by some fixed domain are our research aim. As network and Internet technologies develop, much information can be obtained from the Internet, needless to say the stock market, all kinds of information about some stocks will be published on the Internet, such as Initial public offering, boarding meetings and so on. Shanghai stock market has been chosen for our research, the web site is http://www.bbs.sh.cn. We extract all kinds of short announcements from the bulletin on this web site as the research corpus and get the event patterns of such announcements as a result. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 310−315, 2002. Springer-Verlag Berlin Heidelberg 2002
Event Pattern Discovery from the Stock Market Bulletin
311
In the following, the architecture of the experimental system is first described, then, the method will be introduced by an example and finally some results and conclusions.
2 The Architecture The architecture of the experimental system is described in the following: Tagged text
WWW
Event keyword input
Event Pattern Discovery System
Patterns
Domain model
Fig. 1. The Architecture of the Whole System
The system consists of three components: •
•
•
A tagged corpus: the text extracted from online bulletin of the Shanghai stock market was first tagged by a tagging software which identifies the name entities such as the stock name, the company name, the amount, date by integrating a Chinese tagger from Shan Xi University. [4] Then, some errors are corrected by hand. Finally, we have the corpus, which was tagged by part-of-speech tag (POS) and different name entity tags. A domain model and a link table: the domain model will provide semantic information during the discovery process. The link table works as a bridge between the domain model and the process. It aims to adapt easily from one domain to another domain. The core of the event pattern discovery: It processes the user keyword, finds the examples related to the event from the tagged corpus and then returns identified patterns related to the event.
First a user inputs the event keyword, the core of discovery system scans the tagged corpus, chooses all the examples which has this keyword in the corpus and finds syntactic patterns and makes some unification if it is possible. Then the patterns
312
F. Li, H. Sheng, and D. Zhang
are adjusted according to the domain model and the link table. Finally, patterns are outputted as a final result.
3 Event Pattern Discovering
3.1 The Domain Model and Link Table A domain model provides knowledge about the domain. It is predefined using EntityRelationship model. It describes the property of stocks, the relationship between stocks and their companies, the relationship between stocks and the stock exchanges, all kinds of events related to the stock market. For the stock, there are stock No., stock name, stock market price, and so on. For the relationship between stock and its company, there are P/E ratio, share capital, return on equity, dividend per share and so on. For the stock exchange, there are many events related to the stock, such as Initial Public Offering (IPO), halting the trade of a stock due to pending news and so on. Domain knowledge depends only on the domain and the application. In order to enforce the portability of the event pattern discovery process, a link table is introduced to establish a linkage between the domain model and the process. If the domain changes, the link table changes also while the whole process remains unchanged. This increases the adaptability of information extraction in some extents, because the patterns can be easily obtained when the domain chan ges. Actually the link table is the static definition between text surface structures and semantic structures. The text was tagged by POS and some name entities recognition. Domain model describes the semantic information in the domain, therefore, adding the semantic information to the syntactic analysis by making a reference to the link table. For example, in the link table, there is some information in the followings: −
Tags Stockid Stockname Date Prep_d + date Conj_e + VP
Concepts Stock _Id Stock_name Time point Time period Reasons for the event
In the domain model, there are events and their attributes, if in the IPO event, the time point means the date into the stock market for a newly initial stock, if in the halting event, the time point means the date to halt trade of the stock. A time point has different meanings in different events. With the domain model and the link table, those POS tags or name entities become meaningful entities or attributes related with the corresponding event.
Event Pattern Discovery from the Stock Market Bulletin
313
3.2 The Process of Event Pattern Discovery The discovery process is described in the following steps: 1. 2. 3. 4. 5.
Get the event keyword from the user input. Look up the domain model and find all the information related with this event. Form the initial event pattern. Search the corpus for examples about the event. Extract those examples with tagged information, form the syntactic patterns and make some unification if possible. 6. Synthesis the patterns from domain model and examples based on the link table. 7. Patterns output An example is used to describe step 4 -7 in the process, for example, there are two pieces of news extracted from the bulletin on the stock market website in the following:
“莱钢股份”、 “鼎天科技”、“ “亚星客车”、 “长春经开”、 “武昌鱼”、 “康赛集团”、 “中山火炬”因未刊登股东大会决议公告,
福日股份”、 “青岛碱业”、 “贵华旅业”、 年 月 日停牌一
1. (600102) (600139) (600203) (600213) (600215) (600229) (600275) (600745) (600791) (600872) 2001 1 2 (Translation: Some stocks named (600102)“LaiGangGuFen”,(600139) “DingTianKeJi” …will be halted on the day of Jan,2 2001 because they have not released the news after the shareholders meeting.)
天。
“浙江医药”、(600695)“大江股份”、(900919)“大江B股”因召 开股东大会,2月13日停牌一天。(Translation: Some stocks named (600216)
2. (600216)
“ZheJiangYiYao”, (600695) “DaJiangGuFen”…will be halted on the day of Feb.13 because of shareholders meeting.)
These two pieces of news are found in the corpus to be related to the event of “Halting”. We extracted the tag patterns as followings:
1. <stockid><stockname><stockid><stockname><stockid><stockname><stockid><s 2.
tockname><stockid><stockname><stockid><stockname><stockid><stockname>< stockid><stockname><stockid><stockname><stockid><stockname> <stockid><stockname><stockid><stockname><stockid><stockname> Then, make the unification on the above two results, and get only one pattern:
{<stockid><stockname>}* According to the domain model, there are four attributes related with the halting event: stock_name, time point (when to halt), time period (how long the stock will be
314
F. Li, H. Sheng, and D. Zhang
halted), and the reason for halting. From the link table, we know that + is the reason for the event. Therefore, the “halting” event and its pattern are: Event: the trading of a stock is halted due to many reasons. Pattern: <Sotck_id><stock_name>< period> That means that the template of “halting” event consists of 5 slots: stock _id, stock_name, reasons, date and period. Each slot has its own type. 3.3 Experimental Result and Evaluation We test the system on three events: IPO, Halting and Press Releases. The corpus consists of 23 short passages of halting trade of some stocks, 20 passages of announcements for IPO, and 93 passages for Press Releases. The system found some patterns and we check whether the patterns are correct for the passages in the corpus. The precision is calculated as the following: Precision = Number of correct patterns / Number of the patterns found The result is in the table 1. We analyze the result, and found the main reason for many errors is the ambiguity of keyword. The precision has greatly related with the event keywords inputted into the system. For the IPO event, the Chinese keyword is “ ”, it can be regarded as an adjective, such as “ (IPO announcement), (IPO stock), (IPO part)” , it can also be an event for IPO. While the event keyword for “halting” is less ambiguity, it had a good resul t.
上市 上市股票
上市公告
上市部分
Table 1. Precision of Patterns Discovery related to three Events Event name
Number of passages
Number of patterns found
Number of correct patterns
Precision
Halting
23
5
5
100 %
IPO
10
7
3
42.85 %
Press Releases
93
4
3
75. %
For the halting event, slots are: stock_id, stock_name, reasons (for halting), date (when to halt), period (how long to halt); for the IPO event, there are company_name, stock_name, amount_of_stock, date (for IPO); for the press release, some slots are identified such as, which listed company, which stock, when and what kind of the news release. For the detail information of press release, it is difficult to extract the
Event Pattern Discovery from the Stock Market Bulletin
315
patterns automatically, because it has too much diversity. It could be a challenge in the future.
4 Conclusion In this paper, event patterns on the bulletin of stock market can be automatically discovered based on the tagged corpus and also the domain model. The method has three features in the following: • Domain model can provide some semantic information for the event pattern discovery, it is separate with the whole process, therefore, the experimental system is easy to adapt to other domains. • Although the domain model is predefined, it is easy to extend its knowledge. • The link table is used to establish a bridge between text surface structure and deep semantic structure represented by the domain model. The table is easy to update when the domain changes. However, the method we use is quite easy and the precision is not always high due to the ambiguity of keywords, comparing with man-made patterns or templates. The method still needs some improvements to analyze complexity sentences and solve ambiguities in natural language processing. Acknowledgements. This work was supported by the grant No. 60083003 from National Natural Science Foundation of China.
References 1. Freitag, D: Information extraction from html: Application of a general learning approach. Proceedings of the 15th Conference on Artificial Intelligence (AAAI-98) (1998) 517-523. 2. Califf, M. and Mooney, R.: Relational learning of pattern-match rules for information extraction. Working Papers of the ACL-97 Workshop in Natural Language Learning (1999) 9-15. 3. Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning. 34 (1999) 233-272 4. Liu, K.Y.: Automatic Segmentation and Tagging for Chinese Text. The commercial Press. BeiJing, China. 2000 (In Chinese)
Email Categorization Using Fast Machine Learning Algorithms Jihoon Yang and Sung-Yong Park Department of Computer Science, Sogang University 1 Shinsoo-Dong, Mapo-Ku, Seoul 121-742, Korea {jhyang, parksy}@ccs.sogang.ac.kr
Abstract. An approach to intelligent email categorization has been proposed using fast machine learning algorithms. The categorization is based on not only the body but also the header of an email message. The metadata (e.g. sender name, organization, etc.) provide additional information that can be exploited and improve the categorization capability. Results of experiments on real email data demonstrate the feasibility of our approach. In particular, it is shown that categorization based only on the header information is comparable or superior to that based on all the information in a message.
1
Introduction
With the proliferation of the Internet and numerous affordable gadgets (e.g. PDAs, cell phones), emails have become an indispensable medium for people to communicate with each other nowadays. People can send emails not only to the desktop PCs or corporate machines but also to the mobile devices, and thus they receive messages regardless of the time and place. This has caused a drastic increase of email correspondence and made people spend significant amount of time in reading their messages. Unfortunately, as email communication becomes prevalent, all kinds of emails are generated. People have got a tendency to make emails as their first choice when they need to talk to someone. A supervisor or leader of a group sends out a message to group members for meeting arrangement. The internal communications department of a company distributes an email message to all employees to remind the deadline of timecard submission. These depict the situations in which email communication is very efficient while traditional methods (e.g. phone calls) are time-consuming and expensive. Though emails brought us enormous convenience and fast delivery of messages, it also caused us trouble of managing the huge influx of data everyday. It has become important to distinguish messages of interest from the huge amount of data we receive. For instance, a message from the boss asking a document might be much more critical than a message from a friend suggesting a lunch. To make matters worse, knowing the efficacy and ease of email communications, there exist a number of wicked people trying
This research was supported by the Sogang University Research Grants in 2002.
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 316–323, 2002. c Springer-Verlag Berlin Heidelberg 2002
Email Categorization Using Fast Machine Learning Algorithms
317
to hoax innocent people with jokes or even viruses, and salespeople trying to advertise their goods with unsolicited messages. Therefore, it is clearly of interest to design a system that automatically classifies emails. Against this background, we propose an approach to automatic text classification (or categorization; both terms will be used interchangeably in the paper) using machine learning algorithms. We are interested in fast learning algorithms to deal with large amounts of data swiftly. Our domain of interest is email messages, however our approach can be applied to other types of text data as well (e.g. patents). An email can be simply categorized into spams and non-spams. Furthermore, it can be assorted into more detailed categories such as meetings, corporate announcements, and so on. As mentioned previously, additional information (e.g. sender) in addition to the text data in an email is considered for more precise classification. The Rainbow text classification system [1] is adopted in our experiments. Among the learning algorithms in Rainbow, two fast algorithms are chosen and modified for our experimental studies.
2
Rainbow and Learning Algorithms
Rainbow is a freely available program that performs statistical text classification written by Andrew McCallum and his group at Carnegie Mellon University [1]. Rainbow operates in two steps: 1) read in documents, compute statistics, and write the statistics,“model”, to the disk; and 2) perform classification using the model. A variety of machine learning algorithms are deployed in Rainbow, among which the following two algorithms have been used in our work: TFIDF [2] and Na¨ıve Bayes [3,4]. These algorithms are chosen considering their fast learning speed. We explain each algorithm briefly. (Detailed descriptions can be found in the references.) 2.1
TFIDF
TFIDF classifier (TFIDF) is similar to the Rocchio relevance feedback algorithm [5] and TFIDF word weights described in section 3.2. First, a prototype vector c is generated for every class c ∈ C where C is the set of all classes, by combining all feature vectors in documents d of the class c= d d∈c
Then, the classification is to find a prototype vector that gives the largest cosine of a document d (which we want to classify) and the prototype vector itself arg max cos(d, c) = arg max c∈C
c∈C
d·c d·c
This is a very simple yet powerful algorithm, and numerous variants have been proposed in the literature. (See [2] for detailed descriptions on TFIDF classifier and similar approaches.)
318
2.2
J. Yang and S.-Y. Park
Na¨ıve Bayes
In Na¨ıve Bayes classifier (NB), it is assumed that a term’s occurrence is independent of the other terms. We want to find a class that gives the highest conditional probability given a document d: arg max P (c|d) c∈C
By Bayes rule [3], P (c|d) =
P (d|c) · P (c) P (d)
It is clear that P (c) =
|c|
c ∈C
|c |
and P (d) can be ignored since it is common to all classes. There are two ways to compute P (d|c) based on the representation: either binary or term frequency-based. We show how to compute P (d|c) for the latter. (See [6] for binary case.) Let Nit be the number of occurrences word wt in document di , and V the vocabulary size. Then P (di |c) is the multinomial distribution: |V | P (wt |c)Nit P (di |c) = P (|di |)|di |! Nit ! t=1 P (|di |)|di |! is also common to all classes and thus can be dropped. Finally, the probability of word wt in class c can be estimated from the training data: |D| 1 + i=1 Nit P (c|di ) P (wt |c) = |V | |D| |V | + s=1 i=1 Nis P (c|di ) where D is the training data set.
3
Experiments
We explain which categories and features have been considered for the emails collected for our experiments, and exhibit the results. 3.1
Email Corpus
Email messages have been collected and manually categorized. Among the various categories we defined, corporate announcement, meeting, and spam categories were considered in our experiments. 189, 725, and 430 email messages were collected for such categories respectively, among which 60% was used for training and the remaining 40% was used for testing. There exist messages that belong to more than one categories. For instance, a message announcing a group meeting can belong to the meeting as well as the corporate announcement categories. For simplicity (though can be unrealistic), we assume that each message belong to only one category. (e.g. We excluded meeting messages from corporate announcement even though they belong to the category as well.)
Email Categorization Using Fast Machine Learning Algorithms
3.2
319
Representation
Classification of documents necessarily has to involve some analysis of the contents of a document. In the absence of a satisfactory solution to the natural language understanding problem, most current approaches to document retrieval (including Rainbow) use a bag of words representation of documents [7]. Thus, A document is represented as a vector of weights for terms (or words) from a vocabulary. There are several possibilities for determining the weights: binary values can be assigned for each term to indicate its presence or absence in a document; or term frequency can be used to indicate the number of times that the term appears in a document; or term frequency – inverse document frequency can be used to measure the term frequency of a word in a document relative to the entire collection of documents [7]. A document can be processed using stopping and stemming procedures [7,8] to obtain the bag of words. The stopping procedure eliminates all commonly used terms (e.g. a, the, this, that) and the stemming procedure [9] produces a list of representative (root) terms (e.g. play for plays, played, playing). Let d be a document. Let wi be the ith word in d. The term frequency of wi , T F (wi , d), is the number of times wi occurs in d. The document frequency of wi , DF (wi ), is the number of documents in which wi occurs at least once. The inverse document frequency of wi , IDF (wi ), is defined as IDF (wi ) = log( DF|D| (wi ) ), where |D| is the total number of documents. Then, the term frequency – inverse document frequency of wi is defined as T F (wi , d)·IDF (wi ) [7]. Either the binary values, term frequency or the term frequency – inverse document frequency is used in the classifiers chosen in this paper. Emails have additional information in the header in addition to the text message (i.e. email body). For instance, an email header includes the sender, receivers, subject, and the like. A set of additional features can be derived from the header and be used with the body. These additional features have a potential for more accurate classification. For example, we can define additional features of sender name, sender ID, sender domain name, and sender domain type. If we had a sender “Jihoon Yang <jihoon [email protected]>”, we can define additional features of “JihoonYang:SenderName”, “jihoon yang:SenderID”, “sra:SenderDomainName”, and “com:SenderDomainType”. (Note that we define header features in the form of “:” in order to construct unique features that do not appear in the body.) Another set of features can be defined for both email body and the header, especially in the subject line. For instance, we can count how many $ characters are included in the message, which might insinuate the message is a spam. We can also count how many special characters appear in the message. We define several such features and represent them as mentioned above. For instance, if we had 10 and 5 $’s appearing in the text and the subject line, we can store those values for two features “:NoDollars” and “:SubjectNoDollar”, respectively. While some of the features can be obtained by simple syntactic analysis, some other features require application of information extraction techniques (e.g. email addresses, phone numbers). Overall, 43 features have been defined and used in our work.
320
J. Yang and S.-Y. Park
3.3
Experimental Results
First, the learning algorithms (TFIDF and NB) are trained in Rainbow. (The model generation part in Rainbow is modified to include all the features in Table 1.) After training is done, the “scores” (i.e. similarity or probability) of the training patterns with respect to classes are computed in Rainbow. Then the thresholds are computed by scanning the sorted list of scores once so that the trained classifier yields the maximum classification accuracy. In other words, a threshold that is less than most of the emails in the class and greater than most of the emails in other classes is determined. These thresholds computed for each class are used in testing. This is an independent training in contrast to the winner-take-all strategy originally included in Rainbow. Independent training is necessary since a message can belong to more than one classes or it may not belong to any. One of our goals was to undertake comparative studies between the two learning algorithms. In addition, we intended to figure out how different parts of email messages make difference in classification. For instance, classification based on subject lines might produce comparable results to classification with all data in the message. For this purpose, we compared the performance of classifiers trained with different parts of the message. We considered the following five combinations: all (A), header (H), subject line (S), body and subject line (BS), and header without subject line (HS). We also considered both cases when the stemming procedure was applied or not. Furthermore, we compared the performance of the algorithms with all the features from different parts of an email (i.e. A, H, S, BS, HS) with only 50 features that had the highest information gain [10]. For each experimental setting, we ran each algorithm ten times with different combinations of training and test messages, but maintained the same sizes as mentioned in section 3.1 (i.e. ten-fold cross-validation). Table 2 exhibits our experimental results. Note that test cases postfixed by M I and T are the ones with mutual information-based feature selection and stemming, respectively. The entries in the table correspond to means and standard deviations and are shown in the form mean± standard deviation. The best accuracy, precision, and recall among two algorithms and five (or four with stemming) feature sets are shown in bold face. Note that there is no row for HS when stemming procedure is applied since the procedure does not make any difference for a header without the subject line. We observed the following from Table 2. 1. Better performance of NB than TFIDF: NB outperformed TFIDF in almost all cases. This is because TFIDF generally yielded very poor accuracy and recall despite comparable precision. 2. Better performance of TFIDF with feature subset selection: TFIDF produced better performance with feature subset selection by mutual information in all cases except S. In case of S, the precision with feature subset selection was higher than that without feature selection, but the accuracy and recall were lower. We surmise that the reduced feature set was good to determine
Email Categorization Using Fast Machine Learning Algorithms
321
Table 1. Performance of learning algorithms in different experiments.
features A BS H HS S A-MI BS-MI H-MI HS-MI S-MI A-T BS-T H-T S-T A-T-MI BS-T-MI H-T-MI S-T-MI
accuracy 77.3±0.3 71.8±0.3 73.3±0.5 76.2±0.4 78.7±0.7 85.7±0.4 79.4±0.4 79.1±0.4 83.4±0.4 71.0±0.5 76.0±0.4 68.5±0.3 73.0±0.7 79.0±0.5 85.3±0.4 78.7±0.5 78.4±0.3 72.0±0.6
TFIDF precision 88.2±0.6 85.8±0.6 84.8±0.7 85.4±0.8 90.0±0.4 95.0±0.9 94.3±0.5 89.4±0.7 91.9±0.4 96.9±0.6 86.0±0.5 83.7±0.6 83.5±1.2 89.0±0.6 94.4±0.8 92.6±1.0 89.6±0.5 96.0±0.5
recall 71.2±0.6 66.1±0.3 69.4±0.9 73.6±0.5 75.0±0.7 81.2±0.7 74.4±0.7 77.6±0.5 83.8±0.5 66.0±0.7 69.7±0.5 63.0±0.4 68.3±0.8 75.7±0.6 81.3±0.6 74.3±0.6 77.0±0.7 67.3±0.8
accuracy 94.0±0.2 94.5±0.2 93.7±0.3 89.5±0.4 86.6±0.4 93.8±0.4 90.4±0.3 84.6±0.9 66.3±4.8 75.0±0.4 94.6±0.3 94.3±0.3 94.0±0.4 87.7±0.2 93.2±0.3 90.7±0.3 83.1±0.5 76.1±0.4
NB precision 95.3±0.3 94.2±0.4 93.1±0.3 88.9±0.6 89.4±0.4 93.8±0.4 90.5±0.3 85.3±0.8 68.5±5.5 82.6±0.3 95.1±0.2 93.2±0.4 93.3±0.5 89.8±0.3 92.7±0.4 90.6±0.3 84.1±0.5 82.8±0.3
recall 92.1±0.3 94.3±0.2 93.3±0.3 90.5±0.3 83.4±0.5 93.4±0.5 89.9±0.4 85.3±0.9 61.3±5.8 70.8±0.5 92.7±0.4 94.5±0.4 93.9±0.4 84.7±0.4 92.7±0.4 90.2±0.4 83.1±0.8 72.0±0.7
the categories precisely for specific emails while it did not include enough terms to cover all the messages in each category. 3. Better performance of NB without feature subset selection: NB performed reasonably consistent and good in different experimental settings. However, for incomplete data (i.e. H, HS, and S), it worked better without feature subset selection by mutual information. 4. No effect of stemming: Stemming did not make a significant difference for all algorithms in performance, though it decreased the size of the feature set. 5. Good performance with headers: In both algorithms, the performance with H was comparable to that with A or BS. In particular, TFIDF produced high precision (but low recall and accuracy) with only the subject line as explained above. This means we can get reasonable performance by considering only the header information (or even only the subject line) instead of the entire email message.
4
Summary and Discussion
An approach to intelligent email categorization has been proposed in this paper in order to cope with the immense influx of information these days. Two machine learning algorithms (TFIDF and NB) were used and their performance was compared. We also studied how different parts of email structure affect the classification capability. Experimental results demonstrate that NB outperforms
322
J. Yang and S.-Y. Park
TFIDF and yields better performance without feature subset selection (especially when small number of parts of an email were used), and TFIDF works well with feature subsets based on mutual information. It was also found, at least with our current corpus, that classification with the header was as accurate as that with entire message, with even less number of features. Some avenues for future work include: – Experiments with additional algorithms: A number of machine learning algorithms have been proposed and compared in the literature [4,11,12]. For instance, the support vector machines [11], though slow, were claimed to be powerful and have been applied to text classification [13,12]. Experimental studies of SVM and its comparison with NB and TFIDF will give us additional knowledge on email classification. – Maintenance of quality data: First, we can collect more data. Our experiments were with less than 1,500 messages in three categories, and more than half of the messages belong to spam class. In order to get more accurate, credible statistics, we need to keep collecting messages. Also, the three categories can be extended to include interesting domains. Current emails belong to only one category, but we can extend this to include messages belonging to multiple categories since there exist many such emails in the real world. Furthermore, categories can be organized into a concept hierarchy, which can be used in classification (e.g. Yahoo). In addition, we can eliminate junks from messages which are usually unnecessary in classification and can possibly cause confusion and misclassification. For instance, we can remove signatures in messages. Of course, this kind of information can be useful in detecting the authors of the messages, but they are usually not useful in categorization. – Extensive experimental study: There can be different parameter settings in each algorithm. Our experimental studies are based on the default parameter settings. We can perform extensive experiments with different settings available in Rainbow. Furthermore, we can try different techniques which are not included in Rainbow (e.g. different methods for feature subset selection). We can also build ensemble of classifiers (e.g. voting, boosting, bagging) and compare their performance. – Combination with NLP-based information extraction: Categorization can be bolstered with extracted information by natural language processing. For example, if we extract date, time, place, and duration of a meeting from a message, we know the message is about a meeting even without going through the classification routine. This additional information can boost the performance of text-based classification. Similar to [14], additional domaindependent phrases can be defined for each category and extracted as well. Rainbow can be mixed with a rule-based system using such additional features. – Consideration of attachments: Attachments, if exist, can be exploited. Attachments can simply be processed with email messages in classification or can be classified separately (with respect to the original categories or
Email Categorization Using Fast Machine Learning Algorithms
323
new categories). Moreover, attachments can be classified with respect to a different set of categories. For instance, we can define categories with the types of attachments (e.g. Word document, presentation slides, spreadsheets, postscripts, etc.). The information about the number of attachments and their types can be used in classification. – Extension to prioritization: A simple approach to prioritization would be doing it based on the categories. This can be extended using the extracted information. That is, we can prioritize a message based on the information we extract. For instance, if we extract an activity or an event with its time and place of occurrence, we can determine the priority of the message. The additional information we defined from the header (e.g. sender) can be also exploited to adjust the priority. Also, machine learning can be applied to this problem. – Action learning: There can be typical actions associated with each category. For instance, a person might forward meeting messages to someone else, in general. This kind of knowledge can be used for learning people’s “actions” (or behavior).
References 1. McCallum, A.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/∼mccallum/bow (1996) 2. Joachims, T.: A probabilistic analysis of the roccihio algorithm with tfidf for text categorization. Technical Report CMU-CS-96-118, Carnegie Mellon University, Pittsburgh, PA (1996) 3. Duda, R., Hart, P.: Pattern Classification and Scene Analysis. Wiley, New York (1973) 4. Mitchell, T.: Machine Learning. McGraw Hill, New York (1997) 5. Rocchio, J.: Relevance feedback in information retrieval. In: The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Inc. (1971) 313–323 6. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classificaiton. In: Learning for Text Categorization Workshop, National Conference on Artificial Intelligence. (1998) 7. Salton, G.: Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, Massachusetts (1989) 8. Korfhage, R.: Information Storage and Retrieval. Wiley, New York (1997) 9. Porter, M.: An algorithm for suffix stripping. Program 14 (1980) 130–137 10. Cover, T., Thomas, J.: Elements of Information Theory. John Wiley (1991) 11. Vapnik, V.: The Nature of Statistical Learning Theory. Springer Verlag (1995) 12. Yang, Y.: A re-examination of text categorization methods. In: Proceedings of the 22nd ACM SIGIR Conference. (1999) 42–49 13. Brutlag, J., Meek, C.: Challenges of the email domain for text classification. In: Proceedings of the Seventeenth International Conference on Machine Learning. (2000) 103–110 14. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk e-mail. In: Learning for Text Categorization Workshop, National Conference on Artificial Intelligence. (1998)
Discovery of Maximal Analogies between Stories Makoto Haraguchi, Shigetora Nakano, and Masaharu Yoshioka Division of Electronics and Information Engineering Hokkaido University N-13 W-8, Sapporo 060-8628, JAPAN { makoto, yoshioka }@db-ei.eng.hokudai.ac.jp
Abstract. Given two documents in the form of texts, we present a notion of maximal analogy representing a generalized event sequence of documents with a maximal set of events. They are intended to be used as extended indices of documents to automatically organize a document database from various viewpoints. The maximal analogy is defined so as to satisfy a certain consistency condition and a cost condition. Under the consistency condition, a term in an event sequence is generalized to more abstract term independently of its occurrence positions . The cost condition is introduced so that meaningless similarities between documents are never concluded. As the cost function is monotone, we can present an optimized bottom-up search procedure to discover a maximal analogy under an upper bound of cost. We also show some experimental results based on which we discuss a future plan.
1
Introduction
For the last decade, various methodologies for retrieving, organizing and accessing documents in document databases or on computer networks are developed. Document classification, text summarization, information retrieval are examples of such techniques. As the amount of documents to be processed is generally very large, indexing systems for documents in terms of keywords are often used because of their efficiency. The family of keywords or index terms are chosen to cover various documents and to distinguish them each other. However, only a set of keywords cannot discriminate two or more documents that should be considered as distinct ones. For instance, any keyword-based indexing system does not distinguish ”a dog bit a man” from ”a man bit a dog”, in spite of the fact that they carry completely different stories. We thus need an extended indexing system that can regard such a difference not expressed by index terms. From this view point, this paper presents a first step towards such an extended indexing system from a viewpoint of discovery of analogy between stories. Such a system should satisfy the followings: (R1) As a document has various aspects each of which is a story, an index is itself a story. (R2) Such indices are automatically discovered from documents. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 324−331, 2002. Springer-Verlag Berlin Heidelberg 2002
Discovery of Maximal Analogies between Stories
325
(R3) The problem of what is a significant story in a document depends on each person. So, the indexing varies according to individuals. (R4) Once such indices are discovered and constructed, documents should be quickly accessed by their indices. By the term ”story” we here mean a plot, an event sequence in a document. As each single sentence in a document roughly corresponds to an event, a given document is itself a story including various sub-stories that are subsequences of whole event sequence. The problem is to determine an important event (sub-)sequence characterizing the document. A possible standard approach is to evaluate the significance of events by frequencies and co-occurrences [3] of words in them. Although such a scheme is quite effective, it may exist words or sentences that we realize their importance in a particular story extracted from the document. The basic standpoint of this paper is stated as follows. The problem of what is important events in a document cannot be determined by examining only one document. If some event sequence is regarded significant from a particular point of view, then we will find another similar document in which a similar event sequence also appears in it. Conversely, when we find a generalized event sequence common to all the documents which some user or a group of users consider similar, it can be a candidate of important sequences, and is therefore a possible index of documents. More precisely, we say that an event sequence is common to a set of documents whenever the sequence is a generalization of some sequence in every document. As the act of generalizing event sequences depends on the subsumption relationships between words, we use EDR electronic dictionary [5]. Furthermore, we consider a concept tree representation of event in Section 2. The notion of concept tree is a special case of one for concept graphs [2], and can represent case structures of sentences in documents. Given two documents which we regard similar (R3), a notion of maximal analogy (MA, for short) between the two is introduced in Section 4 to formalize the common generalization of event sequences with a maximal set of events. As a MA is itself an event sequence, it can be an answer for (R1). Although we have not yet designed a query-answering system for documents indexed by MAs, their subsumption checking never involves any combinatorial computation. So the test for a document to meet a MA is quickly performed (R4). We use a cost condition to exclude too abstract event sequence and to enjoy some pruning technique in a bottom-up construction of MAs presented in Section 4. We first introduce a specific ordering, derived from the structure of event sequences, to control the search. Then, we present an optimized generation of candidates in the sense that the number of candidates generated and tested is minimized. This property will meet (R2) under some improvements. In the present experiment, we suppose short stories of at most only 50 sentences. All the problems concerning the natural language processing are included in the last section in which we discuss our experimental results and talk about our future plans.
326
2
M. Haraguchi, S. Nakano, and M. Yoshioka
Concept Trees and Their MCSs
After a morphological analysis and parsing, each sentence in a document is represented as a rooted tree with words as its nodes and cases (or role symbols) as its edges, where we choose a verb as its root (See Fig. 1 for instance). For the verbs are first-class entities of events, the tree of words will be simply called an event in Definition 1. Although such a tree of words is normally formalized as a semantic network [4], we consider it as a kind of concept graph [2]. This is simply because we can define an ordering for trees by restricting one for graphs. To examine a semantic relationship between concept graphs, we use EDR [5], a machine readable dictionary. As a word may have more than two concepts as its possible meaning, a dictionary we need must answer what concepts are involved in words and what relationships hold among those concepts. The EDR system supports both of the two kinds of information about Japanese words and concepts. Each concept is designated by a unique identifier called a concept ID. Let T erms be the set of all words and concept IDs in EDR. Then a partial ordering ≺ over T erms can be given as t1 ≺ t2 iff (1) t1 and t2 are both concept IDs and t1 is more special than t2 in the concept dictionary, or (2) t1 is a word, t2 is a concept ID more general than some concept ID associated with t1 in the word dictionary. Based on this partial ordering for terms, we have the following definitions of concept trees and their ordering. Definition 1. (Concept trees and their paths) Given a set L of role or case symbols, a path of length n is a sequence of roles p = ("1 , ..., "n ), where "j ∈ L. Empty path, λ = (), of length 0 is always regarded as a path denoting the root of tree. A concept tree is then defined as a pair g = (P ath(g), termg ), and is also called an event, where P ath(g) is a finite and prefix-complete set of paths including the empty path, and termg is a term labelling function termg : P ath(g) → T erms. (Concept Tree Ordering) We say that a concept tree gs subsumes another concept tree gi , if, for every rooted path p ∈ P ath(gs ), both p ∈ P ath(gi ) and termgi (p) ' termgs (p) hold. In this case, we also say that gs is a generalization of gi or that gi is a specialization of gs . Intuitively speaking, a concept tree gs is more general than another tree gi , if every path of gs is preserved in gi and it has more general term in gs than in gi . For instance, both two trees at the bottom in Fig. 1 is subsumed by the top tree. Now, a minimal common generalization, MCS, of two concept trees is similary defined as in the case of least common subsumers of concept graphs [2]. Formally, a MCS of g1 and g2 is defined as a tree consisting of common paths of gj whose labels are minimal upper bounds of corresponding paired terms in gj : M CS(< g1 , g2 >) = ( P ath , λp∈P ath mst({termg1 (p), termg2 (p)}), where P ath = P ath(g1 ) ∩ P ath(g2 ).
Discovery of Maximal Analogies between Stories
kill
MCS and its skelton
he
agt
obj
instrument
327
domestic_animal = { horsek, cow } tool = { butcher_knife, empty_hand } subsumption generalization
he
agt time
kill
obj
instrument
agt cow
kill place
buthche_knife
he obj
instrument
horse empty_hand
Fig. 1. Concept tress, where the top is a MCS of another two at the bottom
mst(A) is a chosen minimal upper bound of a set of terms A. In this sense, mst is called a choice function. We furthermore consider the MCS’s skeleton, SCS(< g1 , g2 >) = (P ath , term pair ), with the same path set as MCS to record what terms are paired in MCS. That is, term pair (p) = {termg1 (p), termg2 (p)}. Fig. 1 illustrates MCS and its skeleton SCS, where an expression of the form t = {t1 , t2 } menas t = mst({t1 , t2 }).
3
Minimal Consistent Common Subsumer
MCS represents a similarity between two events, each from each document. However, the similarity we try to investigate here is one between event sequences. To illustrate the point, let us examine a simple example as follows. Suppose we pick up two events g11=”A cat chases” and g12=”The cat tumbles” in one story, and two event g21=”A dog chases a tortoise” and g22=”The tortoise tumbles” from another one. The event pairs < g11, g21 > and < g12, g22 > show the similarity represented by sg1 = M CS(< g11, g21 >) and sg2 = M CS(< g12, g22 >), respectively. A mammals={cat,dog} chases. .... (sg1) The vertebrate={cat,tortoise} tumbles. .... (sg2) In case of sg1, cat and dog are corresponded and generalized to mammal in the concept dictionary. On the other hand, sg2 extends the term pair {cat, tortoise} to animal (vertebrate). Thus the same concept, cat, is generalized to the different concepts, mammal and animal, depending on the event pairs. This is derived from a fact that MCS as well as SCS are computed independent of another event pair. However, we postulate that each term must be generalized to some unique superior term throughout the generalization process. The reason can be stated as follows: We can interpret each concept in the stories in various ways by taking some superior terms in the dictionary. However, once we fix one viewpoint, the variety of interpretations vanishes, and only just one aspect
328
M. Haraguchi, S. Nakano, and M. Yoshioka
of conceptual term, represented by a superior term, will be realized. The superior term should be therefore unique, for each conceptual term, throughout the document. In the present example, cat, dog and tortoise are required to be generalized to animal simultaneously. We find and compute this requirement from the term pairs in the component SCSs, {cat, dog} and {cat, tortoise}. In fact, we regard these term pairs as kinds of equivalence classes, and put them into a new equivalence class by taking their transitive closure. Then, the term pairs in the original component SCSs are replaced with the extended term groups. The SCS sequence thus defined is called a SCCS. Then a minimal common generalization of paired event sequences (MCCS) is defined as the SCCS, where each word group is simultaneously replaced with its minimal upper bound. The following is the result of simultaneous replacement: (sg1’): A animal={cat,dog, tortoise} chases. (sg2’)=(sg2): The animal={cat,tortoise}= {cat, tortoise, dog} tumbles. In the following definitions, the sequence of paired event < g11, g21 > and < g12, g22 > is represented by an op-selection. Definition 2. (op-selection) Each document is defined as an ordered sequence g1 , ..., gn of events gj under the order as they appear in the story. We denote gi < gj whenever gi preceeds gj . The, given two stories Dj , an op-selection θ of D1 and D2 is an order preserving one-to-one correspondence of events in Dj . That is, θ is (1) (2) (i) (i) a sequence P1 , ..., Pk , where Pj =< gj , gj >∈ D1 × D2 and gj < gj+1 . (MCCS and SCCS) Given such an op-selection θ = P1 , ..., Pk , let T erms(θ) be the set of all terms in SCS(P1 ), ..., SCS(Pk ). Then, the relation ∼ defined by {termg(1) (p) ∼ termg(2) (p) | p ∈ P athPj , j = 1, ..., k} is extended to the j
j
least equivalence relation ∼θ including ∼. An equivalence class with both terms termg(1) (p) and termg(2) (p) is written as ecθ (p). Then, SCCS(θ, j) is defined j
j
as (P athPj , λp∈P athPj ecθ (p)). A skeleton, SCCS(θ), is just their sequence. We furthermore associate each C ∈ {ecθ (p) | p ∈ ∪j P athPj }, with some chosen minimal superior term mst(C). Then, M CCS(θ, mst) is defined as the sequence of M CCS(θ, mst, j) = (P athpj , λp∈P athpj mst(ecθ (p))).
4
Maximal Analogy and Its Bottom-up Construction
In this section, we introduce a notion on MA (Maximal Analogy) and its bottomup construction algorithm. To save the space, we only show the version for two documents. For more than three documents, we iteratively apply the algorithm for two documents. The key property to define MAs is stated as follows: As new event pair ep is added to an op-selection θ, it holds that ecθ (p) ⊆ ecθ∪ep (p) for any path p in M CCS(θ). So, the number of steps for generalizing terms in ecθ∪ep (p) to their minimal superior term mst(ecθ∪ep (p)) becomes larger.
Discovery of Maximal Analogies between Stories
329
More precisely, we define the cost as follows, where we suppose Hasse’s diagram when we consider paths in (T erm, ≺). gcost(t, t" ) = min{length(p) | p is a path connecting t and t" }, gcost({t1 , ..., tn }, t) = maxj gcost(tj , t), where we suppose tj ' t, gcost(θ, mst) = max{gcost([t], mst([t])) | [t] is the equivalence class of ∼θ }, gcost(θ) = minmst gcost(θ, mst). Given an upper bound parameter gl, If gcost(θ, mst) > gl, we have to make more than gl steps of generalization through the concept dictionary to obtain M CCS(θ, mst), and the terms in M CCS(θ) may be too abstract. We consider such a M CCS(θ) is of no use, so regard only op-selections whose generalization cost is at most the upper bound gl. Definition 3. (Maximal Analogy) Given two documents and an upper bound gl of generalization level, an op-selection θ is said gl-appropriate, if it satisfy the cost condition gcost(θ) ≤ gl. Then we say that a gl-appropriate op-selection θ is maximal if there exists no gl-appropriate op-selection θ" such that it properly includes θ. In this case, M CCS(θ) is called a MA with its evidence set θ. The construction of MAs is subject to that of maximal op-selections. In order to find maximal op-selections efficiently, we use the following property just corresponding to the anti-monotonicity of support used in [1] . (Monotonicity of Cost) gcost(θ) ≤ gcost(θ" ) if θ ⊆ θ" . The construction is bottom-up so as to enumerate all the possible op-selections without any duplication. For this porpose, we first introduce a partial ordering on the set of op-selections. (j) (j) Let Dj = g1 , ..., gn1 as the whole sequence of events in this order. Then an op-selection θ is rather expressed as a sequence Pi1 j1 , ..., Pik jk , where Pi! j! is the (1) (2) "-th pair of i! -th event gi! in D1 and j! -th event gj! in D2 . Pi! j! is called a singleton selection. The length k is called the level of θ. Then, the partial ordering ≺ among op-selections is defined by the transitive closure of the following direct successor relation: θ1 ≺ θ2 iff θ1 = θPij and θ2 = θPij Pxy for some op-selcction θ, Pij , x and y such that i < x and j < y. From the definition, it follows that any θ of level k + 1 has just one direct predecessor θ1 of level k, a prefix of θ. So, by induction on the level k, all the op-selections are enumerated without any duplication according to the ordering ≺. Furthermore, we list only op-selctions satisfying the cost condition during the whole enumeration process. Base step for level 1 op-selections: We list only singleton op-selections satisfying the cost condition: OP S(1) = {Pij | gcost(Pij ) ≤ gl}.
330
M. Haraguchi, S. Nakano, and M. Yoshioka
Inductive step for level k + 1 op-selections: Suppose we have OP S(k) of all op-selections satisfying the cost condition. We construct op-selections of next level consistent with the condition as follows. OP S(k + 1) = {θPij Pxy | θPij ∈ OP S(k), Pxy ∈ OP S(1), i < x, j < y, gcost(θPij Pxy ) ≤ gl }. Note that, in case of k = 1, θ is null string. Termination of Construction: The generation of OP S(k) terminates whenever we find a level " such that OP S(") = φ. " is at most min{n1 , n2 }, where nj is the number of events in the story Dj . The number of selections generated and tested is minimized. To verify this, suppose gcost(θ) exceeds the limit gl for an op-selection θ at level k. θ has its unique generation path θ1 ≺ .... ≺ θk = θ of length k − 1. As gcost is monotone, there exists the least j such that gcost(θj ) > gl. θj is generated, tested for and fails in the condition, because θi ∈ OP S(i) for any i < j. However, as the predecessor θj is not listed in OP S(j), none of its successor, including θ, is never generated and tested.
5
Experimental Results and Concluding Remarks
In this section, we present some experiments using Linux PC with 1G byte memory. We first apply a morphological analyzer and a parser to obtain a case structure for each sentence, and covert it to a concept tree representation. The cases are therefore surface ones, not deep ones. Although the quality of cases is reflected in our experimental results, we also have another problem on the need of semantic analysis concerning the processing of anaphora, the meaning of compound noun and so on. We are now researching and examining to what extent such a semantic analysis is necessary to obtain MAs of high quality. The final answer will be found by restricting kinds of documents. However, for the present, we choose a children’s story and a short folktale as input to our algorithm to avoid the serious problems on semantic analysis. Both of two input stories have their common plot as follows. (E1) There are two brothers. (E2) The younger brother earns some property. (E3) The elder brother kills the younger brother to snatch the property. (E4) Then the born of younger brother sings a song to reveal the crime. (E5) As a result, the older brother is caught and punished.
We made two experiments. The first one is performed for descriptions of 26 events in the stories (written in Japanese). We set 4 for the value of parameter, gl. Our algorithm finds a MA of 6 events such that (E1) is correctly recognized. (E4) and a part of (E3) are also realized in the same MA. However, we find no generalized event corresponding to (E2) and (E5). The reasons are as follows: (P1) The punishment are described in different forms so that we need some inference to draw the same conclusion. (P2) A wild boar should be corresponded
Discovery of Maximal Analogies between Stories
331
to the valuable property the younger brother possesses, while the generalization of property and boar exceeds the limit of generalization in the concept dictionary. (P1) is generally so hard to solve, as it is concerned with the problem of interpreting the states caused by actions. Unfortunately, for the present, the authors do not have a good idea to solve this. Instead of attacking this issue directly, it will be more realistic to restrict documents to those involving only the factual events. About (P2), it will be possible to allow a group of terms that are the same role-fillers of the important terms in another term groups, even though the former term group needs a generalization beyond the limit. For this purpose, we plan to apply the measure of importance or significance of terms [3]. Apart from the semantic aspect of our algorithm, we briefly discuss the problem of computational complexity. In a word, for stories of 26 events, the algorithm proceeds successfully, as shown in the following table, where Epair, OPExp and OP are the number of event pairs, the number of op-selections actually expanded and the approximate number of all possible op-selections, respectively. Epair
1
2
3
4
5
6
7
8
9
OPExp 39
868
5712
14916
17285
9246
2254
240
0
OP
189,225 16*106
900
751*106 20*109
352*109 4*1012
34*1012 ...
For documents of 48 events of the same stories as in the above, memory overflow occurs. However, we have already developed a method to reduce the possibility of pairing events by representing similar events in one document by an event in the same document. Although there is no space to explain the technique, our improved algorithm succeeds to find a MA without memory overflow. Based on the experimental results, we are now going to the next step. We suppose a document of 100 factual events based on some text summarization using the importance of words. After that, we calculate a MA, and extends the term group so as to relax the cost condition for terms that are related to the important terms. The authors hope that, by this plan, we overcome the very hard issues on (P1) and (P2).
References 1. R. Agrawal, R. Srikant: Fast Algorithms for Mining Association Rules, Proc. of the 20th Int’l Conf. on Very Large Data Bases, pp. 478–499, 1994. 2. W.W.Cohen & H.Hirsh: The Learnability of Description Logics with Equality Constraints, Machine Learning, Vol. 17, No. 2–3, pp 169–199, 1996. 3. Y. Ohsawa, N. E. Benson and M. Yachida: KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor, Proc. of IEEE International Forum on Research and Technology: Advances in Digital Libraries ADL’98, pp. 12–18, 1998. 4. J.Sowa ed. Principles of Semantic Networks, Morgan Kaufmann, 1991. 5. ELECTRONIC DICTIONARY VERSION 2.0 TECHNICAL GUIDE, TR2007, Japan Electronic Dictionary Research Institute, Ltd. (EDR), 1998. http://www.iijnet.or.jp/edr/
Automatic Wrapper Generation for Multilingual Web Resources Yasuhiro Yamada1 , Daisuke Ikeda2 , and Sachio Hirokawa2 1
Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka 812-8581, Japan [email protected] 2 Computing and Communications Center, Kyushu University, Fukuoka 812-8581, Japan {daisuke, hirokawa}@cc.kyushu-u.ac.jp
Abstract. We present a wrapper generation system to extract contents of semi-structured documents which contain instances of a record. The generation is done automatically using general assumptions on the structure of instances. It outputs a set of pairs of left and right delimiters surrounding instances of a field. In addition to input documents, our system also receives a set of symbols with which a delimiter must begin or end. Our system treats semi-structured documents just as strings so that it does not depend on markup and natural languages. It does not require any training examples which show where instances are. We show experimental results on both static and dynamic pages which are gathered from 13 Web sites, markuped in HTML or XML, and written in four natural languages. In addition to usual contents, generated wrappers extract useful information hidden in comments or tags which are ignored by other wrapper generation algorithms. Some generated delimiters contain whitespaces or multibyte characters.
1
Introduction
There are useful information hidden in enormous pages on the Web. It is difficult, however, to extract and restructure them because these pages do not have an explicit structure like database systems. To use pages on the Web like a database system, it is necessary to extract contents of pages as records or fields. A wrapper is a procedure to extract instances of records and fields from Web pages. A database consists of some records, and a record consists of some fields. An instance is an instantiated object of a record or field. For example, in result pages of a typical search engine, a record is a tuple (page title, caption, URL), a field is an element of a record. Thinking of the enormous pages on the Web, it is hard to generate wrappers manually. Basically, there are three approaches to generate wrappers. The first approach is based on machine learning [6,7] using training examples. A problem of machine leaning approaches is that making training examples is too costly. The second approach is to assume input documents are only HTML and use knowledge on HTML [2,4]. In [4], record boundaries are determined by combination of heuristics one of which is a boundary is near some specific tags. This S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 332–339, 2002. c Springer-Verlag Berlin Heidelberg 2002
Automatic Wrapper Generation for Multilingual Web Resources
333
approach does not require any training examples, but this is not applicable to other markup languages. The third approach exploits regularity of input documents instead of background knowledge or training examples. IEPAD [3] tries to find record separators using the maximum repeat of a string. The data extraction algorithm in [8] also finds regularity of lists in input HTML files. Our system, similarly, determines common parts in given documents, then finds delimiters on common parts. A superiority of our system is to find common parts roughly and to be applicable to data with some irregularity. The authors developed a prototype of contents extraction system, called SCOOP [9]. It is based on a very simple idea that frequent substrings of input documents are useless and are not contents. Like other wrapper generation systems, SCOOP also has problems in Section 1.1. The main contribution of this paper is to propose, based on SCOOP, a full automatic wrapper generation system without any training examples. An input for the system is a set of symbols, called enclosing symbols, and a set of semi-structured documents containing instances of a record. A generated wrapper is an LR wrapper [6,7]. We show experimental results in Section 3. Input files are HTML and XML files gathered from 13 sites, and contents of them are written in four languages (Chinese, English, German, and Japanese). A generated wrapper extracts instances of fields with high accuracy. It also extracts useful information hidden in comments or tags which are ignored by other wrapper generation algorithms. 1.1
Our Contributions
Multilingual System: Although Web resources are written in many languages, many other wrapper generation systems are mono- or bilingual. Our system treats input semi-structured documents just as strings, so that it is multilingual1 in two meanings, for markup and natural languages. In the near future, XML files will become widespread on the Web. But a wrapper from XML files has not been considered because they have explict structures by nature. Since restructuring of semi-structured documents is an important goal of wrapper generation, it is important to generate wrappers from XML files. Dynamic and Static Pages: The target of other wrapper generation algorithms is a set of dynamic pages. Dynamic pages are created automatically by some database programs or search facilities. Dynamic pages ideally have completely the same template, so that such pages seem to be easy to generate wrappers. But, in practice, dynamic pages of a site have some irregularities. This is one of most difficult problem of wrapper generation systems. Since static pages usually have larger irregularities than dynamic ones, a wrapper generation system which works well for static pages also can be expected to work well for dynamic pages with some irregularities. Therefore, wrappers are important for both static and dynamic pages. SCOOP [9] can make a wrapper from such static pages, but it can not handle dynamic pages. The presented system is good at both static and dynamic pages. 1
So is SCOOP [9], but its implementation is bilingual (English and Japanese).
334
Y. Yamada, D. Ikeda, and S. Hirokawa
The Number of Instances: In an address book, for example, some does not have an email address, and other have some email addresses. More generally, we must consider the case that some instances of a record have different number of instances of a field. In SCOOP[9], instances in a field must be instantiated from different fields. In other words, all people in the address book must have at most one email address. The presented system overcomes this problem.
2
Main Algorithm
Our wrapper generation algorithm receives a set of semi-structured documents including some instances of a record. It treats each semi-structured document as just string. It also receives El and Er , where El and Er are sets of symbols called enclosing symbols. It outputs a set of rules extracting instances of each field. The algorithm consists of three stages, contents detection, rule extraction, and deleting and integrating rules. In contents detection stage, it divides roughly each input string into common and uncommon parts. In rule extraction stage, it extracts a set of rules. Roughly speaking, a rule is a pair of delimiters, called a left delimiter and a right delimiter. A left delimiter is a string ending with a symbol in El and a right delimiter is a string beginning with a symbol in Er . We define the length of a delimiter to be the number of enclosing symbols. A rule is a pair (l, r) of left and right delimiters such that l and r have the same number of occurrences on each input string. In deleting and integrating rules stage, it deletes useless rules. It is difficult to decide whether a field is useful or not. So we assume that a field is useless if only less than half of input documents have instances of it. Finally, it integrates rules extracting the same string and treats them as a rule. 2.1
Contents Detection
In this stage, our wrapper generation algorithm divides each input string into two parts roughly, common and uncommon parts. It utilizes the algorithm FindOptimal developed in [5]. Our algorithm makes full use of the fact that uncommon parts of semi-structured documents well cover contents [5]. In [5], it is experimentally shown that, given news articles written in English or Japanese gathered from a news site, FindOptimal extracts contents with high accuracy – more than 97%. The original FindOptimal preprocesses given strings. It converts successive whitespaces into a space because whitespaces are ignored when HTML files are displayed by a browser. The current version uses given strings as they are. 2.2
Rule Extraction
In this stage, the algorithm receives a set of strings, a set of common and uncommon divisions of strings, and a set of enclosing symbols.
Automatic Wrapper Generation for Multilingual Web Resources
335
For each uncommon part, the algorithm finds two enclosing symbols le and rb such that they cover whole the uncommon part and they are the nearest from the uncommon part. The first candidate of a left delimiter ends with le and begins with the previous enclosing symbol. Similarly, the first candidate of a right delimiter begins with rb and ends with the next enclosing symbol. If two candidates have different numbers of occurrences, then the algorithm increases the length of the frequent candidate. If le (rb ) is more frequent than rb (le ), then it increases the length of the left (right) candidate until the previous (next) enclosing symbol. It continues this until the occurrence of left and right candidates (l, r) are the same. If l and r are the same string or they are corresponding tags (e.g., l = and r = ), the algorithm increases the length of both candidates and checks the number of their occurrences. 2.3
Deleting and Integrating Rule
Let R be a set of candidates for rules. It is necessary to delete and integrate candidates in R because some of them extract the same string or other of them are useless. In our setting, a rule is allowed to extract no instances of a field from some input strings. We put a restriction on a rule such that it must extract instances from more than half of input strings. Otherwise the algorithm deletes it from R. Next, it integrates candidates on R extracting the same string from each string. For example, if these two candidates, (,
\n) and ( ,
\n), extract the same string from each input string, it integrates these two candidates and treats them as a rule.
3
Experiments
We implement the algorithm described in the previous section in Python. Input files are HTML and XML files, and contents of them are written in four languages (Chinese, English, German, and Japanese). They are gathered from 13 sites (see Table 1) and the number of all gathered files is 1197. We set El and Er are sets of “>” and a whitespace (space, tab, newline characters), and “<” and a whitespace, respectively. To evaluate the results, for each site the authors see some HTML/XML sources in advance and create a wrapper manually. Then we compare two extraction results from wrappers created manually and automatically. Table 2 and Table 4 have results of static and dynamic pages. The second column “Field (Accuracy)” has attribute names of fields in hand-coded wrappers, that is, the fields expected to be extracted and their accuracies. Ev1 shows the number of fields which authors overlook when they created wrappers manually. Ev2 shows the number of fields extracted wrongly, so we want Ev2 to be small.
336
Y. Yamada, D. Ikeda, and S. Hirokawa
Table 1. URL list of 13 sites described in this section. The fourth column stands for the number of files. We gathered 1197 files from the sites. “Search” and “Mail” in Type column mean that these pages are result pages of search engines and mail archives, respectively. “News” pages are gathered from online news sites. “Manual” stands for online manual pages. “Database” means that we got data from some public database. “kyushu-u” is now under-construction and not public yet ID
URL
HTML altavista http://www.altavista.com/ freebsd http://docs.freebsd.org/mail/ ftd http://www.ftd.de/ java http://java.sun.com/j2se/1.3/docs/ lycos http://www.lycos.com/ peopledaily http://www.peopledaily.co.jp/ redhat http://www.redhat.com/mailing-lists/ reuters http://www.reuters.de/ sankei http://www.sankei.co.jp/main.htm yahoo http://www.yahoo.com/ XML kyushu-u – mainichi http://www.mainichi.co.jp/digital/newsml/ sigmod http://www.acm.org/sigmod/record/xml/
3.1
Language
# Type
English 17 Search English 49 Mail German 101 News English 30 Manual English 50 Search Chinese 127 News English 50 Mail German 50 News Japanese 108 News English 45 Search Japanese 50 Database Japanese 470 News English 50 Database
Static Pages
As described in Section 1.1, most of other wrapper generation algorithms assume that input documents are created dynamically. Such dynamic pages are created by filling a template so that common parts created by one template are completely same. So, it is difficult to create wrappers from static pages than from such dynamic ones. Table 2 shows results of static pages and our algorithm works well for such pages. Table 3 shows the wrapper created on “mainichi” which is a set of XML files. We can see that tags in the table are completely different from those of HTML. Our algorithm finds rules for two “date” and two “keyword” fields. We can see in rules whitespaces which are just for readability of XML sources. In [5, 9], successive whitespaces are compressed into a space, so SCOOP in [9] can not find such a rule. The algorithm fails to find a rule for “Body text” field. A body text in “mainichi” is in between “\n” and “\n
”. They are the same, so the system tries to find delimiters with longer lengths. However, the right delimiter is followed by date which is variable, so that the system fails to find a good right delimiter. Other 0% fields in Table 2 also occur by similar problems. Our system succeeds to find the field “second headline” from “sankei” although there are variations of the number of its instances: 26 files have no in-
Automatic Wrapper Generation for Multilingual Web Resources
337
Table 2. Results of static pages Field (Accuracy) page title (0%), headline (100%), summary of article (100%), body (100%) classname (90%), date (100%), return (100%), body java (100%) mainichi headline (100%), date (100%), keyword (100%), body (0%), related word (100%), other headline (100%) peopledaily page title (100%), date (100%), headline (98.4%), body (99.2%) reuters headline (100%), date (100%), body (100%) sankei headline (100%), second headline (100%), body (100%) ID ftd
Ev1 Ev2 6 5 1
2
1
0
4
2
4 0
3 0
Table 3. A part of wrapper created by our system from “mainichi” Wrapper \n\t\t\t\t Date1 \n\t\t\t\t ( Date2 )
\n\t\t\t\t Keyword \n\t\t\t\t <midasi>\n\t\t\t\t\t\t Keyword2 \n\t\t\t\t\t\t Body text can not extract rule Field
stances of “second headline”2 , 21 files have two instances, 2 files have three instances, and the other files have one instance. Our system does not mind the number of instances of a field (see Section 1.1). From this data set, the system finds a rule whose left delimiter contains a multibyte character, “ ”, where “ ” is a multibyte character and used for the header symbol for a headline. Some useful contents hidden in meta tags or comments are extracted. An article in “ftd” contains a brief summary of the article in a meta tag, and the date of the article in a comment tag. In Table 2, there are some fields whose accuracies are high but not perfect. The reason of partial failure is in input files. We assume instances of a field are surrounded by the same pair of strings. But, sometimes there are some files which have instances surrounding by other strings. For example, most instances of a field are followed by “
In these files, there exist no delimiters for the second headline.
338
3.2
Y. Yamada, D. Ikeda, and S. Hirokawa
Dynamic Pages
A typical dynamic page is a search result. We select three major search engines and two mail archives (see Table 4). A search result contains the title of a found page, URL, and brief description of the page. A typical page of mail archives contains the body of a found mail, subject, and some mail headers. Table 4 Table 4. Results of dynamic pages ID altavista lycos yahoo freebsd
Field (Accuracy) title of page (100%), caption (100%), URL (100%) title of page (100%), caption (95.5%), URL (100%) title of page (100%), caption (100%), URL (100%) Date (100%), From (100%), To (93.9%), Subject (100%), Message-ID (100%), content (100%) redhat Subject (100%), content (98%)
Ev1 7 5 7 0
Ev2 4 3 11 2
3
8
shows the presented algorithm treats such pages well although SCOOP [9] failed to find rules. We also have results of the following two databases: “sigmod” and “kyushuu.” They consist of XML files. “sigmod” is gathered from “OrdinaryIssuePage” in “ACM SIGMOD Record: XML Version.” A record has following fields: title of the article, author name(s), volume, number, year, start and end pages. All of these fields are completely found by our algorithm. It also successfully finds unique ID number in an XML tag and created time in a comment. “kyushu-u” stands for a database of academics in Kyushu university. A file corresponds to an academic’s record. A record contains his/her name, affiliation, major, mail address, publications, classes and so on. An XML file of this data set has tags including Japanese characters, but it is not a problem for our system. It found rules containing tags of Japanese characters.
4
Conclusion
We presented a simple wrapper generation algorithm. Any additional inputs are not necessary except that enclosing symbols each of which is the first or last letter of a delimiter. The system is suitable for any semi-structured documents. This is due to simplicity of our system: it treats input documents just as strings and utilizes regularity of instances. Extraction is successful for both dynamic and static pages while SCOOP in [9] failed to find rules from dynamic pages because SCOOP depends heavily on FindOptimal in [5] and FindOptimal fails to divide dynamic pages well into common and uncommon parts. Our system also found useful information hidden in comments and attribute values of tags, such as meta tags. Presented experiments show that whitespaces play an important role for structuring data.
Automatic Wrapper Generation for Multilingual Web Resources
339
These results contrast to other wrapper generation algorithms because they ignore inside of tags and comments, and whitespaces. However, sometimes the system failed to find delimiters. Typically, this happens when instances of a field are surrounded by simple corresponding tags, such as, “” and “”. Such corresponding tags frequently appear in HTML documents and has nothing to do with the structure we want to extract. Therefore, we do not treat such tags as a rule. It will be a good solution to use tree wrappers instead of string based wrappers. An important future work is to combine extracted fields into one record. In [8], the same problem was discussed. However, our case is more difficult because the setting in [8] is that each file must have multiple instances of a record. Our wrapper generation system can deal with both single and multiple cases.
References 1. N. Ashish and C. Knoblock, Wrapper Generation for Semi-structured Internet Sources, Proc. of Workshop on Management of Semistructured Data, 1997. 2. D. Buttler, L. Liu and C. Pu, A Fully Automated Object Extraction System for the World Wide Web, International Conference on Distributed Computing Systems, 2001. 3. C.-H. Chang and S.-C. Lui, IEPAD: Information Extraction Based on Pattern Discovery, Proc. of the Tenth International Conference of World Wide Web (WWW2001), pp. 4–15, 2001. 4. D. W. Embley, Y. Jiang and Y. -K. Ng, Record-Boundary Discovery in Web Documents, Proc. of ACM SIGMOD Conference, pp. 467–478, 1999. 5. D. Ikeda, Y. Yamada and S. Hirokawa, Eliminating Useless Parts in Semistructured Documents using Alternation Counts, Proc. of the Fourth International Conference on Discovery Science, Lecture Notes in Artificial Intelligence, Vol. 2226, pp. 113–127, 2001. 6. N. Kushmerick, D. S. Weld and R. B. Doorenbos, Wrapper Induction for Information Extraction, Intl. Joint Conference on Artificial Intelligence, pp. 729–737, 1997. 7. N. Kushmerick, Wrapper Induction: Efficiency and Expressiveness, Artificial Intelligence, Vol. 118, pp. 15–68, 2000. 8. K. Lerman, C. A. Knoblock and S Minton, Automatic Data Extraction from Lists and Tables in Web Sources, Adaptive Text Extraction and Mining workshop, 2001. 9. Y. Yamada, D. Ikeda and S. Hirokawa, SCOOP: A Record Extractor without Knowledge on Input, Proc. of the Fourth International Conference on Discovery Science, Lecture Notes in Artificial Intelligence, Vol. 2226, pp. 428–487, 2001.
Combining Multiple K-Nearest Neighbor Classifiers for Text Classification by Reducts Yongguang Bao and Naohiro Ishii Department of Intelligence and Computer Science, Nagoya Institute of Technology, Nagoya, 466-8555, Japan {baoyg, ishii}@egg.ics.nitech.ac.jp
Abstract. The basic k-nearest neighbor classifier works well in text classification. However, improving performance of the classifier is still attractive. Combining multiple classifiers is an effective technique for improving accuracy. There are many general combining algorithms, such as Bagging, or Boosting that significantly improve the classifier such as decision trees, rule learners, or neural networks. Unfortunately, these combining methods do not improve the nearest neighbor classifiers. In this paper we present a new approach to general multiple reducts based on rough sets theory, in which we apply multiple reducts to improve the performance of the k-nearest neighbor classifier. This paper describes the proposed technique and provides experimental results.
1
Introduction
As the volume of information available on the Internet and corporative intranets continues to increase, there is a growing need for tools finding, filtering, and managing these resources. The purpose of text classification is to classify text documents into classes automatically based on their contents, and therefore plays an important role in many information management tasks. A number of statistical text learning algorithms and machine learning techniques have been applied to text classification. These text classification algorithms have been used to automatically catalog news articles [1] and web pages [2], learn the reading interests of users [3], and sort electronic mails [4]. The basic k-nearest neighbor (kNN) is one of the simplest methods for classification. It is intuitive, and easy to understand, provides good generalization accuracy for text classification. Recently, researchers have begun paying attention to combining a set of individual classifiers, also known as a multiple model or ensemble approach, with the hope of improving the overall classification accuracy. Unfortunately, many combining methods such as Bagging, Boosting, or Error Correcting Output Coding, do not improve the kNN classifier at all. Alternatively, Bay [6] has proposed MFS, a method of combining kNN classifiers using multiple features subsets. However, Bay has not described a certain way for selecting the features, in other words, MFS should be built by trial and error. To overcome this weakness, Itqon et al. [7] use the test features instead of S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 340–347, 2002. c Springer-Verlag Berlin Heidelberg 2002
Combining Multiple K-Nearest Neighbor Classifiers
341
the random features subsets to combine multiple kNN classifiers. However, the complexity of computing all test features is NP-hard, and the algorithm of computing test features in [7] is infeasible when the number of features is large, for example, to text classification problem. In this paper, we present RkNN, an attempt of combing multiple kNN classifiers using reducts instead of the random feature subsets or the test features. A reduct is the essential part of an information system that can discern all objects discernible by the original information system. Furthermore the multiple reducts can be formulated precisely and in a unified way within the framework of Rough Sets theory. This paper proposes a hybrid technique using Rough Set theory to generate multiple reducts. Then these multiple reducts are used to improve the performance of the k-nearest neighbor classifier.
2 2.1
Information Systems and Rough Sets Information Systems
An information system is composed of a 4-tuple as follow: S =< U, Q, V, f > Where U is the closed universe, a finite nonempty set of N objects (x 1 , x2 , ..., xN ), Q is a finite nonempty set of n features {q1 , q2 , ..., qn }, V = q∈Q Vq , where Vq is a domain(value) of the feature q, f : U × Q → V is the total decision function called the information such that f (x, q) ∈ Vq , for every q ∈ Q, x ∈ U . Any subset P of Q determines a binary relation on U, which will be called an indiscernibility relation denoted by IN P (P ), and defined as follows: xIB y if and only if f (x, a) = f (y, a) for every a ∈ P . Obviously IN P (P ) is an equivalence relation. The family of all equivalence classes of IN P (P ) will be denoted by U/IN P (P ) or simply U/P ; an equivalence class of IN P (P ) containing x will be denoted by P (x) or [x]p . 2.2
Reduct
Reduct is a fundamental concept of rough sets. A reduct is the essential part of an information system that can discern all objects discernible by the original information system. Let q ∈ Q. A feature q is dispensable in S, if IN D(Q − q) = IN D(Q); otherwise feature q is indispensable in S. If q is an indispensable feature, deleting it from S will cause S to be inconsistent. Otherwise, q can be deleted from S. The set R ⊆ Q of feature will be called a reduct of Q, if IN D(R) = IN D(Q) and all features of R are indispensable in S. We denoted it as RED(Q) or RED(S).
342
Y. Bao and N. Ishii
Feature reduct is the minimal subset of condition features Q with respect to decision features D, none of the features of any minimal subsets can be eliminated without affecting the essential information. These minimal subsets can discern decision classes with the same discriminating power as the entire condition features. The set of all indispensable from the set Q is called CORE of Q and denoted by CORE(Q): CORE(Q) = ∩RED(Q) 2.3
Discernibility Matrix
In this section we introduce a basic notions–a discernibility matrix that will help us understand several properties and to construct efficient algorithm to compute reducts. By M (S) we denote an n × n matrix (cij ), called the discernibility matrix of S, such as cij = {q ∈ Q : f (xi , q) = f (xj , q)}for i, j = 1, 2, ..., n. Since M (S) is symmetric and cii = ∅ for i = 1, 2, ..., n, we represent M (S) only by elements in the lower triangle of M (S), i.e. the cij is with 1 ≤ j < i ≤ n. From the definition of the discernibility matrix M (S) we have the following: Proposition 1. CORE(S) = {q ∈ Q : cij = {q} for some i, j} Proposition 2. Let ∅ = B ⊆ Q. The following conditions are equivalent: (1) For all i, j such that cij = ∅ and 1 ≤ j < i ≤ n we have B ∩ cij = ∅ (2) IN D(B) = IN D(Q) i.e. B is superset of a reduct of S.
3
The Proposed System
This paper proposes a hybrid method to text classification. The approach comprises three main stages, as shown in Fig. 1. The keyword acquisition stage reads corpora of documents, locates candidate keywords, estimates their importance, and builds an intermediate dataset of high dimensionality. The feature reductions generation exam-ines the dataset and removes redundancy, generates single or multiple feature reduc-tions. The Classifier combination using multiple reducts to combining the kNN classi-fier by the simple voting methods. 3.1
Keyword Acquisition
Text classification aims to classify text documents into categories or classes automatically based on their content. While more and more textual information is available online, effective retrieval becomes difficult without good indexing and summarization of document content. Document classification is one solution to
Combining Multiple K-Nearest Neighbor Classifiers
Document Class 1
...
Document Class 2
343
Document Class n
Keyword Acquisition
High Dimensionality Dataset
Feature Reducts
Feature Reduction Generation
Combing kNN classifiers
Fig. 1. Data flow through the system
this problem. Like all classification tasks, it may be tackled either by comparing new documents with previous classified ones (distance-based techniques), or by using rule-based approaches. Perhaps, the most commonly used document representation is so called vector space model. In the vector space model, a document is represented by vector of words. Usually, one has a collection of documents which is represented by a M × N word-by-document matrix A, where M is the number of words, and N the number of documents, each entry represents the occurrences of a word in a document, i.e., A = (ωik ) Where wik is the weight of word i in document k. Since every word does not normally appear in each document, the matrix is usually sparse and M can be very large. Hence, a major characteristic, or difficulty of text classification problem is the high dimensionality of the feature space. The keyword acquisition sub-system uses a set of document as input. Firstly, words are isolated and pre-filtered to avoid very short or long keywords, or keywords that are not words (e.g. long numbers or random sequences of characters). Every word or pair of consecutive words in the text is considered to be a candidate keyword. Then, the following weighting function is used for word indexing to generate a set of keywords for each document. N
ωik = log(fik + 1.0) × (1 +
fij 1 fij log ) log(N ) j−1 ni ni
Where ωik is the weight of keyword i in document k; N is the total number of document and ni is the total number of items word i occurs in the whole collection; fik is the frequency of the keyword i in document k. Finally, before the weighted keyword is added to the set of keywords, it passes through two filters: one is a low-pass filter removing words so uncommon that are definitely not good keywords; the other is a high-pass filter that removes far too common words such as auxiliary verbs, articles et cetera. Finally all weights are normalized before the keyword sets are output.
344
Y. Bao and N. Ishii
It must be emphasized that any keyword acquisition approach may be substituted for the one described above, as long as it outputs weighted keywords. 3.2
Generation of Feature Reductions
A reduct is a minimal subset of features, which has the same discernibility power as the entire condition feature. Finding all reducts of an information system is combinatorial NP-hard computational problem [8]. A reduct uses a minimum number of features and represents a minimal and complete rules set to classify new objects. To classify unseen objects, it is optimal to use the characteristics that different reducts use different features as much as possible and the union of these features in the reducts together include all indispensable features in the database and the number of reducts used for classification is minimum. Here, we proposed a greedy algorithm to compute a set of reducts which satisfy this optimal requirement. Our algorithm starts with the CORE features, then through backtracking, multiple reducts are constructed using discernibility matrix. A reduct is computed by using forward stepwise selection and backward stepwise elimination based on the significance values of the features. The algorithm terminates when the features in the union of the reducts includes all the indispensable features in the database or the number of reducts is equal to the num-ber determined by user. Since Rough Sets is suited to nominal datasets, we quantise the normalized weighted space into 11 values calculated by f loor(10ω). Let COMP(B, ADL) denotes the comparison procedure. The result of COM P(B, ADL) is 1 if for each element cj of ADL has B ∩ cj = ∅ otherwise 0, m be the parameter of reducts number. We analyze the time complexity of one reduct generation. Let n is the total number of document and l is the total number of items word in the whole collection. From [9], we can know the complexity of computing DM is O(l · n2 ) and number of elements in DM is n(n − 1)/2. ¿From the analysis of the process of forward selection, we can get the complexity is O(l2 · log l · n2 ). Obviously, The complexity of backward elimination is O(l · n2 ), and then the complexity of generate one reduct O(l2 · log l · n2 ). 3.3
Voting K-Nearest Neighbor Algorithm
For an unknown objects u, classification process by the kNN classifier can be pre-sented as the following: calculating the distances of between unknown object u and all training objects, choosing k-nearest neighbor objects, and voting to class. Each class (cj , j = 1, 2, ..., l) obtains score as the following: rankcj (u) =
nk
sim(d, di ) × δ(cj , yi ) where δ(cj , yi ) = nk i=nl sim(d, di )
i=nl
1 : cj = yi 0 : cj = yi
The classification result of u is the classes that satisfy the condition of rankcj (u) ≥ θ , where θ is ranking threshold that is determined by user.
Combining Multiple K-Nearest Neighbor Classifiers
345
Step 1 Create the discernibility matrix DM : = [Cij ]; CORE = ∪{c ∈ DM :card(c) = 1}; i = 1; Step 2 While (i ≤ m) do begin REDU = CORE; DL = DM − REDU ; /* forward selection*/ While (DL = ∅) do begin Compute the frequency value for each feature q ∈ Q − ∪REDUi ; Select the feature q with maximum frequency value and add it to REDU ; Delete elements dl of DL which q ∈ dl from DL; End /* backward elimination*/ N = card(REDU − CORE); For j = 0 to N − 1 do begin Remove aj ∈ REDU − CORE from REDU ; If COMP(REDU, DM ) = 0 Then add aj to REDU ; End REDUi = REDU ; i = i + 1; End
Fig. 2. Algorithm: Generate Reductions
The basic framework of the proposed RkNN is illustrated in Figure 2. Consider that we obtain a set of m reducts {R1 , R2 , ..., Rm }. RkNN is a combination of m kNN classifier corresponding to each reduct. The l-th individual classifier computes the distance and the voting score, resulting rankcj (u, Rl ). The local scores are then collected to the total score trankcj (u) m trankcj (u) =
i=1
rankcj (u, Ri ) m
Finally, the classes that satisfy the condition of trankcj (u) ≥ θ are the final classification result.
4
Experiments and Results
For evaluating the efficiency of our hybrid algorithm, we compared it with the simple kNN algorithm. And for testing the efficiency of multiple reducts, we run our algo-rithm by the way of using one reduct and 5 reducts respectively. The Reuters collec-tion is used in our experiment, which is publicly available at http://www.research.att.com/ lewis/reuters21578.html. The documents in the Reuters collection were collected from Reuters newswire in 1987. To divide the collection into a training and a test set, the modified Apte(“ModApte“) split has been most fre-quently used. According to the documentation, using the ModApte split leads to a training set consisting of 9603 stories, and a test set consisting of 3299 stories. 135 different categories have been assigned to the
346
Y. Bao and N. Ishii Original Features F={f1,f2,...,fn}
... Reduct 1 R1={r1,r2,...,rn }
...
Reduct m Rm={r1,r2,...,rn }
kNN using Reduct 1 R1
...
kNN using Reduct m Rm
l
m
... Final Classification Decision
Fig. 3. Basic framework of RkNN
Reuter documents. Five different cor-pora of the Reuters collection were used in our experiment. They are: Cocoa: Training/test data number is 41/15. Copper: Training/test data number is 31/17. Cpi: Training/test data number is 45/26 Rubber: Training/test data number is 29/12. Gnp: Training/test data number is 49/34. In our experiment, set parameters k = 30 and θ = 0.25. The measures of classifica-tion effectiveness used in there, are the micro-averaging of recall, precision, accuracy. Table 1. Comparison of RkN N with kN N
Table 1 shows the experimental result. The “Data sets” column shows the data set used in experiment. The “2-class” shows the data includes cocoa and copper, the “3-class” shows the data includes cocoa, copper and cpi, the “4-class” shows the data includes cocoa, copper, cpi and gnp, the “5-class” shows the data includes cocoa, copper, cpi, gnp and rubber. The “kNN” column means the result using the basic k-nearest neighbor algorithm, the “RkNN-1” column shows the result using only one reduct for k-nearest neighbor, the “RkNN-5” column shows the result combining the k-nearest neighbor classifiers by 5 reducts. As can be see
Combining Multiple K-Nearest Neighbor Classifiers
347
from Table 1, RkNN1 almost maintains the accuracy of kNN, and combination of kNN classifiers by 5 reducts can improve the performance of kNN. In the case of “5-class”, by the preprocessing (remove tags, remove stopwords, perform word stemming), we get 2049 keywords. Through the generation of 5 reducts, we get the total 87 keywords. It is just 1/23 of all keywords. When we use this 5 reducts to combine the kNN classifiers, reduce the classification time to 1/15 of kNN. What may be concluded from the above is that the hybrid method, developed in this paper, is efficient and robust text classifier.
5
Conclusions
In this paper, a hybrid method for text classification has been presented by using multiple reducts to combine the kNN classifiers. A reduct is the essential part of an information system that can discern all objects discernible by the original information system. The multiple reducts can be formulated precisely and in a unified way within the framework of Rough Sets theory. These multiple reducts are used to improve the performance of the k-nearest neighbor classifier. Experimental results show that using several reducts makes high improvement. The system is still in its early states of research. To improve further the accuracy of the system, computing all reducts is in progress.
References 1. T. Joachims, “Text Classification with Support Vector Machines: Learning with Many Relevant Features”, ECML-98, 10th European Conference on Machine Learning, 1998, pp. 170-178. 2. M. Craven, D. Dipasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam & S. Slattery, “Learning to Symbolic Knowledge from the World Wide Web”, Proceeding of the 15th Na-tional Conference on Artificial Intelligence (AAAI-98), 1998, pp. 509-516. 3. K. Lang, “Newsweeder: Learning to Filter Netnews”, Machine Learning: Proceeding of the Twelfth International (ICML95), 1995, pp. 331-339. 4. Y. Yang, “An Evaluation of Statistical Approaches to Text Classification”, Journal of Infor-mation Retrieval, 1, 1999, pp. 69-90. 5. A. Skowron & C. Rauszer, “The Discernibility Matrices and Functions in Information Systems”, in R. Slowinski (ed.) Intelligent Decision Support - Handbook of Application and Advances of Rough Sets Theory, Kluwer Academic Publishers, Dordrecht, 1992, pp. 331-362. 6. S.D. Bay, “ Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets”, Intelligent Data Analysis, 3(3), 1999, pp. 191-209. 7. Itqon, S. Kaneko & S. Igarashi, “ Combining Multiple k-Nearest Neighbor Classifiers Using Feature Combinations”, Journal IECI, 2(3), 2000, pp. 23-319. 8. Y. Bao, S. Aoyama, X. Du, K. Yamada & N. Ishii, “A Rough Set-Based Hybrid Method to Text Categorization”, Proc. 2nd International Conf. on Web Information Systems Engi-neering, Kyoto, Japan, pp. 254-261, Dec. 2001.
ARISTA Causal Knowledge Discovery from Texts John Kontos, Areti Elmaoglou, and Ioanna Malagardi Artificial Intelligence Group Laboratory of Cognitive Science Department of Philosophy and History of Science National and Capodistrian University of Athens [email protected] , [email protected]
Abstract. A method is proposed in the present paper for supporting the discovery of causal knowledge by finding causal sentences from a text and chaining them by the operation of our system. The operation of our system called ACkdT relies on the search for sentences containing appropriate natural language phrases. The system consists of two main subsystems. The first subsystem achieves the extraction of knowledge from individual sentences that is similar to traditional information extraction from texts while the second subsystem is based on a causal reasoning process that generates new knowledge by combining knowledge extracted by the first subsystem. In order to speed up the whole knowledge acquisition process a search algorithm is applied on a table of combinations of keywords characterizing the sentences of the text. Our knowledge discovery method is based on the use of our knowledge representation independent method ARISTA that accomplishes causal reasoning “on the fly” directly from text. The application of the method is demonstrated by the use of two examples. The first example concerns pneumonology and is found in a textbook and the second concerns cell apoptosis and is compiled from a collection of MEDLINE paper abstracts related to the recent proposal of a mathematical model of apoptosis.
1 Introduction A method is proposed in the present paper for discovering causal knowledge by finding the appropriate causal sentences from a text and chaining them by the operation of our system for knowledge discovery from texts. The processing of natural language texts for knowledge acquisition was first presented in [1] by the use of the new representation independent method called ARISTA. This method achieves causal knowledge mining “on the fly” through deductive reasoning performed by the system in response to a user's question and is further elaborated in [2]. Our method is an alternative to the traditional “pipeline” method recent applications of which are presented in [3] for mining and [4], [5] for information extraction. The main advantage of the ARISTA Method is that texts are not translated into any representation formalism and therefore retranslation is avoided whenever new linguistic or extra linguistic prerequisite knowledge has to be used for improved text understanding. The operation of the system relies on the search for causal chains that in turn relies on the search for sentences containing appropriate natural language phrases. The system for knowledge discovery from texts we are developing called ACkdT consists of two main subsystems. The first subsystem achieves the extraction of knowledge from individual S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 348-355, 2002. © Springer-Verlag Berlin Heidelberg 2002
ARISTA Causal Knowledge Discovery from Texts
349
sentences that is similar to traditional information extraction from texts [7,8] while the second subsystem is based on a reasoning process that generates new knowledge by combining “on the fly” knowledge extracted by the first subsystem but without the use of a template representation. Our knowledge discovery process relies on the search for causal chains that in turn relies on the search for sentences containing appropriate natural language phrases. In order to speed up the whole knowledge acquisition process the search algorithm proposed in [6] may be used for finding the appropriate sentences for chaining. The increase in speed results because the repeated sentence search is made a function of the number of words in the connecting phrases. This number is usually smaller than the number of sentences of the text that may be arbitrarily large.
2 A First Example of Knowledge Discovery from a Scientific Text An example text that is an extract from a medical physiology book [9] in the domain of pneumonology and in particular of lung mechanics enhanced by a few general knowledge sentences is used as a first illustrative example of knowledge discovery from texts. Our system is able to answer questions from that text that require the chaining of causal knowledge acquired from this text and produce answers that are not explicitly stated in the input texts. The example text contains the following sentences: 1. 2. 3. 4. 5. 6. 7. 8. 9.
The decrease of the concentration of surfactant increases the surface tension. Elastic forces rise is caused by volume reduction The alveolar pressure rise forces air out of the lungs. The alveolar pressure rise is caused by elastic forces. Elastic forces include elastic forces caused by surface tension. Elastic forces caused by surface tension increase as the alveoli become smaller. As the alveoli become smaller, the concentration of surfactant increases. The increase of the concentration of surfactant reduces the surface tension. The reduction of the surface tension opposes the collapse of the alveoli.
The knowledge contained in texts like the one above concerns the causal relations between “processes” such as reduction, rise, increase and collapse that apply to “entities” such as elastic forces, alveolar pressure, surface tension, lungs and alveoli. If the user submits the question: “What process of alveoli forces air out of the lungs?” then the answer “the process in which the alveoli become smaller” is produced automatically by our system after searching the text and discovering the proper causal chain that consists of the sentences 3 to 6. The chain of reasoning for answering this question can be symbolized as below: a.b.s. c.s.t. Ãe.f. a.p.r. DRROZKHUH a.b.s. c.s.t. e.f. a.p.r. a.o.o.l.
= alveoli become smaller = elastic forces caused by surface tension = elastic forces =alveolar pressure rise =air out of the lungs
350
J. Kontos, A. Elmaoglou, and I. Malagardi
If the user submits the question: "What process of alveoli opposes collapse of alveoli?" then the system gives again the answer "the process in which the alveoli become smaller" but using a different causal chain consisting of the sentences 6 to 9. The processing of the second question requires the definition of “causal polarity” for the proper treatment of verbs like “opposes”. Positive and negative causal polarities have been defined as "+cause" and "-cause" respectively. Both causal chains that are discovered for answering the above two questions have the same starting point i.e. the phrase “the alveoli become smaller” but different end points shown in italic letters.
3 The Combination of Sentences for Knowledge Discovery Let us consider the answering of the question of the user: "What process of alveoli causes flow of lungs air?" The explanation generated automatically by the system illustrates some combinations: alveoli become smaller causes increase of elastic forces because surface tension elastic forces is a kind of elastic forces and alveoli become smaller causes increase of surface tension elastic forces alveoli become smaller causes rise of alveolar pressure because alveoli become smaller causes increase of elastic forces and elastic forces causes rise of alveolar pressure alveoli become smaller causes flow of lungs air because alveoli become smaller causes rise of alveolar pressure and rise of alveolar pressure causes flow of lungs air The sentence “The reduction of the surface tension opposes the collapse of the alveoli” uses the verb “opposes” that bears a negative meaning that must be taken into account. After defining the predicates "+cause" and "-cause" for positive and negative relations respectively, the answer to the question: "What process of alveoli opposes collapse of alveoli?" is again "become smaller", while the explanation generated automatically is: alveoli become smaller +causes reduces of surface tension because alveoli become smaller +causes increase of surfactant concentration and increase of surfactant concentration +causes reduces of surface tension alveoli become smaller -causes collapse of alveoli because alveoli become smaller +causes reduction of surface tension and reduction of surface tension -causes collapse of alveoli
4 The Search Algorithm The search algorithm consists of two modules. The first consists of an algorithm of organization of the data and the second consists of the main algorithm. A few of the
ARISTA Causal Knowledge Discovery from Texts
351
words are chosen as keywords that characterize each sentence [6]. The words used as keywords namely “rise”, “tension”, “elastic” and “smaller” denoted by k1,k2,k3,k4, and ordered as: k3,k2,k4,k1. The set of keyword combinations together with their occurrence in either the logical LHS (left hand side) or RHS (right hand side) of a sentence is given in Table 1. Logical LHS corresponds to the “cause” and RHS corresponds to the “effect” and the numbers to the sentences. Table T that results from our organization algorithm is given as Table 2.
Table 1. List of keyword sets
rise tension elastic smaller
=k1 =k2 =k3 =k4 k3k1 k3k2
occurring in LHS of 3 occurring in LHS of 9 occurring in LHS of 4 occurring in LHS of 6 occurring in LHS of none occurring in LHS of 5
and and and and and and
RHS of 4 RHS of 8 RHS of 5 RHS of 7 RHS of 2 RHS of 6
The combinations of keywords characterising the sentences of the text presented in section 2 consist of any of the four keywords “rise, tension, elastic, smaller”. The keyword combinations for each sentence are given below:
Where “
´VWDQGVIRUWKH³FDXVHV´UHODWLRQDQG³Ã” stands for the “is_a” relation. Table 2. The organized Table T
Where A(Q) denotes the sentences related to the term Q. As it was shown in [6] the number of direct accesses to the organised table T is smaller or equal to the number of
352
J. Kontos, A. Elmaoglou, and I. Malagardi
keywords of a term and therefore the retrieval time is independent of the size of the table. We give below the steps of the algorithm executed for locating the sentence characterized by the keyword combination k3k2: i=0 i=1 redo i=2 return
p=1 p=T(1+1)=T(2)=6 above p=T(6+1)=T(7)=A(k3k2) location of sentences 5 and 6
because i=1 and X=POS(kr1)=POS(k3)=1 because i=1 i.e. i<m because i=2 i.e. i=m and X=DIST(k2,k3)=1 because p is not a pointer and STOP.
Where POS (ki) is the position of ki in an imposed ordering, expressed as an integer from 1 to m where m is the number of the keywords ki ( 0< i <m+1 ) and DIST (ki kj,)=POS (ki)-POS (kj) is the distance of the keywords ki kj, with respect to the keyword ordering. In this example the number of accesses to table T is only 2 while for a serial search in the table of sentences the number of accesses would be at least 5. Answering the first question with the pneumonology domain text the ratio between the two cases is about 3 to 1 in favour of the algorithm proposed.
5 A Second Example of Causal Knowledge Discovery from Texts The second example text is compiled from the MEDLINE abstracts of papers used by [10] as references that amount to 73 items. Most of these papers are used in [10] to support the discovery of a quantitative model of protein oscillations related to cell apoptosis. This model is constructed as a set of differential equations. We are aiming at automating part of such a cognitive process by our system. An illustrative subset of sentences used in the second example is the following where the reference numbers of the papers with which the authors of [10] refer to are given in parentheses: The p53 protein is activated by DNA damage. (23) Expression of Mdm2 is regulated by p53. (32) Mdm2 increase inhibits p53 activity. (17) Using these sentences our system discovers the causal negative feedback loop: DNA damage +causes p53 +causes mdm2 -causes p53 Where +causes means “causes increase” and -causes means “causes decrease or inhibition” by answering the question:
ARISTA Causal Knowledge Discovery from Texts
353
Is there a process loop of p53? This question is internally represented as the Prolog goal: “cause(P1,p53,P2,p53,S)”, where P1 and P2 are two process names that the system extracts from the texts and characterize the behaviour of p53. S stands for the overall effect of the feedback loop found i.e. whether it is a positive or a negative feedback loop. In this case S is found equal to “-” since a positive causal connection is followed by a negative one. The short answer automatically generated by our system ACkdT is: Yes. The loop is p53 activity –causes p53 production. The long answer automatically generated by our system ACkdT is: Using sentence 17 with inference rule IR4 since the DEFAULT process of p53 is <production> using sentence 32 the EXPLANATION is: since is equivalent to <expression> p53 production –causes activity of p53 because p53 production +causes expression of Mdm2 and increase of Mdm2 –causes activity of p53 It should be noted that the combination of sentences (17) and (32) in a causal chain that forms a closed negative feedback loop is based on two facts of prerequisite ontological knowledge. This knowledge is inserted manually in our system as Prolog facts and can be stated as: “the DEFAULT process of p53 is ‘production’” or in Prolog: “default(p53,production).”. “the process ‘increase’ is equivalent to the process ‘expression’” or in Prolog “equivalent(increase,expression).”. In [12] we propose that causal knowledge discovered from texts forms the basis for modeling systems whose study is reported in texts analysed by our system and we describe a new method of system modeling that utilises causal knowledge extracted from different texts. The equations describing the system model are solved with a Prolog program which receives values for its parameters from the text analysis system.
354
J. Kontos, A. Elmaoglou, and I. Malagardi
Our final aim is to be able to model biomedical systems by integrating partial knowledge extracted from a number of different texts and give the user a facility for questioning these models during a collaborative man-machine model discovery or diagnostic procedure. The model based question answering we are aiming at may support both biomedical researchers and medical practitioners. The above analysis of the text fragments of the second example is partially based on the following knowledge which is also manually inserted as Prolog facts: kind_of("the","determiner") kind_of("p53","entity_noun") kind_of("protein","entity_noun") kind_of("DNA","entity_noun") kind_of("Mdm2","entity_noun") kind_of("is","copula") kind_of("activated","causal_connector") kind_of("inhibits","causal_connector") kind_of("regulated","causal_connector") kind_of("damage","process") kind_of("expression","process") kind_of("increase","process") kind_of("activity","process") kind_of("of","preposition") the above knowledge base mixes general linguistic and domain dependent ontological knowledge about the words occurring in the corpus. In a practical system of course these two parts of knowledge will be separate and handled differently.
6 Conclusions A method for the computer support of causal knowledge discovery is proposed in the present paper and demonstrated by the operation of our system being developed. The system we are developing consists of two main subsystems. The first subsystem achieves the extraction of knowledge from individual sentences that is similar to traditional information extraction from texts while the second subsystem is based on a reasoning process that generates new knowledge by combining “on the fly” knowledge extracted by the first subsystem. The application of the method proposed is illustrated using two examples of scientific text. We are now investigating the possibility of extending our system for supporting the discovery of qualitative and quantitative dynamic models of biomedical systems from
ARISTA Causal Knowledge Discovery from Texts
355
texts as an alternative research problem to the one of constructing models from continuous data [11]. Our final aim is to be able to model biomedical systems by integrating partial knowledge extracted from a number of different texts and give the user a facility for questioning these models. The model based question answering we are aiming at may support both biomedical researchers and medical practitioners.
References 1.
Kontos, J., 1992, ARISTA: Knowledge Engineering with Scientific Texts. Information and Software Technology, Vol. 34, No 9, pp.611-616. 2. Kontos, J. and Malagardi, I., 1999, Information Extraction and Knowledge Acquisition from Texts using Bilingual Question-Answering. Journal of Intelligent and Robotic Systems, Vol. 26, No. 2, pp. 103-122, October. 3. Rzhetsky A. et al, 2002, GeneWays: A System for Mining Text and for Integrating Data on Molecular Pathways. http://www.bioinfo.de/isb/gcbo1/talks/rzhetsky/main.html. 4. Blaschke C. et al, 1999, Automatic extraction of biological information from scientific text: protein-protein interactions, 7th International Conference on Information Systems for Molecular Biology. 5. Humphreys, K. et al, 2000, Two Applications on Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures. Proceedings of Pacific Symposium on Biocomputing. Hawai. 6. Kontos, J. and Malagardi, I., 2001, A Search Algorithm for Knowledge Acquisition from Texts. HERCMA 2001, 5th Hellenic European Research on Computer Mathematics & its Applications Conference. Athens. Hellas. 7. Cowie, J., and Lehnert, W., 1996, Information Extraction. Communications of the ACM. Vol. 39, No. 1, pp. 80-91. 8. Grishman, R., 1997, Information Extraction: Techniques and Challenges. In Pazienza, M. T. Information Extraction. LNAI Tutorial. Springer, pp. 10-27. 9. Guyton, A. C., 1991, Textbook of Medical Physiology. Eighth Edition, An HBJ International Edition. W.B. Saunders. 10. Bar-Or, R. L. et al, 2000, Generation of oscillations by the p53-Mdm2 feedback loop: A theoretical and experimental study. PNAS, Vol. 97, No 21 pp. 11250-11255, October. 11. Langley P. et al, 2002, Inducing Process Models from Continuous Data. Proceedings of the Nineteenth International Conference on Machine Learning. Sydney: Morgan Kaufmann. 12. Kontos J., I. Malagardi and A. Elmaoglou, 2002, System Modeling by Computer using Biomedical Texts. 5th European System Science Congress. Workshop on System Science Methodologies in Artificial Intelligence and Cognitive Science, Heraklion, Crete, Hellas. October.
Knowledge Discovery as Applied to Music: Will Music Web Retrieval Revolutionize Musicology? Francis Rousseaux1 and Alain Bonardi2 1 IRCAM-CNRS, 1, place Igor-Stravinsky, 75001 Paris, France +33 1 44.78.48.19 - [email protected] 2 Paris 8 University, 2, rue de la Liberté, 93526 Saint-Denis Cedex, France +33 1 49.40.66.04 - [email protected]
Abstract. The present authors share the characteristic of a significant participation in R&D projects for designing and implementing interactive systems in the field of digital music: tools for computer-based assistance to browsing sound and music data, assisted analysis and commenting systems for musical works, or assisted interactive environments for the creation of interactive virtual works. In spite of the diversity of their experience and practice, the authors are aiming at renewing the theoretical framework of musicology, which is working in the background of their creation or engineering activities, since a massive digital inscription of music allows its objects to be handled by programs, or even to be transformed into programs executable by other programs. Such a framework is only a model, which functions only if it allows us to interpret artefacts and phenomena, producing at the same moment a consistent musicological language, a space for the categorisation and evaluation of realisations, as well as a number of objectives for technological deployment and theoretical evolution. But, within a research teams, we also require that these models could provide capitalisation structures for experience, know-how and acquired knowledge.
1 Introduction If contemporaneous music exists, shall we consider a contemporaneous musicology? In this new formulation, the question regards the nature of the links between music th and musicology during a given epoch. In this sense, the characteristics of the 20 century music could suggest that a specific musicology is necessary for this period. Nevertheless, this is not exactly the theme we are trying to develop here, since we are essentially interested in the theoretical framework of the musicological language deployment. It is true that this theoretical framework contains some elements coming from the contemporaneous musical practice, but our reflection goes far beyond it, while aiming at an epistemological research. We shall use as a basis our own experience with projects of assisted browsing through sound and music collections, computerised analysis and commenting of musical works, or interactive creation of virtual works. It is interesting to identify and make S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 356-363, 2002. © Springer-Verlag Berlin Heidelberg 2002
Knowledge Discovery as Applied to Music
357
explicit the implicit theoretical framework which acts in the background of such projects, in order to capitalise experience and know-how. This theoretical framework, has not an aesthetic or historical nature, but is essentially musicological, in the sense that the musicological language is founded on the interpretation of all phenomena present “when we are consciously listening to music”, just to use the Berio’s formulation [2].
2
Identification of a Musicology Which Is Both Contingent and Appropriate for the Computational Reason
The theoretical framework we propose is appropriate by construction. Its elaboration is not the result of the transformation of an older framework, but comes from an interpretative and theoretical examination of constructions bearing the marks of present efforts aimed at a successful use of machines for interpretative activities, in the form of collaboration/co-operation with human users. The method of explanation is here that of the reverse design: one tries to discover a model in actual use and to elicit it as our model. Even if the theoretical framework is appropriate, it is however submitted to the computational reason, if different from the graphical reason studied by Jacques Goody [7]. This means that the framework is prescribed and determined by "digital aspects and computers", in a way which is both obvious and mysterious. Such a theoretical framework may be characterised by “attractors”. The meaning of this term in physical models of the movement is well known: in a global approach, the position and speed components are combined into the same vector. Depending on given situations, this new composite vector can follow regular trajectories which do not depend on initial conditions, or can present a divergent behaviour. In the framework of the definition of the theoretical basis, we may use the term to explain the stability of the chosen framework, its irreducible character, and its lack of sensitivity with respect to given historical/aesthetic initial points. 2.1
The Attractor of Material Hermeneutics
On the basis of the distance between the life singularities due to human concerns and their correlatives in terms of “reason” or ratio (differentiation, categorisation, classification, generalisation), one tries to identify the specificity of a computational reason, opposed to the already evoked graphical reason. This approach tries hard to replace the apology of knowledge in the categorising thought with the study of possible technical conditions for our semiotic mediations, so that the link between thought and technology is much stronger than in other approaches. 2.1.1 The Example of Musical Listening Musical listening prescribes its modalities, objects and targets. This activity appears as a desire to listen to something, “to listen to something else”, which correlates with
358
F. Rousseaux and A. Bonardi
either the sudden appearance of the work, imposing an immediate listening, or with the determination of a similarity immediately specifying a difference as an alterity in an analogical relationship with the past. So, our basic principle is that listening is a desire to listen to something even more (to let the experience go on), but also, paradoxically, to listen to something else (to get a new object allowing the experience to persist). The fact that continuity is required generates a need for interruption, the succession (a cognitive difference) determines and prescribes a variation (a difference of type). The need for a consistent succession can be reduced to the specification of an alterity in a similar or analogical relationship with the previous matter. The deployment of listening towards its ideal implies the construction of a musical sequence musicale, in the mode of an elective affinity, which is always critical. Thanks to musical records, immediately attainable via access systems and restitution devices, listening means composing a preferred sequence. Therefore, the question of listening is connected with the question of description and categorisation (cf. Deleuze in Difference and repetition [5] or Husserl [8] in his Lessons on the intimate consciousness of time, presenting the notion of a retention-protension dyad). As such, it is conditioned by the technological means. 2.1.2 The Example of Musical Creation on Digital Support Understanding a musical composition on digital support is inscribed in the same logic as that we have already presented, because if listening contains a composition, composition obviously contains a form of listening. The need is especially that of a consistent succession, in a form elaborated between « permanence and variation », a term which Schaeffer could use in response to Deleuze’s « difference and repetition ». Computer assisted composition tools allow us not only to mobilise immediately and to bring together some records, but, moreover, to evaluate immediately the consequences of artistic choices. The combinatorial power of these systems allows us to get very rapidly a representation of a musical idea, and therefore to compare the potential of groups of close ideas which could imply a difficult choice for the composer. The traditional time difference between musical conception and realisation is practically abolished, since the sound result can be simulated immediately after the conception. 2.2
The Attractor of the « Ontologies »
The ontological approach is relational and focuses on the notion of knowledge. It was founded on the assumption of the "Knowledge Level" enunciated for the first time by Alan Newell in 1982. In this assumption, knowledge plays an essential part since it is on the one hand a set of data handled by machines through semantic networks (handling can ultimately be reduced to calculation carried out by a layer-architecture system characterising Von Neuman computers), on the other hand the key place of human action simulation, depending on the following principle of rationality: "tell me what you want, what you know, what you can do, and I tell you what you have to do ". Basically, this assumption reduces thought to knowledge processing accompanied by a
Knowledge Discovery as Applied to Music
359
circuit through knowledge networks, and learning comprises operations acting on the network: edit, rebuild, deploy. It is obvious that the ontological theory strongly assumes the possibility of reducing any interpretation to a logical processing of reified meaning units, and this processing would model in some way the contextual meaning. Thus, ontologies claim to separate two difficulties related to thought: on the one hand the choice and the constitution of thought objects, on the other hand the movement of these objects in the semiotic backgrounds. By proposing knowledge models which structure some reified elements, ontologies lead to thinking the objects of thought as available on the shelf, and open the possibility of building a meaning from artificial confrontation of such structures: therefore, the ontological theory perfectly agrees with the virtualities offered on the Web, where such parallels are proposed at every moment and procure something like a cheap form of inter-subjectivity. This approach is assumed by the knowledge elicitation research community. We shall not deny the heuristic value of the "Knowledge Level" assumption, because it allows us to design natively co-operative man-machine systems at this "knowledge level". We have yet to solve the difficult question of eliciting these ontologies, which represents a true paradigm for modelling through elicitation. There is also the problematic question of their dynamic evolution, and finally, the discussion on their cognitive pertinence (the mystery of an intelligent approach of such systems by their users). An example of an ontology in the domain of sound objects is the sound description in the reduced listening mode, proposed by Pierre Schaeffer in his Treaty on musical objects [13].
2.3
The Attractor of Algorithmic Calculation
On the basis of computer power, we search for calculable descriptor forms which can define object classes with an interesting interpretation, giving way to potential novel applications. This approach is typical for the "Motion Picture Expert Group" consortium (MPEG) practices. The calculation approach has an atomic character and focuses on quantity. It is based on an evaluation of the current state of science and (algorithmic) calculation techniques, as well on a kind of market intuition. Since the question is about sound and music, the question could be this one: is it possible to find out algorithms applicable to digital signals in order to extract from them some values, which could be called descriptor attributes of the signals and assigned mnemonic names in order to represent the signal within particular application? A descriptor contains a name, a retrieval algorithm, and a list of potential applications increasing its value on the market. This approach is based on the idea that digital inscription is excessive; the result must be “refined” until specific products, responding to specific needs, are obtained. The heuristic character of this approach can open new interpretation fields, prescribed by computer power. Nevertheless, everybody can feel that the approach is due to a
360
F. Rousseaux and A. Bonardi
mere purpose of standardisation. This idea assimilates problem translation to an actual solution. The "actual duration" of a sound signal is an example of a psycho-acoustic descriptor. It is the evaluation of a duration, when the signal is meaningful in the perceptual plane. It is calculated on the basis of an energy envelop threshold whose value is proposed by some psycho-acoustic studies, and its implementation allows you, for instance, to discriminate between a percussive sound and a maintained sound. This attribute has been promoted as a descriptor by the MPEG7 standardisation consortium.
3
Validity, Expressive Power, and Genericity of the Framework
First, we shall show, via a few examples, that the theoretical framework is spontaneously used in many activities including the analysis of a contemporary piece of music or the automatic generation of formally constrained musical programs. 3.1
The Question of the Validity of the Theoretical Framework
Let’s consider the Jupiter musical piece composed by Philippe Manoury (1987), written for solo flute and real-time electronics. In this work, the composer anticipates rhythmic interpolation processes from the capture of the dates of notes-events played by the flutist, which are transformed by calculation (progressive interpolation) until a given rhythm is attained. The calculated result is separated in time from the capture: the computer plays these sequences several sections later after having “listened” to the flutist. The basis of this compositional approach is both technological and musical: - technical capability of capture; - storage of events; - real-time calculation of musical entities without any musical a priori notation, because only the possible conditions are defined before the performance; - real-time dual modality of the musical events follow-up, i.e. waiting and immediate recognising of figures emerging from the depth, while triggering real-time calculations for the previous point. The calculation approach takes into account all algorithmic processes implemented by Manoury. Computational reason represents the inscription of phenomena in the form of acoustic and musical processing, as well as in the traditional graphical form. The framework of this new chamber music can be explained by the ontologies of manmachine interaction in a situation of musical play. In the case of constrained musical program generators, the calculation approach allows us to extract new stylistic characteristics from the audio records and symbolic files implemented. Besides, the design of this system includes necessarily, for its dynamics, a computer support. Finally, the whole is based on the ontologies of the target repertoires and, even more, on musical styles.
Knowledge Discovery as Applied to Music
3.2
361
The Question of a Canonical Character
Supposing that we have at our disposal a theoretical framework, both contingent and suitable for computational reason, and also operational and effective as a model, we have to decide on its canonical character: can we reduce this theoretical framework to musicology’s traditional framework? This question is important: in fact, why should we leave the traditional framework? If some persons are prone to play necessarily with these new difficult concepts, are we not able to ultimately demonstrate that the modelling they produce can be brought back to the precedent framework via a given transformation, whose validity has been clearly established? To answer this question, we shall show the aporias of the theoretical framework traditionally supporting the musicological language, and the irreducible inadequacies making it definitely unable to say anything pertinent about the musicological objects which concern us.
4
Aporias and Inadequacies of the Theoretical Framework Supporting the Traditional Musicological Language
4.1
The Triad of the Traditional Theoretical Framework
It seems that three attractors may be distinguished within the traditional framework: - The composer’s or interpreter’s hagiography, presented as an outstanding creator; we have to know and understand his/her origins, life conditions and psychology. Understanding the musical phenomenon is thus connected to understanding a person, who composes or plays music. Many positions may be envisaged, from total disjunction, as in the Boucourechliev’s assertion («everybody may be in love, but Tristan is due to a single person», [3]) until a supposed parallelism, following the track of all cognitive science researches on human creativity. - Organology, born together with a technical perspective where mechanics plays the first role. This is the reason why musical instruments are present in museum collections, such as that of the “Arts et Métiers” in Paris, reflecting the fantastic deth velopment of invention in the mechanic field, until the second half of the 20 century. Then, the very conditions of possible music, at a material level, were envisaged, including all their consequences. - The attractor of graphical reason and musical theories, from Antiquity until now, tackles a large number of musical composition parameters, from constitution of scales, harmony and counterpoint, to the orchestration, while presenting as a « law» the great model scores of the past. One has to acknowledge the contribution of the material inscription to all these theories, and particularly the appropriateness of the support: the two-dimensional plane sheet on which are written x/y scores where height (proportional notation: the higher the pitch, the grater the number of necessary new lines) and duration (algebraic notation, where there is a fixed ratio between abstract entities: so, it is said that one crotchet is equal to two quavers) are
362
F. Rousseaux and A. Bonardi
marked. Therefore, we find here a graphical reason, which stands out as the basis of all possible theories. Hugues Dufourt [6] upholds that the old Greek music, following the example of philosophy, searched for « the one within the multiple, the invariable within the change », whilst the Western polyphony and modern form of mind « require a deployment of the time, then a deployment of the space ». This difference is perfectly understood while examining musical notation: the old Greek notation is only an alphabet expressing height, denying the time while occupying all its one-dimensional support. Western notation takes into account the full structure of its support plane, as well as those of the two-dimensional forms of thought, opening the way to polyphony. 4.2
Aporias of the Traditional Theoretical Framework
This theoretical framework (based on the three attractors: an "inspired artist", a "mechanical organology", and a "theory founded on graphical reason") allowed the deployment of the musicological language in the West. The three attractors constitute a consistent and unavoidable basis, used by both music teaching and interpretative reflection (or a major part of the musical thought). For example, the analysis of Jupiter by Philippe Manoury turns out to be dangerous in the traditional theoretical framework, without taking into account the hagiographic attractor, that Manoury moves at once, by asserting the re-usability of his works (as a MAX/MSP library named PMA-LIB, distributed to all the Ircam Forum users) – himself partially resumed some elements and programs already developed. There is in digital music a culture of something “already present and ready to be used”, also inspiring the composers. The organologic attractor may imply that computers are musical instruments, with sensors and actuators, but how could we explain within this framework the deterministic yet undetermined character of note generators? The theoretical attractor is even less able to maintain the graphical reason, and Manoury’s notation choices in his successive scores clearly show this: if the composer tries to translate the electronic « parts » into a traditional notation (by writing the result of some operators such as the harmonizer), the success is very partial because processing, e.g. through the frequency shifter, acts while leaving no possibility of inscription in the score plane (there is a twofold movement of an infinity of sound partials) and this leads to a notation containing only operator names, as in the Neptune score (1991).
5
Conclusion
We have submitted a new theoretical framework to the discussion; its only merit is that it seems to serve spontaneously (although often implicitly) as a model for researches and developers immersed into the world of computational reason and digital music. This framework is in itself very problematic, because each attractor contributing to its status of a modelling field raises difficult questions.
Knowledge Discovery as Applied to Music
363
However, this new framework could hardly be reduced to the traditional musicological framework, in the same way as computational reason cannot be reduced to a mere graphical reason. We must choose between two attitudes: in the first case, a hard struggle for masking the “aporia” by introducing computers which will submerge each attractor of the old theoretical framework into a bath of “new technologies”. There are only a few musicologists, formed by the graphical reason, who attempt to stand up to this temptation. The other attitude would be to constitute a musicology, suitable for computational reason, but free of a blind dependency on it. One may think that this task is quite extensive, decrete that this requires a full epistemological understanding of computational reason (different from graphical reason), and be patient. It will also require a new statement about the essential role of the body in music. But, we must also assert that both questions are intimately connected with each other, so that musicology could contribute to clear up one of the most enthralling questions in this beginning century.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
BONARDI, A., Analyse hypermédia de Jupiter de Philippe Manoury, mise en ligne à la médiathèque de l'Ircam, 2000. BERIO, L., Intervista sulla musica, a cura di Rossana Dalmonte, Rome, Laterza, 1983. BOUCOURECHLIEV, A., Le langage musical, Fayard, 1993. CHION, M., Guide des objets sonores, Buchet/Chastel, 1983. DELEUZE, G., Différence et répétition, Paris, Presses Universitaires de France, 1968. DUFOURT, H., Musical creativity, European Society for the Cognitive Sciences of Music, Liège (Belgium), 5-8 April 2002. GOODY, J., The Interface Between the Written and the Oral, New York, Cambridge University Press, 1987. HUSSERL, E., Leçons sur la conscience intime du temps,Paris, Presses Universitaires de France, 1996. NEWELL, A., The Knowledge Level, Journal of Artificial Intelligence 18, 1982. PACHET, F., A Taxonomy of Musical Genres, Actes de la Conférence RIAO'2000, Centre de Hautes Etudes Documentaires, Paris, pp. 1238-1245, 2000. RASTIER, F., Sémantique pour l'analyse, Masson, Paris, 1994. ROUSSEAUX, F., Musical Knowledge and Elective Affinities: Get the Former to Actualise the Latter, European Meetings on Cybernetics and Systems Research, Vienne, 2002. SCHAEFFER, P., Traité des objets musicaux, Paris, Seuil, 1966. SIMONDON, G., Du mode d'existence des objets techniques, Aubier, 1989.
Process Mining: Discovering Direct Successors in Process Logs 1
1
1
Laura Maruster , A.J.M.M. (Ton) Weijters , W.M.P. (Wil) van der Aalst , 2 and Antal van den Bosch 1 Eindhoven University of Technology, 5600 MB Eindhoven, the Netherlands {l.maruster, a.j.m.m.weijters, w.m.p.aalst}@tm.tue.nl 2 Tilburg University, 5000 LE Tilburg, the Netherlands [email protected]
Abstract. Workflow management technology requires the existence of explicit process models, i.e. a completely specified workflow design needs to be developed in order to enact a given workflow process. Such a workflow design is time consuming and often subjective and incomplete. We propose a learning method that uses the workflow log, which contains information about the process as it is actually being executed. In our method we will use a logistic regression model to discover the direct connections between events of a realistic not complete workflow log with noise. Experimental results are used to show the usefulness and limitations of the presented method.
1 Introduction The managing of complex business processes calls for the development of powerful information systems, able to control and support the flow of work. These systems are called Workflow Management Systems (WfMS), where a WfMS is generally thought of as “a generic software tool, which allows for definition, execution, registration and control of workflows” [1]. Despite the workflow technology promise, many problems are encountered when applying it. One of the problems is that these systems require a workflow design, i.e. a designer has to construct a detailed model accurately describing the routing of work. The drawback of such an approach is that the workflow design requires a lot of effort from the workflow designers, workers and management, is time consuming and often subjective and incomplete. Instead of hand-designing the workflow, we propose to collect the information related to that process and discover the underlying workflow from this information history. We assume that it is possible to record events such that (i) each event refers to a task, (ii) each event refers to a case and (iii) events are totally ordered. We call this information history the workflow log. We use the term process mining for the method of distilling a structured process description from a set of real executions. To illustrate the idea of process mining, consider the workflow log from Table 1. In this example, there are seven cases that have been processed and twelve executed tasks. We can notice the following: for each case, the execution starts with task A and S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 364-373, 2002. © Springer-Verlag Berlin Heidelberg 2002
Process Mining: Discovering Direct Successors in Process Logs
365
ends with task L, if C is executed, then E is executed. Also, sometimes we see task H and I after G and H before G. Using the information shown in Table 1, we can discover the process model shown in Figure 1. We represent the model using Petri nets [1]. The Petri net model starts with task A and finishes with task L. After executing A, either task B or task F can be executed. If task F is executed, tasks H and G can be executed in parallel. Table 1. An example of a workflow log Case number
Executed tasks
Case 1
AFGHIKL
Case 2
H F
ABCEJL
Case 3
AFHGIKL
Case 4
AFGIHKL
Case 5
ABCEJL
Case 6
ABDJL
Case 7
ABCEJL
K G
I
A
L D B
J C
E
Fig. 1. A process model for the log shown in Table 1
A parallel execution of tasks H and G means that they can appear in any order. The idea of discovering models from process logs was previously investigated in contexts such as software engineering processes and workflow management [2-9]. Cook and Wolf propose three methods for process discovery in case of software engineer processes: a finite state-machine method, a neural network and a Markov approach [3]. Their methods focus on sequential processes. Also, they have provided some specific metrics for detection of concurrent processes, like entropy, event type counts, periodicity and causality [4]. Herbst and Karagiannis used a hidden Markov model in the context of workflow management, in the case of sequential processes [6,8,9] and concurrent processes [7]. In the works mentioned, the focus was on identifying the dependency relations between events. In [10], a technique for discovering the underlying process from hospital data is presented, under the assumption that the workflow log does not contain any noisy data. A heuristic method that can handle noise is presented in [11]; however, in some situations, the used metric is not robust enough for discovering the complete process. In this paper, the problem of process discovery from process logs is defined as: (i) for each task, find its direct successor task(s), (ii) in the presence of noise and (iii) when the log is incomplete. Knowing the direct successors, a Petri net model can be constructed, but we do not address this subject in the present paper, this issue is presented elsewhere [10, 11]. It is realistic to assume that workflow logs contain noise. Different situations can lead to noisy logs, like input errors or missing information (for example, in a hospital environment, a patient started a treatment into hospital X and continues it in the hos-
366
L. Maruster et al.
pital Y; in the workflow log of hospital Y we cannot see the treatment activities that occurred in hospital X). The novelty of the present approach resides in the fact that we use a global learning approach, namely we develop a logistic regression model and we find a threshold value that can be used to detect direct successors. As basic material, we use the “dependency/frequency tables”, as in [11]. In addition to the “causality metric” that indicates the strength of the causal relation between two events used in [11], we introduce two other metrics. The content of this paper is organized as follows: in Section 2 we introduce the two new metrics that we use to determine the “direct successor” relationship and we recall the “causality metric” introduced in [11]. The data we use to develop the logistic regression model is presented in Section 3. Section 4 presents the description of the logistic regression model and two different performance test experiments are presented. The paper concludes with a discussion of limitations of the current method and addresses future research issues.
2 Succession and Direct Succession In this section we discuss some issues relating the notion of succession and we define the concept of direct succession. Furthermore, we describe three succession metrics that we used to determine the direct succession relationship. At the end of this section we give an example of dependency/frequency table, with the corresponding values of the three metrics. 2.1 The Succession and Direct Succession Relations
Before introducing the definitions of succession and direct succession relations, we have to define formally the notion of workflow log and workflow trace. Definition 1: (Workflow trace, Workflow log) Let T be a set of tasks. θ∈T* is a workflow trace and W⊆T* is a workflow log. We denote with #L the count of all traces θ. An example of a workflow log is given in Table 1. A workflow trace for case 1 is A F G H I K L. For the same workflow log from Table 1, #L = 7. Definition 2: (Succession relation) Let W be a workflow log over T, i.e. W⊆T*. Let A, B∈T*. Then: • B succeeds A (notation A>WB) if and only if there is a trace θ = t1t2…tn-1 and i ∈ {1,…, n-2} such that θ∈W and ti=a and ti+1=b. In the log from Table 1, A >W F, F >W G, B >W C, H >W G, etc. • we denote (A>B) = m, m≥0, where m is the number of cases for which the relation A>WB holds. For example, if for the log W, the relation A>WB holds 100 times and the relation B>WA holds only 10 times, then (A>B) = 100 and (B>A) = 10.
Process Mining: Discovering Direct Successors in Process Logs
367
Definition 3: (Direct succession relation) Let W be a workflow log over T, i.e. W⊆T* and A, B∈T . Then B directly succeeds A (notation A→WB) if either: 1. (A>B) > 0 and (B>A) = 0 or 2. (A>B) > 0 and (B>A) > 0 and ((A>B) – (B>A) ≥ σ), σ > 0. Let us consider again the Petri net from Figure 1. A pair of two events can be in three possible situations and subsequently the relations between the events are: a) if events C and E are in sequence, i.e. (C>E) > 0 and (E>C) = 0, then C>WE and C→WE. b) if there is a choice between events B and F, i.e. (B>F) = 0 and (F>B) = 0, then B> / W F, F >/ W B, B → / W F, F → / W B. c) if events G and H are in parallel, i.e. (G>H) > 0 and (H>G) > 0, then G>WH, H>WG, G → / W H, H → / W G. The first condition from Definition 3 says that if for a given workflow log W, only B succeeds A and A never succeeds B, then there is a direct succession between A and B. This will holds if there is no noise in W. However, if there is noise, we have to consider the second condition for direct succession, because both (A>B) > 0 and (B>A) > 0. The problem is to distinguish between a situation when (i) A and B are occurring in parallel and (ii) when A and B are really in a direct succession relation, but there is noise. In the rest of the paper we describe the methodology of finding the threshold value σ. In order to find the threshold value σ, we use three metrics of the succession relation, which are described in the next subsection.
2.2 The Local Metric (LM), Global Metric (GM) and Causality Metric (CM) The local metric LM. Considering tasks A and B, the local metric LM is expressing the tendency of succession relation by comparing the magnitude of (A>B) versus (B>A). The idea of LM measure presented below is borrowed from statistics and it is used to calculate the confidence intervals for errors.
LM = P − 1.96
( A > B) P(1 − P) , N = ( A > B ) + ( B > A) . , P= N +1 N +1
We are interested to know with a probability of 95% the likelihood of succession, by comparing the magnitude of (A>B) versus (B>A). For example, if (A>B)=30, (B>A)=1 and (A>C)=60, (C>A)=2, which is the most likely successor of A: B or C? Although both ratios (A>B)/(B>A) and (A>C)/(C>A) equal 30, C is more likely than B to be the successor of A. Our measure gives in case of A and B a value of LM=0.85 and in case of A and C a value of LM=0.90, which is in line with our intuition. Let us now consider again the examples from Figure 1. If we suppose that the number of lines in the log #L=1000, we can have three situations: (i) (C>E)=1000,
368
L. Maruster et al.
(E>C)=0, LM= 0.997; (ii) (H>G)=600, (G>H)=400, LM=0.569; (iii) (F>B)=0, (B>F)=0, LM=0. In the sequential case (i), because always E succeeds C, LM≈1. When H and G are in parallel, in case (ii), LM=0.569, thus a value much smaller than 1. In the case of choice between F and B, in case (iii), LM=0. In general, we can conclude that LM can have a value (i) close to 1 when there is a clear tendency of succession between X and Y, (ii) in the neighborhood of 0.5 when there is both a succession between X and Y and between Y and X, but a clear tendency cannot be identified and (iii) zero when there is no succession relation between X and Y. The global metric GM. The previous measure LM was expressing the tendency of succession by comparing the magnitude of (A>B) versus (B>A) at a local level. Let us consider that the number of traces in our log #L=1000, the frequency of events #A=1000, #B=1000 and #C=1000. We also know that (A>B)=900, (B>A)=0 and (A>C)=50 and (C>A)=0. The question is who is the most likely successor of A: B or C? For B, LM=0.996 and for C, LM=0.942, so we can conclude that they can be both considered as successors. However, one can argue that C succeeds A not as frequently, thus B should be considered a more likely successor. Therefore, we build the GM measure presented below.
GM = (( A > B ) − ( B > A))
#L . # A*# B
Applying the formula above, we obtain for B as direct successor a value of GM=0.90, while for C, GM=0.05, thus B is more likely to directly succeeds A. In conclusion, for determining the likelihood of succession between two events A and B, the GM metric is indeed a global metric because it takes into account the overall frequency of events A and B, while the LM metric is a local metric because it compares the magnitude of (B>A) with (A>B). The causality metric CM. The causality metric CM was first introduced in [11]. If for a given workflow log when task A occurs, shortly later task B also occurs, it is possible that task A causes the occurrence of task B. The CM metric is computed as following: if task B occurs after task A and n is the number of events between A and n B, then CM is incremented with a factor (δ) , where δ is a causality factor, δ∈[0.0,1.0]. We set δ=0.8. The contribution to CM is maximal 1, if task B appears right after task A and then n=0. Conversely, if task A occurs after task B, CM is den creased with (δ) . After processing the whole log, CM is divided by the minimum between the overall frequency of A and the overall frequency of B.
2.3 The Dependency/Frequency Table
The starting point of our method is the construction of a so-called dependency/frequency (D/F) table from the workflow log information. An excerpt from the D/F table for the Petri net presented in Figure 1 is shown in Table 2. The information contained in the D/F table are: (i) the identifier for task A and B, (ii) the overall frequency of task A (#A), (iii) the overall frequency of task B (#B), (iv) the frequency of task B directly succeeded by another task A (B>A), (v) the frequency of task A di-
Process Mining: Discovering Direct Successors in Process Logs
369
rectly succeeded by another task B (A>B), (vi) the frequency of B directly or indirectly succeeded by another task A, but before the next appearance of B (B>>>A), (vii) the frequency of A directly or indirectly succeeded by another task B, but before the next appearance of A (A>>>B), (viii) the local metric LM, (ix) the global metric GM and (x) the causality metric CM. The last column (DS) from Table 2 is discussed in the next section. Table 2. Example of D/F table with direct succession (DS column) information. “T” means that task B is a direct successor of task a, while “F” means that B is not a direct successor of A AB ba bb bd bj bl bc be
#A 536 536 536 536 536 536 536
#B 1000 536 279 536 1000 257 257
(B>A) 536 0 0 0 0 0 0
(A>B) 0 0 279 0 0 257 0
(B>>>A) 536 0 0 0 0 0 0
(A>>>B) 0 0 279 536 536 257 257
LM 0.00 0.00 0.99 0.00 0.00 0.99 0.00
GM -1.0 0.00 1.86 0.00 0.00 1.86 0.00
CM -1.0 0.00 1.00 0.72 0.57 1.00 0.80
DS F F T F F T F
3 Data Generation For developing a model that will be used to decide when two events are in direct succession relation, we need to generate training data that resemble real workflow logs. Our data generation procedure consists on combinations of the following four possible elements that can vary from workflow to workflow and subsequently affect the workflow log: • number of event types: we generate Petri nets with 12, 22, 32 and 42 event types. • amount of information in the workflow log: the amount of information is expressed by varying the number of traces (one trace or line actually represents the processing of one case) starting with 1000, 2000, 3000, etc. and end with 10000 traces. • amount of noise: we generate noise performing four different operations, (a) delete the head of a event sequence, (b) delete the tail of a sequence, (c) delete a part of the body and (d) interchange two random chosen events. During the noise generation process, minimal one event and maximal one third of the sequence is deleted. We generate three levels of noise: 0% noise (the common workflow log), 5% noise and 10% (we select 5% and respectively 10% of the original event sequences and we apply one of the four above described noise generation operations). • unbalance in AND/OR splits: in Figure 1, after executing the event A, which is an OR-split, it is possible to exist an unbalance between executing tasks B and F. For example, 80% of cases will execute task B and only 20% will execute task F. We progressively produced unbalance at different levels. For each log that resulted from all possible combinations of the four elements mentioned before we produce a D/F table. In the D/F table a new field is added (the DS column) which records if there is a direct succession relationship between events A and B or not (True/False). An example of the D/F table with direct succession information is shown in Table 2. All D/F tables are concatenated into one unique and large final D/F/DS table that will be used to build the logistic regression model.
370
L. Maruster et al.
4 The Logistic Regression Model We have to develop a model that can be used to determine when two events A and B are in a direct succession relationship. The idea is to combine the three metrics described earlier and to find a threshold value σ over which two events A and B can be considered to be in the direct succession relationship. In this section we develop a logistic regression model and we perform some validation tests. The logistic regression estimates the probability of a certain dichotomic characteristic to occur. We want to predict whether “events A and B are in a direct succession relationship”, that can be True/False. Therefore, we set as dependent variable the DS field from the D/F/DS file. The independent variables are the three metrics that we built earlier, i.e. the global metric GM, the local metric LM and the causality metric CM. In conclusion, we want to obtain a model that, given a certain combination of LM, GM and CM values for two events A and B, to predict the probability π of A and B being in the direct succession relationship. The form of the logistic regression is log( π/(1-π) ) = B0 + B1*LM + B2 *GM + B3*CM, where the ratio π/(1-π) represents the odds. For example, if the proportion of direct successors is 0.2, the odds equal 0.25 (0.2/0.8=0.25). The significance of individual logistic regression coefficients Bi is given by the Wald statistics which indicates significance in our model; that means that all independent variables have a significant effect on direct succession predictability (Wald tests the null hypothesis that a particular coefficient Bi is zero). The model goodness of fit test is a Chi-square function that tests the null hypothesis that none of the independents are linearly related to the log odds of the dependent. It has a value of 108262.186, at probability p<.000, inferring that at least one of the population coefficients differs from zero. The coefficients of the logistic regression model are shown in Table 3. Table 3. Logistic analysis summary of three succession predictors of direct succession relation. The discrete dependent variable DS measures the question “are events A and B in a direct succession relationship?”; ** means significant at p<0.01 Variables in the Equation* LM GM CM Constant
B 6.376 4.324 8.654 -8.280
Wald 2422.070 920.638 4490.230 4561.956
df 1 1 1 1
Sig** .000 .000 .000 .000
Exp(B) 587.389 75.507 5735.643 .000
Using the Bi coefficients from Table 5, we can write the following expression LR from Eq. 1:
LR = -8.280 + 6.376*LM+ 4.324*GM+ 8.654*CM
(1)
Then the estimated probability πˆ can be calculated with the following formula (Eq.2):
πˆ = e LR /(1 + e LR ) .
(2)
Process Mining: Discovering Direct Successors in Process Logs
371
The influence of LM, GM and CM can be detected by looking at column Exp(B) in Table 3. For example, when CM increases one unit, the odds that the dependent =1 increase by a factor of ~5736, when the others variables are controlled. Comparing between GM, LM and CM, we can notice that CM is the most important variable in the model. Inspecting the correct and incorrect estimates we can assess the model performance. Our model predicts the T value of DS in 95,1% of cases and the F value of DS in 99,2% cases. These values for correct/incorrect estimates are obtained at a cut value of 0.8, i.e. are counted as correct estimates those values that exceed 0.8. We set the cut value at 0.8, because we are interested in knowing the classification score when the estimated probability is high. Because 95% of the events which are in direct succession relationship are correctly predicted by the model, we conclude that we can set the threshold σ = 0.8. That means that we will accept that there is a direct successor relationship between events A and B, if the estimated probability would exceed 0.8. The following step is to test the model performance on test material. Model testing. We describe two different type of tests: (i) k-fold cross-validation on test material extracted from the learning material and (ii) model check on a completely new test material. K-fold cross-validation (k-fold cv) is a model evaluation method that can be used to see how well a model will generalizes to new data of the same type as the training data. The data set is divided into k subsets. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. Then the average error across all k trials is computed. Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is reduced as k is increased. We take k=10. The results of our 10-fold cv gives for the 10 training sets an average performance of 95.1 and for the 10 testing sets an average performance of 94.9, so we can conclude that our model will perform good in case of new data. In order to test the model performance on completely new data, we build a new more complex Petri net with 33 event types. This new PN has 6 OR-splits, 3 ANDsplits and three loops (our training material contains Petri nets with at most one loop). We consider three Petri nets with three different levels of unbalance and using the formula from Eq. 2, we predict the probability of direct succession for the Petri net. For these three Petri nets, we counted the number of direct successors correctly found with our method. The average of direct successors that were correctly found is 94.3. Therefore we can conclude that even in case of completely new data, i.e. a workflow log generated by a more complex Petri net, the method has a good performance of determining the direct successors.
5 Conclusions and Future Directions Using the presented method, we developed a model that estimates the probability that two events A and B are in the direct successor relation. The model performance is
372
L. Maruster et al.
good, i.e. 95% of the original direct succession relations were found. However, it is interesting to investigate what is the reason that the rest of 5% direct connections were not discovered. Inspecting these cases, we notice that although between event A and B there is a direct succession relation, the value of (A>B) is too small, and subsequently, the values for all three metrics are also small. To illustrate such a situation, inspect Figure 1. If we suppose that event H is always processed in 1 time unit, event G in 3 time units and I in 2 time units and H always finishes its execution before I starts, then we will always see the sequence “AFHGIKL” and never the sequence “AFGIHKL”. Although K is the direct successor of H, the method will not find the connection between H and K. In conclusion, we presented a global learning method that uses information contained in workflow logs to discover the direct successor relations between events. The method is able to find almost all direct connections in the presence of parallelism, noise and an incomplete log. Also, we tested our model on a workflow log generated by a more complex Petri net than the learning material, resulting in a performance close to that of the first experiment. We plan to do future research in several directions. First, because in many applications, the workflow log contains a timestamp for each event, we want to use this additional information to improve our model. Second, we want to provide a method to determine the relations between the direct successors and finally to construct the Petri net.
References [1] [2]
[3] [4]
[5]
[6]
[7] [8]
W.M.P. van der Aalst. The Application of Petri Nets to Workflow Management. J. of Circuits, Systems, and Computers, 8(1): 21-66, 1998. R. Agrawal, D. Gunopulos, and F. Leymann. Mining Process models from Workflow Logs. In Sixth International Conference on Extended Database Technology, pg. 469-483, 1998. J.E. Cook and A.L. Wolf. Discovering Models of Software Processes from Event-Based Data, ACM Transactions on Software Engineering and Methodology, 7(3):215-249, 1998. J.E. Cook and A.L. Wolf. Event-Based Detection of Concurrency. In Proceedings of the Sixth International Symposium on the Foundations of Software Engineering (FSE-6), Orlando, FL, November 1998, pp. 35-45. J.E. Cook and A.L. Wolf. Software Process Validation: Quantitatively Measuring the correspondence of a Process to a Model. ACM Transactions on Software Engineering and Methodology, 8(2): 147-176, 1999. J. Herbst. A Machine Learning Approach to Workflow Management. In 11th European Conference on Machine Learning, volume 1810 of Lecture Notes in Computer Science, pages 183-194, Springer, Berlin, Germany, 2000. J. Herbst. Dealing with Concurrency in Workflow Induction In U. Baake, R. Zobel and M. Al-Akaidi, European Concurrent Engineering Conf., SCS Europe, Gent, Belgium, 2000. J. Herbst and D. Karagiannis. An Inductive approach to the Acquisition and Adaptation of Workflow Models. In M. Ibrahim and B. Drabble, editors, Proceedings of the IJCAI’99 Workshop on Intelligent Workflow and Process Management: The New Frontier for AI in Business, pg. 52-57, Stockholm, Sweden, August 1999.
Process Mining: Discovering Direct Successors in Process Logs [9]
373
J. Herbst and D. Karagiannis. Integrating Machine Learning and Workflow Management to Support Acquisition and adaptation of workflow Models. International Journal of Intelligent Systems in Accounting, Finance and Management, 9:67-92, 2000. [10] L. Maruster, W.M.P. van der Aalst, T. Weijters, A. van den Bosch, W. Daelemans. Automated discovery of workflow models from hospital data. In Kröse, B. et al. (eds.): Proceedings 13th Belgium-Netherlands Conference on Artificial Intelligence (BNAIC’01), 25-26 October 2001, Amsterdam, The Netherlands, pp. 183-190. [11] T. Weijters, W.M.P. van der Aalst. Process Mining: Discovering Workflow Models from Event-Based Data. In Kröse, B. et. al, (eds.): Proceedings 13th Belgium-Netherlands Conference on Artificial Intelligence (BNAIC’01), 25-26 October 2001, Amsterdam, The Netherlands, pp. 283-290.
The Emergence of Artificial Creole by the EM Algorithm Makoto Nakamura and Satoshi Tojo Graduate School of Information Sciences, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Tatsunokuchi-machi, Nomi-gun, Ishikawa, 923-1292, Japan {mnakamur, tojo}@jaist.ac.jp
Abstract. Studies on evolutionary dynamics of grammar acquisition on the computer have been widely reported in recent years, where an agent learns grammars of other agents through the exchange of sentences between them. Particularly, Nowak et al. [11] generalized an evolutionary theory of language with the universal grammar mathematically. In this paper, we propose a model of language evolution for the emergence of creole based on their theory, and try to discover the critical conditions for creolization. In our experimentation, we utilize the inside-outside (EM) algorithm to find the grammar of a new generation. As a result, we contend that creolization is strongly affected by the popularity of community of the original language, rather than the similarity of original grammars.
1
Introduction
Studies on evolutionary dynamics of grammar acquisition on the computer have been often reported in recent years, where an autonomous and active agent learns grammars of other agents through the exchange of sentences between them. However, those experimental results hardly seem to reflect the phenomena of language evolution in the real world, because the models and the systems in those reports were too abstract. In this paper, we especially pay attention to the model of language evolution that retains intrinsic linguistic features, and try to discover the critical conditions for the emergence of creole. For us to realize language dynamics on the computer, a rational agent should behave as follows: – – – –
an agent composes messages to other agents with her own grammar, listens to and recognizes messages from other agents with her own grammar, evaluates and learns grammars of the messages of other agents, and leaves offsprings proportional to the payoff, that is the recognition rate of other agents’ grammars.
Thus far, Hashimoto et al. [6,7] modeled an evolution of symbolic grammar systems by agents who use a very simple grammar and a simple learning mechanism. However, these attractive reports seem to be no more stories of cognitive agents S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 374–381, 2002. c Springer-Verlag Berlin Heidelberg 2002
The Emergence of Artificial Creole by the EM Algorithm
375
than the primitive protocol matching of lower animals is, and were not about a behaviour of human beings but about just artificial organisms. In order to simulate human language dynamics, we need to make the language system more sophisticated. Because we cannot implement the mechanism of human learning process, nor that of diachronical evolution of languages directly on the computer, the adequateness of a model is difficult to evaluate. Some works hypothesize UG (universal grammar) [4], that is an innate grammar available to human babies when they begin to learn a language [8]. It is believed that UG is the product of some special neural circuitry within the human brain, which is called ‘language organ’ by Chomsky, and ‘language instinct’ by Pinker [13]. The advantage of UG is to restrict the search space of possible candidate grammars. Briscoe [3] reported models of human language acquisition on UG, where each agent had a hierarchical lattice of categories, and a given set of parameters specified a category grammar. Instantiating UG by agents in their model, they could express the evolutionary dynamics of natural language which consisted of eight basic language families in terms of the unmarked, canonical order of verbs (V), subject(S) and objects(O). Nowak et al. take a middle position between those two extreme models: Hashimoto’s simplified model and Briscoe’s sophisticated model[11]. They generalized an evolutionary theory of language with UG mathematically. It was assumed that the search space provided by principles in UG was finite and all the possible grammars could be enumerated. From the assumption, defining the similarity matrix and the payoff between grammars, they represented the transition of population of grammars as a differential equation. Consequently, they succeeded in representing an equilibrium of language evolution. Based on this Nowak’s framework, we will discuss the conditions that allow creolization to occur. As for the learning mechanisms of each agent, Nowak et al. assumed a stochastic framework. Similarly, we adopt a statistical learning algorithm called the Expectation Maximization (EM) algorithm. Section 2 describes how creole emergence should be represented mathematically in the model. Section 3 describes our experiments in which an agent utilizes the inside-outside(EM) algorithm for a grammar estimation. In Section 4, we discuss our contribution.
2
The Model of Creole Emergence
In this section, we describe how creolization emerges based on population dynamics for which Nowak et al. [11] proposed a framework. Their purpose is to develop a mathematical theory for the evolutionary and population dynamics of grammar acquisition [8]. Particularly, they do not pay attention to an ability of each agent but the whole behaviour of the population, and in this point, this work is different from other works in which each agent obtains a target grammar by the mutual interaction [3,6]. They employed the similarity matrix and the payoff between grammars to represent the differential equation for the population dynamics as mentioned
376
M. Nakamura and S. Tojo
above. The similarity matrix consists of elements of the probability sij that an agent who uses a grammar Gi may utter a sentence that is compatible with Gj . The Q matrix is defined as the member of the equation of the payoff and the differential equation of the population dynamics, where qij is the probFig. 1. Population dynamics of creolization. ability that a child born to an agent using Gi will develop Gj . Thus, the probability that a child will develop Gi if the parent uses Gi is given by qii . The qii represents the accuracy of grammar acquisition.1 Roughly speaking, Nowak’s results were: (1) For low accuracy of grammar acquisition (low values of qii ), all grammars, Gi , occur with roughly equal abundance. There is no predominating grammar in the population. (2) For high accuracy of grammar acquisition, the population will converge to a stable equilibrium with one dominant grammar. In the latter, the dominant grammar indicates an existing grammar, namely xd (0) > 0, where xd is the relative abundance of agents who use the dominant grammar Gd at the last. According to Bickerton, creole, namely a new language, emerges under peculiar environments [2]. In most cases when multiple language contact, one of them would eventually dominate the others. Nowak’s work has not considered the emergence of a new language, and thus there was no concept of creole. If all languages are enumerated, creole must be also included in them. Let us consider a creole grammar, Gc . At first, any agent does not have the grammar. That is xc (0) = 0. Creolization means that the grammar emerges and eventually dominates in the population. We propose a model of creole emergence in terms of population dynamics. This model can be considered as a natural extension of Nowak’s work. Creole is defined in the model as a grammar Gc such that: xc (t) = 0, xc (t + 1) ≥ θc
(1)
where θc is the threshold to admit the grammar to be dominant (See Fig. 1.) According to Briscoe’s [3], the principles restrict all conceivable grammars to relevant ones and they are anchored to a natural language grammar by the parameters for word order. The parameters are three independent binary variables, hence there can be eight grammars in the search space. All grammar rules in the principle are represented in Chomsky’s normal form (CNF). Each parameter is assigned to a specific rule of the grammar to change the order of the non-terminal symbols in the right hand side of the rule. For example, a rule “S → V P N P ” in the grammar has the parameter 0, then the rule “S → N P V P ” would have 1. 1
The matrix Q = {qij } depends on the matrix S = {sij } because the latter one defines how close different grammars are to each other [8]. The accuracy of language acquisition also depends on the learning mechanism that is specified by UG.
The Emergence of Artificial Creole by the EM Algorithm
377
Each of the eight grammars is named by Table 1. The rules of CNF. the value of a set of parameters. Namely, if the values of the parameters (b2 , b1 , b0 ) are b2 S → NP VP b2 S1 → VP1 NP1 S → VP S1 → VP1 (1, 0, 0) for example, then the name of the VP → Vi VP1 → Vi b1 VP → Vt NP b1 VP1 → Vt NP1 grammar is G(b2 b1 b0 )2 , that is G4 . b VP → VP1 PP b1 NP1 → Det N In this model, there are 15 (no recur- 0 NP → NP1 NP1 → N sive) rules of context free grammar as prin- b0 NP → NP1 PP b1 PP → Prp NP1 b NP → NP1 S1 ciples and each value of the parameters 0 (b2 , b1 , b0 ) determines the order of the right hand side of the rules. In Table 1, the parameters in the first column affect the rules of the second column. Each agent utilizes the inside-outside algorithm [10] to estimate the probability of each rule. The algorithm is a kind of EM algorithm, which is a way of estimating the values of the hidden parameters of a model by a stochastic method, and is used for grammar acquisition from plain corpus in natural language processing [9,12]. Through the communication with other agents at time t, an agent memorizes all sentences which are heard from other agents as well as those which the agent herself uttered. Then each agent estimates the application probability of each rule of the grammar and decides the parameter value (See Fig. 1.) To learn rules of the other grammars with the algorithm, agents need to own all the possible rules a priori with low probabilities at the initial stage. After the estimation, comparing the probabilities of two contradicting rules assigned, each agent adopts the one with higher probability at t + 1.
3
The Experiments
In this section, we detail our experiments. We calculate S = {sij } and Q = {qij } first, and observe the critical conditions for creolization. The procedure in which agents obtain sentences and learn grammars at t is as follows: 1. An agent generates a sentence from her own grammar, and speaks it in turn to another agent. The listener agent memorizes the sentence in her memory. This is executed once for all the agents. 2. Repeat Process 1 until an agent memorizes 1,000 sentences in total. 3. Each agent estimates the probability of each rule for the sentences in her memory by the inside-outside algorithm, decides the values of parameters (b2 , b1 , b0 ) from the probabilities and then decides the grammar. 3.1
The Calculation of the S Matrix
First, we calculate the similarity matrix S = {sij }, that was obtained as the probability that a speaker who uses Gi uttered a compatible sentence with Gj , which is derived by rules randomly chosen in proportion to the application probability. Each element of S was calculated from 30,000 sentences, and the result is as in Table 2. Each diagonal element, sii in S is slightly smaller than 1, because agents happen to speak sentences with rules of low probability and recognize the
378
M. Nakamura and S. Tojo
Table 2. The S matrix between grammars in principles G0
G1
G2
G3
G4
G5
G6
G7
G0 .968 .421 .217 .217 .469 .294 .169 .168 G1 .396 .969 .186 .207 .224 .468 .160 .165 G2 .209 .198 .968 .387 .175 .179 .484 .285 G3 .216 .204 .412 .968 .183 .181 .227 .483 G4 .485 .223 .181 .179 .968 .411 .207 .216 G5 .286 .481 .181 .181 .391 .970 .191 .209 G6 .167 .170 .470 .225 .210 .191 .969 .393 G7 .171 .167 .293 .465 .216 .221 .418 .970
Table 3. The Q matrix under the simulation run. G0
G1
G2
G3
G4
G5
G6
G7
G0 .516 .082 .114 .041 .126 .040 .060 .021 G1 .131 .368 .068 .115 .084 .113 .054 .068 G2 .136 .074 .413 .148 .063 .037 .084 .044 G3 .033 .095 .081 .479 .024 .076 .055 .158 G4 .157 .055 .074 .023 .480 .082 .096 .032 G5 .043 .082 .037 .064 .149 .415 .074 .136 G6 .069 .054 .114 .085 .115 .068 .364 .133 G7 .022 .059 .041 .126 .043 .114 .079 .517
sentence only with the innate own grammar. In this principle, we can claim that F (Gi , Gj ) = 0 among all grammars because the contradicting rules for word order always exist. Moreover, each grammar can derive a sentence ‘Vi’ by S → VP, VP → Vi with the probability of 0.167, so that each element in the matrix is roughly over this value. Grammars are roughly symmetrical with regard to the word order. 3.2
Creole Emergence
We distribute eight agents in eight grammars. Possible combinations of population in eight grammars are 15!/(8!7!) = 6, 435. We calculated the transitions of agents between grammars in all possible combinations. For θc = 1.0 in Equation 1, creole emerged in 18 ways, and for θc = 0.5, there were 80 ways. We can analyse the relationship between creolization and the number of grammars that the agents use in t. The rate of creolization for the number of grammars was enumerated in Table 4. When θc = 1.0, creole emerged only in case agents were classified into three or four languages. In the other cases, creole also emerged when θc = 0.5, especially in case agents were classified into three or four languages. When agents are distributed in three languages, creole emerged in 19 out of 1,176 combinations, that is 1.62%; in four languages, 48 out of 2,450, that is 1.96%. These two cases occupied 84% of all the creolization. In case five or more languages, the possibility of a new language itself was difficult, because most languages had been already spoken by at least one agent in t. This result means that creole is easier to emerge when agents are exposed in a situation of three or more languages. The contact of two languages tends to converge one of the two. This result coincides with actual creole [14]. 3.3
Correlation between S and Q Matrices
From the result of population dynamics in the previous section, we calculated Q = {qij } as follows : qij =
the population of Gj users at t + 1 in those who used Gi at t . the population of Gi users at t
(2)
The Emergence of Artificial Creole by the EM Algorithm
379
The result of Q = {qij } is shown in Table 3, where j qij = 1. The di- Table 4. The number and the rate of creagonal elements, qii in the Q matrix olization for the number of grammars. Creolization came to the highest in each column or G D θ = 0.5 θc = 1.0 c row. It means that agents of the dom2 196 1 ( 0.51% ) 0 ( 0% ) inant grammar at t tend to reuse the 3 1176 19 ( 1.62% ) 8 ( 0.68% ) grammar at t + 1. The rate that a half 2450 48 ( 1.96% ) 10 ( 0.41% ) 4 of, or more, agents use a same (domi5 1960 11 ( 0.56% ) 0 ( 0% ) nant) grammar at t is over 40 percent 6 588 1 ( 0.17% ) 0 ( 0% ) among all the combinations. Besides Total 6435 80 ( 1.24% ) 18 ( 0.28%) the fact that qii ’s have rather higher values, there seems no notable differ- G : The number of grammars. ence in qij ’s (i = j). Therefore, this Q D : The number of possible distributions. matrix does not have any meaning except that agents like to converge to the dominant grammar at t. We could not find further relationship between S and Q matrices. 3.4
Conditions of Creolization
In each experiment, every agent listens to sentences certain fixed times, and as a result, each agent memorizes a number of sentences. Each of these sentences is generated by another agent with her own grammar, so that sentences in a memory of an agent are divided into groups by the grammars that generated them. Thus, the number of sentences in each group comes to be proportional to the population of the grammar user. Note that this proportion is approximately identical in memories of agents. Because agents learn the grammar of next generation from this memory, and the grammar ratio of each memory is identical, the grammar tends to converge to a common one. Even in case the grammar does not converge, the population of the most dominant grammar at t consequently increases, or at least remains same at t + 1. We can observe the emergence of creole under the condition of Equation 1 in Fig. 2. All the agents are classified in three or four languages; for example, G0 , G5 and G6 in (a), each population of which are quite similar. In this case, a new language that no agent used at t, emerges at t + 1. Here, let us consider what feature the new language owns from the viewpoint of parameters (b2 , b1 , b0 ). As for the parameter b2 in Fig. 2(a), grammars of two agents are set to 0, and those of the other six are set to 1. Therefore, the language of the next generation tends to use those rules with b2 = 1. With the same reason, b1 tends to be 0, and b0 to be 0, where ‘#’ denotes a wildcard. Thus, we reason that the grammar of the new generation would have the parameters (b2 , b1 , b0 ) = (1, 0, 0), that is, G4 . The sample (b) is the case the grammars of all agents did not converge to a common grammar, because the parameter b0 could not be fixed. Therefore, creolization is strongly affected by the distribution of the population among different grammars, rather than the similarity between original grammars. The result can be generalized to include the case creole does not
380
M. Nakamura and S. Tojo
G0 G1 G2 G3 G4 G5 G6 G7 (a) t 2 0 0 0 0 3 3 0 t+1 0 0 0 0 8 0 0 0
b2 0 # #
b1 # 0 #
b0 b2 #:2 1 #:5# 0 :5#
b1 # 1 #
b0 Value #:6 1 #:3 0 1 :3 0
G0 G1 G2 G3 G4 G5 G6 G7 (b) t 0 3 1 0 0 0 1 3 t+1 0 0 0 5 0 0 0 3
b2 0 # #
b1 # 0 #
b0 b2 #:4 1 #:3# 0 :2#
b1 # 1 #
b0 Value #:4 #:5 1 1 :6 1
⇒ G4
⇒
G3 G7
Fig. 2. Samples of the result.
appear. If there is a dominant grammar group in population, the grammar survives or attracts more agents in the next generation; however, those who were attracted did not necessarily use similar grammars. Thus, we conclude that what decides the next prevalent language is the majority of each parameter.
4
Discussion
In Nowak et al. [11], the matrix S, that indicates similarities between two grammars, and Q, that represents the accuracy of grammar acquisition, played important roles in population dynamics. To simulate population dynamics on the computer, the S and Q matrices need to be defined in advance. The S matrix could be simply calculated from the given grammars. However, the definition of the Q matrix seemed to be problematic in that they considered Q = {qij } diachronic constants. As we have mentioned in the previous section, the rate how many of the population of Gi users change their language to Gj in the next generation strongly depends upon the balance between xi and xj , as well as the population of other grammars. Thus, qij should be a function of abundance of population: qij (x0 (t), x1 (t), . . . , xn−1 (t)). Table 5 is a sample of the Q matrix which depends on one specific distribution of the population, that is, x1 = 0.375 (3 of 8), x4 = 0.250 (2 of 8), and x7 = 0.375 (3 of 8) at a certain time t. Each value of the matrix is the result of ten times calculation in the same way in Section 3, the distribution of the population being kept same. In this case, if the majority decision of parameters is considered, the expected dominant grammar in t + 1 should be G(1,0,1)2 , that is G5 . Actually, the experimental result shows that the values of qi5 (i = 1, 4, 7) are rather higher in Table 5. Although the expected grammar may not always appear in the next generation, we can contend that the Q matrix should depend on the distribution of xi ’s at t. This observation affects the definition of the differential equation of population dynamics. In this paper, we have discussed the conditions for the emergence of creole based on a mathematical theory of the evolutionary and population dynamics.
The Emergence of Artificial Creole by the EM Algorithm
381
Our contributions are summarized as Table 5. A sample of the Q matrix follows. which depends on a distribution of the
– First, we adopted EM algorithm, that population. is one of the standard methods to find G5 G6 G7 G0 G1 G2 G3 G4 grammar rules for large corpora, and G1 0 0.067 0 0 0.330 0.600 0 0 showed the actual experimental result G4 0 0.050 0 0 0.550 0.400 0 0 0 0.800 0 0.033 G7 0 0.167 0 0 on a large scale computer simulation. – Secondly, we observed the qualitative conditions for creolization, that a new language which had not been used by anyone in the previous generation emerges. Our experimental results coincide with linguistic reports in the following two points; one is that creole tends to appear when there was no dominant language in terms of population, and the other is that it emerges easier when three or more languages contact rather than two languages do [14]. – Thirdly, we calculated the Q matrix on a specific distribution of population over various grammars, and predicted that the values of qij should be dependent on the abundance of population: x0 (t), x1 (t), · · · , xn−1 (t). Our future research target is to reconstruct the differential equation of population change, which incorporates the generation-dependent Q matrix.
References 1. Arends, J., Muysken, P., Smith, N. (eds.): Pidgins and Creoles, J. Benjamins Publishing Co. (1995) 2. Bickerton, D.: Language and Species, The University of Chicago Press (1990) 3. Briscoe, E.J.: Grammatical Acquisition and Linguistic Selection, In: Briscoe, T. (ed.): Linguistic Evolution through Language Acquisition: Formal and Computational Models, Cambridge University Press (2002) 4. Chomsky, N.: Lectures on Government and Binding. Dordrecht: Foris (1981) 5. DeGraff, M. (ed.): Language Creation and Language Change, The MIT Press (1999) 6. Hashimoto, T., Ikegami, T.: Evolution of symbolic grammar systems, In: Moran, F., Moreno, A., Merelo, J.J., Chacon, P. (eds.): Advances in Artificial Life. Springer, Berlin (1995) pp.812-823 7. Hashimoto, T., Ikegami, T.: Emergence of net-grammar in communicating agents, BioSystems, 38 (1996) pp.1-14 8. Komarova, N.L., Niyogi, P., Nowak, M.A.: The Evolutionary Dynamics of Grammar Acquisition, J.Theor.Biol. 209 (2001) 43-59 9. Lari, K., Young, S.J.: The estimation of stochastic context-free grammars using the Inside-Outside algorithm, Computer Speech and Language, Vol.4 (1990) pp.35-56 10. Manning, C.D., Sch¨ utze, H.: Foundations of Statistical Natural Language Processing, The MIT Press (1999) 11. Nowak, M.A., Komarova, N.L.: Towards an evolutionary theory of language, Trends in Cognitive Sciences 5(7) (2001) pp.288-295. 12. Pereira, F. Schabes, Y.: Inside-Outside reestimation from partially bracketed corpora, Proceedings of ACL (1992) 13. Pinker, S.: The Language Instinct, W. Morrow & Co. (1994) 14. Todd, L.: Pidgins and Creoles, Routledge & Kegan Paul (1974)
Generalized Musical Pattern Discovery by Analogy from Local Viewpoints Olivier Lartillot Ircam - Centre Pompidou, Place Igor-Stravinsky, 75004 Paris, France [email protected]
Abstract. Musical knowledge discovery, an important issue of digital network processing, is also a crucial question for music. Indeed, music may be considered as a kind of network. A new approach for Musical Pattern Discovery is proposed, which tries to consider musical discourse in a general polyphonic framework. We suggest a new vision of automated pattern analysis that generalizes the multiple viewpoint approach. Sharing the idea that pattern emerges from repetition, analogy-based modeling of music understanding adds the idea of a permanent induction of global hypotheses from local perception. Through a chronological scanning of the score, analogies are inferred between local relationships – namely, notes and intervals – and global structures – namely, patterns – whose paradigms are stored inside an abstract pattern trie. Basic mechanisms for inference of new patterns are described. Such an elastic vision of music enables a generalized understanding of its plastic expression.
1
Music Analysis and Knowledge Discovery
1.1
Music Is a Network
In the realm of digital networks, music takes a pivotal place. Indeed, current and future networks such as Internet will feature a great amount of musical content that will have to be structured pertinently. That is why much research is carried out now in a new domain called Music Information Retrieval (MIR) focused on the automatic processing of music databases. But music itself, because of its combinational richness, may be considered as a dense and complex network : notes are linked together forming higher structures which themselves communicate one with each other. Musical styles may also be seen as super-networks unifying specific structures inside different pieces as occurrences of an element of common language. A single musical piece may contain a huge quantity of information, implicitly understandable by listening but that was impossible to analyze and represent explicitly. That is why even for traditional music analysis, “a new generation of computational techniques and tools is required to support the extraction and the discovery of useful knowledge”1 . 1
Call for Papers of this conference.
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 382–389, 2002. c Springer-Verlag Berlin Heidelberg 2002
Generalized Musical Pattern Discovery by Analogy from Local Viewpoints
1.2
383
Musical Pattern Discovery
For a machine to proceed to musical analysis by itself, without any a priori knowledge, it would first have to be able to automatically discover simple musical patterns. Lots of researches have recently been led in this domain, offering significant results, but without resolving the challenge in a general context yet. In our approach, and in those presented in this paragraph, music expression is limited to scores. For non-written music, a preliminary – and rather intricate – automatic music transcription would be necessary. Gestalt principles – segmenting musical discourse when some perceptual distance gets maximal values – may be used for broad segmentation of music into large streams and temporal blocks [2]. However, detailed segmentation of music is better ruled by repetition than by Gestalt [3]. In this poster, we will mainly focus on the repetition criterion. As usual musical patterns are not exactly repeated, an algorithm of pattern detection should cope with approximation. One major axe followed by current MIR research proposes to overcome this difficulty by considering multiple music representations (called viewpoints) in parallel [8] [4] [3]. According to this theory, an approximate repetition may be viewed as an exact repetition in one relevant viewpoint. However, this approach cannot explain how a pattern may be considered as similar to a slightly stretched version of it, as in Fig. 1. For this example, repetition will only be found along the contour viewpoint, where only the slope between successive notes is considered (for this pattern: up then down). But this viewpoint is so loose that non-pertinent repetition may also be found. Therefore, information selection is not sufficient for approximate repetition discovery. Repetitions should be actively detected, when some similarity distance exceeds a certain amount of relevance.
Fig. 1. Two similar patterns
In most approaches, including multiple viewpoints, pattern is considered as sequence of contiguous elements. In such a restricted vision of music, long motives hidden behind dense accompaniments, for instance, would not be retrieved. We prefer considering a broader approach, where pattern are considered as sets of notes. The difficulty of such an approach, evidently, is computational complexity. Considerations of simple segmentations and edit distance are no more accurate. In many musical pattern detection algorithms, chosen candidate patterns are those that feature a highest score. This score takes into account several independent characteristics – or “heuristics” – such as pattern length (prefer longer patterns), or number of occurrences (prefer most frequently occurring patterns) [3]. This paradigm of pattern selection stems from the idea that only
384
O. Lartillot
a limited number of relevant characteristics of musical works have to be chosen. The remaining will be discarded following the constraint of result compactness. This strategy, though alleviating the information retrieval process, arbitrarily infringes a complex and subtle understanding of music. That is why we would prefer another approach, where selection does not come from competition between candidate patterns, but rather from criterion individually applied to each candidate pattern. In a word, a pattern will not be chosen because it is the most perceived pattern, but because it can be perceived. Therefore, to pattern selection, we would prefer pattern detection.
2 2.1
A Cognitive Approach Analogy
We suggest another method, highly developed in contemporary cognitive sciences and probably of high interest here, based on analogy. Indeed, “analogymaking lies at the heart of pattern perception and extrapolation” [5]. Analogy means that the similarity of two entities is hypothesized as soon as one or several resemblances of particular relevant aspects of them are detected. Human understanding heavily relies on such a mechanism since it progressively constructs a representation of phenomenon. The cognitive system has to take some risks, to induce, that is, to infer knowledge that is not directly present in the phenomenon, and, then, to check the good fulfillment of this hypothesis. An analogy-based vision of understanding is even more accurate in a musical context. Indeed, music cannot be apprehended as a single object: at each step of music hearing, only a local aspect is presented. Thus the analogy hypothesis of music understanding means that the global music structure is inferred through induction of hypotheses from partial – in particular timely local – point of view. The temporal characteristic of music implies another important aspect: a chronological order. A temporal oriented approach of musical pattern discovery would imply that a pattern will be more easily detected if it is presented very clearly first, and then hidden in a complex background than the reverse. Moreover, a distortion of pattern impairs more its detection when it occurs in its beginning that in its end. 2.2
Induction
Here the pivotal framework of induction modeling proposed by Holland et al. [6] is of high relevance: induction is expressed along a generalized paradigm, not founded on formal logic, but rather on a semantic network of concepts. New concepts arise from the interaction, competition and collaboration, of old ones. For each concept is associated not a single probabilistic value, but different activation quantities, each relying on different dimensions of relevance, such as past activity, entrenchment of concept inside the network, specificity with current context, etc.
Generalized Musical Pattern Discovery by Analogy from Local Viewpoints
385
This broad framework also includes an understanding of analogy itself. “Often the interpretation of an input requires the simultaneously activation and integration of multiple schemas”, that is, of different possible analogs. We propose to adapt such a vision of induction of analogies in musical context.
3
The kanthume Framework
Our project aims at building a musical pattern detection system based on a cognitive modeling of induction of analogies and of temporal perception of music. In a first approximation, we limit our scope to pitches p, durations d and time onsets t. The score is scanned in a temporal order and, for a same time onset, in a growing order of pitch. 3.1
Local Similarities
Local Viewpoint. As we told previously in paragraph 2.1, an analogy is the inference of an identity relation between two entities knowing some partial identity relation between them. Our discussion about the phenomenon of time in music leads to the idea that our experience of music is a temporal progression of partial points of view that consists of the timely local perception of it. Thus the analogy hypothesis of music understanding means that the global music structure is inferred through induction of hypotheses from local viewpoint. First of all, it may be remarked that relations between notes may play an important role in pattern perception. Indeed, transposition of pattern – that is, translation of the whole pattern toward higher or lower pitch – does not change the perception of its inner shaping. Moreover, relations between more than two notes may be decomposed into relations between each possible couple of notes, namely, intervals (n1 , n2 ), whose parameters are pitch interval (p2 − p1 ), onset interval (t2 − t1 ) and duration ratio (d2 /d1 ). Our cognitive approach gets benefit from another core characteristic of temporal perception of music. It seems that the perception of the relations – that we called interval – between each successive note of a pattern is made only possible by the fact that, when we hear a new note, the last one that have just been heared remains active in our mind, and buffered in a so called short term memory. The limited extension of short term memory should be defined both in terms of time (no more than 15 sec., for instance) and size (no more than 7 elements, for instance). When chronologically scanning the score, for each successive instant t0 , current local musical viewpoint will then consist of: – current notes n with their intrinsic parameters (p, d, t0 ), – current intervals, that is, relations between current notes, and between current notes and notes of the short term past.
386
O. Lartillot
Associative Memory. Each parameter of each element of local viewpoints – pitch, duration and onset of current notes and intervals – may be a characteristic property that, if retrieved later, will recall the considered element. For example, if two distant notes have same pitch, the later note may – if some constraints are fulfilled – induce a recall of the former one. Such a remembering mechanism will be possible only if there exist associative memory that links all elements sharing a same characteristic. In our framework, this may be formalized by considering parallel partitions of the set of notes of the score for each possible parameter (pitch, onset, duration). Here memorizing a local viewpoint would simply mean adding current notes and interval in the equivalence class corresponding to each current value of pitch, duration, onset, etc. 3.2
Abstract Pattern Trie
Pattern has been defined as a set of notes that is repeated (exactly or varied) in the score. All these repetitions, including the original first pattern, are considered as occurrences – or actual patterns – of a single abstract pattern. By existence of relation order in each dimension of the score space (namely pitch and time onset), notes inside a set of notes may be ordered through an induced relation order, that would consist of a main temporal relation order – which possesses actual cognitive significance –, composed with a secondary (arbitrary) pitch order – for formal reason. In this way, set of notes may be displayed as strings. As we showed that relations between notes could be decomposed into relations between couples, or intervals, then sets of notes may be displayed as string of intervals between each successive note. This is called the minimal interval representation. In the following, intervals will be represented in semitones.
Fig. 2. Polyphonic pattern, its corresponding minimal interval representation, and its corresponding branch (with grey nodes) in one possible APT
Collection of patterns – whether musical or not – may be represented as a trie, where each node is a character (a note here) and where, for each branch, the order set of notes from the root to the leaf, is a pattern. Thanks to this kind of representation, two patterns that have a same prefix will share a same part of branch. This has cognitive ground since patterns cannot be discriminated when only their common prefix is heard. That is why the collection of abstract patterns
Generalized Musical Pattern Discovery by Analogy from Local Viewpoints
387
– in minimal interval representation – related to a score will be represented inside such a trie called abstract pattern trie (APT). For each node of the APT is associated a value, its presence, which indicates the number of actual observed occurrences of the pattern represented by the branch from the root to this node. Each local viewpoint note, in the score, that terminates one or several actual patterns is linked to their corresponding abstract patterns in the APT. And reversely, to each node of APT is associated its corresponding actual patterns in the score. 3.3
Pattern Discovery
The pattern discovery mechanism may be described following three progressive steps (see [7] for details). Pattern Initiating. If, following the associative mechanism described in paragraph 3.1, one of recalled intervals is sufficiently similar (in the sense explained in paragraph 3.4) to current interval, a new abstract interval, representing a paradigm of these two similar occurrences, is inferred. As this interval is considered as the beginning of a new pattern, a new node is added to the root of the APT. Pattern Extending. For each note m in the short term memory, its associated abstract patterns are recalled (or we may say still active in memory), and the analogous pattern (that is, the other instances of these abstract patterns) too. Now current interval, between considered short term note m and current note n, is compared to the possible continuations of these analogous patterns. When one continuation is found as sufficiently similar to current interval (m, n), the abstract pattern is extended by this new interval. Pattern Confirming. Other occurrences of an already induced pattern will simply be detected through a progressive scanning of the corresponding branch of the APT: each successive interval of the occurrence is automatically associated to its corresponding node in the APT. 3.4
Interval Comparison
Similitude of current and abstract or previous intervals is expressed along several dimensions: pitch interval, onset interval, etc. In a first approach, a single interval distance is considered, consisting of the sum of pitch distance and onset distance. For intervals to be considered as similar, their distances have to stay below a fixed threshold. Moreover, a long pattern accepts more freely varying continuations because the context is more firmly grounded. That is, the value of this threshold is increased proportionally to the length of abstract pattern. Threshold is also increased if the abstract pattern is linked to a great number of occurrences.
388
3.5
O. Lartillot
Pattern of Patterns
Now, patterns may be considered themselves as elementary objects, and patterns of patterns may be discovered in a recursive way. The trouble is, as the definition of pattern used the concept of interval between notes, patterns of patterns needs the rather esoteric concept of interval between patterns. A first approach considers pattern of abstract pattern type. A broader approach should compare all internal parameters within occurrences themselves. If we consider the example of figure 3, where each box contains a pattern, we can see that a pattern may be either a motive or a chord. Moreover, pattern of patterns are defined here with abstract types (pattern of chord, for instance) and in the same time with internal parameters (same interval between successive chords). Finally, pattern of patterns of patterns is also possible here because of selection of pertinent knowledge (here: repetition of Eb-D).
Fig. 3. Beethoven’s 5th Symphony, beginning
3.6
Semantic Network
In this way, on the score is woven a network linking all the notes with abstract concepts. This can be considered as a kind of semantic network, where knowledge is distributed throughout the network, and where implicit knowledge is deduced from the network configuration itself. This implicit knowledge has to be controlled by the system, or redundant and non-pertinent information would be added and “mud” would overwhelm the network. It is necessary therefore to design mechanisms in order to avoid inference of new concepts already subsumed by old ones. That is: Do not induce abstract patterns that already exist. At every step of the pattern discovery algorithm, presented in paragraph 3.3, such mechanism is added.
4
Results and Prospective
The algorithm presented in previous paragraph is being implemented as a library called kanthume 2 of a music representation processing software called Open Music [1]. First, a MIDI file of the musical work to be analyzed is loaded. During 2
See kanthume webpage: http://www.ircam.fr/equipes/repmus/lartillot
Generalized Musical Pattern Discovery by Analogy from Local Viewpoints
389
the analysis, all inferences induced by the machine are displayed. First results seem very encouraging and tend to prove the pertinence of the method. Most motives are detected, even when they are transposed or temporally stretched. Repetitions of whole patterns are detected here only when each successive interval is first detected. However, sometimes, the first interval of a pattern does not suffice for inducing the presence of this pattern. It would then be preferable to allow the detection of a pattern even after the scanning of several notes of its beginning. This may be envisaged by considering progressive activation of old recalled intervals. Once these activations exceed a certain threshold, inference is made. For this broader approach, the activation of every past note should be updated at every moment in a parallel way. Here a network approach, closer to actual cognitive modeling, would be necessary. Added to the problematic of conception of cognitive modeling is the question of interface and ergonomics. The result of the analysis has to be displayed graphically in a network of relations, above the score itself. Because of its complexity - that cannot be graphically represented and in fact not catchy for human – this network should not be entirely displayed, but only a part of it. The user should be able to navigate inside this network, by choosing temporal objects and hierarchical level of the network. To current first inquiry of Musical Pattern Discovery will then arise the question of musical form and, why not, of Music Theory Discovery. Acknowledgments. This project is carried out in the context of my PhD, directed by Emmanuel Saint-James (LIP6, Paris VI) and G´erard Assayag (Musical Representations, Ircam).
References 1. Assayag, G., Rueda, C., Laurson, M., Agon, C., Delerue, O.: Computer Assisted Composition at Ircam: From PatchWork to OpenMusic. Computer Music Journal, 23(3), 59–72. 2. Bregman, A. S.: Auditory Scene Analysis. Cambridge, MA: MIT Press (1990) 3. Cambouropoulos, E.: Towards a General Computational Theory of Musical Structure. Ph.D. The University of Edinburgh (1998) 4. Conklin D., Anagnostopoulou, C.: Representation and Discovery of Multiple Viewpoint Patterns. International Computer Music Conference (2001) 5. Hofstadter, D. Fluid Concepts and Creative Analogies: Computer Models of the Fundamental Mechanisms of Thought. Basic Books (1995) 6. Holland, J. H., Holyoak K. J., Nisbeth, R. E., Thagard, P. R.: Induction : Processes of Inference, Learning, and Discovery. The MIT Press (1989) 7. Lartillot, O.: Integrating Pattern Matching into an Analogy-Oriented Pattern Discovery Framework. International Symposium on Music Information Retrieval (2002) 8. Smaill, A., Wiggins, G. A., Harris, M.: Hierarchical Music Representation for Composition and Analysis. Computing and the Humanities Journal (1993)
Using Genetic Algorithms-Based Approach for Better Decision Trees: A Computational Study Zhiwei Fu Fannie Mae, 4000 Wisconsin Avenue NW, Washington, DC 20016, USA [email protected]
Abstract. Recently, decision tree algorithms have been widely used in dealing with data mining problems to find out valuable rules and patterns. However, scalability, accuracy and efficiency are significant concerns regarding how to effectively deal with large and complex data sets in the implementation. In this paper, we propose an innovative machine learning approach (we call our approach GAIT), combining genetic algorithm, statistical sampling, and decision tree, to develop intelligent decision trees that can alleviate some of these problems. We design our computational experiments and run GAIT on three different data sets (namely Socio-Olympic data, Westinghouse data, and FAA data) to test its performance against standard decision tree algorithm, neural network classifier, and statistical discriminant technique, respectively. The computational results show that our approach outperforms standard decision tree algorithm profoundly at lower sampling levels, and achieves significantly better results with less effort than both neural network and discriminant classifiers.
1 Introduction Decision tree algorithms are one of the most popular ways of finding valuable patterns in a data set. Often in data analysis, a subset of the entire data set has to be extracted for analysis. Then a decision tree can be generated as the representative for the entire data set. One main drawback of this approach is that the generated tree can only capture the general information, but may not be good enough to capture the valuable patterns of the data set. To address accuracy, efficiency, and effectiveness, in this paper, we propose a genetic algorithm approach to build smart decision trees (called GAIT) that mitigates some of the problems associated with subset-based approaches. Preliminary computational results for the three different data sets show that GAIT performs well, as compared with other approaches.
2 Background A genetic algorithm (see DeJong 1980 and Goldberg 1988) is an adaptive search technique that was introduced by Holland (1975). One of the key factors affecting the S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 390-397, 2002. © Springer-Verlag Berlin Heidelberg 2002
Using Genetic Algorithms-Based Approach for Better Decision Trees
391
success and efficiency of a genetic algorithm is an appropriate representation of the problem space. A number of representations have been developed including traditional bit string representations (see Holland 1975), real-valued parameters (see Janikow and Michalewicz 1991), and permutation (see Grefenstette et al., 1985). Genetic algorithms have been used in a variety of applications including combinatorial optimization and knowledge discovery (see Fayyad et al., 1996). Kennedy et al. (1997) first developed a genetic algorithm for decision trees (their program is called CALTROP). Kennedy et al. represented a decision tree as a linear chromosome and applied a genetic algorithm to construct the tree. Their algorithm represented a binary tree by a number of unit subtrees (called a caltrop), each having a root node and two branches. Each caltrop consisted of three integers that represented the root, the left child, and the right child of the subtree. A chromosome was made up of a set of caltrops, each of which was an indivisible unit that could not be split by crossover. We point out that this representation of the tree by the string of caltrops had limitations. Each tree was represented by a string of size 3n (where n was the number of variables). The first string represented the root of the tree. After that, the order of the caltrops did not matter. This representation required that when a variable showed up at different locations in the tree, the course of action following a binary split was identical. As a result, the proposed schemes only represented trees with this specific property and not all decision trees. To overcome this limitation, we represent decision trees as binary (or non-binary) trees and work directly on the tree representation to perform crossover and mutation. This approach has both the advantage of representing the full range of decision trees, as well as being computationally more efficient for implementing a genetic algorithm. In this paper, we restrict our discussion to binary decision trees, but note that all our work easily generalizes to non-binary trees.
3 GAIT Methodology In GAIT, we pursue intelligent search using a genetic algorithm to extract valuable structures and generate decision trees that are better in classifying the data. 3.1 GAIT We now describe GAIT, a new genetic algorithm for intelligent decision trees that overcomes the limitations of decision tree algorithms. The flow chart of GAIT is given in Figure 1. First, we generate a set of diverse decision trees from different (not necessarily mutually exclusive) subsets of the original data set by using C4.5 (see Quinlan 1993) on small samples of the data. These decision trees are taken as inputs to GAIT. There are three key issues involved in the genetic algorithm. 1) The initial population. It is important that the initial population contain a wide variety of structures for genetic operations. In GAIT, the initial population is generated by using C4.5 on random samples of the data.
392
Z. Fu
2) Operations. Selection, crossover, and mutation are the three major operations that we use in our algorithm. We determine the subtrees for crossover by random selection. Similarly, mutation is performed by exchanging a subtree or a leaf with a subtree or a leaf within the same tree. 3) Evaluation. In GAIT, we evaluate the fitness of the decision trees by calculating the percentage of correctly classified observations in the validation set. During evolution, some of the decision trees might not be feasible after crossover and mutation. The validation operation is carried out at the end of each generation in order to improve the computational efficiency. Validation operation of GAIT involves elimination of any logic violations as shown in Figure 2. In this example, there is a logic violation with X1>4 and X1<=1 in the tree on the left. Consequently, the subtree associated with X1<=1 would be eliminated (specifically, branches following X5 > 7 and X5 <= 7 as shown in Figure 2 before validation will be logically infeasible) and the subtree associated with X1>1 would be concatenated since X1>4 implies X1>1. After validation, all feasible candidate decision trees are pruned to increase the ability of the tree to classify other data sets. Pruning starts from the bottom of the tree and examines each non-leaf subtree. Each subtree is replaced with a leaf if the replacement results in equal or lower predicted error rates. We coded our algorithm in Microsoft Visual C++ 5.0 and ran our experiments on a Windows 95 PC with a 400 MHz Pentium II processor and 128 MB of RAM.
Generate random samples from the original data set
Build decision trees as the initial population
Genetic operations: •Selection •Crossover •Mutation
Validation & Evaluation
Output decision trees
Fig. 1. Flow chart of GAIT methodology. X1 ≤ 4
X1 > 4
X3 ≤ 2
X3 > 2
X1 ≤ 1
X1 > 1
X5 > 7
X1 ≤ 4
X1 > 4
X5 ≤ 7
X3 > 2
X3 ≤ 2
Fig. 2. Validation operation implemented in GAIT before validation (figure on the left) and after the validation (figure on the right).
Using Genetic Algorithms-Based Approach for Better Decision Trees
393
4 Computational Experiments
4.1 Experimental Design In order to test the performance of GAIT, we designed two experiments on three data sets, Socio-Olympic data, Westinghouse data, and FAA data to test and compare GAIT’s performance against C4.5, neural network classifier and statistical discriminant classifier. To evaluate the performance of GAIT, we first run GAIT and C4.5 with sampling on Socio-Olympic data in Experiment 1. Westinghouse data and FAA data assume complicated structure, and the classification task is not trivial for standard classification approaches. Thus, we run GAIT, neural network classifier, C4.5 algorithm, and traditional statistical discriminant classifier on both data sets, in Experiment 2, and compare the performance using classification accuracy and computing times. 4.2 Data 4.2.1 SOCIO-OLYMPIC DATA Condon (1997) collected data from the 1996 Atlanta Summer Olympic Games, in which 11,000 athletes from 197 countries participated in 271 events from 26 sports. The total score for each country was calculated by assigning points to the top eight placing countries for each of the 271 Olympic events. Socio-economic information (e.g., population and national product) was used to predict total score. In the SocioOlympic data set, there are totally seventeen independent variables (e.g., area, population, birth rate, life expectancy, railroads, GNP per capita, imports, electric production, electric consumption etc.) used in our experiments. We classify the participating countries into three categories based on the total score attained and predicted. 4.2.2 WESTINGHOUSE DATA Westinghouse data we used is the manufacturing data from Westinghouse Electric Corp. The wire bonding process is a crucial step in manufacturing microcircuits. It uses ultrasonic power to weld a very thin gold wire to areas on a substrate. Currently, wire bonds are often screened for quality by visual inspection and a non-destructive pull-strength test. Both screening methods are costly and time-consuming and not highly reliable. In our experiment, we use ultrasonic data gathered during the production of 3,072 wire bonds in Westinghouse Electric Corp. to construct and test GAIT trees. In Westinghouse data, there are totally fifty four independent variables. 4.2.3 FAA DATA Inclement weather conditions at an airport may cause a decrease in the arrival capacity of that airport. The number of flights that an airport can accept in any given hour is referred to as the Airport Acceptance Rate (AAR). The data in our experiments were based on observations in 1996 at San Francisco's airport, collected from the Federal
394
Z. Fu
Aviation Agency (FAA). In the FAA data set, there are 812 observations and five independent variables. Our data set contains observations of weather conditions such as ceiling and visibility on days where the capacity was reduced along with the planned airport acceptance rates. 4.3 Parameter Settings In our experiments, we set the probability of crossover at 100%, the mutation rate at 1%, and the stopping criterion is set as the number of generations at 50. We use default settings for C4.5 algorithm. For neural network classifier, we adopt three-layer feed-forward systems with different inputs, hidden and output nodes correspondingly. We employ Sigmoid activation function with different slopes, back-propagation learning rules with decreasing learning rate, and network pruning. In addition, we construct quadratic discriminant classifier, which is one of the most commonly used forms of discirminant models. As the nearest neighbor method is commonly cited as a basis of comparison with other methods, we used the k-nearest neighbor method (k=3) with the Euclidean distance matrix. The discriminant classifier is constructed and run on the SAS 6.12 program. 4.4 Computational Results We show our computational results of Experiment 1 in Table 1 and Table 2. It is clear that, for any combination of population size and sampling percentage, the trees generated by GAIT are at least as accurate as the trees generated by C4.5 with sampling. The improvement in accuracy ranges from 0% to 9%. We notice that GAIT trees usually differ from the trees constructed by C4.5, and the difference is often extended to the variable at the root node. This could be due to the nature of the search methods. Table 3 and Table 4 demonstrate the computation times of C4.5 and GAIT. We observe that GAIT takes longer than C4.5, and we point out that GAIT’s computing time increases in proportion to the population size and GAIT converges relatively fast. Table 1. Classification Accuracy of Best C4.5 Tree for different Sampling Percentage and Population Size on the Socio-Olympic Data
5% 10% 20% 30% 40% 50% 60%
10
50
60
70
80
90
100
0.6462 0.8462 0.8359 0.8462 0.8769 0.8410 0.8974
0.6462 0.8462 0.8462 0.8769 0.8872 0.9128 0.9026
0.6462 0.8462 0.8462 0.8769 0.8872 0.9128 0.9026
0.6462 0.8462 0.8462 0.8769 0.8974 0.9128 0.9026
0.6462 0.8462 0.8462 0.8769 0.8974 0.9128 0.9128
0.6462 0.8462 0.8462 0.8769 0.8974 0.9128 0.9128
0.6462 0.8462 0.8462 0.8769 0.8974 0.9128 0.9128
Using Genetic Algorithms-Based Approach for Better Decision Trees
395
Table 2. Classification Accuracy of Best GAIT Tree for different Sampling Percentage and Population Size on the Socio-Olympic Data
5% 10% 20% 30% 40% 50% 60%
10 0.7385 0.8667 0.8462 0.8667 0.8974 0.8821 0.9077
50 0.7385 0.8769 0.8923 0.8821 0.9128 0.9128 0.9077
60 0.7385 0.8872 0.8976 0.9128 0.9128 0.9231 0.9179
70 0.7385 0.8872 0.8769 0.9179 0.9128 0.9282 0.9179
80 0.7385 0.8974 0.8769 0.8872 0.9128 0.9128 0.9231
90 0.7385 0.8769 0.8923 0.9128 0.9128 0.9179 0.9179
100 0.7385 0.8974 0.8976 0.9179 0.9128 0.9231 0.9333
Table 3. Computing Time of C4.5 (seconds) for different Sampling Percentage and Population Size on the Socio-Olympic Data
5% 10% 20% 30% 40% 50% 60%
10
50
60
70
80
90
100
0.8 0.8 0.9 1.0 1.1 1.1 1.2
2.6 3.4 3.7 3.9 4.0 4.2 4.5
2.7 3.6 4.0 4.1 4.3 4.7 5.2
3.0 3.9 4.1 4.4 4.9 5.6 6.0
3.4 4.6 5.1 5.7 6.0 6.1 7.0
3.7 4.5 5.4 5.6 6.1 6.9 7.6
4.0 5.3 5.9 6.3 6.8 7.7 8.5
Table 4. Computing Time of GAIT (seconds) for different Sampling Percentage and Population Size on the Socio-Olympic Data
5% 10% 20% 30% 40% 50% 60%
10 14.8 15.0 15.1 15.1 15.2 15.3 15.5
50 70.6 72.9 73.2 73.6 74.6 74.7 75.3
60 86.8 87.8 88.7 89.8 89.5 90.1 90.8
70 98.4 101.9 102.3 102.8 103.7 104.9 105.8
80 112.4 116.8 117.4 118.3 119.0 119.8 121.5
90 130.3 131.1 131.5 132.8 133.4 135.2 138.0
100 144.7 146.3 146.9 147.9 148.5 149.6 150.5
We report the computational results of Experiment 2 in Table 5, 6, 7, and 8. We see that GAIT, in general, has best accuracy, and Discriminant (statistical discriminant classifier) has the worst. We observe from Table 5 and 6 that GAIT achieves best classification accuracy on both data sets. While the accuracy of NN Classifier (i.e., neural network classifier) is only second to GAIT, it takes more computing efforts (see Table 7 and 8). While all the classifiers perform well on Westinghouse data (see Table 5), only GAIT obtains decent accuracy on FAA data (see Table 6). Discriminant takes least computing time, but at the expense of accuracy (see Table 7 and 8). We
396
Z. Fu
conducted paired-difference t-test on classification accuracy, and the p-values indicate that GAIT significantly outperforms all other approaches on both data sets at the 1% significance level. NN Classifier significantly outperforms the other approaches on Westinghouse data at the 1% significance level. Second to GAIT, C4.5 outperforms other approaches at the 1% significance level on FAA data set. Table 5. Classification Accuracy in Experiment 2 on Westinghouse Data
GAIT NN Classifier Discriminant C4.5
10% 88.92 87.92 82.92 83.23
20% 91.58 90.34 87.04 87.67
30% 93.65 91.85 88.85 89.71
40% 94.80 93.28 90.66 91.45
50% 95.57 93.89 91.48 92.24
60% 95.89 94.77 92.67 93.47
70% 96.22 95.20 93.00 94.02
Table 6. Classification Accuracy in Experiment 2 on FAA Data
GAIT NN Classifier Discriminant C4.5
10% 68.00 60.52 43.50 66.40
20% 77.80 66.40 54.05 72.68
30% 81.86 71.52 63.81 75.78
40% 83.49 73.82 68.68 77.58
50% 85.07 75.89 70.08 78.97
60% 85.70 76.26 71.37 79.66
70% 86.00 77.44 71.80 80.20
Table 7. Computing Time (seconds) in Experiment 2 on Westinghouse Data
GAIT NN Classifier Discriminant C4.5
10% 300 403 53 70
20% 421 518 69 82
30% 522 723 80 101
40% 627 895 97 213
50% 798 1180 115 401
60% 897 1769 153 584
70% 965 2145 214 943
Table 8. Computing Time (seconds) in Experiment 2 on FAA Data
GAIT NN Classifier Discriminant C4.5
10% 190 233 37 61
20% 241 284 45 67
30% 322 362 50 79
40% 438 525 63 93
50% 602 823 77 138
60% 734 1198 93 190
70% 805 1680 125 243
5 Concluding Remarks We have demonstrated that better decision trees could be built by integrating a welldesigned genetic algorithm with decision tree and sampling algorithms. In addition our
Using Genetic Algorithms-Based Approach for Better Decision Trees
397
approach produces the same level of accuracy as a standard decision tree algorithm at significantly lower sampling percentages. This indicates that GAIT is likely to scale well and to be effective for large-scale data mining. We also demonstrate the robustness of GAIT on other data sets as compared with standard decision tree algorithm, neural network classifier, and statistical discriminant classifier. In future work, we will run GAIT on large data sets to test its scalability. We will also focus on extracting, aggregating, and enhancing valuable rules discovered by the program.
References 1.
Condon, E. (1997). Predicting the Success of Nations in the Summer Olympics Using Neural Networks, Master thesis, University of Maryland, College Park. 2. DeJong, K. (1980). Adaptive system design: A genetic approach. IEEE Transactions on Systems, Man, and Cybernetics, Vol. SMC-10, No. 9, 566-574. 3. Fayyad, U., Piatetsky-Shapiro, G., Smith, P., & Uthurusamy, R. (1996). Advances in Knowledge Discovery and Data Mining, Cambridge, MA: MIT Press. 4. Fu, Z. (2000). Using Genetic Algorithms to Develop Intelligent Decision Trees, Doctoral dissertation, University of Maryland, College Park. 5. Goldberg, D. (1988). Genetic Algorithms in Search, Optimization and Machine Learning, Reading, MA: Addison-Wesley. 6. Grefenstette, J. Gopal, R., Rosmaita, B., & Van Gucht, D. (1985). Genetic algorithms for st the traveling salesman problem. Proceedings of 1 International Conference on Genetic Algorithms (Pittsburgh, PA, 1985). Hillsdale, NJ: Erlbaum, 160-168. 7. Holland, J. (1975). Adaptation in Natural and Artificial Systems, Ann Arbor, MI: University of Michigan Press. 8. Janikow, C., & Michalewicz, Z. (1991). An experimental comparison of binary and floatth ing point representations in genetic algorithms. Proceedings of 4 International Conference on Genetic Algorithms (San Diego, CA, 1991). San Mateo, CA: Morgan Kaufmann, 31-36. 9. Kennedy, H., Chinniah C., Bradbeer P., & Morss L. (1997). The construction and evaluation of decision trees: A comparison of evolutionary and concept learning methods. Evolutionary Computing, D. Corne and J. Shapiro (eds.), Lecture Notes in Computer Science. Berlin: Springer-Verlag, 147-161. 10. Quinlan, J.R. (1993). C4.5: Programming for Machine Learning, San Mateo, CA: Morgan Kaufmann.
Handling Feature Ambiguity in Knowledge Discovery from Time Series Frank H¨ oppner Department of Computer Science University of Applied Sciences Braunschweig/Wolfenb¨ uttel Salzdahlumer Strasse 46/48 D-38302 Wolfenb¨ uttel, Germany [email protected]
Abstract. In knowledge discovery from time series abstractions (like piecewise linear representations) are often preferred over raw data. In most cases it is implicitly assumed that there is a single valid abstraction and that the abstraction method, which is often heuristic in nature, finds this abstraction. We argue that this assumption does not hold in general and that there is need for knowledge discovery methods that pay attention to the ambiguity of features: In a different context, an increasing segment may be considered as (being part of) a decreasing segment. It is not a priori clear which view is correct or meaningful. We show that the relevance of ambiguous features depends on the relevance of the knowledge that can be discovered by using the features. We combine techniques from multiscale signal analysis and interval sequence mining to discover rules about dependencies in multivariate time series.
1
Introduction
Although a complex system is difficult to forecast or model as a whole, such systems (or subsystems thereof) very often cycle through a number of internal states that lead to repetitions of certain patterns in the observed variables. Discovering these patterns may help a human to resolve the underlying causal or temporal relationships and find local prediction rules. Rather than trying to explain the behaviour of the variables globally, we therefore seek for local dependencies in the data. Now, consider a time series in some neighbourhood around t and assume an increasing trend in this area. This information is probably more reliable than the value at t alone, because a small amount of white noise will not turn an increasing trend into a decreasing trend. There are, however, other effects than white noise, and thus we must be aware that low-frequency disturbances may have caused this increasing trend, but when looking at a coarser scale (zooming out) we have an increasing trend. While our assumption is that there is always a unique sequence of states a system has run through while producing its output, it
This work has been supported by the Deutsche Forschungsgemeinschaft (DFG) under grant no. Kl 648/1.
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 398–405, 2002. c Springer-Verlag Berlin Heidelberg 2002
Handling Feature Ambiguity in Knowledge Discovery
399
is extremely difficult to find the segmentation that corresponds to this sequence, because this means that we have to distinguish all kinds of disturbances to recover the unknown true signal. The difficulty is to distinguish between subtle but important features and uninteresting noise (which is not necessarily Gaussian). This is illustrated by the time series in Fig. 1. There are two major peaks and right before the decreasing flank of the peaks there is a small peak (and thus a short increasing trend) in both cases. Is that a coincidence? Or is it important? Do we want a heuristic time series conversion procedure decide about that? Better not, because if the peaks are falsely discarded, even the most powerful pattern detection mechanism will not be able to recover their importance. On the other hand, we do not want to consider every noisy data point separately, since this would increase the computational cost of the pattern discovery process significantly. The importance of such a small peak can only be shown a posteriori by some interesting pattern that uses the peak. Surprisingly, most of the work in KDD from time series, e.g. [3,8,4], use a single abstraction of the time series and thus the decision whether such patterns will be detected or not is implicitly handed over to the abstraction method. In this paper we provide an extension of [4] that takes ambiguity into account.
Fig. 1. A noisy time series.
2
An Overview of the Approach
In this section we outline our approach to knowledge discovery from time series and the contribution of this paper. At the beginning, the time series are converted into a higher-level representation, that is, a labeled interval sequence with labels addressing aspects like slope or curvature. The sequence consists of triples (b, f, s) denoting an interval [b, f ] in time for which the description s ∈ S holds (S denotes the set of all possible trends or properties that we distinguish). We use Allen’s interval relationships [1] to describe the relationship between intervals, for example, we say “A meets B” if interval A terminates at the same point in time at which B starts. In the following we denote the set of interval relations as shown in Fig. 2 by I. A temporal pattern of size k is defined by a pair P = (s, R), where s : {1, .., k} → S denotes the label of interval #i, and R ∈ I k×k denotes the relationship between [bi , fi ] and [bj , fj ]. By dim(P ) we denote the number of intervals k of the pattern P , we also say that P is a k-pattern. The whole sequence is sorted lexicographically [7] and intervals are numbered according to their position in the sorted list. Finding an embedding π
400
F. H¨ oppner A B
time
B before [b] A B meets [m] A B overlaps [o] A B is−finished−by [if] A B contains [c] A B starts [s] A B equals [=] A
A after [a] B A is−met−by [im] B A is−overlapped−by [io] B A finishes [f] B A during [d] B A is−started−by [is] B A equals [=] B
Fig. 2. Allen’s interval relationships.
that maps interval #i of a pattern P to interval #j in the sequence (π(i) = j) such that all interval relationships hold corresponds to searching an instance of a temporal pattern in the interval sequence. To be considered interesting, a temporal pattern has to be limited in its extension. We therefore choose a maximum duration tmax , which serves as the width of a sliding window which is moved along the sequences. We consider only those pattern instances that can be observed within this window. We define the total time in which the pattern can be observed within the sliding window as the support supp(P ) of the pattern P . We restrict our attention to so-called frequent patterns that have a support above a certain threshold. For the efficient support estimation of temporal patterns (see [4,7] for the details), we make use of a simple pruning technique: once we have observed an instance of a temporal pattern, we can stop any further checking until this instance disappears from the window. In order not to miss any instances, we must then be sure that we always find the earliest instance of a pattern, otherwise we loose the soundness: Suppose there are two instances at t0 and t1 . If we detect the instance at t1 and postpone any further checking until this instance disappears it may happen that the instance at t0 has also disappeared before we perform the next check and thus we loose the support in [t0 , t1 ]. To ensure correctness we have shown Theorem 1. Given two instances φ and ψ of the same pattern P , then π = min(φ, ψ) (i → min(φ(i), ψ(i))) is also an instance of P and bπ ≤ min(bφ , bψ ) holds. (For any instance ϑ, bϑ is the point in time when we start to observe ϑ.) Thus, finding the earliest instance of a pattern corresponds to finding the first mapping π (in a lexicographical ordering of vectors (π(1), π(2), ..., π(dim(P )))) and we make use of this fact in the subpattern test in [7]. There is one important point about our interval sequence that has not been mentioned so far. While we did not require that one interval has ended before another interval starts (which enables us to mix up several sequences into a single one), we required that every labeled interval (bi , fi , s) is maximal in the sense, that there is no (bj , fj , s) in the sequence such that [bi , fi ] and [bj , fj ] intersect: ∀(bi , fi , si ), (bj , fj , sj ), i < j : fi ≥ bj ⇒ si = sj
(1)
The idea is that whenever (1) is violated, we can merge both intervals and replace them by their union (min(bi , bj ), max(fi , fj ), s). However, in Fig. 1, we have discussed the question whether the decreasing flank of the major peaks shall
Handling Feature Ambiguity in Knowledge Discovery
401
be considered as a single decreasing flank within some interval [a, d] or a sequence of increasing, decreasing, and increasing intervals within [a, b], [b, c], and [c, d] (a < b < c < d). It is (1) that prevents us from composing our interval sequence out of all these intervals, because (1) does not allow [a, d] and [c, d] to have the same label. Unfortunately, we have used (1) in the proof of Theorem 1 – thus we loose the soundness of our approach if (1) is abandoned. In the remainder of this paper we discuss how to recover the soundness and efficiency of the approach when eliminating (1) and considering ambiguous data abstractions.
3
Multiscale Feature Extraction
By “feature” we refer to the labels of the interval sequence, such as increasing or decreasing trend. How can we create an ambiguous abstraction of a time series that uses such features? The transition between increasing and decreasing trend is indicated by a zero-crossing in the first derivative, however, noise introduces many zero-crossing and the resulting representation would not correspond to the human perception of the profile. By analysing the signal in multiple scales, that is different degrees of smoothing, and comparing the abstractions against each other, a concise description of the time series (cf. Fig. 3) by means of an interval tree of scale (cf. Fig. 4) can be developed [9]. Each of the rectangles in the tree defines a qualitative feature (here either increasing or decreasing, indicated by ‘+’ and ‘-’). The horizontal extent of the rectangle defines the valid time interval for this feature and the vertical extent defines the valid range in scale. A large vertical extent indicates that the feature is perceived over a broad range of smoothing filters and thus represents a robust feature, whereas rectangles with small vertical extent correspond to noisy features. Due to lack of space, we neither explain the method in detail nor justify our choice, but refer the reader to [9,2] and [5], resp. +
wind strength
+
scale
-
+ +
-
- +
50
100
150
200
250
Fig. 3. Wind strength over 10 days.
4
-
+ +
+
+ + + +-+ + + + -+ + ++-+ +-+ +-+ -+-+-+ - + -+ +-+- + -+-+-++-+ ++ ++ -+-+-+ --++ -++ + --- +++ ++- + + -+ + -+ -+ -++-+ -+ -+ -+ + + -+ +-+ + + + -+ + + -+-++ - -+-++ - - ++ + -+ 0 50 100 150 200 250 -
0
+
+
-
-
+
-
Fig. 4. Interval Tree of Scale.
Handling Ambiguity in the Discovery Process
We can circumvent the elimination of (1) by renaming the labels of all intervals such that they reflect the scale from which they have been extracted. A label s ∈ S is no longer used alone but only in combination with a scale, like s − 4 indicating that the description s holds for scale 4. While “A starts A” is not
402
F. H¨ oppner
allowed, “A − 2 starts A − 4” does not violate (1). However, this approach increases the data volume (if an interval labeled A survives over scales 1 to 5, we have to add it 5 times to the interval sequences with label A − 1 to A − 5) and does not match identical patterns on different scales (A − 4 does not match A − 3). Therefore, we prefer not to incorporate the scale in the label. Then it is sufficient to consider only one interval per rectangle in the interval tree of scale. To follow this approach we have to find a new subpattern check. The naive approach is to enumerate all occurrences of a temporal pattern and then yield the one that can be observed first, however, this appears to be ineffective due to the potential combinatorial explosion of possible embeddings. Let us recall how the visibility of a temporal pattern is determined [7]: A pattern becomes visible if the interval bound next to the rightmost interval bound coincides with the right bound of the sliding window and invisible if the interval bound next to the leftmost interval bound coincides with the left bound of the sliding window. In previous work we have calculated the observation interval a posteriori given a detected instance π of a pattern P . But since all instances of P have the same qualitative structure and thus the same order of the interval bounds, it is possible to compute in advance which are the two relevant bounds1 . Having identified these two bounds and the corresponding intervals #i and #j of P , we may calculate the observation interval of a potential instance a priori: we only have to know the two intervals that we intend to use for #i and #j rather than the embedding π of the whole pattern. Instead of a brute force enumeration of all possible embeddings in a naive approach, we can now organize the search for valid embeddings in such a way that the first match yields the earliest instance: For every pair of labels (s, t) and temporal relationship r, we maintain a list of matching interval pairs in the sliding window. These lists are kept up to date incrementally whenever the content of the sliding window changes. Thus, for any pattern P we easily find all possible pairs of intervals that determine the observation interval. Sorting these pairs by the beginning of the expected observation interval and using this list to control the search for possible embeddings guarantees that we will find the earliest instance first. Besides that, we use the lists of interval-pairs to influence the backtracking depth of the subpattern check routine. For the remaining intervals of the pattern, we look for the interval-pair with the smallest number of occurrences in the sliding window. The fewer possibilities the lower the probability of intensive backtracking.
5
Loosely Connected Patterns
With the elimination of (1) more temporal patterns are allowed and thus we have implicitly increased the size of the pattern space. In this section we propose to remove some other patterns from the pattern space to compensate this increase. 1
The algorithm is somewhat troublesome but nevertheless straightforward. No sketch is given due to lack of space, contact author for detailed report.
Handling Feature Ambiguity in Knowledge Discovery
403
In most other approaches where (local) time series similarity is decided with the help of a sliding window (e.g. dynamic time warping), to be considered as similar the complete content of one window position has to match the complete content of another window position. In our approach, the window is not that restrictive, it mainly provides an upper bound for the temporal extent of patterns, while the content of the sliding window at different positions has to match only partially – depending on the pattern we are currently looking for. Therefore, it makes sense to experiment with larger sliding windows. Suppose that we have an interesting pattern P . The probability of observing pattern “P overlaps A” or “P meets C” does not increase significantly with the window width: such patterns can already be observed with smaller window width and the number of occurrences increases only slightly as the window width is increased. However, patterns “P before A” or “A before P ” become more frequent, simply because it is more and more likely that we can observe some more A instances as the width of the sliding window increases – far apart and without any relationship to P . Thus, a rule P → A does not necessarily indicate a relationship between P and A, it may have arisen from a large window width. We call a temporal pattern P loosely connected (LC-pattern), if the (undirected) graph G = (V, E) is connected, where the set of vertices is given by V = {1, .., dim(P )} and the set of edges is given by E = {(i, j) | R[i, j] ∈ IL } with IL = I\{after, before}. Note that this definition does not exclude after and before relationships in P in general. Intuitively, a pattern is loosely connected if we can draw a vertical line through the pattern without intersecting intervals and thereby separating the intervals into two groups. We want to restrict the pattern enumeration process to LC-patterns. To do this, we have to make sure that during candidate generation all and only LC-patterns are generated. For the pruning of candidate patterns in association mining we use the fact that any subpattern of a frequent pattern is itself a frequent pattern. This is no longer true for loosely connected patterns (consider “A overlaps B, B overlaps C” and the removal of B). However, we have the following observation: Given a (loosely) connected pattern P with dim(P ) ≥ 2. Then one can find at least two different LC-subpatterns Q and R with dim(Q) = dim(R) = dim(P ) − 1. Among the (k − 1)-subpatterns there are at least 2 subpatterns which are loosely connected. From those we can construct a set of candidate patterns similar as it has been done before [4]. However, with respect to pruning efficiency we expect the new candidate generation algorithm to generate more candidates than before because we have fewer LC-subpatterns that can be used for pruning. (Algorithms are omitted due to lack of space, request detailed report from author.)
6
Evaluation
In our experiments, the number of intervals in a multiscale description was 2–3 times the number of intervals in a single scale description. Increasing the number of intervals while keeping the set of labels constant may cause a dramatic increase in the number of frequent patterns (much more relationships contains,
404
F. H¨ oppner
starts, finishes, etc.). But on the other hand, only a small percentage of the frequent patterns are loosely connected patterns (percentage decreases drastically with increasing window width), therefore the reduced pruning efficiency in case of loosely connected patterns is compensated by the savings during support estimation. The pruning techniques are less efficient, the ratio of frequent patterns among candidate patterns decreases. This is due to the fact that we have fewer loosely connected subpatterns that can be used for pruning. In our application to weather data we obtained much better supported patterns than without considering ambiguous abstractions, which we take as an indication that the previously used abstraction method did not always provide the best segmentation possible. For illustration purposes we demonstrate the method using an artificial example, which has been generated by concatenating noisy squared and unsquared sine waves (sin( 2π l t), t ∈ {0, .., l} with l varying randomly within 200 9 15 1 7 , 16 , 16 } we randomly added a Gaussian bump and 300). For some τ ∈ { 16 , 16 2 (exp(−(t−τ ∗l) /100)/h with h varying randomly within 2 and 3). The sine wave 1 is squared if a Gaussian bump appears at τ = 16 . Figure 5 shows an excerpt 1 ) of this time series. The difficulty is to distinguish the important bump (t = 16 7 9 15 from those that have no special meaning (τ ∈ { 16 , 16 , 16 }). It is not possible to generate a single abstraction of the time series that contains only the important bumps, since all bumps have the same characteristics. Their importance can only be revealed by means of the rules that can be discovered by using them.
Fig. 5. The “sine wave with bumps” example.
We ran the discovery process (window width 200, suppmin = 5%) with increasing and decreasing segments only. Then we performed specialization of the best discovered rules [6] to get some evidence for useful thresholds on segment lengths. From the histogram of thresholds on interval lengths we identified two major peaks at segment length ≈ 50 and ≈ 90. Therefore the process was restarted with refined labels denoting the length of the segment (label prefix “short” if ≤ 50, label prefix “long” if ≥ 90). We specialized the discovered rules that contained a “long-dec” label and the best rule with “long-dec” in the conclusion was If short-dec inc has been observed with a gradient between 0.006-0.009 and a length ≥ 24 for the inc segment, then short-dec inc long-dec will be observed. The rule has a support of 10% and confidence of nearly 100%. The premise contains a short decreasing segment (from a Gaussian bump) that meets an increasing segment of the remaining sine wave. A long decreasing segment (≥ 90) is only obtained for an unsquared sine wave, therefore this rule recognizes the relationship between the first bump and the squaredness of the sine curve.
Handling Feature Ambiguity in Knowledge Discovery
405
The squared and non-squared sine waves distinguish slightly in their derivative, and the additional quantitative constraint on the gradient obtained from rule specialization [6] focuses on the non-squared sine wave.
7
Conclusions
In knowledge discovery from time series one has to carefully balance the features of the used time series similarity measure and the computational cost of the discovery process. The proposed approach via labeled interval sequences supports partial similarity of time series segments and can handle gaps, translation and dilatation to some extent. Furthermore, it can be applied not only to univariate but also to multivariate time series, and the representation is close to the human perception of patterns in time series. Ambiguity in time series perception is an important issue and we have shown in this paper that certain relationships in the data cannot be revealed if only a single abstraction is used. We have extended our approach by the ability of taking ambiguity in time series perception into account, which is a feature that is absent in most competitive approaches.
References [1] J. F. Allen. Maintaining knowledge about temporal intervals. Comm. ACM, 26(11):832–843, 1983. [2] B. R. Bakshi and G. Stephanopoulos. Reasoning in time: Modelling, analysis, and pattern recognition of temporal process trends. In Advances in Chemical Engineering, volume 22, pages 485–548. Academic Press, Inc., 1995. [3] G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discovery from time series. In Proc. of the 4th Int. Conf. on Knowl. Discovery and Data Mining, pages 16–22. AAAI Press, 1998. [4] F. H¨ oppner. Discovery of temporal patterns – learning rules about the qualitative behaviour of time series. In Proc. of the 5th Europ. Conf. on Principles of Data Mining and Knowl. Discovery, number 2168 in LNAI, pages 192–203, Freiburg, Germany, Sept. 2001. Springer. [5] F. H¨ oppner. Time series abstraction methods – a survey. In K. Morik, editor, GI Workshop on Knowl. Discovery in Databases, LNI, Dortmund, Germany, Sept. 2002. [6] F. H¨ oppner and F. Klawonn. Finding informative rules in interval sequences. In Proc. of the 4th Int. Symp. on Intelligent Data Analysis, volume 2189 of LNCS, pages 123–132, Lissabon, Portugal, Sept. 2001. Springer. [7] F. H¨ oppner and F. Klawonn. Learning rules about the development of variables over time. In C. T. Leondes, editor, Intelligent Systems: Technology and Applications, volume IV, chapter 9, pages 201–228. CRC Press, 2002. [8] E. J. Keogh and P. Smyth. A probabilistic approach to fast pattern matching in time series databases. In Proc. of the 3rd Int. Conf. on Knowl. Discovery and Data Mining, pages 20–24, 1997. [9] A. P. Witkin. Scale space filtering. In Proc. of the 8th Int. Joint Conf. on Artifial Intelligence, pages 1019–1022, Karlsruhe, Germany, 1983.
A Compositional Framework for Mining Longest Ranges Haiyan Zhao1 , Zhenjiang Hu1,2 , and Masato Takeichi1 1
2
Department of Information Engineering School of Engineering University of Tokyo 7-3-1 Hongo, Bunkyo-ku 113-8656 Tokyo, Japan {zhhy,hu,takeichi}@ipl.t.u-tokyo.ac.jp PRESTO 21, Japan Science and Technology Corporation
Abstract. This paper proposes a compositional framework for discovering interesting range information from huge databases, where a domain specific query language is provided to specify the range of interest, and a general algorithm is given to mine the range specified in this language efficiently. A wide class of longest range problems, including the intensively studied optimized support range problem [FMMT96], can be solved systematically in this framework. Experiments with real world databases show that our framework is efficient not only in theory but also in practice.
1
Introduction
In this paper, we examine a new mining problem, called longest range problem, for discovering interesting range information from a huge database. It has promising applications in many trades, such as telecom, financial and retailing sectors. As an example, consider a relation 1 , namely callsDetail, in a telecom service provider database containing detailed call information. The attributes of the relation include date, time, src city, src country, dst city, dst country and duration, capturing concerned information about each call. Some useful knowledge hidden in this relation can be used by the telecom service provider to make some decision. Suppose the telecom service provider is interested in offering a discount to customers. In this case, the timing of discount may be critical for its success, for example, to encourage the customers to make long time calls, it would be advantageous to offer it in a time interval in which the average calling time is not so long, say, less than 6 minutes. And the telecom service provider would want to do this campaign in a time interval as long as 1
A relational database is a set of relations, while a relation (also called table) can be regarded as a set of tuples (or rows/records), and each tuple consists of several attributes (or fields).
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 406–413, 2002. c Springer-Verlag Berlin Heidelberg 2002
A Compositional Framework for Mining Longest Ranges
407
possible, that is, to find a longest time interval that satisfies the above conditions. This can be described as find longest time range from callsDetail s.t.average(duration) ≤ 6. This kind of problems is practically important, but finding an efficient and correct algorithm is not easy. This has been seen by [FMMT96], which took great pains in solving optimized range problem, a special case of our longest range problems (see Section 4). Things turn out to be more interesting and difficult if we want to find the longest time interval under a more complicated condition. For the above example, to make the campaign more profitable, besides the average talking time is not long, the total time of talking during the time interval should be big enough, say, during one year the sum of duration is more than 10000 hours. find longest time range from callsDetail s.t. average(duration) ≤ 6 ∧ sum(duration) ≥ 10000 ∗ 60 However, as far as we know, this kind of problems has not been systematically solved yet, and it remains open how efficiently it can be solved. This paper is therefore to address this kind of problems by making use of our work achieved in program calculation. In fact, the longest range problem is not a new problem at all; if we regard a relation in database as a list of records (tuples), it then boils down to the longest segment problem [Zan92, Jeu93] satisfying a given condition p, which is known in program calculation community. Unfortunately, there does not always exist an efficient algorithm for computing the longest segment for any predicate p [Zan92]. But, if p has a particular shape or some nice properties, then a strong optimization can be made, resulting in an efficient algorithm using less than O(n2 ) time or even linear time. We hence propose a compositional framework to mine longest ranges by providing a class of predicates with nice property on which efficient solutions can be derived. The main contributions of this paper are as follows. – We design a general querying language for mining the longest ranges. It is powerful and easy to learn, using a syntax similar to SQL. A wide class of longest range problems, including the optimized support range problem as intensively studied in [FMMT96,SBS99], can be expressed and solved by our language. – We show that the language can be efficiently implemented by using techniques in program calculation. And moreover, our solution demonstrates that efficient algorithms for solving the longest range problems can be compositionally built according to the structure of predicates for specifying range properties. This composite approach is in sharp contrast with the existing case-by-case study as in [Zan92,Jeu93].
408
H. Zhao, Z. Hu, and M. Takeichi
p ::= | | | | | | |
< c sum (attr) Sum Property < c Average Property average (attr) < c count (attr) Count Property < c min (attr) Min Property < c max (attr) Max Property p1 ∧ p2 Conjunction p1 ∨ p2 Disjunction not (p) Negation
Fig. 1. Predicates for Specifying Range Property
– We apply this system to mining various range information with a POS database from a coffee shop with two years sales data. The experiments demonstrate that our framework is efficient not only in theory but also in practice, and can be applied in data mining field.
2
The Range Querying Language
The main statement find in our range query language resembles SELECT statement in SQL syntactically. It takes the form of find longest attr range from tab where property in which, find, longest, range, from and where are reserved words. Roughly speaking, it finds the longest range for the attribute attr from the relation tab under the condition given by property. Note that attr must be a numeric attribute of the relation tab. It is evident that both attr and tab come from the database under consideration. However, how to specify property is not so easy as it looks. Just as discussed in Introduction, it cannot be guaranteed that there always exists an efficient algorithm to compute the longest range with respect to any property. It is thus crucial to define a language to describe proper property.
Predicates for Specifying Range Property. Since the property of range attribute is often described by aggregate functions of sum, average, count, min and max in data mining, we therefore design a class of predicates for defining < denotes the range properties of interest, as shown in Figure 1. Among which, a transitive total order relation, like ≤ or >. Our predicates are classified into two groups:
A Compositional Framework for Mining Longest Ranges
409
– five simple aggregate predicates specifying constraints on aggregate result on an attribute (field), including the range attribute, and – three composite predicates combining predicates logically. The former is used to specify simple properties, while the latter is for expressing more complicated properties in a compositional manner. The general form of the simple predicates is agg (attr)
<
c
< a total transitive order, attr an where agg is one of the aggregate functions, < to be ≤, this property means attribute, and c a constant value. Instantiating that aggregate computation over the attribute attr in the range should be no more than constant c.
Example 1. In the telecom relation callsDetail given in Introduction, consider to find a longest time interval whose total calling time is less than 300 hours. We can specify this query just by find longest time range from callsDetail where sum (duration) ≤ 300 ∗ 60. The practical usage demands more interesting range properties rather than the simple aggregate ones. For this purpose, we provide composite predicates ∧, ∨, and not in our language to combine predicates easily. As their names suggest, not denotes the logical negation of predicate, ∧ is used to describe the logical conjunction of predicates, while ∨ is for the logical disjunction of predicates. The precedence of them is descending from not to ∨. Example 2. Recall the example in Introduction, which can be specified in our language by find longest time range from callsDetail where average(duration) ≤ 6 ∧ sum(duration) ≥ 10000 ∗ 60. Consequently, our range query language, for its compositional feature, is powerful and easy for user to specify a wide class of longest ranges. Remark. It is worth noting that we discuss only the properties related with computing the range, and omit the general selection conditions, which, in fact, can be easily filtered by preprocessing. For example, find longest time range from callsDetail where sum ( duration) ≤ 18000 ∧ src city = Tokyo ∧ dst city = L¨ ubeck can be transformed to
410
H. Zhao, Z. Hu, and M. Takeichi
find longest time range from callsDetail’ where sum ( duration) ≤ 18000, and callsDetail’ is a view defined in SQL as SELECT ∗ FROM callsDetail WHERE src city = Tokyo AND dst city = L¨ ubeck
3
Implementing the Range Querying Language
This section outlines how to implement our querying language efficiently. Our result can be summarized in the following theorem. Theorem 1. The longest range specified in our language can be implemented using at most O(n logk−1 n) time, if every f and g used in the definition of primitive predicates inside property p can be computed in constant time. Here n denotes the number of tuples in the tab and k is a constant depending on the definition of property (Lemma 2). ✷ We prove this theorem by giving a concrete implementation, which consists of four phases: (i) bucketing the relation if necessary, (ii) normalizing the range property, (iii) refining the longest range problem, (iv) computing the longest range. The detail of the implementation can be found in [ZHM02a]. For the reason of space, we only illustrate how to normalize the range property and what the core of our problem is after normalization. To normalize the range property specified by user, we first give the following definition. Definition 1 (Primitive Predicate). A primitive predicate takes the form of f (head)
<
g (last),
where head and last indicate the first and the last element of a given range respectively, f and g be any functions applied to the end elements. ✷ Semantically, it means that the leftmost element (after applied function f) of a < with the rightmost element (after range has a transitive total order relation applied function g). Lemma 1. All the simple predicates given in Figure 1 can be represented in the form of primitive predicate. ✷
A Compositional Framework for Mining Longest Ranges
411
We omit the proof for this lemma here, and just demonstrate it by an example. Recall the condition in example 1: sum ( xs) ≤ 18000. How to eliminate this sum function? The trick is to do preprocessing like si = si−1 + xi to compute every prefix sum of the input list xs and get a new list ss: xs : ss :
[x1 , x2 , . . . , xh , . . . , xl , . . . ] [s1 , s2 , . . . , sh , . . . , sl , . . . ].
Therefore to compute the sum of a range xs = [xh , . . . , xl ], we can now use the end elements of xs and corresponding ones of ss , i.e., xh , sh , xl , sl : sum (xs ) = xh + (sl − sh ). Thus, sum (xs ) ≤ 18000 is coded as xh − sh ≤ 18000 − sl , a relation between the two end elements of the preprocessed segment. Accordingly, for any list xs, we can do this preprocessing to get a new list with each element xi changed to a pair (xi , si ), and reduce aggregate sum property to a transitive total order relation between the leftmost and rightmost elements of the concerned segment. It is worth noting that this preprocessing does not raise additional cost by using accumulation and fusion technique [Bir84]. With the same trick, the other four aggregate predicates can also be normalized into the primitive form. Accordingly, we further have the following selfevident lemma hold for the composite property. Lemma 2 (Disjunctive Normal Form). Any composite predicate can be expressed in its canonical form, that is, p (x) = p11 (x) ∧ p12 (x) ∧ . . . ∧ p1k1 (x) ∨ p21 (x) ∧ p22 (x) ∧ . . . ∧ p2k2 (x) ∨ ...... ∨ pm1 (x) ∧ pm2 (x) ∧ . . . ∧ pmkm (x) where, pij is primitive predicate, and the maximum of k1 , k2 , . . . , km is exactly the k in Theorem 1. ✷ Lemma 2 shows that a range property specified by our language can be normalized into its canonical form, that is, a disjunction of simpler components, which is either a primitive predicate or a conjunction of several primitive ones. If we can address both the primitive and the conjunction case, The result for the disjunction case can be gotten easily by calculating that for each component and selecting the longest one as the result. Thus, the crucial part is how to deal with conjunction case with primitive case as its special. Semantically, the conjunction case is to compute the longest range that satisfies a number of primitive predicates simultaneously, i.e., < g1 (last) ∧ p z = f1 (head) < g2 (last) ∧ f2 (head) ...... ∧ < gk (last) fk (head)
412
H. Zhao, Z. Hu, and M. Takeichi
where k is the number of primitive predicates. If we capsule this composite conjunction as below by tupling all of its primitive component together, (f1 , . . . , fk ) (R1 , . . . , Rk ) (g1 , . . . , gk ) with each Ri is a transitive total order relation, and (x1 , ..., xk ) (R1 , ..., Rk ) (y1 , ..., yk ) ≡ ∧ki=1 xi Ri yi then what we need to implement boils down to the following problem: Given a list, compute the length of a longest nonempty range such that the computation on the leftmost element is related with that on the ¯ (which is not necessary to be a total rightmost element by a relation R order), that is, ¯ g¯ (last) f¯ (head) R where
f¯ = (f1 , · · · , fk ) g¯ = (g1 , · · · , gk ).
Fortunately, this refined problem can be solved in O(n log(k−1) N ) time by using our algorithm given in [ZHM02b,ZHM02a]. Theorem 1 has accordingly been proved. Especially, for the case of k = 1 (primitive case), the longest range can be computed in linear time.
4
An Application: Optimized Support Range Problem
We present an application of our framework for practical data mining. As a special case of our longest range problem, the optimized support range problem, first studied in [FMMT96], is very useful in extracting correlated information. For example, the optimized association rule 2 for callsDetail in the telecom database, (date ∈ [d1 ..d2 ]) ∧ (src city = Tokyo) ⇒ (dst city = L¨ ubeck) describes the calls from Tokyo during the date [d1 , d2 ] are made to L¨ ubeck. Suppose that the telecom service provider wants to offer discounts to Tokyo customers who make calls to L¨ ubeck at a period of consecutive days in which the maximum number of calls from Tokyo are made and a certain minimum percentage of the call from Tokyo are to L¨ ubeck. This is known as the optimized support range problem, which maximizes the support of the given optimized association rule with the confidence of the rule exceeds a given constant θ. To do this, first we preprocess the original relation by adding a new attribute called support. It is defined according to the above rule by 1 src city = Tokyo ∧ dst city = L¨ ubeck support = 0 otherwise 2
The definition for optimized associate rule and its associate properties support and confidence can be found in [FMMT96].
A Compositional Framework for Mining Longest Ranges
413
After bucketing callsDetail according to the range attribute date, it is just to compute the longest date range with respect to the average of support no less than the given θ. Thus, the optimized support range problem is only a special case of the longest range problem, it can be simply expressed by our language as follows. find longest date range from callsDetail where average (support) ≥ θ From Theorem 1, we know that it can be solved in O(n) time.
5
Conclusion
In this paper, we identify one important class of data mining problems called longest range problems, and propose a compositional framework for solving the problems efficiently. This work is a continuation of our effort to investigate how program calculation approach could be used in data mining [HCT00], and it confirms us with its promising result. As a future work, we want to investigate how to extend our framework to mining multiple range attributes efficiently. For detailed explanation and experimental results, please refer to [ZHM02a].
References [Bir84] [FMMT96] [HCT00] [Jeu93] [SBS99] [Zan92] [ZHM02a]
[ZHM02b]
R. Bird. The promotion and accumulation strategies in transformational programming. ACM Transactions on Programming Languages and Systems, 6(4):487–504, 1984. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining optimized association rules for numeric attributes. In Proc. ACM PODS’96, pages 182–191, Montreal Quebec, Canada, 1996. Z. Hu, W.N. Chin, and M. Takeichi. Calculating a new data mining algorithm for market basket analysis. In Proc. of PADL2000, LNCS 1753, pages 169–184, Boston, Massachusetts, January 2000. Springer-Verlag. J. Jeuring. Theories for Algorithm Calculation. Ph.D thesis, Faculty of Science, Utrecht University, 1993. R. Rastogi S. Brin and K. Shim. Mining optimized gain rules for numeric attributes. In Proc. of ACM KDD’99, 1999. H. Zantema. Longest segment problems. Science of Computer Programming, 18:36–66, 1992. H. Zhao, Z. Hu, and M.Takeichi. A compositional framework for mining longest ranges. Technical Report METR 02-05, Department of Mathematcical Engineering, Univ. of Tokyo, May 2002. Document is available at ftp://www.ipl.t.u-tokyo.ac.jp/˜zhhy/pub/metr0205.ps. H. Zhao, Z. Hu, and M.Takeichi. Multidimensional searching trees with minimum attribute. JSSST Computer Software, 19(1):22–28, Jan. 2002.
Post-processing Operators for Browsing Large Sets of Association Rules 1 3
Alipio Jorge1 , João Poças2, and Paulo Azevedo 1
LIACC/FEP, Universidade do Porto, Portugal, [email protected] 2 Instituto Nacional de Estatística, Portugal, [email protected] 3 Universidade do Minho, Portugal, [email protected]
Abstract. Association rule engines typically output a very large set of rules. Despite the fact that association rules are regarded as highly comprehensible and useful for data mining and decision support in fields such as marketing, retail, demographics, among others, lengthy outputs may discourage users from using the technique. In this paper we propose a post-processing methodology and tool for browsing/visualizing large sets of association rules. The method is based on a set of operators that transform sets of rules into sets of rules, allowing focusing on interesting regions of the rule space. Each set of rules can be then seen with different graphical representations. The tool is web-based and uses SVG. Association rules are given in PMML.
1 Introduction Association Rule (AR) discovery [1] is many times used, for decision support, in data mining applications like market basket analysis, marketing, retail, study of census data, among others. This type of knowledge discovery is adequate when the data mining task has no single concrete objective to fulfil (such as how to discriminate good clients from bad ones), contrarily to what happens in classification or regression. Instead, the use of AR allows the decision maker/ knowledge seeker to have many different views on the data. There may be a set of general goals (like “what characterizes a good client?”, “which important groups of clients do I have?”, “which products do which clients typically buy?”). Moreover, the decision maker may even find relevant patterns that do not correspond to any question formulated beforehand. This style of data mining is sometimes called “fishing” (for knowledge). Due to the data characterization objectives, association rule discovery algorithms produce a complete set of rules above user-provided thresholds (typically minimal support and minimal confidence, defined in Section 2). This implies that the output is a very large set of rules, which can easily get to the thousands, overwhelming the user. To make things worse, the typical association rule algorithm outputs the list of rules as 1
This work is supported by the European Union grant IST-1999-11.495 Sol-Eu-Net and the POSI/2001/Class Project sponsored by Fundação Ciência e Tecnologia, FEDER e Programa de Financiamento Plurianual de Unidades de I & D.
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 414-421, 2002. © Springer-Verlag Berlin Heidelberg 2002
Post-processing Operators for Browsing Large Sets of Association Rules
415
a long text (even in the case of commercial tools like SPSS Clementine), and lacks post processing facilities for inspecting the set of produced rules. In this paper we propose a method and tool for the browsing and visualization of association rules. The tool reads sets of rules represented in the proposed standard PMML [3]. The complete set of rules can then be browsed by applying operators based on the generality relation between itemsets. The set of rules resulting from each operation can be viewed as a list or can be graphically summarized. This paper is organized as follows: we introduce the basic notions related to association rule discovery, and association rule space. We then describe PEAR, the post processing environment for association rules. We describe the set of operators and show one example of the use of PEAR, then proceed to related work and conclusion.
2 Association Rules An association rule AB represents a relationship between the sets of items A and B. Each item I is an atom representing a particular object. The relation is characterized by two measures: support and confidence of the rule. The support of a rule R within a dataset D, where D itself is a collection of sets of items (or itemsets), is the number of transactions in D that contain all the elements in AB. The confidence of the rule is the proportion of transactions that contain AB with respect to the transactions with A. Each rule represents a pattern captured on the data. The support is the commonness of that pattern. The confidence measures its predictive ability. The most common algorithm for discovering AR from a dataset D is APRIORI [1]. This algorithm produces all the association rules that can be found from a dataset D above given values of support and confidence, usually referred to as minsup and minconf. APRIORI has many variants with more appealing computational properties, such as PARTITION [6] or DIC [2], but that should produce exactly the same set of rules as determined by the problem definition and the data. 2.1
The Association Rule Space
The space of itemsets I can be structured in a lattice with the ² relation between sets. The empty itemset « is at the bottom of the lattice and the set of all itemsets at the top. The ² relation also corresponds to the generality relation between itemsets. To structure the set of rules, we need a number of lattices, each corresponding to one particular itemset that appears as the antecedent, or to one itemset that occurs as a consequent. For example, the rule {a,b,c}Å{d,e}, belongs to two lattices: the one of the rules with antecedent {a,b,c}, structured by the generality relation over the consequent, and the lattice of rules with {d,e} as a consequent, structured by the generality relation over the antecedents of the rules. We can view this collection of lattices as a grid, where each rule belongs to one intersection of two lattices. The idea behind the rule browsing approach we present, is that the user can visit one of these lattices (or part of it) at a time, and take one particular intersection to move into another lattice (set of rules).
416
A. Jorge, J. Poças, and P. Azevedo
3 PEAR: A Web-Based AR Browser To help the user browsing a large set of rules and ultimately find the subset of interesting rules, we developed PEAR (Post processing Environment for Association Rules). PEAR implements the set of operators described below that transform one set of rules into another, and allows a number of visualization techniques. PEAR’s server is run under an http server. A client is run on a web browser. Although not currently implemented, multiple clients can potentially run concurrently.
Fig. 1. PEAR screen showing some rules
PEAR operates by loading a PMML representation of the rule set. This initial set is displayed as a web page (Fig. 1). From this page the user can go to other pages containing ordered lists of rules with support and confidence. To move from page (set of rules) to page, the user applies restrictions and operators. The restrictions can be done on the minimum confidence, minimum support, or on functions of the support and confidence of the itemsets in the rule. Operators can be selected from a list. If it is a {Rule}{Sets of Rules} operator, the input rule must also be selected. For each page, the user can also select a graphical visualization that summarizes the set of rules on the page. Currently, the available visualizations are Confidence Support plot and Confidence / support histograms (Fig. 2). The produced charts are interactive and indicate the rule that corresponds to the point under the mouse.
Post-processing Operators for Browsing Large Sets of Association Rules
417
4 Operators for Sets of Association Rules The association rule browser helps the user to navigate through the space of rules by viewing one set of rules at a time. Each set of rules corresponds to one page. From one given page the user moves to the following by applying a selected operator to all or some of the rules viewed on the current page. In this section we define the set of operators to apply to sets of association rules.
Fig. 2. PEAR plotting support x confidence points for a subset of rules, and showing a multi-bar histogram
The operators we describe here transform one single rule R³{Rules} into a set of rules RS³{Sets of Rules}and correspond to the currently implemented ones. Other interesting operators may transform one set of rules into another. In the following we describe the operators of the former class. Antecedent generalization: AntG(AB) = {A’B | A’ ² A} This operator produces rules similar to the given one but with a syntactically simpler antecedent. This allows the identification of relevant or irrelevant items in the current rule. In terms of the antecedent lattice, it gives all the rules below the current one with the same consequent. Antecedent least general generalization AntLGG(AB) = {A’B | A’ is obtained by deleting one atom in A} This operator is a stricter version of the AntG. It gives only the rules on the level of the antecedent lattice immediately below the current rule. Consequent generalization ConsG(AB) = {AB’ | B’ ² B} Consequent least general generalization ConsLGG(AB) = {AB’ | B’ is obtained by deleting one atom in B} Similar to AntG and AntLGG respectively, but the simplification is done on the consequent instead of on the antecedent. Antecedent specialization AntS(AB) = {A’B | A’¯A} This produces rules with lower support but higher confidence than the current one.
418
A. Jorge, J. Poças, and P. Azevedo
Antecedent least specific specialization AntLSS(AB) = {A’B | A’ is obtained by adding one (any) atom to A} As AntS, but only for the immediate level above on the antecedent lattice. Consequent specialization ConsS(AB) = {AB’ | B’¯B} Consequent least specific specialization ConsLSS(AB) = {AB’ | B’ is obtained by adding one (any) atom to B} Similar to AntS and AntSS, but on the consequent. Focus on antecedent FAnt(AB) = {AC | C is any} Gives all the rules with the same antecedent. FAnt(R) = ConsG(R) ConsS(R). Focus on consequent FCons(AB) = {CB | C is any} Gives all the rules with the same consequent. FCons(R) = AntG(R) AntS(R).
5 The Index Page Our methodology is based on the philosophy of web browsing, page by page following hyperlinks. The operators implement the hyperlinks between two pages. To start browsing, the user needs an index page. This should include a subset of the rules that summarize the whole set. In terms of web browsing, it should be a small set of rules that allows getting to any page in a limited number of clicks. A candidate for such a set could be the, for example, the smallest rule for each consequent. Each of these rules would represent the lattice on the antecedents of the rules with the same consequent. Since the lattices intersect, we can change to a focus on the antecedent on any rule by applying an appropriate operator. Similarly, we could start with the set of smallest rules for each antecedent. Alternatively, instead of the size, we could consider the support, confidence, or other measure. All these possibilities must be studied and some of them implemented in our system, which currently shows, as the initial page, the set of all rules.
6 One Example We now describe how the method being proposed can be applied to the analysis of downloads from the site of the Portuguese National Institute of Statistics (INE). This site (www.ine.pt/infoline) serves as an electronic store, where the products are tables in digital format with statistics about Portugal. From the web access logs of the site’s http server we produced a set of association rules relating the main thematic categories of the downloaded tables. This is a relatively small set set of rules (211) involving 9 items that serves as an illustrative example. The aims of INE are to improve the usability of the site by discovering which items are typically combined by the same user. The results obtained can be used in the restructuring of the site or in the inclusion of recommendation links on some pages. A similar study could be carried out for lower levels of the category taxonomy. The rules in Fig. 3 show the contents of one index page, with one rule for each consequent (from the 9 items, only 7 appear). The user then finds the rule on “Territory_an_Environment” relevant for structuring the categories on the site. By applying
Post-processing Operators for Browsing Large Sets of Association Rules
419
the ConsG operator, she can drill down the lattice around that rule, obtaining all the rules with a generalized antecedent. Rule Economics_and_Finance <= Population_and_Social_Conditions & Industry_and_Energy & External_Commerce Commerce_Tourism_and_Services <= Economics_and_Finance & Industry_and_Energy & General_Statistics Industry_and_Energy <= Economics_and_Finance & Commerce_Tourism_and_Services & General_Statistics Territory_and_Environment <= Population_and_Social_Conditions & Industry_and_Energy & General_Statistics General_Statistics <= Commerce_Tourism_and_Services & Industry_and_Energy & Territory_and_Environment External_Commerce <= Economics_and_Finance & Industry_and_Energy & General_Statistics Agriculture_and_Fishing <= Commerce_Tourism_and_Services & Territory_and_Environment & General_Statistics
Sup 0,038 0,036 0,043 0,043 0,040 0,036 0,043
Conf 0,94 0,93 0,77 0,77 0,73 0,62 0,51
Fig. 3. First page (index)
From here, we can see that “Population_and_Social_Conditions” is not relevantly associated to “Territory_and_Environment”. The user can now, for example, look into rules with “Population_and_Social_Conditions” by applying the FAnt (focus on antecedent) operator (results not shown here). From there she could see what the main associations to this item are. Rule
Sup
Conf
Territory_and_Environment <= Population_and_Social_Conditions & Industry_and_Energy & General_Statistics
0,043
0,77
Territory_and_Environment <= Population_and_Social_Conditions & Industry_and_Energy
0,130
0,41
Territory_and_Environment <= Population_and_Social_Conditions & General_Statistics
0,100
0,63
Territory_and_Environment <= Industry_and_Energy & General_Statistics
0,048
0,77
Territory_and_Environment <= General_Statistics
0,140
0,54
Fig. 4. Applying the operator ConsG (consequent generalization)
The process would then iterate, allowing the user to follow particular interesting threads in the rule space. Plots and bar charts summarize the rules in one particular page. The user can always return to an index page. The objective is to gain insight on the rule set (and on the data) by examining digestible chunks of rules. What is an interesting or uninteresting rule depends on the application and the knowledge of the user. For more on measures of interestingness see [7].
7 Implementation Currently, PEAR server runs under Microsoft Microsoft Internet Information Server (IIS), but is browser free. The server can also run on any PC with a Microsoft OS using a Personal WebServer. ASP PMML document
DOM
AR data base
SQL
web pages (rule sets) SVG
operators ASP / DOM
Visualization
exported PMML
Fig. 5. General architecture of PEAR
On the server side, Active Server Pages (ASP) allow the definition of dynamic and interactive web pages [5]. These integrate html code with VbScript and implement various functionalities of the system. JavaScript, is used for data manipulation on the
420
A. Jorge, J. Poças, and P. Azevedo
client-side, for its portability. With JavaScript we create and manipulate PMML documents or SVG (both XML documents) using Document Object Model (DOM). We use the DOM to read and manipulate the original PMML document (XML document that represents a data mining model), to export a new PMML document and also to create and manipulate the graphical visualization [9]. Interactive graphical visualizations in PEAR are implemented as Scalable Vector Graphics (SVG) [10]. This is an XML-based language that specifies vector graphics that can be visualized by a web browser. 7.1
Representing Associations Rules with PMML
Predictive Model Markup Language (PMML) is an XML-based language. A PMML document provides a complete non-procedural definition of fully trained data mining models. This way, models can be shared between different applications. The universal, extensible, portable and human readable character of PMML allows users to develop models within one application, and use other applications to visualize, analyze, evaluate or otherwise use the models. PEAR can read an AR model specified in a PMML document. The user will be able to manipulate the AR model, creating a new rule space based on a set of operators, and export a subset of selected rules to a new PMML document. Internally, rules are stored in a relational database. Operators are implemented as SQL queries.
8 Related Work The system DS-WEB [4] uses the same sort of approach as the one we propose here. DS-WEB and PEAR have both the aim of post processing a large set of AR through web browsing and visualization. DS-WEB relies on the presentation of a reduced set of rules, called direction setting or DS rules, and then the user can explore the variations of each one of these DS rules. In our approach, we rely on a set of general operators that can be applied to any rule, including DS rules as defined for DS-WEB. The set of operators we define is based on simple mathematical properties of the itemsets and have a clear and intuitive semantics. PEAR also has the additional possibility of reading AR models as PMML. VizWiz is the non-official name for a PMML interactive model visualizer implemented in Java [11]. It displays graphically many sorts of data mining models. AR are presented as an ordered list together with color bars to indicate the values of measures. The user sets the minimal support and confidence through very intuitive gauges. This visualizer can be used directly in a web browser as a java plug-in.
9 Future Work and Conclusions Association rule engines are often rightly accused of overloading the user with very large sets of rules. This applies to any software package, commercial or noncommercial, that we know.
Post-processing Operators for Browsing Large Sets of Association Rules
421
In this paper we describe a rule post processing environment that allows the user to browse the rule space, organized by generality, by viewing one relevant set of rules at a time. A set of simple well defined operators, with an intuitive semantics, allows the user to move from one set of rules to another. Each set of rules is presented in a page and can be graphically summarized through the plotting of its numerical properties. It is important to stress that in this approach, the user can combine her subjective measure of interestingness of each rule with the objective measures provided (such as confidence, support or other). PEAR also presents an open philosophy by reading the set of rules as a PMML model. An important limitation of our work is that visualization techniques are difficult to evaluate and his one is no exception. The index page concept and its implementation require more work . The available visualization techniques are still limited. In the future we intend to develop metrics to measure the gains of this approach, as well as mechanisms that allow the incorporation of user defined visualizations and rule selection criteria, such as for example, the combination of primitive operators.
References 1.
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A. I., Fast Discovery of Association Rules. Advances in Knowledge Discovery and Data Mining: 307-328. 1996. 2. Brin, S., Motwani, R., Ullman, J. D. and Tsur, S. Dynamic itemset counting and implication rules for market basket data. SIGMOD Record (ACM Special Interest Group on Management of Data), 26(2):255, 1997. http://citeseer.nj.nec.com/brin97dynamic.html 3. Data Mining Group (PMML development), http://www.dmg.org/ 4. Ma, Yiming, Liu, Bing, Wong, Kian (2000), Web for Data Mininig: Organizing and Interpreting the Discovered Rules Using the Web, School, SIGKDD Explorations, ACM SIGKDD, Volume 2, Issue 1, July 2000. 5. Microsoft Web Site (JScript and JavaScript) http://support.microsoft.com and ASP http://msdn.microsoft.com. 6. Savasere, A., Omiecinski, E. and Navathe, S., An efficient algorithm for mining association rules in large databases . Proc. of 21st Intl. Conf. on Very Large Databases (VLDB), 1995. 7. Silberschatz, A. and Tuzhilin, A., On subjective measures of interestingness in knowledge discovery. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, 1995, 275-281. http://citeseer.nj.nec.com/silberschatz95subjective.html 8. Toivonen, H., Sampling large databases for association rules. Proc. of 22nd Intl. Conf. on Very Large Databases (VLDB), 1996. http://citeseer.nj.nec.com/toivonen96sampling.html 9. W3C DOM Level 1 specification http://www.w3.org/DOM/ 10. W3C, Scalable Vector Graphics (SVG) 1.0 Specification, W3C Recommendation, September 2001, http://www.w3.org/TR/SVG/ 11. Wettshereck, D., A KDDSE-independent PMML Visualizer, in Proc. of IDDM-02, workshop on Integration aspects of Decision Support and Data Mining, (Eds.) Bohanec, M., Mladenic, D., Lavrac, N., associated to the conferences ECML/PKDD 2002.
Mining Patterns from Structured Data by Beam-Wise Graph-Based Induction Takashi Matsuda, Hiroshi Motoda, Tetsuya Yoshida, and Takashi Washio Institute of Scientific and Industrial Research, Osaka University 8-1, Mihogaoka, Ibaraki, Osaka 567-0047, JAPAN
Abstract. A machine learning technique called Graph-Based Induction (GBI) extracts typical patterns from graph data by stepwise pair expansion (pairwise chunking). Because of its greedy search strategy, it is very efficient but suffers from incompleteness of search. Improvement is made on its search capability without imposing much computational complexity by 1) incorporating a beam search, 2) using a different evaluation function to extract patterns that are more discriminatory than those simply occurring frequently, and 3) adopting canonical labeling to enumerate identical patterns accurately. This new algorithm, now called Beam-wise GBI, B-GBI for short, was tested against a small DNA dataset from UCI repository and shown successful in extracting discriminatory substructures.
1
Introduction
There have been quite a number of research work on data mining in seeking for better performance over the last few years. Better performance includes mining from structured data, which is a new challenge, and there has been little work on this subject. Since structure is represented by proper relations and a graph can easily represent relations, knowledge discovery from graph structured data poses a general problem for mining from structured data. The majority of methods widely used are for data that does not have structure and is represented by attribute-value pairs. Decision tree[14,15], and induction rules[12,3] relate attribute values to target classes. Association rules often used in data mining also use this attribute-value pair representation. However, the attribute-value pair representation is not suitable for representing a more general data structure, and there are problems that need a more powerful representation. Most powerful representations that can handle relation and thus, structure, would be inductive logic programming (ILP) [13] which uses the firstorder predicate logic. However, in exchange for its rich expressibility, the time complexity causes problem [6]. Much more efficient approach has recently been proposed that employs the version space analysis and limits the type of representation to graph fragments, linearly connected substructures [5,9]. AGM (Apriori-based Graph Mining)[8] is another recent work that can mine association rules in a given graph dataset. S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 422–429, 2002. c Springer-Verlag Berlin Heidelberg 2002
Mining Patterns from Structured Data
423
A graph transaction is represented by an adjacency matrix, and the frequent patterns are mined by an extended Apriori algorithm. AGM can extract all connected/disconnected induced subgraphs by complete search. It is reasonably efficient but, in theory, its computation time increases exponentially with input graph size and support threshold. These approaches can use only frequency for the evaluation function. SUBDUE[4], which is probably the closest to our approach, extracts a subgraph which can best compress an input graph based on MDL principle. The found substructure can be considered a concept. This algorithm is based on a computationally-constrained beam search. Graph-Based Induction (GBI) [18,10] is a technique which was devised for discovering typical patterns in a general graph data by recursively chunking two adjoining nodes. It can handle a graph data having loops (including selfloops) with colored/uncolored nodes and links. There can be more than one link between any two nodes. GBI is very efficient because of its greedy search. GBI can use various evaluation functions based on frequency. It is not, however, suitable for pattern extraction from a graph structured data where many nodes share the same label because of its greedy recursive chunking without backtracking, but still effective in extracting patterns from such graph structured data where each node has a distinct label (e.g., World Wide Web browsing data) or where some typical structures exist even if some nodes share the same labels (e.g., chemical structure data containing benzene rings etc) [10]. In this paper we first report the improvement made to enhance the search capability without sacrificing efficiency too much by 1) incorporating a beam search, 2) using a different evaluation function to extract patterns that are more discriminatory than those simply occurring frequently, and 3) adopting canonical labeling to enumerate identical patterns accurately. This new algorithm is implemented and now called Beam-wise GBI, B-GBI for short. After that, we report an experiment using a small DNA dataset from the UCI repository and show the improvements work as intended.
2
Beam-Wise Graph-Based Induction
GBI employs the idea of extracting typical patterns by stepwise pair expansion. “Typicality” is characterized by the pattern’s frequency or the value of some evaluation function of its frequency. The problem of extracting all the isomorphic subgraphs is known to be NP-complete. Thus, GBI aims at extracting only meaningful typical patterns. Its objective is not finding all the typical patterns nor finding all the frequent patterns. For finding a pattern that is of interest, any of its subpatterns must be of interest because of the nature of repeated chunking. Frequency measure satisfies this monotonicity. However, if the criterion chosen does not satisfy this monotonicity, repeated chunking may not find good patterns. This motivated us to improve GBI allowing to use two criteria, one for frequency measure for chunking and the other for finding discriminatory patterns after chunking. The latter
424
T. Matsuda et al.
criterion does not necessarily hold monotonicity property. Any function that is discriminatory can be used, such as Information Gain [14], Gain Ratio [15] and Gini Index [2], all of which are based on frequency. To enhaunce the search capability, a beam search is incorporated to GBI within the framework of greedy search. A certain fixed number of pairs ranked from the top are allowed to be chunked in parallel. To prevent each branch from growing exponentially, the total number of pairs to chunk is fixed at each level of branch. Thus, at any iteration step, there is always a fixed number of chunking that is performed in parallel. The new stepwise pair expansion repeats the following four steps. Step 1. Extract all the pairs consisting of connected two nodes in all the graphs. Step 2a. Select all the typical pairs based on the criterion from among the pairs extracted in Step 1, rank them according to the criterion and register them as typical patterns. If either or both nodes of the selected pairs have already been rewritten (chunked), they are restored to the original patterns before registration. Step 2b. Select, from among the pairs extracted in Step 1, a fixed number of frequent pairs from the top and register them as the patterns to chunk. If either or both nodes of the selected pairs have already been rewritten (chunked), they are restored to the original patterns before registration. Stop when there is no more pattern to chunk. Step 3. Replace each of the selected pairs in Step 2b with one node and assign a new label to it. Delete a graph for which no pair is selected and branch (copy) a graph for which more than one pair are selected. Rewrite each remaining graph by replacing all the occurrence of the selected pair in the graph with a node with the newly assigned label. Go back to Step 1. The output of the B-GBI is a set of ranked typical patterns extracted at Step 2a. These patterns are typical in the sense that they are more discriminatory than non-selected patterns in terms of the criterion used. Another improvement made in conjunction with B-GBI is canonical labeling. GBI assigns a new label for each newly chunked pair. Because it recursively chunks pairs, it happens that the new pairs that have different labels happen to be the same pattern (subgraph). To identify whether the two pairs represent the same pattern or not, each pair is represented by its canonical label[16,7] and only when the label is the same, they are regarded as identical. The basic procedure of canonical labelling is as follows. Nodes in the graph are grouped according to their labels (node colors) and the degrees of node (number of links attached to the node) and ordered lexicographically. Then an adjacency matrix is created using this node ordering. When the graph is symmetric, the upper triangular elements are concatenated scanning either horizontally or vertically to codify the graph. When the graph is asymmetric, all the elements in both triangles are used to codify the graph in a similar way. If there are more than one node that have identical node label
Mining Patterns from Structured Data
425
and identical degrees of node, the ordering which results in the maximum (or minimum) value of the code is searched. The corresponding code is the canonical label. Let M be the number of nodes in a graph, N be the number of groups of the nodes, and pi (i = 1, 2, . . . , N ) be the number of the nodes within group N i. The search space can be reduced to i=1 (pi !) from M ! by using canonical labeling. The code of an adjacency matrix for the case in which elements in the upper triangle are vertically concatenated is defined as a11 a12 . . . a1n code(A) = a11 a12 a22 a13 a23 . . . ann (1) . . . a a 22 2n j
n n A= k}+j−i { . . .. = (L + 1) k=j+1 aij .(2) . . j=1 i=1 ann Here L is the number of different link labels. It is possible to further prune the search space. We choose the option of vertical concatenation. Elements of the adjacency matrix of higher ranked nodes form higher elements of the code. Thus, once the locations of higher ranked nodes in the adjacency matrix are fixed, corresponding higher elements of the code are also fixed and are not affected by N the order of elements of lower ranks. This reduces the search space of i=1 (pi !)
N to i=1 (pi !). However, there is still a problem of combinatorial explosion for a case where there are many nodes of the same labels and the same degrees of node such as the case of chemical compounds because the value of pi becomes large. What we can do is to make the best of already determined nodes of higher ranks. Assume that the nodes vi ∈ V (G)(i = 1, 2, . . . , N ) are already determined in a graph G. Consider finding the order of the nodes ui ∈ V (G)(i = 1, 2, . . . , k) of the same group that gives the maximum code value. The node that comes to vN +1 is the one in ui (i = 1, . . . , k) that has a link to the node v1 because the highest element that vN +1 can make is a1N +1 and the node that makes this element non 0, that is, the node that is linked to v1 gives the maximum code. If there are more than one node or no node at all that has a link to vN +1 , the one that has a link to v2 comes to vN +1 . Repeating this process determines which node comes to vN +1 . If no node can’t be determined after the last comparison at vN , permutation within the group is needed. Thus, the computational complexity in the worst case is still exponential.
3
Experimental Evaluation of B-GBI
The proposed method is tested against the promoter dataset in UCI Machine Learning Repository [1]. The input features are 57 sequential DNA nucleotides (A, G, T, C) and the total number of instances is 106 including 53 positive instances (sample promoter sequences) and 53 negative instances (non-promoter sequence). This dataset was explained and analyzed in [17]. The data is so prepared that each sequence of nucleotides is aligned at a reference point, which makes it possible to assign the n-th attribute to the n-th nucleotide in the
426
T. Matsuda et al.
attribute-value representation. If the data is not properly aligned, standard classifiers such as C4.5 that use attribute-value representation does not solve this problem. One of the advantages of graph representation is that it does not require the data to be aligned at a reference point. In this paper, each sequence is converted to a graph representation assuming that an element interacts up to 10 elements on both sides (See Fig. 1.). Each sequence, thus, results in a graph with 57 nodes and 515 links. The minimum support for chunking is set at 20% and Gini index is used as a criterion to select typical patterns.
4
5
.. 10
3 2
a
1
t
1
g
1
2
c
3
.. 8
a
1
2
2
3 4 ..
1
..
.. 6
t
5
7
9
Fig. 1. Conversion of DNA Sequence Data to a graph
The effect of employing canonical labeling was investigated using artificial datasets. The average number of links per node is fixed to 3 and links are randomly generated. The number of nodes was changed from 10 to 50. The first dataset has 3 labels for both node and link and the second dataset has no label for both. The beam width was set at 1. Figure 2 shows how the number of patterns extracted differs with and without canonical labeling and how much fraction is used for computation of canonical labeling. In theory the computational complexity of GBI without canonical labeling algorithm is quadratic and in practice it is linear to the size of a graph [11]. For the first dataset there is no difference in the number of patterns and the computation time for canonical labeling is negligible. For the second dataset there is a substantial difference between the number of patterns but the computation time required for canonical labeling is small. The promoter dataset has 4 node labels and 10 link labels and is closer to the first dataset and the effect of canonical labeling would probably be small, but this option is used in the following analysis of the promoter dataset. Although not shown in the paper due to the space limitation, computation time increases almost linearly to the beam width, which is easily projected. the To evaluate how discriminatory the extracted patterns are, they are treated as binary attributes of each sequence to build a classifier by C4.5[15]. In [17]
Mining Patterns from Structured Data 㪊㪇㪇㪇
㪌 㪋㪅㪌
㪚㫆㫄㫇㫌㫋㪸㫋㫀㫆㫅㪸㫃㩷㪫㫀㫄㪼㩷㪽㫆㫉㩷㪚㪸㫅㫆㫅㫀㪺㪸㫃 㪣㪸㪹㪼㫃㫃㫀㫅㪾㩷㩿㪣㫅㪔㪊㪃㩷㪣㫃㩷㪔㩷㪊㪀㩷㩷㩷㩷㩷
㪉㪌㪇㪇
㪋
㪫㫆㫋㪸㫃㩷㪚㫆㫄㫇㫌㫋㪸㫋㫀㫆㫅㪸㫃㩷㪫㫀㫄㪼㩷㩿㪣㫅㪔㪊㪃㩷㪣㫃 㪔㩷㪊㪀㩷㩷㩷㩷㩷㩷
㫊㪀㪻㪊㪅㪌 㫅㫆 㪺㪼 㫊㩿㩷 㪊 㪼 㫄 㫀㪫 㩷㫃 㪉㪅㪌 㪸㫅 㫀㫆㫋 㫋㪸㫌 㪉 㫇 㫄 㫆㪚㪈㪅㪌
㪉㪇㪇㪇
㪈㪌㪇㪇
㪥㫆㪅㩷㫆㪽㩷㫃㫀㫅㫂㫊㩷㫇㪼㫉㩷㫅㫆㪻㪼 㫀㫊㩷㫊㪼㫋㩷㫋㫆㩷㪊㪅 㪣㫅㩷㪑㩷㪥㫆㪅㩷㫆㪽㩷㪥㫆㪻㪼㩷㪣㪸㪹㪼㫃㫊 㪣㫃㩷㪑㩷㩷㪥㫆㪅㩷㫆㪽㩷㪣㫀㫅㫂㩷㪣㪸㪹㪼㫃㫊
㪈
㪈㪇㪇㪇
㪌㪇㪇
㫊㫅 㪼㫉㫋 㫋㪸 㪧㩷㪽 㫆㪅㩷 㫆㪥
㪚㫆㫄㫇㫌㫋㪸㫋㫀㫆㫅㪸㫃㩷㪫㫀㫄㪼㩷㪽㫆㫉㩷㪚㪸㫅㫆㫅㫀㪺㪸㫃 㪣㪸㪹㪼㫃㫃㫀㫅㪾㩷㩿㪣㫅㪔㪈㪃㩷㪣㫃㩷㪔㩷㪈㪀㩷㩷㩷㩷 㪫㫆㫋㪸㫃㩷㪚㫆㫄㫇㫌㫋㪸㫋㫀㫆㫅㪸㫃㩷㪫㫀㫄㪼㩷㩿㪣㫅㪔㪈㩷㪣㫃 㪔㩷㪈㪀㩷㩷㩷㩷 㪥㫆㪅㩷㫆㪽㩷㪧㪸㫋㫋㪼㫉㫅㫊㩷㫎㪆㫆㩷㪚㪸㫅㫆㫅㫀㪺㪸㫃 㪣㪸㪹㪼㫃㫃㫀㫅㪾㩷㩿㪣㫅㪔㪊㪃㩷㪣㫃㩷㪔㩷㪊㪀㩷㩷㩷㩷㩷㩷 㪥㫆㪅㩷㫆㪽㩷㪧㪸㫋㫋㪼㫉㫅㫊㩷㫎㫀㫋㪿㩷㪚㪸㫅㫆㫅㫀㪺㪸㫃 㪣㪸㪹㪼㫃㩷㩿㪣㫅㪔㪊㪃㩷㪣㫃㩷㪔㩷㪊㪀㩷㩷㩷㩷㩷㩷 㪥㫆㪅㩷㫆㪽㩷㪧㪸㫋㫋㪼㫉㫅㫊㩷㫎㪆㫆㩷㪚㪸㫅㫆㫅㫀㪺㪸㫃 㪣㪸㪹㪼㫃㫃㫀㫅㪾㩷㩿㪣㫅㪔㪈㪃㩷㪣㫃㩷㪔㩷㪈㪀㩷㩷㩷㩷 㪥㫆㪅㩷㫆㪽㩷㪧㪸㫋㫋㪼㫉㫅㫊㩷㫎㫀㫋㪿㩷㪚㪸㫅㫆㫅㫀㪺㪸㫃 㪣㪸㪹㪼㫃㫃㫀㫅㪾㩷㩿㪣㫅㪔㪈㪃㩷㪣㫃㩷㪔㩷㪈㪀㩷㩷㩷㩷
㪇㪅㪌 㪇
427
㪇
㪈㪇
㪉㪇
㪊㪇
㪋㪇
㪥㫆㪅㩷㫆㪽㩷㪥㫆㪻㪼㫊㩷㫀㫅㩷㪸㩷㪞㫉㪸㫇㪿
㪌㪇
㪍㪇
㪇
Fig. 2. Effect of canonical labeling on the number of extracted patterns and the computation time for artificial graphs of different size
C4.5Rule was used instead of C4.5 and the accuracy is averaged over 10 trials of 10 fold cross validation. Further, 10 trials of windowing is used for a training dataset (9 folds of the whole dataset) to induce 10 rule sets from which an optimized rule set is constructed. In our experiment, windowing did not give any advantage. Cross validation is performed 10 times and the results are averaged. The number of extracted patterns by B-GBI is very large whereas the number of data is too small. Since the extracted patterns are ranked by typicality (selected by Gini index) only the first n ranked attributes (n=5, 10, 20, 30, 40, 50) are used. The beam width is varied from 1 to 25. The best result was searched within these parameter space. It was obtained when n=40, beam width=2 and number of fold is 103 (leaving one out). These are summarized in Table 1. The best prediction error rate is 2.8%. The result reported in [17] is 3.8 %, which is obtained by KBANN using M-of-N expression. Although the fair comparison is difficult, it is worth noting that B-GBI that does not use any domain knowledge induced rules which are comparable to KBANN that use the domain knowledge1 . The rules in Fig. 3 correspond to the best result. Unfortunately the patterns in the antecedent do not match the domain theory reported in [17]. Finding good rules (in terms of predictive performance) does not necessarily means being able to identify the correct domain knowledge. This is probably because the number of examples is very small. 1
KBANN uses domain knowledge to configure an artificial neural network, refines its structure by training and extracts rules.
428
T. Matsuda et al. Table 1. Summary of prediction error rate for the promoter dataset Fold C4.5 tree rules C4.5w tree rules ID3 tree rules GBI+C4.5 tree 20 rules GBI+C4.5 tree 40 rules GBI+C4.5 tree 50 rules
10 24.4 18.1 18.2 12.3 23.8 13.7 10.2 9.7 6.3 6.3 6.6 6.6
30 18.9 14.2 15.2 12.2 19.7 12.7 11.3 10.3 5.8 5.6 4.8 4.8
50 16.7 12.0 15.0 12.7 18.5 13.2 11.6 9.8 4.9 4.9 4.8 4.9
lvo 16.0 9.4 13.1 12.6 17.0 12.3 11.3 9.4 2.8 2.8 4.7 4.7
If T?T???????T?A = y then Promoter If C??????T???T?A = y
then Promoter
If ATTT = y
then Promoter
If C?AA = y
then Promoter
If T?A???G?A = y then Promoter If A?AT?A = y
then Promoter
If T?T???????T?A = n A?T???T??????C = n ATTT = n C?AA = n A?AT?A = n T?A???G?A = n then non- Promoter Fig. 3. Classification rules induced by C4.5Rule using B-GBI generated patterns as binary attributes
4
Conclusion
Graph based induction GBI is improved in three aspects by incorporating: 1) two criteria, one for chunking and the other for task specific criterion to extract more discriminatory patterns, 2) beam search to enhance search capability and 3) canonical labeling to accurately count identical patterns. The improved BGBI is tested against a classification problem of small DNA promoter sequence and the results indicate that it is possible to extract discriminatory patterns. Immediate future work includes to use feature selection methods to filter out less useful patterns and apply B-GBI to a real world hepatitis data. Preliminary analysis shows that B-GBI can actually handle graphs with a few thousands nodes and extract discriminatory patterns.
Mining Patterns from Structured Data
429
Acknowledgement. This work was partially supported by the grant-in-aid for scientific research on priority area “Active Mining” funded by the Japanese Ministry of Education, Culture, Sport, Science and Technology.
References 1. C. L. Blake, E. Keogh, and C.J. Merz. Uci repository of machine leaning database, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. 2. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth & Brooks/Cole Advanced Books & Software, 1984. 3. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. The cn2 induction algorithm. Machine Learning, 3:261–283, 1989. 4. D. J. Cook and L. B. Holder. Graph-based data mining. IEEE Intelligent Systems, 15(2):32–41, 2000. 5. L. De Raedt and S. Kramer. The levelwise version space algorithm and its application to molecular fragment finding. In Proc. the 17th International Joint Conference on Artificial Intelligence, pages 853–859, 2001. 6. L. Dehaspe, H. Toivonen, and R. D. King. Finding frequent substructures in chemical compound. In Proc. the 4th International conference on Knowledge Discovery and Data Mining, pages 30–36, 1998. 7. S. Fortin. The graph isomorphism problem, 1996. 8. A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Proc. of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, pages 13–23, 2000. 9. S. Kramer, L. De Raedt, and C. Helma. Molecular feature miing in hiv data. In Proc. the 7th ACM SIGKDD International conference on Knowledge Discovery and Data Mining, pages 136–143, 2001. 10. T. Matsuda, T. Horiuchi, H. Motoda, and T. Washio. Extension of graph-based induction for general graph structured data. In Knowledge Discovery and Data Mining: Current Issues and New Applications, Springer Verlag, LNAI 1805, pages 420–431, 2000. 11. T. Matsuda, H. Motoda, and T. Washio. Graph-based induction and its applications. Advanced Engineering Informatics, 16(2):135–143, 2002. 12. R. S. Michalski. Learning flexible concepts: Fundamental ideas and a method based on two-tiered representaion. In Machine Learning, An Artificial Intelligence Approiach, 3:63–102, 1990. 13. S. Muggleton and L. de Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19(20):629–679, 1994. 14. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986. 15. J. R. Quinlan. C4.5:Programs For Machine Learning. Morgan Kaufmann Publishers, 1993. 16. R. C. Read and D. G. Corneil. The graph isomorphism disease. Journal of Graph Theory, 1:339–363, 1977. 17. G. G. Towell and J. W. Shavlik. Extracting refined rules from knowledge-based neural networks. Machine Learning, 13:71–101, 1993. 18. K. Yoshida and H. Motoda. Clip : Concept learning from inference pattern. Journal of Artificial Intelligence, 75(1):63–92, 1995.
Feature Selection for Propositionalization Mark-A. Krogel1 and Stefan Wrobel2,3 1
3
Otto-von-Guericke-Universit¨ at, Magdeburg, Germany [email protected] 2 FhG AiS, Schloß Birlinghoven, 53754 Sankt Augustin, Germany [email protected] Universit¨ at Bonn, Informatik III, R¨ omerstr. 164, 53117 Bonn, Germany [email protected]
Abstract. Following the success of inductive logic programming on structurally complex but small problems, recently there has been strong interest in relational methods that scale to real-world databases. Propositionalization has already been shown to be a particularly promising approach for robustly and effectively handling larger relational data sets. However, the number of propositional features generated here tends to quickly increase, e.g. with the number of relations, with negative effects especially for the efficiency of learning. In this paper, we show that feature selection techniques can significantly increase the efficiency of transformation-based learning without sacrificing accuracy.
1
Introduction
Relational databases are a widespread and commonly used technology for storing information in business, administration, and science, among other areas. They pose special challenges to knowledge discovery because many prominent learning systems can only treat attribute-value or propositional data and leave transformations of relational data into that form to experienced analysts. In contrast, methods from inductive logic programming (ILP) can directly handle problems of relational data analysis. In particular, transformation-based approaches, which automatically transform relational data into a form accessible to propositional learners, have shown to be a robust and powerful method for handling relational databases, cf. results of KDD Cup 20011 . Our system RELAGGS (RELational AGGregationS) is a variant of such a transformation-based approach that relies on facilities inspired by SQL aggregation functions [2]. The propositionalized data can be expected to usually contain a large number of attributes many of which are irrelevant in the sense that they do not contribute to the models to be learned. This leads to our proposal to enrich our learner with feature selection methods. We experimentally show that this results in a relational learner with improved time behavior, model complexity, and often also predictive accuracy with the help of 6 variants of well-known learning tasks from 4 different domains. 1
Parts of this work were done while the 2nd author was still at Magdeburg University. http://www.cs.wisc.edu/∼dpage/kddcup2001/
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 430–434, 2002. c Springer-Verlag Berlin Heidelberg 2002
Feature Selection for Propositionalization
2
431
Approach
We gave a formal description of a framework for transformation-based learning in [2]. An extension of this framework led to the development of the system RELAGGS as an instance of systems for propositionalization which is used in the experiments reported here. A propositionalization system queries a relational data set in order to construct features of a single-relational representation of the original data. Usually, those queries concern the existence of features in the original data. RELAGGS extends this by using equivalent functions to the SQL aggregation operators avg, count, max, min, and sum. Feature selection methods are often classified into filters and wrappers [4, 6]. While filters chose attributes based on general properties of the data before learning takes place, wrappers intermingle feature selection and learning. The methods for feature selection are also often subdivided into those that judge only single attributes at a time (univariate methods) and those that evaluate and compare whole sets of attributes (multivariate methods). Furthermore, different selection criteria and search strategies can be applied. A useful property of propositionalization is that the resulting table can be the input to different propositional learners. In order to keep this kind of independence, we favor the filtering approach to feature selection here. We use both univariate and multivariate methods in our experiments. Further details are given in the following section.
3 3.1
Experimental Evaluation Data Sets and Learning Tasks
Table 1 provides general information about the data sets used in the experiments. Table 1. Overview of data sets Origin
Target No. of No. of attribute relations examples
ECML98 Partner.class ECML98 Household.class PKDD99-00 Loan.status PKDD99-01 Patient.thrombosis KDD01 Gene.fctCellGrowth KDD01 Gene.locNucleus
10 10 8 5 7 7
997 997 682 417 862 862
No. of columns in RELAGGS result 824 933 937 297 895 895
432
M.-A. Krogel and S. Wrobel
For the KDD-Sisyphus I Workshop at the ECML98, a data set was issued based on a data warehouse of a Swiss insurance company2 . The data describe the company’s customers or partners, their households and their insurance contracts. Two learning tasks were provided with the data. We draw 2 random samples representing 997 instances each from the partner and household tables for reasons of WEKA’s learning efficiency, cf. below. The PKDD Challenges in 1999 and 2000 offered a data set from a Czech bank3 . The data describe accounts, their transactions, orders, and loans, as well as customers including personal, credit card ownership, and socio-demographic data. A learning task was not explicitly given for the challenges. We compare problematic to non-problematic loans. The PKDD Challenges from 1999 to 2001 provided another data set originating from a Japanese hospital specialized in the treatment of collagen diseases. We concentrate on the group of patients followed by the hospital and provided with both laboratory and special examination results. We aim at a model to differentiate between patients without and those with thrombosis. The KDD Cup 20014 tasks 2 and 3 asked for the prediction of gene function and gene localization, respectively. We chose here two binary tasks, viz. the prediction whether a gene codes for a protein that serves cell growth, cell division and DNA synthesis or not and the prediction whether the protein produced by the gene described would be allocated in the nucleus or not. 3.2
Procedure
The experiments were carried out on a PC with a Pentium III/500 MHz processor and 128 MB RAM. We decided to utilize the data mining system WEKA5 . As input to the experiments, we take the results produced by RELAGGS. The main learning algorithm from WEKA that we apply in the experiments is J48, which largely corresponds to C4.5 [5]. J48 is applied directly to the RELAGGS results (subscript “J48”). This is compared to the application of J48 to the RELAGGS results that were treated before by WEKA’s attribute selectors CfsSubsetEval with best-first search (superscript “Cfs”) and InfoGainAttributeEval with ranker (superscript “IG”). “Cfs” is a representative of filters that consider groups of attributes while “IG” comes from the group of filters that evaluate single attributes. Both filters showed the best performance in a preliminary series of experiments compared to other members of their groups. All WEKA tools are applied with default parameter settings, including stratified 10-fold cross-validation for classification learning. For the opportunity of comparisons, we also applied SMO (subscript “SMO”), the support vector machine implementation of WEKA, to the complete RELAGGS results. Here, the number of irrelevant attributes should not be influential wrt. accuracy. 2 3 4 5
http://research.swisslife.ch/kdd-sisyphus/ http://lisp.vse.cz/challenge/ http://www.cs.wisc.edu/∼dpage/kddcup2001/ http://www.cs.waikato.ac.nz/∼ml/weka/
Feature Selection for Propositionalization
3.3
433
Results
Table 2 shows, for each of the experimental conditions, the average error across the ten folds and the standard deviation. In many cases, the conditions using WEKA’s attribute selectors reach the best results. However, the differences to J48 without feature selection and SMO seem not that large. Table 2. Error rate averages and standard deviations (in percent) Cf s IG Target RELAGGSSM O RELAGGSJ48 RELAGGSJ48 RELAGGSJ48 Partner.class 13.9 ± 3.8 15.6 ± 3.2 12.6 ± 1.8 15.5 ± 3.5 Household.class 20.7 ± 3.8 9.1 ± 2.9 12.7 ± 2.6 13.1 ± 3.0 Loan.status 9.7 ± 2.0 7.8 ± 3.5 8.5 ± 3.1 5.9 ± 3.2 Patient.thrombosis 14.2 ± 3.3 13.4 ± 4.7 12.5 ± 4.9 14.2 ± 4.6 Gene.fctCellGrowth 16.8 ± 3.1 16.9 ± 3.3 17.2 ± 2.7 15.8 ± 3.1 Gene.locNucleus 12.1 ± 3.4 12.9 ± 2.3 16.7 ± 3.7 12.8 ± 3.0
Table 3 shows the running times for learning, including feature selection times. The conditions with feature selection methods are more efficient in the majority of cases. The best results are marked in bold. Table 3. Running times (in sec) for single training runs incl. feature selection times Cf s IG Target RELAGGSSM O RELAGGSJ48 RELAGGSJ48 RELAGGSJ48 Partner.class 59 26 10 13 Household.class 94 20 12 14 Loan.status 25 14 20 9 4 4 2 2 Patient.thrombosis 61 37 16 7 Gene.fctCellGrowth Gene.locNucleus 38 46 17 11
In Table 4, tree sizes are given as the absolute numbers of nodes they consist of. Here, “Cfs” constantly reaches the smallest sizes which should support understandability. The best results are marked in bold.
4
Related Work
The approach for RELAGGS is based on the general idea to transform relational representations of data into representations amenable for efficient propositional learners, as instantiated by LINUS and DINUS, and a number of other systems
434
M.-A. Krogel and S. Wrobel Table 4. Tree sizes as number of nodes Cf s IG Target RELAGGSJ48 RELAGGSJ48 RELAGGSJ48 Partner.class 127 3 95 Household.class 69 7 85 Loan.status 37 15 17 Patient.thrombosis 19 9 19 Gene.fctCellGrowth 69 11 45 Gene.locNucleus 57 15 51
[1]. Here, special approaches exist to deal with irrelevant attributes [3]. However, they seem limited to boolean tables and smaller data sets so far. Feature selection has been dealt with extensively in the literature [4]. [6] point out that both irrelevant and relevant attributes may cause a detoriation in classification accuracy. However, these experiments did not deal with data originating from propositionalization.
5
Conclusion
In this paper, we propose to combine propositionalization with feature selection. Experiments show that the application of feature selection techniques results in a relational learner with often higher predictive accuracy, in most cases less time demands, and partly significantly lower model complexity. In our future work, we plan to evaluate strategies to apply propositionalization to subsamples of the data first, followed by feature selection techniques, in order to finally propositionalize larger data sets only wrt. the relevant attributes. Furthermore, we plan to compare performance with other ILP learners.
References 1. S. Kramer, N. Lavraˇc, and P. A. Flach. Propositionalization Approaches to Relational Data Mining. In N. Lavraˇc and S. Dˇzeroski, editors, Relational Data Mining. Springer, 2001. 2. M.-A. Krogel and S. Wrobel. Transformation-Based Learning Using Multirelational Aggregation. In C. Rouveirol and M. Sebag, editors, Proceedings of the Eleventh International Conference on Inductive Logic Programming (ILP). Springer, 2001. 3. N. Lavraˇc and P. A. Flach. An extended transformation approach to Inductive Logic Programming. ACM Transactions on Computational Logic, 2(4):458–494, 2001. 4. H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer, 1998. 5. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. 6. I. H. Witten and E. Frank. Data Mining – Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000.
Subspace Clustering Based on Compressibility Masaki Narahashi and Einoshin Suzuki Electrical and Computer Engineering, Yokohama National University, 79-5 Tokiwadai, Hodogaya, Yokohama 240-8501, Japan {narahashi, suzuki}@slab.dnj.ynu.ac.jp
Abstract. In this paper, we propose a subspace clustering method based on compressibility. It is widely accepted that compressibility is deeply related to inductive learning. We have come to believe that compressibility is promising as an evaluation criterion in subspace clustering, and propose SUBCCOM in order to verify this belief. Experimental evaluation employs both artificial and real data sets.
1
Introduction
Clustering [7], which partitions a given set of examples to a set of similar groups called clusters, has a variety of applications such as customer segmentation, similarity search, and exploratory data analysis. In a typical application, however, not all attributes are relevant in clustering. Moreover, a clustering algorithm is typically ineffective when the number of attributes is large due to the curse of dimensionality [4]. Subspace clustering [2], which also obtains a set of relevant attributes, has been proposed in order to circumvent these problems. It is widely accepted that inductive learning and data compression can be viewed as two sides of a coin. We have come to believe that compressibility is promising as an evaluation criterion in subspace clustering, and have realized this belief as a system called SUBCCOM (SUBspace Clustering based on COMpressibility). The compressibility which is employed in SUBCCOM is based on BIRCH [9], which represents a fast clustering algorithm by data squashing [6]. SUBCCOM also employs a heuristic search method “Jumping search” to effectively search the relevant set of attributes among a large number of candidates.
2
BIRCH
The main stream of conventional data mining research has concerned how to scale up a learning/discovery algorithm to cope with a huge amount of data. Contrary to this approach, data squashing [6] concerns how to scale down such data so that they can be dealt by a conventional algorithm. BIRCH [9], which we will employ as a basis of our method, represents a fast clustering algorithm by data squashing. BIRCH takes a training data set x1 , x2 , · · · , xm as input, and outputs its partition γ1 , γ2 , · · · , γc+1 , where each of γ1 , γ2 , · · · , γc represents a cluster, and S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 435–440, 2002. c Springer-Verlag Berlin Heidelberg 2002
436
M. Narahashi and E. Suzuki
γc+1 is a set of noise. BIRCH squashes the training data set to obtain a CF (clustering feature) tree, and applies a global clustering algorithm to squashed examples each of which is represented by an entry in a leaf of the tree. A CF tree represents a height-balanced tree which is similar to a B+ tree [5]. A node of a CF tree contains, as entries to child nodes, a set of CF vectors each of which corresponds to an abstracted expression of a set of examples. For a set of examples x1 , x2 , · · · , xNφ to be squashed, a CF vector CFφ consists Nφ of the number Nφ of examples, the add-sum vector i=1 xi of examples, and Nφ the squared-sum i=1 xi 2 of attribute values of examples. Since the CF vector satisfies additivity and can be thus updated incrementally, BIRCH requires only one scan of the training data set. Moreover, various inter-cluster distance measures can be calculated with the corresponding two CF vectors only. This signifies that the original data set need not be stored, and clustering can be performed with their CF vectors only. A CF tree is constructed with a similar procedure for a B+ tree. When a new example is read, it follows a path from the root node to a leaf, then nodes along this path are updated. Selection of an appropriate node in this procedure is based on a distance measure which is specified by a user. In a leaf, the example is assigned to its closest entry if the distance between the new example and the examples of the entry is below a given threshold L.
3 3.1
Subspace Clustering Description of the Problem
The input of subspace clustering in this paper includes m examples x1 , x2 , · · · , xm each of which is represented with n numerical attributes. We denote the set of these n attributes as V . Note that an example xi represents a point in an n-dimensional space. The output of subspace clustering represents a partition γ1 , γ2 , · · · , γc , each of which represents a cluster, of the examples; and a set v (⊆ V ) of relevant attributes. Note that attributes in v represent the subspace in which clusters exist. A pattern extraction procedure in data mining is typically interactive: the user modifies the condition of mining in a series of application of a method and investigation of results. In this paper, we assume that the user specifies the number l of attributes in v. 3.2
Compressibility as an Evaluation Criterion
It is widely accepted that compressibility is deeply related to inductive learning. For instance, MDL (Minimum Description Length) principle [8], which has been successfully employed in inductive learning as an evaluation criterion, favors a model which effectively compresses the information content of the given data set and the model.
Subspace Clustering Based on Compressibility
437
An entry in a leaf of a CF tree, which we explained in the previous section, contains a set of examples which are close to each other. Therefore, the number of entries in leaves is related to the number of dense regions in the example space, and can be used to define an index of compressibility. In this paper, we propose −µ(u) as an evaluation criterion of a subspace u, where µ(u) represents the number of entries in leaves of a corresponding CF tree. 3.3
Heuristic Search for the Relevant Attributes
Note that subsets of attributes form a lattice, of which bottom and top are an empty set and V respectively. We realize search for relevant attributes as a search procedure for an appropriate subset of V in the lattice based on the evaluation criterion in the previous section. Forward search and backward search represent brute force search from the bottom and the top respectively. Jumping search, which we propose in this paper, represents a heuristic search method which is expected to be time-efficient without sacrificing accuracy. Our method first builds a CF tree for each attribute ai with L = T1 , then obtains a set u of attributes each of which evaluation criterion value −µ(ai ) is not smaller than the average value. In case |u| < l, remaining attributes which have the largest evaluation criterion values are added to u so that |u| becomes l. Finally, our method, from u, performs a backward search with L = T2 . Since a single attribute exhibits low compressibility, T1 is typically settled to a much smaller value than T2 . We show our algorithm below. Algorithm: JumpingSearch(x1 , x2 , · · · , xm , l) Input: data set x1 , x2 , · · · , xm , number of attributes l Return value: subspace u , entries in leaves of the final CF tree τ 1 For(i = 1; i ≤ n; i++) 2 Build a CF tree using values for ai only with L = T1 3 u=φ 4 Foreach(−µ(ai ) ≥ the average value of the evaluation criterion) 5 u = u ∪ {ai } 6 If(|u| < l) 7 Add remaining attributes which have the largest evaluation criterion until |u| becomes l 8 While(|u| > l) 9 Foreach(ai ∈ u) 10 Build a CF tree using values for u − {ai } only with L = T2 11 u = u − {a∗i } where a∗i = argmaxai −µ(u − {ai }) 12 Build a CF tree τ using values for u only with L = T2 Our subspace clustering system, which we call SUBCCOM, first obtains a set of possibly relevant attributes with jumping search, then applies k-means algorithm to the entries of leaves of the corresponding CF tree.
438
M. Narahashi and E. Suzuki
4
Experimental Evaluation
4.1
Artificial Data Sets
We experimentally evaluated SUBCCOM using artificial data sets. A data set contains five clusters, each of which is generated by a set of one-dimensional normal distributions. A normal distribution here has a mean value which is one of 3.0, 5.0, 7.0, 9.0, or 11.0; and the variance is 1.0. In the experiments, a class attribute was hidden to a clustering algorithm, which enabled us to calculate accuracies of the algorithm. An attribute which is irrelevant to a cluster takes a random value between 0.0 and 20. The number n of attributes is 20. Two series of experiments have been performed. In the first series, we varied the number of examples 10000, 20000, 30000, 40000, and 50000 under l = 5. In the second series, we varied the number of relevant attributes l = 3, 4, 5, 6, 7 under m = 10000. Since these data sets favor forward search to backward search, we excluded the latter from the experiments. We generated ten data sets for each condition. We investigated the cases of T1 = 0.01, T2 = 2.0, 3.0 for SUBCCOM and T1 = 1.0 for forward search. In SUBCCOM, we settled the branching factor of a CF tree, and the number of entries in a leaf to four. Average cluster distance [7] was employed in building a CF tree and in k-means algorithm. PROCLUS [1], which represents an efficient projected clustering method, has been employed for comparison. For a fair comparison, we employed PROCLUS as a subspace clustering1 for l-dimensional subspaces. Since k-means algorithm and PROCLUS are non-deterministic, we applied SUBCCOM and PROCLUS 100 times for each data set. The average results for the first and second series of experiments are shown in figure 1 and figure 2 respectively. In the figure, bar charts and line graphs represent computation time and accuracy respectively. 500
1 SUBCCOM (T2 = 2.0)
400
0.9 SUBCCOM (T2 = 2.0)
PROCLUS
300
accuracy
computation time [s]
forward search
SUBCCOM (T2 = 3.0) 200
100
0
0.8
PROCLUS SUBCCOM (T2 = 3.0)
0.7 forward search
0.6
10000
20000
30000
40000
number of examples
50000
0.5
10000
20000
30000
40000
50000
number of examples
Fig. 1. Average time (bar chart) and accuracy (line graph) for artificial data sets. 1
PROCLUS has been modified to find an identical set of relevant attributes for all clusters.
Subspace Clustering Based on Compressibility 90
1 forward search PROCLUS
70
0.9
SUBCCOM (T2 = 2.0)
50
accuracy
computation time [s]
80
60
439
SUBCCOM (T2 = 3.0)
40
PROCLUS
0.8
SUBCCOM (T2 = 2.0) 0.7
30
SUBCCOM (T2 = 3.0)
20
forward search
0.6
10 0
3
4 5 6 number of relevant attributes
0.5
7
3
4 5 6 number of relevant attributes
7
Fig. 2. Average time (bar chart) and accuracy (line graph) for artificial data sets.
In terms accuracy, SUBCCOM (T2 = 2.0) outperformed other methods in almost all cases. Other methods can be ranked as PROCLUS, SUBCCOM (T2 = 3.0), and forward search. The last method often failed to find an appropriate subspace2 . We attribute these good results to appropriate data squashing, which would facilitate a clustering procedure by decreasing the number of examples appropriately. It should be also noted that SUBCCOM systematically searches the set of relevant attributes by investigating different subsets, while PROCLUS performs heuristic search which finds the set and clusters simultaneously. In terms of computation time, forward search was the worst method. SUBCCOM was faster than PROCLUS while the number of examples is small.
1000
1 forward search
SUBCCOM (T2 = 2.0) 0.9 forward search PROCLUS
0.8
SUBCCOM (T2 = 2.0) SUBCCOM (T2 = 3.0)
accuracy
computation time [s]
backward search
100
0.7 PROCLUS 0.6 SUBCCOM (T2 = 3.0)
0.5 0.4 10
24
25
26
27
number of relevant attributes
28
0.3
backward search 24
25
26
27
28
number of relevant attributes
Fig. 3. Average time (bar chart) and accuracy (line graph) for a real data set.
2
The strange tendencies of accuracy in figure 1 are due to the small threshold T1 = 1.0, but larger values resulted in degradation of overall performance.
440
4.2
M. Narahashi and E. Suzuki
Real Data Set
For a source of the real data set, we have chosen the network intrusion data set which was employed in KDD Cup 1999 [3]. The data set consists of 494,021 examples with 41 attributes. The real data set was generated as follows: we selected 10,000 examples for each class of normal, neptune, and smurf; and 29 continuous attributes. We settled T1 = 0.01, T2 = 3.0 for SUBCCOM, T1 = 2.0 for forward search, and T2 = 2.0 for backward search. We varied the number of relevant attributes l = 24, 25, 26, 27, and 28; and assumed that there were three clusters. The results are shown in figure 3. SUBCCOM (T2 = 3.0) exhibited the highest average accuracy for l = 27. We attribute these good results to the reason described in the previous section. Results on computation time were similar to those in the previous section.
References 1. Aggarwal, C. C. et al.: Fast Algorithms for Projected Clustering, Proc. 1999 ACM SIGMOD Int’l Conf. on Management of Data (SIGMOD), pp. 61–72 (1999). 2. Agrawal, R. et al.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Proc. 1998 ACM SIGMOD Int’l Conf. on Management of Data (SIGMOD), pp. 94–105 (1998). 3. Bay, S. : UCI KDD Archive, http://kdd.ics.uci.edu/, Dept. of Information and Computer Sci., Univ. of California Irvine. (1999). 4. Bellman, R. E.: Adaptive Control Processes, Princeton Univ. Press, Princeton, N. J. (1961). 5. Comer, D.: The Ubiquitous B-Tree, ACM Computing Surveys, Vol. 11, No. 2, pp. 121–137 (1979). 6. DuMouchel, W. et al.: Squashing Flat Files Flatter, Proc. Fifth ACM Int’l Conf. on Knowledge Discovery and Data Mining (KDD), pp. 6–15 (1999). 7. Kaufman, L. and Rousseeuw, P. J.: Finding Groups in Data, Wiley, New York (1990). 8. Rissanen, J. : Stochastic Complexity in Statistical Inquiry, World Scientific, Singapore (1989). 9. Zhang, T., Ramakrishnan, R., and Livny, M.: BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proc. 1996 ACM SIGMOD Int’l Conf. on Management of Data (SIGMOD), pp. 103–114 (1996).
The Extra-Theoretical Dimension of Discovery. Extracting Knowledge by Abduction Lorenzo Magnani1 , Matteo Piazza2 , and Riccardo Dossena3 1
2
3
Department of Philosophy and Computational Philosophy Laboratory, University of Pavia, 27100 Pavia, Italy, and Georgia Institute of Technology, Atlanta, GA, 30332, USA, [email protected] Department of Philosophy and Computational Philosophy Laboratory, University of Pavia, 27100 Pavia, Italy [email protected] Department of Philosophy and Computational Philosophy Laboratory, University of Pavia, 27100 Pavia, Italy [email protected]
Abstract. Science is one of the most creative forms of human reasoning. The recent epistemological and cognitive studies concentrate on the concept of abduction as a means to originate and refine new ideas. Traditional cognitive science accounts concerning abduction aim at illustrating discovery and creativity processes in terms of theoretical and “internal” aspects, by means of computational simulations and/or abstract cognitive models. A neglected issue, worth of a deepest investigation inside artificial intelligence, is that “discovery” is often related to a complex cognitive task involving the use and the manipulation of external world. Concrete manipulations of external world is a fundamental passage in the process of knowledge extraction and hypotheses generation: by a process of manipulative abduction it is possible to build prostheses for human minds, by interacting with external objects and representations in a constructive way, and so by creating implicit knowledge through doing. This kind of embodied and unexpressed knowledge holds a key role in the subsequent processes of scientific comprehension and discovery. This paper aims at illustrating the close relationship between external representations and creative processes in scientific explorations and understanding of phenomena.
1
Theoretical and Manipulative Abduction
In the twentieth century Kuhnian ideas about irrationality of conceptual change and paradigm shift (see [1]) brought philosophers of science to distinguish between a logic of discovery and a logic of justification, and to the direct conclusion that a logic of discovery, and then a rational model of discovery, cannot exist. Today researchers have by and large abandoned this attitude by concentrating on the concept of abduction pointed out by C.S. Peirce as a fundamental mechanism by which it is possible to account for the introduction of new explanatory S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 441–448, 2002. c Springer-Verlag Berlin Heidelberg 2002
442
L. Magnani, M. Piazza, and R. Dossena
hypotheses in science. Theoretical abduction (see [2,3]) is the process of inferring certain facts and/or laws and hypotheses that render some sentences plausible, that explain or discover some (eventually new) phenomenon or observation; it is the process of reasoning in which explanatory hypotheses are formed and evaluated. There are two main epistemological meanings of the word abduction: 1) abduction that only generates “plausible” hypotheses and 2) abduction considered as inference to the best explanation, which also evaluates hypotheses. To illustrate from the field of medical knowledge, the discovery of a new disease and the manifestations it causes can be considered as the result of a creative abductive inference. Therefore, creative abduction deals with the whole field of the growth of scientific knowledge. This is irrelevant in medical diagnosis where instead the task is to select from an encyclopedia of pre-stored diagnostic entities. Both Abductive inferences are ampliative because the reasoning involved amplifies, or goes beyond, the information incorporated in the premises. Theoretical abduction certainly illustrates much of what is important in creative abductive reasoning, in human and computational programs, but fails to account for many cases of explanations occurring in sciences when the exploitation of environment is crucial. It fails to account for those cases in which we can think to a “discovery through doing”, cases in which new, before unexpressed, information is codified by means of manipulation of some external object (epistemic mediators). The concept of manipulative abduction (see [2]) captures a large part of scientists thinking where the role of action is central, and where the features of this action are implicit and hard to be elicited: action can provide otherwise unavailable information that enables the agent to solve problems by performing a suitable abductive process of generation or selection of hypotheses.
2
Manipulating World to Extract Information
2.1
The “Internal” Side of Discovery
Throughout his career it was C.S. Peirce that defended the thesis that, besides deduction and induction1 , there is a third mode of inference that constitutes the only method for really improving scientific knowledge, which he called abduction. The many attempts have been made to model abduction by developing some formal tools (see[2]) exclusively deal with selective abduction (for example, diagnostic reasoning) and relate to the idea of preserving consistency. Exclusively considering this “logical” view of abduction does not enable us to say much about creative processes in science, and, therefore, about the nomological and most interesting creative aspects of abduction. It mainly refers to the selective (diagnostic) and merely explanatory aspects of reasoning and to the idea that abduction is mainly an inference to the best explanation (see [2]): when used to express the creativity events it is either empty or replicates the well-known Gestalt model of radical innovation. 1
Peirce clearly contrasted abduction to induction and deduction, by using the famous syllogistic model. More details on the differences between abductive and inductive/deductve inferences can be found in [2,4].
The Extra-Theoretical Dimension of Discovery
443
For Peirce abduction is an inferential process that includes all the operations whereby hypotheses and theories are constructed. Hence abduction has to be considered as a kind of ampliative inference that, as already stressed, is not logical and truth preserving: indeed valid deduction does not yield any new information, for example new hypotheses previously unknown. If we want to provide a suitable framework for analyzing the most interesting cases of conceptual changes in science we do not have to limit ourselves to the sentential view of theoretical abduction but we have to consider a broader inferential one: the model-based sides of creative abduction (cf. below). From the Peirce’s philosophical point of view, all thinking is in signs, and signs can be icons, indices or symbols. Moreover, all inference is a form of sign activity, where the word sign includes “feeling, image, conception, and other representation” ([5, 5.283]), and, in Kantian words, all synthetic forms of cognition. That is, a considerable part of the thinking activity is model-based. Of course model-based reasoning acquires its peculiar creative relevance when embedded in abductive processes, so that we can individuate a model-based abduction. Hence, it is in terms of model-based abduction (and not in terms of sentential abduction) that we have to think to explain complex processes like scientific conceptual change. Related to the high-level types of scientific conceptual change (see, for instance, [6]) are different varieties of model-based abductions. Following Nersessian (cf. [7]), the term “model-based reasoning” is used to indicate the construction and manipulation of various kinds of representations, not mainly sentential and/or formal, but mental and/or related to external mediators. Obvious examples of model-based reasoning are constructing and manipulating visual representations, thought experiment, and analogical reasoning. Manipulative abduction [2], as a particular instance of model-based reasoning, happens when we are thinking through doing and not only, in a pragmatic sense, about doing. It refers to an extra-theoretical behavior that aims at creating communicable accounts of new experiences to integrate them into previously existing systems of experimental and linguistic (theoretical) practices. In the following sections manipulative abduction will be considered from the perspective of the relationship between unexpressed knowledge and external representations. 2.2
External Mediators
The power of model-based abduction mainly depends on its ability to extract and render explicit a certain amount of important information, unexpressed at the level of available data. It also has a fundamental role in the process of transformation of knowledge from its tacit to its explicit forms, and in the subsequent knowledge elicitation and use. Let us describe how this happens in the case of “external” model-based processes. As pointed out by M. Polanyi in his epistemological investigation, a large part of knowledge is not explicit, but tacit: we know more than we can tell and we can know nothing without relying upon those things which we may not be able to tell (see [8]).
444
L. Magnani, M. Piazza, and R. Dossena
As Polanyi contends, human beings acquire and use knowledge by actively creating and organizing their own experience. The existence of this kind of cognitive behavior, not merely theoretical, is also testified by the many everyday situations in which humans are perfectly able to perform very efficacious (and habitual) tasks without the immediate possibility of realizing their conceptual explanation (for details see [9]). We can find a similar situation also in the process of scientific creativity. Too often, in the cognitive view of science, it has been underlined that conceptual change just involves a theoretical and “internal” replacement of the main concepts. But usually researchers forget that a large part of this processes are instead due to practical and “external” manipulations of some kind, prerequisite to the subsequent work of theoretical arrangement and knowledge creation. When these processes are creative we can speak of manipulative abduction (cf. above). Scientists need a first “rough” and concrete experience of the world to develop their systems, as a cognitive-historical analysis of scientific change (see [10] and [11]) has carefully shown. Usual accounts of dicovery neglect periods of intense and often arduous thinking activity, often performed by means of experiments and manipulative activity on external objects, that prepare the discovery. Traditional examinations of how problem-solving heuristics create new representations in science have analyzed the frequent use of analogical reasoning, imagistic reasoning, and thought experiment from an internal point of view. However attention has not been focalized on those particular kinds of heuristics, that resort to the existence of extra-theoretical ways of thinking (thinking through doing, cf. [12]) by means of external representations. A central point in the so-called dynamical approach to cognitive science (see [13,14]) is the importance assigned to the “whole” cognitive system: the cognitive activity is in fact the result of a complex interplay and simultaneous coevolution, in time, of the states of mind, body, and external environment. A “real” cognitive system results from an epistemic negotiation among people and some “external” objects and technical artifacts (see [9] and [15]). For example, in the case of a geometrical performance, the cognitive system is not merely the mind-brain of the person performing the geometrical task, but the system consisting of the whole body (cognition is embodied and distributed ) of the person plus the external physical representation. In geometrical discovery the whole activity of cognition is located in the system consisting of a human together with diagrams. An external representation can modify the kind of computation that a human agent uses to reason about a problem: the capacity for inner reasoning and thought results from the internalization of the originally external forms of representation. The external representations are in fact not merely memory aids: they can give people access to knowledge and skills that are unavailable to internal representations, help researchers to easily identify aspects and to make further inferences, they constrain the range of possible cognitive outcomes in a way that some actions are allowed and other forbidden. The mind is limited because of the restricted range of information processing, the limited power of working memory and attention, the limited speed of some learning and reason-
The Extra-Theoretical Dimension of Discovery
445
ing operations; on the other hand the environment is intricate, because of the huge amount of data, real time requirement, uncertainty factors. Consequently, we have to consider the whole system, consisting of both internal and external representations, and their role in optimizing the whole cognitive performance of the distribution of the various subtasks.
3
The Extra-Theoretical Dimension of Discovery
In many cognitive and epistemological situations, we can speak of a sort of implicit information “embodied” into the whole relationship between our mindbody system and suitable external representations. An information we can extract, explicitly develop, and transform in knowledge contents, to solve problems. As we have already stressed, Peirce considers inferential any cognitive activity whatever, not only conscious abstract thought; he also includes perceptual knowledge and subconscious cognitive activity. For instance in subconscious mental activities visual representations play an immediate role. Peirce gives an interesting example of model-based abduction related to sense activity: “A man can distinguish different textures of cloth by feeling: but not immediately, for he requires to move fingers over the cloth, which shows that he is obliged to compare sensations of one instant with those of another” [5, 5.221]. This surely suggests that abductive movements have also interesting extra-theoretical characters and that there is a role in abductive reasoning for various kinds of manipulations of external objects. Hence all knowing is inferring and inferring is not instantaneous, it happens in a process that needs an activity of comparisons involving many kinds of models in a more or less considerable lapse of time [5, 5.221]. All these considerations suggest, then, that there exist a creative form of thinking through doing,2 fundamental as much as the theoretical one: manipulative abduction (see [2] and [3]). As already said manipulative abduction happens when we are thinking through doing and not only, in a pragmatic sense, about doing. Various templates of manipulative behavior exhibit some regularities. The activity of manipulating external things and representations is highly conjectural and not immediately explanatory: these templates are hypotheses of behavior that abductively enable a kind of epistemic “doing”. Some common features of the tacit templates of manipulative abduction that enable us to manipulate things and experiments in science are related to: 1. sensibility to the aspects of the phenomenon which can be regarded as curious or anomalous; 2. preliminary sensibility to the dynamical character of the phenomenon, and not to entities and their properties; 3. referral to experimental manipulations that exploit artificial apparatus to free new possible stable and repeatable sources of information about hidden knowledge and constraints; 4. various contingent ways 2
In this way the cognitive task is achieved on external representations used in lieu of an internal ones. Here action performs an epistemic and not a merely performatory role, relevant to abductive reasoning.
446
L. Magnani, M. Piazza, and R. Dossena
of epistemic acting: looking from different perspectives, checking the different information available, comparing subsequent events, choosing, discarding, imaging further manipulations, re-ordering and changing relationships in the world by implicitly evaluating the usefulness of a new order. The whole activity of manipulation is devoted to building various external epistemic mediators 3 that function as an enormous new source of information and knowledge. Therefore, manipulative abduction represents a kind of redistribution of the epistemic and cognitive effort to manage objects and information that cannot be immediately represented or found internally. From the point of view of everyday situations manipulative abductive reasoning and epistemic mediators exhibit very interesting features (we can find the first three in geometrical constructions): 1. action elaborates a simplification of the reasoning task and a redistribution of effort across time (see [9]); 2. action can be useful in presence of incomplete or inconsistent information; 3. action enables us to build external artifactual models of task mechanisms instead of the corresponding internal ones, that are adequate to adapt the environment to agent’s needs. 4. action as a control of sense data illustrates how we can change the position of our body (and/or of the external objects) and how to exploit various kinds of prostheses (Galileo’s telescope, technological instruments and interfaces) to get various new kinds of stimulation. The external artifactual models are endowed with functional properties as components of a memory system crossing the boundary between person and environment. The cognitive process is distributed between a person (or a group of people) and external representation(s), and so obviously embedded and situated in a society and in a historical culture. An interesting epistemological situation we have recently studied is the one concerning the role of some special epistemic mediators in the field of nonstandard analysis. We maintain that in mathematics diagrams play various roles in a typical abductive way. Two of them are central: – they provide an intuitive and mathematical explanation able to help the understanding of concepts difficult to grasp, that appear obscure and/or epistemologically unjustified, or that are not expressible from an intuitive point of view; – they help create new previously unknown concepts. In the construction of mathematical concepts many external representations are exploited, both in terms of diagrams and of symbols. We are interested in our research in diagrams which play an optical role (see [16] and [3]). Optical diagrams play a fundamental explanatory (and didactic) role in removing obstacles and obscurities and in enhancing mathematical knowledge of critical situations. They facilitate new internal representations and new symbolic-propositional 3
This expression is derived from the cognitive anthropologist Hutchins (see [9]), who coined the expression “mediating structure” to refer to various external tools that can be built to cognitively help the activity of navigating in modern but also in “primitive” settings.
The Extra-Theoretical Dimension of Discovery
447
achievements. In the example we have studied in the area of the calculus, the extraordinary role of the optical diagrams in the interplay standard/non-standard analysis is emphasized. In the case of our non-standard analysis examples, some new diagrams (microscopes within microscopes) provide new mental representations of the concept of tangent line at the infinitesimally small regions. Hence, external representations which play an “optical” role can be used to allow us a better understanding of many critical mathematical situations and, in some cases, to discover (or rediscover) more easily and nicely sophisticated properties. The role of an “optical microscope” that shows the behavior of a tangent line is illuminating (see Figure 1 and, for details, [16]).
11 00
∆y
11111 00000 1 0 0 1 1 0
∆x
00 11 0 1 00 11 000 111 0 1 0 ∆y 1 0 1 0 1 0 1 111 11 00 0000 1 0 1 0dy 1 0 1 0 1 0 1 0 1 0 1 0 1
Fig. 1. An optical diagram shows an infinitesimal neighborhood of the graph of a real function.
Some diagrams could also play an unveiling role, providing new light on mathematical structures: it can be hypothesized that these diagrams can lead to further interesting creative results. The optical and unveiling diagrammatic representation of mathematical structures activates direct perceptual operations (for example identifying how a real function appears in its points and/or to infinity; how to really reach its limits).
4
Conclusion
It is clear that the manipulation of external objects helps human beings in knowledge extraction and in their creative tasks. We have illustrated the strategic role played by the so-called traditional concept of “implicit knowledge” in terms of the recent cognitive and epistemological concept of manipulative abduction, considered as a particular kind of abduction that exploits external models endowed with delegated cognitive roles and attributes. Abductive manipulations operate on models that are external and the strategy that organizes the manipulations is unknown a priori. In the case of creative manipulations of course the result achieved is also new, and adds properties not contained before. The enhancement of the analysis of these important human skills can increase knowledge on inferences involving creative, analogical, spatial, and simulative aspects, both in science and everyday situations.
448
L. Magnani, M. Piazza, and R. Dossena
References 1. Kuhn, T.: The Structures of Scientific Revolutions. University of Chicago Press, Chicago (1962) 2. Magnani, L.: Abduction, Reason, and Science. Processes of Discovery and Explanation. Kluwer Academic/Plenum Publishers, New York (2001) 3. Magnani, L.: Epistemic mediators and model-based discovery in science. In Magnani, L., Nersessian, N., eds.: Model-Based Reasoning: Science, Technology, Values, New York, Kluwer Academic/Plenum Publishers (2002) 305–329 4. Flach, P., A. Kakas, e.: Abductive and Inductive Reasoning: Essays on Their Relation and Integration. Kluwer Academic Publishers, Dordrecht (2000) 5. Peirce, C.: Collected Papers. Harvard University Press, Cambridge, MA (1931–58) 1–6, ed. by C. Hartshorne and P. Weiss, 7-8, ed. by A.W. Burks. 6. Thagard, P.: Conceptual Revolutions. Princeton University Press, Princeton (1992) 7. Nersessian, N.: Model-based reasoning in conceptual change. In Nersessian, N., Magnani, L., Thagard, P., eds.: Model-based Reasoning in Scientific Discovery, New York, Kluwer Academic/Plenum Publishers (1999) 5–22 8. Polanyi, M.: The Tacit Dimension. Routledge & Kegan Paul, London (1966) 9. Hutchins, E.: Cognition in the Wild. MIT Press, Cambridge, MA (1995) 10. Nersessian, N.: How do scientists think? Capturing the dynamics of conceptual change in science. In Giere, R., ed.: Cognitive Models of Science. Minnesota Studies in the Philosophy of Science, Minneapolis, University of Minnesota Press (1992) 3–44 11. Gooding, D.: Experiment and the Making of Meaning. Kluwer, Dordrecht (1990) 12. Magnani, L.: Thinking through doing, external representations in abductive reasoning. In: AISB 2002 Symposium on AI and Creativity in Arts and Science, London, Imperial College (2002) 13. Port, R., T. van Gelder, e.: Mind as Motion. Explorations in the Dynamics of Cognition. MIT Press, Cambridge, MA (1995) 14. Magnani, L., Piazza, M.: Morphodynamical abduction: causation by attractors dynamics of explanatory hypotheses in science (2002) Forthcoming in Foundations of Science. 15. Norman, D.: Things that Make Us Smart. Defending Human Attributes in the Age of the Machine. Addison-Wesley, Reading, MA (1993) 16. Magnani, L., Dossena, R.: Perceiving the infinite and the infinitesimal world: unveiling and optical diagrams and the construction of mathematical concepts. (2002) Forthcoming in Foundations of Science.
Discovery Process on the WWW: Analysis Based on a Theory of Scientific Discovery Hitomi Saito and Kazuhisa Miwa Graduate School of Human Informatics, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601 Japan {hitomi,miwa}@cog.human.nagoya-u.ac.jp
Abstract. We investigated cognitive processes of seeking information on the WWW and the effects of subjects’ knowledge and experience on their processes and performance through the protocol analysis. We analyzed the protocol data on the basis of the two space model of scientific discovery. The information seeking process on the WWW was considered as a process of searching two kinds of problem spaces: the keyword space and the web space. The experimental results showed that (1) due to the effects of knowledge and experience, the experts’ performance of finding a target was higher than that of the novices, (2) when searching each space, the experts searched each space in detail but the novices did not, and (3) when shifting from one space to the other, the experts searched the two spaces alternately, but the novices tended to cling to a search of one of the two spaces.
1
Introduction
Recently, with the growth of the WWW [1], various studies have been made to understand information seeking on the WWW. The cognitive approach investigates users’ information seeking behavior on the WWW based on a theory and a model suggested by studies about traditional information systems [2]. However, Jansen et al. [3] proposed the necessity of further studies independent from traditional information systems. Additionally, many studies have confirmed that a user’s knowledge and experience affect information seeking behavior [4]. However, the effects of knowledge and experience on information seeking on the WWW have not been confirmed. In this paper, we have two research goals: first, we think of information seeking behavior on the WWW as a problem solving activity and apply a theory of problem solving to information seeking on the WWW to explain users’ seeking processes; second, we investigate the effects of knowledge and experience on information seeking on the WWW.
2
The Two Space Model of Scientific Discovery
In this study, we utilize the two space model suggested by Simon and Lea for understanding information seeking on the WWW [5]. This model has been widely used in the studies on the scientific discovery process [6],[7]. This model consists S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 449–456, 2002. c Springer-Verlag Berlin Heidelberg 2002
450
H. Saito and K. Miwa
Fig. 1. The Two Space Model of Information Seeking on the WWW
of a search of two kinds of space: the hypothesis space and the experiment space. In the activity of scientific discovery, people generate a hypothesis, searching the hypothesis space to discover a theory that accounts for target phenomena. Then, they search the experiment space to test the hypothesis generated and conduct experiments. Next, we explain the two space model of information seeking on the WWW. The discovery process of seeking information on the WWW has two phases: one in which keywords and search queries are considered, and the other in which information on the WWW, such as the results of search and web pages, is searched. We define the former as a search in the keyword space and the latter as a search in the web space (Figure 1). The web space is divided into search in the result-ofsearch space and search in the web-page space. The relation between the keyword space and the web space when seeking information on the WWW corresponds to the relation between the hypothesis space and the experiment space.
3
Experiments
Two experiments were conducted. Experiment 1 was performed as a protocol experiment. Experiment 2 was performed as an additional experiment to confirm the result of Experiment 1. In Experiment 1, Twenty graduate students participated in the pre-test. The pre-test included three questionnaires about daily usage of the WWW, information seeking style on the WWW, and knowledge about search engines. We divided the twenty students into a higher group and a lower group based on their average score in the pre-test. We then designated five students with the highest scores in the higher group as Experts and five students with the lowest scores in the lower group as Novices. The subjects were given two serach tasks, one general and the other specific. The general task is related to the custom of a traditional South Korean wedding. This task has a solution not requiring special knowledge. The specific task is related to plankton having a peculiar ecology. This task has a solution requiring special knowledge. Experiment 2 was conducted as a group experiment. Nineteen freshmen participated in the experiment as part of a university class. After the pre-test and the instructions, the subjects were asked to solve a search task. The task was the general task used in Experiment 1.
Discovery Process on the WWW
4
451
Performance
We integrated the results of Experiment 1 and Experiment 2 and compared the Experts’ performances with the Novices’ according to success or failure in finding the target. The subjects in Experiment 2 are divided into two Experts and seventeen Novices based on their average score in the three-part pre-test in Experiment 1. As a result, the subjects were composed in total of seven Experts and twenty-two Novices. The result shows that five Experts and two Novices discovered the target. We found significant differences in the performance of Experts and Novices by the exact probability test (two-tailed test, p < .01). We confirmed that the Experts’ performances were higher than those of the Novices. This result shows that searchers’ knowledge and experience about the WWW influence their performances.
5
Describing the Subjects’ Processes
We next examined subjects’ information seeking processes. Below, we explain our method for describing subjects’ verbal and behavior protocols. 5.1
Description of Verbal Protocols
Each segment of each subject’s verbalization was labeled based on the following three categories. – Prediction and Evaluation In this category, each segment is classified into a prediction (PD), an evaluation (EV), or other (OT). Prediction (PD) is speculation about a search query, the result of a search, a web page before searching and browsing it. Evaluation (EV) is estimation about a search query, the result of a search, a web page after searching and browsing it. Other segments are classified as other (OT). – Verbalization This category shows a space on which subjects verbalize; each segment is classified into the keyword space (K), the result-of-search space (R), or the web-page space (P). Verbalization about a search query is labeled “K”. Verbalization about the result of a search is labeled “R”. Verbalization about a web page is labeled “P”. – Behavior This category shows a space in which subjects behave; each segment is also classified into the keyword space (K), the result-of-search space (R), or the web-page space (P). Inputting a search query is labeled “K”. Browsing the result of a search is labeled “R”. Browsing a web page is labeled “P”. Each label of each segment is described as (x, y, z). Each variable corresponds to each of the three categories above; x corresponds to Prediction and Evaluation, y corresponds to Verbalization, and z corresponds to Behavior. The values of each variable are denoted as x = PD,EV,OT, y = K,R,T, and z = K,R,T.
452
H. Saito and K. Miwa
Fig. 2. Example description by the behavior schema
5.2
Description of Behavior
Next, we describe the subjects’ information seeking processes based on a behavior schema proposed in this study. In human problem solving studies, many schemas have been proposed to describe problem solving behavior. The Problem Behavior Graph (PBG), proposed by Newell and Simon in 1972, is well known as one of the most fundamental schema [8]. We constructed our behavior schema based on PBG. Figure 2 shows an example description of E2’s searching behavior based on our behavior schema. In this schema, each subject’s behavior is described as a transition through three search spaces. Two spaces correspond to the keyword space and the web space in Figure 1. The web space is composed of the result-of-search space and the web-page space. Nodes and operators describe searching each space. A node represents a subject’s behavioral state. And an operator shows operation to the node. From the result of categorization of all of the subjects’ operations, we define the six operators.
6
Analysis of Search Process
We described subjects’ verbalizations and behaviors based on the descriptive method mentioned above. We examined how the subjects searched the keyword space and the web space (the result-of-search space and the web-page space) based on the description. The summaries of the results of our analysis are shown in Table 1. Only one subject, E1, used two or more browsers to seek the target whereas the other subjects always used one browser. This difference seemed to have a significant effect on the result of the analysis. For this reason, E1 was removed from the following analysis.
Discovery Process on the WWW
453
Table 1. The Result of Subjects’ Search of Individual Space
6.1
Search of Individual Space
The Keyword Space. We analyzed searches in the keyword space from the following two perspectives. Kinds of Search Queries and Keywords. We think of the subjects using many kinds of search queries and keywords as those searching the keyword space widely. Therefore we counted the number of kinds of search queries and keywords used. The number of kinds of search queries and keywords were standardized based on the solution time of each subject because each subject’s solution time was different. We found no significant differences in the number of kinds of search queries and keywords between Experts and Novices (see Table 1 column (a),(b)). Prediction and Evaluation about the Keyword Space. We examined the prediction and evaluation about the keyword space based on the subjects’ verbalizations. The verbalizations classified as predictions in the keyword space are described as z=K,R,P (P D, K, z). This means that the verbalizations whose label in the Prediction and Evaluation category is “PD”, whose label in the Verbalization category is “K”, and whose label in the Behavior category is “K”, “R”, or “P” were counted as predictions. Similarly, the verbalizations classified as evaluations in the keyword space are described as z=K,R,P (EV, K, z). The number of each kind of verbalization was standardized based on the solution time of each subject. The result shows that the Experts performed prediction more actively than the Novices (U=3, p<.1: Table 1 see column (c)). The Result-of-Search Space. The web space is divided into the result-ofsearch space and the web-page space. We analyzed searches in the result-of-search space from three perspectives.
454
H. Saito and K. Miwa
Average of Browsing Search Results and the Number of Web Pages Selected. Browsing a larger number of search results pages is regarded as searching the result-of-search space more deeply. Therefore, we counted the average of the number of search results pages browsed from a single search. We found that the Experts browsed less search results pages than the Novices (U=3, p<.1: see Table 1 column (e)). Next, we counted the number of web pages selected from a single search results page. Selecting a larger number of web pages is regarded as searching the result-of-search space more efficiently. We can see that the Experts selected more web pages from search results than the Novices (U=2, p<.1: see Table 1 column (f)). These two results show that the Experts selected more web pages from less search results than the Novices. Prediction and Evaluation about the Result-of-Search Space. We examined prediction and evaluation in the result-of-search space based on the subjects’ verbalizations. The verbalizations classified as predictions in the result-of-search space are described as z=K,R,P (P D, R, z). The verbalizations classified as evaluations in the result-of-search space are described as z=K,R,P (EV, R, z). The number of each kind of verbalization was standardized based on the solution time of each subject. We found no significant differences in prediction and evaluation between Experts and Novices (see Table 1 column (g),(h)). The Web-Page Space. We analyzed searches in the web-page space from the following two perspectives. Verbalizations about Keywords in the Web-Page Space. Verbalizations about keywords in the web-page space mean feedback from the web space to the keyword space, and are considered as an important criterion for determining whether the subjects interactively searched multiple spaces. Therefore, we counted the numberof verbalizations about keywords in the web-page space, which are described as x=P D,EV,OT (x, K, P ). The number of verbalization was standardized based on the solution time of each subject. The Experts mentioned keywords in the web-page space much more often than the Novices (U=0, p<.05: see Table 1 column (i)). This result shows that the Experts are more likely to search two spaces interactively, utilizing information obtained in each of the two search spaces. Prediction and Evaluation about the Web-Page Space. We examined prediction and evaluation in the web-page space based on the subjects’ verbalizations. The verbalizations classified as predictions in the web-page space are described as (P D, P, z). The verbalizations classified as evaluations in the web-page z=K,R,P space are described as z=K,R,P (EV, P, z). The number of each kind of verbalization was standardized based on the solution time of each subject. The Experts performed prediction more actively than the Novices (U=3, p<.1: see Table 1 column (j)). 6.2
Integrative Comparison
So far, we have investigated separately the search in the keyword space and the web space. Next, we compared the information seeking processes of Experts (except for E1) and Novices by integrating both results. Table 2 shows the outline of
Discovery Process on the WWW
455
Table 2. Integrative Comparison between Experts and Novices
the analysis that we conducted in 6.1. We categorized each subject’s search into three types: good search, fair search, and poor search. Each search was labeled based on the ranking of each subject’s search in each category; specifically, a search of a subject, ranking of which was from first through third was labeled as “”, from fourth through sixth as “”, and from seventh through ninth as “×”. Table 2 shows that the Experts performed a good search () in at least one factor in every space; on the other hand, most Novices did not. These results confirm that the Experts searched every space in detail. 6.3
Transitions among Spaces
Next, we compared Experts’ search with Novices’ search qualitatively based on the subjects’ behavior described by the behavior schema. Figure 3 shows three representative subjects’ behaviors, each of which reflects the differences in the Experts’ and Novices’ search. These processes are described by the behavior schema explained in 5.2 (cf. Figure 2). Subject E2’s process, as representative of Experts’ processes, shows that the Experts moved from one space to the other alternately and that the balance of searching each space is well coordinated. By contrast, the Novices’ processes tended to cling to a search of one of three spaces. Subject N3’s horizontally expanded process indicates that N3 searched the web-page space extremely deeply. Whereas N3 clung to the web-page space, N4 hardly searched the web-page space. N4 repeatedly shuttled between search in the keyword space and search in the result-of-search space.
7
Summary and Conclusion
In this study, we thought of information seeking behavior on the WWW as a problem solving activity and analyzed the behavior in detail by applying a theory of problem solving to the analysis of information seeking on the WWW. Moreover, we also investigated the effects of subjects’ knowledge and experience on their processes and performance. The experimental results showed that subjects’ knowledge and experience affected their performance and processes. The results of our analysis imply that the strategy of Experts searching multiple spaces alternately has a positive effect on their performance.
456
H. Saito and K. Miwa
Fig. 3. Three Representative Processes of Experts and Novices
References 1. Cyveilance. Sizing the internet. 2000. Retrieved from on 2002-01-22. 2. Choo Wei Chun, Detlor Brian, and Turnbull Don. Information seekinf on the weban integrated model of browsing and searching. In Proceedings of the 61th Annual Meeting of the American Society for Information Science, pages 3–16, 1998. 3. Bernard J. Jansen and Udo Pooch. A review of web searching studies and a framework for future research. Journal of the American Society for Information Science & Technology, 52(3):235–246, 2001. 4. A. G. Sutcliffe, M. Ennis, and S. J. Watkinson. Empirical studies of end-user information searching. Journal of the American Society for Information Science & Technology, 51(13):1221–1231, 2000. 5. Herbert A. Simon and G. Lea. Problem solving and rule induction: A unified view. In Lee W. Gregg, editor, Knowledge and cognition, pages 105–128. Lawrence Erlbaum, Potomac, Md. NJ, 1974. 6. Deepak Kulkarni and Herbert A. Simon. The processes of scientific discovery: The strategy of experimentation. Cognitive Science, 12(2):139–175, 1988. 7. David Klahr. Exploring Science: the Cognition and Development of Discovery Processes. MIT Press, London, 2000. 8. Allen Newell and Herbert A. Simon. Human Problem Solving. Prentice-Hall, Englewood Cliffs, NJ, 1972.
Invention vs. Discovery A Critical Discussion Carlotta Piscopo and Mauro Birattari IRIDIA, Universit´e Libre de Bruxelles, Brussels Belgium {piscopo,mbiro}@iridia.ulb.ac.be
Abstract. The paper proposes an epistemological analysis of the dichotomy discovery/invention. In particular, we argue in favor of the idea that science does not discover facts by induction but rather invents theories that are then checked against experience.
1
Introduction
Reasoning about the status of knowledge has always been integral to both science and philosophy: What is the path that lead from experience to theories? The first modern answer to this question was given in the early age of modern science: According to Francis Bacon, knowledge is obtained directly from experience. More precisely, Nature is ruled by laws and the task of the scientist is simply to discover and describe them. For Bacon, science is an inductive process. Such position, henceforth termed discoverist, entails the idea that scientific knowledge grows linearly and cumulatively.1 Starting from the 19th century a new trend in science deeply undermined this idea. The introduction of non-Euclidean geometries first, and then of relativity theory and quantum mechanics dared question theories that had been considered sure for centuries. In particular, the fact that mechanics—queen of the sciences—had to undergo a radical revision created a major shock in all scientific domains. Such shock clearly touched also the domain of philosophy of science where serious doubts were raised about the cumulative view of science and about the very idea of scientific discovery. In particular, the concerns about induction previously raised by David Hume [5] came to a new life and were followed in the early 20th century by a large number of epistemological analysis that re-proposed similar issues in a more modern language. Among the others, Pierre Duhem [2] in a critical analysis of Newton’s mechanics questioned its inductive nature. Further, Bertrand Russell [14] with the famous argument of the inductivist turkey raised doubts about the reliability of a theory based on induction. Finally, Karl Popper [12,13] rejected firmly any inductivist basis for science and proposed, for the first time in a mature form, the alternative view that considers theories as conjectures.2 1 2
The view of science as a cumulative process was described by Thomas Kuhn [6] as opposed to his view in which science is a process composed by irreconcilable steps. In Logic der Forschung [12] Popper maintains that scientists invent—rather than discover —laws and then they check them against experience. In this respect, the title
S. Lange, K. Satoh, and C.H. Smith (Eds.): DS 2002, LNCS 2534, pp. 457–462, 2002. c Springer-Verlag Berlin Heidelberg 2002
458
C. Piscopo and M. Birattari
We will call this second epistemological position inventionist and the rest of the paper will focus on the dichotomy inventionism/discoverism. Notwithstanding the strong and authoritative criticisms raised against induction, the inductivist—and therefore discoverist—positions boldly reemerged in artificial intelligence. Starting from the first works on expert systems, the very idea was put forward that it is possible to build programs that make scientific discoveries by induction from data. The expert system BACON.1 is a milestone in machine learning and it is an important example of how the inductivist and discoverist idea has been implemented [8]. Further, the discoverist position emerged through several editorial events such as a special issue of Artificial Intelligence [15] and our conference Discovery Science. Altogether, they prove the interest of the AI community for the concept of discovery and for the related inductivist view of science. It is worth noticing, however, that recently some sectors of the machine learning community seem to have definitely switched to an inventionist position.3 Our personal background and our specific research interests lead us to accept an inventionist position. Within the conference DS-2002 we intend to take the role of the devil’s advocate bringing to the fore in this community the idea that science is about invention! Clearly we are not animated in this discussion by any sake of argument. On the contrary, with this extended abstract and with our presence at the poster session of DS-2002, we wish to start a fruitful and open discussion. In our opinion this discussion could help refining our relative position in full awareness of their philosophical and epistemological implications.
2
Discovery in Science
The discoverist idea is based on the assumption that observation alone is enough to find the laws of Nature. According to this view, an accurate collection and organization of data lets immediately emerge the intrinsic regularities of Nature. The idea that laws are already in Nature and that science is about discovering them, traces back to the ancient and medieval philosophy. According to such idea, theories ultimately reflect the very structure of reality. Francis Bacon,
3
The logic of scientific discovery of the English translation is particularly misleading and seems to suggest the opposite idea: indeed the German forschung means literally research rather than discovery. All non-parametric statistical methods such as bootstrap [3] and cross-validation [16] do not rest on the hypothesis that the real system under observation belongs to the model space. Indeed, if the system does not belongs to the model space the learned model cannot coincide with the system itself and therefore no discovery is possible. In such a case, the learned method can be at best an approximation of the system and remains something ontologically distinct from the latter. The learned model can be therefore considered only as a useful invention. Vladimir Vapnik [18] is even more explicit about the dichotomy invention/discovery: he cites directly Popper and the concept of falsification. In some sense, the VC-dimension, key concept of Vapnik theory, can be seen as a modern and mathematically rigorous version of the concept of dimension of a theory discussed by Popper [12].
Invention vs. Discovery
459
father of the experimental method, embodies such an idea. Bacon’s picture of science rests upon the regulative idea that natural laws have to be extracted only from pure empirical data. Accordingly, he conceives in his Novum Organum [1] two distinct phases that should characterize the experimental method. First the experimenter must put aside all theoretical anticipations that Bacon [1] colorfully calls idola.4 Second, in the proper experimental phase, data are collected and organized in what he calls tabulae—the forerunners of modern databases. The assumption that natural laws can be extracted simply from experimental data raised the most controversial issue in the whole history of epistemology: the problem of induction. Hume [5] is the first thinker that openly and strongly maintains that empirical laws are not logically entailed by observed data, but are rather subjective conjectures originating from the habit to see regularities in repeated events. A century after Hume, John Stuart Mill [9] argues again in favor of the inductivist idea but, aware of Hume’s argument, he justifies induction through the extra-scientific assumption that Nature is ordered by deterministic laws. The discoverist idea defines the role of science while preserving its objectivity: science is about diving into empirical data for finding the laws of Nature. The price to pay is that the discoverist position has to deal with the problem of induction. The determinism of Nature postulated by Mill [9] is an attempt to solve the problem. Yet, it raises other concerns for its clear metaphysical flavor. As Whitehead [19] pointed out, the belief in the deterministic order of nature is nothing but the reinterpretation of the medieval belief in a rational God.5
3
Invention in Science
As pointed out in Section 1, the discoverist idea emerged as a result of great historical changes. Non-Euclidean geometries together with relativity theory questioned basic concepts such those of space and time that according to Newton and Kant enjoyed the property of being absolute and necessary. Besides that, relativity theory and quantum mechanics showed that the Newtonian mechanics, that had been considered as the true description of the universe, was just an approximation leading in some circumstances to incorrect predictions. The scientific 4
5
The term idolum comes from the Greek eidolon which means image or phantom. By adopting this term, Bacon makes clear that in his views theoretical ideas are misleading and they prevent from reaching pure empirical knowledge that alone leads to the discovery of the laws of Nature. This idea is not so inconsiderate as it seems since metaphysical ideas are behind many scientific disciplines, classical mechanics included. Newtonian mechanics rests upon the idea of an “intelligent and powerful Being” that is ultimately responsible of the order of Nature [10]. On the other hand, Leibnizian mechanics supposes that the world we experience is nothing but the one that God chooses as the best among many possible others. Through the principle of least action, such idea carries on to the Euler-Lagrange theory, to the Hamilton-Jacoby theory and ultimately to all contemporary formulations of mechanics [7].
460
C. Piscopo and M. Birattari
and epistemological crisis opened by the refutation of Newton’s mechanics undermined the very idea that scientific theories are discovered and justified, once and forever, by inductive processes. As opposed to the discoverist epistemology, a different view was proposed in the early 20th century, according to which scientific theories are bold speculations put forward and maintained as “true” until they resist to the test of experience. Accepting a theory as true unless proved false might seems rather weak but, as we will see presently, this appeared as the only way for skipping the problem of induction as raised by Hume. Embodying the inventionist epistemology, Popper puts forward the idea that scientific theories neither come out directly from experience nor are definitively verified by it. Clearly Popper accepts that scientific theories can be suggested by observation but he firmly denies that experience alone can logically justify a theory. Popper [13] provocatively defines induction as a “myth”. According to Popper, when we observe we have an interest, a problem to solve, a viewpoint, and a theory for interpreting the world, which make us selectively search in the huge amount of data obtained from observation [13]. Popper clarifies that: scientific theories were not the digest of observation, but that they were inventions– conjectures boldly put forward for trial, to be eliminated if they clashed with observation—[13]. Such conjectures, that are neither obtained by induction nor are verified definitively, can however be falsified by experience. It is on the logical aspect of the theory of knowledge that Popper focuses his attention: The theory to be developed [...] might be described as the theory of the deductive method of testing [...] I must first make clear the distinction between the psychology of knowledge which deals with empirical facts, and the logic of knowledge which is concerned only with logical relations. [...] the work of the scientist consists in putting forward and testing theories. The initial stage, the act of conceiving or inventing a theory, seems to me neither to call for logical analysis nor to be susceptible of it. The question how it happens that a new idea occurs to a man [...] may be of great interest to empirical psychology; but it is irrelevant to the logical analysis of scientific knowledge. [...] Accordingly I shall distinguish sharply between the process of conceiving a new idea, and the methods and results of examining it logically.–[12]. Popper then focuses his discussion on the logical examination of a theory and therefore on falsification and refutation of conjectures rather than on the origin of the latter: It has been sharply noted [4] that, in spite of the title, in Popper’s Conjectures and Refutations [13] the term ‘conjecture’ does not even appear in the index! By introducing what he calls the asymmetry between verifiability and falsifiability, Popper explains that while a scientific theory cannot be derived from observations, it can be contradicted by them [12]. This is done by a deductive procedure called in logic modus tollens, through which we can argue from the truth of singular statements to the falsity of universal statements—[12]. Eliminating the recourse to induction as a mean to explain both how theories are obtained and how they are tested, Popper solves the problem of induction. Other scientists participated in the debate on induction. The great physicist and mathematician Henri Poincar´e is particularly representative. He elaborated
Invention vs. Discovery
461
an epistemological conception of science, called conventionalism: no scientific theories can aspire to obtain the status of true representation of the world. They are simply useful conventions that science puts forward and uses only because they yield good predictions [11]. Scientific theories are therefore simply stipulations that the scientific community decides by agreement to assume, or eventually abandon, according to their utility.6 The problem of induction skipped, another problem emerges: if theories are seen as inventions [12,13], or alternatively as conventions [11], they loose their character of objectivity. However, Popper’s and Poincar´e’s viewpoints do not entail that theories are completely arbitrary. On the contrary, according to Popper and Poincar´e theories are inter-subjective: the objectivity of scientific theories comes from the possibility of being “inter-subjectively tested” by scientists [12]. The decision about the destiny of theories is left to the scientific community that is in charge of testing the predictions these theories allow [11].
4
Conclusion
Assuming that science is about discovering exposes to the problem of induction. However, if experience is assumed to be the only basis on which scientific theories rest, the objectivity of science can be maintained. On the other hand, assuming the opposite view according to which science is about inventing theories protects from the problems related to induction since the source and the justification of theories do not rest ultimately upon experience, but upon a decision. According to this view, theories are conjectures. In this sense, the objectivity of science cannot be maintained anymore. Nevertheless, scientific theories are not arbitrary since they must be inter-subjectively testable, and possibly falsified. Acknowledgments. Mauro Birattari acknowledges support from the Metaheuristics Network, a Research and Training Network funded by the Commission of the European Communities under the Improving Human Potential programme, contract number HPRN-CT-1999-00106. The information provided is the sole responsibility of the authors and does not reflect the Community’s opinion. The Community is not responsible for any use that might be made of data appearing in this publication.
References [1] F. Bacon. Novum Organum. 1610. Available in: The works, B. Montague (ed.), Parry & MacMillan, Philadelphia, PA, USA, 1854. 6
Beside the predicting power, other criteria regulate the choice of a theory. Such further criteria, which are typically extra-evidential, might be for instance simplicity (Occam’s razor) or conservatism.
462
C. Piscopo and M. Birattari
[2] Pierre Duhem. La Th´eorie Physique: son Object et sa Structure. 1906. Available as: The Aim and Structure of Physical Theory, Princeton University Press, Princeton, NJ, USA, 1991. [3] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, New York, NY, USA, 1993. [4] P. A. Flach. Conjectures: an inquiry concerning the logic of induction. PhD thesis, Katholieke Universiteit Brabant, Tilburg, The Netherlands, 1995. [5] D. Hume. A Treatise of Human Nature. 1739. Available as: E. C. Mossner (ed.), Penguin Books, London, United Kingdom, 1986. [6] T. Kunh. The Structure of Scientific Revolutions. The University of Chicago Press, Chicago, IL, USA, 1996: 3rd edition, 1962. [7] C. Lanczos. The Variational Principles of Mechanics. Dover Publications, New York, NY, USA, 1986. [8] P. Langley, H. A. Simon, G. L. Bradshaw, and J. M. Zytkov. Scientific discovery. Computational Explorations of the Creative Processes. MIT Press, Cambridge, MA, USA, 1987. [9] J.S. Mill. A System of Logic. Longmans Green, London, United Kingdom, 1843. [10] I. Newton. Mathematical Principles. 1713. Available as: F. Cajori (ed.), University of California Press, Berkeley, CA, USA. 1946. [11] H. Poincar´e. La Science et L’Hypoth`ese. 1903. Available as: Science and Hypothesis, Dover Publications, New York, NY, USA, 1967. [12] K. Popper. Logik der Forschung. 1935. Available as: The Logic of Scientific Discovery, Routledge, London, United Kingdom, 1999. [13] K. Popper. Conjectures and Refutations. Routledge and Kegan Paul, London, United Kingdom, 1963. [14] B. Russell. The Problems of Philosophy. Williams and Nogate, London, United Kingdom, 1957. [15] H. A. Simon, R. E. Vald´es-P´erez, and D. H. Sleeman. Scientific discovery and simplicity of method. Artifical Intelligence, 91(2):177–181, 1997. Editorial of a Special Issue on Scientific Discovery. [16] M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society B, 36(1):111–147, 1974. [17] R. S. Sutton and A. G. Barto. Reinforcement Learning. An Introduction. MIT Press, Cambridge, MA, USA, 1998. [18] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, NY, USA, 1998. [19] A. N. Whitehead. Science and the Modern World. Cambridge University Press, Cambridge, United Kingdom, 1926.
Author Index
Aalst, W.M.P. van der Abe, Yukino 291 Arikawa, Setsuo 86 Azevedo, Paulo 414
364
Bannai, Hideo 86 Bao, Yongguang 340 Bay, Stephen 59 Birattari, Mauro 457 Bonardi, Alain 356 Borgelt, Christian 2 Bosch, Antal van den 364 Brazdil, Pavel 141 Chawathe, Sudarshan S. Choi, Kijoon 302 Cristianini, Nello 12 Danks, David 178 Devaney, Judith E. 47 Dossena, Riccardo 441 Elmaoglou, Areti 348 Elomaa, Tapio 127 Fang, Xiaoshan 297 Ferri, C. 165 Flach, Peter A. 141 Frank, Eibe 153 Fu, Zhiwei 390 Ghazizadeh, Shayan
71
Hagedorn, John G. 47 Hall, Mark 153 Halova, Jaroslava 291 Haraguchi, Makoto 324 Hayashi, Susumu 1 Hern´ andez-Orallo, J. 165 Hilario, Melanie 113 Hirokawa, Sachio 332 Hirota, Minoru 291 H¨ oppner, Frank 398 Holeˇ na, Martin 192 Hollm´en, Jaakko 259 Holmes, Geoffrey 153
71
Hoon, Michiel de 267 Hu, Zhenjiang 406 Hwang, Youngho 302 Ichise, Ryutaro 247 Ikeda, Daisuke 332 Imoto, Seiya 267 Inenaga, Shunsuke 86 Ishii, Naohiro 340 Jorge, Alipio
414
Kandola, Jaz 12 Kim, Minkoo 302 Kim, Pankoo 302 Kirkby, Richard 153 Kitamoto, Asanobu 283 Kontos, John 348 Krogel, Mark-A. 430 Kruse, Rudolf 2 Langley, Pat 59, 247 Lartillot, Olivier 382 Li, Fang 310 Lindgren, J.T. 127 Magnani, Lorenzo 441 Malagardi, Ioanna 348 Maruster, Laura 364 Maruyama, Osamu 220 Masotti, Daniele 275 Matsuda, Takashi 422 Miwa, Kazuhisa 449 Miyano, Satoru 220, 267 Moon, Yoo-Jin 302 Motoda, Hiroshi 422 Nakamura, Makoto 374 Nakano, Ryohei 206 Nakano, Shigetora 324 Narahashi, Masaki 435 Okada, Takashi
233
Palopoli, Luigi 34 Park, Sung-Yong 316 Peng, Yonghong 141
464
Author Index
Piazza, Matteo 441 Piscopo, Carlotta 457 Po¸cas, Jo˜ ao 414 Ram´ırez-Quintana, M.J. 165 Rousseaux, Francis 356 Ruosaari, Salla 259 Saito, Hitomi 449 Saito, Kazumi 59, 206 Sakakibara, Kazuhisa 291 Shapiro, Daniel 247 Shawe-Taylor, John 12 Sheng, Huanye 297, 310 Shinohara, Ayumi 86 Shoudai, Takayoshi 220 Soares, Carlos 141 Stopka, Pavel 291 Suezawa, Hiroko 291 Suzuki, Einoshin 435 Takeda, Masayuki
86
Takeichi, Masato 406 Terracina, Giorgio 34 Ting, Kai Ming 98 Tojo, Satoshi 374 Washio, Takashi 422 Weijters, A.J.M.M. 364 Widmer, Gerhard 13 Williams, Chris 12 Witten, Ian H. 33 Wrobel, Stefan 430 Yamada, Yasuhiro 332 Yang, Jihoon 316 Yoshida, Tetsuya 422 Yoshioka, Masaharu 324 Yuzuri, Tomoaki 291 Zak, Premysl 291 Zhang, Dongmo 310 Zhao, Haiyan 406