Lecture Notes in Artificial Intelligence Edited by J. G. Carbonell and J. Siekmann
Subseries of Lecture Notes in Computer Science
2835
3
Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo
Tamás Horváth Akihiro Yamamoto (Eds.)
Inductive Logic Programming 13th International Conference, ILP 2003 Szeged, Hungary, September 29 – October 1, 2003 Proceedings
13
Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA J¨org Siekmann, University of Saarland, Saarbr¨ucken, Germany Volume Editors Tamás Horváth Fraunhofer Institute for Autonomous Intelligent Systems (AIS) Schloss Birlinghoven, 53754 Sankt Augustin, Germany E-mail:
[email protected] Akihiro Yamamoto Hokkaido University, MemeMedia Laboratory N 13 W 8, Kita-ku, Sapporo 060-8628, Japan E-mail:
[email protected]
Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliographie; detailed bibliographic data is available in the Internet at
.
CR Subject Classification (1998): I.2.3, I.2.6, I.2, D.1.6, F.4.1 ISSN 0302-9743 ISBN 3-540-20144-0 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York, a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by Christian Grosche, Hamburg Printed on acid-free paper SPIN: 10956656 06/3142 543210
Preface The 13th International Conference on Inductive Logic Programming (ILP 2003), organized by the Department of Informatics at the University of Szeged, was held between September 29 and October 1, 2003 in Szeged, Hungary. ILP 2003 was co-located with the Kalm´ar Workshop on Logic and Computer Science devoted to the work of L´ aszl´o Kalm´ar and to recent results in logic and computer science. This volume contains all full papers presented at ILP 2003, together with the abstracts of the invited lectures by Ross D. King (University of Wales, Aberystwyth) and John W. Lloyd (Australian National University, Canberra). The ILP conference series, started in 1991, was originally designed to provide an international forum for the presentation and discussion of the latest research results in all areas of learning logic programs. In recent years the scope of ILP has been broadened to cover theoretical, algorithmic, empirical, and applicational aspects of learning in non-propositional logic, multi-relational learning and data mining, and learning from structured and semi-structured data. The program committee received altogether 58 submissions in response to the call for papers, of which 5 were withdrawn by the authors themselves. Out of the remaining 53 submissions, the program committee selected 23 papers for full presentation at ILP 2003. High reviewing standards were applied for the selection of the papers. For the first time, the “Machine Learning” journal awarded the best student papers. The awards were presented to Marta Arias for her theoretical paper with Roni Khardon: Complexity Parameters for First-Order Classes, and to Kurt Driessens and Thomas G¨artner for their joint algorithmic paper with Jan Ramon: Graph Kernels and Gaussian Processes for Relational Reinforcement Learning. We are grateful to all those who made ILP 2003 a success. In particular, we would like to thank the authors who submitted papers to ILP 2003, the program committee members and referees for their prompt and thorough work, and the invited speakers for their excellent lectures. Many thanks to the local chair Tibor Gyim´ othy and his team Zolt´ an Alexin, D´ ora Csendes, and T¨ unde K¨ oles for their outstanding organization of the event. We are also grateful to the Kalm´ ar Workshop’s chair Ferenc G´ecseg, co-chairs J´anos Csirik and Gy¨orgy Tur´ an, and to Bal´azs Sz¨or´enyi from the organizing committee for the good cooperation. We also thank the editors of LNCS/LNAI, in particular Alfred Hofmann and Erika Siebert-Cole, for their help in publishing this volume. Last but not least, we gratefully acknowledge the financial support provided by the Department of Informatics at the University of Szeged, by KDNet – The European Knowledge Discovery Network of Excellence, by the “Machine Learning” journal of Kluwer Academic Publishers, and by the Ministry of Education of Hungary.
July 2003
Tam´as Horv´ ath and Akihiro Yamamoto
VI
Preface
Organization
Executive Committee Program Chairs:
Tam´ as Horv´ ath (University of Bonn and Fraunhofer Institute for Autonomous Intelligent Systems, Germany) Akihiro Yamamoto (Hokkaido University, Japan) Tibor Gyim´ othy (University of Szeged, Hungary) Zolt´ an Alexin (University of Szeged, Hungary) D´ ora Csendes (University of Szeged, Hungary) T¨ unde K¨ oles (University of Szeged, Hungary)
Organizing Chair: Organizing Committee:
Program Committee H. Blockeel (Belgium) J.F. Boulicaut (France) I. Bratko (Slovenia) J. Cussens (UK) L. De Raedt (Germany) S. Dˇzeroski (Slovenia) P. Flach (UK) L. Getoor (USA) L. Holder (USA) K. Inoue (Japan) R. Khardon (USA) J.-U. Kietz (Switzerland) S. Kramer (Germany) N. Lachiche (France) N. Lavraˇc (Slovenia) J. Lloyd (Australia) E. Martin (Australia)
S. Matwin (Canada) T. Miyahara (Japan) S. Muggleton (UK) T. Ozaki (Japan) D. Page (USA) D. Roth (USA) C. Rouveirol (France) L. Saitta (Italy) M. Sebag (France) A. Sharma (Australia) A. Srinivasan (UK) P. Tadepalli (USA) G. Tur´ an (USA, Hungary) S. Wrobel (Germany) G. Zaverucha (Brazil) J.-D. Zucker (France)
Additional Referees R. Braz Y. Chevaleyre V.S. Costa C. Cumby K. Driessens
I.C. Dutra T. Erjavec T. G¨ artner S. Hoche S. Igarashi
I. J´ onyer K. Kersting C. Sakama J. Struyf C. Vens
Preface
Sponsoring Institutions Department of Informatics, University of Szeged KDNet – The European Knowledge Discovery Network of Excellence “Machine Learning” journal of Kluwer Academic Publishers Ministry of Education of Hungary
VII
Table of Contents
Invited Papers A Personal View of How Best to Apply ILP . . . . . . . . . . . . . . . . . . . . . . . . . . . Ross D. King
1
Agents that Reason and Learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John W. Lloyd
2
Research Papers Mining Model Trees: A Multi-relational Approach . . . . . . . . . . . . . . . . . . . . . . Annalisa Appice, Michelangelo Ceci, Donato Malerba
4
Complexity Parameters for First-Order Classes . . . . . . . . . . . . . . . . . . . . . . . . 22 Marta Arias, Roni Khardon A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Anna Atramentov, Hector Leiva, Vasant Honavar Applying Theory Revision to the Design of Distributed Databases . . . . . . . . 57 Fernanda Bai˜ ao, Marta Mattoso, Jude Shavlik, Gerson Zaverucha Disjunctive Learning with a Soft-Clustering Method . . . . . . . . . . . . . . . . . . . . 75 Guillaume Cleuziou, Lionel Martin, Christel Vrain ILP for Mathematical Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Simon Colton, Stephen Muggleton An Exhaustive Matching Procedure for the Improvement of Learning Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Nicola Di Mauro, Teresa Maria Altomare Basile, Stefano Ferilli, Floriana Esposito, Nicola Fanizzi Efficient Data Structures for Inductive Logic Programming . . . . . . . . . . . . . . 130 Nuno Fonseca, Ricardo Rocha, Rui Camacho, Fernando Silva Graph Kernels and Gaussian Processes for Relational Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Thomas G¨ artner, Kurt Driessens, Jan Ramon On Condensation of a Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Kouichi Hirata
X
Table of Contents
A Comparative Evaluation of Feature Set Evolution Strategies for Multi-relational Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Susanne Hoche, Stefan Wrobel Comparative Evaluation of Approaches to Propositionalization . . . . . . . . . . 197 ˇ Mark-A. Krogel, Simon Rawles, Filip Zelezn´ y, Peter A. Flach, Nada Lavraˇc, Stefan Wrobel Ideal Refinement of Descriptions in AL-Log . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Francesca A. Lisi, Donato Malerba Which First-Order Logic Clauses Can Be Learned Using Genetic Algorithms? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Flaviu Adrian M˘ arginean Improved Distances for Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Dimitrios Mavroeidis, Peter A. Flach Induction of Enzyme Classes from Biological Databases . . . . . . . . . . . . . . . . . 269 Stephen Muggleton, Alireza Tamaddoni-Nezhad, Hiroaki Watanabe Estimating Maximum Likelihood Parameters for Stochastic Context-Free Graph Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Tim Oates, Shailesh Doshi, Fang Huang Induction of the Effects of Actions by Monotonic Methods . . . . . . . . . . . . . . 299 Ramon P. Otero Hybrid Abductive Inductive Learning: A Generalisation of Progol . . . . . . . . 311 Oliver Ray, Krysia Broda, Alessandra Russo Query Optimization in Inductive Logic Programming by Reordering Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Jan Struyf, Hendrik Blockeel Efficient Learning of Unlabeled Term Trees with Contractible Variables from Positive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Yusuke Suzuki, Takayoshi Shoudai, Satoshi Matsumoto, Tomoyuki Uchida Relational IBL in Music with a New Structural Similarity Measure . . . . . . . 365 Asmir Tobudic, Gerhard Widmer An Effective Grammar-Based Compression Algorithm for Tree Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Kazunori Yamagata, Tomoyuki Uchida, Takayoshi Shoudai, Yasuaki Nakamura
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Preface The 13th International Conference on Inductive Logic Programming (ILP 2003), organized by the Department of Informatics at the University of Szeged, was held between September 29 and October 1, 2003 in Szeged, Hungary. ILP 2003 was co-located with the Kalm´ar Workshop on Logic and Computer Science devoted to the work of L´ aszl´o Kalm´ar and to recent results in logic and computer science. This volume contains all full papers presented at ILP 2003, together with the abstracts of the invited lectures by Ross D. King (University of Wales, Aberystwyth) and John W. Lloyd (Australian National University, Canberra). The ILP conference series, started in 1991, was originally designed to provide an international forum for the presentation and discussion of the latest research results in all areas of learning logic programs. In recent years the scope of ILP has been broadened to cover theoretical, algorithmic, empirical, and applicational aspects of learning in non-propositional logic, multi-relational learning and data mining, and learning from structured and semi-structured data. The program committee received altogether 58 submissions in response to the call for papers, of which 5 were withdrawn by the authors themselves. Out of the remaining 53 submissions, the program committee selected 23 papers for full presentation at ILP 2003. High reviewing standards were applied for the selection of the papers. For the first time, the “Machine Learning” journal awarded the best student papers. The awards were presented to Marta Arias for her theoretical paper with Roni Khardon: Complexity Parameters for First-Order Classes, and to Kurt Driessens and Thomas G¨artner for their joint algorithmic paper with Jan Ramon: Graph Kernels and Gaussian Processes for Relational Reinforcement Learning. We are grateful to all those who made ILP 2003 a success. In particular, we would like to thank the authors who submitted papers to ILP 2003, the program committee members and referees for their prompt and thorough work, and the invited speakers for their excellent lectures. Many thanks to the local chair Tibor Gyim´ othy and his team Zolt´ an Alexin, D´ ora Csendes, and T¨ unde K¨ oles for their outstanding organization of the event. We are also grateful to the Kalm´ ar Workshop’s chair Ferenc G´ecseg, co-chairs J´anos Csirik and Gy¨orgy Tur´ an, and to Bal´azs Sz¨or´enyi from the organizing committee for the good cooperation. We also thank the editors of LNCS/LNAI, in particular Alfred Hofmann and Erika Siebert-Cole, for their help in publishing this volume. Last but not least, we gratefully acknowledge the financial support provided by the Department of Informatics at the University of Szeged, by KDNet – The European Knowledge Discovery Network of Excellence, by the “Machine Learning” journal of Kluwer Academic Publishers, and by the Ministry of Education of Hungary.
July 2003
Tam´as Horv´ ath and Akihiro Yamamoto
VI
Preface
Organization
Executive Committee Program Chairs:
Tam´ as Horv´ ath (University of Bonn and Fraunhofer Institute for Autonomous Intelligent Systems, Germany) Akihiro Yamamoto (Hokkaido University, Japan) Tibor Gyim´ othy (University of Szeged, Hungary) Zolt´ an Alexin (University of Szeged, Hungary) D´ ora Csendes (University of Szeged, Hungary) T¨ unde K¨ oles (University of Szeged, Hungary)
Organizing Chair: Organizing Committee:
Program Committee H. Blockeel (Belgium) J.F. Boulicaut (France) I. Bratko (Slovenia) J. Cussens (UK) L. De Raedt (Germany) S. Dˇzeroski (Slovenia) P. Flach (UK) L. Getoor (USA) L. Holder (USA) K. Inoue (Japan) R. Khardon (USA) J.-U. Kietz (Switzerland) S. Kramer (Germany) N. Lachiche (France) N. Lavraˇc (Slovenia) J. Lloyd (Australia) E. Martin (Australia)
S. Matwin (Canada) T. Miyahara (Japan) S. Muggleton (UK) T. Ozaki (Japan) D. Page (USA) D. Roth (USA) C. Rouveirol (France) L. Saitta (Italy) M. Sebag (France) A. Sharma (Australia) A. Srinivasan (UK) P. Tadepalli (USA) G. Tur´ an (USA, Hungary) S. Wrobel (Germany) G. Zaverucha (Brazil) J.-D. Zucker (France)
Additional Referees R. Braz Y. Chevaleyre V.S. Costa C. Cumby K. Driessens
I.C. Dutra T. Erjavec T. G¨ artner S. Hoche S. Igarashi
I. J´ onyer K. Kersting C. Sakama J. Struyf C. Vens
Preface
Sponsoring Institutions Department of Informatics, University of Szeged KDNet – The European Knowledge Discovery Network of Excellence “Machine Learning” journal of Kluwer Academic Publishers Ministry of Education of Hungary
VII
Table of Contents
Invited Papers A Personal View of How Best to Apply ILP . . . . . . . . . . . . . . . . . . . . . . . . . . . Ross D. King
1
Agents that Reason and Learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John W. Lloyd
2
Research Papers Mining Model Trees: A Multi-relational Approach . . . . . . . . . . . . . . . . . . . . . . Annalisa Appice, Michelangelo Ceci, Donato Malerba
4
Complexity Parameters for First-Order Classes . . . . . . . . . . . . . . . . . . . . . . . . 22 Marta Arias, Roni Khardon A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Anna Atramentov, Hector Leiva, Vasant Honavar Applying Theory Revision to the Design of Distributed Databases . . . . . . . . 57 Fernanda Bai˜ ao, Marta Mattoso, Jude Shavlik, Gerson Zaverucha Disjunctive Learning with a Soft-Clustering Method . . . . . . . . . . . . . . . . . . . . 75 Guillaume Cleuziou, Lionel Martin, Christel Vrain ILP for Mathematical Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Simon Colton, Stephen Muggleton An Exhaustive Matching Procedure for the Improvement of Learning Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Nicola Di Mauro, Teresa Maria Altomare Basile, Stefano Ferilli, Floriana Esposito, Nicola Fanizzi Efficient Data Structures for Inductive Logic Programming . . . . . . . . . . . . . . 130 Nuno Fonseca, Ricardo Rocha, Rui Camacho, Fernando Silva Graph Kernels and Gaussian Processes for Relational Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Thomas G¨ artner, Kurt Driessens, Jan Ramon On Condensation of a Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Kouichi Hirata
X
Table of Contents
A Comparative Evaluation of Feature Set Evolution Strategies for Multi-relational Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Susanne Hoche, Stefan Wrobel Comparative Evaluation of Approaches to Propositionalization . . . . . . . . . . 197 ˇ Mark-A. Krogel, Simon Rawles, Filip Zelezn´ y, Peter A. Flach, Nada Lavraˇc, Stefan Wrobel Ideal Refinement of Descriptions in AL-Log . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Francesca A. Lisi, Donato Malerba Which First-Order Logic Clauses Can Be Learned Using Genetic Algorithms? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Flaviu Adrian M˘ arginean Improved Distances for Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Dimitrios Mavroeidis, Peter A. Flach Induction of Enzyme Classes from Biological Databases . . . . . . . . . . . . . . . . . 269 Stephen Muggleton, Alireza Tamaddoni-Nezhad, Hiroaki Watanabe Estimating Maximum Likelihood Parameters for Stochastic Context-Free Graph Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Tim Oates, Shailesh Doshi, Fang Huang Induction of the Effects of Actions by Monotonic Methods . . . . . . . . . . . . . . 299 Ramon P. Otero Hybrid Abductive Inductive Learning: A Generalisation of Progol . . . . . . . . 311 Oliver Ray, Krysia Broda, Alessandra Russo Query Optimization in Inductive Logic Programming by Reordering Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Jan Struyf, Hendrik Blockeel Efficient Learning of Unlabeled Term Trees with Contractible Variables from Positive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Yusuke Suzuki, Takayoshi Shoudai, Satoshi Matsumoto, Tomoyuki Uchida Relational IBL in Music with a New Structural Similarity Measure . . . . . . . 365 Asmir Tobudic, Gerhard Widmer An Effective Grammar-Based Compression Algorithm for Tree Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Kazunori Yamagata, Tomoyuki Uchida, Takayoshi Shoudai, Yasuaki Nakamura
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
A Personal View of How Best to Apply ILP Ross D. King Department of Computer Science, University of Wales, UK [email protected]
The ancient Celts believed that there were three objects of intellect: the true, the beautiful, and the beneficial. Like all good ideas, ILP combines aspects all three of these objects. In this talk I focus on how to obtain the benefits of ILP. I describe applications of ILP to drug design, toxicology, protein function prediction, chemical pathway discovery, and automatic scientific discovery. The use of ILP enabled these problems to be computationally represented in a more compact and natural way than would be possible propositionally. ILP also helped domain understandable rules to be generated and significant scientific results have been obtained. The use of ILP came at a computational cost. However, nontechnical reasons are probably the greatest barrier against the greater adoption of ILP. For example, it is difficulty explain the benefits of ILP to domain experts.
T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 1–1, 2003. c Springer-Verlag Berlin Heidelberg 2003
Agents that Reason and Learn John W. Lloyd Research School of Information Sciences and Engineering The Australian National University Canberra, ACT, 0200, Australia [email protected]
This talk will address the issue of designing architectures for agents that need to be able to adapt to changing circumstances during deployment. From a scientific point of view, the primary challenge is to design agent architectures that seamlessly integrate reasoning and learning capabilities. That this is indeed a challenge is largely due to the fact that reasoning and knowledge representation capabilities of agents are studied in different subfields of computer science from the subfields in which learning for agents is studied. So far there have been few attempts to integrate these two research themes. In any case, agent architectures is very much an open issue with plenty of scope for new ideas. The research to be described is being carried out in the context of the Smart Internet Technology Cooperative Research Centre [4], a substantial 7 year Australian research initiative having the overall research goal of making interactions that people have with the Internet much simpler than they are now. One of the research programs in the CRC is concerned with building Internet agents and one project in that program is concerned with building adaptive agents, the main topic of this talk. The first attempt in this project at an architecture involves integrating BDI agent architectures for the reasoning component and reinforcement learning for the learning component. The talk will concentrate on a particular aspect of this integration, namely, approximation of the Q-function in reinforcement learning. In seminal work on relational reinforcement learning [1,2], the TILDE decisiontree learning system was employed to approximate the Q-function in various experiments in blocks world. An extremely attractive aspect of the use of a symbolic learning system for function approximation in reinforcement learning is that the functions learned are essentially plans that can be explicitly manipulated for various purposes. In the research to be described in this talk, the learning system used to approximate the Q-function is Alkemy, a decision-tree learning system with a foundation in higher-order logic [3]. The talk will describe the agent architecture and also progress towards building practical Internet agents. Along the way, a setting for predicate construction in higher-order logic used by Alkemy and some theoretical results concerning the efficient construction of predicates will be presented. T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 2–3, 2003. c Springer-Verlag Berlin Heidelberg 2003
Agents that Reason and Learn
3
References 1. S. Dˇzeroski, L. De Raedt, and H. Blockeel. Relational reinforcement learning. In Proceedings of the 15th International Conference on Machine Learning, ICML’98, pages 136–143. Morgan Kaufmann, 1998. 2. S. Dˇzeroski, L. De Raedt, and K. Driessens. Relational reinforcement learning. Machine Learning, 43:7–52, 2001. 3. J.W. Lloyd. Logic for Learning. Cognitive Technologies. Springer, 2003. 4. Home page of the Smart Internet Technology Cooperative Research Centre. http://www.smartinternet.com.au/.
Mining Model Trees: A Multi-relational Approach Annalisa Appice, Michelangelo Ceci, and Donato Malerba Dipartimento di Informatica, Università degli Studi Via Orabona, 4 - 70126 Bari, Italy {appice,ceci,malerba}@di.uniba.it
Abstract. In many data mining tools that support regression tasks, training data are stored in a single table containing both the target field (dependent variable) and the attributes (independent variables). Generally, only intra-tuple relationships between the attributes and the target field are found, while intertuple relationships are not considered and (inter-table) relationships between several tuples of distinct tables are not even explorable. Disregarding inter-table relationships can be a severe limitation in many real-word applications that involve the prediction of numerical values from data that are naturally organized in a relational model involving several tables (multi-relational model). In this paper, we present a new data mining algorithm, named MrSMOTI, which induces model trees from a multi-relational model. A model tree is a tree-structured prediction model whose leaves are associated with multiple linear regression models. The particular feature of Mr-SMOTI is that internal nodes of the induced model tree can be of two types: regression nodes, which add a variable to some multiple linear models according to a stepwise strategy, and split nodes, which perform tests on attributes or the join condition and eventually partition the training set. The induced model tree is a multi-relational pattern that can be represented by means of selection graphs, which can be translated into SQL, or equivalently into first order logic expressions.
1
Introduction
Prediction is arguably considered the main goal of data mining, with the greatest potential payoff [28]. The two principal prediction problems are classification and regression. Samples of past experience with known answers (labels) are examined and generalized in future cases. For classification labels are a finite number of unordered categories. For regression the answer is a number. Traditionally, in a regression problem sample data are described by a set of m independent variables Xi (both numerical and categorical) and a dependent variable Y, which normally takes values in ℜ. According to the data mining terminology, Xi ’s are the attributes, while Y is the target field. Regression problems have been very well studied in statistics. In general, the model is assumed to be a linear combination of independent variables and the coefficients of the combination are determined by the method of the least squares [4]. Refinements and extensions to non-linear models are also well-known and applied in many real world applications. However, classical statistical methods have several limitations. First, (non-)linear regression models are often hard to understand. Second, T. Horváth and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 1-21, 2003. © Springer-Verlag Berlin Heidelberg 2003
Mining Model Trees: A Multi-relational Approach
5
all these statistical models are based on the assumption that all independent variables are equally relevant in the whole sample space. Third, the least square method does not allow prior domain knowledge to be used in the construction of the regression model. To solve some of these problems regression tree methods have been developed. Regression trees [3] are supposed to be more comprehensible then classical regression models. They are built top-down by recursively partitioning the sample space. An attribute may be of varying importance for different regions of the sample space. A constant is associated to each leaf of a regression tree, so that the prediction performed by a regression tree is the same for all sample data falling in the same leaf. A generalisation of regression trees is represented by model trees, which associate multiple linear models with each leaf. Hence, different values can be predicted for sample data falling in the same leaf. Some of the model tree induction systems are M5 [23], RETIS, [9], M5’ [27], HTL [26], TSIR [17], and SMOTI [18]. The last two systems are characterised by a tree structure with two types of nodes: regression nodes, which perform only straight-line regression, and splitting nodes, which partition the feature space. The multiple linear model associated to each leaf is then the composition of the straight-line regressions reported along the path from the root to the leaf. In [18] some differences between TSIR and SMOTI have been reported, the most important of which is that the method implemented in SMOTI is the only one for which the composition of straight-line regressions found along a path from the root to a leaf can be correctly interpreted as a multiple linear model built stepwise. All model-tree induction systems reported above work on data represented by m+1 attribute-value pairs. They input training data from a file, with the exception of SMOTI, which interfaces a relational database and requires the data set to be stored in a single table. Therefore, all methods implemented in the current model tree induction systems are based on the single-table assumption, according to which all training samples are stored in a single table (or “relation” in database terminology), and that there is one row (or “tuple”) in this table for each object of interest [29]. In other words, training data can be described by a fixed set of attributes, each of which can only have a single, primitive value. The single-table assumption underlying this representation paradigm only allows fairly simple objects to be analysed by means of current model tree systems. More complex structured objects require a representation in a relational database containing multiple tables [12]. (Multi-)relational data mining (MRDM) algorithms and systems are capable of directly dealing with multiple tables or relations as they are found in today’s relational databases [8]. The data taken as input by MRDM systems consists o f several tables and not just one table (multi-relational model). Moreover, the patterns output by these systems are relational, that is, involve multiple relations from a relational database. They are typically stated in a more expressive language than patterns described on a single data table. For instance, subsets of first-order logic are used to express relational patterns. Considering this strong link with logics, it is not surprising that many algorithms for MRDM originate from the field of inductive logic programming (ILP) [15] In this paper, we present a multi-relational extension of SMOTI. The new system, named Mr-SMOTI, induces relational model trees from structural data possibly described by multiple records in multiple tables. As in the case of SMOTI, induced relational model trees can contain both regression and split nodes. Differently from
6
Annalisa Appice et al.
SMOTI, attributes involved in both types of nodes can belong to different tables of the relational database. The join of these tables is dynamically determined on the basis of the database schema and aims to involve attributes of different relations in order to build a predictive model for a target field. In the next section we first review related works in order to clarify the innovative aspects of our approach. In Section 3 we briefly introduce the method implemented in SMOTI that operates under the single-table assumption. In Section 4 we draw on the multi-relational regression framework, based on an extended graphical language (selection graph), to mine relational model trees directly from relational databases, through SQL queries. In Section 5 we show how selection graphs can support the stepwise induction of multi-relational model trees from structural data. In Section 6 we present some experimental results. Finally, we draw some conclusions and sketch possible directions of further research.
2
Related Work
The problem of mining patterns (e.g. prediction models) over data that reside in multiple tables is generally solved by moulding a relational database into a single table format, such that traditional attribute-value algorithms are able to work on [13]. This approach corresponds to the concept of propositionalization in machine learning and has been applied to regression tasks as well. In [7], the DINUS [15] algorithm is applied to transform a Datalog representation of a dynamic system into a propositional form (i.e., attribute-value pairs), so that a classical model tree induction system based on the single-table assumption (e.g. RETIS) can be applied. One way of performing the propositionalization is to create a single relation by deriving attributes from other joined tables. However, this produces an extremely large table with lots of data being repeated which is difficult to handle. A different approach is the construction of a single central relation that summarises and/or aggregates information which can be found in other tables. Also this approach has some drawbacks, since information about how data were originally structured is lost. Therefore, a proper way of explicitly and efficiently dealing with multiple relations is necessary. The idea of mining relational regression models from multiple relations is not new. In particular, the learning problem of Relational Regression [5] has been formulated in the normal ILP framework. So far two approaches to solve Relational Regression problems have been proposed in ILP. The former uses a separate-and-conquer (or sequential covering) strategy to build a set of Prolog clauses. The latter uses a divideand-conquer strategy to induce tree-based models and then translate these models into Prolog programs. A system that follows the first approach is FORS [10], while three systems that follow the second approach are SRT [14], S-CART [8], and TILDE-RT [2]. In contrast to DINUS/RETIS all these systems solve a Relational Regression problem in its original representation, and do not require transformation of the problem. Moreover, they can utilise relational non-determinate background knowledge. SRT generates a series of increasingly complex trees containing a literal (an atomic formulation or its negation) or a conjunction of literals in each node, and subsequently returns the best tree according to a criterion based on minimum
Mining Model Trees: A Multi-relational Approach
7
description length. A numerical constant value is assigned to each leaf. Analogously, S-CART and TILDE-RT follow the general procedure of top-down tree induction [3]. In particular, S-CART recursively builds a binary tree, selecting a possible conjunction of one or more literals in each node as provided by user-defined schemata [25] until a stopping criterion is fulfilled. The value predicted in a node is simply the mean of values of all examples covered by the node. The algorithm keeps track of the examples in each node and the conjunctions of literals in each path leading to the respective node. This information can be turned into a clausal theory (e.g. a set of first order regression rules). All these approaches are mostly based on data stored as Prolog facts. Moreover, in real-world applications, where facts correspond to tuples stored on relational databases, some pre-processing is required in order to transform tuples into facts. However, much of the pre-processing, which is often expensive in terms of computation and storage, may be unnecessary since that part of the hypothesis space may never be explored. In addition, in applications where data can frequently change, pre-processing has to be frequently repeated. This means that little attention has been given to data stored in relational database and to how knowledge of a data model can help to guide the search process [3, 13]. A solution can be found by combining the achievements of the Knowledge Discovery in Database (KDD) field on the integration of data mining with database systems, with some results reported in the ILP field on how to correctly upgrade propositional data mining algorithms to multirelational representations. In the next sections we show how to develop a new multirelational data mining system by upgrading SMOTI to multi-re lational representations and by tightly integrating the new system with a relational DBMS, namely Oracle R 9i.
3
Stepwise Model Tree Induction
SMOTI (Stepwise Model Tree Induction) performs the top-down induction of models trees by considering not only a partitioning procedure, but also by some intermediate prediction functions [18] 1. This means that there are two types of nodes in the tree: regression nodes and splitting nodes (Fig. 1). The former compute straight-line Y=a+bX i X’j<α t’L
nL Y’=c+dX’u
t t’
nR Y’=e+fX’v
t’R
Fig. 1. A model tree with both a regression node (t) and a splitting node (t’)
regressions, while the latter partition the sample space. They pass dowrn training data to their children in two different ways. For a splitting node t, only a subgroup of the N(t) training data in t is passed to each child, with no change on training cases. For a regression node t, all the data are passed down to its only child, but the values of both 1
The work reported in [18] has been substantially extended and revisited. Details of the improved algorithm implemented in the data mining system KDB2000 are reported in [19].
8
Annalisa Appice et al.
the dependent and independent numeric variables not included in the multiple linear model associated to t are transformed in order to remove the linear effect of those variables already included. Thus, descendants of a regression node will operate on a modified training set. Indeed, according to the statistical theory of linear regression [4], the incremental construction of a multiple linear model is made by removing the linear effect of introduced variables each time a new independent variable is added to the model. For instance, let us consider the problem of building a multiple regression model with two independent variables through a sequence of straight-line regressions: Yˆ =a+bX1 + cX2 . We start regressing Y on X1 , so that the model Yˆ = a1 +b1 X1 is built. This fitted equation does not predict Y exactly. By adding the new variable X2 , the prediction might improve. Instead of starting from scratch and building a model with both X1 and X2 , we can build a linear model for X2 given X1 : Xˆ = a 2 +b2 X1 . Then we compute the 2
residuals on X2 : X'2 = X2 - (a2 +b2 X1) and on Y: Y' = Y – (a1 +b1 X1). Finally, we regress Y' on X'2 alone: Yˆ ′ = a 3 + b 3 X'2 . By substituting the equations of X'2 and Y' in the last equation we have: Y-(a 1 + b 1 X1 )=a 3 +b 3 (X2 -(a 2 +b2 X1 )). Since Y − (a1 + b1X 1) = Yˆ − (a1 + b1X 1) we have:
Yˆ = (a 3 + a 1 – a2 b3 ) + (b 1 -b2 b3 )X1 + b3 X2 . It can be proven that this last model coincides with the first model built, that is, a= a3 +a1 –a2 b3 , b= b1 -b 2 b3 and c=b3 . Therefore, when the first regression line of Y on X1 is built we pass down both the residuals of Y and the residuals of the regression of X2 on X1 . This means that we remove the linear effect of the variables already included in the model (X1 ) from both the response variable (Y) and those variables to be selected for the next regression step (X2 ).
4
Regression Problem in a Multi-relational Framework
Traditional research for a regression task in KDD has focused mainly on propositional techniques involving the attribute-value paradigm. This implies that relationships between fields of one tuple can be found, but not relationships between several tuples of one or more tables. It seems that this is an important limitation, since a relational database consists of a set of tables and a set of associations between pairs of tables. Both tables and associations are known as relations. Each association describes how records in one table relate to records in another table. Most associations correspond to foreign key relations. These relations can be seen as having two directions. One goes from a table where the attribute is primary key to a table where the attribute is foreign key (one-to-many), and the other one is in the reverse way (many-to-one). An object in a relational database can consist of several records fragmented across several tables and connected by associations (Fig. 2). Although the data model can consist of multiple tables, there must be only a single kind of object that is central to the analysis (target table). The assumption is that each record in the target table will correspond to
Mining Model Trees: A Multi-relational Approach
9
a single object in the database. Any information pertaining to each object which is stored in other tables can be retrieved by following the associations in the data model. Once the target table has been selected, a particular numeric attribute of that table can be chosen for regression purposes (target attribute). Thus, a multiple regression problem in a multi-relational framework can be defined as follows. Given a schema of a relational database D, a target table T0 , a target attribute Y within the target table T0 , the goal is to mine a multi-relational multiple regression model to predict the estimated target attribute Y. Mined models not only involve attribute-value descriptions, but also structural information denoted by the associations in D. Agent
Customer N
Target atribute
Id:text … CreditLine:real Agent:Text
N Id:text … Commission:real
1 Det ail
Order 1
Id: text Date: date Client: text
N 1
Id: text … Order: text Article: text
Article
N 1
Id: text …
Fig. 2. The data model of an example database used in relational regression
Relational regression models induced stepwise as in SMOTI can be expressed in the graphical language of selection graphs. The classical definition of a selection graph is reported in [11, 12, 16]. Nevertheless, we present an extension of this definition in order to make the selection graphs more appropriate to our task. Definition of selection graph A selection graph G is a directed graph (N, A), such that: − each node in N is a 4-tuple (T, C, R, s), named selection node, where: - T = (X1 ,X2 , … Xn ) is a table in the relational schema D. - C is a set of conditions on attributes in T of type T.X’i OP c, where X’i is one of the attributes Xi in T after the removal of the effects of some variables already introduced in the relational regression model through regression nodes. OP is one of the usual comparison operators ( <, ≥, in, not in …) and c is a constant value. - R is a set of tuples R={(RXj , α j , β j )| j=1,…,l}where RXj is a regression term already introduced in the multiple linear model, l is the number of such terms, α j = (αj1 , αj2 , …, αjn ) and β j = (βj1 , βj2 , …, βjn ) are the regression coefficients computed to remove the effect of each term RXj from all numerical attributes in T: X’i = Xi - ∑j=1,...,l (α ji + βji × RXj ) ∀ i = 1,…,n and Xi is numerical - s is a flag with possible values open or closed. − A, a set of tuples (p, q, fk, e), where: - p and q are selection nodes.
10
Annalisa Appice et al.
-
fk is a foreign key association between p.T and q.T in the relational schema D (one-to-many or many-to-one). - e is a flag with possible values present or absent. Selection graphs contain at least a node n 0 that corresponds to the target table T0 . They can be graphically represented by a directed labelled graph (Fig. 3.a). The value of s is expressed by the absence or presence of a cross in the node, representing the value open and close, respectively. Similarly the value for e is indicated by the presence (absent value) or absence (present value) of a cross on the corresponding arrow representing the labelled arc. The direction of the arrow (left-to-right and rightto-left) corresponds to the multiplicity of the association fk (one-to-many and manyto-one, respectively). Every arc between the nodes p and q imposes some constraints on how one or more records in the table q.T are related to each record in table p.T, according to the list of conditions in q.C. The association between p.T and q.T induces some grouping (Fig. 3.b) in the records in q.T, and thus selects some records in p.T. In particular, a present arc selects those records that belong to the join between the tables and match the list of conditions. On the other hand, an absent arc corresponds to the negation of the joining condition and the representations of the complementary sets of objects. Intuitively, the tuples in the target table T0 that are explained by a selection graph G are those for which tuples exist or not in linked tables that satisfy the conditions defined for those tables. Customer
Order
a) Date in {02/09/02, 05/09/02} b)
Customer
… Order
c) begin (model (‘customer-124’)). customer(‘124’,’AdamSally’,818.75, 1000,’03’). order(‘12489’, ’02/09/02’, ’124’). order(‘12500’,’05/09/02’, 124). end (model(‘customer-124’)). begin (model (‘customer-256’)). customer(‘256’,’SadamsAnn’, 21.5, 1500,’06’). … end (model(‘customer-256’)). …
Fig. 3. (a) Example of selection graph; (b) corresponding grouping and (c) logic representation of objects selected from an instance of the example database
Selection graphs are more intuitive than expressions in SQL or Prolog, because they reflect the structure of the relational data model, and refinements of existing graphs may be defined in terms of addition or updating of arcs and/or nodes. The given definition of selection graph cannot allow to represent recursive relationships. Therefore a selection graph can be straightforwardly translated into either SQL or into first order logic expressions (Fig. 4). In this case a subgraph pointed by an absent arc is translated into a negated inner sub-query.
Mining Model Trees: A Multi-relational Approach
G
a)
11
SELECT n0.ID, n0.Name, n0.Adress,…, n0.Sale, n0.CreditLine, n0Agent, n1.ID, n1.Date, n1.Client FROM Customer n0, Order n1 WHERE n0.ID=n1.Client and n0.ID not in (SELECT n2.Client FROM Order n2 WHERE n2.Date in {02/09/02}));
Orde Customer r Orde R=∅ r Date in b) ← customer(N0_ID, …,N0_Agent), order(N1_ID,N1_Date, N0_ID), {02/09/02} ¬ (order(N2_ID,N2_Date, N0_ID), N2_Date=02/09/02).
Fig. 4. (a) SQL and (b) first order logic translation of a selection graph G
5
Multi-relational Stepwise Model Tre e Induction
Mr-SMOTI induces model trees whose nodes (regression, split or leaf) involve multirelational patterns that can be represented with selection graphs, that is each node of the tree corresponds to a selection graph. Essentially Mr-SMOTI, like the propositional version SMOTI, builds a tree-structured multi-relational regression model by adding split and/or regression nodes through a process of successive refinements of the current selection graph until a stopping criterion is fulfilled and a leaf node is introduced. Thus, the model associated to each leaf is computed by combining all straight-line regressions in the regression refinements along the path from the root to the leaf. 5.1
The Algorithm
Mr-SMOTI is basically a divide-and-conquer algorithm that starts with a root selection graph G containing only the target node n0 . This graph corresponds to the entire set of objects of interest in the relational database D (the target table T0 ). At each step the system chooses the optimal refinement (split or regression) according to a heuristic function. In particular, a split refinement corresponds to either the updating of an existing node by adding a new selection condition or the introduction of a new node and a new arc in the current selection graph. On the other hand, a regression refinement corresponds to the updating of regression terms in existing nodes. The optimal refinement (and its complement in the case of a split), are used to create the regression functions associated to the root of the left (/right) branch. This procedure is recursively applied to each branch until a stopping criterion is fulfilled. Mr-SMOTI (D: database, G: selection_graph) begin GS , GR, R: selection_graph; T_left, T_right: model_tree; GR := optimal_regression_refinement (G, D); if stopping_criteria (GR, D) then return leaf (GR ); GS := optimal_split_refinement (G, D); R:= best_refinement (GR, GS);
12
Annalisa Appice et al.
if(R=GR ) //the optimal refinement is a regression node T_left := Mr-SMOTI (D,R); T_right := ∅; else // the optimal refinement is a split node T_left := Mr-SMOTI (D,R); T_right := Mr-SMOTI (D, comp (R)); return model_tree(R, T_left, T_right). end The functions optimal_split_refinement and optimal_regression_refinement take the current selection graph G corresponding to the current node t and consider every possible split and regression refinement. The choice of which refinements are candidates is determined by the current selection graph G, the structure of data model in D, and notably by the multiplicity of associations within this data model. The validity of either a splitting refinement (Gs ) together with its complement (comp(GS )), or a regression refinement (GR ) is based on two distinct evaluation measures, σ( GS , comp(GS )) and ρ(GR ), respectively. Both σ( GS , comp(GS )) and ρ(GR ), are mean square errors (MSE) 2, therefore they can be actually compared to choose between three different possibilities: − growing the model tree by adding the node t GR corresponding to the regression refinement GR ; − growing the model tree by adding the nodes tGS and tComp(GS) corresponding to the splitting refinement GS and its complement comp(GS ) 3; − stopping the tree’s growth at the current node t. Let T be the multi-relational model tree currently built stepwise, G the selection graph associated to the node t in T and t GS (t comp( GS ) ) the left (right) child of t, associated to a split refinement GS (the complementary split refinement comp(GS )) of the selection graph G, σ(Gs , comp(Gs )) is defined as: N(t G S ) N(t comp(GS ) ) s(G S , comp(G S )) = R(G S ) + R(comp(G S )), N(t G S ) + N(t comp(GS ) ) N(t G S ) + N(t comp(GS ) ) where
N(tG S )(N(tcomp(GS ) )) is the number of training tuples covered by the
refinement GS (comp(GS )), and R(GS ) ( R(comp(GS) ) is the resubstitution error of the left (right) child, computed as follows:
R(G S ) =
1 N(t GS )
N(tG S )
∑ (y j =1
j
− yˆ j ) 2
R(comp(G S )) =
1 N(t comp(GS ) )
N(tComp(GS ) )
∑ (y
j
− yˆ j ) 2 .
j=1
Therefore the evaluation measure σ(Gs , comp(Gs )) is coherently defined on the basis of the partially defined multiple linear regression models Yˆ built by combining
2
Mr-SMOTI minimises the square error with respect to the partially constructed regression model . 3 Mr-SMOTI requires that the subsets of target objects belonging to patterns deriving from the same parent by applying some kind of refinement must be complementary. Because of this the split refinements are introduced together with their complementary refinement.
Mining Model Trees: A Multi-relational Approach
the best straight-line regression associated to introduced along the path from the root to
13
t GS (t comp( GS ) ) , with all regressions
t GS (t comp( GS ) ) .
The evaluation of a regression step Y=a+bXi at regression refinement GR , cannot be naïvely based on the resubstitution error R(GR ):
R (G R ) = where
1 N (t GR )
N ( t GR )
∑ (y j =1
j
− yˆ j ) 2 ,
t GR is the node representing the regression refinement GR and N (t G ) is the R
number of training tuples covered by the refinement GR . The predicted value yˆ j is computed by combining all regression lines introduced in T along the path from the root to t GR . This would result in values of ρ(GR) less than or equal to values of σ(GS ,comp(GS )) for splitting refinement involving Xi [18]. Indeed, the splitting test “looks-ahead” to the best multiple linear regressions after the current split is performed, while the regression step does not perform such a look-ahead. A fairer comparison would be to grown the model tree at a further level in order to base the computation of ρ(GR ) on the best split refinement GS , after the current regression refinement is performed. Therefore, ρ(GR ) is defined as follows: ρ(GR) = min {R(GR ),σ(GS ,comp(GS ))}. Having defined both σ(Gs, comp(Gs)) and ρ(GR ), the criterion for selecting the best refinement is fully characterised as well. At each step of the induction process, Mr-SMOTI chooses the apparently most promising refinement, according to a greedy strategy. The function stopping_criteria determines whether the current optimal refinement must be transformed into a leaf according to the minimal number of target objects (minObject) covered by the selection graph which is associated to the current node and the minimal threshold for the coefficient of determination (minR) of the prediction function built stepwise [4]. This coefficient is a scale-free one-number summary of the strength of the relationship between independent variables in the actual multiple linear model and the dependent variable. The regression model built by Mr-SMOTI can be viewed as a set of SQL queries associated with each leaf in the tree. These queries predict an estimate of the target attribute according to the multiple model built stepwise. The prediction is averaged by means of a grouping on the target objects. The complementary nature of different branches of a model tree ensures that a given target object cannot be assigned a conflicting model. 5.2
The Refinements
Split refinements are an extension of the refinement operations proposed in [11] to perform a split test in a multi-relational decision tree. Whenever a split is introduced in a model tree, Mr-SMOTI is in fact refining the selection graph associated to the current node, by adding either a condition or an open node linked by a present arc.
14
Annalisa Appice et al.
Given a selection graph G, the add condition refinement returns the refined selection graph Gs by simply adding a split condition to an open node n i ∈G.N without changing the structure of G. The split condition can be a test on either a continuous or a discrete attribute of the table associated to the node ni . The first is in the form Xi ≤α. The value of α is one of the cut points found by an equal frequency discretization of the ordered distinct values of Xi . A discrete test is in the form Xi in Ui , with Ui a subset of the range of Xi . A greedy strategy as suggested by [20] is used to identify Ui . Initially Ui = ∅ is considered, the possible refinement is obtained by moving one discrete value from the range of Xi to Ui , such that the move results in a better split. The evaluation measure σ( GS , comp(GS )) is computed, therefore a better split decreases σ(Gs , comp(GS )). The process is iterated until there is no improvement in the splits. The add linked node refinement instantiates an association of the data model D by means of a present arc, together with its corresponding table, represented as an open node, and adds these to the selection graph G. Knowledge of the nature and multiplicity is used to guide and optimise this search. Since the investigated associations are foreign key associations, the proposed refinements can have two directions: backward or forward. The former correspond to many-to-one associations, while the latter describe one-to-many associations in the data model. This means that a backward refinement of the selection graph G does not partition the set of target objects covered by G but extends their descriptions (training data) by considering tuples joined in the table which are represented by the new added node. Each split refinement of type add condition or add linked node is introduced together with its complementary refinement. Let G be the selection graph associated to the current node t and GS a split refinement of G associated to the left sub-tree of t. The first order logic expression translating GS is: ← QG , conj ( Q G ), S
where QG is the translation of G and conj is the condition corresponding to the split refinement. The complementary refinement (comp(GS )) associated with the right subtree could not be the expression ←QG ,¬conj. Indeed, the selection graphs (queries) of the left and right sub-tree must be complementary: for each object into the current node (QG succeeds) exactly one of both queries should succeed. Consider in Figure 5 the refinement GS of the selection graph G, obtained by adding a condition on the table Order that is not a target table. In this case the complementary of adding a literal (Date in {02/09/02}) is not equivalent to adding its negation (¬(Date in {02/09/02})), while at the same time switching the branches of T. This is an important difference compared with the propositional case, where a test and its simple negation generate a partition of the training data. The complementary set associated to the complementary refinement of GS must contain the target objects in G (and the linked information in connected nodes) that are associated with none of the tuples in Order satisfying the refinement condition. In [11], Knobbe et al. propose a complementary refinement named add negative condition that should solve the problem of mutual exclusion between an add condition refinement and its complement. If the node that is being refined does not represent the target table, comp(GS ) is built from G by introducing an absent arc from the parent of n i to the clone of the entire sub-graph of G that is rooted in n i . The introduced sub-graph has a root (a clone of the node to be refined) that is a closed
Mining Model Trees: A Multi-relational Approach
15
Customer Order
Date in {02/09/02}
a)
… not in {02/09/02} Date G Customer
b)
Order
(First order expression) Q G ←customer(N0_ID, ,N0_Agent), order(N1_ID,N1_Date, N0_ID). (SQL query) Q G SELECT n0.ID, …, n0Agent, n1.ID, n 1.Date, n1.Client FROM Customer n0, Order n1 WHERE n0.ID=n1.Client
Customer GS
Customer Order Date in {02/09/02}
First order expression
QGS
← customer(N0_ID, …,N0_Agent), order(N1_ID,N1_Date, N0_ID). N2_Date=02/09/02.
SQL query SELECT n0.ID, …, n0Agent, n1.ID,n1.Date,n1.Client FROM Customer n0, Order n1 WHERE n0.ID=n1.Client AND n1.Date =02/09/02
Comp(G S)
First order expression
Order Order Date in {02/09/02}
Q comp ( G S ) ← customer(N0_ID,…,N0_Agent), order(N1_ID,N1_Date, N0_ID). ¬(order(N2_ID, N2_Date, N0_ID), N2_Date=02/09/02). SQL query SELECT n0.ID,.., n0Agent, n1.ID,n1.Date,n1.Client FROM Customer n0, Order n1 WHERE n0.ID=n1.Client AND n0.ID not in (select n2.Client from Client n2 where n2.Date =02/09/02)
Fig. 5. Explanation of (a) the partitioning of training objects according to (b) a split refinement GS and its complement comp(GS)
node updated with the refinement condition that is not negated. In this way the complementary operation builds a selection graph that negates an entire inner subquery and not simply a condition. As was observed in [16], this approach fails to build complementary refinements when the node to be refined is not directly connected to the target node. The example in Figure 6 proves that the proposed mechanism could build a refinement GS and a complementary refinement comp(GS ) that are not mutually exclusive. To overcome this problem the complementary refinement comp(GS) should be obtained by adding an absent arc from the target node n0 to the clone of the subgraph containing the entire join path from n 0 to the node to be refined. The introduced sub-graph has a root (a clone of n 0 ) that is a closed node and is updated with the refinement condition that is not negated. A new absent arc is also introduced between the target node and its closed clone. This arc is an instance of the implicit relationship between the primary key of the target table and the own itself (Figure 7). Similarly, when we consider the complementary refinement for an add linked node refinement we make the same considerations as when a negated condition is going to be added. This means that when the closed node to be added is not directly connected to the target node in G, a procedure similar to that described when an add condition refinement is complemented must be followed.
16
Annalisa Appice et al.
Customer GS Customer
Order
Detail
Detail
Order
price≤15
Customer
Order
Detail
Comp(GS)
Detail price≤15 SELECT n0.ID, …, n0Agent, n1.ID,n1.Date,n1.Client, n2.Id,…n2.Order, n2.Article FROM Customer n0, Order n1, Detail n2 WHERE n0.ID=n1.Client AND n1.ID=n2.Order AND n1.ID not in ( select n3. Order from Order n3 where n3.Price≤15) begin (model (‘customer-124’)). customer(‘124’,’Adam Sally’, …,’03’). order(‘12500’,’05/09/02’, 124). detail(‘D125’, 16,25,12500’,’A3’) end (model(‘customer-124’)). …
SELECT n0.ID, …, n0Agent, n1.ID,n1.Date,n1.Client, n2.Id,…n2.Order, n2.Article FROM Customer n0, Order n1, Detail n2 WHERE n0.ID=n1.Client AND n1.ID=n2.Order AND n2.Price≤15 begin (model (‘customer-124’)). customer(‘124’,’Adam Sally’, …,’03’). order(‘12489’, ’02/09/02’, ’124’). detail(‘D123, 12,80,’12489’,’A1’), detail(‘D124’, 15,22,12489’,’A2’) end (model(‘customer-124’)). begin (model (‘customer-256’)). customer(‘256’,’Sadams Ann’, …,’06’). ... end (model(‘customer-256’)). ...
Fig. 6. Example of (a) refinement (GS) by adding a condition on a node not directly connected to the target node and (b) the corresponding complementary refinement, proposed by Knobbe et al., that does not satisfy the mutual exclusion Customer
Order
Detail
Customer Order
Detail price≤15
SELECT n0.ID, …, n0Agent, n1.ID,n1.Date,n1.Client, n2.Id,…n2.Order, n3.Article FROM Customer n0, Order n1, , Detail n2 WHERE n0.ID=n1.Client AND n1.ID=n2.Order AND n0.ID not in ( select n0. ID from Customer n3, Order n4, , Detail n5 where n3.ID=n4.Client and n4.ID=n5.Order and n5.Price≤15)
Fig. 7. Example of correct complementary refinement when adding a condition on a node not directly connected to the target node
Finally, a regression refinement GR (Figure 8) corresponds to performing a regression step (Y’=αY+βY×n i .T.Xj ’) on the residual of a continuous attribute (n i .T.Xj ’) not yet introduced in the model currently built. Regression coefficients (αY and βY) are estimated according to the values of Y’ and n i .T.Xj ’ of each single tuple selected by the current selection graph G. The regression attribute must belong to a table represented by a node in the parent graph G. For each node, the list of regressions R is updated by adding the regression term (n i .T.Xj ’) introduced in the model and the coefficients α and β computed to update the residuals of all continuous attributes in
Mining Model Trees: A Multi-relational Approach
17
the node. According to the evaluation function ρ(GR ), a regression refinement includes a look-ahead capability. The complementary refinement of a regression step is empty. G Rn0 =∅ Customer Agent n0 n1 Rn1 =∅
GR Customer n0
Rn0 ={(n0.Commission, α1, β 1)} Rn1 ={(n1.Commission, α1 , β 1)}
Agent n1
Regression step
CreditLine =n 0 .R.α1CreditLine+ n 0 .R.β1CreditLineCommission Model currently built SELECT n0 .Id,avg(n0 .R.α1CreditLine+β 1CreditLine n0 .Commision) FROM Customer n0 , Agent n1 WHERE n0.Agent=n1.Id GROUP BY n0 .ID
Fig. 8. Example of a regression refinement GR that performs a regression step on the attribute Agent.Commission. The nodes n0 and n1 are updated with the vectors of coefficients α and β in order to remove the effect of the regression attribute from the continuous attributes in n0.T and n1.T, respectively
6
Experimental Evaluation
Mr-SMOTI has been applied to the biological problems of predicting both the mutagenic activity of molecules [21] and the biodegradability of chemical compounds in water [6]. A mutagenesis dataset consists of 230 molecules divided into two subsets: 188 molecules for which linear regression yields good results and 42 molecules that are regression-unfriendly. In our experiments we used the atom and bond structure of regression-friendly molecules by adding boolean indicators Ind1 and Ind2 as one setting (B1 ) and adding Lumo and Logp properties to get a second setting (B2 ). Similarly biodegradability dataset consists of 328 chemical molecules structurally described in terms of atom and bond. In all the experimental results reported below the thresholds for stopping criteria are fixed as follows: the minimum number of target objects falling in each internal node must be greater than the square root of the number of target objects in the entire training set and the determination coefficient in each internal node must be below 0.80. Each dataset is analysed by means of a 10-fold cross-validation. Figure 9 shows the test set performance of Mr-SMOTI and TILDE-RT in both domains, as measured by the Pearson correlation coefficient. The Pearson correlation coefficient (PCC), which is computed as follows:
PCC =
∑(y
j =1..N
∑(y
j =1..N
j
j
− y j ) ( yˆ j − yˆ j )
− y j )2 ×
∑ ( yˆ
j =1.. N
j
− yˆ j ) 2
,
18
Annalisa Appice et al.
PCC
Mutagenesis (B1) 1 0,8 0,6 0,4 0,2 0
Mr-SMOTI TILDE-RT
1
2
3
4
5
6
7
8
9
10
Tree
PCC
Mutagenesis (B2) 1,2 1 0,8 0,6 0,4 0,2 0
Mr-SMOTI TILDE-RT
1
2
3
4
5
6
7
8
9
10
Tree
Biodegradability 1,2 1 PCC
0,8
Mr-SMOTI
0,6
TILDE-RT
0,4 0,2 0 1
2
3
4
5
6
7
8
9
10
Tree
Fig. 9. Pearson correlation coefficient (Y axis) for multi-relational prediction models induced from the 10-fold cross validated datasets (X axis) of Mutagenesis (B1, B2) and Biodegradability datasets. The comparison concerns two systems: TILDE-RT (black squares) vs. Mr-SMOTI (purple diamonds)
is a measure of how much the value of a target attribute ( y j ) in test objects correlates with the value ( yˆ j ) predicted by the induced model. Since the Pearson correlation coefficient does not measure the quantity error of a prediction, we include several other measures as proposed by Quinlan [24]. We have evaluated the predictive accuracy on the basis of the average mean square error (Avg.MSE), which is computed as follows:
Avg.MSE =
1 1 1 MSE(Vi ) = ∑ ∑ k Vi∈V k Vi ∈V N (V i )
∑ (y
ij
− yˆ ij (V − V i ))2 ,
j∈Vi
where V={V1 , .., Vk } is a k-cross-validation partition of the training data V (i.e., 10),
N (Vi ) is the number of target objects in Vi , and yˆ j (V − Vi ) is the value predicted for the j-th target object in Vi by the prediction model built from V-Vi . The predictive accuracy is also estimated according to the average error (AE) averaged on a 10 fold cross-validation:
Mining Model Trees: A Multi-relational Approach
Avg.AE =
19
1 1 1 AE(Vi ) = ∑ ∑ ∑| yij − yˆij(V −Vi) |. k Vi∈V k Vi ∈V N(Vi ) j∈Vi
For pairwise comparison with TILDE-RT the non-parametric Wilcoxon twosample paired signed rank test is used [22], since the number of folds (or “independent” trials) is relatively low and does not justify the application of parametric tests, such as the t-test. To perform the test, we assume that the experimental results of the two compared methods are independent pairs of sample data {(u 1 , v1 ), (u 2 , v2 ), . . ., (u n , vn )}. We then rank the absolute value of the differences ui - vi . The Wilcoxon test statistics W+ and W- are the sum of the ranks from the positive and negative differences, respectively. We test the null hypothesis H0 : “no difference in distributions” against the two-sided alternative Ha: “there is a difference in distributions”. More formally, the hypotheses are: H0 : “µu =µv ” against Ha: “µu ≠µv ”. Intuitively, when W+ >> W- and viceversa, H0 is rejected. Whether W+ should be considered “much greater than” W- depends on the significance level α. The basic assumption of the statistical test is that the two populations have the same continuous distribution (and no ties occur). Since, in our experiments, ui and vi are MSE, W + >> W - implies that the second method (V) is better than the first one (U). The results of the Wilcoxon signed rank test on the accuracy of the induced multirelational prediction model are reported in Table 1. The Wilcoxon test statistics W+ (W -) is the sum of the ranks from the positive (negative) differences between TILDERT and Mr-SMOTI. Therefore, the smaller W+ (W -), the better for Mr-SMOTI (TILDE-RT). Differences are considered statistically significant when the p-value is less than or equal to α/2. Table 1. Results of the Wilcoxon signed rank test on the accuracy of the induced models. The best value is in boldface, while the statistically significant values (p≤α/2,α=0.05) are in italics
Mutagene sis
Dataset B1 B2
Biodegradability
Accuracy Avg.MSE Avg.AE Avg.MSE Avg.AE Avg.MSE Avg.AE
Mr-SMOTI 1.165 0.887 1.118 0.845 0.337 0.186
TILDE-RT 1.197 0.986 1.193 0.985 0.588 0.363
W+ 23 12 15 11 0
W32 43 40 44 55
0
55
P 0.69 0.13 0.23 0.10 0.0019 0.0019
Table 2. Number of leaves comparison for the 188 regression friendly elements of Mutagenesis (B1 and B2 setting) and the 328 elements of Biodegradability System Mr-SMOTI TILDE-RT
Mutagenesis – B1 14.4 11.7
Mutagenesis – B2 9.2 14.9
Biodegradability 2 4.7
Experimental results on tree size are reported in Table 2. Results show that in the case of both mutagenesis dataset (B2 setting) and biodegradability dataset MrSMOTI builds simpler models (in number of leaves) without loosing accuracy.
20
7
Annalisa Appice et al.
Conclusions
This paper presents a novel approach to mining relational model trees. The proposed algorithm can work effectively when training data are stored in multiple tables of a relational DBMS. Information on the database schema is used to reduce the search space of patterns. Induced relational models are represented by selection graphs whose definition has been extended in order to describe mo del trees with either split nodes or regression nodes. The proposed algorithm has been implemented as a module of the system MURENA that is tightly coupled to the Oracle Database. As future work, we plan to extend the comparison of Mr-SMOTI to other multi-relational data mining systems on a larger set of benchmark datasets. In particular, we plan to apply Mr-SMOTI in the spatial task of supporting quantitative interpretation of maps and in the analysis of geo-referenced census data [1]. Moreover, we intend to use SQL primitives and parallel database servers to speed up the stepwise construction of multi-relational model trees from data stored in a large database.
Acknowledgments This work has been supported by the annual Scientific Research Project "Metodi di apprendimento automatico e di data mining per sistemi di conoscenza basati sulla semantica" Year 2003, funded by the University of Bari. The authors thank Hendrik Blockeel for providing mutagenesis and biodegradability datasets.
References [1]
[2] [3] [4] [5] [6]
[7]
[8] [9]
Appice A., Ceci M., Lanza A., Lisi F.A., and Malerba D.: Discovery of Spatial Association Rules in Georeferenced Census Data: A Relational Mining Approach, Intelligent Data Analysis, numero speciale su "Mining Official Data" (in press). Blockeel H.: Top-down induction of first order logical decision trees. Ph.D thesis, Department of Computer Science, Katholieke Universiteit Leuven, 1998. Breiman L., Friedman J., Olshen R., and Stone J.: Classification and regression tree, Wadsworth & Brooks, 1984. Draper N.R., and Smith H.: Applied regression analysis, John Wiley & Sons, 1982. Dzeroski S.: Numerical Constraints and Learnability in Inductive Logic Programming. Ph.D thesis, University of Ljubljana, Slovenia, 1995. Dzeroski S., Blockeel H., Kramer S., Kompare B., Pfahringer B., and Van Laer W.: Experiments in predicting biodegradability. Proceedings of the Ninth International Workshop on Inductive Logic Programming (S. Dzeroski and P. Flach, eds.), LNAI, vol. 1634, Springer, pp. 80-91, 1999. Dzeroski S., Todoroski L., and Urbancic T: Handling real numbers in inductive logic programming: A step towards better behavioural clones. In Machine Learning: ECML95, Eds. Lavrac N., and Wrobel S., Springer , Berlin Heidelberg New York, 1995. Dzeroski S., and Lavrac N. (Eds). Relational Data Mining. Springer-Verlag, 2001. Karalic A.: Linear regression in regression tree leaves. In Proc. of ISSEK’92 (International School for Synthesis of Expert Knowledge), Bled, Slovenia, 1992.
Mining Model Trees: A Multi-relational Approach
21
[10] Karalic A.: First Order regression. Ph.D thesis, University of Ljubljana, Slovenia, 1995. [11] Knobbe, J., Siebes, A., and Van der Wallen, D.M.G.: Multi-relational decision tree induction. In Proc. 3rd European Conf. on Principles and Practice of Knowledge Discovery in Databases, PKDD '99, 1999. [12] Knobbe J., Blockeel H., Siebes, A., and Van der Wallen D.M.G.: Multi-relational Data Mining. In Proc. of Benelearn'99, 1999. [13] Knobbe A.J., Haas M., and Siebes A: Propositionalisation and aggregates. In Proc. 5th European Conf. on Principles of Data Mining and Knowledge Discovery, SpringerVerlag, 2001. [14] Kramer S.: Structural regression trees. In Proc. 13th National Conf. on Artificial Intelligence, 1996. [15] Lavrac N., and Dzeroski S.: Inductive Logic Programming: Techniques and Applications, Ellis Horwood, Chichester, UK, 1994. [16] Leiva H.A.: MRDTL: A multi-relational decision tree learning algorithm. Master thesis, University of Iowa, USA, 2002. [17] Lubinsky D.: Tree Structured Interpretable Regression. In Learning from Data, Fisher D., and Lenz H.J. (Eds.), Lecture Notes in Statistics, 112, Springer, 1994. [18] Malerba D., Appice A., Ceci M., and Monopoli M.: Trading-off versus global effects or regression nodes in model trees. In Foundations of Intelligent Systems, 13th International Symposium, ISMIS 2002, Hacid H.S., Ras Z.W. , Zighed D.A., and Kodratoff Y. (Eds.), Lecture Notes in Artificial Intelligence, 2366, Springer, Germany, 2002. [19] Malerba D., Esposito F., Ceci M., and Appice A.: Top -down induction of model trees with regression and splitting nodes. LACAM Technical Report, 2003. [20] Mehta M., Agrawal R., and Rissanen J.: SLIQ: A fast scalable classifier for data mining. In Proceedings of the Fifth International Conference on Extending Database Technology, 1996. [21] Muggleton S., Srinivasan A., King R., and Sternberg M.: Biochemical knowledge discovery using Inductive Logic Programming. In Proceedings of the first Conference on Discovery Science, Motoda H. (ed), Springer-Verlag, Berlin, 1998. [22] Orkin. M., and Drogin. R.: Vital Statistics. McGraw Hill. New York, 1990. [23] Quinlan J. R.: Learning with continuous classes, in Proceedings AI'92, Adams & Sterling (Eds.), World Scientific, 1992. [24] Quinlan J.R.: A case study in Machine Leaning, in Proceedings ACSC-16, Sixteenth Australian Computer Science Conferences, 1993. [25] Silverstein, G., and Pazzani, M.J.: Relational cliches: Constraining constructive induction during relational learning. In Proc. 8th Int. Workshop on Machine Learning, 1991. [26] Torgo L.: Functional Models for Regression Tree Leaves. In Proceedings of the 14th International Conference (ICML’97), D. Fisher (Ed.), Nashville, Tennessee, 1997. [27] Wang Y., and Witten I.H.: Inducing Model Trees for Continuous Classes. In Poster Papers of the 9th European Conf. on Machine Learning (ECML’97), M. van Someren and G. Widmer (Eds.), Prague, Czech Republic, 1997. [28] Weiss, S.M., and Indurkhya, N.: Predictive Data Mining. A Practical Guide. Morgan Kaufmann, San Francisco:CA, 1998. [29] Wrobel, S.: Inductive logic programming for knowledge discovery in databases. In Dzeroski S., and Lavrac N. (Eds). Relational Data Mining. Springer-Verlag, pp. 74-101, 2001.
Complexity Parameters for First-Order Classes Marta Arias and Roni Khardon Department of Computer Science, Tufts University Medford, MA 02155, USA {marias,roni}@cs.tufts.edu
Abstract. We study several complexity parameters for first-order formulas and their suitability for first order learning models. We show that the standard notion of size is not captured by sets of parameters that are used in the literature. We then identify an alternative notion of size and a simple set of parameters that are useful in this sense. Matching VC-dimension lower bounds complete the picture showing that these parameters are indeed crucial.
1
Introduction
Since the introduction of ILP, several theoretical investigations have contributed to characterizing the complexity of learning classes of expressions in first order logic (FOL). While learnability is usually defined using the size of the target concept as complexity measure, the truth is that the complexity of algorithms in the literature is usually quantified with other complexity measures. Several authors use standard parameters from first order logic, such as the number of clauses, the number of literals per clause etc. Others introduce special syntactic parameters such as depth and determinacy or restrict the structure of clauses or background knowledge in their analysis [21, 8, 17, 6, 14]. A comparison to propositional logic can highlight the difficulty. Work on learnability in propositional logic typically uses the number of propositions n and the size of the formula m as complexity parameters. The situation in FOL differs from the propositional case since we do not have a fixed instance size n and it has proved difficult to get upper bounds directly in terms of the size m. Moreover several parameters are inter-related so the value of one affects the other and the resulting picture is less clear. This paper studies notions of size and gives a setting and set of parameters which are in some sense the right ones for first order learnability. Previous work has provided both lower bounds and upper bounds on the resources required for learnability. While for upper bounds specific algorithms are proposed and analyzed, lower bounds were derived using the notion of VapnikChervonenkis (VC) dimension. VC based bounds apply in several models of learnability including the PAC model and the model of exact learning with queries [9, 20].
This work has been partly supported by NSF Grant IIS-0099446.
T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 22–37, 2003. c Springer-Verlag Berlin Heidelberg 2003
Complexity Parameters for First-Order Classes
23
Several such lower bounds for first order learnability [4, 15, 19] ignore some parameters and prove exponential or infinite growth w.r.t other parameters. Thus, [4, 15] show that the complexity may be exponential in the arity of predicates. However, both papers do not highlight the fact that the number of literals in the expressions being learned is of the same order (also exponential in arity). The paper [19] shows that the VC dimension is infinite with a single binary predicate but does not highlight the fact that these cases allow for an infinite number of constants (called parameters) whose encoding is not accounted for in the size of expressions1. In fact, any such lower bound going beyond the size of expressions must have a hidden unaccounted aspect: since the VC dimension is bounded by the logarithm of the class size, for discrete cases the lower bounds cannot be larger than the size of the learned expressions assuming a reasonable encoding scheme. Therefore, the question is, what constitutes a good set of parameters for first order learnability. Such a set should capture the size and avoid the confusion from inter-related parameter sizes. To answer this question we consider a setting where the parameters of the FOL signature (number of predicates, constants, function symbols, arity) are fixed in advance and are therefore numerical constants. The concept class is defined by the other parameters controlling the expressions (number of variables, terms, clauses etc). We start our investigation by defining when two sets of parameters are “related” so that polynomial learnability transfers from one to the other. Using this we show that there is no simple answer (set of parameters) if the standard notion of size is used: the standard notion of size for FOL is not polynomially bounded by the natural parameters of FOL. Here we simply count the formula size of an expression. On the other hand if we use a more compact representation, where a repeated term is counted only once, then one can derive a polynomial bound for the total size. In fact the bound is O(cl +ct) where c is the number of clauses, l is the largest number of literals in a single clause, and t is the maximal number of terms and subterms in a single clause. With this in mind we also prove that the VC dimension is Ω(cl + ct) so that we get a matching lower bound. Therefore, our results identify a natural separation of the parameters to fixed and variable ones and give a precise characterization of the VC-dimension in this setting. Our interest in the question above arises from an attempt to characterize learning complexity for learning with equivalence and membership queries [1]. Several recent papers have studied this problem in the ILP context. The case of Horn definitions (with a single head) [24] is indeed polynomial in c + l + t. Other results are either not polynomial or rely on syntax based oracles. For example, our results in [2] show that constrained and range-restricted expressions are learnable with complexity polynomial in c + tv + ta (we simplify here by ignoring 1
The case here is similar to learning classes with real valued parameters where each number is charged one unit of complexity, but nonetheless the VC-dimension of various concept classes is bounded. The negative result mentioned shows that this does not hold for first order logic except in very restricted cases. The work in [19, 11] identifies syntactic restrictions on formulas, examples, and background knowledge that give bounded VC-dimension in this setting.
24
Marta Arias and Roni Khardon
some of the parameters). It is not known whether the exponential dependence on v is necessary or not. Note also that ta essentially bounds l but may in fact be much larger than l. As pointed out above, VC based bounds cannot resolve this question since they are limited by expression size. The notion of certificate size of concept classes, developed by [13, 12], may provide tools to do so. Characterizing the certificate complexity of first order classes is an interesting direction for future work. Preliminary results solving some cases in propositional logic are reported in [3].
2
Preliminaries
We assume familiarity with first-order logic, as given e.g. in [18]. The following gives the basic definitions for concept classes and learnability in this context. We consider the framework of learning from interpretations [7] where examples given to the learner are interpretations. Other frameworks include the standard ILP setting [22] and learning from entailment [10]. While some of our results can be translated to these settings we do not develop the details here.2 A concept is a set of interpretations over a fixed and known signature S. The signature determines the variables, function and predicate symbols (and their respective arity) over which formulas are built. We refer to the set of possible interpretations over S as Int(S). The concepts are represented by first-order formulas; the concept represented by a formula ψ is {M | M |= ψ and M ∈ Int(S)}. Syntactically we consider universally quantified first-order Horn expressions. A clause is a disjunction of literals where all variables in the clause are (implicitly) universally quantified. A Horn clause has at most one positive literal. A Horn expression is a conjunction of Horn clauses. Note that any clause can be written as C = (∧n∈Neg n) → (∨p∈Pos p) where Neg and Pos are the sets of atoms that appear in negative and positive literals of C respectively. When doing so we will refer to (∧n∈Neg n) as the antecedent of C and to (∨p∈Pos p) as the consequent of C. A clause is range restricted if every term that appears in its consequent also appears in its antecedent. A clause is constrained if every term that appears in its antecedent also appears in its consequent. The size of a concept is the size of the smallest formula representing it. If no such formula exists, then the concept’s size is infinite. Usually the size of a formula is its string length but other notions of size are also possible. Given a concept class C, we define Cm as the concepts in C of size at most m. Naturally, C = ∪m≥1 Cm . While our discussion and results are largely independent of the learning model it will be useful to have a model in mind. We briefly review the model of learning with equivalence and membership queries [1]. Before the learning process starts, a concept is fixed among the possible concepts (concept class). We refer to this concept as target concept. The goal of the learner is to output an expression that 2
For learning from entailment, one can use generic tools from [16] but this requires a more complicated language “weakly range restricted expressions” that makes use of equality.
Complexity Parameters for First-Order Classes
25
represents the target concept. The learner (the learning algorithm) has access to an equivalence and a membership oracle that provide information about the target concept. In an equivalence query, the learner presents a hypothesis (in the form of a first-order formula) and the oracle answer Yes if it is a representation of the target concept. Otherwise, it answers No and provides a counterexample, that is, an example interpretation where target and hypothesis disagree. In a membership query, the learner presents an example and the oracle answers Yes or No depending on whether the example presented is part of the target concept. We assume that the learner is given the signature S as input. Definition 1. The query complexity of a learning algorithm A at any stage in a run is the sum of the sizes of the (i) inputs to equivalence queries, or (ii) inputs to membership queries made up to that stage. Definition 2. The class C is polynomial-query learnable if there exists a learning algorithm A and a two-variable polynomial p(·, ·) such that, for any positive integer m, and for any unknown target concept c ∈ Cm : (i) A uses membership queries and equivalence queries of the form EQ(h) where h represents a concept in C (ii) A eventually halts and outputs a string h representing the target concept c, and (iii) at any stage, if n is the size of the longest counterexample received so far in response to an equivalence query, the query complexity of A at that stage does not exceed p(n, m).
3
Size and Other Complexity Parameters
We first introduce the various parameters: TreeSize(E): the size of an expression E is the number of occurrences of variables, function and predicate symbols, logical connectives and quantifiers used to write the expression in its usual string form. For example, if H is the following expression with two clauses: (∀X add(zero, X, X))∧ (∀X ∀Y ∀Z add(X, Y, Z) → add(succ(X), Y, succ(Z))) then T reeSize(H) = 24. This size measure is equivalent to the number of nodes in a tree constructed recursively in the following manner. If the expression is a quantified expression, then put the quantifier in the root (labeled with the quantifier, FORALL or EXISTS), the quantified variable as its left child and the rest of the expression as the right child. If the expression is a conjunct, then add as children to the root (labeled with AND) all its conjuncts. Disjuncts are treated analogously, having OR as the root and the disjuncts as children. For implications the root is labeled with IMPLIES and the left child is the antecedent
26
Marta Arias and Roni Khardon
and the right child the consequent. With a negation the node is labeled with NOT and the only child is the rest of the expression. For atomic formulas, the root is labeled with the predicate symbol and the children are its arguments. If the expression is a variable, then the root is a leaf labeled with the variable name. For functional terms, the root is the outermost function symbol and the children are its arguments. The corresponding tree for the expression H appears in Figure 1. AND
✧ ❍❍ ✧ ❍ ✧ FORALL FORALL # # # add FORALL X X PP ✧ ❛❛ P ✧ P Y FORALL zero X X ❛❛ ❛ Z IMPLIES ✧ ❍❍ ✧ ✧ ❍ ❍ add add ✧❝ ❜ ✱ ❝ ✧ ❜ ✱ ✧ ❜ Y Z X succ succ X
Y
Z
Fig. 1. The tree corresponding to the expression H
DAGSize(E): the size of an expression E is the number of occurrences of nodes in the DAG formed from the tree shown above by having expressions share common sub-structures. We assume that expressions are standardized apart that is, we avoid re-use of variable names that belong to scopes of different quantifiers. This converts our expression H into the equivalent H : (∀X add(zero, X , X ))∧ (∀X ∀Y ∀Z add(X, Y, Z) → add(succ(X), Y, succ(Z))) In the example, since each of the variables X, Y, Z, X appears 3 times and these are the only shared structures, we save exactly 8 occurrences and hence DAGSize(H ) = 16. To our knowledge DAGSize has never been used in any of the work done for learning first-order structures. It seems important, however, to distinguish their different characteristics. As we show below it is also useful in providing an upper bound on the VC-dimension. NTerms(E): the maximum number of distinct terms (including sub-terms) occurring in a clause over all the clauses in E. In the example, N T erms(H) = 5 corresponding to term set in the second clause {X, Y, Z, succ(X), succ(Z)}. We will denote this parameter by t.
Complexity Parameters for First-Order Classes
27
NVars(E): the maximum number of distinct variables appearing in a clause over all the clauses in E. In the example, N V ars(H) = 3 corresponding to variable set in the second clause {X, Y, Z}. We will denote this parameter by v. Depth(E): the maximum depth of any functional term appearing in E. In the example, Depth(H) = 2 corresponding to the deepest term succ(X) (or succ(Z)). We will denote this parameter by d. NLits(E): the maximum number of literals in a clause over all the clauses in E. In the example, N Lits(H) = 2 from the second clause. We will denote this parameter by l. NPreds(E): the number of distinct predicate symbols appearing in E. In the example, N P reds(H) = 1 corresponding to {add/3}. We will denote this parameter by p. NFuncs(E): the number of distinct function symbols appearing in E. In the example, N F uncs(H) = 2 corresponding to {zero/0, succ/1}. We will denote this parameter by f . Arity(E): the largest arity of predicate and function symbols appearing in E. In the example, Arity(H) = 3 corresponding to the predicate add/3. We will denote this parameter by a. NClauses(E): the number of clauses in E. In our example, N Clauses(H) = 2. We will denote this parameter by c. The following relationships between complexity measures are obvious: – – – –
4
DAGSize ≤ T reeSize N V ars ≤ N T erms N Lits ≤ N P reds · N T ermsArity Depth ≤ N T erms
Relating Parameters to Size
While learnability is usually defined in terms of a notion of size it may be useful to provide bounds using other measures (as various authors have done). We therefore need to extend the definitions of query complexity and learnability to refer to a set of parameters. This is done in a natural way so that query complexity measures each of the parameters, and learnability requires a polynomial bound in every parameter. However, this is not sufficient. We must also identify when such a replacement preserves polynomial learnability. For this we define: Definition 3. Let C be a class of expressions. Let C = {C1 , ..., Ck } be a list of complexity measures on expressions in C, and let D = {D1 , ..., Dj } be an alternative list of complexity measures on expressions in C. We say that C and D are polynomially related w.r.t. C if there exist polynomials p1 , ..., pk of arity j and polynomials q1 , ..., qj of arity k such that for every E ∈ C:
28
Marta Arias and Roni Khardon
(i) for all i = 1, ..., k: Ci (E) ≤ pi (D1 (E), ..., Dj (E)), and (ii) for all i = 1, ..., j: Di (E) ≤ qi (C1 (E), ..., Ck (E)). The next proposition shows that alternative complexity measures (or combinations of them) can be used to get learnability results, as long as these measures are polynomially related to the notion of size. For simplicity we assume that both examples and hypotheses are drawn from the same class, as it is, for example, in the case of learning from entailment. The result for the general case follows along similar lines. Proposition 1. Let C be a class of first-order expressions. Let C1 , ..., Ck be a set of complexity measures that is polynomially related to Size w.r.t. the class C, where Size is some notion of size for the expressions in C. Let p1 , ..., pk be polynomials in one argument and q be a polynomial in k arguments that together witness the polynomial relationship w.r.t. C. Suppose that A is a learning algorithm for C with query complexity (with respect to alternative complexity measures C1 , ..., Ck ) bounded by polynomials si (c1 , ..., ck , c1 , ..., ck ) for i = 1, ..., k, where c1 , ..., ck bound the complexity measures C1 , ..., Ck for target concepts and c1 , ..., ck bounds the complexity measures for counterexamples received. Then, A is a learning algorithm for C. Proof. Notice that items (i) and (ii) from the definition of learnability hold trivially since we have assumed that A is a learning algorithm for C working in the same model. We show that item (iii) holds. Namely, there is a polynomial r(·, ·) s.t. at any stage, if n is the size of the longest counterexample received so far in response to an equivalence query, the query complexity of A at that stage does not exceed r(n, m). In the following, f (args) stands for f1 (args), ..., fk (args). We define r(n, m) as q(s(p(m), p(n))). Observe that all the functions s1 , ..., sk , p1 , ..., pk and q are polynomials and hence r is a polynomial, too. It is left to show that r bounds the query complexity for A. Notice that c ∈ Cm implies that c ∈ Cp(m) because p1 (m), ..., pk (m) bound the complexity measures in C1 , ..., Ck . By hypothesis, the query complexity (for complexity measures C1 , ..., Ck ) of A is bounded by s(p(m), p(n)). Hence, the query complexity of A is bounded by q(s(p(m), p(n))). Remark 1. Note that we require polynomial bounds in both directions to guarantee learnability. This is needed for learning with queries and for proper PAC learnability (where hypothesis class is the same as concept class), whereas a one sided bound will suffice for PAC predictability. It may be useful to highlight what can go wrong if this does not hold. In Figure 2 we can see three terms: t1 has T REESize exponential in the depth while its DAGSize is just linear; t2 has both T REESize and DAGSize exponential in the depth; finally t3 has both T REESize and DAGSize linear in the depth. Now, if one has an algorithm that learns w.r.t. T REESize then when learning an expression including t1 the algorithm is allowed to include t2 in a query but
Complexity Parameters for First-Order Classes
29
this is not possible for learning w.r.t. DAGSize since t1 is just polynomial in the depth whereas t2 is exponential. On the other hand, if one has an algorithm that learns w.r.t. DAGSize then when learning an expression including t3 the algorithm can use t1 in its query. If we try to use this algorithm to learn w.r.t. T REESize this query is too large.
f ✁ ✁ ❙❙ f ❆❆ f ✁❊ ✁ ✁ ❊ ✁ 1 11
f ✁✁ ❙❙
f f ❊ ✆❆ ❊ ✆ ❆ 1 1 1
g
f ✁ ✁ ❙❙ f ✁✁ ❙❙
g
f f f ❊ ✆❆ ✆❆ ❊ ✆ ❆ ✆ ❆ 4 5 6 7 8
g
f ❆❆ f ✆❆
✆ ❆ 1 1
t1
f ✁❊ ✁ ✁ ❊ ✁ 1 23
t2
TREESize = Θ(2d ) DAGSize = Θ(d)
1 t3
TREESize = Θ(2d ) DAGSize = Θ(2d )
TREESize = Θ(d) DAGSize = Θ(d)
Fig. 2. The T REESize of term t1 is exponentially larger than its DAGSize, however for terms t2 and t3 both sizes coincide
4.1
Parameters for First Order Logic
The question is whether we can find a combination of the other parameters that is polynomially related to size. Suppose that E is a first-order Horn expression s.t. N T erms(E) = t N Lits(E) = l Arity(E) = a
N V ars(E) = v N P reds(E) = p N Clauses(E) = c
Depth(E) = d N F uncs(E) = f
Using T reeSize: Observe that any term appearing in E has size at most O(ad ). Hence, any atomic formula has size at most 1 + O(ad+1 ) = O(ad+1 ) (1 for the predicate symbol, ad+1 for the arguments). Hence, any Horn clause can have size no more than 1 + 2v + lO(ad+1 ) = O(v + lad+1 ) (1 for the implication symbol in the clause, 2v for the quantifiers and quantified variables, and O(ad+1 ) for each atom in the clause). Finally T reeSize(E) = O(cv + clad+1 ).
30
Marta Arias and Roni Khardon
On the other hand, it is clear that all the parameters above are bounded by T reeSize(E). The next theorem shows that the converse does not hold: Theorem 1. T reeSize is not polynomially bounded by any combination of parameters that includes N T erms for classes over signatures with at least one constant and one function symbol of arity at least 2. Proof. We need to find some expression E such that its T reeSize is exponential in N T erms. Let E be the single clause p(t1 ), where t1 is a complete tree of degree a with internal nodes labeled with function symbol f and leaves labeled with constant 1 (see t1 in Figure 2 for tree representation, when a = 2, d = 3): d times a times p(f (.....f (f (f ( 1, .., 1 ), .., f (1, .., 1)), .., f (f (1, .., 1), .., f (1, .., 1)))...))
The complexity measures for E are: N T erms(E) = d N Lits(E) = 1 Arity(E) = a
N V ars(E) = 0 N P reds(E) = 1 N Clauses(E) = 1
Depth(E) = d N F uncs(E) = 2 T reeSize(E) = Θ(ad )
This is a surprising fact that has not been noticed in previous work working with these parameters. No polynomial combination of the parameters above can replace tree size. Remark 2. If we do not allow function symbols of arity greater than 1 then T reeSize = O(clad). Hence, the parameters N Clauses, N Lits and Depth are polynomially related to T reeSize. Size with Respect to Arity: On the other hand, exponential lower bounds in terms of arity have been derived when ignoring N Lits. These essentially reflect the following fact: Claim. If the number of literals is ignored then T reeSize and DAGSize are not polynomially bounded by Arity Proof. Let p be a predicate of arity a. Let {1, ..., t} be a set of t distinct terms built e.g. by one constant and one unary function. Let P be the set of all different p() atoms built from these terms; |P | = ta . Let pˆ be a particular element in P . Let E be the expression E = P \ {ˆ p} → pˆ. The complexity of E is given by: N T erms(E) = t N Lits(E) = ta Arity(E) = a T reeSize(E) = Ω(ta )
N V ars(E) = 0 N P reds(E) = 1 N Clauses(E) = 1 DAGSize(E) = Ω(ta )
Depth(E) = t N F uncs(E) = 2
Hence, the tree size is exponential in the arity when l is ignored.
Complexity Parameters for First-Order Classes
31
Using DAGSize: As for T reeSize, DAGSize gives an upper bound for all other parameters. This time the relation in the other direction is also polynomial. Notice that a DAG encodes terms in a smarter way, since multiple occurrences of a term are only “stored” once. Hence, t terms in a clause need only Θ(t) DAGSize. An atomic formula contributes only 1 since its arguments are encoded with the terms. Hence, every clause has size at most O(v + t + l) = O(t + l) and c + l + t ≤ DAGSize(E) = O(ct + cl).
Theorem 2. The complexity measures given by t, l, c (following notation above) are polynomially related to DAGSize. Notice that the theorem is true for any values of the other parameters. The previous claim shows DAGSize can be exponential in arity but as the theorem shows in such a case one of c, l, t must be large as well. It is also interesting to note that several results on learning with queries give upper bounds in terms of ta and other parameters [4, 25, 23, 2]. While l ≤ p · ta these bounds do not directly relate to DAGSize or T reeSize.
5
The VC-Dimension of First Order Classes
We characterize the Vapnik-Chervonenkis dimension (V CDim) of first-order Horn expressions. Our lower bounds hold even for expressions which are both range restricted and constrained. This has direct implications for learnability since it is known that the V CDim provides tight bounds on the number of examples for PAC learning [5] as well as a lower bound for the number of equivalence and membership queries for exact learning [20]. As mentioned in the introduction, it is also known that for a finite class C, we have V CDim(C) ≤ log |C| and assuming a reasonable encoding scheme V CDim(C) = O(M axSize) where M axSize is the size of the largest concept in the class. The rest of this section shows that for our classes V CDim = Ω(cl+ct). We start with the necessary definitions [5]. Let I be a set, H ⊆ 2I , and S ⊆ I. Then ΠH (S) = {h ∩ S | h ∈ H} is the set of subsets of S that can be obtained by intersection with elements of H. If |ΠH (S)| = 2|S| , then we say that H shatters S. Finally, V CDim(H) is the size of the largest set shattered by H (or ∞ if arbitrary large sets are shattered). In our case I is a set of interpretations, and H is some class of first-order Horn expressions interpreted under |=. We identify every h ∈ H with the set of interpretations that satisfy h. Theorem 3. There exists a set of c interpretations that can be shattered using first-order Horn expressions bounded by N Clauses ≤ c, N T erms ≤ log c + 3, N Lits = 2, N V ars = 0, Depth = log c, Arity = 2, N F uncs = 4 and N P reds = 2.
32
Marta Arias and Roni Khardon
Proof. We construct a set of c different terms using a function f of arity 2 and three constants 1, 2 and 3 and by forming ground terms of depth log c in the following manner: Tˆ = {f (a1 , f (a2 , f (a3 , f (...f (alog c , 3)...)) | ai ∈ {1, 2} for all 1 ≤ i ≤ log c} Notice that there are exactly 2log c = c such terms. Moreover, every term in Tˆ is of size 2 log c + 1 and contains at most log c + 3 distinct subterms. Each interpretation Itˆ in the set of interpretations I to be shattered contains in its extension a single atom P (tˆ) where tˆ ∈ Tˆ . Hence, |I| = |Tˆ | = c. In addition, the domain of the interpretation Itˆ, consists of the Θ(log c) objects corresponding to the subterms appearing in tˆ (including itself) and a distinguished object ∗. The function mapping for f is defined to follow the functional structure of the distinguished term tˆ, undefined entries are mapped to ∗. Notice that any term t ∈ Tˆ s.t. tˆ = t is mapped to the special object ∗ under the interpretation Itˆ. Now, we define the Horn expression HS that separates any arbitrary subset S ⊆ I as HS = P (tˆ) → F Itˆ ∈ S . Any interpretation in S falsifies one of the clauses in HS , and hence falsifies the whole Horn expression; any interpretation not in S falsifies every clause’s antecedent in HS since the term present in the clause is mapped to the special object ∗ which does not appear in any of the interpretations’ extension. A V CDim construction in [15] uses a signature that grows with N T erms. The following theorem modifies this construction to use a fixed signature. Theorem 4. For l ≤ ta , there exists a set of l interpretations that can be shattered using first-order Horn expressions bounded by N T erms = 2t, N vars ≤ t, Depth = log t, N Lits ≤ l, N P reds = 3, N F uncs = 1, Arity ≤ a and N Clauses = 1. Proof. We construct a set of interpretations I that is shattered using first-order Horn expressions with parameters as stated. Fix a and t. The expressions use a 0-ary predicate F , a unary predicate L and a predicate symbol Q of arity logt l. Let Qall = Q(i1 , . . . , ilogt l ) ij ∈ {1, .., t} for all j = 1, . . . , logt l . Notice that |Qall | = tlogt l = l. Let f be a binary function, and let tˆ be the term represented by a binary balanced tree of depth log t whose leaves are labeled by the objects 1 . . . t (in order) and whose internal nodes are labeled by the function symbol f . Such a term contains 2t subterms. The domain for all the interpretations in I includes objects {1, .., t}, an object for each subterm of tˆ, and a special object ∗. The function mappings for f follow the functional structure of tˆ with undefined entries completed by the special domain object ∗. Interpretations include in their extension the atom L(tˆ) and all the atoms in Qall except one. Hence, there are l interpretations in I.
Complexity Parameters for First-Order Classes
33
The expression that separates an arbitrary S ⊆ I is HS = CS → F where CS is the intersection of the Q() atoms in the extensions of all the interpretations in S plus the atom L(tˆ) after substituting every domain object j ∈ {1, .., t} by a corresponding variable xj . Suppose I ∈ S. Take the substitution {xj → j}. Then I falsifies HS because its antecedent CS is satisfied (it is a subset of the extension of I) and its consequent F is falsified. Suppose on the other hand that I ∈ S. Substitutions other than {xj → j} will falsify the antecedent of HS because of the atom L(tˆ). The clause HS is satisfied under the substitution xj → j because the “omitted Q” in I’s extension is present in CS . Theorem 5. For l ≤ ta , there exists a set of cl interpretations that can be shattered using first-order Horn expressions bounded by N Clauses ≤ c, N T erms = Θ(log c + t), N Lits ≤ l, N V ars ≤ t, Depth = Θ(log c + log t), Arity ≤ a, N F uncs = 5 and N P reds = 3. Proof. Let I be the set shattered in Theorem 4. We create a new set of interpretations I + of cardinality cl in the following way. We have an additional set of c terms constructed in the same way as in Theorem 3, let us denote this set Tˆc . As in Theorem 3, Tˆc contains c distinct terms of depth log c each. We augment the interpretations in the construction of Theorem 4 by associating I ∈ I with a new term in Tˆc (and hence we create c new interpretations in I + for each old interpretation in I), adding log c new objects and the corresponding functional mappings following the terms’ structure, completing undefined entries with the special object ∗. Additionally, we include the atom F (∗) in each of the interpretations’ extensions (notice that a term c will evaluate to ∗ in the interpretations which do not have c as their distinguished term). Hence |I + | = cl. The new expression separating an arbitrary subset S ⊆ I + is HS : CScˆ → F (ˆ c) cˆ ∈ Tˆc , where Scˆ is the subset of interpretations in S with distinguished term cˆ and CScˆ is constructed as in Theorem 4. We finally prove that I falsifies HS iff I ∈ S. Suppose that cˆ is the distinguished term in Tˆc associated to I. Terms c = cˆ evaluate to ∗ under I, and every clause with consequent other than F (ˆ c) in HS is hence satisfied. The clause containing F (ˆ c) is falsified iff I ∈ Scˆ by the same reasoning as in Theorem 4. The next result shows that by varying the number of terms we can shatter arbitrarily large sets with a fixed signature. Theorem 6. There exists a set of t interpretations that can be shattered using Horn expressions bounded by N Clauses = 1, N T erms ≤ 4t, N Lits = 2, N V ars = 0, Depth = 2 log t + 2, Arity = 2, N F uncs ≤ 9 and N P reds = 2.
34
Marta Arias and Roni Khardon
Proof. Let t = k log k for some k ∈ N . Using the same signature as in Theorem 3 we generate a set Tˆ of k terms of depth log k each. We associate to every interpretation a term in Tˆ and an index i ∈ {1, .., log k} and we denote by Itˆ,i the interpretation associated to (tˆ, i) ∈ Tˆ × {1, .., log k}. Thus, we have a set of interpretations I s.t. |I| = |Tˆ | |{1, .., log k}| = k log k = t. Given a subset S ⊆ I, we will construct a big term T REES which intuitively associates to every possible term tˆ in Tˆ a set of indices ltˆ where ltˆ = i Itˆ,i ∈ S . We will then appropriately define the function mappings in each interpretation Itˆ,i so that the term T REES evaluates to a special domain object y iff index i appears in the set of indices for term tˆ encoded in T REES . Each interpretation will include in its extension the atom M (y) so that the clause HS = M (T REES ) → F is falsified by interpretation I iff the term T REES evaluates to y under I. We first describe the structure of the term T REES . Let Stˆ be the subset of S consisting of interpretations Itˆ,i in S and let ltˆ = i Itˆ,i ∈ Stˆ . We encode the set ltˆ with the term fi1 (fi2 (· · · filog k (a)) · · ·) where ij = 0 if j ∈ ltˆ and ij = 1 otherwise. Denote this term by tltˆ. As an example, assume log k = 6 and let the set ltˆ = {1, 4, 5}. Then, tltˆ = f1 (f0 (f0 (f1 (f1 (f0 (a)))))). Notice that we are using two unary functions f0 and f1 and a constant a. Next we use a binary function g to encode the association between terms tˆ and their sets of indices ltˆ as g(tˆ, tltˆ). Finally, T REES is constructed as a balanced tree (using binary function h) whose leaves are terms of the form g(tˆ, tltˆ), for every tˆ ∈ Tˆ . As an example, suppose k = 4. Then Tˆ = {tˆ1 , tˆ2 , tˆ3 , tˆ4 }, where tˆ1 = f (1, f (1, 3)), tˆ2 = f (1, f (2, 3)), tˆ3 = f (2, f (1, 3)) and tˆ4 = f (2, f (2, 3)). Suppose S = {(tˆ1 , 1), (tˆ2 , 2), (tˆ3 , 1), (tˆ3 , 2)}. Then: – ltˆ1 = {1}, ltˆ2 = {2}, ltˆ3 = {1, 2} and ltˆ4 = {}. – tltˆ1 = f1 (f0 (a)), tltˆ2 = f0 (f1 (a)), tltˆ3 = f1 (f1 (a)) and tltˆ4 = f0 (f0 (a)). – T REES = h
h
h
g f1
f 1
f
f0 1
f 1
g
a 3
f1
3
a
2
f1
f 1
g f1
f
f0 f
2
g
3
a
f0
f 2
f0
f 2
3
a
Complexity Parameters for First-Order Classes
35
Let us now describe in detail the domain and function mappings for interpretation Itˆ,i . The domain objects are: – Three special objects ∗, y, n. – Up to log k +3 distinct objects that represent all terms and subterms present in the distinguished term tˆ. – Up to 2k + 1 objects representing all the possible terms and subterms of the vector indices fi1 (fi2 (· · · filog k (a)) · · ·) for all possible ij ∈ {0, 1} where 1 ≤ j ≤ log k. The function mappings are defined as follows: – The constants 1, 2, 3 potentially appearing in tˆ are mapped to objects 1, 2, 3. The mapping for binary function f follows functional structure of tˆ, with undefined entries mapped to the special object ∗. – The constant a is mapped to object a. Unary functions f0 and f1 also mimic the functional structure of terms and subterms of fi1 (fi2 (· · · filog k (a)) · · ·) for all possible ij ∈ {0, 1} where 1 ≤ j ≤ log k. – The binary function g(t1 , t2 ) is mapped to special object y iff t1 = tˆ and the unary function used at depth i in term t2 is f1 . Otherwise it is set to special object n. Note that there is a 1-1 correspondence between objects in the interpretations and terms of the form fi1 (fi2 (· · · filog k (a)) · · ·) and this function mapping is ”legal” in the sense that it does not depend on argument structure. – Finally, the binary function h(a1, a2) is mapped to domain object y iff either a1 = y or a2 = y, otherwise it is mapped to object n. Finally, the only atom true in each interpretation is M (y). We prove that Itˆ,i falsifies HS iff Itˆ,i ∈ S. Notice that Itˆ,i falsifies HS iff Itˆ,i satisfies the atom M (T REES ) iff the term T REES is mapped to the domain object y under Itˆ,i iff some term g(t1 , t2 ) is mapped to y iff term g(tˆ, t2 ) is mapped = tˆ are mapped to n by construction) iff the to y (other terms g(t1 , t2 ) where t1 unary function used at depth i in term t2 is f1 iff Itˆ,i ∈ S. We finally quantify the complexity of the parameters used in HS : it has 1 clause, 2 literals, no variables, uses one single term of depth Θ(log k) (that is O(log t)) which contains Θ(k log k) subterms (that is Θ(t) subterms) that are built from 4 constants, 5 function symbols whose maximal arity is 2. Theorem 7. There exists a set of ct interpretations that can be shattered using Horn expressions bounded by N Clauses ≤ c, N T erms = Θ(t+log c), N Lits = 2, N V ars = 0, Depth = O(log t+log c), Arity = 2, N F uncs ≤ 9 and N P reds = 3. Proof. We extend the previous construction. Let I be the set shattered in Theorem 6. We create a new set of interpretations I + of cardinality ct in the following way. We have an additional set of c terms constructed in the same way as in Theorem 3 but using as constants 1,2,3 and as binary function g; let us denote this set Tˆc . As in Theorem 3, Tˆc contains c distinct terms of depth log c each.
36
Marta Arias and Roni Khardon
Notice that we can safely re-use the constants 1,2,3 and the function g since these are not combined the previous construction. As before, we augment the interpretations in the construction of Theorem 6 by associating I ∈ I with a new term in Tˆc (and hence we create c new interpretations in I + for each old interpretation in I), adding log c new objects and the corresponding functional mappings following the term’s structure. Hence |I + | = ct. In addition we modify the predicate M which now has arity 2. The only atom true in I is M (ˆ c, y) where cˆ is the the distinguished term in Tˆc associated with I. The new expression separating an arbitrary subset S ⊆ I + is: c, T REEScˆ ) → F cˆ ∈ Tˆc , HS = M (ˆ where Scˆ is the subset of interpretations in S with distinguished term cˆ. We finally prove that I falsifies HS iff I ∈ S. Suppose that cˆ is the distinguished term in Tˆc associated to I. I contains the atom M (ˆ c, y) in its extension, and every clause M (c , T REESc ) → F in HS s.t. cˆ = c is satisfied since term c does not evaluate to domain object cˆ under I. The clause M (ˆ c, T REEScˆ ) → F is falsified iff I ∈ Scˆ by the same reasoning as in Theorem 6. It is easy to see that the constructions given above can be modified by adding dummy arguments in the antecedent and consequent so that the expressions used to shatter the given sets are both range restricted and constrained. Thus we get: Corollary 1. Let S be a signature with at least 9 function symbols, 3 predicates and arity at least 2. The VC Dimension of the class of range restricted and constrained first-order Horn expressions over S with at most c clauses, each using up to l literals and t + log c terms is Θ(cl + ct).
References [1] D. Angluin. Queries and concept learning. Machine Learning, 2(4):319–342, April 1988. [2] Marta Arias and Roni Khardon. Learning closed horn expressions. Information and Computation, pages 214–240, 2002. [3] Marta Arias, Roni Khardon, and Rocco A. Servedio. Polynomial certificates for propositional classes. Proceedings of the Conference on Computational Learning Theory, 2003. [4] Hiroki Arimura. Learning acyclic first-order Horn sentences from entailment. In Proceedings of the International Conference on Algorithmic Learning Theory, Sendai, Japan, 1997. Springer-Verlag. LNAI 1316. [5] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, October 1989. [6] W. Cohen. PAC-learning recursive logic programs: Efficient algorithms. Journal of Artificial Intelligence Research, 2:501–539, 1995. [7] L. De Raedt and S. Dzeroski. First order jk-clausal theories are PAC-learnable. Artificial Intelligence, 70:375–392, 1994.
Complexity Parameters for First-Order Classes
37
[8] Saˇso Dˇzeroski, Stephen Muggleton, and Stuart Russell. PAC-learnability of determinate logic programs. In David Haussler, editor, Proceedings of the Conference on Computational Learning Theory, pages 128–135, Pittsburgh, PA, July 1992. ACM Press. [9] Andrzej Ehrenfeucht, David Haussler, Michael Kearns, and Leslie Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82(3):247–251, September 1989. [10] M. Frazier and L. Pitt. Learning from entailment: An application to propositional Horn sentences. In Proceedings of the International Conference on Machine Learning, pages 120–127, Amherst, MA, 1993. Morgan Kaufmann. [11] Martin Grohe and Gyorgy Tur´ an. Learnability and definability in trees and similar structures. In Proceedings of the 19th Annual Symposium on Theoretical Aspects of Computer Science (STACS), pages 645–658. Springer, 2002. LNCS 2285. [12] T. Hegedus. On generalized teaching dimensions and the query complexity of learning. In Proceedings of the Conference on Computational Learning Theory, pages 108–117, New York, NY, USA, July 1995. ACM Press. [13] Lisa Hellerstein, Krishnan Pillaipakkamnatt, Vijay Raghavan, and Dawn Wilkins. How many queries are needed to learn? Journal of the ACM, 43(5):840–862, September 1996. [14] Tam´ as Horv´ ath and Gy¨ orgy Tur´ an. Learning logic programs with structured background knowledge. Artificial Intelligence, 128(1-2):31–97, May 2001. [15] R. Khardon. Learning function free Horn expressions. Machine Learning, 37:241– 275, 1999. [16] R. Khardon. Learning range restricted Horn expressions. In Proceedings of the Fourth European Conference on Computational Learning Theory, pages 111–125, Nordkirchen, Germany, 1999. Springer-verlag. LNAI 1572. [17] J¨ org-Uwe Kietz and Saso Dzeroski. Inductive logic programming and learnability. SIGART Bulletin, 5(1):22–32, January 1994. [18] J.W. Lloyd. Foundations of Logic Programming. Springer Verlag, 1987. [19] W. Maass and Gy. Turan. On learnability and predicate logic (extended abstract). In Proceedings of the 4th Bar-Ilan Symposium on Foundations of AI (BISFAI), 1995. [20] Wolfgang Maass and Gy¨ orgy Tur´ an. Lower bound methods and separation results for on-line learning models. Machine Learning, 9:107–145, 1992. [21] S. Muggleton and C. Feng. Efficient induction of logic programs. In S. Muggleton, editor, Inductive Logic Programming, pages 281–298. Academic Press, 1992. [22] S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. The Journal of Logic Programming, 19 and 20:629–680, May 1994. [23] K. Rao and A. Sattar. Learning from entailment of logic programs with local variables. In Proceedings of the International Conference on Algorithmic Learning Theory, Otzenhausen, Germany, 1998. Springer-verlag. LNAI 1501. [24] C. Reddy and P. Tadepalli. Learning Horn definitions with equivalence and membership queries. In International Workshop on Inductive Logic Programming, pages 243–255, Prague, Czech Republic, 1997. Springer. LNAI 1297. [25] C. Reddy and P. Tadepalli. Learning first order acyclic Horn programs from entailment. In International Conference on Inductive Logic Programming, pages 23–37, Madison, WI, 1998. Springer. LNAI 1446.
A Multi-relational Decision Tree Learning Algorithm - Implementation and Experiments Anna Atramentov, Hector Leiva, and Vasant Honavar Artificial Intelligence Research Laboratory Computer Science Department 226 Atanasoff Hall, Iowa State University Ames, IA 50011-1040, USA {anjuta,aleiva,honavar}@cs.iastate.edu
Abstract. We describe an efficient implementation (MRDTL-2) of the Multi-relational decision tree learning (MRDTL) algorithm [23] which in turn was based on a proposal by Knobbe et al. [19] We describe some simple techniques for speeding up the calculation of sufficient statistics for decision trees and related hypothesis classes from multi-relational data. Because missing values are fairly common in many real-world applications of data mining, our implementation also includes some simple techniques for dealing with missing values. We describe results of experiments with several real-world data sets from the KDD Cup 2001 data mining competition and PKDD 2001 discovery challenge. Results of our experiments indicate that MRDTL is competitive with the state-of-theart algorithms for learning classifiers from relational databases.
1
Introduction
Recent advances in high throughput data acquisition, digital storage, and communications technologies have made it possible to gather very large amounts of data in many scientific and commercial domains. Much of this data resides in relational databases. Even when the data repository is not a relational database, it is often convenient to view heterogeneous data sources as if they were a collection of relations [28] for the purpose of extracting and organizing information from multiple sources. Thus, the task of learning from relational data has begun to receive significant attention in the literature [1,19,11,21,22,12,18,26,9,8,14,16]. Knobbe et al. [19] outlined a general framework for multi-relational data mining which exploits structured query language (SQL) to gather the information needed for constructing classifiers (e.g., decision trees) from multi-relational data. Based on this framework, Leiva [23] developed a multi-relational decision tree learning algorithm (MRDTL). Experiments reported by Leiva [23] have shown that decision trees constructed using MRDTL have accuracies that are comparable to that obtained using other algorithms on several multi-relational data sets. However, MRDTL has two significant limitations from the standpoint of multi-relational data mining from large, real-world data sets: T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 38–56, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Multi-relational Decision Tree Learning Algorithm
39
(a) Slow running time: MRDTL (like other algorithms based on the multirelational data mining framework proposed by Knobbe et al. [19]) uses selection graphs to query the relevant databases to obtain the statistics needed for constructing the classifier. Our experiments with MRDTL on data from KDD Cup 2001 [5] showed that the execution of queries encoded by such selection graphs is a major bottleneck in terms of the running time of the algorithm. (b) Inability to handle missing attribute values: In multi-relational databases encountered in many real-world applications of data mining, a significant fraction of the data have one or more missing attribute values. For example, in the case of gene localization task from KDD Cup 2001 [5], 70% of CLASS, 50% of COMPLEX and 50% of MOTIF attribute values are missing. Leiva’s implementation of MRDTL [23] treats each missing value as a special value (“missing”) and does not include any statistically well-founded techniques for dealing with missing values. Consequently, the accuracy of decision trees constructed using MRDTL is far from satisfactory on classification tasks in which missing attribute values are commonplace. For example, the accuracy of MRDTL on the gene localization task was approximately 50%. Against this background, this paper describes MRDTL-2 which attempts to overcome these limitations of Leiva’s implementation of MRDTL: (a) MRDTL-2 includes techniques for significantly speeding up some of the most time consuming components of multi-relational data mining algorithms like MRDTL that rely on the use of selection graphs. (b) MRDTL-2 includes a simple and computationally efficient technique which uses Naive Bayes classifiers for ‘filling in’ missing attribute values. These enhancements enable us to apply multi-relational decision tree learning algorithms to significantly more complex classification tasks involving larger data sets and larger percentage of missing attribute values than was feasible in the case of MRDTL. Our experiments with several classification tasks drawn from KDD Cup 2001 [5], PKDD 2001 [7] and the widely studied Mutagenesis data set [25] show that MRDTL-2 (a) significantly outperforms MRDTL in terms of running time (b) yields results that are comparable to the best reported results obtained using multi-relational data mining algorithms (c) compares favorably with feature-based learners that are based on clever propositionalization methods [22] The rest of the paper is organized as follows: Section 2 reviews the multirelational data-mining framework; Section 3 describes our implementation of MRDTL-2, a multi-relational decision tree learning algorithm; Section 4 describes the results of our experiments with MRDTL-2 on several representative multi-relational data mining tasks and compares them with the results of other approaches available in the literature; Section 5 concludes with a brief summary and discussion of the main results and an outline of some directions for further research.
40
2 2.1
Anna Atramentov et al.
Multi-relational Data Mining Relational Databases
A relational database consists of a set of tables D = {X1 , X2 , ...Xn }, and a set of associations between pairs of tables. In each table a row represents description of one record. A column represents values of some attribute for the records in the table. An attribute A from table X is denoted by X.A. Definition 1. The domain of the attribute X.A is denoted as DOM (X.A) and is defined as the set of all different values that the records from table X have in the column of attribute A. Associations between tables are defined through primary and foreign key attributes. Definition 2. A primary key attribute of table X, denoted as X.ID, has a unique value for each row in this table. Definition 3. A foreign key attribute in table Y referencing table X, denoted as Y.X ID, takes values from DOM (X.ID). An example of a relational database is shown in Figure 1. There are three tables and three associations between tables. The primary keys of the tables GENE, COMPOSITION, and INTERACTION are: GENE ID, C ID, and I ID, respectively. Each COMPOSITION record references some GENE record through the foreign key COMPOSITION.GENE ID, and each INTERACTION record references two GENE records through the foreign keys INTERACTION.GENE ID1 and INTERACTION.GENE ID2. In this setting the attribute of interest (e.g.,
COMPOSITION INTERACTION GENE GENE_ID1 GENE_ID2 TYPE EXPRESSION_CORR I_ID
GENE_ID ESSENTIAL CHROMOSOME LOCALIZATION
GENE_ID CLASS COMPLEX PHENOTYPE MOTIF C_ID
Fig. 1. Example database class label) is called target attribute, and the table in which this attribute is stored is called target table and is denoted by T0 . Each record in T0 corresponds to a single object. Additional information about an object is stored in other tables of the database that can be looked up by following the associations between tables.
A Multi-relational Decision Tree Learning Algorithm
2.2
41
Multi-relational Data Mining Framework
Multi-relational data mining framework is based on the search for interesting patterns in the relational database, where multi-relational patterns can be viewed as “pieces of substructure encountered in the structure of the objects of interest” [19]. Definition 4. ([19]) A multi-relational object is covered by a multi-relational pattern iff the substructure described by the multi-relational pattern, in terms of both attribute-value conditions and structural conditions, occurs at least once in the multi-relational object. Multi-relational patterns also can be viewed as subsets of the objects from the database having some property. The most interesting subsets are chosen according to some measure (i.e. information gain for classification task), which guides the search in the space of all patterns. The search for interesting patterns usually proceeds by a top-down induction. For each interesting pattern, subpatterns are obtained with the help of refinement operator, which can be seen as further division of the set of objects covered by initial pattern. Top-down induction of interesting pattern proceeds recursively applying such refinement operators to the best patterns. Multi-relational pattern language is defined in terms of selection graphs and refinements which are described in the following sections. 2.3
Selection Graphs
Multi-relational patterns are expressed in a graphical language of selection graphs [20]. Definition 5. ([19]) A selection graph S is a directed graph S = (N, E). N represents the set of nodes in S in the form of tuples (X, C, s, f ), where X is a table from D, C is the set of conditions on attributes in X (for example, X.color = ‘red’ or X.salary > 5,000), s is a flag with possible values open and closed, and f is a flag with possible values front and back. E represents edges in S in the form of tuples (p, q, a, e), where p and q are nodes and a is a relation between p and q in the data model (for example, X.ID = Y.X ID), and e is a flag with possible values present and absent. The selection graph should contain at least one node n0 that corresponds to the target table T0 . An example of the selection graph for the data model from Figure 1 is shown in Figure 2. This selection graph corresponds to those GENE(s) that belong to chromosome number 5, that have at least one INTERACTION record of type ’Genetic’ with a corresponding GENE on chromosome number 11, but for which none of the INTERACTION records have type value ’Genetic’ and expression corr value 0.2. In this example the target table is GENE, and within GENE the target attribute is LOCALIZATION. In graphical representation of a selection graph, the value of s is represented by the absence or presence of a cross in the node, representing values open
42
Anna Atramentov et al. INTERACTION
GENE
INTERACTION.gene_id1 = GENE.gene_id
GENE
= id2 ene_ ne_id E.ge ION.g GEN RACT Type E INT
GE N INT E.ge ER ne_i AC d TIO = N.g Chromosome=5 en
e_i
d2
= ’Genetic’
Chromosome=11
INTERACTION
Type = ’Genetic’ and Expression_corr = 0.2
Fig. 2. Selection graph corresponding to those GENEs which belong to chromosome number 5, that have at least one interaction of type ’Genetic’ with a corresponding gene on chromosome number 11 but for which none of the interaction records are of the type ’Genetic’ and expression corr value 0.2 TRANSLATE(S, key) Input Selection graph S, key (primary or foreign) in the target node of the selection graph S Output SQL query for objects covered by selection graph S 1 table list := 2 condition list := 3 join list := 4 for each node i in S do 5 if (i.s = open and i.f = front ) 6 table list.add(i.table name + T + i) 7 for each condition c in i do 8 condition list.add(c) 9 for each edge j in S do 10 if (j.e = present ) 11 if (j.q.s = open and j.q.f = front ) 12 join list.add(j.a) 13 else join list.add(j.p + . + j.p.primary key + not in + TRANSLATE( subgraph(S, j.q), j.q.key) 15 return select distinct + T0 . + key + from + table list + where + join list + and + condition list
Fig. 3. Translation of selection graph into SQL query and closed, respectively. The value for e, in turn, is indicated by the presence (absent value) or absence (present value) of a cross on the corresponding arrow representing the edge. An edge between nodes p and q chooses the records in the database that match the join condition, a, between the tables which is defined by the relation between the primary key in p and a foreign key in q, or the other way around. For example, the join condition, a, between table GENE
A Multi-relational Decision Tree Learning Algorithm
43
and INTERACTION in selection graph from Figure 2 is GENE.GENE ID = INTERACTION.GENE ID2. A present edge between tables p and q combined with a list of conditions, q.C and p.C, selects those objects that match the list of conditions, q.C and p.C, and belong to the join between p and q, specified by join condition, e.a. On the other hand, an absent edge between tables p and q combined with a list of conditions, q.C and p.C, selects those objects that match condition p.C but do not satisfy the following: match q.C and belong to the join between tables at the same time. Flag f is set to front for those nodes that on their path to n0 have no closed edges. For all the other nodes flag f is set to back. select distinct T0 .gene id from GENE T0 , INTERACTION T1 , GENE T2 where T0 .gene id = T1 .gene id2 and T1 .gene id2 = T2 .gene id and T0 .chromosome = 5 and T1 .type = ’Genetic’ and T2 .chromosome = 11 and T0 .gene id not in ( select T0 .gene id2 from INTERACTION T0 where T0 .type = ’Genetic’ and T0 .expression corr = 0.2)
Fig. 4. SQL query corresponding to the selection graph in Figure 2 Knobbe et al. [20] introduce the algorithm (Figure 3) for translating a selection graph into SQL query. This algorithm returns the records in the target table covered by this selection graph. The subgraph(S, j.q) procedure returns the subgraph of the selection graph S starting with the node q as the target node, which label s is reset to open, removing the part of the graph that was connected to this node with the edge j and resetting all the values of flag f at the resulting selection graph by definition of f . Notation j.q.key means the name of the attribute (primary or foreign key) in the table q that is associated with the table p in relation j.a. Using this procedure the graph in Figure 2 translates to the SQL statement shown in Figure 4. 2.4
Refinements of Selection Graphs
Multi-relational data mining algorithms search for and successively refine interesting patterns and select promising ones based on some impurity measure (e.g. information gain). The set of refinements introduced by [20] are given below. We will illustrate all the refinements on the selection graph from Figure 2. Labels s and f help identify the nodes that needs to be refined. Note that a refinement can only be applied to the open and front nodes in the selection graph S. (a) Add positive condition. This refinement simply adds a condition c to the set of conditions C in the node that is being refined in the selection graph S without actually changing the structure of S. For the selection graph from Figure 2 positive condition expression corr=0.5 applied to the node INTERACTION results in the selection graph shown on Figure 5 a.
44
Anna Atramentov et al. INTERACTION
Type = ’Genetic’ Expression_corr = 0.5 INTERACTION
Type = ’Genetic’ Expression_corr = 0.5
Chromosome=5
Chromosome=11
GENE INTERACTION
GENE
GENE
Chromosome=11
GENE
GENE Type = ’Genetic’
Chromosome=11
Chromosome=5 INTERACTION
INTERACTION
Type = ’Genetic’ and Expression_corr = 0.2
Type = ’Genetic’ and Expression_corr = 0.2
a)
b)
Fig. 5. Complement refinements for adding condition expression corr=0.5 to the node INTERACTION in the selection graph from Figure 2: a) positive condition, b) negative condition (b) Add negative condition (Figure 5 b). If the node which is refined is not n0 , the corresponding refinement will introduce a new absent edge from the parent of the selection node in question. The condition list of the selection node is copied to the new closed node and extended by the new condition. This node gets the copies of the children of the selection graph in question and open edges to those children are added. If the node which is refined represents the target table, the condition is simply negated and added to the current list of conditions for this node. This refinement is the complement of the “add positive condition refinement”, in the sense that it covers those objects from the original selection graph which were not covered by corresponding “add positive condition” refinement. (c) Add present edge and open node. This refinement introduces a present edge together with its corresponding table to the selection graph S. For the selection graph from Figure 2 adding edge from GENE node to COMPOSITION node results in the selection graph shown in (Figure 6 a). (d) Add absent edge and closed node (Figure 6 b). This refinement introduces an absent edge together with its corresponding table to the selection graph S. This refinement is complement to the “add present edge and open node”, in the sense that it covers those objects from the original selection graph which were not covered by “add present edge and open node” refinement. It is important to note that only through the “add edge” refinements the exploration of all the tables in the database is carried out. We can consider “add condition” refinement on some attribute from some table only after the edge to that table has been added to the selection graph. This raises the question as to what happens if the values of the attributes in some table are important for the task but the edge to this table can never be added, i.e. adding edge doesn’t result in further split of the data covered by the refined selection graph. Look ahead
A Multi-relational Decision Tree Learning Algorithm COMPOSITION
COMPOSITION
INTERACTION
45
INTERACTION
GENE
GENE
GENE
GENE Type = ’Genetic’
Chromosome=11
Chromosome=5
Type = ’Genetic’
Chromosome=11
Chromosome=5 INTERACTION
INTERACTION
Type = ’Genetic’ and Expression_corr = 0.2
Type = ’Genetic’ and Expression_corr = 0.2
a)
b)
Fig. 6. Complement refinements for adding edge from GENE node to COMPOSITION node to selection graph from Figure 2: a) adding present edge and open node, b) adding absent edge and closed node
refinements, which are a sequence of several refinements, are used for dealing with this situation. In the case when some refinement does not split the data covered by the selection graph, the next set of refinements is also considered as refinements of the original selection graph.
3
3.1
MDRTL-2: An Efficient Multi-relational Decision Tree Learning Algorithm Decision Tree Construction
Multi-relational decision tree learning algorithm constructs a decision tree whose nodes are multi-relational patterns i.e., selection graphs. MRDTL-2 that we describe below is based on MRDTL proposed by Leiva [23] which in turn is based on the algorithm described by [20] and the logical decision tree induction algorithm called TILDE proposed by [1]. TILDE uses first order logic clauses to represent decisions (nodes) in the tree, when data are represented in first order logic rather than a collection of records in a relational database. MRDTL deals with records in relational databases, similarly to the TILDE’s approach. Essentially, MRDTL adds selection graphs as the nodes to the tree through a process of successive refinement until some termination criterion is met (e.g., correct classification of instances in the training set). The choice of refinement to be added at each step is guided by a suitable impurity measure (e.g., information gain). MRDTL starts with the selection graph containing a single node at the root of the tree, which represents the set of all objects of interest in the relational database. This node corresponds to the target table T0 . The pseudocode for MRDTL is shown in Figure 7.
46
Anna Atramentov et al.
TREE INDUCTION(D, S) Input Database D, selection graph S Output The root of the tree, T 1 R := optimal refinement(S) 2 if stopping criteria(S) 3 return leaf 4 else 6 Tleft := TREE INDUCTION(D, R(S)) 8 Tright := TREE INDUCTION(D, R(S)) 9 return node(Tleft , Tright , R)
Fig. 7. MRDTL algorithm
The function optimal refinement(S) considers every possible refinement that can be made to the current selection graph S and selects the (locally) optimal refinement (i.e., one that maximizes information gain). Here, R(S) denotes the selection graph resulting from applying the refinement R to the selection graph S. R(S) denotes the application of the complement of R to the selection graph S. Our implementation of MRDTL considers the refinements described in the Section 2 as well as the look ahead refinements. The program automatically determines from the relational schema of the current database when look ahead might be needed. When adding an edge does not result in further split of the data, two-step refinements of the original selection graph are considered. Each candidate refinement is evaluated in terms of the split of the data induced by the refinement with respect to the target attribute, as in the case of the propositional version of the decision tree learning algorithm [30]. Splits based on numerical attributes are handled using a technique similar to that of C4.5 algorithm [30] with modifications proposed in [10,29]. Our implementation of MRDTL uses SQL operations to obtain the counts needed for calculating information gain associated with the refinements. First we show the calculation of the information gain associated with “add condition” refinements. Let X be the table associated with one of the nodes in the current selection graph S and X.A be the attribute to be refined, and Rvj (S) and Rvj (S) be the “add condition” X.A = vj refinement and the complement of it respectively. The goal is to calculate entropies associated with the split based on these two refinements. This requires the following counts: count(ci , Rvj (S)) and count(ci , Rvj (S)), where count(ci , S) is the number of objects covered with selection graph S which have classification attribute T0 .target attribute = ci ∈ DOM(T0 .target attribute). The result of the SQL query shown in Figure 8 returns a list of the necessary counts: count(ci , Rvj (S)) for each possible values ci ∈ DOM(T0 .target attribute) and vj ∈ DOM(X.A). The rest of the counts needed for the computation of the information gain can be obtained from the formula:
A Multi-relational Decision Tree Learning Algorithm
47
select T0 .target attribute, X.A, count(distinct T0 .id) from TRANSLATE(S).get table list where TRANSLATE(S).get join list and TRANSLATE(S).get condition list
Fig. 8. SQL query that returns counts of the type count(ci , Rvj (S)) for each of the possible values ci ∈ DOM(T0 .target attribute) and vj ∈ DOM(X.A)
count(ci , Rvj (S)) = count(ci , S) − count(ci , Rvj (S)) Consider the SQL queries needed for calculating information gain associated with “add edge” refinements. Let X be the table associated with one of the nodes in the current selection graph S and e be the edge to be added from table X to table Y , and Re (S) and Re (S) be the “add edge” e refinement and its complement respectively. In order to calculate the entropies associated with the split based on these refinements we need to gather the following counts: count(ci , Re (S)) and count(ci , Re (S)). The result of the SQL query shown in Figure 9 returns a list of the desired counts: count(ci , Re (S)) for each possible value ci ∈ DOM(T0 .target attribute). select T0 .target attribute, count(distinct T0 .id) from TRANSLATE(S).get table list, Y where TRANSLATE(S).get join list and TRANSLATE(S).get command list and e.a
Fig. 9. SQL query returning counts of the type count(ci , Re (S)) for each possible value ci ∈ DOM(T0 .target attribute) The rest of the counts needed for the computation of the information gain can be obtained from the formula: count(ci , Re (S)) = count(ci , S) − count(ci , Re (S)) The straightforward implementation of the algorithm based on the description given so far suffers from an efficiency problem which makes its application to complex real-world data sets infeasible in practice. As one gets further down in the decision tree the selection graph at the corresponding node grows. Thus, as more and more nodes are added to the decision tree the longer it takes to execute the corresponding SQL queries (Figures 8, 9) needed to examine the candidate refinements of the corresponding selection graph. Consequently, the straightforward implementation of MRDTL as described in [23] is too slow to be useful in practice. MRDTL-2 is a more efficient version of MRDTL. It exploits the fact that some of the results of computations that were carried out in the course of adding nodes at higher levels in the decision tree can be reused at lower levels in the
48
Anna Atramentov et al.
course of refining the tree. Note that the queries from Figures 8 and 9 unnecessarily repeats work done earlier by retrieving instances covered by the selection graph whenever refining an existing selection graph. This query can be significantly simplified by storing the instances covered by the selection graph from previous iteration in a table to avoid retrieving them from the database. Thus, refining an existing selection graph reduces to finding a (locally) optimal split of the relevant set of instances. Now we proceed to show how the calculation of the SQL queries in Figures 8 and 9 is carried out in MRDTL-2. In each iteration of the algorithm, we store the primary keys from all open, front nodes of the selection graph for every object covered by it together with its classification value. This can be viewed as storing the ‘skeletons’ of the objects covered by the selection graph, because it stores no other attribute information about records except for their primary keys. The SQL query for generating such a table for the selection graph S is shown in Figure 10. We call the resulting table of primary keys the sufficient table for S and denote it by IS . SUF TABLE(S) Input Selection graph S Output SQL query for creating sufficient table IS 1 table list, condition list, join list := extract from(TRANSLATE(S)) 2 primary key list := T0 .target attribute 3 for each node i in S do 4 if (i.s = open and i.f = front ) 5 primary key list .add(i.ID) 6 return create table IS as ( select + primary key list + from + table list + where + join list + and + condition list + )
Fig. 10. Algorithm for generating SQL query corresponding to the sufficient table IS of the selection graph S Given a sufficient table IS , we can obtain the counts needed for the calculation of the entropy for the “add condition” refinements as shown in Figure 11, and for the “add edge” refinements as shown in Figure 12. It is easy to see that now the number of tables that need to be joined is not more than 3, whereas the number of tables needed to be joined in Figures 8 and 9 grows with the size of the selection graph. It is this growth that was responsible for the significant performance deterioration of MRDTL as nodes get added to the decision tree. It is important to note that it is inefficient to use the algorithm from Figure 10 in each iteration, since again the size of the query would increase with the growth of the selection graph. It is possible to create the sufficient table for the refined selection graph using only the information about refinement and sufficient table of the original selection graph as shown in Figure 13.
A Multi-relational Decision Tree Learning Algorithm
49
select IS .T0 target attribute, X.A, count(distinct IS .T0 id) from IS , X where IS .X ID = X.ID
Fig. 11. SQL query which returns the counts needed for calculating entropy for the splits based on “add condition” refinements to the node X for the attribute X.A using sufficient table IS The simple modifications described above make MRDTL-2 significantly faster to execute compared to MRDTL. This is confirmed by experimental results presented in section 4. 3.2
Handling Missing Values
The current implementation of MRDTL-2 incorporates a simple approach to dealing with missing attribute values in the data. A Naive Bayes model for each attribute in a table is built based on the other attributes (excluding the class attribute). Missing attribute values are ‘filled in’ with the most likely value predicted by the Naive Bayes predictor for the corresponding attribute. Thus, for each record, r, from table X we replace its missing value for the attribute X.A with the following value: vN B = argmax
vj ∈ DOM(X.A)
P (vj )
∀Xl ∈D
P (Xl .Ai |vj )
∀Xl .Ai ,Ai =A, ∀rn ∈Xl ,rn Ai =T0 .target attribute associatedwithr
Here the first product is taken over the tables in the training database; The second product is taken over all the attributes in that table, except for the target attribute, T0 .target attribute, and the attribute which is being predicted, namely, X.A; The third product is taken over all the records in the table Xl which are connected to the record r through the associations between the tables X and Xl . In the case of one-to-many relation between the tables X and Xl , one record from table X may have several corresponding records in the table Xl . The value P (Xl .Ai |vj ) is defined as the probability that some random element from table X has at least one corresponding record from table Xl . Once the tables in the database are preprocessed in this manner, MRDTL-2 proceeds to build a decision tree from the resulting tables that contain no missing attribute values. select IS .T0 target attribute, count(distinct IS .T0 id) from IS , X, Y where IS .X ID = X.ID and e.a
Fig. 12. SQL query which returns counts needed for calculating entropy for the splits based on “add edge” e refinements from the node X to the node Y using sufficient table IS
50
Anna Atramentov et al.
REFINEMENT SUF TABLE(IS , R) Input Sufficient table IS for selection graph S, refinement R Output SQL query for sufficient table for R(S) 1 table list := IS 2 condition list := 3 join list := 4 primary key list := primary keys(IS ) 5 if R == add positive condition, c, in table Ti 6 table list += Ti 7 condition list += Ti .c 8 join list += Ti .ID+ = +IS .Ti ID 9 else if R == add negative condition, c, in table Ti 10 condition list += T0 .ID + is not in ( select distinct + IS .T0 ID + from + IS , Ti + where + Ti .c + and + Ti .ID+ = +IS .Ti ID+ ) 12 else if R = add present edge, e, from Ti to Tj 13 table list += Ti + , +Tj 14 join list += Ti .ID+ = +IS .Ti ID+ and + e.a 15 primary key list += Tj .ID 16 else if R == add absent edge, e from Ti to Tj 17 condition list += T0 .ID + is not in ( select distinct + IS .T0 ID + from + IS + , +Ti + , +Tj + where + Ti .ID+ = +IS .Ti ID+ and + e.a+ ) 19 return create table I R as ( select + primary key list + from + table list + where + join list + and + condition list + )
Fig. 13. Algorithm for generating SQL query corresponding to sufficient table IR(S)
In the future it would be interesting to investigate more sophisticated techniques for dealing with missing values. 3.3
Using the Decision Tree for Classification
Before classifying an instance, any missing attribute values are filled in by preprocessing the tables using the method described above on the database of instances to be classified. The decision tree produced by MRDTL-2, as in the case of MRDTL, can be viewed as a set of SQL queries associated with the selection graphs that correspond to the leaves of the decision tree. Each selection graph (query) has a class label associated with it. If the corresponding node is not a pure node, (i.e., it does not unambiguously classify the training instances that match the query), the label associated with the node is based on the classification of the majority of training instances that match the corresponding selection graph in our implementation. (Alternatively, we could use probabilistic assignment of
A Multi-relational Decision Tree Learning Algorithm
51
labels based on the distribution of class labels among the training instances that match the corresponding selection graph). The complementary nature of the different branches of a decision tree ensures that a given instance will not be assigned conflicting labels. It is also worth noting that it is not necessary to traverse the entire tree in order to classify a new instance; all the constraints on a certain path are stored in the selection graph associated with the corresponding leaf node.
4
Experimental Results
Our experiments focused on three data sets - the mutagenesis database which has been widely used in Inductive Logic Programming (ILP) research [25], the data for prediction of protein/gene localization and function from KDD Cup 2001 [17] and the data for predicting thrombosis from PKDD 2001 Discovery Challenge [27]. We compared the results we obtained using MRDTL-2 algorithm with those reported in the literature for the same datasets. 4.1
Mutagenesis Data Set
The entity-relation diagram for the part of the Mutagenesis database [25] we used in our experiments is shown in Figure 14. The data set consists of 230 molecules divided into two subsets: 188 molecules for which linear regression yields good results and 42 molecules that are regression-unfriendly. This database contains descriptions of molecules and the characteristic to be predicted is their mutagenic activity (ability to cause DNA to mutate) represented by attribute label in molecule table. MOLECULE MOLECULE_ID ATOM
LOG_MUT BOND
ATOME_ID
LOGP
MOLECULE_ID
MOLECULE_ID
LUGMO
ELEMNT
ATOM_ID1
IND1
TYPE
ATOM_ID2
INDA
CHARGE
TYPE
LABEL
Fig. 14. Schema of the mutagenesis database This dataset comes with different levels of background knowledge B0 , B1 , B2 , and B3 . In our experiments we chose to use the background knowledge B2 and regression friendly subset of the dataset in order to compare the performance of MRDTL-2 with other methods for which experimental results are available in the literature. The results averaged with ten-fold cross-validation are shown in the Table 1.
52
Anna Atramentov et al.
Table 1. Experimental results for mutagenesis data accuracy time with MRDTL-2 time with MRDTL mutagenesis 87.5 % 28.45 secs 52.155 secs
Table 2. Experimental results for gene/protein localization prediction task localization accuracy time with MRDTL-2 time with MRDTL accuracy with mvh 76.11 % 202.9 secs 1256.387 secs accuracy without mvh 50.14 % 550.76 secs 2257.206 secs
4.2
KDD Cup 2001 Data Set
We also considered the two tasks from the KDD Cup 2001 data mining competition [5]: prediction of gene/protein function and localization. We normalized the data given in the task which resulted in the schema shown in Table 1. The resulting database consists of 3 tables, GENE, INTERACTION and COMPOSITION, where GENE table contains 862 entries in the training set and 381 in the testing set. Gene localization and function tasks present significant challenges because many attribute values in the data set are missing. We have conducted experiments both using technique for handling missing values, denoted mvh in the tables 2 and 3, and not using it (considering each missing values as a special value “missing”). Table 2 summarizes the results we obtained for predicting GENE.localization value on this set. In the case gene/protein function prediction, instances often have several class labels, since a protein may have several functions. MRDTL-2, like its propositional counterpart C4.5, assumes that each instance can be assigned to only one of several non-overlapping classes. To deal with multivalued class attributes, we transformed the problem into one of separately predicting membership in each possible class. i.e. for each possible function label we predicted whether the protein has this function or not. The overall accuracy was obtained from the formula: (true positive + true negative) (true positive + true negative + false positive + false negative) for all binary predictions. Table 3 summarizes the results we got for predicting the function of the protein. 4.3
PKDD 2001 Discovery Challenge Data Set
The Thrombosis Data from the PKDD 2001 Discovery Challenge Data Set [7] consists of seven tables. PATIENT INFO contains 1239 records about patients. For our experiments we used 4 other tables (DIAGNOSIS, ANTIBODY EXAM,
A Multi-relational Decision Tree Learning Algorithm
53
Table 3. Experimental results for gene/protein function prediction task function accuracy time with MRDTL-2 time with MRDTL accuracy with mvh 91.44 % 151.19 secs 307.82 secs accuracy without mvh 88.56 % 61.29 secs 118.41 secs
Table 4. Experimental results for Thrombosis data set accuracy time with MRDTL-2 time with MRDTL thrombosis 98.1 % 127.75 secs 198.22 secs
Table 5. Comparison of MRDTL-2 performance with the best-known reported results dataset MRDTL accuracy best reported accuracy reference mutagenesis 87.5 % 86 % [31] localization 76.11 % 72.1 % [5] function 91.44 % 93.6 % [5] thrombosis 98.1 % 99.28 % [7]
ANA PATTERN and THROMBOSIS) which all have a foreign key to the PATIENT INFO table. There are no other relations between the tables in this dataset. The task is to predict the degree of thrombosis attribute from ANTIBODY EXAM table. The results we obtained with 5:2 cross-validation are shown in Table 4. The cross-validation was done by partitioning the set of all records in the ANTIBODY EXAM table and their corresponding records from other tables into training and test sets. 4.4
Comparative Evaluation
The results of the comparison of MRDTL-2 performance with the best-known reported results for the datasets we described above are shown in the Table 5.
5
Summary and Discussion
Advances in data acquisition, digital communication, and storage technologies have made it possible to gather and store large volumes data in digital form. A large fraction of this data resides in relational databases. Even when the data repository is not a relational database, it is possible to extract information from heterogeneous, autonomous, distributed data sources using domain specific ontologies [28]. The result of such data integration is in the form of relational tables. Effective use of such data in data-driven scientific discovery and decision-making calls for sophisticated algorithms for knowledge acquisition from relational databases or multi-relational data mining algorithms is an
54
Anna Atramentov et al.
important problem in multi-relational data mining which has begun to receive significant attention in the literature [1,19,11,21,22,12,18,26,9,8,14,16,2,6,13,24]. Learning classifiers from relational databases has also been a focus of KDD Cup 2001 Competition and the PKDD 2001 Discovery Challenge. Against this background, this paper describes the design and implementation of MRDTL-2 - an algorithm for learning decision tree classifiers from relational databases, which is based on the framework for multi-relational data mining originally proposed by Knobbe et al. [19]. MRDTL-2 extends an MRDTL, an algorithm for learning decision tree classifiers described by Leiva [23]. MRDTL-2 includes enhancements that overcome two significant limitations of MRDTL: (a) Slow running time: MRDTL-2 incorporates methods for speeding up MRDTL. Experiments using several data sets from KDD Cup 2001 and PKDD 2001 Discovery Challenge show that the proposed methods can significantly reduce the running time of the algorithm, thereby making it possible to apply multi-relational decision tree learning algorithms on far more complex data mining tasks. The proposed methods are potentially applicable to a broad class of multi-relational data mining algorithms based on the framework proposed by Knobbe et al. [19]. (b) Inability to handle missing attribute values: MRDTL-2 includes a simple and computationally efficient technique using Naive Bayes classifiers for ‘filling in’ missing attribute values, which significantly enhances the applicability of multi-relational decision tree learning algorithms to the real-world classification tasks. Our experiments with several classification tasks drawn from KDD Cup 2001 [5] and PKDD 2001 Discovery Challenge [7] and the widely studied Mutagenesis data set show that MRDTL-2 (a) significantly outperforms MRDTL in terms of running time (b) yields results that are comparable to the best reported results obtained using multi-relational data mining algorithms (Table 5) (c) compares favorably with feature-based learners that are based on clever propositionalization methods [22] Work in progress is aimed at: (a) Incorporation of more sophisticated methods for handling missing attribute values into MRDTL-2 (b) Incorporation of sophisticated pruning methods or complexity regularization techniques into MRDTL-2 to minimize overfitting and improve generalization (c) Development of ontology-guided multi-relational decision tree learning algorithms to generate classifiers at multiple levels of abstraction [33] (d) Development of variants of MRDTL that can learn from heterogeneous, distributed, autonomous data [4,3,28] (e) More extensive experimental evaluation of MRDTL-2 on real-world data sets.
A Multi-relational Decision Tree Learning Algorithm
55
(f) Incorporation of more sophisticated methods for evaluation of MRDTL-2 [15]. Acknowledgments This work is supported in part by an Information Technology Research (ITR) grant (0219699) from the National Science Foundation, and a Biological Information Science and Technology Initiative (BISTI) award (GM066387) from the National Institutes of Health. The paper has benefited from discussions with Doina Caragea and Adrian Silvescu of the Iowa State University Artificial Intelligence Research Laboratory.
References 1. Hendrik Blockeel: Top-down induction of first order logical decision trees. Department of Computer Science, Katholieke Universiteit Leuven (1998) 2. Blockeel, H., and De Raedt, L.: Relational Knowledge Discovery in Databases. In: Proceedings of the sixth internal workshop of Inductive Logic Programming, volume 1312 of Lecture Notes in Artificial Intelligence, 199-212, Springer-Verlag (1996) 3. Caragea, D., Silvescu, A., and Honavar, V. (ISDA 2003)Decision Tree Induction from Distributed, Heterogeneous, Autonomous Data Sources. In: Proceedings of the Conference on Intelligent Systems Design and Applications. In press. 4. Caragea, D., Silvescu, A. and Honavar, V.: Invited Chapter. Toward a Theoretical Framework for Analysis and Synthesis of Agents That Learn from Distributed Dynamic Data SourcesTechnical Report. In: Emerging Neural Architectures Based on Neuroscience, Berlin: Springer-Verlag (2001) 5. Cheng, J., Krogel, M., Sese, J., Hatzis C., Morishita, S., Hayashi, H. and Page, D.: KDD Cup 2001 Report. In: ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Explorations, vol. 3, issue 2 (2002) 6. Santos Costa et al.: Query Transformations for Improving the Efficiency of ILP Systems. In: Journal of Machine Learning Research (2002) 7. Coursac, I., Duteil, N., Lucas, N.: pKDD 2001 Discovery Challenge - Medical Domain. In: PKDD Discovery Challenge 2001, vol. 3, issue 2 (2002) 8. L. Dehaspe and L. De Raedt: Mining Association Rules in Multiple Relations. In: Proceedings of the 7th International Workshop on Inductive Logic Programming, vol. 1297, p. 125-132 (1997) 9. Dzeroski, S., and Lavrac, N.: Relational Data Mining. Springer-Verlag (2001) 10. Fayyad, U.M., and Irani, K.B: On the handling of continuous-valued attributes in decision tree generation. In: Machine Learning, vol.8 (1992) 11. Friedman, N., Getoor, L., Koller, D., and Pfeffer: Learning probabilistic relational models. In: Proceedings of the 6th International Joint Conference on Artificial Intelligence (1999) 12. Getoor, L.: Multi-relational data mining using probabilistic relational models: research summary. In: Proceedings of the First Workshop in Multi-relational Data Mining (2001) 13. Masaki Ito and Hayato Ohwada: Efficient Database Access for Implementing a Scalable ILP engine. In: Work-In-Progress Report of the Eleventh International Conference on Inductive Logic Programming (2001)
56
Anna Atramentov et al.
14. Jaeger, M.: Relational Bayesian networks. In: Proceedings of the 13th Annual Conference on Uncertainty in Artificial Intelligence (UAI-1997) 15. Jensen, D., and J. Neville: Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners. In: Proceedings of the 12th International Conference on Inductive Logic Programming (ILP 2002) 16. Karalic and Bratko: First order regression. In: Machine Learning 26, vol. 1997 17. The KDD Cup 2001 dataset: http://www.cs.wisc.edu/$\sim$dpage/kddcup2001/ 18. Kersting, K., and De Raedt, L.: Bayesian Logic Programs. In: Proceedings of the Work-in-Progress Track at the 10th International Conference on Inductive Logic Programming (2000) 19. Knobbe, J., Blockeel, H., Siebes, A., and Van der Wallen, D.: Multi-relational Data Mining. In: Proceedings of Benelearn (1999) 20. Knobbe, J., Blockeel, H., Siebes, A., and Van der Wallen, D.: Multi-relational decision tree induction. In: Proceedings of the 3rdEuropean Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD-99 (1999) 21. Koller, D.: Probabilistic Relational Models. In: Proceedings of 9th International Workshop on Inductive Logic Programming (ILP-99) 22. Krogel, M., and Wrobel, S.: Transformation-Based Learning Using Multirelational Aggregation. In: Proceedings of the 11th International Conference on Inductive Logic Programming, vol. 2157 (2001) 23. Hector Ariel Leiva: A multi-relational decision tree learning algorithm. M.S. thesis. Deparment of Computer Science. Iowa State University (2002) 24. Morik, K., and Brockhausen, P.: A multistrategy approach to relational discovery in databases. In: Machine Learning, 27(3), 287-312 (1997) 25. The mutagenesis dataset: http://web.comlab.ox.ac.uk/oucl/research/areas/ machlearn/mutagenesis.html 26. Pfeffer, A.: A Bayesian Language for Cumulative Learning. In: Proceedings of AAAI 2000 Workshop on Learning Statistical Models from Relational Data, AAAI Press (2000) 27. The PKDD 2001 Discovery Challenge dataset: http://www.uncc.edu/knowledgediscovery 28. Jaime Reinoso-Castillo: Ontology-driven information extraction and integration from Heterogeneous Distributed Autonomous Data Sources. M.S. Thesis. Department of Computer Science. Iowa State University (2002) 29. Quinlan, R.: Improved Use of Continuous Attributes in C4.5. In: Journal of Artificial Intelligence Research, vol.4 (1996) 30. Quinlan, R.: C4.5: Programs for Machine Learning. In: San Mateo: Morgan Kaufmann (1993) 31. Srinivasan, A., King, R.D., and Muggleton, S.: The role of background knowledge: using a problem from chemistry to examine the performance of an ILP program. Technical Report PRG-TR-08-99, Oxford University Computing Laboratory, Oxford, 1999. 32. Wang, X., Schroeder, D., Dobbs, D., and Honavar, V. (2003). Data-Driven Discovery of Rules for Protein Function Classification Based on Sequence Motifs. Information Sciences. In press. 33. Zhang, J., and Honavar, V. (2003). Learning Decision Tree Classifiers from Attribute-Value Taxonomies and Partially Specified Data. In: Proceedings of the International Conference on Machine Learning. Washington, DC. In press.
Applying Theory Revision to the Design of Distributed Databases 1
Fernanda Baião1, Marta Mattoso , Jude Shavlik2, and Gerson Zaverucha1 1
Department of Computer Science – COPPE, Federal University of Rio de Janeiro (UFRJ) PO Box 68511, Rio de Janeiro, RJ 21941-972 Brazil {baiao,marta,gerson}@cos.ufrj.br 2 Computer Sciences Department, University of Wisconsin-Madison 1210 West Dayton Street, Madison, WI 53706 USA [email protected]
Abstract. This work presents the application of theory revision to the design of distributed databases to automatically revise a heuristic-based algorithm (called analysis algorithm) through the use of the FORTE system. The analysis algorithm decides the fragmentation technique to be used in each class of the database and its Prolog implementation is provided as the initial domain theory. Fragmentation schemas with previously known performance, obtained from experimental results on top of an object database benchmark, are provided as the set of examples. We show the effectiveness of our approach in finding better fragmentation schemas with improved performance.
1
Introduction
Distributed and parallel processing on database management systems are efficient ways of improving performance of applications that manipulate large volumes of data. This may be accomplished by removing irrelevant data accessed during the execution of queries and by reducing the data exchange among sites, which are the two main goals of the design of distributed databases [28]. However, in order to improve performance of these applications, it is very important to design information distribution properly. The distribution design involves making decisions on the fragmentation and placement of data across the sites of a computer network. The first phase of the distribution design is the fragmentation phase, which is the focus of this work. To fragment a class of objects, it is possible to use two basic techniques: horizontal and vertical fragmentation [28], which may be combined and applied in many different ways to define the final fragmentation schema. The class fragmentation problem in the design of a distributed database is known to be an NP-hard problem [28]. There are a number of works in the literature addressing the horizontal [7, 14, 31] or vertical [6, 15] class fragmentation technique, but not both. Even when the designer decides to use a horizontal fragmentation algorithm to one class and a vertical fragmentation algorithm to another class, he is left with no T. Horváth and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 57-74, 2003. © Springer-Verlag Berlin Heidelberg 2003
58
Fernanda Baião et al.
assistance to make this decision. Our previous work proposed a set of heuristics to drive the choice of the fragmentation technique to be applied in each class of the database schema. Those heuristics were implemented in an algorithm called “analysis algorithm” [2], and were incorporated in a methodology that includes the analysis algorithm, horizontal and vertical class fragmentation algorithms adapted from the literature. Experimental results reported in [3, 4] show applications that were executed 3.4 times faster when applying the fragmentation schema resulted from our methodology, compared to other alternative fragmentation schemas proposed by other works in the literature. Experimental results from real applications can continuously provide heuristics for the design of distributed object databases (DDODB) that may be incorporated in our analysis algorithm. Indeed, we have tried to manually improve the analysis algorithm using experimental results from [23, 25], which required a detailed analysis of each result and manual modifications on the analysis algorithm. However, the formalization of new heuristics from these experiments and their incorporation in the analysis algorithm, while maintaining previous heuristics consistent, proved to be an increasingly difficult task. This work proposes the use of Theory REvisioN on the Design of Distributed Databases (TREND3), showing how it automatically improves our analysis algorithm through the use of the FORTE system [29]. TREND3 is a module of a framework that handles the class fragmentation problem of the design of distributed databases, defined in [5]. There are approaches in the literature addressing the DDODB problem [4, 6, 7, 13, 14, 15, 16, 20, 24, 31]. However, none of them addresses the problem of choosing the most adequate fragmentation technique to be applied to each class of the database schema. Some works have been applying machine learning techniques to solve database problems. For example, [8, 9] present an approach for the inductive design of deductive databases, based on the database instances to define some intentional predicates. Also, relational bayesian networks were used to estimate query selectivity in a query processor [19] and to predict the structure of relational databases [18]. However, considering the design of distributed databases as an application for theory revision is a novel approach. The paper is organized as follows: in section 2, the design of distributed databases is defined and our framework for the design of distributed databases is described. Theory revision is briefly reviewed in section 3, while in section 4 we show how to improve a DDODB analysis algorithm through the use of the FORTE system. Experimental results on top of the OO7 benchmark [12] are presented in section 5. Finally, section 6 presents some conclusions and future work.
Applying Theory Revision to the Design of Distributed Databases
2
59
A Framework for the Design of Distributed Databases
This section defines the problem of designing a distributed database, focusing on the object-oriented model, and presents a framework we propose for the class fragmentation phase of the distribution design. 2.1
The Design of Distributed Databases
The distribution design of a database makes decisions on the fragmentation and placement of data across the sites of a computer network. The first phase of the distribution design is the fragmentation phase, which is the process of isolating into fragments specific data accessed by the most relevant applications that run over the database. In an object-oriented database, data is represented as objects. The set of objects sharing the same structure and behavior define a class, and classes may be related to each other through relationships. A database schema describes the set of classes and relationships. The UML diagram representing the database schema of the OO7 benchmark [12] is illustrated in Fig. 1. The OO7 benchmark is a generic application on top of a database of design objects assembled through the composition of parts. We may notice, for example, that each composite part is related to N atomic parts (through the “parts” relationship), and that each atomic part “is part of” one composite part. 'HVLJQ2EMHFW id type buildDate
FRPS6KDUHG BaseAssembly N
FRPS3ULYDWH
1
EDVH6KDUHG
URRW3DUW
N
CompositePart 1
N
1
GRFXPHQWV
1
EDVH3ULYDWH
SDUWV
GRFXPHQWDWLRQ
Document title text
1
LV5RRW3DUW LV3DUW2I
1
AtomicPart x y
N
WR 1 IURP N WR
1 N
Connection type length
Fig. 1. The OO7 benchmark database schema
Given the schema of the database to be distributed, as in any distribution design methodology, we need to capture the set of operations over the database and quantitative information in order to define a fragmentation schema to be applied on the database schema, which is the goal of the fragmentation phase.
60
Fernanda Baião et al.
The operations are captured by decomposing the application running over the database, and are classified into selection, projection or navigation operations according to the definitions from [2]. Quantitative information needed includes the cardinality of each class (i.e., its estimated size: small, medium or large) and the execution frequency of each operation. The fragmentation schema is composed of the choice of a fragmentation technique and the definition of a set of fragments for each class of the database schema. The two basic fragmentation techniques to be applied on a class are horizontal and vertical fragmentation [28]. Vertical fragmentation breaks the class logical structure (its attributes and methods) and distributes them into fragments. Horizontal fragmentation distributes class instances across the fragments. Thus, a horizontal fragment of a class contains a subset of the whole class extension. Horizontal fragmentation is usually subdivided into primary and derived horizontal fragmentation. Primary horizontal fragmentation basically optimizes selection and projection operations, while derived horizontal fragmentation addresses the relationships between classes and improves performance of navigation operations. It is also possible to apply both vertical and primary horizontal fragmentation techniques to a class simultaneously (which we call hybrid fragmentation) or to apply different fragmentation techniques to different classes in the database schema (which we call mixed fragmentation). In the object oriented data model, additional issues contribute to increase the difficulty of the class fragmentation and turn it into an even more complex problem. Our previous work proposed a set of heuristics implemented by an algorithm (called “analysis algorithm”) [2]. Some examples of the heuristics proposed are “in the case of a selection operation on a class with a large cardinality, this class is indicated to primary horizontal fragmentation”, or “in the case of a projection operation on a class with a large cardinality that is not derived horizontally fragmented, this class is indicated to vertical fragmentation”. The algorithm was also capable of handling conflicts during the fragmentation schema definition. 2.2
The Framework for the Class Fragmentation Problem in the DDODB
The framework we propose for the class fragmentation problem in the design of distributed databases integrates three modules: the DDODB heuristic module, the theory revision module (TREND3) and the DDODB branch-and-bound module (Figure 2). The distribution designer provides input information about the database schema (its semantics – classes and relationships – and additional quantitative information such as the estimated cardinality of each class) and applications (projection, selection and navigation operations) that will be executed over the database. This information is then passed to the DDODB heuristic module. The DDODB heuristic module defines a set of heuristics to design an adequate fragmentation schema for a given database application. The execution of the heuristic module algorithms (analysis algorithm, vertical fragmentation and horizontal fragmentation) will follow this set of heuristics
Applying Theory Revision to the Design of Distributed Databases
61
and quickly output an adequate fragmentation schema to the distribution designer. Previous results using the heuristic module are presented in [2, 3].
'DWDEDVH$SSOLFDWLRQ 6HPDQWLFV 2SHUDWLRQV
''2'%+HXULVWLF0RGXOH
TXDQWLWDWLYHLQIR
$$
*RRGIUDJPHQWDWLRQ
→ 9)→
,PSURYHG$QDO\VLV$OJRULWKP
+)
5HYLVHG7KHRU\
VFKHPD
.QRZQIUDJPHQWDWLRQ
'LVWULEXWLRQ
VFKHPDV
'HVLJQHU
G R R J I R W V R &
D P H K F V Q R L W D W Q
$QDO\VLV $OJRULWKP ,QLWLDO7KHRU\
([DPSOHV
75(1' 0RGXOH )257( )257(0RGXOH
H³2SWLPDO´IUDJPHQWDWLRQVFKHPD P J D U I
([DPSOHV
³2SWLPDO´IUDJPHQWDWLRQ VFKHPD
''2'%%UDQFK DQG%RXQG0RGXOH 4XHU\3URFHVVLQJ &RVW)XQFWLRQ
Fig. 2. The overall framework for the class fragmentation in the DDODB
The set of heuristics implemented by the DDODB heuristic module may be further automatically improved by executing a theory revision process through the use of inductive logic programming (ILP) [27, 29, 34]. This process is called Theory REvisioN on the Design of Distributed Databases (TREND3). The improvement process may be carried out by providing two input parameters to the TREND3 module: the Prolog implementation of the analysis algorithm (representing the initial theory, or the background knowledge) and fragmentation schemas with previously known performances (representing a set of examples). The analysis algorithm is then automatically modified by a theory revision system (called FORTE) so as to produce a revised theory. The revised theory will represent an improved analysis algorithm that will be able to output a fragmentation schema with improved performance, and this revised analysis algorithm will then substitute the original one in the DDODB heuristic module. In [26], it has been pointed out that machine learning algorithms that use background knowledge, thus combining inductive with analytical mechanisms, obtain the benefits of both approaches: better generalization accuracy, smaller number of required training examples, and explanation capability. Additionally, the input information from the distribution designer may be passed to our third module, the DDODB branch-and-bound module. This module represents an alternative approach to the heuristic module, and obtains (at a high execution cost) the best fragmentation schema for a given database application. The branch-and-bound procedure searches for an optimal solution in the space of potentially good fragmentation schemas for an application and outputs its result to the distribution designer. The algorithm bounds its search for the best fragmentation schema by using a query processing cost function during the evaluation of each fragmentation schema in the hypotheses space. This cost function, defined in [30], is responsible for estimating the execution cost of queries on top of a distributed database. The resulting
62
Fernanda Baião et al.
fragmentation schema generated by the heuristic module is used to bound evaluations of fragmentation schemas presenting higher estimated costs. Finally, the resulting fragmentation schema generated by the branch-and-bound algorithm, as well as the fragmentation schemas discarded during the search, may generate examples (positive or negative) to the TREND3 module, thus incorporating the branch-and-bound results into the DDODB heuristic module.
3
Theory Revision
The theory revision task [34] can be specified as the problem of finding a minimal modification of an initial theory that correctly classifies a set of training examples. Formally, it is defined as shown in Fig. 3. *LYHQDWDUJHWFRQFHSW&
DVHW3RISRVLWLYHLQVWDQFHVRI&
DVHW1RIQHJDWLYHLQVWDQFHVRI&
DK\SRWKHVLVODQJXDJH/
DQLQLWLDOWKHRU\7H[SUHVVHGLQ/GHVFULELQJ&
)LQG D UHYLVHG WKHRU\ 57 H[SUHVVHG LQ / WKDW LVDPLQLPDOPRGLILFDWLRQRI7VXFKWKDW57 LVFRUUHFWRQWKHLQVWDQFHVRIERWK3DQG1
Fig. 3. The theory revision task
A WKHRU\ is a set of (function-free) definite program clauses, where a definite program clause is a clause of the form of (1).
α ← β 1 ,… , β n .
(1)
where αβ…βQ are atomic formulae. A FRQFHSW is a predicate in a theory for which examples appear in the training set. An LQVWDQFH, or H[DPSOH, is an instantiation (not necessarily ground) of a concept. An instance of the concept “FDUGLQDOLW\” is cardinality( connection, large )
Each instance L has an associated set of facts ) , which gathers all the instances of a concept in the training set. A positive instance should be derivable from the theory augmented with its associated facts, while the negative instances should not. In the DDODB domain, the set of facts define a particular database schema definition (classes with their cardinalities, relationships – of a specific type - between classes) and the applications (operations with their frequencies, classifications and their accessed classes) that run on the database. L
class( atomicPart ) class( compositePart ) cardinality( atomicPart, large ) cardinality( compositePart, small ) relationship( rootPart )
Applying Theory Revision to the Design of Distributed Databases
63
relationshipType( rootPart, ‘1:1’ ) relationshipAccess( rootPart, compositePart, atomicPart ) operation( o1, 100 ) classification( o1, projection ) accessedClasses( o1, [atomicPart] )
The correctness of a theory is defined as follows: given a set P of positive instances and a set N of negative instances, a theory T is correct on these instances if and only if (2) holds.
∀ p ∈ P: T ∪ Fp
p
∀ p ∈ N: T ∪ Fp
p.
(2)
The revision process of an initial domain theory works by performing a set of modifications on it, in order to obtain a correct revised theory. The modifications performed on a theory are the result of applying revision operators that make small syntactic changes on it. A correct revised theory that is obtained through a minimal modification of the initial theory is achieved by minimizing the number of operations performed. By requiring minimal modification, we mean that the initial theory is assumed to be approximately correct, and therefore the revised theory should be as semantically and syntactically similar to it as possible. Related works in the literature [10, 11] presented a detailed comparison among many theory refinement systems in the literature, concentrating in theory revision systems, which - in general - have better results than theory-guided systems. The analysis included systems such as FORTE [29], A3 [33] and PTR+ [22]. The author proposed a framework for classifying theory revision systems and a methodology for evaluating how well an algorithm is able to identify the location of errors independently of its ability to repair them. The performance analysis on the FORTE system when compared to other in different domains demonstrated that it searches a larger space of revised theories, and thus may find a more accurate candidate than either PTR+ or A3. Also, FORTE attempts to repair many more revision points than other systems, because it generates and evaluates more repair candidates. Therefore, the FORTE system was chosen to perform the improvement procedure of our DDODB algorithms. FORTE (First Order Revision of Theories from Examples) is a system for automatically refining first-order Horn-clause knowledge bases. This powerful representation language allows FORTE to work in domains involving relations, such as our DDODB domain. FORTE is a theory revision system, in the sense that it modifies incorrect knowledge by applying the "identify and repair" strategy. It performs a hill-climbing search in the hypothesis space, by applying revision operators (both specialization and generalization) to the initial domain theory in an attempt to minimally modify it in order to make it consistent with a set of training examples. By doing that, FORTE preserves as much of the initial theory as possible. Furthermore, revisions are developed and scored using the entire training set, rather than just a single instance, which gives FORTE a better direction than if revisions were developed from single instances. More details on the FORTE system may be found in [10, 29].
64
4
Fernanda Baião et al.
Theory Revision on the Design of Distributed Databases
This section proposes a knowledge-based approach for improving the DDODB analysis algorithm through the use of theory revision. The goal of applying this knowledge-based approach is to automatically incorporate in the analysis algorithm changes required to obtain better fragmentation schemas. These improvements may be found through additional experiments, thus the theory revision can automatically reflect the new heuristics implicit on these new results. In order to apply the FORTE system to the DDODB problem, we had to model and represent all relevant information from the DDODB domain in an adequate way as required by FORTE. This basically included representing both our initial domain theory and the set of examples. 4.1
The Initial Domain Theory
In our TREND3 approach, we use our analysis algorithm as the initial domain theory. The overall structure of our set of rules is presented in Fig. 4. The complete Prolog implementation of the analysis algorithm is shown in [1]. 'DWDEDVH6FKHPD
Analysis Phase
classes, operations
Analyze Operation
Choose Fragmentation Method
'DWDEDVH
Choose VF
Choose PHF
Choose DHF
Fig. 4. The overall structure of our set of rules for the analysis algorithm
FORTE assumes that the initial domain theory is divided into two files: the “fundamental domain theory” (FDT) file (with predicates that are assumed correct) and the “initial theory to be revised”(THY) file (with predicates subject to the revision process).
Applying Theory Revision to the Design of Distributed Databases
65
The Fundamental Domain Theory. The FDT file contains one clause for each of the attributes and relations used in the examples (which are defined in the DAT file through the predicate H[DPSOH, explained later), plus one clause for each object type. Given a database schema and a set of applications, then objects, their attributes and the relations between objects are fixed and represent all the information that is needed by the analysis algorithm, and therefore need not be revised. FORTE is responsible for decomposing the set of examples and create extensional definitions for these attributes, relations and objects that are accessed through the FORTE predicate H[DPSOH illustrated in Fig. 5. The FDT file contains predicates from the initial domain theory that FORTE is not allowed to revise, and is illustrated in Fig. 5. The predicate QDYLJDWHV)URP7R from Fig. 5 defines if an operation navigates from one class X to another class Y (or vice-versa) in a path expression. Additionally, we had to create predicates LV1RW'HULYHG)UDJPHQWHG and LV1RW9HUWLFDOO\)UDJPHQWHG due to the fact that negated literals (general logic programs) are not addressed by FORTE revision operators. The Initial Theory To Be Revised. The THY file contains predicates from the initial domain theory for FORTE to revise (i.e., concepts from the analysis algorithm that may be modified), and is illustrated in Fig. 6. Intuitively, the clauses in Fig. 6 choose the fragmentation technique (derived horizontal, primary horizontal, vertical) to be applied to a class of the database schema according to the heuristics proposed in [2]. Hybrid fragmentation arises when both primary horizontal and vertical fragmentations are chosen, since their clauses are not exclusive. 4.2
The Set of Examples
Another essential information needed by FORTE for the theory revision process is the set of examples. For the TREND3 approach, they were derived from experimental results presented in [23, 25, 32] on top of the OO7 benchmark [12]. This benchmark describes a representative object oriented application and it has been used in many object database management systems to evaluate their performance in centralized environments. Unfortunately, there are no other performance results on top of distributed databases available in the literature, due to security or commercial reasons. Each example represents a choice of the fragmentation technique to be applied to a class in a database schema. Positive/negative instances were generated by the choices that led to good/bad performance results in the distributed database. We obtained a total of 48 instances (19 positive and 29 negative).
66
Fernanda Baião et al.
/*** Object types that represent the database schema (classes and relationships) and are used in the examples ***/ class( X ) :- example( class( X ) ). relationship( R ) :- example( relationship( R ) ). /*** Object types that represent the operations (extracted from applications) and are used in the examples ***/ operation( O ) :- example( operation( O ) ). /*** Attributes that qualify object types and are used in the examples ***/ /* attributes for classes */ cardinality( X, C ) :- example( cardinality( X, C ) ). fragmentation( C, F ) :- example( fragmentation( C, F ) ). /* attribute for relationships */ relationshipType( R,T ) :- example(relationshipType( R, T )). /* attributes for operations */ frequency( O, F ) :- example( frequency( O, F ) ). classification( O, C ) :- example( classification( O, C ) ). /*** Relations between object types that are used in the examples ***/ relationshipAccess(X,Y,Z):-example(relationshipAccess(X,Y,Z)). operationAccess( O, L ) :- example( operationAccess( O, L ) ). navigates( O, X, Y ) :- example( navigates( O, X, Y ) ). /*** Predicates which appear in the initial theory to be revised, but which FORTE is not allowed to revise ***/ isDerivedFragmented( X ) :fragmentation((_,X),derivedFragmentation). isNotDerivedFragmented( X ) :\+ isDerivedFragmented( X ). isVerticallyFragmented( X ) :fragmentation( X, vertical ). isNotVerticallyFragmented( X ) :\+ isVerticallyFragmented( X ). navigatesFromTo( O, X, Y ) :operationAccess( O, ClassPath ), member( X, ClassPath ), member( Y, ClassPath ), navigates( O, X, Y ). navigatesFromTo( O, X, Y ) :operationAccess( O, ClassPath ), member( X, ClassPath ), member( Y, ClassPath ), navigates( O, Y, X ). Fig. 5. The fundamental domain theory
Applying Theory Revision to the Design of Distributed Databases
67
chooseDerivedHorizontalFragmentationMethod( Oi, X, Y ) :fdt:classification(Oi,navigation), fdt:navigatesFromTo(Oi,Y,X), fdt:relationshipAccess(Name,X,Y),fdt:relationship( Name ), fdt:relationshipType(Name, ’N:1’), fdt:isNotVerticallyFragmented( X ), fdt:isNotDerivedFragmented( X ). chooseDerivedHorizontalFragmentationMethod( Oi, Y, X ) :fdt:classification(Oi,navigation), fdt:navigatesFromTo(Oi,X,Y), fdt:relationshipAccess(Name,X,Y),fdt:relationship( Name ), fdt:relationshipType(Name, ’1:N’), fdt:isNotVerticallyFragmented( Y ), fdt:isNotDerivedFragmented( Y ). chooseDerivedHorizontalFragmentationMethod( Oi, Y, X ) :fdt:classification(Oi,navigation), fdt:navigatesFromTo(Oi,X,Y), fdt:relationshipAccess(Name,X,Y),fdt:relationship( Name ), fdt:relationshipType(Name,’1:1’), fdt:isNotVerticallyFragmented(Y), fdt:isNotDerivedFragmented( Y ). choosePrimaryHorizontalFragment ationMethod( Oi, X ) :fdt:classification(Oi, selection), fdt:operationAccess(Oi, [X]), fdt:cardinality( X, large ). chooseVerticalFragmentationMethod( Oi, X ) :fdt:classification(Oi, projection), fdt:operationAccess(Oi, [X|_]), fdt:cardinality( X, large ), fdt:isNotDerivedFragmented( X ). Fig. 6. The initial theory to be revised
The representation of an example in FORTE is an atomic formula as in (3), example(PositiveInstances,NegativeInstances,Objects,Facts))
(3)
where 3RVLWLYH,QVWDQFH (1HJDWLYH,QVWDQFH) is a list of positive (negative) facts of the concept to be learned, 2EMHFWV are the representation of the application domain (in the DDODB domain, objects are represented as the classes and operations of the current application), and )DFWV are facts from the fundamental domain theory. Figure 7 shows an example of choosing vertical fragmentation for class DWRPLF3DUW from the OO7 benchmark application, during the analysis of a projection operation. In the example of Fig. 7, the positive instance is given by the term FKRRVH9HUWLFDO)UDJPHQWDWLRQ0HWKRGR DWRPLF3DUW There are no negative instances defined. The objects are the sets of classes, relationships and operations of the application, while the facts define existing relations between application objects (e.g.: which classes are accessed by each relationship, which operations compose a query, which classes are accessed by each operation). TREND3 examples are passed to FORTE in a data file (DAT). The DAT file contains examples from which FORTE will learn, and also defines execution parameters to guide the FORTE learning process. The complete description of the DAT file for the OO7 benchmark is in [1].
68
Fernanda Baião et al.
example( [ chooseVerticalFragmentationMethod(o1,atomicPart)], [ ], [ class([ [designObject, none, none], [baseAssembly, small, none], [compositePart, small, none], [atomicPart, medium, none], [connection, large, none] ]), relationship([ [componentsShared, ’N:N’], [componentsPrivate, ’1:N’], [rootPart, ’1:1’], [parts, ’1:N’], [from, ’1:N’], [to, ’1:N’] ]), operation([ [o1, projection] ]) ], facts( [ relationshipAccess(compShared, baseAssembly,compositePart), relationshipAccess(compPrivate,baseAssembly,compositePart), relationshipAccess(rootPart, compositePart, atomicPart), relationshipAccess(parts, compositePart, atomicPart), relationshipAccess(from, atomicPart, connection), relationshipAccess(to, atomicPart, connection), query( q1, 100, [o1] ), operationAccess( o1, [atomicPart] ), ]) ). Fig. 7. A FORTE example from the OO7 benchmark application
5
Experimental Results
This section presents experimental results of TREND3 on top of the OO7 benchmark, showing the effectiveness of our approach in obtaining an analysis algorithm that produces a better fragmentation schema for the OO7 benchmark application. Due to the small amount of examples available, and to overcome the overfitting problem during training, we applied k-fold cross validation approach to split the input data into disjoint training and test sets and, within that, a t-fold cross-validation approach to split training data into disjoint training and tuning sets [26, 21]. The revision algorithm monitors the error with respect to the tuning set after each revision, always keeping around a copy of the theory with the best tuning set accuracy, and the saved "best-tuning-set-accuracy" theory is applied to the test set. The experimental methodology built in FORTE, which is currently a random resampling, was adapted to follow the one above. The experiments were executed with k = 12 and t = 4. Each run was executed with a training set of 33 instances, a tuning set of 11 instances and a test set of 4 instances, and obtained a revised theory as its final result. In all k runs, the best-tuning-setaccuracy was 100%.
Applying Theory Revision to the Design of Distributed Databases
69
Table 1 shows the results of the execution of 12 independent runs, and therefore each result refers to a different revised theory proposed by FORTE. We verified that all proposed revised DDODB theories were identical, and represented the final revised DDODB theory (Figure 8). By comparing the definitions of FKRRVH3ULPDU\+RUL]RQWDO)UDJPHQWDWLRQ0HWKRG and FKRRVH9HUWLFDO)UDJPHQWDWLRQ0HWKRG predicates in Figs. 6 and 8, it may be verified that the following revisions were made by FORTE:
1) Rule addition: The following rule was added: chooseVerticalFragmentationMethod(A,B):cardinality(B,medium),classification(A,projection).
2) Antecedent deletion: The antecedent IGWFDUGLQDOLW\%ODUJH was removed from the rule: choosePrimaryHorizontalFragmentationMethod(A,B):fdt:classification(A,selection), fdt:operationAccess(A,[B]). fdt:cardinality(B,large).
Intuitively, these modifications show that medium-sized classes are also subject to vertical fragmentation in the case of a projection operation, and that classes may have primary horizontal fragmentation independent of its size. By running both versions of the analysis algorithm on top of the OO7 benchmark application, we notice that class DWRPLF3DUW is indicated for hybrid fragmentation (primary horizontal + vertical) after the revision (instead of derived horizontal fragmentation), as illustrated in Table 2. Table 1. Summary of the FORTE execution output
K Initial Training Initial Test Set Final Training Accuracy
Accuracy
Accuracy
Final Test Set Accuracy
1
61.36
100.00
93.18
100.00
2
61.36
100.00
93.18
100.00
3
63.64
75.00
95.45
75.00
4
63.64
75.00
93.18
100.00
5
63.64
75.00
95.45
75.00
6
63.64
75.00
95.45
75.00
7
61.36
100.00
93.18
100.00
8
70.45
0.00
93.18
100.00
9
61.36
100.00
93.18
100.00
10
70.45
0.00
93.18
100.00
11
70.45
0.00
93.18
100.00
12
63.64
75.00
93.18
100.00
70
Fernanda Baião et al.
chooseDerivedHorizontalFragmentationMethod(A,B,C):fdt:classification(A,navigation), fdt:navigatesFromTo(A,C,B), fdt:relationshipAccess(D,B,C), fdt:relationship(D), fdt:relationshipType(D,N:1), fdt:isNotVerticallyFragmented(B), fdt:isNotDerivedFragmented(B). chooseDerivedHorizontalFragmentationMethod(A,B,C):fdt:classification(A,navigation), fdt:navigatesFromTo(A,C,B), fdt:relationshipAccess(D,C,B), fdt:relationship(D), fdt:relationshipType(D,1:N), fdt:isNotVerticallyFragmented(B), fdt:isNotDerivedFragmented(B). chooseDerivedHorizontalFragmentationMethod(A,B,C):fdt:classification(A,navigation),fdt:navigatesFromTo(A,C,B), fdt:relationshipAccess(D,C,B), fdt:relationship(D), fdt:relationshipType(D,1:1), fdt:isNotVerticallyFragmented(B), fdt:isNotDerivedFragmented(B). choosePrimaryHorizontalFragmentationMethod(A,B):fdt:classification(A,selection), fdt:operationAccess(A,[B]). chooseVerticalFragmentationMethod(A,B):cardinality(B,medium), classification(A,projection). chooseVerticalFragmentationMethod(A,B):fdt:classification(A,projection), fdt:operationAccess(A,[B|C]), fdt:cardinality(B,large), fdt:isNotDerivedFragmented(B). Fig. 8. The revised analysis algorithm Table 2. Fragmentation techniques chosen by both versions of the analysis algorithm
Class
Initial version
Revised version
EDVH$VVHPEO\
SULPDU\KRUL]RQWDO
SULPDU\KRUL]RQWDO
FRPSRVLWH3DUW
GHULYHGKRUL]RQWDO
GHULYHGKRUL]RQWDO
DWRPLF3DUW
GHULYHGKRUL]RQWDO
K\EULG
FRQQHFWLRQ
GHULYHGKRUL]RQWDO
GHULYHGKRUL]RQWDO
We then compared the costs of the resulting fragmentation schemas obtained from the initial and the revised versions of the analysis algorithm, after executing the vertical and horizontal fragmentation algorithms (those algorithms were not considered for the revision process). These costs were calculated according to the cost model from [30], assuming that the query optimizer was able to choose the most efficient way of executing each query (that is, choosing the least cost between the “naïve-pointer”, value-based join and pointer-based join algorithms). The resulting costs are illustrated in Fig. 9.
Applying Theory Revision to the Design of Distributed Databases Initial Analysis Algorithm
Revised Analysis Algorithm
20,000 V W V R F Q R L W D F L Q X P P R & 8 3 &
71
17,652
18,000 15,294
16,000 14,000 12,000 10,000 8,000 6,000 4,000
5,228
5,228 2,391
2 ,
1,882
2,391
2,000
18
0 Q1
Q2
1,022
16
Q3
T1
T2
Fig. 9. Comparing the costs of the fragmentation schemas obtained from the initial and the revised analysis algorithm
Figure 9 shows the cost of executing each query from the OO7 benchmark application. The total cost of the OO7 benchmark application, according to the frequencies of each operation, can be calculated as: &RVW22
FRVW4 FRVW4 FRVW4 FRVW7 FRVW7
Which produces the following costs for the two versions of the analysis algorithm that are being compared: &RVWBRIB,QLWLDO9HUVLRQ22
&RVWBRIB5HYLVHG9HUVLRQ22
Our results show the effectiveness of the TREND3 approach in revising the analysis algorithm and obtaining a new version that produced a fragmentation schema that reduced the cost (i.e., increased the performance) of the OO7 application in 38%.
6
Conclusions
Heuristic algorithms are used to address the intractability of the class fragmentation problem in the design of a distributed database, which is known to be an NP-hard problem. However, once defined, it is very difficult to improve them by manually defining and incorporating new heuristics from experimental results, while maintaining previous ones consistent. This work presented a knowledge-based approach for automatically improving a heuristic DDODB algorithm through the use of theory revision. This approach is part of the framework that handles the class fragmentation problem of the design of distributed databases. The proposed framework integrates three modules: the DDODB heuristic module, the theory revision module (called TREND3) and the DDODB branch-and-bound module.
72
Fernanda Baião et al.
The focus of this work was to apply TREND3 to automatically revise the analysis algorithm of the heuristic module, according to experimental results on top of the OO7 benchmark presented as examples. The revised algorithm led to an improvement of 38% in the overall system performance. This shows the effectiveness of our approach in finding a fragmentation schema with improved performance through the use of inductive logic programming. Future work will include applying TREND3 to other applications, and the generation of examples to the TREND3 module using the branch-and-bound module to address the lack of performance results on top of distributed databases in the literature. Also, we intend to enhance the FORTE system to deal with negation as failure, extending the ideas already mentioned in previous works of our group [1,16]. Acknowledgements The Brazilian authors would like to thank the Brazilian agencies CNPq and FAPERJ for providing financial support for this work, and Savio Leandro Aguiar for helping with the implementation of the experimental methodology in FORTE. Part of this work was done at the Computer Science Department of University of Wisconsin Madison, USA, while the authors Fernanda Baião and Gerson Zaverucha were on leave from UFRJ.
References 1.
2.
3.
4.
5.
6.
Baião, F. (2001). A Methodology and Algorithms for the Design of Distributed Databases using Theory Revision. Doctoral Thesis, Computer Science Department – COPPE, Federal University of Rio de Janeiro, Brazil. Technical Report ES-565/02 (2002), COPPE/UFRJ. Baião, F., and Mattoso, M. (1998). A Mixed Fragmentation Algorithm for Distributed Object Oriented Databases. Proc Int’l Conf Computing and Information (ICCI'98), Winnipeg, pp. 141-148. Also In: Special Issue of Journal of Computing and Information (JCI), 3(1), ISSN 1201-8511, pp. 141-148. Baião, F., Mattoso, M., and Zaverucha, G. (1998a). Towards an Inductive Design of Distributed Object Oriented Databases. Proc Third IFCIS Conf on Cooperative Information Systems (CoopIS'98), IEEE CS Press, New York, USA, pp. 88-197. Baião, F., Mattoso, M., and Zaverucha, G. (2001), "A Distribution Design Methodology for Object DBMS", submitted in Aug 2000; revised manuscript sent in Nov 2001 to International Journal of Distributed and Parallel Databases, Kluwer Academic Publishers Baião, F., Mattoso, M., and Zaverucha, G. (2002), A Framework for the Design of Distributed Databases, Workshop on Distributed Data & Structures (WDAS 2002), In: Proceedings in Informatics series, Carleton Scientific. Bellatreche, L., Simonet, A., and Simonet, M. (1996). Vertical Fragmentation in Distributed Object Database Systems with Complex Attributes and Methods. Proc 7th Int’l Workshop Database and Expert Systems Applications (DEXA’96), IEEE Computer Society, Zurich, pp. 15-21.
Applying Theory Revision to the Design of Distributed Databases 7.
8. 9. 10. 11. 12. 13.
14.
15.
16.
17.
18. 19. 20.
21. 22.
23.
24. 25.
26.
73
Bellatreche, L., Karlapalem, K., and Simonet, A. (2000). Algorithms and Support for Horizontal Class Partitioning in Object-Oriented Databases. Int’l Journal of Distributed and Parallel Databases, 8(2), Kluwer Academic Publishers, pp. 155-179. Blockeel, H., and De Raedt, L. (1996). Inductive Database Design. Proc Int’l Symposium on Methodologies for Intelligent Systems (ISMIS’96). Blockeel, H., and De Raedt, L. (1998). IsIdd: an Interactive System for Inductive Database Design. Applied Artificial Intelligence, 12(5), pp. 385-420. Brunk, C. (1996). An Investigation of Knowledge Intensive Approaches to Concept Learning and Theory Refinement. PhD Thesis, University of California, Irvine, USA. Brunk, C., and Pazzani, M. (1995). A Linguistically-Based Semantic Bias for Theory Revision. Proc 12th Int’l Conf of Machine Learning. Carey, M., DeWitt, D., and Naughton, J. (1993). The OO7 Benchmark. Proc 1993 ACM SIGMOD 22(2), Washington DC, pp. 12-21. Chen, Y., and Su, S. (1996). Implementation and Evaluation of Parallel Query Processing Algorithms and Data Partitioning Heuristics in Object Oriented Databases. Int’l Journal of Distributed and Parallel Databases, 4(2), Kluwer Academic Publishers, pp. 107-142. Ezeife, C., and Barker, K. (1995). A Comprehensive Approach to Horizontal Class Fragmentation in a Distributed Object Based System. Int’l Journal of Distributed and Parallel Databases, 3(3), Kluwer Academic Publishers, pp. 247-272. Ezeife, C., and Barker, K. (1998). Distributed Object Based Design: Vertical Fragmentation of Classes. Int’l Journal of Distributed and Parallel Databases, 6(4), Kluwer Academic Publishers, pp. 317-350. Fogel, L., and Zaverucha, G. (1998). Normal programs and Multiple Predicate Learning. Proc 8th Int’l Conference on Inductive Logic Programming (ILP’98), Madison, July, LNAI 1446, Springer Verlag, pp. 175-184. Fung, C., Karlapalem, K., and Li, Q., (2002). Object-Oriented Systems: An Evaluation of Vertical Class Partitioning for Query Processing in Object-Oriented Databases, IEEE Transactions on Knowledge and Data Engineering, Sep/Oct, Vol. 14, No. 5. Getoor., L., Friedman, N., Koller, D., and Taskar, B. (2001). Probabilistic Models of Relational Structure. Proc Int’l Conf Machine Learning, Williamstown. Getoor, L., Taskar, B., and Koller, D. (2001). Selectivity Estimation using Probabilistic Models, Proc 2001 ACM SIGMOD, Santa Barbara, CA. Karlapalem, K., Navathe, S., and Morsi, M. (1994). Issues in Distribution Design of Object-Oriented Databases. In M. Özsu et al. (eds.), Distributed Object Management, Morgan Kaufmann Pub Inc., San Francisco, USA. Kohavi, R (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the IJCAI 1995, pp. 1137-1145. Koppel, M., Feldman, R., and Segre, A. (1994). Bias-Driven Revision of Logical Domain Theories, Journal of Artificial Intelligence Research, 1, AI Access Foundation and Morgan Kaufmann, pp. 159-208. Lima, F., and Mattoso, M. (1996). Performance Evaluation of Distribution in OODBMS: a Case Study with O2. Proc IX Int'l Conf. on Par & Dist Computing Systems (PDCS'96), ISCA-IEEE, Dijon, pp.720-726. Maier, D., et al. (1994). Issues in Distributed Object Assembly. In M. Özsu et al. (eds.), Distributed Object Management, Morgan Kaufmann Publishers Inc., San Francisco, USA. Meyer, L., and Mattoso, M. (1998). Parallel query processing in a shared-nothing object database server. Proc 3rd Int’l Meeting on Vector and Parallel Processing (VECPAR'98), Porto, Portugal, pp.1007-1020. Mitchell, T. (1997). Machine Learning, McGraw-Hill Inc.
74
Fernanda Baião et al.
27. Muggleton, S., and De Raedt, L. (1994). Inductive logic programming: Theory and methods, Journal of Logic Programming, 19(20), pp. 629-679. 28. Özsu, M., and Valduriez, P. (1999). Principles of Distributed Database Systems. New nd Jersey, Prentice-Hall, 2 edition. 29. Richards, B., and Mooney, R. (1995). Refinement of First-Order Horn-Clause Domain Theories. Machine Learning, 19(2), pp. 95-131. 30. Ruberg, G., Baião, F., and Mattoso, M. (2002). “Estimating Costs of Path Expression Evaluation in Distributed Object Databases”, In: Proceedings of the 13th International Conference on Database and Expert Systems Applications (DEXA 2002), LNCS v.2453, Springer Verlag, pp. 351-360. 31. Savonnet, M., Terrasse, M., and Yétongnon, K. (1998). Fragtique: A Methodology for Distributing Object Oriented Databases. Proc Int’l Conf Computing and Information (ICCI'98), Winnipeg, pp.149-156. 32. Tavares, F., Victor, A., and Mattoso, M. (2000). Parallel Processing Evaluation of Path Expressions. Proc XV Brazilian Symposium on Databases, SBC, João Pessoa, Brazil. pp. 49-63. 33. Wogulis, J. (1994). An Approach to Repairing and Evaluating First-Order Theories Containing Multiple Concepts and Negation. Ph.D. Thesis, University of California, Irvine, USA. 34. Wrobel, S. (1996). First Order Theory Refinement. In L. De Raedt (ed.), Advances in Inductive Logic Programming, IOS Press, pp. 14-33.
Disjunctive Learning with a Soft-Clustering Method Guillaume Cleuziou, Lionel Martin, and Christel Vrain LIFO, Laboratoire d’Informatique Fondamentale d’Orl´eans Facult´e des Sciences, Rue L´eonard de Vinci B.P. 6759 45067 Orl´eans cedex2, France {cleuziou,martin,cv}@lifo.univ-orleans.fr
Abstract. In the case of concept learning from positive and negative examples, it is rarely possible to find a unique discriminating conjunctive rule; in most cases, a disjunctive description is needed. This problem, known as disjunctive learning, is mainly solved by greedy methods, iteratively adding rules until all positive examples are covered. Each rule is determined by discriminating properties, where the discriminating power is computed from the learning set. Each rule defines a subconcept of concept to be learned with these methods. The final set of sub-concepts is then highly dependent from both the learning set and the learning method. In this paper, we propose a different strategy: we first build clusters of similar examples thus defining subconcepts, and then we characterize each cluster by a unique conjunctive definition. The clustering method relies on a similarity measure designed for examples described in first order logic. The main particularity of our clustering method is to build “soft clusters”, i.e. allowing some objects to belong to different groups. Once clusters have been built, we learn first-order rules defining the clusters, using a general-to-specific method: each step consists in adding a literal that covers all examples of a group and rejects as many negative examples as possible. This strategy limits some drawbacks of greedy algorithms and induces a strong reduction of the hypothesis space: for each group (subconcept), the search space is reduced to the set of rules that cover all the examples of the group and reject the negative examples of the concept.
1
Introduction
In this paper, we are interested in learning a disjunctive definition of a concept from positive and negative examples. Introducing the disjunction into the hypotheses space is important because many concepts are truly disjunctive ones and their definitions require several rules. A very simple example of such a concept is the concept “parent”, in fact composed of two subconcepts “father” and “mother” and defined by two clauses: parent(X, Y ) ← mother(X, Y ) and parent(X, Y ) ← f ather(X, Y ), each rule defines a subconcept of the initial concept. T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 75–92, 2003. c Springer-Verlag Berlin Heidelberg 2003
76
Guillaume Cleuziou et al.
Nevertheless, learning disjunctive concepts leads to several problems. First, there exists a trivial definition of the concept that covers all the positive examples and rejects the negative ones (except when an example is labeled as positive and negative), namely the disjunction of the positive examples. Secondly, the complexity of the search space increases. To deal with these problems, several algorithms have been developed. For instance, in the system INDUCE [HMS83], R.S. Michalski introduced the star method that consists in choosing a positive example e (often called the seed), building the star1 of e w.r.t. the negative examples, choosing the best definition in the star, and iterating the process until all the positive examples are covered. This leads to a disjunctive definition composed of all the descriptions found so far. A similar strategy is applied in the system Progol [Mug95]: a positive example e is chosen and Progol explores the search space defined by the descriptions that are more general than e. In both cases, the result strongly depends on the choice of the seeds. The system Foil [QCJ95] also searches a disjunctive definition by iteratively building a conjunctive description that covers positive examples and rejects the negative ones. Nevertheless, the strategy used to build a conjunctive definition differs: Foil starts from the most general clause, namely the clause with an empty body and iteratively adds a literal to the body of the clause until all the negative examples are rejected; the choice of the literal depends on a heuristic function that computes for each literal its gain in terms of the number of positive examples still covered when adding this literal to the clause, the number of positive instantiations and the number of negative instantiations still covered. It suffers from several drawbacks. As mentioned in [QCJ95,SBP93], there exists determinate literals that cover all the positive examples and all the negative ones, such literals have thus a very poor gain and are usually not chosen by the system although they can be very important for building a good definition, allowing the introduction of a new variable in the definition. Moreover, the result depends on the heuristic function. In this paper, we propose another strategy: first using a similarity measure, we build clusters of examples, thus defining subconcepts. Once subconcepts have been found, we learn a conjunctive description of each subconcept.The set of all rules gives us the definition of the concept. An important point in that work is that the similarity measure is based on a language specified by the user, allowing to capture more complex similarities between examples than properties depending on a single literal.
2
Overall Presentation
2.1
Motivations
As already mentioned in the introduction, many disjunctive learning algorithms use a greedy method, iteratively adding rules until all positive examples are 1
The star of an example e w.r.t. to negative examples is the set of all the conjunctive descriptions that cover e and reject all the negative examples.
Disjunctive Learning with a Soft-Clustering Method
77
covered. Depending on the systems, the rules can be built with a still greedy approach, starting from a general rule and specializing it by adding at each step the most discriminant literal. Various methods are proposed in the literature to define the discriminative power (denoted by Γ (l, C)) of a literal l for a clause C. Let E + (resp. E − ) be the set of positive (resp. negative) examples and let cov(E, C) be the number of examples2 from E covered by the clause C, Γ (l, C) depends at least on cov(E + , C ∪{l}) and cov(E − , C ∪{l}), where C ∪{l} denotes the clause obtained by adding l in the body of C. The definition of Γ (l, C) plays a central role in this process, and therefore in the set of subconcepts induced. Moreover, Γ (l, C) is highly dependent on the learning sets E + and E − : few changes in these sets may lead to very different solutions, and then to different subconcepts. At last, if h ← p, q is a clause which covers some positive examples and no negative ones, Γ (p, h ←) and Γ (q, h ←) are not necessarily high, as shown in Example 1. The consequence is that it may be impossible for a greedy method to learn the clause h ← p, q because starting from h ←, neither p, nor q seems to be a discriminant literal. This problem is more general than determinate literals, already pointed out in [QCJ95,SBP93]. To sum up, the main drawback of greedy methods is that each subconcept is characterized by a clause h ← b1 , . . . bn where bi is a highly discriminant literal for h ← b1 , . . . bi−1 . Instead of having subconcepts depending on the discriminating power function, we propose a two-step process: (1) building sets of similar positive examples, thus defining subconcepts in extension, (2) for each subconcept, searching for a unique rule characterizing it, i.e. a rule that covers all (resp. none of) the positive (resp. negative) examples of that subconcept. – The main advantage of this approach is to strongly reduce the search space when searching for the subconcept definition: hypotheses are considered only if they cover all the positive examples of the subconcept3 . Practically let G be a subconcept (a set of positive examples) and let us assume that C = h ← b1 , . . . bi is the clause under construction. A literal l can be added to the clause only if cov(G, C ∪ {l}) = |G|. Then, the discriminative power is no longer a combination of 2 criteria (cov(E + , C ∪ {l}) and cov(E − , C ∪ {l})) but is mainly characterized by cov(E − , C ∪ {l}). For example, we can choose to add to C the literal l that minimizes cov(E − , C ∪ {l}) among the set of literals li such that cov(G, C ∪ {li }) = |G|. – When it is not possible to find a clause that defines a subconcept G, then G is split into new subconcepts and definitions for these subconcepts are looked for. This process is repeated until a clause can be learned for each subconcept. In the worst case, this leads to subconcepts containing a single positive example, an overfitting situation which may also occur with greedy 2 3
In Inductive Logic Programming, the number of instantiations of C that cover an element of E may also be used to define the discriminating power. This constraint can be reduced to handle noise: hypotheses are considered only if they cover most of the positive examples of the subconcept.
78
Guillaume Cleuziou et al.
algorithms, but which has not been encountered in our experiments. If no clause can be found for such a subconcept, this means that there is no complete and consistent solution for the initial learning problem. 2.2
Concept Decomposition and Clustering
In greedy algorithms the definition of Γ (l, C) plays a central role whereas in our approach, the decomposition of concept into subconcepts is essential. The goal of this paper is to propose a clustering method that allows the formation of subconcepts. An important point is then the evaluation of our clustering method, i.e., testing whether the decomposition leads to “good” subconcepts. The quality of the decomposition can be considered from different points of view that are not exclusive: natural organization: In most cases, a concept can be ”naturally” divided into subconcepts: each subconcept is composed of similar examples, which do share some characteristic properties. simplicity: From a practical point of view, a decomposition is good if it produces a low number of simple (short) rules. accuracy: The accuracy of the prediction on new examples is good. We propose a two-steps decomposition method. First, we define a similarity measure and we compute the similarity between each pair of positive examples. Then, we use a clustering algorithm to build possibly overlapping groups of similar objects. To motivate the approach, let us consider the following example. For sake of simplicity, we consider simple examples which could be considered in a attribute/value formalism. Nevertheless, as shown in Section 5, our approach can be applied on specific ILP problems. Example 1. Let us consider background knowledge composed of the following ground atoms: {r(a), r(b), r(f ), r(h), s(a), s(b), s(g), s(i), t(c), t(d), t(g), t(i), u(c), u(d), u(f ), u(h), v(b), v(c), v(f ), r(e), s(e), t(e), u(e), v(e)} Let us assume that the set of positive examples is {p(a), p(b), p(c), p(d)} and the set of negative examples is {p(f ), p(g), p(h), p(i)}. Considering a greedy algorithm, similar for instance to the Foil algorithm, the construction starts with the clause p(X) ←, – if we add one of the literal r(X), s(X), t(X) or u(X), we get a clause covering 2 positive examples and 2 negative ones; – if we add the literal v(X), we get a clause which covers 2 positive examples and 1 negative one.
Disjunctive Learning with a Soft-Clustering Method
79
The literal v(X) is chosen and the final learned program contains at least 3 clauses. However, the concept can be characterized by the two following clauses: C1 : p(X) ← r(X), s(X) C2 : p(X) ← t(X), u(X) which lead to 2 subconcepts : {p(a), p(b)} and {p(c), p(d)}. In order to learn these 2 clauses, our decomposition method should build these 2 subconcepts from the background knowledge, i.e. it should consider that p(a) and p(b) (resp p(c) and p(d)) are highly similar. We propose to use the similarity measure defined in [MM01]: it is based on the number of properties shared by two objects, from a specified set of properties called the language. To introduce this measure on the example, let us consider a language L containing all the clauses with head p(X) and with a single literal in the body from {r(X), s(X), t(X), u(X), v(X)}. The similarity between two examples ei and ej is then defined by the number of clauses C in L such that either ei and ej are covered by C or ei and ej are not covered by C. For example, p(a) and p(b) are both covered by p(X) ← r(X) and by p(X) ← s(X); they are both not covered by p(X) ← t(X) and by p(X) ← u(X). In this case, the similarity between p(a) and p(b) is equal to 4. In the same way, we can compute the similarity matrix between each pair of positive examples, and we get the following matrix:
p(a) p(b) p(c) p(d)
p(a) 5 4 0 1
p(b) 4 5 1 0
p(c) 0 1 5 4
p(d) 1 0 4 5
In this matrix, the 2 expected subconcepts clearly appear. However, in many problems, subconcepts are not disjoint. For instance, let us now add to the previous example the new positive example p(e). With this new example, the similarity becomes:
p(a) p(b) p(e) p(c) p(d)
p(a) 5 4 2 0 1
p(b) 4 5 3 1 0
p(e) 2 3 5 3 2
p(c) 0 1 3 5 4
p(d) 1 0 2 4 5
From this matrix, the 2 initial subconcepts {p(a), p(b)} and {p(c), p(d)} still appear, but p(e) can be inserted into both subconcepts. In fact, the new example is covered by the clauses C1 and C2 and then the subconcepts induced by these two clauses are {p(a), p(b), p(e)} and {p(e), p(c), p(d)}. In this paper, we propose a clustering method able to build non disjoint clusters. When applied to this example, it leads to the building of the two clusters {p(a), p(b), p(e)} and {p(e), p(c), p(d)}.
80
Guillaume Cleuziou et al.
Example 2. Let us consider now an example where the observations are described with numeric properties. Let us consider a set of objects {a, b, . . . u} described by two attributes x and y. These objects are described in Figure 1 and are expressed in a background knowledge BK defining the predicates x (x(U, V ) is true when the object U satisfies x = V ), y, ≥ and ≤. For instance, BK contains the ground atoms x(a, 1), y(a, 1), x(b, 1), y(b, 2), . . ., ≥ (1, 1), ≥ (1, 2), . . . Finally, the set of positive examples is {p(a), p(b), . . . , p(j)} and the set of negative examples is {p(l), p(m), . . . , p(u)} y 6
r
u
j
i
5 4
s
t
h
g
q
p
e
f
3 2 1
b
d
o
n
a
c
l
m
0 0
1
2
3
4
5
6
x
Fig. 1. Examples with numerical values If we allow the constants 1, . . ., 6 to appear in the learned clauses, it will be hard for a greedy algorithm to learn a definition of the target concept. Indeed, starting from p(X) ←, the literals x(X, Y ) and y(X, Y ) are determinate literals. Starting from the clause p(X) ← x(X, Y ) or from the clause p(X) ← y(X, Y ), the literals ≥ (Y, 1), ≥ (Y, 2), . . ., ≥ (Y, 6) are not discriminant. In our framework, we can compute the similarity with respect to the language v=1..6 p(X) ← x(X, Y ), ≥ (Y, v) ∪ v=1..6 p(X) ← y(X, Y ), ≥ (Y, v). This language leads to a similarity measure between examples which is close to the Euclidean similarity, and which induces 2 clusters, corresponding to the subconcepts {p(a), p(b), . . . , p(d)} (characterized by the clause p(X) ← x(X, Y ), ≤ (Y, 2), y(X, Z), ≤ (Z, 2)) and {p(e), p(f ), . . . , p(j)} (characterized by p(X) ← x(X, Y ), ≥ (Y, 4), y(X, Z), ≥ (Z, 3)). Let us notice that on both Example 1 and Example 2, the clauses that characterize the obtained concepts do not belong to the language used to define the similarity. Experiments on this example are detailed in Section 5.
Disjunctive Learning with a Soft-Clustering Method
2.3
81
Similarity Measure
Different approaches have been proposed to define a similarity measure between objects described in a first order logic formalism. The common way to build a similarity function consists in first producing a description of objects. Then, for descriptions based on sets of atoms, the similarity is defined by intersections [Bis92,EW96]; for descriptions based on rules, the similarity between two objects is given by the number of rules that are satisfied by the two objects [Seb97,SS94,MM01]. In any case, two objects are considered to be similar when they share some properties. We use the similarity measure proposed in [MM01], this measure is defined with respect to a finite set of clauses, called the language; for a clause C and an example e, we define the function covered(C, e) as follows: covered(C, e) = 1 if e is covered by C, otherwise covered(C, e) = 0. Given a language L, we define L(ei , ej ) as the set of clauses C such that either ei and ej are covered by C or ei and ej are not covered by C: L(ei , ej ) = {C ∈ L such that covered(C, ei ) = covered(C, ej )} The similarity between ei and ej , written simL (ei , ej ), is then defined by: |L(ei ,ej )| simL (ei , ej ) = |L| This is the similarity measure we have used in Example 1. However, it does not take into account the set of negative examples. We propose to extend this measure by giving a weight on the clauses of the language, which depends on the set of positive and negative examples. More precisely, if E + and E − are respectively the sets of positive and negative examples, the weight of a clause C is defined by w(C) = cov(E + , C)/(cov(E + , C) + cov(E − , C)) if cov(E + , C) > 0, otherwise w(C) = 0. The maximum value of this weight is equal to 1 for clauses which cover only positive examples. The weighted similarity we obtain is then: w sim(ei , ej ) =
C∈L(ei ,ej )
w(C)
|L|
it gives a higher weight for clauses that are closer to the target concept. This measure may have a drawback when the language contains a large number of clauses with low weights, since it may reduce the influence of clauses with high weight. For this reason, we propose a definition of similarity, depending on a threshold α, such that clauses with a weight lower than α are not considered: w simα (ei , ej ) =
C∈L(ei ,ej ),w(c)>α w(C)
|L|
The most general clause, which covers all positive and negative examples, has a weight equal to |E + |/(|E + | + |E − |). All the clauses that have a weight lower than this value have a lower accuracy than the most general clause. For this reason, practically we use in our experiments w simα with the threshold α = |E + |/(|E + | + |E − |).
82
Guillaume Cleuziou et al.
3
The Soft-Clustering Algorithm
3.1
Overview of the Soft-Clustering Algorithm
The clustering process is the task that consists in organizing a set of objects into classes, so that similar objects belong to the same cluster and dissimilar ones belong to different clusters. Traditional clustering algorithms can be classified into two main categories: – hierarchical methods that build a hierarchy of classes and illustrated by well known algorithms such as SAHN4 , COBWEB or by newer ones like CURE and CHAMELEON. Let us notice that in hierarchical methods, each level of the hierarchy forms a partition of the set of observed events. – partitioning methods, such as BIRCH, K-MEANS or K-MEDOIDS algorithms that build a partition of the set of observed events, i.e., a training observation cannot belong to two distinct clusters. An extensive survey about clustering methods is given in [Ber02]. The clustering algorithms are usually assessed regarding to the following characteristics: time efficiency, outliers processing, diversity of the clusters shapes, ability to deal with several data types and quality of the final clusters. However a common disadvantage of classical methods is that they require the clusters to be disjoint. In the application of clustering we are interested in - learning a disjunctive definition of a concept by first clustering its observations - learning disjoint clusters truly induces a loss of information. Indeed the main idea of the method we present in this paper is based on the hypothesis that most of the concepts that specify a data set are not clearly separated, and that some objects can be assigned to several clusters. The soft-clustering approach is a kind of compromise between hard-clustering methods, cited before and fuzzy-clustering methods which use a fuzzy membership function which gives to an element a membership value for each class [BB99]. The fuzzy-clustering approach is well known for the wealth of the information description and allows some of the observations to belong to several clusters. In our approach, we first determine a set of poles that represent strong concepts present in the data and are built from non-shared objects. In a second step, a function is defined for assigning objects that are not yet covered in one or several clusters. 3.2
The Notion of “Pole”
The notion of Pole we define here in our algorithm can be linked to the recent definition of a core given by Ben-Dor and Yakhini in [BDSY99]. Their approach is based on the idea that “ideally, all elements within a cluster should be similar to each other, and not similar to any of others clusters”. They identify small 4
Sequential Agglomerative Hierarchical Non-overlapping.
Disjunctive Learning with a Soft-Clustering Method
83
cores in a similarity graph which allows to approximate an ideal clique graph5 . These cores are small disjoint subsets of the vertices of the input graph. In our opinion, a clique graph does not represent classical real world situations: indeed, a clique has good properties to express the interactions between elements of a cluster, nevertheless, relations can appear between different clusters. Thus, the elements that form a concept6 can be split into two categories: the elements that belong only to that concept (the non-shared objects) and the elements also linked to other concepts (the shared objects). In this way, a pole can be seen as the non-shared part of the concept. Before giving a complete definition of a pole, we first define the notion of similarity graph and we describe the method for building a pole. Definition 1. Let V be a set of elements, S be a similarity matrix (S : V ×V → R) and α ∈ R be a threshold. The similarity graph G derived from V (denoted by G(V, Eα )) is an undirected, valued graph such that the set of vertices of G is equal to V and the set, Eα , of undirected edges is defined by: (vi , vj ) ∈ Eα and is valued by s if f S(vi , vj ) ≥ α and S(vi , vj ) = s The threshold α is decisive for the whole process. The higher α is, the less G(V, Eα ) contains edges. Therefore the number of final clusters is linked to the choice of this threshold. Rather than giving an arbitrary value for α, we notice that several natural thresholds can be used. For instance : independent f rom the considered vertices : (α1 ) α = 0 1 (α 2 ) α = |V |.|(V −1)| vi ∈V vj ∈V \{vi } S(vi , vj ) (average value over S) (α3 ) α such as | {(vi , vj ) | S(vi , vj ) > α} |=| {(vi , vj ) | S(vi , vj ) < α} | = 12 | V | . | (V − 1) | the median value over S α= dependent f rom the considered vertices : (α4 ) α(i, j) = M ax( |V1 | vk ∈V S(vi , vk ); |V1 | vk ∈V S(vj , vk )) th (α5 ) α(i, j) = M ax(S(vi , vi,k ); S(vj , vj,k ))where vi,k is the k nearest neighbor of vi (for a given k). The threshold α1 suppose that the similarity matrix have negative values, thus two elements are considered to be in relation if their similarity is positive (α1 ), or if their similarity is greater than the average similarity (α2 ). In the case of α3 , since the threshold splits the space of potential edges into two equal parts, the similarity graph contains exactly 50% of the edges. In contrast with the 5 6
A clique is a fully connected subset of vertices. The term of concept denotes the intuitive notion of cluster.
84
Guillaume Cleuziou et al.
three first thresholds, α4 and α5 take into account the situation of the elements compared with all others in order to define a possible relation between two elements. For instance, α4 (vi , vj )7 allows to consider that vi is in relation with vj if vi (resp. vj ) is near from vj (resp. vi ) on average than from the other elements in the space. The threshold α5 (vi , vj ) determines the existence of a relation between two elements if each element is among the k nearest neighbors of the other. With the previous definition of a similarity graph, the similarity matrix over {p(a), p(b), p(e), p(c), p(d)} given in Example 1 (section 2.1) leads to the similarity graph induced by the following adjacency matrix:
p(a) p(b) p(e) p(c) p(d)
p(a) p(b) 5 4 4 5 2 3 0 1 1 0
p(e) p(c) 2 0 3 1 5 3 3 5 2 4
similarity matrix
p(d) 1 0 2 4 5
p(a) p(b) p(e) p(c) p(d)
p(a) 1 1 0 0
p(b) p(e) p(c) 1 1 0 1 0 1 1 0 1 0 1 1
p(d) 0 0 1 1 -
adjacency matrix
This adjacency matrix is obtained using α1 ,α2 or α4 ; two cliques clearly appear: {p(a), p(b), p(e)} and {p(e), p(c), p(d)}. Because of the small size of this example, the threshold α3 does not exist. The similarity graph obtained with α5 for (k=2) differs from the previous one for two edges ((p(a), p(e)) and (p(d), p(e)) are not connected) thus four small cliques appear: {p(a), p(b)}, {p(b), p(e)}, {p(e), p(c)}, and {p(c), p(d)}. The construction of a pole from the similarity graph consists in searching a maximal-clique in the graph centered on a given point. Because of the computational complexity of the maximum clique problem8 , efficient heuristics for approximating such a clique have been developed. For a more detailed introduction to heuristics that approximate a maximal clique, see for instance [BBPP99]. In our method we use a sequential greedy heuristic based on the principle “Best in”: it constructs a maximal clique by repeatedly adding a vertex that is the nearest one among the neighbors of the temporary clique. The algorithm that implements that heuristic is given in Table 1. Once all possible cliques are built, final poles are reduced to their non-shared part. Definition 2. Let G be an undirected, valued graph, and let C be a set of vertices of G. A neighbor of C in G is a vertex v of G such that ∀vj ∈ C, (v, vj ) ∈ Eα . Let VC denotes the set of neighbors of C in G. The nearest neighbor of C in G 7 8
α4 is the threshold retained for our experiments. The maximum-clique search is a NP-complete problem.
Disjunctive Learning with a Soft-Clustering Method
85
1 is a neighbor v of C such that v = Argmaxvi ∈VC |C| vj ∈C S(vi , vj ), i.e; it is the neighbor of C that is the nearest in average of all the elements of C. Table 1. The Best in heuristic for the construction of one clique Inputs: G(V, Eα ) a valued graph, vs a starting vertex C ← {vs } While:
Build V, the set of neighbors of C in G V is not empty Select vn the nearest vertex from C in V where S(vi , vj ) is the weight of the edge (vi , vj ) in Eα C ← {vn }
Build V the set of neighbors of C in G Output: The clique C
3.3
The Relative Function for Assignment
Let us now suppose that we have iteratively built a set of disjoint cliques in the graph G. Each clique thus represents a class and we have now to assign the remaining vertices of G (that do not belong to a clique) to one or several classes. We define a boolean function that determines whether an element must be assigned to a class or not. It takes into account the relative proximity between an element and a class, defined as the average similarity between that element and each element belonging to the class. The backbone of this task is that when assigning an element to a concept we take into account the other concepts. More formally, this function is defined as follows: Definition 3. Let V be a set of elements, V = C1 ∪ C2 ∪ · · · ∪ Cm a subset of V with Ci ∩ Cj = Ø ∀i, j ∈ {1, . . . m}, S a similarity matrix (S : V × V → R). The k th nearest concept from an element vi ∈ V is written Ci k . The relative assignment function of an element vi to a class Ci k is defined by : 1 if • ∀l < k f (vi , Ci l ) = 1 • S(vi , Ci k ) > 0 • S(vi , Ci k ) > 12 (S(vi , Ci k−1 ) + S(vi , Ci k+1 )) f or 1 < k < m ∀k : f (vi , Ci k ) = • S(vi , Ci k ) > 12 (S(vi , Ci k−1 )) f or k = m 0 otherwise
86
Guillaume Cleuziou et al.
We note that for the nearest concept (Ci 1 ) we have f (vi , Ci 1 ) iff S(vi , Ci 1 ) > 0. Let us now give an important property of the assignment function f : (P ) ∀vi ∈ V : 0 ≤
m
f (vi , Ci k ) ≤ m
k=1
This property shows that when using f it is possible that an element (for instance an outlier) is not assigned to any class; contrary to an outlier, a very central element could be assigned to all the classes. The relative function is thus different to probabilistic membership, as for instance [KK93], which characterizes an element according to its distribution over the classes (or poles). 3.4
Description of the Algorithm
In this section, we propose a formal description of the soft-clustering algorithm. It relies on the definitions of pole and assignment function given in the previous sections. The clustering process (Table 2) iterates the step of pole construction until no more starting point is available. Then the remaining elements are assigned to one or several poles, using the assignment function.
Table 2. The Soft-Clustering Algorithm Input: V the set of elements, S the similarity matrix over V Initialization: C = Ø, P = Ø // P: the set of poles, C: the set of vertices appearing in poles Step 1: Construction of the similarity graph G(V, Eα ) Step 2: vs = startpoint(C, S, G) Step 3: Build a pole P centered on vs P ← P ∪ {P }, C = C ∪ P Step 4: vs = startpoint(C, S, G) if a startpoint vs is found, GOTO step 3 Step 5: Each Pole P is reduced to its non-shared objects P˜ ⊂ P Step 6: For each element vi ∈ V \C assign vi to poles of P using f where f is the relative assignment function Output: A set of overlapping clusters P˜1 , P˜2 , . . . where P˜i = P˜i ∪ {vj ∈ V \C | f (vj , P˜i ) = 1}
The startpoint function provides the element among V \C which is “the most distant” from the set C. The result differs depending whether C is empty or not. Several definitions of the function startpoint can be given. In our experiments, we use the definition startpoint1 that corresponds to the intuitive idea of “the most distant”: when C = Ø Argmin(vi ∈V ) degree(vi , G) with (degree(vi , G) ≥ 1) startpoint1 (C, S, G) = when C = Ø Argmin(vi ∈V \C) S(vi , C)
Disjunctive Learning with a Soft-Clustering Method
87
where degree(v, G) represents the number of edges of v in G. Another definition could be: when C = Ø 1 Argmin(vi ∈V ) K+1 (degree(vi , G) k=K + k=1 degree(vi,k , G)) startpoint2 (C, S, G) = when C =Ø k=K 1 Argmin(vi ∈V \C) K+1 (S(Vi , k) + k=1 S(vi,k , C)) where vi,k denotes the k th nearest neighbor of vi and K a given constant. In Example 1, the four elements {p(a), p(b), p(c), p(d)} have a minimal degree equal to 2. If p(a) is randomly chosen as the starting point, the first clique built is P1 = {p(a), p(b), p(e)}. Then the most distant element from P1 among {p(c), p(d)} is p(d) and the clique obtained from this vertex is P2 = {p(e), p(c), p(d)}. Because p(e) is shared by P1 and P2 , the restriction of the cliques to their non-shared objects provides the two poles P˜1 = {p(a), p(b)} and P˜2 = {p(c), p(d)}. Finally, the assignment step allows p(e) to be member of P˜1 and P˜2 . The final clusters are thus the two non disjoint ones covered by the two rules: C1 : p(X) ← r(X), s(X) C2 : p(X) ← t(X), u(X) The soft-clustering algorithm we propose has several properties highly interesting for the application to learning disjunctive concepts: (1) the clusters that are built can overlap, (2) the number of final clusters is not decided a priori, (3) the input of the algorithm is a similarity matrix, thus allowing the method to be applied to very different kinds of data.
4
General Presentation of the Method
The general learning algorithm is presented in Figure 2. Inputs of the method are: the target concept, specified by positive and negative examples, a background knowledge and a language associated to the similarity measure. We assume that the target concept cannot be characterized by a single clause (otherwise, the algorithm simply outputs this clause). The first step of the method is the computation of the similarity between each pair of positive examples. Then, the similarity matrix is used by the clustering algorithm to produce a set of possibly non-disjoint groups. In some cases, this algorithm produces a large number of groups or some groups are highly similar. For this reason, we organize these groups into a hierarchical way: we use an average-link agglomerative algorithm producing a tree where the leaves are the groups obtained by the clustering method and the root of this tree represents the entire initial concept. In this tree, each node (group of examples) is either a leaf or has two direct sub-groups.
88
Guillaume Cleuziou et al.
G
initial concept
soft-clustering algorithm G2
similarity matrix G1
G3 groups
G G
G1UG2 G3 G1 G2 hierarchical organization
?
?
? ? learned rules
Fig. 2. Overview of the learning algorithm Then the method tries to build a clause which characterizes groups (nodes) of the tree, starting with the two sub-groups of the root group. If a clause is found for a group, this clause is added to the learned program; if no clause is found for a group, this process is recursively repeated on its direct sub-groups (if the associated node is not a leaf). Finally, if some groups have not been characterized by a clause, the learning method (Figure 2) is recursively applied on these groups.
4.1
Learning a Clause
As mentioned above, the decomposition method induces a strong reduction on the search space. Given a set of positive examples G and the set of negative examples E − , we try to build a clause which covers all the examples in G and no negative ones. The learning method we use is a greedy search, starting from the clause p(X1 , . . . , Xn ) ← (where p is the n-ary target predicate) and adding literals one by one until each negative example is rejected by the clause. Since the clause has to cover each example in G, we choose the literal which allows to reject as many negative examples as possible, among the set of literals which cover all the examples in G (no backtrack is needed).
Disjunctive Learning with a Soft-Clustering Method
4.2
89
Complexity
The time complexity of the method comes from the three main steps of the algorithm: similarity: To compute the similarity matrix, we have to test whether each example is covered or not by each clause of the language L. It requires (|E + | + |E − |) ∗ |L| covering tests; then the similarity between each pair of positive examples is computed: |E + |2 ∗ |L| operations, clustering: The time required for this step is mainly due to the pole construction: for each pole, it needs at most |E + |2 operations, building clauses: The greedy search is reduced by the restriction of the search space. The space complexity is determined by the number of positive examples since we have to compute a |E + | × |E + | similarity matrix. If |E + | is too high (more than 1000), we can use a sample of this set: the number of positive examples needed depends mainly on the number of learned clauses. For example, assume that the target concept can be characterized with 10 clauses, with 1000 positive examples we still have an average of 100 examples per groups.
5
Experiments
We propose here some preliminary experiments of our approach: we have tested the method on examples for which a “natural” decomposition into subconcepts is known, and we compare the results obtained with our clustering method w.r.t expected subconcepts. Example 1. The first experiment concerns Example 1: if we use the similarity w simα with respect to the language L proposed in Example 1, we get similarity matrices closed to those presented in the example. With the first matrix we get the 2 expected subconcepts {p(a), p(b)} and {p(c), p(d)}; the second matrix gives also the 2 expected subconcepts {p(a), p(b), p(e)} and {p(e), p(c), p(d)} (p(e) belongs to both subconcepts). Example 2. The second experiment concerns Example 2. If we consider the similar = (p(X) ← x(X, Y ), ≥ (Y, v)) ∪ ity associated to the language L 1 v=1..6 (p(X) ← y(X, Y ), ≥ (Y, v)), we obtain the expected subconcepts : v=1..6 {p(a), p(b), . . . , p(d)} and {p(e), p(f ), . . . , p(j)}. If we consider now the similarity associated to the language L2 containing all the clauses having p(X) as head and having at most two literals in the body, the clustering algorithm gives a different result: the produced groups are G1 = {p(a), p(b), p(c), p(d)}
90
Guillaume Cleuziou et al.
G2 = {p(e), p(b), p(g), p(j)} G3 = {p(g), p(h), p(i), p(j)} The explanation is based on the difference between L1 and L2 : the clause p(X) ← x(X, Y ), y(X, Y ) belongs to L2 but not to L1 . This definition is correct since it covers 4 positive examples : {p(a), p(d), p(h), p(i)} and no negative ones, and then it has a weight equal to 1. The consequence is that the similarity between p(h) and p(i) is increased and for instance, the similarity between p(e) and p(h) is reduced. This example shows that when different decompositions exists, this may induce a fragmentation of some subconcepts (particularly when the number of examples is low). For this reason, the result of our soft-clustering algorithm is transformed into a hierarchical soft-clustering one before learning a definition. In this example, G2 and G3 are the most similar groups, then the hierarchical clustering result is E + is divided into G1 and G2,3 , G2,3 is divided into G2 and G3 where the first level of decomposition leads to the expected result. Example 3: Ancestor. The following example was proposed in [dRL93]. The background knowledge contains ground atoms with predicates f ather, mother, male and f emale over a 19 persons family. The target concept is ancestor. To compute the similarity between examples, we consider the language containing all the clauses having ancestor(X, Y ) as head and having at most two literals in the body. On the complete set of 56 positive examples, our clustering algorithm produces 3 disjoint groups. G1 : the first group is the set {ancestor(Xi , Yi ) | Xi is the father of Yi } G2 : the second group is the set {ancestor(Xi , Yi ) | Xi is the mother of Yi } G3 : the third group contains all other examples. The most similar groups are G2 and G3 but no rule can be found to characterize G2 ∪ G3 . The algorithm tries to build a characterization for G1 , G2 and G3 . A rule is learned for G1 and for G2 but not for G3 . Then the clustering algorithm is applied on the group G3 which produces two disjoint groups associated to the recursive definitions ancestor(X, Y ) ← f ather(X, Z)ancestor(Z, Y ) and ancestor(X, Y ) ← mother(X, Z), ancestor(Z, Y ). Finally, the decomposition produces 4 groups, corresponding to the usual definition of ancestor. Example 4. The last example is introduced to test the ability to build non-disjoint groups. Consider a graph containing two types of edges r and s. BK contains an atom r(X, Y ) (resp. s(X, Y )) if an edge of type r (resp. s) exists from X to Y . BK contains the following set of ground atoms: {r(b, f ), r(b, g), r(f, l), r(g, m), r(k, r), r(l, o), r(m, p), r(m, o), r(j, n), r(a, f ), r(e, k), s(a, f ), s(e, k), s(j, n), s(c, g), s(c, h), s(c, i), s(d, i), s(d, j), s(h, l), s(h, m), s(h, n), s(i, n), s(n, q), }
Disjunctive Learning with a Soft-Clustering Method
91
The target concept is linked, specified by the set of atoms linked(X, Y ) such that there exists a path from X to Y . To compute the similarity between examples, we also consider the language containing all the clauses having linked(X, Y ) as head and having at most two literals in the body. In this example, there exists some vertices linked by different paths. From 44 positive examples, the clustering algorithm produces 5 groups G1 , G2 , G3 , G4 and G5 . – G1 and G2 have 3 common examples: G1 is defined by linked(X, Y ) ← r(X, Y ) and G2 is defined by linked(X, Y ) ← s(X, Y ). There are exactly 3 examples linked(X, Y ) such that both r(X, Y ) and s(X, Y ) hold. G1 and G2 have no common objects with other groups. – G3 has no common examples with other groups. It is defined by the clause linked(X, Y ) ← r(X, Z), linked(Z, Y ) and G3 contains all the examples covered by this clause, except for some examples which are also covered by the clause linked(X, Y ) ← s(X, Z), linked(Z, Y ). – G4 and G5 have one common example and these groups are the most similar. G4 ∪G5 is characterized by the clause linked(X, Y ) ← s(X, Z), linked(Z, Y ). In this example, intersections between groups are not as large as we could have expected. This is mainly due to specificities which increase the similarity between some examples and reduce the possibility for some examples to be assigned to several poles, since the assignment function is base on relative similarities. However, it is preferable to obtain “incomplete” groups since for each group, we try to learn a unique clause. Then, the decomposition obtained on this example is good and allows to produce a satisfying program.
6
Conclusion and Further Works
We have presented a clustering method used to split the set of positive examples of a target concept into groups of similar examples. Each group is supposed to correspond to a subconcept and we try to learn one clause for each subconcept. This method induces a strong reduction of the search space during the learning process, however it can be applied only if the clustering algorithm produces “good” groups. To analyze the quality of the obtained groups, we have performed experiments on several examples for which a good decomposition was known. For all these examples, the results obtained with the clustering algorithm correspond to the subconcepts induced by the expected definitions. Moreover, these experiments show that the soft-clustering method is able to produce non-disjoint groups, corresponding to overlapping subconcepts. In future works, we have to test whether this method produces better clauses than usual greedy algorithms. We plan to test the method on real examples such as Mutagenesis: to achieve this, we have to specify a language (associated to the similarity measure) such that numerical and symbolic values have an fair influence. We plan also to study the relationship between the language and the similarity induced. In our experiments, we have considered languages made of simple
92
Guillaume Cleuziou et al.
rules (having at most two literals in the body and non recursive), this language is only used for the similarity measure and the search space for rules may be much more complex, it may be infinite and contain recursive definitions.
References BB99.
A. Baraldi and P. Blonda. A survey of fuzzy clustering algorithms for pattern recognition. II. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 29:786–801, 1999. BBPP99. I. Bomze, M. Budinich, P. Pardalos, and M. Pelillo. The maximum clique problem. In D.-Z. Du and P. M. Pardalos, editors, Handbook of Combinatorial Optimization, volume 4. Kluwer Academic Publishers, Boston, MA, 1999. BDSY99. Amir Ben-Dor, Ron Shamir, and Zohar Yakhini. Clustering gene expression patterns. Journal of Computational Biology, 6(3/4):281–297, 1999. Ber02. Pavel Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002. Bis92. G. Bisson. Learning in FOL with a similarity measure. In 11th National Conf. on Artificial Intelligence (AAAI), San Jose, CA., pages 82–87. AAAI Press, 1992. dRL93. Dzeroski S. de Raedt L., Lavrac N. Multiple predicate learning. In Proceedings of the Thirteen International Joint Conference on Artificial Intelligence, Chamb´ery, France, pages pp. 1037–1043. Springer-Verlag, 1993. EW96. W. Emde and D. Wettschereck. Relational instance-based learning. In Saitta L., editor, 13th Int. Conf. on Machine Learning (ICML’96), Bari, Italy, pages 122–130. Morgan & Kaufmann, 1996. HMS83. William A. Hoff, Ryszard S. Michalski, and Robert E. Stepp. INDUCE 2: A program for learning structural descriptions from examples. Technical Report 904, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, 1983. KK93. R. Krishnapuram and J. Keller. A possibilistic approach to clustering. IEEE Transactions on Fuzzy Systems, Vol. 1, No. 2, pages 98–110, 1993. MM01. Lionel Martin and Fr´ed´eric Moal. A language-based similarity measure. In Machine Learning: ECML 2001, 12th European Conference on Machine Learning, Freiburg, Germany, September 5-7, 2001, Proceedings, volume 2167 of Lecture Notes in Artificial Intelligence, pages 336–347. Springer, 2001. Mug95. S. Muggleton. Inverse entailment and Progol. New Generation Computing, Special issue on Inductive Logic Programming, 13(3-4):245–286, 1995. QCJ95. J. R. Quinlan and R. M. Cameron-Jones. Induction of logic programs: FOIL and related systems. New Generation Computing, Special issue on Inductive Logic Programming, 13(3-4):287–312, 1995. SBP93. Giovanni Semeraro, Clifford A. Brunk, and Michael J. Pazzani. Traps and pitfalls when learning logical theories: A case study with FOIL and FOCL. Technical Report ICS-TR-93-33, July 1993. Seb97. M. Sebag. Distance induction in first order logic. In Proceedings of ILP’97, pages 264–272. Springer-Verlag, 1997. SS94. M. Sebag and M. Schoenauer. Topics in Case-Based Reasonning, volume 837 of LNAI, chapter A Rule-based Similarity Measure, pages 119–130. SpringerVerlag, 1994.
ILP for Mathematical Discovery Simon Colton and Stephen Muggleton Department of Computing, Imperial College, London, UK {sgc,shm}@doc.ic.ac.uk
Abstract. We believe that AI programs written for discovery tasks will need to simultaneously employ a variety of reasoning techniques such as induction, abduction, deduction, calculation and invention. We describe the HR system which performs a novel ILP routine called automated theory formation. This combines inductive and deductive reasoning to form clausal theories consisting of classification rules and association rules. HR generates definitions using a set of production rules, interprets the definitions as classification rules, then uses the success sets of the definitions to induce hypotheses from which it extracts association rules. It uses third party theorem provers and model generators to check whether the association rules are entailed by a set of user supplied axioms. HR has been applied to a range of predictive, descriptive and subgroup discovery tasks in domains of pure mathematics. We describe these applications and how they have led to some interesting mathematical discoveries. Our main aim here is to provide a thorough overview of automated theory formation. A secondary aim is to promote mathematics as a worthy domain for ILP applications, and we provide pointers to mathematical datasets.
1
Introduction
The HR system [4] employs a novel ILP algorithm – which we call automated theory formation – to build a clausal theory from background knowledge. Primarily, it uses concept formation techniques to search a space of range restricted definitions, with each definition being interpreted as a classification rule for the unlabelled examples supplied by the user. Each definition is built from previous ones, and HR performs a heuristic search by ordering the old definitions in terms of an evaluation function specified by the user. In addition, using the success sets of the definitions, HR induces empirical hypotheses about the concepts and extracts association rules from these. It then uses third party automated reasoning software to decide whether each association rule is entailed by a set of axioms, which are also supplied by the user. HR’s primary mode of operation is to perform descriptive induction, whereby the classification rules and association rules it generates form a theory about the examples and concepts in the background knowledge. After the theory is formed, the user employs various tools to identify and develop the most interesting aspects of it, and to apply these to particular problems. However, in some applications, the user has supplied labelled examples, and HR has also been T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 93–111, 2003. c Springer-Verlag Berlin Heidelberg 2003
94
Simon Colton and Stephen Muggleton
used for predictive induction and subgroup discovery tasks. Our main domain of application has been mathematics, and HR has made a range of discoveries, some of which have been worthy of publication in the mathematical literature [2]. HR’s power comes both from its internal inductive methods and its ability to use automated theorem provers and model generators for deductive methods. Our main aim here is to provide details of both the algorithm underlying automated theory formation (ATF), and the implementation of ATF in HR. In §2 we describe the input and output to the system, the way it represents knowledge and the way it gains knowledge via theory formation steps. ATF is heavily dependent on the space of definitions HR can search, and in §3, we provide a characterisation of this space and how HR moves around it. In §5, we compare ATF to other ILP algorithms. A secondary aim is to promote mathematics as a potential domain for machine learning applications, and the use of ILP in particular. The way mathematics is presented in textbooks often makes it appear a largely deductive activity. This disguises the fact that, like any science, inductive methods play an important part in mathematical reasoning. In §4 we describe four applications of HR to ad-hoc mathematical discovery tasks in number theory and various algebraic domains. We supply pointers to two online data sets which have arisen from these applications, so that our experiments can be validated, and comparisons between ATF and other ILP approaches can be made. We believe that AI programs written for creative discovery tasks will need to simultaneously employ a variety of reasoning techniques such as induction, abduction, deduction, calculation and invention. By performing both inductive and deductive reasoning, HR is a system which partially satisfies this criteria. In §6, we speculate on the further integration of reasoning techniques and the potential for such integrated systems in scientific and other domains.
2
Automated Theory Formation
We attempt here to give an overview of how HR forms a theory using the ATF algorithm. To do so, in §2.1, we discuss the inputs to and outputs from the system. In §2.2, we describe how knowledge is represented within the system. In §2.3, we describe how theory formation steps add knowledge to the theory. We will use a session in number theory as a running example. 2.1
Input and Output
The knowledge input to HR consists of information in one of five different formats. Firstly, the user can supply some constants, which may represent different objects such as integers, graphs, groups, etc. Secondly, the user can supply some predicates which describe these constants. For instance, in number theory, background predicates may include a test whether one number divides another. Thirdly, the user may supply a set of axioms, which are taken to be true hypotheses relating some of the predicates in the background knowledge. During theory
ILP for Mathematical Discovery
95
formation, attempts will be made by an automated theorem prover to determine whether certain association rules are entailed by these axioms. Hence, the axioms are given in the language of the theorem prover being employed, which is usually Otter, a state of the art resolution prover [17]. The predicate names in the axioms must match with those in the background knowledge. Fourthly, for predictive tasks, the user may supply a classification of a set of examples, to be used in an evaluation function during theory formation. Finally, the program runs as an any-time algorithm by default, but the user may also supply termination conditions, which are usually application specific, as discussed in §4. The background predicates and constants are usually supplied in one background theory file, and the axioms in another, so that the same axioms can be used with different background files. The classification of examples and the specification of termination conditions is done on-screen. The background theory and axiom files for a number theory session are given in figure 1. We see that the user has supplied the constants 1 to 10 and four background predicates. The first of these is the predicate of being an integer, which provides typing information for the constants appearing in the theory. The other three background predicates are: (i) leq(I,L), which states that integer L is less than or equal to integer I (ii) divisor(I,D), which states that integer D is a divisor of integer I and (iii) multiplication(I,A,B) stating that A * B = I . Note that typing information for each variable in each predicate is required, which is why the background file contains lines such as leq(I,L) -> integer(L). The axioms in integer.hra are obvious relationships between the predicates supplied in the background file. integer.hrd (background theory)
integer.hra (axioms in Otter format)
int001 integer(I) integer(1).integer(2).integer(3).integer(4).integer(5). integer(6).integer(7).integer(8).integer(9).integer(10).
all all all all all all all
int002 leq(I,L) leq(I,L) -> integer(I) leq(I,L) -> integer(L) leq(1,1).leq(2,1).leq(2,2).leq(3,1).leq(3,2).leq(3,3). leq(4,1).leq(4,2).leq(4,3).leq(4,4). ... leq(10,10).
a a a a a a a
(divisor(a,a)). (leq(a,a)). b (divisor(a,b) -> leq(a,b)). b (leq(a,b) & leq(b,a) <-> a=b). b c (multiplication(a,b,c) -> divisor(a,b). b c (multiplication(a,b,c) -> divisor(a,c). b c (multiplication(a,b,c) <-> multiplication(b,a,c)).
int003 divisor(I,D) divisor(I,D) -> integer(I) divisor(I,D) -> integer(D) divisor(1,1).divisor(2,1).divisor(2,2).divisor(3,1).divisor(3,3). divisor(4,1).divisor(4,2).divisor(4,4). ... divisor(10,10). int004 multiplication(I,A,B) multiplication(I,A,B) -> integer(I) multiplication(I,A,B) -> integer(A) multiplication(I,A,B) -> integer(B) multiplication(1,1,1).multiplication(2,1,2).multiplication(2,2,1). multiplication(3,1,3).multiplication(3,3,1).multiplication(4,1,4). multiplication(4,2,2).multiplication(4,4,1). ... multiplication(10,10,1).
Fig. 1. Example input files for number theory All five types of information are optional to some extent. In particular, in algebraic domains such as group or ring theory, the user may supply just the axioms of the domain, and provide no constants or background predicates. In
96
Simon Colton and Stephen Muggleton
this case, HR extracts the predicates used to state the axioms into the background knowledge file, and uses the MACE model generator [18] to generate a single model satisfying the axioms. MACE, which uses the same input syntax as Otter, will be used repeatedly during theory formation to disprove various false hypotheses made by HR, and this will lead to more constants being added to the theory. Alternatively, the user may supply no axioms, and only background predicates and constants. In this case, the system would not be able to prove anything, unless the user provided axioms during theory formation, as responses to requests from HR. Note that predicates in the background knowledge file may call third party software, in particular computer algebra systems like Maple [6]. The output from HR is a theory consisting of a set of classification rules and a set of association rules. Each classification rule is expressed as a predicate definition, i.e., a disjunction of program clauses with the same head predicate. Each program clause in the definition is range restricted and of the form: conceptC (X1 , . . . , Xn ) ← p1 (A1,1 , . . . , A1,n1 ) ∧ . . . ∧ pm (Am,1 , . . . , Am,nm ) where C is a unique identification number, each Xi is a variable, and each Ai,j may be a constant or a variable which may or may not be the same as a head variable. Body literals may be negated, and there are further restrictions so that each definition can be interpreted as a classification rule, as described in §3. Association rules are expressed as range restricted clauses of the form: q0 (X1 , . . . , Xn ) ← q1 (A1,1 , . . . , A1,n1 ) ∧ . . . ∧ qm (Am,1 , . . . , Am,nm ) where the Xi and each Ai,j are variables as before, and each qk is either a predicate supplied in the background theory file or is one invented by HR. Each body literal may be negated, and the head literal may also be negated. An output file for a short, illustrative, number theory session is given in figure 2. HR has many modes for its output and – dependent on the application at hand – it can produce more information than that presented in figure 2. However, the clausal information is usually the most important, and we have presented the clausal theory in a Prolog style for clarity. The first four definitions are those given in the input file. Note that HR has added the variable typing information to them, so that it is clear, for instance, in concept2/2 that both variables are integers. This is important information for HR, and in every clause for each classification rule HR produces, there are typing predicates for every variable in the body or head. Following the user-given definitions in the output file, we see that HR has invented a new predicate, called count1/2 which counts the number of divisors of an integer (the τ function in number theory). The first classification rule it has introduced is concept5/2, which checks whether the second variable is the square root of the first variable. The next definition provides a boolean classification into square numbers and non-squares. Following this, concept7/2 uses count1 to count the number of divisors of an integer (there seems to be redundancy here, but count1/2 can be used in other definitions – in fact it is used in concept8). Finally, concept8/1 provides a classification into prime and non-prime numbers, because prime numbers have exactly two divisors.
ILP for Mathematical Discovery
97
In the output file, each classification rule is followed by the way in which the rule is interpreted to categorise the constants (in this case, integers). More details about this are given in §3. After the classification predicates, the program has listed the unproved association rules it has found so far. The first of these states that if an integer is a square number (i.e., integers X for which there is some Y such that multiplication(X,Y,Y)), then it will not be a prime number. The second states that if an integer is a prime number then it cannot be a square number. While both of these are in fact true, they are listed as unproved in the output file, because Otter could not prove them. This is actually because no attempt was made to use Otter, as rules containing counting predicates are usually beyond the scope of what Otter can prove. # User Given Classification Predicates concept1(X) :- integer(X). concept2(X,Y) :- integer(X), integer(Y), leq(X,Y). concept3(X,Y) :- integer(X), integer(Y), divisor(X,Y). concept4(X,Y,Z) :- integer(X), integer(Y), integer(Z), multiplication(X,Y,Z). # Invented Counting Predicates count1(X,N) :- findall(Y,(integer(Y),divisor(X,Y)),A), length(A,N). # Invented Classification Predicates concept5(X,Y) :- integer(X), integer(Y), multiplication(X,Y,Y). categorisation: [1][4][9][2,3,5,6,7,8,10] concept6(X) :- integer(X), integer(Y), multiplication(X,Y,Y). categorisation: [1,4,9][2,3,5,6,7,8,10] concept7(X,N) :- integer(X), integer(N), count1(X,N). categorisation: [1][2,3,5,7][4,9][6,8,10] concept8(X) :- integer(X), integer(N), count1(X,2). categorisation: [2,3,5,7][1,4,6,8,9,10] # Unproved Association Rules \+ count1(X,2) :- integer(X), integer(Y), multiplication(X,Y,Y). \+ multiplication(X,Y,Y) :- integer(X), integer(Y), count1(X,2).
Fig. 2. An example output file in number theory
2.2
Representation of Knowledge
HR stores the knowledge it generates in frames, a well known data structure described in [20]. The first slot in each frame contains clausal information, so HR effectively builds up a clausal theory embedded within a frame representation. There are three types of frame: • Constant frames. The first slot in these frames contains a single ground formula of the form type(constant), where type is the name of a unary predicate which has appeared in the background theory and constant is a constant which has either appeared in the background theory or has been generated by the theory formation process. Each constant will appear in a single constant frame, and hence will be of only one type. For instance, in the example session
98
Simon Colton and Stephen Muggleton
described in §2.1, there would be a constant frame for each of the numbers 1 to 10, where the ground formula is integer(1), integer(2), etc. • Concept frames. The first slot in these frames contains a definition of the form for classification rules described above. The other slots contain the results of calculations related to the definition. In particular, one slot contains the success set of the definition. Another slot contains the classification of the constants in the theory afforded by the definition (see later). At the end of the session in our running example, there are 8 concept frames, and, for example, the 6th of these contains a definition with a single clause of the form: concept6 (X) ← integer(X) ∧ integer(Y ) ∧ multiplication(X, Y, Y ). Note that the count1/2 predicate is not stored in a concept frame of its own, but in a slot in the concept frames which require it for their definition. • Hypothesis frames. The first slot in a hypothesis frame contains one or more association rules in the form as described above. The other slots contain information about the hypothesis. In particular, there is a slot describing the status of each association rule as either proved, disproved, or open. Association rules are stored together, as opposed to in separate frames, because the whole can usually be interpreted as more than the sum of the parts. For instance, the two association rules in figure 2 are stored in the first slot of a single hypothesis frame. This is because they were derived from a non-existence hypothesis stating that it is not possible to have an integer which is both square and prime. This information is also recorded in the hypothesis slot, and hence the hypothesis can be presented as the following negative clause, which may be more understandable: ← integer(X), integer(Y ), multiplication(X, Y, Y ), count1(X, 2) 2.3
Theory Formation Steps
HR constructs theories by performing successive theory formation steps. An individual step may add nothing to the theory, or it may add a new concept frame, a new hypothesis frame, a new constant frame, or some combination of these. At the end of the session, various routines are used to extract and present information from the frames. An outline of an individual step is given in table 1. After checking whether the termination conditions have been satisfied, each step starts by generating a new definition, as prescribed by an agenda. This is done using a production rule to derive a new definition from one (or two) old definitions. This process is described in detail in §3, but for our current purposes, we can see the kinds of definitions HR produces in figure 2. The new definition is checked for inconsistencies, e.g., containing a literal and its negation in the body of a clause. For example, a definition may be produced of the form: conceptC (X, Y ) ← integer(X), integer(Y ), leq(X, Y ), ¬leq(X, Y ) so that conceptC is trivially unsatisfiable. If, like this one, the definition is not self-consistent, the step is aborted and a new one started.
ILP for Mathematical Discovery
99
Table 1. Outline of a theory formation step Inputs: Typed examples E Outputs: New examples N (in constant frames) Background predicates B Classification rules R (in concept frames) Axioms A Association rules S (in hypothesis frames) Classification of examples C Termination conditions T (1) (2) (3) (4) (5) (6)
(7)
(8)
(9) (10) (11) (12)
Check T and stop if satisfied Choose old definition(s) and production rule from the top of the agenda Generate new definition D from old definition(s), using production rule Check the consistency of D and if not consistent, then start new step Calculate the success set of D If the success set is empty, then (6.1) derive a non-existence hypothesis (6.2) extract association rules and add to S (6.3) attempt to prove/disprove association rules using A (6.4) if disproved, then add counterexample to N, update success sets, go to (7) else start new step If the success set is a repeat, then (7.1) derive an equivalence hypothesis (7.2) extract association rules and add to S (7.3) attempt to prove/disprove association rules using A (7.4) if disproved, then add counterexample to N, update success sets, go to (8) else start new step Induce rules from implications (8.1) extract association rules and add to S (8.2) attempt to prove/disprove association rules using A Induce rules from near-equivalences and near-implications (9.1) extract association rules and add to S Measure the interestingness of D (possibly using C) Perform more calculations on D and add it to R Update and order the agenda
After the self-consistency check, the success set of the new definition is calculated. For instance, the success set for definition concept6/1 above would be: {concept6 (1), concept6 (4), concept6 (9)}, because, of the numbers 1 to 10, only 1, 4 and 9 are square numbers. If the success set is empty, then this provides evidence for a non-existence hypothesis. That is, HR induces the hypothesis that the definition is inconsistent with the axioms of the domain, and generates some association rules to put into the slot of a new hypothesis frame. The extraction of association rules is done by negating a single body literal (which doesn’t type the variables) and moving it to the head of the rule. In our running example, HR invented the following definition, hoping to add it to the theory: concept9 (X) ← integer(X)∧integer(Y )∧multiplication(X, Y, Y )∧count1(X, 2) The success set of this definition was empty, so a non-existence hypothesis was induced and a hypothesis frame was added to the theory. In the first slot were put the two association rules which could be extracted, namely: ¬multiplication(X, Y, Y ) ← integer(X) ∧ integer(Y ) ∧ count1(X, 2) ¬count1(X, 2) ← integer(X) ∧ integer(Y ) ∧ multiplication(X, Y, Y ) For each rule extracted, an attempt to prove that it is entailed by the axioms is undertaken, by passing Otter the axioms and the statement of the rule. If the attempt fails, then HR tries to find a counterexample to disprove the rule. In
100
Simon Colton and Stephen Muggleton
algebraic domains, this is done using MACE, but in number theory, HR generates integers up to a limit to try as counterexamples. If a counterexample is found, a new constant frame is constructed for it and added to the theory. The success set for every definition is then re-calculated in light of the new constant. This can be done if the user has supplied information about calculations for background predicates (e.g., by supplying Maple code). If no extracted rule is disproved, then the step ends and a new one starts. If the new success set is not empty, then it is compared to those for every other definition in the theory, and if an exact repeat is found (up to renaming of the head predicate), then an equivalence hypothesis is made. A new hypothesis frame is constructed, and association rules based on the equivalence are added to the first slot. These are derived by making the body of the old definition imply a single (non-typing) literal from the body of the new definition, and vice versa. For example, if this these two definitions were hypothesised to be equivalent: conceptold (X, Y ) ← p(X) ∧ q(Y ) ∧ r(X, Y ) conceptnew (X, Y ) ← p(X) ∧ q(Y ) ∧ s(X, X, Y ) then these association rules would be extracted: r(X, Y ) ← p(X) ∧ q(Y ) ∧ s(X, X, Y ) s(X, X, Y ) ← p(X) ∧ q(Y ) ∧ r(X, Y ) In terms of proving and disproving, these association rules are dealt with in the same way as those from non-existence hypotheses. HR also extracts prime implicates using Otter [5], discussion of which is beyond the scope of this paper. If the success set is not empty, and not a repeat, then the new definition is going to be added to the theory inside a concept frame. Before this happens, attempts to derive some association rules from the new definition are made. In particular, the success set of each definition in the theory is checked, and if it is a proper subset or proper superset of the success set for the new definition (up to renaming of the head predicate), then an appropriate implication hypothesis is made. A set of association rules are extracted from any implication found and attempts to prove/disprove them are made as before. Following this, an attempt is made to find old success sets which are nearly the same as the one for the new definition. Such a discovery will lead to a near-equivalence hypothesis being made, and association rules being extracted. Near implications are similarly sought. Even though they have counterexamples (and hence no attempts to prove or disprove them are made), these may still be of interest to the user. In our running example, for instance, HR might next invent the concept of odd numbers, using the divisor predicate thus: concept9 (X) ← integer(X) ∧ ¬divisor(X, 2). On the invention of this definition, HR would make the near-implication hypothesis that all prime numbers are odd. The number 2 is a counterexample to this, but the association rule may still be of interest to the user. Indeed, we are currently undertaking a project to ‘fix’ faulty hypothesis using abductive methods prescribed by Lakatos [16]. For instance, one such method is to attempt to find a definition already in the theory which covers the counterexamples (and
ILP for Mathematical Discovery
101
possibly some more constants), then exclude this definition from the hypothesis statement. Details of the Lakatos project are given in [10]. Theory formation steps end with various calculations being performed using the definition and its success set, with the results put into the slots of a new concept frame. The new definition is put in the first slot and the frame is added to the theory. The calculations are largely undertaken to give an assessment of the ‘interestingness’ of the definition, as prescribed by the user with a weighted sum of measures. HR has more than 20 measures of interestingness, which have been developed for different applications, and we discuss only a few here (see [8] for more details). Some measures calculate intrinsic properties, such as the applicability which calculates the proportion of the constants appearing in the definition’s success set. Other measures are relative, in particular, the novelty of a definition decreases as the categorisation afforded by that definition (as described in the next section) becomes more common in the theory. Other measures are calculated in respect of a labelling of constants supplied by the user. In particular, the coverage of a definition calculates the number of different labels for the constants in the success set of a definition. For details of how HR can use labelling information for predictive induction tasks, see [7]. At the end of the step, all possible ways of developing the new definition are added to the agenda. The agenda is then ordered in terms of the interestingness of the definitions, and the prescription for the next step is taken from the top and carried out.
3
Searching for Definitions
To recap, a clausal theory is formed when frames which embed classification and association rules are added to the theory via theory formation steps. The inductive mechanism is fairly straightforward: the success set of each newly generated definition is checked to see whether (a) it is empty, or (b) it is a repeat. In either case, a hypothesis is made, and association rules are extracted. If neither (a) nor (b) is true, the new definition is added to the theory and interpreted as a classification rule, and association rules are sought, again using the success set of the definition. Clearly, the nature of the definitions produced by HR dictates the form of both the classification rules and the association rules produced. Exactly how HR forms a definition at the start of each step is determined by a triple: PR, Definitions, Parameterisation , where PR is a production rule (a general technique for constructing new definitions from old ones), Definitions is a vector containing one or two old definitions, and Parameterisation specifies fine details about how PR will make a new definition from Definitions. When HR updates the agenda by adding ways to develop a new definition, it generates possible parameterisations for each of the production rules with respect to the new definition, and puts appropriate triples onto the agenda. How parameterisations are generated, and how the production rules actually operate, is carefully controlled so that HR searches for definitions within a well defined, fairly constrained, space. We describe below how each definition is interpreted as a classification
102
Simon Colton and Stephen Muggleton
rule and characterise HR’s search space. Following this, we give some details of the operators which HR uses in this space (the production rules). Definition 1. Fully typed program clauses Suppose we have been given typing information about every constant in a theory, for instance the unary predicate integer(4) in the input file of figure 1. We call these predicates the typing predicates, and assume that each constant has only one type. A program clause C with variables X1 , . . . , Xm (either in the body or the head) is called fully typed if each Xi appears in a single non-negated typing predicate in the body of C. We say that the type of a variable is this single predicate. A definition is fully typed if each clause is fully typed and corresponding head variables in each clause are of the same type. Given a fully typed definition, D, with head predicate p(Y1 , . . . , Ym ) then, for a given integer n, we call the set of constants which satisfy the typing predicate for Y1 the objects of interest for D. Definition 2. n-connectedness Suppose we have a program clause C with head predicate p(X1 , . . . , Xm ), where each Xi is a variable. Then, a variable V in the body of C is said to be nconnected if it appears in a literal in the body of C along with another variable, A, which is either n-connected or is Xn . If every variable in either the body or head of C is n-connected, we say that C is n-connected. Definitions which contain only n-connected clauses are similarly called n-connected. Note that, as for fully typed definitions, 1-connected definitions (which we will be interested in) are a specialisation of range-restricted definitions. For an example, consider this definition, where p and q are typing predicates: conceptC (X, Y ) ← p(X) ∧ q(Y ) ∧ p(Z), r(X, Y ) ∧ s(Y, Z) This is clearly fully typed, because p(X), q(Y ) and p(Z) provide typing information. It is also 1-connected, because Y is in a body literal with X (variable number 1 in the head), and Z is in a body literal with Y , which is 1-connected. Definition 3. Classifying function Suppose we have a fully typed definition, D, of arity n, with head predicate p and success set S. Then, given a constant, o, from the objects of interest for D, the following specifies the classifying function for D: / S; {} if n = 1 & p(o) ∈ f (o) = {{}} if n = 1 & p(o) ∈ S; {(t1 , . . . , tn−1 ) : p(o, t1 , . . . , tn−1 ) ∈ S} if n > 1. We build the classification afforded by D by taking each pair of objects of interest, o1 and o2 and putting them in the same class if f (o1 ) = f (o2 ). As an example classification, we look at concept7 in figure 2, which is a definition with head predicate of arity 2. It represents the number theory function τ , which counts the number of divisors of an integer. The success set for this is: {(1, 1), (2, 2), (3, 2), (4, 3), (5, 2), (6, 4), (7, 2), (8, 4), (9, 3), (10, 4)}, hence f (1) = {(1)}, and for the numbers 2, 3, 5 and 7, f outputs {(2)}, for the numbers
ILP for Mathematical Discovery
103
4 and 9, f outputs {(3)}, and for the numbers 6, 8 and 10, f outputs {(4)}. Hence the classification afforded by concept7 is: [1][2,3,5,7][4,9][6,8,10], as shown in fig. 2. Theorem 1. Suppose we are given a fully typed definition, D, where each head variable appears in at least two distinct body literals. Then, if D is not 1-connected, then there is a literal L in the body of some clause C of D such that L can be removed from C without altering the classification afforded by D. Proof. Note that the restriction to definitions where all head variables appear in at least two distinct body variables means that removing a literal cannot remove all reference to a head variable in the body (which would make calculating the success set impossible). Given that D is not 1-connected, then there must be a clause C with a body literal L containing only constants or variables which are not 1-connected. Because there is no connection between the first variable in the head of C and any variable in L , then the values that those variables take in the success set for C will be completely independent of the value taken by the first head variable. This means that the classifying function for D will be independent of these variables, and hence we can take C and L in the theorem statement to be C and L respectively. ✷ HR is designed to search a space of function free, fully typed, 1-connected definitions where each head variable appears in at least two distinct body literals. The variable typing is important for HR’s efficiency: given an old definition to build from, for each production rule, HR can tell from the typing predicates alone which parameterisations will produce a definition where a variable is assigned two types (and hence not satisfiable, because each constant is of a single type). Such parameterisations are not put on the agenda. Also, when checking for repeat success sets, HR uses variable typing information to rule out repeats quickly. More importantly, in light of theorem 1, the set of 1-connected definitions is a minimal set with respect to the classifications afforded by them. That is, with this language bias, assuming that the user supplies appropriate background definitions, HR avoids building definitions which are guaranteed to have a literal which is redundant in the generation of the classification. As the main reason HR forms definitions is to interpret them as classification rules, it is important to know that it searches within this minimal set (although we do not claim that it searches all of this space). HR currently has 12 production rules (PRs) which search this space. We describe four below in some detail, but due to space restrictions, we only provide a brief sketch of the remainder. For brevity, we assume that each old definition contains a single clause, but note that the procedures scale up to full definitions in obvious ways. For more details about the PRs, see [7] or chapter 6 of [4]. • The Exists Production Rule This builds a new definition from a single old definition. The parameterisation is a list of integers [k1 , . . . , kn ]. The PR takes a copy of the old clause for the new one and then removes variables from the head predicate in each position
104
Simon Colton and Stephen Muggleton
ki . The variables are not removed from body literal. For example if it used the parameterisation [2,3], then HR would turn conceptold into conceptnew as follows: conceptold (X, Y, Z) ← p(X) ∧ q(Y ) ∧ r(Z) ∧ s(X, Y, Z) conceptnew (X) ← p(X) ∧ q(Y ) ∧ r(Z) ∧ s(X, Y, Z) Note that the first variable in the head predicate is never removed, which ensures 1-connectedness of conceptnew , given 1-connectedness of conceptold . • The Split Production Rule This takes a single old definition and instantiates variables to constants in the new definition, removing literals which end up with no variables in them and removing constants from the head literal. The parameterisations are pairs of lists, with the first list corresponding to variable positions in the head, and the second list containing the constants to which the variables will be instantiated. For instance, if HR started with conceptold as above, and parameterisation [[2, 3], [dog, cat]], the new definition generated would be: conceptnew (X) ← p(X) ∧ s(X, dog, cat), because q(dog) and r(cat) would be removed, as they contain no variables. Parameterisations are generated so that the first head variable is never instantiated, to ensure 1-connectedness. Also, HR does not generate parameterisations which would instantiate variables of one type to constants of another type. • The Size Production Rule This takes a single old definition and a parameterisation which consists of a list of integers [i1 , . . . , in ] representing argument positions in the head predicate. The PR removes the variables from the head in the positions specified by the parameters and adds in a new variable of type integer at the end of the head predicate. Furthermore, if it has not already been invented, HR invents a new predicate of the form countid , where id is a unique identification number. This counts the number of distinct tuples of constants in the success set of the old definition. The tuples are constructed by taking the i1 -st, i2 -nd etc. variable from each ground formula in the success set of the old definition. We use the standard Prolog findall/2 and length/2 predicates to represent the invented predicate. For example, suppose HR started with conceptold above, and the parameterisation [2, 3]. It would first invent this predicate: countC (X, N ) ← f indall((Y, Z), (q(Y ) ∧ r(Z), s(X, Y, Z)), A) ∧ length(A, N ) Note that every literal in the body of the old definition which contains a 2 or 3-connected variable appears in the findall predicate. The new definition would then be generated as: conceptnew (X, N ) ← p(X) ∧ integer(N ) ∧ countC (X, N ) Parameterisations are never generated which include the first variable, so it is not removed. As the first variable will also appear in the counting predicate, 1-connectedness is guaranteed.
ILP for Mathematical Discovery
105
• The Compose Production Rule This takes two clauses C1 and C2 and conjoins the two sets of body literals together to become the body of the new definition. It then alters the variable names in the literals imported from C2 , removes any repeated literals formed in the process, then constructs a suitable head for the new definition. This is a complex PR, and there is no need to go into acute detail here, as an example will suffice. Suppose we start with these two definitions: conceptold1 (X, Y, Z) ← p(X) ∧ q(Y ) ∧ r(Z) ∧ s(X, Y, Z) conceptold2 (A, B, C) ← r(A) ∧ q(B) ∧ p(C) ∧ t(A, B, C) The PR could produce many different definitions from these, for example: conceptnew (X, Y, Z, C) ← p(X) ∧ q(Y ) ∧ r(Z) ∧ s(X, Y, Z) ∧ p(C) ∧ t(Z, Y, C) Note that the parameterisations are generated so that the new definition is 1connected, and does not have typing conflicts, i.e., the changing of variable names does not cause a variable to have two distinct types. • Other Production Rules Some production rules are domain specific. Of the remaining generic ones, the match production rule takes a single old definition and equates variables within it. For instance, the match production rule was used to create concept5 in figure 2. The disjunct rule simply adds the clauses from one definition to the set of disjoined clauses of another one. The negate and forall PRs join two old definitions like the compose rule, but the negate rule negates the entire body from the second definition before adding it, and the forall rule adds an implication sign between the two body conjunctions. Definitions produced by both forall and negate have to be re-written to present them in program clause form.
4
Applications to Mathematics
HR has shown some promise for discovery tasks in domains of science other than mathematics. For instance, in [3] we show how HR rediscovers the structural predictor for mutagenesis originally found by Progol [22]. However, it has mainly been applied to fairly ad-hoc tasks in domains of pure mathematics, and has made some interesting discoveries in each case. We look here at four such ad-hoc applications in terms of (a) the problem specification (b) any extensions to HR required (c) the termination conditions imposed and (d) the results. The first task was a descriptive induction task in number theory, very similar to the running example in this paper. We set ourselves the goal of getting HR to invent integer sequences (e.g., prime, square, etc.) which were not already found in the Encyclopedia of Integer Sequences1 and for HR to give us reasons to believe that the sequences were interesting enough to be submitted to this Encyclopedia. 1
The recognised repository for sequences (http://www.research.att.com~njas/sequences).
106
Simon Colton and Stephen Muggleton
We specified this problem for HR as follows: to terminate after finding a certain number (usually 50-100) of integer sequences (i.e., boolean classification rules over the set of integers) which were not in the Encyclopedia. Moreover, we used HR to present any association rules involving sequence definitions which could not be proved by Otter (those proved by Otter were usually trivially true). HR had to be extended to interact with the Encyclopedia, in order for it to tell whether a sequence was novel. On top of this, we enabled HR to mine the Encyclopedia to make relationships between the sequences it invented and those already in the Encyclopedia. This application turned out to be very fruitful: there are now over 20 sequences in the Encyclopedia which HR invented and supplied interesting conjectures for (which we proved). As an example, using only the background knowledge given in figure 1 for the integers 1 to 50, HR invented the concept of refactorable numbers, which are such that the number of divisors is itself a divisor (so, 9 is refactorable, because this has 3 divisors, and 3 divides 9). In addition, HR specialised this to define odd refactorable numbers, then made the implication hypothesis that all odd refactorable numbers are perfect squares – a fact we proved, along with others, for a journal paper about refactorable numbers [2]. As an epilogue, we were informed later that, while they were missing from the Encyclopedia, refactorable numbers had already been invented, although none of HR’s conjectures about them had been made. We’ve received no such notification about the other sequences HR invented. The next task was to fulfil a request by Geoff Sutcliffe for HR to generate first order theorems for his TPTP library [24]. This library is used to compare automated theorem provers: given a certain amount of time for each theorem, which prover can prove the most. The task was to generate theorems which differentiate the theorem provers, i.e., find association rules which can be proved by some, but not all, of a set of provers. This was a descriptive induction task, and we ran HR as an any-time algorithm, until it had produced a certain number of theorems. As described in [25], we linked HR to three provers (Bliksem, E, and Spass) via the MathWeb software bus [15] and ran HR until it had produced 12,000 equivalence theorems and each prover had attempted to prove them (an overnight session). In general, the provers found the theorems easy to prove, with each proving roughly all but 70 theorems. However, it was an important result that, for each prover, HR found at least one theorem which that prover could not prove, but the others could. In other experiments, we didn’t use the provers, and the time saving enabled us to produce more than 40,000 syntactically distinct conjectures in 10 minutes. 184 of these were judged by Geoff Sutcliffe to be of sufficient calibre to be added to the TPTP library. The following is an example group theory theorem which was added. ∀ x, y ((∃ z (z −1 = x ∧ z ∗ y = x) ∧ ∃ u, v (x ∗ u = y ∧ v ∗ x = u ∧ v −1 = x)) ↔ (∃ a, b (inv(a) = x ∧ a ∗ y = x) ∧ b ∗ y = x ∧ inv(b) = y)) As with the Encyclopedia of Integer Sequences, HR remains the only computer program to add to this mathematical database. The third task was to fulfil a request by Volker Sorge and Andreas Meier to integrate HR with their system in an application to classifying residue classes.
ILP for Mathematical Discovery 0 1 2 3 4
0 0 0 0 0 0
1 3 0 2 4 1
2 1 0 4 3 2
3 4 0 1 2 3
4 2 0 3 1 4
0 1 2 3 4
0 0 0 0 0 0
1 2 4 1 3 0
2 4 3 2 1 0
3 1 2 3 4 0
107
4 3 1 4 2 0
Fig. 3. The multiplication tables of two algebraic structures
These are algebraic structures which were generated in abundance by their system. The task was to put them into isomorphic classes – a common problem in pure mathematics – which can be achieved by checking whether pairs of residue classes were isomorphic. When they are isomorphic, it is often not too time consuming to find the isomorphic map. Unfortunately, when they aren’t isomorphic, all such maps have to be exhausted, and this can take a long time. In such cases, it is often easier to find a property which is true of only one example and then prove in general terms that two algebraic structures differing in this way cannot by isomorphic. Finding the property was HR’s job. Hence, each pair of algebras presented HR with a predictive induction task with two examples and the goal of finding a property true of only one. Hence we set HR to stop if such a boolean definition was found, or if 1000 theory formation steps had been carried out. As described in [19], HR was used to discriminate between 817 pairs of nonisomorphic algebras2 with 5, 6 and 10 elements, and was successful for 791 pairs (96.8%). As an example, consider the two algebraic structures in figure 3. HR found this property: ∃ x (x ∗ x = x ∧ ∀ y (y ∗ y = x ⇒ y ∗ y = y)) to be true of the second example, but not the first. This states that there exists an element, x, which is idempotent (i.e., x ∗ x = x) such that any other element which squares to give x is itself idempotent. This means that there must be an idempotent element which appears only once on the diagonal. The final task was to fulfil a request from Ian Miguel and Toby Walsh to use HR to help reformulate constraint satisfaction problems (CSPs) for finding quasigroups. CSPs solvers are powerful, general purpose programs for finding assignments of values to variables without breaking certain constraints. Specifying a CSP for efficient search is a highly skilled art, so there has been much research recently into automatically reformulating CSPs. One possibility is to add more constraints. If a new constraint can be shown to be entailed by the original constraints, it can be added with no loss of generality and is called an implied constraint. If no proof is found, we say the constraint is an induced constraint. We set ourselves the task of finding both implied and induced constraints for a series of quasigroup existence problems. We saw generating implied constraints as a descriptive induction task, and ran HR as an any-time algorithm to produce proved association rules which related concepts in the specification of the CSP. We gave the original specifications to Otter as axioms, so that it could prove the induced rules. Quasigroups are similar to the algebras in figure 3, but they have each element in every row and column. For a special type of quasigroup, known 2
This data set is available here: http://www.doc.ic.ac.uk/~sgc/hr/applications/residues.
108
Simon Colton and Stephen Muggleton
as QG3-quasigroups, we used a CSP solver to generate some examples of small quasigroups. This, along with definitions extracted from the CSP specification, provided the initial data3 for theory formation sessions. As an example of one of many interesting theorems HR found (and Otter proved), we discovered that QG3-quasigroups are anti-Abelian. That is: ∀ x, y (x∗y = y ∗x → x = y). Hence, if two elements commute, they must be the same. This became a powerful implied constraint in the reformulated CSPs. We approached the problem of generating induced constraints as a subgroup discovery problem (in the machine learning, rather than the mathematical, sense). We gave HR a labelling of the solutions found by the solver, with solutions of the same size labelled the same. Then, using a heuristic search involving the coverage and applicability measures discussed previously, we made HR prefer definitions which had at least one positive in every size category (but was not true of all the quasigroups). We reasoned that, when looking for specialisations, it is a good idea to look for ones with good coverage over the size categories. At the end of the session, we ordered the definitions with respect to the weighted sum, and took the best as induced constraints which specialised the CSP. This enabled us to find quasigroups of larger sizes. As an example, HR invented a property we called left-identity symmetry: ∀ a, b (a ∗ b = b → b ∗ a = a). This also became a powerful constraint in the reformulated CSPs. As discussed in [9], for each of the five types of quasigroup we looked at, we found a reformulation using HR’s discoveries which improved efficiency. By combining induced and implied constraints, we often achieved a ten times increase in solver efficiency. This meant that we could find quasigroups with 2 and 3 more elements than we could with the naive formulation of the CSP.
5
Comparison with Other ILP Techniques
Although it has been used for predictive tasks, HR has been designed to undertake descriptive induction tasks. In this respect, therefore, it is most similar to the CLAUDIEN [12] and WARMR [13] programs. These systems specify a language bias (DLAB and WARMODE respectively) and search for clauses in this language. This means that fairly arbitrary sets of predicates can be conjoined in clauses, and similarly arbitrary clauses disjoined in definitions (as long as they specify association rules passing some criteria of interestingness). In contrast, while we have characterised the space of definitions HR searches within, each production rule has been derived from looking at how mathematical concepts could be formed, as described in chapter 6 of [4]. Hence, automated theory formation is driven by an underlying goal of developing the most interesting definitions using possibly interesting techniques. In terms of search, therefore, HR more closely resembles predictive ILP algorithms. For instance, a specific to general ILP system such as Progol [21] chooses a clause to generalise because that clause covers more positive examples than the other clauses (and no negative 3
This data set is available from http://www.doc.ic.ac.uk/~sgc/hr/applications/quasigroups.
ILP for Mathematical Discovery
109
examples). So, while there are still language biases, the emphasis is on building a new clause from a previous one, in much the same way that HR builds a new definition from a previous one. Note that an application-based comparison of HR and Progol is given in [1]. Due to work by Steel et al., [23], HR was extended from a tool for a single relation database to a relational data mining tool, so that multiple input files such as those in figure 1, with definitions relating predicates across files, can be given to HR. However, the data that HR deals with often differs to that given to other ILP systems. In particular, HR can be given very small amounts of data, in some cases just two or three lines describing the axioms of the domain. Also, due to the precise mathematical definitions which generate data, we have not worried particularly about dealing with noisy data. In fact, HR’s abilities to make ‘near’ hypotheses grew from applications to non-mathematical data. There are also no concerns about compression of information as there are in systems such as Progol. This is partly because HR often starts with very few constants (e.g., there are only 12 groups up to size 8), and also because HR is supplied with axioms, hence it can prove the correctness of association rules, without having to worry about overfitting, etc. The final way in which ATF differs from other ILP algorithms is in the interplay between induction and deduction. Systems such as Progol which use inverse entailment techniques, think of induction as the inverse of deduction. Hence, every inductive step is taken in such a way that the resulting hypothesis, along with the background knowledge, deductively entails the examples. In contrast, HR induces hypotheses which are supported by the data, but are in no way guaranteed to be entailed by the background predicates and/or the axioms. For this reason, HR interacts with automated reasoning systems, and is, to the best of our knowledge, the only ILP system to do so. The fact that HR makes faulty hypotheses actually adds to the richness of the theory, because model generators can be employed to find counterexamples, which are added to the theory.
6
Conclusions and Further Work
We have described a novel ILP algorithm called Automated Theory Formation (ATF), which builds clausal theories consisting of classification rules and association rules. This employs concept formation methods to generate definitions from which classification rules are derived. Then, the success sets of the definitions are used to induce non-existence, equivalence and implication hypotheses, from which association rules are extracted. In addition to these inductive methods, ATF also relies upon deductive methods to prove/disprove that the association rules are entailed by a set of user supplied axioms. We discussed the implementation of this algorithm in the HR system, and characterised the space of definitions that HR searches. HR differs from other descriptive ILP systems in the way it searches for definitions and the way in which it interacts with third party automated reasoning software. HR has been applied to various discovery tasks in mathematics, and has had some success in this area. In addition to
110
Simon Colton and Stephen Muggleton
describing ATF and HR, we have endeavoured to promote mathematics as a domain where inductive techniques could be fruitfully employed. To this end, we have supplied pointers to two data sets which arose from our applications. We aim to continue to improve our model of automated theory formation. In particular, we are currently equipping HR with abductive techniques prescribed by Lakatos [16], and modelling advantages of theory formation within a social setting via a multi-agent version of the system [10]. We are continuing the application of HR to mathematical discovery, e.g., we are currently attempting to use it to re-discover the graph theory results found by the Graffiti program [14]. We are also applying HR to other scientific domains, most notably bioinformatics, e.g., we are currently using HR to induce relationships between concepts in the Gene Ontology [11]. We are also continuing to study how descriptive ILP techniques like ATF can be used to enhance other systems such as theorem provers, constraint solvers and predictive ILP programs. In particular, we are studying how descriptive techniques may be used for preprocessing knowledge. ATF uses invention, induction, deduction and abduction, and HR interacts with automated theorem provers, model generators, constraint solvers and computer algebra systems to do so. For systems such as HR to behave creatively, we believe that the search it undertakes must be in terms of which reasoning technique/program to employ next, rather than search at the object level. We envisage machine learning, theorem proving, constraint solving and planning systems being routinely integrated in ways tailored individually for solving particular problems. We believe that such integration of reasoning systems will provide future AI discovery programs with more power, flexibility and robustness than current implementations. Acknowledgements We wish to thank Alireza Tamaddoni-Nezhad, Hiroaki Watanabe, Huma Lodhi, Jung-Wook Bang and Oliver Ray for interesting discussions relating to this work. We also wish to thank the anonymous reviewers for their helpful comments.
References 1. S. Colton. An application-based comparison of Automated Theory Formation and Inductive Logic Programming. Linkoping Electronic Articles in Computer and Information Science (special issue: Proceedings of Machine Intelligence 17), 2000. 2. S. Colton. Refactorable numbers - A machine invention. Journal of Integer Sequences, 2, 1999. 3. S. Colton. Automated theory formation applied to mutagenesis data. In Proceedings of the 1st British-Cuban Workshop on Bioinformatics, 2002. 4. S. Colton. Automated Theory Formation in Pure Mathematics. Springer, 2002. 5. S. Colton. The HR program for theorem generation. In Proceedings of the Eighteenth Conference on Automated Deduction, 2002. 6. S. Colton. Making conjectures about Maple functions. Proceedings of 10th Symposium on Integration of Symbolic Computation and Mechanized Reasoning, LNAI 2385, Springer. 2002.
ILP for Mathematical Discovery
111
7. S. Colton, A. Bundy, and T. Walsh. Automatic identification of mathematical concepts. In Proceedings of the 17th ICML, 2000. 8. S. Colton, A. Bundy, and T. Walsh. On the notion of interestingness in automated mathematical discovery. IJHCS, 53(3):351–375, 2000. 9. S. Colton and I. Miguel. Constraint generation via automated theory formation. In Proceedings of CP-01, 2001. 10. S. Colton and A. Pease. Lakatos-style methods in automated reasoning. In Proceedings of the IJCAI 2003 workshop on Agents and Reasoning, 2003. 11. The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nat. Genet., 25:25–29, 2000. 12. L. De Raedt and L. Dehaspe. Clausal discovery. Machine Learning, 26:99–146, 1997. 13. L. Dehaspe and H. Toivonen. Discovery of frequent datalog patterns. Data Mining and Knowledge Discovery, 3(1):7–36, 1999. 14. S. Fajtlowicz. Conjectures of Graffiti. Discrete Mathematics 72, 23:113–118, 1988. 15. Andreas Franke and Michael Kohlhase. System description: MathWeb. In Proceedings of CADE-16, pages 217–221, 1999. 16. I. Lakatos. Proofs and Refutations. Cambridge University Press, 1976. 17. W. McCune. OTTER user’s guide. ANL/90/9, Argonne Labs, 1990. 18. W. McCune. Mace 2 Reference Manual. ANL/MCS-TM-249, Argonne Labs, 2001. 19. A. Meier, V. Sorge, and S. Colton. Employing theory formation to guide proof planning. In Proceedings of the 10th Symposium on Integration of Symbolic Computation and Mechanized Reasoning, LNAI 2385, Springer. 2002. 20. M. Minsky. A framework for representing knowledge. In Brachman and Levesque, editors, Readings in Knowledge Representation. Morgan Kaufmann, 1985. 21. S. Muggleton. Inverse entailment and Progol. New Generation Computing, 13:245– 286, 1995. 22. A. Srinivasan, S. Muggleton, R. King, and M. Sternberg. Theories for mutagenicity: A study of first-order and feature based induction. Artificial Intelligence, 85(1,2):277–299, 1996. 23. G. Steel. Cross domain concept formation using HR. Master’s thesis, Division of Informatics, University of Edinburgh, 1999. 24. G. Sutcliffe and C. Suttner. The TPTP Problem Library: CNF Release v1.2.1. Journal of Automated Reasoning, 21(2):177–203, 1998. 25. J. Zimmer, A. Franke, S. Colton, and G. Sutcliffe. Integrating HR and tptp2x into MathWeb to compare automated theorem provers. In Proceedings of the CADE 2002 Workshop on Problems and Problem sets, 2002.
An Exhaustive Matching Procedure for the Improvement of Learning Efficiency Nicola Di Mauro, Teresa Maria Altomare Basile, Stefano Ferilli, Floriana Esposito, and Nicola Fanizzi Dipartimento di Informatica, Universit` a di Bari via E. Orabona, 4 - 70125 Bari - Italia {nicodimauro,asile,ferilli,esposito,fanizzi}@di.uniba.it
Abstract. Efficiency of the first-order logic proof procedure is a major issue when deduction systems are to be used in real environments, both on their own and as a component of larger systems (e.g., learning systems). Hence, the need of techniques that can speed up such a process. This paper proposes a new algorithm for matching first-order logic descriptions under θ-subsumption that is able to return the set of all substitutions by which such a relation holds between two clauses, and shows experimental results in support of its performance.
1
Introduction
The induction of logic theories intensively relies on the use of a covering procedure that essentially computes whether a candidate concept definition (a hypothesis) explains a given example. This is the reason why the covering procedure should be carefully designed for an efficient hypothesis evaluation, which in turn means that the underlying matching procedure according to which two descriptions are compared must a fortiori fulfill the very same requirement. When the underlying generalization model is θ-subsumption, which is common for the standard ILP setting [11,12], the complexity of the evaluation is known to be NP-hard [4], unless biases are applied to make the problem tractable. If ILP algorithms have to be applied to real-world problems the efficiency of the matching procedure is imperative, especially when it has to return (some of) these substitutions in case of success (e.g., when they are to be used for performing a resolution step). A prototypical case is represented by the domain of document understanding, in which the learning task concerns the induction of rules for the detection of the roles and relationships between the numerous logic parts that make up a structured document layout. By testing the performance of ILP algorithms in this domain, it can be proven that in this and in similar tasks, where many objects and relations are involved, the ILP systems easily show their poor efficiency since the computational time grows exponentially, as expected for the worst case. Indeed, when profiling the execution of our learning system INTHELEX [3] on instances of that task, it was clear that the embedded matching procedure was to blame as the main source of inefficiency. T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 112–129, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Exhaustive Matching Procedure
113
Specifically, the matching procedure embedded in INTHELEX simply exploited the Prolog built-in SLD resolution: indeed, the θ-subsumption test can be cast as a refutation of the hypothesis plus the example description added with the negation of the target classification. The proof is computed through various SLD resolution steps [9], which can be very inefficient in some cases. For instance, given the following two clauses: h(X) :- p(X,X1 ),p(X,X2),..., p(X,Xn),q(Xn). h(c) :- p(c,c1 ),p(c,c2),...,p(c,cm). SLD-resolution will have to try all mn possible mappings (backtrackings) before realizing that the former does not match the latter because of the lack of property q. Thus, the greater n and m, the sooner it will not be able to compute subsumption between the two clauses within acceptable time1 . It is clear that, in real-world settings, situations like this are likely to happen, preventing a solution to the problem at hand from being found. Needless to say, this would affect the whole learning task. Indeed, the absence of just one matching solution could stop a whole deduction, that in turn might be needed to reach the overall solution and complete the experimental results on which drawing the researcher conclusions. The above consideration motivated the decision to find a more efficient algorithm for matching, so to cope with difficult learning problems both in artificial and in real domains. It should be highlighted that, for our system purposes, the procedure must solve a search problem, that is more complex than a mere θ-subsumption test, since it is mandatory that every possible substitution is computed and returned. There are two unnegligible motivation for this. The former is that the specializing operator relies on this information to find refinements that are able to rule out a problematic negative example while still covering all the past positive ones [2]. The latter is that the saturation and abstraction operators embedded in INTHELEX require resolution to find all possible instances of sub-concepts hidden in the given observations, in order to explicitly add them to the example descriptions. Further reasons are related to the fact that a clause covering an example in many ways could give a higher confidence in the correct classification of the example itself. Besides, most ILP learning algorithms require a more complex matching procedure that solves a search problem rather than a simple decision problem. Indeed, while the mere covering would suffice for a naive generate-and-test algorithm, in general, the induction of new candidate hypotheses may require the calculation of one or all the substitutions that lead a concept definition to explain an example. Finding all substitutions for matching has been investigated in other fields, such as production systems, theorem proving, and concurrent/parallel implementations of declarative languages. For instance, in functional programming search procedures can be often formalized as functions yielding lists as values. As Wadler [17] points out, if we make the result of a function a list, and regard the items of this list as multiple results, lazy evaluation provides the counterpart of backtracking. In automated theorem 1
In some complex cases of document understanding the Prolog SLD resolutions required by a single refutation have been observed to last for over 20 days!
114
Nicola Di Mauro et al.
proving, tableaux calculi have been introduced such as the Confluent Connection Calculus where backtracking is avoided. We looked at the available literature on the subject, and found out that a recent system available was Django, kindly provided to us by its Authors, but unfortunately it did not fit our requirements of returning the matching solutions. As to the previous algorithms, they were not suitable as well, since they return just one solution at a time, thus requiring backtracking for collecting all solutions. But blind backtracking is just the source of the above inefficiency, thus our determination has been to definitely avoid it, and having a matching procedure that always goes forward without loosing information. No such thing was available in the literature, and this forced us to build one ourselves. This paper is organized as follows. The next section presents related work in this field; then, Section 3 presents a new matching algorithm, while Section 4 shows experimental results concerning its performance. Lastly, Section 5 draws some conclusions and outlines future work.
2
Related Work
The great importance of finding efficient algorithms for matching descriptions under θ-subsumption is reflected by the amount of work carried out so far in this direction in the literature. The latest results have been worked out by Maloberti and Sebag in [10], where the problem of θ-subsumption is faced by means of a Constraint Satisfaction Problem (CSP) approach. Briefly, a CSP involves a set of variables X1 , . . . , Xn , where each Xi has domain dom(Xi ), and a set of constraints, specifying the simultaneously admissible values of the variables. A CSP solution assigns to each Xi a value ai ∈ dom(Xi ) such that all constraints are satisfied. CSP algorithms exploit two kinds of heuristics. Reduction heuristics aim at transforming a CSP into an equivalent CSP of lesser complexity by pruning the candidate values for each variable (local consistency). Search heuristics are concerned with the backtracking procedure (lookback heuristics) and with the choice of the next variable and candidate value to consider (look-ahead heuristics): a value is chosen for each variable and then consistency is checked. Since there is no universally efficient heuristic in such a setting, different combinations thereof may be suited to different situations. θ-subsumption is mapped onto a CSP by transforming each literal involved in the hypothesis into a CSP variable with proper constraints encoding the θ-subsumption structure. Specifically, each CSP variable Xp corresponds to a literal in clause C built on predicate symbol p, and dom(Xp ) is taken as the set of all literals in the clause D built on the same predicate symbol. A constraint r(Xp , Xq ) is set on a variable pair (Xp , Xq ) iff the corresponding literals in C share a variable. Given such a representation, different versions of a correct and complete θ-subsumption algorithm, named Django, were built, each implementing different (combinations of) CSP heuristics. Some exploited lookahead heuristics to obtain a failure as soon as possible. Others were based on forward checking (propagation of forced assignments — i.e., variables with a single candidate value) and arc-consistency (based on the association, to each pair
An Exhaustive Matching Procedure
115
variable-candidate value, of a signature encoding the literal links — i.e., shared variables — with all other literals). Experiments in hard artificial domains are reported, proving a difference in performance of several orders of magnitude in favor of Django compared to previous algorithms. In sum, all the work carried out before Django (namely, by Gottlob and Leitsch [6], by Kiets and L¨ ubbe [8] and by Scheffer, Herbrich and Wysotzki [13]) aimed at separating the part of the clause for which finding a matching is straightforward or computationally efficient, limiting as much as possible the complexity of the procedure. All the corresponding techniques rely on backtracking, and try to limit its effect by properly choosing the candidates in each tentative step. Hence, all of them return only the first matching substitution found, even if many exist. On the contrary, it is important to note that Django only gives a binary (yes or no) answer to the subsumption test, without providing any matching substitution in case of positive outcome.
3
A New Matching Algorithm
Ideas presented in related work aimed, in part, at leveraging on particular situations in which the θ-subsumption test can be computed with reduced complexity. However, after treating efficiently the subparts of the given clauses for which it is possible, the only way out is applying classical, complex algorithms, possibly exploiting heuristics to choose the next literal to be unified. In those cases, the CSP approach proves very efficient, but at the cost of not returning (all the) possible substitutions by which the matching holds. Actually, there are cases in which at least one such substitution is needed by the experimenter. Moreover, if all such substitutions are needed (e.g., for performing successive resolution steps), the feeling is that the CSP approach has to necessarily explore the whole search space, thus loosing all the advantages on which it bases its efficiency. The proposed algorithm, on the contrary, returns all possible matching substitutions, without performing any backtracking in their computation. Such a feature is important, since the found substitutions can be made available to further matching problems, thus allowing to perform resolution. Before discussing the new procedure proposed in this paper, it is necessary to preliminarily give some definitions on which the algorithm is based. In the following, we will assume that C and D are Horn clauses having the same predicate in their head, and that the aim is checking whether a matching exists between C and D, i.e. if C θ-subsumes D, in which case we are interested in all possible ways (substitutions) by which it happens. Note that D can always be considered ground (i.e., variable-free) without loss of generality. Indeed, in case it is not, a new corresponding clause D can be obtained by replacing each of its variables by a new constant not appearing in C nor in D, and it can be proven that C θ-subsumes D iff C θ-subsumes D . Definition 1 (Matching Substitution). A matching substitution from a literal l1 to a literal l2 is a substitution µ, such that l1 µ = l2 .
116
Nicola Di Mauro et al.
The set of all matching substitutions from a literal l ∈ C to some literal in D is denoted by [1]: uni(C, l, D) = {µ | l ∈ C, lµ ∈ D}. Let us start by defining a structure to compactly represent sets of substitutions. Definition 2 (Multi-substitutions). A multibind is denoted by X → T , where X is a variable and T = ∅ is a set of constants. A multi-substitution is a set of multibinds Θ = {X1 → T1 , . . . , Xn → Tn } = ∅, where ∀i = j : Xi = Xj . Informally, a multibind identifies a set of constants that can be associated to a variable, while a multi-substitution represents in a compact way a set of possible substitutions for a tuple of variables. In particular, a single substitution is represented by a multi-substitution in which each constants set is a singleton. Example 1. Θ = {X → {1, 3, 4}, Y → {7}, Z → {2, 9}} is a multi-substitution. It contains 3 multibinds, namely: X → {1, 3, 4}, Y → {7} and Z → {2, 9}. Given a multi-substitution, the set of all substitutions it represents can be obtained by choosing in all possible ways one constant for each variable among those in the corresponding multibind. Definition 3 (Split). Given a multi-substitution Θ = {X1 → T1 , . . . , Xn → Tn }, split(Θ) is the set of all substitutions represented by Θ: split(Θ) = { {X1 → ci1 , . . . , Xn → cin } | ∀k = 1 . . . n : cik ∈ Tk ∧ i = 1 . . . |Tk |}. Example 2. split ({X → {1, 3, 4}, Y → {7}, Z → {2, 9}}) = {{X → 1, Y → 7, Z → 2}, {X → 1, Y → 7, Z → 9}, {X → 3, Y → 7, Z → 2}, {X → 3, Y → 7, Z → 9}, {X → 4, Y → 7, Z → 2}, {X → 4, Y → 7, Z → 9}}. Definition 4 (Union of Multi-substitutions). The union of two multi-substitutions Θ = {X → T , X1 → T1 , . . . , Xn → Tn } and Θ = {X → T , X1 → T1 , . . . , Xn → Tn } is the multi-substitution defined as Θ Θ = {X → T ∪ T } ∪ {Xi → Ti }1≤i≤n Informally, the union of two multi-substitutions that are identical but for the multibind referred to one variable is a multi-substitution that inherits the common multibinds and associates to the remaining variable the union of the corresponding sets of constants in the input multi-substitutions. Note that the two input multi-substitutions must be defined on the same set of variables and must differ in at most one multibind. Example 3. The union of two multi-substitutions Σ = {X → {1, 3}, Y → {7}, Z → {2, 9}} and Θ = {X → {1, 4}, Y → {7}, Z → {2, 9}}, is: Σ Θ = {X → {1, 3, 4}, Y → {7}, Z → {2, 9}} (the only different multibinds being those referring to variable X).
An Exhaustive Matching Procedure
117
Algorithm 1 merge(S) Require: S: set of substitutions (each represented as a multi-substitution) while ∃u, v ∈ S such that u = v and u v = t do S := (S \ {u, v}) ∪ {t} end while return S
Definition 5 (Merge). Given a set S of substitutions on the same variables, merge(S) is the set of multi-substitutions obtained according to Algorithm 1. Example 4. merge({{X → 1, Y → 2, Z → 3}, {X → 1, Y → 2, Z → 4}, (X → 1, Y → 2, Z → 5}}) = merge({{X → {1}, Y → {2}, Z → {3, 4}}, {X → {1}, Y → {2}, Z → {5}}}) = {{X → {1}, Y → {2}, Z → {3, 4, 5}}}. This way we can represent 3 substitutions with only one multi-substitution. The merge procedure is in charge of compressing many substitutions into a smaller number of multi-substitutions. It should be noted that there are cases in which a set of substitutions cannot be merged at all; moreover, the set of multisubstitutions resulting from the merging phase could be not unique. In fact, it may depend on the order in which the two multi-substitutions to be merged are chosen at each step. Example 5. Let us consider the following substitutions: θ = {X ← 1, Y ← 2, Z ← 3} δ = {X ← 1, Y ← 2, Z ← 4} σ = {X ← 1, Y ← 2, Z ← 5} τ = {X ← 1, Y ← 5, Z ← 3} One possible merging sequence is (θ δ) σ, that prevents further merging τ and yields the following set of multi-substitutions: {{X ← {1}, Y ← {2}, Z ← {3, 4, 5}}, {X ← {1}, Y ← {5}, Z ← {3}}} Another possibility is first merging θ τ and then δ σ, that cannot be further merged and hence yield: {{X ← {1}, Y ← {2, 5}, Z ← {3}}, {X ← {1}, Y ← {2}, Z ← {4, 5}}} The presented algorithm does not currently specify any particular principle according to which performing such a choice, but this issue is undoubtedly a very interesting one, and deserves a specific study (that would require a paper on its own) in order to understand if the quality of the result is actually affected by the ordering and, in such a case, if there are heuristics that can suggest in what order the multi-substitutions to be merged have to be taken in order to get an optimal result. Actually, many of such heuristics, often clashing with each other, can be developed according to the intended behavior of the system: for instance, some heuristics can be used to make the system faster in recognizing negative outcomes to the matching problem; others can be exploited to optimize the intermediate storage requirements when the matching exists. Preliminary tests on this issue revealed that sorting the constants in the multibinds, in addition to
118
Nicola Di Mauro et al.
make easier the computation of their union and intersection, also generally leads to better compression with respect to the random case. Nevertheless, further insight could suggest better strategies that ensure the best compression (overall or at least on average). Definition 6 (Intersection of Multi-substitutions). The intersection of two multi-substitutions Σ = {X1 → S1 , . . . , Xn → Sn , Y1 → Sn+1 , . . . , Ym → Sn+m } and Θ = {X1 → T1 , . . . , Xn → Tn , Z1 → Tn+1 , . . . , Zl → Tn+l }, where = Zk , is the multi-substitution defined as: n, m, l ≥ 0 and ∀j, k : Yj Σ Θ = {Xi → Si ∩ Ti }i=1...n ∪ {Yj → Sn+j }j=1...m ∪ {Zk → Tn+k }k=1...l iff ∀i = 1 . . . n : Si ∩ Ti = ∅; otherwise it is undefined. Informally, the intersection of two multi-substitutions is a multi-substitution that inherits the multibinds concerning variables appearing in either of the starting multi-substitutions, and associates to each variable occurring in both the input multi-substitutions the intersection of the corresponding sets of constants (that is required not to be empty) in the input multi-substitutions. Example 6. The intersection of two multi-substitutions Σ = {X → {1, 3, 4}, Z → {2, 8, 9}} and Θ = {Y → {7}, Z → {1, 2, 9}} is: Σ Θ = {X → {1, 3, 4}, Y → {7}, Z → {2, 9}}. The intersection of Σ = {X → {1, 3, 4}, Z → {8, 9}} and Θ = {Y → {7}, Z → {1, 2}} is undefined. The above operator is able to check if two multi-substitutions are compatible (i.e., if they share at least one of the substitutions they represent). Indeed, given two multi-substitutions Σ and Θ, if Σ Θ is undefined, then there must be at least one variable X, common to Σ and Θ, to which the corresponding multibinds associate disjoint sets of constants, which means that it does not exist a constant to be associated to X by both Σ and Θ, and hence a common substitution cannot exist as well. The operator can be extended to the case of sets of multi-substitutions. Specifically, given two sets of multi-subistitutions S and T , their intersection is defined as the set of multi-substitutions obtained as follows: S T = {Σ Θ | Σ ∈ S, Θ ∈ T } Note that, whereas a multi-substitution (and hence an intersection of multi-substitutions) is or is not defined, but cannot be empty, a set of multi-substitutions can be empty. Hence, an intersection of sets of multi-substitutions, in particular, can be empty (which happens when all of its composing intersections are undefined). Proposition 1. Let C = {l1 , . . . , ln } and ∀i = 1 . . . n : Ti = merge(uni (C, li , D)); let S1 = T1 and ∀i = 2 . . . n : Si = Si−1 Ti . C θ-subsumes D iff Sn = ∅.
An Exhaustive Matching Procedure
119
Algorithm 2 matching(C, D) Require: C : c0 ← c1 , c2 , . . . , cn , D : d0 ← d1 , d2 , . . . , dm : clauses if ∃θ0 substitution such that c0 θ0 = d0 then S0 := {θ0 }; for i := 1 to n do Si := Si−1 merge(uni (C, ci , D)) end for end if return (Sn = ∅)
Proof. (⇐) Let us prove (by induction on i) the thesis in the following form: = ∅ ⇒ {l1 , . . . , li } ≤θ D (i.e., ∃θ s.t. {l1 , . . . , li }θ ⊆ D). ∀i ∈ {1, . . . , n} : Si [i = 1 ] ∅ = S1 = T1 ⇒ ∀Θ ∈ T1 , ∀θ ∈ split (Θ) : ∃k ∈ D s.t. l1 θ = k ∈ D ⇒ {l1 }θ = {k} ⊆ D ⇒ {l1 } ≤θ D. [(i − 1) ⇒ i ] Si = Si−1 Ti = ∅ ⇒def ∃Σ ∈ Si−1 , Θ ∈ Ti s.t. Σ Θ defined ⇒ ∃σ ∈ split (Σ), θ ∈ split (Θ) : σ, θ compatible ⇒ {l1 , . . . , li−1 }σ ⊆ D (by hypothesis) ∧{li }θ ⊆ D (by definition of Ti ) ⇒ {l1 , . . . , li−1 }σ ∪ {li }θ ⊆ D ⇒ {l1 , . . . , li }σθ ⊆ D. This holds, in particular, for i = n, which yields the thesis. (⇒) First of all, note that ∀i = 1, . . . , n : Ti = ∅. Indeed, (ad absurdum) ∃i ∈ {1, . . . , n} s.t. Ti = ∅ ⇒ ∃li ∈ C s.t. merge(uni (C, li , D)) = ∅ ⇒ uni(C, li , D) = ∅ ⇒∃θ, ∃k ∈ D s.t. li θ = k ⇒ C ≤θ D (Absurd! ). Suppose (ad absurdum) that Sn = ∅. Then ∃i s.t. ∀i, j, 1 ≤ i < i ≤ = ∅ ∧ Sj = ∅. But then Si = Si−1 Ti = ∅, which implies j ≤ n : Si that no substitution represented by Ti is compatible with any substitution represented by Si−1 . Hence, while clause {l1 , . . . , li−1 } θ-subsumes D (∀θ ∈ split(Σ), ∀Σ ∈ Si−1 ), the clause obtained by adding to it literal li does not. Thus, C cannot θ-subsume D, which is an absurd since it happens by hypothesis. This leads to the θ-subsumption procedure reported in Algorithm 2. It is worth explicitly noting that, if the number of substitutions by which clause C subsumes clause D grows exponentially, and these items are such that no merging can take place at all, it follows that exponential space will be required to keep them all. In this case there is no representation gain (remember that we are anyway forced to compute all solutions), but one could wonder how often this happens in real world (i.e., in non purposely designed cases). Moreover, as a side-effect, in this case the merge procedure will not enter the loop, thus saving considerable amounts of time. Example 7. Let us trace the algorithm on C = h : −p(X1, X2), r(X1, X2). and D = h : −p(a, b), p(c, d), r(a, d). The two heads match by the empty substitution, thus S0 is made up of just the empty substitution. i = 1: uni(C, p(X1, X2), D) = {{X1/a, X2/b}, {X1/c, X2/d}} that, represented as multi-substitutions, becomes: {{X1 → {a}, X2 → {b}}, {X1 →
120
Nicola Di Mauro et al.
Table 1. Merge statistics on random problems Substitutions Multi-Substitutions Compression Average 176,0149015 43,92929293 0,480653777 St-Dev 452,196913 95,11695761 0,254031445 Min 1 1 0,001600512 Max 3124 738 1
{c}, X2 → {d}}}. Now, since the union between these two substitutions is not defined, the while loop is not entered and the merge procedure returns the same set unchanged: merge(uni(C, p(X1, X2), D)) = {{X1 → {a}, X2 → {b}}, {X1 → {c}, X2 → {d}}}. By intersecting such a set with S0 , we obtain S1 = {{X1 → {a}, X2 → {b}}, {X1 → {c}, X2 → {d}}}. i = 2: uni(C, r(X1, X2), D) = {{X1/a, X2/d}} that, represented as multisubstitutions, becomes: {{X1 → {a}, X2 → {d}}}. It is just one multi-substitution, thus merge has no effect again: merge(uni(C, r(X1, X2), D)) = {{X1 → {a}, X2 → {d}}}. By trying to intersect such a set with S1 , the algorithm fails because its intersections are both undefined, since {b} ∩ {d} = ∅ for X2 in the former, and {a}∩{c} = ∅ for X1 in the latter. Hence, C does not theta-subsumes D, since S2 = ∅.
4
Experiments
Some preliminary artificial experiments were specifically designed to assess the compression power of multi-substitutions. First of all, 10000 instances of the following problem were generated and run: fixed at random a number of variables n ∈ {2, . . . , 9} and a number of constants m ∈ {2, . . . , 5}, among all the possible mn corresponding substitutions l were chosen at random and merged. Table 1 reports various statistics about the outcome: in particular, the last column indicates the compression factor, calculated as the number of multi-substitutions over the number of substitutions. It is possible to note that, on average, the multi-substitution structure was able to compress the 52% of the input substitutions, with a maximum of 99, 84%. As expected, there were cases in which no merging took place. Two more experiments were designed, in order to better understand if (and how) the compression depends on the number of variables and constants. The reported results correspond to the average value reached by 33 repetitions of each generated problem. First we fixed the number of variables to 2, letting the number of constants m range from 2 to 10. For each problem (2, m) and for each i ∈ {1, . . . , m2 }, i substitutions were randomly generated and merged. Figure 1 reports the average number of multi-substitutions and the corresponding compression factor in these cases. Conversely, Figure 2 reports the average and the compression factor for a similar artificial problem in which the number of constants was set to 2, and the number of variables n was varied from 2 to 10.
An Exhaustive Matching Procedure
121
Fig. 1. Average number of multi-substitutions and compression factor for an artificial problem with 2 variables and m ∈ {2, . . . , 10} constants
Fig. 2. Average number of multi-substitutions and compression factor for an artificial problem with 2 constants and n ∈ {2, . . . , 10} variables The x-axis of the plots corresponds to the number of substitutions generated, the y-axis corresponds to the computed statistic. The curves in the two cases are similar, suggesting a stable (i.e., depending only on the percentage of the whole set of possible substitutions that is taken into account) compression behavior of the merge. Of course, growing parameters (represented by successive curves from left to right) lead to progressively higher values. Then, a prototype of the proposed algorithm was implemented in Prolog, and integrated with INTHELEX (an incremental system for first-order logic learning from examples, refer to [3] for a more detailed description) in order to make it able to handle the problematic situation occurred in the document dataset. Note that, since INTHELEX relies on the Object Identity (OI ) assumption [14], the Prolog implementation slightly modifies the algorithm in order to embed such a bias (the same effect is obtained in SLD-resolution by explicitly adding to each clause the inequality literals expressing OI). To check if any performance gain was obtained, INTHELEX was run on the document interpretation learning problem first exploiting the classical SLD-resolution procedure provided by Prolog (with proper inequality literals encoding the OI assumption), and then using the new algorithm (modified to embed OI). Note that our primary interest was
122
Nicola Di Mauro et al.
Fig. 3. Sample COLLATE documents: registration card from NFA (top-left), censorship decision from DIF (top-right), censorship card from FAA (bottomleft) and newspaper articles (bottom-right)
just obtaining acceptable and practically manageable runtimes for the learning task, and not checking the compression performance of the multi-substitution representation. The original learning task concerned the EC project COLLATE (Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material), aimed at providing a support for archives, researchers and end-users working with digitized historic/cultural material. The chosen sample domain concerns a large corpus of multi-format documents concerning rare historic film censorship from the 20’s and 30’s, but includes also newspaper articles, photos, stills, posters and film fragments, provided by three major European film archives. Specifically, we considered 4 different kinds of documents: censorship decisions from Deutsches Filminstitut (DIF, in Frankfurt), censorship cards from Film Archive Austria (FAA, in Vienna), registration cards from N´ arodni Filmov´ y
An Exhaustive Matching Procedure
123
Archiv (NFA, in Prague) and newspaper articles (from all archives), for each of which the aim was inducing rules for recognizing the role of significant layout components such as film title, length, producer, etc. Figure 3 reports some samples of this material. Specifically, for carrying out the comparison, we focused on one specific problem: learning the semantic label object title for the document class dif censorship decision. Such a choice originates from the higher complexity of that label in that kind of document with respect to the others, due to its being placed in the middle of the page, thus having interrelations with more layout blocks than the other components. The resulting experimental dataset, obtained from 36 documents, contained a total of 299 layout blocks, 36 of which were positive instances of the target concept, while all the others were considered as negative examples. The average length of the example descriptions ranges between 54 and 263 literals (215 on average), containing information on the size, content type, absolute and relative position of the blocks in a given document. The dataset was evaluated by means of a 10-fold cross validation. It should be noted that the learned theory must be checked for completeness and consistency for each new incoming example, and also on the whole set of processed examples each time a candidate refinement is computed, thus the matching procedure is very stressed during the learning and test tasks. The number of variables in the clauses that make up the theory to be taken into account by the matching procedure, that is the critical factor for the exponential growth of the number of substitutions in the experiments, ranged between 50 (for some non-refined clauses) and 4 (for some very refined ones). The results of the comparison reveal a remarkable reduction of computational times when using the proposed algorithm2 . Indeed, the average computational cost for learning in the 10 runs using SLD resolution (5297 sec) is much higher than using the new matching procedure (1942 sec), with a difference of 3355 sec on average. Going into more detail, 80% of the times are in favor of the matching procedure, in which cases the difference grows up to 4318 sec, while the average of times in favor of SLD resolution is limited to 225 sec only. It is noteworthy that in one case SLD resolution took 31572 sec to reach one solution, against 2951 sec of the matching procedure to return all. Another interesting remark is that the presented experiment was carried out by SLD resolution in reasonable, though very high, times; as already pointed out, we also faced cases (not taken into account here because we wanted to compute the average times) in which 20 days were still insufficient for it to get a solution. The good results obtained under the OI assumption led us to try to make a comparison in the general case with other state-of-the-art systems. Since, as already pointed out, no algorithm is available in the literature to compute in one step the whole set of substitutions, the choice was between not making a comparison at all, or comparing the new algorithm to Django (the best-performing system in the literature so far for testing θ-subsumption, refer to [10] for a more 2
Note that the difference is completely due to the matching algorithms, since the rest of the procedures in the learning system is the same.
124
Nicola Di Mauro et al.
detailed explanation). In the second case, it is clear that the challenge was not completely fair for our algorithm, since it always computes the whole set of solutions, whereas Django computes none (it just answers ‘yes’ or ‘no’). Nevertheless, the second option was preferred, according to the principle that a comparison with a faster system could in any case provide useful information on the new algorithm performance, if its handicap is properly taken into account. The need for downward-compatibility in the system output forced to translate the new algorithm’s results in the more generic answers of Django, and hence to interpret them just as ‘yes’ (independently of how many substitutions were computed, which is very unfair for our algorithm) or ‘no’ (if no subsuming substitution exists). Hence, in evaluating the experimental results, one should take into account such a difference, so that a slightly worse performance of the proposed algorithm with respect to Django should be considered an acceptable tradeoff for getting all the solutions whenever they are required by the experimental settings. Of course, the targets of the two algorithms are different, and it is clear that in case a binary answer is sufficient the latter should be used. Specifically, a C language implementation of the new algorithm (this time in the general case, not restricted with the OI bias)3 was made for carrying out the comparison on the same two tasks exploited for evaluating Django by its Authors. The former concerned Phase Transition [5], a particularly hard artificial problem purposely designed to study the complexity of matching First Order Logic formulas in a given universe in order to find their models, if any. Thus, this dataset alone should be sufficient to assess the algorithm performance. Nevertheless, it seemed interesting to evaluate it on real-world problems, whose complexity is expected not to reach the previous one. The Mutagenesis problem [16] is a good testbed to this purpose. No other experiment was run, because no other dataset is available in the literature having a complexity that allows to appreciate the two algorithms’ power and performances, thus justifying their exploitation. All the experiments were run on a PC platform equipped with an Intel Celeron 1.3 GHz processor and running the Linux operating system. In both the two datasets referred to above, the general-case procedure (i.e., without the OI bias) will be exploited. The number of variables in clause C, denoted by n, is chosen according to the directions of the Authors that previously exploited them, for the sake of comparison; however, it is worth noting that the number of constants (which constitutes the base of the exponential formula) in the Phase Transition dataset is quite high (up to 50), which contributes to significantly raise the number of possible substitutions (5010 in some experiments). In the Phase Transition setting, each clause φ is generated from n variables (in a set X) and m binary predicates (in a set P), by first constructing its skeleton ϕs = α1 (x1 , x2 ) ∧ . . . ∧ αn−1 (xn−1 , xn ) (obtained by chaining the n variables through (n − 1) predicates), and then adding to ϕs the remaining (m − n + 1) predicates, whose arguments are randomly, uniformly, and without replacement selected from X. Given Λ, a set of L constants, an example, against 3
This implementation is publicly available on the Internet at the URL: http://lacam.di.uniba.it:8000/systems/matching/.
An Exhaustive Matching Procedure
125
Fig. 4. Performance of the proposed algorithm (left-hand side) and of Django (right-hand side) on the Phase Transition problem (logarithm of times expressed in sec)
which checking the subsumption of the generated clause, is built using N literals for each predicate symbol in P, whose arguments are selected uniformly and without replacement from the universe U = Λ × Λ. In such a setting, a matching problem is defined by a 4-tuple (n, m, L, N ). Like in [10], n was set to 10, m ranges in [10, 60] (actually, a wider range than in [10]) and L ranges in [10, 50]. To limit the total computational cost, N was set to 64 instead of 100: This does not affect the presence of the phase transition phenomenon, but just causes the number of possible substitutions to be less, and hence the height of the peaks in the plot to be lower. For each pair (m, L), 33 pairs (hypothesis , example) were constructed, and for each the computational cost was computed. Figure 4 reports the plots of the average θ-subsumption cost over all 33 trials for each single matching problem, measured as the logarithm of the seconds required by our algorithm and by Django on the phase transition dataset. The two plots are similar in that both show their peaks in correspondence of low values of L and/or m, but the slope is smoother for Django than for the proposed algorithm, whose complexity peaks are more concentrated and abruptly rising. Of course, there is an orders-of-magnitude difference between the two performances (Django’s highest peak is 0.037 sec, whereas our algorithm’s top peak is 155.548 sec), but one has to take into account that the proposed algorithm also returns the whole set of substitutions by which θ-subsumption holds (if any, which means that a ‘yes’ outcome may in fact hide a huge computational effort when the solutions are very dense), and it almost always does this in reasonable time (only 5.93% of computations took more than 1 sec, and only 1.29% took more than 15 sec). Such a consideration can be supported by an estimation of the expected number of solutions (i.e., substitutions) for each matching problem considered in the experiment. According to the procedure for generating problem instances, the first literal in the clause can match any literal in the example, the following (n − 2) ones have one variable partially constrained by the previous ones, and the remaining
126
Nicola Di Mauro et al.
Fig. 5. Logarithm of the expected number of solutions (m − n + 1) are completely fixed, because they contain only variables already appeared in the first part of the formula. Then, the number of solutions for the skeleton ϕs is proportional to: N
N N n−1 N ... = n−2 = S L L L n−1
The remaining literals decrease this number to: S
N N N n−1 . . . 2 = n−2 2 L L L
N L2
m−n+1 =
Nm . L2m−n
m−n+1
Such a result, obtained in an informal way, agrees with the theoretical one (Expected number of solutions of a CSP) according to [15]4 . It is interesting to note that the shape of the plot of the logarithm of such a function, reported in Figure 5, resembles that of the proposed algorithm in Figure 4, in that both tend to have a peak for low values of L and m. This suggests, as expected, a proportionality between the computational times of the proposed algorithm and the number of substitutions to be computed. On the other hand, there is no such correspondence of this plot with Django performance. Even more, Django shows a tendency to increase computational times when keeping L low and progressively increasing m, just where our algorithm seems, on the contrary, to show a decrease (coherent with the number of expected solutions). In the Mutagenesis dataset, artificial hypotheses were generated according to the procedure reported in [10]. For given m and n, such a procedure returns an hypothesis made up of m literals bond(Xi , Xj ) and involving n variables, where the variables Xi and Xj in each literal are randomly selected among n variables {X1 , . . . , Xn } in such a way that Xi = Xj and the overall hypothesis is linked [7]. The cases in which n > m + 1 were not considered, since it is not 4
Remember that the matching problem can be mapped onto a CSP.
An Exhaustive Matching Procedure
127
Fig. 6. Performance of the proposed algorithm (left-hand side) and of Django (right-hand side) on the Mutagenesis problem (sec) Table 2. Mean time on the Mutagenesis problem for the three algorithms (sec) SLD Matching Django 158,2358 0,01880281 0,00049569
possible to build a clause with m binary literals that contains more than m + 1 variables and that fulfills the imposed linkedness constraint. Specifically, for each (m, n) pair (1 ≤ m ≤ 10, 2 ≤ n ≤ 10), 10 artificial hypotheses were generated and each was checked against all 229 examples provided in the Mutagenesis dataset. Then, the mean performance of each hypothesis on the 229 examples was computed, and finally the computational cost for each (m, n) pair was obtained as the average θ-subsumption cost over all the times of the corresponding 10 hypotheses. Figure 6 reports the performance obtained by our algorithm and by Django on the θ-subsumption tests for the Mutagenesis dataset. Timings are measured in seconds. Also on the second experiment, the shape of Django’s performance plot is smoother, while that of the proposed algorithm shows sharper peaks in a generally flat landscape. Again, the proposed algorithm, after an initial increase, suggests a decrease in computational times for increasing values of n (when m is high). It is noticeable that Django shows an increasingly worse performance on the diagonal5 , while there is no such phenomenon in the left plot of Figure 6. In this case, however, there is no appreciable difference in computational times, since both systems stay far below the 1 sec threshold, which is a suggestion that real-world tasks normally do not present the difficult situations purposely created for the phase transition dataset. Table 2 reports the mean time needed by the three considered algorithms to get the answer on the Mutagenesis Problem (backtracking was forced in SLD in order to obtain all the solutions). The Matching algorithm turned out to be 5
Such a region corresponds to hypotheses with i literals and i + 1 variables. Such hypotheses are particularly challenging for the θ-subsumption test since their literals form a chain of variables (because of linkedness).
128
Nicola Di Mauro et al.
8415, 5 times more efficient than the SLD procedure (such a comparison makes no sense for Django because it just answers ‘yes’ or ‘no’). To have an idea of the effort spent, the mean number of substitutions was 91, 21 (obviously, averaged only on positive tests, that are 8, 95% of all cases).
5
Conclusions and Future Work
This paper proposed a new algorithm for computing the whole set of solutions to the matching problem under θ-subsumption, whose efficiency derives from a proper representation of substitutions that allows to avoid backtracking (which may cause, in particular situations, unacceptable growth of computational times in classical matching mechanisms). Experimental results show that the new procedure significantly improves the performance of the learning system INTHELEX with respect to simple SLD resolution. Also on two different datasets, a real-world one and an artificial one purposely designed to generate hard problems, it was able to carry out its task with high efficiency. Actually, it is not directly comparable to other state-of-the-art systems, since its characteristic of yielding all the possible substitution by which θ-subsumption holds has no competitors. Nevertheless, a comparison seemed useful to get an idea of the cost in time performance for getting such a plus. The good news is that, even on hard problems, and notwithstanding its harder computational effort, the new algorithm turned out to be in most cases comparable, and in any case at least acceptable, with respect to the best-performing system in the literature. A Prolog version of the algorithm is currently used in INTHELEX, a system for inductive learning from examples. Future work will concern an analysis of the complexity of the presented algorithm, and the definition of heuristics that can further improve its efficiency (e.g., by guiding the choice of the best literal to take into account at any step in order to recognize as soon as possible the impossibility of finding a match). Acknowledgements This work was partially funded by the EU project IST-1999-20882 COLLATE “Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material” (URL: http://www.collate.de). The authors would like to thank Michele Sebag and Jerome Maloberti for making available their system Django, and for the kind suggestions on its use and features. A grateful acknowledgement to the reviewers for their useful comments and suggestions on how to improve and make clearer ideas presented in this paper.
An Exhaustive Matching Procedure
129
References 1. N. Eisinger. Subsumption and connection graphs. In J. H. Siekmann, editor, GWAI-81, German Workshop on Artificial Intelligence, Bad Honnef, January 1981, pages 188–198. Springer, Berlin, Heidelberg, 1981. 2. F. Esposito, N. Fanizzi, D. Malerba, and G. Semeraro. Downward refinement of hierarchical datalog theories. In M.I. Sessa and M. Alpuente Frasnedo, editors, Proceedings of the Joint Conference on Declarative Programming - GULP-PRODE’95, pages 148–159. Universit` a degli Studi di Salerno, 1995. 3. F. Esposito, G. Semeraro, N. Fanizzi, and S. Ferilli. Multistrategy Theory Revision: Induction and abduction in INTHELEX. Machine Learning Journal, 38(1/2):133– 156, 2000. 4. M.R. Garey and D.S. Johnson. Computers and Intractability. Freeman, San Francisco, 1979. 5. A. Giordana, M. Botta, and L. Saitta. An experimental study of phase transitions in matching. In Dean Thomas, editor, Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI-99-Vol2), pages 1198–1203, S.F., July 31–August 6 1999. Morgan Kaufmann Publishers. 6. G. Gottlob and A. Leitsch. On the efficiency of subsumption algorithms. Journal of the Association for Computing Machinery, 32(2):280–295, 1985. 7. N. Helft. Inductive generalization: A logical framework. In I. Bratko and N. Lavraˇc, editors, Progress in Machine Learning, pages 149–157, Wilmslow, UK, 1987. Sigma Press. 8. J.-U. Kietz and M. L¨ ubbe. An efficient subsumption algorithm for inductive logic programming. In W. Cohen and H. Hirsh, editors, Proc. Eleventh International Conference on Machine Learning (ML-94), pages 130–138, 1994. 9. J.W. Lloyd. Foundations of Logic Programming. Springer, Berlin, New York, 2nd edition, 1987. 10. J. Maloberti and M. Sebag. θ-subsumption in a constraint satisfaction perspective. In C´eline Rouveirol and Mich`ele Sebag, editors, Inductive Logic Programming, 11th International Conference, ILP 2001, Strasbourg, France, volume 2157, pages 164– 178. Springer, September 2001. 11. S.H. Muggleton and L. De Raedt. Inductive logic programming. Journal of Logic Programming: Theory and Methods, 19:629–679, 1994. 12. S.-H. Nienhuys-Cheng and R. de Wolf. Foundations of Inductive Logic Programming, volume 1228 of Lecture Notes in Artificial Intelligence. Springer, 1997. 13. T. Scheffer, R. Herbrich, and F. Wysotzki. Efficient θ-subsumption based on graph algorithms. In Stephen Muggleton, editor, Proceedings of the 6th International Workshop on Inductive Logic Programming (ILP-96), volume 1314 of LNAI, pages 212–228, Berlin, August 26–28 1997. Springer. 14. G. Semeraro, F. Esposito, D. Malerba, N. Fanizzi, and S. Ferilli. A logic framework for the incremental inductive synthesis of datalog theories. In N. E. Fuchs, editor, Logic Program Synthesis and Transformation, number 1463 in Lecture Notes in Computer Science, pages 300–321. Springer-Verlag, 1998. 15. Barbara M. Smith and Martin E. Dyer. Locating the phase transition in binary constraint satisfaction. Artificial Intelligence, 81(1–2):155–181, 1996. 16. Ashwin Srinivasan, Stephen Muggleton, Michael J.E. Sternberg, and Ross D. King. Theories for mutagenicity: A study in first-order and feature-based induction. Artificial Intelligence, 85(1-2):277–299, 1996. 17. P. Wadler. How to replace failure by a list of successes. In J.-P. Jouannaud, editor, Functional Programming Languages and Computer Architecture, volume 201 of Lecture Notes in Computer Science, pages 113–128. Springer-Verlag, 1985.
Efficient Data Structures for Inductive Logic Programming Nuno Fonseca1 , Ricardo Rocha1 , Rui Camacho2 , and Fernando Silva1 1
DCC-FC & LIACC, Universidade do Porto R. do Campo Alegre 823, 4150-180 Porto, Portugal {nf,fds,ricroc}@ncc.up.pt 2 Faculdade de Engenharia, Universidade do Porto Rua Dr. Roberto Frias, s/n 4200-465 Porto, Portugal [email protected]
Abstract. This work aims at improving the scalability of memory usage in Inductive Logic Programming systems. In this context, we propose two efficient data structures: the Trie, used to represent lists and clauses; and the RL-Tree, a novel data structure used to represent the clauses coverage. We evaluate their performance in the April system using well known datasets. Initial results show a substantial reduction in memory usage without incurring extra execution time overheads. Our proposal is applicable in any ILP system.
1
Introduction
Inductive Logic Programming (ILP) [1,2] is an established and healthy subfield of Machine Learning. ILP has been successfully applied to problems in several application domains [3]. Nevertheless, it is recognized that efficiency and scalability is a major obstacle to the increase usage of ILP systems in complex applications with large hypotheses spaces. Research in improving the efficiency of ILP systems has been focused in reducing their sequential execution time, either by reducing the number of hypotheses generated (see, e.g., [4,5]), or by efficiently testing candidate hypotheses (see, e.g., [6,7,8,9]). Another line of research, recommended by Page [10] and pursued by several researchers [11,12,13,14,15], is the parallelization of ILP systems. Another important issue is memory usage as a result of very large and complex search spaces. In this work, we develop techniques to considerably reduce the memory requirements of ILP systems without incurring in further execution time overheads. We propose and empirically evaluate two data structures that may be applied to any ILP system. During execution, an ILP system generates many candidate hypotheses which have many similarities among them. Usually, these similarities tend to correspond to common prefixes among the hypotheses. Blockeel et al. [6] defined a new query-pack technique to exploit this pattern and improve the execution time of ILP systems. We propose the use of the Trie data structure (also known as T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 130–145, 2003. c Springer-Verlag Berlin Heidelberg 2003
Efficient Data Structures for Inductive Logic Programming
131
prefix-trees) that inherently and efficiently exploits the similarities among the hypotheses to reduce memory usage. We also noted that systems like Aleph [16], Indlog [9], and April [17] use a considerable quantity of memory to represent clauses’ coverage lists1 , i.e., lists of examples covered by an hypothesis. To deal with this issue, we propose a novel data structure, called RL-Tree, specially designed to efficiently store and manipulate coverage lists. An interesting observation is that the proposed data structures address the efficient representation of different types of data. Therefore, these data structures can be used in conjunction to maximize the gains in reducing memory usage by ILP systems. The remainder of the paper is organized as follows. Sections 2 and 3 introduce the Trie and RL-Tree data structures and describe their implementation. In Section 4 we present an empirical evaluation of the impact in memory usage and execution time of the proposed data structures. Finally, in Section 5, we draw some conclusions and propose further work.
2
Tries
Tries were first proposed by Fredkin [18], the name coming from the central letters of the word retrieval. Tries were originally invented to index dictionaries, and has since been generalized to index terms (see [19] for use of tries in tabled logic programs and [20,21,22,23] for automated theorem proving and term rewriting). The basic idea behind the trie data structure is to partition a set T of terms based upon their structure so that looking up and inserting these terms will be efficiently done. The trie data structure provides complete discrimination for terms and permits lookup and possibly insertion to be performed in a single pass through a term. 2.1
Applicability
An essential property of the trie structure is that common prefixes are represented only once. The efficiency and memory consumption of a particular trie data structure largely depends on the percentage of terms in T that have common prefixes. For ILP systems, this is an interesting property that we can take advantage of. In ILP, the hypotheses space is structured as a lattice and hypotheses close to one another in the lattice have a lot of common structure. More specifically, hypotheses in the search space have common prefixes (literals). Not only the hypotheses are similar, but information associated to them is also similar (e.g. the list of variables in an hypothesis is similar to other lists of variables of nearby hypotheses). This clearly matches the common prefix property of tries. We thus argue that tries form a promising alternative for storing hypotheses and some associated information. 1
Camacho has observed that Indlog uses around 40% of total memory consumption to represent coverage lists.
132
2.2
Nuno Fonseca et al.
Description
A trie is a tree structure where each different path through the trie data units, the trie nodes, corresponds to a term in T . At the entry point we have the root node. Internal nodes represent symbols in terms and leaf nodes specify the end of terms. Each root-to-leaf path represents a term described by the symbols labeling the nodes traversed. Two terms with common prefixes will branch off from each other at the first distinguishing symbol. When inserting a new term, we start traversing the trie from the root node. Each child node specifies a symbol to be inspected in the input term when reaching that position. A transition is taken if the symbol in the input term at a given position matches a symbol on a child node. Otherwise, a new child node representing the current symbol is added and an outgoing transition from the current node is made to point to the new child node. On reaching the last symbol in the input term, we reach a leaf node in the trie. Figure 1 presents an example for a trie with three terms.
Trie Structure
root node
f/2
g/3
VAR 0
VAR 0
Set of Terms
f(X,a). g(X,b,Y). f(Y,1). a
1
b
VAR 1
Fig. 1. Using tries to represent terms An important point when using tries to represent terms is the treatment of variables. We follow the formalism proposed by Bachmair et al. [20], where each variable in a term is represented as a distinct constant. Formally, this corresponds to a function, numbervar, from the set of variables in a term t to the sequence of constants < VAR0 , VAR1, ..., VARN >, such that numbervart(X) < numbervart(Y) if X is encountered before Y in the left-to-right traversal of t. For example, in the term g(X, b, Y), numbervar(X) and numbervar(Y) are respectively VAR0 and VAR1 . On the other hand, in terms f(X, a) and f(Y, 1), numbervar(X) and numbervar(Y ) are both VAR0 . This is why the child node VAR0 of f/2 from Fig. 1 is common to both terms.
Efficient Data Structures for Inductive Logic Programming
2.3
133
Implementation
The trie data structure was implemented in C as a shared library. Since the ILP system we used for testing is implemented in Prolog we developed an interface to tries as an external Prolog module. Tries are implemented by representing each trie node by a data structure with four fields each. The first field stores the symbol for the node. The second and third fields store pointers respectively to the first child node and to the parent node. The forth field stores a pointer to the sibling node, in such a way that the outgoing transitions from a node are traced using its first child pointer and by following the list of sibling pointers of this child. Figure 2 illustrates the resulting implementation for the trie presented in Fig. 1.
Trie Implementation
a
root node
f/2
g/3
VAR 0
VAR 0
1
b
VAR 1
Fig. 2. The implementation of the trie in Fig. 1 At the entry point we have the root node. A root node is allocated when we call a open trie(−R) predicate2 . This predicate initializes a new trie structure and returns in R a reference to the root node of the new trie. As it is possible to have more than a trie structure simultaneously, when storing new terms we use R to specify the trie where the new terms should be inserted. New terms are stored using a put trie entry(+R, +T, −L) predicate. R is the root node of the trie to be used and T is the term to be inserted. The predicate returns in L the reference to the leaf node of the inserted term. For example, to obtain the structure in Fig. 2, the following code can be used: open_trie(R). put_trie_entry(R,f(X,a),L1). put_trie_entry(R,g(X,b,Y),L2). put_trie_entry(R,f(Y,1),L3). 2
We will use the − symbol to denote output arguments and the + symbol to denote input arguments.
134
Nuno Fonseca et al.
Inserting a term requires in the worst case allocating as many nodes as necessary to represent its complete path. On the other hand, inserting repeated terms requires traversing the trie structure until reaching the corresponding leaf node, without allocating any new node. Searching through a chain of sibling nodes that represent alternative paths is done sequentially. When the chain becomes larger then a threshold value (8 in our implementation), we dynamically index the nodes through a hash table to provide direct node access and therefore optimize the search. Further hash collisions are reduced by dynamically expanding the hash tables. Recall that variables are standardized using the numbervar function. This standardization is performed while a term is being inserted in a trie. First occurrences of variables are replaced by binding the dereferenced variable cell to the constant returned by numbervar. Using this single binding, repeated occurrences of the same variable are automatically handled, without the need to check whether the variable has been previously encountered. The bindings are undone as soon as the insertion of the term is complete. In this manner, standardization is performed in a single pass through the input term. To load a term from a trie, we have defined a get trie entry(+L, −T) predicate. This predicate returns in T the term whose leaf node is referred by L. For example, to obtain in T the term referred by L1, from the previous code, we should call get trie entry(L1, T). Starting from the leaf node L1 and following the parent pointers, the symbols in the path from the leaf to the root node are pushed into Prolog’s term stack and the term is constructed. On reaching the root node, T is unified with the constructed term (f(VAR0 , a)). When loading a term, the trie nodes for the term in hand are traversed in bottom-up order. The trie structure is not traversed in a top-down manner because the insertion and retrieval of terms is an asynchronous process, new trie nodes may be inserted at anytime and anywhere in the trie structure. This induces complex dependencies which limits the efficiency of alternative top-down loading schemes. Space for a trie can be recovered by invoking a close trie(+R) predicate, where R refers the root node of a particular trie, or by invoking a close all tries() predicate where all open tries are closed and their space recovered. Current implementation also defines auxiliary predicates to obtain memory consumption statistics and to print tries to the standard output. As a final note we should mention that besides the atoms, integers, variables and compound terms (functors) presented in the examples, our implementation also supports terms with floats.
3
RL-Trees
The RL-Tree (RangeList-Tree) data structure is an adaptation of a generic data structure called quadtree [24] that has been used in areas like image processing, computer graphics, and geographic information systems. Quadtree is a term used to represent a class of hierarchical data structures whose common property
Efficient Data Structures for Inductive Logic Programming
135
is that they are based on the principle of recursive decomposition of space. Quadtrees based data structures are differentiated by the type of data that they represent, the principle guiding the decomposition process, and the number of times the space is decomposed. The RL-Tree is designed to store integer intervals (e.g. [1 − 3] ∪ [10 − 200]). The goals in the design of the RL-Tree data structure are: efficient data storage; fast insertion and removal; and fast retrieval. 3.1
Applicability
To reduce the time spent on computing clauses coverage some ILP systems, such as Aleph [16], FORTE [25], Indlog [9], and April [17], maintain lists of examples covered (coverage lists) for each hypothesis that is generated during execution. Coverage lists are used in these systems as follows. An hypothesis S is generated by applying a refinement operator to another hypothesis G. Let Cover(G) = {all e ∈ E such that B ∧ G e}, where G is a clause, B the background knowledge, and E is the set of positive (E + ) or negative examples (E − ). Since G is more general than S then Cover(S) ⊆ Cover(G). Taking this into account, when testing the coverage of S it is only necessary to consider examples of Cover(G), thus reducing the coverage computation time. Cussens [26] extended this scheme by proposing a kind of coverage caching. The coverage lists are permanently stored and reused whenever necessary, thus reducing the need to compute the coverage of a particular clause only once. Coverage lists reduce the effort in coverage computation at the cost of significantly increasing memory consumption. Efficient data structures should be used to represent coverage lists to minimize memory consumption. The data structure used to maintain coverage lists in systems like Indlog and Aleph are Prolog lists of integers. For each clause two lists are kept: a list of positive examples covered and a list of negative examples covered. A number is used to represent an example in the list. The positive examples are numbered from 1 to | E + |, and the negative examples from 1 to | E − |. The systems mentioned reduce the size of the coverage lists by transforming a list of numbers into a list of intervals. For instance, consider the coverage list [1, 2, 5, 6, 7, 8, 9, 10] represented as a list of numbers. This list represented as a list of intervals corresponds to [1 − 2, 5 − 10]. Using a list of intervals to represent coverage lists is an improvement to lists of numbers but it still presents some problems. First, the efficiency of performing basic operations on the interval list is linear on the number of intervals and can be improved. Secondly, the representation of lists in Prolog is not very efficient regarding memory usage. The RL-Tree data structure was designed to tackle the problems just mentioned: memory usage and execution time. The RL-Trees can be used to efficiently represent and manipulate coverages lists, and may be implemented in any ILP system (it is not restricted to ILP systems implemented in Prolog).
136
3.2
Nuno Fonseca et al.
Description
In the design and implementation of the RL-Tree data structure we took the following characteristics into consideration: intervals are disjuncts; intervals are defined by adding or removing numbers; and, the domain (an integer interval) is known at creation time. RL-Trees are trees with two distinct types of nodes: list and range nodes. A list node represents a fixed interval, of size LI, that is implementation dependent. A range node corresponds to an interval that is subdivided in B subintervals. Each subinterval in a range node can be completely contained (represented in Black) or partially contained in an interval (represented in Gray), or not be within an interval (represented in White). The basic idea behind RL-Trees is to represent disjunct set of intervals in a domain by recursively partition the domain interval into equal subintervals. The number of subintervals B generated in each partition is implementation dependent. The number of partitions performed depend on B, the size of the domain, and the size of list node interval LI. Since we are using RL-Trees to represent coverage lists, the domain is [1, N E] where NE is the number of positive or negative examples. The RL-Tree whose domain corresponds to the integer interval [1, N ] is denoted as RL-Tree(N). A RL-Tree(N) has the following properties: LN = ceil(N/LI) is the maximum number of list nodes in the tree; H = ceil(logB (LN )) is the maximum height of the tree; all list nodes are at depth H; root node interval range is RI = B H ∗ LI; all range node interval bounds (except the root node) are inferred from its parent node; every range node is colored with black, white, and gray; only the root node can be completely black or white.
RL-Tree(65) [1-65]
[65-65]
[1-64]
[17-32][33-48][49-64]
[1-16] 2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
Fig. 3. Interval [1] represented in a RL-Tree(65)
Consider the RL-Tree with domain [1,65], also denoted as RL-Tree(65). The figures 3, 4, 5, and 6 show some intervals represented in a RL-Tree(65). In these
Efficient Data Structures for Inductive Logic Programming
137
examples the LI and B parameters were set to 16 and 4 respectively. Figure 3 shows the representation of the interval [1]. Each group of four squares represents a range node. Each square in a range node corresponds to a subinterval. A sixteen square group represents a list node. Each square in a list node corresponds to an integer. The top of the tree contains a range node that is associated to the domain ([1, 65]). Using the properties of RL-Trees described earlier one knows that the maximum height of the RL-Tree(65) is 2 and the root node range is [1 − 256]. Each subinterval (square) of the root interval represent an interval of 64 integers. The first square (counting from the left) with range [1 − 64] contains the interval [1], so it is marked with Gray. The range node corresponding to the range [1 − 64] has all squares painted in White except the first one corresponding to range [1 − 16], because it contains the interval [1]. The list node only has one square marked, the square corresponding to the integer 1. Figure 4 shows the representation of a more complex list of intervals. Note that the number of nodes is the same as in Fig. 3 even though it represents a more complex list of intervals. Figure 5 and 6 show, respectively, a complete and empty interval representation.
RL-Tree(65) [1-65]
[65-65]
[1-64]
[1-16][17-32][33-48]
[49-64] 49 50 51 52
55
59
63
Fig. 4. Intervals [1, 32] ∪ [53, 54] ∪ [56, 58] ∪ [60, 62] ∪ [64, 65] represented in a RL-Tree(65)
3.3
Implementation
Like the Trie data structure, the RL-Tree data structure was implemented in C as a shared library. Since the ILP system used in the experiments is implemented in Prolog we developed an interface to RL-Tree as an external Prolog module. Like other quadtree data structures [27], a RL-Tree can be implemented with or without pointers. We chose to do a pointerless implementation (using an array) to reduce memory consumption in pointers. The LI and B parameters were set to 16 and 4 respectively. The range node is implemented using 16 bits. Since we divide the intervals by a factor of 4, each range node may have 4 subintervals.
138
Nuno Fonseca et al. RL-Tree(65)
[1-65]
Fig. 5. Interval [1,65] represented in a RL-Tree(65) RL-Tree(65)
[1-65]
Fig. 6. Interval ∅ represented in a RL-Tree(65)
Each subinterval has a color associated (White, Black, or Gray) that is coded using 2 bits (thus a total of 8 bits are used for the 4 subintervals). The other 8 bits are used to store the number of subnodes of a node. This information is used to improve efficiency by reducing the need to traverse the tree to determine the position, in the array, of a given node. The list nodes use 16 bits. Each bit represents a number (that in turn represents an example). The number interval represented by a list node is inferred from its parent range node. The RL-Tree(N) implemented operations and their complexity (regarding the number of subintervals considered) are: – – – – –
Create a RL-Tree: O(1); Delete a RL-Tree: O(1); Check if a number is in a RL-Tree: O(H). Add a number to a RL-Tree: O(H) Remove a number from a RL-Tree: O(H)
Current implementation of RL-Trees uses, in the worst case, (4H+1 − 1)/3 nodes. The worst case occurs when the tree requires all LN list nodes. Since each node in the tree requires 2 bytes, a RL-Tree(N) will require, in the worst case, approximately ((4H+1 − 1)/3) ∗ 2 + C bytes, where C is the memory needed to store tree header information. In our implementation C = 20.
4
Experiments and Results
The goal of the experiments was to evaluate the impact of the proposed data structures in the execution time and memory usage when dealing with real application problems. We adapted the April ILP system [17] so that it could be executed with support for Tries and/or RL-Trees and applied the system to well known datasets.
Efficient Data Structures for Inductive Logic Programming
139
For each dataset the system was executed four times with the following configuration: no Tries and no RL-Trees, Tries and RL-Trees, Tries only, and RL-Trees only.
4.1
Experimental Settings
The experiments were made on an AMD Athlon(tm) MP 2000+ dual-processor PC with 2 GB of memory, running the Linux RedHat (kernel 2.4.20) operating system. We used version 0.5 of the April system and version 4.3.23 of the YAP Prolog [28]. The datasets used were downloaded from the Machine Learning repositories of the Universities of Oxford3 and York4 . The susi dataset was downloaded from the Science University of Tokyo5. Table 1 characterizes the datasets in terms of number of positive and negative examples as well as background knowledge size. Furthermore, it shows the April settings used with each dataset. The parameter nodes specifies an upper bound on the number of hypotheses generated during the search for an acceptable hypothesis. The i-depth corresponds to the maximum depth of a literal with respect to the head literal of the hypothesis [29]. Sample defines the number of examples used to induce a clause. Language parameter specifies the maximum number of occurrences of a predicate symbol in an hypothesis [9]. MinPos specifies the minimum number of positive examples that an hypothesis must cover in order to be accepted. Finally, the parameter noise defines the maximum number of negative examples that an hypothesis may cover in order to be accepted. Table 1. Settings used in the experiments Characterization | E+ | | E− | | B | amine uptake 343 343 32 carcinogenesis 162 136 44 choline 663 663 31 krki 342 658 1 mesh 2272 223 29 multiplication 9 15 3 pyrimidines 881 881 244 proteins 848 764 45 train 5 5 10 train128 120 5 10 susi 252 8979 18 Dataset
3 4 5
nodes 1000 1000 1000 no limit 1000 no limit 1000 1000 no limit no limit no limit
April’s Settings i sample language minpos noise 2 20 50 20 3 10 3 20 10 2 all 50 20 1 all 2 1 0 3 20 3 10 5 2 all 2 1 0 2 10 75 20 2 10 100 100 2 all 1 1 0 2 all 1 1 0 5 2 1 200 800
http://www.comlab.ox.ac.uk/oucl/groups/machlearn/. http://www.cs.york.ac.uk/mlg/index.html. http://www.ia.noda.sut.ac.jp/ilp.
140
Nuno Fonseca et al.
Note that in order to speedup the experiments we limited the search space of some datasets by setting the parameter nodes to 1000. This reduces the total memory usage needed to process the dataset. However, since we are comparing the memory consumption when using a data structure with when not using it, the estimate we obtain will still give a good idea of the impact of the data structure in reducing memory usage. 4.2
Tries
When activated in April, the Trie data structure stores information about each hypothesis generated. More specifically, it stores the hypothesis (Prolog clause), a list of variables in the clause, a list of unbound variables in the clause’s head, and a list of free variables in the clause. Table 2 shows the total number of hypotheses generated (| H |), the execution time, the memory usage and the impact in performance for execution time and memory usage (given as a ratio between the values obtained when using and when not using Tries). The memory values presented correspond only to the memory used to store information about the hypotheses. The use of tries resulted in an average reduction of 20% in memory consumption with the datasets considered. The train dataset was the only exception, showing a degradation of 25% in memory consumption. This may indicate that the Tries data structure is not adequate for datasets with very small hypothesis spaces. However, memory usage is not a concern for problems with small hypotheses space. Table 2. The impact of Tries Dataset
|H |
amine uptake carcinogenesis choline krki mesh multiplication pyrimidines proteins train train128 susi
66933 142714 803366 2579 283552 478 372320 433271 37 44 3344
Time off 357.10 506.19 13451.21 1.11 3241.62 8.91 5581.35 794.03 0.02 0.05 7995.01
(sec.) Memory (bytes) on off on 362.40 739316 553412 517.76 869888 680212 13573.24 869736 598344 1.30 62436 50000 3267.85 607584 506112 8.98 164304 105348 5602.96 914520 580852 832.83 759440 595928 0.02 9260 11612 0.06 22224 21392 7982.82 3655916 1934640
on/off(%) Time Memory 101.48 74.85 102.28 78.19 100.90 68.79 117.11 80.08 100.80 83.29 100.78 64.11 100.38 63.51 104.88 78.46 100.00 125.39 120.00 96.25 99.84 52.91
With Tries, the execution time slightly increased but the overhead is not significant. The krki and train128 datasets are exceptions, nevertheless unimportant as the difference in execution time is just a fraction of a second.
Efficient Data Structures for Inductive Logic Programming
141
In summary, the results suggest that the Tries data structure reduce memory consumption with a minimal execution time overhead. 4.3
RL-Trees
Table 3 presents the impact of using RL-Trees in the April system. It shows the total number of hypotheses generated (| H |), the execution time, the memory usage, and the impact in performance for execution time and memory usage (given as a ratio between using RL-Trees and Prolog range lists). The memory values presented correspond only to the memory used to store coverage lists. Table 3. The impact of RL-Trees Dataset
|H |
amine uptake carcinogenesis choline krki mesh multiplication pyrimidines proteins train train128 susi
66933 142714 803366 2579 283552 478 372320 433271 37 44 3343
Time list 365.74 508.41 13778.29 1.22 3394.1 8.89 5606.97 805.97 0.02 0.05 8079.51
(sec.) Memory (Kb) rl list rl 357.23 5142784 2181658 505.61 2972668 1560180 13617.49 17644032 7520744 1.13 150264 43822 3258.21 8286944 4880746 8.91 35808 35412 5460.22 24291608 6568286 791.92 693868 146344 0.02 3676 3692 0.05 10228 7284 7927.03 3021356 263098
rl/list(%) Time Memory 97.67 42.42 99.44 52.48 98.83 42.62 92.62 29.16 95.99 58.89 100.22 98.89 97.38 27.03 98.25 21.09 100.00 100.43 100.00 71.21 98.11 8.70
The use of RL-Trees resulted in an average of 50% reduction in memory usage (when comparing to Prolog range lists). The only exception to the overall reduction was registered by the train dataset. This is probably a consequence of the reduced number of examples of the dataset. The results indicate that more significant reductions in memory usage were obtained with datasets with greater number of examples. In general, a considerable reduction in memory usage is achieved with no execution time overhead when using RL-Trees. In fact, an average reduction of 2% in the execution time was obtained. 4.4
Tries and RL-Trees
To evaluate the impact of using Tries and RL-Trees simultaneously we ran April configured to use both data structures. The Table 4 shows the total number of hypotheses generated (| H |), the execution time, the memory usage, and the impact in performance for execution time and memory usage (given as a ratio between using the proposed data structures and not using them). The memory
142
Nuno Fonseca et al.
Table 4. The impact of Tries and RL-Trees Dataset
|H |
amine uptake carcinogenesis choline krki mesh multiplication pyrimidines proteins train train128 susi
66933 142714 803366 2579 283552 478 372320 433271 37 44 3343
Time off 365.74 508.41 13778.29 1.22 3394.1 8.91 5606.97 805.97 0.02 0.05 8079.51
(sec.) Memory (bytes) on off on 362.83 5882100 2728174 516.51 3842556 2223164 13651.51 18513768 8090504 1.21 212700 93978 3284.33 8894528 5376906 8.98 200112 140908 5501.65 25206128 7132978 834.76 1453308 740904 0.02 12936 15264 0.05 32452 28564 7928.98 6677264 2197050
on/off(%) Time Memory 99.20 46.38 101.59 57.85 99.07 43.69 100.82 44.18 96.76 60.45 100.78 70.41 98.12 28.29 103.57 50.98 100.00 117.99 100.00 88.01 98.13 32.90
values presented correspond only to the memory used to store coverage lists and information about the hypotheses (stored in the Tries). The use of both data structures resulted in significant reductions in memory usage. The results indicate that the impact of the data structures tend to be greater in the applications with more examples and with larger search spaces. The train was the only dataset that consumed more memory when using Tries and RL-Trees. This occurred because the dataset has a very small hypothesis space and the number of examples is also small. Nevertheless, the values obtained are useful because they give an idea of the initial overhead of the proposed data structures. The time overhead experienced is minimum in the smaller datasets (train, train128, multiplication), and non existent in the larger datasets. Table 5 resumes the impact of the data structures proposed in the April system total memory usage. The table shows the April (total) memory usage when using Tries and RL-Trees simultaneously and the reduction ratio when comparing to using Prolog range lists and not using Tries. The reduction values obtained are good, especially if we take into account that the biggest reductions (42.15 and 26.57) were obtained in the datasets with greatest memory usage. From the reduction values presented we conclude that with small datasets the data structures do not produce major gains, but they also do not introduce significant overheads. On the other hand, the data structures proposed should be used when processing larger datasets since they can reduce memory consumption significantly.
5
Conclusions
This paper contributes to the effort of improving ILP systems efficiency by proposing two data structures: RL-Trees and Tries. The use of these data structures reduce memory consumption without an execution time overhead. The
Efficient Data Structures for Inductive Logic Programming
143
Table 5. April memory consumption using Tries and RL-Trees Memory (MB) amine uptake 11.02 carcinogenesis 13.14 choline 31.28 krki 2.21 mesh 24.86 multiplication 4.18 pyrimidines 26.53 proteins 26.22 train 1.68 train128 1.75 susi 20 Dataset
Reduction (%) 21.64 9.04 26.57 4.02 23.83 0.44 42.15 2.01 -1.50 -1.11 16.66
RL-Tree is a novel data structure designed to efficiently store and manipulate coverage lists. The Trie data structure inherently and efficiently exploit the similarities among the candidate hypotheses generated by ILP systems to reduce memory usage. The data structures were integrated in the April system, an ILP system implemented in Prolog. We have empirically evaluated the use of RL-Trees and Tries, both individually and in conjunction, on well known datasets. The RL-Tree data structure alone reduced memory usage in coverage lists to half, in average, and slightly reduced the execution time. The Tries data structure alone reduced memory consumption with a minor overhead (approximately 1%) in the execution time. The use of both data structures simultaneously resulted in a overall reduction in memory usage without degrading the execution time. In some datasets, the April system registered very substantial memory reductions (between 20 and 42%) when using Tries and RL-Trees simultaneously. The results indicate that the benefits from using these data structures tend to increase for datasets with larger search spaces and greater number of examples. Since the data structures are system independent, we believe that they can be also applied to other ILP systems with positive impact. In the future we plan to implement operations like intersection and subtraction of two RL-Trees in order to compute the coverage intersection of two clauses more efficiently. We also will identify more items collected during the search that may take advantage of the Tries data structure. Acknowledgments The authors thank Vitor Santos Costa for making YAP available and for all the support provided. The work presented in this paper has been partially supported by project APRIL (Project POSI/SRI/40749/2001) and funds granted to LIACC
144
Nuno Fonseca et al.
through the Programa de Financiamento Plurianual, Funda¸c˜ ao para a Ciˆencia e Tecnologia and Programa POSI. Nuno Fonseca is funded by the FCT grant SFRH/BD/7045/2001.
References 1. S. Muggleton. Inductive logic programming. In Proceedings of the 1st Conference on Algorithmic Learning Theory, pages 43–62. Ohmsma, Tokyo, Japan, 1990. 2. S. Muggleton. Inductive logic programming. New Generation Computing, 8(4):295– 317, 1991. 3. Ilp applications. http://www.cs.bris.ac.uk/~ILPnet2/Applications/. 4. C. N´edellec, C. Rouveirol, H. Ad´e, F. Bergadano, and B. Tausend. Declarative bias in ILP. In L. De Raedt, editor, Advances in Inductive Logic Programming, pages 82–103. IOS Press, 1996. 5. Rui Camacho. Improving the efficiency of ilp systems using an incremental language level search. In Annual Machine Learning Conference of Belgium and the Netherlands, 2002. 6. Hendrik Blockeel, Luc Dehaspe, Bart Demoen, Gerda Janssens, Jan Ramon, and Henk Vandecasteele. Improving the efficiency of Inductive Logic Programming through the use of query packs. Journal of Artificial Intelligence Research, 16:135– 166, 2002. 7. V´ıtor Santos Costa, Ashwin Srinivasan, and Rui Camacho. A note on two simple transformations for improving the efficiency of an ILP system. Lecture Notes in Computer Science, 1866, 2000. 8. V´ıtor Santos Costa, Ashwin Srinivasan, Rui Camacho, Hendrik, and Wim Van Laer. Query transformations for improving the efficiency of ilp systems. Journal of Machine Learning Research, 2002. 9. Rui Camacho. Inductive Logic Programming to Induce Controllers. PhD thesis, Univerity of Porto, 2000. 10. David Page. ILP: Just do it. In J. Cussens and A. Frisch, editors, Proceedings of the 10th International Conference on Inductive Logic Programming, volume 1866 of Lecture Notes in Artificial Intelligence, pages 3–18. Springer-Verlag, 2000. 11. T. Matsui, N. Inuzuka, H. Seki, and H. Itoh. Comparison of three parallel implementations of an induction algorithm. In 8th Int. Parallel Computing Workshop, pages 181–188, Singapore, 1998. 12. Hayato Ohwada and Fumio Mizoguchi. Parallel execution for speeding up inductive logic programming systems. In Lecture Notes in Artificial Intelligence, number 1721, pages 277–286. Springer-Verlag, 1999. 13. Hayato Ohwada, Hiroyuki Nishiyama, and Fumio Mizoguchi. Concurrent execution of optimal hypothesis search for inverse entailment. In J. Cussens and A. Frisch, editors, Proceedings of the 10th International Conference on Inductive Logic Programming, volume 1866 of Lecture Notes in Artificial Intelligence, pages 165–173. Springer-Verlag, 2000. 14. Y. Wang and D. Skillicorn. Parallel inductive logic for data mining. In In Workshop on Distributed and Parallel Knowledge Discovery, KDD2000, Boston, 2000. ACM Press. 15. L. Dehaspe and L. De Raedt. Parallel inductive logic programming. In Proceedings of the MLnet Familiarization Workshop on Statistics, Machine Learning and Knowledge Discovery in Databases, 1995.
Efficient Data Structures for Inductive Logic Programming
145
16. Aleph. http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/. 17. Nuno Fonseca, Rui Camacho, Fernando Silva, and V´ıtor Santos Costa. Induction with April: A preliminary report. Technical Report DCC-2003-02, DCC-FC, Universidade do Porto, 2003. 18. E. Fredkin. Trie Memory. Communications of the ACM, 3:490–499, 1962. 19. I. V. Ramakrishnan, P. Rao, K. Sagonas, T. Swift, and D. S. Warren. Efficient Access Mechanisms for Tabled Logic Programs. Journal of Logic Programming, 38(1):31–54, 1999. 20. L. Bachmair, T. Chen, and I.V. Ramakrishnan. Associative-Commutative Discrimination Nets. In Proceedings of the 4th International Joint Conference on Theory and Practice of Software Development, number 668 in Lecture Notes in Computer Science, pages 61–74, Orsay, France, 1993. Springer-Verlag. 21. P. Graf. Term Indexing. Number 1053 in Lecture Notes in Artificial Intelligence. Springer-Verlag, 1996. 22. W.W. McCune. Experiments with Discrimination-Tree Indexing and Path Indexing for Term Retrieval. Journal of Automated Reasoning, 9(2):147–167, 1992. 23. H.J. Ohlbach. Abstraction Tree Indexing for Terms. In Proceedings of the 9th European Conference on Artificial Intelligence, pages 479–484, Stockholm, Sweden, 1990. Pitman Publishing. 24. Hanan Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys (CSUR), 16(2):187–260, 1984. 25. Bradley L. Richards and Raymond J. Mooney. Automated refinement of first-order horn-clause domain theories. Machine Learning, 19(2):95–131, 1995. 26. James Cussens. Part-of-speech disambiguation using ilp. Technical Report PRGTR-25-96, Oxford University Computing Laboratory, 1996. 27. Hanan Samet. Data structures for quadtree approximation and compression. Communications of the ACM, 28(9):973–993, 1985. 28. V´ıtor Santos Costa, L. Damas, R. Reis, and R. Azevedo. YAP Prolog User’s Manual. Universidade do Porto, 1989. 29. S. Muggleton and C. Feng. Efficient induction in logic programs. In S. Muggleton, editor, Inductive Logic Programming, pages 281–298. Academic Press, 1992.
Graph Kernels and Gaussian Processes for Relational Reinforcement Learning Thomas G¨artner1,2 , Kurt Driessens3 , and Jan Ramon3 1
2
Fraunhofer Institut Autonome Intelligente Systeme, Germany [email protected] Department of Computer Science III, University of Bonn, Germany 3 Department of Computer Science, K.U.Leuven, Belgium [email protected] [email protected]
Abstract. Relational reinforcement learning is a Q-learning technique for relational state-action spaces. It aims to enable agents to learn how to act in an environment that has no natural representation as a tuple of constants. In this case, the learning algorithm used to approximate the mapping between state-action pairs and their so called Q(uality)-value has to be not only very reliable, but it also has to be able to handle the relational representation of state-action pairs. In this paper we investigate the use of Gaussian processes to approximate the quality of state-action pairs. In order to employ Gaussian processes in a relational setting we use graph kernels as the covariance function between state-action pairs. Experiments conducted in the blocks world show that Gaussian processes with graph kernels can compete with, and often improve on, regression trees and instance based regression as a generalisation algorithm for relational reinforcement learning.
1
Introduction
Reinforcement learning [26], in a nutshell, is about controlling an autonomous agent in an unknown environment - often called the state space. The agent has no prior knowledge about the environment and can only obtain some knowledge by acting in that environment. The only information the agent can get about the environment is the state in which it currently is and whether it received a reward. The aim of reinforcement learning is to act such that this reward is maximised. Q-learning [27] — one particular form of reinforcement learning — tries to map every state-action-pair to a real number (Q-value) reflecting the quality of that action in that state, based on the experience so far. While in small stateaction spaces it is possible to represent this mapping extensionally, in large state-action spaces this is not feasible for two reasons: On the one hand, one can not store the full state-action space; on the other hand the larger the stateaction space gets, the smaller becomes the probability of ever getting back into the same state. For this reason, the extensional representation of the quality T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 146–163, 2003. c Springer-Verlag Berlin Heidelberg 2003
Graph Kernels and Gaussian Processes
147
mapping is often substituted with an intensional mapping found by a learning algorithm that is able to generalise to unseen states. Ideally, an incrementally learnable regression algorithm is used to learn this mapping. Relational reinforcement learning [9,8] (RRL) is a Q-learning technique that can be applied whenever the state-action space can not easily be represented by tuples of constants but has an inherently relational representation instead. In this case explicitly representing the mapping from state-action-pairs to Qvalues is – in general – not feasible. So far first-order distance-based algorithms as well as first-order regression trees have been used as learning algorithms to approximate the mapping between state-action pairs and their Q-value. Kernel Methods [24] are among the most successful recent developments within the machine learning community. The computational attractiveness of kernel methods is due to the fact that they can be applied in high dimensional feature spaces without suffering from the high cost of explicitly computing the feature map. This is possible by using a positive definite kernel k on any set X . For such k : X × X → R it is known that a map φ : X → H into a Hilbert space H exists, such that k(x, x ) = φ(x), φ(x ) for all x, x ∈ X . Kernel methods have so far successfully been applied to various tasks in attribute-value learning. Gaussian processes [21] are an incrementally learnable ‘Bayesian’ regression algorithm. Rather than parameterising some set of possible target functions and specifying a prior over these parameters, Gaussian processes directly put a (Gaussian) prior over the function space. A Gaussian process is defined by a mean function and a covariance function, implicitly specifying the prior. The choice of covariance functions is thereby only limited to positive definite kernels. Graph kernels [13] have recently been introduced as one way of extending the applicability of kernel methods beyond mere attribute-value representations. The idea behind graph kernels is to base the similarity of two graphs on the number of common label sequences. The computation of these — possibly infinitely long — label sequences is made possible by using product graphs and computing the limits of power series of their adjacency matrices. In this paper we use Gaussian processes to learn the mapping to Q-values. Related work on reinforcement learning with kernel methods is very limited so far1 . In order to employ Gaussian processes in a relational setting we use graph kernels as the covariance function between state-action pairs. One advantage of using Gaussian processes in RRL is that rather than predicting a single Qvalue, they actually return a probability distribution over Q-values. Experiments conducted in the blocks world show that Gaussian processes with graph kernels can compete with, and often improve on, regression trees and instance based regression as a generalisation algorithm for relational reinforcement learning. Section 2 briefly presents the relational reinforcement learning framework and discusses some previous implementations of the RRL-system. Section 3 describes kernel methods in general and Gaussian processes in particular. Section 4 proposes graph kernels as covariance functions that are able to deal with the 1
In [22] the term ‘kernel’ is not used to refer to a positive definite function but to a probability density function.
148
Thomas G¨ artner et al.
structural nature of state-action pairs in RRL. Section 5 shows how states and actions in the blocks world can be represented by graphs. Section 6 presents some experimental results that compare Gaussian processes with other regression algorithms in RRL. Section 7 concludes and discusses some directions for further work.
2
Relational Reinforcement Learning
Relational reinforcement learning (RRL) [9] is a Q-learning technique that allows structural representations for states and actions. The RRL-system learns through exploration of the state-space in a way that is very similar to normal Q-learning algorithms. It starts with running an episode2 just like table-based Q-learning, but uses the encountered states, chosen actions and the received awards to generate a set of examples that can then be used to build a Q-function generalisation. These examples use a structural representation of states and actions. To build this generalised Q-function, RRL applies an incremental relational regression engine that can exploit the structural representation of the constructed example set. The resulting Q-function is then used to decide which actions to take in the following episodes. Every new episode can be seen as new experience and is thus used to updated the Q-function generalisation. A more formal description of the RRL-algorithm is given in Table 1. Previous implementations of the RRL-system have used first order regression trees and relational instance based regression to build a generalised Q-function. In this work, we suggest using Gaussian processes as a generalisation algorithm for RRL. Gaussian processes not only provide a prediction for unseen examples but can also determine a probability distribution over Q-values. In reinforcement learning, this probability distribution can, for example, very easily be used to determine the exploration strategy. We will compare our new approach with both previous implementations of RRL. RRL-Tg [8] uses an incremental first order regression tree algorithm Tg to construct the Q-function. Although the Tg -algorithm is very fast compared to other approaches, the performance of this algorithm depends greatly on the language definition that is used by the Tg -algorithm to construct possible tree refinements. Also, Tg has shown itself to be sensitive with respect to the order in which the (state, action, qvalue)-examples are presented and often needs more training episodes to find a competitive policy. RRL-Rib [7] uses relational instance based regression for Q-function generalisation. The instance based regression offers a robustness to RRL not found in RRL-Tg but requires a first order distance to be defined between (state,action)pairs. The definition of a meaningful first order distance is seldom trivial. 2
An ‘episode’ is a sequence of states and actions from an initial state to a terminal state. In each state, the current Q-function is used to decide which action to take.
Graph Kernels and Gaussian Processes
149
Table 1. The RRL-algorithm. In the case of the algorithm proposed in this ˆ e means computing the inverse of the covariance matrix of the paper updating Q examples. This can be done incrementally using partitioned inverse equations. More details can be found in section 3.1 (Gaussian processes) ˆ0 Initialise the Q-function hypothesis Q e←0 repeat (for each episode) Examples ← φ Generate a starting state s0 i←0 repeat (for each step of episode) ˆe Choose ai for si using the policy derived from the current hypothesis Q Take action ai , observe ri and si+1 i←i+1 until si is terminal for j=i-1 to 0 do ˆ e (sj+1 , a) Generate example x = (sj , aj , qˆj ), where qˆj ← rj + γmaxa Q and add x to Examples ˆ e using Examples and an incremental relational regression Update Q ˆ e+1 . algorithm to produce Q e ← e+1 until no more episodes
3
Kernel Methods
Kernel Methods are among the most successful recent developments within the machine learning community. In this section we first introduce Gaussian processes as an incrementally learnable regression technique and then give a brief introduction into kernel functions. 3.1
Gaussian Processes
Parametric learning techniques build their hypotheses by searching a parameterised function class {fw (·)}w , where w is the parameter vector of the function class. In some cases a single function is chosen and used for future predictions, in other cases combinations of functions from this function class are used. Examples for such parametric learning algorithms are neural networks and radial basis function networks. Bayesian parametric learning techniques assume some prior distribution over the parameter vectors. Given some training data, they then compute the posterior distribution over parameter vectors by Bayes rule. Predictions for unseen test data can then be obtained, for example, by marginalising over the param-
150
Thomas G¨ artner et al.
eters. For that, Bayesian methods assume that the distribution modelling the noise between observed target values and true target values is known. Gaussian processes are a non-parametric Bayesian method. Instead of parameterising the function class and assuming a prior distribution over the parameter vectors, a prior distribution over the function space itself is assumed, i.e., the prior is P (f (·)) rather than P (w). This prior, however, can be defined only implicitly. For that, the distribution of target values and the noise distribution are assumed to be normal. To make learning possible, it has furthermore to be assumed that the target values are correlated and that their correlation depends only on the correlation of the corresponding data points. To specify a Gaussian process one has to define its mean function µ(x) = E[Y (x)] and its covariance function C(x, x ) = E[(Y (x) − µ(x))(Y (x ) − µ(x ))], where x, x are instances, Y (·) is a random function, and E[·] is the expectation over P (Y (·)). The choice of covariance functions is thereby restricted to positive definite kernel functions. Let {x1 , . . . , xn } be the training instances, let t be the vector (t1 , . . . , tn ) of corresponding target values, and and let xn+1 be an unseen test instance. Now, let C be the covariance matrix of the training instances (Cij = C(xi , xj ), 1 ≤ i, j ≤ n). Let k be the vector of covariances between the training instances and the test instance (ki = C(xi , xn+1 ), 1 ≤ i ≤ n)) and let κ be the variance of the test instance (κ = C(xn+1 , xn+1 )). For simplicity we will throughout the paper assume that the joint Gaussian distribution has a zero mean function. The posterior distribution P (tn+1 |t) of target values tn+1 of the instance xn+1 is then a Gaussian distribution with mean tˆn+1 = k C−1 t and variance σtˆ2 = n+1
κ − k C−1 k. Gaussian Processes are particularly well suited for reinforcement learning, as the inverse of the covariance matrix C can be computed incrementally, using the so called partitioned inverse equations [2]. While computing the inverse directly is of cubic time complexity, incrementing the inverse is only of quadratic time complexity. Also, the probability distribution over target values can be used to determine the exploration strategy in reinforcement learning. 3.2
Kernel Functions
Technically, a kernel k calculates an inner product in some feature space which is, in general, different from the representation space of the instances. The computational attractiveness of kernel methods comes from the fact that quite often a closed form of these ‘feature space inner products’ exists. Instead of performing the expensive transformation step φ explicitly, a kernel k(x, x ) = φ(x), φ(x ) calculates the inner product directly and performs the feature transformation only implicitly. Whether, for a given function k : X × X → R, a feature transformation φ : X → H into some Hilbert space H exists, such that k(x, x ) = φ(x), φ(x ) for all x, x ∈ X can be checked by verifying that the function is positive definite [1]. This means that any set, whether a linear space or not, that admits a positive definite kernel can be embedded into a linear space. Thus, throughout the paper,
Graph Kernels and Gaussian Processes
151
we take ‘valid’ to mean ‘positive definite’. Here then is the definition of a positive definite kernel. (Z+ is the set of positive integers.) Definition 1. Let X be a set. A symmetric function k : X ×X → R is a positive + definite kernel on X if, for all n ∈ Z , x1 , . . . , xn ∈ X , and c1 , . . . , cn ∈ R, it follows that i,j∈{1,...,n} ci cj k(xi , xj ) ≥ 0. While it is not always easy to prove positive definiteness for a given kernel, positive definite kernels do have some nice closure properties. In particular, they are closed under sum, direct sum, multiplication by a scalar, product, tensor product, zero extension, pointwise limits, and exponentiation [4,15]. 3.3
Kernels for Structured Data
The best known kernel for representation spaces that are not mere attributevalue tuples is the convolution kernel proposed by Haussler [15]. The basic idea of convolution kernels is that the semantics of composite objects can often be captured by a relation R between the object and its parts. The kernel on the object is then made up from kernels defined on different parts. Let x, x ∈ X be the objects and x, x ∈ X1 × · · · × XD be tuples of parts of these objects. Given the relation R : (X1 × · · · × XD ) × X we can define the decomposition R−1 as R−1 (x) = {x : R(x, x)}. With positive definite kernels kd : Xd × Xd → R the convolution kernel is defined as kconv (x, x ) =
D
x∈R−1 (x),x ∈R−1 (x )
d=1
kd (xd , xd )
The term ‘convolution kernel’ refers to a class of kernels that can be formulated in the above way. The advantage of convolution kernels is that they are very general and can be applied in many different problems. However, because of that generality, they require a significant amount of work to adapt them to a specific problem, which makes choosing R in ‘real-world’ applications a non-trivial task. There are other kernel definitions for structured data in literature, however, these usually focus on a very restricted syntax and are more or less domain specific. Examples are string and tree kernels. Traditionally, string kernels [20] have focused on applications in text mining and measure similarity of two strings by the number of common (not necessarily contiguous) substrings. These string kernels have not been applied in other domains. However, other string kernels have been defined for other domains, e.g., recognition of translation inition sites in DNA and mRNA sequences [28]. Again, these kernels have not been applied in other domains. Tree kernels [3] can be applied to ordered trees where the number of children of a node is determined by the label of the node. They compute the similarity of trees based on their common subtrees. Tree kernels have been applied in natural language processing tasks. A kernel for instances represented by terms in a higher-order logic can be found in [14].
152
Thomas G¨ artner et al.
For an extensive overview of these and other kernels on structured data, the reader is referred to [12]. None of these kernels, however, can be applied to the kind of graphs encountered in our representation of the blocks world (See Section 5). Kernels that can be applied there have independently been introduced in [11] and [18] and will be presented in the next section.
4
Graph Kernels
Graph kernels are an important means to extend the applicability of kernel methods to structured data. This section gives a brief overview of graphs and graph kernels. For a more in-depth discussion of graphs the reader is referred to [5,19]. For a discussion of different graph kernels see [13]. 4.1
Labelled Directed Graphs
Generally, a graph G is described by a finite set of vertices V, a finite set of edges E, and a function Ψ . For labelled graphs there is additionally a set of labels L along with a function label assigning a label to each edge and vertex. We will sometimes assume some enumeration of the vertices and labels in a graph, i.e., V = {νi }ni=1 where n = |V| and L = {!r }r∈N 3 . For directed graphs the function Ψ maps each edge to the tuple consisting of its initial and terminal node Ψ : E → {(u, v) ∈ V × V}. Edges e in a directed graph for which Ψ (e) = (v, v) are called loops. Two edges e, e are parallel if Ψ (e) = Ψ (e ). Frequently, only graphs without parallel edges are considered. In our application, however, it is important to also consider graphs with parallel edges. To refer to the vertex and edge set of a specific graph we will sometimes use the notation V(G), E(G). Wherever we distinguish two graphs by their subscript (Gi ) or some other symbol (G , G∗ ) the same notation will be used to distinguish their vertex and edge sets. Some special graphs, relevant for the description of graph kernels are walks, paths, and cycles. A walk 4 w is a sequence of vertices vi ∈ V and edges ei ∈ E with w = v1 , e1 , v2 , e2 , . . . en , vn+1 and Ψ (ei ) = (vi , vi+1 ). The length of the walk is equal to the number of edges in this sequence, i.e., n in the above case. A = vj ⇔ i = j and ei = ej ⇔ i = j. A cycle is a path path is a walk in which vi followed by an edge en+1 with Ψ (en+1 ) = (vn+1 , v1 ). 4.2
Graph Degree and Adjacency Matrix
We also need to define some functions describing the neighbourhood of a vertex v in a graph G: δ + (v) = {e ∈ E | Ψ (e) = (v, u)} and δ − (v) = {e ∈ E | Ψ (e) = (u, v)}. Here, |δ + (v)| is called the outdegree of a vertex and |δ − (v)| the indegree. 3 4
While 1 will be used to always denote the same label, l1 is a variable that can take different values, e.g., 1 , 2 , . . .. The same holds for vertex ν1 and variable v1 . What we call ‘walk’ is sometimes called an ‘edge progression’.
Graph Kernels and Gaussian Processes
153
Furthermore, the maximal indegree and outdegree are denoted by ∆− (G) = max{|δ − (v)|, v ∈ V} and ∆+ (G) = max{|δ + (v)|, v ∈ V}, respectively. For a compact representation of the graph kernel we will later use the adjacency matrix E of a graph. The component [E]ij of this matrix corresponds to the number of edges between vertex νi and νj . Replacing the adjacency matrix E by its n-th power (n ∈ N, n ≥ 0), the interpretation is quite similar. Each component [E n ]ij of this matrix gives the number of walks of length n from vertex νi to νj . It is clear that the maximal indegree equals the maximal column sum of the adjacency matrix and that the maximal outdegree equals the maximal row sum of the adjacency matrix. For a ≥ ∆+ (G)∆− (G), an is an upper bound on each component of the matrix E n . This is useful to determine the convergence properties of some graph kernels. 4.3
Product Graph Kernels
In this section we briefly review one of the graph kernels defined in [13]. Technically, this kernel is based on the idea of counting the number of walks in product graphs. Note that the definitions given here are more complicated than those given in [13] as parallel edges have to be considered here. Product graphs [16] are a very interesting tool in discrete mathematics. The four most important graph products are the Cartesian, the strong, the direct, and the lexicographic product. While the most fundamental one is the Cartesian product, in our context the direct graph product is the most important one. However, we need to extend its definition to labelled directed graphs. For that we need to define a function match(l1 , l2 ) that is ‘true’ if the labels l1 and l2 ‘match’. In the simplest case match(l1 , l2 ) ⇔ l1 = l2 . Now we can define the direct product of two graphs as follows. Definition 2. We denote the direct product of two graphs G1 = (V1 , E1 , Ψ1 ), G2 = (V2 , E2 , Ψ2 ) by G1 × G2 . The vertex set of the direct product is defined as: V(G1 × G2 ) = {(v1 , v2 ) ∈ V1 × V2 : match(label (v1 ), label (v2 ))} The edge set is then defined as: E(G1 × G2 ) = {(e1 , e2 ) ∈ E1 × E2 : ∃ (u1 , u2 ), (v1 , v2 ) ∈ V(G1 × G2 ) ∧ Ψ1 (e1 ) = (u1 , v1 ) ∧ Ψ2 (e2 ) = (u2 , v2 ) ∧ match(label (e1 ), label (e2 ))} Given an edge (e1 , e2 ) ∈ E(G1 × G2 ) with Ψ1 (e1 ) = (u1 , v1 ) and Ψ2 (e2 ) = (u2 , v2 ) the value of ΨG1 ×G2 is: ΨG1 ×G2 ((e1 , e2 )) = ((u1 , u2 ), (v1 , v2 )) The labels of the vertices and edges in graph G1 × G2 correspond to the labels in the factors. The graphs G1 , G2 are called the factors of graph G1 × G2 .
154
Thomas G¨ artner et al.
Having introduced product graphs, we are now able to define the product graph kernel. Definition 3. Let G1 , G2 be two graphs, let E× denote the adjacency matrix of their direct product E× = E(G1 × G2 ), and let V× denote the vertex set of the direct product V× = V(G1 × G2 ). With a sequence of weights λ = λ0 , λ1 , . . . (λi ∈ R; λi ≥ 0 for all i ∈ N) the product graph kernel is defined as |V× |
k× (G1 , G2 ) =
i,j=1
∞
n=0
n λn E× ij
if the limit exists. For the proof that this kernel is positive definite, see [13] 5 . There it is shown that this product graph kernel corresponds to the inner product in a feature space made up by all possible contiguous label sequences in the graph. Each feature value√corresponds to the number of walks with such a label sequence, weighted by λn where n is the length of the sequence. 4.4
Computing Graph Kernels
To compute this graph kernel, it is necessary to compute the limit of the above matrix power series. We now briefly discuss the exponential weight setting (λi = βi i! ) for which the limit of the above matrix power series always exists, and the geometric weight setting (λi = γ i ). In relatively sparse graphs, however, it is more practical to actually count the number of walks rather than using the closed forms provided below. Exponential Series Similar to the exponential of a scalar value (eb = 1 + b/1! + b2 /2! + b3 /3! + . . .) the exponential of the square matrix E is defined as eβE = lim
n (βE)i
n→∞
i=0
i!
0
where we use β0! = 1 and E 0 = I. Feasible exponentiation of matrices in general requires diagonalising the matrix. If the matrix E can be diagonalised such that E = T −1 DT we can easily calculate arbitrary powers of the matrix as E n = (T −1 DT )n = T −1 Dn T and for a diagonal matrix we can calculate the power component-wise [Dn ]ii = [Dii ]n . Thus eβE = T −1 eβD T where eβD is calculated component-wise. Once the matrix is diagonalised, computing the exponential matrix can be done in linear time. Matrix diagonalisation is a matrix eigenvalue problem and such methods have roughly cubic time complexity. 5
The extension to parallel edges is straight forward.
Graph Kernels and Gaussian Processes
155
Geometric Series The geometric series i γ i is known if and only nto converge 1 if |γ| < 1. In this case the limit is given by limn→∞ i=0 γ i = 1−γ . Similarly, we define the geometric series of a matrix as lim
n→∞
n
γ iE i
i=0
if γ < 1/a, where a = ∆+ (G)∆− (G) as above. Feasible computation of the limit of a geometric series is possible by inverting the matrix I − γE. To see this, let (I − γE)x = 0, thus γEx = x and (γE)i x = x. Now, note that (γE)i → 0 as i → ∞. Therefore x = 0 and I−γE is regular. Then (I−γE)(I+γE +γ 2 E 2 +· · · ) = I and (I − γE)−1 = (I + γE + γ 2 E 2 + · · · ) is obvious. Matrix inversion is roughly of cubic time complexity.
5
Kernel Based RRL in the Blocks World
In this section we first show how the states and actions in the blocks world can be represented as a graph. Then we discuss which kernel is used as the covariance function between blocks worlds. 5.1
State and Action Representation
A blocks world consists of a constant number of identical blocks. Each block is put either on the floor or on another block. On top of each block is either another block, or ‘no block’. Figure 1 illustrates a (state, action)-pair in a blocks world with four blocks in two stacks. The right side of figure 1 shows the graph representation of this blocksworld. The vertices of the graph correspond either to a block, the floor, or ‘clear’; where ‘clear’ basically denotes ‘no block’. This is reflected in the labels of the vertices. Each edge labelled ‘on’ (solid arrows) denotes that the block corresponding to its initial vertex is on top of the block corresponding to its terminal vertex. The edge labelled ‘action’ (dashed arrow) denotes the action of putting the block corresponding to its initial vertex on top of the block corresponding to its terminal vertex; in the example “put block 4 on block 3”. The labels ‘a/1’ and ‘a/2’ denote the initial and terminal vertex of the action, respectively. To represent an arbitrary blocks world as a labelled directed graph we proceed as follows. Given the set of blocks numbered 1, . . . , n and the set of stacks 1, . . . , m: 1. The vertex set V of the graph is {ν0 , . . . , νn+1 } 2. The edge set E of the graph is {e1 , . . . , en+m+1 }. The node ν0 will be used to represent the floor, νn+1 will indicate which blocks are clear. Since each block is on top of something and each stack has one clear block, we need n+m edges to represent the blocks world state. Finally, one extra edge is needed to represent the action. For the representation of a state it remains to define the function Ψ :
156
Thomas G¨ artner et al.
v5 {block, a/1}
v4 4
{block}
{on} {on}
{on} {action}
v2 3
{block}
{block, a/2}
v3
{on}
1 2
v1
{clear}
{on}
v0
{on} {floor}
Fig. 1. Simple example of a blocks world state and action (left) and its representation as a graph (right)
3. For 1 ≤ i ≤ n, we define Ψ (ei ) = (νi , ν0 ) if block i is on the floor, and Ψ (ei ) = (νi , νj ) if block i is on top of block j. 4. For n < i ≤ n + m, we define Ψ (ei ) = (νn+1 , νj ) if block j is the top block of stack i − n. and the function label : 5. We – – – –
define: L = 2{{‘floor’},{‘clear’},{‘block’},{‘on’},{‘a/1’},{‘a/2’}} , label (ν0 ) = {‘floor’}, label (νn+1 ) = {‘clear’}, label (νi ) = {‘block’} (1 ≤ i ≤ n) label (ei ) = {‘on’} (1 ≤ i ≤ n + m).
All that is left now is to represent the action in the graph 6. We – – – –
define: Ψ (en+m+1 ) = (νi , νj ) if block i is moved to block j. label (νi ) = label (νi ) ∪ {‘a/1’} label (νj ) = label (νj ) ∪ {‘a/2’} label (en+m+1 ) = {‘action’}
It is clear that this mapping from blocks worlds to graphs is injective. In some cases the ‘goal’ of a blocks world problem is to stack blocks in a given configuration (e.g. “put block 3 on top of block 4”). We then need to represent this in the graph. This is handled in the same way as the action representation, i.e. by an extra edge along with an extra ‘g/1’, ‘g/2’, and ‘goal’ labels for initial and terminal blocks, and the new edge, respectively. Note that by using more than one ‘goal’ edge, we can model arbitrary goal configurations, e.g., “put block 3 on top of block 4 and block 2 on top of block 1”.
Graph Kernels and Gaussian Processes
5.2
157
Blocks World Kernels
In finite state-action spaces Q-learning is guaranteed to converge if the mapping between state-action pairs and Q-values is represented explicitly. One advantage of Gaussian processes is that for particular choices of the covariance function, the representation is explicit. To see this we use the matching kernel kδ as the covariance function between examples (kδ : X × X → R is defined as kδ (x, x ) = 1 if x = x and kδ (x, x ) = 0 if x = x ). Let the predicted Q-value be the mean of the distribution over target values, i.e., tˆn+1 = k C−1 t where the variables are used as defined in section 3. Assume the training examples are distinct and the test example is equal to the j-th training example. It then turns out that C = I = C−1 where I denotes the identity matrix. As furthermore k is then the vector with all components equal to 0 except the j-th which is equal to 1, it is obvious that tˆn+1 = tj and the representation is thus explicit. A frequently used kernel function for instances that can be represented by vectors is the Gaussian radial basis function kernel (RBF). Given the bandwidth parameter γ the RBF kernel is defined as: krbf (x, x ) = exp(−γ||x − x ||2 ). For large enough γ the RBF kernel behaves like the matching kernel. In other words, the parameter γ can be used to regulate the amount of generalisation performed in the Gaussian process algorithm: For very large γ all instances are very different and the Q-function is represented explicitly; for small enough γ all examples are considered very similar and the resulting function is very smooth. In order to have a similar means to regulate the amount of generalisation in the blocks world setting, we do not use the graph kernel proposed in section 4 directly, but ‘wrap’ it in a Gaussian RBF function. Let k be the graph kernel with exponential weights, then the kernel used in the blocks world is given by k ∗ (x, x ) = exp[−γ(k(x.x) − 2k(x, x ) + k(x , x ))]
6
Experiments
In this section we describe the tests used to investigate the utility of Gaussian processes and graph kernels as a regression algorithm for RRL. We use the graph-representation of the encountered (state, action)-pairs and the blocks world kernel as described in the previous section, The RRL-system was trained in worlds where the number of blocks varied between 3 and 5, and given “guided” traces [6] in a world with 10 blocks. The Qfunction and the related policy were tested at regular intervals on 100 randomly generated starting states in worlds where the number of blocks varied from 3 to 10 blocks. We evaluated RRL with Gaussian processes on three different goals: stacking all blocks, unstacking all blocks and putting two specific blocks on each other. For each goal we ran five times 1000 episodes with different parameter settings to evaluate their influence on the performance of RRL. After that we chose the best parameter setting and ran another ten times 1000 episodes with different
158
Thomas G¨ artner et al.
random initialisations. For the “stack-goal” only 500 episodes are shown, as nothing interesting happens thereafter. The results obtained by this procedure are then used to compare the algorithm proposed in this paper with previous versions of RRL. 6.1
Parameter Influence
The used kernel has two parameters that need to be chosen: the exponential weight β (which we shall refer to as exp in the graphs) and the radial base function parameter γ (which we shall refer to as rbf). The exp-parameter gives an indication of the importance of long walks in the product graph. Higher exp-values place means a higher weight for long walks. The rbf-parameter gives an indication of the amount of generalisation that should be done. Higher rbf-values means lower σ-values for the radial base functions and thus less generalisation. We tested the behaviour of RRL with Gaussian processes on the “stack-goal” with a range of different values for the two parameters. The experiments were all repeated five times with different random seeds. The results are summarised in figure 2. The graph on the left shows that for a small exp-values RRL can not learn the task of stacking all blocks. This makes sense, since we are trying to create a blocks-world-graph which has the longest walk possible, given a certain amount of blocks. However, for very large values of exp we have to use equally small values of rbf to avoid numeric overflows in our calculations, which in turn results in non-optimal behaviour. The right side of figure 2 shows the influence of the rbf-parameter. As expected, smaller values result in faster learning, but when choosing too small rbf-values, RRL can not learn the correct Q-function and does not learn an optimal strategy.
Varying the RBF-parameter 1
0.9
0.9
0.8
0.8
0.7
0.7
Average Reward
Average Reward
Varying the Exponential Weight 1
0.6 0.5 0.4 0.3 ’exp=1 rbf=0.1’ ’exp=10 rbf=0.1’ ’exp=100 rbf=0.1’ ’exp=1000 rbf=0.00001’
0.2 0.1 0 0
50
100 150 200 250 300 350 400 450 500 Number of Episodes
0.6 0.5 0.4 0.3 ’exp=10 rbf=0.001’ ’exp=10 rbf=0.1’ ’exp=10 rbf=10’ ’exp=10 rbf=100’
0.2 0.1 0 0
50
100 150 200 250 300 350 400 450 500 Number of Episodes
Fig. 2. Comparing parameter influences for the stack goal For the “unstack-” and “on(A,B)-goal”, the influence of the exp-parameter is smaller as shown in the left sides of figure 3 and figure 4 respectively. For the
Graph Kernels and Gaussian Processes
159
“unstack-goal” there is even little influence from the rbf-parameter as shown in the right side of figure 3 although it seems that average values work best here as well.
Varying the RBF-parameter 1
0.9
0.9
0.8
0.8
0.7
0.7
Average Reward
Average Reward
Varying the Exponential Weight 1
0.6 0.5 0.4 0.3 0.2 0 0
200
400 600 Number of Episodes
800
0.5 0.4 0.3 0.2
’exp=1 rbf=0.1’ ’exp=10 rbf=0.1’ ’exp=100 rbf=0.1’
0.1
0.6
’exp=10 rbf=0.001’ ’exp=10 rbf=0.1’ ’exp=10 rbf=10’
0.1 0 1000
0
200
400 600 Number of Episodes
800
1000
Fig. 3. Comparing parameter influences for the unstack goal
The results for the “on(A,B)-goal” however, show a large influence of the rbf-parameter (right side of figure 4). In previous work we have always noticed that “on(A,B)” is a hard problem for RRL to solve [8,6]. The results we obtained with RRL-Kbr give an indication why. The learning-curves show that the performance of the resulting policy is very sensitive to the amount of generalisation that is used. The performance of RRL drops rapidly as a result of overor under-generalisation.
Varying the RBF-parameter 0.9
0.8
0.8
0.7
0.7 Average Reward
Average Reward
Varying the Exponential Weight 0.9
0.6 0.5 0.4 0.3 0.2
0.6 0.5 0.4 0.3 ’exp=10 rbf=0.00001’ ’exp=10 rbf=0.001’ ’exp=10 rbf=0.1’ ’exp=10 rbf=10’
0.2 ’exp=1 rbf=0.1’ ’exp=10 rbf=0.1’ ’exp=100 rbf=0.1’
0.1 0 0
200
400 600 Number of Episodes
800
0.1 0 1000
0
200
400 600 Number of Episodes
800
Fig. 4. Comparing parameter influences for the on(A,B) goal
1000
160
6.2
Thomas G¨ artner et al.
Comparison with Previous RRL-Implementations
Figure 5 shows the results of RRL-Kbr on the three blocks world problems in relation to the two previous implementations of RRL, i.e. RRL-Tg and RRLRib. For each goal we chose the best parameter settings from the experiments described above and ran another ten times 1000 episodes. These ten runs were initialised with different random seeds than the experiments used to choose the parameters.
Unstack 1
0.9
0.9
0.8
0.8
0.7
0.7
Average Reward
Average Reward
Stack 1
0.6 0.5 0.4 0.3 0.2 0 0
50
0.5 0.4 0.3 0.2
’RRL-TG’ ’RRL-RIB’ ’RRL-KBR’
0.1
0.6
’RRL-TG’ ’RRL-RIB’ ’RRL-KBR’
0.1 0
100 150 200 250 300 350 400 450 500
0
200
Number of Episodes
400
600
800
1000
Number of Episodes On(A,B)
0.9 0.8 Average Reward
0.7 0.6 0.5 0.4 0.3 0.2 ’RRL-TG’ ’RRL-RIB’ ’RRL-KBR’
0.1 0 0
200
400
600
800
1000
Number of Episodes
Fig. 5. Comparing Kernel Based RRL with previous versions RRL-Kbr clearly outperforms RRL-Tg with respect to the number of episodes needed to reach a certain level of performance. Note that the comparison as given in figure 5 is not entirely fair with RRL-Tg. Although RRL-Tg does need a lot more episodes to reach a given level of performance, it processes these episodes much faster. This advantage is, however, lost when acting in expensive or slow environments. RRL-Kbr performs better than RRL-Rib on the “stack-goal” and obtains comparable results on the “unstack-goal” and on the “on(A,B)-goal”. Our current implementation of RRL-Kbr is competitive with RRL-Rib in computation times and performance. However, a big advantage of RRL-Kbr is the possibility to achieve further improvements with fairly simple modifications, as we will outline in the next section .
Graph Kernels and Gaussian Processes
7
161
Conclusions and Future Work
In this paper we proposed Gaussian processes and graph kernels as a new regression algorithm in relational reinforcement learning. Gaussian processes have been chosen as they are able to make probabilistic predictions and can be learned incrementally. The use of graph kernels as the covariance functions allows for a structural representation of states and actions. Experiments in the blocks world show comparable and even better performance for RRL using Gaussian processes when compared to previous implementations: decision tree based RRL and instance based RRL. As shown in [13] the definition of useful kernel functions on graphs is hard, as most appealing kernel functions can not be computed in polynomial time. Apart from the graph kernel used in this paper [13] suggests some other kernels and discusses variants of them. Their applicability in RRL will be investigated in future work. With graph kernels it is not only possible to apply Gaussian processes in RRL but also other regression algorithms can be used. Future work will investigate how reinforcement techniques such as local linear models [23] and the use of convex hulls to make safe predictions [25] can be applied in RRL. A promising direction for future work is also to exploit the probabilistic predictions made available in RRL by the algorithm suggested in this paper. The obvious use of these probabilities is to exploit them during exploration. Actions or even entire state-space regions with low confidence on their Q-value predictions could be given a higher exploration priority. This approach is similar to interval based exploration techniques [17] where the upper bound of an estimation interval is used to guide the exploration into high promising regions of the state-action space. In the case of RRL-Kbr these upper bounds could be replaced with the upper bound of a 90% or 95% confidence interval. So far, we have not put any selection procedures on the (state, action, qvalue) examples that are passed to the Gaussian processes algorithm by RRL. Another use of the prediction probabilities would be to use them as a filter to limit the examples that need to be processed. This would cause a significant speedup of the regression engine. Other instance selection strategies that might be useful are suggested in [7] and have there successfully been applied in instance based RRL. Many interesting reinforcement learning problems apart from the blocks world also have an inherently structural nature. To apply Gaussian processes and graph kernels to these problems, the state-action pairs just need to be represented by graphs. Future work will explore such applications. The performance of the algorithm presented in this paper could be improved by using only an approximate inverse in the Gaussian process. The size of the kernel matrix could be reduced by so called instance averaging techniques [10]. While the explicit construction of average instances is far from being trivial, still the kernel between such average instances and test instances can be computed easily without ever constructing average instances. In our empirical evaluation, the algorithm presented in this paper proved competitive or better than the previous implementations of RRL. From our
162
Thomas G¨ artner et al.
point of view, however, this is not the biggest advantage of using graph kernels and Gaussian processes in RRL. The biggest advantages are the elegance and potential of our approach. Very good results could be achieved without sophisticated instance selection or averaging strategies. The generalisation ability can be tuned by a single parameter. Probabilistic predictions can be used to guide exploration of the state-action space. Acknowledgements Thomas G¨artner is supported in part by the DFG project (WR 40/2-1) Hybride Methoden und Systemarchitekturen f¨ ur heterogene Informationsr¨ aume. Jan Ramon is a post-doctoral fellow of the Katholieke Universiteit Leuven. The authors thank Peter Flach, Tam´ as Horv´ ath, Stefan Wrobel, and Saˇso Dˇzeroski for valuable discussions.
References 1. N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 1950. 2. S. Barnett. Matrix Methods for Engineers and Scientists. MacGraw-Hill, 1979. 3. M. Collins and N. Duffy. Convolution kernels for natural language. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14. MIT Press, 2002. 4. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines (and Other Kernel-Based Learning Methods). Cambridge University Press, 2000. 5. R. Diestel. Graph Theory. Springer-Verlag, 2000. 6. K. Driessens and S. Dˇzeroski. Integrating experimentation and guidance in relational reinforcement learning. In C. Sammut and A. Hoffmann, editors, Proceedings of the Nineteenth International Conference on Machine Learning, pages 115–122. Morgan Kaufmann Publishers, Inc, 2002. 7. K. Driessens and J. Ramon. Relational instance based regression for relational reinforcement learning. In Proceedings of the 20th International Conference on Machine Learning (to be published), 2003. 8. K. Driessens, J. Ramon, and H. Blockeel. Speeding up relational reinforcement learning through the use of an incremental first order decision tree learner. In L. De Raedt and P. Flach, editors, Proceedings of the 13th European Conference on Machine Learning, volume 2167 of Lecture Notes in Artificial Intelligence, pages 97–108. Springer-Verlag, 2001. 9. S. Dˇzeroski, L. De Raedt, and H. Blockeel. Relational reinforcement learning. In Proceedings of the 15th International Conference on Machine Learning, pages 136–143. Morgan Kaufmann, 1998. 10. J. Forbes and D. Andre. Representations for learning control policies. In E. de Jong and T. Oates, editors, Proceedings of the ICML-2002 Workshop on Development of Representations, pages 7–14. The University of New South Wales, Sydney, 2002. 11. T. G¨ artner. Exponential and geometric kernels for graphs. In NIPS Workshop on Unreal Data: Principles of Modeling Nonvectorial Data, 2002. 12. T. G¨ artner. Kernel-based multi-relational data mining. SIGKDD Explorations, 2003.
Graph Kernels and Gaussian Processes
163
13. T. G¨ artner, P. A. Flach, and S. Wrobel. On graph kernels: Hardness results and efficient alternatives. In Proceedings of the 16th Annual Conference on Computational Learning Theory and the 7th Kernel Workshop, 2003. 14. T. G¨ artner, J.W. Lloyd, and P.A. Flach. Kernels for structured data. In Proceedings of the 12th International Conference on Inductive Logic Programming. SpringerVerlag, 2002. 15. D. Haussler. Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz, 1999. 16. W. Imrich and S. Klavˇzar. Product Graphs: Structure and Recognition. John Wiley, 2000. 17. L. Kaelbling, M. Littman, and A. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996. 18. H. Kashima and A. Inokuchi. Kernels for graph classification. In ICDM Workshop on Active Mining, 2002. 19. B. Korte and J. Vygen. Combinatorial Optimization: Theory and Algorithms. Springer-Verlag, 2002. 20. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2, 2002. 21. D. J. C. MacKay. Introduction to Gaussian processes. available at http://wol.ra.phy.cam.ac.uk/mackay, 1997. 22. D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learning, 49:161–178, 2002. 23. S. Schaal, C. G. Atkeson, and S. Vijayakumar. Real-time robot learning with locally weighted statistical learning. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 288–293. IEEE Press, Piscataway, N.J., 2000. 24. B. Sch¨ olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002. 25. W.D. Smart and L.P. Kaelbling. Practical reinforcement learning in continuous spaces. In Proceedings of the 17th International Conference on Machine Learning, pages 903–910. Morgan Kaufmann, 2000. 26. R. Sutton and A. Barto. Reinforcement Learning: an introduction. The MIT Press, Cambridge, MA, 1998. 27. C. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge., 1989. 28. A. Zien, G. Ratsch, S. Mika, B. Sch¨ olkopf, T. Lengauer, and K.-R. Muller. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics, 16(9):799–807, 2000.
On Condensation of a Clause Kouichi Hirata Department of Artificial Intelligence Kyushu Institute of Technology Kawazu 680-4, Iizuka 820-8502, Japan Tel : +81-948-29-7884 Fax : +81-948-29-7601 [email protected]
Abstract. In this paper, we investigate condensation of a clause. First, we extend a substitution graph introduced by Scheffer et al. (1996) to a total matcher graph. Then, we give a correct proof of the relationship between subsumption and the existence of cliques in a total matcher graph. Next, we introduce the concept of width of a clique in a total matcher graph. As a corollary of the above relationship, we show that the minimum condensation of a clause is corresponding to the clique with the minimum width in a total matcher graph. Finally, we design a greedy algorithm of finding condensation of a clause, as the algorithm of finding cliques with as small width as possible from the total matcher graph of a clause.
1
Introduction
It is one of the most important researches to remove redundancy of clauses in clause-based systems. There exist mainly two types of redundancy, based on logical implication and subsumption [5,8,12]. The redundancy based on logical implication follows from the existence of self-resolvent up to logically equivalent, and the redundancy based on subsumption follows from the existence of literals up to subsume-equivalent. While removing the former redundancy is known to be undecidable, removing the latter redundancy is decidable. In this paper, we pay our attention to the latter redundancy. The redundancy based on subsumption is formulated as condensation [8,11], [15,16]. For clauses C and D as sets of literals, C is a condensation of D if C ⊆ D and D subsumes C, and C is the minimum condensation of D if C is a condensation of D with the minimum cardinality. Furthermore, C is condensed if there exists no proper subset D ⊂ C such that C and D are subsume-equivalent1 .
1
This work is partially supported by Japan Society for the Promotion of Science, Grants-in-Aid for Encouragement of Young Scientists (B) 15700137 and for Scientific Research (B) 13558036. There are many cases to use the words reduction and reduced [15,16], instead of the words condensation and condensed [8,11]. Furthermore, we use the words slightly different from [8].
T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 164–179, 2003. c Springer-Verlag Berlin Heidelberg 2003
On Condensation of a Clause
165
In Inductive Logic Programming (ILP, for short), it is necessary to deal with the condensation in order to update the hypotheses obtained from the least generalization [15,16], for example. In particular, the condensation is deeply related to PAC-learnability of function-free logic programs [9,10]. Plotkin [15,16] and Joyner [8,11] independently have investigated the condensation and given algorithms to compute the minimum condensation of a clause. Gottlob and Ferm¨ uller [8] have improved the algorithm. All of the condensation algorithms are based on subsumption check of determining whether or not a clause subsumes another clause. Intuitively, for a given clause C, the condensation algorithm first checks whether or not C subsumes C − {L} for each literal L ∈ C and then updates C to C − {L} if it holds. Since the subsumption check is NP-complete [3,7], the above condensation algorithm is intractable. From the viewpoint of computational complexity, Gottlob and Ferm¨ uller [8] have shown that the problem of determining whether or not a clause is condensed is coNP-complete. Recently, Chekuri and Rajaraman [6] have shown that the problem of minimizing the condensation of a clause is not polynomial-time approximable unless NP=ZPP. In order to avoid the above difficulty of condensation, we adopt membership queries (or subsumption check) as oracles in the framework of query learning for ILP [1,17,18]. On the other hand, Scheffer et al. [19] have introduced a substitution graph as the mapping from a subsumption problem to a certain problem of finding cliques of a fixed size in a graph. They have claimed that a clause subsumes another clause if and only if their substitution graph contains a clique of which size is the cardinality of the former clause. Furthermore, they have designed the exact algorithm of subsumption check based on the dynamic programming of finding cliques in a graph introduced by Carraghan and Pardalos [4]. In this paper, we study the condensation of a clause based on the substitution graph. Note first that we cannot apply the substitution graph to the condensation of a clause directly, because we construct the substitution graph [19] under the implicit assumption that two clauses share no common variables. Hence, in this paper, we introduce a total matcher that is a substitution allowing trivial bindings and extend the substitution graph to a total matcher graph. Then, we give a correct proof of the above relationship between subsumption and the existence of cliques given in [19], by replacing the substitution graph with the total matcher graph. Next, we apply the above relationship to condensation of a clause, by constructing a total matcher graph from a given single clause. We introduce the concept of the width of a clique in a total matcher graph. Then, we show that the minimum condensation of a clause is obtained by applying the substitution corresponding to the clique of which size is the cardinality of the clause and of which width is minimum in a total matcher graph. Finally, we design a greedy algorithm CondWidth of finding condensation of a clause. The algorithm CondWidth consists of finding cliques with as small width as possible from the total matcher graph of a given clause and does not adopt the subsumption check as [8,11,15,16]. Also the algorithm CondWidth is a variant for
166
Kouichi Hirata
the greedy algorithm Ramsey of finding cliques of a graph introduced by Boppana and Halld´ orsson [2]. While they have designed it as recursive, we design the algorithm CondWidth as iterative, because of the property of a total matcher graph. Then, we present the several properties of the algorithm CondWidth.
2
Preliminaries and Previous Works
In this section, we prepare some notions necessary to the later discussion. The detailed notions should be referred to [5,12,13,15]. A term, an atom and a literal are defined as usual. A clause is a finite set of literals. All of them are sometimes called expressions. For an expression E, var (E) and |E| denote the set of variables occurring in E and the size of E that is the number of symbols appearing to E, respectively. #S denotes the cardinality of a set S. A substitution is a partial mapping variables to constant symbols or variables. We will represent substitutions with the Greek letters θ and σ and (when necessary) write them as sets θ = {t1 /x1 , . . . , tn /xn } where xi is a variable and ti is a term (1 ≤ i ≤ n), where we assume that xi = ti . We denote {x1 , . . . , xn } by dom(θ), and call each ti /xi in θ a binding of θ. An empty set ∅ is called an empty substitution and denoted by ε. For an expression E, Eθ denotes the result of replacing each variable xi in E with ti simultaneously. For two substitution θ = {t1 /x1 , . . . , tn /xn } and σ = {s1 /y1 , . . . , sm /ym }, a composition θσ of θ and σ is defined as follows. θσ = {ti σ/xi | ti σ = xi , 1 ≤ i ≤ n} ∪ {sj /yj | yj ∈ dom(θ), 1 ≤ j ≤ m}. Since it holds that (Eθ)σ = E(θσ) and (θσ)τ = θ(στ ) for each expression E and each substitutions θ, σ and τ , we omit the parentheses for compositions of substitutions. Let C and D be clauses. We say that C subsumes D, denoted by C D, if there exists a substitution θ such that Cθ ⊆ D. If C D and D C, then we call C and D subsume-equivalent and denote by C ∼ D. Definition 1. Let C and D be clauses. 1. We say that C is a condensation of D if C ⊆ D and D C. 2. We say that C is the minimum condensation of D if C is a condensation of D satisfying that #C ≤ #E for each condensation E of D. 3. We say that C is condensed if just C itself is the condensation of C. In other words, C is condensed if there exists no proper subset D ⊂ C such that C D (or equivalently, C ∼ D). Proposition 1. Let C and D be clauses. If C is the minimum condensation of D, then C is condensed. However, the converse does not hold in general.
On Condensation of a Clause
167
Proof. Suppose that C is the minimum condensation of D but not condensed. Then, there exists a clause E ⊂ C such that C E. Since C is a condensation of D, it holds that C ⊆ D and D C. Hence, it holds that D C E and E ⊂ C ⊆ D, so E is a condensation of D. However, it is a contradiction that the cardinality of C is minimum. On the other hand, let C and D be the following clauses. C = {p(x, y), p(y, x)}, D = {p(x, y), p(y, x), p(u, v), p(v, w), p(w, u)}. Then, C ⊆ D and C is condensed, but C is not the minimum condensation of D; The minimum condensation of D is D itself. From the viewpoint of computational complexity, we introduce the following decision problem. Condensation [8] Instance: A clause C. Question: Is C condensed? Theorem 1 (Gottlob and Ferm¨ uller [8]). The problem Condensation is coNP-complete. Also we introduce the following optimization problem for Condensation. Minimum Condensation [6] Instance: A clause C. Solution: A condensation C of C. Measure: A cardinality of C , i.e., #C . Theorem 2 (Chekuri and Rajaraman [6]). Unless NP = ZPP, for every ε > 0, there exists no polynomial-time approximation algorithm for the problem Minimum Condensation with an approximation guarantee better than m1−ε , where m is the number of literals in the condensed clause. The above hardness results also hold even if a clause is function-free and contains just positive literals with a single binary predicate symbol . Hence, in the remainder of this paper, we assume that a term is either a constant symbol (denoted by a, b, c, . . . possibly with subscripts) or a variable (denoted by x, y, z, . . . possibly with subscripts). Finally, we prepare some graph-theoretic notions. A graph G is a pair (V, E), where V is a set of vertices and E ⊆ V × V is a set of edges. A clique of G is a set C ⊆ V such that, for each u, v ∈ C, it holds that (u, v) ∈ E. For a graph G = (V, E) and v ∈ V , N (v) refers to the subgraph induced by neighbors (the adjacent vertices) of v in G and N (v) to the subgraph induced by the non-neighbors of v in G, respectively.
168
3
Kouichi Hirata
Total Matcher Graph
Scheffer et al. [19] have introduced the mapping from a subsumption problem to a certain problem of finding a clique of a fixed size in a graph. Then, they have designed the algorithm to check whether or not a clause subsumes another clause, based on the dynamic algorithm by Carraghan and Pardalos [4] of finding a clique in a graph. We say that two substitutions θ and τ are strongly compatible if it holds that θτ = τ θ, that is, no variable maps different terms in θ and τ . For two literals L and M , a substitution θ is called a matcher of L with M if Lθ = M . Definition 2. Let C and D be clauses {L1 , . . . , Lm } and {M1 , . . . , Mn }, respectively. Then, a substitution graph SG D (C) = (VD (C), ED (C)) of C w.r.t. D is defined as follows. m n {θij | Li θij = Mj }. 1. VD (C) = i=1 j=1
2. (θij , θkl ) ∈ ED (C) if and only if θij and θkl satisfy that θij and θkl are strongly compatible and i = k. Example 1. Let C and D be the following clauses. C = {p(x, y), p(y, z), p(z, x)} = {L1 , L2 , L3 }, D = {p(a, b), p(b, c), p(c, a), p(d, a)} = {M1 , M2 , M3 , M4 }. Then, VD (C) is described as follows. M2 M3 M4 M1 {a/x, b/y} {b/x, c/y} {c/x, a/y} {d/x, a/y} (L1 θ1i = Mi ) θ11 θ12 θ13 θ14 {a/y, b/z} {b/y, c/z} {c/y, a/z} {d/y, a/z} (L2 θ2i = Mi ) θ21 θ22 θ23 θ24 {a/z, b/x} {b/z, c/x} {c/z, a/x} {d/z, a/x} (L3 θ3i = Mi ) θ31 θ32 θ33 θ34 Hence, we obtain the substitution graph described as Fig. 1, and we can find the cliques {θ11 , θ22 , θ33 }, {θ12 , θ23 , θ31 } and {θ13 , θ21 , θ32 } of size 3 corresponding to substitutions {a/x, b/y, c/z}, {b/x, c/y, a/z} and {c/x, a/y, b/z}, respectively. In order to solve the subsumption problem of clauses efficiently, Scheffer et al. have introduced the following theorem. Theorem 3 (Scheffer et al. [19]). Let C and D be clauses, where #C = m. Then, C D if and only if there exists a clique of size m in SGD (C). Furthermore, if such a clique is given as {θ1 , . . . , θm }, then it holds that Cθ1 · · · θm ⊆ D. In Theorem 3, however, we implicitly assume that C and D share no common variables. Hence, we do not focus on the empty substitution in constructing the substitution graph [19]. If this assumption does not hold, then neither does Theorem 3 as follows.
On Condensation of a Clause
θ11
θ12
θ13
θ14
L1
θ21
θ22
θ23
θ24
L2
θ31
θ32
θ33
θ34
L3
169
Fig. 1. The substitution graph SG D (C) in Example 1 Example 2. Let C and D be the following clauses. C = {p(x, y), p(y, z), p(x, w), p(w, z)} = {L1 , L2 , L3 , L4 }, D = {p(x, y), p(y, z), p(x, w)} = {L1 , L2 , L3 }. Then, VD (C) is described as follows. L1 L2 L3 ε {y/x, z/y} {w/y} (L1 θ1i θ11 θ12 θ13 {x/y, y/z} ε {x/y, w/z} (L2 θ2i θ21 θ22 θ23 {y/w} {y/x, z/w} ε (L3 θ3i θ32 θ33 θ31 {x/w, y/z} {y/w} {x/w, w/z} (L4 θ4i θ42 θ43 θ41
= Li ) = Li ) = Li ) = Li )
Since an empty substitution ε and each substitution are strongly compatible, V1 = {θ11 , θ22 , θ33 , θ43 } is a clique of size 4 in SG D (C). However, the following statement holds. Cθ11 θ22 θ33 θ43 = {p(x, y), p(y, w), p(x, x), p(x, w)} ⊆ D. The reason why the above counterexample to Theorem 3 arises is that we deal with substitutions as nodes in a substitution graph. Hence, in this paper, we extend a substitution graph to a total matcher graph. For literals L and M , a matcher θ of L with M is called total if dom(θ) = var (L) by allowing redundant bindings such as x/x. If var (L) = ∅ and L = M , then we denote a total matcher of L with M by an empty substitution ε. Definition 3. For clauses C and D, a total matcher graph MG D (C) of C w.r.t. D is a substitution graph SG D (C) all of which node consist of total matchers. Example 3. For the clauses C and D in Example 2, VD (C) is described as follows.
170
Kouichi Hirata
L1 L2 L3 {x/x, y/y} {y/x, z/y} {x/x, w/y} (L1 θ1i θ12 θ13 θ11 {x/y, y/z} {y/y, z/z} {x/y, w/z} (L2 θ2i θ21 θ22 θ23 {x/x, y/w} {y/x, z/w} {x/x, w/w} (L3 θ3i θ31 θ32 θ33 {x/w, y/z} {y/w, z/z} {x/w, w/z} (L4 θ4i θ41 θ42 θ43
= Li ) = Li ) = Li ) = Li )
Since θ43 and θ22 (or θ33 ) are not strongly compatible, V1 in Example 2 is no longer a clique of size 4 in MG D (C) described as Fig. 2. Hence, we can find the unique clique V2 = {θ11 , θ22 , θ31 , θ42 } of size 4. θ11
θ12
θ13
L1
θ21
θ22
θ23
L2
θ31
θ32
θ33
L3
θ41
θ42
θ43
L4
Fig. 2. The total matcher graph MG D (C) in Example 3
Theorem 4 (Improvement of Theorem 3). Let C and D be clauses, where #C = m. Then, C D if and only if there exists a clique of size m in MG D (C). Furthermore, if such a clique is given as {θ1 , . . . , θm }, then it holds that Cθ1 · · · θm ⊆ D. Proof. Suppose that D = {M1 , . . . , Mn }. We show the statement by induction on m. If m = 1, that is, if C = {L1 }, then MG D (C) is of the form ({θ11 , . . . , θ1n }, ∅). Note that θ1i is a matcher of L1 with Mi if there exists. Then, there exists a clique {θ1i } of size 1 in MG D (C) if and only if L1 θ1i = Mi , that is, C D.
On Condensation of a Clause
171
Next, we assume that the statement holds for m and consider the case m + 1. Let C and C be clauses {L1 , . . . , Lm } and C ∪ {Lm+1 }, respectively. Suppose that there exists a clique {θ1j1 , . . . , θmjm , θm+1jm+1 } of size m + 1 in MG D (C ). Then, it holds that {θ1j1 , . . . , θmjm } is a clique of size m in MG D (C) and θm+1jm+1 and each of θiji (1 ≤ i ≤ m) are strongly compatible. By induction hypothesis, it holds that Cθ1j1 · · · θmjm ⊆ D. By the existence of θm+1jm+1 , it holds that Lm+1 θm+1jm+1 = Mjm+1 . Note that, by the strongly compatibility, θm+1jm+1 and θ1j1 · · · θmjm maps each variable x to the same terms. Hence, it holds that (C ∪ {L})θ1j1 · · · θmjm θm+1jm+1 ⊆ D, that is, C D. Conversely, suppose that C D. Then, there exist total matchers θ1j1 , . . . , θmjm , θm+1jm+1 such that Li θiji = Mji (1 ≤ i ≤ m+1). By induction hypothesis, θiji and θkjk are strongly compatible for each 1 ≤ i = k ≤ m. Suppose that there exists an index l (1 ≤ l ≤ m) such that θm+1jm+1 and θljl are not strongly compatible. Then, there exists a variable x ∈ dom(θm+1jm+1 ) ∩ dom(θljl ) such that xθm+1jm+1 = xθljl . Since each θiji (1 ≤ i ≤ m+1) is total, such a variable x is in var (Lm+1 )∩var (Li ), so it holds that Ll θm+1jm+1 = Ll θljl and Lm+1 θm+1jm+1 = Lm+1 θljl . If θljl is applied to C before θm+1jm+1 , then Lm+1 is mapped to no longer Mjm+1 . Also if θm+1jm+1 is applied to C before θljl , then Ll is mapped to no longer Mjl . Both of them are contradiction. Hence, θm+1jm+1 and each θiji (1 ≤ i ≤ m) are strongly compatible, so {θ1j1 , . . . , θmjm , θm+1jm+1 } is a clique of size m + 1 in M CD (C ).
4
Width for Cliques
In order to find the minimum condensation of a clause C, in this paper, we pay our attention to the total matcher graph MG C (C) of C w.r.t. C. We call it the total matcher graph of C and denote it by MG(C). Also vertices and edges in MG(C) are denoted by V (C) and E(C), respectively. Definition 4. Let C be a clause such that #C = m, MG(C) a total matcher graph of C and cl a clique {θ1j1 , . . . , θmjm } of size m in MG(C). Then, the number of different ji (1 ≤ i ≤ m), i.e., #{j1 , . . . , jm } is called the width of cl and denoted by w(cl ). Also a composition θ1j1 · · · θmjm of substitutions in cl is called a clique substitution of cl and denoted by θcl . For a clause C = {L1 , . . . , Lm }, there always exists a clique cl 0 = {θ11 , . . . , θmm } of size m and width m. We call such a cl 0 a trivial clique in MG(C). The clique substitution θcl 0 of cl 0 is an empty substitution. Then, we can characterize the condensed clause and the minimum condensation as the cliques of a total matcher graph. First, we can obtain the following corollary. Corollary 1. Let C be a clause. Then, C is condensed if and only if the clique of size #C in MG(C) is just a trivial clique.
172
Kouichi Hirata
Proof. By Theorem 4, there exists a non-trivial clique cl of size #C in MG(C) if and only if Cθcl ⊂ C, that is, C is not condensed. Theorem 1 intuitively follows from Corollary 1. A clause C is not condensed if and only if there exists a non-trivial clique of size #C in MG(C), which is a famous NP-complete problem Clique [7]. Hence, Condensation is coNPcomplete. Furthermore, we also obtain the following corollary of Theorem 4. Corollary 2. Let C be a clause and suppose that D ⊆ C. Then, D is the minimum condensation of C if and only if there exists a clique cl of size #C with the minimum width in MG(C) such that D = Cθcl . Proof. Let C be of the form {L1 , . . . , Lm }. Suppose that there exists such a clique cl and let cl be {θ1j1 , . . . , θmjm }. Since Li θiji = Lji for each i (1 ≤ i ≤ m), it holds that Cθcl = {Lj1 , . . . , Ljm }. Then, the width of cl is equal to #Cθcl . Since cl has the minimum width in MG(C) and by Theorem 4, Cθcl is a clause with the minimum cardinality in a set {Cτ | ∃τ s.t. Cτ ⊆ C} of clauses. Conversely, suppose that D is the minimum condensation of C, that is, there exists a substitution θ such that Cθ = D and Cθ has the minimum cardinality in a set {Cτ | ∃τ s.t. Cτ ⊆ C} of clauses. Also suppose that #D = k ≤ m. Since D ⊆ C, for each i (1 ≤ i ≤ m), there exists a ji such that Li θiji = Lji and θiji is a total matcher of Li with Lji . Since #D = k, it holds that #{j1 , . . . , jm } = k. Since Cθ = D = {Lj1 , . . . , Ljm } (which is the representation containing duplicated literals in D), it holds that θ = θ1j1 · · · θmjm . Furthermore, by Theorem 4, {θ1j1 , . . . , θmjm } is a clique of size m in MG(C). Also the width of {θ1j1 , . . . , θmjm } is k, which is the minimum width, because #D = k and D is the minimum condensation of C. Example 4. Consider the following clause C. C = {p(x, x), p(x, y), p(y, x)} = {L1 , L2 , L3 }. Then, V (C) is described as follows and MG(C) is also described as Fig. 3. Here, ‘−’ denotes the nonexistence of the corresponding total matcher. L1 L2 L3 {x/x} − − (L1 θ1i = Li ) V1 θ11 {x/x, x/y} {x/x, y/y} {y/x, x/y} (L2 θ2i = Li ) θ21 θ22 θ23 V2 {x/x, x/y} {y/x, x/y} {x/x, y/y} (L3 θ3i = Li ) θ31 θ32 θ33 V3 There exist the following two cliques cl 1 and cl 2 of size 3 in MG(C). cl 1 = {θ11 , θ22 , θ33 }, cl 2 = {θ11 , θ21 , θ31 }.
On Condensation of a Clause
θ11
173
V1
θ21
θ22
θ23
V2
θ31
θ32
θ33
V3
Fig. 3. The total matcher graph MG(C) in Example 4
Since w(cl 1 ) = 3 and w(cl 2 ) = 1, Cθcl 2 = {p(x, x)} is the minimum condensation of C, where θcl 2 = {x/y}.
Example 5. Consider the following clause C. C = {p(x, x), p(x, y), p(x, z)} = {L1 , L2 , L3 }. Then, V (C) is described as follows and MG(C) is also described as Fig. 4. L1 L2 L3 {x/x} − − (L1 θ1i = Li ) θ11 V1 {x/x, x/y} {x/x, y/y} {y/x, z/y} (L2 θ2i = Li ) θ21 θ22 θ23 V2 {x/x, x/z} {x/x, x/z} {x/x, z/z} (L3 θ3i = Li ) θ31 θ32 θ33 V3 There exist 9 cliques of size 3 in MG(C). Since the clique cl = {θ11 , θ21 , θ31 } has the minimum width 1, Cθcl = {p(x, x)} is the minimum condensation of C, where θcl = θ11 θ21 θ31 = {x/y, x/z}. θ11
V1
θ21
θ22
θ23
V2
θ31
θ32
θ33
V3
Fig. 4. The total matcher graph MG(C) in Example 5
174
5
Kouichi Hirata
A Greedy Algorithm for Condensation
In this section, we design a greedy algorithm CondWidth of finding the condensation of a clause as Fig. 5. While the previous algorithms of finding the condensation of a clause [8,11,15,16] are based on subsumption check , CondWidth does not adopt the subsumption check. The algorithm CondWidth is motivated by the algorithm CliqueRemoval (and Ramsey) of finding clique covers introduced by Boppana and Halld´ orsson [2] (see Appendix). While Boppana and Halld´ orsson [2] have designed the algorithm Ramsey and CliqueRemoval as recursive, we design CondWidth as iterative, because it holds that θi1 ∈ N (θ1j ) for each i and j (1 ≤ i = j ≤ m). Furthermore, the condition in CondWidth that NB ∩ N (θik ) ∩ Vl = ∅ for each l (i + 1 ≤ l ≤ m) in S(i, NB) is necessary to remove θik from the candidate of cliques. If there exists an index l (i + 1 ≤ l ≤ m) such that NB ∩ N (θik ) ∩ Vl = ∅, then there exists no clique of size #C in MG(C) containing θik . CondWidth(C) /* C: a clause {L1 , . . . , Lm }, θij : a total matcher such that Li θij = Lj */ construct MG(C) = (V1 ∪ · · · ∪ Vm , E(C)): a total matcher graph of C /* Vi = {θi1 , . . . , θim }: a partite set of MG(C) */ CC ← cl 0 ; /* cl 0 = {θ11 , . . . , θmm }: a trivial clique, where w(cl 0 ) = m */ for j = 1 to m do begin if θ1j ∈ V1 or ∃l(2 ≤ l ≤ m)[N (θ1j ) ∩ Vl = ∅] then CC j ← cl 0 ; else /* θ1j ∈ V1 and ∀l(2 ≤ l ≤ m)[N (θ1j ) ∩ Vl = ∅] */ CC j ← {θ1j }; NB ← N (θ1j ); W ← {j}; for i = 2 to m do begin (1 ≤ k ≤ m) ∧ (θik ∈ NB ∩ Vi )∧ ; S(i, NB) ← k ∀l(i + 1 ≤ l ≤ m)[N B ∩ N (θik ) ∩ Vl = ∅] if S(i, NB ) = ∅ then select k ∈ S(i, NB ) such that #(W ∪ {k}) is minimum; CC j ← CC j ∪ {θik }; NB ← NB ∩ N (θik ); W ← W ∪ {k}; else CC j ← cl 0 ; break; end /* for */ end /* if */ if w(CC ) > w(CC j ) then CC ← CC j ; end /* for */ return CθCC ;
Fig. 5. Algorithm CondWidth
Theorem 5. If CondWidth(C) returns D, then it holds that D C. Proof. If D = Cθcl o , then it holds that D = C, so D C.
On Condensation of a Clause
175
Suppose that CondWidth(C) returns D = CθCC and w(CC ) < m, where CC is of the form {θ1j1 , . . . , θmjm }. Consider the case θiji . By the definition of S(i, NB) and since ji is selected from S(i, NB ), it holds that θiji ∈ NB ∩ Vi . Since the neighbors NB of θ1j1 , . . . , θi−1ji−1 are updated by the second for-loop in CondWidth, it holds that NB = N (θ1j1 ) ∩ · · · ∩ N (θi−1ji−1 ) for i, so it holds that {θ1j1 , . . . , θiji } is a clique. Since the same discussion holds for 2 ≤ i ≤ m and by Theorem 4, it holds that CθCC ⊂ C. Theorem 6. Let C be a clause {L1 , . . . , Lm } and suppose that l = max{|Li | | 1 ≤ i ≤ m}. Then, the total running time of CondWidth(C) is in O(lm4 ). Proof. Since we can construct each total matcher of Li with Lj in O(l) time, we can construct V (C) in O(lm2 ) time. Since #V (C) ≤ m2 , we can also construct E(C) in O(lm4 ) time. Furthermore, the running time of CondWidth is in O(lm3 ), because we can find the clique containing θ1j in searching for O(lm2 ) space for each j (1 ≤ j ≤ m). Hence, the statement holds. Unfortunately, we give the following tight example of the algorithm CondWidth. Example 6. Consider the following clause C. C = {p(x, y), p(y, z), p(u, v), p(v, w)} = {L1 , L2 , L3 , L4 }. Then, V (C) is described as follows. L1 L2 L3 L4 {x/x, y/y} {y/x, z/y} {u/x, v/y} {v/x, w/y} V1 θ11 θ12 θ13 θ14 {x/y, y/z} {y/y, z/z} {u/y, v/z} {v/y, w/z} V2 θ21 θ22 θ23 θ24 {x/u, y/v} {y/u, z/v} {u/u, v/v} {v/u, w/v} V3 θ32 θ33 θ34 θ31 {x/v, y/w} {y/v, z/w} {u/v, v/w} {v/v, w/w} V4 θ41 θ42 θ43 θ44
(L1 θ1i = Li ) (L2 θ2i = Li ) (L3 θ3i = Li ) (L4 θ4i = Li )
Note that E(C) contains edges of all nodes between V1 and V3 , V1 and V4 , V2 and V3 , and V2 and V4 . Additionally, E(C) also contains edges (θ11 , θ22 ), (θ13 , θ24 ), (θ31 , θ42 ), and (θ33 , θ44 ). Then, there exist the following two cliques cl 1 and cl 2 with the minimum width 2 in M (G). cl 1 = {θ11 , θ22 , θ31 , θ42 }, cl 2 = {θ13 , θ24 , θ33 , θ44 }. Hence, the clique substitutions θcl 1 and θcl 2 of cl 1 and cl 2 are {x/u, y/v, z/w} and {u/x, v/y, w/z}, respectively, so it holds that Cθcl 1 = {p(x, y), p(y, z)} and Cθcl 2 = {p(u, v), p(v, w)}. On the other hand, we apply CondWidth to the above C.
176
Kouichi Hirata
1. After selecting θ11 and θ22 , CondWidth may select θ31 or θ32 , because N (θ11 )∩ N (θ22 ) ∩ V3 = V3 . If θ31 is selected, then CondWidth returns Cθcl 1 ; If θ32 is selected, then CondWidth returns C itself. 2. After selecting θ13 and θ24 , CondWidth may select θ33 and θ34 , because N (θ13 ) ∩ N (θ24 ) ∩ V3 = V3 . If θ33 is selected, then CondWidth returns Cθcl 2 ; If θ34 is selected, then CondWidth returns C itself. As the generalization of Example 6, we obtain the following proposition. Proposition 2. There exists a clause C such that the cardinality of the minimum condensation of C is 2 but CondWidth(C) returns C. Proof. Let C be the following clause. 1 ≤ i ≤ m, xi = yi , yi = zi , zi = xi , . C = p(xi , yi ), p(yi , zi ) xk = xl , yk = yl , zk = zl (1 ≤ k, l ≤ m, k = l) It is obvious that the minimum condensation of C is {p(xi , yi ), p(yi , zi )} for some i (1 ≤ i ≤ m). On the other hand, for 1 ≤ i ≤ m, let p(xi , yi ) and p(yi , zi ) be L2i−1 and L2i , respectively. As similar as Example 6, we deal with C under the order L1 , L2 , . . . , L2m−1 , L2m . Let θij be a total matcher such that Li θij = Lj and Vi a set {θi1 , . . . , θi(2m) } (1 ≤ i ≤ 2m). Note that all nodes in V2i−1 and V2i are connected to all nodes in V1 , . . . , V2i−2 , V2i+1 , . . . , V2m . Also θ(2i−1)(2j−1) is connected to θ(2i)(2j) . In CondWidth(C), the second for-loop is executed from θ11 , θ13 , . . . , θ1(2m−1) but not from θ12 , θ14 , . . . , θ1(2m) . Furthermore, after selecting θ1(2j−1) and θ2(2j) , CondWidth(C) may select θ3(2j−1) or θ3(2j) . If θ3(2j) is selected for each j (1 ≤ j ≤ m), then CC 2j−1 is set to cl 0 , so CondWidth(C) returns C itself. By changing the order of literals in C, we avoid the situation as Proposition 2 partially as follows. Example 7. Consider the following clause C, which is subsume-equivalent to C in Example 6. C = {p(x, y), p(u, v), p(y, z), p(v, w)} = {L1 , L2 , L3 , L4 }. Then, V (C) is described as follows. L2 L3 L4 L1 {x/x, y/y} {u/x, v/y} {y/x, z/y} {v/x, w/y} V1 θ12 θ13 θ14 θ11 {x/u, y/v} {u/u, v/v} {y/u, z/v} {v/u, w/v} V2 θ22 θ23 θ24 θ21 {x/y, y/z} {u/y, v/z} {y/y, z/z} {v/y, w/z} V3 θ31 θ32 θ33 θ34 {x/v, y/w} {u/v, v/w} {y/v, z/w} {v/v, w/w} V4 θ41 θ42 θ43 θ44
(L1 θ1i = Li ) (L2 θ2i = Li ) (L3 θ3i = Li ) (L4 θ4i = Li )
On Condensation of a Clause
177
Note that E(C) contains edges of all nodes between V1 and V2 , V1 and V4 , V2 and V3 , and V3 and V4 . Additionally, E(C) also contains edges (θ11 , θ33 ), (θ12 , θ34 ), (θ21 , θ43 ), and (θ22 , θ44 ). Then, there exist the following two cliques cl 1 and cl 2 with the minimum width 2 in MG(C). cl 1 = {θ11 , θ21 , θ33 , θ43 }, cl 2 = {θ12 , θ22 , θ34 , θ44 }. Hence, the clique substitutions θcl 1 and θcl 2 of cl 1 and cl 2 are {x/u, y/v, z/w} and {u/x, v/y, w/z}, respectively, so it holds that Cθcl 1 = {p(x, y), p(y, z)} and Cθcl 2 = {p(u, v), p(v, w)}. On the other hand, we apply CondWidth to the above C. 1. After selecting θ11 and θ21 , CondWidth must select θ33 and θ43 , because N (θ11 ) ∩ N (θ21 ) ∩ V3 = {θ33 } and N (θ11 ) ∩ N (θ21 ) ∩ N (θ33 ) ∩ V4 = {θ43 }. Then, CondWidth outputs Cθcl 1 . 2. After selecting θ12 and θ22 , CondWidth must select θ34 and θ44 , because N (θ12 ) ∩ N (θ22 ) ∩ V3 = {θ34 } and N (θ12 ) ∩ N (θ22 ) ∩ N (θ34 ) ∩ V4 = {θ44 }. Then, CondWidth outputs Cθcl 2 . We will report elsewhere how to change the order of literals to obtain the condensation near optimal.
6
Conclusion
In this paper, we have first improved a substitution graph introduced by Scheffer et al. [19] as a total matcher graph, and shown the property of the relationship between a total matcher graph and subsumption, as correcting the original theorem in [19]. Next, we have introduced the concept of the width of a clique in a total matcher graph, and paid our attention that the minimum condensation of a clause is obtained by applying the substitution corresponding to the clique of which size is the cardinality of the clause and of which width is minimum in a total matcher graph. Under the property, we have designed the greedy algorithm CondWidth of finding condensation from a clause, and presented the several properties of CondWidth. As a future work, note first that the results of this paper are insufficient to characterize the algorithm CondWidth. Also, as Proposition 2, we cannot guarantee the performance ratio of CondWidth. Then, it is a future work to characterize CondWidth sufficiently and to improve CondWidth in order to guarantee the performance ratio. In particular, it is necessary for the near optimal condensation how to change the order of literals. Furthermore, it is also a future work to investigate the relationship between the result of this paper and nonexistence of ideal refinement operators [15,20]. As stated in Section 1, the condensation is deeply related to ILP as PAClearning [9,10] or query learning [1,17,18]. Also the condensation is useful for updating the hypotheses in ILP. Then, it is also a future work to implement
178
Kouichi Hirata
the algorithm CondWidth together with the least generalization [15,16], give the empirical results given by CondWidth, and compare them with the empirical results based on [14,19]. Acknowledgment The author would thank to the anonymous referees of ILP2003 for valuable comments.
References 1. H. Arimura, Learning acyclic first-order Horn sentences from entailment, in: Proc. 8th International Workshop on Algorithmic Learning Theory, LNAI 1316, 432–445, 1997. 2. R. Boppana, M.M. Halld´ orsson, Approximating maximum independent sets by excluding subgraphs, BIT 32, 180–196, 1992. 3. L.D. Baxter, The complexity of unification, Doctoral Thesis, Department of Computer Science, University of Waterloo, 1977. 4. R. Carraghan, P. Pardalos, An exact algorithm for the maximum clique problem, Operations Research Letters 9, 375–382, 1990. 5. C.-L. Chang, R.C.-T. Lee, Symbolic logic and mechanical theorem proving, Academic Press, 1973. 6. C. Chekuri, A. Rajaraman, Conjunctive query containment revised , Theoretical Computer Science 239, 211-229, 2000. 7. M.R. Garey, D.S. Johnson, Computers and intractability: A guide to the theory of NP-completeness, W.H. Freeman and Company, 1979. 8. G. Gottlob, C.G. Ferm¨ uller, Removing redundancy from a clause, Artificial Intelligence 61, 263–289, 1993. 9. T. Horv´ ath, R.H. Sloan, G. Tur´ an, Learning logic programs by using the product homomorphism method, in: Proc. 10th Annual Workshop on Computational Learning Theory, 10–20, 1997. 10. T. Horv´ ath, G. Tur´ an, Learning logic programs with structured background knowledge, in L. de Raedt (ed.): Advances in inductive logic programming, IOS Press, 172–191, 1996. 11. W.H. Joyner, Resolution strategies as decision procedures, Journal of the ACM 23, 398–417, 1976. 12. A. Leitsch, The resolution calculus, Springer-Verlag, 1997. 13. J.W. Lloyd, Foundations of logic programming (2nd extended edition), SpringerVerlag, 1987. 14. J. Maloberti, M. Sebag, θ-subsumption in a constraint satisfaction perspective, in: Proc. 11th International Workshop on Inductive Logic Programming, LNAI 2157, 164–178, 2001. 15. S.-H. Nienhuys-Cheng, R. de Wolf, Foundations of inductive logic programming, LNAI 1228, 1997. 16. G.D. Plotkin, A note on inductive generalization, Machine Intelligence 5, 153–163, 1970. 17. C. Reddy, P. Tadepalli, Learning first-order acyclic Horn programs from entailment, in: Proc. 15th International Conference on Machine Learning, 472–480, 1998.
On Condensation of a Clause
179
18. C. Reddy, P. Tadepalli, Learning Horn definitions: Theory and application to planning, New Generation Computing 17, 77–98, 1999. 19. T. Scheffer, R. Herbrich, F. Wysotzki, Efficient θ-subsumption based on graph algorithms, in: Proc. 6th International Workshop on Inductive Logic Programming, LNAI 1314, 212–228, 1996. 20. P.R.J. van der Laag, An analysis of refinement operators in inductive logic programming, Ph.D. Thesis, Tinbergen Institute, 1995.
Appendix: Greedy Algorithms Ramsey and CliqueRemoval Boppana and Halld´ orsson [2] have designed the greedy algorithms Ramsey and CliqueRemoval described as Fig. 6 of finding the independent set and the clique covers from a graph. Note that Ramsey finds the clique and the independent set in G and CliqueRemoval finds the independent set and the clique covers in G. Then, the output of Ramsey and CliqueRemoval are pairs of them, respectively. Boppana and Halld´ orsson [2] have also shown that the performance ratio of the algorithm CliqueRemoval is O(n/ log2 n), where n is the number of nodes in G. It is known the best possible. CliqueRemoval(G) /* G: a graph*/ i ← 1; (Ci , Ii ) ← Ramsey(G); while G = ∅ do begin G ← G − Ci ; i ← i + 1; (Ci , Ii ) ← Ramsey(G); end /* while */ return (max{Ij | 1 ≤ j ≤ i}, {C1 , . . . , Ci }); Ramsey(G) /* G: a graph */ if G = ∅ then return (∅, ∅); select v ∈ G; (C1 , I1 ) ← Ramsey(N (v)); (C2 , I2 ) ← Ramsey(N (v)); return (max{C1 ∪ {v}, C2 }, max{I1 , I2 ∪ {v}});
Fig. 6. Algorithm Ramsey and CliqueRemoval [2]
A Comparative Evaluation of Feature Set Evolution Strategies for Multirelational Boosting Susanne Hoche1 and Stefan Wrobel2 1
University of Bonn, Informatik III R¨ omerstr. 164, 53117 Bonn, Germany [email protected] 2 Fraunhofer AiS, Schloß Birlinghoven 53754 Sankt Augustin, Germany [email protected]
Abstract. Boosting has established itself as a successful technique for decreasing the generalization error of classification learners by basing predictions on ensembles of hypotheses. While previous research has shown that this technique can be made to work efficiently even in the context of multirelational learning by using simple learners and active feature selection, such approaches have relied on simple and static methods of determining feature selection ordering a priori and adding features only in a forward manner. In this paper, we investigate whether the distributional information present in boosting can usefully be exploited in the course of learning to reweight features and in fact even to dynamically adapt the feature set by adding the currently most relevant features and removing those that are no longer needed. Preliminary results show that these more informed feature set evolution strategies surprisingly have mixed effects on the number of features ultimately used in the ensemble, and on the resulting classification accuracy.
1
Introduction
Boosting is a well established method for decreasing the generalization error of classification learners and has been developed into practical algorithms that have demonstrated superior performance on a broad range of application problems in both propositional and multi-relational domains ([5,15,3,14,7]). Instead of searching for one highly accurate prediction rule entirely covering a given set of training examples, boosting algorithms construct ensembles of specialized rules by repeatedly calling a base learner on reweighted versions of the training data. Predictions are based on a combination of all members of the learned ensemble. Previous work showed that this technique can be efficient even in the context of multirelational learning using simple learners and active feature selection [7,8]. Active feature selection can be embedded into a boosting framework at virtually no extra cost by exploiting the characteristics of boosting itself to actively determine the set of features that is being used in the various iterations of the boosted learner [8]. By monitoring the progress of learning, and incrementally T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 180–196, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Comparative Evaluation of Feature Set Evolution Strategies
181
presenting features to the learner only if this appears to be necessary for further learning, we arrive at smaller feature sets and significantly reduced induction times without a deterioration of predictive accuracy. The abovementioned positive effects on feature set size and induction time were achieved with extremely simple and uninformed selection strategies. In the approach introduced in [8] , feature weights were determined once at the beginning on the initially uniform example distribution, and no attempt was made to remove features that might have become irrelevant during the course of boosting due to the changes in the underlying example distribution. In this paper, we therefore investigate whether learning results can be further improved by employing the distributional information present in boosting to reweight features and to dynamically adapt the feature set. In addition, we explore the effect of establishing new feature orders based on considering different sorting criteria as well as different subsets of features and examples. A number of different strategies to feature subset evolution are looked at and evaluated on several multirelational domains. Interestingly, the empirical evaluation shows that more informed feature selection strategies have mixed effects on the size of feature sets and classification accuracy, indicating that the increase in the power of the weak learner achieved by making better feature sets available might be offset by the induction of mutually contradictory base hypotheses produced by features that are very specific to extremal distributions towards the end of the boosting process. This paper is organized as follows. In Section 2, we review constrained confidence-rated boosting. Section 3 provides an overview of the simple uninformed approach to active feature selection in the framework of constrained confidencerated boosting. In section 4, we present our approach to feature set evolution strategies for multirelational boosting. Our experimental evaluation of the approach is described and discussed in Section 5. In Section 6, we discuss related work and conclude in Section 7 with some pointers to future work.
2
Constrained Confidence-Rated ILP-Boosting
Boosting has emerged as a successful method for improving the predictive accuracy of a learning system by combining a set of base classifiers constructed by iterative calls to a base learner into one single hypothesis [5,15,3,14,7]. The idea is to “boost” a weak learner performing only slightly better than random guessing into an arbitrarily accurate learner by constructing an ensemble of base hypotheses and combining them into one final hypothesis. To this end, a base learner is repeatedly called on reweighted versions of a set E of training instances. In each round t of boosting, a probability distribution Dt over E is maintained which models the weight Dit associated with each training example ei in the t-th iteration. Dit indicates the influence of an instance when constructing a base classifier ht . Initially, all instances have equal influence on the construction of a base hypothesis, i.e. the probability distribution D1 is uniform. In each iterative call t to the base learner, a base hypothesis ht is learned based on
182
Susanne Hoche and Stefan Wrobel
E weighted according to the current distribution Dt over E, and used to update the distribution for the next iteration. The weights of misclassified instances are increased while the weights of correctly classified instances are decreased, in order to focus on the examples which have not yet been correctly classified. Finally, all base hypotheses learned are combined into one final hypothesis H by a weighted majority vote of the base hypotheses. In [7], we extended a specific approach to boosting known as constrained confidence-rated boosting, first introduced in [3], to multirelational problems. Combined with an appropriate refinement operator and search heuristics, constrained confidence-rated boosting is an effective approach to producing highly accurate multi-relational models while at the same time ensuring limited complexity of the final ensemble and acceptable induction times, as shown in [7] with the system C 2 RIB (Constrained Confidence-Rated ILP Boosting). Since for the active feature selection strategies developed and evaluated in this paper, we have chosen C 2 RIB as the basic algorithm and point of reference, we provide a summary of the algorithm below; for more detail, the reader is referred to [7]. C 2 RIB accepts as input the total number of iterations of the base learner, and a set E = {(x1 , y1 ), · · · , (xN , yN )} of positive training examples (xi , 1) and negative training examples (xi , −1), where each xi belongs to an instance space X. Additionally, background knowledge may be provided. In each iterative call t of the base learner, a base hypothesis ht is learned on E, based on the current distribution Dt . In the framework of confidence-rated boosting, the prediction of a base hypothesis ht is confidence-rated. A prediction confidence c(ht , E) is assigned to each base hypothesis ht . The sign of c(ht , E) indicates the label predicted by ht to be assigned to an instance, whereas the absolute value of c(ht , E) is interpreted as the confidence in ht ’s prediction. This prediction confidence is used to update Dt for the next iteration, and as ht ’s vote in the final hypothesis H. The constrained form of confidence-rated boosting which we apply here is such that the base learner is restricted to induce only hypotheses predicting the positive class with a positive prediction confidence for all examples covered by the hypothesis, and to abstain on all examples not covered by it. Additionally, the so called default hypothesis is admissible, just comprising the target predicate to be learned and satisfying all examples. The confidence assigned to the default hypothesis conforms to the sign of the class of examples with the largest sum of probabilities according to the current distribution Dt . This weighted majority class may change over the course of iterations, depending on the learned base hypotheses and thus the examples on which the learner is currently focusing. The prediction confidence of the current base hypothesis is used to update the current probability distribution Dt such that misclassified instances will have higher weights in the next iteration of the learner. After the last iteration of the base learner, the final, strong, hypothesis is derived from all base hypotheses induced from the training instances. To classify an instance x the prediction confidences of all hypotheses ht covering x are summed up. If this sum is posi-
A Comparative Evaluation of Feature Set Evolution Strategies
183
tive, the strong hypothesis H classifies x as positive, otherwise x is classified as negative.
3
Simple Baseline Strategy of Active Feature Selection
Efficiency and effectiveness of learning crucially depend on the representation of the objects that are used for learning. The inclusion of unnecessary features or, in a multirelational setting, of unnecessary relations, makes learning less efficient and often less effective. Feature selection therefore is a central topic in machine learning research. Using a very simple strategy, it is possible to exploit the characteristics of boosting in order to perform active feature selection, resulting in lower induction times and reduced complexity of the ensemble [8]. Since boosting tries to increase the certainty with which examples are classified by the ensemble of hypotheses — this is expressed in terms of the so-called margin —, as shown in [8], one can monitor the development of the margin in order to determine when new features or relations might be needed. The approach presented in [8] has used quite a simple strategy which we will use as a baseline for comparison of the more advanced and informed strategies we develop and evaluate in this paper. The algorithm C 2 RIB D described in [8] simply orders the available features in the different relations based on a heuristic relevance measure (mutual information). Boosting is then started with a minimal set of features, and proceeds until the development of the margin indicates that progress is slowing down at which moment the next feature on the list is added to the representation. In more detail, when integrated into the algorithm of C 2 RIB described in the preceding section, the simple baseline active feature selection strategy works as described in the following (for a complete description see [8]). A concise description of the algorithm is given in Table 1. References to steps in Table 1 will be indicated by “T1: ”. C 2 RIB D accepts as input, in addition to the input of C 2 RIB, the set F of features present in the training examples, sorted in descending order according to some criterion, and a subset F of the top most features of F . In order to actively select features depending on the requirements of the problem, and thus accelerate the learning process of the boosted learner C 2 RIB without a deterioration of its prediction accuracy, we start the learner with the features in F and the relations in which these features occur, monitor the learning progress and include additional features and relations into the learning process only by demand (T1:1.2d, T1:2. ). The learning progress is monitored in terms of the development of the training examples’ mean margins. The margin of an example ei = (xi , yi ) under an ensemble Ht of base classifiers h1 , h2 , · · · , ht is a real-valued number margin(Ht , ei ) ∈ [−1, 1] indicating the amount of disagreement of the classifiers in Ht with respect to ei ’s class. For the binary case we deal with here, we can define the margin of ei under Ht as the difference between the sum of the absolute weights of those base classifiers in Ht predicting for ei its correct class yi , and the sum of the
184
Susanne Hoche and Stefan Wrobel
Table 1. C 2 RIB D Algorithm Input: – The number T of iterations of the base learner – A set of positive and negative training instances ei = (xi , yi ) ∈ E = E + ∪ E − , yi = 1 for ei ∈ E + and yi = −1 for ei ∈ E − , with |E| = N – The set F of features present in the examples in E sorted in descending order according to some criterion – A set F ⊆ F of the top features in F Let D denote a probability distribution over E with Dit the probability of ei in the t-th iteration. 1 1.1 Set Di1 := N for 1 ≤ i ≤ N 1.2 For t = 1 . . . T (a) Ct =LearnBaseHypothesis(E, Dt ) (b) ht : X → is the function
ht (x) =
c(Ct , E) 0
if e = (x, y) is covered by Ct otherwise,
where c(C, S) is defined as c(C, S) = w+ (C, S) =
(xi ,1)∈S covered by C
1 2
ln
w+ (C,S)+ 1 2N w− (C,S)+ 1 2N
Dit , w− (C, S) =
, with
(xi ,−1)∈S covered by C
Dit
(c) Update the probability distribution:
Dit =
Dt i e(yi ·ht (xi ))
, and Dit+1 =
t
Di i
Dt i
, 1≤i≤N
=F (d) If t > 2 and F i. Let Ht := {h1 , · · · , ht }, with base classifier hk of iteration 1 ≤ k ≤ t ii. F = CheckLearningP rogress(Ht , t, E, N, F , F , ) as detailed below Output: The final hypothesis H(x) := sign
ht
ht (x))
= sign
Ct :(x,y) covered by Ct
CheckLearningP rogress(Ht , t, E, N, F , F ) returns F
c(Ct , E)
N
1 2.1 Compute for E the examples’ average margin AMt = N margin(Ht , ei ) i=1 2.2 Let gradient(t) be the slope of the line determined by the least square fit to the AMk in k, 1 ≤ k ≤ t 1 T l gradient(t − j) if t > Tl Tl j=1 2.3 Compute trend(t) := t−1 1 gradient(j) if t ≤ Tl , t−2 j=2
where Tl denotes the number of iterations over which the gradients are averaged trend(t) 2.4 Compute ratio(t) := gradient(t) 2.5 If t > 3: (a) If ratio(t − 1) exhibits a local maximum, estimate the slowdown in the margins’ improvement in the form of predict(x) := a ln(1x ) , where a, b are chosen such that predict(2) = b
ratio(t − 1) and predict(3) = ratio(t); of f set := t − 3 (b) If a, b have already been determined, compute predict(t) := a
ln(
1 t−of f set ) b
(c) Else predict(t) := ratio(t) (d) If predict(t) > α, select the first element F of F , i.e. the feature with the next greatest ratio(t) mutual information with the training examples’ class; F := F ∪ {F } (e) Else F := F
A Comparative Evaluation of Feature Set Evolution Strategies
185
absolute weights of those base classifiers in Ht predicting for ei the incorrect class y = yi [8]. Large positive margins indicate a “confident” correct classification. The more negative a margin is, the more confident an incorrect classification is indicated. Boosting is known to be especially effective at increasing the margins of the training examples [18,6]. By increasing their probabilities, boosting forces the focus on misclassified instances which show small or even negative margins. The learner is forced to search for base hypotheses which correctly classify these hard examples and thus increase their margins. Since the margins are increasing in the course of iterated calls to the base learner, the gradient of the mean margins can be assumed to be positive and be employed to monitor the quality of the learning process. For monitoring the learning success, we define in each iteration t of boosting the gradient gradient(t) of t as the slope of the line determined by the least square fit to the average margins in each single iteration 1 to t (T1:2.1, T1:2.2). We then average the gradients over the last Tl iterations as to smooth temporary fluctuations in the margins’ development (T1:2.3), and compute the ratio of the averaged previous gradients and the current gradient (T1:2.4). The margins’ improvement is measured by this ratio which increases from one iteration to the next as long as the margins increase significantly. As soon as the ratio starts to decrease, an estimate for the slowdown in the margins’ improvements is determined (T1:2.5a). This estimate predicts the expected decrease of the ratio (T1:2.5b, T1:2.5c) and is used to determine when a new feature has to be presented to the learner. Whenever the actual decrease of the ratio exceeds the predicted decrease by a certain threshold, a new feature is included into the learning process (T1:2.5d).
4
Feature Set Evolution Strategies for Multirelational Boosting
We investigate whether more complex strategies of feature ordering and selection can further improve the learning results. Instead of relying on the sequence that has been initially determined on the entire training set based on the features’ mutual information with the class, a new feature order is established, by reweighting features based on the current distribution over the training examples, every time a feature is requested by the base learner. Such a new feature order can be arrived at by considering: – the entire training set for feature reweighting, or only the fraction of examples which are misclassified by the current ensemble of base hypotheses; – the entire original feature set or just the features which have not been presented to the learner yet; – not only the mutual information of a feature with the class but also the conditional mutual information of a feature with the class, given the values of other features.
186
Susanne Hoche and Stefan Wrobel
Moreover, features can be simply presented incrementally to the learner or, alternatively, features that are no longer needed can be substituted by the currently most relevant features. Based on these considerations, we investigate several approaches to feature ordering and selection. In the following, we discuss the properties with respect to which we categorize the different strategies summarized in Table 2. Considered Features: (column 3 of Table 2): Boosting is based on a combination of the members of the learned ensemble each of which is specialized on a certain region of the instance space. As the distribution over the examples changes, the required special knowledge might shift. Thus, instead of adhering to features that have been once determined as useful, one can allow for the removal of features that are no longer needed. We consider the question whether prediction accuracy can be improved by presenting to the learner feature subsets which meet the very current requirements of the learning task.
Considered Examples: (column 4 of Table 2): Boosting maintains a probability distribution over the training examples in order to focus the base learner on examples which are misclassified by the current ensemble. This way, the search in the hypothesis space is forced towards hypotheses which correctly classify these hard examples. We consider the question whether predictive accuracy can be improved by presenting to the base learner features which are especially helpful to correctly classify the examples which have been misclassified so far.
Sorting Criterion (column 5 of Table 2): In addition to computing the mutual information of each single feature with the class, we determine, based on their conditional mutual information (CMI), groups of features which optimally separate the given examples. The conditional mutual information between feature Fk and class C, given feature Fj , reflects the amount of information about the class that is obtained when the values of features Fj and Fk are known, and can be defined as CM I(C, Fk |Fj ) = E(C) − E(C|Fk , Fj ) = mj mk |Y | i=1 l=1 c=1
p(C = c, Fk = fl , Fj = fi ) ln
p(C = c, Fk = fl , Fj = fi ) (1) p(C = c)p(Fk = fl , Fj = fi )
with |Y | possible classes, mj possible values of feature Fj , and mk possible values of feature Fk . We investigate whether learning results can be improved by taking into account the CMI of the given features. Due to the complexity of computing the CMI of large feature sets, we restrict our investigation to pairs of features.
A Comparative Evaluation of Feature Set Evolution Strategies
187
Set Evolution Strategy (column 6 of Table 2): Instead of augmenting the set of features to be considered for refinement, features that are no longer needed can be substituted by currently most relevant features. We consider the question whether we can improve prediction accuracy by substituting features instead of adding them.
Partitioning Continuous Features (last column of Table 2): Before determining a feature order by computing the features’ (conditional) mutual information with the class, we can discretize the ranges of continuous features in a way that reflects the current distribution over the training examples. To this end, we apply an objective function of the constrained confidence-rated boosting framework (cf. [7]) to partition the continuous range of a feature F such that the training examples are optimally separated by F according to the current distribution. Table 2. Strategies to feature ordering and selection based on the distributional information present in boosting. The strategies differ with respect to the features and examples, respectively, considered for establishing a new feature order, the applied sorting criterion, the set evolution strategy, and the discretization strategy for continuous features. The n in Set Evolution Strategy “Substitute n” denotes the number of features already presented to the learner. Group ( Version ( Considered Features
I
II
Considered
Sorting
Set
Partitioning
Examples
Criterion
Evolution
Continuous
Strategy
Features
V1
Remaining
MI
Add 1
N
V2
Remaining Misclassified
All
MI
Add 1
N
V3
Remaining
MI
Add 1
Y
V4
Remaining Misclassified
MI
Add 1
Y
All
V5
All
All
MI
Substitute n,
V6
All
Misclassified
MI
Substitute n, Add 1
Y
V7
All
All
MI
Substitute 1
Y
V8
All
Misclassified
MI
Substitute 1
Y
V9
Remaining
All
CMI
Add 1
Y
V10
Remaining Misclassified
CMI
Add 1
Y
Add 1
III
IV
V
Y
V11
All
All
CMI
Add 1
Y
V12
All
Misclassified
CMI
Add 1
Y
V13
Remaining
All
CMI
Substitute 2
Y
V14
Remaining Misclassified
CMI
Substitute 2
Y
V15
All
All
CMI
Substitute 2
Y
V16
All
Misclassified
CMI
Substitute 2
Y
188
Susanne Hoche and Stefan Wrobel
It is common to all approaches V1 to V16 that, exactly as in the baseline strategy C 2 RIB D , the available features in the different relations are initially ordered based on the heuristic relevance measure of mutual information, and that the learner starts with the top two features according to this order. In contrast to the baseline method, every time a feature is requested by the learner, a new feature order is determined based on the current distribution. In all versions, except group I (V1 and V2), the value ranges of continuous features are partitioned based on the current distributional information and the boosting objective function (cf. [7]). The versions in group I to III compute the features’ mutual information with the class. V1 to V4 consider only those features which have not been presented to the learner yet, and augment the set of currently active features with the top feature of the new sequence. One version considers all examples for reweighting, the other one considers only those which are misclassified by the current ensemble. V5 and V6 consider the entire feature set for reweighting, substitute the features already presented to the learner with the same number of top features from the new sequence, and activate one additional top feature of the new ranking. Again, one version considers all examples for reweighting, the other one considers the misclassified ones only. V7 and V8 (group III) determine, based on all examples and the misclassified ones only, respectively, a new sequence of all features and substitute the worst active feature with the top feature of the new sequence. Thus, only two features are active at a time. In V9 to V12 (group IV), the feature with the highest conditional mutual information with the class, given the feature which was last activated, is presented to the learner in an incremental manner. Versions V9 to V12 account for all possible combinations of features and examples to be considered for reweighting. In V13 to V16 (group V), the set of currently active features is substituted by the feature F with the highest mutual information with the class, and the feature with the highest conditional mutual information with the class, given F , again accounting for all possible combinations of features and examples to be considered for reweighting. Only two features are active at a time.
5
Empirical Evaluation
5.1
Experimental Design
We evaluated the different feature set evolution strategies on a total of six learning problems: two classical ILP domains, Mutagenicity [20] (prediction of mutagenic activity of 188 molecules (description B4 )) and QSARs, Quantitative Structure Activity Relationships, [9,10] (prediction of a greater-activity relationship between pairs of compounds based on their structure), one artificial problem, the Eastbound Trains1 proposed by Ryszard Michalski (prediction of trains’ directions based on their properties), and three general knowledge and 1
The examples were generated with the Random Train Generator available at http://www-users-cs-york.ac.uk/~stephen/progol.html.
A Comparative Evaluation of Feature Set Evolution Strategies
189
Table 3. Accuracy ± standard deviation, and number of requested features ± standard deviation after 50 iterations for C 2 RIB D and the feature set evolution strategies V1 to V16 on several multirelational domains V( C R IB D
I
V1
V2
II V3
V4
V5
V6
III V7
V8
IV V9
V10
V11
V12
V V13
V14
V15
V16
KDD01 Mutagenicity PKDD-A PKDD-C QSARs 90.15 ±7.92 5.5±3.4 91.58 ±6.32 6.6±2.2 90.33 ±7.73 5.4±3.9 90.29 ±7.44 2.5±1.1 90.57 ±7.32 3.5±2.3 90.12 ±7.99 3.7±4.0 89.87 ±8.53 2.4±0.9 90.73 ±7.13 2.9±1.9 89.95 ±8.34 2.2±0.6 90.61 ±6.40 2.4±1.1 90.65 ±6.26 2.5±1.1 90.75 ±6.24 2.2±0.4 90.31 ±6.61 2.2±0.4 91.12 ±6.28 3.85±2.9 90.53 ±6.38 2.8±2.2 89.93 ±6.71 2.6±1.3 90.62 ±6.18 2.9±1.8
83.50 ±5.80 4.4±2.0 80.87 ±9.94 4.2±1.6 85.60 ±4.45 7.4±1.2 85.67 ±6.03 5.4±1.4 86.19 ±5.01 4.9±1.7 86.72 ±6.64 5.6±1.8 87.77 ±6.08 5.2±1.8 85.14 ±8.20 5.8±2.2 85.60 ±9.99 7.4±2.2 82.51 ±7.35 4.2±1.7 88.55 ±7.07 5.7±2.2 84.55 ±6.35 3.5±0.7 82.38 ±8.03 4.5±1.1 86.19 ±6.12 9.6± 4.5 86.29 ±5.04 10.0±3.7 82.38 ±8.03 6.4±3.4 82.83 ±6.31 8.2±3.7
86.70 ±6.64 5.5±1.8 86.70 ±6.64 6.9±2.6 86.70 ±6.64 6.1±2.5 86.70 ±6.64 6.6±1.8 86.70 ±6.64 5.9±2.6 86.70 ±6.64 7.1±2.6 86.70 ±6.64 7.3±1.9 86.70 ±6.64 6.9±3.0 86.70 ±6.64 8.5±3.2 86.23 ±6.76 2.9±1.7 82.88 ±7.25 2.0±0.0 86.19 ±7.5 2.9±1.2 85.72 ±7.67 2.0±0.0 82.46 ±6.74 10.8± 7.2 87.02 ±7.64 2± 0.0 85.83 ±7.16 6.6± 5.3 84.18 ±9.2 2.7±1.2
88.57 ±2.95 6.3±1.8 88.87 ±3.35 5.7±2.6 88.87 ±3.35 8.6±3.0 88.72 ±3.12 7.0±2.3 88.87 ±3.35 5.7±4.1 88.87 ±3.35 8.4±2.0 88.87 ±3.35 8.9±1.0 88.87 ±3.35 7.9±3.3 88.87 ±3.35 8.0±3.5 87.46 ±2.73 8.7±1.3 88.65 ±2.78 5.8±2.5 88.72 ±3.2 4.6±2.0 88.97 ±3.19 3.3±2.2 88.43 ± 2.99 11.0±7.7 86.14 ±2.48 5.5±7.0 88.13 ±2.66 13.0±6.9 86.85 ±4.29 2.0±0.0
78.76 ±1.76 3.4±1.5 81.37 ±2.03 3.3±1.5 80.77 ±2.93 2.3±0.5 81.51 ±2.39 4.0±0.0 80.96 ±3.45 6.0±2.8 77.45 ±3.09 5.0±1.4 80.13 ±3.19 3.5±0.6 78.14 ±5.63 6.2±4.7 76.73 ±4.21 4.4±1.3 78.93 ±4.77 2.8±1.0 80.43 ±2.61 5.5±0.7 79.03 ±2.56 2.8±0.8 78.24 ±2.02 3.0±0.7 72.77 ±2.47 4.4±2.2 72.75 ±2.38 4.8±3.4 72.64 ±2.33 2.4±0.9 72.81 ±2.35 6.8±5.6
Trains 80.00 ±15.32 5.9±1.9 78.33 ±8.05 5.8±1.4 81.67 ±16.57 5.7±1.2 86.67 ±10.54 5.8±1.1 83.33 ±11.11 6.6±1.0 73.33 ±11.65 6.1±1.2 88.33 ±13.72 6.0±0.7 81.67 ±9.46 7.5±3.4 85.0 ±14.59 6.3±2.0 81.67 ±9.46 5.8±1.5 81.67 ±9.46 6.2±0.8 78.33 ±11.25 4.0±0.0 78.33 ±8.05 3.4±0.5 80.0 ±10.54 12.4±4.2 76.67 ±2.61 9.2±3.9 73.33 ±11.65 6.8±2.4 75.0 ±11.79 10.2±5.1
190
Susanne Hoche and Stefan Wrobel
data mining tasks, Task A and AC of the PKDD Discovery Challenge 2000 [1] (classification of loans, where Task AC is based on all loans, and Task A only on the closed loans from Task AC), and Task 2 of the KDD Cup 2001 [2] (prediction of gene functions). The standard version C 2 RIB D and each of the versions described in Section 4 is run with T = 50 iterations of the base learner. In all experiments, the threshold α is set to 1.01 (cf. 2.5d in Table 1). The value 1.01 has been empirically determined on the domain of Mutagenicity [20], and has not been modified for subsequent experiments on the other domains in order to ensure proper cross validation results. Since we expected more informed feature set evolution strategies to result in more extreme learning curves, we decided to average the gradients of the examples’ mean margin over a smaller number Tl of iterations (cf. 2.3 in Table 1) than in earlier experiments, where we used Tl = 10. To ensure a fair comparison of the uninformed approach and the new strategies, we compared the results of the base case C 2 RIB D with Tl = 10 against a new value Tl = 3. In two thirds of our domains, Tl = 3 resulted in a, to some extent rather large, deterioration of classification accuracy, in one third only to very slight improvements. Similarly, we compared for one of the informed feature selection strategies on one of the domains the learning curves with Tl = 3 and Tl = 10. As expected, the learning curve initially increased significantly stronger and later dropped significantly slower when using a more complex feature ordering and selection strategy. This had the effect that, with Tl = 10, the learning curve’s estimate was lower than the learning curve itself, and consequently no additional features were requested by the learner. Reducing the number of iterations over which the gradients of the examples’ mean margins are averaged to Tl = 3 has the effect that the different development of the learning curve can be better estimated and thus features are introduced into the learning process over the course of iterations. Thus, the gradients of the examples’ mean margins are averaged for the standard version C 2 RIB D over the last Tl = 10, and for the more complex feature set evolution strategies over the last Tl = 3 iterations (cf. 2.3 in Table 1). 5.2
Detailed Results and Discussion
The resulting predictive accuracies and the average number of features required by the learner are depicted in Table 3 together with the standard deviations. The predictive accuracy is estimated by 10-fold-cross validation with the exception of the QSARs domain, where 5-fold-cross validation is used 2 , and the Eastbound Trains, where the data is split into one training and test set partition, and the results are averaged over 10 iterations of the experiment. The results indicate on the one hand that further improvements can indeed be achieved by using more complex approaches to feature ordering and selection. On the other hand, they clearly show that it has to be considered very carefully which strategy to apply. Combining, for example, substitution of features with 2
The folds correspond exactly to the data described in[9].
A Comparative Evaluation of Feature Set Evolution Strategies
191
reordering all features and applying the CMI criterion (as in V15 and V16), seems to lead to inferior results. The dynamics inherent to boosting already cause the underlying learner to direct its attention on the difficult, or extreme, instances. Strategies overly intensifying the focus in this direction most probably tend to misleadingly lay emphasis on a few extreme examples which leads to inferior results. We will detail on this issue in more depth after a thorough discussion of the single strategies’ results summarized in Table 3. We base our discussion of the strategies’ performance on their win-loss-tie record in comparison to the uninformed baseline strategy C 2 RIB D . Afterwards, possible explanations will be discussed in the context of all results. The entries of groups I to III in the upper half of Table 2 which all apply the MI criterion show an increased prediction confidence over the base case, C 2 RIB D , in all but one cases. This is especially surprising for the versions V1 and V2 (group I), since they apply a simple reordering of the features not yet presented to the learner based on the examples’ current weights, without prior partitioning the continuous features’ value ranges. Both strategies add the best feature with respect to the new ranking in a forward manner, and yield a better accuracy than and feature subsets of about the same size as C 2 RIB D . V2 – using only the weights of misclassified examples to establish a new ranking – clearly outperforms V1 – considering all examples for reweighting – with respect to classification accuracy. Versions V3 and V4 of group II differ from V1 and V2 in terms of continuous features which are discretized based on the current distribution over the training set. Both strategies are superior to C 2 RIB D with respect to classification accuracy, and yield – with one exception – feature subsets of about the same size as or slightly smaller than C 2 RIB D . Again, the version considering the misclassified examples only (V4) is superior to the one considering all examples (V3). For versions V5 and V6 of group II – considering the entire feature set to establish a new feature ranking, substituting all active features with the same number of currently most relevant features and adding one additional feature – the number of features required by the learner is larger than for the base case in one third, and smaller in one fifth of the cases. Again, considering misclassified examples only for reweighting yields the better results. V5 considers the entire training set, and its prediction confidence is inferior to that of the base case. In contrast, V6 – using only the weights of misclassified examples to establish a new ranking – is clearly superior to C 2 RIB D . An opposite effect can be observed in group III (versions V7 and V8), where all features are considered for a new ranking, and only two features are active at a time. Every time a new order has been determined, the worst active feature is substituted with the top feature of the new sequence. Here, the version considering the entire training set for reweighting (V7) is superior to the case where only the weights of misclassified examples are used to establish a new feature order (V8). In both cases, the classification accuracy is better, but the number of requested features is predominantly larger than for the base case.
192
Susanne Hoche and Stefan Wrobel
All versions of groups IV and V in Table 2 are based on the CMI criterion. The results show that some of the strategies (V10 and V11) clearly outperform the baseline strategy C 2 RIB D both in terms of predictive accuracy and reduction of the number of features required by the learner. The remaining versions, however, not only do not yield any improvements but mostly deteriorate the learning results in all respects. In all versions of group IV (V9 to V12), the features are presented to the learner in a forward manner, and they all arrive at a smaller number of features requested for learning than the base case. V9 and V10 consider only the remaining features to determine a new feature order. Again, the version considering only the weights of the misclassified examples (V10) is superior to the version using the weights of all examples (V9). V10 clearly outperforms C 2 RIB D with respect to accuracy, V9 is on par with the base case. In both versions, the number of features requested by the learner is smaller than or on par with the base case. In V11 and V12, all features are reordered every time a new feature is requested. As for V7 and V8, considering only the misclassified examples is inferior to considering all examples. V11 clearly outperforms the base case both in terms of accuracy and number of required features. V12 requires smaller feature subsets than C 2 RIB D but yields a lower classification accuracies. In the versions of group V (V13 to V16), only 2 features are active at at a time, namely the feature F with the currently highest mutual information with the class, and the feature with the currently highest conditional mutual information with the class, given F . None of these versions yields any improvement with respect to the base case. V14 is on par with C 2 RIB D with respect to both accuracy and number of features required for learning. However, the remaining three strategies are clearly inferior to the base case in all respects. Again, the versions only considering the weights of misclassified examples (V14, V16) yield better results than the strategies using the weights of all examples (V13, V15). These detailed results indicate, that the strategies establishing a new order for those features only which have not been presented to the learner yet, based on the weights of misclassified examples only, and adding the currently best feature to the set of active features (V2, V4, V10) are the most successful strategies, and are clearly superior to the base case. Applying the relevance measure of mutual information of a feature with the class (V2, V4) seems to outperform the use of the CMI measure (V10). The strategy yielding the – by far – worst results is the combination of considering all features for reordering, all examples for weighting, substitution of the only two active features with the currently best two features, and the relevance measure of CMI (V15). V15 is, with respect to predictive accuracy, inferior to C 2 RIB D on all six domains, and requires more features in two thirds of the domains. V5, the only MI based strategy (groups I to III) which performs worse than the uninformed baseline strategy C 2 RIB D , combines exactly the means which lead to the worst results in the CMI groups (IV and V). Thus, one can conjecture that this combination should be avoided. The comparatively
A Comparative Evaluation of Feature Set Evolution Strategies
193
good classification accuracy of V7 seems to contradict this conjecture. However, V7 results in all but one cases in a larger number of features required by the learner which might indicate that the selected features are not optimal. Version V11 differs from the worst strategy, V15, only with respect to the set evolution strategy but yields both higher classification accuracy and smaller feature subsets than C 2 RIB D . This provides us with an idea about how “extreme” the single strategies are. We can interpret the “all features, all examples, substitute”-strategy (AAS) as the most extreme method. AAS completely deprives the learner of its current equipment and very strongly directs the focus to the present situation. The classification accuracy deteriorates and the number of features required for learning increases as a result to an insufficient equipment of the learner. When instead the best feature is added to the current equipment, the learner concentrates much less on just the current state, and it becomes less likely that the learner only focuses on a few extreme, or misleading, examples. The “remaining features, misclassified examples, add”-strategy (RMA) which yields the best results in our experiments can be interpreted as the less extreme method. RMA is a fairly cautious strategy and does not further intensify the dynamics inherent to boosting. Not only is the prediction accuracy higher but the number of features requested from the learner is also smaller. The request of a large number of features is most likely due to situations where new features are selected based on misleading information. These features seem to be more promising than they really are. Since the chosen equipment is inadequate, the learner soon tries to level out the insufficiency by requesting yet another feature. 5.3
Summary and Implications
As a bottom line, we can see that positive effects on the classification accuracy and the number of features ultimately used for learning can be achieved by applying more informed feature selection strategies which utilize the distributional information provided by boosting without overly intensifying the dynamics inherent to boosting. The most successful strategies are those which add in a forward manner from the set of features not yet presented to the learner the one that scores best on the misclassified examples with respect to the MI relevance measure. Strategies which further intensify the dynamics of boosting, i.e. which result in a even stronger focus on only a few extreme examples, should be avoided since they lead to a clear deterioration of the results. One could presume that this deterioration stems from an overfitting effect. However, since C 2 RIB D employs an effective overfitting avoidance strategy, we rather conjecture that learning is inhibited by focusing too much on features that are very specific to extremal distributions over the training data. Preliminary analysis of the base hypotheses’ prediction confidences and the training examples’ mean margins over the course of iterations rather indicates that the selection of features which are significant only for a very small fraction E of “extreme” training examples results in the construction of a base hypothesis so unrepresentative for the entire training data, that it is right away leveled out by
194
Susanne Hoche and Stefan Wrobel
the regularization mechanisms of the boosting algorithm. In the next iteration, a contrary base hypothesis is induced which in turn, forces the learner to concentrate again on E , and to repeat the same process, or to request that a new feature order be established. Thus, the learner seems to eventually come to a point where further learning is inhibited.
6
Related Work
The idea of selecting smaller feature subsets and shifting the bias to a more expressive representation language is common in multi-relational learning. Also, relevance measures based on mutual information are widely used for feature selection. The work probably most related to our work is [12], where AdaBoost [16] is combined with molfea, an inductive database for the domain of biochemistry [11]. In [12], AdaBoost is employed to identify particularly difficult examples for which molfea constructs new special purpose structural features. AdaBoost re-weighting episodes and molfea feature construction episodes are alternated. In each iteration, a new feature constructed by molfea is presented to a propositional learner, the examples are re-weighted in accordance to the base classifier learned by it, and a new feature is constructed by molfea based on the modified weights. In contrast, our approach actively decides when to include new features from the list of ranked existing features with the central goal of including new features only when absolutely necessary in order to be maximally efficient. This means that in principle the two approaches could be easily combined, for example by calling a generator of new features whenever the list of existing features has been exhausted. [17] propose a wrapper model utilizing boosting for feature selection. In their approach, alternative feature subsets are assessed based on the underlying booster’s optimization criterion. The feature subset optimal according to this criterion is then presented as a whole to a learner. [4] introduces a Boosting Based Hybrid approach to feature selection. In each iteration of AdaBoost, the feature is selected that has, among all previously unselected features, the highest information gain on the training set weighted according to AdaBoost and the current feature subset. This process terminates when the training error does not further decrease by selecting additional features. Similar to [17] and [4], we combine heuristic relevance measures and the distributional information present in boosting to determine optimal features. However, we embed the feature selection process into the learning process and thus arrive at an active strategy to feature selection.
7
Conclusion
In this paper, we have investigated informed approaches to feature ordering and selection in the framework of active feature selection and constrained confidencerated multirelational boosting. Active feature selection can be embedded into a boosting framework at virtually no extra cost by exploiting the characteristics of
A Comparative Evaluation of Feature Set Evolution Strategies
195
boosting itself to actively determine the set of features that is being used in the various iterations of the boosted learner [8]. By monitoring the progress of learning, and incrementally presenting features to the learner only if this appears to be necessary for further learning, one can, even with a simple uninformed approach to feature ordering and selection, arrive at a smaller feature set, and significantly reduced induction times without a deterioration of predictive accuracy. Here, we investigated whether classification accuracy and the number of features used in the learning process can be further improved by making use of the distributional information present in boosting to reweight features and to dynamically adapt the feature set. In addition, we explored the effect of establishing new feature orders based on considering different sorting criteria as well as different subsets of features and examples. The empirical evaluation of several different strategies to feature subset evolution on a number of multirelational domains shows that more informed feature selection strategies have mixed effects on the size of feature sets and classification accuracy. Prediction accuracy can be improved by utilizing the distribution over the training examples maintained by boosting, for example in combination with the heuristic relevance measure of mutual information. Positive effects on the classification accuracy and the number of features ultimately used for learning can be achieved with the relevance measure of conditional mutual information, whereby the features and examples used for reordering and reweighting, respectively, have to be carefully considered in order to avoid the selection of features which are only significant for very few examples and misleading for the overall learning process. As a next step, the dynamics in boosting which lead to the induction of mutually contradictory base hypotheses in the presence of powerful feature subsets have to be thoroughly investigated. Moreover, other relevance measures and approaches to feature selection will be investigated. Acknowledgments This work was partially supported by DFG (German Science Foundation), projects WR 40/1-3 (“Active Learning”) and WR 40/2-1 (“Hybrid Methods”). We would like to thank J¨ org Kaduk for valuable discussions and helpful comments on previous versions of this paper.
References 1. P. Berka. Guide to the financial Data Set. In: A. Siebes and P. Berka, editors, PKDD 2000 Discovery Challenge, 2000. 2. J. Cheng, C. Hatzis, H. Hayashi, M.-A. Krogel, Sh. Morishita, D. Page, and J. Sese. KDD Cup 2001 Report. In: SIGKDD Explorations, 3(2):47-64, 2002. 3. W. Cohen and Y. Singer. A Simple, Fast, and Effective Rule Learner. Proc. of 16th National Conference on Artificial Intelligence, 1999. 4. S. Das. Filters, Wrappers and a Boosting-based Hybrid for Feature Selection. Proc. of 18th International Conference on Machine Learning, 2001.
196
Susanne Hoche and Stefan Wrobel
5. Y. Freund and R.E. Schapire. Experiments with a New Boosting Algorithm. Proc. of 13th International Conference on Machine Learning, 1996. 6. A.J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. Proc. of 15th National Conf. on AI, 1998. 7. S. Hoche and S. Wrobel. Relational Learning Using Constrained Confidence-Rated Boosting. Proc. 11th Int. Conf. on Inductive Logic Programming (ILP), 2001. 8. S. Hoche and S. Wrobel. Scaling Boosting by Margin-Based Inclusion of Features and Relations. Proc. 13th European Conf. on Machine Learning (ECML’02), 2002. 9. R.D. King, S. Muggleton, R.A. Lewis, and M.J.E. Sternberg. Drug design by machine learning: The use of inductive logic programming to model the structure activity relationships of trimethoprim analogues binding to dihydrofolate reductase. Proc. of the National Academy of Sciences of the USA 89(23):11322-11326, 1992. 10. R.D. King, A. Srinivasan, and M. Sternberg. Relating chemical activity to structure: An examination of ILP successes. New Generation Computing, Special issue on Inductive Logic Programming 13(3-4):411-434, 1995. 11. S. Kramer and L. De Raedt. Feature construction with version spaces for biochemical applications. Proc. of the 18th ICML, 2001. 12. S. Kramer. Demand-driven Construction of Structural Features in ILP. Proc. 11th Int. Conf. on Inductive Logic Programming (ILP), 2001. 13. W.J. McGill. Multivariate information transmission. IRE Trans. Inf. Theory, 1995. 14. D. Opitz and R. Maclin. Popular Ensemble Method: An Empirical Study. Journal of Artificial Intelligence Research 11, pages 169-198, 1999. 15. J.R. Quinlan. Bagging, boosting, and C4.5. Proc. of 14th Nat. Conf. on AI, 1996. 16. R.E. Schapire. Theoretical views of boosting and applications. Proceedings of the 10th International Conference on Algorithmic Learning Theory, 1999. 17. M. Sebban and R. Nock. Contribution of Boosting in Wrapper Models. In: J.M. Zytkow, and J. Rauch, eds, Proc. of the PKDD’99, 1999. 18. R.E. Schapire, Y. Freund, P.Bartlett, and W.S. Lee. Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. The Annals of Statistics, 26(5):1651-1686, 1998. 19. C.E. Shannon. A mathematical theory of communication. Bell. Syst. Techn. J., 27:379-423, 1948. 20. A. Srinivasan, S. Muggleton, M.J.E. Sternberg, and R.D. King. Theories for mutagenicity: A study in first-order and feature-based induction. Artificial Intelligence, 1996.
Comparative Evaluation of Approaches to Propositionalization ˇ Mark-A. Krogel1, Simon Rawles2 , Filip Zelezn´ y3,4 , 2 5 Peter A. Flach , Nada Lavraˇc , and Stefan Wrobel6,7 1
Otto-von-Guericke-Universit¨ at, Magdeburg, Germany [email protected] 2 University of Bristol, Bristol, UK {peter.flach,simon.rawles}@bristol.ac.uk 3 Czech Technical University, Prague, Czech Republic [email protected] 4 University of Wisconsin, Madison, USA [email protected] 5 Institute Joˇzef Stefan, Ljubljana, Slovenia [email protected] 6 Fraunhofer AiS, Schloß Birlinghoven, 53754 Sankt Augustin, Germany [email protected] 7 Universit¨ at Bonn, Informatik III, R¨ omerstr. 164, 53117 Bonn, Germany [email protected]
Abstract. Propositionalization has already been shown to be a promising approach for robustly and effectively handling relational data sets for knowledge discovery. In this paper, we compare up-to-date methods for propositionalization from two main groups: logic-oriented and databaseoriented techniques. Experiments using several learning tasks — both ILP benchmarks and tasks from recent international data mining competitions — show that both groups have their specific advantages. While logic-oriented methods can handle complex background knowledge and provide expressive first-order models, database-oriented methods can be more efficient especially on larger data sets. Obtained accuracies vary such that a combination of the features produced by both groups seems a further valuable venture.
1
Introduction
Following the initial success of the system LINUS [13], approaches to multirelational learning based on propositionalization have gained significant new interest in the last few years. In a multi-relational learner based on propositionalization, instead of searching the first-order hypothesis space directly, one uses a transformation module to compute a large number of propositional features and then uses a propositional learner. While less powerful in principle than systems that directly search the full first-order hypothesis space, it has turned out that in practice, in many cases it is sufficient to search a fixed subspace that can be T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 197–214, 2003. c Springer-Verlag Berlin Heidelberg 2003
198
Mark-A. Krogel et al.
defined by feature transformations. In addition, basing learning on such transformations offers a potential for enhanced efficiency which is becoming more and more important for large applications in data mining. Lastly, transforming multi-relational problems into a single table format allows one to directly use all propositional learning systems, thus making a wider choice of algorithms available. In the past three years, quite a number of different propositionalization learners have been proposed [1,10,9,12,14,17]. While all such learning systems explicitly or implicitly assume that an individual-centered representation is used, the available systems differ in their details. Some of them constrain themselves to features that can be defined in pure logic (existential features), while others, inspired by the database area, include features based on e.g. aggregation. Unfortunately, in the existing literature, only individual empirical evaluations of each system are available, so it is difficult to clearly see what the advantages and disadvantages of each system are, and on which type of application each one is particularly strong. In this paper, we therefore present the first comparative evaluation of three different multi-relational learning systems based on propositionalization. In particular, we have chosen to compare the systems RSD [17], a subgroup discovery system of which we are interested in its feature construction part, SINUS, the successor of LINUS and DINUS [13], and RELAGGS [12], a database-inspired system which adds non-existential features. We give details on each system, and then, in the main part of the paper, provide an extensive empirical evaluation on six popular multi-relational problems. As far as possible, we have taken great care to ensure that all systems use identical background knowledge and declarations to maximize the strength of the empirical results. Our evaluation shows interesting differences between the involved systems, indicating that each has its own strengths and weaknesses and that neither is universally the best. In our discussion, we analyze this outcome and point out which directions of future research appear most promising. The paper is structured as follows. In the following section (Sect. 2), we first recall the basics of propositionalization as used for multi-relational learning. In the subsequent sections, we then discuss each of the three chosen systems individually, first RSD (Sect. 3.1), then SINUS (Sect. 3.2), and finally RELAGGS (Sect. 4). Section 5 is the main part of the paper, and presents an empirical evaluation of the approaches. We give details on the domains that were used, explain how the domains were handled by each learning system, and present a detailed comparison of running times and classification accuracies. The results show noticeable differences between the systems, and we discuss the possible reasons for their respective behavior. We finish with a summary and conclusion in Sect. 6, pointing out some areas of further work.
2
Propositionalization
Following [11], we understand propositionalization as a transformation of relational learning problems into attribute-value representations amenable for con-
Comparative Evaluation of Approaches to Propositionalization
199
ventional data mining systems such as C4.5 [21], which can be seen as propositional learners. Attributes are often called features and form the basis for columns in single table representations of data. Single-table representations and models that can be learned from them have a strong relationship to propositional logic and its expressive power [7], hence the name for the approaches discussed here. As further pointed out there, propositionalization can mostly be applied in domains with a clear notion of individual with learning occurring on the level of individuals only. We focus in this paper on the same kind of learning tasks as [11]: Given some evidence E (examples, given extensionally either as a set of ground facts or tuples representing a predicate/relation whose intensional definition is to be learned), and an initial theory B (background knowledge, given either extensionally as a set of ground facts, relational tuples or sets of clauses over the set of background predicates/relations) Find a theory H (hypothesis, in the form of a set of logical clauses) that together with B explains some properties of E. Usually, hypotheses have to obey certain constraints in order to arrive at hypothesis spaces that can be handled efficiently. These restrictions can introduce different kinds of bias. During propositionalization, features are constructed from relational background knowledge and structural properties of individuals. Results can then serve as input to different propositional learners, e.g. as preferred by the user. Propositionalizations can be either complete or partial (heuristic). In the former case, no information is lost in the process; in the latter, information is lost and the representation change is incomplete: the goal is to automatically generate a small but relevant set of structural features. Further, general-purpose approaches to propositionalization can be distinguished from special-purpose approaches that could be domain-dependent or applicable to a limited problem class only. In this paper, we focus on general-purpose approaches for partial propositionalization. In partial propositionalization, one is looking for a set of features, where each feature is defined in terms of a corresponding program clause. If the number of features is m, then a propositionalization of the relational learning problem is a set of clauses: f1 (X) : −Lit1,1 , . . . , Lit1,n1 . f2 (X) : −Lit2,1 , . . . , Lit2,n2 . ... fm (X) : −Litm,1, . . . , Litm,nm . where each clause defines a feature fi . Clause body Liti,1 , ..., Liti,n is said to be the definition of feature fi ; these literals are derived from the relational background knowledge. In clause head fi (X), argument X refers to an individual. If
200
Mark-A. Krogel et al.
such a clause is called for a particular individual (i.e., if X is bound to some example identifier) and this call succeeds at least once, the corresponding Boolean feature is defined to be “true” for the given example; otherwise, it is defined to be “false”. It is pointed out in [11] that features can also be non-Boolean requiring a second variable in the head of the clause defining the feature to return the value of the feature. The usual application of features of this kind would be in situations where the second variable would have a unique binding. However, variants of those features can also be constructed for non-determinate domains, e.g. using aggregation as described below.
3
Logic-Oriented Approaches
The next two presented systems, RSD and SINUS, tackle the propositionalization task by constructing first-order logic features, assuming – as mentioned earlier – there is a clear notion of a distinguishable individual. In this approach to first-order feature construction, based on [8,11,14], local variables referring to parts of individuals are introduced by the so-called structural predicates. The only place where non-determinacy can occur in individual-centered representations is in structural predicates. Structural predicates introduce new variables. In the proposed language bias for first-order feature construction, a first-order feature is composed of one or more structural predicates introducing a new variable, and of utility predicates as in LINUS [13] (called properties in [8]) that ‘consume’ all new variables by assigning properties to individuals or their parts, represented by variables introduced so far. Utility predicates do not introduce new variables. Although the two systems presented below are based on a common understanding of the notion of a first-order feature, they vary in several aspects. We first overview their basic principles separately and then compare the approaches. 3.1
RSD
RSD has been originally designed as a system for relational subgroup discovery [17]. Here we are concerned only with its auxiliary component providing means of first-order feature construction. The RSD implementation in the Yap Prolog is publicly available from http://labe.felk.cvut.cz/~zelezny/rsd, and accompanied by a comprehensive user’s manual. To propositionalize data, RSD conducts the following three steps. 1. Identify all first-order literal conjunctions which form a legal feature definition, and at the same time comply to user-defined constraints (modelanguage). Such features do not contain any constants and the task can be completed independently of the input data. 2. Extend the feature set by variable instantiations. Certain features are copied several times with some variables substituted by constants detected by inspecting the input data. During this process, some irrelevant features are detected and eliminated.
Comparative Evaluation of Approaches to Propositionalization
201
3. Generate a propositionalized representation of the input data using the generated feature set, i.e., a relational table consisting of binary attributes corresponding to the truth values of features with respect to instances of data. Syntactical Construction of Features. RSD accepts declarations very similar to those used by the systems Aleph [22] and Progol [19], including variable types, modes, setting a recall parameter etc., used to syntactically constrain the set of possible features. Let us illustrate the language bias declarations by an example on the well-known East-West Trains data domain [18]. – A structural predicate declaration in the East-West trains domain can be defined as follows: :-modeb(1, hasCar(+train, -car)). where the recall number 1 determines that a feature can address at most one car of a given train. Input variables are labeled by the + sign, and output variables by the - sign. – Property predicates are those with no output variables. – A head predicate declaration always contains exactly one variable of the input mode, e.g., :-modeh(1, train(+train)). Additional settings can also be specified, or they acquire a default value. These are the maximum length of a feature (number of contained literals), maximum variable depth [19] and maximum number of occurrences of a given predicate symbol. RSD produces an exhaustive set of features satisfying the mode and setting declarations. No feature produced by RSD can be decomposed into a conjunction of two features. For example, the feature set based on the following declaration :-modeh(1, :-modeb(2, :-modeb(1, :-modeb(1,
train(+train)). hasCar(+train, -car)). long(+car)). notSame(+car, +car)).
will contain a feature f(A) : −hasCar(A, B), hasCar(A, C), long(B), long(C), notSame(B, C).
(1)
but it will not contain a feature with a body hasCar(A, B), hasCar(A, C), long(B), long(C)
(2)
as such an expression would clearly be decomposable into two separate features. In the search for legal feature definitions (corresponding to the exploration of a subsumption tree), several pruning rules are used in RSD, that often drastically decrease the run times needed to achieve the feature set. For example, a simple calculation is employed to make sure that structural predicates are no longer added to a partially constructed feature definition, when there would not remain enough places (within the maximum feature length) to hold property literals consuming all output variables. A detailed treatment of the pruning principles is out the scope of this paper and will be reported elsewhere.
202
Mark-A. Krogel et al.
Extraction of Constants and Filtering Features. The user can utilize the reserved property predicate instantiate/1 to specify a type of variable that should be substituted with a constant during feature construction.1 For example, consider that the result of the first step is the following feature: f1(A) : −hasCar(A, B), hasLoad(B, C), shape(C, D), instantiate(D).
(3)
In the second step, after consulting the input data, f1 will be substituted by a set of features, in each of which the instantiate/1 literal is removed and the D variable is substituted by a constant, making the body of f1 provable in the data. Provided they contain a train with a rectangle load, the following feature will appear among those created out of f1: f11(A) : −hasCar(A, B), hasLoad(B, C), shape(C, rectangle).
(4)
A similar principle applies for features with multiple occurrences of instantiate/1 literals. Arguments of these literals within a feature form a set of variables ϑ; only those (complete) instantiations of ϑ making the feature’s body provable on the input database will be considered. Upon the user’s request, the system also repeats this feature expansion process considering the negated version of each constructible feature.2 However, not all of such features will appear in the resulting set. For the sake of efficiency, we do not perform feature filtering by a separate post-processing procedure, but rather discard certain features already during the feature construction process described above. We keep a currently developed feature f if and only if simultaneously (a) no feature has so far been generated that covers (is satisfied for) the same set of instances in the input data as f , (b) f does not cover all instances, and finally: (c) either, the fraction of instances covered by f is larger than a user-specified threshold, or the threshold coverage is reached by ¬f . Creating a Single-Relational Representation. When an appropriate set of features has been generated, RSD can use it to produce a single relational table representing the original data. Currently, the following data formats are supported: a comma-separated text file, a WEKA input file, a CN2 input file, and a file acceptable by the RSD subgroup discovery component [17]. 3.2
SINUS
What follows is an overview of the SINUS approach. More detailed information about the particulars of implementation and its wide set of options can be read at the SINUS website at http://www.cs.bris.ac.uk/home/rawles/sinus. 1 2
This is similar to using the # mode in a Progol or Aleph declaration. Note also that negations on individual literals can be applied via appropriate declarations in Step 1 of the feature construction.
Comparative Evaluation of Approaches to Propositionalization
203
LINUS. SINUS was first implemented as an intended extension to the original LINUS transformational ILP learner [13]. Work had been done in incorporating feature generation mechanisms into LINUS for structured domains and SINUS was implemented from a desire to incorporate this into a modular, transformational ILP system which integrated its propositional learner, including translating induced models back into Prolog form. The original LINUS system had little support for the generation of features as they are discussed here. Transformation was performed by considering only possible applications of background predicates on the arguments of the target relation, taking into account the types of arguments. The clauses it could learn were constrained. The development of DINUS (‘determinate LINUS’) [13] relaxed the bias so that non-constrained clauses could be constructed given that the clauses involved were determinate. DINUS was also extendable to learn recursive clauses. However, not all real-world structured domains have the determinacy property, and for learning in these kinds of domains, feature generation of the sort discussed here is necessary. SINUS. SINUS 1.0.3 is implemented in SICStus Prolog and provides an environment for transformational ILP experimentation, taking ground facts and transforming them to standalone Prolog models. The system works by performing a series of distinct and sequential steps. These steps form the functional decomposition of the system into modules, which enable a ‘plug-in’ approach to experimentation — the user can elect to use a number of alternative approaches for each step. For the sake of this comparison, we focus on the propositionalization step, taking into account the nature of the declarations processed before it. – Processing the input declarations. SINUS takes in a set of declarations for each predicate involved in the facts and the background knowledge. – Constructing the types. SINUS constructs a set of values for each type from the predicate declarations. – Feature generation. The first-order features to be used as attributes in the input to the propositional learner are recursively generated. – Feature reduction. The set of features generated are reduced. For example, irrelevant features may be removed, or a feature quality measure applied. – Propositionalization. The propositional table of data is prepared internally. – File output and invocation of the propositional learner. The necessary files are output ready for the learner to use and the user’s chosen learner is invoked from inside SINUS. At present the CN2 [5] and CN2-SD (subgroup discovery) [15] learners are supported, as well as Ripper [6]. The Weka ARFF format may also be used. – Transformation and output of rules. The models induced by the propositional learner are translated back into Prolog form. Predicate Declaration and Bias. SINUS uses flattened Prolog clauses together with a definition of that data. This definition takes the form of an adapted
204
Mark-A. Krogel et al.
PRD file (as in the first-order Bayesian classifier 1BC [8]), which gives information about each predicate used in the facts and background information. Each predicate is listed in separate sections describing individual, structural and property predicates. From the type information, SINUS constructs the range of possible values for each type. It is therefore not necessary to specify the possible values for each type. Although this means values may appear in test data which did not appear in training data, it makes declarations much more simple and allows the easy incorporation of intensional background knowledge. Example of a Domain Definition in SINUS. Revisiting the trains example, we could define the domain as follows: --INDIVIDUAL train 1 train cwa --STRUCTURAL train2car 2 1:train *:#car * cwa car2load 2 1:car 1:#load * cwa --PROPERTIES cshape 2 car #shape * cwa clength 2 car #length * cwa cwall 2 car #wall * cwa croof 2 car #roof * cwa cwheels 2 car #wheels * cwa lshape 2 load #shapel * cwa lnumber 2 load #numberl * cwa For each predicate, the name and number of arguments is given. Following that appears a list of the types of each argument in turn.3 Types are defined with symbols describing their status. The # symbol denotes an output argument, and its absence indicates an input argument. In the structural predicates, the 1: and *: prefixes allow the user to define the cardinality of the relationships. The example states that while a train has many cars, a car only has one load. SINUS constructs features left-to-right, starting with a single literal describing the individual. For each new literal, SINUS considers the application of a structural or property predicate given the current bindings of the variables. In the case of structural predicates SINUS introduces new variable(s) for all possible type matches. In the case of property predicates SINUS substitutes all possible constants belonging to a type of the output argument to form the new candidate literals. The user can constrain the following factors of generated features: the maximum number of literals (M axL parameter), the maximum number of variables (M axV parameter) and the maximum number of distinct values a type can take (M axT parameter). The character of the feature set produced by SINUS depends principally on the choice of whether and how to reuse variables, i.e. whether to use those 3
The remaining * cwa was originally for compatibility with PRD files.
Comparative Evaluation of Approaches to Propositionalization
205
variables which have already been consumed during construction of a new literal. We consider three possible cases separately. No Reuse of Variables. When predicates are not allowed to reuse variables, 27 features are produced. Some features which differ from previous ones in constants only have been omitted for brevity. The full feature set contains one feature for each constant, such as4 f_aaaa(A) :- train(A),hasCar(A,B),shape(B,bucket). f_aaaq(A) :- train(A),hasCar(A,B),hasLoad(B,C),lshape(C,circle). f_aaax(A) :- train(A),hasCar(A,B),hasLoad(B,C),lnumber(C,0). This feature set is compact but represents only simple first-order features — those which test for the existence of an object with a given property somewhere in the complex object. Reuse of Variables. When all predicates are allowed to reuse variables, the feature set increases to 283. Examples of features generated with reuse enabled include f aaab(A) : −train(A), hasCar(A, B), shape(B, bucket), shape(B, ellipse). (5) It can be seen that generation with unrestricted variable reuse in this way is not optimal. Firstly, equivalent literals may appear in different orders and form new clauses, and features which are clearly redundant may be produced. The application of the REDUCE algorithm [16] after feature generation eliminates both these problems. Secondly, the size of the feature set increases rapidly with even slight relaxation of constraints. Structural Predicates only May Reuse Variables. We can allow only structural predicates to reuse variables. Using this setting, by adapting the declarations with a new property notsame 2 car car * cwa, taking two cars as input, we can introduce more complex features, such as the equivalent of feature f from Expression 1. The feature set has much fewer irrelevant and redundant features, but it can still contain decomposable features, since no explicit check is carried out. However, with careful choice of the reuse option, the feature generation constraints and the background predicates, a practical and compact feature set can be generated. 3.3
Comparing RSD and SINUS
Both systems solve the propositionalization problem in a principally similar way: by first-order feature construction viewed as an exploration of the space of legal 4
Note a minor difference between the feature notation formalisms in SINUS and RSD: the first listed feature, for instance, would be represented as f1(A):-hasCar(A,B),shape(B,bucket) in RSD.
206
Mark-A. Krogel et al.
feature definitions, while the understanding of a feature is common to both systems. Exhaustive feature generation for propositionalization is generally problematic in some domains due to the exponential increase in features size with a number of factors — number of predicates used, maximum number of literals and variables allowed, number of values possible for each type — and experience has shown there are sometimes difficulties with overfitting, as well as time and memory problems during runtime. Both systems provide facilities to overcome this effect. SINUS allows the user to specify the bounds on the number of literals (feature length), variables and number of values taken by the types in a feature, as described above. RSD can constrain the feature length, the variable depth, number of occurrences of specified predicates and their recall number. Both systems conduct a recursively implemented search to produce an exhaustive set of features which satisfy the user’s declarations. However, unlike RSD, SINUS may produce redundant decomposable features; there is no explicit check. Concerning the utilization of constant values in features, SINUS collects all possible constants for each declared type from the database and proceeds to construct features using the collected constants. Unlike SINUS, RSD first generates a feature set with no constants, guided solely by the syntactical declarations. Specified variables are instantiated in a separate following step, in a way that satisfies constraints related to feature coverage on data. Both systems are also able to generate the propositionalized form of data in a format acceptable by propositional learners including CN2 and those present in the system WEKA. SINUS furthermore provides means to translate the outputs of the propositional learner fed with the generated features back into a predicate form. To interpret such an output of a learner using RSD’s features, one has to look up the meaning of each used feature in the feature definition file. Summary. RSD puts more stress on the pre-processing stage, in that it allows a fine language declaration (such as by setting bounds on the recall of specific predicates, variable-depth etc.), verifies the undecomposability of features and offers efficiency-oriented improvements (pruning techniques in the feature search, coverage-based feature filtering). On the other hand, SINUS provides added value in the post-processing and interpretation of results obtained from a learner using the generated features, in that it is able to translate the resulting hypotheses back into a predicate form.
4
Database-Oriented Approaches
In [12], we presented a framework for approaches to propositionalization and an extension thereof by including the application of aggregation functions, which are widely used in the database area. Our approach is based on ideas from MIDOS [25], and it is called RELAGGS, which stands for relational aggregations. It is
Comparative Evaluation of Approaches to Propositionalization
207
very similar to an approach called Polka developed independently by a different research group [9]. A difference between the two approaches concerns efficiency of the implementation, which was higher for Polka. Indeed, we were inspired by Polka to develop new ideas for RELAGGS. Here, we present this new variant of our approach, implemented with Java and MySQL, with an illustrative example at the end of this section. Besides the focus on aggregation functions, we concentrate on the exploitation of relational database schema information, especially foreign key relationships as a basis for a declarative bias during propositionalization, as well as the usage of optimization techniques as usually applied for relational databases such as indexes. These points led us to the heading for this section and do not constitute differences in principle to the logic-oriented approaches as presented above. Rather, predicate logic can be seen as fundamental to relational databases and their query languages. In the following, we prefer to use database terminology, where a relation (table) as a collection of tuples largely corresponds to ground facts of a logical predicate, and an attribute (column) of a relation to an argument of a predicate, cf. also [14]. A relational database can be depicted as a graph with its relations as nodes and foreign key relationships as edges, conventionally by arrows pointing from the foreign key attribute in the dependent table to the corresponding primary key attribute in the independent table, cf. the examples in Fig. 1 below. The main idea of our approach is that it is possible to summarize non-target relations with respect to the individuals dealt with, or in other words, per example from the target relation. In order to relate non-target relation tuples to the individuals, we propagate the identifiers of the individuals to the non-target tables via foreign key relationships. This can be accomplished by comparatively inexpensive joins that use indexes on primary and foreign key attributes. In the current variant of RELAGGS, these joins – as views on the database – are materialized in order to allow for fast aggregation. Aggregation functions are applied to single columns as in [12], and to pairs of columns of single tables. The application of the functions depends on the type of attributes. For numeric attributes, average, minimum, maximum, and sum are computed as in [12], moreover standard deviations, ranges, and quartiles. For nominal attributes, the different possible values are counted, as in [12]. Here, the user can exclude nominal attributes with high numbers of possible values with the help of the parameter cardinality. Besides numeric and nominal attributes, we now also treat identifier attributes as ordinary numeric or nominal attributes, and date attributes as decomposable nominal attributes, e.g. for counting occurrences of a specific year. Using all these aggregation functions, most features constructed here are not Boolean as usual in logic-oriented approaches, but numeric. Note that the RELAGGS approach can be seen as corresponding to the application of appropriate utility functions in a logic-oriented setting as pointed to in [14].
208
Mark-A. Krogel et al.
District (77)
Loan (682)
Order (6,471)
Account (4,500)
Client (5,369)
Trans (1,056,320)
Disp (5,369)
Card (892)
Account_new (682)
Card_new (36)
Client_new (827)
Disp_new (827)
Loan (682)
District_new (1,509)
Order_new (1,513)
Trans_new (54,694)
Fig. 1. Top: The PKDD 1999/2000 challenges financial data set: Relations as rectangles with relation names and tuple numbers in parentheses, arrows indicate foreign-key relationships [2]. Bottom: Relations after identifier propagation
Example 1 (A PKDD Data Set). Figure 1 (top) depicts parts of a relational database schema provided for the PKDD 1999/2000 challenges [2]. This data set is also used for our experiments reported on below in this paper, with table Loan as target relation containing the target attribute Status. All relations have a single-attribute primary key of type integer with a name built from the relation name, such as Loan id. Foreign key attributes are named as their primary key counterparts. Single-attribute integer keys are common and correspond to general recommendations for efficient relational database design. Here, this allows for fast propagation of example identifiers, e.g. by a statement such as select Loan.Loan id, Trans.* from Loan, Trans where Loan.Account id = Trans.Account id; using indexes on the Account id attributes. The result of this query forms the relation Trans new. Figure 1 (bottom) depicts the database following the introduction of additional foreign key attributes for propagated example identifiers in the non-target relations. Note that relation Trans new, which contains information about transactions on accounts, has become much smaller than the original relation Trans, mainly because there are loans for a minority of accounts only. This holds in a similar way for most other relations in this example. However, relation District new has grown compared to District, now being the sum of accounts’ district and clients’ district information.
Comparative Evaluation of Approaches to Propositionalization
209
The new relations can be summarized with aggregation functions in group by Loan id statements that are especially efficient here because no further joins have to be executed after identifier propagation. Finally, results of summarization such as values for a feature min(Trans new.Balance) are concatenated to the central table’s Loan tuples to form the result of propositionalization.
5 5.1
Empirical Evaluation Learning Tasks
We chose to focus on binary classification tasks for a series of experiments to evaluate the different approaches to propositionalization described above, although the approaches can also support solutions of multi-class problems, regression problems, and even other types of learning tasks such as subgroup discovery. As an example of the series of Trains data sets and problems as first instantiated by the East-West challenge [18], we chose a 20 trains problem, already used as an illustrating example earlier in this paper. For these trains, information is given about their cars and the loads of these cars. The learning task is to discover (low-complexity) models that classify trains as eastbound or westbound. In the chess endgame domain White King and Rook versus Black King, taken from [20] , the target relation illegal(A, B, C, D, E, F ) states whether a position where the White King is at file and rank (A, B), the White Rook at (C, D) and the Black King at (E, F ) is an illegal White-to-move position. For example, illegal(g, 6, c, 7, c, 8) is a positive example, i.e., an illegal position. Two background predicates are available: lt/2 expressing the “less than” relation on a pair of ranks (files), and adj/2 denoting the adjacency relation on such pairs. The data set consists of 1,000 instances. For the Mutagenesis problem, [23] presents a variant of the original data named NS+S2 (also known as B4) that contains information about chemical concepts relevant to a special kind of drugs, the drugs’ atoms and the bonds between those atoms. The Mutagenesis learning task is to predict whether a drug is mutagenic or not. The separation of data into “regression-friendly” (188 instances) and “regression-unfriendly” (42 instances) subsets as described by [23] is kept here. Our investigations concentrate on the first subset. The PKDD Challenges in 1999 and 2000 offered a data set from a Czech bank [2]. The data set comprises of 8 relations that describe accounts, their transactions, orders, and loans, as well as customers including personal, credit card ownership, and socio-demographic data, cf. Fig. 1. A learning task was not explicitly given for the challenges. We compare problematic to non-problematic loans regardless if the loan projects are finished or not. We exclude information from the analysis dating after loan grantings in order to arrive at models with predictive power for decision support in loan granting processes. The data describes 682 loans. The KDD Cup 2001 [4] tasks 2 and 3 asked for the prediction of gene function and gene localization, respectively. From these non-binary classification tasks, we extracted two binary tasks, viz. the prediction whether a gene codes for a protein
210
Mark-A. Krogel et al.
that serves cell growth, cell division and DNA synthesis or not and the prediction whether the protein produced by the gene described would be allocated in the nucleus or not. We deal here with the 862 training examples provided for the Cup. 5.2
Procedure
The general scheme for experiments reported here is the following. As a starting point, we take identical preparations of the data sets in Prolog form. These are adapted for usage with the different propositionalization systems, e.g. SQL scripts with create table and insert statements are derived from Prolog ground facts in a straightforward manner. Then, propositionalization is carried out and the results are formated in a way accessible to the data mining environment WEKA [24]. Here, we use the J48 learner, which is basically a reimplementation of C4.5 [21]. We use default parameter settings of this learner, including a stratified 10-fold cross-validation scheme for evaluating the learning results. The software used for the experiments as well as SQL scripts used in the RELAGGS application are available on request from the first author. Declaration and background knowledge files used with SINUS and RSD are available from the second and third author, respectively. Both RSD and SINUS share the same basic first-order background knowledge in all domains, adapted in formal ways for compatibility purposes. The language constraint settings applicable in either system are in principle different and for each system they were set to values allowing to complete the feature generation in a time not longer than 30 minutes. Varying the language constraints (for RSD also the minimum feature coverage constraint; for RELAGGS: parameter cardinality), feature sets of different sizes were obtained, each supplied for a separate learning experiment. 5.3
Results
Accuracies. Figure 2 presents for all six learning problems the predictive accuracies obtained by the J48 learner supplied with propositional data based on feature sets of growing sizes, resulting from each of the respective propositionalization systems. Running Times. The three tested systems are implemented in different languages and interpreters and operate on different hardware platforms. An exact comparison of efficiency was thus not possible. For each domain and system we report the approximate average (over feature sets of different sizes) running times. RSD ran under the Yap Prolog on a Celeron 800 MHz computer with 256 MB of RAM. SINUS was running under SICStus Prolog5 on a Sun Ultra 10 computer. For the Java implementation of RELAGGS, a PC platform was used 5
It should be noted that SICStus Prolog is generally considered to be several times slower than Yap Prolog.
Comparative Evaluation of Approaches to Propositionalization East-West Trains
KRK 100
SINUS RSD RELAGGS
90 85 80
SINUS RSD
95 90
Accuracy [%]
Accuracy [%]
100 95
75 70 65 60
85 80 75 70
55 50 10
100
1000
10000
10
Number of Features
1000
10000
PKDD’99 Financial Challenge 100
SINUS RSD RELAGGS
95
100
Number of Features
Mutagenesis 100
SINUS RSD RELAGGS
98
90
96
Accuracy [%]
Accuracy [%]
211
85 80
94 92
75
90
70 88 1
10
100
1000
10000
10
Number of Features
SINUS RSD RELAGGS
95 90 Accuracy [%]
Accuracy [%]
10000
KDD’01 Challenge (nucleus) 100
SINUS RSD RELAGGS
95
1000
Number of Features
KDD’01 Challenge (growth) 100
100
90 85 80
85 80 75 70 65
75
60 10
100 Number of Features
1000
10
100
1000
Number of Features
Fig. 2. Accuracies resulting from the J48 propositional learner supplied with propositionalized data based on feature sets of varying size obtained from three propositionalization systems. The bottom line of each diagram corresponds to the accuracy of the majority vote
with a 2.2 GHz processor and 512 MB main memory. Table 1 shows running times of the propositionalization systems on the learning tasks with best results in bold. 5.4
Discussion
The obtained results are not generally conclusive in favor of either of the tested systems. Interestingly, from the point of view of predictive accuracy, each of them provided the winning feature set in exactly two domains.
212
Mark-A. Krogel et al.
Table 1. Indicators of running times (different platforms, cf. text) and systems providing the feature set for the best-accuracy result in each domain Problem
Running Times Best Accuracy RSD SINUS RELAGGS Achieved with Trains < 1 sec 2 to 10 min < 1 sec RELAGGS King-Rook-King < 1 sec 2 to 6 min n.a. RSD Mutagenesis 5 min 6 to 15 min 30 sec RSD PKDD99-00 Loan.status 5 sec 2 to 30 min 30 sec RELAGGS KDD01 Gene.fctCellGrowth 3 min 30 min 1 min SINUS KDD01 Gene.locNucleus 3 min 30 min 1 min SINUS
The strength of the aggregation approach implemented by RELAGGS manifested itself in the domain of East-West Trains (where counting of structural primitives seems to outperform the pure existential quantification used by the logic-based approaches) and, more importantly, in the PKDD’99 financial challenge rich with numeric data, evidently well-modeled by RELAGGS’ features based on the computation of data statistics. On the other hand, this approach could not yield any reasonable results for the purely relational challenge of the King-Rook-King problem. Different performances of the two logic-based approaches, RSD and SINUS, are mainly due to their different ways of constraining the language bias. SINUS wins in both of the KDD 2001 challenge versions, RSD wins in the KRK domain and Mutagenesis. While the gap on KRK seems insignificant, the result obtained on Mutagenesis with RSD’s 25 features 6 is the best so far reported we are aware of with an accuracy of 92.6%. From the point of view of running times, RELAGGS seems to be the most efficient system. It seems to be outperformed on the PKDD challenge by RSD, however, on this domain the features of both of the logic-based systems are very simple (ignoring the cumulative effects of numeric observations) and yield relatively poor accuracy results. Whether the apparent efficiency superiority of RSD with respect to SINUS is due to RSD’s pruning mechanisms, or the implementation in the faster Yap Prolog, or a combined effect thereof has yet to be determined.
6
Future Work and Conclusion
In future work, we plan to complete the formal framework started in [12], which should also help to clarify relationships between the approaches. We intend to compare our systems to other ILP approaches such as Progol [19] and Tilde [3]. Furthermore, extensions of the feature subset selection mechanisms in the different systems should be considered, as well as other propositional learners such as support vector machines. 6
The longest have 5 literals in their bodies. Prior to irrelevant-feature filtering conducted by RSD, the feature set has more than 5,000 features.
Comparative Evaluation of Approaches to Propositionalization
213
Specifically, for RELAGGS, we will investigate a deeper integration with databases, also taking into account their dynamics. The highest future work priorities for SINUS are the implementation of a direct support for data relationships informing feature construction, incorporating a range of feature elimination mechanisms and enabling greater control over the bias used for feature construction. In RSD, we will try to devise a procedure to interpret the results of a propositional learner by a first-order theory by plugging the generated features into the obtained hypothesis. As this paper has shown, each of the three considered systems has certain unique benefits. The common goal of all of the involved developers is to implement a wrapper that would integrate the advantages of each. Acknowledgments This work was partially supported by the DFG (German Science Foundation), projects FOR345/1-1TP6 and WR 40/2-1. Part of the work was done during a visit of the second author to the Czech Technical University in Prague, funded by an EC MIRACLE grant (contract ICA1-CT-2000-70002). The third author was supported by DARPA EELD Grant F30602-01-2-0571 and by the Czech Ministry of Education MSM 212300013 (Decision making and control for industrial production).
References 1. E. Alphonse and C. Rouveirol. Lazy propositionalisation for Relational Learning. In W. Horn, editor, Proceedings of the Fourteenth European Conference on Artificial Intelligence (ECAI), pages 256–260. IOS, 2000. 2. P. Berka. Guide to the Financial Data Set. In A. Siebes and P. Berka, editors, PKDD 2000 Discovery Challenge, 2000. 3. H. Blockeel and L. De Raedt. Top-down induction of first-order logical decision trees. Artificial Intelligence, 101(1-2):285–297, 1998. 4. J. Cheng, C. Hatzis, H. Hayashi, M.-A. Krogel, S. Morishita, D. Page, and J. Sese. KDD Cup 2001 Report. SIGKDD Explorations, 3(2):47–64, 2002. 5. P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3:261– 283, 1989. 6. W.W. Cohen. Fast effective rule induction. In A. Prieditis and S. Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning (ICML), pages 115–123. Morgan Kaufmann, 1995. 7. P.A. Flach. Knowledge representation for inductive learning. In A. Hunter and S. Parsons, editors, Proceedings of the European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty (ECSQARU), LNAI 1638, pages 160–167. Springer, 1999. 8. P.A. Flach and N. Lachiche. 1BC: A first-order Bayesian classifier. In S. Dˇzeroski and P. A. Flach, editors, Proceedings of the Ninth International Conference on Inductive Logic Programming (ILP), LNAI 1634, pages 92–103. Springer, 1999. 9. A.J. Knobbe, M. de Haas, and A. Siebes. Propositionalisation and Aggregates. In L. de Raedt and A. Siebes, editors, Proceedings of the Fifth European Conference on Principles of Data Mining and Knowledge Disovery (PKDD), LNAI 2168, pages 277–288. Springer, 2001.
214
Mark-A. Krogel et al.
10. S. Kramer and E. Frank. Bottom-up propositionalization. In Work-in-Progress Track at the Tenth International Conference on Inductive Logic Programming (ILP), 2000. 11. S. Kramer, N. Lavraˇc, and P.A. Flach. Propositionalization Approaches to Relational Data Mining. In N. Lavraˇc and S. Dˇzeroski, editors, Relational Data Mining, pages 262–291. Springer, 2001. 12. M.-A. Krogel and S. Wrobel. Transformation-Based Learning Using Multirelational Aggregation. In C. Rouveirol and M. Sebag, editors, Proceedings of the Eleventh International Conference on Inductive Logic Programming (ILP), LNAI 2157, pages 142–155. Springer, 2001. 13. N. Lavraˇc and S. Dˇzeroski. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994. 14. N. Lavraˇc and P.A. Flach. An extended transformation approach to Inductive Logic Programming. ACM Transactions on Computational Logic, 2(4):458–494, 2001. 15. N. Lavraˇc, P.A. Flach, B. Kavˇsek, and L. Todorovski. Adapting classification rule induction to subgroup discovery. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM), pages 266–273. IEEE, 2002. 16. N. Lavraˇc, D. Gamberger, and P. Turney. A relevancy filter for constructive induction. IEEE Intelligent Systems, 13(2):50–56, 1998. ˇ 17. N. Lavraˇc, F. Zelezn´ y, and P.A. Flach. RSD: Relational subgroup discovery through first-order feature construction. In S. Matwin and C. Sammut, editors, Proceedings of the Twelfth International Conference on Inductive Logic Programming (ILP), LNAI 2538, pages 149–165. Springer, 2002. 18. R.S. Michalski. Pattern Recognition as Rule-guided Inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(4):349–361, 1980. 19. S. Muggleton. Inverse entailment and Progol. New Generation Computing, Special issue on Inductive Logic Programming, 13(3-4):245–286, 1995. 20. J.R. Quinlan. Learning logical definitions from relations. 5:239–266, 1990. 21. J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. 22. A. Srinivasan and R.D. King. Feature construction with inductive logic programming: A study of quantitative predictions of biological activity aided by structural attributes. In S. Muggleton, editor, Proceedings of the Sixth International Conference on Inductive Logic Programming (ILP), LNAI 1314, pages 89–104. Springer, 1996. 23. A. Srinivasan, S.H. Muggleton, M.J.E. Sternberg, and R. . King. Theories for mutagenicity: A study in first-order and feature-based induction. Artificial Intelligence, 85(1,2):277–299, 1996. 24. I.H. Witten and E. Frank. Data Mining – Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000. 25. S. Wrobel. An algorithm for multi-relational discovery of subgroups. In Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD), LNAI 1263, pages 78–87. Springer, 1997.
Ideal Refinement of Descriptions in AL-Log Francesca A. Lisi and Donato Malerba Dipartimento di Informatica, University of Bari, Italy {lisi,malerba}@di.uniba.it
Abstract. This paper deals with learning in AL-log, a hybrid language that merges the function-free Horn clause language Datalog and the description logic ALC. Our application context is descriptive data mining. We introduce O-queries, a rule-based form of unary conjunctive queries in AL-log, and a generality order B for structuring spaces of O-queries. We define a (downward) refinement operator ρO for B -ordered spaces of O-queries, prove its ideality and discuss an efficient implementation of it in the context of interest.
1
Introduction
Hybrid systems are a special class of knowledge representation systems which are constituted by two or more subsystems dealing with distinct portions of a knowledge base and specific reasoning procedures [11]. The characterizing feature of hybrid systems is that the whole system is in charge of a single knowledge base, thus combining knowledge and reasoning services of the different subsystems in order to answer user questions. Indeed the motivation for building hybrid systems is to improve on two basic features of knowledge representation formalisms, namely representational adequacy and deductive power. Among hybrid systems, languages such as Carin [14] and AL-log [8] are particularly interesting because they bridge the gap between description logics (DLs) and Horn clausal logic (notoriously incomparable with respect to expressive power [3]). E.g., AL-log combines Datalog [5] and ALC [23]. Whereas learning pure DLs has been quite widely investigated [6,13,1], there are very few attempts at learning in DL-based hybrid languages. In [22] the chosen language is Carin-ALN , therefore example coverage and subsumption between two hypotheses are based on the existential entailment algorithm of Carin. Following [22], Kietz studies the learnability of Carin-ALN , thus providing a pre-processing method which enables ILP systems to learn Carin-ALN rules [12]. Closely related to DL-based hybrid systems are the proposals arising from the study of many-sorted logics, where a first-order language is combined with a sort language which can be regarded as an elementary DL [9]. In this respect the study of sorted downward refinement [10] can be also considered a contribution to learning in hybrid languages. In this paper we deal with learning in AL-log. This language merges Datalog [5] and ALC [23] by using concept assertions essentially as type constraints on variables. For constrained Datalog clauses we have defined the relation of B-subsumption and the subsequent generality order B , and provided a decidable procedure to check B on the basis of constrained SLD-resolution [17]. T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 215–232, 2003. c Springer-Verlag Berlin Heidelberg 2003
216
Francesca A. Lisi and Donato Malerba
This work presents a case study for B-subsumption in the context of descriptive data mining. As opposite to prediction, description focuses on finding humaninterpretable patterns describing a data set r. Among descriptive tasks, frequent pattern discovery aims at the extraction of all patterns whose cardinality exceeds a user-defined threshold. Indeed each pattern is considered as an intensional description (expressed in a given language L) of a subset of r. We propose a variant of this task which takes concept hierarchies into account during the discovery process, thus yielding descriptions of r at multiple granularity levels. More formally, given – a data set r including a taxonomy T where a reference concept and taskrelevant concepts are designated, – a set {Ll }1≤l≤maxG of languages – a set {minsupl }1≤l≤maxG of support thresholds the problem of frequent pattern discovery at l levels of description granularity, 1 ≤ l ≤ maxG, is to find the set F of all the patterns P ∈ Ll frequent in r, namely P ’s with support s such that (i) s ≥ minsupl and (ii) all ancestors of P w.r.t. T are frequent. An ILP approach to this problem requires the specification of a language L of hypotheses and a generality relation for L. To this aim we introduce O-queries, a rule-based form of unary conjunctive queries in AL-log, and study B -ordered spaces of O-queries. Descriptive data mining problems are characterized by hypothesis spaces with high solution density. Ideal refinement operators are usually suggested to search spaces of this kind [2]. Unfortunately for clausal languages ordered by θ-subsumption or stronger orders, ideal operators have been proven not to exist [20]. Yet they can be defined in some restricted yet meaningful cases. The main contribution of this paper is the definition of an ideal downward refinement operator ρO for B -ordered spaces of O-queries in the context of frequent pattern discovery at multiple levels of description granularity. The paper is organized as follows. Section 2 introduces the basic notions of AL-log. Section 3 defines the space of O-queries organized according to Bsubsumption. Section 4 presents the refinement operator ρO , proves its ideality and discusses an efficient implementation of it in the context of interest. Section 5 concludes the paper with final remarks.
2
AL-Log in a Nutshell
The language AL-log [8] combines the representation and reasoning means offered by Datalog and ALC . Indeed it embodies two subsystems, called relational and structural. We assume the reader to be familiar with Datalog, therefore we focus on the structural subsystem and hybridization of the relational subsystem.
Ideal Refinement of Descriptions in AL-Log
2.1
217
The Structural Subsystem
The structural subsystem of AL-log allows for the specification of structural knowledge in terms of concepts, roles, and individuals. Individuals represent objects in the domain of interest. Concepts represent classes of these objects, while roles represent binary relations between concepts. Complex concepts can be defined by means of constructs, such as and . The structural subsystem is itself a two-component system. The intensional component T consists of concept hierarchies spanned by is-a relations between concepts, namely inclusion statements of the form C D (read ”C is included in D”) where C and D are two arbitrary concepts. The extensional component M specifies instance-of relations, e.g. concept assertions of the form a : C (read ”the individual a belongs to the concept C”) and role assertions of the form aRb (read ”the individual a is related to the individual b by means of the role R”). In ALC knowledge bases, an interpretation I = (∆I , ·I ) consists of a set ∆I (the domain of I) and a function ·I (the interpretation function of I). E.g., it maps concepts to subsets of ∆I and individuals to elements of ∆I such that aI = bI if a = b (see unique names assumption [21]). We say that I is a model for C D if C I ⊆ DI , for a : C if aI ∈ C I , and for aRb if (aI , bI ) ∈ RI . The main reasoning mechanism for the structural component is the satisfiability check. The tableau calculus proposed in [8] starts with the tableau branch S = T ∪ M and adds assertions to S by means of propagation rules such as – S → S ∪ {s : D} if 1. s : C1 C2 is in S, 2. D = C1 and D = C2 , 3. neither s : C1 nor s : C2 is in S – S →∀ S ∪ {t : C} if 1. s : ∀R.C is in S, 2. sRt is in S, 3. t : C is not in S – S → S ∪ {s : C D} if 1. 2. 3. 4.
C D is in S, s appears in S, C is the NNF concept equivalent to ¬C s : ¬C D is not in S
– S →⊥ {s : ⊥} if 1. s : A and s : ¬A are in S, or 2. s : ¬ is in S, 3. s : ⊥ is not in S until either a contradiction is generated or an interpretation satisfying S can be easily obtained from it.
218
2.2
Francesca A. Lisi and Donato Malerba
Hybridization of the Relational Subsystem
The relational part of AL-log allows one to define Datalog programs enriched with constraints of the form s : C where s is either a constant or a variable, and C is an ALC-concept. Note that the usage of concepts as typing constraints applies only to variables and constants that already appear in the clause. The symbol & separates constraints from Datalog atoms in a clause. Definition 1. A constrained Datalog clause is an implication of the form α0 ← α1 , . . . , αm &γ1 , . . . , γn where m ≥ 0, n ≥ 0, αi are Datalog atoms and γj are constraints. A constrained Datalog program Π is a set of constrained Datalog clauses. An AL-log knowledge base B is the pair Σ, Π where Σ is an ALC knowledge base and Π is a constrained Datalog program. For a knowledge base to be acceptable, it must satisfy the following conditions: – The set of Datalog predicate symbols appearing in Π is disjoint from the set of concept and role symbols appearing in Σ. – The alphabet of constants in Π coincides with the alphabet O of the individuals in Σ. Furthermore, every constant in Π appears also in Σ. – For each clause in Π, each variable occurring in the constraint part occurs also in the Datalog part. These properties allow for the extension of terminology and results related to the notion of substitution from Datalog to AL-log in a straightforward manner. Example 1. As a running example, we consider an AL-log knowledge base B obtained from the N orthwin D database. The structural subsystem Σ should retraders flect the E/R model underlying the N orthwin D database. To serve our illustrative traders purpose we focus on the concepts (entities) Order, Product and Customer. The intensional part of Σ encompasses inclusion statements such as DairyProduct Product and EuroCustomer=Customer∃LivesIn.EuroCountry that define two taxonomies, one for Product and the other one for Customer. The extensional part of Σ contains assertions like order10248:Order, product11:DairyProduct, ’VINET’LivesIn’France’ and ’France’:EuroCountry. The relational subsystem Π expresses the N orthwin D database as a constrained Datalog program. traders We restrict ourselves to the relations Order and OrderDetail. The extensional part of Π consists of facts such as order(order10248,’VINET’,. . .) whereas the intensional part defines two views on order and orderDetail: item(OrderID,ProductID)← orderDetail(OrderID,ProductID, , , ) & OrderID:Order, ProductID:Product purchaser(OrderID,CustomerID)← order(OrderID,CustomerID, ,. . ., ) & OrderID:Order, CustomerID:Customer that, when triggered on B, can deduce facts such as item(order10248, product11) and purchaser(order10248,’VINET’).
Ideal Refinement of Descriptions in AL-Log
219
The interaction between the structural and the relational part of an AL-log knowledge base is also at the basis of a model-theoretic semantics for AL-log. We call ΠD the set of Datalog clauses obtained from the clauses of Π by deleting their constraints. We define an interpretation J for B as the union of an Ointerpretation IO for Σ (i.e. an interpretation compliant with the unique names assumption) and an Herbrand interpretation IH for ΠD . An interpretation J is a model of B if IO is a model of Σ, and for each ground instance α ¯ &γ1 , . . . , γn of each clause α ¯ &γ1 , . . . , γn in Π, either there exists one γi , i ∈ {1, . . . , n}, that is not satisfied by J , or α ¯ is satisfied by J . The notion of logical consequence paves the way to the definition of answer set for queries. Queries to AL-log knowledge bases are special cases of Definition 1. An answer to the query Q is a ground substitution σ for the variables in Q. The answer σ is correct w.r.t. a AL-log knowledge base B if Qσ is a logical consequence of B (B |= Qσ). The answer set of Q in B contains all the correct answers to Q w.r.t. B. Reasoning for AL-log knowledge bases is based on constrained SLD-resolution [8], i.e. an extension of SLD-resolution to deal with constraints. In particular, the constraints of the resolvent of a query Q and a constrained Datalog clause E are recursively simplified by replacing couples of constraints t : C, t : D with the equivalent constraint t : C D. The one-to-one mapping between constrained SLD-derivations and the SLD-derivations obtained by ignoring the constraints is exploited to extend known results for Datalog to AL-log. Note that in AL-log a derivation of the empty clause with associated constraints does not represent a refutation. It actually infers that the query is true in those models of B that satisfy its constraints. Therefore in order to answer a query it is necessary to collect enough derivations ending with a constrained empty clause such that every model of B satisfies the constraints associated with the final query of at least one derivation. Definition 2. Let Q(0) be a query ← β1 , . . . , βm &γ1 , . . . , γn to a AL-log knowledge base B . A constrained SLD-refutation for Q(0) in B is a finite set {d1 , . . . , ds } of constrained SLD-derivations for Q(0) in B such that: 1. for each derivation di , 1 ≤ i ≤ s, the last query Q(ni ) of di is a constrained empty clause; 2. for every model J of B, there exists at least one derivation di , 1 ≤ i ≤ s, such that J |= Q(ni ) Constrained SLD-refutation is a complete and sound method for answering ground queries. An answer σ to a query Q is a computed answer if there exists a constrained SLD-refutation for Qσ in B (B # Qσ). The set of computed answers is called the success set of Q in B. Furthermore, given any query Q, the success set of Q in B coincides with the answer set of Q in B. This provides an operational means for computing correct answers to queries. Indeed, it is straightforward to see that the usual reasoning methods for Datalog allow us to collect in a finite number of steps enough constrained SLD-derivations for Q in B to construct a refutation - if any. Derivations must satisfy both conditions of Definition 2. In particular, the latter requires some reasoning on the structural
220
Francesca A. Lisi and Donato Malerba
component of B. This is done by applying the tableau calculus as shown in the following example. Example 2. Following Example 1, we compute a correct answer to Q = ← purchaser(order10248,Y) & order10248:Order, Y:EuroCustomer w.r.t. B. A refutation for Q = Q(0) consists of the following single constrained SLD-derivation. Let E (1) be purchaser(OrderID,CustomerID)← order(OrderID,CustomerID, ,. . ., ) & OrderID:Order, CustomerID:Customer A resolvent for Q(0) and E (1) with substitution σ (1) = {OrderID/ order10248, CustomerID/ Y} is the query Q(1) = ← order(order10248,Y, ,. . ., ) & order10248:Order, Y:EuroCustomer Let E (2) be order(order10248,’VINET’, ,. . ., ). A resolvent for Q(1) and E with substitution σ (2) = {Y/ ’VINET’} is the constrained empty clause (2)
Q(2) = ← & order10248:Order, ’VINET’:EuroCustomer What we need to check is that Σ ∪ {order10248:Order, ’VINET’: EuroCustomer} is satisfiable. This check amounts to two unsatisfiability checks to be performed by applying the tableau calculus. The first check operates on the initial tableau S (0) = Σ ∪ {order10248:¬Order}. The application of the propagation rule →⊥ to S (0) produces the tableau S (1) = {order10248:⊥}. Computation stops here because no other rule can be applied to S (1) . Since S (1) is complete and contains a clash, the initial tableau S (0) is unsatisfiable. The second check operates on the initial tableau S (0) = Σ ∪ {’VINET’:¬EuroCustomer}=Σ ∪ {’VINET’:¬Customer∀ LivesIn.(¬EuroCountry)}. By applying → w.r.t. ∀LivesIn.(¬EuroCountry) to S (0) we obtain S (1) =Σ ∪ {’VINET’:∀ LivesIn. (¬EuroCountry)}. The only propagation rule applicable to S (1) is →∀ which yields the tableau S (2) = Σ ∪ {’VINET’:(¬EuroCountry)}. It presents a contradiction. Indeed the application of →⊥ to S (2) produces the final tableau S (3) = {’VINET’:⊥}. These two results together prove the satisfiability of Σ ∪ {order10248:Order, ’VINET’:EuroCustomer}, then the correcteness of σ={Y/ ’VINET’} as an answer to Q w.r.t. B.
3
The B-Ordered Space of O-Queries
In this section we propose AL-log as the starting point for the definition of a knowledge representation and reasoning framework in the context of interest. The main feature of this framework is the extension of the unique names assumption from the semantic level to the syntactic one. We would like to remind the reader that this assumption holds in ALC. Also it holds naturally for ground constrained Datalog clauses because the semantics of AL-log adopts Herbrand
Ideal Refinement of Descriptions in AL-Log
221
models for the Datalog part and O-models for the constraint part. Conversely it is not guaranteed in the case of non-ground constrained Datalog clauses, e.g. different variables can be unified. In particular we resort to the bias of Object Identity [24]: In a formula, terms denoted with different symbols must be distinct, i.e. they represent different entities of the domain. This bias yields to a restricted form of substitution whose bindings avoid the identification of terms: A substitution σ is an OI-substitution w.r.t. a set of terms T iff ∀t1 , t2 ∈ T : t1 = t2 yields that t1 σ = t2 σ. From now on, we assume that substitutions are OI-compliant. See [16] for an investigation of OI in the case of Datalog queries. In this framework descriptions are represented as O-queries, a rule-based form of unary conjunctive queries whose answer set contains individuals of an ALC concept Cˆ of reference. ˆ an O-query Q to an AL-log knowlDefinition 3. Given a reference concept C, edge base B is a constrained Datalog clause of the form ˆ γ2 , . . . , γn Q = q(X) ← α1 , . . . , αm &X : C, where X is the distinguished variable and the remaining variables occurring in the body of Q are the existential variables. We denote by key(Q) the key constraint X : Cˆ of Q. A trivial O-query is a constrained empty clause of the ˆ form q(X) ← &X : C. We impose O-queries to be linked and connected (or range-restricted) constrained Datalog clauses. The language L of descriptions for a given learning problem is implicitly defined by a set A of atom templates, a key constraint γˆ , and an additional set Γ of constraint templates. An atom template α specify name and arity of the predicate and mode of its arguments. An instantiation of α is a Datalog atom with predicate and arguments that fulfill the requirements specified in α. Constraint templates specify the concept name for concept assertions and determine the granularity level l of descriptions. Example 3. Following Example 1, suppose that we want to perform sales analysis by finding associations between the category of ordered products and the geographic location of the customer within orders. Here the entity Order is the reference concept, and the entities Product and Customer are task-relevant concepts. The language L must be defined so that it can generate descriptions of orders with respect to products and customers. To this aim let A={item(+,-), purchaser(+,-)} and γˆ be the key constraint built on the concept Order. Suppose that we are interested in descriptions at two different granularity levels. Thus T consists of the two layers T 1 ={Product, Customer} and T 2 ={Beverage, Condiment, Confection, DairyProduct, GrainsCereals, MeatPoultry, Produce, SeaFood, EuroCustomer, NorthAmericanCustomer, SouthAmericanCustomer} from which the sets Γ 1 and Γ 2 of constraints are derived. Examples of O-queries belonging to this language are:
222
Francesca A. Lisi and Donato Malerba
Q0 = q(X) ← & X:Order Q1 = q(X) ← item(X,Y) & X:Order Q2 = q(X) ← purchaser(X,Y) & X:Order Q3 = q(X) ← item(X,Y) & X:Order, Y:Product Q4 = q(X) ← purchaser(X,Y) & X:Order, Y:Customer Q5 = q(X) ← item(X,Y), item(X,Z) & X:Order, Y:Product Q6 = q(X) ← item(X,Y), purchaser(X,Z) & X:Order, Y:Product Q7 = q(X) ← item(X,Y), item(X,Z) & X:Order, Y:Product, Z:Product Q8 = q(X) ← item(X,Y), purchaser(X,Z) & X:Order, Y:Product, Z:Customer Q9 = q(X) ← item(X,Y) & X:Order, Y:DairyProduct Q10 = q(X) ← purchaser(X,Y) & X:Order, Y:EuroCustomer Q11 = q(X) ← item(X,Y), item(X,Z) & X:Order, Y:DairyProduct Q12 = q(X) ← item(X,Y), purchaser(X,Z) & X:Order, Y:DairyProduct Q13 = q(X) ← item(X,Y), item(X,Z) & X:Order, Y:DairyProduct, Z:GrainsCereals Q14 = q(X) ← item(X,Y), purchaser(X,Z) & X:Order, Y:DairyProduct, Z:EuroCustomer In particular, Q0 and Q1 are valid for both L1 and L2 , Q3 and Q5 belong to L1 , and Q9 belongs to L2 . Note that all of them are linked and connected. An answer to an O-query Q is a ground substitution θ for the distinguished variable of Q. The aforementioned conditions of well-formedness guarantee that the evaluation of O-queries is sound according to the following notions of answer set and success set. Definition 4. Let B be a AL-log knowledge base. An answer θ to an O-query Q is a correct (resp. computed) answer w.r.t. B if there exists at least one correct (resp. computed) answer to body(Q)θ w.r.t. B. Example 4. Following Example 2 and 3, the substitution θ = {X/order10248} is a correct answer to Q10 w.r.t. B because there exists a correct answer σ={Y/ ’VINET’} to body(Q10 )θ w.r.t. B. The definition of a generality order for structuring the space of O-queries can not disregard the nature of O-queries as a special case of constrained Datalog clauses as well as the availability of an AL-log knowledge base with respect to which these O-queries are to be evaluated. For constrained Datalog clauses we have defined the relation of B-subsumption [17]. It adapts generalized subsumption [4] to the AL-log framework. Definition 5. Let P , Q be two constrained Datalog clauses and B an AL-log knowledge base. We say that P B-subsumes Q, P B Q, if for every model J of B and every ground atom α such that Q covers α under J , we have that P covers α under J . We have proved that B is a quasi-order for constrained Datalog clauses and provided a decidable procedure to check B on the basis of constrained SLDresolution [15]. Note that the underlying reasoning mechanism of AL-log makes B-subsumption more powerful than generalized subsumption.
Ideal Refinement of Descriptions in AL-Log
223
Theorem 1. Let P , Q be two constrained Datalog clauses, B an AL-log knowledge base and σ a Skolem substitution for Q with respect to {P } ∪ B. We say that P B Q iff there exists a substitution θ for P such that (i) head(P )θ = head(Q) and (ii) B ∪ body(Q)σ # body(P )θσ where body(P )θσ is ground. Theorem 2. Checking B in AL-log is decidable. Example 5. Following Example 3, we illustrate the test procedure of Theorem 1 on the pair Q8 , Q14 of O-queries to check whether Q8 B Q14 holds. Let σ = {X/a, Y/b, Z/c} a Skolem substitution for Q14 with respect to B ∪ {Q8 } and θ the identity substitution for Q8 . The condition (i) is immediately verified. It remains to verify that (ii) B∪ {item(a,b), purchaser(a,c) & a:Order, b:DairyProduct, c:EuroCustomer}|= item(a,b), purchaser(a,c) & a:Order, b:Product, c:Customer. We try to build a constrained SLD-refutation for Q(0) = ← item(a,b), purchaser(a,c) & a:Order, b:Product, c:Customer in B = B∪{item(a,b), purchaser(a,c), b:DairyProduct, c:EuroCustomer, a:Order}. Once the constrained empty clause has been obtained by means of classical SLD-resolution, we need to check whether Σ ∪ {a:Order, b:Product, c:Customer} is satisfiable. The first unsatisfiability check operates on the initial tableau S (0) = Σ ∪ {a:¬Order}. The application of the propagation rule →⊥ to S (0) produces the tableau S (1) = {a:⊥}. Computation stops here because no other rule can be applied to S (1) . Since S (1) is complete and contains a clash, the initial tableau S (0) is unsatisfiable. The second unsatisfiability check operates on the initial tableau S (0) = Σ ∪ {b:¬Product}. The only propagation rule applicable to S (0) is → with respect to the assertion DairyProductProduct. It produces the tableau S (1) = Σ ∪ {b:¬Product, b:¬DairyProductProduct}. By applying → to S (1) with respect to the concept Product we obtain S (2) = Σ ∪ {b:¬Product, b:Product} which presents an evident contradiction. Indeed the application of →⊥ to S (2) produces the final tableau S (3) = {b:⊥}. The third unsatisfiability check operates on the initial tableau S (0) = Σ ∪ {b:¬Customer}. Remember that Σ contains the assertion c:Customer∃ LivesIn.EuroCountry. By applying → to S (0) we obtain S (1) which encompasses the contradiction {c:¬Customer, c:Customer}. Indeed the application of →⊥ to S (1) produces the final tableau S (2) = {b:⊥}. Having proved the satisfiability of Σ ∪ {a:Order, b:Product, c:Customer}, we have proved the existence of a constrained SLD-refutation for Q(0) in B . Therefore we can say that Q8 B Q14 .
4
The Refinement Operator ρO
The space (L, B ) is a quasi-ordered set, therefore it can be searched by refinement operators [20]. In the application context of interest, the refinement operator being defined must enable the search through multiple spaces, each of which corresponds to a different level of description granularity. Furthermore we restrict our investigation to downward refinement operators because the search towards finer-grained descriptions is more efficient.
224
Francesca A. Lisi and Donato Malerba
Definition 6. Let γ1 = t1 : C and γ2 = t2 : D two ALC constraints. We say that γ1 is at least as strong as γ2 , denoted as γ1 γ2 , if and only if t1 = t2 and C D. Furthermore γ1 is stronger than γ2 , denoted as γ1 $ γ2 , if and only if t1 = t2 and C D. Definition 7. Let L = {Ll }1≤l≤maxG be a language of O-queries. A (downward) refinement operator ρO for (L, B ) is defined such that, for a given Oˆ γ2 , . . . , γn in Ll , l < maxG, the set query P = q(X) ← α1 , . . . , αm &X : C, ρO (P ) contains all Q ∈ L that can be obtained by applying one of the following refinement rules: ˆ 2 , . . . , γn where αm+1 is an inAtom Q=q(X) ← α1 , . . . , αm , αm+1 &X : C,γ stantiation of an atom template in A such that αm+1 ∈ body(P ). ˆ γ2 , . . ., γn , γn+1 where γn+1 is an Constr Q = q(X) ← α1 , . . . , αm &X : C, instantiation of a constraint template in Γ l such that γn+1 constrains an unconstrained variable in body(P ). ˆ γ , . . . , γ where each γ , 2 ≤ j ≤ n, is ∀C Q = q(X) ← α1 , . . . , αm &X : C, 2 n j an instantiation of a constraint template in Γ l+1 such that γj γj and at least one γj $ γj . The rules Atom and Constr help moving within the space Ll (intra-space search) whereas the rule ∀C helps moving from Ll to Ll+1 (inter-space search). Both rules are correct, i.e. the Q’s obtained by applying any of these rules to P ∈ Ll are such that P B Q. This can be proved intuitively by observing that they act only on body(P ). Thus condition (i) of Theorem 1 is satisfied. Furthermore, it is straightforward to notice that the application of ρO to P reduces the number of models of P in both cases. In particular, as for ∀C, this intuition follows from the definition of O-model. So condition (ii) also is fulfilled. From now on we call k-patterns those patterns Q ∈ ρkO (P ) that have been generated after k refinement steps starting from the trivial O-query P in Ll and applying either Atom or Constr. Example 6. Each edge in Figure 1 indicates the application of only one of the rules defined for ρO to O-queries listed in Example 3. E.g., ρO (Q1 ) is the set Q 1 = q(X) ← item(X,Y), item(X,Z) & X:Order Q 2 = q(X) ← item(X,Y), purchaser(X,Z) & X:Order Q 3 = q(X) ← item(X,Y) & X:Order, Y:Product Q 4 = q(X) ← item(X,Y) & X:Order, Y:Customer Q 5 = q(X) ← item(X,Y) & X:Order, Y:Beverage Q 6 = q(X) ← item(X,Y) & X:Order, Y:Condiment Q 7 = q(X) ← item(X,Y) & X:Order, Y:Confection Q 8 = q(X) ← item(X,Y) & X:Order, Y:DairyProduct Q 9 = q(X) ← item(X,Y) & X:Order, Y:GrainsCereals Q 10 = q(X) ← item(X,Y) & X:Order, Y:MeatPoultry Q 11 = q(X) ← item(X,Y) & X:Order, Y:Produce Q 12 = q(X) ← item(X,Y) & X:Order, Y:SeaFood
Ideal Refinement of Descriptions in AL-Log 1
L1
2
3
4
5
Q3
Q5
Q7
225
k
Q1 Q4
Q6
Q8
Q9
Q11
Q13
Q0
L2
Q2
Q10
Q12
Q14
Fig. 1. Portion of the refinement graph of ρO in L. Q 13 = q(X) ← item(X,Y) & X:Order, Y:EuroCustomer Q 14 = q(X) ← item(X,Y) & X:Order, Y:NorthAmericanCustomer Q 15 = q(X) ← item(X,Y) & X:Order, Y:SouthAmericanCustomer where Q 1 and Q 2 are generated by means of Atom, Q 3 and Q 4 by means of Constr, and the O-queries from Q 5 to Q 15 also by means of Constr (but considering Q1 as belonging to L2 ). Note that Q 4 , Q 13 , Q 14 , and Q 15 will turn out to be infrequent. Yet they are generated. What matters while searching (L, B ) is to find patterns that are more specific than a given P under B-subsumption. Conversely, ρO (Q3 ) is the set Q 1 = q(X) ← item(X,Y), item(X,Z) & X:Order, Y:Product Q 2 = q(X) ← item(X,Y), purchaser(X,Z) & X:Order, Y:Product Q 3 = q(X) ← item(X,Y) & X:Order, Y:Beverage Q 4 = q(X) ← item(X,Y) & X:Order, Y:Condiment Q 5 = q(X) ← item(X,Y) & X:Order, Y:Confection Q 6 = q(X) ← item(X,Y) & X:Order, Y:DairyProduct Q 7 = q(X) ← item(X,Y) & X:Order, Y:GrainsCereals Q 8 = q(X) ← item(X,Y) & X:Order, Y:MeatPoultry Q 9 = q(X) ← item(X,Y) & X:Order, Y:Produce Q 10 = q(X) ← item(X,Y) & X:Order, Y:SeaFood where the O-queries Q 1 and Q 2 are generated by means of Atom, and the O-queries from Q 3 to Q 10 by means of ∀C. Note that the query Q9 can be obtained by applying either Constr to Q1 (here Q1 is considered as belonging to L2 ) or ∀C to Q3 . Actually each node in L2 can be reached starting from either another node in L2 or a node in L1 . This can be exploited to speed up the search at levels of finer granularity as shown later.
226
4.1
Francesca A. Lisi and Donato Malerba
Reaching Ideality
Descriptive data mining problems are characterized by hypothesis spaces with dense solutions. Ideal refinement operators are usually suggested to search spaces of this kind [2]. They satisfy the following properties. Definition 8. Let ρ a downward refinement operator for a quasi-ordered set (L, ). Denoted with ρ∗ the transitive closure of ρ: • ρ is locally finite iff ∀P ∈ L : ρ(P ) is finite and computable; • ρ is proper iff ∀P ∈ L∀Q ∈ ρ(P ) : Q ∼ P ; • ρ is complete iff ∀P, Q ∈ L if P $ Q then ∃Q ∈ ρ∗ (P ) : Q ∼ Q. Unfortunately, for clausal languages ordered by θ-subsumption or stronger orders such as generalized subsumption, ideal operators have been proven not to exist [20]. They can be approximated by dropping the requirement of properness or by bounding the language. We choose to follow the latter option. It guarantees that, if (L, ) is a quasi-ordered set, L is finite and is decidable, then there always exists an ideal refinement operator for (L, ). In our case, since B is a decidable quasi-order, we only need to bound L in a suitable manner. The expressive power of AL-log requires several bounds to be imposed on L in order to guarantee its finiteness. First, it is necessary to set a maximum level maxG of description granularity, so that the problem of finiteness of L can be boiled down to the finiteness of each Ll , 1 ≤ l ≤ maxG. Second, it is necessary to introduce a complexity measure for O-queries, as a pair of two different coordinates. Note that the complexity of an O-query resides in its body. Therefore, given an O-query P , the former coordinate is the size of the biggest literal in body(P ), while the latter is the number of literals in body(P ). Constraints do count as literals. Definition 9. Let L be either a Datalog atom or an ALC constraint. Then size(L) is equal to the difference between the number of symbol occurrences in L and the number of distinct variables in L. Example 7. Let L1 , L2 be the literals item(X,Y) and X:DairyProduct. Then size(L1 ) = 3 − 2 = 1 and size(L2 ) = 2 − 1 = 1. Definition 10. Let P an O-query, maxsize(P ) the maximum of {size(L)|L ∈ body(P )}, and |P | the number of literals in P . We call O − size(P ) the pair (maxsize(body(P )), |P |). Definition 11. Let P an O-query, and maxS,maxD be natural numbers. We say that P is bounded by (maxS, maxD) if maxsize(body(P )) ≤ maxS and |body(P )| ≤ maxD. Example 8. The language L = L1 ∪ L2 partially reported in Example 3 and illustrated in Example 6 contains (1, 5)-bounded O-queries. Note that the bound maxS = 1 derives from the fact that A contains only binary predicates and Γ contains only assertions of primitive concepts.
Ideal Refinement of Descriptions in AL-Log
227
Proposition 1. Given a language L of O-queries and two integers maxS, maxD > 0, the set {P ∈ L|P is bounded by (maxS, maxD)} is finite up to variants. Proof. We will only sketch the idea behind this. The language L has finitely many constants and predicate symbols because it is built starting from the alphabets A and Γ . Furthermore it has no functors because it is a fragment of AL-log. Suppose we are given (maxS, maxD). It is not difficult to see that the set of literals (either Datalog atoms or constraints) with size ≤ maxS is finite up to variants. Let v be the maximum of the set {n| there is a literal L ∈ P with size ≤ maxS that contains n distinct variables}. Because a (maxS, maxD)-bounded Oquery can contain at most maxD literals in the body, each of which can contain at most v distinct variables, a (maxS, maxD)-bounded O-query can contain at most maxDv distinct variables. Let us fix distinct variables X1 , . . . , XmaxDv . Now let K be the finite set of all literals of size ≤ maxS that can be constructed from the predicate symbols and constants in A, the concept symbols in Γ and variables X1 , . . . , XmaxDv . Since each O-query that is bounded by (maxS, maxD) must be (a variant of ) a subset of K, there are only finitely many such O-queries, up to variants. We can prove that ρO is an ideal refinement operator for (L, B ) that maps reduced (maxS, maxD)-bounded O-queries into reduced (maxS, maxD)bounded O-queries under the OI bias. Theorem 3. Let (maxS, maxD) be a pair of natural numbers, and L be the language {Ll }1≤l≤maxG such that each Ll contains reduced (maxS, maxD)-bounded O-queries. The downward refinement operator ρO is ideal for (L, B ). Proof. Let P, Q ∈ L. Local Finiteness. Suppose P ∈ Ll . The alphabets A and Γ underlying L are finite sets. Furthermore, each of the three refinement rules consists of instructions that can be completed in a finite number of steps. Therefore ρO (P ) is finite and computable. Properness. Suppose P ∈ Ll . In the case of either Atom or Constr, Q is strictly longer than P because the OI bias avoids the identification of literals. Therefore P $ Q. In the case of ∀C, the occurrence of a strictly stronger constraint in Q assures that P $ Q. Completeness. Suppose P $B Q. Then either Q is a downward cover of P , in which case there is an R ∈ ρO (P ) such that Q ∼B R, or there is an R ∈ ρO (P ) such that P $B R $B Q. In the latter case, we can find an S ∈ ρO (R) such that P $B R $B S B Q, etc. Since L is finite and ρO is proper, we must eventually find a ρ-chain from P to a member of the equivalence class of Q, so ρO is complete. Ideal refinement operators are mainly of theoretical interest, because in practice it is often very inefficient to find downward (resp. upward) covers for every P ∈ L. Thus more constructive - though possibly improper - refinement operators are usually to be preferred over ideal ones. In the following we show that efficient algorithms can be designed to implement ρO in the context of interest.
228
Francesca A. Lisi and Donato Malerba
4.2
Making Ideality Something Real
The operator ρO has been implemented in AL-QuIn (AL-log Query Induction) [15], an ILP system that - according to Mannila’s levelwise method [19] for frequent pattern discovery - searches the space (L, B ) breadth-first by alternating candidate generation and candidate evaluation phases. In particular, candidate generation consists of a refinement step followed by a pruning step. The former applies one of the three rules of ρO to patterns previously found frequent by preserving the properties of linkedness and safety. The pruning step allows some infrequent patterns to be detected and discarded prior to evaluation thanks to the following property [15]: Under the assumption that minsupl ≤ minsupl−1 , 1 < l < maxG, a k-pattern Q in Ll is infrequent if it is B-subsumed w.r.t. an ALlog knowledge base B by either (i) an infrequent (k − 1)-pattern in Ll or (ii) an infrequent k-pattern in Ll−1 . Note that (i-ii) require a high number of subsumption checks to be performed. This makes candidate generation computationally expensive. Appropriate complementary data structures can help mitigating the computational effort. Our implementation of ρO uses a graph of backward pointers to be updated while searching in order to keep track of both intra-space and inter-space search stages. Figure 2 gives an example of such graph for the portion of space reported in Figure 1. Here nodes, dotted edges and dashed edges represent patterns, intra-space parenthood and inter-space parenthood, respectively. We shall illustrate the benefits of using this data structure by going into details of the procedure generateCandidates() reported in Figure 3. 1
L1
2
3
4
5
Q3
Q5
Q7
k
Q1 Q4
Q6
Q8
Q9
Q11
Q13
Q0
L2
Q2
Q10
Q12
Q14
Fig. 2. Graph of backward pointers. For a given level l of description granularity, 1 ≤ l ≤ maxG, procedure generateCandidates() builds the set Ckl of candidate k-patterns starting from the l set Fk−1 of frequent (k − 1)-patterns and the language Ll by taking the set I l of infrequent patterns into account. It consists of two computation branches. The
Ideal Refinement of Descriptions in AL-Log
229
l Procedure generateCandidates(Fk−1 , Ll , var I l ) 1. Ckl ← ∅; 2. if l = 1 then /* search in L1 */ l do 3. foreach pattern P in Fk−1 4. Q ← intraRefine(P , Ll ); 5. Q ← prune(Q, I l ); 6. foreach pattern Q in Q do /* set the intra-space edge from Q to P */ 7. setIntraSpaceEdge(Q, P ) 8. endforeach 9. Ckl ← Ckl ∪ Q 10. endforeach 11. else /* search in Ll , l > 1*/ l 12. foreach pattern P in Fk−1 do 13. P ← getInterSpaceParent(P ); 14. Q ← getIntraSpaceChildren(P ); 15. foreach pattern Q in Q do 16. Q ← interRefine(Q ); 17. Q ← prune(Q, I l ); 18. foreach pattern Q in Q do /* set the inter-space edge from Q to Q */ 19. setInterSpaceEdge(Q, Q ); /* set the intra-space edge from Q to P 20. setIntraSpaceEdge(Q, P ); 21. endforeach 22. Ckl ← Ckl ∪ Q 23. endforeach 24. endforeach return Ckl
Fig. 3. Implementation of ρO in the candidate generation phase of AL-QuIn former concerns the case of search in L1 . It applies either Atom or Constr (procedure intraRefine()), performs the pruning step (procedure prune()) and inserts an intra-space backward pointer for each retained candidate (procedure setIntraSpaceEdge()). Example 9. As an illustrative example of candidate generation at the first level of description granularity we report the computation of C21 with reference to Example 6. The set F11 contains the trivial query Q0 only. The procedure call intraRefine(Q0 , L1 ) returns the set of possible refinements of Q0 according to directives of L1 , namely the queries Q1 and Q2 obtained by adding the atoms item(X,Y) and purchaser(X,Y) to the body of Q0 respectively. In both refinements it is not necessary to insert inequality atoms. Furthermore the procedure call prune(Q, I 1 ) does affect neither Q nor I 1 . Before returning control to the main procedure of AL-QuIn, generateCandidates(F11 , L1 , I 1 ) updates the graph of backward pointers by inserting intra-space edges for Q1 and Q2 as shown in
230
Francesca A. Lisi and Donato Malerba
Figure 2. Assuming that F21 contains Q1 , we carry on this example by focusing on the generation of candidate patterns starting from Q1 . Possible refinements according to directives of L1 are the queries Q 1 , Q 2 , Q 3 and Q 4 reported in Example 6. Since Q 1 and Q 2 do not fulfill the requirement on the maximum number of unconstrained variables, they are generated then pruned. So Q 3 , namely Q3 , and Q 4 are the only to survive the pruning step. Since Q 4 will not pass the candidate evaluation phase, it does not appear as child of Q1 in Figure 2. The other branch of generateCandidates() concerns the case of search at deeper levels of description granularity. One can expect that it is enough to simply replace the procedure intraRefine() with a procedure interRefine() which implements the refinement rule ∀C. But things are more complicated. Let us suppose that the current space to be searched is Ll , l > 1. On one side, searching Ll with only Atom or Constr implies to restart from scratch. Rather we would like to capitalize on the computational effort made when searching Ll−1 and minimize the number of inter-space subsumption checks. On the other side, searching Ll indirectly, i.e. by applying ∀C to O-queries found frequent in Ll−1 , implies the loss of the useful information that could be collected if Ll was searched dil rectly. E.g., when generating Ckl , l, k > 1, it happens that intraRefine(Fk−1 )⊆ l−1 interRefine(Fk ). This means that a blind application of ∀C causes an increment of intra-space subsumption checks. It is necessary to find a compromise between these two apparently irreconcilable solutions. Our solution requires that l and Fkl−1 into account. In particular, Ckl , l, k > 1, is computed taking both Fk−1 l the expansion of a node P in Fk−1 is done as follows: – retrieve the inter-space parent node P of P by following the inter-space backward pointer (step (13)); – retrieve the set Q ⊆ Fkl−1 of intra-space children nodes of P by navigating intra-space backward pointers in reverse sense (step (14)); – generate the set Q of O-queries obtained by applying ∀C to each Q in Q (step (16)) This not only avoids a blind application of the refinement rule ∀C but also lightens the computational load during the pruning step. Example 10. Again with reference to Example 6 the expansion of Q1 in the case of Q1 considered as belonging to L2 is done indirectly by accessing search stages of success in L1 departing from Q1 . In Example 9 we have seen that Q3 is the only child of Q1 . Possible refinements of Q3 by means of ∀C are the queries Q 3 up to Q 10 listed in Example 6. Note that these are equivalent to the queries Q 5 up to Q 12 in ρO (Q1 ). Furthermore the queries Q 13 up to Q 15 , though possible refinements of Q1 , are not generated because their ancestor Q 4 turned out to be infrequent. This avoids the generation of certainly infrequent patterns in L2 .
5
Conclusions
Hybrid languages supply expressive and deductive power which have no counterpart in Horn clausal logic. This makes them appealing for challenging applications such as descriptive data mining in application domains that require a
Ideal Refinement of Descriptions in AL-Log
231
uniform treatment of both relational and structural features of data. Also this poses several research issues to the ILP community, e.g. the definition of more sophisticated refinement operators than the ones typically used in ILP. The main contribution of this paper is the definition of an ideal (downward) refinement operator ρO to search spaces of descriptions at multiple granularity levels in the AL-log framework. In particular, ideality has been approximated by bounding the language and assuming the OI bias. Though this result is mainly of theoretical interest, it is worthy having it because searching spaces with dense solutions is peculiar to descriptive data mining tasks. With respect to the context of interest we have provided an efficient implementation of ρO . Note that the resulting algorithm is not an attempt at deriving an optimal operator from ρO but aims at lightening the computational load of inter-space subsumption checks. We would like to emphasize that the choice of an application context and the investigation of ILP issues within the chosen context make a substantial difference between our work and related work on learning in hybrid languages. Indeed our broader goal is the definition of an ILP setting for mining objectrelational databases [15]. We claim that AL-log supports a simple yet significant object-relational data model. This provides - among the other things an application-driven motivation for assuming the OI bias and concentrating on concept hierarchies in our AL-log framework. Differences between AL-QuIn and the ILP system Warmr [7] for mining ”pure” relational patterns are discussed in [18]. We intend to carry on the work presented in this paper by following this approach of reconciling theory and practice. In particular, besides the already investigated application to spatial data mining [15], the Semantic Web seems to be a promising source of new interesting tasks for AL-QuIn. Acknowledgments We would like to thank the anonymous reviewers for their useful comments and Nicola Fanizzi for the fruitful discussions.
References 1. L. Badea and S.-W. Nienhuys-Cheng. A refinement operator for description logics. In J. Cussens and A. Frisch, editors, Inductive Logic Programming, volume 1866 of Lecture Notes in Artificial Intelligence, pages 40–59. Springer-Verlag, 2000. 2. L. Badea and M. Stanciu. Refinement operators can be (weakly) perfect. In S. Dˇzeroski and P. Flach, editors, Inductive Logic Programming, volume 1634 of Lecture Notes in Artificial Intelligence, pages 21–32. Springer, 1999. 3. A. Borgida. On the relative expressiveness of description logics and predicate logics. Artificial Intelligence, 82(1–2):353–367, 1996. 4. W. Buntine. Generalized subsumption and its application to induction and redundancy. Artificial Intelligence, 36(2):149–176, 1988. 5. S. Ceri, G. Gottlob, and L. Tanca. Logic Programming and Databases. Springer, 1990. 6. W. Cohen and H. Hirsh. Learning the CLASSIC description logic: Thoretical and experimental results. In Proc. of the 4th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR’94), pages 121–133. Morgan Kaufmann, 1994.
232
Francesca A. Lisi and Donato Malerba
7. L. Dehaspe and H. Toivonen. Discovery of frequent Datalog patterns. Data Mining and Knowledge Discovery, 3:7–36, 1999. 8. F. Donini, M. Lenzerini, D. Nardi, and A. Schaerf. AL-log: Integrating Datalog and Description Logics. Journal of Intelligent Information Systems, 10(3):227–252, 1998. 9. A. Frisch. The substitutional framework for sorted deduction: Fundamental results on hybrid reasoning. Artificial Intelligence, 49:161–198, 1991. 10. A. Frisch. Sorted downward refinement: Building background knowledge into a refinement operator for inductive logic programming. In S. Dˇzeroski and P. Flach, editors, Inductive Logic Programming, volume 1634 of Lecture Notes in Artificial Intelligence, pages 104–115. Springer, 1999. 11. A. Frisch and A. Cohn. Thoughts and afterthoughts on the 1988 workshop on principles of hybrid reasoning. AI Magazine, 11(5):84–87, 1991. 12. J.-U. Kietz. Learnability of description logic programs. In S. Matwin and C. Sammut, editors, Inductive Logic Programming, volume 2583 of Lecture Notes in Artificial Intelligence, pages 117–132. Springer, 2003. 13. J.-U. Kietz and K. Morik. A polynomial approach to the constructive induction of structural knowledge. Machine Learning, 14(1):193–217, 1994. 14. A. Levy and M.-C. Rousset. Combining Horn rules and description logics in CARIN. Artificial Intelligence, 104:165–209, 1998. 15. F.A. Lisi. An ILP Setting for Object-Relational Data Mining. Ph.D. Thesis, Department of Computer Science, University of Bari, Italy, 2002. 16. F.A. Lisi, S. Ferilli, and N. Fanizzi. Object Identity as Search Bias for Pattern Spaces. In F. van Harmelen, editor, ECAI 2002. Proceedings of the 15th European Conference on Artificial Intelligence, pages 375–379, Amsterdam, 2002. IOS Press. 17. F.A. Lisi and D. Malerba. Bridging the Gap between Horn Clausal Logic and Description Logics in Inductive Learning. In A. Cappelli and F. Turini, editors, AI*IA 2003: Advances in Artificial Intelligence, volume ? of Lecture Notes in Artificial Intelligence, Springer, 2003. to appear. 18. F.A. Lisi and D. Malerba. Towards Object-Relational Data Mining. In S. Flesca, S. Greco, D. Sacc` a, and E. Zumpano, editors, Proc. of the 11th Italian Symposium on Advanced Database Systems, pages 269–280. Rubbettino Editore, Italy, 2003. 19. H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3):241–258, 1997. 20. S. Nienhuys-Cheng and R. de Wolf. Foundations of Inductive Logic Programming, volume 1228 of Lecture Notes in Artificial Intelligence. Springer, 1997. 21. R. Reiter. Equality and domain closure in first order databases. Journal of ACM, 27:235–249, 1980. 22. C. Rouveirol and V. Ventos. Towards Learning in CARIN-ALN . In J. Cussens and A. Frisch, editors, Inductive Logic Programming, volume 1866 of Lecture Notes in Artificial Intelligence, pages 191–208. Springer, 2000. 23. M. Schmidt-Schauss and G. Smolka. Attributive concept descriptions with complements. Artificial Intelligence, 48(1):1–26, 1991. 24. G. Semeraro, F. Esposito, D. Malerba, N. Fanizzi, and S. Ferilli. A logic framework for the incremental inductive synthesis of Datalog theories. In N. Fuchs, editor, Proc. of 7th Int. Workshop on Logic Program Synthesis and Transformation, volume 1463 of Lecture Notes in Computer Science, pages 300–321. Springer, 1998.
Which First-Order Logic Clauses Can Be Learned Using Genetic Algorithms? Flaviu Adrian M˘ arginean Department of Computer Science, The University of York Heslington, York YO10 5DD, United Kingdom [email protected]
Abstract. In this paper we present and prove both negative and positive theoretical results concerning the representation and evaluation of firstorder logic clauses using genetic algorithms. Over the last few years, a few approaches have been proposed aiming to combine genetic and evolutionary computation (EC) with inductive logic programming (ILP). The underlying rationale is that evolutionary algorithms, such as genetic algorithms, might mitigate the combinatorial explosions generated by the inductive learning of rich representations, such as those used in first-order logic. Particularly, the genetic algorithms approach to ILP presented by Tamaddoni-Nezhad and Muggleton has attracted the attention of both the EC and ILP communities in recent years. Unfortunately, a series of systematic and fundamental theoretical errors renders their framework moot. This paper critically examines the fallacious claims in the mentioned approach. It is shown that, far from restoring completeness to the learner progol’s search of the subsumption lattice, their proposed binary representation is both incomplete and noncompact. It is also shown that their fast evaluation of clauses based on bitwise operations is based on false theorems and therefore invalid. As an alternative to TamaddoniNezhad and Muggleton’s flawed genetic algorithms attempt, we propose a framework based on provably sound and complete binary refinement.
1
Introduction
Over the last few years there has been a surge of interest in combining the expressiveness afforded by first-order logic representations in inductive logic programming with the robustness of evolutionary search algorithms [8,19,26,27,28]. It is hoped that such hybrid systems would retain the powerful logic programming formalism and its well-understood theoretical foundations, while bringing in to the search the versatility of evolutionary algorithms, their inherent parallelism, and their adaptive characteristics [19,28]. For such reasons, the hybrid systems have been and are of mutual interest to both the evolutionary computation (EC) and the inductive logic programming (ILP) communities. The intersectorial interest in hybrid systems, together with the accumulation of reported theoretical and experimental results, warrants a closer look at the theoretical underpinnings of such systems and at the, sometimes strong, claims that have been made concerning the systems’ representational power and their efficiency. T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 233–250, 2003. c Springer-Verlag Berlin Heidelberg 2003
234
Flaviu Adrian M˘ arginean
In this paper we take as our case study the most recent of such proposals, the genetic algorithms approach to ILP by Tamaddoni-Nezhad and Muggleton [26,27,28]. The approach aims to turn progol into a complete and fast inductive reasoner by coupling a binary representation suitable for the implementation of genetic algorithms search procedures with a clause evaluation mechanism based on bitwise operations. progol is a first-order inductive reasoner widely regarded as state of the art. Owing to its relative importance, its soundness and completeness have been the object of numerous theoretical studies [4,9,13,14,18,31]. progol has also been experimentally investigated from the point of view of its tractability [21,22]. Search in first-order logic is notoriously difficult because of the expressive power of the hypotheses that generates combinatorial explosions [24] . Owing to the two main issues, i.e. (in)completeness and (in)tractability, the announcement by Tamaddoni-Nezhad and Muggleton [26,27,28] that genetic algorithms can solve both problems via a simple binary change in representation and evaluation has understandably been greeted with interest. In this paper we demonstrate that, unfortunately, the hopes elicited by such strong claims are theoretically unfounded. We provide both critical and remedial action. First and foremost, we review Tamaddoni-Nezhad and Muggleton’s aforementioned genetic approach and show that it is provably flawed both at the representation level and at the evaluation level. More specifically, we consider the following claims by the authors, which are central to their approach: Fallacy 1 (Representation). The proposed binary representation for clauses encodes the subsumption lattice lower bounded by the bottom clause in a complete and compact way. Fallacy 2 (Evaluation). A fast evaluation mechanism for clauses has been given. Respectively, we show: Fact 1 (Representation). The binary encoding of the subsumption lattice is both incomplete and noncompact. An infinity of good clauses are left out (the encoding is incomplete), whilst a huge number of redundant binary strings might get in (the encoding is noncompact ): for even a single shared variable in the bottom clause, the proposed space of binary representations is n 2( 2 ) n→∞ Bn −→ ∞ times bigger than the number of clauses that it encodes, where def
n = the number of predicates that share the variable and Bn is the n-th Bell Number. The representational power of the proposed binary encoding is that of the atomic subsumption lattice. Fact 2 (Evaluation). The proposed fast evaluation mechanism for clauses is provably unsound. We shall henceforth refer to the Tamaddoni-Nezhad and Muggleton’s genetic algorithms approach simply as the TNM’s GAA. The TNM’s GAA errors that we pinpoint in this paper appear to have no easy fix. They are very fundamental
Which First-Order Logic Clauses Can Be Learned
235
theoretical errors, which fatally undermine the TNM’s GAA by exploding its two pillars: the allegedly complete and compact binary representation and the allegedly fast evaluation mechanism. It appears to us that these errors are not independent, the confusion in the treatment of the representation having led to a flawed evaluation mechanism. Ever since before the inception of the TNM’s GAA line of research, we have emphasised [19], as we do today, that the problem of combining evolutionary computation with first-order logic learning is worth investigating. In this respect, at the very least, the TNM’s GAA is meritorious. Nevertheless, at present, its methodological conception and its theoretical foundations are much too shaky for the approach to be credible, although the TNM’s GAA does bring some welcome body of experimental evidence. We think such evidence should be considered together with the theoretical evidence we present in this paper in any further investigation. For their part, the theoretical flaws need to be carefully analysed and corrected. To this task the present paper is devoted. The paper is organised as follows. In §2 we review some preliminaries, such as inductive logic programming and inverse entailment. In §3 we analyse the errors in the binary representations used in the TNM’s GAA. In §4 we analyse the errors in the bitwise evaluation mechanism used in the TNM’s GAA. In §5 we present our first-order logic alternative approach to EC—ILP hybridisation, proving the existence of a sound and complete binary refinement operator for subsumption in progol-like search spaces. In addition, we give a novel representation for binding matrices with first-order logic representational power. In §6 we discuss the significance of both our negative and positive results, and articulate the conclusions of this paper alongside some directions for future work.
2
Preliminaries
The reader is assumed to be familiar with the basic formalism of first-order clausal logic. The paragraphs below are intended as brief reminders. A good general reference for inductive logic programming is [20] and Muggleton’s seminal paper on progol is [17]. 2.1
Inductive Logic Programming (ILP)
Under ILP’s normal setting (also called the monotonic setting, or the learningfrom-entailment setting) one considers background knowledge B, positive and negative examples E + and E − , and a hypothesis H. B, E + , E − and H are all finite sets of clauses (theories). Further restrictions can be applied, for instance by requiring that B, H, E + , E − are logic programs, rather than general theories, or imposing that positive examples are ground unit definite clauses and negative examples are ground unit headless Horn clauses. The central goal is to induce (learn) H such that the following two conditions are satisfied: B ∧ H |= E + B ∧ H ∧ E− |= ✷
236
Flaviu Adrian M˘ arginean
This looks rather much like solving a system of inequations (logical entailment |= is a quasi-order, i.e. reflexive and transitive), except that it is a rather complicated one. However, if one only considers the positive examples, in the first instance, the following simpler system is obtained: B ∧ H |= E + B ∧H |= ✷ Progress has been made towards characterising the solutions to this system as follows: 2.2
Inverse Entailment
Definition 1 (Subsumption). Let C and D be clauses. Then C θ-subsumes D, denoted by C D, if there exists a substitution θ such that Cθ ⊆ D (i.e. every literal in Cθ is also a literal in D). Definition 2 (Inverse Entailment). Inverse Entailment is a generic name for any computational procedure that, given B, E + as input, will return a bottom clausal theory ⊥(E + , B) as output, such that the following condition is satisfied: B ∧ H |= E + + , ∀H H |= ⊥(E , B) ⇐⇒ B∧H |= ✷ Inoue (2001) has provided the only known example of Inverse Entailment in the general case (under the name of Consequence-Finding). It was hoped that entailment on the left-hand side might be replaced with subsumption or another decidable quasi-order, as entailment is only semidecidable in the general case. However, this hope was largely unfulfilled. In more restricted settings, the following results have been obtained: For H, E restricted to be clauses rather than theories, Yamamoto (1997) gives a computational procedure that computes a bottom clause ⊥(E + , B) such that the following condition is satisfied: H B E + , ∀H H ⊥(E + , B) ⇐⇒ H B ✷ Note that entailment has been replaced with the more restricted subsumption on the left-hand side and Plotkin’s relative subsumption on the right-hand side. For H, E restricted to be function-free clauses and B a function-free Horn theory, Muggleton (1998) gives a computational procedure that computes a bottom clause ⊥(E + , B) such that the following condition is satisfied: B ∧ H |= E + =⇒ + , ∀H H ⊥(E , B) ⇐= B∧H |= ✷ Note that entailment has been replaced with the more restricted subsumption on the left-hand side but the soundness of Inverse Entailment (=⇒) has been lost.
Which First-Order Logic Clauses Can Be Learned
237
For H, E restricted to be function-free Horn clauses and B a function-free Horn theory, Muggleton (1995) gives a computational procedure that computes a bottom Horn clause ⊥(E + , B) such that the following condition is satisfied: ⇐= B ∧ H |= E + + , ∀H H ⊥(E , B) =⇒ B∧H |= ✷ Note that entailment has been replaced with the more restricted subsumption on the left-hand side but the completeness of Inverse Entailment (⇐=) has this time been lost. Completeness can be restored to this version if either entailment is restored on the left-hand side or the unique bottom clause is replaced with a family {⊥i (E + , B)}i of bottom clauses (subsaturants of the unique bottom clause computed by Muggleton’s procedure): B ∧ H |= E + + , ∀H H ⊥i (E , B) ⇐⇒ B ∧H |= ✷ i
Subsumption is, in general, preferred to entailment on the left-hand side since it is decidable. However, as apparent from before, it only guarantees completeness and soundness of Inverse Entailment when the general ILP setting is restricted and multiple bottom clauses are generated.
3
Errors at the Representation Level of the TNM’s GAA
This section was initially written with knowledge of the TNM’s GAA as exposed in [27,28]. Since then, their paper [26] has become available, which contains some differences with respect to the treatment of the representation. Since those differences are minor and do not alter in any way the analysis or results presented in this section, nor their significance, we shall discuss those differences separately in §6. We first give a description of the TNM’s GAA binary representation. In the context of the Inverse Entailment procedure discussed in the preceding section, Tamaddoni-Nezhad and Muggleton consider the case where one has a functionfree bottom Horn clause ⊥(E + , B) and claim that the space of solutions {H | H ⊥(E + , B), H is a function−free Horn clause} can be described as a boolean lattice obtained from the variable sharing in the bottom clause according to a simple procedure (Fig. 1). In [27,28] the following definition is given: Definition 3 (Binding Matrix). Suppose B and C are both clauses and there exists a variable substitution θ such that Cθ = B. Let C have n variable occurrences representing variables v1 , v2 , . . . , vn . The binding matrix of C is an n×n matrix M in which mij is 1 if there exist variables vi , vj and u such that vi /u and vj /u are in θ and mij is 0 otherwise. We write M (vi , vj ) = 1 if mij = 1 and M (vi , vj ) = 0 if mij = 0. This definition is unsound because, assuming that the authors meant B to be a fixed bottom clause, there can be many substitutions θ such that Cθ = B.
238
Flaviu Adrian M˘ arginean p(U, V ) ←− q(W, X), r(Y, Z) (Binary Encoding: 000)
{W/U }
{Z/V }
{Y /X}
p(U, V ) ←− q(U, X), r(Y, Z) p(U, V ) ←− q(W, X), r(Y, V ) p(U, V ) ←− q(W, X), r(X, Z) (Binary Encoding: 100)
{Z/V }
{W/U }
(Binary Encoding: 010)
{Y /X} {W/U }
(Binary Encoding: 001)
{Y /X}
{Z/V }
p(U, V ) ←− q(U, X), r(Y, V ) p(U, V ) ←− q(U, X), r(X, Z) p(U, V ) ←− q(W, X), r(X, V ) (Binary Encoding: 110)
(Binary Encoding: 101)
{Y /X}
{Z/V }
(Binary Encoding: 011)
{W/U }
p(U, V ) ←− q(U, X), r(X, V ) (Binary Encoding: 111)
Fig. 1. Tamaddoni-Nezhad and Muggleton’s “subsumption lattice” bounded below by the bottom clause p(U, V ) ←− q(U, X), r(X, V ) It is not clear, then, with respect to which of these substitutions the binding matrix of C is defined. This notwithstanding, let us consider the bottom clause p(U, V ) ←− q(U, X), r(X, V ) in Fig. 1. Using the equality predicate we can rewrite the clause as follows: p(X1 , X2 ) ←− q(X3 , X4 ), r(X5 , X6 ), X1 = X3 , X2 = X6 , X4 = X5 We note that the variable sharing in the bottom clause is now completely described by the three equalities. Any other clause in Fig. 1 can be re-written as a combination of the common factor p(X1 , X2 ) ←− q(X3 , X4 ), r(X5 , X6 ) and a subset of the three equalities {X1 = X3 , X2 = X6 , X4 = X5 } that describe the variable sharing in the bottom clause. For instance, clause p(U, V ) ←− q(U, X), r(Y, Z) will become: p(X1 , X2 ) ←− q(X3 , X4 ), r(X5 , X6 ), X1 = X3 It is now clear that we do not need the common factor, every clause in Fig. 1 being describable by simply noting which of the three equalities in the bottom clause it sets off. If we use the binary string (1, 1, 1) to indicate that the bottom clause satisfies all three equalities, then the second clause above may be encoded as (1, 0, 0). The binary representation approach is therefore an instantiation of a technique more commonly known as propositionalisation. To our knowledge, this approach was first investigated rigorously in [1,2]. It was showed clearly that the approach could not yield completeness for subsumption, not even in the simple
Which First-Order Logic Clauses Can Be Learned
239
case of function-free (Datalog) languages [1]. We now show that this is indeed the case for subsumption lattices bounded below by bottom clauses. Binary Representation Space Is Incomplete. The following clauses are missing from Tamaddoni-Nezhad and Muggleton’s subsumption lattice, as pictured in Fig. 1: ←−, the empty clause, subset of the bottom clause p(U, V ) ←−, subset of the bottom clause q(U, X) ←−, subset of the bottom clause r(X, V ) ←−, subset of the bottom clause p(U, V ) ←− q(W, X), maps into the bottom clause by substitution θ = {W/U } p(U, V ) ←− r(X, Z), maps into the bottom clause by substitution θ = {Z/V } ←− q(U, X), r(Y, V ), maps into the bottom clause by substitution θ = {Y /X} These clauses, together with the ones in Fig. 1, are the ones that weakly subsume the bottom clause1 , i.e. those that map literals injectively to the bottom clause. In addition, an infinity of other clauses that subsume p(U, V ) ←− q(U, X), r(X, V ) are also missing, for instance: p(U, V ) ←− {q(U, Xi ), r(Xj , V ) | i = j, 1 ≤ i, j ≤ n} for n ≥ 2, which maps onto the bottom clause by substitution {Xi /X}1≤i≤n . We may wonder whether completeness has instead been achieved under subsumption equivalence, i.e. one clause from each equivalence class under subsumption is present in the search space. However, this is not the case: neither of the exemplified missing clauses are subsume-equivalent with any of the clauses already in the search space, nor are they subsume-equivalent between themselves. The quasi-order used in Fig. 1 is therefore not subsumption but the much weaker atomic subsumption a as defined in [20, p. 244]. If, as before, we denote entailment by |=, subsumption by , weak subsumption by w and the atomic subsumption by a , we have the following relationship between the four orders: ⇐=
⇐=
⇐=
a =⇒w =⇒=⇒|= Consequently, we have: H a H w H H|=
where :
def
H i = {H | H i ⊥(E + , B), H is a function−free Horn clause} for i ∈ {a , w , , |=}. progol’s existing refinement encodes H w , which is incomplete with respect to subsumption: H w H . However, the binary representation approach only captures H a , which is even more incomplete: H a H w . 1
See [4] for a complete and sound encoding of the weak subsumption lattice with a refinement operator.
240
Flaviu Adrian M˘ arginean
Binary Representation Space Is Not Compact. We now show that the binary representation in TNM’s GAA is asymptotically noncompact. Let us consider a bottom clause that has n predicates, all sharing the first variable. For simplicity we assume that all the other variables are distinct: p0 (X, . . . ) ←− p1 (X, . . . ), p2 (X, . . . ), . . . , pn−1 (X, . . . ) Using equality to re-write the clause we arrive at: p0 (X0 , . . . ) ←− p1 (X1 , . . . ), . . . , pn−1 (Xn−1 , . . . ), X0 = X1 · · · = Xn−1 Note that we have slightly abused the notation in order to write the clause more compactly. The number of binary equalities X0 = X1 , X0 = X2 , . . . etc. is now n n(n−1) 2 binary strings upon encoding. On the other hand, 2 , which will yield 2 the number of clauses that can be obtained by anti-unification of variables (as we have seen, Tamaddoni-Nezhad and Muggleton do not consider adding/dropping literals, thereby generating incompleteness) is given by the n-th Bell number, i.e. the number of all partitions of the set of variables {X0 , . . . , Xn−1 }. The space of (n2 ) encodings will therefore be 2Bn times bigger than the number of valid clauses that it encodes. To get a feeling for this difference, for n = 12 the space of encodings will contain 266 = 73786976294838206464 binary strings, while the space of clauses will contain B12 = 4213597 clauses, i.e. about 1 clause for every 17.5 trillion redundant encodings. In ILP it is not uncommon for bottom clauses to contain tens or hundreds of predicates with complex variable sharing. It can be shown that, as n grows, this gets worse and (n2 ) n→∞ worse, i.e. the asymptotic behaviour confirms this tendency: 2Bn −→ ∞. Potential Effects of Noncompactness. The redundancy introduced by the noncompact binary encoding may have statistical and computational consequences. Clauses in the atomic subsumption lattice lower bounded by the bottom clause will not all map to the same number of encodings. Some may happen to have just one binary encoding, e.g. the top clause. And some might have exponentially many, depending on the complexity of the variable multiplicities in the bottom clause. From a statistical point of view, this seems to introduce a distributional prior over the search space, favouring those clauses that are closer to the bottom clause. The implications of this bias towards specificity will need to be analysed. In order to weed out redundancy in the string population, normalisation closures may be employed. This seems to be problematic in two ways. Normalisation closures entail computational difficulties: in order to be able to compute normal-
Which First-Order Logic Clauses Can Be Learned
241
isation closures, one obvious solution is to revert to the binding matrices2 , which is an indirect way of reverting to variables; another solution is to keep a separate list of normalisation rules, valid for the bottom clause at hand. For instance, to infer X2 = X3 from X1 = X2 and X1 = X3 one has to encode the variables. Otherwise, if X1 = X2 , X1 = X3 , X2 = X3 are encoded as 3-bit binary strings (Y1 , Y2 , Y3 ), then one needs to know that Y1 = 1 ∧ Y2 = 1 =⇒ Y3 = 1 etc. It is easy to show that the volume of normalisation operations is exponential in n in the worst-case, which combined with the need to repeat such operations frequently because of the noncompact representation, puts under serious question the claimed efficiency of the approach. The second problem with normalisation operations is that it may interfere with the search by increasing the propensity towards specificity. Search operators applied to a population of strings before normalisation would produce a different population to the one obtained after the normalisation has taken place. Normalised strings stand below the strings they normalise in the boolean order over the binary representation space. Then there is the question of the computational overhead needed to convert strings to clauses for coverage testing. It seems like much of the computational saliency of the approach is lost if we had to do this conversion often. We discuss this issue in the next section.
4
Errors at the Evaluation Level of the TNM’s GAA
The importance of the evaluation step for efficiency considerations in the TNM’s GAA has been described by its authors in their latest paper [26] as follows: “This step is known to be a complex and time-consuming task in firstorder concept learning. In the case of genetic-based systems this situation is even worse, because we need to evaluate a population of hypotheses in each generation. This problem is another important difficulty when applying GA’s in first-order concept learning”. Therefore the authors propose to maintain “the coverage sets for a small number of clauses and by doing bitwise operations we can compute the coverage for other binary strings without mapping them into the corresponding clauses. This property is based on the implicit subsumption order which exists in the binary representation.” ([26, p. 290]. However, we have shown that the binary representation is not based on the subsumption order but on atomic subsumption. That entails that some mgi ’s and lgg’s under subsumption will not be present in the atomic subsumption lattice. From that finding it then follows that Example 2, Example 3 and Example 4 in [26], as well as Theorem 2, Theorem 3 and Theorem 5, are false. In addition, 2
This seems to be the approach taken in the TNM’s GAA, although very little is said about this necessary step in the interpretation of their strings. See footnote 4 of [28, p. 642] and paragraph following Example 2 in [27, p. 249].
242
Flaviu Adrian M˘ arginean
much of the associated claims and discussion is incorrect. Theorem 4 is correct but inapplicable. Let us consider Example 2 in [26]. It is claimed that the clause p(U, V ) ←− q(U, X), r(X, Z) is the mgi 3 of the clauses p(U, V ) ←− q(U, X), r(Y, Z) and p(U, V ) ←− q(W, X), r(X, Z) under subsumption . However, this is not true. The mgi of the two clauses under , according to [20, p. 251], is p(U, V ) ←− q(U, X), q(W, X ), r(Y, Z), r(X , Z ). We may wonder if this was not subsumeequivalent with p(U, V ) ←− q(U, X), r(X, Z). Suppose, by reductio ad absurdum, that the latter clause subsumes the former. Then there should be a substitution mapping it onto a subset of the former. Necessarily, {U/U, V /V } because of the common head. Then q(U, X) is mapped to some q(U, ) and the only such literal available is q(U, X), therefore {X/X}. Then r(X, Z) is mapped to some r(X, ) but there are no such literals in the former clause, contradiction. In fact the mgi given in [26] is computed under atomic subsumption a . Crucially, for mgi ’s under a Theorem 4 in [26] does not hold, i.e. one may not compute coverages of such mgi ’s by intersecting the coverages of the parent clauses as described in [26]. This is because coverages of clauses are computed under entailment |= (or, in some circumstances, subsumption ) but not under the much weaker atomic subsumption a . As a counter-example, consider the clause p(U, V ) ←− q(U, X), q(W, X ), r(Y, Z), r(X , Z ) referred to above. Under it is covered by both p(U, V ) ←− q(U, X), r(Y, Z) and p(U, V ) ←− q(W, X), r(X, Z), therefore it belongs to the intersection of their coverages. However, as shown, it is not covered by the clause p(U, V ) ←− q(U, X), r(X, Z), the alleged mgi . This renders the proposed fast evaluation mechanism unsound and therefore moot. In our view, there are two main misunderstandings that have generated the flaws in the evaluation mechanism: – As subsumption got confused with atomic subsumption, so the mgi under subsumption got confused with the mgi under atomic subsumption. – Even if this was fixed (not possible with a representation based on the atomic subsumption lattice), we still could not obtain a full evaluation mechanism using mgi alone because such a mechanism requires atomicity and the full subsumption lattice is not atomic. We could still be required to do coverage testing for a significant number of clauses during the search.
5
A First-Order Logic Approach: Binary Refinement
In this section we propose an alternative approach to the TNM’s GAA that is more considered theoretically and has the potential to lead, in time, to sound implementations of hybrid systems that combine evolutionary computation with first-order logic search. This approach replaces the flawed binary representations and genetic operators in the TNM’s GAA with a novel first-order logic binding matrix representation and provably sound and complete binary refinement. 3
Most general instantiation.
Which First-Order Logic Clauses Can Be Learned
243
Let us consider the search space def
H = {H | H ⊥(E + , B), H is a function−free Horn clause} This space is too difficult to represent and search directly, therefore we will instead consider the related space def
θ = {(H, φ) | H ∈ H , Hφ ⊆ ⊥(E + , B) is a substitution} H
This approach pairs one clause in H with one substitution to ⊥(E + , B), simθ is ilarly to the approach in [17] for H w . The difference between H and H that a clause can have a few different substitutions onto different subsets of the bottom clause, and we consider each such substitution to generate a distinct element in the search space. Although the number of search elements increases, there are advantages in considering this representation from the point of view θ from H is as follows: (H1 , φ1 ) of the search. The order inherited by H def
(H2 , φ2 ) ⇐⇒ H1 H2 and ∃ H1 φ ⊆ H2 a substitution such that φ1 = φ2 φ. θ , is an infinite search space and can only be represented implicitly H by a refinement operator. This situation is similar to a grammar generating a language in the theory of formal languages. However, refinement operators also follow a certain direction with respect to the ordering of hypotheses in the search domain (for instance, ). Definition 4 (Unary Refinement). Let G, be a quasi-ordered set. A downward refinement operator for G, is a function ρ1 : G −→ P(G), such that ρ1 (C) ⊆ {D|C D} for every C ∈ G. Similarly, one can define an upward refinement operator. It has traditionally been assumed in ILP that refinement operators are unary, i.e. they apply to single clauses. However, in order to accommodate genetic searches we would need refinement operators that apply to pairs of clauses, i.e. binary refinement operators. Since genetic searches are multi-point, proceeding from an initial random seeding rather than top-down or bottom-up, we would like our operator to be both upward and downward. We would also like it to include recombination and mutation sub-operators specific to genetic searches. We prove the following theorem. Theorem 1. A sound and complete binary refinement operator ρ2 exists such that: θ θ θ × H −→ P(H ) ρ 2 : H d ρ1 [mgi (C, D), φmgi(C,D) ] ρ2 [(C, φC ), (D, φD )] = ρu1 [lgg(C, D), φlgg(C,D) ] Recomb ⊥ [(C, φC ), (D, φD )] θ θ where ρd,u : H −→ P(H ) are two unary refinement operators, downward 1 θ θ θ and upward respectively, and Recomb ⊥ : H × H −→ P(H ) is a sound and minimal recombination operator.
244
Flaviu Adrian M˘ arginean
Before going into the proof we note that ρd,u will play the role of muta1 tion, ρd1 [mgi , φmgi ] will be a downward binary refinement operator, ρu1 [lgg, φlgg ] will be an upward binary refinement operator, and the last two together will maintain the refinement space of the search, i.e. the space of clauses which are below or above the seed elements, while Recomb ⊥ will extend the search into the space of all possible recombinations of the clauses in the refinement space. Since the genetic search is stochastic and subject to repeated pruning of the search space, proving the soundness and completeness of ρ2 does not entail the completeness of the search. It only guarantees that the search is sound and that it is not incomplete a priori, i.e. incomplete because of an incomplete search space. θ . EvProof. For simplicity, we assume general clauses in the definition of H erything can easily be translated to Horn clauses. Let us consider an element θ , the bottom clause ⊥ = ⊥(L1 , . . . , Lm ; X1 , . . . , Xn ) where Li , Xj (H, φ) ∈ H are the literals and the variables in ⊥, respectively. Then φ will map every literal in H onto a literal in ⊥ and every variable in H onto a variable in ⊥. We define the building blocks of (H, φ) as follows:
BB(H, φ) = {φ−1 (Li ) | 1 ≤ i ≤ m} ∪ {φ−1 (Xj ) | 1 ≤ j ≤ n} def
mgi and lgg will be the usual operators under subsumption [20, pp. 251– 252], except that for any H we take mgi (H, H) = H and lgg(H, H) = H rather than the clauses subsume-equivalent to H assigned by the general mgi and lgg θ , ρd1 [H, φ] will be computed as follows: procedures. For any (H, φ) ∈ H Unification. For every two variables U, V in H for which ∃Xj in ⊥ such that U, V ∈ φ−1 (Xj ), we take (HU/V , φU/V ) ∈ ρd1 [H, φ], i.e. the clause and substitution obtained from (H, φ) by unification of the variables U and V . Literal Addition. For every literal Li in ⊥, we take (H ∪ Li , φ ∪ IdLi ) ∈ ρd1 [H, φ], i.e. the clause obtained by the addition of literal Li to the clause H (standardised apart) and the substitution obtained by taking φ(Z) = Z for every variable Z in Li . The definition of ρu1 [H, φ] is dual. The basic idea behind defining the operator Recomb ⊥ is that the participating (C, φC ), (D, φD ) will exchange building blocks, whereby swapping either literals or variable sharing. Let us assume the first-order logic binding matrix notation θ exemplified in Figure 2. for elements in H The rows represent distinct variables in the clause, while the labels of the rows determine the building blocks to which the variables belong. The columns represent distinct literals in the clause, while the labels of the columns determine the building blocks to which the literals belong. At the intersection of every line and column we write the positions occupied by the respective variable in the literal, in the obvious way. It is clear how much more expressive this description is compared to the very simple binary binding matrices described in [26,27,28]. It is only with representations of this power that we can capture the space
Which First-Order Logic Clauses Can Be Learned φ (X1 ) −1 φ (X1 ) −1 φ (X2 ) −1 φ (X2 ) φ−1 (X3 ) −1
245
φ−1 (L1 ) φ−1 (L2 ) φ−1 (L3 ) φ−1 (L3 ) φ−1 (L6 ) ∅ (1) ∅ ∅ (1) (1) (3) ∅ ∅ ∅ (2) ∅ (1) ∅ ∅ (3) ∅ ∅ (1) (2) ∅ (2) (2, 3) (2, 3) ∅
Fig. 2. First-order logic binding matrix corresponding to (H, φ), where H = {L1 (V, W, A), L2 (U, B, V ), L3 (W, B, B), L3 (A, B, B), L6 (U, A)} and φ = {U/X1 , V /X1 , W/X2 , A/X2 , B/X3 } θ θ H soundly and completely. The element in H reconstituted from the matrix notation will be (H, φ) where
H = {L1 (V, W, A), L2 (U, B, V ), L3 (W, B, B), L3 (A, B, B), L6 (U, A)} and φ = {U/X1 , V /X1 , W/X2 , A/X2 , B/X3 } The first two rows then define a building block, just as do rows 3 and 4. Row 5 will be a building block on its own. Columns will also define individual building blocks, with the exception of columns 3 and 4 that together define a building block. The recombination operator Recomb ⊥ [(C, φC ), (D, φD )] will always exchange corresponding building blocks, for instance φ−1 (Li ) in (C, φC ) with φ−1 (Li ) in (D, φD ), or φ−1 (Xj ) in (C, φC ) with φ−1 (Xj ) in (D, φD ). φ−1 (Li ) exchanges may always happen, because we can always treat the binding matrices as having the same number of rows in each φ−1 (Xj ) building block, simply by inserting dummy rows consisting of empty sets. φ−1 (Xj ) exchanges may only happen between binding matrices that have the same φ−1 (Li ) building blocks, and only if those blocks are of the same size. For all other [(C, φC ), (D, φD )] the recombination operator yields the empty set. Therefore the recombination operator may be thought to act effectively on that subset of the binding matrices that have the same literal distribution. This natural restriction ensures that the recombination operator is sound. The recombination operator Recomb ⊥ [(C, φC ), (D, φD )] will be computed as follows: Step 1 — Permutation. A permutation of rows and columns is performed in both (C, φC ) and (D, φD ). The permutation only acts within building blocks. It is assumed that building blocks are always represented contiguously in the binding matrix, although their order does not matter. However, the order of the rows or columns inside the blocks does. Step 2 — Recombination. Following the permutation, the chosen building blocks are exchanged. The procedure is repeated for every possible permutation in Step 1. Soundness. For all three sub-operators defining ρ2 , the soundness follows θ rather than directly from construction. The purpose of working in the space H
246
Flaviu Adrian M˘ arginean
H was precisely the enablement of the ability to maintain the soundness of the search operators. Completeness. For reasons of space we only indicate the main steps. The d/a θ is derived by a variant of an argument completeness of operators ρ1 for H presented in [20, pp. 305–310] for any clausal language under subsumption. One θ . The then shows that [mgi (C, D), φmgi (C, D)] = mgi [(C, φC ), (D, φD )] in H d u completeness of operators ρ1 [mgi , φmgi ] and ρ1 [lgg, φlgg ] then follows from the following theorem proved in full generality in [16,19]: Theorem 2 (p1 , p2 Extension). For any complete unary refinement operator ρ1 in a lattice G, inf, sup, where inf(C, C) = sup(C, C) = C, there is a complete downward (upward) binary refinement operator ρ2 that extends ρ1 , defined as def def follows: ρ2 (C, D) = ρ1 [inf(C, D)] (or ρ2 (C, D) = ρ1 [sup(C, D)]). Minimality of Recomb ⊥ . Only one block is exchanged at a time, therefore the operator is minimal. This feature is often desirable in refinement operators, so from this point of view the operator will blend well with the refinement operators in the definition of ρ2 . It may be said that, while the refinement operators implicitly maintain the refinement space below or above the seeds, the recombination operators implicitly maintain the space of recombinations defined by the seeds and their descendants.
6
Conclusions
General Remarks. An eminent statistician once said that those who forget statistics are condemned to reinvent it. Paraphrasing his tag line, we could say that those who forget first-order refinement are doomed to rediscover old pitfalls. Although combining genetic algorithms with inductive logic programming is potentially a valuable approach, it is not straightforward. Inductive logic programming owes its existence both to the first-order representations that it uses, but also to the recognition of the fact that it employs search mechanisms that are not easily translatable in the propositional domain. Even when such transformations can be made, it is usually under heavy restrictions or at the expense of exponential blow-ups in complexity [7]. It is why the inductive logic programming community has painstakingly developed inductive mechanisms that work with first-order representations directly. Particularly, refinement operators defined on first-order clauses lie at the core of many inductive logic programming systems. There is no attempt to deny the general usefulness of propositionalisation in certain targeted contexts. However, the strength of the claims should be correlated with the merits of such transformations on a case-by-case basis. Tamaddoni-Nezhad and Muggleton have proposed [26,27,28] a genetic approach to ILP that rediscovers some previously investigated propositionalisation techniques under the name of binary representations. Although it was claimed that this approach achieved completeness in respect of progol’s search, it is in fact even more incomplete than progol’s existing search. Although the phrases
Which First-Order Logic Clauses Can Be Learned
247
“genetic refinement” in [27] and “stochastic refinement” in [26,28] were used to name this type of search, no definition of these terms was given and no explanation on how they relate to the very consistent body of research on refinement [15,16,20,29,30]. Negative Results and their Significance. Essentially we have shown that not a single theoretical claim of substance in [26,27,28] stands: the allegedly complete and compact binary representation has neither of these properties, whilst the allegedly fast evaluation mechanism is unsound because of the mismatch between the subsumption-based or implication-based coverage testing and the atomic subsumption-based representation. There are some differences in the treatment of TNM’s GAA representation between the initial papers [27,28] and the latest paper [26]. The paper [26] appears to quietly have dropped the previously strong claims to completeness: this is no longer mentioned, and the phrase “subsumption lattice” has been replaced with the more obscure “substitution lattice”. No attempt appears to have been made to correct the previous erroneous claims or to explain their consequences for the proposed approach. In contrast, the claim to compactness of the representation appears stronger. In this paper we establish the representational power of the TNM’s GAA to be that of the atomic subsumption lattice, which is a sub-lattice of the weak subsumption lattice on which the existing progol is based. The claims regarding the fast evaluation mechanism have also been strengthened in [26]. However, the confusion arising at the representation level has in fact also compromised the evaluation level. What is the impact of all these mistakes on the TNM’s GAA? From a theoretical standpoint alone, it appears to us that any method that produces a system that is weaker than an existing one is a step backward and not a step forward. Such is the case with the TNM’s GAA, which we have shown to be even more incomplete than existing versions of progol. The next good question is, what is the impact of the theoretical flaws on the practical usefulness of the method? Based on the theoretical results in this paper, our conjecture is that positive experimental results may only occur when the target clauses are of the same length with their respective bottom clauses and rather specific, and the overall number of examples is not very high. It is possible that, in such cases, atomic subsumption will perform closely enough to subsumption, thereby yielding the appearance of an improvement because of the simplified computation. Indeed, the TNM’s GAA has been reported to achieve good results for complex target theories [26], although it is not clear that closeness to the bottom clauses was the definition of target complexity the authors intended. At present it is not possible to determine with more precision how the flaws in their theoretical framework affect their implementation since they do not give the exact algorithm on which the implementation is based, nor enough detail regarding their experimental settings. It is therefore not possible to either replicate or criticise their experimental evidence. However, in [27] the authors state:
248
Flaviu Adrian M˘ arginean
“In our first attempt, we employed the proposed representation to combine Inverse Entailment in c-progol4.4 with a genetic algorithm. In this implementation genetic search is used for searching the subsumption lattice bounded below by the bottom-clause (⊥). According to Theorem 3 the search space bounded by the bottom clause can be represented by S(⊥).” However, a careful consideration of Theorem 3 given in their paper will show that the theorem’s statement does not entail the identity between H and S(⊥). Furthermore, the argument can not be repaired: S(⊥) is the space of binary representations shown in this paper to be incomplete and noncompact with respect to H . Since, by the results of this paper, the statement quoted above does not hold, we believe that the burden of proof should now be on the TNM’s GAA authors to correlate their experimental claims with the true representational and computational power of their approach. Positive Results and their Significance. In this paper we have given a first-order logic binding matrix representation for clauses in the subsumption lattice bounded by a bottom clause. Unlike TNM’s GAA representation, ours is sound, compact and complete. We have also proven the existence of a binary refinement operator for this space, which, unlike the genetic operators in TNM’s GAA, satisfies the requirements of soundness and completeness required of a refinement operator. Although the theoretical apparatus for binary refinement already exists [16,19], the binary refinement operator given in §5 appears for the first time in print in this paper. Related and Future Work. TNM’s GAA is only the latest of a number of attempts at combining EC with ILP. It has been preceded by, e.g., GA-SMART [11], REGAL [10], DOGMA [12] and G-NET [3], all of which similarly employ representations based on fixed-length bit strings. From a theoretical point of view, the TNM’s GAA has been foregone by our binary refinement framework [16,19]. A comparison between the TNM’s GAA and some previous approaches is made in [26,27,28]. We generally concur with Tamaddoni-Nezhad and Muggleton that such systems are much less systematic in their capability of incorporating intensional background knowledge. To this we could add that the mentioned systems are also much more elaborate on the EC side, which may be a source of inspiration in the future. In our opinion, the way to go is binary refinement and not binary representations. Although we may never provably achieve completeness, owing to the stochasticity of the genetic search, we should at least be careful not to exclude potential valuable clauses from the start. Deriving candidate clauses with complete and sound binary refinement operators, within the framework of first-order logic representations (or encodings of equivalent representational power), ensures that this is indeed the case, whereby essentially giving a theoretically principled way of combining EC with ILP. However we are not quite there yet. The most important outstanding theoretical question is whether a fast evaluation mechanism for clauses is possible
Which First-Order Logic Clauses Can Be Learned
249
within the binary refinement framework. Although we have a positive answer to this question [16], it requires much more background than is possible to give in this paper. Clarifying this outstanding issue remains a topic for a future paper. Acknowledgments Thanks to James Cussens for giving useful advice on a preliminary version of this paper. The author wishes to thank the three anonymous reviewers for their helpful comments. This paper has been completed under the most difficult of circumstances. It is therefore dedicated to myself and to those who approach the world the way I do: with an open mind and a kind heart.
References ´ Alphonse and C. Rouveirol. Object Identity for Relational Learning. Technical 1. E. report, LRI, Universit´e Paris-Sud, 1999. Supported by ESPRIT Framework IV through LTR ILP2. ´ Alphonse and C. Rouveirol. Test Incorporation for Propositionalization Methods 2. E. in ILP. Technical report, LRI, Universit´e Paris-Sud, 1999. Supported by ESPRIT Framework IV through LTR ILP2. 3. C. Anglano, A. Giordana, G.L. Bello, and L. Saitta. An Experimental Evaluation of Coevolutive Concept Learning. In J. Shavlik, editor, Proceedings of the Fifteenth International Conference on Machine Learning, pages 19–27. Morgan Kaufmann, 1998. 4. L. Badea and M. Stanciu. Refinement Operators Can Be (Weakly) Perfect. In S. Dˇzeroski and P. Flach, editors, Inductive Logic Programming, 9th International Workshop, ILP-99, volume 1634 of Lecture Notes in Artificial Intelligence, pages 21–32. Springer, 1999. 5. J. Cussens and A. Frisch, editors. Inductive Logic Programming—ILP 2000, Proceedings of the 10th International Conference on Inductive Logic Programming, Work-in-Progress Reports, Imperial College, UK, July 2000. Work-in-Progress Reports. 6. J. Cussens and A. Frisch, editors. Inductive Logic Programming—ILP 2000, Proceedings of the 10th International Conference on Inductive Logic Programming, volume 1866 of Lecture Notes in Artificial Intelligence, Imperial College, UK, July 2000. Springer. 7. L. De Raedt. Attribute-Value Learning versus Inductive Logic Programming: The Missing Links. In Page [23], pages 1–8. 8. F. Divina and E. Marchiori. Knowledge Based Evolutionary Programming for Inductive Learning in First-Order Logic. In Spector et al. [25], pages 173–181. 9. K. Furukawa and T. Ozaki. On the Completion of Inverse Entailment for Mutual Recursion and its Application to Self Recursion. In Cussens and Frisch [5], pages 107–119. 10. A. Giordana and F. Neri. Search-Intensive Concept Induction. Evolutionary Computation Journal, 3(4):375–416, 1996. 11. A. Giordana and C. Sale. Learning Structured Concepts Using Genetic Algorithms. In D. Sleeman and P. Edwards, editors, Proceedings of the Ninth International Workshop on Machine Learning, pages 169–178. Morgan Kaufmann, 1992.
250
Flaviu Adrian M˘ arginean
12. J. Hekanaho. DOGMA: A GA-based Relational Learner. In Page [23], pages 205– 214. 13. K. Inoue. Induction, Abduction and Consequence-Finding. In C. Rouveirol and M. Sebag, editors, Inductive Logic Programming—ILP 2001, volume 2157 of Lecture Notes in Artificial Intelligence, pages 65–79. Springer, 2001. 14. K. Ito and A. Yamamoto. Finding Hypotheses from Examples by Computing the Least Generalization of Bottom Clauses. In S. Arikawa, editor, Proceedings of Discovery Science, volume 1532 of Lecture Notes in Artificial Intelligence, pages 303–314. Springer-Verlag, 1998. 15. P.D. Laird. Learning from Good Data and Bad. PhD thesis, Yale University, 1987. 16. F.A. M˘ arginean. Combinatorics of Refinement. PhD thesis, Department of Computer Science, The University of York, September 2001. 17. S. Muggleton. Inverse entailment and Progol. New Generation Computing, 13:245– 286, 1995. 18. S. Muggleton. Completing Inverse Entailment. In Page [23], pages 245–249. 19. S.H. Muggleton and F.A. M˘ arginean. Binary Refinement. In J. McCarthy and J. Minker, editors, Workshop on Logic-Based Artificial Intelligence, College Park, Washington DC, Maryland, June 14–16, 1999. Computer Science Department, University of Maryland. 20. S-H. Nienhuys-Cheng and R. de Wolf. Foundations of Inductive Logic Programming. Springer-Verlag, Berlin, 1997. LNAI 1228. 21. K. Ohara, N. Babaguchi, and T. Kitahashi. An Efficient Hypothesis Search Algorithm Based on Best-Bound Strategy. In Cussens and Frisch [5], pages 212–225. 22. H. Ohwada, H. Nishiyama, and F. Mizoguchi. Concurrent Execution of Optimal Hypothesis Search for Inverse Entailment. In Cussens and Frisch [6], pages 165– 173. 23. D. Page, editor. Inductive Logic Programming, Proceedings of the 8th International Conference, ILP-98, volume 1446 of Lecture Notes in Artificial Intelligence. Springer, July 1998. 24. D. Page. ILP: Just Do It. In Cussens and Frisch [6], pages 25–40. 25. L. Spector, E.D. Goodman, A. Wu, W.B. Langdon, H-M. Voigt, M. Gen, S. Sen, M. Dorigo, S. Pezeshk, M.H. Garzon, and E. Burke, editors. Proceedings of the Genetic and Evolutionary Computation Conference, GECCO-2001, San Francisco, CA, July 7–11 2001. AAAI, Morgan Kaufmann. 26. A. Tamaddoni-Nezhad and S. Muggleton. A Genetic Algorithms Approach to ILP. In S. Matwin and C. Sammut, editors, Proceedings of the Twelfth International Conference on Inductive Logic Programming (ILP 2002), LNAI 2583, pages 285– 300. Springer, 2002. 27. A. Tamaddoni-Nezhad and S.H. Muggleton. Searching the Subsumption Lattice by a Genetic Algorithm. In Cussens and Frisch [6], pages 243–252. 28. A. Tamaddoni-Nezhad and S.H. Muggleton. Using Genetic Algorithms for Learning Clauses in First-Order Logic. In Spector et al. [25], pages 639–646. 29. F. Torre and C. Rouveirol. Private Properties and Natural Relations in Inductive Logic Programming. Technical report, LRI, Universit´e Paris-Sud, July 1997. 30. P.R. van der Laag. An Analysis of Refinement Operators in Inductive Logic Programming. Technical Report 102, Tinbergen Institute Research Series, 1995. 31. A. Yamamoto. Which Hypotheses Can Be Found with Inverse Entailment? In N. Lavraˇc and S. Dˇzeroski, editors, Inductive Logic Programming, 7th International Workshop, ILP-97, volume 1297 of Lecture Notes in Artificial Intelligence, pages 296–308. Springer, 1997.
Improved Distances for Structured Data Dimitrios Mavroeidis and Peter A. Flach Department of Computer Science, University of Bristol, United Kingdom [email protected], [email protected]
Abstract. The key ingredient for any distance-based method in machine learning is a proper distance between individuals of the domain. Distances for structured data have been investigated for some time, but no general agreement has been reached. In this paper we use first-order terms for knowledge representation, and the distances introduced are metrics that are defined on the lattice structure of first-order terms. Our metrics are shown to improve on previous proposals in situations where feature equality is important. Furthermore, for one of the distances we introduce, we show that its metric space is isometrically embedable in Euclidean space. This allows the definition of kernels directly from the distance, thus enabling support vector machines and other kernel methods to be applied to structured objects. An extension of the distances to handle sets and multi-sets, and some initial work for higher-order representations, is presented as well.
1 Introduction Distance-based methods (also called instance-based methods or lazy learning) [ 1] have always been popular in machine learning. Such methods utilize a distance between individuals in a domain for solving predictive and descriptive learning tasks. When individuals are described by numerical attributes, the natural choice for the distance is Euclidean distance (i.e., the norm of the vector connecting two data points in Euclidean vector space). This distance can be easily adapted to take nominal attributes into account. However, much real-world data is structured and can’t be naturally represented as attribute-value tuples. Defining distances between structured individuals means crossing the boundary between attribute-value and relational learning. This enables k-nearest neighbors, k-means clustering and other distance-based machine learning methods to be applied directly on structured data. There is a direct analogy between kernel methods and distance-based methods. Kernels embed objects in a Hilbert space and calculate the inner product in that space, which is a kind of unnormalized similarity measure. It may be fruitful to exploit this analogy to overcome some of the drawbacks of distance-based methods, such as problems dealing with noise and high dimensionality, and to define kernels directly from distances. While it is straightforward to define a distance from a kernel, it is not generally possible to define a kernel from a distance, the reason being that a metric space is a much weaker structure than a Hilbert space. In this paper we introduce a distance whose metric space is isometrically embedable in a Euclidean space, thus enabling us to define a kernel for structured data directly from the distance. Several distances (or similarity measures) for structured data have been proposed [4,9,15,18,10,8,11]. Some of these distances are pseudo-metrics, some of them don’t T. Horv´ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 251–268, 2003. c Springer-Verlag Berlin Heidelberg 2003
252
Dimitrios Mavroeidis and Peter A. Flach
satisfy the triangle inequality and some of them are metrics. Our approach follows the idea from [9,18] to define distances using the lattice structure of first-order terms. The most theoretically sound approach is Ramon and Bruynooghe’s approach [ 18]. Their distance, which we will call the RB distance, is a metric that is defined for both ground and non-ground terms and is strictly order-preserving in the lattice of terms (see Section 2 for details). Our approach shares these theoretical properties, and the main difference is in how the variables of a term are treated. We will show that our handling of the variables results in a distance measure that takes into account the equality between features when calculating the distance between two terms. It is clear that feature equality is important in many relational and ILP domains, such as KRK-illegal. Our main aim in this paper is to define general and flexible distances in a theoretically sound manner. Our main motivation comes from the fact that in propositional learning various metrics have been proposed for instance-based learning (Euclidean distance, Minkowski distance, etc.); similarly, we want a general theory for distances for structured data offering a variety of distances. The main contributions of this paper are the following. First, we present two ways for defining distances for structured data represented as first-order terms that are able to model situations where feature equality is important for the learning task. Second, we introduce a distance whose metric space is isometrically embedable in a vector space with any Minkowski metric (in case the Minkowski metric is the Euclidean metric then the vector space will be a Euclidean space). This enables us to define kernels for structured data directly from the distance. Furthermore, we extend the distance in order to handle sets and multi-sets. Finally some initial work is presented concerning the extension of the distance to handle higher-order terms. The outline of the paper is as follows. Section 2 gives an account of the preliminary notions needed for knowledge representation, distances and inductive logic programming, and recalls the RB distance. Section 3 introduces our distances for first-order terms. Section 4 introduces distances that are isometrically embedable in Euclidean space and shows how they can be used to define kernels for structured data. The approach is also extended to cover sets and multi-sets. Finally, Section 5 contains the concluding remarks.
2 Preliminaries 2.1 First- and Higher-Order Terms In an Inductive Logic Programming (ILP) perspective, structured objects are represented with the use of first-order logic. The distances presented in this paper use firstorder terms in order to represent individuals. We briefly recall the necessary terminology from first-order logic. The set of terms T is defined using the set of variables V and the set of functors F. A term can be any variable X or f (t 1 ,t2 , ...,tn ), where f /n is a functor with arity n and t i , i = 1, .., n are terms. In this paper we will use the Prolog notation, that is capital letters for variables and lower case letters for functors. To illustrate how the representation of individuals as terms works, consider the following individual, that describes culinary present sets from different vendors (example
Improved Distances for Structured Data
253
taken from [11]). In a ground Datalog representation (i.e., all functors have arity 0), individuals are represented by sets of ground facts such as {present(set1,125,personal), cheese(set1,camembert,150,france), wine(set1,mouton,1988,0.75),wine(set1,gallo,1995,0.5), vineyard(gallo,famous,large,usa), vineyard(mouton,famous,small,france)} We can see that as a culinary present set may contain various numbers of wines or cheese, there doesn’t exist a propositional representation that describes the data perfectly. Using first-order terms, this individual would be described as follows: present(125,personal,[cheese(camembert,150,france)], [wine(description(famous,small,france),1988,0.75), wine(description(famous,large,usa),1995,0.5)]) We can see that in an individual to term representation, all the facts describing the individual are included in a term. In the above term the cheese and the wines are represented using Prolog lists. It can be argued that this representation is not appropriate as in a list the elements are ordered. So, in order to represent the individual more accurately we should use higher-order terms such as sets and multi-sets. The higher-order term representing the individual would be the same as the first-order term with the lists replaced by sets. We briefly outline the ingredients of higher-order knowledge representation following [ 12,13]. The main concept used for knowledge representation in higher-order logic is the notion of a basic term. A formal definition of basic terms can be found in [ 12]. In general a basic term is defined inductively over the following three components. 1. Tuples 2. Data constructors (representing lists, trees, etc.) 3. Abstractions (representing sets, multi-sets, etc.) Data constructors are like Prolog functors; in Prolog functors are used to represent tuples as well. The abstractions make the basic terms a much more powerful representation than the first-order terms. Consider the set {1,2}. This set will be represented using basic terms as λx.if x = 1 then true else if x = 2 then true else false. Similarly, the multi-set that contains 32 occurrences of object A and 25 occurrences of object B is represented by the basic term λx.if x = A then 32 else if x = B then 25 else 0. It should be noted that, while the lattice structure of first-order terms is well-explored, this is not the case with higher-order terms. The inclusion of sets and multi-sets adds significant power to the knowledge representation. We now summarize the generalization structure of first-order terms, over which the distances are defined. The generalization structure gives rise to a partial order which is a complete lattice. This structure was first introduced by [17] and almost all ILP algorithms make use of it. (A partial order ≤ on a set T is a binary relation that is reflexive, antisymmetric, and transitive. A complete lattice T is a partial order, where every subset of T has a least upper bound and a greatest lower bound.)
254
Dimitrios Mavroeidis and Peter A. Flach
The complete lattice of first-order terms is defined as follows. Let t 1 and t2 be two first-order terms. We will say that t 1 is at least as general as t2 iff there exists a substitution θ from variables to terms such that t 1 θ = t2 ; if in addition t2 is not at least as general as t1 , then t1 is more general than t 2 . This notion of generality gives rise to a partial order, which is a complete lattice. We denote the top element of this lattice (a single variable) by . The distances introduced use the notion of least general generalization (lgg). The least general generalization of two terms t 1 and t2 is their least common upper bound in the generality lattice. lgg(t 1 ,t2 ) can be computed by an algorithm called anti-unification, which is the dual of unification. An important concept in this paper is the notion of a depth function. A depth function is defined formally over a lattice as follows. Definition 1 (Depth Function). We will say that a function depth : V → IR, where V is a lattice with partial order ≤ and top element , is a depth function iff it satisfies the following conditions: 1. depth(x) = 0 ⇔ x = 2. x < y ⇒ depth(x) > depth(y) The notion of a depth function is similar to the notion of length function in [ 2] and the notion of size in [18]. We can see from the definition that the depth function is 0 for the top of the lattice and as we go down the lattice the depth increases. In particular, the depth of a term increases when more variables get instantiated. Similarly, unifying variables increases the depth of a term. For example, in p(X, X) < p(X,Y ), the two occurrences of variable X means the first term has greater depth than the second. 2.2 Distances Now we will review material related to distances. A metric is the mathematical abstraction over the notion of distance. We will say that a distance is a metric iff it satisfies the following requirements. 1. d(x, y) = 0 ⇔ x = y 2. d(x, y) = d(y, x) 3. d(x, y) ≤ d(x, z) + d(z, y) The third property is known as the triangle inequality. As our distances are defined over a complete lattice, we want the distance to reflect this order. We will say that a distance d is order-preserving iff for all x, y, z such that x < y < z we have that d(x, y) ≤ d(x, z) and d(y, z) ≤ d(x, z). A distance d is strictly order-preserving iff it is order-preserving and for all x, y such that x < y we have that d(x, y) > 0. A widely used distance for propositional representations is the Minkowski distance. The Minkowski distance is defined for two vectors x = (x 1 , ..., xn ), y = (y1 , ..., yn ) in a 1
vector space as d p (x, y) = (∑ni=1 |xi − yi | p ) p , where p is a natural number. Euclidean distance is a special case with p = 2.
Improved Distances for Structured Data
255
One of the distances defined in this paper behaves similar to the symmetric difference distance. The symmetric difference distance is defined over sets A, B as d(A, B) = s((A ∪ B) − (A ∩ B)), where s is a size function. A function s is a size function if it outputs non-negative real numbers and if X ∩Y = 0/ implies s(X ∪Y ) = s(X) + s(Y ). Among the other distances that have been proposed for structured data represented with first-order terms, the one that stands out is the distance proposed by Ramon and Bruynooghe [18]. The reason is that it has the theoretical properties one would demand from a distance between terms. More precisely, it is a metric, defined for ground and non-ground terms, that is strictly order-preserving in the lattice of terms. Definition 2 (RB Distance [18]). Let A and B be two first-order terms. The RB distance is defined as dRB (A, B) = |size(A) − size(lgg(A, B))| + |size(B) − size(lgg(A, B))| with size(A) = (F(A),V (A)). F(A) accounts for the size due to functors and V (A) accounts for the size due to variables. |·| is defined as |size(A)− size(B)| = max(size(A)− size(B), size(B) − size(A)). For a term t the F-component is defined as: n
F(t) = ω p,0 + ∑ ω p,i F(ti ) if t = p(t1 , ...,tn ) i=1
F(t) = 0 if t is a variable Here, the ω p, j are weights assigned to the j th position of term with functor p. The V component is defined as V (A) = ∑X∈vars(A) f rq(A, X)u( f rq(A, X)), where vars(A) is the sets of variables in term A, f rq(A, X) is the number of occurrences of variable X in term A and u : IN → IR is a strictly monotonic and convex function. As size is a tuple, the output of the distance will be a tuple as well. The ordering that is used considers the distance given by the F-component to be always more important than the distance given by the V -component. This would mean that the distance between functors is always considered to be more important than the distance between variables. In order to illustrate how this behavior can be problematic, consider the following three terms: t1 = f (a, X, X, X, X), t2 = f (a, X,Y, Z, G) and t3 = f (b, X, X, X, X). t1 and t2 share a constant, but t 1 and t3 share a number of variables. Intuitively one would say that, if the number of shared variables increases, there would be a point where d(t 1 ,t2 ) > d(t1 ,t3 ). However, the RB distance will always give d RB (t1 ,t2 ) < dRB (t1 ,t3 ), because it gives absolute priority to the shared constant between t 1 and t2 . In general the RB distance will have problems modeling situations where equality between features is important. Example 1 (Robots). Consider a toy domain where the learning task is to classify robots as friendly or unfriendly. Data occurs as terms robot(shape head, shape body, color head, color body), where the shapes can take the values circle, square and rectangle, and the colors can take the values red, black and white. Suppose the concept is: Robot is unfriendly iff shape head = shape body and color head = color body. With this classification rule we will have that the robots t 1 = robot(square, square, black, black)
256
Dimitrios Mavroeidis and Peter A. Flach
and t2 = robot(circle, circle, white, white) are classified as unfriendly and the robot t3 = robot(square, rectangle, red, white) is classified as friendly. We would desire a distance to output d(t 1 ,t2 ) < d(t1 ,t3 ) and d(t1 ,t2 ) < d(t2 ,t3 ). The RB distance will output in general d(t 1 ,t2 ) > d(t1 ,t3 ) and d(t1 ,t2 ) > d(t2 ,t3 ) (this actually depends on the weight assigned to circle, square, red, ...; however, for every weight configuration many problematic t3 can be found). It will have this behavior as the distance between functors is taken to always be more important than the distance between variables. More precisely, we will have lgg(t1 ,t2 ) = robot(X, X,Y,Y ) and lgg(t 1 ,t3 ) = robot(square, X,Y, Z). We can easily see that the distance due to functors will be smaller for t 1 and t3 as they have an extra common functor (square). The fact that we have more common variables in the lgg(t1 ,t2 ) will be taken to be less important. Clearly, feature equality is a key aspect of many relational domains. In the next section we propose metrics to overcome the limitations of the RB distance.
3 Distances for First-Order Terms The definition of a depth function for first-order terms is fairly straightforward, apart from the case where we have multiple occurrences of a variable in a term. We address this problem as follows. Whenever a variable has a single occurrence then it doesn’t increase the depth of a term. When a variable has n > 1 occurrences, then the depth function behaves like there is only one occurrence of the variable and n − 1 ‘semiconstants’. That is, multiple occurrences of a variable will behave like constants with a user-defined weight ω var , with ωvar to be less than all the other user-defined weights. Definition 3 (Aggregation Depth). We define a function depth A : Terms → IR such that 1. If we have the term t = X then depth A (t) = 0 2. If we have the term t = p(t 1 , ...,tn ) then depth A (t) = ω p,0 + ∑m k=1 ω p,ik depthA (tik ) + ∑n−m λ=1
nvar(i ) −1 λ nvar(i ) ω p,iλ ωvar λ
where ik is a renaming of i, i = 1, ..., n such that t ik isn’t a variable, and i λ is a renaming of i, i = 1, ..., n such that t iλ is a variable that has n var(iλ ) occurrences in term t. The weights ω p, j > 0 correspond to the weights assigned at the j th position of term with functor p/n. Finally, we take all the weights corresponding to variables to be smaller than the weights corresponding to functors, that is ω var < ω p, j . The weights assigned to the functor positions in Definition 3 have the same role as in the RB distance. They allow for more flexibility when configuring the depth of a term. We will see later, when defining the distance, that the assignment of a larger weight value to a position of a term will increase its relative importance in the calculation of the distance. For example, if we wanted for a term p(a, b) the difference in the first argument to be twice as important as the difference in the second argument, we should configure ω p,1 to have the double value of ω p,2 . Theorem 1. The Aggregation Depth is a depth function.
Improved Distances for Structured Data
257
The proof of this theorem, as well as other proofs that are not included in the appendix, can be found in [14]. In the Aggregation Depth we can see that, if a variable has only one occurrence in −1 the term, then it has no contribution to its depth ( nvar nvar = 0). When we have multiple occurrences of a variable, then it is intuitive to consider the first occurrence as contributing 0 and all the other as ‘semi-constants’. The contribution of these variables as ‘semi-constants’ is aggregated and configured by ω var . Example 2. In order to illustrate the use of the depth function defined above, consider the term t = f (a, b, g(d, X), X,Y ), with all functor-related weights set to 1 and ω var set to 0.5. then we will have depth(t) = 1 + 1 · 1 + 1 · 1 + 1 · (1 + 1 · 1 + 12 · 1 · 0.5) + 12 · 1 · 0.5 + 1−1 1 · 1 · 0.5 = 5.5. Another way of dealing with the problem of multiple occurrences of variables is to consider the variable that would contribute the most to the depth function (this depends on the weight configuration) to contribute 0 and the other variables to contribute as ‘semi-constants’ with weight ω var . A depth function that follows this intuition is the following. Definition 4 (Minimal Depth). We define a function depth M : Terms → IR such that 1. If we have term t = X then depth M (t) = 0 2. If we have term t = p(t1 , ...,tn ) then depth M (t) = ω p,0 + ∑m k=1 ω p,ik depthM (tik ) + ω ω ∑n−m p,i var λ λ=1 where ik is a renaming of i, i = 1, ..., n such that t ik isn’t a variable, and i λ is a renaming of i, i = 1, ..., n such that t iλ is a variable. Now for every variable with n occurrences in term t, one (of these n) will contribute maximally to the depth function. We will consider ωvar for this variable to be 0. Again we stipulate that ω var < ω p, j . Theorem 2. The Minimal Depth is a depth function. Example 3. Now consider the term t = f (a, b, X, X) with all the functor-related weights set to 1 apart from ω f ,4 = 2 and the ωvar set to 0.5. Then we have for the depth function defined above depth(t) = 1 + 1 · 1 + 1 · 1 + 1 · 0.5 + 2 · 0 = 3.5. We can see that the variable at position f , 4 would contribute more than the variable in position f , 3. So, in the computation of the depth function the variable in position f , 4 is taken to contribute 0 and the variable in position f , 3 contributes ω var . If we define the distance between two terms using depth in the intuitive way, as the sum of the difference in depth between each term and their lgg, the result is a metric only if we use Minimal Depth (Theorem 3). However, using an alternative definition we can obtain a metric using Aggregation Depth as well (Theorem 4). Theorem 3. The function between first-order terms x, y defined as dl (x, y) = (depth(x) − depth(lgg(x, y))) + (depth(y) − depth(lgg(x, y))) where depth is the Minimal Depth depth M , is a metric which is strictly order-preserving in the lattice of terms.
258
Dimitrios Mavroeidis and Peter A. Flach
If we consider the Aggregation Depth, the above distance doesn’t satisfy the triangle inequality. Theorem 4. The function between first-order terms x, y defined as de (x, y) = 1 −
1 cdl (x,y)
with an appropriate choice for constant c, and with d l defined as in Theorem 3 using either Aggregation Depth or Minimal Depth, is a metric which is strictly order-preserving in the lattice of first-order terms. The constant c is configured such that c minx=y (dl (x,y)) > 2. This ensures that the triangle inequality holds. This requirement is not very restrictive and once we have found one constant c we can choose any c > c depending on the way we want our distance to behave. It can be observed by the definitions of the distances that, as we increase the weight values of a position of a term, its relative importance in the calculation of the distance will increase (as in [18]). To demonstrate this consider the following terms: t 1 = p(a, b) and t2 = p(c, d). Their distance will be d l (t1 ,t2 ) = ω p,1 ωa,0 + ω p,2 ωb,0 + ω p,1 ωc,0 + ω p,2 ωd,0 . Now if we have the terms t 1 = p(a, b) and t 2 = p(a, d), their distance will be dl (t1 ,t2 ) = ω p,2 ωb,0 + ω p,2 ωd,0 . We see that by increasing ω p,1 we will increase the relative importance of the first position of term t 1 in the calculation of the distance. Now we will demonstrate how our handling of the multiple occurrence of variables, enables the distance to handle situations where we want the equality between features to be taken into account. Consider two terms t 1 = p(a, a, a, c) and t 2 = p(b, b, b, d), with lgg(t1 ,t2 ) = p(X, X, X,Y ). The distance of the two terms will be d l (t1 ,t2 ) = depth(p(a, a, a, c)) + depth(p(b, b, b, d)) − 2depth(p(X, X, X,Y)). The behavior of the distance regarding the configuration of ω var will be the following: As we increase the weight ω var the depth(p(X, X, X,Y )) will be increased and thus the d l (t1 ,t2 ) will decrease. Thus we can see that by increasing the value of the weight ω var we will increase the relative importance of feature equality in the calculation of the distance. The same configuration rules will hold for d e as well. Example 4 (Robots Again). Recall that robots t 1 = robot(square, square, black, black) and t2 = robot(circle, circle, white, white) are unfriendly and robot t 3 = robot(square, rectangle, red, white) is friendly. We desire a distance to output d(t 1 ,t2 ) < d(t1 ,t3 ) and d(t1 ,t2 ) < d(t2 ,t3 ). The distance introduced above can be configured to output the desired distances. The configuration can be achieved by setting a high value for ω var . In general with ωvar we can configure the relative importance of feature equality in the calculation of the distance. Although our distance improves modeling of situations where feature equality is important for the learning task, it can’t model them perfectly. This comes from the fact that we have the limitation for ω var to be smaller than the other weights. This limitation would result (regarding the above example) in a robot t 4 = robot(square, rectangle, black, white) to output d(t 1 ,t4 ) < d(t1 ,t2 ), although the robot t 4 is classified as friendly.
Improved Distances for Structured Data
259
If we assign ωvar to larger values than other weights we can have d(t 1 ,t4 ) > d(t1 ,t2 ); however, the distance would not be guaranteed to be a metric. We have seen that our distance has the same theoretical properties as the distance in [18], as it is a metric defined for ground and non-ground terms that is strictly orderpreserving in the lattice of terms. The difference is in how we treat the multiple occurrences of variables. Moreover, our distance is flexible as we have weights that configure the relative importance of this position of the term in the calculation of the distance. We also have ωvar that configures the relative importance of feature equality in the calculation of the distance. The importance of being able to model situations where feature equality is important for the learning task is illustrated by the fact that feature equality is important in many relational domains. Thus, we would expect a distance that is able to model such situations, to improve the accuracy of distance-based methods.
4 Distances that Are Isometrically Embedable in Euclidean Spaces A metric space can be embedded isometrically either in a Euclidean space or a pseudoEuclidean (Minkowski) space [7]. In this section we will define metrics that can always be embedded isometrically in a Euclidean space. Such an embedding enables us to define kernels for structured data, which allow for support vector machines and other kernel methods to be applied to structured data. In a Euclidean space the connection between inner products and distances is wellestablished. If we consider vectors x i in a Euclidean space, with d the Euclidean distance, then we can write d 2 (xi , x j ) =< xi − x j , xi − x j >= d 2 (xi , 0) + d 2(x j , 0) − 2 < xi , x j > where 0 is the origin of our Euclidean space, and < x, y > denotes the inner product of vectors x and y. From this equation we can derive: 1 < xi , x j >= − [d 2 (xi , x j ) − d 2 (xi , 0) − d 2(x j , 0)] 2
(1)
So, if we achieve the isometrical embedding of our metric space in a Euclidean space, we can use the inner product in (1) as the kernel. 4.1 Distances on First-Order Terms Using Elementary Decompositions In order to perform the embedding procedure, we will use slightly different depth functions than in Section 3. The difference is in the way multiple occurrences of variables are treated. Before we define the new depth functions we will first define the notion of elementary decomposition. The elementary decomposition is a set that contains all the necessary information about a term. Definition 5 (Elementary Decomposition). Consider a first-order term t. Then the elementary decomposition D E (t) of term t is a set of the form
DE (t) = (
i
{( fi , position of f i in term)}) ∪ (
{(pair, position of pair in term)})
260
Dimitrios Mavroeidis and Peter A. Flach
where fi are the functors of the term t with their position in the term t and pair accounts for all the pairs of the same sub-terms of t with their position in the term t. The following examples illustrate the notation of the position of the functors and pairs. The element (a, p 42 ( f12 (a00 ))) of the elementary decomposition of a term t will denote that the functor a is a functor with arity 0, which is in the first position of a term with functor f and arity 2, which is in the second position of a term with functor p and arity 4; so the term is of the form t = p( , f (a, ), , ). The element (pair, (p 42 , p44 )) of the elementary decomposition of a term t will denote that in the second and the fourth position of a term with functor p and arity 4 we will have the same sub-term; so the term is of the form t = p( , s, , s). Example 5. Let t be the term p(a, f (b, c), X, X). The elementary decomposition of this term will be DE (t) = {(p, p40 ), (a, p41 (a00 )), ( f , p42 ( f02 )), (b, p42 ( f12 (b00 ))), (c, p42 ( f22 (c00 ))), (pair, (p43 , p44 ))}. Example 6. Consider the term t = p(a, a, a). The elementary decomposition of this term will be DE (t) = {(p, p30 ), (a, p31 (a00 )), (a, p32 (a00 )), (a, p33 (a00 )), (pair, (p31 , p32 )), (pair, (p31 , p33 )), (pair, (p32 , p33 ))}. The key property of elementary decompositions is that they can be used to define a partial order that is equivalent to θ-subsumption. Theorem 5. Let x and y be two first-order terms, then the order defined by x ≤ y ⇔ DE (y) ⊆ DE (x) is equivalent to the order introduced by θ-substitution. Theorem 6. DE (lgg(x, y)) = DE (x) ∩ DE (y). The proof of these and subsequent theorems is in the Appendix. From the theorems above we can see that there is a direct analogy between the lgg of two terms and the intersection of two elementary decompositions. This enables us to define depth functions for the terms using their elementary decompositions. We can define such depth functions as follows. Definition 6 (Elementary Depth). We define the functions depth[p] : T → IR, where T is the domain of first-order terms and p is a positive integer parameter, such that depth[p](t) = (
∑
1
(depth (s)) p ) p
s∈DE (t)
where depth is any positive function that accounts for the depth assigned to the elements of the elementary decomposition.
Improved Distances for Structured Data
261
Theorem 7. The Elementary Depth is a depth function. The role of depth in definition 6 is to configure the relative importance of the elements of the elementary decomposition in the calculation of the distance. In contrast with the distances defined in section 3, there are no constrains in the configuration of the relative importance of feature equality. Now we can move on and define a distance that uses the Elementary Depth. In analogy to the way we defined a distance between terms as their difference of depths to their lgg in Section 3, we now define the distance between two terms as the difference of the depths of the two elementary decompositions to the elementary decomposition of their lgg. Theorem 6 ensures that the intersection between two elementary decompositions is the elementary decomposition of their lgg. Theorem 8 (Elementary Distance). The function d[p] : T × T →R, where T is the domain of first-order terms, defined as 1
d[p](x, y) = ((depth[p](x)) p + (depth[p](y)) p − 2(depth[p](lgg(x, y))) p) p where depth[p](x) is the Elementary Depth, is a metric. As the Elementary Distance is defined for first-order terms, we would want it to reflect the natural lattice ordering of first-order terms. Theorem 9. The Elementary Distance is order-preserving in the lattice of terms. Theorem 10. The Elementary Distance is strictly order-preserving in the lattice of terms. The Elementary Distance of two terms is actually the symmetric difference distance between their elementary decompositions (remember that the elementary decompositions are sets). This follows from the fact that the Elementary Depth is a size function (see Section 2.2 for definitions of symmetric difference distance and size function) and we showed (in the context of proving that Elementary Distance is a metric) that the Elementary Distance between two terms x and y is equal to depth((D E (x) ∪ DE (y)) − (DE (x) ∩ DE (y)). We continue to demonstrate that the metric space of terms (T, d[p]), where d[p] is the Elementary Distance, can be isometrically embedded in a vector space (E, d p ). Furthermore, we will show that depending on the choice of p, we will have d p to be the corresponding Minkowski metric. The embedding procedure is the following: 1. The first individual represented by term t is embedded in a vector space with coordinates (x1 , x2 , .., xn ), where each xi corresponds to an element p i of the elementary decomposition of t and x i = depth (pi ) (depth is defined in Definition 6). 2. The second individual represented by term s is embedded in a vector space with coordinates (x1 , ..., xm ). We will have that there exist (x i1 , ..., xik ) common coordinates (or no common coordinates) of s and t, where the x i j correspond to the elementary decomposition of the lgg(s,t). We will ‘glue’ the two vector spaces together using
262
Dimitrios Mavroeidis and Peter A. Flach
the common coordinates, such that the coordinates of term t in this common vector space will be (x1 , x2 , ..., xn , 0, ..., 0) (m − k zeros). The first n coordinates of term s in the common space are filled by zeros and the common coordinates x i j of s and t with the common coordinates being in the same positions. The remaining m − k coordinates are filled by the non-common coordinates. 3. For embedding the k th individual, after we have already embedded k − 1, we follow the same procedure glueing the common coordinates of the k th individual with all the others. The procedure above will be able to embed all our individuals, regardless of the size of our training set. The ability to move from step k − 1 to step k comes from the fact that the first-order terms define a complete lattice. This guarantees that we will be able to find the common coordinates of any set of terms. Now we will show that this procedure actually is an isometric embedding in a vector space with any Minkowski metric. Theorem 11. If we have a function f : (T, d[p]) → (E, d p ) that follows the embedding procedure defined above, then f is an isometric function. Here, (T, d[p]) is the space of terms with the Elementary Distance, and (E, d p ) is the vector space with the Minkowski metric d p . If we have the Minkowski metric with p = 2 it becomes the Euclidean metric. In this case we can define a kernel between two individuals with the use of Equation ( 1). Note that the origin (0) of our vector space is the top of the lattice of terms. This kernel is the kernel induced by our distance metric. 4.2 Extension to Sets and Multi-sets As we have mentioned earlier, although first-order terms provide a powerful representation, they can’t represent some important data types such as sets and multi-sets. It is easy to extend the notion of elementary decomposition to handle sets and multi-sets. The elementary decomposition of a set is defined as follows. Definition 7 (Elementary Decomposition for Sets). Consider a set t = {t 1 , ...,tn }, where ti are terms, then we define D E (t) = ∪ni=1 DE (ti ). Example 7. Consider the set {p(a, b), p(a, c)}. The elementary decomposition of t is DE (t) = DE (a) ∪ DE (b) = {(p, p20 ), (a, p21 (a00 )), (b, p22 (b00 )), (c, p22 (c00 ))}. The Elementary Depth and the Elementary Distance can be defined for sets in a similar way as for first-order terms. By defining the elementary decomposition of sets to behave in such a way we achieve a desirable behavior of the Elementary Depth. Example 8. Consider the set s 1 = {p(a, b)}, and suppose we add the element p(a, c), creating the set s2 = {p(a, b), p(a, c)}. The main question is how much the depth of s1 should be increased by the addition of the element p(a, c). By using the elementary decomposition of sets, the depth of set s 2 increases only because of the element (c, p22 (c00 )), as only this element of the elementary decomposition of p(a, c) is not already in the elementary decomposition of the first set D E (s1 ).
Improved Distances for Structured Data
263
We can say that the Elementary Depth takes the ‘structural average’ of the elements of the set. This is done as we average over the contribution of the common elements of the elementary decompositions of p(a, b) and p(a, c) and then add the contribution of the different elements. Furthermore, as the Elementary Distance of two sets calculates the distance of their elementary decompositions to their intersection, we want the intersection of the elementary decomposition of two sets to give a measure of their structural similarity (as the lgg of two terms provides the structural similarity of two terms). We can see that if we have two sets of terms s 1 = {t1 , ...,tn } and s2 = {t1 , ...,tm } then the intersection of the elementary decomposition of these two sets D E (s1 ) ∩ DE (s2 ) will be equal to the elementary decomposition of the cross-product lgg {lgg(t 1 ,t1 ), lgg(t1 ,t2 ), ..., lgg(t1 ,tm ), lgg(t2 ,t1 , ..., lgg(tn ,tm } (remember that for first-order terms we have D E (lgg(t1 ,t2 ) = DE (t1 ) ∩ DE (t2 )). This behavior is desirable, as in general the we want the lgg (structural similarity) of two sets to gather the most specific information after performing a cross-product lgg. Similar to the way we defined the elementary decomposition for sets, we now define the elementary decomposition of multi-sets. Definition 8 (Elementary Decomposition for Multi-sets). Consider a multi-set t = {t1 → N1 , ...,tn → Nn }, where ti are terms and Ni natural numbers, then we define DE (t) = ∪ni=1 {DE (ti ) → Ni }. Here, ∪ stands for multi-set union, and {D E (ti ) → Ni } denotes the multi-set with every element of D E (ti ) mapped to Ni . Example 9. Consider the multi-set {p(a, b) → 1, p(a, c) → 2}. The elementary decomposition of t will be D E (t) = DE (a) ∪ DE (b) = {(p, p20 ) → 2, (a, p21 (a00 )) → 2, (b, p22 (b00 )) → 1, (c, p22 (c00 )) → 2}. In the case of multi-sets we can’t apply the Elementary Depth and Elementary Distance directly. We can elevate the Elementary Depth to be able to handle multi-sets if we treat multi-sets {a → 2} as ‘sets’ with multiple occurrences {a, a}. With this change the Elementary Depth and Elementary Distance can be applied for multi-sets. The Elementary Depth of multi-sets will have the same behavior as the Elementary Depth of sets, in the sense that we again take a ‘structural average’ of the elements of the multi-set. The (multi-set) intersection of two elementary decompositions of multi-sets will provide a good measure of the structural similarity of two multi-sets, as intersection of two elementary decompositions of multi-sets will be equal to the elementary decomposition of the cross-product lgg of the two multi-sets. 4.3 Discussion The distance introduced in this section is a theoretically sound approach as it is a metric that is strictly order-preserving. It is a general and flexible approach as we can assign any depth to the elements of the elementary decomposition. Moreover, our distance can be considered as a way for defining Minkowski distances for structured data as our metric space is isometrically embedable in a vector space, with any Minkowski metric.
264
Dimitrios Mavroeidis and Peter A. Flach
In a general framework the properties of generality and flexibility are very important as they increase the applicability of the distance. The distance introduced in this section can model situations where feature equality is important for the learning task. The relative importance of the equality between features can be configured with the depth that is assigned to the elements of the elementary decomposition that represent the pairs of a term (pair, position of pair in term). Compared to the distance defined in Section 3, we do not have any restrictions in configuration of the relative importance of feature equality. As our metric space is embedable isometrically in vector space with any Minkowski distance, if we take the Minkowski distance with p = 2, then we will have our metric space isometrically embedable in Euclidean space. Thus, we can define a kernel directly with the use of the distance. In the general case it is not possible to isometrically embed a metric space in a Euclidean space and this has led researchers to try to define kernels in pseudo-Euclidean (Minkowski) spaces [16]. There have been other approaches that define kernels for structured data directly [6]. In contrast, we first define a distance and then we use it for defining kernels. 4.4 Extension to Abstractions In order to extend the applicability of the distance we should extend it to handle higherorder representations. The main problem with higher-order terms is the handling of abstractions. In this section we suggest how elementary distance can be extended to cover abstractions. We will define an elementary decomposition for abstractions that behaves in a similar manner as the elementary decomposition of sets and multi-sets. Consider the abstraction λx.if x = s 1 then t1 else if x = s2 then t2 else ... else s0 , with si and ti first-order terms. In order to define an elementary decomposition we have to define a kind of union and intersection for the elementary decompositions of abstractions, similar to set and multi-set union and intersection. Definition 9. We define operations ∪ a and ∩a as follows. – z ∈ A ∪a B ⇔ z = (if x = a then A ∪ B ) – z ∈ A ∩a B ⇔ z = (if x = a then A ∩ B ) with (if x = a then A ) ∈ A and (if x = a then B ) ∈ B. (∪ and ∩ are standard set union and intersection.) Definition 10 (Elementary Decomposition for Abstractions). Consider an abstraction t = λx.if x = s1 then t1 else if x = s2 then t2 else ... else s0 , with si and ti first-order terms. We define the elementary decomposition D E (t) = ∪a (if x = DE (si ) then DE (ti )), where (if x = DE (si ) then DE (ti )) denotes the abstraction where every element of D E (si ) is mapped to DE (ti ). The purpose of defining the ∪ a and ∩a in such a way, was to achieve that the intersection ∩a provides a good measure of the structural similarity between two abstractions. It can be observed that the intersection ∩ a of two elementary decomposition of
Improved Distances for Structured Data
265
abstractions will be equal to the elementary decomposition of the cross-product lgg of the (if..then) parts of the two abstraction. The Elementary Depth can’t be directly applied to elementary decomposition of abstractions. This is because an element of the elementary decomposition will be like: (if x = a then A), where A is a set. The depth of this element can be calculated by calculating the Elementary Depth of set A and then define the depth of (if x = a then A) to be the depth we assign to a times the depth of the set A, depth(a) · depth(A). Another way would be to add the two depths depth(a) + depth(A). The behavior of the Elementary Depth in both cases will be similar to the behavior of the Elementary Depth for sets and multi-sets. The completion of the theoretical framework for higher-order terms is a matter of ongoing work.
5 Conclusion In this paper we have defined two general and flexible approaches for defining distances for structured data. These approaches have all the desired theoretical properties. The generality and flexibility of the distance defined in Section 3 is due to the weight configuration, which was inspired by Ramon and Bruynooghe’s work [ 18]. With these weights we can configure the relative importance of each position of a term, and – importantly – the relative importance of feature equality. For the distance defined in Section 4 the flexibility and generality is due to the fact that we can configure the depths of the elements of the elementary decomposition. Moreover, the distance defined in Section 4 can be considered as the Minkowski distance for structured data, as our metric space is isometrically embedable in a vector space with the Minkowski metric. In a theoretical approach such generality and flexibility are important as they increase the applicability of the distance. We are currently investigating the applicability of our distances to real-world datasets. An important issue for applying distance-based methods in structured data, is to deal with the efficient memory indexing problem. In propositional learning this problem was successfully addressed by storing the individuals in such a way that the nearest neighbors can be retrieved very efficiently [ 3,5]. Extending these approaches to handle structured data is essential so that distance-based methods can be applied efficiently. Acknowledgements Part of this research was supported by the EU Framework V project (IST-1999-11495) Data Mining and Decision Support for Business Competitiveness: Solomon Virtual Enterprise. Thanks are due to Thomas G¨artner and the anonymous reviewers for their useful comments.
266
Dimitrios Mavroeidis and Peter A. Flach
References 1. D.W. Aha, editor. Lazy Learning. Kluwer Academic Publishers, 1997. 2. C. Baier and M.E. Majster-Cederbaum. Metric semantics from partial order semantics. Acta Informatica. 34(9):701-735, 1997. 3. J.L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509-517, 1975. 4. W. Emde and D. Wettschereck. Relational Instance-Based Learning. In L. Saitta, editor, Proceedings of the Thirteenth International Conference on Machine Learning, pages 122-130. Morgan Kaufmann, 1996. 5. J. Friedman, J. Bentley, and R. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209-226, 1977. 6. T. G¨artner, J.W. Lloyd, and P.A. Flach. Kernels for Structured Data. In S. Matwin and C. Sammut, editors, Proceedings of the Twelfth International Conference on Inductive Logic Programming, volume 2583 of Lecture Notes in Artificial Intelligence, pages 66-83. Springer-Verlag, 2002. 7. W. Greub. Linear Algebra. Springer-Verlag, 1975. 8. T. Horvath, S. Wrobel and U. Bohnebeck. Relational Instance-Based Learning with Lists and Terms. Machine Learning, 43(1/2):53-80, 2001. 9. A. Hutchinson. Metrics on terms and clauses. In M. van Someren and G. Widmer, editors, Proceedings of the Ninth European Conference on Machine Learning, volume 1224 of Lecture Notes in Artificial Intelligence, pages 138-145. Springer-Verlag, 1997. 10. M. Kirsten and S. Wrobel. Extending k-means clustering to first-order representations. In J. Cussens and A. Frisch, editors, Proceedings of the Tenth International Conference on Inductive Logic Programming, volume 1866 of Lecture Notes in Artificial Intelligence, pages 112-129. Springer-Verlag, 2000. 11. M. Kirsten, S. Wrobel, and T. Horvath. Distance Based Approaches to Relational Learning and Clustering. In S. Dˇzeroski and N. Lavraˇc, editors, Relational Data Mining, chapter 9, pages 213-232. Springer-Verlag, 2001. 12. J.W. Lloyd. Knowledge representation, computation and learning in higher-order logic, Available at http://csl.anu.edu.au/~jwl, 2001. 13. J.W. Lloyd. Higher-order computational logic. In Antonis C. Kakas and Fariba Sadri, editors, Computational Logic. Logic Programming and Beyond, volume 2407 of Lecture Notes in Artificial Intelligence, pages 105-137. Springer-Verlag, 2002. 14. D. Mavroeidis. Distances and generalization structures in first-order and higher-order representations. Master’s thesis, Department of Computer Science, University of Bristol, 2002. 15. S.H. Nienhuys-Cheng. Distance Between Herbrand Interpretations: A Measure for Approximations to a target concept. In S. Dˇzeroski and N. Lavraˇc, editors, Proceedings of the Seventh International Workshop on Inductive Logic Programming, volume 1297 of Lecture Notes in Artificial Intelligence, pages 213-226. Springer-Verlag, 1997. 16. E. Pekalska, P. Paclik and R. Duin. A Generalized Kernel Approach to Dissimilarity-based Classification. Journal of Machine Learning Research, 2:175-211, 2001. 17. G.D. Plotkin. A note on inductive generalization. In Machine Intelligence, volume 5, pages 153-163. Edinburgh University Press, 1970. 18. J. Ramon and M. Bruynooghe. A framework for Defining Distances Between First-Order Logic Objects. In D. Page, editor, Proceedings of the Eighth International Conference on Inductive Logic Programming, volume 1446 of Lecture Notes in Artificial Intelligence, pages 271-280. Springer-Verlag, 1998.
Improved Distances for Structured Data
267
Appendix Proof. (Theorem 5) In order to prove the theorem we have to show the following: 1. If y θ-subsumes x then D E (y) ⊆ DE (x). 2. If DE (y) ⊆ DE (x) then y θ-subsumes x. 1. We have two first-order terms x, y, such that y θ-subsumes x ⇒ there exists a θsubstitution from variables to terms such that θ(y) = x. Now we will show that z ∈ DE (y) ⇒ z ∈ DE (x). Since z ∈ DE (y), from definition of the elementary decomposition, it can be either a functor with its position or a pair with its position. In both cases the θ-substitution can’t affect z, so we will have z ∈ DE (x). 2. Now we have D E (y) ⊆ DE (x), for all z ∈ DE (y) ⇒ z ∈ DE (x). Assume now that y doesn’t θ-subsume x. From definition of the θ-subsumption, this will mean that we have either a position in term y that contains different non-variable sub-terms with term x, or we have a pair in two positions of y and not in x. In both cases this violates the fact that DE (y) ⊆ DE (x). So we conclude that y θ-subsumes x. Proof. (Theorem 6) In order to prove the theorem we have to show the following: 1. DE (lgg(x, y)) ⊆ DE (x) ∩ DE (y). 2. DE (x) ∩ DE (y) ⊆ DE (lgg(x, y)). 1. Since lgg(x, y) θ-subsumes both x and y we will have that D E (lgg(x, y)) ⊆ DE (x) and DE (lgg(x, y)) ⊆ DE (y). So, DE (lgg(x, y)) ⊆ DE (x) ∩ DE (y). 2. Let’s assume that we have z ∈ D E (x) ∩ DE (y) ⇒ z ∈ DE (x) and z ∈ DE (y). Since z will be either a functor with its position or a pair with its position, from definition of the lgg the same functors and pairs that exist in the same position of both terms are guaranteed to exist in the lgg(x, y). So z ∈ D E (lgg(x, y)). Proof. (Theorem 7) In order to show that this function is a depth function we have to show the following 1. depth[p]()=0 2. x < y ⇒ depth[p](x) > depth[p](y), ∀x, y / So we 1. The elementary decomposition of the top of the lattice will be D E () = 0. have depth() = 0. 2. Now if x < y we will have that D E (y) ⊆ DE (x) and therefore depth[p](x) > depth[p](y). Proof. (Theorem 8) We have to prove the following, for all x, y, z ∈ T : 1. d(x, x) = 0 and d(x, y) = 0 ⇒ x = y 2. d(x, y) = d(y, x) 3. d(x, z) ≤ d(x, y) + d(y, z) 0. It can be easily observed that d(x, y) = depth((D E (x) ∪ DE (y)) − DE (lgg(x, y))). / = 0. 1. We have d(x, x) = depth((D E (x) ∪ DE (x)) − DE (lgg(x, x))) = depth(0) d(x, y) = 0 ⇔ depth((D E (x) ∪ DE (y)) − DE (lgg(x, y)) = 0. / This is because each element of the elemenThe only case that a depth(x)=0 is if x = 0. tary decomposition would contribute a positive quantity to the depth function. So it should be DE (x) ∪ DE (y)) − DE (lgg(x, y)) = 0/ ⇔
268
Dimitrios Mavroeidis and Peter A. Flach
DE (lgg(x, y)) = DE (x) ∪ DE (y) ⇔ DE (x) ∩ DE (y) = DE (x) ∪ DE (y) ⇔ DE (x) = DE (y) ⇔ x = y. 2. It comes straight from definition that d(x, y) = d(y, x) for all terms x, y. 3. We have to show that d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z.(1) Since d(x, z) = depth((D E (x) ∪ DE (z)) − DE (lgg(x, z)) = depth((DE (x) ∪ DE (z)) − (DE (x) ∩ DE (z)). and d(x, y) + d(y, z) = depth((D E (x) ∪ DE (y)) − (DE (x) ∩ DE (y)) + depth((DE (y) ∪ DE (z)) − (DE (y) ∩ DE (z)) ≥ depth((DE (x) ∪ DE (y)) − (DE (x) ∩ DE (y)) ∪ (DE (y) ∪ DE (z)) − (DE (y) ∩ DE (z)). (1) (1) can be proved if we show that (X ∪ Z) − (X ∩ Z) ⊆ ((X ∪Y ) − (X ∩Y )) ∪ ((Y ∪ Z) − (Y ∩ Z)) for all sets X,Y, Z (2) Applying set properties (2) can be shown to be true. Proof. (Theorem 9) We have to show that for all terms x, y, z if x ≤ y ≤ z then d(x, z) ≥ d(x, y) and d(x, z) ≥ d(y, z). Assume that we have x, y, z such that x ≤ y ≤ z. 1 d(x, z) = ((depth(x)) p + (depth(z)) p − 2(depth(lgg(x, z))) p) p .(1) Now since x ≤ y ⇒ depth(x) ≥ depth(y)) ⇒ (depth(x)) p ≥ (depth(y)) p Moreover x ≤ y ≤ z ⇒ depth(lgg(x, z)) = depth(lgg(y, z)) = depth(z) By substituting the above in (1) we get: 1 d(x, z) ≥ ((depth(y)) p + (depth(z)) p − 2(depth(z)) p) p = d(y, z). Similarly we can show that d(x, z) ≥ d(x, y). Proof. (Theorem 10) We have already shown that the distance is order-preserving, so what is left to show is that if x < y then d(x, y) > 0. We have that x < y ⇒ x = y and since d(x, y) is a metric x = y ⇒ d(x, y) = 0. Combining this with the triangle inequality gives us x < y ⇒ d(x, y) > 0. Proof. (Theorem 11) In order to show that f is an isometric function, we have to show that d[p](x, y) = d p ( f (x), f (y)). 1
We have that d[p](x, y) = ((depth(x)) p + (depth(y)) p − 2(depth(lgg(x, y))) p) p = 1
(∑i (depth [p](si )) p ) p , where si ∈ ((DE (x) ∪ DE (y)) − (DE (x) ∩ DE (y))) 1 Moreover we have that d p (x, y) = (∑i |xi − yi | p ) p It can be derived from the embedding procedure that we will have d[p](x, y) = d p ( f (x), f (y)).
Induction of Enzyme Classes from Biological Databases Stephen Muggleton, Alireza Tamaddoni-Nezhad, and Hiroaki Watanabe Department of Computing, Imperial College University of London, 180 Queen’s Gate, London SW7 2BZ, UK {shm,atn,hw3}@doc.ic.ac.uk
Abstract. Bioinformatics is characterised by a growing diversity of large-scale databases containing information on genetics, proteins, metabolism and disease. It is widely agreed that there is an increasingly urgent need for technologies which can integrate these disparate knowledge sources. In this paper we propose that not only is machine learning a good candidate technology for such data integration, but Inductive Logic Programming, in particular, has strengths for handling the relational aspects of this task. Relations can be used to capture, in a single representation, not only biochemical reaction information but also protein and ligand structure as well as metabolic network information. Resources such as the Gene Ontology (GO) and the Enzyme Commission (EC) system both provide isa-hierarchies of enzyme functions. On the face of it GO and EC should be invaluable resources for supporting automation within Functional Genomics, which aims at predicting the function of unassigned enzymes from the genome projects. However, neither GO nor EC can be directly used for this purpose since the classes have only a natural language description. In this paper we make an initial attempt at machine learning EC classes for the purpose of enzyme function prediction in terms of biochemical reaction descriptions found in the LIGAND database. To our knowledge this is the first attempt to do so. In our experiments we learn descriptions for a small set of EC classes including Oxireductase and Phosphotransferase. Predictive accuracy are provided for all learned classes. In further work we hope to complete the learning of enzyme classes and integrate the learned models with metabolic network descriptions to support “gap-filling” in the present understanding of metabolism.
1
Introduction
Within Bioinformatics there is a growing diversity of large-scale databases containing information on gene sequences (e.g. EMBL1 ), proteins (e.g. Swiss-Prot2 , 1 2
http://www.ebi.ac.uk/embl/. http://www.ebi.ac.uk/swissprot/.
T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 269–280, 2003. c Springer-Verlag Berlin Heidelberg 2003
270
Stephen Muggleton et al.
SCOP3 ), metabolism (e.g. KEGG4 , WIT5 and BRENDA6 ) and disease (e.g. JGSNP database7 ). It is widely agreed that there is an increasingly urgent need for technologies which integrate these disparate knowledge sources. In this paper we propose that not only is machine learning a good candidate technology for such data integration, but Inductive Logic Programming, in particular, has strengths for handling the relational aspects of this task. In the context of protein structure prediction this approach has already shown success [6]. Potentially relations can be used to capture, in a single representation, not only biochemical reaction information but also protein and ligand structure as well as metabolic network information. Bioinformatic resources such as the Gene Ontology (GO) [3] and the Enzyme Classification (EC) [5] system both provide isa-hierarchies of enzyme functions. On the face of it GO and EC should be invaluable for supporting automation within Functional Genomics [8], which aims at predicting the function of unassigned enzymes from the genome projects. However, neither GO nor EC can be directly used for this purpose since the classes have only a natural language description. In this paper we make an initial attempt at machine learning logic programs describing EC classes for the purpose of enzyme function prediction. The function of any particular enzyme is normally described in terms of the biochemical reaction which it catalyses. The LIGAND database8 provides an extensive set of biochemical reactions underlying KEGG (the Kyoto Encyclopedia of Genes and Genomes). In LIGAND reactions are described in equational form as follows. C21 H30 N7 O17 P3 + C7 H8 O5 C21 H29 N7 O17 P3 + C7 H10 O5 To our knowledge the experiments described in this paper are the first attempt to learn enzyme functions in terms of these underlying reaction equations. In our experiments we learn descriptions for a small set of EC classes including Oxireductase and Phosphotransferase. Predictive accuracies are provided for all learned classes. In further work we hope to complete the learning of GO classes and integrate the learned models with metabolic network descriptions to support “gap-filling” in the present understanding of metabolism. This will extend previous research described in [2] by allowing for the case in which substrates and products of a reaction are known, but there is no known enzyme for such a reaction. In this case, rather than hypothesis an arbitrary unknown enzyme we could potentially use abduction together with learned biochemical knowledge of the kind we develop in this paper to narrow down the functional class of the missing enzyme. This paper is arranged as follows. Section 2 introduces the EC classification system. The LIGAND database of biochemical reactions is then described in 3 4 5 6 7 8
http://scop.mrc-lmb.cam.ac.uk/scop/. http://www.genome.ad.jp/kegg/. http://wit.mcs.anl.gov/WIT2/. http://www.brenda.uni-koeln.de/. http://www.tmgh.metro.tokyo.jp/jg-snp/. http://www.genome.ad.jp/ligand/.
Induction of Enzyme Classes from Biological Databases
271
Section 3. In the experiments described in Section 4 we investigate the possibility of learning EC classification rules in terms of the biochemical reactions found in LIGAND. In Section 5 we discuss the representation issues with the learned rules. In Section 6 we conclude and describe directions for further research.
2
Enzyme Classification
Enzymes are proteins which catalyse biochemical reactions within organisms. The description of enzyme function provides a characterisation of biological systems which forms a bridge between the micro-level and the macro-level (from atoms, through chemical reaction networks, to diseases). The genome projects are generating ever-larger volumes of genes with unassigned function. This has led in turn to an increasingly important role for enzyme classification systems. Let us consider homology-based functional genomics as an approach to finding the functions of unknown enzymes using an enzyme classification system. Assume that we have an amino-acid sequence of an unknown enzyme A and a known enzyme B, and that B belongs to the enzyme class X. First, we use software to compute the amino-acid sequence similarity of A and B in order to determine their degree of homology. If A and B are found to be homologues then we next proceed with experiments based on the hypothesise that A has a similar function to B. Classification of enzyme functions is difficult since many enzyme mechanisms are not fully understood and many enzymes catalyse multiple reactions. To tackle this issue, classification systems have been proposed which focus on different features of enzymes [5,7,1]. For example, the EC List [5] is based on a chemical-formula oriented classification. Thus it is sometimes difficult to classify the enzymes that catalyse several steps of reactions by creating intermediates. By contrast, mechanism-oriented approaches [1] classify enzymes based on reaction mechanisms such as (a) rules of the substructure changes in the chemical structures and (b) chemical structural reasons for the changes. These classifications do not tend to take account of reaction-related issues such as inhibitors, pH, temperature, protein structure, amino-acid sequence, and the context of the metabolic network in which the reaction is taking place. A relational representation has the potential to capture many of these aspects simultaneously. We believe that such representations are mandatory if we hope to model biological systems from the micro to the macro-level in a seamless fashion. In our experiments, the oxidoreductase and phosphotransferase classes are learned as logic programs. Oxidoreductases are enzymes which catalyse oxidisation and reduction. These reactions cause energy flow within organisms by exchange of electrons. Phosphotransferases are enzymes transferring a phosphate group from one compound (donor) to another (acceptor). The acceptors have electrophilic substructures such as NR, SR, and OR where N, S, O, and R are nitrogen, oxygen, sulphur, and alkyl group respectively.
272
Stephen Muggleton et al.
Below we show where oxidoreductases and phosphotransferases fit within the EC classification system. The latest version of the EC List [5] contains 3196 enzymes, and is divided into 6 first-layers as follows: 1. Oxidoreductases 1.1 Acting on the CH-OH group of donors 1.1.1 With NAD+ or NADP+ as acceptor 1.1.1.1 Alcohol dehydrogenase; Aldehyde reductase : 2. Transferases : 2.7 Transferring Phosphorus-Containing Groups : 3. Hydrolases 4. Lyases 5. Isomerases 6. Ligases.
(1st layer) (2nd layer) (3rd layer) (4th layer)
For example, the classification of EC enzyme Number 1.1.1.1 can be read as follows: EC Number 1.1.1.1 is an oxidoreductase which acts on the CH-OH group of donors, with NAD+ or NADP+ as acceptor, and the name of the enzyme is Alcohol dehydrogenase or Aldehyde reductase. Oxidoreductases are classified in EC Number 1.∗. ∗ .∗ and phosphotransferases are EC Number 2.7.∗. ∗ ..
3
LIGAND Database
LIGAND is a database of chemical compounds and reactions in biological pathways [4]. The database consists of three sections: COMPOUND, REACTION, and ENZYME, and data is available in text files from the web site9 and the anonymous ftp site10 . The COMPOUND section is a collection of metabolic and other compounds such as substrates, products, inhibitors of metabolic pathways as well as drugs and xenobiotic chemicals. The REACTION section is a collection of chemical reactions involved in the pathway diagrams of the KEGG/PATHWAY database as well as in the ENZYME section. The ENZYME section is a collection of all known enzymatic reactions classified according to the EC List. Knowledge integration could be performed for COMPOUND, REACTION, and ENZYME sections by cross-referring the EC numbers, the compound numbers, and the reaction numbers (Fig. 1). COMPOUND, REACTION, and ENZYME sections contain several atoms such as carbon (C), hydrogen (H), nitrogen (N), oxygen (O), phosphorous (P), sulphur (S), magnesium (Mg), manganese (Mn), iron (Fe) and iodine (I). Note that R is an alkyl group. 9 10
http://www.genome.ad.jp/ligand/. ftp.genome.ad.jp/pub/kegg/ligand/.
Induction of Enzyme Classes from Biological Databases
273
REACTION ENTRY NAME DEFINITION EQUATION PATHWAY ENZYME /// ENTRY :
R00104 ATP:NAD+ 2’-phosphotransfera ATP + DAD+ <=> ADP + NADP+ C00002 + 2 C00003 <=> C00008 + C00006 PATH: MAP00760 Nicotinate and nicotinamide metabolism 2.7.1.23 R00105
COMPOUNDS ENTRY NAME FORMULA REACTION : /// ENTRY NAME FORMULA : /// ENTRY NAME FORMULA : /// ENTRY NAME FORMULA :
C00002 ATP Adenosine 5’-triphosphate C10H16N5O13P3 R00002 R00076 R00085 R00086 R00087 R00088 R00089 R00104
C00003 NAD C21H28N7O14P2
C00006 NADP C21H29N7O17P3
C00008 ADP C10H15N5O10P2
Fig. 1. Excerpt from REACTION and COMPOUND sections in LIGAND
4
Experiments
The experiments in this section are aimed at evaluating the following null hypothesis: Null Hypothesis: A relational representation cannot capture enzyme classification rules based only on descriptions of the underlying biochemical reactions.
274
4.1
Stephen Muggleton et al.
Materials
The ILP system used in the experiments is Progol 4.4 11 . In order to allow reproducibility of the results, the data sets and Progol’s settings used in the experiments have also been made available 12 . Our study is restricted to learning classification rules for two classes of enzymes: the main class EC1 (Oxidoreductase) and a more specific class EC2.7 (Phosphotransferase). One justification for these choices is that EC1 and EC2 are the most populated classes of enzymes and they contain a relatively large number of examples, which means that learning is more robust and the results more meaningful. Other classes contain a smaller number of enzymes, for example EC5 and EC6 each contain around 200 known enzymes compared to EC1 and EC2 with over 1000 enzyme in each. 4.2
Methods
For the experiments reported in this section, we use a relational representation to represent the biochemical reactions catalysed by each enzyme. In this representation, each reaction is defined as a set of compounds in the left hand side (LHS) and the right hand side (RHS) of the reaction. For example, the enzyme with EC number 1.1.1.37 which belongs to the class of Oxidoreductase and catalyses the reaction C00149 + C00003 C00036 + C00004 + C00080 is represented by the following Prolog facts: oxidoreductase(’1.1.1.37’). lhs(’1.1.1.37’,’C00149’). lhs(’1.1.1.37’,’C00003’). rhs(’1.1.1.37’,’C00036’). rhs(’1.1.1.37’,’C00004’). rhs(’1.1.1.37’,’C00080’). For each chemical compound, we only represent the number of atoms of each element appearing in the compound. For example, compound C00003 with chemical formula C21 H28 N7 O14 P2 can be represented as follows. compound(’C00003’). atoms(’C00003’,’c’,21). atoms(’C00003’,’h’,28). atoms(’C00003’,’n’,7). atoms(’C00003’,’o’,14). atoms(’C00003’,’p’,2). In order to capture the exchange of elements in compounds during the reactions, we define the relation ‘diff atoms/5’ which represents the difference between the number of particular atoms in compounds C1 and C2 which appeared in LHS and RHS respectively. 11 12
Available from: http://www.doc.ic.ac.uk/~shm/Software/progol4.4/. Available from: http://www.doc.ic.ac.uk/bioinformatics/datasets/enzymes/.
Induction of Enzyme Classes from Biological Databases
275
diff_atoms(Enz,C1,C2,E,Dif):lhs(Enz,C1), rhs(Enz,C2), atoms(C1,E,N1), atoms(C2,E,N2), Dif is N1 - N2, Dif > 0 . We report on two series of experiments for each of Oxidoreductase and Phosphotransferase enzyme classes. In ‘Mode 1’, the hypotheses language was limited so that only ‘diff atoms/5’ and numerical constraint predicates (i.e. =, ≥ and ≤) can appear in the body of each hypothesis. In ‘Mode 2’, the hypothesis language also included ‘atoms/3’, ‘lhs/2’ and ‘rhs/2’ (Mode 2). Part of the mode declaration and background knowledge used by Progol are shown in Table 1. In the experiments, we compared the performance of Progol in learning classification rules for each of Oxidoreductase and Phosphotransferase enzyme classes from varying-sized training sets. Figure 2 shows the experimental method used for this purpose. The average predictive accuracy was measured in 20 different runs. In each run, the number of positive and negative training examples was varied while the number of ‘hold-out’ test examples was kept fixed (i.e. 200). Test-set positive examples were randomly sampled from the target enzyme class. Negative examples for the target class EC1 (Oxidoreductase) were randomly sampled from other major classes (i.e. EC2 to EC6). For EC2.7 (phosphotransferase), negative examples were randomly sampled from other sub-classes of EC2. Progol was then run on the training examples using ‘Mode 1’ and ‘Mode 2’. For each iteration of the loop the predictive accuracy of the learned classification rule was measured on the test examples. The average and standard error of these parameters were then plotted against the number of training examples.
for i=1 to 20 do for j in (10,20,40,80,160) do Randomly sample j positive and j negative ‘training’ examples Randomly sample 200 positive and 200 negative ‘test’ examples Run Progol on the ‘training’ set using ‘Mode 1’ Aij =predictive accuracy of the learned classification rule on the ‘test’ set Run Progol on the ‘training’ set using ‘Mode 2’ Aij =predictive accuracy of the learned classification rule on the ‘test’ set end end for j in (10,20,40,80,160) do Plot average and standard error of Aij and Aij versus j (i ∈ [1..20])
Fig. 2. Experimental method
276
Stephen Muggleton et al.
Table 1. Part of mode declarations and background knowledge used by Progol in the experiments :- set(h,10000)? :- set(r,100000)? :- set(noise,5)? :::::-
modeh(1,oxidoreductase(+enzyme))? modeb(*,diff_atoms(+enzyme,-compound,-compound,#element,-int))? modeb(1,eq(+int,#int))? modeb(1,lteq(+int,#int))? modeb(1,gteq(+int,#int))?
% The following mode declarations were added in ‘Mode 2’ experiments %:- modeb(*,lhs(+enzyme,-compound))? %:- modeb(*,rhs(+enzyme,-compound))? %:- modeb(*,atoms(+compound,#element,-nat))? element(c). element(s).
element(h). element(mg).
element(n). element(mn).
diff_atoms(Enz,C1,C2,E,Dif):lhs(Enz,C1), rhs(Enz,C2), atoms(C1,E,N1), atoms(C2,E,N2), Dif is N1 - N2, Dif > 0 . eq(X,X):not(var(X)), int(X),!. gteq(X,Y):not(var(X)), not(var(Y)), int(X), int(Y), X >= Y, !. gteq(X,X):not(var(X)), int(X). lteq(X,Y):not(var(X)), not(var(Y)), int(X), int(Y), X =< Y, !. lteq(X,X):not(var(X)), int(X).
element(o). element(fe).
element(p). element(i).
Induction of Enzyme Classes from Biological Databases
4.3
277
Results
The results of the experiments are shown in Figure 3. In these graphs, the vertical axis shows predictive accuracy and the horizontal axis shows the number of training examples. For each experiment, predictive accuracies were averaged over 20 different runs (error bars represent standard errors). According to these graphs, the overall predictive accuracies of the learned rules for the Phosphotransferase dataset are higher than the overall predictive accuracies for the Oxidoreductase dataset. These results suggest that using only ‘diff atoms/5’ information is sufficient to get a relatively high accuracy for the phosphotransferase dataset while this is not the case for Oxidoreductase dataset and probably we require additional information (e.g. knowledge about the structure) which cannot be captured by ‘diff atoms/5’. For the Phosphotransferase dataset the accuracy difference between using mode declarations ‘Mode 1’ and ‘Mode 2’ is not significant, however, for the Oxidoreductase dataset ‘Mode 2’ clearly outperforms ‘Mode 1’ in all experiments. In both graphs the null hypothesis is refuted as the predictive accuracies are significantly higher than default accuracy (i.e. 50% ). In the next section we discuss some of the descriptions which have been learned for each of the target enzyme classes.
5
Discussion
In this paper we have made an initial attempt at machine learning EC classes for enzyme function prediction based on biochemical reaction descriptions found in the LIGAND database. Figure 4 shows a diagrammatic representation of the class of chemical reaction catalysed by Oxireductase and Phosphotransferase enzymes. We succeeded in learning descriptions of the Oxidoreductase and Phosphotransferase class from LIGAND database in the form of a logic program containing the following rules (among others). Oxidoreductases Rule oxidoreductase(A) :- diff atoms(A,B,C,h,D), atoms(B,o,E), atoms(C,o,E), eq(D,2), lteq(E,4). Phosphotransferases Rule phosphotrans(A) :- diff atoms(A,B,C,h,D), diff atoms(A,B,C,o,E), diff atoms(A,B,C,p,F), eq(F,1), lteq(E,7). In the above the Oxidoreductase Rule captures the elimination of H2 which is typical of oxidation-reduction reactions (see Figure 4a). The Phosphotransferase Rule represents the exchange of the phosphate group P O3 (Figure 4b). Logical speaking, the boundary constraint lteq(E,7) is consistent with a transfer of three Oxagen atoms. However, it is not clear why lteq(E,7) is learned instead of eq(E,3). Further analysis by domain experts is required to identify the chemical meaning of this constraint.
Stephen Muggleton et al.
Predictive accuracy %
278
100 95 90 85 80 75 70 65 60 55 50
Mode 2 Mode 1 0
50
100 150 200 250 300 350 No. of training examples
Predictive accuracy %
(a) Oxidoreductase
100 95 90 85 80 75 70 65 60 55 50
Mode 2 Mode 1 0
50
100 150 200 250 300 350 No. of training examples
(b) Phosphotransferase
Fig. 3. Performance of Progol in learning enzyme classification rules for a) Oxidoreductase and b) Phosphotransferase. In both graphs default accuracy is 50%
Induction of Enzyme Classes from Biological Databases
H
H
C
O
C
O + H2
a) Oxireduction
OH
OPO3
C
C
279
b) Phosphotransference
Fig. 4. Chemical reactions catalysed by Oxireductase and Phosphotransferase enzymes
6
Conclusion
As mentioned previously, resources such as the Gene Ontology (GO) [3] and the Enzyme Classification (EC) [5] system both provide isa-hierarchies of enzyme functions. On the face of it GO and EC should be invaluable for supporting automation within Functional Genomics, which aims at predicting the function of unassigned enzymes from the genome projects. However, neither GO nor EC can be directly used for this purpose since the classes presently have only a natural language description. The study described in this paper has taken a first step toward automatic formulation of rules which describe some of the major functional classes of enzymes. By extending this study we believe it should be possible to learn descriptions for all major GO and EC classes. However, in order to do so we will need to involve domain experts to check the quality and comprehensibility of the learned rules. In order to speed-up the learning process in this study, we simply compared the number of atoms between two compounds with diff atoms/5 predicate. The limitation of this representation is that we ignore the structure of compounds. For example, enzymes which catalyse the elimination of H2 are called dehydrogenase, and the reaction results in a double bond between C-O, C-C, or C-N. By considering the types of bonds between atoms such as single bond and double bond, we could track the introductions of double bonds between atoms and determine the locations where H2 is eliminated. Logic programs could represent the structural information by expressing connections and the type of connections between atoms. The learned knowledge could be viewed as not only rules for classification but also programs for a logic-based biological simulation. As a future study, we believe it would be worth adding more background knowledge including inhibitors, cofactors and amino-acid sequential information which is available from various public-domain biological databases. Acknowledgements This work was supported by the ESPRIT IST project “Application of Probabilistic Inductive Logic Programming (APRIL)”, the BBSRC/EPSRC Bioinformatics and E-Science Programme, “Studying Biochemical networks using probabilistic knowledge discovery” and the DTI Metalog project.
280
Stephen Muggleton et al.
References 1. M. Arita and T. Nishioka. Hierarchical classification of chemical reactions. Bio Industry, 17(7):45–50, 2000. 2. C.H. Bryant, S.H. Muggleton, S.G. Oliver, D.B. Kell, P. Reiser, and R.D. King. Combining inductive logic programming, active learning and robotics to discover the function of genes. Electronic Transactions in Artificial Intelligence, 5B1(012):1–36, November 2001. 3. The Gene Ontology Consortium. Gene ontology: Tool for the unification of biology. Nature Genetics, 25:25–29, 2000. 4. S. Goto, Y. Okuno, M. Hattori, T. Nishioka, and M. Kanehisa. Ligand: Database of chemical compounds and reactions in biological pathways. Nucleic Acids Research, 30:402–404, 2002. 5. International Union of Biochemistry and Molecular Biology. Enzyme Nomenclature: Recommendations (1992) of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology. Academic Press, New York, 1992. 6. M. Turcotte, S.H. Muggleton, and M.J.E. Sternberg. Automated discovery of structural signatures of protein fold and function. Journal of Molecular Biology, 306:591–605, 2001. 7. C. Walsh. Enzymatic Reaction Mechanisms. W. H. Freeman and Company, 1979. 8. M.R. Wilkins, K.L. Williams, R.D. Appel, and D.F. Hochstrasser. Proteome Research : New Frontiers in Functional Genomics (Principles and Practice). Springer Verlag, Berlin, 1997.
Estimating Maximum Likelihood Parameters for Stochastic Context-Free Graph Grammars Tim Oates, Shailesh Doshi, and Fang Huang Department of Computer Science and Electrical Engineering University of Maryland Baltimore County, Baltimore, MD 21250 {oates,sdoshi1,fhuang2}@cs.umbc.edu
Abstract. Given a sample from an unknown probability distribution over strings, there exist algorithms for inferring the structure and parameters of stochastic grammatical representations of the unknown distribution, i.e. string grammars. Despite the fact that research on grammatical representations of sets of graphs has been conducted since the late 1960’s, almost no work has considered the possibility of stochastic graph grammars and no algorithms exist for inferring stochastic graph grammars from data. This paper presents PEGG, an algorithm for estimating the parameters of stochastic context-free graph grammars given a sample from an unknown probability distribution over graphs. It is established formally that for a certain class of graph grammars PEGG finds parameter estimates in polynomial time that maximize the likelihood of the data, and preliminary empirical results demonstrate that the algorithm performs well in practice.
1
Introduction
Graphs are a natural representation for relational data. Nodes correspond to entities, edges correspond to relations, and symbolic or numeric labels on nodes and edges provide additional information about particular entities and relations. Graphs are routinely used to represent everything from social networks (Pattison, 1993) to chemical compounds (Cook et al., 2001) to visual scenes (Hong & Huang, 2002). Suppose you have a set of graphs, each representing an observed instance of a known money laundering scheme (Office of Technology Assessment, 1995). It would be useful to learn a statistical model of these graphs that supports the following operations: – Compute graph probabilities: If the model represents a probability distribution over graphs, then it is possible to determine the probability of a new graph given the model. In the context of money laundering schemes, this would amount to determining whether a newly observed set of business relationships and transactions (represented as a graph) is likely to be an instance of money laundering. T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 281–298, 2003. c Springer-Verlag Berlin Heidelberg 2003
282
Tim Oates et al.
– Identify recurring structures: Money laundering schemes may contain common components (i.e. sub-graphs) that are arranged in a variety of ways. To better understand the domain, it is useful to explicitly identify such components and the common ways in which they are connected to one another. – Sample new graphs: Given the model, one might want to sample new graphs (money laundering schemes) according to the probability distribution defined by the model. This might be useful in exploring the space of possible schemes, perhaps looking for new variants that law enforcement has not previously considered, or for generating training examples from which humans or programs can learn. Stochastic grammatical representations of probability distributions over strings, such as stochastic context-free grammars (SCFGs), support these three operations. Given a SCFG, G, and a string, s, it is possible to efficiently compute p(s|G). It is also straightforward to sample strings from the probability distribution defined by G. Finally, there exist a number of methods for learning both the structure (Stolcke, 1994) and parameters (Lari & Young, 1990) of string grammars from data. The most well-known algorithm for computing maximum likelihood estimates of the parameters of string grammars is the InsideOutside algorithm (Lari & Young, 1990). In addition to estimating parameters, this algorithm can be used to learn structure. This is done by constructing a grammar containing, for example, all possible CNF productions that can be created from a given set of terminals and non-terminals. Inside-Outside can then prune away (i.e. set production probabilities to zero) those productions that are possible but that are not actually in the grammar that generated the training data. Also, Inside-Outside can be used as a component in a system that explicitly searches over the space of grammar structures, iteratively evaluating structures/parameters via, for example, the description length of the grammar and the data given the grammar. We have embarked on a program of research aimed at creating algorithms for learning and reasoning with stochastic grammatical representations of probability distributions over graphs that provide functionality mirroring that available for string grammars. There exists a fairly extensive literature on deterministic graph grammars that define sets of graphs in the language of the grammar (see, for example, (Engelfriet & Rozenberg, 1997) and (Ehrig et al., 1999)), just as deterministic string grammars define sets of strings that are in the language of the grammar. However, the vast majority of existing work on graph grammars has completely ignored the possibility of stochastic graph grammars and there is no work whatsoever on learning either the structure or parameters of stochastic graph grammars. This paper describes an algorithm for estimating the parameters of stochastic graph grammars that we call Parameter Estimation for Graph Grammars (PEGG), the first algorithm of its kind. PEGG is similar in many respects to the Inside-Outside algorithm. PEGG computes inside and outside probabilities in polynomial time, and can use these probabilities to efficiently compute p(g|G), the probability of graph g given graph grammar G. In addition, PEGG computes maximum likelihood estimates of grammar parameters for a given gram-
Estimating Maximum Likelihood Parameters
283
mar structure and set of graphs, again in polynomial time. Though we have explored the use of Bayesian model merging techniques developed for learning the structure (i.e. productions) of string grammars (Stolcke, 1994) in the context of learning the structure of graph grammars (Doshi et al., 2002), the current focus is on parameter estimation. The ability to learn grammar-based representations of probability distributions over graphs has the attractive property that non-terminals encode information about classes of functionally equivalent sub-graphs. For example, most money laundering schemes have a method for introducing illegal funds into the financial system and a method for moving the funds around to distance them from the source. If sub-graphs in the ground instances of money laundering schemes correspond to these methods, and there are different instantiations of each method, it is reasonable to expect that the learned grammar will contain a non-terminal that expands to ways of introducing funds into the financial system and another non-terminal that expands to ways of moving these funds around. Identifying these non-terminals in the learned grammar makes it possible to enumerate the sub-graphs they generate (i.e. all possible instantiations of a method) and to determine their probability of occurrence to, for example, focus law enforcement efforts. From a more formal standpoint, graphs are logical structures, so individual graphs and sets of graphs can be described by logical formulas. It is possible to deduce properties of graphs and sets of graphs from these descriptions (Immerman, 1999). PEGG opens up the possibility of automatically synthesizing logical descriptions (i.e. graph grammars) of sets of graphs from data. For example, the expressive power of certain graph grammar formalisms is co-extensive with that of monadic second-order logic (Courcelle, 1997). The remainder of this paper is organized as follows. Section 2 describes stochastic context-free graph grammars and discusses their relationship to stochastic context-free string grammars. Despite the fact that graph grammars have a rich history of application in a variety of domains, no algorithms exist for learning them from data. To introduce the fundamental concepts of grammar induction from data, section 3 reviews the Inside-Outside algorithm for estimating the parameters of stochastic context-free string grammars. Section 4 introduces the Parameter Estimation for Graph Grammars (PEGG) algorithm for learning maximum likelihood parameter estimates for graph grammars. Section 5 presents the results of a set of preliminary experiments with PEGG. Section 6 reviews related work, concludes, and discusses a number of directions in which we are taking this research.
2
Graph Grammars
This section provides an overview of graph grammars. For a thorough introduction to the formal foundations of graph grammars, see (Engelfriet & Rozenberg, 1997), and to learn more about the vast array of domains in which graph grammars have been applied, see (Ehrig et al., 1999).
284
Tim Oates et al.
The easiest way to build intuition about graph grammars is by way of comparison with string grammars, for which we will take stochastic context-free grammars to be the paradigmatic example. (For the remainder of this paper the term string grammar means stochastic context-free string grammar.) A string grammar G is a 4-tuple (S, N, Σ, P ) where N is a set of non-terminal symbols, S ∈ N is the start symbol, Σ is a set of terminal symbols disjoint from N , and P is a set of productions. Associated with each production is a probability such that the probabilities for productions with the same left-hand side sum to one. Sometimes it will be convenient to describe grammars as being composed of structure and parameters, where the parameters are the production probabilities and the structure is everything else. In this paper we will be concerned exclusively with stochastic context-free graph grammars (Mosbah, 1994), and will use the term graph grammar to refer to grammars of this type. Despite the fact that our present concern is with stochastic graph grammars, it is important to note that prior work reported in the literature has focused almost exclusively on deterministic grammars. Just as string grammars define probability distributions over strings, graph grammars define probability distributions over graphs. A graph grammar G is a 4-tuple (S, N, Σ, P ) where N is a set of non-terminal symbols, S ∈ N is the start symbol, Σ is a set of terminal symbols disjoint from N , and P is a set of productions. Associated with each production is a probability such that the probabilities for productions with the same left-hand side sum to one. The primary difference between string grammars and graph grammars lies in the right-hand sides of productions. String grammar productions have strings of terminals and non-terminals on their right-hand sides. Graph grammar productions have graphs on their right-hand sides. At this point the reader may well be wondering where the terminals and non-terminals appear in the graphs generated by graph grammars. It turns out that they can be associated with nodes, yielding a class of grammars known as Node Controlled Embedding (NCE) graph grammars, or they can be associated with edges, yielding a class of grammars known as Hyperedge Replacement (HR) graph grammars. For reasons that will be discussed later, we focus exclusively on HR grammars. Figure 1 shows the three productions in a simple HR grammar (Drewes et al., 1997) that has one non-terminal - S. Each left-hand side is a single non-terminal and each right-hand side is a graph. Some of the edges in the graphs are labeled with non-terminals in boxes. These non-terminal edges can be expanded, a process that involves removing the edge and replacing it with the graph on the right-hand side of a matching production. Each right-hand side has a pair of nodes labeled 1 and 2 that are used to orient the graph when it replaces a non-terminal edge. We will generally use the term host graph to refer to the graph containing the non-terminal edge and the term sub-graph to refer to the graph that replaces the non-terminal edge. Figure 2 shows a partial derivation using the productions in figure 1. The second graph in figure 2 is obtained from the first by removing the labeled edge and replacing it with the sub-graph on the right-hand side of the second production in figure 1. After removing the edge, all that remains is two disconnected
Estimating Maximum Likelihood Parameters
S
1
S
1
S
1
285
2
S
S
2
S 2
S Fig. 1. Productions in a simple HR grammar nodes, one that used to be at the head of the edge and the other at the tail. The edge is replaced by gluing the node labeled 1 in the sub-graph to the node that was at the head of the removed edge. Likewise, the node labeled 2 in the sub-graph is glued to (i.e. made the same node as) the node that was at the tail of the removed edge.
S S
S S
S S S S Fig. 2. A partial derivation using the productions in figure 1 The last graph in figure 2 is obtained from the penultimate graph by replacing a non-terminal edge with the right-hand side of the first production in figure 1. This results in an edge with no label – a terminal edge – which can therefore
286
Tim Oates et al.
not be expanded. A terminal graph is one that contains only terminal edges. Terminal edges can be unlabeled, as in the current example, or productions can specify labels for them from the set of terminals Σ. Note that every production in figure 1 has exactly two distinguished nodes, labeled 1 and 2, that are used to orient the sub-graph in the host graph when an edge is replaced. When expanding a non-terminal in the derivation of a string there is no ambiguity about how to join the substrings to the left and right of the non-terminal with its expansion. Things are not so clear when expanding nonterminal edges to graphs. Given that the sub-graph to which the non-terminal is expanded will be attached by gluing, there are in general several possible attachments. Consider the second production in figure 1, whose right-hand side has three nodes. When it is used to replace a non-terminal edge, there are 6 possible ways of gluing the sub-graph to the host graph. Any of the three subgraph nodes can be glued to the host graph node that was at the head of the non-terminal edge, and any of the remaining two sub-graph nodes can be glued to the host-graph node that was at the tail of the non-terminal edge. To remove this ambiguity, each production specifies which nodes in the sub-graph are to be glued to which nodes in the host graph. In general, non-terminal edges can be hyperedges that join more than two nodes. A hyperedge is said to be an n-edge if it joins n nodes. All of the hyperedges in the above example are 2-edges, or simple edges. If an n-edge labeled with non-terminal X is to be expanded, there must be a production with X as its left-hand side and a graph on its right-hand side that has n distinguished nodes (e.g. labeled 1 - n) that will be glued to the nodes that were attached to the hyperedge before it was removed.
3
Parameter Estimation for String Grammars
Our goal is to develop a set of algorithms for graph grammars that mirror those available for string grammars, with the starting point being an algorithm for estimating the parameters of graph grammars from data. This section reviews the most widely used algorithm for estimating the parameters of string grammars from data - the Inside-Outside (IO) algorithm (Lari & Young, 1990). This review will provide the necessary background for readers unfamiliar with IO and will make it possible to focus on issues specific to graph grammars in section 4 where we derive a version of IO for graph grammars (the PEGG algorithm). Let G = (S, θ) be a stochastic context-free string grammar with structure S and parameters θ. Let E be a set of training examples created by sampling from the probability distribution over strings defined by G. Given S and E, the goal ˆ such that p(E|S, θ) ˆ of parameter estimation is to obtain a set of parameters, θ, is maximized. If G is unambiguous then maximum likelihood parameter estimation is easy. A grammar is unambiguous if every string in L(G) has exactly one derivation (Hopcroft & Ullman, 1979). That is, given a string in L(G) it is possible to determine which productions were used to derive the string. Let c(X → γ|s) be the number of times production X → γ is used in the derivation of string s. Let
Estimating Maximum Likelihood Parameters
c(X → γ|E) be p(X → γ) is:
s∈E
287
c(X → γ|s). Then the maximum likelihood estimate for c(X → γ|E) pˆ(X → γ) = δ c(X → δ|E)
The estimate is simply the number of times X was expanded to γ divided by the number of times X occurred. If G is ambiguous then strings in L(G) can have multiple derivations, and parameter estimation becomes more difficult. The problem is that only the strings in E are observable, not their derivations. Given a string s ∈ E, one of the possibly many derivations of s was actually used to generate the string when sampling from the probability distribution over strings defined by G. It is production counts from this derivation, and none of the other legal derivations, that are needed for parameter estimation. This is an example of a hidden data problem. Given information about which derivation was used when sampling each s ∈ E, the estimation problem is easy, but we do not have this information. As is typical in such cases the solution is Expectation Maximization (EM) (Dempster et al., 1977). For string s with m possible derivations - d1 , d2 , . . . , dm - we introduce indicator variables - z1 , z2 , . . . , zm , such that zi = 1 if di is the derivation used when s was sampled. Otherwise, zi = 0. The expected value of zi can therefore be computed as follows: E[zi ] = 1 ∗ p(zi = 1) + 0 ∗ p(zi = 0) = p(zi = 1) = p(di is the true derivation) p(di |G) = j p(dj |G) In the expectation step, the indicator variables are used to compute expected counts: E[zi ]c(X → γ|di ) cˆ(X → γ|s) = i
=
i
p(di |G)c(X → γ|di ) j p(dj |G)
(1)
In the maximization step, the expected counts are used to compute new maximum likelihood parameter estimates: cˆ(X → γ|E) pˆ(X → γ) = ˆ(X → δ|E) δc Iterating the E-step and the M-step is guaranteed to lead to a local maximum in the likelihood surface. The only potential difficulty is that computing expected counts requires summing over all possible derivations of a string, of which there may be exponentially many. The Inside-Outside algorithm uses dynamic programming to compute
288
Tim Oates et al.
these counts in polynomial time (Lari & Young, 1990). Our discussion of the algorithm will follow the presentation in (Charniak, 1993). For string s and non-terminal X, let si,j denote the sub-string of s ranging from the ith to the j th position, and let Xi,j denote the fact that non-terminal X roots the subtree that derives si,j . We can now define the inside probability, βX (i, j), as the probability that X will derive si,j . More formally: βX (i, j) = p(si,j |Xi,j ) The outside probability, αX (i, j), is the probability of deriving the string s1,i−1 Xsj+1,n from the start symbol such that X spans si,j . More formally: αX (i, j) = p(s1,i−1 , Xi,j , sj+1,n ) In the formula above, n = |s|. As figure 3 suggests, given that non-terminal X roots the sub-tree that derives si,j , the inside probability βX (i, j) is the probability of X deriving the part of s inside the sub-tree and the outside probability αX (i, j) is the probability of the start symbol deriving the part of s outside the sub-tree. How are α and β useful in parameter estimation? Rather than implementing equation 1 as a sum over derivations, we will soon see that knowing α and β makes it possible to compute expected counts by summing over all possible substrings of s that a given non-terminal can generate. For a string of length n there are n(n−1)/2 substrings, which is far fewer than the worst case exponential number of possible derivations. For example, consider the somewhat simpler problem of computing the expected number of times X occurs in a derivation of string s. This non-terminal can potentially root sub-trees that generate any of the n(n − 1)/2 substrings of s. The expected number of occurrences of X is thus given by the following sum: p(Xi,j |s) cˆ(X) = i,j
This expression can be rewritten as follows in terms of inside and outside probabilities exclusively: cˆ(X) = p(Xi,j |s) i,j
= =
1 p(Xi,j , s) p(s) i,j
1 p(s1,i−1 , si,j , sj+1,n , Xi,j ) p(s) i,j
1 p(si,j |Xi,j )p(s1,i−1 , Xi,j , sj+1,n ) p(s) i,j 1 = αX (i, j)βX (i, j) p(s) i,j =
Estimating Maximum Likelihood Parameters
289
The move from the first line to the second above is a simple application of the definition of conditional probability. We then expand s, apply the chain rule of probability, and finally substitute α and β. Equation 1 requires cˆ(X → γ), not cˆ(X). Suppose for the moment that our grammar is in Chomsky Normal Form. That is, all productions are of the form X → Y Z or X → σ where X, Y , and Z are non-terminals and σ is a terminal. To compute cˆ(X → γ), rather than just summing over all possible substrings that X can generate, we sum over all possible substrings that X can generate and all possible ways that Y and Z can carve up the substring. Consider figure 3. If X generates si,j and X expands to Y Z, then concatenating the substring generated by Y with the substring generated by Z must yield si,j . The expected counts for X → Y Z are defined to be: p(Xi,j , Yi,k , Zk+1,j |s) cˆ(X → Y Z) = i,j,k
It is easy to show that this is equivalent to: cˆ(X → Y Z) =
1 βY (i, k)βZ (k + 1, j) p(s) i,j,k
αX (i, j)p(X → Y Z)
(2)
A complete derivation will be given in the next section when a formula for computing expected counts for graph grammars is presented.
S
X
Y 1
i−1,i
Z k,k+1
j,j+1
n
Fig. 3. Given that X derives si,j and that X expands to Y Z, there are only j − i + 1 ways that Y and Z can carve up si,j . Clearly, evaluating equation 2 requires O(n3 ) computation, in addition to that required to compute α and β. For string grammars, tables of α and β values are computed via dynamic programming in O(m2 n) time where m is the number
290
Tim Oates et al.
of non-terminals in the grammar. Section 4 will show a complete derivation of the formulas for computing α and β in the context of graph grammars.
4
Parameter Estimation for Graph Grammars
In this section we define and derive analogs of inside and outside probabilities for graph grammars. Just as α and β can be computed efficiently, top down and bottom up respectively, for string grammars by combining sub-strings, they can be computed efficiently for graph grammars by combining sub-graphs. While there are only polynomially many sub-strings of any given string, there in general can be exponentially many sub-graphs of any given graph. It turns out there there is a natural class of graphs (Lautemann, 1990) for which the number of sub-graphs that one must consider when computing α and β is polynomial in the size of the graph. For this type of grammar, maximum likelihood parameter estimates can be computed in polynomial time. For non-terminal hyperedge X and graph g we define the inside probability βX (g) to be p(g|X), the probability that X will derive g. Note that βS (g) is the probability of g in the distribution over graphs defined by the grammar. There are two cases to consider - either X derives g in one step, or X derives some other sub-graph in one step and g can be derived from that sub-graph in one or more steps: βX (g) = p(g|X) = p(X → g) +
∗
p(X → γ)p(γ → g)
(3)
X→γ
In equation 3, we use → to denote derivation in one step via a production in the ∗ grammar and → to denote derivation in one or more steps. ∗ The difficult part of evaluating equation 3 is computing p(γ → g). Because γ is the right-hand side of a production it can be an arbitrary hypergraph. Suppose γ has m hyperedges - h1 , h2 , . . . , hm . If γ can derive g, then there must be m graphs - g1 , g2 , . . . , gm - such that hi derives gi for 1 ≤ i ≤ m and the graph that results from replacing each hi with the corresponding gi is isomorphic to g. Note that each gi must be isomorphic to a sub-graph of g for this to occur. It is therefore theoretically possible to determine if γ (and thus X) can derive g by generating all possible sub-graphs of g, forming all ordered sets of these sub-graphs of size m, generating the graphs that result from substituting the sub-graphs in each ordered set for the hyperedges in γ, and testing to see if any of the resulting graphs are equal to g. Because the sub-graphs are taken directly from g the equality test can be performed in polynomial time (i.e. a test for graph isomorphism is not required). However, there may be exponentially many sub-graphs. As stated earlier, we will restrict our attention to a robust class of graphs for which the number of subgraphs one must consider is small (polynomial). Let’s finish our derivation of βX (g) before getting to the details of computing it efficiently.
Estimating Maximum Likelihood Parameters
291
Let Ψ (γ, g) be the set that results from computing all ordered sub-sets of size m of the set of all sub-graphs of g. Recall that γ is a hypergraph with m hyperedges (i.e. non-terminals). Let Ψi (γ, g) be the ith element of this set. Each element of Ψ represents a mapping of hyperedges in γ to structure in g. If any of these mappings yield g, then it is the case that γ can derive g. ∗ To compute p(γ → g) we simply need to iterate over each element of Ψ (γ, g) and compute the probability of the joint event that each of the hi derive each of the gi and sum this probability for each element that produces a graph equal to g. That is, for each Ψi (γ, X) we need to compute: ∗
∗
∗
i p(hi1 → g1i , hi2 → g2i , . . . , him → gm )
Because HR graph grammars are context free, derivations that start from different hyperedges are completely independent of one another (Courcelle, 1987). Therefore, the probability of the joint event is equivalent to the product of the probabilities of the individual events. That is: ∗
∗
∗
i p(hi1 → g1i , hi2 → g2i , . . . , him → gm )=
m
∗
p(hij → gji )
j=1
Combining the above with equation 3 yields an expression for βX (g) in terms of other inside probabilities: ∗ βX (g) = p(X → g) + p(X → γ)p(γ → g) X→γ
= p(X → g) +
p(X → γ)
p(X → γ)
= p(X → g) +
p(X → γ)
= p(X → g) +
p(X → γ)
X→γ
∗
∗
∗
p(hij → gji )
j=1
m
p(X → γ)
j=1 m
p(gji |hij )
i
X→γ
= p(X → g) +
∗
i p(hi1 → g1i , hi2 → g2i , . . . , him → gm )
m i
X→γ
i
X→γ
p(Ψi (γ, g))
i
X→γ
= p(X → g) +
i
j=1
βhij (gji )
(4)
Equation 4 makes it possible to compute inside probabilities in terms of other inside probabilities. Note that this computation can proceed bottom up because the sub-graphs considered in the inner sum, i.e. the recursive computation of βhij (gji ), must be smaller than g because they are composed via γ to yield g. That is, one can compute β for sub-graphs containing one node, then sub-graphs containing two nodes, and so on. The number of levels in this bottom up computation is bounded by the size of g. The outer summation is linear
292
Tim Oates et al.
in the number of productions in the grammar, and the product is linear in the maximum number of hyperedges in any right-hand side, which we assume to be bounded by a small constant. However, the inner sum iterates over all elements of Ψ , of which there can be exponentially many. If the number of sub-graphs considered in the inner sum in equation 4 were polynomial, then all inside probabilities could be computed in polynomial time. Lautemann (Lautemann, 1990) defines a class of HR grammars for which this is the case, i.e. grammars with logarithmic k-separability. The k-separability of graph g (see definition 3.2.3 in (Lautemann, 1990)) is the maximum number of connected components that can be produced by removing k or fewer nodes from g. This definition becomes useful for our current purposes when considered in conjunction with lemma 3.2.1 from (Lautemann, 1990). To build intuition before stating the lemma, consider how you might try to determine if a hypergraph, γ, with a single hyperedge, h, can generate a given graph, g. Note that all of the nodes and edges in the hypergraph must appear in the final graph. You might therefore try all possible mappings of nodes and edges in the hypergraph to nodes and edges in the graph, and see if the hyperedge can generate the unmapped remainder of the graph. The lemma says, essentially, that if replacing hyperedge h in hypergraph γ with graph g yields graph g, then every connected component in g minus the nodes in h is a connected component in g minus the nodes in h. That is, if you map the nodes in γ to nodes in g and then remove those mapped nodes from g and find the connected components in the resulting graph, you will have enumerated (at least) all of the connected components in the sub-graph with which h should be replaced to derive g. Therefore, rather than enumerating all possible sub-graphs of g to deter∗ mine if γ → g, we can form all possible mappings of nodes and edges in γ onto g, compute the connected components that result when the mapped nodes are removed from g, and consider only those sub-graphs that are combinations of these connected components. Because γ is the right-hand side of a production and we assume that its size is bounded by a small constant, the number of possible mappings of γ onto g is polynomial in the size of g. If we further assume that the k-separability of the graph is logarithmic, then the number of connected components formed for each mapping of γ onto g is O(log |g|) and there are only polynomially many possible combinations of connected components. In polynomial time we can compute all of the subgraphs that need to be considered in the inner sum of equation 4, of which there are polynomially many. All of the inside probabilities can therefore be computed in polynomial time. Intuitively, bounded k-separability requires that graphs have bounded degree and be connected. Consider the language containing all star graphs, i.e. graphs containing n nodes where nodes 2 – n have a single edge to node 1. If node 1 is removed, n − 1 connected components are created. At the other extreme, consider a graph of n nodes and no edges. Removing any one node results in a graph with n − 1 connected components. In both cases, the k-separability of the graph is linear in the size of the graph. For k-separability to have a lower
Estimating Maximum Likelihood Parameters
293
bound, there must be a bound on node degree and the graph must be (mostly) connected. We now turn to the derivation of the outside probability. Recall that the inside probability βX (g) is the probability that a non-terminal hyperedge labeled X will generate graph g. In practice, given a graph G, β values are computed for sub-graphs of G. That is, βX (g) is computed for values of g corresponding to different sub-graphs of some fixed graph G. The outside probability αX (g) is the probability that the start symbol will generate the graph formed by replacing sub-graph g in graph G with a non-terminal hyperedge labeled X. It is called the outside probability because α is the probability of generating the graph structure in G outside the sub-graph generated by X. Note that the quantity αX (g)βX (g) is the probability of generating G in such a way that nonterminal X generates sub-graph g. How might non-terminal X become responsible for generating sub-graph g? Suppose Y is a non-terminal, Y → γ is a production in the grammar, and γ contains a hyperedge labeled X. Further, let g be a subgraph of G that contains g. If Y is responsible for generating g , then it could be the case that X generates g and the remainder of γ generates the remainder of g . That is, we can compute outside probabilities from outside probabilities of larger subgraphs. Let Xg denote the fact that X generates subgraph g, and h(γ) denote the set of hyperedges contained in hypergraph γ. The outside probability can be derived as follows: αX (g) = p(Xg , G − g) p(G − g , g − g, Xg , Yg , (γ − X)g −g ) = Y →γ,X∈h(γ) g ,g⊆g
=
=
Y →γ,X∈h(γ)
g ,g⊆g
p(Yg , G − g )p(Xg , (γ − X)g −g |Yg , G − g ) p(g − g|Yg , G − g , Xg , (γ − X)g −g ) p(Yg , G − g )p(Xg , (γ − X)g −g |Yg )
Y →γ,X∈h(γ) g ,g⊆g
=
p(g − g|(γ − X)g −g ) ∗
αY (g )p(Y → γ)p((γ − X)g −g → g − g)
Y →γ,X∈h(γ) g ,g⊆g
=
Y →γ,X∈h(γ)
g ,g⊆g
αY (g )p(Y → γ)
i
j∈h(γ)−X
βhij (gji )
(5)
The second line is derived from the first by summing over all productions, Y → γ, for which X is a hyperedge in γ, and all subgraphs, g , such that γ generates g , g is a subgraph of g , and X is a hyperedge in γ that generates g. Then we apply the chain rule of probability, and substitute α and β. The move from the fifth line to the sixth line involves the same steps used in the derivation of the ∗ formula for inside probabilities that expand p(γ → g) into an expression that can be computed in terms of other inside probabilities.
294
Tim Oates et al.
Outside probabilities can be computed from the top down, with the base case being αS (G) = 1. That is, with probability 1 the start symbol is responsible for generating any graph in the language of the grammar. As with β, all of the sums and products are polynomial in the size of the graph except the one that iterations over the elements of Ψ (γ, G). However, as with β, if the graph has logarithmic k-separability, there are only polynomially many elements of Ψ (γ, G) to consider. Therefore, all outside probabilities can be computed in polynomial time. Once we know the inside and outside probabilities, we can compute the expected counts of each production in the grammar. For productions in the form of X → γ, these counts are computed as follows: p(Xg , γ|G) cˆ(X → γ) = g
1 p(Xg , γ, G) p(G) g 1 = p(Xg , γ, G − g, g) p(G) g 1 = p(G − g, Xg )p(γ|Xg )p(g|γ) p(G) g 1 ∗ = αX (g)p(X → γ)p(γ → g) p(G) g
=
m
=
1 αX (g)p(X → γ) βhij (gji ) p(G) g i j=1
For production in the form of X → g, i.e. the right-hand side is a terminal graph, the counts are computed as follows: p(Xg , g = g|G) cˆ(X → g) = g
=
1 p(Xg , g = g, G) p(G) g
=
1 p(G)
p(Xg , g = g, G − g , g )
g
1 = p(G − g , Xg )p(g , g = g|Xg ) p(G) g
1 = αX (g )p(X → g, g = g) p(G) g
In the above formulas, g = g denotes that g is isomorphic to g.
Estimating Maximum Likelihood Parameters
5
295
Preliminary Empirical Results
This section reports the results of some simple, preliminary experiments with an implementation of PEGG. Let G = (S, θ) be a stochastic context-free HR graph grammar with structure S and parameters θ. Let E be a set of training examples created by sampling from the probability distribution over graphs defined by G. ˆ such that Given S and E, the goal of PEGG is to obtain a set of parameters, θ, ˆ p(E|S, θ) is maximized. We used the grammar shown in figure 1 for S, and the true parameters were θ = (0.6, 0.2, 0.2). That is, the probability of expanding a hyperedge labeled S with the first production is 0.6, with the second production is 0.2, and with the third production is 0.2. In the first experiment we sampled 1, 5, and 10 graphs from the grammar and ran PEGG on these samples. The learned parameters are shown in table 1. In all cases the parameters appear to be “reasonable”, but they do deviate from the desired parameters. This might be due to the fact that the samples are small and are therefore not representative of the true distribution over graphs defined by the grammar. To test this hypothesis, we took another sample of 10 graphs for which the estimated parameters deviated significantly from the true parameters. Given the derivations of the 10 graphs, it was a simple matter to count the number of times the various productions were applied and manually compute maximum likelihood parameters. Both the estimated parameters and the ML parameters are shown in table 2. Note that the sample of 10 graphs was clearly not representative of the distribution over graphs defined by the true parameters. The Kullback-Leibler divergence between θ = (0.6, 0.2, 0.2) and θML = (0.75, 0.15, 0.10) is 0.8985. However, PEGG did a good job of estimation. The KL divergence between θP EGG and θML is 0.007, two orders of magnitude less than the divergence with the true parameters. Finally, to determine if PEGG was finding parameters that actually maximize the likelihood of the data, we computed the log-likelihood of a sample of 5 graphs given the true parameters and the parameters estimated for that sample. The log-likelihood of the data given the true parameters was -9.38092, and it was -9.40647 given the estimated parameters, a difference of less than three-tenths of one percent. Table 1. Parameters estimated by PEGG for samples of size 1, 5, and 10 |E| θˆ1 θˆ2 θˆ3 1 0.5714 0.2857 0.1429 5 0.6206 0.2168 0.1626 10 0.6486 0.2973 0.0541
296
6
Tim Oates et al.
Conclusion
This paper introduced the Parameter Estimation for Graph Grammars (PEGG) algorithm, the first algorithm for estimating the parameters of stochastic contextfree hyperedge replacement graph grammars. PEGG computes inside and outside probabilities in polynomial time for graphs with logarithmic k-separability. In addition, PEGG uses these probabilities to compute maximum likelihood parameters for a fixed grammar structure and a sample of graphs drawn from some probability distribution over graphs. Despite the fact that graph grammars have been an active area of research since the late 1960’s, almost no work has dealt with stochastic graph grammars. One notable exception is (Mosbah, 1994), which explores the properties of graphs sampled from stochastic graph grammars. There are only a handful of papers that directly address the problem of learning graph grammars, and none other than the current paper that leverage the vast body of work on inferring string grammars from data. (Bartsch-Sprol, 1983) describes an enumerative (i.e. computationally infeasible) method for inferring a restricted class of context-sensitive graph grammars. (Jeltsch & Kreowski, 1991) describes an algorithm for extracting common hyperedge replacement substructures from a set of graphs via merging techniques. This work is similar to that reported in (Jonyer et al., 2002) in which merging techniques were used to extract node replacement sub-structures from a set of graphs. Fletcher (Fletcher, 2001) developed a connectionist method for learning regular graph grammars. To the best of our knowledge, our paper is the first to present a formally sound algorithm for computing maximum likelihood parameter estimates for a large class of HR graph grammars. Future work will involve developing an approach to inferring the structure of HR graph grammars based on Bayesian model merging techniques, similar to those we developed for node replacement grammars (Doshi et al., 2002). In combination with the PEGG algorithm described in this paper the result will be a powerful tool for inferring HR graph grammars from data. In addition, we are considering applications of this tool in the domain of bioinformatics. Finally, we are attempting to understand the relationship between stochastic context-free graph grammars and stochastic definite clause grammars as used in stochastic logic programming (Muggleton, 1996). Table 2. Estimated parameters and manually computed ML parameters for a sample of size 10. PEGG ML θˆ1 0.7368 0.75 θˆ2 0.1579 0.15 θˆ3 0.1053 0.10
Estimating Maximum Likelihood Parameters
297
References Bartsch-Sprol, B. (1983). Grammatical inference of graph grammars for syntactic pattern recognition. In H. Ehrig, M. Nagl, and G. Rozenberg (Eds.), Proceedings of the second international workshop on graph grammars and their applications to computer science. Springer Verlag. Charniak, E. (1993). Statistical language learning. MIT Press. Cook, D., Holder, L.B., Su, S., Maglothin, R., and Jonyer, I. (2001). Structural mining of molecular biology data. IEEE Engineering in Medicine and Biology, 20, 231–255. Courcelle, B. (1987). An axiomatic definition of context-free rewriting and its application to NLC graph grammars. Theoretical Computer Science, 55, 141–181. Courcelle, B. (1997). The expression of graph properties and graph transformations in monadic second-order logic. In G. Rozenberg (Ed.), Handbook of graph grammars and computing by graph transformation: Foundations. World Scientific Publishing Company. Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39, 1–38. Doshi, S., Huang, F., and Oates, T. (2002). Inferring the structure of graph grammars from data. Proceedings of the International Conference on Knowledge-Based Computer Systems. Drewes, F., Kreowski, H.J., and Habel, A. (1997). Hyperedge replacement graph grammars. In G. Rozenberg (Ed.), Handbook of graph grammars and computing by graph transformation: Foundations. World Scientific Publishing Company. Ehrig, H., Engels, G., Kreowski, H.-J., and Rozenberg, G. (Eds.). (1999). Handbook of graph grammars and computing by graph transformation: Applications, languages and tools. World Scientific Publishing Company. Engelfriet, J., and Rozenberg, G. (1997). Node replacement graph grammars. In G. Rozenberg (Ed.), Handbook of graph grammars and computing by graph transformation: Foundations. World Scientific Publishing Company. Fletcher, P. (2001). Connectionist learning of regular graph grammars. Connection Science, 13, 127–188. Hong, P., and Huang, T.S. (2002). Spatial pattern discovery by learning a probabilistic parametric model from multiple attributed relational graphs. Journal of Discrete Applied Mathematics. Hopcroft, J.E., and Ullman, J.D. (1979). Introduction to automata theory, languages, and computation. Addison-Wesley Publishing Company. Immerman, N. (1999). Descriptive complexity. Springer. Jeltsch, E., and Kreowski, H.J. (1991). Grammatical inference based on hyperedge replacement. Lecture Notes in Computer Science, 532, 461–474. Jonyer, I., Holder, L.B., and Cook, D.J. (2002). Concept formation using graph grammars. Working Notes of the KDD Workshop on Multi-Relational Data Mining. Lari, K., and Young, S.J. (1990). The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language, 4, 35–56. Lautemann, C. (1990). The complexity of graph languages generated by hyperedge replacement. Acta Informatica, 27, 399–421. Mosbah, M. (1994). Properties of random graphs generated by probabilistic graph grammars. Proceedings of the The Fifth International Workshop on Graph Grammars and their Application to Computer Science.
298
Tim Oates et al.
Muggleton, S. (1996). Stochastic logic programs. In L. De Raedt (Ed.), Advances in inductive logic programming, 254–264. Office of Technology Assessment, U.C. (1995). Information technologies for control of money laundering. OTA-ITC-360. Pattison, P.E. (1993). Algebraic models for social networks. Cambridge University Press. Stolcke, A. (1994). Bayesian learning of probabilistic language models. Doctoral dissertation, University of California, Berkeley.
Induction of the Effects of Actions by Monotonic Methods Ramon P. Otero Department of Computer Science, University of Corunna Corunna 15071, Galicia, Spain [email protected]
Abstract. Induction of the effects of actions considered here consists in learning an action description of a dynamic system from evidence on its behavior. General logic-based induction methods can deal with this problem but, unfortunately, most of the solutions provided have the frame problem. To cope with the frame problem induction under suitable nonmonotonic formalisms has to be used, though this kind of induction is not well understood yet. We propose an alternative method that relies on the identification of a monotonic induction problem whose solutions correspond one-to-one to those of the original problem without the frame problem. From this result induction of the effects of actions can be characterized under current monotonic induction methods.
1
Preliminaries
Induction of the effects of actions considered here consists in learning an action description of a dynamic system from evidence on its behavior. The area of Inductive Logic Programming (ILP) that studies learning under logic-based formalisms, mainly logic programs (LP), defines an (explanatory) induction problem, in general, as follows. Given the sets of clauses E + , called positive examples, negative examples E − , and background knowledge B, find a set of clauses H such that + + B ∪ H |= e+ i for every ei ∈ E , − − B ∪ H |= e− j for every ej ∈ E , and
B ∪ H |= ⊥. + + Sometimes it is also required that B |= e+ and that i for every ei ∈ E − − − B |= ej for every ej ∈ E but these will not be demanded in this work. On the other hand, it is usually assumed that H is a clause universally quantified at least in one variable. Induction of the effects of actions can be defined as an induction problem in this framework. Evidence on the behavior of the dynamic system is assumed in the following form. T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 299–310, 2003. c Springer-Verlag Berlin Heidelberg 2003
300
Ramon P. Otero
Definition 1 (Narrative) Given a domain with a set of fluent-names F and a set of action-names A, a narrative is a pair (F, A) where each component is a set of ground unit clauses verifying, F ⊆ {f (si ), ¬f (si ) | f ∈ F, 0 ≤ si ≤ n} A ⊆ {a(si ) | a ∈ A, 1 ≤ si ≤ n} Where f (si ) and ¬f (si ) are literals on fluent f , complementary to each other. The natural number si is the situation constant and 0 names the initial situation. The maximum number of situations in the narrative (length) is (n + 1). A narrative is consistent iff the set of clauses F is consistent, i.e. F does not contain a complementary pair of literals f (si ) and ¬f (si ). Definition 2 Evidence is a set of narratives, for a given domain. Narratives correspond to facts true in the domain at different time points, represented by fluent literals at different situation constants; action literals represent the actions performed in the domain at the corresponding time. A narrative thus represents evidence on the particular behavior of the dynamic system after a particular sequence of actions. Different narratives correspond to different sequences of actions and different initial facts in the domain. The action description of the domain is a set of action laws as follows. Definition 3 (Hypotheses) An action law on an effect of a fluent f ∈ F is a clause C e(S) ← a(S), prev(S, P S), precond(P S) where e(S) is f (S) or ¬f (S) for the fluent f ∈ F and is called the effect literal; S, P S are situation variables (universally quantified); a ∈ A; precond(P S) may be missing and is a conjunction of literals f (P S) and ¬f (P S), with f ∈ F; the relation prev(S, P S) is defined as the successor relation in naturals, i.e. S is the successor of P S. Each action law corresponds to one effect literal, the whole description will contain action laws for each effect on the fluents. The induction problem can be defined on one effect at a time, to this end the examples that correspond to the selected target effect are extracted from the evidence narratives. Definition 4 (Examples) Given a narrative (F, A) and some effect literal e of a fluent in F , the set of positive examples on e is E + = {e(si ) | e(si ) ∈ F, 1 ≤ si } and the set of negative examples on e is E − = {e(si ) | e(si ) ∈ F, 1 ≤ si } where e(si ) is the literal complementary to e(si ).
Induction of the Effects of Actions by Monotonic Methods
301
Definition 5 (Induction of the Effects of Actions from One Narrative) Given some evidence E = {(F, A)} on a narrative of a domain and a target effect literal e on a fluent in F , a solution to induction of the effect literal e is a set of action laws H on it, verifying (X ∪ F ∪ A ∪ H) |= e+ (si )
for every e+ (si ) ∈ E +
(X ∪ F ∪ A ∪ H) |= e− (si )
for every e− (si ) ∈ E −
(X ∪ F ∪ A ∪ H) |= ⊥ where F = (F \E + ), E + and E − are the set of positive and negative examples on e, and X is a set of clauses defining relation prev(S, P S). It follows that this induction problem is a case of (explanatory) induction in ILP, with background B = (X ∪ F ∪ A). Definition 6 (Induction of the Effects of Actions) Given some evidence E = {(Fl , Al ), 1 ≤ l ≤ m} on m narratives of a domain and a target effect literal e on a fluent in F , a solution to induction of the effect literal e is a set of action laws H on it, verifying that H is a solution to induction on every narrative (Fl , Al ) ∈ E. Proposition 1. There is no solution to induction from inconsistent narratives. Proof. For inconsistent narratives two literals e(si ) and e(si ) belong to F for some fluent f and situation constant si . If e(si ) is a target effect literal, by definition of examples, e(si ) belongs to both E + and E − then it must be entailed and not entailed by every solution to induction, which is impossible. If e(si ) is not a target effect, both e(si ) and e(si ) are included in F , then (monotonically) entailed by it and no solution to induction can satisfy the condition (X ∪ F ∪ A ∪ H) |= ⊥. As inconsistent narratives in any fluent do not seem to represent actual behaviors, we will consider induction problems from consistent narratives. Most of the solutions to induction as defined before have the frame problem, that can be defined as follows in the represention we are using. Definition 7 (Frame Problem) A solution H on effect literal e has the frame problem iff it contains an action law that includes e(P S) in precond(P S) (a frame axiom). To show the existence of the frame problem, consider an action that does not affect the target fluent—this is a common case as fluents are not affected by every action. The narrative likely contains two situations at which this action is performed but the target fluent holds on complementary literals. If the rest of the fluents are the same at both situations, the only available difference to explain the complementary effects of the action is the literal of the fluent at the
302
Ramon P. Otero
previous situation, which is (respectively) the same as the action has no effect on it. Notice that the number of frame axioms needed for a domain is the number of fluents times the number of actions which do not affect the fluent. The definition of induction cannot be simply restricted disallowing frame axioms in H as most of the domains will not have solution. The frame problem is solved by inertia axioms under suitable nonmonotonic formalisms. The nonmonotonic behavior of LP due to negation as failure (NAF) can be used to represent inertia axioms. Definition 8 (Inertia Axioms) An inertia axiom on a literal e of a fluent in F is a clause I e(S) ← prev(S, P S), e(P S), not e(S) where S, P S are situation variables (universally quantified), e(S) is the literal complementary to e(S), prev(S, P S) is the successor relation as defined before. Inertia axioms do not mention action literals, being just one of these enough for each effect, strongly reducing the size of the description. Furthermore they have a general form, independent on the domain, and known in advance. Definition 9 (Induction of the Effects of Actions without the Frame Problem) Given some evidence E = {(F, A)} on a narrative of a domain and a target effect literal e on a fluent in F , a solution to induction of the effect literal e is a set of action laws HI on it, not including frame axioms, verifying (X ∪ I ∪ F ∪ A ∪ HI ) |= e+ (si )
for every e+ (si ) ∈ E +
(X ∪ I ∪ F ∪ A ∪ HI ) |= e− (si )
for every e− (si ) ∈ E −
(X ∪ I ∪ F ∪ A ∪ HI ) |= ⊥ where F = (F \E + ), E + and E − are the set of positive and negative examples on e, X is a set of clauses defining relation prev(S, P S), and I is the inertia axiom on e. Again this induction problem is a case of (explanatory) induction in ILP, with background B = (X ∪I ∪F ∪A). This time however, B is a normal logic program using negation as failure, not operator, in the inertia axioms. Unfortunately, it is not known how to efficiently solve an induction problem under normal background knowledge. But see [1] for a characterization of induction in normal logic programs. As an indication of the nonmonotonic behavior note that it will usually be the case that B |= e− (si ) for some e− (si ) ∈ E − , while this will not imply that there is no solution, for this reason the condition was not required in the definition.
Induction of the Effects of Actions by Monotonic Methods
2
303
Monotonic Method
Once inertia axioms are included in the background knowledge some of the examples are entailed by inertia—the so called persistent examples. We can focus our attention on the rest of the examples and define the sub-problem of induction from the examples that can never be entailed from just inertia. We call these, examples on change. Definition 10 (Examples on Change) Given a narrative (F, A) and some effect literal e of a fluent in F , the set of positive examples on change to e is P E + = {e(si ) | e(si ) ∈ F and e(si−1 ) ∈ F } and the set of negative examples on change to e is P E − = {e(si ) | e(si ) ∈ F, 1 ≤ si } where e(si ) is the literal complementary to e(si ) and si , si−1 verify prev(si , si−1 ). Compared with the whole sets of examples, P E + ⊆ E + and P E − = E − . Notice that positive examples on change intuitively correspond with change on the fluent whereas negative examples are not defined following the intuition of non-change on the fluent, this is important for completeness. Definition 11 (Monotonic Induction of the Effects of Actions) Given some evidence E = {(F, A)} on a narrative of a domain and a target effect literal e on a fluent in F , a solution to monotonic induction of the effect literal e is a set of action laws HM on it, not including frame axioms, verifying (X ∪ F ∪ A ∪ HM ) |= e+ (si )
for every e+ (si ) ∈ P E +
(X ∪ F ∪ A ∪ HM ) |= e− (si )
for every e− (si ) ∈ P E −
(X ∪ F ∪ A ∪ HM ) |= ⊥ where F = (F \ E + ), P E + and P E − are the set of positive and negative examples on change to literal e, E + is the set of positive examples on literal e, and X is a set of clauses defining relation prev(S, P S). This induction problem is a case of (explanatory) induction in ILP, with background B = (X ∪ F ∪ A), and it is a case of monotonic induction as B, HM , P E + , and P E − are sets of Horn clauses.
3
Correspondence
Solutions to the monotonic induction problem correspond one-to-one to solutions to induction of the effects of actions without the frame problem.
304
Ramon P. Otero
Definition 12 (Complete Narrative) A narrative (F, A) is complete on fluent f iff for every situation si , 0 ≤ si ≤ n there is either f (si ) ∈ F or ¬f (si ) ∈ F , where (n + 1) is the length of the narrative. Proposition 2. (Correspondence) From evidence on narratives consistent and complete on the target fluent, HI is a solution of nonmonotonic induction with inertia (Definition 9) if and only if it is a solution of monotonic induction (Definition 11). Proof. 1. (HI ⇐ HM ) Consider a monotonic solution HM is not a solution of the nonmonotonic induction problem. Then one of these is true: a) there is some e(si ) ∈ E + such that (X ∪ I ∪ F ∪ A ∪ HM ) |= e(si ), b) there is some e(si ) ∈ E − such that (X ∪ I ∪ F ∪ A ∪ HM ) |= e(si ), c) (X ∪ I ∪ F ∪ A ∪ HM ) |= ⊥. In case (a), consider e(si ) corresponds in monotonic induction to an example on change, e(si ) ∈ P E + , recall P E + ⊆ E + . Then (X ∪ F ∪ A ∪ HM ) |= e(si ), and it follows that (X ∪ I ∪ F ∪ A ∪ HM ) |= e(si ) because the set of clauses (X ∪ F ∪ A ∪ HM ) |= ⊥ and it is a Horn program, thus its consequences are monotonically preserved in any other program it belongs to. Alternatively, consider e(si ) does not correspond to an example on change, e(si ) ∈ P E + . There are two cases here depending on whether the monotonic solution HM eventually implies e(si ) or not. (Note that e(si ) ∈ P E − , as e(si ) ∈ E + , P E − = E − and F is assumed consistent, thus a solution to induction is free implying the instance e(si ) or not.) If eventually (X ∪ F ∪ A ∪ HM ) |= e(si ), then we already shown that e(si ) also follows in nonmonotonic induction. Alternatively, if (X ∪ F ∪ A ∪ HM ) |= e(si ), then we show that (X ∪ I ∪ F ∪ A ∪ HM ) |= e(si ), i.e. e(si ) is entailed by inertia. By definition of P E + , as ∈ P E + and e(si ) ∈ E + , then e(si−1 ) ∈ F . As narratives are complete on e(si ) target fluent it follows e(si−1 ) ∈ F , then also e(si−1 ) ∈ E + . Consider the inertia axiom for the instance at si , e(si ) ← prev(si , si−1 ), e(si−1 ), note(si ), it is the case ∈ F because e(si ) ∈ E + , thus e(si ) ∈ F , and F is consistent. Then that e(si ) not e(si ) holds as there is no other means of inferring e(si ). That e(si−1 ) is also entailed follows from mathematical induction over situations, as e(si−1 ) ∈ E + . Mathematical induction ends always in some e(si−k ) in one of two cases (being all the e(si−j ) ∈ E + , 1 ≤ j ≤ k as shown before): either e(si−k ) ∈ P E + , in which case we already shown it follows in nonmonotonic induction, or alternatively e(si−k ) = e(0), which also follows as e(0) ∈ F (notice that instances at initial situation do not belong to the example set). In case (b), by definition of E − , for every e(si ) ∈ E − , e(si ) ∈ F thus also in F . Then e(si ) is monotonically entailed by (X ∪ F ∪ A). As HM is solution and P E − = E − , (X ∪ F ∪ A ∪ HM ) |= e(si ). Consider the inertia axiom for the instance at si , e(si ) ← prev(si , si−1 ), e(si−1 ), note(si ), it is the case that note(si ) do not hold as e(si ) is monotonically entailed by F , thus (X ∪I ∪F ∪A∪HM ) |= e(si ). In case (c), as HM is solution, (X ∪ F ∪ A ∪ HM ) |= ⊥. The inertia axiom can only entail some additional e(si ), thus (X ∪ I ∪ F ∪ A ∪ HM ) |= ⊥ only if
Induction of the Effects of Actions by Monotonic Methods
305
the entailed e(si ) ∈ E − , thus e(si ) ∈ F . We already shown in case (b) that this is not the case. 2. (HI ⇒ HM ) Consider a nonmonotonic solution HI is not a solution of the monotonic induction problem. Then one of these is true: a) there is some e(si ) ∈ P E + such that (X ∪ F ∪ A ∪ HI ) |= e(si ), b) there is some e(si ) ∈ P E − such that (X ∪ F ∪ A ∪ HI ) |= e(si ), c) (X ∪ F ∪ A ∪ HI ) |= ⊥. In case (a), consider some e(si ) ∈ P E + such that (X ∪ F ∪ A ∪ HI ) |= e(si ), then consider the program with the inertia axiom, its instance at si , e(si ) ← prev(si , si−1 ), e(si−1 ), not e(si ), as e(si ) ∈ P E + it follows e(si−1 ) ∈ F , thus e(si−1 ) ∈ E − and as HI is solution (X ∪ I ∪ F ∪ A ∪ HI ) |= e(si−1 ), then the inertia axiom is not applicable at si , leading to (X ∪ I ∪ F ∪ A ∪ HI ) |= e(si ), which is contradictory with the assumption that HI is solution, thus it must be that (X ∪ F ∪ A ∪ HI ) |= e(si ). In case (b), for every e(si ) ∈ P E − as P E − = E − , (X∪I∪F ∪A∪HI ) |= e(si ). Consider (X ∪ F ∪ A ∪ HI ) |= e(si ), as (X ∪ F ∪ A ∪ HI ) is a Horn program its consequences are monotonically preserved in any other program it belongs to, then it will follow that (X ∪ I ∪ F ∪ A ∪ HI ) |= e(si ) which contradicts the initial assumption on HI , thus (X ∪ F ∪ A ∪ HI ) |= e(si ). In case (c), consider (X ∪ F ∪ A ∪ HI ) |= ⊥, it is a Horn program thus inconsistency will hold in any other program it belongs to, which is a contradiction with the assumption that (X ∪ I ∪ F ∪ A ∪ HI ) |= ⊥. Under some conditions there are efficient monotonic methods sound and complete for induction under Horn logic programs. This is the case of our definition of monotonic induction of the effects of actions in which the background knowledge as well as the examples are sets of ground facts, and the hypotheses are function-free positive rules. Tractability further comes from the fact that hypotheses, as required for the representation of action laws, are 12-determinate clauses, with respect to the background and examples defined [2]. Domains may have a relational (static) structure for which fluents with more arguments than just one for the situation term are used, leading to higher values of i and j in the ij-determinacy of hypotheses, but efficiency holds as far as the ij-determinate restriction does. Corollary 1. The monotonic method provides an efficient, sound and complete induction of the effects of actions without the frame problem from evidence on consistent and complete narratives. Proof. Follows from the fact that there are efficient monotonic methods sound and complete for the setting we are using in monotonic induction of actions and from Prop. 2 establishing one-to-one correspondence to nonmonotonic induction with inertia.
306
4
Ramon P. Otero
Example
To illustrate the method consider the simple Yale Shooting Scenario. There is a turkey and a gun, the gun can be loaded or not, and the turkey will be dead when shooting with the gun loaded. There are actions shoot s, load l, and wait w; and fluents loaded ld, and dead d. Consider the following narrative (F, A) F = {nld(0), nld(1), ld(2), ld(3), ld(4), ld(5), nd(0), nd(1), nd(2), nd(3), d(4), d(5)} A = {s(1), l(2), w(3), s(4), w(5)} Where nd (not dead) is the complementary fluent1 to d and nld the complementary of ld. An example of a learning problem in this scenario is the induction of a description of effect d, the examples from the narrative would be E + = {d(4), d(5)} E − = {d(1), d(2), d(3)} After (direct) monotonic induction is applied to this problem one solution would be H = { d(S) ← s(S), prev(S, P S), ld(P S). d(S) ← s(S), prev(S, P S), d(P S). d(S) ← w(S), prev(S, P S), d(P S). d(S) ← l(S), prev(S, P S), d(P S).} The last three clauses are frame axioms, the solution has the frame problem. The examples on change corresponding to the narrative would be P E + = {d(4)} P E − = {d(1), d(2), d(3)} After the monotonic induction method is applied to this problem one solution would be H = { d(S) ← s(S), prev(S, P S), ld(P S).}
5
Extended Monotonic Method
Monotonic induction of the effects of actions introduced so far relies on narratives complete on the target fluent, this condition can be removed. When narratives are not complete, the target fluent has missing instances at some situation constants which makes the definition of examples on change incomplete. 1
We are using a different predicate (nd) for the (classically) negated fluent (¬d), this allows, as far as this work, for a Horn logic program representation—instead of an Extended logic program with classical negation—when the corresponding constraint ( ← d(S), nd(S). ) is in the program.
Induction of the Effects of Actions by Monotonic Methods
307
Definition 13 (Examples on Change, Missing Instances) Given a narrative (F, A) and some effect literal e of a fluent in F , the set of positive examples on change to e is P E + = {e(si ) | e(si ) ∈ F and e(si−1 ) ∈ F } ∪ {e(si ) | e(si ) ∈ F and e(si−k ) ∈ F and for all sj , si−k < sj < si , e(sj ) ∈ F and e(sj ) ∈ F } and the set of negative examples on change to e is P E − = {e(si ) | e(si ) ∈ F, 1 ≤ si } where e(si ) is the literal complementary to e(si ) and si , si−1 verify prev(si , si−1 ) as well as the < relation represents the transitive closure of prev/2. For each e(si ) instance, the constant si is a new situation constant not present elsewhere in the description. The new component of P E + corresponds to segments of consecutive situation constants with missing instances on the target fluent, the new situation constant si can be understood as representing the missing segment as a whole. Intuitively a missing segment with complementary literals at both edge situations must contain an example on change, the available evidence in the narrative does not tell us the particular situation inside the segment at which the change occurs. This is indeed a case of multiple instance learning. To deal with it the representation of actions is also extended to have an extra argument for the missing segment the action belongs to in case there is one. For each narrative (F, A) missing segments are named with different new constants and given as extra argument to action instances, the new narrative (F, A ) contains thus A = {a(si , si ) | a(si ) ∈ A} where si is the constant naming the missing segment or just si if there is no missing target fluent at si . Correspondingly, action laws (hypotheses) will have the new form e(ES) ← a(ES, S), prev(S, P S), precond(P S) Definition 11 of monotonic induction of actions is still valid considering the extended P E + set instead of P E + , the narratives (F, A ) instead of (F, A) and the new hypotheses language. The induction problem is still a case of (explanatory) induction in ILP, with background B = (X ∪ F ∪ A ), and also a case of monotonic induction as B, HM , P E + , and P E − are sets of Horn clauses. After induction of the action laws the extra situation argument for missing segments is discarded, so the hypotheses recover the regular form. Correspondence with nonmonotonic induction with inertia follows. Proposition 3. (Correspondence Extended) From evidence on narratives consistent, HI is a solution of nonmonotonic induction with inertia (Definition 9)
308
Ramon P. Otero
if and only if it is a solution of monotonic induction (Definition 11 with the extension). Proof. We only shown the part of the proof corresponding to the extension wrt Prop. 2. 1. (HI ⇐ HM ) Consider a monotonic solution HM is not a solution of the nonmonotonic induction problem. Then one of these is true: a) there is some e(si ) ∈ E + such that (X ∪ I ∪ F ∪ A ∪ HM ) |= e(si ), b) there is some e(si ) ∈ E − such that (X ∪ I ∪ F ∪ A ∪ HM ) |= e(si ), c) (X ∪ I ∪ F ∪ A ∪ HM ) |= ⊥. In case (a), consider e(si ) corresponds in monotonic induction to an (extended) example on change, e(si ) ∈ P E + , but e(si ) ∈ P E + , then it must be the case that e(si ) is the ending edge of a missing segment, as HM is solution the corresponding e(si ) ∈ P E + is (monotonically) entailed. After discarding the will entail one of the e(sj ) extra situation argument in HM , the solution HM inside the missing segment and inertia (added after learning) will propagate the instance from the induced situation of change to the ending edge of the missing segment. In case (b) and (c) the proof is that of Prop. 2. 2. (HI ⇒ HM ) Consider a nonmonotonic solution HI is not a solution of the monotonic induction problem. Then one of these is true: a) there is some e(si ) ∈ P E + such that (X ∪ F ∪ A ∪ HI ) |= e(si ), b) there is some e(si ) ∈ P E − such that (X ∪ F ∪ A ∪ HI ) |= e(si ), c) (X ∪ F ∪ A ∪ HI ) |= ⊥. In case (a), consider some e(si ) ∈ P E + , e(si ) ∈ P E + , such that (X ∪ F ∪ |= e(si ), after HI is rewritten in the extended form to include the A ∪ HI ) reference to missing segments, the corresponding instance at e(si ) is entailed.
In case (b) and (c) the proof is that of Prop. 2.
The monotonic induction problem extended for missing segments requires the induction of nondeterminate clauses—the action literal is nondeterminate with respect to the background and examples. This compromises efficiency, as induction of 12-nondeterminate clauses is not PAC-learnable [3]. Corollary 2. The extended monotonic method provides a sound and complete induction of the effects of actions without the frame problem from evidence on consistent narratives. To illustrate the extended method consider in the previous YSS domain the following narrative (F, A) F = {nld(0), nld(1), nld(2), ld(3), ld(4), ld(5), nd(0), nd(1), nd(2), d(4), d(5)} A = {s(1), w(2), l(3), s(4), w(5)}
Induction of the Effects of Actions by Monotonic Methods
309
The (extended) examples on change corresponding to the narrative would be P E + = {d(m34)} P E − = {d(1), d(2)} Where m34 is the new situation constant naming the missing segment. The extended component of the narrative A would be A = {s(1, 1), w(2, 2), l(m34, 3), s(m34, 4), w(5, 5)} After the extended monotonic induction method is applied to this problem one solution would be H = { d(ES) ← s(ES, S), prev(S, P S), ld(P S).} Discarding the extra ES term becomes the intended solution seen before.
6
Related Work and Discussion
Moyle and Muggleton [4] study the induction of Event Calculus programs, a particular action formalism, being [5] the most recent work. The methods proposed there are different and rely on working with negation as failure in the background (for inertia) during induction, but it has been shown [6], [1] that monotonic induction does not extend well to normal programs. The approach in [7] uses some causality-based formalism for actions, requiring examples on complete causality on the evidence, i.e. a closed world assumption is assumed on them, and the so-called nonmonotonic setting of induction is used. This restricts the range of applicability of the method, as causality is usually not directly observable in the domains. In our method evidence on change is extracted from regular evidence on the domain, and no closed world assumption is assumed on it. Another important difference with these two experimental approaches is that they are restricted to complete narratives. The results provided here make induction of actions without the frame problem, a nonmonotonic induction problem that no efficient system can deal with, actually efficiently solvable using most of the current (monotonic) ILP systems. Acknowledgments I would like to thank the anonymous reviewers for helpful comments. This research is partially supported by the Government of Spain, National Commission of Science and Tecnology grant TIC-2001-0393 and by the Government of Galicia-Spain, Secretary of R&D grant PGIDIT-02-PXIC-10502PN.
310
Ramon P. Otero
References 1. Otero, R.: Induction of stable models. In: Proc. of the 11th Int. Conference on Inductive Logic Programming, ILP 01, LNAI 2157. (2001) 193–205 2. Muggleton, S., Feng, C.: Efficient induction of logic programs. Inductive Logic Programming (1992) 3. Kietz, J.: Some lower bounds on the computational complexity of inductive logic programming. In: Proc. of the 6th European Conference on Machine Learning, ECML 93, LNAI 667. (1993) 115–123 4. Moyle, S., Muggleton, S.: Learning programs in the event calculus. In: Proc. of the 7th Int. Workshop on Inductive Logic Programming, ILP 97, LNAI 1297. (1997) 205–212 5. Moyle, S.: Using theory completion to learn a robot navigation control program. In: Proc. of the 12th Int. Conf. on Inductive Logic Programming, ILP 02, LNAI 2583. (2003) 182–197 6. Sakama, C.: Inverse entailment in nonmonotonic logic programs. In: Proc. of the 10th Int. Conf. on Inductive Logic Programming, ILP 00, LNAI 1866. (2000) 209–224 7. Lorenzo, D., Otero, R.: Learning to reason about actions. In: Proc. of the 14th European Conference on Artificial Intelligence, ECAI 00. (2000) 316–320
Hybrid Abductive Inductive Learning: A Generalisation of Progol Oliver Ray, Krysia Broda, and Alessandra Russo Department of Computing, Imperial College London 180 Queen’s Gate, London SW7 2BZ {or,kb,ar3}@doc.ic.ac.uk
Abstract. The learning system Progol5 and the underlying inference method of Bottom Generalisation are firmly established within Inductive Logic Programming (ILP). But despite their success, it is known that Bottom Generalisation, and therefore Progol5, are restricted to finding hypotheses that lie within the semantics of Plotkin’s relative subsumption. This paper exposes a previously unknown incompleteness of Progol5 with respect to Bottom Generalisation, and proposes a new approach, called Hybrid Abductive Inductive Learning, that integrates the ILP principles of Progol5 with Abductive Logic Programming (ALP). A proof procedure is proposed, called HAIL, that not only overcomes this newly discovered incompleteness, but further generalises Progol5 by computing multiple clauses in response to a single seed example and deriving hypotheses outside Plotkin’s relative subsumption. A semantics is presented, called Kernel Generalisation, which extends that of Bottom Generalisation and includes the hypotheses constructed by HAIL.
1
Introduction
Machine Learning is the branch of Artificial Intelligence that seeks to better understand and deploy learning systems through the analysis and synthesis of analogous processes in machines. The specific task of generalising from positive and negative examples relative to given background knowledge has been much studied in Machine Learning, and when combined with a first-order clausal representation is known as Inductive Logic Programming (ILP) [9,8]. The Progol system of Muggleton [10] is a state-of-the-art and widely applied ILP system that has been successful in significant real world applications. Progol5 [7] is the latest system in the Progol family, which is based on the inference method of Bottom Generalisation [10,15]. Given background knowledge B and seed example e, Bottom Generalisation constructs and generalises a clause, called the BottomSet [10] of B and e, to return a hypothesis h that together with B entails e. Yamamoto [15] has shown that Bottom Generalisation, and hence Progol5, are limited to deriving clauses h that subsume e relative to B in the sense of Plotkin [11]. But while approaches have been proposed that do not suffer from this limitation, for example in [17,3,2], these have yet to achieve the same degree of practical success as Progol. This paper identifies a previously unknown incompleteness of Progol5 with respect to Bottom Generalisation, and attributes this incompleteness to the T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 311–328, 2003. c Springer-Verlag Berlin Heidelberg 2003
312
Oliver Ray et al.
routine, called STARTSET, responsible for computing positive literals in the BottomSet. A proof procedure is proposed, called HAIL, that not only overcomes this newly discovered incompleteness, but further generalises Progol5 by computing multiple clauses in response to a single seed example and deriving hypotheses outside Plotkin’s relative subsumption. A semantics is presented, called Kernel Generalisation, which extends that of Bottom Generalisation and includes the hypotheses constructed by HAIL. The motivation is to develop an enhanced practical system by integrating the proven ILP principles of Progol5 with Abductive Logic Programming (ALP) [4]. The relationship between abduction and Bottom Generalisation was first established in [7], where the authors view the Progol5 STARTSET routine as a form of abduction, and in [16], where the author shows that positive literals in the BottomSet can be computed by an abductive proof procedure called SOLDR. But while existing approaches for integrating abduction and Bottom Generalisation have used abduction to compute single atom hypotheses, HAIL exploits the ability of ALP to compute multiple atom hypotheses. This enables HAIL to hypothesise multiple clauses not derivable by Bottom Generalisation. The paper is structured as follows. Section 2 defines the relevant notation and terminology and reviews both Bottom Generalisation and Progol5. Section 3 discusses the STARTSET routine and considers its soundness with respect to Bottom Generalisation. Section 4 reveals an incompleteness of STARTSET with respect to Bottom Generalisation. Section 5 introduces the semantics of Kernel Generalisation, presents a refinement called Kernel Set Subsumption, and describes and illustrates the HAIL proof procedure with two worked examples. Section 6 compares this approach with related work, and the paper concludes with a summary and a discussion of future work.
2
Background
This section defines the notation and terminology used in this paper and provides an introduction to Bottom Generalisation and Progol5. It is assumed that all clauses and formulae are expressed in a first-order language L based on a fixed signature Σ. In addition to the usual function and predicate symbols, this signature is assumed to contain a set of Skolem constants and a set of predicate symbols called starred predicates, such that every non-starred predicate p also of arity n ≥ 0 is associated with a unique starred predicate p∗ of arity n. Informally, p∗ represents the negation of p. Skolem symbols and starred predicates are reserved for the process of Skolemisation and for the formation of contrapositives, respectively. The notations GA and GL represent respectively the sets of ground atoms and ground literals in L. The binary relations and |= and ≡ denote respectively derivability under SLD resolution, classical logical entailment, and classical logical equivalence. A clause is a set of literals {A1 , . . ., Am , ¬B1 , . . ., ¬Bn } and will often be written in the implicative form A1 , . . ., Am :- B1 , . . ., Bn . When a clause appears in a logical formula it denotes the universal closure of the disjunction of its literals. Where no confusion arises, a clause and a disjunction of literals will
Hybrid Abductive Inductive Learning: A Generalisation of Progol
313
be treated interchangeably. The complement of a clause C, written C, denotes the set of unit clauses obtained from the Skolemised negation of C. A (clausal) theory is a set of implicitly conjoined clauses. The symbols B, H, E + , E − will denote Horn theories representing background knowledge, hypothesis, positive and negative examples. The symbols h and e will be hypotheses and examples consisting of a single Horn clause. The theories and clauses denoted by these symbols are assumed to contain no Skolem symbols or starred predicates. Given a theory B and a clause e, then B and denote the result of normalising B and e. The normalised theory B is obtained by adding to B the positive unit clauses in e . The normalised clause is the Skolemised head of e, if it exists, or the empty clause , otherwise. Note that B ∧ H |= e iff B ∧ H |= for any hypothesis H. A clause C is said to θ-subsume a clause D, written C D, if and only if Cθ ⊆ D for some substitution θ. A clause is reduced if and only if it does not θ-subsume some proper subset of itself. The relation induces a lattice ordering on the set of reduced clauses (up to renaming of variables), and the least element in this ordering is the empty-clause. A clausal theory S is said to clausally subsume a clausal theory T , written S T , if and only if every clause in T is θ-subsumed by at least one clause in S. 2.1
Bottom Generalisation
Bottom Generalisation [10,15] is an approach to ILP motivated by the principle of Inverse Entailment, which states B ∧ H |= e iff B ∧ ¬e |= ¬H. Thus, the negations of inductive hypotheses may be deduced from the background knowledge together with the negation of a seed example. Given B and e, the task of finding such an H will be called the task of inductive generalisation, and H will be said to cover e. In Progol5, the principle of Inverse Entailment is realised through the technique of Bottom Generalisation, which is based on the BottomSet [10], formalised in Definition 1 below. Definition 1 (BottomSet [10]). Let B be a Horn theory and e be a Horn clause. Then the BottomSet of B and e, written Bot(B, e), is the clause Bot(B, e) = {L ∈ GL | B ∧ e |= ¬L}. The BottomSet of a theory B and clause e, written Bot(B, e), is the clause containing all ground literals whose negations may be deduced from B and the complement of e. The sets Bot+ (B, e) and Bot− (B, e) will denote respectively the positive (head) atoms and the negated (body) atoms of Bot(B, e). A clause h is said to be derivable from B and e by Bottom Generalisation if and only if it θ-subsumes Bot(B, e), as formalised in Definition 2 below. Definition 2 (Bottom Generalisation [10,15]). Let B be a Horn theory and e a Horn clause. A Horn clause h (containing no Skolem constant) is said to be derivable by Bottom Generalisation from B and e iff h Bot(B, e). It is shown in [15] that the class of hypotheses derivable by Bottom Generalisation can be characterised by Plotkin’s relative subsumption, or the related notion of C-derivation, formalised in Definitions 3 and 4 below.
314
Oliver Ray et al.
Definition 3 (Relative Subsumption [11]). Clause C subsumes a clause D relative to a theory T , iff T |= ∀(Cφ → D) for some substitution φ.
Definition 4 (C-Derivation [11]). A C-derivation of a clause D from a theory T with respect to a clause C, is a resolution derivation of the clause D from the clauses T ∪ {C}, in which C is used at most once as an input clause (i.e. a leaf ). A C-derivation of the empty-clause is called a C-refutation. Yamamoto [15] shows that given a theory B and a clause e such that B |= e, a clause h is derivable by Bottom Generalisation from B and e if and only if h subsumes e relative to B, or equivalently, if and only if there is a C-refutation from B ∪ e with respect to h. The C-refutation therefore characterises the hypotheses which are derivable by Bottom Generalisation. 2.2
Progol5
Progol5 [7] is an established ILP system based on the efficient realisation of Bottom Generalisation. Given Horn theories B, E + and E − , Progol5 aims to return an augmented theory B = B ∪ {h1 , . . . , hn } that entails E + and is consistent with E − . Each clause hi hypothesised by Progol5 is maximally compressive in the sense that it must cover the greatest number of remaining positive examples, while containing the fewest number of literals. Progol5 also takes as input a set M of mode-declarations [10] that specifies a language bias with which hypothesised clauses must be compatible. Mode-declarations consist of head-declarations and body-declarations that impose syntactic constraints on the head and body atoms of hypothesised clauses. If p and t are predicates and X is a variable, then the head-declaration modeh[p(+t)] states that the atom p(X) may appear in the head of a hypothesis clause, and the body-declaration modeb[p(+t)] states that the atom p(X) may appear in the body providing X appears also in the head. The type predicate t is used internally by Progol5 when computing literals in the BottomSet. Progol5 consists of a standard covering loop called COVERSET and three sub-routines called STARTSET, BOTTOMSET and SEARCH. These routines are described in [10,7] and a brief overview is now provided. COVERSET constructs hypotheses incrementally by repeatedly performing three operations until all positive examples have been covered. First it selects a seed example e from among those remaining in E + . Then it constructs a maximally compressive hypothesis h that together with the current B covers at least e. Finally, it adds h to B and removes all covered examples from E + . The step of hypothesis formation is performed in two stages. A finite Horn subset of Bot(B, e) is constructed by STARTSET and BOTTOMSET, and is generalised by SEARCH. The head atom is computed by STARTSET by reasoning with contrapositives [13], then the body atoms are computed by BOTTOMSET using a Prolog interpreter, and finally the most compressive generalisation is determined by SEARCH using a general-to-specific search through the θ-subsumption lattice.
Hybrid Abductive Inductive Learning: A Generalisation of Progol
3
315
Soundness of StartSet
This section considers an idealisation of the contrapositive reasoning mechanism used by the Progol5 STARTSET and discusses the soundness of this procedure. Contrapositives are a means of propagating negative information backwards through program clauses. For example, consider the clause a :- b where a and b are propositions. Typically this clause would be used to conclude a from b. But by the classical equivalence a ← b iff ¬b ← ¬a, it could equally be used to conclude ¬b from ¬a. This latter inference can be simulated within a logic programming context by introducing the starred predicates a∗ and b∗ to represent the negations of predicates a and b, giving the new clause b∗ :- a∗ . Contrapositives obtained in this way will be called contrapositive variants. Any Horn clause C with exactly n body atoms yields exactly n contrapositive variants, each of which is obtained by transposing the head atom C0 (if it exists) with a body Cj atom (assuming one exists) and starring the transposed atoms. The contrapositive variants of a Horn theory T , written Contra(T ), are defined as the union of the contrapositive variants of the individual clauses of T . Recall that Bot+ (B, e) is the set of ground atoms whose negations are entailed by B and e . To compute such negative consequences, STARTSET uses SLD-resolution on the theory obtained from B by first adding e and then adding the contrapositive variants of the resulting clauses. These theories will be called the complementary and contrapositive extensions of B, as formalised in Definitions 5 and 6 below. Definition 5 (Complementary Extension). Let B be a Horn theory and e be a Horn clause. Then the complementary extension of B with respect to e, written Be , is the Horn theory B ∪ e . Definition 6 (Contrapositive Extension). Let B be a Horn theory and e be a Horn clause. Then the contrapositive extension of B with respect to e, written Be∗ , is the Horn theory Be ∪ Contra(Be ). Atoms in Bot+ (B, e) may be computed by identifying those ground atoms that succeed under SLD-resolution as starred queries from the contrapositive extension. As formalised in Definition 7 below, the set of all ground atoms obtained in this way is called the StartSet of B with respect to e, and is a subset of Bot+ (B, e), as stated in Proposition 1. Definition 7 (StartSet). Let B be a Horn theory, and e be a Horn clause. Then the StartSet of B with respect to e, denoted StartSet(B, e) is the set of ground atoms {α ∈ GA | Be∗ α∗ }. Proposition 1 (Soundness of StartSet). Let B be a Horn theory, and e be a Horn clause. Then StartSet(B, e) ⊆ Bot+ (B, e). For proof see [12]. For reasons of efficiency, the STARTSET routine used by Progol5 is more complex than the idealised STARTSET described above. It is shown in [12], however, that the Progol5 routine computes only a subset of the idealised StartSet, and so the soundness and incompleteness results presented in this section and the next apply to both procedures.
316
4
Oliver Ray et al.
Incompleteness of StartSet
This section reveals that STARTSET, and therefore Progol5, are incomplete with respect to Bottom Generalisation. Proposition 2 shows that for B and e as ∈ StartSet(B, e). Therefore the defined below, the atom c ∈ Bot+ (B, e) but c hypothesis h = c is derivable by Bottom Generalisation from B and e, but is not computed by Progol5. Let a, b and c be proposition symbols, and define: B=
a :- b, c b :- c
e=a
h=c
Proposition 2 (Incompleteness of StartSet). Given B and e as defined above, then c ∈ Bot+ (B, e) but c ∈ StartSet(B, e). Proof. First, referring to Definitions 5 and 6, observe that the complementary and contrapositive extensions are as follows: b∗ :- a∗ , c a :- b, c a :- b, c ∗ ∗ c :- a , b ∗ Be = b :- c Be = b :- c ∪ ∗ ∗ :- a :- a c ∗ :- b a Then show that c ∈ Bot+ (B, e). Observe that Be ∧ c |= ⊥ since c and the first two clauses of Be entail a, and this contradicts the third clause of Be . Therefore Be |= ¬c and so c ∈ Bot+ (B, e) by Definitions 1 and 5. Finally, show that c ∈ StartSet(B, e). Observe that the query c∗ fails under ∗ SLD-resolution from Be , as shown by the SLD tree in Figure (a) below. Therefore c ∈ StartSet(B, e) by Definition 7. ? c*
:- a
? a*, b
? b*
?b
? a*, c
:- b, c
merge
?c
a :- b, c b :- c c
:- c
hypothesis
?c !
(a) Failed SLD-computation
(b) C-Derivation (with merge)
The incompleteness identified above is related to a refinement of C-refutations. A C-refutation for this example is shown in Figure (b) above. Recall that a clause as defined by Plotkin is a set of literals, and so identical literals are merged. Note that in this example every refutation that uses h only once, requires at least one merge. If a C-refutation with no merge of literals is called a C*-refutation, then it remains to show the conjecture that h is derivable by Progol5 from B and e only if there exists a C*-refutation from B ∪ e with respect to h.
Hybrid Abductive Inductive Learning: A Generalisation of Progol
5
317
Hybrid Abductive-Inductive Learning
This section proposes a semantics that extends Bottom Generalisation, and introduces a corresponding proof procedure that generalises Progol5. The motivation underlying this approach is given in Proposition 3, which generalises a similar result in [16] from definite clauses to Horn clauses by reformulating the BottomSet in terms of a deductive and an abductive component. Proposition 3. Let B and be the result of normalising a Horn theory B and a Horn clause e whereB |= e. Let α, δ ∈ GA denote ground atoms. As usual, let the operators and denote respectively the conjunction and disjunction of a set of formulae (which are atoms in this case). Then Bot(B, e) ≡
{δ ∈ GA | B |= δ} →
{α ∈ GA | B ∧ α |= }
Proof. By Definition 1, Bot(B, e) = {L | B ∧ e |= ¬L}. Partitioning into positive literals α and negative literals ¬δ, this set can be rewritten as the union of the two sets: (i) {α | B ∧ e |= ¬α} and (ii) {¬δ | B ∧ e |= δ}. The proof is then by cases, according to whether e is a definite clause or a negative clause. Case 1: Let e be the definite clause e = {E0 , ¬E1 , . . . , ¬En }. Then set (i) is equal to {α | B ∧ ¬E0 σ ∧ E1 σ ∧ . . . ∧ En σ |= ¬α}, which can be written {α | B ∧ ¬ |= ¬α}, and is equal to {α | B ∧ α |= }. Set (ii) is equal to {¬δ | B ∧¬E0 σ∧E1 σ∧. . .∧En σ |= δ}, which can be written as {¬δ | B∧¬ |= δ}, and this is now shown to be equal to {¬δ | B |= δ} using the following argument. If δ is any atom such that B |= δ then B ∧ ¬ |= δ by monotonicity. Therefore {¬δ | B ∧ ¬ |= δ} ⊇ {¬δ | B |= δ}. If B ∧ ¬ |= δ then B ∧ ¬ ∧ ¬δ |= ⊥. Now, by the completeness of Hyper-resolution [1], there is a Hyper-resolution refutation from the clauses of B ∪ {¬} ∪ {¬δ}, in which the electrons are E1 σ, . . . , En σ and any facts in B. And since the nuclei ¬ and ¬δ are negative unit clauses, they can be used only once (if at all) to derive the empty-clause in the very last step of the refutation. But suppose ¬ is used, then ¬δ cannot be used, and so there is a Hyper-resolution refutation from the clauses of B ∪ {¬}, which means that B ∧ ¬ |= ⊥ by the soundness of Hyper-resolution, and so B |= . But this is equivalent to B |= e, which is a contradiction. Therefore ¬ is not used, and so there is a Hyper-resolution refutation from the clauses of B ∪ {¬δ}, which means that B |= δ by the soundness of Hyper-resolution. Therefore {¬δ | B ∧ ¬ |= δ} ⊆ {¬δ | B |= δ}. Hence {¬δ | B ∧ ¬ |= δ} = {¬δ | B |= δ}. Case 2: Let e be the negative clause e = {¬E1 , . . . , ¬En }. Then set (i) is equal to {α | B ∧ E1 σ ∧ . . . ∧ En σ |= ¬α}, which is equal to {α | B ∧ E1 σ ∧ . . . ∧ En σ ∧ α |= ⊥} and can be written {α | B ∧ α |= } as = whenever e is negative. Set (ii) is equal to {¬δ | B ∧ E1 σ ∧ . . . ∧ En σ |= δ}, which can be written {¬δ | B |= δ}. In both cases Bot(B, e) = {L | B∧e |= ¬L} = {¬δ | B∧e |= δ}∪{α | B∧e |= ¬α} = {¬δ | B |= δ} ∪ {α | B ∧ α |= }. Since the clause Bot(B, e) represents the disjunction logically equivalent to the formula
of its literals, it is therefore Bot(B, e) ≡ {δ ∈ GA | B |= δ} → {α ∈ GA | B ∧ α |= }.
318
Oliver Ray et al.
Proposition 3 shows that the atoms δ ∈ Bot− (B, e) are those ground atoms that may be deduced from the normalised background B, and that the atoms α ∈ Bot+ (B, e) are those ground atoms that may be abduced from B given as goal the normalised example . This has two important implications. First, the incompleteness of Progol5 identified in Section 4 can be avoided by replacing the STARTSET routine with an abductive procedure for deriving single atom hypotheses α. Second, the semantics of Bottom Generalisation can be extended, and the Progol5 proof procedure can be further generalised, by exploiting abductive hypotheses with multiple atoms, as shown in the next two subsections. 5.1
Semantics
This subsection introduces a new semantics called Kernel Generalisation, and a refinement of this semantics called Kernel Set Subsumption. The underlying notion, called a Kernel, is a logical formula that generalises the BottomSet by replacing the single atoms α in Proposition 3 by sets of (implicitly conjoined) atoms ∆ = {α1 , . . . , αn }, as formalised in Definition 8 below. Definition 8 (Kernel). Let B and be the result of normalising a Horn theory B and a Horn clause e such that B |= e. Then the Kernel of B and e, written Ker(B, e), is the formula defined as follows:
Ker(B, e) = {δ ∈ GA | B |= δ} → {∆ ⊆ GA | B ∧ ∆ |= } As formalised in Definition 9 below, any formula that logically entails the Kernel is said to be derivable by Kernel Generalisation, and as shown in Proposition 4 below, all such formulae are correct inductive generalisations. Definition 9 (Kernel Generalisation). Let B be a Horn theory and e be a Horn clause such that B |= e. Then a Horn theory H is said to be derivable by Kernel Generalisation from B and e iff H |= Ker(B, e). Proposition 4 (Soundness of Kernel Generalisation). Let B be a Horn theory and e be a Horn clause such that B |= e. Then H |= Ker(B, e) only if B ∧ H |= e, for any Horn theory H. Proof. Assume H |= Ker(B,
e). For convenience, let P and S abbreviate the following formulae: let P = {δ ∈ GA | B |= δ} be the conjunction of all ground atoms entailed by B, and let S = {∆ ⊆ GA | B ∧ ∆ |= } be the disjunction of the conjunctions of ground atoms that together with B entail . Then observe that (i) B |= P as each conjunct δ of P is individually entailed by B, and (ii) B ∧ S |= as together with B each individual conjunct ∆ of S entails , and (iii) H |= P → S by Definition 8 and the assumption above. Let M be a model of B and H. Then M is a model of P using (i), and of S using (iii), and of using (ii). Therefore B ∧ H |= , which is equivalent to B ∧ H |= e. To remain within Horn clause logic, it is convenient to introduce a refinement of the Kernel, called a Kernel Set. Informally, a Kernel Set K of B and e, is a
Hybrid Abductive Inductive Learning: A Generalisation of Progol
319
partial representation of the Ker(B, e). Comparing Definition 10 below, with Definition 8 above, the set of head atoms {α1 , . . . , αn } of K is seen to be an element of the consequent {∆ | B ∧ ∆ |= } of Ker(B, e). The set of body atoms m(n) {δ11 , . . . , δn } of K is seen to be a subset of the antecedent {δ | B |= δ}. Definition 10 (Kernel Set). Let B and be the result of normalising a Horn theory B and a Horn clause e. Then a Horn theory K is said to be a Kernel Set of B and e iff m(1) 1 α1 :- δ1 , . . . . . . . . . . , δ1 .. . K = α :- δ 1 , . . . , δ j , . . . , δ m(i) i i i ..i . m(n) 1 αn :- δn , . . . . . . . . . . , δn where m(i) ≥ 0 denotes the number of body atoms in the ith clause, and αi ∈ GA denotes the head atom of the ith clause, and δij ∈ GA denotes the j th body atom of the ith clause, and B ∪ {α1 , . . . , αn } |= and B |= δij for all i, j such that 1 ≤ i ≤ n and 1 ≤ j ≤ m(i). As formalised in Definition 11 below, any formula that clausally subsumes a Kernel Set is said to be derivable by Kernel Set Subsumption, and as shown in Proposition 5 below, all such formulae are correct inductive generalisations. Definition 11 (Kernel Set Subsumption). Let K be a Kernel Set of a Horn theory B and a Horn clause e such that B |= e. Then a Horn theory H is said to be derivable by Kernel Set Subsumption from B and e iff H K. Proposition 5 (Soundness of Kernel Set Subsumption). Let K be a Kernel Set of a Horn theory B and a Horn clause e such that B |= e. Then H K only if B ∧ H |= e, for any Horn theory H. Proof. Assume H K. For
convenience, let P , Q, R and S abbreviate the following formulae: let P = {δ ∈ GA | B |= δ} be the conjunction of all ground
m(n) atoms entailed by B, let Q = {δ11 , . . . , δij , . . . , δn } be the conjunction of all body atoms of K, let R = {α1 , . . . , αn } be the conjunction of all head atoms of K, and let S = {∆ ⊆ GA | B ∧ ∆ |= } be the disjunction of the conjunctions of ground atoms that together with B entail . Then observe that (i) P |= Q as the conjuncts δij of Q are included among the conjuncts δ of P , and (ii) R |= S as the conjunction R is one of the disjuncts ∆ in S, and (iii) K |= Q → R, as any model of K that satisfies every body atom, must also satisfy every head atom, and (iv) H |= K by definition of θ-subsumption and the assumption above. Let M be a model of H. If M is a model of P , then M is a model of Q using (i), and of K using (iv), and of R using (iii), and of S using (ii). Therefore H |= P → S, and so H |= Ker(B, e) by Definition 8, and thus B ∧ H |= e by Proposition 4. Proposition 5 above, shows that Kernel Set Subsumption is a sound method of inductive generalisation. Proposition 6 below, shows that Kernel Set Subsumption is a strict extension of Bottom Generalisation for Horn clause logic.
320
Oliver Ray et al.
Proposition 6 (Kernel Set Subsumption Extends Bottom Generalisation). Let B be a Horn theory and e a Horn clause such that B |= e. Then the set of hypotheses KSS derivable by Kernel Set Subsumption strictly includes the set of Horn clause hypotheses BG derivable by Bottom Generalisation. Proof. First show that KSS ⊇ BG. If the Horn clause h is derivable from B and e by Bottom Generalisation, then h Bot(B, e) by Definition 2, and therefore hσ ⊆ Bot(B, e) for some substitution σ. By Proposition 3 it follows hσ = α :- δ1 , . . . , δn where B ∧ α |= and B |= δj for all 0 ≤ j ≤ n. Therefore the Horn theory H = {h} is derivable by Kernel Set Subsumption using the Kernel Set K = {α :- δ1 , . . . , δn }. Thus KSS ⊇ BG. Now show that KSS = BG. Let p/0 and q/1 be predicates, let a and b be constants, and define B = {p :- q(a), q(b)}, e = p, and h = q(X). Then the hypothesis h = q(X) is not derivable by Bottom Generalisation, as it does not θ-subsume Bot(B, e) = {p}. But the hypothesis H = {q(X)} is derivable by Kernel Set Subsumption, as it clausally subsumes the Kernel Set K = {q(a)} ∪ {q(b)}. Thus KSS = BG. The notion of Kernel Set introduced above is related to an extension of Plotkin’s C-refutation. Let a K-derivation of a clause D from a clausal theory T with respect to a clausal theory K be defined as a resolution derivation of D from T ∪ K in which any clause in K is used at most once. Then it remains to show the conjecture that a theory K is a Kernel Set of B and e only if there exists a K-refutation from B ∪ e with respect to K. Note that C-derivations are a special case of K-derivations in which K consists of a single clause C. 5.2
Proof Procedure
This subsection introduces a proof procedure for Kernel Set Subsumption, called HAIL, that integrates abductive, deductive and inductive reasoning within a cycle of learning that generalises Progol5. This cycle is illustrated in Figure 1. HAIL, like Progol5, consists of a CoverSet loop (Steps 1 and 5) with abductive (Step 2), deductive (Step 3) and inductive (Step 4) phases. Given Horn theories B, E + and E − , and a set of mode-declarations M , HAIL aims to return an augmented background knowledge B = B ∪ H1 ∪ . . . ∪ Hm that entails E + , is consistent with E − , and such that each theory Hi for 1 ≤ i ≤ m is maximally compressive and compatible with M . On every iteration of the cycle, at least one clause is removed from E + , and a non-empty theory Hi is added to B. It is assumed that initially E + is non-empty and consistent with B and E − . The CoverSet loop begins (Step 1) by selecting from E + a seed example e, which is normalised with B, giving theory B and atom . An abductive procedure is then used (Step 2) to find explanations ∆i = {α1 , . . . , αn } of goal from theory B. By definition, each explanation is a set of implicitly conjoined ground atoms such that B ∧∆i |= . Any abductive procedure can be used, but for the purposes of illustration Figure 1 depicts a tree-like computation representing the ASLD procedure of Kakas and Mancarella [5]. Abduced atoms αj are shown as tapered squares, the goal is shown as an oval, and the theory B is implicit.
Hybrid Abductive Inductive Learning: A Generalisation of Progol
ki1 = ki1
α1
…
…
1
δ1
Ki
kin
kin = αn hi1
321
hin
3
1
2
δn δn
DEDUCE
… Hi =
hi1
hin
∆i =
4
SEARCH
2
5
Let H be the most compressive Hi Add H to B, remove cover from E+
1
Return B’= B ∪ H1 ∪ ∪ Hm
E+ ≠ ∅
α1
αn
…
ABDUCE Select seed e from E+ Normalise B and e Given B, E+, E− and M
Fig. 1. Conceptual View of the HAIL Learning Cycle Every n-atom hypothesis ∆i = {α1 , . . . , αn } abduced in Step 2 is used in Step 3 to form an n-clause Kernel Set Ki = {ki1 , . . . , kin }, with each atom αj becoming the head of exactly one clause kij . To every head atom αj is adjoined a set of body atoms δij , shown as squares in Figure 1, each of which is determined by a deductive procedure that computes ground atomic consequences of B. The resulting Kernel Set Ki is then generalised (Step 4) by constructing a Horn theory Hi that includes at least one clause hij from the θ-subsumption lattice of each Kernel clause kij . Figure 1 shows the clauses hi1 and hin (rounded rectangles) selected from the θ-subsumption lattices (dotted arrows) of the Kernel clauses ki1 and kin (tapered rectangles). In general, the same clause may be selected from several lattices, as in the example used in Proposition 6. The hypotheses constructed by HAIL should be compatible with the given language bias, and they should be maximally compressive in the sense of covering the greatest number of remaining positive examples while containing the fewest number of literals. Therefore, the abductive and search procedures are required return hypotheses that are minimal in the sense that no subset is also a hypothesis, and, in practice, all three procedures will make use of the modedeclarations M . In this way, the most compressive hypothesis Hi is determined for each Kernel Set Ki resulting from some explanation ∆i . In step 5, the most compressive such hypothesis, H, is then asserted into B, and any covered examples are removed from E + . The cycle is repeated until E + is empty, whereupon the augmented background B is returned.
322
Oliver Ray et al. Begin HAIL Input remove cover CoverSet Loop select seed normalise Abduction Deduction
Induction best hypothesis assert hypothesis remove cover Output
given B, E + , E − , M let E + = E + − {e ∈ E + | B |= e} while E + = ∅ select seed example e ∈ E + let B, = N ormalise(B, e) let A = ABDU CE(B, , Mh ) for each abduced hypothesis ∆i ∈ A for each abduced atom αj ∈ ∆i let kij= DEDU CE(B, αj , Mb ) let Ki = j {kij } let Hi = SEARCH(Ki , B, E + , E − , M ) let H = Hi with greatest Compression let B = B ∪ H let E + = E + − {e ∈ E + | B |= e} return B
End HAIL
Fig. 2. HAIL Proof Procedure The high-level operation of the HAIL learning cycle is shown in Figure 2, in which the abductive, deductive and search procedures are referred to generically as ABDU CE, DEDU CE and SEARCH. Given as input B, E + , E − and M , HAIL begins by removing from E + any examples already covered by B – as these require no hypothesis. The first seed example is then selected and normalised, giving B and . From the theory B and the goal , ABDU CE computes a set A = {∆1 , . . . , ∆p } of explanations, each of which is an implicitly conjoined set ∆i = {α1 , . . . , αn } of ground atoms compatible with the head-declarations Mh in M , and is such that B ∧ ∆i |= e. In the outer for-loop, each explanation ∆i ∈ A is processed in turn. In the inner for-loop, each atom αj ∈ ∆i becomes the head of a clause kij to which DEDU CE adjoins a set body atoms, each of which is a ground atomic consequence of B compatible with the body-declarations Mb in M . The Kernel Set Ki formed of the union of the clauses kij is then generalised by SEARCH, which determines the most compressive theory Hi that clausally subsumes Ki and is compatible with M . The most compressive theory obtained in this way is then added to B, and any newly covered examples are removed from E + . A concrete instance of Figure 2 is proposed in [12] that instantiates ABDU CE, DEDU CE and SEARCH with ASLD, BOTTOMSET and a new search algorithm called M-SEARCH. Very briefly, the language bias Mh is encoded within the ALSD procedure as additional abductive integrity constraints, the Progol5 BOTTOMSET routine is used to compute the body atoms of each individual Kernel clause, and M-SEARCH performs a recursive specific to general search through the collection subsumption lattices obtained from the given Kernel Set. These concrete procedures are now used informally in Examples 1 and 2 below, to illustrate the HAIL proof procedure in Figure 2 above.
Hybrid Abductive Inductive Learning: A Generalisation of Progol
k
323
α
k = fries(X) :- offer(X)
fries(X) :offer(X)
3
fries(X) :offer(Y)
BOTTOM SET
fries(X) ? meal(md) ? burger(md),fries(md)
h = fries(X) :- offer(X)
∆ = {fries(md)} 2
4 M-SEARCH
ASLD
B’ = B ∪ {fries(X) :- offer(X)}
5
e = meal(md) 1
Return B’
Select Seed
Fig. 3. Fast Food Example - Solved by HAIL (but not by Progol5)
α1
k2
k1 Tx :- Lx, Ax
k1 = tired(X) :- lecturer(X), academic(X) k2 = poor(X) :- lecturer(X), academic(X)
Px :- Lx, Ax
Tx :Lx,Az
Tx :Ly,Ay
Tx :Ly,Ax
Px :Lx,Az
Px :Ly,Ay
Tx :-Lx
Tx :Ly,Az Tx :- Ax
Px :-Lx
Px :Ly,Az Px :- Ax
Tx :- Ly
Tx :- Az
Px :- Ly
Px :Ly,Ax
Px
h1= tired(X)
h2= poor(X) :- lecturer(X)
Return B’
BOTTOM SET
? sad(a) ? tired(a), poor(a)
∆ = {tired(a), poor(a)} 2
4 M-SEARCH
5
3
Px :- Az
Tx
B’ = B ∪
α2
poor(X) :- lecturer(X) tired(X)
ASLD
e = sad(a)
1
Select Seed
Fig. 4. Academic Example - Solved by HAIL (but not by Bottom Generalisation)
324
Oliver Ray et al.
Example 1 (Fast Food). This example shows how HAIL is able to overcome the incompleteness of Progol5 identified in Section 4. The background knowledge B, shown below, describes a domain of three bistros: md, bk and rz (abbreviating mcDonalds, burgerKing and theRitz). To have a meal in a bistro, it is sufficient to have burger and f ries. Also, a free burger comes with every order of f ries at bistros participating in a special offer. The positive examples E + state that a meal has been eaten at both md and bk. The negative example(s) E − state that a meal has not been eaten at rz. The mode-declarations state that the atom f ries(X) may appear in the heads of hypothesised clauses, and the atom offer(X) may appear in the bodies. It can be verified that hypothesis H is derivable by Bottom Generalisation using e = meal(md) or e = meal(bk) as the seed example. However, it is not computed by Progol5, as the queries meal∗ (md) and meal∗ (bk) fail from the contrapositive extension Be∗ . Therefore, STARTSET computes no atoms and Progol5 computes no hypothesis. As illustrated in Figure 3, HAIL solves Example 1 in the following way. In Step 1, the seed e = meal(md) is selected and normalised, trivially giving B = B and = e. In Step 2, given theory B and goal , ASLD abduces the hypothesis ∆ = {f ries(md)} containing the atom α = f ries(md). In Step 3, this atom α becomes the head of a clause k, to which the body atom offer(md) is added by BOTTOMSET. For efficiency BOTTOMSET replaces the constant md with the variable X, as required by the mode-declarations. In Step 4, the θ-subsumption lattice, bounded from above by the newly computed clause k, and from below by the empty clause , is searched. The most compressive hypothesis is k itself – as all more general clauses are inconsistent with the negative example :- meal(rx). In Step 5, the theory H = {f ries(X) :- offer(X)} is added to B, and because both positive examples are now covered, they are removed from E + . The cycle terminates, returning the augmented background B = B ∪ H.
B=
Background Knowledge (Fast Food Example) bistro(md) offer(md) meal(X) :- f ries(X), burger(X) ∪ bistro(bk) ∪ offer(bk) burger(Y ) :- f ries(Y ), offer(Y ) bistro(rz) burger(rz) Positive Examples meal(md) E+ = meal(bk)
M+
Negative Examples E − = :- meal(rz)
Head-Declarations = modeh[f ries(+bistro)]
M−
Body-Declarations = modeb[offer(+bistro)]
Hypothesis
H = f ries(Z) :- offer(Z)
Hybrid Abductive Inductive Learning: A Generalisation of Progol
325
Example 2 (Academic). This example shows how HAIL is able to compute more than one clause in response to a single seed example, and to derive hypotheses outside the semantics of Bottom Generalisation. The background knowledge, shown below, describes a domain with three academics: oli, ale and kb. Any academic who is both tired and poor is also sad. The positive examples state that ale and kb are sad. The negative examples state that oli is neither sad nor poor. It can be verified that hypothesis H is not derivable by Bottom Generalisation using e = sad(ale) or e = sad(kb) as the seed example, as no literal with the predicates tired or poor is entailed by the complementary extension Be . Therefore no literal with these predicates is contained in the BottomSet Bot(B, e), nor any clause derivable by Bottom Generalisation. (Note, therefore, that Progol5 cannot compute H). As illustrated in Figure 4, HAIL solves Example 2 in the following way. In Step 1, the seed e = sad(ale) is selected and normalised, again giving B = B and = e. In Step 2, ASLD abduces the hypothesis ∆ containing the two atoms α1 = tired(ale) and α2 = poor(ale). In Step 3, α1 and α2 become the heads of two clauses k1 and k2 , to which the body atoms lecturer(ale) and academic(ale) are added by BOTTOMSET. Note that for efficiency BOTTOMSET replaces the constant ale with the variable X, as required by the mode-declarations. Note also that, in general, different body atoms will be added to different clauses. Note finally that the two clauses k1 and k2 constitute a Kernel Set of B and e. In Step 4, one clause is chosen from each of the θ-subsumption lattices resulting from this Kernel Set. For ease of presentation the clauses in the θ-subsumption lattices have been written without brackets and only the first letter of each predicate symbol is shown. In Step 5, the most compressive hypothesis H consisting of the two clauses tired(X) and poor(X) :- lecturer(X) is added to B, and because both positive examples are now covered, they are removed from E + . The cycle then terminates returning the augmented background B = B ∪ H. Background Knowledge (Academic Example) academic(oli) student(oli) B = sad(X) :- tired(X), poor(X) ∪ academic(ale) ∪ lecturer(ale) academic(kb) lecturer(kb) Positive Examples sad(ale) E+ = sad(kb)
M+
Negative Examples :- sad(oli) E− = :- poor(oli)
Body-Declarations Head-Declarations modeh[tired(+academic)] modeb[lecturer(+academic)] − = M = modeh[poor(+academic)] modeb[academic(+academic)] H=
Hypothesis tired(X) poor(X) :- lecturer(X)
326
6
Oliver Ray et al.
Related Work
The importance of abductive inference in the context of Bottom Generalisation was first realised in [7] and [16]. In [7], Muggleton and Bryant suggest that Progol5 can be seen as a procedure for efficiently generalising the atoms computed by the STARTSET routine, which they view as implementing a form of abduction based on contrapositive reasoning. This paper confirms the view of Muggleton and Bryant, but reveals that STARTSET, and therefore Progol5, are incomplete with respect to Bottom Generalisation. In [16] it is shown that given definite clauses B and e, then Bot− (B, e) is the set of atoms in the least Herbrand model of the definite theory consisting of B and the Skolemised body of e, and Bot+ (B, e) is the set of atoms abducible by SOLDR-resolution from this theory given as goal the Skolemised head of e. The Kernel semantics presented in this paper can be seen as a generalisation of these results that exploits multiple atom abductive hypotheses. In [16], Yamamoto describes a procedure that incorporates explicit abduction within Bottom Generalisation. Atoms in the head and body of the BottomSet are computed by separate abductive and deductive procedures, and hypotheses are formed by generalising the computed atoms. This procedure, however, is non-deterministic and is restricted to definite clause logic. Yamamoto shows that his procedure is able to induce a single clause or a set of facts for each seed example, but he conjectures that it would be difficult to extend the procedure to induce conjunctions of definite clauses. The proof procedure and semantics described in this paper can be seen as generalising those in [7] and [16] by constructing Horn theories not derivable by Bottom Generalisation. But not all hypotheses can be found with this new approach, as can be seen using the following example due to Yamamoto [14]. If B = {even(0)} ∪ {even(s(X)) :- odd(X)} and e = odd(s(s(s(0)))), then the hypothesis H = {odd(s(X)) :- even(X)} is not derivable by Kernel Set Subsumption, nor by Kernel Generalisation, as Ker(B, e) = {odd(s(s(s(0)))) :- even(0)} and H |= Ker(B, e). Note that in this example Ker(B, e) = Bot(B, e). Complete methods of hypothesis finding for full clausal logic are proposed in [17] and [3]. In [17], Yamamoto and Fronh¨ ofer describe a technique based on Residue Hypotheses. Very briefly, the Residue of a ground theory G, written Res(G), is the ground theory consisting of all non-tautological clauses that contain the negation of one literal from each clause in G. A Residue Hypothesis of two clausal theories B and E is defined as the Residue of a subset of the ground instances of clauses in B and clauses in the Residue of the Skolemisation of E. A hypothesis H is derived by the Residue Procedure from B and E iff H generalises a Residue Hypothesis of B and E. If the example consists of a single clause e, then a theory H is derived by the Residue Procedure from B and e iff H |= Res(Gnd(Be )) where Gnd(Be ) denotes the ground instances of the complementary extension Be = B ∪ e . Compare this with Kernel Set Subsumption, which derives a theory H iff H K where K is a Kernel Set of B and e. Both procedures derive hypotheses by generalising a ground theory constructed from B and e. For example, if B = {p :- q(a), q(b)} and e = p then H = {q(X)} is derived by the Residue Procedure with Res(Gnd(Be )) = {q(a), p} ∪ {q(b), p} and is derivable by Kernel Set Subsumption with K = {q(a)} ∪ {q(b)}, but not
Hybrid Abductive Inductive Learning: A Generalisation of Progol
327
by Bottom Generalisation, as shown in Proposition 6. In [3], Inoue describes a technique called Consequence Finding Induction or CF-Induction, which is based on the concepts of Production Fields and Characteristic Clauses. Very briefly, a Production Field defines a syntactic language bias on the hypothesis space, and a Characteristic Clause of two clausal theories B and E, is a non-tautological clause entailed B ∧ E that is expressed in the language of some Production Field P , and is not properly subsumed by any other such clause. A hypothesis H is derived by CF-Induction iff H generalises the complement of a theory CC(B, E) containing a set of Characteristic Clauses. For the example above, H = {q(X)} is derived by CF-Induction with CC = {p :- q(a), q(b)} ∪{ :- p} since CC is equivalent to the theory {q(a), p} ∪ {q(b), p}, and q(X) |= CC . Although the Residue Procedure and CF-Induction are more general than HAIL, they must search a correspondingly larger hypothesis space, which makes them computationally expensive. It is believed, however, that practical systems can be developed for HAIL that will build on the success of Progol by overcoming some of limitations described in this paper.
7
Conclusion
This paper has identified an incompleteness of the ILP proof procedure Progol5 with respect to the semantics of Bottom Generalisation, and has proposed a new approach, called Hybrid Abductive Inductive Learning, that integrates abductive and inductive reasoning within a learning cycle that exploits multiple atom abductive hypotheses. A proof procedure has been presented, called HAIL, that overcomes this newly identified incompleteness and further generalises Progol5 by computing multiple clauses in response to a single seed example, and by finding hypotheses not derivable by Bottom Generalisation. A semantics for this proof procedure, called Kernel Generalisation, has been defined, and a refinement of this semantics, called Kernel Set Subsumption, was shown to extend that of Bottom Generalisation. To better characterise the hypotheses derivable by HAIL, precise completeness results are required for the semantics and proof procedures presented in this paper. It is believed that K-derivations, introduced in this paper as an extension of C-derivations, will serve as the basis of such a characterisation. Although Kernel Set Subsumption is an extension of Bottom Generalisation, it is not complete with respect to the general task of inductive generalisation. Therefore, the possibility of enlarging the class of derivable hypotheses by interleaving the abductive, deductive and inductive phases will be investigated. In addition, a prototype implementation of the HAIL proof procedure needs to be developed in order to evaluate the approach. One possibility would be to generalise the current Progol5 implementation and combine this with an existing abductive system, such as the A-System of Van Nuffelen, Kakas, and Denecker [6].
328
Oliver Ray et al.
References 1. C. Chang and R.C. Lee. Symbolic Logic and Mechanical Theorem Proving. Academic Press, 1973. 2. K. Furukawa and T. Ozaki. On the Completion of Inverse Entailment for Mutual Recursion and its Application to Self Recursion. In Proceedings of the Work-inProgress Track, 10th International Conference on Inductive Logic Programming, Lecture Notes in Computer Science, 1866:107–119, Springer-Verlag, 2000. 3. K. Inoue. Induction, Abduction, and Consequence-Finding. In Proceedings 11th International Conference on Inductive Logic Programming, Lecture Notes in AI, 2157:65–79, Springer-Verlag, 2001. 4. A.C. Kakas, R.A. Kowalski, and F. Toni. Abductive Logic Programming. Journal of Logic and Computation, 2(6):719–770, 1992. 5. A.C. Kakas and P. Mancarella. Database Updates through Abduction. In Proceedings of the 16th International Conference on Very Large Databases, pages 650–661, Morgan Kaufmann, 1990. 6. A. Kakas, B. Van Nuffelen, and M. Denecker. A-system: Problem solving through abduction. In Proceedings of the 17th International Joint Conference on Artificial Intelligence, pages 591–596 Morgan Kaufmann, 2001. 7. S.H. Muggleton and C.H. Bryant. Theory Completion Using Inverse Entailment. In Proceedings of the 10th International Conference on Inductive Logic Programming, Lecture Notes in Computer Science, 1866:130–146, Springer-Verlag, 2000. 8. S.H. Muggleton and L. De Raedt. Inductive Logic Programming: Theory and Methods. Journal of Logic Programming, 19,20:629–679, 1994. 9. S.H. Muggleton. Inductive Logic Programming. New Generation Computing, 8(4):295–318, 1991. 10. S.H. Muggleton. Inverse Entailment and Progol. New Generation Computing, Special issue on Inductive Logic Programming, 13(3-4):245–286, 1995. 11. G.D. Plotkin. Automatic Methods of Inductive Inference. PhD thesis, Edinburgh University, 1971. 12. O. Ray. HAIL: Hybrid Abductive-Inductive Learning. Technical report 2003/6, Imperial College, Department of Computing, 2003. 13. M.E. Stickel. A Prolog technology theorem prover: Implementation by an extended Prolog compiler. Journal of Automated Reasoning, 4(4):353-380, 1988. 14. A. Yamamoto. Which Hypotheses Can Be Found with Inverse Entailment? In Proceedings of the 7th International Workshop on Inductive Logic Programming, Lecture Notes in Computer Science 1297:296–308, Springer-Verlag, 1997. 15. A. Yamamoto. An Inference Method for the Complete Inverse of Relative Subsumption. New Generation Computing, 17(1):99–117, 1999. 16. A. Yamamoto. Using Abduction for Induction based on Bottom Generalisation In Abduction and Induction: Essays on their relation and integration, Applied Logic Series 18:267–280, Kluwer Academic Publishers, 2000. 17. A. Yamamoto and B. Fronh¨ ofer. Hypothesis Finding via Residue Hypotheses with the Resolution Principle. In Proceedings of the 11th International Conference on Algorithmic Learning Theory, Lecture Notes in Computer Science 1968:156–165, Springer-Verlag, 2000.
Query Optimization in Inductive Logic Programming by Reordering Literals Jan Struyf and Hendrik Blockeel Katholieke Universiteit Leuven, Dept. of Computer Science Celestijnenlaan 200A, B-3001 Leuven, Belgium {Jan.Struyf,Hendrik.Blockeel}@cs.kuleuven.ac.be
Abstract. Query optimization is used frequently in relational database management systems. Most existing techniques are based on reordering the relational operators, where the most selective operators are executed first. In this work we evaluate a similar approach in the context of Inductive Logic Programming (ILP). There are some important differences between relational database management systems and ILP systems. We describe some of these differences and list the resulting requirements for a reordering transformation suitable for ILP. We propose a transformation that meets these requirements and an algorithm for estimating the computational cost of literals, which is required by the transformation. Our transformation yields a significant improvement in execution time on the Carcinogenesis data set.
1
Introduction
Many Inductive Logic Programming (ILP) systems construct a predictive or descriptive model (hypothesis) for a given data set by searching a large space of candidate hypotheses. Different techniques exist to make this search more efficient. Some techniques improve efficiency by reducing the number of hypotheses that are evaluated during the search. Examples of such techniques are language bias specification mechanisms [10], which limit the number of valid hypotheses, and search strategies (e.g. branch-and-bound search [9], heuristic search [12] and stochastic search [16]), which reduce the number of hypotheses to be evaluated by cutting away branches in the search. Evaluating a candidate hypothesis can also be made more efficient. In ILP each hypothesis is represented by a number of clauses in first order logic. For example, in an ILP rule learner such as Progol [9] or FOIL [12], the condition of each rule r is a first order logic query q. Rules are generated and evaluated one by one. In order to evaluate the performance of a rule, the query q has to be executed on each example in the data set. Many techniques designed for efficiently executing relational queries (e.g. SQL) in a relational database management system [4] can, after some modifications, also be used for efficiently executing first order queries. An example of this is index structures (e.g., hash-tables), which are used to efficiently retrieve tuples from a relational database given the value of some attribute. Most ILP T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 329–346, 2003. c Springer-Verlag Berlin Heidelberg 2003
330
Jan Struyf and Hendrik Blockeel
systems provide a similar kind of indexing [7,3] on the datalog facts that are commonly used to describe the examples. Query optimization [4,5] is another technique that is frequently used in relational database management systems. Query optimizers transform queries into a different form, which is more efficient to execute. Also in ILP, techniques have been introduced for optimizing first order queries (see [13] for an overview). One specific approach that is popular in relational database management systems is to reorder the relational operators such that the most selective ones are executed first. A similar approach has not yet been evaluated in the context of ILP. In this work we introduce a query transformation that optimizes first order queries by reordering literals. The basic idea is that also first order queries become more efficient to execute if “selective” literals are placed first. This paper is organized as follows. In Section 2 we present a motivating example that illustrates the amount of efficiency that can be gained by a reordering transformation in ILP. In Section 3 we list three important requirements for such a transformation and point out some significant differences with query optimization in relational databases. In Section 4 we introduce a reordering transformation for optimizing first order queries in ILP. We also describe an extension able to handle queries that are the output of the cut-transform [13]. Section 5 presents experiments that evaluate the performance of the transformation on the Carcinogenesis data set. Section 6 concludes the paper.
2
A Motivating Example
We start with a motivating example that illustrates the amount of efficiency that can be gained by reordering the literals of a first order query. In this example we use the Carcinogenesis data set [14], which stores structural information about 330 molecules. Each molecule is described by a number of atom/41 and bond/4 facts. Consider the following query: atom(M,A1,h,3),atom(M,A2,c,16),bond(M,A2,A1,1). This query succeeds for a molecule M that contains a hydrogen atom A1 of type 3 that is bound to a carbon atom A2 of type 16 by a single bond. We execute each possible reordering (i.e., permutation) of the query on each example molecule. Table 1 lists all permutations sorted by average execution time. As can be seen from Table 1, the most efficient permutation is 6 times faster than the original query q4 and 9 times faster than the least efficient one. This can be explained by looking at the size of the SLD-tree [8] that is generated when the ILP system executes the permutation on a given example. Given a query q = l1 . . . ln and example e, the SLD-tree will contain at level i different nodes vi,j that correspond to calls to literal li with input substitution θi,j . The input substitution for the root node θ1,1 binds the key variable of the 1
We have omitted the atom charge to simplify the presentation.
Query Optimization in ILP by Reordering Literals
331
Table 1. Averaged execution times (ms/example) for the 6 permutations of an example query. The queries were executed using the YAP Prolog system version 4.4.2 on an Intel P4 1.8GHz with 512MB RAM running Linux q1 q2 q3 q4 q5 q6
Time 0.010 0.023 0.052 0.061 0.081 0.091
Permutation atom(M,A2,c,16),bond(M,A2,A1,1),atom(M,A1,h,3). atom(M,A2,c,16),atom(M,A1,h,3),bond(M,A2,A1,1). bond(M,A2,A1,1),atom(M,A2,c,16),atom(M,A1,h,3). atom(M,A1,h,3),atom(M,A2,c,16),bond(M,A2,A1,1). atom(M,A1,h,3),bond(M,A2,A1,1),atom(M,A2,c,16). bond(M,A2,A1,1),atom(M,A1,h,3),atom(M,A2,c,16).
query (M in the example) to the identifier of example e. The children of a node vi,j correspond to the solutions of li θi,j . The size of the SLD-tree thus depends on the length of the query, which is the maximum depth of the tree, and the number of children of each node. We will call this number of children the non-determinacy of a literal given the corresponding input substitution (Definition 1). Because the non-determinacy of a literal depends on the input substitution, it also depends on the position of the literal in the SLD-tree and hence also on its position in the query. Definition 1. (Non-determinacy) The non-determinacy nondet(l, θ) of a literal l given an input substitution θ is the number of solutions or answer substitutions that are obtained by calling lθ. Example 1. We continue the Carcinogenesis example. The non-determinacy of atom(M,A1,h,3) given {M/m1} is equal to the number of hydrogen atoms of type 3 in molecule m1. If we consider {M/m1,A1/a1} then the non-determinacy is either 1 if a1 is a hydrogen atom of type 3 or 0 if it is another kind of atom. Non-determinacy as defined above is always associated with one specific call, i.e., with one literal and input substitution. The notion of average nondeterminacy (Definition 2) is useful to make more general statements about the number of solutions of a literal. Definition 2. (Average Non-determinacy) Consider a distribution over possible input substitutions D. We define the average non-determinacy of a literal l with respect to D as:
nondet(l, D) =
(θ,p)k ∈D
with pk the probability of θk given D.
pk · nondet(l, θk )
332
Jan Struyf and Hendrik Blockeel
Example 2. Consider all N molecules in the Carcinogenesis domain and let D be a uniform distribution over the input substitutions that ground the key argument of a literal to one of the molecule identifiers. The average non-determinacy of bond(M,A1,A2,N) with respect to D is about 20, which is the average number of bonds that occurs in a molecule. Whenever D is not specified in the future it refers to the same distribution as is defined in this example. Consider again the fastest (q1 ) and slowest (q6 ) permutation from Table 1. In permutation q1 the average non-determinacy of l1 is low because few molecules contain carbon atoms of type 16. The non-determinacy of l2 , given an input substitution, is at most 4 because carbon atoms can have at most 4 bonds. The non-determinacy of l3 is either zero or one because A1 is ground at that point. Since only few molecules contain carbon atoms of type 16, the average SLD-tree size will be small for q1 . In permutation q6 the bond/4 literal is executed first. Because both A1 and A2 are free at that point, the non-determinacy of this call will be equal to the number of bonds in the molecule (about 20 on average). The non-determinacies of l2 and l3 will be either zero or one. The SLD-trees for q6 will be much bigger on average and hence the execution time, which is proportional to the average SLD-tree size, will also be longer.
3
Requirements
We start by listing a number of requirements for a reordering transformation for first order queries in the context of ILP. R1. (Correctness) The reordering transformation should be correct, i.e. the transformed query should succeed (fail) for the same examples as the original query succeeds (fails). R2. (Optimality) The reordering transformation should approximate the optimal transformation, which replaces the original query by the permutation that has the shortest average execution time. R3. (Efficiency) The reordering transformation itself should be efficient. The time for performing the transformation should be smaller than the efficiency difference that can be obtained by executing the transformed query instead of the original one. If all predicates are defined by sets of facts, then the order of the literals does not influence the result of the query (cf. the switching lemma [8]) and the correctness requirement (R1) is met. If some predicates have input arguments that should be ground (e.g., background predicates that perform a certain computation), then it is possible that a given permutation is invalid because it breaks this constraint. The reordering transformation must make sure not to select such a permutation. The efficiency requirement (R3) can be easily met in the context of a relational database where executing a query typically involves (slow) disk access. In ILP, query execution is much faster. In many applications the entire data set
Query Optimization in ILP by Reordering Literals
333
resides in main memory. Even if the data set does not fit completely in main memory at once, ILP systems can still use efficient caching mechanisms that load examples one by one [1,15] to speed up query execution. When queries are executed fast, transformation time becomes more important. ILP systems also consider different subsets of the data during the refinement process. Some queries are executed on very small sets of examples. If a query is executed on a smaller set, it will be executed faster, which again makes R3 more difficult to achieve. To meet the optimality requirement (R2), query optimizers in relational database management systems place selection operators with a high selectivity, joins that are expected to return a small number of tuples, and projections that select few attributes, as early as possible. Similarly, a reordering transformation for first order queries should place literals with a low average non-determinacy first in order to avoid unnecessary backtracking. An important difference between relational database management systems and ILP systems is that relational database management systems execute queries bottom-up instead of top-down. The user is always interested in obtaining all solutions for a given query. In ILP, query execution stops after finding the first solution (success). This is because the ILP system only needs to know for a given example whether a given query succeeds or not. In this context, a topdown execution strategy is more efficient. A consequence of the top-down strategy is that a query optimizer for first order queries should place literals that ground many variables as early as possible. Grounding variables decreases the non-determinacy of the literals that follow, which decreases execution time. Typical for ILP applications are predicates that have a long execution time (e.g., background predicates that compute aggregates). Such predicates should be placed after fast predicates with a low non-determinacy and before predicates with a high non-determinacy. Due to the top-down strategy, the expensive predicate is not executed if the low non-determinacy predicate fails and may be executed several times if the high non-determinacy predicate has several solutions for a given example. A last difference is that in relational database systems queries are mostly generated directly by humans and not by a refinement operator ρ as is the case for ILP systems. In ILP systems the efficiency of the original query (as it is generated by ρ) depends (much) on the definition of ρ and on the language bias specification that is used.
4 4.1
Reordering Transformation Reordering Dependent Literals
In this section we introduce a possible reordering transformation for first order queries suitable to be implemented in ILP systems. We will introduce two versions of this transformation. In this section we describe the first version T1 . The second version, which extends T1 , will be discussed in Section 4.2.
334
Jan Struyf and Hendrik Blockeel
Given a query q, the optimal reordering transformation T1 should return the permutation T1 (q) that has the shortest average execution time. More formally, we can define T1 as follows (cost(q) represents the average execution time of q and perms(q) the set of all permutations of q). T1 (q) = argminqk ∈perms(q) cost(qk ) Of course, cost(qk ) can not be computed without executing permutation qk on all examples. Therefore, we will try to approximate T1 by replacing the average execution time by an estimate. We call the resulting approximate transformation 1 . T 1 (q) = argminq ∈perms(q) cost(q k) T k We will now explain how the estimated average execution time cost(q) of a query q can be computed. The average execution time of a query q depends on the individual execution times of its literals. The execution time and average execution time of a literal can be defined in the same way as non-determinacy and average non-determinacy where defined in Section 2. Definition 3. (Cost) The execution time cost(l, θ) of a literal l given an input substitution θ is defined as the time necessary to compute all solutions to the call lθ (i.e., the execution time of the Prolog query ‘?- lθ, fail.’). Definition 4. (Average Cost) Consider a distribution over possible input substitutions D. We define the average execution time of a literal l with respect to D as: cost(l, D) = pk · cost(l, θk ) (θ,p)k ∈D
with pk the probability of θk given D. Using the definitions of average non-determinacy and average execution time of literals we can compute the average execution time of a query q as follows. cost(q) =
n
wi · cost(li , Di )
(1)
i=1
w1 = 1.0;
wi =
i−1 j=1
nondet(lj , θj ) ≈
i−1
nondet(lj , Dj ), i ≥ 2
j=1
Equation (1) computes the average execution time by summing the average execution times of all individual literals. Each term is weighted with a weight wi which is the average number of calls to this literal in a SLD-tree generated by the ILP system. The correct value of wi is the average of the product of 1 we will the non-determinacies of the preceding literals. For transformation T
Query Optimization in ILP by Reordering Literals
335
approximate this value by the product of average non-determinacies as shown above. Each literal in a given query will be called with different instantiations for its arguments (one for each occurrence of the literal in the SLD-tree for a given example). We average over these different instantiations by considering for each literal li , a distribution of input substitutions Di . Example 3. Consider the query q = l1 (X), l2 (X) and suppose that l1 has nondeterminacy 2 and execution time 2ms. Literal l2 has average non-determinacy 0.5 and average execution time 1ms with respect to D2 . Distribution D2 is a distribution over the answer substitutions for X of l1 . The average execution time of q is cost(q) = 1 × 2ms + 2 × 1ms = 4ms because l1 will be called once, which takes 2ms, and l2 will be called twice (once for each solution to l1 ), which also takes 2ms. 1 (q) we need to estimate the average execution time In order to perform T k ) with (1) where the average nonof each permutation qk . We compute cost(q determinacy and execution time are replaced by estimates, which we compute based on the data. In practice, it is not feasible to obtain for each literal in each query an estimate for the corresponding distribution Di and associated cost(li , Di ) and nondet(li , Di ). Therefore, we will use an approximate uniform distribution to estimate average cost and non-determinacy. An algorithm for constructing such an approximate distribution is discussed in Section 4.3. In order to select the appropriate approximate distribution we will use the following constraint that holds on Di and that must also hold for reasonable approximate distributions. Each input substitution in Di must assign a value to the input variables of li that are grounded by calls to preceding literals. Consider again q = l1 (X), l2 (X), and assume that l1 grounds X. Then all input substitutions in D2 must also ground X. More formally, i−1 ∀(θ, p)k ∈ Di : vars(θk ) ⊇ vars(li ) ∩ groundvars(lj ) j=1
with pk the probability of θk given Di , vars(θk ) the variables grounded by input substitution θk , vars(li ) the set of (input) variables of li and groundvars(li ) the set of variables that is grounded by calling li . Note that (1) computes the execution time for generating a complete SLDtree, which may have several succeeding paths. As said before, most ILP systems stop execution after finding the first success. Having said that, we will (for simplicity reasons) still use (1) to estimate the average execution time of the different permutations. In order to meet R1 (correctness), the transformation must rule out queries that call background predicates p with free variables for arguments that are D) = ∞ for disrequired to be ground. This is accomplished by setting cost(p, tributions D that do not ground all these variables.
336
Jan Struyf and Hendrik Blockeel
Because the transformation requires computing all permutations of q, it has an exponential time complexity in the number of literals n. For large values of n it may become difficult to meet requirement R3 with such a transformation. However, in our experiments, where query lengths varied from 1 to 6, this was not a problem. 4.2
Reordering Sets of Literals
Some queries can be split in several independent sets of literals [13,6]. Consider for example the following query. atm(M,A1,n,38),bond(M,A1,A2,2),atm(M,A3,o,40). This query succeeds if a given molecule contains a nitrogen atom of type 38 with one double bond and an oxygen atom of type 40. If l1 and l2 succeed, but l3 fails, then alternative solutions for l1 and l2 will be tried. However, this is useless if l3 does not depend on l1 and l2 , i.e. if it does not share variables with l1 and l2 . The cut-transformation proposed in [13] avoids this unnecessary backtracking, by putting cuts between the independent sets of literals. atm(M,A1,n,38),bond(M,A1,A2,2),!,atm(M,A3,o,40). If S1 and S2 are independent sets of literals, then the execution time of S2 will not depend on the execution time of S1 . This implies that S1 and S2 can 1 independently. Once S1 be transformed by our reordering transformation T and S2 have been transformed the question remains whether we should execute S1 ,!,S2 or S2 ,!,S1 . The efficiency of both permutations of the sets may be different because the set after the cut does not need to be executed if the first 2 , set fails for a given example. The following equation defines transformation T which extends T1 , for reordering a query that is a conjunction of independent sets separated by cuts. 2 (S1 , !, S2 , !, . . . , Sm ) = argmin T qk ∈perms(S1 ,!,S2 ,!,...,Sm ) cost(qk ) We approximate the average execution time of S1 , !, S2 , !, . . . , Sm as follows. cost(S1 , !, S2 , !, . . . , Sm ) =
m
wi · cost(Si )
i=1
w1 = 1.0;
wi =
i−1
Psucceeds(Sj ), i ≥ 2
j=1
with wi the probability that Si will be executed and Psucceeds(Si ) the probability that Si succeeds. We estimate Psucceeds(Si ) in the following, rather ad-hoc way.
nondet(li , Di ) (2) Psucceeds(S) = min 1.0 , li ∈S
Query Optimization in ILP by Reordering Literals
337
The motivation for (2) is that if a set has a low non-determinacy (e.g., if it has on average 0.3 solutions) then we can use this value as an estimate for Psucceeds(Si ). If the non-determinacy is high (e.g., if the set has on average 5 solutions) then we assign a probability of 1.0. In order to transform a query that contains several independent sets of lit1 and after that erals, we first transform each of these sets separately using T reorder the sets using T2 . Both T1 and T2 require estimates for nondet(li , Di ) and cost(li , Di ). We will show how these can be obtained from the training examples in the following section. 4.3
Estimating the Cost and Non-determinacy of a Literal
The average non-determinacy and execution time of a literal depend both on the constants that occur as arguments of the literal and on the relevant distribution of input substitutions D. In this section we describe an algorithm that automatically estimates average non-determinacy and execution time for each possible combination of constants and for different uniform distributions D (with different groundness constraints for the arguments). The algorithm takes as input a set of predicate type definitions. Consider the following type definitions from the Carcinogenesis example. type(atom(mol:key,atom:any,element:const,atype:const,charge:any). type(bond(mol:key,atom:any,atom:any,btype:const). Each argument of a type definition is of the form a:b with a the type of the argument and b its tag. Arguments marked with tag key are grounded to the example key by each substitution in D. Arguments marked with const will always occur with one of the possible constants for the corresponding type filled in. Arguments with tag any can be free or can be ground depending on D. The algorithm works as follows. 1. For each type definition typedef k , collect in a set Ck the possible combinations of constant values that occur in the data for the arguments marked with const. 2. For each type typei that occurs with tag any, collect for each example ej in a set Di,j the constants that occur for arguments of typei in ej for any of the predicates that has at least one such argument. 3. Run for each type definition typedef k the algorithm shown in Fig. 1. The algorithm from Fig. 1 computes for a given type definition, for each possible combination of constants C and for different input substitution distributions D the average execution time and non-determinacy with respect to D for a literal l with name and arity as given by the type definition and the constants from C filled in. The first loop of the algorithm is over the different combinations of constants and the second loop over the different distributions. For each distribution, a different subset I of the arguments with tag any will be ground. For the
338
Jan Struyf and Hendrik Blockeel
function compute avg nondet and cost(typedef k ) Ik = set of all arguments from typedef k that have tag any for each C ∈ Ck for each I ∈ 2Ik create literal l (name/arity as in typedef k , constants from C filled in) cost = 0; nondet = 0; count = 0 for each example ej θj = substitution that replaces key variable by identifier ej if I = ∅ then nondet = nondet + nondet(l, θj ); cost = cost + cost(l, θj ) count = count + 1 else P = the set of all possible combinations of constants for the attributes in I (constructed using Di,j ) for each combination of constants P ∈ P σ = θj ∪ substitution for each constant in P nondet = nondet + nondet(l, σ); cost = cost + cost(l, σ) count = count + 1 c ost[typedef k , D(I) , C] = cost / count (I) nondet[typedef , C] = nondet / count k, D
Fig. 1. An algorithm for estimating average non-determinacy and execution time. bond(M,A1,A1,T) type definition, there are two arguments with tag any. The first distribution D(1) that the algorithm considers will leave both atom arguments free. Consecutive iterations will consider D(2) with A1 ground, D(3) with A2 ground and finally for D(4) with both A1 and A2 ground. The part inside the double loop computes for literal l the average execution time and non-determinacy with respect to D(I) . The first instructions construct l based on the type definition and constants in C. If all arguments with tag any are free (i.e., I = ∅), then D(I) becomes a distribution of substitutions that unify the key argument with one of the example identifiers. The average cost and non-determinacy for such a distribution can be estimated by averaging over all examples ej (cf. the then branch in Fig 1). If some of the arguments with tag any are ground in D(I) , then constructing a uniform distribution over possible input substitutions is more complex (cf. the else branch in Fig 1). In that case, the averages have to be computed over each possible combination of relevant constants (from Di,j ). Example 4. We will illustrate the algorithm for the bond/4 type definition. The first step is to collect constants in Ck . For bond/4, there is only one argument tagged const: the bondtype. The corresponding constants are {1, 2, 3, 7} (7 representing an aromatic bond). Type atom is the only type tagged any. We collect for each example ej in set Datom,j all constants of type atom that occur, i.e., for each example the set of atom identifiers. The next step is to run the algorithm
Query Optimization in ILP by Reordering Literals
339
of Fig. 1. It will compute the average non-determinacy and execution time for bond/4 with different constant combinations from Ck filled in and for different subsets of its two tag any arguments ground. A subset of the results is shown in Table 2.
Table 2. Average non-determinacy and execution time for bond(M,A1,A2,T) with different constant combinations filled in and for different subsets of its two tag any arguments ground. The results were obtained with YAP Prolog version 4.4.2. Only the relative values of the costs are important A1
A2
T nondet(l, D) cost(l, D)
free free ground ground free ...
free ground free ground free ...
1 1 1 1 2 ...
20.4 0.74 0.74 0.017 1.36 ...
0.0081 0.0034 0.0016 0.0022 0.0074 ...
Note that our algorithm uses uniform distributions over the possible identifiers to estimate average non-determinacy and execution time. If a literal occurs somewhere in the middle of a conjunction, the distribution over input substitutions will probably not be uniform. Using the estimates based on uniform distributions is an approximation. The computational complexity of the algorithm is O(2|Ik | |Ck |), i.e., exponential in the number of variables that occur in a type definition. In practice, the arity of predicates is usually relatively small, so that this approach is not problematic. Moreover, the computation needs to happen only once, not once for each query to be transformed; so the requirement that the query transformation must be efficiently computable, remains fulfilled. In those cases where the complexity of the algorithm is nevertheless prohibitive, other techniques for estimating the selectivity of literals could be used instead of this straightforward method; see for instance [11].
5 5.1
Experimental Evaluation Aims
The aim of our experiments is to obtain more insight in the behavior of the reordering transformation, both with and without the reordering of independent sets. 1 (q) and A first experiment compares for a given query q the efficiency of T 2 (q) with the efficiency of the original query and that of the most efficient T permutations T1 (q) and T2 (q). It also provides more insight in the influence of the
340
Jan Struyf and Hendrik Blockeel
query length on the obtainable efficiency gain. A second experiment compares the efficiency of running the ILP system Aleph without a reordering transformation 1 and T 2 . In each experiment we use the average with its efficiency when using T non-determinacies and execution times estimated as described in Section 4.3 to 2 (see also Table 2). 1 and T perform T 5.2
Materials
All experiments are performed on the Carcinogenesis data set [14], which is the same data set that has been used as running example throughout this paper. As Prolog system we use YAP Prolog2 version 4.4.2 on an Intel P4 1.8GHz system with 512MB RAM running Linux. We also use the ILP system Aleph3 version 4 (13 Nov 2002) which implements a version of Progol [9] and runs under YAP. Aleph uses the cut and theta transformations as defined in [13]. The language bias we use in the experiments only contains modes for the atom and bond predicates and is defined as follows. :- mode(*,bond(+drug,+atomid,-atomid,#integer)). :- mode(*,atom(+drug,-atomid,#element,#integer,-charge)). :- set(clauselength,7), set(nodes,50000). 2 as an add-on library 1 and T We have implemented both transformations T for Aleph. This library (written in C++) is available from the authors upon request. 5.3
Experiment 1
1 . In this experiment we have sampled 45000 sets of linked literals (i.e., sets T that are separated by cuts in the queries, after they have been processed by the cut-transform) uniformly from the queries that occur during a run of Aleph on the Carcinogenesis data set. For each permutation of each of these queries we estimated the average execution time over all 330 molecules. Figure 2 presents the results. The first bar represents the average execution time of the original queries q that are generated by the refinement operator of Aleph, the second 1 (q) bar is the average execution time of the permutations that are selected by T and the last bar the average execution time of the most efficient permutations T1 (q). Above each set of bars, N is the number of queries, Tmax the average execution time of the slowest permutations, Tavg the average of the average execution time over all permutations, Tref the average execution time of the original queries, Sbest the efficiency gain of T1 (q) over the original query and 1 (q). In some cases Tmax and Tavg are marked with Str the efficiency gain of T a >-sign. This occurs because we have set a maximum execution time of 4s for each permutation (in order to make the experiment feasible). When this time is 2 3
http://yap.sourceforge.net. http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/.
Query Optimization in ILP by Reordering Literals
341
exceeded the execution of the permutation is aborted and its execution time is excluded from the averages. All execution times are averaged over all queries and examples (the first set of bars) and also averaged over all examples and queries with a given length (the other bars). The averages shown for the long queries have a rather large variance.
Time relative to Tref
Query execution time N Tmax Tavg Tref Sbest Str
45000 0.094 0.016 0.010 2.1 2.1
31248 0.004 0.004 0.004 1.0 1.0
7738 0.016 0.010 0.005 1.0 1.0
3513 0.041 0.019 0.009 1.3 1.3
1567 >0.42 >0.092 0.036 4.4 4.1
689 >2.0 >0.27 0.15 17 15
245 >11 >0.76 0.35 34 20
1.0 0.8 0.6 0.4 0.2 0.0
AVERAGE
1
2
3
4
5
6
Query length Refinement Transformed Best
N = number of queries T = execution time (ms/query/example)
Fig. 2. Experiment on independent sets of literals 1 almost always succeeds The main conclusion from this experiment is that T in selecting the most efficient permutation or a permutation with execution time close to that of the most efficient one (because Str is close to Sbest ). The difference is larger for longer queries, probably because the approximations made for estimating the cost a query are less accurate in that case. Surprisingly there is almost no gain possible for queries of length 2. This is because in that case the refinement operator of Aleph always generates the most efficient permutation (if compared to Tavg it is possible to gain a factor of 2 in efficiency). Aleph always generates atom,bond instead of bond,atom. The second one is less efficient than the first one because the highly non-determinate bond literal comes first. If we look at longer queries we see that the transformation yields higher gains (up to 20× for queries of length 6). Comparing Tref with Tavg shows that the original queries are already very efficient. This would probably not be the case anymore if we would use a more naively defined language bias (e.g. one that allows adding 1 would yield much higher gains. bonds with the two atom identifiers free): T
342
Jan Struyf and Hendrik Blockeel
Note also that Aleph generates much more small independent sets than larger ones, which also reduces the average gain. This again depends on the language bias that is used. 2 . In this experiment we have sampled 5000 queries uniformly from the queries T that occur during a run of Aleph. Each of these queries can contain up to 6 independent sets of linked literals. We estimated for each permutation of the sets and for each permutation of the literals in each set the average execution time over all examples. Figure 3 presents the results. The first bar again corresponds to the execution time of the original query q, the second bar to that of the best possible permutation without reordering the sets (i.e., T1 (q)), the third bar to 2 and the last one to that of T2 (q). The that of the permutation returned by T other quantities are defined similarly to the previous experiment. One important difference is that here the sets of bars are averaged over queries with a given maximum independent set size (1 - 6).
Time relative to Tref
Query execution time N Tmax Tavg Tref Sbest Str
5000 0.45 0.049 0.019 3.5 3.3
1802 0.005 0.005 0.005 1.1 1.1
1621 0.016 0.009 0.006 1.1 1.1
910 0.039 0.016 0.011 1.8 1.6
427 0.36 0.068 0.039 5.7 5.3
193 >2.3 >0.29 0.096 13 11
47 >35 >2.7 0.69 78 67
1.0 0.8 0.6 0.4 0.2 0.0
AVERAGE
1
2
3
4
5
6
Max independent set size Refinement Best (rs=0)
Transformed Best (rs=1)
N = number of queries T = execution time (ms/query/example)
Fig. 3. Experiment on entire clauses The main conclusion is that not much efficiency can be gained by reordering the sets. This is expected because reordering sets does not alter the average nondeterminacy of literals. Contrary to the previous experiment some gain can be obtained for queries that have 6 independent literals and queries where all inde-
Query Optimization in ILP by Reordering Literals
343
2 performs reasonably pendent sets have at most two literals. Transformation T well for a low maximum set size and less good for queries with larger independent sets. Note that, to simplify the presentation, Fig. 3 does not show results 1 . for T 5.4
Experiment 2
In this experiment we compare the runtime of Aleph without any reordering 2 . Table 3 presents the 1 and T transformation, with its runtime when using T results for the original Carcinogenesis data set and Table 4 for an over-sampled version in which each molecule occurs 5 times. The total execution time is split up in three components: the query execution time Texec., the time used for performing the reordering transformation Ttrans. and a term Tother which includes, for example, the time for computing the bottom-clauses and the time taken by the refinement operator. Table 3. Runtime of Aleph (in seconds) on the Carcinogenesis data set Transformation none T 1 T 2
Texec. 2230 1330 1310
Ttrans. / 100 110
Tother 3900 3920 3850
Sexec. / 1.7 1.7
Stotal / 1.15 1.16
Table 4. Runtime of Aleph (in seconds) on a 5× over-sampled version of the Carcinogenesis data set Transformation none T 1 T 2
Texec. 22720 13720 13690
Ttrans. / 320 370
Tother 14530 14550 14570
Sexec. / 1.7 1.7
Stotal / 1.3 1.3
Table 3 and Table 4 both show an efficiency gain in query execution time of 1.7 times. This is less than the average gain of 3.5 times that was obtained in the previous experiment (Fig. 3). The difference is that in that case, each query was executed on the entire data set. Aleph considers different subsets of the data in each refinement step: longer queries are executed on a smaller subset. Because longer queries yield higher gains, the average gain drops. Comparing Ttrans. with Texec. shows that the transformation introduces no significant overhead (7.5% on the original data set and 2.3% on the over-samples version). Recall from the previous experiment (Fig. 2 and Fig. 3) that the efficiency 1 and T 2 is close to the best possible efficiency gain (i.e., that gain obtained with T
344
Jan Struyf and Hendrik Blockeel
of T1 an T2 ). This means that the rather low gain (factor 1.7) obtained in this experiment is not caused by the fact that we use an approximate transformation, but rather by the fact that the queries generated by the refinement operator are already very efficient. This depends on the language bias that is used: here it is well designed and generates efficient queries. Another reason for the low gain is that the refinement operator generates much more small sets of linked literals than larger ones.
6
Conclusion
We have introduced a query transformation that aims at reducing the average execution time of queries by replacing them with one of their permutations. Similar techniques are used in relational database management systems but there are some important differences. ILP systems execute queries top-down instead of bottom-up, look only for the first solution, may use expensive background predicates, and valid permutations of queries are constrained by the modes of the literals. Queries are also generated by the system itself and not (directly) by humans. We have defined two versions of the transformation. Our first version transforms a query by replacing it with the permutation that has the lowest estimated average execution time. For computing this estimate we make two important approximations. The first one is that we use an approximate formula for computing the average execution time of a query based on the average execution times and non-determinacies of its literals. The second one is that we use uniform distributions over input substitutions to estimate the average execution time and non-determinacy of a literal. Such a uniform distribution will differ from the actual distribution of input substitutions for the literal, which depends on its position in the SLD-trees. A second version of the transformation extends the first one and is able to handle queries that are composed of different independent sets of linked literals, separated by cuts. Such queries are generated by the cut-transform [13], which is implemented in the ILP system Aleph. Our experiments show that the two versions of the transformation both approximate the theoretical optimal transformations, which replace a query by the permutation that has minimal average execution time, very well. The obtainable efficiency gain of the transformation over using the original queries, as generated by the refinement operator of the ILP system, depends much on the language bias that is used. The language bias used in our experiments was well designed in the sense that the efficiency of the generated queries was close to that of the best possible permutations. Our experiments further show that the efficiency gain increases with the length of the sets of linked literals. By reordering the sets themselves, not much efficiency can be gained. The time complexity of our transformation is exponential in query-length (or independent set size) because it considers all possible permutations of a given query. We would like to develop a different version of our transformation that
Query Optimization in ILP by Reordering Literals
345
uses a greedy method to find an approximate solution in less time. In this way it could be possible to obtain a better trade-off between the overhead introduced by the transformation itself and the efficiency gained by executing the transformed queries. Kietz and L¨ ubbe introduce in [6] an efficient subsumption algorithm which is based on similar ideas. Their algorithm moves deterministic literals to the front while executing a query on a given example. One important difference is that their transformation is dynamic, because it is performed for each query and each example. The transformation described in this work is static: a given query is transformed once and then executed on all examples. One advantage of a static transformation is that the possible overhead introduced by the transformation itself decreases if the number of examples increases. Further work will include comparing both transformations. Another approach for improving the efficiency of query execution is the use of query-packs [2]. The basic idea here is that sets of queries that have a common prefix, as they are generated by the refinement operator of a typical ILP system, can be executed more efficiently by putting them in a tree structure called a query-pack. In further work, we intend to combine the transformations presented here with query-packs. Combining query transformations with query-packs is difficult because the transformation may ruin the structure of the pack. Currently we are working on combining query-packs with the transformations described in [13]. Because the language bias has an important influence on the efficiency gain, we would like to try our transformation on more data sets with different types of language bias. One interesting approach here would be to use a language bias that generates larger sets of linked literals. Such a language bias is useful for first order feature construction, where one is interested in predictive relational patterns, which can be used for example, in propositional learning systems. Acknowledgments Jan Struyf is a research assistant and Hendrik Blockeel a post-doctoral fellow of the Fund for Scientific Research (FWO) of Flanders.
References 1. H. Blockeel, L. De Raedt, N. Jacobs, and B. Demoen. Scaling up inductive logic programming by learning from interpretations. Data Mining and Knowledge Discovery, 3(1):59–93, 1999. 2. H. Blockeel, L. Dehaspe, B. Demoen, G. Janssens, J. Ramon, and H. Vandecasteele. Improving the efficiency of inductive logic programming through the use of query packs. Journal of Artificial Intelligence Research, 2001. Submitted. 3. M. Carlsson. Freeze, indexing, and other implementation issues in the WAM. In Jean-Louis Lassez, editor, Proceedings of the 4th International Conference on Logic Programming (ICLP’87), Series in Logic Programming, pages 40–58. MIT Press, 1987.
346
Jan Struyf and Hendrik Blockeel
4. R. Elmasri and S. B. Navathe. Fundamentals of Database Systems. Benjamin/Cummings, 2nd edition, 1989. 5. M. Jarke and J. Koch. Query optimization in database systems. ACM Computing Surveys, 16(2), 1984. 6. J.U. Kietz and M. L¨ ubbe. An efficient subsumption algorithm for inductive logic programming. In Proceedings of the 11th International Conference on Machine Learning, pages 130–138. Morgan Kaufmann, 1994. 7. A. Krall. Implementation techniques for prolog. In N. Fuchs and G. Gottlob, editors, Proceedings of the Tenth Logic Programming Workshop, WLP 94, pages 1–15, 1994. 8. J.W. Lloyd. Foundations of Logic Programming. Springer-Verlag, 2nd edition, 1987. 9. S. Muggleton. Inverse entailment and Progol. New Generation Computing, Special issue on Inductive Logic Programming, 13(3-4):245–286, 1995. 10. C. N´edellec, H. Ad´e, F. Bergadano, and B. Tausend. Declarative bias in ILP. In L. De Raedt, editor, Advances in Inductive Logic Programming, volume 32 of Frontiers in Artificial Intelligence and Applications, pages 82–103. IOS Press, 1996. 11. D. Pavlov, H. Mannila, and P. Smyth. Beyond independence: Probabilistic models for query approximation on binary transaction data. IEEE Transactions on Knowledge and Data Engineering, 2003. To appear. 12. J.R. Quinlan. Learning logical definitions from relations. Machine Learning, 5:239– 266, 1990. 13. V. Santos Costa, A. Srinivasan, R. Camacho, H. Blockeel, B. Demoen, G. Janssens, J. Struyf, H. Vandecasteele, and W. Van Laer. Query transformations for improving the efficiency of ILP systems. Journal of Machine Learning Research, 2002. In press. 14. A. Srinivasan, R.D. King, and D.W. Bristol. An assessment of ILP-assisted models for toxicology and the PTE-3 experiment. In Proceedings of the Ninth International Workshop on Inductive Logic Programming, volume 1634 of Lecture Notes in Artificial Intelligence, pages 291–302. Springer-Verlag, 1999. 15. J. Struyf, J. Ramon, and H. Blockeel. Compact representation of knowledge bases in ILP. In Proceedings of the 12th International Conference on Inductive Logic Programming, volume 2583 of Lecture Notes in Artificial Intelligence, pages 254– 269. Springer-Verlag, 2002. ˇ 16. F. Zelezn´ y, A. Srinivasan, and D. Page. Lattice-search runtime distributions may be heavy-tailed. In S. Matwin and C. Sammut, editors, Inductive Logic Programming, 12th International Conference, ILP 2002, volume 2583 of Lecture Notes in Computer Science, pages 333–345. Springer-Verlag, 2003.
Efficient Learning of Unlabeled Term Trees with Contractible Variables from Positive Data Yusuke Suzuki1 , Takayoshi Shoudai1 , Satoshi Matsumoto2 , and Tomoyuki Uchida3 1
Department of Informatics, Kyushu University, Kasuga 816-8580, Japan {y-suzuki,shoudai}@i.kyushu-u.ac.jp 2 Department of Mathematical Sciences Tokai University, Hiratsuka 259-1292, Japan [email protected] 3 Faculty of Information Sciences Hiroshima City University, Hiroshima 731-3194, Japan [email protected]
Abstract. In order to represent structural features common to tree structured data, we propose an unlabeled term tree, which is a rooted tree pattern consisting of an unlabeled ordered tree structure and labeled variables. A variable is a labeled hyperedge which can be replaced with any unlabeled ordered tree of size at least 2. In this paper, we deal with a new kind of variable, called a contractible variable, that is an erasing variable which is adjacent to a leaf. A contractible variable can be replaced with any unlabeled ordered tree, including a singleton vertex. Let OTT c be the set of all unlabeled term trees t such that all the labels attaching to the variables of t are mutually distinct. For a term tree t in OTT c , the term tree language L(t) of t is the set of all unlabeled ordered trees which are obtained from t by replacing all variables with unlabeled ordered trees. First we give a polynomial time algorithm for deciding whether or not a given term tree in OTT c matches a given unlabeled ordered tree. Next for a term tree t in OTT c , we define the canonical term tree c(t) of t in OTT c which satisfies L(c(t)) = L(t). And then for two term trees t and t in OTT c , we show that if L(t) = L(t ) then c(t) is isomorphic to c(t ). Using this fact, we give a polynomial time algorithm for finding a minimally generalized term tree in OTT c which explains all given data. Finally we conclude that the class OTT c is polynomial time inductively inferable from positive data.
1
Introduction
A term tree is a rooted tree pattern which consists of tree structures, ordered children and internal structured variables. A variable in a term tree is a list of vertices and it can be replaced with an arbitrary tree. The term tree language L(t) of a term tree t, which is considered to be a representing power of t, is the set of all ordered trees which are obtained from t by replacing all variables in t with arbitrary ordered trees. The subtrees which are obtained from t by T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 347–364, 2003. c Springer-Verlag Berlin Heidelberg 2003
348
Yusuke Suzuki et al.
T1
T2
T3 u2
y x
y z
z
x
u1
v2
x v1
t1
t2
t3
g1
u3
g2
g3
Fig. 1. An uncontractible (resp. contractible) variable is represented by a single (resp. double) lined box with lines to its elements. The label inside a box is the variable label of the variable removing the variables in t represent the common subtree structures among the trees in L(t). A term tree is suited for representing structural features in tree structured data such as HTML/XML files which are represented by rooted trees with ordered children and edge labels [1]. Hence, in data mining from tree structured data, a term tree is used as a knowledge representation. Since SGML/XML can freely define tree structures by using strings given by users as tags, extracting meaningful structural features from such data is often difficult. In this case, by ignoring edge labels (e.g., tags) in tree structures and using only structural information, we need to extract meaningful structural features. Based on this motivation, we consider the polynomial time learnabilities of term trees without any vertex and edge label in the inductive inference model. A term tree t is said to be unlabeled if t has neither vertex label nor edge label, and said to be regular if all variable labels in t are mutually distinct. In this paper, we deal with a new kind of variable, called a contractible variable, that is an erasing variable which is adjacent to a leaf. A contractible variable can be replaced with any unlabeled ordered tree, including a singleton vertex. A usual variable, called an uncontractible variable, does not match a singleton vertex. Let OTT c be the set of all unlabeled regular term trees with contractible variables. In this paper, we show that the class OTT c is polynomial time inductive inferable from positive data. First we give a polynomial time algorithm for deciding whether or not a given term tree in OTT c matches a given unlabeled ordered tree. For any term tree t, we denote by s(t) the ordered unlabeled tree obtained from t by replacing all uncontractible variables of t with single edges and all contractible variables of t with singleton vertices. We write t ≡ t if t is isomorphic to t . Second we give totally 52 pairs (g , gr ) of term trees g and gr
Efficient Learning of Unlabeled Term Trees
349
such that g ≡ gr , s(g ) ≡ s(gr ), and L(g ) = L(gr ). We say that a term tree t in OTT c is a canonical term tree if no term subtree of t is isomorphic to any second term tree gr of the 52 pairs. For any term tree t in OTT c , there exists a canonical term tree in OTT c , denoted by c(t), which satisfies s(c(t)) ≡ s(t) and L(c(t)) = L(t). Then we show that for any term trees t and t in OTT c , if L(t) = L(t ) then c(t) ≡ c(t ). Using this fact, we give a polynomial time algorithm for finding a minimally generalized term tree in OTT c which explains all given data. From this algorithm and Angluin’s theorem [3], we show that the class OTT c is polynomial time inductively inferable from positive data. For example, the term tree t3 in Fig. 1 is a minimally generalized term tree in OTT c which explains T1 , T2 and T3 . And t2 is also minimally generalized among all unlabeled regular term trees with no contractible variable which explain T1 , T2 and T3 . On the other hand, t1 is overgeneralized and meaningless, since t1 explains any tree of size at least 2. In analyzing tree structured data, sensitive knowledge (or patterns) for slight differences among such data are often meaningless. For example, extracted patterns from HTML/XML files are affected by attributes of tags which can be recognized as noises. In fact, a term tree with only uncontractible variables is very sensitive to such noises. By introducing contractible variables, we can find robust term trees for such noises. From this reason, we consider that in Fig. 1, t3 is a more precious term tree than t2 . A term tree is different from other representations of ordered tree structured patterns in [2,4,13] in that an ordered term tree has structured variables which can be substituted by arbitrary ordered trees. In [10,12], we showed that some fundamental classes of regular ordered term tree languages with no contractible variable are polynomial time inductively inferable from positive data. In [5,8,9], we showed that some classes of regular unordered term tree languages are polynomial time inductively inferable from positive data. Moreover, we showed in [6] that some classes of regular ordered term tree languages with no contractible variable are exactly learnable in polynomial time using queries. In [7], we gave a data mining method from semistructured data using ordered term trees.
2
Term Trees with Contractible Variables
Let T = (VT , ET ) be a rooted tree with ordered children (or simply a tree) which has a set VT of vertices and a set ET of edges. Let Eg and Hg be a partition of ET , i.e., Eg ∪ Hg = ET and Eg ∩ Hg = ∅. And let Vg = VT . A triplet g = (Vg , Eg , Hg ) is called a term tree, and elements in Vg , Eg and Hg are called a vertex, an edge and a variable, respectively. We assume that edges and vertices have no label. A label of a variable is called a variable label. X denotes a set of variable labels. For a term tree g and its vertices v1 and vi , a path from v1 to vi is a sequence v1 , v2 , . . . , vi of distinct vertices of g such that for any j with 1 ≤ j < i, there exists an edge or a variable which consists of vj and vj+1 . If there is an edge or a variable which consists of v and v such that v lies on the path from the root to v , then v is said to be the parent of v and v is a child of v. We use a notation [v, v ] to represent a variable {v, v } ∈ Hg such that v is the parent of v . Then
350
Yusuke Suzuki et al.
we call v the parent port of [v, v ] and v the child port of [v, v ]. For a term tree g, all children of every internal vertex u in g have a total ordering on all children of u. The ordering on the children of u is denoted by
Efficient Learning of Unlabeled Term Trees
351
for i = 1, . . . , k. Case 1 : If u , u ∈ Vf and u
3
An Efficient Matching Algorithm for Term Trees
A matching algorithm for term trees is an algorithm which decides whether or not T ∈ L(t) for a given term tree t and a tree T . We gave matching algorithms for term trees with no contractible variable in [10,11]. These algorithms are based on dynamic programming and run in O(nN ) time where n and N are numbers of vertices of a given term tree t and a tree T , respectively. In this section, we give a matching algorithm for OTT c by extending the matching algorithm [10]. Let t = (Vt , Et , Ht ) be a term tree in OTT c and T = (VT , ET ) a tree in OT . We assume that all vertices of a term tree t are associated with mutually distinct numbers, called vertex identifiers. We denote by I(u ) the vertex identifier of u ∈ Vt . A correspondence-set (C-set for short) is a set of vertex identifiers, which are with or without parentheses, of vertices of t. A vertex identifier with parentheses shows that the vertex is the child port of a variable. Our matching algorithm proceeds by constructing C-sets for each vertex of a given tree T in the bottom-up manner, that is, from the leaves to the root of T . At first, we construct the C-set-attaching rule of a vertex u of t as follows. Let c1 , · · · , cm be all ordered children of u . The C-set-attaching rule of u is of the form I(u ) ⇐ J(c1 ), . . . , J(cm ), where J(ci ) = I(ci ) if {u , ci } is an edge, J(ci ) = I(∅) if [u , ci ] is a contractible variable, J(ci ) = (I(ci )) otherwise. I(∅) is a special symbol which shows ci is the child port of a contractible variable. The C-set-attaching rule of t, denoted by Rule(t), is defined as follows. Rule(t) = {I(u ) ⇐ J(c1 ), . . . , J(cm ) | the C-set-attaching rule of all internal vertices} ∪ {(I(u )) ⇐ (I(u )) | u is the child port of an uncontractible variable} ∪ {I(u ) ⇐ I(u ) | u has just one child and connects to the child with a contractible variable}. We repeatedly attach a C-set to vertices of a tree T from the leaves to the root by using C-set-attaching rules. The entire algorithm Matching is described in Fig. 2.
352
Yusuke Suzuki et al.
Procedure Matching(t, T ); input t: a term tree in OTT c with root r, T : a tree in OT with root R; begin Construct Rule(t); foreach leaf of T do CS( ) := {I( ) | is a leaf of t that is not the child port of a contractible variable, or has just one child and connects to it with a contractible variable}; while there is a vertex v of T s.t. v has no C-set and all children of v have C-sets do C-Set-Attaching(v, Rule(t)); if I(r) ∈ CS(R) then t matches T else t does not match T end. Procedure C-Set-Attaching(v, Rule(t)); input v: a vertex of T , Rule(t): the C-set-attaching rule of t; begin CS(v) := ∅; Let c1 , · · · , cm be all ordered children of v in T ; foreach I(u ) ⇐ J(c1 ), · · · , J(cm ) in Rule(t) do if there is a sequence 0 = j0 ≤ j1 ≤ · · · ≤ ji ≤ · · · ≤ jm −1 ≤ jm = m s.t. 1 . if J(ci ) = I(ci ) then ji − ji−1 = 1 and I(ci ) ∈ CS(cji ), 2 . if J(ci ) = (I(ci )) then CS(cki ) has I(ci ) or (I(ci )) for some ki (ji−1 < ki ≤ ji ) for all i = 1, ..., m // we have no condition on ji when J(ci ) = I(∅). then CS(v) := CS(v) ∪ {(I(u ))}; foreach (I(u )) ⇐ (I(u )) in Rule(t) do if there is a set in CS(c1 ), · · · , CS(cm ) which has I(u ) or (I(u )) then CS(v) := CS(v) ∪ {(I(u ))}; foreach I(u ) ⇐ I(u ) in Rule(t) do CS(v) := CS(v) ∪ {I(u )} end.
Fig. 2. Procedure Matching: An algorithm for deciding whether or not a given term tree t ∈ OTT c matches a given tree T ∈ OT Theorem 1. Let t be a term tree with n vertices in OTT c and T a tree with N vertices in OT . The problem for deciding whether or not t matches T is solvable in O(nN ) time.
4
4.1
An Algorithm for Finding a Minimally Generalized Term Tree Term Trees which Generate the Same Term Tree Language
Let g and t be term trees in OTT c , we denote g t if there exists a substitution θ such that g ≡ tθ. Let g = (V, E, H) be a term tree and g = (V , E , H ) a term subtree of g such that for any two siblings v and v of g , all the siblings between v and v are also vertices of g . A list [u, v1 , v2 , . . . , vk ] of vertices of g is said to be the port list of g w.r.t. g if the following conditions hold: (i) u is the root of g , (ii) for any i < j, vi appear at left of vj , (iii) v1 and vk are the leftmost
Efficient Learning of Unlabeled Term Trees
u
v
≥1
u
≥0
w1
v
g1
≥1
≥1
u
≥0
≥0
v
t1
≥1 ≥0
g2
u
v
u
w2
v
u
≥1
v
0
353
≥1 0
t2
w3
g3
t3
Fig. 3. For i = 1, 2, 3, gi ≡ ti , gi ti , ti gi , and L(gi ) = L(ti ) u u
≥1
v
0
w1
v
≥1
≥1 u
≥1
v
0
0
0
w3
g4
t4
u
v
u
≥1
v
0
w1
w2
t5
≥1
v
0
w2
w3
w3
g5
u
g6
t6
Fig. 4. For i = 4, 5, 6, gi ≡ ti , ti gi , gi ti , and L(gi ) = L(ti )
and rightmost leaves of g , respectively, and (iv) v2 , . . . , vk−1 are all leaves of g which are not leaves of g. Let g be a term subtree of g and [u, v1 , . . . , vk ] the port list of g . Let t be a term tree which has at least k leaves, and [u , v1 , . . . , vk ] a list of vertices of t such that u is the root of t and v1 , . . . , vk are k leaves of t where for i < j, vi appear at left of vj . We define an operation SubRep on g, g , and t as follows: SubRep(g, g , [u, v1 , . . . , vk ], t , [u , v1 , . . . , vk ]); Step 1. Remove all edges and variables of g from g and also remove all vertices of g other than the vertices in the port list [u, v1 , . . . , vk ] of g . Step 2. Add all edges, variables, and vertices of t to g and identify u , v1 , . . . , vk with u, v1 , . . . , vk in this order. Step 3. Output g. For any term tree g, we denote by s(g) the ordered tree obtained from g by replacing all variables of g with edges and all contractible variables of g with singleton vertices. For any two term trees g and t, we write g ≈ t if s(g) is isomorphic to s(t). We give totally 52 pairs of term trees g and t in Fig. 3–7 such that g ≈ t, g ≡ t, and L(g) = L(t). For a vertex u of a term tree, ch(u) is the number of the children of u which connect to u with edges or uncontractible variables. In the figures (Fig. 3–7), the digit in a box k (resp. ≥ k ) near u shows that ch(u) is equal to k (resp. is more than or equal to k). A right arrow shows that the vertex at right of the arrow is the immediately right child of the vertex at left of it. We omit the proofs of the following three lemmas.
354
Yusuke Suzuki et al.
Lemma 1. Let gi and ti (1 ≤ i ≤ 3) be term trees in OTT c described in Fig. 3. Let g be a term tree in OTT c which has at least one term subtree of the form gi (1 ≤ i ≤ 3). For one of occurrences of term subtrees gi , let t be the output of SubRep(g, gi, [u, v], ti , [u, v]). Then L(g) = L(t). Lemma 2. Let gi and ti (4 ≤ i ≤ 16) be term trees in OTT c described in Fig. 4– 6. Let g be a term tree in OTT c which has at least one term subtree of the form gi (1 ≤ i ≤ 16). For one of occurrences of term subtrees gi , we make a new term tree t in the following way. Then L(g) = L(t). 1. For 4 ≤ i ≤ 6, t is the output of SubRep(g, gi, [u, v], ti , [u, v]), where gi and ti are in Fig. 4. 2. For 7 ≤ i ≤ 16, t is the output of SubRep(g, gi, [u, w], ti , [u, w]), where gi and ti are in Fig. 5,6. Proof. (Sketch) We give a proof for the case of i = 4. Then t is obtained by SubRep(g, g4, [u, v], t4 , [u, v]). Since t g, L(t) ⊆ L(g). We show that gθ ∈ L(t) for any substitution θ. Let f be a term tree with which the variable [u, v] of the term subtree g4 of g is replaced. We have two cases. Case 1 : The rightmost child of the root of f is a leaf. Case 2 : The rightmost child of the root of f is not a leaf. Since t4 has two variables [u, w1 ]c and [v, w3 ]c , we define two new bindings for [u, w1 ] and [v, w3 ] so that the bindings express the binding for [u, v] of g4 . In Case 1, let f1 be a term tree obtained from f by removing the rightmost child of the root of f . Let f2 be a singleton vertex. In Case 2, let f1 be the term tree obtained from f by removing the rightmost child of the root of f and the descendants of it. Let f2 be the term tree consisting of the rightmost child of f and the descendants of it. In both cases, gθ can be obtained from t with a new substitution including the two bindings which replace [u, w1 ] with f1 and [v, w3 ] with f2 . Similarly we can show L(g) = L(t) for any 5 ≤ i ≤ 16. ✷ (k,)
(k,)
and ti (17 ≤ i ≤ 20, 1 ≤ k, ' ≤ 3) be term trees deLemma 3. Let gi scribed in Fig. 7. Let g be a term tree in OTT c which has at least one term subtree (k,) of the form gi (17 ≤ i ≤ 20, 1 ≤ k, ' ≤ 3). For one of occurrences of term sub(k,) (k,) (k,) trees gi , let t be the output of SubRep(g, gi , [u, w1 , w2 ], ti , [u, w1 , w2 ]). Then t g, g t, and L(g) = L(t). Definition 4. Let g be a term tree in OTT c . The term tree g is said to be a canonical term tree if no term subtree of g is identical to any of ti (1 ≤ i ≤ 16) (k,) (Fig. 3–6) and ti (17 ≤ i ≤ 20, 1 ≤ k, ' ≤ 3) (Fig. 7). We can see that any term tree g is transformed into the canonical term tree by applying SubRep(g, ti, [u, v], gi , [u, v]) (1 ≤ i ≤ 6), SubRep(g, ti, [u, w], gi , [u, w]) (k,) (k,) (7 ≤ i ≤ 16), and SubRep(g, ti , [u, w1 , w2 ], gi , [u, w1 , w2 ]) (17 ≤ i ≤ 20, 1 ≤ k, ' ≤ 3) to g, repeatedly. We denote by c(g) this canonical term tree transformed from g in this way. We note that L(c(g)) = L(g).
Efficient Learning of Unlabeled Term Trees u
≥1
u
≥1
u
≥1
≥1
u
v
1
v
1
v
1
1
v
w
0
w
0
w
0
0
w
w1
g7
t7
g8
≥1
u
≥1
u
≥1
v
1
v
1
v
1
w
0
w
0
w
0
w1
w2
t9
w2
t8
u
g9
w1
g10
u
≥1
v
1
w
0
≥1
u
≥1
u
≥1
u
≥1
v
1
v
1
v
1
v
1
w
≥1
w
≥1
w
≥1
w
≥1
g11
t11
w2
g12
w2
t10
u
w1
355
w1
w2
t12
Fig. 5. For i = 7, 8, 9, 10, 11, 12, gi ≡ ti , ti gi , gi ti and L(gi ) = L(ti )
Lemma 4. Let g and t be term trees in OTT c . If L(g) = L(t) then c(g) ≡ c(t). Proof. (Sketch) If L(g) = L(t), we have g ≈ t. We show that if g ≈ t and c(g) ≡ c(t) then L(g) = L(t). Let c(g) = (Vg , Eg , Hg ) and c(t) = (Vt , Et , Ht ). A vertex v is said to be an uncontractible vertex if v is not a child port of a contractible variable. Let Vg (resp. Vt ) be the set of all uncontractible vertices of Vg (resp. Vt ). Since g ≈ t, there is an isomorphism ξ from s(g) to s(t). We consider the mapping ξ to be a bijection from Vg to Vt . For a vertex v ∈ Vg which is not the root, let p(v) be the parent of v. Let an(v) be the nearest ancestor of v with an(v) = v and ch(an(v)) ≥ 2. If there is no ancestor v of v with ch(v ) ≥ 2, let an(v) be the root of g. And let de(v) be the nearest descendant of v with ch(de(v)) ≥ 2. If there is no descendant v of v with ch(v ) ≥ 2, let de(v) be the furthest descendant of v which is an uncontractible vertex. Let gv be the term subtree of g which consists of all vertices on the path from an(v) to p(de(v)), all vertices connecting to a vertex on the path, except for p(an(v)) if exists, and the child of de(v) if ch(de(v)) = 0. We denote by ht(an(v)) the length of the path from an(v) to de(v). Since c(g) ≡ c(t), there is a vertex v ∈ Vg satisfying gv ≡ tξ(v). If ch(de(v)) ≥ 2, we can give a substitution θ such that c(g)θ ∈ L(t) or c(t)θ ∈ L(g), since c(g) and c(t) have no term subtree which is identical to one of t1 , t2 (Fig. 3) and t11 , t12 (Fig. 5). If (ch(de(v)), ht(an(v))) = (0, 1) or (0, h) for h ≥ 3, we can give a substitution
356
Yusuke Suzuki et al. u
≥1
u
≥1
u
≥1
u
≥1
v1
1
v1
1
v1
1
v1
1
w1
v2
1
w1
v2
1
w1
v2
1
w1
v2
1
w2
vn-1
1
w2
vn-1
1
w2
vn-1
1
w2
vn-1
1
wn-1
vn
1
wn-1
vn
1
wn-1
vn
1
wn-1
vn
1
wn
wn
w
w
w
w
0
0
0
0
g13
t13
g14
t14
wn+1
≥1
u
≥1
u
≥1
u
≥1
u
1
v1
1
v1
1
v1
1
v1
1
v2
w1
1
v2
w1
1
v2
w1
1
v2
w1
1
vn-1
w2
1
vn-1
w2
1
vn-1
w2
1
vn-1
w2
1
vn
wn-1
1
vn
wn-1
1
vn
wn-1
1
vn
wn-1
w
w
wn
w
wn+1
0
0
0
0
g15
t15
g16
t16
w
wn
Fig. 6. For i = 13, 14, 15, 16, gi ≡ ti , ti gi , gi ti and L(gi ) = L(ti )
to show L(g) = L(t), since no term subtree of c(g) and c(t) is not identical to one of term subtrees in Fig. 3,4 and 6. If (ch(de(v)), ht(an(v))) = (0, 2), we have to examine a lot of small cases depending on left and right siblings of v. In all the small cases, we can give a substitution to show L(g) = L(t). In this way, we show that if c(g) ≡ c(t) then L(g) = L(t). ✷ 4.2
A MINL Algorithm for Term Trees
We can consider the language L(t) to be the representing power of a term tree t. A minimally generalized term tree explaining a given set of trees S ⊆ OT is a term tree t such that S ⊆ L(t) and there is no term tree t satisfying that L(t). We want to find a minimally generalized term tree for a given S ⊆ L(t ) ⊆ /
Efficient Learning of Unlabeled Term Trees u
u x w1
1
g17
v1
y w3
v2
vn
0
0
x w2
1
w1
y
v1
w
w
0
0
0
0
0
x
y vn-1
vn
0
w3
vn
1
w2
w1
y
v1
vn-1
0
0
vn
u y w3
v2
1
vn
x
1
w2
w1
y
v1
0
v2
wr
w1
wr
0
0
0
0
y
1
v1
vn-1
w3
vn
x w2
1
w1
w1
wr
0
0
w1
f1
t19
vn-1
w1
wr
0
0
u
w1
w2
w2
f2
f1
f2
f3
w2
t20 u
u
w1
vn
0
u
u
w2
y
1
v1
0
u
1
u
x
g20
vn
w1
u
1
1
0
g19
w1
t18
u
x v1
w3
0
0
g18
1
w
w
w1
w2
u
x v1
v2
t17
u
w1
357
w2
f3 (k,)
Fig. 7. For i = 17, 18, 19, 20, k = 1, 2, 3, and ' = 1, 2, 3, let gi = gi {x := (k,) = ti {x := [fk , [u, w1 ]], y := [f , [u, w2 ]]}. [fk , [u, w1 ]], y := [f , [u, w2 ]]} and ti (k,) (k,) (k,) (k,) (k,) (k,) (k,) (k,) Then gi ≡ ti , ti gi , gi ti and L(gi ) = L(ti )
358
Yusuke Suzuki et al. R(u) R(u) R(u)r R(u),r R(u)d R(u),d R(u)r,d R(u),r,d
: : : : : : : :
Replace Replace Replace Replace Replace Replace Replace Replace
[p(u), u] [p(u), u] [p(u), u] [p(u), u] [p(u), u] [p(u), u] [p(u), u] [p(u), u]
with with with with with with with with
{p(u), u}. {p(u), u} and [p(u), w ]c . {p(u), u} and [p(u), wr ]c . {p(u), u}, [p(u), w ]c and [p(u), wr ]c . {p(u), u} and [u, wd ]c . {p(u), u}, [p(u), w ]c and and [u, wd ]c . {p(u), u}, [p(u), wr ]c and [u, wd ]c . {p(u), u}, [p(u), w ]c , [p(u), wr ]c and [u, wd ]c .
Fig. 8. The operations for replacing a variable with an edge and a contractible variable set of trees. This problem is Minimal Language Problem for (OTTLc , OTT c ) and then we define this problem formally as follows. Minimal Language Problem (MINL) for (OTTLc , OTT c ) Instance: A nonempty set of trees S ⊆ OT . Question: Find a term tree t ∈ OTT c which is a minimally generalized term tree explaining S. The algorithm MINL (Fig. 9) solves Minimal Language Problem (MINL) for (OTTLc , OTT c ) correctly. MINL consists of the following two procedures: Variable-Extension (Fig. 9): The aim of this procedure is to output a term tree t consisting of only uncontractible variables such that there is no term L(t). Thus tree t consisting of only uncontractible variables with S ⊆ L(t ) ⊆ / this procedure extends a term tree t by adding uncontractible variables as much as possible while S ⊆ L(t) holds. Lemma 5. Let t be the output of Variable-Extension for an input S. Let t be a minimally generalized term tree explaining S. If S ⊆ L(t ) ⊆ L(t) then t ≈ t. Proof. Obviously s(t ) ∈ L(t). Let t be the term tree obtained by replacing all edges of s(t) with uncontractible variables. It is easy to see that S ⊆ L(t ) ⊆ L(t ) ⊆ L(t). Since t is an output of Variable-Extension for an input S, t ≡ t, thus, t ≈ t. ✷ Let u be a vertex of a term tree which is not the root of the term tree and p(u) the parent of u. Let w and wr be new children of p(u) which become the immediately left and right siblings of u, respectively. If u is a leaf, let wd be a new child of u. We suppose that [p(u), u] is an uncontractible variable. Then we define the 8 operations in Fig. 8. For example, for the term tree t in Fig. 10, t1 = tR(c1 ), t2 = tR(c1 )d , and t3 = tR(a1 ),r R(a2 ),r · · · R(an ),r R(u),r . We use the notation R(a1 , a2 , . . . , an ),r for R(a1 ),r R(a2 ),r · · · R(an ),r for simplicity. Then t3 = tR(a1 , a2 , . . . , an , u),r .
Efficient Learning of Unlabeled Term Trees
359
Algorithm MINL(S); input a set of trees S = {T1 , . . . , Tm } ⊆ OT ; begin t := ({u, v}, ∅, {[u, v]}); Let q be a list initialized to be [[u, v]]; Variable-Extension(t, S, q); (Fig. 9) Edge-Replacing(t, S, rt ), where rt is the root of t; (Fig. 9) output t end. Procedure Variable-Extension(t, S, q); input t: a term tree, S: a set of trees, q: a queue of variables; begin while q is not empty do begin [u, v] := q[1]; Let w1 , w2 , and w3 be new vertices; // w1 becomes a vertex between u and v. if S ⊆ L(t := (Vt ∪ {w1 }, Et , Ht ∪ {[u, w1 ], [w1 , v]} − {[u, v]})) then begin t := t ; q := q&[[w1 , v]]; continue end else q := q[2..]; // w2 and w3 become the immediately left and right siblings of v, respectively. if S ⊆ L(t := (Vt ∪ {w2 }, Et , Ht ∪ {[u, w2 ]})) then begin t := t ; q := q&[[u, w2 ]] end; if S ⊆ L(t := (Vt ∪ {w3 }, Et , Ht ∪ {[u, w3 ]})) then begin t := t ; q := q&[[u, w3 ]] end; end; return t end; Procedure Edge-Replacing(t, S, u); input t: a term tree, S: a set of trees, u: a vertex; begin if u is a leaf then return; Let c1 , . . . , ck be the children of u which connect to u with edges or uncontractible variables; for i := 1 to k do Edge-Replacing(t, S, ci ); if k = 1 then if c1 is a leaf then Vertically-Edge-Replacing-for-Leaf(t, S, u, c1 ) (Fig. 10) else if [u, c1 ] is an uncontractible variable then Vertically-Edge-Replacing(t, S, u, c1 ) (Fig. 12) else if ∃ [u, w1 ]c s.t. w1 is the left sibling of c1 then begin if S ⊆ L(t − [u, w1 ]c ) then t := t − [u, w1 ]c ; if ∃ [u, w2 ]c s.t. w2 is the right sibling of c1 then if S ⊆ L(t − [u, w2 ]c ) then t := t − [u, w2 ]c end else if ∃ [u, w2 ]c s.t. w2 is the right sibling of c1 then if S ⊆ L(t − [u, w2 ]c ) then t := t − [u, w2 ]c ; i := 1; while i ≤ k do i :=Horizontally-Edge-Replacing(t, S, u, ci ) (Fig. 13); return t end;
Fig. 9. Algorithm MINL: For a term tree t, we denote by t − [u, v]c the term tree obtained by removing a contractible variable [u, v]c
360
Yusuke Suzuki et al.
t
t1
t2
t3
t4
t5
t6
a
a
a
a
a1
a1
a1
a1
a2
a2
a2
a2
a
a
a
a1
a1
1
a1
a2
1
a2
an
1
a2
→
an
NG
an
⇒
NG
⇒
an
→
an
NG
⇒
an
NG
⇒
an
u
u
1
u
u
u
u
u
c1
c1
c1
c1
c1
c1
0
c1
c4
The vertex a is the root of t or the nearest ancestor of c1 which has at least two children. Procedure Vertically-Edge-Replacing-for-Leaf(t, S, u, ci ); input t = (Vt , Et , Ht ): a term tree, S: a set of trees, u, ci : a vertex; begin if S ⊆ L(t1 := tR(c1 )) then return t1 ; if S ⊆ L(t2 := tR(c1 )d ) then return t2 ; // t3 := tR(a1 , . . . , an , u),r generates the same language as t (Lemma 2, Fig. 5); // Next we try to decide the replacement of [u, c1 ] and the upper variables. if S ⊆ L(t4 := tR(a1 , . . . , an , u),r R(c1 ) ) then return t4 ; if S ⊆ L(t5 := tR(a1 , . . . , an , u),r R(c1 )r ) then return t5 ; if S ⊆ L(t6 := tR(a1 , . . . , an , u, c1 ),r ) then return t6 ; // We have not decided yet what is [u, c1 ] in the final term tree. return Vertically-Edge-Replacing-for-Leaf-Sub(t, S, u, c1 ) end;
Fig. 10. Procedure Vertically-Edge-Replacing-for-Leaf
Edge-Replacing (Fig. 9): Let t be an output of Variable-Extension for an input S. This procedure visits all vertices of t in the reverse order of the breadth-first search of t. And it applies the above 8 operations R to t. If S ⊆ L(tR) then t := tR. We have the 52 pairs of two term trees g and t such that g ≈ t, g ≡ t, and L(g) = L(t). Thus we can not easily decide the replacement R. For example, L(g7 ) = L(t7 ) and t7 = g7 R(w) (Fig. 5). If we apply R(w) to g7 and fix it, we can not go to g7 R(v) although L(g7 R(v)) ⊆ L(t7 ). This procedure / calls Vertically-Edge-Replacing-for-Leaf (Fig. 10, 11) and VerticallyEdge-Replacing (Fig. 12). Let [p(u), u] be a target contractible variable. These procedures decide that none of the above 8 replacements can be applied to [p(u), u] if the term subtree including [p(u), u] is isomorphic to one of ti (7 ≤
Efficient Learning of Unlabeled Term Trees t a
a1
1
a2
1
w1
t7
t8
t9
t10
a
a
a
a
a1
a1
a1
a1
a2
w2
→
w1
a2
w2
→
w1
a2
w2
→
w1
a2
w2
→
an
1
w3
an
w4
w3
an
w4
w3
an
w4
w3
an
w4
u
1
w2n-1
u
w2n
w2n-1
u
w2n
w2n-1
u
w2n
w2n-1
u
w2n
c1
0
c2
c1
c1
c3
c2
c1
c3
c1
361
The vertex a is the root of t or the nearest ancestor of u which has at least two children. Procedure Vertically-Edge-Replacing-for-Leaf-Sub(t, S, u, ci ); input t: a term tree, S: a set of trees, u: a vertex u, c1 ; begin // Below we suppose that an+1 = u. // t7 = tR(a2 , . . . , an+1 ),r generates the same language as t (Lemma 2, Fig. 6). // We decide whether the following substructures do not appear in Fig. 5. If ∃ i (2 ≤ i ≤ n + 1) s.t. S ⊆ L(t := tR(ai )r R(ai+1 , . . . , an+1 ),r R(c1 ) ) then return t ; If ∃ i (2 ≤ i ≤ n + 1) s.t. S ⊆ L(t := tR(ai ) R(ai+1 , . . . , an+1 ),r R(c1 )r ) then return t ; If ∃ i, j (2 ≤ i ≤ j ≤ n + 1) s.t. S ⊆ L(t := tR(ai ) R(ai+1 , . . . , aj−1 ),r R(aj )r R(aj+1 , . . . , an+1 , c1 ),r ) then return t ; If ∃ i, j (2 ≤ i < j ≤ n + 1) s.t. S ⊆ L(t := tR(ai )r R(ai+1 , . . . , aj−1 ),r R(aj ) R(aj+1 , . . . , an+1 , c1 ),r ) then return t ; // Note that we do not know what are the upper variables of amin{i,j} . // We decide whether the structure is not of the form of t13 (Fig. 6). Find the smallest index k s.t. S ⊆ L(t := tR(ak−1 ),r R(ak , . . . , an+1 , c1 ) ); if ∃ i (2 ≤ i ≤ k − 2) s.t. S ⊆ L(t := t R(ai ) R(ai+1 , . . . , ak−2 ),r ) then return t ; // We decide whether the structure is not of the form of t14 (Fig. 6). Find the smallest index k s.t. S ⊆ L(t := tR(ak−1 ),r R(ak , . . . , an+1 , c1 )r ); if ∃ i (2 ≤ i ≤ k − 2) s.t. S ⊆ L(t := t R(ai )r R(ai+1 , . . . , ak−2 ),r ) then return t ; // We decide whether the structure is not of the form of t15 (Fig. 6). Find the smallest index k s.t. S ⊆ L(t := tR(ak−1 ),r R(ak , . . . , an+1 ) R(c1 ),r ); if ∃ i (2 ≤ i ≤ k − 2) s.t. S ⊆ L(t := t R(ai ) R(ai+1 , . . . , ak−2 ),r ) then return t ; // We decide whether the structure is not of the form of t16 (Fig. 6). Find the smallest index k s.t. S ⊆ L(t := tR(ak−1 ),r R(ak , . . . , an+1 )r R(c1 ),r ); if ∃ i (2 ≤ i ≤ k − 2) s.t. S ⊆ L(t := t R(ai )r R(ai+1 , . . . , ak−2 ),r ) then return t ; // Otherwise we keep [u, c1 ] an uncontractible variable. return t end;
Fig. 11. Procedure Vertically-Edge-Replacing-for-Leaf-Sub i ≤ 16) (Fig. 5,6). In this case, these procedures keep [p(u), u] an uncontractible variable. Horizontally-Edge-Replacing (Fig. 13) processes each child of a vertex u from left to right. When the procedure works at one child of u, if a (k,) temporary term tree t has one of the term subtrees ti (17 ≤ i ≤ 20, 1 ≤ k, ' ≤ (k,) (k,) 3) in Fig. 7, the procedure replaces ti with gi and goes back to a child at left of the target child.
362
Yusuke Suzuki et al. t
t1
a
a1
1
a2
1
t4
t5
a
a
a
a
a
a1
a1
a1
a1
a1
a2
→
t2
t3
a2
NG
⇒
NG
a2
NG
⇒
a2
NG
⇒
a2
NG
⇒
⇒
an
1
an
an
an
an
an
u
1
u
u
u
u
u
c1
≥1
c1
c1
c1
c1
c1
⇓ OK return
⇓ OK return
⇓ OK return
⇓ OK return
⇓ OK return
t−3
t6
t−2
t−1
t
a
a
a
a
a
a1
a1
a1
a1
a1
a2
a2
⇒ ··· ⇒
NG
a2
⇒
NG
a2
⇒
NG
a2
⇒
an
an
an
an
an
u
u
u
u
u
c1
c1
c1
c1
c1
⇓ OK return
⇓ OK return
⇓ OK return
⇓ OK return
⇓ OK return
Fig. 12. Procedure Vertically-Edge-Replacing: The vertex a is the root or the nearest ancestor which has at least two children. This procedure tries t1 , t2 , t3 in this order. If S ⊆ L(ti ) for one of t1 , t2 , t3 then return ti . If S ⊆ L(ti ) for one of t4 , . . . , t then return tR(c1 ),r , otherwise, return t itself
Lemma 6. Let t be the output of the algorithm MINL for an input S. Let t be a term tree satisfying that S ⊆ L(t ) ⊆ L(t). Then c(t ) ≡ c(t). Theorem 2. The algorithm MINL finds a minimally generalized term tree in OTT c for a given set of trees in OT in polynomial time.
Efficient Learning of Unlabeled Term Trees
363
Procedure Horizontally-Edge-Replacing(t, S, u, ci ); input t: a term tree, S: a set of trees, u: a vertex; begin if {u, ci } is an edge then Let w1 and w2 be the children of u which are the immediately left and right siblings of ci , respectively; Replace {u, ci } with [u, ci ] and remove contractible variables [u, w1 ]c and [u, w2 ]c end; if ci is the last child of u and a leaf then begin if S ⊆ L(t := tR(ci )) then t := t ; if S ⊆ L(t := tR(ci ) ) then t := t ; if S ⊆ L(t := tR(ci )r ) then t := t ; if S ⊆ L(t := tR(ci ),r ) then t := t ; if S ⊆ L(t := tR(ci )d ) then t := t ; if S ⊆ L(t := tR(ci )r,d ) then t := t end else if ci is a leaf begin if S ⊆ L(t := tR(ci )) then t := t ; if S ⊆ L(t := tR(ci ) ) then t := t ; if S ⊆ L(t := tR(ci )d ) then t := t end if ci is the last child of u then begin if S ⊆ L(t := tR(ci )) then t := t ; if S ⊆ L(t := tR(ci ) ) then t := t ; if S ⊆ L(t := tR(ci )r ) then t := t ; if S ⊆ L(t := tR(ci ),r ) then t := t end else begin if S ⊆ L(t := tR(ci )) then t := t ; if S ⊆ L(t := tR(ci ) ) then t := t end; // Let wi1 and wi2 be the immediately left and right siblings of ci , respectively, if exist. if t has a new contractible variable [u, wi1 ]c and, by regarding wi1 as w2 , (k,) the term subtree including left siblings of wi1 is isomorphic to one of ti (17 ≤ i ≤ 20, 1 ≤ k, ≤ 3) in Fig. 7 or t has a new contractible variable [u, wi2 ]c and, by regarding wi2 as w2 , (k,) the term subtree including left siblings of wi2 is isomorphic to one of ti (17 ≤ i ≤ 20, 1 ≤ k, ≤ 3) in Fig. 7 then begin (k,) (k,) with gi ; Replace ti // What we have done is only to insert a new contractible variable [u, w3 ]c . Let cj be the immediately right sibling of w3 ; Replace all edges {u, cj+1 }, . . . , {u, ci } of t with uncontractible variables and removing all contractible variables adjacent to the right siblings of cj+1 ; t := t ; return j // Set the next target of children of u to be cj . end; return i + 1 // Set the next target of children of u to be ci+1 . end;
Fig. 13. Procedure Horizontally-Edge-Replacing
5
Conclusions
We have given polynomial time algorithms for solving the membership and MINL problems for the class of unlabeled term trees with contractible variables. From these algorithms and Angluin’s theorem [3], we have the following theorem: Theorem 3. The class OTT c is polynomial time inductively inferable from positive data.
References 1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000. 2. T. R. Amoth, P. Cull, and P. Tadepalli. Exact learning of unordered tree patterns from queries. Proc. COLT-99, ACM Press, pages 323–332, 1999.
364
Yusuke Suzuki et al.
3. D. Angluin. Finding patterns common to a set of strings. Journal of Computer and System Science, 21:46–62, 1980. 4. H. Arimura, H. Sakamoto, and S. Arikawa. Efficient learning of semi-structured data from queries. Proc. ALT-2001, Springer-Verlag, LNAI 2225, pages 315–331, 2001. 5. S. Matsumoto, Y. Hayashi, and T. Shoudai. Polynomial time inductive inference of regular term tree languages from positive data. Proc. ALT-97, Springer-Verlag, LNAI 1316, pages 212–227, 1997. 6. S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning of finite unions of tree patterns with internal structured variables from queries. Proc. AI-2002, Springer-Verlag, LNAI 2557, pages 523–534, 2002. 7. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tag tree patterns in semistructured web documents. Proc. PAKDD-2002, Springer-Verlag, LNAI 2336, pages 341–355, 2002. 8. T. Shoudai, T. Miyahara, T. Uchida, and S. Matsumoto. Inductive inference of regular term tree languages and its application to knowledge discovery. Information Modelling and Knowledge Bases XI, IOS Press, pages 85–102, 2000. 9. T. Shoudai, T. Uchida, and T. Miyahara. Polynomial time algorithms for finding unordered tree patterns with internal variables. Proc. FCT-2001, Springer-Verlag, LNCS 2138, pages 335–346, 2001. 10. Y. Suzuki, R. Akanuma, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time inductive inference of ordered tree patterns with internal structured variables from positive data. Proc. COLT-2002, Springer-Verlag, LNAI 2375, pages 169–184, 2002. 11. Y. Suzuki, T. Shoudai, T. Miyahara, and T. Uchida. A polynomial time matching algorithm of structured ordered tree patterns for data mining from semistructured data. Proc. ILP-2002, Springer-Verlag, LNAI 2583, pages 270–284, 2003. 12. Y. Suzuki, T. Shoudai, T. Uchida, and T. Miyahara. Ordered term tree languages which are polynomial time inductively inferable from positive data. Proc. ALT2002, Springer-Verlag, LNAI 2533, pages 188–202, 2002. 13. K. Wang and H. Liu. Discovering structural association of semistructured data. IEEE Trans. Knowledge and Data Engineering, 12:353–371, 2000.
Relational IBL in Music with a New Structural Similarity Measure Asmir Tobudic1 and Gerhard Widmer1,2 1 2
Austrian Research Institute for Artificial Intelligence, Vienna Department of Medical Cybernetics and Artificial Intelligence University of Vienna {asmir,gerhard}@oefai.at
Abstract. It is well known that many hard tasks considered in machine learning and data mining can be solved in an rather simple and robust way with an instance- and distance-based approach. In this paper we present another difficult task: learning, from large numbers of performances by concert pianists, to play music expressively. We model the problem as a multi-level decomposition and prediction task. Motivated by structural characteristics of such a task, we propose a new relational distance measure that is a rather straightforward combination of two existing measures. Empirical evaluation shows that our approach is in general viable and our algorithm, named DISTALL, is indeed able to produce musically interesting results. The experiments also provide evidence of the success of ILP in a complex domain such as music performance: it is shown that our instance-based learner operating on structured, relational data outperforms a propositional k-NN algorithm. Keywords: Relational instance-based learning, music.
1
Introduction
Instance-based learning has always been very popular within machine learning and data mining. During the long research history on IBL, countless studies have stressed its strong aspects: algorithmic simplicity, incrementality, almost obvious extensions for handling noisy examples and/or attributes, the ability to deal with discrete as well as with continuous attributes, and often surprisingly good performance. Although most research on IBL has been done in a propositional setting, recently there has been a growing interest in transferring the successful IBL framework to the richer first-order logic (FOL) representation. A number of instance-based learners operating in the FOL framework have already been developed — e.g. KBG [1], RIBL [5], STILL [13] — and shown to work well on a number of tasks. This paper introduces another difficult task for relational IBL, from the area of music research. We would like to automatically build, via inductive learning from ‘real-world’ data (i.e., real performances by highly skilled musicians), predictive models of certain aspects of performance (e.g. tempo, timing, dynamics, etc). Previous research has shown that computers can indeed find and T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 365–382, 2003. c Springer-Verlag Berlin Heidelberg 2003
366
Asmir Tobudic and Gerhard Widmer
describe interesting and useful regularities at the level of individual notes. Using a new machine learning algorithm [17], we succeeded in discovering a small set of simple, robust and highly general rules that predict a substantial part of the note-level choices of a performer (e.g., whether (s)he will shorten or lengthen a particular note) with high precision [16]. However, music performance is a highly complex activity, and the note level is far from sufficient as a basis for a complete model of expressive performance. The goal of our ongoing work is to build quantitative models of musical expression at different levels of abstraction: we would like to learn tempo and dynamics strategies at levels of hierarchically nested phrases. In this paper we show how relational IBL can be applied to learn expressive tempo and dynamics patterns at different phrase levels. We also propose a new similarity measure for structured objects described in FOL, which is a fairly straightforward modification of existing measures and can be regarded as a combination of two techniques: (1) RIBL’s [5] strategy for assessing similarity between FOL objects by computing the similarity between objects’ properties and the similarity of the objects related to them, and (2) a definition of distance between two sets based on the notion of transport networks, as proposed in [10]. Our similarity measure has been built into a relational instance-based learner named DISTALL and applied to our music task. DISTALL predicts timing and dynamics patterns for phrases in a new piece by analogy to the most similar phrases in the training set. Empirical evaluation shows that the relational IBL approach is viable in this domain and DISTALL is indeed able to achieve musically sensible results. Experiments also show that DISTALL produces clearly better results than a propositional k-NN algorithm, which provides additional evidence for the benefits of relational instance-based learning. The paper is organized as follows: Section 2 introduces the notion of expressive music performance and its representation via performance curves. We also show how hierarchically nested musical phrases are represented in FOL, and how complex tempo and dynamics curves can be decomposed into well-defined training instances for the instance-based learning algorithm. In Section 3 we describe our distance measure in detail. Section 4 gives empirical results on DISTALL’s performance and a comparison with both RIBL and a propositional k-NN algorithm. Section 5 gives a brief conclusion.
2
Real-World Task: Learning to Play Music Expressively
The work presented here is part of a large research project that studies the fundamentals of expressive music performance via AI and, in particular, machine learning [18]. Expressive music performance is the art of shaping a musical piece by continuously varying important parameters like tempo, loudness, etc. while playing a piece. Instead of playing a piece of music with constant tempo or loudness, (skilled) performers rather speed up at some places, slow down at others, stress certain notes or passages etc. The way this ‘should be’ done is not specified
Relational IBL in Music with a New Structural Similarity Measure
367
rel. dynamics
1.5
1
0.5 31
32
33
34
35 36 score position (bars)
37
38
39
Fig. 1. Dynamics curve (relating to melody notes) of performance of Mozart Sonata KV.279, 1st movement, mm. 31–38, by a Viennese concert pianist
precisely in the written score1 , but at the same time it is absolutely essential for the music to sound alive. The aim of this work is learning predictive models of two of the most important expressive parameters: timing (tempo variations) and dynamics (loudness variations). The tempo and loudness variations can be represented as curves which quantify the variations of these parameters for each note relative to some reference value (e.g. average loudness or tempo of the same piece). Figure 1 shows a dynamics curve of a small part of the Mozart piano Sonata K.279 (C major), 1st movement, as played by a Viennese concert pianist (computed from recordings on a B¨ osendorfer SE290 computer-monitored grand piano2 ). Each point represents the relative loudness with which a particular melody note was played (relative to an average loudness of the piece); a purely mechanical (unexpressive) rendition of the piece would correspond to a flat horizontal line at y = 1.0. Tempo variations can be represented in an analogous way. A careful examination of the figure reveals some trends in the dynamics curve. For instance, one can notice an up-down, crescendo-decrescendo tendency over the presented part of the piece and relatively consistent smaller up-down patterns embedded in it. This is not an accident since we chose to show a part of the piece which is a musically meaningful unit: a high-level phrase. This phrase contains a number of lower-level phrases, which are apparently also ‘shaped’ by the performer. The hierarchical, four-level phrase structure of this passage is indicated by four levels of brackets at the bottom of the figure. The aim of our work is the automatic induction of tempo and dynamics strategies, at differ1 2
The score is the music as actually printed. The SE290 is a full concert grand piano with a special mechanism that measures every key and pedal movement with high precision and stores this information in a format similar to MIDI. From these measurements, and from a comparison with the notes in the written score, the tempo and dynamics curves corresponding to the performances can be computed.
368
Asmir Tobudic and Gerhard Widmer
rel. dynamics
1.5
1 phrCont(p11,A1,A2,...) 0.5
phrCont(p12,A1,A2,...)
Level 1
contains(p21,p11) phrCont(p21,A1,A2,...)contains(p21,p12) contains(p31,p21) 1
Level 2
phrCont(p31,A1,A2,...)
1.5
Level 3 2
2.5
score position (bars)
Fig. 2. Phrase representation used by our relational instance-based learning algorithm ent levels of the phrase structure, from large amounts of real performances by concert pianists. The heart of our system, the relational instance-based learning algorithm described below, recognizes similar phrases from the training set and applies their expressive patterns to a new (test) piece. In this section we will describe the steps which precede and succeed the actual learning: First we show how hierarchically nested phrases are represented in first-order logic. We then show how complex tempo and dynamics curves as measured in real performances can be decomposed into well-defined training instances for the learner. Finally, we discuss the last step: at prediction time, the shapes predicted by the learner for nested phrases at different levels must be combined into a final performance curve that can be used to produce a computer-generated ‘expressive’ performance. 2.1
Representing Musical Phrases in FOL
Phrases are segments of music heard and interpreted as coherent units; they are important structural building blocks of music. Phrases are organized hierarchically: smaller phrases are grouped into higher-level phrases, which are in turn grouped together, constituting a musical context at a higher level of abstraction etc. The phrases and relations between them can be naturally represented in first-order logic. Consider Figure 2. It shows the dynamics curve corresponding to a small portion (2.5 bars) of a Mozart sonata performance, along with the piece’s underlying phrase structure. For all scores in our data set phrases are organized at four hierarchical levels, based on a manual phrase structure analysis. The musical content of each phrase is encoded in the predicate phrCont(Id, A1, A2, ...). Id is the phrase identifier and A1,A2,... are attributes that describe very basic phrase properties like the length of a phrase, melodic intervals between the starting and ending notes, information about where the highest melodic point (the ‘apex’) of the phrase is, the harmonic progression between start, apex, and end, etc. Relations between phrases are specified via the predicate contains(Id1, Id2), which
Relational IBL in Music with a New Structural Similarity Measure
369
states that the bigger phrase Id1 contains the smaller one Id2. Note that smaller phrases (consisting only of a few melody notes) are described in detail by the predicate phrCont. For the bigger phrases — containing maybe several bars — the high-level attributes in phrCont are not sufficient for a full description. But having links to the lower-lever phrases through the contains predicate and their detailed description in terms of phrCont, we can also obtain detailed insight into the contents of bigger phrases. In ILP terms, the description of the musical scores through the predicates phrCont and contains defines the background knowledge of the domain. What is still needed in order to learn are the training examples, i.e. for each phrase in the training set, we need to know how it was played by a musician. This information is given in the predicate phrShape(Id , Coeffs), where Coeffs encode information about the way the phrase was played by a pianist. This is computed from the tempo and dynamics curves, as described in the following section. 2.2
Deriving the Training Instances: Multi-level Decomposition of Performance Curves
Given a complex tempo or dynamics curve (see Figure 1) and the underlying phrase structure, we need to calculate the most likely contribution of each phrase to the overall observed expression curve, i.e., we need to decompose the complex curve into basic expressive phrase ‘shapes’. As approximation functions to represent these shapes we decided to use the class of second-degree polynomials (i.e., functions of the form y = ax2 + bx + c), because there is ample evidence from research in musicology that high-level tempo and dynamics are well characterized by quadratic or parabolic functions [14]. Decomposing a given expression curve is an iterative process, where each step deals with a specific level of the phrase structure: for each phrase at a given level, we compute the polynomial that best fits the part of the curve that corresponds to this phrase, and ‘subtract’ the tempo or dynamics deviations ‘explained’ by the approximation. The curve that remains after this subtraction is then used in the next level of the process. We start with the highest given level of phrasing and move to the lowest. As tempo and dynamics curves are lists of multiplicative factors (relative to a default tempo), ‘subtracting’ the effects predicted by a fitted curve from an existing curve simply means dividing the y values on the curve by the respective values of the approximation curve. Figure 3 illustrates the result of the decomposition process on the last part (mm.31–38) of the Mozart Sonata K.279, 1st movement, 1st section. The fourlevel phrase structure our music analyst assigned to the piece is indicated by the four levels of brackets at the bottom of the plot. The elementary phrase shapes (at four levels of hierarchy) obtained after decomposition are plotted in gray. We end up with a training example for each phrase in the training set — a predicate phrShape(Id , Coeff ), where Coeff = {a, b, c} are the coefficients of the polynomial fitted to the part of the performance curve associated with the phrase.
370
Asmir Tobudic and Gerhard Widmer
rel. dynamics
1.5
1
0.5 31
32
33
34
35 36 score position (bars)
37
38
39
Fig. 3. Multilevel decomposition of dynamics curve of performance of Mozart Sonata K.279:1:1, mm.31-38.: original dynamics curve plus the second-order polynomial shapes giving the best fit at four levels of phrase structure
2.3
Combining Multi-level Prase Predictions
Input to the learning algorithm are the (relational) representation of the musical scores plus the training examples (i.e. timing and dynamics polynomials), for each phrase in the training set. Given a test piece the learner assigns the shape of the most similar phrase from the training set to each phrase in the test piece. In order to produce final tempo and dynamics curves, the shapes predicted for phrases at different levels must be combined. This is simply the inverse of the curve decomposition problem. Given a new piece to produce a performance for, the system starts with an initial ‘flat’ expression curve (i.e., a list of 1.0 values) and then successively multiplies the current value by the multi-level phrase predictions.
3
Instance-Based Learning in FOL
This section presents the relational IBL learner DISTALL and briefly contrasts it with its ancestor RIBL [5], by showing via an example how they implement different notions of structural similarity. 3.1
A Structural Similarity Measure
Before explaining DISTALL in detail, we recall some definitions concerning first order logic and inductive logic programming from [6,9]. Definition 1 (Linked Clause). A clause is linked if all of its variables are linked. A variable v is linked in a clause c if and only if v occurs in the head of c, or there is a literal l in c that contains the variables v and w (v = w) and w is linked in c.
Relational IBL in Music with a New Structural Similarity Measure
371
Example 1. Clause p(A) ← r(B) is not linked, while p(A) ← q(A, B), r(B, C), u(D, C) is. Definition 2 (Level of Term). The level l(t) of a term t in a linked clause c is 0 if t occurs as an argument in the head of c; and 1+min l(s) where s and t occur as arguments in the same literal of c. Example 2. The variable F in f ather(F, C) ← male(F ), parent(F, C) has level 0, the variable G in grandf ather(F ) ← male(F ), parent(F, C), parent(C, G) has level 2. The algorithm to be presented here can be regarded as a generalization of the propositional k-NN for examples described in first-order logic. In FOL, examples are usually represented as sets of ground facts. The heart of a relational IBL algorithm is thus a distance function between sets of elements. A number of distances on sets already exist, e.g. the Hausdorff metric, symmetric difference distances, distances based on relations between sets, etc. For our algorithm, we adopt the distance proposed in [10,11]. The distance between sets of elements is defined in [10] via transport networks (for more information on the concept of transport networks see [7]). First, the appropriate transport network is constructed. The network has two groups of vertices {ai } and {bi } corresponding to the elements of the two sets A and B; a starting and an ending vertex (source s and sink t); and two additional vertices, let us call them a− and b− . For all edges in the network, capacities and weights are defined, where capacities represent the maximal amount of ‘units’ which can ‘flow’ through a connection and the weights are the distances (transport costs) between particular vertices. The distance between two sets of elements is then defined as the solution of the maximum flow minimal weight problem: one would like to transport as much as possible from s to t with minimal costs. In other words, one wants to maximally match elements from one set with the elements of the other and achieve the minimal possible distance. By setting the weights of the edges between set elements and the two additional vertices a− and b− to a big constant (e.g. bigger than all other edge weights), a ‘penalty’ is modeled: all elements of one set which do not match with any element of the other cause big costs. By associating appropriate capacities with the edges in the network one can generalize the notion of cardinality in such a way that sets of different cardinality can be scaled appropriately (e.g. allowing the elements of the smaller set to match up more than one element of the other and avoiding that the distance of two sets with vastly different cardinalities is expressed mainly through penalty). An example of a distance network is given in Figure 4. More formally, the weighting function for sets and generalization of the notion of cardinality can be defined as follows [10]: Definition 3 (Weighting Function). Let X be a universe. A weighting function W for X is a function that maps each subset A ∈ 2X to a function W [A] : A → .
372
Asmir Tobudic and Gerhard Widmer
Fig. 4. A distance network (see [11])
Definition 4 (Size under Weighting Function). Let W be a weighting funcX + tion for X. Then the function sizew : 2 → is defined as sizeW (A) = W [A](a). a∈A The distance network can be formalized as (see figure 4): Definition 5 (Distance Network). Let X be a set, and d a metric on X. Let M be a constant, W be a weighting function for X and QW X = maxA∈2X sizeW (A). Then for all finite A, B ∈ 2X with A = {a1 , ..., am } and B = {b1 , ..., bn }, we define a distance network between A and B for d, M and W in X to be N[X,d,M,W,A,B]=N(V,E,cap,s,t,w) with V = A ∪ B ∪ {s, t, a− , b− } , E = ({s} × (A ∪ {a− })) ∪ ((B ∪ {b− }) × {t}) ∪ ((A ∪ {a− }) × (B ∪ {b− })), ∀a ∈ A, ∀b ∈ B : w(s, a) = w(b, t) = w(s, a− ) = w(b− , t) = w(a− , b− ) = 0 ∧ w(a, b) = d(a, b) ∧ w(a− , b) = w(a, b− ) = M/2 and ∀a ∈ A, ∀b ∈ B : cap(s, a) = W [A](a) ∧ W cap(b, t) = W [B](b) ∧ cap(s, a− ) = QW X − sizeW (A) ∧ cap(b− , t) = QX − sizeW (B) ∧ cap(a, b) = cap(a− , b) = cap(a, b− ) = cap(a− , b− ) = ∞ (see Figure 4). The distance between sets of elements is then defined as follows: Definition 6 (Netflow Distance). Let X be a set, d a metric on X, M a constant, and W a weighting function for X. For all A, B ∈ 2X , the netflow distance from A to B under d, M and W in X, denoted dN X,d,M,W (A, B), is the weight of the minimal weight maximal flow from s to t in N[X,d,M,W,A,B]. In [7] it has been shown that if W has integer values, the maximal flow minimal weight problem can be solved in time polynomial in sizeW (A) and sizeW (B). If the weights of the netflow distance are normalized (in the interval [0,1]), the netflow distance can also be normalized:
Relational IBL in Music with a New Structural Similarity Measure
373
Definition 7 (Normalized Netflow Distance). Let X be a set and d a normalized metric on X. Then, the normalized netflow distance based on d is a distance function dN,n X,d,M,W (A, B) : X × X → defined by: – 0 if (X=0 and Y=0) –
2×dN X,d,M,W (A,B) dN (A,B)+(size W (X)+sizeW (Y ))/2 X,d,M,W
otherwise
Although the maximal matching distance defined via transport networks can be calculated in polynomial time, applying it directly on examples described in FOL (containing maybe hundreds of ground facts) would cause impractically high computational costs. Another problem is that given an example described by many facts, the relevance of particular facts is hard to tune (e.g. by weighting them differently). We avoid these problems by collecting the facts derived from background knowledge into hierarchical subsets. By doing so we develop a context-sensitive similarity measure where terms with a lower linkage level to the objects whose distance we are interested in can be made to influence the distance more. In the following we describe our algorithm in more detail. The difference between our approach and the RIBL algorithm, which uses a similar strategy for grouping facts, is illustrated on an example from our learning problem in the next subsection. 3.2
The DISTALL Algorithm
The first step of the algorithm is building so-called starting clauses. The building of starting clauses is a well known method in ILP employed for reasons of computational complexity and implemented in many systems (CLINT [2], GOLEM [8], ITOU [12], RIBL [5]). For each (training and test) example we collect literals that contain terms linked to the example and group them into subsets according to linkage levels. The building of starting clauses is in our case guided by types and modes of literals. The distance between a test and training example is computed as the set distance between all literals found in the test and training starting clause at linkage level 1 . In other words, we find the solution of the maximal flow minimal weight problem (see Definition 6), where the transport network vertices are literals containing terms which are directly linked to the examples. The weights of the edges (i.e. distances between individual literals) are computed as the Manhattan distance defined over the literals’ arguments (or set to 1 if the literals have different functors). If the arguments are object identifiers, the distance between them is computed by expanding them into a new transport network where the vertices are literals also containing these objects, found one linkage level deeper in the starting clauses. At the lowest level, the distance between objects is calculated as the distance between discrete values. The capacities associated with the edges in the network can be used to control the ‘virtual’ cardinalities of sets (and, accordingly, the influence of penalty on the set distance). They can also be used to give different importance to the predicates in sets.
374
Asmir Tobudic and Gerhard Widmer
Fig. 5. Basic principle of DISTALL’s similarity measure
The basic principle of the algorithm, let us call it DISTALL (DIstance on SeTs of Appropriate Linkage Level), is illustrated in Figure 5. In the example, the distance between objects Ob1 and Ob2 is calculated as the solution of the maximal flow minimal weight problem on the sets of literals found at LinkLevel = 1 in the starting clauses built for Ob1 and Ob2 . The weights d(ai , bj ) of edges connecting literals containing no object identifiers are computed directly (via Manhattan distance, see above). In the example, the literals a01 and b01 as well as a02 and b03 have same functors and object identifiers as arguments. The weights of edges between them are thus defined as distance network problems involving the literals containing these objects, found one linkage level deeper in the starting clause. The procedure continues recursively, until the depth of the starting clauses is reached. The computational cost is kept small, since the algorithm solves many hierarchically nested transport network problems with a small number of vertices in one network. Notice that the distances between vertices are normalized. With a normalized distance between vertices we can apply Definition 7 and normalize the netflow distance. Normalized netflow distance is in turn used in the computation of the (normalized) distance between literals in the ‘higher-level’ transport network.
Relational IBL in Music with a New Structural Similarity Measure
3.3
375
DISTALL vs. RIBL
DISTALL can be regarded as a continuation of the line of research initiated in [1], where a clustering algorithm together with its similarity measure was presented. This work was later improved in [5], in the context of the relational instancebased learning algorithm RIBL. The main idea behind RIBL’s similarity measure is that the similarity between two objects is determined by the similarity of their attributes and the similarity of the objects related to them. The similarity of the related objects depends in turn on their attributes and related objects. DISTALL combines this idea with a set distance function based on the notion of transport networks, which was proposed in [10]. While RIBL’s similarity measure permits several literals in one example to match the same literal in the other example, DISTALL’s netflow distance strongly favors matchings that are as complete as possible and penalizes literals left unmatched. We argue that the so defined distance is more natural in structured domains and works better in practice. First we show the main difference between RIBL’s and DISTALL’s behavior in one constructed situation from our musical application domain. In the next section we present DISTALL’s empirical results and provide a direct comparison with RIBL.
phrShape(p2,???)
phrShape(p1,Coeffs)
phrCont(p2,Attrs...)
phrCont(p1,Attrs...) contains(p1,p11)
contains(p1,p12)
contains(p1,p13)
phrCont(p11,Attrs...) phrCont(p12,Attrs...) phrCont(p13,Attrs...)
contains(p2,p21)
contains(p2,p22)
contains(p2,p23)
phrCont(p21,Attrs...) phrCont(p22,Attrs...) phrCont(p23,Attrs...)
Fig. 6. An example of relational learning situation: training example (left) and new test case (right) Consider the situation given in Figure 6. We are interested in predicting the ‘expressive shape’ of the high-level phrase p2 and thus want to calculate the distance between phrases p1 and p2. Each of the two phrases is described via the attributes stored in the phrase-content predicate phrCont. They also contain lower-level phrases (predicate contains), which are in turn described with their phrase-content predicates. RIBL would compute the similarity between
376
Asmir Tobudic and Gerhard Widmer
p1 and p2 as a (weighted) sum of the similarities between the phrCont and contains predicates, where for each contains predicate of p2, the most similar contains(p1, X) predicate is found (by finding the most similar phrCont and contains predicate at the lower level). Imagine the situation where the short lower-level phrase p13 is a prototype of all lower-level phrases of p2 (i.e. the sum of the distances between p13 and all p2x phrases is minimal, and the other two lower-level phrases p11 and p12 are completely different from all phrases p2x). By matching all p2x phrases to p13 RIBL would obtain a relatively high similarity between p1 and p2. This is not what we want, as the internal details of the whole high-level phrase p1 are ‘responsible’ for the expressive shape and not just a small fraction. In DISTALL, on the other hand, such a matching that leaves two of three subphrases unmatched would receive a high penalty and thus result in a low similarity rating.3
4
Experiments
In the following we present detailed empirical results achieved with DISTALL on a complex real-world dataset derived from piano performances of classical music and compare these with results achieved by RIBL. We also provide a comparison with a simpler propositional approach. 4.1
The Data
The data used for the experiments was derived from performances of Mozart piano sonatas by a Viennese concert pianist on a B¨ osendorfer SE 290 computercontrolled grand piano. A multi-level phrase structure analysis of the musical score was carried out manually by a musicologist. Phrase structure was marked at four hierarchical levels; three of these were finally used in the experiments. The sonatas are divided into sections, which can be regarded as coherent pieces. The resulting set of annotated pieces is summarized in Table 1. The pieces and performances are quite complex and different in character; automatically learning expressive strategies from them is a challenging task. 4.2
A Quantitative Evaluation of DISTALL
A systematic leave-one-piece-out cross-validation experiment was carried out. Each of the 16 sections was once set aside as a test piece, while the remaining 15 pieces were used for learning. DISTALL uses one nearest neighbor for prediction, 3
One could avoid RIBL’s ‘mismatching’ of the lower-level phrases, e.g. by introducing a new predicate relP os(P hrase, P osition), which describes the relative position of lower-level phrases within higher-level phrases. In this case, the phrases with the same relative position would be matched with higher probability. We do not want to make use of this information, since there are relational learning problems which do not have such well defined alignment information between objects.
Relational IBL in Music with a New Structural Similarity Measure
377
Table 1. Mozart sonata sections used in experiments (to be read as <sonataName>:<movement>:<section>); notes refers to ‘melody’ notes phrases at level sonata section notes 1 2 3 4 kv279:1:1 fast 4/4 391 50 19 9 5 kv279:1:2 fast 4/4 638 79 36 14 5 kv280:1:1 fast 3/4 406 42 19 12 4 kv280:1:2 fast 3/4 590 65 34 17 6 94 23 12 6 3 kv280:2:1 slow 6/8 kv280:2:2 slow 6/8 154 37 18 8 4 kv280:3:1 fast 3/8 277 28 19 8 4 kv280:3:2 fast 3/8 379 40 29 13 5 kv282:1:1 slow 4/4 165 24 10 5 2 kv282:1:2 slow 4/4 213 29 12 6 3 31 4 2 1 1 kv282:1:3 slow 4/4 kv283:1:1 fast 3/4 379 53 23 10 5 kv283:1:2 fast 3/4 428 59 32 13 6 kv283:3:1 fast 3/8 326 52 30 12 3 kv283:3:2 fast 3/8 558 78 47 19 6 kv332:2 slow 4/4 477 49 23 12 4 Total: 5506 712 365 165 66
with the starting clause depth set to 3 (i.e. just those phrases whose relationship order to the phrase in question is ≤ 3 can influence the distance measure). The expressive shapes for each phrase in a test piece were predicted by DISTALL and then combined into a final tempo and dynamics curve, as described in section 2.3. The following performance measures were computed: the mean squared error of the system’s predictions on thepiece relative to the actual exn 2 pression curve produced by the pianist n (M SE = i=1 (pred(ni ) − expr(ni )) /n), the mean absolute error (M AE = i=1 |pred(ni ) − expr(ni )|/n), and the correlation between predicted and ‘true’ curve. MSE indicates those cases where the learner produces rather extreme ‘errors’. MSE and MAE were also computed for a default curve that would correspond to a purely mechanical, unexpressive performance (i.e., an expression curve consisting of all 1’s). That allows us to judge if learning is really better than just doing nothing. The results of the experiment are summarized in table 2, where each row gives the results obtained on the respective test piece when all others were used for training. The last row (W M ean) shows the weighted mean performance over all pieces (individual results weighted by the relative length of the pieces). We are interested in cases where the relative errors (i.e., MSEL /MSED and MAEL /MAED ) are less than 1.0, that is, where the curves predicted by the learner are closer to the pianist’s actual performance than a purely mechanical rendition. In the dynamics dimension, this is the case in 12 out of 16 cases for MSE, and in 14 out of 16 for MAE. The results for tempo are dramatically
378
Asmir Tobudic and Gerhard Widmer
Table 2. Results, by sonata sections, of cross-validation experiment with DISTALL (depth=2, k=1). Measures subscripted with D refer to the ‘default’ (mechanical, inexpressive) performance, those with L to the performance produced by the learner. The cases where DISTALL is better than the default are printed in bold
kv279:1:1 kv279:1:2 kv280:1:1 kv280:1:2 kv280:2:1 kv280:2:2 kv280:3:1 kv280:3:2 kv282:1:1 kv282:1:2 kv282:1:3 kv283:1:1 kv283:1:2 kv283:3:1 kv283:3:2 kv332:2 WMean
MSED .0383 .0318 .0313 .0281 .1558 .1424 .0334 .0226 .1076 .0865 .1230 .0283 .0371 .0404 .0424 .0919 .0486
dynamics MSEL MAED MAEL .0214 .1643 .1100 .0355 .1479 .1384 .0195 .1432 .1052 .0419 .1365 .1482 .0683 .3498 .2064 .0558 .3178 .1879 .0168 .1539 .0979 .0313 .1231 .1267 .0412 .2719 .1568 .0484 .2420 .1680 .0717 .2595 .2172 .0263 .1423 .1067 .0221 .1611 .1072 .0149 .1633 .0928 .0245 .1688 .1156 .0948 .2554 .2499 .0360 .1757 .1370
CorrL .6714 .5744 .6635 .4079 .7495 .7879 .7064 .4370 .7913 .7437 .6504 .7007 .7121 .8247 .6881 .3876 .6200
MSED .0348 .0244 .0254 .0250 .0343 .0406 .0343 .0454 .0367 .0278 .1011 .0183 .0178 .0225 .0256 .0286 .0282
tempo MSEL MAED .0375 .1220 .0291 .1004 .0188 .1053 .0290 .1074 .0373 .1189 .0508 .1349 .0260 .1218 .0443 .1365 .0376 .1300 .0474 .1142 .0463 .2354 .0202 .0918 .0171 .0932 .0183 .1024 .0308 .1085 .0630 .1110 .0326 .1108
MAEL .1257 .1133 .0934 .1111 .1157 .1443 .1179 .1388 .1196 .1436 .1575 .1065 .0960 .0954 .1184 .1767 .1202
CorrL .3061 .3041 .5611 .3398 .5888 .4659 .5136 .3361 .3267 .2072 .8075 .3033 .4391 .4997 .2574 .2389 .3600
worse: in only 6 cases is learning better than no learning (for both MSE and MAE). This can partly be explained by the fact that quadratic functions may not be as reasonable a model class for expressive timing as it has been believed in musicology (see also [19]). On some pieces DISTALL is able to predict expressive curves which are surprisingly close to those actually produced by the pianist — witness, e.g., the correlation of 0.8247 in kv283:3:1 for dynamics.4 On the other hand, DISTALL performs poorly on some pieces, especially on those that are fully different in character from all other pieces in the training set (e.g. correlation of 0.2389 by kv332:2 for tempo). 4.3
DISTALL vs. RIBL
In order to put these results into context, we present a direct comparison with RIBL [5]. Since RIBL is not publicly available, in the experiments of this section we used a self-implemented version. While the leave-one-piece-out crossvalidation procedure stayed the same as in the previous section, for the direct 4
Such a high correlation between predicted and observed curves is even more surprising taking into account that kv283:3:1 is a fairly long piece with over 90 hierarchically nested phrases containing over 320 melody notes.
Relational IBL in Music with a New Structural Similarity Measure
379
Table 3. Direct comparison between RIBL and DISTALL. The table shows absolute numbers and percentages of the phrases where the predictions of both learners are equal, and where one learner is closer to the actual phrase shape than the other. The parameters of both learners are k = 1 and depth = 3. All attribute and predicate weights used by both learners (and especially capacities of the distance networks used by DISTALL) are set to 1. dynamics tempo MSE MAE CORR MSE MAE CORR equal 901 (73%) 901 (73%) 901 (73%) 901 (73%) 901 (73%) 901 (73%) 153 (12%) 157 (12%) 165 (13%) 155 (12%) 159 (13%) 153 (12%) RIBL closer DISTALL closer 188 (15%) 184 (15%) 176 (14%) 186 (15%) 182 (14%) 188 (15%)
comparison of the two learners we chose a somewhat different evaluation procedure. Rather than comparing entire composite performance curves, we compare the learners’ performance directly at the phrase level. That is, for each phrase, we compare RIBL’s and DISTALL’s predictions with the ‘real’ phrase shapes, i.e. those obtained by decomposing tempo and dynamics curves played by the pianist. We then check whose prediction was closer to the actual shape (again in terms of MSE, MAE and correlation). That gives us much more test instances (1240 phrases instead of 16 pieces) and thus more detailed insight into differences between the algorithms. The results are given in Table 3. A look at Table 3 reveals that RIBL and DISTALL agree in more than 70% of all cases. For those cases where the predictions differ, DISTALL’s prediction are closer to the pianist’s shapes in more cases than vice versa in terms of each error measure (i.e., lower MSE, lower MAE, higher correlation), although the difference is not large. Since DISTALL solves the maximal flow minimal weight problem hierarchically, expanding unknown weights into transport network problems at a lower level, the number of elements in each network — and accordingly, the computational complexity of solving the network distance problem — is kept low, making DISTALL’s runtimes only slightly higher than RIBL’s (for the presented experiment involving 1240 phrases and 16-fold crossvalidation, the approximate runtimes on our system are 4h and 5h for RIBL and DISTALL, respectively).
4.4
DISTALL vs. Propositional k-NN
One desirable property of relational learners is performing as well on propositional data as the ‘native’ propositional learners [5,4]. Being generalizations of the propositional k-NN, both DISTALL and RIBL share this property. It would however be interesting to compare the performance of DISTALL, given the relational data representation, with the performance of the standard propo-
380
Asmir Tobudic and Gerhard Widmer
Table 4. Direct comparison between standard k-NN and DISTALL. The table shows absolute numbers and percentages of the phrases where the predictions of both learners are equal, and where one learner is closer to the actual phrase shape than the other. Both learners use one nearest neighbor for prediction. DISTALL’s depth parameter is set to depth = 3. Capacities of the distance networks are set to 1. dynamics tempo MSE MAE CORR MSE MAE CORR equal 697 (56%) 697 (56%) 697 (56%) 697 (56%) 697 (56%) 697 (56%) 240 (19%) 243 (20%) 259 (21%) 249 (20%) 252 (20%) 241 (19%) prop. closer DISTALL closer 305 (25%) 302 (24%) 286 (23%) 296 (24%) 293 (24%) 304 (25%)
sitional k-NN,5 since it has been shown that a richer relational representation need not always be a guarantee for better generalization performance [4]. We can represent phrases in propositional logic by describing each phrase in the data set with the attributes A1, A2, ... from the predicate phrCont(Id, A1, A2, ...) together with the ‘target’ polynomial coefficients Coeffs from the predicate phrShape(Id , Coeffs). By doing so we lose information about hierarchical relations between phrases and obtain an attribute-value representation which can be used by the k-NN algorithm. The results of the direct comparison are shown in Table 4. Experimental setup and evaluation were the same as in the previous section. Again, both learners predict the same shapes in a lot of cases — more than 55% of the test set. For the second half however, DISTALL outperforms its propositional counterpart in terms of all error measures.
5
Conclusion
We have presented a complex learning task from the domain of classical music: learning to apply musically ‘sensible’ tempo and dynamics variations to a piece of music, at different levels of the phrase hierarchy. The problem was modelled as a multi-level decomposition and prediction task and attacked via relational instance-based learning. A new structural similarity measure, based on a combination of two existing techniques, was presented and implemented in a learning algorithm named DISTALL. An experimental analysis showed that the algorithm produces slightly better results in this domain than the related algorithm RIBL, and that it is more effective than a propositional k-NN learner on the music task. In addition to such quantitative evaluations, listening to the performances produced by the learner provides additional qualitative insight. Some of DISTALL’s performances — although being the result of purely automated learning 5
For a detailed study on the performance of the straightforward propositional k-NN on the same learning problem see [15].
Relational IBL in Music with a New Structural Similarity Measure
381
with no additional knowledge about music — sound indeed musically sensible. We hope to demonstrate some interesting sound examples at the conference. On the other hand, some elementary musical ‘errors’ clearly show that fully automated expressive performance at the level of human pianists is still far from being feasible, and certainly not with such a knowledge-free approach. In [5] it was argued that attribute weighting is an important issue in propositional IBL and even more important in a relational setting. DISTALL currently lacks a data-driven predicate/attribute weighting module. We suspect that the quality of our similarity measure would become even more evident it the predicates were weighted adequately (e.g., if the detailed structure of the phrases were given higher weight). Future work with DISTALL could also have an impact on musicology. The poor results in the tempo domain (see section 4.2) suggest that other types of approximation functions may be worth trying, which might lead to better phraselevel tempo models. We also plan to try to empirically prove that a concert pianist plays similar phrases in similar ways (by showing a high correlation between expressive shapes for those phrases for which DISTALL suggests high similarity). One could also turn the question around and take the high correlation between expressive shapes attached to phrases which are suggested to have high similarity as a reliability proof of DISTALL’s similarity measure.6 Acknowledgments This research is supported by a START Research Prize by the Austrian Federal Government (project no. Y99-INF) and by the European project HPRN-CT2000-00115 (MOSART). The Austrian Research Institute for Artificial Intelligence acknowledges basic financial support from the Austrian Federal Ministry for Education, Science and Culture. Thanks to Werner Goebl for performing the harmonic and phrase structure analysis of the Mozart sonatas.
References 1. Bisson, G. (1992). Learning in FOL with a Similarity Measure. In Proceedings of the 10th AAAI, 1992. 2. De Raedt, L. (1992). Interactive Theory Revision: an Inductive Logic Programming Approach. Academic Press. 3. Duda, R., and Hart, P. (1967). Pattern Classification and Scene Analysis. New York, NY: John Wiley & Sons. 4. Dzeroski S., Schulze-Kremer, Heidtke K.R., Siems K., Wettschereck D., and Blockeel H. (1998). Diterpene structure elucidation from 13C NMR spectra with inductive logic programming. Applied Artificial Intelligence: Special Issue on First-Order Knowledge Discovery in Databases, 12(5):363-384, July August 1998. 6
Certainly, it would be more reliable if we had an objective judgment of similarity between two phrases from some independent third party, but there is no procedure known in musicology that would quantify similarity between phrases.
382
Asmir Tobudic and Gerhard Widmer
5. Emde, D. and Wettschereck, D. (1996). Relational Instance-Base Learning. In Proceedings of the Thirteen International Conference on Machine Learning (ICML’96), pages 122-130. Morgan Kaufmann, San Mateo. 6. Helft, N. (1989). Induction as nonmonotonic inference. In Proceedings of the 1st international Conference on Principles of Knowledge Representation and Reasoning, pages 149-156. Kaufmann. 7. Mehlhorn, K. (1984). Graph algorithms and NP-completeness, volume 2 of Data structures and algorithms, Springer Verlag. 8. Muggleton, S.H., and Feng C. (1990). Efficient Induction of Logic Programs. In Proceedings of the First Conference on Algorithmic Learning Theory,Tokyo. 9. Muggleton, S., and de Raedt, L. (1994). Inductive Logic Programming: Theory and Methods. Journal of Logic Programming, 19,20:629-679. 10. Ramon, J., and Bruynooghe, M (1998). A Framework for defining distances between first-order logic objects. In D. Page, (ed.), Proceedings of the 8th International Conference on Inductive Logic Programming, volume 1446 of Lecture Notes in Artificial Intelligence, pages 271–280. Springer-Verlag. 11. Ramon, J., and Bruynooghe, M. (2000). A polynomial time computable metric between point sets. Report CW 301, Department of Computer Science, K.U.Leuven, Leuven, Belgium. 12. Rouveirol, C. (1992). Extensions of inversion of resolution applied to theory completion. In S. Muggleton, (ed.), Inductive Logic Programming. Academic Press, London. 13. Sebag, M., and Rouveirol, C. (1997). Tractable induction and classification in FOL Via Stochastic Matching. In Proceedings of IJCAI-97. Morgan Kaufmann. 14. Todd, N. McA. (1992). The Dynamics of Dynamics: A Model of Musical Expression. Journal of the Acoustical Society of America 91, 3540–3550. 15. Tobudic, A., and Widmer, G. (2003). Playing Mozart Phrase By Phrase. In Proceedings of 5th International Conference on Case-Based Reasoning (ICCBR’03), Trondheim, Norway. Berlin: Springer Verlag.. 16. Widmer, G. (2002). Machine Discoveries: A Few Simple, Robust Local Expression Principles. Journal of New Musical Research 31(1), 37–50. 17. Widmer, G. (2003). Discovering Simple Rules in Complex Data: A Meta-learning Algorithm and Some Surprising Musical Discoveries. Artificial Intelligence 146(2), 129–148. 18. Widmer, G., Dixon, S., Goebl, W., Pampalk, E., and Tobudic, A. (2003). In Search of the Horowitz Factor. AI Magazine, in press. 19. Widmer, G., and Tobudic, A. (2003). Playing Mozart by Analogy. Journal of New Musical Research, in press.
An Effective Grammar-Based Compression Algorithm for Tree Structured Data Kazunori Yamagata1 , Tomoyuki Uchida1 , Takayoshi Shoudai2 , and Yasuaki Nakamura1 1
2
Faculty of Information Sciences Hiroshima City University, Hiroshima 731-3194, Japan k [email protected], {uchida,nakamura}@cs.hiroshima-cu.ac.jp Department of Informatics, Kyushu University, Kasuga 816-8580, Japan [email protected]
Abstract. Many semistructured data such as HTML/XML files are represented by rooted trees t such that all children of each internal vertex of t are ordered and all edges of t have labels. Such data is called tree structured data. Analyzing large tree structured data is a time-consuming process in data mining. If we can reduce the size of input data without loss of information, we can speed up such a heavy process. In this paper, we consider a problem of effective compression of an ordered rooted tree, which represents given tree structured data, without loss of information. Firstly, in order to define this problem in a grammar-based compression scheme, we present a variable replacement grammar (VRG for short) over ordered rooted trees. The grammar-based compression problem for an ordered rooted tree T is defined as a problem of finding a VRG which generates only T and whose size is minimum. For the grammar-based compression problem for an ordered rooted tree, we show that there is no polynomial time algorithm with approximation ratio less than 8593 8592 unless P=NP. Secondly, based on this theoretical result, we present an effective compression algorithm for finding a VRG which generates only a given ordered rooted tree and whose size is as small as possible. Finally, in order to evaluate the performance of our grammar-based compression algorithm, we report some experimental results.
1
Introduction
Background: Due to rapid growth of Information Technologies, semistructured data such as HTML/XML files have been rapidly increasing and each of them has become larger. Semistructured data having tree structures are called tree structured data and are represented by rooted trees t such that all children of each internal vertex of t are ordered and all edges of t have labels. In general, analyzing large tree structured data is a time-consuming process in data mining. If we can reduce the size of input data without loss of information including structural features, we can speed up such a heavy process. In this paper, we consider a problem T. Horv´ ath and A. Yamamoto (Eds.): ILP 2003, LNAI 2835, pp. 383–400, 2003. c Springer-Verlag Berlin Heidelberg 2003
384
Kazunori Yamagata et al.
of effective compression of an ordered rooted tree, which represents given tree structured data, without loss of information. We must compress a given ordered rooted tree T so that we exclude the loss of structural features which T has. Hence, we cannot apply lossless compression algorithms for strings to tree structured data. The aim of this paper is to give a grammar-based compression scheme for an ordered rooted tree and to present an effective algorithm for compressing a given ordered rooted tree without loss of information in the constructed grammar-based compression scheme. Data Model: As our data model for tree structured data, we use a variant of Object Exchange Model (OEM, for short) presented by Abiteboul et al.[1] as follows. An object o consists of an identifier, a link and a value, which are denoted by &o, link(&o) and val(&o), respectively. The identifier &o uniquely identifies the object o. The link link(&o) is a list (&o1 , &o2 , . . . , &op ) of the identifiers of all subobjects oi (i = 1, 2, . . . , p), where p > 0. The value(&o) is either a string such as a tag in HTML/XML files, or a text such as a text written in the field of PCDATA in HTML/XML files. Tree structured data is represented by an ordered rooted tree with edge labels as follows. Each vertex represents an object identifier &o. An edge (&o, &oi ) represents a reference &oi in link(&o) and has the value val(&oi ). For any object identifier &o with link(&o) = (&o1 , &o2 , . . . , &op ), the children &o1 , &o2 , . . . , &op of the vertex &o are ordered in this order. For example, in Fig. 1, the ordered tree T represents the structure which Sample html has. Main Results: In this paper, we consider a problem of effective compression of an ordered rooted tree, which represents given tree structured data, without loss of information. Firstly, in order to define such data compression problem for an ordered rooted tree in the grammar-based compression schema, we present a term tree consisting of tree structures and structured variables, and present a Variable Replacement Grammar (VRG for short) over ordered rooted trees which is based on Hyperedge Replacement Grammar (HRG for short, see [6]). A graph transformation of VRG is defined as a mechanism of replacing a variable by an ordered rooted tree. In Fig. 1, as examples of a term tree and a graph transformation of VRG, we give the term tree t and the ordered rooted tree g such that T is obtained from the term tree t and the tree g by replacing all variables labeled with x by g. The grammar-based compression problem for an ordered rooted tree T is defined as a problem of finding a VRG which generates only T and whose size is minimum. We can regard this grammar-based compression problem as an optimization problem for minimizing the size of a VRG which generates only T . Secondly, for the grammar-based compression problem for an ordered rooted tree, we show that there is no polynomial time algorithm with approximation ratio less than 8593 8592 unless P=NP. This result shows that approximating the size of the minimum VRG to within a small constant factor is NP-hard. Next, based on this theoretical result, we present an effective grammar-based compression algorithm for finding a VRG which generates only a given ordered rooted tree whose size is as small as possible. This algorithm is based on a greedy approach
An Effective Grammar-Based Compression Algorithm
385
table tr td font Text 1-A /font /td td font Text 1-B /font /td /tr tr td font Text 2-A /font /td td font Text 2-B /font /td /tr tr td
T
font Text 3-A /font /td td font Text 3-B /font /td /tr tr td font Text 4-A /font /td td font Text 4-B /font /td /tr /table
t
g
Sample html
Fig. 1. An HTML document Sample html, the ordered rooted tree T which is a data model of Sample html, a term tree t and an ordered rooted tree g. A variable is represented by a box with lines to its elements. The label of a box is the label of the variable. The number in the left side of a vertex denotes the ordering on its siblings
of replacing isomorphic subtrees t, which are not overlap in a given ordered rooted tree, by the same variable in order of increasing the size of t. Next, by improving the algorithm given by Asai et al. [3], we present an efficient algorithm for finding all candidate subtrees s of a given ordered rooted tree T such that s can be replaced by a variable. This algorithm is a pre-processing of our grammarbased compression algorithm. Finally, in order to evaluate the performance of our grammar-based compression algorithm, we report some experimental results of comparing our algorithm with other two algorithms. One is based on a greedy approach of the order of decreasing the size of a candidate subtree which can be replaced by a variable. The other is based on Minimum Description Length (MDL for short) heuristic such as SUBDUE [5]. Experimental results show the effectiveness of our algorithm. Related Works: For a string, several grammar-based compression algorithms have been proposed
386
Kazunori Yamagata et al.
[4,8,9,12,13]. Such algorithms are based on the idea of representing a string by a context-free grammar (see [8,12]). Especially, based on a grammar-based compression scheme, Charikar et al. [4] presented an O(log(n/g ∗ )) approximation algorithm and Sakamoto [13] proposed a linear-time approximation algorithm which guarantees O(log2 n) approximation ratio where n is the length of an input string and g ∗ is the size of the smallest grammar. On the other hand, for semistructured data, there are few researches for a grammar-based compression. Hence, we need to define a new grammar-based compression scheme for an ordered rooted tree which is based on HRG (see [6]). For semistructured data which can be represented by a general graph, Cook [5] presented a practical data compression algorithm based on MDL heuristic which is not a grammar-based compression algorithm. For semistructured data with geometric information, we presented an effective compression algorithm in [7] by introducing notions of a layout term graph in [15] and a substitution in logic programming. This compression scheme presented in [7] is regarded as a preliminary version of the grammar-based compression scheme presented in this paper. In the fields of data mining and knowledge discovery, there are increasing demands for effective methods for extracting information from large semistructured data. Several effective algorithms for finding frequent substructures among large tree structured data have been proposed [3,16]. In [11], we presented an effective algorithm for extracting common structural features among ordered rooted trees. Moreover, in [10,14], we discussed the learnabilities of tree patterns having tree structure, variables and ordered children from the viewpoint of machine learning. Organization: This paper is organized as follows. In Section 2, we introduce an ordered rooted term tree and define an admissible VRG which leads us to compress an ordered tree without loss of information. In Section 3, we define a problem of finding an admissible VRG whose size is minimum among admissible VRGs generating only an ordered rooted tree. Then, we present an effective greedy algorithm for solving this problem. In Section 4, in order to evaluate the performance of our algorithm, we report some experimental results of applying our algorithm to artificial large trees.
2 2.1
Preliminaries Ordered Term Tree
Let T = (VT , ET ) be an ordered rooted tree with a vertex set VT and an edge set ET . Let ≥ 1 be an integer. A list h = (u0 , u1 , . . . , u ) of vertices in VT is called a variable (or a hyperedge) of T if u1 , . . . , u is a sequence of consecutive children of u0 , i.e., u0 is the parent of u1 , . . . , u and uj+1 is the next sibling of uj for j with any 1 ≤ j < . Two variables h = (u0 , u1 , . . . , u ) and h = (u0 , u1 , . . . , u ) are said to be disjoint if {u1 , . . . , u } ∩ {u1 , . . . , u } = ∅. Definition 1. Let T = (VT , ET ) be an ordered rooted tree and HT a set of pairwise disjoint variables of T . An ordered term tree obtained from T and HT is a
An Effective Grammar-Based Compression Algorithm
387
triplet t = (Vt , Et , Ht ) where Vt = VT , Et = ET − h=(u0 ,u1 ,...,u )∈HT {{u0 , ui } ∈ ET | 1 ≤ i ≤ } and Ht = HT . For two vertices u, u ∈ Vt , we say that u is the parent of u in t if u is the parent of u in T . Similarly we say that u is a child of u in t if u is a child of u in T . In particular, for a vertex u ∈ Vt with no child, we call u a leaf of t. We define the order of the children of each vertex u in t as the order of the children of u in T . We often omit the description of the ordered tree T and the variable set HT because we can find them from the triplet t = (Vt , Et , Ht ). Example 1. The ordered term tree t in Fig. 1 is obtained from the tree T = (VT , ET ) and the set HT , where VT = {v0, v1, . . . , v17}, ET = {{v0, v1}, {v1, v2}, {v2, v3}, {v1, v4}, {v4, v5}, {v1, v6}, {v6, v7}, {v1, v8}, {v8, v9}, {v1, v10}, {v10, v11}, {v1, v12}, {v12, v13}, {v1, v14}, {v14, v15}, {v1, v16}, {v16, v17}} and HT = {(v1, v2, v4), (v1, v6, v8), (v1, v10, v12), (v1, v14, v16)}. For any ordered term tree t, a vertex u of t, and two children u and u of u, we write u
Λ, X if every edges and every variables of t are labeled by elements in Λ and X, respectively. If Λ and X need not to be specified, we often omit them. Note. In this paper, we treat only ordered rooted term trees, and then we call an ordered rooted term tree a term tree, simply. In particular, a term tree with no variable is called a ground term tree (or simply a tree) and considered to be a tree with ordered children. For a term tree t and its vertices v1 and vi , a path from v1 to vi is a sequence v1 , v2 , . . . , vi of distinct vertices of t such that for any j with any 1 ≤ j < i, vj is the parent of vj+1 . Let t = (Vt , Et , Ht ) be a term tree. For subsets Vf ⊆ Vt , Ef ⊆ Et and Hf ⊆ Ht , if f = (Vf , Ef , Hf ) is a term tree then f is said to be a term subtree of t. For two term subtrees f = (Vf , Ef , Hf ) and g = (Vg , Eg , Hg ) = ∅, Vf ⊆ Vg of t, we say that f and g are overlap in t if ((Ef ∩ Eg ) ∪ (Hf ∩ Hg )) and Vg ⊆ Vf . Let f and g be term trees over Λ, X each of which has at least two vertices . Let h = (v0 , v1 , . . . , v ) be a variable in f and σ = (u0 , u1 , . . . , u ) a list of + 1 distinct vertices in g such that u0 is the root of g and u1 , . . . , u are leaves of g. The pair [g, σ] of g and σ is called an ( + 1)-hypertree over Λ, X. If , Λ and X need not to be specified, we often omit them. The form h ← [g, σ] is called a variable replacement for h. A new term tree f = f {h ← [g, σ]} is obtained by applying the variable replacement h ← [g, σ] to f in the following way. For the variable h = (v0 , v1 , . . . , v ), we attach g to f by removing the variable h from Hf and by identifying the vertices v0 , v1 , . . . , v with the vertices u0 , u1 , . . . , u of g in this order. We define a new ordering
388
Kazunori Yamagata et al.
f
g
f
Fig. 2. The new ordering on vertices in the term tree f = f {h ← [g, (u0, u1, u2, u3)]}, where h = (v0, v1, v2, v3) in f in the following natural way. Suppose that v has more than one child and let v and v be two children of v in f . We note that vi = ui for any 0 ≤ i ≤ . (1) If v, v , v ∈ Vg and v
Admissible Variable Replacement Grammar
Next, we define formally an admissible Variable Replacement Grammar, which generates only one tree, based on a HRG (see [6]). Let Λ and X be finite alphabets with Λ ∩ X = ∅. Definition 2. A Variable Replacement Grammar (VRG for short) G = (S, R) over Λ, X is defined as follows: (1) S is a variable label in X with rank(S) = 0 and is called the start variable label. (2) R is a finite set of productions of the form x → [g, σ], where x is a variable label in X with rank(x) = and [g, σ] is an -hypertree over Λ, X. Let G = (S, R) be a VRG. For a variable label x ∈ X, an -hypertree [g, σ] and an integer i ≥ 1, we define the relation x →iG [g, σ] inductively as follows. (1) We denote x →1G [g, σ] if there is a production x → [g, σ] in R. (2) For i ≥ 2, we denote x →iG [g, σ] if there are j, m ≥ 1, an -hypertree [f, σ] and a variable h of rank k with label y in f such that j + m = i, x →jG [f, σ], y →m G [d, σ ], and g = f (h ← [d, σ ]).
An Effective Grammar-Based Compression Algorithm
389
i We write x →+ G [g, σ] if x →G [g, σ] for some i ≥ 1. The graph language generated by a VRG G = (S, R) is the set L(G) = {T | T is a tree and S →+ G [T, ()]}. Let G = (S, R) be a VRG and T a tree. Then, G is said to be admissible if L(G) = {T }. For a given tree T , an admissible VRG G generating only T leads us to compress T without loss of information, if the size of G is less than the size of T .
Example 2. Let G = (S, R) be the VRG where R = {S → [t1 , ()], x → [t2 , (u1, u2)], y → [t3 , (v1, v2)]}, and t1 , t2 and t3 term trees in Fig. 3. Then, we can see that G is admissible and L(G) = {T }, where T is the tree in Fig. 3.
T
t1
t2
t3
Fig. 3. A Tree T and term trees t1 , t2 , t3
3
Grammar-Based Compression for an Ordered Rooted Tree
In this section, we consider a problem of finding an admissible VRG which generates only a given tree and whose size is minimum. Firstly, we formally define this problem and show the hardness of solving this problem. Secondly, for a given tree T , we present an algorithm Find Freq Trees for finding all candidate ground term subtrees of T which can be replaced by variables. Finally, by using Find Freq Trees, we give an effective algorithm for finding an admissible VRG G which generates only a given tree and whose size is as small as possible.
390
3.1
Kazunori Yamagata et al.
Hardness of Grammar-Based Compression Problem for an Ordered Rooted Tree
For t | + 2|Et | + a term tree t = (Vt , Et , Ht ), we define the size of t as |t| = |V |h|. For a VRG G = (S, R), we define the size of G as |G| = (|g| + h∈Ht
x→[g,σ]∈R
|σ|). For a tree T and an admissible VRG G such that L(G) = {T }, we define a |G| × 100. compression ratio ρ of T w.r.t G as ρ = |T | Example 3. The size of the tree T in Fig. 3 is |T | = 22 + 2 × 21 = 64. The sizes of term trees t1 , t2 and t3 in Fig.3 are |t1 | = 4 + 2 × 1 + (2 + 2) = 10, |t2 | = 3 + (2 + 2) = 7 and |t3 | = 6 + 2 × 5 = 16, respectively. Then, the size of the admissible VRG G = (S, R) is |G| = (10 + 0) + (7 + 2) + (16 + 2) = 37, where R = {S → [t1 , ()], x → [t2 , (u1, u2)], y → [t3 , (v1, v2)]}. Therefore, the compression ratio ρ of T w.r.t. G is ρ = 37 64 × 100 ≈ 57.8. A grammar-based compression problem for a tree is defined as the following problem Find Min AVRG. Find Min AVRG Instance: A tree T . Problem: Find an admissible VRG G such that L(G) = {T } and for any admissible VRG G with L(G ) = {T }, |G| ≤ |G |. This problem is regarded as an optimization problem for minimizing the size of an admissible VRG which generates only a given tree. Then, we can prove the following theorem by a reduction from restricted form of VERTEX COVER in a similar way as the proof of Theorem 3.1 in [9]. Theorem 1. There
is
no
polynomial
time
algorithm for solving 8593 unless P=NP. Find Min AVRG with approximation ratio less than 8592 This theorem shows the hardness of solving Find Min AVRG. That is, this result indicates that approximating the size of the minimum VRG to within a small constant factor is NP-hard. Based on this theoretical result, in next section, we will present an effective compression algorithm for finding an admissible VRG which generates only a given tree and whose size is as small as possible 3.2
Algorithm of Finding All Frequent Ground Term Subtrees
Let T = (V, E, ∅) be a tree and t = (Vt , Et , ∅) a ground term subtree of T . From the definitions of a variable and a variable replacement, if there exist a path p in T from a vertex v ∈ Vt to a vertex u ∈ V − Vt such that v is not the root or a leaf of t and p does not contain any leaf of t, or if for two children w1 and w2 of the root r of t, there is a vertex w ∈ V − Vt such that w1
An Effective Grammar-Based Compression Algorithm
391
w
392
Kazunori Yamagata et al.
π is a matching function from T to U }, respectively. Similarly, given k-pattern T ∈ Tk and a pseudo-matching function π from T to U , we define the pseudo rightmost occurrence (the pseudo-rml-occurrence for short) and the candidate rightmost occurrence list of T w.r.t. U to be π (k) and RocU (T ) = {π (k) | π is a pseudo-matching function from T to U }, respectively. Let r ≥ 2 be an integer which is called a occurrence count. T is said to be r-occurred for U if |RocU (T )| ≥ r and T is said to be r-pseudo-occurred for U if |RocU (T )| ≥ r. Then, we define the set of all r-occurred k-patterns in Tk for U as FU,k,r = {T | T ∈ Tk , |RocU (T )| ≥ r}, and FU,r = k FU,k,r ⊆ T . We define the set of all r-pseudo-occurred k-patterns in Tk for U as FU,k,r = {T | T ∈ Tk , |RocU (T )| ≥ r} and FU,r = k FU,k,r ⊆ T . Let RocU,k,r = T ∈FU,k,r {π(k) | π is a matching function from T to U } and let RocU,k,r = T ∈F {π (k) | U,k,r
π is a pseudo-matching function from T to U}. Let U be a tree, T a tree of normal form, and RocU (T ) = {π(rml(T )) | π is a matching function from T to U }. From the definitions of a matching function and a pseudo-matching function, for a vertex v in RocU (T ), we can identify the unique matching function π from T to U such that π(rml(T )) = v and the unique matching function π from T to U such that π (rml(T )) = v. For a tree T of normal form and a vertex v of U , a ground term subtree G = (VG , EG , ∅) of U is said to be identified by T and v if there exists an isomorphism π between T and G such that π(rml(T )) = v. Let T ∈ Tk−1 , 0 ≤ p < depthT (rml(T )) any integer, and l ∈ Λ any edge label. Then, the (p, l)-expansion of T is the tree S obtained from T by attaching a new vertex k to the vertex v such that the attacked vertex k is the rightmost child of v and the edge between k and v has the label l, where v = papT (rml(T )), that is, v is the p-th parent of the rightmost leaf of T . In Fig. 4, given a tree U and an occurrence count r as inputs, we present an efficient algorithm Find Freq Trees which outputs the set FU,r of all r-occurred patterns for U and the set of their rml-occurrences indexed by trees in FU,r w.r.t. U . In Fig. 5, we present a procedure Expand Trees used in Find Freq Trees. Given the set R(T ) calculated in line 4 of the procedure Expand Trees and the integer p as inputs, for every edge label l ∈ Λ and every (p, l)-expansion S of T , the procedure Scanning Sibling in line 5 of the procedure Expand Trees returns the candidate rightmost occurrence list N ewRoc (that is, the set of all pseudo-rmloccurrences ) of S w.r.t. the tree U as follows. Initially, Scanning Sibling creates an empty set N ewRoc. Next, for each v ∈ R(T ) and each l ∈ Λ, add the pair (l, u) to N ewRoc if there exists (p, l)-expansion of T in U such that if p = 0 then u is the leftmost child of v in U , otherwise u is the vertex nextU (pap−1 U (v)). Then, the following theorem holds. Theorem 2. When a tree U and an occurrence count r ≥ 2 are given as inputs, the algorithm Find Freq Trees can construct correctly the set FU,r of all r-occurred patterns for U and the set T ∈FU,r RocU (T ) of rml-occurrences in ||Λ|) time where VU is dexed by trees in FU,r w.r.t. U in O(|VU | + A2 N + A|FU,r the set of vertices in U , A is the maximum number of vertices of trees in the set FU,r of all r-pseudo-occurred patterns for U , and N = ΣT ∈FU,r |RocU (T )|.
An Effective Grammar-Based Compression Algorithm
393
Algorithm Find Freq Trees Input: A tree U and an occurrence count r ≥ 2 Output: The set FU,r of all r-occurrence patterns for U and their rml-occurrence lists Roc = T ∈F RocU (T ) U,r
, RocU,1,r , FU,2,r , and RocU,2,r from U in level-order traversal; 1. Compute FU,1,r 2. k := 3; = ∅ do 3. while FU,k−1,r , RocU,k,r 4. FU,k−1,r , RocU,k−1,r , FU,k,r := Expand Trees(FU,k−1,r , RocU,k−1,r , r); 5. k := k + 1; 6. end; */ 7. FU,r := FU,1,r ∪ · · · ∪ FU,k−2,r ; /* FU,1,r = FU,1,r 8. Roc := RocU,1,r ∪ · · · ∪ RocU,k−2,r ; /* RocU,1,r = RocU,1,r */ 9. return FU,r , Roc;
Fig. 4. Algorithm Find Freq Trees
3.3
Grammar-Based Compression Algorithm for an Ordered Rooted Tree
Let U be a tree, T a tree of normal form, and RocU (T ) = {π(rml(T )) | π is a matching function from T to U }. Then, a subset RT ⊆ RocU (T ) is a valid subset of RocU (T ) if for any two vertices u, v ∈ RT , tu and tv are not overlap in U , where tu is the ground term subtree identified by T and u and tv is the ground term subtree identified by T and v. Moreover, a valid subset RT of RocU (T ) is maximal if for any subset R of RocU (T ) such that RT ⊆ R, R is not a valid subset of RocU (T ). We can compute a maximal valid subset RT of RocU (T ) by level-order traversal of U as follows. Let RT = {v1 } and RocU (T ) = {v1 , . . . , vn } such that for 1 ≤ i < j ≤ n, vi is found before vj by level-order traversal of U . For each i = 2, . . . , n, we add vi to RT if there exists no vertex u in RT such that ti and t are overlap in U , where t is the ground term subtree of U identified by T and u and ti is the ground term subtrees of U identified by T and vi . We remark that the above maximal valid subset RT of RocU (T ) is not always best for compressing a given tree. In Fig. 6, when a tree U and an occurrence count r are given as inputs, we present a greedy algorithm Compress Tree for finding an admissible VRG which generates only T and is as small as possible. The algorithm Compress Tree is based on a greedy approach of replacing isomorphic term subtrees which are not overlap in a given tree by the same variable in order of increasing the size of a replaced term subtree. In line 1 of Compress Tree, we find the set F of all r-occurred patterns for U and the set of their rml-occurrences indexed by trees in F w.r.t. U by using the algorithm Find Freq Trees. In the while-loop from line 4 to line 25, Compress Tree fixes on all ground term subtrees which
394
Kazunori Yamagata et al.
Procedure Expand Trees Input: A set Fold of patterns, A set Rocold of pseudo-rml-occurrences indexed by trees in Fold and an occurrence count r ≥ 2. Output: A set F ixF of r-occurred patterns for U, a set F ixRoc of their rml-occurrences indexed by trees in F ixF , of the rightmost expansions of trees in Fold a set Fnew and a set Rocnew of pseudo-rml-occurrences indexed by trees in Fnew w.r.t. U.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.
:= ∅; F ixRoc := Rocold ; F ixF := Fold , Rocnew := ∅; Fnew do foreach tree T ∈ Fold foreach 0 ≤ p < depth(rml(T )) do R(T ) := {π (rml(T )) | π (rml(T )) ∈ F ixRoc, π is a pseudo-matching function from T to U}; N ewRoc := Scanning Sibling(R(T ), p); foreach l ∈ Λ do compute the (p, l)-expansion S of T ; N ewRoc(l):={v | (l, v) ∈ N ewRoc}; if |N ewRoc(l)| ≥ r then :=Fnew ∪ {S}; Fnew Rocnew :=Rocnew ∪ {(S, v) | v ∈ N ewRoc(l)}; /* end of if */ if p = 0 and p = depth(rml(T ) − 1) then while N ewRoc(l) = ∅ do choose a vertex v in N ewRoc(l); π is a pseudo-matching F ixRoc:=F ixRoc − π (prevRml(S)) from S to U and ; function π (rml(S)) = v N ewRoc(l):=N ewRoc(l) − {v}; end; /* end of if */ end; R(T ) := {π (rml(T )) | π (rml(T )) ∈ F ixRoc, π is a pseudo-matching function from T to U}; if |R(T )| < r then F ixF :=F ixF − {T }; F ixRoc:=F ixRoc − {v | v ∈ R(T )}; break; /* end of if */ end; /* end of foreach-loop */ end; /* end of foreach-loop */ , Rocnew ; return F ixF, F ixRoc, Fnew
Fig. 5. Procedure Expand Trees
are actually replaced by variables in the procedure Make Grammar of line 26. In line 14, we revise the set Roc by removing all vertices u in {π(rml(G)) ∈ Roc | π is a matching function from G to U )} from Roc for each G ∈ Forg such that the identified ground term subtree gu of U by G and u is satisfied the following condition. There exists a vertex v in vRoc(T ) such that tv and gu are overlap in U , or there exists a vertex v in vRoc(T ) − {w} such that gu is a ground term subtree of tv , where w is the first rml-occurrence of T in levelorder traversal of U and tv is the identified ground term subtree of U by T and v. The procedure Make Grammar in the line 26 constructs an admissible VRG G by applying the following operations to U in increasing order of the size of T of (T, V List(T )) ∈ tmpRules. Let Q = (VQ , EQ , HQ ) be a copy of U . We
An Effective Grammar-Based Compression Algorithm
395
initialize RQ :=∅ and HQ :=∅. For (T, V List(T )) ∈ tmpRules, HQ :=HQ ∪ {hπ | π(rml(T )) ∈ Roc(T ), (π, hπ ) ∈ V List(T )} and RQ :=RQ ∪ {x → [tT , σ]} where x is a new variable label, tT is the corresponding term subtree of Q to the identified ground term subtree by T and the first rml-occurrence in level-order traversal, and σ is the first list of V List(T ). Then, for each element (π, hπ ) ∈ V List(T ) such that π(rml(T )) ∈ Roc(T ), we revise the term tree Q by deleting the corresponding term subtree of Q to the identified ground term subtree by T and π(rml(T )). Finally, the rule S → [Q, ()] is added to RQ and the procedure Make Grammar outputs the admissible VRG G = (S, RQ ). Then, the following theorem holds. Theorem 3. When a tree U and an occurrence count r are given, the algorithm Compress Tree in Fig. 6 can produce correctly an admissible VRG G = (S, R) over Λ, X with L(G) = {U } in O(|VU |+A2 N +A|FU,r ||Λ|+BM C) time, where of vertices of trees in the VU is the vertex set of U , A is the maximum number set FU,r of all r-pseudo occurred patterns for U , N = T ∈F |RocU (T )|, B is U,r the maximum number of vertices of trees in FU,r , M = T ∈FU,r |RocU (T )|, and C is the number of variable labels appeared in G. Proof. (Sketch) We can prove the correctness of this theorem from the following facts (1) and (2). (1) The admissible VRG G = (R, S) constructed by Compress Tree is deterministic. For any variable label x appeared in G, G has only one production p in R such that the variable label in the leftside of p is x. Therefore, we can see that |L(G)| = 0 or |L(G)| = 1. (2) U is in L(G), since any two term subtrees, which are replaced by varibles in Make Grammar, are not overlap in U . From (1) and (2), we can see that G is an admissible VRG with L(G) = {U }. ||Λ|) time. From Theorem 2, line 1 can be executed in O(|VU | + A2 N + A|FU,r Moreover, lines from 4 to 25 can be executed in O(BM C) time. Then, we can show the time complexity of Compress Tree.
4
Implementation and Experimental Results
In order to evaluate our grammar-based compression algorithm Compress Tree presented in previous section, we have implemented Compress Tree and two other algorithms Algorithm 1 and Algorithm 2. The algorithm Algorithm 1 is based on a greedy approach of replacing isomorphic term subtrees, which are not overlap in a given tree, by the same variable in order of decreasing the size of a replaced term subtree. That is, Algorithm 1 is the algorithm obtained from Compress Tree by changing line 5 of Compress Tree with the instruction, “let T be a largest tree in F ”. The algorithm Algorithm 2 is based on an approach of replacing repeatedly isomorphic term subtrees, which are not overlap in a given tree T and gives us the best compression ratio, by a variable. Algorithm 2 is
396
Kazunori Yamagata et al.
Algorithm Compress Tree Input: A tree U and an integer r ≥ 2 Output: An admissible VRG G = (S, R) such that L(G) = {U } and a compression ratio ρ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27.
F, Roc:=Find Freq Trees(U ); remove all trees consisting of one vertex or two vertices from F ; tmpRules:=∅, Forg :=F and for each T ∈ F , tmpSize(T ):=|T |; while F = ∅ do let T be a smallest tree in F ; F :=F − {T }; Roc(T ):={π(rml(T )) ∈ Roc | π is a matching function from T to U }; compute a maximal valid subset vRoc(T ) of Roc(T ); m:=|vRoc(T )|; fix on the integer k > 0 and π is a matching function from T to U such that π(rml(T )) = v, hv is a variable which consists (π, hv ) ; V List(T ):= of k vertices of U and by which v∈vRoc(T ) the term subtree identified by T and v can be replaced fix on hypertree [T, σ] such that |σ| = k, by using V List(T ); Size:=((m − 1)tmpSize(T ) − (2m + 1)k)); if Size ≥ 1 then Revise Roc by removing all useless vertices in Roc, using Forg ; tmpRules:=tmpRules ∪ {(T, V List(T ))}; foreach G ∈ F do π is a matching function from G ; R(G):= π(rml(G)) ∈ Roc to U if |R(G)| ≤ 1 then F :=F − {G}; Forg :=Forg − {G}; else let w be a vertex in R(G); let gw ground term subtree of U by G and be the identified w; gw has the identified ground term n:= u ∈ vRoc(T ) ; tree by T and u as a term subtree tmpSize(G):=tmpSize(G) − n(tmpSize(T ) − 2k) /* end of if */ end; /* end of if */ end; G:=Make Grammar(U,tmpRules, Roc); |G| return G, × 100 ; |T |
Fig. 6. Algorithm Compress Tree
An Effective Grammar-Based Compression Algorithm
397
the algorithm by adding the instruction “else break;” under line 24 of Compress Tree and changing line 5 of Compress Tree with the following instruction INSTRUMENT. “let T be a best tree among F with respect to the compression ratio obtained by replacing the term subtrees, which are isomorphic to T and are not overlap, by a variable”. That is, Algorithm 2 is regarded as the algorithm SUBDUE in [5] based on a Minimum Description Length heuristic. We have evaluated our algorithm Compress Tree by comparing with two other algorithms Algorithm 1 and Algorithm 2 with respect to execution time and compression ratio of applying them to artificial large trees. The machine used in experiments is a PC with two 2.4GHz CPUs and 1.00GB main memory. We implemented a data generator to randomly produce an artificial large tree satisfying the following conditions. (1) The number of vertices is 20,000, 40,000, 60,000, 80,000 or 100,000. (2) The degree of each vertex is less than 3. (3) The number of edge labels is less than 2. For N ∈ {20, 000, 40, 000, 60, 000, 80, 000, 100, 000}, let D(N ) be the set of 10 trees whose numbers of vertices are N and which are produced by the data generator. We tested the execution times and the compression ratios of Compression Tree, Algorithm 1 and Algorithm 2 under the circumstances of different datasets and the occurrence count 2. Fig. 7 (a) shows the relationship between the number of vertices and the execution times. We remark that each execution time does not contain the time of reading data as an input and is the average execution time for trees in a dataset. For example, Fig. 7 (a) indicates that the average execution time of Algorithm 1 for trees in D(60, 000) is about 300 seconds. From Fig. 7 (a), our algorithm Compress Tree is fastest among three algorithms. Fig. 7 (b) shows the relationship between the number of vertices and the compression ratios. Each compression ratio in Fig. 7 (b) is the average compression ratio for trees in a dataset. For example, from Fig. 7 (b), we can see that the average compression ratio of Compress Tree for trees in D(60, 000) is about 60%. From Fig. 7 (a) and (b), Compress Tree and Algorithm 2 have extremely better performance than Algorithm 1. Fig. 7 (c) shows the relationship between the number of vertices in input data and the number of variables appeared in admissible VRG output by each algorithm. Moreover, Fig. 7 (d) the relationship between the number of vertices in input data and the average number of variable labels used in admissible VRG output by each algorithm. From Fig. 7 (c) and (d), although the number of variables appeared in admissible VRG produced by each algorithm is almost same, Algorithm 1 produced a admissible VRG which has extremely more variable labels in each dataset than other two algorithms. Moreover, in Fig. 7 (b), (c) and (d), we can see that Compress Tree and Algorithm 2 have almost same performance. This indicates that the order of chosen trees at INSTRUCTION in Algorithm 2 almost coincides with the order of chosen trees at line 5 of Compress Tree. From these reasons, we can see that our algorithm Compress Tree and the algorithm Algorithm 2
398
Kazunori Yamagata et al.
(a) Execution Time vs Number of Vertices
(b) Compression Ratio vs Number of Vertices
(c) Number of Variables vs Number of Vertices
(d) Number of Variable Labels vs Number of Vertices
Fig. 7. Experiment 1 of comparing Compress Tree with Algorithm 1 and Algorithm 2 with respect to execution time and compression ratio under the circumstances of different datasets and the fixed occurrence count 2. suit for lossless compression of a large tree, but the algorithm Algorithm 1 does not suit. we tested the execution times and the compression ratios of three algorithms for the dataset D(80, 000) by varying an occurrence count from 2 to 5. Fig. 8 shows the performances of three algorithms for different occurrence counts. We can obtain the similar results as the previous experiments from Fig. 8. From these experimental results, we can see that the algorithm Compress Tree suits for lossless compression of a large tree and have an advantage of execution time.
5
Conclusions
We have considered the problem of effective compression of an ordered rooted tree without loss of information. We have presented an admissible VRG which generates only a given ordered rooted tree. Then, for an ordered rooted tree T , we have defined the grammar-based compression problem of finding an admissible VRG which generates only T and whose size is minimum. Moreover, we have shown the hardness of solving this problem by proving that there is no polynomial time algorithm with approximation ratio less than 8593 8592 unless P=NP. Next, we have presented an effective algorithm for finding an admissible VRG G, which generates only given ordered rooted tree and which is as small as possible. In order to evaluate the performance of our algorithm, we have implemented our algorithm and other two algorithms. Then, we have shown the
An Effective Grammar-Based Compression Algorithm
(a) Execution Time vs Occurrence Count
(b) Compression Ratio vs Occurrence Count
(c) Number of Variables vs Occurrence Count
(d) Number of Variable Labels vs Occurrence Count
399
Fig. 8. Experiment 2 of comparing Compress Tree with Algorithm 1 and Algorithm 2 with respect to execution time and compression ratio under the circumstances of the dataset D(80, 000) and the different occurrence counts.
effectiveness of our algorithm by comparing them with respect to execution time and compression ratio in applying them to artificial large trees. From the viewpoint of computational complexity, we will analyze the approximation ratio of our algorithm, that is, the maximum ratio between the size of the generated admissible VRG and the smallest possible admissible VRG over all inputs. Moreover, we will construct efficient data mining tools for lossless compressed data and apply to real-world data. Moreover, we will apply our grammar-based compression scheme for other graph structured data. This work is partly supported by Grant-in-Aid for Young Scientists (B) No. 14780303 from the Ministry of Education, Culture, Sports, Science and Technology, Japan, and Hiroshima City University Grant for Special Academic Research(General Studies) No. 2117.
References 1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000. 2. A.V. Aho, J.E. Hopcroft, and J.D. Ullman. Data Structures and Algorithms. Addison-Wesley, 1983. 3. T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. Proc. 2nd SIAM Int. Conf. Data Mining (SDM-2002), pages 158–174, 2002.
400
Kazunori Yamagata et al.
4. M. Charikar, E. Lehman, D. Liu, and R. Panigrahy. Approximating the smallest grammar: Kolmogorov Complexity in natural models. Proc. 34th ACM STOC’02, pages 792–801, 2002. 5. D. J. Cook and L. B. Holder. Graph-based data mining. IEEE Intelligent Systems, 15:32–41, 2000. 6. G. Rozenberg (Ed.). Handbook of Graph Grammars and Computing by Graph Transformation, volume 1. World Scientific Publishing, 1997. 7. Y. Itokawa, T. Uchida, T. Shoudai, T. Miyahara, and Y. Nakamura. Finding frequent subgraphs from graph structured data with geometric information and its application to lossless. Proc. PAKDD-2003, Springer-Verlag, LNAI 2637, pages 582–594, 2003. 8. J. C. Kieffer and E-h. Yang. Grammar based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46:737–754, 2000. 9. E. Lehman and A. Shelat. Approximations algorithms for grammar-based compression. Proc. SODA 2002, pages 205–212, 2002. 10. S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning of finite unions of tree patterns with internal structured variables from queries. Proc. AI-2002, Springer-Verlag, LNAI 2557, pages 523–534, 2002. 11. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tag tree patterns in semistructured web documents. Proc. PAKDD-2002, Springer-Verlag, LNAI 2336, pages 341–355, 2002. 12. C. Nevill-Manning and I Witten. Compression and explanation using hierarchical grammars. Computer Journal, 40(2/3):103–116, 1997. 13. H. Sakamoto. A fully linear-time approximation algorithm for grammar-based compression. DOI Technical Report 214, Department of Informatics, Kyushu University, 2003. 14. Y. Suzuki, R. Akanuma, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time inductive inference of ordered tree patterns with internal structured variables from positive data. Proc. COLT-2002, Springer-Verlag, LNAI 2375, pages 169–184, 2002. 15. T. Uchida, Y. Itokawa, T. Shoudai, T. Miyahara, and Y. Nakamura. A new framework for discovering knowledge from two-dimensional structured data using layout formal graph system. Proc. ALT-2000, Springer-Verlag, LNAI 1968, pages 141– 155, 2000. 16. K. Wang and H. Liu. Discovering structural association of semistructured data. IEEE Trans. Knowledge and Data Engineering, 12:353–371, 2000.
Author Index
Appice, Annalisa 4 Arias, Marta 22 Atramentov, Anna 38 Bai˜ ao, Fernanda 57 Basile, Teresa Maria Altomare Blockeel, Hendrik 329 Broda, Krysia 311
112
Malerba, Donato 4, 215 M˘ arginean, Flaviu Adrian 233 Martin, Lionel 75 Matsumoto, Satoshi 347 Mattoso, Marta 57 Mavroeidis, Dimitrios 251 Muggleton, Stephen 93, 269 Nakamura, Yasuaki
383
Camacho, Rui 130 Ceci, Michelangelo 4 Cleuziou, Guillaume 75 Colton, Simon 93
Oates, Tim 281 Otero, Ramon P.
Di Mauro, Nicola 112 Doshi, Shailesh 281 Driessens, Kurt 146
Ramon, Jan 146 Rawles, Simon 197 Ray, Oliver 311 Rocha, Ricardo 130 Russo, Alessandra 311
Esposito, Floriana
112
Fanizzi, Nicola 112 Ferilli, Stefano 112 Flach, Peter A. 197, 251 Fonseca, Nuno 130 G¨ artner, Thomas
299
146
Shavlik, Jude 57 Shoudai, Takayoshi 347, 383 Silva, Fernando 130 Struyf, Jan 329 Suzuki, Yusuke 347 Tamaddoni-Nezhad, Alireza Tobudic, Asmir 365
Hirata, Kouichi 164 Hoche, Susanne 180 Honavar, Vasant 38 Huang, Fang 281
Uchida, Tomoyuki
Khardon, Roni 22 King, Ross D. 1 Krogel, Mark-A. 197
Watanabe, Hiroaki 269 Widmer, Gerhard 365 Wrobel, Stefan 180, 197
Lavraˇc, Nada 197 Leiva, Hector 38 Lisi, Francesca A. 215 Lloyd, John W. 2
Yamagata, Kazunori
Vrain, Christel
347, 383
75
383
Zaverucha, Gerson 57 ˇ Zelezn´ y, Filip 197
269