Subseries of Lecture Notes in Computer Science
3
Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo
Algorithmic Learning Theory 14th International Conference, ALT 2003 Sapporo, Japan, October 17-19, 2003 Proceedings
13
Volume Editors Ricard Gavaldà Technical University of Catalonia Department of Software (LSI) Jordi Girona Salgado 1-3, 08034 Barcelona, Spain E-mail:
[email protected] Klaus P. Jantke Deutsches Forschungszentrum für Künstliche Intelligenz GmbH Im Stadtwald, Geb. 43.8, 66125 Saarbrücken, Germany E-mail:
[email protected] Eiji Takimoto Tohoku University Graduate School of Information Sciences Sendai 980-8579, Japan E-mail:
[email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at
.
CR Subject Classification (1998): I.2.6, I.2.3, F.1, F.2, F.4.1, I.7 ISSN 0302-9743 ISBN 3-540-20291-9 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin GmbH Printed on acid-free paper SPIN: 10963852 06/3142 543210
Preface
This volume contains the papers presented at the 14th Annual Conference on Algorithmic Learning Theory (ALT 2003), which was held in Sapporo (Japan) during October 17–19, 2003. The main objective of the conference was to provide an interdisciplinary forum for discussing the theoretical foundations of machine learning as well as their relevance to practical applications. The conference was co-located with the 6th International Conference on Discovery Science (DS 2003). The volume includes 19 technical contributions that were selected by the program committee from 37 submissions. It also contains the ALT 2003 invited talks presented by Naftali Tishby (Hebrew University, Israel) on “Efficient Data Representations that Preserve Information,” by Thomas Zeugmann (University of L¨ ubeck, Germany) on “Can Learning in the Limit be Done Efficiently?”, and by Genshiro Kitagawa (Institute of Statistical Mathematics, Japan) on “Signal Extraction and Knowledge Discovery Based on Statistical Modeling” (joint invited talk with DS 2003). Furthermore, this volume includes abstracts of the invited talks for DS 2003 presented by Thomas Eiter (Vienna University of Technology, Austria) on “Abduction and the Dualization Problem” and by Akihiko Takano (National Institute of Informatics, Japan) on “Association Computation for Information Access.” The complete versions of these papers were published in the DS 2003 proceedings (Lecture Notes in Artificial Intelligence Vol. 2843). ALT has been awarding the E. Mark Gold Award for the most outstanding paper by a student author since 1999. This year the award was given to Sandra Zilles for her paper “Intrinsic Complexity of Uniform Learning.” This conference was the 14th in a series of annual conferences established in 1990. Continuation of the ALT series is supervised by its steering committee, consisting of: Thomas Zeugmann (Univ. of L¨ ubeck, Germany), Chair, Arun Sharma (Univ. of New South Wales, Australia), Co-chair, Naoki Abe (IBM T.J. Watson Research Center, USA), Klaus Peter Jantke (DFKI, Germany), Phil Long (National Univ. of Singapore), Hiroshi Motoda (Osaka Univ., Japan), Akira Maruoka (Tohoku Univ., Japan), Luc De Raedt (Albert-Ludwigs-Univ., Germany), Takeshi Shinohara (Kyushu Institute of Technology, Japan), and Osamu Watanabe (Tokyo Institute of Technology, Japan). We would like to thank all individuals and institutions who contributed to the success of the conference: the authors for submitting papers, the invited speakers for accepting our invitation and lending us their insight into recent developments in their research areas, as well as the sponsors for their generous financial support. Furthermore, we would like to express our gratitude to all program committee members for their hard work in reviewing the submitted papers and participating in on-line discussions. We are also grateful to the external referees whose reviews made a considerable contribution to this process.
VI
Preface
We are also grateful to the DS 2003 Chairs Yuzuru Tanaka (Hokkaido University, Japan), Gunter Grieser (Technical University of Darmstadt, Germany) and Akihiro Yamamoto (Hokkaido University, Japan) for their efforts in coordinating with ALT 2003, and to Makoto Haraguchi and Yoshiaki Okubo (Hokkaido University, Japan) for their excellent work on the local arrangements. Last but not least, Springer-Verlag provided excellent support in preparing this volume.
August 2003
Ricard Gavald` a Klause P. Jantke Eiji Takimoto
Organization
Conference Chair Klaus P. Jantke
DFKI GmbH Saarbr¨ ucken, Germany
Program Committee Ricard Gavald` a (Co-Chair) Eiji Takimoto (Co-Chair) Hiroki Arimura Shai Ben-David Nicol` o Cesa-Bianchi Nello Cristianini Fran¸cois Denis Kouichi Hirata Sanjay Jain Stephen Kwek Phil Long Yasubumi Sakakibara Rocco Servedio Hans-Ulrich Simon Frank Stephan Christino Tamon
Tech. Univ. of Catalonia, Spain Tohoku Univ., Japan Kyushu Univ., Japan Technion, Israel Univ. di Milano, Italy UC Davis, USA LIF, Univ. de Provence, France Kyutech, Japan Nat. Univ. Singapore, Singapore Univ. Texas, San Antonio, USA Genome Inst. Singapore, Singapore Keio Univ., Japan Columbia Univ., USA Ruhr-Univ. Bochum, Germany Univ. Heidelberg, Germany Clarkson Univ., USA
Local Arrangements Makoto Haraguchi (Chair) Yoshiaki Okubo
Hokkaido Univ., Japan Hokkaido Univ., Japan
Subreferees Kazuyuki Amano Dana Angluin Tijl De Bie Laurent Brehelin Christian Choffrut Pedro Delicado Claudio Gentile R´emi Gilleron Sally Goldman
Joshua Goodman Colin de la Higuera Hiroki Ishizaka Jeffrey Jackson Satoshi Kobayashi Jean-Yves Marion Andrei E. Romashchenko Hiroshi Sakamoto Kengo Sato
VIII
Organization
Dale Schuurmans Chema Sempere Shinichi Shimozono Takeshi Shinohara Robert Sloan
Lee Wee Sun Hisao Tamaki Marc Tommasi Takashi Yokomori
Sponsoring Institutions The Japanese Ministry of Education, Culture, Sports, Science and Technology The Suginome Memorial Foundation, Japan
Table of Contents
Invited Papers Abduction and the Dualization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Eiter
1
Signal Extraction and Knowledge Discovery Based on Statistical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Genshiro Kitagawa
3
Association Computation for Information Access . . . . . . . . . . . . . . . . . . . . . . Akihiko Takano
15
Efficient Data Representations That Preserve Information . . . . . . . . . . . . . . Naftali Tishby
16
Can Learning in the Limit Be Done Efficiently? . . . . . . . . . . . . . . . . . . . . . . . Thomas Zeugmann
17
Regular Contributions Inductive Inference Intrinsic Complexity of Uniform Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sandra Zilles
39
On Ordinal VC-Dimension and Some Notions of Complexity . . . . . . . . . . . Eric Martin, Arun Sharma, Frank Stephan
54
Learning of Erasing Primitive Formal Systems from Positive Examples . . . Jin Uemura, Masako Sato
69
Changing the Inference Type – Keeping the Hypothesis Space . . . . . . . . . . Frank Balbach
84
Learning and Information Extraction Robust Inference of Relevant Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Arpe, R¨ udiger Reischuk
99
Efficient Learning of Ordered and Unordered Tree Patterns with Contractible Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Yusuke Suzuki, Takayoshi Shoudai, Satoshi Matsumoto, Tomoyuki Uchida, Tetsuhiro Miyahara
X
Table of Contents
Learning with Queries On the Learnability of Erasing Pattern Languages in the Query Model . . . 129 Steffen Lange, Sandra Zilles Learning of Finite Unions of Tree Patterns with Repeated Internal Structured Variables from Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Satoshi Matsumoto, Yusuke Suzuki, Takayoshi Shoudai, Tetsuhiro Miyahara, Tomoyuki Uchida
Learning with Non-linear Optimization Kernel Trick Embedded Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . 159 Jingdong Wang, Jianguo Lee, Changshui Zhang Efficiently Learning the Metric with Side-Information . . . . . . . . . . . . . . . . . . 175 Tijl De Bie, Michinari Momma, Nello Cristianini Learning Continuous Latent Variable Models with Bregman Divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Shaojun Wang, Dale Schuurmans A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Joel Ratsaby
Learning from Random Examples On the Complexity of Training a Single Perceptron with Programmable Synaptic Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 ˇıma Jiˇr´ı S´ Learning a Subclass of Regular Patterns in Polynomial Time . . . . . . . . . . . 234 John Case, Sanjay Jain, R¨ udiger Reischuk, Frank Stephan, Thomas Zeugmann Identification with Probability One of Stochastic Deterministic Linear Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Colin de la Higuera, Jose Oncina
Online Prediction Criterion of Calibration for Transductive Confidence Machine with Limited Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Ilia Nouretdinov, Vladimir Vovk Well-Calibrated Predictions from Online Compression Models . . . . . . . . . . 268 Vladimir Vovk
Table of Contents
XI
Transductive Confidence Machine Is Universal . . . . . . . . . . . . . . . . . . . . . . . . 283 Ilia Nouretdinov, Vladimir V’yugin, Alex Gammerman On the Existence and Convergence of Computable Universal Priors . . . . . 298 Marcus Hutter
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
XII
Table of Contents
Abduction and the Dualization Problem Thomas Eiter Institut f¨ur Informationssysteme, Technische Universit¨at Wien, Favoritenstraße 9-11, A-1040 Wien, Austria [email protected]
Abduction is a fundamental mode of reasoning which was extensively studied by C.S. Peirce, who also introduced the term for inference of explanations for observed phenomena. Abduction has taken on increasing importance in Artificial Intelligence (AI) and related disciplines, where it has been recognized as an important principle of commonsense reasoning. It has applications in many areas of AI and Computer Science including diagnosis, database updates, planning, natural language understanding, learning, to number some of them. In a logic-based setting, abduction can be seen as the task to find, given a set of formulas Σ (the background theory), a formula χ (the query), and a set of formulas A (the abducibles or hypotheses), a minimal subset E of A such that Σ plus E is satisfiable and logically entails χ (i.e., an explanation). In many application scenarios T is a propositional Horn theory, χ is a literal or a conjunction of literals, and the abducibles A are certain literals of interest. For use in practice, computing abductive explanations in this setting is an important problem, for which well-known early systems such as Poole’s Theorist or assumption-based Truth Maintenance Systems have been devised in the 1980s. Since then, there has been a growing literature on this subject. Besides computing some arbitrary explanation for a query, the problem of generating several or all explanations has received more attention in the last years. This problem is important since often one would like to select one out of a set of alternative explanations according to a preference or plausibility relation; this relation may be based on subjective intuition which is difficult to formalize and thus can not be implemented by an algorithm. In general, a query may have exponentially many explanations, and thus generating all explanations inevitably requires exponential time in general, even in propositional logic. It is then of interest to know whether generating all explanations is feasible in polynomial total time (aka output-polynomial time), i.e., in time polynomial in the combined size of the input and the output. Furthermore, if exponential resources are prohibitive, it is of interest to know whether a few explanations (e.g., polynomially many) can be generated in polynomial time. In recent and ongoing work, we have investigated the computational complexity of generating all abductive explanations, and compiled a number of interesting results for charting the tractability / intractability frontier of this problem. In this talk, we shall recall some of the results and then focus on abduction from Horn theories represented by their characteristic models. In this setting, the background theory T is represented by a set of so called characteristic models, char(T ), rather than by formulas. The benefit R. Gavald`a et al. (Eds.): ALT 2003, LNAI 2842, pp. 1–2, 2003. c Springer-Verlag Berlin Heidelberg 2003
2
T. Eiter
is that for certain formulas, logical consequence from T efficiently reduces to deciding consequence from char(T ) (which is easy) and thus admits tractable inference. In fact, finding some abductive explanation for a query literal is polynomial in this setting, while this is well-known to be NP-hard under formula-based representation. Computing all abductive explanations for a query literal, which rises in different contexts, is known to be polynomial-time equivalent (in a precise sense) to the problem of dualizing a Boolean function given by a monotone CNF. The latter problem, Monotone Dualization, is with respect to complexity a somewhat mysterious problem which since more than 20 years resists to a precise classification in terms of well-established complexity classes. Currently, no polynomial total-time algorithm solving this problem is known; on other hand, there is also no stringent evidence that such an algorithm is unlikely to exist (like, e.g., coNP-hardness of the associated decision problem whether, given two monotone CNFs ϕ and ψ, they represent dual functions). On the contrary, results in the 1990’s provided some hints that the problem is closer to polynomial totaltime, since as shown by Fredman and Khachyian, the decisional variant can be solved in quasi-polynomial time, i.e., in time O(nlog n ). This was recently refined to solvability in polynomial time with limited nondeterminism, i.e., using a poly-logarithmic number of bit guesses. Apart from this peculiarity, Monotone Dualization has been recognized as an important problem since there are a large number of other problems in Computer Science which are known to be polynomial-time equivalent to this problem. It has a role similar to the one of SAT for the class NP: A polynomial total-time algorithm for Monotone Dualization implies polynomial total-time algorithms for all the polynomial-time equivalent problems. We will consider some possible extensions of the results for abductive explanations which are polynomial-time equivalent to Monotone Dualization. Besides generating all abductive explanations for a literal, there are many other problems in Knowledge Discovery and Data Mining which are polynomial-time equivalent or closely related to Monotone Dualization, including learning with oracles, computation of infrequent and frequent sets, and key generation. We shall give a brief account of such problems, and finally will conclude with some open problems and issues for future research. The results presented are joint work with Kazuhisa Makino, Osaka University.
Association Computation for Information Access Akihiko Takano National Institute of Informatics Hitotsubashi, Chiyoda, Tokyo 101-8430 Japan [email protected]
Abstract. GETA (Generic Engine for Transposable Association) is a software that provides efficient generic computation for association. It enables the quantitative analysis of various proposed methods based on association, such as measuring similarity among documents or words. Scalable implementation of GETA can handle large corpora of twenty million documents, and provides the implementation basis for the effective information access of next generation. DualNAVI is an information retrieval system which is a successful example to show the power and the flexibility of GETA-based computation for association. It provides the users with rich interaction both in document space and in word space. Its dual view interface always returns the retrieved results in two views: a list of titles for document space and “Topic Word Graph” for word space. They are tightly coupled by their cross-reference relation, and inspires the users with further interactions. The two-stage approach in the associative search, which is the key to its efficiency, also facilitates the content-based correlation among databases. In this paper we describe the basic features of GETA and DualNAVI.
The full version of this paper is published in the Proceedings of the 6th International Conference on Discovery Science, Lecture Notes in Artificial Intelligence Vol. 2843.
R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, p. 15, 2003. c Springer-Verlag Berlin Heidelberg 2003
Efficient Data Representations That Preserve Information Naftali Tishby School of Computer Science and Engineering and Center for Neural Computation The Hebrew University, Jerusalem 91904, Israel [email protected]
Abstract. A fundamental issue in computational learning theory, as well as in biological information processing, is the best possible relationship between model representation complexity and its prediction accuracy. Clearly, we expect more complex models that require longer data representation to be more accurate. Can one provide a quantitative, yet general, formulation of this trade-off? In this talk I will discuss this question from Shannon’s Information Theory perspective. I will argue that this trade-off can be traced back to the basic duality between source and channel coding and is also related to the notion of “coding with side information”. I will review some of the theoretical achievability results for such relevant data representations and discuss our algorithms for extracting them. I will then demonstrate the application of these ideas for the analysis of natural language corpora and speculate on possibly-universal aspects of human language that they reveal. Based on joint works with Ran Bacharach, Gal Chechik, Amir Globerson, Amir Navot, and Noam Slonim.
R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, p. 16, 2003. c Springer-Verlag Berlin Heidelberg 2003
Can Learning in the Limit Be Done Efficiently? Thomas Zeugmann Institut f¨ ur Theoretische Informatik, Universit¨ at zu L¨ ubeck, Wallstraße 40, 23560 L¨ ubeck, Germany [email protected]
Abstract. Inductive inference can be considered as one of the fundamental paradigms of algorithmic learning theory. We survey results recently obtained and show their impact to potential applications. Since the main focus is put on the efficiency of learning, we also deal with postulates of naturalness and their impact to the efficiency of limit learners. In particular, we look at the learnability of the class of all pattern languages and ask whether or not one can design a learner within the paradigm of learning in the limit that is nevertheless efficient. For achieving this goal, we deal with iterative learning and its interplay with the hypothesis spaces allowed. This interplay has also a severe impact to postulates of naturalness satisfiable by any learner. Finally, since a limit learner is only supposed to converge in the limit, one never knows at any particular learning stage whether or not the learner did already succeed. The resulting uncertainty may be prohibitive in many applications. We survey results to resolve this problem by outlining a new learning model, called stochastic finite learning. Though pattern languages can neither be finitely inferred from positive data nor PAC-learned, our approach can be extended to a stochastic finite learner that exactly infers all pattern languages from positive data with high confidence.
1
Introduction
Inductive inference can be considered as one of the fundamental paradigms of algorithmic learning theory. In particular, inductive inference of recursive functions and of recursively enumerable languages have been studied intensively within the last four decades (cf., e.g., [3,4,30,16]). The basic model considered within this framework is learning in the limit which can be informally described as follows. The learner receives more and more data about the target and maps these data to hypotheses. Of special interest is the investigation of scenarios in which the sequence of hypotheses stabilizes to an accurate and finite description (e.g. a grammar, a program) of the target. Clearly, then some form of learning must have taken place. Here by data we mean either any infinite sequence of pairs argument-value (in case of learning recursive functions) such that all arguments appear eventually or any infinite sequence of all members of the target language R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 17–38, 2003. c Springer-Verlag Berlin Heidelberg 2003
18
T. Zeugmann
(in case of language learning from positive data). Alternatively, one can also study language learning from both positive and negative data. Most of the work done in the field has been aimed at the following goals: showing what general collections of function classes or language classes are learnable, characterizing those collections of classes that can be learned, studying the impact of several postulates on the behavior of learners to their learning power, and dealing with the influence of various parameters to the efficiency of learning. However, defining an appropriate measure for the complexity of learning in the limit has turned out to be quite difficult (cf. Pitt [31]). Moreover, whenever learning in the limit is done, in general one never knows whether or not the learner has already converged. This is caused by the fact that it is either undecidable at all whether or not convergence already occurred. But even if it is decidable, it is practically infeasible to do so. Thus, there is always an uncertainty which may not be tolerable in many applications of learning. Therefore, different learning models have been proposed. In particular, Valiant’s [46] model of probably approximately correct (abbr. PAC) learning has been very influential. As a matter of fact, this model puts strong emphasis on the efficiency of learning and avoids the problem of convergence at all. In the PAC model, the learner receives a finite labeled sample of the target concepts and outputs, with high probability, a hypothesis that is approximately correct. The sample is drawn with respect to an unknown probability distribution and the error of as well as the confidence in the hypothesis are measured with respect to this distribution, too. Thus, if a class is PAC learnable, one obtains nice performance guarantees. Unfortunately, many interesting concept classes are not PAC learnable. Consequently, one has to look for other models of learning or one is back to learning in the limit. So, let us assume that learning in the limit is our method of choice. What we would like to present in this survey is a rather general way to transform learning in the limit into stochastic finite learning. It should also be noted that our ideas may be beneficial even in case that the considered concept class is PAC learnable. Furthermore, we aim to outline how a thorough study of limit learnability of concept classes may nicely contribute to support our new approach. We exemplify the research undertaken by mainly looking at the class of all pattern languages introduced by Angluin [1]. As Salomaa [37] has put it “Patterns are everywhere” and thus we believe that our research is worth the effort undertaken. There are several problems that have to be addressed when dealing with the learnability of pattern languages. First, the nice thing about patterns is that they are very intuitive. Therefore, it seems desirable to design learners outputting pattern as their hypotheses. Unfortunately, membership is known to be N P complete for the pattern languages (cf. [1]). Thus, many of the usual approaches used in machine learning will directly lead to infeasible learning algorithms. As a consequence, we shall ask what kind of appropriate hypothesis spaces can be used at all to learn the pattern languages, and what are the appropriate learning strategies.
Can Learning in the Limit Be Done Efficiently?
19
In particular, we shall deal with the problem of redundancy in the hypothesis space chosen, with consistency, conservativeness, and iterative learning. Here consistency means that the intermediate hypotheses output by the learner do correctly reflect the data seen so far. Conservativeness addresses the problem to avoid overgeneralization, i.e., preventing the learner from guessing a proper superset of the target language. These requirements are naturally arising desiderata, but this does not mean that they can be fulfilled. With iterative learning, the learning machine, in making a conjecture, has access to its previous conjecture and the latest data item coming in. Iterative learning is also a natural requirement whenever learning in the limit is concerned, since no practical learner can process at every learning stage all examples provided so far, it may even not be able to store them. Finally, we address the question how efficient the overall learning process can be performed, and how we can get rid of the uncertainty of not knowing whether or not the learner has already converged.
2
Preliminaries
Unspecified notation follows Rogers [35]. By N = {0, 1, 2, . . .} we denote the set of all natural numbers. We set N+ = N \ {0} . The cardinality of a set S is denoted by |S| . Let ∅, ∈, ⊂, ⊆ , ⊃ , and ⊇ , denote the empty set, element of, proper subset, subset, proper superset, and superset, respectively. Let ϕ0 , ϕ1 , ϕ2 , . . . denote any fixed acceptable programming system for all (and only) the partial recursive functions over N (cf. Rogers [35]). Then ϕk is the partial recursive function computed by program k . In the following subsection we define the main learning models considered within this paper. 2.1
Learning in the Limit
Gold’s [12] model of learning in the limit allows one to formalize a rather general class of learning problems, i.e., learning from examples. For defining this model we assume any recursively enumerable set X and refer to it as the learning domain. By ℘(X ) we denote the power set of X . Let C ⊆ ℘(X ) , and let c ∈ C be non-empty; then we refer to C and c as a concept class and a concept, respectively. Let c be a concept, and let t = (xj )j∈N be any infinite sequence of elements xj ∈ c such that range(t) =df {xj j ∈ N} = c . Then t is said to be a positive presentation or, synonymously, a text for c . By text(c) we denote the set of all positive presentations for c . Moreover, let t be a positive presentation, and let y ∈ N . Then, we set ty = x0 , . . . , xy , i.e., ty is the initial segment of t + of length y + 1 , and t+ y =df {xj j ≤ y} . We refer to ty as the content of ty . Furthermore, let σ = x0 , . . . , xn−1 be any finite sequence. Then we use |σ| to denote the length n of σ , and let content(σ) and σ + , respectively, denote
20
T. Zeugmann
the content of σ . Additionally, let t be a text and let τ be a finite sequence; then we use σ t and σ τ to denote the sequence obtained by concatenating σ onto the front of t and τ , respectively. Alternatively, one can also consider complete presentations or, synonymously, informants. Let c be a concept; then any sequence i = (xj , bj )j∈N of labeled examples, where bj ∈ {+, −} such that {xj j ∈ N} = X and i+ = {xj (xj , bj ) = (xj , +), j ∈ N} = c and i− = {xj (xj , bj ) = (xj , −), j ∈ N} = X \ c is called an informant for c . For the sake of presentation, the following definitions are only given for the text case, the generalization to the informant case should be obvious. We sometimes use the term data sequence to refer to both text and informant, respectively. An inductive inference machine (abbr. IIM) is an algorithm that takes as input larger and larger initial segments of a text and outputs, after each input, a hypothesis from a prespecified hypothesis space H = (hj )j∈N . The indices j are regarded as suitable finite encodings of the concepts described by the hypotheses. A hypothesis is said to describe a concept c iff c = h . Definition 1. Let C be any concept class, and let H = (hj )j∈N be a hypothesis space for it. C is called learnable in the limit from text iff there is an IIM M such that for every c ∈ X and every text t for c , (1) for all n ∈ N+ , M (tn ) is defined, (2) there is a j such that c = hj and for all but finitely many n ∈ N+ , M (tn ) = j . By LimTxt we denote the collection of all concepts classes C that are learnable in the limit from text1 . Note that instead of LimTxt sometimes TxtEx is used. Note that Definition 1 does not contain any requirement concerning efficiency. Before we are going to deal with efficiency, we want to point to another crucial parameter of our learning model, i.e., the hypothesis space H . Since our goal is algorithmic learning, we can consider the special case that X = N and let C be any subset of the collection of all recursively enumerable sets over N . Let Wi = domainϕi . In this case, (Wj )j∈N is the most general hypothesis space. Within this setting many learning problems can be described. Moreover, this setting has been used to study the general capabilities of different learning models which can be obtained by suitable modifications of Definition 1. There are numerous papers performing studies along this line of research (cf., e.g., [16,30] and the references therein). On the one hand, the results obtained considerably broaden our general understanding of algorithmic learning. On the other hand, one has also to ask what kind of consequences one may derive from these results for practical learning problems. This is a non-trivial question, since the setting of learning recursively enumerable languages is very rich. Thus, it is conceivable 1
If learning from informant is considered we use LimInf to denote the collection of all concepts classes C that are learnable in the limit from informant.
Can Learning in the Limit Be Done Efficiently?
21
that several of the phenomena observed hold in this setting due to the fact too many sets are recursively enumerable and there are no counterparts within the world of efficient computability. As a first step to address this question we mainly consider the scenario that indexable concept classes with uniformly decidable membership have to be learned (cf. Angluin [2]). A class of non-empty concepts C is said to be an indexable class with uniformly decidable membership provided there are an effective enumeration c0 , c1 , c2 , ... of all and only the concepts in C and a recursive function f such that for all j ∈ N and all elements x ∈ X we have 1, if x ∈ cj , f (j, x) = 0, otherwise. In the following we refer to indexable classes with uniformly decidable membership as to indexable classes for short. Furthermore, we call any enumeration (cj )j∈N of C with uniformly decidable membership problem an indexed family. Since the paper of Angluin [2] learning of indexable concept classes has attracted much attention (cf., e.g., Zeugmann and Lange [51]). Let us shortly provide some-well known indexable classes. Let Σ be any finite alphabet of symbols, and let X be the free monoid over Σ , i.e., X = Σ ∗ . We set Σ + = Σ ∗ \ {λ} , where λ denotes the empty string. As usual, we refer to subsets L ⊆ X as languages. Then the set of all regular languages, context-free languages, and context-sensitive languages are indexable classes. n Next, let Xn = {0, 1} be the set of all n -bit Boolean vectors. We consider X = n≥1 Xn as learning domain. Then, the set of all concepts expressible as a monomial, a k -CNF, a k -DNF, and a k -decision list form indexable classes.
When learning indexable classes C , it is generally assumed that the hypothesis space H has to be an indexed family, too. We distinguish class preserving learning and class comprising learning defined by C = range(H) and C ⊆ range(H) , respectively. When dealing with class preserving learning, one has the freedom to choose as hypothesis space a possibly different enumeration of the target class C . In contrast, when class comprising learning is concerned, the hypothesis space may enumerate, additionally, languages not belonging to C . Note that, in general, one has to allow class comprising hypothesis spaces to obtain the maximum possible learning power (cf. Lange and Zeugmann [20,22]). Finally, we call an hypothesis space redundant if it is larger than necessary, i.e., there is at least one hypothesis in H not describing any concept from the target class or one concept possesses at least two different descriptions in H . Thus, non-redundant hypothesis spaces are as small as possible. Formally, a hypothesis space H = (hj )j∈N is non-redundant for some target concept class C iff range(H) = C and hi = hj for all i, j ∈ N with i = j . Otherwise, H is a redundant hypothesis space for C . Next, let us come back to the issue of efficiency. Looking at Definition 1 we see that an IIM M has always access to the whole history of the learning process, i.e., in order to compute its actual guess M is fed all examples seen so
22
T. Zeugmann
far. In contrast to that, next we define iterative IIMs. An iterative IIM is only allowed to use its last guess and the next element in the positive presentation of the target concept for computing its actual guess. Conceptionally, an iterative IIM M defines a sequence (Mn )n∈N of machines each of which takes as its input the output of its predecessor. Definition 2 (Wiehagen [47]). Let C be a concept class, let c be a concept, let H = (hj )j∈N be a hypothesis space, and let a ∈ N ∪ {∗} . An IIM M ItLimTxt H -infers c iff for every t = (xj )j∈N ∈ text(c) the following conditions are satisfied: (1) for all n ∈ N , Mn (T ) is defined, where M0 (T ) =df M (x0 ) and for all n ≥ 0 : Mn+1 (T ) =df M (Mn (T ), xn+1 ) , (2) the sequence (Mn (T ))n∈N converges to a number j such that c = hj . Finally, M ItLimTxt H -infers C iff, for each c ∈ C , M ItLimTxt H -infers c . In the latter definition Mn (t) denotes the (n+1) th hypothesis output by M when successively fed the text t . Thus, it is justified to make the following convention. Let σ = x0 , . . . , xn be any finite sequence of elements over the relevant learning domain. Moreover, let C be any concept class over X , and let M be any IIM that iteratively learns C . Then we denote by My (σ) the (y + 1) th hypothesis output by M when successively fed σ provided y ≤ n , and there exists a concept c ∈ C with σ + ⊆ c . Furthermore, we let M∗ (σ) denote M|σ|−1 (σ) . Moreover, whenever learning a concept class from text, a major problem one has to deal with is avoiding or detecting overgeneralization. An overgeneralization occurs if the learner is guessing a superconcept of the target concept. Clearly, such an overgeneralized guess cannot be detected by using the incoming positive data only. Therefore, one may be tempted to disallow overgeneralized guesses at all. Learners behaving thus are called conservative. Intuitively speaking a conservative IIM maintains its actual hypothesis at least as long as it has not seen data contradicting it. More formally, an IIM M is said to be conservative iff for all concepts c in the target class C and all texts t for c the condition M (ty ) = M (ty+z ) then t+ y+z ⊆ hM (ty ) is fulfilled. Another property of learners quite often found in the literature is consistency. Informally, a learner is called consistent if all its intermediate hypotheses do correctly reflect the data seen so far. More formally, an IIM M is said to be consistent iff t+ x ⊆ hM (tx ) for all x ∈ N and every text t for every concept c in the target class C . Whenever one talks about the efficiency of learning besides the storage needed by the learner one has also to consider the time complexity of the learner. When talking about the time complexity of learning, it does not suffice to consider the time needed to compute the actual guess. What really counts in applications is the overall time needed until successful learning. Therefore, following Daley and Smith [10] we define the total learning time as follows.
Can Learning in the Limit Be Done Efficiently?
23
Let C be any concept class, and let M be any IIM that learns C in the limit. Then, for every c ∈ C and every text t for c , let Conv (M, t) =df the least number m ∈ N+ such that for all n ≥ m, M (tn ) = M (tm ) denote the stage of convergence of M on t (cf. [12]). Note that Conv (M, t) = ∞ if M does not learn the target concept from its text t . Moreover, by TM (tn ) we denote the time to compute M (tn ) . We measure this time as a function of the length of the input and call it the update time. Finally, the total learning time taken by the IIM M on successive input t is defined as Conv (M,t)
T T (M, t) =df
TM (tn ).
n=1
Clearly, if M does not learn the target concept from text t then the total learning time is infinite. Two more remarks are in order here. First, it has been argued elsewhere that within the learning in the limit paradigm a learning algorithm is invoked only when the current hypothesis has some problem with the latest observed data. However, such a viewpoint implicitly assumes that membership in the target concept is decidable in time polynomial in the length of the actual input. This may be not case. Thus, directly testing consistency would immediately lead to a non-polynomial update time provided membership is not known to be in P . Second, Pitt [31] addresses the question with respect to what parameter one should measure the total learning time. In the definition given above this parameter is the length of all examples seen so far. Clearly, now one could try to play with this parameter by waiting for a large enough input before declaring success. However, when dealing with the learnability of non-trivial concept classes, in the worst-case the total learning time will be anyhow unbounded. Thus, it does not make much sense to deal with the worst-case. Instead, we shall study the expected total learning time. In such a setting one cannot simply wait for long enough inputs. Therefore, using the definition of total learning time given above seems to be reasonable. Next, we define important concept classes which we are going to consider throughout this survey. 2.2
The Pattern Languages
Following Angluin [1] we define patterns and pattern languages as follows. Let A = {0, 1, . . .} be any non-empty finite alphabet containing at least two elements. By A∗ we denote the free monoid over A . The set of all finite non-null strings of symbols from A is denoted by A+ , i.e., A+ = A∗ \ {λ} , where
24
T. Zeugmann
λ denotes the empty string. Let X = {xi i ∈ N} be an infinite set of variables such that A ∩ X = ∅ . Patterns are non-empty strings over A ∪ X , e.g., 01, 0x0 111, 1x0 x0 0x1 x2 x0 are patterns. The length of a string s ∈ A∗ and of a pattern π is denoted by |s| and |π| , respectively. A pattern π is in canonical form provided that if k is the number of different variables in π then the variables occurring in π are precisely x0 , . . . , xk−1 . Moreover, for every j with 0 ≤ j < k − 1 , the leftmost occurrence of xj in π is left to the leftmost occurrence of xj+1 . The examples given above are patterns in canonical form. In the sequel we assume, without loss of generality, that all patterns are in canonical form. By Pat we denote the set of all patterns in canonical form. If k is the number of different variables in π then we refer to π as to a k -variable pattern. By Pat k we denote the set of all k -variable patterns. Furthermore, let π ∈ Pat k , and let u0 , . . . , uk−1 ∈ A+ ; then we denote by π[x0 /u0 , . . . , xk−1 /uk−1 ] the string w ∈ A+ obtained by substituting uj for each occurrence of xj , j = 0, . . . , k − 1 , in the pattern π . For example, let π = 0x0 1x1 x0 . Then π[x0 /10, x1 /01] = 01010110 . The tuple (u0 , . . . , uk−1 ) is called a substitution. Furthermore, if |u0 | = · · · = |uk−1 | = 1 , then we refer to (u0 , . . . , uk−1 ) as to a shortest substitution. Let π ∈ Pat k ; we define the language generated by pattern π by L(π) = {π[x0 /u0 , . . . , xk−1 /uk−1 ] u0 , . . . , uk−1 ∈ A+ } . By PAT k we denote the set of all k -variable pattern languages. Finally, PAT = k∈N PAT k denotes the set of all pattern languages over A . Furthermore, we let Q range over finite sets of patterns and define L(Q) = π∈Q L(π) , i.e., the union of all pattern languages generated by patterns from Q . Moreover, we use Pat(k) and PAT (k) to denote the family of all unions of at most k canonical patterns and the family of all unions of at most k pattern languages, respectively. That is, Pat(k) = {Q Q ⊆ Pat, |Q| ≤ k} and PAT (k) = {L (∃Q ∈ Pat(k))[L = L(Q)]} . Finally, let L ⊆ A+ be a language, and let k ∈ N+ ; we define Club(L, k) = {Q |Q| ≤ k, L ⊆ L(Q), (∀Q )[Q ⊂ Q ⇒ L ⊆ L(Q )]} . Club stands for consistent least upper bounds. The pattern languages have been intensively investigated (cf., e.g., Salomaa [37,38], and Shinohara and Arikawa [43] for an overview). Nix [29] as well as Shinohara and Arikawa [43] outlined interesting applications of pattern inference algorithms. For example, pattern language learning algorithms have been successfully applied for solving problems in molecular biology (cf., e.g., Shimozono et al. [39], Shinohara and Arikawa [43]). As it turned out, pattern languages and finite unions of pattern languages are subclasses of Smullyan’s [45] elementary formal systems (Abbr. EFS). Arikawa et al. [5] have shown that EFS can also be treated as a logic programming language over strings. Recently, the techniques for learning finite unions of pattern languages have been extended to show the learnability of various subclasses of EFS (cf. Shinohara [42]). The investigations of the learnability of subclasses of EFSs are interesting because they yield corresponding results about the learnability of subclasses of logic programs. Hence, these results are also of relevance for Inductive Logic Programming (ILP) [28,23,8,24]. Miyano et al. [26] intensively studied the polynomial-time learnability of EFSs.
Can Learning in the Limit Be Done Efficiently?
25
Therefore, we may consider the learnability of pattern languages and of unions thereof as a nice test bed for seeing what kind of results one may obtain by considering the corresponding learning problems within the setting of learning in the limit.
3
Results
Within this section we ask whether or not the pattern languages and finite unions thereof can be learned efficiently. The principal learnability of the pattern languages from text with respect to the hypothesis space Pat has been established by Angluin [1]. However, her algorithm is based on computing descriptive patterns for the data seen so far. Here a pattern π is said to be descriptive (for the set S of strings contained in the input provided so far) if π can generate all strings contained in S and no other pattern with this property generates a proper subset of the language generated by π . Since no efficient algorithm is known for computing descriptive patterns, and finding a descriptive pattern of maximum length is N P -hard, its update time is practically intractable. There are also serious difficulties when trying to learn the pattern languages within the PAC model introduced by Valiant [46]. In the original model, the sample complexity depends exclusively on the VC dimension of the target concept class and the error and confidence parameters ε and δ , respectively. Recently, Mitchell et al. [25] have shown that even the class of all one-variable pattern languages has infinite VC dimension. Consequently, even this special subclass of PAT is not uniformly PAC learnable. Moreover, Schapire [40] has shown that pattern languages are not PAC learnable in the generalized model provided P/poly = N P/poly with respect to every hypothesis space for PAT that is uniformly polynomially evaluable. Though this result highlights the difficulty of PAC learning PAT it has no clear application to the setting considered in this paper, since we aim to learn PAT with respect to the hypothesis space Pat . Since the membership problem for this hypothesis space is N P -complete, it is not polynomially evaluable (cf. [1]). In contrast, Kearns and Pitt [18] have established a PAC learning algorithm for the class of all k -variable pattern languages. Positive examples are generated with respect to arbitrary product distributions while negative examples are allowed to be generated with respect to any distribution. In their algorithm the length of substitution strings is required to be polynomially related to the length of the target pattern. Finally, they use as hypothesis space all unions of polynomially many patterns that have k or fewer variables2 . The overall learning time of their PAC learning algorithm is polynomial in the length of the target 2
More precisely, the number of allowed unions is at most poly(|π|, s, 1/ε, 1/δ, |A|) , where π is the target pattern, s the bound on the length on substitution strings, ε and δ are the usual error and confidence parameter, respectively, and A is the alphabet of constants over which the patterns are defined.
26
T. Zeugmann
pattern, the bound for the maximum length of substitution strings, 1/ε , 1/δ , and |A| . The constant in the running time achieved depends doubly exponential on k , and thus, their algorithm becomes rapidly impractical when k increases. Finally, Lange and Wiehagen [19] have proposed an inconsistent but iterative and conservative algorithm that learns PAT with respect to Pat . We shall study this algorithm below in much more detail. But before doing it, we aim to figure out under which circumstances iterative learning of PAT is possible at all. A first answer is given by the following theorems from Case et al. [9]. Note that Pat is a non-redundant hypothesis space for PAT . Theorem 1 (Case et al. [9]). Let C be any concept class, and let H = (hj )j∈N be any non-redundant hypothesis space for C . Then, every IIM M that ItLimTxt H -infers C is conservative. Proof. Suppose the converse, i.e., there are a concept c ∈ C , a text t = (xj )j∈N ∈ text(c) , and a y ∈ N such that, for j = M∗ (ty ) and k = M∗ (ty+1 ) = M (j, xy+1 ) , both j = k and t+ y+1 ⊆ hj are satisfied. The latter implies xy+1 ∈ hj , and thus we may consider the following text t˜ ∈ text(hj ) . Let tˆ = (ˆ xj )j∈N be any text for hj and let t˜ = x ˆ0 , xy+1 , x ˆ1 , xy+1 , x ˆ2 , . . . Since M has to learn hj from t˜ there must be a z ∈ N such that M∗ (t˜z+r ) = j for all r ≥ 0 . But M∗ (t˜2z+1 ) = M (j, xy+1 ) = k , a contradiction. Next, we point to another peculiarity of PAT , i.e., it meets the superset condition defined as follows. Let C be any indexable class. C meets the superset condition if, for all c, c ∈ C , there is some cˆ ∈ C being a superset of both c and c . Theorem 2. (Case et al. [9]). Let C be any indexable class meeting the superset condition, and let H = (hj )j∈N be any non-redundant hypothesis space for C . Then, every consistent IIM M that ItLimTxt H -infers C may be used to decide the inclusion problem for H . Proof. Let X be the underlying learning domain, and let (wj )j∈N be an effective enumeration of all elements in X . Then, for every i ∈ N , ti = (xij )j∈N is the following computable text for hi . Let z be the least index such that wz ∈ hi . Recall that, by definition, hi = ∅ , since H is an indexed family, and thus wz must exist. Then, for all j ∈ N , we set xij = wj , if wj ∈ hi , and xij = wz , otherwise. We claim that the following algorithm Inc decides, for all i, k ∈ N , whether or not hi ⊆ hk . Algorithm Inc: “On input i, k ∈ N do the following: ⊆ hk . Determine the least y ∈ N with i = M∗ (tiy ) . Test whether or not ti,+ y In case it is, output ‘Yes,’ and stop. Otherwise, output ‘No,’ and stop.” Clearly, since H is an indexed family and ti is a computable text, Inc is an algorithm. Moreover, M learns hi on every text for it, and H is a nonredundant hypothesis space. Hence, M has to converge on text ti to i , and therefore Inc has to terminate.
Can Learning in the Limit Be Done Efficiently?
27
It remains to verify the correctness of Inc . Let i, k ∈ N . Clearly, if Inc outputs ‘No,’ a string s ∈ hi \hk has been found, and hi ⊆ hk follows. Next, consider the case that Inc outputs ‘Yes.’ Suppose to the contrary that hi ⊆ hk . Then, there is some s ∈ hi \ hk . Now, consider M when fed the text t = tiy tk . Since ti,+ ⊆ hk , t is a text for hk . Since M learns hk , there is y some r ∈ N such that k = M∗ (tiy tkr ) . By assumption, there are some cˆ ∈ C with hi ∪ hk ⊆ cˆ , and some text tˆ for cˆ having the initial segment tiy s tkr . By Theorem 1, M is conservative. Since s ∈ hi and i = M∗ (tˆy ) , we obtain M∗ (tˆy+1 ) = M (i, s) = i . Consequently, M∗ (tiy s tkr ) = M∗ (tiy tkr ) . Finally, i k since s ∈ tˆ+ / hk , M fails to consistently learn y+r+2 , k = M∗ (ty tr ) , and s ∈ cˆ from text tˆ , a contradiction. This proves the theorem. Taking into account that the inclusion problem for Pat is undecidable (cf. Jiang et al. [17] and that PAT meets the superset condition, since L(x0 ) = A+ , by Theorem 2, we immediately arrive at the following corollary. Corollary 3 (Case et al. [9]). If an IIM M ItLimTxt Pat -learns PAT then M is inconsistent. As a matter of fact, the latter corollary generalizes to all non-redundant hypothesis spaces for PAT . All the ingredients to prove this can be found in Zeugmann et al. [52]. Consequently, if one wishes to learn the pattern languages or unions of pattern languages iteratively, then either redundant hypothesis spaces or inconsistent learners cannot be avoided. As for unions, the first result goes back to Shinohara [41] who proved the class of all unions of at most two pattern languages to be in LimTxt Pat(2) . Wright [49] extended this result to PAT (k) ∈ LimTxt Pat(k) for all k ≥ 1 . Moreover, Theorem 4.2 in Shinohara and Arimura’s [44] together with a lemma from Blum and Blum [6] shows that k∈N PAT (k) is not LimTxt H -inferable for every hypothesis space H . The iterative learnability of PAT (k) has been established by Case et al. [9]. Our learner is also consistent. Thus, the hypothesis space used had to be designed to be redundant. We only sketch the proof here. Theorem 4. (1) Club(L, k) is finite for all L ⊆ A+ and all k ∈ N+ , (2) If L ∈ PAT (k) , then Club(L, k) is non-empty and contains a set Q , such that L(Q) = L . Proof. Part (2) is obvious. Part (1) is easy for finite L . For infinite L , it follows from the lemma below. Lemma 1. Let k ∈ N+ , let L ⊆ A+ be any language, and suppose t = (sj )j∈N ∈ text(L). Then, + (1) Club(t+ 0 , k) can be obtained effectively from s0 , and Club(tn+1 , k) is effec+ tively obtainable from Club(tn , k) and sn+1 (* note the iterative nature *).
28
T. Zeugmann
+ (2) The sequence Club(t+ 0 , k), Club(t1 , k), . . . converges to Club(L, k).
Putting it all together, one directly gets the following theorem. Theorem 5. For all k ≥ 1 , PAT (k) ∈ ItLimTxt . Proof. Let can(·) , be some computable bijection from finite classes of finite sets of patterns onto N . Let pad be a 1–1 padding function such that, for all x, y ∈ N , Wpadx,y = Wx . For a finite class S of sets of patterns, let g(S) denote a grammar obtained, effectively from S , for Q∈S L(Q) . Let L ∈ PAT (k) , and let t = (sj )j∈N ∈ text(L) . The desired IIM M is de+ fined as follows. We set M0 (t) = M (s0 ) = padg(Club(t+ 0 , k)), can(Club(t0 , k)) , and for all n > 0 , let Mn+1 (t) = M (Mn (t), sn+1 ) + = padg(Club(t+ n+1 , k)), can(Club(tn+1 , k)) Using Lemma 1 it is easy to verify that Mn+1 (t) = M (Mn (t), sn+1 ) can be obtained effectively from Mn (t) and sn+1 . Therefore, M ItLimTxt -identifies PAT (k) . So far, the general theory provided substantial insight into the iterative learnability of the pattern languages. But still, we do not know anything about the number of examples needed until successful learning and the total amount of time to process them. Therefore, we address this problem in the following subsection. 3.1
Stochastic Finite Learning
As we have already mentioned, it does not make much sense to study the worstcase behavior of learning algorithms with respect to their total learning time. The reason for this phenomenon should be clear, since an arbitrary text may provide the information needed for learning very late. Therefore, in the following we always assume a class D of admissible probability distributions over the relevant learning domain. Ideally, this class should be parameterized. Then, the data fed to learner are generated randomly with respect to one of the probability distributions from the class D of underlying probability distributions. Furthermore, we introduce a random variable CONV for the stage of convergence. Note that CONV can be also interpreted as the total number of examples read by the IIM M until convergence. The first major step to be performed consists now in determining the expectation E[CONV ] . Clearly, E[CONV ] should be finite for all concepts c ∈ C and all distributions D ∈ D . Second, one has to deal with tail bounds for E[CONV ] . The easiest way to perform this step is to use Markov’s inequality, i.e., we always know that Pr(CONV ≥ t · E[CONV ]) ≤
1 for all t ∈ N+ . t
However, quite often one can obtain much better tail bounds. If the underlying learner is known to be conservative and rearrangement-independent we always
Can Learning in the Limit Be Done Efficiently?
29
get exponentially shrinking tail bounds. A learner is said to be rearrangementindependent if its output depends exclusively on the range and length of its input (cf. [21] and the references therein). These tail bounds are established by the following theorem. Theorem 6. (Rossmanith and Zeugmann [36].) Let CONV be the sample complexity of a conservative and rearrangement-independent learning algorithm. Then Pr(CONV ) ≥ 2t · E[CONV ]) ≤ 2 −t for all t ∈ N . Theorem 6 puts the importance of rearrangement-independent and conservative learners into the right perspective. As long as the learnability of indexed families is concerned, these results have a wide range of potential applications, since every conservative learner can be transformed into a learner that is both conservative and rearrangement-independent provided the hypothesis space is appropriately chosen (cf. Lange and Zeugmann [21]). Furthermore, since the distribution of CONV decreases geometrically for all conservative and rearrangement-independent learning algorithms, all higher moments of CONV exist in this case, too. Thus, instead of applying Theorem 6 directly, one can hope for further improvements by applying even sharper tail bounds using for example Chebyshev’s inequality. Additionally, the learner takes a confidence parameter δ as input. But in contrast to learning in the limit, the learner itself decides how many examples it wants to read. Then it computes a hypothesis, outputs it and stops. The hypothesis output is correct for the target with probability at least 1 − δ . The explanation given so far explains how it works, but not why it does. Intuitively, the stochastic finite learner simulates the limit learner until an upper bound for twice the expected total number of examples needed until convergence has been met. Assuming this to be true, by Markov’s inequality the limit learner has now converged with probability 1/2 . All what is left, is to decrease the probability of failure. This is done by using the tail bounds for CONV . Applying Theorem 6, one easily sees that increasing the sample complexity by a factor of O(log 1δ ) results in a probability of 1 − δ for having reached the stage of convergence. If Theorem 6 is not applicable, one can still use Markov’s inequality but then the sample complexity needed will increase by a factor of 1/δ . It remains to explain how the stochastic finite learner can calculate the upper bound for E[CONV ] . This is precisely the point where we need the parameterization of the class D of underlying probability distributions. Since in general, it is not known which distribution from D has been chosen, one has to assume a bit of prior knowledge or domain knowledge provided by suitable upper and/or lower bounds for the parameters involved. A more serious difficulty is to incorporate the unknown target concept into this estimate. This step depends on the concrete learning problem on hand, and requires some extra effort. We shall exemplify it below. Now we are ready to formally define stochastic finite learning.
30
T. Zeugmann
Definition 3 ([33,34,36]). Let D be a set of probability distributions on the learning domain, C a concept class, H a hypothesis space for C , and δ ∈ (0, 1) . (C, D) is said to be stochastically finitely learnable with δ -confidence with respect to H iff there is an IIM M that for every c ∈ C and every D ∈ D performs as follows. Given any random data sequence θ for c generated according to D , M stops after having seen a finite number of examples and outputs a single hypothesis h ∈ H . With probability at least 1−δ (with respect to distribution D ) h has to be correct, that is c = h . If stochastic finite learning can be achieved with δ -confidence for every δ > 0 then we say that (C, D) can be learned stochastically finite with high confidence. Note that there are subtle differences between our model and PAC learning. By its definition, stochastic finite learning is not completely distribution independent. A bit of additional knowledge concerning the underlying probability distributions is required. Thus, from that perspective, stochastic finite learning is weaker than the PAC-model. On the other hand, we do not measure the quality of the hypothesis with respect to the underlying probability distribution. Instead, we require the hypothesis computed to be exactly correct with high probability. Note that exact identification with high confidence has been considered within the PAC paradigm, too (cf., e.g., Goldman et al. [13]). Conversely, we also can easily relax the requirement to learn probably exactly correct but whenever possible we shall not do it. Furthermore, in the uniform PAC model as introduced in Valiant [46] the sample complexity depends exclusively on the VC dimension of the target concept class and the error and confidence parameters ε and δ , respectively. This model has been generalized by allowing the sample size to depend on the concept complexity, too (cf., e.g., Blumer et al. [7] and Haussler et al. [15]). Provided no upper bound for the concept complexity of the target concept is given, such PAC learners decide themselves how many examples they wish to read (cf. [15]). This feature is also adopted to our setting of stochastic finite learning. However, all variants of PAC learning we are aware of require that all hypotheses from the relevant hypothesis space are uniformly polynomially evaluable. Though this requirement may be necessary in some cases to achieve (efficient) stochastic finite learning, it is not necessary in general as we shall see below. Next, let us exemplify our model by looking at the concept class of all pattern languages. The results presented below have been obtained by Zeugmann [50] and Rossmanith and Zeugmann [36]. Our stochastic finite learner uses Lange and Wiehagen’s [19] pattern language learner as a main ingredient. We consider here learning from positive data only. Recall that every string of a particular pattern language is generated by at least one substitution. Therefore, it is convenient to consider probability distributions over the set of all possible substitutions. That is, if π ∈ Pat k , then it suffices to consider any probability distribution D over A+ × · · · × A+ . For k−times
(u0 , . . . , uk−1 ) ∈ A+ × · · · × A+ we denote by D(u0 , . . . , uk−1 ) the probability
Can Learning in the Limit Be Done Efficiently?
31
that variable x0 is substituted by u0 , variable x1 is substituted by u1 , . . . , and variable xk−1 is substituted by uk−1 . In particular, we mainly consider a special class of distributions, i.e., product distributions. Let k ∈ N+ , then the class of all product distributions for Pat k is defined as follows. For each variable xj , 0 ≤ j ≤ k − 1 , we assume an arbitrary probability distribution Dj over A+ on substitution strings. Then we call D = D0 × · · · × Dk−1 product distribution over A+ × · · · × A+ , i.e., k−1 D(u0 , . . . , uk−1 ) = j=0 Dj (uj ) . Moreover, we call a product distribution regular if D0 = · · · = Dk−1 . Throughout this paper, we restrict ourselves to deal with regular distributions. We therefore use d to denote k−1 the distribution over A+ on substitution strings, i.e, D(u0 , . . . , uk−1 ) = j=0 d(uj ) . We call a regular distribution admissible if d(a) > 0 for at least two different elements a ∈ A . As a special case of an admissible distribution we consider the uniform distribution over A+ , i.e., d(u) = 1/(2 · |A|) for all strings u ∈ A+ with |u| = . We will express all estimates with the help of the following parameters: E[Λ] , α and β , where Λ is a random variable for the length of the examples drawn. α and β are defined below. To get concrete bounds for a concrete implementation one has to obtain c from the algorithm and has to compute E[Λ] , α , and β from the admissible probability distribution D . Let u0 , . . . , uk−1 be independent random variables with distribution d for substitution strings. Whenever the index i of ui does not matter, we simply write u or u . The two parameters α and β are now defined via d . First, α is simply the probability that u has length 1, i.e., α = Pr(|u| = 1) =
d(a).
a∈A
Second, β is the conditional probability that two random strings that get substituted into π are identical under the condition that both have length 1 , i.e.,
2
β = Pr u = u |u| = |u | = 1 = d(a)2 d(a) . a∈calA
a∈A
Note that we have omitted the assumption of a text to exhaust the target language. Instead, we only demand the data sequence fed to the learner to contain “enough” information to recognize the target pattern. The meaning of “enough” is mainly expressed by the parameter α . The model of computation as well as the representation of patterns we assume is the same as in Angluin [1]. In particular, we assume a random access machine that performs a reasonable menu of operations each in unit time on registers of length O(log n) bits, where n is the input length. Lange and Wiehagen’s [19] algorithm (abbr. LWA) works as follows. Let hn be the hypothesis computed after reading s1 , . . . , sn , i.e., hn = M (s1 , . . . , sn ) .
32
T. Zeugmann
Then h1 = s1 and for all n > 1 : if |hn−1 | < |sn | hn−1 , if |hn−1 | > |sn | hn = sn , hn−1 ∪ sn , if |hn−1 | = |sn | The algorithm computes the new hypothesis only from the latest example and the old hypothesis. If the latest example is longer than the old hypothesis, the example is ignored, i.e., the hypothesis does not change. If the latest example is shorter than the old hypothesis, the old hypothesis is ignored and the new example becomes the new hypothesis. If, however, |hn−1 | = |sn | the new hypothesis is the union of hn−1 and sn . The union = π ∪ s of a canonical pattern π and a string s of the same length is defined as π(i), if π(i) = s(i) xj , if π(i) = s(i) & ∃k < i : [(k) = xj , s(k) = s(i), (i) = π(k) = π(i)] xm , otherwise, where m = #var((1) . . . (i − 1)) where (0) = λ for notational convenience. Note that the resulting pattern is again canonical. If the target pattern does not contain any variable then the LWA converges after having read the first example. Hence, this case is trivial and we therefore assume in the following always k ≥ 1 , i.e., the target pattern has to contain at least one variable. Our next theorem analyzes the complexity of the union operation. Theorem 7 (Rossmanith and Zeugmann [36]). The union operation can be computed in linear time. Furthermore, the following bound for the stage of convergence for every target pattern from Pat k can be shown. Theorem 8(Rossmanith and Zeugmann [36]). 1 E[CONV ] = O · log1/β (k) for all k ≥ 2 . αk Hence, by total learning time can be estimated by Theorem 7, the expected 1 E[Λ] log1/β (k) for all k ≥ 2 . E[T T ] = O αk For a better understanding of the bound obtained we evaluate it for the uniform distribution and compare it to the minimum number of examples needed for learning a pattern language via the LWA. Theorem 9 (Rossmanith and Zeugmann [36]). E[T T ] = O(2k |π| log|A| (k)) for the uniform distribution and all k ≥ 2 . Theorem 10 (Zeugmann [50]). To learn a pattern π ∈ Pat k the LWA needs exactly log|A| (|A| + k − 1) + 1 examples in the best case.
Can Learning in the Limit Be Done Efficiently?
33
The main difference between the two bounds just given is the factor 2k which precisely reflects the time the LWA has to wait until it has seen the first shortest string from the target pattern language. Moreover, in the best-case the LWA is processing shortest examples only. Thus, we introduce MC to denote the number of minimum length examples read until convergence. Then, one can show that 2 ln(k) + 3 E[MC ] ≤ +2 . ln(1/β) Note that Theorem 8 is shown by using the bound for E[MC ] just given. More precisely, we have E[CONV ] = (1/αk )E[MC ] . Now, we are ready to transform the LWA into a stochastic finite learner. Theorem 11 (Rossmanith and Zeugmann [36]). Let α∗ , β∗ ∈ (0, 1) . Assume D to be a class of admissible probability distributions over A+ such that α ≥ α∗ , β ≤ β∗ and E[Λ] finite for all distributions D ∈ D . Then (PAT , D) is stochastically finitely learnable with high confidence from text. Proof. Let D ∈ D , and let δ ∈ (0, 1) be arbitrarily fixed. Furthermore, let t = s1 , s2 , s3 , . . . be any randomly generated text with respect to D for the target pattern language. The wanted learner M uses the LWA as a subroutine. Additionally, it has a counter for memorizing the number of examples already seen. Now, we exploit the fact that the LWA produces a sequence (τn )n∈N+ of hypotheses such that |τn | ≥ |τn+1 | for all n ∈ N+ . The learner runs the LWA until for the first time C many examples have been processed, where |τ | 2 ln(|τ |) + 3 +2 (A) C = α1∗ · ln(1/β∗ ) and τ is the actual output made by the LWA. Finally, in order to achieve the desired confidence, the learner sets γ = log 1δ and runs the LWA for a total of 2 · γ · C examples. This is the reason we need the counter for the number of examples processed. Now, it outputs the last hypothesis τ produced by the LWA, and stops thereafter. Clearly, the learner described above is finite. Let L be the target language and let π ∈ Pat k be the unique pattern such that L = L(π) . It remains to argue that L(π) = L(τ ) with probability at least 1 − δ . First, the bound in (A) is an upper bound for the expected number of examples needed for convergence by the LWA that has been established in Theorem 8 (via the reformulation using E[MC ] given above). On the one hand, this follows from our assumptions about the allowed α and β as well as from the fact that |τ | ≥ |π| for every hypothesis output. On the other hand, the learner does not know k , but the estimate #var (π) ≤ |π| is sufficient. Note that we have to use in (A) the bound for E[MC ] given above, since the target pattern may contain zero or one different variables.
34
T. Zeugmann
Therefore, after having processed C many examples the LWA has already converged on average. The desired confidence is then an immediate consequence of Corollary 6. The latter theorem allows a nice corollary which we state next. Making the same assumption as done by Kearns and Pitt [18], i.e., assuming the additional prior knowledge that the target pattern belongs to Pat k , the complexity of the stochastic finite learner given above can be considerably improved. The resulting learning time is linear in the expected string length, and the constant depending on k grows only exponentially in k in contrast to the doubly exponentially growing constant in Kearns and Pitt’s [18] algorithm. Moreover, in contrast to their learner, our algorithm learns from positive data only, and outputs a hypothesis that is correct for the target language with high probability. Again, for the sake of presentation we shall assume k ≥ 2 . Moreover, if the prior knowledge k = 1 is available, then there is also a much better stochastic finite learner for PAT 1 (cf. [34]). Corollary 12. Let α∗ , β∗ ∈ (0, 1) . Assume D to be a class of admissible probability distributions over A+ such that α ≥ α∗ , β ≤ β∗ and E[Λ] finite for all distributions D ∈ D . Furthermore, let k ≥ 2 be arbitrarily fixed. Then there exists a learner M such that (1) M learns (PAT k , D) stochastically finitely with high confidence from text, and
k (2) The running time of M is O α ˆ ∗ E[Λ] log1/β∗ (k) log2 (1/δ) . (* Note that α ˆ ∗k and log1/β∗ (k) now are constants. *)
4
Conclusions
The present paper surveyed results recently obtained concerning the iterative learnability of the class of all pattern languages and finite unions thereof. In particular, it could be shown that there are strong dependencies between iterative learning, the class of admissible hypothesis spaces and additional requirements to the learner such as consistency, conservativeness and the decidability of the inclusion problem for the hypothesis space chosen. Looking at these results, we have seen that the LWA is in some sense optimal. Moreover, by analyzing the average-case behavior of Lange and Wiehagen’s pattern language learning algorithm with respect to its total learning time and by establishing exponentially shrinking tail bounds for a rather rich class of limit learners, we have been able to transform the LWA into a stochastic finite learner. The price paid is the incorporation of a bit prior knowledge concerning the class of underlying probability distributions. When applied to the class of all k -variable pattern languages, where k is a priori known, the resulting total learning time is linear in the expected string length.
Can Learning in the Limit Be Done Efficiently?
35
Thus, the present paper provides evidence that analyzing the average-case behavior of limit learners with respect to their total learning time may be considered as a promising path towards a new theory of efficient algorithmic learning. Recently obtained results along the same path as outlined in Erlebach et al.[11] as well as in Reischuk and Zeugmann [32,34] provide further support for the fruitfulness of this approach. In particular, in Reischuk and Zeugmann [32,34] we have shown that onevariable pattern languages are learnable for basically all meaningful distributions within an optimal linear total learning time on the average. Furthermore, this learner can also be modified to maintain the incremental behavior of Lange and Wiehagen’s [19] algorithm. Instead of memorizing the pair (PRE, SUF) , it can also store just the two or three examples from which the prefix PRE and the suffix SUF of the target pattern has been computed. While it is no longer iterative, it is still a bounded example memory learner. A bounded example memory learner is essentially an iterative learner that is additionally allowed to memorize an a priori bounded number of examples (cf. [9] for a formal definition). While the one-variable pattern language learner from [34] is highly practical, our stochastic finite learner for the class of all pattern languages is still not good enough for practical purposes. But our results surveyed point to possible directions for potential improvements. However, much more effort seems necessary to design a stochastic finite learner for PAT (k) . Additionally, we have applied our techniques to design a stochastic finite learner for the class of all concepts describable by a monomial which is based on Haussler’s [14] Wholist algorithm. Here we have assumed the examples to be binomially distributed. The sample size of our stochastic finite learner is mainly bounded by log(1/δ) log n , where δ is again the confidence parameter and n is the dimension of the underlying Boolean learning domain. Thus, the bound obtained is exponentially better than the bound provided within the PAC model. Our approach also differs from U-learnability introduced by Muggleton [27]. First of all, our learner is fed with positive examples only, while in Muggleton’s [27] model examples labeled with respect to their containment in the target language are provided. Next, we do not make any assumption concerning the distribution of the target patterns. Furthermore, we do not measure the expected total learning time with respect to a given class of distributions over the targets and a given class of distributions for the sampling process, but exclusively in dependence on the length of the target. Finally, we require exact learning and not approximately correct learning.
References 1. D. Angluin, Finding Patterns common to a Set of Strings, Journal of Computer and System Sciences 21, 1980, 46–62. 2. D. Angluin, Inductive inference of formal languages from positive data, Information and Control 45, 1980, 117–135.
36
T. Zeugmann
3. D. Angluin and C.H. Smith. Inductive inference: Theory and methods. Computing Surveys 15, No. 3, 1983, 237–269. 4. D. Angluin and C.H. Smith. Formal inductive inference. “Encyclopedia of Artificial Intelligence” (St.C. Shapiro, Ed.), Vol. 1, pp. 409–418, Wiley-Interscience Publication, New York. 5. S. Arikawa, T. Shinohara and A. Yamamoto, Learning elementary formal systems, Theoretical Computer Science 95, 97–113, 1992. 6. L. Blum and M. Blum, Toward a mathematical theory of inductive inference, Information and Control 28, 125–155, 1975. 7. A. Blumer, A. Ehrenfeucht, D. Haussler and M. Warmuth, Learnability and the Vapnik-Chervonenkis Dimension, Journal of the ACM 36 (1989), 929–965. 8. I. Bratko and S. Muggleton, Applications of inductive logic programming, Communications of the ACM, 1995. 9. J. Case, S. Jain, S. Lange and T. Zeugmann, Incremental Concept Learning for Bounded Data Mining, Information and Computation 152, No. 1, 1999, 74–110. 10. R. Daley and C.H. Smith. On the Complexity of Inductive Inference. Information and Control 69 (1986), 12–40. 11. T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger and T. Zeugmann, Learning one-variable pattern languages very efficiently on average, in parallel, and by asking queries, Theoretical Computer Science 261, No. 1–2, 2001, 119–156. 12. E.M. Gold, Language identification in the limit, Information and Control 10 (1967), 447–474. 13. S.A. Goldman, M.J. Kearns and R.E. Schapire, Exact identification of circuits using fixed points of amplification functions. SIAM Journal of Computing 22, 1993, 705–726. 14. D. Haussler, Bias, version spaces and Valiant’s learning framework, “Proc. 8th National Conference on Artificial Intelligence,” pp. 564–569, San Mateo, CA: Morgan Kaufmann, 1987. 15. D. Haussler, M. Kearns, N. Littlestone and M.K. Warmuth, Equivalence of models for polynomial learnability. Information and Computation 95 (1991), 129–161. 16. S. Jain, D. Osherson, J.S. Royer and A. Sharma, “Systems That Learn: An Introduction to Learning Theory,” MIT-Press, Boston, Massachusetts, 1999. 17. T. Jiang, A. Salomaa, K. Salomaa and S. Yu, Inclusion is undecidable for pattern languages, in “Proceedings 20th International Colloquium on Automata, Languages and Programming,” (A. Lingas, R. Karlsson, and S. Carlsson, Eds.), Lecture Notes in Computer Science, Vol. 700, pp. 301–312, Springer-Verlag, Berlin, 1993. 18. M. Kearns L. Pitt, A polynomial-time algorithm for learning k –variable pattern languages from examples. in “Proc. Second Annual ACM Workshop on Computational Learning Theory” (pp. 57–71). San Mateo, CA: Morgan Kaufmann, 1989. 19. S. Lange and R. Wiehagen, Polynomial-time inference of arbitrary pattern languages. New Generation Computing 8 (1991), 361–370. 20. S. Lange and T. Zeugmann, Language learning in dependence on the space of hypotheses. in “Proc. of the 6th Annual ACM Conference on Computational Learning Theory,” (L. Pitt, Ed.), pp. 127–136, ACM Press, New York, 1993. 21. S. Lange and T. Zeugmann, Set-driven and Rearrangement-independent Learning of Recursive Languages, Mathematical Systems Theory 29 (1996), 599–634. 22. S. Lange and T. Zeugmann, Incremental Learning from Positive Data, Journal of Computer and System Sciences 53(1996), 88–103. 23. N. Lavraˇc and S. Dˇzeroski, “Inductive Logic Programming: Techniques and Applications,” Ellis Horwood, 1994.
Can Learning in the Limit Be Done Efficiently?
37
24. T. Mitchell. “Machine Learning,” McGraw Hill, 1997. 25. A. Mitchell, A. Sharma, T. Scheffer and F. Stephan, The VC-dimension of Subclasses of Pattern Languages, in “Proc. 10th International Conference on Algorithmic Learning Theory,” (O. Watanabe and T. Yokomori, Eds.), Lecture Notes in Artificial Intelligence, Vol. 1720, pp. 93–105, Springer-Verlag, Berlin, 1999. 26. S. Miyano, A. Shinohara and T. Shinohara, Polynomial-time learning of elementary formal systems, New Generation Computing, 18:217–242, 2000. 27. S. Muggleton, Bayesian Inductive Logic Programming, in “Proc. 7th Annual ACM Conference on Computational Learning Theory” (M. Warmuth, Ed.), pp. 3–11, ACM Press, New York, 1994. 28. S. Muggleton and L. De Raedt, Inductive logic programming: Theory and methods, Journal of Logic Programming, 19/20:669–679, 1994. 29. R.P. Nix, Editing by examples, Yale University, Dept. Computer Science, Technical Report 280, 1983. 30. D.N. Osherson, M. Stob and S. Weinstein, “Systems that Learn, An Introduction to Learning Theory for Cognitive and Computer Scientists,” MIT-Press, Cambridge, Massachusetts, 1986. 31. L. Pitt, Inductive Inference, DFAs and Computational Complexity, in “Proc. 2nd Int. Workshop on Analogical and Inductive Inference” (K.P. Jantke, Ed.), Lecture Notes in Artificial Intelligence, Vol. 397, pp. 18–44, Springer-Verlag, Berlin, 1989. 32. R. Reischuk and T. Zeugmann, Learning One- Variable Pattern Languages in Linear Average Time, in “Proc. 11th Annual Conference on Computational Learning Theory - COLT’98,” July 24th - 26th, Madison, pp. 198–208, ACM Press 1998. 33. R. Reischuk and T. Zeugmann, A Complete and Tight Average-Case Analysis of Learning Monomials, in “Proc. 16th International Symposium on Theoretical Aspects of Computer Science,” (C. Meinel and S. Tison, Eds.), Lecture Notes in Computer Science, Vol. 1563, pp. 414–423, Springer-Verlag , Berlin 1999. 34. R. Reischuk and T. Zeugmann, An Average-Case Optimal One-Variable Pattern Language Learner, Journal of Computer and System Sciences 60, No. 2, 2000, 302–335. 35. H. Rogers, Jr., “Theory of Recursive Functions and Effective Computability,” McGraw–Hill, New York, 1967. 36. P. Rossmanith and T. Zeugmann. Stochastic Finite Learning of the Pattern Languages, Machine Learning 44, No. 1-2, 2001, 67–91. 37. Patterns (The Formal Language Theory Column), EATCS Bulletin 54, 46–62, 1994. 38. Return to patterns (The Formal Language Theory Column), EATCS Bulletin 55, 144–157, 1994. 39. S. Shimozono, A. Shinohara, T. Shinohara, S. Miyano, S. Kuhara and S. Arikawa, Knowledge acquisition from amino acid sequences by machine learning system BONSAI, Trans. Information Processing Society of Japan 35, 2009–2018, 1994. 40. R.E. Schapire, Pattern languages are not learnable, In M.A. Fulk & J. Case (Eds.), Proceedings of the Third Annual ACM Workshop on Computational Learning Theory, (pp. 122–129). San Mateo, CA: Morgan Kaufmann, (1990). 41. T. Shinohara, Inferring unions of two pattern languages, Bulletin of Informatics and Cybernetics 20, 83–88, 1983. 42. T. Shinohara, Inductive inference of monotonic formal systems from positive data, New Generation Computing 8, 371–384, 1991. 43. T.Shinohara and S. Arikawa, Pattern inference, in “Algorithmic Learning for Knowledge-Based Systems,” (K.P. Jantke and S. Lange, Eds.), Lecture Notes in Artificial Intelligence, Vol. 961, pp. 259–291, Springer-Verlag, Berlin, 1995.
38
T. Zeugmann
44. T. Shinohara and H. Arimura, Inductive inference of unbounded unions of pattern languages from positive data, in “Proceedings 7th International Workshop on Algorithmic Learning Theory,” (S. Arikawa and A.K. Sharma, Eds.), Lecture Notes in Artificial Intelligence, Vol. 1160, pp. 256–271, Springer-Verlag, Berlin, 1996. 45. R. Smullyan, “Theory of Formal Systems,” Annals of Mathematical Studies, No. 47. Princeton, NJ, 1961. 46. L.G. Valiant, A Theory of the Learnable, Communications of the ACM 27 (1984), 1134–1142. 47. R. Wiehagen. Limes-Erkennung rekursiver Funktionen durch spezielle Strategien. Journal of Information Processing and Cybernetics (EIK) 12, 1976, 93–99. 48. R. Wiehagen and T. Zeugmann, Ignoring Data may be the only Way to Learn Efficiently, Journal of Experimental and Theoretical Artificial Intelligence 6 (1994), 131–144. 49. K. Wright, Identification of unions of languages drawn from an identifiable class, in “Proceedings of the 2nd Workshop on Computational Learning Theory,” (R. Rivest, D. Haussler, and M. Warmuth, Eds.), pp. 328–333, San Mateo, CA: Morgan Kaufmann, 1989. 50. T. Zeugmann, Lange and Wiehagen’s Pattern Language Learning Algorithm: An Average-case Analysis with respect to its Total Learning Time, Annals of Mathematics and Artificial Intelligence 23, No. 1–2, 1998, 117–145. 51. T. Zeugmann and S. Lange, A guided tour across the boundaries of learning recursive languages, in “Algorithmic Learning for Knowledge-Based Systems,” (K.P. Jantke and S. Lange, Eds.), Lecture Notes in Artificial Intelligence, Vol. 961, pp. 190–258, Springer-Verlag, Berlin, 1995. 52. T. Zeugmann, S. Lange and S. Kapur, Characterizations of monotonic and dual monotonic language learning, Information and Computation 120, 155–173, 1995.
Intrinsic Complexity of Uniform Learning Sandra Zilles Universit¨ at Kaiserslautern, FB Informatik, Postfach 3049, 67653 Kaiserslautern, Germany, [email protected]
Abstract. Inductive inference is concerned with algorithmic learning of recursive functions. In the model of learning in the limit a learner successful for a class of recursive functions must eventually find a program for any function in the class from a gradually growing sequence of its values. This approach is generalized in uniform learning, where the problem of synthesizing a successful learner for a class of functions from a description of this class is considered. A common reduction-based approach for comparing the complexity of learning problems in inductive inference is intrinsic complexity. In this context, reducibility between two classes is expressed via recursive operators transforming target functions in one direction and sequences of corresponding hypotheses in the other direction. The present paper is the first one concerned with intrinsic complexity of uniform learning. The relevant notions are adapted and illustrated by several examples. Characterizations of complete classes finally allow for various insightful conclusions. The connection to intrinsic complexity of non-uniform learning is revealed within several analogies concerning firstly the role and structure of complete classes and secondly the general interpretation of the notion of intrinsic complexity.
1
Introduction
Inductive inference is concerned with algorithmic learning of recursive functions. In the model of learning in the limit, cf. [7], a learner successful for a class of recursive functions must eventually find a correct program for any function in the class from a gradually growing sequence of its values. The learner is understood as a machine – called inductive inference machine or IIM – reading finite sequences of input-output pairs of a target function, and returning programs as its hypotheses, see also [2]. The underlying programming system is then called a hypothesis space. Studying the potential of such IIMs in general leads to the question whether – given a description of a class of functions – a corresponding successful IIM can be synthesized computationally from this description. This idea is generalized in the notion of uniform learning: we consider a collection C0 , C1 , . . . of learning problems – which may be seen as a decomposition of a class C = C0 ∪ C1 ∪ . . . – and ask for some kind of meta-IIM tackling the whole collection of learning problems. As an input, such a meta-IIM gets a description of one of the learning R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 39–53, 2003. c Springer-Verlag Berlin Heidelberg 2003
40
S. Zilles
problems Ci (in our context a class Ci of recursive functions) in the collection. The meta-IIM is then supposed to develop a successful IIM for Ci . Besides studies on uniform learning of classes of recursive functions, cf. [12,16], this topic has also been investigated in the context of learning formal languages, see in particular [1,13,14]. Since we consider IIMs as tackling a given problem, namely the problem of identifying all elements in a particular class of recursive functions, the complexity of such IIMs might express, how hard a learning problem is. For instance, the class of all constant functions allows for a simple and straightforward identification method; for other classes successful methods might seem more complicated. But this does not involve any rule allowing us to compare two learning problems with respect to their difficulty. So a formal approach for comparing the complexity of learning problems (i. e. of classes of recursive functions) is desirable. Different aspects have been analysed in this context. One approach is, e. g., mind change complexity measured by the maximal number of hypothesis changes a machine needs to identify a function in the given class, see [3]. But since in general this number of mind changes is unbounded, other notions of complexity might be of interest. Various subjects in theoretical computer science deal with comparing the complexity of decision problems, e. g. regarding decidability as such, see [15], or the possible efficiency of decision algorithms, see [5]. In general Problem A is at most as hard as Problem B, if A is reducible to B under a given reduction. Each such reduction involves a notion of complete (hardest solvable) problems. Besides studies concerning language learning, see [9,10,11], in [4] an approach for reductions in the context of learning recursive functions is introduced. This subject, intrinsic complexity, has been further analysed in [8] with a focus on complete classes. It has turned out that, for learning in the limit, a class is complete, iff it contains a dense r. e. subclass. Here the aspect of high topological complexity (density), contrasts with the aspect of low algorithmic complexity of r. e. sets, which is somehow striking and has caused discussions on whether this particular approach of intrinsic complexity is adequate. The present paper deals with intrinsic complexity in the context of uniform learning. Assume some new reduction expresses such an idea of intrinsic complexity. If a class C of functions is complete in the initial sense, natural questions are (i) whether C can be decomposed into a uniformly learnable collection C0 , C1 , . . . , which is not a hardest problem in uniform learning, and (ii) whether there are also inappropriate decompositions of C, i. e. collections of highest complexity in uniform learning. Below a notion of intrinsic complexity for uniform learning is developed and the corresponding complete classes are characterized. The obtained structure of degrees of complexity matches recent results on uniform learning: it has been shown that even decompositions into singleton classes can yield problems too hard for uniform learning in Gold’s model. This suggests that collections representing singleton classes may sometimes form hardest problems in uniform learning. Indeed, the notion developed below expresses this intuition, i. e. collec-
Intrinsic Complexity of Uniform Learning
41
tions of singleton sets may constitute complete classes in uniform learning. Still, the characterization of completeness here reveals a weakness of the general idea of intrinsic complexity, namely – as in the non-uniform case – complete classes have a low algorithmic complexity (see Theorem 7). All in all, this shows that intrinsic complexity, as in [4], is on the one hand a useful approach, because it can be adapted to match the intuitively desired results in uniform learning. On the other hand, the doubts in [8] are corroborated.
2 2.1
Preliminaries Notations
Knowledge of basic notions used in mathematics and computability theory is assumed, cf. [15]. N is the set of natural numbers. The cardinality of a set X is denoted by card X. Partial-recursive functions always operate on natural numbers. If f is a function, f (n) ↑ indicates that f (n) is undefined. Our target objects for learning will always be recursive functions, i. e. total partial-recursive functions. R denotes the set of all recursive functions. If α is a finite tuple of numbers, then |α| denotes its length. Finite tuples are coded, i. e. if f (0), . . . , f (n) are defined, a number f [n] represents the tuple (f (0), . . . , f (n)), called an initial segment of f . f [n] ↑ means that f (x) ↑ for some x ≤ n. For convenience, a function may be written as a sequence of values or as a set of input-output pairs. A sequence σ = x0 , x1 , x2 , . . . converges to x, iff xn = x for all but finitely many n; we write lim(σ) = x. For example let f (n) = 7 for n ≤ 2, f (n) ↑ otherwise; g(n) = 7 for all n. Then f = 73 ↑∞ = {(0, 7), (1, 7), (2, 7)}, g = 7∞ = {(n, 7) | n ∈ N}; lim(g) = 7, and f ⊆ g. For n ∈ N, the notion f =n g means that for all x ≤ n either f (n) ↑ and g(n) ↑ or f (n) = g(n). A set C of functions is dense, iff for any f ∈ C, n ∈ N there is some g ∈ C satisfying f =n g, but f = g. Recursive functions – our target objects for learning – require appropriate representation schemes, to be used as hypothesis spaces. Partial-recursive enumerations serve for that purpose: any (n + 1)-place partial-recursive function ψ enumerates the set Pψ := {ψi | i ∈ N} of n-place partial-recursive functions, where ψi (x) := ψ(i, x) for all x = (x1 , . . . , xn ). Then ψ is called a numbering. Given f ∈ Pψ , any index i satisfying ψi = f is a ψ-program of f . Following [6], we call a family (di )i∈N of natural numbers limiting r. e., iff there is a recursive numbering d such that lim(di ) = di for all i ∈ N. 2.2
Learning in the Limit and Intrinsic Complexity
Below, let τ be a fixed acceptable numbering, serving as a hypothesis space. The learner is a total computable device called IIM (inductive inference machine) working in steps. The input of an IIM M in step n is an initial segment f [n] of some f ; the output M (f [n]) is interpreted as a τ -program. In learning in the limit, M is successful for f , if the sequence M (f ) := (M (f [n]))n∈N of hypotheses is admissible for f :
42
S. Zilles
Definition 1 [4] Let f, σ ∈ R. σ is admissible for f , iff σ converges and lim(σ) is a τ -program for f . Now a class of recursive functions is learnable in the limit (Ex -learnable; Ex is short for explanatory), if a single IIM is successful for all functions in the class. Definition 2 [7,2] A class C ⊆ R is Ex -learnable (C ∈ Ex ), iff there is an IIM M such that, for any f ∈ C, the sequence M (f ) is admissible for f . M is then called an Ex -learner or an IIM for C. The class of constant functions and the class Cfsup = {α0∞ | α is an initial segment} of recursive functions of finite support are in Ex , but intuitively, the latter is harder to learn. A reduction-based approach for comparing the learning complexity is proposed in [4], using the notion of recursive operators. Definition 3 [15,8] Let Θ be a total function operating on functions. Θ is a recursive operator, iff for all functions f, g and all numbers n, y ∈ N: 1. if f ⊆ g, then Θ(f ) ⊆ Θ(g); 2. if Θ(f )(n) = y, then Θ(f )(n) = y for some initial segment f ⊆ f ; 3. if f is finite, then one can effectively (in f ) enumerate Θ(f ). Reducing a class C1 of functions to a class C2 of functions requires two operators: the first one maps C1 into C2 ; the second maps any admissible sequence for a mapped function in C2 to an admissible sequence for the associated original function in C1 . Definition 4 [4] Let C1 , C2 ∈ Ex . C1 is Ex -reducible to C2 , iff there are recursive operators Θ, Ξ such that all functions f ∈ C1 fulfil the following conditions: 1. Θ(f ) belongs to C2 , 2. if σ is admissible for Θ(f ), then Ξ(σ) is admissible for f . Note, if C1 is Ex -reducible to C2 , then an IIM for C1 can be deduced from any IIM for C2 ; e. g. by [4], each class in Ex is Ex -reducible to Cfsup . As usual, this reduction yields complete classes, i. e. learnable classes of highest complexity. Definition 5 [4] A class C ∈ Ex is Ex -complete, iff each class C ∈ Ex is Ex -reducible to C. By the remark above, the class Cfsup is Ex -complete. Note that Cfsup is r. e. and dense – a relevant property for characterizing Ex -complete classes: Theorem 1 [8] A class C ∈ Ex is Ex -complete iff it has an r. e. dense subset. Ex -complete classes have subsets, which are dense, i. e. topologically complex, but r. e., i. e. algorithmically non-complex. The latter is astonishing, since there are dense classes, which are not Ex -complete, cf. [8], so they do not contain r. e. dense subsets. These classes are algorithmically more complex than Cfsup , but belong to a lower degree of intrinsic complexity. R. e. subsets as in Theorem 1 are obtained by mapping r. e. Ex -complete classes – such as Cfsup – to C with the help of an operator Θ. So perhaps this approach of intrinsic complexity just makes a class complete, if it is a suitable ‘target’ for recursive operators. This may be considered as a weakness of the notion of intrinsic complexity.
Intrinsic Complexity of Uniform Learning
2.3
43
Uniform Learning in the Limit
Uniform learning views the approach of Ex -learning on a meta-level; it is not only concerned with the existence of methods solving specific learning problems, but with the problem to synthesize such methods. So the focus is on families of learning problems (here families of classes of recursive functions). Given a representation or description of a class of recursive functions, the aim is to effectively determine an adequate learner, i. e. to compute a program for a successful IIM learning the class. For a formal definition of uniform learning it is necessary to agree on a scheme for describing classes of recursive functions (i. e. describing learning problems). For that purpose we fix a three-place acceptable numbering ϕ. If d ∈ N, the numbering ϕd is the function resulting from ϕ, if the first input is fixed by d. Then any number d corresponds to a two-place numbering ϕd enumerating the set Pϕd of partial-recursive functions. Now it is conceivable to consider the subset of all total functions in Pϕd as a learning problem which is uniquely determined by the number d. Thus each number d acts as a description of the set Rd , where Rd := {ϕdi | i ∈ N and ϕdi is recursive} = Pϕd ∩ R for any d ∈ N . Rd is called recursive core of the numbering ϕd . So any set D = {d0 , d1 , . . . } can be regarded as a set of descriptions, i. e. a collection of learning problems Rd0 , Rd1 , . . . In this context, D is called a description set. A meta-IIM M is an IIM with two inputs: (i) a description d of a recursive core Rd , and (ii) an initial segment f [n] of some f ∈ R. Then Md is the IIM resulting from M , if the first input is fixed by d. A meta-IIM M can be seen as mapping descriptions d to IIMs Md ; it is a successful uniform learner for a set D, in case Md learns Rd for all d ∈ D; i. e. given any description in D, M develops a suitable learner for the corresponding recursive core. Definition 6 Let D ⊆ N. D is uniformly Ex -learnable (D ∈ UEx ), iff there is a meta-IIM M such that, for any d ∈ D, the IIM Md is an Ex -learner for Rd . As a numbering ϕd enumerates a superset of Rd , a meta-IIM might also use ϕd as a hypothesis space for Rd . This involves a new notion of admissible sequences. Definition 7 Let d ∈ N, f ∈ Rd , σ ∈ R. σ is r -admissible for d and f , iff σ converges and lim(σ) is a ϕd -program for f . This approach yields just a special (restricted ) case of uniform Ex -learning, because ϕd -programs can be uniformly translated into τ -programs. Definition 8 Let D ⊆ N. D is uniformly Ex -learnable restrictedly (D ∈ rUEx ), iff there is a meta-IIM M such that, for any d ∈ D and any function f ∈ Rd , the sequence Md (f ) is r -admissible for d and f . By the following result, special sets describing only singleton recursive cores are not uniformly Ex -learnable (restrictedly). For Claim 2 cf. a proof in [16].
44
S. Zilles
Theorem 2 1. [12,16] {d ∈ N | card Rd = 1} ∈ / UEx . 2. Fix s ∈ R. Then {d ∈ N | Rd = {s}} ∈ / rUEx . It has turned out, that even UEx -learnable subsets of these description sets are not in UEx (or rUEx ), if additional demands concerning the sequence of hypotheses are posed, see [17]. This suggests that description sets representing only singletons may form hardest problems in uniform learning; analogously description sets representing only a fixed singleton recursive core may form hardest problems in restricted uniform learning. Hopefully, this intuition can be expressed by a notion of intrinsic complexity of uniform learning.
3 3.1
Intrinsic Complexity of Uniform Learning Intrinsic Complexity of UEx -Learning
The crucial notion now concerns the reduction between description sets D1 and D2 . As in the non-uniform model, a meta-IIM for D1 should be computable from a meta-IIM for D2 , if D1 is reducible to D2 . We first focus on UEx -learning; the restricted variant will be discussed later on. A first idea for UEx -reducibility might be to demand the existence of operators Θ and Ξ such that for d1 ∈ D1 and f1 ∈ Rd1 Θ transforms (d1 , f1 ) into a pair (d2 , f2 ) with d2 ∈ D2 and f2 ∈ Rd2 ; where Ξ maps any admissible sequence for f2 to an admissible sequence for f1 . Unfortunately, this does not allow us to reduce every set in UEx to a set describing only singleton recursive cores: suppose Rd = Cfsup . As the set D1 = {d} is uniformly Ex -learnable, it should be reducible to a set D2 representing only singleton recursive cores, say via Θ and Ξ as above. Now for any initial segment α, there are d2 ∈ D2 and f2 ∈ Rd2 such that Θ(d, α0∞ ) = (d2 , f2 ). The usual notion of an operator yields an n > 0 and a subfunction σ ⊆ f2 such that Θ(d, α0n ) = (d2 , σ). As card Rd2 = 1, this implies Θ(d, α0n β0∞ ) = (d2 , f2 ) for all initial segments β. In particular, there are f, f ∈ Rd such that f = f , but Θ(d, f ) = Θ(d, f ) = (d2 , f2 ). By assumption, Ξ maps each admissible sequence for f2 to a sequence admissible for both f and f . The latter is of course impossible, so this approach does not meet our purpose. The problem above is that the description d2 , once it is output by Θ on input of (d1 , f1 [m]), can never be changed depending on the values of f1 to be read. Hence, Θ should be allowed to return a sequence of descriptions, when fed a pair (d1 , f1 ). As an improved approach, it is conceivable to demand, that for d1 ∈ D1 and f1 ∈ Rd1 Θ transforms (d1 , f1 ) into a pair (δ2 , f2 ) . Here δ2 is a sequence converging to some d2 ∈ D2 with f2 ∈ Rd2 . Moreover, Ξ maps any admissible sequence for f2 to an admissible sequence for f1 . Still this approach bears a problem. Intuitively, reducibility should be transitive. In general, such a transitivity is achieved by connecting the operators of a
Intrinsic Complexity of Uniform Learning
45
first reduction with the operators of a second reduction. The idea above cannot guarantee that: assume D1 is reducible to D2 via Θ1 and Ξ1 ; D2 is reducible to D3 via Θ2 and Ξ2 . If Θ1 maps (d1 , f1 ) to (δ2 , f2 ), then which description d in the sequence δ2 should form an input (d, f2 ) for Θ2 ? It is in general impossible to detect the limit d2 of the sequence δ2 , and any description d = d2 might change the output of Θ2 . So it is inevitable to let Θ operate on sequences of descriptions and on functions, i. e., Θ maps pairs (δ1 , f1 ), where δ1 is a sequence of descriptions, to pairs (δ2 , f2 ). Definition 9 Let Θ be a total function operating on pairs of functions. Θ is a recursive meta-operator, iff the following properties hold for all functions δ, δ , f, f : 1. if δ ⊆ δ , f ⊆ f , as well as Θ(δ, f ) = (γ, g) and Θ(δ , f ) = (γ , g ), then γ ⊆ γ and g ⊆ g ; 2. if n, y ∈ N, Θ(δ, f ) = (γ, g), and γ(n) = y (or g(n) = y, resp.), then there are initial segments δ0 ⊆ δ and f0 ⊆ f such that (γ0 , g0 ) = Θ(δ0 , f0 ) fulfils γ0 (n) = y (g0 (n) = y, resp.); 3. if δ, f are finite, Θ(δ, f ) = (γ, g), one can effectively (in δ, f ) enumerate γ, g. This finally allows for the following definition of UEx -reducibility. Definition 10 Let D1 , D2 ∈ UEx . Fix a recursive meta-operator Θ and a recursive operator Ξ. D1 is UEx -reducible to D2 via Θ and Ξ, iff for any d1 ∈ D1 , any f1 ∈ Rd1 , and any initial segment δ1 there are functions δ2 and f2 satisfying: 1. Θ(δ1 d∞ 1 , f1 ) = (δ2 , f2 ), 2. δ2 converges to some description d2 ∈ D2 such that f2 ∈ Rd2 , 3. if σ is admissible for f2 , then Ξ(σ) is admissible for f1 . D1 is UEx -reducible to D2 , iff D1 is UEx -reducible to D2 via some Θ and Ξ . Note that this definition expresses intrinsic complexity in the sense that a meta-IIM for D1 can be computed from a meta-IIM for D2 , if D1 is UEx reducible to D2 . Moreover, as has been demanded in advance, the resulting reducibility is transitive: Lemma 3 If D1 , D2 , D3 are description sets such that D1 is UEx -reducible to D2 and D2 is UEx -reducible to D3 , then D1 is UEx -reducible to D3 . The notion of completeness can be adapted from the usual definitions. Definition 11 A description set D ∈ UEx is UEx -complete, iff each description set D ∈ UEx is UEx -reducible to D. The question is, whether this notion of intrinsic complexity expresses the intuitions formulated in advance, e. g., that there are UEx -complete description sets representing only singleton recursive cores. Before answering this question consider an illustrative example.
46
S. Zilles
This example states that there is a single description d of an Ex -complete set such that the description set {d} is UEx -complete. On the one hand, this might be surprising, because a description set consisting of just one index representing an Ex -learnable class might be considered rather simple and thus not complete for uniform learning. But on the other hand, this result is not contrary to the intuition, that the hardest problems in non-uniform learning may remain hardest, when considered in the context of meta-learning. The reason is that the complexity is still of highest degree, if the corresponding class of recursive functions is not decomposed appropriately. Example 4 Let d ∈ N fulfil Rd = Cfsup . Then the set {d} is UEx -complete. Proof. Obviously, {d} ∈ UEx . To show that each description set in UEx is UEx reducible to {d}, fix D1 ∈ UEx and let M be a corresponding meta-IIM as in Definition 6. It remains to define a recursive meta-operator Θ and a recursive operator Ξ appropriately. Given initial segments δ1 and α, let Θ just modify the sequence of hypotheses returned by the meta-IIM M , if the first input parameter is gradually taken from the sequence δ1 and the second input parameter is gradually taken from the sequence α. The modification is to increase each hypothesis by 1 and to change each repetition of hypotheses into a zero output. A formal definition is omitted. Moreover, given an initial segment σ = (s0 , . . . , sn ), let Ξ(σ) look for the maximal m ≤ n such that at least one of the values τsm (x), x ≤ n, is defined within n steps and greater than 0. In case m does not exist, Ξ(σ) = Ξ(s0 , . . . , sn−1 ). Otherwise, let y ≤ n be maximal such that τsm (y) has already been computed and is greater than 0. Then Ξ(σ) = Ξ(s0 , . . . , sn−1 )τsm (y) − 1. Now D1 is UEx -reducible to {d} via Θ, Ξ; details are omitted. That decompositions of Ex -complete classes may also be not UEx -complete, is shown in Section 3.3. Example 4 moreover serves for proving the completeness of other sets, if Lemma 5 – an immediate consequence of Lemma 3 – is applied. Lemma 5 Let D1 , D2 ∈ UEx . If D1 is UEx -complete and UEx -reducible to D2 , then D2 is UEx -complete. Lemma 5 and Example 4 simplify the proofs of further examples, finally revealing that there are indeed UEx -complete description sets representing singleton recursive cores only. Example 6 1. Let (αi )i∈N be an r. e. family of all initial segments. Let g ∈ R g(i) g(i) fulfil ϕ0 = αi 0∞ and ϕx+1 =↑∞ for i, x ∈ N. Then the description set {g(i) | i ∈ N} is UEx -complete. g(i) g(i) 2. Let g ∈ R fulfil ϕ0 = τi and ϕx+1 =↑∞ for i, x ∈ N. Then the description set {g(i) | i ∈ N} is UEx -complete. Proof. ad 1. Obviously, {g(i) | i ∈ N} ∈ UEx . Now we reduce the UEx -complete set {d} from Example 4 to {g(i) | i ∈ N}. Lemma 5 then proves Assertion 1.
Intrinsic Complexity of Uniform Learning
47
It is easy to define Θ such that, if α does not end with 0, then Θ(δ1 , α0∞ ) = (δ2 , α0∞ ), where δ2 converges to some g(i) with αi = α. Let Ξ(σ) = σ for all σ. Then {d} is UEx -reducible to {g(i) | i ∈ N} via Θ and Ξ. Details are omitted. ad 2. Fix an r. e. family (αi )i∈N of all initial segments; fix h ∈ R with τh(i) = g(h(i)) g(h(i)) αi 0∞ for all i ∈ N. Then ϕ0 = αi 0∞ and ϕx+1 =↑∞ for i, x ∈ N. As above, the set {g(h(i)) | i ∈ N} is UEx -complete; so is its superset {g(i) | i ∈ N}. Just as the properties of Cfsup are characteristic for Ex -completeness, the properties of description sets representing decompositions of Cfsup are characteristic for UEx -completeness, as is stated in Theorem 7 and Corollary 8. Theorem 7 Let D ∈ UEx . D is UEx -complete, iff there are a recursive numbering ψ and a limiting r. e. family (di )i∈N of descriptions in D such that: 1. ψi belongs to Rdi for all i ∈ N; 2. Pψ is dense. Proof. Fix a description set D in UEx . Necessity. Assume D is UEx -complete. Fix any one-one recursive numbering χ such that Pχ = Cfsup . Moreover fix g ∈ R which, given any i, x ∈ N, fulfils g(i) g(i) ϕ0 = χi and ϕx =↑∞ , if x > 0. Then the description set {g(i) | i ∈ N} is UEx -complete, as can be verified similarly to Example 6. Lemma 5 then implies that {g(i) | i ∈ N} is UEx -reducible to D, say via Θ and Ξ. Fix a one-one r. e. family (αi )i∈N of all finite tuples over N. For i ∈ N, i coding the pair (x, y), define (δi , ψi ) := Θ(αy g(x)∞ , χx ). By definition, ψ is a recursive numbering and, for all i ∈ N, the sequence δi converges to some di ∈ D such that ψi ∈ Rdi . Hence (di )i∈N is a limiting r. e. family of descriptions in D. It remains to verify Property 2. For that purpose fix i, n ∈ N. By definition, if i encodes (x, y), we obtain Θ(αy g(x)∞ , χx ) = (δi , ψi ). The properties of Θ yield some m ∈ N such that Θ(αy g(x)m , χx [m]) = (δi , α ) for some δi , α with δi ⊆ δi and ψi [n] ⊆ α ⊆ ψi . Because of the particular properties of χ, there is some x ∈ N, x = x, such that χx =m χx , but χx = χx . Moreover, there is some y ∈ N such that αy = αy g(x)m . If j encodes (x , y ), this yields Θ(αy g(x)m g(x )∞ , χx ) = (δj , ψj ), where α ⊆ ψj . In particular ψj =n ψi . Assume ψi = ψj . Suppose σ is any admissible sequence for ψi . Then σ is admissible for ψj . This implies that Ξ(σ) is admissible for both χx and χx . As χx = χx , this is impossible. So ψi = ψj . Sufficiency. Assume D, ψ, and (di )i∈N fulfil the conditions of Theorem 7. Let d denote a numbering associated to the limiting r. e. family (di )i∈N . The results in the context of non-uniform learning help to show that D is UEx -complete: By assumption, Pψ is a dense r. e. subset of R. Theorem 1 then implies that Pψ is Ex -complete, so Cfsup is Ex -reducible to Pψ , say via Θ , Ξ . Using Θ and Ξ one can show that the UEx -complete set {d} from Example 4 is UEx -reducible to D. This implies that D is UEx -complete, too. Note that Rd = Cfsup .
48
S. Zilles
It remains to define a recursive meta-operator Θ and a recursive operator Ξ appropriately. If δ1 and α1 are finite tuples over N, define Θ(δ1 , α1 ) as follows. Compute Θ (α1 ) = α2 and n = |α2 |. For all x < n, let ix be minimal such that α2 [x] ⊆ ψix . Return Θ(δ1 α1 ) = ((di0 (0), di1 (1), . . . , din−1 (n − 1)), α2 ) (if n = 0, then the first component of Θ(δ1 α1 ) is the empty sequence). Clearly, if f1 ∈ R, then Θ(δ1 , f1 ) = (δ2 , Θ (f1 )) for some sequence δ2 . Moreover, let Ξ := Ξ . Finally, to verify that {d} is UEx -reducible to D, fix a sequence δ1 and a function f1 ∈ Rd . First, note that f2 = Θ (f1 ) ∈ Pψ . Let i be the minimal ψ-program of Θ (f1 ) = f2 . As ψ ∈ R, for all x ∈ N the minimal ix satisfying f2 [x] ⊆ ψix can be computed. Additionally, lim(ix )x∈N = i. Note that di converges to di . Hence Θ(δ1 , f1 ) = (δ2 , f2 ), where f2 ∈ Pψ and δ2 converges to di , given f2 = ψi . In particular, f2 ∈ Rdi . Second, if σ is admissible for f2 , then Ξ (σ) is admissible for f1 . So {d} is UEx -reducible to D via Θ and Ξ, and thus D is UEx -complete. Corollary 8 Let D ∈ UEx . D is UEx -complete, iff there are a recursive numbering ψ and a limiting r. e. family (di )i∈N of descriptions in D such that: 1. ψi belongs to Rdi for all i ∈ N; 2. Pψ is Ex -complete. Proof. Necessity. The assertion follows from Theorem 1 and Theorem 7. Sufficiency. Let D ∈ UEx . Assume ψ and (di )i∈N fulfil the conditions above. Let d be a recursive numbering corresponding to the limiting r. e. family (di )i∈N . By Property 2, Pψ is Ex -complete; thus, by Theorem 1, there exists a dense r. e. subclass C ⊆ Pψ . Let ψ be a one-one, recursive numbering with Pψ = C, in particular Pψ is dense. It remains to find a limiting r. e. family (di )i∈N of descriptions in D such that ψi ∈ Rdi for all i ∈ N. For that purpose define a corresponding numbering d . Given i, n ∈ N, define di (n) as follows. Let j ∈ N be minimal such that ψi =n ψj . (* Note that, for all but finitely many n, the index j will be the minimal ψ-program of ψi . *) Return di (n) := dj (n). (* lim(di ) = dj , for j minimal with ψi = ψj . *) Finally, let di be given by the limit of the function di , in case a limit exists. Fix i ∈ N. Then there is a minimal j with ψi = ψj . By definition, the limit di of di exists and di = dj ∈ D. Moreover, as ψj ∈ Rdj , the function ψi is in Rdi . As ψ and (di )i∈N allow us to apply Theorem 7, the set D is UEx -complete. Thus certain decompositions of Ex -complete classes remain UEx -complete, and UEx -complete description sets always represent decompositions of supersets of Ex -complete classes. Example 9 illustrates how to apply the above characterizations of UEx -completeness. A similar short proof may be given for Example 6.
Intrinsic Complexity of Uniform Learning
49
Example 9 Fix a recursive numbering χ such that Pχ is dense. Let g ∈ R fulfil g(i) g(i) ϕ0 = χi and ϕx+1 =↑∞ for i, x ∈ N. Then {g(i) | i ∈ N} is UEx -complete. Proof. (g(i))i∈N is a (limiting) r. e. family such that χi ∈ Rg(i) for all i ∈ N and Pχ is Ex -complete. Corollary 8 implies that {g(i) | i ∈ N} is UEx -complete. 3.2
Intrinsic Complexity of rUEx -Learning
Adapting the formalism of intrinsic complexity for restricted uniform learning, we have to be careful concerning the operator Ξ. In UEx -learning, the current description d has no effect on whether a sequence is admissible for a function or not. For restricted learning this is different. Therefore, to communicate the relevant information to Ξ, it is inevitable to include a description from D2 in the input of Ξ. That means, Ξ should operate on pairs (δ2 , σ) rather than on sequences σ only. Since only the limit of the function output by Ξ is relevant for the reduction, this idea can be simplified. It suffices, if Ξ operates correctly on the inputs d2 and σ, where d2 is the limit of δ2 . Then an operator on the pair (δ2 , σ) is obtained from Ξ by returning the sequence (Ξ(δ2 (0)σ[0]), Ξ(δ2 (1)σ[1]), . . . ). Its limit will equal the limit of Ξ(d2 σ). Definition 12 Let D1 , D2 ∈ rUEx . Fix a recursive meta-operator Θ and a recursive operator Ξ. D1 is rUEx -reducible to D2 via Θ and Ξ, iff for any d1 ∈ D1 , any f1 ∈ Rd1 , and any initial segment δ1 there are functions δ2 and f2 satisfying: 1. Θ(δ1 d∞ 1 , f1 ) = (δ2 , f2 ), 2. δ2 converges to some description d2 ∈ D2 such that f2 ∈ Rd2 , 3. if σ is r -admissible for d2 and f2 , then Ξ(d2 σ) is r -admissible for d1 and f1 . D1 is rUEx -reducible to D2 , iff D1 is rUEx -reducible to D2 via some Θ and Ξ . Completeness is defined as usual. As in the UEx -case, rUEx -reducibility is transitive; so the rUEx -completeness of one set may help to verify the rUEx completeness of others. Lemma 10 If D1 , D2 , D3 are description sets such that D1 is rUEx -reducible to D2 and D2 is rUEx -reducible to D3 , then D1 is rUEx -reducible to D3 . Lemma 11 Let D1 , D2 ∈ rUEx . If D1 is rUEx -complete and rUEx -reducible to D2 , then D2 is rUEx -complete. Recall that, intuitively, sets describing just one singleton recursive core may be rUEx -complete. This is affirmed by Example 12, the proof of which is omitted. Example 12 Let s, g ∈ R such that ϕi = s and ϕx =↑∞ , if i, x ∈ N, x = i. Then {g(i) | i ∈ N} is rUEx -complete, but not UEx -complete. g(i)
g(i)
Example 12 helps to characterize rUEx -completeness. In particular, it shows that the demand ‘Pψ is dense’ has to be dropped.
50
S. Zilles
Theorem 13 Let D ∈ rUEx . D is rUEx -complete, iff there are a recursive numbering ψ and a limiting r. e. family (di )i∈N of descriptions in D such that: 1. ψi belongs to Rdi for all i ∈ N; 2. for each i, n ∈ N there are infinitely many j ∈ N satisfying ψi =n ψj and (di , ψi ) = (dj , ψj ). Proof. Fix a description set D in rUEx . Necessity. Assume D is rUEx -complete. Lemma 11 implies that the description set {g(i) | i ∈ N} from Example 12 is rUEx -reducible to D, say via Θ and Ξ. Fix a one-one r. e. family (αi )i∈N of all finite tuples over N. For i ∈ N, i coding the pair (x, y), define (δi , ψi ) := Θ(αy g(x)∞ , s). By definition, ψ is a recursive numbering and, for all i ∈ N, the sequence δi converges to some di ∈ D such that ψi ∈ Rdi . Hence (di )i∈N is a limiting r. e. family of descriptions in D. It remains to verify Property 2. For that purpose fix i, n ∈ N. By definition, if i encodes (x, y), we have Θ(αy g(x)∞ , s) = (δi , ψi ). The properties of Θ yield some m ∈ N such that Θ(αy g(x)m , s) = (δi , α ) for some δi and α with δi ⊆ δi and ψi [n] ⊆ α ⊆ ψi . Now choose any x ∈ N such that x = x. Moreover, there is some y ∈ N such that αy = αy g(x)m . If j encodes (x , y ), this yields Θ(αy g(x)m g(x )∞ , s) = (δj , ψj ), where α ⊆ ψj . In particular ψj =n ψi . Assume (di , ψi ) = (dj , ψj ). Suppose σ is any rUEx -admissible sequence for di and ψi . Then Ξ(di σ) is rUEx -admissible for both g(x) and s and g(x ) and s. As x is the only ϕg(x) -number for s and x is the only ϕg(x ) -number for s, the latter is impossible. So (di , ψi ) = (dj , ψj ). Repeating this argument for any x with x = x yields the desired property. Sufficiency. First note: if (di )i∈N is a limiting r. e. family and ψ any recursive numbering, such that {(di , ψi ) | i ∈ N} is an infinite set, then there are a limiting r. e. family (di )i∈N and a recursive numbering ψ , such that {(di , ψi ) | i ∈ N} ⊆ {(di , ψi ) | i ∈ N} and i = j implies (di , ψi ) = (dj , ψj ). Details are omitted. So let D, ψ, (di )i∈N fulfil the demands of Theorem 7 and assume wlog that i = j implies (di , ψi ) = (dj , ψj ). Let d be the numbering associated to the limiting r. e. family (di )i∈N . We show that the set {g(i) | i ∈ N} from Example 12 is rUEx -reducible to D; so Lemma 11 implies that D is rUEx -complete. For that purpose fix a one-one numbering η ∈ R such that Pη equals the set Cconst := {αi∞ | α is a finite tuple over N and i ∈ N} of all recursive finite variants of constant functions. Using a construction from [8] we define an operator Θ mapping Pη into Pψ . In parallel, a function θ is constructed to mark used indices. Let Θ (η0 ) := ψ0 and θ(0) = 0. If i > 0, let Θ (ηi ) be defined as follows. For x < i, let mx be maximal with ηi (mx ) = ηx (mx ). Let m := maxx
Intrinsic Complexity of Uniform Learning
51
Choose h = min(H); return Θ (ηi ) := ψh , moreover let θ(i) := h. (* Because of Property 2, the index h exists. As ψ is recursive, h is found effectively. *) Note that Θ is a recursive operator mapping Pη into Pψ . θ is a recursive function, that maps each number i to the index h used in the construction of Θ (ηi ) = ψh . θ is one-one, yet it may happen, that Θ (ηi ) = Θ (ηj ), but θ(i) = θ(j) for some i, j ∈ N. It remains to define a recursive meta-operator Θ and a recursive operator Ξ such that {g(i) | i ∈ N} is rUEx -reducible to D via Θ and Ξ. If δ is an infinite sequence, define Θ(δ, s) as follows. For each x ∈ N, let jx ∈ N be minimal such that ηjx =x δ. Let ix := θ(jx ). Return Θ(δ, s) := ((di0 (0), di1 (1), . . . ), Θ (δ)). Clearly, the output of Θ depends only on δ. If δ converges, then Θ(δ, s) = (δ , f ), where f ∈ Pψ and δ converges to some description di such that i = θ(j) for the minimal number j satisfying ηj = δ. To define an operator Ξ, compute Ξ(dσ) for d ∈ N, σ ∈ R as follows. For x ∈ N let X := {y ≤ x | dy (x) = d and, for all z ≤ x, if ϕdσ(x) (z) is defined in x steps of computation, then ϕdσ(x) (z) = ψy (z)}. If X is empty, let ix = 0, otherwise let ix be the minimum of X. (* In the limit, the only i satisfying di = d and ϕdlim(σ) = ψi is found – provided that i exists. *) For each x ∈ N, compute jx ∈ N with θ(jx ) = ix . (* In the limit, a number j with θ(j) = i, di = d, and Θ (ηj ) = ψi is found – provided that j exists. *) Return Ξ(dσ) := (g −1 (ηj0 (0)), g −1 (ηj1 (1)), g −1 (ηj2 (2)), . . . ). (* g −1 denotes the function inverse to g. Ξ(dσ) converges to g −1 (l), where l is the limit of ηj with Θ (ηj ) = ψi and di equals d – provided that i and j exist. *) To show that {g(i) | i ∈ N} is rUEx -reducible to D via Θ and Ξ, fix some δ1 ∈ R converging to some description d ∈ {g(i) | i ∈ N}. First, by the remarks below the definition of Θ, we obtain Θ(δ1 , s) = (δ2 , f2 ), where f2 ∈ Pψ and δ2 converges to some description di such that i = θ(j) for the minimal j satisfying ηj = δ1 . This implies f2 = ψi . In particular, f2 ∈ Rdi . Second, if σ ∈ R is r -admissible for di and ψi , then Ξ(di σ) converges to g −1 (d) (by the note in the definition of Ξ and d = lim(ηj )). Recall that g −1 (d) is the only ϕd -program of s, whenever d ∈ {g(i) | i ∈ N}. Hence Ξ(di σ) is r -admissible for d and s. So {g(i) | i ∈ N} is rUEx -reducible to D and finally D is rUEx -complete. As an immediate consequence of Theorems 7 and 13 we have: Corollary 14 Let D ∈ rUEx . If D is UEx -complete, then D is rUEx -complete. 3.3
Algorithmic Structure of Complete Classes in Uniform Learning
Theorems 7 and 13 suggest a weakness of the notion of intrinsic complexity, similar to the non-uniform case: though UEx -/rUEx -complete sets involve a
52
S. Zilles
topologically complex structure, expressed by Property 2, this goes along with the demand for a limiting r. e. subset combined with an r. e. subset Pψ of the union of all represented recursive cores. The latter again can be seen as a noncomplex algorithmic structure. Now Theorem 15 shows that there are non-complete description sets, for which the properties of Theorems 7 and 13 can be fulfilled, but only if the demand for limiting r. e. sets is dropped. These sets are algorithmically more complex than our examples of UEx -complete sets, but they belong to a lower degree of intrinsic complexity. Theorem 15 Let C ⊆ R. Then there is a set D ∈ rUEx such that 1. C equals the union of all recursive cores described by D, 2. D is not rUEx -complete (and hence not UEx -complete). Proof. Fix a list A0 , A1 , . . . of all infinite limiting r. e. sets such that ϕd0 ∈ C and d ∞ ϕx+1 =↑ for all i, x ∈ N and d ∈ Ai . Let A := i∈N Ai and C = {f0 , f1 , . . . }. Define a set D0 as follows. Fix the least elements d0 , d0 of A0 , d0 < d0 . Let I0 := {d0 }, I0 := {d0 }. Let e0 ∈ A \ (I0 ∪ I0 ) be minimal such that f0 ∈ Re0 . (* e0 exists, because A contains infinitely many descriptions d with ϕd0 = f0 . *) Let D0 := I0 ∪ {e0 }. (* The disjoint sets D0 and I0 both intersect with A0 ; some recursive core described by D0 equals {f0 }. *) Moreover, for any k ∈ N, define a set Dk+1 as follows. Fix the least elements dk+1 , dk+1 of Ak+1 \ (Dk ∪ Ik ), dk+1 < dk+1 . (* These have not been touched in the definition of D0 , . . . , Dk yet. *) Let Ik+1 := Dk ∪{dk+1 }, Ik+1 := Ik ∪{dk+1 }. Let ek+1 ∈ A\(Ik+1 ∪Ik+1 ) be minimal such that fk+1 ∈ Rek+1 . (* ek+1 exists, because A contains infinitely many descriptions d with ϕd0 = fk+1 . *) Let Dk+1 := Ik+1 ∪{ek+1 }. (* The disjoint sets Dk+1 and Ik+1 both intersect with Ak+1 ; some recursive core described by Dk+1 equals {fk+1 }. *) Choose D := k∈N Dk ⊂ A, so D does not contain any infinite limiting r. e. set. As ϕdx+1 =↑∞ for all d ∈ D, x ∈ N, we have D ∈ rUEx . Moreover, C is the union of all cores described by D. It remains to prove that D is not rUEx -complete. Assume D is rUEx -complete. Then some limiting r. e. set {di | i ∈ N} ⊆ D and some ψ ∈ R fulfil the conditions of Theorem 13. In particular, {(di , ψi ) | i ∈ N} is infinite. As D does not contain any infinite limiting r. e. set, the set {di | i ∈ N} is finite. card Rdi = 1 for i ∈ N implies that {ψi | i ∈ N} is finite, too; thus {(di , ψi ) | i ∈ N} is finite – a contradiction. So D is not rUEx -complete. The reason why each UEx -/rUEx -complete set D contains a limiting r. e. subset representing a decomposition of an r. e. class is that certain properties of UEx -complete sets are ‘transferred’ by meta-operators Θ. This corroborates the possible interpretation that our approach of intrinsic complexity just makes a
Intrinsic Complexity of Uniform Learning
53
class complete, if it is a suitable ‘target’ for recursive meta-operators – similar to the non-uniform case. By the way, Theorem 15 shows, that every Ex -complete class C has a decomposition represented by a description set which is not UEx -complete – answering a question in Section 3.1. Acknowledgements. My thanks are due to the referees and to Steffen Lange for their comments correcting and improving a former version of this paper, moreover to Frank Stephan for a very helpful discussion on the technical details.
References 1. Baliga, G.; Case, J.; Jain, S. (1999); The synthesis of language learners, Information and Computation 152:16–43. 2. Blum, L.; Blum, M. (1975); Toward a mathematical theory of inductive inference, Information and Control 28:125–155. 3. Case, J.; Smith, C. (1983); Comparison of identification criteria for machine inductive inference, Theoretical Computer Science 25:193–220. 4. Freivalds, R.; Kinber, E.; Smith, C. (1995); On the intrinsic complexity of learning, Information and Computation 123:64–71. 5. Garey, M; Johnson, D. (1979); Computers and Intractability – A Guide to the Theory of NP-Completeness, Freeman and Company. 6. Gold, E. M. (1965); Limiting recursion, Journal of Symbolic Logic 30:28–48. 7. Gold, E. M. (1967); Language identification in the limit, Information and Control 10:447–474. 8. Jain, S.; Kinber, E.; Papazian, C.; Smith, C.; Wiehagen, R. (2003); On the intrinsic complexity of learning recursive functions, Information and Computation 184:45– 70. 9. Jain, S.; Kinber, E.; Wiehagen, R. (2000); Language learning from texts: Degrees of intrinsic complexity and their characterizations, Proc. 13th Annual Conference on Computational Learning Theory, Morgan Kaufmann, 47–58. 10. Jain, S.; Sharma, A. (1996); The intrinsic complexity of language identification, Journal of Computer and System Sciences 52:393–402. 11. Jain, S.; Sharma, A. (1997); The structure of intrinsic complexity of learning, Journal of Symbolic Logic 62:1187–1201. 12. Jantke, K. P. (1979); Natural properties of strategies identifying recursive functions, Elektronische Informationsverarbeitung und Kybernetik 15:487–496. 13. Kapur, S.; Bilardi, G. (1992); On uniform learnability of language families, Information Processing Letters 44:35–38. 14. Osherson, D.; Stob, M.; Weinstein, S. (1988); Synthesizing inductive expertise, Information and Computation 77:138–161. 15. Rogers, H. (1987); Theory of Recursive Functions and Effective Computability, MIT Press. 16. Zilles, S. (2001); On the synthesis of strategies identifying recursive functions, Proc. 14th Annual Conference on Computational Learning Theory, LNAI 2111, pp. 160–176, Springer-Verlag. 17. Zilles, S. (2001); On the comparison of inductive inference criteria for uniform learning of finite classes, Proc. 12th Int. Conference on Algorithmic Learning Theory, LNAI 2225, pp. 251–266, Springer-Verlag.
On Ordinal VC-Dimension and Some Notions of Complexity Eric Martin1 , Arun Sharma2 , and Frank Stephan2 1
School of Computer Science and Engineering, UNSW Sydney, NSW 2052, Australia [email protected] 2 National ICT Australia, UNSW Sydney, NSW 2052, Australia {Arun.Sharma,Frank.Stephan}@nicta.com.au
Abstract. We generalize the classical notion of VC-dimension to ordinal VC-dimension, in the context of logical learning paradigms. Logical learning paradigms encompass the numerical learning paradigms commonly studied in Inductive inference. A logical learning paradigm is defined as a set W of structures over some vocabulary, and a set D of first-order formulas that represent data. The sets of models of ϕ in W, where ϕ varies over D, generate a natural topology W over W. We show that if D is closed under boolean operators, then the notion of ordinal VC-dimension offers a perfect characterization for the problem of predicting the truth of the members of D in a member of W, with an ordinal bound on the number of mistakes. This shows that the notion of VC-dimension has a natural interpretation in Inductive Inference, when cast into a logical setting. We also study the relationships between predictive complexity, selective complexity—a variation on predictive complexity—and mind change complexity. The assumptions that D is closed under boolean operators and that W is compact often play a crucial role to establish connections between these concepts. Keywords: Inductive inference, logical paradigms, VC-dimension, predictive complexity, mind change bounds.
1
Introduction
The notion of VC-dimension is a key concept in PAC-learning [6,12,13]. The notion of finite telltale is a key concept in Inductive inference [3,8]. It can be claimed that VC-dimension is to PAC-learning what finite telltales are to Inductive inference. Both provide a characterization of learnability, for fundamental classes of learning paradigms, in the respective settings. Both take the form of a condition where finiteness is a key requirement, in frameworks that deal with infinite objects. In logical learning paradigms of identification in the limit, it has been shown that the finite telltale condition can be seen as a generalization of the compactness property, the latter being the hallmark of, equivalently, finite learning or deductive inference [10]. The finite telltale condition can even be generalized and be interpreted as a property of β-weak compactness, that characterizes classification with less than β mind changes [10]. There are extensions of R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 54–68, 2003. c Springer-Verlag Berlin Heidelberg 2003
On Ordinal VC-Dimension and Some Notions of Complexity
55
VC dimension to infinite domains [4]. But there are few essential connection between VC-dimension and some fundamental concepts from Inductive inference: the relevance of the concept of VC-dimension seems to be closely related to the existence of probability distributions over the sample space. Though connections exists between PAC learning and Inductive inference (e.g., [5]), it does not seem that VC-dimension has any chance to play a key role in learning paradigms of Inductive inference that do not introduce probability distributions over the sample space. We will show that VC-dimension can still provide a perfect characterization for the problem of predicting whether a possible datum is true or false in the underlying world, in the realm of Inductive inference. But for such a characterization to be possible, the condition that the set of possible data is closed under boolean operators has to be imposed. The fact that we work in a logical setting is of course essential to express this condition in a simple, meaningful and natural way. The notions and main results are stated with no computability condition on the procedures that analyze the data and output hypotheses. This is necessary to obtain perfect equivalences, and provides strong evidence that the concepts involved are naturally connected. When computability is a requirement, the relationships become more complex. Our aim is to encourage the study of the connections between ordinal VC-dimension and predictive complexity in paradigms of Inductive inference. Obtaining perfect connections for ideal, unconstrained paradigms, suggests that further work in the same direction should be carried out, in the context of more realistic or constrained paradigms. Moreover, the results will be illustrated with examples that always involve effective procedures. We proceed as follows. We introduce some background notions and notation in Section 2. We define the various complexity measures in Section 3. We study the relationships between these complexity measures in Section 4. We conclude in Section 5.
2
Background
The class of ordinals is denoted by Ord. Let a set X be given. The set of finite sequences of members of X, including the empty sequence (), is represented by X . Given a σ ∈ X , the set of members of X that occur in σ is denoted by rng(σ). Given a finite or an infinite sequence σ of members of X and a natural number i that, is case σ is finite, is at most equal to the length of σ, we represent by σ|i the initial segment of σ of length i. Concatenation between sequences is represented by , and sequences consisting of a unique element are often identified with that element. We write ⊂ (respect., ⊆) for strict (respect., nonstrict) inclusion between sets, as well as for the notion of a finite sequence being a strict (respect., nonstrict) initial segment of another sequence. We also use the notation ⊃. Let two sets X, Y and a partial function f from X into Y be given. Given x ∈ X, we write f (x) =↓ when f (x) is defined, and f (x) =↑ otherwise. Given two members x, x of X, we write f (x) = f (x ) when both f (x) and f (x ) are defined and equal; we write f (x) = f (x ) otherwise. Let R be a binary relation over a set X. Recall that R is a well-founded iff every
56
E. Martin, A. Sharma, and F. Stephan
nonempty subset Y of X contains an element x such that no member y of Y satisfies R(y, x). Suppose that R is well founded. We then denote by ρR the unique function from X into Ord such that for all x ∈ X: ρR (x) = sup{ρR (y) + 1 : y ∈ X, R(y, x)}. The length of R, written as |R|, is the least ordinal not in the range of ρR . Note that |R| = 0 iff X = ∅. For example, Figure 1 depicts a finite binary relation R of length 5. In this diagram, an arrow joins a point x to a point y iff R(x, y) holds. For all points x in the field of R, the value of ρR (x) is indicated.
4 3
3
3
2
2 0
1 0
1
0
0
Fig. 1. A finite binary relation of length 5
Let us introduce the logical learning paradigms and their constituents. We denote by V a countable vocabulary, i.e., a countable set of function symbols (possibly including constants) and predicate symbols. Let us adopt a convention. If we say that V contains 0 and s, then 0 denotes a constant and s a unary function symbol. Moreover, given a nonnull n ∈ N, n is used as an abbreviation for the term obtained from 0 by n applications of s to 0. We denote by D a nonempty set of first-order sentences (closed formulas) over V that represent data. Three important cases are sets of closed atoms (to model learning from texts), sets of closed literals (i.e., closed atoms and their negations, to model learning from informants), and sets of sentences closed under boolean operators, a natural example being the set of quantifier free sentences. Note that quantifier free sentences convey no more information than closed literals. Still, the assumption that D is closed under boolean operators will play a key role in this paper. We denote by some symbol to be used when no datum is presented. Given a member σ of (D ∪ {}) , we set cnt(σ) = rng(σ) ∩ D. We denote by W a nonempty set of structures over V. An important case is given by the set of all Herbrand structures, i.e., structures over V each of whose individuals interprets a unique closed term. (When we consider Herbrand structures, V contains at least one constant). Given a member M of W and a set E
On Ordinal VC-Dimension and Some Notions of Complexity
57
of formulas, the E-diagram of M, denoted DiagE (M), is the set of all members of E that are true in M. We say that a set T of first-order formulas is consistent in W iff T has a model in W. Given a set T of first-order formulas over V, we denote by Mod(T ) the set of models of T , and by ModW (T ) the set of models of T in W (i.e., ModW (T ) = Mod(T ) ∩ W). We denote by P the triple (V, D, W). We call P a logical paradigm. This is a simplification of the notion of logical paradigm investigated in [9,10]. Learning paradigms in the numerical setting are naturally cast into the logical setting as follows. Set V = {0, s, P } where P is a unary predicate symbol. Let E be the set {P (n) : n ∈ N}. If C is the set of languages to be learnt, we define W as the set of Herbrand structures whose E-diagrams are {P (n) : n ∈ L} where L varies over C. The choice of D depends on the type of data: D = E when data are positive, D = E ∪ {¬ψ : ψ ∈ E} when data are positive or negative.
3
The Notions of Complexity
We now define the various concepts of complexity we need in this paper, starting with VC-dimension. The notion of a set of hypothesis shattering a set of data takes the following form when hypotheses are represented as structures and data as formulas. Definition 1. Let a set E of formulas be given. We say that W shatters E iff E is finite and for all subsets D of E, DiagE (M) = D for some M ∈ W. Traditionally, the VC-dimension is defined as the greatest number n such that some set consisting of n distinct elements is shattered [13]. When such an n does not exist, the VC-dimension is considered to be either undefined or infinite. We extend the notion of VC-dimension from natural numbers to ordinals as follows. Definition 2. Let X be the set of nonempty subsets of D that W shatters. Let R be the restriction of ⊃ to X. The VC-dimension of P is equal to the length of R if R is well-founded, and undefined otherwise. As an intuitive interpretation, the VC-dimension of P is determined by the following game, where we assume for simplicity that D is infinite. Consider two players Anke and Boris. Anke has to output an increasing sequence of nonempty finite subsets of D and Boris a decreasing sequence of ordinals. – In round 1, Anke outputs a nonempty finite subset of D and Boris outputs α. – If in the n-th round Anke has output a finite set D ⊆ D and Boris has output a nonnull ordinal β, then the following is done in round n + 1. • Anke outputs a finite set E with D ⊂ E ⊆ D. • Boris outputs an ordinal strictly below β. The game terminates after Boris has output 0. If Anke’s last set is shattered, Anke wins the game. Otherwise Boris wins the game. The VC-dimension of P
58
E. Martin, A. Sharma, and F. Stephan
is the smallest nonnull ordinal α for which Boris has a winning-strategy. If this ordinal does not exist, the VC-dimension is undefined. Given a structure M, we call environment (in P) an infinite sequence e of members of D ∪ {} such that for all ϕ ∈ D, ϕ occurs in e iff ϕ ∈ DiagD (M). So environments correspond to texts when D is the set of closed atoms, and to informants when D is the set of closed literals. Identification in the limit and the corresponding notion of complexity are defined next.
Definition 3. An identifier (for P) is a partial function from (D ∪ {}) into {DiagD (M) : M ∈ W}. Let an identifier f be given.
– We say that a member σ of (D ∪ {}) is consistent in W just in case there exists M ∈ W such that for all ϕ ∈ D that occur in σ, M |= ϕ. – We say that f is is successful (in P) iff for every M ∈ W and for every environment e for M, f (e|k ) = DiagD (M) for almost every k ∈ N. – Let X be the set of all σ ∈ (D ∪ {}) such that σ is consistent in W and f (τ ) =↓ for some initial segment τ of σ. We denote by Rf the binary relation over X such that for all σ, τ ∈ X, Rf (σ, τ ) holds iff τ ⊂ σ and f (σ) = f (τ ). The identification complexity of P is the least ordinal of the form |Rf |, where f is an identifier that is successful in P and Rf is well-founded; if such an f does not exist, the identification complexity of P is undefined. If the identification complexity of P is equal to nonnull ordinal β, then the D-diagrams of the members of W are identifiable in the limit with less than β mind changes. (Note that the usual notion of mind change complexity considers the least ordinal β such that at most, rather than less than, β mind changes are sometimes necessary for the procedure to converge [1,2,7]. There are good theoretical reasons for preferring the ‘less than’ formulation.) Note that if some identifier is successful in P, then there are countably many D-diagrams of members of W only. The following is a characterization of identification complexity based on a generalization Angluin’s finite telltale characterization of learnability in the limit [3]. Proposition 4. The identification complexity of P is defined and equal to nonnull ordinal β iff there exists a sequence (βM )M∈W of ordinals smaller than β and for all M ∈ W, there exists a finite AM ⊆ DiagD (M) such that for all N ∈ W: (∗)
if AM ⊆ DiagD (N) and βM ≤ βN then DiagD (N) = DiagD (M).
The next notion of complexity we define is based on selectors. Intuitively, a selector for P is a procedure that, for any given M ∈ W, selects in the limit all formulas from DiagD (M) or their negations, and is correct cofinitely many times. In order to be able to achieve this, the selector needs to receive for every selected formula ϕ, a feedback whether ϕ is or is not true in M. Thus one can represent the selector as a partial function where at stage n the input is from {0, 1}n and represents the feed-back on the previous selections.
On Ordinal VC-Dimension and Some Notions of Complexity
59
Definition 5. A selector (for P) is a partial function f from {0, 1} into D. Let a selector f be given.
– We say that a member σ of {0, 1} is consistent with W and f just in case there exists M ∈ W such that for all τ ∈ {0, 1} and i ∈ {0, 1} with τ i ⊆ σ, f (τ ) =↓, and M |= f (τ ) iff i = 1. – We say that f is is successful (in P) iff for every M ∈ W, there is a string t of finitely many 0’s and infinitely many 1’s such that every finite initial segment of t is consistent with W and f and for all ϕ ∈ DiagD (M), either f (σ) = ϕ or f (σ) = ¬ϕ for some finite initial segment σ of t. – Let X be the set of all σ ∈ {0, 1} that are consistent with W and f , and that end with a 0. We denote by Rf the binary relation over X such that for all σ, τ ∈ X, Rf (σ, τ ) holds iff τ ⊂ σ. The selective complexity of P is the least ordinal of the form |Rf |, where f is a selector that is successful in P and Rf is well-founded; if such a f does not exist, the selective complexity of P is undefined. The last notion of complexity we define is based on predictors. Whereas selectors have control over the formulas the truth of which they want to predict, a predictor has to take the members of D as they come.
Definition 6. A predictor (for P) is a partial function f from (D × {0, 1}) × D into {0, 1}. Let a predictor f be given.
– Let σ = ((ϕ0 , 0 ), . . . , (ϕp , p )) ∈ (D × {0, 1}) be given. We say that σ is consistent with W and f iff there exists M ∈ W such that for all i ≤ p: • f (σ|i , ϕi ) =↓, and • M |= ϕi iff i = 1 If σ is consistent with W and f , we call the number of i ≤ p such that f (σ|i , ϕi ) = i , the number of mispredictions that f makes on σ (in P). – We say that f is successful (in P) iff for every member M of W and every t ∈ (D×{0, 1})N , the following holds. Assume that every finite initial segment of t is consistent with W and f . Then there exists n ∈ N such that for all finite initial segments σ of t, f makes at most n mispredictions on σ. – Let X be the set of all σ ∈ (D × {0, 1}) that are consistent with W and f , and on which f makes at least one misprediction. We denote by Rf the binary relation over X such that for all σ, τ ∈ X, Rf (σ, τ ) holds iff τ ⊂ σ and f makes more mispredictions on σ than on τ . The predictive complexity of P is the least ordinal of the form |Rf |, where f is a predictor that is successful in P and Rf is well-founded; if such an f does not exist, the predictive complexity of P is undefined. Clearly, if the predictive complexity of P is defined, then the selective complexity of P is defined, and at most equal to the former. The three notions of complexity we have introduced can, similarly to VC-dimension, be interpreted in terms of the outcome of a game between Anke and Boris. For instance, for predictive complexity, Anke selects formulas from D and Boris has to make a
60
E. Martin, A. Sharma, and F. Stephan
prediction whether the formula holds (in the unknown world M) or not. Anke tells Boris whether the prediction is correct. If not, Boris has to count down an ordinal counter. The predictive complexity is then the least ordinal to start with for which Boris has a winning strategy.
4
Relationships between the Complexity Measures
Let us first introduce an alternative definition of the selective complexity, followed by a definition and a lemma, that are just technical tools for proving some of the main propositions. Notation 7. Let α ∈ Ord be given, and suppose that Γβ has been defined for all β < α. Let Γα be the set of all subsets U of W such that for all ϕ ∈ D, U ∩ Mod(ϕ) or U ∩ Mod(¬ϕ) belongs to β<α Γβ ∪ {∅}. Property 8. The predictive complexity of P is defined iff W ∈ this case it is equal to the least ordinal α with W ∈ Γα .
α∈Ord
Γα ; in
Definition 9. A shatterer (for P) is a sequence (Yα )α∈Ord of sets of subsets of D with the following properties. – W shatters all members of α∈Ord Yα ; – for all α, β ∈ Ord with β ≤ α and for all D, E ⊆ D with D ⊆ E, if E ∈ Yα then D ∈ Yβ ; – for all α, β ∈ Ord with β < α and for all E ∈ Yα , Yβ contains a proper superset of E. Lemma 10. Let a sequence (Mα )α∈Ord be inductively defined as follows. For all α ∈ Ord, Mα is the set of all finite subsets D of D such that W shatters D and for all β < α, Mβ contains a proper superset of D. – (Mα )α∈Ord is a shatterer. – For all shatterers (Yα )α∈Ord and α ∈ Ord, Yα ⊆ Mα . – There exists a least ordinal α with Mα = Mα+1 ∪ {∅}; the VC-dimension of P is defined and equal to α if Mα = {∅}, and undefined otherwise. We are now in a position to state and prove one of the main results of the paper, which relates VC-dimension to predictive complexity. Proposition 11. Suppose that D is closed under boolean operators. Then the VC-dimension of P is defined iff the predictive complexity of P is defined; moreover, if they are defined then they are equal.
On Ordinal VC-Dimension and Some Notions of Complexity
61
Proof. Given α ∈ Ord, let Γα denote the set of subsets of W as defined in Notation 7. For all α ∈ Ord, set Zα = {U ⊆ W : U ∈ / β<α Γβ ∪ {∅}}. For all ordinals α, let Yα be the set of all finite subsets E of D such that for all D ⊆ E, {M ∈ W : DiagE (M) = D} ∈ Zα . We show that (Yα )α∈Ord is a shatterer. Using the facts that ∅ ∈ / Zα and Γβ ⊆ Γα for all α, β ∈ Ord with β ≤ α, it is immediately verified that the first two conditions in Definition 9 are satisfied. For the third condition, let α, β ∈ Ord with β < α and E ∈ Yα be given. Given D ⊆ E, set ϕD = D ∧ ¬ (E \ D). By the definition of Yα , ModW (ϕD ) ∈ Zα . Moreover, ModW (ϕD ) ∈ / Γβ . Hence there exists ψD∈ D such that ModW (ϕD ∧ ψD ) ∈ Zβ and ModW (ϕD ∧ ¬ψD ) ∈ Zβ . Set ϕ = {ϕD ∧ ψD : D ⊆ E}. It is immediately verified that E ∪ {ϕ} ∈ Yβ . Assume that the predictive complexity of P is undefined. Let an ordinal α be given. Then W ∈ / Γα , hence W ∈ Zα+1 , hence ∅ ∈ Yα+1 . This implies that Yα contains a non-empty set, and by Lemma 10, the VC-dimension of P is not equal to α. It follows that the VC-dimension of P is undefined. Assume that the predictive complexity of P is defined and equal to ordinal γ. So W ∈ Γγ ∩ Zγ . Let ϕ ∈ D be given, and suppose that {ϕ} ∈ Yγ . Then both ModW (ϕ) and ModW (¬ϕ) are members of Zγ , which is in contradiction with W ∈ Γγ . Hence Yγ = {∅}, and by Lemma 10 again, the VC-dimension of P is at least equal to γ. It is easily verified that the VC-dimension of P is defined and at most equal to γ, which completes the proof of the proposition. The assumption that D is closed under negation is essential in Proposition 11, as shown in the next result. Proposition 12. For every nonnull n ∈ N, there exists a finite vocabulary V and a finite set B of formulas over V with the following property. Suppose that V = V , W is the set of Herbrand models of B, and D is the set of positive quantifier-free sentences. Then the VC-dimension of P is equal to 1 and the predictive complexity of P is equal to ω × n. Proof. Let a nonnull n ∈ N be given. Let V consist of a unary function symbol s, a binary predicate symbol <, an n-ary predicate symbol P and equality. Let B consist of the following formulas, which express that P is a predicate on Nn that is downward closed for the lexicographic ordering. – ∀x(x < s(x)) ∧ ∀xyz((x < y ∧ y < z) → (x < z)) ∧ ∀xy(x < y → ¬(y < x)); – ∀x1 . . . xn y1 . . . yn (( i
62
E. Martin, A. Sharma, and F. Stephan
Proposition 13. Assume that D is closed under negation and that W is compact. If the identification complexity of P is defined and equal to ordinal α, then the predictive complexity of P is defined and at most equal to ω × α. Proof. Suppose that the identification complexity of P is defined and equal to ordinal α. Choose a canonical identifier f for P. Let a predictor h be defined as follows. Given a member σ = ((ϕ1 , 1 ), . . . , (ϕn , n )) of (D × {0, 1}) , put σ = (ψ1 , . . . , ψn ) where for all nonnull i ≤ n, ψi = ϕi if i = 1 and ψi = ¬ϕi oth σ )) ⊆ Mod(ϕ) erwise. Let σ ∈ (D × {0, 1}) and ϕ ∈ D be given. If ModW (rng( then h(σ, ϕ) = 1. If ModW (rng( σ )) ⊆ Mod(¬ϕ) then h(σ, ϕ) = 0. If neither ModW (rng( σ )) ⊆ Mod(ϕ) nor ModW (rng( σ )) ⊆ Mod(¬ϕ), and if f ( σ ) is defined and is a model of ϕ then h(σ, ϕ) = 1. Otherwise h(σ, ϕ) = 0. It is immediately verified that h is successful on P. The proof of the proposition is completed if we show that the length of Rh is smaller than ω ×α. Let a member of the field of Rh of the form σ (ϕ, 0) be given. Let X be the set of all τ ∈ (D ∪ {}) such that rng( σ ) ⊆ rng(τ ), {ϕ, ¬ϕ}∩rng(τ ) = ∅, τ is consistent in W, and f (τ ) =↓. By the choice of f , and since the restriction of W to ModW σ )) is compact, there (rng( exists a finite F ⊆ X such that ModW (rng( σ )) = {ModW (rng(τ )) : τ ∈ F }. Let n be the sum of the lengths of the members of F . It is easy to verify that ρRh (σ (ϕ, 0)) ≤ ω × supτ ∈F ρRf (τ ) + n. This implies immediately that Rh is at most equal to ω × α, as wanted. When D is closed under negation, identification and selective complexities are very close concepts, as shown next. Proposition 14. Suppose that D is closed under negation. If the selective complexity of P is defined and equal to ordinal α, then the identification complexity of P is defined and at most equal to α + 1. Proof. Let a selector g be successful in P and such that the length of Rg is defined and equal to ordinal α. Without loss of generality, we can assume that for all σ, τ ∈ {0, 1} , if σ ⊂ τ then g(σ) = g(τ ). Let an identifier f be defined as follows. Let σ = (ϕ0 , . . . , ϕn ) ∈ (D ∪ {}) be such that cnt(σ) is consistent in W. We define a p ∈ N, a member σ of {0, 1} and a finite sequence (ψi )i
On Ordinal VC-Dimension and Some Notions of Complexity
63
Proposition 15. Suppose that D is closed under boolean operators, W is countable and W is compact. Then there exists ordinals α, β such that: – – – – –
ω × α ≤ β ≤ ω × (α + 1); the VC-dimension of P is equal to β; the predictive complexity of P is equal to β; the identification complexity of P is equal to α + 1; the selective complexity of P is equal to α or α + 1.
Proof. We first show that the VC-dimension of P is defined, which together with Propositions 11 and 14, implies immediately that the predictive complexity of P, the identification complexity of P, and the selective complexity of P are also defined. Suppose for a contradiction that there exists an infinite subset X of D such that for all finite E ⊆ X and D ⊆ E, D = DiagE (M) for some M ∈ W. Since W is compact, it follows that for all Y ⊆ X, Y = DiagX (M) for some M ∈ W, which contradicts the assumption that W is countable. Thus every infinite subset of D has a finite subset that W does not shatter, and the VCdimension of P is defined, as wanted. We now show that the identification complexity of P is a successor ordinal. Choose a canonical identifier f for P. It suffices to show that the length of Rf is not a limit ordinal. Suppose otherwise for a contradiction. Let X be the set of all finite subsets D of D such that f (D) =↓. Since W is compact and f is successful in P, there exists a finite subset F of X such that W = {ModW (D) : D ∈ F }. This implies that |Rf | is equal to sup(ρRf (σ) + 1 : D ∈ F ), hence is smaller than |Rf | since F is finite and |Rf | is a limit ordinal. Contradiction. Denote by α + 1 the identification complexity of P, and by β the predictive complexity of P. By Proposition 11, the VC-dimension of P is equal to β. By Proposition 14, the selective complexity of P is at least equal to α. We show that the selective complexity of P is at most equal to α + 1. Choose a set-driven identifier f such that f is successful in P and the length of Rf is defined and equal to ordinal α + 1. Fix an enumeration (ϕi )i∈N of D. We define by induction a sequence (hi )i∈N of selectors with finite domains such that for all i, j ∈ N, hi and hj have disjoint domains. Let i ∈ N be given, and suppose that hj has been defined for all j < i. If i = 0 put Z = {()}. If i = 0, let Z be the set of all ⊆-maximal members of the domain of hi−1 . Let σ ∈ Z be given. If the σ = () let σ = σ. If σ is nonempty and equal to ( 0 , . . . , k ), denote by σ sequence (ψ0 , . . . , ψk ) of members of D such that for all j ≤ k, ψj = f (σ|j ) if j = 1, and ψj = ¬f (σ|j ) otherwise. If ModW (rng( σ )) ⊆ Mod(ϕi ) then set σ )) ⊆ Mod(¬ϕi ) then set hi (σ) = ¬ϕi . Suppose that hi (σ) = ϕi . If ModW (rng( neither ModW (rng( σ )) ⊆ Mod(ϕi ) nor ModW (rng( σ )) ⊆ Mod(¬ϕi ). Let Xσ be the set of all τ ∈ (D ∪ {}) such that rng( σ ) ⊆ rng(τ ), {ϕi , ¬ϕi } ∩ rng(τ ) = ∅, τ is consistent in W, and f (τ ) =↓. By the choice of f , and since the restriction of W to ModW (rng( σ )) is compact, there exists a finite Fσ ⊆ Xσ such that ModW (rng( σ )) = {ModW (rng(τ )) : τ ∈ Fσ }. Without loss of generality, we can assume that there exists ψ0 , . . . , ψk ∈ D such that for all for all τ ∈ Fσ and for all j ≤ k, either ψj or ¬ψj occurs in τ , and every formula that occurs in τ is of the form ψj or ¬ψj for some j ≤ k. Let n be the cardinality of Fσ ,
64
E. Martin, A. Sharma, and F. Stephan
and fix an enumeration (Dp )p
Let σ ∈ (D ∪ {}) be such that f (σ) =↓, and let γ be the least ordinal with ModW (cnt(σ)) ∈ Γγ . Suppose for a contradiction that γ is neither 0 nor a limit ordinal. It then follows from the definition of Γγ that there exists ϕ ∈ D such that Γγ−1 contains neither ModW (cnt(σ) ∪ {ϕ}) nor ModW (cnt(σ) ∪ {¬ϕ}), which is impossible by the definition of f . Let λ be the number of limit ordinals less than or equal to β. We infer easily from the previous remark that α + 1 ≤ λ + 1, hence ω × α ≤ ω × λ ≤ β, and we are done. We illustrate the previous results, especially the bounds that have been obtained, with a few examples. Example 16. Let V consists of 0, s, and a unary predicate symbol P . Let W be the set of Herbrand models of ∀x(P (x) ∧ P (s(x)) → P (s(s(x)))) ∧ ∃y(P (y) ∧ P (s(y))). Then the identification complexity of P is 1 but the predictive complexity of P is undefined. Example 17. Assume that D is closed under Boolean operators. If the VCdimension of P is a nonnull n ∈ N then both the identification and the selective complexity of P are 1, and the predictive complexity of P is n. Proposition 18. Suppose that V consists of 0, s, and a unary predicate symbol P . Set D = {P (n) : n ∈ N}. Assume that W is the set of Herbrand structures whose D-diagrams are {nk : n ∈ N} for k ∈ N. Then the identification complexity of P is 2, and both the VC-dimension and the predictive complexity of P are ω. Proof. It is easily verified that an identifier can be successful in P by first hypothesizing {P (0)}, and in the case of an environment for the structure whose D-diagram is T = {P (nk) : n ∈ N} for some nonnull k ∈ N, by eventually changing {P (0)} to T . If follows that the identification complexity of P is 2. It is
On Ordinal VC-Dimension and Some Notions of Complexity
65
easily verified that the VC-dimension and the predictive complexity of P are at most equal to ω. To see that they are at least equal to ω, let a nonnull n ∈ N be given, and let p0 , p1 , . . . , p2n −1 be an enumeration of the first 2n prime numbers. For all k < n, let qk be the product of all pi , 0 ≤ i < 2n , such that the (k + 1)st bit in the binary representation of i is equal to 1. It is immediately verified that W shatters {P (q0 ), P (q1 ), . . . , P (qn−1 )}, implying that the VC-dimension of P is at least equal to n, hence the predictive complexity of P is also at least equal to n. Example 19. Suppose that V consists of 0, s, and a unary predicate symbol P . Assume that W is the set of Herbrand structures M such that for all n ∈ N, if M |= P (n) then there exists at most n natural numbers m such that M |= P (m). Suppose that D is the set of quantifier free sentences. Then the identification complexity of P is ω+1, the selective complexity of P is ω, and the VC-dimension of P is ω 2 . Note that if in the previous example, we remove from W the structure M such that M |= ∀x¬P (x)—resulting in a noncompact topological space W—, then the identification complexity becomes ω; but in both cases, the traditional notion of mind change complexity is ω. Proposition 20. Let a countable ordinal α be given. There exists a finite vocabulary V and two sets K and E of sentences over V such that if V = V , D = E, and W is the set of Herbrand models of K, then the identification complexity of P is α + 1, the selective complexity of P is α, and both the VC-dimension and the predictive complexity of P are ω × α. Proof. Suppose that V consists of a constant, a unary function symbol, a unary predicate symbol P , and a binary predicate symbol ≤. Identify the set X of closed terms with the set of ordinals smaller than α, writing X = {cβ : β < α}. Let D be the closure of {P (cβ ) : β < α} under boolean operators. Suppose that W is the set of Herbrand models of {(cβ , cγ ) : β, γ < α, β ≤ γ} ∪ {∀x∀y((P (x) ∧ x ≤ y) → P (y))}.
We define by induction, for all members σ of (D ∪ {}) , three ordinals: counterσ , aboveσ , and belowσ . Set counter() = α, above() = ω α and below() = 0. Let σ ∈ (D ∪ {}) and x ∈ D ∪ {} be given, and suppose that counterσ , aboveσ and belowσ have been defined. Let aboveσx be the minimal ordinal β such that cβ occurs in σ x if such a β exists, and let aboveσx be ω α otherwise. Let belowσx be the maximal ordinal β such that cβ occurs in σ x if such a β exists, and let belowσx be 0 otherwise. If there exist (unique) ordinals β, γ such that β < counterσ , aboveσ = ω β × γ and belowσ + ω β ≥ aboveσ , then set counterσx = β; otherwise set counterσx = counterσ . Let f be the identifier defined as follows. Let σ ∈ (D ∪ {}) be given. If counterσ = ω α then f (σ) is the D-diagram of the member M of W such that M |= ∀x¬P (x); otherwise, f (σ) is the D-diagram of the member M of W such that M |= ∀x(P (x) ↔ x ≥ caboveσ ).
66
E. Martin, A. Sharma, and F. Stephan
It is immediately verified that f is successful in P, and that the length of Rf is defined and equal to α + 1. It is also easily verified that the identification complexity of P cannot be smaller than α + 1, and that the selective complexity of P is α. Since W is a compact topological space, it follows from Propositions 11 and 15 that the VC-dimension and the predictive complexity of P are equal, and at most equal to ω × α. So to complete the proof of the proposition, it suffices to show that the VC-dimension of P is at least equal to ω × α. Suppose for a contraction that the VC-dimension of P is equal to ordinal α with α < α. The argument uses the presentation of the VC-dimension as a game between Anke and Boris. At round n, Anke defines a set of ordinals On and a formula ψn , then outputs a set En , before Boris outputs an ordinal βn . Set O0 = {0}, ψ0 = P (cα ), E0 = {ψ0 }, and β0 = α . Let n ∈ N be given and suppose that On , ψn , En and βn have been defined. Let ordinal γ and m ∈ N be such that βn = ω × γ + m. Set On+1 = On ∪ {δ + ω γ + 2m : δ ∈ On }. Let ψn+1 be a formula expressing that the cardinality of the members x of On+1 that have property P is odd. Let En+1 = {ψ0 . . . , ψn+1 }. Note that On+1 expands On with exactly one ordinal between two ordinals in On with no ordinal in On in between, plus an ordinal greater than all ordinals in On . It is easily verified that for all n ∈ N, W shatters En . Hence Anke wins the game, contradiction. If the previous example were modified by taking {P (cβ ) : β < α} as set of possible data, resulting in a new paradigm P , then the identification complexity of P would clearly be equal to ω α + 1. More generally, there exists a relationship between the identification complexities of two paradigms that differ in their sets of possible data—one set of data being closed under negation in one paradigm, and not in the other. This relationship is considered in the next proposition, and the previous example shows that the “exponential” difference between the identification complexities of both paradigms is almost as large as it can be. First note that Angluin’s finite telltale condition takes the following form in the logical framework. Lemma 21. Some identifier is successful in P iff for all M ∈ W, there exists a finite A ⊆ DiagD (M) such that no N ∈ W satisfies A ⊆ DiagD (N) ⊂ DiagD (M). Proposition 22. Suppose that D is the closure under negation of some set of sentences D , and set P = (V, D , W). Assume that W is compact, the identification complexity of P is defined and equal to nonnull ordinal α, and some identifier is successful in P . Then the identification complexity of P is defined and smaller than ω α . Proof. The proof is trivial if {DiagD (M) : M ∈ W} is finite, so suppose otherwise. Let (ψi )i∈N be a repetition free enumeration of D . Let X be the set of finite sequences of the form (ϕ0 , . . . , ϕn−1 ), n ∈ N, where for all i < n, ϕi = ψi or ϕi = ¬ψi , such that {ϕ0 , . . . , ϕn−1 } is consistent in W. Let f be a canonical identifier for P. Let a finite subset E of D be consistent in W. Denote by UE the set of all ⊆-minimal members σ of X such that f (σ) is defined and contains E.
On Ordinal VC-Dimension and Some Notions of Complexity
67
Suppose for a contradiction that UE is infinite. By K¨ onig’s lemma, there exists a sequence (ξ)i∈N of formulas such that for all n ∈ N, (ξ0 , . . . , ξn−1 ) = σ|n for some σ ∈ UE . Note that E ⊆ rng(σ) for cofinitely many members σ of UE , hence E ⊆ {ξi : i ∈ N}. Moreover, for all distinct σ, τ ∈ UE , neither σ ⊆ τ nor τ ⊆ σ, hence σ ⊂ {ξi : i ∈ N} for cofinitely many members σ of UE . Hence f ((ξ0 , . . . , ξn )) is either undefined or does not contain E for infinitely many n ∈ N. By compactness of W, {ξi }i∈N is the D-diagram of a member of W. But this is in contradiction with the fact that f is successful in P. So we have verified that UE is finite. Note that since f is successful in P, ModW (E) is included in {ModW (cnt(σ)) : σ ∈ UE }. Let (σ1 , . . . , σp ) be a repetition free enumeration of UE with ρRf (σ1 ) ≥ . . . ≥ ρRf (σp ). Set βE = ω ρRf (σ1 ) + . . . + ω ρRf (σp ) . Let finite E, F ⊆ D be such that E ⊆ F , UF \ UE = ∅, and UF is nonempty. Let τ ∈ UF \ UE be given. Then there exists i ∈ {1, . . . , p} such that σi ⊂ τ , / UF . Let D be the set of all members of UF that strictly f (σi ) = f (τ ), and σi ∈ extend σi . Clearly, ρRf (σi ) > ρRf (γ) for all γ ∈ D. Moreover, the factor ω ρRf (σi ) in the sum βE is replaced in the sum βF by the factors ω ρRf (γ) , with γ varying over D. Due to the powers of ω, it is easily verified that the extra factors ω ρRf (γ) , γ ∈ D, ‘weight less’ than ω ρRf (σi ) . Since every σi ∈ UE is either also in UF or replaced in EF by a D with the properties just described, we conclude that βE > βF . Fix a repetition free enumeration (Ti )i∈N of {DiagD (M) : M ∈ W}. By Lemma 21, choose for all i ∈ N a finite subset Ai of Ti such that no member M of W satisfies Ai ⊆ DiagD (M) ⊂ Ti . Let Y be the set of all σ ∈ (D ∪ {}) such that Ai ⊆ cnt(τ ) ⊆ Ti for some i ∈ N and some initial segment τ of σ. For all σ ∈ Y , denote by σ the ⊆-maximal initial segment of σ such that Ai ⊆ cnt( σ ) ⊆ Ti for some i ∈ N. Now define an identifier f for P as follows. / Y then f (σ) =↑; otherwise, f (σ) = Ti Let σ ∈ (D ∪ {}) be given. If σ ∈ where i ∈ N is least with Ai ⊆ cnt( σ ) ⊆ Ti . Let σ, τ ∈ (D ∪ {}) be such that σ ⊂ τ , and f (σ) and f (τ ) are defined but distinct. Then both σ and τ belong to Y . Clearly, there exists i ∈ N such that cnt(σ) ⊆ Ti but cnt(τ ) ⊆ Ti , which implies that Ti ∈ {ModW (cnt(γ)) : γ ∈ Ucnt(τ ) } \ {ModW (cnt(γ)) : γ ∈ Ucnt(σ) }. We infer that Ucnt(τ ) is nonempty and distinct from Ucnt(σ) , which we know implies that βcnt(σ) > βcnt(τ ) . It then follows that the height of Rf is defined and at most equal to ω α . Moreover, since W is compact, the reasoning in Proposition 15 shows that the identification complexity of P is not a limit ordinal, hence is smaller than ω α , as wanted. It was shown in [11] that every class of languages that is finitely identifiable from informants is also identifiable in the limit from texts. But such a relationship does not generalize to identifiability from informants with one mind change at most. Indeed, the class C consisting of N and all finite initial segments of N is not learnable in the limit from texts, as proved in [8], whereas C is clearly learnable from informants with one mind change at most. Cast into the logical framework,
68
E. Martin, A. Sharma, and F. Stephan
this provides an example of V, W, D, and D such that D is the closure of D under negation, W is compact, the identification complexity of P is equal to 1, but no identifier is successful in P = (V, D , W). This shows that Proposition 22 does not hold if the assumption that some identifier is successful in P is dropped.
5
Conclusion
In ideals paradigms of Inductive inference, finite tell tails conditions offer characterizations of identification in the limit or of classification, with or without a mind change bound. Assuming that the set of data is closed under boolean operators, the VC-dimension offers a characterization of prediction. An extra topological assumption of compactness enables to provide a complete picture of the relationship between VC-dimension and other notions of complexity, including mind change bound complexity.
References 1. Ambainis, A., Jain, S., Sharma, A.: Ordinal mind change complexity of language identification. Theoretical Computer Science. 220(2) pp. 323–343 (1999) 2. Ambainis, A., Freivalds, R., Smith, C.: Inductive Inference with Procrastination: Back to Definitions. Fundamenta Informaticae. 40 pp. 1–16 (1999) 3. Angluin, D. Inductive Inference of Formal Languages from Positive Data. Information and Control 45 p. 117–135 (1980) 4. Ben-David, S., Gurvits, L.: A note on VC-Dimension and Measure of Sets of Reals. Combinatorics Probability and Computing 9, 391–405 (2000) 5. Ben-David, S., Jacovi, M.: On Learning in the Limit and Non-Uniform (, δ)Learning. In Proceedings of the Sixth Conference on Computational Learning Theory. ACM Press pp. 209–217 (1993) 6. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Learnability and the Vapnik-Chervonenkis Dimension. J. ACM 36(4) pp. 929–965 (1989) 7. Freivalds, R., Smith, C.: On the role of procrastination for machine learning. Inform. Comput. 107(2) pp. 237–271 (1993) 8. Gold, E.: Language Identification in the Limit. Information and Control. 10 (1967) 9. Martin, E., Sharma, A., Stephan, F.: A General Theory of Deduction, Induction, and Learning. In Jantke, K., Shinohara, A.: Proceedings of the Fourth International Conference on Discovery Science. Springer-Verlag, LNAI 2226 pp. 228–242 (2001) 10. Martin, E., Sharma, A., Stephan, F.: Logic, Learning, and Topology in a Common Framework. In Cesa-Bianchi, N., Numao, M., Reischuk, R.: Proc. of the 13th Intern. Conf. on Alg. Learning Theory. Springer-Verlag, LNAI 2533 pp. 248–262 (2002) 11. Sharma, A.: A note on batch and incremental learnability. Journal of Computer and System Sciences 56 pp. 272–276 (1998) 12. Valiant, L.: A Theory of the Learnable. Commun. ACM 27(11) pp. 1134–1142 (1984) 13. Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probabilities and its Applications 16(2) pp. 264–280 (1971)
Learning of Erasing Primitive Formal Systems from Positive Examples Jin Uemura and Masako Sato Department of Mathematics and Information Sciences Osaka Prefecture University, Sakai, Osaka 599-8531, Japan {jin, sato}@mi.cias.osakafu-u.ac.jp
Abstract. An elementary formal system, EFS for short, introduced by Smullyan is a kind of logic program over strings, and is regarded as a grammar to generate a language. Arikawa and his colleagues introduced some subclasses of EFSs which correspond to Chomsky hierarchy, and showed that they constitute a useful framework for language learning. This paper considers a subclass of EFSs, called primitive EFSs, in view of inductive inference in Gold framework from positive examples. Shinohara showed that the class of languages generated by primitive EFSs is inferable from positive examples, where ε substitutions, i.e., substitutions that may substitute the empty string for variables is not allowed. In the present paper, we consider allowing ε substitutions, and call such EFSs erasing EFSs. It is unknown whether or not the class of erasing pattern languages is learnable from positive examples. An erasing pattern language is a language generated by an erasing EFS with just one axiom. We first show that the class PFSL of languages generated by erasing primitive EFSs does not have finite elasticity, but has M-finite thickness. The notions of finite elasticity and M-finite thickness were introduced by Wright, and Moriyama and Sato, respectively, to present sufficient conditions for learnability from positive examples. Moriyama and Sato showed that a language class with M-finite thickness is learnable from positive examples if and only if for each language in the class, there is a finite tell-tale set of the language. Then we show the class PFSL is learnable from positive examples by presenting a finite tell-tale set for each language in the class.
1
Introduction
An elementary formal system, EFS for short, is a kind of logic program over strings consisting of finitely many axioms. A pattern is a finite string of constant symbols and variables. A pattern is regular, if each variable appears in the pattern at most once. In EFSs, patterns are used for terms in logic programming. For example, Γ = {p(ab) ←, p(axb) ← p(x)} is an EFS with two axioms, where p is a unary predicate symbol, a and b are constant symbols and x is a variable, where patterns ab and axb are used as terms. R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 69–83, 2003. c Springer-Verlag Berlin Heidelberg 2003
70
J. Uemura and M. Sato
An EFS generates a language of constant strings obtained by applying substitutions for variables and Modus Ponens to axioms in the EFS. In the above example, the language generated by Γ is L(Γ ) = {an bn | n ≥ 1}. The framework of EFSs was introduced by Smullyan [14] to develop his recursive function theory. Arikawa and his colleagues [2] introduced some subclasses of EFSs whose language classes correspond to Chomsky hierarchy. Among them, the class of length-bounded EFSs that generates the class of context-sensitive languages was especially investigated from a viewpoint of learning of languages from positive examples in Gold framework [4](Shinohara [13], Moriyama&Sato [5], Mukouchi&Arikawa [7], Sato [10]). Wright [16] introduced a notion of finite elasticity as a sufficient condition for learnability from positive examples. Shinohara [13] showed that the class of languages generated by length-bounded EFSs with at most n axioms has finite elasticity, where ε substitutions, i.e., substitutions that may substitute the empty string for variables are not allowed. In the present paper, we consider allowing ε substitutions, and call such EFSs erasing EFSs. As easily seen, the language generated by an erasing EFS is different from that generated by the nonerasing EFS with the same axioms. The language of an EFS Γ = {p(π) ←} is called an erasing or extended pattern language, if Γ is erasing, and nonerasing pattern language otherwise. It is well known that the class of nonerasing pattern languages is learnable from positive examples ([1]), but unknown for the class of erasing pattern languages. Note that it was shown that the latter class is not learnable from positive examples when the number of constant symbols is two ([9]). The present authors [15] proved compactness theorem of bounded unions of erasing regular pattern languages which plays an essential role in designing an efficient learning algorithm for unions. An EFS is regular, if each axiom in the EFS is of the form p(π) ← q1 (x1 ), · · · , qn (xn ), where π is a regular pattern, p, qi s are unary predicate symbols and x1 , · · · , xn are all of the mutually distinct variables appearing in π. The class of languages generated by erasing regular formal system, RFS for short, corresponds to that of context-free languages([2]). Recently Mukouchi [8] has showed that the class of languages generated by erasing RFSs with at most n axioms has finite elasticity, and so is learnable from positive examples, provided every regular pattern π in axioms is of restricted form called canonical, i.e., contains no successive variables. In this paper, we deal with a subclass of RFSs called primitive formal systems, PFSs for short, consisting of exactly two axioms of the forms p(π) ← and p(τ ) ← p(x1 ), · · · , p(xn ), where τ is possible to be noncanonical form, but imposed some condition when π = ε. The class of nonerasing PFSs was firstly introduced by Shinohara [12], and is a proper subclass of RFSs. Note that Mukouchi [8] has shown that the class of languages generated by erasing EFSs with two axioms where patterns in the axioms are not always regular is not learnable from positive examples under Σ ≥ 2.
Learning of Erasing Primitive Formal Systems from Positive Examples
71
Unfortunately, the class PFSL of languages generated by erasing PFSs is shown not to have finite elasticity. Moriyama and Sato [5] has introduced a notion of M-finite thickness, and showed that a class with M-finite thickness is learnable from positive examples if and only if for each language in the class, there is a finite tell-tale set of the language. As well as finite elasticity, M-finite thickness is known to have some good properties such as closure properties for various class operations, but M-finite thickness merely is not a sufficient condition for learnability ([10]). We first investigate the inclusion problem L(Γ ) ⊆ L(Γ ) for given PFSs Γ and Γ . Then, we show that the class of languages generated by PFSs has Mfinite thickness as well as the class of nonerasing length-bounded EFSs. Finally we show that the class PFSL is learnable from positive examples by showing that for each PFS Γ , there is a finite tell-tale set of the language generated by Γ.
2
Erasing PFS Languages
Let Σ be a finite alphabet, X be a countable set of variables, and Π be a set of predicate symbols. We assume these sets Σ, X, Π are mutually disjoint. Each predicate symbol is associated with a positive integer called arity. A pattern is a string (possibly empty string ε) on Σ ∪ X. An atom is an expression of the form p(π1 , · · · , πn ), where p is a predicate symbol with arity n and π1 , · · · , πn are patterns. A definite clause is a clause of the form A ← B1 , · · · , Bm (m ≥ 0), where A, B1 , · · · , Bm are atoms. The atom A is called the head and the part B1 , · · · , Bm the body of the definite clause. Definition 1. An elementary formal system, EFS for short, is a finite set of definite clauses. For an EFS Γ , each definite clause in Γ is called an axiom of Γ. A substitution is a homomorphism from patterns to patterns that maps each symbol a ∈ Σ to itself. We permit an ε substitution that can map some variables to the empty string. By πθ, we denote the image of a pattern π by a substitution θ. For an atom A = p(π1 , · · · , πn ) and a clause C = A ← B1 , · · · , Bm , we define Aθ = p(π1 θ, · · · , πn θ) and Cθ = Aθ ← B1 θ, · · · , Bm θ. A definite clause C is provable from an EFS Γ , denoted by Γ C, if C is obtained by finitely many (possibly 0) applications of substitutions and Modus Ponens. We define the language L(Γ, p) = {w ∈ Σ ∗ | Γ p(w)}, where p is a unary predicate symbol. Definition 2. An EFS Γ is a simple formal system, an SFS for short, if each clause in Γ is of the form p(π) ← q1 (x1 ), · · · , qn (xn ), where p, qi s are unary predicate symbols and x1 , · · · , xn are mutually distinct variables appearing in π. A pattern π is regular if each variable appears in π at most once. An SFS Γ is a regular formal system, an RFS for short, if all patterns in heads of clauses in
72
J. Uemura and M. Sato
Γ are regular. An RFS Γ is a primitive formal system, a PFS for short, if it contains exactly two clauses of the forms p(π) ←
and
p(τ ) ← p(x1 ), p(x2 ), · · · , p(xn ),
where x1 , x2 , · · · , xn are all of the variables appearing in τ . The former is called a base step and the latter an induction step of Γ . A language L is an erasing EFS (resp., SFS, RFS or PFS) language if L = L(Γ, p) for some EFS (resp., SFS, RFS or PFS) Γ and some unary predicate symbol. A language L is an erasing regular pattern language if L = L(Γ, p) for some RFS Γ = {p(π) ←}. It can be shown that the class of erasing RFS languages corresponds to that of context-free languages (Arikawa et al.[2]). Thus a PFS language is contextfree, but not always regular. In fact, the following PFS language is not regular. Example 1. Let us consider the PFS Γ = {p(ε) ←, p(axb) ← p(x)}, where Σ = {a, b}. Then as easily seen, L(Γ, p) = {an bn | n ≥ 0}. In this paper, we deal with the class of PFS languages. In what follows, we fix a unary predicate symbol, say p, and denote L(Γ, p) by L(Γ ) simply. Moreover, we denote a PFS Γ = {p(π) ←, p(τ ) ← p(x1 ), · · · , p(xn )} and L(Γ ), by Γ = (π, τ ) and L(π, τ ), respectively. Similarly by L(π) we denote L({p(π) ←}). For patterns π and τ , we introduce binary relations and ≡ as follows: π τ if π = τ θ for some substitution θ, and π ≡ τ if π τ and τ π. A renaming of variables of a substitution θ such that xθ ∈ X, and x = y implies xθ = yθ for any x, y ∈ X. In this paper, we identify patterns equivalent from each other by renaming. Thus ax1 bx2 = ax2 bx1 and ax1 bx2 ≡ ax1 x2 bx3 but ax1 bx2 = ax1 x2 bx3 . A pattern π is of canonical form if π ≡ τ implies |π| ≤ |τ | for any pattern τ , and π contains exactly n variables x1 , · · · , xn for some integer n and the leftmost occurrence of xi is to the left of the leftmost occurrence of xi+1 for each i, where |π| is the length of the pattern π. Lemma 1 (Shinohara [11]). Suppose that Σ ≥ 3. Let π and τ be regular patterns. Then (i) π τ ⇐⇒ L(π) ⊆ L(τ ),
(ii) π ≡ τ ⇐⇒ L(π) = L(τ ).
By the definition of L(π, τ ), it follows that L(π, τ ) = L(π) ∪ L , where L is the set of strings obtained by at least one applying the induction step of Γ . Concerning an erasing regular pattern language L(π), Shinohara [11] showed that there is a unique canonical pattern π equivalent to any regular pattern π. Clearly the canonical pattern π has the form of w0 x1 w1 x2 · · · wn−1 xn wn (w0 , wn ∈ Σ ∗ , wi ∈ Σ + (i = 1, · · · , n − 1), and L(π) = L(π ), provided that Σ ≥ 3. Hereafter, we assume that Σ ≥ 3 and the pattern π in a PFS Γ = (π, τ ) is assumed to be of canonical form. On the other hand,
Learning of Erasing Primitive Formal Systems from Positive Examples
73
concerning with τ in the induction step, we can not assume τ to be of canonical form. Indeed, let Γ = (aa, bx1 x2 b). As easily seen, b(aa)(aa)b ∈ L(Γ ). On the other hand, for Γ = (aa, bx1 b), we have b(aa)(aa)b ∈ L(Γ ). Thus L(Γ ) = L(Γ ). For Γ = (π, τ ), we define the following particular pattern τπ = τ {x := π | x appears in τ }, where variables substituted to the variables in τ are taken to be distinct, and so τπ is always regular. Then we have for every w ∈ L(Γ ), |w| ≥ min{|c(π)|, |c(τπ )|}, where c(π) is the string obtained from π by substituting the empty string to all variables. For Γ = (π, τ ), τ = x means L(Γ ) = L(π). That is, the induction step p(x) ← p(x) is redundant, and thus we assume τ = x. A string w ∈ Σ + is a multiple string of a string u, called a component, if w = ul for some l ≥ 2, and is a multiple string if there is a component of w. A component u for w is maximal if there is no component u for w satisfying |u | > |u|. We denote by PFS the sets of all PFSs except for PFSs Γ = (ε, τ ) such that τε is a multiple string. Definition 3. A PFS Γ is reduced if L(Γ ) L(Γ ) for any Γ Γ .
3
Inductive Inference
We first give the notion of identification in the limit from positive examples ([4]). A language class L = L0 , L1 , · · · over Σ is an indexed family of recursive languages if there is a computable function f : N × Σ ∗ → {0, 1} such that f (i, w) = 1 if w ∈ Li , otherwise 0, where N = {i | i ≥ 0}. The function f is called a membership function. Hereafter we confine ourselves to indexed families of recursive languages. An infinite sequence of strings w1 , w2 , · · · over Σ is a positive presentation of a language L, if L = {wn | n ≥ 1} holds. An inference machine is an effective procedure M that runs in stages 1, 2, · · · , and requests an example and produces a hypothesis in N based on the examples so far received. Let M be an inference machine and σ = w1 , w2 , · · · be an infinite sequence of strings. We denote by hn the hypothesis produced by M at stage n after the examples w1 , · · · , wn are fed to M . M on input σ converges to h if there is an integer n0 ∈ N such that hn = h for every n ≥ n0 . M identifies in the limit or infers a language L from positive examples, if for any positive presentation σ of L, M on input σ converges to h with L = Lh . A class of languages L is inferable from positive examples if there is an inference machine that infers any language in L from positive examples. Angluin [1] gave a characterizing theorem for a language class L to be inferable from positive examples if and only if there exists an effective procedure
74
J. Uemura and M. Sato
that enumerates for every language in L a set SL such that SL ⊆ L, SL is finite and SL ⊆ L for all L ∈ L with L L. The set SL is called a finite tell-tale set of L. Mukouchi [8] has showed that the class of SFS languages defined by at most two clauses is not inferable from positive examples by considering an infinite sequence of SFSs as follows: Γn = {p(x1 x2 · · · xn−1 xn xn xn−1 · · · x2 x1 ) ←},
(n ≥ 1).
We should note that the pattern x1 x2 · · · xn−1 xn xn xn−1 · · · x2 x1 is not regular, and thus the above ∞EFSs are not RFSs, but SFSs. They showed L(Γ1 ) L(Γ2 ) · · · L(Γ ) and n=1 L(Γn ) = L(Γ ) under Σ ≥ 2, where Γ = {p(x1 x1 ) ←,
p(x1 x2 x1 ) ← p(x2 )}.
It means that the language L(Γ ) does not have any finite tell-tale set within the class. Angluin [1] gave a very useful sufficient condition for inferability called finite thickness. The class of erasing regular pattern languages discussed in this paper was shown to have finite thickness by Shinohara [11] as well as nonerasing pattern languages (Angluin[1]). Wright [16] introduced another sufficient condition for inferability called finite elasticity more general than finite thickness ([6]). A class L has finite elasticity, if there is no infinite sequence of strings w0 , w1 , · · · and no infinite sequence of languages L1 , L2 , · · · in L satisfying {w0 , w1 , · · · , wn−1 } ⊆ Ln but wn ∈ Ln for every n ≥ 1. Finite elasticity has a good property in a sense that it is closed under various class operations such as union, intersection and so on (Wright[16], Moriyama & Sato [5], Sato [10]). Shinohara [13] proved that the class of nonerasing length-bounded EFS languages generated by at most k clauses has finite elasticity, and so is inferable from positive examples. Mukouchi [8] has showed that the class of erasing RFS languages generated by at most k clauses has finite elasticity similarly, provided that all regular patterns in heads of induction steps are of canonical forms. Without the condition of canonical form, the inferability, however, is not valid as shown below. Theorem 1. The class PFSL does not have finite elasticity. Proof. Define PFSs Γn = (ε, τn ) (n ≥ 1) as follows: τn = a(x1 x2 · · · xn )b for n ≥ 1, where a, b ∈ Σ, a = b. Then we can show that {ε, ab, a(ab)2 b, · · · , a(ab)k b} ⊆ L(Γk ) but a(ab)k+1 b ∈ L(Γk ) for k ≥ 1. Thus the infinite sequence (wn )n≥0 of strings and infinite sequence (Γn )n≥1 of PFSs satisfies the above conditions, where w0 = ε, wn = a(ab)n b (n ≥ 1). Hence the class PFSL has infinite elasticity. Moriyama&Sato [5] introduced a notion of M-finite thickness by generalizing finite thickness. For nonempty finite set S ⊆ Σ ∗ , we define MIN(S, L) = {L ∈ L | L is a minimal language of S within L},
Learning of Erasing Primitive Formal Systems from Positive Examples
75
where L is a minimal language of S within L if no language L ∈ L satisfy S ⊆ L L. Definition 4. A class L has M-finite thickness, if for any nonempty set S ⊆ Σ ∗ , (i) MIN(S, L) < ∞, (ii) S ⊆ L ∈ L implies that there is a language L ∈ MIN(S, L) such that L ⊆ L. M-finite thickness merely is not a sufficient condition for inferability from positive examples, but is closed under various class operations such as union, concatenation and so on as well as finite elasticity ([5,10]). Theorem 2 (Moriyama&Sato [5]). If a class L has M-finite thickness, then the class L is inferable from positive examples if and only if for each language L ∈ L, there is a finite tell-tale set of L.
4
Reduced PFSs
This section gives a characterization of a reduced PFS. Hereafter, we assume that Σ ≥ 3. Then by Lemma 1, for any regular patterns π and τ , π τ ⇐⇒ L(π) ⊆ L(τ ). Let Γ = (π, τ ) be a PFS. If τ ∈ Σ ∗ , clearly L(Γ ) = L(π) ∪ {τ }. Thus the PFS Γ is reduced if and only if τ π. Hereafter, we consider a PFS Γ = (π, τ ) with var(τ ) = φ, where var(τ ∞ ) = {x1 , x2 , · · · , xn } is the set of variables contained in τ . We define Γτ = t=1 Γt , where Γt is recursively defined as follows: Γ1 = {τπ } and for each t ≥ 2, Γt = Γt−1 ∪ {τ {x1 := ξ1 , x2 := ξ2 , · · · , xn := ξn } | ξi ∈ Γt−1 ∪ {π}, i = 1, 2, · · · , n}. Clearly every ξ in Γτ always contains π as a substring whenever var(τ ) = φ. By the definitions of L(Γ ) and the above Γτ , it follows that: Lemma 2. Let Γ = (π, τ ) be a PFS. Then L(Γ ) = L(π) ∪ L(Γτ ) holds, where L(Γτ ) = ξ∈Γτ L(ξ). By the definition of a reduced PFS, Γ is reduced if and only if L(π) L(Γ ), i.e., L(Γτ ) ⊆ L(π). Clearly if π ∈ Σ ∗ , then Γ is always reduced. For a pattern π with var(π) = φ, let us denote by Aπ the longest constant prefix and by Bπ the longest constant suffix of π. For instance, Aπ = ab and Bπ = ε if π = abx1 aax2 . Lemma 3. Let Γ = (π, τ ) be a PFS with var(π) = φ. For any ξ ∈ Γτ , there is a pair (i0 , j0 ) of positive integers such that Aξ = Aiτ0 Aπ ,
Bξ = Bπ Bτj0 .
For two strings v, w ∈ Σ ∗ , v p w denotes that v is a prefix of w, and v s w means that v is a suffix of w. Moreover, w p v ∗ means that w p v i for some i ≥ 0. Similarly we define w s v ∗ . By Pref, we denote the set of pairs (v, w) of strings satisfying v p w or w p v. Similarly we define the set Suff.
76
J. Uemura and M. Sato
Lemma 4. Let π and τ be regular patterns containing variables. Then L(π) ∩ L(τ ) = φ ⇐⇒ (Aπ , Aτ ) ∈ Pref,
(Bπ , Bτ ) ∈ Suff.
Lemma 5. Let π = Aπ (xπ x )Bπ and τ = Aτ (xτ x )Bτ . If π is a substring of τ , then L(τ ) ⊆ L(π) ⇐⇒ Aπ p Aτ , Bπ s Bτ . By the above lemmas, the next result immediately follows: Theorem 3. Let Γ = (π, τ ) be a PFS. If Aπ = Bπ = ε or Aτ = Bτ = ε, then Γ is not reduced. Remember that the pattern π for a PFS Γ = (π, τ ) is assumed to be of canonical form. Thus by the above theorem, it follows that: Corollary 1. Let Γ = (π, τ ) be a reduced PFS with var(τ ) = φ. Any pattern ξ ∈ Γτ does not contain successive variables. In terms of the above lemma, every pattern in Γτ can be assumed to be of canonical form for a reduced PFS Γ = (π, τ ). Lemma 6. For w ∈ Σ ∗ and v ∈ Σ + , (i) ∃ i0 ≥ 1 s.t. w p v i0 w ⇐⇒ w p v ∗ ⇐⇒ ∀i ≥ 0, w p v i w, (ii) ∃ j0 ≥ 1 s.t. w s wv j0 ⇐⇒ w s v ∗ ⇐⇒ ∀j ≥ 0, w s wv j . Theorem 4. Let Γ = (π, τ ) be a PFS. Then the following statements are equivalent: (i) Γ is reduced. (ii) L(π) ∩ L(Γτ ) = φ. (iii) [Aπ p A∗τ and Aτ = ε] or [Bπ s Bτ∗ and Bτ = ε)], if π ∈ Σ ∗ . Proof. If π ∈ Σ ∗ , clearly (i) and (ii) are equivalent. Hereafter we assume π ∈ Σ ∗ , i.e., π contains some variables. (i) ⇒ (iii). Assume that Γ is reduced, and [Aπ p A∗τ or Aτ = ε] and [Bπ s Bτ∗ or Bτ = ε]. If Aτ = Bτ = ε, then by Theorem 3, Γ is not reduced. A case of Aπ p A∗τ and Bπ s Bτ∗ . Then by Lemma 6, we have Aπ p (Aτ )i Aπ ,
Bπ s Bπ (Bτ )j ,
i, j = 0, 1, 2, · · · .
Let ξ ∈ Γτ be an arbitrary pattern. Then by Lemma 3, there are integers i0 , j0 ≥ 1 such that Aξ = (Aτ )i0 Aπ and Bξ = Bπ (Bτ )j0 . It implies that Aπ p Aξ and Bπ s Bξ hold. Since π is a substring of ξ, by Lemma 5, we get L(ξ) ⊆ L(π). It means that L(Γτ ) ⊆ L(π) holds, and a contradiction. Similarly we can prove for [Aπ p A∗τ and Bτ = ε] or [Aτ = ε and Bπ s Bτ∗ ]. (iii) ⇒ (ii). Assume that L(Γτ ) ∩ L(π) = φ. Then there is a pattern ξ ∈ Γτ satisfying L(ξ) ∩ L(π) = φ. By Lemma 4, (Aπ , Aξ ) ∈ Pref,
(Bπ , Bξ ) ∈ Suff.
Learning of Erasing Primitive Formal Systems from Positive Examples
77
By ξ ∈ Γτ , similarly to the above, there are i0 , j0 ≥ 1 satisfying Aξ = (Aτ )i0 Aπ and Bξ = Bπ (Bτ )j0 . It means that Aπ p (Aτ )i0 Aπ . By Lemma 6(i), Aπ p A∗τ holds. Similarly we have Bπ s Bτ∗ . It contradicts the assumption of (iii). (ii) ⇒ (i) is clear. By the above theorem and Theorem 3, it follows immediately that: Corollary 2. Given a PFS Γ = (π, τ ), the decision problem of whether Γ is reduced is computable in time order O(|π| + |τ |).
5
Inclusion Problem for Erasing PFS Languages
In this section, we deal with the inclusion problem for erasing PFS languages. Lemma 7. Let Γ = (π, τ ) be a reduced PFS and let γ be a pattern such that L(γ) ⊆ L(Γ ). Then γ π ⇐⇒ L(γ) ⊆ L(Γτ ). Proof. We only prove a case of π ∈ Σ ∗ since it can be easily shown for π ∈ Σ ∗ . Since Γ is reduced, by Theorem 4, L(π) ∩ L(Γτ ) = φ holds, and so (⇐) is valid. Moreover [Aπ p A∗τ and Aτ = ε] or [Bπ s Bτ∗ and Bτ = ε] hold. (⇒) We consider only a case of Aπ p (Aτ )∗ and Aτ = ε. We can prove similarly for the other case. In this case, by Lemma 6, (∗)
Aπ p (Aτ )i Aπ ,
i = 1, 2, · · · .
Suppose that γ π and L(γ) ⊆ L(Γτ ). By L(γ) ⊆ L(Γ ), it implies that L(γ) ∩ L(π) = φ and L(γ) ∩ L(ξ) = φ for some ξ ∈ Γτ . Then by Lemma 4, (∗∗)
(Aγ , Aπ ) ∈ Pref,
(Aγ , Aξ ) ∈ Pref.
Since ξ ∈ Γτ , by Lemma 3 there is an integer i0 ≥ 1 such that (∗ ∗ ∗) Aξ = (Aτ )i0 Aπ . By (∗), Aπ p Aξ . Claim A. If |Aπ | ≤ |Aγ |, then Aπ p Aξ . The proof of Claim A. If |Aπ | ≤ |Aγ |, by (∗∗), it follows that Aπ p Aγ . If |Aξ | ≤ |Aγ |, by (∗∗), Aξ p Aγ holds, and so Aπ p Aξ holds. Otherwise, i.e., |Aξ | > |Aγ |, by (∗∗), Aγ p Aξ holds, and so, Aπ p Aξ holds. As mentioned above, since Aπ p Aξ , |Aπ | > |Aγ | must hold. Claim B. If |Aπ | > |Aγ |, then L(γ) ⊆ L(Γ ). The proof of Claim B. By the assumption of our claim, both of the lengths of Aπ and Aξ are larger than |Aγ |. Let a and b be the |Aγ | + 1-th constant symbols from the left sides, respectively. Since Σ ≥ 3, there is a symbol c ∈ Σ such that c = a, b. Let x be the variable of the most left in γ and let γc = γ{x := cx}.
78
J. Uemura and M. Sato
Then L(γc ) ⊆ L(Γ ) and Aγc = (Aγ )c hold. Clearly (Aγc , Ajτ Aπ ) ∈ Pref (j = 0, 1, · · · ) holds, and so by Lemma 4, we have L(γc ) ∩ L(π) = φ. Furthermore, by Lemma 3, for any ξ ∈ Γτ , it implies that L(γc ) ∩ L(ξ ) = φ. Thus L(γc ) ⊆ L(Γ ). That is, L(γ) ⊆ L(Γ ). The claim B leads a contradiction because of L(γ) ⊆ L(Γ ). Lemma 8. Let Γ = (π, τ ) be a reduced PFS and let γ be a pattern with L(γ) ⊆ L(Γτ ). Then L(γ) ∩ L(τπ ) = φ ⇐⇒ γ τπ . The proof can be done similarly to the proof of Lemma 7. Lemma 9. Let Γ = (α, β) be a reduced PFS and let π be a pattern. Then L(Γβ ) ⊆ L(π) ⇐⇒ βα π,
ββα π.
Lemma 10. Let Γ = (α, β) be a reduced PFS and let π be a pattern. Then L(Γ ) ⊆ L(π) ⇐⇒ α π, βα π. By Lemma 9 and Lemma 10, we have the following: Theorem 5. Let Γ = (π, τ ) and Γ = (α, β) be reduced PFSs. If L(Γ ) ⊆ L(Γ ), the following equivalences are valid. (i) (ii) (iii)
6
L(Γ ) ⊆ L(π) ⇐⇒ α π, βα π, L(α) ⊆ L(π), L(Γβ ) ⊆ L(Γτ ) ⇐⇒ α π, L(α) ⊆ L(Γτ ), L(Γβ ) ⊆ L(π) ⇐⇒ α π,
βα π, βα π,
ββα π.
Learnability of the Class PF SL
In the present section, we show that the class PFSL is inferable from positive examples. In order to do it, the class will be shown that the class has M -finite thickness and each language has a finite tell-tale set. 6.1
Finite Tell-Tale Sets for PFS Languages
For a pattern π, S(π) denotes the set of strings obtained from π by substituting each variable to the empty string or a constant symbol in Σ. Lemma 11 (Shinohara [11]). Suppose that Σ ≥ 3. For any regular pattern π, there is no regular pattern τ satisfying S(π) ⊆ L(τ ) L(π).
Learning of Erasing Primitive Formal Systems from Positive Examples
79
For a reduced PFS Γ = (π, τ ), we introduce the following finite subset of L(Γ ): T (Γ ) = L(Γ ) ∩ Σ ≤|ττπ | , where Σ ≤l denotes the set of strings with at most length l for each l ≥ 0. We first consider a PFS Γ = (π, τ ) with var(π) = φ. Clearly S(π) ∪ S(τπ ) ∪ S(ττπ ) ⊆ T (Γ ), where τπ = ττπ = τ if τ ∈ Σ ∗ . Note that the string c(π) is the unique shortest string of L(Γ ) and L(π) if var(τ ) = φ, and that c(τπ ) is that of L(Γτ ). Theorem 6. Let Γ = (π, τ ) be a reduced PFS with var(π) = φ. Then there does not exist a PFS Γ such that T (Γ ) ⊆ L(Γ ) L(Γ ). Proof. We can easily prove for a case of τ ∈ Σ ∗ , and thus consider the case of var(τ ) = φ. We assume T (Γ ) ⊆ L(Γ ) L(Γ ) for some PFS Γ = (α, β). We can assume that Γ is reduced. Clearly c(π) and c(α) are the shortest strings of L(Γ ) and L(Γ ), respectively, and by the above assumption, c(π) = c(α)(= w). Claim A. α = π and S(τπ ) ∪ S(ττπ ) ⊆ L(Γβ ) L(Γτ ). The proof of Claim A. If α π, by Lemma 7, we have L(α) ⊆ L(Γτ ). It means that w ∈ L(Γτ ), and a contradiction. Thus α π, and so L(α) ⊆ L(π). If α ≺ π, then α π{x := ε} for some x ∈ var(π) because of c(π) = c(α). If the variable x is not the first or the last symbol of π, then π = π1 (axb)π2 for some a, b ∈ Σ and for some patterns π1 , π2 . Let c ∈ Σ such that c = a, b, and let w = π{x := c, x := ε x = x}. Clearly w ∈ S(π) and |w | = |w| + 1. As easily seen, w ∈ L(α) holds. By T (Γ ) ⊆ L(Γ ), w ∈ L(Γβ ) must hold. The length of the shortest string c(βα ) in L(Γβ ) is larger than |w|, and is equal to |w| + 1 if and only if β = dx or xd for some d ∈ Σ. In any case, w = dw, wd, and so w ∈ L(Γ ). It is a contradiction. Similarly we can prove for the variable x to be the first symbol or the last symbol of π. Hence α ≡ π must hold. Since both π and α are of canonical form, it implies that π = α. Therefore we have S(τπ ) ∪ S(ττπ ) ⊆ L(Γβ ) L(Γτ ). Claim B. βπ = τπ . The proof of Claim B. By the inclusions of the claim A, c(βπ ) = c(τπ ) holds. If βπ τπ , by Lemma 8, we have L(βπ ) ∩ L(τπ ) = φ. It contradicts c(βπ ) ∈ L(βπ ) ∩ L(τπ ). Hence βπ τπ holds. Similarly, by the claim A, S(τπ ) ⊆ L(Γβ ) holds. If τπ βπ , by Lemma 8 L(τπ ) ∩ L(βπ ) = φ, and a contradiction. Therefore τπ βπ holds. Consequently, we obtain βπ ≡ τπ . In terms of Corollary 1, both βπ and τπ are patterns of canonical form, and thus βπ = τπ . Since π contains variables, it is easily shown that β = τ . It means that L(Γ ) = L(Γ ) holds, and a contradiction.
80
J. Uemura and M. Sato
The following result plays an essential role in our problem on finite tell-tale set. Lemma 12. Let w be a nonempty string in Σ + . (i) If w is a nonmultiple string, then there do not exist strings u, v = ε satisfying w = uv = vu or wu = vw. (ii) If u is a maximal component for w, then wv1 = v2 w implies vi ∈ u∗ for i = 1, 2. Now we consider a PFS Γ = (w, τ ) for w ∈ Σ + . We introduce the following strings: ri (τ ) = τ {xi := τw , xj := w | j = i}, i = 1, · · · , n, where var(τ ) = {x1 , · · · , xn } for n ≥ 1. Then we have {w, s} ∪ {ri (τ ) | i = 1, · · · , n} ⊆ T (Γ ), where s = τw . Note that the string w is the unique shortest string of L(w, τ ) and S(Γ ), and s is the unique second shortest string of them. By Lemma 12, it follows that: Lemma 13. Let Γ = (w, τ ) be a reduced PFS with w ∈ Σ + and var(τ ) = {x1 , · · · , xn } (n ≥ 1). If s is a nonmultiple string, then ri (τ ) = rj (τ ) for i = j. Lemma 14. Let Γ = (w, τ ) be a reduced PFS with w ∈ Σ + and var(τ ) = {x1 , · · · , xn } (n ≥ 1), and let u be a maximal component for s. If w ∈ u+ , then ri (τ ) = rj (τ ) for all i and j. Otherwise ri (τ ) = rj (τ ) for i = j. Theorem 7. Let Γ = (w, τ ) be a reduced PFS with w ∈ Σ + . Then there does not exist a PFS Γ such that T (Γ ) ⊆ L(Γ ) L(Γ ). Proof. We assume that there is a PFS Γ = (α, β) satisfying the inclusion relations in our theorem. Then clearly w and s(= τw ) are the unique shortest string and the unique second shortest string of L(Γ ). It implies that α = w and βα = s. Let τ = v0 x1 v1 x2 · · · xn vn and β = v0 y1 · · · yn vn for some vi , vj ∈ Σ ∗ (i = 0, · · · , n, j = 0, · · · , n ). (i) A case that s is a nonmultiple string. In this case, by Lemma 13, ri (τ ) = rj (τ ) for i = j. Clearly the strings ri (τ ) are the third shortest strings of L(Γ ) and S(Γ ). Similarly the strings rj (β)s are those of L(Γ ). By our assumption, (∗)
{ri (τ ) | i = 1, · · · , n} = {rj (β) | j = 1, · · · , n }.
Since there strings are distinct as mentioned above, we have n = n .
Learning of Erasing Primitive Formal Systems from Positive Examples
81
We show that vi = vi for i = 0, · · · , n. Assume that vi = vi for i = 0, · · · , i0 −2 and vi0 −1 = vi0 −1 for some i0 ≥ 1. Then ri (τ ) = ri (β) for i = 1, · · · , i0 −1. By the above, there is an integer j ≥ i0 satisfying ri0 (τ ) = rj (β). By 2|s| = |ri0 (τ )|+|w|, the substring s appears in both ri0 (τ ) and rj (τ ) at different but overlapped places. It means that sw = w s for some nonempty strings w and w . Applying Lemma 12, it contradicts to the assumption on s. Hence vi = vi holds for every i. Therefore we have Γ = Γ and a contradiction. (ii) A case that a string u is a maximal component for s. By Lemma 14, if w ∈ u+ , then ri (τ ) = rj (τ ) for any i, j, and moreover vi , vi ∈ u∗ . It means that L(Γ ) = L(Γ ) ⊆ u+ , and a contradiction. By Lemma 14, if w ∈ u+ , then ri (τ ) = rj (τ ) for i = j. Similarly to the case of (i), we obtain n = n and (∗). Furthermore, we can show that ww = w w for some nonempty strings w and w , and a contradiction. Finally, we consider a PFS Γ = (ε, τ ), provided τε is nonmultiple string. We define the following strings: for τ = u0 X1 u1 · · · uk−1 Xk uk for u0 , uk ∈ Σ ∗ , ui ∈ Σ + (i = 1, · · · , k − 1), Xi ∈ X + for i = 1, · · · , k, rˆi (τ ) = τ {Xi := s, Xj := ε, j = i},
ti (τ ) = τ {Xi := s|Xi | , Xj := ε, j = i},
for i = 1, · · · , k, where s = τε , Xi := s means a substitution of x := s, x := ε(x = x) for some x in Xi and for any x = x in Xi and Xi := s|Xi | denotes x := s for every x in Xi . Then clearly {ε, s} ∪ {ˆ ri (τ ), ti (τ ) | i = 1, · · · , k} ⊆ T (Γ ).
Theorem 8. Let Γ = (ε, τ ) be a reduced PFS, where τε is a nonmultiple string. Then there does not exist a PFS Γ = (α, β) such that T (Γ ) ⊆ L(Γ ) L(Γ ). Proof. We assume that there is a PFS Γ = (α, β) satisfying the inclusion relations in our theorem. Then clearly α = ε and u0 u1 · · · uk = u0 u1 · · · uk (= s), where τ = u0 X1 u1 X2 · · · Xk uk and β = u0 Y1 · · · Yk uk for some u0 , u0 , uk , uk ∈ Σ ∗ , ui , uj ∈ Σ + (i = 1, · · · , k − 1, j = 1, · · · , k − 1) and Xi , Yj ∈ X + for each i, j. Claim A. k = k and ui = ui for i = 1, · · · , k. Since s is nonmultiple string, the proof of Claim A can be done by showing rˆi (τ ) = rˆj (τ ) for i = j similarly to the proof of Theorem 7. Similarly we can proof Xi = Yi for each i. Theorem 9. Let Γ be a reduced PFS in PFS. Then the set T (Γ ) is a finite tell-tall set of the language L(Γ ).
82
6.2
J. Uemura and M. Sato
M-Finite Thickness
Theorem 10. The class PFSL has M -finite thickness. Proof. Let S ⊆ Σ ∗ be nonempty finite set. Let lmin and lmax be the shortest length and the longest length of strings in S, respectively. Claim A. MIN(S, PFSL) < ∞ holds. The proof of Claim A. Let L(Γ ) ∈ MIN(S, PFSL) for Γ = (π, τ ). Without loss of generality, we can assume that Γ is reduced. As mentioned in §2, every w ∈ L(Γ ) satisfies |w| ≥ min{|c(π)|, |c(τπ )|}. A case of π = ε. In this case, |c(π)| and |c(τπ )| is less than or equal to lmax . Since π is of canonical form and the number of variables in τ is bounded by lmax , there are at most finitely many such PFSs. A case of π = ε. In this case, c(τ ) = τπ holds. Similarly to the above, we have |c(τ )| ≤ lmax , and thus there are at most finitely many such constant strings. Let us put |c(τ )| = l. Then as easily seen, lengths of strings in L(ε, τ ) are multiples of l. Thus lmax = kl for some k ≥ 0. Let us put τ = w0 X1 w1 · · · wn−1 Xn wn . A variable x ∈ var(τ ) is nonerasable w.r.t. S if S ⊆ L(ε, τ {x := ε}). We can assume that every variable in τ is nonerasable w.r.t. S, and such a τ is called a nonerasable pattern w.r.t. S. Clearly |Xi | ≤ k holds for every i. Hence there are at most finitely many PFSs in MIN(S, PFSL). Claim B. For any PFS Γ , if S ⊆ L(Γ ), then L(Γ ) ⊆ L(Γ ) for some Γ ∈ MIN(S, PFSL). The proof of Claim B. Let S ⊆ L(Γ ) for Γ = (π, τ ). We can assume that Γ is reduced, and L(Γ ) ∈ MIN(S, PFSL). Then we have S ⊆ L(Γ ) L(Γ ) for some reduced PFS Γ = (α, β). Similarly to the proof of Claim A, it can be shown that there are at most finitely many such L(Γ ) containing the set S. By the above and Theorem 10, we obtain the following main theorem. Theorem 11. The class PFSL is inferable from positive examples.
References 1. D. Angluin: Inductive inference of formal languages from positive data, Information and Control, 45, 117–135, (1980). 2. S. Arikawa, T. Shinohara and A. Yamamoto: Learning elementary formal system, Theoretical Computer Science, 95, 97–113, (1992). 3. H. Arimura, T. Shinohara and S. Otsuki: Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data, Lecture Notes in Computer Science, 775, 646–660, (1994). 4. E.M. Gold: Language identification in the limit, Information and Computation, 10, 447–474, (1967). 5. T. Moriyama and M. Sato: Properties of language classes with finite elasticity, IEICE Transactions on Information and Systems, E78-D(5), 532–538, (1995).
Learning of Erasing Primitive Formal Systems from Positive Examples
83
6. T. Motoki, T. Shinohara and K. Wright: The correct definition of finite elasticity: Corrigendum to identification of unions, Proceedings of the 4th Annual Workshop on Computational Learning Theory, 375–375, (1991). 7. Y. Mukouchi and S. Arikawa: Towards a mathematical theory of machine discovery from facts, Theoretical Computer Science, 137, 53–84, (1995). 8. Y. Mukouchi: Note on Learnability of Subclasses of Erasing Elementary Formal Systems from Positive Examples, in preparation. 9. D. Reidenbach: Result on Inductive Inference of Extended Pattern Languages, Lecture Notes in Artificial Intelligence, 2533, 308–320, (2002). 10. M. Sato: Inductive Inference of Formal Languages, Bulletin of Informatics and Cybernetics, 27(1), 85–106, (1995). 11. T. Shinohara: Polynomial time inference of extended regular pattern languages, Lecture Notes in Computer Science, 147, 115–127, (1982). 12. T. Shinohara: Inductive inference of formal systems from positive data, Bulletin of Information and Cybernetics, 22, 9–18, (1986). 13. T. Shinohara: Rich classes inferable from positive data, Information and Control, 108, 175–186, (1994). 14. R.M. Smullyan: “Theory of Formal Systems,” Princeton University Press, 1961. 15. J. Uemura and M. Sato: Compactness and Learning of Unions of Erasing Regular Pattern Languages, Lecture Notes in Artificial Intelligence, 2533, 293–307, (2002). 16. K. Wright: Identification of unions of languages drawn from positive data, in Proceedings of the 2nd Annual Workshop on Computational Learning Theory, 328– 333, (1989).
Changing the Inference Type – Keeping the Hypothesis Space Frank Balbach Institut f¨ ur Theoretische Informatik, Universit¨ at zu L¨ ubeck Wallstraße 40, 23560 L¨ ubeck, Germany [email protected]
Abstract. In inductive inference all learning takes place in hypothesis spaces. We investigate for which classes of recursive functions learnability according to an inference type I implies learnability according to a different inference type J within the same hypothesis space. Several classical inference types are considered. Among FIN, CONS-CP, and CP the above implication is true, for all relevant classes, independently from the hypothesis space. On the other hand, it is proved that for many other pairs (I, J ) hypothesis spaces exist that allow full I learning power, but limit that of J to finite classes. Only in a few cases (e. g. LIM vs. CONS) the result does depend on the actual class to be learned.
1
Introduction
In inductive inference a scenario is investigated where a learner receives more and more data about a target object and outputs a sequence of hypotheses. The learner is successful if its sequence of hypotheses eventually converges to a single description for the target object. Usually a learner is required to learn all objects from a (possibly infinite) class. This model of learning in the limit [11], applied to classes of recursive functions, has been studied thoroughly. Thereby, many variations of the basic model have been developed [4,19,1,8,12,10,9,13]. All these models are referred to as “inference types.” They often differ in the constraints placed on the intermediate hypotheses or in the way the sequence of hypotheses has to converge. Common to all inference types, however, is the need to interpret the hypotheses a learner (“strategy”) outputs. This is usually done by means of a hypothesis space. Its design is thus of key importance to the learning success. In the inductive inference of recursive functions, hypothesis spaces are numberings of partial recursive functions. Hypotheses are represented by indices in such numberings. It is well known that G¨ odel numberings [15] (acceptable numberings [14]) serve, in many inference types, as universal hypothesis spaces. That is, any solvable learning problem can be solved within such a numbering. Of course, knowing that a solution exists is not sufficient in practical applications. One rather needs to know how to construct an appropriate learner. R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 84–98, 2003. c Springer-Verlag Berlin Heidelberg 2003
Changing the Inference Type – Keeping the Hypothesis Space
85
Constructing learning strategies is usually easier in hypothesis spaces specifically designed for a certain learning task than it is in G¨ odel numberings. A first approach to such hypothesis spaces is given by numbering theoretical characterizations [19,9]. As an example, let us consider: A class U of recursive functions is learnable in the limit iff there is a numbering ψ such that (a) ψ contains all functions of U and (b) for different indices i, j one can uniformly compute an upper bound d(i, j) of an argument on which the i-th and j-th function in ψ differ. This characterization provides a sufficient criterion for hypothesis spaces ψ to be suitable for the learning of U . Moreover, learners for classes U can be built uniformly from the hypothesis space ψ and the function d. Specialized hypothesis spaces are also necessary in combination with general types of strategies, like identification by enumeration [11]. In its basic form this strategy outputs the least index whose associated function is consistent with the data known to the strategy so far. Therefore it often needs to decide the consistency of an index with some known data, which usually cannot be done in G¨ odel numberings [20]. Despite their advantages regarding the construction of learning strategies, specialized non-G¨ odel numberings suffer from several drawbacks. One of them is a loss of flexibility. Changing the class to be learned or switching to a different type of learning, thereby leaving the hypothesis space unchanged, might lead to an unsolvable new learning problem. In this paper we will concentrate on the consequences of switching to another inference type after fixing the hypothesis space. The inductive inference paradigm can also be viewed from a more practical point of view. Here, the hypothesis spaces correspond to certain output formats, or description languages, for hypotheses. The inference types find their counterpart in various requirements put on the behavior of a learning algorithm and/or the quality of its hypotheses. Consider a typical practical learning algorithm that uses a certain description language for hypothesis and learns according to some requirements. In practice, requirements are often changing. The output format might be fixed, however. The first natural question then is whether the algorithm can still handle the new requirements or whether there is at all an algorithm that can cope with the new situation. However, the new requirements could be too demanding for such an algorithm to exist. But what happens if a learning algorithm for the new problem is known that (unfortunately) uses a different output format? Is this additional fact enough to conclude that there must be such an algorithm for the previously fixed format, too? In general the possible answers are “yes, this conclusion can be drawn”, “no, it cannot”, or “it depends...” In case of a positive answer the immediate next question is how to find such an algorithm. Can it even be constructed effectively from the previous one?
86
F. Balbach
The following pages give answers to these questions in the general framework of inductive inference of recursive functions, thereby revealing that all three kinds of answers do indeed occur, depending on the inference types under consideration. Section 3 addresses the “no” answers, Sect. 4 presents some “yes” answers, and Sect. 5 some intermediate results. Finally, Sect. 6 gives an overview of all results obtained.
2
Preliminaries
Notations not explained herein follow standard conventions [15]. By IN we denote the set {0, 1, 2, . . .} of natural numbers; inclusion and proper inclusion are denoted by ⊆ and ⊂, respectively; card A is the cardinality of the set A. We denote the set difference of A and B by A \ B. For n ∈ IN the set of all n-ary partial recursive functions over IN will be written P n , the set of all recursive functions Rn . As abbreviation for P 1 and R1 we use P and R, respectively. For f ∈ P, x ∈ IN we write f (x) ↓, if f is defined on input x and f (x) ↑ otherwise. Functions f, g ∈ P fulfill f =n g iff {(x, f (x)) | x ≤ n and f (x) ↓} = {(x, g(x)) | x ≤ n and g(x) ↓}. For i ∈ IN and n ≥ 1, in denotes the n-tuple (i, . . . , i). Functions can be identified with the sequence of their values, e. g. f = 1n 0∞ means f (x) = 1 for 0 ≤ x < n and f (x) = 0 for x ≥ n. If for f ∈ P and n ∈ IN the values f (0), . . . , f (n) are defined, we will write f n for the initial segment (f (0), . . . , f (n)) and implicitly identify every f n with a natural number via a computable bijective coding function. For a tuple α = (α0 , . . . , αn ) over IN and a class U ⊆ R, we write α U iff there is an f ∈ U such that f n = α. In order to abbreviate certain statements, the symbols ∧ (“and”), =⇒ (“implies”) and ⇐⇒ (“iff”) as well as the quantifiers ∀ (“for all”), ∀∞ (“for all but finitely many”), and ∃ (“exists”) will be used. Let ψ ∈ P 2 . Then ψ is called numbering and Pψ := {ψi | i ∈ IN} denotes the set of the functions enumerated by ψ; for i ∈ IN the function ψi is defined odel numbering by ψi (x) := ψ(i, x) for all x. A numbering ϕ ∈ P 2 is called G¨ (acceptable numbering) iff (1) Pϕ = P and (2) ∀ψ ∈ P 2 ∃c ∈ R ∀i [ψi = ϕc(i) ]. We use ϕ to denote a fixed G¨ odel numbering. For a function ϕi we will write Si if the function plays the role of a learning strategy (see below). Let Φ be a Blum complexity measure [6] for ϕ. For i ∈ IN we will write ϕi (x) ↓n instead of Φi (x) ↓≤ n. In the basic learning model, a strategy S ∈ P learns a class U ⊆ R with respect to a hypothesis space ψ ∈ P 2 . The strategy receives one after another initial segments f n of a function f ∈ U as input and generates a sequence of hypotheses S(f n ) as output. Each hypothesis is interpreted as the function ψS(f n ) . The basic inference type, learning in the limit [11], gives the learning strategy the freedom to output whatever it wants, as long as it reaches a point beyond that the output remains constant as well as correct.
Changing the Inference Type – Keeping the Hypothesis Space
87
Definition 1. A class U ⊆ R is said to be learned in the limit with respect to (or in) an hypothesis space ψ ∈ P 2 by a strategy S ∈ P iff (1) ∀f ∈ U ∀n [S(f n ) ↓], (2) ∀f ∈ U ∃i [ψi = f ∧ ∀∞ n [S(f n ) = i]]. This fact will be written U ∈ LIMψ (S). Furthermore LIMψ := {U | U ⊆ R ∧ ∃S ∈ P [U ∈ LIMψ (S)]} denotes the set of LIM learnable classes with respect to ψ and LIM := ψ∈P 2 LIMψ the entire set of LIM learnable classes. Various inference types are built from LIM by adding conditions to the intermediate hypotheses. Probably the most natural one is the consistency condition demanding that the hypothesized function always agrees with the data already known to the strategy [2,7]. It has been intensively studied [3,18,20,16]. In any case, the hypothesized functions have to be total in order to be correct, hence it is natural to demand that all hypotheses refer to total functions. This is called total learning [19]. It is restricted even further within the so called class preserving learning [5] where the hypothesized functions are required to be members of the class to be learned. Both, total and class preserving learning, can be combined with the consistency condition [12]. Definition 2. Let U ⊆ R be a class, ψ ∈ P 2 a hypothesis space, and S ∈ P a strategy such that U ∈ LIMψ (S). (1) U ∈ CONSψ (S) iff ∀f ∈ U ∀n [f =n ψS(f n ) ] . (2) U ∈ TOTALψ (S) iff ∀f ∈ U ∀n [ψS(f n ) ∈ R] . (3) U ∈ CPψ (S) iff ∀f ∈ U ∀n [ψS(f n ) ∈ U ] . (4) U ∈ CONS-TOTALψ (S) iff ∀f ∈ U ∀n [f =n ψS(f n ) ∈ R] . (5) U ∈ CONS-CPψ (S) iff ∀f ∈ U ∀n [f =n ψS(f n ) ∈ U ] . CPψ , TOTALψ , CONSψ , CONS-TOTALψ , and CONS-CPψ as well as CP, TOTAL, CONS, CONS-TOTAL, and CONS-CP are defined in analogy to LIMψ and LIM. Instead of convergence to a single correct hypothesis, the behaviorally correct learning in the limit only requires that a learning strategy outputs almost always arbitrary, but correct hypotheses [3]. Definition 3. A class U is said to be learned behaviorally correct in the limit by a strategy S with respect to a hypothesis space ψ (written: U ∈ BCψ (S)) iff (1) ∀f ∈ U ∀n [S(f n ) ↓], (2) ∀f ∈ U ∀∞ n [ψS(f n ) = f ]. BCψ and BC are defined analogously to the previous inference types. Convergence of a different kind takes place in the finite learning model, also called one-shot learning. Here the strategy may, on each function, output the special hypothesis “?” finitely many times until it outputs a correct one [11]. Definition 4. A class U is said to be learned finitely by a strategy S with respect to a hypothesis space ψ (written: U ∈ FINψ (S)) iff (1) ∀f ∈ U ∀n [S(f n ) ↓],
88
F. Balbach
(2) ∀f ∈ U ∃n (a) ∀x < n [S(f x ) = ?], (b) ∃i [ψi = f ∧ ∀x ≥ n [S(f x ) = i]]. FINψ and FIN are defined in the obvious way. For strategies S (except for BC strategies), the convergence point on a learned function f is denoted by Conv(S, f ) := min{n | ∀x ≥ n [S(f x ) = S(f n )]}. and the final hypothesis of S on f by lim S(f ) := limn→∞ S(f n ). The convergence point for BC strategies depends on the hypothesis space ψ and is defined by Convψ (S, f ) := min{n | ∀x ≥ n [ψS(f x ) = f ]}. For every inference type introduced here, G¨ odel numberings present a universal hypothesis space insofar as every learnable class can be learned in any such numbering [13]. Lemma 5. Let U ⊆ R be a class of recursive functions, I ∈ {LIM, CONS, TOTAL, CP, CONS-TOTAL, CONS-CP, BC, FIN} an inference type, and ϕ an arbitrary G¨ odel numbering. Then Iϕ = I. Some inference types have another property, namely that the learning goal can always be achieved by a strategy defined on every input. Lemma 6. Let I ∈ {LIM, TOTAL, CP, BC, FIN} be an inference type and S ∈ P, U ⊆ R, ψ ∈ P 2 satisfying U ∈ Iψ (S). Then there is a strategy T ∈ R such that U ∈ Iψ (T ). Both lemmata will be used implicitly in the next sections. The set of classes contained in total numberings is denoted by NUM := {U | ∃ψ ∈ R2 [U ⊆ Pψ ]}. It is known [11] that every class U ∈ NUM enumerated by ψ ∈ R2 , i. e. U ⊆ Pψ , can be learned with respect to ψ by the strategy Enumψ (f n ) := min{i | ψi =n f }. The relations between the inference types in terms of learnable classes are described in the following theorem [3,18,12]. Theorem 7. (1) CONS-CP ⊂ CP ⊂ TOTAL = CONS-TOTAL ⊂ CONS ⊂ LIM ⊂ BC, (2) FIN ⊂ CP, and (3) NUM ⊂ TOTAL. (4) Inference types whose relations are not explicitly stated are incomparable. The question formulated in the introduction can now be expressed more formally by: Does U ∈ J satisfy ∀ψ ∈ P 2 [U ∈ Iψ =⇒ U ∈ Jψ ]? We give a name to the condition contained therein. Definition 8. Let I, J ∈ {BC, LIM, FIN, CONS, TOTAL, CP, CONS-CP, CONS-TOTAL} be inference types. A class U ⊆ R has the property (satisfies the condition) (I → J ) iff ∀ψ ∈ P 2 [U ∈ Iψ =⇒ U ∈ Jψ ]. Note that all finite classes U satisfy (I → J ) for all introduced inference types I and J . The set {U | U ∈ I ∩ J and U satisfies (I → J )} will be called scope of (I → J ). Classes U ∈ / I ∩ J are not considered since it is obvious whether they satisfy (I → J ) or not. A scope containing exactly the finite classes will be called minimal, a scope equal to I ∩ J will be called maximal. Often (I → J ) is fulfilled for all I ∩ J simply because ∀ψ ∈ P 2 [Iψ ⊆ Jψ ]. The next lemma states when this happens.
Changing the Inference Type – Keeping the Hypothesis Space
89
Lemma 9. (1) ∀ψ ∈ P 2 [FINψ ⊆ CPψ ⊆ TOTALψ ⊆ CONSψ ⊆ LIMψ ⊆ BCψ ], (2) ∀ψ ∈ P 2 [CONS-CPψ ⊆ CONS-TOTALψ ⊆ TOTALψ ], (3) ∀ψ ∈ P 2 [CONS-CPψ ⊆ CPψ ]. Note, however, that I ⊆ J is not sufficient for ∀ψ ∈ P 2 [Iψ ⊆ Jψ ]. Lemma 10. (1) ∃ψ ∈ P 2 [TOTALψ ⊆ CONS-TOTALψ ], (2) ∃ψ ∈ P 2 [CPψ ⊆ CONS-TOTALψ ].
3
Negative Results and Biased Hypothesis Spaces
We will first consider the transition from TOTAL learning to CONS learning. Both requirements are rather natural and lie close to each other in the hierarchy (see Theorem 7 and Lemma 9). Hence, one would not expect any problems if one wants to learn a TOTAL class in a TOTAL way with respect to a hypothesis space suitable for learning this class consistently. However, this expectation is wrong. As soon as the class to be learned is infinite, the hypothesis space could be a “bad” one preventing the TOTAL learning of the class. This is the subject of the following Theorem 11. Theorem 11. Let U ∈ TOTAL be an infinite class. Then a hypothesis space ψ ∈ P 2 exists such that U ∈ CONSψ \ TOTALψ . Proof. Let U ∈ TOTAL. The hypothesis space ψ will be defined via diagonalization against all (TOTAL) learning strategies. The functions in ψ will be grouped into consecutive blocks Zj of increasing size. Within the j-th block, which contains j + 2 functions, diagonalization against the j + 1 strategies S0 , . . . , Sj happens. The functions in the block will be defined, argument by argument, to equal ϕj . Meanwhile the output of the strategies S0 , . . . , Sj on initial segments of ϕj is watched. As soon as an Si is found to output a hypothesis z from within the j-th block, the definition of ψz is stopped, resulting in ψz ∈ P \ R. From now on neither Si nor ψz are taken into account during the ongoing definition of the j-th block. The formal algorithm for the j-th block is given below. (j)
L0 := {0, . . . , j} (j) G0 := Zj x := 0 While ϕxj ↓ do: (j) (1) For all z ∈ Gx : ψz (x) := ϕj (x) (j) (2) For all z∈ Zj \ Gx : ψz (x) := ↑
(j) (j) := (, y) | ∈ Lx ∧ y ≤ x ∧ S (ϕyj ) ↓x ∈ Gx (j) (j) (j) (j) (4) Gx+1 := Gx \ S (ϕyj ) | (, y) ∈ Px ∧ y = min{z | (, z) ∈ Px } (j)
(3) Px
90
F. Balbach
(j) (j) (j) (5) Lx+1 := Lx \ | ∃y [(, y) ∈ Px ] (6) x := x + 1 Note that, since the indices outnumber the strategies, there must remain at least one z ∈ Zj such that ψz = ϕj . To prove U ∈ / TOTALψ , we assume an Si such that U ∈ TOTALψ (Si ). Since U is infinite, there must be an f ∈ U such that Si converges on f to an index k ∈ Zj for a j ≥ i. All total functions in the j-th block equal ϕj , hence f = ψk = ϕj . Therefore, Si outputs on ϕj almost always indices in Zj . Thus, either Si or k (or both) will be “eliminated.” In either case, Si outputs a non-total hypothesis on f ∈ U , contradicting the assumption. In order to show U ∈ CONSψ , let R ∈ P be a strategy such that U ∈ CONS-TOTALϕ (R). A strategy T learning U consistently in ψ works as follows: T (f n ) := min Gn(R(f
n
))
.
n
(R(f ))
For f ∈ U , ϕR(f n ) is total, hence Gn exists and can be computed using the algorithm above. Let f ∈ U and j = lim R(f n ) be the final hypothesis of R on f . Then T converges against the least k ∈ Zj not “eliminated.” It follows that ψk = ϕj , hence T converges correctly. That the intermediate hypotheses of T are consistent is a consequence of R being a consistent strategy for U in ψ. If consistent learnability of a class is not sufficient for its total learnability, then neither LIM nor BC learnability are. Moreover, consistent learnability cannot be sufficient for learning in a CONS-TOTAL, CP, CONS-CP, or FIN way. Corollary 12. Let I ∈ {CONS, LIM, BC} and J ∈ {FIN, CONS-CP, TOTAL, CONS-TOTAL,CP} be inference types and let U ∈ J be an infinite class. Then a hypothesis space ψ ∈ P 2 exists such that U ∈ Iψ \ Jψ . A closer look at the proof of Theorem 11 reveals that the constructed hypothesis space ψ does not depend on the class U . In ψ no infinite class U ∈ TOTAL can be learned totally. But any such U can be learned consistently in ψ. Even more is true: The hypothesis space ψ does in fact allow for the full learning power of consistent learning as well as of limit learning in general. Behaviorally correct learning, on the other hand, does not increase the learning power further. Corollary 13. There is a hypothesis space ψ satisfying (1) TOTALψ = {U ⊆ R | card U < ∞}, (2) CONSψ = CONS, (3) LIMψ = LIM, (4) BCψ = LIM. Proof. Let ψ be the hypothesis space constructed in the proof of Theorem 11. (1) is obvious, as well as the ⊆-part of (2). In order to prove CONSψ ⊇ CONS let U ∈ CONS be a class and R ∈ P a strategy such that U ∈ CONSϕ (R). We
Changing the Inference Type – Keeping the Hypothesis Space
91
define T as in the proof of Theorem 11 with the only difference that R is a CONS strategy for U . The proof of (3) proceeds similar to that of (2). In order to show U ∈ LIMψ for a U ∈ LIM we use a strategy R ∈ R with U ∈ LIMϕ (R) and define T as follows: (R(f m ))
T (f n ) := “For m = 0, . . . , n compute the sets Gm for at most n steps. Let m be the greatest index such that the computation is finished.
(R(f m ))
Output min Gm
.”
This modification is necessary since ϕR(f n ) need not be defined up to n, (R(f n ))
hence Gn
need not exist.
(4) LIM = LIMψ ⊆ BCψ follows from (3) and the definitions of BC and LIM. In order to show BCψ ⊆ LIM we assume a U ∈ BCψ with U ∈ / LIM. Let S ∈ R be such that U ∈ BCψ (S). Then there must be a function g ∈ U such that card {S(g n ) | n ∈ IN} = ∞ (otherwise one could amalgamate the finitely many indices of each function of U and learn U in the limit, in contradiction to our assumption). Recall the grouping of the indices of ψ in blocks Zj . The hypotheses of S on g reach infinitely many such blocks. We assume without loss of generality that n S / outputs on g at most one index from each such block, that is ∀n [S(g ) ∈ m ) ]. (One can construct an S with this property from an S without Z S(g m
k will be selected for “elimination” in step (4) since (i, m) ∈ Px and m = (j) min{z | (i, z) ∈ Px } (remember that, by assumption an S, g m is the only segment where S outputs a hypothesis from Zj ). This contradicts the conclusion above that k “survives.” Note that part (1) of Corollary 13 remains true if TOTAL is substituted by CONS-TOTAL, CP, CONS-CP, or FIN. Obviously, ψ is biased towards CONS and LIM and against most other inference types considered. The diagonalization technique from the proof of Theorem 11 can, suitably modified, be used for other inference types as well. In order to prove the next theorem we need a well known lemma that provides a stronger version of Lemma 6 for LIM. Lemma 14. There is a function ρ ∈ R such that ∀i ∈ IN (1) ϕρ(i) ∈ R, (2) ∀ψ ∈ P 2 ∀U ⊆ R [U ∈ LIMψ (ϕi ) =⇒ U ∈ LIMψ (ϕρ(i) )].
92
F. Balbach
That is, for every LIM strategy ϕi , ρ effectively constructs an everywhere defined LIM strategy ϕρ(i) with at least the same learning power. We abbreviate the strategies from the last lemma by Si := ϕρ(i) . Theorem 15. Let U ∈ LIM be an infinite class. Then there is a hypothesis space ψ ∈ P 2 such that U ∈ BCψ \ LIMψ . Proof. This proof uses the same blockwise grouping of indices as the proof of Theorem 11. However, diagonalization happens against all strategies Si instead of Si . For every j ∈ IN the functions with indices z ∈ Zj are defined as follows: ϕj (x), if ∀i ≤ j ∃n ≥ x [ϕnj ↓ ∧Si (ϕnj ) = z], ψz (x) := ↑, otherwise. Provided ϕj ∈ R, the function ψz equals ϕj iff every strategy S0 , . . . , Sj outputs on ϕj infinitely often a hypothesis different from z. If, on the other hand, one of those strategies converges on ϕj to z, then ψz ∈ / R. Furthermore there is a z ∈ Zj with ψz = ϕj . Assume U ∈ LIMψ (Si ) for an i ∈ IN. Then an f ∈ U and a j ≥ i exist such that k := lim Si (f ) ∈ Zj . Hence ψk = f = ϕj holds. But then, according to the definition of ψk , Si does not converge to k, a contradiction. In order to prove U ∈ BCψ , let R ∈ R be a LIM strategy for U in ϕ. A BC strategy for U in ψ can be defined in the following way. T (f n ) := “(1) For all z ∈ ZR(f n ) and x ≤ n compute each ψz (x) for at most n steps. (2) Output a z such that ψz has a longest initial segment computed in (1).” Let f ∈ U and j := lim R(f ). Then T outputs on f almost always indices of Zj . Let m be the length of the longest initial segment of all non-total functions from Zj . Let n ≥ Conv(R, f ). Eventually n will be great enough for step (1) to compute an initial segment longer than m, since the total function ϕj = f appears in the j-th block. Then T will output an index z of a total function ψz = ϕj = f and therefore f behaviorally correct. The proof does not only prove the scope of (BC → LIM) to be minimal. Again, it yields a biased hypothesis space, this time favouring BC and disliking LIM. Corollary 16. There is a hypothesis space ψ satisfying (1) LIMψ = {U ⊆ R | card U < ∞}, (2) BCψ =BC. We will now turn to the property (TOTAL → FIN). Once again we apply the same proof technique. But this time we only get a proof for the minimality of the scope. We are not provided with a biased hypothesis space. Theorem 17. Let U ∈ FIN be an infinite class. Then a hypothesis space ψ ∈ P 2 exists such that U ∈ TOTALψ \ FINψ .
Changing the Inference Type – Keeping the Hypothesis Space
93
Proof. Let R ∈ R such that U ∈ FINϕ (R). We need to modify the construction of the proof of Theorem 11 in such a way that the class U will be taken into account. This is inevitable, as will be shown in Theorem 18. The construction of the j-th block starts by watching R on the function ϕj . Only when (and if) R converges finitely to the index j, a construction very similar to that of the proof of Theorem 11 takes place. For the sake of completeness the algorithm constructing the block with indices from Zj is given below. x := 0 While both (A) ϕxj ↓ and (B) R(ϕxj ) = ? do: For all z ∈ Zj : ψz (x) := ϕj (x) x := x + 1 Case 1: (A) and (B) hold for all x. Then ∀z ∈ Zj [ψz = ϕj ] and R does not learn ϕj , hence ϕj ∈ / U. Case 2: There is x0 such that (A) does not hold. Then ∀z ∈ Zj ∀x ≥ x0 [ψz (x) ↑] and therefore ∀z ∈ Zj [ψz ∈ / U ]. Case 3: There is x0 such that (A) holds, but (B) does not, i. e. ϕxj 0 ↓ and R(ϕxj 0 ) ∈ IN. Then ∀z ∈ Zj [ψz =x0 −1 ϕj ] follows and two cases are distinguished: Case 3.1: R(ϕxj 0 ) = j. Then define for all z ∈ Zj and x ≥ x0 : ψz (x) := ↑. Case 3.2: R(ϕxj 0 ) = j. Then the construction goes on in the following way. (j)
Lx0 := {0, . . . , j} (j) Gx0 := Zj x := x0 While ϕxj ↓ do: Perform steps (1) to (6) as in the proof of Theorem 11. The proofs of U ∈ TOTALψ and U ∈ / FINψ are not very different from the corresponding ones in Theorem 11, but somewhat more technical, and will be omitted due to space constraints. The question, whether there is a biased hypothesis space, i. e. a ψ such that TOTALψ = TOTAL and FINψ = {U | card U < ∞}, has been answered by an anonymous reviewer of this paper’s submitted version. The proof is also due to this reviewer. Theorem 18. If ψ is a hypothesis space with TOTALψ = TOTAL, then there is an infinite class in FINψ . Proof. Let U1 = {0n 1∞ | n ∈ IN}. U1 is in TOTAL and thus in TOTALψ . Let S be a TOTALψ strategy for U1 . Now U2 = {ψS(0n 1m ) | n ∈ IN and m is the first number such that m > 0 and ψS(0n 1m ) extends 0n 1 as a function}
94
F. Balbach
is a class of total functions since S only outputs indices of total functions. Furthermore, the test whether ψS(0n 1m ) extends 0n 1 can be carried out effectively for m = 1, 2, . . . since the indices to be simulated are total. Thus U2 is well-defined and recursively enumerable. For every n there is exactly one function in U2 which extends 0n 1; thus U2 is infinite. Furthermore, U2 is in FINψ : on input which does not start with 0n 1, one outputs “?”, on input that starts with 0n 1 one outputs the ψ-index for the unique function in U2 which extends 0n 1. Among all properties (I → J ) so far proven to have minimal scope, the property (TOTAL → FIN) is the only one where no biased hypothesis space exists.
4
Positive Results
The previous section might look somewhat discouraging. However, not every property (I → J ) can be proved to have minimal scope. Contrary, there are inference types I and J such that the scope of (I → J ) is maximal. Theorem 19. If U ∈ FIN and ψ ∈ P 2 such that U ∈ CPψ , then U ∈ FINψ . Proof. Let R, S ∈ R be such that U ∈ FINϕ (R) and U ∈ CPψ (S). Define T (f n ) := “If R(f n ) = ? then output ‘?’. n n If R(f n ) ∈ IN ∧ ∀m ≤ n [ψS(f m ) ↓= f ] then output ‘?’. n n If R(f n ) ∈ IN ∧ ∃m ≤ n [ψS(f m ) ↓= f ] then: n n m := min{m ≤ n | ψS(f m ) ↓= f }. Output S(f m ).” On f ∈ U the strategy T outputs “?” until an n with n ≥ Conv(R, f ) and ψS(f n ) =n f is reached. Since R is a FIN strategy that has already converged on f , there can be no function g = f in U such that g =n f (otherwise R would converge on such a g to the same final hypothesis as it does on f , a contradiction). But if there is only one function in U starting with f n , this function must be identical to ψS(f n ) , for S is a CP strategy. It follows that S(f n ) is a correct hypothesis for f with respect to ψ. Theorem 19 addresses the transition from class preserving learning to finite learning. It does not only say that this translation is possible, it also provides an algorithm to construct a finite learning strategy from a class preserving one. Thus it presents a positive answer to the introductory question on effective constructability of such learning algorithms. The next theorem gives a very similar result concerning (CP → CONS-CP). Theorem 20. If U ∈ CONS-CP and ψ ∈ P 2 such that U ∈ CPψ , then U ∈ CONS-CPψ .
Changing the Inference Type – Keeping the Hypothesis Space
95
Proof. The proof is similar to that of Theorem 19. Let R ∈ P be such that U ∈ CONS-CPϕ (R) and S ∈ R such that U ∈ CPψ (S). Define T (f n ) := “If ψS(f n ) =n f then output S(f n ). If ψS(f n ) =n f then: g := ϕR(f n ) . Find m > n with ψS(gm ) =n f and output S(g m ).” Let f ∈ U and n ∈ IN. Then ψS(f n ) ∈ U follows, since S is a class preserving strategy. Furthermore “ψS(f n ) =n f ?” is decideable. If this condition is satisfied, T (f n ) = S(f n ) is a class preserving consistent hypothesis; else for g := ϕR(f n ) we get g ∈ U and g =n f because of the properties of S. Since S learns all functions of U in the limit, an m > n must exist such that ψS(gm ) = g =n f . This condition can easily be checked because S outputs only total hypotheses with respect to ϕ on functions g ∈ U . Thus, such an m can be found effectively and S(g m ) is a consistent and class preserving hypothesis. It remains to show that T learns every f ∈ U in the limit. For n ≥ Conv(S, f ) the first condition in the definition of T is satisfied and T (f n ) = S(f n ). Hence, T converges on f to the same final hypothesis as S. Corollary 21. If U ∈ CONS-CP and ψ ∈ P 2 such that U ∈ FINψ , then U ∈ CONS-CPψ . Corollary 22. If U ∈ FIN and ψ ∈ P 2 such that U ∈ CONS-CPψ , then U ∈ FINψ . Corollary 22 is dual to Corollary 21. Thus, for all classes U ∈ CONS-CP∩FIN and all hypothesis spaces ψ the equivalence U ∈ FINψ ⇐⇒ U ∈ CONS-CPψ holds. Moreover, CP can be added to this equivalence. Corollary 23. For all U ∈ CONS-CP ∩ FIN and for all ψ ∈ P 2 , U ∈ FINψ ⇐⇒ U ∈ CONS-CPψ ⇐⇒ U ∈ CPψ . The equivalence described in the last corollary is remarkable in two ways. First, it concernes an inference type and its consistent variant and, second, it concernes two inference types that are incomparable regarding their learning power. It shall be noted, however, that this equivalence is only valid within a relatively small area, FIN ∩ CONS-CP. But nevertheless, none of the other introduced inference types could be added to the statement of Corollary 23. Hence, there is indeed a close relationship between FIN, CONS-CP, and CP.
5
An Intermediate Result — CONS vs. LIM
So far, every examined property (I → J ) had either minimal or maximal scope. This need not be so for every pair of inference types. In order to show this, we will turn our attention to the condition (LIM → CONS). All classes U ∈ CP satisfy (LIM → CONS), as has already been proved [16].
96
F. Balbach
Theorem 24. If U ∈ CP and ψ ∈ P 2 such that U ∈ LIMψ , then U ∈ CONSψ . Naturally the question arises, whether (LIM → CONS) holds not only for the classes U ∈ CP, but for all U ∈ CONS. The next theorem gives a negative answer to this question. It shows that even in NUM there are classes which do not satisfy (LIM → CONS). This is remarkable since NUM classes tend to be easily learnable. After all, there are always total numberings that can be used as hypothesis spaces for them. However, the properties (I → J ) take into account all numberings, whether total or not. This is not unrealistic, even in the case of NUM classes, because deciding whether a class is embedded in a total numbering can be much harder than to find a non-total hypothesis space suitable for learning the class [17]. Theorem 25. There is a class U ∈ NUM and a hypothesis space ψ ∈ P 2 such that U ∈ LIMψ \ CONSψ . Proof. For the construction of ψ let η ∈ R2 be a numbering of V := {f ∈ R | ∀∞ n [f (n) = 0]} \ {0∞ } with the property ∀i ∀j [i = j =⇒ ηi = ηj ]. For all i ∈ IN set ψ2i+1 := ηi and ψ2i (0) := i. For all i ∈ IN and x ≥ 1 define (=: (A)) i, if Si (ix+1 ) ∈ 2IN \ {2i}, i, if Si (ix+1 ) ∈ 2IN + 1, (=: (B)) ψ2i (x) := x+1 (i ) = 2i, (=: (C)) ↑, if S i ↑, if Si (ix+1 ) ↑ . (=: (D)) Define U := V ∪ {ψ2i | ψ2i = i∞ }. Clearly, U ∈ NUM. Furthermore, U is learnable in the limit with respect to ψ by the strategy T defined as follows: 2 · f (0), if f (0) = . . . = f (n), n T (f ) := 2 · Enumη (f n ) + 1, otherwise. To prove U ∈ / CONSψ , assume a strategy Si such that U ∈ CONSψ (Si ). Considering the behavior of Si on initial segments of i∞ , and thereby the definition of ψ2i , we distinguish four cases: Case 1: There is x ≥ 1 such that (A) happens. Then Si (ix+1 ) = 2k for a k = i. Hence, ψSi (ix+1 ) (0) = ψ2k (0) = k = i and Si outputs an inconsistent hypothesis on ix+1 U , a contradiction. Case 2: There is x ≥ 1 such that (C) happens. Then Si (ix+1 ) = 2i, but ψ2i (x) ↑. Thus, Si is inconsistent on ix+1 U , a contradiction. Case 3: There is x ≥ 1 such that (D) happens. Then Si is undefined on ix+1 U , a contradiction. Case 4: For all x ≥ 1 (B) happens. Then ψ2i = i∞ ∈ U and Si outputs almost always odd hypothesis on i∞ . Since the odd indices in ψ belong to nonconstant functions, Si does not converge to a correct hypothesis on i∞ ∈ U ,a contradiction. The hypothesis space ψ constructed in the last proof satisfies the conditions (a) and (b) of the characterization theorem given in the introduction. Hence, ψ is a somewhat more “natural” hypothesis space (cf. proof of Theorem 11) biased towards an inference type, namely LIM, although only for a certain class U .
Changing the Inference Type – Keeping the Hypothesis Space
97
Table 1. Overview of the scope of (I → J ). A + means maximal scope, a − minimal. For a set M of classes, + M means the scope is a superset of M; − M means the intersection of scope and M contains only the finite classes. Finally, − U means, a counter-example for the maximality of the scope exists. I
\
J
FIN
CONS-CP CP
CONS-TOTAL TOTAL CONS LIM
FIN
+
+
+
−U + CONS-CP
+
+
+
CONS-CP
+
+
+
+
+
+
+
CP
+
+
+
−U + CONS-CP
+
+
+
− NUM + − FIN∩CONS-CP
+
+
+
CONS-TOTAL − NUM − NUM − CONS-CP − FIN TOTAL
−
− NUM − FIN
− NUM − FIN
−U + CONS-CP
+
+
+
CONS
−
−
−
−
−
+
+
LIM
−
−
−
−
−
−U + + CP
BC
−
−
−
−
−
−
6
−
Overview of Results Concerning (I → J )
Table 1 tries to present the numerous results of this paper in a clear manner. Results stated in that table, but not proved within this paper, can be obtained via techniques similar to those presented in the previous sections. Note that the scope of (I → J ) has not been fully characterized for every such property. Acknowledgments. This paper is based on my diploma thesis at the University of Kaiserslautern. It is a pleasure for me to thank Sandra Zilles and Rolf Wiehagen for their continuous support and helpful advice. Many thanks also to Thomas Zeugmann for many valuable hints and insights. Finally, I wish to thank the members of the Program Committee of the ALT 2003 for carefully reading the paper. In particular I am indebted to the anonymous reviewer who provided the proof of Theorem 18.
References 1. D. Angluin, C. Smith. Inductive inference: theory and methods. ComputingSurveys 15, 237–269, 1983. 2. J. Barzdin. Inductive inference of automata, functions and programs, Proceedings International Congress of Math., 455–460, Vancouver, 1974. 3. J.M. Barsdin. Dve Teoremui o predjelnom sintjese funkzii. Teorija algorithmov i programm I 82–88, Latviiskii. Gosudarstvenyi univ., Riga 1974.
98
F. Balbach
4. J.M. Barsdin, R.W. Freiwald: Prognosirovanje i predjelnyi sinfjes effektivno peretschislimyich klassov funkzii. Teorija algorithmov i programm I 101–111, Latviiskii Gosudarstvenyi universitjet, Riga 1974. 5. H.-R. Beick. Einige qualitative Aspekte bei der Erkennung von Klassen allgemein rekursiver Funktionen. Diplomarbeit, Humboldt-Universit¨ at, Berlin, 1979. 6. M. Blum. A machine independent theory of the complexity of recursive functions, Journal of Association for Computing Machinery, Vol. 11, 322–336, April 1967. 7. L. Blum, M. Blum. Toward a Mathematical Theory of Inductive Inference, Information and Control 28, 125–155, 1975. 8. J. Case, C. Smith. Comparison of Identification Criteria for Machine Inductive Inference, Theoretical Computer Science 25, 193–220, 1983. 9. R. Freivalds, E. B. Kinber, R. Wiehagen. How Inductive Inference Strategies Discover Their Errors, Information and Computation 118, 208–226, 1995. 10. R. Freivalds. Inductive inference of recursive functions: Qualitative theory, (J. B¯ arzdi¸ nˇs and D. Bjorner, Eds.) Baltic Computer Science, LNCS 502, 77–110, Springer-Verlag, 1991. 11. E. M. Gold. Language identification in the limit, Information and Control 10, 447–474, 1967. 12. K. P. Jantke, H.-R. Beick. Combining Postulates of Naturalness in Inductive Inference, Elektronische Informationsverarbeitung und Kybernetik 17, 465–484, 1981. 13. S. Jain, D. Osherson, J. S. Royer, A. Sharma. Systems that Learn: An Introduction to Learning Theory, second edition, MIT Press, Cambridge, Massachusetts, 1999. 14. M. Machtey, P. Young. An Introduction to the General Theory of Algorithms, North-Holland, New York, 1978. 15. H. Rogers. Theory of Recursive Functions and Effective Computability. McGraw– Hill, New York, 1967. 16. W. Stein. Konsistentes und inkonsistentes Lernen im Limes. Dissertation, Universit¨ at Kaiserslautern, 1998. 17. F. Stephan, T. Zeugmann. Learning Classes of Approximations to Non-Recursive Functions, Theoretical Computer Science Vol. 288, Issue 2, 309–341, 2002. (Special Issue ALT ’99). 18. R. Wiehagen. Limes-Erkennung rekursiver Funktionen durch spezielle Strategien, Elektronische Informationsverarbeitung und Kybernetik 12 1/2, 93–99, 1976. 19. R. Wiehagen. Zur Theorie der algorithmischen Erkennung, Dissertation B, Sektion Mathematik, Humboldt-Universit¨ at, Berlin, 1978. 20. R. Wiehagen, T. Zeugmann. Learning and Consistency, (K. P. Jantke, S. Lange, Eds.) Alg. Learning for Knowledge-Based Systems, LNAI 961, 1–24, Springer, 1995.
Robust Inference of Relevant Attributes Jan Arpe and R¨ udiger Reischuk Institut f¨ ur Theoretische Informatik, Universit¨ at zu L¨ ubeck Wallstr. 40, 23560 L¨ ubeck, Germany {arpe/reischuk}@tcs.uni-luebeck.de
Abstract. Given n Boolean input variables representing a set of attritubes, we consider Boolean functions f (i.e., binary classifications of tuples) that actually depend only on a small but unknown subset of these variables/attributes, in the following called relevant. The goal is to determine the relevant attributes given a sequence of examples - input vectors X and corresponding classifications f (X). We analyze two simple greedy strategies and prove that they are able to achieve this goal for various kinds of Boolean functions and various input distributions according to which the examples are drawn at random. This generalizes results obtained by Akutsu, Miyano, and Kuhara for the uniform distribution. The analysis also provides explicit upper bounds on the number of necessary examples. They depend on the distribution and combinatorial properties of the function to be inferred. Our second contribution is an extension of these results to the situation where attribute noise is present, i.e., a certain number of input bits xi may be wrong. This is a typical situation, e.g., in medical research or computational biology, where not all attributes can be measured reliably. We show that even in such an error-prone situation, reliable inference of the relevant attributes can be performed, because our greedy strategies are robust even against a linear number of errors.
1
Introduction
In many data mining applications, one is faced with the situation that a binary classification of elements with a large number of attributes only depends on a small subset of these attributes. A central task is then to infer these relevant attributes from a given input sample consisting of a series of examples X(k) = (x1 (k), . . . , xn (k)) with classifications y(k) for k = 1, 2, . . . , m, i.e., one wants to find a set of variables xi1 , . . . , xid such that the sample can be explained by a function f : {0, 1}n → {0, 1} that depends only on these d variables. A function f is said to explain the sample, if f (x1 (k), . . . , xn (k)) = y(k) for all k. Moreover, since real data usually contain noise, it is of particular interest to design algorithms that in some sense behave ‘robustly’ with respect to input disturbances. When inferring relevant attributes, two natural questions that can be asked:
Supported by DFG research grant Re 672/3.
R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 99–113, 2003. c Springer-Verlag Berlin Heidelberg 2003
100
J. Arpe and R. Reischuk
1. Given a fixed sample of an unknown concept, what is the minimum number of variables that explain the sample? 2. How many examples does one need to generate in order to find out the actual relevant attributes? The first question gives rise to an optimization problem introduced in Sect. 3, whereas the second one can be considered as an algorithmic learning problem. In both cases, however, the key task is to infer relevant variables from a sample. Thus our goal is to design efficient algorithms that find a small set of variables explaining the input sample. Akutsu and Bao [2] proposed a greedy algorithm based on a well-known greedy strategy for the Set Cover problem (see [12]). Akutsu, Miyano, and Kuhara [3] describe an efficient implementation of this approach and give an average case analysis of the algorithm for two special types of functions, namely AND and OR of arbitrary literals, under the uniform distribution of input examples. In Sect. 4, we simplify the greedy strategy. We call this strategy Greedy Ranking and show that its performance is similar to the one obtained in [3]. In Sect. 5, the average case analysis of [3] is generalized in two respects: to a broader class of functions and to weaker assumptions on the input distributions. It turns out that a modification of our approach, namely taking the smallest sets of the ranking, may also be useful for some classes of functions and input distributions. We call this strategy Modest Ranking, since its ‘modest’ behavior of first selecting the smallest sets is in contrast to the greedy strategy of taking the largest sets. We apply these very general results to some typical input distributions and some specific functions of major interest (e.g., monomials, clauses, and threshold functions) in Sect. 6. After these investigations we turn to the ‘real case’ of samples that contain partial errors. In Sect. 7, we assume that for each attribute there is a certain (generally unknown) error probability δi that the value for this attribute xi in an input vector is flipped. This noise model called product random attribute noise ([11]) has been applied to the PAC learning model as well. Note that it is quite different from the classification noise model [5]. We show that for some δ > 0 depending only on combinatorial properties of the function f to be inferred and on the probability distribution according to which the samples are generated, one can tolerate any constant fractions δi ≤ δ of such erroneous bits and still infer the relevant attributes successfully with high probability using the ranking strategies. In addition to their general ability of robustly inferring relevant attributes, the number of examples needed to handle disturbed inputs only grows at most by a factor of 4. Finally, in Sect. 8, we consider a different approach for the parity function, since the ranking strategies do not work in this case. Inferring relevant attributes is related to finding association rules (also called functional dependencies/relations) – a well-studied problem (e.g., see [1,15]). In the variant considered in this paper, the target attribute Y is fixed as in [2,3]. The goal of efficiently inferring concepts with many irrelevant attributes (so-called attribute-efficient learning) has attracted much attention in the past (e.g., see [13,8,9,17]). Most authors consider the mistake-bounded model. In this
Robust Inference of Relevant Attributes
101
on-line setting, one tries to minimize the number of examples for which the current hypothesis turns out to be wrong. There are several ways known how to convert on-line algorithms with low mistake bounds into efficient PAC learning algorithms (see [4,14,13]). In this paper, we consider the finite exact learning model: From a randomly selected sample of small size, we have to compute a single hypothesis that with high probability has to be correct (with accuracy 1). Recently, Mossel, O’Donnell, and Servedio [16] have introduced an algorithm that exactly learns the class of concepts f with n input variables and d relevant attributes (also called d-juntas) under uniform distribution with confidence 1−δ ω in time (nd ) ω+1 · poly(n, 2d , log(1/δ)), where ω < 2.376 is the matrix multiplication exponent. The Target Ranking algorithm we introduce runs in time O(m2 n) on samples of size m. In order to achieve confidence 1 − δ, we roughly need c · log(1/δ) · log n examples, where c depends on the base function f˜ (i.e., the restriction of f to its relevant variables), the number of relevant attributes d, and the probability distribution according to which the examples are drawn. In particular, restricting to the uniform distribution, for arbitrary f satisfying a certain statistical property, c can be bounded by poly(2d ). In this case we are able to exactly infer the relevant attributes with confidence 1 − δ in time n · poly(log n, 2d , log(1/δ)). Due to space limitations, most proofs have to be omitted. Details are presented in [7].
2
Preliminaries
A concept is a Boolean function f : {0, 1}n → {0, 1}, a concept class is a set of concepts. A concept f : {0, 1}n → {0, 1} depends on variable xi , if the two (n−1)ary subfunctions fxi =0 and fxi =1 with variable xi fixed to 0 and 1 respectively are not identical. If f depends on xi , then attribute xi is called relevant for f , otherwise irrelevant. We denote the set of relevant (resp. irrelevant) attributes by V + (f ) (resp. V − (f )). If f is clear from the context, we just write V + and V − . We denote by f˜ the restriction of f to its relevant variables and call it the base function of f . An example is a vector (x1 , . . . , xn ; y) ∈ {0, 1}n+1 . It is an example for f , if y = f (x1 , . . . , xn ). The values of x1 , . . . , xn are called variable or attribute assignments, whereas the value for y is called a label. A sequence (x1 (k), . . . , xn (k); y(k)) (k = 1, . . . , m) of examples for f is called a sample for f of size m, and f is said to explain the sample. A sample T is a sequence of examples such that there exists some f that explains the sample. If f depends only on variables from the set {xi1 , . . . , xid }, then we also say that these variables explain T . A sample is stored in a matrix each line of which represents x1 (1) . . . xn (1) | y(1)) .. .. ∈ {0, 1}m×(n+1) , where .. one example: T = (X; y) = ... . . . x1 (m) . . . xn (m) | y(m) X is the submatrix consisting of the variable assignments in the examples, and y is the column vector containing the labels of the examples. A sample T may contain a certain combination of attributes several times. Then, of course, it is necessary that for k = l the following implication holds:
102
J. Arpe and R. Reischuk
X(k) = X(l)
=⇒
y(k) = y(l) .
(1)
Indeed, if (1) does not hold for some k = l, then by definition, T is not a sample. In the noisy case, however, it may well be that different combinations of attributes yield different labels, but due to false measurements of the attributes, the values for x1 , . . . , xn all look the same. We assume that the examples of a sample T are drawn according to a fixed probability distribution p : {0, 1}n → [0, 1], and we say that T is generated according to p. Definition 1. Let (X; y) ∈ {0, 1}m×(n+1) be a sample. The corresponding functional relations graph is a bipartite labeled graph defined as follows. The vertices are {1, . . . , m}, the edges are S = {{k, l} | y(k) = y(l)}. Each edge {k, l} is labeled by the set of variables xi such that xi (k) = xi (l). The set of edges with a label containing variable xi is denoted by Si = {{k, l} ∈ S | xi (k) = xi (l)}. Proposition 1. Let T = (X; y) ∈ {0, 1}m×(n+1) be a sample and {i1 , . . . , id } ⊆ {1, . . . , n}. Then the following statements are equivalent: (a) xi1 , . . . , xid explain T . (b) For each pair k, l ∈ {1, . . . , m} such that y(k) = y(l) there exists r ∈ {1, . . . , d} such that xir (k) = xir (l). (c) S = Si1 ∪ . . . ∪ Sid .
3
Approximability
Consider the following optimization problem: Inference of Relevant Attributes (INFRA) Instance: sample T = (X; y) = (x1 (k), . . . , xn (k); y(k))k=1,... ,m ∈ {0, 1}m×(n+1) Solution: a function f : {0, 1}n → {0, 1} such that T is a sample for f (i.e., y(k) = f (x1 (k), . . . , xn (k)) for all k ∈ {1, . . . , m}) Measure: |V + (f )| Goal: minimize |V + (f )|
Note that in order to find a small set of explaining attributes for an INFRA instance, we do not have to explicitly define a corresponding concept f , but it is enough to find a set of attributes xi1 , . . . , xid such that for k, l ∈ {1, . . . , m} with y(k) = y(l) there exists r ∈ {1, . . . , d} with xir (k) = xir (l) by Proposition 1. In order to obtain results on the approximability of INFRA, we consider the well-studied Set Cover problem. Note that Proposition 1 yields a reduction from INFRA to Set Cover. Based on this fact, Akutsu and Bao [2] have proved the following theorem: Theorem 1 ([2]). INFRA can be approximated in polynomial time within a factor of 2 ln m + 1.
Robust Inference of Relevant Attributes
103
The next claim is a slightly stronger version of Theorem 8 in [2], since we consider the special case of INFRA for Boolean functions. Proposition 2. Set Cover is reducible to INFRA via a polynomial time computable approximation factor preserving reduction. Applying a result from [10], we obtain the following lower bound: Theorem 2. For any ε > 0, INFRA cannot be approximated within a factor of (1 − ε) ln m unless NP ⊆ DTIME(nO(log log n) ). Therefore, when faced with the INFRA problem, the best one can hope for are efficient approximation algorithms with a nonconstant approximation ratio or fast algorithms providing correct results for ‘most’ inputs. In the rest of this paper, we investigate the latter challenge.
4
From Greedy to Ranking
Let us start with the algorithm discussed in [3] which is presented in Fig. 1. It makes use of the reduction from INFRA to Set Cover given by Proposition 1 and applies a well-known greedy approach to the Set Cover instance obtained. Johnson [12] first analyzed this approach for Set Cover.
input (x1 (k), . . . , xn (k); y(k))k=1,... ,m V := {x1 , x2 , . . . , xn }; S := {{k, l} | y(k) = y(l)} while S = ∅ do for i = 1 to n do Si := {{k, l} ∈ S | xi (k) = xi (l)} find an xi ∈ V with maximum |Si | output xi S := S \ Si ; V := V \ {xi } Fig. 1. Algorithm Greedy
We apply some modifications of this algorithm and analyze their effects. The strategy is based on a ranking of the sets S1 , . . . , Sn by their cardinalities which is done by the procedure Rank Sets, see Fig. 2.
for i = 1 to n do Si := {{k, l} ∈ S | xi (k) = xi (l)} compute π : {1, . . . , n} → {1, . . . , n} such that |Sπ(1) | ≥ |Sπ(2) | ≥ . . . ≥ |Sπ(n) | Fig. 2. Procedure Rank Sets
The results may be worse in some cases, since the new greedy approach is based on a single static ranking. However, we show that the ranking still yields
104
J. Arpe and R. Reischuk
properties similar to Greedy and in addition performs quite robustly when confronted with attribute noise. Greedy Ranking (see Fig. 3) outputs the variables xi with maximum |Si | until these Si ’s cover the whole edge set S. In contrast to Greedy, Greedy Ranking does not recompute the sets Si in each step. Given a concept f , the Greedy Ranking algorithm works correctly, if with high probability, the sets Si for xi ∈ V + are larger than the sets Si for xi ∈ V − . On the other hand, if the converse is the case, i.e., if the sets Si for relevant variables are likely to be smaller than the sets Si for the irrelevant variables, we should make use of an algorithm that outputs the variables corresponding to the smallest sets Si . Instead of being greedy, this algorithm rather behaves modestly, so we call it Modest Ranking (see also Fig. 3).
input (x1 (k), . . . , xn (k); y(k))k=1,... ,m Rank Sets S := {{k, l} | y(k) = y(l)} i := 1 / i = n while S = ∅ do output xπ(i) S := S \ Sπ(i) i := i + 1 / i := i − 1 Fig. 3. Algorithms Greedy/Modest Ranking
Note that, given a sample, all three algorithms terminate after a finite number of steps since by property (1), each pair {k, l} ∈ S belongs to some Si . Clearly, all algorithms presented here compute a cover of S. Thus by the reduction given in Proposition 1, the algorithms work correctly for the optimization problem, i.e., they output sets of variables that explain the input sample. It is not hard to construct instances showing that in general none of the algorithms is superior to the others in terms of finding small sets of explaining variables. It may be the case that an input sample for some concept f can be explained by a proper subset of the relevant variables for f . In case the number d of relevant variables is a priori known, we can overcome this problem by giving d as additional input to the algorithms and output the d variables with the largest (resp., smallest) sets Si . This is done by the Target Ranking and the Modest Target Ranking algorithms (see Fig. 4).
input (x1 (k), . . . , xn (k); y(k))k=1,... ,m , d Rank Sets S := {{k, l} | y(k) = y(l)} for i = 1 to d do output xπ(i) / output xπ(n−i+1) Fig. 4. Algorithms (Modest) Target Ranking
Robust Inference of Relevant Attributes
105
Moreover, if we only have an a priori upper bound d on the number of relevant variables, then the target ranking algorithms output a set of d variables such that the d relevant ones are most likely among them. Definition 2. Given a concept f with relevant variables xi1 , . . . , xid and a sample T for f , we say that an algorithm succeeds in a step, if the output generated in that step is a relevant variable of f . The algorithm is said to be correct, if it is successful in all steps it makes. It is complete, if it finds all relevant variables. Finally, an algorithm is said to be successful, if it is both correct and complete. The following properties are easy to show: Lemma 1. Let f depend on d variables, and let T be a sample for f . (a) If Target Ranking (resp., Modest Target Ranking) is correct on input (T, d), then it is successful, too. (b) If Target Ranking (resp., Modest Target Ranking) is complete on input (T, d), then it is also successful. (c) If Target Ranking (resp., Modest Target Ranking) is successful on input (T, d), then Greedy Ranking (resp., Modest Ranking) is correct on input T . In order to uniquely recognize the relevance of some variable xir , there has to be an edge in the functional relations graph whose only relevant label is xir . Thus, independently of the used learning algorithm, a necessary (but not sufficient) condition to infer the relevance of xir is the occurrence of two examples k, l in the input sample with xir (k) = 0 and xir (l) = 1, but with identical values for all other relevant attributes. By the birthday paradox, already for the √ uniform distribution roughly 2d−1 examples are necessary to guarantee such an occurrence. This shows that in order for any algorithm to be complete, Ω(2d/2 ) examples have to be provided due to information theoretic reasons.
5
Probabilistic Analysis of the Ranking Strategies
Let f : {0, 1}n → {0, 1} be a concept with relevant variables xi1 , . . . , xid , and p be a probability distribution. For x, y ∈ {0, 1}, we denote by ‘xi = x’ the set of examples with xi = x, and by ‘f = y’ the set of examples with f (x1 , . . . , xn ) = y. (x,y) For i ∈ {1, . . . , n} and x, y ∈ {0, 1} define the probability αi that a randomly drawn example (x1 , . . . , xn ; f (xi1 , . . . , xid )) has xi = x and f (xi1 , . . . , xid ) = y, (x,y) = Pr(xi = x ∧ f = y). i.e., αi Let T = (X; y) be a sample of size m for f generated according to p. Define (x,y) K = {1, . . . , m} and Ki = {k ∈ K | xi (k) = x and y(k) = y}. The situation (x,y) | can is depicted in Fig. 5. Since all examples are identically distributed, |Ki (x,y) be considered as a binomially distributed random variable with parameters αi (x,y) (x,y) and m. Analogously to the αi ’s, we denote by βi the corresponding relative (x,y) (x,y) = |Ki | / |K|, and define αi = frequencies in the input sample, i.e., βi (0,0) (1,1) (1,0) (0,1) (0,0) (1,1) (1,0) (0,1) αi αi + αi αi and βi = βi βi + βi βi . It holds |Si | = βi m2 , and for large m, we get the approximation |Si | ≈ αi m2 .
106
J. Arpe and R. Reischuk y(k) = 0
y(k) = 1
(0,0)
(0,1)
Ki
Ki
xi (k) = 0 xi (k) = 1 (1,0)
(1,1)
Ki
Ki edges in Si edges not in Si
Fig. 5. Partition of K with respect to variable xi 1 Lemma 2. For fixed i ∈ {1, . . . , n} and arbitrary 0 ≤ δ ≤ 10 , it holds that 1 2 Pr(|Si | − αi m2 ≥ δm2 ) ≤ 8e− 3 δ m .
Proof. The proof requires lengthy calculations and case distinctions. It is based on standard Chernoff bound techniques and can be found in [7].
The following theorem provides very general conditions that guarantee the success of the ranking algorithms with respect to a concept f : Theorem 3. Let f : {0, 1}n → {0, 1} depend on xi1 , . . . , xid , let T be a sample for f generated according to a probability distribution p : {0, 1}n → [0, 1], and let c > 0. (a) If min{αi | xi ∈ V + } > max{αj | xj ∈ V − }, then with probability 1 − n−c , Target Ranking is successful on input (T, d), provided that m ≥ 12ε−2 ((c + 1) ln n + ln 8), where ε = min { min{αi | xi ∈ V + } − max{αj | xj ∈ V − } , 1/5}. (b) If max{αi | i ∈ V + } < min{αj | j ∈ V − }, then with probability 1 − n−c , Modest Target Ranking is successful on input (T, d), provided that m ≥ 12ε−2 ((c + 1) ln n + ln 8), where ε = min { min{αj | xj ∈ V − } − max{αi | xi ∈ V + } , 1/5}. Proof. We only prove part (a), 2 since (b) can be +done analogously. Let t = 1 + − min α + max α it holds that i j m . Then for xi ∈ V i∈V j∈V 2
ε ε 2 m ≤ Pr |Si | − αi m2 ≥ m2 Pr(|Si | ≤ t) ≤ Pr |Si | ≤ αi − 2 2 2 1 2 − 13 ( 2ε ) m − 12 ε m ≤ 8e = 8e , where the last inequality is due to Lemma 2. Similarly, for xj ∈ V − , 1
2
Pr(|Sj | ≥ t) ≤ 8e− 12 ε
m
.
Robust Inference of Relevant Attributes
107
Target Ranking is successful on input (T, d) iff it is correct in all of its d steps. This is exactly the case, if the largest d sets Si correspond to the relevant variables, i.e., if minxi ∈V + |Si | > maxxj ∈V − |Sj |. We have
Pr min+ |Si | > max− |Sj | ≥ Pr min+ |Si | > t ∧ max− |Sj | < t xi ∈V xj ∈V xi ∈V xj ∈V + = 1 − Pr ∃xi ∈ V |Si | ≤ t ∨ ∃xj ∈ V − |Sj | ≥ t ≥1− Pr (|Si | ≤ t) + Pr (|Sj | ≥ t) xi ∈V +
xj ∈V −
1 2 − 12 ε m
≥ 1 − 8ne If m ≥
12 ε2 ((c + 1) ln n + ln 8),
1 2 − 12 ε m
then 8ne
.
≤ n−c , thus the claim follows.
As Theorem 3 is stated for a general setting, let us now consider some typical input distributions and simplify its conditions in these cases. • Independent Attributes (IA) Suppose that the values for the xi ’s (i = 1, . . . , n) are generated independently of each other, say with Pr(xi = 1) = pi ∈ [0, 1] (thus Pr(xi = 0) = 1 − pi ). Then we say that the sample is IA(p1 , . . . , pn )-generated. Lemma 3. Let T be an IA(p1 , . . . , pn )-generated sample for f . Then, for xj ∈ V − , we have αj = 2pj (1 − pj ) Pr(f = 0) Pr(f = 1). • Independent Equiprobable Attributes (IEA) If T is IA(p1 , . . . , pn )-generated with p1 = . . . = pn = q, then we say that T is IEA(q)-generated. Lemma 4. Let T be an IEA(q)-generated sample for f . (a) For each xj ∈ V − , it holds αj = 2q(1−q) Pr(f = 0) Pr(f = 1). In particular, αj is independent of xj ∈ V − . We denote the common value of these αj ’s by α− in this case. (b) If f is symmetric, then the αi ’s with xi ∈ V + are also independent of i. We denote the common value of these αi ’s by α+ in this case. From the previous lemma and Theorem 3 we immediately obtain the following result on the successfulness of the target ranking algorithms when applied to symmetric Boolean functions. Corollary 1. For f with a symmetric base function f˜, three cases can occur: – α+ > α− : O(log n) input examples suffice such that Target Ranking is successful with high probability. – α+ < α− : O(log n) input examples suffice such that Modest Target Ranking is successful with high probability. – α+ = α− : No success ratios can be guaranteed for the ranking algorithms, regardless of how many input examples are provided. • Uniformly Distributed Attributes (UDA) If the examples are uniformly distributed, i.e., if a sample T is IEA( 12 )-generated, then we say that T is UDA-generated.
108
6
J. Arpe and R. Reischuk
Inferring Specific Concepts
We now consider several basic Boolean functions. Theorem 4 (AND-function). Let {i1 , . . . , id } ⊆ {1, . . . , n}, let f : {0, 1}n → {0, 1} be defined by f (x1 , . . . , xn ) = xi1 ∧ . . . ∧ xid , and let T be an IEA(q)generated sample for f . (a) If q ≤ 12 , then it holds that α+ > α− . Thus the success ratio for Target Ranking may be raised arbitrarily close to 1 by choosing a large enough sample size m ∈ O(log n). (b) If q = 12 , then it holds that α+ −α− = 2−2d−1 > 0. Thus the success ratio for Target Ranking may be raised arbitrarily close to 1 by choosing a large enough sample size m ∈ O(log n) with the constant being of order 24d . (c) If q > 12 , then for sufficiently large d, we have α+ < α− . Thus the success ratio for Modest Target Ranking may be raised arbitrarily close to 1 by choosing a large enough sample size m ∈ O(log n). The same ideas apply to the OR-function with q substituted by 1 − q. Sketch of proof: We have Pr(f = 1) = q d and Pr(f = 0) = 1 − q d , thus α− = (0,0) (1,1) 2q d+1 (1−q)(1−q d ) by Lemma 4 (a). Furthermore, αi = 1−q, αi = q d , and (0,1) αi = 0, yielding α+ = q d (1 − q). Hence, α+ > α− ⇐⇒ q(1 − q d ) < 12 , from which (a) and (c) follow. Similarly, (b) can be shown by plugging in q = 1/2.
Theorem 5 (Monomials). Let {i1 , . . . , id } ⊆ {1, . . . , n}, let lr ∈ {xir , ¬xir } for each r ∈ {1, . . . , d}, and let f : {0, 1}n → {0, 1} be defined by f (x1 , . . . , xn ) = l1 ∧ . . . ∧ ld . Let T be a UDA-generated sample for f . Then Target Ranking is successful with high probability provided that a sample of size m ∈ Ω(24d · log n) is given. Sketch of proof: The analysis is similar to the one in Theorem 4 for q = 12 . In particular, α+ − α− = 2−2d−1 > 0. Now the claim follows from Theorem 3.
Akutsu, Miyano, and Kuhara [3] showed a similar result for the Greedy algorithm. Note that for monomials under uniform input distribution, 2d rows are necessary in order to obtain (in expectation) at least one example with label 1. (If there is no such example, then the sample can be explained by the constant zero function.) It is easy to see that being able to infer the relevant attributes of a function f , the same holds for its negation. In particular, the result for monomials translates to clauses. The case of negating individual attribute values is more complex. At least in case of the uniform distribution the inferability is not effected. Theorem 6 (Threshold functions). Let {i1 , . . . , id } ⊆ {1, . . . , n}, 1 ≤ t ≤ d, d and f : {0, 1}n → {0, 1} be defined by f (x1 , . . . , xn ) = 1 iff r=1 xir ≥ t. Let T be a UDA-generated sample for f . Then Target Ranking is successful with
−4 24d log n . high probability, provided that m ∈ Ω dt
Robust Inference of Relevant Attributes
109
2 −2d−1 Sketch of proof: A straightforward calculation yields α+ − α− = d−1 ·2 . t−1 Now Theorem 3 yields the claim.
If t = d in the previous theorem, then f = AND, and we recover our result from Theorem 4, part (b). Moreover, under uniformly distributed inputs, the gap between α+ and α− for threshold functions is smallest for t ∈{1, d}. The largest d−1 ∈ Θ( √1d 2d ), such gap is reached for t = d2 , the majority function. Since t/2−1 + − −1 we have α − α ∈ Θ(d ). Applying Theorem 3, this proves the following Corollary 2 (Majority function). Let f : {0, 1}n → {0, 1} such that its base function f˜ : {0, 1}d → {0, 1} is the majority function. Then Target Ranking is succeessful with high probability, provided that m ∈ Ω(d2 · log n). For symmetric Boolean functions, one cannot always guarantee α+ = α− , even for UDA-generated samples. A simple counter-example is the parity function f (x1 , . . . , xn ) = (xi1 + . . . + xid ) mod 2 for which αi = 18 for all i ∈ {1, . . . , n}, no matter whether xi ∈ V + or xi ∈ V − . Thus the ranking strategies do not work for the parity function. We provide an alternative solution for such concepts in Sect. 8.
7
Robust Inference
As real data usually contain noise, our ultimate goal is to handle cases in which the attribute values underly certain disturbances. More precisely, we assume that in each input example, attribute xi is flipped with probability δi , i.e., an algorithm obtains xi (k) instead of the correct value xi (k) with probability δi . We call the resulting set of disturbed examples a δ-disturbed sample, where δ = (δ1 , . . . , δn ). Note that this assumption introduces a linear number of faults (with respect to the number of attributes). Fortunately, it can be shown that the ranking algorithms still perform well, if they are given such disturbed samples. The key idea in this case is to examine how much the sets Si computed by the ranking algorithms deviate from the Si ’s intended by the real data. We denote the sets derived from the disturbed data by Sˆi . Furthermore, for i ∈ {1, . . . , n}, let Fi = {k ∈ {1, . . . , m} | the input table contains xi (k) instead of xi (k)} . The following lemma is analogous to Lemma 2: 1 Lemma 5. Let i ∈ {1, . . . , n} with δi ≤ 30 . Then, for ε such that 6δi ≤ ε ≤ 15 , 1 2 2 2 ˆ it holds that Pr |Si | − αi m ≥ εm ≤ 9e− 12 ε m . Sketch of proof: We use the inequality |Sˆi | − αi m2 ≤ |Sˆi | − |Si | + |Si | − αi m2 and compute the probability that each of the summands on the right hand side is bounded by 2ε m2 . Combinatorial investigations yield |Sˆi |−|Si | ≤ m|Fi |+ 12 |Fi |2 . In particular, if |Fi | ≤ 13 εm, then |Sˆi | − |Si | ≤ 12 εm2 . From standard Chernoff 1 ε 1 bounds, it follows that Pr |Fi | ≥ 3ε m ≤ e− 3 · 6 ·m = e− 18 εm , since δi < 6ε (|Fi |
110
J. Arpe and R. Reischuk
can be considered as a binomially distributed random variable with parameters δi and m). Now ε
ε Pr |Sˆi | − αi m2 ≥ εm2 ≤ Pr |Sˆi | − |Si | ≥ m2 ∨ |Si | − αi m2 ≥ m2 2 2
2 ε − 13 ( 2ε ) m + 8e ≤ Pr |Fi | ≥ 3 1 1 2 1 2 ≤ e− 18 εm + 8e− 12 ε m ≤ 9e− 12 ε m , 1 1 2 εm > 12 ε m for ε ≤ 15 .
where we make use of Lemma 2 and the fact that 18 Besides the general information theoretic problem that a sample may already be explained by a proper subset of the relevant variables, just the opposite phenomenon can occur due to disturbances: Sˆ may not be covered by Sˆ1 , . . . , Sˆn , so Greedy, Greedy Ranking, and Modest Ranking – as introduced in Sect. 4 – do not terminate on the corresponding input samples. Therefore, when faced with the disturbed situation, we modify the algorithms as follows: All edges that do not belong to any of the computed Sˆi ’s are ignored, i.e., we compute a new set Sˆnew = Sˆ \ {{k, l} ∈ Sˆ | ∀i ∈ {1, . . . , n} : {k, l} ∈ Sˆi }. The edges removed are exactly those connecting two example nodes with identical attribute values but different labels. All algorithms make use of this set Sˆnew instead of ˆ However, this modification does not effect our analysis, so we continue by S. ˆ In the noisy scenario, Lemma 1 has to be modified as follows: writing S.
Lemma 6. Let f be a concept depending on d variables, and T a δ-disturbed sample for f such that Target Ranking is successful on (T, d). If Greedy Ranking outputs at most d variables on input T , then it is correct. Otherwise, the first d variables output by Greedy Ranking are the relevant ones. We now state our main theorem for the case of disturbed samples: Theorem 7. Let f : {0, 1}n → {0, 1} with relevant variables xi1 , . . . , xid , and let δ = (δ1 , . . . , δn ) ∈ [0, 1]n . Let T be a δ-disturbed sample for f generated according to a probability distribution p : {0, 1}n → [0, 1], and let c > 0. 1 ε for all k ∈ (a) If min{αi | xi ∈ V + } > max{αj | xj ∈ V − } and δk ≤ 12 −c {1, . . . , n}, then with probability 1 − n , Target Ranking is successful on input (T, d), provided that
m ≥ 48 ε−2 ((c + 1) ln n + ln 9) , where ε = min{ min{αi | i ∈ V + } − max{αj | j ∈ V − } , 2/5} . 1 ε for all k ∈ (b) If max{αi | i ∈ V + } < min{αj | j ∈ V − } and δk ≤ 12 {1, . . . , n}, then Modest Target Ranking is successful on input (T, d), provided that m ≥ 48 ε−2 ((c + 1) ln n + ln 9) , where ε = min{ min{αj | j ∈ V − } − max{αi | i ∈ V + } , 2/5}. Proof. Extension of the analysis in the proof of Theorem 3. See [7].
Robust Inference of Relevant Attributes
111
We would like to stress that the algorithms have not been modified in any way in order to overcome the disturbances. In particular, we do not have to assume that the algorithms have any knowledge about the error probabilities δ1 , . . . , δn . Even more, the sample size required for Target Ranking only has to be enlarged by factor 4 in order to obtain the same success probability in case of a (small) constant percentage of errors in the input sample.
8
Inferring Relevant Attributes of the Parity Function
Throughout this section we identify {0, 1} with the two-element field GF(2) n and denote by ⊕ the sum operation in this field. Furthermore, we define |ξ| = i=1 ξi for ξ ∈ {0, 1}n (here the sum is taken in Z). Let f : {0, 1}n → {0, 1} be defined by f (x1 , . . . , xn ) = xi1 ⊕ . . . ⊕ xid for some set of variable indices I = {i1 , . . . , id } ⊆ {1, . . . , n}. Since we have seen at the end of Sect. 5 that ranking the variables according to their occurences in the functional relations graph does not work for the parity function, we present a different algorithm Parity Infer to find the relevant variables. The idea is simply to compute a solution of a system of linear equations associated with the input sample and then to infer from this solution a set of variables that can explain the sample.
input (X; y) ∈ {0, 1}m×(n+1) solve Xξ = y if there is no solution then output ‘sample contains wrong data’ else choose any solution ξ; output all xi ’s such that ξi = 1 Fig. 6. Algorithm Parity Infer
Let us again differenciate between the two aspects, the optimization problem INFRA(⊕) obtained by restricting the instances and the solutions of INFRA to samples for concepts whose base functions are parity functions – such functions can be uniquely described by the set V + of relevant variables – and on the other hand finding exactly the relevant variables of a given but unknown parity function provided that the sample size is large enough. Let T = (X; y) ∈ {0, 1}m×(n+1) . There is a one-to-one correspondence between solutions V + of the INFRA(⊕) instance T and the solutions ξ ∈ {0, 1}n for the system of linear equations Xξ = y given by ξi = 1 iff xi ∈ V + . The task of finding an optimal solution for an INFRA(⊕) instance is equivalent to finding a solution ξ of Xξ = y with minimum |ξ|. Since {xi | i ∈ I} is a solution for T , the system has at least one solution. Moreover, if X has full rank (i.e., rank(X) = n), then there is a unique solution which is of course also an optimal solution in this case. There is a well-known correspondence between the INFRA(⊕) and the Nearest Codeword problem. A Nearest Codeword instance consists of a matrix
112
J. Arpe and R. Reischuk
A ∈ {0, 1}n×r and a vector b ∈ {0, 1}n . A solution is a vector x ∈ {0, 1}r , and the goal is to minimize the Hamming distance of Ax and b (i.e., |Ax ⊕ b|). The obvious reduction is approximation factor preserving. Using a result of [6], this implies Theorem 8. For any ε > 0, INFRA(⊕) cannot be approximated within a factor 1−ε of 2log m unless NP ⊆ DTIME(npolylog(n) ). Despite this negative result, INFRA(⊕) can be solved efficiently on the average. We show that under certain assumptions the variables detected by Parity Infer are exactly the relevant ones with high probability. Theorem 9. Let f : {0, 1}n → {0, 1} such that its base function f˜ is a parity function, and let T = (X; y) ∈ {0, 1}m×n × {0, 1}n be a UDA-generated sample for f . If m ≥ n + k(2 ln k + 1) with k = c log n + 1 for some c > 0, then with probability 1 − n−c , Xξ = y has exactly one solution ξ. Corollary 3. Under the conditions described in Theorem 9, using sample size m = n + O(c log n log log n) Parity Infer is successful with probability 1 − n−c , where c may be chosen arbitrarily large.
9
Conclusions and Further Research
For inferring relevant Boolean valued attributes we have presented ranking algorithms, which are modifications of greedy algorithms proposed earlier. We have extended a negative approximability result to the restriction of only Boolean values and have improved a lower bound by using Feige’s result. General criteria for the successfulness of our algorithms have been established in terms of some statistical values (depending on the concept considered and the probability distribution). These results have been applied to a series of typical input distributions and specific functions. In case of monotone functions, a straightforward modification of our strategy restricts edge set Si to those edges {k, l} with xi (k) = y(k) = 0 and xi (l) = y(l) = 1. This halves the values αj for xj ∈ V − and thus may satisfy (or improve) the conditions of the main theorems for certain monotone functions. Next, we have investigated the case of noisy attribute values. We have shown that our algorithms still succeed with high probability, if their input contains a (small) constant fraction of wrong values. This desirable robustness property is achieved without requiring any specific knowledge about the likelihood of errors. One direction of future research could be to extend these results to more complex Boolean functions such as DNF formulas with a constant number of monomials. Furthermore, the case of robustly inferring relevant attributes of parity functions remains open. Another generalization would be the case that attributes and/or labels may take values from sets with more than two elements. Given an input instance, Greedy Ranking always outputs a proper solution that is capable of explaining the sample. However, if the input has some
Robust Inference of Relevant Attributes
113
disturbances, Greedy Ranking might indeed stop only after having chosen significantly more than the real number of relevant attributes. In such situations, one might be interested in algorithms that – rather than computing an exact solution for the given input data – output a simple solution fitting to an input instance that is in some sense ‘near’ to the input instance. Following Occam’s razor such a simple solution may be much more likely to explain the real phenomenon. Currently, we are working on a general framework for this setting.
References 1. R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases. Proc. 1993 ACM SIGMOD Conf., 207–216. 2. T. Akutsu, F. Bao, Approximating Minimum Keys and Optimal Substructure Screens. Proc. 2nd COCOON, Springer LNCS 1090 (1996), 290–299. 3. T. Akutsu, S. Miyano, and S. Kuhara, A Simple Greedy Algorithm for Finding Functional Relations: Efficient Implementation and Average Case Analysis. TCS 292(2) (2003), 481–495. (see also Proc.3rd DS, Springer LNAI 1967 (2000), 86–98.) 4. D. Angluin, Queries and Concept Learning. Machine Learning 2(4) (1988), 319– 342, Kluwer Academic Publishers, Boston. 5. D. Angluin and P. Laird, Learning from noisy examples. Machine Learning 2(4) (1988), 343–370, Kluwer Academic Publishers, Boston. 6. S. Arora, L. Babai, J. Stern, and Z. Sweedyk, The Hardness of Approximate Optima in Lattices, Codes, and Systems of Linear Equations, J. CSS 54 (1997), 317– 331. 7. J. Arpe, R. Reischuk, Robust Inference of Relevant Attributes. Techn. Report, SIIM-TR-A 03-12, Univ. L¨ ubeck, 2003, available at http://www.tcs.mu-luebeck.de/TechReports.html. 8. A. Blum, L. Hellerstein, and N. Littlestone, Learning in the Presence of Finitely or Infinitely Many Irrelevant Attributes. Proc. 4th COLT ’91, 157–166. 9. A. Blum, P. Langley, Selection of Relevant Features and Examples in Machine Learning. Artificial Intelligence 97(1–2), 245–271 (1997). 10. U. Feige, A Threshold of ln n for Approximating Set Cover. J. ACM 45 (1998), 634–652. 11. S. Goldman, H. Sloan, Can PAC Learning Algorithms Tolerate Random Attribute Noise? Algorithmica 14 (1995), 70–84. 12. D. Johnson, Approximation Algorithms for Combinatorial Problems. J. CSS 9 (1974), 256–278. 13. N. Littlestone, Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm. Machine Learning 4(2) (1988), 285–318, Kluwer Academic Publishers, Boston. 14. N. Littlestone, From On-line to Batch Learning. Proc. 2nd COLT 1989, 269–284. 15. H. Mannila, K. R¨ aih¨ a, On the Complexity of Inferring Functional Dependencies. Discrete Applied Mathematics 40 (1992), 237–243. 16. E. Mossel, R. O’Donnell, R. Servedio, Learning Juntas. Proc. STOC ’03, 206–212. 17. L. Valiant, Projection Learning. Machine Learning 37(2) (1999), 115–130, Kluwer Academic Publishers, Boston.
Efficient Learning of Ordered and Unordered Tree Patterns with Contractible Variables Yusuke Suzuki1 , Takayoshi Shoudai1 , Satoshi Matsumoto2 , Tomoyuki Uchida3 , and Tetsuhiro Miyahara3 1
2
Department of Informatics, Kyushu University, Kasuga 816-8580, Japan {y-suzuki,shoudai}@i.kyushu-u.ac.jp Department of Mathematical Sciences, Tokai University, Hiratsuka 259-1292, Japan [email protected] 3 Faculty of Information Sciences, Hiroshima City University, Hiroshima 731-3194, Japan {uchida@cs,miyahara@its}.hiroshima-cu.ac.jp
Abstract. Due to the rapid growth of tree structured data such as Web documents, efficient learning from tree structured data becomes more and more important. In order to represent structural features common to such tree structured data, we propose a term tree, which is a rooted tree pattern consisting of tree structures and labeled variables. A variable is a labeled hyperedge, which can be replaced with any tree. A contractible variable is an erasing variable which is adjacent to a leaf. A contractible variable may be replaced with a singleton vertex. A usual variable, called an uncontractible variable, is replaced with a tree of size at least 2. In this paper, we deal with ordered and unordered term trees with contractible and uncontractible variables such that all variables have mutually distinct variable labels. First we give a polynomial time algorithm for deciding whether or not a given term tree matches a given tree. Let Λ be a set of edge labels. Second, when Λ has more than one edge label, we give a polynomial time algorithm for finding a minimally generalized ordered term tree which explains all given tree data. Lastly, when Λ has infinitely many edge labels, we give a polynomial time algorithm for finding a minimally generalized unordered term tree which explains all given tree data. These results imply that the classes of ordered and unordered term trees are polynomial time inductively inferable from positive data.
1
Introduction
Due to the rapid growth of semistructured data such as Web documents, Information Extraction from semistructured data becomes more and more important. Web documents such as HTML/XML files have no rigid structure and are called semistructured data. According to Object Exchange Model [1], we treat semistructured data as tree structured data. Tree structured data such as HTML/XML files are represented by rooted trees with edge labels. In order to represent a tree structured pattern common to such tree structured data, we R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 114–128, 2003. c Springer-Verlag Berlin Heidelberg 2003
Efficient Learning of Ordered and Unordered Tree Patterns
115
Sec1 Sec2 Comment Sec3 Sec4
Sec1
Sec2
Sec3
Introduction Preliminary
Sec4
Exp1 Exp2 Conclusion
Result1
Exp1 Exp2 Conclusion Comment
Introduction Preliminary
Result2
Exp1 Exp2
Result1 Result2
T1
SubSec3.1 Conclusion SubSec3.2
Introduction Note Preliminary
Sec1 Sec2 Comment Sec3 Sec4
Result1
T2
Result3
Result2
T3 u2 Comment Sec3
y
Sec1 Sec2
Introduction
x
Sec4
Exp1
x
Result1
z
y
Sec1 Sec2
Conclusion
Introduction Preliminary
Sec4
Exp1 Exp2
x
Result2
z
Result1 Result2
SubSec3.1 SubSec3.2
Conclusion
u1
v2 Result3
Note
u3
v1
t1
t2
t3
g1
g2
g3
Fig. 1. Ordered term trees t1 , t2 and t3 and ordered trees T1 , T2 and T3 . An uncontractible (resp. contractible) variable is represented by a single (resp. double) lined box with lines to its elements. The label inside a box is the variable label of the variable.
proposed an ordered term tree and unordered term tree, which are rooted trees with structured variables [12,13]. Many semistructured data have irregularities such as missing or erroneous data. In Object Exchange Model, the data attached to leaves are essential information and such data represented as subtrees. On the other hand, in analyzing tree structured data, sensitive knowledge (or patterns) for slight differences among such data is often meaningless. For example, extracted patterns from HTML/XML files are affected by attributes of tags which can be recognized as noises. Therefore we introduce a new kind of variable, called a contractible variable, that is an erasing variable which is adjacent to a leaf. A contractible variable can be replaced with any tree, including a singleton vertex. A usual variable, called an uncontractible variable, is replaced with a tree which consists of at least 2 vertices. A term tree with only uncontractible variables is very sensitive to noises. By introducing contractible variables, we can find robust term trees for such noises. Shinohara [11] started to study the learnabilities of extended pattern languages of strings with erasing variable. Since this pioneering work, researchers in the field of computational learning theory are interested in classes of string or tree pattern languages with erasing variables which are polynomial time learnable. Recently Uemura et al. [16] showed that classes of unions of erasing regular pattern languages can be polynomial time learnable from positive data. In this paper, we study the learnabilities of classes of tree structured patterns with restricted erasing variables, called contractible variables. A term tree t is said to be regular if all variable labels in t are mutually distinct. The term tree language of an ordered term tree t is the set of all ordered trees which are obtained from t by substituting ordered trees for variables in t.
116
Y. Suzuki et al.
The language shows the representing power of an ordered term tree t. We say that a regular ordered term tree t explains given tree structured data S if the term tree language of t contains all trees in S. A minimally generalized regular ordered term tree t explaining S is a regular ordered term tree t such that t explains S and the language of t is minimal among all term tree languages which contain all trees in S. For example, the term tree t3 in Fig. 1 is a minimally generalized regular ordered term tree explaining T1 , T2 and T3 . And t2 is also minimally generalized regular ordered term trees with no contractible variable explaining T1 , T2 and T3 . On the other hand, t1 is overgeneralized and meaningless, since t1 explains any tree of size at least 2. An ordered term tree using contractible and uncontractible variables rather than an ordered term tree only using uncontractible variables can express the structural feature of ordered trees more correctly. From this reason, we consider that in Fig. 1, t3 is a more precious term tree than t2 . In a similar way, we define the term tree language of an unordered term tree and a minimally generalized regular unordered term tree explaining given tree structured data S. Let Λ be a set of edge labels used in tree structured data. We denote by OTT cΛ (resp. UTT cΛ ) the set of all regular ordered (resp. unordered) term trees with contractible and uncontractible variables. For a set S, the number of elements in S is denoted by |S|. First we give a polynomial time algorithm for deciding whether or not a given regular ordered (resp. unordered) term tree explains an ordered (resp. unordered) tree, where |Λ| ≥ 1. Second when |Λ| ≥ 2, we give a polynomial time algorithm for finding a minimally generalized regular ordered term tree in OTT cΛ which explains all given data. Lastly when |Λ| is infinite, we give a polynomial time algorithm for finding a minimally generalized regular unordered term tree in UTT cΛ which explains all given data. These results imply that both OTT cΛ where |Λ| ≥ 2 and UTT cΛ where |Λ| is infinite are polynomial time inductively inferable from positive data. A term tree is different from other representations of tree structured patterns such as in [2,3,5] in that a term tree has structured variables which can be substituted by arbitrary trees. As related works, we proved the learnability of some classes of term tree languages with no contractible variable. In [13,14], we showed that some fundamental classes of regular ordered term tree languages are polynomial time inductively inferable from positive data. And in [7,9,12], we showed that the class of regular unordered term tree languages with infinitely many edge labels is polynomial time inductively inferable from positive data. Moreover, we showed in [8] that some classes of regular ordered term tree languages are exactly learnable in polynomial time using queries. In [15], we showed that the class of regular ordered term tree with contractible variables and no edge label is polynomial time inductively inferable from positive data. Asai et al. [6] studied a data mining problem for semistructured data by modeling semistructured data as labeled ordered trees and presented an efficient algorithm for finding all frequent ordered tree patterns from semistructured data. In [10], we gave a data mining method from semistructured data using ordered term trees.
Efficient Learning of Ordered and Unordered Tree Patterns
2
117
Ordered and Unordered Term Trees
Definition 1 (Ordered term trees and unordered term trees). Let T = (VT , ET ) be a rooted tree with ordered children or unordered children, which has a set VT of vertices and a set ET of edges. We call a rooted tree with ordered (resp. unordered) children an ordered tree (resp. an unordered tree). Let Eg and Hg be a partition of ET , i.e., Eg ∪ Hg = ET and Eg ∩ Hg = ∅. And let Vg = VT . A triplet g = (Vg , Eg , Hg ) is called an ordered term tree if T is an ordered tree, and called an unordered term tree if T is an unordered tree. We call an element in Vg , Eg and Hg a vertex, an edge and a variable, respectively. Below we say a term tree if we do not have to distinguish between ordered term trees and unordered term trees. We assume that every edge and variable of a term tree is labeled with some words from specified languages. A label of a variable is called a variable label. Λ and X denote a set of edge labels and a set of variable labels, respectively, where Λ ∩ X = φ. For a term tree g and its vertices v1 and vi , a path from v1 to vi is a sequence v1 , v2 , . . . , vi of distinct vertices of g such that for any j with 1 ≤ j < i, there exists an edge or a variable which consists of vj and vj+1 . If there is an edge or a variable which consists of v and v such that v lies on the path from the root to v , then v is said to be the parent of v and v is a child of v. We use a notation [v, v ] to represent a variable {v, v } ∈ Hg such that v is the parent of v . Then we call v the parent port of [v, v ] and v the child port of [v, v ]. Definition 2 (Regular term tree). A term tree g is regular if all variables in Hg have mutually distinct variable labels in X. In this paper, we discuss with regular term trees only. Thus we assume that all term trees in this paper are regular. Definition 3 (Contractible variables). Let X c be a distinguished subset of X. We call variable labels in X c contractible variable labels. A contractible variable label can be attached to a variable whose child port is a leaf. We call a variable with a contractible variable label a contractible variable, which is allowed to substitute a tree with a singleton vertex. We state the formal definitions later. We call a variable which is not a contractible variable an uncontractible variable. In order to distinguish a contractible variable from an uncontractible variable, we denote by [v, v ]c (resp. [v, v ]) a contractible variable (resp. an uncontractible variable). For an ordered term tree g, all children of every internal vertex u in g have a total ordering on all children of u. The ordering on the children of u is denoted by
118
Y. Suzuki et al.
f and g are ordered term trees, for any internal vertex u in f which has more than one child, and for any two children u and u of u, u
Efficient Learning of Ordered and Unordered Tree Patterns
119
Case 1 : If u , u ∈ Vf and u
120
Y. Suzuki et al.
Procedure Ordered-C-Set-Attaching(v, Rule(t)); input v: a vertex of T , Rule(t): the C-set-attaching rule of t; begin CS(v) := ∅; Let c1 , · · · , cm be all ordered children of v in T ; foreach I(u ) ⇐ J(c1 ), · · · , J(cm ) in Rule(t) do if there is a sequence 0 = j0 ≤ j1 ≤ · · · ≤ ji ≤ · · · ≤ jm −1 ≤ jm = m such that 1 . if J(ci ) = I(ci ) then ji − ji−1 = 1 and I(ci ) ∈ CS(cji ), 2 . if J(ci ) = (I(ci )) then CS(cki ) has I(ci ) or (I(ci )) for some ki (ji−1 < ki ≤ ji ) for all i = 1, ..., m // we have no condition on ji when J(ci ) = I(∅). then CS(v) := CS(v) ∪ {(I(u )}; foreach (I(u )) ⇐ (I(u )) in Rule(t) do if there is a set in CS(c1 ), · · · , CS(cm ) which has I(u ) or (I(u )) then CS(v) := CS(v) ∪ {(I(u )}; foreach I(u ) ⇐ I(u ) in Rule(t) do CS(v) := CS(v) ∪ {I(u )} end; Fig. 2. Procedure Ordered-C-Set-Attaching for |Λ| = 1. We can easily extend this procedure for the case of |Λ| ≥ 2 by checking edge labels of a term tree and a tree in applying C-set-attaching rules.
It is easy to see that the classes OTT cΛ and UTT cΛ have finite thickness. In Section 3, we give polynomial time algorithms for Membership problems for OTT cΛ and UTT cΛ . And in Section 4, we give polynomial time algorithms for Minimal Language problems for OTT cΛ (|Λ| ≥ 2) and UTT cΛ (|Λ| = ∞). Therefore, we show the following main result. Theorem 1. The classes OTT cΛ (|Λ| ≥ 2) and UTT cΛ (|Λ| = ∞) are polynomial time inductively inferable from positive data.
3
An Efficient Matching Algorithm for Term Trees
In this section, we give polynomial time algorithms for Membership Problem for OTT cΛ and UTT cΛ by extending the matching algorithm in [9,13]. Let t = (Vt , Et , Ht ) be a term tree and T a tree. We assume that all vertices of a term tree t are associated with mutually distinct numbers, called vertex identifiers. We denote by I(u ) the vertex identifier of u ∈ Vt . A correspondence set, C-set for short, is a set of vertex identifiers which are with or without parentheses. A vertex identifier with parentheses shows that the vertex is a child port of a variable. We employ the dynamic programming method. Our matching algorithms proceed by constructing C-sets for each vertex of a given tree T in the bottomup manner, that is, from the leaves to the root of T . At first, we construct the C-set-attaching rule of a vertex u of t as follows. Let c1 , · · · , cm be all ordered (or unordered) children of u . The C-set-attaching rule of u is of the form I(u ) ⇐ J(c1 ), . . . , J(cm ), where J(ci ) = I(ci ) if {u , ci } ∈ Et , J(ci ) = I(∅) if [u , ci ]c ∈ Ht , J(ci ) = (I(ci )) otherwise. I(∅) is a special symbol which shows ci
Efficient Learning of Ordered and Unordered Tree Patterns
121
Procedure Unordered-C-Set-Attaching(v, Rule(t)); input v: a vertex of T , Rule(t): the C-set-attaching rule of t; begin CS(v) := ∅; Let c1 , · · · , cm be all unordered children of v in T ; foreach I(u ) ⇐ J(c1 ), · · · , J(cm ) in Rule(t) do begin E := {{I(ci ), CS(cj )} | I(ci ) ∈ CS(cj )}∪ {{I(ci ), CS(cj )} | (I(ci )) ∈ CS(cj ) and J(ci ) = (I(ci ))} (1 ≤ i ≤ m , 1 ≤ j ≤ m); Let G be a bipartite graph ({I(c1 ), . . . , I(cm )}, {CS(c1 ), . . . , CS(cm )}, E); if J(ci ) = I(ci ) for all i = 1, . . . , m then begin if there is a perfect matching for G then CS(v) := CS(v) ∪ {I(u )} end else if there is an index i (1 ≤ i ≤ m ) such that J(ci ) = I(∅) then begin if there is a matching of size m − 1 for G then CS(v) := CS(v) ∪ {I(u )} end else if there is a matching of size m for G then CS(v) := CS(v) ∪ {I(u )} end; foreach (I(u )) ⇐ (I(u )) in Rule(t) do if there is a set in CS(c1 ), · · · , CS(cm ) which has I(u ) or (I(u )) then CS(v) := CS(v) ∪ {(I(u )}; foreach I(u ) ⇐ I(u ) in Rule(t) do CS(v) := CS(v) ∪ {I(u )} end; Fig. 3. Procedure Unordered-C-Set-Attaching for |Λ| = 1. We can easily extend this procedure for the case of |Λ| ≥ 2.
is the child port of a contractible variable. The C-set-attaching rule of t, denoted by Rule(t), is defined as follows. Rule(t) = {I(u ) ⇐ J(c1 ), . . . , J(cm ) | the C-set-attaching rule of all inner vertices} ∪ {(I(u )) ⇐ (I(u )) | u is the child port of an uncontractible variable} ∪ {I(u ) ⇐ I(u ) | u has just one child and connects to the child with a contractible variable}. Initially we attach CS = {I( ) | is a leaf of t that is not a child port of a contractible variable, or has just one child and connects to it with a contractible variable} to all leaves of T . By using Ordered-C-Set-Attaching (Fig.2) for Membership Problem for OTT cΛ and Unordered-C-Set-Attaching (Fig.3) for Membership Problem for UTT cΛ , we repeatedly attach a C-set to each vertex of a given tree T in the bottom-up manner, that is, from the leaves to the root of T . When we can not apply the procedure to any vertex any more, if the C-set of the root of T has the vertex identifier of the root of t then we conclude that t matches T . Theorem 2. For each TT ∈ {OTT cΛ , UTT cΛ }, Membership Problem for TT is solvable in polynomial time.
122
Y. Suzuki et al.
u
≥1
v
≥0
g1
w1
u
v
t1
≥1
u
≥1
≥1
u
≥0
v
≥0
≥0
v
g2
w2
u
≥1
v
0
u
≥1
v
0
w3
t2 g3
t3
O Fig. 4. For i = 1, 2, 3, gi ≡ ti , gi ti , ti gi , and LO Λ (gi ) = LΛ (ti ). The digit in a box k (resp. ≥ k ) near u shows that the number of children of u is equal to k (resp. is more than or equal to k). An arrow shows that the right vertex of it is the immediately right child of the left vertex.
4
An Algorithm for Finding a Minimally Generalized Term Tree
Let g and t be ordered (or unordered) term trees. We denote g t if there exists a substitution θ such that g ≡ tθ. For ordered (resp. unordered) term trees g = (V, E, H) and g = (V , E , H ), we say that g is an ordered (resp. unordered) term subtree of g if V ⊆ V , E ⊆ E, and H ⊆ H. For an ordered (resp. unordered) term tree t, an occurrence of t in g is an ordered (resp. unordered) term subtree of g which is isomorphic to t. For any ordered (resp. unordered) term tree g, we denote by s(g) the ordered (resp. unordered) tree obtained from g by replacing all edges of g with uncontractible variables and all contractible variables of g with singleton vertices. For any two ordered (or unordered) term trees g and t, we write g ≈ t if s(g) is isomorphic to s(t). 4.1
An Algorithm for Ordered Term Trees
Lemma 1. Let gi and ti (1 ≤ i ≤ 3) be ordered term trees in OTT cΛ described in Fig. 4. Let t be an ordered term tree in OTT cΛ which has at least one occurrence of ti (1 ≤ i ≤ 3). For one of occurrences of ti , we make a new ordered term tree O g by replacing the occurrence of ti with gi . Then LO Λ (g) = LΛ (t). Definition 8. Let g be an ordered term tree in OTT cΛ for |Λ| ≥ 2. The ordered term tree g is said to be a canonical ordered term tree if g has no occurrence of ti (1 ≤ i ≤ 3) (Fig. 4). Any ordered term tree g is transformed into the canonical term tree by replacing all occurrences of gi with ti (1 ≤ i ≤ 3) repeatedly. We denote by c(g) the O canonical ordered term tree transformed from g. We note that LO Λ (c(g)) = LΛ (g). Lemma 2. Let g and t be two ordered term trees in OT T cΛ (|Λ| ≥ 2). If g ≈ t O and LO Λ (g) ⊆ LΛ (t) then c(g) c(t).
Efficient Learning of Ordered and Unordered Tree Patterns
123
Proof. Let c(g) = (Vc(g) , Ec(g) , Hc(g) ) and c(t) = (Vc(t) , Ec(t) , Hc(t) ). Let Vc(g) = {v ∈ Vc(g) | v is not a child port of a contractible variable} and Vc(t) = {v ∈ Vc(t) | v is not a child port of a contractible variable}. For a vertex v which is not the root of a term tree, we denote by p(v) the parent of v. For any v ∈ Vc(g) , either {p(v), v} ∈ Ec(g) , [p(v), v] ∈ Hc(g) , or [p(v), v]c ∈ Hc(g) . We note that if and only if [p(v), v]c ∈ Hc(g) . Since g ≈ t, we have c(g) ≈ c(t). v ∈ Vc(g) − Vc(g) to Vc(t) such that {p(v), v} ∈ Ec(g) or Therefore there is a bijection ξ from Vc(g) [p(v), v] ∈ Hc(g) if and only if {ξ(p(v)), ξ(v)} ∈ Ec(t) or [ξ(p(v)), ξ(v)] ∈ Hc(t) . Since Λ contains at least two edge labels, we can easily show the following claim. Claim 1. If [p(v), v] ∈ Hc(g) then [ξ(p(v)), ξ(v)] ∈ Hc(t) . In the following three lemmas, we assume that [p(v), v]c ∈ Hc(g) . Claim 2. Suppose that p(v) has at least two children and v is the leftmost (resp. rightmost) child of p(v). Let w be the immediately right (resp. left) sibling of v. Then one of the following two statements holds: (1) there exists the leftmost (resp. rightmost) child v of ξ(p(v)) such that [ξ(p(v)), v ]c ∈ Hc(t) , or (2) [ξ(p(v)), ξ(w)] ∈ Hc(t) . Proof of Claim 2. We note that {p(v), w} be an edge in Ec(g) since c(g) is the canonical ordered term tree of g. We assume that {ξ(p(v)), ξ(w)} is an edge in Ec(t) . Let α be the edge label of {ξ(p(v)), ξ(w)}. Let β be an edge label in Λ − {α}. Let g β be the ground term trees which is obtained by replacing all contractible and uncontractible variables with edges labeled with β. This substitution does not increase the number of internal vertices of c(g). Thus, if there is a substitution θ such that g β ≡ c(t)θ, the vertex of g β which is p(v) of c(g) must corresponds to the vertex of c(t)θ which is ξ(p(v)) of c(t). Therefore, if there does not exist the vertex v stated in this claim, g β does not belong to LO (End of Proof ) Λ (c(t)) since the edge label of {ξ(p(v)), ξ(w)} is α. We can show the next two lemmas in a similar way to Claim 2. Claim 3. If v is only child of p(v) then ξ(p(v)) has exactly one child v such that [ξ(p(v)), v ]c ∈ Hc(t) or there exists the parent of ξ(p(v)) such that [p(ξ(p(v))), ξ(p(v))] ∈ Hc(t) . Claim 4. Suppose that p(v) has at least three children and v has the immediately left sibling w and immediately right sibling wr . Then one of the following three statements holds: (1) there exists a child v of ξ(p(v)) between ξ(w ) and ξ(wr ) such that [ξ(p(v)), v ]c ∈ Hc(t) , (2) [ξ(p(v)), ξ(w )] ∈ Hc(t) , or (3) [ξ(p(v)), ξ(wr )] ∈ Hc(t) . From these claims, we can immediately show this lemma. 2
The algorithm MINL-OTT c (Fig. 5) solves Minimal Language Problem for OTT cΛ (|Λ| ≥ 2) correctly. The algorithm consists of two procedures. Variable-Extension (Fig. 5): The aim of this procedure is to output an ordered term tree t consisting of only uncontractible variables such that there is no ordered term tree t consisting of only uncontractible variables with S ⊆ LO LO Λ (t ) ⊆ Λ (t). Thus this procedure extends an ordered term tree t by / adding uncontractible variables as much as possible while S ⊆ LO Λ (t) holds.
124
Y. Suzuki et al.
Algorithm MINL-OTT c (S); input: a set of trees S ⊆ OT Λ ; begin t := ({u, v}, ∅, {[u, v]}); Let q be a list initialized to be [[u, v]]; Variable-Extension(t, S, q); Edge-Replacing(t, S, rt ), where rt is the root of t; output t end. Procedure Variable-Extension(t, S, q); input: an ordered term tree t, a set of trees S, a queue of variables q; begin while q is not empty do begin [u, v] := q[1]; Let w1 , w2 , and w3 be new vertices; // w1 becomes a vertex between u and v. if S ⊆ LO Λ (t := (Vt ∪ {w1 }, Et , Ht ∪ {[u, w1 ], [w1 , v]} − {[u, v]})) then begin t := t ; q := q&[[w1 , v]]; continue end else q := q[2..]; // w2 and w3 become the immediately left and right siblings of v, respectively. if S ⊆ LO Λ (t := (Vt ∪ {w2 }, Et , Ht ∪ {[u, w2 ]})) then begin t := t ; q := q&[[u, w2 ]] end; if S ⊆ LO Λ (t := (Vt ∪ {w3 }, Et , Ht ∪ {[u, w3 ]})) then begin t := t ; q := q&[[u, w3 ]] end end; return t end; Procedure Edge-Replacing(t, S, u); input: an ordered term tree t, a set of trees S, a vertex u begin if u is a leaf then return; Let c1 , . . . , ck be the children of u; for i := 1 to k do Edge-Replacing(t, S, ci ); for i := 1 to k do foreach edge label λ ∈ ΛS do if ci is a leaf then ,r,d if S ⊆ LO ) then begin Λ (t := tR(ci )λ O if S ⊆ LΛ (t1 := t − [u, w ]c ) then t := t1 ; c if S ⊆ LO Λ (t2 := t − [u, wr ] ) then t := t2 ; O c if S ⊆ LΛ (t3 := t − [u, wd ] ) then t := t3 ; t := t end else ,r if S ⊆ LO Λ (t := tR(ci )λ ) then begin O if S ⊆ LΛ (t1 := t − [u, w ]c ) then t := t1 ; c if S ⊆ LO Λ (t2 := t − [u, wr ] ) then t := t2 ; t := t end; return t end; Fig. 5. Algorithm MINL-OTT c : For an ordered term tree t, we denote by t − [u, v]c the term tree obtained by removing a contractible variable [u, v]c .
Efficient Learning of Ordered and Unordered Tree Patterns
125
Lemma 3. Let t ∈ OTT cΛ (|Λ| ≥ 2) be the output of Variable-Extension for an input S. Let t be a minimally generalized ordered term tree explaining S. If O S ⊆ LO Λ (t ) ⊆ LΛ (t) then t ≈ t. Proof. Obviously t ∈ LO Λ (t). Let t be the ordered term tree obtained by replacing all edges of s(t ) with uncontractible variables. Then t ≈ t and O LO Λ (t ) ⊆ LΛ (t ). Let θ be a substitution such that t ≡ tθ, and θ a substitution which obtained by replacing all edges appearing in θ with uncontractible variables. Then t ≡ tθ . Since Variable-Extension does not add a variable to t any more then tθ ≡ t. Therefore t ≡ t, then t ≈ t. 2
Let u be a vertex of an ordered term tree which is not the root of the term tree and p(u) the parent of u. Let λ be an element of Λ. Let w and wr be new children of p(u) which become the immediately left and right siblings of u, respectively. If u is a leaf, let wd be a new child of u. We suppose that [p(u), u] is an uncontractible variable. Then we define the following 2 operations. : Replace [p(u), u] with {p(u), u} labeled with λ, R(u),r λ [p(u), w ]c and [p(u), wr ]c . ,r,d R(u)λ : Replace [p(u), u] with {p(u), u} labeled with λ, [p(u), w ]c , [p(u), wr ]c and [u, wd ]c . Edge-Replacing (Fig. 5): Let t be the output of Variable-Extension for an input S. This procedure visits all vertices of t in the reverse order of the breadthfirst search of t. And it applies the above 2 operations R to t. If S ⊆ LO Λ (tR) then t := tR. And it eliminates contractible variables as much as possible. Lemma 4. Let t ∈ OTT cΛ (|Λ| ≥ 2) be the output of the algorithm MINL-OTT c O for an input S. Let t be a term tree satisfying that S ⊆ LO Λ (t ) ⊆ LΛ (t). Then c(t ) ≡ c(t). Theorem 3. Minimal Language Problem for OTT cΛ (|Λ| ≥ 2) is solvable in polynomial time. 4.2
An Algorithm for Unordered Term Trees
Lemma 5. Let gi and ti (1 ≤ i ≤ 3) be unordered term trees in UTT cΛ (|Λ| = ∞) in Fig. 6. Let t be a term tree in UTT cΛ which has at least one occurrence of ti (1 ≤ i ≤ 3). For one of occurrences of ti , we make a new unordered term tree g U by replacing the occurrence of ti with gi . Then LU Λ (g) = LΛ (t). Definition 9. Let g be an unordered term tree in UTT cΛ (|Λ| = ∞). The unordered term tree g is said to be a canonical unordered term tree if g has no occurrence of ti (1 ≤ i ≤ 3) (Fig. 6). We can easily see that any unordered term tree g is transformed into the canonical unordered term tree by replacing all occurrences of gi with ti (1 ≤ i ≤ 3) repeatedly. We denote by c(g) the canonical unordered term tree transformed from g. We show the following lemma in a similar way to Lemma 2.
126
Y. Suzuki et al. u
≥1
u
v1
v2
vn
v1
v2
vn
≥0
≥0
≥0
≥0
≥0
≥0
g1 u
w
t1
≥1
u
≥1
v1
v2
vn
v1
v2
vn
≥0
≥0
≥0
≥0
≥0
≥0
g2
≥1
t2
w
u
≥1
v
0
u
≥1
v
0
wd
g3
t3
U Fig. 6. For i = 1, 2, 3, gi ≡ ti , gi ti , ti gi , and LU Λ (gi ) = LΛ (ti ). Let u is a vertex of gi and c1 , . . . , ck children of u. We suppose that at least one child among c1 , . . . , ck is connected to u by a contractible variable or an uncontractible variable.
Lemma 6. Let g and t be two unordered term trees in UTT cΛ (|Λ| = ∞). If g ≈ t U and LU Λ (g) ⊆ LΛ (t) then c(g) c(t). The algorithm MINL-UTT c (Fig. 7) solves Minimal Language Problem for UTT cΛ . The procedure Variable-Extension (Fig. 7) extends an unordered term tree t by adding uncontractible variables as much as possible while S ⊆ LU Λ (t) holds. We can show the following lemma in a similar way to Lemma 3. Lemma 7. Let t ∈ UTT cΛ (|Λ| = ∞) be the output of Variable-Extension for an input S. Let t be a minimally generalized unordered term tree explaining U S. If S ⊆ LU Λ (t ) ⊆ LΛ (t) then t ≈ t. Let c1 , . . . , ck be a vertex of an unordered term tree which is not the root of the term tree and u the parent of c1 , . . . , ck . Let λ be an element of Λ. Let w be a new child of u. If ci is a leaf, let wd be a new child of ci . We suppose that [u, ci ] is an uncontractible variable. Edge-Replacing (Fig. 7) adds a contractible variable [u, w]c , and applies the following 2 operations R(ci ) to a temporary unordered term tree. If S ⊆ LU Λ (tR(ci )) then t := tR(ci ), and it eliminates a contractible variable [u, w]c if possible. R(ci )λ : Replace [u, ci ] with {u, ci } labeled with λ. R(ci )dλ : Replace [u, ci ] with {u, ci } labeled with λ and [ci , wd ]c . Lemma 8. Let t ∈ UTT cΛ (|Λ| = ∞) be the output of the algorithm MINL-UTT c for an input S. Let t be an unordered term tree satisfying that S ⊆ LU Λ (t ) ⊆ U LΛ (t). Then c(t ) ≡ c(t). Theorem 4. Minimal Language Problem for UTT cΛ (|Λ| = ∞) is solvable in polynomial time.
Efficient Learning of Ordered and Unordered Tree Patterns Algorithm MINL-UTT c (S); input: a set of trees S ⊆ UT Λ ; begin Let ΛS be the set of edge labels which appear in S; Variable-Extension(S); Edge-Replacing(t, S, rt ), where rt is the root of t; output t end. Procedure Variable-Extension(S); input: a set of trees S; begin d := 0; t := ({r}, ∅, ∅); t :=Breadth-Extension(t, S, r); max-depth :=the maximum depth of the trees in S; d := d + 1 while d ≥ max-depth − 1 do begin v :=a vertex at depth d which is not yet visited; t :=Breadth-Extension(t, S, v); while there exists a sibling of v which is not yet visited do begin Let v be a sibling of u which is not yet visited; t :=Breadth-Extension(t, S, v ); end; d := d + 1 end; return t end;
127
Procedure Edge-Replacing(t, S, u); input: an unordered term tree t, a set of trees S, and a vertex u; begin if u is a leaf then return; Let c1 , . . . , ck be the children of u; for i := 1 to k do Edge-Replacing(t, S, ci ); Let w be a new children of u; t := t + [u, w]c ; for i := 1 to k do foreach edge label λ ∈ ΛS do if ci is a leaf then begin if S ⊆ LU Λ (t := tR(ci )λ ) then t := t ; U d if S ⊆ LΛ (t := tR(ci )λ ) then t := t end else if S ⊆ LU Λ (t := tR(ci )λ ) then t := t end; t := t − [u, w]c ; If S ⊆ LU Λ (t ) then t := t end; return t end;
Procedure Depth-Extension(t, S, v); input: an unordered term tree t, a set of trees S, a vertex v,; begin Let t be (Vt , ∅, Ht ); Let v be a new vertex and [v, v ] a new variable; t := (Vt ∪ {v }, ∅, Hg ∪ {[v, v ]}); Procedure Breadth-Extension(t, S, v); while S ⊆ LU Λ (t ) do begin input: an unordered term tree t, t := t ; v := v ; a set of trees S, a vertex v; Let v be a new vertex and begin [v, v ] a new variable; t :=Depth-Extension(t, S, v); t := (Vt ∪ {v }, ∅, Ht ∪ {[v, v ]}) while t = t do begin end; return g t := t ; end; t :=Depth-Extension(t, S, v) end; return t end;
Fig. 7. Algorithm MINL-UTT c : For an unordered term tree t, we denote by t + [u, v]c the term tree obtained by adding a contractible variable [u, v]c , and by t − [u, v]c the term tree obtained by removing a contractible variable [u, v]c .
128
Y. Suzuki et al.
References 1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000. 2. S. Amer-Yahia, S. Cho, L. V. S. Lakshmanan, and D. Srivastava. Minimization of Tree Pattern Queries. Proc. ACM SIGMOD 2001, pages 497–508, 2001. 3. T. R. Amoth, P. Cull, and P. Tadepalli. Exact learning of unordered tree patterns from queries. Proc. COLT-99, ACM Press, pages 323–332, 1999. 4. D. Angluin. Finding patterns common to a set of strings. Journal of Computer and System Science, 21:46–62, 1980. 5. H. Arimura, H. Sakamoto, and S. Arikawa. Efficient learning of semi-structured data from queries. Proc. ALT-2001, Springer-Verlag, LNAI 2225, pages 315–331, 2001. 6. T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient Substructure Discovery from Large Semi-structured Data. the Proc. of the Second SIAM International Conference on Data Mining, pages 158–174, 2002. 7. S. Matsumoto, Y. Hayashi, and T. Shoudai. Polynomial time inductive inference of regular term tree languages from positive data. Proc. ALT-97, Springer-Verlag, LNAI 1316, pages 212–227, 1997. 8. S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning of finite unions of tree patterns with internal structured variables from queries. Proc. AI-2002, Springer-Verlag, LNAI 2557, pages 523–534, 2002. 9. T. Miyahara, T. Shoudai, T. Uchida, K. Kuboyama, K. Takahashi, and H. Ueda. Discovering New Knowledge from Graph Data Using Inductive Logic Programming Proc. ILP-99, Springer-Verlag, LNAI 1634, pages 222–233, 1999. 10. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, S. Hirokawa, K. Takahashi, and H. Ueda. Extraction of Tag Tree Patterns with Contractible Variables from Irregular Semistructured data. Proc. PAKDD-2003, Springer-Verlag, LNAI 2637, pages 430–436, 2003. 11. T. Shinohara. Polynomial time inference of extended regular pattern languages. Proc. RIMS Symposium on Software Science and Engineering, Springer-Verlag, LNCS 147, pages 115–127, 1982. 12. T. Shoudai, T. Uchida, and T. Miyahara. Polynomial time algorithms for finding unordered tree patterns with internal variables. Proc. FCT-2001, Springer-Verlag, LNCS 2138, pages 335–346, 2001. 13. Y. Suzuki, R. Akanuma, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time inductive inference of ordered tree patterns with internal structured variables from positive data. Proc. COLT-2002, Springer-Verlag, LNAI 2375, pages 169–184, 2002. 14. Y. Suzuki, T. Shoudai, T. Uchida, and T. Miyahara. Ordered term tree languages which are polynomial time inductively inferable from positive data. Proc. ALT2002, Springer-Verlag, LNAI 2533, pages 188–202, 2002. 15. Y. Suzuki, T. Shoudai, S. Matsumoto and T. Uchida. Efficient Learning of Unlabeled Term Trees with Contractible Variables from Positive Data. Proc. ILP-2003, Springer-Verlag, LNAI (to appear), 2003. 16. J. Uemura and M. Sato. Compactness and Learning of Classes of Unions of Erasing Regular Pattern Languages. Proc. ALT-2002, Springer-Verlag, LNAI 2533, pages 293–307, 2002.
On the Learnability of Erasing Pattern Languages in the Query Model Steffen Lange1 and Sandra Zilles2 1
Deutsches Forschungszentrum f¨ ur K¨ unstliche Intelligenz, Stuhlsatzenhausweg 3, 66123 Saarbr¨ ucken, Germany, [email protected] 2 Universit¨ at Kaiserslautern, FB Informatik, Postfach 3049, 67653 Kaiserslautern, Germany, [email protected]
Abstract. A pattern is a finite string of constant and variable symbols. The erasing language generated by a pattern p is the set of all strings that can be obtained by substituting (possibly empty) strings of constant symbols for the variables in p. The present paper deals with the problem of learning the erasing pattern languages and natural subclasses thereof within Angluin’s model of learning with queries. The paper extends former studies along this line of research. It provides new results concerning the principal learning capabilities of query learners as well as the power and limitations of polynomial-time query learners. In addition, the paper focusses on a quite natural extension of Angluin’s original model. In this extended model, the query learner is allowed to query languages which are themselves not object of learning. Query learners of the latter type are often more powerful and more efficient than standard query learners. Moreover, when studying this new model in a more general context, interesting relations to Gold’s model of language learning from only positive data have been elaborated.
1
Introduction
A pattern is a finite string of constant and variable symbols (cf. Angluin [2]). The erasing language generated by a pattern p is the set of all strings that can be obtained by substituting strings of constant symbols (including the empty one!) for the variables in p.1 Thereby, each occurrence of a variable has to be substituted by the same string. The erasing pattern languages have found a lot of attention within the past two decades both in the formal language theory community (see, e. g., Salomaa [15,16], Jiang et al. [9]) and in the learning theory community (see, e. g., 1
The term ‘erasing’ is coined to distinguish these languages from those pattern languages originally defined in Angluin [2], where it is forbidden to replace variables by the empty string.
R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 129–143, 2003. c Springer-Verlag Berlin Heidelberg 2003
130
S. Lange and S. Zilles
Shinohara [17], Erlebach et al. [6], Mitchell [12], Nessel and Lange [13], Reidenbach [14]). The learning scenarios studied include Gold’s [7] model of learning in the limit and Angluin’s [3] model of learning with queries. Besides that, interesting applications have been outlined. For example, learning algorithms for particular subclasses of erasing pattern languages have been used to solve problems in molecular biology (see Arikawa et al. [5]). The present paper focusses on the learnability of the erasing pattern languages and natural subclasses thereof in Angluin’s [3,4] model of learning with queries. The paper extends the work of Nessel and Lange [13]; the first systematic study in this context. In contrast to Gold’s [7] model of learning in the limit, Angluin’s [3] model deals with ‘one-shot’ learning. Here, a learning algorithm (henceforth called query learner) has the option to ask queries in order to receive information about an unknown language. The queries will truthfully be answered by an oracle. After asking at most finitely many queries, the learner is supposed to output its one and only hypothesis. This hypothesis is required to correctly describe the unknown language. The present paper contains a couple of new results, which illustrate the power and limitations of query learners in the context of learning the class of all erasing pattern languages and natural subclasses thereof. Along the line of former studies, the capabilities of polynomial-time query learners (i. e. learners that are constrained to ask at most polynomially many queries before returning their hypothesis) are studied as well. In addition, a problem is addressed that has mainly been ignored so far. The present paper provides the first systematic study concerning the strength of query learners that are – in contrast to standard query learners – allowed to query languages that are themselves not object of learning. As it turns out, these new learners often outperform standard learners, concerning their principal learning capability as well as their efficiency. Moreover, the learning power of non-standard query learners is compared to the capabilities of Gold-style language learners. As a result of this comparison, quite interesting coincidences between Gold-style language learning and query learning – in the more general setting of learning indexable classes of recursive languages – have been observed. One of them allows for a new approach to the long-standing open question of whether or not the erasing pattern languages (over a finite alphabet with at least three constant symbols) are Gold-style learnable from only positive examples. To be more precise, the erasing pattern languages are learnable in the non-standard query model (using a particular type of queries, namely restricted superset queries), iff they are Gold-style learnable from only positive examples by a conservative learner (i. e. a learner that strictly avoids overgeneralized hypotheses). Next, we summarize the disciplinary results on query learning of all erasing pattern languages or natural subclasses thereof. Among the different types of queries investigated in the past (see, e. g., Angluin [3,4]), we consider the following ones:
On the Learnability of Erasing Pattern Languages in the Query Model
131
Membership queries. The input is a string w and the answer is ‘yes’ or ‘no’, respectively, depending on whether or not w belongs to the target language L. Restricted subset queries. The input is a language L . If L ⊆ L, the answer is ‘yes’. Otherwise, the answer is ‘no’. Restricted superset queries. The input is a language L . If L ⊆ L , the answer is ‘yes’. Otherwise, the answer is ‘no’. In the original model of learning with queries (cf. Angluin [3]), the query learner is constrained to choose the input language L exclusively from the class of languages to be learned. Our study involves a further approach, in which this constraint is weakened by allowing the learner to query languages that are themselves not object of learning. The following table summarizes the results obtained and compares them to the previously known results. The focus is on the learnability of the class of all erasing pattern languages and the following subclasses thereof: the so-called regular, k-variable, and non-cross erasing pattern languages.2 The items in the table have to be interpreted as follows. The item ‘No’ indicates that queries of the specified type are insufficient to learn the corresponding language class, while the item ‘Yes’ indicates that the corresponding class is learnable using queries of this type. The superscript † refers to results, which can be found or easily derived from results in Angluin [3], Matsumoto and Shinohara [11], and Nessel and Lange [13], respectively. Type of queries
Type of erasing pattern languages const.-free const.-free all regular 1-variable 1-variable k-variable k-variable non-cross
membership No† Yes† restr. subset No Yes restr. superset No† Yes†
No No No†
Yes Yes No†
No No No†
No No No†
No No No†
If query learners are allowed to choose input languages that are themselves not object of learning, their learning capabilities change remarkably, particularly when the learner is allowed to ask restricted superset queries. It seems as if this type of queries is especially tailored to accumulate learning-relevant information about erasing pattern languages. Note that the superscript ‡ marks immediate outcomes of the table above. Type of extra queries
all
Type of erasing pattern languages const.-free const.-free regular 1-variable 1-variable k-variable k-variable non-cross
restr. subset No Yes‡ restr. superset Open Yes‡
No Yes
Yes‡ Yes
No Yes
No Yes
No Yes
Of particular interest is also the complexity of a successful query learner M , cf. Angluin [3]. M learns a class polynomially, if, for each target language L 2
A pattern p is regular provided that p does not contain any variable more than once. Moreover, p is said to be a k-variable pattern, if it contains at most k variables, while it is said to be non-cross, if there are variables x1 , . . . , xn and indices e1 , . . . , en such that p = xe11 · · · xenn .
132
S. Lange and S. Zilles
in the class, the total number of queries to be asked by M in the worst-case is polynomial in the length of the minimal description for L. The table below summarizes the corresponding results. The first (second) row displays the types of queries (not) suitable for polynomial learning of a particular class; the third row marks open problems. Here MemQ (SubQ,SupQ) is short for membership (restricted subset, restricted superset) queries; the prefix x denotes extra queries. The superscript † refers to results by Nessel and Lange [13]. Note that the results on non-learnability are not displayed.
all
Type of erasing pattern languages const.-free const.-free nonregular 1-variable 1-variable k-variable k-variable cross
polynomially SupQ † xSupQ learnable learnable, not MemQ † polynomially xSubQ open xSupQ
2
MemQ, SubQ xSupQ, xSupQ if k = 2 xSupQ, if k > 2
xSupQ, xSupQ if k = 2 xSupQ, if k > 2
Preliminaries
In the following, Σ denotes a fixed finite alphabet, the set of constant symbols. Moreover, X = {x1 , x2 , x3 , . . .} is a countable, infinite set of variables. To distinguish constant symbols from variables, it is assumed that Σ and X are disjoint. By Σ ∗ we refer to the set of all finite strings over Σ (words, for short), where ε denotes the empty string or empty word, respectively. A pattern is a non-empty string over Σ ∪ X . Several special types of patterns are distinguished. Let p be a pattern. If p ∈ X ∗ , then p is said to be a constant-free pattern. p is a regular pattern, if each variable in p occurs at most once. If p contains at most k variables, then p is a k-variable pattern. Moreover, p is said to be a non-cross pattern, if it is constant-free and there are some n ≥ 1 and indices e1 , . . . , en ≥ 1 such that p equals xe11 · · · xenn . For a pattern p, the erasing pattern language Lε (p) generated by p is the set of all words obtained by substituting all variables in p by strings in Σ ∗ . Thereby, each occurrence of a variable in p has to be replaced by the same word. Below, we generally assume that the underlying alphabet Σ consists of at least three elements.3 a, b, c always denote elements of Σ. The erasing pattern languages and natural subclasses thereof will provide the target objects for learning. The formal learning model analyzed is called 3
As results in Shinohara [17] and Nessel and Lange [13] impressively show, this assumption remarkably reduces the complexity of the proofs needed to establish learnability results in the context of learning the erasing pattern languages and subclasses thereof. However, some of the learnability results presented below may no longer remain valid, if this assumption is skipped. A detailed discussion of this issue is outside the scope of the paper on hand.
On the Learnability of Erasing Pattern Languages in the Query Model
133
learning with queries, see Angluin [3,4]. In this model, the learner has access to an oracle that truthfully answers queries of a specified kind. A query learner M is an algorithmic device that, depending on the reply on the queries previously made, either computes a new query or a hypothesis and halts. M learns a target language L using a certain type of queries provided that it eventually halts and that its one and only hypothesis correctly describes L. Furthermore, M learns a target language class C using a certain type of queries, if it learns every L ∈ C using queries of the specified type. As a rule, when learning a target class C, M is not allowed to query languages not belonging to C (cf. Angluin [3]). As in Angluin [3], the complexity of a query learner is measured by the total number of queries to be asked in the worst-case. The relevant parameter is the length of the minimal description for the target language. Below, only indexable classes of erasing pattern languages are considered. Note that a class of recursive languages is said to be an indexable class, if there is an effective enumeration (Li )i≥0 of all and only the languages in that class that has uniformly decidable membership. Such an enumeration is called an indexing.
3 3.1
Strength and Weakness of Query Learners Learning in the Original Query Model
We first present results related to Angluin’s [3] original model. Here the learner is only allowed to query languages that are themselves object of learning. The first result points to the general weakness of query learners when arbitrary erasing pattern languages have to be identified. Theorem 1. The class of all erasing pattern languages is (i) not learnable using membership queries, (ii) not learnable using restricted subset queries, and, (iii) not learnable using restricted superset queries. Proof. Assertions (i) and (iii) are results from Nessel and Lange [13]. To prove Assertion (ii), assume that a query learner M identifies the class of all erasing pattern languages using restricted subset queries. Then it is possible to show, that M fails to identify either Lε (x21 ) or all but finitely many of the languages Lε (x21 xz2 ) for z ≥ 2. 2 The observed weakness has one origin: the query learners are only allowed to output one hypothesis, which has to be correct. To see this, consider the following relaxation of the learning model on hand. Suppose that a query learner M has the freedom to output in each learning step, after asking a query and receiving the corresponding answer, a hypothesis. Similarly to Gold’s [7] model of learning in the limit, a query learner is now successful, if the sequence of its hypotheses stabilizes on a correct one. Accordingly, we say that M learns in the limit using queries. Theorem 2. The class of all erasing pattern languages is (i) learnable in the limit using membership queries, (ii) learnable in the limit using restricted subset queries, and, (iii) learnable in the limit using restricted superset queries.
134
S. Lange and S. Zilles
However, let us come back to the original learning model, in which the first hypothesis of the query learner has to be correct. As Theorem 1 shows, positive results can only be achieved, if the scope is limited to proper subclasses of the erasing pattern languages. Suppose that a subclass of the erasing pattern languages is fixed. Naturally, one may ask whether – similarly to Theorems 1 and 2 – the learnability of this class does not depend on the type of queries actually considered. However, this is generally not the case as our next theorem shows. Theorem 3. Fix two different query types from the following ones: membership, restricted subset, and restricted superset queries. Then there is a class of erasing pattern languages, which is learnable using the first type of queries, but not learnable using the second type of queries. Proof. Scanning the first table above, the class of all erasing pattern languages generated by constant-free 1-variable patterns is learnable with membership or restricted subset queries, but not learnable with restricted superset queries. Moreover, it is not hard to verify, that the class which contains Lε (a) and all languages Lε (axz1 ), where z is a prime number, is learnable using restricted superset queries, but not learnable using membership queries and not learnable using restricted subset queries. Next, the class containing Lε (x21 ) and all languages Lε (x21 x22 xz3 ), z ≥ 2, is learnable with membership queries, but not with restricted subset queries. A class learnable with restricted subset queries, but not with membership queries can be constructed via diagonalization. For that purpose fix an effective enumeration (Mi )i≥0 of all query learners using membership queries and posing each query at most once.4 Let zi denote the i-th prime number for all i ≥ 0. Given i ≥ 0, let L2i = Lε (xz1i a). Moreover, simulate the learner Mi . If Mi queries a word w ∈ Σ ∗ , provide the answer ‘yes’ iff w ∈ Lε (xz1i a); provide the answer ‘no’, otherwise. In case Mi never returns a hypothesis in this scenario, let L2i+1 = L2i = Lε (xz1i a). In case Mi returns a hypothesis, let l be the length of the longest word Mi has queried in the corresponding scenario. Then define L2i+1 = Lε (xz1i axz2l ). Finally, let C consist of all languages Li for i ≥ 0. Note that (Li )i≥0 is an indexing for C; membership is decidable as follows: assume w ∈ Σ ∗ and j ≥ 0 are given. If j = 2i for some i ≥ 0, then w ∈ Lj iff w ∈ Lε (xz1i a). If j = 2i + 1 for some i ≥ 0 and w ∈ L2i , then w ∈ Lj . If j = 2i + 1 and w ∈ / L2i , then let A = {l ≥ 0 | w ∈ Lε (xz1i axz2l )}. A is finite and can be computed from w and i. Simulate Mi as above in the definition of L2i+1 . If Mi does not return a hypothesis, then, since no query is posed twice, Mi queries a word of a length not in A. Thus there is no l ∈ A with Lj = Lε (xz1i axz2l ); in particular w ∈ / Lj . If Mi returns a hypothesis, one can determine the length l∗ of the longest word Mi has queried. In this case w ∈ Lj iff l∗ ∈ A. Next, we show that C is learnable using restricted subset queries. A learner M for C may first query the languages Lε (xz10 a), Lε (xz11 a), Lε (xz12 a), . . . , until the 4
Note that any query learner can be normalized to pose each query at most once without affecting its learning capabilities.
On the Learnability of Erasing Pattern Languages in the Query Model
135
answer ‘yes’ is received for the first time, say as a reply to the query Lε (xz1i a) = L2i . Then M queries the language L2i+1 . In case the answer is ‘yes’, let M return the hypothesis L2i+1 . Otherwise, let M return the hypothesis L2i . It is not hard to verify that M is a successful query learner for C. It remains to verify that C is not learnable using membership queries. Assume to the contrary, that C is learnable using membership queries, say by the learner Mi for some i ≥ 0. Then Mi identifies the language L2i = Lε (xz1i a). In particular, if its queries are answered truthfully respecting L2i , Mi must return a hypothesis correctly describing L2i after finitely many queries. Let l be the length of the longest word Mi queries in the corresponding learning scenario. Then, by definition, L2i+1 = Lε (xz1i axz2l ). Note that a word of length up to l belongs to L2i iff it belongs to L2i+1 . Thus all queries in the learning scenario of Mi for L2i are answered truthfully also for the language L2i+1 = L2i . Since Mi correctly identifies L2i , Mi fails to learn L2i+1 . This yields a contradiction. 2 Next, we systematically investigate the learnability of some prominent subclasses of the erasing pattern languages in Angluin’s [3] model. Theorem 4. The class of all regular erasing pattern languages is (i) learnable using membership queries, (ii) learnable using restricted subset queries, and, (iii) learnable using restricted superset queries. Proof. For a proof of Assertions (i) and (iii) see Nessel and Lange [13]. Adapting their ideas one can also prove (ii). 2 Theorem 5. The class of all 1-variable erasing pattern languages is (i) not learnable using membership queries, (ii) not learnable using restricted subset queries, and, (iii) not learnable using restricted superset queries. Proof. (i) and (iii) are due to Nessel and Lange [13]. To verify (ii), note that the class of all languages Lε (axz1 b), z ≥ 0, is not learnable using restricted subset queries, even if it is allowed to query any 1-variable erasing pattern language. 2 To prove Theorems 6 to 9, similar methods as above can be used. For the results concerning restricted superset queries, ideas from Nessel and Lange [13] can be exploited. Further details are omitted. Theorem 6. The class of all constant-free 1-variable erasing pattern languages is (i) learnable using membership queries, (ii) learnable using restricted subset queries, and, (iii) not learnable using restricted superset queries. Theorem 7. The class of all k-variable erasing pattern languages is (i) not learnable using membership queries, (ii) not learnable using restricted subset queries, and, (iii) not learnable using restricted superset queries. Theorem 8. The class of all constant-free k-variable erasing pattern languages is (i) not learnable using membership queries, (ii) not learnable using restricted subset queries, and, (iii) not learnable using restricted superset queries. Theorem 9. The class of all non-cross erasing pattern languages is (i) not learnable using membership queries, (ii) not learnable using restricted subset queries, and, (iii) not learnable using restricted superset queries.
136
3.2
S. Lange and S. Zilles
Learning with Extra Queries
As it turns out, there are not so many natural subclasses of the erasing pattern languages that are learnable using restricted subset and restricted superset queries, respectively. But where does the observed weakness stem from? Does it result from the complexity of the considered language classes? The following investigations seem to prove that this is not the case. Instead it seems as if, at least in some cases, the query learners are simply not allowed to ask the ‘appropriate’ queries. In the extended model, the query learner is not constrained to query just languages belonging to the target class. That means, the learner and the oracle have the ability to communicate additional queries and corresponding answers. However, in a reasonable model, there has to be an a priori agreement of how to formulate the queries. For that purpose, we assume that the query languages are selected from an a priori fixed indexable class of recursive languages. As we will see below, this may severely increase the learning power concerning natural subclasses of the erasing pattern languages. Still, if the class of all erasing pattern languages is considered, a benefit resulting from extra queries has not been verified yet. Indeed, concerning restricted subset queries, it is clear that extra queries do not help. Theorem 10. The class of all erasing pattern languages is not learnable using extra restricted subset queries. It remains open whether or not the class of all erasing pattern languages is learnable using extra restricted superset queries. The relevance of this problem is discussed in the last section. Extra restricted superset queries improve the power of query learners remarkably. Due to the space constraints the corresponding proof is omitted. Theorem 11. The classes of all regular, of all k-variable, and of all non-cross erasing pattern languages, respectively, are learnable using extra restricted superset queries. In contrast, extra restricted subset queries do not help for learning the natural subclasses of the erasing pattern languages considered. Note that there are still subclasses which are learnable using restricted subset queries if and only if the learner may ask extra languages. An example can be found in the demonstration of Theorem 3, third paragraph. Theorem 12. (i) The classes of all 1-variable, of all constant-free k-variable (k ≥ 2) and of all non-cross erasing pattern languages, respectively, are not learnable using extra restricted subset queries. (ii) The classes of all regular and of all constant-free 1-variable erasing pattern languages, respectively, are learnable using extra restricted subset queries. Proof. Assertion (ii) is an immediate consequence of Theorems 4 and 6. To prove Assertion (i), note that neither the class consisting of Lε (a) and all languages Lε (axz1 ), z ≥ 1, nor the class consisting of Lε (x21 ) and all languages Lε (x21 xz2 ), z ≥ 2, are learnable with extra restricted subset queries. 2
On the Learnability of Erasing Pattern Languages in the Query Model
4
137
Efficiency of Query Learners
Having analyzed the learnability of natural subclasses of the class of all erasing pattern languages in the (extended) query model, we now turn our attention to the question, which of the learnable classes can even be learned efficiently, i. e. with polynomially many queries. In particular, it is of interest, whether or not the permission to query extra languages may speed up learning. As it turns out, there are subclasses, which are not learnable in the original model, but even efficiently learnable with extra queries, see Theorem 13, Assertion (iv). Thus, extra restricted superset queries may bring the maximal benefit imaginable. In contrast, extra restricted subset queries do not help to speed up learning of the prominent subclasses of the erasing pattern languages considered above. Theorem 13. (i) Polynomially many restricted superset queries suffice to learn the class of all regular erasing pattern languages. (ii) Polynomially many membership queries suffice to learn the class of all constant-free 1-variable erasing pattern languages. (iii) Polynomially many restricted subset queries suffice to learn the class of all constant-free 1-variable erasing pattern languages. (iv) Polynomially many extra restricted superset queries suffice to learn the classes of all regular, of all 1-variable, and of all non-cross erasing pattern languages, respectively. Proof. (i) is due to Nessel and Lange [13]. The proofs of (ii) and (iii) are omitted. Results by Nessel and Lange [13] help to verify Assertion (iv) for the case of regular erasing pattern languages. Details are omitted. The more involved proof of Assertion (iv) for the case of non-cross erasing pattern languages is just sketched: Assume that the target language L equals Lε (p) for some non-cross pattern p = xe11 · · · xenn . A query learner M successful for all non-cross erasing pattern languages may operate as follows. 1. M poses the query Σ ∗ \ {a}. If the answer is ‘no’, then M returns the hypothesis L = Lε (x1 ) and stops; otherwise M acts as described in 2. 2. The queries {w | |w| = j} for j = 1, 2, . . . help to determine the minimal exponent e in {e1 , . . . , en }. Knowing e, M executes 3. 3. M poses the query Lε (xe1 ). If the answer is ‘yes’, then M returns the hypothesis L = Lε (xe1 ) and stops; otherwise M acts as described in 4. 4. The queries (Lε (xe1 )∩{w | |w| ≤ j})∪{w | |w| > j} for j = e, e+1, . . . help to determine further candidates for elements in {e1 , . . . , en }. Queries concerning special words in a selected class of (at most e1 + · · · + en ) 2-variable erasing pattern languages help to exactly compute a next exponent e . Knowing e , M executes 5. 5. The queries Σ ∗ \{w}, for particular words w ∈ Σ ∗ in order of growing length, help to determine in which order the exponents e and e appear in p.
138
S. Lange and S. Zilles
Afterwards, M executes (slightly modified versions of) steps 3 to 5 in order to find further exponents, until the correct structure of p is output. All in all, this method is successful for all non-cross erasing pattern languages, but uses only polynomially many extra restricted superset queries. Instead of formalizing the details we try to illustrate the idea with an example. Assume Σ = {a, b, c} and the target language is Lε (x41 x22 x83 ). Then the corresponding learning scenario can be described by the following table. Step Query 1 Σ ∗ \ {a} 2 {w | |w| = 1} {w | |w| = 2} (* e = 2. *) 3 Lε (x21 ) (* There is a second exponent e . *) 4 (Lε (x21 ) ∩ {w | |w| ≤ 2}) ∪ {w | |w| > 2} .. .
5
3 4
3
(Lε (x21 ) ∩ {w | |w| ≤ 5}) ∪ {w | |w| > 5} (Lε (x21 ) ∩ {w | |w| ≤ 6}) ∪ {w | |w| > 6} (* e = 4. *) ∗ Σ \ {a2 b4 } Σ ∗ \ {a4 b2 } (* e appears only before e in p. *) Lε (x41 x22 ) (* There is a third exponent e . *) (Lε (x41 x22 ) ∩ {w | |w| ≤ 6}) ∪ {w | |w| > 6} .. . (Lε (x41 x22 ) ∩ {w | |w| ≤ 9}) ∪ {w | |w| > 9} (Lε (x41 x22 ) ∩ {w | |w| ≤ 10}) ∪ {w | |w| > 10} (* Candidates for e : 6, 8. Interesting words: a6 b4 , a2 b8 . *) ∗ Σ \ {a6 b4 } Σ ∗ \ {a2 b8 } (* e = 8, appears after e in p, step 5 is not necessary. *) Lε (x41 x22 x83 ) Output : Lε (x41 x22 x83 )
Reply ‘yes’ ‘yes’ ‘no’ ‘no’ ‘yes’ .. . ‘yes’ ‘no’ ‘yes’ ‘no’ ‘no’ ‘yes’ .. . ‘yes’ ‘no’ ‘yes’ ‘no’ ‘yes’
It remains to prove Assertion (iv) for 1-variable erasing pattern languages: Assume the target language is L = Lε (p) for some 1-variable pattern p. Let v be the shortest word in L, v = v1 · · · vl for v1 , . . . , vl ∈ Σ. A query learner M successful for all 1-variable erasing pattern languages may operate as follows: 1. With the help of the queries Σ ∗ \{a} and Σ ∗ \{b} the learner M can find out, whether or not L = Lε (x1 ). If yes, then M returns the hypothesis L = Lε (x1 ) and stops; otherwise M acts as described in 2. 2. The queries {w | |w| = j} for j = 0, 1, 2, . . . help to compute the length l of v. To compute v itself, the |Σ|l candidates for v are recursively split into two
On the Learnability of Erasing Pattern Languages in the Query Model
139
equally large sets V1 and V2 ; which of these sets is taken under consideration, in each splitting step only depends on the query V1 ∪ {w | |w| = l}. If v is computed, M goes on as in 3. 3. M poses the query Lε (v). On answer ‘yes’, M returns the hypothesis Lε (v) and stops. On answer ‘no’, M queries all the languages Lε (pi ), 1 ≤ i ≤ l + 1, where pi is the pattern resulting from x1 v1 x2 v2 · · · xl vl xl+1 , if the variable xi is deleted. Thus M can detect exactly those positions in v, where the only variable has to occur (at least once). Knowing the positions of the variables, M goes on as in 4. 4. By posing the queries {v} ∪ {w | |w| ≥ l + j} for j = 1, 2, . . ., M finds out the number j ∗ of occurrences of the variable x1 in p. Afterwards, special queries concerning the words of length l + j ∗ help to find out the multiplicity of x1 in the positions computed in 3. Finally, a hypothesis for Lε (p) is returned. All in all, this method is successful for all 1-variable erasing pattern languages, but uses only polynomially many queries. Instead of formalizing the details we try to illustrate the idea with an example. Assume Σ = {a, b, c} and the target language is Lε (ax31 bx21 ). Then the corresponding learning scenario can be described by the following table. Step Query Reply 1 Σ ∗ \ {a} ‘yes’ 2 {w | |w| = 0} ‘yes’ {w | |w| = 1} ‘yes’ {w | |w| = 2} ‘no’ (* l = 2. v ∈ {aa, ab, ac, ba, bb, bc, ca, cb, cc}. *) {aa, ab, ac, ba} ∪ {w | |w| = 2} ‘yes’ {aa, ab} ∪ {w | |w| = 2} ‘yes’ {aa} ∪ {w | |w| = 2} ‘no’ (* v = ab. *) 3 Lε (ab) ‘no’ Lε (ax2 bx3 ) ‘yes’ Lε (x1 abx3 ) ‘no’ ‘no’ Lε (x1 ax2 b) (* p = axe11 bxe12 for some e1 , e2 ≥ 1. *) 4 {ab} ∪ {w | |w| ≥ 3} ‘yes’ .. .. . . {ab} ∪ {w | |w| ≥ 7} ‘yes’ {ab} ∪ {w | |w| ≥ 8} ‘no’ (* j ∗ = 5. Test a2 ba4 , a3 ba3 , a4 ba2 , a5 ba. *) Σ ∗ \ {a2 ba4 } ‘yes’ Σ ∗ \ {a3 ba3 } ‘yes’ Σ ∗ \ {a4 ba2 } ‘no’ Output: Lε (ax31 bx21 ) Further details are omitted.
2
140
S. Lange and S. Zilles
Theorem 14. Polynomially many queries do not suffice to learn the class of all regular erasing pattern languages with either membership queries, or restricted subset queries, or extra restricted subset queries. Proof. Note that, for any n ≥ 0, there are at least |Σ|n distinct regular patterns, such that each pair of corresponding erasing pattern languages is disjoint. By a result in Angluin [3], given n ≥ 0, any query learner identifying each of these |Σ|n erasing regular pattern languages using membership or restricted subset queries must make |Σ|n − 1 queries in the worst case. Angluin’s proof can be adopted for the case of learning with extra restricted subset queries. Concerning membership queries and restricted subset queries, Theorem 14 has also been verified by Nessel and Lange [13]. 2 It remains open, whether or not, for any k ≥ 3, the class of all k-variable erasing pattern languages, or at least the class of all constant-free k-variable erasing pattern languages, is learnable using polynomially many extra restricted superset queries. Until now, we have only been successful in showing that Theorem 13, Assertion (iv) generalizes to the case of learning the class of all 2-variable erasing pattern languages. A similar, slightly extended, method as in the proof above for 1-variable erasing pattern languages can be used. The relevant details are omitted.
5
Connections to Gold-Style Learning
Comparing query learning to the standard models of Gold-style language learning from positive examples requires some more notions. These will be kept short, see, e. g., Gold [7], Angluin [1], and Zeugmann and Lange [18] for more details. Let L be a language. Any infinite sequence t = (wj )j≥0 with {wj | j ≥ 0} = L is called a text for L. For any n ≥ 0, tn denotes the initial segment w0 , . . . , wn and t+ n the set {w0 , . . . , wn }. Let C be an indexable class, let H = (Li )i≥0 be a hypothesis space, and let L ∈ C. An inductive inference machine (IIM ) is an algorithmic device, that reads longer and longer initial segments of a text and, from time to time, outputs numbers as its hypotheses. An IIM M returning some i is construed to hypothesize the language Li . Given a text t for L, M identifies L from t with respect to H, if the sequence of hypotheses output by M , when fed t, stabilizes on a number i (i. e. past some point M always outputs the hypothesis i) with Li = L. M identifies C from text with respect to H, if it identifies every L ∈ C from every corresponding text. We say that C can be conservatively identified with respect to H iff there is an IIM M that identifies C from text with respect to H and that performs exclusively justified mind changes, i. e. if M , on some text t, outputs hypotheses i and later i , then M must have seen some word w∈ / Li before it outputs i . In other words, M may only change its hypothesis when it has found hard evidence that it is wrong. LimTxt (Consv Txt) denotes the collection of all indexable classes C for which there are an IIM M and a hypothesis space H such that M (conserva-
On the Learnability of Erasing Pattern Languages in the Query Model
141
tively) identifies C from text with respect to H . Note that Consv Txt ⊂ LimTxt, cf. Zeugmann and Lange [18]. For the next theorem, let xSupQ denote the collection of all indexable classes, which are learnable with extra restricted superset queries. Theorem 15. Consv Txt = xSupQ. Proof. “Consv Txt ⊆ xSupQ”: Fix C ∈ Consv Txt. Then there is an indexing (Li )i≥0 and an IIM M , such that M Consv Txt-identifies C with respect to (Li )i≥0 . Obviously, if L ∈ C and t is a text for L, then M never returns an index i with L ⊂ Li on any segment of t. Now the underlying indexable class used for the queries contains all languages in (Li )i≥0 and all languages Li \ {w} for i ≥ 0 and w ∈ Σ ∗ . A learner M identifying any L ∈ C with extra restricted superset queries may work as follows. M looks for a superset of L and uses queries on variants of this superset to construct a text for L. First, to find a superset of L, M poses queries L0 , L1 , . . ., until the answer ‘yes’ is received for the first time, say upon the query Lk . (* Note that L ⊆ Lk . *) Second, to effectively enumerate a text t for L, M determines the set T of all words w ∈ Σ ∗ , for which the query Lk \ {w} is answered with ‘no’. Since T = L and T is recursively enumerable in k, any effective enumeration of T yields a text for L. Third, to compute its hypothesis, M executes step 0, 1, 2, . . . until it receives a stop signal. In general, step n, n ≥ 0, consists of the following instructions: Determine i := M (tn ), where t is a fixed effective enumeration of the set T . Pose the query Li . If the answer is ‘no’, execute step n + 1. Otherwise hypothesize i and stop. (* In the latter case, as M never hypothesizes a proper superset of L, M returns an index for L. *) Further details are omitted. “xSupQ ⊆ Consv Txt”: Fix an indexable class C ∈ xSupQ. Then there is an indexing (Li )i≥0 and a query learner M , such that M identifies C with extra restricted superset queries respecting (Li )i≥0 . A new indexing (Li )i≥0 is defined as follows: – L0 is the empty language. – If i is the canonical index of the finite set {i1 , . . . , in }, then Li = Li1 ∩· · ·∩Lin . An IIM M identifying C in the limit from text with respect to the hypothesis space (Li )i≥0 , given a text t, may work as follows. M (t0 ) := 0. To compute M (tn+1 ), the learner M simulates a query learning scenario with M for n steps of computation. If M does not return a hypothesis in the n-th step, then M (tn+1 ) := M (tn ). Additionally, if M poses the query Li
142
S. Lange and S. Zilles
in the n-th step, then M will receive the answer ‘no’, if t+ n ∩ Li = ∅ (i. e. if ), and the answer ‘yes’, otherwise. If M returns a hypothesis i in the Li ⊇ t+ n n-th step, then the hypothesis M (tn+1 ) is computed as follows: • Let Li+ , . . . , Li+ be the queries answered with ‘yes’ in the currently m 1 simulated scenario. + • Compute the canonical index i of the set {i, i+ 1 , . . . , im }. • Return the hypothesis M (tn+1 ) = i . It is not hard to verify that M learns C in the limit from text; the relevant details are omitted. Moreover, as we will see next, M avoids overgeneralized hypotheses, that means, if t is a text for some L ∈ C, n ≥ 0, and M (tn ) = i , then Li ⊃ L. Therefore, M can easily be transformed into a learner M which identifies the class C conservatively in the limit from text.5 To prove that M learns C in the limit from text without overgeneralizations, assume to the contrary, that there is an L ∈ C, a text t for L, and an n ≥ 0, such that the hypothesis i = M (tn ) fulfills Li ⊃ L. Then i = 0. By definition of M , there must be a learning scenario S for M , in which – – – –
M poses queries Li− , . . . , Li− , Li+ , . . . , Li+ (in some particular order); m 1 1 k the queries Li− , . . . , Li− are answered with ‘no’; 1 k the queries Li+ , . . . , Li+ are answered with ‘yes’; m 1 afterwards M returns the hypothesis i.
+ Hence i is the canonical index of the set {i, i+ 1 , . . . , im }. This implies Li = Li ∩ Li+ ∩ · · · ∩ Li+ . So each of the languages Li+ , . . . , Li+ is a superset of L. m m 1 1 for 1 ≤ j ≤ k. Therefore none of the languages By definition of M , Li− ⊇ t+ n j Li− , . . . , Li− are supersets of L. So the answers in the learning scenario S above 1 k are truthful respecting the language L. As M learns C with extra restricted superset queries, the hypothesis i must be correct for L, i. e. Li = L. This yields Li ⊆ L in contradiction to Li ⊃ L. So M learns C in the limit from text without overgeneralizations, which – by the argumentation above – implies C ∈ Consv Txt. 2
As an immediate consequence of xSupQ = Consv Txt and the fact that Consv Txt is a proper subset of LimTxt we obtain the following corollary. Corollary 1. xSupQ ⊂ LimTxt. Theorem 15 and Corollary 1 are of relevance for the open question, whether or not the class of all erasing pattern languages is learnable in the limit from text, if the underlying alphabet consists of at least three symbols. Obviously, if this class is learnable with extra restricted superset queries, then the open question can be answered in the affirmative. Conversely, if it is not learnable with extra restricted superset queries, then it is not conservatively learnable in the limit 5
Note that a result by Zeugmann and Lange [18] states that any indexable class, which is learnable in the limit from text without overgeneralizations, belongs to Consv Txt.
On the Learnability of Erasing Pattern Languages in the Query Model
143
from text. Of course the latter would not yet imply, that the open question can be answered in the negative. Still it would at least suggest that this is the case, since until now, there is no ‘natural’ class known that separates LimTxt from Consv Txt.
References 1. D. Angluin. Inductive inference of formal languages from positive data. Information and Control, 45:117–135, 1980. 2. D. Angluin. Finding patterns common to a set of strings. Journal of Computer and System Sciences, 21:46–62, 1980. 3. D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988. 4. D. Angluin. Queries revisited. Proc. Int. Conf. on Algorithmic Learning Theory, LNAI 2225, 12–31, Springer, 2001. 5. S. Arikawa, S. Miyano, A. Shinohara, S. Kuhara, Y. Mukouchi, T. Shinohara. A machine discovery from amino acid sequences by decision trees over regular patterns. New Generation Computing, 11:361–375, 1993. 6. T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger, T. Zeugmann. Learning onevariable pattern languages very efficiently on average, in parallel, and by asking questions. Proc. Int. Conf. on Algorithmic Learning Theory, LNAI 1316, 260–276, Springer, 1997. 7. E. M. Gold. Language identification in the limit. Information and Control, 10:447– 474, 1967. 8. J. E. Hopcroft, J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, 1979. 9. T. Jiang, A. Salomaa, K. Salomaa, S. Yu. Decision problems for patterns. Journal of Computer and System Sciences, 50:53–63, 1995. 10. S. Lange, T. Zeugmann. Types of monotonic language learning and their characterization. Proc. ACM Workshop on Computational Learning Theory, 377–390. ACM Press, 1992. 11. S. Matsumoto, A. Shinohara. Learning pattern languages using queries. Proc. European Conf. on Computational Learning Theory, LNAI 1208, 185–197, Springer, 1997. 12. A. Mitchell. Learnability of a subclass of extended pattern languages. Proc. ACM Workshop on Computational Learning Theory, 64–71, ACM-Press, 1998. 13. J. Nessel, S. Lange. Learning erasing pattern languages with queries. Proc. Int. Conf. on Algorithmic Learning Theory, LNAI 1968, 86–100, Springer, 2000. 14. D. Reidenbach. A negative result on inductive inference of extended pattern languages. Proc. Int. Conf. on Algorithmic Learning Theory, LNAI 2533, 308–320, Springer, 2002. 15. A. Salomaa. Patterns (the formal language theory column). EATCS Bulletin, 54:46–62, 1994. 16. A. Salomaa. Return to patterns (the formal language theory column). EATCS Bulletin, 55:144–157, 1995. 17. T. Shinohara. Polynomial time inference of extended regular pattern languages. Proc. RIMS Symposium on Software Science and Engineering, LNCS 147, 115–127, Springer, 1983. 18. T. Zeugmann, S. Lange. A guided tour across the boundaries of learning recursive languages. Algorithmic Learning for Knowledge-Based Systems, LNAI 961, 190– 258, Springer, 1995.
Learning of Finite Unions of Tree Patterns with Repeated Internal Structured Variables from Queries Satoshi Matsumoto1 , Yusuke Suzuki2 , Takayoshi Shoudai2 , Tetsuhiro Miyahara3 , and Tomoyuki Uchida3 1
Department of Mathematical Sciences, Tokai University, Hiratsuka 259-1292, Japan [email protected] 2 Department of Informatics, Kyushu University, Kasuga 816-8580, Japan {y-suzuki,shoudai}@i.kyushu-u.ac.jp 3 Faculty of Information Sciences, Hiroshima City University, Hiroshima 731-3194, Japan [email protected] [email protected]
Abstract. In the field of Web mining, a Web page can be represented by a rooted tree T such that every internal vertex of T has ordered children and string data such as tags or texts are assigned to edges of T . A term tree is an ordered tree pattern, which has ordered tree structures and variables, and is suited for a representation of a tree structured pattern in Web pages. A term tree t is allowed to have a repeated variable which occurs in t more than once. In this paper, we consider the learnability of finite unions of term trees with repeated variables in the query learning model of Angluin (1988). We present polynomial time learning algorithms for finite unions of term trees with repeated variables by using superset and restricted equivalence queries. Moreover we show that there exists no polynomial time learning algorithm for finite unions of term trees by using restricted equivalence, membership and subset queries. This result indicates the hardness of learning finite unions of term trees in the query learning model.
1
Introduction
In the field of Web mining, Web documents such as HTML/XML files have tree structures and are called tree structured data. Tree structured data can be naturally represented by rooted trees T such that every internal vertex in T has ordered children, every vertex has no label and every edge has a label [1]. We are interested in extracting a set (or a union) of tree structured patterns which explains heterogeneous tree structured data having no rigid structure. From this motivation, in this paper, we consider the polynomial time learnability of finite unions of tree structured patterns in the query learning model of Angluin [5]. A term tree is a rooted tree pattern which consists of tree structures, ordered children and internal structured variables [10,13]. A variable in a term tree is a R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 144–158, 2003. c Springer-Verlag Berlin Heidelberg 2003
Query Learning of Finite Unions of Tree Patterns with Repeated Variables
145
list of two vertices and it can be substituted by an arbitrary tree. For example, the term tree t in Fig. 1 is a tree pattern explaining the tree T in Fig. 1 because T can be obtained from t by substituting variables x1 , x2 , x3 and x4 by trees g1 , g2 , g3 and g4 in Fig. 1, respectively. In [2,3], Amoth et al. presented the into-matching semantics and introduced the class of ordered tree patterns and ordered forests with the semantics. Such an ordered tree pattern is a standard tree pattern, which is also called a first order term in formal logic. Since a term tree may have variables consisting of two internal vertices (e.g. the variable x2 in Fig. 1), a term tree is more powerful than an ordered tree pattern. For example, in Fig. 1, the tree pattern f (b, x, g(a, z), y) can be represented by the term tree s, but the term tree t cannot be represented by any standard tree pattern because of the existence of internal structured variables represented by x2 and x3 in t. Arimura et al. [7] presented ordered gapped tree patterns and ordered gapped forests under into-matching semantics introduced in [3]. An ordered gapped tree pattern is incomparable to a term tree, since a gap-variable in an ordered gapped tree pattern does not exactly correspond to an internal variable in a term tree. A variable with a variable label x in a term tree t is said to be repeated if x occurs in t more than once. In this paper, we treat a term tree with repeated internal variables. In [7], Arimura et al. discussed the polynomial time learnability of ordered gapped forests without repeated gap-variables in the query learning model. In this paper, we discuss the polynomial time learnability of finite unions of term trees with repeated variables in the query learning model. For a tree T which represents tree structured data such as Web documents, string data such as tags or texts are assigned to edges of T . Hence, we assume naturally that the cardinality of a set of edge labels is infinite. Let Λ be a set of strings used in tree structured data. Then, our target class of learning is the set OTFΛ of all finite sets of term trees all of whose edges are labeled with elements in Λ. A term tree t is said to be regular (or repetition-free) if all variable labels in t are mutually distinct. The term tree language of a term tree t, denoted by LΛ (t), is the set of all labeled trees which are obtained from t by substituting arbitrary labeled trees for all variables in t. The language represented by a finite set of term trees R = {t1 , t2 , . . . , tm } in OTFΛ is the finite union of m term tree languages LΛ (R) = LΛ (t1 ) ∪ LΛ (t2 ) ∪ . . . ∪ LΛ (tm ). In the query learning model of Angluin [5], a learning algorithm is said to exactly learn a target finite set R∗ of term trees if it outputs a finite set R of term trees such that LΛ (R) = LΛ (R∗ ) and halts, after it uses some queries. In this paper, firstly, we present a polynomial time algorithm which exactly learns any finite set in OTFΛ having m∗ term trees by using superset and restricted equivalence queries. Next, we show that there exists no polynomial time learning algorithm for finite unions of term trees by using restricted equivalence, membership and subset queries. This result indicates the hardness of learning finite unions of term trees in the query learning. In the query learning model, many researchers [2,3,7,10] showed the exact learnabilities of several kinds of tree structured patterns (e.g. the query learning for ordered forests under onto-matching semantics [6], for unordered forests
146
S. Matsumoto et al.
under into-matching semantics [2,3], for ordered gapped forests [7], for regular term trees[11]). In [10], we showed the polynomial time exact learnability of finite unions of regular term trees using restricted subset queries and equivalence queries. As other learning models, in [13], we showed the class of single regular term trees is polynomial time inductively inferable from positive data. Further, we gave a data mining method from semistructured data, based on a learning algorithm for regular term trees [12]. Further related works are discussed in Conclusion. This paper is organized as follows. In Section 2 and 3, we explain term trees and the query learning model. In Section 4, we show that the class OTFΛ is exactly learnable in polynomial time by using superset and restricted equivalence queries. In Section 5, we give the hardness of learning the unions of term trees in the query learning model.
2
Preliminaries
Definition 1. Let T = (VT , ET ) be a rooted tree with ordered children which has a set VT of vertices and a set ET of edges. Let Et and Ht be a partition of ET , i.e., Et ∪ Ht = ET and Et ∩ Ht = ∅. And let Vt = VT . A triplet t = (Vt , Et , Ht ) is called a term tree, and elements in Vt , Et and Ht are called a vertex, an edge and a variable, respectively. For a term tree t and its vertices v1 and vi , a path from v1 to vi is a sequence v1 , v2 , . . . , vi of distinct vertices of t such that for any j with 1 ≤ j < i, there exists an edge or a variable which consists of vj and vj+1 . If there is an edge or a variable which consists of v and v such that v lies on the path from the root to v , then v is said to be the parent of v and v is a child of v. We use a notation [v, v ] to represent a variable {v, v } ∈ Ht such that v is the parent of v . Then we call v the parent port of [v, v ] and v the child port of [v, v ]. A term tree t is called ordered if every internal vertex u in t has a total ordering on all children of u. We define the size of t as the number of vertices in t and denote it by |t|. For a set S, the number of elements in S, called the size of S, is denoted by |S|. For example, the ordered term tree t = (Vt , Et , Ht ) in Fig. 1 is defined as follows. Vt = {v1 , . . . , v11 }, Et = {{v1 , v2 }, {v2 , v3 }, {v1 , v4 }, {v7 , v8 }, {v1 , v10 }, {v10 , v11 }} with the root v1 and the sibling relation displayed in Fig. 1. Ht = {[v4 , v5 ], [v1 , v6 ], [v6 , v7 ], [v6 , v9 ]}. For any ordered term tree t, a vertex u of t, and two children u and u of u, we write u
Query Learning of Finite Unions of Tree Patterns with Repeated Variables
147
v1 Sec1
x2
Sec2
Sec4
Sec1
v2
v4
v6
v10
Introduction
x1
x3 x4
Conclusion
v3
v5
v7
Sec2
Sec3
Sec4
Introduction Preliminary
Conclusion SubSec3.1 SubSec3.2
v9
v11
Result I
Result I
Result II
v8
t
T f
u4 SubSec3.2
b
x
s
g
y
a
z
u1
u2
u3
Preliminary
Sec3
SubSec3.1
w1
w2
w3
w4
g1
g2
g3
g4
Result II
Fig. 1. A term tree t explains a tree T . A term tree s represents the tree pattern f (b, x, g(a, z), y). A variable is represented by a box with lines to its elements. The label inside a box is the variable label of the variable.
(resp. regular term trees) with Λ as a set of edge labels, and by OTFΛ (resp. µOTFΛ ) the set of all finite sets of term trees (resp. regular term trees) with Λ as a set of edge labels. An ordered term tree with no variable is called a ground term tree and considered to be a tree with ordered children. OT Λ denotes the set of all ground term trees with Λ as a set of edge labels. Let f = (Vf , Ef , Hf ) and g = (Vg , Eg , Hg ) be term trees. We say that f and g are isomorphic, denoted by f ≡ g, if there is a bijection ϕ from Vf to Vg such that (i) the root of f is mapped to the root of g by ϕ, (ii) {u, u } ∈ Ef if and only if {ϕ(u), ϕ(u )} ∈ Eg and the two edges have the same edge label, (iii) [u, u ] ∈ Hf if and only if [ϕ(u), ϕ(u )] ∈ Hg and the two variables have the same variable label, and (iv) for any vertex u in f which has more than one child, and for any two children u and u of u, u
148
S. Matsumoto et al. r
y
g
v0 x
r
u0 y
(2)
(u0) v0
(2)
(< rf )
(< rf )
z
z (3)
u1
v1
< vf’0
(1)
(< ug0 ) v1 (u1) (2)
(1)
(< ug0 )
(3)
< vf’0
(< ug0 )
(1)
(< vf 1 )
f
g
f = f {x := [g, [u0 , u1 ]]}
Fig. 2. The new ordering on vertices in the term tree f = f {x := [g, [u0 , u1 ]]}.
A substitution θ is a finite collection of bindings {x1 := [g1 , σ1 ], · · ·, xn := [gn , σn ]}, where xi ’s are mutually distinct variable labels in X. The term tree f θ, called the instance of f by θ, is obtained by applying all the bindings xi := [gi , σi ] on f simultaneously. We define a new total ordering
Query Learning of Finite Unions of Tree Patterns with Repeated Variables
3
149
Learning Model
In this paper, let R∗ be a set, in OTFΛ , which is to be identified. We say that the set R∗ is a target. Without loss of generality, we assume that LΛ (R∗ ) = LΛ (R∗ − {r}) for any r ∈ R∗ . We introduce the query learning model due to Angluin [5]. In this model, learning algorithms can access to oracles that answer specific kinds of queries about the unknown set LΛ (R∗ ). We consider the following oracles. 1. Superset oracle SupR∗ : The input is a set R in OTFΛ . If LΛ (R) ⊇ LΛ (R∗ ), then the output is ”yes”. Otherwise, it returns a counterexample t ∈ LΛ (R∗ ) − LΛ (R). The query is called a superset query. 2. Restricted equivalence oracle rEquiv R∗ : The input is a set R in OTFΛ . The output is ”yes” if LΛ (R) = LΛ (R∗ ) and ”no” otherwise. The query is called a restricted equivalence query. 3. Membership oracle MemR∗ : The input is a ground term tree t in OT Λ . The output is ”yes” if t ∈ LΛ (R∗ ), and ”no” otherwise. The query is called a membership query. 4. Subset oracle SubR∗ : The input is a set R in OTFΛ . The output is ”yes” if LΛ (R) ⊆ LΛ (R∗ ). Otherwise, it returns a counterexample t ∈ LΛ (R) − LΛ (R∗ ). The query is called a subset query. A learning algorithm A collects information about LΛ (R∗ ) by using queries and outputs a set R in OTFΛ . We say that a learning algorithm A exactly identifies a target R∗ in polynomial time using a certain type of queries if A halts in polynomial time and outputs a set R ∈ OTFΛ such that LΛ (R) = LΛ (R∗ ) using queries of the specified type.
4
Learning Using Superset and Restricted Equivalence Queries
In this section, we show the learnability of finite unions of term tree languages in the query learning model. In Subsection 4.1, we introduce some notations. In Subsection 4.2, we show that any set in OTFΛ is exactly identifiable in polynomial time using superset queries if the size of a target is known. In Subsection 4.3, we show that any set in OTFΛ is exactly identifiable in polynomial time using superset and restricted equivalence queries even if the size of a target is unknown. 4.1
Compactness and Extensions of Term Trees
The following property, called compactness, plays an important role in the learning of unions of languages [7,8]. By Lemma 1, in this paper, we assume that |Λ| is infinite.
150
S. Matsumoto et al.
a
a a
a
a
a
a
b
a
a
b
t1
t2 a
a a
a
a
a
a a
b
a c
a a b
t3 a a a
a
a a
a
a a
b
a a
a a
a a
a
a
a b
a a
a
a c
a
a b
t4 Fig. 3. For regular term trees t1 , t2 , t3 and t4 , LΛ (t1 ) ⊆ LΛ (t2 ), LΛ (t4 ) ⊆ LΛ (t3 ), t1 t2 and t4 t3 .
Lemma 1. Let r be a term tree in OTT Λ , R a set in OTFΛ and |Λ| infinite. Then, r r for some r ∈ R if and only if LΛ (r) ⊆ LΛ (R) Proof. Let wr be a ground term tree obtained from r by substituting edges which have mutually distinct labels not appearing in any term trees in R. Since wr ∈ LΛ (R), if LΛ (r) ⊆ LΛ (R), then there exists a term tree r ∈ R such that wr ∈ LΛ (r ). Since any substituted labels do not appear in R, we have r r by inverting the substitution. 2 Let Λ = {a, b}. For example, in Fig. 3, we have LΛ (t1 ) ⊆ LΛ (t2 ) and t1 t2 . Thus, if |Λ| = 2, then compactness doesn’t hold. Moreover, let Λ = {a, b, c}. Then LΛ (t4 ) ⊆ LΛ (t3 ) and t4 t3 . From the above mentioned, we show that |Λ| must be infinite to satisfy compactness. We introduce operations increasing variables in a term tree.
Query Learning of Finite Unions of Tree Patterns with Repeated Variables
T1 v1
T1 v1 x
T2
151
y
T2
T3
T4
T3
v3
v2
z
=⇒
v2 T4
T1
T2
T1 v1
T1 v1
T2
T3
x
=⇒
T2
v2
y
v2
v3
T4
T2
T4
T3
T4
T1 v1
T1 v1 T3
x
=⇒
T2
x
v2
v2
T4
T4
T5
T3
x
T3
y v3
T6
Fig. 4. Regular term trees T2 , T4 , T6 are obtained by extensions from T1 , T3 , T5 respectively, that is, T1 1[v1 ,v2 ] T2 , T3 2[v1 ,v2 ] T4 and T5 3[v1 ,v2 ] T6 . It is clear that T2 ≺ T1 , T4 ≺ T3 , T6 ≺ T5 , |T2 | > |T1 |, |T4 | > |T3 | and |T6 | > |T5 |.
Definition 2. Let r = (Vr , Er , Hr ) be a term tree in µOTT Λ , v1 , v2 ∈ Vr and v3 ∈ Vr . We define three extensions of term trees as the following operations: 1. If [v1 , v2 ] ∈ Hr , then Vr = Vr ∪ {v3 } and Hr = Hr ∪ {[v1 , v3 ], [v3 , v2 ]} − {[v1 , v2 ]}. The variables [v1 , v3 ] and [v3 , v2 ] have variable labels which do not appear in r, respectively. 2. If [v1 , v2 ] ∈ Hr , then Vr = Vr ∪ {v3 }, Hr = Hr ∪ {[v1 , v3 ]} and v2 is the next sibling of v3 . The variable [v1 , v3 ] has a variable label which do not appear in r. 3. If [v1 , v2 ] ∈ Hr , then Vr = Vr ∪ {v3 }, Hr = Hr ∪ {[v1 , v3 ]} and v3 is the next sibling of v2 . The variable [v1 , v3 ] has a variable label which do not appear in r. Let r = (Vr , Er , Hr ) be a regular term tree and r the regular term tree obtained from r by the i-th extension of Definition 2 for a variable h ∈ Hr .
152
S. Matsumoto et al.
Then we write r ih r . For example, in Fig. 4, regular term trees T2 , T4 , T6 are obtained by extensions from regular term trees T1 , T3 , T5 respectively, that is, T1 1[v1 ,v2 ] T2 , T3 2[v1 ,v2 ] T4 and T5 3[v1 ,v2 ] T6 . Then, we have T2 ≺ T1 , T4 ≺ T3 , T6 ≺ T5 , |T2 | > |T1 |, |T4 | > |T3 | and |T6 | > |T5 |. We define ES(r) as follows: ES(r) = {r ∈ µOTT Λ | r ih r for some h ∈ Hr and some i ∈ {1, 2, 3}}. Note that |r | > |r| and r ≺ r for any r ∈ ES(r). The number of non-isomorphic term trees in ES(r) is at most 3|r|. 4.2
The Size of a Target Is Known
In this section, we assume that we know the size of R∗ in advance. Then, we show that any set in OTFΛ is exactly identifiable in polynomial time using superset queries. Thus, let |R∗ | = m∗ . We show that the algorithm LEARN KNOWN in Fig. 5 exactly identifies any set R∗ ∈ OTFΛ in polynomial time using superset queries. In LEARN KNOWN, Rhypo denotes a hypothesis set which is included in OTFΛ and Rnocheck denotes a set of regular term trees which are not checked by the algorithm LEARN OTT. Note that Rnocheck ∈ µOTFΛ and each regular term tree in Rnocheck consists of variables only. Lemma 2. Let R be a set in µOTFΛ , r a term tree in R and R be a set in OTFΛ . If LΛ (R ) ⊆ LΛ (R) and LΛ (R ) ⊆ LΛ (R − {r}) ∪ LΛ (ES(r)), then there exists a term tree r ∈ R such that r r and |r | = |r|. Proof. Let rc be a ground term tree in LΛ (R ) − (LΛ (R − {r}) ∪ LΛ (ES(r))). Since rc ∈ LΛ (R ), there exists a term tree r in R such that rc ∈ LΛ (r ). We assume r r. By Lemma 1, there exists a term tree r in R such that r r . Then, rc ∈ LΛ (r ) ⊆ LΛ (r ) ⊆ LΛ (R − {r}). This is a contradiction. Thus, we have r r. Since r r, it is clear that |r | ≥ |r|. We assume |r | > |r|. Then, rc ∈ LΛ (r ) ⊆ LΛ (ES(r)). This is a contradiction. Thus, we have |r | = |r|. 2 We denote by rin (resp. Rin ) a term tree (resp. a set of term trees) given to the algorithm LEARN OTT in Fig. 6. By Lemma 2, the algorithm LEARN OTT always takes as input a term tree rin such that r∗ rin and |r∗ | = |rin | for some r∗ ∈ R∗ . Note that ES(rin ) ⊆ Rin and (Rnocheck − {rin }) ⊆ (Rhypo − {rin }) ⊆ Rin . We show that if there exists a term tree r∗ ∈ R∗ such that r∗ ≡ rin , then the algorithm outputs rin . Otherwise, the algorithm calls itself recursively and gives a term tree r such that r ≺ rin and |r| = |rin |. We give some notations. Let r be a term tree in OTT Λ , α an edge label and x, y variable labels appearing in r. We denote by Xr the set of all variable labels appearing in r, and by ρe (r, x, α) (resp. ρv (r, x, y)) the term tree obtained from r by replacing all variables having the variable label x with edges having the
Query Learning of Finite Unions of Tree Patterns with Repeated Variables
153
Algorithm LEARN KNOWN Given: An oracle Sup R∗ for the target R∗ ∈ OTFΛ and an integer m with m ≥ m∗ ; Output: A set R ∈ OTFΛ with LΛ (R) = LΛ (R∗ ); begin Let Rhypo := φ; if Sup R∗ (Rhypo ) = “yes” then output Rhypo ; else begin Let r = ({u, v}, φ, {[u, v]}) ∈ µOTT Λ and R = {r}; Let Rhypo := Rnocheck := R; while Rnocheck = φ do begin foreach r ∈ Rnocheck do if Sup R∗ ((Rhypo − {r}) ∪ ES(r)) = “yes” then begin Rhypo := (Rhypo − {r}) ∪ ES(r); Rnocheck := (Rnocheck − {r}) ∪ ES(r); /* Remove redundant term trees in ES(r). */ foreach r ∈ ES(r) do if Sup R∗ (Rhypo − {r }) = “yes” then begin Rhypo := Rhypo − {r }; Rnocheck := Rnocheck − {r }; end; end else begin R := LEARN OTT(m,(Rhypo − {r}) ∪ ES(r),r); Rhypo := (Rhypo − {r}) ∪ R ∪ ES(r); Rnocheck := (Rnocheck − {r}) ∪ ES(r); /* Remove redundant term trees in ES(r). */ foreach r ∈ ES(r) do if Sup R∗ (Rhypo − {r }) = “yes” then begin Rhypo := Rhypo − {r }; Rnocheck := Rnocheck − {r }; end; end; end; output Rhypo ; end; end.
Fig. 5. Algorithm LEARN KNOWN
edge label α (resp. variables having a variable label y). For a subset ∆ of Λ, we define the set RS ∆ (r) as follows: RS ∆ (r) = {ρe (r, x, α) ∈ OTT Λ | x ∈ Xr and α is an edge label in ∆.} ∪{ρv (r, x, y) ∈ OTT Λ | x, y ∈ Xr , x and y are different.}
154
S. Matsumoto et al.
If r ∈ OT Λ , then we define RS ∆ (r) = φ. Note that r ≺ r and |r | = |r| for any r ∈ RS ∆ (r), and the number of non-isomorphic term tress in RS ∆ (r) is at most |r| · |∆| + |r|2 . In the algorithm LEARN OTT, let t1 , t2 , . . ., ti , . . . and ∆1 , ∆2 , . . ., ∆i ,. . . (i ≥ 1) be the sequence of counterexamples returned by the superset queries in line 7 and the sequence of finite subsets of Λ obtained in line 9 respectively. Let ∆0 be the finite subset of ∆ obtained in line 6. And we suppose that at each stage i ≥ 0, LEARN OTT makes a superset query SupR∗ (Rin ∪ RS ∆i (rin )), and receives a counterexample ti+1 . First we assume that rin ≡ r∗ for some r∗ ∈ R∗ . Then we have the following lemma. Lemma 3. For any i ≥ 0, LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i (rin )). Proof. If rin has no variable, it is clear. We assume that rin has variables. The proof is by the induction on the number of iterations i ≥ 0 of the while-loop. In case i = 0. Since |Λ| is infinite, there exists an edge label in Λ − ∆0 . Thus, we can construct a term tree r with r ∈ LΛ (rin ) − LΛ (Rin ∪ RS ∆0 (rin )). It follows that LΛ (rin ) ⊆ LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆0 (rin )). We assume inductively that the result holds for any number of iterations of the while-loop less than i. By the inductive hypothesis, LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i−1 (rin )). Thus, ti is obtained. LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i (rin )). Since |Λ| is infinite, there exists an edge label in Λ − ∆i . We have a term tree r ∈ LΛ (R∗ ) − LΛ (Rin ∪ RS ∆i (rin )). Therefore, LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i (rin )). 2 Next we assume that rin ≡ r∗ for any r∗ ∈ R∗ . Let r∗1 , . . . , r∗ be the term trees in R∗ such that r∗i ≺ rin and |r∗i | = |rin | for each i ∈ {1, . . . , }, where ≤ m∗ . Since LΛ (R∗ ) ⊆ LΛ (Rin ∪ {rin }) and ES(rin ) ⊆ Rin , we have LΛ (R∗ − {r∗1 , . . . , r∗ }) ⊆ LΛ (Rin ). Then we have the following lemma. Lemma 4. There exists a subset S of {r∗1 , . . . , r∗ } such that |S| ≥ i + 1 and LΛ (S) ⊆ LΛ (RS ∆i (rin )). Proof. The proof is by the induction on the number of iterations i ≥ 0 of the while-loop. In case i = 0. Let t be a ground term tree given by SupR∗ (Rin ) as a counterexample in line 5. Then, t ∈ LΛ (r∗ ) for some r∗ ∈ {r∗1 , . . . , r∗ }. Since ∆0 is the set of edge labels appearing in t, r∗ r for some r ∈ RS ∆0 (rin ). Thus, we have LΛ ({r∗ }) ⊆ LΛ (RS ∆0 (rin )). We assume inductively that the result holds for any number of iterations of the while-loop less than i. By the inductive hypothesis, there exists a subset S of {r∗1 , . . . , r∗ } such that |S| ≥ i and LΛ (S) ⊆ LΛ (RS ∆i−1 (rin )). If LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i−1 (rin )), ti is obtained. Since LΛ (S) ⊆ LΛ (RS ∆i−1 (rin )), there exits a term tree r∗ ∈ {r∗1 , . . . , r∗ } − S such that ti ∈ LΛ (r∗ ). We have r∗ ∈ RS ∆i (rin ). Thus, there exists a subset S of {r∗1 , . . . , r∗ } such that |S | ≥ i + 1 and LΛ (S ) ⊆ LΛ (RS ∆i (rin )), where S ∪ {r∗ } ⊆ S . 2 From the above lemma, for some i ≤ m, LΛ ({r∗1 , . . . , r∗ }) ⊆ LΛ (RS ∆i (rin )). It follows that LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i (rin )).
Query Learning of Finite Unions of Tree Patterns with Repeated Variables
155
Thus, by Lemma 3 and 4, the algorithm LEARN OTT exactly identifies the set {r∗1 , . . . , r∗ }. The algorithm is called recursively at most O(|rin |2 ) times to identify one term tree in {r∗1 , . . . , r∗ }. Thus, the algorithm is called recursively at most O( |rin |2 ) times in all. The while-loop of lines 7–11 is repeated at most O(m) times. Since ES(rin ) ⊆ Rin , |ti | = |rin | for any i. Thus, in foreach-loop of lines 15–17, |∆| ≤ |t1 | + . . . + |tm | = m|rin |. The loop uses at most O(m|rin |2 ) superset queries. The number of superset queries needed to identify the set {r∗1 , . . . , r∗ } is at most O( m|rin |4 ). The algorithm LEARN KNOWN uses at most O(|rin |2 ) superset queries to obtain a term tree rin . Thus, the number of superset queries the algorithm needs to identify a target R∗ is at most O(m2 n4 + 1), where n is the maximum size of term trees in R∗ . From the above mentioned, we have the following theorem. Theorem 1. If the algorithm LEARN KNOWN of Fig. 5 takes an integer m with m ≥ |R∗ | as input, then it exactly identifies a set R∗ ∈ OTFΛ in polynomial time using at most O(m2 n4 + 1) superset queries, where n is the maximum size of term trees in R∗ .
4.3
The Size of a Target Is Unknown
In this section, we assume that the size of a target is unknown. However, by Theorem 1, we have the following theorem. Theorem 2. The algorithm LEARN OTF of Fig. 7 exactly identifies any set R∗ ∈ OTFΛ in polynomial time using at most O(m3∗ n4 + 1) superset queries and at most O(m∗ + 1) restricted equivalence queries, where m∗ is the size of R∗ and n is the maximum size of term trees in R∗ .
5
Hardness Result on Learnability
In this section, we show the insufficiency of learning of OTFΛ in the query learning model. We uses the following lemma to show the insufficiency of learning of OTFΛ . Lemma 5. (L´ aszl´ o Lov´ asz [9]) Let U Tn be the number of all rooted unordered trees with no edge labels which of the size is n. Then, 2n < U Tn < 4n , where n ≥ 6. We denote by OTn the set of all rooted ordered trees with no edge labels which of the size is n. From the above lemma, we have OTn ≥ 2n , where n ≥ 6. By Lemma 5 and Lemma 1 in [5], we have Theorem 3.
156
S. Matsumoto et al.
Algorithm LEARN OTT Given: An oracle Sup R∗ for the target R∗ ∈ OTFΛ , a positive integer m, a set Rin in OTFΛ and a term tree rin in OTT Λ such that LΛ (R∗ ) ⊆ LΛ (Rin ∪ {rin }) and LΛ (R∗ ) ⊆ LΛ (Rin ); Output: A set S in OTFΛ ; begin 1. S := φ; 2. if rin ∈ OT Λ then S := {rin } 3. else begin 4. Let n := 0; 5. Let t be a counterexample given by Sup R∗ (Rin ); 6. Let ∆ be the set of all edge labels in t; 7. while Sup R∗ (Rin ∪ RS ∆ (rin )) = “yes” and n ≤ m do begin 8. Let t be a counterexample and ∆ the set of all edge labels in t ; 9. ∆ := ∆ ∪ ∆ ; 10. n := n + 1; 11. end; 12. if n > m then S := {rin }; 13. else begin 14. /* Remove redundant term trees in RS. */ 15. foreach r ∈ RS ∆ (rin ) do 16. if Sup R∗ (Rin ∪ RS ∆ (rin ) − {r}) = “yes” then 17. RS ∆ (rin ) := RS ∆ (rin ) − {r}; 18. foreach r ∈ RS ∆ (rin ) do begin 19. S := LEARN OTT(m,Rin ∪ RS ∆ (rin ) − {r},r); 20. S := S ∪ S ; 21. end; 22. end; 23. end; 24. output S; end.
Fig. 6. Algorithm LEARN OTT
Theorem 3. Any learning algorithm that exactly identifies all finite sets of the term trees of size n using restricted equivalence, membership and subset queries must make more than 2n queries in the worst case, where n ≥ 6 and |Λ| ≥ 1. Proof. We denote by Sn the class of singleton sets of ground term trees of which the size is n. The class Sn is a subclass of OTFΛ . For any L and L in Sn , L ∩ L = φ. However, the empty set is included in OTFΛ . Thus, by Lemma 5 and Lemma 1 in [5], any learning algorithm that exactly identifies all finite sets of the term trees of size n using restricted equivalence, membership and subset queries must make more than 2n queries in the worst case, where n ≥ 6 and |Λ| ≥ 1. 2
Query Learning of Finite Unions of Tree Patterns with Repeated Variables
157
Algorithm LEARN OTF Given: Oracles Sup R∗ and rEquiv R∗ for the target R∗ ∈ OTFΛ ; Output: A set R ∈ OTFΛ with LΛ (R) = LΛ (R∗ ). begin m := 0; R := φ; repeat m := m + 1; R := LEARN KNOWN (m) using Sup R∗ ; until rEquiv R∗ (R) = “yes”; output R; end.
Fig. 7. Algorithm LEARN OTF
Table 1. Our results and future works Exact Learning µOTT Λ µOTFΛ
OTFΛ
6
Yes [11] membership & a positive example (|Λ| ≥ 2) Yes [10] restricted subset & equivalence (|Λ| is infinite) sufficient [This work] insufficient [This work] superset & restricted equivalence restricted equivalence membership (|Λ| is infinite) subset (|Λ| ≥ 1)
Inductive Inference from positive data Yes [13] polynomial time (|Λ| ≥ 1) Open
Open
Conclusions
We have studied the learnability of OTFΛ in the query learning model. In Section 4, we have shown that any finite set R∗ ∈ OTFΛ is exactly identifiable using at most O(m3∗ n4 + 1) superset queries and at most O(m∗ + 1) restricted equivalence queries, where m∗ = |R∗ |, n is the the maximum size of term trees in R∗ and |Λ| is infinite. In Section 5, we have shown that it is hard to exactly identify any set in OTFΛ efficiently using restricted equivalence, membership and subset queries.
158
S. Matsumoto et al.
We showed the learnabilities of µOTT Λ and µOTFΛ in the query learning model [10,11]. Suzuki et al. [13] showed the learnability of µOTT Λ in the framework of polynomial time inductive inference from positive data [4]. Thus, we will study the learnabilities of µOTFΛ and OTFΛ in the same framework. We summarize our results and future works in Table 1.
References 1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000. 2. T. R. Amoth, P. Cull, and P. Tadepalli. Exact learning of tree patterns from queries and counterexamples. Proc. COLT-98, ACM Press, pages 175–186, 1998. 3. T. R. Amoth, P. Cull, and P. Tadepalli. Exact learning of unordered tree patterns from queries. Proc. COLT-99, ACM Press, pages 323–332, 1999. 4. D. Angluin. Finding pattern common to a set of strings. Journal of Computer and System Sciences, 21:46–62, 1980. 5. D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988. 6. H. Arimura, H. Ishizaka, and T. Shinohara. Learning unions of tree patterns using queries. Theoretical Computer Science, 185(1):47–62, 1997. 7. H. Arimura, H. Sakamoto, and S. Arikawa. Efficient learning of semi-structured data from queries. Proc. ALT-2001, Springer-Verlag, LNAI 2225, pages 315–331, 2001. 8. H. Arimura, T. Shinohara, and S. Otsuki. Polynomial time algorithm for finding finite unions of tree pattern languages. Proc. NIL-91, Springer-Verlag, LNAI 659, pages 118–131, 1993. 9. L´ aszl´ o Lov´ asz. Combinatorial Problems and Exercises, chapter Two classical enumeration problems in graph theory. North-Holland Publishing Company, 1979. 10. S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning of finite unions of tree patterns with internal structured variables from queries. Proc. AI-2002, Springer LNAI 2557, pages 523–534, 2002. 11. S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning unions of term tree languages using queries. Proceedings of LA Summer Symposium, July 2002, pages 21–1 – 21–10, 2002. 12. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tag tree patterns in semistructured web documents. Proc. PAKDD-2002, Springer-Verlag, LNAI 2336, pages 341–355, 2002. 13. Y. Suzuki, R. Akanuma, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time inductive inference of ordered tree patterns with internal structured variables from positive data. Proc. COLT-2002, Springer-Verlag, LNAI 2375, pages 169–184, 2002.
Kernel Trick Embedded Gaussian Mixture Model Jingdong Wang, Jianguo Lee, and Changshui Zhang State Key Laboratory of Intelligent Technology and Systems Department of Automation, Tsinghua University Beijing, 100084, P. R. China {wangjd01,lijg01}@mails.tsinghua.edu.cn [email protected]
Abstract. In this paper, we present a kernel trick embedded Gaussian Mixture Model (GMM), called kernel GMM. The basic idea is to embed kernel trick into EM algorithm and deduce a parameter estimation algorithm for GMM in feature space. Kernel GMM could be viewed as a Bayesian Kernel Method. Compared with most classical kernel methods, the proposed method can solve problems in probabilistic framework. Moreover, it can tackle nonlinear problems better than the traditional GMM. To avoid great computational cost problem existing in most kernel methods upon large scale data set, we also employ a Monte Carlo sampling technique to speed up kernel GMM so that it is more practical and efficient. Experimental results on synthetic and real-world data set demonstrate that the proposed approach has satisfing performance.
1
Introduction
Kernel trick is an efficient method for nonlinear data analysis early used by Support Vector Machine (SVM) [18]. It has been pointed out that kernel trick could be used to develop nonlinear generalization of any algorithm that could be cast in the term of dot products. In recent years, kernel trick has been successfully introduced into various machine learning algorithms, such as Kernel Principal Component Analysis (Kernel PCA) [14,15], Kernel Fisher Discriminant (KFD) [11], Kernel Independent Component Analysis (Kernel ICA) [7] and so on. However, in many cases, we are required to obtain risk minimization result and incorporate prior knowledge, which could be easily provided within Bayesian probabilistic framework. This makes the emerging of combining kernel trick and Bayesian method, which is called Bayesian Kernel Method [16]. As Bayesian Kernel Method is in probabilistic framework, it can realize Bayesian optimal decision and estimate confidence or reliability easily with probabilistic criteria such as Maximum-A-Posterior [5] and so on. Recently some researches have been done in this field. Kwok combined the evidence framework with SVM [10], Gestel et al. [8] incorporated Bayesian framework with SVM and KFD. These two work are both to apply Bayesian framework to known kernel method. On the other hand, some researchers proposed R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 159–174, 2003. c Springer-Verlag Berlin Heidelberg 2003
160
J. Wang, J. Lee, and C. Zhang
new Bayesian methods with kernel trick embedded, among which one of the most influential work is the Relevance Vector Machine (RVM) proposed by Tipping [17]. This paper also addresses the problem of Bayesian Kernel Method. The proposed method is that we embed kernel trick into Expectation-Maximization (EM) algorithm [3], and deduce a new parameter estimation algorithm for Gaussian Mixture Model (GMM) in the feature space. The entire model is called kernel Gaussian Mixture Model (kGMM). The rest of this paper is organized as follows. Section 2 reviews some background knowledge, and Section 3 describes the kernel Gaussian Mixture Model and the corresponding parameter estimation algorithm. Experiments and results are presented in Section 4. Conclusions are drawn in the final section.
2
Preliminaries
In this section, we review some background knowledge including the kernel trick, GMM based on EM algorithm and Bayesian Kernel Method. 2.1
Kernel Trick
Mercer kernel trick was early applied by SVM. The idea is that we can implicitly map input data into a high dimension feature space via a nonlinear function: Φ:X→H x → φ(x)
(1)
And a similarity measure is defined from the dot product in space H as follows: k (x, x ) φ(x) · φ(x )
(2)
where the kernel function k should satisfy Mercer’s condition [18]. Then it allows us to deal with learning algorithms using linear algebra and analytic geometry. Generally speaking, on the one hand kernel trick could deal with data in the high-dimensional dot product space H, which is named feature space by a map associated with k. On the other hand, it avoids expensive computation cost in feature space by employing the kernel function k instead of directly computing dot product in H. Being an elegant way for nonlinear analysis, kernel trick has been used in many other algorithms such as Kernel Fisher Discriminant [11], Kernel PCA [14, 15], Kernel ICA [7] and so on. 2.2
GMM Based on EM Algorithm
GMM is a kind of mixture density models, which assumes that each component of the probabilistic model is a Gaussian density. That is to say:
Kernel Trick Embedded Gaussian Mixture Model
M
p(x|Θ) =
i=1
αi Gi (x|θi )
161
(3)
where x ∈ Rd is a random variable, parameters Θ = α1 , · · · , αM ; θ1 , · · · θM M satisfy i=1 αi = 1, αi ≥ 0 and Gi (x|θi ) is a Gaussian probability density function: Gl (x|θl ) =
1 1/2
(2π)d/2 |Σl |
1 exp − (x − µl )T Σl−1 (x − µl ) 2
(4)
where θl = (µl , Σl ). GMM could be viewed as a generative model [12] or a latent variable model [6] N that assumes data set X = {xi }i=1 are generated by M Gaussian components, and introduces the latent variables item Z = {zi }N i=1 whose value indicates which component generates the data. That is to say, we assume that if sample xi is generated by the lth component and then zi = l. Then the parameters of GMM could be estimated by the EM algorithm [2]. EM algorithm for GMM is an iterative procedure, which estimates the new parameters in terms of the old parameters as the following updating formulas: (t)
αl
=
(t)
µl
(t) Σl
N =
i=1
1 N p(l|xi , Θ(t−1) ) i=1 N
N xi p(l|xi , Θ(t−1) ) = i=1 N (t−1) ) i=1 p(l|xi , Θ (t)
(t)
(xi − µl )(xi − µl )T p(l|xi , Θ(t−1) ) N (t−1) ) i=1 p(l|xi , Θ
(5)
where l represents the lth Gaussian component, and (t−1)
α p(l|xi , Θ(t−1) ) = M l
(t−1)
G(xi |θl
)
(t−1) (t−1) G(xi |θj ) j=1 αj
, l = 1, · · · , M
(6)
(t−1) (t−1) (t−1) (t−1) are parameters of the (t − 1)th , · · · , αM ; θ1 , · · · , θM Θ(t−1) = α1 (t) (t) (t) (t) iteration and Θ(t) = α1 , · · · , αM ; θ1 , · · · , θM are parameters of the (t)th iteration. GMM has been successfully applied in many fields, such as parametric clustering, density estimation and so on. However, for instance, it can’t give a simple but satisfied clustering result on data set with complex structure [13] as shown in Figure 1. One alternative way is to perform GMM based clustering in another space instead of in original data space.
162
J. Wang, J. Lee, and C. Zhang
0.5
0.5
0
0
-0.5 -0.5
0
0.5
-0.5 -0.5
(a)
0
(b)
0.5
(c)
Fig. 1. Data set of two concentric circles with 1,000 samples, points marked by ‘×’ belong to one cluster and marked by ‘·’ belong to the other. (a) is the partition result by traditional GMM, (b) is the result achieved by kGMM using polynomial kernel of degree 2; (c) shows the probability that each point belongs to the outer circle. The whiter the point is, the higher the probability is.
2.3
Bayesian Kernel Method
Bayesian Kernel Method could be viewed as a combination of Bayesian method and Kernel method. It inherits merits from both these two methods. It could tackle problems nonlinearly like kernel method, and it obtains estimation results within a probabilistic framework like classical Bayesian methods. Many works have been done in this field. There are typically three different ways. – Interpretation of kernel methods in Bayesian framework such as SVM and other kernel methods in Bayesian framework as in [10,8]; – Employing kernel methods in traditional Bayesian methods such as Gaussian Processes and Laplacian Processes [16]; – Proposing new methods in Bayesian framework with kernel trick embedded such as Relevance Vector Machine (RVM) [17] and Bayes Point Machine [9]. And we intend to embed kernel trick into Gaussian Mixture Model. This work just belongs to the second category of Bayesian Kernel Method.
3
Kernel Trick Embedded GMM
As mentioned before, Gaussian Mixture Model can not obtain simple but satisfied results on data sets with complex structure, so we consider employing kernel trick to realize a Bayesian Kernel version of GMM. Our basic idea is to embed kernel trick into parameter estimation procedure of GMM. In this section, we firstly describe GMM in feature space, secondly present the properties in feature space, then formulate the Kernel Gaussian Mixture Model and the corresponding parameter estimation algorithm, finally make some discussions on the algorithm.
Kernel Trick Embedded Gaussian Mixture Model
3.1
163
GMM in Feature Space
GMM in feature space by a map φ(·) associated with kernel function k can be easily rewritten as p(φ(x)|Θ) =
M i=1
αi G(φ(x)|θi )
(7)
and the EM updating formula in (5) and (6) can be replaced by the following. (t)
αl
(t) µl
(t)
Σl
N =
i=1
=
1 N p(l|φ(xi ), Θ(t−1) ) i=1 N
N =
(t−1) ) i=1 φ(xi )p(l|φ(xi ), Θ N (t−1) ) i=1 p(l|φ(xi ), Θ
(t)
(t)
(φ(xi ) − µl )(φ(xi ) − µl )T p(l|φ(xi ), Θ(t−1) ) N (t−1) ) i=1 p(l|φ(xi ), Θ
(8)
where l represents the lth Gaussian component, and (t−1)
α p(l|φ(xi ), Θ(t−1) ) = M l
(t−1)
G(φ(xi )|θl
)
(t−1) (t−1) G(φ(xi )|θj ) j=1 αj
, l = 1, · · · , M
(9)
However, computing GMM directly with formula (8) and (9) in a high dimension feature space is computationally expensive thus impractical. We consider employing kernel trick to overcome this difficulty. In the following section, we will give some properties based on Mercer kernel trick to estimate the GMM parameters in feature space. 3.2
Properties in Feature Space
To be convenient, notations in feature space are given firstly, and then three properties are presented. Notations. In all the formulas, bold and capital letters are for matrixes, italic and bold letters are for vectors, and italic and lower case are for scalars. Subscript l represents the lth Gaussian component, superscript t represents the tth iteration of the EM procedure. AT represents transpose of matrix A. Other notations are shown in Table 1.
164
J. Wang, J. Lee, and C. Zhang Table 1. Notation List
(t)
pli = p(l|φ(xi ), Θ(t) ) (t) (t) M (t) wli = pli j=1 pji
Posterior that φ(xi ) belongs to the lth component. (t)
(wli )2 represents ratio that φ(xi ) is occupied by the lth Gaussian component.
(t) (t) 2 µl = N i=1 φ(xi )(wli ) (t) (t) φ˜l (xi ) = φ(xi ) − µl (t) (t) 2 T ˜ ˜ Σl = N i=1 φ(xi )φ(xi ) (wli ) (t) (Kl )ij = (wli φ(xi ) · wlj φ(xj ) ˜ i ) · wlj φ(x ˜ j) ˜ (t) )ij = (wli φ(x (K l (t) (Kl )ij = (φ(xi ) · wlj φ(xj ) ˜ i ) · wlj φ(x ˜ j) ˜ (t) ) = φ(x (K l (t) λle , (t) λle ,
Mean vector of the lth Gaussian component. Centered image of φ(xi ). Covariance matrix of the lth Gaussian component. Kernel matrix Centered kernel matrix. Projecting kernel matrix. Centered projecting kernel matrix.
ij (t) Vle (t) βle
(t)
Eigenvalue and eigenvector of Σl . ˜ (t) . Eigenvalue and eigenvector of K l
Properties in Feature Space. According to Lemma 1 in Appendix, we can get the first property. (t)
˜ and centered projecting kernel matrix [Property 1] Centered kernel matrix K l (t) ˜ (Kl ) are computed from the following formulas: ˜ (t) = K(t) − W(t) K(t) − K(t) W(t) + W(t) K(t) W(t) K l l l l l l l l l (t)
(t)
(t)
(t)
(t)
(t)
(t)
(t)
(t)
˜ ) = (K ) − (W ) K − (K ) W + (W ) K W (K l l l l l l l l l (t)
(t)
(t)
(t)
(t)
(t)
(t)
(10)
(t)
where Wl = wl (wl )T , (Wl ) = 1N (wl )T , wl = [wl1 , · · · , wlN ]T , and 1N is a N-dimensional column vector with all entries equal to 1. This property presents the way to center the kernel matrix and the projecting kernel matrix. According to Lemma 2 in Appendix and Property 1, we can obtain the second property. (t)
[Property 2] Feature space covariance matrix Σl and centered kernel matrix ˜ (t) have the same nonzero eigenvalues, and the following equivalence relation K l holds. (t)
(t)
(t)
(t) (t)
(t)
˜ β = λle β Σl Vle = λle Vle ⇔ K l le le
(11)
(t) (t) (t) ˜ (t) respectively. where Vle and βle are eigenvectors of Σl and K l This property enables us to compute eigenvectors from centered kernel matrix instead of from feature space covariance matrix as in Equation (8). With the first two properties, we do not need to compute the means and covariance matrixes in feature space, but do the eigen-decomposition of centered kernel matrix instead.
Kernel Trick Embedded Gaussian Mixture Model
165
Feature space has a very high dimensionality, which makes it intractable to compute the complete Gaussian probability density Gl (φ(xj )|θl ) directly. However, Moghadam and Pentland’s work [22] bring us some motivation. It divides the original high dimension space into two parts: the principal subspace and the orthogonal complement subspace. The principal subspace has dimension dφ . Thus the complete Gaussian density could be approximated by 1
ˆ l (φ(xj )|θl ) = G
dφ /2
(2π)
dφ e=1
×
1/2
λle
1 (N −dφ )/2
(2πρ)
dφ 1 ye2 exp − 2 e=1 λle
ε2 (x ) j exp − 2ρ
(12)
˜ j )T Vle (here Vle is the e-th eigenvector of Σl ), ρ is the weight where ye = φ(x 2 ratio, and ε (xj ) is the residual reconstruction error. At the right side of the Equation (12), the first factor computes from the principal subspace, and the second factor computes from the orthogonal complement subspace. The optimal value of ρ can be determined by minimizing a cost function. From an information-theoretic point of view, the cost function should be the KullbackLeibler divergence between the true density Gl (φ(xj )|θl ) and its approximation ˆ l (φ(xj )|θl ). G
G(φ(x)|θl ) log
J(ρ) =
G(φ(x)|θl ) dφ(x) ˆ G(φ(x)|θ l)
(13)
Plugging (12) into upper equation, it can easily shown that
J(ρ) = Solving the equation
∂J ∂ρ
1 2
N λ ρ le − 1 + log ρ λle
e=dφ +1
= 0 yields the optimal value 1 ρ∗ = N − dφ
N
λle
e=dφ +1
˜ l has same nonzero eigenvalues, by emAnd according to Property 2, Σl and K ploying the property of symmetry matrix, we obtain 1 1 Σl 2 − K ˜ l 2 − λle = λle F F N − dφ N − dφ e=1 e=1 dφ
ρ∗ =
where · F is a Frobenius matrix norm defined as AF =
dφ
trace(AAT ).
(14)
166
J. Wang, J. Lee, and C. Zhang
˜ j )2 − y 2 , could be easily The residual reconstruction error, ε2 (xj ) = φ(x e dφ
e=1
obtained by employing kernel trick ε2 (xj ) = k(xj , xj ) −
dφ
ye2
(15)
e=1
And according to Lemma 3 in Appendix, ˜ j )T Vle = Vle · φ(x ˜ j ) = β T Γl ye = φ(x
(16)
˜ . where Γl is the j-th column of centered projecting kernel matrix K l ˜ l is used to obtain eigenIt should notice here the centered kernel matrix K ˜ is used to values λl and eigenvectors βl , whereas projecting kernel matrix K l compute ye as in Equation (16). And both these two matrixes cannot be omitted in the training procedure. Employing all these results, we obtain the third property. [Property 3] The Gaussian probability density function Gl (φ(xj )|θl ) can be ˆ l (φ(xj )|θl ) as shown in (12), where ρ, ε2 (xj ) and ye are approximated by G shown in (14), (15) and (16) respectively. We should specially stress that the approximation of Gl (φ(xj )|θl ) by ˆ l (φ(xj )|θl ) is complete since it represents not only the principle subspace but G also the orthogonal complement subspace. According to these three properties, we can draw the following conclusions. – Mercer kernel trick is introduced to indirectly compute dot products in the high dimension feature space. – The probability that a sample belongs to the lth component can be computed through centered kernel matrix and centered projecting kernel matrix instead of mean vector and covariance matrix as in traditional GMM. – Needing not to obtain full eigen-decomposition of the centered kernel matrix, we could approximate the Gaussian density with only the largest dφ principal components of the centered kernel matrix, and the approximation is not dependent on dφ very much since it is complete and optimal. With these three properties, we can formulate our kernel Gaussian Mixture Model. 3.3
Kernel GMM and the Parameter Estimation Algorithm
In feature space, the kernel matrixes replace the mean and covariance matrixes in input space to represent the Gaussian component. So the parameters of each component are not the mean vectors and covariance matrixes, but the kernel matrixes. In fact, it is also intractable to compute mean vectors and covariance matrixes in feature space because feature space has a quite high or even infinite
Kernel Trick Embedded Gaussian Mixture Model
167
dimension. Fortunately, with kernel trick embedded, computing on the kernel matrix is quite feasible since the dimension of the principal subspace is bounded by the data size N. Consequently, the parameters that kernel GMM needs to estimate are the (t) ˜ (t) , centered projecting kernel prior probability αl , centered kernel matrix K l ˜ (t) ) and w(t) (see Table 1). That is to say, the M -components kernel matrix (K l l (t) (t) (t) ˜ , (K ˜ (t) ) , l = 1, · · · , M . GMM is determined by parameters θl = αl , wl , K l l According to the properties in previous sections, the EM algorithm for parameter estimation of kGMM could be summarized as in Table 2. Assuming the number of Gaussian components is M , we initialize the posterior probability pli that each sample belongs to some Gaussian component. The algorithm could not be terminated until it converges or the presetting maximum iteration step is reached. Table 2. Parameter Estimation Algorithm for kGMM (0)
Step 0. Initialize all pli (l = 1, · · · , M ; i = 1, · · · , N ), t = 0, set tmax and stopping condition false. Step 1. While stopping condition is false, t = t + 1, do Step 2-7. (t)
(t)
(t)
(t)
(t)
(t)
Step 2. Compute αl , wli , Wl , (Wl ) , Kl and (Kl ) according to notations in Table 1. ˜ (t) ) via Property 1. ˜ (t) , (K Step 3. Compute the matrixes K l
l
Step 4. Compute the largest dφ eigenvalues and eigenvectors of centered ˜ (t) . kernel matrixes K l ˆ l (φ(xj )|θ(t) ) via Property 3. Step 5. Compute G l (t)
Step 6. Compute all posterior probabilities pli via (9) Step 7. Test the stopping condition. (t) (t−1) 2 < ε, set stopping condition true, If t > tmax or kl=1 N i=1 pli − pli otherwise loop back to Step 1.
3.4
Discussion of the Algorithm
Computational Cost and Speedup Techniques on Large Scale Problem. By employing kernel trick, the computational cost of kernel eigen-decomposition based methods is almost involved by the eigen-decomposition step. Therefore, the computational cost mainly depends on the size of kernel matrix, i.e. the size of data set. If the size N is not very large (e.g. N ≤ 1, 000), it is not a problem to obtain full eigen-decomposition. If the size N is large enough, it is liable to meet with the curse of dimension. As is known, if N > 5, 000, it is impossible to finish full eigen-decomposition even within hours on the fastest PCs currently. However, the size N is usually very large in some problems such as data mining. Fortunately, as we have pointed out in Section 3.2, we need not obtain the full eigen-decomposition for components of kGMM, and we only need estimate
168
J. Wang, J. Lee, and C. Zhang
the largest dφ nonzero eigenvalues and corresponding eigenvectors. As for large scale problem, we could make the assumption that dφ N . Some techniques can be adopted to estimate the largest dφ components for kernel methods. The first technique is based on traditional Orthogonal Iteration or Lanzcos Iteration. The second is to make the kernel matrix sparse by sampling techniques [1]. The third is to apply Nystr¨ om method to speedup kernel machine [19]. However, these three techniques are a little complicated. In this paper, we adopt another much simple but practical technique proposed by Taylor et.al [20]. It assumes that all samples forming kernel matrix of each component are drawn from an underlying density p(x). And the problem could be written down as a continue eigen-problem.
k(x, y)p(x)βi (x)dx = λi βi (y) (17) where λi , βi (y) are eigenvalue and eigenvector, and k is a given kernel function. The integral could be approximate using Monte Carlo method by a subset of m samples {xi }i=1 (m N, m dφ ) drawn according to p(x).
1 N k(x, y)p(x)Vi (x)dx ≈ k(xj , y)Vi (xj ) (18) j=1 N Plugging in y = xk for j = 1, · · · , N , we obtain a matrix eigen-problem. 1 N ˆ i Vi (xk ) k(xj , xk )Vi (xj ) = λ j=1 N
(19)
ˆ i is the approximation of eigenvalue λi . where λ This approximation approach has been proved feasible and has bounded error. We apply it to our parameter estimation algorithm on large scale problem (N > 1, 000). In our algorithm, the underlying density of component l is apˆ l (φ(x)|θ(t) ). We do sampling to obtain a subset with size m proximated by G l ˆ l (φ(x)|θ(t) ), and perform full eigen-decomposition on such subset according to G l to obtain the largest dφ eigen-components. With employing this Monte Carlo sampling technique, the computational cost upon large scale problem could be reduced greatly. Furthermore, the memory needed by the parameter estimation algorithm also reduces greatly upon large scale problem. These makes the proposed kGMM efficient and practical. Comparison with Related Works. There are still some other work related to ours. One major is the spectral clustering algorithm [21]. Spectral clustering could be regarded as using RBF based kernel method to extract features and then performing clustering by K-means. Compared with spectral clustering, the proposed kGMM has at least two advantages. (1)kGMM can provide result in probabilistic framework and can incorporate prior information easily. (2) kGMM can be used in supervised learning problem as a density estimation method. All these advantages encourage us to apply the proposed kGMM.
Kernel Trick Embedded Gaussian Mixture Model
169
Misunderstanding. We emphasize a misunderstanding of the proposed model. Someone doubt that it can simply run GMM in the reduced dimension space obtained by Kernel PCA to achieve the same result as kGMM. They say that this just need project the data into the first dφ dimension of the feature space by Kernel PCA, and then perform GMM parameter estimation in that principle subspace. However, the choice of a proper dφ is a critical problem so that the performance will completely depend on the choice of dφ . If dφ is too large, the estimation is not feasible because probability estimation demands that the number of samples is large enough in comparison with dimension dφ , and the computational cost will increase greatly simultaneously. On the contrary, small dφ makes the estimated parameters not “well represent” the data. The proposed kGMM does not have that problem since the approximated density function is complete and optimal under the minimum Kullback-Leibler divergence criteria. Moreover, kGMM can allow different component with different dφ . All these improve the flexibility and expand the application of the proposed kGMM.
4
Experiments
In this section, two experiments are performed to validate the proposed kGMM compared with traditional GMM. Firstly kGMM is employed as unsupervised learning or clustering method on synthetic data set. Secondly kGMM is employed as supervised density estimation method for real-world handwritten digit recognition. 4.1
Synthetic Data Clustering Using Kernel GMM
To provide an intuitive comparison between the proposed kGMM and traditional GMM, we first conduct experiments on synthetic 2-D data sets. The data sets each with 1,000 samples are depicted in Figure 1 and Figure 2. For traditional GMM, all the samples are used to estimate the parameters of two components mixture of Gaussian. When the algorithm stops, each sample will belong to one of the components or clusters according to its posterior. The clustering results of traditional GMM are shown in Figure 1(a) and Figure 2(a). The results are obviously not satisfying. However, by using kGMM with a polynomial kernel of degree 2, dφ = 4 for each Gaussian component and the same clustering scheme as traditional GMM, we achieve the promising results as shown in Figure 1(b) and Figure 2(b). Besides, kGMM provides probabilistic information as in Figure 1(c) and Figure 2(c), which cannot provide by most classical kernel methods. 4.2
USPS Data-Set Recognition Using Kernel GMM
Kernel GMM is also applied to a real-world problem, the US Postal Service (USPS) handwritten digit recognition. The data set consists of 9,226 grayscale
170
J. Wang, J. Lee, and C. Zhang
0.5
0.5
0
0
-0.5 -0.5
0
0.5
-0.5 -0.5
(a)
0
0.5
(b)
(c)
Fig. 2. Data set consists of 1,000 samples. Points marked by ‘×’ belong to one cluster and marked by ‘·’ belong to the other. (a) is the partition result by traditional GMM; (b) is the result achieved by kGMM; (c) shows the probability that each point belongs to the left-right cluster. The whiter the point is, the higher the probability is.
images of size 16x16, divided into a training set of 7,219 images and a test set of 2,007 images. The original input data is just vector form of the digit image, i.e., the input feature space is with dimensionality 256. Optionally, we can perform a linear discriminant analysis (LDA) to reduce the dimensionality of feature space. If LDA is performed, the feature space yields to be 39. Each category ω is estimated a density of p(x|ω) using 4 components GMM on training set. To classify an test sample x, we use the Bayesian decision rule
ω∗ = arg max p(x|ω)P (ω) , ω = 1, · · · , 10 ω
(20)
where P (ω) is prior probability of category ω. In this experiment, we set P (ω) = 1/10. That is to say, all categories are with equal prior probability. To be comparison, kGMM adopts the same experiment scheme as traditional GMM except for using an RBF kernel function k(x, x ) = exp(−γx − x 2 ) with γ = 0.0015, Gaussian mixture component number of 2 and dφ = 40 for each Gaussian component. The experiment results of GMM in original input space, GMM in the space by LDA (from [4]) and kGMM are shown in Table 3. We can see that kGMM, with less components number, has obviously much better performance than traditional GMM with or without LDA. Although, the result by kGMM is not the state-of-art result on USPS, we still can improve the result by incorporating invariance prior knowledge using Tangent distance as in [4].
Kernel Trick Embedded Gaussian Mixture Model
171
Table 3. Comparison results on USPS data set
5
Method
Best Error rate
GMM
8.0%
LDA+GMM
6.7%
Kernel GMM
4.3%
Conclusion
In this paper, we present a kernel Gaussian Mixture Model, and deduce a parameter estimation algorithm by embedding kernel trick into EM algorithm. Furthermore, we adopt a Monte Carlo sampling technique to speedup kGMM upon large scale problem, thus make it more practical and efficient. Compared with most classical kernel methods, kGMM can solve problems in a probabilistic framework. Moreover, it can tackle nonlinear problems better than the traditional GMM. Experimental results on synthetic and real-world data set show that the proposed approach has satisfied performance. Our future work will focus on incorporating prior knowledge such as invariance in kGMM and enriching its applications. Acknowledgements. The author would like to thank anonymous reviewers for their helpful comments, also thank Jason Xu for helpful conversations about this work.
References 1. Achlioptas, D., McSherry, F. and Sch¨ olkopf, B.: Sampling techniques for kernel methods. In Advances in Neural Information Processing System (NIPS) 14, MIT Press, Cambridge MA (2002) 2. Bilmes, J. A.: A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, Technical Report, UC Berkeley, ICSI-TR-97-021 (1997) 3. Bishop, C. M.: Neural Networks for Pattern Recognition, Oxford University Press. (1995) 4. Dahmen, J., Keysers, D., Ney, H. and G¨ uld, M.O.: Statistical Image Object Recognition using Mixture Densities. Journal of Mathematical Imaging and Vision, 14(3) (2001) 285–296 5. Duda, R. O., Hart, P. E. and Stork, D. G.: Pattern Classification, New York: John Wiley & Sons Press, 2nd Edition. (2001) 6. Everitt, B. S.: An Introduction to Latent Variable Models, London: Chapman and Hall. (1984) 7. Francis R. B. and Michael I. J.: Kernel Independent Component Analysis, Journal of Machine Learning Research, 3, (2002) 1–48 8. Gestel, T. V., Suykens, J.A.K., Lanckriet, G., Lambrechts, A., Moor, B. De and Vanderwalle J.: Bayesian framework for least squares support vector machine classifiers, gaussian processs and kernel fisher discriminant analysis. Neural Computation, 15(5) (2002) 1115–1148
172
J. Wang, J. Lee, and C. Zhang
9. Herbrich, R., Graepel, T. and Campbell, C.: Bayes Point Machines: Estimating the Bayes Point in Kernel Space. In Proceedings of International Joint Conference on Artificial Intelligence Work-shop on Support Vector Machines, (1999) 23–27 10. Kwok, J. T.: The Evidence Framework Applied to Support Vector Machines, IEEE Trans. on NN, Vol. 11 (2000) 1162–1173. 11. Mika, S., R¨ atsch, G., Weston, J., Sch¨ olkopf, B. and M¨ uller, K.R.: Fisher discriminant analysis with kernels. IEEE Workshop on Neural Networks for Signal Processing IX, (1999) 41–48 12. Mjolsness, E. and Decoste, D.: Machine Learning for Science: State of the Art and Future Pros-pects, Science. Vol. 293 (2001) 13. Roberts, S. J.: Parametric and Non-Parametric Unsupervised Cluster Analysis, Pattern Recogni-tion, Vol. 30. No 2, (1997) 261–272 14. Sch¨ olkopf, B., Smola, A.J. and M¨ uller, K.R.: Nonlinear Component Analysis as a Kernel Eigen-value Problem, Neural Computation, 10(5), (1998) 1299–1319 15. Sch¨ olkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., M¨ uller, K. R., Raetsch, G. and Smola, A.: Input Space vs. Feature Space in Kernel-Based Methods, IEEE Trans. on NN, Vol 10. No. 5, (1999) 1000–1017 16. Sch¨ olkopf, B. and Smola, A. J.: Learning with Kernels: Support Vector Machines, Regularization and Beyond, MIT Press, Cambridge MA (2002) 17. Tipping, M. E.: Sparse Bayesian Learning and the Relevance Vector Machine, Journal of Machine Learning Research. (2001) 18. Vapnik, V.: The Nature of Statistical Learning Theory, 2nd Edition, SpringerVerlag, New York (1997) 19. Williams, C. and Seeger, M.: Using the Nystr¨ om Method to Speed Up Kernel Machines. In T. K. Leen, T. G. Diettrich, and V. Tresp, editors, Advances in Neural Information Processing Systems (NIPS)13. MIT Press, Cambridge MA (2001) 20. Taylor, J. S., Williams, C., Cristianini, N. and Kandola J.: On the Eigenspectrum of the Gram Matrix and Its Relationship to the Operator Eigenspectrum, N. CesaBianchi et al. (Eds.): ALT 2002, LNAI 2533, Springer-Verlag, Berlin Heidelberg (2002) 23–40 21. Ng, A. Y., Jordan, M. I. and Weiss, Y.: On Spectral Clustering: Analysis and an algorithm, Advance in Neural Information Processing Systems (NIPS) 14, MIT Press, Cambridge MA (2002) 22. Moghaddam, B. and Pentland, A.: Probabilistic visual learning for object representation, IEEE Trans. on PAMI, Vol. 19, No. 7 (1997) 696–710
Appendix To be convenient, the subscripts representing Gaussian components and the superscripts representing iterations are omitted in this part. [Lemma 1] Suppose φ(·) is a mapping function that satisfies Mercer conditions N as in Equation (1)(Section 2) with N training samples X = {xi }i=1 . w is a T N N -dimensional column vector with w = [w1 , · · · , wN ] ∈ R . N ˜ i ) = φ(xi ) − µ Let’s define µ = i=1 φ(xi )wi2 and φ(x Then the following consequences hold true: (1) If K is a N × N kernel matrix such that Kij = wi φ(xi ) · wj φ(xj ) , ˜ is a N × N matrix, which is centered in the feature space, such that and K
Kernel Trick Embedded Gaussian Mixture Model
173
˜ i ) · wj φ(x ˜ j ) , then we can get: ˜ ij = wi φ(x K ˜ = K − WK − KW + WKW K
(a1)
where W = wwT . (2) If K is a N × N projecting kernel matrix such that Kij = φ(xi ) · ˜ is a N × N matrix, which is centered in the feature space, wj φ(xj ) , and K ˜ i ) · wj φ(x ˜ j ) , then = φ(x such that K ij ˜ = K − W K − K W + W KW K
(a2)
where W = 1N wT , 1N is a N -dimensional column vector that all entries equal to 1. Proof: (1) ˜ i ) · wj φ(x ˜ j) ij = wi φ(x K ˜ i )T wj φ(x ˜ j) = wi φ(x N N T = wi (φ(xi ) − φ(xk )wk2 ) φ(xk )wk2 ) wj (φ(xj ) − k=1
k=1
= wi φ(xi )T wj φ(xj ) − wi
N
wk wk φ(xk )T wj φ(xj )
k=1
−
N
N N wk wi φ(xi )T wk φ(xk ) ·wj + wi wk wn wk φ(xk )T wn φ(xn ) k=1 n=1
k=1
= Kij − wi
N k=1
wk Kkj −
N
wk Kik wj + wi
N N
wk wn wk φ(xk )T wn φ(xn ) wj
k=1 n=1
k=1
then we can get the more compact expression as (a1). Similarly, we can prove (2). [Lemma 2] Suppose Σ is a covariance matrix such that Σ=
N i=1
˜ i )T w2 ˜ i )φ(x φ(x i
(a3)
Then the following would be hold ˜ = λβ. (1) ΣV = λV ⇔ Kβ N ˜ (2) V = i=1 βi φ(xi )wi . Proof: (1) Firstly, we prove ” ⇒ ”. If ΣV = λV , then the solution V lies in space spanned by ˜ 1 ), · · · , wN φ(x ˜ N ), and we have two useful consequences: firstly, we can w1 φ(x consider the equivalent equation ˜ k ) · V = wk φ(x ˜ k ) · ΣV , for all k = 1, · · · , N λ wk φ(x (a4)
174
J. Wang, J. Lee, and C. Zhang
and secondly, there exist coefficients βi V =
N i=1
(i = 1, · · · , N ) such that ˜ i) βi wi φ(x
(a5)
Combining (a4) and (a5), we get λ
N
N N ˜ k ) · wi φ(x ˜ i) = ˜ k) · ˜ j ) wj φ(x ˜ j ) · wi φ(x ˜ i) βi wk φ(x βi wk φ(x wj φ(x
i=1
i=1
j=1
then this can read ˜ =K ˜ 2β λKβ ˜ is a symmetric matrix, which has a set of eigenvecwhere β = [β1 , · · · , βN ]T . K tors which span the whole space, thus ˜ λβ = Kβ
(a6)
Similarly, we can prove ” ⇐ ”. (2) is easy to prove, so the proof is omitted. ˜ [Lemma 3] If x ∈ Rd is a sample, with φ(x) = φ(x) − µ, then
N ˜ j) = ˜ j ) = βT Γ ˜ i ) · φ(x V · φ(x βi wi φ(x i=1
˜ . where Γ is j-th column of centered projecting matrix K Proof: This Lemma follows from (a5).
(a7)
175
Efficiently Learning the Metric with Side-Information Tijl De Bie1 , Michinari Momma2 , and Nello Cristianini3 1
2
Department of Electrical Engineering ESAT-SCD, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium [email protected] Department of Decision Sciences and Engineering Systems, Rensselaer Polytechnic Institute, Troy, NY 12180, USA [email protected] 3 Department of Statistics, University of California, Davis Davis, CA 95616, USA [email protected]
Abstract. A crucial problem in machine learning is to choose an appropriate representation of data, in a way that emphasizes the relations we are interested in. In many cases this amounts to finding a suitable metric in the data space. In the supervised case, Linear Discriminant Analysis (LDA) can be used to find an appropriate subspace in which the data structure is apparent. Other ways to learn a suitable metric are found in [6] and [11]. However recently significant attention has been devoted to the problem of learning a metric in the semi-supervised case. In particular the work by Xing et al. [15] has demonstrated how semi-definite programming (SDP) can be used to directly learn a distance measure that satisfies constraints in the form of side-information. They obtain a significant increase in clustering performance with the new representation. The approach is very interesting, however, the computational complexity of the method severely limits its applicability to real machine learning tasks. In this paper we present an alternative solution for dealing with the problem of incorporating side-information. This side-information specifies pairs of examples belonging to the same class. The approach is based on LDA, and is solved by the efficient eigenproblem. The performance reached is very similar, but the complexity is only O(d3 ) instead of O(d6 ) where d is the dimensionality of the data. We also show how our method can be extended to deal with more general types of side-information.
1
Introduction
Machine learning algorithms rely to a large extent on the availability of a good representation of the data, which is often the result of human design choices. More specifically, a ‘suitable’ distance measure between data items needs to be specified, so that a meaningful notion of ‘similarity’ is induced. The notion of ‘suitable’ is inevitably task dependent, since the same data might need very different representations for different learning tasks. R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 175–189, 2003. c Springer-Verlag Berlin Heidelberg 2003
176
T. De Bie, M. Momma, and N. Cristianini
This means that automatizing the task of choosing a representation will necessarily need utilization of some type of information (e.g. some of the labels, or less refined forms of information about the task at hand). Labels may be too expensive, while a less refined and more readily available source of information can be used (known as side-information). For example, one may want to define a metric over the space of movies descriptions, using data about customers associations (such as sets of movies liked by the same customer in [9]) as side-information. This type of side-information is commonplace in marketing data, recommendation systems, bioinformatics and web data. Many recent papers have dealt with these and related problems; some by imposing extra constraints without learning a metric, as in the constrained K-means algorithm [5], others by implicitly learning a metric, like [9], [13] or explicitly by [15]. In particular, [15] provides a conceptually elegant algorithm based on semi-definite programming (SDP) for learning the metric in the data space based on side-information, an algorithm that unfortunately has complexity O(d6 ) for d-dimensional data1 . In this paper we present an algorithm for the problem of finding a suitable metric, using the side-information that consists of n example pairs (1) (2) (xi , xi ), i = 1, . . . , n belonging to the same but unknown class. Furthermore, we place our algorithm in a general framework, in which also the methods described in [14] and [13] would fit. More specifically, we show how these methods can all be related with Linear Discriminant Analysis (LDA, see [8] or [7]). For reference, we will first give a brief review of LDA. Next we show how our method can be derived as an approximation for LDA in case only sideinformation is available. Furthermore, we provide a derivation similar to the one in [15] in order to show the correspondence between the two approaches. Empirical results include a toy example, and UCI data sets also used in [15]. Notation. All vectors are assumed to be column vectors. With Id the identity matrix of dimension d is meant. With 0, we denote a matrix or a vector of appropriate size, containing all zero elements. The vector 1 is a vector of appropriate dimension containing all 1’s. A prime denotes a transpose. (1) (2) To denote the side-information that consists of n pairs (xi , xi ) for which is (1) (2) known that xi and xi ∈ Rd belong to the same class, we will use the matrices (1)
(2)
X(1) ∈ Rn×d and X(2) ∈ Rn×d . These contain xi and xi as their ith rows: (1) (2) x1 x1 (1) (2) X(1) = x2 and X(2) = x2 . This means that for any i = 1, . . . , n, it is ··· ··· (1)
xn 1
(2)
xn
The authors of [15] see this problem, and they try to circumvent it by developing a gradient descent algorithm instead of using standard Newton algorithms for solving SDP problems. However, this may lead to convergence problems, especially for data sets in large dimensional spaces.
Efficiently Learning the Metric with Side-Information
177
known that the samples at the ith rows of X(1) and X(2) belong to the same class. For ease of notation (but without loss of generality) we will construct the full (1) X . When we want to denote the sample data matrix2 X ∈ R2n×d as X = X(2) corresponding to the ith row of X without regarding the side-information, it is denoted as xi ∈ Rd (without superscript, and i = 1, . . . , 2n). The data matrix should be centered, that is 1 X = 0 (the mean of each column is zero). We use w ∈ Rd to denote a weight vector in this d-dimensional data space. Although the labels for the samples are not known in our problem setting, we will consider the label matrix Z ∈ R2n×c corresponding to X in our derivations. i,j indicates (The number of classes is denoted by c.) It is defined as (where Z the element at row i and column j): i,j = Z
1 when the class of the sample xi is j , 0 otherwise
− 11 Z. followed by a centering to make all column sums equal to zero: Z = Z n c We use wZ ∈ R to denote a weight vector in the c-dimensional label space. The matrices CZX = CXZ = Z X, CZZ = Z Z, CXX = X X are called total scatter matrices of X or Z with X or Z. The total scatter matrices for the subset data matrices X(k) , k = 1, 2, are indexed by integers: Ckl = X(k) X(l) . Again if the labels were known, we could identify the sets Ci = {all xj in class i}. Then we could also compute the following quantities for the samples in i |; the class means X: the number of samples in each class: ni = |C c mi = n1i j:xj ∈Ci xj ; the between class scatter matrix CB = i=1 ni mi mi . The c within class scatter matrix CW = i=1 j:xj ∈Ci (xj − mi )(xj − mi ) . Since the labels are not known in our problem setting, we will only use these quantities in our derivations, not in our final results.
2
Learning the Metric
In this section, we will show how the LDA formulation which requires labels can be adapted for cases where no labels but only side-information is available. The resulting formulation can be seen as an approximation of LDA with labels available. This will lead to an efficient algorithm to learn a metric: given the sideinformation, solving just a generalized eigenproblem is sufficient to maximize the expected separation between the clusters.
2
In all derivations, the only data samples involved are the ones that appear in the sideinformation. It is not until the empirical results section that also data not involved in the side-information is dealt with: the side-information is used to learn the metric, and only subsequently, this metric is used to cluster any other available sample. We also assume no sample appears twice in the side-information.
178
2.1
T. De Bie, M. Momma, and N. Cristianini
Motivation
Canonical Correlation Analysis (CCA) Formulation of Linear Discriminant Analysis (LDA) for Classification. When given a data matrix X and a label matrix Z, LDA [8] provides a way to find a projection of the data that has the largest between class variance to within class variance ratio. This can be w CB w . formulated as a maximization problem of the Rayleigh quotient ρ(w) = w C w W In the optimum ∇w ρ = 0, w is the eigenvector corresponding to the largest eigenvalue of the generalized eigenvalue problem CB w = ρ CW w. Furthermore, it is shown that LDA can also be computed by performing CCA between the data and the label matrix ([3],[2],[12]). In other words, LDA maximizes the correlation between a projection of the coordinates of the data points and a projection of their class labels. This means the following CCA generalized eigenvalue problem formulation can be used: 0 CXZ CXX 0 w w =λ wZ 0 CZZ CZX 0 wZ The optimization problem corresponding to CCA is (as shown in e.g. [4]): max w X ZwZ
w,wZ
s.t. Xw2 = 1 and ZwZ 2 = 1
(1)
This formulation for LDA will be the starting point for our derivations. Maximizing the Expected LDA Cost Function. In the problem setting at hand however, we do not know the label matrix Z. Thus we can not perform LDA in its basic form. However, the side-information that given pairs of samples (1) (2) (xi , xi ) belong to the same class (and thus have the same –but unknown– label ) is available. (This side-information is further denoted by splitting X into two matrices X(1) and X(2) as defined in the notation paragraph.) Using a parameterization of the label matrix Z that explicitly realizes these constraints given by the side-information, we derive a cost function that is equivalent to the LDA cost function but that is written in terms of this parameterization. Then we maximize the expected value of this LDA cost function, where the expectation is taken over these parameters under a reasonable symmetry assumption. The derivation can be found in Appendix A. Furthermore it is shown in Appendix A that this expected LDA cost function is maximized by solving for the dominant eigenvector of: (C12 + C21 )w = λ(C11 + C22 )w
(2)
where Ckl = X(k) X(l) . In Appendix B we provide an alternative derivation leading to the same eigenvalue problem. This derivation is based on a cost function that is close to the cost function used in [15].
Efficiently Learning the Metric with Side-Information
2.2
179
Interpretation and Dimensionality Selection
Interpretation. Given the eigenvector w, the corresponding eigenvalue λ is (C12 +C21 )w equal to w w (C11 +C22 )w . The numerator w (C12 + C21 )w is twice the covariance of the projections X(1) w with X(2) w (up to a factor equal to the number of samples in X(k) ). The denominator normalizes with the sum of their variances (up to the same factor). This means λ is very close to the correlation between X(1) w and X(2) w (it becomes equal to their correlation when the variances of X(1) w and X(2) w are equal, which will often be close to true as both X(1) and X(2) drawn from the same distribution). This makes sense: we want are X(1) w and thus both X(1) w and X(2) w to be strongly correlated Xw = X(2) with a projection ZwZ of their (common but unknown) labels in Z on wZ (see equation (1); this is what we actually wanted to optimize, but could not do exactly since Z is not known). Now, when we want X(1) w and X(2) w to be strongly correlated with the same labels, they necessarily have to be strongly correlated with each other. Some of the eigenvalues may be negative however. This means that along these eigenvectors, samples that should be co-clustered according to the sideinformation are anti -correlated. This can only be caused by features in the data that are irrelevant for the clustering problem at hand (which can be seen as noise). Dimensionality Selection. As with LDA, one will generally not only use the dominant eigenvector, but a dominant eigenspace to project the data on. The number of eigenvectors used should depend on the signal to noise ratio along these components: when it is too low, noise effects will cause poor performance of a subsequent clustering. So we need to make an estimate of the noise level. This is provided by the negative eigenvalues: they allow us to make a good estimate of the noise level present in the data, thus motivating the strategy adopted in this paper: only retain the k directions corresponding to eigenvalues larger than the largest absolute value of the negative eigenvalues. 2.3
The Metric Corresponding to the Subspace Used
Since we will project the data onto the k dominant eigenvectors w, this finally boils down to using the distance measure
d2 (xi , xj ) = (W (xi − xj )) (W (xi − xj )) = xi − xj 2WW . where W is the matrix containing the k eigenvectors as its columns. Normalization of the different eigenvectors could be done so as to make the variance equal to 1 along each of the directions. However as can be understood from the interpretation in 2.2, along directions with a high eigenvalue λ a better separation can be expected. Therefore, we applied the heuristic to scale each
180
T. De Bie, M. Momma, and N. Cristianini
of the eigenvectors by multiplying them with their corresponding eigenvalue. In doing that, a subsequent clustering like K-means will preferentially find cluster separations orthogonal to directions that will probably separate well (which is desirable). 2.4
Computational Complexity
Operations to be carried out in this algorithm are the computation of the d × d scatter matrices, and solving a symmetric generalized eigenvalue problem of size d. The computational complexity of this problem is thus O(d3 ). Since the approach in [15] is basically an SDP with d2 parameters, its complexity is O(d6 ). Thus a massive speedup can be achieved.
3 3.1
Remarks Relation with Existing Literature
Actually, X(1) and X(2) do not have to belong to the same space, they can be of a different kind: it is sufficient when corresponding samples in X(1) and X(2) belong to the same class to do something similar as above. Of course then we need different weight vectors in both spaces: w(1) and w(2) . Following a similar reasoning as above, in Appendix C we provide an argumentation that solving the CCA eigenproblem
0 C12 C21 0
w(1) w(2)
=λ
C11 0 0 C22
w(1) w(2)
is closely related to LDA as well. This is exactly what is done in [14] and [13] (in both papers in a kernel induced feature space). 3.2
More General Types of Side-Information
Using similar approaches, also general types of side-information may be utilized. We will only briefly mention them: – When the groups of samples for which is known they belong to the same class is larger than 2 (let us call them X(i) again, but now i is not restricted to only 1 or 2). This can be handled very analogously to our previous derivation. Therefore we just state the resulting generalized eigenvalue problem:
(k) (k) (k) (k) w=λ w X X X X k
k
k
Efficiently Learning the Metric with Side-Information
181
– Also in case we are dealing with more than 2 data sets that are of a different nature (eg analogous to [14]: we could have more than 2 data sets, each consisting of a text corpus in a different language), but for which is known that corresponding samples belong to the same class (as described in the previous subsection), the problem is easily shown to reduce to the extension of CCA towards more data spaces, as is e.g. used in [1]. Space restrictions do not permit us to go into this. – It is possible to keep this approach completely general, allowing for any type of side-information of the form of constraints that express for any number of samples they belong to the same class, or on the contrary do not to belong to the same class. Also knowledge of some of the labels can be exploited. For doing this, we have to use a different parameterization for Z than used in this paper. In principle also any prior distribution on the parameters can be taken into account. However, sampling techniques will be necessary to estimate the expected value of the LDA cost function in these cases. We will not go into this in the current paper.
3.3
The Dual Eigenproblem
As a last remark, the dual or kernelized version of the generalized eigenvalue problem as follows. The solution w can be expressed in the form can be derived (1) (2) w= X α where α ∈ R2n is a vector containing the dual variables. X
Now, with Gram matrices Kkl = X(k) X(l) , and after introducing the notation G1 =
K11 K21
and G2 =
K12 K22
the α’s corresponding to the weight vectors w are found as the generalized eigenvectors of (G1 G2 + G2 G1 )α = λ(G1 G1 + G2 G2 )α. This motivates that it will be possible to extend the approach to learning nonlinear metrics with side-information as well.
4
Empirical Results
The empirical results reported in this paper will be for clustering problems with the type of side-information described above. Thus, with our method we learn a suitable metric based on a set of samples for which the side-information is known, i.e. X(1) and X(2) . Subsequently a K-means clustering of all samples (including those that are not in X(1) or X(2) ) is performed, making use of the metric that is learned.
182
T. De Bie, M. Momma, and N. Cristianini
4.1
Evaluation of Clustering Performance
We use the same measure of accuracy as is used in [15], namely, defining I(xi , xj ) as the function being 1 when xi and xj are clustered in the same cluster by the algorithm, k i,j>i;xi ,xj ∈Ck I(xi , xj ) i,j>i;¬∃k:xi ,xj ∈Ck (1 − I(xi , xj )) + Acc = . 2 k i,j>i;xi ,xj ∈Ck 1 2 i,j>i;¬∃k:xi ,xj ∈Ck 1 4.2
Regularization
To deal with inaccuracies, numerical instabilities and influences of finite sample size, we apply regularization to the generalized eigenvalue problem. This is done in the same spirit as for CCA in [1], namely by adding a diagonal to the scatter matrices C11 and C22 . This is justified thanks to the CCA-based derivation of our algorithm. To train the regularization parameter, a cost function described below is minimized via 10-fold cross validation. In choosing the right regularization parameter, there are two things to consider: firstly, we want the clustering to be good. This means that the sideinformation should be reflected as well as possible by the clustering. Secondly we want this clustering to be informative. This means, we don’t want one very large cluster, the others being very small (the probability to fulfil the side-information would be too easy then). Therefore, the cross-validation cost minimized here, is the probability for the measured performance on the test set side-information, given the sizes of the clusters found. (More exactly, we maximized the difference of this performance with its expected performance, divided by its standard deviation.) This approach incorporates both considerations in a natural way. 4.3
Performance on a Toy Data Set
The effectiveness of the method is illustrated by using a toy example, in which each of the clusters consists of two parts lying far apart (figure (1)). Standard K-means has an accuracy of 0.50 on this data set, while the method developed here gives an accuracy of 0.92. 4.4
Performance on Some UCI Data Sets
The empirical results on some UCI data sets, reported in table 1, are comparable to the results in [15]. The first column contains the K-means clustering accuracy without any side-information and preprocessing, averaged over 30 different initial conditions. In the second column, results are given for small side-information leaving 90 percent of the connected components3 , in the third column for large 3
We use the notion connected component as defined in [15]. That is, for given sideinformation, a set of samples makes up one connected component, if between each pair of samples in this set, there exists a path via edges corresponding to pairs given in the side-information. For no side-information given, the number of connected components is thus equal to the total number of samples.
Efficiently Learning the Metric with Side-Information
183
15 10 5 0 −5 −10 −15 4
3 2 2
1 0
0 −1 −2
−2 −4
−3
Fig. 1. A toy example whereby the two clusters each consist of two distinct clouds of samples, that are widely separated. Ordinary K-means obviously has a very low accuracy of 0.5, whereas when some side-information is taken into account as described in this paper, the performance goes up to 0.92.
side-information leaving 70 percent of the connected components. For these two columns, averages over 30 randomizations are shown. The side-information is generated by randomly picking pairs of samples belonging to the same cluster. The number between brackets indicates the standard deviation over these 30 randomizations. Table 2 contains the accuracy on the UCI wine data set and on the protein data set, for different amounts of side-information. To quantize the amount of side-information, we used (as in [15]) the number of pairs in the side-information, divided by the total number of pairs of samples belonging to the same class (the ratio of constraints.) These results are comparable with those reported in [15]. Like in [15], constrained K-means [5] will allow for a further improvement. (It is important to note that constrained K-means on itself does not learn a metric, ie, the sideinformation is not used for learning which directions in the data space are important in the clustering process. It rather imposes constraints assuring the clustering result does not contradict the side-information.)
5
Conclusions
Finding a good representation of the data is of crucial importance in many machine learning tasks. However, without any assumptions or side-information, there is no way to find the ‘right’ metric for the data. We thus presented a way
184
T. De Bie, M. Momma, and N. Cristianini
Table 1. Accuracies for on UCI data sets, for different numbers of connected components. (The more side-information, the less connected components. The fraction f is the number of connected components divided by the total number of samples.)
Data set wine protein ionosphere diabetes balance iris soy breast cancer
f =1 0.69 (0.00) 0.62 (0.02) 0.58 (0.02) 0.56 (0.02) 0.56 (0.02) 0.83 (0.06) 0.80 (0.08) 0.83 (0.01)
f = 0.9 0.92 (0.05) 0.71 (0.04) 0.69 (0.09) 0.60 (0.02) 0.66 (0.01) 0.92 (0.03) 0.85 (0.09) 0.89 (0.02)
f = 0.7 0.95 (0.03) 0.72 (0.06) 0.75 (0.05) 0.61 (0.02) 0.67 (0.03) 0.92 (0.04) 0.91 (0.1) 0.91 (0.02)
Table 2. Accuracies on the wine and the protein data sets, as a function of the ratio of constraints.
ratio of constr. 0 0.0015 0.0023 0.0034 0.0051 0.0075 0.011 0.017 0.025 0.037
accuracy ratio of for wine constr. 0.69 (0.00) 0 0.73 (0.08) 0.012 0.78 (0.11) 0.019 0.87 (0.08) 0.028 0.91 (0.05) 0.041 0.93 (0.05) 0.060 0.96 (0.05) 0.099 0.97 (0.017) 0.14 0.97 (0.018) 0.21 0.98 (0.015) 0.31
accuracy for protein 0.62 (0.03) 0.59 (0.04) 0.60 (0.05) 0.62 (0.04) 0.67 (0.05) 0.71 (0.05) 0.75 (0.05) 0.77 (0.05) 0.79 (0.06) 0.78 (0.07)
to learn an appropriate metric based on examples of co-clustered pairs of points. This type of side-information is often much less expensive or easier to obtain than full information about the label. The proposed method is justified in two ways: as a maximization of the expected value of a Rayleigh quotient corresponding to LDA, and another way showing connections with previous work on this type of problems. The result is a very efficient algorithm, being much faster than, while showing similar performance as the algorithm derived in [15]. Importantly, the method is put in a more general context, showing it is only one example of a broad class of algorithms that are able to incorporate different forms of side-information. It is pointed out how the method can be extended to deal with basically any kind of side-information. Furthermore, the result of the algorithm presented here is a lower dimensional representation of the data, just like in other dimensionality reduction methods such as PCA (Principal Component Analysis), PLS (Partial Least Squares),
Efficiently Learning the Metric with Side-Information
185
CCA and LDA, that try to identify interesting subspaces for a given task. This often comes as an advantage, since algorithms like K-means and constrained K-means will run faster on lower dimensional data. Acknowledgements. TDB is a Research Assistant with the Fund for Scientific Research - Flanders (F.W.O.-Vlaanderen). MM is supported by the NSF grant IIS-9979860. This paper was written during a scientific visit of TDB and MM at U.C.Davis. The authors would like to thank Roman Rosipal, Pieter Abbeel and Eric Xing for useful discussions.
References 1. F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2002. 2. M. Barker and W.S. Rayens. Partial least squares for discrimination. Journal of Chemometrics, 17:166–173, 2003. 3. M. S. Bartlett. Further aspects of the theory of multiple regression. Proc. Camb. Philos. Soc., 34:33–40, 1938. 4. M. Borga, T. Landelius, and H. Knutsson. A Unified Approach to PCA, PLS, MLR and CCA. Report LiTH-ISY-R-1992, ISY, SE-581 83 Link¨ oping, Sweden, November 1997. 5. P. Bradley, K. Bennett, and Ayhan Demiriz. Constrained K-means clustering. Technical Report MSR-TR-2000-65, Microsoft Research, 2000. 6. N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press. 7. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons, Inc., 2nd edition, 2000. 8. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II:179–188, 1936. 9. T. Hofmann. What people don’t want. In European Conference on Machine Learning (ECML), 2002. 10. R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991. 11. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix with semi-definite programming. Technical Report CSD-02-1206, Division of Computer Science , University of California, Berkeley, 2002. 12. R. Rosipal, L.J. Trejo, and B. Matthews. Kernel PLS-SVC for linear and nonlinear classification. In (to appear) Proceedings of the Twentieth International Conference on Machine Learning, 2003. 13. J.-P. Vert and M. Kanehisa. Graph-driven features extraction from microarray data using diffusion kernels and cca. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2003. MIT Press. 14. A. Vinokourov, N. Cristianini, and J. Shawe-Taylor. Inferring a semantic representation of text via cross-language correlation analysis. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2003. MIT Press. 15. E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2003. MIT Press.
186
T. De Bie, M. Momma, and N. Cristianini
Appendix A: Derivation Based on LDA Parameterization. As explained before, the side-information is such that we (1) (2) get pairs of samples (xi , xi ) which have the same class label. Using this side(1) (2) information we stack the corresponding vectors xi and xi at the same row in (1) (2) x1 x1 (1) (2) their respective matrices X(1) = x2 and X(2) = x2 . The full matrix ··· ··· (1)
(2)
xn xn containing all samples for which side-information is available, is then equal to (1) X X= . Now, since we know each row of X(1) has the same label as the X(2) corresponding rowof X(2) , a parameterization of the label matrix Z is easily L found to be Z = . Note that Z is centered iff L is centered. The matrix L L is in fact just the label matrix of both X(1) and X(2) on themselves. (We want to stress L is not known, but used in the equations as an unknown matrix parameter for now.) The Rayleigh Quotient Cost Function That Incorporates the SideInformation. Using this parameterization we apply LDA on the matrix (1) X L with the label matrix to find the optimal directions for sepaL X(2) ration of the classes. For this we use the CCA formulation of LDA. This means we want to solve the CCA optimization problem (1) where we substitute the values for Z and X: (1) X L max w wZ = max w X(1) LwZ + w X(2) LwZ (3) (2) L w,wZ w,wZ X (1) 2 X (1) 2 (2) 2 s.t. (4) X(2) w = X w + X w = 1 LwZ 2 = 1 The Lagrangian of this constrained optimization problem is: (X(1) X(1) + X(2) X(2) )w − µw L LwZ L = w X(1) LwZ + w X(2) LwZ − λw Z
Differentiating with respect to wZ and w and equating to 0 yields ∇wZ L = 0 ⇒ L (X(1) + X(2) )w = 2µL LwZ (1) X(1) + X(2) X(2) )w ∇w L = 0 ⇒ (X(1) + X(2) ) LwZ = 2λ(X
(5) (6)
1 From (5) we find that wZ = 2µ (L L)† L (X(1) +X(2) )w. Filling this into equation = 4λµ gives that (6) and choosing λ (1) X(1) + X(2) X(2) )w. (X(1) + X(2) ) L(L L)† L (X(1) + X(2) )w = λ(X
Efficiently Learning the Metric with Side-Information
187
It is well known that solving for the dominant generalized eigenvector is equivalent to maximizing the Rayleigh quotient: w (X(1) + X(2) ) L(L L)† L (X(1) + X(2) )w . (7) w (X(1) X(1) + X(2) X(2) )w Until now, for the given side-information, there is still an exact equivalence between LDA and maximizing this Rayleigh quotient. The important difference between the standard LDA cost function and (7) however, is that in the latter the side-information is imposed explicitly by using the reduced parameterization for Z in terms of L. The Expected Cost Function. As pointed out, we do not know the term between [·]. What we will do then is compute the expected value of the cost L function (7) by averaging over all possible label matrices Z = , possibly L weighted with any symmetric4 a priori probability for the label matrices. Since the only part that depends on the label matrix is the factor between [·], and since it appears linearly in the cost function, we just need to compute the expectation of this factor. This expectation is proportional to I− 11 n . To see this we only have to use symmetry arguments (all values on the diagonal should be equal to each other, and all other values should be equal to each other), and the observation that L is centered and thus L(L L)† L 1 = 0. Now, since we assume that the data matrix X containing the samples in the side-information is centered too, (1) (X(1) +X(2) ) 11 +X(2) ) is equal to the null matrix. Thus the expected value n (X (1) (2) of (X +X ) L(L L)† L (X(1) +X(2) ) is proportional to (X(1) +X(2) ) (X(1) + X(2) ). The expected value of the LDA cost function in equation (7), where the expectation is taken over all possible label assignments Z constrained by the side-information, is then shown to be w (C12 + C21 )w w (C11 + C12 + C22 + C21 )w = 1 + w (C11 + C22 )w w (C11 + C22 )w The vector w maximizing this cost is the dominant generalized eigenvector of (C12 + C21 )w = λ(C11 + C22 )w
where Ckl = X(k) X(l) . (Note that the side-information is symmetric in the sense that one could (1) (2) (2) (1) replace an example pair (xi , xi ) with (xi , xi ) without losing any information. However this operation does not change C12 + C21 nor C11 + C22 , so that 4
That is, the a priori probability of a label assignment L is the same as the probability of the label assignment PL where P can be any permutation matrix. Remember every row of L corresponds to the label of a pair of points in the side-information. Thus, this invariance means we have no discriminating prior information on which pair belongs to which of the classes. Using this ignorant prior is clearly the most reasonable we can do, since we assume only the side-information is given here.
188
T. De Bie, M. Momma, and N. Cristianini
the eigenvalue problem to be solved does not change either, which is of course a desirable property.)
Appendix B: Alternative Derivation More in the spirit of [15], we can derive the algorithm by solving the constrained optimization problem (where dim(W) = k means that the dimensionality of W is k, that is, W has k columns):
maxW trace(X(1) WW X(2) ) s.t.
dim(W) = k W
X(1) X
(2)
X(1) X(2)
W = Ik
so as to find a subspace of dimension k that optimizes the correlation between samples belonging to the same class. This can be reformulated as maxW trace(W (C12 + C21 )W) s.t. dim(W) = k W (C11 + C22 )W = Ik Solving this optimization problem amounts to solving for the eigenvectors corresponding to the k largest eigenvalues of the generalized eigenvalue problem described above (2). The proof involves the following theorem by Ky Fan (see eg [10]): Theorem 5.01. Let H be a symmetric matrix with eigenvalues λ1 > λ2 > . . . > λn , and the corresponding eigenvectors U = (u1 , u2 , . . . , un ). Then λ1 + . . . + λk = max trace(P HP). P P=I
Moreover, the optimal P∗ is given by P∗ = (u1 , . . . , uk )Q where Q is an arbitrary orthogonal matrix. Since (C11 + C22 ) is positive definite, we can take P = (C11 + C22 )1/2 W, so that the constraint W (C11 + C22 )W = Ik becomes P P = Ik . Also put H = (C11 + C22 )−1/2 (C12 + C21 )(C11 + C22 )−1/2 , so that the objective function (8) becomes trace(P HP). Applying the Ky Fan theorem and choosing Q = Ik , leads to the fact that P∗ = (u1 , . . . , uk ), with u1 , . . . , uk the k eigenvectors corresponding of the k largest eigenvalues of H. Thus, the optimal W∗ = (C11 + C22 )−1/2 P∗ . For P∗ an eigenvector of H = (C11 + C22 )−1/2 (C12 + C21 )(C11 + C22 )−1/2 , this W∗ is exactly the generalized eigenvector (corresponding to the same eigenvalue) of (2). The result is thus exactly the same as obtained in the derivation in Appendix A.
Efficiently Learning the Metric with Side-Information
189
Appendix C: Connection to Literature If we replace w in optimization problem (3) subject to (4) once by w(1) and one by w(2) :
max w(1) X(1) LwZ + w(2) X(2) LwZ
w(1) ,w(2)
s.t. X(1) w(1) 2 + X(2) w(2) 2 = 1 LwZ 2 = 1 where L corresponds to the common label matrix for X(1) and X(2) (both centered). In a similar way as previous derivation, this can be shown to amount to solving the eigenvalue problem: 0 X(1) L(L L)−1 L X(2) w(1) w(2) X(2) L(L L)−1 L X(1) 0 (1) C11 0 w =λ 0 C22 w(2) which again corresponds to a Rayleigh quotient. Since also here in fact we do not know the matrix L, we again take the expected value (as in Appendix A). This leads to an expected Rayleigh quotient that is maximized by solving the generalized eigenproblem corresponding to CCA: (1) (1) 0 C12 C11 0 w w =λ . 0 C22 C21 0 w(2) w(2)
Learning Continuous Latent Variable Models with Bregman Divergences Shaojun Wang1 and Dale Schuurmans2 1
2
Department of Statistics, University of Toronto, Canada School of Computer Science, University of Waterloo, Canada
Abstract. We present a class of unsupervised statistical learning algorithms that are formulated in terms of minimizing Bregman divergences—a family of generalized entropy measures defined by convex functions. We obtain novel training algorithms that extract hidden latent structure by minimizing a Bregman divergence on training data, subject to a set of non-linear constraints which consider hidden variables. An alternating minimization procedure with nested iterative scaling is proposed to find feasible solutions for the resulting constrained optimization problem. The convergence of this algorithm along with its information geometric properties are characterized. Index Terms — statistical machine learning, unsupervised learning, Bregman divergence, information geometry, alternating minimization, forward projection, backward projection, iterative scaling.
1
Introduction
A variety of machine learning and statistical inference problems focus on supervised learning from labeled training data. In such problems, convexity often plays a central role in formulating the loss function to be minimized during training. For example, a standard approach to formulating a training loss is to distinguish a preferred value from a set of candidate prediction values, and measure prediction error by a convex error measure. Examples of this include least squares regression, decision tree learning, boosting, on-line learning, maximum likelihood for exponential models, logistic regression, maximum entropy, support vector machines, statistical signal processing (e.g. Burg’s spectral estimation for speech signal analysis and image reconstruction) and optimal portfolio selection. Such problems can often be naturally cast as convex optimization problems involving a Bregman divergence [5,10,23], which can lead to new algorithms, analytical tools, and insights derived from the powerful methods of convex analysis [2,3,7,13]. Training algorithms that solve these problems can be cast as implementing a minimum Bregman divergence (MB) principle. However, in practice, many of the natural patterns we wish to classify are the result of causal processes that have hidden hierarchical structure—yielding data that does not report the value of latent variables. For example, in natural language learning the observed data rarely reports the value of hidden semantic variables or syntactic structure, in speech signal analysis gender information is R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 190–204, 2003. c Springer-Verlag Berlin Heidelberg 2003
Learning Continuous Latent Variable Models with Bregman Divergences
191
not explicitly marked, etc. Obtaining fully labeled data is tedious or impossible in most realistic cases. This motivates us to propose a class of unsupervised statistical learning algorithms that are still formulated in terms of minimizing a Bregman divergence, except that we must now change the problem formulation to respect hidden variables. In this paper we propose training algorithms for solving the latent minimum Bregman divergence (LMB) principle: given a set of training data and features that one would like to match in the training data, compute a model that minimizes a convex objective function (a Bregman divergence) subject to a set of non-linear constraints that take into account possible latent structure. Our treatment of the LMB principle closely parallels the results presented in [24] for the Kullback-Leibler divergence, but the extension proposed here is not trivial. For probabilistic models under the Kullback-Leibler divergence, we can show an equivalence between satisfying the constraints (i.e. achieving feasibility) and locally maximizing the likelihood under a log-linear assumption. Thus, in this case, we can resort to the EM algorithm [14] to develop a practical technique for finding feasible solutions and proving convergence. However, general Bregman divergences raise a more difficult technical challenge because the EM approach breaks down for these generalized entropy measures. In this paper, we will overcome this difficulty by using an alternating minimization approach [9] in a non-trivial way, see Figure 1. Thus, beyond the generalized KL divergence used for unsupervised boosting in clustering [25], the techniques of this paper can also handle a broader class of functions, such as the Itakura-Siato distortion [17] for speech signal analysis. LMB principle AM algorithm
LME principle
EM algorithm
generalized Shannon K−L entropy divergence K−L divergence
joint convex Bregman divergence
Bregman divergence
unsupervised boosting
Fig. 1. The AM algorithm proposed in this paper is valid for the family of joint convex Bregman divergences, but the EM algorithm proposed in [24] is only valid for the K-L divergence. The unsupervised boosting case is dealing with generalized K-L divergence, thus it can be solved by using the AM algorithm to find feasible solutions under latent minimum Bregman divergence principle.
192
2
S. Wang and D. Schuurmans
The LMB Principle
To express a joint (probability) model, let X ∈ X denote the complete data, Y ∈ Y be the observed incomplete data and Z ∈ Z be the missing data. That is, X = (Y, Z). Let φ(t) : → be a strictly convex function on an interval I ⊂ , and differentiable in the interior of I. Define a closed convex set S ⊂ X , where S is typically assumed to be the set of (probability) distributions (or positive measures) p over X . For functions p and q on X with values in I, a Bregman divergence 1 [4,7,10,11,12,13,18] is a generalized entropy measure that is associated with a convex function φ Bφ (p; q) = ∆φ (p(x); q(x)) µ(dx) x∈X
where ∆φ (p(x); q(x)) = φ(p(x)) − φ(q(x)) − φ (q(x)) (p(x) − q(x)) and φ denotes the derivative of φ.2 That is, the Bregman divergence Bφ measures the discrepancy between two distributions p and q by integrating the difference between φ evaluated at p and φ’s first-order Taylor expansion about q, evaluated at p over X . To strengthen the interpretation of Bφ (p; q) as a measure of distance, we make the following assumptions. • ∆φ (u, v) is strictly convex in u and in v separately, but also satisfies the stronger property that it is jointly convex in u, v. Thus our choice of Bregman divergence Bφ (p; q) is strictly convex in p and in q separately, and also jointly convex. This assumption lies at the heart of the analysis below. • Bφ (p; q) is lower-semi-continuous in p and q jointly. • For any fixed p ∈ S, the level sets {q : Bφ (p; q) ≤ } are bounded. • If Bφ (pk ; q k ) → 0 and pk or q k is bounded, then pk → q k and q k → pk . • If p ∈ S and q k → p, then Bφ (p; q k ) → 0. Examples 1. Let φ(t) = t log t be defined on I = [0, ∞). Then φ (t) = log t + 1, and p(x) − p(x) + q(x) µ(dx) p(x) log Bφ (p; q) = D(p; q) = q(x) x∈X
1
2
The machine learning community [8,13,18,19] is familiar with the discrete case, since for supervised learning there are a finite number of sample points (training examples), so we can write the constraints as pertaining to a finite dimensional vector. However, in unsupervised learning we are usually dealing with continuous variables, and therefore instead of a vector, we are working with an infinite dimensional space. In this paper, µ denotes a given σ-finite measure on X . If X is finite or countably infinite, then µ is the counting measure, and integrals reduce to sums. If X is a subset of a finite dimensional space, µ is the Lebesgue measure. If X is a combination of both cases, µ will be a combination of both measures.
Learning Continuous Latent Variable Models with Bregman Divergences
193
which is the generalized KL divergence. This is the objective function of the primal problem for AdaBoost [20]. When p and q are restricted to be probability measures, it becomes the KL divergence, the objective function of the primal problem for LogitBoost [16,20]. Furthermore, when q is chosen to be uniform, it becomes the Shannon entropy. 2. Let φ(t) = t2 be defined on I = (−∞, ∞). Then φ (t) = 2t, and Bφ (p; q) = ||p(x) − q(x)||2L2 (µ) which is the measure of energy. 3. Let φ(t) = − log t be defined on I = (0, ∞). Then φ (t) = − 1t , and q(x) p(x) + − 1 µ(dx) Bφ (p; q) = log p(x) q(x) x∈X which is the Itakura-Saito distortion that arises in the spectral analysis of speech signals. When q = 1, it becomes the Burg entropy [17]. 4. Let φ(t) = t log t + (1 − t) log(1 − t) be defined on I = [0, 1]. Then φ (t) = t , and log 1−t Bφ (p; q) =
x∈X
1 − p(x) p(x) + (1 − p(x)) log p(x) log q(x) 1 − q(x)
which is the Bernoulli entropy. When q = entropy.
1 2,
µ(dx)
it becomes the Fermi-Dirac
To formulate the minimum Bregman divergence principle, assume we have a finite set of features f1 (x), ..., fN (x) which correspond to sufficient statistics in a log-linear model, weak learners in boosting, or basis function in non-parametric ˜ Z), ˜ where Y˜ are obestimation. Given a set of complete data points X˜ = (Y, served “descriptions” and Z˜ are observed “labels”, the minimum Bregman divergence principle (MB) is: MB principle. Choose a conditional distribution p(z|y) to minimize min
p(z|y)∈S
Bφ (˜ p(y)p(z|y); q0 (x))
subject to the constraints p˜(y) fi (x) p(z|y) µ(dz) = p˜(x)fi (x) ˜ y∈Y
z∈Z
(1)
for i = 1, ..., N (2)
x∈X˜
where x = (y, z), q0 ∈ S is a default distribution chosen so that φ (q0 ) = 0, and quite often we set q0 to be uniform, and p˜(x) and p˜(y) denote the empirical distributions of the complete and marginal data respectively. In general, iterative scaling [11,12,13,18] is used to obtain the (global) optimal solution for the MB principle.
194
S. Wang and D. Schuurmans
In contrast to MB principle, if the labels Z˜ are unobserved, we propose the latent minimum Bregman divergence principle (LMB) as follows. LMB principle. Choose a joint distribution p(x) to minimize min Bφ (p; q0 ) p∈S
subject to the constraints fi (x) p(x) µ(dx) = p˜(y) x∈X
z∈Z
˜ y∈Y
(3)
fi (x) p(z|y) µ(dz), for i = 1, ..., N(4)
where x = (y, z), q0 ∈ S is a default distribution chosen so that φ (q0 ) = 0, and quite often we set q0 to be uniform. Here p˜(y) is the empirical distribution of the observed data and p(z|y) is the conditional distribution of latent variables given the observed data. Note that the conditional p(z|y) implicitly encodes thelatent structure and is a nonlinear mapping of p(x). That is, p(z|y) = p(y, z)/ z ∈Z p(y, z )µ(dz) = p(x)/ x =(y,z ) p(x )µ(dx ) where x = (y, z) and x = (y, z ). Clearly p(z|y) is a non-linear function of p(x) because of the division. This means we are faced with minimizing an objective (3) subject to a system of non-linear constraints (4). Therefore, even though the objective function (3) is convex, no unique minimum can be guaranteed to exist. In fact, maxima and saddle points may exist. Nevertheless, one can still attempt to derive an iterative training procedure that finds feasible solutions to the LMB problem. With such a subroutine in hand, one can then heuristically solve the LMB principle by gathering several feasible solutions (by starting with different initial points) and then choosing the feasible p that obtains the smallest Bregman divergence. To illustrate how the LMB principle is related to unsupervised learning, assume we are given a collection of unlabeled examples from which we wish to construct a linear combination of weak “decision stumps” to create a “strong” predicative model for clustering. In this case, we can formulate the problem as minimizing the generalized K-L divergence of an unnormalized exponential model defined in terms of the features (the decision stumps) subject to the (nonlinear) constraints that the model matches the generalized feature expectations. Below we focus on developing an iterative algorithm for finding feasible solutions. In general, solving (3) subject to (4) is quite complex. Since the original problem does not yield a simple closed form solution for p, we instead look for an approximate solution. First, we restrict the model to have an additive form. Definition 1. [19] Let S ⊂ X be a set of measures. An additive model for S is defined by an operation L : X × S → S satisfying the homomorphism property L(r1 + r2 , s) = L (r1 , L(r2 , s)) for all r1 , r2 ∈ X and s ∈ S. Lemma 1. [19] Given a convex function φ : → , let Bφ be the Bregman divergence defined on measures p ∈ S. Define the Legendre transform v ◦φ q0 by v ◦φ q0 = arg min p∈S
Bφ (p; q0 ) − v · p
Learning Continuous Latent Variable Models with Bregman Divergences
195
Then we have v ◦φ q0 = Lφ (q0 , v) such that (Lφ (q0 , v)) (x) = (φ )−1 (φ (q0 (x)) − v(x)) for all x. Also the map (v, q0 ) → v ◦φ q0 is an additive model for S. By adopting an additive model restriction, we can make valuable progress toward formulating a practical algorithm for approximately satisfying the LMB principle. In the following, we use a doubly iterative projection algorithm to obtain feasible additive solutions, and also provide a characterization of its convergence properties and information geometry.
3
Preliminaries: Convergence of Alternating Projections
We present a generalization of the alternating projection method of Csiszar and Tusnady [9] for Bregman divergences, and show how this technique can be used to find feasible solutions for the LMB principle. In developing our method we need to derive a slightly more general convergence result than [9], which is due to [15]. These results are originally shown in [6,15] for discrete case, here we extend them for continuous variables. Since projections onto closed convex sets may be thought of as solutions of minimum divergence problems, we begin by introducing suitable definitions for the Bregman divergence. Definition 2. (Forward projection) Suppose Q ⊂ S is a nonempty closed convex set, and let p ∈ S. We define the forward projection of p onto Q as the unique element q ∗ ∈ Q such that Bφ (p; q ∗ ) = minq∈Q {Bφ (p; q)}. Definition 3. (Backward projection) Suppose P ⊂ S is a nonempty closed convex set, and let q ∈ S. We define the backward projection of q onto P as the unique element p∗ ∈ P such that Bφ (p∗ ; q) = minp∈P {Bφ (p; q)}. We can then define the alternating projection algorithm associated with the Bregman divergence. Alternating minimization (AM) algorithm. closed convex sets P, Q ⊂ S.
Consider two nonempty
Initialization: Let q 0 ∈ Q be an arbitrary distribution such that there exists p ∈ P with Bφ (p; q 0 ) < ∞. Iterative step: Given q k , find pk by backward projection onto P: pk = arg min Bφ (p; q k ) p∈P
Then calculate q k+1 by forward projection onto Q: q k+1 = arg min Bφ (pk ; q) q∈Q
Repeat until convergence. To prove that this procedure converges, we first demonstrate the “three points” and “four points” properties for the Bregman divergence.
196
S. Wang and D. Schuurmans
Lemma 2. (Three points property) Consider a Bregman divergence Bφ on two nonempty closed convex sets P, Q ⊂ S. Let q ∈ Q be such that Bφ (p; q) < ∞ for all p ∈ P, and let p∗ = arg minp∈P Bφ (p; q). Then for all p ∈ P we have Bφ (p; q) − Bφ (p∗ ; q) ≥ Bφ (p; p∗ ) Proof. By definition of Bregman divergence, we have Bφ (p; q) − Bφ (p∗ ; q) = φ(p(x)) − φ(p∗ (x)) − φ (q(x)) (p(x) − p∗ (x)) µ(dx) x∈X = φ(p(x)) − φ(p∗ (x)) − φ (p∗ (x))(p(x) − p∗ (x)) x∈X
+ (φ (p∗ (x)) − φ (q(x))) (p(x) − p∗ (x)) µ(dx) = Bφ (p; p∗ ) + (φ (p∗ (x)) − φ (q(x))) (p(x) − p∗ (x)) µ(dx) x∈X
Denote the partial gradient of Bφ with respect to its first argument as ∇p Bφ (p; q), and note that ∂Bφ (p; q)/∂p(x) = φ (p(x)) − φ (q(x)). Therefore we have (∇p Bφ (p∗ ; q)) (x) = φ (p∗ (x)) − φ (q(x)) for all x. Since p∗ minimizes Bφ (p; q) over convex set P, we must have ∇p Bφ (p∗ ; q) · (p − p∗ ) ≥ 0 The result then follows since (φ (p∗ (x)) − φ (q(x))) (p(x) − p∗ (x)) µ(dx) = ∇p Bφ (p∗ ; q) · (p − p∗ ) x∈X
Lemma 3. (Four points property) Consider a jointly convex Bregman divergence Bφ on two nonempty closed convex sets P, Q ⊂ S. Let p ∈ P such that Bφ (p; q) < ∞ for all q ∈ Q, and let q ∗ = arg minq∈Q Bφ (p; q). Then for all u ∈ P, v ∈ Q we have Bφ (u; q ∗ ) ≤ Bφ (u; p) + Bφ (u; v) Proof. By the joint convexity assumption of ∆φ (p(x); q(x)), we have ∆φ (u(x); v(x)) ≥ ∆φ (p(x); q ∗ (x)) + +
∂ ∗ ∂p(x) ∆φ (p(x); q (x)) ∂ ∗ ∂q ∗ (x) ∆φ (p(x); q (x))
(u(x) − p(x)) (v(x) − q ∗ (x))
for all x. Therefore Bφ (u; v) ≥ Bφ (p; q ∗ ) + ∇p Bφ (p; q ∗ ) · (u − p) + ∇q∗ Bφ (p; q ∗ ) · (v − q ∗ )
Learning Continuous Latent Variable Models with Bregman Divergences
197
Since q ∗ minimizes Bφ (p; q) over the convex set Q, we have ∇q∗ Bφ (p; q ∗ ) · (v − q ∗ ) ≥ 0 Thus Bφ (u; v) − Bφ (p; q ∗ ) − ∇p Bφ (p; q ∗ ) · (u − p) ≥ 0 On the other hand, by the definition of Bregman divergence, we have ∗ ∗ ∗ ∗ Bφ (u; p) − Bφ (u; q ) =
x∈X
φ(q (x)) − φ(p(x)) + φ (q (x))(u(x) − q (x)) − φ (p(x))(u(x) − p(x)) µ(dx)
= −Bφ (p; q ∗ ) −
x∈X
(φ (p(x)) − φ (q ∗ (x)))(u(x) − p(x)) µ(dx)
= −Bφ (p; q ∗ ) − ∇p Bφ (p; q ∗ ) · (u − p)
Thus we obtain Bφ (u; p) + Bφ (u; v) − Bφ (u; q ∗ ) = Bφ (u; v) − Bφ (p; q ∗ ) − ∇p Bφ (p; q ∗ ) · (u − p) ≥0 Given these two lemmas, following [15] we obtain the following convergence result. Theorem 1. The alternating minimization algorithm (AM) converges. That is, p1 , p2 , ... converges to some p∞ ∈ P, and q 1 , q 2 , ... converges to some q ∞ ∈ Q, such that Bφ (p∞ ; q ∞ ) =
min
p∈P,q∈Q
Bφ (p; q)
The proof of this theorem follows the same line of argument as that of theorem 2.17 given in [15].
4
The AM-IS Algorithm for Learning Latent Structure
We now extend this alternating minimization algorithm to finding feasible solutions to the LMB principle. To understand the algorithm and its information geometry, we first define some useful sub-manifolds in S. C= p∈S: fi (x) p(x)µ(dx) = p˜(y) fi (x) p(z|y)µ(dz), i = 1...N x∈X z∈Z ˜ y∈Y
p(y, z) µ(dz) = p˜(y), for each y ∈ Y˜ M= p∈S: Ga = E=
z∈Z
p(x)fi (x) µ(dx) = ai , i = 1...N x∈X N λi fi (x) , λ ∈ Ω pλ ∈ S : pλ (x) = Lφ q0 ,
p∈S:
i=1
198
S. Wang and D. Schuurmans
where C denotes the set of nonlinear constraints the model should satisfies, M denotes the set of distributions whose observed marginal distribution matches the observed empirical distribution, Ga denotes the set of distributions whose features’ expectations are constant, E denotes the set of additive models, and N N Ω = λ ∈ : Lφ q0 , λi fi (x) < ∞ i=1
¯ to play the role of P in the previous Now by choosing the closed convex set M discussion, and choosing the closed convex set E¯ to play the role of Q, we can define the corresponding forward projection and backward projection operators, and then use these to iterate toward feasible LMB solutions. First, to derive a backward projection operator, take a current pkλ playing the role of q k in the previous discussion, and use this to determine a distribution ¯ that minimizes p∗ ∈ M Bφ (p∗ ; pkλ ) = min Bφ (p; pkλ ) ¯ p∈M
¯ To solve for p∗ , one can That is, p∗ is the backward projection of pkλ onto M. formulate the Lagrangian Λ(p, α) k αy p(y, z)µ(dz) − p˜(y) Λ(p, α) = Bφ (p; pλ ) + ˜ y∈Y
z∈Z
Now since ∂ Λ(p, α) = φ (p(x)) − φ (pkλ (x)) + αy ∂p(x) it is not hard to see that the solution p∗ must satisfy p∗ (x) = (φ )−1 φ (pkλ (x)) − αy ˜ where αy is chosen so that for all y ∈ Y, ∗ p (y, z) µ(dz) = (φ )−1 φ (pkλ (y, z)) − αy µ(dz) = p˜(y) z∈Z
(5)
z∈Z
Lemma 4. For φ corresponding to Examples 1 and 2 above, the backward pro¯ is given by the closed form solution p∗ (x) = p˜(y)pk (z|y) jection of pkλ onto M λ for all x = (y, z). Proof. Note that for φ(t) = t log t (Example 1) or φ(t) = t2 (Example 2) we have φ (pkλ (x)) − φ (pkλ (y)) + φ (˜ p(y)) = φ pkλ (z|y)˜ p(y) p(y)) we obtain Therefore, if we let αy = φ (pkλ (y)) − φ (˜ p∗ (x) = (φ )−1 φ (pkλ (x)) − αy = (φ )−1 φ (pkλ (z|y)˜ p(y)) = p˜(y)pkλ (z|y)
Learning Continuous Latent Variable Models with Bregman Divergences
199
Thus in many cases we can implement the backward projection step for AM merely by calculating the conditional distribution pkλ (z|y) of the current model. In general, one has to solve for the Lagrange multipliers that satistfy (5) to yield a general form of the backward projection p∗ (x) = p˜(y)pkαy ,λ (z|y). In this case, instead of using the original conditional distribution p(z|y) on the right-hand side of the constraints, Eqn (4), a modified conditional distribution pαy (z|y) which is a function of p(z|y) has to be used in the problem formulation of the LMB principle. Next, to formulate the forward projection step, we exploit the following lemma. ¯ the forward projection of pk onto E¯ is equivalent Lemma 5. For any pk ∈ M, to solving the minimization pk (x)fi (x)µ(dx) (6) min Bφ (p; q0 ) where ai = p∈Ga
x∈X
Proof. To find the solution of (6), form the Lagrangian Ψ (p, λ) Ψ (p, λ) = Bφ (p; q0 ) +
N
λi
p(x)fi (x)µ(dx) −
x∈X
i=1
x∈X
pk (x)fi (x)µ(dx)
Now since N ∂ λi fi (x) Ψ (p, λ) = φ (p(x)) − φ (q0 (x)) + ∂p(x) i=1
any solution must satisfy −1
pλ (x) = (φ )
φ (q0 (x)) −
N
λi fi (x)
i=1
Plugging into Ψ , we are left with the problem of maximizing φ(pλ (x)) − φ(q0 (x)) − φ (q0 (x))(pλ (x) − q0 (x)) µ(dx) Ψ (pλ , λ) = x∈X N
+ =
λi pλ (x)fi (x)µ(dx) x∈X i=1 Bφ (pk ; q0 ) − Bφ (pk ; pλ )
−
x∈X
p (x)fi (x)µ(dx) k
¯ the forward projection which is equivalent to minimizing Bφ (pk ; pλ ) over pλ ∈ E, ¯ of pk onto E. To solve the minimization problem specified in (6) one can use iterative scaling. By using an auxiliary function to bound the change in Bregman divergence from below, the iterative scaling algorithm can be derived. Following [13], define an auxiliary function as the following:
200
S. Wang and D. Schuurmans
Definition 4. We call A : S × N → an auxiliary function for pk and f if it satisfies the following conditions: 1. A(q, λ) is continuous in q and A(q, 0) = 0. N 2. Bφ (pk ; q) − Bφ (pk ; Lφ (q, i=1 fi (x)λi )) ≥ A(q, λ). 3. If λ = 0 is a maximum of A(q, λ), then fi (x) q(x) µ(dx) = fi (x) pk (x) µ(dx), x∈X
i = 1, ..., N
x∈X
Maximizing this auxiliary function we obtain new parameters λ = λ + λ and a new model given by N qλ+λ = L qλ ,
λi fi (x) i=1
= L L q0 ,
N
N
λi fi (x) ,
i=1
= L qλ ,
N
λi fi (x)
i=1
(λi + λi )fi (x)
i=1
When λ = 0, we have that qλ = minp∈Ga Bφ (p; q0 ). N Lemma 6. Define f (x) = i=1 |fi (x)|, σi (x) = sign(fi (x)) and lφ (q, v) = supp∈S v · p − Bφ (p; q). Then def
A(q, λ) =
N
λi
i=1
−
x∈X
fi (x) pk (x) µ(dx)
N |fi (x)|
x∈X i=1
f (x)
lφ
q,
N
(7)
σi (x)f (x)λi
µ(dx)
i=1
is an auxiliary function for pk and f , and the corresponding iterative scaling update scheme is given by N k k qt+1 = Lφ qt , λi fi (x) (8) i=1
where λi , i = 1, ..., N satisfies k fi (x)Lφ qt , σi (x)f (x)λi µ(dx) = x∈X
x∈X
fi (x) pk (x) µ(dx)
(9)
and k lim qtk = q∞ = arg min Bφ (pk ; q)
t→∞
q∈E¯
(10)
Learning Continuous Latent Variable Models with Bregman Divergences
201
Proof. Following [13], which considers discrete state distribution, we consider the continuous case. The proof is essentially identical. We verify that the function defined in (8) satisfies the three properties of the definition. Property (1) holds since lφ (q, 0) = 0. Property (2) follows from the convexity of lφ . lφ
q,
N
λi fi (x)
= lφ
i=1
≤
q,
N
N |fi (x)|
x∈X i=1
(11)
σi (x)|fi (x)|λi
i=1
f (x)
lφ
q,
N
σi (x)f (x)λi
µ(dx) (12)
i=1
The rest proof follows exactly the proof of Proposition 4.4 of [13]. We are then able to find feasible solutions for the LMB principle by using an algorithm that combines the previous AM algorithm with a nested IS loop to calculate the forward projection. AM-IS algorithm: Backward projection: Compute pk (x) = p˜(y) pkαy ,λ (z|y), which yields ai = pk (x)fi (x)µ(dx), i = 1, ..., N for the forward projection step. x∈X Forward projection: Perform iterations of full parallel update IS as in (8) and ∞ k k (9) to obtain the parameter values λ∞ = (λ∞ 1 , ..., λN ) and set pλ (x) = q∞ (x). This alternating procedure will halt at a point where the three manifolds C, E and Ga have a common intersection, since we reach a stationary point in that case. Due to the nonlinearity of the manifold C, the intersection is not unique, and multiple feasible solutions may exist. We are now ready to prove the main result of this section that AM-IS can be shown to converge and hence is guaranteed to yield feasible solutions to the LMB principle.
Theorem 2. The AM-IS algorithm asymptotically yields feasible solutions to the LMB principle for additive models. ¯ to play the role Proof. By lemmas 5 and 6 and choose the closed convex set M of P in Theorem 2, and choose the closed convex set E¯ to play the role of Q in Theorem 2. The conclusion immediately follows. Unlike the standard K-L divergence for which the EM-IS algorithm can be shown to monotonically increase likelihood during each iteration [24], monotonic improvement will not necessarily hold under the Bregman divergences.
202
5
S. Wang and D. Schuurmans
Information Geometry of AM-IS
We gain further insight by considering the well known Pythagorean theorem [13] for additive models, which in the complete data case states that if there exists pλ∗ ∈ G¯a ∩ E¯ then Bφ (p; pλ ) = Bφ (p; pλ∗ ) + Bφ (pλ∗ ; pλ )
for all p ∈ G¯a and pλ ∈ E¯
In the incomplete data case, this theorem needs to be modified to incorporate the effect of latent variables. Unlike the case in [24], in general, M is not ¯ thus the interpretation of the information geometry of a sub-manifold of C, Pythagorean theorem need to be slightly modified. ¯ there exists Theorem 3. Pythagorean Property: For all pλ ∈ E¯ and pλ∗ ∈ C¯∩ E, ¯ such that a p(x) ∈ M Bφ (p; pλ ) = Bφ (p; pλ∗ ) + Bφ (pλ∗ ; pλ )
(13)
¯ pick p(x) = p˜(y)pα ,λ∗ (z|y). Obviously p ∈ M. ¯ Now Proof. For all pλ∗ ∈ C¯ ∩ E, y ¯ we show that for all pλ ∈ E Bφ (˜ p(y)pαy ,λ∗ (z|y); pλ (x)) = Bφ (˜ p(y)pαy ,λ∗ (z|y); pλ∗ (x)) + Bφ (pλ∗ (x); pλ (x)) Establishing the above equation is equivalent to showing φ(˜ p(y)pαy ,λ∗ (z|y)) − φ(pλ (x)) − φ (pλ (x)) (˜ p(y)pαy ,λ∗ (z|y) − pλ (x)) µ(dx) x∈X = φ(˜ p(y)pαy ,λ∗ (z|y)) − φ(pλ∗ (x)) − φ (pλ∗ (x)) (˜ p(y)pαy ,λ∗ (z|y) − pλ∗ (x)) x∈X µ(dx) + φ(pλ∗ (x)) − φ(pλ (x)) − φ (pλ (x)) (pλ∗ (x) − pλ (x)) µ(dx) x∈X
Cancelling common terms leaves φ (pλ (x))˜ p(y)pαy ,λ∗ (z|y) µ(dx) x∈X = φ (pλ∗ (x))(˜ p(y)pαy ,λ∗ (z|y) − pλ∗ (x)) µ(dx) x∈X + φ (pλ (x))pλ∗ (x) µ(dx) x∈X
Plugging pλ (x) = L q0 , pλ∗ (x) = L q0 ,
N i=1 N i=1
= (φ )−1
λi fi (x)
i=1
λ∗i fi (x)
φ (q0 ) +
−1
= (φ )
N
φ (q0 ) +
N i=1
λi fi (x) λ∗i fi (x)
Learning Continuous Latent Variable Models with Bregman Divergences
203
into the above equation, we then have N φ (q0 ) + λi fi (x) p˜(y)pαy ,λ∗ (z|y) µ(dx) x∈X
i=1
φ(q0 ) +
= x∈X
+ x∈X
N
λ∗i fi (x)
i=1
φ(q0 ) +
N
p˜(y)pαy ,λ∗ (z|y) − pλ∗ (x) µ(dx)
λi fi (x) pλ∗ (x) µ(dx)
i=1
Cancelling the common terms, we then are left with N (λi − λ∗i ) p˜(y) pαy ,λ∗ (z|y)fi (x)µ(dz) − i=1
˜ y∈Y
z∈Z
x∈X
pλ∗ (x)fi (x)µ(dx)
= 0 ¯ The term inside the brackets is 0 since pλ∗ (x) ∈ C¯ ∩ E.
6
Summary
There are a number of iterative methods for performing Bregman divergence projections onto convex sets that can be used to illustrate existing supervised machine learning techniques. In this paper, we have presented a class of unsupervised statistical learning algorithms formulated in terms of minimizing Bregman divergences subject to a set of non-linear constraints that consider hidden variables. We have proposed a new alternating minimization algorithm with nested iterative scaling that asymptotically finds feasible solutions to this constrained optimization problem, and provided its convergence and information geometry properties. We are developing this framework to provide analytical tools to transform current supervised machine learning techniques to unsupervised counterparts. In general, a greedy search procedure [25] similiar as in AdaBoost can be developed to automatically extract hidden latent structure. Prelimilary experimental results on unsupervised boosting for clustering and gender independent speech signal analysis are encouraging. Acknowledgement: This work is supported by MITACS and NSERC.
References 1. H. Bauschke and J. Borwein, “Joint and Separate Convexity of the Bregman Distance,” in: Inherently Parallel Algorithms in Feasibility and Optimization and Their Applications, Elsevier, 2001, pp. 23–36
204
S. Wang and D. Schuurmans
2. J. Borwein and A. Lewis, “Duality relationships for entropy-like minimization problems,” SIAM J. Control Optim., Vol. 29, No. 2, pp. 325–338, 1991 3. J. Borwein; A. Lewis, Convex Analysis and Nolinear Optimization, Springer 2000 4. L. Bregman, “The relaxation method of finding the common point of convex sets and its applications to the solution of problems in convex programming,” USSR Computational Mathematics and Mathematical Physics, Vol. 7, pp. 200–217, 1967 5. A. Buja and W. Stuetzle, “Degrees of Boosting,” manuscript, 2002 6. C. Byrne and Y. Censor, “Proximity function minimization using multiple Bregman projections with applications to split feasibility and Kullback-Leibler distance minimization,” Annals of Operations Research, Vol. 105, pp. 77–98, 2001 7. Y. Censor and S. Zenios, Parallel Optimization: Theory, Algorithms, and Applications, Oxford University Press, 1997 8. M. Collins, R. Schapire and Y. Singer, “Logistic regression, AdaBoost and Bregman distances,” Machine Learning, Vol. 48, No. 1–3, pp. 253–285, 2002 9. I. Csiszar and G. Tusnady, “Information geometry and alternating minimization procedures,” Statistics and Decisions, Supplement Issue 1, pp. 205–237, 1984 10. I. Csiszar, “Why least squares and maximum entropy?” The Annals of Statisics, Vol. 19, No. 4, pp. 2032–2066, 1991 11. I. Csiszar, “Generalized projections for non-negative functions,” Acta Mathematica Hungarica, Vol. 68, No. 1–2, pp. 161–185, 1995 12. I. Csiszar, “Maxent, mathematics, and information theory,” Maximum Entropy and Bayesian Methods, Edited by K. Hanson and R. Silver, pp. 35–50, Kluwer, 1996 13. S. Della Pietra, V. Della Pietra and J. Lafferty, “Duality and auxiliary functions for Bregman distances,” Technical Report CMU-CS-01-109, CMU, 2001 14. A. Dempster, N. Laird and D. Rubin, “Maximum likelihood estimation from incomplete data via the EM algorithm,” J. Royal Stat. Soc. B, Vol. 39, pp 1–38, 1977 15. P. Eggermont and V. LaRiccia, “On EM-like algorithms for minimum distance estimation,” Technical Report, Mathematical Sciences, University of Delaware, 1998 16. J. Friedman, T. Hastie and R. Tibshirani, “Additive logistic regression: A statistical view of boosting,” Annals of Statistics, Vol. 28, No. 2, pp. 337–407, 2000 17. R. Johnson and J. Shore, “Which is the better entropy expression for speech processing: -S log S or log S?,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 32, No. 1, pp. 129–137, 1984 18. J. Lafferty, S. Della Pietra and V. Della Pietra, “Statistical learning algorithms based on Bregman distances,” Canadian Workshop on Info. Theory, pp. 77–80, 1997 19. J. Lafferty, “Additive models, boosting, and inference for generalized divergences,” Annual Conference on Computational Learning Theory, pp. 125–133, 1999 20. G. Lebanon and J. Lafferty, “Boosting and maximum likelihood for exponential models,” In Advances in Neural Information Processing Systems (NIPS), 14, 2002 21. D. Luenberger, Optimization by Vector Space Methods, John Wiley & Sons, 1969 22. V. Vapnik, The Natural of Statistical Learning Theory, Springer, 2000 23. T. Zhang, “Statistical behavior and consistency of classification methods based on convex risk minimization,” to appear in Annals of Statistics, 2004 24. S. Wang, D. Schuurmans and Y. Zhao, “The latent maximum entropy principle,” manuscript, 2002 25. S. Wang, D. Schuurmans, A. Ghodsi and J. Rosenthal, “Unsupervised Boosting with the Latent Maximum Entropy Principle,” manuscript, 2003
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation Joel Ratsaby University College London Gower Street, London WC1E 6BT, United Kingdom [email protected]
Abstract. Structural risk minimisation (SRM) is a general complexity regularization method which automatically selects the model complexity that approximately minimises the misclassification error probability of the empirical risk minimiser. It does so by adding a complexity penalty term (m, k) to the empirical risk of the candidate hypotheses and then for any fixed sample size m it minimises the sum with respect to the model complexity variable k. When learning multicategory classification there are M subsamples mi , corresponding to the M pattern classes with a priori probabilities pi , 1 ≤ i ≤ M . Using the usual representation for a multi-category classifier as M individual boolean classifiers, the penalty becomes M i=1 pi (mi , ki ). If the mi are given then the standard SRM trivially applies here by minimizing the penalised empirical risk with respect to ki , 1, . . . , M . However, in situations where the total sample size M i=1 mi needs to be minimal one needs to also minimize the penalised empirical risk with respect to the variables mi , i = 1, . . . , M . The obvious problem is that the empirical risk can only be defined after the subsamples (and hence their sizes) are given (known). Utilising an on-line stochastic gradient descent approach, this paper overcomes this difficulty and introduces a sample-querying algorithm which extends the standard SRM principle. It minimises the penalised empirical risk not only with respect to the ki , as the standard SRM does, but also with respect to the mi , i = 1, . . . , M . The challenge here is in defining a stochastic empirical criterion which when minimised yields a sequence of subsample-size vectors which asymptotically achieve the Bayes’ optimal error convergence rate.
1
Introduction
Consider the general problem of learning classification with M pattern classes each with a class conditional probability density fi (x), 1 ≤ i ≤ M , x ∈ IRd , and a priori probabilities pi , 1 ≤ i ≤ M . The functions fi (x), 1 ≤ i ≤ M , are assumed to be unknown while the pi are assumed to be known or unknown depending on the particular setting. The learner observes randomly drawn i.i.d. examples each consisting of a pair of a feature vector x ∈ IRd and a label y ∈ {1, 2, . . . , M }, which are obtained by first drawing y from {1, . . . , M } according R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 205–220, 2003. c Springer-Verlag Berlin Heidelberg 2003
206
J. Ratsaby
to a discrete probability distribution {p1 , . . . , pM } and then drawing x according to the selected probability density fy (x). Denoting by c(x) a classifier which represents a mapping c : IRd → {1, 2, . . . , M } then the misclassification error of c is defined as the probability of misclassification of a randomly drawn Mx with respect to the underlying mixture probability density function f (x) = i=1 pi fi (x). This misclassification error is commonly represented as the expected 0/1-loss, or simply as the loss, L(c) = E1{c(x)=y(x)} , of c where expectation is taken with respect to f (x) and y(x) denotes the true label (or class origin) of the feature vector x. In general y(x) is a random variable depending on x and only in the case of fi (x) having non-overlapping probability 1 supports then y(x) is a deterministic function1 . The aim is to learn, based on a finite randomly drawn labelled sample, the optimal classifier known as the Bayes classifier which by definition has minimum loss. In this paper we pose the following question: Question: Can the learning accuracy be improved if labeled examples are independently randomly drawn according to the underlying class conditional probability distributions but the pattern classes, i.e., the example labels, are chosen not necessarily according to their a priori probabilities ? We answer this in the affirmative by showing that there exists a tuning of the subsample proportions which minimizes a loss criterion. The tuning is relative to the intrinsic complexity of the Bayes-classifier. Before continuing let us introduce some notation. We write const to denote absolute constants or constants which do not depend on other variables in the mathematical expression. We denote by {(xj , yj )}m j=1 an i.i.d. sample of labelled examples where m denotes the total sample size, yj , 1 ≤ j ≤ m, are drawn i.i.d. and taking the integer value ‘i’ with probability pi , 1 ≤ i ≤ M , while the corresponding xj are drawn according to the class conditional probability density fyj (x). Denote by mi the number of examples having a y-value of ‘i’. Denote M by m = [m1 , . . . , mM ] the sample size vector and let m = i=1 mi ≡ m. The notation argmink∈A g(k) for a set A means the subset (of possibly more than one element) whose elements have the minimum value of g over A. A slight abuse of notation will be made by using it for countable sets where the notation means the subset of elements k such that2 g(k) = infk g(k ). The lossL(c) is expressed in terms of the class-conditional losses, Li (c), as M L(c) = i=1 pi Li (c) where Li (c) = Ei 1{c(x)=i} , and Ei is the expectation with (x). The empirical counterparts of the respect to the density fi loss and condiM tional loss are Lm (c) = i=1 pi Li,mi (c) where Li,mi (c) = m1i j:yj =i 1{c(xj )=i} 1
According to the probabilistic data-generation model mentioned above, only regions in probability 1 support of the mixture distribution f (x) have a well-defined class membership. 2 In that case, technically, if there does not exists a k in A such that g(k) = inf k g(k ) then we can always find an arbitrarily close approximating elements kn , i.e., ∀ > 0 ∃N () such that for n > N () we have |g(kn ) − inf k g(k )| < .
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation
207
where throughout the paper we assume the a priori probabilities are known to the learner (see Assumption 1 below).
2
Structural Risk Minimisation
The loss L(c) depends on the unknown underlying probability distributions hence realistically for a learning algorithm to work it needs to use only an estimate of L(c). For a finite class C of classifiers the empirical loss Lm (c) is a consistent estimator of L(c) uniformly for all c ∈ C hence provided that the sample size m is sufficiently large, an algorithm that minimises Lm (c) over C will yield a classifier cˆ whose loss L(ˆ c) is an arbitrarily good approximation of the true minimum Bayes loss, denoted here as L∗ , provided that the optimal Bayes classifier is contained in C. The Vapnik-Chervonenkis theory (Vapnik, 1982) characterises the condition for such uniform estimation over an infinite class C of classifiers. The condition basically states that the class needs to have a finite complexity or richness which is known as the Vapnik-Chervonenkis dimension and is defined as follows: for a class H of functions from a set X to {0, 1} and a set S = {x1 , . . . , xl } of l points in X, denote by H|S = {[h(x1 ), . . . , h(xl )] : h ∈ H}. Then the Vapnik-Chervonenkis dimension of H denoted by V C(H) is the largest l such that the cardinality H|S = 2l . The method known as empirical risk minimisation represents a general learning approach which for learning classification minimises the 0/1empirical loss and provided that the hypothesis class has a finite VC dimension then the method yields a classifier cˆ with an asymptotically arbitrarily-close loss to the minimum L∗ . As is often the case in real learning algorithms, the hypothesis class can be rich and may practically have an infinite VC-dimension, for instance, the class of all two layer neural networks with a variable number of hidden nodes. The method of Structural Risk Minimisation (SRM) was introduced by Vapnik (1982) in order to learn such classes via empirical risk minimisation. For the purpose of reviewing existing results we limit our discussion for the remainder of this section to the case of two-category classification thus we use m and k as scalars representing the total sample size and class VC-dimension, respectively. Let us denote by Ck a class of classifiers having a VC-dimension of k and let c∗k be the classifier which minimises the loss L(c) over Ck , i.e., c∗k = argminc∈Ck L(c). The standard setting for SRM considers the overall ∞ class C of classifiers as an infinite union of finite VC-dimension classes, i.e., k=1 Ck , see for instance Vapnik (1982), Devroye et. al. (1996), Shawe-Taylor et. al. (1996), Lugosi & Nobel (1999), Ratsaby et. al. (1996). The best performing classifier in C denoted as c∗ is defined as c∗ = argmin1≤k≤∞ L(c∗k ). Similarly, denote by cˆk the empirically-best classifier in Ck , i.e., cˆk = argminc∈Ck Lm (c). Denoting by k ∗ the minimal complexity of a class which contains c∗ , then depending on the problem and on the type of classifiers used, k ∗ may even be infinite as in the case when the Bayes classifier is not contained in C. The complexity k ∗ may be thought of as the intrinsic complexity of the Bayes classifier.
208
J. Ratsaby
The idea behind SRM is to minimise not the pure empirical loss Lm (ck ) but a penalised version taking the form Lm (ck ) + (m, k) where (m, k) is some increasing function of k and is sometimes referred to as a complexity penalty. The classifier chosen by the criterion is then defined by cˆ∗ = argmin1≤k≤∞ (Lm (ˆ ck ) + (m, k)) .
(1)
The term (m, k) is proportional to the worst case deviations between the true loss and the empirical loss uniformly over all functions in Ck . More recently there has been interest in data-dependent penalty terms for structural risk minimisation which do not have an explicit complexity factor k but are related to the class Ck by being defined as a supremum of some empirical quantity over Ck , for instance the maximum discrepancy criterion (Bartlett et. al., 2002) or the Rademacher complexity (Kultchinskii, 2001). We take the penalty to be as in Vapnik (1982) (see also
k ln m where again const stands Devroye et. al. (1996)) (m, k) = const m for an absolute constant which for our purpose is not important. This bound is central to the computations of the paper3 . It can be shown (Devroye et. al., 1996) that for the two-pattern classification case the error rate of the SRM-chosen classifier cˆ∗ (which implicitly depends on the random sample of size m since it is obtained by minimising the sum in (1)), satisfies k ∗ ln m ∗ ∗ (2) L(ˆ c ) > L(c ) + const m
infinitely often with probability 0 where again c∗ is the Bayes classifier which is assumed to be in C and k ∗ is its intrinsic complexity. The nice feature of SRM is that the selected classifier cˆ∗ automatically locks onto the minimal error rate as if the unknown k ∗ was known beforehand.
3
Multicategory Classification
A classifier c(x) may be represented as a vector of M boolean classifiers bi (x), where bi (x) = 1 if x is a pattern drawn from class ‘i’ and bi (x) = 0 otherwise. A union of such boolean classifiers forms a well-defined classifier c(x) if for M each x ∈ IRd , bi (x) = 1 for exactly one i, i.e., {x : bi (x) = 1} = IRd i=1 and {x : bi (x) = 1} {x : bj (x) = 1} = ∅, for 1 ≤ i = j ≤ M . We also refer to these boolean classifiers as the component classifiers ci (x), 1 ≤ i ≤ M , of a vector classifier c(x). The loss of a classifier M c is just the average of the losses of the component classifiers, i.e., L(c) = i=1 pi L(ci ) where for a boolean 3
There is actually an improved bound due to Talagrand, cf. Anthony & Bartlett (1999) m ) which Section 4.6, but when adapted for almost sure statements it yields O( k+ln m k ln m for our work is insignificantly better then O m
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation
209
classifier ci the loss is defined as L(ci ) = Ei 1{ci (x)=1} , and the empirical loss m i 1 i 1{ci (xj )=1} which is based on a subsample {(xj , i)}m is Li,mi (ci ) = mi j=1 j=1 drawn i.i.d. from pattern class “i”. The class C of classifiers is decomposed into a structure S = S1 ×S2 ×· · ·×SM , where Si is a nested structure (cf. Vapnik (1982)) of classes Bki , i = 1, 2, . . ., of boolean classifiers bi (x), i.e., S1 = B1 , B2 , . . . , Bk1 , . . . , S2 = B1 , B2 , . . . , Bk2 , . . . up to SM = B1 , B2 , . . . , BkM , . . . where ki ∈ ZZ + denotes the VC-dimension of Bki and Bki ⊆ Bki +1 , 1 ≤ i ≤ M . For any fixed positive integer vector k ∈ ZZ M + consider the class of vector classifiers Ck = Bk1 × Bk2 × · · · × BkM . Define by Gk the subclass of Ck of classifiers c that are well-defined (in the sense mentioned above). M For vectors m and k in ZZ M , define (m, k) ≡ + i=1 pi (mi , ki ) where ki ln mi as before (mi , ki ) = const mi . For any 0 < δ < 1, we denote by 1 M ki ln mi +ln δ (mi , ki , δ) = and (m, k, δ) = i=1 pi (mi , ki , δ). Lemma 1 below mi states an upper bound on the deviation between the empirical loss and the loss uniformly over all classifiers in a class Gk and is a direct application of Theorem 6.7 Vapnik (1982). Before we state it, it is necessary to define what is meant by an increasing sequence of vectors m. Definition 1. (Increasing sample-size sequence) A sequence m(n) of samplesize vectors is said to increase if: (a) at every n, there exists a j such that mj (n + 1) > mj (n) and mi (n + 1) ≥ mi (n) for 1 ≤ i = j ≤ M and (b) there exists an increasing function T (N ) such that for all N > 0, n > N implies every component mi (n) > T (N ), 1 ≤ i ≤ M . Definition 1 implies that for all 1 ≤ i ≤ M , mi (n) → ∞ as n → ∞. We will henceforth use the notation m → ∞ to denote such an ever-increasing sequence m(n) with respect to an implicit discrete indexing variable n. The relevance of Definition 1 will become clearer later, in particular when considering Lemma 3. Definition 2. (Sequence generating procedure) A sequence generating procedure φ is one which generates increasing sequences m(n) with a fixed function Tφ (N ) as in Definition 1 and also satisfying the following: for all N, N ≥ 1 such that Tφ (N ) = Tφ (N ) + 1 then |N − N | ≤ const, where const is dependent only on φ. The above definition simply states a lower bound requirement on the rate of increase of Tφ (N ). We now state the uniform strong law of large numbers for the class of well-defined classifiers. Lemma 1. (Uniform SLLN for multicategory classifier class) For any k ∈ ZZ M + let Gk be a class of well-defined classifiers. Consider any sequence-generating procedure as in Definition 2 which generates m(n), n = 1, . . . , ∞. Let the empirical m(n) loss be defined based on examples {(xj , yj )}j=1 , each drawn i.i.d. according to
210
J. Ratsaby
an unknown underlying distribution over IRd × {1, . . . , M }. Then for arbitrary (m(n), k, δ) with probability 1 − δ 0 < δ < 1, supc∈Gk Lm(n) (c) − L(c) ≤ const and the events supc∈Gk Lm(n) (c) − L(c) > const (m(n), k), n = 1, 2, . . ., occur infinitely often with probability 0, where m(n) is any sequence generated by the procedure. The outline of the proof is in Appendix A. We henceforth denote by c∗k the optimal classifier in Gk , i.e., c∗k = argminc∈Gk L(c) and cˆk = argminc∈Gk Lm (c) is the empirical minimiser over the class Gk . In Section 2 we mentioned that the intrinsic unknown complexity k ∗ of the Bayes classifier is automatically learned by minimising the penalised empirical loss over the complexity variable k. If an upper bound of the form of (2) but based on a vector m could be derived for the multicategory case then a second minimisation step, this time over the sample-size vector m, will improve the SRM error convergence rate. The main result of this paper (Theorem 1) shows that through a stochastic gradient descent such minimisation improves the standard SRM bound from (m, k ∗ ) to (m∗ , k ∗ ) where m∗ minimises (m, k ∗ ) over all possible vectors m whose magnitude m equals the given total sample size m. The technical challenge is to obtain this without assuming the knowledge of k ∗ . Our approach is to estimate k ∗ and minimise an estimated criterion. We only provide sketch of proofs for the stated lemmas and theorem. The full proofs are in Ratsaby (2003). Concerning the convergence mode of random variables, upper bounds are based on the uniform strong law of large numbers, see Appendix A. Such bounds originated in the work of Vapnik (1982), for instance his Theorem 6.7. Throughout the current paper, almost sure statements are made by a standard application of the Borel-Cantelli lemma. For instance, taking m to be a scalar, the r log m+log
1
δ statement supb∈B |L(b) − Lm (b)| ≤ const with probability at least m 1 − δ for any δ > 0 is alternatively stated as follows by letting δm = m12 : For the sequence of randomvariables Lm (b), uniformly over all b ∈ B, we have
r log m+log
1
δm L(b) > Lm (b) + const occur infinitely often with probability m 0. Concerning our, perhaps loose, use of the word optimal, whenever not explicitly stated, optimality of a classifier or of a procedure or algorithm is only with respect to minimisation of the criterion, namely, the upper bound on the loss.
4
Standard SRM Loss Bounds
We will henceforth make the following assumption. Assumption 1. The Bayes loss L∗ = 0 and there exists a classifier ck in the structure S with L(ck ) = L∗ such that ki < ∞, 1 ≤ i ≤ M . The a priori pattern class probabilities pi , 1 ≤ i ≤ M , are known to the learner.
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation
211
Assumption 1 essentially amounts to the Probably Approximately Correct (PAC) framework, Valiant (1984), Devroye et. al. (1996) Section 12.7, but with a more relaxed constraint on the complexity of the hypothesis class C since it is permitted to have an infinite VC-dimension. Also, in practice the a priori pattern class probabilities can be estimated easily. In assuming that the learner knows the pi , 1 ≤ i ≤ M , one approach would have the learner allocate sub-sample sizes according to mi = pi m followed by doing structural risk minimisation. However this does not necessarily minimise the upper bound on the loss of the SRM-selected classifier and hence is inferior in this respect to Principle 1 which is stated later. We note that if the classifier class was fixed and the intrinsic complexity k ∗ of the Bayes classifier was known in advance then because of Assumption 1 one would resort to a bound of the form O (k log m/m) and not the weaker bound that has a square root, see ch. 4.5 in Anthony & Bartlett (1999). However, as mentioned before, not knowing k ∗ and hence using structural risk minimisation as opposed to empirical risk minimisation over a fixed class, leads to using the weaker bound for the complexity-penalty. We next provide some additional definitions needed for the remainder of the paper. Consider the set F ∗ = {argmink∈ZZ M L(c∗k )} = {k : L(c∗k ) = L∗ = 0} + which may contain more than one vector k. Following Assumption 1 we may define the Bayes classifier c∗ as the particular classifier c∗k∗ whose complexity is minimal, i.e., k ∗ = argmin{k∈F ∗ } {k∞ } where k∞ = max1≤i≤M |ki |. Note again that there may be more than one such k ∗ . The significance of specifying the Bayes classifier up to its complexity rather than just saying it is any classifier having a loss L∗ will become apparent later in the paper. For an empirical minimiser classifier cˆk define by the penalised empirical loss ˜ m (ˆ (cf. Devroye et. al. (1996)) L ck ) = Lm (ˆ ck ) + (m, k). Consider the set Fˆ = ˜ {argmink∈ZZ M L(ˆ ck )} which may contain more than one vector k. In standard +
structural risk minimisation (Vapnik, 1982) the selected classifier is any one whose complexity index k ∈ Fˆ . This will be modified later when we introduce an algorithm which relies on the convergence of the complexity kˆ to some finite limiting complexity value with increasing4 m. The selected classifier is therefore defined to be one whose complexity satisfies kˆ = argmink∈Fˆ k∞ . This minimalcomplexity SRM-selected classifier will be denoted as cˆkˆ or simply as cˆ∗ . We will sometimes write kˆn and cˆn for the complexity and for the SRM-selected classifier, respectively, in order to explicitly show the dependence on discrete time n. The next lemma states that the complexity kˆ converges to some (not necessarily unique) k ∗ corresponding to the Bayes classifier c∗ . Lemma 2. Based on m examples {(xj , yj )}m j=1 each drawn i.i.d. according to d an unknown underlying distribution over IR × {1, . . . , M }, let cˆ∗ be the choˆ Consider a sequence of samples ζ m(n) with insen classifier of complexity k. 4
ˆn → k∗ , a.s., means We will henceforth adopt the convention that a vector sequence k ˆ that every component of kn converges to the corresponding component of k∗ , a.s., as m → ∞.
212
J. Ratsaby
creasing sample-size vectors m(n) obtained by a sequence-generating procedure as in Definition 2. Then (a) the corresponding complexity sequence kˆn converges a.s. to k ∗ which from Assumption 1 has finite components. (b) For any sample ζ m(n) in the sequence, the loss of the corresponding classifier cˆ∗n satisfies L(ˆ c∗n ) > const (m(n), k ∗ ) infinitely often with probability 0. The outline of the proof is in Appendix B. For the more general case of L∗ > 0 (but two-category classifiers) the upper bound becomes L∗ + const (m, k ∗ ), cf. Devroye et. al. (1996). It is an open question whether in this case it is possible to guarantee convergence of kˆn or some variation of it to a finite limiting value. The previous lemma bounds the loss of the SRM-selected classifier cˆ∗ . As suggested earlier, we wish to extend the SRM approach to do an additional minimisation step by minimising the loss of cˆ∗ with respect to the sample size vector m. In this respect, the subsample proportions may be tuned to the intrinsic Bayes complexity k ∗ thereby yield an improved error rate for cˆ∗ . This is stated next: Principle1. Choose m to minimise the criterion (m, k ∗ ) with respect to all m M such that i=1 mi = m, the latter being the a priori total sample size allocated for learning. In general there may be other proposed criteria just as there are many criteria for model selection based on minimisation of different upper bounds. Note that if k ∗ was known then an optimal sample size m∗ = [m∗1 , . . . , m∗M ] could be computed which yields a classifier cˆ∗ with the best (lowest) deviation const (m∗ , k ∗ ) away ∗ from Bayes loss. The difficulty is that k ∗ = [k1∗ , . . . , kM ] is usually unknown since it depends on the underlying unknown probability densities fi (x), 1 ≤ i ≤ M . To overcome this we will minimise an estimate of (·, k ∗ ) rather than the criterion (·, k ∗ ) itself.
5
The Extended SRM Algorithm
In this section we extend the SRM learning algorithm to include a stochastic gradient descent step. The idea is to interleave the standard minimisation step of SRM with a new step which asymptotically minimises the penalised empirical loss with respect to the sample size. As before, m(n) denotes a sequence of sample-size vectors indexed by an integer n ≥ 0 representing discrete time. When referring to a particular ith component of the vector m(n) we write mi (n). The algorithm initially starts with uniform sample size proportions m1 = m2 = · · · = mM = const > 0, then at each time n ≥ 1 it selects the classifier cˆ∗n defined as cˆ∗n = argmincˆn,k :k∈Fˆn k∞
Standard Minimization Step
(3)
˜ n (ˆ ˜ n (ˆ cn,k ) = minr∈ZZM L cn,r )} and for any cˆn,k which minimises where Fˆn = {k : L + ˜ n (ˆ cn,k ) = Lm(n) (c) over all c ∈ Gk we define the penalised empirical loss as L
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation
213
Lm(n) (ˆ cn,k ) + (m(n), k) where Lm(n) stands for the empirical loss based on the sample-size vector m(n) at time n. The second minimisation step is done via a query rule which selects the particular pattern class from which to draw examples as one which minimises the stochastic criterion (·, kˆn ) with respect to the sample size vector m(n). The complexity kˆn of cˆ∗n will be shown later to converge to k ∗ hence (·, kˆn ) serves as a consistent estimator of the criterion (·, k ∗ ). We choose an adaptation step which changes one component of m at a time, namely, it increases the component mjmax (n) which corresponds to the direction of maximum descent of the criterion (·, kˆn ) at time n. This may be written as m(n + 1) = m(n) + ∆ ejmax
New Minimization Step
(4)
where the positive integer ∆ denotes some fixed minimisation step-size and for any integer i ∈ {1, 2, . . . , M }, ei denotes an M -dimensional elementary vector with 1 in the ith component and 0 elsewhere. Thus at time n the new minimisation step produces a new value m(n + 1) which is used for drawing additional examples according to specific sample sizes mi (n + 1), 1 ≤ i ≤ M . Learning Algorithm XSRM (Extended SRM) Let: mi (0) = const > 0, 1 ≤ i ≤ M . mi (0) mi (0) Given: (a) M uniform-size samples {ζ mi (0) }M = {(xj , ‘i’)}j=1 , i=1 , where ζ and xj are drawn i.i.d. according to underlying class-conditional probability densities fi (x). (b) A sequence of classes Gk , k ∈ ZZ M + , of well-defined classifiers. (c) A constant minimisation step-size ∆ > 0. (d) Known a priori probabilities pj , 1 ≤ j ≤ M (for defining Lm ). Initialisation: (Time n = 0) Based on ζ mi (0) , 1 ≤ i ≤ M , determine a set of candidate classifiers cˆ0,k minimising the empirical loss Lm(0) over Gk , k ∈ M , respectively. Determine cˆ∗0 according to (3) and denote its complexity Z+ vector by kˆ0 . Output: cˆ∗0 . Call Procedure NM: m(1) := N M (0). Let n = 1. While (still more available examples) Do: 1. Based on the sample ζ m(n) , determine the empirical minimisers cˆn,k for each class Gk . Determine cˆ∗n according to (3) and denote its complexity vector by kˆn . 2. Output: cˆ∗n . 3. Call Procedure NM: m(n + 1) := N M (n). 4. n := n + 1. End Do Procedure New Minimisation (NM) Input: Time n. ˆn,j ) (mj (n),k , where if more than one argmax – jmax (n) := argmax1≤j≤M pj mj (n) then choose any one.
214
J. Ratsaby
– Obtain: ∆ new i.i.d. examples from class jmax (n). Denotethem by ζn . := ζ mjmax (n) (n) ζn , while – Update Sample: ζ mjmax (n) (n+1) ζ mi (n+1) := ζ mi (n) , for 1 ≤ i = jmax (n) ≤ M . – Return Value: m(n) + ∆ ejmax (n) . The algorithm alternates between the standard minimisation step (3) and the new minimisation step (4) repetitively until exhausting the total sample size m which for most generality is assumed to be unknown a priori. mi (n) accumulated While for any fixed i ∈ {1, 2, . . . , M } the examples {(xj , i)}j=1 m(n)
up until time n are all i.i.d. random variables, the total sample {(xj , yj )}j=1 consists of dependent random variables since based on the new minimisation the choice of the particular class-conditional probability distribution used to draw examples at each time instant l depends on the sample accumulated up until time l − 1. It turns out that this dependency does not alter the results of Lemma 2. This follows from the proof of Lemma 2 and from the bound of Lemma 1 which holds even if the sample is i.i.d. only when conditioned on a pattern class since it is the weighted average of the individual bounds corresponding to each of the pattern classes. Therefore together with the next lemma this implies that Lemma 2 applies to Algorithm XSRM. Lemma 3. Algorithm XSRM is a sequence-generating procedure. The outline of the proof is deferred to Appendix C. Next, we state the main theorem of the paper. Theorem 1. Assume that the Bayes complexity k ∗ is an unknown M dimensional vector of finite positive integers. Let the step size ∆ = 1 in Algorithm XSRM resulting in a total sample size which increases with discrete time as m(n) = n. Then the random sequence of classifiers cˆ∗n produced by Algorithm XSRM is such that the events L(ˆ c∗n ) > const (m(n), k ∗ ) or m(n)−m∗ (n)l1M > 1 occur infinitely often with probability 0 where m∗ (n) is the solution to the constrained minimisation of (m, k ∗ ) over all m of magnitude m = m(n). Remark 1. In the limit of large n the bound const (m(n), k ∗ ) is almost minimum (the minimum being at m∗ (n)) with respect to all vectors m ∈ ZZ M + of size m(n). Note that this rate is achieved by Algorithm XSRM without the knowledge of the intrinsic complexity k ∗ of the Bayes classifier. Compare this for instance to uniform querying where at each time n one queries for subsamples of the ∆ same size M from every pattern class. This leads to a different (deterministic) ∆ sequence m(n) = M [1, 1, . . . , 1]n ≡ ∆ n and in turn to a sequence of classifiers cˆn whose loss L(ˆ cn ) ≤ const (∆ n, k ∗ ), as n → ∞, where here the upper bound is not even asymptotically minimal. A similar argument holds if the proportions are based on the a priori pattern class probabilities since in general letting mi = pi m does not necessarily minimise the upper bound. In Ratsaby (1998), empirical results show the inferiority of uniform sampling compared to an online approach based on Algorithm XSRM.
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation
6
215
Proving Theorem 1
The proof of Theorem 1 is based on Lemma 2 and on two additional lemmas, Lemma 4 and Lemma 5, which deal with the the convergent property of the new minimisation step of Algorithm XSRM. The proof is outlined in Appendix D. Our approach is to show that the adaptation step used in the new minimisation step follows from the minimisation of the deterministic criterion (m, k ∗ ) with a known k ∗ . Letting t, as well as n, denote discrete time t = 1, 2, . . ., we adopt the notation m(t) for a deterministic sample size sequence governed by the deterministic criterion (m, k ∗ ) where k ∗ is taken to be known. We write m(n) to denote the stochastic sequence governed by the random criterion (m, kˆn ). Thus t or n distinguish between a deterministic or stochastic sample sequence, m(t) or m(n), respectively. We start with the following definition. Definition 3. (Optimal trajectory) Let m(t) be any positive integer-valued function of t which denotes the total sample size at time t. The optimal trajectory is a set of vectors m∗ (t) ∈ ZZM Z+ , defined as m∗ (t) = + indexed by t ∈ Z argmin{m∈ZZM :m=m(t)} (m, k ∗ ). +
First let us solve the following constrained minimisation problem. Fix a to∗ tal the constraint that M Msample size m and minimise the error (m, k ∗) under m = m. This amounts to minimising (m, k ) + λ( i i=1 i=1 mi − m) over m and λ. Denote the gradient by g(m, k ∗ ) = ∇(m, k ∗ ). Then the above is equivalent to solving g(m, k ∗ ) + λ[1, 1, . . . , 1] = 0 for m and λ. The vector valued func p (m ,k∗ ) p (m ,k∗ ) tion g(m, k ∗ ) may be approximated by g(m, k ∗ ) − 1 2m11 1 , − 2 2m22 2 ,. . . , p (m ,k∗ ) − M 2mMM M where we used the approximation 1 − log1mi 1 for 1 ≤ i ≤ M . We then obtain the set of equations 2λ∗ m∗i = pi (m∗i , ki∗ ), 1 ≤ i ≤ M , and ∗ ∗ λ∗ = (m2m,k ) . We are interested not in obtaining a solution for a fixed m but obtaining, using local gradient information, a sequence of solutions for the sequence of minimization problems corresponding to an increasing sequence of total sample-size values m(t). Applying the New Minimization procedure of Algorithm XSRM to the deterministic criterion (m, k ∗ ) we have an adaptation rule which modifies the sample size vector m(t) at time t in the direction of steepest descent of pj (mj (t),kj∗ ) (m, k ∗ ). This yields: j ∗ (t) = argmax1≤j≤M which means we let mj (t) mj ∗ (t) (t + 1) = mj ∗ (t) (t) + ∆ while the remaining components of m(t) remain unchanged, i.e., mj (t + 1) = mj (t), ∀j = j ∗ (t). The next lemma states that this rule achieves the desired result, namely, the deterministic sequence m(t) converges to the optimal trajectory m∗ (t). Lemma 4. For any initial point m(0) ∈ IRM , satisfying mi (0) ≥ 3, for a fixed positive ∆, there exists some finite integer 0 < N < ∞ such that for all discrete time t > N the trajectory m(t) corresponding to a repeated application of the
216
J. Ratsaby
adaptation rule mj ∗ (t) (t + 1) = mj ∗ (t) (t) + ∆ is no farther than ∆ (in the l1M norm) from the optimal trajectory m∗ (t). M ∗ Outline of Proof: Recall that (m, k ∗ ) = i=1 pi (mi , ki ) where (mi , ki ) = ∗ (mi ,ki∗ ) ∂(m,k ) ki ln mi −pi 2m . Denote by xi = mi , 1 ≤ i ≤ M . The derivative ∂mi i (m ,k∗ )
i i dxi xi , and note that dm − 32 m , 1 ≤ i ≤ M . There is a one-to-one pi 2m i i i correspondence between the vector x and m thus we may refer to the optimal trajectory also in x-space. Consider the set T = {x = c[1, 1, . . . , 1] ∈ IRM + : c ∈ IR+ } and refer to T as the corresponding set in m-space. Define the Lyapunov min (t) function V (x(t)) = V (t) = xmaxx(t)−x where for any vector x ∈ IRM + , xmax = min (t) max1≤i≤M xi , and xmin = min1≤i≤M xi , and write mmax , mmin for the elements of m with the same index as xmax , xmin , respectively. Denote by V˙ the derivative of V with respect to t. Using standard analysis it can be shown that if x ∈ T then V (x) > 0 and V˙ (x) < 0 while if x ∈ T then V (x) = 0 and V˙ (x) = 0. This means that as long as m(t) is not on the optimal trajectory then V (t) decreases. To show that the trajectory is an attractor V (t) is shown to decrease fast enough to zero 3 using the fact that V (t) ≤ const 1t 2 . Hence as t → ∞, the distance between m(t) and the set T dist(m(t), T ) → 0 where dist(x, T ) = inf y∈T x − yl1M and l1M denotes the Euclidean vector norm. It is then easy to show that for all large t, m(t) is farther from m∗ (t) by no more than ∆. We now show that the same adaptation rule may also be used in the setting where k ∗ is unknown. The next lemma states that even when k ∗ is unknown, it is possible, by using Algorithm XSRM, to generate a stochastic sequence which asymptotically converges to the optimal m∗ (n) trajectory (again, the use of n instead of t just means we have a random sequence m(n) and not a deterministic sequence m(t) as was investigated above).
Lemma 5. Fix any ∆ ≥ 1 as a step size used by Algorithm XSRM. Given a sample size vector sequence m(n), n → ∞, generated by Algorithm XSRM, assume that kˆn → k ∗ almost surely. Let m∗ (n) be the optimal trajectory as in Definition 3. Then the events m(n) − m∗ (n)l1M > ∆ occur infinitely often with probability 0. Outline of Proof: From Lemma 3 m(n) generated by Algorithm XSRM is an increasing sample-size sequence. Therefore by Lemma 2 we have kˆn → k ∗ , a.s., as n → ∞. This means that P (∃n > N, |kˆn − k ∗ | > ) = δN () where δN () → 0 as N → ∞. It follows that for all δ > 0, there is a finite N (δ, ) ∈ ZZ+ such that with probability 1 − δ for all n ≥ N (, δ), kˆn = k ∗ . It follows that with the same probability for all n ≥ N , the criterion (m, kˆn ) = (m, k ∗ ), uniformly over all m ∈ ZZ M + , and hence the trajectory m(n) taken by Algorithm XSRM, governed by the criterion (·, kˆn ), equals the trajectory m(t), t ∈ ZZ+ , taken by minimising the deterministic criterion (·, k ∗ ). Moreover, this probability of 1 − δ goes to 1 as N → ∞ by the a.s. convergence of kˆn to k ∗ . By Lemma 4, there exists a N < ∞ such that for all discrete time t > N , m(t) − m∗ (t)l1M ≤ ∆. Let N =
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation
217
max{N, N } then P ∃n > N , kˆn = k ∗ or m(t)|t=n − m∗ (t)|t=n lM > ∆ = 1 δN where δN → 0 as N → ∞. The latter means that the event kˆn = k ∗ or m(n) − m∗ (n)lM > ∆ occurs infinitely often with probability 0. The state1 ment of the lemma then follows.
Appendix In this section we outline the proofs. Complete proofs appear in Ratsaby (2003).
A
Proof Outline of Lemma 1
For a class of boolean classifiers Br of VC-dimension r it is known (cf. Devroye et. al. (1996) ch. 6, Vapnik (1982) Theorem 6.7) that a bound on the deviation between the loss and the empirical over all classifiers loss uniformly r ln m+ln( δ1 ) b ∈ Br is supb∈Br |L(b) − Lm (b)| ≤ const with probability 1 − δ m where m denotes the size of the random sample used for calculating empirical
m loss Lm (b). Choosing for instance δm = m12 implies that the bound const r ln m (with a different constant) does not hold infinitely often with probability 0. We will refer to this as the uniform strong law of large numbers result and we note that this was defined earlier as (m, r). This result is used together with an application of the union
bound reMwhich duces the probability P supc∈Ck |L(c) − L (c)| > (m, k, δ ) into P ∃c ∈ m i=1
Cki :|L(c) − Li,mi (c)| > (mi , ki , δ ) which is bounded from above by M δ . The first part of the lemma then follows since the class of well defined classifiers Gk is contained in the class Ck . For the second part of the lemma, by the premise consider any fixed complexity vector k and any sequence-generating procedure φ according to Definition 2. Define the following set of sample size vector sequences: AN ≡ {m(n) : n > N, m(n) is generated by φ}. As the space is discrete, note that for any finite N , the set AN contains all possible paths except a finite number of length-N paths. The proof proceeds by showing that the events En ≡ {supc∈Gk L(c) − Lm(n) (c) > (m(n), k, δ) : m(n) generated by φ} occur ∗ infinitely often with probability 0. To show this, we first choose for δ to be δm = 1 , and then reduce the P ∃m(n) ∈ AN : supc∈Gk L(c) − Lm(n) (c) > max1≤j≤M m2j
M ∗ ) to j=1 mj >Tφ (N ) m12 . Then use the fact that m(n) ∈ AN (m(n), k, δm(n) j
implies there exists a point m such that min1≤j≤M mj > Tφ (N ) where Tφ (N ) is increasing with N hence the set {mj : mj > Tφ (N )} is strictly increasing, 1 ≤ j ≤ M , which implies that the above double sum strictly decreases with increasing N . It then follows that limN →∞ P(∃m(n) ∈ AN : supc∈Gk L(c) − Lm(n) (c) > (m(n), k)) = 0 which implies the events En occur i.o. with probability 0.
218
B
J. Ratsaby
Proof Outline of Lemma 2
First we sketch the proof of the convergence of kˆ → k ∗ , where k ∗ is some vector of minimal norm over all vectors k for which L(c∗k ) = 0. We henceforth denote for a vector k ∈ ZZ M + , by k∞ = max1≤i≤M |ki |. All convergence statements are made with respect to the increasing sequence m(n). The indexing variable n is sometimes left hidden for simpler notation. ˜ c∗ )}. ˜ ck ) = L(ˆ The set Fˆ defined in Section 4 may be rewritten as Fˆ = {k : L(ˆ The cardinality of Fˆ is finite since for all k having at least one component ki ˜ ck ) > L(ˆ ˜ c∗ ) because (m, k) will be larger larger than some constant implies L(ˆ ∗ ˜ ck ) ≤ L(ˆ ˜ c∗ ) is finite. Now ˜ c ) which implies that the set of k for which L(ˆ than L(ˆ ∗ ˜ ck ) ≤ L(ˆ ˜ c ) + α}. Recall that F ∗ was defined for any α > 0, define Fˆα = {k : L(ˆ in Section 4 as F ∗ = {k : L(c∗k ) = L∗ = 0} and define Fα∗ = {k : L(c∗k ) ≤ L∗ +α}, where the Bayes loss is L∗ = 0. Recall that the chosen classifier cˆ∗ has a complexity kˆ = argmink∈Fˆ k∞ . By Assumption 1, there exists a k ∗ = argmink∈F ∗ k∞ all of whose components are finite. The proof proceeds by first showing that ∗ ∗ Fˆ ⊆ F(m,k ∈ Fˆ and that ∗ ) , i.o. with probability 0. Then proving that k ∗ for all m large enough, k = argmink∈F ∗ ∗ k∞ . It then follows that (m,k ) ˆ ∞ = k ∗ ∞ i.o. with probability zero but where kˆ does not necessarily k equal k ∗ and that kˆ → k ∗ , (componentwise) a.s., m → ∞ (or equivalently, with n → ∞ as the sequence m(n) is increasing) where k ∗ = argmink∈F ∗ k∞ is not necessarily unique but all of whose components are finite. This proves the first part of the lemma. The proof of the second part of the Lemma follows similarly as the proof of Lemma 1. Start with P (∃m(n) ∈ AN : L(ˆ c∗n ) > (m(n), k ∗ ) ) which after is shown to be bounded from above by
M some ∞ manipulation the sum j=1 kj =1 P ∃mj > Tφ (N ) : L(ˆ ckj ) > Lj,mj (ˆ ckj ) + (mj , kj ) . Then make use of the uniform strong law result (see first of Appendix paragraph √ kj ln(emj ) kj ln mj A) and choose a const such that (mj , kj ) = const ≥ 3 . mj mj Using the upper bound on the growth function cf. Vapnik (1982) Section 6.9, Devroye et. al. (1996) Theorem 13.3, we have for some absolute constant κ > 0,
2 k ckj )+ (mj , kj ) ≤ κmj j e−mj (mj ,kj ) which is bounded from P L(ˆ ckj ) > Lj,mj (ˆ above by κ m12 e−3kj for kj ≥ 1. The bound on the double sum then becomes M j 2κ j=1 mj >Tφ (N ) m12 which is strictly decreasing with N as in the proof of j
Lemma 1. It follows that the events {L(ˆ c∗n ) > (m(n), k ∗ )} occur infinitely often with probability 0.
C
Proof Outline of Lemma 3
Note that for this proof we cannot use Lemma 1 or parts of Lemma 2 since they are conditioned on having a sequence-generating procedure. Our approach here relies on the characteristics of the SRM-selected complexity kˆn which is shown to be bounded uniformly over n based on Assumption 1. It follows that
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation
219
by the stochastic adaptation step of Algorithm XSRM the generated sample size sequence m(n) is not only increasing but with a minimum rate of increase as in Definition 2. This establishes that Algorithm XSRM is a sequence-generating procedure. The proof starts by showing that for an increasing sequence m(n), as in Definition 1, for all n there is some constant 0 < ρ < ∞ such that kˆn ∞ < ρ. It then follows that for all n, kˆn is bounded by a finite constant independent of n. So for a sequence generated by the new minimisation procedure in Algorithm ˆn,j ) ˜ ) (mj (n),k (m (n),k XSRM, pj are bounded by pj mj j (n) j , for some finite k˜j , 1 ≤ j ≤ mj (n) M , respectively. It can be shown by simple analysis of the function (m, k) that ∂ 2 (mj ,kj ) ∂ 2 (mi ,ki ) / ∂m2 for a fixed k the ratio of converges to a constant dependent ∂m2j i on ki and kj with increasing mi , mj . Hence the adaptation step which always increases one of the sub-samples yields increments of ∆mi and ∆mj which are no farther apart than a constant multiple of each other for all n, for any pair 1 ≤ i, j ≤ M . Hence for a sequence m(n) generated by Algorithm XSRM the following is satisfied: it is increasing in the sense of Definition 1, namely, for all N > 0 there exists a Tφ (N ) such that for all n > N every component mj (n) > Tφ (N ), 1 ≤ j ≤ M . Furthermore, its rate of increase is bounded from below, namely, there exists a const > 0 such that for all N, N > 0 satisfying Tφ (N ) = Tφ (N ) + 1, then |N − N | ≤ const. It follows that Algorithm XSRM is a sequence-generating procedure according to Definition 2.
D
Proof Outline of Theorem 1
The classifier cˆ∗n is chosen according to (3) based on a sample of size vector m(n) generated by Algorithm XSRM which is a sequence-generating procedure (by Lemma 3). From Lemma 2, L(ˆ c∗n ) > const (m(n), k ∗ ), i.o. with probability 0 and since ∆ = 1 then from Lemma 5 it follows that m(n) − m∗ (n)l1M > 1 i.o. with probability 0 where m∗ (n) =argminm:m=m(n) (m, k ∗ ).
References Anthony M., Bartlett P. L., (1999), “Neural Network Learning: Theoretical Foundations”, Cambridge University Press, UK. Bartlett P. L., Boucheron S., Lugosi G., (2002) Model Selection and Error Estimation, Machine Learning, Vol. 48, No.1–3, p. 85–113. Devroye L., Gyorfi L. Lugosi G. (1996). “A Probabilistic Theory of Pattern Recognition”, Springer Verlag. Kultchinskii V., (2001), Rademacher Penalties and Structural Risk Minimization, IEEE Trans. on Info. Theory, Vol. 47, No. 5, p.1902–1914. Lugosi G., Nobel A., (1999), Adaptive Model Selection Using Empirical Complexities. Annals of Statistics, Vol. 27, pp.1830–1864. Ratsaby J., (1998), Incremental Learning with Sample Queries, IEEE Trans. on PAMI, Vol. 20, No. 8, p.883–888. Ratsaby J., (2003), On Learning Multicategory Classification with Sample Queries, Information and Computation, Vol. 185, No. 2, p. 298–327.
220
J. Ratsaby
Ratsaby J., Meir R., Maiorov V., (1996), Towards Robust Model Selection using Estimation and Approximation Error Bounds, Proc. 9th Annual Conference on Computational Learning Theory, p.57, ACM, New York N.Y.. Shawe-Taylor J., Bartlett P., Williamson R., Anthony M., (1996), A Framework for Structural Risk Minimisation. NeuroCOLT Technical Report Series, NC-TR-96-032, Royal Holloway, University of London. Valiant L. G., A Theory of the learnable, (1984), Comm. ACM, Vol. 27, No. 11, p.1134– 1142. Vapnik V.N., (1982), “Estimation of Dependences Based on Empirical Data”, SpringerVerlag, Berlin.
On the Complexity of Training a Single Perceptron with Programmable Synaptic Delays ˇıma Jiˇr´ı S´ Department of Theoretical Computer Science, Institute of Computer Science, Academy of Sciences of the Czech Republic, P. O. Box 5, 182 07 Prague 8, Czech Republic [email protected]
Abstract. We consider a single perceptron N with synaptic delays which generalizes a simplified model for a spiking neuron where not only the time that a pulse needs to travel through a synapse is taken into account but also the input firing rates may have more different levels. A synchronization technique is introduced so that the results concerning the learnability of spiking neurons with binary delays also apply to N with arbitrary delays. In particular, the consistency problem for N with programmable delays and its approximation version prove to be NPhard. It follows that the perceptrons with programmable synaptic delays are not properly PAC-learnable and the spiking neurons with arbitrary delays do not allow robust learning unless RP = N P . In addition, we show that the representation problem for N which is an issue whether an n-variable Boolean function given in DNF (or as a disjunction of O(n) threshold gates) can be computed by a spiking neuron is co-NP-hard.
1
Perceptrons with Synaptic Delays
Neural networks establish an important class of learning models that are widely applied in practical applications to solving artificial intelligence tasks [12]. We consider only a single (perceptron) neuron N having n analog inputs that are encoded by firing rates x1 , . . . , xn ∈ [−1, 1]. Here the input values are normalized but any bounded domain [−a, a] for a positive real a ∈ IR+ can replace [−1, 1] without loss of generality [21]. As usual each input i (1 ≤ i ≤ n) is associated with a real synaptic weight wi ∈ IR. In addition, N receives i’s analog input in the form of a unit-length rectangular pulse (spike) of height |xi | (for xi < 0 upside down). This pulse travels through i’s synapse in continuous time producing a synaptic time delay di ∈ IR+ 0 which represents a nonnegative real parameter individual for each input 1 ≤ i ≤ n. Taking these delays into account, current i’s input xi (t) ∈ [−1, 1] to N at continuous time t ≥ 0 can be expressed as xi for t ∈ Di xi (t) = (1) 0 otherwise
Research partially supported by project LN00A056 of The Ministry of Education of the Czech Republic.
R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 221–233, 2003. c Springer-Verlag Berlin Heidelberg 2003
222
ˇıma J. S´
where Di = [di , di + 1) is a time interval of the unit length during which N is influenced by the spike from input i. This determines the real excitation ξ(t) = w0 +
n
wi xi (t)
(2)
i=1
for N at time instant t ≥ 0 as a weighted sum of current inputs including a real bias weight w0 ∈ IR. The real output y(t) ∈ IR of N at continuous time t ≥ 0 is computed by applying an activation function σ : IR −→ IR to the excitation, i.e. y(t) = σ(ξ(t)) . For binary outputs y(t) ∈ {0, 1} the Heaviside activation function 1 for ξ ≥ 0 σ(ξ) = 0 for ξ < 0
(3)
(4)
is usually employed. In this case, the output protocol can be defined so that N with weights w0 , . . . , wn and delays d1 , . . . , dn computes a neuron function yN : [−1, 1]n −→ {0, 1} defined for every input x1 , . . . , xn ∈ [−1, 1]n as yN (x1 , . . . , xn ) = 1 iff there exists a time instant t ≥ 0 such that y(t) = 1. Similarly, the logistic sigmoid σL (ξ) =
1 , 1 + e−ξ
(5)
which is well-known from the back-propagation learning [26] produces analog outputs y(t) ∈ [0, 1] whereas the output protocol can specify a time instant tout ≥ 0 when the resulting output is read, that is yN (x1 , . . . , xn ) = y(tout ). Unless otherwise stated we assume that neuron N employs the Heaviside activation function (4). By restricting certain parameters in the preceding definition of N we obtain several computational units which are widely used in neurocomputing. For the classical perceptrons [25] all synaptic delays are zero, i.e. di = 0 for i = 1, . . . , n, and also tout = 0 when the logistic sigmoid (5) is employed [26]. Or assuming the spikes with a uniform firing rate, e.g. xi ∈ {0, 1} for i = 1, . . . , n, neuron N coincides with a simplified model of a spiking neuron with binary coded inputs which was introduced and analyzed in [20]. Hence, the computational power of N computing the Boolean functions is the same as that of the spiking neuron with binary coded inputs [27] (cf. Section 4). In addition, the VC-dimension Θ(n log n) of the spiking neuron still applies to N with n analog inputs as can easily be verified by following the argument in [20]. From this point of view, N represents generalization of the spiking neuron in which the temporal delays are combined with the firing rates of perceptron units. It follows that biological motivations for spiking neurons [10,19] partially apply also to neuron N . For example, it is known that the synaptic delays are
On the Complexity of Training a Single Perceptron
223
tuned in biological neural systems through a variety of mechanisms. On the other hand, the underlying computational model is still sufficiently simple providing easy silicon implementation in pulsed VLSI [19]. In this paper we deal with the computational complexity of training a single neuron N with programmable synaptic delays. The article is organized as follows. In Section 2, the so-called consistency problem proves to be NP-hard for N , which implies that the perceptrons with delays are not properly PAClearnable unless RP = N P . Furthermore, it is shown in Section 3 that even the approximate training can be hard for N with binary firing rates, which means that the spiking neurons with binary coded inputs do not allow robust learning if RP = N P . In addition, the representation problem for spiking neurons is proved to be co-NP-hard in Section 4.
2
A Single Perceptron with Delays Is Not Learnable
The computational complexity of training a neuron can be analyzed by using the consistency (loading) problem [17] which is the problem of finding the neuron parameters for a training task so that the neuron function is perfectly consistent with all training data. For example, an efficient algorithm for the consistency problem is required within the proper PAC learning framework [5] besides the polynomial VC-dimension that common neural network models usually possess [3,24,29]. Therefore, several learning heuristics have been proposed for networks of spiking neurons, e.g. spike-propagation [6]. On the other hand, NPhardness of this problem implies that the neuron is not properly PAC learnable (i.e. for any training data that can be loaded into the neuron) under generally accepted complexity-theoretic assumption RP = N P [22]. An almost exhaustive list of such NP-hardness results for feedforward perceptron networks was presented in [28]. We define a training set T = {(xk ; bk ); xk = (xk1 , . . . , xkn ) ∈ [−1, 1]n , bk ∈ {0, 1}, k = 1, . . . , m}
(6)
containing m training examples, each composed of n-dimensional input xk from [−1, 1]n labeled with the desired scalar output value bk from {0, 1} corresponding to negative and positive examples. The decision version for the consistency problem is formulated as follows: Consistency Problem for Neuron N (CPN) Instance: A training set T for N having n inputs. Question: Are there weights w0 , . . . , wn and delays d1 , . . . , dn for N such that yN (x) = b for every training example (x; b) ∈ T ? For ordinary perceptrons with zero delays, i.e. di = 0 for i = 1, . . . , n, the consistency problem is solvable in polynomial time by linear programming although this problem restricted to binary weights is NP-complete [22]. However, already for binary delays di ∈ {0, 1} the consistency problem becomes NPcomplete, even for spiking neurons having binary firing rates xi ∈ {0, 1} and
224
ˇıma J. S´
fixed weights [20]. This implies that neuron N with binary delays is not properly PAC learnable unless RP = N P . The result generalizes also to bounded delay values di ∈ {0, 1, . . . , c} for fixed c ≥ 2. For the spiking neurons with unbounded delays, however, NP-hardness of the consistency problem was listed among open problems [20]. In this section we prove that the consistency problem is NP-hard for a single perceptron N with arbitrary delays, which partially answers the previous open question provided that several levels of firing rates are allowed. For this purpose a synchronization technique is introduced whose main idea can be described as follows. The consistency of negative example (x1 , . . . , xn ; 0) means that for every subset of inputs I ⊆ {1, . . . , n} whose spikes may simultaneously influence N (i.e. ∩i∈I Di = ∅) a corresponding excitation must satisfy w0 + i∈I wi xi < 0. At the same time, by using the consistency of other (mostly positive) training examples we can enforce w0 + i∈J wi xi ≥ 0 for some J ⊆ {1, . . . , n}. In this way we ensure that N is not simultaneously influenced by the spikes from inputs J, that is ∩i∈J Di = ∅ which is then exploited for the synchronization of the input spikes. Theorem 1. CPN is NP-hard. Proof. In order to achieve the NP-hardness result, the following variant of the set splitting problem which is known to be NP-complete [9] will be reduced to CPN in polynomial time. 3Set-Splitting Problem (3SSP) Instance: A finite set S = {s1 , . . . , sp } and a collection C = {c ⊆ S ; |c | = 3, = 1, . . . , r} of three-element subsets c of S. Question: Is there a partition of S into two disjoint subsets S1 and S2 , i.e. S = S1 ∪ S2 where S1 ∩ S2 = ∅, such that c ⊆ S1 and c ⊆ S2 for every = 1, . . . , r? The 3SSP problem was also used for proving the result restricted to binary delays [20]. The above-described synchronization technique generalizes the proof to arbitrary delays. Given a 3SSP instance S, C, we construct a training set T for neuron N with n inputs where n = 2p+2. The input firing rates of training examples exploit only seven levels from −1, − 14 , − 18 , 0, 38 , 34 , 1 ⊆ [−1, 1]. A list of training examples which are included in training set T follows: (0, . . . , 0,
3 4
, 0, . . . , 0 ; 1)
↑
for i = 1, . . . , p ,
(7)
for i = 1, . . . , p ,
(8)
2i − 1
(0, . . . , 0, − 14 , 0, . . . , 0 ; 1) ↑ 2i
On the Complexity of Training a Single Perceptron
, − 18 , 0, . . . , 0 ; 0) ↑
3 8
(0, . . . , 0,
↑ 2i − 1
for i = 1, . . . , p ,
225
(9)
2i
(0, . . . , 0, − 14 , 0 ; 1) , ↑
(10)
2p + 1
(0, . . . , 0, − 14 ; 1) , ↑
(11)
2p + 2
(0, . . . , 0, − 18 , − 18 ; 0) , ↑ ↑ 2p + 1
(0, . . . , 0,
1 ↑
(12)
2p + 2
, 0, . . . , 0 ; 1) for i = 1, . . . , p ,
(13)
for i = 1, . . . , p ,
(14)
for i = 1, . . . , p ,
(15)
for i = 1, . . . , p ,
(16)
1 ↑
(17)
2i − 1
(0, . . . , 0,
1 ↑
, 0, . . . , 0,
2i − 1
1 ↑
,
2p + 1
1 ↑
; 0)
2p + 2
(0, . . . , 0, −1 , 0, . . . , 0 ; 1) ↑ 2i
(0, . . . , 0, −1 , 0, . . . , 0, ↑
1 ↑
2i
2p + 1
,
1 ↑
; 0)
2p + 2
and (0, . . . , 0,
1 ↑ 2i − 1
, 1 , 0, . . . , 0, ↑ 2i
1 ↑
, 1 , 0, . . . , 0, ↑
2j − 1
2j
, 1 , 0, . . . , 0 ; 0) ↑
2k − 1
2k
for each c = {si , sj , sk } ∈ C (1 ≤ ≤ r). The number of training examples is |T | = 7p + r + 3, and hence, the construction of T can be done in polynomial time in terms of the size of S, C. Now, the correctness of the reduction will be verified, i.e. it will be shown that the 3SSP instance has a solution iff the corresponding CPN instance is solvable. So first assume that there exists a solution S1 , S2 of the 3SSP instance. Define the weights and delays for N as follows:
d2i−1
w0 = −1 , w2i−1 = 2 , 0 for si ∈ S1 = , 1 for si ∈ S2 d2p+1 = 0
w2p+1 = w2p+2 = −4 ,
(18)
w2i = −4
(19)
for i = 1, . . . , p ,
d2i = 1 − d2i−1 d2p+2 = 1 .
for i = 1, . . . , p ,
(20) (21)
226
ˇıma J. S´
Clearly, D2i−1 ∩ D2i = ∅
(22)
for i = 1, . . . , p + 1 according to (20) and (21). It can easily be checked that N with parameters (18)–(21) is consistent with training examples (7)–(16). For instance, for any positive training example (7), excitation ξ(t) = −1 + 2 · 34 ≥ 0 when t ∈ D2i−1 , which is sufficient for N to output 1. Or for any negative training example (9), excitation ξ(t) = −1 + 2 · 38 < 0 for all t ∈ D2i−1 and ξ(t) = −1 − 4 · (− 18 ) < 0 for all t ∈ D2i , whereas ξ(t) = −1 < 0 for t ≥ 2, which implies that N outputs desired 0. The verification for the remaining training examples (7)–(16) is similar. Furthermore, D2i−1 ∩ D2j−1 ∩ D2k−1 = ∅ holds for any c = {si , sj , sk } ∈ C according to (20) since c ⊆ S1 . Hence, for a negative training example (17) corresponding to c excitation ξ(t) ≤ −1+2·1+2·1−4·1 < 0 for every t ≥ 0 due to (22) which produces zero output. This completes the argument for the CPN instance to be solvable. On the other hand, assume that there exist weights w0 , . . . , wn and delays d1 , . . . , dn for N such that N is consistent with training examples (7)–(17). Any consistent negative example ensures w0 < 0
(23)
since the excitation must satisfy ξ(t) < 0 also for t ∈ ∪ni=1 Di . Hence, it follows from (7) and (8) that w0 + 34 w2i−1 ≥ 0 and w0 − 14 w2i ≥ 0, respectively, which sums up to 3 1 for i = 1, . . . , p . (24) w0 + w2i−1 − w2i ≥ 0 8 8 On the other hand, by comparing inequality (24) with the consistency of negative examples (9) we conclude that D2i−1 ∩ D2i = ∅
for i = 1, . . . , p .
(25)
Similarly, positive training examples (10) and (11) compel inequality 1 1 w0 − w2p+1 − w2p+2 ≥ 0 8 8
(26)
D2p+1 ∩ D2p+2 = ∅
(27)
which implies when the consistency of negative example (12) is required. Furthermore, positive training examples (13) ensure w0 + w2i−1 ≥ 0
for i = 1, . . . , p
(28)
which, confronted with the consistency of negative examples (14), implies D2i−1 ⊆ D2p+1 ∪ D2p+2
for i = 1, . . . , p .
(29)
On the Complexity of Training a Single Perceptron
227
Similarly, the simultaneous consistency of positive examples (15) and negative examples (16) gives D2i ⊆ D2p+1 ∪ D2p+2
for i = 1, . . . , p .
(30)
It follows from (25), (27), (29), and (30) that for each 1 ≤ i ≤ p either (D2i−1 = D2p+1 and D2i = D2p+2 ) or (D2i−1 = D2p+2 and D2i = D2p+1 ) ,
(31)
which represents the synchronization of the input spikes according to (27). Thus define the splitting of S = S1 ∪ S2 as S1 = {si ∈ S ; D2i−1 = D2p+1 } ,
S2 = S \ S1 .
(32)
It will be proved that S1 , S2 is a solution of the 3SSP. On the contrary, assume that there is c = {si , sj , sk } ∈ C such that c ⊆ S1 or c ⊆ S2 . According to definition (32), D2i−1 = D2j−1 = D2k−1 = D2p+1 holds for c ⊆ S1 . Hence, the consistency of a corresponding negative example (17) would require w0 + w2i−1 + w2j−1 + w2k−1 < 0
(33)
due to (25), but inequalities (23) and (28) imply the opposite. Similarly, D2i−1 = D2j−1 = D2k−1 = D2p+2 for c ⊆ S2 because of (32) and (31) providing contradiction (33). This completes the proof that the 3SSP instance is solvable. Corollary 1. If RP = N P , then a single perceptron N with programmable synaptic delays is not properly PAC-learnable.
3
A Spiking Neuron Does Not Allow Robust Learning
A single perceptron N with delays can compute only very simple neuron functions. Therefore the consistency problem introduced in Section 2 does not frequently have a solution: there are no weight and delay parameters such that the neuron function is consistent with all training data. In this case, one would be satisfied with a good approximation in practice, that is with the neuron parameters yielding a small training error. For example, in the incremental learning algorithms (e.g. [8]) that adapt single neurons before these are wired to a neural network, an efficient procedure for minimizing the training error is crucial to keep the network size small for successful generalization. Thus the decision version for the approximation problem is formulated as follows: Approximation Problem for Neuron N (APN) Instance: A training set T for N and a positive integer k. Question: Are there weights w0 , . . . , wn and delays d1 , . . . , dn for N such that yN (x) = b for at most k training examples (x; b) ∈ T ?
228
ˇıma J. S´
Within the PAC framework, the NP-hardness of this problem implies that the neuron does not allow robust learning (i.e. probably approximately optimal learning for any training task) unless RP = N P [14]. For the perceptrons with zero delays, the complexity of the approximation problem has been resolved. Several authors proved that the approximation problem is NP-complete in this case [14,23] even if the bias is assumed to be zero [2, 16]. This means that the perceptrons with zero delays do not allow robust learning unless RP = N P . In addition, it is NP-hard to achieve a fixed error that is a constant multiple of the optimum [4]. These results were also generalized to analog outputs, e.g. for the logistic sigmoid (5) it is NP-hard to minimize the training error under the L1 [15] or L2 [28] norm within a given absolute bound or within 1 of its infimum. In this section the approximation problem is proved to be NP-hard for perceptron N with arbitrary delays. The proof exploits only binary firing rates, which means the result is also valid for spiking neurons with binary coded inputs. Theorem 2. APN for N with binary firing rates is NP-hard. Proof. The following vertex cover problem that is known to be NP-complete [18] will be reduced to APN in polynomial time: Vertex Cover Problem (VCP) Instance: A graph G = (V, E) and a positive integer k ≤ |V |. Question: Is there a vertex cover U ⊆ V of size at most k ≥ |U | vertices, that is for each edge {u, v} ∈ E at least one of u and v belongs to U ? A similar reduction was originally used for the NP-hardness result concerning the approximate training an ordinary perceptron with zero synaptic delays [14]. The technique generalizes for arbitrary delays. Thus given a VCP instance G = (V, E), k with n = |V | vertices V = {v1 , . . . , vn } and r = |E| edges we construct a training set T for neuron N with n inputs. Training set T contains the following m = n + r examples: (0, . . . , 0, 1 , 0, . . . , 0 ; 1) ↑
for i = 1, . . . , n ,
(34)
for each {vi , vj } ∈ E ,
(35)
i
(0, . . . , 0, 1 , 0, . . . , 0, 1 , 0, . . . , 0 ; 0) ↑ ↑ i
j
which can be constructed in polynomial time in terms of the size of the VCP instance. Moreover, in this APN instance at most k inconsistent training examples are allowed. It will be shown that the VCP instance has a solution iff the corresponding APN instance is solvable. So first assume that there exists a vertex cover U ⊆ V
On the Complexity of Training a Single Perceptron
229
of size at most k ≥ |U | vertices. Define the weights and delays for N as follows: w0 = −1 , −1 wi = 1 di = 0
(36) if vi ∈ U if vi ∈ U
for i = 1, . . . , n ,
for i = 1, . . . , n .
(37) (38)
Obviously, negative examples (35) corresponding to edges {vi , vj } ∈ E produce excitations either ξ(t) = −3 when both endpoints in U or ξ(t) = −1 when only one endpoint in U , for t ∈ [0, 1), while ξ(t) = w0 = −1 for t ≥ 1, which means N outputs desired 0. Furthermore, the positive examples (34) that correspond to vertices vi ∈ U give excitations ξ(t) = 0 for t ∈ [0, 1) and hence N classifies them correctly. On the other hand, N is not consistent with the positive examples (34) corresponding to vertices vi ∈ U since ξ(t) = −2 for t ∈ [0, 1) and ξ(t) = −1 for t ≥ 1. Nevertheless, the size of vertex cover U is at most k, which also upper-bounds the number of inconsistent training examples. This completes the argument for the APN instance to be solvable. On the other hand, assume that there exist weights w0 , . . . , wn and delays d1 , . . . , dn making N consistent with all but at most k training examples (34)– (35). Define U ⊆ V so that U contains vertex vi for each inconsistent positive example (34) corresponding to vi . In addition, U includes just one of vi and vj (chosen arbitrarily) for each inconsistent negative example (35) corresponding to edge {vi , vj }. Clearly, |U | ≤ k since there are at most k inconsistent training examples. It will be proved that U is a vertex cover for G. On the contrary, assume that there is an edge {vi , vj } ∈ E such that vi , vj ∈ U . It follows from the definition of U that N is consistent with the negative example (35) corresponding to edge {vi , vj }, which implies ξ(t) = w0 < 0
for t ∈ Di ∪ Dj ,
(39)
and it is consistent with the positive examples (34) corresponding to vertices vi , vj , which ensures ξ(t) = w0 + wi ≥ 0 ξ(t) = w0 + wj ≥ 0
for t ∈ Di
(40)
for t ∈ Dj
(41)
because of (39). By summing inequalities (39)–(41), we obtain w0 + wi + wj > 0 .
(42)
On the other hand, by comparing inequalities (40) and (41) with the consistency of the negative example (35) corresponding to edge {vi , vj } we conclude that Di = Dj (synchronization technique) and hence ξ(t) = w0 + wi + wj < 0
for t ∈ Di = Dj ,
(43)
which contradicts inequality (42). This completes the proof that U is a solution of VCP. Corollary 2. If RP = N P , then a single spiking neuron with binary coded inputs and arbitrary delays does not allow robust learning.
230
4
ˇıma J. S´
The Representation Problem for Spiking Neurons
In this section we deal with the representation (membership) problem for the spiking neurons with binary coded inputs: Representation Problem for Spiking Neuron N (RPN) Instance: A Boolean function f in DNF (disjunctive normal form). Question: Is f computable by a single spiking neuron N , i.e. are there weights w0 , . . . , wn and delays d1 , . . . , dn for N such that yN (x) = f (x) for every x ∈ {0, 1}n ? The representation problem for perceptrons with zero delays, known as the linear separability problem, was proved to be co-NP-complete [13]. We generalize the co-NP-hardness result for spiking neurons with arbitrary delays. On the other hand, the RPN is clearly in Σ2p whereas its hardness for Σ2p (or for NP) which would imply [1] that the spiking neurons with arbitrary delays are not learnable with membership and equivalence queries (unless N P = co − N P ) remains an open problem. Moreover, it was shown [20] that the class of n-variable Boolean functions computable by spiking neurons is strictly contained in the class DLLT that consists of functions representable as disjunctions of O(n) Boolean linear threshold functions over n variables (from the class LT containing functions computable by threshold gates) where the smallest number of threshold gates is called the threshold number [11]. For example, class DLLT corresponds to two-layer networks with linear number of hidden perceptrons (with zero delays) and one output OR gate. It was shown [27] that the threshold number of spiking neurons with n inputs is at most n − 1 and can be lower-bounded by n/2. On the other hand, there exists a Boolean function with threshold number 2 that cannot be computed by a single spiking neuron [27]. We prove that a modified version of RPN, denoted as DLLT-RPN, whose instances are Boolean functions f from DLLT (instead of DNF) is also co-NP-hard. This means that it is hard to decide whether a given n-variable Boolean function expressed as a disjunction of O(n) threshold gates can be computed by a single spiking neuron. Theorem 3. RPN and DLLT-RPN are co-NP-hard and belong to Σ2p . Proof. The tautology problem that is known to be co-NP-complete [7] will be reduced to RPN in polynomial time in a similar way as it was done for the linear separability problem [13]: Tautology Problem (TP) Instance: A Boolean function g in DNF. Question: Is g a tautology, i.e. g(x) = 1 for every x ∈ {0, 1}n ? For the DLLT-RPN, a modified version of TP, denoted as DLLT-TP, whose instances are Boolean functions g from DLLT will be exploited. For proving that the DLLT-TP remains co-NP-complete, any TP instance ∨m j=1 Cj with m monomials (conjunctions of literals over n variables) can be equivalently rewritten in
On the Complexity of Training a Single Perceptron
231
DNF ∨m ¯j )) where x1 , . . . , xm are m new variables. Clearly, j=1 ((Cj ∧ xj ) ∨ (Cj ∧ x in the new DNF formula the number of monomials is linear in terms of the number of variables. Moreover, any monomial can obviously be computed by a single threshold gate. Thus given a TP (DLLT-TP) instance g over n variables x1 , . . . , xn , we construct a corresponding RPN (DLLT-RPN) instance f over n + 2 variables x1 , . . . , xn , y1 , y2 in polynomial time as follows:
y1 ∧ y2 ) . f (x1 , . . . , xn , y1 , y2 ) = (g(x1 , . . . , xn ) ∧ y1 ) ∨ (y1 ∧ y¯2 ) ∨ (¯
(44)
For TP instance g, function f is actually in DNF as required for the RPN. For DLLT-TP instance g = ∨m j=1 gj with gj from LT, formula (44) contains terms gj ∧ y1 that are equivalent with g¯j ∨ y¯1 which belong to LT since class LT is closed under negation [21] and summand W (1 − y1 ) with a sufficiently large weight W can be added to the weighted sum for g¯j to evaluate g¯j ∨ y¯1 . This implies that f is from DLLT representing a DLLT-RPN instance. It will be shown that the TP (DLLT-TP) instance has a solution iff the corresponding RPN (DLLT-RPN) instance is solvable. So first assume that g is a tautology. Hence f given by (44) can be equivalently rewritten as y1 ∨ y2 which is trivially computable by a spiking neuron. On the other hand, assume that there exists a ∈ {0, 1}n such that g(a) = 0. In this case, f (a, y1 , y2 ) reduces to XOR(y1 , y2 ) which cannot be implemented by a single spiking neuron [20]. For proving that RP N ∈ Σ2p (similarly for DLLT-RPN) consider an alternating algorithm for the RPN that, given f in DNF, guesses polynomial-size representations [20] of weights and delays for spiking neuron N first in its existential state, and then verifies yN (x) = f (x) for every x ∈ {0, 1}n (yN (x) can be computed in polynomial time since there are only linear number of time intervals to check) in its universal state.
5
Conclusion
The computational complexity of training a single perceptron with programmable synaptic delays which is a model that covers certain aspects of spiking neurons (with binary coded inputs) has been analyzed. We have developed a synchronization technique that generalizes the known non-learnability results for arbitrary synaptic delays. In particular, we have proved that the perceptrons with delays are not properly PAC-learnable and the spiking neurons do not allow robust learning unless RP = N P . This represents a step towards solving an open issue concerning the PAC-learnability of spiking neurons with arbitrary delays. In addition, we have shown that it is co-NP-hard to decide whether a disjunction of O(n) threshold gates, which is known to implement any spiking neuron, can reversely be computed by a single spiking neuron. An open issue remains for further research whether the spiking neurons are learnable with membership and equivalence queries.
232
ˇıma J. S´
References 1. Aizenstein, H., Heged¨ us, T., Hellerstein, L., Pitt, L.: Complexity Theoretic Hardness Results for Query Learning. Computational Complexity 7 (1) (1998) 19–53 2. Amaldi, E.: On the complexity of training perceptrons. In: Kohonen, T., M¨ akisara, K., Simula, O., Kangas, J. (eds.): Proceedings of the ICANN’91 First International Conference on Artificial Neural Networks. Elsevier Science Publisher, NorthHolland, Amsterdam (1991) 55–60 3. Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, UK (1999) 4. Arora, S., Babai, L., Stern, J., Sweedyk, Z.: The hardness of approximate optima in lattices, codes, and systems of linear equations. Journal of Computer and System Sciences 54 (2) (1997) 317–331 5. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM 36 (4) (1989) 929–965 6. Bohte, M., Kok, J.N., La Poutr´e, H.: Spike-prop: error-backpropagation in multilayer networks of spiking neurons. In: Proceedings of the ESANN’2000 European Symposium on Artificial Neural Networks. D-Facto Publications, Brussels (2000) 419–425 7. Cook, S.A.: The complexity of theorem-proving procedures. In: Proceedings of the STOC’71 Third Annual ACM Symposium on Theory of Computing. ACM Press, New York (1971) 151–158 8. Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. In: Touretzky, D.S. (ed.): Advances in Neural Information Processing Systems (NIPS’89), Vol. 2. Morgan Kaufmann, San Mateo (1990) 524–532 9. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, San Francisco (1979) 10. Gerstner, W., Kistler, W.M.: Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cambridge University Press, Cambridge, UK (2002) 11. Hammer, P.L., Ibaraki, T., Peled, U.N.: Threshold numbers and threshold completions. In: Hansen, P. (ed.): Studies on Graphs and Discrete Programming, Annals of Discrete Mathematics 11, Mathematics Studies, Vol. 59. North-Holland, Amsterdam (1981) 125–145 12. Haykin, S.: Neural Networks: A Comprehensive Foundation. 2nd edn. PrenticeHall, Upper Saddle River, NJ (1999) 13. Heged¨ us, T., Megiddo, N.: On the geometric separability of Boolean functions. Discrete Applied Mathematics 66 (3) (1996) 205–218 14. H¨ offgen, K.-U., Simon, H.-U., Van Horn, K.S.: Robust trainability of single neurons. Journal of Computer and System Sciences 50 (1) (1995) 114–125 15. Hush, D.R.: Training a sigmoidal node is hard. Neural Computation 11 (5) (1999) 1249–1260 16. Johnson, D.S., Preparata, F.P.: The densest hemisphere problem. Theoretical Computer Science 6 (1) (1978) 93–107 17. Judd, J.S.: Neural Network Design and the Complexity of Learning. The MIT Press, Cambridge, MA (1990) 18. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.): Complexity of Computer Computations. Plenum Press, New York (1972) 85–103 19. Maass, W., Bishop, C.M. (eds.): Pulsed Neural Networks. The MIT Press, Cambridge, MA (1999)
On the Complexity of Training a Single Perceptron
233
20. Maass, W., Schmitt, M.: On the complexity of learning for spiking neurons with temporal coding. Information and Computation 153 (1) (1999) 26–46 21. Parberry, I.: Circuit Complexity and Neural Networks. The MIT Press, Cambridge, MA (1994) 22. Pitt, L., Valiant, L.G.: Computational limitations on learning from examples. Journal of the ACM 35 (4) (1988) 965–984 23. Roychowdhury, V.P., Siu, K.-Y., Kailath, T.: Classification of linearly nonseparable patterns by linear threshold elements. IEEE Transactions on Neural Networks 6 (2) (1995) 318–331 24. Roychowdhury, V.P., Siu, K.-Y., Orlitsky, A. (eds.): Theoretical Advances in Neural Computation and Learning. Kluwer Academic Publishers, Boston (1994) 25. Rosenblatt, F.: The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65 (6) (1958) 386–408 26. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Nature 323 (1986) 533–536 27. Schmitt, M.: On computing Boolean functions by a spiking neuron. Annals of Mathematics and Artificial Intelligence 24 (1-4) (1998) 181–191 ˇıma, J.: Training a single sigmoidal neuron is hard. Neural Computation 14 (11) 28. S´ (2002) 2709–2728 29. Vidyasagar, M.: A Theory of Learning and Generalization. Springer-Verlag, London (1997)
Learning a Subclass of Regular Patterns in Polynomial Time John Case1, , Sanjay Jain2, , R¨ udiger Reischuk3 , Frank Stephan4 , and Thomas Zeugmann3 1
4
Dept. of Computer and Information Sciences, University of Delaware, Newark, DE 19716-2586, USA [email protected] 2 School of Computing, National University of Singapore, Singapore 117543 [email protected] 3 Institute for Theoretical Informatics, University at L¨ ubeck, Wallstr. 40, 23560 L¨ ubeck, Germany {reischuk, thomas}@tcs.mu-luebeck.de Mathematisches Institut, Universit¨ at Heidelberg, Im Neuenheimer Feld 294, 69120 Heidelberg, Germany [email protected]
Abstract. Presented is an algorithm (for learning a subclass of erasing regular pattern languages) which can be made to run with arbitrarily high probability of success on extended regular languages generated by patterns π of the form x0 α1 x1 ...αm xm for unknown m but known c , from number of examples polynomial in m (and exponential in c ), where x0 , . . . , xm are variables and where α1 , ..., αm are each strings of constants or terminals of length c . This assumes that the algorithm randomly draws samples with natural and plausible assumptions on the distribution. The more general looking case of extended regular patterns which alternate between a variable and fixed length constant strings, beginning and ending with either a variable or a constant string is similarly handled.
1
Introduction
The pattern languages were formally introduced by Angluin [1]. A pattern language is (by definition) one generated by all the positive length substitution instances in a pattern, such as, for example, abxycbbzxa — where the variables (for substitutions) are x, y, z and the constants/terminals are a, b, c .
Supported in part by NSF grant number CCR-0208616 and USDA IFAFS grant number 01-04145. Supported in part by NUS grant number R252-000-127-112.
R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 234–246, 2003. c Springer-Verlag Berlin Heidelberg 2003
Learning a Subclass of Regular Patterns in Polynomial Time
235
Since then, much work has been done on pattern languages and extended pattern languages which also allow empty substitutions as well as on various special cases of the above (cf., e.g., [1,6,7,10,12,21,20,22,23,26,19,29] and the references therein). Furthermore, several authors have also studied finite unions of pattern languages (or extended pattern languages), unbounded unions thereof and also of important subclasses of (extended) pattern languages (see, for example, [11, 5,27,3,32]). Nix [18] as well as Shinohara and Arikawa [28,29] outline interesting applications of pattern inference algorithms. For example, pattern language learning algorithms have been successfully applied toward some problems in molecular biology (see [25,29]). Pattern languages and finite unions of pattern languages turn out to be subclasses of Smullyan’s [30] Elementary Formal Systems (EFSs), and Arikawa, Shinohara and Yamamoto [2] show that the EFSs can also be treated as a logic programming language over strings. The investigations of the learnability of subclasses of EFSs are interesting because they yield corresponding results about the learnability of subclasses of logic programs. Hence, these results are also of relevance for Inductive Logic Programming (ILP) [17,13,4,15]. Miyano et al. [16] intensively studied the polynomial-time learnability of EFSs. In the following we explain the main philosophy behind our research as well as the ideas by which it emerged. As far as learning theory is concerned, pattern languages are a prominent example of non-regular languages that can be learned in the limit from positive data (cf. [1]). Gold [9] has introduced the corresponding learning model. Let L be any language; then a text for L is any infinite sequence of strings containing eventually all strings of L , and nothing else. The information given to the learner are successively growing initial segments of a text. Processing these segments, the learner has to output hypotheses about L . The hypotheses are chosen from a prespecified set called hypothesis space. The sequence of hypotheses has to converge to a correct description of the target language. Angluin [1] provides a learner for the class of all pattern languages that is based on the notion of descriptive patterns. Here a pattern π is said to be descriptive (for the set S of strings contained in the input provided so far) if π can generate all strings contained in S and no other pattern having this property generates a proper subset of the language generated by π . But no efficient algorithm is known for computing descriptive patterns. Thus, unless such an algorithm is found, it is even infeasible to compute a single hypothesis in practice by using this approach. Therefore, one has considered restricted versions of pattern language learning in which the number k of different variables is fixed, in particular the case of a single variable. Angluin [1] gives a learner for one-variable pattern languages with update time O(4 log ) , where is the sum of the length of all examples seen so far. Note that this algorithm is also computing descriptive patterns even of maximum length.
236
J. Case et al.
Another important special case extensively studied are the regular pattern languages introduced by Shinohara [26]. These are generated by the regular patterns, i.e., patterns in which each variable that appears, appears only once. The learners designed by Shinohara [26] for regular pattern languages and extended regular pattern languages are also computing descriptive patterns for the data seen so far. These descriptive patterns are computable in time polynomial in the length of all examples seen so far. But when applying these algorithms in practice, another problem comes into play, i.e., all the learners mentioned above are only known to converge in the limit to a correct hypothesis for the target language. But the stage of convergence is not decidable. Thus, a user never knows whether or not the learning process is already finished. Such an uncertainty may not be tolerable in practice. Consequently, one has tried to learn the pattern languages within Valiant’s [31] PAC model. Shapire [24] could show that the whole class of pattern languages is not learnable within the PAC model unless P/poly = N P/poly for any hypothesis space that allows a polynomially decidable membership problem. Since membership is N P -complete for the pattern languages, his result does not exclude the learnability of all pattern languages in an extended PAC model, i.e., a model in which one is allowed to use the set of all patterns as hypothesis space. However, Kearns and Pitt [10] have established a PAC learning algorithm for the class of all k -variable pattern languages, i.e., languages generated by patterns in which at most k different variables occur. Positive examples are generated with respect to arbitrary product distributions while negative examples are allowed to be generated with respect to any distribution. Additionally, the length of substitution strings has been required to be polynomially related to the length of the target pattern. Finally, their algorithm uses as hypothesis space all unions of polynomially many patterns that have k or fewer variables1 . The overall learning time of their PAC learning algorithm is polynomial in the length of the target pattern, the bound for the maximum length of substitution strings, 1/ε , 1/δ , and |±| . The constant in the running time achieved depends doubly exponential on k , and thus, their algorithm becomes rapidly impractical when k increases. As far as the class of extended regular pattern languages is concerned, Miyano et al. [16] showed the consistency problem to be N P -complete. Thus, the class of all extended regular pattern languages is not polynomial-time PAC learnable unless RP = N P for any learner that uses the regular patterns as hypothesis space. This is even true for REGPAT1 , i.e., the set of all extended regular pattern languages where the length of constant strings is 1 (see below for a formal 1
More precisely, the number of allowed unions is at most poly(|π|, s, 1/ε, 1/δ, |±|) , where π is the target pattern, s the bound on the length on substitution strings, ε and δ are the usual error and confidence parameter, respectively, and ± is the alphabet of constants over which the patterns are defined.
Learning a Subclass of Regular Patterns in Polynomial Time
237
definition). The latter result follows from [16] via an equivalence proof to the common subsequence languages studied in [14]. In the present paper we also study the special cases of learning the extended regular pattern languages. On the one hand, they already allow non-trivial applications. On the other hand, it is by no means easy to design an efficient learner for these classes of languages as noted above. Therefore, we aim to design an efficient learner for an interesting subclass of the extended regular pattern languages which we define next. Let Lang(π) be the extended pattern language generated by pattern π . For c > 0 , let REGPATc be the set of all Lang(π) such that π is a pattern of the form x0 α1 x1 α2 x2 . . . αm xm , where each αi is a string of terminals of length c and x0 , x1 , x2 , . . . , xm are distinct variables. We consider polynomial time learning of REGPATc for various data presentations and for natural and plausible probability distributions on the input data. As noted above, even REGPAT1 is not polynomial-time PAC learnable unless RP = N P . Thus, one has to restrict the class of all probability distributions. Then, the conceptional idea is as follows. We explain it here for the case mainly studied in this paper, learning from text (in our above notation). One looks again at the whole learning process as learning in the limit. So, the data presented to the learner are growing initial segments of a text. But now, we do not allow any text. Instead every text is drawn according to some fixed probability distribution. Next, one determines the expected number of examples needed by the learner until convergence. Let E denote this expectation. Assuming prior knowledge about the underlying probability distribution, E can be expressed in terms the learner may use conceptionally to calculate E . Using Markov’s inequality, one easily sees that the probability to exceed this expectation by a factor of t is bounded by 1/t . Thus, we introduce, as in the PAC model, a confidence parameter δ . Given δ , one needs roughly (1/δ) · E many examples to converge with probability at least 1 − δ . Knowing this, there is of course no need to compute any intermediate hypotheses. Instead, now the learner firstly draws as many examples as needed and then it computes just one hypothesis from it. This hypothesis is output, and by construction we know it to be correct with probability at least 1 − δ . Thus, we arrive at a learning model which we call probabilistically exact learning (cf. Definition 5 below). Clearly, in order to have an efficient learner one also has to ensure that this hypothesis can be computed in time polynomial in the length of all strings seen. For arriving at an overall polynomial-time learner, it must be also ensured that E is polynomially bounded in a suitable parameter. We use the number of variables occurring in the regular target pattern, c (the length of substitution strings) and a term describing knowledge about the probability distribution as such a parameter. For REGPATc , we have results for three different models of data presentation. The data are drawn according to the distribution prob described below.
238
J. Case et al.
The three models are as follows. Thanks to space limitations we present herein the details and verification of our algorithm for the first model only. The journal version of this paper will present more details. Σ is the terminal alphabet. For natural numbers c > 0 , Σ c is Σ ∗ restricted to strings of length c . (1) For drawing of examples according to prob for learning a pattern language generated by π : one draws terminal string σ according to distribution prob over Σ ∗ until σ ∈ Lang(π) is obtained. Then σ is returned to the learner. (2) One draws σ according to prob and gives (σ, χLang(π) (σ)) to the learner. (3) As in (2), but one gives σ to the learner in the case that σ ∈ Lang(π) , and gives a pause-symbol to the learner otherwise. For this paper, the natural and plausible assumptions on prob are the following. (i) prob(Σ c ) ≥ prob(Σ c+1 ) for all c ; (ii) prob(σ) =
prob(Σ c ) |Σ c |
, where σ ∈ Σ c .
(iii) there is an increasing polynomial pol such that prob(Σ c ) ≥ all c .
1 pol(c)
for
Our algorithm is presented in detail in Section 3 below. The complexity bounds are described more exactly there, but, basically, the algorithm can be made to run with arbitrarily high probability of success on extended regular languages generated by patterns π of the form x0 α1 x1 ...αm xm for unknown m but known c , from number of examples polynomial in m (and exponential in c ), where α1 , ..., αm ∈ Σ c . N.B. Having our patterns defined as starting and ending with variables is not crucial (since one can handle patterns starting or ending with constants easily by just looking at the data and seeing if they have a common suffix or prefix). Our results more generally hold for patterns alternating variables and fixed length constant strings, where the variables are not repeated. Our statements above and in Section 3 below involving variables at the front and end is more for ease of presentation of proof.
2
Preliminaries
Let N = {0, 1, 2, . . .} denote the set of natural numbers, and let N+ = N \ {0} . For any set S , we write |S| to denote the cardinality of S . Let Σ be any non-empty finite set of constants such that |Σ| ≥ 2 and let V be a countably infinite set of variables such that Σ ∩ V = ∅ . By Σ ∗ we denote the free monoid over Σ . The set of all finite non-null strings of symbols from Σ is denoted by Σ + , i.e., Σ + = Σ ∗ \ {λ} , where λ denotes the empty string. As above, Σ c denotes the set of strings over Σ with length c . We let a, b, . . .
Learning a Subclass of Regular Patterns in Polynomial Time
239
range over constant symbols. x, y, z, x1 , x2 , . . . range over variables. Following Angluin [1], we define patterns and pattern languages as follows. Definition 1. A term is an element of (Σ ∪ V )∗ . A ground term (or a word , or a string) is an element of Σ ∗ . A pattern is a non-empty term. A substitution is a homomorphism from terms to terms that maps each symbol a ∈ Σ to itself. The image of a term π under a substitution θ is denoted πθ . We next define the language generated by a pattern. Definition 2. The language generated by a pattern π is defined as Lang(π) = {πθ ∈ Σ ∗ | θ is a substitution } . We set PAT = {Lang(π) | π is a pattern} . Note that we are considering extended (or erasing) pattern languages, i.e., a variable may be substituted with the empty string λ . Though allowing empty substitutions may seem a minor generalization, it is not. Learning erasing pattern languages is more difficult for the case considered within this paper than learning non-erasing ones. For the general case of arbitrary pattern languages, already Angluin [1] showed the non-erasing pattern languages to be learnable from positive data. However, the erasing pattern languages are not learnable from positive data if |Σ| = 2 (cf. Reidenbach [19]). Definition 3 (Shinohara[26]). A pattern π is said to be regular if it is of the form x0 α1 x1 α2 x2 . . . αm xm , where αi ∈ Σ + and xi is the i -th variable. We set REGPAT = {Lang(π) | π is a regular pattern} . Definition 4. Suppose c ∈ N+ . We define c (a) regm c = {π | π = x0 α1 x1 α2 x2 . . . αm xm , where each αi ∈ Σ } . (b) regc = m regc .
(c) REGPATc = {Lang(π) | π ∈ regc } . Next, we define the learning model considered in this paper. As already explained in the Introduction, our model differs to a certain extent from the PAC model introduced by Valiant [31] which is distribution independent. In our model, a bit of background knowledge concerning the class of allowed probability distributions is allowed. So, we have a stronger assumption, but also a stronger requirement, i.e., instead of learning an approximation for the target concept, our learner is required to learn it exactly. Moreover, the class of erasing regular pattern languages is known not to be PAC learnable (cf. [16] and the discussion within the Introduction).
240
J. Case et al.
Definition 5. A learner M is said to probabilistically exactly learn a class L of pattern languages according to probability distribution prob , if for all δ , 0 < δ < 1 , for some polynomial q , when learning a Lang(π) ∈ L , with probability at least 1 − δ , M draws at most q(|π|, 1δ ) examples according to the probability distribution prob , and then outputs a pattern π , such that Lang(π) = Lang(π ) . As far as drawing of examples according to prob for learning a pattern language generated by π is concerned, we assume the following model (the first model discussed in the Introduction): one draws σ according to distribution prob over Σ ∗ , until σ ∈ Lang(π) is obtained. Then σ is returned to the learner. (Note: prob is thus defined over Σ ∗ .) The other two models we mentioned in the Introduction are: (2) There is a basic distribution prob and one draws σ according to prob and gives (σ, χLang(π) (σ)) to the learner. (3) As in (2), but one gives σ to the learner in the case that σ ∈ Lang(π) , and gives a pause-symbol to the learner otherwise. We note that our proof works for models (2) and (3) above too. For this paper, the assumptions on prob are (as in the Introduction) the following. (i) prob(Σ c ) ≥ prob(Σ c+1 ) for all c ∈ N ; (ii) prob(σ) =
prob(Σ c ) |Σ c |
, where σ ∈ Σ c .
(iii) there is an increasing polynomial pol with prob(Σ c ) ≥ pol(c) = 0 for all c ∈ N .
3
1 pol(c)
and
Main Result
In this section we will show that REGPATc is probabilistically exactly learnable according to probability distributions prob satisfying the constraints described above. Lemma 1. (based on Chernoff Bounds) Suppose X, Y ⊆ Σ ∗ , δ, are properly between 0 and 1/2 , and prob(X) ≥ prob(Y ) + . Let e be the base of natural logarithm. Then, if one draws at least 2 − log(δ) ∗ 2 log e many examples from Σ ∗ according to the probability distribution prob , then with probability at least 1 − δ , more elements of X than of Y show up. The number 22∗δ is an upper bound for this number. More generally, the following holds.
Learning a Subclass of Regular Patterns in Polynomial Time
241
Lemma 2. One can define a function r(, δ, k) which is polynomial in k, 1 , 1δ such that for all sets X, Z, Y1 , Y2 , . . . , Yk ⊆ Σ ∗ , the following holds. If prob(X) − prob(Yi ) ≥ , for i = 1, 2, . . . , k , and prob(Z) ≥ , and one draws ≥ r(, δ, k) many examples from Σ ∗ according to the distribution prob , then with probability at least 1 − δ (a) there is at least one example from Z . (b) there are strictly more examples in X than in any of the sets Y1 , ..., Yk . Proposition 1. For every regular pattern π and all m ∈ N , Lang(π) ∩ Σ m+1 ≥ |Σ| ∗ (Lang(π) ∩ Σ m ) . Proof. Since any regular pattern π has a variable at the end, the proposition follows. Proposition 2. For any fixed constant c ∈ N+ and any alphabet Σ , there is a polynomial f such that for every π ∈ regm c , at least half of the strings of length f (m) are generated by π . Proof. Suppose that π = x0 α1 x1 α2 x2 . . . αm xm , and α1 , α2 , ..., αm ∈ Σ c . Clearly, there is a length d ≥ c such that for every τ ∈ Σ c , at least half of the d−c strings in Σ d contain τ as a substring, that is, are in the set k=0 Σ k τ Σ d−k−c . Now let f (m) = d ∗ m2 . We show that given π as above, at least half of the strings of length f (m) are generated by π . 2
In order to see this, draw a string σ ∈ Σ d∗m according to a fair |Σ| -sided coin such that all symbols are equally likely. Divide σ into m equal parts of length d ∗ m . The i -th part contains αi with probability at least 1 − 2−m as a substring, and thus the whole string is generated by π with probability at least 1 − m ∗ 2−m . Note that 1 − m ∗ 2−m ≥ 1/2 for all m , and thus f (m) meets the specification. We now present our algorithm for learning REGPATc . The algorithm has prior knowledge about the function r from Lemma 2 and the function f from Proposition 2. It takes as input c , δ and knowledge about the probability distribution by getting pol . Learner (c, δ, pol) (1) Read examples until an n is found such that the shortest example is strictly shorter than c ∗ n and the total number of examples (including repetitions) is at least n 1 c , , |Σ| . n∗r 2 ∗ |Σ|c ∗ f (n) ∗ pol(f (n)) δ
242
J. Case et al.
Let A be the set of all examples and Aj (j ∈ {1, 2, . . . , n}) , be the examples whose index is j modulo n ; so the (k ∗ n + j) -th example from A goes to Aj where k is an integer and j ∈ {1, 2, ..., n} . Let i = 1 , π0 = x0 , X0 = {λ} and go to Step (2). (2) For β ∈ Σ c , let Yi,β = Xi−1 βΣ ∗ . If A ∩ Xi−1 = ∅ , then let m = i − 1 and go to Step (3). Choose αi as the β ∈ Σ c , such that |Ai ∩ Yi,β | > |Ai ∩ Yi,β | , for β ∈ Σ c − {β} (if there is no such β , then abort the algorithm). Let Xi be the set of all strings σ such that σ is in Σ ∗ α1 Σ ∗ α2 Σ ∗ . . . Σ ∗ αi , but no proper prefix τ of σ is in Σ ∗ α1 Σ ∗ α2 Σ ∗ . . . Σ ∗ αi . Let πi = πi−1 αi xi , let i = i + 1 and go to Step (2). (3) Output the pattern πm = x0 α1 x1 α2 x2 . . . αm xm and halt. End Note that since the shortest example is strictly shorter than c ∗ n it holds that n ≥ 1 . Furthermore, if π = x0 , then the probability that a string drawn is λ is at least 1/pol(0) . A lower bound for this is 1/(2 ∗ |Σ|c ∗ f (n) ∗ pol(f (n)) , whatever n is, due to the fact that pol is monotonically increasing. Thus λ appears with probability 1 − δ/n in the set An and thus in the set A . So the algorithm is correct for the case that π = x0 . It remains to consider the case where π is of the form x0 α1 x1 α2 x2 . . . am xm for some m ≥ 1 where all αi are in Σ c . Claim. Suppose any pattern π = x0 α1 x1 α2 x2 ...αm xm ∈ regm c . Furthermore, let πi−1 = x0 α1 x1 ...αi−1 xi−1 . Let the sets Yi,β , Xi be as defined in the algorithm and let C(i, β, h) be the cardinality of Yi,β ∩ Lang(π) ∩ Σ h . Then, for all h > 0 and all β ∈ Σ c \ {αi } , we have C(i, β, h) ≤ |Σ| ∗ C(i, αi , h − 1) ≤ C(i, αi , h) . Proof. Let σ ∈ Yi,β ∩ Lang(π) . Note that σ has a unique prefix σi ∈ Xi . Furthermore, there exist s ∈ Σ , η, τ ∈ Σ ∗ such that (i) σ = σi βsητ and (ii) βsη is the shortest possible string such that βsη ∈ Σ ∗ αi . The existence of s is due to the fact that β = αi and β, αi have both the length c . So the position of αi in σ must be at least one symbol behind the one of β . If the difference is more than a symbol, η is used to take these additional symbols. Now consider the mapping t from Lang(π) ∩ Yi,β to Lang(π) ∩ Yi,αi which replaces βs in the above representation of σ by αi – thus t(σ) = σi αi ητ . The mapping t is |Σ| -to- 1 since it replaces the constant β by αi and erases s (the information is lost about which element from Σ the value s is). Clearly, σi but no proper prefix of σi is in Xi . So σi αi is in Xi αi . The position of αi+1 , . . . , αm in σ are in the part covered by τ , since σi βsη
Learning a Subclass of Regular Patterns in Polynomial Time
243
is the shortest prefix of σ generated by πi αi . Since πi generates σi and xi αi+1 xi+1 ...αm xm generates ητ , it follows that π generates t(σ) . Hence, t(σ) ∈ Lang(π) . Furthermore, t(σ) ∈ Σ h−1 since the mapping t omits one element. Also, clearly t(σ) ∈ Xi αi Σ ∗ = Yi,αi . Thus, for β = αi , β ∈ Σ c , it holds that C(i, β, h) ≤ |Σ| ∗ C(i, αi , h − 1) . By combining with Proposition 1, C(i, αi , h) ≥ |Σ| ∗ C(i, αi , h − 1) ≥ C(i, β, h) . Claim. If m > i then there is a length h ≤ f (m) such that C(i, αi , h) ≥ C(i, β, h) +
|Σ|h 2 ∗ |Σ|c ∗ f (m)
for all β ∈ Σ c \ {αi } . In particular, prob(Yi,β ∩ Lang(π)) + Proof. Let D(i, β, h) = Claim 3 give that
2∗
|Σ|c
C(i,β,h) |Σ|h
1 ≤ prob(Yi,αi ∩ Lang(π)). ∗ pol(f (m)) ∗ f (m) , for all h and β ∈ Σ c . Proposition 1 and
D(i, β, h) ≤ D(i, αi , h − 1) ≤ D(i, αi , h). Since every string in Lang(π) is in some set Yi,β , it holds that D(i, αi , f (m)) ≥ 1 2∗|Σ|c . Furthermore, D(i, αi , h) = 0 for all h < c since m > 0 and π does not generate the empty string. Thus there is an h ∈ {1, 2, ..., f (m)} with D(i, αi , h) − D(i, αi , h − 1) ≥
1 . 2 ∗ |Σ|c ∗ f (m)
For this h , it holds that D(i, αi , h) ≥ D(i, β, h) +
2∗
1 . ∗ f (m)
|Σ|c
The second part of the claim follows, by noting that prob(Σ h ) ≥
1 1 ≥ . pol(h) pol(f (m))
We now show that the learner presented above indeed probabilistically exactly learns Lang(π) , for π ∈ regc . , the A loop (Step (2)) invariant is that with probability at least 1 − δ∗(i−1) n pattern πi−1 is a prefix of the desired pattern π . This certainly holds before entering Step (2) for the first time. Case 1. i ∈ {1, 2, ..., m} .
244
J. Case et al.
By assumption, i ≤ m and πi−1 is with probability 1 − δ∗(i−1) a n prefix of π , that is, α1 , ..., αi−1 are selected correctly. Since αi exists and every string generated by π is in Xi Σ ∗ αi Σ ∗ , no element of Lang(π) and thus no element of A is in Xi−1 and the algorithm does not stop too early. If β = αi and β = αi , then prob(Yi,β ∩ Lang(π)) ≥ prob(Yi,β ∩ Lang(π)) +
1 , 2 ∗ |Σ|c ∗ f (m) ∗ pol(f (m))
by Claim 3. By Lemma 2, αi is identified correctly with probability at least 1 − δ/n from the data in Ai . It follows that the body of the loop in Step (2) is executed correctly with probability at least 1 − δ/n and the loop-invariant is preserved. Case 2. i = m + 1 . By Step (1) of the algorithm, the shortest example is strictly shorter than c ∗ n and at least c ∗ m by construction. Thus, we already know m < n. With probability 1 − δ∗(n−1) the previous loops in Step (2) have n gone through successfully and πm = π . Consider the mapping t which omits from every string the last symbol. Now σ ∈ Xm iff σ ∈ Lang(π) and t(σ) ∈ / Lang(π) . Let D(π, h) be the weighted number of strings |Σ h ∩Lang(π)| . Since generated by π of length h , that is, D(π, h) = |Σ|h 1 D(π, f (m)) ≥ 2 and D(π, 0) = 0 , there is a h ∈ {1, 2, . . . , f (m)} such that 1 1 . D(π, h) − D(π, h − 1) ≥ ≥ 2 ∗ f (m) 2 ∗ |Σ|c ∗ f (n) Note that h ≤ f (n) since f is increasing. It follows that prob(Xm ) ≥
2∗
|Σ|c
1 ∗ (f (n) ∗ pol(f (n))
and thus with probability at least 1 − nδ a string from Xm is in Am , and in particular in A (by Lemma 2). Thus the algorithm terminates after going through the step (2) m times with the correct output with probability at least 1 − δ . To get polynomial time bound for the learner, we note the following. It is easy to show that there is a polynomial q(m, δ1 ) which with sufficiently high probability ( 1 − δ , for any fixed δ ) bounds the parameter n of the learning algorithm. Thus, with probability at least 1 − δ − δ the whole algorithm is successful in time and example-number polynomial in m, 1/δ, 1/δ . Thus, for
Learning a Subclass of Regular Patterns in Polynomial Time
245
any given δ , by choosing δ = δ = δ /2 , one can get the desired polynomial time algorithm. We are hoping in the future (not as part of the present paper) to run our algorithm on molecular biology data to see if it can quickly provide useful answers.
References 1. D. Angluin. Finding patterns common to a set of strings. Journal of Computer and System Sciences, 21:46–62, 1980. 2. S. Arikawa, T. Shinohara, and A. Yamamoto. Learning elementary formal systems. Theoretical Computer Science, 95:97–113, 1992. 3. T. Shinohara and H. Arimura. Inductive inference of unbounded unions of pattern languages from positive data. Theoretical Computer Science, 241:191–209, 2000. 4. I. Bratko and S. Muggleton. Applications of inductive logic programming. Communications of the ACM, 1995. 5. A. Br¯ azma, E. Ukkonen, and J. Vilo. Discovering unbounded unions of regular pattern languages from positive examples. In Proceedings of the 7th International Symposium on Algorithms and Computation (ISAAC’96), volume 1178 of Lecture Notes in Computer Science, pages 95–104, Springer, 1996. 6. J. Case, S. Jain, S. Kaufmann, A. Sharma, and F. Stephan. Predictive learning models for concept drift. Theoretical Computer Science, 268:323–349, 2001. Special Issue for ALT’98. 7. J. Case, S. Jain, S. Lange, and T. Zeugmann. Incremental concept learning for bounded data mining. Information and Computation, 152(1):74–110, 1999. 8. T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger, and T. Zeugmann. Learning one-variable pattern languages very efficiently on average, in parallel, and by asking queries. Theoretical Computer Science, 261(1):119–156, 2001. 9. E.M. Gold. Language identification in the limit. Information & Control, 10:447– 474, 1967. 10. M. Kearns and L. Pitt. A polynomial-time algorithm for learning k -variable pattern languages from examples. In R. Rivest, D. Haussler and M. K. Warmuth (Eds.), Proceedings of the Second Annual ACM Workshop on Computational Learning Theory, pages 57–71, Morgan Kaufmann Publishers Inc., 1989. 11. P. Kilpel¨ ainen, H. Mannila, and E. Ukkonen. MDL learning of unions of simple pattern languages from positive examples. In Paul Vit´ anyi, editor, Second European Conference on Computational Learning Theory, volume 904 of Lecture Notes in Artificial Intelligence, pages 252–260. Springer, 1995. 12. S. Lange and R. Wiehagen. Polynomial time inference of arbitrary pattern languages. New Generation Computing, 8:361–370, 1991. 13. N. Lavraˇc and S. Dˇzeroski. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994. 14. S. Matsumoto and A. Shinohara. Learnability of subsequence languages. Information Modeling and Knowledge Bases VIII, pages 335–344, IOS Press, 1997. 15. T. Mitchell. Machine Learning. McGraw Hill, 1997. 16. S. Miyano, A. Shinohara and T. Shinohara. Polynomial-time learning of elementary formal systems. New Generation Computing, 18:217–242, 2000.
246
J. Case et al.
17. S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19/20:669–679, 1994. 18. R. Nix. Editing by examples. Technical Report 280, Department of Computer Science, Yale University, New Haven, CT, USA, 1983. 19. D. Reidenbach. A Negative Result on Inductive Inference of Extended Pattern Languages. In N. Cesa-Bianchi and M. Numao, editors, Algorithmic Learning Theory, 13th International Conference, ALT 2002, L¨ ubeck, Germany, November 2002, Proceedings, pages 308–320. Springer, 2002. 20. R. Reischuk and T. Zeugmann. Learning one-variable pattern languages in linear average time. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 198–208. ACM Press, 1998. 21. P. Rossmanith and T. Zeugmann. Stochastic Finite Learning of the Pattern Languages. Machine Learning 44(1/2):67–91, 2001. Special Issue on Automata Induction, Grammar Inference, and Language Acquisition 22. A. Salomaa. Patterns (The Formal Language Theory Column). EATCS Bulletin, 54:46–62, 1994. 23. A. Salomaa. Return to patterns (The Formal Language Theory Column). EATCS Bulletin, 55:144–157, 1994. 24. R. Schapire, Pattern languages are not learnable. In M.A. Fulk and J. Case, editors, Proceedings, 3rd Annual ACM Workshop on Computational Learning Theory, pages 122–129, Morgan Kaufmann Publishers, Inc., 1990. 25. S. Shimozono, A. Shinohara, T. Shinohara, S. Miyano, S. Kuhara, and S. Arikawa. Knowledge acquisition from amino acid sequences by machine learning system BONSAI. Trans. Information Processing Society of Japan, 35:2009–2018, 1994. 26. T. Shinohara. Polynomial time inference of extended regular pattern languages. In RIMS Symposia on Software Science and Engineering, Kyoto, Japan, volume 147 of Lecture Notes in Computer Science, pages 115–127. Springer-Verlag, 1982. 27. T. Shinohara. Inferring unions of two pattern languages. Bulletin of Informatics and Cybernetics, 20:83–88., 1983. 28. T. Shinohara and S. Arikawa. Learning data entry systems: An application of inductive inference of pattern languages. Research Report 102, Research Institute of Fundamental Information Science, Kyushu University, 1983. 29. T. Shinohara and S. Arikawa. Pattern inference. In Klaus P. Jantke and Steffen Lange, editors, Algorithmic Learning for Knowledge-Based Systems, volume 961 of Lecture Notes in Artificial Intelligence, pages 259–291. Springer, 1995. 30. R. Smullyan. Theory of Formal Systems, Annals of Mathematical Studies, No. 47. Princeton, NJ, 1961. 31. L.G. Valiant. A theory of the learnable. Communications of the ACM 27:1134– 1142, 1984. 32. K. Wright. Identification of unions of languages drawn from an identifiable class. In R. Rivest, D. Haussler, and M.K. Warmuth, editors, Proceedings of the Second Annual Workshop on Computational Learning Theory, pages 328–333. Morgan Kaufmann Publishers, Inc., 1989. 33. T. Zeugmann. Lange and Wiehagen’s pattern language learning algorithm: An average-case analysis with respect to its total learning time. Annals of Mathematics and Artificial Intelligence, 23(1–2):117–145, 1998.
Identification with Probability One of Stochastic Deterministic Linear Languages Colin de la Higuera1 and Jose Oncina2 1
EURISE, Universit´e de Saint-Etienne, 23 rue du Docteur Paul Michelon, 42023 Saint-Etienne, France [email protected], http://eurise.univ-st-etienne.fr/˜cdlh 2 Departamento de Lenguajes y Sistemas Inform´ aticos, Universidad de Alicante, Ap.99. E-03080 Alicante, Spain [email protected], http://www.dlsi.es/˜oncina
Abstract. Learning context-free grammars is generally considered a very hard task. This is even more the case when learning has to be done from positive examples only. In this context one possibility is to learn stochastic context-free grammars, by making the implicit assumption that the distribution of the examples is given by such an object. Nevertheless this is still a hard task for which no algorithm is known. We use recent results to introduce a proper subclass of linear grammars, called deterministic linear grammars, for which we prove that a small canonical form can be found. This has been a successful condition for a learning algorithm to be possible. We propose an algorithm for this class of grammars and we prove that our algorithm works in polynomial time, and structurally converges to the target in the paradigm of identification in the limit with probability 1. Although this does not ensure that only a polynomial size sample is necessary for learning to be possible, we argue that the criterion means that no added (hidden) bias is present.
1
Introduction
Context-free grammars are known to have a superior modeling capacity than regular grammars or finite state automata. Learning these grammars is also harder but considered an important and challenging task. Yet without external help such as a knowledge of the structure of the strings [Sak92] only clever but limited heuristics have been proposed [LS00,NMW97]. When no positive examples exist, or when the actual problem is that of building a language model, stochastic context-free grammars have been proposed. In a number of applications (computational biology [SBH+ 94] and speech recognition [WA02] are just two typical examples), it is speculated that success will
The author thanks the Generalitat Valenciana for partial support of this work through project CETIDIB/2002/173.
R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 247–258, 2003. c Springer-Verlag Berlin Heidelberg 2003
248
C. de la Higuera and J. Oncina
depend on being able to replace finite state models such as Hidden Markov Models by stochastic context-free grammars. Yet the problem of learning this type of grammar from strings has rarely been addressed. The usual way of dealing with the problem still consists in first learning a structure, and then estimating the probabilities [Bak79]. In the more theoretical setting of learning from both examples and counterexamples classes of grammars that are more general than the regular grammars, but restricted to cases where both determinism and linearity apply have been studied [dlHO02]. On the other hand, learning (deterministic) regular stochastic grammars has received a lot of attention over the past 10 years. A well known algorithm for this task is ALERGIA [CO94], which has been improved by different authors [YLT00,CO99], and applied to different tasks [WA02]. We synthesize in this paper both types of results and propose a novel class of stochastic languages that we call stochastic deterministic linear languages. We prove that each language of the class admits an equivalence relation of finite index, thus leading to a canonical normal form. We propose an algorithm that works in polynomial time with respect to the learning data. It can identify with probability one any language in the class. In section 2 the necessary definitions are given. We prove in section 3 the existence of a small normal form, and give in section 4 a learning algorithm that can learn grammars in normal form.
2 2.1
Definitions Languages and Grammars
An alphabet Σ is a finite nonempty set of symbols. Σ ∗ denotes the set of all finite strings over Σ. A language L over Σ is a subset of Σ ∗ . In the following, unless stated otherwise, symbols are indicated by a, b, c, . . . , strings by u, v, . . . , and the empty string by λ. The length of a string u will be denoted |u|. Let u, v ∈ Σ ∗ , u−1 v = w such that v = uw (undefined if u is not a prefix of v) and uv −1 = w such that u = wv (undefined if v is not a suffix of u). Let L be a language and u ∈ Σ ∗ , u−1 L = {v : uv ∈ L} and Lu−1 = {v : vu ∈ L}. Let L be a language, the prefix set is Pref(L) = {x : xy ∈ L}. The longest common suffix (lcs(L)) of L is the longest string u such that (Lu−1 )u = L. A context-free grammar G is a quadruple (Σ, V, R, S) where Σ is a finite alphabet (of terminal symbols), V is a finite alphabet (of variables or nonterminals), R ⊂ V ×(Σ ∪V )∗ is a finite set of production rules, and S(∈ V ) is the * starting symbol. We will denote uT v → uwv when (T, w) ∈ R. → is the reflexive and transitive closure of →. If there exists u0 , . . . , uk such that u0 → · · · → uk k * we will write u0 → uk . We denote by LG (T ) the language {w ∈ Σ ∗ : T → w}. Two grammars are equivalent if they generate the same language. A context-free grammar G = (Σ, V, R, S) is linear if R ⊂ V × (Σ ∗ V Σ ∗ ∪ Σ ∗ ).
Identification with Probability One
2.2
249
Stochastic Languages
A stochastic language L over Σ is defined by a probability density function over w ∈ Σ ∗ appears in the language. Σ ∗ giving the probability p(w|L) that the string To be consistent, a necessary condition is that x∈Σ ∗ p(x|L) = 1. When convenient, we are going to represent a stochastic language as a set of pairs: L = {(u, p(u|L)) : p(u|L) > 0}. Consequently (u, pu ) ∈ L =⇒ p(u|L) > 0. Also to avoid unnecessary notations we will allow the empty set ∅ to be a stochastic language (paired with an arbitrary function). The probability of any subset X ⊆ Σ ∗ is given by p(X|L) = p(u|L) u∈X
Let L be a stochastic language and u ∈ Σ ∗ , Pref(L) = {u : (uv, p) ∈ L}, Sf(L) = {u : (vu, p) ∈ L}, uL = {(uv, p) : (v, p) ∈ L}, Lu = {(vu, p) : (v, p) ∈ L}, u−1 L = {(v, pv ) : (uv, p(uΣ ∗ |L)pv ) ∈ L} Lu−1 = {(v, pv ) : (vu, pv p(Σ ∗ u|L)) ∈ L}. Note that the expresions for u−1 L and Lu−1 are equivalent to {(v, pv ) : pv = p(uv|L)/p(uΣ ∗ |L)} and {(v, pv ) : pv = p(vu|L)/p(uΣ ∗ |L)} respectively but avoiding division by zero problems. Of course, if u is a common prefix (u common suffix) of L then p(uΣ ∗ |L) = 1 (p(Σ ∗ u|L) = 1) and u−1 L = {(v, pv ) : (uv, pv ) ∈ L} (Lu−1 = {(v, pv ) : (vu, pv ) ∈ L}). We denote the longest common suffix reduction of a stochastic language L by L ↓ = {(u, p) : z = lcs(L), (uz, p) ∈ L}, where lcs(L) = lcs{u : (u, p) ∈ L}. Note that if L is a stochastic language then ∀u u−1 L, Lu−1 and L ↓ are also stochastic languages. A stochastic deterministic linear (SDL) grammar, G = (Σ, V, R, S, p) consists Σ, V , S as for context-free grammars, a finite set R of derivation rules with either of the structures X → aY w or X → λ; such that X → aY w, X → aZv ∈ R ⇒ Y = Z ∧ w = v, and a real function p : R →]0, 1] giving the probability of each derivation. * The probability p(S → w) that the grammar G generates the string w is defined recursively as: * * p(X → avw) = p(X → aY w)p(Y → v)
where Y is the only variable such that X → Y w ∈ R (if such variable does not exist, thenp(X → aY w) = 0 is assumed). It can be shown that if ∀A ∈ V p(A → α) = 1 and G does not contains useless symbols then G
250
C. de la Higuera and J. Oncina
defines a stochastic deterministic linear language LG through the probabilities * w). p(w|LG ) = p(S → Let X be a variable in the SDL grammar G = (Σ, V, R, S, p) then LG (X) = * {(u, pu ) : p(X → u) = pu }. A non stochastic version of the above definition is studied in [dlHO02]: it corresponds to a very general class of linear grammars that includes for instance grammars for all regular languages, palindrome languages and {an bn : n ∈ N}. In the same paper a more general form of deterministic linear grammars was proposed, equivalent to the form we use to support our grammars here. Extension of these results to general deterministic linear grammars will not be done in this paper.
3
A Canonical Form for Stochastic Deterministic Linear Grammars
For a class of stochastic languages to be identifiable in the limit with probability one a reasonable assumption is that there exists some small canonical form for any language representable in the class. We prove in this section that such is indeed the case for stochastic deterministic linear grammars. The purpose of this section is to reach a computable normal form for SDL grammars. For this we first define a normal form for these grammars (called advanced as the longest common suffixes appear as soon as possible), and then construct such a grammar from any deterministic linear language. Definition 1 (Advanced form). A stochastic deterministic linear grammar G = (Σ, V, R, S, p) is in advanced form if: 1. ∀(T, aT w) ∈ R, w = lcs(a−1 LG (T )); * 2. all non-terminal symbols are accessible: ∀T ∈ V ∃u, v ∈ Σ ∗ : S → uT v and useful: ∀T ∈ V, LG (T ) = ∅; 3. ∀T, T ∈ V, LG (T ) = LG (T ) ⇒ T = T . We build the canonical form from the language so as to ensure uniqueness: Definition 2 (Common suffix-free language equivalence). Given a stochastic language L we define recursively the common suffix-free languages CSFL (·), and the associated equivalence relation as follows: CSFL (λ) = L x ≡L y ⇐⇒ CSFL (x) = CSFL (y) −1 CSFL (xa) = (a CSFL (x)) ↓ Proposition 1. The equivalence relation ≡L has a finite index. Proof. See the appendix.
Identification with Probability One
251
Definition 3 (A canonical grammar). Given any stochastic linear deterministic language L, the canonical grammar associated with L is GL = (Σ, V, R, SCSFL (λ) , p) where: V = {SCSFL (x) : CSFL (x) = ∅} R = {SCSFL (x) → aSCSFL (xa) lcs(a−1 CSFL (x)) : CSFL (xa) = ∅} ∪ {SCSFL (x) → λ : λ ∈ CSFL (x)}
p(SCSFL (x) → aY w) = p(aΣ ∗ w| CSFL (x)) = p(aΣ ∗ | CSFL (x)) p(SCSFL (x) → λ) = p(λ| CSFL (x)) Proposition 1 allows this construction to terminate. The correctness of the construction is a consequence of: Proposition 2. Let L be a SDL language and let GL = (Σ, V, R, S, p) be its associated canonical grammar. Then L = LGL (S). Proof. See the appendix. Theorem 1. Given a SDL grammar G = (Σ, VG , RG , SG , pG ), let GL = (Σ, VGL , RGL , SGL , pGL ) be the canonical grammar that generates L = LG (SG ), 1. GL is advanced 2. |VGL | |VG | + 1. Proof. We prove that GL is advanced by showing that conditions 1 to 4 of definition 1 hold. The proof of the second part is a consequence of lemma 5 and proposition 4: both results are given and proved in the appendix: they state that the number of classes of CSFL and thus the number of variables in the canonical grammar, is bounded by the number of non-terminals in the original grammar.
4
Learning SDL Grammars
As SDL languages admit a small canonical form it will be sufficient to have an algorithm that can identify a grammar in this type of canonical form. We are going to divide the task of learning in two steps: 1. Identify the topology of the grammar, that is type A → aBv rules, without the probabilities. 2. Add the A → λ type rules and assign the probabilities. The second step can be done by counting the use of the different rules while parsing a sample (maximum likelihood estimation); alternatively, as this does not achieve identification, techniques based on Stern-Brocot trees can be used in a similar way as in [dlHT00]. Hence we are going to concentrate on the first step. Definition 4. Let L be a SDL language, and a length lexicographic order relation over Σ ∗ , the shortest prefix set of L is SpL = {x ∈ Pref(L) : CSFL (x) = ∅ ∧ y ≡L x ⇒ x y}
252
C. de la Higuera and J. Oncina
Note that, in a canonical grammar, we have a one-to-one relation between strings in Sp and non-terminals of the grammar. We shall thus use the strings in Sp as identifiers for the non terminal symbols. To describe the algorithm we shall imagine that we have access to an unlimited oracle that knows language L and to which we can address the following queries: nextL (x) = {xa ∈ Pref(L), a ∈ Σ} equivL (x, y) ⇐⇒ x ≡L y rightL (xa) = lcs(a−1 CSFL (x)) Algorithm 1 visits the prefixes of the language L in length lexicographic order, and constructs the canonical grammar corresponding to definition 3. If a prefix xa is visited and no previous equivalent non terminal has been found (and placed in Sp), this prefix is added to Sp as a new non terminal and the corresponding rule is added to the grammar. If there exists an equivalent non terminal y in Sp then the corresponding rule is added but the strings for which x is a prefix will not be visited (they will not be added to W ). When the algorithm finishes, Sp contains all the shortest prefixes of the language. Algorithm 1 is clearly polynomial in the size of set W , provided the auxiliary functions are polynomial. A stochastic sample S of the stochastic language L is an infinite sequence of strings generated according to the probability distribution p(w|L). We denote with Sn the sequence of the n first strings (not necessarily different) in S, which will be used as input for the algorithm. The number of occurrences in Sn of ∗ the string x will be denoted with cn (x), and for any subset X ⊆ Σ , cn (X) = c (x). Note that in the context of the algorithm, next (x), rightL (xa) L x∈X n and equivL (xa, y) are only computed when x and y are in SpL . Therefore the size of W is bounded by the number of prefixes of Sn . In order to use algorithm 1 with a sample Sn instead of an oracle with access to the whole language L Algorithm 1 Computing G using functions next, right and equiv Require: functions next, right and equiv, language L Ensure: L(G) = L with G = (Σ, V, R, Sλ ) Sp = {λ}; V = {Sλ } W = nextL (λ) while W = ∅ do xa = min W W = W − {xa} if ∃y ∈ Sp : equivL (xa, y) then add Sx → aSy rightL (xa) to R else Sp = Sp ∪{xa}; V = V ∪ {Sxa } W = W ∪ nextL (xa) add Sx → aSxa rightL (xa) to R end if end while
Identification with Probability One
253
the 3 functions must be implemented as functions of Sn (nextSn (·), rightSn (·) and equivSn (·, ·)) rather than L so that they give the same result as nextL (x), rightL (xa) and equivL (xa, y) when x, y ∈ SpL and n tends to infinity. In order to simplify notations we introduce: Definition 5. Let L be a SDL language, then lcs(x−1 L) if x = λ tailL (x) = λ if x = λ
∀x : CSFL (x) = ∅
A slightly different function tail that works over sequences is now introduced. This function will be used to define a function right to work over sequences. Definition 6. Let Sn be a finite sequence of strings, then lcs(x−1 Sn ) if x = λ tailSn (x) = ∀x ∈ Pref(Sn ) λ if x = λ Lemma 1. Let GL = (Σ, V, R, S, p) be the canonical grammar of a SDL language L, ∀x : CSFL (x) = ∅, lcs(a−1 CSFL (x)) = (tailL (xa))(tailL (x))−1 Proof. The proof is similar to lemma 4(1) of [dlHO02] Definition 7. nextSn (x) = {xa : ∃xay ∈ Sn } rightSn (xa) = tailSn (xa) tailSn (x)−1 It should be noticed that the above definition ensures that the functions nextSn and rightSn can be computed in time polynomial in the size of Sn . We now prove that the above definition allows functions nextSn and rightSn to converge in the limit, to the intended functions nextL and rightL : Lemma 2. Let L be a SDL language, for each sample Sn of L containing a set D ⊆ {x : (x, p) ∈ L} such that: 1. ∀x ∈ SpL ∀a ∈ Σ : xa ∈ Pref(L) ⇒ ∃xaw ∈ D. 2. ∀x ∈ SpL ∀a ∈ Σ : CSFL (xa) = ∅ ⇒ tailD (xa) = tailL (xa) then ∀x, y ∈ Sp(L), 1. nextSn (x) = nextL (x) 2. rightSn (xa) = rightL (xa) Proof. Point 1 is clear by definition and point 2 is a consequence of lemma 1 Lemma 3. With probability one, nextSn (x) = nextL (x) and rightSn (xa) = rightL (xa) ∀x ∈ Sp(L) except for finitely many values of n.
254
C. de la Higuera and J. Oncina
Proof. Given a SDL language, there exists (at least one) set D with non null probability. Then with probability 1 any sufficiently large sample contains such a set D. is unique for each SDL language. Then the above lemma yields the result. In order to evaluate the equivalence relation equiv(x, y) ⇐⇒ x ≡L y ⇐⇒ CSFL (x) = CSFL (y) we have to check if two stochastic languages are equivalent from a finite sample Sn . To do that, instead of comparing the probabilities of each string of the sample, we are going to compare the probabilities of their prefixes. This strategy (also used in ALERGIA [CO94] and RLIPS [CO99]) allows to distinguish different probabilities faster, as more information is always available about a prefix than about a whole string. It is therefore easy to establish the equivalence between the various definitions: Proposition 3. Two stochastic languages L1 and L2 are equal iff p(aΣ ∗ |w−1 L1 ) = p(aΣ ∗ |w−1 L2 )∀a ∈ Σ, ∀w ∈ Σ ∗ Proof. L1 = L2 =⇒ ∀w ∈ Σ ∗ : p(w|L1 ) = p(w|L2 ) =⇒ w−1 L1 = w−1 L2 =⇒ ∀z ⊆ Σ ∗ : p(z|w−1 L1 ) = p(z|w−1 L2 ) Conversely L1 = L2 =⇒ ∃w ∈ Σ ∗ : p(w|L1 ) = p(w|L2 ). Let w = az, as p(az|L) = p(aΣ ∗ |L)p(z|a−1 L) then p(aΣ ∗ |L1 )p(z|a−1 L1 ) = p(aΣ ∗ |L2 ) p(z|a−1 L2 ). Now we have 2 cases: 1. p(aΣ ∗ |L1 ) = p(aΣ ∗ |L2 ) and the proposition is shown. 2. p(aΣ ∗ |L1 ) = p(aΣ ∗ |L2 ) then p(z|a−1 L1 ) = p(z|a−1 L2 ). This can be applyed recursively unless w = λ. ∗ ∗ In such case we have that ∃w ∈ Σ : p(w|L1 ) = p(w|L2 ) ∧ p(wΣ |L1 ) = ∗ p(wΣ |L2 ). But since x∈Σ ∗ p(x|Li ) = 1, it follows that ∃a ∈ Σ such that p(waΣ ∗ |L1 ) = p(waΣ ∗ |L2 ). Thus p(aΣ ∗ |w−1 L1 ) = p(aΣ ∗ |w−1 L2 ). As a consequence, x ≡L y ⇐⇒ p(aΣ ∗ |(xz)−1 L) = p(aΣ ∗ |(yz)−1 L)∀a ∈ Σ, z ∈ Σ ∗ If instead of the whole language we have a finite sample Sn we are going to estimate the probabilities counting the appearances of the strings and comparing using a confidence range. Definition 8. Let f /n be the obseved frequency of a Bernoulli variable of probability p. We denote by α (n) a fuction such that p(| nf − p| < α (n)) > 1 − α (the Hoeffding bound is one of such functions). Lemma 4. Let f1 /n1 and f2 /n2 two obseved frecuencies of a Bernoulli variable of probability p. Then: f1 f2 p < − (n ) + (n ) > (1 − α)2 α 1 α 2 n1 n2
Identification with Probability One
255
Proof. p(| nf11 − nf22 | < α (n1 )+α (n2 )) < p(| nf11 −p|+| nf22 −p| < α (n1 )+α (n2 )) < p(| nf11 − p| < α (n1 ) ∧ | nf22 − p| < α (n2 )) < (1 − α)2 Definition 9. equivSn (x, y) ⇐⇒ ∀z ∈ Σ ∗ : xz ∈ Pref(Sn ) ∧ yz ∈ Pref(Sn ), ∀a ∈ Σ cn (xzaΣ ∗ ) cn (yzaΣ ∗ ) − < α (cn (xzΣ ∗ )) + α (cn (yzΣ ∗ )) ∧ cn (xzΣ ∗ ) cn (yzΣ ∗ ) cn (xz) cn (yz) − < α (cn (xzΣ ∗ )) + α (cn (yzΣ ∗ )) cn (xzΣ ∗ ) cn (yzΣ ∗ )
This does not correspond to an infinite number of tests but only to those for which xz or yz is a prefix in Sn . Each of these tests returns the correct answer with probability greater than (1 − α)2 . Because the number of checks grows with | Pref(L)| we will allow the parameter α to depend on n. ∞ Theorem 2. Let the parameter αn be such that n=0 nαn is finite. Then, with probability one, (x ≡L y) = equivSn (x, y) except for finitely many values of n. Proof. In order to compute equivSn (x, y) a maximum of 2| Pref(Sn )| tests are made, each with a confidence above (1 − αn )2 . Let An be the event that at least one of the equivalence tests fails ((x ≡L y) = equivSn (x, y) when using Sn as a sample. Then ∞Pr(An ) < 4αn | Pref(Sn )|. According to the Borel-Cantelli lemma [Fel68], if n=0 Pr(An ) < ∞ then, with probability one, only finitely many events An take place. As the expected ∞size of Pref(Sn ) can not grow faster than linearly with n, it is sufficient that n=1 nαn < ∞.
5
Discussion and Conclusion
We have described a type of stochastic grammars that correspond to a large class of languages including regular languages, palindrome languages, linear LL(1) languages and other typical linear languages such as {an bn , 0 n}. The existence of a canonical form for any grammar in the class is proved, and an algorithm that can learn stochastic deterministic linear grammars is given. This algorithm works in polynomial time and can identify the structure and the probabilities when these are rational (see [dlHT00] for details). It is nevertheless easy to construct a grammar for which learning is practically doomed: with high probability, not enough examples will be available to notice that some lethal merge should not take place. A counterexample can be constructed by simulating parity functions with a grammar. So somehow the paradigm we are using of polynomial identification in the limit with probability one seems too weak. But on the other hand it is intriguing to notice that the combination of the two criteria of polynomial runtime and identification in the limit with probability one does not seem to result in a very strong condition: it is for instance unclear if a non effective enumeration algorithm might also meet the
256
C. de la Higuera and J. Oncina
required standards.It might even be the case that the entire class of context-free grammars may be identifiable in the limit with probability one by polynomial algorithms. An open problem for which in our mind an answer would be of real help for further research in the field is that of coming up with a new learning criterion for polynomial distribution learning. This should in a certain may better match the idea of polynomial identification with probability one.
References J. K. Baker. Trainable grammars for speech recognition. In Speech Communication Papers for the 97th Meeting of the Acoustical Soc. of America, pages 547–550, 1979. [CO94] R. Carrasco and J. Oncina. Learning stochastic regular grammars by means of a state merging method. In Proceedings of ICGI’94, number 862 in LNAI, pages 139–150. Springer Verlag, 1994. [CO99] R. C. Carrasco and J. Oncina. Learning deterministic regular grammars from stochastic samples in polynomial time. RAIRO (Theoretical Informatics and Applications), 33(1):1–20, 1999. [dlHO02] C. de la Higuera and J. Oncina. Learning deterministic linear languages. In Proceedings of COLT 2002, number 2375 in LNAI, pages 185–200, Berlin, Heidelberg, 2002. Springer-Verlag. [dlHT00] C. de la Higuera and F. Thollard. Identication in the limit with probability one of stochastic deterministic finite automata. In Proceedings of ICGI 2000, volume 1891 of LNAI, pages 15–24. Springer-Verlag, 2000. [Fel68] W. Feller. An Introduction to Probability Theory and Its Applications, volume 1 and 2. John Wiley & Sons, Inc., New York, 3rd edition, 1968. [LS00] P. Langley and S. Stromsten. Learning context-free grammars with a simplicity bias. In Proceedings of ECML 2000, volume 1810 of LNCS, pages 220–228. Springer-Verlag, 2000. [NMW97] C. Nevill-Manning and I. Witten. Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of A. I.Research, 7:67–82, 1997. [Sak92] Y. Sakakibara. Efficient learning of context-free grammars from positive structural examples. Information and Computation, 97:23–60, 1992. [SBH+ 94] Y. Sakakibara, M. Brown, R. Hughley, I. Mian, K. Sjolander, R. Underwood, and D. Haussler. Stochastic context-free grammars for trna modeling. Nuclear Acids Res., 22:5112–5120, 1994. [WA02] Y. Wang and A. Acero. Evaluation of spoken language grammar learning in the atis domain. In Proceedings of ICASSP, 2002. [YLT00] M. Young-Lai and F. W. Tompa. Stochastic grammatical inference of text database structure. Machine Learning, 40(2):111–137, 2000. [Bak79]
6
Appendix
Propositions from section 3 aim at establishing that a small canonical form exists for each SDL grammar. The following proofs follow the ideas from [dlHO02].
Identification with Probability One
6.1
257
Proof of Proposition 1
In order to prove the propositions we have to establish more definitions. To define another equivalence relation over Σ ∗ , when given a stochastic deterministic linear grammar, we first associate in a unique way prefixes of strings in the language with non-terminals: Definition 10. Let G = (Σ, V, R, S, p) be a SDL grammar. With every string * x we associate the unique non terminal [x]G = T such that S → xT u; we extend LG to be a total function by setting LG ([x]G ) = ∅ the non terminal T doen not exists. We use this definition to give another equivalence relation over Σ ∗ , when given a SDL grammar: Definition 11. Let G = (Σ, V, R, S, p) be a SDL grammar. We define the associated common suffix-free languages CSFG (.), and associated equivalence relation as follows: CSFG (λ) = LG (S) x ≡G y ⇐⇒ CSFG (x) = CSFG (y) CSFG (xa) = LG ([xa]G ) ↓ ≡G is clearly an equivalence relation, in which all strings x such that [x]G is undefined are in a unique class. The following lemma establishes that ≡G has finite index, when G is a stochastic deterministic linear grammar: Lemma 5. If [x]G = [y]G , x = λ and y = λ ⇒ x ≡G y. Hence if G contains n non-terminals, ≡G has at most n + 2 classes. The proof is straightforward. There can be at most two possible extra classes corresponding to λ (when it is alone in its class) and the undefined class * Lemma 6. Let G = (Σ, V, R, S, p) be a SDL grammar. If X → xY w then:
(x−1 L(X)) ↓ = L(Y ) ↓ Proof. It is enough to prove (a−1 L(X)) ↓ = L(Y ) ↓ if X → aY w ∈ R, which is clear by double inclusion. Proposition 4. Let G = (Σ, V, R, S, p) be a SDL grammar, and denote L = LG (S). ∀x ∈ Σ ∗ , either CSFL (x) = CSFG (x) or CSFL (x) = ∅ Proof. By induction on the length of x. Base: x = λ, then CSFL (x) = L = CSFG (x). Suppose: the proposition is true for all strings of length up to k, so consider string xa of length k + 1. CSFL (xa) = (a−1 CSFL (x)) ↓ (by definition 2). If CSFL (x) = ∅, CSFL (xa) = ∅. If not (CSFL (x) = CSFG (x)) by induction hypothesis, CSFL (xa) = (a−1 CSFL (x)) ↓ = (a−1 CSFG (x)) ↓ and there are two sub-cases:
258
C. de la Higuera and J. Oncina
if x = λ CSFG (x) = LG ([x]G ), so CSFL (xa) = (a−1 LG ([x]G )) ↓ if x = λ CSFG (x) = LG ([x]G ) ↓, so: CSFL (xa) = (a−1 (LG ([x]G ) ↓)) ↓ (by definition 11), =(a−1 (LG ([x]G ))) ↓ In both cases follows: CSFL (xa) = (a−1 LG ([x]G )) ↓ = LG ([xa]G ) ↓ (by lemma 6)= CSFG (xa). Corollary 1 (proof of proposition 1). Let G = (Σ, V, R, S, p) be a stochastic deterministic linear grammar. So ≡LG (S) has finite index. Proof. A consequence of lemma 5 and proposition 4: 6.2
Proof of Proposition 2
To avoid extra notations, we will denote (as in definition 10) by [x] the nonterminal corresponding to x in the associated grammar (formally SCSFL (x) or [x]GL ). The proof that GL generates L is established through the following more general result (as the special case where x = λ): Proposition 5. ∀x ∈ Σ ∗ , LGL ([x]) = CSFL (x). Proof. We prove it by double inclusion. ∀x ∈ Σ ∗ , CSFL (x) ⊆ LGL ([x]) Proof by induction on the length of all strings in CSFL (x). Base case |w| = 0 ⇒ w = λ. If (λ, p) ∈ CSFL (x), by construction of the rules, [x] → λ and p([x] → λ) = p so (λ, p) ∈ LGL ([x]). Suppose now (induction hypothesis) that ∀x ∈ Σ ∗ , ∀w ∈ Σ k : (w, p) ∈ CSFL (x) ⇒ (w, p) ∈ LGL ([x]). Let w = auv such that |w| = k + 1, (auv, p) ∈ CSFL (x) and let v = lcs(a−1 CSFL (x)). As CSFL (xa) = (a−1 CSFL (x)) ↓, then ∃pu : (u, pu ) ∈ CSFL (xa) and then p = pu p(aΣ ∗ | CSFL (x)). As by construction [x] → a[xa]v and p([x] → a[xa]v) = p(aΣ ∗ | CSFL (x)) and, by hypothesis induction (|u| k) (u, pu ) ∈ LG ([xa]), then (auv, p) ∈ LG ([x]). ∀x ∈ Σ ∗ , LGL ([x]) ⊆ CSFL (x) Proof by induction on the order (k) of the derivation k
k
∀x ∈ Σ ∗ , ∀k ∈ N, ∀w ∈ Σ ∗ , [x]→w ⇒ (w, p([x]→w) ∈ CSFL (x). 1 Base case [x]→w. This case is only possible if w = λ. And, by construction, such a rule is in the grammar because (λ, p(λ| CSFL (x)) ∈ CSFL (x) Suppose now (induction hypothesis) that for any n k : n ∀x ∈ Σ ∗ , ∀w ∈ Σ ∗ : [x]→w ⇒ ∃p : (w, p) ∈ CSFL (x) k+1
k
Take w ∈ Σ ∗ such that [x]−→w, then [x] → a[xa]v → w = auv with k
k
[xa] → u, and p = p([x] → a[xa]v)pu where pu = p([xa] → u), by induction hypothesis we know that (u, pu ) ∈ CSFL (xa) = (a−1 CSFL (x)) ↓ = {(t, pt ) : (atv, pa pt ) ∈ CSFL (x), pa = p(aΣ ∗ | CSFL (x)), v = lcs(a−1 CSFL (x))}. As by construction we know that p([x] → a[xa]v) = p(aΣ ∗ | CSFL (x)) then (w, p) = (auv, p([x] → a[xa]v)pu ) ∈ CSFL (x).
Criterion of Calibration for Transductive Confidence Machine with Limited Feedback Ilia Nouretdinov and Vladimir Vovk Department of Computer Science Royal Holloway, University of London {ilia,vovk}@cs.rhul.ac.uk
Abstract. This paper is concerned with the problem of on-line prediction in the situation where some data is unlabelled and can never be used for prediction, and even when data is labelled, the labels may arrive with a delay. We construct a modification of randomised Transductive Confidence Machine for this case and prove a necessary and sufficient condition for its predictions being calibrated, in the sense that in the long run they are wrong with a prespecified probability under the assumption that data is generated independently by same distribution. The condition for calibration turns out to be very weak: feedback should be given on more than a logarithmic fraction of steps.
1
Introduction
In this paper we consider the problem of prediction: given some training data and a new object xn we would like to predict its label yn . We use the randomised online version of Transductive Confidence Machine as basic method of prediction; first we explain why we are interested in this method and then formulate the main question of this paper. Transductive Confidence Machine (TCM) [3,4] is a prediction method giving “p-values” py for any possible value y of the unknown label yn ; the p-values satisfy the following property (proven in, e.g., [1]): if the data satisfies the i.i.d. assumption, which means that the data is generated independently by same mechanism, the probability that pyn < δ does not exceed δ for any threshold δ ∈ (0, 1) (the validity property). There are different ways of presenting the p-values. The one used in [3] only works in the case of pattern recognition: the prediction algorithm outputs a “most likely” label (y with the largest py ) together with confidence (one minus the second largest py ) and credibility (the largest py ). Alternatively, the prediction algorithm can be given a threshold δ as an input and its answer will be that the label yn should lie in the set of such y that py > δ; this scenario of set (or region) prediction was used in [5,2] and will be used in this paper. The validity property says that the set prediction will be wrong with probability at most δ. Therefore, we can guarantee some maximal probability of error; the downside is that the set prediction can consist of more than one element. R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 259–267, 2003. c Springer-Verlag Berlin Heidelberg 2003
260
I. Nouretdinov and V. Vovk
Randomised TCM (rTCM), which is described below, is valid in a stronger sense than pure TCM: the error probability is equal to δ. In on-line TCM [5] it is supposed that machine learning is performed stepby-step: on the nth step TCM predicts the new label yn using knowledge of the new object xn and all the previous objects with their labels; after that the true information about yn becomes available and TCM can use it on the next step n + 1. In the paper [5] it was proven that the probability of error on each step is again δ; moreover, errors on different steps are independent of each other, so the mean percentage of errors asymptotically tends to δ (the calibration property). In principle, it is easy to be calibrated in set prediction; what makes TCMs interesting is that they output few uncertain predictions (predictions containing more than one label). This can be demonstrated both empirically on standard benchmark data sets (see, e.g., [5]) and theoretically: a simple Nearest Neighbours rTCM produces asymptotically no more uncertain predictions than any other calibrated algorithm for set prediction. The interest of this paper is a more general case of on-line TCM prediction, where only some subsequence of labels is available, possibly with a delay; a necessary and sufficient condition for calibration in probability is given in Theorem 1 below. Originally, we stated this result assuming that true labels were given without delay, but then we noticed that Daniil Ryabko’s [2] device of “ghost rTCM” (in our terminology) makes it possible to add delays without any extra work.
2
Online Randomised TCM
Now we describe (mainly following [5]) how on-line rTCM works. Suppose we observe a sequence z1 , z2 , . . . , zn , . . . of examples, where zi = (xi , yi ) ∈ Z = X × Y, xi ∈ X are objects to be labelled and yi ∈ Y are the labels; X and Y are arbitrary measurable spaces. “On-line” means that for any n we try to predict yn using z1 = (x1 , y1 ), . . . , zn−1 = (xn−1 , yn−1 ), xn . The method is as follows. We need a symmetric function f (z1 , . . . , zn ) = (α1 , . . . , αn ). “Symmetric” means that if we change order of z1 , . . . , zn , the order of α1 , . . . , αn will change in the same way. In other words, there must exist a function F such that αi = F (z1 , . . . , zi−1 , zi+1 , . . . , zn , zi ), where · · · means a multiset. The output of on-line rTCM is a set Yn of predictions for yn ; a label y is included in Yn if and only if #{i : αi > αn } + θn #{i : αi = αn } > nδ, where
Criterion of Calibration for Transductive Confidence Machine
261
(α1 , . . . , αn ) = f (z1 , . . . , zn−1 , (xn , y)), θn ∈ [0, 1] are random numbers distributed uniformly and independently of each other and everything else, and δ > 0 is a given threshold (called significance level ). We will be concerned with the error sequence e1 , . . . , en , . . . , where en = 0 if the true value yn is in Yn , and en = 1 otherwise. In the paper [5] it is proven that for any probability distribution P in the set Z of pairs zi = (xi , yi ), the corresponding (e1 , e2 , . . . ) is a Bernoulli sequence: for each i, ei ∈ {0, 1}, ei = 1 with probability δ, and all ei are independent.
3
Restricted TCM
In practice we are likely to have the true labels yn only for a subset of steps n; moreover, even for this subset yn may be given with a delay. In this paper we consider the following scheme. We are given a function L : N → IN defined on an infinite set N ⊆ IN and required to satisfy L(n) ≤ n for all n ∈ N and m = n =⇒ L(m) = L(n) for all m ∈ N and n ∈ N ; a function satisfying these properties will be called the teaching schedule. The teaching schedule L describes the way the data is disclosed to us: at the end of step n we are given the label yL(n) for the object xL(n) . The elements of L’s domain N in the increasing order will be denoted ni : N = {n1 , n2 , . . . } and n1 < n2 < · · · . We transform the on-line randomised TCM algorithm to what we call the Lrestricted rTCM. We again use a symmetric function f (ζ1 , . . . , ζk ) = (α1 , . . . , αk ) and for any n = nk−1 + 1, . . . , nk and any y ∈ Y we include y in Yn if and only if #{i = 1, . . . , k : αi > αk } + θn #{i = 1, . . . , k : αi = αk } > kδ, where (α1 , . . . , αk ) = f (zL(n1 ) , . . . , zL(nk−1 ) , (xn , y)), θn are random numbers and δ is a given significance level. As before, the error sequence is: en = 1 if yn ∈ / Yn and en = 0 otherwise. Let U be the uniform distribution in [0, 1]. If a probability distribution P in Z generates the examples zi , the distribution (P × U )∞ generates zi and the random numbers θi and therefore determines the distribution of all random variables, such as the errors ei , considered in this paper. We say that a restricted rTCM is (well-)calibrated in probability if the corresponding error sequence e1 , e2 , . . . has the property that e1 + · · · + en →δ n
262
I. Nouretdinov and V. Vovk
in (P ×U )∞ -probability for any significance level δ and distribution P in Z. (Remember that, by definition, ξ1 , ξ2 , . . . converges to a constant c in Q-probability if lim Q {|ξn − c| > ε} → 0 n→∞
for any ε.) Our aim is to prove the following statement. Theorem 1. Let L be a teaching schedule with domain N = {n1 , n2 , . . . }, where n1 , n2 , . . . is an increasing infinite sequence of positive integers. – If limk→∞ (nk /nk−1 ) = 1, any L-restricted rTCM is calibrated in probability. – If limk→∞ (nk /nk−1 ) = 1 does not hold, there exists an L-restricted rTCM which is not calibrated in probability. In words, the theorem asserts that the restricted rTCM is guaranteed to be calibrated in probability if and only if the growth rate of nk is sub-exponential.
4
Proof That nk /nk−1 → 1 Is Sufficient
We start from a simple general lemma about martingale differences. Lemma 1. If ξ1 , ξ2 , . . . is a martingale difference w.r. to σ-algebras F1 , F2 , . . . such that, for all i ≥ 1, 2 E(ξi | Fi−1 ) ≤ 1 and w1 , w2 , . . . is a sequence of positive numbers, then 2 w1 ξ1 + · · · + wn ξn w12 + · · · + wn2 ≤ . E w1 + · · · + wn (w1 + · · · + wn )2 Proof. Since elements of a martingale difference sequence are uncorrelated, we have 2 wi2 E(ξi2 ) + 2 E (w1 ξ1 + · · · + wn ξn ) = 1≤i≤n
wi wj E(ξi ξj )
1≤i<j≤n
≤
wi2 .
1≤i≤n
Fix a probability distribution P in Z generating the examples zi ; let P stand for (P × U )∞ (the probability distribution generating the examples zi and the random numbers θi ) and E stand for the expected value w.r. to P. Along with the original L-restricted rTCM making errors e1 , e2 , . . . we also consider the ghost rTCM (introduced in [2]) which uses the same alpha function as the L-restricted rTCM but is fed with the examples z1 := zL(n1 ) , z2 := zL(n2 ) , . . .
Criterion of Calibration for Transductive Confidence Machine
263
and random numbers θ1 , θ2 , . . . (independent from each other and anything else); the error sequence of the ghost rTCM is denoted e1 , e2 , . . . (remember that an error is encoded as 1 and the absence of error as 0). The ghost rTCM is given all labels and each label is given without delay. Notice that the input sequence zL(n1 ) , zL(n2 ) , . . . to the ghost rTCM is also distributed according to P ∞ . Set, for each n = 1, 2, . . . , dn = P{en = 1 | z1 , . . . , zn−1 } (it is clear that, for each k, dn will be the same for all n = nk−1 + 1, . . . , nk ) and . dk = P ek = 1 | z1 , . . . , zk−1 Notice that, for all k = 1, 2, . . . , dnk = dk .
(1)
Corollary 1. For each k, 2 (e1 − δ)n1 + (e2 − δ)(n2 − n1 ) + · · · + (ek − δ)(nk − nk−1 ) E nk ≤
n21 + (n2 − n1 )2 + · · · + (nk − nk−1 )2 . n2k
Proof. It is sufficient to apply Lemma 1 to w1 = n1 , w2 = n2 − n1 , . . . , wk = nk − nk−1 , the independent zero-mean (by the result of [5] described at the end of §2) random variables ξk = ek − δ, and the trivial σ-algebras. Corollary 2. For each k, 2 (e1 − d1 )n1 + (e2 − d2 )(n2 − n1 ) + · · · + (ek − dk )(nk − nk−1 ) E nk ≤
n21 + (n2 − n1 )2 + · · · + (nk − nk−1 )2 . n2k
Proof. Use Lemma 1 for w1 = n1 , w2 = n2 −n1 , . . . , wk = nk −nk−1 , ξk = ek −dk , and the σ-algebras Fk generated by z1 , . . . , zk−1 . Corollary 3. For each k, E
(e1 − d1 ) + (e2 − d2 ) + · · · + (enk − dnk ) nk
2 ≤
1 . nk
Proof. Apply Lemma 1 to wi = 1, ξi = ei − di , and the σ-algebras Fi generated by z1 , . . . , zi .
264
I. Nouretdinov and V. Vovk
Lemma 2. If limk→∞ nnk+1 = 1 for some increasing sequence of positive integers k n1 , n2 , . . . , nk , . . . , then n21 + (n2 − n1 )2 + · · · + (nk − nk−1 )2 = 0. k→∞ n2k lim
k−1 < ε for any k ≥ K. Proof. For any ε > 0, there exists K such that nkn−n k−1 Therefore, n21 + (n2 − n1 )2 + · · · + (nk − nk−1 )2 n2k
≤ ≤ +
n2K (nK+1 − nK )2 + · · · + (nk − nk−1 )2 + 2 nk n2k
n2K nK+1 − nK nK+1 − nK nK+2 − nK+1 nK+2 − nK+1 + + + ··· n2k nK nk nK+1 nk
nk − nk−1 nk − nk−1 n2 (nK+1 − nK ) + · · · + (nk − nk−1 ) ≤ K +ε ≤ 2ε 2 nk−1 nk nk nk
from some k on. Now it is easy to finish the proof of the first part of the theorem. In combination with Chebyshev’s inequality and Lemma 2, Corollary 1 implies that (e1 − δ)n1 + (e2 − δ)(n2 − n1 ) + · · · + (ek − δ)(nk − nk−1 ) →0 nk in probability; using the notation k(n) := min{k : nk ≥ n}, we can rewrite this as nk
1 ek(n) − δ → 0. nk n=1
(2)
Similarly, (1) and Corollary 2 imply nk
nk
1 1 ek(n) − dk(n) = ek(n) − dn → 0 nk n=1 nk n=1
(3)
and Corollary 3 implies nk 1 (en − dn ) → 0 nk n=1
(4)
(all convergences are in probability). Combining (2)–(4), we obtain nk 1 (en − δ) → 0; nk n=1
the condition nk+1 /nk → 1 allows us to replace nk with n in (5).
(5)
Criterion of Calibration for Transductive Confidence Machine
5
265
Proof That nk /nk−1 → 1 Is Necessary
As a first step, we construct the example space Z, the probability distribution P in Z and an rTCM for which dk deviate consistently from δ. Let X = {0}, Y = {0, 1}, so zi is, essentially, always 0 or 1. The probability P is defined by P {0} = P {1} = 12 . Define the alpha function (α1 , . . . , αk ) = f (ζ1 , . . . , ζk ) as follows: (α1 , . . . , αk ) = (ζ1 , . . . , ζk ) if ζ1 + · · · + ζk is even and (α1 , . . . , αk ) = (1 − ζ1 , . . . , 1 − ζk ) if ζ1 + · · · + ζk is odd. It follows from the central limit theorem that #{i = 1, . . . , k : zi = 1} ∈ (0.4, 0.6) k
(6)
with probability more than 99% for k large enough. Let δ = 5%. Consider some k ∈ {1, 2, . . . }; we will show that dk deviates significantly from δ with probability more than 99% for sufficiently large k; namely, that dk is significantly greater than δ if z1 + · · · + zk−1 is odd (intuitively, in this case both potential labels are strange) and dk is significantly less than δ if z1 + · · · + zk−1 is even (intuitively, both potential labels are typical). Formally: is odd, then – If z1 + · · · + zk−1 + zk is even =⇒ αk = zk = 1 zk = 1 =⇒ z1 + · · · + zk−1 zk = 0 =⇒ z1 + · · · + zk−1 + zk is odd =⇒ αk = 1 − zk = 1;
in both cases we have αk = 1 and, therefore, with probability more than 99%, dk = P {θk #{i = 1, . . . , k : αi = 1} ≤ kδ} kδ kδ 10 ≥ = = δ. #{i = 1, . . . , k : αi = 1} 0.7k 7 is even, then – If z1 + · · · + zk−1 zk = 1 =⇒ z1 + · · · + zk−1 + zk is odd =⇒ αk = 1 − zk = 0 + zk is even =⇒ αk = zk = 0; zk = 0 =⇒ z1 + · · · + zk−1
in both cases αk = 0 and, therefore, with probability more than 99%, dk = P {#{i = 1, . . . , k : αi = 1} + θk #{i = 1, . . . , k : αi = 0} ≤ kδ} ≤ P {0.3k ≤ kδ} = 0.
266
I. Nouretdinov and V. Vovk
To summarise, for large enough k, |dk − δ| = |dnk − δ| > δ/3
(7)
with probability more than 99%. Suppose that n
1 ei − δ → 0 n i=1
(8)
in probability; we will deduce that nk /nk−1 → 1. By (4) (remember that Corollary 3 and, therefore, (4) do not depend on the condition nk /nk−1 → 1) and (8) we have n 1 di − δ → 0; n i=1 we can rewrite this in the form n
di = n(δ + o(1))
i=1
(all o(1) are in probability). This equality implies K
dnk (nk+1 − nk ) = nK+1 (δ + o(1))
k=0
and
K−1
dnk (nk+1 − nk ) = nK (δ + o(1));
k=0
subtracting the last equality from the penultimate one we obtain dnK (nK+1 − nK ) = (nK+1 − nK )δ + o(nK+1 ), i.e., (dnK − δ) (nK+1 − nK ) = o(nK+1 ). In combination with (7) and (1), this implies nK+1 − nK = o(nK+1 ), i.e., nK+1 /nK → 1 as K → ∞.
References 1. Ilia Nouretdinov, Thomas Melluish, and Vladimir Vovk. Ridge Regression Confidence Machine. In Proceedings of the 18th International Conference on Machine Learning, 2001. 2. Daniil Ryabko, Vladimir Vovk, and Alex Gammerman. Online region prediction with real teachers. Submitted for publication.
Criterion of Calibration for Transductive Confidence Machine
267
3. Craig Saunders, Alex Gammerman, and Vladimir Vovk. Transduction with confidence and credibility. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 722–726, 1999. 4. Vladimir Vovk, Alex Gammerman, Craig Saunders. Machine-learning applications of algorithmic randomness. Proceedings of the 16th International Conference on Machine Learning, San Francisco, CA: Morgan Kaufmann, pp. 444–453, 1999. 5. Vladimir Vovk. On-line Confidence Machines are well-calibrated. Proceedings of the 43rd Annual Symposium on Foundations of Computer Science, IEEE Computer Society, 2002.
Well-Calibrated Predictions from Online Compression Models Vladimir Vovk Computer Learning Research Centre, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, England, [email protected], http://vovk.net
Abstract. It has been shown recently that Transductive Confidence Machine (TCM) is automatically well-calibrated when used in the on-line mode and provided that the data sequence is generated by an exchangeable distribution. In this paper we strengthen this result by relaxing the assumption of exchangeability of the data-generating distribution to the much weaker assumption that the data agrees with a given “on-line compression model”.
1
Introduction
Transductive Confidence Machine (TCM) was introduced in [1,2] as a practically meaningful way of providing information about reliability of the predictions made. In [3] it was shown that TCM’s confidence information is valid in a strong non-asymptotic sense under the standard assumption that the examples are exchangeable. In §2 we define a general class of models, called “on-line compression models”, which include not only the exchangeability model but also the Gaussian model, the Markov model, and many other interesting models. An on-line compression model (OCM) is an automaton (usually infinite) for summarizing statistical information efficiently. It is usually impossible to restore the statistical information from OCM’s summary (so OCM performs lossy compression), but it can be argued that the only information lost is noise, since one of our requirements is that the summary should be a “sufficient statistic”. In §3 we construct “confidence transducers” and state the main result of the paper (proved in Appendix A) showing that the confidence information provided by confidence transducers is valid in a strong sense. In the last three sections, §4–6, we consider three interesting examples of on-line compression models: exchangeability, Gaussian and Markov models. The idea of compression modelling was the main element of Kolmogorov’s programme for applications of probability [4], which is discussed in Appendix B.
2
Online Compression Models
We are interested in making predictions about a sequence of examples z1 , z2 , . . . output by Nature. Typically we will want to say something about example zn , R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 268–282, 2003. c Springer-Verlag Berlin Heidelberg 2003
Well-Calibrated Predictions from Online Compression Models
269
n = 1, 2, . . . , given the previous examples z1 , . . . , zn−1 . In this section we will discuss an assumption that we might be willing to make about the examples, and in the next section the actual prediction algorithms. An on-line compression model is a 5-tuple M = (Σ, 2, Z, (Fn ), (Bn )), where: 1. Σ is a measurable space called the summary space; its elements are called summaries; 2 ∈ Σ is a summary called the empty summary; 2. Z is a measurable space from which the examples zi are drawn; 3. Fn , n = 1, 2, . . . , are functions of the type Σ × Z → Σ called forward functions; 4. Bn , n = 1, 2, . . . , are kernels of the type Σ → Σ × Z called backward kernels; in other words, each Bn is a function Bn (A | σ) which depends on σ ∈ Σ and a measurable set A ⊆ Σ × Z such that – for each σ, Bn (A | σ) as a function of A is a probability distribution in Σ × Z; – for each A, Bn (A | σ) is a measurable function of σ; it is required that Bn be a reverse to Fn in the sense that Bn Fn−1 (σ) | σ = 1 for each σ ∈ Fn (Σ × Z). We will sometimes write Bn (σ) for the probability distribution A → Bn (A | σ). Next we explain briefly the intuitions behind this formal definition and introduce some further notation. An OCM is a way of summarizing statistical information. At the beginning we do not have any information, which is represented by the empty summary σ0 := 2. When the first example z1 arrives, we update our summary to σ1 := F1 (σ0 , z1 ), etc.; when example zn arrives, we update the summary to σn := Fn (σn−1 , zn ). This process is represented in Figure 1. Let tn be the nth statistic in the OCM, which maps the sequence of the first n examples z1 , . . . , zn to σn : t1 (z1 ) := F1 (σ0 , z1 ); tn (z1 , . . . , zn ) := Fn (tn−1 (z1 , . . . , zn−1 ), zn ), n = 2, 3, . . . . The value tn (z1 , . . . , zn ) is a summary of the full data sequence z1 , . . . , zn available at the end of trial n; our definition requires that the summaries should be computable on-line: the function Fn updates σn−1 to σn . Condition 3 in the definition of OCM reflects its on-line character, as explained in the previous paragraph. We want, however, the system of summarizing statistical information represented by the OCM to be efficient, so that no useful information is lost. This is reflected in Condition 4: the distribution Pn of the more detailed description (σn−1 , zn ) given the less detailed σn is known and so does not carry any information about the distribution generating the examples z1 , z2 , . . . ; in other words, σn contains the same useful information as (σn−1 , zn ), and the extra information in (σn−1 , zn ) is noise. This intuition would be captured in statistical terminology (see, e.g., [5], §2.2) by saying that σn is
270
V. Vovk z1
2
? - σ1
zn−1
z2
? - σ2
- ···
? - σn−1
zn
? - σn
Fig. 1. Using the forward functions Fn to compute σn from z1 , . . . , zn
a “sufficient statistic” of z1 , . . . , zn (although this expression does not have a formal meaning in our present context, since we do not have a full statistical model {Pθ : θ ∈ Θ}). Analogously to Figure 1, we can compute the distribution of the data sequence z1 , . . . , zn from σn (see Figure 2). Formally, using the kernels Bn (dσn−1 , dzn | σn ), we can define the conditional distribution Pn of z1 , . . . , zn given σn by the formula Pn (A1 × · · · × An | σn ) := · · · B1 (A1 | σ1 )B2 (dσ1 , A2 | σ2 ) . . . Bn−1 (dσn−2 , An−1 | σn−1 )Bn (dσn−1 , An | σn )
(1)
for each product set A1 × · · · × An , Ai ⊆ Z, i = 1, . . . , n.
z1
2
zn−1
z2
6
6
σ1
σ2
6 ···
σn−1
zn
6 σn
Fig. 2. Using the backward functions Bn to extract the distribution of z1 , . . . , zn from σn
We say that a probability distribution P in Z∞ agrees with the OCM (Σ, 2, Z, (Fn ), (Bn )) if, for each n, Bn (A | σ) is a version of the conditional probability, w.r. to P , that (tn−1 (z1 , . . . , zn−1 ), zn ) ∈ A given tn (z1 , . . . , zn ) = σ and given the values of zn+1 , zn+2 , . . . .
3
Confidence Transducers and the Main Result
A randomised transducer is a function f of the type (Z × [0, 1])∗ → [0, 1]. It is called “transducer” because it can be regarded as mapping each input sequence (z1 , θ1 , z2 , θ2 , . . . ) in (Z × [0, 1])∞ (the examples zi are complemented by random numbers θi ) into the output sequence (p1 , p2 , . . . ) defined
Well-Calibrated Predictions from Online Compression Models
271
by pn := f (z1 , θ1 , . . . , zn , θn ), n = 1, 2, . . . ; we will say that p1 , p2 , . . . are the p-values produced by the transducer. We say that the transducer f is valid w.r. to an OCM M if the output p-values p1 p2 . . . are always distributed according to the uniform distribution U ∞ in [0, 1]∞ , provided the input examples z1 z2 . . . are generated by a probability distribution that agrees with M and θ1 θ2 . . . are generated, independently of z1 z2 . . . , from U ∞ . If we drop the dependence on the random numbers θn , we obtain the notion of deterministic transducer. Any sequence of measurable functions An : Σ ×Z → IR, n = 1, 2, . . . , is called an individual strangeness measure w.r. to the OCM M = (Σ, 2, Z, (Fn ), (Bn )). The confidence transducer associated with (An ) is the deterministic transducer where pn are defined as pn := Bn ({(σ, z) ∈ Σ × Z : An (σ, z) ≥ An (σn−1 , zn )} | σn )
(2)
and σn := tn (z1 , . . . , zn ),
σn−1 := tn−1 (z1 , . . . , zn−1 ) .
The randomised version is obtained by replacing (2) with pn := Bn ({(σ, z) ∈ Σ × Z : An (σ, z) > An (σn−1 , zn )} | σn ) + θn Bn ({(σ, z) ∈ Σ × Z : An (σ, z) = An (σn−1 , zn )} | σn ) . (3) A confidence transducer in an OCM M is a confidence transducer associated with some individual strangeness measure w.r. to M . Theorem 1. Suppose the examples zn ∈ Z, n = 1, 2, . . . , are generated from a probability distribution P that agrees with an on-line compression model. Any randomised confidence transducer in that model is valid (will produce independent p-values pn distributed uniformly in [0, 1]). Confidence transducers can be used for “prediction with confidence”. Suppose each example zn consists of two components, xn (the object) and yn (the label); at trial n we are given xn and the goal is to predict yn ; for simplicity, we will assume that the label space Y from which the labels are drawn is finite. One mode of prediction with confidence is “region prediction” (as in [3]). Suppose we are given a significance level δ > 0 (the maximum probability of error we are prepared to tolerate). When given xn , we can output as the predictive region Γn ⊆ Y the set of labels y such that yn = y would lead to a p-value pn > δ. (When a confidence transducer is applied in this mode, we will sometimes refer to it as a TCM.) If error at trial n is defined as yn ∈ / Γn , then by Theorem 1 errors at different trials are independent and the probability of error at each trial is δ, assuming the pn are produced by a randomised confidence transducer. In particular, such region predictors are well-calibrated, in the sense that the number En of errors made in the first n trials satisfies lim
n→∞
En = δ. n
272
V. Vovk
This implies that if the pn are produced by a deterministic confidence transducer, we will still have the conservative version of this property, lim n→∞
En ≤ δ. n
An alternative way of presenting the confidence transducer’s output (used in [2] and several other papers) is reporting, after seeing xn , a predicted label (2) (1) yˆn ∈ arg maxy∈Y pn (y), the confidence 1 − pn and the credibility pn , where (1) pn (y) is the p-value that would be obtained if yn = y, pn is the largest value (2) among pn (y) and pn is the second largest value among pn (y).
4
Exchangeability Model
In this section we discuss the only special case of OCM studied from the point of view of prediction with confidence so far: the exchangeable model. In the next two sections we will consider two other models, Gaussian and Markov; many more models are considered in [6], Chapter 4. For defining specific OCM, we will specify their statistics tn and conditional distributions Pn ; these will uniquely identify Fn and Bn . The exchangeability model has statistics tn (z1 , . . . , zn ) := z1 , . . . , zn ; given the value of the statistic, all orderings have the same probability 1/n!. Formally, the set of bags z1 , . . . , zn of size n is defined as Zn equipped with the σalgebra of symmetric (i.e., invariant under permutations of components) events; the distribution on the orderings is given by zπ(1) , . . . , zπ(n) , where z1 , . . . , zn is a fixed ordering and π is a random permutation (each permutation is chosen with probability 1/n!). The main results of [3] and [7] are special cases of Theorem 1.
5
Gaussian Model
In the Gaussian model, Z := IR, the statistics are tn (z1 , . . . , zn ) := (z n , Rn ) , n 1 z n := zi , n i=1
Rn :=
(z1 − z n )2 + · · · + (zn − z n )2 ,
and Pn (dz1 , . . . , dzn | σ) is the uniform distribution in t−1 n (σ) (in other words, it is the uniform distribution in the n − 2-dimensional sphere in IRn with centre (z, . . . , z) ∈ IRn of radius Rn lying inside the hyperplane n1 (z1 + · · · + zn ) = z n ).
Well-Calibrated Predictions from Online Compression Models
273
Let us give an explicit expression of the predictive region for the Gaussian model and individual strangeness measure An (tn−1 , zn ) = An ((z n−1 , Rn−1 ), zn ) := |zn − z n−1 |
(4)
(it is easy to see that this individual strangeness measure is equivalent, in the sense of leading to the same p-values, to |zn − z n |, as well as to several other natural expressions, including (5)). Under Pn (dz1 , . . . , dzn | σ), the expression (n − 1)(n − 2) zn − z n−1 (5) n Rn−1 has Student’s t-distribution with n − 2 degrees of freedom (assuming n > 2; see, e.g., [8], §29.4). If t(δ) is the value defined by P{|tn−2 | > t(δ) } = δ (where tn−2 has Student’s t-distribution with n − 2 degrees of freedom), the predictive interval corresponding to individual strangeness measure (4) is the set of z satisfying n Rn−1 . |z − z n−1 | ≤ t(δ) (n − 1)(n − 2) Therefore, we obtained the usual predictive regions based on the t-test (as in [9] or, in more detail, [10]); now, however, we can see that the errors of this standard procedure (applied in the on-line fashion) are independent.
6
Markov Model
The Gaussian OCM, considered in the previous section, is narrower than the exchangeability OCM. The OCM considered in this section is interesting in that it goes beyond exchangeability. In this section we always assume that the example space Z is finite. The following notation for digraphs will be used: in(v)/out(v) stand for the number of arcs entering/leaving vertex v; nu,v is the number of arcs leading from vertex u to vertex v. The Markov summary of a data sequence z1 . . . zn is the following digraph with two vertices marked: – the set of vertices is Z (the state space of the Markov chain); – the vertex z1 is marked as the source and the vertex zn is marked as the sink (these two vertices are not necessarily distinct); – the arcs of the digraph are the transitions zi zi+1 , i = 1, . . . , n − 1; the arc zi zi+1 has zi as its tail and zi+1 as its head. It is clear that in any such digraph all vertices v satisfy in(v) = out(v) with the possible exception of the source and sink (unless they coincide), for which we then have out(source) = in(source) + 1 and in(sink) = out(sink) + 1. We will call a digraph with this property a Markov graph if the arcs with the same tail and head are indistinguishable (for example, we do not distinguish two Eulerian paths
274
V. Vovk
that only differ in the order in which two such arcs are passed); its underlying digraph will have the same structure but all its arcs will be considered to have their own identity. More formally, the Markov model (Σ, 2, Z, F, B) is defined as follows: – Z is a finite set; its elements (examples) are also called states; one of the states is designated as the initial state; – Σ is the set of all Markov graphs with the vertex set Z; – 2 is the Markov graph with no arcs and with both source and sink at the designated initial state; – Fn (σ, z) is the Markov graph obtained from σ by adding an arc from σ’s sink to z and making z the new sink; – let σ ↓ z, where σ is a Markov graph and z is one of σ’s vertices, be the Markov graph obtained from σ by removing an arc from z to σ’s sink (σ ↓ z does not exist if there is no arc from z to σ’s sink) and moving the sink to z, and let N (σ) be the number of Eulerian paths from the source to the sink in the Markov graph σ; Bn (σ) is (σ ↓ z, sink) with probability N (σ ↓ z)/N (σ), where sink is σ’s sink and z ranges over the states for which σ ↓ z is defined. We will take as the individual strangeness measure An (σ, z) := −Bn ({(σ, z)} | Fn (σ, z))
(6)
(we need the minus sign because lower probability makes an example stranger). To give a computationally efficient representation of the confidence transducer corresponding to this individual strangeness measure, we need the following two graph-theoretic results, versions of the BEST theorem and the Matrix-Tree theorem, respectively. Lemma 1. In any Markov graph σ = (V, E) the number of Eulerian paths from the source to the sink equals out(sink) v∈V (out(v) − 1)! , T (σ) u,v∈V nu,v ! where T (σ) is the number of spanning out-trees in the underlying digraph centred at the source. Lemma 2. To find the number T (σ) of spanning out-trees rooted at the source in the underlying digraph of a Markov graph σ with vertices z1 , . . . , zn (z1 being the source), – create the n × n matrix with the elements ai,j = −nzi ,zj ; – change the diagonal elements so that each column sums to 0; – compute the co-factor of a1,1 . These two lemmas immediately follow from Theorems VI.24 and VI.28 in [11].
Well-Calibrated Predictions from Online Compression Models
275
cumulative errors, uncertain and empty predictions
250 errors uncertain predictions empty predictions 200
150
100
50
0
0
1000
2000
3000
4000
5000 6000 examples
7000
8000
9000 10000
Fig. 3. TCM predicting the binary Markov chain with transition probabilities P(1 | 0) = P(0 | 1) = 1% at significance level 2%; the cumulative numbers of errors (predictive regions not covering the true label), uncertain (i.e., containing more than one label) and empty predictive regions are shown
It is now easy to obtain an explicit formula for prediction in the binary case Z = {0, 1}. First we notice that Bn ({(σ ↓ z, sink)} | σ) =
T (σ ↓ z)nz,sink N (σ ↓ z) = N (σ) T (σ) out(sink)
(all nu,v refer to the numbers of arcs in σ and sink is σ’s sink; we set N (σ ↓ z) = T (σ ↓ z) := 0 when σ ↓ z does not exist). The following simple corollary from the last formula is sufficient for computing the probabilities Bn in the binary case: nsink,sink Bn ({(σ ↓ sink, sink)} | σ) = . out(sink) This gives us the following formulas for the TCM in the binary Markov model (remember that the individual strangeness measure is (6)). Suppose the current summary is given by a Markov graph with ni,j arcs going from vertex i to vertex j (i, j ∈ {0, 1}) and let f : [0, 1] → [0, 1] be the function that squashes [0.5, 1] to 1: p if p < 0.5 f (p) := 1 otherwise .
276
V. Vovk
If the current sink is 0, the p-value corresponding to the next example 0 is
n0,0 + 1 f n0,0 + n0,1 + 1 and the p-value corresponding to the next example 1 is (with 0/0 := 1)
n1,0 f . n1,0 + n1,1
(7)
If the current sink is 1, the p-value corresponding to the next example 1 is
n1,1 + 1 f n1,1 + n1,0 + 1 and the p-value corresponding to the next example 0 is (with 0/0 := 1)
n0,1 f . n0,1 + n0,0 Figure 3 shows the result of a computer simulation; as expected, the error line is close to the straight line with the slope close to the significance level. Acknowledgments. I am grateful to Per Martin-L¨ of, Glenn Shafer, Alex Gammerman, Phil Dawid, and participants in the workshop “Statistical Learning in Classification and Model Selection” (January 2003, Eurandom) for useful discussions. The anonymous referees’ comments helped to improve the presentation. Gregory Gutin’s advice about graph theory is gratefully appreciated. This work was partially supported by EPSRC (grant GR/R46670/01), BBSRC (grant 111/BIO14428), and EU (grant IST-1999-10226).
References 1. Saunders, C., Gammerman, A., Vovk, V.: Transduction with confidence and credibility. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence. (1999) 722–726 2. Vovk, V., Gammerman, A., Saunders, C.: Machine-learning applications of algorithmic randomness. In: Proceedings of the Sixteenth International Conference on Machine Learning, San Francisco, CA, Morgan Kaufmann (1999) 444–453 3. Vovk, V.: On-line Confidence Machines are well-calibrated. In: Proceedings of the Forty Third Annual Symposium on Foundations of Computer Science, IEEE Computer Society (2002) 187–196 4. Kolmogorov, A.N.: Combinatorial foundations of information theory and the calculus of probabilities. Russian Mathematical Surveys 38 (1983) 29–40 5. Cox, D.R., Hinkley, D.V.: Theoretical Statistics. Chapman and Hall, London (1974) 6. Bernardo, J.M., Smith, A.F.M.: Bayesian Theory. Wiley, Chichester (2000)
Well-Calibrated Predictions from Online Compression Models
277
7. Vovk, V., Nouretdinov, I., Gammerman, A.: Testing exchangeability on-line. In: Proceedings of the Twentieth International Conference on Machine Learning. (2003) 8. Cram´er, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton, NJ (1946) 9. Wilks, S.S.: Determination of sample sizes for setting tolerance limits. Annals of Mathematical Statistics 12 (1941) 91–96 10. Guttman, I.: Statistical Tolerance Regions: Classical and Bayesian. Griffin, London (1970) 11. Tutte, W.T.: Graph Theory. Cambridge University Press, Cambridge (2001) 12. Shiryaev, A.N.: Probability. Second edn. Springer, New York (1996) 13. Kolmogorov, A.N.: Logical basis for information theory and probability theory. IEEE Transactions of Information Theory IT-14 (1968) 662–664 14. Martin-L¨ of, P.: The definition of random sequences. Information and Control 9 (1966) 602–619 15. Asarin, E.A.: Some properties of Kolmogorov δ-random finite sequences. Theory of Probability and its Applications 32 (1987) 507–508 16. Asarin, E.A.: On some properties of finite objects random in the algorithmic sense. Soviet Mathematics Doklady 36 (1988) 109–112 17. Vovk, V.: On the concept of the Bernoulli property. Russian Mathematical Surveys 41 (1986) 247–248 18. Martin-L¨ of, P.: Repetitive structures and the relation between canonical and microcanonical distributions in statistics and statistical mechanics. In Barndorff-Nielsen, O., Blæsild, P., Schou, G., eds.: Proceedings of Conference on Foundational Questions in Statistical Inference, Aarhus (1974) 271–294 19. Lauritzen, S.L.: Extremal Families and Systems of Sufficient Statistics. Volume 49 of Lecture Notes in Statistics. Springer, New York (1988) 20. Vovk, V., Shafer, G.: Kolmogorov’s contributions to the foundations of probability. Problems of Information Transmission 39 (2003) 21–31 21. Vovk, V.: Asymptotic optimality of Transductive Confidence Machine. In: Proceedings of the Thirteenth International Conference on Algorithmic Learning Theory. Volume 2533 of Lecture Notes in Artificial Intelligence. (2002) 336–350 22. Vovk, V.: Universal well-calibrated algorithm for on-line classification. In: Proceedings of the Sixteenth Annual Conference on Learning Theory. (2003) 23. Nouretdinov, I., V’yugin, V., Gammerman, A.: Transductive Confidence Machine is universal. In Gavald` a, R., Jantke, K.P., Takimoto, E., eds.: Proceedings of the Fourteenth International Conference on Algorithmic Learning Theory. Volume 2842 of Lecture Notes in Artificial Intelligence. Berlin, Springer (2003)
A
Appendix: Proof of Theorem 1
We will use the notation EF for the conditional expectation w.r. to a σ-algebra F; if necessary, the underlying probability distribution will be given as an upper index. Similarly, PF will stand for the conditional probability w.r. to F. In this appendix we will use the following properties of conditional expectation (see, e.g., [12], §II.7.4): A. If G and F are σ-algebras, G ⊆ F, ξ and η are bounded F-measurable random variables, and η is G-measurable, EG (ξη) = η EG (ξ) a.s.
278
V. Vovk
B. If G and F are σ-algebras, G ⊆ F, and ξ is a random variable, EG (EF (ξ)) = EG (ξ) a.s.; in particular, E(EF (ξ)) = E(ξ). Proof of the Theorem This proof is a generalization of the proof of Theorem 1 in [3], with the same basic idea: to show that (p1 , . . . , pN ) is distributed as U N (it is easy to get rid of the assumption of a fixed horizon N ), we reverse the time. Let P be the distribution generating the examples; it is assumed to agree with the OCM. Imagine that the sample (z1 , . . . , zN ) is generated in two steps: first, the summary σN is generated from some probability distribution (namely, the image of the distribution P generating z1 , z2 , . . . under the mapping tN ), and then the sample (z1 , . . . , zN ) is chosen randomly from PN (· | σN ). Already the second step ensures that, conditionally on knowing σN (and, therefore, unconditionally), the sequence (pN , . . . , p1 ) is distributed as U N . Indeed, roughly speaking (i.e., ignoring borderline effects), pN will be the p-value corresponding to the statistic AN and so distributed, at least approximately, as U (see, e.g., [5], §3.2); when the pair (σN −1 , zN ) is disclosed, the value pN will be settled; conditionally on knowing σN −1 and zN , pN −1 will also be distributed as U , and so on. We start the formal proof by defining the σ-algebra Gn , n = 0, 1, 2, . . . , as the one on the sample space (Z × [0, 1])∞ generated by the random elements σn , zn+1 , θn+1 , zn+2 , θn+2 , . . . . In particular, G0 (the most informative σ-algebra) coincides with the original σ-algebra on (Z × [0, 1])∞ ; G0 ⊇ G1 ⊇ · · · . Fix a randomised confidence transducer f ; it will usually be left implicit in our notation. Let pn be the random variable f (z1 , θ1 , . . . , zn , θn ) for each n = 1, 2, . . . ; P will refer to the probability distribution P × U ∞ (over examples zn and random numbers θn ) and E to the expectation w.r. to P. The proof will be based on the following lemma. Lemma 3. For any trial n and any δ ∈ [0, 1], PGn {pn ≤ δ} = δ .
(8)
Proof. Let us fix a summary σn of the first n examples (z1 , . . . , zn ) ∈ Zn ; we σ , z˜) from Fn−1 (σn ) define will omit the condition “ | σn ”. For every pair (˜ p+ (˜ σ , z˜) := Bn {(σ, z) : An (σ, z) ≥ An (˜ σ , z˜)} , p− (˜ σ , z˜) := Bn {(σ, z) : An (σ, z) > An (˜ σ , z˜)} . It is clear that always p− ≤ p+ . Notice that the semi-closed intervals [p− (˜ σ , z˜), p+ (˜ σ , z˜)), (˜ σ , z˜) ∈ Σ × Z, either coincide or are disjoint; it is also easy to see that they “lie next to each other”, in the sense that their union is also a semi-closed interval (namely, [0, 1)). Let us say that a pair (˜ σ , z˜) is – strange if p+ (˜ σ , z˜) ≤ δ – ordinary if p− (˜ σ , z˜) > δ – borderline if p− (˜ σ , z˜) ≤ δ < p+ (˜ σ , z˜).
Well-Calibrated Predictions from Online Compression Models
279
We will use the notation p− := p− (˜ σ , z˜) and p+ := p+ (˜ σ , z˜) where (˜ σ , z˜) is any borderline example. Notice that the Bn -measure of strange examples is p− , the Bn -measure of ordinary examples is 1−p+ , and the Bn -measure of borderline examples is p+ − p− . By the definition of rCT, pn ≤ δ if the pair (σn−1 , zn ) is strange, pn > δ if the pair is ordinary, and pn ≤ δ with probability δ − p− p+ − p−
(9)
if the pair is borderline; indeed, in this case pn = p− + θn (p+ − p− ) , and so pn ≤ δ is equivalent to θn ≤
δ − p− . p+ − p−
Therefore, the overall probability that pn ≤ δ is p− + (p+ − p− )
δ − p− = δ. p+ − p−
The other basic result that we will need is the following lemma. Lemma 4. For any trial n = 1, 2, . . . , pn is Gn−1 -measurable. Proof. Fix a trial n and δ ∈ [0, 1]. We are required to prove that the event {pn ≤ δ} is Gn−1 -measurable. This follows from the definition, (3): pn is defined in terms of σn−1 , zn and θn . Fix temporarily positive integer N . First we prove that, for any n = 1, . . . , N and any δ1 , . . . , δn ∈ [0, 1], PGn {pn ≤ δn , . . . , p1 ≤ δ1 } = δn · · · δ1 .
(10)
The proof is by induction in n. For n = 1, (10) immediately follows from Lemma 3. For n > 1 we obtain, making use of Lemmas 3 and 4, properties A and B of conditional expectations, and the inductive assumption: PGn {pn ≤ δn , . . . , p1 ≤ δ1 } = EGn EGn−1 I{pn ≤δn } I{pn−1 ≤δn−1 ,...,p1 ≤δ1 } = EGn I{pn ≤δn } EGn−1 I{pn−1 ≤δn−1 ,...,p1 ≤δ1 } = EGn I{pn ≤δn } δn−1 · · · δ1 = δn δn−1 · · · δ1 (IE being the indicator of event E) almost surely. By property B, (10) immediately implies P {pN ≤ δN , . . . , p1 ≤ δ1 } = δN · · · δ1 .
280
V. Vovk
Therefore, we have proved that the distribution of the random sequence p1 p2 · · · ∈ [0, 1]∞ coincides with U ∞ on the σ-algebra FN generated by the first N coordinate random variables p1 , . . . , pN . It is well known (see, e.g., [12], Theorem II.3.3) that this implies that the distribution of p1 p2 . . . coincides with U ∞ on all measurable sets in [0, 1]∞ .
B
Appendix: Kolmogorov’s Programme and Repetitive Structures
In this section we briefly discuss Kolmogorov’s programme for applications of probability and two related developments originated by Martin-L¨ of and Freedman; in particular, we formally define a version of the notion of repetitive structure which is in a sense isomorphic to our notion of OCM. B.1
Kolmogorov’s Programme
The standard approach to modelling uncertainty is to choose a family of probability distributions (statistical model ) one of which is believed to be the true distribution generating, or explaining in a satisfactory way, the data. (In some applications of probability theory, the true distribution is assumed to be known, and so the statistical model is a one-element set. In Bayesian statistics, the statistical model is complemented by another element, a prior distribution on the distributions in the model.) All modern applications of probability depend on this scheme. In 1965–1970 Kolmogorov suggested a different approach to modelling uncertainty based on information theory; its purpose was to provide a more direct link between the theory and applications of probability. His main idea was that “practical conclusions of probability theory can be substantiated as implications of hypotheses of limiting, under given constraints, complexity of the phenomena under study” [4]. The main features of Kolmogorov’s programme can be described as follows: C (Compression): One fixes a “sufficient statistic” for the data. This is a function of the data that extracts, intuitively, all useful information from the data. This can be the number of ones in a binary sequence (the “Bernoulli model” [13,14]), the number of ones after ones, ones after zeros, zeros after ones and zeros after zeros in a binary sequence (the “Markov model” [4]), the sample average and sample variance of a sequence of real numbers (the “Gaussian model” [15,16]). A (Algorithmic): If the value of the sufficient statistic is known, the information left in the data is noise. This is formalized in terms of Kolmogorov complexity: the complexity of the data under the constraint given by the value of the sufficient statistic should be maximal (in other words, the data should be algorithmically random given the value of the sufficient statistic).
Well-Calibrated Predictions from Online Compression Models
281
U (Uniformity): Semantically, the requirement of algorithmic randomness in the previous item means that the conditional distribution of the data given the sufficient statistic is uniform. D (Direct): It is preferable to deduce properties of data sets directly from the assumption of limiting complexity, without a detour through standard statistical models (examples of such direct inferences are given in [15,16] and hinted at in [4]), especially that Kolmogorov’s models are not completely equivalent to standard statistical models [17]. (Kolmogorov’s only two publications on his programme are [4,13]; the work reported in [14]–[17] was done under his supervision by his PhD students.) After 1965 Kolmogorov and Martin-L¨ of worked on the information-theoretic approach to probability applications independently of each other, but arrived at similar concepts and definitions. In 1973 [18] Martin-L¨ of introduced the notion of repetitive structure, later studied by Lauritzen [19]. Martin-L¨ of’s theory of repetitive structures has features C and U of Kolmogorov’s programme but not features A and D. An extra feature of repetitive structures is their on-line character : the conditional probability distributions are required to be consistent and the sufficient statistic can usually be updated recursively as new data arrives. The absence of algorithmic complexity and randomness from Martin-L¨ of’s theory does not look surprising; e.g., it is argued in [20] that these algorithmic notions are powerful sources of intuition, but for stating mathematical results in their strongest and most elegant form it is often necessary to “translate” them into a non-algorithmic form. A more serious deviation from Kolmogorov’s ideas seems to be the absence of “direct inferences”. The goal in the theory of repetitive structures is to derive standard statistical models from repetitive structures (in the asymptotic online setting the difference between Kolmogorov-type and standard models often disappears); to apply repetitive structure to reality one still needs to go through statistical models. In our approach (see Theorem 1 above or the optimality results in [21,22]) statistical models become irrelevant. Freedman and Diaconis independently came up with ideas similar to Kolmogorov’s (Freedman’s first paper in this direction was published in 1962); they were inspired by de Finetti’s theorem and the Krylov-Bogolyubov approach to ergodic theory. Kolmogorov only considered the three models we discuss in §4–6, but many other models have been considered by later authors (see, e.g., [6]). The difference between standard statistical modelling and Kolmogorov’s modelling discussed in [17] is not important for the purpose of one-step-ahead forecasting in the exchangeable case (in particular, for both exchangeability and Gaussian models of this paper; see [23]); it becomes important, however, in the Markov case. The theory of prediction with confidence has a dual goal: validity (there should not be too many errors) and quality (there should not be too many uncertain predictions). In the asymmetric Markov case, although we have the validity result (Theorem 1), there is little hope of obtaining an optimality result analogous to those of [21,22]. A manifestation of the difference between
282
V. Vovk
the two approaches to modelling is, e.g., the fact that (7) involves the ratio n1,0 /(n1,0 + n1,1 ) rather than something like n0,1 /(n0,0 + n0,1 ). B.2
Repetitive Structures
Let Σ and Z be measurable spaces (of “summaries” and “examples”, respectively). An OCM-repetitive structure consists of the following two elements: – a system of statistics (measurable functions) tn : Zn → Σ, n = 1, 2, . . . ; – a system of kernels Pn : Σ → Zn , n = 1, 2, . . . . These two elements are required to satisfy the following consistency requirements: Agreement between Pn and tn : for each σ ∈ tn (Zn ), the probability distribution Pn (· | σ) is concentrated on the set t−1 n (σ); Consistency of tn over n: for all integers n > 1, tn (z1 , . . . , zn ) is determined by tn−1 (z1 , . . . , zn−1 ) and zn , in the sense that the function tn is measurable w.r. to the σ-algebra generated by tn−1 and zn . Consistency of Pn over n: for all integers n > 1, all σ ∈ tn (Zn ), all τ ∈ tn−1 (Zn−1 ), and all z ∈ Z, Pn−1 (· | τ ) should be a version of the conditional distribution of z1 , . . . , zn−1 when z1 , . . . , zn is generated from Pn (dz1 , . . . , dzn | σ) and it is known that tn−1 (z1 , . . . , zn−1 ) = τ and zn = z. Remark 1. We say “OCM-repetitive structures” instead of “repetitive structures” since the latter are defined by different authors differently. Martin-L¨ of [18] is only interested in uniform Pn , does not have the condition that tn should be computable from tn−1 and zn among his requirements, and his requirement of consistency of Pn over n involves conditioning on tn−1 = τ only (not on zn = z). Lauritzen’s ([19], p. 207) repetitive structures do not involve any probabilities (which enter the picture through parametric “projective statistical fields”). Bernardo and Smith [6] do not use this term at all. The notions of OCM and OCM-repetitive structure are very close. If M = (Σ, 2, Z, (Fn ), (Bn )) is an OCM, then M := (Z, Σ, (tn ), (Pn )), as defined in §2, is an OCM-repetitive structure. If M = (Z, Σ, (tn ), (Pn )) is an OCM-repetitive structure, an OCM M := (Σ, 2, Z, (Fn ), (Bn )) can be defined as follows: – Fn is a measurable function mapping tn−1 (z1 , . . . , zn−1 ) and zn to tn (z1 , . . . , zn ), for all (z1 , . . . , zn ) ∈ Zn (the existence of such Fn follows from the consistency of tn over n); – Bn (dσn−1 , dzn | σn ) is the image of the distribution Pn (dz1 , . . . , dzn | σn ) under the mapping (z1 , . . . , zn ) → (σn−1 , zn ), where σn−1 := tn−1 (z1 , . . . , zn−1 ). If M is an OCM-repetitive structure, M is essentially the same as M , and if M is an OCM, M is essentially the same as M . In our examples (exchangeability, Gaussian and Markov models) we found it more convenient to start from the statistics tn and distributions Pn ; the conditions of consistency were obviously satisfied in those cases.
Transductive Confidence Machine Is Universal Ilia Nouretdinov, Vladimir V’yugin, and Alex Gammerman Computer Learning Research Centre Royal Holloway, University of London Egham, Surrey TW20 0EX, England
Abstract. Vovk’s Transductive Confidence Machine (TCM) is a practical prediction algorithm giving, in additions to its predictions, confidence information valid under the general iid assumption. The main result of this paper is that the prediction method used by TCM is universal under a natural definition of what “valid” means: any prediction algorithm providing valid confidence information can be replaced, without losing much of its predictive performance, by a TCM. We use as the main tool for our analysis the Kolmogorov theory of complexity and algorithmic randomness.
1
Introduction
In the last several decades new powerful machine-learning algorithms have appeared. A serious shortcoming of most of these algorithms, however, is that they do not directly provide any measures of confidence in the predictions they output. Two of the most important traditional ways to obtain such confidence information are provided by PAC theory (a typical result that can be used is Littlestone and Warmuth’s theorem; see, e.g., [3]) and Bayesian theory. The former is discussed in detail in [9] and the latter is discussed in [8], but disadvantages of the traditional approaches can be summarized as follows: PAC bounds are valid under the general iid assumption but are too weak for typical problems encountered in practice to give meaningful results; Bayesian bounds give practically meaningful results, but are only valid under strong extra assumptions. Vovk [4,16,14,11,12,17] proposed a practical (as confirmed by numerous empirical studies reported in those papers) method of computing confidence information valid under the general iid assumption. Vovk’s Transductive Confidence Machine (TCM) is based on a specific formula p=
|{i : αi ≥ αl+1 }| , l+1
where αi are numbers representing some measures of strangeness (cf. (1) in Section 2). A natural question is whether there are better ways to produce valid confidence information. In this paper (Sections 3 and 6) we show that the first-order answer is “no”: no way of producing valid confidence information is drastically better than TCM. We present our results in terms of Kolmogorov’s theory of algorithmic complexity and randomness. R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 283–297, 2003. c Springer-Verlag Berlin Heidelberg 2003
284
2
I. Nouretdinov, V. V’yugin, and A. Gammerman
Prediction Using TCM
Suppose we have two sets: the training set (x1 , y1 ), . . . , (xl , yl ) and the test set (xl+1 , yl+1 ) containing only one example. The unlabelled examples xi are drawn from a set X and the labels yi are drawn from a finite set Y; we assume that |Y| is small (i.e., we consider the problem of classification with a small number of classes) 1 . The examples (xi , yi ) are assumed to be generated by some probability distribution P (same for all examples) independently of each other; we call this the iid assumption. Set Z := X × Y. For any l a sequence z l = z1 , . . . , zl defines a multiset B of all elements of this sequence, where each element z ∈ B is supplied by its arity n(z) = |{j : zj = z}|. We call multiset B of this type a bag. Its size |B| is defined as the sum of arities of all its elements. The bag defined by a sequence z l is also called configuration of this sequence; to be precise define the standard representation of this bag as a set con(z l ) = {(z1 , n(z1 )), . . . (zl , n(zl )}. In this paper we discuss four natural ways of predicting with confidence, which we call Randomness Predictors, Exchangeability Predictors, Invariant Exchangeability Predictors, and Transductive Confidence Machines. We start with the latter (following the papers mentioned above). An individual strangeness measure is a family of functions An , n = 1, 2, . . ., such that each An maps every pair (B, z), where B is a bag of n − 1 elements of Z and z is an element of Z, to a real (typically non-negative) number An (B, z). (Intuitively, An (B, z) measures how different z is from the elements of B). The Transductive Confidence Machine associated with An works as follows: when given the data (x1 , y1 ), . . . , (xl , yl ), xl+1 (the training set and the known component xl+1 of the test example), every potential classification y of xl+1 is assigned the p-value p(y) :=
|{i : αi ≥ αl+1 }| , l+1
(1)
where αi := Al+1 (con(z1 , . . . , zi−1 , zi+1 , . . . , zl+1 ), zi ), zj = (xj , yj ) (except zl+1 = (xl+1 , y)), and con(z1 , . . . , zi−1 , zi+1 , . . . , zl+1 ) is a bag. TCM’s output p : Y → [0, 1] can be further packaged in two different ways: – we can output arg maxy p(y) as the prediction and say that 1 − p2 , where p2 is the second largest p-value, is the confidence and that the largest p-value p1 is the credibility; – or we can fix some conventional threshold δ (such as 1% or 5%) and output as our prediction (predictive region) the set of all y such that p(y) > δ. 1
By |Y| we mean the cardinality of the set Y.
Transductive Confidence Machine Is Universal
285
The essence of TCM is formula (1). The following simple example illustrates a definition of individual strangeness measure in the spirit of the 1-Nearest Neighbour Algorithm (we assume that objects are vectors in a Euclidian space) αi =
minj=i:yj =yi d(xi , xj ) , minj=i:yj =yi d(xi , xj )
where d is the Euclidian distance (i.e. object is considered strange if it is in the middle of objects labelled in a different way and is far from the objects labelled in the same way). For other examples of TCM (and corresponding algorithms computing αi ), see the papers referred to above.
3
Specific Randomness
Next we define Randomness Predictors (RP). At first, we consider a typical example from statistics. Let Zn be a sample space and Qn be a sequence of probability distributions in Zn , where n = 1, 2, . . .. Let fn (ω) be a statistics, i.e. a sequence of real valued functions from Zn to the set of all real numbers. The function tn (ω) = Qn {α : fn (α) ≥ fn (ω)} is called p-value and satisfies Qn {ω : tn (ω) ≤ γ} ≤ γ
(2)
for any real number γ. The outcomes ω with a small p-value have the small probability. These outcomes should be consider as almost impossible from the standpoint of the holder of the measure Qn . The notion of p-value can be easily extended to the case where for any n we consider a class of probability distributions Qn in Zn : tn (ω) = sup Q{α : fn (α) ≥ fn (ω)}.
(3)
Q∈Qn
This function satisfies sup Q{ω : tn (ω) ≤ γ} ≤ γ
(4)
Q∈Qn
for all γ. We fix the properties (2) and (4) as basic for the following definitions. Let for any n a probability distribution Qn in Zn be given. We say that a sequence of functions tn (ω) : Z → [0, 1] is an Qn -randomness test (p-test) if it satisfies inequality (2) for any γ. Analogously, let for any n a class Qn of probability distributions in Zn be given. We say that a sequence of functions tn (ω) is an Qn -randomness test if the inequality (4) holds for any γ. We call inequality (2) or (4) validity property of a test. We will consider two important statistical models on the sequence of sample spaces Zn . The iid model Qiid n , n = 1, 2, . . ., is defined for any n by the class
286
I. Nouretdinov, V. V’yugin, and A. Gammerman
of all probability distributions in Zn of the form Qn = P n , where P is some probability distribution in Z and P n is its product. Instead of (1), we now define the Randomness Predictor (RP) p(y) := tl+1 (z l , (xl+1 , y)),
(5)
l where tl+1 is a Qiid l+1 -randomness test and z = (x1 , y1 ), . . . , (xl , yl ). Using this function, we can define the corresponding predictive region, or the prediction, confidence, and credibility, as above. The exchangeability model Qexch , n = 1, 2, . . ., uses exchangeable probabiln ity distributions. A probability distribution P in Zn is exchangeable if, for any permutation π : {1, . . . , n} → {1, . . . , n} and any data sequence z1 , . . . , zn ∈ Z n ,
P (z1 , . . . , zn ) = P (π(z1 ), . . . , π(zn )). A sequence of functions tn : Zn → [0, 1], n = 1, 2, . . ., is an exchangeability test if, for every n, any exchangeable probability distribution P in Zn , and any γ ∈ [0, 1], P {(z1 , . . . , zn ) ∈ Zn | tn (z1 , . . . , zn ) ≤ γ} ≤ γ. (6) If we now define p(y) by the same formula (5), we obtain the notion of an Exchangeability Predictor (EP). If we further require tn to be invariant, in the sense that tn (z1 , . . . , zn ) does not change if any zi and zj , i, j = 1, . . . , n − 1, are swapped, then we arrive at the notion of an Invariant Exchangeability Predictor (IEP). Our first proposition asserts that TCM and IEP are essentially the same notion. Formally, we identify TCM, RP, EP, and IEP with the functions mapping (x1 , y1 ), . . . , (xl , yl ), xl+1 to the function p = p(x1 ,y1 ),...,(xl ,yl ),xl+1 : Y → [0, 1], according to (1) or (5), respectively. We say that a predictor (TCM, RP, EP, or B IEP) pA z l ,xl+1 (y) is (at least) as good as a predictor pz l ,xl+1 (y) if, for any training set z l = (x1 , y1 ), . . . , (xl , yl ), any unlabelled test example xl+1 , and any label y, B pA z l ,xl+1 (y) ≤ pz l ,xl+1 (y).
(7)
We say that a class A (such as TCM or RP) of predictors is as good as a class B of predictors if for any B ∈ B there exists A ∈ A such that A is as good as B (i.e., if every predictor in B can be replaced by an equally good or better predictor in A). Proposition 1. Transductive Confidence Machines are as good as Invariant Exchangeability Predictors, and vice versa. Proof. For simplicity we will assume that X is finite. First we show that Transductive Confidence Machines are Invariant Exchangeability Predictors; we only need to check the validity property P {z l , (xl+1 , y) | pzl ,xl+1 (y) ≤ γ} ≤ γ for the values p(y) = pzl ,xl+1 (y) computed according to (1), where P is an exchangeable distribution which generates (x1 , y1 ), . . . , (xl , yl ), (xl+1 , y), and
Transductive Confidence Machine Is Universal
287
z l = (x1 , y1 ), . . . , (xl , yl ). Invariance is obvious. Inequality pzl ,xl+1 (y) ≤ γ means that αl+1 is among the top 100γ% in the list α1 , . . . , αl+1 , where each element is repeated according to its arity; the validity follows from the fact that all permutations of αi are P -equiprobable. To show that Invariant Exchangeability Predictors can be replaced by Transductive Confidence Machines, we have to explicitly construct α’s. Suppose we are given an IEP generated by an invariant exchangeability test t. If B is a bag in Z of size l and z ∈ Z, define Al+1 (B, z) = 1/tl+1 (z1 , . . . , zl , z), where z1 , . . . , zl is a list of all elements of B with repetitions (in any order; because of invariance, the order does not matter). The corresponding TCM will be as good as the IEP. It is clear that EP are as good as IEP and that RP are as good as EP. In the next sections we will see that the opposite relations also hold, in some weaker sense. To prove this we need a notion of optimal randomness test from the theory of algorithmic randomness.
4
Algorithmic Randomness
4.1
Uniform Tests for Randomness
We refer readers to [7] for details of the theory of Kolmogorov complexity and algorithmic randomness. We also will consider a logarithmic scale for tests (logtests of randomness) 2 . dn (x|P, q) = − log tn (x|P, q), where tn (x|P, q) is a randomness test satisfying (2) or (4). In this case the validity property of a test (2) must be replaced on P {z n : dn (z n |P, q) ≥ m} ≤ 2−m
(8)
for all n, m, P and q, where z n = z1 , . . . , zn ∈ Zn . So, in the following sections we consider log-tests, for example, iid log-tests or log-tests of exchangeability. We present our final results (Corollaries 2 and 3) for tests defined in the direct scale (4). Let us define a notion of optimal uniform randomness test. Recall that Z := X × Y. Let N be the set of all positive integer numbers, Q be the set of all rational numbers. We consider the discrete topology in these sets. Let R be the set of all real numbers. We also need in some “computational effective” topology 2
In the following all logarithms are to the base 2. Below q is a parameter from some set S of constructive objects. Any algorithm computing values of the test uses q as input.
288
I. Nouretdinov, V. V’yugin, and A. Gammerman
in the set P(Zn ) of all probability distributions in Zn . This topology is generated by intervals {P ∈ P(Zn ) : a1 < P (ω1 ) < b1 , . . . ak < P (ωk ) < bk }, where ωi ∈ Z and ai < bi ∈ Q, i = 1, . . . k, k ∈ N. An open set U is called effectively open if it can be represented as a union of a recursively enumerable set of untervals. A family of real valued functions dn (ω|P, q) from ω ∈ Zn and P ∈ P(Zn ), q ∈ S to R ∪ {+∞} is called lower semicomputable if the set {(n, ω, r, P, q) : n ∈ N, ω ∈ Zn , r ∈ Q, q ∈ S, r < dn (ω|P, q)}
(9)
is effectively open in the product topology in the set 3 D = N×Zn ×Q×P(Zn )× S. This means that some algorithm given n, q and finite approximations of P enumerates rational approximations from below of the test. Proposition 2. There exists an optimal uniform randomness log-test dn , which means the following: dn is lower semicomputable and for any other lower semicomputable uniform randomness test dn 4 dn (ω|P, q) + O(1) ≥ dn (ω|P, q) The proof of this proposition uses well known idea of universality from Kolmogorov theory of algorithmic randomness. We fix some optimal uniform randomness log-test dn (ω|P, q). The value dn (ω|P, q) is called the randomness level of the sequence ω with respect to P . Parameter q will be used only in Section 4.2 for technical reason. In the following usually we fix some q ∈ S and omit this variable from the notation of test. Using the direct scale we consider the optimal uniform randomness test δn (ω|P ) = 2−dn (ω|P ) satisfying (2). This test is minimal up to multiplicative constant in the class of all upper semicomputable tests satisfying (2) 5 . It is easy to verify that Proposition 2 and all considerations above will be exch . So, we can consider valid if we restrict ourselves by P ∈ Qiid n or by P ∈ Qn uniform optimal tests of randomness with respect to classes of iid or exchangeable probability distributions. More correctly, analogously to the definition (3) 3 4
5
This topology is generated by intervals, which can be considered as constructive objects (more correctly, any such interval has standard constructive representation). a(x1 , . . . , xn ) ≤ b(x1 , . . . , xn ) + O(1) or a(x1 , . . . , xn ) − O(1) ≤ b(x1 , . . . , xn ) means that a constant c ≥ 0 exists such that a(x1 , . . . , xn ) ≤ b(x1 , . . . , xn ) + c holds for all values (of free variables) x1 , . . . , xn . a(x1 , . . . , xn ) = b(x1 , . . . , xn ) + O(1) means that a(x1 , . . . , xn ) ≤ b(x1 , . . . , xn ) + O(1) and a(x1 , . . . , xn ) ≥ b(x1 , . . . , xn ) − O(1). Relations with product sign are treated analogously using a multiplicative factors. A definition of upper semicomputability can be obtained from the definition of lower semicomputability (9) by replacing < on >.
Transductive Confidence Machine Is Universal
289
we define optimal log-test with respect to a sequence Q = {Qn } of classes of probability distributions in Zn , n = 1, 2, . . ., dQ n (z1 , . . . , zn ) =
inf
P ∈Qn (Zn )
d(z1 , . . . , zn |P ),
(10)
where z1 , . . . , zn ∈ Zn . So, an optimal iid-log-test diid (z1 , . . . , zn ) corresponds to the iid model Qn . In the direct scale the iid-test is represented as δniid (z1 , . . . , zn ) = 2−dn
iid
(z1 ,...,zn )
.
Analogously, an optimal uniform exchangeability log-test dexch (δnexch ) is den invexc (δninvexc ) we fined. To define the optimal invariant exchangeability test dn iid consider in Proposition 2 only invariant log-tests. These optimal tests diid n (δn ), exch exch invexc invexc dn (δn ) and dn (δn ) determine the Optimal Randomness Predictor, Optimal Exchangeability Predictor, and Optimal Invariant Exchangeability Predictor, respectively. The main goal of this paper is to prove the following aproximate equality 6 δ invexc (z l , (xl+1 , y)) ≈ δ iid (z l , (xl+1 , y))
(11)
if the data set z l = (x1 , y1 ), . . . , (xl , yl ), xl+1 is random and the set Y is small. This shows the universality of TCM: the optimal IEP (equivalently, TCM; see Proposition 1) is about as good as the optimal RP. The precise statement involves a multiplicative constant C; this is inevitable since randomness and exchangeability levels are only defined to within a constant factor (in direct scale). We will prove this assertion by the following way. Approximate equality (11) will be split into two: δ invexc (z l , (xl+1 , y)) ≈ δ exch (z l , (xl+1 , y)), δ
exch
(z , (xl+1 , y)) ≈ δ l
iid
l
(z , (xl+1 , y))
(12) (13)
(Theorem 1, Section 5 and Theorem 2, Section 6 below). 4.2
p- and i-tests
A definition of i-test will be obtained if we replace the validity property (8) by a more strong requirement i n 2dn (z |P,q) dP (z n ) ≤ 1. (14) We call log-test satisfying (8) p-log-test. It is easy to verify that Proposition 2 holds for i-tests, relations (11), (12) and (13) for i-tests are also valid. By Chebyshev inequality each i-test is a p-test. The following proposition gives the relation between optimal p and i-tests. 6
In the following we omit lower index l + 1 in the notation of test.
290
I. Nouretdinov, V. V’yugin, and A. Gammerman
Proposition 3. Let dpn (z n |P, q) be the optimal p-log-test and din (z n |P, q) be the optimal i-log-test. Then di (z n |P ) − O(1) ≤ dp (z n |P ) ≤ di (z n |P, d(z n |P )) + O(1). Proof. Let dpn (z n |P, q) be the optimal p-log-test and din (z n |P, q) be the optimal i-log-test. Then the proposition asserts 7 di (z n |P ) − O(1) ≤ dp (z n |P ) ≤ di (z n |P, d(z n |P )) + O(1). The first inequality ≤ is obvious. To prove the second one note that the lower semicomputable function m − 1 if dp (z n |P ) ≥ m, n ψ(z |P, m) = −1 otherwise is an i-log-test. Indeed, n 2ψ(z |P,m) dP (z n ) =
2m−1 dP (z n ) +
2−1 dP (z n ) ≤ 1.
z n :dp (z n |P )≥m
Then by definition of optimal i-test di (z n |P, m) ≥ ψ(z n |P, m) − O(1). Putting m = dp (z n |P ) in this inequality we obtain di (z n |P, d(z n |P )) ≥ dp (z n |P ) − O(1). The relation between conditional and unconditional i-tests is presented by the following proposition. Proposition 4. Let k ∈ N. Then di (z n |P ) − O(1) ≤ di (z n |P, k) ≤ di (z n |P ) + 2 log k + O(1). Proof. The first inequality is obvious. To prove the second inequality let us note that the function ∞ i n 2d (z |P,k) k −2 ψ(z n |P ) = log k=2
is an i-log-test. Indeed, it is lower semicomputable and ∞ i n ψ(z n |P )dP (z n ) ≤ k −2 2d (z |P,k) dP (z n ) ≤ 1. k=2
Then di (z n |P ) + O(1) ≥ ψ(z n |P ) ≥ di (z n |P, k) − 2 log k. It follows from Propositions 3 and 4 the following 7
We omit the lower index, i.e. we write d(z n |P, q) instead of dn (z n |P, q). We also omit parameter q when it is not used.
Transductive Confidence Machine Is Universal
291
Corollary 1. dp (z n |P ) + O(1) ≥ di (z n |P ) ≥ dp (z n |P ) − 2 log dp (z n |P ) − O(1). We use i-tests since they simplify our proofs. But we formulate our main result, Theorem 2, for p-tests. In the following p and i variants of optimal tests for classes of probability distributions, namely, dp,iid and di,iid , dp,exc and di,exc , dp,invexc and di,invexc , will be considered. 4.3
Randomness with Respect to Exchangeable Probability Distributions
Any computable function F (p, q) (method of decoding), where p is a binary string and q ∈ S, defines a measure of (plain) Kolmogorov complexity KF (x|q) = min{|p| : F (p, q) = x}. The main result of the theory is that an optimal F exists such that KF (x|q) ≤ KF (x|q) + O(1) holds for any method F of decoding. For detailed definition and main properties of Kolmogorov (conditional) complexity K(x|q) we refer reader to the book [7]. In the following we consider the prefix modification of Kolmogorov complexity [7]. This means that only prefix methods of decoding are considered: if F (p, q) and F (p , q) are defined then the strings p and p are incomparable. Kolmogorov defined in [6] the notion of deficiency of randomness of an element x of a finite set D d(x|D) = log |D| − K(x|D).
(15)
It is easy to verify that K(x|D) ≤ log |D) + O(1) and that the number of x ∈ D such that d(x|D) > m does not exceed 2−m |D|. Earlier in [5] he also defined m-Bernoulli sequence as a sequence x satisfying n K(x|n, k) ≥ log − m, k where n is the length of x and k is the number of ones in it. For any finite sequence xn = x1 , . . . , xn ∈ Zn consider a permutation set Ξ(xn ) = {z n : con(z n ) = con(xn )}
(16)
i.e. the set of all sequences with the same configuration as xn (set of all permutations of xn ). For any permutation set Ξ we consider the measure QΞ 1/|Ξ| if z n ∈ Ξ, QΞ (z n ) = 0 otherwise concentrated in the set Ξ of all sequences with the same configuration. An optimal uniform log-test d(xn |QΞ(xn ) , q) for the class {QΞ : ∃z n ∈ Zn (Ξ = Ξ(z n ))} can be defined in the spirit of Proposition 2. The next proposition shows that the deficiency of exchangeability can be characterized in a fashion free from probability concept.
292
I. Nouretdinov, V. V’yugin, and A. Gammerman
Proposition 5. It holds
8
di,exch (z n |q) = log |Ξ(z n )| − K(z n |Ξ(z n ), q) + O(1).
(17)
Proof. We prove (17) and that it is also equal to di (z n |QΞ(zn ) , q) + O(1). Let us prove that the function dˆi (z n |q) = log |Ξ(z n )| − K(z n |Ξ(z n ), q)
is an uniform i-log-test of exchangeability. Indeed, let Pˆ (Ξ(z n )) = zn ∈Ξ P (z n ). Then for any exchangeable measure P ∈ P(Zn ) n n ˆi n 2d (z |q) dP (z n ) = 2−K(z |Ξ(z ),q) Pˆ (Ξ(z n )) = z n ∈Zn
Ξ
Pˆ (Ξ)
2−K(z
n
|Ξ,q)
≤1
z n ∈Ξ
Then dˆi (z n |q) ≤ di (z n |P, q) + O(1) for any exchangeable measure P , and so, we have dˆi (z n |q) ≤ di,exch (z n |q) + O(1) ≤ di (z n |QΞ(zn ) , q) + O(1). Let us check the converse inequality. Let Ξ = Ξ(z n ). We have di,exch (z n |q) =
inf
P ∈Qexch
d(z n |q, P ) ≤ log |Ξ| − K(z n |q, QΞ ) = di (z n |q) + O(1).
Here we take into account that K(z n |q, QΞ ) = K(z n |q, Ξ) + O(1), which follows from the fact that measure QΞ and configuration Ξ are computationally equivalent. Let D be a bag of elements of Z and x ∈ D has arity k(x). Then we can assign a probability P (x) = k(x)/|D| to each element x of the bag and a positive −lx −1 integer number ≤ P (x) ≤ 2−lx . It follows from the Kraft −llxx such that 2 inequality 2 ≤ 1 that a corresponding decodable prefix code exists, and so, K(x|D) ≤ log(|D|/k(x)) + O(1). Let us define the randomness deficiency of x with respect to a bag D d(x|D) = log(|D|/k(x)) − K(x|D).
(18)
We have |{x : d(x|D) ≥ m}| ≤ 2−m |D| for any m. The following proposition implies that the optimal invariant exchangeability log-test di,invexc of a training set (x1 , y1 ), . . . (xl , yl ) and testing example (xl+1 , y) coincides with generalized Kolmogorov’s deficiency of randomness of testing example (xl+1 , y) with respect to the configuration of all sequence. Proposition 6. Let u1 , . . . ul+1 ∈ Zl+1 . Then di,invexc (u1 , . . . ul+1 ) = d(ul+1 |con(u1 , . . . ul+1 )) + O(1) The proof of this proposition is analogous to the proof of Proposition 5. 8
The same relation holds for dp,exch (z n |q) if we replace the prefix variant of Kolmogorov complexity by its plain variant.
Transductive Confidence Machine Is Universal
5
293
EP and IEP Are Equivalent
Let us define di,exch (z l , xl+1 ) = min di,exch (z l , (xl+1 , y)) y∈Y
(19)
The following theorem implies that if a training set is random 9 (with respect to some exchangeable measure) then EP and IEP are almost the same notion. Theorem 1. It holds di,invexc (z l , (xl+1 , y)) − O(1) ≤ di,exch (z l , (xl+1 , y)) ≤ di,invexc (z l , (xl+1 , y)) +2 log di,invexc (z l , (xl+1 , y)) + di,exch (z l , xl+1 )) + 2 log |Y| + O(1), where z l = (x1 , y1 ), . . . , (xl , yl ) is a training set and (xl+1 , y) is a testing example. The proof of this theorem is based on relation for the complexity of a pair [7] and is presented in Section 7.1. In the direct scale of the definition of test we have Corollary 2. For any > 0 O(1)δ i,invexc (z l , (xl+1 , y)) ≥ δ i,exch (z l , (xl+1 , y)) ≥ (δ i,invexc (z l , (xl+1 , y)))1+ δ i,exch (z l , xl+1 )|Y|−2 /(O(1), where z l = (x1 , y1 ), . . . , (xl , yl ) is a training set and (xl+1 , y) is a testing example.
6
RP and IEP Are Equivalent
In this section we use p-tests. Let us define dp,iid (z l , xl+1 ) = min dp,iid (z l , (xl+1 , y)). y
(20)
The following theorem shows that the difference between RP and IEP is not essential in the most interesting case where a training set and an unlabelled test example are random with respect to some iid probability distributions. Theorem 2. It holds dp,iid (z l , (xl+1 , y)) + O(1) ≥ dp,invexc (z l , (xl+1 , y)) ≥ dp,iid (z l , (xl+1 , y)) −4dp,iid (z l , xl+1 ) − 2 log dp,iid (z l , xl+1 ) − 4 log |Y| − O(1), (21) where z l = (x1 , y1 ), . . . , (xl , yl ) is a training set, xl+1 is an unlabelled test example, and y is a label. The proof of this theorem is based on Theorem 1 and on Propositions 7 and 8, and on Corollary 1 (see Section 7.3). 9
In other words, we suppose that the optimal log-test of the training set is small.
294
I. Nouretdinov, V. V’yugin, and A. Gammerman
Corollary 3. Let > 0. Then δ p,iid (z l , (xl+1 , y)/O(1) ≤ δ p,invexc (z l , (xl+1 , y)) ≤ (δ p,iid (z l , (xl+1 , y)))1− |Y|4 (δ p,iid (z l , xl+1 ))−(4+) O(1), where z l = (x1 , y1 ), . . . , (xl , yl ) is a training set, xl+1 is unlabelled test example and y is a label. Acknowledgments. Volodya Vovk initiated this work and proposed ideas of the main theorems. The authors are deeply grateful to him for valuable discussions.
7 7.1
Appendix Proof of Theorem 1
Let z l = (x1 , y1 ), . . . , (xl , yl ) be a training set and (xl+1 , y) be a testing example. By definition (19) for any z l and xl+1 an y¯ exists such that di,exch (z l , xl+1 ) = di,exch (z l , (xl+1 , y¯)). Let Ξ be a set of all permutations of z l , (xl+1 , y) and Ξ¯ be a set of all permutations of z l , (xl+1 , y¯). We have by Proposition 5 di,exch (z l , (xl+1 , y)) = log |Ξ| − K(z l , (xl+1 , y)|Ξ) + O(1), ¯ − K(z l , (xl+1 , y¯)|Ξ) ¯ + O(1). di,exch (z l , (xl+1 , y¯) = log |Ξ|
(22)
Let k be the arity of (xl+1 , y) in con(z l , (xl+1 , y)) and k¯ be the arity of (xl+1 , y¯) ¯ Ξ|. ¯ Then from (22) we obtain in con(z l , xl+1 , y¯)). By definition k|Ξ| = k| di,exch (z l , (xl+1 , y)) = di,exch (z l , (xl+1 , y¯)) + log k¯ − log k ¯ − K(z l , (xl+1 , y)|Ξ) + O(1). +K(z l , (xl+1 , y¯))|Ξ)
(23)
By the well known equality for the complexity of a pair [7] we have K(z l , (xl+1 , y)|Ξ) = K((xl+1 , y)|Ξ) + K(z l |xl+1 , y, K(xl+1 , y|Ξ), Ξ) + O(1). Then (23) is transformed to ¯ + log k¯ di,exch (z l , (xl+1 , y)) = di,exch (z l , (xl+1 , y¯)) + K(z l , (xl+1 , y¯))|Ξ) − log k − K(z l |xl+1 , y, K(xl+1 , y|Ξ), Ξ) − K((xl+1 , y)|Ξ) + O(1). (24) We have |con(z l , (xl+1 , y))| = |con(z l , (xl+1 , y¯))| = l + 1 Let m be the ordinal number of the pair (xl+1 , y¯) in the list z l , (xl+1 , y¯) sorted ¯ in order of decreasing of theirs arities. Then it holds m ≤ (l + 1)/k.
Transductive Confidence Machine Is Universal
295
Let us prove the following inequalities between complexities: ¯ ≤ K(z l |xl+1 , y, d((xl+1 , y)|con(z l , (xl+1 , y)), Ξ) K(z l , (xl+1 , y¯)|Ξ) +2 log d((xl+1 , y)|con(z l , (xl+1 , y))) + log(l + 1) − log k¯ + 2 log |Y| + O(1) Indeed, let a program p conditional on xl+1 , y, d((xl+1 , y)|con(z l , (xl+1 , y))) and Ξ computes z l . We add to p the binary codes of m, y and d((xl+1 , y)|con(z l , (xl+1 , y))). Using Ξ¯ we can restore con(z l , (xl+1 , y¯)), and then by m we restore xl+1 and y¯. Using this information we can also trans¯ binary codes of m, y and by form Ξ¯ to Ξ. Hence, by the program p, Ξ, l d((xl+1 , y)|con(z , (xl+1 , y))) we can compute z l , xl+1 and y¯. By definition d((xl+1 , y)|con(z l , (xl+1 , y))) = log(l + 1) − log k − K((xl+1 , y)|con(z l , (xl+1 , y))).
(25)
Evidently, con(z l , (xl+1 , y)) and Ξ are computationally equivalent. By (25) and Proposition 6 the value of K(xl+1 , y|Ξ) can be computed by d((xl+1 , y)|con(z l , (xl+1 , y))), Ξ and pair (xl+1 , y). Then we have 10 K(z l |xl+1 , y, d((xl+1 , y)|con(z l , (xl+1 , y)), Ξ)) ≤ K(z l |xl+1 , y, K(xl+1 , y|Ξ), Ξ) + O(1).
(26)
Then by (26), (24) and (25) we obtain di,exch (z l , (xl+1 , y)) ≤ di,exch (z l , (xl+1 , y¯)) + log k¯ − log k + log(l + 1) − log k¯ − K((xl+1 , y)|con(z l , (xl+1 , y))) + 2 log d((xl+1 , y)|con(z l , (xl+1 , y))) + 2 log |Y| + O(1). To obtain the final result we should apply Proposition 6. 7.2
iid and Exchangeability Tests
We recall an important relation between iid and exchangeability tests from [13]. Proposition 7. It holds di,exch (z n ) + O(1) ≥ di,iid (z n ) − di,iid (Ξ(z n )) − 2 log di,iid (Ξ(z n )),
(27)
where z n ∈ Zn . Proof omitted. 10
Here we use inequality K(x|q) ≤ K(x|f (q)) + O(1) which holds for any computable function f (see [7]).
296
7.3
I. Nouretdinov, V. V’yugin, and A. Gammerman
Proof of Theorem 2
Proposition 8. Let z n = (x1 , y1 ), . . . , (xn , yn ). Then dp,iid (Ξ(z n , (xn+1 , y))) ≤ dp,iid (z n , xn+1 ) + 2 log |Y| + O(1).
(28)
For simplicity of presentation we consider only a case where all emements of z n = (x1 , y1 ), . . . , (xn , yn ) are distinct and Y = {0, 1}. Lemma 1. Let z n ∈ Zn . Then dp,iid (Ξ(z n )) ≤ dp,iid (z n ) + O(1). Proof omitted. Lemma 2. Suppose that P1 (x, y) =
n 1 P (x, y) + P (x, 1 − y); n+1 n+1
and U is the epimorphism U (z n , (xn+1 , yn+1 )) = Ξ(z n , (xn+1 , 1 − yn+1 )), where z n = (x1 , y1 ), . . . , (xn , yn ). Then for any class L of permutations sets P n+1 (U −1 (L)) ≤ P1n+1 (L). Proof omitted. Lemma 3. Let dp be the optimal uniform randomness p-log-test. Then for any P ∈ P(Z) there exists a P1 ∈ P(Z) such that dp (Ξ(z n , (xn+1 , y))|P1n+1 ) ≤ dp (z n , (xn+1 , 1 − y)|P n+1 ) + O(1). Proof. The measure P1 can be defined as in the Lemma 2. We know that P n+1 (U −1 (L)) ≤ P1n+1 (L) for any class L of permutation sets, and the statement has the type dp (W |P1n+1 ) ≤ dp (v|P n+1 ) + O(1), where W is a permutation set and v ∈ U −1 (W ). Indeed, d (v|P n+1 ) = dp (U (v)|P1n+1 ) is really an uniform test of randomness, let us check the validity property: P n+1 (v : d (v|P n+1 ) ≥ m) = P n+1 (v : dp (U (v)|P1n+1 ) ≥ m) ≤ P1n+1 (W : dp (W |P1n+1 ) ≥ m) ≤ 2−m for any m. Since d is a p-log-test, we have dp (W |P1n+1 ) = dp (U (v)|P1n+1 ) = d (v|P n+1 ) ≤ dp (v|P n+1 ) + O(1). To obtain the statement of the lemma we put v = z n , (xn+1 , 1 − y) and W = Ξ(z n , (xn+1 , y)).
Transductive Confidence Machine Is Universal
297
Lemma 4. dp,iid (Ξ(z n , (xn+1 , y))) ≤ dp,iid (z n , (xn+1 , 1 − y)) + O(1). Proof. By Lemma 3 we have for some P and P1 dp,iid (z n , (xn+1 , 1 − y)) = dp (z n , (xn+1 , 1 − y)|P n+1 ) + O(1) ≥ dp (Ξ(z n , (xn+1 , y))|P1n+1 ) + O(1) ≥ dp,iid (Ξ(z n , (xn+1 , y))) Proof of Proposition 8. Taking into account definition (20) we obtain inequality (28) as a direct corollary of Lemma 1 and Lemma 4. Proof of Theorem 2. Inequality (21) is a direct corollary of Theorem 1, Proposition 8 and Corollary 1.
References 1. J.M. Bernardo, A.F.M. Smith. Bayesian Theory. Wiley, Chichester, 2000. 2. D.Cox, D.Hinkley. Theoretical Statistics. Chapman, Hall, London, 1974. 3. N. Cristianini, J. Shawe-Taylor. An Introduction to Support Vector Machines and OtherKernel-based Methods. Cambridge, Cambridge University Press, 2000. 4. A. Gammerman, V. Vapnik, V. Vovk. Learning by transduction. In Proceedings of UAI’1998, pages 148–156, San Francisco, MorganKaufmann. 5. A.N. Kolmogorov Three approaches to the quantitative definition of information, Problems Inform. Transmission, 1965, 1 N1, p.4–7. 6. A.N. Kolmogorov Combinatorial foundations of information theory and the calculus of probabilities. Russian Math. Suveys, 1983, 38, N4, p.29–40. 7. M. Li, P. Vit´ anyi. An Introduction to Kolmogorov Complexity and ItsApplications. Springer, New York, 2nd edition, 1997. 8. T. Melluish, C. Saunders, I. Nouretdinov, V. Vovk. Comparing the Bayes and typicalness frameworks.In Proceedings of ECML’2001, 2001.Full version published as a CLRC technical report TR-01-05; seehttp://www.clrc.rhul.ac.uk. 9. I. Nouretdinov, V. Vovk, M. Vyugin, A. Gammerman. Pattern recognition and density estimation under the general i.i.d. assumption. In David Helmbold and Bob Williamson, editors, Proceedings of COLT’ 2001, pages 337–353. 10. H. Rogers. Theory of recursive functions and effective computability, New York: McGraw Hill, 1967 11. C. Saunders, A. Gammerman, V. Vovk. Transduction with confidence and credibility. In Proceedings of the 16th IJCAI, pages 722–726, 1999. 12. C. Saunders, A. Gammerman, V. Vovk. Computationally efficient transductive machines. In Proceedings of ALT’00, 2000. 13. V. Vovk. On the concept of the Bernoulli property. Russian Mathematical Surveys, 41:247–248, 1986. 14. V. Vovk, A. Gammerman. Statistical applications of algorithmic randomness. In Bulletin of the International Statistical Institute. The 52ndSession, Contributed Papers, volume LVIII, book 3, pages 469–470, 1999. 15. V. Vovk, A. Gammerman. Algorithmic randomness for machine learning. Manuscript, 2001. 16. V. Vovk, A. Gammerman, C. Saunders. Machine-learning applications of algorithmic randomness. In Proceedings of the 16th ICML, pages 444–453, 1999. 17. V. Vovk. On-Line Confidence Machines Are Well-Calibrated. In proceedings of FOCS’02, pages 187–196, 2002. 18. I. Nuretdinov, V. Vovk, V. V’yugin, A. Gammerman, Transductive confidence machine is universal. CLRC technical report http://www.clrc.rhul.ac.uk/tech-report/
On the Existence and Convergence of Computable Universal Priors Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland [email protected] http://www.idsia.ch/˜marcus
Abstract. Solomonoff unified Occam’s razor and Epicurus’ principle of multiple explanations to one elegant, formal, universal theory of inductive inference, which initiated the field of algorithmic information theory. His central result is that the posterior of his universal semimeasure M converges rapidly to the true sequence generating posterior µ, if the latter is computable. Hence, M is eligible as a universal predictor in case of unknown µ. We investigate the existence and convergence of computable universal (semi)measures for a hierarchy of computability classes: finitely computable, estimable, enumerable, and approximable. For instance, M is known to be enumerable, but not finitely computable, and to dominate all enumerable semimeasures. We define seven classes of (semi)measures based on these four computability concepts. Each class may or may not contain a (semi)measure which dominates all elements of another class. The analysis of these 49 cases can be reduced to four basic cases, two of them being new. We also investigate more closely the types of convergence, possibly implied by universality: in difference and in ratio, with probability 1, in mean sum, and for Martin-L¨ of random sequences. We introduce a generalized concept of randomness for individual sequences and use it to exhibit difficulties regarding these issues.
1
Introduction
All induction problems can be phrased as sequence prediction tasks. This is, for instance, obvious for time series prediction, but also includes classification tasks. Having observed data x1 ,...,xt−1 at times 1,...,t−1, the task is to predict the t-th symbol xt from sequence x=x1 ...xt−1 . The key concept to attack general induction problems is Occam’s razor and to a less extent Epicurus’ principle of multiple explanations. The former/latter may be interpreted as to keep the simplest/all theories consistent with the observations x1 ...xt−1 and to use these theories to predict xt . Solomonoff [Sol64,Sol78] formalized and combined both principles in his universal prior M (x) which assigns high/low probability to simple/complex environments, hence implementing Occam and Epicurus. Solomonoff’s [Sol78] central result is that if the probability µ(xt |x1 ...xt−1 ) of observing xt at time
This work was supported by SNF grant 2000-61847.00 to J¨ urgen Schmidhuber.
R. Gavald` a et al. (Eds.): ALT 2003, LNAI 2842, pp. 298–312, 2003. c Springer-Verlag Berlin Heidelberg 2003
On the Existence and Convergence of Computable Universal Priors
299
t, given past observations x1 ...xt−1 is a computable function, then the universal posterior M (xt |x1 ...xt−1 ) converges rapidly for t → ∞ to the true posterior µ(xt |x1 ...xt−1 ), hence M represents a universal predictor in case of unknown µ. One representation of M is as a weighted sum of all enumerable “defective” probability measures, called semimeasures (see Definition 2). The (from this representation obvious) dominance M (x) ≥ const.×µ(x) for all computable µ is the central ingredient in the convergence proof. What is so special about the class of all enumerable semimeasures Msemi enum ? The larger we choose M the less restrictive is the essential assumption that M should contain the true distribution µ. Why not restrict to the still rather general class of estimable or finitely computable (semi)measures? For every countable class M and ξM (x):= w ν∈M ν ν(x) with wν > 0, the important dominance ξM (x) ≥ wν ν(x) ∀ν ∈ M is satisfied. The question is what properties does the mixture ξM possess. The distinguishing property of M = ξMsemi is that it is itself an element of Msemi enum . enum On the other hand, for prediction ξM ∈M is not by itself an important property. What matters is whether ξM is computable (in one of the senses defined) to avoid getting into the (un)realm of non-constructive math. The intention of this work is to investigate the existence, computability and convergence of universal (semi)measures for various computability classes: finitely computable ⊂ estimable ⊂ enumerable ⊂ approximable (see Definition 1). For instance, M (x) is enumerable, but not finitely computable. The research in this work was motivated by recent generalizations of Kolmogorov complexity and Solomonoff’s prior by Schmidhuber [Sch02] to approximable (and others not here discussed) cases. Contents. In Section 2 we review various computability concepts and discuss their relation. In Section 3 we define the prefix Kolmogorov complexity K, the concept of (semi)measures, Solomonoff’s universal prior M , and explain its universality. Section 4 summarizes Solomonoff’s major convergence result, discusses general mixture distributions and the important universality property – multiplicative dominance. In Section 5 we define seven classes of (semi)measures based on four computability concepts. Each class may or may not contain a (semi)measures which dominates all elements of another class. We reduce the analysis of these 49 cases to four basic cases. Domination (essentially by M ) is known to be true for two cases. The two new cases do not allow for domination. In Section 6 we investigate more closely the type of convergence implied by universality. We summarize the result on posterior convergence in difference (ξ −µ → 0) and improve the previous result [LV97] on the convergence in ratio ξ/µ → 1 by showing rapid convergence without use of Martingales. In Section 7 we investigate whether convergence for all Martin-L¨ of random sequences could hold. We define a generalized concept of randomness for individual sequences and use it to show that proofs based on universality cannot decide this question. Section 8 concludes the paper. Notation. We denote strings of length n over finite alphabet X by x=x1 x2 ...xn with xt ∈ X and further abbreviate x1:n := x1 x2 ...xn−1 xn and x
300
M. Hutter n→∞
sequences. We abbreviate limn→∞ [f (n)−g(n)] = 0 by f (n) −→ g(n) and say f converges to g, without implying that limn→∞ g(n) itself exists. We write f (x)g(x) for g(x) = O(f (x)), i.e. if ∃c > 0 : f (x) ≥ cg(x)∀x.
2
Computability Concepts
We define several computability concepts weaker than can be captured by halting Turing machines. Definition 1 (Computable functions). We consider functions f : IN → IR: f is finitely computable or recursive iff there are Turing machines T1/2 with output interpreted as natural numbers and f (x) = TT12 (x) (x) , f is approximable iff ∃ finitely computable φ(·,·) with limt→∞ φ(x,t) = f (x). f is lower semi-computable or enumerable iff additionally φ(x,t) ≤ φ(x,t+1). f is upper semi-computable or co-enumerable iff [−f ] is lower semicomputable. f is semi-computable iff f is lower- or upper semi-computable. f is estimable iff f is lower- and upper semi-computable. If f is estimable we can finitely compute an ε-approximation of f by upper and lower semi-computing f and terminating when differing by less than ε. This means that there is a Turing machine which, given x and ε, finitely computes yˆ such that |ˆ y −f (x)| < ε. Moreover it gives an interval estimate f (x) ∈ [ˆ y −ε,ˆ y +ε]. An estimable integer-valued function is finitely computable (take any ε<1). Note that if f is only approximable or semi-computable we can still come arbitrarily close to f (x) but we cannot devise a terminating algorithm which produces an ε-approximation. In the case of lower/upper semi-computability we can at least finitely compute lower/upper bounds to f (x). In case of approximability, the weakest computability form, even this capability is lost. In analogy to lower/upper semi-computability one may think of notions like lower/upper estimability but they are easily shown to coincide with estimability. The following implications are valid:
⇒ recursive= finitely ⇒ estimable computable ⇒
enumerable= lower semicomputable
⇒
co-enumerable= ⇒ upper semicomputable
semi⇒ approximable computable
In the following we use the term computable synonymous to finitely computable, but sometimes also generically for some of the computability forms of Definition 1. What we call estimable is often just called computable, but it makes sense to separate the concepts of finite computability and estimability in this work, since the former is conceptually easier and some previous results have only been proved for this case.
On the Existence and Convergence of Computable Universal Priors
3
301
The Universal Prior M
The prefix Kolmogorov complexity K(x) is defined as the length of the shortest binary program p ∈ {0,1}∗ for which a universal prefix Turing machine U (with binary program tape and X ary output tape) outputs string x∈X ∗ , and similarly K(x|y) in case of side information y [LV97]: K(x) = min{l(p) : U (p) = x},
K(x|y) = min{l(p) : U (p, y) = x}
Solomonoff [Sol64,Sol78] (with a flaw fixed by Levin [ZL70]) defined (earlier) the closely related quantity, the universal prior M (x). It is defined as the probability that the output of a universal Turing machine starts with x when provided with fair coin flips on the input tape. Formally, M can be defined as M (x) :=
2−l(p)
(1)
p : U (p)=x∗
where the sum is over all so called minimal programs p for which U outputs a string starting with x (indicated by the ∗). Before we can discuss the stochastic properties of M we need the concept of (semi)measures for strings. Definition 2 (Continuous (Semi)measures). µ(x) denotes the probability that a sequence starts with string x. We call µ ≥ 0 a (continuous) semimeasure if µ() ≤ 1 and µ(x) ≥ a∈X µ(xa), and a (probability) measure if equality holds. We have a∈X M (xa) < M (x) because there are programs p, which output x, not followed by any a ∈ X . They just stop after printing x or continue forever without any further output. Together with M () = 1 this shows that M is a semimeasure, but not a probability measure. We can now state the fundamental property of M [Sol78]: Theorem 1 (Universality of M ). The universal prior M is an enumerable semimeasure which multiplicatively dominates all enumerable semimeasures in the sense that M (x) 2−K(ρ) ·ρ(x) for all enumerable semimeasures ρ. M is enumerable, but not estimable or finitely computable. The Kolmogorov complexity of a function like ρ is defined as the length of the shortest self-delimiting code of a Turing machine computing this function in the sense of Definition 1. Up to a multiplicative constant, M assigns higher probability to all x than any other computable probability distribution. It is possible to normalize M to a true probability measure Mnorm [Sol78, LV97] with dominance still being true, but at the expense of giving up enumerability (Mnorm is still approximable). M is more convenient when studying algorithmic questions, but a true probability measure like Mnorm is more convenient when studying stochastic questions.
302
4
M. Hutter
Universal Sequence Prediction
In which sense does M incorporate Occam’s razor and Epicurus’ principle of multiple explanations? Since the shortest programs p dominate the sum in M , M (x) is roughly equal to 2−K(x) (M (x) = 2−K(x)+O(K(l(x)) ), i.e. M assigns high probability to simple strings. More useful is to think of x as being the observed history. We see from (1) that every program p consistent with history x is allowed to contribute to M (Epicurus). On the other hand shorter programs give significantly larger contribution (Occam). How does all this affect prediction? If M (x) describes our (subjective) prior belief in x, then M (y|x) := M (xy)/M (x) must be our posterior belief in y. From the symmetry of algorithmic information K(xy) ≈ K(y|x)+K(x), and M (x) ≈ 2−K(x) and M (xy) ≈ 2−K(xy) we get M (y|x) ≈ 2−K(y|x) . This tells us that M predicts y with high probability iff y has an easy explanation, given x (Occam & Epicurus). The above qualitative discussion should not create the impression that M (x) and 2−K(x) always lead to predictors of comparable quality. Indeed in the online/incremental setting, K(y) = O(1) invalidates the consideration above. The proof of (2) below, for instance, depends on M being a semimeasure and the chain rule being exactly true, neither of them is satisfied by 2−K(x) . See [Hut03] for a more detailed analysis. Sequence prediction algorithms try to predict the continuation xt ∈ X of a given sequence x1 ...xt−1 . We assume that the true sequence is drawn from a computable probability distribution µ, i.e. the true (objective) probability of x1:t is µ(x1:t ). The probability of xt given x
2 µ(x
1 2
ln 2·K(µ) + O(1) < ∞. (2)
t=1 x
The infinite sum can only be finite if the difference M (0|x
is obvious. Is ξ lower semi-computable? To answer this question one has to be more precise. Levin [ZL70] has shown that the set of all lower semi-computable
On the Existence and Convergence of Computable Universal Priors
303
semimeasures is enumerable (with repetitions). For this (ordered multi) set M= Msemi enum :={ν1 ,ν2 ,ν3 ,...} and K(νi ):=K(i) one can easily see that ξ is lower semicomputable. Finally proving M (x)ξ(x) also establishes universality of M (see [Sol78,LV97] for details. ξ M also holds). The advantage of ξ over M is that it immediately generalizes to arbitrary weighted sums of (semi)measures for arbitrary countable M.
5
Universal (Semi)Measures
What is so special about the set of all enumerable semimeasures Msemi enum ? The larger we choose M the less restrictive is the assumption that M should contain the true distribution µ, which will be essential throughout the paper. Why do not restrict to the still rather general class of estimable or finitely computable (semi)measures? It is clear that for every countable set M, the universal or mixture distribution ξ(x) := ξM (x) := wν ν(x) with wν ≤ 1 and wν > 0 (4) ν∈M
ν∈M
dominates all ν ∈M. This dominance is necessary for the desired convergence ξ → µ similarly to (2). The question is what properties ξ possesses. The distinguishing semi property of Msemi enum is that ξ is itself an element of Menum . When concerned with predictions, ξM ∈ M is not by itself an important property, but whether ξ is computable in one of the senses of Definition 1. We define M1 M2 :⇔ there is an element of M1 which dominates all elements of M2 :⇔ ∃ρ ∈ M1 ∀ν ∈ M2 ∃wν > 0 ∀x : ρ(x) ≥ wν ν(x). is transitive (but not necessarily reflexive) in the sense that M1 M2 M3 implies M1 M3 and M0 ⊇ M1 M2 ⊇ M3 implies M0 M3 . For the computability concepts introduced in Section 2 we have the following proper set inclusions msr msr msr Mmsr comp ⊂ Mest ≡ Menum ⊂ Mappr ∩ ∩ ∩ ∩ semi semi semi Msemi ⊂ M ⊂ M ⊂ M comp est enum appr where Mmsr stands for the set of all probability measures of approc priate computability type c ∈ {comp=finitely computable, est=estimable, enum=enumerable, appr=approximable}, and similarly for semimeasures Msemi . From an enumeration c of a measure ρ one can construct a co-enumeration by exploiting ρ(x1:n ) = 1− y1:n =x1:n ρ(y1:n ). This shows that every enumerable measure is also co-enumerable, hence estimable, which proves the identity ≡ above. semi With this notation, Theorem 1 implies Msemi enum Menum . Transitivity allows semi msr to conclude, for instance, that Mappr Mcomp , i.e. that there is an approximable semimeasure which dominates all computable measures.
304
M. Hutter
The standard “diagonalization” way of proving M1 M2 is to take an arbitrary µ∈M1 and “increase” it to ρ such that µ ρ and show that ρ∈M2 . There are 7×7 combinations of (semi)measures M1 with M2 for which M1 M2 could be true or false. There are four basic cases, explicated in the following theorem, from which the other 49 combinations displayed in Table 3 follow by transitivity. Theorem 2 (Universal (semi)measures). A semimeasure ρ is said to be universal for M if it multiplicatively dominates all elements of M in the sense ∀ν∃wν > 0 : ρ(x) ≥ wν ν(x)∀x. The following holds true: o) ∃ρ : {ρ} M: For every countable set of (semi)measures M, there is a (semi)measure which dominates all elements of M. semi i) Msemi enum Menum : The class of enumerable semimeasures contains a universal element. semi ii) Mmsr appr Menum : There is an approximable measure which dominates all enumerable semimeasures. msr iii) Msemi est Mcomp : There is no estimable semimeasure which dominates all computable measures. msr iv) Msemi appr Mappr : There is no approximable semimeasure which dominates all approximable measures.
Table 3 (Existence of universal (semi)measures). The entry in row r and column c indicates whether there is a r-able (semi)measure ρ for the set M which contains all c-able (semi)measures, where r,c ∈ {comput, estimat, enumer, approxim}. Enumerable measures are estimable. This is the reason why the enum. row and column in case of measures are missing. The superscript indicates from which part of Theorem 2 the answer follows. For the bold face entries directly, for the others using transitivity of . ρ s e m i m s r
M comp. est. enum. appr. comp. est. appr.
semimeasure comp. est. enum. appr. noiii noiii noiii noiv noiii noiii noiii noiv yesi yesi yesi noiv yesi yesi yesi noiv noiii noiii noiii noiv noiii noiii noiii noiv yesii yesii yesii noiv
measure comp. est. appr. noiii noiii noiv noiii noiii noiv yesi yesi noiv yesi yesi noiv noiii noiii noiv noiii noiii noiv yesii yesii noiv
If we ask for a universal (semi)measure which at least satisfies the weakest form of computability, namely being approximable, we see that the largest dominated set among the 7 sets defined above is the set of enumerable semimeasures. This semi is the reason why Msemi enum plays a special role. On the other hand, Menum is not the largest set dominated by an approximable semimeasure, and indeed no such largest set exists. One may, hence, ask for “natural” larger sets M. One such set,
On the Existence and Convergence of Computable Universal Priors
305
namely the set of cumulatively enumerable semimeasures MCEM , has recently been discovered by Schmidhuber [Sch02], for which even ξCEM ∈ MCEM holds. Theorem 2 also holds for discrete (semi)measures P : IN → [0,1] with (<) x∈IN P (x) = 1. Theorem 2 (i) is Levin’s major result [LV97, Th.4.3.1 & semi Th.4.5.1], (ii) is due to Solomonoff [Sol78], the proof of Msemi comp Mcomp in [LV97, p249] contains minor errors and is not extensible to (iii) and the proof in [LV97, p276] only applies to infinite alphabet and not to the binary/finite case considered here. Proof. We present proofs for binary alphabet X = {0,1} only. The proofs naturally generalize from binary to arbitrary finite alphabet. argminx f (x) is the x that minimizes f (x). Ties are broken in an arbitrary but computable way (e.g. by taking the smallest x). (o) ρ(x) := ν∈M wν ν(x) with wν > 0 obviously dominates all ν ∈ M (with dom ination constant wν ). With w = 1 and all ν being (semi)measures also ρ is a ν ν (semi)measure. (i) See [LV97, Th4.5.1]. ξ be a universal element in Msemi enum . We define [Sol78] ξnorm (x1:n ) := n (ii) Let ξ(x1:t ) . By induction one can show that ξnorm is a measure and that t=1 ξ(x 0 and finitely compute an ε-approximation σ ˆ of σ(x). If σ ˆ > 4ε define µ(x) := σ ˆ , else halve ε and repeat the process. Since σ(x) > 0 (otherwise it could not dominate, e.g. 2−l(x) ) the loop terminates after finite time. So µ is finitely computable. Inserting σ ˆ = µ(x) and ε < 14 σ ˆ = 14 µ(x) into |σ(x)− σ ˆ | < ε we get |σ(x)−µ(x)| < 14 µ(x), which implies 34 µ(x) ≤ σ(x) ≤ 54 µ(x). Unfortunately µ is not a semimeasure, but it still satisfies the weaker inequality µ(x0)+µ(x1)≤ 43 [σ(x0)+σ(x1)]≤ 43 σ(x)≤ 43 · 54 µ(x)= 53 µ(x). This is sufficient for the first half of the proof of (iii) to go through with 12 replaced by 12 · 53 = 56 < 1, msr 4 which shows that µ Mmsr comp . But this contradicts µ ≥ 5 σ Mcomp showing that our semi assumed estimable semimeasure σ does not exist, i.e. Mest Mmsr comp . msr (iv) Assume µ ∈ Msemi appr Mappr . We construct an approximable measure ρ which is not dominated by µ, thus contradicting the assumption. Let µ1 ,µ2 ,... be a sequence of recursive functions converging to µ. We recursively (in t and n) define sequences yn1 ,yn2 ,... converging to yn and from them ρ1 ,ρ2 ,... converging to ρ. Let yn1 = 0 ∀n. If t t t µt (y 23 µt (y t0 . Assume t > t0 in the following. Since t yk ∈{0,1}, some value, say y˜k , is assumed infinitely often. Non-convergence implies that
306
M. Hutter
the sequence leaves and enters to y˜k infinitely often. If y˜k is left (ykt−1 = y˜k = ykt ) we have t→∞
t t ykt−1 ) > 23 µt (y
t→∞ 1 µ (y t ) = 12 µt (y 3 µ(y t0 ), we have µ ρ. Since semi msr µ ∈ Msemi appr was arbitrary and ρ is an approximable measure we get Mappr Mappr .
2
6
Posterior Convergence
We have investigated in detail the computational properties of various mixture distributions ξ. A mixture ξM multiplicatively dominates all distributions in M. We have mentioned that dominance implies posterior convergence. In this section we present in more detail what dominance implies and what not. Convergence of ξ(xt |x
On the Existence and Convergence of Computable Universal Priors
307
in ξ. For M = Msemi enum , µ/ξ-randomness is just µ.M.L. randomness. The larger M, the more patterns are recognized as non-random. Roughly speaking, those regularities characterized by some ν ∈ M are recognized by µ/ξ-randomness, i.e. for M ⊂ Msemi enum some µ/ξ-random strings may not be M.L. random. Other randomness concepts, e.g. those by Schnorr, Ko, van Lambalgen, Lutz, Kurtz, von Mises, Wald, and Church (see [Wan96,Lam87,Sch71]), could possibly also be characterized in terms of µ/ξ-randomness for particular choices of M. A classical (non-random) real-valued sequence at is defined to converge to a∗ , short at →a∗ if ∀ε∃t0 ∀t≥t0 :|at −a∗ |<ε. We are interested in convergence properties of random sequences zt (ω) for t→∞ (e.g. zt (ω)=ξ(ωt |ω
t=1
t
µ(xt |x
−1
≤
ξ(xt |x
E
t=1
xt
µ(xt |x
≤ ln wµ
308
M. Hutter
where wµ is the weight (4) of µ in ξ, which implies
t→∞
ξ(xt |x
µ(xt |x
ξ(xt |x
1, both i.m.s.
The latter strengthens the result ξ(xt |x
This shows f ≥ 0, and hence
i
i
f (yi ,zi ) ≥ 0, which implies
√ yi √ yi ln − ( yi − zi )2 ≥ yi − zi ≥ 1 − 1 = 0. zi i
i
i
The (conditional) µ-expectations of a function f : X t → IR are defined as E[f ] =
µ(x1:t )f (x1:t )
Et [f ] := E[f |x
and
x1:t ∈X t
µ(xt |x
xt ∈X
where sums over all xt or x1:t for which µ(x1:t ) = 0. If we insert X = {1,...,N }, N = |X |, i = xt , yi = µt := µ(xt |x
n
Taking the expectation E and the sum n
E[dt (x
t=1
n
we get
t=1
µt µ(x1:n ) µt ]] = E[ln ] = E[ln ] ≤ ln wµ−1 ξt ξt ξ(x1:n ) n
E[Et [ln
t=1
(5)
t=1
where we have used E[Et [..]] = E[..] and exchanged the t-sum with the expectation E, which transforms to a product inside the logarithm. In the last equality we have used the chain rule for µ and ξ. Using universality ξ(x1:n ) ≥ wµ µ(x1:n ) yields the final inequality. Finally Et
ξt µt
2
−1
=
µt
ξt µt
2
−1
xt
Taking the expectation E and the sum Theorem 5.
=
n
(
ξt −
√
µt )2 ≤ ht (x
xt
t=1
and chaining the result with (5) yields 2
On the Existence and Convergence of Computable Universal Priors
7
309
Convergence in Martin-L¨ of Sense
An interesting open question is whether ξ converges to µ (in difference or ratio) individually for all Martin-L¨ of random sequences. Clearly, convergence µ.M.L. may at most fail for a set of sequences with µ-measure zero. A convergence M.L. result would be particularly interesting and natural for Solomonoff’s universal prior M , since M.L. randomness can be defined in terms of M (see Theorem 4). Attempts to convert the bounds in Theorem 5 to effective µ.M.L. randomness M.L. tests fail, since M (xt |x
µ(xt |x
2 ρ(xt |x
t=1 xt
and
∞ ρ(xt |x
µ(xt |x
2 −1
< ∞.
If M were recursive, then this would imply posterior M → µ and M/µ → 1 for every µ.M.L. random sequence x1:∞ , since every sequence is M .M.L. random. Since M is not recursive Vovk’s theorem cannot be applied and it is not obvious how to generalize it. So the question of individual convergence remains open. More generally, one may ask whether ξM →µ for every µ/ξ-random sequence. It turns out that this is true for some M, but false for others. Theorem 6 (µ/ξ-convergence of ξ to µ). Let X = {0,1} be binary and MΘ := {µθ : µθ (1|x
The formulation of their Theorem is quite misleading in general: “Let µ be a positive recursive measure. If the length of y is fixed and the length of x grows to infinity, then M (y|x)/µ(y|x)→1 with µ-probability one. The infinite sequences ω with prefixes x satisfying the displayed asymptotics are precisely [‘⇒’ and ‘⇐’] the µ-random sequences.” First, for off-sequence y convergence w.p.1 does not hold (xy must be demanded to be a prefix of ω). Second, the proof of ‘⇐’ has loopholes (see main text). Last, ‘⇒’ is given without proof and is probably wrong. Also the assertion in [LV97, Th.5.2.1] that St := E x (µ(xt |x
monotonically. For example, than 1/t cannot be made, since St may not decrease √ ∞ for at := 1/ t if t is a cube and 0 otherwise, we have a < ∞, but at = o(1/t). t=1 t
310
M. Hutter
ii) There are µ ∈ MΘG and µ/ξMΘG random x1:∞ for which ξMΘG(xt |x
µ.ξ.r
for which ξ −→ µ and M ξ for which ξ −→ µ. Theorem 6 can be generalized to i.i.d. sequences over general finite alphabet X . The idea to prove (ii) is to construct a sequence x1:∞ which is µθ0 /ξ-random and µθ1 /ξ-random for θ0 = θ1 . This is possible if and only if Θ contains a gap and θ0 and θ1 are the boundaries of the gap. Obviously ξ cannot converge to θ0 and θ1 , thus proving non-convergence. For no θ ∈[0,1] will this x1:∞ be µθ M.L.random. Finally, the proof of Theorem 6 makes essential use of the mixture representation of ξ, as opposed to the proof of Theorem 5 which only needs dominance ξ M. Proof. Let X = {0,1} and M = {µθ : θ ∈ Θ} with countable Θ ⊂ [0,1] and µθ (1|x1:n ) = θ = 1−µθ (0|x1:n ), which implies n1 µθ (x1:n ) = θn1 (1 − θ)n−n1 , n1 := x1 + ... +xn , θˆ ≡ θˆn := n ˆ θ depends on n; all other used/defined θ will be independent of n. ξ is defined in the standard way as ξ(x1:n ) = where
θ
wθ µθ (x1:n )
⇒
ξ(x1:n ) ≥ wθ µθ (x1:n ),
(6)
θ∈Θ
wθ =1 and wθ >0 ∀θ. In the following let µ=µθ0 ∈M be the true environment. ω = x1:∞ is µ/ξ-random
⇔
∃cω : ξ(x1:n ) ≤ cω ·µθ0 (x1:n ) ∀n
(7)
n→∞
For binary alphabet it is sufficient to establish whether ξ(1|x1:n ) −→ θ0 ≡µ(1|x1:n ) for µ/ξ-random x1:∞ in order to decide ξ(xn |x
wnθ µθ (1|x1:n ),
wnθ := wθ
θ∈Θ
µθ (x1:n ) wθ µθ (x1:n ) ≤ , ξ(x1:n ) wθ0 µθ0 (x1:n )
wnθ = 1 (8)
θ∈Θ
The ratio µθ /µθ0 can be represented as follows: µθ (x1:n ) µθ0 (x1:n )
ˆ
ˆ
= en[D(θn ||θ0 )−D(θn ||θ)]
where
ˆ D(θ||θ) = θˆ ln
θˆ θ
ˆ ln + (1− θ)
1−θˆ 1−θ
(9)
is the relative entropy between θˆ and θ, which is continuous in θˆ and θ, and is 0 if and only if θˆ= θ. We also need the following implication for sets Ω ⊆ Θ: n→∞
If wnθ ≤ wθ gθ (n) −→ 0 and gθ (n) ≤ c ∀θ ∈Ω, then
θ∈Ω
wnθ µθ (1|x1:n ) ≤
n→∞
wnθ −→ 0,
θ∈Ω
(10)
On the Existence and Convergence of Computable Universal Priors
311
wθ ≤ 1 and µθ ≤ 1. We now prove Theorem 6. which follows from boundedness θ n We leave the special considerations necessary when 0,1 ∈ Θ to the reader and assume, henceforth, 0,1 ∈ Θ. (i) Let Θ be a countable dense subset of (0,1) and x1:∞ be µ/ξ-random. Using (6) and (7) in (9) for θ ∈ Θ to be determined later we can bound ˆ
ˆ
en[D(θn ||θ0 )−D(θn ||θ)] =
µθ (x1:n ) cω =: c < ∞ ≤ µθ0 (x1:n ) wθ
(11)
Let us assume that θˆ ≡ θˆn → θ0 . This implies that there exists a cluster point θ˜ = θ0 ˜ e.g. D(θˆn ||θ) ˜ ≤ε of sequence θˆn , i.e. θˆn is infinitely often in an ε-neighborhood of θ, for infinitely many n. θ˜ ∈ [0,1] may be outside Θ. Since θ˜ = θ0 this implies that θˆn ˆ θ)+ ˜ must be “far” away from θ0 infinitely often. E.g. for ε = 41 (θ˜−θ0 )2 , using D(θ|| 2 ˜ ˆ ˜ ˆ D(θ||θ0 ) ≥ (θ−θ0 ) , we get D(θ||θ0 ) ≥ 3ε. We now choose θ ∈ Θ so near to θ such that ˆ ˆ θ)| ˜ ≤ ε (here we use denseness of Θ). Chaining all inequalities we get |D(θ||θ)−D( θ|| ˆ 0 )−D(θ||θ)≥3ε−ε−ε=ε>0. ˆ D(θ||θ This, together with (11) implies enε ≤c for infinitely many n which is impossible. Hence, the assumption θˆn → θ0 was wrong. Now, θˆn → θ0 implies that for arbitrary θ = θ0 , θ ∈ Θ and for sufficiently large n there exists δθ >0 such that D(θˆn ||θ)≥2δθ (since D(θ0 ||θ)= 0) and D(θˆn ||θ0 )≤δθ . This implies wθ n[D(θˆn ||θ0 )−D(θˆn ||θ)] wθ −nδθ n→∞ wnθ ≤ e ≤ e −→ 0, wθ0 wθ0 where we have used (8) and (9) in the first inequality and the second inequality holds for sufficiently large n. Hence θ=θ0 wnθ → 0 by (10) and wnθ0 → 1 by normalization (8), which finally gives ξ(1|x1:n ) = wnθ0 µθ0 (1|x1:n ) +
n→∞
wnθ µθ (1|x1:n ) −→ µθ0 (1|x1:n ).
θ=θ0 0 (ii) We first consider the case Θ ={θ0 ,θ1 }: Let us choose θ¯ (=ln( 1−θ )/ln( θθ10 1−θ1 potentially ∈ Θ) in the (KL) middle of θ0 and θ1 such that
¯ 0 ) = D(θ||θ ¯ 1 ), D(θ||θ n1 n
1−θ0 ), 1−θ1
0 < θ0 < θ¯ < θ1 < 1,
¯ ≤ 1 satisfies |θˆn − θ| and choose x1:∞ such that θˆn := n ˆ ¯ ˆ θ| ¯ ∀ θ,θ, ˆ θ¯ ∈ [θ0 ,θ1 ] (c = ln θ1 (1−θ0 ) Using |D(θ||θ)−D( θ||θ)| ≤ c|θ− θ0 (1−θ1 )
(12) n→∞
¯ (⇒ θˆn −→ θ) < ∞) twice in (9) we
get
ˆ ˆ ¯ ˆ ¯ ¯ ˆ ¯ µθ1 (x1:n ) = en[D(θn ||θ0 )−D(θn ||θ1 )] ≤ en[D(θ||θ0 )+c|θn −θ|−D(θ||θ1 )+c|θn −θ|] ≤ e2c µθ0 (x1:n )
(13)
where we have used (12) in the last inequality. Now, (13) and (8) lead to wnθ0 = wθ0
µθ0 (x1:n ) wθ wθ µθ (x1:n ) −1 = [1 + 1 1 ] ≥ [1 + 1 e2c ]−1 =: c0 > 0, ξ(x1:n ) wθ0 µθ0 (x1:n ) wθ0
(14)
which shows that x1:∞ is µθ0 /ξ-random by (7). Exchanging θ0 ↔ θ1 in (13) and (14) we similarly get wnθ1 ≥ c1 > 0, which implies (using wnθ0 +wnθ1 = 1) ξ(1|x1:n ) =
wnθ µθ (1|x1:n ) = wnθ0 ·θ0 + wnθ1 ·θ1 = θ0 = µθ0 (1|x1:n ).
(15)
θ∈{θ0 ,θ1 } n→∞
This shows ξ(1|x1:n ) −→ µ(1|x1:n ). For general Θ with gap in the sense that there exist 0 < θ0 < θ1 < 1 with [θ0 ,θ1 ] ∩ Θ = {θ0 ,θ1 } one can show that all θ = θ0 ,θ1 give asymptotically no contribution to ξ(1|x1:n ), i.e. (15) still applies. 2
312
8
M. Hutter
Conclusions
For a hierarchy of four computability definitions, we completed the classification of the existence of computable (semi)measures dominating all computable (semi)measures. Dominance is an important property of a prior, since it implies rapid convergence of the corresponding posterior with probability one. A strengthening would be convergence for all Martin-L¨ of (M.L.) random sequences. This seems natural, since M.L. randomness can be defined in terms of Solomonoff’s prior M , so there is a close connection. Contrary to what was believed before, the question of posterior convergence M/µ→1 for all M.L. random sequences is still open. We introduced a new flexible notion of µ/ξ-randomness which contains Martin-L¨ of randomness as a special case. Though this notion may have a wider range of application, the main purpose for its introduction M.L. was to show that standard proof attempts of M/µ −→ 1 based on dominance only must fail. This follows from the derived result that the validity of ξ/µ → 1 for µ/ξ-random sequences depends on the Bayes mixture ξ.
References [Doo53] J. L. Doob. Stochastic Processes. John Wiley & Sons, New York, 1953. [Hut01] M. Hutter. Convergence and error bounds of universal prediction for general alphabet. Proceedings of the 12th Eurpean Conference on Machine Learning (ECML-2001), pages 239–250, 2001. [Hut03] M. Hutter. Sequence prediction based on monotone complexity. In Proceedings of the 16th Conference on Computational Learning Theory (COLT-2003). [Lam87] M. van Lambalgen. Random Sequences. PhD thesis, Univ. Amsterdam, 1987. [Lev73] L. A. Levin. On the notion of a random sequence. Soviet Math. Dokl., 14(5):1413–1416, 1973. [LV97] M. Li and P. M. B. Vit´ anyi. An introduction to Kolmogorov complexity and its applications. Springer, 2nd edition, 1997. [Sch71] C. P. Schnorr. Zuf¨ alligkeit und Wahrscheinlichkeit. Springer, Berlin, 1971. [Sch02] J. Schmidhuber. Hierarchies of generalized Kolmogorov complexities and nonenumerable universal measures computable in the limit. International Journal of Foundations of Computer Science, 13(4):587–612, 2002. [Sol64] R. J. Solomonoff. A formal theory of inductive inference: Part 1 and 2. Inform. Control, 7:1–22, 224–254, 1964. [Sol78] R. J. Solomonoff. Complexity-based induction systems: comparisons and convergence theorems. IEEE Trans. Inform. Theory, IT-24:422–432, 1978. [VL00] P. M. B. Vit´ anyi and M. Li. Minimum description length induction, Bayesianism, and Kolmogorov complexity. IEEE Transactions on Information Theory, 46(2):446–464, 2000. [Vov87] V. G. Vovk. On a randomness criterion. Soviet Mathematics Doklady, 35(3):656–660, 1987. [Wan96] Y. Wang. Randomness and Complexity. PhD thesis, Univ. Heidelberg, 1996. [ZL70] A. K. Zvonkin and L. A. Levin. The complexity of finite objects and the development of the concepts of information and randomness by means of the theory of algorithms. Russian Mathematical Surveys, 25(6):83–124, 1970.
Author Index
Arpe, Jan
Oncina, Jose
99
Balbach, Frank
84
Case, John 234 Cristianini, Nello De Bie, Tijl
Ratsaby, Joel 205 Reischuk, R¨ udiger 99, 234
175
175
Eiter, Thomas
1
Gammerman, Alex
283
Higuera, Colin de la 247 Hutter, Marcus 298 Jain, Sanjay
234
Kitagawa, Genshiro
247
3
Sato, Masako 69 Schuurmans, Dale 190 Sharma, Arun 54 Shoudai, Takayoshi 114, 144 ˇıma, Jiˇr´ı 221 S´ Stephan, Frank 54, 234 Suzuki, Yusuke 114, 144 Takano, Akihiko 15 Tishby, Naftali 16 Uchida, Tomoyuki Uemura, Jin 69
114, 144
Lange, Steffen 129 Lee, Jianguo 159
V’yugin, Vladimir 283 Vovk, Vladimir 259, 268
Martin, Eric 54 Matsumoto, Satoshi 114, 144 Miyahara, Tetsuhiro 114, 144 Momma, Michinari 175
Wang, Jingdong 159 Wang, Shaojun 190
Nouretdinov, Ilia
259, 283
Zeugmann, Thomas 17, 234 Zhang, Changshui 159 Zilles, Sandra 39, 129