Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster
Subseries of Lecture Notes in Computer Science
6339
José M. Sempere Pedro García (Eds.)
Grammatical Inference: Theoretical Results and Applications 10th International Colloquium, ICGI 2010 Valencia, Spain, September 13-16, 2010 Proceedings
13
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors José M. Sempere Universidad Politécnica de Valencia Departamento de Sistemas Informáticos y Computación Camino de Vera s/n, 46022 Valencia, Spain E-mail:
[email protected] Pedro García Universidad Politécnica de Valencia Departamento de Sistemas Informáticos y Computación Camino de Vera s/n, 46022 Valencia, Spain E-mail:
[email protected]
Library of Congress Control Number: 2010933123
CR Subject Classification (1998): I.2, F.1, I.4, I.5, J.3, H.3
LNCS Sublibrary: SL 7 – Artificial Intelligence
ISSN
0302-9743
ISBN-10 ISBN-13
3-642-15487-5 Springer Berlin Heidelberg New York 978-3-642-15487-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
The first edition of the International Colloquium on Grammatical Inference (ICGI) was held in Essex (United Kingdom) in 1993. After the success of this meeting there have been eight more editions that have been hosted by different academic institutions across the world: Alicante (Spain, 1994), Montpellier (France, 1996), Ames, Iowa (USA, 1998), Lisbon (Portugal, 2000), Amsterdam (The Netherlands, 2002), Athens (Greece, 2004), Tokyo (Japan, 2006) and SaintMalo (France, 2008). ICGI 2010 was held in Valencia (Spain) during September 13–16. It was organized by the Research Group on Formal Language Theory, Computability and Complexity from the Technical University of Valencia. This was the tenth edition of ICGI, which is a nice number for celebrations. Ten editions is a sign of good health for any conference. In the case of Grammatical Inference, it means that the topics, problems and applications of this research area are alive and serve as a good framework to study related aspects of artificial intelligence, natural language processing, formal language theory, computability and complexity, bioinformatics, pattern recognition, etc. There were two reviews and local discussions among the members of the Program Committee (PC) in order to evaluate every work proposed to the conference. This volume contains the texts of 32 papers presented at ICGI 2010. They are divided into two groups of works. There are 18 regular papers (out of 25) and 14 short papers (11 out of 15, and three regular papers proposed as short ones). The topics of the papers range from theoretical results about the learning of different formal language classes (regular, context-free, context-sensitive, etc.) to application papers on bioinformatics, language modelling, software engineering, etc. In addition, there are two invited lectures delivered by distinguished scientists on the following topics: – Simon Lucas (University of Essex, UK): Grammatical Inference and Games – David B. Searls (University of Pennsylvania, USA): Molecules, Languages, and Automata In this edition, for the first time, there was a Best Student Paper Award to motivate young researchers in this area to continue their research work. The award was given to Franco Luque for his paper “Bounding the Maximal Parsing Performance of Non-Terminally Separated Grammars.” The first day of the conference hosted four tutorial talks given by prominent scientists of the area on different aspects of grammatical inference. We are grateful to the tutorial lecturers for the brilliant talks: Tim Oates, with Sourav Mukherjee, Colin de la Higuera, Francois Coste and Dami´ an L´ opez, with Pedro Garc´ıa. We would like to thank the many people who contributed to the success of ICGI 2010. First of all, we are grateful to the members of the Steering Committee
VI
Preface
that supported our proposal to organize the conference. It was very exciting to organize ICGI 2010 given that some members of the Local Organizing Committee were involved in the organization of ICGI 1994. We are very grateful to the members of the PC for their time and effort in carrying out the reviewing process. The help and the experience that they provided were invaluable, and the suggestions that they proposed to improve different aspects of the conference were brilliant. Thanks are given to external reviewers that helped the PC members during the review process: Kengo Sato, Manuel V´ azquez de Parga and Dami´an L´ opez. The joint effort of these people ensured the quality of the works presented in this volume. The success of the conference was possible due to the work of the Local Organizing Committee. We especially thank the effort and work made by Dami´an L´opez, who was involved in many aspects of the conference. In addition, we received the support of the Centre for Innovation, Research and Technology Transfer (CTT) and the Continuous Training Centre (CFP) of the Technical University of Valencia. We are grateful to the people of such institutions for helping us to carry out different aspects of the organization of the conference. Last, but not least, we are grateful to the sponsors of the conference: The PASCAL2 Network of Excellence, the Spanish Ministry of Science and Innovation, BANCAJA, and the Technical University of Valencia together with the Department on Information Systems and Computation and the School of Engineering in Computer Science. We hope to celebrate the next ten editions of ICGI. We are sure that it will have a brilliant and exciting future in this research area that tries to identify and solve many interesting problems before the limit. June 2010
Jos´e M. Sempere Pedro Garc´ıa
VIII
Conference Organization
Sponsoring Institutions The PASCAL2 Network of Excellence Ministerio de Ciencia e Innovaci´ on (Spain) Universidad Polit´ecnica de Valencia (UPV) Department of Information Systems and Computation (DSIC, UPV) School of Engineering in Computer Science (ETSINF, UPV) BANCAJA
Conference Organization
Program Chair Jos´e M. Sempere
Universidad Polit´ecnica de Valencia, Spain
Program Committee Pieter Adriaans Dana Angluin Jean-Marc Champarnaud Alexander Clark Francois Coste Colin de la Higuera Francois Denis Henning Fernau Pedro Garc´ıa Makoto Kanazawa Satoshi Kobayashi Laurent Miclet Tim Oates Arlindo Oliveira Jose Oncina Georgios Paliouras Yasubumi Sakakibara Etsuji Tomita Menno van Zaanen Ryo Yoshinaka Sheng Yu Thomas Zeugmann
Universiteit van Amsterdam, The Netherlands Yale University, USA Universit´e de Rouen, France Royal Holloway University of London, UK INRIA, France Universit´e de Nantes - LINA, France Universit´e de Provence, France Universit¨ at Trier, Germany Universidad Polit´ecnica de Valencia, Spain National Institute of Informatics, Japan University of Electro-Communications, Japan ENSSAT-Lannion, France University of Maryland Baltimore County, USA Lisbon Technical University, Portugal Universidad de Alicante, Spain Institute of Informatics Telecommunications, Greece Keio University, Japan University of Electro-Communications, Japan Tilburg University, The Netherlands Japan Science and Technology Agency, Japan The University of Western Ontario, Canada Hokkaido University, Japan
Local Organization All members are from the Universidad Polit´ecnica de Valencia, Spain. Marcelino Campos Antonio Cano Dami´an L´ opez Alfonso Mu˜ noz-Pomer Piedachu Peris Manuel V´ azquez de Parga
Table of Contents
Invited Talks Grammatical Inference and Games: Extended Abstract . . . . . . . . . . . . . . . Simon M. Lucas
1
Molecules, Languages and Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David B. Searls
5
Regular Papers Inferring Regular Trace Languages from Positive and Negative Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Cano G´ omez
11
Distributional Learning of Some Context-Free Languages with a Minimally Adequate Teacher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Clark
24
Learning Context Free Grammars with the Syntactic Concept Lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Clark
38
Learning Automata Teams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedro Garc´ıa, Manuel V´ azquez de Parga, Dami´ an L´ opez, and Jos´e Ruiz
52
Exact DFA Identification Using SAT Solvers . . . . . . . . . . . . . . . . . . . . . . . . Marijn J.H. Heule and Sicco Verwer
66
Learning Deterministic Finite Automata from Interleaved Strings . . . . . . Joshua Jones and Tim Oates
80
Learning Regular Expressions from Representative Examples and Membership Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Efim Kinber
94
Splitting of Learnable Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongyang Li and Frank Stephan
109
PAC-Learning Unambiguous k,l -NTS≤ Languages . . . . . . . . . . . . . . . . . . . . Franco M. Luque and Gabriel Infante-Lopez
122
Bounding the Maximal Parsing Performance of Non-Terminally Separated Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Franco M. Luque and Gabriel Infante-Lopez
135
X
Table of Contents
CGE: A Sequential Learning Algorithm for Mealy Automata . . . . . . . . . . Karl Meinke Using Grammar Induction to Model Adaptive Behavior of Networks of Collaborative Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wico Mulder and Pieter Adriaans
148
163
Transducer Inference by Assembling Specific Languages . . . . . . . . . . . . . . . Piedachu Peris and Dami´ an L´ opez
178
Sequences Classification by Least General Generalisations . . . . . . . . . . . . . Fr´ed´eric Tantini, Alain Terlutte, and Fabien Torre
189
A Likelihood-Ratio Test for Identifying Probabilistic Deterministic Real-Time Automata from Positive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sicco Verwer, Mathijs de Weerdt, and Cees Witteveen A Local Search Algorithm for Grammatical Inference . . . . . . . . . . . . . . . . . Wojciech Wieczorek Polynomial-Time Identification of Multiple Context-Free Languages from Positive Data and Membership Queries . . . . . . . . . . . . . . . . . . . . . . . . Ryo Yoshinaka Grammatical Inference as Class Discrimination . . . . . . . . . . . . . . . . . . . . . . Menno van Zaanen and Tanja Gaustad
203
217
230
245
Short Papers MDL in the Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pieter Adriaans and Wico Mulder
258
Grammatical Inference Algorithms in MATLAB . . . . . . . . . . . . . . . . . . . . . Hasan Ibne Akram, Colin de la Higuera, Huang Xiao, and Claudia Eckert
262
A Non-deterministic Grammar Inference Algorithm Applied to the Cleavage Site Prediction Problem in Bioinformatics . . . . . . . . . . . . . . . . . . Gloria In´es Alvarez, Jorge Hern´ an Victoria, Enrique Bravo, and Pedro Garc´ıa
267
Learning PDFA with Asynchronous Transitions . . . . . . . . . . . . . . . . . . . . . . Borja Balle, Jorge Castro, and Ricard Gavald` a
271
Grammar Inference Technology Applications in Software Engineering . . . Barrett R. Bryant, Marjan Mernik, Dejan Hrnˇciˇc, Faizan Javed, Qichao Liu, and Alan Sprague
276
Table of Contents
H¨ older Norms and a Hierarchy Theorem for Parameterized Classes of CCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christophe Costa Florˆencio and Henning Fernau
XI
280
Learning of Church-Rosser Tree Rewriting Systems . . . . . . . . . . . . . . . . . . . M. Jayasrirani, D.G. Thomas, Atulya K. Nagar, and T. Robinson
284
Generalizing over Several Learning Settings . . . . . . . . . . . . . . . . . . . . . . . . . Anna Kasprzik
288
Rademacher Complexity and Grammar Induction Algorithms: What It May (Not) Tell Us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sophia Katrenko and Menno van Zaanen
293
Extracting Shallow Paraphrasing Schemata from Modern Greek Text Using Statistical Significance Testing and Supervised Learning . . . . . . . . . Katia Lida Kermanidis
297
Learning Subclasses of Parallel Communicating Grammar Systems . . . . . Sindhu J. Kumaar, P.J. Abisha, and D.G. Thomas Enhanced Suffix Arrays as Language Models: Virtual k -Testable Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Herman Stehouwer and Menno van Zaanen Learning Fuzzy Context-Free Grammar—A Preliminary Report . . . . . . . . Olgierd Unold
301
305 309
Polynomial Time Identification of Strict Prefix Deterministic Finite State Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mitsuo Wakatsuki and Etsuji Tomita
313
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
317
Grammatical Inference and Games: Extended Abstract Simon M. Lucas School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, UK
[email protected]
1
Introduction
This paper discusses the potential synergy between research in grammatical inference and research in artificial intelligence applied to games. There are two aspects to this: the potential as a rich source of challenging and engaging test problems, and the potential for real applications. Grammatical Inference (GI) addresses the problem of learning a model for recognising, interpreting, generating or transducing data structures. Learning may proceed based on samples of the structures or via access to a simulator or oracle with which the learner can interact by asking questions or running experiments. In the majority of GI research the data structures are labelled strings, and the most successful GI algorithms infer finite state automata, or their stochastic counterparts such as N-Gram models, or hidden Markov models. We now consider some different types of grammatical inference, and the application of those types to particular problems in AI and Games.
2
Sequence Recognition
A common application of GI is to sequence recognition. The aim of the learning phase is to infer a sequence recognition model which is then used for classification. Real-world problems tend to be noisy, and recognition of real-world sequences is usually best performed by stochastic models. The type of GI that works best for these applications is often based on relatively simply statistical models, such as n-gram models or hidden Markov models. A significant application in computer games is the so-called “bot-detection” problem. Massively Multiplayer Online Games often involve the players acquiring valuable assets, and this acquisition process may involve a significant amount of tedious labour on behalf of the player. An alternative is for the player to spend real-world money to acquire such assets. Typically the in-game assets can be bought either with real money or with virtual game money (hence there is an exchange rate between the two). Unscrupulous players may use bots to do the tedious work needed to acquire the assets - which can then be sold to generate real-world revenue. The use of bots has a detrimental effect on the game play. People play on-line games to play against other people, and bots are typically J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 1–4, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
S.M. Lucas
less fun to play against, partly because bots lack the flexible intelligence that people take for granted. There are two main possible approaches to bot detection: active detection and passive detection. Active detection involves modifying the game to introduce tests which are specifically hard for bots to pass, such as CAPTCHA style tests. These are highly effective, but rather disruptive to the game play. The passive approach is to try to identify behaviour that would be unnatural for human players, based on some statistical measures of the observed actions of the player. An example of this uses the trajectories of the players’ avatars (the in-game characters controlled by the players) to compare against typical bot trajectories [1]. Given the vast amount of player data available this would make an interesting challenge for statistical GI methods, such as those that have been reported in previous ICGI conferences.
3
Learning Finite State Machines
Finite state automata have been among the most widely studied models within the GI community and have been the subject of some interesting competitions such as the Abbadingo One DFA induction competition [2] and the Gecco 2005 DFA from noisy samples competition1 . State machines are also the most widely used architecture for controlling the non-player characters (NPCs) in video games. The state machines used in video games are typically more complex than the ones used in GI research. In particular, the states represent actions that the character may execute continuously until an event or a condition being satisfied triggers a transition to a new state. Hence the complete representation of a state machine controller goes beyond a transition matrix and set of labels, and includes some decision logic to trigger the transitions, and perhaps also procedural code to map high-level actions into low-level actions. Finite state machines have proven to be useful for encoding relatively simple behaviours but the main limitation is that they do not scale well to more complex problems. For this reason more sophisticated NPC control architectures such as hierarchical state machines, planning systems and behaviour trees are being developed and applied, and grammatical inference research could benefit from trying to learn richer models of this type. This would have the potential to reduce game development costs by realising a programming-by-example model. The idea would be for the game designers to provide sample behaviours for the non-player characters using standard game controllers, and have the system learn an underlying finite state machine able to reproduce the desired behaviour. The learning of finite state machines has also been studied from the perspective of learning to play games, such as the Resource Protection Game [3]. The challenge here was to learn a strategy encoded as a finite state machine, where the objective for the player is to capture grid cells by visiting them before the opponent does, given only local information about the neighbouring grid cells. By placing the finite state machine induction problem within the context of 1
http://cswww.essex.ac.uk/staff/sml/gecco/NoisyDFA.html
Grammatical Inference and Games: Extended Abstract
3
game playing, it becomes an even more challenging problem than the more conventional GI problem of trying to learn a model from a fixed sample of data, or with reference to an oracle, since now the learner must also attempt to solve an even harder credit assignment problem. Actions taken early in the game may lead to success or to failure, but this also depends on the actions taken by the opponent. Over the years the grammatical inference community has run several competitions that go beyond learning DFA from samples of data, such as context-free grammar learning (Omphalos, ICGI 2004 [4]), learning models of machine translation (Tenjinno, ICGI 2006), and the active learning of DFA in the minimum number of queries to an oracle (Zulu, ICGI 2010). An interesting future competition would be the learning of finite-state (or other) game controllers either from game logs or by embedding the learning agent directly in the game, giving it control over its learning experience.
4
Semantic Language Learning
Most of the work on grammatical inference involves learning only the syntax of language, but it is well understood that children learn language within a rich semantic and pragmatic context. Feldman [5] describes how computational modelling of language acquisition can be extended and applied to grammatical inference within a semantic context. Orkin and Roy [6] devised a relatively simple on-line game called the Restaurant Game with part of the motivation being to test how well a system would be able to learn to behave in realistic ways using a plan network from the observed interactions of human users playing the game. To play the game users play either as a customer or a waitress, and click actions while typing free text to fill in the details with the aim of completing a successful dining transaction. This is of interest to grammatical inference in several ways. The system learned a plan network from the game logs of over 5,000 games. The plan network consists of a set action nodes together with arcs showing which nodes follow other nodes. Each action node is defined by its name (e.g. Pickup), its requirements (e.g. actor=customer and object=menu), the localworld pre-conditions (e.g. actor sitting on chair, menu on table), and the effects of taking the action (e.g. customer has menu). The learning algorithm was able to infer plan networks from the game logs using clustering and statistical ngram methods, and the inferred networks were able to rate the degree to which a particular game log was representative of typical restaurant behaviour.
5
Grammatical Inference and Opponent Modelling
In order to provide some simple yet interesting examples of game-log analysis, results will be reported on some problems related to playing Ms. Pac-Man. This is a classic arcade game requiring great skill in order to achieve high scores. The best human players can score over 900,000 after many hours of play. The ghosts in Ms Pac-Man were programmed to provide the player with a fun experience,
4
S.M. Lucas
and they do not play optimally. Part of the control logic of the ghosts is a finite state machine. Expert players are able to make good predictions about the next moves of the ghosts, and by making such predictions are able to escape from apparently impossible situations. The challenge here for GI methods is to infer finite state machines and hence perform ghost behaviour prediction. This can be done either by passively studying the game-logs of any players, or for potentially higher performance learning, done by an active learner embedded in the game deliberately attempting to reach states of the game in which it is likely to learn most about the ghost behaviours.
6
Conclusion
The overall conclusion of this paper is that there is a significant overlap in some of the fundamental problems and architectures used in grammatical inference and in games. Now that games have superb graphics and increasingly realistic physics, the next frontier is improving the game AI. Grammatical inference has the potential to contribute to this, but to make a convincing impact, it will need to deal with the richer control models used in game AI. The talk will discuss these ideas in more detail, and describe some on-going experiments by the author.
References 1. Chen, K.-T., Liao, A., Pao, H.-K.K., Chu, H.-H.: Game bot detection based on avatar trajectory. In: Stevens, S.M., Saldamarco, S.J. (eds.) ICEC 2008. LNCS, vol. 5309, pp. 94–105. Springer, Heidelberg (2008) 2. Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the abbadingo one dfa learning competition and a new evidence-driven state merging algorithm. In: Honavar, V.G., Slutzki, G. (eds.) ICGI 1998. LNCS (LNAI), vol. 1433, pp. 1–12. Springer, Heidelberg (1998) 3. Spears, W.M., Gordon, D.F.: Evolution of strategies for resource protection problems. In: Advances in evolutionary computing: theory and applications, pp. 367–392. Springer, Heidelberg (2000) 4. Starkie, B., Coste, F., van Zaanen, M.: The omphalos context-free grammar learning competition. In: Paliouras, G., Sakakibara, Y. (eds.) ICGI 2004. LNCS (LNAI), vol. 3264, pp. 16–27. Springer, Heidelberg (2004) 5. Feldman, J.A.: Real language learning. In: Honavar, V.G., Slutzki, G. (eds.) PICGI 1998. LNCS (LNAI), vol. 1433, pp. 114–125. Springer, Heidelberg (1998) 6. Orkin, J., Roy, D.: The restaurant game: Learning social behavior and language from thousands of players online. Journal of Game Development 3(1), 39–60 (2007)
Molecules, Languages and Automata David B. Searls Lower Gwynedd, PA 19454, USA
Abstract. Molecular biology is full of linguistic metaphors, from the language of DNA to the genome as “book of life.” Certainly the organization of genes and other functional modules along the DNA sequence invites a syntactic view, which can be seen in certain tools used in bioinformatics such as hidden Markov models. It has also been shown that folding of RNA structures is neatly expressed by grammars that require expressive power beyond context-free, an approach that has even been extended to the much more complex structures of proteins. Processive enzymes and other “molecular machines” can also be cast in terms of automata. This paper briefly reviews linguistic approaches to molecular biology, and provides perspectives on potential future applications of grammars and automata in this field.
1
Introduction
The terminology of molecular biology from a very early point adopted linguistic and cryptologic tropes, but it was not until some two decades ago that serious attempts were made to apply formal language theory in this field. These included efforts to model both the syntactic structure of genes, reflecting their hierarchical organization, and the physical structure of nucleic acids such as DNA and RNA, where grammars proved suitable for representing folding patterns in an abstract manner. In the meantime, it was also recognized that automata theory could be a basis for representing some of the key string algorithms used in the analysis of macromolecular sequences. These varied approaches to molecular biology are all bound together by formal language theory, and its close relationship to automata theory. In reviewing these approaches, and discussing how they may be extended in new directions within biology, we hope to demonstrate the power of grammars as a uniform, computer-readable, executable specification language for biological knowledge.
2
Structural Grammars
Nucleic acids are polymers of four bases, and are thus naturally modeled as languages over the corresponding alphabets, which for DNA comprise the wellknown set Σ = {a, c, g, t}. (RNA bases are slightly different, but for all practical purposes can be treated the same.) DNA, which carries the genetic information in our chromosomes, tends to form itself into a double helix with two strands J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 5–10, 2010. c Springer-Verlag Berlin Heidelberg 2010
6
D.B. Searls
that are held together by a complementary pairing of the opposing bases, ‘a’ with ‘t’ and ‘g’ with ‘c’. RNA molecules are more often single-stranded, though they can fold back on themselves to form regions of double-stranded structure, called secondary structure. Given these bare facts, the language of all possible RNA molecules is specified by the following trivial grammar (with S the start symbol and the empty string, as usual): S → xS |
for each x ∈ {a, c, g, u}
(1)
It would seem to be natural to specify DNA helices, which comprise two strands, as a pair of strings, and a DNA language as a set of such pairs. However, DNA has an additional important constraint, in fact two: first, the bases opposing one another are complementary, and second, the strings have directionality (which is chemically recognizable) with the base-paired strands running in opposite directions. We have previously shown how this sort of model can be extended to describe a number of specific phenomena in RNA secondary structure, and in fact a simple addition to the stem-and-loop grammar allows for arbitrarily branching secondary structures: S → xS x ¯ | SS |
where g¯ = c, c¯= g, a ¯ = t, t¯= a
(2)
Examples of such branching secondary structure include a cloverleaf form such as is found in transfer RNA or tRNA, an important adaptor molecule in the translation of genetic information from messenger RNA to protein. The language of (2) describes what is called orthodox secondary structure, which for our purposes can be considered to be all fully base-paired structures describable by context-free grammars. There are, however, secondary structures that are beyond context-free, the archetype of which are the so-called pseudoknots. Pseudoknots can be conceived as a pair of stem-loop structures, one of whose loops constitutes one side of the others stem. The corresponding (idealized) language is of the form uv¯ uR v¯R , which cannot be expressed by any context-free grammar. It is sometimes described as the intersection of two context-free palindromes of the form uv¯ uR and R R v¯ u v¯ , but of course context-free languages are not closed under intersection. Pseudoknots and other non-context-free elements of the language of secondary structure can be easily captured with context-sensitive grammars, but the resulting complex movements of nonterminals in sentential forms tend not to enlighten. Rather, grammars with more structured rules, such as Tree-Adjoining Grammars (TAG), have been more profitably used for this purpose [6]. A grammar variation that the author has proposed describes even more complex, multi-molecular base-paired complexes of nucleic acids [4]. This is enabled by the addition to any grammar of a new symbol δ which is understood to cut the string at the point it appears. This means that derivations ultimately give rise not to strings but to sets of strings arising from such cuts, which may be base-paired in arbitrarily ramified networks.
Molecules, Languages and Automata
7
Proteins are more complex macromolecular structures with several kinds of intermolecular interactions. Some of the basic types of such recurrent structural themes have been described with a variety of grammars [6].
3
Gene Grammars
Genes, which are encoded in the DNA of organisms, have a hierarchical organization to them that is determined by the process by which they are converted into proteins (for the most part). Genes are first transcribed into messenger RNA, or mRNA, which constitutes a complementary copy of the gene, and then this is translated into protein. The latter step requires the DNA/RNA code to be adapted to that of proteins, whose alphabet comprises the twenty amino acids. This encoding is called the genetic code, which appears as a table of triplets of bases mapped to amino acids. Transcription itself involves a number of complications regarding the structure of genes, such as the fact that the actual coding sequence is interrupted by segments that are spliced out at an intermediate step, establishing what is called the intron/exon structure of the gene. In addition there are many signal sequences embedded in the gene, including in flanking non-coding regions, that determine such things as the starting point of transcription, the conditions under which transcription will occur, and the points at which splicing will occur. The author has demonstrated how grammars can effectively capture all these features of genes, including ambiguities such as alternative splicing whereby different versions of genes may arise from the same genome sequence [1]. Such grammars have been used to recognize the presence of genes in raw sequence data by means of parsing, in what amounts to an application of syntactic pattern recognition [2]. (Modern ‘gene-finders’, however, use highly customized algorithms for efficiency, though the most effective of these do capture the syntactic structure of the standard gene model.) As the variety of genes and related features (such as immunoglobulin superfamily genes and microRNA) and their higher-level organization in genomes continues to grow more complex, grammars may yet prove to be a superior means to formally specify knowledge about such structure.
4
Genetic Grammars
Gregor Mendel laid the foundation for modern genetics by asserting a model for the inheritance of traits based on a parsimonious set of postulates. While many modifications have been required to account for a wider and wider set of observations, the basic framework has proven robust. Many mathematical and computational formalizations of these postulates and their sequelae have been developed, which support such activities as pedigree analysis and genetic mapping. The author has been developing a grammar-based specification of Mendelian genetics which is able to depict the basic processes of gamete formation, segregation of alleles, zygote formation, and phenotypic expression within a uniform
8
D.B. Searls
framework representing genetic crosses [unpublished]. With this basic ‘Mendelian grammar,’ extensions are possible that account in a natural way for various known mechanisms for modification of segregation ratios, linkage, crossing-over, interference, and so forth. One possible use of this formalism is as a means to frame certain types of analysis as a form of grammar inference. For example, mapping of genes to linkage groups and ordering linkage groups can be seen as finding an optimal structure of an underlying grammar so as to fit experimental data. Especially intriguing is the possibility of including together in one framework the genetic analysis with phenotypic grammars, for example in the genetic dissection of pathways.
5
Molecular Machines
Enzymes and other biochemical structures such as ribosomes are sometimes called ‘molecular machines’ because they perform repetitive chemical and/or mechanical operations on other molecules. In particular, a large class of such objects process nucleic acids in various ways, many of them by attaching to and moving along the DNA or RNA in what is termed processive fashion. This immediately brings to mind computational automata which perform operations on tapes. Since automata have their analogues in grammars, it is natural to ask whether grammars can model enzymes that act on DNA or RNA. In fact the trivial rightrecursive grammar that we showed at the outset (1) can be considered a model for terminal transferase, an enzyme that synthesizes DNA by attaching bases to a growing chain, as in this derivation: S ⇒ cS ⇒ ctS ⇒ ctcS ⇒ ctcaS ⇒ ctcaaS ⇒ ctcaagS ⇒ ctcaag We can view the nonterminal S as the molecular machine, the terminal transferase itself, laying down bases sequentially and then departing into solution. Similarly, we can envision a context-sensitive grammar that models an exonuclease, an enzyme that degrades nucleic acids a base at a time from one or the other end. The orientation is important, because exonucleases are specific for which end they chew on, and therefore whether they run in the forward or reverse direction on the strand: Fx → F | xR → R |
forward exonuclease reverse exonuclease
(3)
These can produce derivations such as the following, with the nonterminals again physically mimicking the action of the corresponding enzymes: F gcaa ⇒ F gcaa ⇒ F caa ⇒ F aa ⇒ F a ⇒ F ⇒ atggacR ⇒ atggaR ⇒ atggR ⇒ atgR ⇒ atR ⇒ at
Molecules, Languages and Automata
9
In the first derivation, the F completely digests the nucleic acid strand and then itself disappears via the disjunct — into solution, as it were. On the other hand, in the second example we show the R exonuclease departing without completing the job, which mirrors the biological fact that enzymes can show greater or lesser propensity to hop on or off the nucleic acid spontaneously. We could model the tendency to continue the recursion (known as an enzyme’s processivity) with a stochastic grammar, where probabilities attached to rules would establish the half-lives of actions of the biological processes. The author’s most recent efforts [unpublished] have been to catalogue a wide range of grammars describing the activities of enzymes acting on nucleic acids in various circumstances. This requires the extension of the model to doublestranded DNA, as well as the ability to act on more than one double-stranded molecule at once. Wth the employment of stochastic grammars, it appears possible to specify a wide variety of biochemical details of molecular machines.
6
Edit Grammars
Another view of the movement of nonterminals is as a means to perform editing operations on strings. As in the case for processive enzymes, we view a nonterminal as a ‘machine’ that begins at the left end of an input string and processes to the right end, leaving an altered string as output. 0
identity (x ∈ Σ)
1
substitution (x = y)
xS −→ Sx yS −→ Sx 1
S −→ Sx 1 xS −→ S
deletion insertion
To frame this input/output process in a more standard fashion, one can simply assert a new starting nonterminal S , a rule S →Swτ where w ∈ Σ ∗ is the input string and τ is a new terminal marker not in the language, and an absorbing rule S→τ that is guaranteed to complete any derivation and leave only the output string. Note that the insertion rule is not strictly context-sensitive (the left side being longer than the right), and can generate any string whatever as output. The numbers above the arrows here represent a cost of applying the corresponding edit rule. An overall derivation would again move the S nonterminal from the beginning of an input string to the end, leaving the output to its left, as follows: 0
1
1
2
2
Sgact =⇒ gSact =⇒ gtSct =⇒ gtcSt =⇒ gtcgSt =⇒ gtcgtS Here the numbers above the double arrows represent the cumulative cost of the derivation. The rules applied are an identity (for no cost), a substitution of a ‘t’ for an ‘a’ (adding a cost of 1), another identity, an insertion of a ‘g’ (adding a cost of 1), and an identity.
10
D.B. Searls
Minimal edit distances are typically calculated with dynamic programming algorithms that are O(nm) in the lengths of the strings being compared. The same order of results can be obtained with the appropriate table-based parsers for grammars such as that above, though perhaps with the sacrifice of some efficiency for the sake of generality. The great advantage of the parsing approach is that grammars and their cognate automata make it possible to describe more complex models of string edits, and therefore of processes related to molecular evolution. The author has recast a number of the algorithms developed for such purposes in the form of automata, which can be shown to be equivalent to the recurrence relations typically used to specify such algorithms [4].
References 1. Searls, D.B.: The linguistics of DNA. Am. Sci. 80, 579–591 (1992) 2. Dong, S., Searls, D.B.: Gene structure prediction by linguistic methods. Genomics 23, 540–551 (1994) 3. Searls, D.B.: String Variable Grammar: a logic grammar formalism for DNA sequences. J. Logic Prog. 24, 73–102 (1995) 4. Searls, D.B.: Formal language theory and biological macromolecules. In: FarachColton, M., Roberts, F.S., Vingron, M., Waterman, M. (eds.) Mathematical Support for Molecular Biology, pp. 117–140. American Mathematical Society, Providence (1999) 5. Searls, D.B.: The language of genes. Nature 420, 211–217 (2002) 6. Chiang, D., Joshi, A.K., Searls, D.B.: Grammatical representations of macromolecular structure. J. Comp. Biol. 13, 1077–1100 (2006)
Inferring Regular Trace Languages from Positive and Negative Samples Antonio Cano G´ omez Departamento de Sistemas Inform´ aticos y Computaci´ on, Universidad Polit´ecnica de Valencia, Valencia, Spain
[email protected]
Abstract. In this work, we give an algorithm that infers Regular Trace Languages. Trace languages can be seen as regular languages that are closed under a partial commutation relation called the independence relation. This algorithm is similar to the RPNI algorithm, but it is based on Asynchronous Cellular Automata. For this purpose, we define Asynchronous Cellular Moore Machines and implement the merge operation as the calculation of an equivalence relation. After presenting the algorithm we provide a proof of its convergence (which is more complicated than the proof of convergence of the RPNI because there are no Minimal Automata for Asynchronous Automata), and we discuss the complexity of the algorithm.
1
Introduction
This work presents an algorithm that infers Regular Trace Languages. Traces were first introduced by Mazurkiewicz [7] to describe the behavior of concurrent systems. The main idea of traces is to consider that each letter of a given alphabet represents a process. When two processes in a concurrent system can be executed simultaneously, they are considered to be independent, so it does not matter which letter is written first in the word that represents the concurrent system. Mazurkiewitcz’s theory of traces has developed very rapidly since its introduction [13,9]. In Grammatical Inference, the inference of finite automata has been a central subject [3,6,8,10]. One of the most popular algorithms is RPNI [8] which has led to many other algorithms that have attempted to improve it. Another option for improving the efficiency of the RPNI algorithm is to not work with all Regular Language but to work with some subclasses of Regular Languages. FCRPNI algorithm [1], was created for this purpose in order to improve the efficiency of RPNI. Even though it was based on the RPNI, grouping of states was not allowed if the result did not have an automaton that belonged to the corresponding subclass (in other words, it had a forbidden configuration for that class).
Work supported by the project T´ecnicas de Inferencia Gramatical y aplicaci´ on al procesamiento de biosecuencias (TIN2007-60769) supported by the Spanish Ministry of Education and Sciences.
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 11–23, 2010. c Springer-Verlag Berlin Heidelberg 2010
12
A.C. G´ omez
In [4], another idea was introduced to define a new kind of automata for a given subclass and apply the ideas of RPNI to the new kind of automata. In [4], that idea was applied to Commutative Regular Languages, and the results for the efficiency and complexity of the algorithm were very good . The problem with this algorithm is that Commutative Regular Languages are a very small subclass of Regular Languages. This is where the inference of Regular Trace Languages might be useful. Regular Trace languages can be viewed as Regular Languages that are closed under an independence relation where words of the alphabet can commute. For instance, if we take equality as the independence relation, we obtain Regular Languages. However, if we take the relation that relates every letter of the alphabet as the independence relation, we obtain Commutative Regular Languages (an overview of subclasses of regular languages that are closed under independence relations can be found in [2]). The aim of our work is to present an algorithm for the inference of Regular Trace Languages, prove its convergence, and analyze its complexity. In Section 2, we present the definition of the main concepts that will be used in this paper about Trace Theory and Grammatical Inference. In Section 3, we introduce the concept of Asynchronous Automata, which is used to recognize Regular Trace Languages. Specifically, we focus on an special kind of Asynchronous Automata called an Asynchronous Cellular Automaton. We present its formal definition and provide some definitions that are useful for the following sections. In Section 4, we define the adaptation of an Asynchronous Cellular Automaton for a Moore Machine and present the definition and results. In Section 5, we define a version of RPNI, that is based on equivalence relations on an Asynchronous Cellular Moore Machine that could be adapted to an Asynchronous Cellular Automaton. In Section 6, we study our main algorithm. In Section 7, we study the convergence of this algorithm. The proof of convergence is not a simple adaptation of the convergence for the RPNI algorithm since there are several Minimal Cellular Asynchronous Automata for a given trace language. Therefore, we need to use the lexicographical order in order to determine which of the irreducible automata the algorithm converges to. In Section 8 we discuss the general complexity of the algorithm, and in Section 9, we present the conclusions of our work and give an overview of possible further work.
2
Prelimiaries
Let Σ be a finite alphabet, whose elements are called letters. We denote the set of all words over Σ by Σ ∗ . Formally, Σ ∗ with the concatenation operation forms the free monoid with the set of generators Σ. The empty word, denoted by λ, plays the role of unit element. Given a set S, we denote the set of subsets of S by P(S). Given two sets S and T , we denote the complement of S by S, the union of S and T by S ∪ T , the intersection of S and T by S ∩ T , and the difference of S and T by S\T = S ∩ T .
Inferring Regular Trace Languages from Positive and Negative Samples
13
For any word x of Σ ∗ , |x| denotes the length of x, and |x|a denotes the number of occurrences of a letter a in x. Alph(x) denotes the set of all letters appearing in x. Given words p, x on Σ ∗ , we say that p is a prefix of x if and only if there exist a word x of Σ ∗ such that x = py. Given a word x of Σ ∗ , we define P ref (x) = {p ∈ Σ ∗ | p is prefix of x}. Given a word x of Σ ∗ and a letter a ∈ Σ, we define P refa (x) = {p ∈ Σ ∗ | p is prefix of p and the last word of x s a } = {λ} ∪ (P ref (x) ∩ Σ ∗ a). We can extend these last two concepts to languages as usual: given a language L ⊆ Σ ∗ and a letter a ∈ Σ, we define P ref (L) = x∈L P ref (x) and P refa (L) = x∈L P refa (x). Given a total order < on Σ, we can define a lexicographical order
14
A.C. G´ omez
For a language L ⊆ Σ ∗ , [L]I is the set of words that are equivalent to a word x ∈ L; hence, [L]I = {y ∈ Σ ∗ | y ∼I x and x ∈ L} A trace language is any subset T ⊆ Σ ∗ / ∼I . Given a language L ⊆ Σ ∗ , we identify L with a trace language if L = [L]I .
3
Asynchronous Automaton
An asynchronous automaton A has a distributed finite state control such that independent actions may be performed in parallel. The set of global states is modeled as Cartesian product Q = i∈J Qi , where Qi are states of the local component i ∈ J and J is a finite set of indexes. There are many different kinds of Asynchronous Automata in the literature, and all them try to simulate the independence relations by restrictions of how the transition function should behave. Even though is an entire theory about this subject, we focus on a certain type of Asynchronous Automata called an Asynchronous Cellular Automaton. Definition 1. A Cellular Asynchronous Automaton is defined as A = (Σ, I, (Qa )a∈Σ , (δa )a∈Σ , q0 , F ) where, – for any a ∈ Σ, Qa is set of states, and the set of global states is denoted by Q = a∈Σ Qa (for any state q ∈ Q, we will denote the a-th projection of q by (q)a , i.e. if q = (qa )a∈Σ , (q)a = qa for any a ∈ Σ). – for any a ∈ Σ, δa is a local transition function from b∈D(a) Qb to Qa . The local transition functions (δa )a∈Σ give rise to a partially defined transition function on global states δ:( Qa ) × Σ → Qa a∈Σ
a∈Σ
where δ((pb )b∈Σ , a) = (qb )b∈Σ is defined if and only if δa ((pb )b∈D(a) ) is defined. In this case, for any b ∈ Σ, (δa ((pc )c∈D(a) ))), if b = a qb = pb otherwise – q0 ∈ Q – F ⊆Q Here we give an example of Cellar Asynchronous Automaton. Example 3.1. Let us consider the alphabet Σ = {a, b, c} and an independence relation I = {(b, c), (c, b)}. We consider the regular language L = (a(bc + cb))∗ we can consider it as a trace language since [L]I = L. We now define the Cellular Asynchronous Automaton recognizing L: A = (Σ, I, (Qa )a∈Σ , (δa )a∈Σ , q0 , F ) where,
Inferring Regular Trace Languages from Positive and Negative Samples
15
– Qa = {q0a , q1a }, Qb = {q0b , q1b } and Qc = {q0c , q1c } – • δa (q0a , q0b , q0c ) = q1a and δa (q1a , q1b , q1c ) = q0a • δb (q1a , q0b ) = q1b and δb (q1a , q1b ) = q0b • δc (q1a , q0c ) = q1c and δc (q1a , q1c ) = q0c – q0 = (q0a , q0b , q0c ) – F = {(q0a , q0b , q0c ), (q1a , q1b , q1c )} The general automaton represented by A is given in Figure 1. q1a , q1b , q0c c
b q1a , q0b , q0c
q1a , q1b , q1c c
b q1a , q0b , q1c
a
a q0a , q0b , q1c c
b
q0a , q0b , q0c
q0a , q1b , q1c c
b q0a , q1b , q0c Fig. 1. General
From these definitions, we can extend the transition function to words and then define the language that is accepted by a Cellular Asynchronous Automaton. Definition 2. Given a word x, a Cellular Asynchronous Automaton A = ( Σ , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , F ) and q ∈ a∈Σ Qa , we extend the global functions to words by setting δ(q, λ) = q and δ(q, x) = δ(δ(q, y), a), where x = ya Definition 3. Given a Cellular Asynchronous Automaton A = (Σ , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , F ), we define the language defined by A as L(A) = {x ∈ Σ ∗ | δ(q0 , x) ∈ F } As shown in [5], there is not just a single Minimal Cellular Asynchronous Automata, there area many of them. In this article, we define a similar concept
16
A.C. G´ omez
that is close to minimality, this concept is the concept of Irreducible Cellular Asynchronous Automata. In Section 5 (Definition 10), we present the definition of this concept by means of some congruences on Cellular Asynhchronous Automata that can also be defined for the Cellular Asynchronous Moore Machine as well. In order to prove convergence of our algorithm, we define an order between Asynchronous Cellular Automata based on lexicographical order. To do this, we need the following definitions. Definition 4. Given an irreducible Asynchronous Cellular Automaton A = (Σ , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , F ), where Q and δ are the global set of states and transition function, respectively, given a state q ∈ Q, we define the minimal path of q as minpath (q) = minlex ({x ∈ Σ ∗ | δ(q0 , x) = q}). We can extend this definition to the projection of the general set of states on the letter states. Definition 5. Given an irreducible Asynchronous Cellular Automaton A = (Σ , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , F ), where Q and δ are the global set of states and transition function, respectively, given a state qa ∈ Qa , we define the minimal path of qa as minpath (qa ) = minlex ({P refa (x) | x = minpath (qa ) with q ∈ Q and (q)a = qa }). Given these definitions, we are now ready to present our order on automata. Definition 6. Given two Cellular Asynchronous Automata denoted as A1 = (Σ , I , (Q1a )a∈Σ , (δa1 )a∈Σ , q01 , F 1 ) and A2 = (Σ , I , (Q2a )a∈Σ , (δa2 )a∈Σ , q02 , F 2 ). We say that A1
{minpath (qa )},
a∈Σqa ∈Q2a
we have minlex (S1 \S2 )
4
Cellular Asynchronous Moore Machine
Here we introduce the concept of Cellular Asynchronous Moore Machine , which is needed for the definition of the main algorithm. Definition 7. A Cellular Asynchronous Moore Machine is defined as M = (Σ , Γ , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , F , Φ) where, , F ) is a Cellular Asynchronous Automaton. – (Σ, I, (Qa )a∈Σ , (δa )a∈Σ , q0 – Φ is a function from Q = a∈Σ Qa into Γ .
Inferring Regular Trace Languages from Positive and Negative Samples
17
Here we define a special Cellular Asynchronous Moore Machine that is the equivalent to the Tree Prefix Acceptor. Definition 8. Given two sets of samples D+ and D− on the alphabet Σ, we define the Asynchronous Tree Prefix Acceptor (AT P A) as follows. AT P A(Σ, I, D+ , D− ) = (Σ, Γ, (Qa )a∈Σ , (δa )a∈Σ , q0 , F, Φ), where – Q = a∈Σ Qa is a set of general states – Qa = P refa (D+ ∪ D− ) – for every a ∈ Σ and for every state (qb ∈ Qb )b∈D(a) , δa ( b∈D(a) qb ) = xa if and only if minlex ({xa ∈ Qa , | b ∈ D(a), P refb (xa) = qb }). – q0 = a∈Σ λ – for every q ∈ A, ⎧ ⎨ +, if there exists x ∈ D+ such that δ(q0 , x) = q φ(q) = −, if there exists y ∈ D− such that δ(q0 , x) = q ⎩ ?, otherwise
5
Equivalence Relations
In this section, we discuss different kinds of equivalence relations on states of Cellular Asynchronous Moore Machines. Given a Cellular Asynchronous Moore Machine M = (Σ , Γ , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , F, Φ), we can define an equivalence relation letter-wise, as follows. Suppose that for any a ∈ Σ, we define an equivalence relation ∼a on Qa . We define an equivalent Σ-relation ∼= (∼a )a∈Σ in the general set of states a∈A Qa by setting (pa )a∈Σ ∼ (qa )a∈Σ if and only if for every a ∈ Σ, pa ∼a qa Since Σ relations ∼ have been defined on Cellular Asynchronous Moore Machines, there are properties that would be useful to study for use later in this work. Given a Cellular Asynchronous Moore Machine M = (Σ , Γ , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , F , Φ) and an equivalence Σ-relation ∼= (∼a )a∈Σ , we define that two states p = (pa )a∈Qa and q = (qa )a∈Qa are congruent on ∼b for b ∈ Σ, if p ∼b q implies δ(p, b) ∼b δ(q, b). We say that two states p, q ∈ Q are congruent on ∼ if they are congruent on ∼b for any b ∈ Σ. Consequently, we say that two states p, q ∈ Q are incongruent if they are not congruent on ∼b for some b ∈ Σ. We say that M is congruent on ∼ if and only if for every p, q ∈ Q with p ∼ q, p and q are congruent. The next Proposition is a well-known fact. Proposition 1. For any Cellular Asynchronous Moore Machine M = (Σ , Γ, I, (Qa )a∈Σ , (δa )a∈Σ , q0 , F , Φ), ( qa )a∈Σ is a congruent equivalent Σrelation.
18
A.C. G´ omez
Given a Cellular Asynchronous Moore Machine M = (Σ , Γ , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , F , Φ), a Σ-relation ∼, and a subalphabetΘ ⊆ Γ , we say that M is Θ-consistent with ∼ if and only if for any p, q ∈ a∈Σ Qa with p ∼ q, Φ(p) ∈ Θ, Φ(q) ∈ Θ, or Φ(p) = Φ(q). Given an Asynchronous Moore Machine M = (Σ , Γ , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , F , Φ), an equivalent Σ-relation ∼ on M, and a subalphabet Θ ⊆ Γ . We present Algorithm 5 to find the minimal Σ-relation ∼ such that, ∼⊆∼ , and M is {?}-consistent with M if it exists, and error otherwise. Join Input: Σ:Alphabet, I:Independent Relation, M = (Σ, Γ, I, (Qa )a∈Σ , (δa )a∈Σ , q0 , F, Φ): A Moore Machine, ∼= (∼a )a∈Σ : an equivalent Σ-relation on M, Θ ⊆ Γ and alphabet Output: minimal Σ-relation ∼ such that, ∼⊆∼ , and M is {?}-consistent with M if it exists, error otherwise while ∼ is not congruent do foreach (b, p = (pa )a∈Σ , q = (qa )a∈Σ ) do ∼=∼ ∪{(δb (pb ), δb (qb ))} ; if ∼ is innconsistent then Return Error ; end end end
Algorithm 1. Join algorithm used by T race − RP N I algorithm Example 5.2. Let Σ = {a, b, c} and I = {(a, b), (b, a), (b, c), (c, b)}. If we take as examples D+ = {abc} and D− = {cba}, when constructing the AP T A(D+ , D− ), we obtain the Moore Machine M1 = (Σ, I, (Qa )a∈Σ , (δa )a∈Σ , q0 , Φ) where, – Qa = {λ, a, cba}, Qb = {λ, ab} and Qc = {λ, c, abc} – • δa (λ, λ) = a and δa (λ, c) = cba • δb (λ) = ab • δc (λ, λ) = c and δc (a, λ) = abc – q0 = (q0a , q0b , q0c ) – Φ(a, ab, abc) = +, Φ(cba, ab, c) = −, and Φ(q) =? for any other state q. M1 is shown in Figure 2. After the algorithm, we obtain the Moore Machine M2 = (Σ , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , Φ ) where, – Qa = {{λ, a, cba}}, Qb = {{λ, ab}} and Qc = {{λ}, {c, abc}} – • δa ({λ, a, cba}, {λ}) = {λ, a, cba} and δa ({λ, a, cba}, {c, abc}) = { λ , a , cba} • δb ({λ, ab}) = {λ, ab} • δc ({λ, a, cba}, {λ}) = {c, abc} and δc ({λ, a, cba}, {c, abc}) = {c, abc} • – q0 = (q0a , q0b , q0c )
Inferring Regular Trace Languages from Positive and Negative Samples
19
a, λ, abc c
b
a, λ, λ
a, ab, abc c
b a
a, ab, λ a
λ, λ, λ
b
λ, ab, λ c
c
λ, ab, c a
b λ, λ, c
cba, ab, c a
b cba, λ, c
Fig. 2. General States of the AP T A(D+ , D− ) for D+ = {abc} and D− = {cba}, where ab = ba and cb = bc on the independence relation with Φ(a, ab, abc) = +, Φ(cba, ab, c) = −, and Φ is equals ? for any other state
{λ, a, cba}, {λ, ab}, {λ}
c
{λ, a, cba}, {λ, ab}, {c, abc}
a, b
a, b, c
Fig. 3. General States of the Asynchronous Cellular Automaton resulting from applying T race − RP N I algorithm to the sample D+ = {abc} and D− = {cba}, where ab = ba and cb = bc on the independence relation, with {λ, a, cba}, {λ, ab}, {λ} ∈ F
– Φ(a, ab, abc) = +, Φ(cba, ab, c) = −, and Φ(q) =? for any other state q. The resulting automaton from M2 is shown in Figure 3. Definition 9. Given an Asynchronous Cellular Automata A = (Σ , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , F ), we can obtain the associated Asynchronous Cellular Machine M (A) = (Σ , {0, 1} , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , Φ), where Φ(q) = 1 if q ∈ F , and Φ(q) = 0 if q ∈ F.
20
A.C. G´ omez
Using this definition, we can obtain a definition that is similar to the concept of minimal automata. Definition 10. An Asynchronous Cellular Automata A = (Σ , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , F ) is irreducible if and only if ( qa )a∈Σ is the maximal equivalent Σ-relation congruent with M (A).
6
Main Algorithm
In this section, we define the main algorithm of the paper (Algorithm 6). This algorithm tries to emulate the RPNI algorithm. Its behavious is similar to ComRP N I-algorithm [4]. But in this case state merging order is quite different and difficult to define, and it is strongly based on lexicographical order. This is why we define AT P A so close to sample. The use algorithm Join (Algorithm 5) is similar to [4] as well. In this work, Join’s algorithm definition is based on congruences. This definition is less easy than the usual one. But because of the copleaxity of working with AsychronousAutomata it is necessary. Trace-RPNI Input: Σ:Alphabet, I:Independent Relation, D+ : Positive Samples, D− : Negative Samples Output: M : A Cellular Asynchronous Moore Machine consistent with D+ and D− M (A) = (Σ , {0, 1} , I , (Qa )a∈Σ , (δa )a∈Σ , q0 , Φ) = ∼= ( Qa )a∈Σ ; foreach x ∈ a∈Σ Qa in lexicographical order do set a such that x ∈ Qa ; foreach y ∈ Qa with y
Algorithm 2. T race − RP N I algorithm
7
Convergence
In this section we prove that the algorithm converges in the limit. First, we give a characteristic sample D = (D+ , D− ) for which the algorithm will converge. Definition 11. We now define the characteristic sample D as. (1) for every a ∈ Σ and pa ∈ Q, minpath (p) ∈ D.
Inferring Regular Trace Languages from Positive and Negative Samples
21
(2) for every a ∈ Σ, for every p ∈ Q, and for any qa ∈ Qa with minpath (qa )
22
A.C. G´ omez
well defined. It remains to prove that for any a ∈ Σ and pa , qa ∈ QA with minpath (pa )
8
Complexity
According to the definition of (AP T A), the construction of the AP T A is linear with the size of the samples |S| = |D+ | + |D− |. The T race − RP N I algorithm shows that the complexity of the algorithm is polynomial with respect to the size of AP T A(D+ , D− ) if we consider that the complexity of Join is lineal. With this assumption, the complexity of the algorithm is polynomial.
9
Conclusions and Further Word
In this paper we have presented an algorithm to infer Regular Trace Languages. We have proved that the algorithm converges, and we have presented a general description of its complexity. For future work, based on the experimental results based on [4], we think that the efficiency of the algorithm could be improved by acting on a dependence alphabet different from equality. How much improvement could be achieved can only be determined by a detailed experimentation. Another interesting subject it to determine whether the improvement made in RPNI, such as red-blue algorithm [11,12] and others, could be adapted to the T race − RP N I. It would be useful to know if we could obtain the same advantages in the trace version as in the word case.
References 1. Ruiz, J., Cano, A., Garc´ıa, P.: Inferring subclasses of regular languages faster using RPNI and forbidden configurations. In: Adriaans, P.W., Fernau, H., van Zaanen, M. (eds.) ICGI 2002. LNCS (LNAI), vol. 2484, pp. 28–36. Springer, Heidelberg (2002) 2. Pin, J.-E., Cano Gmez, A., Guaiana, G.: When does partial commutative closure preserve regularity? In: Aceto, L., Damg˚ ard, I., Goldberg, L.A., Halld´ orsson, M.M., Ing´ olfsd´ ottir, A., Walukiewicz, I. (eds.) ICALP 2008, Part II. LNCS, vol. 5126, pp. 209–220. Springer, Heidelberg (2008) 3. Agluin, D.: Inductive inference of formal languages from positive data. Information and Control 45(2), 117–135 (1980) 4. Cano, A., Alvarez, G.: Learning commutative regular language. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 71–83. Springer, Heidelberg (2008)
Inferring Regular Trace Languages from Positive and Negative Samples
23
5. Pighizzini, G., Bruschi, D., Sabadini, N.: On the existence of the minimum asynchronous automaton and on decision problems for unambiguous regular trace languages. In: Proceedings of the 5th Annual Symposium on Theoretical Aspects of Computer Science. LNCS, pp. 334–345. Springer, Heidelberg (1988) 6. Gold, E.M.: Language identification in the limit. Information and Control 10, 447– 474 (1967) 7. Mazurkiewicz, A.W.: Trace theory. In: Brauer, W., Reisig, W., Rozenberg, G. (eds.) APN 1986. LNCS, vol. 254, pp. 279–324. Springer, Heidelberg (1987) 8. Oncina, J., Garcia, P.: Inferring regular languages in polynomial updated time. In: Pattern Recognition and Image Analysis (1992) 9. Rozenberg, G., Salomaa, A. (eds.): Handbook of formal languages. Beyond Words, vol. 3. Springer, New York (1997) 10. Trakhtenbrot, B., Barzidin, Y.: Finite automata: Behavior and synthesis (1973) 11. Trakhtenbrot, K.J.: Evidence-drive state merging with search (1998), http://citerseer.nj.sec.com/lang98evidence.html 12. Trakhtenbrot, K.J., Peralmutter, B.A.: Abbadingo one: Dfa learning competition (1997), http://abbadingo.cs.unm.edu 13. Diekert, V., Rozenberg, G.: Book of Traces. World Scientific Publishing Co., Inc., River Edge (1995)
Distributional Learning of Some Context-Free Languages with a Minimally Adequate Teacher Alexander Clark Department of Computer Science Royal Holloway, University of London
[email protected]
Abstract. Angluin showed that the class of regular languages could be learned from a Minimally Adequate Teacher (mat) providing membership and equivalence queries. Clark and Eyraud (2007) showed that some context free grammars can be identified in the limit from positive data alone by identifying the congruence classes of the language. In this paper we consider learnability of context free languages using a mat. We show that there is a natural class of context free languages, that includes the class of regular languages, that can be polynomially learned from a mat, using an algorithm that is an extension of Angluin’s lstar algorithm.
1
Introduction
The inference of context free languages is in a less developed state than the study of the inference of regular languages. Angluin [2] showed that a simple criterion, reversibility, could be used to identify a class of regular languages from positive data alone using Deterministic Finite Automata (dfas). This approach has two parts; first basing the states of the learned automata on the residual languages or right congruence classes of the language, and secondly a simple test for telling whether two strings are in the same class. Later, she showed [3] that a much larger class, indeed the class of all regular languages, could be learned using a richer model, the Minimally Adequate Teacher (mat) model. In this model the learner is provided with two sources of information about the language. First, the learner can ask membership queries — for a given string the learner can find out whether that string is in the language — and secondly the learner can ask equivalence queries — the learner provides the teacher with a hypothesis, and the teacher will either confirm that it is correct, or it will provide the learner with a counter-example. The algorithm lstar is a classic and well-studied algorithm; it uses the same representational idea — the states again correspond to the residual languages, but the test for equivalence of strings is much more sophisticated, using a series of test suffixes to define the equivalence classes. When it comes to context free inference, Clark and Eyraud [1] showed a result which is an exact analogue of the first paper of Angluin. They show that a learnability result could be established from positive data alone, by combining a representational decision – the non-terminals correspond to the syntactic congruence class – with a simple test – weak substitutability. In this paper we try J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 24–37, 2010. c Springer-Verlag Berlin Heidelberg 2010
Distributional Learning of Some Context-Free Languages with a MAT
25
to extend this approach to produce an algorithm for context free grammatical inference that is similar to the lstar algorithm. In the process, we will borrow some ideas and terminology from [4]. We will make equivalence queries where the hypothesis may not be in the learnable class – this is sometimes called an Extended Minimally Adequate Teacher [5]. The minimal dfa has a special status – there is a bijection between the states of the automaton and the residual languages. Though it is “minimal” it may be exponentially larger than the smallest non-deterministic automaton for the same language. We can consider the equivalent construction for cfgs to be where we have a correspondence between the congruence classes of the language and the non-terminals of the grammar – we define these precisely in Section 2. An important difference is that the number of congruence classes of a language will be infinite if it is not regular. We will therefore only model some finite subset of the congruence classes; as a result the class of languages that we can learn is limited, and as we shall see in Section 3, does not correspond to the class of all context free languages. In this algorithm we will use an observation table that is similar to that used in the lstar algorithm, but one that consists of substrings and contexts, rather that prefixes and suffixes (Section 4). In the remaining sections of the paper, we will define the algorithm and prove its correctness and polynomial efficiency in the standard way.
2
Notation
We have a finite non-empty alphabet Σ which is known; we use Σ ∗ to be the set of all finite strings of Σ, and λ is the empty string. A language is a subset of Σ ∗ . A context (l, r) is just a pair of strings; an element of Σ ∗ × Σ ∗ . The distribution of a string CL (w) = {(l, r)|lwr ∈ L}. We define (l, r)u to be lur – the wrapping operation which combines a context with a substring, and we extend this to sets of contexts and sets of strings in the natural way. We will write Sub(w) for the set of all substrings of a string, so Sub(w) = {u|∃l, r ∈ Σ ∗ , lur = w}. Two strings are congruent with respect to a language L, written u ≡L v iff CL (u) = CL (v). This is an equivalence relation and we write [u]L = {v|u ≡L v}, for the equivalence class of u. This elementary lemma [1] establishes the basis for this learning approach: Lemma 1. For any language L, if u ≡L u and v ≡L v then uv ≡L u v . This means that we can define a concatenation operation between these classes: [u] ◦ [v] = [uv]. This is a monoid, as it is associative and contains a unit [λ]; it is called the syntactic monoid: Σ ∗ / ≡L .
3
Objective CFGs
Our representations are context free grammars (cfg). We define a very slightly non-standard definition of a cfg. A cfg is a tuple Σ, V, P, I where Σ is a set
26
A. Clark
of terminal symbols, V is a finite non-empty set, (non-terminals), I is a nonempty subset of V , the set of initial symbols, and P is a finite set of productions of the form V × (V ∪ Σ)∗ , which we write N → α, where N ∈ V , and α is a possibly empty string of terminal and non-terminal symbols. We define a ∗ standard derivation relation γN δ → γαδ, when N → α ∈ P , and let ⇒G be the ∗ transitive reflexive closure of this relation. We define L(G, A) = {w|A ⇒G w} and L(G) = S∈I L(G, S). Note that we allow multiple start symbols. Clearly this does not change the class of languages that can be defined, since we could add one more symbol S and add productions S → S for all S ∈ I. Secondly, we will allow the alphabet Σ to be empty. We will consider the case of cfgs in Chomsky Normal Form (CNF): all productions are of the form N → P Q, or N → a or N → λ. While in general we can assume w.l.o.g. that grammars are in CNF, this assumption might in this case limit the class of languages that can be represented. 3.1
Congruential Grammars
We are interested in grammars where there is a relation between the nonterminals and the congruence classes. We say that a cfg G is congruential if for every non-terminal N it is the case that L(G, N ) is a subset of a congruence ∗ ∗ class of L(G); i.e. if N ⇒ u and N ⇒ v implies that u ≡L(G) v. This means that L(G, N ) will be a subset of a congruence class but each non-terminal need not generate every string in the congruence class. Note that this differs from the definition of congruential given in [6]. There are a few interesting properties of these grammars: first, note that we can assume w.l.o.g. that we only have one non-terminal for each congruence class. If we have two non-terminals that both generate strings from the same congruence class [w] then we can clearly merge them, without changing the ∗ language defined by the grammar. That is to say, if N ⇒ u and u ≡L v, then if ∗ we add productions so that N ⇒ v then this will leave the language unchanged. Secondly, the binary productions will all be of the form [u] ◦ [v] → [u][v]. We consider a simple example: the context-free language L = {an bn |n ≥ 0}. This has an infinite number of congruence classes including the following five: [λ] = {λ}, [a] = {a}, [b] = {b}, [ab] = {ab, aabb, . . . } = L \ {λ} and [aab] = {aab, aaabb, . . . } Let us define a cfg whose non-terminals correspond to these 5 congruence classes; we will label them as Z, A, B, S, T . We have productions Z → λ, S → AB, S → T B , T → AS, A → a and B → b. It is easy to verify that this is congruential in that L(G, A) = [a], L(G, B) = [b], L(G, S) = [ab] and so on. The set of initial symbols is I = {S, Z}; we see that L(I) = L. Definition 1. Let Lccfg to be the set of all languages definable by a congruential cfg. Space does not permit a full exploration of the relationship of this class to other learnable classes, but we can make a few basic points. We can assume that the grammar is in Chomsky Normal Form, as we discuss below.
Distributional Learning of Some Context-Free Languages with a MAT
27
First note that Lccfg includes all regular languages. Regular languages have only a finite number of congruence classes [7]. We can therefore construct a grammar which has one non-terminal for every congruence class of a regular language L, together with the set of productions [a] → a, and [uv] → [u][v], ∗ [λ] → λ; it is trivial to show that this will have the property that [u] ⇒ u for all u, and thus if we set I = {[u]|u ∈ L}, we will have a grammar in Lccfg that defines the language L. Secondly, Lccfg includes all NTS languages [6]. An NTS grammar is a gram∗ mar such that for any non-terminals N, M and strings l, w, r if N ⇒ w and ∗ ∗ M ⇒ lwr then M ⇒ lN r. NTS grammars are clearly congruential. Suppose G ∗ is an NTS grammar defining a language L. Suppose N ⇒ u and lur ∈ L, and ∗ ∗ N ⇒ v. then there is an S ∈ I such that S ⇒ lur; and therefore by the NTS ∗ ∗ property, S ⇒ lN R ⇒ lvr; and so u ≡L v. Moreover, if α ∈ (V ∪ Σ)+ then if ∗ ∗ α ⇒ u and α ⇒ v then u ≡L v, by induction on the length of α using Lemma 1. Therefore we can binarise the right-hand sides of the rules of G; the resulting grammar may not be NTS but will still be congruential. Not all CFLs are in Lccfg . In particular, languages which are a union of infinitely many congruence classes are not. For example, the languages {an bm |n > m > 0}, and {an bn } ∪ {an b2n } are not in Lccfg for exactly this reason. Similarly, this class does not include the palindrome language over a, b or the even palindrome language over a, b and thus does not include the class of even linear languages [8]. However Lccfg does include the substitutable CFLs [1], and the k-l-substitutable languages [9]. As we shall see it includes the Dyck language, which is neither substitutable, linear nor regular. We conjecture that the classes of NTS languages, pre-NTS languages and congruential languages all coincide, though the corresponding classes of grammars are clearly distinct, but the exact relationships are still not fully established.
4
Observation Table
We now define the basic data structure that we will use which is a modification of the observation table used by Gold[10] and Angluin [3]. It is an observation table which consists of a non-empty finite set of strings K a non-empty finite set of contexts F and a finite function mapping F KK to {0, 1}. Since K always contains λ, K is a subset of KK. Given a context (l, r) in F and a substring w ∈ KK, we have a 1 in the table if lwr is in the language and a 0 if it is not. We will write this as a tuple K, D, F , where D is the set of grammatical strings in F KK. Figure 1 illustrates a simple example. For two strings u, v ∈ KK we say they are equivalent if they appear in the same set of contexts; and we write this u ∼F v; in Angluin’s terms this is row[u] = row[v]. This means that CL (u) ∩ F = CL (v) ∩ F . Note that if u ≡L v then u ∼F v for any set of contexts; conversely, if it is not the case that u ≡L v then we can find some context (l, r) such that if (l, r) ∈ F , it is not the case that u ∼F v. Of course, our set of contexts may be too small, in which case we may
28
A. Clark
λ a b ab
(λ, λ) 1 0 0 1
(a, λ) 0 0 1 0
(λ, b) 0 1 0 0
(λ, ab) 1 0 0 0
(λ, λ) aab 0 abb 0 aa 0 ba 0 bb 0 bab 0 aba 0 abab 0
(a, λ) 0 1 0 0 0 0 0 0
(λ, b) 1 0 0 0 0 0 0 0
(λ, ab) 0 0 0 0 0 0 0 0
Fig. 1. Observation table for L = {an bn |n ≥ 0}. We have F = {(λ, λ), (a, λ), (λ, b), (λ, ab)}, which has 4 elements; these head the 4 columns in the diagram. Each row corresponds to an element of KK. K consists of the 4 strings λ, a, b, ab. We have split the table into two parts; on the left we have K, and on the right we have KK \ K.
have u ∼F v even though u and v are not congruent. In general we will want to increase F so that this does not happen. 4.1
Construction of Grammar
Given an observation table we can construct a CFG from it in a fairly straightforward way. First we assume that we have all the information we need: no holes in the table. In the learning model we use, we will have a membership oracle Mem, and we can use this to fill in the cell of the table corresponding to context (l, r) and substring u, by querying Mem(lur). Definition 2. An observation table, K, F, D is consistent if for all u1 , u2 , v1 , v2 in K, if u1 ∼F u2 and v1 ∼F v2 , then u1 u2 ∼F v1 v2 . If a table is not consistent, then we know that we do not have a large enough set of features, by Lemma 1 and we could add additional features until it is consistent. Algorithm 1 selects appropriate contexts to make sure that the table is consistent. We want to limit the number of contexts we add; note that every time we add a context we will increase the number of congruence classes of K. Thus it is clear that Algorithm 1 will terminate after at most |K| iterations. We use a similar approach in Algorithm 3. However, though the number of iterations that Algorithm 1 will make is bounded, we do not yet see how to bound the total run-time of the algorithm as the length of the contexts being generated might become very large. It is also not necessary to have a consistent table; if the table is inconsistent, then we may generate a grammar which will have two rules of the form N1 → AB and N2 → AB, for distinct non-terminals N1 and N2 . Algorithm 2 presents the algorithm for constructing the grammar from an observation table. This runs in polynomial time, and always returns a valid cfg. We will refer to the output of this algorithm as G(K, D, F ).
Distributional Learning of Some Context-Free Languages with a MAT
1 2 3 4 5
Data: K, D, F ; Result: A set of features that is consistent while K, D, F is not consistent do Find u1 , u2 , and v1 , v2 in K, and (l, r) ∈ F such that u1 ∼F u2 , v1 ∼F v2 , lu1 v1 r ∈ D, and lu2 v2 r is not in D ; if Mem(lu1 v2 r) = 1 then F ← F ∪ {(l, v2 r)};
6
else F ← F ∪ {(lu1 , r)};
7
Use Mem() to increase D to fill in the observation table ;
8
29
return F ;
Algorithm 1. MakeConsistent
1 2 3 4 5 6 7
Data: K, D , F ; Result: A Context Free Grammar G Divide K into equivalence classes according to ∼F ; Let V be the set of these equivalence classes ; Let I = {N ∈ V |∀w ∈ N, w ∈ D} ; P ← {N → a|a ∈ N ∩ Σ} ; P ← P ∪ {N → P Q|u ∈ P, v ∈ Q, w ∈ N, uv ∼F w} ; P ← P ∪ {N → λ|λ ∈ N } ; return G = Σ, V, P, I;
Algorithm 2. MakeGrammar
5
Adding Features
There are two sorts of errors to deal with; undergeneration and overgeneration. The more difficult one is to cope with overgeneralisation, so before we define the full algorithm, we will discuss how this is dealt with. First, if the partition of KK into classes is correct, then we will not overgeneralise; that is to say, if for any u, v in KK, we have that u ∼F v implies u ≡L v, then we will not overgeneralise. If we overgeneralise, then there must be two strings w1 , w2 in KK that appear to be congruent but are not. So we need to add a feature/context to have a more fine division into classes so that the two strings w1 and w2 are in different classes. The categories are equivalence classes of KK under ∼F , that contain at least one element of K. We will consider the categories to be subsets of KK, and write w ∈ X, where X is one of these categories. Note that some elements of KK will not be in a category, if they do not contain any element of K. These will correspond to congruence classes that we do not model, but are aware of. In Algorithm 3 we are given a derivation that is incorrect: we have a derivation of a string w from a non-terminal X such that we know that w is not congruent ∗ to the strings in X. Formally we have a context (l, r) such that X ⇒ w and
30
A. Clark
lwr ∈ L(G) \ L, and yet there is a w ∈ X such that lw r ∈ L. We return a context that splits some category X in the grammar. We say that a category X is split by a context (l, r) if there are u, v ∈ X such that lux ∈ L and lvx ∈ L. We will explain the algorithm informally: suppose we have a non-terminal X, that generates a string w. X corresponds to a subset of strings in KK, say {x1 , . . . xk }. Ideally we will have that w is congruent to all of these xi and that all of these xi are congruent to each other. Suppose we observe that this is not the case and that w has some context that an xi does not. If some of the xi have it and some do not, then we can use this to split the category X. Otherwise it might be that all of the xi are in fact congruent to each other, and that the problem is with some other productions. Consider a derivation of w from X. ∗ Suppose the derivation starts with the production X → Y Z, such that Y ⇒ u ∗ and Z ⇒ v and w = uv. We will have two strings u , v ∈ K and u v ∈ KK such that u is in the category Y , v ∈ Z and u v ∈ X, and lu v r ∈ L. Now crucially, we know that w = uv is not congruent to u v ; since the context (l, r) distinguishes them. Therefore either u is not congruent to u or v is not congruent to v , or possibly both are not congruent. If they were both congruent, this would violate Lemma 1. Note that we might have that u = u or v = v in which case we know immediatxely which of the pair is different. Suppose not; then we consider the intermediate string u v. If lu vr ∈ L, then we know that u has the context (l, vr) but u does not. If lu vr ∈ L then we know that v has the context (lu , r) and v does not. We then recurse.1 When we reach a leaf, we know that the algorithm must terminate. Since K contains the strings of length 1, if our derivation is of the form X → a, then we know that the feature (l, r) will split it, since a ∈ X by construction. Lemma 2. Algorithm 4 works in polynomial time in |w| and |K|; and terminates with a set of features such that the derived grammar no longer generates w. Proof. We can prove that FindContext will always find a context that splits a class of KK. Since we can have at most |K|2 equivalence classes, it will terminate after adding at most |K|2 new contexts. More generally we can see that, as with Binary Feature Grammars [4], if we increase F then we decrease the language defined by the grammar. Lemma 3. Suppose F1 ⊆ F2 and Gi = G(K, D, Fi ). If u ∈ K, we write Ni [u] ∗ for the category in Gi that contains u. For all u ∈ K, if N2 [u] ⇒G2 v then ∗ N1 [u] ⇒G1 v Proof. Note that since F2 is bigger than F1 and they are based on the same data, u ∼F2 v implies u ∼F1 v. We prove the result by induction on length of derivation. It is clearly true for strings of length 1, since if u ∼F2 a, u ∼F1 a. Suppose true for all strings up to length k, and let v be a string of length ∗ k + 1, N2 [u] ⇒G2 v. Expanding the first step of this derivation, there must be 1
A slightly more complex and efficient approach would be to consider also uv and possibly add two contexts if necessary.
Distributional Learning of Some Context-Free Languages with a MAT
1 2 3 4 5 6 7 8 9
31
Data: A finite set of strings K, a finite set of contexts F , a finite set of strings D, and a triple X,(l, r),w, where w is a string, (l, r) is a context, X is a ∗ non-terminal such that X ⇒ w; and lwr ∈ L(G) \ L ; Result: A context that splits some category of G if (l, r) splits X then return (l, r); else ∗ ∗ ∗ Let X → Y Z ⇒ uv = w be a derivation of w such that Y ⇒ u, Z ⇒ v; Find a pair of strings u , v ∈ K such that u ∈ Y, v ∈ Z, u v ∈ X ; if Mem(lu vr) = 1 then return FindContext(Y, (l, vr), u); else return FindContext(Z, (lu , r), v);
Algorithm 3. FindContext
7
Data: K, D, F and a string w that is not in L Result: A set of features including F , such that the grammar does not generate w G ← MakeGrammar(K,D,F ); while G generates w do ∗ Suppose S ⇒ w for some S ∈ I ; f be FindContext(S,(λ, λ), w ) ; F ← F ∪ {f } ; Increase D = L ∩ (F KK) using Mem() ; G = MakeGrammar(K,D,F ) ;
8
return F ;
1 2 3 4 5 6
Algorithm 4. AddContexts
∗
∗
N2 [u] → N2 [x]N2 [y], where x, y ∈ K, where N2 [x] ⇒G2 p and N2 [y] ⇒G2 q and ∗ ∗ v = pq. By the inductive hypothesis, we have that N1 [x] ⇒G1 p and N1 [y] ⇒G1 q. We know that by construction of G2 and its consistency, we have u ∼F2 xy; therefore u ∼F1 xy, therefore there is a production N1 [u] → N1 [x]N1 [y], and ∗ therefore N1 [u] ⇒G1 v and the result follows.
6
Algorithm
We now informally describe our algorithm: we maintain a set of strings K, a set of contexts F and some data D. We initialise K = {λ} and F = {(λ, λ)}. We fill in D using the membership oracle. We make it consistent, generate a grammar and then query. Note that we only want to add strings to the grammar as a result of counter-examples. Since the class of grammars includes ones which
32
A. Clark
define languages where every string is exponentially large in the number of nonterminals, we need to wait until we are given a long string; otherwise we will violate the polynomial bound. If the grammar is correct, then we terminate; otherwise if we overgenerate, we add more features until we no longer generate that string. If, on the other hand we undergenerate, then we add more strings to K to increase the number of congruence classes of the languge that we generate. Suppose the target G = V, P, I is a congruential cfg in CNF; and let V be a subset of V . we say that K is sufficient for V if it contains one string from each non-terminal yield; i.e. for all N ∈ V there is a string u ∈ K such that u ∈ L(G, A). ˆ = Lemma 4. If K is sufficient for V , and F is consistent, then let G G(K, D, F ) denote w(N ) for a string in K for some N ∈ V . Then for every ∗ derivation of a string in G, that contains only non-terminals of V say N ⇒G w, ∗ ˆ of N (w(N )) ⇒ ˆ w; we have a derivation in G G Proof. If N is a non-terminal in G let w(N ) be one of the corresponding strings in K, by sufficiency. Suppose we have a rule N → P Q in G; then w(N ) ≡L w(P )w(Q); which means that w(N ) ∼F w(P )w(Q); which means there is a production N (w(N )) → N (w(P ))N (w(Q)) in the set of productions ˆ Similarly for the productions N → a and the production N → λ, and of G. thus every derivation of a string w with respect to G, can be converted into a ˆ derivation of a string w in G. A consequence of this is that if we undergenerate, this can only be because we do not have a large enough set of K; crucially, if we observe a positive counterexample, the derivation of that string must use at least one non-terminal that we do not have an example of in K. Therefore if we add all substrings of this counter-example to K, we will increase the number of non-terminals we have covered by at least 1. Therefore, if there is a grammar with n non-terminals for the target language, we will only need at most n positive counter-examples before we have a grammar that includes all of the target language. 6.1
Convergence Proof
We now prove that this algorithm will learn the class. Lemma 5. Given a sub-congruential cfg in CNF, G, with non-terminals V ; ∗ For K, let n(K) be |{N ∈ V |∃u ∈ K, N ⇒ u}|. Suppose ˆ(G) = G(K, L, F ), If w ˆ then n(K ∪ Sub(w)) > n(K). is a positive countexample, (i.e w ∈ L(G) \ L(G)) Proof. Let w be such an example; there must be a non-terminal that we have not yet observed used in the derivation; call this N . Therefore there is a derivation ∗ ∗ S ⇒ lN r ⇒ lur = w; u ∈ Sub(w), therefore n(K ∪ Sub(w)) > n(K). Theorem 1. There is a polynomial p, such that if L ∈ Lccfg and is generated by a sub-congruential grammar in CNF with n non-terminals, then Algorithm 5
Distributional Learning of Some Context-Free Languages with a MAT
1 2 3 4 5 6 7 8 9 11 12
33
Result: A cfg G K ← Σ ∪ {λ}, K2 = K ; F ← {(λ, λ)}; D = L ∩ {λ} ; G = K, D, F ; while true do if Equiv(G) returns correct then return G ; w ← Equiv(G) ; if w is not in L(G) then K ← K ∪ Sub(w) ;
14
else F ← AddContexts(G,w );
15
G ← MakeGrammar(K, D, F ) ;
Algorithm 5. LearnCFG
will terminate, returning a correct grammar, after p(n, l) steps, where l is the maximum length of examples returned by the equivalence oracle. Proof. First, if it terminates at all, then the result is correct. Note that we will only add positive counterexamples at most n times. Each positive example will add at most 12 l(l + 1) elements to K; and at the start K is of size 1. Therefore |K| is always at most 1 + n2 l(l + 1). Every time we add a feature we will split a class of KK, and the total number of equivalence classes of KK cannot be more than |K|2 . Each time we get a negative example, we will add at least one more feature, and so the total number of negative examples cannot be greater than |K|2 . Therefore Algorithm 5 will terminate after at most n + |K|2 iterations; by previous lemmas the overall complexity is polynomial. This proof is a short-cut proof – while valid it perhaps does not explain the result fully. When we add contexts, we reduce the language, until eventually we will only undergenerate. In particular, eventually we will have that u ∼F v implies u ≡L v; i.e. that we have split the classes up as finely as possible until they correspond to the actual congruence classes of the language. We can explain this with an additional lemma, which we won’t prove: Lemma 6. For any K and any L there is a finite set of contexts F0 such that for all F ⊇ F0 , L(G(K, L, F )) ⊆ L. Note additionally that the algorithm is polynomial at every step – it is possible to create an algorithm that “cheats” by using an exponential amount of data and then constructing a hypothesis with only an exponentially long counter-example [11], in order to force l to be exponentially large, so that the overall complexity will be polynomial. This algorithm satisfies the stricter condition that at each step the amount of computation used is polynomially bounded in n and l.
34
7
A. Clark
Sample Run
We will now illustrate this with a sample run of the algorithm on the Dyck language over the alphabet {a, b} where L = {λ, ab, abab, aabb, . . . }. This is an infinite non-linear context free language. We initialise K and F to the trivial starting points. Our observation table is shown as Step 0 in Figure 2. This is vacuously consistent, so we create a grammar with one non-terminal, S and the rules S → λ, and S → SS, which just generates the language {λ}. We query this, and it undergenerates; let us suppose we receive the positive counterexample ab. We add Sub(ab) to K, getting K = {λ, a, b, ab}; this is shown as Step 1.
Step 0 (λ, λ) λ 1
Step 1 λ a b ab aa ba bb abb aba aab bab abab
(λ, λ) 1 0 0 1 0 0 0 0 0 0 0 1
Step 2 λ a b ab aa ba bb abb aba aab bab abab
(λ, λ) 1 0 0 1 0 0 0 0 0 0 0 1
(a, λ) 0 0 1 0 0 0 0 1 0 0 1 0
Step 3 λ a b ab aa ba bb abb aba aab bab abab
(λ, λ) 1 0 0 1 0 0 0 0 0 0 0 1
(a, λ) 0 0 1 0 0 0 0 1 0 0 1 0
(λ, b) 0 1 0 0 0 0 0 0 1 1 0 0
Fig. 2. States of observation table; in each case, the elements of K are above the line, and KK \ K below the line
So this is not consistent since a ∼F b but aa is not equivalent to ab. We add the feature (a, λ), which will separate a and b. This gives us Step 2 in the figure; this is consistent, since the only two strings in K that are similar are λ and ab and these are in fact congruent. So we define the grammar which has 3 non-terminals S, A, B. This has the three lexical rules S → λ A → a and B → b. We also have these binary rules: S → SS, S → AB A → AA A → BA and A → BB. The set of rules expanding A are clearly bizarre; but we get them since for example if we combine an a with an a we get aa which has the same features as A; the problem is that a ∼F aa but clearly they are not congruent. This defines a language which overgenerates since we have S → AB → AAB → aab. So we then equivalence query this grammar, and get, let us suppose, the string aab. We now call FindContext with arguments S, (λ, λ), aab; we have a derivation S → AB from the strings ab → a, b. We test ab which is in the language, so we then call FindContext with arguments A, (λ, b), aa. A has the set of strings a, aa and we note that (λ, b) splits these, so we return (λ, b) and add it to F .
Distributional Learning of Some Context-Free Languages with a MAT
35
This is again consistent, shown as Step 3, and gives us the grammar with lexical rules as before and binary rules: – S → SS, S → AB – A → AS, A → SA, – B → BS, B → SB This is consistent and defines the right language so we query and terminate.
8
Discussion
The mat model, though standard in grammatical inference, is unrealistic for classes of representations where the equivalence queries are undecidable. We decided to use this model for several reasons: the first is the ZULU competition [12] which looks again at practical issues in the use of the lstar algorithm. In the context of the specific representation classes we use here, we note that the equivalence of NTS grammars is decidable [13], and that it is polynomially decidable whether a given cfg is NTS [14]. Therefore synthetic experiments with this algorithm are certainly practical. We have implemented the algorithm using a sampling approximation to the equivalence oracle and, though we do not present any experimental results here, it is simple and efficient. More generally, one can approximate the equivalence oracle by generating examples from a fixed distribution or from both target and hypothesis grammars, either using a probabilistic cfg or otherwise[15]. Finally, the mat model divides representation classes along interesting lines: DFAs are learnable and NFAs are not. Gold style identification in the limit models tend to be either too restrictive or too permissive. mat learning, on the other hand, seems to have the right level of difficulty: it accords well with practical experiences in learning competitions such as the Tenjinno [16] and Omphalos competitions [17]. de la Higuera [18] says the question as to whether it (the class of context free grammars) can be identified with a polynomial number of queries to a mat is still an open question, but widely believed to be also intractable. The work presented here provides a partial answer to this question: it seems clear that using these approaches only a limited subclass can be learned efficiently; however this work does seem to have overcome the barrier of “linearity” that has been observed. There is still a form of determinism in the grammars, but this is a bottom up determinism, rather than a left-to-right determinism. Shirakawa and Yokomori [5] propose the following definition: A grammar G ∗ is context-deterministic iff whenever the derivation S ⇒ lAr exists L(G, A) = {w|lwr ∈ L}. This is a very interesting approach that is closely related to our own; this requirement states that for every non-terminal any of the contexts will uniquely pick out the language. Thus we have the same idea, albeit in a much
36
A. Clark
weaker form: we consider grammars where the sets of strings generated by nonterminals are distributionally defined. Note that the Dyck language for example, is not context-deterministic, since ab occurs in the context (a, b) but so does ba. The approach here is less powerful than the lattice based approaches in [4,19], but they are easier to understand since they are based on the classic cfg representation. Nonetheless this paper gives a very natural “textbook” algorithm for a reasonable class of context free languages. The class of languages definable is the largest class we can hope to represent using the natural one non-terminal per congruence class representation. This is also a properly polynomial result for a class of representations that includes some languages where the smallest strings are exponentially large. [1] shows only a characteristic set of polynomial cardinality, which is strictly speaking too weak, since one could specify an additional very long string in the set in order to give the algorithm arbitrary amounts of additional computational time, and [4] only gives a polynomial update time algorithm. [9] suggests using the thickness as an additional parameter. Comparing this result to the Binary Feature Grammar (bfg) result in [4], one important difference is that the bfg approach uses as its representational primitives sets of the form {w|CL (w) ⊇ CL (v)} rather than the congruence classes here. This increases the class of languages, but means that it is more difficult to get a polynomial mat ¸ result.
Acknowledgments I am very grateful to two anonymous reviewers for their helpful suggestions, and to Ryo Yoshinaka for some technical suggestions and corrections.
References 1. Clark, A., Eyraud, R.: Polynomial identification in the limit of substitutable context-free languages. Journal of Machine Learning Research 8, 1725–1745 (2007) 2. Angluin, D.: Inference of reversible languages. Communications of the ACM 29, 741–765 (1982) 3. Angluin, D.: Learning regular sets from queries and counterexamples. Information and Computation 75(2), 87–106 (1987) 4. Clark, A., Eyraud, R., Habrard, A.: A polynomial algorithm for the inference of context free languages. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 29–42. Springer, Heidelberg (2008) 5. Shirakawa, H., Yokomori, T.: Polynomial-time MAT Learning of C-Deterministic Context-free Grammars. Transactions of the information processing society of Japan 34, 380–390 (1993) 6. Boasson, L., S´enizergues, S.: NTS languages are deterministic and congruential. J. Comput. Syst. Sci. 31(3), 332–342 (1985) 7. Ginsburg, S.: The Mathematical Theory of Context-Free Languages. McGraw-Hill, Inc., New York (1966) 8. Takada, Y.: Grammatical inference for even linear languages based on control sets. Information Processing Letters 28(4), 193–199 (1988)
Distributional Learning of Some Context-Free Languages with a MAT
37
9. Yoshinaka, R.: Identification in the Limit of k-l-Substitutable Context-Free Languages. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 266–279. Springer, Heidelberg (2008) 10. Gold, E.M.: Complexity of automaton identification from given data. Information and Control 37(3), 302–320 (1978) 11. Angluin, D.: Negative results for equivalence queries. Machine Learning 5(2), 121– 150 (1990) 12. Combe, D., de la Higuera, C., Janodet, J.C.: Zulu: an interactive learning competition. In: Yli-Jyr¨ a, A. (ed.) FSMNLP 2009. LNCS(LNAI), vol. 6062, pp. 139–146. Springer, Heidelberg (2010) 13. S´enizergues, G.: The equivalence and inclusion problems for NTS languages. J. Comput. Syst. Sci. 31(3), 303–331 (1985) 14. Engelfriet, J.: Deciding the NTS property of context-free grammars. Results and Trends in Theoretical Computer Science, 124–130 (1994) 15. Gore, V., Jerrum, M., Kannan, S., Sweedyk, Z., Mahaney, S.: A Quasi-polynomialtime Algorithm for Sampling Words from a Context-Free Language. Information and Computation 134(1), 59–74 (1997) 16. Starkie, B., van Zaanen, M., Estival, D.: The Tenjinno Machine Translation Competition. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 214–226. Springer, Heidelberg (2006) 17. Starkie, B., Coste, F., van Zaanen, M.: The Omphalos context-free grammar learning competition. In: Paliouras, G., Sakakibara, Y. (eds.) ICGI 2004. LNCS (LNAI), vol. 3264, pp. 16–27. Springer, Heidelberg (2004) 18. de la Higuera, C.: A bibliographical study of grammatical inference. Pattern Recognition 38(9), 1332–1348 (2005) 19. Clark, A.: A learnable representation for syntax using residuated lattices. In: Proceedings of the 14th Conference on Formal Grammar, Bordeaux, France (2009)
Learning Context Free Grammars with the Syntactic Concept Lattice Alexander Clark Department of Computer Science Royal Holloway, University of London Egham, TW20 0EX
[email protected]
Abstract. The Syntactic Concept Lattice is a residuated lattice based on the distributional structure of a language; the natural representation based on this is a context sensitive formalism. Here we examine the possibility of basing a context free grammar (cfg) on the structure of this lattice; in particular by choosing non-terminals to correspond to concepts in this lattice. We present a learning algorithm for context free grammars which uses positive data and membership queries, and prove its correctness under the identification in the limit paradigm. Since the lattice itself may be infinite, we consider only a polynomially bounded subset of the set of concepts, in order to get an efficient algorithm. We compare this on the one hand to learning algorithms for context free grammars, where the non-terminals correspond to congruence classes, and on the other hand to the use of context sensitive techniques such as Binary Feature Grammars and Distributional Lattice Grammars. The class of cfgs that can be learned in this way includes inherently ambiguous and thus non-deterministic languages; this approach therefore breaks through an important barrier in cfg inference.
1
Introduction
In recent years, grammatical inference has started to moved from the learnability of regular languages onto the study of the inference of context free languages. The approach developed by Clark and Eyraud [1] is one active research direction: they consider defining context free grammars where the non-terminals correspond to the congruence classes of the language and are able to demonstrate a learnability result for the class of substitutable languages. Though this class is small, the result is significant and has already lead to a number of extensions [2,3,4]. Given a richer source of data, including membership queries, it is possible to increase the class of languages learned, while maintaining this basic representational assumption (see [5] in this volume). The limitations of the pure congruence approach are however quite strict. Though it includes many standard languages, such as the Dyck language and so on, there are many simple languages which are not in the class. Consider the following language [6] L2 = {an bn |n ≥ 0} ∪ {an b2n |n ≥ 0} J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 38–51, 2010. c Springer-Verlag Berlin Heidelberg 2010
Learning Context Free Grammars with the Syntactic Concept Lattice
39
This is clearly a context free language. It is easy to show, using a pumping argument, that any cfg for this language must have a non-terminal that generates an infinite set of strings of the form {ap+qn bqn |n ≥ 0 ∧ (p + qn) ≥ 0} for some positive q and some integer value p. None of these strings are congruent to each other. Moreover, since the language itself is a union of infinitely many congruence classes, it is immediate that it is not in the class of languages that we can learn using a pure congruential class. Similarly, the language of all odd and even length palindromes over {a, b} has the property that no two distinct strings are congruent, and thus a congruence based approach will also fail. In related work, [7] extends this work by switching to a more powerful context sensitive representation; [8] bases a more powerful approach in the theory of residuated lattices. That work is motivated by the problems of linguistics – context free languages are well known to be inadequate for natural language syntax. However there are other non-linguistic areas where grammatical inference is relevant, and in those areas, it may be worthwhile restricting the representations to context free grammars for external reasons – for example one might have prior knowledge the representations are in fact cfgs. Moreover, there are problems involved with generating from these context-sensitive representations, and with making stochastic variants of them, whereas these problems are well understood in the field of cfgs [9]. Therefore, even though cfgs may not be the right representation for natural languages, it is still worth studying their learnability under various paradigms. The question is then how can we increase the class of context free languages that we can learn using the same family of distributional techniques, while still keeping a cfg as a representation. Alternatively, we can ask whether we really need to use the context-sensitive grammar formalisms if all we are really interested in is languages which are in fact context free. The goal of this paper is to take the lattice-based techniques of [8] and use them to push cfg inference techniques as far as we can. The basic strategy is to define various sets of strings that we will correspond to the non-terminals, or more precisely to the set of strings generated by a nonterminal. We consider a finite set of possibly infinite sets of strings, C. If we have that P Q ⊆ N for three sets in C, then we can add a rule N → P Q; see for example [10]. Similarly, if w ∈ N we can add a rule N → w; in particular we will need rules of the form N → a and N → λ where a ∈ Σ. Thus we can add all rules of the first type, and only a finite number of the second type to get a context free grammar, and this context free grammar will clearly have ∗ the property that N ⇒ w will imply that w ∈ N . Obviously this leaves many questions unanswered: how to define these sets, and how to compute whether one is a subset of the other. In [1], the class C consists of a finite set of congruence classes; these are nonoverlapping sets, which form a congruence: that is to say for any two strings u,v, we have that [u][v] ⊆ [uv]. This makes the inference process quite straightforward. Here we will look at using a much larger class C which includes many
40
A. Clark
overlapping sets and has the structure of a lattice. In particular the primitive elements are defined dually: in terms of finite intersections of sets of strings which are defined by contexts, not strings. Given a context, or pair of strings, (l, r), the elementary sets we consider are of the form: {w|lwr ∈ L} Whereas the congruence classes are the smallest and most fine-grained sets that can be distributionally defined, since they are the sets of strings that have identical distributions, the elementary classes that we define here are in some sense the largest possible classes that we can define. These are sets of strings that have only one context in their common distribution, as opposed to sets that have all their contexts in their common distribution. The languages that can be defined using these classes are thus rather different in character from those that are defined using the congruence based approaches.
2
Distributional Lattice
The theoretical base of our approach is the syntactic concept lattice [8], which is a rich algebraic object which can be thought of as a lattice-theoretic generalisation of the syntactic monoid. 2.1
Notation
Given a finite non-empty alphabet Σ, we use Σ ∗ to refer to the set of all strings and λ to refer to the empty string. As usual a language L is any subset of Σ ∗ . A context is just an ordered pair of strings that we write (l, r) – l and r refer to left and right. We can combine a context (l, r) with a string u with a wrapping operation that we write : so (l, r) u is defined to be lur. We will sometimes write f for a context (l, r). We will extend this notation to sets of contexts and strings in the natural way. Given a formal language L and a given string w we can define the distribution of that string to be the set of all contexts that it can appear in: CL (w) = {(l, r)|lwr ∈ L}, equivalently {f |f w ∈ L}. There is a special context (λ, λ): clearly (λ, λ) ∈ CL (w) iff w ∈ L. There is a natural equivalence relation on strings defined by equality of distribution: u ≡L v iff CL (u) = CL (v); this is called the syntactic congruence. We write [u] for the congruence class of u. 2.2
Lattice
Distributional learning is learning that exploits or models the distribution of strings in the language. A sophisticated approach can be based on the Galois connection between sets of strings and sets of contexts.
Learning Context Free Grammars with the Syntactic Concept Lattice
41
For a given language L we can define two polar maps from sets of strings to sets of contexts and vice versa. Given a set of strings S we can define a set of contexts S to be the set of contexts that appear with every element of S. S = {(l, r) ∈ Σ ∗ × Σ ∗ : ∀w ∈ S lwr ∈ L}
(1)
Dually we can define for a set of contexts C the set of strings C that occur with all of the elements of C C = {w ∈ Σ ∗ : ∀(l, r) ∈ C lwr ∈ L}
(2)
We define a syntactic concept to be an ordered pair of a set of strings S and a set of contexts C, written S, C , such that S = C and C = S. An alternative way of looking at this is that the concepts are the pairs S, C such that C S ⊆ L and such that these are maximal. For any set of strings S we can define a concept C(S) = S , S . S is called the closure of the set S; this is a closure operator as S = S , for any set of strings. We can also form a concept from a set of contexts C as C(C) = C , C . Importantly, there will be a finite number of concepts in the lattice, if and only if the language is regular. Each concept represents a natural set of strings in the language; these form an overlapping hierarchy of all sets of strings that can be distributionally defined. We can define a partial order on these concepts where: S1 , C1 ≤ S2 , C2 iff S1 ⊆ S2 . Note that S1 ⊆ S2 iff C1 ⊇ C2 . We can see that C(L) = C({(λ, λ)}), and clearly w ∈ L iff C({w}) ≤ C({(λ, λ)}). We will drop brackets from time to time to improve legibility. This poset in fact forms a complete lattice with a top element which will normally be Σ ∗ , ∅ , though there may be some contexts shared by every string, for example in the language Σ ∗ , and similarly a bottom element ⊥, ∅, Σ ∗ × Σ ∗ , though again there may be some strings in the bottom element. Indeed if L = Σ ∗ then the lattice has only one element and = ⊥ = C(L) = Σ ∗ , Σ ∗ × Σ ∗ . If L = Σ ∗ aΣ ∗ , then the lattice has two elements: the top element is Σ ∗ , L×Σ ∗ ∪Σ ∗ ×L , and the bottom element is L, Σ ∗ × Σ ∗ . The relation to the synactic monoid is crucial; C({u}) = C([u]) but there may be other strings in the concept that are not congruent to u. In particularly the set of strings in the concept of u will be the union of all of the congruence classes whose distribution contains the distribution of u. i.e. if CL (u) ⊆ CL (v) then v and indeed [v] will be in the concept C(u). More formally C({u}) will have the set of strings {v|CL (v) ⊇ CL (u)} whereas [u] = {v|CL (v) = CL (u)}. We define a concatenation operation on these as follows; somewhat similar to the concatenation in the syntactic monoid. Definition 1. Sx , Cx ◦ Sy , Cy = (Sx Sy ) , (Sx Sy ) We refer the reader to [8] for a more detailed derivation of the properties of this lattice; and a proof that it is a residuated lattice.
42
A. Clark
3
Grammar
We will use cfgs, which we define standardly as a tuple Σ, V, P, S ; where Σ is a non-empty finite set, the alphabet; V is a finite set of non-terminals disjoint from Σ, S is a distinguished element of V , the start symbol, and P is a finite set of productions of the form V × (Σ ∪ V )∗ ; we will write these as V → α. We will consider cfgs in Chomsky Normal Form, where all of the productions are either of the form N → a, N → λ or N → RS. We write the standard derivation as ∗ βN γ ⇒G βαγ, when N → α ∈ P and the transitive reflexive closure as ⇒G . Given a lattice B(L) and a finite set of concepts V ⊆ B(L), which includes the concept C(L), we can define a cfg as follows. The set of non-terminals will be equal to this set of concepts; either a set of symbols that are in bijection with the concepts or alternatively the concepts themselves. The start symbol will be the concept C(L). We define the set of productions P as follows: If N = S, C and a ∈ Σ ∪ {λ} and a ∈ S, then we add a rule N → a. If A ◦ B ≤ N then we add a rule N → AB. We will call this grammar G(L, V ). ∗
Lemma 1. For all L and for all V ⊆ B(L), if N = SN , CN and N ⇒G w then w ∈ SN . Proof. By induction on the length of the derivation. Suppose N → w is the complete derivation; then by construction w ∈ SN . Suppose it is true for all ∗ derivations of length at most k, and let N ⇒G w be a derivation of length k + 1. ∗ ∗ Suppose the first step of the derivation is N → P Q ⇒G uv = w, and P ⇒G u ∗ and Q ⇒G v; by the inductive hypothesis u ∈ SP and v ∈ SQ ; (using the obvious notation for the sets of strings in the concepts P and Q). By the definition of P ◦ Q we have that uv ∈ SP ◦Q , and since P ◦ Q is less than N we have that uv ∈ SN . Of course this result assumes that we can correctly identify both the concepts and the concatenation operation; we now address this point. 3.1
Partial Lattice
Given a finite set of strings K, and a finite set of contexts F we can produce a partial lattice. We consider a set D which is a sufficiently large subset of L; in particular we take D = L ∩ (F KK). We then define the lattice B(K, D, F ) to be the set of all ordered pairs S, C where S ⊆ K and C ⊆ F and C = S ∩ F , and S = C ∩ K. We can compute this lattice using only the finite sets K, D and F . We use two subroutines defined as GetK and GetF; if C is a set of contexts, GetK(C) will return C , and GetF(S) will return S . We therefore define the following algorithm which uses a membership oracle, and takes as input a finite set of strings K and a finite set of contexts F . We assume that (λ, λ) ∈ F . Algorithm 1 will generate a cfg in Chomsky normal
Learning Context Free Grammars with the Syntactic Concept Lattice
43
form given the finite sets K, D and F , and an integer bound f which is at least 1. We discuss f further below. The important point of this algorithm is the schema we use for generating the branching productions of the grammar. If we have three non-terminals N, P, Q which correspond to SN , CN , SP , CP , SQ , CQ , then we have rules of the form N → P Q if and only if ((SP SQ ) ∩ F ) ∩ K ⊆ SN . Let us consider this condition for a moment. SP SQ is a subset of KK; this may not contain any elements of K at all; but we take the set of all strings of K that have all of the contexts that are shared by all of the elements of SP SQ , and compare this to SN . It could be that ((SP SQ ) ∩ F ) ∩ K is empty; then for every non-terminal N we will have a rule N → P Q; or it could be that N = in which case for every non-terminals P, Q, we will have N → P Q. We can remove these excessive productions and redundant non-terminals using standard techniques. An alternative condition would be ((SP SQ ) ∩ F ) ⊇ CN ; but this does not have quite the right properties. Example 1. Consider the Dyck language over a, b, that consists of strings like {λ, ab, abab, aabb, . . . }. Let K = {λ, a, b, ab} and F = {(λ, λ), (a, λ), (b, λ)}. D = L ∩ (F KK) = {λ, ab, aabb, abab}. There are therefore 5 concepts in the lattice B(K, D, F ) – – – – –
The top element T = K, ∅ The bottom element N = ∅, F S = {λ, ab}, {(λ, λ)} A = {a}, {(λ, b)} B = {b}, {(a, λ)}
The grammar therefore has 5 non-terminals, labelled S, A, B, T and N . We will have lexical rules; T → a, A → a and T → b and B → b. We will have λrules S → λ and T → λ. Since there are 5 concepts, there are 125 possible branching rules. We have a large number of vacuous rules with N on the right hand side —there are 45 such rules N → N N, A → BN and so on. We then have a large number of vacuous rules with T on the left — there are 25 of these: T → AA, T → AN etc. 16 of these are not duplicated with the 45 previous rules. Stripping out all of these we are left with the following rules S → SS, S → AB, A → AS, A → SA, B → BS, B → SB and S → λ, A → a, B → b. This grammar clearly generates the Dyck language. For a fixed value of f , Algorithm 1 will produce a polynomial sized grammar and will always run in polynomial time. Note that if we did not bound the set of concepts in some way, and used the whole lattice B(K, D, F ) we might have exponentially large grammars. Example 2. Suppose Σ = {a1 , . . . an } and L = {xy|x, y ∈ Σ, x = y}. If K = Σ and F = {(λ, x)|x ∈ Σ} then B(K, L, F ) will have 2n elements;
44
A. Clark
Algorithm 1. MakeGrammar Data: A finite set of strings K, a finite set of contexts F , a finite set of strings D, a non-empty finite set Σ, a bound f Result: A context-free grammar G S = C({(λ, λ)}) ; for A ⊆ F , |A| ≤ f do V ← V ∪ {C(A)} ; P ←∅; for each a ∈ Σ ∪ {λ} do for each N ∈ V , N = SN , CN do if a ∈ SN then P ← P ∪ {N → a}; for each A, B ∈ V do J = SA SB ; SX ← GetK(GetF(J )) ; for each N ∈ V do if SX ⊆ SN then P ← P ∪ {N → AB} ; return G = Σ, V, P, S ;
4
Inference
Given a suitable source of information we can then consider the search process for a suitable set of strings K and set of contexts F . We will now establish two monotonicity lemmas just as in [8]. We first change our notation slightly. We assume that L is fixed and that D = L ∩ (F KK). We therefore will write B(K, L, F ) as a shorthand for B(K, L ∩ (F KK), F ), and similarly G(K, L, F ) for the grammar generated from this. In what follows we will consider the bounds k and f to be fixed. The first lemma states that if we increase the set of contexts we use, then the language will increase. We take a language L, and two sets of contexts F1 ⊆ F2 and a set of strings K. Let G1 be the grammar formed from K, L, F1 and G2 = G(K, L, F2 ). For a concept S, C in B(K, L, F1 ) note that there is a corresponding concept S, S ∩ F2 in B(K, L, F2 ). We will write f ∗ : B(K, L, F1 ) → B(K, L, F2 ) for the function f ∗ ( S, C ) = S, S ∩ F2 . Lemma 2. If N → P Q is a production in G1 then f ∗ (N ) → f ∗ (P )f ∗ (Q) is a production in G2 . Proof. First of all, since N → P Q is a production in G1 it follows that ((SP SQ ) ∩ F ) ∩ K ⊆ SN . Clearly, since G ⊇ F , ((SP SQ ) ∩ G) ⊇ ((SP SQ ) ∩ F ) and so ((SP SQ ) ∩ G) ⊆ ((SP SQ ) ∩ F ) , therefore ((SP SQ ) ∩ G) ∩ K ⊆ SN ; which means that there is a rule f ∗ (N ) → f ∗ (P )f ∗ (Q) in G2 . Note that if N, P, Q lie
Learning Context Free Grammars with the Syntactic Concept Lattice
45
in the bounded subset of concepts defined by f , f ∗ (N ), f ∗ (P ) and f ∗ (Q) will also lie in the bounded set in the larger lattice. ∗
∗
Lemma 3. If S, C ⇒G1 w then S, S ⇒G2 w. ∗
Proof. Suppose S, C = N ⇒G1 w. We proceed by induction on length of the derivation. It is obviously true for a derivation of length 1 by construction. ∗ Suppose true for all derivations of length at least k; We must have N → P Q ⇒G1 ∗ ∗ uv = w; where P ⇒G1 u and Q ⇒G1 v. By the inductive hypothesis we have ∗ ∗ that f ∗ (P ) ⇒G2 u and f ∗ (Q) ⇒G2 v. Since f ∗ (N ) → f ∗ (P )f ∗ (Q) will be a production in G2 , by the previous lemma, the result holds by induction. Clearly this means that as we increase the set of contexts the language defined can only increase. Conversely we have that if we increase the set of strings in the kernel, the language will decrease. Suppose J ⊆ K; and define the map between B(K, L, F ) (which defines a grammar G1 ) and B(J, L, F ) which defines the grammar G2 as g( S, C ) = S ∩ J, (S ∩ J) . ∗
∗
Lemma 4. If S, C ⇒G1 w then g( S, C ) ⇒G2 w Proof. If we have a rule N → P Q in G1 ; then this means that ((SP SQ ) ∩ F ) ∩ K ⊆ SN . Now (SP ∩J)(SQ ∩J) ⊆ (SP SQ ) And so (((SP ∩J)(SQ ∩J) ∩F ) ∩K ⊆ SN And so (((SP ∩ J)(SQ ∩ J) ∩ F ) ∩ J ⊆ SN ∩ J which means there is a rule in G2 g(N ) → g(P )g(Q), and the result follows by induction as before. Lemma 5. For any language L, and set of contexts F , there is a set of strings K, such that L(G(K, L, F )) ⊆ L Proof. As we increase K the number of concepts in B(K, L, F ) may increase, but is obviously bounded by 2|F | . We start by assuming that K is large enough that we have a maximal number of concepts. Given two concepts X, Y , define D = (Cx Cy ) ∩ F . This is the set of contexts shared in the infinite data limit. For and v ∈ CY such each element (l, r) ∈ F \ D we take a pair of strings u ∈ CX ∗ luvr ∈ L; if have all such pairs in K, then we can easily see that S, C ⇒ w implies that w ∈ C . This means that for any set of contexts, as we increase K the language will decrease until finally it will be a subset of the target language. For a given L and F define the limit language as K⊂Σ ∗ L(G(K, L, F )), where the intersection is limited to all finite K. This limit will be attained for some finite K. Definition 2. Given a language L, a finite set of contexts F is adequate, iff for every finite set of strings K that includes Σ ∪ {λ}, L(G(K, L, F )) ⊇ L. Clearly by the previous definitions, any superset of an adequate set of contexts is also adequate. If a language has an adequate finite set of contexts F , then for sufficiently large K, the grammar will define the right language.
46
A. Clark
We say that a cfg in cnf has the finite context property if every non-terminal ∗ can be defined by a finite set of contexts. Defining L(G, N ) = {w|N ⇒G w}, this property requires that for all non-terminals N , there is a finite set of contexts FN such that L(G, N ) = FN . For a given cfg with the FCP we can define f (G) to be the maximum cardinality of FN over the non-terminals in G, and we will define f (L) to be the minimum of k(G) over all grammars G such that L(G) = L. We will now show that the algorithm will learn all context free languages with a bound on f (L).
5
Learning Model
Before we present an algorithm we will describe the learning model that we will use. We use the same approach as in [7] and other papers. We assume that we have a sequence of positive examples, and that we can query examples for membership. See [11] for arguments that this is a plausible model. In other words, we have two oracles, one which will generate positive examples from the language, and the other which will allow us to test whether a given string in the language. After every time the algorithm receives a positive example, the learner must use a polynomial amount of computation, and produce a hypothesis. Given a language L a presentation for L is an infinite sequence of strings w1 , w2 , . . . such that {wi |i > 0} = L. An algorithm receives a sequence T and an oracle, and must produce a hypothesis H at every step, using only a polynomial number of queries to the membership oracle. It identifies in the limit the language L iff for every presentation T of L there is a N such that for all n > N Hn = HN , and L(HN ) = L. We say it identifies in the limit a class of languages L iff it identifies in the limit all L in L. We say that it identifies the class in polynomial update time iff there is a polynomial p, such that at each step the model uses an amount of computation (and thus also a number of queries) that is less than p(n, l), where n is the number of strings and l is the maximum length of a string in the observed data. We note that this is slightly too weak. It is possible to produce vacuous enumerative algorithms that can learn anything by only processing a logarithmically small prefix of the string [12].
6
Algorithm
Algorithm 2 is the learning algorithm we use. We will present the basic ideas informally before proving its correctness. We initialise K and F to the most basic sets we can: so F will just consist of the empty context (λ, λ) and K will be λ. We generate a grammar, and then we repeat the following process. We draw a positive example and if the positive example is not in our current hypothesis, then we add additional contexts to F . We want to keep F to be very limited and only increase it when we are forced to. We maintain a large set of all of the substrings that we have seen so far; this is stored in the variable K2 ; we compare the grammar formed with K2 to
Learning Context Free Grammars with the Syntactic Concept Lattice
47
the one formed with K. If they are different then that means that K2 is more accurate, in that it will eliminate some incorrect rules, and that we have not yet attained the limit language, and as a result we might overgeneralise, and so we will increase K to K2 . In general we will add to K as much as we want; it can only make the grammar more accurate. We just need to check whether two grammars with the same set of contexts are identical. This merely requires us to verify that there are the same number of concepts; and that for every concept in one there is a concept in the other with the same set of contexts, and that for every triple of concepts X ≥ Y ◦ Z the same inequality holds between the corresponding elements in the larger lattice. Algorithm 2. cfg learning algorithm Data: Input alphabet Σ, bounds k, f Result: A sequence of cfgs G1 , G2 , . . . K ← Σ ∪ {λ}, K2 = K ; F ← {(λ, λ)}, E = {} ; D = (F KK) ∩ L ; G = Make( K, D, F,f ) ; repeat w = GetPositiveExample; if there is some w ∈ E that is not in L(G) then F ← Con(E) ; K ← K2 ; D = (F KK) ∩ L ; G = Make(K, D, F,f ) ; else D2 ← (F K2 K2 ) ∩ L ; if K2 , D2 , F not isomorphic to K, D, F then K ← K2 ; D = (F KK) ∩ L ; G = Make(K,D,F,f ); Output G; until ;
6.1
Proof
We will now show that this algorithm is correct for a certain class of languages. We have a parameter f which we assume is fixed; for each value we have a learnable class. For a fixed value of f we can learn all context-free languages L such that f (L) ≤ f . That is to say, all context-free languages that can be defined by a cfg where each non-terminal can be contextually defined by at most f contexts. As is now standard in context-free grammatical inference we cannot define a decidable syntactic property, but rely on defining a class of languages by reference to the algorithm. We will later try to clarify the class of languages that lie in each class.
48
A. Clark
Theorem 1. Algorithm 2 identifies in the limit, from positive data and membership queries the class Lfcp (f ). This theorem is an immediate consequence of the following two lemmas. Lemma 6. There is a point N at which FN is adequate for L; and for all n > N , Fn = FN . Proof. First, once F is adequate the language defined will always include the target and thus F will never be increased again. Suppose L ∈ Lfcp (f ) and let F be an adequate set of contexts. Let n be the first time that Con(En ) contains F . Let Fn be the set of contexts at that point. If Fn is adequate we are done; assume it is not. Therefore there is some set of strings K such that L(G(K, D, FN )) is not a subset of L; i.e. there is some w ∈ L \ L(G(K, D, FN )) Let n2 be the first time that K2 contains K and E contains w; at this point, or some earlier point, we will increase F and it will be adequate. Lemma 7. If Fn is adequate then there is some n2 > n at which point L(G(Kn , L, Fn )) = L. Proof. Let K0 be a set of strings such that G(K, L, Fn ) defines exactly the right language. Furthermore K0 ⊆ Sub(L). Let T be a set of strings of L such that K0 ⊂ Sub(T ) and let m be a point such that T ⊆ Em and m > n. Either G(Km , L, Fn ) is correct, in which case we are done, or it is not, in which case it will differ from G(Sub(E), L, Fn ) and thus K will be increased to include K0 , which means that it will correct at this point. We will now give some simple examples of languages for varying values of f . We define Lfcp (f ) to be the set of languages learnable for a value of f . The simplest class is the class Lfcp (1). First, we note that the regular languages lie within this class. Given a regular language L consider a deterministic regular grammar for L, with no unreachable non-terminals. For a given non-terminal ∗ N in this grammar let wN be a string such that S ⇒ wN . Since the grammar ∗ is deterministic, we know that if S ⇒ wM then N = M . The context (w, λ) therefore contextually defines the non-terminal N . Consider the language Lnd = {an bn cm |n, m ≥ 0} ∪ {am bn cn |n, m ≥ 0}. Lnd is a classic example of an inherently ambiguous and thus non-deterministic language; moreover it is a union of infinitely many congruence classes. This language also lies within the class Lfcp (1). In Table 1, we give on the left the sets of strings generated by a natural cfg for this language, and on the right a context that defines that set of strings. Thus even the very simplest element of this class contains inherently ambiguous languages. If we consider the class Lfcp (2), then we note that the palindrome languages and the language L2 defined in the introduction lie in this class. The palindrome languages require two contexts to define the elementary letters of the alphabet – so for example {a} can be defined by the two contexts (λ, ba) and (λ, aa).
Learning Context Free Grammars with the Syntactic Concept Lattice
49
Table 1. Contextually defined grammar for the language Lnd = {an bn cm |n, m ≥ 0} ∪ {am bn cn |n, m ≥ 0} L(G, N ) λ a b c c∗ a∗ {an bn |n ≥ 0} {an bn+1 |n ≥ 0} {bn cn |n ≥ 0} {bn cn+1 |n ≥ 0} Lnd
7
FN (aaabb, bccc) (λ, abbccc) (aaab, bccc) (aaabbc, λ) (c, λ) (λ, a) (a, b) (aa, b) (b, c) (bb, c) (λ, λ)
Discussion
There is an important point that we will discuss at length: the switch from context free representations to context sensitive representations. Backing off to cfgs makes it clear how important the use of Distributional Lattice Grammars (dlgs) are. With a dlg, we can compactly represent the potentially exponentially large set of concepts and perform the parsing directly. The difference is this: In the cfg we have to treat each derivation separately. If we have one concept that contains the context f and another that contains the context g and both of these can generate the same string w; this does not mean that there is a concept containing f and g that will generate that string, since the two contexts could come from different sub-derivations. In a dlg however, we can aggregate information from different derivations – far from making things more difficult, this actually makes the derivation process simpler since we only need to keep one concept for each span in the parse table. Thus in this case using a context-sensitive representation is a solution to a problem, rather than creating new problems itself. Suppose we are given the sets of strings generated by each non-terminal of a cfg in cnf. Then it is easy to write down the rules: for any triple of sets/nonterminals X, Y, Z we have a rule X → Y Z iff X ⊇ Y Z. We also have rules of the form X → a iff a ∈ X. Thus the general strategy for CF inference that we propose is to define a suitable collection of sets of strings, and then to construct all valid rules. The congruence based approach uses the congruence classes: since [u][v] ⊆ [uv] we have the rule schema [uv] → [u][v]. Here we have the same approach with concepts: Sx Sy ⊆ (Sx Sy ) , and so we have a rule C(X) ◦ C(Y ) → C(X), C(Y ). The current approach is very heavily influenced by the class of Binary Feature Grammars bfgs, but they are also very different. In particular, this model is in some sense a dual representation. In the bfg formalism the primitive elements are determined by strings, and the contexts are used to restrict the generated
50
A. Clark
rules. Here, dually, the primitive elements are defined by the contexts and the strings are used to restrict the generated rules. As a result the the monotonicity lemmas for bfgs go in the opposite direction: for a bfg, if we increase K we increase the language, and if we increase F we decrease the language. In the current construction, the opposite is true. In general there are two ways of proceeding – we define the sets of nonterminals in terms of strings; the set of all strings congruent to a given string, or alternatively in terms of contexts: the set of all strings that can occur in a given set of contexts. These two approaches give rise to two different sets of algorithms. Clearly this does not exhaust the approach, as we could combine the two. For example, if we define [l, r] to be the set of all strings that have the context (l, r), then the rule schemas [l, r] → [l, xr][x] and [l, r] → [y][ly, r] are also valid. Since the concepts form a lattice, we suggest as a direction for future research that a variety of heuristic algorithms could be used, since a lattice is a good search space, as is known for regular grammatical inference [13], but we do not consider such heuristic algorithms here.
8
Conclusion
We have presented an approach to context-free grammatical inference where the non-terminals correspond to elements of a distributional lattice. Using this representational assumption we have presented an efficient algorithm for the inference of a very large set of context free languages, subject to the setting of certain bounds on the number of non-terminals defined.
Acknowledgments I am very grateful to the reviewers for identifying several confusing points in this paper; the paper has been much improved as a result of their comments.
References 1. Clark, A., Eyraud, R.: Polynomial identification in the limit of substitutable context-free languages. Journal of Machine Learning Research 8, 1725–1745 (2007) 2. Clark, A.: PAC-learning unambiguous NTS languages. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 59–71. Springer, Heidelberg (2006) 3. Yoshinaka, R.: Identification in the Limit of k-l-Substitutable Context-Free Languages. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 266–279. Springer, Heidelberg (2008) 4. Yoshinaka, R.: Learning mildly context-sensitive languages with multidimensional substitutability from positive data. In: Gavald` a, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS, vol. 5809, pp. 278–292. Springer, Heidelberg (2009) 5. Clark, A.: Distributional learning of some context-free languages with a minimally adequate teacher. In: Proceedings of the ICGI, Valencia, Spain (September 2010)
Learning Context Free Grammars with the Syntactic Concept Lattice
51
6. Asveld, P., Nijholt, A.: The inclusion problem for some subclasses of context-free languages. Theoretical computer science 230(1-2), 247–256 (2000) 7. Clark, A., Eyraud, R., Habrard, A.: A polynomial algorithm for the inference of context free languages. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 29–42. Springer, Heidelberg (2008) 8. Clark, A.: A learnable representation for syntax using residuated lattices. In: Proceedings of the 14th Conference on Formal Grammar, Bordeaux, France (2009) 9. Chi, Z., Geman, S.: Estimation of probabilistic context-free grammars. Computational Linguistics 24(2), 299–305 (1998) 10. Martinek, P.: On a Construction of Context-free Grammars. Fundamenta Informaticae 44(3), 245–264 (2000) 11. Clark, A., Lappin, S.: Another look at indirect negative evidence. In: Proceedings of the EACL Workshop on Cognitive Aspects of Computational Language Acquisition, Athens (March 2009) 12. Pitt, L.: Inductive inference, DFAs, and computational complexity. In: Jantke, K.P. (ed.) AII 1989. LNCS (LNAI), vol. 397, pp. 18–44. Springer, Heidelberg (1989) 13. Dupont, P., Miclet, L., Vidal, E.: What is the search space of the regular inference? In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 25–37. Springer, Heidelberg (1994)
Learning Automata Teams Pedro Garc´ıa, Manuel V´ azquez de Parga, Dami´an L´ opez, and Jos´e Ruiz Departamento de Sistemas Inform´ aticos y Computaci´ on, Universidad Polit´ecnica de Valencia, Valencia, Spain {pgarcia,mvazquez,dlopez,jruiz}@dsic.upv.es
Abstract. We prove in this work that, under certain conditions, an algorithm that arbitrarily merges states in the prefix tree acceptor of the sample in a consistent way, converges to the minimum DFA for the target language in the limit. This fact is used to learn automata teams, which use the different automata output by this algorithm to classify the test. Experimental results show that the use of automata teams improve the best known results for this type of algorithms. We also prove that the well known Blue-Fringe EDSM algorithm, which represents the state of art in merging states algorithms, suffices a polynomial characteristic set to converge. Keywords: DFA learning, Automata teams.
1
Introduction
Gold [4] proposed the identification in the limit model as a framework to study the convergence of inference algorithms. He also proved [5] that any minimum DFA can be reconstructed from a set of words of polynomial size in the number of states of the automaton and he proposed an algorithm to do this task in polynomial time. This algorithm uses a red-blue strategy (this denomination is taken from a later algorithm [8] that will be mentioned afterwards), that maintains two subsets in the set of states of the prefix tree acceptor of the sample: – States belonging to the solution (the red set, denoted R in the sequel). – States not in R that can be reached from a state of R using a symbol (the blue set, denoted B in the sequel). In Gold’s algorithm, the selection of the states of B to be promoted to R and the election of equivalent states, are both made in an arbitrary way. The RPNI [9] and Lang [7] algorithms were both proposed in 1992. They assure the consistency of the hypothesis by merging states in lexicographical order, starting from the prefix tree acceptor of the sample. Later in that decade, the Blue-Fringe EDSM algorithm [8] was developed; This algorithm uses a strategy
Work partially supported by Spanish Ministerio de Educaci´ on y Ciencia under project TIN2007-60769.
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 52–65, 2010. c Springer-Verlag Berlin Heidelberg 2010
Learning Automata Teams
53
named Red-Blue to decide the states candidates to be merged. Blue-Fringe became the state of art for DFA learning algorithms. The question that was left without proof if there exists a polynomial characteristic set for this algorithm. In [6], De la Higuera et al. studied how the order of states merging could affect the convergence of RPNI -type algorithms. They established that if the order of merging is data-independent, both the convergence and the existence of a characteristic polynomial set are guaranteed. Otherwise, these properties do not hold for data-dependant algorithms. They also empirically showed that this later type of algorithms may behave well. We prove in this work the existence of a polynomial characteristic set for the Blue-Fringe EDSM algorithm [8]1 , which realizes a data-dependent merging of states. We also propose and prove the convergence of a merging states algorithm that we denote Generalized Red-Blue Merging (GRBM ) algorithm that shares with Gold’s algorithm the red-blue strategy and state merging in an arbitrary way. This permits to learn automata teams, that is, the use of several different automata output by the algorithm for classification tasks. In the experiments, several classification criteria have been used and they obtain better recognitions rates than both the Blue-Fringe EDSM, which outputs DFAs and the DeLeTe2 [3] which outputs a subclass of the nondeterministic automata called Residual Finite State Automata.
2
Definitions and Notation
Let Σ be a finite alphabet and let Σ ∗ be the monoid generated by Σ with concatenation as the internal operation and λ as neutral element. A language L over Σ is a subset of Σ ∗ . The elements of L are called words. Given x ∈ Σ ∗ , if x = uv with u, v ∈ Σ ∗ , then u (resp. v) is called prefix (resp. suffix ) of x. Pr(L) (resp. Suf(L)) denotes the set of prefixes (suffixes) of L. A Deterministic Finite Automaton (DFA) is a 5-tuple A = (Q, Σ, δ, q0 , F ), where Q is a finite set of states, Σ is an alphabet, q0 ∈ Q is the initial state, F ⊆ Q is the set of final states and δ : Q × Σ → Q is the transition function. The language accepted by an automaton A is denoted L(A). A Moore machine is a 6-tuple M = (Q, Σ, Γ, δ, q0 , Φ), where Σ (resp. Γ ) is the input (resp. output) alphabet, δ is a partial function that maps Q × Σ in Q and Φ is a function that maps Q in Γ called output function. Throughout this paper, the behavior of M will be given by the partial function tM : Σ ∗ → Γ defined as tM (x) = Φ(δ(q0 , x)), for every x ∈ Σ ∗ such that δ(q0 , x) is defined. A DFA A = (Q, Σ, δ, q0 , F ) can be simulated by a Moore machine M = (Q, Σ, {0, 1}, δ, q0, Φ), where Φ(q) = 1 if q ∈ F and Φ(q) = 0 otherwise. Then, the language defined by M is L(M ) = {x ∈ Σ ∗ : Φ(δ(q0 , x)) = 1}. 1
Through this paper, the strategy of maintaining three sets of states in the automaton while merging, will be denoted Red-Blue strategy whereas the implementation of the algorithm by Lang, using a merging score proposed by Price, will be denoted BlueFringe EDSM.
54
P. Garc´ıa et al.
Given two disjoint finite sets of words D+ and D− , we define the (D+ , D− )prefix tree Moore machine (P T M (D+ , D− )) as the Moore machine having Γ = {0, 1, ?}, Q = Pr(D+ ∪ D− ), q0 = λ and δ(u, a) = ua if u, ua ∈ Q and a ∈ Σ. For every state u, the value of the output function associated to u is 1, 0 or ? (undefined) depending whether u belongs to D+ , to D− or to Q − (D+ ∪ D− ) respectively. A Moore machine M = (Q, Σ, {0, 1, ?}, δ, q0, Φ) is consistent with (D+ , D− ) if ∀x ∈ D+ we have Φ(x) = 1 and ∀x ∈ D− we have Φ(x) = 0. Given a language L, a characteristic set for an inference algorithm for L is a set of words such that when they are used as input to the algorithm, a representation of the target language is obtained, and the use of further input words do not change the output. We use the model of learning called identification in the limit [4]. An algorithm identifies a class of languages H in the limit if and only if every language in the class has associated a finite characteristic set.
3
Gold’s Algorithm
Aiming to focus our proposal and as a way to analyze the main features of most of the inference algorithms that have been proposed so far, we present in this section a version of Gold’s algorithm that uses a prefix tree Moore machine of the sample as a way of representing the input data, instead of the original way, which was the so called state characterization matrix. The algorithm we describe (Algorithm 1) behaves exactly as the original and its main features are: 1) It does not merge states; 2) some decisions can be taken in a not specified (arbitrary) way (lines 4 and 16 of the algorithm) and 3) it converges with a polynomial characteristic sample (if there is a relation between the order in which states are considered and the way the characteristic sample is built). The main drawback of Gold’s algorithm is that if it is not supplied with enough data, the output may not be consistent with the input data. It behaves in a different way as the RPNI, which is a merging states algorithm whose output is always consistent with the input. Gold’s algorithm uses the function od (obviously distinguishable, lines 3 and 16) defined in the following way: Two states u1 and u2 of P T M (D+, D− ) are obviously distinguishable if there exists a word x such that Φ(u1 x), Φ(u2 x) ∈ {0, 1} and Φ(u1 x) = Φ(u2 x). The characteristic set proposed by Gold is based in the following definition and proposition: Definition 1. Let A = (Q, Σ, δ, q0 , F ) be a DFA. We say that S ⊂ Σ ∗ is a minimal set of test states if for every q ∈ Q there exists only one word x ∈ S such that δ(q0 , x) = q. Note that if S is minimal, Card(S) = Card(Q). Let A = (Q, Σ, δ, q0 , F ) be the minimum complete DFA for a language L and let S be a prefix closed minimal set of test states. Two sets D+ (S) and D− (S) can be built starting from S as follows: 1. For every u ∈ (SΣ ∩ P r(L)) ∪ {λ} we add uv to D+ (S), where v is a suffix that completes u in L (uv ∈ L). If u ∈ L we take v = λ.
Learning Automata Teams
55
Algorithm 1. Gold(D+ ∪ D− ) Require: Two disjoint finite sets (D+ , D− ) Ensure: A consistent Moore Machine 1: M0 := P T M M (D+ , D− ) = (Q0 , Σ, {0, 1, ?}, δ0 , q0 , Φ0 ) 2: R = {λ}; B = Σ ∩ Q0 ; 3: while there exists s ∈ B such that od(s, s , M0 ), ∀s ∈ R do 4: choose s 5: R = R ∪ {s }; 6: B = (RΣ − R) ∩ Q0 ; 7: end while 8: Q = R; 9: q0 = λ; 10: for s ∈ R do 11: Φ(s) = Φ0 (s); 12: for a ∈ Σ do 13: if sa ∈ R then 14: δ(s, a) = sa 15: else 16: δ(s, a) = any s ∈ R such that ¬od(sa, s , M0 ); 17: end if 18: end for 19: end for 20: M = (Q, Σ, {0, 1, ?}, δ, q0 , Φ); 21: if M is consistent with (D+ , D− ) then 22: Return(M ) 23: else 24: Return(M0 ); 25: end if 26: End
2. For every pair (u1 , u2 ) with u1 ∈ S and u2 ∈ SΣ, if u−1 = u−1 1 L 2 L we ∗ choose v ∈ Σ which distinguishes u1 from u2 , that is, v is chosen under the condition that just one of the two words u1 v or u2 v belong to L. We add u1 v and u2 v to D+ (S) or to D− (S) according to their membership to L. A rough bound for the size of D+ (S) ∪ D− (S) is easily seen to be quadratic in the size of Q. There are families of automata for which the amount of prefix closed minimal set of test states grows exponentially with the size of the automaton. Example 1. For n ≥ 1 let An = ({1, 2, ..., n + 1}, {a, b}, δ, 1, {n + 1}) be the automaton defined as: δ(i, c) = i + 1, for i = 1, ..., n, c ∈ {a, b} and δ(n + 1, a) = δ(n + 1, b) = n + 1. For every An there exist 2n prefix closed minimal set of test states. Proposition 1. [5] Let A = (Q, Σ, δ, q0 , F ) be a minimum complete automata. Let S = {u0 , u1 , . . . , un } be a minimal set of test states and let D+ (S) and D− (S) be the sets obtained as it is shown above. If, in Gold’s algorithm, for any
56
P. Garc´ıa et al.
i = 0, . . . , n the state ui is considered for promotion before than any other state u such that u−1 L = u−1 i L, the output is a DFA isomorphic to A. Then, if the order in which states to be promoted to B is established as in the above proposition, the characteristic set for Gold’s algorithm is of polynomial size. Otherwise it requires a set of exponential size to guarantee identification. To obtain this set, let S1 , S2 , . . . , Sr all the minimal and prefix closed sets of test states. The |Σ|n number r is finite and roughly bounded above , where n + 1 isthe number rby 2 of states of the minimum DFA. Doing D+ = i=1 D+ (Si ) and D− = ri=1 D− (Si ) (obviously exponential), we have a characteristic set for Gold’s algorithm with no restrictions in the choose sentences of the lines 4 and 16.
4
The Blue-Fringe EDSM Algorithm has a Polynomial Characteristic Set
As it has been mentioned above, De la Higuera et al. [6] proposed a general merging states inference algorithm. Aiming to avoid undesired merges of states, particularly those that take place at the first steps of the running of the algorithm, the authors proposed an algorithm that uses a function that establishes the order in which states are selected to be merged. It is important to note that although the authors claim that the function can implement any ordering in the set of states, the structure of the algorithm makes that the only states that can possibly be merged belong to two disjoint sets: the first one contains the consolidated states which will belong to the set of states of the final DFA and the second set contains the states that can be reached from the first one using only one transition. These two sets are usually denominated as the set of Red (R) and Blue (B ) states respectively. The algorithm presented in [6] is used for the authors to prove that in an inference algorithm based in merging states, when the order of the merging is data independent, there is a polynomial characteristic set that makes the algorithm to converge to the target automaton. When the order of merging is data dependent, the existence of a characteristic set polynomial in size is not so clear. The best known algorithm that implements a function to select the states to be merged is the Blue-Fringe EDSM [8]. It uses a PTMM as data structure to manipulate the sample and different training sets may lead to different ordering of the states. Although it has shown a very good experimental behavior, as far as we know, the existence of a characteristic sample has not been proved. In order to prove the existence of this characteristic set, we will first briefly describe the algorithm. Blue-Fringe EDSM starts from the P T M M (D+, D− ) = (Q, Σ, {0, 1, ?}, δ, q0 , Φ). Initially R = {λ} and B = Σ ∩ Q. The algorithm compares every pair (u, v) ∈ R × B. To make the proof easier, we assume that the states of B are visited in lexicographical order. So, every state q of B, in lexicographical order is compared with every state p of R. If q is distinguishable from every state of R
Learning Automata Teams
57
it is promoted do the set R and afterwards, the set B is recalculated. Otherwise, if no state can be promoted, the pair of states with greater score is merged (the score is assigned using the number of coincidences in the subgraphs that have p and q as initial states). The algorithm continues doing this task until B becomes empty. The merges are done in a deterministic way. It is easily seen that Blue-Fringe EDSM converges. Let us see that it converges with a polynomial characteristic set. Proposition 2. The algorithm Blue-Fringe EDSM has polynomial characteristic set. Proof. Let S = {u0 , u1 , . . . , un } be the minimal prefix-closed set of test states and such that for every i and for every u ∈ Σ ∗ with u−1 L = u−1 i L we have that ui is previous to u in lexicographical order, that is, S is the set of smallest (in lexicographical order) words that reach every state of the automaton. Let us see that on input of the sets D+ (S) and D− (S) built as in Definition 1, the algorithm Blue-Fringe EDSM outputs the minimum DFA for L. 1. Blue-Fringe EDSM algorithm promotes to the set R all the states of S and only the states of S before any merging is done: As the set B is traversed in canonical order, neither state that could be promoted will be considered before its equivalent state in S. 2. After the promotion step we have R = S and B = SΣ − S. There are no more possible promotions. The number of states in the set B equals the number of transitions left in the subautomaton induced by R. This will be true during the whole process. 3. For definition of D+ (S) and D− (S), for every state in B = SΣ − S there is only a compatible state in R = S, so every state of B will be correctly merged independently from the order in which they are processed that is, data has only influence in the order merges are done. 4. Every state in B can only be merged with one in R. Once a union is made, a state in B disappears (and never comes back again). The rest of them, either have the same information they had (rooted subtree) or some of them increase it. In both cases every state in B is distinguished from every state in R except from exactly one of them. This process continues until the set B becomes empty. D+ (S) and D− (S) form a characteristic set. If we add new data to D+ (S) and D− (S), the promotion process of the states of S to the set R (before any merging is done) is not altered. From this proof it follows that not only the Blue-Fringe EDSM, but any other algorithm based in the Red-Blue strategy will converge with polynomial characteristic sample, under the condition that the promotion of states (in lexicographical order) from B to R is considered before any merging. For better understanding of the fact that if the elements of B are traversed in lexicographical order then all the promotions are done before the first merge takes place, let us see the following example:
58
P. Garc´ıa et al.
Example 2. Let us consider the automaton of Figure 1 (a). We have, following proposition 2, S = {λ, a, aa} and thus SΣ = {a, b, aa, ab, aaa, aab}. From those sets we obtain D+ (S) = {aa, ab, bb, ba, aaa, aab} and D− (S) = {λ, a, b}. The prefix tree Moore machine is depicted in figure Figure 1 (b). 8 4 2 a,b
1
a,b
2
a,b
1
+
a
−
b
a
b
9 +
6 3 −
(a)
b
+
−
3
5
+
a
+
a b
(b) 7 +
Fig. 1. (a) Starting automaton. (b) The prefix tree Moore machine for D+ (S) = {aa, ab, bb, ba, aaa, aab} and D− (S) = {λ, a, b}.
At the beginning R = {1} and thus B = {2, 3}. As state 2 can not be merged to 1 we have R = {1, 2} and then B = {3, 4, 5}. At this point, state 3 can be merged with 2. Finally, as state 4 can not be merged to states 1 and 2, it is added to R and thus we obtain R = {1, 2, 4} and B = {3, 5, 8, 9}. States 5, 8 and 9 can be merged to 4 and the whole process ends. One should observe the states of S have been promoted to R before doing any merging. Let us see that adding new data does not affect to the set R. For example, let us suppose that D+ = D+ (S) ∪ {baa, baaa, baaaa, baaaab, baba} and D− = D− (S). The prefix tree Moore machine is depicted in figure Figure 2, where the new states and transitions are drawn in a dashed way. The algorithm proceeds exactly as before an thus R = {1, 2, 4} and B = {3, 5, 8, 9}. After merging state 2 with state 3, the latter state disappears from 8 4 2
1
+
a
−
b
a
+
a b
9 +
5 +
10
− b
6 3 −
+
a b
7
a
+ b 11 +
a
a
12 +
a
14 +
a
15 +
13 +
+
Fig. 2. The prefix tree Moore machine for D+ = D+ (S) ∪ {baa, baaa, baaaa, baaaab, baba} and D− = D− (S)
Learning Automata Teams
1 −
a,b
2 −
4
a
+ b
5 +
8
a
+ b
9 +
a
a
12 +
a
14 +
a
59
15 +
13 +
Fig. 3. Moore machine after merging states 2 and 3
B. The resulting automaton at this point is depicted in Figure 3, where states belonging to the sets R and B are depicted in different levels of grey. You should observe that state 5 has the same information it had before, whereas states 8 and 9 have increased it.
5
Generalized Red-Blue Merging Algorithm
The Generalized Red-Blue Merging Algorithm (GRBM ) is described in Algorithm 2. It starts from the PTM M of the sample. First, R is initialized to the initial state of the tree. The set R contains, at every step, those states that will be part of the output hypothesis. The set B is constructed from R. It contains the sons of elements of R which do not belong to R. Next, the algorithm chooses a state of B in an arbitrary way and tries to merge it with any state of R. In case that it can not be merged with any state of R, it is added to R. The set B has to be recalculated in both cases. The algorithm continues processing the input until B becomes empty. Possible merges of the states of the sets R and B are arbitrarily done. Merges of states in M are done in a deterministic way (merging two states may lead to future mergins to avoid non determinism) using the function detmerge (M, p, q). In case that this merging is not consistent it returns M . The main difference between Gold’s algorithm and GRBM is that the latter merges states whereas the former does not. The other difference is that in Golds algorithm states of B which are obviously different from the set R are promoted to R while transitions are only established at the end (analyzing the equivalences between states in B and R). Both Blue-Fringe EDSM and GRBM consider merging of states, but while the former tries to promote before merging, the latter arbitrarily chooses one state from B and tries to merge it with one from R. Only when merges are not possible the state is promoted to R. This fact improves the computational efficiency. The convergence of the algorithm is always guaranteed and, under certain restrictions, (the same as in Gold’s) there exists a polynomial characteristic set. The following proposition paraphrases Proposition 3 for the new GRBM and establishes the conditions for the existence of a characteristic polynomial sample. Proposition 3. Let A = (Q, Σ, δ, q0 , F ) be an minimum automaton. Let S = {u0 , u1 , . . . , un } be a minimal set of test states and let D+ (S) and D− (S) be
60
P. Garc´ıa et al.
Algorithm 2. Generalized Red-Blue Merging(D+ ∪ D− ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:
Input: Two finite sets (D+ and D− ) Output: A consistent Moore Machine Method: M := P T M M (D+ , D− ) = (Q, Σ, {0, 1, ?}, δ, q0 , Φ) R := {q0 } B := {q ∈ Q : δ(q0 , a) = q, a ∈ Σ} while B = ∅ do for q ∈ B (in arbitrary order) do merged := f alse for p ∈ R (in arbitrary order) do if detmerge(M, p, q) = M then merged := true M = detmerge(M, p, q) break() end if end for if ¬merged then R := R ∪ {q} end if B := {q ∈ Q | δ(p, a) = q, p ∈ R, a ∈ Σ} − R end for end while Return M
the sets obtained as above. If in algorithm GRBM, for any i = 0, . . . , n the state ui (when the set B is ordered) is considered before any other u such that u−1 L = u−1 i L, GRBM outputs a DFA isomorphic to A. Proof. Let S = {u0 , u1 , . . . , un }, u0 = λ: It is enough to see that at every step we have R ⊆ S and B ⊆ SΣ − S. Initially, R = {u0 } and B = Σ ∩ P r(D+ ∪ D− ). If the proposition holds up to a certain step, let u ∈ B be the state to be compared with R. If it is distinguishable from any member of R (the way the set B is traversed allows us to affirm that u = ui for any i, otherwise if v ∈ B and v −1 L = u−1 L, v will be processed after). When ui is promoted to R, R ⊆ S and once B is recalculated, B ⊆ SA − S. If u is not distinguishable from all the states in R, as u ∈ SΣ − S and any member of R is in S, There is only an element of R that can be merged with u (because of the way that D+ (S) and D− (S) have been constructed). The set R does not change (R ⊆ S) and the update of the set B makes B ⊆ SΣ − S. When the algorithm finishes, R = S and B = ∅. Besides, for definition of characteristic set, every transition of the automaton A appear in the output automaton. If one merges states in an arbitrary way, the characteristic set is exponential, so there is no guarantee that an accurate output for a given input will be obtained. The use of automata teams increases the probability of good results.
Learning Automata Teams
6
61
Experiments
In this section we experiment the behavior of the Algorithm 2 (GRBM ) and compare it with two previous algorithms that have shown the best recognition rates: the Blue-Fringe EDSM, which outputs DFAs and the DeLeTe2 which outputs NFAs. The authors of the latter one affirm [3] that DeLeTe2 performs better than Blue-Fringe EDSM when samples are drawn from NFAs or regular expressions and that the opposite happens when they are drawn from DFAs. We will first describe the data set used in the experiments and then, the protocol and the recognition rates. The programs used in the experiments are: the software developed by the authors of the DeLeTe2 [3] and a version of the Blue-Fringe EDSM implemented in [1] (which obtains slightly better results than the reported in [3] for the same set of experiments). Concerning the run time of the algorithms, the GRBM, which has O(k × n2 ) time complexity (where k is the number of automata in the team), is faster than both the Blue-Fringe EDSM and the DeLeTe2 algorithms, and that the Blue-Fringe EDSM algorithm is faster than the DeleTe2. The data set we use is the corpus of the experiments for the DeLeTe2 [3] together with the DFAs of [2], we eliminate the repeated automata and we then obtain 102 regular expressions (re), 120 NFAs and 119 DFAs. We then generate 500 different training samples of length (randomly) varying from 0 to 18. The percentage of positive and negative samples is not controlled. The average number of states of the automata used is 20 (in case of NFA’s the number of states when converted to DFAs is around 120), the average size of regular expressions is 8 and the size of the alphabet is 2 (see [3]).These samples are distributed in five incremental sets of size 100, 200, 300, 400 and 500. We also generate 1000 test samples which are different from the training ones. The length of the test words also vary from 0 to 18. The test set is labeled by every automaton and we thus obtain the following groups: er 100, er 200, er 300, er 400, er 500, nfa 100, nfa 200, nfa 300, nfa 400, nfa 500, dfa 100, dfa 200, dfa 300, dfa 400, dfa 500. Different examples of run of the algorithm GRBM may lead to different output automata. We aim to measure the recognition rates of automata teams obtained using GRBM. The protocol considered the languages of the corpus (119 languages obtained from random DFAs, 120 from NFAs and 102 from regular expressions). For each of the languages, teams of 5, 11, 21, 41 and 81 automata were inferred using training sets of increasing size (100, 200, 300, 400 and 500 samples). Every team was used to classify the test using the following criteria: fair vote, weighted vote (inverse to the size of the automaton) and use of the smallest automaton. Aiming to obtain statistically uniform results the protocol was repeated 10 times. The best results, as expected, were obtained using the biggest team (81 automata). Classification done using the fair vote criterium can not be compared to the other criteria. The average results obtained for these teams are shown in Table 1. They are compared with the results of algorithms Blue-Fringe EDSM and DeLeTe2.
62
P. Garc´ıa et al.
Table 1. Comparison of the classification rates of our approach, the Blue-Fringe EDSM and the DeLeTe2 algorithms. The classification rates are established considering weighted vote (% w.v.), as well as those obtained by the smallest automata of the team (↓ F A). Third, fifth and seventh columns show the average of the size of the smallest automata of the GRBM team and the sizes of the automata output by Blue-Fringe EDSM and DeLeTe2 algorithms respectively. Set er 100 er 200 er 300 er 400 er 500 nfa 100 nfa 200 nfa 300 nfa 400 nfa 500 dfa 100 dfa 200 dfa 300 dfa 400 dfa 500
GRBM (81 FA) % w.v. % ↓ F A | ↓ F A| 94.50 94.64 7.28 98.14 97.94 7.85 98.67 98.59 8.39 99.16 99.02 8.54 99.36 99.27 8.75 77.08 71.23 17.11 80.70 74.80 28.18 83.19 77.14 37.26 85.01 79.01 44.95 86.57 80.65 51.84 76.45 70.32 17.21 82.10 80.69 24.66 88.47 91.29 23.02 93.37 96.62 19.60 96.76 98.84 17.00
Blue-fringe % rec. |F A| 87.94 10.00 94.81 9.97 96.46 11.05 97.74 10.43 98.54 10.47 68.15 18.83 72.08 28.80 74.55 36.45 77.53 42.58 80.88 47.54 69.12 18.59 77.18 25.83 88.53 25.10 94.42 21.36 97.88 18.75
DeLeTe2 % rec. |F A| 91.65 30.23 96.96 24.48 97.80 31.41 98.49 27.40 98.75 29.85 73.95 98.80 77.79 220.93 80.86 322.13 82.66 421.30 84.29 512.55 62.94 156.89 64.88 432.88 66.37 706.64 69.07 903.32 72.41 1027.42
Looking for uniform results, two new approaches were used in the experiments. The first one was to consider weighted vote in a way that gives more weight to the smaller automata, and thus the classification was made with a weight parameter inverse to the square of their size. The second approach aimed to select those automata with the right to vote. Several approaches were tried, and the best results were obtained when automata of size smaller than the average size were selected. Once the automata were selected, the classification were done using weighted vote inverse to the square of their size. The results are shown in Table 2 comparing them with the results of Blue-Fringe EDSM and DeLeTe2 algorithms. Both approaches obtain better classification rates than the Blue-Fringe EDSM. Note that the number of selected automata with right to vote is near half of the size of the team. This number increases when we have much information about the target language (languages obtained from regular expressions). Figure 4 shows the comparison of the performance of the automata teams used in the experiments with Blue-Fringe EDSM and DeLeTe2 algorithms. We have used teams of 5, 11, 21, 41 and 81 automata, although we only show the results of some of them to avoid confusion. The election of the number of automata in each team was simple (we started with 5, a small odd number and continued multiplying the number of automata times 2) It is also worth to be noted that the performance of the Blue-Fringe EDSM and DeLeTe2 algorithms is always worse than GRBM, except for the case of teams of 5 automata.
Fig. 4. Comparative behavior between our approach (GRBM), RedBlue and DeLeTe2 algorithms
Learning Automata Teams 63
64
P. Garc´ıa et al.
Table 2. Results obtained in the second set of experiments. GBRM classification rates consider weighted vote inverse to the square of the sizes of the automata (first and second columns). Second and third columns show respectively the classification rates and the number of automata selected when only automata smaller than the average size are selected for classification purposes.
Set
er 100 er 200 er 300 er 400 er 500 nfa 100 nfa 200 nfa 300 nfa 400 nfa 500 dfa 100 dfa 200 dfa 300 dfa 400 dfa 500
7
GBRM (81 FA) Bluesel. FA size DeLeTe no select. Fringe smaller average %w.v.2 %w.v.2 #F A 95.19 95.55 50.00 87.94 91.65 98.32 98.45 58.24 94.81 96.96 98.78 98.87 60.58 96.46 97.80 99.24 99.32 63.45 97.74 98.49 99.42 99.49 65.91 98.54 98.75 77.24 77.45 40.51 68.15 73.95 80.96 81.25 41.28 72.08 77.79 83.46 83.77 41.26 74.55 80.86 85.14 85.50 41.47 77.53 82.66 86.71 86.98 42.25 80.88 84.29 76.68 76.83 40.48 69.12 62.94 83.13 84.04 36.00 77.18 64.88 90.04 91.59 36.00 88.53 66.37 95.24 96.48 36.31 94.42 69.07 98.16 98.68 37.69 97.88 72.41
Conclusions
In this paper we propose the algorithm GRBM for automata inference. It uses the Red-Blue strategy to divide the states of the PTMM of the sample in two sets. The order in which states belonging to those sets try to be merged is arbitrarily chosen. This algorithm, under certain conditions, converges with a polynomial characteristic set. As a byproduct, the existence of a polynomial characteristic set for the previous Blue-Fringe EDSM algorithm has also been proved. Different outputs obtained by different examples of run of GRBM may lead to different outputs. This fact has been used for learning using automata teams. Experiments done using different classification criteria for the test sets improve the classification rates obtained by both the Blue-Fringe EDSM and the DeLeTe2 algorithms.
References ´ 1. Alvarez, G.I.: Estudio de la Mezcla de Estados Determinista y No Determinista en el Dise˜ no de Algoritmos para Inferencia Gramatical de Lenguajes Regulares. PhD. Thesis DSIC UPV (2008)
Learning Automata Teams
65
2. Coste, F., Fredouille, D.: Unambiguous automato inference by means of state merging methods. In: Lavraˇc, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) ECML 2003. LNCS (LNAI), vol. 2837, pp. 60–71. Springer, Heidelberg (2003) 3. Denis, F., Lemay, A., Terlutte, A.: Learning regular languages using RFSAs. Theoretical Computer Science 313(2), 267–294 (2004) 4. Gold, E.M.: Language identification in the limit. Information and Control 10, 447– 474 (1967) 5. Gold, E.M.: Complexity of Automaton Identification from Given Data. Information and Control 37, 302–320 (1978) 6. de la Higuera, C., Oncina, J., Vidal, E.: Data dependant vs data independant algorithms. In: Miclet, L., de la Higuera, C. (eds.) ICGI 1996. LNCS (LNAI), vol. 1147, pp. 313–325. Springer, Heidelberg (1996) 7. Lang, K.J.: Random DFAs can be Approximately Learned from Sparse Uniform Examples. In: Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pp. 45–52 (1992) 8. Lang, K.J., Pearlmutter, D., Price, R.A.: Results of the Abbadingo one DFA Learning Competition an a new evidence-driven state merging algorithm. In: Honavar, V.G., Slutzki, G. (eds.) ICGI 1998. LNCS (LNAI), vol. 1433, pp. 1–12. Springer, Heidelberg (1998) 9. Oncina, J., Garc´ıa, P.: Inferring Regular Languages in Polynomial Updated Time. In: de la Blanca, P., Sanfeli´ u, Vidal (eds.) Pattern Recognition and Image Analysys. World Scientific, Singapore (1992)
Exact DFA Identification Using SAT Solvers Marijn J.H. Heule1, and Sicco Verwer2 1
2
Delft University of Technology
[email protected] Eindhoven University of Technology
[email protected]
Abstract. We present an exact algorithm for identification of deterministic finite automata (DFA) which is based on satisfiability (SAT) solvers. Despite the size of the low level SAT representation, our approach is competitive with alternative techniques. Our contributions are fourfold: First, we propose a compact translation of DFA identification into SAT. Second, we reduce the SAT search space by adding lower bound information using a fast max-clique approximation algorithm. Third, we include many redundant clauses to provide the SAT solver with some additional knowledge about the problem. Fourth, we show how to use the flexibility of our translation in order to apply it to very hard problems. Experiments on a well-known suite of random DFA identification problems show that SAT solvers can efficiently tackle all instances. Moreover, our algorithm outperforms state-of-the-art techniques on several hard problems.
1
Introduction
The problem of identifying (learning) a deterministic finite state automaton (DFA) is one of the best studied problems in grammatical inference, see, e.g., [1]. A DFA is a well-known language model that can be used to recognize a regular language. The goal of DFA identification is to find a (non-unique) smallest DFA that is consistent with a set of given labeled examples. The size of a DFA is measured by the amount of states it contains. An identified DFA has to be as small as possible because of an important principle known as Occam’s razor, which states that among all possible explanations for a phenomenon, the simplest is to be preferred. A smaller DFA is simpler, and therefore a better explanation for the observed examples. DFA identification thus consists of finding the regular language that is most likely to have generated a set of labeled examples. This problem has many applications in for example computational linguistics, bioinformatics, speech processing, and verification. The problem of finding a smallest consistent DFA can be very difficult. It is the optimization variant of the problem of finding a consistent DFA of fixed size,
This is an extended version of: Marijn Heule and Sicco Verwer. Using a Satisfiability Solver to Identify Deterministic Finite State Automata. In BNAIC 2009, pp. 91-98. Supported by the Dutch Organization for Scientific Research (NWO) under grant 617.023.611.
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 66–79, 2010. c Springer-Verlag Berlin Heidelberg 2010
Exact DFA Identification Using SAT Solvers
67
which has been shown to be NP-complete [2]. In spite of this hardness results, quite a few DFA identification algorithms exist, see, e.g., [1]. The current stateof-the-art in DFA identification is the evidence driven state-merging (EDSM) algorithm [3]. Essentially, EDSM is a heuristic method that tries to find a good local optimum efficiently. It has been shown using a version of EDSM called RPNI, that it is guaranteed to efficiently converge to the global optimum in the limit [4]. However, wrapping a specialized search procedure around the EDSM heuristic method will typically lead to better results, see, e.g., [5,6,7,8]. Although the different search techniques improve the performance of EDSM, they are still less advanced than solvers for well-studied problems such as graph coloring and satisfiability (SAT). Especially SAT solvers have become very powerful in the last decade. The power of these solvers can be used in other problems by translating these problems into SAT instances, and subsequently running a SAT solver on these translated problems. This approach is very competitive for several problems, see, e.g., [9,10,11]. We adopt this approach for DFA identification. In [12], such a translation is introduced from DFA identification into graph coloring. The main idea of this translation is to use a distinct color for every state of the identified DFA. The nodes in the graph coloring instance represent the labeled examples. Two nodes are connected if the examples they represent have different labels, i.e., if they cannot end in the same state in the DFA. Dynamic constraints are used to guarantee that examples with different labels cannot end in the same state. The amount of colors used in the graph coloring problem is equal to the size of the identified DFA, and hence this should be as small as possible. Finding this minimum can be done by iterating over this amount. An alternative approach [13] uses the well-known translation of DFA identification to an integer constraint satisfaction problem (CSP) from [14]. It translates this CSP into SAT in two ways: a unary and a binary encoding of the integers. Again, the minimum can be found by iterating over the number of states. We propose a different method inspired by the encoding by [12]. The main problem we solve is how to efficiently encode the graph coloring constraints of [12] into SAT. A naive direct encoding [15] of these constraints would lead to O(k 2 |V |2 ) clauses, where k is the size of the identified DFA, and V is the set of labeled examples. Such a direct encoding is in fact identical to the unary encoding from [13], which can be considered the current state-of-the-art in translations of DFA identification to SAT. Our encoding, however, requires only O(k 2 |V |) clauses. The crucial part of our translation is the use auxiliary variables to represent the problem more efficiently. In addition, we apply symmetry breaking [16] to prevent overlapping searches with different colors by preprocessing the result of our translation with a fast max-clique approximation algorithm. Furthermore, we add many redundant clauses to our translation that provide the SAT solver with some additional knowledge about the DFA identification instance. A nice feature of our encoding is that it is flexible in the sense that it can also be applied to partially identified DFAs. Starting with a partially identified DFA reduces the size of the SAT instance significantly. Thus, one could use our encoding as a subprocess in a larger DFA identification algorithm as follows:
68
M.J.H. Heule and S. Verwer
first identify a small part of a DFA, and then run our encoding to determine how many additional states are required. In this way, our encoding can also be applied in cases where the number of clauses resulting from our initial encoding is too large for the current state-of-the-art SAT solvers. The contributions of this paper are thus fourfold: – – – –
We introduce a simple and efficient encoding of DFA identification to SAT. We suggest max-clique symmetry breaking to reduce the search space. We add redundant clauses to improve the performance of the SAT solver. We show how the flexibility of our encoding can be used in order to apply it to very hard problems.
We compare the performance of our SAT approach with the naive direct encoding, and two state-of-the-art search procedures for EDSM. We first tested these algorithms on a set of well-known benchmark problem instances. These results show that our approach is competitive with the state-of-the-art in DFA identification, and significantly outperforms the current state-of-the-art in translations of DFA identification to SAT. In addition, we tested our approach on a very challenging suite of very hard instances. For the second experiment we applied our encoding to a DFA that was partially identified by EDSM. This second experiment shows that the flexibility of our encoding allows it to be applied to very difficult DFA identification problems. In a few of these instances we could determine the exact solution starting from a short initial run of EDSM. In addition, during these experiments we discovered that our max-clique symmetry breaking technique can potentially be used to reduce the search space of the EDSM algorithm. Adapting EDSM to make use of this technique is left as future work. This paper is organized as follows. We start with a short description of the EDSM algorithm (Section 2) and the translation into graph coloring (Section 3). We then give our translation into SAT, including symmetry breaking and redundant clauses (Section 4). Next, we explain the application of our encoding to partially identified DFA (Section 5). We present our experimental results (Section 6), and end with some conclusions and some ideas for future work (Section 7).
2
The State-of-the-Art in DFA Identification
We assume the reader to be familiar with the theory of languages and automata. A deterministic finite state automaton (DFA) A is a automaton model consisting of states and labeled transitions. It recognizes those symbol sequences formed by the labels of transitions on paths from a specific start start to a final state. In this way, DFAs can be used to recognize any regular language. We use L(A) to denote the language of a DFA A. Given a pair of finite sets of positive sample strings S+ and negative sample strings S− , called the input sample, the goal of DFA identification is to find a smallest DFA A that is consistent with S = {S+ , S− }, i.e., such that S+ ⊆ L(A) and S− ⊆ Σ ∗ \L(A) (where Σ ∗ is the set of all strings). The size of a DFA is measured by the usual measure, i.e., by the number of states it contains.
Exact DFA Identification Using SAT Solvers
0
a
1
b
2
b
0
b
a
6
b
7
3 5
a
4
69
1
7
2
6
3 5
4
Fig. 1. An augmented prefix tree acceptor for S = (S+ = {a, abaa, bb}, S− = {abb, b}) (left) and the corresponding consistency graph (right). Some vertices in the consistency graph are not directly inconsistent, but inconsistent due to determinization. For instance state 2 and 6 are inconsistent because the strings abb and bb will end in the same state if these states are merged. Also state 1 and 2 are inconsistent because the strings a and abb will end in the same state if these states are merged.
The idea of a state-merging algorithm is to first construct a tree-shaped DFA from this input, and then to merge the states of this DFA. Such a tree-shaped DFA is called an augmented prefix tree acceptor (APTA), see Figure 1. An APTA is a DFA such that the computations of two strings s and s reach the same state q if and only if s and s share the same prefix until they reach q, hence the name prefix tree. An APTA is called augmented because it contains (is augmented with) states for which it is yet unknown whether they are accepting or rejecting. No execution of any sample string from S ends in such a state. We use V , V+ , and V− to denote all states, the accepting states, and the rejecting states in the APTA, respectively. A merge of two states q and q combines the states into one: it creates a new state q that has the same incoming and outgoing transitions of both q and q . Such a merge is only allowed if the states are consistent, i.e., it is not the case that q is accepting while q is rejecting, or vice versa. Whenever a merge introduces a non-deterministic choice, i.e., q is the source of two transitions with the same symbol, the target states of these transitions are merged as well. This is called the determinization process, and is continued until there are no non-deterministic choices left. Of course, all of these merges should be consistent too. A state-merging algorithm iteratively applies the state-merging process until no more consistent merges are possible. Currently, the most successful method for solving the DFA identification problem is the evidence driven state-merging (EDSM) algorithm in the red-blue framework [3]. EDSM is a greedy procedure that uses a simple heuristic to determine which merge to perform. In grammatical inference, there is a lot of research into developing advanced and efficient search techniques for ESDM. The idea is to increase the quality of a solution by searching other paths in addition to the path determined by the greedy EDSM heuristic. Examples of such advanced techniques are dependency directed backtracking [5], using mutually (in)compatible
70
M.J.H. Heule and S. Verwer
B R
R B
R R
B
Fig. 2. The red-blue framework. The red states (labeled R) are the identified parts of the automaton. The blue states (labeled B) are the current candidates for merging. The uncolored states are pieces of the APTA.
merges [6], and searching most-constrained nodes first [7]. A comparison of different search techniques for EDSM can be found in [8]. Typically, a time bound is set and the algorithm is stopped when its runningtime exceeds this bound. However, it can guarantee that it has found the optimal solution (a smallest DFA) if all smaller solutions have been visited by its breadthfirst search. In total EDSM tries |V |2 possible merges in every iteration, and since V can be very large, this can take a very large amount of time. In order to avoid this, EDSM is often applied within the red-blue framework. The redblue framework maintains a core of red states with a fringe of blue states, see Figure 2. A red-blue state-merging algorithm performs merges only between blue and red states. If no red-blue merge is possible the algorithm changes the color of a blue state into red. This framework reduces the amount of possible merges significantly without reducing the number of possible solutions, i.e., the algorithm is still complete. Since the algorithm is guaranteed not to change any of the transitions between red states, the red core of the DFA can be viewed as a part of the DFA that is already identified. Within the red-blue framework, EDSM is a polynomial time (greedy) algorithm that converges quickly to a local optimum. Despite its simplicity, EDSM participated in and won (in a tie) the Abbadingo DFA learning competition in 1997 [3]. The evidence measure that is used by EDSM is based on the idea that bad merges can often be avoided by performing those merges that have passed the most tests for consistency, and are hence most likely to be correct. Using this evidence measure EDSM participated in and won (in a tie) the Abbadingo DFA learning competition in 1997 [3]. The data set in this competition consisted of sparse data-sets. In the competition EDSM was capable of approximately (with 99% accuracy) learning a DFA with 500 states with a training set consisting of 60.000 strings. The current state-of-the-art techniques are two simple search strategies called ed-beam and exbar [7]. The ed-beam procedure calculates one greedy EDSM path starting from every node in the search tree in breadth-first order. The smallest
Exact DFA Identification Using SAT Solvers
71
DFA found by these EDSM paths is returned as a solution. This solution then serves as an upper bound of the DFA size for the breadth-first search. The exbar procedure iteratively runs EDSM with an increasing upper bound on the number of DFA states. It continues this procedure until a solution is found. To reduce the size of the search space, exbar searches the most-constrained nodes first.
3
From DFA Identification to Graph Coloring
The EDSM search techniques are usually based on successful techniques for other more actively studied problems, such as satisfiability and graph coloring. There have been many competitions for algorithms that solve these problems and these solvers are therefore highly optimized. Although the different search techniques improve the performance of EDSM, and the implementations use efficient data structures, still a lot of work has to be done before the EDSM implementations are as efficient and advanced as the solvers for these problems. Since the decision version of DFA identification is NP-complete, it is also possible to translate the DFA identification problem into a more actively studied problem, and thus make use of the optimized search techniques immediately. In [12], such a translation is introduced from DFA identification into graph coloring. The main idea of this translation is to use a distinct color for every state of the identified DFA. Every vertex in the graph of the graph coloring problem represents a distinct state in the APTA. Two vertices v and w in this graph are connected by an edge (cannot be assigned the same color), if merging v and w results in an inconsistency (i.e., an accepting state is merged with a rejecting state). These edges are called inequality constraints. Figure 1 shows an example of such a graph. In addition to these inequality constraints, equality constraints are required: if the parents p(v) and p(w) of two vertices v and w with the same incoming label are merged, then v and w must be merged too. With the addition of these constraints, some of the inequality constraints become redundant : only the directly inconsistent edges (between accepting and rejecting states) are actually necessary, the other edges (resulting from the determinization process) are no longer needed because they logically follow from combining the direct constraints and the equality constraints. These redundant constraints are kept in our translation in order to help the search process. In the graph coloring problem, the equality constraints imply that the two parent nodes p(v) and p(w) can get the same color only if v and w get the same color. Such a constraint is difficult to implement in graph coloring. In [12], this is dealt with by modifying the graph according to the consequences of these constraints. This implies that a new graph coloring instance has to be solved every time an equality constraint is used. We propose a different method to encode these inequality constraints, that is by encoding them using satisfiability. In addition, using auxiliary variables, we reduce the number of constraints that are required by the encoding.
72
4
M.J.H. Heule and S. Verwer
Translating DFA Identification into SAT
The satisfiability problem (SAT) deals with the question whether there exists an assignment to Boolean variables such that a given formula evaluates to true. Such a formula in conjunctive normal form (CNF) is a conjunction (∧) of clauses, each clause being a disjunction (∨) of literals. Literals refer either to a Boolean variables xi or to its negation ¬xi . In the last decade, SAT solvers have become very powerful. This can be exploited by translating a problem into CNF and solve it by a SAT solver. Despite the low level representation, such an approach is very competitive for several problems. Examples are bounded model checking [9], equivalence checking [10] and rewriting termination problems [11]. Below we present such an approach to DFA identification. 4.1
Direct Encoding
Our translation reduces DFA identification into a graph coloring problem [12] which in turn is translated into SAT. A widely used translation of graph coloring problems into SAT is known as the direct encoding [15]. Given a graph G = (V, E) and a set of colors C, the direct encoding uses (Boolean) color variables xv,i with v ∈ V and i ∈ C. If xv,i is assigned to true, it means that vertex v has color i. The constraints on these variables are as follows (see Table 1 for details): For each vertex, at-least-one color clauses make sure that each vertex is colored, while at-most-one color clauses forbid that a vertex can have multiple colors. The latter clauses are redundant. Additionally, we have to translate that adjacent vertices cannot have the same color. The direct encoding uses the following clauses:
(¬xv,i ∨ ¬xw,i )
i∈C (v,w)∈E
Finally, let EL be the set consisting of pairs of vertices that have the same incoming label in the APTA. In case the parents p(v) and p(w) of such a pair (v, w) ∈ EL have the same color, then v and w must have the same color as well. This corresponds to the equality constraints in [12]. A straight-forward translation of these constraints into CNF is:
i∈C j∈C (v,w)∈EL
(¬xp(v),i ∨ ¬xp(w),i ∨ ¬xv,j ∨ xw,j ) ∧ (¬xp(v),i ∨ ¬xp(w),i ∨ xv,j ∨ ¬xw,j ) ∧
This encoding is identical to the CSP-based translation given in [13], and can be considered as the current state-of-the-art in translations of DFA identification to SAT. Notice that the size of the direct encoding is O(|C|2 |V |2 ). For interesting DFA identification problems this will result in a formula which will be too large for the current state-of-the-art SAT solvers. Therefore we will propose a more compact encoding below.
Exact DFA Identification Using SAT Solvers
4.2
73
Compact Encoding
The majority of clauses in the direct encoding originate from translating the equality constraints into SAT. We propose a more efficient encoding based on auxiliary variables ya,i,j , which we refer to as parent relation variables. If set to true, ya,i,j means that for any vertex with color i, the child reached by label a has color j. Let l(v) denote the incoming label of vertex v, and let c(v) denote the color of vertex v. As soon as both a child vi and its parent p(vi ) are colored, we force the corresponding parent relation variable to true by the clause yl(vi ),c(p(vi )),c(vi ) ∨ ¬xp(vi ),c(p(vi )) ∨ ¬xvi ,c(vi ) . This leads to O(|C|2 |V |) clauses. Additionally, we require at-most-one parent relation clauses to guarantee that each relation is unique – see Table 1 for details. This new encoding reduces the number of clauses significantly. To further reduce this size, we introduce an additional set of auxiliary variables zi with i ∈ C. If zi is true, color i is only used for accepting vertices. Therefore, we refer to them as accepting color variables. They are used for the constraint that requires all accepting vertices to be colored differently from the rejecting states. Without auxiliary variables, this can be encoded as (¬xv,i ∨ ¬xw,i ) for v ∈ V+ , w ∈ V− , i ∈ C, resulting in |V+ | · |V− | · |C| clauses. Using the auxiliary variables zi , the same constraints can be encoded as (¬xv,i ∨ zi ) ∧ (¬xw,i ∨ ¬zi ), requiring only (|V+ | + |V− |)|C| clauses. 4.3
Symmetry Breaking
In case a graph cannot be colored with k colors, the corresponding (unsatisfiable) SAT instance will solve the problem k! times: once for each permutation of the colors. Therefore, when dealing with CNF formulas representing graph coloring problems, it is good practice to add symmetry breaking predicates (SBPs) [16]. Notice that in any valid coloring of a graph, all vertices in a clique must have a different color. So, one can fix vertices in a large clique to a color in a preprocessing step. Although finding the largest clique in a graph is NP-complete, a large clique can be computed cheaply using a greedy algorithm. Start with the vertex v0 with the highest degree. In each step i, add the vertex vi that is connected to all vertices v0 to vi−1 , again with the highest degree. Because the corresponding graph of an APTA can be huge (many edges), we propose a variant of this algorithm. First, compute the induced subgraph of accepting vertices (v ∈ V+ ) and determine a large clique in this subgraph. Second, in a similar way find a large clique among rejecting vertices (v ∈ V− ). Because all accepting vertices are connected to all rejecting vertices, the union of both cliques is also a clique. This variant often provides a clique that is larger than the clique found using the entire APTA. In addition, the computation costs are very low. 4.4
Adding Redundant Clauses
The compact encoding discussed above can be extended with several types of redundant clauses. First, we can explicitly state that each vertex must be colored
74
M.J.H. Heule and S. Verwer
Table 1. Encoding of DFA identification into SAT. C = set of colors, L = set of labels (alphabet), V = vertices, E = conflict edges. Variables
Range
Meaning
xv,i
v ∈ V ;i ∈ C
ya,i,j
a ∈ L; i, j ∈ C
zi
i∈C
xv,i ≡ 1 iff vertex v has color i ya,i,j ≡ 1 iff parents of vertices with color j and incoming label a must have color i zi ≡ 1 iff an accepting state has color i
Clauses
Range
Meaning
(xv,1 ∨ xv,2 ∨ · · · ∨ xv,|C| )
v∈V
each vertex has at least one color
accepting vertices cannot have the same color as rejecting vertices a parent relation is set when a (yl(v),i,j ∨ ¬xp(v),i ∨ ¬xv,j ) v ∈ V ; i, j ∈ C vertex and its parent are colored each parent relation can (¬ya,i,h ∨ ¬ya,i,j ) a ∈ L; h, i, j ∈ C; h < j target at most one color
(¬xv,i ∨ zi ) ∧ (¬xw,i ∨ ¬zi ) v ∈ V+ ; w ∈ V− ; i ∈ C
Redundant Clauses
Range
Meaning
(¬xv,i ∨ ¬xv,j )
v ∈ V ; i, j ∈ C; i < j
each vertex has at most one color
(ya,i,1 ∨ ya,i,2 ∨ · · · ∨ ya,i,|C| )
a ∈ L; i ∈ C
(¬yl(v),i,j ∨ ¬xp(v),i ∨ xv,j )
v ∈ V ; i, j ∈ C
(¬xv,i ∨ ¬xw,i )
i ∈ C; (v, w) ∈ E
each parent relation must target at least one color a parent relation forces a vertex once the parent is colored all determinization conflicts explicitly added as clauses
with exactly one color by adding the redundant at-most-one color clauses (¬xv,i ∨ ¬xv,j ) with v ∈ V and i < j ∈ C. Similarly, we can explicitly state that for each combination of a color and a label exactly one parent relation variable must be true. This is achieved by adding the at-least-one parent relation clauses ( j∈C ya,i,j ) for all a ∈ L and i ∈ C. Also, once a parent relation is set, and some vertices have the source color, then all child nodes should have the target color (¬yl(v),i,j ∨ ¬xp(v),i ∨ xv,j ) for v ∈ V ; i, j ∈ C. All three types of clauses are known as blocked clauses [17]. These clauses have at least one literal that cannot be removed by resolution. Therefore, blocked clauses cannot be used to derive the empty clause (i.e. show that the formula is unsatisfiable). So, formulas with and without blocked clauses are equisatisfiable. We will show that these blocked clauses improve the performance of DFA identification. However, for other problems, removal of blocked clauses results in a speed-up [18]. Other types of redundant clauses consists of a second constraint on the parent relation and adding all edges that are not covered by the accepting color clauses. Although these clauses are redundant, they provide some additional knowledge about the problem to the SAT solver. However, since the largest number of clauses are created by the first parent relation, and since there can be an even larger number of conflicts, the addition of these clauses could potentially blow up the size of the encoding.
Exact DFA Identification Using SAT Solvers
4.5
75
Iterative SAT Solving
The translation of DFA identification into SAT (direct encoding, compact encoding with and without redundant clauses) uses a fixed set of colors. To prove that the minimal size of a DFA equals k, we have to show that the translation with k colors is satisfiable and that the translation with k − 1 colors is unsatisfiable. The following procedure is used to determine the minimal size: S1 : find a large clique L (set of vertices) in the graph representing the APTA. S2 : initialize the set of colors C in such a way that |C| = |L|. S3 : construct a CNF by translating the APTA based on C and SBPs on L. S4 : solve the formula of step S3 . S5 : if the formula is unsatisfiable then add a color to C and goto step S3 . S6 : return the DFA found in step S4 .
5
Translating Partially Identified DFAs
In spite of the efficiency of our translation, there can still be cases where the above procedure leads to a formula that is too large for the current state-of-theart SAT solvers. For instance, the Abbadingo problem set [3] contains some very difficult problems that require hundreds of colors, resulting in over 100.000.000 clauses. Since the current state-of-the-art SAT solvers are known to work well up to 5.000.000 clauses, this is much too large. In such cases another nice feature of our encoding can be used, which is that it also works when the input is a (partially identified) DFA instead of an APTA. Thus, a simple method that can be used to reduce the size of the problem is to: 1. apply a few steps of the EDSM algorithm, and then 2. apply our translation to SAT. Every merge that is performed before applying the translation reduces the size of V significantly. Therefore also the encoding becomes much smaller. The price to pay is of course that the solution provided by the SAT solver will no longer be exact. The first few merges are performed by a greedy procedure, and hence they can lead to a larger DFA size. These first few merges will however be based on a lot of evidence. Consequently, they are likely to be correct, i.e., they are likely to lead to the optimal solution. Intuitively, this approach should therefore work well in practice, the main problem is to know how many merges to perform. For more information on the EDSM evidence value and its rationale, the reader is referred to [3]. An additional benefit of first applying the EDSM algorithm is that we automatically obtain a clique of conflicting states: no red state can be merged with another red states. Hence, the red states in a partially identified DFA resulting from a few steps of the EDSM algorithm can be used to construct symmetry breaking predicates. These predicates can be used instead of the ones resulting from the greedy max-clique algorithm.
76
M.J.H. Heule and S. Verwer
Runtime in seconds 200
150
100
50
0
exbar
ed-beam
all clauses
w/o redundant
naive
Fig. 3. Results on the set from [19,7]. The graph shows run-times in seconds of exbar, edbeam, our encoding with and without redundant clauses, and the naive direct encoding. The horizontal axis shows the instances of DFA size 16 or more sorted by run-time.
6
Results
Our experiments are based on a suite of 810 instances1 that were also used to evaluate exact DFA identification algorithms in [19,7]. The suite is partitioned into sizes ranging from 4 to 21. Since the larger ones are more difficult, we focus on the sizes 16 to 21. In addition, we performed tests on some instances from the challenging Abbadingo problem set [3]. All tests were performed on a Intel Pentium 4, 3.0 GHz with 1 Gb of memory running on Fedora Core 8. We compare two implementations of our SAT encoding with the current stateof-the-art in exact DFA identification: ed-beam and exbar (see Section 2 for a description). In addition, we include the naive encoding described in Section 4. This encoding can be considered as the current state-of-the art translation to SAT. All SAT algorithms follow the iterative SAT solving procedure presented in Section 4.5. We used picosat [20] to solve the CNF instances. The performance is measured by cumulating all computational costs of the unsatisfiable runs together with the time to solve the smallest satisfiable instance. Figure 3 shows the run-times of all algorithms. All algorithms except the naive encoding solved the full suite within 200 seconds per instance. Most problems require almost no search at all and are solved by all algorithms except the naive encoding in a few seconds. Some of the larger problems, however, do require some search-time and there one clearly sees the strength of our approach: it outperforms the state-of-the-art on these instances. An interesting observation is the effect of the redundant clauses, without these the SAT solver no longer outperforms the state-of-the-art. The huge difference between the naive and our 1
Available at http://algos.inesc.pt/~ aml/tar_files/moore_dfas.tar.gz
Exact DFA Identification Using SAT Solvers
77
encoding clearly shows the benefit of using the auxiliary variables we introduced in our encoding. On average the ed-beam and breath-first search implementations are faster. However, the SAT translation with all redundant clauses performed best on the hardest problems. These results are promising, since they show that the search techniques used by SAT can be a lot more efficient than the state-of-the-art search variants of EDSM. We also experimented with instances of the Abbadingo challenge [3].2 Initially, these benchmarks appeared too hard for our exact SAT approach. The smallest problem has an APTA with 12,796 states, resulting in 77,730,715 clauses. Therefore, we first ran a few iterations of EDSM and applied our translation as soon as the size of the partial DFA is small enough. Throughout our experiment, we observed that a partial DFA of about 5000 states is currently the limit of what state-of-the-art SAT solvers can manage. This means that for the smallest problems (#1, #4, and #7) about a dozen merge steps are required. Using this combined approach we were able to solve the first four challenge problems.
7
Conclusions and Future Work
We presented an efficient translation from DFA identification into satisfiability. By performing this transformation, we are able to make direct use of the advanced search techniques that are currently used in satisfiability solving. The result is simple, efficient, and advanced algorithm for solving the DFA identification problem. In experimental results, we show that our approach is very competitive with the current state-of-the-art in DFA identification. It even outperformed the state-of-the-art on several hard instances. In addition, we show that the flexibility of our transformation can be used to apply it to very challenging DFA identification instances. The use of auxiliary variables by this transformation results in a significant improvement in the number of required clauses with respect to the current stateof-the-art in translating DFA identification to SAT [12] and [13]. Our transformation only requires O(k 2 |V |) clauses for a DFA identification problem, where k is the size of the sought DFA and V are the states of the APTA constructed from the input sample. Using the current state-of-the-art, we would have required O(k 2 |V |2 ) clauses. Since |V | is typically large, this is a big improvement. We plan to experiment with alternative translations. Most of the graph color to SAT translations presented in [21] can be used for DFA identification too. In order to determine the usefulness of these alternatives, we will construct variants that use auxiliary variables to reduce the size of theses translations as well. We make use of symmetry breaking predicates in order to prevent overlapping searches with different colors. These predicates are produced by preprocessing the result of the transformation with a fast max-clique approximation algorithm. In the future, we would like to perform this symmetry breaking also dynamically, 2
Available at http://www-bcl.cs.may.ie/data-sets.html
78
M.J.H. Heule and S. Verwer
i.e., during the satisfiability solving. This is a new technique for graph coloring based satisfiability solving proposed in [22] that shows promising improvements. In our experiments, the greedy max-clique algorithm often discovered a larger clique than the one induced by the red states in a partially identified DFA. This opens up a very interesting path for future work in DFA identification, namely to replace the red states in EDSM by the states represented by this clique. Since this clique poses more constraints on the partially identified DFA, we believe this will improve the performance of the EDSM algorithm.
References 1. de la Higuera, C.: A bibliographical study of grammatical inference. Pattern Recognition 38(9), 1332–1348 (2005) 2. Gold, E.M.: Complexity of automaton identification from given data. Information and Control 37(3), 302–320 (1978) 3. Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: Honavar, V.G., Slutzki, G. (eds.) ICGI 1998. LNCS (LNAI), vol. 1433, p. 1. Springer, Heidelberg (1998) 4. Oncina, J., Garcia, P.: Inferring regular languages in polynomial update time. In: Pattern Recognition and Image Analysis. Series in Machine Perception and Artificial Intelligence, vol. 1, pp. 49–61. World Scientific, Singapore (1992) 5. Oliveira, A.L., Marques-Silva, J.P.: Efficient search techniques for the inference of minimum sized finite state machines. In: SPIRE, pp. 81–89 (1998) 6. Abela, J., Coste, F., Spina, S.: Mutually compatible and incompatible merges for the search of the smallest consistent DFA. In: Paliouras, G., Sakakibara, Y. (eds.) ICGI 2004. LNCS (LNAI), vol. 3264, pp. 28–39. Springer, Heidelberg (2004) 7. Lang, K.J.: Faster algorithms for finding minimal consistent DFAs. Technical report, NEC Research Institute (1999) 8. Bugalho, M., Oliveira, A.L.: Inference of regular languages using state merging algorithms with search. Pattern Recognition 38, 1457–1467 (2005) 9. Biere, A., Cimatti, A., Clarke, E.M., Zhu, Y.: Symbolic model checking without BDDs. In: Cleaveland, W.R. (ed.) TACAS 1999. LNCS, vol. 1579, pp. 193–207. Springer, Heidelberg (1999) 10. Marques-Silva, J.P., Glass, T.: Combinational equivalence checking using satisfiability and recursive learning. In: DATE 1999, p. 33. ACM, New York (1999) 11. Endrullis, J., Waldmann, J., Zantema, H.: Matrix interpretations for proving termination of term rewriting. J. Autom. Reason. 40(2-3), 195–220 (2008) 12. Coste, F., Nicolas, J.: Regular inference as a graph coloring problem. In: Workshop on Grammatical Inference, Automata Induction, and Language Acquisition, ICML 1997 (1997) 13. Grinchtein, O., Leucker, M., Piterman, N.: Inferring network invariants automatically. In: Furbach, U., Shankar, N. (eds.) IJCAR 2006. LNCS (LNAI), vol. 4130, pp. 483–497. Springer, Heidelberg (2006) 14. Biermann, A.W., Feldman, J.A.: On the synthesis of finite-state machines from samples of their behavior. IEEE Trans. Comput. 21(6), 592–597 (1972) 15. Walsh, T.: SAT v CSP. In: Dechter, R. (ed.) CP 2000. LNCS, vol. 1894, pp. 441– 456. Springer, Heidelberg (2000)
Exact DFA Identification Using SAT Solvers
79
16. Sakallah, K.A.: Symmetry and Satisfiability. In: Handbook of Satisfiability, ch. 10, pp. 289–338. IOS Press, Amsterdam (2009) 17. Kullmann, O.: On a generalization of extended resolution. Discrete Applied Mathematics 96-97(1), 149–176 (1999) 18. Jarvisalo, M., Biere, A., Heule, M.J.H.: Blocked clause elimination. In: Esparza, J., Majumdar, R. (eds.) TACAS 2010. LNCS, vol. 6015, pp. 129–144. Springer, Heidelberg (2010) 19. Oliveira, A.L., Marques-Silva, J.P.: Efficient search techniques for the inference of minimum size finite automata. In: South American Symposium on String Processing and Information Retrieval, pp. 81–89. IEEE Computer Society Press, Los Alamitos (1998) 20. Biere, A.: Picosat essentials. Journal on Satisfiability, Boolean Modeling and Computation 4, 75–97 (2008) 21. Velev, M.N.: Exploiting hierarchy and structure to efficiently solve graph coloring as sat. In: ICCAD 2007: International conference on Computer-aided design, Piscataway, NJ, USA, pp. 135–142. IEEE Press, Los Alamitos (2007) 22. Schaafsma, B., Heule, M.J.H., van Maaren, H.: Dynamic symmetry breaking by simulating Zykov contraction. In: Kullmann, O. (ed.) SAT 2009. LNCS, vol. 5584, pp. 223–236. Springer, Heidelberg (2009)
Learning Deterministic Finite Automata from Interleaved Strings Joshua Jones and Tim Oates University of Maryland, Baltimore County 1000 Hilltop Circle, Baltimore, Maryland, USA {jkj,oates}@umbc.edu
Abstract. Workflows are an important knowledge representation used to understand and automate processes in diverse task domains. Past work has explored the problem of learning workflows from traces of processing. In this paper, we are concerned with learning workflows from interleaved traces captured during the concurrent processing of multiple task instances. We first present an abstraction of the problem of recovering workflows from interleaved example traces in terms of grammar induction. We then describe a two-stage approach to reasoning about the problem, highlighting some negative results that demonstrate the need to work with a restricted class of languages. Finally, we give an example of a restricted language class called terminated languages for which an accepting deterministic finite automaton (DFA) can be recovered in the limit from interleaved strings, and make remarks about the applicability of the two-stage approach to terminated languages. Keywords: regular languages, interleaved languages, learning in the limit, workflow inference.
1
Introduction
Workflows, state-based representations of information processing tasks, are an important knowledge representation used to understand and model processes in diverse task domains. For example, workflows have been used to model business processes such as the handling of commercial credit applications at a bank [8] and scientific processes such as the acquisition and synthesis of astronomical data in an observatory [3]. Often, workflow knowledge can be represented as a finite state machine (FSM). Beyond modeling and understanding these kinds of processes, workflows have also been used to (partially) automate tasks. However, the knowledge extraction process required to specify a complex workflow is difficult and time-consuming, posing a significant obstacle to widespread deployment of workflow-based automation. For this reason, we are interested in automatically extracting workflows from sample traces of a process’ execution, which should be relatively straightforward to capture. Past work has viewed the problem of learning workflows from example traces as one of grammar induction [9]. Existing work on this topic tends to assume that process traces can be captured in J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 80–93, 2010. c Springer-Verlag Berlin Heidelberg 2010
Learning DFAs from Interleaved Strings
81
isolation – that is, the system generating the trace is working on a single task instance at the time that the trace is captured. Our goal in this work is to relax this demand, acknowledging that in general an information processing system (e.g. a user and an application running on his or her workstation) may be working concurrently on multiple task instances. For instance, an astronomer may be working with instruments to survey several celestial objects at the same time, or a bank agent may be multitasking, flipping back and forth among the credit applications of several potential customers. In such cases, the resulting trace that is captured may contain arbitrarily interleaved steps executed in service of each of the concurrently processed task instances. In this paper, we first present an abstraction of the problem of recovering workflows from interleaved example traces in terms of grammar induction. We then describe a two-stage approach to reasoning about the problem, highlighting some negative results that demonstrate the need to work with a restricted class of languages. Finally, we give an example of a restricted language class called terminated languages for which an accepting deterministic finite automaton (DFA) can be recovered in the limit from interleaved strings, and make remarks about the applicability of the two-stage approach to terminated languages. The contributions of this paper are the identification and formulation of the problem of learning DFAs from interleaved strings, a number of negative results that help to circumscribe potential solutions, and the description and analysis of a class of languages that are learnable from interleaved strings. We also enumerate open issues.
2
Problem Description
To understand the feasibility of learning workflows from interleaved traces, we have devised a grammar induction problem that is an abstraction of the workflow learning problem. Here, we restrict the general problem, where an arbitrary number of task instances may be concurrently processed, to a simpler case where exactly two task instances are being executed during the generation of each trace. In this setting, we imagine that we are presented with a set of example strings generated by an unknown DFA. We wish to recover the unknown DFA based on the examples. The twist in this setting is that the strings are not each generated by a single DFA execution, but instead are generated by two concurrent executions. That is, the strings are generated by the following process: 1. 2. 3. 4.
Begin with both executions in the DFA’s start state. Choose an execution to update. For the chosen execution, choose a transition to follow from the current state. Place the symbol on the chosen transition at the end of the string being generated and update the current state of the chosen execution. 5. If both executions are in final states, the process may be terminated. If continuing, return to step (2). Let L be the language accepted by the DFA. We will then refer to the language generated by the interleaved execution process described above as L2 . Notice
82
J. Jones and T. Oates
that the nondeterminism in step (2) of this process allows for strings in L2 to be generated in a large number of ways – for instance, by first generating a complete string from L and then appending another complete string from L, by strictly alternating characters from two strings in L, and so on. In the literature, there are multiple ways in which a “shuffle operator” is sometimes defined. In some cases, the shuffle operator is defined as interleaving two strings in a rigid way, forming a new string by strictly alternating symbols from the two argument strings [4][7]. This type of shuffle operator does not behave in a similar fashion to the interleaving process described here, which can produce strings by strictly alternating between executions at each iteration, but in general does not. There is an alternative definition of the shuffle operator that has also appeared in the literature, where two strings are shuffled together without a requirement for strict alternation [6]. The interleaved languages we deal with here can be defined by shuffling a language with itself according to this latter type of shuffle operation. In an attempt to understand the characteristics of this problem, we have found it useful to separate the problem of recovering the underlying DFA from these strings into two stages. First, we will attempt to find an expanded super-DFA that directly accepts L2 in the standard, non-interleaved sense. Call the original DFA A = Q, Σ, δ, q0 , F , where Q is a finite set of states, Σ is a finite alphabet, δ is a transition function taking Q × Σ to Q, q0 ∈ Q is the initial state, and F ⊆ Q is the set of final states. Given A, we can construct a non-deterministic finite automaton (NFA) N 2 = Q × Q, Σ, δ 2 , {q0 , q0 }, F × F , where δ 2 has a transition from ({qi , qj }, a) to {qi , qk } iff δ has a transition from (qj , a) to qk . This NFA can then be converted into a DFA A2 via the standard transformation. Thus, it is clear that the super-DFA that we wish to recover from the given example strings in the first stage of our solution does indeed exist. In the second stage of the problem, we then wish to take the super-DFA A2 recovered in the first stage and use it to recover the original DFA A. In the following subsections, we discuss negative results with respect to the feasibility of these problems in the general case, as well as in some restricted cases. In Section 3, we propose a particular restricted class of languages for which this problem is solvable. 2.1
Recovering the Super-DFA from Examples
It is well-known that arbitrary DFAs are not learnable in the limit from positive examples [5]. Thus, we would like to establish some restriction on L2 , the language to be accepted by the super-DFA, that will make the problem tractable. Popular restricted language classes that are learnable include the k-reversible [1] and k-testable languages [2]. In this context, we wish to determine whether restrictions such as these, which are widely considered to be reasonable, are preserved from the original language L to the concurrent language L2 . Unfortunately, in both of these cases, L’s membership in the restricted class does not ensure L2 ’s membership. Theorem 1. A k-reversible language L does not in general imply k’-reversibility of L2 for any value of k’.
Learning DFAs from Interleaved Strings
83
Fig. 1. A zero-reversible original DFA. The start state is depicted with a rectangle, final states are depicted with double ovals, and non-final non-start states are depicted with single ovals.
Fig. 2. The super-DFA corresponding to the original DFA of Figure 1. The start state is depicted with a rectangle, final states are depicted with double ovals, and non-final non-start states are depicted with single ovals.
Proof. We present the following counterexample as proof of Theorem 1, that the property of k-reversibility need not be preserved from L to L2 (even for increased values of k). A zero-reversible original DFA is depicted in Figure 1, and the corresponding super-DFA is depicted in Figure 2. The super-DFA of Figure 2 is not k-reversible for any value of k. To see that this is true, examine the reverse path from the node labeled ’1’ to the node labeled ’2’, and the reverse path from the node labeled ’3’ to the node labeled ’4’. These are two distinct final states from which we can generate reverse symbol sequences of arbitrary length consisting of an ’a’ followed by any number of ’b’s. Thus, there is no finite value of k for which it is guaranteed that we can disambiguate reverse executions of the DFA. Theorem 2. A k-testable language L does not in general imply k’-testability of L2 for any value of k’. Proof. The following counterexample similarly proves that k-testability is not necessarily preserved from an original language L to the two-execution interleaved language L2 (even for increased values of k). Take a language L that consists of
84
J. Jones and T. Oates
either two or more ’a’s or two or more ’b’s. This language is 2-testable – as long as all of a string’s length 2 substrings are either ’aa’ or ’bb’, the string is in the language. However, L2 is not k-testable for any value of k. L2 in this case contains all strings of length four or more that have either zero or two or more ’a’s and either zero or two or more ’b’s. Thus, if there is one ’a’ at some position within the string, there must be another ’a’ within the string somewhere for it to be a member of L2 . However, because one ’a’ within the string can be separated from the next by an arbitrary number of ’b’s, there is no fixed substring length that will enable us to verify the presence of a second ’a’. For example, imagine that we select k = 5. In that case, notice that ’abbbba’ is in L2 , while ’abbbb’ and ’bbbba’ are not. For ’abbbba’ to be accepted, we must allow ’abbbb’ and ’bbbba’ as valid length 5 substrings. Thus, ’abbbb’ and ’bbbba’ consist only of valid substrings, and cannot be rejected. It may be the case that some more stringent conditions can be established on L to ensure that L2 is k-reversible – this question is currently open. However, it does not appear that any such conditions (within reason) can be established to ensure the k-testability of L2 . In particular, any infinite language L will not yield a k-testable L2 . To see that this is true, assume that we are given a k-testable original DFA. We wish to find some finite value k such that the super-DFA is k -testable. For non-finite languages, there is in general no fixed substring length k that can be employed in examining the interleaved strings that will allow us to verify the validity of the “embedded” k-length substrings from the two generating executions. To see that this is true, notice that an adversarial string generator can defeat a window size k by: 1. Generate a symbol from the first execution. 2. Generate k /k k length sequences from the second execution. 3. Generate k-1 symbols from the first execution. Step (2) in this procedure effectively pushes the first symbol out of the fixed window’s scope, making it impossible to recognize the k-1 following symbols as the end of a valid k length sequence. Based on this analysis, it appears that neither the k-testable nor k-reversible properties are dependably preserved from a language L to its interleaved language L2 . It may still be the case that some other, more stringent restrictions on L may guarantee a k-reversible corresponding L2 , or an L2 belonging to some other feasibly learnable class of languages. It is not clear whether these restrictions, if they exist, will be reasonably broad. At this time, the existence of suitable restrictions on L remains an open question. 2.2
Recovering the Original DFA from the Super-DFA
In evaluating the possibility of recovering A given A2 , a question immediately becomes apparent: is the mapping from DFAs to super-DFAs one-to-one? Unfortunately, it turns out that the answer is ’no’. Take, for example, the original DFAs depicted in Figure 3.
Learning DFAs from Interleaved Strings
85
Fig. 3. Two original DFAs. The start state is depicted with a rectangle, final states are depicted with double ovals, and non-final non-start states are depicted with single ovals.
Fig. 4. The super-DFA corresponding to both the top and bottom original DFAs of Figure 3. The start state is depicted with a rectangle, final states are depicted with double ovals, and non-final non-start states are depicted with single ovals.
The DFA on the top of Figure 3 accepts all non-empty strings of ’a’s. The DFA on the bottom accepts the same language with the exception that strings of length three are not accepted. However, the languages generated by interleaving two concurrent walks over either DFA are the same, and are accepted by the same super-DFA, depicted in Figure 4. The super-DFA accepts all strings of ’a’s of length two or longer. It is clear why the original DFA on the top of Figure 3 gives rise to this super-DFA. To see why the DFA on the bottom of Figure 3 does also, notice that any interleaved string length generated by making use of string(s) of length three from the underlying top DFA (e.g. 6 = 3+3) can be generated with alternative string lengths by the bottom DFA (6 = 2+4). It is easy to generate more examples of original DFAs that map to the same super-DFA. Of course, the fact that the mapping between original and super-DFAs is not one-to-one also means that the mapping between original and interleaved languages is not one-to-one. Thus, the problem of learning a language from interleaved strings is not solvable in the general case. This characteristic, along with those noted in the previous subsection, make clear the need to identify a restricted class of languages for which a solution is possible. We describe such a restricted class in the next section.
86
3
J. Jones and T. Oates
Restriction to “Terminated” Languages
Given the difficulties elaborated upon in the previous section, we would like to find some restricted class of languages for which a DFA may be tractably learned from interleaved strings. To this end, we propose a class which we call terminated languages. A terminated language is one accepted by a DFA A = Q, Σ, δ, q0 , f , where Q is a finite set of states, Σ is a finite alphabet, δ is a transition function taking Q × Σ to Q, q0 ∈ Q is the initial state, and f ∈ Q is a unique, absorbing final state. δ is further restricted such that ∀δ(qi , s) = f, s = # and ∀δ(qi , s) = f, s = #. Terminated languages have a special symbol, here denoted #, which occurs as the last symbol in every valid string, and in no other position within any valid string. Notice that we can easily adapt any DFA into a terminated version by adding a new state which will become the unique final state, adding a new symbol to the alphabet, marking all previously final states as non-final, and adding transitions from these previously final states to the new final state labeled with the newly added symbol. This class of languages is reasonable for the workflow domain with which we are concerned, as we can require the user (or a particular participant in the modeled system) to take some explicit action designating the completion of a task instance. In some systems, such an action may already exist – for example, clicking “Send” or striking the Enter key. To apply the two-stage learning technique to a terminated language L, we will first need to recover the super-DFA accepting L2 . It is not clear that the restriction of L to terminated languages will cause properties such as k-testability or k-reversibility to be preserved from L to L2 . The issue of which restrictions can be placed on L such that L2 will belong to a tractably learnable class remains open. However, if we for the moment assume that L2 belongs to a learnable class, we can apply known methods to recover the super-DFA from positive examples of L2 . Given the super-DFA we have learned, we can now use it to generate examples of L in the following manner: 1. Follow a shortest path through a terminal (’#’) symbol. 2. Follow any path to a final state. 3. The suffix of the generated string that follows the initial terminal symbol is a string in L. Any string that is not a valid suffix (in L2 ) following the initial terminal symbol of a string generated in this fashion is not in L. This procedure works because a shortest prefix ending in a terminal symbol can only have advanced a single execution. If this were not the case, a shorter prefix ending in a terminal symbol would exist which advanced only a single execution, eliminating all symbols due to the advancement of the other execution. Because the emission of the terminal symbol means that the execution advanced to generate the prefix can emit no more symbols, it must be the case that all subsequently emitted symbols are due to the other (single) execution, and thus constitute a valid string in L. Further, since a single execution must be able to generate all valid strings in L, any string that cannot follow such a “shortest-to-terminal”
Learning DFAs from Interleaved Strings
87
prefix cannot be a member of L. Notice that this procedure can be readily modified to cases with more than two interleaved executions (as long as the number of executions is fixed), by changing step (1) to take a shortest path through N − 1 terminal symbols, for N interleaved executions. Using this technique, we can generate all of the strings in L, and so if L is learnable from positive examples, this procedure will suffice. We can also generate as many negative examples as desired, by finding strings starting with a shortest-to-terminal prefix that are rejected by the super-DFA. There is, in fact, a computationally much simpler approach to moving from the super-DFA to the underlying DFA in the case of terminated languages: 1. Follow a shortest path through a terminal (’#’) symbol. 2. Mark the currently occupied state (the one reached upon emitting the terminal symbol) as the start state. 3. Remove all states and transitions not reachable from the newly marked start state. 4. The resulting DFA is the original DFA. As explained above, a string is accepted by the original DFA iff there is a string in L2 for each shortest-to-terminal prefix consisting of that prefix followed immediately by the string in question. The DFA remaining after applying the procedure above accepts exactly this set of strings. This procedure extends in a straightforward fashion to higher-order interleaved languages (e.g. L3 , L4 ,...) in the same manner as the previous procedure. For example, Figure 5 contains terminated versions of the original DFAs of Figure 3. Recall that the two non-terminated DFAs in Figure 3 both map to the same super-DFA – that is, L2 is the same for both of the DFAs in Figure 3. However, for the terminated versions of these languages, this is no longer the case. The two distinct super-DFAs are shown in Figure 6 (corresponding to the DFA on the top of Figure 5) and Figure 7 (corresponding to the bottom of Figure 5). In each case, the state reached by following the shortest path through a terminal symbol is darkened. In each case, removing all unreachable transitions and states and making the darkened state the new start state results in the original DFA. Based on this analysis, an algorithm for learning a language L from interleaved strings (i.e. examples of L2 ) from positive examples in the limit becomes apparent, if L is both learnable and terminated. As this algorithm learns, it will store the shortest prefix ending in a terminal symbol seen so far. This algorithm will depend upon an “embedded” learning algorithm to which we will pass examples believed to be from L. The specifics of this embedded algorithm will depend upon the learnable class to which L is known to belong – for each such class (by definition), an existing algorithm can be employed. The algorithm for learning from L2 then proceeds by finding the shortest prefix of each incoming example that ends in a terminating symbol. This prefix is then compared to the stored shortest prefix so far. If the prefix from the new example is shorter, it is
88
J. Jones and T. Oates
Fig. 5. Two original DFAs accepting terminated languages. The start state is depicted with a rectangle, final states are depicted with double ovals, and non-final non-start states are depicted with single ovals.
Fig. 6. The super-DFA corresponding to the terminated language accepted by the DFA on the top of Figure 5. The start state is depicted with a rectangle, final states are depicted with double ovals, and non-final non-start states are depicted with single ovals.
stored as the shortest prefix seen so far and the embedded learning algorithm is completely reset. The suffix following the newly stored prefix is presented to the embedded algorithm as the first example in a new training regime. Alternatively, if the prefix from the new example is exactly the same length as the stored shortest prefix, the suffix following the first terminal symbol is presented to the embedded algorithm as another example of L. Table 1 more formally describes a procedure that can be employed to learn a DFA accepting a learnable, terminated language L from strings in the interleaved language L2 in the limit. The procedure of Table 1 will be called an infinite number of times on strings from L2 that are generated in accordance with the typical definition of learning from text in the limit – that is, Learn-Interleaved will be called infinitely often on every string in L2 . In the following, we formally demonstrate that the algorithm of Table 1 indeed learns L from examples of L2 in the limit.
Learning DFAs from Interleaved Strings
89
Fig. 7. The super-DFA corresponding to the terminated language accepted by the DFA on the bottom of Figure 5. The start state is depicted with a rectangle, final states are depicted with double ovals, and non-final non-start states are depicted with single ovals.
90
J. Jones and T. Oates Table 1. Learning a DFA from interleaved strings in the limit
/* Global Variables */ Integer p ← ∞ /* * DFA L ← ∅ /* * /* Subroutines */ Prefix(a,x) Suffix(i,x) Learn-Standard(x) Reset-Standard()
/* /* * /* /* *
The length of the shortest prefix ending in a terminal symbol seen yet. */ The current ‘‘answer" for the DFA accepting the original language */ Return the shortest prefix of x ending in symbol a */ Return the suffix of x following the initial i symbols. */ Standard in-the-limit learning algorithm. */ Completely reset the standard in-the-limit learning algorithm. */
begin Learn-Interleaved(string x) /* If the shortest prefix of x ending in a terminal (’#’) is shorter * than the currently stored prefix length p, alter p and reset the * embedded learning algorithm. */ if |Prefix(#, x)| < p then: p ← |Prefix(#, x)| Reset-Standard() /* * * if
If the shortest prefix of x ending in a terminal symbol is the same as the currently stored prefix length, provide the suffix of x following that prefix as input to the standard learning algorithm. */ |Prefix(#, x)| = p then: L ← Learn-Standard(Prefix(p,x))
return L end
Lemma 1. Subroutine Reset-Standard will be called by Learn-Interleaved (Table 1) a finite number of times while learning from any interleaved language L2 . Once the final reset occurs, p will contain the length of the shortest strings in L. Proof. For any language L underlying an interleaved language L2 , we can define a subset Ls ⊆ L such that (1) all of the strings in Ls are of equal length and (2) no string in L has length less than a string in Ls . That is, Ls is the set of the shortest strings in L. Next, notice that some subset of the symbols occurring before the first terminal symbol in a string from L2 , along with the terminal symbol itself, must form a valid string in L. Thus, the length of the shortest possible prefix of a string in L2 ending in a terminal symbol is |x|, x ∈ Ls . In the limit, it is guaranteed that Learn-Interleaved will be called on a string from
Learning DFAs from Interleaved Strings
91
L2 with such a prefix (since such strings are in L2 and all strings are eventually presented), and this length (|x|, x ∈ Ls ) will be stored in p. Once this occurs, no subsequently presented string from L2 can ever be seen to have a shorter terminated prefix and no further calls to Reset-Standard will be made. Lemma 2. After the final call to Reset-Standard, every string passed in a call to Learn-Standard is a valid string in L. Proof. In order to generate a string in L2 having a shortest terminated prefix with length equal to |x|, x ∈ Ls , exactly one execution must have produced all of the symbols within the prefix, and that execution must have terminated with the production of the terminal symbol (i.e. it is unable to produce further symbols). The production of a symbol within the prefix by the other execution would require a string y ∈ L with |y| < |x|, x ∈ Ls . Thus, only one execution may have produced symbols within the prefix. Once a terminal is seen, the execution having produced it may produce no more symbols. Given these facts, the other (non-prefix generating) execution is the only producer of symbols occurring after the terminated prefix, and each symbol it generates occurs after the terminated prefix. Lemma 3. After the final call to Reset-Standard, every string in L will be passed in a call to Learn-Standard infinitely often. Proof. ∀x ∈ L, ∃yx ∈ L2 where y ∈ Ls . If Learn-Interleaved is called on yx after the final reset, x will be passed to Learn-Standard, as p contains |y| by Lemma 1. By the definition of our problem setting, learning from text (of L2 ) in the limit, Learn-Interleaved will be called infinitely often on every string in L2 . Theorem 3. Learn-Interleaved learns the learnable, terminated language L in the limit from positive examples of L2 . Proof. Because L is learnable (by definition), we can select Learn-Standard appropriately such that it will converge on a correct representation of L based on positive examples in the limit. By Lemmas 2 and 3, after the final reset of the standard learning algorithm, we will present only valid examples of L, and each member of L will be presented infinitely often. It is certainly the case that the algorithm of Table 1 will in general lead to many false starts and resets of the embedded algorithm before a string with a truly shortest prefix is seen. However, in the limit we are guaranteed to see such a string, and once this happens there will be no further resets of the embedded algorithm. Further, since we are guaranteed to see every string in L2 infinitely often in the limit, every string in L2 will be seen infinitely often after the final reset of the embedded algorithm. If we call the prefix stored by the algorithm upon the final reset p, notice that ∀x ∈ L, px ∈ L2 . Thus, the embedded learning algorithm will see every string in L infinitely often after the reset. Since L belongs
92
J. Jones and T. Oates
to a learnable class, the embedded algorithm will converge on a DFA that accepts L. While this algorithm demonstrates the learnability of L from examples of L2 in the limit if L is learnable and terminated, it clearly is wildly inefficient. Thus, if L2 is efficiently learnable, it will be far more practical to learn the super-DFA and apply the transformation to the original DFA given above. This algorithm is generalizable to learning from more than two interleaved executions, for a fixed number of executions N , by working with shortest prefixes containing N − 1 terminal symbols rather than a single terminal symbol. We can also readily generalize the algorithm to handle learning from a language with strings generated by a variable number of interleaved executions, L+ . Here, we will simply first count n, the number of terminal symbols in an incoming example, find the prefix with n − 1 terminal symbols, and then proceed as normal. Thus, after the final reset the embedded algorithm will end up learning only from those strings in L+ resulting from the fewest interleaved executions.
4
Conclusions and Future Work
In this paper we have presented an abstraction from a workflow learning problem of practical interest to one of grammar induction. Part of the contribution of this paper is identifying the problem of learning a DFA from interleaved strings. We have also described some negative results that demonstrate the difficulty of the problem. Finally, we have proposed a new restricted class of languages, terminated languages, that are reasonable in the motivating problem context and that are learnable in the limit from interleaved strings. We also describe a two-stage learning scheme by which some subclasses of the terminated languages may be more efficiently learned. There are a number of open problems that could be usefully addressed with further work, including the question of restrictions that can be placed on L to cause L2 to belong to a learnable class (in terms of learning the super-DFA), the existence of other interesting classes of languages that may be learnable in the interleaved context, and the generalization of more efficient learning techniques to languages that arise when allowing a variable number of interleaved executions.
References 1. Angluin, D.: Inference of reversible languages. J. ACM 29(3), 741–765 (1982) 2. Garc´ıa, P., Vidal, E.: Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(9), 920–925 (1990) 3. Gil, Y., Deelman, E., Ellisman, M., Fahringer, T., Fox, G., Gannon, D., Goble, C., Livny, M., Moreau, L., Myers, J.: Examining the challenges of scientific workflows. Computer 40, 24–32 (2007) 4. Gischer, J.: Shuffle languages, petri nets, and context-sensitive grammars. Commun. ACM 24(9), 597–605 (1981)
Learning DFAs from Interleaved Strings
93
5. Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967), http://www.isrl.uiuc.edu/~ amag/langev/paper/gold67limit.html 6. Jedrzejowicz, J.: Structural properties of shuffle automata. Grammars 2(1), 35–51 (1999) 7. Mayer, A.J., Stockmeyer, L.J.: The complexity of word problems—this time with interleaving. Inf. Comput. 115(2), 293–311 (1994) 8. Reijers, H.A.: Design and Control of Workflow Processes. Springer, New York (2003) 9. Yaman, F., Oates, T., Burstein, M.H.: A context driven approach for workflow mining. In: Boutilier, C. (ed.) IJCAI, pp. 1798–1803 (2009)
Learning Regular Expressions from Representative Examples and Membership Queries Efim Kinber Department of Computer Science, Sacred Heart University, Fairfield, CT 06825-1000, U.S.A.
[email protected]
Abstract. A learning algorithm is developed for a class of regular expressions equivalent to the class of all unionless unambiguous regular expressions of loop depth 2. The learner uses one representative example of the target language (where every occurrence of every loop in the target expression is unfolded at least twice) and a number of membership queries. The algorithm works in time polynomial in the length of the input example.
1
Introduction
Over the last forty years, a large number of algorithms for learning regular languages has been developed. The class of regular languages being learnt, in most learning scenarios, is represented by some class of target objects representing regular languages. In most cases, this target class is a class of DFAs of one or another type. Yet, as H. Fernau points out in [7], “in practical applications, learning regular languages often means to infer regular expressions (REs), because REs are arguably the most suitable model to specify regular languages, especially for human beings”. He also points out that, unfortunately, very few learning algorithms are known that deal directly with regular expressions. Some learning algorithms for regular expressions have been developed in [7]. An interesting algorithm for learning regular expressions applicable to the practically important tasks of DTD inference for XML documents has been proposed in [5]. The algorithms in the aformentioned papers use the model of learning in the limit from potentially the full set of positive examples of the target language (proposed by E. M. Gold in [8]). Back in 1993, A. Br¯ azma developed algorithms for learning some classes of regular expressions from so-called representative examples (where all loops are unfolded approximately same number of times), see, for example, [4]. In the learning model defined in [4], the algorithm infers a finite union of unionless regular expressions from a finite fixed number of examples representing each member of the union; examples are representative in the sense that, to obtain an example from a regular expression R, every occurrence of every loop in R is unfolded approximately same number of times. Regular expressions in [4] must be in a certain standard form and, in a certain J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 94–108, 2010. c Springer-Verlag Berlin Heidelberg 2010
Learning Regular Expressions from Representative Examples
95
sense, unambiguous. A natural notion of an unambiguous regular expression was introduced in [3], however, the notion used in [4] is different, and somewhat artificial (though, suited for the purpose of the model suggested in [4]). In this paper, we use so-called active learning model (introduced by D. Angluin in [1,2]): a learner uses a finite number of membership queries to the teacher (oracle), gets truthful answers, and, in a finite amount of time, produces a regular expression equivalent to the target one (within the class of expressions being learnt). In addition to membership queries, our learner uses one representative example of the target expression R, which is obtained from R when all occurrences of all loops in R are unfolded at least twice. Regular expressions in our class do not contain unions and are unambiguous (in the sense of [3]). The expressions inferred by our learner are in some standard form, namely, left-aligned: every loop in any such expression may not be shifted to the left (while preserving equivalence). While inferred expressions are in such standard form, they represent all regular languages that can be represented by unambiguous unionless regular expressions. Our learning algorithm learns all such expressions of loop depth 2 in polynomial time. Note that the classes of regular expressions in [7] and [5], in many ways, are much more limited than our class. In particular, regular expressions in these classes contain only very simple loops of depth 1, and they are so-called 1unambiguous, which is a much stronger condition than being unambiguous. The algorithm in [4] uses a finite number of examples, however, to obtain such an example from a regular expression R, all loops in R must be unfolded approximately the same number of times, and, typically, the length of the input example would be much larger than the size of the target expression - whereas the length of representative examples in our model would typically be comparable with the size of the target expression. A learning model similar to the one in this paper was considered in [9], however, regular expressions in [9] use +, rather than Kleene star, and have loop depth 1 (though, they may be ambiguous).
2
The Class of Regular Languages
Let Σ be a finite alphabet. Σ ∗ denotes the set of all finite strings (words) over Σ. Let denote the empty string. A language is any subset of Σ ∗ . We will refer to elements of a language L as (positive) examples of L. uv denotes the concatenation of the strings u and v. By a substring of a string w we will understand any string u such that w = vux for some strings v and x. For any string v and any positive integer k ≥ 0, let v k denote the concatenation of k copies of v; by convention, v 0 = . We consider the following subclass of regular expressions in this paper: all regular expressions that use concatenation and Kleene star ∗ (but no unions), of loop depth at most 2. (Loop depth of such an expression can be defined as follows. For any ∗-expression V , the expression (V )∗ is called a loop. If no loop (E)∗ in the expression R contains a loop, then R has loop depth 1; if each loop (E)∗ in R contains loops of depth at most n − 1, then R has loop depth at most
96
E. Kinber
n). Following [4], we call such expressions ∗-expressions. For any ∗-expression R, let L(R) denote the language generated by R. Given a ∗-expression R, an example v ∈ L(R) is called also an example of R. For each ∗-expression E over a finite alphabet Σ, we define the length l(E) of it as follows: it is the length of the string over Σ obtained from E when all parentheses and Kleene stars are dropped. For example, the length l(E) of the expression E = (ab)∗ ab∗ is 4. Two ∗-expressions R1 and R2 are equivalent if L(R1 ) = L(R2 ). Given a loop (T )∗ and an integer n ≥ 0, the expression T n is called an unfolding of (T )∗ . Now, we inductively define an unfolding of any ∗-expression R: 1. If there are loops in R that are not inside other loops, then replace all such loops by some their unfoldings. Let T be the resulting expression. Go to Step 2. 2. Terminate, or set R := T and return to Step 1. For example, for the expression a∗ b(ab∗ ac∗ )∗ , expressions aabab∗ ac∗ ab∗ ac∗ , aaababbacccabacccc, and a∗ bab∗ ac∗ are different examples of unfoldings. An unfolding that does not contain any loops is called a complete unfolding. Obviously, for any ∗-expression R, any its complete unfolding is a string in the language L(R). Now, following [3,6], we will define unambiguous ∗-expressions. To indicate different positions of the same symbol a ∈ Σ, we mark different instances of the same symbol with subscripts. For example, the expression R = ab∗ a((bb)∗ a))∗ becomes a1 b∗1 a2 ((b2 b3 )∗ a3 ))∗ . We call the latter expression the marking of R and denote it Rf . If H is a subexpression of R, we assume that the markings H f and Rf are chosen so that H f is a subexpression of Rf . Now, a marked ∗expression is a ∗-expression over Π, the alphabet of subscripted symbols, where each subscripted symbol occurs at most once and is mapped to exactly one sYmbol in Σ. The reverse of marking is dropping subscripts, indicated by #, and defined as follows: if H is a ∗-expression over the alphabet of subscripted characters Π, then H # is the ∗-expression over Σ that is obtained from H by dropping all the subscripts in H. Obviously, unmarking can be done for words w over the alphabet Π - then one gets a word w# over the alphabet Σ. Now, let R be a ∗-expression. We call R unambiguous if, for any string w in L(R), there exists exactly one string u in L(Rf ) (called the witness) such that u# = w. Roughly speaking, R is unambiguous if, for each w ∈ L(R) there is just one (complete) unfolding of R resulting in w. Consider the regular ∗-expression R = (a∗ b∗ )∗ aa∗ (see [6]). A marked version f R of this expression is (a∗1 b∗1 )∗ a2 a∗3 . Now, the string aaa ∈ L(R) has three witnesses: a1 a2 a3 , a1 a1 a2 , and a2 a3 a3 , therefore R is ambiguous. However, R is equivalent to the unambiguous ∗-expression (a∗ b∗ )∗ a. Note that not all regular languages represented by ∗-expressions can be representated by unambiguous ∗-expressions. For example, the language represented by the expression (aaa)∗ (aaaaa)∗ cannot be represented by an unambiguous ∗-expression. Still the class of all languages represented by unambiguous ∗expressions is quite rich.
Learning Regular Expressions from Representative Examples
97
A. Br¯ azma in [4] considered a different notion of unambiguity (better suited for the purpose of the topic of [4]). For example, expressions a∗ a and ab∗ (cb∗ )∗ d are not unambiguous according to [4], however, they are obviously unambiguous, according to our definition. Now, following [4], we define the concept of shifting loops to the left (and to the right). We need this concept to define the class of expressions to be learnt: roughly speaking, loops in such expressions may not be shifted to the left. To define what this means precisely, we are going to introduce a number of shifting procedures. First, we define the procedure LSHIFT1(T ∗ , i) that, given a loop T ∗ in a ∗-expression R and an integer i ≤ l(T ), finds the expression T2 of the length i such that T = T1 T2 and the loop T ∗ is the tail of the subexpression T2 (T )∗ , and replaces T2 (T )∗ in R by (T2 T1 )∗ T2 . We also define the procedure LSHIFT2(E ∗ , i) that, given a loop E ∗ and an integer i ≤ l(E), where E is of the form (T )∗ F T2 , T = T1 T2 , l(T2 ) = i and the loop (E)∗ is the tail of the subexpression P = T2 (E)∗ , replaces P by ((T2 T1 )∗ T2 F )∗ T2 . It is easy to see that applying each of these procedures to a loop preserves equivalence. We also define the procedure MAXLSHIFT1(T ∗ ) that runs the procedure LSHIFT1(T ∗ , i) for i = l(T ) and the procedure MAXLSHIFT2(E ∗ ) that runs the procedure LSHIFT2(E ∗ , i) for i = l(T ) (where E is of the form (T )∗ F T for some F and T ). For example, when applied to the loop (abab)∗ in the expression abab(abab)∗ , MAXLSHIFT1 outputs the expression (abab)∗ abab. When applied to the loop ((abc)∗ dabc)∗ in the expression abc((abc)∗ dabc)∗ , MAXLSHIFT2 outputs ((abc)∗ abcd)∗ abc. We also define right-shifting counterparts of the procedures LSHIFT1, LSHIFT2 and their MAX versions: RSHIFT1, MAXRSHIFT1, etc. It can be easily seen that they also preserve equivalence. Now, we define the procedure LALIGN(T ∗ ) that, given a loop T ∗ , repeatedly applies MAXLSHIFT1 or MAXLSHIFT2 to it (whichever applicable) and to the resulting loops and, possibly, LSHIFT1(., k) or LSHIFT2(., k) for an appropriate k on the last step, so that no procedure LSHIFT1 or LSHIFT2 can be applied to the final result. For example, when applied to the loop a∗ in aaa∗ , LALIGN outputs a∗ aa. If LALIGN cannot be applied to a loop T ∗ in an expression R, we say that this loop is left-aligned in R. Now we need one more version of the procedure RSHIFT1. The procedure RSHIFT1a(T ∗ , i) applies RSHIFT1(T ∗ , i) if there are T1 of the length i and T2 such that T = T1 T2 and T ∗ T1 is a subexpression of R, and if there is no such T1 on the right from T ∗ , the procedure checks if such T1 on the right from the given loop T ∗ can be obtained by repeatedly (recursively) applying (not more than i times) RSHIFT1a(.,k) for appropriate values k ≤ i to the outermost loop H ∗ neighbouring T ∗ on the right, if any, such that the expression following T ∗ on the right is F (H)∗ , where l(F ) < i; if the desired T1 is obtained this way, RSHIFT1a applies RSHIFT1(T ∗ , i).
98
E. Kinber
For example, when applied to the loop a∗ in the expression a∗ (ab)∗ (ac)∗ a, the procedure RSHIFT1a(a∗ , 1) recursively calls RSHIFT1a((ab)∗ , 1) (in turn, calling RSHIFT1a((ac)∗ , 1)) and results in the expression aa∗ (ba)∗ (ca)∗ . When applied to the loop (bbc)∗ in the expression (bbc)∗ b∗ bbd, RSHIFT1a((bbc)∗ , 2) twice applies RSHIFT1a(.,1) to the loop b∗ and then applies RSHIFT1((bbc)∗ , 2) to obtain the expression bb(cbb)∗ b∗ d. Again, it can be easily seen that the procedure RSHIFT1a preserves equivalence. Some of the procedures defined above are similar to procedures in [4]. Now, we define the following procedure (applied to any expression R of loop depth 2, where all loops are left-aligned) SLSHIFT1(T ∗ , i). Let T = T1 T2 , where l(T2 ) = i. Let E be the subexpression of R such that R = ET ∗ P for some P . Let E = F G for some G of the length i. Let H ∗ be the rightmost outermost loop that is either inside F or “crosses” F (so that a part of it is in G). If this loop is of depth 2, then the procedure fails. Otherwise, repeatedly applying MAXRSHIFT1 to H ∗ and, possibly, applying RSHIFT1a(.,k) for some k on the last step, try to get the resulting loop as the leftmost left-aligned outermost loop in G. If this is not possible, the procedure fails. If this is possible, let F and G be the modified expressions F and, respectively, G . Test if G is identical to T2 . If it is not true, the procedure fails. Otherwise, replace F G T ∗ by F (T2 T1 )∗ T2 . For example, SLSHIFT1, when applied to the loop T ∗ = (a∗ (ba)∗ bac)∗ in the following expression a∗ a(ab)∗ abac(a∗ (ba)∗ bac)∗ a∗ (ba)∗ bac
(1)
and i = l(T ) = 6, applies MAXRSHIFT1 and, consequently, RSHIFT1a(a∗ , 1) to the loop a∗ (rightmost outermost loop preceding (ab)∗ abac - expression of the length l(T ) preceding the loop to be shifted). Note that applying RSHIFT1a(a∗ , 1) causes shifting the loop (ab)∗ one letter to the right. Thus, the prefix a∗ a(ab)∗ abac of the whole expression is transformed to aaa∗ (ba)∗ bac, and the loop T ∗ is shifted to the left, getting the resulting expression aa(a∗ (ba)∗ bac)∗ a∗ (ba)∗ baca∗ (ba)∗ bac
(2)
Similarly, one can define the procedure SLSHIFT2(T ∗ , i) based on using LSHIFT2. Note that all S versions of the procedures defined above preserve equivalence (S here stands for “strong”). At last, we consider the procedure SLALIGN that, given a ∗-expression R of loop depth 2, first, using LALIGN, left-aligns all loops of depth 1, in the order from left to right, then left-aligns all loops of depth 2, from left to right, and then applies SLSHIFT1(T ∗ , k) or SLSHIFT2(T ∗ , k) for the maximal possible value of k to each loop T ∗ of loop depth 2, from left to right. It can be shown that the obtained expression R is equivalent to R. We call the expression R obtained by applying SLALIGN to an arbitrary ∗-expression R (of loop depth 2) strongly left-aligned. For example, the expression (2) is strongly left-aligned.
Learning Regular Expressions from Representative Examples
99
Now we are ready to define the class R of regular expressions to be learnt by our algorithm. It contains all strongly left-aligned unambiguous ∗-expressions of loop depth 2. As we have actually shown, every unambiguous ∗-expression is equivalent to some strongly left-aligned ∗-expression. Moreover, one can prove the following THEOREM 1. Every unambiguous ∗-expression (of loop depth 2) is equivalent to some strongly left-aligned unambiguous ∗-expression (of loop depth 2). The proof is based on following the procedures defined above and showing that application of each of them preserves unambiguity. We omit details. It follows from Theorem 1 that the class of languages {L|L = L(R) for some R ∈ R} is the class of all languages represented by unambiguous ∗-expressions (of loop depth up to 2). Denote this class by L. The definition of strong left-aligned expression can be extended for ∗expressions of arbitrary loop depth (and, accordingly, our algorithm, developed in this paper, can be extended for learning unambiguous stongly left-aligned ∗-expressions of arbitrary loop depth). In order to be able to make such an extention, we need to redefine the procedure LSHIFT2 so that it would be applicable to loops of arbitary depth. However, such a procedure would then be much more complex, involving many recursive calls of itself and other procedures defined above. To demonstrate the problem that arises when using analogs of LSHIFT2 for loops of depth greater than 2, consider the following example of the expression of depth 3: ab(((ab)∗ cdab)∗ eab)∗ . When our procedure LSHIFT2(.,2) is applied to the loop following the prefix ab, we get the expression ((ab(ab)∗ cd)∗ abe)∗ ab, which can be easily seen to be equivalent to the original expression (and preserving unambiguity). However, this expression is not (strongly) left-aligned. In order to left-align it, the loop (ab)∗ must be left-aligned. It is not hard in this case, however, in the general case, it might involve multiple (and recursive) calls to the procedures defined above, which makes the whole process quite messy, and it is not clear how it affects correctness of the algorithm.
3
The Learning Model
Now we define the algorithmic learning model, within which we intend to show the class L to be learnable in polynomial time. Our learning model is based on so-called active learning model introduced by D. Angluin in [1,2]. In this learning model, a learner asks an oracle (or a teacher) queries about the concept (language) to be learnt and gets (correct) answers. The most natural type of queries is membership queries: the learner asks if a string w belongs to the language to be learnt and gets the (correct) answer “yes” or “no”.
100
E. Kinber
In our variant of the active learning paradigm, a learner is initially given one representative example of the target language L. Let R be a ∗-expression. Consider the following process deriving strings in L(R): first, in the order from right to left, unfold every loop of depth 2 a number of times; then, in the order from right to left, unfold every loop in the obtained expression R a number of times. We will refer to this process of deriving any example in L(R) as standard derivation. We call an example w ∈ L(R) representative example of the language L(R) (or, just R) if, to obtain w, every loop in the process of standard derivation of w is unfolded at least twice. For example, aabbcaaabbc is a representative example of the expression R = (a∗ b∗ c)∗ . On the other hand, aabbcabbc is not a representative example of R, as the loop a∗ is unfolded only once inside the second unfolding of the outer loop (a∗ b∗ c)∗ . Note that, for any unambiguous expression R, the concept of representative example is well defined, since, for each example w ∈ L(R), there is only one standard derivation of R resulting in w, and, thus, there is no standard derivation of R resulting in w and making it not representative (which can happen if R is ambiguous: for example bbabbabbabbabba is representative using one standard derivation applied to R = (bba)∗ b∗ a(bba)∗ and is not representative using another standard derivation applied to R). Of course, representative examples can be obtained using different derivation processes, not necessarily the standard one, as long as every loop in the corresponding “parse tree” (unique for an example of an unambiguous ∗-expression) is unfolded at least twice. Now, given one representative example of an expression R, an algorithmic learner in our model asks a finite number of membership queries, gets the (correct) answers, and, in a finite amount of time, outputs a ∗-expression P equivalent to R. (Our learning algorithm, given a representative example of an expression R, will actually output the expression R itself). If, given a representative example of any language L(R) represented by a regular expression R in some class S, a learner A in our model learns R, we say that A learns the class of expressions R ∈ S (and the class of languages {L|L(R), R ∈ S}). We intend to design algorithmic learner, within the given model, that learns the class R defined in the previous section. Thus, our learner infers all languages in the class L using their strongly left-aligned representations. Moreover, our learner will work in time polynomial of the length of the input representative example. Using definitions of our procedures, one can prove the following THEOREM 2. For any unambiguous ∗-expression R of loop depth 2, any its representative example is a representative example of the strongly left-aligned unambiguous expression R equivalent to R. This means that if the target expression R is any arbitrary unambiguous ∗expession of loop depth 2, our learner, given a representative example of R, will output a ∗-expression equivalent to it.
Learning Regular Expressions from Representative Examples
4
101
Learning the Class R
Let R be a strongly left-aligned unambiguous target expression and w be a representative example of R. Let C be some unfolding of R used in the process of getting the string w. C must be unambiguous, as otherwise R would not be unambiguous. It is easy to see then, that, for any well-defined subexpression E of C such that C = AEB for some subexpressions A, B, there is a unique substring v of w obtained when all loops in E are completely unfolded. We will denote this substring v by w(E). At the heart of the algorithm, there is a simple procedure that, given a subexpression S k of a partial unfolding C of the target expression, uses a number of membership queries (removing substrings w(S) for consecutive copies of S from w) to determine which part S m of the expression S k is an unfolding of the loop S ∗ , and replaces S m by S ∗ S k−m . However, in order to be able to find repetitions and to preserve strong left-alignment, before making queries, the algorithm performs a number of complex loop shifts. 4.1
The Algorithm: An Example
We begin with demonstrating our algorithm for learning L on an example. Let the target (strongly left-aligned) unambiguous expression R be a∗ (bbba)∗ ba((ba)∗ c)∗ and let the input example be w = aaabbbabbbababababacbabac. It is easy to see that the target expression is unambiguous (each substring baba . . . or bbba . . . clearly defines which part of the expression R was used to derive it), and the example w is representative, as, in order to generate this example, every loop in R is unfolded at least twice. The algorithm uses some loop-shifting procedures defined in the Section 2. Let R0 := w (R0 is the conjecture before the first Stage of the algorithm). In the first Stage, the algorithm builds the expression (R, w)1 (it is the expression that contains loops of the depth 1 only, and when all of them are unfolded, one gets the string w). Let H0 := R0 . In the first Phase, the algorithm tries to find all one-letter loops. First, it finds all longest substrings in H0 that are repetitions of one letter containing at least two copies of this letter. Obviously, such substrings of w (from left to right) are v1 = aaa, v2 = bbb, v3 = bbb. Now, the algorithm tests if v1 is an unfolding of a loop in (R, w)1 as follows. It removes a from the substring v1 in H0 and queries if the remaining string aabbbabbbababababacbabac is in L. The answer is ‘yes’, and the algorithm removes aa from w and queries if the remaining string is in L. The answer is ‘yes’ again. The algorithm removes now the whole prefix aaa from w and queries the remaining string. The answer is ‘yes’. The algorithm substitutes the whole substring aaa in w by a∗ and sets (the current conjecture) C1 := a∗ bbbabbbababababacbabac.
102
E. Kinber
Now, it removes b from the substring v2 in w and queries the remaining string. The answer is ‘no’, and the algorithm does not replace any part of v2 by the loop b∗ ; it sets C2 := C1 . Then, similarly, the algorithm finds out that v3 is not an unfolding of any loop and sets C3 := C2 . Since there are no other substrings that could be unfoldings of one-letter loops, the algorithm sets H1 := C3 (the conjecture after the first Phase) and moves to the second Phase. In the second Phase, the algorithm tries to find all two-letter loops in the expression H1 . First, working from left to right, it finds all longest substrings in H1 that are repetitions of a substring of length 2 containing at least two copies of this substring. Obviously, the leftmost such substring is v1 = bababababa. The algorithm queries all the strings obtained from w by deleting ba, baba, . . . from the part v1 and then v1 itself until it gets the first answer ‘no’ - in our case, it happens when it deletes the substring bababa. As the first answer ‘no’ is obtained when two occurrences of ba in the part v1 remain in the queried string, the algorithm sets (the current conjecture) C1 := a∗ bbbabb(ba)∗babacbabac. Then the algorithm finds the next longest substring consisting of two-letter blocks, v2 = baba. Now, deleting ba, then baba from the part v2 in w, and querying the corresponding strings, the algorithm gets the answers ‘yes’ for both of them, and sets C2 := a∗ bbbabb(ba)∗ babac(ba)∗ c. Since there is no v3 , the algorithm completes this Phase and sets H2 := C2 . In the Phase 3, the algorithm fails to find three-letter loops in the expression H2 , thus setting H3 := H2 and going to Phase 4, where it tries to build all four-letter loops. The only substring v consisting of (at least) two repetitions of the same four-letter substring bbba can be obtained when the first loop (ba)∗ in H2 is shifted to the right using RSHIFT1((ba)∗ , 2). The algorithm first deletes bbba from w and queries the remaining string, then deletes bbbabbba from w and queries the remaining string. The answer in both cases is ‘yes’. The algorithm sets (the current conjecture) C1 := a∗ (bbba)∗ (ba)∗ bac(ba)∗ c. As no other fourletters loops are possible, it sets H4 := C1 . As no 5 or more letter loops are possible, the algorithm sets R1 := H4 and goes to the Stage 2. In the Stage 2, the algorithm, using R1 = (R, w)1 as the current conjecture builds (R, w)2 . It sets H1 := R1 and goes to the Phase 2 (since no loops (T )∗ of depth 2 with the length l(T ) = 1 are possible). In this Phase, the algorithm tries to find all the longest substrings T in H1 that have loop depth 1 and consist of at least two blocks of the length l(T ) = 2 (like, for example, a∗ ba∗ ba∗ b), possibly shifting some loops to the right. Such a substring T can be found if the first loop (ba)∗ is shifted to the right - it is T = (ba)∗ c(ba)∗ c. The algorithm first cuts out the substring bababac of w(T ) = bababacbabac (the unfolding of the first subexpression (ba)∗ c in T ) from w and queries the remaining string. Then it cuts out the whole w(T ) and queries the remaining string. In both cases, the answer is ‘yes’, and the algorithm sets the current conjecture C1 := a∗ (bbba)∗ ba((ba)∗ c)∗ . As no other loops of depth 2 are to be found, the algorithm sets R := C1 and terminates.
Learning Regular Expressions from Representative Examples
4.2
103
The Algorithm: Formal Description
Now we give a more formal description of the learning algorithm. It uses a number of procedures defined in Section 2 and the procedure RSHIFT2a extending RSHIFT2 in the same way as RSHIFT1a extends RSHIFT1. Algorithm Let R be the target ∗-expression. Let w be a representative example in L = L(R). Let n = l(w). We will give just the description of the Stage 2, on which the algorithm creates the loops of depth 2 (as the first Stage is similar, but simpler). Stage 2: Let R1 be the expression obtained on Stage 1. Set H1 := R1 . On each Phase j of the FOR loop below the algorithm builds loops of the depth 2 and bodies T such that l(T ) = j. Obviously, j cannot be smaller than 2. Let r := n/2 if n is even, and r := n/2 − 1, otherwise. For j = 2, 3, . . . , r DO: Phase j: By induction, we assume that the conjecture Hj−1 has been constructed on the Phase j − 1. Let C := Hj−1 be the current conjecture, and T ail := C . WHILE(l(T ail) ≥ 2j)DO: By induction, we will assume that, for every well-formed subexpression E of C such that C = F EG for some well-formed subexpressions F and G, there is a unique segment w(E) in w which is obtained when all loops in E are unfolded in the process of getting w from the target expression. We will make sure that this invariant also holds after one iteration of the WHILE loop. It will be achieved in the following two ways. Firstly, when a loop T ∗ , where T = T1 T2 and l(T1 ) = i, is shifted to the right using the procedure RSHIFT1(T ∗ , i), then w(T ∗ )w(T1 ) is substituted by w(T1 )w((T2 T1 )∗ ) (corresponding substitution is being made for each application of RSHIFT2; as these two procedures are at the heart of all right-shifting procedures, similar associations are extended to their applications as well). Similar associations are being made for each application of left-shifting procedures used by the algorithm. Secondly, when (and if), at the end of the iteration of WHILE loop, a new loop S ∗ is created, replacing some S k , then w(S ∗ ) replaces w(S k ). Step 1. Let S be the prefix of T ail of the length j. Check that it is a wellformed ∗-expression. If yes, go to Step 2. If not, then, obviously, there is an outermost loop (P T )∗ in C such that some prefix of T ail is the part T )∗ . If possible, shift this loop to the right (and, possibly, some other loops to the right from this loop) in C, using the procedure RSHIFT1a or RSHIFT2a, so that it would be within the prefix of the length j in the T ail and would be left-aligned in T ail. (* For example, if C = a(ba)∗ bac(ab)∗ ac and T ail = a)∗ bac(ab)∗ ac, then T ail is transformed into (ab)∗ ac(ab)∗ ac. *) If this is not possible, then remove from T ail the first character. Remove also all characters (if any) ’)’ and ’∗’ following this character.
104
Step
Step
Step
Step
Step
E. Kinber
(For example, if a prefix of T ail is a)∗ b, then remove a)∗ ). Let T ail be the remaining part of the former T ail. Go to the top of the WHILE loop. If it has been possible to transform S to a well-formed ∗-expression, as above, then go to Step 2. 2. Let T ail be the part of T ail left when the prefix S is removed. Let T be the prefix of T ail of the length j. Check that T does not contain loops of depth 2. If not, go to Step 4. If yes, check that, after, possibly, shifting the rightmost outermost loop (P )∗ in T to the right repeatedly using the procedures MAXRSHIFT1 or MAXRSHIFT2 (whichever is applicable) and on the last step, possibly, the procedures RSHIFT1a(.,k) or RSHIFT2a(.,k) for an appropriate value k, the (possibly, modified) prefix T of T ail of the length j is identical to S. (* For example, if T ail is (b∗ a)∗ b∗ db∗ (dc∗ )∗ d∗ de and S = (b∗ a)∗ b∗ db∗ d, then the algorithm, applying RSHIFT1a((dc∗ )∗ , 1) to right-shift the input loop, shifting also the loop d∗ , and, thus, transforming T ail to T ail = (b∗ a)∗ b∗ db∗ d(c∗ d)∗ d∗ e, will determine that the prefix of T ail of the same length as S is identical to S *). If yes, go to Step 6. If no, go to Step 3. 3. Repeatedly using MAXRSHIFT1 and, possibly, RSHIFT1a(.,k) for the appropriate value of k on the last step, check if the rightmost outermost loop in the part of C before T ail can be shifted to the right (possibly, causing shifting some other outermost loops to the right in T ail) so that it becomes the first loop in the prefix S of T ail, is left-aligned within T ail, and S does not contain loops of depth 2. (* For example, if C = b∗ bbbbcb∗bc, T ail = bbcb∗ bc, and the prefix S = bbc, then shift the first loop b∗ in C to the right to obtain the expression bbbb∗ bcb∗ bc; thus, the prefix S of T ail becomes b∗ bc. *) If no, go the Step 4. If yes, go to Step 5. 4. Return back to the original expression C. Remove the first character from T ail. Remove also all characters (if any) ’)’ and ’∗’ following this character. Let T ail be the remaining part of the former T ail. Go to the top of the WHILE loop. 5. Let T ail be the part of T ail left when the prefix S (as defined on the Step 3) is removed. Let T be the prefix of T ail of the length j. Check that, after, possibly, shifting the rightmost outermost loop (P )∗ in T to the right repeatedly using MAXRSHIFT1 or MAXRSHIFT2 (whichever is applicable) and, possibly, RSHIFT1a(.,k) or RSHIFT2a(.,k) for an appropriate value of k (which, in turn, may require right-shifting some loops to the right from (P )∗ ), the (possibly, modified) prefix T of T ail of the length j is identical to S. If no, go Step 4. If yes, go to Step 6. 6. Find the longest prefix V of T ail that consists of repetitions of S. Let T ail be the part of T ail left when the prefix V is removed. Let V be the prefix of T ail of the length l(S). Repeatedly using MAXRSHIFT1 or MAXRSHIFT2, and, possibly, RSHIFT1a(.,k) or RSHIFT2a(.,k) for
Learning Regular Expressions from Representative Examples
105
an appropriate k on the last step, try right-shift the rightmost outermost loop in V so that it would be left-aligned in the part of T ail with V removed, attempting making the prefix V of the modified T ail identical to S. If this is possible, set V := V V , otherwise, set V := V . Let p be the number of copies of S in V . Let S1 be the first copy of S in V , and, for each i ≤ p, let Si := Si−1 S. Thus, V = Sp . For i = 1, 2, . . . , p DO: Remove w(Si ) from w and, for the obtained string v, query ‘v ∈ L?’. If the answer is “yes”, go to the next step of this loop. If “no”, then break out of this loop. Step 6.1. If the answer to the first query in the FOR loop on Step 6 is “no”, go to Step 4. Step 6.2. If there was an answer “yes” in the above FOR loop, let i be the last step of the FOR loop, on which the answer was “yes”. Substitute Si in T ail (and, thus, in C) by (S)∗ and set C to the modified expression C. Let T ail be such that T ail = (S)∗ T ail. Set T ail := T ail . If p defined on the Step 6 is greater than i, then, if there was a loop that was shifted out from the last copy of S in Sp on the Step 6, then restore this loop, as well as all other loops to the right from it that could have been shifted on the Step 6, to their original positions in the expression C. END WHILE loop. Set Hj := C. END FOR loop. Set R2 := Hr . Output R2 as the target expression. 4.3
Correctness of the Algorithm
We give a sketch of the correctness proof, omitting many technical details. We will begin with the following REMARK 1. Suppose R is strongly left-aligned. One way to derive an example w ∈ L(R) from R is to use standard derivation. Another way, which combines standard derivation with shifting loops, is as follows. Let m be the maximal length of loops of depth 2 in R. First, unfold all loops of depth 2 of length m, from right to left, left-aligning all loops in the expression after unfolding each loop. Then, repeat this process for all loops of depth 2 of length m − 1, then for all loops of depth 2 of length m − 2, etc. Once there are no loops of depth 2 left, apply the same process to all loops of depth 1. It is quite obvious that, given a string w, if the expression R is unambiguous, then each expression obtained after unfolding just one loop in the above process (and left-aligning) must be unique. For example, consider the following (unambiguous) strongly left-aligned expression R = bb(b∗ c)∗ dabb(b∗ aa)∗ a∗ and the representative example
106
E. Kinber
bbbbcbbbbcdabbbbbaabbaaaa. Using the process described in Remark 1, we can get w unfolding loops and left-aligning loops in the obtained expressions in the following order: unfolding the loop (of depth 2 and maximal length 3) (b∗ aa)∗ : bb(b∗ c)∗ dabbb∗ aab∗ aaa∗ ; left-aligning all loops in the above expression: bb(b∗ c)∗ dab∗ bbaab∗ a∗ aa; unfolding the loop (of depth 2 and length 2) (b∗ c)∗ : bbb∗ cb∗ cdab∗ bbaab∗ a∗ aa; left-aligning the first loop b∗ : b∗ bbcb∗ cdab∗ bbaab∗ a∗ aa; etc. The algorithm obviously attempts to synhthesize loops in the order opposite to the derivation order described in the Remark 1. As Remark 1 states, given the string w, each expression obtained after unfolding of every next loop is unique. By induction, we will assume that the expression C before some iteration of the WHILE loop on some Phase j of some Stage is the expression obtained in the process described in Remark 1 (and, thus, the correct unfolding of the target expression R on the path to getting w). Let C be the expression described in the Remark 1, from which C is obtained by unfolding one loop, and let (G)∗ be this loop. We will consider just the case when the length of the loop (G)∗ is the same as the length of the last loop created on some iteration of the given WHILE loop (other cases are similar). We intend to show that (G)∗ will be created on the first iteration of the given WHILE loop on which a new loop is created (and, thus, the expression C will be synthesized). Thus, suppose we are on some iteration of the WHILE loop. If the prefix of T ail of the length j is not a well-formed ∗-expression and it cannot be made well-formed by shifting to the right the loop U = (P T )∗ in C such that some prefix of T ail is the part T )∗ , then moving to the next iteration of WHILE loop is clearly justified. Now, suppose that after, possibly, right-shifting the loop U mentioned just above, the prefix S of T ail of the length j is a well-formed ∗-expression. Several cases are possible. First, it is possible that, in order to obtain C from C , the loop in C unfolded at this step of the process described in the Remark 1, does not contain loops that are to be shifted to the left (and out from S) in order to left-align C. As the loop U is unfolded at least twice, when it happens, if S is the body of U , it must be followed by another copy of S. However, when later some other loop is unfolded (and left-alignment occurs), some loops may be shifted to the second copy of S from the right. On the Step 2, the algorithm tries to shift these loops out of the second copy of S. We now argue that it does it correctly. Let P ∗ be the loop as defined on the Step 2. We are going to show that this loop is the only loop that have possibly been shifted to T (as defined on the Step 2). Suppose there was a loop A∗ neighbouring P ∗ on the right that also was shifted to S. There
Learning Regular Expressions from Representative Examples
107
are many cases to consider here, but essentially, all of them can be reduced to the following case: the expression P ∗ A∗ is the tail of the second copy of S and P ∗ A∗ was the prefix of the expression following S on the right before shifting. This means that S, after left-shifting P ∗ A∗ , is followed by the expression P A. That is, we have the expression P ∗ A∗ P A, and it was obtained from P AP ∗ A∗ by our left-shifting procedures. However, then the following Claim can be proved, based on the definitions of our left-shifting procedures. CLAIM 1. If P ∗ A∗ P A is obtained from P AP ∗ A∗ by our left-shifting procedures, then P = B k and A = B m for some expression B whose length is the GCD of l(A) and l(B). It follows from Claim 1 that the expression containing such neighbouring loops P ∗ , A∗ cannot be unambiguous - there exist unfoldings that can be created by using, for example, just the loop P ∗ , or just the loop A∗ . By contradiction, we conclude that there could have been at most one loop shifted to S from the right. Similar arguments apply to the last copy of S in V on the Step 6. It follows from the description of the algorithm that no copy of the expression S defined on the Step 2 would have preceded it in C (since, otherwise, it would be found on earlier iterations of the WHILE loop). Similarly, there was no such B preceding S = AB in C that would have resulted in extending BA to a V described on the Step 6, since, otherwise, it would have resulted in generating the loop (BA)∗ substituting at least a part of S on an earlier iteration of the WHILE loop. It is clear now that, in the case under consideration, the loop in C resulting in the expression S r could have been only the loop created on Step 6.2 (note that C is left-aligned). Due to space restrictions, we omit the case when there is some loop to the left from S that can be right-shifted to S, as described on Step 3. 4.4
Complexity
Let n be the length of the input example w. Let us determine complexity of one iteration of WHILE loop. First, the running time of each of the procedures LSHIFT1, LSHIFT2, RSHIFT1, RSHIFT2, and their MAX versions does not exceed O(n2 ). The RSHIFT1a and RSHIFT2a make up to O(n) recursive calls and thus would have running time O(n3 ). Now it is easy to see that the total running time of one iteration of the WHILE loop requires O(n3 ). As the number of iterations of one WHILE loop does not exceed O(n) and the total number of WHILE loops does not exceed O(n), the overall complexity of the algorithm does not exceed O(n5 ). As the algorithm makes up to n membership queries on each iteration of the WHILE loop, the overall number of queries does not exceed O(n3 ).
5
Conclusion
We have developed a polynomial-time learning algorithm for all unionless unambiguous regular expressions in a natural normal form using one representative
108
E. Kinber
input example and membership queries. One must note that, while our algorithm uses help from the teacher (oracle), it does not use counterexamples (which, within the framework of active learning model, are usually obtained using quite powerful equivalence queries, see, for example, [1], or subset queries). Representative examples used in our model are obtained by unfolding each occurrence of each loop in the target expression at least twice. In fact, the algorithm can be modified to work (in polynomial time) for representative example using at least one unfolding of each loop. However, for practical purposes, it is more natural to assume that each loop is unfolded at least twice, since, in this case, the algorithm avoids testing segments of input example that cannot possibly be unfoldings of loops. An interesting question is whether our algorithm can be extended to learn unambiguous finite unions R1 ∪ R2 ∪ . . . Rk of ∗-expressions R1 , R2 , . . . , Rk from finite sets of examples representing each Ri , i = 1, 2, . . . , k. In this case, a more complex notion of normal form, extending strong left-alignment, would probably be needed (for example, this form could probably require replacing the expression a∗ aaa ∪ aa by the equivalent unambiguous expression a∗ aa). Acknowledgments. The author is grateful to anonymous referees of ICGI’2010 for a number of helpful comments and suggestions.
References 1. Angluin, D.: Learning Regular Sets from Queries and Counterexamples. Information and Computation 75, 87–106 (1987) 2. Angluin, D.: Queries and Concept Learning. Machine Learning 2, 319–342 (1988) 3. Book, R., Even, S., Greibach, S., Ott, G.: Ambiguity in Graphs and Expressions. IEEE Transactions on Computers. C-20(2), 149–153 (1971) 4. Br¯ azma, A.: Efficient Identification of Regular Expressions from Representative Examples. In: 6th Annual ACM Conference on Comp. Learning Theory, pp. 236–242. ACM Press, New York (1993) 5. Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of Concise DTDs from XML Data. In: Proceedings of the 32nd International Conference on VLDB, pp. 115–126. ACM Press, New York (2006) 6. Br¨ uggemann-Klein, A., Wood, D.: One-Unambiguous Regular Languages. Information and Computation 142(2), 182–206 (1998) 7. Fernau, H.: Algorithms for Learning Regular Expressions from Positive Data. Information and Computation 207(4), 521–541 (2009) 8. Gold, E.M.: Language Identification in the Limit. Information and Control 10, 447– 474 (1967) 9. Kinber, E.: On Learning Regular Expressions and Patterns. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 125–138. Springer, Heidelberg (2008)
Splitting of Learnable Classes Hongyang Li1 and Frank Stephan2 1 Department of Mathematics and Department of Computer Science, National University of Singapore, Singapore 119076, Republic of Singapore
[email protected] 2 Department of Mathematics and Department of Computer Science, National University of Singapore, Singapore 119076, Republic of Singapore
[email protected]
Abstract. A class L is called mitotic if it admits a splitting L0 , L1 such that L, L0 , L1 are all equivalent with respect to a certain reducibility. Such a splitting might be called a symmetric splitting. In this paper we investigate the possibility of constructing a class which has a splitting and where any splitting of the class is a symmetric splitting. We call such a class a symmetric class. In particular we construct an incomplete symmetric BC-learnable class with respect to strong reducibility. We also introduce the notion of very strong reducibility and construct a complete symmetric BC-learnable class with respect to very strong reducibility. However, for EX-learnability, it is shown that there does not exist a symmetric class with respect to any weak, strong or very strong reducibility. Keywords: inductive inference, mitotic classes, intrinsic complexity.
1
Introduction
Gold [7] initiated the study of inductive inference; he considered, besides various other models, in particular the learning of classes of recursively enumerable sets from positive data. The basic idea of this scenario is that the learner is presented with a list of all elements of some member set in the class in arbitrary order and has to find, in the limit, a program which enumerates the given language. The initial study was soon extended [2,3,4,13,14] and notions to compare the difficulties of classes were introduced, in particular notions which translate the sequence of data describing the language from the first class into a corresponding sequence of data for a language in the second class plus a reverse translation from any sequence of hypotheses for the image language back to a sequence of hypotheses for the first class; if such a reduction exists and the second class is learnable, so is the first. These reducibilities were introduced in order to measure the intrinsic complexity of learning [5,8,9,10] and the field is quite well-studied within inductive inference. Based on this notion, Jain and Stephan [11] investigated whether mitoticity occurs in inductive inference. The notion of mitoticity stems from the study of recursively enumerable sets [15] and means that an r.e. set A is the J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 109–121, 2010. c Springer-Verlag Berlin Heidelberg 2010
110
H. Li and F. Stephan
union of two disjoint r.e. sets B, C such that A, B, C have all the same Turing degree [12]. This concept was also brought over to complexity theory [1,6]. When studying it in inductive inference, Jain and Stephan [11] failed to solve the following related question (where the notions of splitting, intrinsic reducibility r and complete class will be made more precise below). Question 1. Given an intrinsic reducibility r , is there a learnable class L such that L admits a splitting and every splitting of L is symmetric, that is, for every splitting of L into two halves L0 and L1 it holds that L0 r L1 and L1 r L0 ? Jain and Stephan [11] did not solve this question. However, they showed one result on the way to it: If a class L is BC-complete with respect to strong reducibility and if L0 , L1 form a splitting of L then either L0 ≡strong L or L1 ≡strong L. Note that splittings always exist in the case of a BC-complete class [11]. In the following, the underlying definitions and notions are explained formally. – A general recursive operator Θ is a mapping from total functions to total functions such that there is a recursively enumerable set E of triples such that, for every total function f and every x, y, Θ(f )(x) = y iff there is an n such that f (0)f (1) . . . f (n), x, y ∈ E. – A language L is a recursively enumerable subset of the natural numbers. – A class L is a set of languages. – A text T of a language L is an infinite sequence T (0), T (1), T (2), . . . such that every member of L is equal to some T (m) and every T (m) is either a member of L or a special symbol to denote a pause. The content of a text T , denoted by content(T ), is the set containing all symbols that have appeared in T . – A learner M is a general recursive operator that reads more and more elements of the text T and outputs a sequence e0 , e1 , . . . of conjectures. M explanatorily (EX) learns some language L iff there is an n such that en = en+1 = . . . and L = Wen where W0 , W1 , . . . is an underlying acceptable numbering of all r.e. sets which is used as a fixed hypothesis space for learning. M behaviourally correctly (BC) learns L iff L = Wen for almost all n. Now M learns a class L iff M learns every language L ∈ L from any text of L under the learning criterion considered (EX or BC, respectively). – A classifier C is a general recursive operator that reads more and more elements of the text T and outputs a binary sequence a0 , a1 , . . . where each an is either 0 or 1. We say a classifier C classifies a class L if C converges to an element in {0, 1} in the limit on any text T of any language L in the class L, and C converges to the same number on any text of the same language. For convenience we write C(L) to denote the number that C converges on any text of L. It should be noted that a classifier of a class L is not required to converge on texts of languages outside the class L. Note that in the framework of inductive inference it does not matter how fast a learner M or a classifier C converges. The machine can be slowed down by
Splitting of Learnable Classes
111
starting with an arbitrary guess and later repeating conjectures. Similarly, if one translates one text of a language L into a text of a language H, it is not important how fast the symbols of H show up in the translated text; it is only important that they show up eventually. Therefore the translator can put into the translated text pause symbol, #, until more data are available or certain simulated computations have terminated. Therefore, learners, operators translating texts and classifiers can be made primitive recursive by the just mentioned delaying techniques. Thus one can have recursive enumerations Θ0 , Θ1 , Θ2 , . . . of translators from texts to texts, M0 , M1 , M2 , . . . of learners and C0 , C1 , C2 , . . . of classifiers such that, for every given translator, learner or classifier, the corresponding list contains an equivalent one. Definition 2. We say that a class L is weakly reducible to H (written L weak H) iff there are two general recursive operators Γ, Θ such that for every language L ∈ L and every text T of L there is a language H ∈ H satisfying that Γ translates T into a text of H and that whenever a sequence E converges to an index eH such that WeH = H then the sequence Θ(E) converges to an index eL such that WeL = L. We say that a general recursive operator Γ strongly maps texts of languages in L to texts of languages in H iff, whenever T, T are texts of the same language in L, Γ (T ), Γ (T ) are texts of the same language in H. Furthermore, L is strongly reducible to H (written L strong H) iff L weak H via general recursive operators Γ, Θ such that Γ strongly maps texts of languages in L to texts of languages in H. A class L is very strongly reducible to H (written L vs H) iff there is a general recursive operator Γ and a recursive function f such that the following conditions hold: – If T is a text of a language in L then Θ(T ) is a text of a language in H. – If T, T are texts of the same language in L then Γ (T ), Γ (T ) are texts of the same language in H. – If T is a text of a language L ∈ L and e is an index such that We = content(Γ (T )) then Wf (e) = L. Definition 3. Given a learning criterion I and a reducibility r , a class H is I-complete with respect to r if for any I-learnable class L we have L r H. Ladner [12] introduced the recursion-theoretic version of mitoticity. He defined that an infinite recursively enumerable set A splits into infinite sets A0 , A1 if there is a partial recursive function with domain A mapping the elements of Aa to a for all a ∈ {0, 1}. In learning theory, the corresponding role of the partial recursive function is played by a classifier C which classifies the class L into L0 , L1 . Definition 4. A splitting of a class L is a pair of infinite sub-classes L0 , L1 such that L0 ∩ L1 = ∅, L0 ∪ L1 = L and there exists a classifier C such that, for all a ∈ {0, 1} and for all texts T with content(T ) ∈ La , C converges on T to a.
112
H. Li and F. Stephan
While mitoticity only demands the existence of one splitting where the subclasses are both equivalent to the original class, a symmetric class requires the sub-classes of any splitting are equivalent to the original class. Definition 5. A class L is called a symmetric class with respect to a certain reducibility r if L has a splitting, and for any splitting L0 , L1 of L, we have L0 ≡r L1 ≡r L. This paper contains two main results extending the theorems established in [11]. While Jain and Stephan has shown the existence of a complete BC-learnable class with only comparable splittings, in this paper both a complete BC-learnable symmetric class and an incomplete BC-learnable symmetric class have been constructed. Jain and Stephan has also shown in [11] that an EX-learnable class complete for weak (strong ) always has a symmetric splitting with respect to weak (strong ). However, in this paper it is shown that there exists no EXlearnable symmetric class for weak . Both results needed novel approaches which were not yet there in [11]. The other results in the present work are less involved and obtainable with known techniques.
2
Very Strong Reducibility
In this section we present several results related to the very strong reducibility. First we show in Theorem 6 that vs is strictly stronger than strong . Then we show in Theorem 7 that various learnabilities are preserved under vs so that it is reasonable to consider splittings of a learnable class with respect to vs . Theorem 6. There exist explanatorily learnable classes L and H such that L strong H but L vs H. Proof. Take the following classes which are clearly EX-learnable: L = {{n} : n ∈ N}
(1)
H = {{0, 1, . . . , n} : n ∈ N}.
(2)
and Let Γ be the recursive operator which replaces in every text each occurrence of n by an occurrence of the sequence 0, 1, . . . , n and therefore translates every text of {n} into a text of a language of the form {0, 1, . . . , n}. Then Γ strongly maps texts of languages in L to texts of languages in H. Let E = (e0 , e1 , e2 , . . .) be a sequence of indices. Define a recursive operator Θ such that whenever E converges syntactically to an index e and We has maximum n then Θ(E) is a sequence of indices which converges to some index m with Wm = {n}. The sequence output by the operator might either fail to converge or converge to an arbitrary index in the case that E does not converge or E converges to an index for a set which is not of the form {0, 1, . . . , n} for any n. Then Γ, Θ witness the strong reducibility from L to H. Therefore the statement L strong H holds.
Splitting of Learnable Classes
113
Now we assume the contrary that L vs H. There is a recursive operator mapping every text of a set of the form {n} to a set of the form {0, 1, . . . , γ(n)} for some K-recursive function γ and a recursive function f such that for all n ∈ N and for all e ∈ N, Wf (e) = {n} as long as We = {0, 1, . . . , γ(n)}. Now choose m, n such that γ(m) < γ(n) and let {0, 1, . . . , γ(m)} if d ∈ / K; Wg(d) = (3) {0, 1, . . . , γ(n)} if d ∈ K. It follows that m ∈ Wf (g(d)) ⇔ d ∈ / K, n ∈ Wf (g(d)) ⇔ d ∈ K
(4)
and therefore one could find out whether d ∈ K by enumerating Wf (g(d)) until either m or n shows up in the enumeration. This would give that the halting problem is recursive, a contradiction. Therefore L vs H.
Theorem 7. Suppose L and H are classes of languages and suppose L vs H. Then the following statements hold: – If H is EX-learnable, then L is also EX-learnable; – If H is BC-learnable, then L is also BC-learnable; Proof. We prove for the case of EX-learnability. The proof for other cases are similar. Let M be an EX-learner of H. Assume the very strong reducibility is witnessed by (Γ, f ). Define the new learner N for L as follows: N (σ) = f (M (Γ (σ))).
(5)
Let T be a text of a language in L. Since M EX-learns H, M converges on Γ (T ) to an index e of content(Γ (T )). Then N converges on T to f (e). By definition of very strong reducibility, Wf (e) = content(T ). Therefore, N converges on T to a right index and N EX-learns L.
The next result investigates the question of whether there are complete classes with respect to very strong reducibility and it gives an affirmative answer for BC-learnable classes. Similarly one can also show the existence of complete EXlearnable classes with respect to very strong reducibility. Theorem 8. There exists a BC-learnable class which is complete for very strong reducibility. Proof. Fix an enumeration M0 , M1 , M2 , . . . of all learners. Consider the class L = {{x} ⊕ We : Mx BC-learns We }.
(6)
Now L is BC-learnable since there is a learner which waits until the element 2x has been seen in the input text (as {x} ⊕ We = {2x} ∪ {2y + 1 : y ∈ We }); once
114
H. Li and F. Stephan
2x is known, the learner simulates Mx to learn We and translates every index d conjectured by Mx to an index for {2x} ∪ {2y + 1 : y ∈ Wd }. For the converse direction, given an arbitrary BC-learnable class H, let Mx be a learner for H. We define a mapping Γ from languages in H to languages in L such that (7) Γ (We ) = {x} ⊕ We . Note that there also exists a recursive function f such that Wf (e ) = {y : 2y + 1 ∈ We }.
(8)
Then the pair Γ, f shows that H vs L. Therefore the class L is complete for vs .
3
Symmetric BC-Learnable Classes
This section contains the main results of the paper. We construct a BC-complete symmetric class with respect to vs (which is automatically complete for strong ) in Theorem 9 and a BC-incomplete symmetric class with respect to vs in Theorem 10. In the end we construct a BC-complete class which is not symmetric in Theorem 11. Theorem 9. There is a BC-learnable class J which is complete for vs such that, for any splitting J0 , J1 of J , we have J0 ≡vs J1 ≡vs J . Proof. Let L be a BC-learnable class which is complete for vs . Fix a numbering of r.e. sets such that W0 = ∅. Define f (n) = max{0, ϕe (m) ↓: e ≤ n, m ≤ n}.
(9)
Define for any recursive function g the language Jg = {x, y, z : (y < f (x)) ∨ (y = f (x) ∧ z ∈ Wg(x) )}
(10)
and define the class = 0 ⇒ Wg(x) ∈ L])}. J = {Jg : (∀∞ x[g(x) = 0]) ∧ (∀x[g(x) First we show that J is BC-learnable. Given any input string σ, let z if σ(n) = x, y, z; σx,y (n) = # otherwise.
(11)
(12)
Let M be a learner for L such that ∀n[M (#n ) = 0]. We define the learner N such that WN (σ) = {x, y, z : ∃s[(y < fs (x)) ∨ (y = fs (x) ∧ z ∈ WM(σx,y ) )]}
(13)
Splitting of Learnable Classes
115
where fs (x) is the maximum of 0 and all values ϕe (m) where e ≤ x, m ≤ x and the corresponding computation terminates within s computation steps. Note that the sequence f0 , f1 , . . . recursively approximates f from below. Let Jg be any language in J . To see that N learns Jg , first note that given any x, we always have fs (x) = f (x) for sufficiently large s. Then we have ∀x∀y[y < f (x) ⇒ ∃s[y < fs (x)]]. Therefore, all tuples x, y, z ∈ Jg with y < f (x) will eventually go into the set WN (σ) for any σ. Also note that ∀s∀x[fs (x) ≤ f (x)]. Therefore, the condition y < fs (x) will not put any tuple x, y, z with y ≥ f (x) into WN (σ) . Since there are only finitely many x with g(x) = 0, there exists an s such that fs (x) = f (x) for all x with g(x) = 0. By definition, if g(x) = 0, then Wg(x) ∈ L. Since M BC-learns L, we have WM(σx,f (x) ) = Wg(x) in the limit, namely, given any text T of Jg , for sufficiently long initial segment σ of T , we have for all x that WM(σx,f (x) ) = Wg(x) . It follows that WN (σ) enumerates a tuple of the form x, f (x), z if and only if g(x) = 0 and z ∈ Wg(x) . This justifies that N learns Jg . Since Jg is chosen arbitrarily, we claim that N BC-learns J . Next we show that L vs J0 for any splitting J0 , J1 of J . Then analogously L vs J1 and the theorem follows automatically from the transitivity of vs and the completeness of L. Fix some Jh ∈ J0 . Let C be an arbitrary classifier. Without loss of generality assume C(Jh ) = 0. Let be the locking sequence of C on Jh . Define ϕe (n) = max{y : ∃x, z[τn is defined and x, y, z ∈ τn ]}
(14)
where τn is the first string found such that 1. C(τn ) = C(); 2. u, v, z ∈ τn ⇒ u > n or u, v, z ∈ Jh . If such a τn cannot be found, then ϕe (n) is undefined. / Jh ; otherwise the condition Note that τn must contain some xτn , yτn , zτn ∈ C(τn ) = C() would contradict the fact that is a locking sequence. Then we have yτn ≥ f (xτn ). Since ∀x ≤ n[x, y, z ∈ τn ⇒ x, y, z ∈ Jh ], we have xτn > n. Therefore we have ϕe (n) ≥ yτn ≥ f (xτn ) ≥ f (n). We claim that there exists some m such that ϕe (n) is undefined for all n > m. Consider the function 1 + ϕe (n) if ϕe (n) ↓, (15) ϕe (n) = ↑ if ϕe (n) ↑. Let m = max{e , e}. Assume that n > m and ϕe (n) is defined. Then ϕe (n) is also defined. By definition of f we have f (n) ≥ ϕe (n) > ϕe (n), contradicting the fact that ϕe (n) ≥ f (n). Since h is 0 almost everywhere and since W0 = ∅, there exists some d > m such that (16) ∀x ≥ d ∀y, z[x, y, z ∈ Jh ⇐⇒ y < f (x)]. Since d > m, ϕe (d) is undefined. Therefore any string τ must violate at least one of the conditions in the definition of τn where n = d. Let B = {x, y, z : x ≤
116
H. Li and F. Stephan
d − 1}. Now consider any superset Jg of Jh such that Jg ∩ B = Jh ∩ B, and let τ be any string from the language Jg . Since Jg preserves Jh up to x = d − 1, τ always satisfies the second condition in the definition of τn for n = d−1. However τn is not defined for n = d − 1 since d − 1 ≥ m. Therefore one must conclude that τ violates the first condition in the definition of τn , which implies that C() = C(τ ). Since τ is chosen arbitrarily, we have the following conclusion: Let B = {x, y, z : x ≤ d − 1}. For any language Jg ∈ J , if Jh ⊆ Jg and Jg ∩ B = Jh ∩ B, then C(Jg ) = C(Jh ) = 0. Now consider a mapping Γ from texts of languages in L to texts of languages in J0 such that Γ (We ) = Jh ∪ {d, f (d), z : z ∈ We }. (17) Since d is a fixed finite number, we can assume that Γ knows the value of d and f (d). Then we can construct such a Γ by putting into the resulting text all tuples x, y, z ∈ Jh and d, f (d), w for all w in the input text. It is clear that C(Γ (We )) = C(Jh ) = 0 since Γ (We ) ∩ B = Jh ∩ B and Γ (We ) is a superset of Jh . For the other direction of the reduction, define a recursive function f˜ such that for any index e Wf˜(e ) = {z : d, f (d), z ∈ We }.
(18)
Note that Jh ∩ {d, f (d), z : z ∈ We } = ∅ by the choice of d. Then the pair Γ, f˜ shows that L vs J0 . Due to the transitivity of vs , we have J vs L J0 , which implies that J ≡vs J0 . Since Jh and the classifier C are arbitrary, the theorem is proven.
By a similar proof, one can show there is also a BC-incomplete symmetric class. Theorem 10. There is a BC-learnable class J which is incomplete for strong (and thus incomplete for vs ) such that, for any splitting J0 , J1 of J , we have J0 ≡vs J1 ≡vs J . Proof. Fix any acceptable numbering of r.e. sets such that W0 = ∅. Define f (n) = max{0, ϕe (m) ↓: e ≤ n, m ≤ n}.
(19)
Define for any recursive function g the language Jg = {x, y, z : (y < f (x)) ∨ (y = f (x) ∧ z ∈ Dg(x) )}
(20)
where De is the finite set with canonical index e. That is, De = E iff E is finite and d∈E 2d = e. Note that D0 = ∅. Let J = {Jg : ∀∞ x[g(x) = 0]}.
(21)
Similar to the proof of theorem 9 one could show that J is BC-learnable and symmetric.
Splitting of Learnable Classes
117
To show that the class J is not BC-complete, consider the class H = {Hn : x ∈ Hn ⇐⇒ x ≥ n}.
(22)
Clearly H is BC-learnable. Note that H contains an infinite descending chain starting with H0 . However, for any Jg ∈ J , there does not exist an infinite descending chain in J starting with Jg due to the fact that any Jg˜ ∈ J is a finitevariant superset of {x, y, z : y < f (x)}. Since a strong reduction preserves the proper subset relationship among the members of the class [10], it also preserves infinite descending chains with respect to the subset relation and so H strong J . Note that H from the previous proof is EX-learnable and therefore L is not hard for all EX-learnable classes with respect to strong and very strong reducibility. Theorem 11. There is BC-complete class H which has a splitting H0 , H1 such that H strong H1 and H0 strong H1 . Proof. Let L be an arbitrary BC-complete class. Define H0 = {{0, 1} ∪ {x + 2 : x ∈ L} : L ∈ L}.
(23)
Then H0 is also BC-complete. Now let H1 = {{x} : x ∈ N}. Note that H1 is EX-learnable. Then H = H0 ∪ H1 is also BC-complete. It is easy to see that there is a classifier that splits H into H0 and H1 . However, we cannot have H strong H1 and H0 strong H1 , which would imply that every BC-learnable class is EXlearnable.
4
Asymmetric EX-Learnable Classes
We have shown that there is a BC-learnable symmetric class with respect to vs . It is natural to ask whether it is possible to construct a symmetric class for EX-learnability. As we will show in this section, there is no EX-learnable symmetric classes. Definition 12. We say that a general recursive operator Γ mapping texts of languages in L to texts of languages in H is 1-1 if, whenever T1 and T2 are texts of different languages in L, Γ (T1 ) and Γ (T2 ) are texts of different languages in H. Note that the general recursive operator Γ in the definition of weak, strong and very-strong reducibility must be 1-1; otherwise it is impossible for the operator Θ to translate a sequence of indices converging to an index of content(Γ (T )) to a sequence of indices converging to an index of content(T ). Theorem 13. There is no EX-learnable class L such that L has a splitting and for every splitting L0 , L1 of L, L0 ≡weak L1 .
118
H. Li and F. Stephan
Proof. Let M be an EX-learner for L. Without loss of generality we could assume that M converges to the same index on all texts of the same language in L [4]. Let C be an arbitrary classifier that classifies L and let L0 , L1 be the splitting of L produced C. Fix some language L ∈ L0 . Then C(L) = 0. Suppose M converges on all texts of L to an index eL . Let σL be the minimum locking sequence of M on L. Define another classifier C such that ⎧ ⎪ ⊆ content(τ ) ⎨C(τ ) if content(σL ) (24) C (τ ) = or M (σL τ ) = M (σL ); ⎪ ⎩ 1 otherwise. We claim that C produces a splitting L0 − {L}, L1 ∪ {L}. To see this, note = L, if σL ⊆ L , then the condition content(σL ) ⊆ content(τ ) that for any L will always hold, and C will preserve the classification made by C on L . If σL is contained in L , then σL cannot be a locking sequence for L since L = L. = M (σL ), and C will also Then there exists some τ ⊆ L such that M (σL τ ) preserve the classification made by C on L . It is clear that C (L) = 1 while C(L) = 0. Therefore, C moves exactly L from L0 to L1 and preserves all the rest classifications made by C. Now assume the contrary that every splitting of L produces two sub-classes of the same complexity, then we must have L0 ≡weak L1 and L0 − {L} ≡weak L1 ∪ {L}. Then there exists a 1-1 general recursive operator Γ which maps texts of languages in L0 to texts of languages in L1 and a 1-1 general recursive operator Γ which maps texts of languages in L1 ∪ {L} to texts of languages in L0 − {L}. Therefore Γ = Γ ◦ Γ is a 1-1 general recursive operator which maps texts of languages in L0 to texts of languages in L0 − {L}. Let T0 be a text of L and Tn = Γ (Tn−1 ) for all n > 0. Now define Hn = content(Tn ) for all n ∈ N. Note that i = j ⇒ Hi = Hj since Γ is a 1-1 general recursive operator. Fix an enumeration of all primitive-recursive operators {Γ0 , Γ1 , Γ2 , . . .} translating texts. Define Od,e = content(Γd (Te )).
(25)
Then there exists a K-recursive function f (b, d, e) such that = Od,e ]. ∀c ≤ b [c = f (b, d, e) or Hc
(26)
To see the existence of such a function f , first note that Hi = Hj whenever i = j. Therefore, given any b, d, e, there can be at most one c such that c ≤ b and Hc = Od,e . Moreover, for any i, j with i = j, there must be some x such = Hj (x). Then it follows that for any i, j with i < j ≤ c, there that Hi (x) must be some x such that either Hi (x) = Od,e (x) or Hj (x) = Od,e (x). Therefore between any two different languages Hi and Hj , we can always identify one of them that is not identical to Od,e using oracle K. Moreover, we can repeat this process for b times to identify, among all the b + 1 languages Hi with i ≤ b, b languages which are not identical to Od,e . Then we set f (b, d, e) = c where Hc is the only language with index less than or equal to b that has not been identified as different from Od,e .
Splitting of Learnable Classes
119
Now we define a K-recursive sequence {an }, where a0 = 0. To define an+1 , let b = (n + 1)(an + 2n + 5) + (n + 2) + (an + 2n + 5) + 1 and let an+1 be the least c found which satisfies all the following conditions: – ∀d ≤ n∀e ≤ an + 2n + 4 [c = f (b, d, e)]; – M (Tc ) > n + 1; – c > an + 2n + 4. Note that there can be at most (n + 1)(an + 2n + 5) different choices of c that violate the first condition, at most n + 2 choices of c with M (Tc ) ≤ n + 1 that violate the second condition, and at most an + 2n + 5 choices of c with c ≤ an + 2n + 4 that violate the third condition. Therefore, given our definition of b, it is guaranteed that there exists a c ≤ b which satisfies all the three conditions. The set E = {M (Tan ) : n ∈ N} = {e : ∃n ≤ e[M (Tan ) = e]} is K-recursive. Let Es be the s-th recursive approximation to E. Now define C (σ) = E|σ| (M (σ))
(27)
and note that this is a classifier splitting L into two halves H0 and H1 where H1 = {Han : n ∈ N} and H0 contains all other members of L. To see this, consider any text T . On T , M converges to a value e and C converges to the value E(e). If T is a text of a language in H1 then M converges on T to some value of the form M (Tan ) and hence M converges to a value in E; if T is a text of a language in H0 then M converges on T to an index e for a language outside H1 , hence e ∈ / E and C converges on T to 0. ≡weak H1 by showing it is impossible for any 1-1 general We show that H0 recursive operator to map texts of languages in H0 to texts of languages in H1 . Assume the contrary that Γd is a 1-1 general recursive operator which maps texts of languages in H0 to texts of languages in H1 . Choose some n > d. Note that since an+1 > an +2n+4, there are at most n+1 numbers e ≤ an +2n+4 such that He ∈ H1 . It follows that there exist more than n+2 numbers e ≤ an +2n+4 such that He ∈ H0 . Then there must be some m > n and some e ≤ an + 2n + 4 such that Od,e = Ham , which contradicts our definition of the array {an }. Since Γd is chosen arbitrarily, we conclude that it is not possible to find a 1-1 general recursive operator which maps texts of languages in H0 to texts of languages in H1 . Therefore H0 ≡weak H1 .
Although there is no EX-learnable symmetric class, one could construct an EXlearnable class which has and only has comparable splittings as shown in the next theorem: Theorem 14. There is an EX-learnable class L such that L has a splitting, and for every splitting L0 , L1 of L, either L0 vs L1 or L1 vs L0 . Proof. Let A be a K-cohesive set, that is, A satisfies that whenever a K-r.e. set contains infinitely many elements of A then this set contains all but finitely many
120
H. Li and F. Stephan
elements of A. Now define the class L such that L = {{2x}, {2x + 1} : x ∈ A}. It is clear that L is EX-learnable since every language in L is a singleton set. Let C be any classifier that classifies L and let L0 , L1 be the splitting of L produced by C. Without loss of generality assume C puts {2x} into L0 for infinitely many x ∈ A. Then, by definition of K-cohesive sets, C must put {2x} into L0 for all but finitely many x ∈ A and must put {2x + 1} into L1 for all but finitely many x ∈ A. Hence there exists a number m such that ∀x ≥ m [x ∈ A ⇒ (C({2x}) = 0 and C({2x + 1}) = 1)].
(28)
Let L0 = L0 ∩ {{y} : y < 2m} and L1 = L1 ∩ {{y} : y < 2m}. Again we may assume that |L0 | ≤ |L1 |. Note that both L0 and L1 are finite. It is clear that there exists a 1-1 mapping from languages in L0 to languages in L1 , and that there exists a 1-1 general recursive operator Γ which maps texts of languages in L0 to texts of languages in L1 . Now define another recursive operator Γ such that Γ ({y}) if y < 2m; Γ ({y}) = (29) {y + 1} otherwise. It is then easy to verify that Γ 1-1 strongly maps texts of languages in L0 to texts of languages in L1 . The reverse mapping is done by a recursive function f which maps an index e to an index f (e) such that Wf (e) = {y} for the first pair x, y found with x ∈ We ∧ Γ ({y}) = {x}; if such x, y do not exist then Wf (e) = ∅. Note that only the search-algorithm for the x, y is coded into the index f (e) but not the values x, y themselves, therefore f can be chosen to be a total-recursive
function. It follows that L0 vs L1 .
5
Conclusion
In this paper we have investigated the existence of symmetric classes with respect to various reducibilities and learning criteria. In particular we have shown the existence of BC-learnable symmetric classes with respect to very strong reducibility vs . We have also shown that there exists no EX-learnable symmetric classes even for weak reducibility weak . Note that a symmetric class requires each sub-class in any splitting to have the same complexity as the original class. While the existence of a BC-learnable symmetric class has been shown, it is not yet clear whether there is a BClearnable class L which has a splitting such that for any splitting L0 , L1 of L, L0 ≡r L1
References 1. Ambos-Spies, K.: P-mitotic sets. In: Bekic, H. (ed.) Programming Languages and their Definition. LNCS, vol. 177, pp. 1–23. Springer, Heidelberg (1984) 2. Angluin, D.: Inductive inference of formal languages from positive data. Information and Control 45, 117–135 (1980)
Splitting of Learnable Classes
121
3. B¯ arzdi¸ nˇs, J.: Two theorems on the limiting synthesis of functions. Theory of Algorithms and Programs 1(210), 82–88 (1974) 4. Blum, L., Blum, M.: Toward a mathematical theory of inductive inference. Information and Control 28(2), 125–155 (1975) 5. Freivalds, R., Kinber, E., Smith, C.: On the intrinsic complexity of learning. Information and Computation 123, 64–71 (1995) 6. Glaßer, C., Ogihara, M., Pavan, A., Selman, A., Zhang, L.: Autoreducibility, mitoticity and immunity. In: Jedrzejowicz, J., Szepietowski, A. (eds.) MFCS 2005. LNCS, vol. 3618, pp. 387–398. Springer, Heidelberg (2005) 7. Mark Gold, E.: Language identification in the limit. Information and Control 10, 447–474 (1967) 8. Jain, S., Kinber, E., Wiehagen, R.: Language learning from texts: degrees of intrinsic complexity and their characterizations. Journal of Computer and System Sciences 63, 305–354 (2001) 9. Jain, S., Sharma, A.: The intrinsic complexity of language identification. Journal of Computer and System Sciences 52, 393–402 (1996) 10. Jain, S., Sharma, A.: The Structure of Intrinsic Complexity of Learning. The Journal of Symbolic Logic 62, 1187–1201 (1997) 11. Jain, S., Stephan, F.: Mitotic classes in inductive inference. SIAM Journal on Computing 38, 1283–1299 (2008) 12. Ladner, R.: Mitotic recursively enumerable sets. The Journal of Symbolic Logic 38, 199–211 (1973) 13. Osherson, D.N., Stob, M., Weinstein, S.: Systems That Learn, An Introduction to Learning Theory for Cognitive and Computer Scientists. Bradford — The MIT Press, Cambridge (1986) 14. Osherson, D.N., Weinstein, S.: Criteria of language learning. Information and Control 52, 123–138 (1982) 15. Post, E.: Recursively enumerable sets of positive integers and their decision problems. Bulletin of the American Mathematical Society 50, 284–316 (1944)
PAC-Learning Unambiguous k, l-NTS≤ Languages Franco M. Luque and Gabriel Infante-Lopez Grupo de Procesamiento de Lenguaje Natural Universidad Nacional de Córdoba & Conicet Córdoba, Argentina {francolq,gabriel}@famaf.unc.edu.ar
Abstract. In this paper we present two hierarchies of context-free languages: The k, l-NTS languages and the k, l-NTS≤ languages. k, l-NTS languages generalize the concept of Non-Terminally Separated (NTS) languages by adding a fixed size context to the constituents, in the analog way as k, l-substitutable languages generalize substitutable languages (Yoshinaka, 2008). k, l-NTS≤ languages are k, l-NTS languages that also consider the edges of sentences as possible contexts. We then prove that Unambiguous k, l-NTS≤ (k, l-UNTS≤ ) languages be converted to plain old UNTS languages over a richer alphabet. Using this and the result of polynomial PAC-learnability with positive data of UNTS grammars proved by Clark (2006), we prove that k, l-UNTS≤ languages are also PAC-learnable under the same conditions.
1
Introduction
A major goal in the area of grammatical inference is the discovery of more and more expressive classes of languages that happen to be learnable in some sense, such as Probably Approximately Correct (PAC) learnability [2] or identifiability in the limit [5]. There is also a particular interest in those classes that may include in some degree the natural languages. Among the most recent results in this field, there are the results of polynomial PAC-learnability of Unambiguous NTS (UNTS) languages from Clark [3], and the polynomial identifiability in the limit of Substitutable Context-Free Languages (SCFLs) from Clark and Eyraud [4]. These two results are related in the sense that substitutable languages are a sort of language level definition of the NTS property, and Clark and Eyraud actually conjectured that the SCFLs are a subclass of the NTS languages. Another recent result is Yoshinaka’s [8] generalization of Clark and Eyraud work, defining the hierarchy of k, l-substitutable languages, and proving their polynomial identifiability in the limit. In this paper, we define a hierarchy of k, l-NTS languages that generalizes NTS languages as k, l-substitutable languages generalize substitutable languages. We also show how k, l-NTS fits in the map of language classes, generalizing Clark’s conjecture to the inclusion of k, l-SCFLs into k, l-NTS languages. k, l-NTS languages adds the notion of a fixed size context to the constituency property of J.M. Sempere and P. García (Eds.): ICGI 2010, LNAI 6339, pp. 122–134, 2010. c Springer-Verlag Berlin Heidelberg 2010
PAC-Learning Unambiguous k, l-NTS≤ Languages
123
the NTS grammars. We then present the k, l-NTS≤ languages, a hierarchy of subclasses of the k, l-NTS languages. These languages also consider the edges of sentences as valid contexts. We prove that k, l-UNTS≤ languages can be converted injectively to UNTS languages over a richer alphabet. We do this by showing how to convert k, lUNTS≤ grammars with l > 0 to k, l − 1-UNTS≤ grammars. Applying this conversion recursively, we can get to k, 0-UNTS≤ grammars, and applying a symmetric conversion, we finally get to 0, 0-UNTS≤ grammars, that are exactly UNTS grammars. We then use the conversion to UNTS languages to prove that k, l-UNTS≤ languages are PAC-learnable. A sample of a k, l-UNTS≤ language can be converted to a sample of a UNTS language, and this sample can be used with Clark’s algorithm to learn a language that converges to the UNTS language. The learned language can then be converted back, to obtain a language that converges to the original k, l-UNTS≤ language. k, l-NTS languages express the idea of letting the contexts to influence the decision of considering the enclosed strings as constituents. This idea has been successfully applied to natural language and in particular to unsupervised learning, for instance in the Constituent Context Model by Klein and Manning [6], where a context of size (k, l) = (1, 1) is used. Their idea of adding starting and ending markers to the sentences corresponds exactly to our definition of k, l-NTS≤ .
2
Notation and Definitions
A Context-Free Grammar (CFG) is a tuple G = Σ, N, S, P , where Σ is the terminal alphabet, N is the set of non-terminals, S ∈ N is the initial nonterminal, and P ⊆ N × (Σ ∪ N )+ is the set of productions or rules. We use letters from the beginning of the alphabet a, b, c, . . . to represent elements of Σ, from the end of the alphabet r, s, t, u, v, . . . to represent elements of Σ ∗ , and Greek letters to represent elements of (Σ ∪ N )∗ . We use the notation (α)i to refer to the i-th element of α. We write the rules using the form X → α, ∗ and call the derivation relation ⇒ to the transitive closure of the relation ⇒ where αXβ ⇒ αγβ iff X → γ ∈ P . The language generated by a CFG G is the Context-Free Language (CFL) ∗ L(G) = {s ∈ Σ ∗ |S ⇒s}. As in [3], we will assume that all the non-terminals are useful, that is, that they are used in the derivation of some element of the language, and that they are not duplicated, that is, that they all generate different sets of strings. A grammar is said to be Non-Terminally Separated (NTS) if and only if ∗ ∗ ∗ whenever X ⇒αβγ and Y ⇒β, we have that X ⇒αY γ. This definition implies that if a string occurs as a constituent, then all of its occurrences are also constituents. 2.1
k, l-NTS Languages
We present a generalization of NTS grammars that relaxes the NTS condition to introduce the influence of fixed size contexts. In the generalization, the
124
F.M. Luque and G. Infante-Lopez
constituency of an occurrence of a string will not be determined only by the string itself but also by the context where it occurs. Definition 1. Given k, l non-negative integers, a grammar G = Σ, N, S, P is k, l-Non-Terminally Separated (k, l-NTS) if, for all X, Y ∈ N , α, β, γ, α , γ ∈ (Σ ∪ N )∗ , and (u, v) ∈ Σ k × Σ l , ∗
∗
∗
∗
X ⇒αuβvγ, Y ⇒β and S ⇒α uY vγ implies X ⇒αuY vγ. A grammar is k, l-Unambiguous NTS (k, l-UNTS) if it is unambiguous and k, lNTS. This definition states that if a string occurs as a constituent in a context (u, v), then every time it occurs in that context, it has to be a constituent. In contrast with NTS grammars that have one global set of constituents, in k, l-NTS there is a set of constituents for each possible context (u, v) such that (|u|, |v|) = (k, l). From the definition, it is easy to prove that the 0, 0-NTS grammars are exactly the NTS grammars, and that if a grammar is k, l-NTS, then it is m, n-NTS for m ≥ k, n ≥ l. So, k, l-NTS grammars conform a hierarchy of classes of grammars that starts with the NTS grammars. This hierarchy is strict. For instance, the grammar with productions P = {S → ak Y cl−1 , S → ak+1 bcl , Y → b} with l > 0, is k, l-NTS but it is not k, l − 1-NTS. At the language level, the k, l-NTS classes also conform a strict hierarchy. In this case, to prove strictness we must give infinite languages as witnesses and use the pumping lemma for CFLs, because every finite language is k, l-NTS if k + l > 0. For instance, if l > 0, L = {ak bn cl−1 |n > 0} ∪ {an bn cl−1 |n > 0} is k, l-NTS but it is not k, l − 1-NTS. The reverse of L is l, k-NTS but not l − 1, kNTS. These examples are also the witnesses that prove that the hierarchy of k, l-UNTS languages is strict. So far, it is not clear to us the relationship between k, l-NTS and k, lsubstitutable languages. In [4], Clark and Eyraud conjectures that the Substitutable CFLs (SCFLs) are NTS, while in [8] Yoshinaka generalizes this conjecture saying that the k, l-SCFLs are NTS for all k, l. This generalization is not correct, and a source of counterexamples is the fact that the k, l-SCFL condition does not affect some strings that the NTS condition does. For instance, L = {a, ab} is ∗ ∗ ∗ a 0, 1-SCFL but it is not NTS because S ⇒ab, S ⇒a but S ⇒Sb. We believe that the correct generalization of Clark’s conjecture is to say that every k, l-SCFL is a k, l-NTS language. 2.2
k, l-NTS≤ Languages
We will focus our attention in a particular family of subclasses of k, l-NTS grammars. Observe that in the definition of k, l-NTS grammars, the Y non-terminals are those that can be derived from S with contexts of size of at least (k, l). This means that the k, l-NTS condition does not affect those that have smaller contexts. In k, l-NTS≤ grammars the constituency of the strings will be determined not only by contexts of size (k, l) but also by smaller contexts if they occur next to
PAC-Learning Unambiguous k, l-NTS≤ Languages
125
the edges of the sentences. We will define k, l-NTS≤ in terms of k, l-NTS and use the following simple grammar transformation. Definition 2. Let G = Σ, N, S, P be a CFG, and let u, v ∈ Σ ∗ . Then, uGv = Σ ∪ Σ , N ∪ {S }, S , P is a CFG such that P = P ∪ {S → uSv}. It is easy to see that L(uGv) = {usv|s ∈ L(G)}. Definition 3. Let G be a CFG and let • be a new element of the alphabet. Then, G is k, l-Non-Terminally Separated≤ (k, l-NTS≤ ) if and only if •k G•l is k, lNTS. A grammar is k, l-Unambiguous NTS≤ (k, l-UNTS≤ ) if it is unambiguous and k, l-NTS≤ . The definition says that if we add a prefix •k and a suffix •l to every element of the language, then the resulting language is k, l-NTS. Doing this, we guarantee that every substring in the original language has a context of size (k, l) and therefore is affected by the k, l-NTS condition. It is easy to see that 0, 0-NTS≤ = 0, 0-NTS = NTS, and that k, l-NTS≤ is a hierarchy, where k, l-NTS≤ ⊂ m, n-NTS≤ for every m > k, n > l. It can be seen that this hierarchy is strict using the same example given for k,l-NTS grammars. Also, it can be proved that k, l-NTS≤ ⊆ k, l-NTS, and that this inclusion is proper if k + l > 0. The difference between k, l-NTS and k, l-NTS≤ grammars can be illustrated with the following example. In the grammar G with P = {S → cba, S → Aa, A → b}, there is no constituent with a context of size (0, 2) so it is trivially 0, 2-NTS. Instead, in G•2 , that has P = {S → S•2 , S → cba, S → Aa, A → b}, ∗ every constituent (excluding S ) has a context of size (0, 2), in particular S ⇒ba ∗ with context (λ, •2 ), and A⇒b with (λ, a•). But these two constituents occur with ∗ ∗ the same contexts in S ⇒cba•2 , so the 0, 2-NTS≤ condition says that S ⇒cS•2 ∗ and S ⇒cAa•2 should hold. As this is not true, G is not 0, 2-NTS≤ . At the language level, both k, l-NTS≤ and k, l-UNTS≤ hierarchies are strict, and in both cases the witnesses are small modifications of the witnesses used in the previous section. Now, L = {ak bn cl−1 d|n > 0} ∪ {an bn cl−1 e|n > 0} with l > 1 is the language that distinguishes k, l-NTS≤ and k, l-UNTS≤ from k, l − 1NTS≤ and k, l −1-UNTS≤ respectively. The reverse of L distinguishes l, k-NTS≤ and l, k-UNTS≤ from l − 1, k-NTS≤ and l − 1, k-UNTS≤ respectively.
3
Learning Algorithm for k, l-UNTS≤ Languages
In this section we see intuitively how k, l-UNTS≤ languages can be injectively converted into UNTS languages. Using this and the PACCFG algorithm for UNTS grammars from [3], we will define the learning algorithm k,l-PACCFG. First, remember that in k, l-UNTS≤ the substrings are constituents or not depending in the substrings themselves and in the contexts where they occur, while in UNTS the constituency of the substrings depends only on the substrings themselves. So, in the conversion of k, l-UNTS≤ languages to UNTS languages,
126
F.M. Luque and G. Infante-Lopez
we must find a way to encode into the substrings the context where they are occurring each time. A way to do this is to add to each letter the context where it is occurring. To mark the contexts of the letters that are at the edges of the sentences, we will use a new terminal •. So, we must change the alphabet from Σ to the triplets Σ•k × Σ × Σ•l , where Σ• = Σ ∪{•}. To be more compact, we will write the triplets (u, b, v) using subscripts, in the form u bv . For instance, with (k, l) = (1, 1), the string abc would be mapped to • ab a bc b c• . With (k, l) = (0, 2), abc would be mapped to abc bc• c•• . This is simply the way to convert a k, l-UNTS≤ language into a UNTS language. We use it in the first step of our algorithm in the following way: Algorithm 1 (k,l-PACCFG). – Input: A sample S. k, l and some other parameters. ˆ – Result: A context-free grammar G. – Steps: 1. Convert S into a new sample S by marking the contexts. ˆ be the resulting grammar. 2. Run PACCFG with S and let G ˆ ˆ 3. Remove the marks in G and return the resulting grammar G. In the third step, the removal of contextual marks is done on the terminals that occur on the rules, that are known to belong to the alphabet Σ•k × Σ × Σ•l . 3.1
Towards a Proof of PAC-Learnability
At the core of the proof that the presented algorithm k,l-PACCFG is PAC is the fact that the given conversion effectively results in a UNTS language. To prove this, we have to give a UNTS grammar that generates the converted language. In this section we illustrate with an example our algorithmic procedure to build the UNTS grammar starting with the original k, l-UNTS≤ grammar. Consider the grammar in Fig. 1 (a), that generates the language {abc, abd, bdc}. ∗ ∗ ∗ It is 1, 1-UNTS≤ , but it is not UNTS because S ⇒abd and X ⇒ab but S ⇒Xd. The derivation trees are shown in Fig. 1 (a’). Our aim is to change the rules in order to add the context marks in the terminals. We must do this consistently, in a way that the resulting grammar generates the desired language. In the example, the language must be { • ab a bc b c• , • bd b dc d c• , • ab a bd b d• }. For instance, to mark the contexts in the terminals of the rule X → ab, we must first know the contexts where X may appear. Once we know these contexts, we create a new non-terminal and rule for each possible context. As X only occurs in the context (•, c), it will result in a new non-terminal • Xc , and the rule X → ab will result in a rule • Xc → • ab a bc . But we come to a problem when we consider rules that have non-terminals in the right side, as in S → Xc. To mark the left context of c, we need a new type of information, that is the last letter that will be generated by X. So, we must also add boundaries marks to all the non-terminals, this is, the initial and final substrings that the non-terminals will generate.
PAC-Learning Unambiguous k, l-NTS≤ Languages (a)
(b)
S S X X Y
→ → → → →
• •
S • • S • • S a c S a d S b c S a b X b d X b d Y
Xc aY ab bd bd
(a’)
(b’) S
S
X c X ca Y ab
bd
bd
• • • S• • • • S• • • • S• a c • S• b c • S• a d • S• a b • Xc d b • Xc b d a Y•
a c
S a d S b c S a b X c a bY d b d X c ab bd bd (c’)
→ → → → → → → → →
a c • S• b c • S• a d • S• a b • Xc b c• b d • Xc d c• b d a • b a Y• • a b a bc • bd b d c a bd b d •
• •
• •
• •
S
• • • S•
• • • S•
• • • S•
a c
b c
a d
a c • S•
b c • S•
a d • S•
S
S
(c) → → → → → → → → →
127
S
S S
S
a
X b c bX d c a bY d
ab
bd
bd
a b • Xc • a b a bc
b c•
b d • Xc • bd b d c
d c• • ab
b d a Y• a bd b d •
Fig. 1. Example of conversion of a grammar (a) first adding the boundaries (b) and then adding the contexts (c).
Actually, to be able to mark all the contexts in the grammar, we must first mark completely all the boundaries on the non-terminals. This marking can be done with a bottom-up procedure that starts at the terminal rules and then recursively process the other rules using the already marked non-terminals. The boundaries are marked in the non-terminals using superscripts. In the example, X → ab and X → bd are marked as aX b → ab and bX d → bd, and after that the rule S → Xc is marked in two different ways: aS c → aX b c and bS c → bX d c. The resulting grammar and derivations for the example can be seen in Fig. 1 (b) and (b’). Just for uniformity, the initial symbol will be denoted •S • . Once that all the boundaries have been marked, we can proceed to mark the contexts. This procedure is done top-down, starting at the initial symbol with context (•, •). For instance, in the rule aS c → aX b c we know that the left context of c is b, so it will be marked as a• S•c → a• Xcb b c• . The resulting grammar and derivations are shown in Fig. 1 (c) and (c’). In Sect. 4 we will formally define this procedure, and prove that the resulting grammar has a language that is equal to the converted language, and that if the original grammar is k, l-UNTS≤ , then the resulting grammar is UNTS. As in [3], to prove that the algorithm is PAC we will assume that the samples are generated by a PCFG. So, we will have to generalize the procedure to PCFGs in a way that the probabilities distributions are preserved. 3.2
Parameters and Bounds
The two standard main parameters of a PAC-learning algorithm are the precision and the confidence δ [2]. The precision determines how close the induced
128
F.M. Luque and G. Infante-Lopez
instance will be to the hidden instance, and the confidence determines with how much probability. These parameters are commonly used in the body of the algorithm and in the definition of the sample complexity. The sample complexity is the minimum number of samples required to guarantee that the algorithm achieves a given precision and confidence [2]. In PACCFG, there is a set of additional parameters that stratify the learning problem, stating properties that are assumed to be satisfied by the underlying UNTS PCFGs. There must be known upper bounds for the number of non-terminals, n, for the number of productions, p, and for the length of the right sides of the productions, m. There are also parameters that specify distributional properties. There must be a known upper bound L for the expected number of substrings, and there must be known μ1 , ν and μ2 such that the underlying PCFGs are μ1 -distinguishable, ν-separable and μ2 -reachable. The sample complexity for PACCFG is a function of all these parameters N (μ1 , μ2 , ν, n, p, m, L, δ, ). The exact formula for N is a bit complex and is not n+p of special interest. We just observe that N is O( μm 2 2 ). We refer the reader 1 μ2 ν to [3] for the details of the parameters of PACCFG and for the definition of μ1 -distinguishability, ν-separability and μ2 -reachability. As our algorithm k,l-PACCFG is defined in terms of PACCFG, it also stratifies the learning problem. We will assume that the underlying k, l-UNTS≤ PCFGs have the mentioned bounds n, p, m and L, and a new bound o for the number of non-terminals in the right sides (o ≤ m). Knowing these bounds, it is possible to compute the corresponding bounds for the UNTS PCFGs that are the result of converting the k, l-UNTS≤ PCFGs with the process described in Sect. 3. We will show this in Sect. 4. There will be a different treatment for the parameters μ1 , ν and μ2 . We will directly assume that the k, l-UNTS≤ PCFGs are such that, when converted, the resulting UNTS PCFGs are μ1 -distinguishable, ν-separable and μ2 -reachable.
4
Proof of PAC-Learnability
In this section we give the formal elements required to prove that our algorithm is PAC. In the first place, we show how to convert a CFG using the Left Marked Form and then the Right Contextualized Grammar. Then we show that when these conversions are applied to a k, l-UNTS≤ grammar with l > 0, they return a k, l − 1-UNTS≤ grammar. Then, we extend the conversion procedure to PCFGs, showing that the distributions are preserved. Finally, we see that these results give a conversion from k, l-UNTS≤ to UNTS PCFGs and prove that our algorithm is PAC. 4.1
The Left Marked Form of a Grammar
The Left Marked Form of a grammar adds a mark to each non-terminal that states which is the first terminal it generates, without changing the shape of the rules and the generated language. It is constructed using a recursive bottom-up procedure as described intuitively in Sect. 3.
PAC-Learning Unambiguous k, l-NTS≤ Languages
129
Definition 4. Let G = Σ, N, S, P be a CFG. Let Gi = Σ, Ni , •S, Pi , i ≥ 0 be such that Ni = { •S} ∪ { aX| aX → α ∈ Pi } P0 = { aX → as|X → as ∈ P }
Pn+1 = { aX → u1 a1X1 u2 . . . un anXn un+1 |∀i aiXi ∈ Nn and X → u1 X1 u2 . . . un Xn un+1 ∈ P and
a = (u1 )0 if u1
= λ, a = a1 otherwise } ∪ { •S → aS| aS ∈ Nn } If there exists k such that Gk+1 = Gk , the Left Marked Form (LMF) of G is the grammar G = Gk . Lemma 1. The LMF always exists and is unique. The sequence of grammars G0 , G1 , . . . represents the steps in the procedure. Lemma 1 shows that this procedure converges. The following lemmas state the fundamental properties of the LMF. ∗
Lemma 2. Let G be a CFG and G its LMF. Then, for all a, X, α, aX ⇒G α if ∗ and only if X ⇒G α and there exist β, Y such that α = aβ or α = aY β. Lemma 3. L(G ) = L(G). For the parameter conversion of Sect. 3.2, we must observe that if n, p and o are bounds for G, then n = n|Σ| + 1, p = p|Σ|o and o = o are bounds for G . 4.2
The Right Contextualized Grammar of a Grammar
The Right Contextualized Grammar adds a mark to each terminal that says which terminal goes next, while preserving the shape of the rules. As we saw intuitively in Sect. 3, it uses the LMF and is constructed with a recursive topdown procedure. The procedure involves changing the right sides of the LMF rules to add the right contexts. To do this we use the following definition. Definition 5. Let G be a CFG and G its LMF. Let α ∈ (Σ ∪ N )∗ and b ∈ Σ. Then, the right-contextualization of α with final context b, rcb (α), is recursively defined by rcb (λ) = λ rcb (αa) = rca (α)ab rcb (α aX) = rca (α) aXb The right-contextualization of a language L is rc(L) = {rc• (s)|s ∈ L}.
130
F.M. Luque and G. Infante-Lopez
Definition 6. Let G = Σ, N, S, P be a CFG and G = Σ, N , •S, P its LMF. Let Gi = Σ × Σ, Ni , •S• , Pi , i ≥ 0 be such that Nn = { •S• } ∪ { aXb | cYd → α aXb β ∈ Pn }
P0 = ∅ = { aXb → rcb (α)| aXb ∈ Nn and aX → α ∈ P } Pn+1 If there exists k such that Gk+1 = Gk , the Right Contextualized Grammar (RCG) of G is the grammar G = Gk . Lemma 4. The RCG always exists and is unique. The following lemmas state the fundamental properties of the RCG. Lemma 5. Let G be a CFG, G its LMF and G its RCG. Then, for all ∗ a, b, X, α, aXb ⇒G α if and only if there is α0 such that α = rcb (α0 ) and ∗ a X ⇒G α0 . Lemma 6. L(G ) = rc(L(G)). For the parameter conversion of Sect. 3.2, we must observe that if n , p and o are bounds for G , then n = (n − 1)(|Σ| + 1) + 1, p = p (|Σ| + 1) and o = o are bounds for G . 4.3
Converting k, l-UNTS≤ Grammars
Lemma 7. Let G be a k, l-NTS≤ grammar. Then, its LMF G is also k, l-NTS≤. Lemma 8. Let G be a k, l-UNTS≤ grammar with l > 0. Then, its RCG G is k, l − 1-UNTS≤ . Proof sketch. We will only prove that G is k, l − 1-NTS≤ , omitting the unambiguity part. Let H = •k G •l and H = •k G •l−1 . Then, we must prove that H is k, l − 1-NTS knowing that H is k, l-NTS. ∗ ∗ ∗ Suppose that aXb ⇒H αuβvγ, cYd ⇒H β and •S• ⇒H α u cYd vγ , with |u| = ∗ k, |v| = l − 1. We have to show that aXb ⇒H αu cYd vγ. To do this, we must try to generate the three conditions over H that let us apply the hypothesis that it is k, l-NTS. First, we see that if cYd = •S• , then everything is λ except β that has the ∗ ∗ form •k β •l−1 , so aXb is also •S• , aXb ⇒H αu cYd vγ is equivalent to •S• ⇒H •S• and we are done with the proof. So, we can continue assuming cYd = •S• . ∗
– Condition 1: It is derived from aXb ⇒H αuβvγ but will not talk about a X. Instead, we will go back to •S• using that there are α , γ such that ∗ ∗ • S• ⇒H α aXb γ ⇒H α αuβvγγ . Moving this derivation to G , applying Lemma 5 to go to G , and adding again the • markers to go to H , we can ∗ see that •S ⇒H α0 α0 u0 β0 v0 γ0 γ0 •. We still do not have the desired condition because we need a right context of size l.
PAC-Learning Unambiguous k, l-NTS≤ Languages
131
– Condition 2: Since cYd
= •S• , then β does not have any •, and also in G ∗ ∗ ∗ c Yd ⇒G β. Using Lemma 5, we have cY ⇒G β0 and also in H cY ⇒H β0 , so we have the second condition. ∗ – Condition 3: Using that •S• ⇒H α u cYd vγ and applying the same arguments ∗ used in the first condition we can see that •S ⇒H α0 u0 cY v0 γ0 •. Again, we still need to find a right context of size l. – Conditions 1 and 3: In order to find the correct right context for both conditions we start observing that either γ0 γ0 and γ0 must be both λ, or both must derive the same first terminal e. In the first case the right context will be v0 • and in the second case it will be v0 e, and in both cases we will have the desired conditions. Having the conditions, we apply the hypothesis to obtain in H a derivation of ∗ the form •S ⇒H α0 α0 u0 cY v0 eδ0 . Moving to G , applying Lemma 5 to go to G ∗ and then to H adding again the • markers, we have that •S• ⇒H α αu cYd vef δ for some f . Now, in H we have two different ways to derive the same thing: ∗ ∗ ⇒α αu cYd vef δ ⇒ ∗ ∗ • S• ⇒ ∗ a ⇒α αuβvef δ ∗ ∗ ⇒α Xb γ ⇒α αuβvγγ ⇒ Since H is unambiguous, these two derivations must be the same tree, enforcing ∗ Xb ⇒H αu cYd vγ.
a
4.4
Extending to PCFGs
In this section we extend the definitions of Left Marked Form and Right Contextualized Grammar to PCFGs so that the probabilities of the derivations are preserved. In the case of the LMF, the derivations have the same tree structure as in the original grammar but using an additional initial rule of the form •S → aS. So, if every rule that comes from a rule of the original grammar takes the same probability as the original rule, and all the new initial rules have probability 1, the probabilities of the derivations will be preserved. The problem is that we will not end up with a PCFG but a Weighted CFG (WCFG), because the probabilities of the rules for •S may sum more than 1 and the rules for the other p1 p2 non-terminals may sum less than 1. For instance, the rules A → a, A → b will be p1 p2 mapped to the non-PCFG rules aA → a, bA → b. To solve this, we can use the renormalization formula presented in [1] to convert the WCFG back to a PCFG with the same language probability distribution. Fortunately, in the case of the RCG the derivations have the same tree structure as in the LMF, and every rule comes from a rule of the LMF, so the probabilities can be directly propagated. Despite the fact that the non-terminals are more granular, in this case we always end up with a PCFG without the need p1 pn of renormalization. The LMF rules aA → α1 , . . . , aA → αn will be mapped in p pn 1 the RCG to several sets of rules of the form aAb → rcb (α1 ), . . . , aAb → rcb (αn ), each set with a different b, and each set summing 1. All these observations are summarized in the following definition and lemma:
132
F.M. Luque and G. Infante-Lopez
Definition 7. Let G be a PCFG with production probability π. The LMF of G is the PCFG G with production probability π such that (α)i a a π ( X → α) = w ( X → α) i a X where w ( aX → u1 a1X1 u2 . . . un anXn un+1 ) = π(X → u1 X1 u2 . . . un Xn un+1 ) w ( •S → aS) = 1 ∗ and aX = s Ww ( aX ⇒G s) for aX ∈ N and a = 1 for a ∈ Σ. The RCG of G is the PCFG G with production probability π such that π ( aXb → rcb (α)) = π ( aX → α).
In this definition, Ww refers to the weight of a derivation where the weights of the rules are given by w [1]. Lemma 9. Let G be a PCFG, G its LMF and G its RCG. Then, for every s ∈ L(G), PG (s) = PG (s) = PG (rc• (s)).
4.5
The Theorems
We will call markk,l to the function that marks the contexts of a language in Σ ∗ , and unmarkk,l to the function that unmarks the contexts. For convenience, the domain of unmark will be any language in (Σ•k × Σ × Σ•l )∗ , not only the ones that are the result of marking a language. Theorem 1. Let G be a k, l-UNTS≤ PCFG. Then, there is a UNTS PCFG G such that 1. L(G ) = markk,l (L(G)), 2. and for every s ∈ L(G), PG (s) = PG (markk,l (s)). Proof sketch. Do induction in k and l, starting with k = l = 0, then continuing with k = 0, l > 0 using the RCG, and finally with k, l > 0, using a symmetric version of the RCG. Theorem 2. Given δ and , there is N such that, if S is a sample of a k, lˆ = UNTS≤ PCFG G with |S| > N , then with probability greater than 1 − δ, G k,l-PACCFG(S) is such that ˆ ⊆ L(G), and 1. L(G) ˆ < . 2. PG (L(G) − L(G))
PAC-Learning Unambiguous k, l-NTS≤ Languages
133
Proof. Let G be the UNTS conversion of G. If the known bounds for G are of n, p, m, o and L, the corresponding bounds for G are n = n(|Σ|(|Σ|+1))k+l +1, p = p|Σ|(k+l)o (|Σ| + 1)k+l , m = m and L = L. Let N = N (μ1 , μ2 , ν, n , p , m , L , δ, ), and suppose that |S| > N . As defined in step 1 of the k,l-PACCFG algorithm, S is a sample of G and |S | = |S| > N . ˆ that with probability Then, by Clark’s PAC-learning theorem, step 2 returns G ˆ ˆ ˆ be the result of 1 − δ, L(G ) ⊆ L(G ), and PG (L(G ) − L(G )) < . Now, let G step 3. Then, ˆ = unmarkk,l (L(Gˆ )) L(G) ⊆ unmarkk,l (L(G )) (with probability 1 − δ) = L(G). ˆ = L(G ˆ ) with probability 1 − δ because L(G ˆ ) ⊆ L(G ) = Also, markk,l (L(G)) markk,l (L(G)) with the same probability. So, ˆ = PG (markk,l (L(G) − L(G))) ˆ PG (L(G) − L(G)) ˆ = PG (markk,l (L(G)) − markk,l (L(G))) ˆ = PG (L(G ) − markk,l (L(G))) = PG (L(G ) − L(Gˆ )) (with probability 1 − δ) < .
5
Discussion
As the results presented here are heavily based in the results of [3], the discussion section of that paper applies entirely to our work. The sample complexity of our learning algorithm is n + p n|Σ|2(k+l) + p|Σ|(k+l)(o+1) O = O . 2 2 2 2 μm μm 1 μ2 ν 1 μ2 ν As pointed in [3], the m exponent is worrying, and we can say the same about the o exponent. A small restriction on the UNTS grammars can guarantee conversion to CNF preserving the UNTS property, giving 2 ≥ m ≥ o. However, it is not clear to us how this restriction affects the k, l-UNTS≤ grammars. In this work we directly assume known values of μ1 , ν and μ2 for the converted UNTS grammars. We do this because these values can not be computed from the corresponding values for the original k, l-UNTS≤ grammars as can be done with the other parameters. We can define new ad-hoc properties that when valid over k, l-UNTS≤ grammars imply μ1 -distinguishability, ν-separability and μ2 -reachability over the converted grammars. However, these properties are in essence equivalent to the given assumptions. We could not use our approach to prove PAC-learnability of all the k, l-UNTS languages. Using a modified version of the conversion process we could show that
134
F.M. Luque and G. Infante-Lopez
k, l-UNTS languages with k + l > 0 can be converted to 0, 1; 1, 0 or 1, 1-UNTS. However, we could not manage to give a conversion to UNTS, and we believe it is actually not possible. Anyway, PAC-learnability of k, l-UNTS languages is not of special interest to us because we find k, l-UNTS≤ more suitable to model natural language. UNTS grammars have a limited expressivity in the typical approach for unsupervised parsing of natural language, where the POS tagging problem is isolated and the grammars are defined over the alphabet of POS tags [7]. We believe this expressivity can be greatly improved just by moving to 0, 1; 1, 0 or 1, 1-UNTS≤ .
Acknowledgments This work was supported in part by grant PICT 2006-00969, ANPCyT, Argentina.
References 1. Abney, S., Mcallester, D., Pereira, F.: Relating probabilistic grammars and automata. In: Proceedings of the 37th ACL, pp. 542–549 (1999) 2. Anthony, M., Biggs, N.: Computational learning theory: an introduction. Cambridge University Press, Cambridge (1992) 3. Clark, A.: PAC-learning unambiguous NTS languages. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 59–71. Springer, Heidelberg (2006) 4. Clark, A., Eyraud, R.: Polynomial identification in the limit of substitutable contextfree languages. J. Mach. Learn. Res. 8, 1725–1745 (2007) 5. Gold, M.: Language identification in the limit. Information and Control 10(5), 447– 474 (1967) 6. Klein, D., Manning, C.D.: Corpus-based induction of syntactic structure: Models of dependency and constituency. In: Proceedings of the 42nd ACL, pp. 478–485 (2004) 7. Luque, F., Infante-Lopez, G.: Bounding the maximal parsing performance of NonTerminally Separated grammars. In: Sempere, J.M., García, P. (eds.) ICGI 2010. LNCS (LNAI), vol. 6339, pp. 135–147. Springer, Heidelberg (2010) 8. Yoshinaka, R.: Identification in the limit of k, l-substitutable context-free languages. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 266–279. Springer, Heidelberg (2008)
Bounding the Maximal Parsing Performance of Non-Terminally Separated Grammars Franco M. Luque and Gabriel Infante-Lopez Grupo de Procesamiento de Lenguaje Natural Universidad Nacional de Córdoba & Conicet Córdoba, Argentina {francolq,gabriel}@famaf.unc.edu.ar
Abstract. Unambiguous Non-Terminally Separated (UNTS) grammars have good learnability properties but are too restrictive to be used for natural language parsing. We present a generalization of UNTS grammars called Unambiguous Weakly NTS (UWNTS) grammars that preserve the learnability properties. Then, we study the problem of using them to parse natural language and evaluating against a gold treebank. If the target language is not UWNTS, there will be an upper bound in the parsing performance. In this paper we develop methods to find upper bounds for the unlabeled F1 performance that any UWNTS grammar can achieve over a given treebank. We define a new metric, show that its optimization is NP-Hard but solvable with specialized software, and show a translation of the result to a bound for the F1 . We do experiments with the WSJ10 corpus, finding an F1 bound of 76.1% for the UWNTS grammars over the POS tags alphabet.
1
Introduction
Unsupervised parsing of natural language has received an increasing attention in the last years [3,4,8,13,14]. The standard way of evaluating the developed parsers is comparing their output against gold treebanks, such as the Penn Treebank [11], using the precision, recall and F1 measures [15]. Related to the unsupervised parsing problem is the grammatical inference problem, that deals with theoretical learnability of formal languages. One of the goals of this area is to find classes of languages that are suitable to model natural language and that are learnable in some sense. In [5,6], it is argued that natural language is close to be in the class of Non-Terminally Separated (NTS) languages, and in [5] it is proved that Unambiguous NTS (UNTS) languages are PAC-learnable in polynomial time. However, we will see that UNTS grammars are much less expressive than NTS grammars and that they can not admit elemental sets of natural language sentences. In order to shorten the gap between UNTS and NTS, we present a slight generalization of UNTS grammars called the Unambiguous Weakly NTS (UWNTS) grammars. In contrast with UNTS, UWNTS grammars are general enough to admit any finite language, but at the same time preserving every other aspect of UNTS grammars, including PAC-learnability in polynomial time. J.M. Sempere and P. García (Eds.): ICGI 2010, LNAI 6339, pp. 135–147, 2010. c Springer-Verlag Berlin Heidelberg 2010
136
F.M. Luque and G. Infante-Lopez
In this work, we study the potential of UWNTS grammars to model natural language syntax. Our approach to study UWNTS grammars is in the context of the unsupervised parsing problem, using the standard quantitative evaluation over gold treebanks. We do not aim at developing a learning algorithm that returns a UWNTS grammar, because the resulting evaluation will depend on the algorithm. Instead, our aim is to find upper bounds for the achievable F1 performance of all the UWNTS grammars over a given gold treebank, regardless of the learning algorithm and the training material used to induce the grammars. Our bounds should only depend on the gold treebank that is going to be used for evaluation. In general, knowing an upper bound for a model is useful in at least two senses. First, if it is lower than the performance we want to achieve, we should consider not using that model at all. Second, if it is not low, it might be useful for the development of a learning algorithm, given that it can be assessed how close the performance of the algorithm is to the upper bound. In this paper we present methods to obtain such upper bounds for the class of UWNTS grammars. Our methods allow us to compute the bounds without explicitly working with grammars, because we are only interested in the parsings of the gold sentences that the grammars can return. In principle, the function to be optimized is F1 , but instead we define a new metric W , related to F1 , but whose optimization is feasible. Our methods are based in the optimization of W and the recall, and in the translation of these optimal values to an upper bound of the F1 . The optimization problems are solved by reducing them to the well known Integer Linear Programming (ILP) with binary variables problem, that is known to be NP-Hard, but for which there exists software to solve it [2]. Moreover, we show that the optimization of W is NP-Hard, by reducing the Maximum Independent Set problem for 2-subdivision graphs to it. Finally, we solve the optimization problems for the WSJ10 subset of the Penn Treebank [11], compute the upper bounds, and compare them to the performances of state-of-the-art unsupervised parsers. Our results show that UWNTS grammars over the POS tags alphabet can not improve state-of-the-art unsupervised parsing performance.
2
Notation and Definitions
We skip conventional definitions and notation for formal languages and contextfree grammars, but define concepts that are more specific to our paper. Given a language L, Sub(L) is the set of non-empty substrings of the elements of L, and Sub(L) is the set of non-empty proper substrings. We say that two strings r, s overlap if there exist non-empty strings r , s , t such that r = r t and s = ts ; r ts is called an overlapping of r and s. We say that two strings r, s occur overlapped in a language L if they overlap and if an overlapping of them is in Sub(L). Given a grammar G and s ∈ L(G), a substring r of s = urv is called ∗ ∗ a constituent in (u, v) if and only if there is X ∈ N such that S ⇒ uXv ⇒ s.
Bounding the Maximal Parsing Performance of NTS Grammars
137
In contrast, r is called a non-constituent or distituent in (u, v) if it is not a constituent in (u, v). More than in the grammars, we are in fact interested in all the possible ways a given finite set of sentences S = {s1 , . . . , sn } can be parsed by a particular class of grammars. As we are modeling unlabeled parsing, the parse of a sentence is an unlabeled tree, or equivalently, a bracketing of the sentence. Formally, a bracketing of a sentence s = a0 . . . an−1 is a set b of pairs of indexes that marks the starting and ending positions of the constituents, consistently representing a tree. A bracketing always contains the full-span bracket (0, n), never has duplicated brackets (it is a set), and does not have brackets of span 1, i.e., of the form (i, i + 1). Usually, we will represent the bracketings together with their corresponding sentences. For instance, we will jointly represent the sentence abcde and the bracketing {(0, 5), (2, 5), (2, 4)} as ab((cd)e). Observe that the full-span bracket is omitted. Given an unambiguous grammar G and s ∈ L(G), the bracketing of s with G, brG (s), is the bracketing corresponding to the parse tree of s with G. Given S ⊆ L(G), the bracketing of S with G, BrG (S), is the set {brG (s)|s ∈ S}. 2.1
UWNTS Grammars
A grammar G is said to be Non-Terminally Separated (NTS) if and only if, for ∗ ∗ ∗ all X, Y ∈ N and α, β, γ ∈ (Σ ∪ N )∗ , X ⇒ αβγ and Y ⇒ β implies X ⇒ αY γ [5]. A grammar is Unambiguous NTS (UNTS) if it is unambiguous and NTS. Unambiguous NTS grammars are much less expressive than NTS grammars: It can be proved that a UNTS language having two overlapping sentences can not have the overlapping of them as a sentence. For instance, the set {ab, bc, abc} can not be a subset of a UNTS language because abc is an overlapping of ab and bc. This situation is very common in the WSJ10 sentences of POS tags, and consequently there is no UNTS grammar that accepts this set. For instance, it has the sentences “NN NN NN” and “NN NN”, but “NN NN NN” is an overlapping of “NN NN” with itself. The limitations of UNTS grammars lead us to define a more expressive class, UWNTS, that is able to parse any finite set of sentences preserving, at the same time, every other aspect of UNTS grammars. The properties of UWNTS will also let us characterize the sets of bracketings of all the possible grammars that parse a given finite set of sentences. This characterization will be at the core of our methods for finding bounds. Definition 1. A grammar G = Σ, N, S, P is Weakly Non-Terminally Separated (WNTS) if S does not appear on the right side of any production and, for = S, all X, Y ∈ N and α, β, γ ∈ (Σ ∪ N )∗ such that Y ∗
∗
∗
X ⇒ αβγ and Y ⇒ β implies X ⇒ αY γ. A grammar is Unambiguous WNTS (UWNTS) if it is unambiguous and WNTS. Note that any finite set {s1 , . . . , sn } is parsed by a UWNTS grammar with rules {S → s1 , . . . , S → sn }. It is easy to see that every NTS language is also WNTS.
138
F.M. Luque and G. Infante-Lopez
A much more interesting result is that a WNTS language can be converted into an NTS language without loosing any information. Lemma 1. If L is WNTS, then xLx = {xsx|s ∈ L} is NTS, where x is a new element of the alphabet. This conversion allows us to prove PAC-learnability of UWNTS languages using the PAC-learnability result for UNTS languages of [5]. If we want to learn a UWNTS grammar from a sample S, we can simply give the sample xSx to the learning algorithm for UNTS grammars, and remove the x’s from the resulting grammar. A detailed proof of this result is beyond the scope of this work. We present a much more general result of PAC-learnability in [10]. In a UWNTS grammar G, every substring is either always a constituent in every proper occurrence or always a distituent in every proper occurrence. If two strings r, s occur overlapped in L(G), then at least one of them must be always a distituent. In this case, we say that r and s are incompatible in L(G), and we say that they are compatible in the opposite case. We say that a set of strings is compatible in L(G) if every pair of strings of the set is compatible in L(G). Now let us consider a finite set of sentences S, a UWNTS grammar G, and the bracketing of S with G. The given properties imply that in the bracketing there are no substrings marked some times as constituents and some times as distituents. Consequently, the information in BrG (S) can be simply represented as the set of substrings in Sub(S) that are always constituents. We call this the set ConstG (S). When considering all the possible UWNTS grammars G with S ⊆ L(G), there is a 1-to-1 mapping between all the possible bracketings BrG (S) and all the possible sets ConstG (S). Using this information, we can define our search space C (S) as . Definition 2. C (S) = {ConstG (S) : G UWNTS and S ⊆ L(G)}. With UWNTS grammars we can characterize the search space in a way that there is no need to explicitly refer to the grammars. We can see that the search space is equal to all the possible subsets of Sub(S) that are compatible in S: Theorem 1. C (S) = {C : C ⊆ Sub(S), C compatible in S}. The proof of ⊆ follows immediately from the given properties. ⊇ is proved by constructing a UWNTS grammar mainly using the fact that S is finite and C is compatible in S. 2.2
UWNTS-SC Grammars
In the parser induction problem, it is usually expected that the induced parsers will be able to parse any sentence, and not only the finite set of sentences S that is used to evaluate them. However, the defined search space for UWNTS grammars does not guarantee this. In Theorem 1, it is evident that the sets of constituents are required to be compatible in S, but may be incompatible with other sets of sentences. For instance, if S = {abd, bcd} and C = {ab, bc}, C
Bounding the Maximal Parsing Performance of NTS Grammars
139
is compatible in S but not in {abc}. We define UWNTS-SC grammars as the subclass of UWNTS that guarantee that the constituents are compatible with any set of sentences. Definition 3. A grammar G is UWNTS Strongly Compatible (UWNTS-SC) if it is UWNTS and, for every r, s ∈ ConstG (L(G)), r and s do not overlap. When r and s do not overlap, we say that r and s are strongly compatible. And when every pair of a set of strings C do not overlap, we say that C is strongly compatible. As UWNTS-SC is a subclass of UWNTS, all the properties of UWNTS grammars still hold. The search space for UWNTS-SC grammars and its characterization are: . Definition 4. CSC (S) = {ConstG (S) : G UWNTS-SC and S ⊆ L(G)}. Theorem 2. CSC (S) = {C : C ⊆ Sub(S), C strongly compatible}. The proof of this theorem is analog to the proof of Theorem 1. Theorem 2 shows a more natural way of understanding the motivation under UWNTS-SC grammars. While the compatibility property depends on the sample S, the strong compatibility do not. It can be checked for a given set of constituents without looking at the sample, and if it holds, we know that the set if compatible for any sample.
3
The W Measure and Its Relationship to the F1
In this section we will study the evaluation measures in the frame of UWNTS grammars. In general, a measure is a function of similarity between a given gold bracketing B = {b1 , . . . , bn } of a set of gold sentences and a proposed bracketing ˆ = {ˆb1 , . . . , ˆbn } over the same set of sentences. The standard measures used B in unsupervised constituent parsing are the micro-averaged precision, recall and F1 as defined in [8]: n ˆ n ˆ ˆ ˆ B) . . . 2P (B)R( k=1 |bk ∩ bk | k=1 |bk ∩ bk | ˆ ˆ ˆ = P (B) = n R(B) = F1 (B) . n ˆ ˆ + R(B) ˆ P (B) k=1 |bk | k=1 |bk | Note that these measures differ from the standard PARSEVAL measures [1], because from the definition of bracketing it follows that the syntactic categories are ignored and that unary branches are ignored. Another way to define precision and recall is in term of two measures that we will call hits and misses. Definition 5. The hits is the number of proposed brackets that are correct, and the misses is the number of proposed brackets that are not correct. Formally, . ˆ ˆ = H(B) |bk ∩ bk |
n
. ˆ ˆ = M (B) |bk − bk |.
n
k=1
k=1
140
F.M. Luque and G. Infante-Lopez
. n Using these two measures, and defining K = k=1 |bk | to be the number of brackets in the gold bracketing, we have ˆ = P (B)
ˆ H(B) ˆ ˆ H(B) + M (B)
ˆ = R(B)
ˆ H(B) . K
As we saw in the previous section, in the case of UWNTS grammars the brackˆ So, we will rewrite the defetings can be represented as sets of constituents C. ˆ Observe that, if s ∈ C, ˆ B ˆ initions of the measures in terms of Cˆ instead of B. contains all the occurrences of s marked as a constituent. But in the gold B, s may be marked some times as a constituent and some times as a distituent. Let c(s) and d(s) be number of times s appears in B as a constituent and as a distituent respectively: Definition 6. . c(s) = |{(u, s, v) : uv = λ, usv ∈ S and s is a constituent in (u, v)}| . d(s) = |{(u, s, v) : uv = λ, usv ∈ S and s is a distituent in (u, v)}| ˆ B ˆ will have c(s) hits and d(s) misses. This is, Then, for every s ∈ C, ˆ = ˆ = H(C) c(s) M (C) d(s). ˆ s∈C
ˆ s∈C
Using this, we can see that ˆ = F1 (C)
2 ˆ c(s) s∈C . K + s∈Cˆ c(s) + d(s)
ˆ we would like to define an Now that we have written the F1 in terms of C, ˆ for every Cˆ ∈ C (S). As the search spaces algorithm to find the optimal F1 (C) ˆ is to compute the F1 for every is finite, a simple algorithm to find maxCˆ F1 (C) ˆ C. The problem is that the order of this algorithm is O(2|Sub(S)| ). Given that the optimization of the F1 doesn’t seem to be feasible, we will define another measure W which optimization will result to be more tractable. This measure has its own intuition, and is a natural way to combine hits and misses. It is very different to the F1 , but we will see that an upper bound of W can be translated to an upper bound of the F1 measure. We will also see that an upper bound of the recall R can be used to find a better bound for the F1 . We want W to be such that a high value for it reflects a big number of hits and a low number of misses, so we simply say: . ˆ = ˆ − M (B). ˆ Definition 7. W (B) H(B) Note that, when dealing with UWNTS grammars, as H and M are linear exˆ W will also be linear over C, ˆ unlike the F1 measure. This is pressions over C, what will make it more tractable.
Bounding the Maximal Parsing Performance of NTS Grammars
141
Table 1. Comparison of the F1 and W measures: The scores of two bracketings with respect to the gold bracketing (S, B) = {(ab)c, a(bd), (ef )g, ef h, ef i}. i 1 2
ˆi ) (S, B {(ab)c, (ab)d, ef g, ef h, ef i} {(ab)c, (ab)d, (ef )g, (ef )h, (ef )i}
P 50% 40%
R 33% 67%
F1 40% 50%
H 1 2
M 1 3
W 0 −1
W and F1 are different measures that do not define the same ordering over the ˆ For instance, Table 1 shows the scores for two different candidate bracketings B. ˆ ˆ bracketings B1 , B2 with respect to the same gold bracketing. Here, F1 is higher for Bˆ2 but W is higher for Bˆ1 . Anyway, it can be proved that the W and F1 measures are related through the following formula: 2r f1 (r, w) = (1) w . 1 + 2r − K From this formula it can be seen that the F1 is monotonically increasing in both w and r. Then, if r and w are upper bounds for their respective measures, f1 (r, w) is an upper bound for the F1 measure. If we do not know an upper bound for the recall, we can simply use the bound r = 1.
4
The Optimization of W and R
In this section we first formalize the optimization problems of W and R over the UWNTS and UWNTS-SC grammars. We define them in terms of a more general problem of optimization of a score over a class of grammars. Then, we show how to reduce the problems to Integer Linear Programming problems, using the fact ˆ Finally, we that W and R are computed as linear expressions in terms of C. show that the problems are NP-Hard. Definition 8. Given a class of grammars G, a score function s, a set of gold sentences S and a set of gold bracketings B, the Maximum Score Grammar (MSG) problem is such that it computes M SG(G, s, S, B) =
max
G∈G,S⊆L(G)
s(B, BrG (S)).
Definition 9. The Maximum W UWNTS (MW-UNTS), Maximum W UWNTS-SC (MW-UNTS-SC), Maximum R UWNTS (MW-UNTS) and Maximum R UWNTS-SC (MW-UNTS-SC) problems are such that they compute MW-UWNTS(S, B) = M SG(UWNTS, W, S, B) MW-UWNTS-SC(S, B) = M SG(UWNTS-SC, W, S, B) MR-UWNTS(S, B) = M SG(UWNTS, R, S, B) MR-UWNTS-SC(S, B) = M SG(UWNTS-SC, R, S, B).
142
4.1
F.M. Luque and G. Infante-Lopez
Solving the Problems for UWNTS Grammars
Let us first consider the problem of W maximization, MW-UWNTS. By the characterization of the search space of Theorem 1, we have that MW-UWNTS(S, B) = max c(s) − d(s). C⊆Sub(S),C compatible in S
s∈C
Now, let H(S, B) = V, E, w be an undirected weighted graph such that V = Sub(S), w(s) = c(s) − d(s) and E = {(s, t) : s, t incompatible in S}. An independent set of a graph is a set of nodes such that every pair of nodes in it is not connected by an edge. So, C is an independent set of H(S, B) if and only if C is compatible in S. Now, consider the Maximum Weight Independent Set problem, MWIS(G), that given G returns the weight of the independent set with maximum weight of G [7]. Clearly, MW-UWNTS(S, B) = MWIS(H(S, B)). This is a reduction of our problem to the MWIS problem, that is known to be NP-Hard [7]. At its turn, MWIS is reducible to the Integer Linear Programming (ILP) problem with binary variables. This problem is also NP-Hard, but there exists software that implements efficient strategies for solving some of its instances [2]. An ILP problem with binary variables is defined by three parameters x, f, c: A set of variables x that can take values in {0, 1}, an objective linear function f (x) that has to be maximized, and a set of linear constraints c(x) that must be satisfied. The result of the ILP problem is a valuation of the variables that satisfies the constraints and maximizes the objective function. In the case of MWIS, given a weighted graph G = V, E, w, we define a binary variable xv for every node v. A valuation in {0, 1} of the variables will define a subset of nodes {v|xv = 1}. To ensure that the possible subsets are independent sets, we define the constraints c(x) = {xv + xw ≤ 1|(v, w) ∈ E}. Finally, the objective function is f (x) = v∈V xv w(v). Using this instance I(G) = x, f, c, we have that in general MWIS(G) = ILP(I(G)), and in particular that MW-UWNTS(S, B) = ILP(I(H(S, B))). We illustrate the reduction of MW-UWNTS to MWIS and then to ILP with the example instance of Fig. 1 (a). The sets of substrings such that w(c) ≥ 0 is {da, cd, bc, cda, ab, bch}. The input graph for the MWIS problem and the instance of the ILP problem are given in Fig. 1 (b) and (c). Now we consider the solution of the problem of recall maximization, MRUWNTS. We can do a similar reduction to MWIS changing the graph weights from c(s) − d(s) to c(s), and then reduce MWIS to ILP. If H (S, B) is the MWIS instance, then MR-UWNTS(S, B) =
1 ILP(I(H (S, B))). K
Bounding the Maximal Parsing Performance of NTS Grammars (a) (b)
143
(S, B) = {(da)((bc)h), b((cd)a), (ab)e, (ab)f, (ab)g, (bc)i, (da)j} (c)
x ={xda , xcd , xbc , xcda , xab , xbch } f (x) = xda + xcd + xbc + xcda + 2xab + xbch c(x) ={xda + xcd ≤ 1, xcd + xbc ≤ 1, xbc + xcda ≤ 1, xda + xab ≤ 1, xab + xbc ≤ 1, xab + xbch ≤ 1}
Fig. 1. (a) A gold bracketing. (b) Graph for the MWIS problem. The shadowed nodes form the solution. (c) Instance for the ILP problem.
(a) H: u y x v (2-subdivision of G: u v) (b) Σ = {au , bu , av , bv , cuv , d, e} Substrings = {au bu , av bv , bu cuv , cuv av } (S, B) = {(au bu )cuv , (bu cuv )av , (cuv av )bv , (bu cuv )d, (cuv av )d, (av bv )d, (av bv )e} 1 1 1 1 (c) H(S, B): cuv av a u bu bu cuv a v bv Fig. 2. Example of conversion of a 2-subdivision graph H (a), instance of the MIS problem, to a pair (S, B) (b), instance of the MW-UWNTS problem. In (c) it is shown that H is equivalent to H(S, B).
4.2
Solving the Problems for UWNTS-SC Grammars
The problems for UWNTS-SC grammars can also be reduced to MWIS as in the previous section just changing in the graph construction the definition of the edge set. Instead of saying that (s, t) ∈ E iff s and t are incompatible in S, we say that (s, t) ∈ E iff s and t are strongly incompatible, this is, s and t overlap. But this reduction leads to a much bigger number of edges. In our experiments with the WSJ10, this resulted in a number of ILP constraints that was unmanageable by our working version of the ILP software we used. For instance, if there are n substrings that end with the symbol a and m substrings that start with a, there will be nm edges/constraints. These constraints express the fact that “every string starting with an a is strongly incompatible with every string ending with an a”, or equivalently “every string starting with an a have value 0 or every string ending with an a have value 0”. But this can be expressed with less constraints using a new ILP variable ya that has value 0 if every substring starting with a has value 0, and value 1 if every substring ending with a has value 0. To do this we use, for each substring of the form as, the constraint xas ≤ ya and for each substring of the form ta, the constraint xta ≤ 1 − ya leading to n + m constraints instead of nm. In the general form, the ILP instance is I = x, f, c, where c(x) = {xs ≤ yt |s ∈ V, t proper prefix of s} ∪ {xs ≤ 1 − yu |s ∈ V, u proper suffix of s}.
144
F.M. Luque and G. Infante-Lopez
4.3
NP-Hardness of the Problems
It can be shown that the presented problems, MW-UWNTS, MR-UWNTS, MWUWNTS-SC, and MR-UWNTS-SC, are all NP-Hard. In this section, we prove only that the MW-UWNTS problem is NP-Hard. The proofs for the rest of the problems are analogous. Theorem 3. MW-UWNTS is NP-Hard. Proof. We will prove this by reducing the NP-Hard Maximum Independent Set (MIS) graph problem to the MW-UWNTS problem. MIS is the optimization problem over unweighted graphs that is equivalent to the MWIS problem over weighted graphs where all the weights are 1. We have to provide a way to convert instances of MIS to instances of MW-UWNTS. To simplify this process, we will restrict the domain of the MIS problem to the class of 2-subdivision graphs, where it is still known to be NP-Hard [12]. This is the key idea of the proof. The 2-subdivision graphs are those graphs that are obtained from another graph by replacing every edge {u, v} by a path uxyv, where x and y are new nodes. The path uxyv is called the 2-subdivision path of the edge {u, v} and x and y are called the 2-subdivision nodes of {u, v}. Let H be a 2-subdivision graph for which we want to solve the MIS problem, and let G be the graph from which H is obtained by 2-subdividing it. We will construct an instance of the MW-UWNTS problem (S, B) in terms of H, this is, a set of gold sentences and its corresponding gold bracketing. The instance (S, B) will be such that, when reduced to the graph Gw S,B as explained in section 4.1, this graph will have all weights equal to 1 and will be structurally equal to the original graph H. As a solution of the MW-UWNTS problem gives a solution of the MWIS problem for Gw S,B , this solution is also a solution of the MIS problem for H. This way, we can successfully solve the MIS problem in terms of a MWUWNTS problem. To describe the instance (S, B), we will first define the alphabet it uses, then the set of substrings that occur in S, and finally the gold sentences S and the brackets that conform the gold bracketing B. The alphabet will have, for each node v in H that is also in G, two symbols av and bv , and for each 2-subdivision path uxyv in H, a symbol cuv . The set of substrings will have one substring for each node in H. The substrings must overlap in a way that the overlappings encode the edges of H. For every node v in G we will use the substring av bv in the treebank, and for every 2-subdivision nodes x and y of {u, v} we will use the substrings bu cuv and cuv av respectively. Note that, for every 2-subdivision path uxyv of H, the string corresponding to the node u overlaps with the string of the node x, the string of x overlaps with the string of y, and the string of y overlaps with the string of v. Also, it can be seen that there is no overlapping that doesn’t come from an edge of a 2-subdivision path of H. The sentences must contain the mentioned substrings in a way that all the pairs of substrings that overlap, effectively appear overlapped. This way, the edges of H will be rebuilt from the treebank. To do this, for each 2-subdivision path uxyv we define the sentences {au bu cuv , bu cuv av , cuv av bv }.
Bounding the Maximal Parsing Performance of NTS Grammars
145
We must define also the brackets for these sentences. These have to be such that every substring s that correspond to a node in H has weight w(s) = c(s) − d(s) = 1. This will not be possible unless we use some extra sentences. For instance, if we use the bracketings {(au bu )cuv , (bu cuv )av , (cuv av )bv } we will have w(au bu ) = 1, w(bu cuv ) = w(bu cuv ) = 0, and w(av bv ) = −1. All the weights can be set to 1 using new symbols and sentences like {(bu cuv )d, (cuv av )d, (av bv )d, (av bv )e}. The new substrings, such as cuv d, will all have weight < 0 so they will not appear in the graph. In general, it will always be possible to fix the weights of the substrings by adding new sentences. The described conversion of H to (S, B) can be carried in O(|V (H)|), or equivalently O(|V (G)| + |E(G)|), time and space. A very simple example of this process is shown in Fig. 2.
5
Upper Bounds for the WSJ10 Treebank
This section shows the results of computing concrete upper bounds for the classes of UWNTS and UWNTS-SC grammars over the WSJ10 treebank. The WSJ10 consists of the sentences of the WSJ Penn Treebank whose length is of at most 10 words after removing punctuation marks [8]. There is a total of 63742 different non-empty substrings of POS tags in Sub(S). To solve the optimization problems, the strings with w(s) ≤ 0 can be ignored because they do not improve the scores. In the case of the MW problems, where w(s) = c(s) − d(s), the number of strings with w(s) > 0 is 7029. In the case of the MR problems, where w(s) = c(s), the number of strings with w(s) > 0 is 9112. The sizes of the resulting instances are summarized in Table 2. Observe that the number of edges of the MWIS instances for UWNTS-SC grammars have a higher order of magnitude than the size of the improved ILP instances. Using the obtained results of Maximal W and Maximal R we proceeded to compute the upper bounds for the F1 using (1) from Sect. 3. Table 3 shows the results, together with the performance of actual state-of-the-art unsupervised parsers. We show the computed maximal W and maximal recall for UWNTS and UWNTS-SC grammars, and also the values of all the other measures associated to these solutions. We also show the upper bounds for the F1 in the rows labeled UBoundF1 (UWNTS) and UBoundF1 (UWNTS-SC), together with their corresponding precision and recall. RBranch is a baseline parser that parses every sentence with a right-branching bracketing. DMV+CCM is the parser from Table 2. Sizes of the MWIS and ILP instances generated by the MW and MR problems for the WSJ10 treebank Nodes MWIS Edges Variables ILP Constraints
MW-UWNTS MR-UWNTS MW-UWNTS-SC MR-UWNTS-SC 7029 9112 7029 9112 1204 45984 1467257 3166833 7029 9112 26434 28815 1204 45984 67916 79986
146
F.M. Luque and G. Infante-Lopez
Table 3. Summary of the results of our experiments with the WSJ10, in contrast with state-of-the-art unsupervised parsers. The numbers in bold mark the upper bounds or maximal values obtained in each row. Model RBranch DMV+CCM U-DOP Incremental MW-UWNTS MR-UWNTS UBoundF1 (UWNTS) MW-UWNTS-SC MR-UWNTS-SC UBoundF1 (UWNTS-SC)
P 55.1 69.3 70.8 75.6 91.2 73.8 85.0 89.1 65.3 79.9
R 70.0 88.0 88.2 76.2 62.8 69.0 69.0 52.2 61.1 61.1
F1 61.7 77.6 78.5 75.9 74.4 71.3 76.1 65.8 63.1 69.2
H
M
W
22169 24345
2127 8643
20042 15702
18410 21562
2263 11483
16147 10079
Klein and Manning [8], U-DOP is the parser from Bod [4] and Incremental is the parser from Seginer [13].
6
Discussion
Our bounding methods are specific to the evaluation approach of comparing against gold treebanks, but this is the most accepted and supported approach, and it is scalable and unbiased with respect to the evaluator [15]. Our methods are also specific to the unlabeled F1 measure as we defined it, with micro-averaging and removal of unary brackets. Also in [15], micro-averaging and removal of trivial structure are proposed as the preferred evaluation method. Our experiments show that UWNTS grammars over short English sentences of POS tags have a low performance compared to state-of-the-art unsupervised parsers. We believe that these experiments provide enough evidence to conclude that unsupervised parsing over POS tags requires more expressive grammars. However, our results should not be interpreted as a theoretical conclusion about the relationship between natural languages and UWNTS languages in general. This work is an extended and improved version of [9]. Here, we explicitly define the UWNTS grammars, improve the bounding method using the recall, formalize the optimization problems, and show that they are NP-Hard. We also define the subclass of UWNTS-SC grammars and provide an alternative reduction to ILP for them.
Acknowledgments This work was supported in part by grant PICT 2006-00969, ANPCyT, Argentina. We would like to thank Pablo Rey (UDP, Chile) for his help with ILP, and Demetrio Martín Vilela (UNC, Argentina) for his comments.
Bounding the Maximal Parsing Performance of NTS Grammars
147
References 1. Abney, S., Flickenger, S., Gdaniec, C., Grishman, C., Harrison, P., Hindle, D., Ingria, R., Jelinek, F., Klavans, J., Liberman, M., Marcus, M., Roukos, S., Santorini, B., Strzalkowski, T.: A procedure for quantitatively comparing the syntactic coverage of English grammars. In: Black, E. (ed.) Proceedings of a workshop on Speech and natural language, pp. 306–311 (1991) 2. Achterberg, T.: SCIP - a framework to integrate Constraint and Mixed Integer Programming. Tech. rep. (2004) 3. Adriaans, P.W., Vervoort, M.: The EMILE 4.1 grammar induction toolbox. In: Adriaans, P.W., Fernau, H., van Zaanen, M. (eds.) ICGI 2002. LNCS (LNAI), vol. 2484, pp. 293–295. Springer, Heidelberg (2002) 4. Bod, R.: Unsupervised parsing with U-DOP. In: Proceedings of the 10th CoNLL (CoNLL-X), pp. 85–92 (2006) 5. Clark, A.: PAC-learning unambiguous NTS languages. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 59–71. Springer, Heidelberg (2006) 6. Clark, A.: Learning deterministic context free grammars: The Omphalos competition. Machine Learning 66(1), 93–110 (2007) 7. Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of Computer Computations, pp. 85–103 (1972) 8. Klein, D., Manning, C.D.: Corpus-based induction of syntactic structure: Models of dependency and constituency. In: Proceedings of the 42nd ACL, pp. 478–485 (2004) 9. Luque, F., Infante-Lopez, G.: Upper bounds for unsupervised parsing with Unambiguous Non-Terminally Separated grammars. In: Proceedings of CLAGI, 12th EACL, pp. 58–65 (2009) 10. Luque, F., Infante-Lopez, G.: PAC-learning unambiguous k, l-NTS≤ languages. In: J.M. Sempere and P. García (Eds.): ICGI 2010. LNCS(LNAI), vol. 6339, pp. 122– 134. Springer, Heidelberg (2010) 11. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of english: The Penn treebank. Computational Linguistics 19(2), 313–330 (1994) 12. Poljak, S.: A note on stable sets and coloring of graphs. Commentationes Mathematicae Universitatis Carolinae 15(2), 307–309 (1974) 13. Seginer, Y.: Fast unsupervised incremental parsing. In: Proceedings of the 45th ACL, pp. 384–391 (2007) 14. van Zaanen, M.: ABL: alignment-based learning. In: Proceedings of the 18th conference on Computational linguistics, pp. 961–967 (2000) 15. van Zaanen, M., Geertzen, J.: Problems with evaluation of unsupervised empirical grammatical inference systems. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 301–303. Springer, Heidelberg (2008)
CGE: A Sequential Learning Algorithm for Mealy Automata Karl Meinke School of Computer Science and Communication, Royal Institute of Technology, 100-44 Stockholm, Sweden
Abstract. We introduce a new algorithm for sequential learning of Mealy automata by congruence generator extension (CGE). Our approach makes use of techniques from term rewriting theory and universal algebra for compactly representing and manipulating automata using finite congruence generator sets represented as string rewriting systems (SRS). We prove that the CGE algorithm correctly learns in the limit.
1
Introduction
Developments in software testing such as black-box checking [Peled et al. 1999], learning-based testing [Meinke 2004], adaptive model checking [Groce et al. 2006], and dynamic testing [Raffelt et al. 2008] motivate the need for new types of learning algorithms for automata. The common aim of these approaches to software testing is to model an unknown black-box system under test (SUT) as some form of automaton, and to dynamically learn this automaton and analyse its behavioural correctness using relevant test cases as membership queries and the SUT itself as the teacher. A learning algorithm for automata is sequential if it can produce a sequence of hypothesis automata A0 , A1 , . . . which are approximations to an unknown automata A, based on a sequence of information (queries and results) about A. This sequence should converge to a behaviourally equivalent automaton when given sufficient information about A. A sequential algorithm is incremental if computation of successive approximations can reuse previous results (e.g. equivalence relations on states). For further details see e.g. [Parekh and Honavar 2000]). Our research [Meinke 2004] on learning-based testing has shown that sequential and incremental learning algorithms are crucial for the effectiveness of the above testing methods for four important reasons. (1) Real software systems may be too big to be exactly or completely learned within a feasible timescale. (2) Testing of specific system requirements or use cases typically does not require learning and analysis of the entire software system. (3) The overhead of SUT execution to answer a membership query during learning may be non-neglegible compared with the execution time of the learning algorithm itself (see e.g. [Bohlin and Jonsson 2008]). Therefore membership queries should be seen as “expensive”. J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 148–162, 2010. c Springer-Verlag Berlin Heidelberg 2010
CGE: A Sequential Learning Algorithm for Mealy Automata
149
(4) Since membership queries are expensive, as many queries (i.e. test cases) as possible should be derived from the behavioural analysis of the hypothesis automaton, while as few queries as possible should be derived for reasons of internal book-keeping by the learning algorithm. Ideally, every membership query should represent a relevant and interesting test case. In this paper we introduce a new sequential learning algorithm satisfying these four criteria. We prove that it correctly learns in the limit in the sense of [Gold 1967]. Since [Pao, Carr 1978] and [Dupont et al. 1994] it has been clear that automata learning can be viewed as a search problem in the lattice of automata solutions. We follow this philosophy, while changing it slightly to search among the lattice of all consistent finitely generated congruences on the prefix automaton. Hence, a significant contribution of our approach is to show how universal algebra (see e.g. [Meinke, Tucker 1993]) and term rewriting (see e.g. [Dershowitz, Jouannaud 1990]) can be applied to the learning problem. Using finitely generated congruences increases the efficiency of hypothesis automaton representation. Congruence construction is non-trivial in the case of Mealy automata since two interrelated congruences, on states and output values, must be learned. This makes the search space of solutions inherently more complex than the search space for learning a regular language acceptor. The CGE learning algorithm computes a pair of finite congruence generator sets, which is inherently more compact than an explicit state space partition for large automata. Furthermore, each hypothesis automaton can be simulated and statically analysed (for example by model checking) without ever explicitly constructing it as a quotient automaton. One can perform these operations using the generator sets alone. In Section 1.1, we review some automata learning algorithms from the literature. In Section 2, we briefly review some elementary concepts from automata theory and universal algebra. In Section 3 we define and prove the correctness of a new prefix completion algorithm for SRS which is more efficient than the well known Knuth-Bendix completion algorithm [Knuth and Bendix 1970]. This algorithm efficiently constructs (a generator set for) the smallest state congruence containing a given set of generator pairs. In Section 4, we present the CGE learning algorithm and prove its correctness. In Section 5 we draw some conclusions, and in Appendix 1 we present a simple illustrative case study of CGE learning. 1.1
Sequential and Incremental Learning Algorithms for Automata
In [Dupont 1996], an incremental version RPNI2 of the RPNI learning algorithm of [Oncina and Garcia 1992] and [Lang 1992] is presented. The RPNI2 algorithm has some features in common with the CGE algorithm, but there are significant differences. Both RPNI2 and CGE perform a recursive depth first search of a lexicographically ordered state set with backtracking. However, RPNI2 is explicitly coded for Moore machines with binary (positive/negative) outputs (i.e. language recognition). Furthermore, RPNI2, like many other automata learning algorithms, represents and manipulates the hypothesis automaton state set by computing an equivalence relation on input strings. By contrast, CGE uses a
150
K. Meinke
purely symbolic approach based on finite congruence generator sets represented as string rewriting systems (SRS) that are used to compute normal forms of states. This latter representation is more compact because a congruence is an equivalence relation that is also closed under substitution. Several computational steps in RPNI2 such as quotient automaton derivation, and compatibility (consistency) checking by parsing, are unnecessary in the CGE algorithm, or are replaced by an efficient alternative based on string rewriting. Furthermore, RPNI2 computes a non-deterministic hypothesis automaton that is subsequently rendered deterministic, whereas CGE always maintains a deterministic hypothesis automaton by working directly with congruences. However, both RPNI2 and CGE learn in the limit. Neither algorithm needs to generate any new membership queries, other than those contained in the current query database, in order to produce a hypothesis automaton. Thus both RPNI2 and CGE fulfill criterion (4) of Section 1 above. The IID algorithm of [Parekh et al. 1998] and the incremental learning algorithm introduced in [Porat, Feldman 1991] are also explicitly coded for positive/negative outcomes only (language recognition). Furthermore, because of internal book-keeping operations (see [Meinke and Sindhu 2010]) and lexicographic query ordering, these algorithms seem less efficient according to criterion (4).
2
Congruences, Generators and Quotient Automata
It is natural to model a Mealy automaton as a many-sorted algebraic structure (c.f. [Meinke, Tucker 1993]) by considering states and outputs as two separate types. In the sequel, Σ = { σ1 , . . . , σm } is a fixed finite input alphabet, and Ω = { ω1 , . . . , ωn } is a fixed finite output alphabet. 2.1. Definition. A Mealy automaton A over Σ and Ω is a two-sorted algebra 0 , ω1A , . . . , ωnA A = QA , ΩA , δA , λA , qA
where the carrier set QA is termed the state set of A and the carrier set ΩA is termed the output set of A. If the state set QA is finite then A is termed a finite state Mealy automaton otherwise A is termed infinite state. Also δA : Q A × Σ → Q A ,
λA : QA × Σ → ΩA
are the state transition function, and output function respectively. Furthermore 0 ∈ QA is the initial state and ω1A , . . . , ωnA ∈ ΩA are output constants that qA interpret the output symbols ω1 , . . . , ωn respectively. We let MA(Σ, Ω) denote the class of all Mealy automata over Σ and Ω. 2.2. Example. Define the term, initial or absolutely free Mealy automaton T (Σ, Ω) ∈ MA(Σ, Ω) (which is infinite state) by: QT (Σ, Ω) = Σ ∗ ,
ΩT (Σ,
Ω)
= Σ+ ∪ Ω
CGE: A Sequential Learning Algorithm for Mealy Automata
then for any σ ∈ Σ ∗ and σ ∈ Σ, δT (Σ, Ω) (σ, σ) = λT (Σ, qT0 (Σ, Ω) = ε and ωiT (Σ, Ω) = ωi for i = 1, . . . , n.
Ω) (σ,
151
σ) = σ . σ. Also
It is easily shown that T (Σ, Ω) is initial in the class MA(Σ, Ω), i.e. there exists a unique homomorphism φ : T (Σ, Ω) → A for each A ∈ MA(Σ, Ω). Hence every minimal or reachable Mealy automaton A ∈ MA(Σ, Ω) can be constructed as a quotient of T (Σ, Ω). This principle is applied in Section 4. An observation for any Mealy automaton is defined to be a pair (σ, ω) ∈ Σ + × Ω consisting of an input sequence and the final output value. We define lhs((σ, ω)) = σ, and extend this operation to sets of obervations. Recall the principles of quotient construction applied to Mealy automata. 2.3. Definition. Let A ∈ MA(Σ, Ω) be any Mealy automaton, and let ≡ = ≡Q , ≡Ω , where ≡Q ⊆ QA × QA is an equivalence relation on states and ≡Ω ⊆ ΩA × ΩA is an equivalence relation on outputs. We say that ≡ is a congruence on A if, and only if, ≡ satisfies the following substitutivity conditions. For any q, q ∈ QA and σ ∈ Σ, if q ≡Q q then: δA ( q, σ ) ≡Q δA ( q , σ ), and λA ( q, σ ) ≡Ω λA ( q , σ ). In this case, ≡Q is termed a state congruence and ≡Ω is termed an output congruence. Let X = XQ , XΩ be a pair of binary relations XQ ⊆ QA × QA and XΩ ⊆ ΩA × ΩA then X generates ≡ if and only if, ≡ is the smallest congruence on A containing X, and ≡ is finitely generated if XQ and XΩ are both finite. Define the quotient automaton A/ ≡ ∈ MA(Σ, Ω) by QA/≡ = QA /≡Q ,
ΩA/≡ = ΩA /≡Ω
and for any q ∈ QA and σ ∈ Σ, δA/≡ ( q/≡Q , σ ) = δA ( q, σ )/≡Q ,
λA/≡ ( q/≡Q , σ ) = λA ( q, σ )/≡Ω .
0 0 = qA /≡Q and ωiA/≡ = ωiA /≡Ω for 1 ≤ i ≤ n. Also qA/≡
For black-box testing and many other applications, it suffices to learn an unknown automaton up to behavioural equivalence. 2.4. Definition. Let A, B ∈ MA(Σ, Ω) be any Mealy automata, and let ≡φ and ≡ψ be the kernels of the unique homomorphisms φ : T (Σ, Ω) → A and ψ : T (Σ, Ω) → B respectively. We say that A and B are behaviourally equivalent if, and only if, ≡φΩ = ≡ψ Ω . Intuitively, two Mealy automata A and B are behaviourally equivalent if A and B always produce the same output sequences given the same input sequence.
3
String Rewriting Systems and Rule Completion
In this section we consider the concept of a confluent terminating string rewriting system (SRS) which not only provides a compact representation of a congruence,
152
K. Meinke
but also comes with a natural model of computation known as string rewriting. This allows us to directly simulate computations performed by quotient automata while avoiding their explicit construction. String rewriting is a special case of the more general theory of term rewriting ([Dershowitz, Jouannaud 1990]). We borrow important concepts from this theory such as the completion of a set of rewrite rules. Our main result in this section is to define and prove the correctness of the prefix completion algorithm. This auxiliary algorithm plays an important role for efficient congruence construction and consistency checking in the CGE algorithm of Section 4. 3.1. Definition (i) A string rewriting rule over Σ is a pair (l, r) ∈ Σ ∗ × Σ ∗ . Often we use the more intuitive notation l → r to denote the rule (l, r). By a string rewriting system (SRS) over Σ we mean a set R ⊆ Σ ∗ × Σ ∗ of string rewriting rules. (ii) Let ρ = (l, r) be a string rewriting rule and let σ, σ ∈ Σ ∗ be any strings. ρ We say that σ rewrites to σ using ρ and write σ −→ σ if, and only if for some σ0 ∈ Σ ∗ and σ = l . σ0 , i.e. l is a prefix of σ and, σ = r . σ0 . (iii) If R ⊆ Σ ∗ × Σ ∗ is an SRS then we say that σ rewrites to σ using R in one ρ R step and write σ −→ σ if, and only if, for some ρ ∈ R, σ −→ σ . R∗
R
(iv) We let −→ ⊆ Σ ∗ × Σ ∗ denote the reflexive transitive closure of −→ . We R∗
define the bi-rewriting relation ←→ ⊆ Σ ∗ × Σ ∗ for any σ, σ ∈ Σ ∗ , by: R∗
R∗
R∗
σ ←→ σ ⇔ ∃σ0 ∈ Σ ∗ such that σ −→ σ0 and σ −→ σ0 . Any rewrite sequence can be shown to terminate if it consists of a sequence of strings that can be shown to be strictly decreasing according to some wellordering, e.g. the short-lex ordering on strings. 3.2. Definition (i) Let D ⊆ Σ × Σ be a linear ordering on Σ, and let ≤D ⊆ Σ n × Σ n for n ≥ 1 be the induced lexicographic ordering. Define the short-lex ordering ≤D ⊆ Σ ∗ × Σ ∗ by σ1 , . . . , σm ≤D σ1 , . . . , σn ⇔ m < n or m = n and σ1 , . . . , σm ≤D σ1 , . . . , σn . Then ≤D is a well-ordering on finite strings. Our main purpose for introducing this ordering is to prove termination of rewrite sequences by showing that the individual strings in a sequence are well-ordered. 3.3. Definition. Let l → r ∈ Σ ∗ × Σ ∗ be a rewrite rule. We say that l → r is reducing if, and only if, r
CGE: A Sequential Learning Algorithm for Mealy Automata
153
Algorithm 1. Prefix Completion Algorithm 1 2 3 4 5 6 7 8
9
10 11 12
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Function C(R) unf inished ← true while unfinished do Compute the set M ax (R) of all maximal lhs terms foreach l ∈ M ax (R) do construct the tower T (l) under l unf inished ← f alse foreach l ∈ M ax (R) do // Process the first block in T (l) // Collect all rhs terms from rules in block T (l)(1) for j ← 2 to |T (l)(1)| do Rhs T erms ← Rhs T erms ∪ { T (l)(i)(j)2 } // Get the minimum rhs term in block T (l)(1) rhs min ← T (l)(1)(1)2 // Orient all critical pairs as new rules. forall t ∈ Rhs T erms do if t → rhs min ∈ R then R ← R ∪ { t → rhs min } unf inished ← true // Process all remaining blocks in T (l) for i ← 2 to |T (l)| do // Collect all rhs terms from rules in block T (l)(i) for j ← 2 to |T (l)(i)| do Rhs T erms ← Rhs T erms ∪ { T (l)(i)(j)2 } // Complete the previous minimum rhs term rhs min completed ← concat(rhs min, (T (l)(i)(1)1 − T (l)(i − 1)(1)1 )) // Compute new value of rhs min and update Rhs Terms if T (l)(i)(1)2 ≤D rhs min completed then rhs min ← T (l)(i)(1)2 if T (l)(i)(1)2 = rhs min completed then Rhs T erms ← Rhs T erms ∪ { rhs min completed } else rhs min ← rhs min completed Rhs T erms ← Rhs T erms ∪ { T (l)(i)(1)2 } // Orient all critical pairs as new rules forall t ∈ Rhs T erms do if t → rhs min ∈ R then R ← R ∪ { t → rhs min } unf inished ← true return R end
In general, the order in which rewrite rules are applied can influence the outcome. An SRS which does not depend on the order of application of rules is termed a confluent SRS. 3.5. Definition. Let R ⊆ Σ ∗ × Σ ∗ be an SRS.
154
K. Meinke R∗
(i) We say that R is confluent if, and only if, for any σ0 , σ1 , σ2 ∈ Σ ∗ , if σ0 −→ σ1 R∗
R∗
R∗
and σ0 −→ σ2 then there exists σ ∈ Σ ∗ such that σ1 −→ σ and σ2 −→ σ. (ii) We say that R is locally confluent if, and only if, for any σ0 , σ1 , σ2 ∈ Σ ∗ , R
R∗
R
if σ0 −→ σ1 and σ0 −→ σ2 then there exists σ ∈ Σ ∗ such that σ1 −→ σ and R∗ σ2 −→ σ. Clearly confluence implies local confluence. The converse comes from a celebrated general result in term rewriting theory. 3.6. Newman’s Lemma. Let R ⊆ Σ ∗ × Σ ∗ be a reducing SRS. Then R is locally confluent if, and only if, R is confluent. Non-confluent SRS are undesirable since the resulting equivalence classes lack unique normal forms. Fortunately, they can sometimes be converted to an equivalent confluent SRS by completion, i.e. conservatively adding extra rewrite rules that rectify divergent critical pairs of rules. We introduce an efficient algorithm that completes R by adding an additional set of rules that grows only linearly in the size of the problem. For this purpose we use a specific well-ordering on rewrite rules. Let ⊆ Σ ∗ × Σ ∗ be the prefix ordering on strings. 3.7. Definition. Define the tower ordering ≤T (D) ⊆ (Σ ∗ × Σ ∗ )2 for any ρ1 = l1 → r1 and ρ2 = l2 → r2 by ρ1 ≤T (D) ρ2 ⇔ l1 ≺ l2 or l1 = l2 and r1 ≤D r2 . Next we introduce an abstract data structure for storing rewrite rules in such a way that we can conveniently access exactly those pairs of rules that yield critical pairs. 3.8. Definition (i) Let R ⊆ Σ ∗ × Σ ∗ be any finite SRS. Define Max (R) to be the set of all maximal left hand sides of rules in R under the prefix ordering, i.e. Max
(R) =
{ l ∈ Σ ∗ | l → r ∈ R and there is no l → r ∈ R such that l ≺ l }. (ii) For each maximal lhs l ∈ Max (R) define the tower of rules T (l) ∈ ((Σ ∗ × Σ ∗ )+ )+ under l to be the finite sequence of finite sequences of all rewrite rules l → r ∈ R such that l l ordered according to the following conditions. Let us term the ith finite sequence T (l)(i) the ith block in T (l). Then each block T (l)(i) must satisfy the following properties: (ii.a) All rules in the same block T (l)(i) have the same lhs, i.e. for all 1 ≤ i ≤ |T (l)|, and all 1 ≤ j, k ≤ |T (l)(i)|, T (l)(i)(j)1 = T (l)(i)(k)1 . (ii.b) All blocks are strictly linearly ordered by the prefix ordering on their unique lhs, i.e. for all 1 ≤ i, j ≤ |T (l)|, i < j implies T (l)(i)(1)1 ≺ T (l)(j)(1)1 . (ii.c) All rewrite rules in the same block T (l)(i) are strictly linearly ordered by the short-lex ordering ≤D on their rhs, i.e. for all 1 ≤ i ≤ |T (l)| and for all 1 ≤ j, k ≤ |T (l)(i)|, j < k implies T (l)(i)(j)2
CGE: A Sequential Learning Algorithm for Mealy Automata
155
3.9. Prefix Completion Algorithm Define the prefix completion function C : ℘fin (Σ ∗ × Σ ∗ ) → ℘fin (Σ ∗ × Σ ∗ ) constructively by Algorithm 1. 3.10. Correctness Theorem. Let R ⊆ Σ ∗ × Σ ∗ be any finite reducing SRS. The prefix completion algorithm terminates given input R and C(R) is a finite C(R)∗
confluent reducing SRS. Also the bi-rewriting relation ←→ ⊆ Σ ∗ × Σ ∗ is the smallest state congruence on the term automaton T (Σ, Ω) that contains R. Proof. Exercise. The most important property of a confluent reducing SRS (used in Consistency Algorithm 4.2) is that it yields a unique normal form for every string. 3.11. Definition. Let R ⊆ Σ ∗ × Σ ∗ be a confluent reducing SRS. For any σ ∈ Σ ∗ we define the normal form Norm R (σ) ∈ Σ ∗ of σ (modulo R) to be the unique irreducible string obtained by rewriting σ using R. To construct a quotient automaton it is not enough to have a state congruence, we also need an output congruence. 3.12. Definition. Let ≡ ⊆ Σ ∗ × Σ ∗ be any binary relation on Σ ∗ . Let Λ ⊆ Σ + × Ω be a set of observations. Define the binary relation ≡Λ ⊆ (Σ + ∪ Ω)2 by ≡Λ = RST ( Λ ∪ { (σ . σ, σ . σ) | (σ, σ ) ∈ ≡ and σ ∈ Σ } ), where RST (X) is the reflexive symmetric transitive closure of X. 3.13. Proposition. Let ≡ ⊆ Σ ∗ × Σ ∗ be any state congruence on T (Σ, Ω). Let Λ ⊆ Σ + ×Ω be a set of observations. Then ≡Λ is the smallest output congruence on the term automaton T (Σ, Ω) that contains Λ. Proof. Exercise.
4
Learning by Congruence Generator Extension (CGE)
In this section we introduce the CGE learning algorithm for Mealy automata and prove its correctness. The basic idea of our method is to sequentially construct a sequence of hypothesis automata H1 , H2 , . . . based on the results of a series of observations o1 , o2 , . . . about an unknown Mealy automaton A. We represent each hypothesis automaton Hi as a quotient term automaton: Hi = T (Σ, Ω)/ ≡i ,
i = 1, 2, . . . ,
and the sequence will eventually converge to a quotient automaton Hn that is behaviorally equivalent to A (Theorem 4.7), and has a minimised state space. Each congruence ≡i , for i = 1, 2, . . ., is constructed from the current finite
156
K. Meinke
observation set Λi = { o1 , o2 , . . . , oi }. We identify bounds on n and Λn which guarantee convergence, in terms of the size and structure of A. Key features of the CGE learning algorithm are the following: (i) Learning in the limit is achieved in the sense of [Gold 1967]. (ii) The number of observations between successive hypothesis automata constructions Hi and Hi+1 can be as small as one observation oi+1 . (Recall Section 1 criterion (4).) (iii) There is always complete freedom to choose a new observation oi+1 . For example, new observations may be made randomly, or from static analysis of the current hypothesis Hi against a requirement specification. (iv) The hypothesis automaton Hi is unambiguously represented by finite congruence generator sets and never explicitly constructed. Let us begin by formalising some notions of consistency. 4.1. Definition (i) Let ≡ ⊆ (Σ + ∪ Ω)2 be an equivalence relation on outputs. We say that ≡ is consistent if and only if, for any output symbols ω, ω ∈ Ω ω = ω ⇒ ω ≡ ω. (ii) Let ≡ ⊆ Σ ∗ ×Σ ∗ be any relation and Λ ⊆ Σ + ×Ω be any set of observations. Then ≡ is said to be Λ-consistent if, and only if ≡Λ is consistent (iii) Let R ⊆ Σ ∗ × Σ ∗ be a confluent reducing SRS then R is said to be ΛR∗
consistent if and only if, the bi-rewriting relation ←→ is Λ-consistent. (iv) Let A ∈ MA(Σ, Ω) be any automaton. Then A is said to be consistent if, and only if, for any output symbols ω, ω ∈ Ω ω = ω ⇒ ωA = ωA . During CGE we build up each SRS Ri (which represents ≡i ) sequentially from the empty set by checking new rules for consistency. 4.2. Consistency Algorithm. Define the consistency function Cons : ℘fin (Σ ∗ × Σ ∗ ) × ℘fin (Σ + × Ω) × Σ ∗ × Σ ∗ → { true, false } for any SRS R ⊆ Σ ∗ × Σ ∗ , observation set Λ ⊆ Σ + × Ω and strings σ, σ ∈ Σ ∗ by: Cons (R, Λ, , σ, σ ) = let S = C( R ∪ { σ → σ } ) in ⎧ false if Norm S ( σ1 , . . . , σm ) = Norm S ( σ1 , . . . , σn ), ⎪ ⎪ ⎪ ⎨ for some (σ1 , . . . , σm , σ, ω ) ∈ Λ ⎪ = ω. and (σ1 , . . . , σn , σ, ω ) ∈ Λ with ω ⎪ ⎪ ⎩ true otherwise. 4.3. Correctness Theorem. Let R ⊆ Σ ∗ × Σ ∗ be a finite confluent reducing SRS and let Λ ⊆ Σ + ×Ω be any set of observations. For any σ, σ ∈ Σ ∗ such that σ
CGE: A Sequential Learning Algorithm for Mealy Automata
157
Proof. Exercise, using Definition 3.13. The CGE algorithm takes as input a set Λ of observations, an SRS R which is under construction, a finite sequence A = A1 , . . . , Al of strings Ai ∈ Σ ∗ (state representations) which is strictly linearly ordered by
n. 4.4. CGE Learning Algorithm. Let Λ ⊆ Σ + × Ω be a given finite set of observations. Define the Congruence Generator Extension function CGEΛ : ℘fin (Σ ∗ × Σ ∗ ) × (Σ ∗ )+ × N × N → ℘fin (Σ ∗ × Σ ∗ ) constructively for any SRS R ⊆ Σ ∗ × Σ ∗ , any finite sequence A = A1 , . . . , Al ∈ (Σ ∗ )+ of strings, and any indices m, n ∈ N, by Algorithm 2. Define the congruence generator extension function CGEΛ (A) = CGEΛ ( ∅, A, 2, 1 ). 4.5. Proposition. Let Λ ⊆ Σ + × Ω be any finite set of observations on a consistent automaton A ∈ MA(Σ, Ω). Let A ∈ (Σ ∗ )+ be a sequence of length 2 or more (|A| ≥ 2) of strings , in strictly ascending
CGEΛ (A) is Λ-consistent confluent and reducing SRS, and hence ←→ is a state congruence on T (Σ, Ω). Proof. By induction. The key insight for proving the correctness of CGE learning is to establish that all loops in the path structure of an unknown automaton A will be correctly learned by executing a minimum number of observations on A. This number is bounded by the maximum loop size of A. So under an appropriate observation strategy the sequence of hypothesis automata generated by CGE learning eventually converges in terms of overall state space size. 4.6. Convergence Theorem. Let A ∈ MA(Σ, Ω) be a consistent automaton. Let n be the length of the longest acyclic path in A. Let Λ ⊆ Σ + × Ω be any set of observations on A that contains all observations of length 2n + 1 or less. Let Λ+ be the prefix closure of all inputs in Λ, and let Λ+
For every σ ∈ Σ ∗ there exists σ ∈ Σ ∗ such that σ −→ σ and |σ | ≤ n. Proof. By induction on the well ordering ≤D .
158
K. Meinke
Algorithm 2. CGE Algorithm 1
Function CGEΛ ( R, A, m, n ) // Check subsumption and consistency of rule Am → An
2
if Am ←→ An and Cons ( R, Λ, Am , An ) then // Add rule Am → An to R if n = m − 1 and m < |A| then return CGEΛ ( C( R ∪ { Am → An } ), A, m + 1, 1 ) else if n < m − 1 then return CGEΛ ( C( R ∪ { Am → An } ), A, m, n + 1 ) else // Finished traversal of A return C( R ∪ { Am → An } ) else // Don’t add rule Am → An to R if n = m − 1 and m < |A| then return CGEΛ ( R, A, m + 1, 1 ) else if n < m − 1 then return CGEΛ ( R, A, m, n + 1 ) else // Finished traversal of A return R end
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
R∗
From Convergence Theorem 4.6 we can conclude that the state space size of each generated hypothesis automaton is bounded from above by |Σ|n , where n is the length of longest acyclic path in A. Our main result is the following theorem. 4.7. Correctness Theorem. Let A ∈ MA(Σ, Ω) be a consistent automaton. Let n be the length of the longest acyclic path in A. Let Λ ⊆ Σ + × Ω be any set of observations on A that contains all observations of length 2n + 1 or less. Let R = CGEΛ (Λ+
R∗ Λ
T (Σ, Ω)/( ←→ , ←→ ) are behaviourally equivalent. Proof. Induction using Theorem 4.6, Definitions 2.4 and 3.12. As we indicated at the start of this section, the CGE function presented in Algorithm 2 is applied iteratively to a sequence of input/output observations on an unknown automaton A. Then Correctness Theorem 4.7 establishes that when this sequence of observations is sufficiently large then the sequence of hypothesis automaton indeed converges to A (up to behavioural equivalence). This iterative method is defined more precisely in Algorithm 3. In line 6 we compute the R normal form of the input string of each observation in the prefix closure P ref Obs((σ i , ωi )) of observation (σ i , ωi ), which consists of all prefixes of σ i and their respective outputs. Taking the prefix closure speeds up average convergence.
CGE: A Sequential Learning Algorithm for Mealy Automata
159
Algorithm 3. CGE Iteration Algorithm 1 2
Input: A sequence S = (σ1 , ω1 ), . . . , (σ n , ωn ) of n observations of the I/O k(i) behavior of A, where (σi , ωi ) = (σi1 , . . . , σi , ωi ) ∈ Σ + × Ω. output state Output: A sequence (Ri , Ri ) for i = 1, . . . , n of congruence generator sets for quotient Mealy machines Mi represented as SRS. 1. begin 2. //Perform Initialization 3. Λ = ∅, ΛR = ∅, R = ∅, A = ∅, i = 1, 4. while i ≤ n do 5. //Normalise the i-th observation and all its prefix observations with R 6. norm = N ormR ( P ref Obs((σ i , ωi )) ) 7. 8. if norm ⊆ ΛR then 9. //At least one prefix observation of i-th observation (σ i , ωi ) 10. //has no equivalent in ΛR . 11. //So update Λ, R and ΛR . This will also resolve inconsistency 12. //if some prefix of (σ i , ωi ) is inconsistent with R. 13. Λ = Λ ∪ { P ref Obs((σi , ωi )) } 14. A = lhs(Λ) ∪ { τ σ0 | ∃ σ ∈ lhs(Λ), τ ≺ σ, σ0 ∈ Σ } 15. R = CGEΛ ( A
Notice also that the input sequence is pruned (line 8), so that only observations that give rise to a new hypothesis automaton are accepted and integrated. On line 14 we extend the set Λ of input strings with all one element extensions of proper prefixes. This step ensures that the quotient automaton will be finite (c.f. the kernel construction in [Dupont 1996]). For convenience, the sequence of constructed hypothesis automata, each of which is represented as a pair of SRS, is buffered in a print statement (line 18). To illustrate the principles of our approach, in Appendix 1 we give a small case study of the sequence of SRS produced when learning a simple Mealy machine using CGE.
5
Conclusions
We have introduced the CGE algorithm for sequential learning of Mealy automata. This algorithm efficiently represents and manipulates learned hypothesis automata using pairs of finite generator sets for state and output congruences represented as string rewriting systems (SRS). We have developed the prefix completion algorithm for efficient SRS representation. We have shown that CGE correctly learns in the limit. Elsewhere we will present detailed proofs and show that it computes the minimimum state space.
160
K. Meinke
The simple CGE algorithm presented here is rich in optimisations that can be applied, though the proof of correctness becomes progressively more complex. We will explore such optimisations in future research. For example, this algorithm is not yet incremental as one would like, since the SRS set R is fully recomputed each time a new observation is read, even if it is consistent with the current set R. It seems possible to reduce this computation and avoid recomputing the whole of R. Nevertheless, computing a monotone increasing sequence of SRS is fundamentally impossible, since each SRS rule is a current hypothesis about A that might possibly be rejected later. So purely incremental learning is generally not possible. We gratefully acknowledge assistance from Mr Niu Fei in the construction of a Python implementation of CGE, which was used for initial benchmarking studies. Financial support for this research came from the Swedish Research Council (VR) and the European Union under project HATS FP7-231620.
References [Balcazar et al. 1997] Balcazar, J.L., Diaz, J., Gavalda, R.: Algorithms for learning finite automata from queries: a unified view. In: Advances in Algorithms, Languages and Complexity, pp. 53–72. Kluwer, Dordrecht (1997) [Bohlin and Jonsson 2008] Bohlin, T., Jonsson, B.: Regular Inference for Communication Protocol Entities, Tech. Report 2008-024, Dept. of Information Technology, Uppsala University (2008) [Dershowitz, Jouannaud 1990] Dershowitz, N., Jouannaud, J.-P.: Rewrite systems. In: van Leeuwen, J. (ed.) Handbook of Theoretical Computer Science. North Holland, Amsterdam (1990) [Dupont 1996] Incremental regular inference. In: Miclet, L., de la Higuera, C. (eds.) ICGI 1996. LNCS(LNAI), vol. 1147, pp. 222–237. Springer, Heidelberg (1996) [Dupont et al. 1994] What is the search space of the regular inference? In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 25–37. Springer, Heidelberg (1994) [Gold 1967] Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967) [Groce et al. 2006] Groce, A., Peled, D., Yannakakis, M.: Adaptive Model Checking. Logic Journal of the IGPL 14(5), 729–744 (2006) [Knuth and Bendix 1970] Knuth, D.E., Bendix, P.: Simple word problems in universal algebras. In: Leech, J. (ed.) Computational Problems in Abstract Algebra, pp. 263–269. Pergamon Press, Oxford (1970) [Lang 1992] Lang, K.J.: Random DFA’s can be approximately learned from sparse uniform examples. In: Proc. of 5th ACM workshop on Computational Learning Theory, pp. 45–52 (1992) [Meinke 2004] Meinke, K.: Automated Black-Box Testing of Functional Correctness using Function Approximation. In: Rothermel, G. (ed.) Proc. ACM SIGSOFT Int. Symp. on Software Testing and Analysis, ISSTA 2004. Software Engineering Notes, vol. 29 (4), pp. 143–153. ACM Press, New York (2004) [Meinke and Sindhu 2010] Meinke, K., Sindhu, M.: On the Correctness and Performance of the IID Incremental Learning Algorithm for DFA, technical report, School of Computer Science and Communication, Royal Institute of Technology, Stockholm (2010)
CGE: A Sequential Learning Algorithm for Mealy Automata
161
[Meinke, Tucker 1993] Meinke, K., Tucker, J.V.: Universal Algebra. In: Abramsky, S., Gabbay, D., Maibaum, T.S.E. (eds.) Handbook of Logic in Computer Science, vol. 1, pp. 189–411. Oxford University Press, Oxford (1993) [Oncina and Garcia 1992] Oncina, J., Garcia, P.: Inferring regular languages in polynomial update time. In: Perez de la Blanca, N., Sanfeliu, A., Vidal, E. (eds.) Pattern Recognition and Image Analysis. Series in Machine Perception and Artificial Intelligence, vol. 1, pp. 49–61. World Scientific, Singapore (1992) [Parekh and Honavar 2000] Parekh, R., Honavar, V.: Grammar inference, automata induction and language acquisition. In: Dale, Moisl, Somers (eds.) Handbook of Natural Language Processing. Marcel Dekker, New York [Pao, Carr 1978] A solution of the syntactic induction-inference problem for regular languages. Computer languages 3, 53–64 (1978) [Parekh et al. 1998] Parkeh, R.G., Nichitiu, C., Honavar, V.G.: A polynomial time incremental algorithm for regular grammar inference. In: Honavar, V.G., Slutzki, G. (eds.) ICGI 1998. LNCS (LNAI), vol. 1433, p. 37. Springer, Heidelberg (1998) [Peled et al. 1999] Peled, D., Vardi, M.Y., Yannakakis, M.: Black-box Checking. In: Wu, J., et al. (eds.) Formal Methods for Protocol Engineering and Distributed Systems, FORTE/PSTV, Beijing, pp. 225–240. Kluwer, Dordrecht (1999) [Porat, Feldman 1991] Porat, S., Feldman, J.: Learning automata from ordered examples. Machine Learning 7, 109–138 (1991) [Raffelt et al. 2008] Raffelt, H., Steffen, B., Margaria, T.: Dynamic Testing Via Automata Learning. In: Yorav, K. (ed.) HVC 2007. LNCS, vol. 4899, pp. 136–152. Springer, Heidelberg (2008)
Appendix 1: Case Study The automaton to be CGE learned has state set = { 0, 1, 2, 3, 4 } Input alphabet = { 0, 1 } , Output alphabet = { 0, 1 } , Starting state = 0 Transition/Output Table = 0 1 2 3 4 0 2/0 0/0 3/1 0/0 4/1 1 0/0 0/0 1/0 4/1 3/1 ——————– 1-st observation = (0) : [] 0, R = { (0) → () , (1) → () } , Lambda R = { (0) → 0 } ——————– 2-nd observation = (0, 0, 0) : [0, 1] 0 : Run CGE - inconsistent R = { (1) → (), (0, 0) → (), (0, 1) → () } Lambda R = { (0) → 0, (0, 0) → 1 } ——————– 3-rd observation = (0, 0, 0, 0) : [0, 1, 0] 0 : Run CGE - inconsistent R = { (1) → (), (0, 1) → (), (0, 0, 0) → (), (0, 0, 1) → () } Lambda R = { (0) → 0, (1) → 0, (0, 0) → 1, (0, 0, 0) → 0 } ——————– 4-th observation = (0, 1, 0, 0) : [0, 0, 0] 0 : Run CGE - inconsistent R = { (1) → (), (0, 1) → (0, 0), (0, 0, 0) → (), (0, 0, 1) → () } Lambda R = { (0) → 0, (1) → 0, (0, 0) → 1, (0, 1) → 0, (0, 0, 0) → 0 } ——————–
162
K. Meinke
5-th observation = (0, 0, 1, 0) : [0, 1, 1] 1 : Run CGE - inconsistent R = { (1) → (), (0, 1) → (0, 0), (0, 0, 0) → (), (0, 0, 1) → (0,) } Lambda R= { (0) → 0, (1) → 0, (0, 0) → 1, (0, 1) → 0, (0, 0, 0) → 0, (0, 0, 1) → 1 } ——————– 6-th observation = (0, 1, 1) : [0, 0] 0 : Run CGE - inconsistent R = { (1) → (), (0, 0, 0) → (), (0, 0, 1) → (0), (0, 1, 0) → (), (0, 1, 1) → () } Lambda R = { (0) → 0, (1) → 0, (0, 0) → 1, (0, 1) → 0, (0, 0, 0) → 0, (0, 0, 1) → 1, (0, 1, 0) → 0, (0, 1, 1) → 0 } ——————– 7-th observation = (0, 0, 1, 0, 0, 0) : [0, 1, 1, 1, 1] 1 : Run CGE - inconsistent R = { (1) → (), (0, 0, 0) → (), (0, 1, 0) → (), (0, 1, 1) → (), (0, 0, 1, 0) → (0, 0, 1), (0, 0, 1, 1) → () } Lambda R = { (0) → 0, (1) → 0, (0, 0) → 1, (0, 1) → 0, (0, 0, 0) → 0, (0, 0, 1) → 1, (0, 1, 0) → 0, (0, 1, 1) → 0, (0, 0, 1, 0) → 1 } ——————– 8-th observation = (0, 0, 1, 1, 0, 0) : [0, 1, 1, 1, 0] 0 : Run CGE - inconsistent R = { (1) → (), (0, 0, 0) → (), (0, 1, 0) → (), (0, 1, 1) → (), (0, 0, 1, 0) → (0, 0, 1), (0, 0, 1, 1) → (0, 0) } Lambda R = { (0) → 0, (1) → 0, (0, 0) → 1, (0, 1) → 0, (0, 0, 0) → 0, (0, 0, 1) → 1, (0, 1, 0) → 0, (0, 1, 1) → 0, (0, 0, 1, 0) → 1, (0, 0, 1, 1) → 1 } —————————– Learning complete, computation time = 0.15s, average time per observation = 0.01871s
Using Grammar Induction to Model Adaptive Behavior of Networks of Collaborative Agents Wico Mulder and Pieter Adriaans Department of Computer Science University of Amsterdam, Science Park 107 1098 XG Amsterdam, The Netherlands [email protected], [email protected]
Abstract. We introduce a formal paradigm to study global adaptive behavior of organizations of collaborative agents with local learning capabilities. Our model is based on an extension of the classical language learning setting in which a teacher provides examples to a student that must guess a correct grammar. In our model the teacher is transformed in to a workload dispatcher and the student is replaced by an organization of worker-agents. The jobs that the dispatcher creates consist of sequences of tasks that can be modeled as sentences of a language. The agents in the organization have language learning capabilities that can be used to learn local work-distribution strategies. In this context one can study the conditions under which the organization can adapt itself to structural pressure from an environment. We show that local learning capabilities contribute to global performance improvements. We have implemented our theoretical framework in a workbench that can be used to run simulations. We discuss some results of these simulations. We believe that this approach provides a viable framework to study processes of self-organization and optimization of collaborative agent networks. Keywords: collaborative self-organization.
1
agents,
learning,
grammar
induction,
Introduction
The notion of an organization, as a network of collaborative agents, is almost as general as the idea of a system. In this paper1 we study a formal model of learning organizations. We build on earlier work done in the domain of grammar induction, specifically the work of learning Deterministic Finite Automata (DFA) using the principle of Minimum Description Length (MDL), reported in 1
Our work is supported by a BSIK grant from the Dutch Ministry of Education, Culture and Science (OC&W) and is part of the ICT innovation program of the Ministry of Economic Affairs (EZ) and a grant from the Casimir program from NWO.
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 163–177, 2010. c Springer-Verlag Berlin Heidelberg 2010
164
W. Mulder and P. Adriaans
[2] and [3]. We base our approach on the broadly accepted theory of learnability, the notion of learning by identification [5], which deals with linguistic structures and the learnability of these structures. We replace the classical teacher student model with one in which a teacher/dispatcher presents a structured workload to an organization of students/agents with a certain learning capacity. Our work is motivated by research questions concerning the management of grid environments [11] and collaborative network organizations [10], but it also touches issues studied in ant colony behavior [12] and deep belief networks [7]. In hindsight, planning problems like the ones studied in [6] and [1] belong to the same domain, but there we used genetic algorithms to analyze the structure of the workload because at that time the techniques for learning DFA models from positive examples were not yet developed. Our work can also be seen as a more specific version of the problems studied in scheduling using local optimization [4], in the sense that we study variants with highly structured workloads. The paradigm: Consider an organization of specialized worker-agents. Each agent can perform only one type of task and can work at only one task at a time. He has limited overview of the rest of the organization. He can delegate work to colleagues in his immediate environment, but he does not know the whole organization. A job description consists of a sequence of typed tasks. Workloads that consist of jobs are submitted to the organization by an agent or a dispatcher in the environment outside the organization. This dispatcher generates workloads with a certain structure. An agent accepts a job when the first task of this job matches the type of work he is specialized in. After acceptance, and after other pending work is finished, the agent executes the task and sends the rest of the job to one of his colleagues. The agents that were involved with the execution of a particular job report back to each other when the job was processed successfully and which individual agents executed the tasks. The agent keeps track of this information and uses these data to learn which type of jobs are to be routed to which of his direct colleagues. In the absence of (sufficient) data the agent will dispatch the jobs to his close colleagues at random, but as soon there is enough data to learn a model of the successful tracks of jobs through the organization the agent will use this model to route the work. In this sense one can say that the organization adapts its global behavior on the basis of local learning capabilities. We are interested in this kind of global adaptation as a result of local learning.
2
Formal Definitions
Definition 1. A job < ID, AID, [< t1 , d1 >, ..., < ti , di >] > consists of a job index with finite sequence of tasks. Here ID ∈ N is a job index, and AID is the index from the agent that sent the job. A task consists of a tuple < ti , di > with ti ∈ T is a type in a finite set of type T and di ∈ R is a number indicating a duration. We will consider tasks with duration 1 only, consequently the job notation can be simplified to < ID, AID, [t1 , ..., ti ] > . A workload consists of a finite sequence of jobs.
Using Grammar Induction to Model Adaptive Behavior of Networks
165
Definition 2. A collaborative learning agent is a tuple < AID, t, W L, H, M, L, A, S, F > where: – AID ∈ N is an index. – t ∈ T is the type. – W L is the work-list, a list of (partial) jobs to be executed by the agent. The first task in each job must be of type t. – H is the history, which consists of a list of reduced jobs and a list of processed job paths. The list of reduced jobs contains jobs that are waiting or have been sent through by the agent after finishing a task. Jobs in the history can have three statuses: waiting (if the job still needs to be accepted by an other agent) or sent (if the job has been accepted by another agent) or finished. If the total job is finished the (partial) processed path is stored in the history. A processed path of a job that has been processed successfully by the organization has the form < ID, AID, [< t1 , AID1 , d1 >, ..., < ti , AIDi , di > ] >, where AID1 , ..., AIDi are the indexes of the agents that have actually executed the tasks. – M is a learned model. The framework allows to use various types of models. In our work we study the use of DFA models. – A is a learning algorithm. The learning algorithm takes a list of processing paths P as input and produces a model M . – L is a learning strategy, that defines how and when A is invoked. One could consider batch learning, continuous learning, interval learning, learning with depreciation etc. – S is a job acceptation and distribution strategy. This regulates how the learned model M is used to dispatch jobs to other agents. – F is a status of the agent which can be either free or busy, depending on whether the agent is working on a job or not. Definition 3. A collaborative learning agent is resource-bounded if one or more of its resources are limited. Limits to be considered could be: the size of the work-list, the size of the history, the size of the list of processed paths, the size of the model and the amount of processing time allowed to learn the model. An interesting boundary case is the situation in which the size of the work-list is 1, i.e. every agent can only handle one job at the time. Definition 4. An organization of collaborative agents is a network (or digraph) of agents O =< Γ, r >, where Γ is the set of collaborative learning agents and r ∈ Γ × Γ is a directed cooperation relation. A direct controlled neighborhood of an agent i in an organization O is {x| < i, x >∈ r}, i.e. the set of all agents that have a direct relation starting in i. A direct supervised neighborhood of an agent i in an organization O is {x| < x, i >∈ r}, i.e. the set of all agents that have a direct relation ending in i. Agent j can be reached from agent i iff there is a path from i to j. Definition 5. A teacher or workload dispatcher is an agent outside the organization that generates and submits workloads to the organization according to
166
W. Mulder and P. Adriaans
a certain submission strategy. This can be either a batch or a continuous stream of jobs. Based on these definitions the learning process takes the following form: 1. Given are a workload dispatcher W and a finite number of agents Γ of various types T . 2. The workload dispatcher W and the agents agree on a class of workload descriptions from which W may select one to generate workloads. This step is analogous to the selection of a class of languages to be learned in the Gold model ([5]). 3. The agents Γ select an initial organization form, i.e. they select r. 4. The workload dispatcher starts to generate and submit jobs. The whole process is discrete and regulated by a central timer. At each time-step a two phase process takes place: 1. Communication: The teacher submits a job < ID, dispatcher, [< t1 , d1 > , ..., < ti , di >] > to an agent of the organization. The agents submit, using their distribution strategy S, reduced jobs < ID, AID, [< tk , dk >, ..., < ti , di >] > to their colleagues, where AID is the index of the dispatching agent. The agents accept a job from the dispatcher or from one of their colleagues and put it on their work-list. If a reduced job cannot be submitted it is kept in the history with waiting status. As soon as it is accepted the job gets the status sent. When an agent AIDn finished the last task of a job < ID, AIDm , [< ti , di >] > he sends a message < ID, [< ti , AIDn , di >] > to his supervising agent AIDm . This agent AIDm updates his history and sends the enriched description < ID, [< tl , AIDm , dl >, < ti , AIDn , di >] > to his supervising agent AIDl etc. 2. Execution: The agents perform a task of a job and put the reduced job in the history with the waiting status. Each agent selects a new job from his local work list. If there is no new job the status of the agent is free. If there is a task beeing carried out the status of the agent is busy. Apart from this two-stage process agents can independently start learning sessions to update their model. This can simply be performed in a step after the communication or execution step. One can study research questions of the following form in this setting: – Does the organization accept a job with a certain structure at a certain moment in time? We say that an organization accepts a job when it is capable of processing this job, i.e. the job travels through the organization and ends in a situation in which each task of the job has been handled by an agent and finished. – Is the organization adequate for a certain class of workloads, i.e. will all possible sequences of tasks be accepted? – Is the structure of the organization optimal for a certain class of workloads, i.e. will all possible tasks be accepted in the shortest possible time?
Using Grammar Induction to Model Adaptive Behavior of Networks
167
The setting that we will study in this paper is the one in which the structural descriptions of the jobs match a regular language. Here the teacher selects a DFA to generate a workload. The agents use a learning algorithm based on MDL to learn DFA models on the basis of positive examples. The intuition is that an optimal organization for such a workload would be a model that is isomorphic to a parallel nondeterministic automaton (NFA) equivalent to the original DFA selected by the workload dispatcher. In order to analyze this we need a result from language learning theory. Let a theoretically optimal compression algorithm be an algorithm that always finds the optimal compression of a data-set in terms of its Kolmogorov complexity. We know that such an algorithm does not exist, but also that it can be approximated in the limit ([9]). We also know that an MDL algorithm using such a compression algorithm is optimal, in the sense that it always find the best (or ’a’ best, if there are more) theory in terms of randomness deficiency ([3]). Let’s call such an MDL algorithm optimal. Such an optimal MDL algorithm does not exist, but it can be approximated in the limit. This insight allows us to use the notion of an optimal DFA-learner in some of the proofs below. The results represent limit cases that can be approximated empirically using practical implementations of MDL. Of course the observation that it might in practice be impossible to implement an effective coding scheme for the model and the data remains. We can now turn our attention to organizational learning issues. We distinguish two types of learning: Definition 6. Given a set processing paths jobs of the form < ID, AID, [< t1 , AID1 , d1 >, ..., < ti , AIDi , di >] > one can make three sets of sentences. Sentences in the first set have the form [t1 , ..., ti ]. Learning a DFA structure of this set amounts to learning the workload language. We call this environmental learning. Sentences in the second set have the form [AID1 , ..., AIDi ]. Learning a DFA structure of this second set amounts to learning the structure of the organization given the workload language. We call this organizational learning. Efficient adaptive behavior depends on an intertwining of these two forms of learning.
3
Some Theoretical Results and an Open Problem
In this section we present some theoretical results. It is useful to consider some boundary cases. Definition 7. A minimal unbounded clique is an organization in which there is exactly one agent of each type with unbounded resources and in which each agent is connected to every other agent (including the reflexive connection). A clique is the organizational counterpart of a universal automaton that accepts any language. The corresponding theorem is:
168
W. Mulder and P. Adriaans
Theorem 1. A minimal unbounded clique is adequate for any finite workload. Proof: each agent can locally maintain a work-list of any length. Therefore the dispatcher can simply dispatch the whole workload to the relevant agents at once. After performing a task of a job the agent has always a neighbor of the right type to dispatch the task. Such an agent will always accept the task since there are no bounds for the work-list. Therefore, at any moment in time, as long as there are jobs in the system, at least one agent will perform at least one task. The total amount of work is reduced with each time step. Since the workload is finite the organization will finish all the work in a finite amount of time. The theorem also holds for cliques that are not minimal. That the unboundedness is essential is clear from the following result: Lemma 1. A resource bounded clique can not accept every workload. Proof: Suppose we have a workload of size l containing jobs with similar tasks < ID, AID, [t1 , ..., ti ] > |i > 1, ti = t. These jobs are accepted by an agent of type t having a bounded memory of size k. For each job, this agent forwards a reduced job to itself. Now, if l > k, then after a finite number of steps the memory of this agents gets fully occupied and both the dispatch as well as the agent will keep on waiting. Since each agent acts as a dispatcher of reduced jobs, this can happen for any number of agents of type t in the clique. Such a situation can be explained using the notion of gridlock, commonly used to describe congestion due to traffic that blocks itself. An adaptive organization needs to find a balance between two forces, 1) the structure of the workloads and 2) its internal structure. Separating these two issues is not always possible or necessary on the basis of local learning capabilities. For example suppose that a teacher/dispatcher is not a good informer, in the sense of Gold, for the workload language, i.e. there are parts of the language that are never produced. In that case there might be parts of the organization that are never used, but an agent with only local knowledge of the organization might never know this. We therefore introduce the notion of a universal dispatcher: Definition 8. Given a type set T , a universal dispatcher U is one which creates workload on the basis of the universal language of T , i.e. any finite subset of T ∗ can be a valid work load. We demand that the universal dispatcher also is a text for this language, i.e. every string in T ∗ will be produced by U in the limit. A universal dispatcher randomly chooses for each job an agent of the right type, i.e. an agent that can perform the first task. One could view a universal dispatcher as an environment that creates maximally noisy messages. Such noise gives a possibility for the local agents to explore the organization. The following situation illustrates this. Let us define the notion of a mixed-clique organization. This will be an organization that consists of two or more cliques that are mixed over the individual agents : Definition 9. O1,2 = O1 ∪ O2 is a two-clique organization iff the following conditions hold: we have type sets T1 and T2 such that T1 ∩T2 = ∅ and (T1 −T2 )∪
Using Grammar Induction to Model Adaptive Behavior of Networks
169
Fig. 1. an example of a two-clique organization
(T2 − T1) = ∅, i.e. they overlap but are mutually different. The two organizations O1 =< Γ1 , r1 > and O2 =< Γ2 , r2 > are such that O1 contains a finite nonempty set of agents for each type t ∈ T1 and O2 contains a finite non-empty set of agents for each type t ∈ T2 . Moreover O1 and O2 are non-minimal cliques that overlap in the sense that there are agents that belong to O1 as well as O2 but for some types t ∈ T1 ∩ T2 there are agents < a, t, W L, H, M, L, A, S, F >∈ O1 but < a, t, W L, H, M, L, A, S, F >∈ / O2 , i.e. the organizations share types but not all agents of a certain type belong to both organizations. The problem for agents in a two-clique organization is that they do not know to which part of the organization they themselves or their direct controlled neighborhood belong. A fundamental question is whether the agents still can learn optimal routing in such a confused setting. We can prove the following lemma: Lemma 2. Given a universal dispatcher for T1 ∪ T2 and a corresponding twoclique organization O1,2 with agents that use an optimal DFA induction strategy, the agents will in the limit create a maximally adequate organization, in the sense that any workload that can be processed by the organization will be processed. Proof: (Sketch) Note that the two cliques in O1,2 only cooperate with each other over the agents that they share. The corresponding workload-language is one in which arbitrary fragments of T1∗ can via shared types be mixed with arbitrary fragments of T2 ∗. The universal dispatcher will create four types of jobs: 1) jobs that only contain tasks of types in T1 , 2) jobs that only contain tasks of types in T2 , 3) jobs that contain tasks that belong to T1 ∪ (T2 − T1 ) or 4) to T2 ∪ (T1 − T2 ). The first two types of jobs can be processed by the organization, the others not necessarily since they contain tasks that can only be performed in different parts of the organization that not necessarily have direct
170
W. Mulder and P. Adriaans
communication. The universal dispatcher will distribute the jobs randomly over the appropriate agents. These agents will perform their task and then select an appropriate agent for the reduced task in their direct controlled environment. The job-descriptions will remain in the histories of the individual agents. In the limit these histories will contain sublists of successful jobs and jobs that apparently never were processed by the rest of the organization. Note that the histories of the successful jobs are tagged with the ID’s of the individual agents. Now by performing two learning algorithms an agent can learn two models: by performing an MDL learning algorithm on the sequences of types that correspond to successful jobs he can learn which parts of the organization accept which type of jobs, this is environmental learning. This model will be a DFA over the alphabet of types. By performing a DFA learning algorithm in the sequences of ID’s of agents he can learn a model of the organization. this is organizational learning. This will be a DFA over the alphabet of agent ID’s. The optimality of the DFA induction guarantees that in the limit these models will be correct. This ensures that any local agent in the organization will only dispatch jobs to agents of which he is certain that they can handle them. This lemma can be generalized to the following general theorem: Theorem 2. Given a universal dispatcher for T and a corresponding organization O of any structure with agents that use an optimal DFA induction strategy, the agents will in the limit create a maximally adequate organization (given its structure), in the sense that any workload that can be processed by the organization will be processed. Proof: (Sketch) A job that has to be handled by the an organization O can end up in three ways: 1) it gets stuck when there is no connection to a colleague agent that can handle the next task, 2) it gets stuck for similar reasons, but due to an agent that made a wrong routing decision, 3) it gets accepted. Every agent in O maintains a history. On the basis of this history an agent can in the limit learn which agents in his direct controlled environment can process which types of sentences. He can use this information to route workloads. If the learned models are adequate, then no routing decision of any agent diminishes the processing capacity of the total organization and no jobs get unnecessary stuck. While learning the number of jobs that get stuck due to a wrong routing decision diminishes: i.e. the organization is maximally adequate. Note that if there are multiple entry agents with different levels of adequacy, the successful processing of a job might depend on the first agent that is selected by the workload dispatcher, but this is obviously also a problem of the original organization, so this does not depreciate the value of this proof. If the original organization could process the job starting from a certain agent, then so can the optimized organization. Unfortunately nice general proofs such as the one presented above are not available for organizational learning. Even if the agents have optimal DFA learning algorithms, issues of local versus global organization come in to play. It might be the case that a local optimization of one agent prevents other agents from
Using Grammar Induction to Model Adaptive Behavior of Networks
171
performing more efficient optimization. Since there is a timing issue local adaptations might oscillate or sweep through the organization in a chaotic way. We conclude this section with the formulation of an open problem: Definition 10. Organizational Learning problem: Given a work dispatcher that uses a regular job language, can an adequately rich clique of agents with optimal learners always converge to an optimal organization? If this problem is unsolvable in general, we would be interested in the particular constraints that make it viable. We will leave this for future research.
4 4.1
Experiments Simulator
We developed a software workbench to run simulations of learning agent organizations. The workbench, called ”Workbench for Intelligent Collaborative Organizations” (WICO) can be used to create various workloads, organizations and experimental setups. Components of the workbench are: – a DFA editor that can be used to build a pre-defined probabilistic DFA structure for possible workloads – a workload generator that generates jobs form a given probabilistic DFA – a work dispatcher that can be configured to send jobs to an organization of agents – an organization factory for creating agent organizations with different topologies – a DFA learning algorithm based on MDL which is used by the agents to learn DFA models – a number of visual components to visualize experiments, organizations, workloads and the DFA models – an experiment controller unit that can be used to define experiments and capture results
4.2
Learning Capacity
In a first series of experiments we used the workbench to study the environmental learning capacity of an unbounded network. The workload was generated using a probabilistic DFA as shown in figure 4. Figure 3 shows an example of a minimal unbounded clique aimed to learn the task structure of the workload. Using this DFA a workload was generated containing strings such as e.g. BDEFE, ACEFE, BDEFE, BDEFEFEFEF, BDEFEF, ACEFEFE, BDEFEFE, BDEFEF, BDEFEFE, ACEFEF, ... . Figure 5 shows the DFA of agent A as it has learned the successful jobs in which it was involved. This grammar also reflects the organization ’below’ him, i.e. those agents that were involved with the reduced jobs.
172
W. Mulder and P. Adriaans
Fig. 2. screen shot of the workbench environment
Fig. 3. a minimal-clique
Fig. 4. DFA used workload generation
for
Fig. 5. DFA of agent A0
We studied the organizational learning capabilities by using a uniform dispatcher sending jobs through various types of organizations. Figure 6 shows a typical result of 1000 dispatched jobs with a task length between 5 and 10. Experiences were that only a fraction of these random jobs got successfully processed, as most of them got stuck in the network because of the lack of a possible connection. The figure shows the DFA models of three agents. Each agent maintains two models; one only containing the task labels and one containing task labels and together with the index of the agent that processed that task. Figure 7 shows another example. 4.3
Network Performance
In a second series of experiments, we looked at the network performance by measuring the proportion of jobs that are successfully handled. The local DFA model is used to determine whether an agent is able to handle the rest of the job. We used a network that consists of two cliques, symbolizing an organization of two departments. One department contains agents that can handle tasks of type A, B, C, D and another department that is specialized in the processing of tasks E and F . Figure 8 shows the organization that was used in this experiment.
Using Grammar Induction to Model Adaptive Behavior of Networks
Fig. 6. organizational learning in a simple two-clique organization
Fig. 7. organizational learning in a three-clique organization
173
174
W. Mulder and P. Adriaans
Fig. 8. Graph of a typical organization used in our performance experiments
Fig. 9. Results of the experiment on network performance while learning
We looked at the influence of using the locally learned DFA models on the global task handling performance while it processed a series of 100 workloads. To be able to measure the network performance while the network gradually learns, we used a step-by-step approach: we processed a workload of 500 jobs by the network while the agents were instructed not to update their models, after which we processed a small workload of 10 jobs, letting the agents update their model. This was repeated 100 times. The result can been seen in figure 9 (upper curve). One can see that as the network learns, the number of successfully processed jobs increases gradually from roughly 185 to 425. Gradually more C
Using Grammar Induction to Model Adaptive Behavior of Networks
175
and D agents updated their model on successful jobs. For the A and B agents the C and D agents that updated their model become more attractive to forward a job. Other C and D agents become less attractive to get a forwarded job. In case there are many agents of one type in an organization, each agent uses its DFA model to decide on which agent gets forwarded a job. The reason for not being able top handle the 500 jobs is because the agents used a strategy based on a probability distribution. Chances are high for agents that can verify that they can handle the rest of a particular forwarded job, but still a low chance of an alternative has been left open. We know that such a strategy can improved. We looked at the DFA- and MDL complexity score of the individual agent models during the learning process. For the calculation of these scores we refer to earlier work in [11]. The score for the network is calculated as the sum of the DFA and MDL complexity of the individual agent models. Figure 9 (lower two curves) shows the DFA- and MDL scores. Both curves show that both the model-code (DFA score) as well as the model-code including the data (MDL score) gradually evolve until all agents have learned an (almost) complete model for these kind of tasks. The curve of the DFA complexity is expected to behave asymptotically as the structure of the models will converge, the MDL score keeps slowly increasing as long as there are new unique series of jobs send by the workload dispatcher.
5
Conclusion
An organization can learn grammatical models of both workload structures and their own organization while handling sequences of tasks. Using the models, the agents can make early statements about the acceptance of tasks and make decisions on forwarding jobs. We showed that locally learning agents, when using their grammar models in their decision to forward jobs, contribute to the improvement of the global network performance on processing jobs. We believe that our framework is useful for the analysis of problems in the optimization of agent organizations in general. This holds in particular for networks we studied in earlier work; grid infrastructures and collaborative business organizations. The distinction between organizational and environmental learning, and the insight that the strategic impact of both forms of learning is very different, is very important. Every manager working in a complex organization has the experience that extreme efficiency is at times counterproductive. Sometimes it is useful to blow a bit of random noise through the organization to discover where the real bottlenecks are. The theoretical results in the paper (theorem 2) seem to corroborate this insight.
6
Future Work
The theoretical framework and workbench can be used to investigate issues in related fields of research. The notion of agents that make local decisions to handle
176
W. Mulder and P. Adriaans
and dispatch tasks poses new research questions in the field of routing and load balancing. Developing agent strategies for optimal path learning, allows one to investigate how local modeling can lead to robust behavior and global performance optimization. For continuous streams of work one can have the following additional criteria: is the organization insufficient, in equilibrium or redundant. The notion of dynamic agent topologies, i.e. the creating and deletion of agents and connections on the fly, allows for research on the handling of continuous streams. We also want to study the influence of perturbations on the network of agents, which amounts to the research on the stability and reliability of grid infrastructures. We want to investigate under which conditions an organization in unstable environments can still learn to handle task structures and optimize its behavior. A cloud infrastructure, or shortly cloud, is a scalable and configurable network of ICT resources that are implemented and exploited as services that can be accessed via the internet. The fact that different services might be owned by different organizations imposes many challenges on the management of clouds. In previous work [11] we discussed support for performance management in Grids. We foresee similar problems in the field of cloud computing and think that our approach of self learning agents can contribute there as well.
References [1] Adriaans, P.W.: Predicting pilot bid behavior with genetic algorithms (Abstract), Symbiosis of Human and Artifact. In: Anzai, Y., et al. (eds.) Proceedings of the Sixth International Conference on Human-Computer Interaction, HCI International 1995, Tokyo, Japan, pp. 1109–1113 (1995) [2] Adriaans, P.W., Jacobs, C.J.H.: Using MDL for grammar induction. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 293–306. Springer, Heidelberg (2006) [3] Adriaans, P.W., Vit´ anyi, P.: Approximation of the Two-Part MDL Code. IEEE Transactions on Information Theory 55(1), 444–457 (2009) [4] Anderson, E.J., Glass, C.A., Potts, C.N.: Machine scheduling. In: Aarts, E., Lenstra, J.K. (eds.) Local Search in Combinatorial Optimization, pp. 361–414. John Wiley & Sons, Inc., New York (1997) [5] Gold, E.M.: Language Identification in the Limit. Information and Control, 447– 474 (1967) [6] den Heijer, E., Adriaans, P.W.: The Application of Genetic Algorithms in a Carreer Planning Environment: CAPTAINS. International Journal of HumanComputer Interaction 8, 343–360 (1996) [7] Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Computation 18, 1527–1554 (2006) [8] Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation, 2nd edn. Addison-Wesley, Reading (2001) [9] Li, M., Vit´ anyi, P.: An Introduction to Kolmogorov Complexity and Its Applications, 3rd edn. Springer, Heidelberg (2008)
Using Grammar Induction to Model Adaptive Behavior of Networks
177
[10] Mulder, W., Meijer, G.R.: Distributed information services supporting collaborative network management. In: IFIP International Federation for Information Processing, Proceedings PROVE 2006. Network-Centric Collaboration and supporting frameworks, vol. 224, pp. 491–498. Springer, Heidelberg (2006) ISBN 0387-38266-6 [11] Mulder, W., Jacobs, C.J.H.: Grid Management Support by Means of Collaborative Learning Agents. In: Grids Meet Autonomic Computing, Workshop at the 6th IEEE International Conference on Autonomic Computing (ICAC), Barcelona, pp. 43–50. ACM, New York (2009) ISBN: 978-1-60558-564-2 [12] Sim, K.M., Sun, W.H.: Ant Colony Optimization for Routing and Load-Balancing: Survey and New Directions. IEEE Transaction on system, man and cybernetics 33(5), 560–572 (2003)
Transducer Inference by Assembling Specific Languages Piedachu Peris and Dami´ an L´ opez Departamento de Sistemas Inform´ aticos y Computaci´ on, Universidad Polit´ecnica de Valencia, Camino de Vera s/n 46071 Valencia, Spain {pperis,dlopez}@dsic.upv.es
Abstract. Grammatical Inference has recently been applied successfully to bioinformatic tasks as protein domain prediction. In this work we present a new approach to infer regular languages. Although used in a biological task, our results may be useful not only in bioinformatics, but also in many applied tasks. To test the algorithm we consider the transmembrane domain prediction task. A preprocessing of the training sequences set allows us to use this heuristic to obtain a transducer. The transducer obtained is then used to label problem sequences. The experimentation carried out shows that this approach is suitable for the task. Keywords: Inference of regular languages, bioinformatics, protein motif location.
1
Introduction
Formal Language Theory and Grammatical Inference (GI) are playing an important role in the development of new methods to process biological data [1,2]. Many works propose GI techniques to tackle bioinformatic tasks as: secondary structure identification [3], protein motifs detection [4,5], optimal consensus sequence discovery [6,7], gene prediction [8] or multiple sequence alignment [9]. The selection of proteins with certain characteristics from amino acid sequences is a central goal of computational biology. One aspect of this problem is to detect certain sub-sequences, known as domains or motifs, with some interesting functional features. Among the membrane related proteins, some are integrally in the membrane and others span the hydrophobic core of the membrane. Note that this fact classifies the segments of the protein sequence into: transmembrane, inside and outside regions. In this work we deal with transmembrane protein sequences which, from now on, we will refer to as membrane proteins. Figure 1 shows a schematic representation of these proteins. Under a biological point of view, membrane proteins play an important role in a variety of important biological functions [10,11], mainly as receptors or
Work supported by the spanish CICYT under contract TIN2007-60769.
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 178–188, 2010. c Springer-Verlag Berlin Heidelberg 2010
Transducer Inference by Assembling Specific Languages
179
Fig. 1. Schematic representation of two transmembrane proteins (one single-spanning and one multi-spanning). The grey bars represent the (two) lipidic layers of the membrane.
transporters. In order to identify relevant features of a given membrane protein, as well as their role in the cell [12], the number of transmembrane segments of a protein and some characteristics such as loop lengths are to be taken into account. Thus, an important and interesting task is to predict the location of transmembrane domains along the sequence, since these are the basic structural building blocks defining the protein topology. Several works have dealt with this prediction task from different approaches, mainly using Hidden Markov Models (HMM) [13,14,15], neural networks [16,17] or statistical analysis [18]. A rich literature is available on proteins prediction. For reviews on different methods for predicting transmembrane domains in proteins, we refer the reader to [19,20,21]. In our work, we use a grammatical approach to locate transmembrane motifs within protein sequences. We extend previous work by Peris et al. that propose approaches to predict the coiled-coil and the transmembrane domain [4,5]. Briefly, the approach proposed in this work is based on the assumption that all the transmembrane, inside and outside regions share common features suitable to be modelled grammatically. Thus, we propose a method that infers automata for each kind of region and another automaton to model the sequence of regions that appear in the training set. A substitution allows us to obtain the final transducer. Several works have proposed operations over languages by means of automata transformations. For instance, [22] used a specialization strategy from positive examples that consists in creating recursive automata by replacing transitions by states of the same automaton. The results of our experimentation are compared with other existing approaches and show that this approach improves previous work performance. Our work is organized as follows: Section 2 summarizes some definitions and the notation used; Section 3 explains our approach to the problem; Section 4 shows the experimental results and the indexes used to compare our results with previous ones. Finally, some conclusions and future lines of research end the paper.
2
Notation and Definitions
Let Σ be an alphabet and Σ ∗ the set of words over the alphabet. For any word x ∈ Σ ∗ let xi denote the i-th symbol of the sequence and let |x| denote the
180
P. Peris and D. L´ opez
length of the word. Let also λ denote the empty word. A language over Σ is any set L ⊆ Σ ∗ . A finite automaton is defined as a tuple A = (Q, Σ, δ, I, F ), where Q is a finite set of states, Σ is an alphabet, I, F ⊆ Q are the sets of initial and final states and δ : Q × Σ → P(Q) is the transition function. For the sake of clarity, we will consider the transition function as a subset of Q×Σ×Q. This function can also be extended in a natural way to consider words over an alphabet instead of symbols. The language accepted by the automaton is L(A) = {x ∈ Σ ∗ : δ(q0 , x)∩F = ∅}. A finite automaton is deterministic (DFA) if the transition function is defined as Q × Σ → Q. A language L is k-testable in the strict sense (k-TSS) if there exist sets P, S ⊆ Σ k−1 and N ⊆ Σ k such that L − {λ} = (P Σ ∗ ∩ Σ ∗ S) − Σ ∗ N Σ ∗ . A finite state transducer is defined by a system τ = (Q, Σ, Δ, q0 , QF , E) where: Q is a set of states and QF ⊆ Q is the set of final states; Σ and Δ are the input and output alphabets respectively; q0 is the initial state, and E ⊆ (Q × Σ × Δ∗ × Q) is the set of transitions of the transducer. Given an input word x = a1 , a2 , . . . , an , a successful path in a transducer is a sequence of transitions (q0 , a1 , o1 , q1 ), (q1 , a2 , o2 , q2 ), . . . , (qn−1 , an , on , qn ) where qn ∈ QF , and ai ∈ Σ ∗ , oi ∈ Δ∗ and qi ∈ Q for 1 ≤ i ≤ n. Note that a path can be denoted as (q0 , a1 a2 . . . an , o1 o2 . . . on , qn ) whenever the sequence of states are not of particular concern. A transduction is defined as a function t : Σ ∗ → Δ∗ where t(x) = y if and only if there exists a successful path (q0 , x, y, qn ). We refer the interested reader to [23].
3
Combination of Specific Languages
Usually, the class of 2-TSS is referred to as the class of local languages. This is a very important subclass of regular languages. A strong result that relates both classes states that a language L ⊆ Σ ∗ is regular if and only if there exists a finite alphabet Σ , a local language K ⊆ Σ ∗ and a morphism h that maps Σ ∗ into Σ ∗ such that L = h(K). In [24] the authors take into account this result to propose a methodology that allows, whenever relevant expert information is available, to infer regular languages from positive data. This methodology (MGGI) has been widely and successfully applied mainly to language and dialog processing (for instance [25,26]). Note also that the inclusion of this expert knowledge allows to address practical tasks under a GI approach using only positive presentation. This is important because negative information usually is hard to define in applied tasks. In a similar way MGGI does, we assume that, under a classical GI framework and for some practical tasks, it is possible to detect sub-sequences of the training set M that model/share a common feature f . This allows the sequences to be labelled, and therefore, to take into account this knowledge. In this work we isolate the sub-sequences with the same feature label in order to build more specific training subsets. Thus, we obtain, for each feature f considered, a set of samples Mf that allows us to infer a language Lf .
Transducer Inference by Assembling Specific Languages
181
9 8 b1 a1 a1 a1 b1 b1 b1 b1 a2 b2 c2 a2 a2 c2 b3 a3 a3 c3 a3 b3 a3 b4 a4 a4 b4 a4 b4 a4 c4 b4 a4 , > > > > > > > > a4 b4 b4 c4 a4 b1 b1 c1 a1 c1 a3 c3 a3 a3 b3 b3 a3 b4 c4 a4 a4 b4 a2 a2 c2 b2 a2 , = < M= c1 b1 a1 b1 b1 b1 a1 a1 a1 c2 c2 b2 a2 b2 a2 a2 a2 b3 a3 c3 b2 a2 a2 b2 b4 a4 a4 c4 a4 c4 , > > > > a1 b1 b1 b1 b1 c2 b2 c2 a2 a2 a2 b3 a3 b3 a3 a4 a4 a4 b4 a4 c2 a2 a2 b2 a2 , > > > > ; : a 4 b4 b 4 a 4 a 4 a 2 a 2 a 2 b2 a 2 a 2 a 3 b3 b 3 a 3 a 3 b3 b3 a 3 b2 a 2 a 2 b 2 b 2 a 2 b2 a 2 a 2 a 2 9 8 1234, > > > > > > > = < 41342, > S(M ) = 12324, > > > 12342, > > > > > ; : 4232
9 8 baaabbbb, > > > > = < bbcac, M1 = E(M, 1) = > > > cbabbbaaa, > ; : abbbb
9 8 abcaac, > > > > > > > aacba, > > > > > > > > > ccbabaaa, > > > > = < baab, M2 = E(M, 2) = > > cbcaaa, > > > > > caaba, > > > > > > > > > > aaabaa, > > > ; : baabbabaaa
9 8 baacaba, > > > > > > > = < acaabba, > M3 = E(M, 3) = bac, > > > baba, > > > > > ; : abbaabba
9 8 > > > baababacba, > > > > abbca, > > > > > = < bcaab, M4 = E(M, 4) = > > baacac, > > > > > aaaba, > > > > > ; : abbaa
Fig. 2. Scheme of the preprocessing. The different features considered are denoted by sub-indexes in the original training set M . Note that the derived training sets E(M, i) contain only those sub-sequences with the corresponding feature.
The original training set M is modified by substituting the extracted subsequences by a unique feature identifier. In a more formal way, let Σ denote the strings alphabet and Δ = {i1 , i2 , . . . , ik } the feature identifiers alphabet. Let us denote the labelled alphabet by ΣΔ . Thus, ∗ denotes the set of all possible strings over Σ whose symbols are labelled with ΣΔ symbols in Δ. Let the homomorphism h : ΣΔ → Σ be defined as h(aI ) = a, for each aI ∈ ΣΔ . That is, the homomorphism that erases the labelling of the symbols. Let us define the simplification function as follows: ∗ → Δ∗ S : ΣΔ ∗ S(u1 u2 . . . um ) = I1 I2 . . . Im where: ui ∈ Σ{I i}
(1)
Let us define the extraction function as follows: ∗ E : ΣΔ × Δ → P(Σ ∗ ) ∗ E(u1 u2 . . . uk , I) = {h(ui ) : ui ∈ Σ{I} }
(2)
We also extend these functions to operate on sets of strings, thus, for any set of labelled strings M and a label identifier I: S(w) E(M, I) = E(w, I) S(M ) = w∈M
w∈M
182
P. Peris and D. L´ opez
Figure 2 shows schematically an example of this preprocessing. Please, note that it may be possible that the considered features do not cover completely the training sequences in M . In that case it is possible to consider an extra identifier to label those sub-sequences. Note also that, it is possible to use whichever combination of inference algorithms with the feature training sets obtained. The main assumption we make is that the more specific the data, the better the automata inferred. Let us denote by LS(M) (AS(M) ) the language (resp. automaton) inferred from the set S(M ), as well as LMi (AMi ) will denote the language (resp. automaton) inferred considering the strings in Mi . In order to build the final transducer, a substitution over LS(M) is carried out. This substitution guarantees that, for each transition (p, fi , q) of AS(M) , it is possible to reach in the transducer the state q with x from the state p, where x ∈ LMi . An implementation of this language operation may consist of substituting each transition (p, fi , q) in AS(M) by the automaton that models the same feature (AMi ), adding λ transitions in order to connect state p with the initial state of AMi , as well as the final states of AMi with state q. All the transitions of AMi should also be modified to consider the same feature output symbol fi . Example 1 illustrates our approach. Example 1. Let us consider the following set of labelled strings: M = {a1 a1 b2 c2 c2 c3 c3 , b2 b2 c2 c3 c3 , b2 c2 c3 b2 c2 c3 } from the alphabet ΣΔ . The output of the function 1 when applied to each string in M is the set of sequence identifiers S(M ) = {123, 23, 2323}. In the same way, when the set M is considered, the function 2 returns the sets of sub-strings labelled with the same feature identifier, that is: M1 = {aa}, M2 = {bcc, bbc, bc} and M3 = {cc, c}. Then, an automaton is inferred for each set obtained by the preprocessing: AS(M) , AM1 , AM2 and AM3 . If it is considered an algorithm to infer local languages, the resulting automata are shown in Figure 3. Finally, a substitution over LS(M) is carried out. We note here that several approaches can be used to obtain a probabilistic transducer. The study of which one has the better experimental behaviour is not pursued in this paper. In this work, we infer probabilistic automata for the preprocessed sets, although probabilities are not shown in Figure 3. In order to maintain the distribution of probability as much as possible, we follow the procedure described above. Thus, the implementation of the substitution replaces each transition (p, l, q) with the corresponding automaton (i.e. Ml ) and connects the state p and the initial state of Ml , as well as the final state(s) of Ml and q, with λtransitions. The transitions of Ml are also modified to consider the transduction. The λ-transitions are removed using a traditional algorithm. The resulting (non-deterministic) transducer is neither determinized nor minimized. For this example, the final transducer is shown in Figure 4.
Transducer Inference by Assembling Specific Languages
183
a
A M1 : a AS(M ) : 1 2
c
b
A M2 :
2 3
b
c
2
c
A M3 : c
Fig. 3. Automata inferred for the sets obtained by the preprocessing of M a/1
b/2
c/2
AM : a/1
b/2
b/2
c/2
c/2
c/3 b/2
c/2
c/3
b/2
c/2 b/2
c/3
c/3
c/2
Fig. 4. Transducer obtained by our approach when it is considered the data set M = {a1 a1 b2 c2 c2 c3 c3 , b2 b2 c2 c3 c3 , b2 c2 c3 b2 c2 c3 }
4 4.1
Experimental Results Dataset
We used a dataset composed of 160 membrane proteins in order to evaluate the performance of our approach. Of the proteins included in the dataset, 108 are multi-spanning proteins and 52 are single-spanning. This dataset has
184
P. Peris and D. L´ opez
been introduced by [13], and will be referred to as the TMHMM set. Most of the topology data included in TMHMM have been determined with biochemical and genetic methods. Only the structure of a small number of membrane protein domains have been determined at an atomic resolution. On the one hand, this gives the set biological relevance, but on the other hand, these methods are considered not completely reliable, and may output contradictory topologies for the same protein sequence. We removed from the original dataset those protein sequences for which different (biochemical or genetic) methods output different topologies. This dataset is available for the community at: http://people.binf.ku.dk/krogh/TMHMM/. 4.2
Performance Measures
In literature, different measures have been proposed to evaluate sequence analysis methods, especially gene-finding methods. An exhaustive review of these measures can be found in [27]. The most used measures in functional domain location tasks are probably: recall or sensitivity (Sn); and precision or specificity (Sp). These measures can be computed as follows: Sn =
TP TP + FN
Sp =
TP TP + FP
where: True positives (TP): correctly localized amino acids into a TM domain. True Negatives (TN): correctly annotated amino acids out of a TM domain. False positives (FP): amino acids out of a TM domain annotated as belonging to a domain. False Negative (FN): amino acids into a TM domain not correctly localized (annotated as out of any domain). Sn and Sp, however, took by themselves, do not constitute an exhaustive measure. A measure that is more complete and summarizes both Sn and Sp is the Correlation Coefficient (CC) or Matthews Correlation Coefficient [28], which also presents some interesting statistical properties. It is computed as follows: (T P ·T N )−(F N ·F P ) (T P +F N )·(T N +F P )·(T P +F P )·(T N +F N )
CC = √
The main drawback of the CC measure is that it is not defined if any factor of the root is equal to zero. Some measures have been proposed to tackle this issue. We have selected the Approximate Correlation (AC) which is calculated as follows: P TP TN TN AC = ( 14 T PT+F N + T P +F P + T N +F P + T N +F N − 0.5) · 2 In the results, we omitted the samples for which it was not possible to calculate CC (independently of the dataset considered). On the other hand, AC
Transducer Inference by Assembling Specific Languages
185
has a 100% coverage, which could explain the considerable difference that can be observed in some experiments between AC and CC values. We also used a segment-based measure, called Segment overlap, (Sovδ obs ) defined in [29]: Sovδ obs =
1 min(E) − max(B) + 1 + δ len(s1 ) N s max(E) − min(B) + 1
where N is the total number of amino acids observed within all the domains of the protein, s1 and s2 are two overlapped segments, E is {end(s1 ); end(s2 )}, B is {beg(s1 ); beg(s2 )} and δ is a parameter for the accepted (maximal) deviation. We used a value of δ = 3. We have also calculated the number of transmembrane segments correctly predicted at three accuracy thresholds: 100%, 90% and 75%, that is, number of segments with the 100%, 90% or more, and 75% or more of their amino acids are correctly predicted. This measure is similar to Sensibility, but it is based on segments. This measure allows to obtain a reliable evaluation for those segments that contain false negatives not only at the ends of the segment. For example, this occurs when a domain sequence is predicted as more than one segment, and there are some false negatives between two of this predicted segments. In other words, these values show the continous coverage degree of the prediction. A drawback of these measures is their sensitivity to overprediction. Therefore, it is necessary to complement it with the Sp measure. 4.3
Results
We considered inference of k-TSS languages to obtain the automata which model the distinct regions. Looking for the best behaviour, the experimentation was run using values of k from 2 to 7. We recall that, in this task, three different labels are considered, that is: inner regions, outer regions, and transmembrane domains. The best results were obtained using the following k-values: k = 3 to infer the inner and transmembrane automata models; k = 6 to infer the outer model; and k = 2 to infer the automata for the set of label sequences. In order to test our method, we followed a leaving one out scheme. In this scheme, for each protein of the dataset, a transducer is obtained using the rest of sequences in the dataset. The transducer is then used to process the sequence left out. This process is repeated until all proteins have been used as test sequences. Table 1 shows the accuracy of our approach (igS ), along with the results we obtained with TMHMM 2.0[13] and igTM[5]. In order to do the comparison, we run our method, TMHMM and igTM over the same dataset of 160 proteins. igTM is a Grammatical Inference approach which achieves values close to 80% in both specificity and sensitivity. TMHMM 2.0 is a method based on a Hidden Markov Model, whose prediction accuracy for membrane domain location is near 83%. It is worth to note here that this tool is available as a closed package and, therefore, no leaving one out procedure has been carried out for this method. Thus, these results may have some slight bias when compared with other results.
186
P. Peris and D. L´ opez
Table 1. Results of the experiments carried out with igS over the 160 proteins dataset, compared to the results of TMHMM 2.0 and and the two best configurations of igTM over the same dataset Sn igS 0.877 igTM config. 1 0.808 config. 2 0.819 TMHMM 2.0 0.900
Sp 0.784 0.810 0.796 0.879
TMHMM database CC AC Sov3 obs 100% 0.733 0.728 0.722 0.591 0.707 0.702 0.680 0.474 0.715 0.707 0.707 0.490 0.830 0.827 0.915 0.339
≥ 90% 0.693 0.603 0.618 0.636
≥ 75% 0.856 0.756 0.789 0.920
With respect to igTM, the method we propose in this work improves the sensitivity (Sn of 0.88), and, despite a slightly lower specificity (Sp of 0.78), obtains a better AC (from 0.707 to 0.722). This method also improves the Sovδobs obtained by igTM. This means that our proposal behaves better at the ends of the predicted membrane domains. Another relevant improvements are the results obtained in the segment-based measures. Thus, a 59.1% of segments were correctly covered by the prediction. There was also obtained a slight better coverage of the membrane domains at 90% of accuracy. Neverthless, in order to do a fair comparison, these results have to be combined with the Sp measure, because it is possible that this value is due to the presence of false positives at the boundaries of the correctly predicted segments. The improved value of Sovδobs may show that this is not the case. Nevertheless, the improvement of the Sp value is key for further developments.
5
Conclusions and Future Lines of Work
This paper describes igS, an application of Grammatical Inference to the task of transmembrane domain prediction. Grammatical Inference has already been used to tackle this task [5]. The method we propose in this work introduces a preprocessing where the protein is divided into sub-sequences that belong to different domains according to the topology of the protein (inner, outer or transmembrane). We consider the inference of k-TSS languages to obtain an automaton for each of these subsets. We also generate a k-TSS language for the set of label sequences. Our method then combines the automata of each subset with the automaton of labels using a language substitution. The experimental results of this method outperformed the previous GI approach to the task (igT M [5]). The improved accuracy of the method may be attributed to the higher specificity of generated languages, due to the classification of protein sub-sequences introduced in the preprocessing. Nevertheless, this approach is slightly less accurate than the method based on HMM. This may be caused by the need of more data in training phase in GI.
Transducer Inference by Assembling Specific Languages
187
We note that, although applied to a bioinformatic task, this approach may be useful whenever there exists relevant information to label the training sequences (which is usually the case in pattern recognition tasks). In the future, we plan to combine igS together with (an)other prediction method(s). This may allow to raise the specificity of the approach and therefore to improve the results. At present, we are testing other inference algorithms to learn the automata, the use of new labelling information of the sequences [30,31], and larger datasets, by merging the existing ones. The influence of the substitution procedure in the experimental behaviour remains also as future work.
References 1. Searls, D.B.: The language of genes. Nature 420, 211–217 (2002) 2. Sakakibara, Y.: Grammatical inference in bioinformatics. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(7), 1051–1062 (2005) 3. Yokomori, T., Kobayashi, S.: Learning local languages and their application to dna sequence analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(10), 1067–1079 (1998) 4. Peris, P., L´ opez, D., Campos, M., Sempere, J.M.: Protein motif prediction by grammatical inference. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 175–187. Springer, Heidelberg (2006) 5. Peris, P., L´ opez, D., Campos, M.: Igtm: an algorithm to predict transmembrane domains and topology in proteins. BMC-Bioinformatics 9, 367–378 (2008) 6. Brazma, A., Johansen, I., Vilo, J., Ukkonen, E.: Pattern discovery in biosequences. In: Honavar, V.G., Slutzki, G. (eds.) ICGI 1998. LNCS (LNAI), vol. 1433, pp. 257–270. Springer, Heidelberg (1998) 7. Arimura, H., Wataki, A., Fujino, R., Arikawa, S.: A fast algorithm for discovery optimal string patterns in large databases. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS (LNAI), vol. 1501, pp. 247–261. Springer, Heidelberg (1998) 8. Peris, P., L´ opez, D., Campos, M.: Localizaci´ on de genes en el adn mediante inferencia gramatical. In: Universidad de Valencia (ed.) Proceedings of the XII Congreso de la Sociedad Espa˜ nola de Neurociencia, Universidad de Valencia (2007) (spanish) 9. Campos, M., L´ opez, D., Peris, P.: Incremental multiple sequence alignment. In: Rueda, L., Mery, D., Kittler, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 604–614. Springer, Heidelberg (2007) 10. Wallin, E., von Heijne, G.: Genome-wide analyses of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Science 7(4), 1029– 1038 (1998) 11. Mitaku, S., Ono, M., Hirokawa, T., Boon-Chieng, S., Sonoyama, M.: Sonoyama. Proportion of membrane proteins in proteomes of 15 single-cell organisms analyzed by the sosui prediction system. Biophysical Chemistry 82(2-3), 165–171 (1999) 12. Sugiyama, Y., Polulyakh, N., Shimizu, T.: Identification of transmembrane protein functions by binary topology patterns. Protein Engineering Design and Selection (PEDS) 16(7), 479–488 (2003)
188
P. Peris and D. L´ opez
13. Sonnhammer, E.L.L., von Heijne, G., Krogh, A.: A hidden markov model for predicting transmembrane helices in protein sequences. In: Glasgow, J.I., Littlejohn, T.G., Major, F., Lathrop, R.H., Sankoff, D., Sensen, C. (eds.) ISMB, pp. 175–182. AAAI, Menlo Park (1998) 14. Tusn´ ady, G.E., Simon, I.: The hmmtop transmembrane topology prediction server. Bioinformatics 17(9), 849–850 (2001) 15. Viklund, H., Elofsson, A.: Best alpha-helical transmembrane protein topology predictions are achieved using hidden markov models and evolutionary information. Protein Science 13(7), 1908–1917 (2004) 16. Fariselli, P., Casadio, R.: Htp: a neural network-based method for predicting the topology of helical transmembrane domains in proteins. Computer Applications in the Biosciences 12(1), 41–48 (1996) 17. Michael Gromiha, M., Ahmad, S., Suwa, M.: Neural network-based prediction of transmembrane -strand segments in outer membrane proteins. Journal of Computational Chemistry 25(5), 762–767 (2004) 18. Pasquier, C., Promponas, V.J., Palaios, G.A., Hamodrakas, J.S., Hamodrakas, S.J.: A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm. Protein Eng. 12(5), 381–385 (1999) 19. Sadovskaya, N.S., Sutormin, R.A., Gelfand, M.S.: Recognition of transmembrane segments in proteins: Review and consistency-based benchmarking of internet servers. J. Bioinformatics and Computational Biology 4(5), 1033–1056 (2006) 20. Bagos, P.G., Liakopoulos, T., Hamodrakas, S.J.: Evaluation of methods for predicting the topology of beta-barrel outer membrane proteins and a consensus prediction method. BMC Bioinformatics 6, 7 (2005) 21. Punta, M., Forrest, L.R., Bigelow, H., Kernytsky, A., Liu, J., Rost, B.: Membrane protein prediction methods. Methods 41(4), 460–474 (2007) 22. Tellier, I.: How to split recursive automata. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 200–212. Springer, Heidelberg (2008) 23. Berstel, J.: Transductions and Context-Free Languages. Teubner Studienb¨ ucher, Stuttgart (1979) 24. Vidal, E., Garc´ıa, P., Casacuberta, F.: Local languages, the succesor method, and a step towards a general methodology for the inference of regular grammars. IEEE Trans. on PAMI 9(6), 841–845 (1987) 25. Segarra, E., Hurtado, L.: Construction of Language Models using the Morphic Generator Grammatical Inference (MGGI) Methodology. In: Proc. of Eurospeech, Rhodes (Grecia), pp. 2695–2698 (1997) 26. Grau, S., Segarra, E., Sanchis, E., Garc´ıa, F., Hurtado, L.F.: Incorporating semantic knowledge to the language model in a speech unders- tanding system. In: IV Jornadas en Tecnologia del Habla, pp. 145–148 (2006) 27. Burset, M., Guigo, R.: Evaluation of gene structure prediction programs. Genomics 34(3), 353–367 (1996) 28. Mathews, B.W.: Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica Biophysica Acta 405(2), 442–451 (1975) 29. Rost, B., Sander, C., Schneider, R.: Redefining the goals of protein secondary structure prediction. J. Mol. Biol. 235, 13–26 (1994) 30. Reed Murphy, L., Wallqvist, A., Levy, R.M.: Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Engineering 13(3), 149–152 (2000) 31. Li, T., Fan, K., Wang, J., Wang, W.: Reduction of protein sequence complexity by residue grouping. Protein Engineering 16(5), 323–330 (2003)
Sequences Classification by Least General Generalisations Frédéric Tantini1 , Alain Terlutte2 , and Fabien Torre2 1
2
Parole, CNRS/LORIA Nancy Mostrare (INRIA Lille Nord Europe et CNRS LIFL) Université Lille Nord de France
Abstract. In this paper, we present a general framework for supervised classification. This framework provides methods like boosting and only needs the definition of a generalisation operator called lgg. For sequence classification tasks, lgg is a learner that only uses positive examples. We show that grammatical inference has already defined such learners for automata classes like reversible automata or k-TSS automata. Then we propose a generalisation algorithm for the class of balls of words. Finally, we show through experiments that our method efficiently resolves sequence classification tasks. Keywords: sequence classification, least general automata, balls of words.
1
Introduction
We investigate in this paper the problem of sequence classification with two main ideas. First, we want to benefit from supervised classification advances like ensemble methods (bagging, boosting, etc.). For this, we use a general framework for supervised classification based on the notion of least general generalisation (lgg). This framework, called volata, provides various ensemble methods whenever we are able to define this lgg operator. Second, we claim that results from the grammatical inference domain, like language identification with positive instances only, can be used in order to classify sequences. We consider learnability results: proofs of learnability from positive examples could offer learning algorithms close to an lgg operator. Applying this approach, we study 0-reversible automata [1], k-TSS automata [2] and balls of words [3]. The paper is organised as follow. In Section 2, we describe a lgg-based machine learning framework and give its main generic algorithms. This generic method is instantiated for some automata families in Section 3, and for balls of words in Section 4. In Section 5, we run these algorithms on well-known sequence classification problems and also on a real handwritten digit recognition problem. Finally, Section 6, we assess our results and propose some future research works. J.M. Sempere and P. García (Eds.): ICGI 2010, LNAI 6339, pp. 189–202, 2010. © Springer-Verlag Berlin Heidelberg 2010
190
2
F. Tantini, A. Terlutte, and F. Torre
Supervised Learning with Least General Generalisations
In this section, we describe our framework which is based on the following points: – choose example language (E) and hypothesis language (H) such that E ⊂ H; – define a subsumption relation between hypotheses of H that allows to check if a hypothesis subsumes an example, and also to decide if a hypothesis is more general than another one; – given H and , prove the existence of a unique least general hypothesis for every set of examples and define an algorithm lgg to compute this hypothesis. The last point uses the notion of least general generalisation defined as follow. Definition 1 (least general generalisation). Given a set of examples E ⊆ E, a hypothesis h ∈ H is a least general generalisation of E iff: – ∀e ∈ E : h e; – there exists no hypothesis h such that ∀e ∈ E : h e and h h . Assuming a unique and computable least general generalisation, the volata system is designed through three levels. 1. The first one defines the lgg operator and depends on H and . The next levels are generics, they depend neither on language representation nor on generality ordering. 2. The second one uses classes to control the generalisation. A possible algorithm here is cg (for correct generalisation) that aims at producing a correct hypothesis (Algorithm 1): the first example is used as a seed, then we try to generalise other positive examples using the lgg operator; each generalisation must be validated against negative examples; an example that produces an incorrect generalisation is rejected and we continue to add positive examples from the previous correct hypothesis. Let us note that the result of cg depends on the presentation order of positive examples. This characteristic will be an advantage in ensemble methods since it enhances diversity of produced hypotheses. 3. The last level provides generic full learners: dlg for a fast learning, globo to obtain a comprehensive theory and ensemble methods like globoost, bagging and adaboost-mg that ensure better predictions. In this paper, we focus on globoost that is the simplest ensemble method: it randomly produces T correct hypotheses; at each step, a class is randomly chosen, sets of positive and negative examples are built, randomly shuffled and then the algorithm cg is called on these sets; finally, a new example is classified by the T hypotheses in a vote. globoost is described in Algorithm 2. The two presented algorithms, globoost and cg, are generics, they have to be instantiated by a generalisation operator lgg, itself depending on H and .
Sequences Classification by Least General Generalisations
191
Algorithm 1. cg (correct generalisation) Require: E = [p1 , . . . , pn ] ⊆ E an ordered set of n examples with the same class, N ⊆ E a set of counter-examples. Ensure: h ∈ H generalisation of some examples in E and correct wrt N . 1: g = p1 2: for i = 2 to n do 3: g = lgg(g, pi ) /* generalisation with pi */ 4: if (∀e ∈ N : g e) then /* if g is correct */ 5: g = g /* g is now the current generalisation */ 6: end if 7: end for 8: return h(x) = class(E) if g x, else 0 (abstention)
Algorithm 2. globoost Require: labelled examples (xi , yi ) and T the number of steps. Ensure: H the final classifier. 1: for t = 1 to T do 2: target = class randomly chosen 3: P = [xi |yi = target ] 4: N = [xi |yi = target ] 5: randomly shuffle P 6: ht = cg(P ,N ) /* Call to the correct generalisation algorithm */ 7: end for T 8: return H(x) = sign t=1 ht (x)
In the rest of this paper, examples are words and hypotheses are either automata or balls of words. For these families, the subsumption test between an hypothesis and an example is naturally the word membership of the denoted language. What remains is the definition of the lgg operator. In grammatical inference, this means finding for each family an algorithm that learns with positive examples only, and provides the smallest language that contains the training set. In the two following sections, we investigate this question for k-TSS automata, 0-reversible automata and balls of words.
3
Generalisation of Words to Automata
We consider in this section that hypotheses are automata. The subsumption test is obviously whether or not the automaton accepts the example. We have now to define the lgg operator for each class of language we want to use, that is, an algorithm from positive examples only ensuring the smallest language that includes the given examples. 3.1
The k-TSS Languages
The class of k-testable in the strict sense (k-TSS) languages is a well known subclass of regular languages [2]. It is characterised by the set of sequences of
192
F. Tantini, A. Terlutte, and F. Torre
length k that do not appear in the word of the language. These languages are very simple and their expressivity is relatively poor. However, there exists an algorithm learning from positive examples only. It is known to compute the smallest language including the sample, then the least general generalisation. We propose an incremental version, lgg-tssi (Algorithm 3), for which the key point is that each state is represented by the last (k − 1) letters read to reach this state. Algorithm 3. lgg-tssi Require: h = (Q, Σ, q0 , F ) a k-TSS automaton, e an example, a given integer k. Ensure: h a k-TSS automaton, least general generalisation of h, subsuming e. 1: q = q0 2: for i = 1 to |e| do 3: v = q.ei /* concatenation of the word of q with the ith letter of e */ 4: if (|v| = k) then 5: v = v2,...,|v| 6: end if 7: nq = v 8: add nq to Q /* nq may be already present in Q */ 9: add (q, ei , nq) to δ 10: q = nq 11: end for 12: add q to F 13: return h = (Q, Σ, δ, q0 , F )
3.2
The 0-Reversible Languages
The class of k-reversible languages is often said to be more interesting than the previous one, partly because it is more expressive: (k − 1)-TSS languages are all k-reversible ones. Note that, given a fixed k, some finite languages are not k-reversible. In this paper, we will focus on k = 0, that is 0-reversible languages. A least general generalisation computation in the class of 0-reversible languages is given in [1]. We propose an incremental version: lgg-zr (Algorithm 4). In this algorithm, the new word is added to the current automaton in the PTA way, that induces the creation of a new branch recognising only this word. Then, merges are made in order to get a unique final state, deterministic transitions and a deterministic mirror -automaton.
4 4.1
Generalisation of Words to Balls of Strings Definitions
We now consider hypotheses as balls of strings. A ball is defined by a centre-string o and a radius r, and is noted Br (o). A ball of strings Br (o) is the set of all words
Sequences Classification by Least General Generalisations
193
Algorithm 4. lgg-zr Require: h = (Q, Σ, q0 , F ) a 0-reversible automaton, e an example. Ensure: h a 0-reversible automaton, least general generalisation of h subsuming e. 1: i = 1 ; q = q0 /* following existing states */ 2: while δ(q, ei ) is defined do 3: q = δ(q, ei ) ; i = i + 1 4: end while /* creating new branch for remaining letters */ 5: while i ≤ |e| do 6: create state q , add q to Q 7: add (q, ei , q ) to δ 8: i=i+1 9: end while 10: add q to F /* merging */ 11: merge all states in F 12: repeat 13: if ∃A, B ∈ Q and ∃l ∈ Σ such that δ(A, l) = δ(B, l), merge A and B 14: if ∃A, B, E ∈ Q and ∃l ∈ Σ such that δ(E, l) = {A, B}, merge A and B 15: until no fusion 16: return h = (Q, Σ, δ, q0 , F )
at distance less or equals to r from o, that is, Br (o) = {w ∈ Σ ∗ |d(o, w) ≤ r}. The subsumption test between a hypothesis h = Br (o) and an example e is then true if the word is in the ball, that is, h e ⇔ d(e, o) ≤ r. The distance we use is the edit distance, or Levenshtein distance [4], for which each edit operation (among insertion, deletion, substitution) has a unit cost. It is the minimal number of symbol operations needed to rewrite one word into another one. More formally, let w and w be two words in Σ ∗ , we rewrite w into w in one step if one of the following condition is true: 1. deletion : w = uav and w = uv with u, v ∈ Σ ∗ and a ∈ Σ; 2. insertion : w = uv and w = uav with u, v ∈ Σ ∗ and a ∈ Σ; 3. substitution : w = uav and w = ubv with u, v ∈ Σ ∗ , a, b ∈ Σ, a = b. k
We note w − → w if w can be rewritten into w by means of k operations. Definition 2 (Edit Distance). The edit distance between two words w and k w , noted d(w, w ), is the smallest k such that w − → w . The edit distance d(w, w ) can be computed in O (|w| · |w |) time by dynamic programming [5]. Basically, we compute a |w| × |w | matrix M , where M [i][j] is . Moreover, this allows us the edit distance between the prefixes w1...i and w1...j to deduce the required edit operations to go from one word into the other.
194
4.2
F. Tantini, A. Terlutte, and F. Torre
Learning Generalised Balls
Non-unicity of least general generalisation Unlike the generalisation to automata, we can note that with balls of strings, least general generalisation are not unique anymore. Example 1. Let E = [a, b, ab], h = B1 (a) and h = B1 (b). Both hypotheses subsume the examples (h E and h E) but h h and h h ! This property is obvious in R2 ; in Figure 1, three points are subsumed by several disk-hypotheses that are not comparable to each other.
Fig. 1. Endless number of disks containing 3 points and non comparable to each other
Thus, we are definitely not in the ideal case of Section 2 where least general generalisation is unique. One could claim that there is nonetheless a smallest ball containing all the examples as we can see in Figure 1. But there is no reason that the smallest ball should be the least general generalisation. Indeed, it is not contained in the other ones, thus the two concepts are different. Furthermore, there is a computational barrier if we made this choice: finding the centre string of a set is NP-hard [6]. Monotonic generalisation operator To tackle theses problems, we propose the incremental Algorithm 5 (called gballs) as the generalisation operator for balls of strings. Algorithm 5. Generic algorithm g-balls of the generalised ball Require: h = Br (o) a ball, e an example. Ensure: g ∈ H a least general generalisation of h subsuming e (g h and g e). ∗ 1: p = o − → e /* a shortest path */ y x →u− → e, x + y = d(o, e) */ 2: Let u be a string on the path p /* p = o − 3: x = d(o, u) 4: y = d(u, e) 5: k = max(x + r, y) 6: return Bk (u)
Sequences Classification by Least General Generalisations
195
This algorithm requires a method to choose the new centre u on the path p, but whatever this choice is, we keep the monotonic property, that is the new ball subsumes the new example and the previous hypothesis: – d(u, e) = y ≤ k =⇒ e ∈ Bk (u) =⇒ g e; – ∀w ∈ Br (o), d(o, w) ≤ r and with the triangular inequality d(u, w) ≤ d(u, o) + d(o, w), we deduce d(u, w) ≤ x + r ≤ k, so w ∈ Bk (u) and then g h. The algorithm mainly relies on the computation of the path between the centre and the new example, so it is, as the edit distance, polynomial in the length of the words. Unfortunately, the downside of the monotony and the complexity gain is that the new hypothesis is not always a least general generalisation anymore, but the balls of strings combinatorial complexity keeps us from a better construction. For instance: Example 2. let E = [a, b]. The first hypothesis is the first example, that is h = B0 (a). Then, as the path between the centre and the new example is 1 c : a − → b, there are two ways of choosing u. Either g-balls(h, b) = B1 (a), or g-balls(h, b) = B1 (b). But the ball of radius 1 centred on the empty word B1 (λ) contains E and is more specific than both hypothesis: B1 (λ) ⊆ B1 (a) and B1 (λ) ⊆ B1 (b). However, let us note that the result of g-balls depends on the presentation order of examples and this is a suitable property for our ensemble methods. cg properties At last, the cg is no more monotonic. Example 3. Let us suppose that the strategy for the choice of the new centre is to always take the new centre at distance 1 from the old one (x = 1). If the examples are λ, b, a, and the counter-example is bb, the produced hypotheses are, in order: – B0 (λ), the first hypothesis; – B1 (b), which is rejected since it contains bb; – B1 (a), which is accepted. And yet, B1 (a) contains b, while the addition of b was rejected in the previous step. The only consequence of this non-monotony is AdaBoost-like implementations less efficient since covering of hypotheses should be systematically computed. Summary for g-balls and comparison with lgg operators unique lgg sensibility to order monotony of g-balls monotony of cg
lgg g-balls × × ×
196
F. Tantini, A. Terlutte, and F. Torre
To summarise the properties of generalisation to balls of strings, we can then say that: g-balls does not produce the least general generalisation but is monotonic; cg depends on the presentation order of the examples and produces correct hypotheses but is not monotonic. The only point remaining now is to choose a minimal rewriting path p, and a centre u on it. 4.3
Strategies to Compute the Centre of the New Ball
In Algorithm 5, p is a rewriting path from o to e. Yet, there are many ways to go ∗ from a word to an other. In order to always compute the same path p = o − → e, we use the matrix M used for the edit distance. Starting from the last entry, we go back according to the origin of the resulting computation, in such (arbitrary) ways: – if M [i][j] can come from a deletion or another operation, we choose the deletion; – if M [i][j] can come from an insertion or a substitution, we choose the insertion; – once all edit operations are found, we execute them from left to right. We denote such a path as an edit path. By setting these choices, the algorithm cg is then deterministic and depends on the order of the examples. For instance, let w = ABACD and w = EAFCGD. In Table 1 we show the computation of the edit distances d(ABACD, EAFCGD) and d(EAFCGD, ABACD). Gray entries indicate the trace of the edit operation selection for the edit path computation, as previously defined. If we apply the operation from left to right, we obtain the following edit path: – ·ABACD → EABACD → EAF ACD → EAF C·D → EAF CGD – EAF CGD → A·AF CGD → ABAF CGD → ABACGD → ABACD Finally, several strategies are conceivable to choose the new centre u along the edit path. This choice relies on the problem we are dealing with, or heuristics if we want to give importance to the seed, or favour long examples, etc. ∗
Table 1. Edit path computation matrices of ABACD − → EAFCGD (left) and EAFCGD ∗ − → ABACD (right) ABACD
EAFCGD A B A C D
0 1 2 3 4 5
1 1 2 3 4 5
2 1 2 2 3 4
3 2 2 3 3 4
4 3 3 3 3 4
5 4 4 4 4 4
6 5 5 5 5 4
E A F C G D
0 1 2 3 4 5 6
1 1 1 2 3 4 5
2 2 2 2 3 4 5
3 3 2 3 3 4 5
4 4 3 3 3 4 5
5 5 4 4 4 4 4
Sequences Classification by Least General Generalisations
197
– simple: x + r = y (if k is even, otherwise, x + r = y + 1). Naive version, that we would use in Euclidian space. It “minimises” the new ball radius. – seed: weighted by the number of examples in the ball. The seed is privileged. d(o,e) – longCentre: weigthed by the old centre length x = d(o, e) × d(o,e)+|o| . The longer the centre, the closer we come to it. |o| – shortCentre: weigthed by the old centre length x = d(o, e) × d(o,e)+|o| . The longer the centre, the farther we move away from it. – random: the new centre is on the path, randomly chosen (x ∈ [0; d(o, e)]). We have obviously considered balls centred on the first example. In this case, neither lgg nor cg are needed to learn the radius: taking the distance to the nearest counterexample is enough. We are not driven by examples anymore and we lose in diversity: each example leads to one unique ball. These suspicions are confirmed with poor results in experiments with this strategy.
5
Experimentations
Among the methods that are suitable for including our generalisation algorithm, we will keep globoost, previously described in Algorithm 2 and implemented in the volata1 system. We have set the following protocol: 10-fold cross validation, with 10 runs of globoost on each fold. A given result is then the average of 100 runs. 5.1
UCI Repository Datasets
In this section, we will use the sequential datasets from the UCI Repository [7], namely: tic-tac-toe, badges, promoters, us-first-name and splice. With these few problems, we tackle alphabets of size 3 to about 30. In the described protocol, the data is split in 90% for learning an 10% for testing. Our goal is to compare globoost instantiated with least general generalisation computations to classical grammatical inference methods (such as rpni [8]). Results are given Table 2 (with 1 000, 10 000 and 100 000 balls, 1 000 automata, all produced by globoost, for each considered problem). Missing values are due to a lack of time with our available implementations. On each problem, one of our method is the best one. Ensemble method gives better performances in prediction. gb-b is generally the best choice, then comes gb-tssi. gb-zr is not very good: it can be explained by the fact that the 0reversible class is rich and leads to nearly learn by heart on some data. A unique 0-reversible can then subsume all positive data without accepting counterexamples. We will now concentrate on balls of strings as hypotheses. 5.2
Handwritten Digit Classification
We consider now the Nist special database 3. This database consists in 128 × 128 bitmap images of handwritten letters and digits. We will focus on a subset of 1
http://www.grappa.univ-lille3.fr/~torre/Recherche/Softwares/volata/
198
F. Tantini, A. Terlutte, and F. Torre
Table 2. Precision of the methods on UCI Repository databases. gb-M denotes globoost instancied by M , with M a k-TSS automata computation(tssi), a 0-reversible computation (zr), or a ball of strings computation (B). In this case, the choice of the new centre strategy is given in subscript. (references) Majority rpni traxbar red-blue
tic-tac-toe 65.34 % 91.13 % 90.81 % 93.89 %
badges 71.43 % 62.24 % 57.48 % 61.09 %
promoters 50.00 % 56.60 % 63.02 %
first-name splice 81.62 % 50.26 % 81.42 % 81.37 % 58.33 % 82.83 % 54.65 %
(globoost ×1 000) gb-tssi gb-zr
tic-tac-toe badges promoters first-name splice 91.47 % 72.69 % 61.13 % 89.50 % 78.07 % 98.36 % 71.43 % 50.00 % 83.07 % -
(globoost ×1 000) gb-bsimple gb-bseed gb-blongCentre gb-bshortCentre gb-brandom
tic-tac-toe 92.64 % 92.77 % 91.04 % 74.89 % 92.62 %
badges 81.10 % 81.72 % 81.12 % 80.43 % 80.41 %
promoters 88.58 % 86.13 % 86.53 % 87.55 % 87.63 %
first-name 87.24 % 87.45 % 86.84 % 86.95 % 87.10 %
splice 93.78 % 93.63 % 93.48 % 92.70 % 93.76 %
(globoost ×10 000) gb-bsimple gb-bseed gb-blongCentre gb-bshortCentre gb-brandom
tic-tac-toe 94.54 % 94.14 % 94.69 % 74.46 % 94.69 %
badges 81.38 % 82.04 % 82.21 % 81.12 % 81.39 %
promoters 88.90 % 86.30 % 86.75 % 88.63 % 88.43 %
first-name 88.59 % 88.93 % 89.47 % 89.37 % 88.80 %
splice 95.16 % 95.29 % 95.54 % 95.83 % 95.63 %
(globoost ×100 000) gb-bsimple gb-bseed gb-blongCentre gb-bshortCentre gb-brandom
tic-tac-toe 94.43 % 94.26 % 95.03 % 74.45 % 94.90 %
badges 81.21 % 81.39 % 81.98 % 81.13 % 81.36 %
promoters 89.79 % 86.58 % 87.58 % 89.20 % 89.08 %
first-name 88.72 % 89.11 % 89.93 % 89.88 % 89.06 %
splice 96.05 % 95.64 % 95.68 % 96.02 % 95.42 %
digits, written by 100 different writers. Each class (from 0 to 9) has about 1 000 instances, giving a 10 568 digits corpus. As we are working on words, each image is transformed in an octal string, with the algorithm described in [9]: from the upper left pixel, we follow the border of the digit until going back to the first one. Each direction gives a different letter of the string (see Figure 2). We aim at comparing our approach to the one of [10], thanks to the use of SEDiL [11] and a weighted edit matrix. The matrix is learnt on the same data as Marc Sebban with a stochastic transducer on 8 000 (input, output) pairs of strings (the input is the string from the learning set, the output the 1nearest-neighbour). The final class is given according to the 1-nearest-neighbour
Sequences Classification by Least General Generalisations
199
Fig. 2. Handwritten digit example. The corresponding string is “2”=22222 24324444466656565432222222466666666660000212121210076666546600210.
computed with the weighted distance matrix learnt. We have kept the same protocol (10-fold cross validation), with 10% of the data for the learning set, 90% for the test set (the matrix of SEDiL has been learnt on the same test examples, thus inducing a bias in its favour). Results are given Table 3. Table 3. Precision on the Nist special database 3, for 1 000, 10 000 and 100 000 produced balls by globoost (SEDiL performance: 95.86 %)
gb-bsimple gb-bseed gb-blongCentre gb-bshortCentre gb-brandom
5.3
×1 000 92.59 % 93.74 % 93.64 % 92.92 % 93.81 %
×10 000 95.14 % 95.77 % 95.89 % 95.73 % 95.93 %
×100 000 95.57 % 96.16 % 96.22 % 96.17% 96.27 %
Experimental Observations and Discussion
Apart from the shortCentre counterperformance on the tic-tac-toe problem, we can consider that our strategies to compute the new centre are very close. Note also that, with few exceptions, predicting qualities increase with the number of produced balls. Finally, combining balls is more competitive in prediction terms than the other tested methods, especially on genomic data (promoters and splice problems) and on handwritten recognition, where we overcome SEDiL in spite of our protocol. Being able to produce 100 000 hypotheses is characteristic of balls of strings. It is inconceivable for 0-reversible or k-TSS automata. On the one hand, runs of these algorithms are too long to give such a large amount of hypotheses this quickly; with SEDiL, it is the classification that requires a quadratic number of distance computations. On the other hand, the produced automata are quickly the same. In other words, balls are diverse and fast to compute. Note that wide diversity is usually considered as an important point for ensemble method [12].
200
F. Tantini, A. Terlutte, and F. Torre
Another observation is that for each experiment, examples are on the border of the learnt ball and its centre is never an example of the concept. Even if we can explain this by our hypotheses construction and the intrinsic properties of balls, this is nevertheless noteworthy. Indeed, when used to learn from noisy data (as in [13]), the centre is usually a non-noisy data, and the radius is seen as a noise tolerance level. Here, the centre of the final hypothesis is rather considered as a median string of the positive examples. Example 4. On the tic-tac-toe data set, which encodes possible board configurations at the end of the game, positives examples being “win for x”: we learn the ball with radius 5 and centre bbbb. It covers no negative examples but 120 positive ones, all of them at distance 5 from the centre: xxxoobbbb, xobxbbxbo, xbboxbobx, obxbbxobx, boxoxbxbb, bbxobxobx, etc. Example 5. On the us-first-name data set, which contains american first name, classes being female versus male first name: we learn the ball with radius 7 and centre LRLRTSVKCA. It covers 346 female first names but no male first names. Here again, covered examples are on the border of the ball. These hollow balls are also part of the proof of the balls VC-dimension [14]. Theorem 1 ([14]). The VC-dimension of balls, with a 2-letter alphabet, is infinite. Proof. We take n words, all of length n, defined as follows: the ith word is made with only as, except for the ith letter which is a b. Let us suppose now that these words are labelled: k positives, (n − k) negatives. We can build a ball covering only positive examples as follows: the centre is the word of length n that has bs at the same places than the positive examples, as everywhere else, and the radius is (k − 1). By construction, positive examples can be reached from the centre by (k − 1) substitutions, and negative examples are at distance strictly greater. In this proof, we can note that the ball contains more words than the sample set (thus there is a generalisation) and that words are on the border of the ball (indicating that the learnt ball remains relatively specific to the sample).
6
Conclusion
Our goal in this paper is the classification of sequences, by deciding whether a word belongs in some language or not. In other words, we try to guess a target language, by minimising the generalisation error. We have chosen to integrate classical grammatical inference techniques in a general framework resulting from supervised classification: our hypotheses are automata or balls of strings, that we combine using globoost algorithm. We have shown through experiments that our approach is generally better than classical methods and require usually less examples: we learn a combination of automata that are individually simpler than a unique corresponding
Sequences Classification by Least General Generalisations
201
automaton. Leveraging grammatical inference learners induces good sequence classifiers. Although least general generalisation is supposed to be unique, our method can cope with multiple ones such as balls of strings. We have then considered to follow one of the generalisations. The ball with the smallest radius is certainly attractive, but its computation is exponential. Finally, we have chosen a more general ball, still close to the examples. Experimental results tend to show that our approach is valid: balls of strings combinations are better than automata combinations on classical problems of sequences classification, and than the reference method on the handwritten recognition problem. Moreover, learning balls of strings is fast since operations are on words; on the contrary, operations on automata are more complex (merging, determinisation, etc.). Finally, we have been able to deal with multiple least general generalisation in the volata framework, dedicated to unique least general generalisation. This allows us to explore the integration of many more hypotheses classes. Among other perspectives, we think that our methods can be improved by using weighted edit distance, learning one distance for each specific problem. At last, we are encouraged by the good results in genomic data to experiment in this field. Acknowledgments. We would like to thank Jean-Christophe Janodet for providing us the result about the VC dimension of balls and Marc Sebban for the discussions on handwritten recognition and SEDiL. This work was partially supported by Ministry of Higher Education and Research, Nord-Pas de Calais Regional Council and FEDER through the Contrat de Projets Etat Region (CPER) 2007-2013.
References 1. Angluin, D.: Inference of reversible languages. Journal of the ACM 29(3), 741–765 (1982) 2. García, P., Vidal, E.: Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(9), 920–925 (1990) 3. de la Higuera, C., Janodet, J.C., Tantini, F.: Learning languages from bounded resources: The case of the dfa and the balls of strings. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 43–56. Springer, Heidelberg (2008) 4. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR 163(4), 845–848 (1965) 5. Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the ACM 21, 168–178 (1974) 6. de la Higuera, C., Casacuberta, F.: Topology of strings: median string is NPcomplete. Theoretical Computer Science 230, 39–48 (2000) 7. Asuncion, A., Newman, D.: UCI machine learning repository (2007) 8. Oncina, J., García, P.: Identifying regular languages in polynomial time. In: Advances in Structural and Syntactic Pattern Recognition, pp. 99–108. World Scientific Publishing, Singapore (1992)
202
F. Tantini, A. Terlutte, and F. Torre
9. Micó, L., Oncina, J.: Comparison of fast nearest neighbour classifiers for handwritten character recognition. Pattern Recognition Letter 19(3-4), 351–356 (1998) 10. Oncina, J., Sebban, M.: Learning stochastic edit distance: Application in handwritten character recognition. Pattern Recognition 39(9), 1575–1587 (2006) 11. Boyer, L., Esposito, Y., Habrard, A., Oncina, J., Sebban, J.: Sedil: Software for Edit Distance Learning. In: Daelemans, W., Goethals, B., Morik, K. (eds.) Proceedings of the 19th European Conference on Machine Learning, pp. 672–677. Springer, Heidelberg (2008) 12. Kuncheva, L.I., Whitaker, C.J.: Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning 51(2), 181–207 (2003) 13. Tantini, F., de la Higuera, C., Janodet, J.C.: Identification in the limit of systematic-noisy languages. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 19–31. Springer, Heidelberg (2006) 14. Janodet, J.C.: The vapnik-chervonenkis dimension of balls of strings is infinite. Personal Communication (2010)
A Likelihood-Ratio Test for Identifying Probabilistic Deterministic Real-Time Automata from Positive Data Sicco Verwer1 , Mathijs de Weerdt2 , and Cees Witteveen2 1
Eindhoven University of Technology 2 Delft University of Technology [email protected], {M.M.deWeerdt,C.Witteveen}@tudelft.nl
Abstract. We adapt an algorithm (RTI) for identifying (learning) a deterministic real-time automaton (DRTA) to the setting of positive timed strings (or time-stamped event sequences). An DRTA can be seen as a deterministic finite state automaton (DFA) with time constraints. Because DRTAs model time using numbers, they can be exponentially more compact than equivalent DFA models that model time using states. We use a new likelihood-ratio statistical test for checking consistency in the RTI algorithm. The result is the RTI+ algorithm, which stands for real-time identification from positive data. RTI+ is an efficient algorithm for identifying DRTAs from positive data. We show using artificial data that RTI+ is capable of identifying sufficiently large DRTAs in order to identify real-world real-time systems.
1
Introduction
In previous work [11], we described the RTI algorithm for identifying (learning) deterministic real-time automata (DRTAs) from labeled data, i.e., from an input sample S = (S+ , S− ). The RTI algorithm is based on the currently bestperforming algorithm for the identification of deterministic finite state automata (DFAs), called evidence-driven state-merging (ESDM) [9]. The only difference between DFAs and DRTAs are that DRTAs contain time constraints. In addition to using the standard state-merging techniques, RTI identifies these time constraints by splitting transitions into two, see [11] for details. The RTI algorithm is efficient in both run-time and convergence because it is a special case of an efficient algorithm for identifying one-clock timed automata, see [12]. In practice, however, it can sometimes be difficult to apply RTI. The reason being that data can often only be obtained from actual observations of the process to be modeled. From such observations we only obtain timed strings that have actually been generated by the system. In other words, we only have access to the positive data S+ . In this paper, we adapt the RTI algorithm to this setting. A straightforward way to do this is to make the model probabilistic, and to check for consistency using statistics. This has been done many times, and in different ways, for the J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 203–216, 2010. c Springer-Verlag Berlin Heidelberg 2010
204
S. Verwer, M. de Weerdt, and C. Witteveen
problem of identifying (probabilistic) DFAs, see, e.g., [2,8,3]. As far as we know, this is the first time such an approach is applied to the problem of identifying DRTAs. We start this paper by defining DRTAs and probabilistic DRTAs (PDRTA, Section 2). In addition to a DRTA structure, a PDRTA contains parameters that model the probabilities of events in the DRTA structure. In order to identify a PDRTA, we thus need to solve two different identification problems: the first problem is to identify the correct DRTA structure, and the second is to set the probabilistic parameters of this model correctly. However, because a PDRTA is a deterministic model, we can simply set these parameters to the normalized frequency counts of events in the input sample S+ .1 This is very easy to compute and it is the unique correct setting of the parameters given the data. We therefore focus on identifying the DRTA structure of a PDRTA. We introduce a new likelihood-ratio test that can be used to solve this identification problem (Section 3). Intuitively, this test tests the null-hypothesis that the suffixes of strings that can occur after two different states have been reached follow the same PDRTA distribution, i.e., whether these two states can be modeled using a single state in a PDRTA. If this null-hypothesis is rejected with sufficient confidence, then this is considered to be evidence that these two states should not be merged. Equivalently, if these two states result from a split of a transition, then this is evidence that this transition should be split. In this way, the statistical evidence resulting from these tests replace the evidence value in the original RTI algorithm. The result is the RTI+ algorithm (Section 3.3), which stands for real-time identification from positive data. The RTI+ algorithm is an efficient algorithm for identifying DRTAs from positive data. The likelihood-ratio test used by RTI+ is designed specifically for the purpose of identifying a PDRTA from positive data. Although many algorithms like RTI+ exist for the problem of identifying (probabilistic) DFAs, none of these algorithms uses the non-timed version of the likelihood-ratio test of RTI+. Hence, since this test can easily be modified in order to identify (probabilistic) DFAs, it also contributes to the current state-of-the-art in DFA identification. In order to evaluate the performance of the RTI+ algorithm we show a typical result of RTI+ when run on data generated from a random PDRTA (Section 4). This result shows that our algorithm is capable of identifying sufficiently complex real-time systems in order to be useful in practice. We end this paper with some conclusions and pointers for future work (Section 5).
2
Probabilistic Deterministic Real-Time Automata
The following exposition uses basic notation from language, automata, and complexity theory. For an introduction the reader is referred to [10]. In the following, we first describe non-probabilistic real-time automata, then we show how to add probability distributions to these model. 1
In the case of a non-deterministic model, setting the model parameters is a lot harder. In fact, it can be as difficult as identifying the model itself.
A Likelihood-Ratio Test for Identifying Probabilistic DRTAs
a [0,10]
205
a [0,10], b [3,10]
a [6,10] a [0,5] b [0,2]
Fig. 1. An example of a DRTA. The leftmost state is the start state, indicated by the sourceless arrow. The topmost state is an end state, indicated by the double circle. Every state transition contains both a label a or b and a delay guard [n, n ]. Missing transitions lead to a rejecting garbage state.
2.1
Real-Time Automata
In a real-time system, each occurrence of a symbol (event) is associated with a time value, i.e., its time of occurrence. We model these time values using the natural numbers N. This is sufficient because in practice we always deal with a finite precision of time, e.g., milliseconds. Timed automata [1] can be used to accept or generate a sequence τ = (a1 , t1 )(a2 , t2 )(a3 , t3 ) . . . (an , tn ) of symbols ai ∈ Σ paired with time values ti ∈ N, called a timed string. Every time value ti in a timed string represents the time (delay) until the occurrence of symbol ai since the occurrence of the previous symbol ai−1 . In timed automata, timing conditions are added using a finite number of clocks and a clock guard for each transition. In this paper, we use a class of timed automata known as real-time automata (RTAs) [5]. An RTA has only one clock that represents the time delay between two consecutive events. The clock guards for the transitions are then constraints on this time delay. When trying to identify an RTA from data, one can always determine an upper bound on the possible time delays by taking the maximum observed delay in this data. Therefore, we represent a delay guard [n, n ] by a closed interval in N. Definition 1. (RTA) A real-time automaton (RTA) is a 5-tuple A = Q, Σ, Δ, q0 , F , where Q is a finite set of states, Σ is a finite set of symbols, Δ is a finite set of transitions, q0 is the start state, and F ⊆ Q is a set of accepting states. A transition δ ∈ Δ in an RTA is a tuple q, q , a, [n, n ], where q, q ∈ Q are the source and target states, a ∈ Σ is a symbol, and [n, n ] is a delay guard. Due to the complexity of identifying non-deterministic automata (see [4]), we only consider deterministic RTAs (DRTAs). An RTA A is called deterministic if A does not contain two transitions with the same symbol, the same source state, and overlapping delay guards. Like timed automata, in DRTAs, it is possible to make time transitions in addition to the normal state transitions used in DFAs. In other words, during its execution a DRTA can remain in the same state for a while before it generates the next symbol. The time it spends in every state is represented by the time values of a timed string. In a DRTA, a state transition is possible (can fire) only if its delay guard contains the time spent in the previous state. A transition q, q , a, [n, n ] of a DRTA is thus interpreted as follows:
206
S. Verwer, M. de Weerdt, and C. Witteveen
whenever the automaton is in state q, reading a timed symbol (a, t) such that t ∈ [n, n ], then the DRTA will move to the next state q . Example 1. Figure 1 shows an example DRTA. This DRTA accepts and rejects timed strings not only based on their event symbols, but also based on their time values. For instance, it accepts (a, 4)(b, 3) (state sequence: left → bottom → top) and (a, 6)(a, 5)(a, 6) (left → top → left → top), and rejects (a, 6)(b, 2) (left → top → reject) and (a, 5)(a, 5)(a, 6) (left → bottom → top → left).
2.2
Adding Probability Distributions
In order to identify a DRTA from positive data S+ , we need to model a probability distribution for timed strings using a DRTA structure. Identifying a DRTA then consists of fitting this distribution and the model structure to the data available in S+ . We want to adapt RTI [11] to identify such probabilistic DRTAs (PDRTAs). Since they have the same structure as DRTAs, we only need to decide how to represent the probability of observing a certain timed event (a, t) given the current state q of the PDRTA, i.e., P r(O = (a, t) | q). In order to determine the probability distribution of this random variable O, we require two distributions for every state q of the PDRTA: one for the possible symbols P r(S = a | q), and one for the possible time values P r(T = t | q). The probability of the next state P r(X = q | q) is determined by these two distributions because the PDRTA model is deterministic. The distribution over events P r(S = a | q) that we use is the standard generalization of the Bernoulli distribution, i.e., every symbol ahas some probability P r(S = a | q) given the current state q, and it holds that a∈Σ P r(S = a | q) = 1 (also known as the multinomial distribution). This is the most straightforward choice for a distribution function and it is used in many probabilistic models, such as Markov chains, hidden Markov models, and probabilistic automata. A flexible way to model a distribution over time P r(T = t | q) is by using histograms. A histogram divides the domain of the distribution (in our case time) into a fixed number of bins H. Every bin [v, v ] ∈ H is an interval in N. The distributions inside the bins are modeled uniformly, i.e., for all [v, v ] ∈ H and all t, t ∈ [v, v ], P r(T = t | q) = P r(T = t | q). Naturally, it has to hold that all these probabilities sum to one: t∈N P r(T = t | q) = 1. Using histograms to model the time distribution might look simple, but it is very effective. In fact, it is a common way to model time in hidden semi-Markov models, see, e.g., [6]. The price of using a histogram to model time is that we need to specify the amount, and the sizes (division points) [v, v ] of the histogram bins. Choosing these values boils down to making a tradeoff between the model complexity and the amount of data required to identify the model. More bins lead to a more complex model that is capable of modeling the time distribution more accurately, but it requires more data in order to do so. To simplify matters, we assume that these bins are specified beforehand, for example by a domain expert, or by performing data analysis.
A Likelihood-Ratio Test for Identifying Probabilistic DRTAs
a 0.4 0.3 0.2
0.5 a 0.5 b 0.1
a [0, 5] b b
207
0.45 0.8 a 0.2 b
0.2
0.25 0.1
a [5, 10]
Fig. 2. A probabilistic DRTA. Every state is associated with a probability distribution over events and over time. The distribution over time is modeled using histograms. The bin sizes of the histograms are predetermined but left out for clarity.
In addition to choosing how to model the time and symbol distributions, we need to decide whether to make these two distributions dependent or independent. It is common practice to make these distributions independent, see, e.g., [6]. In this case, the time distribution represents a distribution over the waiting (or sojourn) time of every state. In some cases, however, it makes sense to let the time spent in a state depend on the generated symbol. By modeling this dependence, the model can deal with cases where some events are generated more quickly than others. Unfortunately, this dependence comes with a cost: the size of the model is increased by a polynomial factor (the product of the sizes of the distributions). Due to this blowup, we require a lot more data in order to identify a similar sized PDRTA. This is our main reason for modeling these two distributions independently. This results in the following PDRTA model: Definition 2. (PDRTA) A probabilistic DRTA (PDRTA) A is a quadruple A , H, S, T , where A = Q, Σ, Δ, q0 is a DRTA without final states, H is a finite set of bins (time intervals) [v, v ], v, v ∈ N, known as the histogram, S is a finite set of symbol probability distributions Sq = {P r(S = a | q) | a ∈ Σ, q ∈ Q}, and T is a finite set of time-bin probability distributions Tq = {P r(T ∈ h | q) | h ∈ H, q ∈ Q}. The DRTA without final states specifies the structure of the PDRTA. The symbol- and time-probabilities S and T specify the probabilistic properties of a PDRTA. The probabilities in these sets are called the parameters of A. However, in every set Sq and Tq , the value of one of these parameters follows from the others because their values have to sum to 1. Hence, there are (|Sq | − 1) + (|Tq | − 1) parameters per state q of our PDRTA model. The probability that the next time value equals t given that the current state is q is defined as P r(T ∈ h | q) P r(T = t | q) = v − v + 1 where h = [v, v ] ∈ H is such that t ∈ [v, v ]. Thus, in every time-bin the probabilities of the individual time points are modeled uniformly. The probability of an observation (a, t) given that the current state is q is defined as P r(O = (a, t) | q) = P r(S = a | q) × P r(T = t | q)
208
S. Verwer, M. de Weerdt, and C. Witteveen
Thus, the distributions over events and time are modeled to be independent.2 The probability of the next state q given the current state q is defined as P r(X = q | q) = P r(O = (a, t) | q) q,q ,a,[v,v ]∈Δ t∈[v,v ]
Thus, the model is deterministic. A PDRTA models a distribution over timed strings P r(O∗ = τ ), defined using the computation of a PDRTA: Definition 3. (PDRTA computation) A finite computation of a PDRTA A = Q, Σ, Δ, q0 , H, S, T over a timed string τ = (a1 , t1 ) . . . (an , tn ) is a finite sequence (a1 ,t1 )
an ,tn
q0 −→ q1 . . . qn−1 −→ qn such that for all 1 ≤ i ≤ n, qi−1 , qi , ai , [ni , ni ] ∈ Δ, and t i ∈ [ni , ni ]. The ∗ probability of τ given A is defined as P r(O = τ | A) = 1≤i≤n P r(O = (ai , ti ) | qi−1 , H, S, T ).
Example 2. Figure 2 shows a PDRTA A. Let H = {[0, 2]; [3, 4]; [5, 6]; [7, 10]} be the histogram. In every bin the distribution over time values is uniform. We can use A as a predictor of timed events. For example, the probability of 0.3 (a, 3)(b, 1)(a, 9)(b, 5) is P r((a, 3)(b, 1)(a, 9)(b, 5)) = 0.5 × 0.2 2 × 0.5 × 3 × 0.8 × 0.25 0.4 −5 . 4 × 0.5 × 2 = 1.25 × 10 A PDRTA essentially models a certain type of distribution over timed strings. An input sample S+ can be seen as a sample drawn from such a distribution. The problem of identifying a PDRTA then consists of finding the distribution that generated this sample. We now describe how we adapt RTI in order to identify a PDRTA from such a sample.
3
Identifying PDRTAs from Positive Data
In this section, we adapt the RTI algorithm for the identification of DRTAs from labeled data (see [11]) to the setting of positive data. The result is the RTI+ algorithm, which stand for real-time identification from positive data. Given a set of observed timed strings S+ , the goal of RTI+ is to find a PDRTA that describes the real-time process that generated S+ . Note that, because RTI+ uses statistics (occurrence counts) to find this PDRTA, S+ is a multi-set, i.e., S+ can contain the same timed string multiple times. Like RTI (see [11] for details), RTI+ starts with an augmented prefix tree acceptor (APTA). However, since we only have positive data available, the APTA will not contain rejecting states. Moreover, since the points in time where the observations are stopped are arbitrary, it also does not contain accepting states. Thus, the initial PDRTA simply is the prefix tree of S+ , see Figure 3. 2
Modeling dependencies between events and time values is possible but this comes with a cost: the number of parameters of the model is increased by a polynomial factor. This blowup also increases the amount of data required for identification.
A Likelihood-Ratio Test for Identifying Probabilistic DRTAs
a [0,100]
b [0,100]
209
a [0,100]
b [0,100] b [0,100]
b [0,100]
Fig. 3. A prefix tree. It is identical to an augmented prefix tree acceptor, but without accepting and rejecting states. The bounds of the delay guards are initialized to the minimum and maximum observed time value.
Starting from a prefix tree, our original algorithm tries to merge states and split transitions using a red-blue framework. A merge is the standard statemerging operation used in DFA identification algorithms such as ESDM [9]. A split can be seen as the opposite of a merge. A split of a transition δ requires a time value t and uses this to divide δ, its delay guard [n, n ], and the part of the PDRTA reached afterwards into two parts. The first part is reached by the timed strings that fire δ with a delay value less or equal to t, creating a new delay guard [n, t]. The second part is reached by timed strings for which this value is greater than t, creating delay guard [t + 1, n ]. The parts of the PDRTA reached after firing δ are reconstructed as new prefix trees, using the suffixes of the timed strings that reach these parts as input sample. See [11] for more information on the split operation. RTI+ uses exactly the same operations and framework as RTI. The only difference is the evidence value we use. Originally, the evidence was based on the number of positive and negative examples that end in the same state. For RTI+, we require an evidence value that uses only positive examples, and that disregards which states these examples end in. We use a likelihood-ratio test for this purpose. We now describe this test and explain how we use it both as an evidence value and as a consistency check. 3.1
A Likelihood-Ratio Test for State-Merging
The likelihood-ratio test (see, e.g., [7]) is a common way to test nested hypotheses. A hypothesis H is called nested within another hypothesis H if the possible distributions under H form a strict subset of the possible distributions under H . Less formally, this means that H can be created by constraining H . Thus, by definition H has more unconstrained parameters (or degrees of freedom) than H. Given two hypotheses H and H such that H is nested in H , and a data set S+ , the likelihood-ratio test statistic is computed by LR =
likelihood(S+ , H) likelihood(S+ , H )
210
S. Verwer, M. de Weerdt, and C. Witteveen
a
1
2
a b
b
3
a
1
4
2
a
5 a
1 4
a
4
b
5
2
b
3
b
6
b
3
b
5
Fig. 4. The likelihood-ratio test. We test whether using the left model (two prefix trees) instead of the right model (a single prefix tree) results in a significant increase in the likelihood of the data with respect to the number of additional parameters (used to model the state distributions).
where likelihood is a function that returns the maximized likelihood of a data set under a hypothesis, i.e., likelihood(S+ , H) is the maximum probability (with optimized parameter settings) of observing S+ under the assumption that H was used to generate the data. Let H and H have n and n parameters respectively. Since H is nested in H , the maximized likelihood of S+ under H is always greater than the maximized likelihood under H. Hence, the likelihood-ratio LR is a value between 0 and 1. When the difference between n and n grows, the likelihood under H can be optimized more and hence LR will be closer to 0. Thus, we can increase the likelihood of the data S+ by using a different model (hypothesis) H , but at the cost of using more parameters n −n. The likelihood-ratio test can be used to test whether this increase in likelihood is statistically significant. The test compares the value −2ln(LR) to a χ2 distribution with n −n degrees of freedom. The result of this comparison is a p-value. A high p-value indicates that H is a better model since the probability that n −n extra parameters results in the observed increase in likelihood is high. A low p-value indicates that H is a better model. Applying the likelihood-ratio test to state-merging and transition-splitting is remarkably straightforward. Suppose that we want to test whether we should perform a merge of two states. Thus, we have to make a choice between two PDRTAs (models): the PDRTA A resulting from the merge of these states, and the PDRTA A before merging these states. Clearly, A is nested in A . Thus all we need to do is compute the maximized likelihood of S+ under A and A , and apply the likelihood-ratio test. Since PDRTAs are deterministic, the maximized likelihood can be computed simply by setting all the probabilities in the PDRTAs to their normalized counts of occurrence in S+ . We now show how to use this test in order to determine whether to perform a merge using an example. Example 3. For simplicity, we disregard the time values of timed strings and the timed properties of PDRTAs. Suppose we want to test whether to merge the two root states of the prefix trees of Figure 4. These two prefix trees are parts
A Likelihood-Ratio Test for Identifying Probabilistic DRTAs
211
of the PDRTA we are currently trying to identify. Hence only some strings from S+ reach the top tree, and some reach the bottom tree. Let S = {10 × a, 10 × aa, 20 × ab, 10 × b} and S = {20 × aa, 20 × bb} be the suffixes of these strings starting from the point where they reach the root state of the top and bottom tree respectively, where n × τ means that the (timed) string τ occurs n times. We first set all the parameters of the top tree in such a way that the likelihood of S is maximized: pa,q0 = 45 , pb,q0 = 15 , pa,q1 = 13 , pb,q1 = 23 (this is easy because the model is deterministic). We do the same for the bottom tree and S : pa,q0 = 12 , pb,q0 = 12 , pa,q1 = 1, pb,q2 = 1. 40 We can now compute the probability of S under the top tree: p1 = 45 × 1 10 1 10 2 20 −20 × × ≈ 6.932 × 10 , and the probability of S under the 5 3 3 20 1 20 bottom tree p2 = 12 × 2 ≈ 9.095 × 10−13 . Next, we set the parameter of the right tree to maximize the likelihood of S ∪ S : pa,q0 = 23 , pb,q0 = 13 , pa,q1 = 3 , pb,q1 = 25 , pb,q2 = 1, and compute the likelihood of the data under the right 5 60 1 30 3 30 2 20 × 3 × 5 × 5 ≈ 3.211 × 10−40. We multiply (merged) tree: p3 = 23 the top and bottom tree probabilities in order to get the likelihood of the data under the left (un-merged) tree, and use this to compute the likelihood-ratio: 3 ≈ 5.093 × 10−9 . LR = p1p×p 2 2 The χ value that we need to compare to a χ2 distribution then becomes 2 χ ≈ 38.19. Per state |Σ| − 1 parameters are used. In the un-merged model, the number of (untimed) parameters is 5, in the merged model it is 3. A likelihoodratio test using these values results in a p-value of 5.093 × 10−9 . This is a lot less than 0.05, and hence the merge results in a significantly worse model. Testing whether to perform a split of a transition can be done in a similar way. When we want to decide whether to perform a split, we also have to make a choice between two PDRTAs: the PDRTA before splitting A, and the PDRTA after splitting A . A is again nested in A , and hence we can perform the likelihoodratio test in the same way. 3.2
Dealing with Small Frequencies
The likelihood-ratio test does not perform well when the tested models contain many unused parameters. The test tests whether an increase in the number of parameters leads to a significantly higher likelihood. Thus, if there are many unused parameters, this increase will usually not be significant. Hence, there will be a tendency to accept null-hypotheses, i.e., to merge states. This causes problems especially in the leafs of the prefix tree. We deal with the issue of small frequencies by pooling the bins of the histogram and symbol distributions if the frequency of these bins in both states is less than 10. Pooling is the process of combining the frequencies of two bins into a single bin. In other words, we treat two bins as though it were a single one. For example, suppose we have three bins, and their frequencies are 7, 14, and 5, respectively. Then we treat it as being two bins with frequencies 12 and 14. In the likelihood-ratio test, this effectively reduces the amount of parameters
212
S. Verwer, M. de Weerdt, and C. Witteveen
of the tested models. Theoretically, it can be objected that this changes the model using the data. However, if we do not pool data, we will obtain too many parameters for the states in which some bin occurrences are very unlikely. For instance, suppose we have a state in which 1000 symbols could occur, but only 10 of them actually occur. Then according to theory, we should count this state as having 999 parameters. We count it as having only 9. 3.3
The Algorithm
We have just described the test we use to determine whether two states are similar. The null-hypothesis of this test is that two states are the same. When we obtain a p-value less than 0.05, we can reject this hypothesis with 95% certainty. When we obtain a p-value greater than 0.05, we cannot reject the possibility that the two states are the same. Instead of testing whether two states are the same, however, we want to test whether to perform a merge or a split, and if so, which one. When we test a merge, a high p-value indicates that the merge is good. When we test a split, a low p-value indicates that the split is good. We implemented this statistical evidence in RTI+ in a very straightforward way: – If there is a split that results in a p-value less than 0.05, perform the split with the lowest p-value. – If there is a merge that results in a p-value greater than 0.05, perform the merge with the highest p-value. – Otherwise, perform a color operation. Thus, we merge two states unless we are very certain that the two states are different. In addition, we always perform the merge or split that leads to the most certain conclusions. In every iteration, RTI+ selects the most visited transition from a red state to a blue state and determines whether to merge the blue state, split the transition, or color the blue state red. The main reason for trying out only the most visited transition is that it reduces the run-time of the algorithm. Trying every possible merge and split would take much longer. Additionally, the tests performed using the most visited transition will be based on the largest amount of data. Hence, we are more confident that these conclusions are correct. An overview of the RTI+ algorithm is shown in Algorithm 1. We claim that RTI+ is efficient, i.e., it that runs in polynomial time: Proposition 1. RTI+ is a polynomial-time algorithm. Proof. This follows from the fact that ID 1DTA is efficient [12] and the fact that every statistic can be computed (up to sufficient accuracy) in polynomial time for every state. Since, at any time during a run of the algorithm, the number of states does not exceed the size of the input, the proposition follows. In addition to being time-efficient, we believe that RTI+ is also data-efficient. More specifically, we conjecture that returns a PDRTA that is equal to the correct PDRTA At in the limit. With equal we mean that these PDRTAs model the exact same probability distributions over timed strings.
A Likelihood-Ratio Test for Identifying Probabilistic DRTAs
213
Algorithm 1. Real-time identification from positive data: RTI+ Require: A multi-set of timed strings S+ generated by a PDRTA At Ensure: The result is a small DRTA A, in the limit A = At Construct a timed prefix A tree from S+ , color the start state q0 of A red while A contains non-red states do Color blue all non-red target states of transitions with red source states Let δ = qr , qb , a, g be most visited transition from a red to a blue state Evaluate all possible merges of qb with red states Evaluate all possible splits of δ If the lowest p-value of a split is less than 0.05 then preform this split Else if the highest merge p-value is greater than 0.05 then perform this merge Else color qb red end while
Conjecture 1. The result A of RTI+ converges efficiently in the limit to the correct PDRTA A with probability 1. Completeness of the algorithm follows from the fact that the algorithm is a special case of the ID 1DTA algorithm from [12]. The conjecture therefore holds if all correct merges and splits are performed given a input sample of size polynomial in the size of A . The main reason for our conjecture follows from the fact that with increasing amounts of data, the p-value resulting from the likelihood-ratio test converges to 0 if the two states are different. Thus in the limit, RTI+ will perform all the necessary splits, and perhaps some more, and it will never perform an incorrect merge. However, when the two states tested in the likelihood-ratio test are the same, there is always a probability of 0.05 that the p-value is less than 0.05. Thus, at times it will not perform a merge when it should. Fortunately, not performing a merge or performing an extra split does not influence the language of the DRTA, or the distribution of the PDRTA. It only adds additional (unnecessary) states to the resulting PDRTA A. Thus, in the limit, the algorithm should return a PDRTA A that is language equivalent to the target PDRTA A . Unfortunately, since we use multiple statistical tests that can become dependent, proving this conjecture is complex and left as future work.
4
Tests on Artificial Data
In order to evaluate the RTI+ algorithm, we test it on artificially generated data. First we generate a random PDRTA (without final states), and then we generate data using the distributions of this PDRTA. Unfortunately, it is difficult to measure the quality of models that are identified from such data. Commonly used measures include the predictive quality or a model selection criterion. However, such measures are meaningless on their own, they only useful to compare the performance of different methods against each other. Since, we know of no any other method for identifying a PDRTA, we cannot make use of these measures.
214
S. Verwer, M. de Weerdt, and C. Witteveen
Originally generated random DRTA a, c
b
b, c [86,100], d
b
a [38,100]
d b, c
a [0,94]
d [48,100]
a [94,100]
a [0,37], d
c [0,85]
c
c
d [0,47]
a
d
b, d
a [26,100], b
a [0,25] a
a
c b, c
c, d
b
a
d [20,35] d [36,100]
d [0,19]
Identified using RTI+ with the likelihood ratio test a, c d
b
b, d
b a [0,100], d
a [25,28] a [0,100]
b, c
d
d [48,100] a [29,100], b, c
c [0,100]
d [0,47] a a
b, d
c
a [0,24] a c b, c
b
c, d a
d [43,100]
d [0,42]
= correct
= (partially) incorrect
Fig. 5. A randomly generated DRTA (top) and the DRTA identified by our algorithm (bottom). The dashed lines are (partially) incorrectly identified transitions. The solid states are correctly identified, including all outgoing transitions.
Therefore, in order to provide some insight into the capabilities of RTI+, we only show a typical result of RTI+ when run on this data. We generate a random PDRTA with 8 states and a size 4 alphabet. Of the transitions of the PDRTA, 4 are split and assigned different target states at random. The number of possible time values for the timed strings is fixed at 100. The number of histogram bins used in the PDRTA is set to 10. Thus, there
A Likelihood-Ratio Test for Identifying Probabilistic DRTAs
215
are individual probabilities for [0, 9], [10, 19], etc. The probabilities of these bins and the symbol bins are generated by first assigning to each bin a value between 0 and 1, drawn from a uniform distribution. These values are then normalized such that both the histogram values and the symbol values summed to 1. We generated 2000 timed strings from this PDRTA, which all have an exponentially distributed length with an average of 10. Figure 5 shows the resulting original and identified PDRTA (no probability distributions are drawn). From this figure, it is clear that the most common mistake is the incorrect identification (or absence) of a clock guard. These are usually only minor errors, involving only infrequently visited transitions. The resulting PDRTA is thus very similar to the original used to generate the data. We performed such a test multiple times and using differently sized random PDRTAs. The results of these tests are encouraging for up to 8 states, a size 4 alphabet, and 4 splits. When either of these values is increased, the algorithm needs more than 2000 examples to come up with a similar PDRTA. These results are encouraging because PDRTAs of this size are complex enough to model interesting real-time systems.
5
Future Work
In previous work, we described the RTI algorithm for identifying deterministic real-time automata (DRTAs) from labeled data. In this paper, we showed how to adapt it to the setting of positive data. The result is the RTI+ algorithm. RTI+ runs in polynomial time, and we conjecture that it converges efficiently to the correct probabilistic DRTA (PDRTA). In future work, we would like to prove this conjecture. This should be possible, because none of the statistics we use requires a large amount of data. Moreover, the fact that there exist polynomial characteristics sets for DRTAs (see [12]) should somehow extend to identifying PDRTAs. RTI+ uses a likelihood-ratio test in order to determine which states to merge and which transitions to split. Although this test is designed for the purpose of identifying a PDRTA from positive data, it can easily be modified in order to identify probabilistic DFAs. It would be interesting to test such an approach. The achieved performance of RTI+ is shown to be sufficient in order to identify complex real-time systems. We believe this performance to be sufficient to be useful for identifying real-world real-time systems. We invite everyone with timed data to try RTI+ to identify behavioral models, and network protocols. The source code of RTI+ is available on-line from the first author’s homepage.
References 1. Alur, R., Dill, D.L.: A theory of timed automata. Theoretical Computer Science 126, 183–235 (1994) 2. Carrasco, R., Oncina, J.: Learning stochastic regular grammars by means of a state merging method. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 139–150. Springer, Heidelberg (1994)
216
S. Verwer, M. de Weerdt, and C. Witteveen
3. Clark, A., Thollard, F.: PAC-learnability of probabilistic deterministic finite state automata. Journal of Machine Learning Research 5, 473–497 (2004) 4. de la Higuera, C.: A bibliographical study of grammatical inference. Pattern Recognition 38(9), 1332–1348 (2005) 5. Dima, C.: Real-time automata. Journal of Automata, Languages and Combinatorics 6(1), 2–23 (2001) 6. Gu´edon, Y.: Estimating hidden semi-Markov chains from discrete sequences. Journal of Computational and Graphical Statistics 12(3), 604–639 (2003) 7. Hays, W.L.: Statistics, 5th edn. Wadsworth Pub Co. (1994) 8. Kermorvant, C., Dupont, P.: Stochastic grammatical inference with multinomial tests. In: Adriaans, P.W., Fernau, H., van Zaanen, M. (eds.) ICGI 2002. LNCS (LNAI), vol. 2484, pp. 149–160. Springer, Heidelberg (2002) 9. Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: Honavar, V.G., Slutzki, G. (eds.) ICGI 1998. LNCS (LNAI), vol. 1433, pp. 1–12. Springer, Heidelberg (1998) 10. Sipser, M.: Introduction to the Theory of Computation. PWS Publishing (1997) 11. Verwer, S., de Weerdt, M., Witteveen, C.: An algorithm for learning real-time automata. In: Benelearn, pp. 128–135 (2007) 12. Verwer, S., de Weerdt, M., Witteveen, C.: One-clock deterministic timed automata are efficiently identifiable in the limit. In: Dediu, A.H., Ionescu, A.M., Mart´ın-Vide, C. (eds.) LATA 2009. LNCS, vol. 5457, pp. 740–751. Springer, Heidelberg (2009)
A Local Search Algorithm for Grammatical Inference Wojciech Wieczorek Institute of Computer Science University of Silesia Bedzinska 39 41-200 Sosnowiec, Poland [email protected]
Abstract. In this paper, a heuristic algorithm for the inference of an arbitrary context-free grammar is presented. The input data consist of a finite set of representative words chosen from a (possibly infinite) context-free language and of a finite set of counterexamples—words which do not belong to the language. The time complexity of the algorithm is polynomially bounded. The experiments have been performed for a dozen or so languages investigated by other researchers and our results are reported.
1
Introduction
Grammatical inference can be stated as follows: identification of the grammar that generates the language for a given finite number of examples of a formal language. Sometimes there are negative examples as well as positive ones. The lower the type of the target language in the Chomsky hierarchy, the harder the problem is [11,22]. Even for the simplest—regular languages—inferring a grammar belongs to the intractable class of problems. In 1978, Gold showed that the problem of finding a deterministic finite automaton with a minimum number of states consistent with a given finite set of positive and negative examples is NP-hard [9]. In 1976, Angluin showed that the problem of finding a smallest regular expression compatible with arbitrary positive and negative examples is NP-hard [1]. In fact, even finding a polynomially larger DFA (deterministic finite automaton) than the minimum DFA, consistent with the data, is NP-hard [20]. On the process of grammatical inference we can adopt different points of view. Four main approaches dominate the literature. In Gold’s model of identification
We thank the following computing centres where the preliminary computations of our project were carried out: Academic Computer Centre in Gda´ nsk TASK, Academic Computer Centre CYFRONET AGH, Krak´ ow (computing grant 027/2004), Wroclaw Centre for Networking and Supercomputing (computing grant 04/97), Interdisciplinary Centre for Mathematical and Computational Modeling, Warsaw University (computing grant G27-9), and Pozna´ n Supercomputing and Networking Centre. The research was supported by the Minister of Science and Higher Education Grant No 3177/B/T02/2008/35.
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 217–229, 2010. c Springer-Verlag Berlin Heidelberg 2010
218
W. Wieczorek
in the limit, the input of a learning algorithm is an infinite sequence of examples (including counterexamples) of the unknown grammar [8]. The setting of this model is that of on-line, incremental learning. After each new example, the learner (algorithm) must return some hypothesis (a grammar). Identification is achieved when the learner returns a correct answer and does not change its decision afterwards. In the query learning model by Angluin, the learning algorithm is based on two query instructions relevant to the unknown grammar G: (a) membership—the input is a string w and the output is ‘yes’ if w is generated by G and ‘no’ otherwise, (b) equivalence—the input is a grammar G and the output is ‘yes’ if G is equivalent to G and ‘no’ otherwise [2]. If the answer is ‘no’, a string w in the symmetric difference of the language L(G) and the language L(G ) is returned. In the probably approximately correct learning model (PAC learning), we assume that random samples are drawn independently from examples and counterexamples [25]. The goal is to minimize the probability of learning an incorrect grammar. From the fourth, practitioner’s perspective we see grammatical inference as a computationally hard problem and try to solve it by means of heuristic methods. The most exploited method in this scope is a genetic algorithm or evolutionary computation in a broader sense. For a standard—with binary representation of the solution—genetic algorithm, a good example is Lankhorst’s report [13]. A more sophisticated way of using an evolutionary algorithm is presented in Tsoulos and Lagaris’ paper [24]. They made use of a meta-grammar based on grammatical evolution, a variant of the evolutionary algorithm introduced by Ryan, Colins and O’Neill [21]. The need for genetic algorithms emerged also in Sakakibara’s work [23]. He proposed an efficient hypothesis representation method which consists of a table-like data structure similar to the parse table used in the CYK (Cocke–Younger–Kasami) parsing algorithm. Another heuristic method, Tabu Search, has been developed for the purpose of regular grammar induction by Giordano [7]. A rough set approach is proposed by Yokomori and Kobayashi [28]. The rationale behind using heuristic methods is their marvellous accomplishment in recent DFA or context-free grammar competitions [15,3]. Our inference algorithm is also a member of the fourth group in the taxonomy described above. Strictly speaking, we deal with the following problem: given a finite alphabet Σ and two finite subsets X, Y ⊆ Σ ∗ , find the smallest possible context-free grammar G such that, if L ⊆ Σ ∗ is the language represented by G, then X ⊆ L and Y ⊆ Σ ∗ − L. The size of a grammar is defined as the total number of symbols on the right sides of all rules. Obviously, without additional kinds of information about L, we can never be sure we have correctly identified L. As far as the classic problem of GI is concerned, this problem can serve as a guess step in Gold’s model of identification in the limit. This paper is organised into five sections. We present in Section 2 necessary definitions and facts originated from automata and formal languages. Section 3 describes our inference algorithm. Section 4 shows experimental results of our approach. Concluding comments are made in Section 5.
A Local Search Algorithm for Grammatical Inference
2
219
Preliminaries
In this section, we are going to describe some facts about automata in order to make the notation understandable to the reader. For further details about the definitions, the reader is referred to [12,14] (words and languages), [12,4] (automata and their transition diagrams) and [12] (context-free grammars). In our implementation we used a procedure for constructing a minimal DFA (deterministic finite automaton) based on the following definition [14]. To every set X ⊂ Σ ∗ is associated a deterministic automaton A(X). Its set of states Q(X) is Q(X) = {w−1 X: w ∈ Σ ∗ }. Its initial state is X, its set of final states is F (X) = {S ∈ Q(X): ∈ S}. Its transitions are defined for S ∈ Q(X) and a ∈ Σ by δ(S, a) = a−1 S. The automaton A(X) is the minimal automaton of X. The procedure is given as recursive Algorithm 1, in which all states Q(X) are numbered by integers from 0 on up. The set of states is represented by a map (an associative array), where a key is the set of words and a value is an integer. The following preconditions must be met: X is a finite language over Σ; s = 0; Q is empty; and F is empty. The algorithm ensures that A(X) = (Q, Σ, δ, 0, F ) is a minimal acyclic DFA for X ⊂ Σ ∗ . Assuming that words are random, the average running time T of the algorithm can be assessed in the following way. Let n be the cardinal number of X, k be the length of the longest word in X, and m be the number of symbols in Σ. Instructions from lines 1–5 can be realized in time O(kn). Then the recurrence relation for T can be directly derived from our recursive algorithm: T (k, m, n) = O(kn) + m(O(kn) + T (k − 1, m, n/m)) which, for m ≥ 2, leads to T (k, m, n) = O(kmn log n). In order to recall the necessary results in the simplest way, we ought to define the right and left languages of a state q. For the state q ∈ Q of a DFA A = (Q, Σ, δ, s, F ) we consider the two languages: Algorithm 1. build minDFA(X) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
Q[X] := s if ∈ X then add s to F p := s s := s + 1 for a ∈ Σ do U := a−1 X if U = ∅ then if U ∈ Q then δ(p, a) := Q[U ] else δ(p, a) := build minDFA(U ) return p
220
W. Wieczorek
− → L (q) = {w ∈ Σ ∗ : δ(q, w) ∈ F },
← − L (q) = {w ∈ Σ ∗ : δ(s, w) = q}.
→ − Thus, the right language of a state q, L (q), is the set of all words spelled out ← − on paths from q to a final state, whereas the left language of a state q, L (q), is the set of all words spelled out on paths from the initial state s to q. To present our approaches we recall some notation and results from [16]. Let A = (Q, Σ, δ, s, F ) be a deterministic finite automaton. For a subset P ⊆ Q we define the languages ← − − → R1P = L (p), R2P = L (p). p∈P
p∈P
Theorem 1. [16] Let A be the minimal DFA for a language L and assume that we can write L = L1 L2 . Then L = R1P R2P , where P ⊆ Q is defined by P = {p ∈ Q: (∃w ∈ L1 ) δ(s, w) = p}. Furthermore, we know that Li ⊆ RiP , i = 1, 2. By definition, a non-empty subset P ⊆ Q is a decomposition set (for a regular language L) if L = R1P R2P . The decomposition L = R1P R2P is referred to as the decomposition of L induced by the decomposition set P .
3
An Inference Algorithm
Let Σ be an alphabet, X, Y ⊂ Σ ∗ be the nonempty sets of examples and counterexamples of an unknown context-free language. Our method of grammatical inference is based on a local search procedure ([19]) and is presented as Algorithm 2. A repetition counter t is a small integer constant (in the experiments, t = 20 was chosen). The basic components of the local search are: constructing an initial feasible solution (used in line 6); the choice of a neighbourhood for the grammar and a method for searching it (used in lines 7 and 8). These two subroutines are described in the following subsections. After a searching phase (lines 3–12), unnecessary variables and rules are removed from the resultant grammar specified by the min rules. This phase (line 13) can be done in a straightforward way. First, for every variable (except V0 , which is the start symbol) check whether it can be removed (and all rules which contain it) without affecting the acceptance of words X. If so, remove it along with all rules which contain it. If not, do not remove anything. Second, remove an unnecessary rule repeatedly as long as there are rules left after removing under which the grammar still accepts all words X. It is worth emphasising that the last phase does not affect words Y , i.e. both removals can only decrease the number of accepted words.
A Local Search Algorithm for Grammatical Inference
221
Algorithm 2. local search for GI(X, Y ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
3.1
min size := ∞ min rules := ∅ for iter := 1 to t do P := ∅ V := empty map initialG(X, P, V ) {-- sets up rules P and variables V } while improve(Y, P, V ) = ‘no’ do (P, V ) := improve(Y, P, V ) let s be the size of P if s < min size then min size := s min rules := P remove from min rules unnecessary variables and rules return G = (variables in min rules, Σ, min rules, V0 )
An Initial Grammar for the Set of Examples
Before we go on to show the procedure initialG, let us introduce its fundamental step—a splitting operation. Let L ⊂ Σ ∗ be a finite language. The language is said to possess a splitting if it can be written as the union of two languages, one of which is a catenation of two nontrivial languages (we call {} a trivial language): L = AB ∪ C, A, B, C ⊂ Σ ∗ , A, B = {}. It is desirable to have the language AB as large as possible due to the nature of the procedure initialG: the larger the |AB| achieved, the smaller is the initial grammar that is generated. Please notice the operation of obtaining sets U, V ⊆ Σ ∗ such that L = U V has a close connection with splitting and is called a decomposition of the language [16,26,27]. Suppose we wish to find a splitting for a language L ⊂ Σ ∗ , |L| ≥ 1, L = {}. First, a minimal acyclic DFA M = (Q, Σ, δ, s, F ) such that L(M ) = L is constructed. Next, because of Theorem 1, some subsets P of Q are checked for the maximization of |R1P R2P | (R1P = A, R2P = B). Our language splitting function is presented as randomized Algorithm 3. In every step of the essential part of the algorithm (lines 14–18), we are faced with adding a state j to a set P (line 16). The state j is selected uniformly at random from the set J (line 15) including all states from the set Q − P which will allow us to obtain a better splitting. This is a randomized algorithm and it does not guarantee optimal splitting sets (i.e. AB = L, C = ∅) or even feasible splitting (i.e. AB ⊆ L, A, B = {}, C = L − AB). Such a procedure which completes in a fixed amount of time (as a function of the input size) but allows a certain probability of error is called a Monte Carlo algorithm. Example. Let us consider the language L = {ax, ay, az, bx, by, bz}.
222
W. Wieczorek
Algorithm 3. split(L) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:
Q := empty map δ is unspecified s := 0 F := ∅ build min DFA(L) iter limit := max(min(|L|, |Q|), 10) let = 1 2 · · · k (i ∈ Σ) be the longest word in L (A, B, C) := ({1 }, {2 · · · k }, L − {}) best size := 1 for iter := 1 to iter limit do select uniformly at random q ∈ Q P := ∅ J := {q} while J = ∅ do select uniformly at random j ∈ J add to P a state j P ∪{q} P ∪{q} J := {q ∈ Q − P : (|R1 R2 | > |R1P R2P |) P ∪{q} P ∪{q} ∧ (R1 = {}) ∧ (R2 = {})} if |R1P R2P | > best size and R1P = {} and R2P = {} then best size := |R1P R2P | (A, B, C) := (R1P , R2P , L − R1P R2P ) return (A, B, C)
?>=< 89:; 0
a,b
89:; / ?>=< 1
x,y,z
?/.-, ()*+ >=< / 89:; 2
Fig. 1. A minimal acyclic DFA for L
A minimal acyclic DFA for L is shown in Fig. 1. It has the set of states Q = {0, 1, 2}. If the set P = {1} has been created, the splitting L = R1P R2P = {a, b} × {x, y, z} will be obtained (and later we will have a rule V0 → Vi Vj in an initial grammar). As for complexity issues, let us assume that the size of an alphabet is a small integer constant m ≥ 2. The assumption is very realistic since, in the most practical of grammatical inference settings—except natural language processing where an alphabet is the set of words—Σ does not exceed the Latin alphabet (and possibly digits and, rarely, punctuation characters). Additionally, let us assume that the words of a language L are random, so that the creation of a minimal acyclic DFA takes time O(kn log n). The number of states, |Q|, is O(kn). Thus, it can be readily verified that the running time of Algorithm 3 is of O(k 2 n4 ). The main procedure for generating an initial grammar is presented as Algorithm 4. The following facts are helpful in proving the subsequent theorem:
A Local Search Algorithm for Grammatical Inference
223
Algorithm 4. initialG(X, P, V ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
V [X] := the consecutive number i from (0, 1, 2, . . .) if |X| = 1 or X ⊆ Σ ∪ {} then add to P the rule Vi → s for every s ∈ X else (A, B, C) := split(X) a := if A ∈ V then V [A] else initialG(A) b := if B ∈ V then V [B] else initialG(B) add to P the rule Vi → Va Vb if C = ∅ then c := if C ∈ V then V [C] else initialG(C) add to P the rule Vi → Vc return i
– Before the execution of initialG, the set P and the map V are empty (see lines 4–5 of Algorithm 2). – In the ‘else’ section (lines 5–11) |X| > 1 and X has at least one word such that || ≥ 2. As a consequence, the function split (line 5) returns sets A, B, C for which AB ∪ C = X, A, B
= {}, C = X − AB hold. – To every set U in V is associated a number i (V [U ] = i) such that a grammar variable Vi ‘represents’ the set U . – V0 is the start symbol (see line 14 of Algorithm 2) and V [X] = 0. – As usual, let k be the length of the longest word in X, |X| = n. The number of generated variables does not exceed T (k, n) = 1 + T (k − i, n − j) + T (i, n − j) + T (k , j), where 1 ≤ i < k, 0 ≤ j < n, 0 ≤ k ≤ k, T (1, n) = T (k, 1) = 1, which is a finite integer. Thus, by implication, the algorithm has the stop property. In fact, it may be proved by induction on k + n that for k, n ≥ 1 the inequality T (k, n) ≤ kn − 1 holds. Theorem 2. Let G be a context-free grammar determined by rules P after the execution of initialG(X, P, V ) for a finite, nonempty language X ⊂ Σ ∗ , and initially empty P , V . Then L(G) = X. Proof. The conclusion of the theorem can be written as follows: for an arbitrary x ∈ Σ ∗ , V0 ⇒∗ x iff x ∈ X. (Only-if part) We assume x = x1 x2 . . . xr , r ≥ 0, is derived from V0 and will prove x ∈ X. Let ϕ be the function induced by: ϕ() = {}; for a ∈ Σ, ϕ(a) = {a}; for i = 0, 1, . . ., ϕ(Vi ) = U , where V [U ] = i; for a sentential form α = w1 w2 . . . ws , ϕ(α) = ϕ(w1 )ϕ(w2 ) . . . ϕ(ws ). Every production rule A → α in the set P has the property that ϕ(A) ⊇ ϕ(α). Hence, for every derivation α ⇒ β in G we have ϕ(α) ⊇ ϕ(β). Let us consider the sequence of derivations: V0 ⇒ α1 ⇒ α2 ⇒ . . . ⇒ αp ⇒ x1 x2 . . . xr . Since X = ϕ(V0 ) ⊇ ϕ(α1 ) ⊇ ϕ(α2 ) ⊇ . . . ⊇ ϕ(αp ) ⊇ ϕ(x1 x2 . . . xr ) = x, x must be a member of a set X. (If part) Now, we assume x is an element of a set X and prove V0 ⇒∗ x. In fact, we may prove by induction on k + n, where n is the size of ϕ(Vi ) ∈ V and k
224
W. Wieczorek
is the length of the longest word in ϕ(Vi ), a stronger theorem: if u ∈ ϕ(Vi ) then Vi ⇒∗ u. Basis: We use k + n = 1 as the basis. Because n ≥ 1, let |ϕ(Vi )| = 1; that is, ϕ(Vi ) = {u} = {}. Obviously we have Vi ⇒ u, because there is the production rule (added in line 3) Vi → u. Induction: Suppose that |ϕ(Vi )| = n, the length of the longest word in ϕ(Vi ) is k, k + n > 1, and that the statement of the theorem holds for all sets ϕ(Vj ) ∈ V such that 1 ≤ fj + gj ≤ k + n, where fj is the length of the longest word in ϕ(Vj ) and |ϕ(Vj )| = gj . For a set ϕ(Vi ), there are two cases to consider: n = 1∨ϕ(Vi ) ⊆ Σ ∪ {} and n > 1 ∧ ϕ(Vi )
⊆ Σ ∪ {}. If n = 1 or ϕ(Vi ) ⊆ Σ ∪ {}, line 3 was executed so this is a trivial case. If n > 1 and ϕ(Vi ) ⊆ Σ ∪ {}, lines 5–11 were executed so we have ϕ(Vi ) = AB ∪ C, where the lengths fa and fb of the longest word in A and in B are less than k, |A| ≤ n and |B| ≤ n. As for C, the length fc of the longest word is less than or equal to k, but |C| < n. If u ∈ C, we have directly Vi ⇒ Vc , where ϕ(Vc ) = C and 1 ≤ fc + |C| < k + n. We invoke the inductive hypothesis to claim that Vc ⇒∗ u. On the other side, if AB u = u1 u2 . . . ur , r ≥ 0, we have the rule Vi → Va Vb , where ϕ(Va ) = A, ϕ(Vb ) = B, u1 u2 . . . up ∈ A and up+1 up+2 . . . ur ∈ B. Because of the inequalities 1 ≤ fa + |A| < k + n and 1 ≤ fb + |B| < k + n, we again invoke the inductive hypothesis to claim that Va ⇒∗ u1 u2 . . . up and Vb ⇒∗ up+1 up+2 . . . ur . Then there is a derivation of u from Vi , namely Vi ⇒ Va Vb ⇒∗ u1 u2 . . . up Vb ⇒∗ u1 . . . up up+1 . . . ur = u. 3.2
The Definition of a Neighbourhood
In this subsection, we will describe the function ‘improve’ (see lines 7–8 in Algorithm 2), which—in a local search process—improves the current grammar by reducing the number of variables and rules. This reduction leads in turn to an increase in the number of accepted words and this is why, in most cases, we get a grammar that generates an infinite language. So as to avoid the acceptance of a counterexample, the improvement process is controlled by means of Y . According to the local search idea, after starting at some initial feasible solution (an initial grammar in our algorithm), the subroutine ‘improve’ is used to search for a better solution in its neighbourhood. Thus, we have to choose a ‘good’ neighbourhood for the problem. Our choice is guided by intuition, because very little theory is available as a guide. A grammar (a current feasible solution) is represented by only the set of rules P and the map V —which values are the indices of grammar variables—the reason being that an alphabet and the start symbol are immutable. Definition 1. Let I and J be such sets of words that V [I] = i, V [J] = j, i < j, and I ∩J
= ∅. The neighbour of a grammar represented by P and V is a grammar represented by P and V , where P can be obtained from P by the substitution of Vi for all Vj , whereas V can be obtained from V by removing I along with i as well as J along with j and by the addition of V [I ∪ J] = i. A supplementary condition for the neighbourhood is that the new grammar should not accept any word from counterexamples Y .
A Local Search Algorithm for Grammatical Inference
225
Algorithm 5. improve(Y, P, V ) 1: if |V | = 1 then 2: return ‘no’ 3: for (I, i) ∈ V do 4: for (J, j) ∈ V do 5: if i < j and I ∩ J = ∅ then 6: P = ∅ 7: for (A → α) ∈ P do 8: A → α := substitute Vi for Vj in A → α 9: P := P ∪ {A → α } 10: if a grammar induced by P does not accept any y ∈ Y then 11: V := V without V [I] and V [J] 12: V [I ∪ J] = i 13: return (P’, V’) 14: return ‘no’
This merging of the variables is the quintessence of the function ‘improve’, which is presented as Algorithm 5. Let k be the length of the longest word in X ∪ Y . The time requirement for membership queries (the CYK method [10]) in line 10 is O(|Y |k 3 ). Since there are at most kn variables (n = |X|), the whole function costs O(k 3 n3 + k 5 n2 |Y |). Now it is easy to see that the time complexity of Algorithm 2 is polynomially bounded in terms of the input size.
4
Experimental Results
In all experiments, we used the implementation of algorithms written in Python. An interpreter ran on an Intel Xeon 5120, 1.86 GHz Dual Core processor under the Linux operating system with 3 GB RAM. As regards the language membership problem, we took advantage of the Esrapy1 library, which is intended for parsing any context-free grammar and is very easy to use. The generation of all 14 samples used in the experiments is described in the following subsection. We checked the correctness of all the inferred grammars. 4.1
Generation of the Problem Sets
The benchmark is composed of fourteen languages ( x (w) denotes the number of x’s in the word w): – L1 : any word without an odd number of consecutive a’s after an odd number of consecutive b’s (regular), – L2 : any word on {a, b} without more than two consecutive a’s (regular), – L3 : any word with an even number of a’s and an even number of b’s (regular), – L4 : (aa)+ (bbb)+ (regular), – L5 : am bn , 1 ≤ m ≤ n (not regular), 1
http://sifter.org/˜simon/esrapy/
226
– – – – – – – – –
W. Wieczorek
L6 : balanced parentheses (not regular), L7 : regular expressions over the letters a, b (not regular), L8 : {w: w is a palindrome and w ∈ {a, b}{a, b}+} (not regular), L9 : {w: w ∈ {a, b}+ and a (w) = b (w)} (not regular), L10 : {w: w ∈ {a, b}+ and 2 a (w) = b (w)} (not regular), L11 : the language of L ukasiewicz (S → aSS; S → b) (not regular), L12 : {am bm cn : m, n ≥ 1} over Σ = {a, b, c} (not regular), L13 : {acm : m ≥ 1} ∪ {bcm : m ≥ 1} over Σ = {a, b, c} (regular), L14 : the string representations of binary trees storing letter a or b in every node which has a parent, and c in a root (not regular).
The first four languages have been used in [5,7]. The inference of these regular languages is known to be a difficult problem. The next six languages were considered by Nakamura and Matsumoto [18], L11 was considered by Eyraud at al. [6], while L12 and L13 were taken by Sakakibara [23] as the unknown context-free languages to be learned. We believe that the last language is one of the ‘hardest’ and, in view of the possibility of future comparisons with other methods, the examples and counterexamples of L14 are given below. The target finite samples, which had dozens of words each, were constructed Ki Li ∩ Σ k , where Ki is a positive integer. |Xi | words, as follows. Let Zi = k=1 chosen randomly from the set Zi , constituted examples. Let z ∈ Zi and y ∈ Σ ∗ be words which differ by a few letters—as a consequence of a swap, insertion or deletion. |Yi | words y ∈ Yi , y
∈ Li , 1 ≤ |y| ≤ Ki , generated randomly in this way, constituted counterexamples. The exact number of examples and counterexamples, and the length boundary (Ki ) for particular languages Li , are given in table 1. A sample for L14 is: X14 = {nanbnancn, nanbnbncn, nbnanancn, nbnbncnbn, nbncnbnan, ncnanbnbn, nbncn, nbncnan, nbnancn, ncnbnbnbn} Y14 = {anccnbnan, cbnancacn, nbbncnbnan, naanbncnbn, nbanbnb, nnananbncn, nbncbcan, nancnnnn, ncnbbn, ncabanban, ncanban, ncbabn, ncbnca, ncnannanbn, ncnbanbnbn, ncnbbc, nnbncnnn, nannbn, nbn, nnnbbcnbn}
4.2
Performance Results
It emerged that Algorithm 2 is suitable for the data sets. The computational results with generated samples are presented in table 1. The column captioned L contains the number of the language. In the next two columns, captioned |X| and |Y |, the cardinalities of examples and counterexamples are given. The boundary of the length of words, K, is given in the fourth (and twelfth) column. Next, we have the size |G| of an inferred grammar—defined as the total number of symbols on the right sides of all rules—and the number |P | of production rules. The algorithm has been run several times, N , until a satisfactory grammar
A Local Search Algorithm for Grammatical Inference
227
Table 1. Samples and inferred grammar characteristics and CPU time of computations L |X| |Y | K |G| |P | N τ
L |X| |Y | K |G| |P | N τ
1 2 3 4 5 6 7
8 9 10 11 12 13 14
20 20 15 20 15 20 40
20 5 30 20 40 20 40
5 5 6 20 10 10 12
16 19 14 22 16 8 23
10 11 8 11 10 5 14
2 3 3 1 4 1 4
320 370 83 609 301 317 857
15 20 40 10 10 10 10
15 40 40 10 60 10 20
6 8 10 10 12 10 10
22 12 14 6 27 7 16
15 7 8 4 12 5 10
5 1 1 1 3 1 2
690 565 1562 54 371 21 54
has been obtained. The average CPU time (in seconds) of one execution of the inference algorithm is given in the column captioned τ . As can be seen from the table, small enough grammars can be inferred by means of our local search method in a reasonable time. The results would seem to suggest that our algorithm is a good alternative to methods investigated by other researchers. What is so surprising is that in many cases it is sufficient to take only the small number of random examples and counterexamples. In Sakakibara’s work [23], in contrast, all examples of length up to 10 and all counterexamples of length up to 20 were given for L12 , similarly all examples of length up to 6 and all counterexamples of length up to 12 were given for L13 . An inferred grammar for L14 is: V0 → V1 V2 V1 → n
5
V2 → V3 V1 V3 → V8 V20 | c | V12 V8
V8 → a | b V12 → V3 V1
V20 → V1 V3
Conclusions
In this paper, we were interested in the identification of a formal language based on finite samples of words, called examples and counterexamples. We therefore asked the following question: given two sets of words, X and Y , is it possible to find a grammar G such that X ⊆ L(G) and Y ∩ L(G) = ∅? As stated, the inference problem has many solutions among which we are searching for short ones. For regular languages, this problem has many theoretical results and practical methods [17]. In order to address the context-free case, we have defined a grammar’s neighbourhood and a subroutine for generating an initial grammar. On this foundation, we have built an local search inference algorithm. Our experiments showed that our algorithm works satisfactorily for standard benchmarks as well as for ‘harder’ languages.
References 1. Angluin, D.: An application of the theory of computational complexity to the study of inductive inference. Doctoral Thesis. UMI Order Number: AAI7704361, University of California, Berkeley (1976)
228
W. Wieczorek
2. Angluin, D.: Queries and concept learning. Mach. Learn. 2, 319–342 (1998) 3. Clark, A.: Learning deterministic context free grammars: The Omphalos competition. Mach. Learn. 66, 93–110 (2007) 4. Du, D.-Z., Ko, K.-I.: Problem Solving in Automata, Languages, and Complexity. John Wiley & Sons, Chichester (2001) 5. Dupont, P.: Regular Grammatical Inference from Positive and Negative Samples by Genetic Search: the GIG Method. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 236–245. Springer, Heidelberg (1994) 6. Eyraud, R., Higuera, C., Janodet, J.: LARS: A learning algorithm for rewriting systems. Mach. Learn. 66, 7–31 (2007) 7. Giordano, J.Y.: Grammatical inference using tabu search. In: Miclet, L., de la Higuera, C. (eds.) ICGI 1996. LNCS (LNAI), vol. 1147, pp. 292–300. Springer, Heidelberg (1996) 8. Gold, E.M.: Language identification in the limit. Inform. and Control 10, 447–474 (1967) 9. Gold, E.M.: Complexity of automaton identification from given data. Inform. and Control 37, 302–320 (1978) 10. Grune, D., Jacobs, C.J.: Parsing Techniques: a Practical Guide. Ellis Horwood (2008) 11. Higuera, C.: A bibliographical study of grammatical inference. Pattern Recognition 38, 1332–1348 (2005) 12. Hopcroft, J.E., Ullman, J.D.: Introduction to automata theory, languages and computation. Addison-Wesley, Reading (1979) 13. Lankhorst, M.M.: Breeding Grammars: Grammatical Inference with a Genetic Algorithm. Comp. Science Report CS-R9401, C.S. Department, Univ. of Gronigen, The Netherlands (1994) 14. Lothaire, M.: Algebraic Combinatorics on Words. In: Encyclopedia of Mathematics and its Applications, Cambridge, vol. 90 (2002) 15. Lucas, S.M., Reynolds, T.J.: Learning deterministic finite automata with a smart state labelling evolutionary algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1063–1074 (2005) 16. Mateescu, A., Salomaa, A., Yu, S.: On the decomposition of finite languages. Turku Centre for Computer Science Technical Report No. 222 (1998) 17. Miclet, L.: Grammatical inference. In: Syntactic and Structural Pattern Recognition, Theory and Applications, pp. 237–290. World Scientific, Singapore (1990) 18. Nakamura, K., Matsumoto, M.: Incremental learning of context free grammars based on bottom-up parsing and search. Pattern Recognition 38, 1384–1392 (2005) 19. Papadimitriou, C.H., Steiglitz, K.: Combinatorial Optimization: Algorithms and Complexity. Dover Publications, Mineola (1998) 20. Pitt, L., Warmuth, M.: The minimum consistent DFA problem cannot be approximated within any polynomial. J. Assoc. Comput. Mach. 40, 95–142 (1993) 21. Ryan, C., Colins, J., O’Neill, M.: Grammatical evolution: Evolving programs for an arbitrary language. In: Banzhaf, W., Poli, R., Schoenauer, M., Fogarty, T.C. (eds.) EuroGP 1998. LNCS, vol. 1391, pp. 83–95. Springer, Heidelberg (1998) 22. Sakakibara, Y.: Recent advances of grammatical inference. Theoretical Computer Science 185, 15–45 (1997) 23. Sakakibara, Y.: Learning context-free grammars using tabular representations. Pattern Recognition 38, 1372–1383 (2005)
A Local Search Algorithm for Grammatical Inference
229
24. Tsoulos, I.G., Lagaris, I.E.: Grammar inference with grammatical evolution (2005), http://www.cs.uoi.gr/~ lagaris/papers/PREPRINTS/meta_grammars.pdf 25. Valiant, L.G.: A theory of the learnable. Communications of the ACM 27, 1134– 1142 (1984) 26. Wieczorek, W.: An algorithm for the decomposition of finite languages. Logic Journal of IGPL, 12 p. (2009), doi:10.1093/jigpal/jzp032 27. Wieczorek, W.: Metaheuristics for the Decomposition of Finite Languages. In: Recent Advances in Intelligent Information Systems, pp. 495–505 (2009) ISBN 97883-60434-59-8 28. Yokomori, T., Kobayashi, S.: Inductive learning of regular sets from examples. In: Proceedings of International Workshop on Rough Sets and Soft Computing, pp. 570–577 (1994)
Polynomial-Time Identification of Multiple Context-Free Languages from Positive Data and Membership Queries Ryo Yoshinaka Minato Discrete Structure Manipulation System Project, Erato, Japan Science and Technology Agency [email protected]
Abstract. This paper presents an efficient algorithm that identifies a rich subclass of multiple context-free languages in the limit from positive data and membership queries by observing where each tuple of strings may occur in sentences of the language of the learning target. Our technique is based on Clark et al.’s work (ICGI 2008) on learning of a subclass of context-free languages. Our algorithm learns those context-free languages as well as many non-context-free languages.
1
Introduction
Most approaches for investigating learnable classes of languages assume a fixed form of grammars or automata and then design a learner which computes conjectures in that form. Clark and Eyraud’s work [1] of learning substitutable contextfree languages ( cfls) is based on a quite different approach, where the target languages are defined by a purely language theoretic closure property and have no grammatical characterization. Their idea behind is to identify a syntactic category with the set of phrases belonging to that category and moreover with the set of contexts in which those phrases may occur. Then they use contextual information for determining nonterminal symbols and production rules. The literature has already got several results in this line, where the phrase-context relation plays a crucial role. The discussion in this paper should be put in this line of research, and it can particularly be seen as a unification of techniques from Clark et al. [2] and Yoshinaka [3]. Yoshinaka [3] has presented efficient algorithms that identify special kinds of multiple context-free languages ( mcfls) in the limit from positive data based on the technique by Clark and Eyraud [1] for learning substitutable cfls. Mcfls are a natural extension of cfls and they form a representative class of mildly contextsensitive languages. Since some natural language phenomena were found not to be context-free, the notion of mildly context-sensitive languages was proposed for better describing natural languages. It is an important and challenging issue
He is concurrently working in Graduate School of Information Science and Technology, Hokkaido University.
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 230–244, 2010. c Springer-Verlag Berlin Heidelberg 2010
Polynomial-Time Identification of MCFLs from Positive Data
231
in grammatical inference to find an efficiently learnable class of mildly contextsensitive languages. While substitutable context-free languages are not very expressive, Clark et al. [2] have shown that a much richer subclass of cfls is efficiently identifiable in the limit from positive data with the aid of an oracle for membership queries. Actually their learning algorithm runs on contextual binary feature grammars, which are another generalization of context-free grammars. Though their algorithm seems to be able to learn some non-context-free languages, they focus on a sufficient condition on cfls for being learnt and it is not discussed what kind of context-sensitive languages are learnt by their algorithm. In spite of several virtues of contextual binary feature grammars [4], the merit of the formalism is not very clear from a learning theoretical point of view. This paper proposes algorithms that identify special sorts of mcfls in the limit from positive data and membership queries which update their conjecture in polynomial time in the size of the given positive data. Our discussion is based on Clark et al.’s [2], but includes an expansion of the class of learnable languages and a refinement of the conditions for learnability. Our algorithms learn complex languages like { am bn cm dn | 1 ≤ m ≤ n } and { ww | w ∈ Σ + }.
2 2.1
Preliminaries Basic Definitions and Notations
The set of non-negative integers is denoted by N and this paper will consider only numbers in N. The cardinality of a set S is denoted by |S|. If w is a string over an alphabet Σ, |w| denotes its length. ∅ is the empty set and λ is the empty string. Σ ∗ denotes the set of all strings over Σ. We write Σ + = Σ ∗ − {λ}, Σ k = { w ∈ Σ ∗ | |w| = k } and Σ ≤k = { w ∈ Σ ∗ | |w| ≤ k }. Any subset of (over Σ). If L is a finite language, its size is defined Σ ∗ is called a language as L = |L| + w∈L |w|. An m-word is an m-tuple of strings and we denote the set of m-words by (Σ ∗ )m . Similarly we define (·)∗ , (·)+ , (·)≤m . Any m-word is called a multiword. Thus (Σ ∗ )∗ denotes the set of all multiwords. For w = w1 , . . . , wm ∈ (Σ ∗ )m , |w| denotes its length m and w denotes its size m + 1≤i≤m |wi |. We use the symbol , assuming that ∈ / Σ, for representing a “hole”, which is supposed to be replaced by another string. We write Σ for Σ ∪ {}. A string x over Σ is called an m-context if x contains m occurrences of . m-contexts are also called multicontexts. For an m-context x = x0 x1 . . . xm with x0 , . . . , xm ∈ Σ ∗ and an m-word y = y1 , . . . , ym ∈ ∗ m (Σ ) , we define x y = x0 y1 x1 . . . yn xn and say that y is a sub-multiword of x y. Note that is the empty context and we have y = y for any y ∈ Σ ∗ . For L ⊆ Σ ∗ and p ≥ 1, we define ∗ S ≤p (L) = { y ∈ (Σ + )≤p | x y ∈ L for some x ∈ Σ }, ∗ C ≤p (L) = { x ∈ Σ | x y ∈ L for some y ∈ (Σ + )≤p },
232
R. Yoshinaka
and for y ∈ (Σ ∗ )∗ , we define ∗ | x y ∈ L }. L/y = { x ∈ Σ
Computation of S ≤p (L) and C ≤p (L) can be done in O(L2p ) time if L is finite. 2.2
Linear Regular Functions
Let us suppose a countably infinite set Z of variables disjoint from Σ. A function f from (Σ ∗ )m1 × · · · × (Σ ∗ )mn to (Σ ∗ )m is said to be linear regular, if there is α1 , . . . , αm ∈ ((Σ ∪ { zij ∈ Z | 1 ≤ i ≤ n, 1 ≤ j ≤ mi })∗ )m such that each variable zij occurs exactly once in α1 , . . . , αm and f (w1 , . . . , wn ) = α1 [z := w], . . . , αm [z := w] for any w i = wi1 , . . . , wimi ∈ (Σ ∗ )mi with 1 ≤ i ≤ n, where αk [z := w] denotes the string obtained by replacing each variable zij with the string wij . We say that f is λ-free if no αk from α1 , . . . , αm is λ. If zij always occurs left of zi(j+1) in α1 . . . αm for 1 ≤ i ≤ n and 1 ≤ j < mi , f is said to be non-permuting. The rank of f is defined to be rank(f ) = n and the size of f is size(f ) = α1 , . . . , αm . Example 1. Among the functions defined below, where a, b ∈ Σ, f is not linear regular, while g and h are linear regular. Moreover h is λ-free and non-permuting. f (z11 , z12 , z21 ) = z11 z12 , z11 z21 z11 , g(z11 , z12 , z21 ) = z12 , z11 bz21 , λ, h(z11 , z12 , z21 ) = a, z11 bz21 , z12 . Lemma 1. For any language L ⊆ Σ ∗ and any u1 , . . . , un , v 1 , . . . , v n ∈ (Σ ∗ )∗ such that |ui | = |v i | and L/ui ⊆ L/vi for all i, we have L/f (u1 , . . . , un ) ⊆ L/f (v1 , . . . , v n ) for any non-permuting linear regular function f . Proof. Let mi = |ui | = |v i |. Suppose that x ∈ L/f (u1 , . . . , un ), i.e., x f (u1 , . . . , un ) ∈ L. The following inference is allowed: x f (m1 , u2 , . . . , un ) ∈ L/u1 ⊆ L/v1 =⇒ x f (v 1 , u2 , . . . , un ) ∈ L =⇒ x f (v 1 , m2 , u3 , . . . , un ) ∈ L/u2 ⊆ L/v2 =⇒ x f (v 1 , v 2 , u3 , . . . , un ) ∈ L =⇒ . . . =⇒ x f (v 1 , . . . , v n ) ∈ L. Hence x ∈ L/f (v1 , . . . , v n ). 2.3
Multiple Context-Free Grammars
A multiple context-free grammar ( mcfg) is a tuple G = Σ, Vdim , F, P, I, where – Σ is a finite set of terminal symbols,
Polynomial-Time Identification of MCFLs from Positive Data
233
– Vdim = V, dim is the pair of a finite set V of nonterminal symbols and a function dim giving a positive integer, called a dimension, to each element of V , – F is a finite set of linear regular functions,1 – P is a finite set of rules of the form A → f (B1 , . . . , Bn ) where A, B1 , . . . , Bn ∈ V and f ∈ F maps (Σ ∗ )dim(B1 ) × · · · × (Σ ∗ )dim(Bn ) to (Σ ∗ )dim(A) , – I is a subset of V and all elements of I have dimension 1. Elements of I are called initial symbols. We note that our definition of mcfgs is slightly different from the original [5], where functions from F may delete some arguments (some variable zij may be absent in α1 , . . . , αm in the definition of f in Section 2.2) and grammars have exactly one initial symbol. In fact this modification does not change their generative capacity. We will simply write V for Vdim if no confusion occurs. If a rule has a function f , then its right hand side must have rank(f ) occurrences of nonterminals by definition. If rank(f ) = 0 and f () = v, we may write A → v instead of A → f (). If rank(f ) = 1 and f is the identity, we may write A → B instead of A → f(B), where dim(A) = dim(B). The size G of G is defined as G = |P | + ρ∈P size(ρ) where size(A → f (B1 , . . . , Bn )) = size(f ) + n + 1. For all A ∈ V we define L(G, A) to be the smallest subset of (Σ ∗ )dim(A) such that if A → f (B1 , . . . , Bn ) is a rule and wi ∈ L(G, Bi ) for i = 1, . . . , n, then f (w1 , . . . , wn ) ∈ L(G, A). Those series of recursive steps for defining elements of L(G, A) are called derivations. The language L(G) generated by G means the set { w ∈ Σ ∗ | w ∈ L(G, S) with S ∈ I }, which is called a multiple context-free language ( mcfl). Two grammars G and G are equivalent if L(G) = L(G ). We denote by G(p, r) the collection of mcfgs G whose nonterminals are assigned a dimension at most p and whose functions have a rank at most r. Then we define L(p, r) = { L(G) | G ∈ G(p, r) }. We also write G(p, ∗) = r∈N G(p, r) and L(p, ∗) = r∈N L(p, r). The class of context-free grammars is identified with G(1, ∗) and G(1, 2) corresponds to the class of context-free grammars in Chomsky normal form. Thus L(1, 2) = L(1, ∗). Example 2. Let G be the mcfg Σ, V, F, P, {S} over Σ = {a, b, c, d} whose rules are π1 : S → f (A, B) with f (z11 , z12 , z21 , z22 ) = z11 z21 z12 z22 , π3 : A → a, c, π2 : A → g(A) with g(z1 , z2 ) = az1 , cz2 , π4 : B → h(B) with h(z1 , z2 ) = z1 b, z2 d,
π5 : B → b, d,
where V = {S, A, B} with dim(S) = 1, dim(A) = dim(B) = 2, and F consists of f , g, h and the constant functions appearing in the rules π3 and π5 . For example, a, c ∈ L(G, A) by π3 , aa, cc ∈ L(G, A) by π2 , b, d ∈ L(G, B) by π5 and aabccd ∈ L(G, S) by π1 . We have L(G) = { am bn cm dn | m, n ≥ 1 }. 1
We identify a function with its name for convenience.
234
R. Yoshinaka
If all functions of an mcfg G are λ-free, non-permuting, we say that G is λ-free, non-permuting, respectively. In fact every mcfg in G(p, r) has an equivalent λfree and non-permuting one in G(p, r) modulo λ [5, 6]. We assume without loss of generality that all mcfgs are λ-free and non-permuting in this paper. Seki et al. [5] and Rambow and Satta [7] have investigated the hierarchy of mcfls. Proposition 1 (Seki et al. [5], Rambow and Satta [7]). For p ≥ 1, L(p, ∗) L(p + 1, ∗). For p ≥ 2, r ≥ 1, L(p, r) L(p, r + 1) except for L(2, 2) = L(2, 3). For p ≥ 1, r ≥ 3 and 1 ≤ k ≤ r − 2, L(p, r) ⊆ L((k + 1)p, r − k). Theorem 1 (Seki et al. [5], Kaji et al. [8]). Let p and r be fixed. It is decidable in O(G2 |w|p(r+1) ) time whether w ∈ L(G) for any mcfg G ∈ G(p, r) and w ∈ Σ ∗ . 2.4
Multiple Context-Free Grammars with Finite Kernel Property
Now we introduce subclasses of mcfls that will be our learning targets. Definition 1. Let G = Σ, V, F, P, I ∈ G(p, r). If each A ∈ V has an element v A ∈ L(G, A) such that we have L(G)/v A ⊆ L(G)/w for all w ∈ L(G, A), then we say that G has the finite kernel property ( fkp). We let KG(p, r) denote the class of mcfgs with the fkp in G(p, r), and KL(p, r) the corresponding class of languages. When p = 1 and r ≥ 2, this definition is equivalent to Clark et al.’s [2] definition for cfgs in Chomsky normal form. It is not difficult to see that KL(1, 2) = r∈N KL(1, r) and that KG(1, 1) includes all regular languages [2]. Example 3. Let us consider the grammar G ∈ G(2, 1) with the following rules: S → f (A) with f (z1 , z2 ) = z1 z2 , A → g(A) with g(z1 , z2 ) = z1 b, z2 d, A → h(A) with h(z1 , z2 ) = az1 b, cz2 d, A → b, d, where S is the only initial symbol of G. The generated language is L(G) = { am bn cm dn | m < n }. For each am bn , cm dn ∈ L(G, A) with m < n, we have L(G)/bn , dn = { ai bj1 bj2 ci dj3 dj4 | i ≤ n + j1 + j2 = n + j3 + j4 }, L(G)/am bn , cm dn = { ai bj ci dj | i ≤ n − m + j }
for m ≥ 1.
Thus we have L(G)/abb, cdd ⊆ L(G)/am bn , cm dn for any m, n ∈ N with m < n. Clearly L(G)/w = for any w ∈ L(G, S). Therefore G ∈ KG(2, 1). Another example with the fkp is { wcwc | w ∈ {a, b}∗ } ∈ KL(2, 1). On the other hand, the language { an bn | n ≥ 1 } ∪ { an b2n | n ≥ 1 } does not have the fkp.
Polynomial-Time Identification of MCFLs from Positive Data
3 3.1
235
Learning Multiple Context-Free Languages with the Finite Kernel Property Hypotheses
Hereafter we arbitrarily fix two natural numbers p ≥ 1 and r ≥ 1. Our learning algorithm A(p, r) computes grammars in G(p, r) using positive data and membership queries. Let L∗ ⊆ Σ ∗ be the target language belonging to KL(p, r). The hypothesized grammar is defined by three parameters K ⊆ S ≤p (L∗ ), X ⊆ C ≤p (L∗ ) and L∗ . K and X are finite sets computed from given positive data. Of course A(p, r) cannot take L∗ as a part of input, but in fact a finite number of membership queries is enough to construct the following mcfg G r (K, X, L∗ ) = Σ, V, F, P, I. The set of nonterminal symbols is V = K and we will write [[v]] instead of v for clarifying that it means a nonterminal symbol (indexed with v). The dimension dim([[v]]) is |v|. The set of initial symbols is I = { [[w]] | w ∈ K and w ∈ L∗ }, where each element of I has dimension 1. The rules of P are divided into the following two types: – (Type I) [[v]] → f ([[v 1 ]], . . . , [[v n ]]), if 0 ≤ n ≤ r, v, v 1 , . . . , v n ∈ K and v = f (v 1 , . . . , v n ) for f λ-free and non-permuting; – (Type II) [[u]] → [[v]], if L∗ /u ∩ X ⊆ L∗ /v ∩ X and u, v ∈ K, where rules of the form [[v]] → [[v]] are of Type I and Type II at the same time, but they are anyway meaningless. F is the set of λ-free and non-permuting functions that appear in the definition of P . We want each nonterminal symbol [[v]] to derive v. Here the construction of I appears to be trivial: initial symbols derive elements of L∗ that appear in K. This property is realized by the rules of Type I. For example, for p = r = 2 and K = S ≤2 ({ ab}), one has the following rules π1 , . . . , π5 of Type I that have [[a, b]] on its left hand side: π1 : [[a, b]] → a, b, π2 : [[a, b]] → fa ([[b]])
with
fa (z) = a, z,
π3 : [[a, b]] → fb ([[a]]) with fb (z) = z, b, π4 : [[a, b]] → g([[a]], [[b]]) with g(z1 , z2 ) = z1 , z2 , π5 : [[a, b]] → [[a, b]], where π1 indeed derives a, b, while π5 is meaningless. Instead of deriving a, b directly by π1 , one can derive it by two steps with π3 and π6 : [[a]] → a (or π2 and π7 : [[b]] → b), or by three steps by π4 , π6 and π7 . One may regard application of rules of Type I as a decomposition of the multiword that appears on its left hand side. It is easy to see that there are finitely many rules of Type I, because K is finite and nonterminals on the right hand side of a rule are all
236
R. Yoshinaka
λ-free sub-multiwords of that on the left hand side. If the grammar had only rules of Type I, then it should derive all and only elements of I. Now we explain the intuition behind rules of Type II. Suppose that L∗ /u ⊆ L∗ /v. This means that L∗ is closed under substituting v for u, and such substitution is realized by the rule [[u]] → [[v]] of Type II. However the algorithm cannot check whether L∗ /u ⊆ L∗ /v in finitely many steps, while it can approximate this relation with L∗ /u ∩ X ⊆ L∗ /v ∩ X by membership queries, because X is finite. Clearly L∗ /u ⊆ L∗ /v implies that L∗ /u ∩ X ⊆ L∗ /v ∩ X, but the inverse is not true. We say that a rule [[u]] → [[v]] of Type II is wrong (with respect to L∗ ) if L∗ /u L∗ /v. It might often happen that L∗ /u ∩ X = L∗ /v ∩ X and in this case we have symmetric rules [[u]] → [[v]] and [[v]] → [[u]] and then trivially they derive the same set of multiwords. One can merge those nonterminals to compact the grammar. Computation process of rules of Type II can be handled by a collection of matrices, called observation tables. For each dimension m ≤ p, we have an observation table Tm . Let Km and Xm be the sets of m-words from K and mcontexts from X, respectively. The rows of the table Tm are indexed with just elements of Km and the columns are indexed with elements of Xm . For each pair u, v ∈ Km , to compare the sets L∗ /u ∩ X and L∗ /v ∩ X, one needs to know whether x v ∈ L∗ or not for all of x ∈ Xm . The membership of x v is recorded in the corresponding entry of the observation table with the aid of a membership query. By comparing the entries of the rows corresponding to u and v, one can determine whether the grammar should have the rule [[u]] → [[v]]. Note that the initial symbols are determined by K and L∗ and the rules of Type I are constructed solely by K, while X is used only for determining rules of Type II. As Clark et al.’s [2] learning algorithm does, our algorithm A(p, r) monotonically expands K and X, where the set I and rules of Type I monotonically increase, while wrong rules of Type II may be deleted. Example 4. Let p = 2, r = 1, L = { am bn cm dn | 1 ≤ m ≤ n } ∈ L(2, 1), K = {aabbccdd, aabb, ccdd, aab, ccd, abb, cdd, ab, cd}) and X = {, ac, ˆ = G 1 (K, X, L) has the following bd, abcd}. The computed grammar G rules of Type I: π0 : [[aabbccdd]] → f0 ([[aabb, ccdd]]) with f0 (z1 , z2 ) = z1 z2 , π1 : [[aabb, ccdd]] → fabcd ([[ab, cd]]) with fabcd (z1 , z2 ) = az1 b, cz2 d, π2 : [[aabb, ccdd]] → fbd ([[aab, ccd]]) with fbd (z1 , z2 ) = z1 b, z2 d, π3 : [[aabb, ccdd]] → fac ([[abb, cdd]]) with fac (z1 , z2 ) = az1 , cz2 , π4 : , [[aabbccdd]] → (f0 ◦ fabcd )([[ab, cd]])), π5 : , [[aabbccdd]] → (f0 ◦ fbd )([[aab, ccd]]), π6 : , [[aabbccdd]] → (f0 ◦ fac )([[abb, cdd]]), π7 : [[aab, ccd]] → fac ([[ab, cd]]), π80 : [[abb, cdd]] → fbd ([[ab, cd]]), π10 : [[aabb, ccdd]] → aabb, ccdd, π9 : [[aabbccdd]] → aabbccdd, π11 : [[aab, ccd]] → aab, ccd, π12 : [[abb, cdd]] → abb, cdd, π13 : [[ab, cd]] → ab, cd Here we ignore all meaningless rules of the form [[v]] → [[v]].
Polynomial-Time Identification of MCFLs from Positive Data
237
We have the following observation tables T1 and T2 :
T1 aabbccdd 1
T2 ac bd abcd aabb, ccdd 1 0 1 1 ab, cd 1 0 1 1 aab, ccd 0 0 1 0 abb, cdd 1 1 1 1
which induce the following rules of Type II: ρ1 : [[ab, cd]] → [[abb, cdd]], ρ3 : [[aabb, ccdd]] → [[ab, cd]],
ρ2 : [[aab, ccd]] → [[ab, cd]], ρ4 : [[ab, cd]] → [[aabb, ccdd]].
ˆ does not have the rules symmetric Note that while ρ3 and ρ4 are symmetric, G to ρ1 nor ρ2 . ˆ for any m, n if 1 ≤ m ≤ n. Trivially Let us see that am bn cm dn ∈ L(G) ˆ ab, cd ∈ L(G, [[ab, cd]]) by π13 . By repeatedly applying the rules π1 and ρ4 , ˆ [[ab, cd]]) for all m ≥ 1. Then alternating apone gets am bm , cm dm ∈ L(G, ˆ [[ab, cd]]) for plications of the rules π8 and ρ1 give us am bn , cm dn ∈ L(G, m n m n ˆ all n ≥ m. The rules ρ3 and π0 give a b c d ∈ L(G, [[aabbccdd]]). Since ˆ we conclude that am bn cm dn ∈ L(G). ˆ On [[aabbccdd]] is an initial symbol of G, m n m n ˆ the other hand, it is not hard to see that a b c d ∈ L(G) if m > n. Before giving the overall structure of the algorithm A(p, r), we discuss properties of G r (K, X, L), where L can be an arbitrary language over Σ. 3.2
Properties of Hypotheses
Lemma 2. Let G = G r (K, X, L) and G = G r (K , X, L). If K ⊆ K , then L(G) ⊆ L(G ). Proof. By definition, every rule of G is also a rule of G and every initial symbol of G is also an initial symbol of G .
Lemma 3. Let G = G r (K, X, L) and G = G r (K, X , L). If X ⊆ X , then L(G ) ⊆ L(G). Proof. Clearly G and G share the same set of initial symbols and the same rules of Type I. If G has a rule [[u]] → [[v]] of Type II, we have L/u ∩ X ⊆ L/v ∩ X . By X ⊆ X , we have L/u ∩ X ⊆ L/v ∩ X and thus [[u]] → [[v]] is a rule of G, too.
∗ is said to be fiducial on K ⊆ (Σ ∗ )∗ with respect to L Definition 2. X ⊆ Σ if L/u ∩ X ⊆ L/v ∩ X implies L/u ⊆ L/v for any u, v ∈ K.
By definition G = G r (K, X, L) has no wrong rules of Type II if and only if X is fiducial on K. In order to get rid of wrong rules from the conjecture, our algorithm wants a fiducial set. If X is fiducial on K with respect to L and X ⊆ X , then X is also fiducial on K and we have G r (K, X, L) = G r (K, X , L). The following lemma is easy.
238
R. Yoshinaka
Lemma 4. Every pair of K and L admits a finite fiducial set X whose cardinality is at most |K|2 . Proof. For each pair u, v ∈ K such that L/u L/v, there is x ∈ L/u − L/v.
By collecting all such x, a finite fiducial set XK on K is obtained. Lemma 5. Suppose that X is fiducial on K with respect to L and let G = G r (K, X, L). If w ∈ L(G, [[v]]), then L/v ⊆ L/w for any v ∈ K. Proof. We show the lemma by induction on the derivation of w. Suppose that the fact w ∈ L(G, [[v]]) is due to a rule of Type I of the form [[v]] → f ([[v 1 ]], . . . , [[v n ]]) and w = f (w1 , . . . , wn ) for wi ∈ L(G, [[v i ]]) with i = 1, . . . , n. By induction hypothesis, we have L/vi ⊆ L/wi for i = 1, . . . , n. Applying Lemma 1, we get L/v ⊆ L/w. Suppose that the fact w ∈ L(G, [[v]]) is due to a rule of Type II of the form [[v]] → [[u]] and w ∈ L(G, [[u]]). By the presence of the rule, we have v, u ∈ K and L/v ∩ X ⊆ L/u ∩ X, which implies L/v ⊆ L/u by the fiduciality of X. Together with the induction hypothesis that L/u ⊆ L/w, we get L/v ⊆ L/w.
Corollary 1. If X is fiducial on K with respect to L, then L(G) ⊆ L for G = G r (K, X, L). Proof. If w ∈ L(G), then G has an initial symbol [[v]] such that w ∈ L(G, [[v]]) and v ∈ L. Then ∈ L/v ⊆ L/w by Lemma 5, i.e., w ∈ L.
Corollary 1 establishes that a fiducial set indeed ensures that the grammar does not overgeneralize. We now turn our attention to a condition for preventing undergeneralization. Definition 3. A finite set K ⊆ (Σ ∗ )≤p is said to be a p, r-kernel of a language ∗ of multicontext. L ⊆ Σ ∗ , if L ⊆ L(G r (K, X, L)) for any finite set X ∈ Σ Lemma 6. K ⊆ Σ ≤p is a p, r-kernel of L if and only if L = L(G r (K, X∗ , L)) for a fiducial set X∗ on K. Proof. The “only if” part is trivial by Definition 3 and Corollary 1. Conversely suppose that L = L(G r (K, X∗ , L)) for a fiducial set X∗ on K. Let [[u]] → [[v]] be a rule of Type II from G r (K, X∗ , L). Then, for an arbitrary ∗ X ⊆ Σ , L/u ∩ X∗ ⊆ L/v ∩ X∗ =⇒ L/u ⊆ L/v =⇒ L/u ∩ X ⊆ L/v ∩ X. Thus every rule of G r (K, X∗ , L) belongs to G r (K, X, L), too. That is, L = L(G r (K, X∗ , L)) ⊆ L(G r (K, X, L)).
Proposition 2. A language L has a finite p, r-kernel if and only if L ∈ KL(p, r). ˆ = G r (K, X, L) Proof. First suppose that L has a finite p, r-kernel K and let G ˆ by Lemma 6. where X is a fiducial set X on K with respect to L, where L = L(G)
Polynomial-Time Identification of MCFLs from Positive Data
239
ˆ has the fkp. For each nonterminal symbol [[v]] of G, ˆ Lemma 5 We show that G claims that v ∈ L(G, [[v]]) satisfies Definition 1. Conversely suppose that we have an mcfg G = Σ, V, F, P, I0 ∈ KG(p, r) such that L(G) = L. Let v A ∈ L(G, A) satisfy Definition 1. We show that K = { v A | A ∈ V } ∪ { f (v B1 , . . . , v Bn ) | A → f (B1 , . . . , Bn ) ∈ P } ∗ is a p, r-kernel of L. Let GX = G r (K, X, L) for X ⊆ Σ . By induction on derivation we show that if w ∈ L(G, A) then w ∈ L(GX , [[v A ]]). Then in particular when w ∈ L(G, A) with A ∈ I0 , we have w ∈ L(GX , [[vA ]]), where [[vA ]] ∈ I by vA ∈ L(G). That is, w ∈ L(GX ). Suppose that w ∈ L(G, A) due to A → f (B1 , . . . , Bn ) ∈ P , w i ∈ L(G, Bi ) and w = f (w1 , . . . , wn ). We have w i ∈ L(GX , [[v Bi ]]) by induction hypothesis. By definition u = f ([[v B1 ]], . . . , [[v Bn ]]) ∈ K ∩ L(G, A). We have L/vA ⊆ L/u by the assumption. Hence GX has the rules [[u]] → f ([[v B1 ]], . . . , [[v Bn ]]) of Type I and [[v A ]] → [[u]] of Type II, and thus w ∈ L(GX , [[v A ]]).
3.3
Learning Algorithm
Our algorithm A(p, r), shown in Algorithm 1, wants a p, r-kernel of the target language L∗ and a fiducial set on the kernel, from which it can compute a grammar generating the target language. Expanding X infinitely causes no problem, because it is used only for removing wrong rules. On the other hand, expansion of K means increment of nonterminal symbols and production rules, which should not happen infinitely many times. Thus A(p, r) expands K only when it knows that K is not a p, r-kernel of L∗ , i.e., when it observes that it is undergenerating. Algorithm 1. A(p, r) Data: A sequence of strings w1 , w2 , · · · ∈ L∗ ; membership oracle O Result: A sequence of mcfgs G1 , G2 , · · · ∈ G(p, r) ˆ := G 0 (K, X, L∗ ); let D := ∅; K := ∅; X := ∅; G for n = 1, 2, . . . do let D := D ∪ {wn }; X := C ≤p (D); ˆ then if D L(G) let K := S ≤p (D); end if ˆ = G r (K, X, L∗ ) as Gn ; output G end for
ˆ is such that L∗ L(G), ˆ then the learner Lemma 7. If the current conjecture G ˆ will discard G at some point. ˆ is given to the learner. Proof. At some point, some element from L∗ − L(G)
ˆ L∗ , then the learner will discard G ˆ at some point. Lemma 8. If L(G)
240
R. Yoshinaka
ˆ L∗ implies that X is not fiducial on K by Corollary 1. Proof. The fact L(G) ˆ That is, G has a wrong rule [[u]] → [[v]] of Type II, where L∗ /u L∗ /v. At some / L for some point the learner will have D ⊆ L∗ such that x u ∈ D and x v ∈ x ∈ L∗ /u − L∗ /v. The wrong rule must be removed.
Theorem 2. The learner A(p, r) identifies KL(p, r) in the limit. Proof. Let L∗ ∈ KL(p, r) be the learning target. By Lemmas 7 and 8, the learner never converges to a wrong hypothesis. It is impossible that the set K is changed infinitely many times, because K is monotonically expanded and sometime K will be a p, r-kernel of L∗ , in which case the learner never updates K any more by Definition 3. Then sometime X will be fiducial on the p, r-kernel K by Lemmas 8 ˆ has no wrong rules of Type II. Thereafter no rules will be added and 4, where G ˆ any more. to or removed from G
ˆ = G r (K, X, L∗ ) The converse does hold. If A(p, r) converges to a grammar G ˆ such that L(G) = L∗ , then K is a p, r-kernel of L∗ by definition. Lemma 9. The algorithm updates its conjecture in polynomial time in D, where the degree of the polynomial is linear in pr. Proof. Let us think about computing G r (K, X, L∗ ). By K ⊆ S ≤p (D) and X ⊆ C ≤p (D), K and X are bounded by a polynomial in D with a degree linear in p. We first estimate the number of rules of Type I that have a fixed [[v]] on the left hand side, which are of the form [[v]] → f ([[v 1 ]], . . . , [[v n ]]). Roughly speaking, this is the number of ways to decompose v into sub-multiwords v 1 , . . . , v n with the aid of a linear regular function f . Once one determines where the occurrence of each substring from v 1 , . . . , v n starts and ends in v, the function f is uniquely determined. We have at most pr substrings from v 1 , . . . , v n , hence the number of ways to determine such starting and ending points are at most v2pr . Thus, the number of rules of Type I is at most O(|K|2pr ) where is the maximal size of elements of K. Clearly the description size of each rule is at most O(). One can construct the observation tables by calling the oracle at most |K||X| times. Then one can determine initial symbols and rules of Type II in polynomial ˆ of G ˆ is bounded by a polynomial, whose degree is linear time. Clearly the size G in pr, in D. ˆ it takes O(G ˆ 2 Dp(r+1) ) steps by For checking whether D ⊆ L(G), Theorem 1. All in all, the algorithm updates its conjecture in polynomial time in D, where the degree of the polynomial is linear in pr.
We have assumed that the parameters p and r are known to the learner a priori, but it is not mandatory for identifiability, if we give up the polynomial-time update. By assuming p = r = max{ |w| | w ∈ D }, the algorithm can compute its conjecture.
Polynomial-Time Identification of MCFLs from Positive Data
3.4
241
Slight Enhancement
Our algorithm A(p, r) learns all regular languages and some cfls and mcfls. As Clark et al. [2] have discussed, the language L0 = a+ ∪ { an bn | n > 0 } is not learnt by A(1, r). This is because, to derive an for infinitely many n, the conjectured grammar must have a rule of Type II of the form [[am ]] → [[am+k ]] for some positive integers m and k. However, as bm ∈ L0 /am − L0 /am+k , such a rule should be discarded at some point. The algorithm does not converge to a correct grammar. It is not hard to modify the algorithm A(1, r) to learn L0 by using the fact that $L0 $ = $a+ $ ∪ { $an bn $ | n > 0 } is identifiable by A(1, r), where $ is a new terminal symbol not in Σ. Our alternative algorithm B(p, r) is a slight modification of A(p, r). It behaves as if it wants to learn $L∗ $ instead of the target L∗ by assuming the special marker $ at the head and the tail of each positive example. It is obvious that checking whether or not w ∈ $L∗ $ is possible by using a membership query for L∗ . The new algorithm B(p, r) identifies languages such as L0 (with p = 1, r = 1) and { ww | w ∈ {a, b}+ } (with p = 2, r = 1), which A(p, r) does not learn. Still languages like { an bn | n > 0 } ∪ { an b2n | n > 0 } or { ww | w ∈ {a, b}+ } are not learnt by B(p, r). Example 5. Let L = { w#w | w ∈ {a, b}∗ } ∈ L(2, 1), D = {a#a, b#b} ⊆ L, K$ = S ≤2 ($D$) and X$ = C ≤2 ($D$). The following table is a fragment of the observation table T2 : T2 $#$ #$ $$ a#a b#b $ a, a 1 0 0 0 0 0 b, b 1 0 0 0 0 0 $a, a 0 1 0 0 0 0 $b, b 0 1 0 0 0 0 a, #a 0 0 1 0 0 0 b, #b 0 0 1 0 0 0 $, $ 0 0 0 1 1 0 $a, a$ 0 0 0 1 0 0 $b, b$ 0 0 0 0 1 0 $, # 0 0 0 0 0 1 $a, #a 0 0 0 0 0 1 $b, #b 0 0 0 0 0 1 ˆ = G 1 (K$ , X$ , L) by B(2, 1) generates all the eleThe conjectured grammar G $ ments of L by using the rules of Type II among [[$, #]], [[$a, #a]] and [[$b, #b]], which are not wrong, and the rules of Type I: [[$a, #a]] → fa ([[$, #]]) with fa (z1 , z2 ) = z1 a, z2 a and [[$b, #b]] → fb ([[$, #]]) with fb (z1 , z2 ) = z1 b, z2 b. However some of the rules of Type II inferred from this table are wrong, e.g., the rule [[a, a]] → [[b, b]] and its symmetry. Suppose that the next positive example is aba#aba. Because it is generated by the current conjecture ˆ the algorithm does not expand K, while it finds wrong rules by expanding G, the table:
242
R. Yoshinaka
T2 $a#a$ a#a$ $aa$ ba#ab a, a 1 0 0 0 b, b 0 0 0 0 $a, a 0 1 0 0 $b, b 0 0 0 0 a, #a 0 0 1 0 b, #b 0 0 0 0 $, $ 0 0 0 0 $a, a$ 0 0 0 1 $b, b$ 0 0 0 0 $, # 0 0 0 0 $a, #a 0 0 0 0 $b, #b 0 0 0 0 Then a half of the wrong rules are now removed. If furthermore B(2, 1) gets bab#bab for example, the rest half disappears. Then the rules of Type II that will survive are those among [[$, #]], [[$a, #a]] and [[$b, #b]] (and those among [[#, $]], [[a#, a$]] and [[b#, b$]]), except the trivial ones, i.e., A → A for all nonterminals A. Here the inserted marker $, as well as the center marker #, plays a crucial role.
4
Discussion
Observation tables are a classical technique used in different schemes for computing deterministic finite state automata (e.g., [9, 10]), Rows and columns of an observation table are indexed with (potential) prefixes and suffixes of elements of the target language, respectively, and each entry shows whether the corresponding combination of the prefix and the suffix gives an element of the language. Each row corresponds to a state and if one finds two rows of the same entries, they are identified. If the index of a row is obtained by adding an input symbol on the tail of the index of another row, the corresponding transition rule is inferred. This technique has been extended in several algorithms that learns structures more complex than deterministic finite automata such as context-free grammars [11], regular tree languages [12, 13] and even a non-context-free formalism, multidimensional tree languages [14]. While those extensions take trees as input, a recent study shows that a rich subclass of context-free languages can be learnt from string data by using an observation table [15]. They use substructures as indices of rows and their context as indices of columns like our algorithm. An important difference of our algorithm from those preceding ones is that our algorithm finds an asymmetric relation between two rows that have comparable entries, while the classical techniques merge two rows with identical entries. Two rows with identical entries will be identified as a result of finding the inclusion relations of the both directions. This generalization enables us to learn complex languages like { am bn cm dn | 1 ≤ m ≤ n }.
Polynomial-Time Identification of MCFLs from Positive Data
243
Table 1. Comparison of observation tables for different language classes target regular languages [9, 10] congruential cfls [15] mcfls with fkp rows states nonterminals nonterminals row indices prefixes substrings multiwords column indices suffixes contexts multicontexts key relation equivalence equivalence inclusion between rows
Our algorithm computes a grammar from any observation tables, while the standard technique for learning deterministic finite state automata requires the observation table to be closed and consistent. We may think our observation tables are always closed, because rules of Type I are just decompositions and the row indices are sub-multiword-closed. We may think of two kinds of consistency, g-consistency and f-consistency. Observation tables are said to be gconsistent if the grammar computed from the tables is compatible with all the entries of the tables. Observation tables are said to be f-consistent if one can assume that all the rules of Type II are not wrong without contradicting Lemma 1. More formally, we say that observation tables are f-consistent, if for any u, u1 , . . . , un , v, v 1 , . . . , v n ∈ K such that u = f (u1 , . . . , un ) and v = f (v 1 , . . . , v n ) for some linear regular function f and L/ui ∩ X ⊆ L/vi ∩ X, we have L/u ∩ X ⊆ L/v ∩ X. For example, the first observation table T2 of Example 5 is not f-consistent. The two rows a, a and b, b have the same entries, but the incomparability of the rows $a, a$ and $b, b$ entails that L/a, a and L/b, b are also incomparable by Lemma 1. One can modify our algorithm so that it extends the observation tables until they become f-consistent and g-consistent. The new algorithm is more restrained from asking equivalence queries. However currently we have no detailed mathematical analysis on this modification, which is future work.
Acknowledgement The author is grateful to Alexander Clark, Anna Kasprzik and Takeshi Shibata for valuable comments and suggestions on a draft of this paper. This work was supported in part by Grant-in-Aid for Young Scientists (B-20700124) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
References 1. Clark, A., Eyraud, R.: Polynomial identification in the limit of substitutable context-free languages. Journal of Machine Learning Research 8, 1725–1745 (2007) 2. Clark, A., Eyraud, R., Habrard, A.: A polynomial algorithm for the inference of context free languages. In: [16], pp. 29–42
244
R. Yoshinaka
3. Yoshinaka, R.: Learning mildly context-sensitive languages with multidimensional substitutability from positive data. In: Gavald` a, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS, vol. 5809, pp. 278–292. Springer, Heidelberg (2009) 4. Clark, A., Eyraud, R., Habrard, A.: A note on contextual binary feature grammars. In: EACL 2009 workshop on Computational Linguistic Aspects of Grammatical Inference, pp. 33–40 (2009) 5. Seki, H., Matsumura, T., Fujii, M., Kasami, T.: On multiple context-free grammars. Theoretical Computer Science 88(2), 191–229 (1991) 6. Kracht, M.: The Mathematics of Language. Studies in Generative Grammar, vol. 63, pp. 408–409. Walter de Gruyter, Berlin (2003) 7. Rambow, O., Satta, G.: Independent parallelism in finite copying parallel rewriting systems. Theor. Comput. Sci. 223(1-2), 87–120 (1999) 8. Kaji, Y., Nakanishi, R., Seki, H., Kasami, T.: The universal recognition problems for parallel multiple context-free grammars and for their subclasses. IEICE Transaction on Information and Systems E75-D(7), 499–508 (1992) 9. Gold, E.M.: System identification via state characterization. Automatica 8(5), 621– 636 (1972) 10. Angluin, D.: Learning regular sets from queries and counterexamples. Information and Computation 75(2), 87–106 (1987) 11. Sakakibara, Y.: Learning context-free grammars from structural data in polynomial time. Theoretical Computer Science 76(2-3), 223–242 (1990) ´ 12. Drewes, F., H¨ ogberg, J.: Learning a regular tree language from a teacher. In: Esik, Z., F¨ ul¨ op, Z. (eds.) DLT 2003. LNCS, vol. 2710, pp. 279–291. Springer, Heidelberg (2003) 13. Besombes, J., Marion, J.Y.: Learning tree languages from positive examples and membership queries. Theoretical Computer Science 382(3), 183–197 (2007) 14. Kasprzik, A.: A learning algorithm for multi-dimensional trees, or: Learning beyond context-freeness. In: [16], pp. 111–124 15. Clark, A.: Distributional learning of some context-free languages with a minimally adequate teacher. In: proceedings of the 10th International Colloquium on Grammatical Inference (2010) (to appear) 16. Clark, A., Coste, F., Miclet, L. (eds.): ICGI 2008. LNCS (LNAI), vol. 5278. Springer, Heidelberg (2008)
Grammatical Inference as Class Discrimination Menno van Zaanen and Tanja Gaustad TiCC, Tilburg University Tilburg, The Netherlands {M.M.vanZaanen,T.Gaustad}@uvt.nl
Abstract. Grammatical inference is typically defined as the task of finding a compact representation of a language given a subset of sample sequences from that language. Many different aspects, paradigms and settings can be investigated, leading to different proofs of language learnability or practical systems. The general problem can be seen as a one class classification or discrimination task. In this paper, we take a slightly different view on the task of grammatical inference. Instead of learning a full description of the language, we aim to learn a representation of the boundary of the language. Effectively, when this boundary is known, we can use it to decide whether a sequence is a member of the language or not. An extension of this approach allows us to decide on membership of sequences over a collection of (mutually exclusive) languages. We will also propose a systematic approach that learns language boundaries based on subsequences from the sample sequences and show its effectiveness on a practical problem of music classification. It turns out that this approach is indeed viable. Keywords: empirical grammatical inference, class discrimination, tf*idf .
1
Introduction
Grammatical inference deals with the learning of languages. The task is typically defined as follows: Given a set of example sequences, find a compact representation of the underlying language of which the sequences are examples. The compact representation is called a grammar, the example sequences are generated from the grammar by a teacher and it is the learner that aims to find the underlying grammar. The field of grammatical inference is often divided into two subfields: formal and empirical grammatical inference [1]. Formal grammatical inference investigates learnability of classes of languages given a particular learning setting. The result of this research is a formal, mathematical proof showing that a certain class or family of languages is learnable (or not) provided the environment corresponds to the requirements of the learning setting. Probably the most famous of these settings is that of identification in the limit [2], but others exist [3]. Here, however, we are more interested in empirical grammatical inference. In contrast to formal grammatical inference, where mathematical proofs are provided on learnability of predetermined classes of languages, empirical grammatical inference deals with learning of languages in situations where the underlying J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 245–257, 2010. c Springer-Verlag Berlin Heidelberg 2010
246
M. van Zaanen and T. Gaustad
grammar or class of grammars is not known. This typically leads to empirical results on naturally occurring data and addresses practical learning situations. In the ideal case, we would like to combine both formal and practical grammatical inference techniques. This means that we know formally that languages can be learned in the setting under consideration and that in practice this is also true. Knowing that a language is learnable formally does not necessarily mean that it is also learnable in practice, due to, for instance, noise, limited amounts of available data, or a (minor) mismatch between the practical and formal learning settings. In this paper, we propose to treat the problem of empirical grammatical inference in a slightly different way. Instead of trying to learn a full, compact representation of the underlying language, we redefine the task to find a representation of the boundary of the language. In many cases, both the learned grammar or the learned boundaries can be applied. For instance, when the learned construction is used to classify sequences into classes (such as inside or outside the language), both representations are equally applicable. In addition to the theoretical specification of our new language learning approach, we describe a practical implementation of the approach. This implementation relies on finding patterns in the shape of subsequences from the example sequences for each of the languages. (In the case of learning sequence membership of one language, negative examples are considered as an alternative language.) The patterns that have high predictive power are selected and are subsequently used to classify new sequences. The paper is structured as follows. Firstly, we will specify the new approach to empirical grammatical inference in more detail, including a discussion of the advantages and disadvantages as well as a description of a practical system. Next, the results of applying the practical system to two data sets are provided. The paper ends with a conclusion.
2
Approach
The research presented in this paper introduces two novelties. First, we redefine the task of grammatical inference as a discrimination task. The new task is to identify the boundary of the underlying language(s) rather than to construct a compact representation of it (in the form of e.g. a grammar). Second, we propose a practical system that identifies patterns that describe language boundaries based on the example sequences. We apply an existing statistical measure to identify patterns that are useful for the identification of the boundary. Both aspects will now be described in more detail. 2.1
Class Discrimination
Languages can be visualized in the sequence space (the space that contains all possible sequences) and are typically described as an area in the shape of a circle or oval, like a Venn diagram. The area contains all sequences that are part of the language and all sequences outside the area are non-members. Typically, the
Grammatical Inference as Class Discrimination
247
aim of grammatical inference is to find a grammar that describes the entire area of the language. Most often, grammatical inference approaches aim to learn a representation in the form of a grammar that fully describes the underlying language from which the example sequences are drawn. The advantage of learning a full description is that this can also be used to generate more sequences in the language, which leads to a proper generalization of the sample sequences. However, grammatical inference settings, such as identification in the limit or PAC learning do not specify that such a full description from a generative point of view is required. In contrast to learning a full description of the language, we propose to find a representation of the line describing the boundary of the language only. Once we know this boundary, essentially we also know which sequences are in the language and which are out, without having an explicit representation of the sequences that are part of the language (which one has in the case of a grammar). Note however, that generating sequences (in addition to the ones known from the learning sample) in the language is non-trivial in this case. When looking for the boundary between languages (where in the case of learning one language L, the other language would be its complement LC ), we do not need to know exactly which sequence is inside the language. We are only interested in sequences that are close to the boundary of the language. This idea of finding a representation of the boundary of the language can be compared to the supervised machine learning methods based on support vector machines (SVMs) [4]. Given a set of training examples belonging to either one of two categories, an SVM model is built that predicts into which category a new, unseen example falls. The model represents the examples as points in the instance space mapped in such a way that the examples from the two categories are divided by a clear margin. Ideally, the boundary falls right in the middle of the margin and this boundary represents the largest distance to the nearest training data points of any class, thereby minimizing the generalization error of the SVM classifier. Unseen examples are mapped into the instance space and, based on which side of the boundary they fall on, their class is predicted. Interestingly, SVMs only rely on examples that are close to the boundary. Examples that are far away from the boundary are not used to build the vector that distinguishes the areas describing the classes. Alternatively, our approach can be seen as being similar to the k-NN (Nearest Neighbor) supervised machine learning approach [5]. Here, just as in the SVM case, the training instances are placed in the instance space. Classification of a particular (unseen) instance is then performed by finding the instances from the training data that are closest to the unseen instance. The assigned class is found by taking a majority vote over all classes of the nearest instances. The boundaries in this situation are computed on the fly. In a way, the k-NN approach does not aim to learn a complete description of the boundary in the sense of a formula describing that boundary. Whereas SVMs aim to learn linear classifiers on a mapped instance space (allowing for non-linear classification), k-NN only computes local boundaries when required
248
M. van Zaanen and T. Gaustad
for classification. At no point in time a complete formal description of the boundary is known (although this can be extracted from the known instances in the instance space if required). With the approach described here, we essentially treat the task of grammatical inference as a discrimination task. Without creating a description of all the sequences in the language, we can still decide for unseen sequences in which area of the sequence space they should be placed. It also means means that such an inference system can be used to distinguish between one or more languages at the same time. The difference there is that boundaries between each of the languages need to be learned. Note that the practical approach we will describe here identifies patterns that can be used to distinguish language membership of example sequences. Each pattern only describes a small part of the language boundary. In that sense, it fits in between SVMs and the k-NN classifiers. The patterns are simple (just like the simple representation used in the SVM context) and each one describes a small part of the boundary, just like a k-NN classifier does. So far, we have not said anything about the properties of the boundaries. For instance, what shape the boundaries should have or whether the boundaries may overlap (allowing sequences to be in multiple languages at the same time). We will discuss some properties of the boundaries in the next section, which describes a practical system. However, more work needs to be done in this area for alternative practical systems. 2.2
tf*idf Pattern Identification
The discussion so far has been quite abstract. It may be unclear exactly how we should find the boundaries between languages or perhaps even how we should describe these boundaries. To show that this abstract idea can actually lead to a practical system, we will propose a working system that is entirely based on the theoretical approach that was described in the previous section. The representation of the boundary between languages we use here consists of subsequences. These are consecutive symbols that occur in the example language sequences that the system received during learning. In fact, for practical purposes, we search for subsequences of a certain length, which means they can be seen as n-grams (with n describing the length of the subsequence). By using n-grams as the representation of our patterns, we explicitly limit the languages we can identify. In fact, using patterns of a specific length, we can learn the boundaries of the family of k-testable languages [6]. This family contains all languages that can be described by a finite set of subsequences of length k. It may be clear that these subsequences of length k correspond well with our patterns of fixed length n. Note, however, that we do not present a formal proof of learnability of this family of languages (which has already been shown before [7]), but we will implicitly assume that the language(s) we are trying to learn are in fact k-testable or if they are not, we will provide an approximation of the language that is k-testable.
Grammatical Inference as Class Discrimination
249
The subsequences we are interested in should help us decide whether an unseen sequence is part of the language (or in the more generic case, it should help us identify which language the sequence belongs to). Therefore, we will use the subsequences as patterns. During testing, the patterns are matched against the to be classified sequence (counting number of occurrences per language). Based on this information, the sequence is classified. For the patterns to be maximally useful, during learning we would like to identify patterns (i.e. subsequences in the shape of n-grams) that are maximally discriminative between languages and that at the same time occur often. To measure the effectiveness and usability of the patterns, we apply a classic statistical measure from the field of information retrieval, namely the “term frequency*inverse document frequency” (tf*idf ) weight [8]. This measure consists of two terms, term frequency (tf ) which measures the regularity and inverse document frequency (idf ) which measures the discriminative power of the pattern. Originally, in the context of information retrieval, the tf*idf weight is used to evaluate how relevant a document in a large document collection is given a search term. In its classic application, tf*idf weights are computed for all documents separately in the collection with respect to a search term. The first part of the tf*idf metric is tf . It is defined as the number of times a given term appears in a document. Simply counting the number of occurrences, will yield a bias towards longer documents. To prevent this, the tf measure is often normalized normalized by the length of the document. This results in the following metric: ni,j tf i,j = (1) k nk,j where ni,j describes the number of occurrences of term ti in document dj . The denominator represents the length of document dj , which is measured as the total number of terms in document dj . The idea behind tf is that when the term ti occurs frequently in certain documents, these documents are considered more relevant to the term than documents with fewer instances. Taking this into the extreme, when no occurrences of the term are found in a document that document is probably not about the topic represented by the term. (In the case of natural language terms, this may not always be true. In fact, this has led to research into, for instance, stemming, pseudo relevance feedback and automatic synonym generation [9].) The second part of the tf*idf is idf . For a given term ti , it is calculated as follows: |D| idf i = log (2) |{d : ti ∈ d}| where |D| is the total number of documents in the collection and |{d : ti ∈ d}| is the number of documents that contain the term ti . The idf measures relevance of a term with respect to the documents. Intuitively, this can be described as follows. On the one hand, terms that occur in all documents are not particularly useful when deciding which document is relevant. On the hand, terms that occur only in one or a few documents are good indicators, as those documents are probably about the term under consideration.
250
M. van Zaanen and T. Gaustad
To obtain the tf*idf weight for a particular term, the term frequency tf and inverse document frequency idf are combined: tf*idf i,j = tf i,j × idf i
(3)
The default way of computation of tf*idf provides us with an indication of how relevant a particular document is to a particular term. This metric can be extended, resulting in tf*idf scores for multiple terms. In this case, the tf*idf for all documents is computed for each of the terms. These tf*idf values are then summed and the documents that have the highest tf*idf scores (representing that these documents are most relevant with respect to the terms) are preferred. In the research presented here, we extend the tf*idf metric in a different way. Instead of computing the tf*idf score of a collection of terms (in the sense of a “bag-of-terms”), we want to be able to compute the tf*idf score of a sequence of terms with a fixed order. This corresponds to treating n-grams (a sequence of terms) as if it is a single term. The underlying idea behind using sequences of terms instead of single terms is that we think that sequences are more informative than single terms to determine the boundary between languages (and this will be shown empirically in Section 3). The modification of the computation of the tf*idf weights is rather straightforward. Instead of counting single terms (for instance in the computation of the tf ), n-grams are counted as if they are single terms (with single terms being a specific case where n = 1). For instance, ni,j is the number of occurrences of a particular n-gram ti in document dj . To summarize, during the learning phase, the learner receives example sequences from the languages under consideration. Out of these sequences, all n-gram patterns are extracted and for each of these, the tf*idf score is computed (with respect to each of the languages). Patterns that have a non-zero tf*idf are retained as patterns for classification afterwards. Note that if patterns occur in all languages, their idf will be zero (and the idf will be high if it only occurs in one language). At the same time, if the patterns occur more often, they are considered more important, which increases the overall tf*idf value for that pattern due to a higher tf . During classification, a new, unseen sequence is presented. All patterns are matched against it, leading to a score for each of the languages. This score is calculated by summing the tf*idf scores for each match of a pattern, keeping track of the tf*idf per language. The sequence is then classified into the language that has the highest combined tf*idf value. In Section 3 we will describe experiments performed with fixed length n-grams, but also with n-grams of varying sizes. This brings up an interesting aspect of tf*idf . Shorter patterns (with small n) have a higher likelihood of occurring compared to longer patterns (with large n). This means that the tf*idf will typically be higher for short patterns. To reduce this effect, we multiply each tf*idf score by n, the length of the n-gram. This leads to a higher impact for longer patterns (which, if they can be found in the sequence to be classified, gives more pronounced evidence that the sequence actually belongs to that language).
Grammatical Inference as Class Discrimination
2.3
251
Imperfect Languages and Noise
So far, we have assumed that there is a perfect distinction between the languages. In the simplest case, we consider a language L and its complement LC . This means that all possible sequences come from either L or LC . In practice, it might be that the situation is more difficult. Firstly, there may be an area in the sequence space that is not described by any language. This happens when the sequence space is not perfectly partitioned. In other words, nthe sequence space S is not entirely covered by the languages (L1 , . . . , Ln ): S ⊃ i=1 Li . In this case, sequences exist that are not a member of any language. The system will decide (perhaps randomly) that the sequence is a member of one of the known languages, because it assumes that the entire sequence space is covered by the languages. Secondly, there may be an overlap between the languages. For instance, sequences that really belong to L are presented to the learner as sequences from LC or vice versa. If this occurs, the training data contains noise. A major advantage of the use of tf*idf in this system is that if noise occurs in the data, the patterns dealing with the subsequences containing the noise are now automatically ignored in the pattern identification phase. This works through the idf component in the tf*idf formula. When noise introduces sequences in the wrong language, the patterns that would otherwise have been found (because they are distinctive for the sequences in a particular language) will now receive a zero idf and hence a zero tf*idf , which then results in the pattern being dropped. This allows for a very robust practical system.
3
Empirical Results
To empirically evaluate the effectiveness of the tf*idf pattern identification and discrimination approach to detecting boundaries between languages, we test this approach in two practical experiments. The next section describes the data sets and classification tasks used, followed by an explanation of the data representation. 3.1
Data Sets and Classification Tasks
To evaluate our approach, we compiled two separate data sets from the area of music classification. Both data sets were retrieved from the **kern scores website1 [10]. The two different data sets lead to two different classification tasks. Firstly, we have a binary class data set containing folksongs. One class (i.e. language) consists of Asian folksongs and the other of European folksongs. Both are taken from the Essen Folksong Collection. This data set is called country. An overview of the data set can be found in Table 1. We will use these data sets to show the feasibility of the approach. Music has a fairly limited amount of symbols (compared to for instance natural language), 1
http://kern.ccarh.org/
252
M. van Zaanen and T. Gaustad Table 1. Overview of the country data set Class Asia
Description # of pieces Chinese folksongs 2,241 (4 provinces) Europe European folksongs 848 (19 countries plus misc) Total 3,089
but the training data is extracted from real world data. Music also has inherent “rules” or restrictions, which we aim to learn here. Furthermore, music allows us to experiment with different representations easily. The aim of the country classification task is discriminating folksongs. Two classes are distinguished: Asian folksongs and European folksongs. Even though the original data set has more fine-grained classes, we have not tried to further distinguish either collection into sub-classes (e.g. different provinces or countries) as we expect there to be a partial overlap between the songs from different European countries. Intuitively, the country task is relatively easy for several reasons. There are only two classes to classify into (compared to four in the other task). Also, we expect that the difference between Asian and European folksongs will be quite pronounced. However, the musical pieces to be classified are relatively short, which might make identifying and matching patterns, and hence classification, more difficult. Secondly, we have have extracted the musedata selection from the **kern scores website, which contains pieces by four composers: J.S. Bach, A. Corelli, J. Haydn, and W.A. Mozart. We call this data set composer and numerical information on the data set is shown in Table 2. Table 2. Overview of the composer data set Class Bach Corelli Haydn Mozart Total
Description # of pieces chorales and various 246 trio sonatas 247 quartets 212 quartets 82 787
In the composer classification task, the system should identify which composer, out of the four composers, composed a given musical piece. The system selects one out of four classes (Bach, Corelli, Haydn, and Mozart). Note that the composers come from different, but overlapping periods. One has to keep in mind that the composer classification task is actually quite difficult. For instance, when people are asked to distinguish between musical pieces from these composers (see e.g. the “Haydn/Mozart String Quartet
Grammatical Inference as Class Discrimination
253
Quiz”2 ), the identification accuracies are only 55% and 57% for Mozart and Haydn respectively. Given these results, we expect this task to be hard for automatic classification as well. 3.2
Data Representation or Features and Patterns
We start with the collections of musical pieces in the humdrum **kern format [11]. This format is a symbolic representation of sheet music. Because we want to identify patterns in the musical pieces, we need to define exactly which aspects of the musical representations are going to be used to define the patterns. We convert the music from the **kern humdrum format to a simpler format describing melody (pitch) and rhythm (duration) only. This information is extracted directly from the humdrum **kern format and converted into a new symbolic representation. For both pitch and duration, we chose one way of rendering, namely using what is typically called absolute representations. Absolute pitch refers to the absolute value (in semitones) of the melody with c = 0 (e.g. d = 2, e = 4, etc.). Similarly, absolute duration gives the absolute duration of a given note (e.g. 2, 16). This absolute representation allows for a one-to-one mapping from the **kern humdrum representation of sheet music to a simple symbolic representation that can be used to learn. We know that alternative representations of symbolic music are possible [12,13] and will perhaps even lead to better results. However, here we have selected a fairly simple representation, which allows us to demonstrate the feasibility of the new language learning approach. To make the meaning of the n-gram patterns explicit: the patterns with n = 1 correspond to patterns of a single note in a piece of music. When n = 2, the patterns describe two consecutive notes, etc. Other representations of the music may lead to patterns that describe more complex aspects of music (potentially non-consecutive notes or more abstract descriptions of the music). Each piece of music is converted to a sequence of symbols, where each symbol is a combination of the pitch and duration of a single note. This means that each symbol in the representation that is used to find patterns consists of two components (pitch and duration) that are “glued” together, leading to a single symbol. Starting from the converted sequences of symbols for each of the musical pieces, we combined them into classes. Each class contains all the sequences (i.e. musical pieces) of a single composer or geographical area. These collections of sequences are used as input from which we build various patterns of n-grams as outlined in Section 2. We assume that each composer or geographical area has its own “language” which was used to generate musical pieces. The task is then to learn the boundaries between the languages, which allows us to classify new, unseen musical pieces into the corresponding classes. (Unfortunately, this approach does not 2
http://qq.themefinder.org/
254
M. van Zaanen and T. Gaustad
explicitly allow us to generate new music that is similar to existing musical pieces of a particular language or class.) With respect to the shape of the patterns, we tried n-grams of size n = 1, . . . , 7 and also tried combinations of n-grams of length 1−2, . . . , 1−7. The experiments based on the combinations of n-grams use patterns of n-grams of all the specified lengths combined. Remember that the tf*idf score is multiplied by the length of the n-gram, which means that longer patterns will have more impact in the final score. The main disadvantage of the current music representation is that only local patterns can be found. For instance, languages that require global information in a pattern (such as the number of symbols in the sequence) simply cannot be identified with the current system using n-grams. This problem might be solved if a more complex representation of the data or a completely different shape of patterns is used. The solution to this problem should, however, be seen as future work. 3.3
Quantitative Results
Table 3 contains the results of applying the tf*idf grammatical inference pattern finding system to the two data sets. The figures describe accuracy (% of correctly classified musical pieces divided by the total number of classified pieces), combined with the standard deviation (in brackets). All experiments are performed using ten fold cross-validation. Table 3. Classification results in % correct (and standard deviation) for the country and composer classification tasks n-gram size Country classification Composer classification Baseline 73.49 (±1.64) 27.96 (±4.01) 1 62.05 (±1.52) 64.19 (±6.79) 2 87.90 (±2.08) 78.65 (±2.25) 3 95.52 (±1.06) 81.95 (±2.85) 4 95.54 (±1.72) 79.79 (±4.31) 5 94.12 (±2.65) 78.01 (±4.01) 6 91.97 (±2.96) 74.58 (±4.84) 7 90.65 (±2.75) 71.91 (±4.57) 1−2 79.82 (±3.02) 76.75 (±3.93) 1−3 89.33 (±2.84) 81.06 (±3.31) 1−4 92.27 (±1.94) 81.82 (±3.56) 1−5 93.00 (±1.54) 82.07 (±4.25) 1−6 93.13 (±1.48) 81.56 (±3.91) 1−7 93.16 (±1.44) 81.06 (±3.77)
The results clearly show that using tf*idf to identify useful patterns works well for both discriminating between two classes (or languages) and multiple classes (four in our case).
Grammatical Inference as Class Discrimination
255
The first figures in the table are majority class baselines. The class occurring most often in the training data is selected and used to classify all test sequences. In the country classification, the Asian class clearly has more pieces (the accuracy is higher than the 50% that is expected with a perfectly balanced data set), whereas in the composer task, the number of instances is more balanced (expected baseline with a perfectly balanced data set would be 25%). Looking at the results of the single size n-grams (the first seven entries following the baseline), we see that the results peak around n = 3 or n = 4. This illustrates that, on the one hand, small patterns, even though occurring frequently, have less discriminative power to classify sequences in classes compared to larger n-gram patterns. On the other hand, large n-gram patterns have high discriminative power, but do not occur enough (and hence are less usable). Hence, the optimum size of the patterns is around length three or four. The story is different when a collection of patterns of varying length is collected and used for classification. The results on the country task are still increasing after n = 1 − 7, but so far the results are worse than the best single n-gram pattern (n = 4). On the composer task, the results of the combination of n-gram patterns peaks at n = 1 − 5. It results in the best score for that task. However, the difference in results comparing n = 1 − 5 against n = 3 is not statistically significant. Overall, the results show that the tf*idf pattern finding system significantly outperforms the majority class baseline. The experiments also show that there seems to be an optimum pattern length regardless of the experiment. This can be explained by considering how the tf*idf metric works.
4
Conclusion
Empirical grammatical inference is typically defined as the task of finding a compact representation (in the shape of a grammar) of a language, given a set of example sequences. Typically, the learned grammar is a full description of the language, often allowing for the generation of additional sequences in the language. The underlying grammar from which the example sequences are generated is often unknown, which means that evaluation of the effectiveness of the empirical grammatical inference system needs to be performed according to the classification of unseen sequences. Here, we modified the task slightly. Instead of finding an explicit grammar for the language, we aim to find a representation of the boundary of the language. Once this boundary is known, it can be used to indicate which sequences should be considered as a member of the language or not. Generation of additional sequences is not directly supported by this representation. The advantage of this view on empirical grammatical inference is that the system can be used to distinguish between one or more languages at the same time. Effectively, the task of grammatical inference is treated as a discrimination task. The situation that is normally seen as the grammatical inference task (learning a representation of one language) can be seen as a one-class discrimination task.
256
M. van Zaanen and T. Gaustad
However, the view that is proposed in this paper also allows for the learning of multiple languages simultaneously. The patterns that are learned using this approach used together describe the boundary between languages. Each pattern only describes a small part of the completely boundary. Often, when classifying, only a limited amount of patterns is used to decide which language the sequence belongs to. In addition to the new approach to grammatical inference, we have also proposed a practical system that finds patterns in example sequences. These patterns allow for the classification of new and unseen sequences into languages. Using an extension of the tf*idf metric, the system identifies patterns that both occur often and are helpful in discriminating the sequences. Another advantage of the presented system is that if noise occurs in the data, these sequences are automatically ignored in the pattern identification phase. This allows for a very robust system. Applying the system to real world data sets yields good results. Two classification tasks (dividing musical data based on geography or era) have been used as experimental cases. Alternative representations of the music may still lead to improvements over the results discussed here, but these experimental results already show that this approach is practically viable. To fully appreciate the effectiveness of the proposed approach, more experiments need to be performed. Not only should the effectiveness of different representations of the data be investigated, but completely different data sets taken from other domains should be used as well. Furthermore, to get a better idea about the state-of-the-art, the approach should be compared against other grammatical inference systems. The main disadvantage of the current system is that only local patterns can be found. As such, languages for which global information of a sequence (such as the number of symbols in the sequence) is required, cannot be learned with the current system. This problem might be solved using a different, more complex representation of the data or, alternatively, using a completely different type of patterns. This different representation of patterns should then extend the current n-gram patterns and allow for the description of more global information. We consider this problem as future work.
References 1. Adriaans, P.W., van Zaanen, M.M.: Computational grammatical inference. In: Holmes, D.E., Jain, L.C. (eds.) Innovations in Machine Learning. Studies in Fuzziness and Soft Computing, vol. 194. Springer, Heidelberg (2006) ISBN: 3-540-306099 2. Gold, E.M.: Language identification in the limit. Information and Control 10, 447– 474 (1967) 3. de la Higuera, C.: Grammatical inference: learning automata and grammars. Cambridge University Press, Cambridge (2010) 4. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)
Grammatical Inference as Class Discrimination
257
5. Daelemans, W., van den Bosch, A.: Memory-Based Language Processing. Cambridge University Press, Cambridge (2005) 6. Garcia, P., Vidal, E.: Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 920–925 (1990) 7. Garcia, P., Vidal, E., Oncina, J.: Learning locally testable languages in the strict sense. In: Proceedings of the Workshop on Algorithmic Learning Theory, Japanese Society for Artificial Intelligence, pp. 325–338 (1990) 8. van Rijsbergen, C.J.: Information Retrieval, 2nd edn. University of Glasgow, Glasgow (1979) (printout) 9. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Publishing Company, Reading (1999) 10. Sapp, C.S.: Online database of scores in the humdrum file format. In: Proceedings of the sixth International Conference on Music Information Retrieval (ISMIR), London, United Kingdom, pp. 664–665 (September 2005) 11. Huron, D.: Humdrum and kern: selective feature encoding. In: Selfridge-Field, E. (ed.) Beyond MIDI: The handbook of musical codes, pp. 375–401. Massachusetts Institute of Technology Press, Cambridge (1997) 12. Conklin, D., Anagnostopoulou, C.: Representation and discovery of multiple viewpoint patterns. In: Proceedings of the 2001 International Computer Music Conference, International Computer Music Association, pp. 479–485 (2001) 13. Geertzen, J., van Zaanen, M.: Composer classification using grammatical inference. In: Proceedings of the MML 2008 International Workshop on Machine Learning and Music held in conjunction with ICML/COLT/UAI 2008, Helsinki, Finland, pp. 17–18 (2008)
MDL in the Limit Pieter Adriaans and Wico Mulder
2
1 Theory of Computer Science Group, IVI University of Amsterdam, Science Park 107, 1098 XG Amsterdam, The Netherlands Logica, Prof. W.H. Keesomlaan, 1183 DJ Amstelveen, The Netherlands [email protected], [email protected]
Abstract. We show that within the Gold paradigm for language learning an informer for a superfinite set can cause an optimal MDL learner to make an infinite amount of mind changes. In this setting an optimal learner can make an infinite amount of wrong choices without approximating the right solution. This result helps us to understand the relation between MDL and identification in the limit in learning: MDL is an optimal model selection paradigm, identification in the limit defines recursion theoretical conditions for convergence of a learner.
1
Introduction
In a landmark paper Gold [1] introduced the idea of identification in the limit as a paradigm to study language learning. We start with a student and a teacher. At the beginning of the learning process they select a class of languages L. The teacher consequently selects an element Li ∈ L and starts to produce example sentences from Li . After each example the student is allowed to update his guess for the language the teacher has selected. We expect the teacher to be an informer for the language i.e. each sentence from Li will be produced in the limit. The class of languages L is considered to be identifiable in the limit from positive information if the student can for each language in L reach a stable guess in a finite amount of time on the basis of only positive examples. A well-known border case for this form of learning are the so-called superfinite sets. We give an example: Definition 1. Let L∞ = {a}∗ and let Lk = {aj | 0 ≤ j ≤ k}. We define the superfinite class of languages L = {L∞ } ∪ {Lk |k ∈ N} One can prove ([4], pg. 203) that for the set of finite languages {Lk |k ∈ N} the student can deploy a lazy learning strategy, i.e. take the longest sentence ak the teacher has produced so far to be an indication of the intended language Lk . As soon as we add L∞ to the set this does not work anymore. If L∞ is the intended language this strategy will lead to an infinite chain of mind changes of the student. On the other hand there is no point in this process where the student can, with certainty, guess that L∞ is the target language. We saw that superfinite sets are not identifiable in the limit from positive information. J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 258–261, 2010. c Springer-Verlag Berlin Heidelberg 2010
MDL in the Limit
259
Fig. 1. The main idea behind the proof. Suppose the circles define some complexity borders in a sample space. Examples further removed from the center are more complex. In phase 1 the distribution is sparse and we prove that the best MDL model is infinite. In phase 2 the distribution is dense and the best MDL model is finite. In phase 3 the best model is infinite again. This can go on indefinitely.
A totally different paradigm for learning is the so-called Minimum Description Length (MDL) principle [2]: the best theory M to explain the data x is the one that minimizes: the sum of 1) the length of the description in bits of M (the model code) and 2) the length of the description in bits of x given M (the data-tomodel code). We want to investigate how MDL performs in terms of identification in the limit. As a preliminary exercise we investigate the performance of MDL on super-finite sets. Since MDL is technically an optimal model selection strategy we do not expect MDL to settle for either L∞ or a finite language Lk . In some cases it might make an infinite number of mind changes.
2
Outline of the Proof of the Main Theorem
The prefix-free Kolmogorov complexity of a binary string can (following [3]) be defined as: K(x) = mini,p {|i| + |p| : Ti (p) = x} where i ∈ {1, 2, ...} and p ∈ {0, 1}∗. Here |i| is the length of a self-delimiting code of an index (see [3], pg. 79) and T is a universal Turing machine that runs program p after interpreting by n + 2 log n + 1, where |i|. The length of |i| is limited for practical purposes n = |i|. Let the universal distribution be m(x) = x 2−K(x) . Let M be the set of prefix-free programs. Using Bayes’ law, the optimal computational model under this distribution would be: Mmap (x) = argmaxM∈M m(M)m(x|M) which can be m(x) rewritten as: = argminM∈M − log m(M ) − log m(x|M ) Here − log m(M ) can be interpreted as the length of the optimal data-code in Shannon’s sense and − log m(x|M ) as the length of the optimal data-to-model code. Using Levin’s coding theorem ([3], pg. 273) this can be rewritten as: Mmap (x) = argminM∈M K(M ) + K(x|M )
(1)
This gives optimal two-part code compression of x. We now give the central theorem:
260
P. Adriaans and W. Mulder
Theorem 1. An informer for a super-finite set can cause an optimal MDL learner to make an infinite amount of mind changes. Outline of proof. – Suppose the language chosen by the informer is L∞ . We need to estimate the optimal model code for M and the optimal data to model for x given M . We can interpret the string ak as the unary representation of the number k. Model and data set can now be interpreted as (sets of) natural numbers. – Optimal Model Code: any finite model Lk can be coded as a natural number k, i.e. K(Lk ) = log k + O(1). The code for the infinite model is of small constant length and is given by L∞ = {a}∗ , i.e. K(L∞ ) = O(1). – Optimal Data to Model Code: the data produced by the informer at time t can be effectively coded as a set of natural numbers. There are two optimal coding techniques: 1. A self-delimiting list of numbers for sparse sets. Let D be the data set at time t coded in terms of natural numbers. Given the fact that we need not more than 2 log log n additional bits to make the number n self delimiting the optimal code length for D is limited by K(D) ≤ Σi∈D log i+log log i. Note that in this code the largest sentence produced by the observer is included in self-delimiting in the data-to-model code. The sparse coding scheme is therefore associated with the selection of L∞ as optimal model, since this adds the least additional bits. The self-delimiting representation does not take in to account the mutual information between the elements of D. For sparse sets where the numbers have no mutual information this is optimal. 2. Subset coding for dense sets. Here the data is encoded as elements of a subset using Newton’s binomial formula: K(D) = log d+ log m+ log m d . k Here m = |M | and d = |D|. Since log k! ≈ 1 logxdx we can approximate m d log m d ≈ d logxdx− 1 logxdx. Note that in this case we can interpret M as the model with log m as model code. The dense coding is associated with a finite language as optimal model. The subset coding is optimal if the numbers are so dense that there is a lot of mutual information. – We now have two estimates for the MDL code: K(L∞ ) + K(D|L∞ ) = O(1) + Σi∈D log i + log log i m d logxdx − logxdx + O(1) K(Lm ) + K(D|Lm ) = log m + log d + d
1
– The potential oscillating behavior of an MDL learner is proved by the following observation: For large enough m there is always a set of natural numbers D such that: K(L∞ ) + K(D|L∞ ) ≈ K(Lm ) + K(D|Lm ) The proof is as follows: suppose that the best MDL code for a certain D is K(Lm )+K(D|Lm ). This implies that the elements of D have a lot of mutual
MDL in the Limit
261
information. Start adding large new elements ak where k >> m such that the numbers have low mutual information. Since there is an infinite amount of sentences this is always possible. Very soon (if you do it right after the first added element) the MDL code K(L∞ ) + K(D|L∞ ) starts to be more efficient. Suppose on the other hand that K(L∞ ) + K(D|L∞ ) is the best model. Now start to add new elements smaller then am to D, where am is the biggest sentence you have seen so far. After some point the mutual information will be so big that K(Lm ) + K(D|Lm ) is a better model. The ’tipping point’ is defined by a comparison between the two MDL estimates:
m
O(1) + Σi∈D log i + log log i = log m + log d + d
logxdx −
d
logxdx + O(1) 1
– This shows that at each point in time the informer has the power to steer an MDL learning process in the direction of L∞ or in the direction of some Lk . Note that the optimal model in practice often will be a mixture of the two approaches: i.e. dense model for an initial segment, and a sparse model for the rest of the data, but this does not affect the main point of the proof: the optimal model for the largest sentences seen so far determines the final model estimate.
3
Discussion
MDL, as a criterium for optimal model selection, is in a way orthogonal to identification in the limit. Note also that in order to generate the infinite amount of mind shifts for the student the teacher also has to make an infinite amount of changes of strategy. As soon as the teacher uses a probability distribution over the whole of L∞ the MDL learner will with high probability stabilize on this guess early in the process and never change its mind. The fact that this class of languages can be interpreted as consisting of sets of unary representations of natural makes it easy to calculate the MDL scores. It might be difficult to generalize these results to more complex languages.
References [1] Gold, E.M.: Language Identification in the Limit. Information and Control 10(5), 447–474 (1967) [2] Gr¨ unwald, P.D.: The Minimum Description Length Principle, 570 pages. MIT Press, Cambridge (2007) [3] Li, M., Vit´ anyi, P.M.B.: An Introduction to Kolmogorov Complexity and Its Applications, 3rd edn. Springer, New York (2008) [4] Zeugmann, T., Lange, S.: A Guided Tour Across the Boundaries of Learning Recursive Languages. In: Lange, S., Jantke, K.P. (eds.) GOSLER 1994. LNCS (LNAI), vol. 961, pp. 190–258. Springer, Heidelberg (1995)
Grammatical Inference Algorithms in MATLAB Hasan Ibne Akram1 , Colin de la Higuera2 , Huang Xiao1 , and Claudia Eckert1 1
Technische Universit¨ at M¨ unchen, Munich, Germany {hasan.akram,huang.xiao,claudia.eckert}@sec.in.tum.de 2 Nantes University, Nantes, France [email protected]
Abstract. Although MATLAB1 has become one of the mainstream languages for the machine learning community, there is still skepticism among the Grammatical Inference (GI) community regarding the suitability of MATLAB for implementing and running GI algorithms. In this paper we will present implementation results of several GI algorithms, e.g., RPNI (Regular Positive and Negative Inference), EDSM (Evidence Driven State Merging), and k-testable machine. We show experimentally based on our MATLAB implementation that state merging algorithms can successfully be implemented and manipulated using MATLAB in the similar fashion as other machine learning tools. Moreover, we also show that MATLAB provides a range of toolboxes that can be leveraged to gain parallelism, speedup etc.
1
Introduction
In this paper we focus on two important tasks for GI - learning regular languages from an informant and learning k-testable languages [1] from text. We have implemented the RPNI [2] and EDSM [3] algorithms to investigate the feasibility of running such classes of algorithm in MATLAB. We have also implemented algorithms to learn k-testable languages which corresponds to a subset of regular languages. We have followed the notations and algorithms given in Colin de la Higuera’s [4] book. The details of RPNI algorithm can be found in chapter 12, EDSM in chapter 14 and k-testable language in chapter 11 of the book. In this section we briefly introduce notations and definitions. 1.1
Preliminaries
Definition 1. A Deterministic Finite Automaton is defined as a 6-tuple A = Σ, Q, q0 , FA , FR , δ, where Σ is the set of alphabets, Q is the set of finite states, q0 ∈ Q is the initial state, FA ⊆ Q is the set of final accepting states, FR ⊆ Q is set of final rejecting states, δ is the transition function. 1
MATLAB is a registered trademark of The MathWorks, Inc.
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 262–266, 2010. c Springer-Verlag Berlin Heidelberg 2010
Grammatical Inference Algorithms in MATLAB
263
Definition 2. A Prefix Tree Acceptor (PTA) is a tree-like DFA generated by extracting all the prefixes of the samples as states that only accepts the samples it is built from. A prefix tree is also known as trie which is an ordered data structure and it is expressed as a DFA to form a PTA. Let S be the sample from which we build a PTA. APT A = P T A(S) is a DFA that contains a path from the initial state to a final accepting state for each strings in S. To recognize k-testable languages we would require a special machine called ktestable machine from which we can build an equivalent DFA. Definition 3. Given k > 0, a k-testable machine k − T SS is a 5-tuple Zk = Σ, I, F, T, C where Σ is the set of alphabets, I ⊆ Σ k−1 set of prefixes of length k − 1, F ⊆ Σ k−1 suffixes of length k − 1, C ⊆ Σ k set of short strings, and T ⊆ Σ k set of allowed segments. A k-testable machine k − T SS will recognize strings only either exactly in C, or those whose prefix of length k − 1 is in I, suffix of length k − 1 is in F , and where all substrings of length k is in T .
2
State Merging Algorithms
The basic idea of state merging algorithm to infer a DFA is to build a PTA from the positive sample (S+ ), conduct the state-merging iteratively, each intermediate DFA is verified by examining the negative samples. Only the merge resulting in a DFA that rejects all the negative samples is kept as current status, otherwise the merge is discarded. This process is repeated until the target automaton is found. Examples of such state merging algorithms are RPNI, EDSM, Blue-Fringe etc. 2.1
RPNI
The RPNI version given in [4] uses two labels for the states in the automaton: red states and blue states. After a series of merges between the red states and blue states, promoting the states (e.g., blue to red ), the target DFA is produced. For the details of RPNI please consult chapter 12 of [4]. By consideration of overhead of computing, we initiated another version of RPNI, Parallel RPNI, which opens multiple sessions for states in PTA and runs those sessions concurrently so that if the merge at state i is failed, then the merges at states {i + 1, i + 2, i + 3, · · ·} might be already prepared to be taken into execution. 2.2
EDSM
RPNI basically performs a greedy search to find out the target DFA, meaning whenever two states are mergable, they are merged. Obviously there could be other option, e.g., choosing a better or even best merge by means of some heuristics. EDSM algorithm introduced by Lang and et. al., [3] which takes such heuristics into consideration.
264
3
H.I. Akram et al.
MATLAB GI Toolbox
In this section we present the MATLAB GI Toolbox, an open source implementation of a set of GI algorithms in MATLAB. The Toolbox offers out-of-the-box implementations of a range of GI algorithms that can be used like every other machine learning tool provided in MATLAB. The fundamental data structures in the MATLAB [5] platform are matrices. Therefore, in our implementation we have used matrices as primary data structure to represent a DFA. The DFA object contains eight different sets represented as matrices: FiniteSetOfStates: the set of finite states (Q) of the DFA, it is stored as an integer vector in MATLAB. Alphabets: the set of symbols. It is stored as a cell array where each cell contains a character. TransitionMatrix : each column of the matrix corresponds to a symbol a ∈ Σ (Alphabets) and each row corresponds to a state q ∈ Q (FiniteSetOfStates). Each cell of the matrix is a transition δ, e.g., if there is a transition from a state qi to qj via a symbol a, then the corresponding cell for qi and a is marked as qj . No transition cells are marked with −1. InitialState: the set of initial states, an integer vector which contains only state. FinalAcceptedStates: the set of final accepted states, an integer vector which is a subset of FiniteSetOfStates. FinalRejectStates: the set of final rejecting states, an integer vector which is a subset of FiniteSetOfStates. RED: the set of red states, an integer vector which is a subset of FiniteSetOfStates. BLUE: the set of blue states, an integer vector which is a subset of FiniteSetOfStates. The input file is given in the similar format as the Abbadingo2 format. However, internally in MATLAB the training dataset is represented as cell arrays [5], each cell containing a character, which is independent of the input file format. 3.1
Features
The MATLAB GI Toolbox provides the GI algorithms in a modular fashion so that they can be reused to enhance or improve the existing GI algorithms. The DFA data structure and the built-in methods can be used for RPNI, Blue-Fringe, EDSM and can be extended to incorporate other methods such as Genetic Algorithm techniques to optimize the search strategy. Moreover, this toolbox has been made absolutely compatible to other MATLAB features and toolboxes which enables ways of trying out new experiments using other MATLAB Toolboxes in an extremely easy manner. It is simple to use the toolbox to do a k-fold cross-validation, ROC analysis etc. using internal MATLAB classes to obtain fast results.
4
Experimental Results
In this section we present our experiments which are conducted on the Gowachin3 dataset varying the size of the target DFA and the sample size Table 1 for RPNI 2 3
Abbadingo is a DFA learning competition held in 1998. The details about the competition and data format can be found at: http://www-bcl.cs.may.ie/ Gowachin: DFA Learning Competition is a test version of a follow-on to Abbadingo One. Artificial datasets for training and testing can be generated using this website: http://www.irisa.fr/Gowachin/
Grammatical Inference Algorithms in MATLAB
265
Table 1. MATLAB GI ToolBox executions of RPNI and EDSM with different target DFA sizes and sample sizes. Row a shows the accuracy (%) and row t shows the time cost in seconds. sample size → DFA size ↓ 3 5 10 20
200 RPNI 99.17 0.49 100 0.50 67.39 16.09 63.94 13.10
500
1000
5000
EDSM RPNI EDSM RPNI EDSM RPNI EDSM ↓ 99 100 99.28 99.33 100 99.17 100 a 0.44 0.48 2.25 1.03 1221.6 43.12 23.09 t 77.94 92.28 100 99.78 99.67 100 99.72 a 639.59 511.67 31.84 2.28 26.06 11.10 245.95 t 66.78 100 99.89 99.61 100 100 100 a 736.31 2.9 257.79 4.57 464.95 31.47 1421.8 t 52.5 61.67 60.77 99.06 99.33 100 100 a 2043.6 180.71 1873.8 21.46 3677.99 99.20 3989.71 t
Table 2. MATLAB GI ToolBox executions of learning k-testable machine. Column t shows the time cost in seconds, p the precision and r the recall. sample size → k ↓ 2 3 5 10
t
200 p
r
1.16 0.12 0.14 1.16 1 0.14 1.17 1 0.04 4.58 1 0.002
t
500 p
r
t
1000 p r
2.98 1 0.58 6.03 2.84 1 0.579 5.83 2.86 1 0.142 5.89 9.19 NaN 0 16.08
1 1 1 1
0.408 0.407 0.023 0.007
t
5000 p
r
31.77 0.5 1 31.30 1 1 30.78 1 0.543 65.06 1 0.002
and EDSM. The results obtained from the MATLAB GI Toolbox are tested also with Gowachin. Table 2 shows experiments on learning k-testable machines. The experiments were run in a machine having two CPUs, each with 4 core QuadCore AMD OpteronTM Processor 2384, cache size: 512 KB, memory 66175292 KB. In these experiments, the accuracy results are as expected, bad when an insufficient amount of data is provided.
5
Conclusion and Future Work
The experimental results shown above clearly indicate that MATLAB is perfectly suitable for GI algorithms and experiments, at least for reasonable sizes of datasets. Besides the two state merging algorithms, we have also implemented learning algorithm for k-testable machine. Moreover, we have implemented a parallel version of RPNI using the MATLAB Parallel Toobox [5], where we have been able to gain 10-15% speedup for each additional CPU. Our future plan is to incorporate other GI algorithms such as L*, OSTIA (for learning transducers) etc. in the toolbox. To the best of our knowledge this is the first open source implementation of GI algorithms in MATLAB. We plan to publish the MATLAB GI ToolBox as an open source library under the MIT License for open source software. The beta version of MATLAB GI ToolBox can be downloaded from the following link: http://www.sec.in.tum.de/~hasan/matlab/gi_toolbox/
References 1. Garc´ıa, P., Vidal, E.: Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(9), 920–925 (1990)
266
H.I. Akram et al.
2. Oncina, J., Garcia, P.: Identifying regular languages in polynomial time. In: Advances in Structural and Syntactic Pattern Recognition. Series in Machine Perception and Artificial Intelligence, vol. 5, pp. 99–108. World Scientific, Singapore (1992) 3. Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the abbadingo one dfa learning competition and a new evidence-driven state merging algorithm. In: Honavar, V.G., Slutzki, G. (eds.) ICGI 1998. LNCS (LNAI), vol. 1433, pp. 1–12. Springer, Heidelberg (1998) 4. de la Higuera, C.: Grammatical Inference: Learning Automata and Grammars. Cambridge University Press, Cambridge (2010) 5. MathWorks: Matlab - the language of technical computing (2010), http://www.mathworks.com/products/matlab
A Non-deterministic Grammar Inference Algorithm Applied to the Cleavage Site Prediction Problem in Bioinformatics Gloria In´es Alvarez1 , Jorge Hern´ an Victoria1 , 2 Enrique Bravo , and Pedro Garc´ıa3 1
Pontificia Universidad Javeriana Cali Calle 18 118-250 Cali, Colombia {galvarez,jhvictoria}@javerianacali.edu.co 2 Universidad del Valle Sede Melendez, Cali, Colombia [email protected] 3 Universidad Polit´ecnica de Valencia Camino de Vera s/n 46022 Valencia, Espa˜ na [email protected]
Abstract. We report results on applying the OIL (Order Independent Language) grammar inference algorithm to predict cleavage sites in polyproteins from translation of Potivirus genome. This non-deterministic algorithm is used to generate a group of models which vote to predict the occurrence of the pattern. We built nine models, one for each cleavage site in this kind of virus genome and report sensibility, specificity, accuracy for each model. Our results show that this technique is useful to predict cleavage sites in the given task with accuracy rates higher than 95%.
Introduction Grammar inference is a technique of inductive learning, belonging to the syntactic approach of machine learning. Here we propose an inference algorithm to predict cleavage sites in polyproteins from translation of Potivirus genomes. Our goal is to develop an application for automatic segmentation of polyproteins available in large bioinformatic databases. Often these databases collect sequences which are not segmented, difficulting its use for analysing and extracting features from a particular segment or segmented chains. Furthermore, this real problem allows us to evaluate the behaviour of the OIL algorithm in real world conditions which are different from synthetic data tests. The paper is organised as follows: in Section 1 we briefly overview the OIL algorithm, in Section 2 we describe the cleavage site prediction problem. Design and experimental results are presented in Section 3. Finally, in Section 4 some final remarks and future work are discussed.
This work was partially supported by the Spanish Ministry of Education and Science TIN2007-60769.
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 267–270, 2010. c Springer-Verlag Berlin Heidelberg 2010
268
1
G.I. Alvarez et al.
Algorithm OIL
This algorithm was first published in [3], the Order Independent Language (OIL) inference algorithm is a non deterministic approach to grammar inference for regular languages. Algorithm 1 below presents OIL strategy: positive and negative samples are sorted in lexicographical order (lines 1,2). At the beginning, hypothesis M is empty (line 3). Then positive samples are considered one by one (line 4). If the current hypothesis M accepts a positive sample pS, M remains unchanged. If hypothesis M rejects it (line 5), a new automaton M’ is built to accept pS and it is added to M (lines 6,7). The elements of M’ are defined in the following way: Q = P ref (pS), δ = {(u, v) | u ∈ P ref (pS), v = ua, a ∈ Σ, ua ∈ P ref (pS)}, q0 = ε, finally Φ is defined: ∀w ∈ (P ref (pS) − pS), Φ(w) =? and Φ(pS) = 1. In line 8, M is modified by merging as many states as possible. The states to be merged are selected randomly. Once the merge is completed, negative samples are computed in the new model M; if there are any inconsistencies, the merging procedure is undone. When all the positive samples are processed, the algorithm ends and the final value of M is the model learned. OIL is a convergent algorithm, the proof is in [1]. Notice that every running of OIL may produce a different model because it is a non-deterministic algorithm. For this reason, we compute a group of models from a given training sample. To test the algorithm, several heuristics may be applied to get a final response. For example, we can test it with the smaller model (with less states) or by applying a voting method among models to tag the test samples. Algorithm 1. OIL (D+ , D− ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
posSample = sort(D+ ) (in lexicographical order) negSample = sort(D− ) (in lexicographical order) M = (Q, Σ, {0, 1, ?}, δ, q0 , Φ) (empty automaton) for pS in posSample do if M doesn’t accept pS then M’ = (Q , Σ, {0, 1, ?}, δ , q0 , Φ ) (M’ accepts only pS) M = M ∪ M’ M = DoAllMergesPossible(M,negSample) end if end for return M
2
The Cleavage Site Prediction Problem
Given a sequence of amino acids, the cleavage sites prediction problem consists in predicting where a particular subsequence with specific meaning or function begins and ends. We can predict cleavage sites for signal peptides, viral coding segments and other biological patterns. Thus the generic problem is present in any genome, from viruses to human beings. We are interested in cleavage site
A Non-deterministic Grammar Inference Algorithm
269
prediction of Potyviruses, since they are pathogenic for many important crop plants such as beans, soybean, sugarcane, tobacco among others, which have a large economic and alimentary impact in South America. Prediction of cleavage sites may facilitate the understanding of the molecular mechanisms underlying the diseases caused by these viruses. Researchers in the region have studied this family of viruses [2] and more than fifty viruses have been sequenced. The potyviral genome is expressed through the translation of a polyprotein which is cut by virus-encoded proteinases at specific sites in the sequence of amino acids, resulting in 10 functionally mature proteins responsible for the infection and virus replication called: P1, HCPro, P3, 6K1, CI, 6K2, VPg, NIa, NIb and CP. The functions of these viral-encoded proteins are partially understood. Each cleavage site is identified by the name of the segments it separates. Prediction of cleavage sites is not trivial because even though there are patterns of symbols that mark these places, these patterns can be variable. Because of the complexity of the cleavage site sequences, the use of algorithms makes easier the detection of specific features of those points. The prediction of cleavage sites allows isolating specific segments to be studied and facilitates the analysis and annotation of the data obtained experimentally and their comparison with those existing in databases such as GenBank.
3
Experimental Results
We apply the OIL algorithm to the problem of predicting cleavage sites in polyproteins translated from the genome of viruses of the family Potyviridae. Our purpose is to learn a model for recognising each of the nine cleavage sites present in the polyprotein. Training samples are obtained from sequences published at www.dpvweb.net/potycleavage/index.html. Approximately 50 samples for each cleavage site are trained. Since the algorithm needs negative samples, we use positive samples of other sites as negative samples for a given model; the ratio between positive and negative samples is 1/10. The amino acid sequence is considered one window at a time. Three window lengths are explored: in the first case we suppose cleavage site is located between the fourth and fifth symbols. For this reason we refer to this window as 4/1; in a similar way, we experiment with windows 14/1 and 10/10. We train 15 hypotheses from each training set with the OIL algorithm and all of them vote to decide if a test sample is accepted or rejected. We use a simple voting criterion where each model adds 1 to a counter initialised to zero if it accepts the sample and subtracts 1 from the counter if it rejects it. We calculate several very common measures for evaluating algorithms in bioinformatics: sensibility, specificity and accuracy. Table 1 shows average performance of the algorithm for three window sizes 4/1, 14/1 and 10/10, for each cleavage site. Values highlighted show the best window for each cleavage site. We obtain models for each cleavage site with an accuracy higher than 0.95. From Table 1 we can decide which window size is best suited for each cleavage site: P1-HCPro, HCPro-P3, P3-6K1 and 6K2-VPg yield better results when learning
270
G.I. Alvarez et al.
Fig. 1. Average sensibility, specificity and accuracy of the OIL algorithm results predicting cleavage sites on polyproteins from Poriviridae family viruses genome with a group of 15 models Cleavage site Sens. P1-HCPro 0.81 HCPro-P3 0.88 P3-6K1 0.65 6K1-CI 0.74 CI-6K2 0.7 6K2-VPg 0.74 VPg-Nia 0.81 NIa-NIb 0.73 NIb-CP 0.94
4/1 Spec. 0.85 0.97 0.80 0.76 0.65 0.73 0.81 0.77 0.91
Acc. 0.98 0.99 0.96 0.97 0.95 0.96 0.97 0.96 0.93
Sens. 0.63 0.74 0.61 0.72 0.72 0.63 0.81 0.71 0.96
14/1 Spec. 0.84 0.86 0.74 0.66 0.74 0.69 0.92 0.73 0.87
Acc. 0.97 0.97 0.96 0.95 0.96 0.95 0.98 0.96 0.92
Sens. 0.58 0.67 0.65 0.79 0.79 0.60 0.72 0.76 0.95
10/10 Spec. 0.78 0.88 0.76 0.87 0.92 0.74 0.91 0.77 0.95
Acc. 0.96 0.97 0.96 0.98 0.98 0.96 0.98 0.97 0.95
from a 4/1 window, while 6K1-CI, CI-6K2, NIa-NIb and NIb-CP from window 10/10 and VPg-NIa from 14/1. This information gives hints about the size of the pattern to be learned and allows us to specialise the training process for each cleavage site.
4
Conclusions and Future Work
It is possible to learn patterns to predict cleavage sites in potivirus polyproteins with grammar inference algorithms like OIL. Our experimental rates suggest that it is possible to develop an automatic segmentation tool for such models. Currently, we are applying other methods to compare their performance. We will assemble the best models into a computational tool which receives a complete polyprotein and segments it. Finally, some biological considerations will be taken into account to improve the performance of the proposed tool.
References 1. Alvarez, G.: Estudio de la Mezcla de Estados Determinista y No Determinista en el Dise˜ no de Algoritmos para Inferencia Gramatical de Lenguajes Regulares. PhD thesis, Universidad Polit´ecnica de Valencia (2007) 2. Bravo, E., Calvert, L.A., Morales, F.J.: The complete nucleotide sequence of the genomic RNA of bean common mosaic virus strain nl4. Revista de la Academia Colombiana de Ciencias Exactas, F´ısicas y Naturales 32(122), 37–46 (2008) 3. Garc´ıa, P., de Parga, M.V., Alvarez, G.I., Ruiz, J.: Universal automata and NFA learning. Theoretical Computer Science 407, 192–202 (2008)
Learning PDFA with Asynchronous Transitions Borja Balle, Jorge Castro, and Ricard Gavald` a Universitat Polit`ecnica de Catalunya, Barcelona {bballe,castro,gavalda}@lsi.upc.edu
Abstract. In this paper we extend the PAC learning algorithm due to Clark and Thollard for learning distributions generated by PDFA to automata whose transitions may take varying time lengths, governed by exponential distributions.
1
Motivation
The problem of learning (distributions generated by) probabilistic automata and related models has been intensely studied by the grammatical inference community; see [4,12,13] and references therein. The problem has also been studied in variants of the PAC model. In particular, it has been observed that polynomialtime learnability of PDFA is feasible if one allows polynomiality not only in the number states but also in other measures of the target automaton complexity. Specifically, Ron et al. [11] showed that acyclic PDFA can be learned w.r.t. the Kullback–Leibler (KL) divergence in time polynomial in alphabet size, 1/, 1/δ, number of target states, and 1/μ, where μ denotes the distinguishability of the target automaton. Clark and Thollard extended the result to general PDFA by considering also as a parameter the expected length of the strings L generated by the automaton [3]. Their algorithm, a state merge-split method, was in turn extended or refined in subsequent work [6,7,5,2]. Here we consider what we call asynchronous PDFA (AsPDFA), in which each transition has an associated exponential distribution. We think of this distribution as indicating the ‘time’ or duration of the transition. Note that there are several models of timed automata in the literature with other meanings, for example automata with timing constraints on the transitions. Our model is rather the finite-state and deterministic restriction of so-called semi-Markov processes; a widely-studied particular case of the latter are continuous-time Markov chains, in which times between transitions are exponentially distributed. We show a general expression for the KL divergence between two given AsPDFA similar to that in [1] for PDFA. Based on this expression and a variant of the Clark–Thollard algorithm from [2], we show that AsPDFA are learnable w.r.t. the KL divergence. Technically, the algorithm requires bounds on the largest and smallest possible values of the parameters of the exponential distributions, which can be thought as defining the ‘time-scale’ of the target AsPDFA. Full proofs are omitted in this version and will appear elsewhere. The result above is motivated by the importance of modeling temporal components in many scenarios where probabilistic automata or HMM’s are used as J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 271–275, 2010. c Springer-Verlag Berlin Heidelberg 2010
272
B. Balle, J. Castro, and R. Gavald` a
modeling tools. We in particular were brought to this problem by the work of one of the authors and other collaborators on modeling users’ access patterns to websites [8,9,10]. Models similar to (visible- or hidden- state) Markov Models have been used for this purpose in marketing circles and are called Customer Behavior Model Graphs. After the work in [8,9,10], we noted that the time among successive web clicks, the user think time, was extremely informative to discriminate among different user types and predict their future behavior, and this information is not captured by standard PFA.
2
Results
We essentially follow notation and learning model from [3,2]. In particular, the definition of probabilistic deterministic finite automaton (PDFA) and associated notation used here are from [2]. Furthermore, we borrow the KL–PAC model for learning distributions over sequences and the notion of μ-distinguishability of PDFA from [3]. We will denote by KL(D1 D2 ) the relative entropy, or KL divergence, between a pair of distributions over the same set. The distributions are sometimes denoted by their models or parameters. In particular, in the case of ˆ one has KL(λλ) ˆ = ln(λ/λ) ˆ + two exponential distributions Exp(λ) and Exp(λ) ˆ − 1. λ/λ An asynchronous PDFA (AsPDFA) is a tuple Q, Σ, τ, γ, ξ, q0, Λ, where the sub-tuple Q, Σ, τ, γ, ξ, q0 defines a PDFA and Λ : Q × Σ → IR is a partial function that assigns a rate parameter Λ(q, σ) = λq,σ > 0 to each transition defined in the PDFA. We will say that an AsPDFA is μ-distinguishable if the underlying PDFA is μ-distinguishable. When acting as a generator, an AsPDFA works like a PDFA with a minor modification. If q is the current state, after ‘deciding’ to emit the symbol σ (with probability γ(q, σ)), it also emits a real number t, called the duration of the transition, sampled at random from Exp(λq,σ ), an exponential distribution with parameter λq,σ . Next state is τ (q, σ). In this process, all durations sampled from exponential distributions are mutually independent. An observation generated by an AsPDFA is a temporal string x = ((σ0 , t0 ), . . . , (σk , tk ), (ξ, tk+1 )) where σi ∈ Σ and ti ∈ IR. Thus, an AsPDFA induces a probability measure over the space X = (Σ × IR)∗ × ({ξ} × IR). Our first theorem provides an expression for the relative entropy between two AsPDFA that generalizes the formula in [1] for PDFA. Carrasco’s formula was used in [3] to bound the KL divergence between a target PDFA and an hypothesis produced by a learning algorithm. By the following result, similar techniques can be use to prove learnability for AsPDFA. Theorem 1. Let A and Aˆ be AsPDFA over the same alphabet Σ with the same terminal symbol ξ. The KL divergence between the probability distributions induced by A and Aˆ is γ(q, σ) ˆ qˆ,σ ) , (1) ˆ = W (q, qˆ) γ(q, σ) log + KL(λq,σ λ KL(AA) γˆ (ˆ q , σ) q∈Q qˆ∈Q ˆ
σ∈Σ
Learning PDFA with Asynchronous Transitions
where Σ = Σ ∪ {ξ} and W (q, qˆ) =
s∈P (q,ˆ q ) γ(q0 , s)
273
with
P (q, qˆ) = {s ∈ Σ ∗ | τ (q0 , s) = q and τˆ(ˆ q0 , s) = qˆ} .
(2)
ˆ as a sum of two terms, one Note that (1) yields a decomposition of KL(AA) correponding to the KL divergence between the underlying PDFA and another ˆ The proof of Theorem 1 is similar in that contains all the terms from Λ and Λ. spirit to that in [1]. However, some measurability issues need to be taken into account in this case. Essentially, this is due to the fact that an AsPDFA defines a probability measure over (Σ × IR)∗ × ({ξ} × IR), a space which is neither discrete nor continuous. As already mentioned, the decomposition given by (1) opens the door to algorithms for learning AsPDFA similar to those for PDFA. In particular, a variation of the Clark–Thollard algorithm [3] for learning AsPDFA will be outlined next. The algorithm is called AsLearner and is built as an extension, with some improvements, over the Learner algorithm from [2]. As input parameters AsLearner receives the alphabet size |Σ|, an upper bound n on the number of states of the target, a confidence parameter δ, and upper and lower bounds, λmax and λmin respectively, on all rate parameters of the target. Furthermore, AsLearner is provided with a sample S of examples, in this case temporal strings, drawn independently at random from the target AsPDFA A. Grosso modo, the algorithm uses S to build a graph which captures all ‘essential’ parts of A, the so-called frequent states and frequent transitions. Each node and each arc in this graph is assigned a multiset. In the case of nodes, multisets collect suffixes generated from states corresponding to them. These multisets can be used to estimate stopping and transition probabilities associated to that state. For arcs, multisets contain all observed durations of the corresponding transition. From these durations a rate parameter for each transitions can be easily estimated. These estimation steps turn the graph into an hypothesis AsPDFA. Finally, a smoothing step is performed and a ground state is added to the hypothesis. The resulting AsPDFA Aˆ is returned. Some little differences between Learner and AsLearner are to be found on how the graph is constructed. Remarkably, a variation of the distinctness test from [2] requiring less samples is employed. Furthermore, a different stopping condition is used to determine when the graph contains all relevant states and transitions. The analysis of the algorithm follows a scheme similar to that in [2]. Using Chernoff bounds as the main technical tool, the graph is guaranteed to be correct with high probability. Subsequently, estimations of transition probabilities and rate parameters are shown to be accurate with high probability. Here, a concentration inequality for the sample mean of exponential random variables is invoked. Finally, bounding techniques from [3] are combined with (1) to prove that, with high probability, Aˆ is close to A w.r.t. the KL divergence. The outcome of this analysis is summarized in the following.
274
B. Balle, J. Castro, and R. Gavald` a
Theorem 2. Given a sample S from an AsPDFA A, the algorithm AsLearner ˆ < outputs, with probability at least 1 − δ, an hypothesis Aˆ satisfying KL(AA) whenever the number of examples in S is |S| > N , where N is a function from 5 9 3 n L |Σ| 1 λmax 3 ˜ O · ln · ln . (3) 6 μ2 δ λmin Furthermore, the algorithm runs in time polynomial in |S| and the lengths of examples in S.
3
Discussion
Improving on previous algorithms, AsLearner needs less input parameters (about the underlying PDFA) thanks to the new stopping condition. Futhermore, an improved test for comparing states and a sharper analysis yield a dependence on |Σ| in the sample bound from Theorem 2 one degree smaller than in the Learner algorithm. On the other side, note the dependence on the number of states is one degree larger. That is because some more samples are needed in order to guarantee a good approximation of all relevant rate parameters. Apart from the dependence on λmax and λmin , which determine the ‘time scale’ of the target AsPDFA, the rest of parameters appear with the same degree as in the bounds from [2]. Recall that the distinguishability of an AsPDFA is defined here as the distinguishability of its underlying PDFA. This allows to prove learnability for AsPDFA using almost the same algorithm for learning PDFA. In particular, the same statistical test for distinguishing between different states can be used for learning PDFA and AsPDFA under this definition of distinguishability. However, it is conceivable that a new test using information provided by transition durations in addition to information from suffix distributions can be used to learn AsPDFA. Such a test would require a novel definition of distinguishability and would provide means for learning AsPDFA whose underlying PDFA are not learnable. Thus, we regard our results as a proof of concept on AsPDFA learning which we plan to extend along these lines in future work. Finally, it is worth remarking that Theorem 1 can be generalized to broader classes of automata where, instead of duration, transitions convey more general types of information. This generalization can be proved under very mild measurability conditions on the distributions that generate such information. Identifying families of distributions, other than exponential, for which learning is feasible, can significantly extend the range of practical applications where these techniques can be used. Acknowledgements. This work is partially supported by the Ministry of Science and Technology of Spain under contracts TIN-2008-06582-C0301 (SESAAME) and TIN-2007-66523 (FORMALISM), by the Generalitat de Catalunya 2009-SGR-1428 (LARCA), and by the EU PASCAL2 Network of Excellence (FP7-ICT-216886). B. Balle is supported by an FPU fellowship (AP2008-02064) of the Spanish Ministry of Education.
Learning PDFA with Asynchronous Transitions
275
References 1. Carrasco, R.C.: Accurate computation of the relative entropy between stochastic regular grammars. RAIRO (Theoretical Informatics and Applications) 31(5), 437– 444 (1997) 2. Castro, J., Gavald` a, R.: Towards feasible PAC-learning of probabilistic deterministic finite automata. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 163–174. Springer, Heidelberg (2008) 3. Clark, A., Thollard, F.: PAC-learnability of probabilistic deterministic finite state automata. Journal of Machine Learning Research (2004) 4. Dupont, P., Denis, F., Esposito, Y.: Links between probabilistic automata and hidden markov models: probability distributions, learning models and induction algorithms. Pattern Recognition 38(9), 1349–1371 (2005) 5. Gavald` a, R., Keller, P.W., Pineau, J., Precup, D.: PAC-learning of markov models with hidden state. In: F¨ urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 150–161. Springer, Heidelberg (2006) 6. Guttman, O., Vishwanathan, S.V.N., Williamson, R.C.: Learnability of probabilistic automata via oracles. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI), vol. 3734, pp. 171–182. Springer, Heidelberg (2005) 7. Palmer, N., Goldberg, P.W.: PAC-learnability of probabilistic deterministic finite state automata in terms of variation distance. Theor. Comput. Sci. 387(1), 18–31 (2007) 8. Poggi, N., Berral, J.L., Moreno, T., Gavald` a, R., Torres, J.: Automatic detection and banning of content stealing bots for e-commerce. In: NIPS 2007 Workshop on Machine Learning in Adversarial Environments for Computer Security (2007), http://mls-nips07.first.fraunhofer.de/ 9. Poggi, N., Moreno, T., Berral, J.L., Gavald` a, R., Torres, J.: Web customer modeling for automated session prioritization on high traffic sites. In: Conati, C., McCoy, K., Paliouras, G. (eds.) UM 2007. LNCS (LNAI), vol. 4511, pp. 450–454. Springer, Heidelberg (2007) 10. Poggi, N., Moreno, T., Berral, J.L., Gavald` a, R., Torres, J.: Self-adaptive utilitybased web session management. Computer Networks 53(10), 1712–1721 (2009) 11. Ron, D., Singer, Y., Tishby, N.: On the learnability and usage of acyclic probabilistic finite automata. J. Comput. Syst. Sci. 56(2), 133–152 (1998) 12. Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F., Carrasco, R.C.: Probabilistic finite-state machines - part I. IEEE Trans. Pattern Anal. Mach. Intell. 27(7), 1013–1025 (2005) 13. Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F., Carrasco, R.C.: Probabilistic finite-state machines - part II. IEEE Trans. Pattern Anal. Mach. Intell. 27(7), 1026–1039 (2005)
Grammar Inference Technology Applications in Software Engineering Barrett R. Bryant1 , Marjan Mernik1,2 , Dejan Hrnˇciˇc2, Faizan Javed3 , Qichao Liu1 , and Alan Sprague1 1
3
The University of Alabama at Birmingham, Department of Computer and Information Sciences, Birmingham, AL 35294-1170, U.S.A. {bryant,mernik,qichao,sprague}@cis.uab.edu 2 University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, SI-2000 Maribor, Slovenia {marjan.mernik,dejan.hrncic}@uni-mb.si Regions Financial Corp., Mortgage Shared Systems, Birmingham, AL 35244, U.S.A. [email protected]
Abstract. While Grammar Inference (GI) has been successfully applied to many diverse domains such as speech recognition and robotics, its application to software engineering has been limited, despite wide use of context-free grammars in software systems. This paper reports current developments and future directions in the applicability of GI to software engineering, where GI is seen to offer innovative solutions to the problems of inference of domain-specific language (DSL) specifications from example DSL programs and recovery of metamodels from instance models. Keywords: domain-specific languages, grammar inference, metamodel.
1
Introduction
Grammatical inference (GI), often called grammar induction or language learning, is the process of learning a grammar from positive and/or negative sentence examples. Machine learning of grammars finds many applications in syntactic pattern recognition, computational biology, and natural language acquisition [3]. In the realm of software engineering, context-free grammars (CFGs) are of paramount importance for defining the syntactic component of programming languages. Grammars are increasingly being used in various software development scenarios, and recent efforts seek to carve out an engineering discipline for grammars and grammar dependent software [6]. The structure of this paper is as follows. Section 2 explores the application of GI to inferring domain-specific language (DSL) specifications from example DSL programs. In Section 3, application of GI to recovery of a metamodel from instance models is discussed. We conclude in Section 4. J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 276–279, 2010. c Springer-Verlag Berlin Heidelberg 2010
Grammar Inference Technology Applications in Software Engineering
2
277
Inferring DSL Specifications from DSL Programs
A domain-specific language (DSL) is a programming language dedicated to a particular domain or problem [9]. It provides appropriate built-in abstractions and notations; it is usually small, more declarative than imperative, and less expressive than a general-purpose language (GPL). Using grammar inference, language specifications can be generated for DSLs, facilitating programming language development for domain experts not well versed in programming language design. This enables a domain expert to create a DSL by supplying sentences from the desired DSL to the grammar induction system, which would then create a parser for the DSL represented by those samples, thus expediting programming language development. A memetic algorithm [10] is a population-based evolutionary algorithm with local search. MAGIc, Memetic Algorithm for Grammatical Inference [8] infers CFGs from positive samples, which are divided into a learning set and a test set. The learning set is used only in local search, while grammar fitness is calculated on samples from the learning and test sets. Using samples from the test set in the grammar inference process is the main difference between our approach and many machine learning approaches, where the test set is used for testing the result accuracy. Although the initial population has been created mostly randomly in evolutionary algorithms such an approach has been proven insufficient for grammar inference [1]. Indeed, a more sophisticated approach is needed and an initial population of grammars is generated using the Sequitur algorithm [11], which generates a grammar that only parses a particular sample from a learning set. Hence, Sequitur does not generalize productions. Moreover, the initialization procedure can be enhanced with seeds of partially correct grammars or grammar dialects, which are useful for learning grammar versions [2]. We have tested MAGIc on the formal language {an bn cm |n, m ≥ 1} used in [13] and also on various DSLs, including desk calculator language DESK [12] and a DSL for describing 3-D shapes and textures [15]. MAGIc is able to infer CFGs, which are non-ambiguous and of type LR(1). Often these have fewer productions than the original grammar. Further details may be found in [8].
3
Recovery of a Metamodel from Instance Models
GI can also be applied to solve problems in domain-specific modeling (DSM) [14]. DSM assists subject matter experts in describing the essential characteristics of a problem in their domain. The typical approach to DSM involves the construction of a metamodel, from which instances are defined that represent specific configurations of the metamodel entities. Throughout the evolution of a metamodel, repositories of instance models (or domain models) can become orphaned from their defining metamodel. In such a case, instance models that contain important design knowledge cannot be loaded into the modeling tool due to the version changes that have occurred to the metamodel. Thus, the ability to recover the design knowledge in a repository of legacy models is needed. A correspondence exists
278
B.R. Bryant et al.
between the domain models that can be instantiated from a metamodel, and the set of programs that can be described by a grammar. This correspondence can be leveraged by creating a GI-based tool that allows the recovery of lost metamodels by making use of information contained in the domain models. MARS (MetAmodel Recovery System) [4][7] is a system we have developed to infer metamodels from model instances, based on the GME1 modeling tool. GME, and most other modeling tools, represent the model instance using XML. XSLT2 is a visual-to-textual representation transformation process which we use for reading in a set of instance models in XML and translating them into a DSL, called the Model Representation Language (MRL), that accurately represents the visual GME domain models contained in XML files. MRL is then loaded into the LISA language description environment [9]. As a result of the inference process, an inferred metamodel in XML is generated using formal transformation rules. The inferred metamodel can then be used by GME to load the previous instance models into the modeling tool. MARS has been validated using example models developed in ESML (Embedded Systems Modeling Language), a domain-specific graphical modeling language for modeling embedded systems [5]. We created various model instances exhibiting different properties and were able to infer a very good approximation of the metamodel. The quality of the inferred metamodel depends on the level of details available in the domain instances and also the number of domain instances used. If the set of domain instances used in the inference did not make use of all the constituent elements of the original metamodel, then those elements cannot be recovered in the inferred metamodel. Almost all properties addressed in the instances we developed were used properly in the metamodel inference and the cardinalities of connections appeared in the original were also correctly computed. Further details may be found in [7].
4
Conclusions
We have shown that grammar inference may be used to infer DSLs from example DSL source programs and metamodels from model instances. In both cases results are good when the sample set is sufficient to illustrate the key aspects of the underlying grammar. There are still some limitations in inferring highly complex recursive structures. We are currently working on improving our algorithms to overcome this. It is also the case that for both DSLs and models, the addition of semantic information would be beneficial. Such semantic information could allow the extension of the construction of DSL parsers into complete DSL compilers, and also allow for more refined metamodels to be inferred using semantic information.
Acknowledgments This work was supported in part by United States National Science Foundation award CCF-0811630. 1 2
Generic Modeling Environment - http://www.isis.vanderbilt.edu Extensible Stylesheet Transformation - http://www.w3.org/TR/xslt
Grammar Inference Technology Applications in Software Engineering
279
References ˇ 1. Crepinˇ sek, M., Mernik, M., Javed, F., Bryant, B.R., Sprague, A.: Extracting grammar from programs: Evolutionary approach. ACM SIGPLAN Notices 40(4), 39–46 (2005) 2. Dubey, A., Jalote, P., Aggarwal, S.K.: Learning context-free grammar rules from a set of program. IET Software 2(3), 223–240 (2008) 3. de la Higuera, C.: Grammatical Inference: Learning Automata and Grammars. Cambridge University Press, Cambridge (2010) 4. Javed, F., Mernik, M., Gray, J., Bryant, B.: MARS: A metamodel recovery system using grammar inference. Information and Software Technology 50, 948–968 (2008) 5. Karsai, G., Neema, S., Sharp, D.: Model-driven architecture for embedded software: A synopsis and an example. Science of Computer Programming 73(1), 26–38 (2008) 6. Klint, P., L¨ ammel, R., Verhoef, C.: Toward an engineering discipline for grammarware. ACM Transactions on Software Engineering Methodology 14(3), 331–380 (2005) 7. Liu, Q., Bryant, B.R., Mernik, M.: Metamodel recovery from multi-tiered domains using extended MARS. In: Proc. COMPSAC 2010, 34th Annual International Computer Software and Applications Conference (to appear, 2010) 8. Mernik, M., Hrnˇciˇc, D., Bryant, B.R., Javed, F.: Applications of grammatical inference in software engineering: Domain specific language development. In: MartinVide, C. (ed.) Scientific Applications of Language Methods, pp. 475–511. Imperial College Press (2010) 9. Mernik, M., Heering, J., Sloane, A.M.: When and how to develop domain-specific languages. ACM Computing Surveys 37(4), 316–344 (2005) 10. Moscato, P.: On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Tech. rep., California Institute of Technology, Concurrent Computation Program 158-79 (1989) 11. Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: a linear-time algorithm. Journal of Artificial Intelligence Research 7(1), 67–82 (1997) 12. Paakki, J.: Attribute grammar paradigms—a high-level methodology in language implementation. ACM Computing Surveys 27(2), 196–255 (1995) 13. Sakakibara, Y.: Learning context-free grammars using tabular representations. Pattern Recognition 38(9), 1372–1383 (2005) 14. Schmidt, D.C.: Guest editor’s introduction: Model-driven engineering. IEEE Computer 39(2), 25–31 (2006) 15. Strnad, D., Guid, N.: Modeling trees with hypertextures. Computer Graphics Forum 23(2), 173–187 (2004)
H¨ older Norms and a Hierarchy Theorem for Parameterized Classes of CCG Christophe Costa Florˆencio1 and Henning Fernau2 1
2
Department of Computer Science, K.U. Leuven, Leuven, Belgium [email protected] Universit¨ at Trier, FB IV—Abteilung Informatik, D-54286 Trier, Germany [email protected]
Abstract. We develop a framework based on H¨ older norms that allows us to easily transfer learnability results. This idea is concretized by applying it to Classical Categorial Grammars (CCG).
1
Introduction
The question to present new, potentially interesting language classes that can be learnt in certain scenarios has been always in the focus of Grammatical Inference. To the knowledge of the authors, no systematic way to design or define new learnable language classes, starting out from known learnability results, is known, possibly apart from the results exhibited in [1, 2, 3] where it is shown how formal language results can be used to transfer learnability results by using a kind of preprocessing reduction. We will here show another way of such a transfer, namely by using elementary mathematical properties of Hilbert spaces. This works in particular for learning from text, also known as learning from positive data, a setting where such transfer results have been hitherto unknown. One way of obtaining learnable classes of grammars is by imposing bounds on their descriptive complexity. This approach has been applied in the past to the Classical Categorial Grammar (CCG) formalism [4, 5] resulting in, among others, the learnable class of k-valued grammars, the grammars of which assign at most k categorial types to any single symbol in the (fixed) alphabet.1 Using ideas exhibited in [6], we might first associate to any categorial grammar G over an n-letter alphabet {a1 , . . . , an } an n-dimensional vector v(G), where the ith component counts that number of categorial types associated to ai . From this viewpoint, a categorial grammar G is k-valued iff the maximum norm (also known as L∞ norm) of v(G) is bounded by k. We will therefore term the corresponding class k-max-valued in what follows. This association of G to a normed Hilbert space poses a natural mathematical question: How do formal language (hierarchy) and learnability results transfer when changing the norm? 1
For reasons of space, we are not giving definitions of CCG here, but refer to the given references. In particular, range(G) denotes the types occurring in the CCG G and Pr is the set of primitive types.
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 280–283, 2010. c Springer-Verlag Berlin Heidelberg 2010
H¨ older Norms and a Hierarchy Theorem for Parameterized Classes of CCG
2
281
k-Sum-Valued CCG
We first study the sum norm: a categorial grammar G is k-sum-valued iff the sum norm (also known as L1 norm) of v(G) is bounded by k. For a crisp notation, let Gk−max-val (Σ), Lk−max-val (Σ), and FLk−max-val (Σ) denote that class of k-max-valued grammars, their languages and functor-argument structure languages, respectively; sometimes, the basic alphabet Σ is made explicit. Similarly, the notations Gk−sum-val (Σ), Lk−sum-val (Σ), and FLk−sum-val (Σ) are understood. Kanazawa showed that both the hierarchies Lk−max-val (Σ) and FLk−max-val (Σ) have finite elasticity, hence entailing text learnability. Theorem 1. Fix some finite alphabet Σ. Families from both the hierarchies Lk−sum-val (Σ) and FLk−sum-val (Σ) have finite elasticity. Proof. The maximum number of types that can be assigned to any symbol in a k-max-valued grammar is k. Thus, Gk−sum-val (Σ) ⊂ Gk−max-val (Σ), and Lk−sum-val (Σ) ⊆ Lk−max-val (Σ). Since Lk−max-val is known to have finite elasticity for every k, every subclass, including Lk−sum-val , has finite elasticity (for every k), and is thus learnable. An analogous argument can be given for FLk−sum-val . This yields the easy corollary: Corollary 2. Fix some finite alphabet Σ. Both families from the hierarchies Lk−sum-val (Σ) and FLk−sum-val (Σ) are identifiable in the limit. For the maximum norm, a hierarchy theorem was proved in [4], i.e., for any k ≥ 1, Lk−max-val L(k+1)−max-val . Such a result also holds for the k-sum-val grammars, as we will now demonstrate. Theorem 3. For any k ≥ 1, Lk−sum-val L(k+1)−sum-val . Proof. This proof parallels Theorem 5.5 as presented in Section 8.1.1 of [4]: By definition, Gk−sum-val ⊂ G(k+1)−sum-val , which immediately implies Lk−sum-val ⊆ L(k+1)−sum-val . It remains to show that L(k+1)−sum-val − Lk−sum-val = ∅. Let Ln = {ai | 1 ≤ i ≤ n}, and Ln = {Ln | n ∈ N+ }. Note that Ln is an infinite ascending chain, so, since Lk−sum-val has finite elasticity (Theorem 1), ⊆ Lk−sum-val for any fixed k. Thus, for every k, there is an n such that Ln Ln ∈ Lk−sum-val . We will show that for the least such n, Ln ∈ L(k+1)−sum-val . Let Gn−1 be a k-sum-valued grammar such that L(Gn−1 ) = Ln−1 . There must exist a type A ∈ range(Gn−1 ) − Pr such that A is not a proper subtype of any type in range(Gn−1 ). Let B = (. . . (t/ A)/ . . .)/A, and let Gn = Gn−1 ∪ { a, B }. The reader can verify that L(Gn ) = Ln .
n
From this proof, it immediately follows that: Theorem 4. For any k ≥ 1, FLk−sum-val FL(k+1)−sum-val .
282
3
C. Costa Florˆencio and H. Fernau
H¨ older Norms
The bounds defining the k-max-valued and k-sum-valued n classes are special cases of H¨ older norms. These are of the form x p = p i=1 |xi |p for x ∈ Rn and p ≥ 1. For p = 1, this is the sum bound, for p → ∞ this is the maximum bound, and for p = 2 we obviously obtain the Euclidian distance bound. Proposition 5 (Folklore). 1. x ∞ ≤ x p for any p ≥ 1. 2. For 1-dimensional spaces, x ∞ = x p for any p ≥ 1. Thus, natural questions present themselves: Do learnability and hierarchy theorems hold for every class of CCGs defined in terms of H¨ older norms? More specifically, let GH(k,p) (Σ) collect those grammars G over the alphabet Σ which satisfy v(G) p ≤ k. Accordingly, the notations LH(k,p) (Σ) and FLH(k,p) (Σ) are understood. By definition, we have: Lemma 6. For every k, p ≥ 1, GH(k,p) ⊆ GH(k+1,p) . Proposition 5 (1) entails, completely analogous to Theorem 1: Theorem 7. Fix some finite alphabet Σ and some p ≥ 1. Families from both the hierarchies LH(k,p) (Σ) and FLH(k,p) (Σ) have finite elasticity. Theorem 8. For every k, p ≥ 1, LH(k,p) (Σ) LH(k+1,p) (Σ). Proof. From Lemma 6, it immediately follows that LH(k,p) ⊆ LH(k+1,p) . Since the strictness (for the case p = 1) was shown in Theorem 3 by a unary example, based on the finite elasticity of that class, Proposition 5 (2), together with Theorem 7 yield the claim. From this proof, it immediately follows that: Theorem 9. For every k, p ≥ 1, FLH(k,p) (Σ) FLH(k+1,p) (Σ). More generally, if Lk now denotes any language class that has finite elasticity and whose definition is obtained by restricting the L∞ norm of some vector v(G) associated to some grammar G for L ∈ Lk , then, for any p ≥ 1, Lk,p is a hierarchy with finite elasticity, as well, where Lk,p collects all languages that can be described by grammars G with G p ≤ k.
4
Further Extensions and Questions
We have indicated a quite general framework that allows us to transfer learnability results within the scenario of identification in the limit. We exemplified this with results on classical categorial grammars. In passing, the interplay between learnability and formal languages allowed us to prove a hierarchy theorem
H¨ older Norms and a Hierarchy Theorem for Parameterized Classes of CCG
283
based on previously shown learnability results (which seems to be a new way of argumentation). Our reasoning was based on the notion of finite elasticity. Can such arguments be extended to similar notions like finite thickness? Can we find other interesting special instances of our approach? The learning algorithms underlying this framework are not very efficient in general (although they are in the special case of CCG). So, is there a good way of transferring efficient learnability? To give a better known example: We might associate to any nondeterministic n-state automaton A an n-dimensional vector v(A), indicating in component i the amount of lookahead needed to disambiguate any nondeterminism in state si . Then, a deterministic automaton A is k-reversible [7] iff its reversal Ar satisfies: v(Ar ) ∞ ≤ k.2 From the viewpoint exhibited in this paper, it appears natural to study the learnability of classes of regular languages defined by the restriction v(Ar ) p ≤ k for any p ≥ 1. In general terms, we believe that there are further quite interesting connections between the areas of Grammatical Inference and that of Descriptional Complexity that deserve further studies.
References [1] Takada, Y.: A hierarchy of language families learnable by regular language learning. Information and Computation 123, 138–145 (1995) [2] Fernau, H., Sempere, J.M.: Permutations and control sets for learning non-regular language families. In: Oliveira, A.L. (ed.) ICGI 2000. LNCS (LNAI), vol. 1891, pp. 75–88. Springer, Heidelberg (2000) [3] Fernau, H.: Even linear simple matrix languages: formal language properties and grammatical inference. Theoretical Computer Science 289, 425–489 (2002) [4] Kanazawa, M.: Learnable Classes of Categorial Grammars. PhD, CSLI (1998) [5] Costa Florˆencio, C., Fernau, H.: Finding consistent categorial grammars of bounded value: a parameterized approach. In: Dediu, A.-H., Fernau, H., Mart´ın-Vide, C. (eds.) LATA 2010. LNCS, vol. 6031, pp. 202–213. Springer, Heidelberg (2010) [6] Dassow, J., Fernau, H.: Comparison of some descriptional complexities of 0L systems obtained by a unifying approach. Information and Computation 206, 1095– 1103 (2008) [7] Angluin, D.: Inference of reversible languages. Journal of the ACM 29(3), 741–765 (1982)
2
We omit here the particularity on the disambiguation of final states that can be treated analogously.
Learning of Church-Rosser Tree Rewriting Systems M. Jayasrirani1, D.G. Thomas2 , Atulya K. Nagar3 , and T. Robinson3 1
3
1
Arignar Anna Government Arts College, Walajapet, India 2 Madras Christian College, Chennai - 600 059, India [email protected] Department of Computer Science, Liverpool Hope University, United Kingdom
Introduction
Tree rewriting systems are sets of tree rewriting rules used to compute by repeatedly replacing equal trees in a given formula until the simplest possible form (normal form) is obtained. The Church-Rosser property is certainly one of the most fundamental properties of tree rewriting system. In this system the simplest form of a given tree is unique since the final result does not depend on the order in which the rewritings rules are applied. The Church-Rosser system can offer both flexible computing and effecting reasoning with equations and have been intensively researched and widely applied to automated theorem proving and program verification etc. [3,5]. On the other hand, grammatical inference is concerned with finding some grammatical description of a language when given only examples of strings from this language, with some additional information about the structure of the strings, some counter-examples or the possibility of interrogating an oracle. The grammatical inference model of Gold [4] called “identification in the limit from positive data” has inputs of the inference process as just examples of the target language and there is no interaction with the environment. Angluin introduced the grammatical inference based on positive examples and membership questions of computed elements. In [1], Angluin presented another grammatical inference model to identify regular languages based on membership and equivalence queries with a help of a teacher (called minimally adequate teacher). Sakakibara studied the grammatical inference of languages of unlabeled derivation trees of context-free grammars with the help of structural equivalence queries and structural membership queries (1990). Inference of regular tree languages from positive examples only has been studied by Knuutila and Steinby (1994). An inference algorithm for learning a k-testable tree language was presented by Damian Lopez et al. (2004). Besombes and Marion [2] investigated regular tree language exact learning from positive examples and membership queries. Fernau (2007) studied the problem of learning regular tree languages from text using the generalised frame work of function distinguisablity. In this paper, we investigate Church-Rosser tree rewriting systems as an alternative to describe and manipulate context-sensitive tree languages. Church-Rosser tree rewriting systems have many interesting properties such as decidability of J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 284–287, 2010. c Springer-Verlag Berlin Heidelberg 2010
Learning of Church-Rosser Tree Rewriting Systems
285
word problem, language description of congruence classes, etc. We present an algorithm for learning a subclass of the class of Church-Rosser tree rewriting systems. Learning is obtained using membership queries. A teacher or an oracle possesses the knowledge of the Church-Rosser tree rewriting system and hence the knowledge of the congruence classes and answers the membership queries related to the congruence classes made by the learner.
2
Tree Replacement Systems
For the notions of ranked alphabets, trees, tree replacement, tree composition and substitutions we refer to [3]. We recall the necessary definitions and notations related to tree rewriting system [3]. The set of finite Σ-trees is denoted as TΣ . Let X = {x1 , x2 , . . . } be a countable set of variables and TΣ (X), the set of trees with variables from X. The set dom(t) stands for the domain of a tree t and V ar(t), for the set of variables associated with t. A substitution is a function h : X → TΣ (X) and this h extends to a unique homomorphism h : TΣ (X) → TΣ (X). Definition 1. A set of rules S over Σ is a subset of TΣ (X) × TΣ (X). Each pair (s, t) in S is called a rule and is also denoted as s → t. The congruence generated by S is the reflexive transitive closure ⇔∗S of the relation ⇔S defined as follows: For any two trees t1 and t2 in TΣ (X), if there is some tree T in TΣ (X), some tree address u both in dom(t1 ) and dom(t2 ), some pair (s, t) such that either s → t or t → s is a rule in S, some substitution ¯ ¯ h : V ar(s) ∪ V ar(t) → TΣ (X) and t1 = T [u ← h(s)], t2 = T [u ← h(t)], then ¯ we write t1 ⇔ t2 . In other words, t2 is obtained by taking a subtree (h(s)) of t1 which is a substitution instance of one side of a rule in S(s → t, t → s) and replacing it. Definition 2. Two trees t1 and t2 are congruent (mod S) if t1 ⇔∗ t2 . The class of trees congruent to the tree ’t’ is [t]S = {t /t ⇔∗S t}. The set of congruence classes {[t]S /t ∈ TΣ } forms a monoid under multiplication, [t]S · [r]S = [tr]S with identity [Λ]S where Λ stands for the empty tree. This monoid is the quotient monoid TΣ / ⇔∗S denoted by MS . Definition 3. Given a set of rules S over Σ, the relation ⇒S is defined as t ⇒S s, if t ⇔ s and hg(t) > hg(s), ∀ t, s ∈ TΣ (X) where hg(t) stands for height of t. ⇒∗S is the reflexive transitive closure of ⇒S and (S, ⇒S ) is called a tree replacement (rewriting) system on Σ. Given a tree replacement system (S, ⇒S ), a tree t is irreducible (mod S) if there is no tree t such that t ⇒S t . Definition 4. A tree replacement system (S, ⇒S ) is Church-Rosser if for all trees t1 , t2 with t1 ⇔∗S t2 , there exists a tree t3 such that t1 ⇒∗S t3 and t2 ⇒∗S t3 . The word problem for a tree replacement system (S, ⇒S ) is that given any two trees s, t in TΣ (X), deciding whether s and t are congruent to each other or not. The word problem is undecidable in general for any tree replacement
286
M. Jayasrirani et al.
system but it has been proved that the word problem for any Church-Rosser tree replacement system is decidable [3]. Let S be a tree rewriting system on Σ. Let IRR(S) be the set of all irreducible trees with respect to S. Definition 5. A tree rewriting system T on Σ is called reduced if for every rewriting rule (s, t) ∈ T , t is an irreducible tree with respect to T and s is an irreducible tree with respect to T − {(s, t)}.
3
Learning Church-Rosser Tree Rewriting System R
Let Σ be a given ranked alphabet. We consider a Church-Rosser tree rewriting system T on Σ. Let MT = {L1 , L2 , . . . , Ln } be the quotient monoid where each Li is a congruence class of a tree with respect to T . Then, the congruence relation ⇔∗T is of finite index and so each congruence class Li (1 ≤ i ≤ n) is a regular tree language by Myhill-Nerode theorem for trees. Algebraic properties of a Church-Rosser tree rewriting system T for which MT is finite enable us to present an efficient learning procedure for congruence classes with only membership queries. Since the congruence of T partitions the set TΣ into disjoint congruence classes, any tree in TΣ is in only one congruence class with respect to T . So, the membership query for congruence classes is meaningful and reasonable. The unique reduced Church-Rosser tree rewriting system R equivalent to T is then obtained. The learning procedure to obtain R consists of two parts, one for IRR(R) and the other for the tree rewriting system R. For any tree t ∈ TΣ given as input, the oracle answers membership query by producing an n-tuple that contains n − 1 zeros and one 1 since MT = MR = {L1 , L2 , . . . , Ln }. The learner gets the value of n when the empty tree Λ is given as input for membership query. The input is a tree t ∈ TΣ and the output is an n tuple q(t) = (k1 , k2 , . . . , kn ) where ki = 1 if t ∈ Li and ti = 0 if t ∈ Li (1 ≤ i ≤ n). Let pi be the projection defined by pi (x) = xi for any n-tuple x = (x1 , x2 , . . . , xn ), 1 ≤ i ≤ n. Membership queries are made to the oracle for the input trees, starting with the empty tree Λ, which is an irreducible tree with respect to R and continued with the trees in TΣ0 . Let t1 = Λ and suppose t2 , t3 , . . . , ts are the lexicographically ordered trees in TΣ0 where s − 1 is the number of constants in Σ. A tree ti (2 ≤ i ≤ s) belonging to Lj for some j (1 ≤ j ≤ n) is an irreducible tree with respect to R whenever ti ∈ Lj but tp ∈ Lj for p = 1, 2, . . . , i − 1. Hence by membership queries all the irreducible trees in TΣ0 with respect to R are obtained. The process is continued by making membership queries for trees in TΣ1 (TΣ0 ∩ IRR(R)), the set of all trees of height one with subtrees in TΣ0 ∩ IRR(R), which can be lexicographically ordered. Thus the process gives irreducible trees with respect to R in TΣ0 and TΣ1 . In general the process is continued recursively by making membership queries for trees in TΣ1 (TΣr−1 ∩ IRR(R)), the set of all trees of height r, with subtrees in TΣr−1 ∩ IRR(R), r ≥ 1. This process terminates when each Lj receives an irreducible tree with respect to R.
Learning of Church-Rosser Tree Rewriting Systems
287
The algorithm for forming irreducible trees with respect to R, terminates when the process for finding trees with respect to R in TΣk ends, when k = max{hg(t)/t ∈ IRR(R)} since (a) IRR(R) is finite and (b) each Lj (1 ≤ j ≤ n) contains exactly one irreducible tree with respect to R. To identify the unique, reduced Church-Rosser tree rewriting system R equivalent to the unknown tree rewriting system T , the learner performs again the membership queries as in the procedure for the lexicographically ordered trees in the set TΣ1 (IRR(R)) − IRR(R), where TΣ1 (IRR(R)) is the set of all trees with subtrees in IRR(R) in the next level. The learning then forms the tree rewriting system S = {(s, t)/s ∈ TΣ1 (IRR(T )) - IRR(T ), t ∈ IRR(T ), s and t both belong to Lj for some j(1 ≤ j ≤ n)} on Σ. From S a reduced tree rewriting system S equivalent to S on Σ is obtained and thus the learner obtains R which is same as S on Σ. We can show that the time taken by the learning algorithm to learn IRR(R) is polynomial in the number of congruence classes, the arities of members of Σ and the number of elements in Σ. An example run We illustrate the procedure for learning the reduced Church-Rosser tree rewriting system R = {(b(c), c), (b(d), d), (a(c, c), c), (a(d, d), d), (a(c, d), c), (a(d, c), d)} on Σ = {a, b, c, d} with arities of a, b, c, d are 2, 1, 0, 0 respectively. MR = {[Λ]R , [c]R , [d]R } where L1 = [Λ]R , L2 = [c]R and L3 = [d]R . Membership queries as made for the trees Λ, c, d belonging to TΣ0 and the oracle produces the answers q(Λ) = (1, 0, 0), q(c) = (0, 1, 0), q(d) = (0, 0, 1) for which the learner obtains IRR(R) as {Λ, c, d}. Again membership queries are made for the trees in the set TΣ1 = {b(c), b(d), a(c, c), a(d, d), a(c, d), a(d, c)} and the oracle produces the answers q(b(c)) = (0, 1, 0), q(b(d)) = (0, 0, 1), q(a(c, c)) = (0, 1, 0), q(a(d, d)) = (0, 0, 1), q(a(c, d)) = (0, 0, 1), q(a(d, c)) = (0, 0, 1). From which the learner obtains S = {b(c), c), (b(d), d), (a(c, c), c), (a(d, d), d), (a(c, d), c), (a(d, c), d)}. The reduced tree rewriting system S equivalent to S is obtained as S = S = R.
References 1. Angluin, D.: Learning regular sets from queries and counter examples. Inform. Comput. 75, 87–106 (1987) 2. Besombes, J., Marion, J.Y.: Learning tree languages from positive examples and membership queries. Theoretical Computer Science 382, 183–197 (2007) 3. Gallier, J.H., Book, R.V.: Reductions in tree replacement systems. Theoretical Computer Science 37, 123–150 (1985) 4. Gold, M.: Language identification in the limit. Information and Control 10, 447–474 (1967) 5. Rosen, B.K.: Tree-manipulating systems and Church-Rosser theorems. Journal of the Association for Computing Machinery 20(1), 160–187 (1973)
Generalizing over Several Learning Settings Anna Kasprzik University of Trier [email protected]
Introduction. We recapitulate inference from membership and equivalence queries, positive and negative samples. Regular languages cannot be learned from one of those information sources only [1,2,3]. Combinations of two sources allowing regular (polynomial) inference are MQs and EQs [4], MQs and positive data [5,6], positive and negative data [7,8]. We sketch a meta-algorithm fully presented in [9] that generalizes over as many combinations of those sources as possible. This includes a survey of pairings for which there are no well-studied algorithms. Definition 1. T = S, E, obs (S, E ⊆ Σ ∗ ) is an observation table if S is prefixclosed and obs(s, e) = 1 if se ∈ L, 0 if se ∈ / L, ∗ if unknown. Let row(s) := {(e, obs(s, e))|e ∈ E}. S is partitioned into red and blue. We call r, s ∈ S obviously different (OD; r <> s) iff ∃e ∈ E with obs(r, e) = obs(s, e) and obs(r, e), obs(s, e) ∈ {0, 1}. T is closed iff ¬∃s ∈ blue : ∀r ∈ red : r <> s. Let r ≡L s iff re ∈ L ⇔ se ∈ L for all r, s, e ∈ Σ ∗ . Let IL := |{[s0 ]L |s0 ∈ Σ ∗ }|. Due to the Myhill-Nerode theorem there is a unique total state-minimal DFA AL with IL states and each state recognizing a different equivalence class. From a closed and consistent (see [4]) table T = S, E, obs with ε ∈ E we derive a DFA AT = Σ, QT , qT , FT , δT with QT = row(red), qT = row(ε), → q|¬(q <> FT = {row(s)|s ∈ red, obs(s, ε) = 1}, and δT = {(row(s), a) row(sa)), s ∈ red, a ∈ Σ, sa ∈ S}. AT has at most IL states (see [4], Th. 1). Definition 2. A finite X ⊆ L is representative for L with min. DFA A = Σ, Q, q0 , F, δ iff [∀(q1 , a) → q2 ∈ δ : a ∈ Σ ⇒ ∃w ∈ X : ∃u, v ∈ Σ ∗ : w = uav ∧ (q0 , u) → q1 ∈ δ]∧[∀q ∈ F : ∃w ∈ X : (q0 , w) → q ∈ δ]. A finite X ⊆ Σ ∗ \L is separative ∗ iff ∀q1 = q2 ∈ Q : ∃w ∈ X : ∃u, v ∈ Σ : w = uv ∧[δ(qL , u) = q1 ∨δ(qL , u) = q2 ]∧ → qa , (q2 , v) → qb ∈ δ : [(qa ∈ F ∧qb ∈ (Q\F ))∨(qb ∈ F ∧qa ∈ (Q\F ))]. ∃(q1 , v) All learning algorithms we consider can be seen to start out with a provisional set of classes and converge to the partition by ≡L by splitting or merging them according to obtained information. In a table S contains strings whose rows are candidates for states in the minimal DFA, and E experiments (‘contexts’) proving that two strings belong to distinct classes and represent different states. Algorithm GENMODEL. The input is a tuple IP = EQ , MQ, X+ , X− with Boolean values stating if EQs or MQs can be asked, a positive, and a negative finite sample of L. After initializing T we enter a loop checking if T is closed and, J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 288–292, 2010. c Springer-Verlag Berlin Heidelberg 2010
Generalizing over Several Learning Settings
289
if it is, if we can still find states that should be split up, until there is no more information to process. The algorithm is composed of the following procedures: INIT initializes the oracle (MQORACLE returns a blackbox for L if MQ = 1 and otherwise the prefix automaton for X+ as an imperfect oracle improved during the process) and T . The set red contains candidates that were fixed to represent a state in the output, and is initialized by ε (start state), and blue contains candidates representing states to which there is a transition from one of the states in red. white is the set of remaining candidates from which blue = ∅ POOL is filled up. The set of (initial) candidates is given by POOL: If X+ = ∅ ∧ MQ = 1 returns Pref (X+ ), otherwise all strings up to length 2. If X− POOL builds X+ from X− : Let n− := |Suff (X− )|. In a worst case, every suffix in a separative X− distinguishes a different of the (IL2 − IL )/2 pairs of states. From n− ≤ (IL2 − IL )/2 we compute an upper bound for IL and take all strings up to that length as X+ as the longest shortest representative of a state in AL is at most of length IL . Note that |X+ | can be exponential with respect to |X− |. We also have UPDATE which clears the elements that were moved to blue out of white and fills in the cells of T if we have a perfect membership oracle which for MQ = 1 is true at any time and for MQ = 0 when we have processed all available information, provided that it was sufficient. For the cases with empty samples but MQ = 1 we fill up white with all one-symbol extensions of blue. CLOSURE is straightforward, it successively finds all elements preventing the closedness of T , moves them to red, and calls UPDATE to fill up the table. NEXTDIST calls FINDNEXT to look for a candidate to be fixed as another state of the output. Then T is modified by MAKEOD such that CLOSURE will move this string to red. If no such candidate is found FINDNEXT returns ε, ε (this can be seen as a test for the termination criterion). In that case white is emptied if we use queries only, for all other cases the remaining candidates are moved to blue in order not to lose the information contained in the pool. If MQ = 1 FINDNEXT exploits a counterexample. EQ = 1: c is given by the oracle. Else if X+ = ∅ the learner tries to build c from Text = S ∪ white, E ∪ Suff (X+ ), obs. This succeeds if X+ is representative (see [9]). At least one prefix of c must be a distinct state of the output, but as it may not be in blue MINIMIZE is called to replace the blue prefix of c until it finds s e with s ∈ blue and e distinguishes s from all red elements: FINDNEXT returns s e . If MQ = 0 we continue merging states unless there is information preventing it. After the call of MERGENEXT either all blue strings correspond to states resulting from a merge or there is s which is a non-mergeable state. FINDNEXT returns s, ε as s should be a distinct state of the solution. In cases not covered by these distinctions we cannot reliably find another candidate to move and return ε, ε. MAKEOD is called if FINDNEXT returns s, e with s = ε, i.e., s is to be moved to red by CLOSURE. If MQ = 1 there is a single r ∈ red not OD from s (red elements are pairwise OD, and rows of S are complete), and e separates s from r, so add e to E. If MQ = 0 row (s) consists of ‘∗’s – we have to make s
290
A. Kasprzik
OD from all r ∈ red “by hand”: Find c ∈ X− preventing the merge of qr and qs via PREVENTMERGE and a suffix er of c leading from qr or qs to a final state (X− = ∅ as FINDNEXT returns ε, ε for MQ = 0 otherwise). As c should not be accepted er separates s from r. Add er to E and fill the two cells of T with differing values – note that they do not have to be correct as they are used only once by CLOSURE, and T will be updated completely just before termination. GENMODEL is intended as a generalization of algorithms for settings where polynomial one-shot inference is possible, which also implies that it is deterministic and does not guess/backtrack. However, note that it behaves in an “intuitively appropriate” way when (polynomial) inference is not possible as well. We call an information source non-void for queries if MQ = 1/EQ = 1, for a positive sample if it is representative, and for a negative sample if it is separative. Theorem 1. a. Let L be the regular target language. GENMODEL terminates for any input after at most 2IL − 1 main loop executions and returns a DFA. b. For any input including at least two non-void information sources except for 1, 0, X+ , X− with X+ or X− void the output is a minimal DFA for L. See [9] for the proof. Note that Theorem 1b can also be seen from the proofs of the algorithms in [4,6,8]. We comment on the following three cases because to our knowledge there are no such well-studied algorithms for these settings. 0, 1, ∅, X− : As 0, 1, X+ , ∅. We build a positive sample from X− (see above) which however may be exponential in size with respect to |X− | so that the number of MQs is not polynomial with respect to the size of the given data. 1, 0, X+ , ∅: Suppose we wanted to handle this case analogously: We would have to test state mergeability in O via EQs. For X+ representative a positive counterexample reveals the existence of states that should be merged, a negative one of states that should not have been. When we query the result of a merge (even without repairing non-determinism by further merges) and get a positive counterexample we could either repeat the EQ and wait for a negative one but the number of positive ones may be infinite. Or we could query the next merge but when (if) we eventually get a negative one we do not know which of the previous merges was illegitimate. So this strategy is no less complex than ignoring all counterexamples and asking an EQ for the result of every possible set of merges, of which there are exponentially many. Therefore, since we cannot proceed as in the cases where inference is possible with a polynomial number of steps or queries this case is eclipsed from GENMODEL by the corresponding case distinctions. 1, 0, ∅, X− : If X− is separative negative counterexamples do not carry new information, and the number of negative counterexamples may be infinite. The set of positive counterexamples so far may not be representative so that we cannot reliably detect an illegitimate merge as there may be final states that are not even represented in the current O such that a compatibility check is too weak. If we make the merge we might have to undo it because of another positive counterexample, a situation we want to avoid. Hence we eclipse this case as well. Note: For input with more than two non-empty sources the algorithm chooses one of the two-source options with priority MQs&EQs > MQ&X+ > X+ &X− .
Generalizing over Several Learning Settings
291
Conclusion. We have aimed to design GENMODEL as modular as possible as an inventory of the essential procedures in existing and conceivable polynomial one-shot regular inference algorithms of the considered kind. This may help to give clearer explanations for the interchangeability of information sources. Practically, an extended GENMODEL (see below) could be used as a template from which individual algorithms for hitherto unstudied scenarios can be instantiated. We have chosen observation tables as an abstract and flexible means to perform and document the process, from which various descriptions can be derived. GENMODEL offers itself to be extended in several directions. We could try to generalize over the type of objects, such as trees (see [10,6,11,12]), graphs, matrices, or infinite strings. Then there are other kinds of information sources which might be integratable, such as correction queries [13], active exploration [14], or distinguishing functions [15]. The third direction concerns an extension of the learned language class beyond regularity (for example by using strategies as in [16] for even linear languages, or [17] for languages recognized by DFA with infinite transition graphs) and even beyond context-freeness [16,18]. The development of GENMODEL may be of use in the concretization of an even more general model of learning in the sense of polynomial one-shot inference as considered here – also see the very interesting current work of Clark [19].
References 1. 2. 3. 4. 5. 6.
7.
8. 9. 10.
11. 12.
Gold, E.: Language identification in the limit. Inf. & Contr. 10(5), 447–474 (1967) Angluin, D.: Queries and concept learning. Mach. L. 2, 319–342 (1988) Angluin, D.: Negative results for equivalence queries. Mach. L. 5, 121–150 (1990) Angluin, D.: Learning regular sets from queries and counterexamples. Information and Computation 75(2), 87–106 (1987) Angluin, D.: A note on the number of queries needed to identify regular languages. Inf. & Contr. 51, 76–87 (1981) Besombes, J., Marion, J.Y.: Learning tree languages from positive examples and membership queries. In: Gavald´ a, R., Jantke, K.P., Takimoto, E. (eds.) ALT 2003. LNCS (LNAI), vol. 2842, pp. 440–453. Springer, Heidelberg (2003) Oncina, J., Garcia, P.: Identifying regular languages in polynomial time. Machine Perception and Artificial Intelligence, vol. 5, pp. 99–108. World Scientific, Singapore (2002) de la Higuera, C.: Grammatical Inference: Learning Automata and Grammars. Cambridge University Press, Cambridge (2010) Kasprzik, A.: Generalizing over several learning settings. Technical report, University of Trier (2009) ´ Drewes, F., H¨ ogberg, J.: Learning a regular tree language from a teacher. In: Esik, Z., F¨ ul¨ op, Z. (eds.) DLT 2003. LNCS, vol. 2710, pp. 279–291. Springer, Heidelberg (2003) Oncina, J., Garcia, P.: Inference of recognizable tree sets. Technical report, DSIC II/47/93, Universidad de Valencia (1993) Kasprzik, A.: A learning algorithm for multi-dimensional trees, or: Learning beyond context-freeness. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 111–124. Springer, Heidelberg (2008)
292
A. Kasprzik
13. Tˆırn˘ auc˘ a, C.: A note on the relationship between different types of correction queries. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 213–223. Springer, Heidelberg (2008) 14. Pitt, L.: Inductive inference, DFAs, and computational complexity. In: Jantke, K.P. (ed.) AII 1989. LNCS, vol. 397. Springer, Heidelberg (1989) 15. Fernau, H.: Identification of function distinguishable languages. Theoretical Computer Science 290(3), 1679–1711 (2003) 16. Fernau, H.: Even linear simple matrix languages: Formal language properties and grammatical inference. Theoretical Computer Science 289(1), 425–456 (2002) 17. Berman, P., Roos, R.: Learning one-counter languages in polynomial time. In: SFCS, pp. 61–67 (1987) 18. Yoshinaka, R.: Learning mildly context-sensitive languages with multidimensional substitutability from positive data. In: Gavald` a, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS, vol. 5809, pp. 278–292. Springer, Heidelberg (2009) 19. Clark, A.: Three learnable models for the description of language. In: Dediu, A.H., Fernau, H., Mart´ın-Vide, C. (eds.) LATA 2010. LNCS, vol. 6031, pp. 16–31. Springer, Heidelberg (2010)
Rademacher Complexity and Grammar Induction Algorithms: What It May (Not) Tell Us Sophia Katrenko1 and Menno van Zaanen2 1
Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands [email protected] 2 TiCC, Tilburg University, Tilburg, The Netherlands [email protected] Abstract. This paper revisits a problem of the evaluation of computational grammatical inference (GI) systems and discusses what role complexity measures can play for the assessment of GI. We provide a motivation for using the Rademacher complexity and give an example showing how this complexity measure can be used in practice.
1
Introduction
Various aspects of grammatical inference (GI) have been studied extensively from both theoretical and practical points of view [3]. These include formal learnability results in the frameworks of the identification in the limit and PAC learning, as well as empirical methods. In the latter case, given a finite amount of sequential data, the aim is to find the underlying structure that was used to generate the data. Empirical approaches usually fall into the unsupervised learning paradigm and explore vast volumes of unlabeled sequences. One of the widely discussed questions in the literature concerns the performance of GI methods and their means of assessment. Van Zaanen and Geertzen [5] identify four evaluation strategies: the looks-good-to-me, rebuilding an apriori known grammars, language membership detection and comparison against a treebank approaches. All have weaknesses, some of which can be attributed to subjectivity, low scalability, and a bias towards specific grammars. In practice, the comparison against a gold standard remains the most popular evaluation strategy. For instance the empirical comparison of ABL and EMILE [4] was based on unlabeled precision and recall. In this paper, we do not focus on accuracy of the GI methods but on their overfitting. In particular, it is known from statistical learning theory that classifiers prone to overfitting do not provide high generalization. In what follows, we give a definition of Rademacher complexity and discuss how to use it in the context of GI.
2
Rademacher Complexity
A goal of a learning system is to be able to analyze new, unseen examples and predict them correctly. In other words, given a set of n examples {(xi , yi )}ni=1 J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 293–296, 2010. c Springer-Verlag Berlin Heidelberg 2010
294
S. Katrenko and M. van Zaanen
drawn i.i.d. from the joint distribution PXY , it is supposed to produce a classifier h : X → Y such that it is able to categorize a new example x ∈ X. Any incorrect predictions that a classifier makes on a training set are counted as its empirical = yi ), where I is an indicator function which returns error eˆ(h) = ni=1 I(h(xi ) 1 in the case h(xi ) = yi and 0, otherwise. Even though a classifier has access only to the limited number of examples (training set), one would ideally wish the empirical error on training examples eˆ(h) to be close to the true error e(h). In statistical learning theory, it is common to describe the difference between true and empirical errors in terms of generalization bounds. These bounds typically depend on the number of training examples and capacity of a hypothesis space H. If a hypothesis space is very large and there are only few training examples, the difference between true and empirical errors can be large. Capacity closely relates to the notion of overfitting and emphasizes the fact that even if a classifier performs very well on the training set, it may yield poor results on a new data set. It is measured either by Vapnik-Chervonenkis dimension or Rademacher complexity and here we focus on the latter. Definition 1. For n training examples from a domain X, a set of real-valued functions H (where h ∈ H, h : X → R), a distribution PX on X, the Rademacher complexity R(H, X, PX , n) is defined as follows: R(H, X, PX , n) = Exσ
n 2 sup σi h(xi ) h∈H n i=1
(1)
where σ = σ1 , . . . , σn are random numbers distributed identically and independently according to the Bernoulli distribution with values ±1 (with equal probability), and the expectation is taken over σ and x = x1 , . . . , xn . Equation 1 shows that Rademacher complexity depends on the number of training examples n. In particular, larger number of examples will lead to lower complexity and, consequently, overfitting will also be low. In the binary case, where h : X → {−1, 1}, Rademacher complexity ranges from 0 to 2. In a nutshell, Rademacher complexity shows how well a classifier can match random noise. The use of Rademacher complexity to bound generalization error is discussed in [1] and is illustrated below. Theorem 1. (Bartlett and Mendelson) Let PXY be a probability distribution on X × {−1, 1} with marginal distribution PX on X, H be a set of functions such that each h ∈ H, h : X → {−1, 1}. Let {(xi , yi )}ni=1 be a training set sampled i.i.d. from PXY . For any δ > 0, with probability at least 1 − δ, every function h ∈ H satisfies ln(1/δ) R(H, X, PX , n) e(h) − eˆ(h) ≤ + (2) 2 2n Equation 2 shows that if Rademacher complexity is high and a number of training examples is small, the generalization bound will be loose. Ideally, one would like to keep Rademacher complexity as low as possible, and a number of training examples sufficiently large.
Rademacher Complexity and Grammar Induction Algorithms
3
295
Grammar Induction: Some Considerations
Tailoring Rademacher complexity to GI is not trivial because even though it is evaluated against existing annotated resources, it does not always fall in a typical supervised learning scenario. We assume that a grammar induction algorithm maintains several hypotheses and chooses the best one available, hg . Depending on the input data, there are three possible strategies. Supervised GI. When a GI method is supervised, i.e. it is trained on sentences with their corresponding constituency structures, Rademacher complexity can be used to measure overfitting. This is the case of probabilistic context-free grammars (PCFGs). To measure Rademacher complexity, we need to specify what is an input space X and an output space Y . Usually, GI methods take a text corpus as input and generate constituents as output, which may suggest that X is a set of sequences (sentences) and Y is a set of subsequences (constituents). When comparing the output of an algorithm against a structured version of the sentences (i.e. a treebank), one considers how many constituents where found by a GI method and whether they match annotations. Consequently, we assume a hypothesis to be a mapping from constituents to binary labels, hg : X → {−1, 1}. Labels indicate whether a constituent from the gold standard was found by a GI algorithm (1) or not (−1). To summarize, in the supervised case one may use the following evaluation scheme. For each constituent xi , i = 1, . . . , n from the gold standard corpus, we generate a random label σi . In addition, we have a binary prediction from the GI method which indicate whether this constituent is generated by this particular method, hg (xi ). Finally, Rademacher complexity is computed as described in Equation 1. Semi-supervised GI. The second scenario is applicable when a GI method uses both labeled and unlabeled data. In such a case, transductive Rademacher complexity may be used, which is a counterpart of a standard Rademacher complexity. Unsupervised GI. In a fully unsupervised scenario, a GI method does not make use of labeled data for training and in this case we need another measure of overfitting instead of Rademacher complexity. However, in order to see what would happen in the case we simulate an evaluation proposed for supervised scenario, we have applied Alignment-Based Learning (ABL) [4] on the 578 sentence the Air Traffic Information System (ATIS3) subset of the Penn treebank (the edit distance-based alignment algorithm and the term probability selection learning method). As baselines, we also consider left and right branching binary tree structures. The generated structures have been compared against the ATIS3 gold standard, not taking empty constituents (traces) and the constituents spanning the entire sentence into account. Table 1 shows that complexity rates for all three algorithms are low, which suggests that overfitting is low. Figure 1 illustrates that increasing the size of the training data lowers Rademacher complexity, although the differences are small here as well.
296
S. Katrenko and M. van Zaanen
Settings Rademacher complexity ABL 0.0267 (± 0.0227) left branching 0.0295 (± 0.0215) right branching 0.0304 (± 0.0219)
Rademacher complexity
0.1
Table 1. Rademacher complexity and standard deviation on the ATIS3 corpus (100 runs)
0.08
0.06
0.04
0.02
0 10
20
30
40
50
60
70
80
90 100
Percentage of ATIS 3, %
Fig. 1. Learning curve of Rademacher complexity of ABL on ATIS3 corpus
4
Conclusions
In this paper, we discuss how to use Rademacher complexity to analyze existing grammar induction algorithms. In addition to commonly used measures, such as unlabeled precision or recall, the use of Rademacher complexity allows to measure overfitting of a method at hand. Since complexity is computed for a data sample, it makes it possible to study overfitting for the entire text collection, as well as on some subsets defined based on the sentence length or certain linguistic phenomena. Rademacher complexity is well suited for supervised and semi-supervised settings. However, it remains an open question how overfitting should be measured in a completely unsupervised scenario. Recent work on clustering [2] suggests that, similarly to supervised learning, it is possible to restrict a function space in order to avoid overfitting. In future, we plan to investigate whether these findings can be used for unsupervised grammar induction.
References 1. Bartlett, P.L., Mendelson, S.: Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3 (2002) 2. Bubeck, S., von Luxburg, U.: Nearest neighbor clustering: A baseline method for consistent clustering with arbitrary objective functions. JMLR 10, 657–698 (2009) 3. Clark, A.: Unsupervised Language Acquisition: Theory and Practice. PhD thesis, COGS, University of Sussex (2001) 4. van Zaanen, M., Adriaans, P.: Alignment-Based Learning versus EMILE: A Comparison. In: Proceedings of the Belgian-Dutch Conference on Artificial Intelligence (BNAIC), pp. 315–322 (2001) 5. van Zaanen, M., Geertzen, J.: Problems with evaluation of unsupervised empirical grammatical inference systems. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 301–303. Springer, Heidelberg (2008)
Extracting Shallow Paraphrasing Schemata from Modern Greek Text Using Statistical Significance Testing and Supervised Learning Katia Lida Kermanidis Department of Informatics, Ionian University 7 Pl. Tsirigoti, 49100 Corfu, Greece [email protected]
Abstract. Paraphrasing normally involves sophisticated linguistic resources for pre-processing. In the present work Modern Greek paraphrases are automatically generated using statistical significance testing in a novel manner for the extraction of applicable reordering schemata of syntactic constituents. Next, supervised filtering helps remove erroneously generated paraphrases, taking into account the context surrounding the reordering position. The proposed process is knowledge-poor, and thus portable to languages with similar syntax, robust and domain-independent. The intended use of the extracted paraphrases is hiding secret information underneath a cover text. Keywords: paraphrasing, statistical significance testing, supervised learning.
1 Introduction Paraphrasing is expressing the meaning of a sentence using a different set of words and/or a different syntactic structure. Paraphrasing is useful in language learning, authoring support, text summarization, question answering, machine translation, textual entailment, and natural language generation. Significant research effort has been put into paraphrase identification [9, 6, 1], and generation [7, 2]. The present work describes the automatic inference of syntactic patterns from Modern Greek (MG) text for generating shallow paraphrases. The proposed methodology is a combination of a statistical significance testing process for generating ‘swapable’ phrase (chunk) pairs based on their co occurrence statistics, followed by a supervised filtering phase (a support vector machines classifier) that helps remove pairs that lead to erroneous swaps. A first goal is to produce as many correct paraphrases as possible for an original sentence, due to their intended use in steganographic communication [5], i.e. for embedding hidden information in unremarkable cover text [3, 8, 7, 11]. Among others, one way to insert hidden bits within a sentence is by taking advantage of the plural number of syntactic structures it can appear in, e.g. paraphrases. Steganographic security relies on the number and the grammaticality of produced paraphrases, not on their complexity [5]. Instead of focusing on few intricate alterations (common in previous work), the methodology aims at generating a significant number of paraphrases. Unlike the syntactic rules in previous J.M. Sempere and P. García (Eds.): ICGI 2010, LNAI 6339, pp. 297–300, 2010. © Springer-Verlag Berlin Heidelberg 2010
298
K.L. Kermanidis
work [7], each swapping schema (and different schemata simultaneously) may be applied multiple times (i.e. in multiple positions) to a sentence [5]. A second goal is to employ as limited external linguistic resources as possible, ensuring thereby the portability of the methodology to languages with similar syntax to MG, robustness and domain independence (the proposed alterations are applicable to any MG text).
2 Inferring Paraphrasing Schemata MG is highly inflectional and allows for a large degree of freedom in the ordering of the chunks within a sentence. This freedom enables paraphrase generation merely by changing the chunk order. The ILSP/ELEFTHEROTYPIA corpus [4] used in the experiments consists of 5244 sentences and is manually annotated with morphological information. Phrase structure information is obtained automatically by a multi-pass parsing chunker that exploits minimal resources [10] and detects non-overlapping noun (NP), verb (VP), prepositional (PP), adverbial phrases (ADP) and conjunctions (CON). Next, phrase types are formed by stripping phrases from superfluous information. NP types retain the phrase case. VP types retain the verb voice, the conjunction introducing them and their copularity. PP types retain their preposition and CON types their conjunction type (coo/sub-ordinating). 156 phrase types were formed. Next, the statistical significance of the co occurrence of two phrase types is measured using hypothesis testing: the t-test, the log likelihood ratio (LLR), the chi-squared metric (χ2) and pointwise mutual information (MI). Phrase type pairs that occur in both orderings ([TYPE1][TYPE2] and [TYPE2][TYPE1]) among the top results with the highest rank are selected. These are considered permissible phrase swaps, as both orderings show significant correlation between the phrases forming them. In case a swap pair is detected in an input sentence, the two phrases are swapped and a paraphrase is produced. The left column in Table 1 shows the size of the selected swap set and the average number of swaps that are permitted per sentence for each swap set for every metric (each pair is counted once), and various values for the N-best results. If more than one swap is applicable at different positions, all swap combinations are performed, and all respective paraphrases are produced. As a first step towards evaluation, certain swap pairs that are incapable of producing legitimate swaps are removed from the sets, e.g. pairs like [Phrase][#] (# denotes end of sentence), [Phrase][CONcoo], [Phrase][CONsub] and their symmetrical pairs. Then, two native speakers judged the produced paraphrases of 193 randomly selected sentences, according to grammaticality and naturalness. Inter-expert agreement exceeded 96% using the kappa statistic. The percentage of paraphrases that required one or more manual swaps from the judges in order to become grammatical and/or natural is shown in the right column of Table 1. MI returns a smaller but more diverse set of infrequent swap pairs. Such phrase types are: copular VPs, genitive NPs, unusual PPs (e.g. PPs introduced by the preposition ως - until). This set leads to a small average number of swaps per sentence, and a high error rate. T-test returns a more extensive set of swap pairs that consist of more frequent phrase types and results in the smallest error rate. A significant part of the errors is attributed to the automatic nature and the low level of the chunking process: Erroneous phrase splitting, incorrect attachment of punctuation marks, inability to identify certain relative and adverbial expressions, to resolve PP attachment ambiguities, subordination dependencies etc.
Extracting Shallow Paraphrasing Schemata from Modern Greek Text
299
Table 1. Swap set size and error rate for every metric
Ttest LLR χ2 ΜΙ
Swap Set Size/Avg nr of swaps Top50 Top100 Top200 Top300 21/3.8 38/4.2 67/4.6 92/4.9 11/2.2 31/2.5 49/2.8 77/3.0 12/3.1 30/3.4 47/3.6 71/3.8 16/0.6 19/0.6 36/0.9 60/1.4
Top50 27.8% 34.8% 28.1% 33.1%
Error rate Top100 Top200 29.1% 29.7% 35.5% 37.1% 29.9% 30.6% 35.1% 35.4%
Top300 36.9% 41.2% 37.7% 39.9%
To reduce the error rate, the extracted swap sets undergo a filtering process, where erroneous swap pairs are learned using supervised classification and withdrawn from the final pair sets. The positions of possible swaps are identified according to the Τtest swap set for the top 200 results. A learning vector is created for every input sentence and each swap position for the 193 sentences. The features forming the vector encode syntactic information for the phrase right before the swap position, two phrases to the left and two phrases to the right. Thereby, context information is taken into account. Each of the five phrases is represented through six features (Table 2). Unlike previous supervised learning approaches to paraphrase identification [6], the presented dataset does not consist of candidate sentence-paraphrase pairs, but of single sentences that in certain positions allow (or not) the neighboring phrases to be swapped. So commonly employed features like shared word sequences and word similarity [6] are out of the scope of the methodology and not abiding by the low resource policy. A support vector machines (SVM) classifier (first degree polynomial kernel function, and SMO for training) classified instances using 10-fold cross validation. SVM were selected because they are known to cope well with high data sparseness and multiple attribute problems. Classification reached 82% precision and 86.2% recall. The correlation of each swap pair with the target class (valid/not valid paraphrase) was estimated next. 28 swap pairs that appear more frequently with the negative than with the positive class value were removed from the final swap set. Table 2. The features of the learning vector
1 2 3 4 5 6
NP NP case of phrase headword NP is (in)definite pronoun in NP (if any) contains (not) genitive element -
VP VP conjunction in VP verb is (not) copular
PP PP preposition -
CON/ADP CON/ADP 1st word lemma -
-
-
nr of words in phrase
The reduced swap set was evaluated against a held-out test set (100 new corpus sentences, not included in the training data of the filtering phase) and reached an error rate of 17.6%. Against the 193-sentence training set, the error rate dropped to 14.3%. Given the ‘knowledge poverty’ of the approach, the results are satisfactory when compared to those of approaches that utilize sophisticated resources [7].
300
K.L. Kermanidis
It is interesting to study the pairs that tend to lead to correct vs. incorrect swaps. PPs introduced by the preposition για (for) are usually attached to the sentence verb, and so may almost always be swapped with the preceding phrase. PPs introduced by the preposition σε (to) are more problematic. ADPs may usually be swapped with preceding NPs, but preceding VPs are confusing. Consecutive main verb phrases are rarely ‘swapable’. Certain secondary clauses (e.g. final or relative clauses) may often be swapped with their preceding main verb phrase, but not with a preceding NP. The use of other filters, the set of features for supervised learning, and the context window size should be further explored. Another challenging perspective would be to enlarge the window size between the phrases to be swapped, instead of focusing only on two consecutive chunks. This would increase paraphrasing accuracy.
References 1. Barzilay, R., Lee, L.: Learning to Paraphrase: An Unsupervised Approach Using MultipleSequence Alignment. In: Proceedings of the Conference on Human Language Technology (HLT-NAACL), Edmonton, pp. 16–23 (2003) 2. Bentivogli, L., Dagan, I., Dang, H., Giampiccolo, D., Magnini, B.: The Fifth PASCAL Recognizing Textual Entailment Challenge. In: Proceedings of the Text Analysis Conference. Gaithersburg, Maryland (2009) 3. Cox, I., Miller, M.L., Bloom, J.A.: Digital Watermarking. Morgan Kaufmann, San Francisco (2002) 4. Hatzigeorgiu, N., et al.: Design and Implementation of the online ILSP Greek Corpus. In: Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, pp. 1737–1742 (2000) 5. Kermanidis, K.L., Magkos, E.: Empirical Paraphrasing of Modern Greek Text in Two Phases: An Application to Steganography. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 535–546. Springer, Heidelberg (2009) 6. Kozareva, Z., Montoyo, A.: Paraphrase Identification on the Basis of Supervised Machine Learning Techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006) 7. Meral, H.M., Sevinc, E., Unkar, E., Sankur, B., Ozsoy, A.S., Gungor, T.: Syntactic Tools for Text Watermarking. In: Proceedings of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents IX, vol. 6505 (2007) 8. Provos, N., Honeyman, P.: Hide and Seek: An Introduction to Steganography. IEEE Security and Privacy, 32–44 (2003) 9. Rus, V., McCarthy, P.M., Lintean, M.C., McNamara, D.S., Graesser, A.C.: Paraphrase Identification with Lexico-syntactic Graph Subsumption. In: Proceedings of the Florida Artificial Intelligence Research Society, pp. 201–206 (2008) 10. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: A Practical Chunker for Unrestricted Text. In: Christodoulakis, D.N. (ed.) NLP 2000. LNCS (LNAI), vol. 1835, pp. 139–150. Springer, Heidelberg (2000) 11. Topkara, M., Taskiran, C.M., Delp, E.: Natural Language Watermarking. In: Proceedings of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, San Jose (2005)
Learning Subclasses of Parallel Communicating Grammar Systems Sindhu J. Kumaar1 , P.J. Abisha2 , and D.G. Thomas2 1
Department of Mathematics, B.S. Abdur Rahman University Chennai - 600 048, Tamil Nadu, India [email protected] 2 Department of Mathematics, Madras Christian College East Tambaram, Chennai - 600 059, Tamil Nadu, India [email protected], [email protected]
Abstract. Pattern language learning algorithms within the inductive inference model and query learning setting have been of great interest. In this paper an algorithm to learn a parallel communicating grammar system in which the master component is a regular grammar and the other components are pure pattern grammars is given.
1
Introduction
Inferring a pattern common to all words in a given sample is a typical instance of inductive inference [5]. Motivated by the study of Angluin on pattern languages [3] a generative device called pattern grammar is defined by Dassow et al. [4]. In [1], a generative device called pure pattern grammar is defined. In pure pattern grammar variables are not specified, instead constants themselves are replaced by axioms initially and the process is continued with the current set of words to get the associated language. A parallel communicating (PC) grammar system consists of several grammars working synchronously, each on its own sentential form and communicating by request. Here we give an algorithm to learn a parallel communicating grammar system, in which the master component is a regular grammar and the remaining components are pure pattern grammars.
2
Pure Pattern Grammar
Definition 1. [1] A pure pattern grammar (PPG) is a triple G = (T, A, P ) where T is an alphabet, A ⊆ T ∗ is a finite non empty set of elements of T ∗ called axioms and P is a finite non empty subset of T + called the set of patterns. For a set P and a language L ⊆ T ∗ , let P (L) be the set of strings obtained by replacing uniformly and in parallel, each letter of all patterns in P by strings in L, all different occurrences of the same letter in a pattern being replaced by the same string. The pure pattern language (PPL) generated by G denoted by L(G) is the smallest language L ⊆ T ∗ , for which we have P ⊆ L, A ⊆ L, and P (L) ⊆ L. In fact L(G) = P ∪ A ∪ P (A) ∪ P (P (A)) ∪ . . . . J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 301–304, 2010. c Springer-Verlag Berlin Heidelberg 2010
302
S.J. Kumaar, P.J. Abisha, and D.G. Thomas
Example 1. G1 = ({a}, {a}, {aa}), n L(G1 ) = {a, aa, aaaa, . . . } = {a2 /n ≥ 0} P = {aa}, A = {a}, P (A) = {aa}, P (P (A)) = {aaaa}, . . . Definition 2. A parallel communicating pure pattern grammar system PC(PPG) is a construct Γ = (N, T, K, (P0 , S0 ), (P1 , A1 ), . . . , (Pn , An )) where N, T, K are non empty pairwise disjoint finite alphabets, N is the set of nonterminals, T is set of terminals, with K = {Q1 , . . . , Qn } set of query symbols, S0 ∈ N . (N, T ∪K, P0 , S0 ) is a regular grammar and (T, Ai , Pi ) are pure pattern grammars. The rewriting in the component (Pi , Ai ) is done according to the PPG. i.e., Pik (Ai ) is considered in the k th step, until a query is asked. If a query symbol Qj appears in the master component (P0 , S0 ), then the strings in the j th component are communicated to the master component. The language generated by such a system is the set of all words in T ∗ generated by the master component and it is called a parallel communicating pure pattern language and we write in short as PC(PPL). Example 2. Γ = (N, T, K, (P0 , S0 ), (P1 , A1 )), N = {S0 }; T = {a, b}; K = {Q1 }; P0 = {S0 → aS0 , S0 → bS0 , S0 → aQ1 , S0 → bQ1 }; P1 = {aba}, A1 = {a, ab} (S0 , {aba}) ⇒ (aS0 , {aaa, aaba, abaab, ababab}) ⇒ (abQ1 , {a9 , aaaaabaaaa, aaaabaabaaa, . . . }) ⇒ (ab{a9 , aaaaabaaaa, aaaabaabaaa, . . . }, y) where y = aba. L(Γ ) = {{a, b}+{a9 , aaaaabaaaa, aaaabaabaaa, . . . }}, if Γ works in the returning mode.
3
Learning Parallel Communicating Grammar Systems PC (PPL)
In this section we attempt to learn the parallel communicating grammar systems in which the master component is a regular grammar and the second component is a pure pattern grammar with a single pattern. The algorithm to learn PCPPL is as follows: 1. From the language generated by the parallel communicating grammar system, one string w of length ’r’ is given as the input. We assume that the length of the pattern n, the maximum length m of the axiom and alphabet T are known, using restricted superset queries the pattern is learnt and by restricted subset queries the set of axioms of the required pure pattern grammar are learnt. This checking is done till an equivalent pure pattern grammar is got. 2. Now, we check whether x = x1 x2 . . . xr is a member of the pure pattern language. If yes, the program halts because we have already learnt pure pattern grammar. Otherwise split the first character x1 from the left extreme of the sample x and check whether the remaining string is a member of
Learning Subclasses of Parallel Communicating Grammar Systems
303
the required pure pattern language. This process is repeated till we get the longest suffix xi+1 xi+2 . . . xr of x which is a member of the pure pattern language. For the remaining prefix x1 x2 . . . xi of x, we ask membership query i.e., we first check if the string x1 x2 . . . xi is a member of the required regular set the language generated by the master component, then taking x1 x2 . . . xi as a sample we try to learn the regular grammar. Algorithm 1 Input: The alphabet T , a positive sample w ∈ T + of length r with w = w1 w2 . . . wr , the length ’n’ of the pattern, the maximum m length ’m’ of the axiom, r ≥ n, words x1 , x2 , . . . , xs of Ti i=1
given in the increasing length order, among words of equal length according to lexicographic order. Output: A parallel communicating grammar system Γ = (N, T, Q, (P0 , S0 ), (P1 , A1 )) with L(Γ ) = L(Γ ) Procedure (Pattern) begin Let u1 , u2 , . . . , ut be the words in T n in the lexicographic order for i = 1 to t begin m ask the restricted super set query for T, i=1 T i , {ui } , ui ∈ T n if yes then p = ui call(Axiom) else i = i + 1 end Procedure (Axiom) m Let x1 , x2 , . . . , xs be the words in i=1 T i arranged in lexicographic order A=φ for t = 1 to s do begin ask restricted subset query for G = (T, A ∪ {xt }, {p}) If ’yes’ then A = A ∪ {x} and t = t + 1 else output G end Print the pure pattern grammar (T, A, p) Procedure (Master) for i = 1 to r − 1 begin Ask membership query for wi+1 . . . wr , is wi+1 . . . wr ∈ (T, A, {p})? If yes, then for the prefix x = w1 w2 . . . wi , ask membership query, is xq ∈ L(N, T ∪ {q}, P0 , S0 ) If yes, then run L∗ using prefixes of x. If L∗ gives the correct automaton, write the corresponding regular
304
S.J. Kumaar, P.J. Abisha, and D.G. Thomas
grammar which is equivalent to G0 = (N, T ∪ {q}, P0 , S0 ) else i = i + 1 else i = i + 1 Print Γ = (N, T, Q, (P0 , S0 ), (P1 , A1 )) the PC grammar system end end Time Analysis: As each of the procedures runs in polynomial time the algorithm to learn PC(PPL) is also polynomial.
References 1. Abisha, P.J., Subramanian, K.G., Thomas, D.G.: Pure Pattern Grammars. In: Proceedings of International Workshop Grammar Systems, Austria, pp. 253–262 (2000) 2. Abisha, P.J., Thomas, D.G., Sindhu J. Kumaar: Learning Subclasses of Pure Pattern Languages. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 280–283. Springer, Heidelberg (2008) 3. Angluin, D.: Learning Regular Sets from Queries and Counter Examples. Information and Computation 75, 87–106 (1987) 4. Dassow, J., Paun, G., Rozenberg, G.: Generating Languages in a Distributed Way: Grammar Systems. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages. Springer, Heidelberg (1997) 5. Gold, E.M.: Language Identification in the Limit. Information and Control 10, 447– 474 (1967) 6. Salomaa, A.: Formal Languages. Academic Press, New York (1973)
Enhanced Suffix Arrays as Language Models: Virtual k-Testable Languages Herman Stehouwer and Menno van Zaanen TiCC, Tilburg University, Tilburg, The Netherlands {J.H.Stehouwer,M.M.vanZaanen}@uvt.nl
Abstract. In this article, we propose the use of suffix arrays to efficiently implement n-gram language models with practically unlimited size n. This approach, which is used with synchronous back-off, allows us to distinguish between alternative sequences using large contexts. We also show that we can build this kind of models with additional information for each symbol, such as part-of-speech tags and dependency information. The approach can also be viewed as a collection of virtual k-testable automata. Once built, we can directly access the results of any k-testable automaton generated from the input training data. Synchronous backoff automatically identifies the k-testable automaton with the largest feasible k. We have used this approach in several classification tasks.
1
Introduction
When writing texts, people often use spelling checkers to reduce the number of mistakes in their texts. Many spelling checkers concentrate on non-word errors. However, there are also types of errors in which words are correct, but used incorrectly in context. These errors are called contextual errors and are much harder to recognize than non-word errors. In this paper, we describe a novel approach, which is based on suffix arrays, which are sorted arrays containing all suffixes of a collection of sequences, to store the models. This approach can be used to make decisions about alternative corrections of contextual errors. The use of suffix arrays allows us to use large, potentially enriched n-grams and as such can be seen as an extension to more conventional n-gram models. The underlying assumption of the language model is that using more (precise) information pertaining to the decision is better [3]. The approach can also be seen as a collection of k-testable automata that we can access using by using a single query. As De Higuera states in [4] choosing the right size k is a crucial issue. When k is too small over-generalization will occur, conversely too large k leads to models that might not generalize enough. The approach described here automatically chooses the largest k applicable to the situation. J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 305–308, 2010. c Springer-Verlag Berlin Heidelberg 2010
306
2
H. Stehouwer and M. van Zaanen
Approach
To select the best sequence out of a set of alternative sequences, such as in the problem of contextual errors in text, we consider all possible alternatives and use a model to select the most likely sequence. The sequence with the highest probability is selected as the correct form. The language model we use here is based on unbounded size n-grams. The probability of a sequence is computed by multiplying the probabilities of the n-gram for each position in the sequence. Pseq = PLM (w|w−1 . . . w−n ) w∈seq
Considering that the probabilities are extracted from the training data, when using n-grams with very large n, data sparseness is an issue. Long sequences may simply not occur in the data, even though the sequence is correct, leading to a probability of zero, even though the correct probability should be non-zero (albeit small). To reduce the impact of data sparseness, we can use techniques such as smoothing [2], which redistributes probability mass to estimate the probability of previously unseen word sequences1 or back-off, where probabilities of lower order n-grams are used to approximate the probability of the larger n-gram. In this article, we use the synchronous back-off method [6] to deal with data sparseness. This method analyzes n-grams of the same size for each of the alternative sequence in parallel. If all n-grams have zero probability, the method backs off to n − 1-grams. This continues until at least one n-gram for an alternative has a non-zero probability. This implements the idea that, assuming the training data is sufficient, if a probability is zero the n-gram combination is not in the language. Effectively, this method selects the largest, usable n-grams automatically. Probabilities of all n-grams (from the training data) of all sizes are stored in an enhanced suffix array. A suffix array is a flat data-structure containing an implicit suffix tree structure [1]. A suffix tree is a trie-based data structure [5, pp. 492] that stores all suffixes of a sequence in such a way that a suffix (and similarly an infix) can be found in linear time in the length of the suffix. All suffixes occupy a single path from the root of the suffix tree to a leaf. Construction of the data structure only needs to be performed once. Due to the way suffix arrays are constructed, we can efficiently find the number of occurrences of subsequences (used as n-grams) of the training data. Starting from the entire suffix array we can quickly identify the interval(s) that pertain to the particular n-gram query. The interval specifies exactly the number of occurrences of the subsequence in the training data. Effectively, this means that we can find the largest non-zero n-gram efficiently. 1
In this paper we do not employ smoothing or interpolation methods as they modify the probabilities of all alternatives equally and hence will not affect the ordering of alternative sequences.
Enhanced Suffix Arrays as Language Models: Virtual k-Testable Languages
3
307
Suffix Arrays as Collections of k-Testable Machines
An enhanced suffix array extends a regular suffix array with a data-structure allowing for the implicit access of the longest-common-prefix (lcp) intervals [1]. An lcp interval represents a virtual node in the implicit suffix trie. A simple enhanced suffix-array with its corresponding implicit suffix-trie is shown in Figure 1 as an example. We can view a suffix array as a virtual DFA in which each state is described by a set of lcp-intervals over the suffix array. This view allows us to determine (by the size of the interval) the number of valid sequences that terminated in each state. If there is no valid path in the DFA for the queried sequence it results in an empty state and the sequence is rejected by the learned grammar. Since the suffix array stores the n-grams of all sizes n, this comes down to a collection of k-testable machines with k = 1 . . . |T | (with T the training data). Querying with length k automatically results in using a k-testable machine. There is an interesting property of the n-gram suffix array approach, which separates it from collections of regular k-testable machine DFAs. All the states on the suffix array are accepting states. Rejection of a sequence only happens when the query cannot be found in the training data at all. The system also does not support negative training examples, only positive ones. To enhance the system, we have generalized a state to be described by a set of lcp intervals. This allows for the supports of single position wildcards. In practice, wildcards allow for the integration of additional information. By interleaving the symbol sequences with the additional symbols, we can incorporate for instance, long range information, such as dependency information and local, less specific features such as part-of-speech tags. Using wildcards, we can construct queries that either use such additional information on one or more positions or not. To evaluate the approach, we ran experiments on three contextual error problems from the natural language domain, namely confusible disambiguation, verb i suffix lcp S[suffix] 0 2 0 aaacatat$ 1 3 2 aacatat$ 2 0 1 acaaacatat$ 3 4 3 acatat$ 4 6 1 atat$ 5 8 2 at$ 6 1 0 caaacatat$ 7 5 2 catat$ 8 7 0 tat$ 9 9 1 t$ 10 10 0 $ Fig. 1. An enhanced suffix array on the string S= acaaacatat on the left, and its corresponding lcp-interval tree on the right. From [1].
308
H. Stehouwer and M. van Zaanen
and noun agreement and adjective ordering. The synchronous back-off method automatically selects the k-testable machine that has the right amount of specificity for selecting between the alternative sequences. These experiments where run with a simple words-only approach and also with part-of-speech tags. The experiments show that the approach is feasible and efficient. When trained on the first 675 thousand sequences of the British National Corpus building the enhanced suffix array takes 2.3 minutes on average. These sequences contain about 27 million tokens. When loaded into memory the enhanced suffix array uses roughly 500 megabytes. We ran speed-tests using 10.000 randomly selected sequences of length 10. The system has an average runtime off 10.2 minutes over tens of runs, with as extremes 8.1 and 12.1 minutes. This means that we can expect the enhanced suffix array to process around 1200 queries per minute. All tests where run on a 2GHz opteron system with 32GB of main memory. The suffix array process is single-threaded.
4
Conclusion and Future Work
We have proposed a novel approach which implements a collection of k-testable automata using an enhanced suffix-array. This approach describes automata that have no explicit reject states and do not require (or support) negative examples during training. Nevertheless, this approach allows for an efficient implementation of many concurrent k-testable machines of various k using suffix arrays. The implementation will be applied as a practical system in the context of text correction, allowing additional linguistic information to be added when needed. In this context, the effectiveness of the additional information in combination with the limitations of k-testable languages still needs to be evaluated.
References 1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004) 2. Chen, S., Goodman, J.: An empirical study of smoothing techniques for language modelling. In: Proceedings of the 34th Annual Meeting of the ACL, pp. 310–318. ACL (June 1996) 3. Daelemans, W., Van den Bosch, A., Zavrel, J.: Forgetting exceptions is harmful in language learning. Machine Learning, Special issue on Natural Language Learning 34, 11–41 (1999) 4. de la Higuera, C.: Grammatial Inference, Learning Automata and Grammars. Cambridge University Press, Cambridge (2010) 5. Knuth, D.E.: The art of computer programming. Sorting and searching, vol. 3. Addison-Wesley, Reading (1973) 6. Stehouwer, H., Van den Bosch, A.: Putting the t where it belongs: Solving a confusion problem in Dutch. In: Verberne, S., van Halteren, H., Coppen, P.A. (eds.) Computational Linguistics in the Netherlands 2007: Selected Papers from the 18th CLIN Meeting, pp. 21–36. Nijmegen, The Netherlands (2009)
Learning Fuzzy Context-Free Grammar—A Preliminary Report Olgierd Unold Member, IEEE Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland [email protected] http://olgierd.unold.staff.iiar.pwr.wroc.pl/
Abstract. This paper takes up the topic of a task of learning fuzzy context-free grammar from data. The induction process is divided into two phases: first the generic grammar is derived from the positive sentences, next the membership grades are assigned to the productions taking into account the occurrences of productions in a learning set. The problem of predicting the location of promoters in Escherichia coli is examined. Language of bacterial sequence can be described using formal system such as context-free grammar, and problem of promoter region recognition can be replaced by grammar induction. The induced fuzzy grammar was compared to other machine learning methods. Keywords: Grammatical Inference, Fuzzy Grammar.
1
Introduction
Fuzzy languages and grammars have been introduced in [1]. The fuzzy language theory enables us—contrary to the crisp language theory—to distinct huge or tiny errors we allow in the input of the parser or the recognizer. Fuzzifying context-free languages (CFLs) is a great step towards a robustness in parsing CFLs. We refer the reader to [2] for more details of fuzzy languages and fuzzy automata. In this paper, we are interested in inducing a fuzzy context-free grammar (FCFG) that accepts a CFL given a finite number of positive and negative examples drawn from that language. Relatively few efforts have been made to learn FCFGs or fuzzy finite automata that recognize FCFG [3,4,5,6]. This paper addresses a fuzzy context-free grammar induction using a novel flexible approach based on a learning set. First, the algorithm produces the crisp and generic context-free grammar in Chomsky Normal Form (CNF). The generic grammar includes all possible production rules for a choosen learning string, i.e. we assume that in each position of the string we can insert all terminals. Next, the algorithm determines the membership grades for all productions of grammar. A fuzzy formal language is a formal language where each word has a degree of membership to the language. A FCFG G = (V, T, P, S, ω, ⊗, ⊕) consists of a set J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 309–312, 2010. c Springer-Verlag Berlin Heidelberg 2010
310
O. Unold
of variables V , a set of terminals T , a set of productions P , start symbol S, ω a set of weights defined over the production rules P , ⊗ denotes a t-norm, and ⊕, ω a t-conorm. Productions are of the form A → α where A ∈ V , α ∈ (V ∪ T ) and ω ∈ [0, 1]. The empty word is denoted by λ. The fuzzy language L(G) generated ∗
by this fuzzy grammar is (w, μL (w))|w ∈ T ∗ , S → w , μL (w) represents the degree of membership of the word w to the language L and is obtained by apllying the t-norm ⊗ to the weights of all productions involved in the generation of w. Should the grammar be ambiguous, and a word w be rechable from S by different sequences of productions, then t-conorm ⊕ will be used to calculate the final degree of membership from the degrees of membership obtained through different sequences of productions. A λ-free (fuzzy) context-free grammar G is in CNF iff P ⊆ V × [0, 1] × (T ∪ V × V ).
2
Learning Fuzzy Grammar
For assigning a (crisp) grammar to a set of learning set, we adopt the algorithm proposed in [7]. All sentences in the learning set are assumed to be of equal length, and for the one choosen sentence (positive) the generic grammar is derived. For example, for a string with length equal to 4 following productions P are obtained: S → AW1
W1 → AW2
W2 → AW3
W3 → A
S → CW1
W1 → CW2
W2 → CW3
W3 → C
A →a C →c
S → GW1
W1 → GW2
W2 → GW3
W3 → G
G →g
S → T W1
W1 → T W2
W2 → T W3
W3 → T
T →t
After the generic grammar has been generated, the membership grades are assigned to the production rules. The initial membership grades are set to 0.5. Note that setting the initial values is necessary in order to use different t-norms (like fuzzy AND). The membership grade of each production Pi is calculated i +N Ni as μPi = NPPS+N S where P S denotes the number of the positive sentences in the learning set, N S—the number of the negative sentences, N Pi —the number of occurrences of the production Pi in a derivation of the positive sentences, N Ni —the number of non-occurrences of the production Pi in a derivation of the negative sentences (i.e. N Ni is counted for the production Wi → xWi+1 as the sum of the occurrences of the productions Wi → yWi+1 , where x, y ∈ T and x = y). During a phase of testing, the final degree of membership of each sentence is worked out from the degrees of membership obtained through different sequences of productions. The average function was used. The threshold was set to 0.5, and each sentence with a membership over this threshold is counted as a positive sentence. In this paper, we address the problem of predicting the location of promoters in Escherichia coli [8]. The language of bacterial sequence can be described by using a formal system such as context-free grammar, and problem of promoter region recognition can be replaced by grammar induction. The gene content of these genomes was mostly computationally recognized. However, the promoter regions are still undetermined in most cases and the software able to accurately
Learning Fuzzy Context-Free Grammar—A Preliminary Report
311
Table 1. Comparison of the induced fuzzy grammar (IFG) with different methods. Leung et al. [13] introduced Basic Gene Grammars (BGG) to represent many formulations of the knowledge of E.coli promoters. BGG is able to represent knowledge acquired from knowledge-based artificial neural network learning (KBANN approach in [14]), and a combination of grammar of weight matrices [15] and KBANN (denoted as WANN). The development of BGG is supported by DNA-ChartParser. GCS, introduced by O.Unold [16] is a kind of a learning classifier system which evolves using genetic algorithm a population of context-free grammar productions. After each execution four numbers were calculated: True Positives (correctly recognized positive examples), True Negatives (correctly recognized negatives), False Negatives (positives recognized as negatives), and False Positives (negatives recognized as positives). Then the average of these numbers was found and the following measures were calculated: Specificity, Sensitivity, and Accuracy. Specificity is a measure of the incidence of the negative results in testing all the non-promoter sequences, i.e., (True Negatives/(False Positives + True Negatives)) x 100. Sensitivity is a measure of the incidence of positive results in testing all the promoter sequences, i.e., (True Positives/(True Positives + False Negatives)) x 100. Accuracy is measured by the number of correct results, the sum of true positives and true negatives, in relation to the number of tests carried out, i.e., ((True Positives + True Negatives)/Total) x 100. Method Specifity Sensitivity Accuracy KBANN 97 16 56 WANN 82 69 75 GCS 94 61 78 IFG 72 78 75
predict promoters in sequenced genomes is not yet available in public domain. Promoter recognition, the computational task of finding the promoter regions on a DNA sequence, is very important for defining the transcription units responsible for specific pathways. A promoter enables the initiation of a gene expression after binding with an enzyme called RNA polymerase, which moves bidirectionally in searching for a promoter, and starts making RNA according to the DNA sequence at the transcription initiation site, following the promoter [9]. The genome can be treated as a string composed of letters A, C, T, G. The goal is, given an arbitrary potential promoter region, to be able to find out whether it is a true or false promoter region. As the learning set the database introduced by M. Noordewier and J. Shavlik to UCI repository was used [10]. The database consists of 53 positive instances and 53 negative instances, 57 letters each. Negative learning sentences were derived from E. coli bacteriophage T7 believed to not contain any promoter sites. In order to get an estimate of how well the algorithm learned the concept of promoter, the test set consisting of unseen 36 instances including 18 positive and 18 negative examples was prepared. Positive test instances were prepared by mutating the bases of the randomly chosen positive learning sentences in non-critical positions, negative test instances by mutating in any positions of randomly chosen negative learning sentences. This method increases the amount of available examples and was first proposed in [11]. The induced fuzzy grammar (IFG) achieved 75% accuracy, 72% specificity,
312
O. Unold
and 75% sensitivity in the testing set. Table 1 compares the results of IFG and three formal system based methods presented in [12]. The results obtained by induced fuzzy grammar are somehow comparable to those methods: Specifity has the lowest value of the compared mathods, but Sensitivity the highest one. Note that replacing symbols A, C, T, G by a, c, t, g in the grammar, one gets a equivalent regular grammar. Moreover, the induced grammar is not ambiguous. However we belive that the use of fuzzy grammars can be a significant step towards a robustness in parsing formal languages, and the proposed approach is flexible enough to deal with complex tasks. The use of different t-norms and t-conorms will be a subject to further testing.
References 1. Lee, E.T., Zadeh, L.A.: Note on fuzzy languages. Inform. Sci. 1, 421–434 (1969) 2. Mordeson, J.N., Mailk, D.S.: Fuzzy Automata and Languages: Theory and Applications. Chapman and Hall, Boca Raton (2002) 3. Mozhiwen, W.: An Evolution Strategy for the Induction of Fuzzy Finite-state Automata. Journal of Mathematics and Statistics 2(2), 386–390 (2006) 4. Wen, M.Z., Min, W.: Fuzzy Automata Induction using Construction Method. Journal of Mathematics and Statistics 2(2), 395–400 (2006) 5. Molina-Lozano, H., Vallejo-Clemente, E.E., Morett-Sanchez, J.E.: DNA sequence analysis using fuzzy grammars. In: IEEE International Conference on Fuzzy Systems, pp. 1915–1921 (2008) 6. Carter, P., Kremer, S.C.: Fuzzy Grammar Induction from Large Corpora. In: IEEE International Conference on Fuzzy Systems (2006) 7. Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading (2001) 8. Blattner, F., Plunkett, G., Bloch, C., Perna, N., Burland, V., Riley, M., ColladoVides, J., Glasner, J., Rode, C., Mayhew, G., et al.: The complete genome sequence of Escherichia coli k-12. Science 277, 1453–1462 (1997) 9. Lewin, B.: Genes VII. Oxford University Press, Oxford (2000) 10. Murphy, P.M., Aha, D.W.: UCI Repository of Machine Learning Databases, Department of Information and Computer Science. University of California, Irvine, CA (1992) 11. O’Neill, M.: Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes. Nucleic Acids Res. 20, 3471–3477 (1992) 12. Unold, O.: Grammar-Based Classifier System for Recognition of Promoter Regions. In: Beliczynski, B., Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007, Part I. LNCS, vol. 4431, pp. 798–805. Springer, Heidelberg (2007) 13. Leung, S.W., Mellish, C., Robertson, D.: Basic gene grammars and dna-chart parser for language processing of Escherichia coli promoter dna sequences. Bioinformatics 17, 226–236 (2001) 14. Towell, G., Shavlik, J.: Extracting refined rules from knowledge-based neural networks. Machine Learning 13, 71–101 (1993) 15. Rice, P., Elliston, K., Gribskov, M.: DNA. In: Girbskov, M., Devereux, J. (eds.) Sequence Analysis Primer, ch. 1, pp. 1–59. Stockton Press (1991) 16. Unold, O.: Context-free grammar induction with grammar-based classifier system. Archives of Control Science 15 (LI) 4, 681–690 (2005)
Polynomial Time Identification of Strict Prefix Deterministic Finite State Transducers Mitsuo Wakatsuki and Etsuji Tomita Graduate School of Informatics and Engineering, The University of Electro-Communications Chofugaoka 1–5–1, Chofu, Tokyo 182-8585, Japan {wakatuki,tomita}@ice.uec.ac.jp Abstract. This paper is concerned with a subclass of finite state transducers, called strict prefix deterministic finite state transducers (SPDFST ’s for short), and studies a problem of identifying the subclass in the limit from positive data. After providing some properties of languages accepted by SPDFST’s, we show that the class of SPDFST’s is polynomial time identifiable in the limit from positive data in the sense of Yokomori.
1
Introduction
A reasonable definition for polynomial time identifiability in the limit [3] from positive data has been proposed by Yokomori [4]. He has also proved that a class of languages accepted by strictly deterministic automata (SDA’s for short) [4] and a class of very simple languages [5] are polynomial time identifiable in the limit from positive data. As for a class of transducers, Oncina et al. [2] have proved that a class of onward subsequential transducers (OST’s for short), which is a proper subclass of finite state transducers, is polynomial time identifiable in the limit from positive data. The present paper deals with a subclass of finite state transducers called strict prefix deterministic finite state transducers (SPDFST ’s for short), and discusses the identification problem of the class of SPDFST’s. The class of SDA’s forms a proper subclass of associated automata with SPDFST’s. Moreover, the class of languages accepted by SPDFST’s is incomparable to the class of languages accepted by OST’s. After providing some properties of languages accepted by SPDFST’s, we show that the class of SPDFST’s is polynomial time identifiable in the limit from positive data in the sense of Yokomori [4]. The main result in this paper provides another interesting instance of a class of transducers which is polynomial time identifiable in the limit. This identifiability is proved by giving an exact characteristic sample of polynomial size for a language accepted by an SPDFST.
2
Basic Definitions and Notation
An alphabet Σ is a finite set of symbols. We denote by Σ ∗ the set of all finitelength strings over Σ . The string of length 0 (the empty string) is denoted by
This work was supported in part by Grants-in-Aid for Scientific Research Nos. 18500108 and 20500007 from the MEXT of Japan.
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 313–316, 2010. c Springer-Verlag Berlin Heidelberg 2010
314
M. Wakatsuki and E. Tomita
ε. Let Σ + = Σ ∗ − {ε}. We denote by |w| the length of a string w and by |S| the cardinality of a set S. A language over Σ is any subset L of Σ ∗ . For a string w ∈ Σ + , first(w) denotes the first symbol of w. For w ∈ Σ ∗ , alph(w) denotes the set of symbols appearing in w. For w ∈ Σ ∗ and its prefix x ∈ Σ ∗ , x−1 w denotes the string y ∈ Σ ∗ such that w = xy. For S ⊆ Σ ∗ , lcp(S) denotes the longest common prefix of all strings in S. Let Σ be any alphabet and suppose that Σ is totally ordered by some binary relation ≺. Let x = a1 · · · ar , y = b1 · · · bs , where r, s ≥ 0, ai ∈ Σ for 1 ≤ i ≤ r, and bi ∈ Σ for 1 ≤ i ≤ s. We write that x ≺ y if (i) |x| < |y|, or (ii) |x| = |y| and there exists k ≥ 1 so that ai = bi for 1 ≤ i < k and ak ≺ bk . The relation x y means that x ≺ y or x = y.
3
Strict Prefix Deterministic Finite State Transducers
A finite state or rational transducer (FST for short) is defined as a 6-tuple T = (Q, Σ , Δ, δ, q0 , F ), where Q is a finite set of states, Σ is an input alphabet, Δ is an output alphabet, δ is a finite subset of Q × Σ ∗ × Δ∗ × Q whose elements are called transitions or edges, q0 is the initial state, and F (⊆ Q) is a set of final states [1][2]. A finite automaton M = (Q, Σ , δ , q0 , F ), where δ ⊆ Q × Σ ∗ × Q and (p, x, y, q) ∈ δ implies that (p, x, q) ∈ δ , is called an associated automaton with an FST T . A path in an FST T is a sequence of transitions π = (p0 , x1 , y1 , p1 )(p1 , x2 , y2 , p2 ) · · · (pn−1 , xn , yn , pn ), where pi ∈ Q for 0 ≤ i ≤ n, and xi ∈ Σ ∗ , yi ∈ Δ∗ for 1 ≤ i ≤ n. When the intermediate states involved in a path are insignificant, a path is written as π = (p0 , x1 x2 · · · xn , y1 y2 · · · yn , pn ). For p, q ∈ Q, ΠT (p, q) denotes the set of all paths from p to q. By convention, we let (p, ε, ε, p) ∈ ΠT (p, p) for any p ∈ Q. We extend this notation by setting ΠT (p, Q ) = ∪q∈Q ΠT (p, q) for any Q ⊆ Q. A path π from p to q is successful iff p = q0 and q ∈ F . Thus, the set of all successful paths is ΠT (q0 , F ). Here, for a state p ∈ Q, it is said to be reachable if ΠT (q0 , p) = ∅, and it is said to be = ∅. For an FST T , the language accepted by T is defined to be live if ΠT (p, F ) L(T ) = {(x, y) ∈ Σ ∗ × Δ∗ | (q0 , x, y, q) ∈ ΠT (q0 , F )}. Definition 1. Let T = (Q, Σ , Δ, δ, q0 , F ) be an FST. Then, T is a strict prefix deterministic finite state transducer ( SPDFST) iff T satisfies the following conditions: (1) δ ⊆ Q × Σ + × Δ+ × Q. (2) For any (p, x1 , y1 , q1 ), (p, x2 , y2 , q2 ) ∈ δ, if first(x1 ) = first(x2 ), then x1 = x2 , y1 = y2 and q1 = q2 ( determinism condi= first(x2 ), then tion). (3) For any (p, x1 , y1 , q1 ), (p, x2 , y2 , q2 ) ∈ δ, if first(x1 ) first(y1 ) = first(y2 ). (4) For any (p1 , x1 , y1 , q1 ), (p2 , x2 , y2 , q2 ) ∈ δ with p1 = p2 or q1 = q2 , it holds that first(x1 ) = first(x2 ) or first(y1 ) = first(y2 ) (i.e., the uniqueness of labels). If T satisfies the conditions (3) and (4), we say that T has the strict prefix property. An SPDFST T = (Q, Σ , Δ, δ, q0 , F ) is said to be in canonical form if, for any p ∈ Q, p is reachable and live, and for any p ∈ Q − {q0 }, it holds that p ∈ F or |{(p, x, y, q) ∈ δ | x ∈ Σ + , y ∈ Δ+ , q ∈ Q}| ≥ 2. For any SPDFST T , there exists an SPDFST T in canonical form such that L(T ) = L(T ), and we can
Polynomial Time Identification
315
construct an algorithm that outputs such T . Hereafter, we are concerned with SPDFST’s in canonical form. The following lemmas are derived from Definition 1. Lemma 1. Let T = (Q, Σ , Δ, δ, q0 , F ) be an SPDFST, and let p, p , q, q ∈ Q, x, x ∈ Σ + , and y, y ∈ Δ+ . Then, the followings hold. (1) If (p, x, y, q) ∈ ΠT (p, q) and (p, x, y , q ) ∈ ΠT (p, q ), then y = y and q = q . (2) If (p, x, y, q) ∈ ΠT (p, q) and (p , x, y, q ) ∈ ΠT (p , q ), then p = p and q = q . (3) For some π = (p, x, y, q) ∈ ΠT (p, q) and π = (p, x , y , q ) ∈ ΠT (p, q ), if first(x) = first(x ) and first(y) = first(y ), then π can be divided into (p, xc , yc , r) and (r, xc −1 x, yc −1 y, q), and π can be divided into (p, xc , yc , r) and (r, xc −1 x , yc −1 y , q ), where xc = lcp({x, x }), yc = lcp({y, y }), and r ∈ Q. Lemma 2. Let T = (Q, Σ , Δ, δ, q0 , F ) be an SPDFST and let (x, y), (x1 , y1 ), = a2 ), b, b1 , b2 ∈ Δ (b1 = b2 ), (x2 , y2 ) ∈ L(T ). Then, for each a, a1 , a2 ∈ Σ (a1 the followings hold. (1) If x = ax and y = by for some x ∈ Σ ∗ , y ∈ Δ∗ , then there exists a transition (q0 , u, v, p) ∈ δ such that first(u) = a and first(v) = b for some p ∈ Q. (2) If x1 = x a1 x1 , x2 = x a2 x2 , y1 = y b1 y1 and y2 = y b2 y2 for some x , x1 , x2 ∈ Σ ∗ , y , y1 , y2 ∈ Δ∗ , then there exist p, q1 , q2 ∈ Q, u1 , u2 ∈ Σ + , and v1 , v2 ∈ Δ+ such that (p, u1 , v1 , q1 ), (p, u2 , v2 , q2 ) ∈ δ with first(u1 ) = a1 , first(u2 ) = a2 , first(v1 ) = b1 and first(v2 ) = b2 . (3) If x2 = x1 ax2 and y2 = y1 by2 for some x2 ∈ Σ ∗ , y2 ∈ Δ∗ , then there exist p ∈ F , q ∈ Q, u ∈ Σ + , and v ∈ Δ+ such that (p, u, v, q) ∈ δ with first(u) = a and first(v) = b. From the definition of SDA’s [4, p.159, Definition 5], we can show that the class of SDA’s is a proper subclass of associated automata with SPDFST’s. Moreover, from the definition of OST’s [2, p.450], we can show that the class of languages accepted by OST’s is incomparable to the class of languages accepted by SPDFST’s.
4
Identifying SPDFST’s
Let T = (Q, Σ , Δ, δ, q0 , F ) be any SPDFST in canonical form. A finite subset R ⊆ Σ ∗ × Δ∗ of L(T ) is called a characteristic sample of L(T ) if L(T ) is the smallest language accepted by an SPDFST containing R, i.e., if for any SPDFST T , R ⊆ L(T ) implies that L(T ) ⊆ L(T ). For each p ∈ Q, define pre(p) as the shortest input string x ∈ Σ ∗ from q0 to p, i.e., (q0 , x, y, p) ∈ ΠT (q0 , p) and x x for any x such that (q0 , x , y , p) ∈ ΠT (q0 , p). Moreover, for each p ∈ Q and q ∈ F , define post(p, q) (∈ Σ ∗ ) as the shortest input string from p to q. Then, define RI (T ) = {pre(p) · post(p, q) | p ∈ Q, q ∈ F } ∪ {pre(p) · x · post(r, q) | p ∈ Q, (p, x, y, r) ∈ δ, q ∈ F } ∪ {pre(p) · x1 · x2 · post(s, q) | p ∈ Q, (p, x1 , y1 , r), (r, x2 , y2 , s) ∈ δ, q ∈ F } and R(T ) = {(x, y) ∈ Σ ∗ × Δ∗ | x ∈ RI (T ), (q0 , x, y, q) ∈ ΠT (q0 , F )}. R(T ) is called a representative sample of T . Note that the cardinality |R(T )| of a representative sample is at most |Q|2 (|Σ |2 + |Σ | + 1), that is, |R(T )| is polynomial with respect to the description length of T . We can prove that R(T ) is a characteristic sample of L(T ).
316
M. Wakatsuki and E. Tomita
Let T∗ be a target SPDFST. The idenitification algorithm IA is given in the following. Input: a positive presentation (x1 , y1 ), (x2 , y2 ), . . . of L(T∗ ) for T∗ Output: a sequence of SPDFST’s T1 , T2 , . . . Procedure IA begin initialize i = 0; q0 := p[ε] ; h(p[ε] ) := ε; let T0 = ({p[ε] }, ∅, ∅, ∅, q0, ∅) be the initial SPDFST; repeat (forever) let Ti = (Qi , Σi , Δi , δi , q0 , Fi ) be the current conjecture; i := i + 1; read the next positive example (xi , yi ); if (xi , yi ) ∈ L(Ti−1 ) then output Ti = Ti−1 as the i-th conjecture else Qi := Qi−1 ; Σi := Σi−1 ; Δi := Δi−1 ; δi := δi−1 ; Fi := Fi−1 ; if xi = ε and yi = ε then Fi := Fi ∪ {p[ε] }; output Ti = (Qi , Σi , Δi , δi , q0 , Fi ) as the i-th conjecture else /* the case where xi = ε and yi = ε */ Qi := Qi ∪ {p[xi ] }; Σi := Σi ∪ alph(xi ); Δi := Δi ∪ alph(yi ); Fi := Fi ∪ {p[xi ] }; h(p[xi ] ) := xi ; Ti := CONSTRUCT(Qi , Σi , Δi , δi ∪ {(p[ε] , xi , yi , p[xi ] )}, q0 , Fi ); output Ti as the i-th conjecture fi fi until (f alse) end Here, the function CONSTRUCT(Q, Σ , Δ, δ, q0 , F ) merges states in Q so that it satisfies Lemma 1 (2) and divides a transition in δ into two transitions so that it satisfies Lemma 1 (3) repeatedly, and outputs the updated SPDFST. By using Lemmas 1 and 2 and analyzing the behavior of the identification algorithm IA in the similar way as in [4], we have the following conclusion. Theorem 1. The class of SPDFST’s is polynomial time identifiable in the limit from positive data in the sense of Yokomori [4].
References 1. Berstel, J.: Transductions and Context-Free Languages. Teubner Studienb¨ ucher, Stuttgart (1979) 2. Oncina, J., Garc´ıa, P., Vidal, E.: Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Trans. on Pattern Analysis and Machine Intelligence 15(5), 448–458 (1993) 3. Pitt, L.: Inductive inference, DFAs, and computational complexity. In: Jantke, K.P. (ed.) AII 1989. LNCS (LNAI), vol. 397, pp. 18–44. Springer, Heidelberg (1989) 4. Yokomori, T.: On polynomial-time learnability in the limit of strictly deterministic automata. Machine Learning 19, 153–179 (1995) 5. Yokomori, T.: Polynomial-time identification of very simple grammars from positive data. Theoretical Computer Science 298, 179–206 (2003)
Author Index
Abisha, P.J. 301 Adriaans, Pieter 163, 258 Akram, Hasan Ibne 262 Alvarez, Gloria In´es 267
L´ opez, Dami´ an 52, 178 Lucas, Simon M. 1 Luque, Franco M. 122, 135 Meinke, Karl 148 Mernik, Marjan 276 Mulder, Wico 163, 258
Balle, Borja 271 Bravo, Enrique 267 Bryant, Barrett R. 276
Nagar, Atulya K. Cano G´ omez, Antonio 11 Castro, Jorge 271 Clark, Alexander 24, 38 Costa Florˆencio, Christophe de la Higuera, Colin de Weerdt, Mathijs Eckert, Claudia Fernau, Henning
Oates, Tim 280
262 203
80
Peris, Piedachu
280
Garc´ıa, Pedro 52, 267 Gaustad, Tanja 245 Gavald` a, Ricard 271
Robinson, T. 284 Ruiz, Jos´e 52
Tantini, Fr´ed´eric 189 Terlutte, Alain 189 Thomas, D.G. 284, 301 Tomita, Etsuji 313 Torre, Fabien 189
Heule, Marijn J.H. 66 Hrnˇciˇc, Dejan 276 122, 135
Javed, Faizan 276 Jayasrirani, M. 284 Jones, Joshua 80 Kasprzik, Anna 288 Katrenko, Sophia 293 Kermanidis, Katia Lida 297 Kinber, Efim 94 Kumaar, Sindhu J. 301
Unold, Olgierd
309
van Zaanen, Menno 245, 293, 305 V´ azquez de Parga, Manuel 52 Verwer, Sicco 66, 203 Victoria, Jorge Hern´ an 267 Wakatsuki, Mitsuo 313 Wieczorek, Wojciech 217 Witteveen, Cees 203 Xiao, Huang
Li, Hongyang 109 Liu, Qichao 276
178
Searls, David B. 5 Sprague, Alan 276 Stehouwer, Herman 305 Stephan, Frank 109
262
Infante-Lopez, Gabriel
284
Yoshinaka, Ryo
262 230